Next Article in Journal
An Effective and Efficient Genetic-Fuzzy Algorithm for Supporting Advanced Human-Machine Interfaces in Big Data Settings
Next Article in Special Issue
Editorial: Special Issue on Data Compression Algorithms and Their Applications
Previous Article in Journal
A Numerical Approach for the Filtered Generalized Čech Complex
Previous Article in Special Issue
Finding Patterns in Signals Using Lossy Text Compression
 
 
Article
Peer-Review Record

Optimal Prefix Free Codes with Partial Sorting

by Jérémy Barbay
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Submission received: 29 November 2019 / Revised: 23 December 2019 / Accepted: 25 December 2019 / Published: 31 December 2019
(This article belongs to the Special Issue Data Compression Algorithms and their Applications)

Round 1

Reviewer 1 Report

See the attached file.

Comments for author File: Comments.pdf

Author Response

(Please see the attachment for a better formatted answer, and the text below for a simple text version.)


Comments partially implemented, justified here

1. /The author exaggerates the practical value of his research. For example, on page 1, last paragraph, he claims that the problem is important since the Huffman code is used in main data compression standards. This argument is completely invalid. None of implementations of standards contains routines for constructing Huffman codes. Huffman codes are pre-constructed and are kept in the form of look-up tables./

- We respectfully disagree with the statement "/None of implementations of standards contains routines for constructing Huffman codes. Huffman codes are pre-constructed and are kept in the form of look-up tables./" (sic):

1. The author, without pretending to be an expert in the field of compression, personally knows of two applications of the computation of such codes (even if, as previously stated in the article, the running time of this computation is *not* a critical component of the overall running time of the application):
1) Takaoka [1997] described an adaptive version of =Merge Sort= which scans for sorted runs in the input, and merge them according to a scheme which is exactly the Huffman code tree.
2) The author [2013-TCS] described how to use such code tree to compute the optimal shape of a wavelet tree for runs-based compression of permutations. This, in turns, has applications to the compression of general texts, via the compression of the permutations appearing in a Burrows Wheeler's transform of the text.

2. Requesting confirmation from colleagues with more experience than the author in the field of compressed data structure, (Alistair moffat, author of various works on the topic of Huffman codes, and in particular of a recent survey (2019, ACM Computer Surveys) about "Huffman Coding"; Gonzalo Navarro, author of many articles and a few books on compression; and Travis Gagie, author of various articles and organizers of a few conferences on compression) confirmed that, even if it is not the bottleneck (as already mentioned in the article), there are indeed many practical applications which do recompute the optimal prefix free code instead of using precomputed table (e.g. bzip2, jpeg, etc.)

- Nevertheless, we added in the introduction a paragraph to specifically state that "/Even though some compression methods use only precomputed tables coding for an Optimal Binary Prefix Free Code "built-in" the method, there are still many applications that require the computation of such codes for each instance./":

#+BEGIN_QUOTE
Even though some compression methods use only precomputed tables coding for an \textsc{Optimal Binary Prefix Free Code} "built-in" the compression method, there are still many applications that require the computation of such codes for each instance (e.g. BZIP2, JPEG, etc.). In the current state of the art, the running time of such algorithm is almost never the bottleneck of the compression process, but nevertheless worth studying if only for theoretical sake, and potentially for future applications less directly related to compression: Takaoka described an adaptive version of \texttt{Merge Sort} which scans for sorted runs in the input, and merge them according to a scheme which is exactly the \textsc{Huffman} code tree; and Barbay and Navarro described how to use such code tree to compute the optimal shape of a wavelet tree for runs-based compression of permutations. This, in turns, has applications to the compression of general texts, via the compression of the permutations appearing in a Burrows Wheeler's transform of the text.
#+END_QUOTE

2. /In the beginning of Section 4, page 12, practical importance of the problem is discussed again. Some readers can erroneously conclude from this Section that word counting statistical data presented in Figure 4 have some practical meaning and that realistic archivers apply Huffman codes based on word frequencies. In this sense the section is misleading. The author has to explain that these numerical results do not have relation to practical data compression. Neither in multimedia data compression (image, speech, audio, video) nor in archivers for computer data (texts, executable files, data bases etc.) fixed Huffman codes overlarge alphabets can be used. The main reason is non-stationarity of real-life random processes. Moreover, as mentioned above, the complexity of code design is not important since all fixed codes are designed "off-line". Well-known "dynamic" Huffman codes are not practical as well. Different integer implementations of arithmetic coding are used in most applications. If possible, the author should give much shorter but more realistic description of the practical applications of the algorithm./

- We respectfully disagree with the fact that "Well-known `dynamic' Huffman codes are not practical as well." is an argument against research in order to improve aspects of the algorithms maintaining "dynamic Huffman codes", such as the computation of Huffman code:

1. It is by researching how to improve the components of such solutions that they might be finally made practical, and
2. an efficient implementation of dynamic Huffman codes would definitely have practical applications.

- Nevertheless, we added a paragraph to clarify that the word counting statistical data is a synthetic example inspired from a single previous study by Moura et al. and cited by the recent survey on "Huffman computing", without claim to be of practical relevance;

#+BEGIN_QUOTE
Albeit the computation of \textsc{Optimal Prefix Free Codes} occurs in many applications (e.g. BZIP2, JPEG, etc.), the alphabet size is not necessarily increasing with the size of the data to be compressed. In order to evaluate the potential improvement of adaptive techniques such as presented in this work and others, we consider the application of \textsc{Optimal Prefix Free Codes} to the word-based compression of natural language texts, cited as an example of ``\emph{large alphabet}'' application by Moffat, and studied by Moura et al.. As for the natural language texts themselves, we considered a random selection of nine texts from the Gutenberg project, listed in Figure.
#+END_QUOTE

3. /For log function three different notations lg, log, log 2 are used almost at random. I would recommend to use one notation everywhere./

- We apologize for the use of both $\lg$ and $\log$, which was a mistake and has been corrected.
- The use of the notation $\log_2$ was not random, and was (respectfully) maintained.
- the notation $\log_2$ is used any time the base of the logarithm matters (e.g. "when $k$ is larger than $\log_2 n$"); while
- the notation $\log$ (without the base) is used within asymptotic notations, where the base corresponds to an irrelevant constant factor (e.g. "$O(n\log n)$".
- We apologize if the rule behind its use was unclear, and added a footnote appearing at the first use of the logarithmic function in the article, stating the following:
#+BEGIN_QUOTE
The notation $\log_2$ is used any time the base of the logarithm matters (e.g. "when $k$ is larger than $\log_2 n$"); while the notation $\log$ (without the base) is used within asymptotic notations, where the base corresponds to an irrelevant constant factor (e.g. "$O(n\log n)$".
#+END_QUOTE

4. /Pages 12-18 are devoted to discussion of the impact of the new solutions on different areas of computer science. This discussion is too long taking into account that scientific and practical contributions of the paper are limited. (...) Remove unnecessary subsections (e.g. 5.4, 5.5)./

- Given the previous comments about the lack of practical impact of the results described in this work, and the fact that section 5.4's title is exactly "Potential (lack of) Practical Impact of our Results", we respectfully decline to remove the section.
- Given that section 5.5, "Generalisation to Non Binary Output Alphabets", seems at first approach like a logical extension of the work presented, and that this section explains how the impact of such an extension would be minor at best, we respectfully decline to remove the section as well: we apologize if the facts exposed in this section seem obvious to the referee, but they did not seem obvious neither to the author nor to at least one colleague consulted on this topic.
- Nevertheless, we merged various sections of the conclusion into more general one, in order to improve the readability and reduce the importance given to the incriminated sections: Section 5 has now only three subsections:
- 5.1 Relation to Previous Work
- 5.2 Potential (lack of) Practical Impact
- 5.3 Variants of the Optimal Prefix Free Code Problem

* Comments completely implemented

1. /I would suggest to remove the code from the appendix and just refer the reader to the gitlab repository, unfortunately many lines get truncated by the style of this journal (e.g., line 743, 757-758, ...)/

- Agreed. (After a lot of hesitation, the code had been added in the appendix in order to make it easier for the referees to review it without having to go to the gitlab repository. We apologize for the bad presentation.)

2. /There are almost 5 pages of Python programs in the Appendices. I would recommend to replace them by reference to some web-page from which programs can be conveniently downloaded. (...) Move Appendices to Internet./

- The code was /already/ in a gitlab repository, mentioned in the body of the article and a second time in the appendix:
#+BEGIN_QUOTE
*Data and Material Availability:*
Sources and Data Sets are available at https://gitlab.com/FineGrainedAnalysis/PrefixFreeCodes.
#+END_QUOTE
- The code was removed from the appendix.

* Additional minor modifications

In addition to the implementation of the comments from the referees, we also performed various minor corrections, listed here for transparency of the process:

1) Reviewed the case in the occurrences of "Deferred Data Structure": it appeared most of the times in lower case, and sometimes in upper case: all (but one, in the definition of "Partial Sum Deferred Data Structure") were moved to lower case;
2) Reviewed the use of "messages" (instead of "input symbols") and "output symbols": this terminology is announced in the introduction, but was not fully respected in the article;
3) Made more uniform the fonts used to refer to the various phases of the algorithm: most of them were in "teletype", but some in bold and some in italics: they were all put to teletype.
4) Made more uniform the fonts used to refer to Optimal Prefix Free Codes: most were in "textsc" but some were not.
5) Made more uniform the format of all the bibliographical references to Wikipedia articles.
6) Added a short proof to property 1.

Once again, we would like to thank again both referees for their celerity and their detailed revision of our work!

Author Response File: Author Response.pdf

Reviewer 2 Report

The article by Barbay describes the GDM (group-dock-mix) algorithm to
compute optimal prefix free codes. Prefix codes are widely known
in the field and are used by many compression softwares. The main
idea behind the algorithm proposed in this paper is to plug deferred
data structures for multisets into (a modified version of) van
Leeuwen's algorithm (from 1976).

This paper is an extension of the one presented by the same author in
CPM 2016, and significantly improves it (most notably, Section 4 and
Subsections 5.4-5.7 were not in the arxiv version of the CPM paper).
Overall I think that this is a good paper and the presentation is
great. I really liked the writing style and the paper was a pleasure
to read. Although the practical impact of this method will not be
significant (as stated by the author himself), I think that the
theoretical contribution of this article is enough for it to be
published in this journal. I don't have any additional major remarks
and I suggest to accept this manuscript as-is.

MINOR REMARKS:

I would suggest to remove the code from the appendix and just refer
the reader to the gitlab repository, unfortunately many lines get
truncated by the style of this journal (e.g., line 743, 757-758, ...),

Author Response

(Please see attachment for a better formatted answer in pdf, and the text below if more convenient.)


* Comments partially implemented, justified here

1. /The author exaggerates the practical value of his research. For example, on page 1, last paragraph, he claims that the problem is important since the Huffman code is used in main data compression standards. This argument is completely invalid. None of implementations of standards contains routines for constructing Huffman codes. Huffman codes are pre-constructed and are kept in the form of look-up tables./

- We respectfully disagree with the statement "/None of implementations of standards contains routines for constructing Huffman codes. Huffman codes are pre-constructed and are kept in the form of look-up tables./" (sic):

1. The author, without pretending to be an expert in the field of compression, personally knows of two applications of the computation of such codes (even if, as previously stated in the article, the running time of this computation is *not* a critical component of the overall running time of the application):
1) Takaoka [1997] described an adaptive version of =Merge Sort= which scans for sorted runs in the input, and merge them according to a scheme which is exactly the Huffman code tree.
2) The author [2013-TCS] described how to use such code tree to compute the optimal shape of a wavelet tree for runs-based compression of permutations. This, in turns, has applications to the compression of general texts, via the compression of the permutations appearing in a Burrows Wheeler's transform of the text.

2. Requesting confirmation from colleagues with more experience than the author in the field of compressed data structure, (Alistair moffat, author of various works on the topic of Huffman codes, and in particular of a recent survey (2019, ACM Computer Surveys) about "Huffman Coding"; Gonzalo Navarro, author of many articles and a few books on compression; and Travis Gagie, author of various articles and organizers of a few conferences on compression) confirmed that, even if it is not the bottleneck (as already mentioned in the article), there are indeed many practical applications which do recompute the optimal prefix free code instead of using precomputed table (e.g. bzip2, jpeg, etc.)

- Nevertheless, we added in the introduction a paragraph to specifically state that "/Even though some compression methods use only precomputed tables coding for an Optimal Binary Prefix Free Code "built-in" the method, there are still many applications that require the computation of such codes for each instance./":

#+BEGIN_QUOTE
Even though some compression methods use only precomputed tables coding for an \textsc{Optimal Binary Prefix Free Code} "built-in" the compression method, there are still many applications that require the computation of such codes for each instance (e.g. BZIP2, JPEG, etc.). In the current state of the art, the running time of such algorithm is almost never the bottleneck of the compression process, but nevertheless worth studying if only for theoretical sake, and potentially for future applications less directly related to compression: Takaoka described an adaptive version of \texttt{Merge Sort} which scans for sorted runs in the input, and merge them according to a scheme which is exactly the \textsc{Huffman} code tree; and Barbay and Navarro described how to use such code tree to compute the optimal shape of a wavelet tree for runs-based compression of permutations. This, in turns, has applications to the compression of general texts, via the compression of the permutations appearing in a Burrows Wheeler's transform of the text.
#+END_QUOTE

2. /In the beginning of Section 4, page 12, practical importance of the problem is discussed again. Some readers can erroneously conclude from this Section that word counting statistical data presented in Figure 4 have some practical meaning and that realistic archivers apply Huffman codes based on word frequencies. In this sense the section is misleading. The author has to explain that these numerical results do not have relation to practical data compression. Neither in multimedia data compression (image, speech, audio, video) nor in archivers for computer data (texts, executable files, data bases etc.) fixed Huffman codes overlarge alphabets can be used. The main reason is non-stationarity of real-life random processes. Moreover, as mentioned above, the complexity of code design is not important since all fixed codes are designed "off-line". Well-known "dynamic" Huffman codes are not practical as well. Different integer implementations of arithmetic coding are used in most applications. If possible, the author should give much shorter but more realistic description of the practical applications of the algorithm./

- We respectfully disagree with the fact that "Well-known `dynamic' Huffman codes are not practical as well." is an argument against research in order to improve aspects of the algorithms maintaining "dynamic Huffman codes", such as the computation of Huffman code:

1. It is by researching how to improve the components of such solutions that they might be finally made practical, and
2. an efficient implementation of dynamic Huffman codes would definitely have practical applications.

- Nevertheless, we added a paragraph to clarify that the word counting statistical data is a synthetic example inspired from a single previous study by Moura et al. and cited by the recent survey on "Huffman computing", without claim to be of practical relevance;

#+BEGIN_QUOTE
Albeit the computation of \textsc{Optimal Prefix Free Codes} occurs in many applications (e.g. BZIP2, JPEG, etc.), the alphabet size is not necessarily increasing with the size of the data to be compressed. In order to evaluate the potential improvement of adaptive techniques such as presented in this work and others, we consider the application of \textsc{Optimal Prefix Free Codes} to the word-based compression of natural language texts, cited as an example of ``\emph{large alphabet}'' application by Moffat, and studied by Moura et al.. As for the natural language texts themselves, we considered a random selection of nine texts from the Gutenberg project, listed in Figure.
#+END_QUOTE

3. /For log function three different notations lg, log, log 2 are used almost at random. I would recommend to use one notation everywhere./

- We apologize for the use of both $\lg$ and $\log$, which was a mistake and has been corrected.
- The use of the notation $\log_2$ was not random, and was (respectfully) maintained.
- the notation $\log_2$ is used any time the base of the logarithm matters (e.g. "when $k$ is larger than $\log_2 n$"); while
- the notation $\log$ (without the base) is used within asymptotic notations, where the base corresponds to an irrelevant constant factor (e.g. "$O(n\log n)$".
- We apologize if the rule behind its use was unclear, and added a footnote appearing at the first use of the logarithmic function in the article, stating the following:
#+BEGIN_QUOTE
The notation $\log_2$ is used any time the base of the logarithm matters (e.g. "when $k$ is larger than $\log_2 n$"); while the notation $\log$ (without the base) is used within asymptotic notations, where the base corresponds to an irrelevant constant factor (e.g. "$O(n\log n)$".
#+END_QUOTE

4. /Pages 12-18 are devoted to discussion of the impact of the new solutions on different areas of computer science. This discussion is too long taking into account that scientific and practical contributions of the paper are limited. (...) Remove unnecessary subsections (e.g. 5.4, 5.5)./

- Given the previous comments about the lack of practical impact of the results described in this work, and the fact that section 5.4's title is exactly "Potential (lack of) Practical Impact of our Results", we respectfully decline to remove the section.
- Given that section 5.5, "Generalisation to Non Binary Output Alphabets", seems at first approach like a logical extension of the work presented, and that this section explains how the impact of such an extension would be minor at best, we respectfully decline to remove the section as well: we apologize if the facts exposed in this section seem obvious to the referee, but they did not seem obvious neither to the author nor to at least one colleague consulted on this topic.
- Nevertheless, we merged various sections of the conclusion into more general one, in order to improve the readability and reduce the importance given to the incriminated sections: Section 5 has now only three subsections:
- 5.1 Relation to Previous Work
- 5.2 Potential (lack of) Practical Impact
- 5.3 Variants of the Optimal Prefix Free Code Problem

* Comments completely implemented

1. /I would suggest to remove the code from the appendix and just refer the reader to the gitlab repository, unfortunately many lines get truncated by the style of this journal (e.g., line 743, 757-758, ...)/

- Agreed. (After a lot of hesitation, the code had been added in the appendix in order to make it easier for the referees to review it without having to go to the gitlab repository. We apologize for the bad presentation.)

2. /There are almost 5 pages of Python programs in the Appendices. I would recommend to replace them by reference to some web-page from which programs can be conveniently downloaded. (...) Move Appendices to Internet./

- The code was /already/ in a gitlab repository, mentioned in the body of the article and a second time in the appendix:
#+BEGIN_QUOTE
*Data and Material Availability:*
Sources and Data Sets are available at https://gitlab.com/FineGrainedAnalysis/PrefixFreeCodes.
#+END_QUOTE
- The code was removed from the appendix.

* Additional minor modifications

In addition to the implementation of the comments from the referees, we also performed various minor corrections, listed here for transparency of the process:

1) Reviewed the case in the occurrences of "Deferred Data Structure": it appeared most of the times in lower case, and sometimes in upper case: all (but one, in the definition of "Partial Sum Deferred Data Structure") were moved to lower case;
2) Reviewed the use of "messages" (instead of "input symbols") and "output symbols": this terminology is announced in the introduction, but was not fully respected in the article;
3) Made more uniform the fonts used to refer to the various phases of the algorithm: most of them were in "teletype", but some in bold and some in italics: they were all put to teletype.
4) Made more uniform the fonts used to refer to Optimal Prefix Free Codes: most were in "textsc" but some were not.
5) Made more uniform the format of all the bibliographical references to Wikipedia articles.
6) Added a short proof to property 1.

Once again, we would like to thank again both referees for their celerity and their detailed revision of our work!

Author Response File: Author Response.pdf

Back to TopTop