Next Article in Journal
Detection of Representative Variables in Complex Systems with Interpretable Rules Using Core-Clusters
Next Article in Special Issue
Compressed Communication Complexity of Hamming Distance
Previous Article in Journal
Number of Financial Indicators as a Factor of Multi-Criteria Analysis via the TOPSIS Technique: A Municipal Case Study
Previous Article in Special Issue
Non-Overlapping LZ77 Factorization and LZ78 Substring Compression Queries with Suffix Trees
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Approximation Ratios of RePair, LongestMatch and Greedy on Unary Strings †

Department Elektrotechnik und Informatik, Universität Siegen, D-57068 Siegen, Germany
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in the proceedings of SPIRE 2019.
Submission received: 20 January 2021 / Revised: 17 February 2021 / Accepted: 18 February 2021 / Published: 20 February 2021
(This article belongs to the Special Issue Algorithms and Data-Structures for Compressed Computation)

Abstract

:
A grammar-based compressor is an algorithm that receives a word and outputs a context-free grammar that only produces this word. The approximation ratio for a single input word is the size of the grammar produced for this word divided by the size of a smallest grammar for this word. The worst-case approximation ratio of a grammar-based compressor for a given word length is the largest approximation ratio over all input words of that length. In this work, we study the worst-case approximation ratio of the algorithms Greedy, RePair and LongestMatch on unary strings, i.e., strings that only make use of a single symbol. Our main contribution is to show the improved upper bound of O ( ( log n ) 8 · ( log log n ) 3 ) for the worst-case approximation ratio of Greedy. In addition, we also show the lower bound of 1.34847194 for the worst-case approximation ratio of Greedy, and that RePair and LongestMatch have a worst-case approximation ratio of log 2 ( 3 ) .

1. Introduction

The goal of grammar-based compression is to represent a word w by a small context-free grammar that produces exactly { w } . Such a grammar is called a straight-line program (SLP) for w. In the best case, one gets an SLP of size Θ ( log n ) for a word of length n, where the size of an SLP is the total length of all right-hand sides of the rules of the grammar. A grammar-based compressor is an algorithm that produces an SLP for a given word w. There are various grammar-based compressors that can be found in many places in the literature. A well-known example is the classic LZ78-compressor of Lempel and Ziv [1]. Although it was not introduced as a grammar-based compressor, it is straightforward to compute from the LZ78-factorization of w an SLP for w of roughly the same size. Other examples include BISECTION [2] and SEQUITUR [3]. In this work, we study the global grammar-based compressors Greedy [4,5,6], RePair [7] and LongestMatch [8], to which we will also refer to as global algorithms. Global algorithms are important in practice because they show excellent compression results in various fields. For example, Greedy is used in [5] to compress DNA sequences. Among all global algorithms, RePair is probably the most used one. Examples include compressing web graphs [9], searching compressed text [10], suffix array compression [11] and compressing XML [12]. A key concept of global compressors are maximal strings. A maximal string of an SLP A is a word that has length at least two and occurs at least twice without overlap as a factor of the right-hand sides of the rules of A . Furthermore, no strictly longer word appears at least as many times without overlap as a factor of the right-hand sides of A . For an input word w, a global grammar-based compressor starts with the SLP that has a single rule S w , where S is the start nonterminal of the grammar. The SLP is then recursively updated by choosing a maximal string γ of the current SLP and replacing a maximal set of pairwise nonoverlapping occurrences of γ by a new nonterminal X. Additionally, a new rule X γ is introduced. The algorithm stops when the obtained SLP has no maximal string. In the case of Greedy the chosen maximal string minimizes the size of the SLP in each round, while RePair selects in each round a most frequent maximal string, and LongestMatch chooses a longest maximal string. Please note that the Greedy algorithm as originally presented in [4,5,6] is different from the version studied in this work as well as in [13]: The original Greedy algorithm only considers the right-hand side of the start rule for the choice and the replacement of the maximal string. In particular, all other rules do not change after they are introduced.
In [13] the worst-case approximation ratio of grammar-based compressors is studied. For a grammar-based compressor C that computes an SLP C ( w ) for a given word w, one defines the approximation ratio of C on w as the quotient of the size of C ( w ) and the size g ( w ) of a smallest SLP for w. The approximation ratio α C ( n ) is the maximal approximation ratio of C among all words of length n. In [13] the authors provide upper and lower bounds for the approximation ratios of several grammar-based compressors (among them are all compressors mentioned so far), but for none of the compressors the lower and upper bounds match. For LZ78 and BISECTION these gaps were closed in [14]. For all global algorithms the best upper bound on the approximation ratio is O ( ( n / log n ) 2 / 3 ) [13], while the best known lower bounds so far are Ω ( log n / log log n ) for RePair [15], Ω ( log log n ) for LongestMatch and 5 / ( 3 log 3 ( 5 ) ) = 1.137 for Greedy [13]. In general, the achieved bounds “leave a large gap of understanding surrounding the global algorithms" as the authors in [13] conclude.
Unary words have the form a n for some symbol a and integer n 1 . Grammar-based compression on unary words is strongly related to the field of addition chains, which has been studied for decades (see [16] (Chapter 4.6.3) for a survey) and still is an active topic due to the strong connection to public key cryptosystems (see [17] for a review from that point of view). An addition chain for an integer n of size m is a sequence of integers 1 = k 1 , k 2 , , k m = n such that for each d ( 2 d m ), there exists i , j ( 1 i , j < d ) such that k i + k j = k d . It is straightforward to compute from an addition chain for an integer n of size m an SLP for a n of size 2 m 2 . Vice versa, an SLP for a n of size m yields an addition chain for n of size m. Therefore, grammar-based compressors on unary inputs can also be thought of as addition chain solvers, i.e., algorithms that find a (small) addition chain for a given integer.
The worst-case approximation ratio for global algorithms is difficult to analyze. A good starting point is therefore to analyze them on unary words because of their simplicity. Even though unary words are not interesting to compress, it is still interesting to look at how global algorithms perform on them. The improved upper bound we show for Greedy uses unary words and is the first improvement that happened in 15 years.
We show the worst-case approximation ratio of RePair and LongestMatch for unary words to be log 2 ( 3 ) . Both algorithms are basically identical to the binary method that produces an addition chain for n by creating powers of two using repeated squaring, and then the integer n is represented as the sum of those powers of two that correspond to a one in the binary representation of n. Based on that information, we show that for any unary input w the produced SLPs of RePair and LongestMatch have size at most log 2 ( 3 ) · g ( w ) and we also provide a lower bound.
We improve the upper bound for the approximation ratio of Greedy on unary words to O ( ( log n ) 8 · ( log log n ) 3 ) . In [18], which is the previous version of this article, the authors showed an upper bound for the approximation ratio of Greedy of O ( n 1 / 4 / log n ) for unary inputs, by only analyzing the first three rounds. Here, we present a more in-depth analysis that makes use of every round. We can prove that Greedy produces an SLP of size O ( ( log n ) 9 · ( log log n ) 3 ) on input a n , which together with the fact that a smallest SLP for a n has size Ω ( log n ) then yields the improved upper bound of O ( ( log n ) 8 · ( log log n ) 3 ) . To prove the size bound on the SLP produced by Greedy, we distinguish unary and nonunary nonterminals. A nonterminal X is called unary if its right-hand side is of the form X Z d when it is first introduced. Otherwise, it is nonunary. We bound the total number of occurrences of all unary and nonunary nonterminals in the grammar produced by Greedy separately. For the unary nonterminals, we bound their total number to be O ( ( log n ) 9 ) , while each of them contributes a size of O ( log log n ) , which yields a total size contribution of O ( ( log n ) 9 · log log n ) . We then bound the number of occurrences of nonunary nonterminals using the already established number of unary nonterminals, which comes out to be O ( ( log n ) 9 · ( log log n ) 3 ) . Thus, we obtain the desired upper bound on the size of the grammar.
We also show the lower bound of 1.34847194 for the approximation ratio of Greedy. The key to achieve this bound is the sequence y k = y k 1 2 + 1 with y 0 = 2 , which has been studied in [19] (among other sequences), where it is shown that y k = γ 2 k for γ = 2.258 . To prove the lower bound, we show that the SLP produced by Greedy on input a y k has size 3 · 2 k 1 , while a smallest SLP for a y k has size 3 · log 3 ( γ ) · 2 k + o ( 2 k ) (this follows from a construction used to prove the lower bound for Greedy in [13]).
This paper is an extended version of our paper published in the proceedings of SPIRE 2019 [18].

Related Work

One of the first appearances of straight-line programs in the literature are [20,21], where they are called word chains (since they generalize addition chains from numbers to words). In [20] it is shown that the function g ( k , n ) = max { g ( w ) w { 1 , , k } n } is in Θ ( n / log k n ) . Recall that g ( w ) is the size of a smallest SLP for the word w and thus g ( k , n ) measures the worst-case SLP-compression over all words of length n over a k-letter alphabet.
The smallest grammar problem is the problem of computing a smallest SLP for a given input word. It is known from [13,22] that in general no grammar-based compressor can solve the smallest grammar problem in polynomial time unless P = NP . Even worse, unless P = NP one cannot compute in polynomial time for a given word w an SLP of size at most 8569 8568 · g ( w ) [13]. One should mention that the constructions to prove these hardness results use alphabets of unbounded size. Although in [13] it is remarked that the construction in [22] works for words over a ternary alphabet, in [23] it is argued that this is not clear at all and a construction for fixed alphabets of size at least 24 is given. However, for grammar-based compression on unary strings as studied in this work (as well as for the problem of computing a smallest addition chain), there is no NP -hardness result, so there might be an optimal polynomial-time algorithm even though it is widely believed that there is none.
Other notable systematic investigations of grammar-based compression are provided in [8,24]. However, in [8], grammar-based compressors are used for universal lossless compression (in the information-theoretical sense), it is shown in [24] that the size of so-called irreducible SLPs (that include SLPs produced by global algorithms) can be upper-bounded by the (unnormalized) k-th order empirical entropy of the produced string plus some lower order terms.

2. Preliminaries

For i , j N we write [ i , j ] = { i , i + 1 , , j } for i j and [ i , j ] = otherwise. For m , n N we denote by m div n the integer division of m and n. We denote by m mod n the modulo of m and n, i.e., m mod n [ 0 , n 1 ] and
m = ( m div n ) · n + ( m mod n ) .
If m / n or m n is used, then this refers to the standard division over R . Please note that m div n = m / n and ( m div n ) + ( m mod n ) m / n .
An alphabet Σ is a finite set of symbols. For a word or string w = a 1 a n over Σ with a 1 , , a n Σ and n 0 we write | w | = n to denote w’s length. The set of Σ * consists of all words over Σ and Σ + = Σ * \ { ε } , where ε is the word of length 0. A unary word is a word of the form a n with a Σ and n N . All other words w Σ + are called nonunary. For words w , v Σ + we say that v is a factor of w if there are x , y Σ * such that w = x v y .
A context-free grammar is a tuple ( N , Σ , P , S ) , where N is the finite set of nonterminals, Σ is the alphabet with Σ N = , P is the set of productions of the form X v , where X N and v ( Σ N ) + , and S N is the start symbol. An SLP is a context-free grammar A = ( N , Σ , P , S ) , where
  • every X N has exactly one production, i.e.,
    | { ( X v ) P v ( Σ N ) + } | = 1 , and
  • the relation { ( A , B ) N × N ( A v ) P , B occurs in v } is acyclic.
This way, for every X N there exists a unique word w Σ + with X + w . We say that A produces w if S + w . Please note that some authors require SLPs to be in Chomsky Normal Form (CNF), i.e., every production is of the form A B C , where B , C N or A a , where a Σ . We do not make this assumption here because, in general, grammar-based compressors produce SLPs that are not in CNF. Furthermore, every SLP can easily be transformed into an SLP that is in CNF, produces the same word and has roughly the same size. A grammar-based compressor C is an algorithm that given an input w Σ + outputs an SLP A that produces w. The size of an SLP is defined as | A | = ( X v ) P | v | . For a word w Σ + we write g ( w ) for the size of a smallest SLP that produces w. The worst-case approximation ratio α C ( k , n ) of C is the maximal approximation ratio over all words of length n over an alphabet of size k:
α C ( k , n ) = max | C ( w ) | g ( w ) w [ 1 , k ] n .
For a given SLP A , a word γ is called a maximal string of A if
  • | γ | 2 ,
  • γ appears at least twice without overlap as a factor of the right-hand sides,
  • and no strictly longer word appears at least as many times as a factor of the right-hand sides without overlap.
Example 1.
Let A = ( { S , X , Y , Z } , { a , b } , P , S ) such that P contains
  • S a X X X X b b Y Z Y Z ,
  • X Y b b Y Z Y Z a ,
  • Y b b b Z Z a Z , and
  • Z a b b .
The maximal strings of A are b b , Y Z and b b Y Z Y Z . The factors b b and Y Z occur four times on the right-hand sides without overlap and b b Y Z Y Z occurs twice without overlap.
A global grammar-based compressor (or simply global algorithm) starts on input w with the SLP A 0 = ( { S } , Σ , { S w } , S ) . In each round i 1 , the algorithm selects a maximal string γ of A i 1 and updates A i 1 to A i by replacing a largest set of pairwise nonoverlapping occurrences of γ in A i 1 by a new nonterminal X. Additionally, the algorithm introduces the rule X γ in A i . The algorithm stops when no maximal string occurs. Please note that the replacement is not unique, e.g., the word a 5 has a unique maximal string γ = a a , which yields SLPs with rules S X X a , X a a or S X a X , X a a or S a X X , X a a . We assume the first variant here, i.e., maximal strings are replaced from left to right. The compressor Greedy that we study in this work chooses a maximal string in each round i 1 such that the size of A i is minimal.
Example 2(Greedy).
Let w = a a a a a b b a b a b b b a a a b b . We have
A 0 :
S a a a a a b b a b a b b b a a a b b ,
A 1 :
S a a a a X a b X b a a X , X a b b ,
A 2 :
S Y Y X a b X b Y X , X a b b , Y a a ,
A 3 :
S Y Y X Z X b Y X , X Z b , Y a a , Z a b ,
A 4 :
S Y A Z X b A , X Z b , Y a a , Z a b , A Y X .
Please note that in the first round, instead of the maximal string a b b the algorithm could also choose the maximal string a a a b b , because both choices yield SLPs of minimal size 15. In the second round, instead of a a the algorithm could also choose a a X , because both choices yield SLPs of size 14. Finally, the order of the choices a b (round 3) and Y X (round 4) could be swapped because both choices yield SLPs of unchanged size 14.
The following lemma from [13] provides a lower bound on the size of an SLP for a word of length n.
Lemma 1
([13] (Lemma 1)). For every word w Σ + of length n, we have g ( w ) 3 log 3 ( n ) 3 .

3. Upper Bound for Greedy

To show our improved upper bound for the approximation ratio of Greedy on unary words, we are first going to prove that the size of the SLP produced by Greedy for the input a n is upper-bounded by O ( log n ) 9 · ( log log n ) 3 .
Proposition 1.
For all n, we have
| Greedy ( a n ) | O ( ( log n ) 9 · ( log log n ) 3 ) .
First, we need to prove several lemmas that are fulfilled for any global algorithm. When we apply specific arguments to Greedy, we draw attention to it. For better readability, we will use X 0 = a , i.e., the input is X 0 n . Furthermore, let A i = ( N i , { X 0 } , P i , S ) be the SLP obtained by the global algorithm on input X 0 n after i rounds. Please note that until the algorithm stops, we have | N i \ { S } | = i since exactly one new nonterminal is introduced in each round. If we quantify over the rounds of the algorithm, we always implicitly mean that the statements hold until the algorithm stops. If i is mentioned without a quantification, then the statement holds for any A i constructed after some round i of the algorithm.
Lemma 2.
For every i, there is a fixed order X i > X i 1 > > X 1 of the nonterminals in N i \ { S } such that every right-hand side of a rule ( X v ) P i satisfies
v X i * X i 1 * X 1 * X 0 * .
Proof. 
We prove this property by induction. Initially, the property holds for the SLP A 0 since N 0 \ { S } = and the only rule S X 0 n satisfies X 0 n X 0 * . Now assume the claim is true for A i , i.e., each right-hand side of a rule in P i is a word from X i * X i 1 * X 1 * X 0 * . Please note that any nonempty factor of such a right-hand side is a word from X k * X j + 1 * X j + for some i k > j 0 . Therefore, assume the global algorithm chooses a maximal string γ X k * X j + 1 * X j + in round i + 1 and ( X γ ) P i + 1 is the corresponding new rule. We show that v X i * X i 1 * X * X j * X 1 * X 0 * for all rules ( Y v ) P i + 1 , i.e., the order of the nonterminals after round i + 1 is obtained by inserting the new nonterminal X directly before X j in the previous order. First, this is obviously true for the new rule X γ as well as for all rules that have not been modified during round i + 1 . It remains to check the rules ( Y v ) P i + 1 that are obtained from a rule ( Y v ) P i by replacing a largest set of pairwise nonoverlapping occurrences of γ in v by the new nonterminal X. If γ = X j d ( d 2 ) is unary and X j is the single maximal X j -block that occurs in v , then replacing occurrences of γ from left to right yields X div d X j mod d as the new maximal blocks of X and X j in w. It follows that v X i * X i 1 * X * X j * X 1 * X 0 * . If otherwise γ is not a unary word, i.e., γ X k + X j + 1 * X j + for i k > j 0 , then v has exactly one occurrence of γ as a factor. It follows that v X i * X i 1 * X k * X X j * X 1 * X 0 * and thus v satisfies the claim. This finishes the induction. □
In other words, there is at most one maximal block for each symbol on each right-hand side and the order of these blocks is the same for all rules. Similar to the case distinction in the last steps of the proof of Lemma 2, we will distinguish two types of nonterminals. Let X γ be the introduced rule in some round of the algorithm. If γ is unary, then we call X a unary nonterminal. Otherwise, we call X nonunary. We categorize X 0 as a unary nonterminal, although formally X 0 is not a nonterminal. Please note that the type of a nonterminal is decided when it is introduced and does not change later, i.e., even if the right-hand side of a unary nonterminal becomes nonunary during the execution of the algorithm, the type of the nonterminal stays the same. Our strategy to prove Theorem 1 is to bound the total number of occurrences of unary nonterminals and nonunary nonterminals on right-hand sides independently. It follows from Lemma 2 that every factor that occurs more than once on the right-hand side of a single rule is unary. The following lemma is a direct consequence of that fact.
Lemma 3.
Every nonunary nonterminal occurs at most once on the right-hand side of each rule at any time of the algorithm.
Corollary 1.
If a unary nonterminal X is introduced and X Z d with d 2 is the corresponding rule for X, then Z is a unary nonterminal.
Next, we bound the number of rules that contain a unary nonterminal X on the right-hand side. For a nonterminal X, including X 0 , let # i ( X ) be the number of rules of A i where X occurs on the right-hand side, or more formally
# i ( X ) = | { ( Y v ) P i X occurs in v } | .
The next two lemmas describe how # i evolves depending on the type of the introduced nonterminal.
Lemma 4.
If a nonunary nonterminal X is introduced in some round i + 1 , then for every X X (including X 0 ), we have # i + 1 ( X ) # i ( X ) .
Proof. 
To prove this point, we use that all rules ( Y v ) P i satisfy v X i * X i 1 * X 1 * X 0 * (Lemma 2). Since X is nonunary, the chosen maximal string γ satisfies γ X k + X j + 1 * X j + for i k > j 0 . If a nonterminal X does not occur in γ , then # i + 1 ( X ) = # i ( X ) . So, assume the nonterminal X occurs in γ , i.e., X = X t for t [ j , k ] . Note first that X t occurs on the right-hand side of the new rule ( X γ ) P i + 1 . It follows that to prove the claimed result, we must show that for at least one rule ( Y v ) P i such that X t occurs in v, all occurrences of X t must disappear, i.e., X t does not occur in v for ( Y v ) P i + 1 . Let
M = { Y N i ( Y v ) P i and γ is a factor of v } N i
be the set of nonterminals of A i where the corresponding rule is modified in round i + 1 . Please note that | M | 2 since a maximal string occurs at least twice on all right-hand sides and γ occurs at most once as factor of each rule since X is nonunary (Lemma 3). If k < t < j then for all Y M and ( Y v ) P i + 1 , the nonterminal X t does not occur in v anymore since the complete X t -block (among other symbols) has been replaced. This means that # i + 1 ( X t ) < # i ( X t ) since one new rule contains X t on the right-hand side, while for at least two rules the occurrences of X t have been removed in round i + 1 . If otherwise t = k or t = j , then the same argument fails since for Y M and ( Y v ) P i + 1 , the right-hand side v could still contain X t since it is not necessarily true that the complete X t -block has been replaced. However, due to the properties of a maximal string, we show that X t does not occur in v for at least one Y M and ( Y v ) P i + 1 . Towards a contradiction, assume X t occurs in v for all Y M and ( Y v ) P i + 1 . This means that for all Y M and ( Y v ) P i , the length of the maximal X t -block that occurs in v is strictly larger than the length of the maximal X t -block that occurs in γ . If t = k , it follows that X k γ is a factor of v for all Y M and ( Y v ) P i , and symmetrically, if t = j then γ X j is a factor of v for all Y M and ( Y v ) P i . This contradicts the property that no strictly longer string than γ occurs at least as often on the right-hand sides of the rules. It follows that in this case # i + 1 ( X t ) # i ( X t ) , which finishes the proof. □
Lemma 5.
If a unary nonterminal X is introduced in some round i + 1 and X Z d with d 2 is the corresponding rule, then # i + 1 ( X ) # i ( Z ) and # i + 1 ( Z ) # i ( Z ) + 1 .
Proof. 
Both points are straightforward: A rule ( Y v ) P i + 1 only contains X on the right-hand side if Z d is a factor of v for ( Y v ) P i , which shows that # i + 1 ( X ) # i ( Z ) . For the second point, note that if ( Y v ) P i does not contain Z on the right-hand side, then the same is true for (the unchanged rule) ( Y v ) P i + 1 . The only rule where Z occurs new is the new rule X Z d , and thus # i + 1 ( Z ) # i ( Z ) + 1 . □
So far, we have shown that when a unary nonterminal X is introduced and X Z d with d 2 is the corresponding rule, then Z is a unary nonterminal as well (Corollary 1). Furthermore, we argued that introducing a nonunary nonterminal does not increase the number of rules where a unary nonterminal occurs on the right-hand side (Lemma 4). It follows that we can upper-bound the number of unary nonterminals and the number of rules where those nonterminals occur on right-hand sides independently of the nonunary nonterminals.
To do so, we inductively define a binary tree T i that describes how the unary nonterminals evolve until A i is reached. All nodes in the tree are labeled with ( X , k ) , where X is a unary nonterminal and k is an upper bound on # j ( X ) for some j.
(1)
Initially, the tree T 0 only contains a single node that is labeled with ( X 0 , 1 ) .
(2)
If a rule X Z d is introduced in round i + 1 for some d 2 , then we update T i to T i + 1 by adding two children to the unique leaf that is labeled with ( Z , k ) for some k. The new left child is labeled with ( Z , k + 1 ) and the new right child is labeled with ( X , k ) as depicted on the left of Figure 1.
(3)
If otherwise a nonunary nonterminal is introduced in round i + 1 , then T i + 1 = T i , i.e., nonunary nonterminals are ignored.
The initial tree T 0 reflects that the only unary nonterminal of A 0 is X 0 and # 0 ( X 0 ) = 1 . If the tree is modified according to point (2) of the definition, this refers to Lemma 5, where # i + 1 ( X ) # i ( Z ) and # i + 1 ( Z ) # i ( Z ) + 1 is shown when a rule X Z d for d 2 is introduced.
The level of a node is the length of the path from the node to the root. For a unary nonterminal X N i , we denote by leveli ( X ) the level of the unique leaf of T i that is labeled with ( X , k ) for some k.
Example 3.
Assume that the first three rules introduced by a global algorithm are X 2 X 0 d 1 in the first round, X 1 X 0 d 2 in the second round and X 3 X 2 d 3 in the third round for d 1 , d 2 , d 3 2 . The tree T 3 that corresponds to this introduced rules is depicted on the right of Figure 1. The indices for the introduced nonterminals are chosen such that the ordering of the nonterminals in N 3 \ { S } (see Lemma 2) is X 3 > X 2 > X 1 , i.e., all right-hand sides of rules are contained in X 3 * X 2 * X 1 * X 0 * . The corresponding SLPs A 0 , A 1 , A 2 and A 3 are depicted next, where we simply use ∗ instead of the exact exponents of the symbols due to better readability.
A 0 : S X 0 * A 2 : S X 2 * X 1 * X 0 * A 3 : S X 3 * X 2 * X 1 * X 0 * A 1 : S X 2 * X 0 * X 2 X 1 * X 0 * X 2 X 1 * X 0 * X 2 X 0 * X 1 X 0 * X 1 X 0 * X 3 X 2 *
Please note that in this example, we have # 3 ( X 0 ) 3 , # 3 ( X 2 ) 2 , # 3 ( X 1 ) 2 and # 3 ( X 3 ) 1 , which is exactly the information contained in the second components of the leaf labels in T 3 . Furthermore, we havelevel3 ( X i ) = 2 for i [ 0 , 3 ] in this example.
The following lemma is a direct consequence of the fact that the maximal k that occurs for some label ( X , k ) is incremented from one level to the next level (as described in Lemma 5).
Lemma 6.
For each node of T i at level m that is labeled with ( X , k ) for some unary nonterminal X N i , we have k m + 1 . Also, let X N i be a unary nonterminal. Then we have # i ( X ) level i ( X ) + 1 .
So far, we provided information about the number of rules where a unary nonterminal occurs. Next, we move on to the total number of occurrences of a unary nonterminal on all right-hand sides. We denote by t i ( X ) the total number of occurrences of X on right-hand sides of rules in A i . We have # i ( X ) t i ( X ) by the definition of both functions and for a nonunary nonterminal X, we have # i ( X ) = t i ( X ) due to Lemma 3.
Lemma 7.
Let X γ be the rule that is introduced in some round i + 1 and let M = { Y N i Y occurs in γ } . We have
(1) 
t i + 1 ( Y ) t i ( Y ) for all Y N i = N i + 1 \ { X } , and
(2) 
Y M t i + 1 ( Y ) + t i + 1 ( X ) Y M t i ( Y ) .
Proof. 
Point (1) is straightforward: For Y M let Y be the maximal Y-block that occurs as a factor in γ for some 1 . Replacing γ on right-hand sides yields that at least two occurrences of Y are eliminated while only Y is added as a part of the new rule X γ . If otherwise Y M , then t i + 1 ( Y ) = t i ( Y ) because the occurrences of Y are not affected by the new rule.
Point (2) is also based on a simple observation. Please note that Y M t i ( Y ) describes the part of the SLP A i that is affected by the replacement of γ in round i + 1 , and Y M t i + 1 ( Y ) + t i + 1 ( X ) is the size of that part in A i + 1 after the occurrences of γ are replaced by X plus the new occurrences in the introduced rule. All other parts of A i are not affected by the new rule. Now the properties of a maximal string ensure that | A i + 1 | | A i | . The extreme case where γ has length two and occurs only twice without overlap on the right-hand sides of rules in A i satisfies | A i + 1 | = | A i | . All other cases even satisfy | A i + 1 | < | A i | . Point (2) directly follows. □
Our next goal is to bound t i ( X ) depending on level i ( X ) for a unary nonterminal X. To do so, we now apply arguments specific to Greedy. Recall that Greedy selects a maximal string that minimizes the size of the obtained SLP in each round.
Lemma 8.
Let Z N i { X 0 } be a unary nonterminal and assume a rule X Z d for some d 2 is introduced byGreedyin round i + 1 . We have
t i + 1 ( X ) + t i + 1 ( Z ) 2 t i ( Z ) # i ( Z ) + 1 + 1 .
Proof. 
If a unary nonterminal X and a rule X Z d with d 2 are introduced in round i + 1 , then the choice of d only depends on the maximal Z-blocks occurring on all right-hand sides of rules in A i since the remaining part of A i does not change. Assume that # i ( Z ) = k and let 1 , , k be the lengths of the maximal Z-blocks occurring on right-hand sides of A i , i.e., j = 1 k j = t i ( Z ) . Then Greedy minimizes t i + 1 ( X ) + t i + 1 ( Z ) = d + j = 1 k ( j div d ) + ( j mod d ) , where d is the size of the new rule X Z d and for each j [ 1 , k ] a maximal block Z j on the right-hand side of a rule in A i is transformed into X j div d Z j mod d . Due to the greedy nature of the algorithm, the following equation holds for all d 1 :
t i + 1 ( X ) + t i + 1 ( Z ) d + j = 1 k ( j div d ) + ( j mod d ) d + j = 1 k j d + k ( d 1 ) = d + t i ( Z ) d + k ( d 1 ) .
Please note that the chosen maximal string has length at least 2, but the upper bound also holds for d = 1 since in this case we have t i + 1 ( X ) + t i + 1 ( Z ) t i ( Z ) due to Lemma 7 (point (2)). If we apply d = t i ( Z ) / k + 1 , we get
t i + 1 ( X ) + t i + 1 ( Z ) t i ( Z ) k + 1 + t i ( Z ) t i ( Z ) k + 1 + k t i ( Z ) k + 1 1 t i ( Z ) k + 1 + 1 + t i ( Z ) t i ( Z ) k + 1 + k t i ( Z ) k + 1 = ( k + 1 ) t i ( Z ) k + 1 + t i ( Z ) · k + 1 t i ( Z ) + 1 = 2 t i ( Z ) k + 1 + 1 .
Together with k = # i ( Z ) this proves the lemma. □
The following lemma is essential for the proof of Theorem 1 since we bound the total number of occurrences of a unary nonterminal depending on its level.
Lemma 9.
Let X N i { X 0 } be a unary nonterminal with level i ( X ) = m . We have
t i ( X ) 2 2 2 1 m n 2 m j = 1 m ( m + 2 j ) 2 j .
Proof. 
We prove the lemma by induction on m = level i ( X ) and we start with m = 0 . The only SLP A i that contains a unary nonterminal X such that level i ( X ) = 0 is the initial SLP A 0 and the unary nonterminal is X = X 0 . Please note that the maximal string γ chosen by any global algorithm in the first round on input X 0 n trivially satisfies γ X 0 * and thus the two unary nonterminals of A 1 have level one. We have t 0 ( X 0 ) = n and this is exactly what we obtain when m = 0 is used on the right side of Equation (1) (the empty product is considered to be 1).
Now assume any unary nonterminal that has level m satisfies the claimed bound and we consider a unary nonterminal X such that level i ( X ) = m + 1 > 0 for some i. It follows from the definition that there is a leaf node at level m + 1 in T i that is labeled with ( X , k ) for some k. There are two cases that need to be distinguished. Either this leaf is a left child or a right child of its parent node. Assume that ( X , k ) is the label of a right child and let ( Z , k + 1 ) be the label of the left sibling of that node. To prove both cases simultaneously, we prove the upper bound for X and for Z, i.e., we use Z to cover the second case where the node is a left child. The parent node of ( Z , k + 1 ) and ( X , k ) is labeled with ( Z , k ) (see Figure 1 on the left). Let i < i be the maximal i such that level i ( Z ) = m , i.e., X Z d for d 2 is the introduced rule in round i + 1 and ( Z , k ) is the label of a leaf at level m in T i . By induction, we have t i ( Z ) 2 2 2 1 m n 2 m j = 1 m ( m + 2 j ) 2 j . Now by Lemma 8, we have
t i + 1 ( X ) + t i + 1 ( Z ) 2 t i ( Z ) # i ( Z ) + 1 + 1 .
Together with # i ( Z ) m + 1 (Lemma 6), this yields
t i + 1 ( X ) + t i + 1 ( Z ) 2 2 2 2 1 m n 2 m j = 1 m ( m + 2 j ) 2 j 1 2 ( m + 2 ) 1 2 + 1 = 2 2 2 m n 2 m 1 j = 1 m ( m + 2 j ) 2 j 1 ( m + 2 ) 2 1 + 1 = 2 2 2 m n 2 m 1 j = 2 m + 1 ( m + 2 j + 1 ) 2 j ( m + 2 ) 2 1 + 1 = 2 2 2 m n 2 m 1 j = 1 m + 1 ( m + 2 j + 1 ) 2 j + 1 .
Using the fact that t i + 1 ( X ) 2 (there are at least two nonoverlapping occurrences of a maximal string) and t i + 1 ( Z ) 2 (the new rule contains Z at least twice) yields the claimed upper bound on t i + 1 ( X ) and t i + 1 ( Z ) . Finally, this upper bound holds for i i + 1 due to Lemma 7 (point (1)). □
Corollary 2.
Let X N i { X 0 } be a unary nonterminal with level i ( X ) = m . We have
t i ( X ) 4 n 2 m ( m + 2 ) .
Proof. 
Please note that 2 2 2 m 4 for all m 0 . We upper-bound the right side of Equation (1) (Lemma 9) as follows:
2 2 2 1 m n 2 m j = 1 m ( m + 2 j ) 2 j 4 n 2 m j = 1 m ( m + 2 ) 2 j = 4 n 2 m ( m + 2 ) j = 1 m 2 j 4 n 2 m ( m + 2 )
What we achieved so far is to bound the total size t i ( X ) that a unary nonterminal X contributes on right-hand sides of the rules depending on level i ( X ) . Next, we bound the size that nonunary nonterminals contribute to | A i | depending on the levels of all unary nonterminals. To do so, we need the following definitions. Let R i ( X ) be the number of distinct right neighbors of X (which are not equal to X) on right-hand sides plus the number of occurrences of X as the last symbol of a right-hand side in A i , i.e.,
R i ( X ) = | { A N i { X 0 } A X , X A occurs on a right - hand side in A i } | + | { ( Y v X ) P i Y N i , v ( N i { X 0 } ) * } | .
Let L i ( X ) be the number of distinct left neighbors of X (which are not equal to X) on right-hand sides plus the number of occurrences of X as the first symbol of a right-hand side in A i , i.e.,
L i ( X ) = | { A N i { X 0 } A X , A X occurs on a right - hand side in A i } | + | { ( Y X v ) P i Y N i , v ( N i { X 0 } ) * } | .
Also, let f i ( X ) = # i ( X ) R i ( X ) and g i ( X ) = # i ( X ) L i ( X ) . Please note that R i ( X ) # i ( X ) and L i ( X ) # i ( X ) since for each right-hand side of a rule there is at most one right (respectively, left) neighbor A X for some occurrence of X due to Lemma 2 and each right-hand side can contain X at most once as the last (respectively, first) symbol. Also, # i ( X ) = R i ( X ) means that all maximal X-blocks on right-hand sides are either at the end of the right-hand side or are followed by a distinct symbol. Similarly, # i ( X ) = L i ( X ) means that all maximal X-blocks on right-hand sides are either at the beginning of the right-hand side or are preceded by a distinct symbol. The following lemmas describe how the functions f i ( X ) and g i ( X ) evolve.
Lemma 10.
If X N i { X 0 } then f i + 1 ( X ) f i ( X ) . If a nonunary, maximal string γ = X v is selected in round i + 1 for some v ( N i { X 0 } ) + , then f i + 1 ( X ) < f i ( X ) .
Proof. 
Let Y γ be the introduced rule in round i + 1 . If X does not occur in γ , then it is straightforward to see that f i + 1 ( X ) f i ( X ) since # i + 1 ( X ) = # i ( X ) and R i + 1 ( X ) R i ( X ) . The new nonterminal Y could be a new right neighbor for some occurrences of X, but all occurrences of X which have this new right neighbor Y in A i + 1 shared the same right neighbor in A i (the first symbol of γ ).
If otherwise X occurs in γ , then first assume that γ = X d for some d 2 . Please note that replacing an occurrence of γ on the right-hand side of a rule ( Z u ) P i either removes all occurrences of X on this right-hand side (in case u contains a maximal X-block of length k · d for some integer k 1 ) or the right neighbor of the maximal X-block in u does not change in the modified rule ( Z u ) P i + 1 since occurrences of γ are replaced from left to right. It follows that the only way to obtain R i + 1 ( X ) < R i ( X ) is to remove all occurrences of X on a right-hand side, but then # i ( X ) decreases by the same value. Additionally, the new rule Y X d adds a new right-hand side to # i + 1 ( X ) , but since X is the last symbol on this right-hand side it follows that R i + 1 ( X ) is incremented as well. Together this yields f i + 1 ( X ) f i ( X ) in this case.
The case remains where γ is nonunary and X occurs in γ . Here, we have # i + 1 ( X ) # i ( X ) due to Lemma 4 and γ occurs at most once on each right-hand side due to Lemma 2. However, again, the only way to reduce R i + 1 ( X ) compared to R i ( X ) is to remove all occurrences of X on a right-hand side, but then again # i + 1 ( X ) decreases by the same value. This yields f i + 1 ( X ) f i ( X ) .
Assume now that a nonunary, maximal string γ = X v for some v ( N i { X 0 } ) + is selected in round i + 1 . We show that f i + 1 ( X ) < f i ( X ) . If X only occurs in the new rule in A i + 1 after occurrences of γ are replaced on all right-hand sides in A i , i.e., all modified rules do not contain X anymore, then # i + 1 ( X ) < # i ( X ) because at least two right-hand sides do not contain X as a factor anymore while only the new rule Y γ adds a new right-hand side which contains X to # i + 1 ( X ) . Also, we have R i + 1 ( X ) = R i ( X ) in this case and thus f i + 1 ( X ) < f i ( X ) because all rules where the maximal X-block is removed shared the same right neighbor due to the fact that γ = X v is the chosen maximal string. However, X v still occurs in the new rule of A i + 1 and thus R i + 1 ( X ) = R i ( X ) . If otherwise at least one of the modified rules still contains X on the right-hand side, then X Y is a factor of this right-hand side in A i + 1 after the replacement of γ . It follows that R i + 1 ( X ) > R i ( X ) in this case and thus f i + 1 ( X ) < f i ( X ) because each distinct right neighbor of X in A i is still a right neighbor of X in A i + 1 as argued above, but additionally X Y is new since Y is a new nonterminal. □
The same result does not hold for g i ( X ) . In particular, g i + 1 ( X ) > g i ( X ) is possible when a rule Y X d is introduced in round i + 1 for some d 2 due to the assumption that global algorithms replace occurrences of the maximal string from left to right. For example, assume that A X 4 , B X 7 and C X 10 are the maximal X-blocks on right-hand sides of A i including distinct left neighbors for each X-block (A, B and C). Therefore, we have # i ( X ) = 3 , L i ( X ) = 3 and thus g i ( X ) = 0 in this example. If now a rule Y X 3 is introduced, then this yields A Y X , B Y 2 X and C Y 3 X after replacing X 3 . Hence we have # i + 1 ( X ) = 4 , L i + 1 ( X ) = 2 and thus g i + 1 ( X ) = 2 . We show in the following lemma that this is the only case where g i + 1 ( X ) > g i ( X ) occurs.
Lemma 11.
Let X N i { X 0 } . If a rule Y γ is introduced in round i + 1 such that γ X + , then g i + 1 ( X ) g i ( X ) . If a nonunary, maximal string γ = X v is selected in round i + 1 for some v ( N i { X 0 } ) + , then g i + 1 ( X ) < g i ( X )
Proof. 
The arguments are similar to the corresponding cases in Lemma 10. Let Y γ be the introduced rule in round i + 1 . If X does not occur in γ , then g i + 1 ( X ) g i ( X ) since # i + 1 ( X ) = # i ( X ) and L i + 1 ( X ) L i ( X ) . The new nonterminal Y could be a new left neighbor for some occurrences of X in A i + 1 , but all these occurrences of X shared the same left neighbor in A i (the last symbol of γ ).
If otherwise γ is nonunary and contains X, then we have # i + 1 ( X ) # i ( X ) due to Lemma 4 and γ occurs at most once on each right-hand side due to Lemma 2. The only way to obtain L i + 1 ( X ) < L i ( X ) is again to remove all occurrences of X on a right-hand side, but then # i + 1 ( X ) decreases by the same value. Please note that due to the assumption that γ is nonunary, it is not possible to modify two (or more) rules such that the maximal X-blocks have different left neighbors in A i and after the replacement these X-blocks share the same left neighbor in A i + 1 . This yields g i + 1 ( X ) g i ( X ) .
Assume now that a nonunary, maximal string γ = v X is selected in round i + 1 for some word v ( N i { X 0 } ) + . We show that g i + 1 ( X ) < g i ( X ) . If X only occurs in the new rule in A i + 1 , i.e., X does not occur in the modified rules, then we have # i + 1 ( X ) < # i ( X ) because at least two right-hand sides do not contain X as a factor anymore while only the new rule Y γ adds a new right-hand side which contains X to # i + 1 ( X ) . Moreover, we have L i + 1 ( X ) = L i ( X ) in this case because all rules where the maximal X-block is removed shared the same left neighbor since γ = v X is the selected nonunary, maximal string. However, v X still occurs on the right-hand side of the new rule of A i + 1 . It follows that g i + 1 ( X ) < g i ( X ) . If otherwise at least one of the modified rules still contains X on the right-hand side, then Y X is a factor of this right-hand side in A i + 1 after the replacement of γ . It follows that L i + 1 ( X ) > L i ( X ) and thus g i + 1 ( X ) < g i ( X ) because each distinct left neighbor of X in A i is still a left neighbor of X in A i + 1 for some occurrence of X. Additionally, Y is a new left neighbor. □
In the following proof, we use the notation
U i = { X N i { X 0 } X is a unary nonterminal }
for all unary nonterminals that appear in A i and M i = N i \ U i for all nonunary nonterminals that appear in A i .
Lemma 12.
We have
X M i t i ( X ) X U i ( level i ( X ) + 1 ) · ( level i ( X ) + 1 + ( level i ( X ) + 1 ) 2 ) .
Proof. 
Let s ( i ) = X M i t i ( X ) be the total size that all nonunary nonterminals contribute to the size of A i . We first bound the number of rounds where the function s increases, i.e., we bound | { j [ 0 , i 1 ] s ( j + 1 ) > s ( j ) } | . If a unary nonterminal is introduced in some round j + 1 , then s ( j + 1 ) = s ( j ) , i.e., we can ignore these rules. So, consider some round j + 1 where a nonunary nonterminal X is introduced and let X γ be the introduced rule. Let M = { Z N j Z occurs in γ } be the set of nonterminals that occur at least once in γ . We first show that if k : = | M j M | 2 , then s ( j + 1 ) s ( j ) . In other words, if two nonunary nonterminals occur in γ , then s ( j + 1 ) s ( j ) . Let r be the number of rules ( Z v ) P j such that γ is a factor of v. Recall that nonunary factors and nonterminals occur at most once on the right-hand side of a single rule (Lemma 3). We have s ( j + 1 ) s ( j ) = k + r k · r because the new nonunary nonterminal X occurs now on r right-hand sides, γ contains k nonunary nonterminals which occur exactly once in γ each, and the replacement of γ on right-hand sides deletes these k nonterminals on r right-hand sides. We have r 2 (due to the properties of a maximal string) which together with k 2 yields s ( j + 1 ) s ( j ) 0 . Hence we can assume that k = | M j M | 1 . The maximal string γ has length | γ | 2 and is not unary, so the first and the last symbol of γ are different and at least one of them is unary due to our assumption that at most one nonunary nonterminal occurs in γ . Let Y be this unary nonterminal and assume that Y is the first symbol, i.e., the nonunary, maximal string is γ = Y v for some (nonempty) v. Afterwards we discuss the case where Y is the last symbol, i.e., γ = v Y .
We bound the number of rounds where a nonunary, maximal string γ = Y v is selected for some v. Let j 0 i be the round where the unary nonterminal Y has been introduced. We have
f j 0 ( Y ) # j 0 ( Y ) level j 0 ( Y ) + 1 level i ( Y ) + 1
due to Lemma 6 and level j ( Y ) level i ( Y ) for all j i . We also have f j + 1 ( Y ) f j ( Y ) for j [ j 0 , i 1 ] by Lemma 10. In addition to that, if Y v is the selected nonunary, maximal string in round j + 1 for some v, then we have f j + 1 ( Y ) < f j ( Y ) again by Lemma 10. It follows that after at most level i ( Y ) + 1 many rounds where the chosen maximal string is nonunary and has the form Y v for some (nonempty) v, we have f i ( Y ) = 0 . In this case, all maximal Y-blocks have distinct right neighbors or occur at the end of a right-hand side. Hence there is no possibility to select a nonunary, maximal string Y v anymore.
Now we similarly bound the number of rounds such that a nonunary, maximal string γ = v Y is selected for some v. However, care must be taken in this case, because it is possible that g j + 1 ( Y ) > g j ( Y ) when a rule X Y d for d 2 is introduced in round j + 1 as explained above. Fortunately, rules of this form (the selected maximal string is from Y + ) are introduced at most level i ( Y ) many times up to round i by the definition of level i ( Y ) . Let j 0 i be the round where the unary nonterminal Y has been introduced. We have
g j ( Y ) # j ( Y ) level j ( Y ) + 1 level i ( Y ) + 1
for each j [ j 0 , i ] due to Lemma 6 and level j ( Y ) level i ( Y ) for all j i . Furthermore, if the selected maximal string in round j + 1 ( j j 0 ) is not from Y + , we have g j + 1 ( Y ) g j ( Y ) due to Lemma 11. Moreover, if the maximal string γ is nonunary and γ = v Y for some (nonempty) v, then g j + 1 ( Y ) < g j ( Y ) . It follows that between two rounds where maximal strings from Y + are selected, there are at most level i ( Y ) + 1 many rounds where a nonunary, maximal string of the form v Y is chosen because then g j ( Y ) = 0 is reached (for some j) and thus all maximal Y blocks have distinct left neighbors or occur at the beginning of a right-hand side. Hence no nonunary string of the form v Y for some v occurs twice on right-hand sides. Since maximal strings from Y + are chosen at most level i ( Y ) many times up to round i, it follows that the number of rounds where a nonunary, maximal string of the form v Y for some v is selected is at most ( level i ( Y ) + 1 ) 2 .
Furthermore, the maximal increase max { s ( j + 1 ) s ( j ) j [ 0 , i 1 ] } in a single round is at most level i ( Y ) + 1 , because the new nonunary nonterminal occurs in A j + 1 on at most # j ( Y ) level j ( Y ) + 1 level i ( Y ) + 1 many right-hand sides of rules for any j i and the total number of occurrences of all other (nonunary) nonterminals does not increase (Lemma 7, point (1)).
We conclude that for each unary nonterminal Y, at most
level i ( Y ) + 1 + ( level i ( Y ) + 1 ) 2
many rules are introduced such that the nonunary, maximal string γ satisfies γ = Y v or γ = v Y for some v and each of those rules increases the total size that nonunary nonterminals contribute by at most level i ( Y ) + 1 . In all other cases, we showed that the size that nonunary nonterminals contribute does not increase. □
Now we are able the prove Proposition 1.
Proof of Proposition 1.
Let A f = Greedy ( X 0 n ) be the final SLP obtained by Greedy, i.e., after f rounds the algorithm stops because A f has no maximal string. First, we want to bound the level of unary nonterminals occurring in A f . Assume there is a unary nonterminal X such that level i ( X ) = log log n after some round i f of the algorithm. By Corollary 2, we have
t i ( X ) 4 n 2 log log n ( log log n + 3 ) 8 ( log log n + 3 ) .
Consider the unique leaf node v X in the tree T i which has level log log n and label ( X , k ) for some k. If in some round j [ i + 1 , f ] two children with labels ( X , k + 1 ) and ( Y , k ) are attached to v X , i.e., the introduced rule in round j is Y X d for some d 2 , then we have t j ( X ) + t j ( Y ) t j 1 ( X ) t i ( X ) by Lemma 7. To be more specific, if the length of the chosen maximal string X d is exactly d = 2 and this maximal string X X occurs exactly twice without overlap in A j 1 , then we have t j ( X ) + t j ( Y ) = t j 1 ( X ) (and | A j | = | A j 1 | ). Please note that in this case, there does not exist a maximal string X d or Y d of A k for all k [ j , f ] (since Y occurs only twice and X X does not occur on right-hand sides of A j ), i.e., the children of the node v X in T k are leaves for k [ j , f ] . Otherwise, if the maximal string has length d 3 or occurs at least three times without overlap, then we have t j ( X ) + t j ( Y ) < t j 1 ( X ) since | A j | < | A j 1 | holds in this setting. This means that when a new branch occurs in the tree T j for some j, then the new children of the branching node are either leaves of the final tree T f or the corresponding nonterminals contribute strictly less to the size of the current SLP than the nonterminal which corresponds to the parent node did before the branch. We can iterate this argument for the children of the children of v X and so on, i.e., if we consider the subtree rooted at v X in T f , then from level to level the size that the nonterminals contribute decreases until only leaves occur at some level. Since t i ( X ) 8 ( log log n + 3 ) , it follows that the subtree of T f rooted at v X has depth at most 8 ( log log n + 3 ) + 1 and thus the maximal level of any unary nonterminal in A f is bounded by
log log n + 8 ( log log n + 3 ) + 1 9 log log n + 26 .
Consequently, the number of unary nonterminals (the number of leaves of T f ) is bounded by O ( ( log n ) 9 ) since T f is a binary tree of depth at most 9 log log n + 26 . Furthermore, each unary nonterminal X in A f satisfies t f ( X ) O ( log log n ) . If there is a round i f such that O i ( X ) = log log n we obtain t f ( X ) t i ( X ) 8 ( log log n + 3 ) by Corollary 2 and Lemma 7, point (1). Otherwise, let m = level f ( X ) < log log n . Then t f ( X ) m + 3 log log n + 3 , because there is at most one nonoverlapping occurrence of X X on right-hand sides of A f (otherwise there would exist a maximal string of A f ) and the number of rules where X occurs on the right-hand side is # f ( X ) m + 1 by Lemma 6. To be more precise, a single right-hand side of A f could have a maximal X-block of length 3 and all other right-hand sides must have at most one occurrence of X since two different right-hand sides where X-blocks of length 2 occur as well as one right-hand side where an X-block of length 4 occurs would contradict the fact that A f has no maximal string. It follows that the size which unary nonterminals contribute to A f is O ( ( log n ) 9 · log log n ) . By Lemma 12, we can bound the size that nonunary nonterminals contribute by O ( ( log n ) 9 · ( log log n ) 3 ) since there are at most O ( ( log n ) 9 ) many unary nonterminals and each has level at most O ( log log n ) as argued above. It follows that | A f | O ( ( log n ) 9 · ( log log n ) 3 ) , which proves the proposition. □
The following theorem follows directly from Proposition 1 and Lemma 1, where g ( w ) Ω ( log n ) is shown for words w of length n.
Theorem 1.
For all n, we have
α Greedy ( 1 , n ) O ( ( log n ) 8 · ( log log n ) 3 ) .

4. Lower Bound for Greedy

We proceed with the lower bound on the approximation ratio of Greedy. The best lower bound [13] (Theorem 11) that was known previously was
α Greedy ( k , n ) 5 3 log 3 ( 5 ) = 1.13767699
for all k 1 and infinitely many n. This bound was shown using unary words, which we will also use in our proof for the improved lower bound. A key concept for doing this is the sequence x n described in the following lemma by [19]:
Lemma 13
([19] (Example 2.2)). Let x n + 1 = x n 2 + 1 with x 0 = 1 and
β = exp i = 1 1 2 i log 1 + 1 x i 2 .
We have x n = β 2 n .
In this work, we use the shifted sequence y n = x n + 1 , i.e., we start with y 0 = 2 . It follows that y n = γ 2 n , where γ = β 2 = 2.25851845 . Additionally, we need the following lemma:
Lemma 14.
Let m 1 be an integer. Let f m : R > 0 R with
f m ( x ) = x + m 2 + 1 x .
We have f m ( x ) > 2 m for all x > 0 .
Proof. 
The unique minimum of f m ( x ) is 2 m 2 + 1 for x = m 2 + 1 . It follows that f m ( x ) 2 m 2 + 1 > 2 m 2 = 2 m . □
Now we can prove the new lower bound for Greedy:
Theorem 2.
For all k 1 and infinitely many n, we have
α Greedy ( k , n ) 1 log 3 ( γ ) = 1.34847194 .
Proof. 
Let Σ = { a } be a unary alphabet. We define w k = a y k . By Lemma 13, we have | w k | γ 2 k . Applying Lemma 1 yields
g ( w k ) 3 · log 3 ( γ ) · 2 k + o ( 2 k ) .
In the remaining proof we show that on input w k , Greedy produces an SLP of size 3 · 2 k 1 , which directly implies α Greedy ( 1 , n ) 3 / ( 3 log 3 ( γ ) ) . We start with the SLP A 0 which has the single rule S a y k . Consider now the first round of the algorithm, i.e., we need to find a maximal string a x of A 0 such that the grammar A 1 with rules
X 1 a x , S X 1 y k div x a y k mod x
has minimal size. We have | A 1 | = x + ( y k div x ) + ( y k mod x ) x + y k / x . By the definition of y k we have | A 1 | x + ( y k 1 2 + 1 ) / x . Applying Lemma 14 yields | A 1 | 2 y k 1 + 1 . Please note that for x = y k 1 this minimum is achieved, i.e., we can assume that Greedy selects the maximal string a y k 1 and A 1 is
X 1 a y k 1 , S X 1 y k 1 a .
Each maximal string of A 1 is either a unary word over X or a unary word over a, i.e., we can analyze the behavior of Greedy on both rules independently. The rule X 1 a y k 1 is obviously treated similarly as the initial SLP A 0 , so we continue with analyzing S X 1 y k 1 a . However, again, the same arguments as above show that Greedy introduces a rule X 3 X 1 y k 2 which yields S X 3 y k 2 X 1 a as the new start rule. This process can be iterated using the same arguments for the leading unary strings of length y i for some i [ 1 , k ] .
The reader might think of this process as a binary tree, where each node is labeled with a rule (the root is labeled with S a y k ) and the children of a node are the two rules obtained by Greedy when the rule has been processed. We assume that the left child represents the rule for the chosen maximal string and the right child represents the parent rule where all occurrences of the maximal string are replaced by the new nonterminal. In Figure 2 this binary tree is depicted for the steps we discussed above.
Please note that when a rule is processed, the longest common factor of the two new rules has length 1 (the remainder). More generally, after each round there is no word of length at least two that occurs as a factor in two different rules, since a possibly shared remainder has length 1 and otherwise only new nonterminals are introduced. It follows that we can iterate this process independently for each rule until no maximal string occurs. This is the case when each rule starts with a unary string of length y 0 = 2 or, in terms of the interpretation as a binary tree, when a full binary tree of height k is produced. Each right branch occurring in this tree adds a new remainder to those remainders that already occur in the parent rule and a left branch introduces a new (smaller) instance of the start problem. We show by induction that at level i [ 0 , k ] of this full binary tree of height k, there is one rule of size y k i + i and 2 i j 1 many rules of size y k i + j for j [ 0 , i 1 ] . At level 0, this is true since there is only a single rule of size y k + 0 . Assuming that our claim is true at level i < k , we derive from each rule at level i two new rules at level i + 1 : A right branch yields a rule that starts with a leading unary string of size y k i 1 and adds a new remainder to the parent rule. A left branch yields a rule that contains only a unary string of size y k i 1 . If we first consider the left branches, we derive that each of the 2 i many rules at level i adds a rule of size y k i 1 at level i + 1 . For the right branches, the single rule of size y k i + i at level i yields a rule of size y k i 1 + i + 1 at level i + 1 . Furthermore, each of the 2 i j 1 many rules of size y k i + j ( j [ 0 , i 1 ] ) yields a rule of size y k i 1 + j + 1 . When we put everything together, we get that at level i + 1 there is a single rule of size y k i 1 + i + 1 and 2 i j many rules of size y k i 1 + j for j [ 0 , i ] . That finishes the induction. It follows that the final SLP (which consists of the rules at level k) has a single rule of size y 0 + k = 2 + k and 2 k j 1 many rules of size 2 + j for j = 0 , , k 1 . This gives a total size of
2 + k + j = 0 k 1 2 k j 1 ( 2 + j ) = 2 + k + 2 k j = 0 k 1 2 j + 2 k j = 0 k 1 2 j 1 j = 2 + k + 2 k ( 2 2 k + 1 ) + 2 k ( 2 k k 2 k + 1 ) = 2 + k + 2 k + 1 2 k 1 + 2 k = 2 k + 1 + 2 k 1 = 3 · 2 k 1 .

5. RePair and LongestMatch

In this section, we analyze the global grammar-based compressors RePair and LongestMatch. In each round i, RePair selects a most frequent maximal string of A i 1 and LongestMatch selects a longest maximal string of A i 1 .
We will abbreviate the approximation ratio α LongestMatch by α LM for better readability. We will first show that RePair and LongestMatch produce SLPs of equal size for unary inputs a n and we show what the exact size of these SLPs depending on n is. We then use this information to obtain our result for α RePair ( 1 , n ) and α LM ( 1 , n ) , respectively. Fix an integer n 2 and consider the binary representation
n = i = 0 log 2 n b i · 2 i
of n, where b i { 0 , 1 } for i [ 0 , log 2 n ] . We denote by ν ( n ) the number of 1’s in the binary representation of n, i.e.,
ν ( n ) = i = 0 log 2 n b i .
For example, we have 11 = 1 · 2 3 + 0 · 2 2 + 1 · 2 1 + 1 · 2 0 and thus b 0 = b 1 = b 3 = 1 , b 2 = 0 and ν ( 11 ) = 3 .
Proposition 2.
For n 2 , let A be the SLP produced by RePair on input a n and B be the SLP produced by LongestMatch on input a n . We have
| A | = | B | = 2 log 2 n + ν ( n ) 1 .
Proof. 
If n = 2 or n = 3 then a n has no maximal string and thus the final SLP of any global algorithm has a single rule S a n . The reader can easily verify the claimed result for these cases.
We now assume that n 4 . Let m = log 2 n 1 . We prove the claim for RePair first and for LongestMatch after. On input a n , RePair runs for exactly m rounds and creates rules X 1 a a and X i X i 1 X i 1 for i [ 2 , m ] , i.e., the nonterminal X i produces the string a 2 i . These rules have a total size of 2 m . After these steps, the start rule is
S X m X m X m b m X m 1 b m 1 X 1 b 1 a b 0 ,
where the b i ’s are the coefficients occurring in the binary representation of n, see equation (2). In other words, the symbol a only occurs in the start rule if the least significant bit is b 0 = 1 , and the nonterminal X i ( i [ 1 , m 1 ] ) occurs in the start rule if and only if b i = 1 . Since RePair only replaces words with at least two occurrences, the most significant bit b m + 1 = 1 is represented by X m X m . A third X m occurs in the start rule if and only if b m = 1 . The size of the start rule is 2 + i = 0 m b i . It follows that the total size of the SLP produced by RePair on input a n is 2 m + 2 + i = 0 m b i , which together with m = log 2 n 1 and b log 2 n = 1 (the most significant bit is always 1) yields the claimed size.
Now we prove the same result for LongestMatch. In the first round, the chosen maximal string is a n / 2 , which yields rules X 1 a n / 2 and S X 1 X 1 a b 0 , i.e., the symbol a occurs in the start rule if and only if n is odd and thus the least significant bit is b 0 = 1 . Assuming that n 8 , this procedure is now repeated for the rule X 1 a n / 2 (for n < 8 there is no maximal string, and the algorithm stops after the first round). This yields X 2 a n / 4 , X 1 X 2 X 2 a b 1 and S X 1 X 1 a b 0 (note that ( n / 2 ) / 2 = n / 4 ). After m = log 2 n 1 steps, the iteration of this process results in the final SLP with rules S X 1 X 1 a b 0 , X i X i + 1 X i + 1 a b i for i [ 1 , m 1 ] and X m a a a b m . The size of this SLP is 2 · ( m + 1 ) + i = 0 m b i , which directly implies the claimed result for LongestMatch. □
Using Proposition 2, we prove the matching bounds for α RePair ( 1 , n ) and α LM ( 1 , n ) :
Theorem 3.
For all n, we have α RePair ( 1 , n ) = α LM ( 1 , n ) log 2 ( 3 ) .
Proof. 
As a consequence of Proposition 2, RePair and LongestMatch produce on input a n SLPs of size at most 3 log 2 n , since ν ( n ) 1 log 2 n . By Lemma 1, we have g ( a n ) 3 log 3 n 3 . The equality log 2 n / log 3 n = log 2 ( 3 ) finishes the proof. □
Theorem 4.
For infinitely many n, we have α RePair ( 1 , n ) = α LM ( 1 , n ) log 2 ( 3 ) .
Proof. 
Let w k = a 2 k 1 . We have 2 k 1 = i = 0 k 1 2 i and thus ν ( 2 k 1 ) = k . By Proposition 2, the size of the SLPs produced by RePair and LongestMatch is 3 k 3 . By Lemma 1, we have
g ( w k ) 3 log 3 ( 2 k 1 ) + o ( log ( 2 k 1 ) 3 log 3 ( 2 ) · k + o ( k ) .
The equality 1 / log 3 ( 2 ) = log 2 ( 3 ) finishes the proof. □

6. Discussion

Although we improved the upper bound on the approximation ration for Greedy, the lower bound remains quite weak. We might be able to improve it by finding a similar sequence such that Greedy produces larger remainders in each round. However, care must be taken since for larger remainders it is not true anymore that the rules can be analyzed independently because the rules could share factors of length greater than one. Concerning the upper bound, we conjecture that Greedy achieves logarithmic compression for all unary inputs and thus the approximation ratio is constant, but we have not been able to prove this so far. For arbitrary alphabets, a nonconstant lower bound for Greedy as well as an improvement of the upper bound of O ( ( n / log n ) 2 / 3 ) for any global algorithm seems to be natural starting points for future work.

Author Contributions

Conceptualization, D.H.; methodology, D.H.; validation, D.H. and C.P.R.; formal analysis, D.H.; writing—original draft preparation, D.H.; writing—review and editing, C.P.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the DFG research project LO 748/10-2.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

References

  1. Ziv, J.; Lempel, A. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 1978, 24, 530–536. [Google Scholar] [CrossRef] [Green Version]
  2. Kieffer, J.C.; Yang, E.; Nelson, G.J.; Cosman, P.C. Universal lossless compression via multilevel pattern matching. IEEE Trans. Inf. Theory 2000, 46, 1227–1245. [Google Scholar] [CrossRef]
  3. Nevill-Manning, C.G.; Witten, I.H. Identifying Hierarchical Structure in Sequences: A linear-time algorithm. CoRR 1997, 7, 67–82. [Google Scholar] [CrossRef] [Green Version]
  4. Apostolico, A.; Lonardi, S. Some Theory and Practice of Greedy Off-Line Textual Substitution. In Proceedings of the Data Compression Conference, DCC 1998, IEEE Computer Society, Snowbird, UH, USA, 30 March–1 April 1998; pp. 119–128. [Google Scholar] [CrossRef] [Green Version]
  5. Apostolico, A.; Lonardi, S. Compression of Biological Sequences by Greedy Off-Line Textual Substitution. In Proceedings of the Data Compression Conference, DCC 2000, IEEE Computer Society, Snowbird, UH, USA, 28–30 March 2000; pp. 143–152. [Google Scholar] [CrossRef]
  6. Apostolico, A.; Lonardi, S. Off-line compression by greedy textual substitution. Proc. IEEE 2000, 88, 1733–1744. [Google Scholar] [CrossRef] [Green Version]
  7. Larsson, N.J.; Moffat, A. Offline Dictionary-Based Compression. In Proceedings of the Data Compression Conference, DCC 1999, IEEE Computer Society, Snowbird, UH, USA, 29–31 March 1999; pp. 296–305. [Google Scholar] [CrossRef]
  8. Kieffer, J.C.; Yang, E. Grammar-based codes: A new class of universal lossless source codes. IEEE Trans. Inf. Theory 2000, 46, 737–754. [Google Scholar] [CrossRef]
  9. Claude, F.; Navarro, G. Fast and Compact Web Graph Representations. ACM Trans. Web 2010, 4, 16:1–16:31. [Google Scholar] [CrossRef]
  10. Kida, T.; Matsumoto, T.; Shibata, Y.; Takeda, M.; Shinohara, A.; Arikawa, S. Collage system: a unifying framework for compressed pattern matching. Theor. Comput. Sci. 2003, 298, 253–272. [Google Scholar] [CrossRef] [Green Version]
  11. González, R.; Navarro, G. Compressed Text Indexes with Fast Locate. In Proceedings of the Combinatorial Pattern Matching, 18th Annual Symposium, CPM 2007, London, ON, Canada, 9–11 July 2007; In Lecture Notes in Computer Science. Ma, B., Zhang, K., Eds.; Springer: Berlin, Germany, 2007; Volume 4580, pp. 216–227. [Google Scholar] [CrossRef] [Green Version]
  12. Lohrey, M.; Maneth, S.; Mennicke, R. XML tree structure compression using RePair. Inf. Syst. 2013, 38, 1150–1167. [Google Scholar] [CrossRef]
  13. Charikar, M.; Lehman, E.; Liu, D.; Panigrahy, R.; Prabhakaran, M.; Sahai, A.; Shelat, A. The smallest grammar problem. IEEE Trans. Inf. Theory 2005, 51, 2554–2576. [Google Scholar] [CrossRef]
  14. Hucke, D.; Lohrey, M.; Reh, C.P. The Smallest Grammar Problem Revisited. In Proceedings of the String Processing and Information Retrieval—23rd International Symposium, SPIRE 2016, Beppu, Japan, 18–20 October 2016; In Lecture Notes in Computer Science. Inenaga, S., Sadakane, K., Sakai, T., Eds.; Springer International Publishing: Cham, Switzerland, 2016; Volume 9954, pp. 35–49. [Google Scholar] [CrossRef] [Green Version]
  15. Hucke, D.; Jez, A.; Lohrey, M. Approximation ratio of RePair. arXiv 2017, arXiv:1703.06061. [Google Scholar]
  16. Knuth, D.E. The Art of Computer Programming, Volume II: Seminumerical Algorithms, 3rd ed.; Addison-Wesley: Boston, MA, USA, 1998. [Google Scholar]
  17. Noma, A.M.; Muhammed, A.; Mohamed, M.A.; Zulkarnain, Z.A. A Review on Heuristics for Addition Chain Problem: Towards Efficient Public Key Cryptosystems. J. Comput. Sci. 2017, 13, 275–289. [Google Scholar] [CrossRef] [Green Version]
  18. Hucke, D. Approximation Ratios of RePair, LongestMatch and Greedy on Unary Strings. In Proceedings of the String Processing and Information Retrieval—26th International Symposium, SPIRE 2019, Segovia, Spain, 7–9 October 2019; In Lecture Notes in Computer Science. Brisaboa, N.R., Puglisi, S.J., Eds.; Springer: Berlin, Germany, 2019; Volume 11811, pp. 3–15. [Google Scholar] [CrossRef]
  19. Aho, A.V.; Sloane, N.J.A. Some doubly exponential sequences. Fibonacci Quart. 1973, 11, 429–437. [Google Scholar]
  20. Berstel, J.; Brlek, S. On the Length of Word Chains. Inf. Process. Lett. 1987, 26, 23–28. [Google Scholar] [CrossRef]
  21. Diwan, A.A. A New Combinatorial Complexity Measure for Languages; Tata Institute: Bombay, India, 1986. [Google Scholar]
  22. Storer, J.A.; Szymanski, T.G. Data compression via textual substitution. J. ACM 1982, 29, 928–951. [Google Scholar] [CrossRef]
  23. Casel, K.; Fernau, H.; Gaspers, S.; Gras, B.; Schmid, M.L. On the Complexity of Grammar-Based Compression over Fixed Alphabets. In Proceedings of the 43rd International Colloquium on Automata, Languages, and Programming, ICALP 2016, Rome, Italy, 11–15 July 2016; Chatzigiannakis, I., Mitzenmacher, M., Rabani, Y., Sangiorgi, D., Eds.; Schloss Dagstuhl—Leibniz-Zentrum für Informatik. 2016; Volume 55, p. 122. [Google Scholar] [CrossRef]
  24. Ochoa, C.; Navarro, G. RePair and All Irreducible Grammars are Upper Bounded by High-Order Empirical Entropy. IEEE Trans. Inf. Theory 2019, 65, 3160–3164. [Google Scholar] [CrossRef]
Figure 1. On the left, the general pattern that is applied during the construction of T i is illustrated, where the split refers to a rule X Z d that has been introduced by the global algorithm. On the right, the tree T 3 that corresponds to the introduced rules of Example 3 is shown.
Figure 1. On the left, the general pattern that is applied during the construction of T i is illustrated, where the split refers to a rule X Z d that has been introduced by the global algorithm. On the right, the tree T 3 that corresponds to the introduced rules of Example 3 is shown.
Algorithms 14 00065 g001
Figure 2. Three rounds of Greedy on input a y k .
Figure 2. Three rounds of Greedy on input a y k .
Algorithms 14 00065 g002
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Hucke, D.; Reh, C.P. Approximation Ratios of RePair, LongestMatch and Greedy on Unary Strings. Algorithms 2021, 14, 65. https://0-doi-org.brum.beds.ac.uk/10.3390/a14020065

AMA Style

Hucke D, Reh CP. Approximation Ratios of RePair, LongestMatch and Greedy on Unary Strings. Algorithms. 2021; 14(2):65. https://0-doi-org.brum.beds.ac.uk/10.3390/a14020065

Chicago/Turabian Style

Hucke, Danny, and Carl Philipp Reh. 2021. "Approximation Ratios of RePair, LongestMatch and Greedy on Unary Strings" Algorithms 14, no. 2: 65. https://0-doi-org.brum.beds.ac.uk/10.3390/a14020065

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop