A major task for producing the RePair grammar is to count the frequencies of the most frequent bigrams. Our work horse for this task is a frequency table. A frequency table in ${T}_{i}$ of length f stores pairs of the form $(\mathtt{bc},x)$, where bc is a bigram and x the frequency of bc in ${T}_{i}$. It uses $f\u2308lg({\sigma}_{i}^{2}{n}_{i}/2)\u2309$ bits of space since an entry stores a bigram consisting of two characters from ${\mathsf{\Sigma}}_{i}$ and its respective frequency, which can be at most ${n}_{i}/2$. Throughout this paper, we use an elementary inplace sorting algorithm like heapsort:
2.1. TradeOff Computation
Using the frequency tables, we present a solution with a tradeoff parameter:
Lemma 2. Given an integer d with $d\ge 1$, we can compute the frequencies of the d most frequent bigrams in a text of length n whose characters are drawn from an alphabet of size σ in $\mathcal{O}(max(n,d)nlgd/d)$ time using $2d\u2308lg({\sigma}^{2}n/2)\u2309+\mathcal{O}(lgn)$ bits.
Proof. Our idea is to partition the set of all bigrams appearing in T into $\u2308n/d\u2309$ subsets, compute the frequencies for each subset, and finally, merge these frequencies. In detail, we partition the text $T={S}_{1}\cdots {S}_{\u2308n/d\u2309}$ into $\u2308n/d\u2309$ substrings such that each substring has length d (the last one has a length of at most d). Subsequently, we extend ${S}_{j}$ to the left (only if $j>1$) such that ${S}_{j}$ and ${S}_{j+1}$ overlap by one text position, for $1\le j<\u2308n/d\u2309$. By doing so, we take the bigram on the border of two adjacent substrings ${S}_{j}$ and ${S}_{j+1}$ for each $j<\u2308n/d\u2309$ into account. Next, we create two frequency tables F and ${F}^{\prime}$, each of length d for storing the frequencies of d bigrams. These tables are at the beginning empty. In what follows, we fill F such that after processing ${S}_{i}$, F stores the most frequent d bigrams among those bigrams occurring in ${S}_{1},\dots ,{S}_{i}$ while ${F}^{\prime}$ acts as a temporary space for storing candidate bigrams that can enter F.
With F and ${F}^{\prime}$, we process each of the $n/d$ substrings ${S}_{j}$ as follows: Let us fix an integer j with $1\le j\le \u2308n/d\u2309$. We first put all bigrams of ${S}_{j}$ into ${F}^{\prime}$ in lexicographic order. We can perform this within the space of ${F}^{\prime}$ in $\mathcal{O}(dlgd)$ time since there are at most d different bigrams in ${S}_{j}$. We compute the frequencies of all these bigrams in the complete text T in $\mathcal{O}(nlgd)$ time by scanning the text from left to right while locating a bigram in ${F}^{\prime}$ in $\mathcal{O}(lgd)$ time with a binary search. Subsequently, we interpret F and ${F}^{\prime}$ as one large frequency table, sort it with respect to the frequencies while discarding duplicates such that F stores the d most frequent bigrams in $T[1..jd]$. This sorting step can be done in $\mathcal{O}(dlgd)$ time. Finally, we clear ${F}^{\prime}$ and are done with ${S}_{j}$. After the final merge step, we obtain the d most frequent bigrams of T stored in F.
Since each of the $\mathcal{O}(n/d)$ merge steps takes $\mathcal{O}(dlgd+nlgd)$ time, we need: $\mathcal{O}(max(d,n\left)\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}\right(nlgd)/d)$ time. For $d\ge n$, we can build a large frequency table and perform one scan to count the frequencies of all bigrams in T. This scan and the final sorting with respect to the counted frequencies can be done in $\mathcal{O}(nlgn)$ time. □
2.2. Algorithmic Ideas
With Lemma 2, we can compute ${T}_{m}$ in $\mathcal{O}(m{n}^{2}lgd/d)$ time with additional $2d$$\u2308lg({\sigma}_{m}^{2}n/2)\u2309$ bits of working space on top of the text for a parameter d with $1\le d\le n$. (The variable $\tau $ used in the abstract and in the introduction is interchangeable with ${\sigma}_{m}$, i.e., $\tau ={\sigma}_{m}$.) In the following, we show how this leads us to our first algorithm computing RePair:
Theorem 1. We can compute RePair on a string of length n in $\mathcal{O}\left({n}^{2}\right)$ time with $max\left(\right(n/c)lgn,$$n\u2308lg\tau \u2309)+\mathcal{O}(lgn)$ bits of working space including the text space as a rewritable part in the working space, where $c\ge 1$ is a fixed constant and $\tau ={\sigma}_{m}$ is the sum of the alphabet size σ and the number of nonterminal symbols.
In our model, we assume that we can enlarge the text ${T}_{i}$ from ${n}_{i}\u2308lg{\sigma}_{i}\u2309$ bits to ${n}_{i}\u2308lg{\sigma}_{i+1}\u2309$ bits without additional extra memory. Our main idea is to store a growing frequency table using the space freed up by replacing bigrams with nonterminals. In detail, we maintain a frequency table F in ${T}_{i}$ of length ${f}_{k}$ for a growing variable ${f}_{k}$, which is set to ${f}_{0}:=\mathcal{O}\left(1\right)$ in the beginning. The table F takes ${f}_{k}\u2308lg({\sigma}_{i}^{2}n/2)\u2309$ bits, which is $\mathcal{O}(lg\left({\sigma}^{2}n\right))=\mathcal{O}(lgn)$ bits for $k=0$. When we want to query it for a most frequent bigram, we linearly scan F in $\mathcal{O}\left({f}_{k}\right)=\mathcal{O}\left(n\right)$ time, which is not a problem since (a) the number of queries is $m\le n$ and (b) we aim for $\mathcal{O}\left({n}^{2}\right)$ as the overall running time. A consequence is that there is no need to sort the bigrams in F according to their frequencies, which simplifies the following discussion.
Frequency table F: With lemBatchedCount, we can compute F in $\mathcal{O}(nmax(n,{f}_{k})$$lg{f}_{k}/{f}_{k})$ time. Instead of recomputing F on every turn i, we want to recompute it only when it no longer stores a most frequent bigram. However, it is not obvious when this happens as replacing a most frequent bigram during a turn (a) removes this entry in F and (b) can reduce the frequencies of other bigrams in F, making them possibly less frequent than other bigrams not tracked by F. Hence, the variable i for the ith turn (creating the ith nonterminal) and the variable k for recomputing the frequency table F the $(k+1)$st time are loosely connected. We group together all turns with the same ${f}_{k}$ and call this group the kth round of the algorithm. At the beginning of each round, we enlarge ${f}_{k}$ and create a new F with a capacity for ${f}_{k}$ bigrams. Since a recomputation of F takes much time, we want to end a round only if F is no longer useful, i.e., when we no longer can guarantee that F stores a most frequent bigram. To achieve our claimed time bounds, we want to assign all m turns to $\mathcal{O}(lgn)$ different rounds, which can only be done if ${f}_{k}$ grows sufficiently fast.
Algorithm outline: At the beginning of the
kth round and the
ith turn, we compute the frequency table
F storing
${f}_{k}$ bigrams and keep additionally the lowest frequency of
F as a threshold
${t}_{k}$, which is treated as a constant during this round. During the computation of the
ith turn, we replace the most frequent bigram (say,
$\mathtt{bc}\in {\mathsf{\Sigma}}_{i}^{2}$) in the text
${T}_{i}$ with a nonterminal
${X}_{i+1}$ to produce
${T}_{i+1}$. Thereafter, we remove
bc from
F and update those frequencies in
F, which were decreased by the replacement of
bc with
${X}_{i+1}$ and add each bigram containing the new character
${X}_{i+1}$ into
F if its frequency is at least
${t}_{k}$. Whenever a frequency in
F drops below
${t}_{k}$, we discard it. If
F becomes empty, we move to the
$(k+1)$st round and create a new
F for storing
${f}_{k+1}$ frequencies. Otherwise (
F still stores an entry), we can be sure that
F stores a most frequent bigram. In both cases, we recurse with the
$(i+1)$st turn by selecting the bigram with the highest frequency stored in
F. We show in Algorithm 1 the pseudo code of this outlined algorithm. We describe in the following how we update
F and how large
${f}_{k+1}$ can become at least.
Algorithm 1: Algorithmic outline of our proposed algorithm working on a text T with a growing frequency table F. The constants ${\alpha}_{i}$ and ${\beta}_{i}$ are explained in Section 2.3. The same section shows that the outer while loop is executed $\mathcal{O}(lgn)$ times. 

2.3. Algorithmic Details
Suppose that we are in the kth round and in the ith turn. Let ${t}_{k}$ be the lowest frequency in F computed at the beginning of the kth round. We keep ${t}_{k}$ as a constant threshold for the invariant that all frequencies in F are at least ${t}_{k}$ during the kth round. With this threshold, we can assure in the following that F is either empty or stores a most frequent bigram.
Now suppose that the most frequent bigram of ${T}_{i}$ is $\mathtt{bc}\in {\mathsf{\Sigma}}_{i}^{2}$, which is stored in F. To produce ${T}_{i+1}$ (and hence advancing to the $(i+1)$st turn), we enlarge the space of ${T}_{i}$ from ${n}_{i}\u2308lg{\sigma}_{i}\u2309$ to ${n}_{i}\u2308lg{\sigma}_{i+1}\u2309$ and replace all occurrences of bc in ${T}_{i}$ with a new nonterminal ${X}_{i+1}$. Subsequently, we would like to take the next bigram of F. For that, however, we need to update the stored frequencies in F. To see this necessity, suppose that there is an occurrence of abcd with two characters $\mathtt{a},\mathtt{d}\in {\mathsf{\Sigma}}_{i}$ in ${T}_{i}$. By replacing $\mathtt{bc}$ with ${X}_{i+1}$,
 1.
the frequencies of ab and cd decrease by one (for the border case a = b = c (resp. b = c = d), there is no need to decrement the frequency of ab (resp. cd)), and
 2.
the frequencies of $\mathtt{a}{X}_{i+1}$ and ${X}_{i+1}\mathtt{d}$ increase by one.
Updating F. We can take care of the former changes (1) by decreasing the respective bigram in F (in the case that it is present). If the frequency of this bigram drops below the threshold ${t}_{k}$, we remove it from F as there may be bigrams with a higher frequency that are not present in F. To cope with the latter changes (2), we track the characters adjacent to ${X}_{i+1}$ after the replacement, count their numbers, and add their respective bigrams to F if their frequencies are sufficiently high. In detail, suppose that we have substituted bc with ${X}_{i+1}$ exactly h times. Consequently, with the new text ${T}_{i+1}$ we have additionally $hlg{\sigma}_{i+1}$ bits of free space (the free space is consecutive after shifting all characters to the left), which we call D in the following. Subsequently, we scan the text and put the characters of ${\mathsf{\Sigma}}_{i+1}$ appearing to the left of each of the h occurrences of ${X}_{i+1}$ into D. After sorting the characters in D lexicographically, we can count the frequency of $\mathtt{a}{X}_{i+1}$ for each character $\mathtt{a}\in {\mathsf{\Sigma}}_{i+1}$ preceding an occurrence of ${X}_{i+1}$ in the text ${T}_{i+1}$ by scanning D linearly. If the obtained frequency of such a bigram $\mathtt{a}{X}_{i+1}$ is at least as high as the threshold ${t}_{k}$, we insert $\mathtt{a}{X}_{i+1}$ into F and subsequently discard a bigram with the currently lowest frequency in F if the size of F has become ${f}_{k}+1$. In the case that we visit a run of ${X}_{i+1}$’s during the creation of D, we must take care of not counting the overlapping occurrences of ${X}_{i+1}{X}_{i+1}$. Finally, we can count analogously the occurrences of ${X}_{i+1}\mathtt{d}$ for all characters $\mathtt{d}\in {\mathsf{\Sigma}}_{i}$ succeeding an occurrence of ${X}_{i+1}$.
Capacity of F: After the above procedure, we update the frequencies of F. When F becomes empty, all bigrams stored in F are replaced or have a frequency that becomes less than ${t}_{k}$. Subsequently, we end the kth round and continue with the ($k+1$)st round by (a) creating a new frequency table F with capacity ${f}_{k+1}$ and (b) setting the new threshold ${t}_{k+1}$ to the minimal frequency in F. In what follows, we (a) analyze in detail when F becomes empty (as this determines the sizes ${f}_{k}$ and ${f}_{k+1}$) and (b) show that we can compensate the number of discarded bigrams with an enlargement of F’s capacity from ${f}_{k}$ bigrams to ${f}_{k+1}$ bigrams for the sake of our aimed total running time.
Next, we analyze how many characters we have to free up (i.e., how many bigram occurrences we have to replace) to gain enough space for storing an additional frequency. Let ${\delta}_{i}:=lg({\sigma}_{i+1}^{2}{n}_{i}/2)$ be the number of bits needed to store one entry in F, and let ${\beta}_{i}:=min({\delta}_{i}/lg{\sigma}_{i+1},c{\delta}_{i}/lgn)$ be the minimum number of characters that need to be freed to store one frequency in this space. To understand the value of ${\beta}_{i}$, we look at the arguments of the minimum function in the definition of ${\beta}_{i}$ and simultaneously at the maximum function in our aimed working space of $max(n\u2308lg{\sigma}_{m}\u2309,(n/c)lgn)+\mathcal{O}(lgn)$ bits (cf. Theorem 1):
 1.
The first item in this maximum function allows us to spend $lg{\sigma}_{i+1}$ bits for each freed character such that we obtain space for one additional entry in F after freeing ${\delta}_{i}/lg{\sigma}_{i+1}$ characters.
 2.
The second item allows us to use $lgn$ additional bits after freeing up c characters. This additional treatment helps us to let ${f}_{k}$ grow sufficiently fast in the first steps to save our $\mathcal{O}\left({n}^{2}\right)$ time bound, as for sufficiently small alphabets and large text sizes, $lg({\sigma}^{2}n/2)/lg\sigma =\mathcal{O}(lgn)$, which means that we might run the first $\mathcal{O}(lgn)$ turns with ${f}_{k}=\mathcal{O}\left(1\right)$ and, therefore, already spend $\mathcal{O}({n}^{2}lgn)$ time. Hence, after freeing up $c{\delta}_{i}/lgn$ characters, we have space to store one additional entry in F.
With ${\beta}_{i}=min({\delta}_{i}/lg{\sigma}_{i+1},c{\delta}_{i}/lgn)=\mathcal{O}\left({log}_{\sigma}n\right)\cap \mathcal{O}\left({log}_{n}\sigma \right)=\mathcal{O}\left(1\right)$, we have the sufficient condition that replacing a constant number of characters gives us enough space for storing an additional frequency.
If we assume that replacing the occurrences of a bigram stored in F does not decrease the other frequencies stored in F, the analysis is now simple: Since each bigram in F has a frequency of at least two, ${f}_{k+1}\ge {f}_{k}+{f}_{k}/{\beta}_{i}$. Since ${\beta}_{i}=\mathcal{O}\left(1\right)$, this lets ${f}_{k}$ grow exponentially, meaning that we need $\mathcal{O}(lgn)$ rounds. In what follows, we show that this is also true in the general case.
Lemma 3. Given that the frequency of all bigrams in F drops below the threshold ${t}_{k}$ after replacing the most frequent bigram bc, then its frequency has to be at least $max(2,F1/2)$, where $\leftF\right\le {f}_{k}$ is the number of frequencies stored in F.
Proof. If the frequency of bc in ${T}_{i}$ is x, then we can reduce at most $2x$ frequencies of other bigrams (both the left character and the right character of each occurrence of bc can contribute to an occurrence of another bigram). Since a bigram must occur at least twice in ${T}_{i}$ to be present in F, the frequency of bc has to be at least $max(2,({f}_{k}1)/2)$ for discarding all bigrams of F. □
Suppose that we have enough space available for storing the frequencies of
${\alpha}_{i}{f}_{k}$ bigrams, where
${\alpha}_{i}$ is a constant (depending on
${\sigma}_{i}$ and
${n}_{i}$) such that
F and the working space of Lemma 2 with
$d={f}_{k}$ can be stored within this space. With
${\beta}_{i}$ and Lemma 3 with
$\leftF\right={f}_{k}$, we have:
where we use the equivalence
$1+2/\left({\alpha}_{i}{\beta}_{i}{f}_{k}\right)=1+1/\left(2{\alpha}_{i}{\beta}_{i}\right)1/\left(2{\alpha}_{i}{\beta}_{i}{f}_{k}\right)\iff 5={f}_{k}$ to estimate the two arguments of the maximum function.
Since we let ${f}_{k}$ grow by a factor of at least $\gamma :={min}_{1\le i\le m}{\gamma}_{i}>1$ for each recomputation of F, ${f}_{k}=\mathsf{\Omega}\left({\gamma}^{k}\right)$, and therefore, ${f}_{k}=\mathsf{\Theta}\left(n\right)$ after $k=\mathcal{O}(lgn)$ steps. Consequently, after reaching $k=\mathcal{O}(lgn)$, we can iterate the above procedure a constant number of times to compute the nonterminals of the remaining bigrams occurring at least twice.
Time analysis: In total, we have
$\mathcal{O}(lgn)$ rounds. At the start of the
kth round, we compute
F with the algorithm of Lemma 2 with
$d={f}_{k}$ on a text of length at most
$n{f}_{k}$ in
$\mathcal{O}(n(n{f}_{k})\phantom{\rule{0.166667em}{0ex}}\xb7\phantom{\rule{0.166667em}{0ex}}lg{f}_{k}/{f}_{k})$ time with
${f}_{k}\le n$. Summing this up, we get:
In the ith turn, we update F by decreasing the frequencies of the bigrams affected by the substitution of the most frequent bigram bc with ${X}_{i+1}$. For decreasing such a frequency, we look up its respective bigram with a linear scan in F, which takes ${f}_{k}=\mathcal{O}\left(n\right)$ time. However, since this decrease is accompanied with a replacement of an occurrence of bc, we obtain $\mathcal{O}\left({n}^{2}\right)$ total time by charging each text position with $\mathcal{O}\left(n\right)$ time for a linear search in F. With the same argument, we can bound the total time for sorting the characters in D to $\mathcal{O}\left({n}^{2}\right)$ overall time: Since we spend $\mathcal{O}(hlgh)$ time on sorting h characters preceding or succeeding a replaced character and $\mathcal{O}\left({f}_{k}\right)=\mathcal{O}\left(n\right)$ time on swapping a sufficiently large new bigram composed of ${X}_{i+1}$ and a character of ${\mathsf{\Sigma}}_{i+1}$ with a bigram with the lowest frequency in F, we charge each text position again with $\mathcal{O}\left(n\right)$ time. Putting all time bounds together gives the claim of Theorem 1.
2.4. Storing the Output InPlace
Finally, we show that we can store the computed grammar in text space. More precisely, we want to store the grammar in an auxiliary array A packed at the end of the working space such that the entry $A\left[i\right]$ stores the righthand side of the nonterminal ${X}_{i}$, which is a bigram. Thus, the nonterminals are represented implicitly as indices of the array A. We therefore need to subtract $2lg{\sigma}_{i}$ bits of space from our working space ${\alpha}_{i}{f}_{k}$ after the ith turn. By adjusting ${\alpha}_{i}$ in the above equations, we can deal with this additional space requirement as long as the frequencies of the replaced bigrams are at least three (we charge two occurrences for growing the space of A).
When only bigrams with frequencies of at most two remain, we switch to a simpler algorithm, discarding the idea of maintaining the frequency table F: Suppose that we work with the text ${T}_{i}$. Let $\lambda $ be a text position, which is one in the beginning, but will be incremented in the following turns while holding the invariant $T[1..\lambda ]$ that does not contain a bigram of frequency two. We scan ${T}_{i}[\lambda ..n]$ linearly from left to right and check, for each text position j, whether the bigram ${T}_{i}\left[j\right]{T}_{i}[j+1]$ has another occurrence ${T}_{i}\left[{j}^{\prime}\right]{T}_{i}[{j}^{\prime}+1]={T}_{i}\left[j\right]{T}_{i}[j+1]$ with ${j}^{\prime}>j+1$, and if so,
 (a)
append ${T}_{i}\left[j\right]{T}_{i}[j+1]$ to A,
 (b)
replace ${T}_{i}\left[j\right]{T}_{i}[j+1]$ and ${T}_{i}\left[{j}^{\prime}\right]{T}_{i}[{j}^{\prime}+1]$ with a new nonterminal ${X}_{i+1}$ to transform ${T}_{i}$ to ${T}_{i+1}$, and
 (c)
recurse on ${T}_{i+1}$ with $\lambda :=j$ until no bigram with frequency two is left.
The position $\lambda $, which we never decrement, helps us to skip over all text positions starting with bigrams with a frequency of one. Thus, the algorithm spends $\mathcal{O}\left(n\right)$ time for each such text position and $\mathcal{O}\left(n\right)$ time for each bigram with frequency two. Since there are at most n such bigrams, the overall running time of this algorithm is $\mathcal{O}\left({n}^{2}\right)$.
Remark 1 (Pointer machine model). Refraining from the usage of complicated algorithms, our algorithm consists only of elementary sorting and scanning steps. This allows us to run our algorithm on a pointer machine, obtaining the same time bound of $\mathcal{O}\left({n}^{2}\right)$. For the space bounds, we assume that the text is given in n words, where a word is large enough to store an element of ${\mathsf{\Sigma}}_{m}$ or a text position.