The question is: how can we design a data structure that permits to efficiently find the exact occurrences of (short) strings inside

$\mathcal{T}$? In particular, one may wish to

count the occurrences of a pattern (for example,

$count(\mathcal{T},\u201cAT\u201d)=3)$ or to

locate them (for example,

$locate(\mathcal{T},\u201cAT\u201d)=1,3,7)$. As we will see, there are two conceptually different ways to solve this problem. The first, sketched in

Section 2.1, is to realize that every occurrence of

$\Pi =\u201cAT\u201d$ in

$\mathcal{T}$ is a prefix of a text suffix (for example: occurrence at position 7 of “AT” is a prefix of the text suffix “AT”). The second, sketched in

Section 2.2, is to partition the text into non-overlapping

phrases appearing elsewhere in the text and divide the problem into the two sub-problems of (i) finding occurrences that overlap two adjacent phrases and (ii) finding occurrences entirely contained in a phrase. Albeit conceptually different, both approaches ultimately resort to

suffix sorting: sorting lexicographically a subset of the text’s suffixes. The curious reader can refer to the excellent reviews of Mäkinen and Navarro [

7] and Navarro [

8,

9] for a much deeper coverage of the state-of-the-art relative to both approaches.

#### 2.1. The Entropy Model: Compressed Suffix Arrays

The sentence

every occurrence of Π

is the prefix of a suffix of $\mathcal{T}$ leads very quickly to a simple and time-efficient solution to the full-text indexing problem. Note that a suffix can be identified by a text position: for example, suffix “GAT

$” corresponds to position 6. Let us sort text suffixes in

lexicographic order. We call the resulting integer array the

suffix array ( SA) of

$\mathcal{T}$; see

Figure 2. This time, we color

text positions i according to the color of letter

$\mathcal{T}\left[i\right]$. Note that colors (letters) get clustered, since suffixes are sorted lexicographically.

The reason for appending

$ at the end of the text is that, in this way, no suffix prefixes another suffix. Suffix arrays were independently discovered by Udi Manber and Gene Myers in 1990 [

11] and (under the name of PAT array) by Gonnet, Baeza-Yates, and Snider in 1992 [

12,

13]. An earlier solution, the

suffix tree [

14], dates back to 1973 and is more time-efficient, albeit much less space efficient (by a large constant factor). The idea behind the suffix tree is to build the trie of all text’s suffixes, replacing unary paths with pairs of pointers to the text.

The suffix array $SA$, when used together with the text $\mathcal{T}$, is a full-text index: in order to count/locate occurrences of a pattern $\Pi $, it is sufficient to binary search $SA$, extracting characters from $\mathcal{T}$ to compare $\Pi $ with the corresponding suffixes during search. This solution permits to count the $occ$ occurrences of a pattern $\Pi $ of length m in a text $\mathcal{T}$ of length n in $O\left(m\mathrm{log}n\right)$ time, as well as to report them in additional optimal $O\left(occ\right)$ time. Letting the alphabet size being denoted by $\sigma $, our index requires $nlog\sigma +nlogn$ bits to be stored (the first component for the text and the second for the suffix array).

Note that the suffix array cannot be considered a small data structure. On a constant-sized alphabet, the term $nlogn$ is asymptotically larger than the text. The need for processing larger and larger texts made this issue relevant by the end of the millennium. As an instructive example, consider indexing the Human genome (≈3 billion DNA bases). DNA sequences can be modeled as strings over an alphabet of size four ($\{A,C,G,T\}$); therefore, the Human genome takes less than 750 MiB of space to be stored using two bits per letter. The suffix array of the Human genome requires, on the other hand, about 11 GiB.

Although the study of

compressed text indexes had begun before the year 2000 [

15], the turn of the millennium represented a crucial turning point for the field: two independent works by Ferragina and Manzini [

16] and Grossi and Vitter [

17] showed how to

compress the suffix array.

Consider the integer array

$\psi $ of length

n defined as follows. Given a suffix array position

i containing value (text position)

$j=SA\left[i\right]$, the cell

$\psi \left[i\right]$ contains the suffix array position

${i}^{\prime}$ containing text position

$\left(j\phantom{\rule{3.33333pt}{0ex}}\mathrm{mod}\phantom{\rule{0.277778em}{0ex}}n\right)+1$ (we treat the string as circular). More formally:

$\psi \left[i\right]=S{A}^{-1}[(SA\left[i\right]\phantom{\rule{3.33333pt}{0ex}}mod\phantom{\rule{0.277778em}{0ex}}n)+1]$. See

Figure 3.

Note the following interesting property: equally-colored values in

$\psi $ are increasing. More precisely:

$\psi $-values corresponding to suffixes beginning with the same letter form increasing subsequences. To understand why this happens, observe that

$\psi \left[i\right]$ takes us from a suffix

$\mathcal{T}\left[SA\right[i],\cdots ,n]$ to suffix

$\mathcal{T}\left[SA\right[i]+1,\cdots ,n]$ (let us exclude the case

$SA\left[i\right]=n$ in order to simplify our formulas). Now, assume that

$i<j$ and

$\mathcal{T}\left[SA\right[i],\cdots ,n]$ and

$\mathcal{T}\left[SA\right[j],\cdots ,n]$ begin with the same letter (that is:

i and

j have the same color in

Figure 3). Since

$i<j$, we have that

$\mathcal{T}\left[SA\right[i],\cdots ,n]<\mathcal{T}\left[SA\right[j],\cdots ,n]$ (lexicographically). But then, since the two suffixes begin with the same letter, we also have that

$\mathcal{T}\left[SA\right[i]+1,\cdots ,n]<\mathcal{T}\left[SA\right[j]+1,\cdots ,n]$, i.e.,

$\psi \left[i\right]<\psi \left[j\right]$.

Let us now store just the differences between consecutive values in each increasing subsequence of

$\psi $. We denote this new array of differences as

$\Delta \left(\psi \right)$. See

Figure 4 for an example.

Two remarkable properties of $\Delta \left(\psi \right)$ are that, using an opportune encoding for its elements, this sequence:

The fact that this strategy achieves compression is actually not hard to prove. Consider any integer encoding (for example, Elias’ delta or gamma) capable to represent any integer

x in

$O(1+logx)$ bits. By the concavity of the logarithm function, the inequality

${\sum}_{i=1}^{m}log\left({x}_{i}\right)\le m\xb7log\left(\frac{{\sum}_{i=1}^{m}{x}_{i}}{m}\right)$ holds for any integer sequence

${x}_{1},\cdots ,{x}_{m}$. Now, note that the sub-sequence

${x}_{1},\cdots ,{x}_{{n}_{c}}$ of

$\Delta \left(\psi \right)$ corresponding to letter

$c\in \Sigma $ has two properties: it has exactly

${n}_{c}$ terms and

${\sum}_{i=1}^{{n}_{c}}{x}_{i}\le n$. It follows that the encoded sub-sequence takes (asymptotically)

${\sum}_{i=1}^{{n}_{c}}(1+log\left({x}_{i}\right))\le {n}_{c}\xb7log(n/{n}_{c})+{n}_{c}$ bits. By summing this quantity over all the alphabet’s characters, we obtain precisely

$O\left(n({H}_{0}+1)\right)$ bits. Other encodings (in particular, Elias-Fano dictionaries [

18,

19]) can achieve the claimed

$n{H}_{0}+O\left(n\right)$ bits while supporting constant-time random access.

The final step is to recognize that

$\psi $ moves us forward (by one position at a time) in the text. This allows us to extract suffixes

without using the text. To achieve this, it is sufficient to store in one array

$F=\$AAAAGTTT$ (using our running example) the first character of each suffix. This can be done in

$O\left(n\right)$ bits of space (using a bitvector) since those characters are sorted and we assume the alphabet to be effective. The

i-th text suffix (in lexicographic order) is then

$F\left[i\right]$,

$F\left[\psi \right[i\left]\right]$,

$F\left[{\psi}^{\left(2\right)}\left[i\right]\right],F\left[{\psi}^{\left(3\right)}\left[i\right]\right],\cdots $, where

${\psi}^{\left(\ell \right)}$ indicates function

$\psi $ applied

ℓ times to its argument. See

Figure 5. Extracting suffixes makes it possible to implement the binary search algorithm discussed in the previous section: the

Compressed Suffix Array (CSA) takes compressed space and enables us finding all pattern’s occurrences in a time proportional to the pattern length. By adding a small sampling of the suffix array, one can use the same solution to compute any value

$SA\left[i\right]$ in polylogarithmic time without asymptotically affecting the space usage.

Another (symmetric) philosophy is to exploit the inverse of function

$\psi $. These indexes are based on the

Burrows-Wheeler transform (BWT) [

20] and achieve high-order compression [

16]. We do not discuss BWT-based indexes here since their generalizations to labeled graphs will be covered in

Section 3. For more details on entropy-compressed text indexes, we redirect the curious reader to the excellent survey of Mäkinen and Navarro [

7].

#### 2.2. The Repetitive Model

Since their introduction in the year 2000, entropy-compressed text indexes have had a dramatic impact in domains, such as bioinformatics; the most widely used DNA aligners, Bowtie [

21] and BWA [

22], are based on compressed indexes and can align thousands of short DNA fragments per second on large genomes while using only compressed space in RAM during execution. As seen in the previous section, these indexes operate within a space bounded by the text’s empirical entropy [

16,

17]. Entropy, however, is insensitive to long repetitions: the entropy-compressed version of

$\mathcal{T}\xb7\mathcal{T}$ (that is, text

$\mathcal{T}$ concatenated with itself) takes at least twice the space of the entropy-compressed

$\mathcal{T}$ [

23]. There is a simple reason for this fact: by its very definition (see

Section 2.1), the quantity

${H}_{0}$ depends only on the characters’ relative frequencies in the text. Since characters in

$\mathcal{T}$ and

$\mathcal{T}\xb7\mathcal{T}$ have the same relative frequencies, it follows that their entropies are the same. The claim follows, being the length of

$\mathcal{T}\xb7\mathcal{T}$ twice the length of

$\mathcal{T}$.

While the above reasoning might seem artificial, the same problem arises on texts composed of large repetitions. Today, most large data sources, such as DNA sequencers and the web, follow this new model of

highly repetitive data. As a consequence, entropy-compressed text indexes are no longer able to keep pace with the exponentially-increasing rate at which this data is produced, as very often the mere index size exceeds the RAM limit. For this reason, in recent years more powerful compressed indexes have emerged; these are based on the Lempel-Ziv factorization [

23], the run-length Burrows-Wheeler Transform (BWT) [

24,

25,

26], context-free grammars [

27], string attractors [

28,

29] (combinatorial objects generalizing those compressors), and more abstract measures of repetitiveness [

30]. The core idea behind these indexes (with the exception of the run-length BWT; see

Section 3) is to partition the text into phrases that occur also elsewhere and to use

geometric data structures to locate pattern occurrences.

To get a more detailed intuition of how these indexes work, consider the Lempel-Ziv’78 (LZ78) compression scheme. The LZ78 factorization breaks the text into phrases with the property that each phrase extends by one character a previous phrase; see the example in

Figure 6 (top). The mechanism we are going to describe works for any factorization-based compressor (also called

dictionary compressors). A factorization-based index keeps a two-dimensional geometric data structure storing a labeled point

$(x,y,\ell )$ for each phrase

y, where

x is the reversed text prefix preceding the phrase and

ℓ is the first text position of the phrase. Efficient techniques not discussed here (in particular, mapping to

rank space) exist to opportunely reduce

x and

y to small integers preserving the lexicographic order of the corresponding strings (see, for example, Kreft and Navarro [

23]). For example, such a point in

Figure 6 is

$(GCA,CG,4)$, corresponding to the phrase break between positions 3 and 4. To understand how

locate queries are answered, consider the pattern

$\Pi =CAC$. All occurrences of

$\Pi $ crossing a phrase boundary can be split into a prefix and a suffix, aligned on the phrase boundary. Let us consider the split

$CA|C$ (all possible splits have to be considered for the following algorithm to be complete). If

$\Pi $ occurs in

$\mathcal{T}$ with this split, then

C will be a prefix of some phrase

y, and

$CA$ will be a suffix of the text prefix preceding

y. We can find all phrases

y with this property by issuing a four-sided range query on our grid as shown in

Figure 6 (bottom left). This procedure works since

C and

$AC$ (that is,

$CA$ reversed) define ranges on the horizontal and vertical axes: these ranges contain all phrases prefixed by

C and all reversed prefixes prefixed by

$AC$ (equivalently, all prefixes suffixed by

$CA$), respectively. In

Figure 6 (bottom left), the queried rectangle contains text positions 13 and 11. By subtracting from those numbers the length of the pattern’s prefix (in this example,

$2=\left|CA\right|$), we discover that

$\Pi =CAC$ occurs at positions 11 and 9 crossing a phrase with split

$CA|C$.

With a similar idea (that is, resorting again to geometric data structures), one can recursively track occurrences completely contained inside a phrase; see the caption of

Figure 6 (bottom right). Let

b be the number of phrases in the parse. Note that our geometric structures store overall

$O\left(b\right)$ points. On repetitive texts, the smallest possible number

b of phrases of such a parse can be

constant. In practice, on repetitive texts

b (or any of its popular approximations [

31]) is orders of magnitude smaller than the entropy-compressed text [

23,

24]. The fastest existing factorization-based indexes to date can locate all

$occ$ occurrences of a pattern

$\Pi \in {\Sigma}^{m}$ in

$O\left(b\phantom{\rule{4pt}{0ex}}\mathrm{polylog}\phantom{\rule{4pt}{0ex}}n\right)$ bits of space and

optimal $O(m+occ)$ time [

32]. For more details on indexes for repetitive text collections, the curious reader can refer to the excellent recent survey of Navarro [

8,

9].