Gonzalo Navarro
Massive and highly repetitive text collections are arising in several modern applications. For example, a U.K. project managed in 2018 to sequence 100,000 human genomes, which stored in plain form require 300 terabytes. Further, the data structures needed to efficiently perform the complex searches required in bioinformatics would add another order of magnitude to the storage space, reaching the petabytes.
How to cope with this flood of repetitive data? We can think of compression (after all, two human genomes differ by about 0.1%), but it is not the definitive answer—we need a way to decompress the data before we can use it. A more ambitious research area, compressed data structures, promises to store the data and the structures required to efficiently handle it, within space close to that of the compressed data. The data will never be decompressed; it will always be used directly in compressed form.
However, on these repetitive datasets, statistical compression is useless. Dictionary compression techniques like Lempel-Ziv perform much better, because they replace text chunks by references to identical chunks seen before. Lempel-Ziv can compress our genome collections by a factor of 100, to three affordable terabytes. But again, this is just compression. Can we design compressed data structures based on dictionary compression, that extract snippets, and even search the genomes efficiently, without decompressing?
There has been a good deal of research on this challenge. Even extracting a snippet without decompressing the text from the beginning is tricky. We don’t know how to do it on Lempel-Ziv, but it is possible on slightly weaker dictionary compression formats, like grammar compression (where one finds a small context-free grammar that generates the text). Providing efficient searches is even more difficult: on those huge collections, sequentially scanning the text is not a choice. We must build an index data structure that accelerates searches. Unlike in statistical compression, dictionary compressors tend to cut the text into pieces, so a substring sought can appear in many different forms. Despite those challenges, there are currently various compressed indexes that provide efficient access and search for short strings, and whose size is bounded in terms of the size z of the Lempel-Ziv encoding or the size g of grammar compression.
Bioinformatic applications require more complex searches. They need to search allowing errors, to search for all the substrings of a long string, to find frequent long enough text substrings, to find the longest substrings that appear repeated in the text, and many others.
A beloved data structure in stringology, the suffix tree, can efficiently answer all those complex queries. However, it requires a lot of space—an order of magnitude over the plain text. To be used in bioinformatics, it underwent various simplifications and space reductions. Researchers showed how to do it with just suffix arrays—the leaves of the suffix tree—and then with the FM-index—a statistically compressed suffix array.
The FM-index builds on the Burrows-Wheeler Transform (BWT), a permutation of the text that makes it easier to compress. It was soon discovered the BWT featured runs of equal consecutive symbols that became longer as the text was more repetitive. The number r of runs in the BWT became then a measure of repetitiveness—just like z or g. Further research managed, within space bounded in terms of r, not only to extract snippets and perform basic searches, but also to support all the complex searches offered by suffix trees (this compression does not cut the text, so things are simpler).
In parallel, researchers aimed to understand the compressibility limits of repetitive texts, obtaining lower bounds in terms of the number of different substrings. Over time, it has turned out that all the compressibility measures and lower bounds are sandwiched within a logarithmic factor of each other, so they are all relatively close—except r.
The measure r, which offered a world of sophisticated searches, seemed to be an outlier. It sat between statistical and dictionary compression and was the only measure that could not be bounded in terms of the others. In practice, the structures based on r were indeed larger than those based on z or g, which raised concerns on how well r captured repetitiveness.
The following paper finally settles the question. It proves that r is at most a log-square factor away from z, which validates r as a measure of repetitiveness. At the same time, it confirms previous intuition that r is not that small. But at least we now know the price for going from basic to complex string searches—what is needed in bioinformatics.
Besides the relevance of the result, the paper is a beautiful masterpiece and explores many other consequences. It is written by two rising stars in combinatorial pattern matching. I hope to hear much more from them in the years to come.
Sumber: ACM Magazine