Burrows Wheeler transform

Keywords: original string character, transformation reversible, rle compression, row, process, bwt, order, length, matrix, column, retrieve. Powered by TextRank.

I want to talk about a cool trick that has been around for a long time, one I encountered in the early 2000s when I was deeply into data compression (remember the Usenet posts on comp.compression and people claiming to be able to repeatedly compress a string down to a single bit?). This method has been used in bzip to achieve great compression ratios, albeit at the cost of extra CPU cycles and slower (de)compression speeds. The technique is called the Burrows-Wheeler Transformation (BWT).

The concept behind this technique is to group similar characters together, enabling efficient run-length encoding (RLE), followed by a move-to-front transformation for enhanced Huffman encoding or other entropy-based compression methods.

Imagine the following string:

ababababababab

Running length encoding on this isn’t feasible, as the output would actually be longer than the input:

1a1b1a1b...

But what if there were a magical algorithm that could transform the string into:

aaaaaaabbbbbbb

Then, using RLE, it would become:

7a7b

This results in a good compression ratio of 4/14. This is the idea behind BWT. However, the challenge is making the transformation reversible to retrieve the original string. We can’t simply sort the string alphabetically to group every character close together, as that wouldn’t be reversible.

How It Works

First, we create all possible rotations of the string and place them in a matrix. An easy way to implement this in code is to concatenate the string with itself and slide a window of the original string’s length along this new concatenated string.

For example, with “banana”:

banana
ananab
nanaba
anaban
nabana
abanan

Next, lexicographically sort the matrix:

abanan
anaban
ananab
banana
nabana
nanaba

The last column is the BWT representation of the original string, and the index of the original string is used as the starting point in the decoding process.

nnbaaa

In this example, the characters are grouped together, making the data a good candidate for RLE compression. Since we need lossless compression, the process has to be reversible. How can we go from nnbaaa back to banana?

Inverting the Process

The beauty of BWT is that it’s reversible, allowing us to retrieve the original string from the transformed string.

The transformed string constitutes the last column of our sorted matrix. Sorting this string gives us the first column. So we have:

0. a.....n
1. a.....n
2. a.....b
3. b.....a
4. n.....a
5. n.....a

Since the transformation is cyclic and we have the index of the original term (3), we know that the rotation in row 3 follows the rotation in row 2, as the first character appears as the last character in row 2. We build a sequence graph to indicate the order of letters based on their table locations. If multiple rows are possible, we process them in ascending order. Initially, our graph (in adjacency representation) looks like this:

shift = [_, _, _, 2, _, _]

This means position i points to shift[] as its successor. Checking row 2, we see the first character matches the last character of row 5, 4, or 3. We process these in descending order, so:

shift = [_, _, 5, 2, _, _]

Following row 2, there are two rows starting with 'a'. Again, we pick in descending order, so row 4 (without reusing rows, as the operation is bijective):

shift = [_, 4, 5, 2, _, _]

Continuing this process, we eventually get:

shift = [3, 4, 5, 2, 0, 1]

Starting with the original index from the transform (3) and following the shift array through the original string characters, we retrieve the original sequence:

input = nnbaaa
shift[3] = 2, input[2] = b
shift[2] = 5, input[5] = a
shift[5] = 1, input[1] = n
shift[1] = 4, input[4] = a
shift[4] = 0, input[0] = n
shift[0] = 3, input[3] = a

Following the shift array, we successfully reconstruct the original string “banana”.


Metadata

597 words

First published on 2024-10-30

Generated on 13 Dec 2024 at 12:12 AM

Index

Mobile optimized version. Desktop version.