G2's Blog

Learning About Lossless Compression

March 24, 2024

The other day I had the sudden idea to try to make a compression algorithm for compressing specifically space-separated strings based on an idea I had, which basically went as follows:

  1. Read through the file, and as you go, make a list of all unique space-separated strings

  2. As you read, if any strings match, replace the matching string with an identifier to denote that it's a compressed string, followed by a number which points at the first occurrence of that string in the text

Fundamentally, that's the whole idea. It's a relatively simple thing to do in theory.

To start, I'll give an example using this excerpt:

The Cat's head began fading away the moment he was gone, and, by the time he had come back with the Dutchess, it had entirely disappeared; so the King and the executioner ran wildly up and down looking for it, while the rest of the party went back to the game.

— Lewis Carroll, Alice's Adventures in Wonderland

This text contains several repeated strings, which are in this case English words, most notably, the word "the" is repeated 8 times, not including the first word:

The Cat's head began fading away the moment he was gone, and, by the time he had come back with the Dutchess, it had entirely disappeared; so the King and the executioner ran wildly up and down looking for it, while the rest of the party went back to the game.

If we make a note of where "the" is located first in the text, which is the 34th character, and then store it in a dictionary, we can then store its location as an 8-bit value. Since many latin Unicode values take up the equivalent of a single character, this will decrease the size of a word like "the" followed by a space to only one byte! That block of text is 257 bytes, and if we replace all the occurrences of "the", it ends up becoming 233 bytes, which is a compression ratio of .

This style of compression is known as "dictionary" compression, and is similar to the way something like LZ77 and LZ78 work, although usually you operate on strings of arbitrary bytes and not individual space-separated words like I was doing here.

In my testing on files such as enwik8 and enwik9 my implementation of the above algorithm was able to achieve a compression ratio of , while compressing and writing the file at a speed of 130MB/s, which I think is pretty good for something I made in one day. If you would like to take a look at the code I created, I have posted it to my GitHub here:

https://github.com/G2-Games/dictionary_compression

There are definitely ways to improve this as well, the implementation marks compressed bytes by using 0xFF, which immediately makes it incapable of compressing arbitrary binary data. Additionally, the fact that every single compressed byte has to have another byte preceeding it increases the number of bytes above the theoretical maximum of a little less than .

Comments