Learning About Lossless Compression
March 24, 2024
The other day I had the sudden idea to try to make a compression algorithm for compressing specifically space-separated strings based on an idea I had, which basically went as follows:
-
Read through the file, and as you go, make a list of all unique space-separated strings
-
As you read, if any strings match, replace the matching string with an identifier to denote that it's a compressed string, followed by a number which points at the first occurrence of that string in the text
Fundamentally, that's the whole idea. It's a relatively simple thing to do in theory.
To start, I'll give an example using this excerpt:
The Cat's head began fading away the moment he was gone, and, by the time he had come back with the Dutchess, it had entirely disappeared; so the King and the executioner ran wildly up and down looking for it, while the rest of the party went back to the game.
— Lewis Carroll, Alice's Adventures in Wonderland
This text contains several repeated strings, which are in this case English words, most notably, the word "the" is repeated 8 times, not including the first word:
The Cat's head began fading away the moment he was gone, and, by the time he had come back with the Dutchess, it had entirely disappeared; so the King and the executioner ran wildly up and down looking for it, while the rest of the party went back to the game.
If we make a note of where "the" is located first in the text, which is the
34th character, and then store it in a dictionary, we can then store its
location as an 8-bit value. Since many latin Unicode values take up the
equivalent of a single character, this will decrease the size of a word like
"the" followed by a space to only one byte! That block of text is 257 bytes,
and if we replace all the occurrences of "the", it ends up becoming 233 bytes,
which is a compression ratio of
This style of compression is known as "dictionary" compression, and is similar to the way something like LZ77 and LZ78 work, although usually you operate on strings of arbitrary bytes and not individual space-separated words like I was doing here.
In my testing on files such as
enwik8 and enwik9 my implementation
of the above algorithm was able to achieve a compression ratio
of
https://github.com/G2-Games/dictionary_compression
There are definitely ways to improve this as well, the implementation marks
compressed bytes by using 0xFF
, which immediately makes it incapable of
compressing arbitrary binary data. Additionally, the fact that every single
compressed byte has to have another byte preceeding it increases the number of
bytes above the theoretical maximum of a little less than