Kolmogorov Cocktail: Chasing Higher File Compression Ratios

~ and a way to watch your movie without EVER unzipping the file ~

Image for post
Image for post

The mathematician Andrey Kolmogorov, generations ago, invented a hypothetical measure of complexity: “If I give you the code for something, and you feed that code into a simple computer circuit, that computer will output the data associated with that code.” That computer circuit’s number of parts and connections determines its Kolmogorov complexity. If you can find the absolute simplest circuit which generates your data, then that circuit’s size is the complexity of the data. Problem is, mathematicians proved that there’s no computer program which can tell you the Kolmogorov complexity of all possible given chunks of data. So, there’s no ‘Universal Optimum File Compression Software’… ever.

Instead, folks have to settle for compression software that does an alright job, most of the time. And, technically, you could find software that gives you the actual optimum almost all of the time — just not 100% of the time. That’s the target to shoot for: Usually Optimal. There have been various strategies, each with their own wrinkles — like, finding the most common ‘snippets’ that appear in the data, and forming a ‘grammar’ out of them, where each snippet gets a name in code, and you turn the data’s bits into a string of these snippets’ names. Morse code is an example of that. Or, you can create a warped, rippled surface using a complicated sine function, and add that to a dozen other such wrinkled landscapes; when you chop the heads off of the peaks, and color-in the grid-squares that those peaks occupied, you’ve painted your data. My version is a little different…

Data-Specific Circuits

See, Kolmogorov and his buddies were imagining a single piece of software that, for any particular input code, regurgitated a whole file of data. The grammar-based compression mentioned above is of that sort; you generate a grammar, like “A: 01, B: 100, C: 110”, and then, if you have the code “CBAA”, you can generate the data file — “1101000101”. Yet, for a single given grammar, you can input many different codes; each one will generate a different file, and you only wanted one of these. The rest of the codes are almost certainly gibberish.

The ability of that grammar to represent many different files of data, depending upon the inputs, is called its expressivity. Usually, when we hear about expressivity in computer science, we want more of it: a computer language that has high expressivity can describe many behaviors, meaning it’s more likely to be able to do what you want. Yet, we are not looking for the most expressive grammar, in this case; we want the most compressed one! So, any additional expressivity is direct cause of lower compression ratios. We’re looking for a computer program that has as little expressive power as possible.

Ideally, that would be a program where every possible input is needed to reconstruct the single data file. That is, the circuit ONLY creates the data you want, and you need the WHOLE thing’s space of inputs to do so. No gibberish hiding behind alternate input codes! Instead, each input, from “0…000” and “0…001” to “1…111” extracts it’s own chunk of the data, and to avoid having to specify which chunk to do first (which would take-up precious memory), we’ll just iterate through all inputs.

View without Unzipping

This variation of compression methods, by using each and every input in turn, and running that little snippet through a tiny circuit, does something unique: it lets you jump to a specific time in the song or movie, or a specific page of a document or spreadsheet, and read-out all of that information, even view thumbnails from other places in the file, all without ever unzipping it. The file never takes up more space again! (Caveat — if you want to modify a file, yes, you need to unzip it, for modifications to be included in a re-zipping process…same as everybody else.)

Files with a low or zero probability of being edited (photos, songs, movies, government and business data, scientific and medical data) are ideal candidates. You could hold dozens of times more full-length video on your phone, without extra memory, and without any inconvenience.

Knitting Circuits Together

Another wrinkle in this compression methodology: instead of having each time stamp (your iterated inputs) map onto a single bit of output for a number of time steps equal to the bits of the file, it is far more efficient to have many output channels, each generating their own bit in an ordering, and together these are a single time stamp’s chunk of data.

By finding all the intermediate truth-tables of the circuits for each of those bit-positions, we can easily check if two circuits’ intermediates match at any point. If they do, then their circuits can be overlapped, and this amplifies the compression possible. (No, we cannot guarantee an ideal circuit without testing almost all of the circuits below a large size… which will be never.) The trick that I’m still fiddling with is exactly how to build those initial circuits efficiently, so that they’re likely to have plenty of these overlapping intermediates. I have matrices of output bits’ dependencies, arranged by the global state, kinda like having the entire permutation space of possible experiments… I’ll have to draw pictures of it, once it’s nailed down.

The key insight though: you run along each circuit, and keep a bit string associated to each gate in the proposed circuit — those bits tied to a single gate, in order, are the activations of the gate’s output as you iterate through all inputs. By bagging all these strings, we can identify when there is a match, as well as when truth tables are darn similar. That dovetails with the final section…

Adjustable Lossy-ness

The methodology so far deals strictly with lossless compression, which is usually 1:2 at best. Yet, if you don’t mind slightly lower quality, you can do with quite a bit more compression. Neural networks, happily, allow us to do something new: we can identify how important certain regions of data are, using eye-saccades from movie-viewers, and seeing what levels of noise in each region of the frame are necessary, to get people to look in that direction (“perturbations were significant enough to distract from regular viewing”). And, neural networks are already incredible at in-painting and de-noising data, so that you can store a file with only 1/5th the bits of the original, scattered around the image, and the neural network can still reconstruct the original!

So, combining those strengths with the data-specific circuit from above: a neural network is trained by a company, over months, to identify which parts of a movie are likely to be important to a viewer, as well as what level of noise is acceptable in each region. Then, another network, which is able to reconstruct an image from a fraction of the pixels, is set to work in tandem with the circuit-design algorithm, to answer a question, asked repeatedly of each little chunk of data: “if I swap this bit, or that bit, will I be able to make a simpler circuit — or, better yet, one which overlaps with the other circuits needed for all the output channels?” (I suspect that a ‘0-bias’ or ‘1-bias’ in each dependent-chunk of bits will be the easiest route to compression, by making as many input channels irrelevant to the output as possible.) Together, this neural network and circuit-designer are able to identify which bits can be flipped, to compress the file, without noticeably reducing the quality of the reconstructed image (in the important areas, at least). Then, people can share that compressed file, and download the reconstruction neural network, to regenerate any time stamp quickly, without extra memory requirements. Done.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store