There was a series of articles on lossless and lossy and compression techniques in, I think, PC Magazine that I read as a kid and made a big impression on me. I didn't, like, end up ready to write code to build Huffman trees or anything, but it did change it from a total black box into a bunch of smaller pieces each of which a mere mortal can understand.
The compression rabbit hole can be a rewarding one to go down if you haven't. Besides the tech itself making a lot of things work better (like a Web built on human-readable formats, or more specialized stuff like Parquet or other columnar formats, or video on the lossy side), it can give you some tools or perspective that apply elsewhere, e.g. to other probablistic stuff like caching and predictions for lossless compression, or other signal-processing stuff for lossy compression.
Another interesting facet is that, according to some schools of thought, compression and AI are equivalent problems: The better you understand something, the less you need to memorize to reproduce it.
Of course, large language models are (by definition) currently going the other direction, but it remains to be seen whether that leads to artificial intelligence (whatever that ends up meaning).
I interpreted that statement as saying the current practice is to make LLMs larger and larger (so they effectively memorize more and more data) to make them more powerful, but from the perspective of information theory, if models were powerful and "understanding", then models could stay the same size and become more and more powerful as they get increasingly better at compressing the available information. I am not sure if this interpretation was what was meant though.
I believe the parent poster's point is: LLMs are more effective when they use more memory, meaning the less they are forced to compress the training data, the better they perform.
Maybe a naive question: are LLMs really going the other way? My intuition is that the model weights are much smaller than the information encoded in them.
Was it maybe Dr Dobb's? Mark Nelson had an excellent series of articles about compression which opened up that world for me. I ended up working on compression for my PhD many years later.
Hi I am a neophyte interested in compression. It's difficult to find communities regarding compression online. I am looking for a guide. Is there any place that I can dm you?
The only CS class I ever took was Huffman's "Cybernetics" (undergrad UCSC CS class) and it was a real mind-blower. The very first day he dove right into sphere packing (https://en.wikipedia.org/wiki/Sphere_packing) and the whole course was a great introduction to information theory.
I remember him describing how he wrote the original solution for huffman compression, crumpled it up, threw it away, and then retrieved it from the trash.
I failed the class which led to a huge imposter syndrome in me that pushed me to learn far more CS than I ever needed to know. Huffman was definitely an arrogant bastard, but he certainly taught me a lot of interesting math.
Interesting refresher. I do remember in college we had to build the lookup tree for some word as an exercise. Obviously the name Huffman stuck in my brain. But for the love of god, I can't even remember if the lecture mentioned Fano. Seems he was just as important in the design process of what we only refer to as Huffman encoding today.
Which strikes me as either rare honesty, or an example of how our values have slid over time because today the professor would claim the algorithm for himself, and the student would get a B- for some spelling mistake in his write up, none the wiser the professor it making fame from their work.
This is why I was flabbergasted by the nonsense internet takeover of HTTPS. You can't cache it. Probably 99% of web traffic is generic public content, but we can't cache any of it, because 1% of it needs to be private.
So then people complain "oh but The Illuminati can see what websites I'm going to!" Yeah, and they still can with HTTPS, it's called statistical network traffic analysis. Decades of research papers show you can uniquely identify a client going to a random internet server (and the page they're browsing) just by sniffing a bunch of POPs. It's used by law enforcement and Five Eyes to identify internet users around the world. Even protocols that have countermeasures (TOR) don't stand up to it.
HTTPS everywhere is not only about privacy but also about integrity. You don't want internet randos (including ISPs) to swap your content with ads, bitcoin miners and put other batshit crazy JavaScript in it.
Also, a lot of ISPs blackbox caching proxies were buggy and breaking websites.
Right, I didn't thank about that part. They could technically replace it with another valid certificate, but if you're looking for specific certificates you will notice immediately.
Well, an ISP that wants to MITM your traffic today can present another valid certificate too, nothing changes there. It’s just that they couldn’t use a valid certificate that has the same Common Name (FQDN) as the site you’re connecting to, without having their root CA in your browser’s store (so, same behavior as we already have with TLS.) Presenting a cert with a different FQDN already causes a browser warning.
Well, it still involves some kind of public key infrastructure, but encryption could be optional even in https. Linux distros have been hosting their packages on plain http for two decades, PGP signatures (again, not a great example of "trivial", I admit) were sufficient to ensure integrity.
Numerous people are currently sitting in jail in Saudi Arabia for having posted tweets critical of the regime under their own names. Without HTTPS, jails around the world would be a lot busier if authorities could see everything that their populace reads online in clear text, in order to build dossiers on who is most likely to be a dissident.
I don’t think the kind of traffic analysis you mention works as well as you think it does for identifying individual pages e.g. which tweet someone is viewing. Moreover it requires a level of technical sophistication that is beyond all but the most advanced countries, countries that tend to have some measure of rule of law.
First of all, did I say get rid of HTTPS? A protocol that does integrity without privacy doesn't mean HTTPS magically disappears. You can still use it.
Second, totalitarian regimes around the world don't sit on their hands just because you use HTTPS. If they want to know who a dissident is, they go find out. Bribery, tips, intimidation, torture, spy cameras, facial recognition, etc. They also know that everyone who reads a tweet isn't automatically a dissident.
Third, no, it's not hard at all to do statistical traffic analysis, it's part of basic DPI packages shipped with commercial network gear for about a decade. All you need to identify the user is the destination and the source, and the signature of similar connections to specific hosts with specific traffic. You compare the traffic from the target user to the traffic you monitor or simulate with known destinations and content, and highest probability wins. It's child's play.
> Numerous people are currently sitting in jail in Saudi Arabia for having posted tweets critical of the regime under their own names. Without HTTPS, jails around the world would be a lot busier if authorities could see everything that their populace reads online in clear text, in order to build dossiers on who is most likely to be a dissident.
Indeed... When one lives in such regimes, to say that appears to be an utter truism.
1. What is the status of using HTTP content on HTTPS pages with subresource integrity? I think we could save on traffic if CDNs would serve content over HTTPS and HTTP, and browsers allowed embedding the latter without warnings into HTTPS served pages provided page had subresource integrity information.
2. Wasn't there some work on serving presigned HTTPS requests? The one everyone blamed Google (?) for cause they wanted it to put slices of other websites inside the search results. Might make sense to resurrect that work.
I think the only website I visit with any regularity where my traffic does not contain some session information is Wikipedia. The transition to the web being largely for web applications and services happened before the transition to HTTPS everywhere. That's when you get stuff like Firesheep just stealing the Facebook sessions of hundreds of people at once.
Is it still? Afaik now with everything HTTPS, caching proxies on the ISP end are completely useless.
We instead now see (proprietary?) cache boxes by big players like YouTube and Netflix that ISPs can install in their DCs, but that seems less elegant, even though it gives the content providers much greater control over what, when and how much gets cached. Still, as a smaller fish, without going with cloudflare, there's no caching involved anymore, probably not even if I decide to run my site on HTTP only.
A CDN is a cache as are the Netflix/fb/etc local boxes. CPUs have three layers of cache before hitting main memory, which could be considered a cache for persistent storage…
He perhaps meant bandwidth and not requests. Netflix and youtube alone must constitute the bulk of the volume and I believe they are fairly aggressively cached, then you have all the assets stored on CDNs.
How do these cache boxes work? Does the server send them its half of the ephemeral key or something? I can't imagine each isp has a box with a copy of netflix's private key.
They terminate TLS at the cache box and use a different connection altogether from the cache box to the upstream server. (Said tunnel to upstream could simply be another TLS connection, or wireguard, or IPsec, etc etc.)
I don't think your caches would help very much if all media were sent uncompressed. Everything is already compressed, so you might not think about it much, but I'd say that it is at least as important as caching.
Compression is so ubiquitous that people tend to forget it's there. I still encounter people wondering why zipping their movie files does not result in a smaller file size
Yeah it's cache all the way down. CPU has L1, L2, L3 for instructions cache. Server has ramcache, then SSD cache. Then apps with SQL almost always use redis/memcache as first level cache. Then for network side we have CDN which is basically a glorified webcache, even at router/switch there is cache for all the stuffs.
Indeed. And ultimately human neurons cache and we only scroll back, reload the page, or go back to the website when our cache ages out or the request isn’t cached yet.
> The first step involves yet another clever strategy for identifying repetition and thereby compressing message size, but the second step is to take the resulting compressed message and run it through the Huffman process.
I wonder if this "first step" is Burrows-Wheeler Transform?
Side note: In Silicon Valley (the show), I'm pretty sure that Richard has a picture of David Huffman by his bedside.
No, BWT is largely unused in modern compression codecs.
Lempel-Ziv is the basis for almost all modern general purpose compression, and works more like having a hash table mapping 3 or 4 byte fragments to their positions, and walking through the input byte by byte checking the hash table for matches and inserting the latest fragments&positions into the hash table.
BWT has nearly identical speed compressing and decompressing, but searching for matches to compress is much slower than simply copying data according to instructions to decompress.
My understanding is that there are 1000s of different compression algorithms, each with their own pros/cons dependent on the type and characteristics of the file. And yet we still try to pick the "generically best" codec for a given file (ex. PNG) and then use that everywhere.
Why don't we have context-dependent compression instead?
I'm imagining a system that scans objects before compression, selects the optimal algorithm, and then encodes the file. The selected compression algorithm could be prefixed for easy decompression.
Compare a single black image that's 1x1 to one that's 1000x1000. PNGs are 128bytes and 6KB, respectively. However, Run Length Encoding would compress the latter to a comparable size as the former.
There are a variety of use cases that dictate which algorithm is going to perform best. For example, you might use Zstandard -19 if you are compressing something once and transferring it over a slow network to millions of people. You might use LZ4 if you are generating a unique large piece of data interactively for thousands of concurrent users, because it compresses faster than Zstandard. Basically, if you're constrained by network bandwidth, Zstandard; if you're constrained by CPU, LZ4.
There are then legacy formats that have stuck around long past their sell-by date, like gzip. People are used to using gzip, so you see it everywhere, but it's slower and compresses worse than Zstandard, so there is no reason why you'd ever use it except for compatibility with legacy systems. (bzip2, 7z, xz, snappy, etc. also live in this "no reason to use in 2023" space.)
Take a look at performance measurements here: https://jolynch.github.io/posts/use_fast_data_algorithms/. For example, gzip can get a compression ratio of 0.41 at 21MiB/s, while Zstandard does 0.38 (better) at 134MiB/s. (Meanwhile, lz4 produces outputs nearly twice as large as Zstandard, but compresses almost 3x faster and decompresses 2.5x faster.)
Lossy compression is even more complicated because the compression algorithms take advantage of "nobody will notice" in a way that's data dependent; so music, video, and photographs all have their own special algorithms.
Your link seems to compare GNU gzip with zstd. When comparing file formats, I would compare the best software for that file format. igzip: https://github.com/intel/isa-l can decompress consistently faster than GNU gzip. Depending on the file, it decompresses 2-3x faster making it almost as fast as zstd decompression. I have less experience with compression benchmarks. A quick benchmark on Silesia shows igzip to be ~7x faster but it sacrifices 10% of compression ratio for that even on its highest compression setting. It seems to be optimized for speed.
There is some approach you can use sort of like this within PNGs themselves I think. It's been a while so I might be misrembering but effectively each "row" of data that is compressed can be encoded as the difference from the preceding row using 4 or 5 different operations.
You can achieve better compression by brute forcing the possible operations used to encode the rows to find the "most compressible" output. Not quite what you meant but sort of similar in that you try multiple approaches and pick the best.
I gave up before implementing it but in the stub I left this comment to myself " A heuristic approach is to use adaptive filtering as follows: independently for each row, apply all five filters and select the filter that produces the smallest sum of absolute values per row.".
In addition more similar to your approach PDFs support many compression filters for objects internally like RLE and ZIP so you can choose the best algorithm per object but generally it's quicker just to ZIP everything.
> There is some approach you can use sort of like this within PNGs themselves I think. [...] You can achieve better compression by brute forcing the possible operations used to encode the rows to find the "most compressible" output.
That's the approach used by tools like optipng and pngcrush.
> I gave up before implementing it but in the stub I left this comment to myself " A heuristic approach is to use adaptive filtering as follows: independently for each row, apply all five filters and select the filter that produces the smallest sum of absolute values per row.".
The same idea can be found in the PNG standard itself: "For best compression of truecolor and grayscale images, we recommend an adaptive filtering approach in which a filter is chosen for each scanline. The following simple heuristic has performed well in early tests: compute the output scanline using all five filters, and select the filter that gives the smallest sum of absolute values of outputs. (Consider the output bytes as signed differences for this test.) This method usually outperforms any single fixed filter choice. However, it is likely that much better heuristics will be found as more experience is gained with PNG." (quoted from the PNG specification, version 1.2)
Here's the catch: how does the "system" know which algorithm would be the best? It could try encoding it with multiple algorithms and see which one is shorter, but that's extra CPU.
And the "system" can be called acompression algorithm itself.
I mean yeah that's basically what high compression solutions like paq have done, depending on the compression level desired apply increasingly speculative and computationally intensive models to the block and pick whichever one worked the best.
And then, when nobody wants to implement all the compression algorithms in a compressor or decompressor, we end up with files out in the wild that only pick one of them anyway.
OpenZFS' integration of Zstandard uses LZ4 as a "compression canary" for higher ZStandard compression levels, where they feed the data blocks through LZ4 and if it compresses it enough, feeds it through Zstandard.
This relies on LZ4 being very fast, especially with it's early-exit on incompressible data.
Overall this turns out to be a win, you lose a little bit of compression at a huge decrease in CPU over just using the same Zstandard compression for all the blocks.
It would still pay off in many situations. With existing algorithms you can already optimize image files to be compressed as much as possible, which takes quite a bit longer than usual but if it means 30% smaller files for an entire website that has an impact on every visitor.
WinRAR used to support specialized compression algorithms for specific kinds of files, like uncompressed bitmaps, sounds, executables, or plain text. However, the feature was removed in the RAR5 file format 10 years ago. Maybe it was not worth it?
I listened to a talk once where they said that this is sort of how AV1 compression works, where there are many possible ways to compress a block, and you can choose the best one at runtime by computing each and seeing which is best.
Most image codecs are made for natural images; those of the real world, not synthetic ones like the one you proposed. Lossy codecs like JPEG use perceptual coding to maximize perceived quality for a given file size.
missed opportunity to explain how compression and prediction are related, and that the better you can predict the next token the better your compression gets, then your article gets to mention GPT hey
AIXI / Solomonoff prediction uses lossless compression, which may lead to massive overfitting. If anything, some degree of lossy compression would be "equivalent" to intelligence. Ockham's razor also says that the simplicity of a hypothesis can outweigh another hypothesis which better describes the available evidence. It's a trade-off, and AIXI doesn't make that trade-off, but insists on perfect compression/prediction of the available data.
It's basically the curve-fitting problem: You don't want the simplest curve that fits all the available data points perfectly, you want an even simpler curve that still fits the evidence reasonably well. If you hit the right balance between simplicity and fit, you can expect your model to generalize to unseen data, to make successful predictions. That would be intelligence, or some major part of it.
I think in a sense compression is worse - because not only you want to correctly predict the next token, you also want to do it fast, with a minimal but efficient algorithm that also doesn't require much space / a big dictionary.
You could think of it as taking a "snapshot" if an AI and then optimizing the hell out of it for a specific case and you end up with a good compression algorithm.
The modality of the data contains an amount of information comparable to the data itself. Telling ChatGPT that it's hearing music rather than a story would probably help it reason about it a lot.
On a lower level, you can't tell ChatGPT to reason about an image when its only input is a microphone.
Agreed, I had a lot of fun putting a toy one together, even if my compression format was worse in every way than a real format and my implementation quite inefficient
If you are only sending one word, and the recipient already needs to know the word, then you only need 1 bit, essentially just signaling that you are saying that specific word
If you want a richer vocabulary, you could create an index of about 300k words (from the English dictionary), shared between the parties
Then to send any word you only need to send one number, and in binary it would have between 1 and at most 19 bits, for any word in the index (2^19 is around 500k)
That’s without even sorting the index by frequency of appearance/usage
Where did you get that you need 27 bits for one word?
> Then to send any word you only need to send one number, and in binary it would have between 1 and at most 19 bits
Yep! By sorting by frequency, you are able to make it so the majority of words have shorter bit strings. By my calculations, common words such as "the", "of", and "and" will have ~4-6 bits associated with them. That means you can encode a large number of words (googling says those words make up ~1/7 of words based on frequency) with only 4-6 bits each. That's far from the 27 bits you calculated
> Fano’s balancing approach starts by assigning the O and one other letter to the left branch, with the five total uses of those letters balancing out the five appearances of the remaining letters.
> The resulting message requires 27 bits.
I didn’t calculate it, the author of the article did
My understanding why arithmetic coding isn't used as much as Huffman coding is because of the patents that IBM had on arithmetic coding. I'm not familiar with any actively used compression algorithm that uses arithmetic coding.
I have explored using arithmetic coding on some data at work (mainly large amounts of XYZ points). My attempts did not work better than standard zip compression.
On a related note, I got an email today saying that Cloudflare will automatically turn on Brotli compression later this month. Does anyone know if Brotli really makes a difference beyond gzip?
Brotli and ZSTD are both a bunch better (~20% file size while being faster to decompress). Brotli does especially well on javascript/html because they hard-code some of the keywords in the spec.
I agree. They have a breathless tone to them that's quite annoying to me (I work in data compression as an academic, and I found this article uninspiring.)
By the way, there was an old Soviet magazine called "Kvant" (Russian for Quantum, I think). I do not know Russian, but I have 2 collected volumes of selected articles from them. [1] [2] Their quality is astonishingly good, and high-level. The difference is this:
The Kvant articles were written by professional research mathematicians, trying to present their ideas to an audience that were willing to follow them with pencil and paper in hand.
Quanta magazine articles are written by journalists trying to present advanced science to a lay audience - the articles are very stilted and present the articles in a tone that oversimplifies the problem and gives no idea about the actual solution, and uses hackneyed tropes like : oh look the solvers were just some random unknown guys (in a recent case, the random unknown guy is a tenured faculty at UCLA in theoretical computer science, apparently "a world away from mathematics" [3])
I think this is an unfair criticism of the Quanta articles.
I'll just say up front that I am not an academic, and I enjoy the breadth of coverage in the Quanta articles. I would liken them to science articles in American Scientist. (Perhaps you don't like that either.) Yes, they are popularized, but they are still technical.
Are you bored by a description of an algorithm you know well? This article clearly describes the process of Huffmann encoding. It's not one of the most amazing discoveries, but is this topic ever going to be exciting? I'd say it's easier to follow their article than either Wikipedia or the top animation hit [1].
There are two other claims you make that appear baseless.
On the first point:
> Quanta magazine articles are written by journalists
The first bio I checked [2] is a Ph.D. mathematician. If they also write, that does not make them less qualified. I'll grant that the second bio I checked [4] was "only" a journalist, but the third was a professor in a named chair [5]. Just clicking, not searching for examples.
On the second point:
> oversimplifies the problem and gives no idea about the actual solution
In your reference [3], it describes the problem clearly and devotes several paragraphs to what looks like a sketch of a solution. Certainly it outlines the ingredients used. (Search for "To see how they arrived at their new upper limit." and "In their proof [...]".)
I haven't seen Kvant, but it's worth adding that Scientific American up till maybe the mid-80s was also more real than standard pop science. Don't take its current incarnation as much like its past. (I guess the 70s were even better for it, but this is my fuzzy memory of a trove of back issues I went through in the 80s.)
That said, from a skim I wouldn't call this particular article standard pop science: it explains an idea/result rather than spending most of its words on periphera, and the subject is not recent news. It does ask less of the reader than an old Sci Am article would, I think.
Yeah you have a good point - not sure if there’s a name for what you’re referencing but it’s tough to be an expert in something and read a bunch of cringe pop sci articles on your field. This topic comes up frequently with my partner who is a researcher - Ed Yong recently wrote a piece on some work by their lab and it was “tolerable”.
Not a huge fan of Quanta myself but Aeon frequently has articles written by researchers/experts and are higher quality in my opinion.
> "Briefly stated, the Gell-Mann Amnesia effect is as follows. You open the newspaper to an article on some subject you know well. In Murray's case, physics. In mine, show business. You read the article and see the journalist has absolutely no understanding of either the facts or the issues. Often, the article is so wrong it actually presents the story backward—reversing cause and effect. I call these the "wet streets cause rain" stories. Paper's full of them. In any case, you read with exasperation or amusement the multiple errors in a story, and then turn the page to national or international affairs, and read as if the rest of the newspaper was somehow more accurate about Palestine than the baloney you just read. You turn the page, and forget what you know."
> but because they do it badly, and the clickbait is insufferable
Well, then, who does pop science "right"?
There's no shortage, of course, of 30+ page review articles on every scientific topic imaginable. And if someone has a few days, a freshly minted STEM degree or years of related experience, and a compelling interest in the topic, they can just pick up one of these review articles and go to town.
But that isn't going to fly for the general public, not even close. And not just because of the mathematics, dry passive-voice language, lack of context, times new roman, and gratuitous expert jargon. It's just too much.
I must say, there is no written media that I know of that pleases me. Of course, it doesn't mean they don't exist.
I much prefer the treatment from some YouTubers, such as Sabine Hossenfelder (to name just one).
It's far from dry, she has a wicked sense of humour, she tries her best to be impartial (and sometimes fails), and makes it clear when something she says is her opinion.
PBS also has a ton of good content.
Of course, all these actually require the viewer to make an effort, but if you're not ready to do that, then a poorly written clickbait article is likely to do more harm than good anyways.
I like Quanta! They explain things accurately and in detail, with helpful figures and diagrams. They did go wrong a few months ago with the "quantum computer black hole" article, but the heat they recieved was precisely because it fell well below their usual standards, which are otherwise well above other "pop science" sites in my experience.
quanta's tone does get mildly annoying at times, but the actual content of the articles is usually excellent, and this one is no exception. I suspect you disliked it simply because you already knew most of what it had to say. (I mean sure, the writer is no Martin Gardner, but then again who is.)
The compression rabbit hole can be a rewarding one to go down if you haven't. Besides the tech itself making a lot of things work better (like a Web built on human-readable formats, or more specialized stuff like Parquet or other columnar formats, or video on the lossy side), it can give you some tools or perspective that apply elsewhere, e.g. to other probablistic stuff like caching and predictions for lossless compression, or other signal-processing stuff for lossy compression.