A student’s desire to get out of a exam led to a compression algorithm

twotwotwo · on June 9, 2023

There was a series of articles on lossless and lossy and compression techniques in, I think, PC Magazine that I read as a kid and made a big impression on me. I didn't, like, end up ready to write code to build Huffman trees or anything, but it did change it from a total black box into a bunch of smaller pieces each of which a mere mortal can understand.

The compression rabbit hole can be a rewarding one to go down if you haven't. Besides the tech itself making a lot of things work better (like a Web built on human-readable formats, or more specialized stuff like Parquet or other columnar formats, or video on the lossy side), it can give you some tools or perspective that apply elsewhere, e.g. to other probablistic stuff like caching and predictions for lossless compression, or other signal-processing stuff for lossy compression.

MauranKilom · on June 10, 2023

Another interesting facet is that, according to some schools of thought, compression and AI are equivalent problems: The better you understand something, the less you need to memorize to reproduce it.

https://en.wikipedia.org/wiki/Hutter_Prize (and https://en.wikipedia.org/wiki/AIXI)

Of course, large language models are (by definition) currently going the other direction, but it remains to be seen whether that leads to artificial intelligence (whatever that ends up meaning).

floobertoober · on June 10, 2023

> Of course, large language models are (by definition) currently going the other direction ...

How so? Aren't the networks' weights orders of magnitude smaller than the training data?

govg · on June 10, 2023

I interpreted that statement as saying the current practice is to make LLMs larger and larger (so they effectively memorize more and more data) to make them more powerful, but from the perspective of information theory, if models were powerful and "understanding", then models could stay the same size and become more and more powerful as they get increasingly better at compressing the available information. I am not sure if this interpretation was what was meant though.

karpierz · on June 10, 2023

I believe the parent poster's point is: LLMs are more effective when they use more memory, meaning the less they are forced to compress the training data, the better they perform.

MauranKilom · on June 14, 2023

But they don't losslessly recreate the training data.

brookst · on June 10, 2023

Maybe a naive question: are LLMs really going the other way? My intuition is that the model weights are much smaller than the information encoded in them.

ta93754829 · on June 11, 2023

Once ChatGPT was trained on "the internet of 2021", how big is the remaining model? Is ChatGPT effectively a compressed version of the internet?

ot · on June 10, 2023

Was it maybe Dr Dobb's? Mark Nelson had an excellent series of articles about compression which opened up that world for me. I ended up working on compression for my PhD many years later.

I can't find any archives for Dr Dobb's, but some articles are in his personal site, for example this one about arithmetic coding: https://marknelson.us/posts/1991/02/01/arithmetic-coding-sta...

twotwotwo · on June 11, 2023

It looks like the ones I were PC Magazine, https://books.google.com/books?id=gCfzPMoPJWgC&lpg=PP1&pg=PA... and https://books.google.com/books?id=eX8w8B-OhIIC&lpg=PA371&pg=... .

Much appreciation to all the folks who put in the work to make these things interesting and accessible.

oikawa_tooru · on June 12, 2023

Hi I am a neophyte interested in compression. It's difficult to find communities regarding compression online. I am looking for a guide. Is there any place that I can dm you?

Folcon · on June 10, 2023

I found their Encyclopedia page[0], but I suspect you meant something better than this?

Any further recollections to help drill down and find it?

- [0]: https://www.pcmag.com/encyclopedia/term/data-compression

twotwotwo · on June 11, 2023

It looks like it was Barry Simon in "Lab Notes" in PC Magazine of June and July 1993:

https://books.google.com/books?id=gCfzPMoPJWgC&lpg=PP1&pg=PA...

https://books.google.com/books?id=eX8w8B-OhIIC&lpg=PA371&pg=...

dekhn · on June 10, 2023

The only CS class I ever took was Huffman's "Cybernetics" (undergrad UCSC CS class) and it was a real mind-blower. The very first day he dove right into sphere packing (https://en.wikipedia.org/wiki/Sphere_packing) and the whole course was a great introduction to information theory.

I remember him describing how he wrote the original solution for huffman compression, crumpled it up, threw it away, and then retrieved it from the trash.

I failed the class which led to a huge imposter syndrome in me that pushed me to learn far more CS than I ever needed to know. Huffman was definitely an arrogant bastard, but he certainly taught me a lot of interesting math.

iforgotpassword · on June 10, 2023

Interesting refresher. I do remember in college we had to build the lookup tree for some word as an exercise. Obviously the name Huffman stuck in my brain. But for the love of god, I can't even remember if the lecture mentioned Fano. Seems he was just as important in the design process of what we only refer to as Huffman encoding today.

dmurray · on June 10, 2023

A nice reversal of the meme where "the supervising professor takes all the credit and the army of grad students do the real work".

amemisoglu · on June 10, 2023

This algorithm is known as Huffman encoding.

yccs27 · on June 10, 2023

Yes, and it's named after graduate student Huffman (at the time) instead of after his supervisor/teacher Fano.

bsenftner · on June 10, 2023

Which strikes me as either rare honesty, or an example of how our values have slid over time because today the professor would claim the algorithm for himself, and the student would get a B- for some spelling mistake in his write up, none the wiser the professor it making fame from their work.

thesz · on June 10, 2023

Here's Shannon-Fano coding: https://en.wikipedia.org/wiki/Shannon%E2%80%93Fano_coding

It is used in PKZip Deflate algorithm.

Fano's work is as famous and as useful as Huffman's.

klyrs · on June 10, 2023

Mathematicians are familiar with the name Fano, but for Roberto's dad Gino, whose name adorns the first finite projective plane.

sgt101 · on June 9, 2023

Caches drive the internet.

They are orders of magnitude more important than compression. 99.99% of requests hit a local cache. (1)

Compression is important too.

(1) I worked in a telco. Check it out for yourself!

throwaway892238 · on June 10, 2023

This is why I was flabbergasted by the nonsense internet takeover of HTTPS. You can't cache it. Probably 99% of web traffic is generic public content, but we can't cache any of it, because 1% of it needs to be private.

So then people complain "oh but The Illuminati can see what websites I'm going to!" Yeah, and they still can with HTTPS, it's called statistical network traffic analysis. Decades of research papers show you can uniquely identify a client going to a random internet server (and the page they're browsing) just by sniffing a bunch of POPs. It's used by law enforcement and Five Eyes to identify internet users around the world. Even protocols that have countermeasures (TOR) don't stand up to it.

jakub_g · on June 10, 2023

HTTPS everywhere is not only about privacy but also about integrity. You don't want internet randos (including ISPs) to swap your content with ads, bitcoin miners and put other batshit crazy JavaScript in it.

Also, a lot of ISPs blackbox caching proxies were buggy and breaking websites.

dmytrish · on June 10, 2023

Integrity without confidentiality is trivial.

Timon3 · on June 10, 2023

How so? Your ISP is still the channel over which the integrity information is sent. Why couldn't they just swap the SRI hashes as well?

ninkendo · on June 10, 2023

You could sign the content with the same CA architecture we already use to encrypt it, but leave it plain text (just a thought.)

A browser could render a similar security warning to what it already does, if the signature doesn’t match or if the hash is wrong.

Timon3 · on June 10, 2023

Right, I didn't thank about that part. They could technically replace it with another valid certificate, but if you're looking for specific certificates you will notice immediately.

ninkendo · on June 10, 2023

Well, an ISP that wants to MITM your traffic today can present another valid certificate too, nothing changes there. It’s just that they couldn’t use a valid certificate that has the same Common Name (FQDN) as the site you’re connecting to, without having their root CA in your browser’s store (so, same behavior as we already have with TLS.) Presenting a cert with a different FQDN already causes a browser warning.

Timon3 · on June 10, 2023

Yep, you're right, I had a brainfart there. Thanks for being nice and explaining it! :)

dmytrish · on June 10, 2023

Well, it still involves some kind of public key infrastructure, but encryption could be optional even in https. Linux distros have been hosting their packages on plain http for two decades, PGP signatures (again, not a great example of "trivial", I admit) were sufficient to ensure integrity.

zarzavat · on June 10, 2023

Numerous people are currently sitting in jail in Saudi Arabia for having posted tweets critical of the regime under their own names. Without HTTPS, jails around the world would be a lot busier if authorities could see everything that their populace reads online in clear text, in order to build dossiers on who is most likely to be a dissident.

I don’t think the kind of traffic analysis you mention works as well as you think it does for identifying individual pages e.g. which tweet someone is viewing. Moreover it requires a level of technical sophistication that is beyond all but the most advanced countries, countries that tend to have some measure of rule of law.

throwaway892238 · on June 10, 2023

First of all, did I say get rid of HTTPS? A protocol that does integrity without privacy doesn't mean HTTPS magically disappears. You can still use it.

Second, totalitarian regimes around the world don't sit on their hands just because you use HTTPS. If they want to know who a dissident is, they go find out. Bribery, tips, intimidation, torture, spy cameras, facial recognition, etc. They also know that everyone who reads a tweet isn't automatically a dissident.

Third, no, it's not hard at all to do statistical traffic analysis, it's part of basic DPI packages shipped with commercial network gear for about a decade. All you need to identify the user is the destination and the source, and the signature of similar connections to specific hosts with specific traffic. You compare the traffic from the target user to the traffic you monitor or simulate with known destinations and content, and highest probability wins. It's child's play.

zvmaz · on June 10, 2023

> Numerous people are currently sitting in jail in Saudi Arabia for having posted tweets critical of the regime under their own names. Without HTTPS, jails around the world would be a lot busier if authorities could see everything that their populace reads online in clear text, in order to build dossiers on who is most likely to be a dissident.

Indeed... When one lives in such regimes, to say that appears to be an utter truism.

bhickey · on June 10, 2023

Tell me you've never worked on a CDN without telling me you've never worked on CDN.

Compared to transit, last mile bandwidth is effectively limitless and free. Cache fill at the edge is important, last mile caching not so much.

ninkendo · on June 10, 2023

You didn’t need the first paragraph there, the snark is a bit uncalled for.

sr-latch · on June 11, 2023

I felt it was a fair reaction to suggesting that we undo years of work in transitioning to https

lostmsu · on June 17, 2023

1. What is the status of using HTTP content on HTTPS pages with subresource integrity? I think we could save on traffic if CDNs would serve content over HTTPS and HTTP, and browsers allowed embedding the latter without warnings into HTTPS served pages provided page had subresource integrity information.

2. Wasn't there some work on serving presigned HTTPS requests? The one everyone blamed Google (?) for cause they wanted it to put slices of other websites inside the search results. Might make sense to resurrect that work.

UncleMeat · on June 11, 2023

I think the only website I visit with any regularity where my traffic does not contain some session information is Wikipedia. The transition to the web being largely for web applications and services happened before the transition to HTTPS everywhere. That's when you get stuff like Firesheep just stealing the Facebook sessions of hundreds of people at once.

iforgotpassword · on June 10, 2023

Is it still? Afaik now with everything HTTPS, caching proxies on the ISP end are completely useless.

We instead now see (proprietary?) cache boxes by big players like YouTube and Netflix that ISPs can install in their DCs, but that seems less elegant, even though it gives the content providers much greater control over what, when and how much gets cached. Still, as a smaller fish, without going with cloudflare, there's no caching involved anymore, probably not even if I decide to run my site on HTTP only.

meinheld111 · on June 10, 2023

A CDN is a cache as are the Netflix/fb/etc local boxes. CPUs have three layers of cache before hitting main memory, which could be considered a cache for persistent storage…

Yes, still

cm2187 · on June 10, 2023

He perhaps meant bandwidth and not requests. Netflix and youtube alone must constitute the bulk of the volume and I believe they are fairly aggressively cached, then you have all the assets stored on CDNs.

Null-Set · on June 10, 2023

How do these cache boxes work? Does the server send them its half of the ephemeral key or something? I can't imagine each isp has a box with a copy of netflix's private key.

ninkendo · on June 10, 2023

They terminate TLS at the cache box and use a different connection altogether from the cache box to the upstream server. (Said tunnel to upstream could simply be another TLS connection, or wireguard, or IPsec, etc etc.)

Ma8ee · on June 10, 2023

I don't think your caches would help very much if all media were sent uncompressed. Everything is already compressed, so you might not think about it much, but I'd say that it is at least as important as caching.

number6 · on June 10, 2023

Compression is so ubiquitous that people tend to forget it's there. I still encounter people wondering why zipping their movie files does not result in a smaller file size

anvuong · on June 10, 2023

Yeah it's cache all the way down. CPU has L1, L2, L3 for instructions cache. Server has ramcache, then SSD cache. Then apps with SQL almost always use redis/memcache as first level cache. Then for network side we have CDN which is basically a glorified webcache, even at router/switch there is cache for all the stuffs.

brookst · on June 10, 2023

Indeed. And ultimately human neurons cache and we only scroll back, reload the page, or go back to the website when our cache ages out or the request isn’t cached yet.

themoonisachees · on June 12, 2023

Most serverside processors will only compile pages once they're hit and cache thz results. As the xkcd goes "there's a lot of caching"

alex_duf · on June 10, 2023

It's a bit silly to oppose these two technologies... Caches also rely on compression to be efficient, they work hand in hand.

meinheld111 · on June 10, 2023

Foremost a cache relies upon uneven access patterns, which most real world use cases are

asah · on June 10, 2023

both are correct.

The internet is "driven" by web/mobile which is driven by multimedia, which lives and dies on compression / codecs.

Solvency · on June 9, 2023

Ok how do they work then?

esafak · on June 10, 2023

They store frequently-requested items in a place that permits faster retrieval.

https://en.wikipedia.org/wiki/Cache_(computing)

psychphysic · on June 10, 2023

Wait maybe we should be using the Huffman encoding at the cache level!

I'm not sure what that would mean...

Maybe if you're website is unpopular we move your server to North Korea since no one is accessing it anyways?

parl_match · on June 9, 2023

There's lot of information about this on the open web, and tons of resources for varying levels of skill.

astrange · on June 10, 2023

Ironically, this answer is a cache miss.

denvaar · on June 9, 2023

> The first step involves yet another clever strategy for identifying repetition and thereby compressing message size, but the second step is to take the resulting compressed message and run it through the Huffman process.

I wonder if this "first step" is Burrows-Wheeler Transform?

Side note: In Silicon Valley (the show), I'm pretty sure that Richard has a picture of David Huffman by his bedside.

Scaevolus · on June 9, 2023

No, BWT is largely unused in modern compression codecs.

Lempel-Ziv is the basis for almost all modern general purpose compression, and works more like having a hash table mapping 3 or 4 byte fragments to their positions, and walking through the input byte by byte checking the hash table for matches and inserting the latest fragments&positions into the hash table.

BWT has nearly identical speed compressing and decompressing, but searching for matches to compress is much slower than simply copying data according to instructions to decompress.

pseudotrash · on June 9, 2023

I came here to find Silicon Valley references and wasn't disappointed.

ThrowawayTestr · on June 9, 2023

I don't like that show because whenever I watch it I start imagining a world with inside-out compression and I get sad we'll never have it.

LoganDark · on June 10, 2023

What is inside-out compression?

CharlesW · on June 10, 2023

I believe the parent meant "middle out". (Very NSFW: https://www.youtube.com/watch?v=Ex1JuIN0eaA)

highwaylights · on June 10, 2023

I mean, sure it’s NSFW, but just take a look at that Weissman Score!!

ThrowawayTestr · on June 10, 2023

How did I make that mistake, I'm so ashamed.

LoganDark · on June 10, 2023

> Very NSFW

I didn't believe you until about a minute in. LOL

devsegal · on June 9, 2023

Both of us have been satisfied

hinkley · on June 9, 2023

at the same time?

joshsabol46 · on June 9, 2023

Please pardon my ignorance...

My understanding is that there are 1000s of different compression algorithms, each with their own pros/cons dependent on the type and characteristics of the file. And yet we still try to pick the "generically best" codec for a given file (ex. PNG) and then use that everywhere.

Why don't we have context-dependent compression instead?

I'm imagining a system that scans objects before compression, selects the optimal algorithm, and then encodes the file. The selected compression algorithm could be prefixed for easy decompression.

Compare a single black image that's 1x1 to one that's 1000x1000. PNGs are 128bytes and 6KB, respectively. However, Run Length Encoding would compress the latter to a comparable size as the former.

jrockway · on June 10, 2023

There are a variety of use cases that dictate which algorithm is going to perform best. For example, you might use Zstandard -19 if you are compressing something once and transferring it over a slow network to millions of people. You might use LZ4 if you are generating a unique large piece of data interactively for thousands of concurrent users, because it compresses faster than Zstandard. Basically, if you're constrained by network bandwidth, Zstandard; if you're constrained by CPU, LZ4.

There are then legacy formats that have stuck around long past their sell-by date, like gzip. People are used to using gzip, so you see it everywhere, but it's slower and compresses worse than Zstandard, so there is no reason why you'd ever use it except for compatibility with legacy systems. (bzip2, 7z, xz, snappy, etc. also live in this "no reason to use in 2023" space.)

Take a look at performance measurements here: https://jolynch.github.io/posts/use_fast_data_algorithms/. For example, gzip can get a compression ratio of 0.41 at 21MiB/s, while Zstandard does 0.38 (better) at 134MiB/s. (Meanwhile, lz4 produces outputs nearly twice as large as Zstandard, but compresses almost 3x faster and decompresses 2.5x faster.)

Lossy compression is even more complicated because the compression algorithms take advantage of "nobody will notice" in a way that's data dependent; so music, video, and photographs all have their own special algorithms.

mxmlnkn · on June 11, 2023

Your link seems to compare GNU gzip with zstd. When comparing file formats, I would compare the best software for that file format. igzip: https://github.com/intel/isa-l can decompress consistently faster than GNU gzip. Depending on the file, it decompresses 2-3x faster making it almost as fast as zstd decompression. I have less experience with compression benchmarks. A quick benchmark on Silesia shows igzip to be ~7x faster but it sacrifices 10% of compression ratio for that even on its highest compression setting. It seems to be optimized for speed.

UglyToad · on June 10, 2023

There is some approach you can use sort of like this within PNGs themselves I think. It's been a while so I might be misrembering but effectively each "row" of data that is compressed can be encoded as the difference from the preceding row using 4 or 5 different operations.

You can achieve better compression by brute forcing the possible operations used to encode the rows to find the "most compressible" output. Not quite what you meant but sort of similar in that you try multiple approaches and pick the best.

I gave up before implementing it but in the stub I left this comment to myself " A heuristic approach is to use adaptive filtering as follows: independently for each row, apply all five filters and select the filter that produces the smallest sum of absolute values per row.".

In addition more similar to your approach PDFs support many compression filters for objects internally like RLE and ZIP so you can choose the best algorithm per object but generally it's quicker just to ZIP everything.

cesarb · on June 10, 2023

> There is some approach you can use sort of like this within PNGs themselves I think. [...] You can achieve better compression by brute forcing the possible operations used to encode the rows to find the "most compressible" output.

That's the approach used by tools like optipng and pngcrush.

> I gave up before implementing it but in the stub I left this comment to myself " A heuristic approach is to use adaptive filtering as follows: independently for each row, apply all five filters and select the filter that produces the smallest sum of absolute values per row.".

The same idea can be found in the PNG standard itself: "For best compression of truecolor and grayscale images, we recommend an adaptive filtering approach in which a filter is chosen for each scanline. The following simple heuristic has performed well in early tests: compute the output scanline using all five filters, and select the filter that gives the smallest sum of absolute values of outputs. (Consider the output bytes as signed differences for this test.) This method usually outperforms any single fixed filter choice. However, it is likely that much better heuristics will be found as more experience is gained with PNG." (quoted from the PNG specification, version 1.2)

deepsun · on June 9, 2023

> selects the optimal algorithm

Here's the catch: how does the "system" know which algorithm would be the best? It could try encoding it with multiple algorithms and see which one is shorter, but that's extra CPU.

And the "system" can be called acompression algorithm itself.

U2EF1 · on June 9, 2023

I mean yeah that's basically what high compression solutions like paq have done, depending on the compression level desired apply increasingly speculative and computationally intensive models to the block and pick whichever one worked the best.

LoganDark · on June 10, 2023

And then, when nobody wants to implement all the compression algorithms in a compressor or decompressor, we end up with files out in the wild that only pick one of them anyway.

magicalhippo · on June 10, 2023

OpenZFS' integration of Zstandard uses LZ4 as a "compression canary" for higher ZStandard compression levels, where they feed the data blocks through LZ4 and if it compresses it enough, feeds it through Zstandard.

This relies on LZ4 being very fast, especially with it's early-exit on incompressible data.

Overall this turns out to be a win, you lose a little bit of compression at a huge decrease in CPU over just using the same Zstandard compression for all the blocks.

alpaca128 · on June 10, 2023

It would still pay off in many situations. With existing algorithms you can already optimize image files to be compressed as much as possible, which takes quite a bit longer than usual but if it means 30% smaller files for an entire website that has an impact on every visitor.

pezezin · on June 10, 2023

WinRAR used to support specialized compression algorithms for specific kinds of files, like uncompressed bitmaps, sounds, executables, or plain text. However, the feature was removed in the RAR5 file format 10 years ago. Maybe it was not worth it?

foota · on June 10, 2023

I listened to a talk once where they said that this is sort of how AV1 compression works, where there are many possible ways to compress a block, and you can choose the best one at runtime by computing each and seeing which is best.

esafak · on June 10, 2023

Most image codecs are made for natural images; those of the real world, not synthetic ones like the one you proposed. Lossy codecs like JPEG use perceptual coding to maximize perceived quality for a given file size.

nextmove · on June 10, 2023

Great read for beginners. I've read about Robert Fano being the creator of Project MAC that turned into MIT's AI lab (CSAIL).

I'm sure AI will end up the winner in compression in the end.

gfody · on June 9, 2023

missed opportunity to explain how compression and prediction are related, and that the better you can predict the next token the better your compression gets, then your article gets to mention GPT hey

optimalsolver · on June 9, 2023

Apparently compression and intelligence are synonymous:

https://mattmahoney.net/dc/rationale.html

cubefox · on June 10, 2023

AIXI / Solomonoff prediction uses lossless compression, which may lead to massive overfitting. If anything, some degree of lossy compression would be "equivalent" to intelligence. Ockham's razor also says that the simplicity of a hypothesis can outweigh another hypothesis which better describes the available evidence. It's a trade-off, and AIXI doesn't make that trade-off, but insists on perfect compression/prediction of the available data.

It's basically the curve-fitting problem: You don't want the simplest curve that fits all the available data points perfectly, you want an even simpler curve that still fits the evidence reasonably well. If you hit the right balance between simplicity and fit, you can expect your model to generalize to unseen data, to make successful predictions. That would be intelligence, or some major part of it.

valenterry · on June 9, 2023

I think in a sense compression is worse - because not only you want to correctly predict the next token, you also want to do it fast, with a minimal but efficient algorithm that also doesn't require much space / a big dictionary.

You could think of it as taking a "snapshot" if an AI and then optimizing the hell out of it for a specific case and you end up with a good compression algorithm.

mxkopy · on June 9, 2023

In today's post-GPT world it's important to recognize the role of embodiment in intelligence as well.

qumpis · on June 10, 2023

mxkopy · on June 10, 2023

The modality of the data contains an amount of information comparable to the data itself. Telling ChatGPT that it's hearing music rather than a story would probably help it reason about it a lot.

On a lower level, you can't tell ChatGPT to reason about an image when its only input is a microphone.

tgv · on June 10, 2023

Synonymous, according to one person, taking a liberal view of a only vaguely related result.

ozr · on June 10, 2023

That being said: the best results currently on the large text compression benchmark utilize ML: http://www.mattmahoney.net/dc/text.html

hackcasual · on June 9, 2023

Huffman trees are really cool and a neat little example program to put together.

delosrogers · on June 10, 2023

Agreed, I had a lot of fun putting a toy one together, even if my compression format was worse in every way than a real format and my implementation quite inefficient

verall · on June 10, 2023

> to the ubiquitous software utility PKZip

Huh? Ubiquitous?

quesera · on June 10, 2023

The utility program might vary between platforms, but the algorithm is definitely ubiquitous.

nico · on June 10, 2023

In the article the compression doesn’t make sense

If you are only sending one word, and the recipient already needs to know the word, then you only need 1 bit, essentially just signaling that you are saying that specific word

If you want a richer vocabulary, you could create an index of about 300k words (from the English dictionary), shared between the parties

Then to send any word you only need to send one number, and in binary it would have between 1 and at most 19 bits, for any word in the index (2^19 is around 500k)

That’s without even sorting the index by frequency of appearance/usage

27 bits for just one word seems wasteful

malisper · on June 11, 2023

> 27 bits for just one word seems wasteful

Where did you get that you need 27 bits for one word?

> Then to send any word you only need to send one number, and in binary it would have between 1 and at most 19 bits

Yep! By sorting by frequency, you are able to make it so the majority of words have shorter bit strings. By my calculations, common words such as "the", "of", and "and" will have ~4-6 bits associated with them. That means you can encode a large number of words (googling says those words make up ~1/7 of words based on frequency) with only 4-6 bits each. That's far from the 27 bits you calculated

nico · on June 11, 2023

About the 27 bits. This is from the article:

> Fano’s balancing approach starts by assigning the O and one other letter to the left branch, with the five total uses of those letters balancing out the five appearances of the remaining letters.

> The resulting message requires 27 bits.

I didn’t calculate it, the author of the article did

jhallenworld · on June 10, 2023

I thought Huffman coding was obsolete, arithmetic coding replaced it. It allows for fractional number of bits per symbol, so is more efficient.

Isamu · on June 10, 2023

It’s still used in PNG, it is part of the spec.

There’s a trade off, Huffman is relatively fast and easier to get right, and the gains from increased compression may be considered marginal.

This is a classic engineering trade off.

adgjlsfhk1 · on June 14, 2023

PNG is also obsolete :)

ixwt · on June 10, 2023

My understanding why arithmetic coding isn't used as much as Huffman coding is because of the patents that IBM had on arithmetic coding. I'm not familiar with any actively used compression algorithm that uses arithmetic coding.

I have explored using arithmetic coding on some data at work (mainly large amounts of XYZ points). My attempts did not work better than standard zip compression.

adgjlsfhk1 · on June 14, 2023

The patents all expired in ~2013ish. ZSTD uses arithmetic encoding and is well on it's way to replacing gzip since it's usually faster and smaller.

cesarb · on June 10, 2023

> I'm not familiar with any actively used compression algorithm that uses arithmetic coding.

I believe some popular video codecs use arithmetic coding (and IIRC, some other video codecs use the related range coding to avoid these patents).

zulban · on June 10, 2023

Huh? I thought you cannot patent an algorithm, mathematical process, number, game design, etc.

yawaramin · on June 10, 2023

On a related note, I got an email today saying that Cloudflare will automatically turn on Brotli compression later this month. Does anyone know if Brotli really makes a difference beyond gzip?

jgrahamc · on June 10, 2023

It does. That's why we're turning it on automatically. We've supported it since 2017 and we have made a bunch of changes recently that make it better.

https://blog.cloudflare.com/results-experimenting-brotli/

https://blog.cloudflare.com/brotli-compression-using-a-reduc...

powturbo · on June 10, 2023

I've made a benchmark with real html data including memory usage in [2] and a recent benchmark in [3] with TurboBench [1].

In the speedup plots you can see the best compressors for content providers:

- brotli 11 is best for static content

- brotli 5 is best until 1MB/s network transfer speed

- libdeflate 6 is best from 1MB/s to 6MB/s (followed by brotli,4)

- igzip 1,2 is best for very fast networks > 10MB/s

brotli brings little value at decompression for users

[1] https://github.com/powturbo/TurboBench

[1] https://sites.google.com/site/powturbo/home/web-compression

[2] https://encode.su/threads/2333-TurboBench-Back-to-the-future...

adgjlsfhk1 · on June 10, 2023

Brotli and ZSTD are both a bunch better (~20% file size while being faster to decompress). Brotli does especially well on javascript/html because they hard-code some of the keywords in the spec.

amelius · on June 10, 2023

What would be interesting is something like an LLM attempting to compress some piece of data, and see how it gets better at it over time.

vjerancrnjak · on June 10, 2023

I think there’s a contest for this compressing Wikipedia. LMs are winning.

amelius · on June 10, 2023

Interesting. Do you mean [1]?

[1] https://en.wikipedia.org/wiki/Hutter_Prize

ftxbro · on June 11, 2023

gpt uses huffman coding for tokenization

fguerraz · on June 9, 2023

Quanta magazine and all these pop science websites need to be stopped.

(Not because popular science is bad, but because they do it badly, and the clickbait is insufferable)

sn41 · on June 10, 2023

I agree. They have a breathless tone to them that's quite annoying to me (I work in data compression as an academic, and I found this article uninspiring.)

By the way, there was an old Soviet magazine called "Kvant" (Russian for Quantum, I think). I do not know Russian, but I have 2 collected volumes of selected articles from them. [1] [2] Their quality is astonishingly good, and high-level. The difference is this:

The Kvant articles were written by professional research mathematicians, trying to present their ideas to an audience that were willing to follow them with pencil and paper in hand.

Quanta magazine articles are written by journalists trying to present advanced science to a lay audience - the articles are very stilted and present the articles in a tone that oversimplifies the problem and gives no idea about the actual solution, and uses hackneyed tropes like : oh look the solvers were just some random unknown guys (in a recent case, the random unknown guy is a tenured faculty at UCLA in theoretical computer science, apparently "a world away from mathematics" [3])

[1] http://www.iri.upc.edu/people/thomas/Collection/details/5660...

[2] http://www.personal.psu.edu/sot2/kvant_preface1.pdf

[3] https://www.quantamagazine.org/surprise-computer-science-pro...

fn-mote · on June 10, 2023

I think this is an unfair criticism of the Quanta articles.

I'll just say up front that I am not an academic, and I enjoy the breadth of coverage in the Quanta articles. I would liken them to science articles in American Scientist. (Perhaps you don't like that either.) Yes, they are popularized, but they are still technical.

Are you bored by a description of an algorithm you know well? This article clearly describes the process of Huffmann encoding. It's not one of the most amazing discoveries, but is this topic ever going to be exciting? I'd say it's easier to follow their article than either Wikipedia or the top animation hit [1].

There are two other claims you make that appear baseless.

On the first point:

> Quanta magazine articles are written by journalists

The first bio I checked [2] is a Ph.D. mathematician. If they also write, that does not make them less qualified. I'll grant that the second bio I checked [4] was "only" a journalist, but the third was a professor in a named chair [5]. Just clicking, not searching for examples.

On the second point:

> oversimplifies the problem and gives no idea about the actual solution

In your reference [3], it describes the problem clearly and devotes several paragraphs to what looks like a sketch of a solution. Certainly it outlines the ingredients used. (Search for "To see how they arrived at their new upper limit." and "In their proof [...]".)

[1]: https://cmps-people.ok.ubc.ca/ylucet/DS/Huffman.html [2]: https://www.quantamagazine.org/authors/erica-klarreich/ [4]: https://www.quantamagazine.org/authors/kevin-hartnett/ [5]: https://www.quantamagazine.org/authors/stevenstrogatz/

abecedarius · on June 10, 2023

I haven't seen Kvant, but it's worth adding that Scientific American up till maybe the mid-80s was also more real than standard pop science. Don't take its current incarnation as much like its past. (I guess the 70s were even better for it, but this is my fuzzy memory of a trove of back issues I went through in the 80s.)

That said, from a skim I wouldn't call this particular article standard pop science: it explains an idea/result rather than spending most of its words on periphera, and the subject is not recent news. It does ask less of the reader than an old Sci Am article would, I think.

the-mitr · on June 10, 2023

> but it's worth adding that Scientific American up till maybe the mid-80s was also more real than standard pop science.

glad that someone also shares this opinion, the articles and the art in those two decades was great

gavinray · on June 10, 2023

I really enjoyed this article, as someone who didn't finish highschool and is a professional programmer.

The Wikipedia entry on Huffman Coding is impenetrable to me, this article was easy to follow.

hgsgm · on June 10, 2023

Not everyone who reads the news is a mathematian.

Quanta is far better than the programmer / bizhacker / SEO blogs that dominate HN.

> “My mind was just blown. Like, wait, have they really done this?” said Sisask, a lecturer at Stockholm University."

> Sisask called it “the biggest result in the area for 20 years.”

> “Meka and Kelley have sort of leapfrogged all this incremental progress,” said Terence Tao, a prominent mathematician at UCLA.

Maybe you're just wrong? I trust the judgement of these people more than yours.

sn9 · on June 11, 2023

Quanta articles are usually written by people with at least graduate degrees, if not PhDs in Math/Physics/CS/etc.

They're without question one of the highest quality pop science publications around.

That you had to reach for an old Soviet magazine from the pre-internet era for anything better only speaks to that.

dieselgate · on June 10, 2023

Yeah you have a good point - not sure if there’s a name for what you’re referencing but it’s tough to be an expert in something and read a bunch of cringe pop sci articles on your field. This topic comes up frequently with my partner who is a researcher - Ed Yong recently wrote a piece on some work by their lab and it was “tolerable”. Not a huge fan of Quanta myself but Aeon frequently has articles written by researchers/experts and are higher quality in my opinion.

Interesting to hear about Kvant!

notRobot · on June 10, 2023

> "Briefly stated, the Gell-Mann Amnesia effect is as follows. You open the newspaper to an article on some subject you know well. In Murray's case, physics. In mine, show business. You read the article and see the journalist has absolutely no understanding of either the facts or the issues. Often, the article is so wrong it actually presents the story backward—reversing cause and effect. I call these the "wet streets cause rain" stories. Paper's full of them. In any case, you read with exasperation or amusement the multiple errors in a story, and then turn the page to national or international affairs, and read as if the rest of the newspaper was somehow more accurate about Palestine than the baloney you just read. You turn the page, and forget what you know."

https://www.epsilontheory.com/gell-mann-amnesia/

crispyambulance · on June 10, 2023

> but because they do it badly, and the clickbait is insufferable

Well, then, who does pop science "right"?

There's no shortage, of course, of 30+ page review articles on every scientific topic imaginable. And if someone has a few days, a freshly minted STEM degree or years of related experience, and a compelling interest in the topic, they can just pick up one of these review articles and go to town.

But that isn't going to fly for the general public, not even close. And not just because of the mathematics, dry passive-voice language, lack of context, times new roman, and gratuitous expert jargon. It's just too much.

So, what do you recommend?

fguerraz · on June 10, 2023

I must say, there is no written media that I know of that pleases me. Of course, it doesn't mean they don't exist.

I much prefer the treatment from some YouTubers, such as Sabine Hossenfelder (to name just one).

It's far from dry, she has a wicked sense of humour, she tries her best to be impartial (and sometimes fails), and makes it clear when something she says is her opinion.

PBS also has a ton of good content.

Of course, all these actually require the viewer to make an effort, but if you're not ready to do that, then a poorly written clickbait article is likely to do more harm than good anyways.

Georgelemental · on June 10, 2023

I like Quanta! They explain things accurately and in detail, with helpful figures and diagrams. They did go wrong a few months ago with the "quantum computer black hole" article, but the heat they recieved was precisely because it fell well below their usual standards, which are otherwise well above other "pop science" sites in my experience.

zem · on June 10, 2023

quanta's tone does get mildly annoying at times, but the actual content of the articles is usually excellent, and this one is no exception. I suspect you disliked it simply because you already knew most of what it had to say. (I mean sure, the writer is no Martin Gardner, but then again who is.)