Compressed filesystems à la language models

55 points by grohan a day ago

> Presciently, Hutter appears to be absolutely right. His enwik8 and enwik9’s benchmark datasets are, today, best compressed by a 169M parameter LLM

Okay, that's not fair. There's a big advantage to having an external compressor and reference file whose bytes aren't counted, whether or not your compressor models knowledge.

More importantly, even with that advantage it only wins on the much smaller enwiki8. It loses pretty badly on enwiki9.

grohan 9 hours ago

Bellard has trained various models, so it may not be the specific 169M parameter LLM, but his Transformer-based `nncp` is indeed #1 on the "Large Text Compression Benchmark" [1], which correctly accounts for both the total size of compressed enwik9 + decompresser size (zipped).
There is no unfair advantage here. This was also achieved in the 2019-2021 period; it feels safe to say that Bellard could have likely pushed the frontier far further with modern compute/techniques.
[1] https://www.mattmahoney.net/dc/text.html
- Dylan16807 4 hours ago
  
  Okay, that's a much better claim. nncp has sizes of 15.5MB and 107MB including the decompressor. The one that's linked, ts_zip, has sizes of 13.8MB and 135MB excluding the decompressor. And it's from 2023-2024.
vrighter an hour ago

Yep, this is like taking a file, saving a different empty file named as base-64 encoded contents of the first and claim you compressed it down by 100%.

PaulHoule 13 hours ago

Love the quote:

  Every systems engineer at some point in their journey yearns to write a filesystem

It reminds me of a friend who had a TRS-80 color computer (like me) in the 1980s who was a self-taught BASIC programmer who developed a very complex BBS system and was frustrated that the cluster size for the RS-DOS file system was half a track so there was a lot of space wasted when you stored small files. He called me up one day and told me he'd managed to store 180k of files on a 157k disc and I had to break it to him that he was storing 150k (minus metadata) files on a 157k disk as opposed to the 125k or so he was getting before... With BASIC!

N_Lens 12 hours ago

Sort of similar vibes as "The children yearn for the mines"

porphyra 10 hours ago

Reminds me of ts_zip by Fabrice Bellard: https://bellard.org/ts_zip/

N_Lens 12 hours ago

Interesting experiment but the author lists some caveats (Not exhaustive by any means):

"Of course, in the short term, there’s a whole host of caveats: you need an LLM, likely a GPU, all your data is in the context window (which we know scales poorly), and this only works on text data."

endofreach 13 hours ago

Interesting. I had an idea cooking some days ago. And implementing exactly this was the first step that i was gonna work on this weekend. Funny how often this happens here on HN. Thank you for this inspiration & motivation. And: It was a joy to read.

ShoeMakerBox 11 hours ago

mgddbsbdbd ddfk,d ,