65 points | by mxmlnkn15 hours ago
ratarmount ./takeout-20231130T224325Z-0*.tgz ./mnt
For gzip, it is as you say. However, when only wanting to seek to DEFLATE block boundaries, the "state" of the decompressor is as simple as the last decompressed 32 KiB in the stream. Compared to the two offsets for bzip2, this is 2048x more data to store though. Rapidgzip does sparsity analysis to find out which of decompressed bytes are actually referenced later on and also recompresses those windows to reduce overhead. Ratarmount still uses the full 32 KiB windows though. This is one of the larger todos, i.e., to use the compressed index format, instead, and define such a format in the first place. This will definitely be necessary for LZ4, for which the window size is 64 KiB instead of 32 KiB.
For zstd and xz, this Ansatz finds its limits because the Lempel-Ziv backreference windows are not limited in size in general. However, I am hoping that the sparsity analysis should make it feasible because, in the worst case, the state cannot be longer than the next decompressed chunk. In this worst case, the decompressed block consists only of non-overlapping back-references.
A small note, archivemount has a living fork here: https://git.sr.ht/~nabijaczleweli/archivemount-ng