- sort the words in the document
- eliminate unique words (they sort together)
- merge the sorted words with the sorted dictionary and keep only the missing words
I saw this in BASIC in Creative Computing and got it working in on my TRS-80 Color Computer which had much less than 32k of available RAM, so that was the first thing I thought when I saw the headline.Now this blew people away when it came out
https://winworldpc.com/product/turbo-lightning/1x
it had a compressed dictionary that would fit together with the other programs you were running on a PC and spell check as you typed; there was a 640k limit for the PC but it could only use a fraction of that so as not to interfere and in the early days of the PC you couldn't actually afford to fill it out.
I guess you could save the working copy to disk first, then do the sort, then compare, then reload the working copy. I think the C=64 developers probably avoided that strategy because the disk interface was so damn slow.
I had a lot of data, 640KB RAM, 64KB of heap, and 64KB of stack. I had hundreds of megabytes that I had to search extract data from and then combine some of them.
I experimented with data index structured into ternary trees. Conceptually it made sense, but implementation-wise the relationships and paths were still too big to keep in 64KB.
Instead of compression, I did swapping. I wrote a TSR (think service), a piece of code that would process a chunk of the data, extract the results, store it n the stack, dump the original data, make an interrupt call to the TSR, which in turn destroy the heap, and read in the next chunk from storage, return control to the program, process, combine with stack data, and continue until finished the entire process.
Originally this process took about a week for three data entry persons (think about a dozen 3" ring binders filled with tables), and an specialist combining the information. The program completed the work in just a few hours. It was amazingly "fast".
This was on a single threaded system.
[0] https://en.wikipedia.org/wiki/Terminate-and-stay-resident_pr...
The main memory bandwidth and the disk bandwidth were about the same, a little of 1MB/s.
I would have done this in multiple passes (but still used the Bloom Filters, those are cool).
https://github.com/arnoldrobbins/v10spell
https://code.google.com/archive/p/unix-spell/
The original paper is great https://www.semanticscholar.org/paper/Development-of-a-Spell...
It is hosted on his web page https://www.cs.dartmouth.edu/~doug/
https://en.wikipedia.org/wiki/Douglas_McIlroy
If you are a word nerd, you will have found obovate and there, this chart.
https://upload.wikimedia.org/wikipedia/commons/e/e8/Leaf_mor...
Related, a contest about compressing the Wordle dictionary: http://golf.horse/wordle/
And it is a completely different thing. In general, it is more about procedural generation and tricks then good packing. Runtime packers are used, like crinkler and kkrunchy, but actually they use a lot of RAM, like hundreds of MB, which is a bit surprising considering that the decompressed executable is in the tens of kB. But that's because they use very powerful but slow compression algorithms.
Sizecoding usually doesn't care about RAM, unless the platform requires it, the only think that matters is the size of the executable file and its data. For that 39kB Playdate game, I guess that's the same idea. The Playdate has 16MB of RAM, I bet the game took full advantage of it.
makewords sentence | lowercase | sort | unique | mismatch