81 points | by saikatsg8 hours ago
Running 1BRC was immensely fun, I learned a ton from it. Had you told me before how far the community would be able to push this, I'd have not believed you.
One main take-away for me was that you could improve performance by one order of magnitude over the baseline basically by just doing a good job and avoiding basic mistakes. The resulting code is still well readable and maintainable. In most scenarios, this is where you should stop.
If you want to improve by another order of magnitude (like leaders in the challenge did), code becomes completely non-idiomatic, super-dense, and hard to maintain. You should go there only where it really, really matters, like when building a database kernel for instance. Or well, when trying to win a coding challenge ;)
Some more resources for those interested:
* Blog post with the results: https://www.morling.dev/blog/1brc-results-are-in/
* Show & Tell, featuring implementations in languages other than Java: https://github.com/gunnarmorling/1brc/discussions/categories...
* List of many more blog posts discussing 1BRC in different languages: https://github.com/gunnarmorling/1brc?tab=readme-ov-file#1br...
* 3h deep-dive into the implementation techniques by Thomas Würthinger and Roy van Rijn, two of the top participants of the challenge: https://www.youtube.com/watch?v=_w4-BqeeC0k
I’ve done a lot of performance work but never heard this expressed so clearly before. Thanks - I’m stealing that.
I’ve found exactly the same thing in my own work optimising text CRDTs. Just writing crdt code in a straightforward, correct way, without unnecessary allocations (and ideally using some good data structures) will get you very, very far. But there’s another order of magnitude available to anyone interested in using exotic, task specific data structures.
I suspect the same is true in just about every field of computer science. Write good code in go / c# / Java / JavaScript / etc and 99% of the time, performance will be just fine. Well, so long as you don’t do anything silly like pull in immutablejs. But there’s usually 10-100x more performance available if you really try.
If you want some examples, I highly recommend Algorithms for Modern Hardware, available for free online:
There are also lots of SIMD operations, which Zig has great support in the form of vector programming.
The only thing novel I did was to line up the temperature data from the back instead of from the front for the SIMD operations. That is for temperatures 12.3, 4.5, 67.8, 0.1, I line them up by the last digit after the decimal before packing them into a SIMD register. That way I know their exact positions in the register.
It seems like swiss tables are tuned for workloads of mostly gets of keys that are not present. Otherwise the two-level lookup is an extra cost with not much benefit.
IIRC Joad Nacer says he is using a hash table for this (again, essentially identical) task on highload.fun[1]. This was sort if surprising for the other competitors because the top couple solutions below that one use some variety of bloom filter instead.
0: https://easyperf.net/blog/2022/05/28/Performance-analysis-an...
The fastest version also used Graal which would've helped with startup time.
(Edit: although it kinda then spoils it with the newsletter popup)
Maybe this is covered in the actual video, but one potential target for optimisation that I didn't see in the transcript was how the temperature values are converted to integers.
One suggestion is that, after deleting the period, if the numbers have at most 4 digits and 64-bit multiplication is available, the conversion to an integer can be done with a single such multiplication and a few shifts. Ensure each digit is in the low byte of a separate 16-bit word (that is, "space them out" with a zero byte in between each digit; this may require extra work, though there's often a SIMD "scatter"-type instruction that will accomplish it), subtract the ASCII value of the "0" digit from each 16-bit word's low byte (one instruction), then multiply by a constant which has 1000 in its lowest 16-bit word, 100 in the next-lowest, ..., 1 in the highest. The final answer will now be in the highest word, available with a 48-bit right shift.
If multiplication is slow, another trick to gain some speed is to not actually convert from decimal until you need to, possibly right at the end. That is, just subtract the ASCII value of the "0" digit from each 16-bit word's low byte, and then treat the resulting 64-bit value as a very wasteful BCD representation in which each decimal digit occupies 16 bits. These "integers" can be safely added to each other up to 65535/9=7281 times before there is a risk of overflow; you only need to convert to regular binary then. This representation also honours min and max operations.
A corollary is 'postgres is really all you need' unless you have irrefutable proof it isn't.
Many businesses think "we have a lot of products". Nope, even when you have more than a million products we can load all of them into RAM, even with 4 KB of memory allocate to each product.
So a million products is small business from a "load into RAM" perspective. Easily handled on a single server.
Run a Wireshark capture while opening slack, those ~30ms round trips add up
Especially if you need to have supplementary indexes to support common queries on that data.
New Go Billion Row Challenge w/ Great Optimizations | Prime Reacts
- https://www.youtube.com/watch?v=SZ1PDS7iRU8
- https://r2p.dev/b/2024-03-18-1brc-go/
Spoiler:
SwissMap: A smaller, faster Golang Hash Table
- https://www.dolthub.com/blog/2023-03-28-swiss-map/
- https://github.com/dolthub/swiss
Want to see Paul Allen's implementation in Rust, Zig, Mojo ... and even Bun.sh, Deno.com
https://hotforknowledge.com/2024/01/13/1brc-in-dotnet-among-...