55 points | by aoli-al2 天前
Fray[1] is a concurrency testing tool for Java that can help you find and debug tricky race conditions that manifest as assertion violations, run-time exceptions, or deadlocks. I’d love to hear your thoughts—feel free to ask me anything! And if you’re curious, give Fray a try.
We appear to use exactly the same basic technique, using byte weaving to intercept concurrency primitives such as synchronized, LockSupport etc to pause the system thread and run them on some schedule.
We only currently run (deterministic) probabilistic traces though, we can’t search the interleaving space. But the traces for a whole cluster are extremely complex and probably unsearchable.
I have been meaning to publish it for broader consumption for years now, but there’s always something more important to do. It’s great to see some dedicated efforts in this space.
[1] https://github.com/apache/cassandra/tree/trunk/test/simulato...
It seems that all controlled threads are wrapped with `InterceptibleThread` in the Cassandra simulator. Does this work for ThreadPools (e.g., ForkJoinPool) as well? We had a hard time intercepting thread objects because they are used by the language runtime (e.g., GC threads) as well and we don’t want to interfere with them. Additionally, modifying application code just track thread creation isn’t ideal. To work around this, we came up with this combination of JVMTi and Java Agent solution and we use JVMTi to monitor thread creation and termination.
As for searching schedules, yes, it is hard to search all possible schedules. However, it turns out many searching algorithms such as probabilistic concurrency testing[1] or partial order sampling[2] are still better than random walk. So it is worth to give them a try.
[1] https://www.microsoft.com/en-us/research/wp-content/uploads/... [2] https://www.cs.columbia.edu/~junfeng/papers/pos-cav18.pdf
> Additionally, modifying application code just track thread creation isn’t ideal.
This would certainly be necessary, but don't you anyway need to rewrite the application to trap synchronised, volatile, atomic accesses etc? It doesn't seem all that different to rewrite calls to Thread::start. The issue of JVM threads is perhaps a little trickier, but I am not averse to some ugly integrations. Just take a look at how we make RNGs deterministic
> So it is worth to give them a try.
Thanks for the tips! I am not sure when I will have time to apply these techniques to our simulator, but they are no doubt valuable for the protocol simulations I am relying on today, so maybe I will have a justification to explore them sometime soon.
Really cool work too. I hope it manages to make its way into more hands, so that this technique can be used more widely.
[1]: https://arxiv.org/pdf/2501.12618
[2]: https://robert.ocallahan.org/2016/02/introducing-rr-chaos-mo...
To be fair, the gap is because writing the tests is the hard part. Tests for deterministic testing frameworks can be more complicated because you have to simulate more complex situations with more components interacting (otherwise the simpler targeted tests would have caught your bug). So it works well in terms of making your existing integration tests gain extra value, but the complexity of writing and maintaining those integration tests is the actual challenge.
Don’t get me wrong. I love deterministic simulation testing and along with property tests and mutation testing it’s best in class techniques for having confidence in the efficacy of your tests. Just that the challenges are on the less sexy side of writing the tests whereas academia focuses on the sexy frameworks piece.
In fact, when we evaluate Fray, we collect all existing concurrency tests from Lucene, Kafka, and Guava, and running them under different thread inter-leavings can already reveal so many bugs. [1]
[1]: https://github.com/cmu-pasta/fray/blob/main/docs/bugs.md
Depressing that apple mail had better search 25 years ago than today.
I remember working with lucene around 2000/2001 and how good the results were.