Appears to be this, though different title: https://www.sciencedirect.com/science/article/pii/S095579972...
I wonder if this is also a CUDA-bypass, PTX optimization that led to the 10x performance gain by Deepseek: https://xyzlabs.substack.com/p/deepseeks-latest-shocker-who-...