SVDQuant+NVFP4: 4× Smaller, 3× Faster FLUX with 16-bit Quality on Blackwell GPUs

(hanlab.mit.edu)

49 points | by lmxyy16 hours ago

7 comments

lmxyy16 hours ago
SVDQuant supports NVFP4 on NVIDIA Blackwell GPUs with 3× speedup over BF16 and better image quality than INT4. Try our interactive demo below or at https://svdquant.mit.edu/! Our code is all available at https://github.com/mit-han-lab/nunchaku!
semi-extrinsic8 hours ago
I assume they've messed up the prompt caption for the squirrel-looking creature?
Interesting to see how poor the prompt adhesion is in these examples. The cyanobacteria one is just "an image of the ocean". The skincare one completely ignores 50% of the ingredients in the prompt, and makes coffee beans the size and shape of almonds.
yorwba12 hours ago
I thought I'd already seen this in the previous discussion 3 months ago https://news.ycombinator.com/item?id=42093112 but that one used INT4 quantization, so NVFP4 is a further improvement on that. Sweet!
If I found the correct docs https://docs.nvidia.com/deeplearning/cudnn/frontend/latest/o... NVFP4 means 16 4-bit floating-point values (1 sign bit, 2 for the exponent, 1 for the mantissa) each have one shared 8-bit floating point scaling factor (1 sign bit, 4 exponent, 3 mantissa), so strictly speaking it's 4.5 bits per value.
This grouped scaling immediately makes me wonder whether the quantization error could be reduced even more by permuting the matrix so values of similar magnitude are quantized together.
- lmxyy11 hours ago
  I think so. There are already some techniques called rotation, which have similar effects. But it will incur additional overheads in diffusion models.
  yorwba10 hours ago
  Permuting entire columns at once should have zero overhead as long as you permute the rows of the next matrix to match. But as each entry of a column participates in a different scaling group, I guess swapping two columns will reduce quantization error for some while increasing it for others, making it unlikely to get a significant overall improvement in this way.
lmxyy15 hours ago
FLUX-schnell is only 800ms on RTX 5090.
beebaween15 hours ago
This is amazing
curtisszmania13 hours ago
[dead]
42lux9 hours ago
Now release the LoRa conversion code you promised months ago…