507 points | by egnehots6 days ago
Random anecdote warning - In the old days, before vector search became AI and everyone and their dog offered a vector database, I had a task that required nearest neighbour search in a decent amount of high-dimensional vectors.
I tried quantizing them to bit vectors in an index and scanning through it to get an initial set of candidates. Performance was actually quite decent - reading through RAM linearly is fast! But the selectivity wasn't great.
Somewhere along the way I found this paper[1] that iteratively finds a rotation to apply before quantization to reduce the quantization error. Very similar goal to SpinQuant, but focused on bit quantization only.
As it turns out the 'random rotation' baseline they benchmark against worked great for my use case, so I never tried implementing the fancier algorithm. But it's a pretty rare day at work that "apply a random rotation matrix to a 128-dimensional vector" is the solution to my problem.
[1] https://ieeexplore.ieee.org/abstract/document/6296665 / https://slazebni.cs.illinois.edu/publications/ITQ.pdf
Funny enough, if you visualize a vector-embedding's latent-space features using that "points on the surface of a hypersphere" analogy that ML programmers like to use — and you assume a really low quantization, say, 1-bit — then you can almost picture the hypersphere surface as a black-and-white vector image, the points as arbitrary-precision vector positions where you want to place dots... and your goal as quantizing those positions to reduce the storage costs down to storing a raster bitmap.
And that problem has a name: dithering!
Oddly enough, for what may or may not be coincidental reasons, what we want in ML terms (keeping the learned associational weights between features constant) is very similar to what we want from the output of image dithering: to not allow the dots to come together to create false features or false voids.
And how do we do that? In dithering, we usually apply a set of random perturbations to the vectorized points. Which, for image dithering, just look like translations in 2D space... but, in a higher-dimensional space, might very well best be analytically modelled as rotations about the origin!
Which I think is what is happening with SpinQuant as well - a smoothing of the frequency spectrum of the model weights, confirmed by the smearing of the singular values of the weight matrices.
What you are describing reminds me of Low discrepancy sequences: https://en.wikipedia.org/wiki/Low-discrepancy_sequence
Though these methods have their problems and blind-spots, too, and are often outdone by random sampling with even slightly higher sample count, while preserving all the simplicity and (statistical) guarantees you get from randomness.
Isn't it the lengths/distances to neighbors that is the main information being stored in a vector db? Or is it just that what you're talking about only concerns the angles so the lengths are not part of the discussion?
I'm a dev but still have a lot to learn about ML :)
(Just kidding - but if you have any recommendations for learning resources to get started being able to understand what you're talking about, I'd greatly appreciate it.)
Example from electrical engineering: microprocessors will have a "clock" frequency, say, 16Mhz. But when you haul a wire up to VCC and pull it back down to ground, some amount of the power will be radiated away as radio waves. If your clock is at a constant rate, then you'll have a big spike of radiated noise at 16MHz, and the FCC will be unhappy.
So modern devices cheat it by dithering around the central frequency. If you bounce from 15.9998MHz to 16.001 to 15.998 then the same amount of power will be radiated, but smeared across a bigger frequency, enough to get you lower than the regulatory threshold. Spread spectrum clock generation. https://www.analog.com/en/resources/technical-articles/clock...
If you look in your PC's BIOS settings, spread spectrum is usually an option, and you can disable it if you want your computer to be slightly noisier.
Perhaps not entirely coincidentally, FAISS is also maintained by FB.
https://faiss.ai/cpp_api/struct/structfaiss_1_1OPQMatrix.htm...
I'm no expert and I'm sure this has been tried by many people already - but would it be possible to reduce the computational effort instead by using SVD decomposition, spreading the singular values and then reapplying the original singular values and recomposing the matrix using the quantized versions of the SVD matrices?
[1] - https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_...
The Johnson-Lindenstrauss lemma asserts that a multiplying by a random matrix (some conditions apply, but iirc rotation matrices satisfy them) keeps, in many senses, the distances between points even if the dimension drops very significantly (some conditions apply but usually satisfied by real world data)
This is, in fact, the theoretical underpinning of compressed sensing.
tl;dr: round((2*R)*x) is not a great idea for an R-bit quantization.
Really appreciate that Meta published both results+model quants and didn't just make some bs claim about a new sota quant like most other bigger companies would've done.
That said, as others have pointed out, and as it's also written on the blog post, they are entirely different methods. QLoRA requires access to the full training data, while theoretically you can apply SpinQuant to any given model. For example, they also apply it to Mistral, not only to their LLaMA.
(QLoRA also takes some time and compute to apply, but since SpinQuant also implies learning some weights, I don't know if it's actually faster/cheaper, too)
Definitely nice to see them not cherrypick results - makes them more believable that its not the best along all axes.
> For example, they seem to not care about instructions to only write a response and no explanation
You need to use tools to force the model to adhere to a schema. Or you can learn to parse out the part of the response you want, both work.
You'll also need to make good use of robust examples in your initial prompt, and give lots of examples of how you want the output to look. (Yes this quickly burns up the limited context length!)
Finally, embrace the fact that these models are tuned for chat, so the more conversational you make the back and forth the less you are stretching the models abilities.
I wrote a very small blog post at https://meanderingthoughts.hashnode.dev/unlock-the-full-pote... explaining some of this.
[1] Cue – A language for defining, generating, and validating data:
https://news.ycombinator.com/item?id=20847943
[2] Feature structure:
https://en.m.wikipedia.org/wiki/Feature_structure
[3] The Logic of CUE:
tl;dr you put into the prompt all the JSON up until what you want the LLM to say, and you set the stop token to the end token of the current JSON item (so ',' or '}' ']', whatever) and you then your code fills out the rest of the JSON syntax up until another LLM generated value is needed.
I hope that makes sense.
It is super cool, and I am pretty sure there is a way to make a generator that takes in an arbitrary JSON schema and builds a state machine to do the above.
The performance should be super fast on locally hosted models that are using context caching.
Eh I should write this up as a blog post, hope someone else implements it, and if not, just do it myself.
I'm partial to Outlines lately, but they all have various upsides and downsides.
OpenAI even natively added support for this on their platform recently: https://openai.com/index/introducing-structured-outputs-in-t...
Outlines looks quite interesting but I wasn't able to get it to work reliably.
We haven’t built a state machine over JSON schema that uses this approach yet but it’s on the way.
Wow, that is a much more succinct way of describing it!
> We haven’t built a state machine over JSON schema that uses this approach yet but it’s on the way.
Really this should just be a simple library in JS and Python. Schema goes in, state machine pops out.
Complications will be around optional fields, I'm not sure offhand how to solve that!
It's still in early stages, but might be usable for something you're trying to build. Here's an example (this buffers the entire JSON object, but you can also gen as you go): https://docs.mixlayer.com/examples/json-output
For context, I was playing with a script to bulk download podcasts, transcribe with whisper, pass the transcription to llama.cpp to ID ads, then slice the ads out with ffmpeg. I started with the generic json_array example grammar, then iteratively tweaked it.
And Claude did everything perfectly ;)
I could recommend using ollama or VLLm inference servers. They support a `response_format="json"` parameter (by implementing grammars on top of the base model). It makes it reliable for a production use, but in my experience the quality of the response decreases slightly when a grammar is applied.
Works as expected if you provide a few system prompts with context.
I was doing some local tidying up of recording transcripts, using a fairly long system prompt, and I saw the same behaviour you mention if the transcript I was passing in was too long -- batching it up to make sure to be under the max length prevented this.
Might not be what's happening in your case, but I mention it because it wasn't immediately obvious to me when I first saw the behaviour.
No, speculative decoding has exactly the same accuracy as the target model. It is mathematically identical to greedy decoding.
You will see that tokens not predicted by greedy sampling of the target model are rejected. Ergo, they are mathematically identical.
>>> Remove the explanation parts and only leave yaml in place from above response. apiVersion: apps/v1 kind: Deployment metadata: name: my-deployment spec: replicas: 3 ...
Alternatively this worked as well >>> Write a YAML file with kubernetes deployment object in it. Response should only contain the yaml file, no explanations. ... ions. ```yml apiVersion: apps/v1 kind: Deployment metadata: name: example-deployment spec: replicas: 3 selector: matchLabels: app: example-app template: metadata: labels: app: example-app spec: containers: - name: example-container image: nginx:latest ports: - containerPort: 80 ```
https://huggingface.co/docs/hub/en/gguf#quantization-types
It might even mean a non-GGUF quantization scheme; I'm just an intermediate user of local models, not an expert user or developer.
So this is gonna be 8 bit weights, 8 bit activations, group size of 256, symmetric quantization. Not sure how to map this to the GGUF variants because they don't mention how they don't do activation quantization
So for example for AWQ and GPTQ we can accelerate them by using a fast int4 kernel called tinygemm
In vanilla Pytorch I have the following expression:
t.sum(values[inds] * weights)
If 'inds' is int8, I get "IndexError: tensors used as indices must be long, int, byte or bool tensors".Is this still true if I use torchao?
Maybe someone can recommend a way to deploy Llama to Android without Termux, maybe even something that can be potentially fully implemented inside an app?
I'm currently looking into compiling llama.cpp for Android and bundling it inside an app. Is that a viable path? Would love to hear from someone who tried something similar.
Most weights are released as fp16/bf16 so 2 bytes per weight. So just double the number of parameters = the number of gigabytes of VRAM. Llama 3.1 8B ~= 16GB weights in fp16. At 4bit quantization, it would be half the number of parameters so Llama 3.1 8B ~= 4GB weights.
But this is just weights. The real issue is context and output length: how much data are you feeding in? This is where VRAM can explode, and it's entirely use-case dependent. So for a 128k context model, the range of VRAM usage is huge.
The reality is, if you're not able to quickly estimate the above, you're probably not running local models anyway.
I do think it would be good to include some info. on "what we expect to be common deployment scenarios, and here's some sample VRAM values".
Tangentially, whenever these models get released with fine-tuning scripts (FFT and Lora) I've yet to find a model that provides accurate information on the actual amount of VRAM required to train the model. Often times it's always 8x80GB for FFT, even for a 7B model, but you can tweak the batch sizes and DeepSpeed config. to drop that down to 4x80GB, then with some tricks (8bit Adam, Activation Checkpointing), drop it down to 2x80GB.
Some pretty charts here https://github.com/pytorch/ao/issues/539
It’s pretty adept at most natural language tasks (“summarize this”) and performance on iPhone is usable. It’s even decent at tool once you get the chat template right.
But it struggles with json and html syntax (correctly escaping characters), and isn’t great at planning, which makes it a bad fit for most agenetic uses.
My plan was to let llama communicate with more advanced AI’s, using natural language to offload tool use to them, but very quickly llama goes rogue and starts doing things you didn’t ask it to, like trying to delete data.
Still - the progress Meta has made here is incredible and it seems we’ll have capable on-device agents in the next generation or two.
You should customise your sampler to mandate JSON grammar after ```json tokens.
Take for example: "A dog says \"Woof!\""
With a grammar, you’ll end up with "A dog says " when the model forgets to escape.
Which is valid JSON, but not what the model intended.
So it’s usually better to catch the exception and ask the model to try again.
Unless you’ve come across a sampler with backtracking? That would be cool
That and average inference times on common hardware is what I'm curious about.
> Decode latency improved by 2.5x and prefill latency improved by 4.2x on average, while model size decreased by 56% and memory usage reduced by 41% on average. The benchmarks can be reproducible today via ExecuTorch Llama instructions. The table above shows results using an Android OnePlus 12 device—however, we’ve also verified similar relative performance on Samsung S24+ for 1B and 3B and Samsung S22 for 1B.
for me it is nothing short of bad experience. it is way over-engineered with poor quality and just plain does not work, and maintainers are questionable. I would rather call HuggingFace python code for inference or anything else.
is ExecuTorch any better?
ExecuTorch is a runtime for mobile and embedded devices to run PyTorch models directly. Currently it runs pretty fast on CPU, but expanding our use-case for mobile accelerators and GPUs.
We're still in our early stages (just turned beta status). But try it out and let us know.
Regarding Llama Stack, it is built by my colleagues. What were some concrete issues have you experienced? If you have error/bug reports, I'll happy to pass along.
with llamastack, well making it work with CUDA for starters would be great.
it is also bloated. something that supposed to take direct 100 lines of code and a couple files, takes dozens of files, multiple frameworks, generators.. which in the end do not work at all, and nobody knows why. very obscure framework. can't believe this code is coming from Meta.
> At Connect 2024 last month, we open sourced Llama 3.2 1B and 3B
No you did not. There is no source (in this case: training data) included. Stop changing the meaning of "open source", Meta!
At larger batch sizes you become compute bound so quantization matters less and you have to rely on hardware support to accelerate smaller dtypes like fp8
e.g. instead of tokens ['i', 'am', 'beautiful'] having tokens ['I am', 'beautiful'] on the premise that 'I am' is a common set of bytes for a semantic token that identifies a 'property of self'?
Or taking that further and having much larger tokens based on statistical analysis of common phrases of ~5 words or such?
I wouldn't be surprised to see it add the new ones shortly, it's quite actively maintained.
This was just recently open sourced and is pretty nice. Only issue I've had is very minor UI stuff (on Android, sounds like it runs better on iOS from skimming comments)
Computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku
https://news.ycombinator.com/item?id=41914989
1421 points, 717 comments
No one serious is going to build on some horror of Python interpreter running inside your app to run an LLM when llama.cpp is right there, with more quants available. In practice, on mobile, you run out of RAM headroom way more quickly than CPU headroom. You've been able to run llama.cpp 3B models for almost a year now on iOS, whereas here, they're just starting to be able to. (allocating 6 GB is a quick way to get autokill'd on iOS...2.5GB? Doable)
It looks like spinquant is effectively Q8, in widespread blind testing over months, empirically, we found Q5 is assuredly indistinguishable from the base model.
(edit: just saw your comment. oy. best of luck! generally, I don't bother with these sorts of 'lived experience' details, because no one wants to hear they don't get it, and most LLM comments on HN are from ppl who don't have the same luck as to work on it fulltime. so you're either stuck aggressively asserting you're right in practice and they don't know what you're talking about, or, you're stuck being talked down to about things you've seen, even if they don't match a first-pass based on theory) https://news.ycombinator.com/item?id=41939841
I’m focused on making models play nice with each other rather than building a feature that relies on it. That’s where I see the more relevant work being. Why such news are exciting!
"AI will destroy the world"? "AI is great and will save humanity"? If you're seriously missing that, there's really enough platforms (and articles for more fundamental announcements/propositions on this one) where you can have these.