113 points | by ngrislain3 days ago
That said, it's concerning to see the reported probability for getting a 4 on a die roll is 65%.
Hopefully OpenAI isn't that biased at generating die rolls, so is that number actually giving us information about the accuracy of the probability assessments?
This is a problem when people naively use "give an answer on a scale of 1-10" in their prompts. LLMs are biased towards particular numbers (like humans!) and cannot linearly map an answer to a scale.
It's extremely concerning when teams do this in a context like medicine. Asking an LLM "how severe is this condition" on a numeric scale is fraudulent and dangerous.
The results of an LLM are an arbitrary approximation of what a human would expect to see as the results of a query. In other words, it correlates very well with human expectations and is very good at fooling you into believing it. But can it provide you with results that you disagree with?
And more importantly, can you trust these results scientifically?
But the real question is not whether you agree with the results, but whether they're useful. If you apply an objective method to data it is unsuitable for, it's garbage in, objective garbage out. Whether the method is suitable or not is not always something you can decide a priori, then you need to check.
And if trying it out shows that LLM-provided clusters are more useful than other methods, you should swallow your pride and accept that, even if you disagree on philosophical grounds. (Or it might show that the LLM has no idea what it's doing! Then you can feel good about yourself.)
Finding that an LLM is biased toward inventing die rolls that are the median result rounded to an available result by the most common rounding method is...not particularly surprising. If you want a fair RNG, use an RNG deigned to be fair, not an LLM where that would be, at best, an emergent accidental property.
AFAICT, the LLMs aren’t creating new mental mappings of “dice are a symmetric and should give equal probability to land on any side followed by using that info to infer they should use a RNG.”
Think about this: suppose you’re reading a scientific paper and the author writes “I did a study with 52 participants, and here are the answers”. Would there be any reason to believe that data is real?
I'm not sure I follow your hypothetical. The author making the claim in a public paper can be contacted for the data. It can be verified. Auditing the internals of an LLM, especially a closed one that, is not the same.
https://news.ycombinator.com/item?id=42684629
> the logits aren't telling you anything like 'what is the probability in a random sample of Internet text of the next token', but are closer to a Bellman value function, expressing the model's belief as to what would be the net reward from picking each possible BPE as an 'action' and then continuing to pick the optimal BPE after that (ie. following its policy until the episode terminates). Because there is usually 1 best action, it tries to put the largest value on that action, and assign very small values to the rest (no matter how plausible each of them might be if you were looking at random Internet text)
Any interest in seeing this sort of thing being added to llama.cpp?
It feels like this would be useful enough to build around -- I especially like the idea of asking the API to return the top K results for each field, and denoting their likelyhood -- almost like a dropdown box with percentages attached for each possible result.
Any chance we can get Pydantic support?
If you run "bananas,fishbowl,phonebook," and get {"sponge": 0.76}
It doesn't mean that "placemat" was the 76% correct answer. Just that the word "sponge" was the next most likely word for the model to generate.
The library is compatible with that but does not use Pydantic further than that.
```
class Classification(BaseModel):
color: Literal['red', 'blue', 'green']
```then the output type would be:
```
class ClassificationWithLogProbs(BaseModel):
color: Dict[Literal['red', 'blue', 'green'], float]
```Don't take this too literally; I'm not convinced that this is the right way to do it. But it would provide structure and scores without dealing with a mess of complex JSON.
One question I always had was what about the descriptions you can attach to the class and attributes? ( = Field(description=...) in pydantic) is the model made aware of those descriptions?
Also, if you're "studying LLM based chess" and you don't use dynamic grammar's to enforce that models can only make "valid" moves at each time step, you're research is basically invalid.
And don't meme me with claims that structured/constrained generation harms creativity. The devs of outlines debunked that FUD already: https://blog.dottxt.co/say-what-you-mean.html
Similarly, if you think that RLHF/DPO or Lora or any of that harms creativity, you're really outing yourself as not having played with high temperature sampling.