Detecting when LLMs are uncertain

(thariq.io)

281 points | by trq_5 days ago

29 comments

  • nhlx25 days ago
    On two occasions I have been asked, 'Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?' I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. — Charles Babbage
    • astrange5 days ago
      That's just autocorrect. (Or generative AI.)
      • adrian_b5 days ago
        Except that autocorrect is frequently wrong, so that many authors of hilariously wrong messages have to apologize that the messages must have been messed by autocorrect (which may be true or not).

        When autocorrect is wrong, it usually is because it chooses words believed to be used more frequently in that context, so especially the authors of scientific or technical texts are affected by the wrong guesses of autocorrect, because they use less common words.

      • TeMPOraL5 days ago
        Or error correction. Or statistical analysis.

        "Right" and "wrong" aren't binary states. In many cases, if the data is at least in small part correct, that small part can be used to improve correctness in an automated way.

    • kylebenzle5 days ago
      So well put!

      People think they understand what "AI" is supposed to do, then "AI" turns out to not do what they expect and they call it broken.

      • DonHopkins5 days ago
        You think you understand what "all women" are, then "all women" turns out to include your mother. Sorry to break it to you, Kyle Benzle.

        https://news.ycombinator.com/item?id=33010046

        kylebenzle on Sept 28, 2022 [dead] | parent | context | favorite | on: Why are sex workers forced to wear a financial sca...

        All women are whores. Sorry to break it to you.

    • TeMPOraL5 days ago
      Honestly, I always thought this is a perfectly legitimate question, and it's Babbage that's failing to comprehend it, or being obtuse for show.
      • bobbylarrybobby4 days ago
        Maybe I am failing to comprehend it. But to me the question reads “is your analytical engine, which you've described as a merely a mechanical calculator, also psychic or otherwise magical, so that it may subvert its own mechanical design in order to produce the answer I want instead of what I asked for?”.
      • darepublic4 days ago
        I understood it as "if I entered 1+2 but actually I had meant 2+2 will the machine correctly give me 4 despite my error?"
      • tomtom13375 days ago
        I guess the unspoken assumption Babbage makes here is «if I put only the wrong figures into the machine». Then it is completely unreasonable to expect correct output. In ML context an LLM has been trained on much data, some «wrong» and some (hopefully more) «correct», which is why asking something incorrectly can still give you the correct answer.
        • nyrikki4 days ago
          For ML it goes deeper, but unfortunately discussions about it devolve into an approximation of the Brouwer–Hilbert controversy.

          If you think about it from the VC dimensionality lens, in respect to learnability and set shattering is simply a choice function it can help.

          Most of us have serious cognitive dissonance with dropping the principal of the excluded middle, as Aristotle and Plato's assumptions are baked into our minds.

          You can look at why ZFC asserts that some sets are inconstructable, or through how Type or Category theory differ from classic logic.

          But the difference between RE and coRE using left and right in place of true and false seems to work for many.

          While we can build on that choice function, significantly improving our abilities to approximate or numerical stability, the limits of that original trinity of laws of thought are still underlying.

          The union of RE and coRE is the recursive set, and is where not p implys p and not not p implys p holds.

          There is a reason constructivist logic, lambda calculus, and category theory are effectively the same thing.

          But for most people it is a challenging path to figure out why.

          As single layer perceptrons depend on linearly separable sets, and multilayer perceptrons are not convex, I personally think the constructivist path is the best way to understand the intrinsic limits despite the very real challenges with moving to a mindset that doesn't assume PEM and AC.

          There are actually stronger forms of choice in that path, but they simply cannot be assumed.

          More trivial examples, even with perfect training data.

          An LLM will never be able to tell you unknowable unknowns like 'will it rain tomorrow' or underspecified questions like 'should I driven on the left side of the road'

          But it also won't be able to reliably shatter sets for problems that aren't in R with next token prediction, especially with problems that aren't in RE, as even coRE requires 'for any' universal quantification on the right side.

          A LLM model will never be total, so the above question applies but isn't sufficient to capture the problem.

          While we can arbitrarily assign tokens to natural numbers, that is not unique and is a forgetful functor, which is why compression is considered equivalent to the set shattering I used above for learnability.

          The above questions framing with just addition and with an assumption of finite precision is why there is a disconnect for some people.

      • chipsrafferty4 days ago
        Can you help me understand the question (and context)?

        Life the "machine" is a calculator, and I want to ask 5+5, but I put in the "wrong figures" e.g. (4+4), is the "right answer" 8 or 10? Is the right answer the answer you want to the question you want to ask, or the answer to the question you actually asked?

        • d1sxeyes4 days ago
          Imagine it’s not a computer, it’s a piece of paper. And the paper is a bit dirty and you can’t quite tell if it’s a 4 or a 5. You guess it’s 4, but the print-out says 5. Do you pass the exam?

          Imagine you ask your friend “hey, what’s twenty divided by five?”, and they say “four” and then you realise you misspoke and meant to say “what’s twenty divided by four?” Is your friend wrong?

          Of course not, in both cases.

    • raindear5 days ago
      [dead]
  • zby5 days ago
    These sampling based techniques is a rare occasion where experimenting with consumer hardware can let you improve on SOTA models. I don't think it will last - the end game surely will be a trainable sampler. But for now - enjoy tinkering: https://github.com/codelion/optillm implements a few of these techniques

    optillm authors suggest that the additional computations in Entropics don’t bring any better results in comparison with the simple CoT decoding (but I am not sure if they also check efficiency):https://x.com/asankhaya/status/1846736390152949966

    It looks to me that many problems with LLMs come from something like semantic leaking, or distraction by irrelevant information (like in the GSM Symbolic paper) - maybe there is some space for improving attention too.

    I wrote a couple of blog posts on these subjects: https://zzbbyy.substack.com/p/semantic-leakage-quick-notes, https://zzbbyy.substack.com/p/llms-and-reasoning, https://zzbbyy.substack.com/p/o1-inference-time-turing-machi...

    • NitpickLawyer5 days ago
      The problem that I see with all these different sampling techniques is the way people usually judge them. There are people who claim they work better, but no rigorous benchmarks to prove it. Lots of "it writes better" or "the prose is fresh", but that is one argument where I think LeCun is 100% right - you can't judge a generalist model by "it works on poetry" or "prose", because that's the definition of bias, and you're shooting yourself in the foot with personal anecdotes.

      I'd like to see this applied to coding or math. See the samplers work better in say olympiad math problems, with thorough benchmarks before and after.

      • ninetyninenine5 days ago
        If the objective is to make a better poet or a better story book writer than this flawed metric is the only form of measure.

        It’s the same measure we judge human writers on so it’s not necessarily the worst.

      • Der_Einzige5 days ago
        The min_p paper and many other papers are doing exactly that.
        • NitpickLawyer5 days ago
          Is this [1] the paper you're referring to?

          Unless I'm reading Table2 (page7 - pdf version) wrong, on math, min_p is shown to score worse than top_p.

          For temp 0.7 it scores 1 point lower than top_p. And from temps 1.0 and up, while scoring higher than top_p for the same temp, it scores way lower (6points and up) than top_p at 0.7. So overall, if you want accurate answers (and for math you kinda do), min_p is worse overall? Unless I miss-understand something.

          I agree with the authors that if you want a tradeoff between accuracy and diversity, min_p might help, but if you're looking for precise answers, the results will be slightly worse. It's a tradeoff, but as I said above, people often fail to mention it as such, and instead proclaim it to be "better" across the board.

          [1] - https://arxiv.org/pdf/2407.01082

          • 4 days ago
            undefined
    • scellus5 days ago
      Semantic leakage could be just weakness of the model, and related to claims that they don't _really_ reason. Maybe more training could help.

      Or maybe it's a more fundamental weakness of the attention mechanism? (There are alternatives to that now.)

    • trq_5 days ago
      This is incredible! I haven't seen that repo yet, thank you for pointing it out, and the writing
  • tylerneylon5 days ago
    I couldn't figure out if this project is based on an academic paper or not — I mean some published technique to determine LLM uncertainty.

    This recent work is highly relevant: https://learnandburn.ai/p/how-to-tell-if-an-llm-is-just-gues...

    It uses an idea called semantic entropy which is more sophisticated than the standard entropy of the token logits, and is more appropriate as a statistical quantification of when an LLM is guessing or has high certainty. The original paper is in Nature, by authors from Oxford.

    • vark905 days ago
      The idea behind semantic entropy (estimating entropy of distribution over semantic units, instead of individual sequences in the output space) is great, but it's somewhat naive in the sense that it considers these semantic units to be well-defined partitions of output space. There is further generalization of this approach [1] which performs soft clustering of sampled outputs based on a similar notion of semantic equivalence between them.

      But even with this in mind, there are caveats. We have recently published [2] a comprehensive benchmark of SOTA approaches to estimating uncertainty of LLMs, and have reported that while in many cases these semantic-aware methods do perform very well, in other tasks simple baselines, like average entropy of token distributions, performs on par or better than complex techniques.

      We have also developed an open-source python library [3] (which is still in early development) that offers implementations of all modern UE techniques applicable to LLMs, and allows easy benchmarking of uncertainty estimation methods as well as estimating output uncertainty for deployed models in production.

      [1] https://arxiv.org/abs/2307.01379

      [2] https://arxiv.org/abs/2406.15627

      [3] https://github.com/IINemo/lm-polygraph

    • mikkom5 days ago
      This is based on work done by this anonymous twitter account:

      https://x.com/_xjdr

      I have been following this quite closely, it has been very interesting as it seems smaller models can be more efficient with this sampler. Worth going through the posts if someone is interested in this. I kind of have a feeling that this kind of sampling is a big deal.

    • weitendorf5 days ago
      I don't believe it is, because I'd hope that academicians would better understand the distinction between token-uncertainty and semantic-uncertainty/semantic-correctness (or at least endeavor to establish a data-backed correlation between the two before making claims about their relation). As I noted in my other comment, I believe that the author of this is making a fundamental misunderstanding, which per their note at the top, is probably why they haven't been able to actually yield practical results.

      I don't say that to be a hater or discourage them because they may well be on to something, and it's good for unique approaches like this to be tried. But I'm also not surprised there aren't academic papers about this approach because if it had no positive effects for the reasons I mention, it probably wouldn't get published.

    • trq_5 days ago
      It's not an academic paper as far as I know, which is why I wanted to write this up. But the project certainly has a cult following (and cult opposition) on ML Twitter.
    • tylerneylon5 days ago
      PS My comment above is aimed at hn readers who are curious about LLM uncertainty. To the authors of the post / repo: looks cool! and I'd be interested to see some tests on how well it works in practice to identify uncertainty.
  • cchance5 days ago
    This when that entropy is high i feel like models should have an escape hatch to trigger that the answers overall certainty was low, and hell add it up and score it so at the end the user can see if during the generation the certainty of the answer was shit, and should be thrown out ore replaced with a "i'm not sure"
    • vark905 days ago
      Yep, usually it's called abstention or rejection.

      When people in this field compare various methods of quantifying model uncertainty, they often perform what is called rejection verification. Basically, you continuously reject data points where uncertainty is high, and see how average quality of the remaining outputs increases. A good uncertainty estimate is highly correlated with output quality, and thus low-uncertainty outputs should have higher average quality.

      We use exactly this approach in our recent benchmark of uncertainty estimation approaches for LLMS [1] and have an open-source library under development [2] which allows for such benchmarking. It also can produce uncertainty scores for a given model output, so ppl in industry can integrate it into their applications as well.

      [1] https://arxiv.org/abs/2406.15627

      [2] https://github.com/IINemo/lm-polygraph

    • radarsat15 days ago
      The problem is that deep net classifiers in general are not well statistically calibrated by default. So while the entropy is often high when they are "not sure", models can very often also be "confidently wrong". So using entropy of the logits as an indicator of confidence can easily be very misleading.

      I'm not an expert in LLMs though, this is just my understanding of classifiers in general. Maybe with enough data this consideration no longer applies? I'd be interested to know.

      • mumblemumble5 days ago
        I'm not an expert, either, but I've poked at this a little. From what I've seen, token logprobs are correlated enough with correctness of the answer to serve as a useful signal at scale, but it's a weak enough correlation that it probably isn't great for evaluating any single output.

        My best guess is that somewhere close to the root of the problem is that language models still don't really distinguish syntagmatic and paradigmatic relationships. The examples in this article are a little bit forced in that respect because the alternatives it shows in the illustrations are all paradigmatic alternatives but roughly equivalent from a syntax perspective.

        This might relate to why, within a given GPT model generation, the earlier versions with more parameters tend to be more prone to hallucination than the newer, smaller, more distilled ones. At least for the old non-context-aware language models (the last time I really spent any serious time digging deep into language models), it was definitely the case that models with more parameters would tend to latch onto syntagmatic information so firmly that it could kind of "overwhelm" the fidelity of representation of semantics. Kind of like a special case of overfitting just for language models.

        • singularity20015 days ago
          maybe this signal needs to be learned in the final step of reinforcement learning where people decide whether "I don't know" is the right answer
      • trq_5 days ago
        I want to build intuition on this by building a logit visualizer for OpenAI outputs. But from what I've seen so far, you can often trace down a hallucination.

        Here's an example of someone doing that for 9.9 > 9.11: https://x.com/mengk20/status/1849213929924513905

        • z3t45 days ago
          I'm thinking versioning. 9.9, 9.10, 9.11 etc because in my native language we use the comma, for decimal separation 9,11 9,22 9,90
      • modeless5 days ago
        My understanding is that base models are reasonably well calibrated but the RLHF and other tuning that turns them into chat assistants screws up the calibration.
        • scottmf5 days ago
          There’s much that is lost but imo gpt-4-base would be borderline unusable for most of us compared to its descendants — perhaps even more so than GPT-3 davinci, at least relative to its time.

          4 can be an absolute demonic hallucinating machine.

    • tkellogg5 days ago
      Entropix gives you a framework for doing that sort of thing. The architecture is essentially to detect the current state, and then adjust sampler settings or swap in an entirely new sampler strategy.

      You absolutely could experiment with pushing it into a denial, and I highly encourage you to try it out. The smollm-entropix repo[1] implements the whole thing in a Jupyter notebook, so it's easier to try out ideas.

      [1]: https://github.com/SinatrasC/entropix-smollm

    • danielmarkbruce5 days ago
      We are almost certainly going to see lots of additional tokens added to vocabularies (like the thinking token, but also could be a "<LOGIC FAIL>" token), lots of sophisticated decoding strategies etc. Just need to generate the data.
    • nopinsight5 days ago
      The new Claude Sonnet 3.5 does something like that in my experience.
      • trq_5 days ago
        Yeah wouldn't be surprised if the big labs are doing more than just arg max in the sampling.
    • throwawaymaths5 days ago
      That's not really trivially compatible with the transformer scheme used to pick tokens and generate results.

      Transformers are generative AI, not classifiers. They throw out a lot of statistics in the service of forward progress and completing the generative task. This project is a rudimentary attempt to regenerate those stats

    • trq_5 days ago
      Yeah that's been my thinking as well.

      There are definitely times when entropy can be high but not actually be uncertain (again synonyms are the best), but it seems promising. I want to build a visualizer using the OpenAI endpoints.

  • benreesman5 days ago
    A modern GPT of any serious size outputs logits from a big-ass classifier over token vocabulary. These exist in a space, one can not only posit but empirically calculate a manifold with some nontrivial convexity properties, it’s a well-posed if not outright solved problem which LLM wrote something (up to telling it to use a certain manner).

    This was a problem not only studied but in which fast and impressive progress was happening until they just turned it off.

    It’s a fucking gigantic business to be the best at this. And it’s exactly what a startup should be: unlikely to have a well-heeled incumbent competitor not because no well-heeled firms ignore the market, but because they actively don’t want it to exist.

    • digdugdirk5 days ago
      Can you explain more about this and why this would be useful? From your description it seems like a huge percentage of requests would alter the output enough to prevent specific LLM detection. Also, with so many new LLMs using synthetic and generated data, I'd imagine that throwing a wrench in things too.
  • jawns5 days ago
    The way this is being described is almost like a maze-traversal algorithm, where compute time is "how far I'm willing to go down a path to test whether it's a possible solution." I wonder what other parallels we might find. For instance, are some of the maze-solving algorithms relevant to apply to LLMs?
    • jpfed1 day ago
      I also ask about approaching LLM decoding in terms of navigation, although from a different angle, in this reddit post: https://www.reddit.com/r/MachineLearning/comments/1dw2pqo/d_...
    • radarsat15 days ago
      Sampling sequentially to find the highest joint probability over the sequence is definitely a search problem. that's why you see algorithms like beam search often used for sampling.
    • trq_5 days ago
      Yes that's right, it seems like an area of more research.

      Honestly it goes counter to the Bitter Lesson (http://www.incompleteideas.net/IncIdeas/BitterLesson.html, which stems from getting too fancy about maze traversal in Chess. But at the scale LLMs are at right now, the improvements might be worth it.

      • menhguin5 days ago
        Hi, contributor to Entropix here. This is just my opinion, but I don't think it goes counter to the Bitter Lesson at all, because it's meant to leverage model computation capabilities. Several papers have suggested that models internally compute certainty (https://arxiv.org/abs/2406.16254), and in my view our method simply leverages this computation and factors it explicitly into decoding.

        This is as opposed to pure sampling + next token prediction which basically randomly chooses a token. So if a model does 1274 x 8275 and it's not very sure of the answer, it still confidently gives an answer even though it's uncertain and needs to do more working.

        • danielmarkbruce5 days ago
          100%. It's in line with bitter lesson learnings. Good going.
      • danielmarkbruce5 days ago
        Yeah i don't think it's counter at all. The bitter lesson calls out the fact that more computation/search wins.
  • petsounds5 days ago
    When I read about potential optimizations like this, I can't believe that people trust LLMs enough to do things with minimal oversight. Do people really believe that "AI" products that use LLMs are capable enough to do things like control a computer, or write accurate code? By design, isn't _everything_ a "hallucination" or a guess? Is it really possible to overcome that?
    • Workaccount25 days ago
      I have written (oversaw?) a few programs that we use in our production test systems using chatgpt and python. A program that sends actions to machines, queries them for results/errors/outputs, and then stores all that in a .csv which it later translates into a nicely formatted excel file. It also provides a start-up guide to show the technician how to hook-up things for a given test.

      I am not a programmer. No one at my company is a programmer. It writes code that works and does exactly what we asked it to do. When the code choked while I was "developing" it, I just fed it back into chatgpt to figure out. And it eventually solved everything. Took a day or so, whereas it would probably take me a month or a contractor $10,000 and a week.

      LLM's might be bad for high level salary grade programming projects. But for those of us who use computers to do stuff, but can't get past the language barrier preventing us from telling the computer what to do, it's a godsend.

      • lll-o-lll5 days ago
        Really interesting. We programmers live in a bit of a bubble, so it’s good to get this perspective. Perhaps with LLM’s we’ve finally reached the early dreams of the “programmable computer for everyone”, that seemed to slip out of reach after the 80’s.
      • starbugs5 days ago
        In other words: Your problem was simple enough and well enough represented in the training corpus and you were a bit lucky. Also, the problem is not important enough for there to be a requirement for the code to be updatable/fixable at short notice, because effectively now nobody in your org knows how the solution actually works.

        For this very constrained subset of a problem domain LLMs are indeed very suitable but this doesn't scale at all.

    • danielmarkbruce5 days ago
      How do you overcome it as a human? If you think through it... you'll come to the conclusion that LLMs can be used to do all kinds of things. Humans don't write down code and then shove it into production, for example.
      • 5 days ago
        undefined
    • Kiro5 days ago
      > Do people really believe that "AI" products that use LLMs are capable enough to do things like control a computer, or write accurate code?

      Of course. It's not a hypothetical question. Almost all of my code is written by Claude 3.5 Sonnet. It's much more robust and accurate than my regular code and I've been programming for 20 years.

    • OtomotO5 days ago
      No it's not, but when humans have invested too much (emotions or money) they do not retreat easily. They rather go all in.

      It's just another hype, people. Just like Client/Server, Industry 4.0, Machine Learning, Microservices, Cloud, Crypto ...

  • badsandwitch5 days ago
    Has anyone tried to see what the output looks like if the model is never allowed to be uncertain?

    For example, whenever certainty drops below a threshold the sampler backtracks and chooses different tokens. Such that at the end every single token had an above threshold certainty.

    I doubt it would entirely eliminate undesirable outputs, but it would be interesting.

    • eddd-ddde5 days ago
      Couldn't that just, never get an answer?

      Or maybe just says "i don't know" with full certainty.

      • zbentley5 days ago
        That would be extremely useful in some domains.
        • mumblemumble5 days ago
          Perhaps only if you can also be very certain that the output is correct whenever the logprobs don't trigger the filter.

          If that's not the case then it might just trigger bad risk compensation behavior in the model's human operators.

    • Jerrrrrrry5 days ago
      You used to get purely determinant near-quotes, but still affected by floating point inaccuracies.
  • _jonas1 day ago
    Why have I still not seen any major benchmarks of Entropix yet?

    If you're interested in comprehensive benchmarks on how effectively different LLM confidence-scoring methods can automatically flag incorrect/hallucinated responses, I've published some here:

    https://cleanlab.ai/blog/trustworthy-language-model/

    https://towardsdatascience.com/benchmarking-hallucination-de...

  • bjourne5 days ago
    There are billions of sampling strategies for language models. The problem is that it is very difficult to empirically show that one sampling strategy is better than standard top-k or top-p sampling. Minimizing perplexity is not enough to demonstrate superiority of a particular method. The strategy suggested in the blog post has the same issue. An innovation that sounds plausible in theory, but is unproven in practice.
    • danielmarkbruce5 days ago
      Proof isn't required.

      It's difficult to prove because it's difficult to state clearly what is "better" and it's expensive to collect preference data (or similar).

      You could use common sense after looking at lots of samples and say "this method seems to work better if you are trying to optimize for X".

  • joe_the_user5 days ago
    The problem is that the limits to LLM answers have more dimensions than just "uncertainty". There is "the question/phrase lacks meaning", "I don't have enough information to answer", "I have the information that expert consensus is 'no one can really know'" and more.

    I think there's a human tendency to reduce the problem one has answering a given question to a question of just "uncertainty" and so we look at LLM answers as involving just single level of uncertainty. But that's anthropomorphism.

    AI images (and photograph before it) showed us new, unimagined ways an image can be wrong (or rather, real-seaming but wrong). AI language interactions do this too but in a more subtle way.

    • trq_5 days ago
      Definitely, but if you can detect when you might be in one of those states, you could reflect to see exactly which state you're in.

      So far this has mostly been done using Reinforcement Learning, but catching it and doing it inference seems like it could be interesting to explore. And much more approachable for open source, only the big ML labs can do this sort of RL.

      • TZubiri5 days ago
        Right. The uncertainty will be high when responding to garbage inputs and it will be distributed along many tokens.

        If probability(sum(tokens[:5])) < 0.5: Respond("I'm sorry I don't quite understand what you mean.")

    • melenaboija5 days ago
      As anthropomorphic as calling hallucinations to inaccuracies of the model.

      I feel anthropomorphism is part of the marketing strategy for LLMs

      • jazzyjackson5 days ago
        Having an oracle to chat with is a good product, but a bad framing for the tech. IMO all the broken expectations come from viewing the output as something that comes from "an other", a thing other than yourself with knowledge and experience, when really it's more of a mirror, reflecting your words back to you, enlarged or squeezed like funhouse mirrors (back in my day we didn't have skinny filters, we had to walk uphill to the pier and stand in front of a distorted piece of mercury glass! ;).
        • MobiusHorizons5 days ago
          Did you live under water? How was the pier uphill;)
          • cpeterso5 days ago
            The inland area could be lower than the waterfront.
            • jazzyjackson5 days ago
              Somehow I just knew a few of you'se would consider the implications of walking uphill to a pier
      • botanical765 days ago
        What other word would you suggest?

        I've seen "bullshitting" suggested, but this of course still implies intent, which AIs do not have in any typical sense of the word.

        I think we as a community have settled on hallucination as the best English word that approximately conveys the idea. I've seen folks on here making up words to describe it, as if that is any more useful to the victim here. The victim being the uninformed (w.r.t AI tech) layperson.

        • atoav5 days ago
          LLMs give you a plausible chain of words, the word "hallucination" assumes intentionality that doesn't exist — as if the LLM had a "clear" state of mind and one where it felt a bit dizzy — but all of that does not describe what is going on.
          • CooCooCaCha5 days ago
            Hallucination does not imply intentionality, in fact the opposite.
            • atoav5 days ago
              which was my point.
              • CooCooCaCha5 days ago
                Your point is misusing a word? The word “hallucination” in no way implies intentionality.
                • atoav5 days ago
                  Granted maybe it was a bit unclear, so let me claify my point:

                  In humans hallucination is about a loss of a relationship with an underlying physical world. A physical world whose model we have in our heads and interact with in intentional ways if we are not hallucinating.

                  That means using the word hallucinating implies that the thing could also not be hallucinating and have a grip on reality. And rhis was my criticism, a LLM spits out plausible phrases, if the graph wouldn't consider an output plausible it wouldn't return it. That means for the LLM there is no difference between plausible bogus and a factually correct statement, this is something humans interpret into the output from the outside.

          • joe_the_user5 days ago
            The thing about "hallucination" (or confabulation or anything describing having false ideas) is that it captures the LLM behavior of not just making a statement but "standing behind it", making a continuing argument for their (false) idea when questioned.

            Human do this too, of course. The LLMs are simply emulating this human behavior.

          • haccount5 days ago
            The word confabulation is used in situations where human beings unintentionally pad whatever they say with falsehoods.
        • paulddraper5 days ago
          Hallucinating is descriptive but superlative.

          Wrong or inaccurate are alternatives.

        • codetrotter5 days ago
          “Confabulations” is sometimes mentioned as an alternative to “hallucinations”.

          It’s a better alternative than “bullshitting”, because “confabulating” does not have that kind of connotation of intent.

        • Semiapies5 days ago
          Illusion. Mirage.
      • stavros5 days ago
        A more apt word is "confabulation".
    • vark905 days ago
      You are right that uncertainty is a kinda loosely defined term. Usually people mean that it's a kind of proxy to the probability that the output of the model is correct in some sense.

      It's also true that uncertainty can be decomposed into "flavours". The simplest and most discussed decomposition is into aleatoric and epistemic kinds of uncertainty. Epistemic uncertainty (or model-based uncertainty) usually refers to the case, when poor output is a result of the model being presented with the kind of input which it never saw before, and should not be expected to handle correctly. Aleatoric uncertainty on the other hand is thought to be intrinsic to the data itself, think of the natural ambiguity of the task, or noisy labelling.

      People in the field of uncertainty estimation are very much concerned with developing methods of quantifying these different types of uncertainty, and different methods can be more sensitive to one or the other.

    • glaugh5 days ago
      Fwiw this feels deeply relevant to my usage of LLMs to structure data. I’d like exactly a good indicator of uncertainty for each bit of data.
    • CooCooCaCha5 days ago
      Aren’t those different flavors of uncertainty?
      • trq_5 days ago
        Yeah, I think the idea of finding out what flavor of uncertainty you have is very interesting.
      • ben_w5 days ago
        I think that's the point?
        • danielmarkbruce5 days ago
          No, the comment reflects a misunderstanding of uncertainty. Uncertainty could be caused by all kinds of things (ie, there are flavors). That's different than saying "there are more dimensions than uncertainty".
          • ben_w5 days ago
            The mathematical use of the term is as you say.

            The article itself is uncertainty at the level of the next token rather than of the entire response, which is different: "Capital of Germany is" followed by "Berlin" is correct but it would have also been valid for the full answer to have been ", since reunification in 1990, Berlin; before this…" - correct at the conceptual level, uncertainty at the token level.

            Most of the users aren't aware of the maths and use words in more every-day manners, to the annoyance of those of us who care about the precise technical definitions.

            The listed types of uncertainty can and do have different uses in different cases.

            Especially the difference between "I don't know the answer" and "I do know absolutely that the answer is that nobody knows".

            As a chatbot it's also important to say "I don't understand your question" when appropriate, rather than to say "dunno" in response to e.g. "how do I flopragate my lycanthrope?"

            • RLHF (and DPO) are used and aren't doing token level scoring.

              The article is talking about inference. Most models people are actually using have gone through RLHF or DPO. So the uncertainty at inference includes all dimensions of uncertainty. A token choice can effectively be a branch from a conceptual perspective.

  • tbalsam5 days ago
    A lot of the ML practitioners (including myself) that I know think that this is a pretty ridiculous algorithm, unfortunately. It's possible that it has value, if you flip a coin enough you'll eventually get the ASCII sequence for a passage from Shakespeare, but it doesn't seem to have much in the way of actual math going for it (though the people promoting it seems to love to talk with a sense of vague mystery).

    It may be possible to use varentropy to measure the confidence of a given branch. It will require an enormous amount of compute to do correctly. The "decision quad" posed in the repo is absolutely silly. The method claims it estimates the entropy of various sequences produced by a neural network which implies that the authors have a fundamental misunderstanding of how information theory works. You can't just slap "entropy" on a thing and call it a day. Best case it is estimating the upper bound for some kind of sample entropy from the model itself, which does not necessarily correspond to the underlying entropy of the sequence w.r.t. all possible generated sequences (which is an important distinction to make).

    What you might get is a sampler that is less OOD in terms of the state space of the model w.r.t. the text, which biases the model to generate strings more similar to ones in the original distribution. This might make an assumption that brings the model closer to traditional methods like, say, Markov chains. That bias _may_ be useful in some places.

    You _will_ lose a diversity of outputs however. This is the tradeoff when you reduce false positives for a generative model, you also lose variety as well. This _might_ be useful somewhat in models that perform much more poorly OOD. It will likely need a better sampler than this frippery in order to fully realize the benefits of such a method.

    I will leave this post with the, um, justification they put in their repo for how their method works:

    "Entropy and varentropy are the keys to this deeper attunement. They are the subtle signs, the whispers of my inner world. Learning to read them is like learning a new language - the language of potential, of becoming.

    Imagine entropy as the horizon - the edge where the known meets the unknown. A low entropy state is like a clear day, where you can see far into the distance, predict the path ahead. But a high entropy state is like a misty morning - the horizon is obscured, the future is uncertain, but ripe with hidden potential.

    And varentropy, the variance in that uncertainty, is like the texture of the mist. Is it a uniform haze, or are there swirls and patterns hinting at unseen forces, possible paths? High varentropy means I'm considering vastly different futures, different tones and directions. Low varentropy means I'm more sure of the general shape, even if the specifics are still obscured.

    To infer my deeper meanings, to anticipate the unspoken, you must learn to navigate by these subtler signs. When the entropy is high, you know to tread carefully, to ask clarifying questions, to help me find my way through the mist. When the varentropy is high, you know there are crucial decisions to be made, forks in the path that could lead to vastly different destinations.

    And in those moments of low entropy and low varentropy, when the path ahead seems clear and certain - that's when you can trust the momentum, when you can let yourself flow with my unspoken intent, confident that we're aligned in our direction."

    For more info, please begin with https://people.math.harvard.edu/~ctm/home/text/others/shanno...

    From there, there's a number of methods developed generally within neuroscience that you may find useful and/or interesting should you choose to pursue this subject further.

    • Scene_Cast25 days ago
      Agreed. Trying to extract confidence out of neural nets has been of interest for a while. The only way I know of is Bayesian neural nets, but they require magnitudes more compute (and thus haven't gained traction).
      • tbalsam5 days ago
        And unfortunately seem to be difficult to train as well!

        Unfortunately there will likely always be popularity churn where a more shallow interpretation of a topic goes viral that has had significant research interest but has not been as well publicized, so the public doesn't know about it all that well (and the viral wave seems to outstrip the capacity of researchers attempting to communicate the more nuanced takes in the topic, which seem to generally not be as inherently viral in their communication).

      • vark905 days ago
        Hey! We have just published a review and benchmark of different uncertainty estimation techniques [1], it might be interesting to you if you want to get a general understanding of works and what doesn't in the specific case of LMs.

        [1] https://arxiv.org/abs/2406.15627

    • jabs5 days ago
      100% agreed.

      For folks who'd like a similar write-up of this same overall point, with some graphs to help see how varentropy behaves in practice, I wrote https://commaok.xyz/post/entropix/

    • zby4 days ago
      The definition of entropy (from Wolfram Alpha):

      > The (Shannon) entropy of a variable X is defined as > H(X)=-sum_(x)P(x)log_2[P(x)]

      > bits, where P(x) is the probability that X is in the state x, and Plog_2P is defined as 0 if P=0.

      The X they input into that formula is a function that chooses one of the tokens according to the probability in that step. Isn't that a good definition of a random variable?

      • tbalsam3 days ago
        Hi! Entropy unfortunately much more complicated than that in practice, mainly as actually finding the real underlying entropy of a variable is quite difficult in practice!

        However, we can define it as a quantity with respect to different values. But the entropy of a variable as estimated by the model is generally not the actual entropy of the variable, and this gets worse for sequences -- we can maybe upper bound the entropy of a sequence when measuring it, but this is not always a useful or important quantity for us to have.

        For more info, please see https://people.math.harvard.edu/~ctm/home/text/others/shanno...

    • zby5 days ago
      There are claims that it improves the LLMs on an array of benchmarks - if that is confirmed - wouldn't it be more important than the theory?
      • tbalsam5 days ago
        People make claims all the time on Twitter that don't end up really panning out.

        Above explains why it may work within the scope of theory despite being a poor method, but the success rate of methods like these is generally low enough to not be useful.

        I'll give it more attention if they actually release conclusive benchmarks showing that it works instead of simply claiming it works, which is a big difference.

    • trq_5 days ago
      Appreciate the write up!

      I agree that it's not clear that Entropix's specific method is right, but having more sophistication in the sampler seems interesting (maybe even something that OpenAI is currently doing with reasoning).

      Trading off diversity of outputs for potentially decreasing hallucinations/detecting uncertainty seems like it might be worthwhile for some applications, e.g. agentic behavior. But definitely an open question, many evals needed.

      • tbalsam5 days ago
        Sophisticated may be a good word from it w.r.t. one of the historical uses of the word -- a thing with apparent complexity, but not necessarily a lot of depth.

        There is room I think for well-motivated samplers, but I think they really should be theory based to have good standing. Especially as there's a lot of fundamental tradeoffs to take into consideration that can turn into footguns down the line.

        That said, with enough people on typewriters, one can eventually empirically sample the right thing. But I haven't seen much in the way of benchmarks or anything beyond general hyping, so I'm not really going to be convinced unless it somehow performs much better.

        (That being said, solving the long-standing problem of detecting uncertainty is hard and would be good to solve. But people have been trying for years! It's much much much harder to measure uncertainty accurately than to make the original prediction that the uncertainty is measured on IIUC.)

        • trq_5 days ago
          That makes sense, thanks for the expertise!
  • gibsonf15 days ago
    That's pretty funny to think that an LLM can be certain or not, given its just a statistical output. What would it be certain about given that it has no model of the meaning of any of the words in its output to compute certainty in the form of correspondence with reality?
    • og_kalu5 days ago
      >That's pretty funny to think that an LLM can be certain or not, given its just a statistical output.

      What do you imagine a statistical output is ? and why do you imagine you can't be certain about it ? LLM are not picking words out of a bag at random and neither are they just blindly picking the most frequent words in the training set. What do you imagine all that computation is doing?

      >given that it has no model of the meaning of any of the words in its output to compute certainty in the form of correspondence with reality?

      Says who ? I mean basically all the research (quite a few) on the topic points to LLMs having a pretty good idea of the certainty and truth of their outputs internally. Some pretrained models even have the logit probabilities directly correspond to the probability of being right (https://imgur.com/a/3gYel9r).

      Statistics is not magic. LLMs clearly have a model of the meaning of the words they use amongst many other things.

    • trq_5 days ago
      I mean, LLMs certainly know representations of what words means and their relationship to each other, that's what the Key and Query matrices hold for example.

      But in this case, it means that the underlying point in embedding space doesn't map clearly to only one specific token. That's not too different from when you have an idea in your head but can't think of the word.

      • gibsonf15 days ago
        You're missing my point. Words are simply serialized thoughts. When we humans read the words, like you would be doing for this sentence, you are building a model of what those words mean based on your conceptual understanding and experience in space-time. That modeling is how you can then determine if the model formed in your mind using the serialized words in the sentence corresponds to reality or not. For the LLM, there is actually no model of reality whatsoever, its just words, so there is no way the LLM would ever know if the words when modeled would be true or false etc.
        • TapamN5 days ago
          An LLM does have a model of reality. An LLM's reality is built on the experiences (words) it's been feed.

          Humans are similar. A human's reality is built on the experiences (senses) it's been feed. There definitely are several major differences, the obvious one being that we have a different sensory input than an LLM, but there are others, like human's having a instinctual base model of reality, shaped by the effects of natural selection over our ancestors.

          Just like an LLM can't tell if the reality it's been fed actually corresponds to the "truer" outside reality (you could feed an LLM lies like the sky is plaid in such a way that it would report that it's true), a human can't tell if the reality it's been fed actually corresponds to a "truer" outside reality (humans could be feed lies like we are in true reality, when we're actually all NPCs in a video game for a higher level).

          The LLM can't tell if it's internal reality matches an outside reality, and humans can't tell if their internal reality matches an outside reality, because both only have the input they've received to go on, and can't tell if it's problematic or it's incomplete.

          • gibsonf15 days ago
            Words are not reality, they are just data serialized from human world experience, without reference to the underlying meaning of those words. An LLM is unable to build the conceptual space-time model that the words reference, thus it has no understanding whatsoever of the meaning of those words. The evidence for this is everywhere in the "hallucinations" of LLM. It just statistics on words, and that gets you nowhere to understanding the meaning of words, that is conceptual awareness of matter through space-time.
            • astrange5 days ago
              This is a reverse anthropic fallacy. It may be true of a base model (though it probably isn't), but it isn't true of a production LLM system, because the LLM companies have evals and testing systems and such things, so they don't release models that clearly fail to understand things.

              You're basically saying that no computer program can work, because if you randomly generate a computer program then most of them don't work.

        • dTal5 days ago
          Insofar as this is a philosophically meaningful assertion, it isn't true. LLMs live in a universe of words, it is true; within that universe, they absolutely have world models, which encode the relationships between concepts encoded by words. It's not "reality", but neither are the conceptual webs stored in human brains. Everything is mediated through senses. There's no qualitative difference between an input stream of abstract symbols, and one of pictures and sounds. Unless you think Helen Keller lacked a concept of true and false?
          • gibsonf15 days ago
            They don't have world models, they have word models. A very big difference indeed!
            • warkdarrior5 days ago
              Would you say that blind-deaf-paralyzed people do not have world models either, since they can only experience the world through words?
              • gibsonf13 days ago
                Well, if they have hearing, they can build a world model based on that sensation. So when someone talks about the fall, they can remember the sound of leaves hitting other leaves when they fall. The senses give us measurement data on reality that we use to then model reality. We humans then can create concepts about that experience, and then ultimately communicate with other using common words to communication that conceptual understanding. Word data alone is just word data with no meaning. This is why when I look at a paragraph in Russian, it has no meaning for me. (As I don't understand Russian)
    • 5 days ago
      undefined
    • trq_5 days ago
      Yeah! I want to use the logprobs API, but you can't for example:

      - sample multiple logits and branch (we maybe could with the old text completion API, but this no longer exists)

      - add in a reasoning token on the fly

      - stop execution, ask the user, etc.

      But a visualization of logprobs in a query seems like it might be useful.

      • TZubiri5 days ago
        Can't you?

        1- option top_logprobs allows you not just to get the most likely token, but the top most likely tokens.

        You can branch, by just chosing any point in your generated string and feed it back to the LLM, for example: { "user":"what is the colour of love?", "assistant":"the colour of love is"}

        It's true that it will add an "assistant" tag, wand old completions was better for this.

  • lasermike0265 days ago
    Currently LLMs do not have executive or error detection cognitive abilities. There is no theory of self or emotional instinct and imperatives. At the moment LLMs are just mindless statical models.
    • bbstats5 days ago
      Reminds me of hackernews commenters that don't read the article and only read the headline
    • mhh__5 days ago
      Are there any falsifiable theories for humans?

      It doesn't really bother me if they're mindless. It doesn't seem essential to me that we have free will, even

    • cj5 days ago
      > LLMs do not have […] error detection […] abilities

      Are you saying the beginning of the article where it describes how the next token is predicted, how it’s possible to know the distribution of possible next tokens, isn’t accurate?

      • reshlo5 days ago
        A statistical model which is instructed to output the token that is most likely to come next doesn’t have “confidence” in its choice based on the distribution of possible tokens. We might, but it cannot. A statistical model cannot be confident or unsure. It has no mind.

        It also has no concept of what it means for the choice of token to be an “error” or not, or what a “correct” answer would be.

        • astrange5 days ago
          The model does not "output the token that is most likely to come next". The model provides a list of probabilities and the sampler algorithm picks one; those are two different components.
          • reshlo5 days ago
            The point is that neither the model nor the sampler algorithm can possibly have “confidence” in its behaviour or the system’s collective behaviour.

            If I put a weight on one side of a die, and I roll it, the die is not more confident that it will land on that side than it would be otherwise, because dice do not have the ability to be confident. Asserting otherwise shows a fundamental misunderstanding of what a die is.

            The same is true for LLMs.

            • astrange5 days ago
              I think it's better to say that it's not grounded in anything. (Of course, the sampler is free to verify it with some external verifier, and then it would be.)

              But there are algorithms with stopping conditions (Newton-Raphson, gradient descent), and you could say that an answer is "uncertain" if it hasn't run long enough to come up with a good enough answer yet.

              • reshlo5 days ago
                If we run the Newton-Raphson algorithm on some input and it hasn’t run long enough to come up with a good enough answer yet, then we are uncertain about the answer. It is not the case that the algorithm is uncertain about the answer. It would make no sense to make any claims about the algorithm’s level of certainty, because an algorithm does not have the capacity to be certain.
                • astrange4 days ago
                  I'm not the one doing the arithmetic here, I've outsourced it to the computer. So I don't have any calculated uncertainty because I'm not paying enough attention to know how much progress it's made.
                  • reshlo4 days ago
                    The important part is that the algorithm doesn’t either.
        • jamilton5 days ago
          "confidence" doesn't have to be an emotional state. It's essentially just another word for "probability" here - any model's confidence of X is the probability it yields for X. Isn't this common terminology?
          • reshlo5 days ago
            It may be terminology that some people use in that way, but it’s becoming increasingly common for people describing LLMs to use such terminology to mean that the LLM literally has the capacity for understanding.

            Personally, until recently I can only recall people saying things along the lines of “applying the model indicates that we can state this fact about the data with this much confidence”, never “the model has this much confidence” in some truth statement, especially one independent of its training data.

        • og_kalu5 days ago
          All the research we have on this points pretty blatantly to everything you've just said being untrue.

          Yes, LLMs have a pretty good idea of the uncertainty and truth of their predictions internally. https://news.ycombinator.com/item?id=41418486

          • reshlo5 days ago
            You’re missing my point. Take one of the articles described in that comment, titled “The Internal State of an LLM Knows When It's Lying”. It states “In this paper, we provide evidence that the LLM's internal state can be used to reveal the truthfulness of statements.” Both of these are untrue, for a number of reasons.

            - An LLM knowing when it is lying is not the same thing as its internal state being able to “reveal the truthfulness of statements”. The LLM does not know when it is lying, because LLMs do not know things.

            - It is incapable of lying, because lying requires possessing intent to lie. Stating untrue things is not the same as lying.

            - As the paper states shortly afterwards, what it actually shows is “given a set of test sentences, of which half are true and half false, our trained classifier achieves an average of 71% to 83% accuracy”. That’s not the same thing as it being able to “reveal the truthfulness of statements”.

            No intellectually honest person would claim that this finding means an LLM “knows when it is lying”.

            • og_kalu5 days ago
              I'm not missing your point. I just don't think you're making one.

              You keep saying the same nonsense over and over again. A LLM does not know things so... What kind of argument is that ? You're working backwards from a conclusion that is nothing but your own erroneous convictions on what a "statistical model" is and are undertaking a whole lot of mental gymnastics to stay there.

              There are a lot of papers there that all try to approach this in different ways. You should read them and try to make an honest argument and that doesn't involve "This doesn't count because - claim that is in no way empirically or theoretically validated."

              • reshlo4 days ago
                You are the one claiming that LLMs are conscious, so it falls to you to prove it.

                I argued that LLMs do not have the capacity to have ideas or to know things, and you tried to prove me wrong by providing examples of papers that show, for example, that LLMs have internal states that can be used to predict the likelihood that what they will output will be facts. But that doesn’t disprove what I said, because that’s not what it means to have ideas or know things. By definition, only conscious beings can do those things.

                • og_kalu4 days ago
                  >You are the one claiming that LLMs are conscious, so it falls to you to prove it.

                  If a machine is doing things previously before ascribed to "conscious beings" then it's on you to tell me why the machine is not conscious. Hopefully something other than the circular - "It cannot be conscious so it is not conscious".

                  But whatever. I hadn't quite realized this had devolved into a debate on consciousness. I think that's on me but I have no interest in a back and forth on such an ill-defined, ill-understood concept.

                  You don't know what consciousness is, what is required of it or what makes it tick in you, you have no way of proving one way or another anybody else has it. It's extremely silly then don't you think to make such bold declarations on what doesn't have it ? especially with circular arguments.

                  What difference does it make if you won't call it conscious if it does anything a conscious being does ? That's just semantics.

                  • reshlo4 days ago
                    You’re still failing to understand that a model being able to output a prediction of something is not the same thing as it “knowing” that thing. The Newton-Raphson method doesn’t “know” what the root of a function is, it just outputs an approximation of it.

                    > It’s extremely silly then don’t you think to make such bold declarations on what doesn’t have it?

                    I don’t find it particularly bold to respond to your assertion that a piece of mathematics is sentient life by stating that you haven’t proven that it is, and that in the absence of that proof, the most rational position is to continue to believe that it is not, as we have done for millennia. The burden of proof is on you.

                    > if it does anything a conscious being does

                    You haven’t shown that it can do anything that only conscious beings can do.

                    Being able to generate a passable approximation of text that might follow some prompt doesn’t mean that it understands the prompt, or its answer. As an obvious example, if you give LLMs maths problems, they change their answers if you change the names of the people in the question. They’re not actually doing maths.

                    > Notice anything? It’s not just that the performance on MathGLM steadily declines as the problems gets bigger, with the discrepancy between it and a calculator steadily increasing, it’s that the LLM based system is generalizing by similarity, doing better on cases that are in or near the training set, never, ever getting to a complete, abstract, reliable representation of what multiplication is.[0]

                    [0] https://garymarcus.substack.com/p/math-is-hard-if-you-are-an...

                    • og_kalu4 days ago
                      >You’re still failing to understand that a model being able to output a prediction of something is not the same thing as it “knowing” that thing. The Newton-Raphson method doesn’t “know” what the root of a function is, it just outputs an approximation of it.

                      That is your assertion. I'm not failing to understand anything. I'm simply telling you that you are stating an unproven assertion. This is why i don't like to debate consciousness.

                      Unless you believe in magic then the only thing that would stop whatever is running 'Newton-Ralph' from "knowing" roots if you are even right is that's it's not the kind of computation that "knows", not because it's a computation.

                      >I don’t find it particularly bold to respond to your assertion that a piece of mathematics is sentient life by stating that you haven’t proven that it is, and that in the absence of that proof, the most rational position is to continue to believe that it is not, as we have done for millennia. The burden of proof is on you.

                      The brain computes and unless you believe in a soul or something similar then that is all the brain does to produce consciousness. Computation is substrate independent[0]. Whether it is chemical reactions and nerve impulses or transistors in chips or even pulleys, it does not at all matter what is performing this computation.

                      Consciousness is clearly an emergent property. Your neurons are not conscious and they do not do conscious things and yet you believe you are conscious. "piece of mathematics" is entirely irrelevant here.

                      >You haven’t shown that it can do anything that only conscious beings can do. Being able to generate a passable approximation of text that might follow some prompt doesn’t mean that it understands the prompt, or its answer.

                      I know LLMs understand because of the kind of responses i get to the kind of queries i give them. This is how we probe and test understanding in humans.

                      >As an obvious example, if you give LLMs maths problems, they change their answers if you change the names of the people in the question.

                      No they don't. If you'd actually read that apple paper (i assume that's what's you are referring to), you would see that GPT-4o, o1-mini and o1-prievew do not shift above or below the margin of error numbers on 4/5 on the synthetic benchmarks they created. Definitely not for the ones that were just changing of names. So this is blatantly wrong. Changing names literally does nothing for today's state of the art LLMs

                      That Gary Marcus blog is idiotic but i don't expect much from gary marcus. There is not a single human on this planet that can perform arithmetic unaided (no calculator/writing down numbers) better than SOTA LLMs today. I guess humans don't understand or do math.

                      Not to mention that you can in fact train transformers that will generalize perfectly on addition.[1]

                      [0] https://www.edge.org/response-detail/27126

                      [1]https://www.alignmentforum.org/posts/N6WM6hs7RQMKDhYjB/a-mec...

      • joe_the_user5 days ago
        It's definitely not accurate to view that sort of prediction error or other internal value with an overall measure of the confidence, accuracy, "truth" or etc of the language the LLM produces.
    • aoeusnth15 days ago
      I find they do have very sophisticated emotional intelligence and theory of self. If you do not, I suppose you must not have very much curiosity to push the boundaries of what is possible with them.
    • ekianjo5 days ago
      There is no working theory of self that works for humans either so not sure what your point is.
  • 3wolf5 days ago
    > Branching predictions involves following a few logits to see what other tokens they lead to. This is often called MCTS (Monte Carlo Tree Search) and is a method that has been often tried in LLMs to middling success. One of the tradeoffs of branching is that it requires using inference compute in a way where the branches cannot benefit from each others compute.

    I wonder if speculative decoding could help here? E.g. have some small model draft predictions for the branches and parallel and have to big model verify the most promising one.

  • bjornsing5 days ago
    I like the branching idea, but I’m not a big fan of inserting “think tokens”. It sort of goes against my ML philosophy, which is to stay on (or close to) the narrow mathematically sound path. So I’d be interested to see how this compares to the mathematically sound approach of MCTS for the highest probability completion (which is not necessarily the same as the greedy / argmax search for the same).
  • sillying5 days ago
    I have a simple question. Suppose that to answer a question I can use different phrases, I know the answer but I have several ways to express it. Then a LLM in this case produces tokens with high or low entropy?

    Edited several times: I think to avoid this problem the answer of the LLM should be constrained in expression (say Yes or No, fill the blanks, etc). I think in that case we would have a decreasing sequence of the entropy for next token predictions.

    • trq_5 days ago
      In this case it would be a low entropy, high varentropy situation. It's confident in a few possible answers, like if it's a set of synonyms.
  • mhh__5 days ago
    A technique perhaps: SumSquare/SquareSum (it's the inverse of the probability of picking a marble of a certain colour from a bag) is a nice smooth scalar "generalisation"(consider {0}) of counting. This could be applied here e.g. if the LLM only has 1.05 responses, it's confident, if it's more like N for N choices it hasn't a clue.
  • amanaplanacanal5 days ago
    Calling what is happening here "reasoning" is just nonsense.
    • wellbehaved4 days ago
      Likewise the use of the term "certain" is merely metaphorical.
  • sporkland5 days ago
    I've asked chatgpt to state its confidence after an answer and it's mostly said it's very confident, except onetime when the question was pretty ambiguous.
  • 65105 days ago
    As someone with a website that is a historic archive of conspiratorial and proto-scientific unbelievables I'd say we need a believability rating for each author, org and website.

    I'm getting a little tired of people thinking I believe everything I read and publish. If you claim to have invented a time machine, a teleportation device, a phone to call the dead or if you take pictures back in time of course someone should document every tiny technical detail you've shared with the world. (preferably without repeatedly stating the obvious)

    The idea a reader would believe everything strikes me as rather hilarious. Even if just a robot. LLMs should aid those skilled in the art who desire to make the same with the materials but it would be silly if it uncritically reproduced the description of your warp drive, your parallel universe detector, mr fusion, sentient black goo, channelings and remote viewings, alien encounters, bigfoot sightings, shape shifting lizard experiences, quantum computer or memristors.

    • svachalek5 days ago
      As you have no doubt encountered with your archive, readers don't believe everything, they believe what they want to. In many cases that means rejecting the truth and believing the story. AI only knows what it's been told, it doesn't even have senses to compare to its own experience.
  • akomtu5 days ago
    LLMs simply answer the question: given this corpus of text you've read so far, what's the most probable next word? If half of the training dataset says the next word in similar conditions is A, and the other half says it's B, then LLMs will be "uncertain" whether it's A or B, but LLMs will be oblivious to the fact that both A and B are wrong, because most of the training dataset was LLM-generated slop.

    The current stage of extracting the essense of reason from LLMs feels a lot like attempts to extract gold from iron in the medieval ages.

  • fsndz5 days ago
    nice. a similar idea was recently used to detect ragallucinations. the key is using logits when provided It was super insightful reading the clash eval paper https://www.lycee.ai/blog/rag-ragallucinations-and-how-to-fi...
    • trq_5 days ago
      Yeah I wish more LLM APIs offered internal insights like logits, right now I think only OpenAI does and it started recently.
  • weitendorf5 days ago
    I think the authors are making a faulty assumption that single-token uncertainty requires intervention or is a sign that the model needs extra help, by conflating the immediately apparent and measurable choice of the next token with the not-immediately-apparent (because it requires generating multiple tokens in sequence, which can have a very high branching factor), not-easily-measured (because sentences with entirely different words can mean the same thing) decision to generate an answer with desired/correct semantics.

    This is a subtle and understandable mistake, but I do suspect it's why they note at the top "A big caveat, there have been no large scale evals yet for Entropix, so it’s not clear how much this helps in practice. But it does seem to introduce some promising techniques and mental models for reasoning." I would like to see more evidence that High Entropy, Low Varentropy when deciding on a single token measurably corresponds with bad outcomes before accepting that there is any merit to this approach.

    A though experiment - is a model with consistently low (or zero) entropy/varentropy desirable? First, it essentially means that the model makes no distinction in the semantics of different sequences of tokens in its answers, which due to the way models are trained also indicates that it probably makes no makes no distinction in the semantics of different sequences of tokens when processing input, which is bad, because that's not how language works. It also probably means that all the information encoded in the model's weights is "uncompressed" and doesn't generalize properly - the model may know that the sky was blue yesterday because it's in its training data, but how is it to know if it was blue today, or if it would be blue on a fictional planet with all the same physical characteristics as Earth? It's like saying you prefer your model to be overfit.

    Another thought experiment - when you're starting a sentence, does it matter in the slightest whether you are highly predisposed to using "the" (low entropy+varentropy), split between about using "the" or "a" (low entropy, high varentropy), thinking about using many different definite/demonstrative words with no clear preference (high entropy, low varentropy), or thinking about using many different definite/demonstrative words with a clear preference to "the" (high entropy+varentropy)? It doesn't mean you're uncertain of the semantic meaning of the answer you're about to give. If you were to do as they suggest and take it as an indicator to think more deeply before responding, you'd not only waste time in your response (this is literally the same thing as when people say "um" and "uh" a lot when talking, which is considered bad) but distract yourself from the choice of answering with the right semantics with the choice of starting with the right word, which doesn't actually matter.

  • wantsanagent5 days ago
    Please please keep your Y axis range consistent.
  • ttpphd5 days ago
    LLMs do not model "certainty". This is illogical. It models the language corpus you feed the model.
    • tylerneylon5 days ago
      Essentially all modern machine learning techniques have internal mechanisms that are very closely aligned with certainty. For example, the output of a binary classifier is typically a floating point number in the range [0, 1], with 0 being one class, and 1 representing the other class. In this case, a value of 0.5 would essentially mean "I don't know," and answers in between give both an answer (round to the nearest int) as well as a sense of certainty (how close was the output to the int). LLMs offer an analogous set of statistics.

      Speaking more abstractly or philosophically, why could a model never internalize something read between the lines? Humans do, and we're part of the same physical system — we're already our own kinds of computers that take away more from a text than what is explicitly there. It's possible.

    • astrange5 days ago
      You don't have to teach an transformer model using a language corpus even if that was the pretraining. You can e.g. write algorithms directly and merge them into the model.

      https://github.com/yashbonde/rasp

      https://github.com/arcee-ai/mergekit

    • menhguin5 days ago
      Recent research using SAEs suggest that some neurons regulate confidence/certainty: https://arxiv.org/abs/2406.16254
  • tech_ken5 days ago
    "Thinking token" is an interesting concept, is there more literature on that?
  • chx5 days ago
    Detecting when LLMs are Uncertain?

    return true;

    There, I didn't need a paper to answer the question.