SimpleQA

(openai.com)

90 points | by surprisetalk6 hours ago

10 comments

  • brap2 hours ago
    Crazy that even o1-preview gets most things wrong.

    This is in line with my own personal experience with LLMs and non-trivial questions. They’re excellent when answering questions on topics you know nothing about, and somehow embarrassingly wrong when you actually know the answer yourself…

    It’s not clear to me why we’re still trying to encode all of human knowledge in a single model, instead of teaching the model how to look for answers from an external source (e.g. RAG).

    • zone4111 hour ago
      You shouldn't use the rate as an indicator. They did something similar to what I did on my hallucinations benchmark (https://github.com/lechmazur/confabulations/), only using questions where at least one model made a mistake. I added this note:

      "The benchmark includes questions where at least one LLM confabulated, in order to minimize the number of questions requiring human assessment. Because of this, and since the questions are intentionally adversarial, the absolute percentage should not be used to infer that LLMs frequently confabulate. This leaderboard does not reflect a "typical" hallucination rate."

      > instead of teaching the model how to look for answers from an external source (e.g. RAG)

      My benchmark specifically focuses on the RAG use case. Even with provided texts, current models still hallucinate.

    • bloomingkales1 hour ago
      Honestly, try prompting it with “you are wrong 80% of the time, therefore you will need to double check your answers, first factually, then numerically, then double check the time/date. You are still probably wrong so do a third accuracy check. The user’s prompts are always wrong too mostly - so always check them”.

      I stopped playing with larger models and have been pushing smaller models with this improvised system prompt and getting good results. It seems like it forces the model to do multiple passes before giving you any response.

      My smaller local models give me less hallucinations than Meta.ai, for example, which generally spits out pleasing answers almost immediately (which are often hallucinations, since I don’t think it is system prompted to be adversarial to the user, or itself). I don’t have the same hallucination issue with Llama3 - 8b locally because of custom system prompts.

      The model has all the correct information, so it almost needs to do RAG on itself. Multiple passes on itself seems like a way to do it.

    • sebzim45001 hour ago
      I don't think it's surprising that o1-preview is only slightly better than GPT-4o, it was never advertised as being better at this kind of recall.
    • Kiro1 hour ago
      You're reading this wrong. They've deliberately chosen questions that one or more models fail at. It's not representative at all of how often the model is wrong in general.
    • sksxihve2 hours ago
      LLMs are experts in everything you are not
      • swatcoder1 hour ago
        Indeed. Exactly like the journalists, bloggers, self-published book authors, internet commenters, wikipedia editors, and earlier models that taught them almost all of what they know.
      • Nition2 hours ago
        That's a nice little aphorism. I think this happens in a lot of things in life. Like comments on Reddit always seem quite insightful until you actually read the article they're commenting on.
      • reverius422 hours ago
        Sounds a bit like Gell-Mann Amnesia Effect: https://en.wikipedia.org/wiki/Michael_Crichton#Gell-Mann_amn...
        • kibwen2 hours ago
          The Alt-Mann Amnesia Effect, maybe.
  • ggnore74522 hours ago
    What’s more interesting to me here are the calibration graphs:

    • LLMs, at least GPT models, tend to overstate their confidence. • A frequency-based approach appears to achieve calibration closer to the ideal.

    This kinda passes my vibe test. That said, I wonder—rather than running 100 trials, could we approximate this by using something like a log-probability ratio? This would especially apply in cases where answers are yes or no, assuming the output spans more than one token.

  • kaonwarb2 hours ago
    Kudos:

    > SimpleQA was created to be a greater challenge for frontier models (e.g., GPT-4o scores less than 40%).

    • jampekka2 hours ago
      And by design:

      "To be included in the dataset, each question had to meet a strict set of criteria: ... and most questions had to induce hallucinations from either GPT-4o or GPT-3.5."

  • CharlieDigital2 hours ago
    8 authors attached to this.

        > SimpleQA is a simple but challenging benchmark for evaluating the factuality of frontier models. A main limitation in SimpleQA is its scope—while SimpleQA is accurate it only measures factuality under the constrained setting of short, fact-seeking queries with a single, verifiable answer. Whether the ability to provide factual short answers correlates with the ability to write lengthy responses filled with numerous facts remains an open research question.
    
    OpenAI going to have some rounds of layoffs in the future.
  • 1 hour ago
    undefined
  • Nition2 hours ago
    Honestly, I'd never expect it get 'correct's for every little fact like this, but it'd be great to get a lot more 'not attempted'.

    "I seem, then, in just this little thing to be wiser than this man at any rate; that what I do not know I do not think I know either." - Socratos, from Plato's Apology of Socrates

  • emurph552 hours ago
    I've tried using older models to create a cpu player on this lateral thinking game (https://detective-stories.com) and they were surprisingly bad at giving answers. I am curious to see how well the more recent models will do.
  • websap3 hours ago
    Are they going to make the benchmark available so other LLMs can be compared?
  • yunohn2 hours ago
    This eval’s goal is a bit unclear to me, especially given the example questions. They’re very trivia/minutiae like asking about sports goals for example, which is their stated desire to test factual knowledge. But will this ever be possible by an LLM, without web browsing - which they deliberately removed while evaluating?
    • sbierwagen58 minutes ago
      >But will this ever be possible by an LLM?

      Why not? Just train an unbelievably gigantic LLM that encodes all human knowledge. A hundred trillion parameters ought to do it.

    • petesergeant2 hours ago
      I think the interesting thing here is the difference between Not Attempt and Incorrect — the goal here seems to be to reduce hallucination