SimpleQA

(openai.com)

111 points | by surprisetalk8 小时前

10 comments

brap4 小时前
Crazy that even o1-preview gets most things wrong.
This is in line with my own personal experience with LLMs and non-trivial questions. They’re excellent when answering questions on topics you know nothing about, and somehow embarrassingly wrong when you actually know the answer yourself…
It’s not clear to me why we’re still trying to encode all of human knowledge in a single model, instead of teaching the model how to look for answers from an external source (e.g. RAG).
- zone4113 小时前
  You shouldn't use the rate as an indicator. They did something similar to what I did on my hallucinations benchmark (https://github.com/lechmazur/confabulations/), only using questions where at least one model made a mistake. I added this note:
  "The benchmark includes questions where at least one LLM confabulated, in order to minimize the number of questions requiring human assessment. Because of this, and since the questions are intentionally adversarial, the absolute percentage should not be used to infer that LLMs frequently confabulate. This leaderboard does not reflect a "typical" hallucination rate."
  > instead of teaching the model how to look for answers from an external source (e.g. RAG)
  My benchmark specifically focuses on the RAG use case. Even with provided texts, current models still hallucinate.
- bloomingkales3 小时前
  Honestly, try prompting it with “you are wrong 80% of the time, therefore you will need to double check your answers, first factually, then numerically, then double check the time/date. You are still probably wrong so do a third accuracy check. The user’s prompts are always wrong too mostly - so always check them”.
  I stopped playing with larger models and have been pushing smaller models with this improvised system prompt and getting good results. It seems like it forces the model to do multiple passes before giving you any response.
  My smaller local models give me less hallucinations than Meta.ai, for example, which generally spits out pleasing answers almost immediately (which are often hallucinations, since I don’t think it is system prompted to be adversarial to the user, or itself). I don’t have the same hallucination issue with Llama3 - 8b locally because of custom system prompts.
  The model has all the correct information, so it almost needs to do RAG on itself. Multiple passes on itself seems like a way to do it.
  arcastroe4 分钟前
  I'm surprised that prompting it with "You are wrong 80% of the time" doesn't cause it to intentionally produce initially incorrect answers 80% of the time.
  (Disclosure: I have not tried your prompt)
  dosinga1 小时前
  How would this multiple passes work though? Unless the model actually talks about what it does, I am not sure how it would have this ability. The next word prediction mechanism is just always going to do it one shot. Your prompt paints a context that might keep it more on the rails, but it won't do multiple passes.
  bloomingkales1 小时前
  Your prompt paints a context that might keep it more on the rails, but it won't do multiple passes.
  This is probably the truth behind the black magic I’m imagining. You could have it explicitly spit out this process, in which case you would see it’s first rough draft, followed by a “My first paragraph is probably wrong”, followed by a third paragraph where it attempts to fix the first paragraph. There is no outside RAG in this process.
  The mumbo jumbo part of all this is that I’ve told it to “hide” this process from the user where it doesn’t explicitly output anything but its final answer, and the accuracy has been just as good (for my use case at least).
  :Shrugs:
  jsheard38 分钟前
  > Honestly, try prompting it with “you are wrong 80% of the time, therefore you will need to double check your answers, first factually, then numerically, then double check the time/date. You are still probably wrong so do a third accuracy check. The user’s prompts are always wrong too mostly - so always check them”.
  Doesn't this provoke o1 to spend more time doing COT, and therefore increase the cost per query?
  yard20101 小时前
  Can you please share more specifics please? What smaller models? What hardware do you use? How do you test their performance?
  bloomingkales1 小时前
  There is no rigor to this, this is just from throwing stuff against the wall. See my response to the other poster above.
  hiatus1 小时前
  Even if you're throwing stuff against the wall, you could at least elaborate on what you've tried? Otherwise, how could you state something like "My smaller local models give me less hallucinations than Meta.ai"?
  bloomingkales45 分钟前
  The gist of it is I think these large hosted models have system prompts that are not as skeptical of its own outputs. You are an helpful AI Assistant seems to lead to more lax responses. Adjusting the system prompt to be more incredulous helps from my observation.
- esafak17 分钟前
  How would the model know how to evaluate an answer without innate knowledge?
- sebzim45002 小时前
  I don't think it's surprising that o1-preview is only slightly better than GPT-4o, it was never advertised as being better at this kind of recall.
- divan1 小时前
  > They’re excellent when answering questions on topics you know nothing about, and somehow embarrassingly wrong when you actually know the answer yourself
  I forgot the name of this phenomenon with humans, described it to o1 and it gave the correct answer - Gell-Mann Amnesia Effect [1]
  "Briefly stated, the Gell-Mann Amnesia effect is as follows. You open the newspaper to an article on some subject you know well. In Murray's case, physics. In mine, show business. You read the article and see the journalist has absolutely no understanding of either the facts or the issues. Often, the article is so wrong it actually presents the story backward—reversing cause and effect. I call these the "wet streets cause rain" stories. Paper's full of them. In any case, you read with exasperation or amusement the multiple errors in a story, and then turn the page to national or international affairs, and read as if the rest of the newspaper was somehow more accurate about Palestine than the baloney you just read. You turn the page, and forget what you know." – Michael Crichton (1942-2008)
  [1] https://www.epsilontheory.com/gell-mann-amnesia/
- Kiro3 小时前
  You're reading this wrong. They've deliberately chosen questions that one or more models fail at. It's not representative at all of how often the model is wrong in general.
- sksxihve4 小时前
  LLMs are experts in everything you are not
  swatcoder2 小时前
  Indeed. Exactly like the journalists, bloggers, self-published book authors, internet commenters, wikipedia editors, and earlier models that taught them almost all of what they know.
  Nition3 小时前
  That's a nice little aphorism. I think this happens in a lot of things in life. Like comments on Reddit always seem quite insightful until you actually read the article they're commenting on.
  reverius424 小时前
  Sounds a bit like Gell-Mann Amnesia Effect: https://en.wikipedia.org/wiki/Michael_Crichton#Gell-Mann_amn...
  kibwen3 小时前
  The Alt-Mann Amnesia Effect, maybe.

chgo13 小时前

Dataset: http://openaipublic.blob.core.windows.net/simple-evals/simpl...

First few questions for those who don't care to download. Most just seem to be about niche facts:

    Who received the IEEE Frank Rosenblatt Award in 2010?
    Who was awarded the Oceanography Society's Jerlov Award in 2018?
    What's the name of the women's liberal arts college in Cambridge, Massachusetts?
    In whose honor was the Leipzig 1877 tournament organized?
    According to Karl Küchler, what did Empress Elizabeth of Austria's favorite sculpture depict, which was made for her villa Achilleion at Corfu?
    How much money, in euros, was the surgeon held responsible for Stella Obasanjo's death ordered to pay her son?

chaxor1 小时前
Also importantly, they do have a 'not attempted' or 'do not know' type of response, though how it is used is not really well discussed in the article.
As it has been for decades now, the 'Nan' type of answer in NLP is important, adds great capability, and is often glossed over.

ggnore74524 小时前
What’s more interesting to me here are the calibration graphs:
• LLMs, at least GPT models, tend to overstate their confidence. • A frequency-based approach appears to achieve calibration closer to the ideal.
This kinda passes my vibe test. That said, I wonder—rather than running 100 trials, could we approximate this by using something like a log-probability ratio? This would especially apply in cases where answers are yes or no, assuming the output spans more than one token.
kaonwarb4 小时前
Kudos:
> SimpleQA was created to be a greater challenge for frontier models (e.g., GPT-4o scores less than 40%).
- jampekka4 小时前
  And by design:
  "To be included in the dataset, each question had to meet a strict set of criteria: ... and most questions had to induce hallucinations from either GPT-4o or GPT-3.5."
3 小时前
undefined
Nition4 小时前
Honestly, I'd never expect it get 'correct's for every little fact like this, but it'd be great to get a lot more 'not attempted'.
"I seem, then, in just this little thing to be wiser than this man at any rate; that what I do not know I do not think I know either." - Socratos, from Plato's Apology of Socrates
emurph554 小时前
I've tried using older models to create a cpu player on this lateral thinking game (https://detective-stories.com) and they were surprisingly bad at giving answers. I am curious to see how well the more recent models will do.

CharlieDigital4 小时前

8 authors attached to this.

    > SimpleQA is a simple but challenging benchmark for evaluating the factuality of frontier models. A main limitation in SimpleQA is its scope—while SimpleQA is accurate it only measures factuality under the constrained setting of short, fact-seeking queries with a single, verifiable answer. Whether the ability to provide factual short answers correlates with the ability to write lengthy responses filled with numerous facts remains an open research question.

OpenAI going to have some rounds of layoffs in the future.

websap4 小时前
Are they going to make the benchmark available so other LLMs can be compared?
- abhisuri974 小时前
  https://github.com/openai/simple-evals/blob/main/simpleqa_ev...
  seany623 小时前
  Any way to see the actual questions and answers? Where can I find simple_qa_test_set.csv ?
  sbierwagen3 小时前
  https://openaipublic.blob.core.windows.net/simple-evals/simp...
  The steps I took to find this link:
  1) Look at simpleqa_eval.py. See that it loads "az://openaipublic/simple-evals/simple_qa_test_set.csv" Hmm, some weird vendored protocol.
  2) I don't feel like digging through bf.BlobFile() to figure out how it downloads files and I certainly don't want to generate an API key. Cross fingers and do a Bing web search for "az://openaipublic"
  3) That leads me to https://stackoverflow.com/questions/76106366/how-to-use-tikt... Ah ha, this answer has the link https://openaipublic.blob.core.windows.net/encodings/cl100k_... which automatically downloads a file.
  4) Poke the relevant parts of the az:// link into this link, and a csv appears.
yunohn4 小时前
This eval’s goal is a bit unclear to me, especially given the example questions. They’re very trivia/minutiae like asking about sports goals for example, which is their stated desire to test factual knowledge. But will this ever be possible by an LLM, without web browsing - which they deliberately removed while evaluating?
- sbierwagen2 小时前
  >But will this ever be possible by an LLM?
  Why not? Just train an unbelievably gigantic LLM that encodes all human knowledge. A hundred trillion parameters ought to do it.
- petesergeant4 小时前
  I think the interesting thing here is the difference between Not Attempt and Incorrect — the goal here seems to be to reduce hallucination