OpenAI’s newest reasoning models, o3 and o4‑mini, produce made‑up answers more often than the company’s earlier models, as shown by internal and external tests.
The rise in so‑called hallucinations breaks a long‑running pattern in which each new release tended to make up stuff less than the previous model.
OpenAI ’s own numbers put the problem in stark terms. On PersonQA, a company benchmark that checks how well a model recalls facts about people, o3 invented material in 33 percent of responses, about double the rates logged by o1 and o3‑mini, which scored 16 percent and 14.8 percent. O4‑mini fared even worse, hallucinating 48 percent of the time.
A technical report details the findings. Engineers write that the new models outperform earlier versions in coding and math, yet because they “make more claims overall,” they also make “more accurate claims as well as more inaccurate / hallucinated claims.” The document adds that “more research is needed” to explain the slide in reliability.
OpenAI classifies o‑series systems as reasoning models, a line the firm and much of the industry have embraced over the past year. Traditional, non‑reasoning models such as GPT‑4o with web search beat the latest duo on truthfulness: GPT‑4o with search achieves 90 percent accuracy on SimpleQA, another in‑house benchmark.
See also Huawei takes Nvidia’s spot after Trump blocked H20 AI chip sales to China
OpenAI’s o3 model is making up steps
Transluce, an AI nonprofit lab, reported the o3 model making up steps. In one run, the model said it had executed code on a 2021 MacBook Pro “outside of ChatGPT,” then copied the numbers back. The model simply is not capable of doing that.
“Our hypothesis is that the kind of reinforcement learning used for o‑series models may amplify issues that are usually mitigated (but not fully erased) by standard post‑training pipelines,” said Neil Chowdhury, a Transluce researcher and former OpenAI employee, in an email.
Transluce co‑founder Sarah Schwettmann said the higher error rate could make o3 less helpful than its raw skills suggest.
Kian Katanforoosh, a Stanford adjunct professor, told TechCrunch his team is already testing o3 for coding tasks and sees it as “a step above the competition.” Yet he reported another flaw: the model often returns web links that do not work when clicked.
Hallucinations can spur creativity, but they make the systems a tough sell for businesses that need accuracy. A law firm drafting contracts, for example, is unlikely to tolerate frequent factual mistakes.
Real-time search could reduce hallucinations in AI models
One possible solution is real‑time search. OpenAI’s GPT‑4o version, which consults the web, already scores better on SimpleQA. The report suggests the same tactic could cut hallucinations in reasoning models, at least when users are willing to send prompts to a third‑party engine.
See also Elon Musk’s xAI adds ChatGPT-like ‘memory’ feature to Grok
“Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability,” OpenAI spokesperson Niko Felix said in an email.
Whether real-time search alone will solve the problem remains unclear. The report warns that if scaling up reasoning models keeps worsening hallucinations, the hunt for fixes will grow more urgent. Researchers have long called hallucinations one of the hardest issues in AI, and the latest findings underline how far there is to go.
For OpenAI, credibility is important as ChatGPT is used in workplaces, classrooms, and creative studios. Engineers say they will keep tuning reinforcement learning, data selection, and tool use to bring the numbers down. Until then, users must balance sharper skills against a higher chance of being misled.
Cryptopolitan Academy: Tired of market swings? Learn how DeFi can help you build steady passive income. Register Now