OpenAI Launches GPT-5.4 with 1 Million Token Context and Record Benchmark Scores
The new flagship model scores 83% on OpenAI's GDPval knowledge-work benchmark, leads on computer-use tests, and cuts errors by a third over GPT-5.2.
OpenAI on Thursday released GPT-5.4, billed as its "most capable and efficient frontier model for professional work," available in three variants: a standard version, GPT-5.4 Thinking for step-by-step reasoning tasks, and GPT-5.4 Pro for maximum performance. The API version of the model supports context windows as large as 1 million tokens — by far the largest available from OpenAI — enabling it to process entire codebases, lengthy legal documents, or extended research threads in a single call.
The model set new records on several benchmarks. On OpenAI's GDPval test for economically valuable knowledge work, GPT-5.4 scored 83%, the highest result to date.
It also led on the OSWorld-Verified and WebArena Verified computer-use benchmarks, and took the top position on Mercor's APEX-Agents benchmark, which evaluates professional-grade tasks in law and finance. Compared with GPT-5.2, the new model is 33% less likely to make errors in individual factual claims, and overall responses are 18% less likely to contain errors.
OpenAI also overhauled how the API handles tool calling, introducing a system called Tool Search that allows models to look up tool definitions dynamically rather than loading all tool definitions in system prompts. The company said this results in faster and cheaper requests in systems with large numbers of available tools.
A new safety evaluation for chain-of-thought reasoning showed that GPT-5.4 Thinking is less prone to misrepresenting its reasoning process, a concern among AI safety researchers.
The release came amid heightened political pressure following the Pentagon deal controversy and a spike in ChatGPT uninstalls the prior week. GPT-5.4 was made available immediately via the API and through ChatGPT to Pro subscribers.
Read the original reporting at TechCrunch.