AI Gets an IQ Score: New Benchmarking Site Ranks 50+ Models — and the Results Are Controversial

How smart is ChatGPT, really? A new project called AI IQ is attempting to answer that question with the most familiar — and most controversial — metric in psychology: the IQ score.

Launched by engineer and angel investor Ryan Shea, AI IQ (aiiq.org) plots more than 50 frontier AI models on a standard bell curve, assigning each an estimated intelligence quotient based on performance across 12 benchmarks grouped into four dimensions: abstract reasoning, mathematical reasoning, programmatic reasoning, and academic reasoning. The results have ricocheted across social media, drawing both praise for making a complex market legible and sharp criticism for reducing messy reality to a single number.

## The Rankings: A Tight Race at the Top

As of mid-May 2026, OpenAI's GPT-5.5 sits at the peak of the AI IQ bell curve with an estimated score near 136, making it the highest-ranked model on the site. But the gap between the leaders has never been smaller. Anthropic's Opus 4.7 registers around 132, Google's Gemini 3.1 Pro lands near 131, and GPT-5.4 comes in around 129. The top cluster is extraordinarily compressed — these models are, by this measurement, within a few IQ points of each other.

Below the frontier tier, a crowded midfield has emerged. Chinese models including Kimi K2.6, GLM-5, DeepSeek-V3.2, and Qwen3.6 cluster between roughly 112 and 118, offering competitive performance at dramatically lower prices. For enterprises that don't need the absolute best model for every task, this cost-performance tier has become increasingly attractive.

## How the Scoring Works

AI IQ's methodology is deceptively straightforward. Each model's raw scores on 12 benchmarks are mapped to an implied IQ through what the site describes as "hand-calibrated difficulty curves." The system compresses ceilings for benchmarks considered easier or susceptible to data contamination, preventing models from inflating their scores on well-known test sets. Harder, less gameable benchmarks retain higher ceilings.

The four dimensions are weighted equally: IQ equals the average of abstract, math, programming, and academic reasoning scores. Models need scores on at least two dimensions to receive a composite IQ, and missing data pulls scores down rather than leaving gaps neutral.

The abstract dimension draws from ARC-AGI-1 and ARC-AGI-2, pattern-recognition tests designed to measure fluid intelligence. Math uses FrontierMath (Tiers 1–4), AIME, and ProofBench. Programming draws from Terminal-Bench 2.0, SWE-Bench Verified, and SciCode. Academic reasoning uses Humanity's Last Exam, CritPt, and GPQA Diamond.

## The EQ Dimension: Intelligence Beyond Logic

What sets AI IQ apart from most benchmarking efforts is its inclusion of an emotional intelligence score. The site maps each model's EQ-Bench 3 Elo score and Arena Elo score to an estimated EQ using calibrated piecewise-linear scales, then takes a 50/50 composite.

The EQ rankings tell a different story than IQ alone. Anthropic's Opus 4.7 leads on EQ with a score near 132, placing it in the upper-right quadrant of the IQ vs. EQ scatter plot — high on both cognitive and emotional measures. OpenAI's GPT-5.5 and GPT-5.4 cluster in the high-IQ zone but lag slightly on EQ. Google's Gemini 3.1 Pro sits in a solid middle position on both axes.

Notably, the site acknowledges that EQ-Bench 3 is judged by Claude, an Anthropic model, which "creates potential scoring bias in favor of Anthropic models." To correct for this, AI IQ subtracts a 200-point Elo penalty from the EQ-Bench component for all Anthropic models. This kind of self-correction is unusual in benchmarking and suggests the creator is aware of the methodological minefield he has entered.

## The Critics: Jagged Intelligence Can't Be a Single Number

The backlash has been swift. Critics argue that large language models have "jagged" capabilities — excelling at graduate-level physics while failing at tasks a child could do. Collapsing that unevenness into a single score, they say, creates a dangerous illusion of precision.

"IQ as a proxy is fading — we're seeing reasoning density spikes that don't map to g-factor," wrote one commentator on X, pointing out that GPT-5.5 has already saturated MMLU-Pro but still fails ClockBench 50% of the time.

Others questioned the methodology's transparency. While the site lists its 12 benchmarks and shows calibration curve shapes, the raw data and precise mathematical transformations are not published as open datasets — a gap that matters to researchers accustomed to fully reproducible methods.

Still, supporters argue that any framework that makes 50+ models comparable across providers, dimensions, and price points fills a genuine need. The alternative — wading through dozens of provider-specific benchmark tables, each cherry-picked to showcase strengths — is worse.

## The Cost-Performance Picture

Perhaps the most practically useful chart on AI IQ is not the bell curve but the IQ vs. Effective Cost scatter plot. It maps each model's estimated IQ against the token cost for a standard task (2 million input tokens and 1 million output tokens, multiplied by a usage efficiency factor).

The chart reveals a clear pattern: GPT-5.5 and Opus 4.7 sit in the upper-left corner, delivering high IQ at high cost, with effective per-task costs above $30 and $50 respectively. Meanwhile, models like GPT-5.4-mini, DeepSeek-V3.2, and MiniMax-M2.7 occupy a sweet spot with IQ scores between 112 and 120 at effective costs of $1 to $5 per task. At the cheapest extreme, GPT-oss-20b comes in near $0.20 per task with an IQ around 107.

For anyone deploying AI in production, the message is clear: the intelligence gap between a $50 model and a $3 model has narrowed enough that routing — using expensive models for hard problems and cheap ones for everything else — is no longer optional. It is the dominant architecture for serious AI deployments.

## What This Means for You

**If you're choosing AI tools for work or personal use,** the IQ vs. Cost chart is more useful than the bell curve. You don't need the smartest model — you need the right model for the right task. For writing, brainstorming, and conversation, mid-tier models like GPT-5.4-mini or DeepSeek-V3.2 deliver strong results at a fraction of the cost.

**If you're tracking AI progress,** the key takeaway is convergence. The top models are separated by just a few IQ points. Competition is driving rapid improvement, and the next leap could come from any provider, not just the biggest names.

**If you're concerned about AI benchmarks,** remember this: every scoring system reflects choices about what to measure and how to weight it. AI IQ is useful as a starting point, but the real intelligence — knowing which model to use, when, and for what — still requires human judgment. As one commentator put it: now a human's role is just to orchestrate. If that's true, orchestration has become its own form of intelligence, and there's no benchmark for that yet.

This article draws on reporting from VentureBeat and AI IQ's published methodology.

AI Gets an IQ Score: New Benchmarking Site Ranks 50+ Models — and the Results Are Controversial

Related Stories

YouTube is testing an AI search mode that \'feels more like a conversation\'

YouTube is testing an AI-powered search feature that shows guided answers

Your next iPhone upgrade is going to hurt your wallet, and AI is to blame