Google released a new AI model, which shouldn’t surprise anyone. All the major tech companies are fighting for supremacy the way Bryan Cranston fought for name recognition in the television series, Breaking Bad. but Google went in a different direction.
What was surprising was that, according to most of the benchmarks, Gemini 3.1 Pro is likely the smartest model out there. The headlines were all the same flavor of caffeinated awe. Gemini 3.1 Pro scored 77.1% on ARC-AGI-2, a test that attempts to measure whether a model can solve problems it hasn’t just memorized.
I love the phrasing of ARC-AGI-2’s pitch: “Humans can solve every task.” It’s the kind of thing you read and think, Good for us, and then remember that humans can’t agree on whether pineapple belongs on pizza, or whether a zipper should go up or down, or why the sock you lost is always the left one.
But fine. Let’s accept the premise. Here’s the thing that made my stomach do that little elevator drop it usually reserves for airline turbulence and family group texts: Gemini 3.1 Pro didn’t just improve. It jumped, more than doubling its predecessor on ARC-AGI-2 in about 90 days, which is either progress or a sign that we’ve unknowingly put our civilization on fast-forward.
And then comes the part that feels like the plot twist in a thriller where the villain turns out to be your dentist: Google priced it low. Almost insultingly low. They released a “smartest” model and then behaved like a person who brings an elaborate homemade pie to a party and says, “Oh, don’t eat it if you don’t want to. I made it mostly to prove that I could.”
This is the part that matters more than the score. Google doesn’t appear to need my usage the way other AI companies do. They’re not begging me to bring my messy human life, with all of my emails, my contracts, my panicked Friday spreadsheets, into their model’s warm embrace. They’re offering it the way a billionaire offers a handshake: it’s polite and it’s firm, but they’ll forget your name before your palm stops sweating.
Compared to the rest of the AI competitors, Google is playing a different game.
OpenAI, Anthropic, and the rest of the major players feel like they’re living inside the “product race” story. Market share. Daily active users. What features can be bundled, monetized, advertised, enterprise-ified. In that story, the model is the business. Google, meanwhile, has a business that throws off cash the way a lawn sprinkler throws off water. The model can be something else: a research vehicle, a proving ground, a stake in the dirt that says, “We’re building the thing underneath the thing.”
Demis Hassabis has been saying some version of “solve intelligence, then solve everything else” for years. On a recent appearance on 60 Minutes, he talks about AI and disease in a way that feels less like marketing and more like someone describing the weather that’s coming whether you believe in umbrellas or not. This is a man selling you the future, and the unsettling part is that he doesn’t need you to buy it.
Google can afford to have that posture, because they’ve built a vertical stack that looks less like a company and more like a fortress with a moat, a drawbridge, and an internal ecosystem of very serious people who use words like “inference” the way normal people use words like “lunch.” They design their own chips (aka TPUs) like Ironwood, which can scale into pods of 9,216 chips, which is the kind of number that makes you realize your own brain is basically two tablespoons of tapioca trying its best. And then, because the universe has a sense of humor, competitors sometimes train on Google’s hardware anyway, like paying rent to the person you’re competing with in a footrace.
So yes. Google can ship “the smartest model” and act indifferent about whether you use it, because Google’s business isn’t “winning your daily workflow.” Google’s business is being Google.
Most people are mistaken when evaluating the various models available to them. They look at “smart” as a single metric against which everyone is judged. But intelligence doesn’t work that way. I have friends who struggle to find the power button on a laptop and who believe John Steinbeck’s middle name should be a offensive gerund that starts with the letter F, but many of those folks can fix complex car engines with a hammer and a sweat rag, and others can paint the Mona Lisa better than Da Vinci himself. Which of us is smarter?
“Smart” isn’t just one thing, so evaluating whether a specific model is “smarter” than another is technically a nonsense question. A better question is to ask which model is smart in the way you need it to be. Gemini 3.1 Pro is framed as the strongest naked reasoner, which sounds like the title of a Gustave Courbet painting, but is really a way of saying it expends effort thinking deeply about novel problems.
But when you add tools like web search, code execution, reading files, calling APIs, the “equipped reasoners” can pull ahead, because the bottleneck becomes less about how cleverly you think and more about whether you can act on that thinking over time without wandering off to sniff the digital equivalent of a squirrel. Anthropic’s Opus 4.6, for example, got showcased building a C compiler with agent teams: 16 parallel Claudes, like a committee that actually produces something other than resentment. In the parlance of Artistry, Gemini refined the artistic vision, Anthropic organized the studio, and OpenAI perfected a masterful brushstroke.
Similar to “smart,” solving “hard” problems also comprises more than just a single metric. There are reasoning problems: the multi-step, logic-heavy puzzles that make you feel like Sherlock Holmes, except you’re wearing sweatpants and your dog is licking the carpet. ARC-AGI-2 exists to measure that kind of novelty reasoning. Then, there are effort problems: not intellectually hard, just enormous. Like reading process logs of customer interactions until your eyes begin to weep the way statues weep in Catholic churches. This is where agentic systems shine: the models that can keep going, hour after hour, without needing to “feel inspired.”
Next, there are coordination problems: getting multiple teams aligned on a single project, routing dependencies, managing information so nobody builds the wrong thing for for a month because they missed a meeting that was moved because someone else missed a meeting because someone’s dog died. Coordination is the primary industry in corporate America, and our main export is calendar invites.
There are also emotional intelligence problems. These include giving feedback to a good-willed colleague who’s falling behind, negotiating with someone says they want to help but is really trying to get information out of you, reading a room where silence could mean “I hate this” or “I absolutely love this!” and more. If AI ever solves this one, it will not be because it got better at benchmarks. It will be because it learned to notice the way a person says, “Sounds good,” when it does not, in fact, sound good.
There are judgment and willpower problems which could manifest as killing a project, saying no to a client, or making the politically dangerous call because it’s right. AI can provide the answer. It cannot provide the nerve.
The most overlooked problem are domain expertise problems. These are the problems where a veteran might recognize the smell of a recurring incident from 2019. The lawyer who knows which clause gets litigated because they’ve watched it happen. This is not reasoning so much as scar tissue.
And, finally, there’s the one that makes everything else feel like a decoy. Ambiguity problems, which means figuring out what the question even is. The client says they want better reporting, but what they really want is their boss to stop interrogating them. The stakeholder says “efficiency,” but what they mean is “control.” The request says “simple,” but what it means is “politically survivable.”
This is where “smartest model” becomes a weird thing to brag about, because the real bottleneck in most work isn’t “I need to think harder.” It’s “I need to get through all of this,” or “I need to get everyone aligned,” or “I need to figure out what we’re doing here.”
So what do we do?
“Deep Think” recently collaborated with with researchers to tackle professional research problems like math, physics, and computer science, and the examples are the kind of thing you read and feel proud of humanity, until you remember the “humanity” part may be mostly ceremonial going forward.
Isomorphic Labs, DeepMind’s drug discovery sibling, published about a drug design engine (IsoDDE) that claims dramatic performance improvements over AlphaFold 3 on protein-ligand prediction and binding affinity, which is the part where you realize “solve intelligence” is not a slogan. It’s a pipeline.
And somewhere in there, along side the Nobel Prize press release, and the TPU pods, and the benchmarks that sound like dystopian final exams, we are left with some awkward realizations.
The question isn’t “Which AI should I use?” It never has been. The question is: What kind of problem am I solving right now? Which is what it should have been all along.
Because if it’s a pure reasoning problem, Google is selling you the cheapest, strongest engine in town. If it’s an effort or coordination problem, you might want a model that’s built to keep working, to use tools, to persist. And if it’s an emotional intelligence problem, a judgment problem, or an ambiguity problem: congratulations. You are still employed by reality.
Google shipped the smartest model and doesn’t care if I use it, because Google is trying to win something bigger than my workflow. I am trying to win back my afternoon.

