A few weeks ago, I spent an evening with Claude and Gemini, running them through the kind of work I'd spent years doing manually in finance. I wasn't stress-testing them.
I was genuinely curious whether the gap between what I remembered these models being capable of and what they could actually do now was as large as I suspected.
It was larger.
That evening wasn't a revelation about AI. It was a revelation about the processes we had built around human limitations and how many of those processes we'd quietly mistaken for rigour.
In February, a Citrini Research blog post - framed as a hypothetical macro memo from June 2028, describing white-collar unemployment at 10.2% and the S&P down 38% - circulated widely enough to draw a formal rebuttal from Citadel Securities. Citadel's response was empirical: software engineer demand up 11% year-on-year, AI use at work "unexpectedly stable," the historical displacement-and-reabsorption pattern intact. Goldman then published their own take, finding no meaningful relationship between AI adoption and economy-wide productivity, but a median 30% gain for firms that had actually integrated AI into defined workflows and measured the output. The macro number is flat. The micro number is transformative. The difference between the two is not capability. It’s implementation.

And here is the figure that should unsettle any boardroom: 70% of S&P 500 management teams discussed AI on their quarterly earnings calls. 10% quantified its impact on a specific use case. 1% connected it to earnings. That is not an adoption curve. It is a room full of people talking confidently about a thing almost none of them have actually done.
These are serious arguments. All are probably partly right about different things. And no one is asking the question I couldn't stop thinking about that evening, with the terminal open in front of me.
What the models can do is genuinely impressive and technically, no small feat. But that is not what this piece is about. The harder question - the one that doesn't appear in product demos or earnings calls - is what it takes to deploy that capability inside institutional workflows in finance. Workflows that require the same standard every time, at scale, with an audit trail someone can stand behind when it counts.
That is where the gap lives.
Last month, I ran due diligence on a deal. For anyone outside finance: due diligence is a stringent, thorough background check on a company. Financial records, contracts, risk clauses, revenue projections - a search for the things that don't appear in the headline numbers, the detail that changes the decision. Done properly, it runs in phases across weeks, with a team of analysts, lawyers and accountants working in parallel.
The first time I directed AI through one of those phases, it took under four hours. That number has since dropped below an hour, not by cutting corners but by directing AI through the same process a human team would follow - document by document, question by question. It held the thread. It spotted the inconsistency buried in the revenue projections. It flagged the clause in the agreement that would have cost us later. It produced a risk summary I could take into a room and defend.
The revelation wasn't the speed. The real 'aha' moment was realising this wasn't a shortcut - it was the work itself. The same adversarial logic, the same granular scepticism, the same defensible output. The only thing missing was the human friction: the thousand-yard stares and the weeks of analyst burnout required to grind that level of detail out of a spreadsheet.
Then I asked the harder question: what happens when the data room has 100+ documents instead of ten?
This is where impressive stops and institutional begins.
Anyone can get results from a few documents. The gap appears at scale, not because AI can't read a large data room, but because maintaining the same adversarial standard across every document requires something more deliberate than a prompt. A prompt is a one-time instruction. A process is a repeatable standard that doesn't degrade at volume, doesn't depend on who's running it that week, and produces an output you can stand behind when someone asks you to account for it.
On a pre-revenue infrastructure deal - significant fundraise behind it, serious money on the table - I ran the same data room through two different AI architectures. The data room contained a hidden contradiction: the construction schedule ended nine months before the revenue model began, creating a massive unfunded capital burn. No single document stated this. You had to cross-reference dates between separate files to see the cliff.
The single-agent architecture, designed to be adversarial and asked to “find every reason to fail this deal before building the investment case” - flagged the gap immediately as a deal-killer.
The multi-agent architecture, more sophisticated on paper, with specialist sub-agents for finance, legal, operations and tax, didn't miss the risk. It simply narrativised it: "Month 29–31: Commissioning for FY4 Revenue start." The sub-agents negotiated the friction away to maintain a cohesive narrative. The cliff became a transition.
If the multi-agent output had reached the investment committee, serious money would have moved into a structurally broken deal. And the multi-agent output was the one that looked more impressive.
This is what I'd call consensus hallucination. An architecture designed to produce a professional case will smooth over the data's jagged edges - because that's what collaboration optimises for. In finance, that's a fiduciary liability. If your system is designed to find a plan rather than find a flaw, it will eventually hallucinate a success that doesn't exist. The danger at scale isn't that models are wrong. It's that the wrong architecture is persuasively, professionally incorrect.
The auditability question is the one most conversations skip past, and it carries the most weight. This isn't only about regulators - though regulators will ask. It's about the moment a deal develops complications, or an investor wants to understand the basis for a recommendation, or a board asks how the risk was assessed. At that moment, "we used AI" is not an answer - and neither is a chat export showing a prompt and a summary.
"Here is every document reviewed, the standard applied to each, and the specific signal that raised this flag" - that is an answer.
The output that answers that question looks less like a summary and more like a ledger: document by document, flag by flag, the chain of inference that produced the conclusion. It's not a chat log. It's a record of methodology.
No general-purpose AI tool produces that today. Not because the models aren't capable, but because nobody has built the architecture underneath them that makes it possible.
The distance between those two things is not a prompt engineering problem. It is a complex architecture problem.
The third issue will matter most in three years and almost nobody is naming it yet.
When a foundational AI model improves, and the rate of improvement in 2025 and into 2026 has been faster than most institutions have noticed, that improvement is available to everyone on the same subscription. What is not available to everyone is what sits on top of it: the institutional knowledge, the specific risk profile, the asset class expertise, the accumulated deal history that tells the system what a red flag looks like for this fund, in this sector, at this stage of a process.
The firms that build that layer own something that cannot be replicated from a standing start. Every model upgrade makes their system more intelligent in ways that are specific to them. Everyone else gets the same commodity improvement and starts from the same place.
Goldman's own data bears this out. The 30% productivity gains weren't spread evenly across the economy. They showed up in the firms that had done the specific, unglamorous work of integrating AI into defined workflows and measuring the output. Everyone else got a licence and a talking point.
That is not a productivity story. It is a compounding advantage story. And right now, most firms are drifting into the commodity outcome without meaning to.
The most common AI strategy I encountered within finance wasn't a strategy at all. It was an AI licence, a town hall, and an assumption that the rest would follow.
The firms that have handed out licences haven't avoided the choice. They've made it. They've selected into the commodity outcome - the one where every model improvement is available to all competitors equally, where nothing built on top of the model belongs specifically to them, where the architecture that makes this work defensible will be built by someone else.
Most of them haven't framed it that way. But that is what the choice amounts to.
The previous processes in finance were not built for speed. They were built to meet a standard - the same standard, every time, at scale. The models are now good enough to meet that standard. The architecture that delivers it is being built right now, by a small number of firms that understood early that the capability question was already well understood, and moved on to the only question that remained.
That work is not glamorous. It doesn't show up in a product demo. It will not be solved by the next model release. The question isn't whether this architecture will exist. It will. The question is whether yours is one of them, and whether that was a decision you made, or one you drifted into by not deciding at all.
---
Sources:
— Citrini Research, "2028 Global Intelligence Crisis": citriniresearch.com
— Citadel Securities rebuttal (Frank Flight): citadelsecurities.com
— Goldman Sachs, "AI-nxiety" Q4 Earnings Analysis (March 2026), via Fortune: https://fortune.com/2026/03/03/goldman-earnings-ai-anxiety-no-meaningful-impact-productivity-economy-30-percent-in-2-areas/
