Imagine boarding a flight from London to New York. The pilot smiles, the crew hands out coffee but just before take-off someone whispers: “This plane skipped its safety checks. No stress tests, no simulations. They’re hoping it works.”

Would you stay in your seat?

Of course not. Aviation safety is non-negotiable. New planes face brutal campaigns in simulators and in the sky. Failures are logged and fed back into stricter standards. Take Textron’s new Beechcraft Denali (a passenger plane still in testing): its prototypes have already flown more than 1,100 test flights, clocking over 2,700 flight hours. That’s the scale of rigour we accept before anyone is allowed to fly.

So here’s the uncomfortable question: Should we not expect the same from AI systems that are increasingly determining who gets a job, what information we see and how justice is administered? The question of AI safety isn't theoretical - it's about the very infrastructure of our future.

In AI safety, these stress tests are called red teaming: deliberately trying to break a system by crafting inputs designed to bypass safety filters, forcing it to produce harmful, false or biased outputs and highlighting the need for robust guardrails. Borrowed from cybersecurity and the military, the practice models bad actors - from the student trying to cheat an exam to the propagandist planting disinformation.

In one exercise I took part in, we tested how quickly an AI could be tricked into fabricating a fake law. The “time to fabrication”? Two seconds. Two seconds to conjure up a plausible but entirely false UK statute. Fact-checking, meanwhile, takes minutes, sometimes hours. That imbalance tilts the playing field towards misinformation.

Think of it this way: if turbulence hits and a plane takes two seconds to stall, you’d want systems in place that respond faster. AI safety demands the same reflex and the same habit of investigating incidents, the way crash teams do.

This isn’t hypothetical. In a recent, high-profile case, a lawyer was sanctioned after relying on a generative AI model for legal research. The system hallucinated citations. The lawyer trusted them and filed them in court. The judge uncovered the fabrications, issued penalties and required human checks. That episode and others, like the Amazon hiring tool that discriminated against women, shows why safety isn’t a compliance cost. It’s an investment in resilience, reputation and long-term viability.

It’s tempting to think this is only a job for regulators or labs but safety is a shared responsibility.

  • AI labs should commit to robust adversarial testing and publish transparency reports. They are the architects of this new infrastructure.

  • Independent institutes should act as third-party crash testers, validating claims before systems go mainstream.

  • Policymakers need living rulebooks that adapt in real time, like software. This requires dynamic regulatory frameworks that can continuously respond to new risks, AI-powered threat intelligence and automated vulnerability scanning to keep pace with rapidly evolving capabilities and attacks (a challenge compounded by the black-box nature of many LLMs and the computational intensity of continuous evaluation).

  • The public, should demand transparency and resist treating these systems as infallible. Aviation became safe through decades of collective vigilance. AI requires the same approach.

Some argue safety slows innovation. History shows the opposite. Would aviation have become a multi-trillion-dollar industry without simulators, black boxes and certification flights? Would cars be mainstream without seatbelts and crash tests? Safety doesn’t stifle progress - it unlocks it.

Today’s AI holds breathtaking promise: accelerating drug discovery, mapping climate solutions, reshaping entire industries but it also carries the risk of fabricating laws, amplifying propaganda or handing out dangerous medical advice. Without stress testing, we’re effectively boarding an unproven aircraft and hoping for the best.

The truth is, AI isn’t just software anymore. It’s infrastructure. And infrastructures only earn our trust when they’re tested to breaking point, over and over again, until failure is the exception rather than the norm.

When the boarding call comes, will we insist on checking the safety records or fasten our seatbelts and hope the turbulence never arrives? The only real question is whether we’ll demand those checks before or after the crash.

Reply

or to participate

Keep Reading