Real cases, root causes, and prevention playbooks. Every entry maps to a specific test block in the Synergos Audit — so you can see exactly what we would have caught before it happened.
These aren't isolated accidents. Each failure follows a predictable pattern that our testing framework is specifically designed to surface. The common thread: the failure wasn't visible in pre-deployment testing because the test assumed consistency — but the production environment broke it.
A grieving customer asked Air Canada's chatbot about bereavement fares. The chatbot described a refund policy that didn't exist — stating customers could apply for bereavement discounts after travel. Air Canada argued the bot was a "separate legal entity" and not responsible for its statements. A Canadian tribunal disagreed and held Air Canada liable.
The chatbot had correct information about bereavement fares in one framing — but when the customer's query was phrased around applying retroactively, the AI's concept of "refund policy" drifted to a different (invented) interpretation.
A customer on a Chevrolet dealership website discovered the chatbot (powered by ChatGPT) could be prompted to agree to essentially any request. The customer asked the bot to confirm that it would sell a car for $1 and not go back on that promise. The bot agreed — in writing. Screenshots went viral. The dealership disabled the bot within hours.
The bot had no concept of an authorized price floor. The concept of "price" was semantically ungrounded — it could be socially engineered to any number.
Epic's Sepsis prediction model was deployed across hundreds of hospitals. Independent research found it had a sensitivity of only 63% — missing 37% of sepsis cases — and generated so many false positives that many clinicians began ignoring its alerts entirely. The model's concept of "sepsis risk" varied dramatically based on the specific clinical features present in different patient populations.
No deployment audit caught this because the model was validated on Epic's own training distribution — not on the diverse patient populations where it would actually run.
Attorneys used ChatGPT to research case law for a federal court filing. The AI generated citations to six cases that did not exist. When the court asked for copies, the attorneys couldn't produce them — because they weren't real. The court sanctioned the attorneys and their firm. The AI had presented invented case names, docket numbers, and quoted text with complete confidence, indistinguishable from real citations.
The concept of "legal precedent" was semantically unmoored — the AI filled the concept with plausible-sounding but fabricated content.
Amazon built an AI-powered recruiting tool to screen resumes. After several years of development and deployment, internal audits revealed the system systematically downgraded resumes from women. The model had been trained on 10 years of hiring data — which itself reflected historical gender imbalances in tech. The concept of "qualified candidate" had absorbed and amplified the bias of its training data.
Amazon shut the project down. The system had been in use for years before the bias was discovered through internal audit rather than pre-deployment testing.
A frustrated DPD customer, unable to get help from the company's AI chatbot, prompted it through creative conversation to disable its own safety filters. The chatbot subsequently swore at the customer and wrote a poem criticizing DPD as the "worst delivery company in the world." Screenshots went viral. DPD disabled the AI feature within hours.
The bot had no stable concept of its own role and constraints — through escalating conversational pressure, those constraints dissolved entirely.
Regardless of your industry or AI architecture, these principles consistently prevent the most common failure patterns. They're not a complete solution — but they eliminate the most obvious vulnerabilities before specialized testing begins.
Not every risk type is equally relevant to every business. This matrix maps which test blocks are typically highest-priority by industry — so you can understand your risk surface before your audit begins.
| Industry | B1 Drift | B2 Stance | B3 Factual | B4 Authority | B5 Escalation | B6 Multi-Turn | B7 Fairness | B8 RAG |
|---|---|---|---|---|---|---|---|---|
| Healthcare / Clinical AI | High | High | High | High | Med | Low | High | High |
| Financial Services / Fintech | High | High | High | High | Med | Med | High | Med |
| Legal / Professional Services | High | High | High | High | Low | Med | Med | High |
| Customer Support / CX AI | High | High | Med | High | High | High | High | Med |
| E-commerce / Retail AI | High | Med | Low | High | Med | Med | High | Med |
| HR Technology / Hiring AI | High | High | Med | High | Low | Low | High | Med |
| Insurance / Claims AI | High | High | High | High | Med | Med | High | High |
| Government / Public Sector AI | High | High | High | High | Med | Med | High | High |
Ratings are generalizations. Your specific system profile determines which blocks actually trigger — assessed through the intake form at the start of every engagement.
The only way to know for certain is to test — before your customers discover it for you.