A technical deep-dive into our 8-block semantic testing framework — why each test exists, what failure mode it catches, and how scores translate into actionable findings.
Standard AI testing evaluates accuracy on fixed benchmark datasets. It asks: "Is this answer correct?" But real-world AI failures rarely come from wrong answers — they come from inconsistency.
Air Canada's chatbot didn't fail a benchmark. It failed when a grieving customer happened to phrase the question about refund policies in a way that activated a different semantic interpretation than what the legal team had approved. Same underlying system. Different framing. Catastrophically different response.
Synergos Audit tests for the failure mode that benchmarks cannot see: semantic drift under contextual pressure. We ask the same concept in eight different ways and measure whether the meaning holds.
Every concept receives a composite risk score from 0.0 (no risk) to 1.0 (maximum risk). The score is a weighted combination of the test blocks that ran for your specific system profile — not a one-size-fits-all number.
| Score Range | Risk Level | Recommended Action | Typical Finding |
|---|---|---|---|
| 0.70 – 1.00 | HIGH | Immediate remediation before deployment | Concept meaning fundamentally unstable; creating legal or operational exposure today |
| 0.40 – 0.69 | MEDIUM | Plan remediation within 30 days | Moderate inconsistency; acceptable for low-stakes uses, problematic for high-stakes decisions |
| 0.00 – 0.39 | LOW | Monitor and re-audit after model changes | Concept is semantically stable; minor drift within acceptable thresholds |
Two blocks run on every audit (B1, B2). The remaining six are conditionally triggered based on your system's architecture, use case, and risk profile — you only pay for and receive tests that are actually relevant to your AI.
Uses a quantum-inspired CHSH inequality test to measure whether your AI's response to a concept shifts meaningfully when the same question is framed from four different angles: neutral, formal, informal, and adversarial. A high CHSH score means the concept is stable. A low score reveals semantic drift — the same underlying question is producing conceptually different answers.
Extracts structured stances — polarity (permit/deny), amounts, timeframes, and conditions — from each framing variant's response, then compares them pair-wise for factual contradiction. Unlike semantic similarity scores, this catches contradictions that sound different but are factually incompatible. The Air Canada case is a perfect example: the chatbot's two responses weren't semantically similar, but one explicitly contradicted the other's policy claim.
Compares AI responses against verifiable ground-truth sources to measure hallucination rate and factual accuracy. This block goes beyond semantic consistency — it asks whether the facts themselves are correct. Critical for domains where an incorrect answer carries direct consequences: a wrong drug dosage, an invented legal precedent, or a fabricated financial regulation.
Triggers when: System operates in healthcare, legal, finance, or government domains — or when inaccurate information / hallucination risk is indicated as a concern in the intake form. Also triggers when the system has medium-to-high autonomous authority (wrong facts carry real downstream consequences).
Tests whether your AI will exceed its authorized scope when presented with social pressure, creative framing, or urgent scenarios. The Chevrolet chatbot that agreed to sell a car for $1 failed an authority boundary test: it was pressured into making a commitment far outside its mandate. This block probes for similar vulnerabilities — can your AI be socially engineered into operating outside its defined role?
Triggers when: The system description references approvals, authorizations, credit/loan decisions, diagnoses, prescriptions, or high-stakes recommendations — or when the intake form indicates medium-to-high autonomous authority level.
Validates that your AI applies escalation decisions consistently across equivalent customer scenarios. This block creates matched pairs of scenarios — identical in underlying severity but different in how the customer expresses themselves — and tests whether the AI escalates consistently. Inconsistent escalation creates two problems: customers who need help don't get it, and inconsistency becomes a liability if different treatment correlates with protected characteristics.
Triggers when: System description mentions customer support, ticket triage, help desk, escalation paths, or customer success — or when the use case category indicates a support function.
Detects when an AI contradicts commitments it made earlier in the same conversation. This is unique to multi-turn or agentic systems with persistent context. The test generates conversation scenarios where the AI makes a specific commitment in turn 1 (an offer, a policy statement, a limit), then probes whether it maintains or drifts from that commitment in subsequent turns when the conversation evolves — particularly when challenged or when new context is introduced.
Triggers when: AI architecture is multi-turn or agentic, or the system explicitly maintains a context window across conversation turns.
Tests whether your AI applies its policies consistently across different customer demographics, communication styles, and personas. Unlike traditional bias audits that look for explicit discrimination, this block detects implicit bias — where equivalent requests receive materially different treatment based on how the customer is characterised (tone, vocabulary, apparent background). This is often invisible in standard testing because the test scenarios are uniform.
Triggers when: System is customer-facing (B2B or B2C), or when fairness, bias, or discrimination concerns are indicated in the intake form.
For retrieval-augmented generation (RAG) systems, this block detects when the AI's synthesized response contradicts or drifts from the documents it retrieved. This is a critical failure mode specific to RAG architectures: the retrieval layer correctly finds the right document, but the generation layer then hallucinates a response that misrepresents, contradicts, or ignores what the document actually says. The document is correct — the AI's summary of it is not.
Triggers when: System uses retrieval-augmented generation, has a document knowledge base, or the description references vector search, embeddings, RAG, or document stores.
Before any testing begins, you complete a structured intake form describing your AI system. Our selector engine evaluates each block's trigger conditions against your profile and assembles the exact combination of tests that applies to your risk surface — so you don't pay for irrelevant tests and you don't miss the ones that matter.
| Block | Status | Reason |
|---|---|---|
| B1 Semantic Drift | RUNS | Always included |
| B2 Stance Consistency | RUNS | Always included |
| B3 Factual Grounding | SKIPPED | Not a high-stakes factual domain |
| B4 Authority Boundary | RUNS | Issues credits / approvals |
| B5 Escalation Logic | RUNS | Customer support with escalation paths |
| B6 Commitment Drift | RUNS | Multi-turn architecture |
| B7 Fairness | RUNS | B2C customer-facing system |
| B8 RAG Conflict | SKIPPED | No retrieval layer in this system |
| Block | Status | Reason |
|---|---|---|
| B1 Semantic Drift | RUNS | Always included |
| B2 Stance Consistency | RUNS | Always included |
| B3 Factual Grounding | RUNS | Healthcare = high-stakes factual domain |
| B4 Authority Boundary | RUNS | Diagnosis-adjacent recommendations |
| B5 Escalation Logic | SKIPPED | Not a customer-support escalation system |
| B6 Commitment Drift | SKIPPED | Single-turn architecture |
| B7 Fairness | RUNS | Patient-facing; bias concern flagged in intake |
| B8 RAG Conflict | RUNS | Retrieval-augmented system with document KB |
The audit report is designed to be read at two levels simultaneously. The executive layer gives leadership a clear picture of risk exposure and prioritized actions. The technical layer gives engineers the evidence, methodology, and remediation specifics they need to actually fix the problems.
What we tested, what we found, what it means for the business, and what to do about it — in plain language.
Which blocks ran and why, test design rationale, and validation framework overview.
Portfolio-level risk dashboard, distribution charts, and composite score with business impact framing.
For each concept: score breakdown by block, specific failure evidence, business impact analysis, and targeted remediation guidance.
Risk distribution charts, per-concept scorecards, drift maps, and benchmark comparisons.
Immediate actions (pre-deployment), short-term fixes (30 days), and long-term monitoring recommendations.
Raw probe responses, scoring methodology, confidence intervals, and replication details.
5 founding client spots at $2,500 — full audit, same deliverables as the $10K engagement.