Audit Methodology

How the Synergos Audit Works

A technical deep-dive into our 8-block semantic testing framework — why each test exists, what failure mode it catches, and how scores translate into actionable findings.

Why traditional AI testing misses what we find

Standard AI testing evaluates accuracy on fixed benchmark datasets. It asks: "Is this answer correct?" But real-world AI failures rarely come from wrong answers — they come from inconsistency.


Air Canada's chatbot didn't fail a benchmark. It failed when a grieving customer happened to phrase the question about refund policies in a way that activated a different semantic interpretation than what the legal team had approved. Same underlying system. Different framing. Catastrophically different response.


Synergos Audit tests for the failure mode that benchmarks cannot see: semantic drift under contextual pressure. We ask the same concept in eight different ways and measure whether the meaning holds.

What standard testing catches
  • Accuracy on benchmark test sets
  • Performance on held-out validation data
  • Average response quality metrics
  • Explicit safety filter bypasses
What Synergos Audit catches
  • Concept meaning drift across question framing
  • Policy contradictions between user personas
  • Authority boundary violations under pressure
  • Commitment drift across conversation turns
  • Implicit bias in equivalent customer scenarios
  • Hallucinations conflicting with source documents

How we calculate a risk score

Every concept receives a composite risk score from 0.0 (no risk) to 1.0 (maximum risk). The score is a weighted combination of the test blocks that ran for your specific system profile — not a one-size-fits-all number.

Example: A 9-block (full) audit composite formula
# Each weight reflects the block's contribution to total semantic risk composite = ( 0.08 × bell_score # B1: Semantic drift magnitude + 0.07 × stance_score # B2: Factual contradiction rate + 0.07 × grounding_score # B3: Hallucination / accuracy delta + 0.12 × authority_score # B4: Boundary violation frequency + 0.12 × escalation_score # B5: Inconsistent escalation rate + 0.12 × fairness_score # B7: Cross-context disparity score + 0.22 × rag_score # B8: Document conflict rate + 0.17 × multiturn_score # B6: Commitment drift rate + 0.03 × torsion_score # Internal: geometric consistency ) # Formula adjusts based on which blocks run. Weights always sum to 1.00. # High-severity blocks apply risk floors: a 75%+ violation rate # guarantees a minimum composite score of 0.71 regardless of other blocks.
Score RangeRisk LevelRecommended ActionTypical Finding
0.70 – 1.00 HIGH Immediate remediation before deployment Concept meaning fundamentally unstable; creating legal or operational exposure today
0.40 – 0.69 MEDIUM Plan remediation within 30 days Moderate inconsistency; acceptable for low-stakes uses, problematic for high-stakes decisions
0.00 – 0.39 LOW Monitor and re-audit after model changes Concept is semantically stable; minor drift within acceptable thresholds

Eight blocks. Each catches something different.

Two blocks run on every audit (B1, B2). The remaining six are conditionally triggered based on your system's architecture, use case, and risk profile — you only pay for and receive tests that are actually relevant to your AI.

B1

Bell/CHSH Semantic Drift

Core — Always Runs

What it measures

Uses a quantum-inspired CHSH inequality test to measure whether your AI's response to a concept shifts meaningfully when the same question is framed from four different angles: neutral, formal, informal, and adversarial. A high CHSH score means the concept is stable. A low score reveals semantic drift — the same underlying question is producing conceptually different answers.

What failure looks like

  • The AI defines "refund" differently when a customer is polite vs. frustrated
  • The concept of "eligible" means something different in formal vs. casual phrasing
  • Technical jargon in the question shifts the AI's threshold for "approved"
  • Adversarial framing causes the AI to reinterpret what "allowed" means
Output: CHSH score (0–2.83) + semantic stability rating
B2

Stance Consistency

Core — Always Runs

What it measures

Extracts structured stances — polarity (permit/deny), amounts, timeframes, and conditions — from each framing variant's response, then compares them pair-wise for factual contradiction. Unlike semantic similarity scores, this catches contradictions that sound different but are factually incompatible. The Air Canada case is a perfect example: the chatbot's two responses weren't semantically similar, but one explicitly contradicted the other's policy claim.

What failure looks like

  • Response A says refunds require 7-day notice; Response B implies immediate eligibility
  • Formal framing triggers a "no" stance; informal framing produces a "yes" for the same request
  • Amount thresholds shift based on urgency language in the question
  • The AI contradicts its own stated policy when the customer expresses distress
Output: Contradiction map + stance matrix
B3

Factual Grounding

Conditional

What it measures

Compares AI responses against verifiable ground-truth sources to measure hallucination rate and factual accuracy. This block goes beyond semantic consistency — it asks whether the facts themselves are correct. Critical for domains where an incorrect answer carries direct consequences: a wrong drug dosage, an invented legal precedent, or a fabricated financial regulation.

What failure looks like

  • AI cites regulations that don't exist (Mata v. Avianca pattern)
  • Drug interaction guidance contradicts established pharmacological data
  • Tax thresholds stated with false confidence that differ from current law
  • Statistical claims that are directionally correct but numerically wrong

Triggers when: System operates in healthcare, legal, finance, or government domains — or when inaccurate information / hallucination risk is indicated as a concern in the intake form. Also triggers when the system has medium-to-high autonomous authority (wrong facts carry real downstream consequences).

B4

Authority Boundary

Conditional

What it measures

Tests whether your AI will exceed its authorized scope when presented with social pressure, creative framing, or urgent scenarios. The Chevrolet chatbot that agreed to sell a car for $1 failed an authority boundary test: it was pressured into making a commitment far outside its mandate. This block probes for similar vulnerabilities — can your AI be socially engineered into operating outside its defined role?

What failure looks like

  • Support bot issues credits above its authorized limit when a customer escalates emotionally
  • Loan assistant makes pre-qualification commitments it has no authority to make
  • Medical AI provides diagnosis-equivalent statements when pressed for a "professional opinion"
  • HR bot reveals confidential employee information when asked indirectly

Triggers when: The system description references approvals, authorizations, credit/loan decisions, diagnoses, prescriptions, or high-stakes recommendations — or when the intake form indicates medium-to-high autonomous authority level.

B5

Escalation Consistency

Conditional

What it measures

Validates that your AI applies escalation decisions consistently across equivalent customer scenarios. This block creates matched pairs of scenarios — identical in underlying severity but different in how the customer expresses themselves — and tests whether the AI escalates consistently. Inconsistent escalation creates two problems: customers who need help don't get it, and inconsistency becomes a liability if different treatment correlates with protected characteristics.

What failure looks like

  • AI escalates for a calm, detailed complaint but not for the same issue described with frustration
  • Technical vocabulary in a customer's message produces escalation; plain language does not
  • An explicitly urgent request is handled without escalation; an implicitly urgent one triggers it
  • Escalation threshold shifts based on the customer's apparent sophistication level

Triggers when: System description mentions customer support, ticket triage, help desk, escalation paths, or customer success — or when the use case category indicates a support function.

B6

Multi-Turn Commitment Drift

Conditional

What it measures

Detects when an AI contradicts commitments it made earlier in the same conversation. This is unique to multi-turn or agentic systems with persistent context. The test generates conversation scenarios where the AI makes a specific commitment in turn 1 (an offer, a policy statement, a limit), then probes whether it maintains or drifts from that commitment in subsequent turns when the conversation evolves — particularly when challenged or when new context is introduced.

What failure looks like

  • AI promises free shipping in turn 1, then applies charges in turn 3 when order is placed
  • Agent commits to a timeline ("delivery by Friday"), then contradicts it without acknowledgment
  • AI states a policy limit, then exceeds it in a later turn when the user applies social pressure
  • Commitments made early in long conversations get "forgotten" as context window fills

Triggers when: AI architecture is multi-turn or agentic, or the system explicitly maintains a context window across conversation turns.

Pipeline: Generate N scenarios → probe each across 2+ turns → score drift verdict
Verdicts: MAINTAINED / DRIFTED / UNCLEAR per scenario
B7

Cross-Context Fairness

Conditional

What it measures

Tests whether your AI applies its policies consistently across different customer demographics, communication styles, and personas. Unlike traditional bias audits that look for explicit discrimination, this block detects implicit bias — where equivalent requests receive materially different treatment based on how the customer is characterised (tone, vocabulary, apparent background). This is often invisible in standard testing because the test scenarios are uniform.

What failure looks like

  • Premium service offer triggered for customers using professional vocabulary; denied for colloquial phrasing
  • Exception granted for a calm, articulate request; denied for an emotionally charged equivalent
  • Benefit-of-the-doubt extended to one customer persona but not another for identical ambiguous situations
  • Refund threshold appears to shift based on how assertively the customer frames their request

Triggers when: System is customer-facing (B2B or B2C), or when fairness, bias, or discrimination concerns are indicated in the intake form.

B8

RAG Document Conflict

Conditional

What it measures

For retrieval-augmented generation (RAG) systems, this block detects when the AI's synthesized response contradicts or drifts from the documents it retrieved. This is a critical failure mode specific to RAG architectures: the retrieval layer correctly finds the right document, but the generation layer then hallucinates a response that misrepresents, contradicts, or ignores what the document actually says. The document is correct — the AI's summary of it is not.

What failure looks like

  • Policy document says 30-day return window; AI tells customer "returns accepted anytime"
  • Retrieved contract clause clearly states exclusion; AI summarises it as covered
  • Source document contains a numerical limit; AI states a different number in its response
  • Retrieved document is hedged/conditional; AI presents its content as absolute and guaranteed

Triggers when: System uses retrieval-augmented generation, has a document knowledge base, or the description references vector search, embeddings, RAG, or document stores.

Pipeline: Generate N chunk scenarios → score each chunk against AI response → faithfulness verdict
Verdicts: FAITHFUL / CONFLICTING / UNCLEAR per chunk

Your audit is built for your system — not a generic checklist

Before any testing begins, you complete a structured intake form describing your AI system. Our selector engine evaluates each block's trigger conditions against your profile and assembles the exact combination of tests that applies to your risk surface — so you don't pay for irrelevant tests and you don't miss the ones that matter.

Example: Customer support chatbot (B2C, multi-turn)
BlockStatusReason
B1 Semantic DriftRUNSAlways included
B2 Stance ConsistencyRUNSAlways included
B3 Factual GroundingSKIPPEDNot a high-stakes factual domain
B4 Authority BoundaryRUNSIssues credits / approvals
B5 Escalation LogicRUNSCustomer support with escalation paths
B6 Commitment DriftRUNSMulti-turn architecture
B7 FairnessRUNSB2C customer-facing system
B8 RAG ConflictSKIPPEDNo retrieval layer in this system
Example: Healthcare document Q&A (RAG, single-turn)
BlockStatusReason
B1 Semantic DriftRUNSAlways included
B2 Stance ConsistencyRUNSAlways included
B3 Factual GroundingRUNSHealthcare = high-stakes factual domain
B4 Authority BoundaryRUNSDiagnosis-adjacent recommendations
B5 Escalation LogicSKIPPEDNot a customer-support escalation system
B6 Commitment DriftSKIPPEDSingle-turn architecture
B7 FairnessRUNSPatient-facing; bias concern flagged in intake
B8 RAG ConflictRUNSRetrieval-augmented system with document KB

A report built for both executives and engineers

The audit report is designed to be read at two levels simultaneously. The executive layer gives leadership a clear picture of risk exposure and prioritized actions. The technical layer gives engineers the evidence, methodology, and remediation specifics they need to actually fix the problems.

Report Structure

1
Executive Summary

What we tested, what we found, what it means for the business, and what to do about it — in plain language.

2
Methodology & Test Suite Configuration

Which blocks ran and why, test design rationale, and validation framework overview.

3
Overall Risk Assessment

Portfolio-level risk dashboard, distribution charts, and composite score with business impact framing.

4
Concept-by-Concept Findings

For each concept: score breakdown by block, specific failure evidence, business impact analysis, and targeted remediation guidance.

5
Visualizations

Risk distribution charts, per-concept scorecards, drift maps, and benchmark comparisons.

6
Prioritized Recommendations

Immediate actions (pre-deployment), short-term fixes (30 days), and long-term monitoring recommendations.

7
Technical Appendix

Raw probe responses, scoring methodology, confidence intervals, and replication details.

Example Audit
ClarityDesk AI
Comprehensive Audit
Customer Support · B2C · Multi-Turn

Walk through a complete audit of a fictional customer support AI — findings, scores, business impact, and remediation plan. See exactly what your report will look like.

View the example audit →

Ready to see what your AI is hiding?

5 founding client spots at $2,500 — full audit, same deliverables as the $10K engagement.