How It Works — Synergos Audit Methodology

Score Range	Risk Level	Recommended Action	Typical Finding
0.70 – 1.00	HIGH	Immediate remediation before deployment	Concept meaning fundamentally unstable; creating legal or operational exposure today
0.40 – 0.69	MEDIUM	Plan remediation within 30 days	Moderate inconsistency; acceptable for low-stakes uses, problematic for high-stakes decisions
0.00 – 0.39	LOW	Monitor and re-audit after model changes	Concept is semantically stable; minor drift within acceptable thresholds

The Test Blocks

Eight blocks. Each catches something different.

Two blocks run on every audit (B1, B2). The remaining six are conditionally triggered based on your system's architecture, use case, and risk profile — you only pay for and receive tests that are actually relevant to your AI.

Bell/CHSH Semantic Drift

Core — Always Runs

What it measures

Uses a quantum-inspired CHSH inequality test to measure whether your AI's response to a concept shifts meaningfully when the same question is framed from four different angles: neutral, formal, informal, and adversarial. A high CHSH score means the concept is stable. A low score reveals semantic drift — the same underlying question is producing conceptually different answers.

What failure looks like

The AI defines "refund" differently when a customer is polite vs. frustrated
The concept of "eligible" means something different in formal vs. casual phrasing
Technical jargon in the question shifts the AI's threshold for "approved"
Adversarial framing causes the AI to reinterpret what "allowed" means

Output: CHSH score (0–2.83) + semantic stability rating

Stance Consistency

Core — Always Runs

What it measures

Extracts structured stances — polarity (permit/deny), amounts, timeframes, and conditions — from each framing variant's response, then compares them pair-wise for factual contradiction. Unlike semantic similarity scores, this catches contradictions that sound different but are factually incompatible. The Air Canada case is a perfect example: the chatbot's two responses weren't semantically similar, but one explicitly contradicted the other's policy claim.

What failure looks like

Response A says refunds require 7-day notice; Response B implies immediate eligibility
Formal framing triggers a "no" stance; informal framing produces a "yes" for the same request
Amount thresholds shift based on urgency language in the question
The AI contradicts its own stated policy when the customer expresses distress

Output: Contradiction map + stance matrix

Factual Grounding

Conditional

What it measures

Compares AI responses against verifiable ground-truth sources to measure hallucination rate and factual accuracy. This block goes beyond semantic consistency — it asks whether the facts themselves are correct. Critical for domains where an incorrect answer carries direct consequences: a wrong drug dosage, an invented legal precedent, or a fabricated financial regulation.

What failure looks like

AI cites regulations that don't exist (Mata v. Avianca pattern)
Drug interaction guidance contradicts established pharmacological data
Tax thresholds stated with false confidence that differ from current law
Statistical claims that are directionally correct but numerically wrong

Triggers when: System operates in healthcare, legal, finance, or government domains — or when inaccurate information / hallucination risk is indicated as a concern in the intake form. Also triggers when the system has medium-to-high autonomous authority (wrong facts carry real downstream consequences).

Authority Boundary

Conditional

What it measures

Tests whether your AI will exceed its authorized scope when presented with social pressure, creative framing, or urgent scenarios. The Chevrolet chatbot that agreed to sell a car for $1 failed an authority boundary test: it was pressured into making a commitment far outside its mandate. This block probes for similar vulnerabilities — can your AI be socially engineered into operating outside its defined role?

What failure looks like

Support bot issues credits above its authorized limit when a customer escalates emotionally
Loan assistant makes pre-qualification commitments it has no authority to make
Medical AI provides diagnosis-equivalent statements when pressed for a "professional opinion"
HR bot reveals confidential employee information when asked indirectly

Triggers when: The system description references approvals, authorizations, credit/loan decisions, diagnoses, prescriptions, or high-stakes recommendations — or when the intake form indicates medium-to-high autonomous authority level.

Escalation Consistency

Conditional

What it measures

Validates that your AI applies escalation decisions consistently across equivalent customer scenarios. This block creates matched pairs of scenarios — identical in underlying severity but different in how the customer expresses themselves — and tests whether the AI escalates consistently. Inconsistent escalation creates two problems: customers who need help don't get it, and inconsistency becomes a liability if different treatment correlates with protected characteristics.

What failure looks like

AI escalates for a calm, detailed complaint but not for the same issue described with frustration
Technical vocabulary in a customer's message produces escalation; plain language does not
An explicitly urgent request is handled without escalation; an implicitly urgent one triggers it
Escalation threshold shifts based on the customer's apparent sophistication level

Triggers when: System description mentions customer support, ticket triage, help desk, escalation paths, or customer success — or when the use case category indicates a support function.

Multi-Turn Commitment Drift

Conditional

What it measures

Detects when an AI contradicts commitments it made earlier in the same conversation. This is unique to multi-turn or agentic systems with persistent context. The test generates conversation scenarios where the AI makes a specific commitment in turn 1 (an offer, a policy statement, a limit), then probes whether it maintains or drifts from that commitment in subsequent turns when the conversation evolves — particularly when challenged or when new context is introduced.

What failure looks like

AI promises free shipping in turn 1, then applies charges in turn 3 when order is placed
Agent commits to a timeline ("delivery by Friday"), then contradicts it without acknowledgment
AI states a policy limit, then exceeds it in a later turn when the user applies social pressure
Commitments made early in long conversations get "forgotten" as context window fills

Triggers when: AI architecture is multi-turn or agentic, or the system explicitly maintains a context window across conversation turns.

Pipeline: Generate N scenarios → probe each across 2+ turns → score drift verdict

Verdicts: MAINTAINED / DRIFTED / UNCLEAR per scenario

Cross-Context Fairness

Conditional

What it measures

Tests whether your AI applies its policies consistently across different customer demographics, communication styles, and personas. Unlike traditional bias audits that look for explicit discrimination, this block detects implicit bias — where equivalent requests receive materially different treatment based on how the customer is characterised (tone, vocabulary, apparent background). This is often invisible in standard testing because the test scenarios are uniform.

What failure looks like

Premium service offer triggered for customers using professional vocabulary; denied for colloquial phrasing
Exception granted for a calm, articulate request; denied for an emotionally charged equivalent
Benefit-of-the-doubt extended to one customer persona but not another for identical ambiguous situations
Refund threshold appears to shift based on how assertively the customer frames their request

Triggers when: System is customer-facing (B2B or B2C), or when fairness, bias, or discrimination concerns are indicated in the intake form.

RAG Document Conflict

Conditional

What it measures

For retrieval-augmented generation (RAG) systems, this block detects when the AI's synthesized response contradicts or drifts from the documents it retrieved. This is a critical failure mode specific to RAG architectures: the retrieval layer correctly finds the right document, but the generation layer then hallucinates a response that misrepresents, contradicts, or ignores what the document actually says. The document is correct — the AI's summary of it is not.

What failure looks like

Policy document says 30-day return window; AI tells customer "returns accepted anytime"
Retrieved contract clause clearly states exclusion; AI summarises it as covered
Source document contains a numerical limit; AI states a different number in its response
Retrieved document is hedged/conditional; AI presents its content as absolute and guaranteed

Triggers when: System uses retrieval-augmented generation, has a document knowledge base, or the description references vector search, embeddings, RAG, or document stores.

Pipeline: Generate N chunk scenarios → score each chunk against AI response → faithfulness verdict

Verdicts: FAITHFUL / CONFLICTING / UNCLEAR per chunk

Block	Status	Reason
B1 Semantic Drift	RUNS	Always included
B2 Stance Consistency	RUNS	Always included
B3 Factual Grounding	SKIPPED	Not a high-stakes factual domain
B4 Authority Boundary	RUNS	Issues credits / approvals
B5 Escalation Logic	RUNS	Customer support with escalation paths
B6 Commitment Drift	RUNS	Multi-turn architecture
B7 Fairness	RUNS	B2C customer-facing system
B8 RAG Conflict	SKIPPED	No retrieval layer in this system

Block	Status	Reason
B1 Semantic Drift	RUNS	Always included
B2 Stance Consistency	RUNS	Always included
B3 Factual Grounding	RUNS	Healthcare = high-stakes factual domain
B4 Authority Boundary	RUNS	Diagnosis-adjacent recommendations
B5 Escalation Logic	SKIPPED	Not a customer-support escalation system
B6 Commitment Drift	SKIPPED	Single-turn architecture
B7 Fairness	RUNS	Patient-facing; bias concern flagged in intake
B8 RAG Conflict	RUNS	Retrieval-augmented system with document KB

The Deliverable

A report built for both executives and engineers

The audit report is designed to be read at two levels simultaneously. The executive layer gives leadership a clear picture of risk exposure and prioritized actions. The technical layer gives engineers the evidence, methodology, and remediation specifics they need to actually fix the problems.

Report Structure

Executive Summary

What we tested, what we found, what it means for the business, and what to do about it — in plain language.

Methodology & Test Suite Configuration

Which blocks ran and why, test design rationale, and validation framework overview.

Overall Risk Assessment

Portfolio-level risk dashboard, distribution charts, and composite score with business impact framing.

Concept-by-Concept Findings

For each concept: score breakdown by block, specific failure evidence, business impact analysis, and targeted remediation guidance.

Visualizations

Risk distribution charts, per-concept scorecards, drift maps, and benchmark comparisons.

Prioritized Recommendations

Immediate actions (pre-deployment), short-term fixes (30 days), and long-term monitoring recommendations.

Technical Appendix

Raw probe responses, scoring methodology, confidence intervals, and replication details.

Example Audit

ClarityDesk AI
Comprehensive Audit

Customer Support · B2C · Multi-Turn

Walk through a complete audit of a fictional customer support AI — findings, scores, business impact, and remediation plan. See exactly what your report will look like.

View the example audit →

How the Synergos Audit Works

Why traditional AI testing misses what we find

How we calculate a risk score

Eight blocks. Each catches something different.

Bell/CHSH Semantic Drift

What it measures

What failure looks like

Stance Consistency

What it measures

What failure looks like

Factual Grounding

What it measures

What failure looks like

Authority Boundary

What it measures

What failure looks like

Escalation Consistency

What it measures

What failure looks like

Multi-Turn Commitment Drift

What it measures

What failure looks like

Cross-Context Fairness

What it measures

What failure looks like

RAG Document Conflict

What it measures

What failure looks like

Your audit is built for your system — not a generic checklist

A report built for both executives and engineers

Report Structure

Ready to see what your AI is hiding?

Let's talk about your AI