Eval Overview

100 answers across 16 sessions · judge: gpt-4o-mini · run: nightly-68

Headline metrics

Quality

Pass rate

0.0%

below 70% threshold

Faithfulness

45.0%

hallucinations detected

Voice

Opening

40.9%

meta-openings detected

Authenticity

40.1%

too formal / AI-sounding

Coverage

Correctness

50.4%

technical errors found

Completeness

41.4%

missing sub-questions

Sessions

weakest scoring sessions · click to drill in

best buy — Quick Practiceconsumer0%Quick Practiceconsumer0%Stripe — AI enginner (Copy 104)enterprise0%Stripe — AI enginner (Copy 96)enterprise0%Stripe — AI enginner (Copy 88)enterprise0%Stripe — AI enginner (Copy 87)enterprise0%

Failure modes

why answers failed — most common first

Meta-opening ("The question is about...")85 · 85.0%

Voice: too formal, no contractions80 · 80.0%

Incomplete answer (missed sub-question)79 · 79.0%

Hallucinated company / project not in resume75 · 75.0%

Off-topic tangent (low context precision)67 · 67.0%

Per-Metric Detail

mean score across all 100 answers · worst: Voice Authenticity (0.40)

threshold 70%

Correctness

0.50

FAIL

Completeness

0.41

FAIL

Context Recall

0.48

FAIL

Context Precision

0.44

FAIL

Opening

0.41

FAIL

Voice Authenticity

0.40

FAIL

Faithfulness

0.45

FAIL