MOCK · trigger workflow to load real dataLast run: 5h agoJudge: gpt-4o-mini
Run eval
Account home

Eval Overview

142 answers across 10 sessions · judge: gpt-4o-mini

Analytics

Quality
Pass rate
64.1%
8.0pp
Faithfulness
92.0%
1.1pp
Voice
Opening
51.0%
below threshold
Authenticity
69.0%
below threshold
Coverage
Correctness
84.0%
2.0pp
Completeness
78.0%
2.0pp

Sessions

10
Stripe — Senior ML Engineerconsumer75%
Google — Staff ML Engineerenterprise74%
Meta — MLOps Leadenterprise64%
Anthropic — LLM Inference Engconsumer50%
Databricks — Senior Data Engenterprise82%

Failure modes

most common first
Meta-opening ("The question is about...")26 · 51.0%
Incomplete answer (missed sub-question)12 · 23.5%
Voice: too formal, no contractions8 · 15.7%
Off-topic tangent (low context precision)4 · 7.8%
Hallucinated company / project not in resume1 · 2.0%

Per-Metric Detail

threshold 70%
Correctness
0.84
Completeness
0.78
Context Recall
0.86
Context Precision
0.82
Opening
0.51
Voice Authenticity
0.69
Faithfulness
0.92