MOCK · trigger workflow to load real dataLast run: 5h agoJudge: gpt-4o-mini
Run eval

Per-Session Breakdown

10 sessions evaluated · grouped by sessions.title · 6 consumer + 4 enterprise

consumer
98 answers · 60 passed
61.2%
Mean correctness 0.83 · Mean opening 0.49
enterprise
44 answers · 31 passed
70.5%
Mean correctness 0.86 · Mean opening 0.55
All Sessions
most recent first · click a row to drill in
Session TitleTierRoleNPassCorrectnessCompletenessContextContextOpeningVoiceFaithfulness
Stripe — Senior ML Engineerconsumerml_engineer1275%0.860.790.890.830.580.730.94
Google — Staff ML Engineerenterpriseml_engineer1974%0.870.810.890.850.560.720.94
Meta — MLOps Leadenterprisemlops_engineer1464%0.830.740.860.810.500.690.92
Anthropic — LLM Inference Engconsumerllm_genai_engineer1650%0.780.620.800.740.380.600.88
Databricks — Senior Data Engenterprisedata_engineer1182%0.880.830.890.860.620.760.94
Snowflake — Data Platform Engconsumerdata_engineer978%0.860.800.870.830.580.720.93
Netflix — Backend SWEconsumerbackend_engineer1362%0.830.710.840.810.460.650.92
Airbnb — Frontend SWEconsumerfrontend_engineer875%0.820.780.860.820.540.710.91
Cloudflare — SREenterprisesre1080%0.860.790.870.840.570.740.93
Stripe — ML Infra round 2consumermlops_engineer1155%0.810.710.830.780.430.650.90