Per-Session Breakdown
10 sessions evaluated · grouped by sessions.title · 6 consumer + 4 enterprise
consumer
98 answers · 60 passed
61.2%
Mean correctness 0.83 · Mean opening 0.49
enterprise
44 answers · 31 passed
70.5%
Mean correctness 0.86 · Mean opening 0.55
All Sessions
most recent first · click a row to drill in
| Session Title | Tier | Role | N | Pass | Correctness | Completeness | Context | Context | Opening | Voice | Faithfulness |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Stripe — Senior ML Engineer | consumer | ml_engineer | 12 | 75% | 0.86 | 0.79 | 0.89 | 0.83 | 0.58 | 0.73 | 0.94 |
| Google — Staff ML Engineer | enterprise | ml_engineer | 19 | 74% | 0.87 | 0.81 | 0.89 | 0.85 | 0.56 | 0.72 | 0.94 |
| Meta — MLOps Lead | enterprise | mlops_engineer | 14 | 64% | 0.83 | 0.74 | 0.86 | 0.81 | 0.50 | 0.69 | 0.92 |
| Anthropic — LLM Inference Eng | consumer | llm_genai_engineer | 16 | 50% | 0.78 | 0.62 | 0.80 | 0.74 | 0.38 | 0.60 | 0.88 |
| Databricks — Senior Data Eng | enterprise | data_engineer | 11 | 82% | 0.88 | 0.83 | 0.89 | 0.86 | 0.62 | 0.76 | 0.94 |
| Snowflake — Data Platform Eng | consumer | data_engineer | 9 | 78% | 0.86 | 0.80 | 0.87 | 0.83 | 0.58 | 0.72 | 0.93 |
| Netflix — Backend SWE | consumer | backend_engineer | 13 | 62% | 0.83 | 0.71 | 0.84 | 0.81 | 0.46 | 0.65 | 0.92 |
| Airbnb — Frontend SWE | consumer | frontend_engineer | 8 | 75% | 0.82 | 0.78 | 0.86 | 0.82 | 0.54 | 0.71 | 0.91 |
| Cloudflare — SRE | enterprise | sre | 10 | 80% | 0.86 | 0.79 | 0.87 | 0.84 | 0.57 | 0.74 | 0.93 |
| Stripe — ML Infra round 2 | consumer | mlops_engineer | 11 | 55% | 0.81 | 0.71 | 0.83 | 0.78 | 0.43 | 0.65 | 0.90 |