MOCK · trigger workflow to load real dataLast run: 5h agoJudge: gpt-4o-mini
Run eval

Per-Role Breakdown

Pass rates and metric means split by role template (sorted weakest first)

RoleNPassCorrectnessCompletenessContextContextOpeningVoiceFaithfulness
llm_genai_engineer2255%0.790.650.810.750.400.620.89Inspect →
backend_engineer2564%0.830.720.840.810.470.650.92Inspect →
mlops_engineer1867%0.820.740.850.800.480.680.91Inspect →
ml_engineer2871%0.850.780.880.820.550.710.93Inspect →
frontend_engineer1573%0.810.760.850.810.520.690.91Inspect →
sre1479%0.850.780.860.830.550.720.92Inspect →
data_engineer2080%0.870.820.880.840.600.740.93Inspect →