MOCK · trigger workflow to load real dataLast run: 5h agoJudge: gpt-4o-mini
Run eval

A/B Compare

OLD no-essay-rule vs NEW voice-rule-fix-v2 · 200 questions, paired

Ship the new prompt
NEW wins on 7 of 7 metrics · 1 statistically significant (p<0.05) · Opening
Per-Metric Delta
paired t-test on 200 matched answers
MetricOLDNEWΔp-valueSignificant?
Correctness0.820.84+0.020.180
Completeness0.760.78+0.020.220
Context Recall0.840.86+0.020.310
Context Precision0.800.82+0.020.280
Opening0.310.51+0.20<0.001✓ significant
Voice Authenticity0.650.69+0.040.070
Faithfulness0.910.92+0.010.620
Example: Same Question, Both Prompts
click any row in the metric table to see another sample
Question
How do you detect and handle model drift in production?
OLD · no-essay-rule

The question is about detecting and handling model drift in production. The steps I'd take are: First, set up a drift detection pipeline. Second, compare distributions...

Opening: 0.20Voice: 0.55Correctness: 0.85
NEW · voice-rule-fix-v2

Yeah so for drift, I set up a pipeline that compares inference input distributions against a baseline using KS and PSI tests. When something significant pops, I trigger alerts and consider retraining...

Opening: 0.85Voice: 0.78Correctness: 0.85