A/B Compare
OLD no-essay-rule vs NEW voice-rule-fix-v2 · 200 questions, paired
Ship the new prompt
NEW wins on 7 of 7 metrics · 1 statistically significant (p<0.05) · Opening
Per-Metric Delta
paired t-test on 200 matched answers
| Metric | OLD | NEW | Δ | p-value | Significant? |
|---|---|---|---|---|---|
| Correctness | 0.82 | 0.84 | +0.02 | 0.180 | — |
| Completeness | 0.76 | 0.78 | +0.02 | 0.220 | — |
| Context Recall | 0.84 | 0.86 | +0.02 | 0.310 | — |
| Context Precision | 0.80 | 0.82 | +0.02 | 0.280 | — |
| Opening | 0.31 | 0.51 | +0.20 | <0.001 | ✓ significant |
| Voice Authenticity | 0.65 | 0.69 | +0.04 | 0.070 | — |
| Faithfulness | 0.91 | 0.92 | +0.01 | 0.620 | — |
Example: Same Question, Both Prompts
click any row in the metric table to see another sample
Question
How do you detect and handle model drift in production?
OLD · no-essay-rule
The question is about detecting and handling model drift in production. The steps I'd take are: First, set up a drift detection pipeline. Second, compare distributions...
Opening: 0.20Voice: 0.55Correctness: 0.85
NEW · voice-rule-fix-v2
Yeah so for drift, I set up a pipeline that compares inference input distributions against a baseline using KS and PSI tests. When something significant pops, I trigger alerts and consider retraining...
Opening: 0.85Voice: 0.78Correctness: 0.85