A/B Compare

OLD no-essay-rule vs NEW voice-rule-fix-v2 · 200 questions, paired

Ship the new prompt

NEW wins on 7 of 7 metrics · 1 statistically significant (p<0.05) · Opening

Per-Metric Delta

paired t-test on 200 matched answers

Metric	OLD	NEW	Δ	p-value	Significant?
Correctness	0.82	0.84	+0.02	0.180	—
Completeness	0.76	0.78	+0.02	0.220	—
Context Recall	0.84	0.86	+0.02	0.310	—
Context Precision	0.80	0.82	+0.02	0.280	—
Opening	0.31	0.51	+0.20	<0.001	✓ significant
Voice Authenticity	0.65	0.69	+0.04	0.070	—
Faithfulness	0.91	0.92	+0.01	0.620	—

Example: Same Question, Both Prompts

click any row in the metric table to see another sample

Question

How do you detect and handle model drift in production?

OLD · no-essay-rule

The question is about detecting and handling model drift in production. The steps I'd take are: First, set up a drift detection pipeline. Second, compare distributions...

Opening: 0.20Voice: 0.55Correctness: 0.85

NEW · voice-rule-fix-v2

Yeah so for drift, I set up a pipeline that compares inference input distributions against a baseline using KS and PSI tests. When something significant pops, I trigger alerts and consider retraining...

Opening: 0.85Voice: 0.78Correctness: 0.85