Iterate
Prompt Lab
Take low-scoring sessions, regenerate answers with a candidate prompt, compare metrics.
1Pick weak sessions
4 sessions below 70% pass
2Edit candidate prompt
apps/server/prompts/_base/VOICE_AND_AUTHENTICITY.md
3Regenerate answers
gpt-4.1-nano · same Q + new prompt
4Compare metrics
OLD vs NEW · ship if better
Low-Scoring Sessions
click Run with new prompt to regenerate all questions in a session using the candidate prompt
| Session | Tier | Role | Q's | Pass | Worst metric | |
|---|---|---|---|---|---|---|
| Meta — MLOps Lead | enterprise | mlops_engineer | 14 | 64% | Opening: 0.50 | Run with new prompt |
| Anthropic — LLM Inference Eng | consumer | llm_genai_engineer | 16 | 50% | Opening: 0.38 | Run with new prompt |
| Netflix — Backend SWE | consumer | backend_engineer | 13 | 62% | Opening: 0.46 | Run with new prompt |
| Stripe — ML Infra round 2 | consumer | mlops_engineer | 11 | 55% | Opening: 0.43 | Run with new prompt |
How regeneration works
- Edit the prompt — open
apps/server/prompts/_base/VOICE_AND_AUTHENTICITY.md(or any role template), commit + push to a branch likeprompt-experiment - Trigger the workflow — click "Run with new prompt" on a session above. Opens GH Actions
eval-run.ymlwith the session ID prefilled. - Pipeline runs — pulls each question from the session, calls
gpt-4.1-nanowith the new prompt, scores the new answers with the same 7-metric judge. - Compare — once the run finishes, the Per-Session page shows OLD vs NEW side-by-side. If NEW wins, merge the prompt branch to prod.
See
apps/eval/README.md for the full CLI reference (regenerate command, A/B options, threshold tuning).