Iterate

Prompt Lab

Take low-scoring sessions, regenerate answers with a candidate prompt, compare metrics.

1Pick weak sessions

4 sessions below 70% pass

2Edit candidate prompt

apps/server/prompts/_base/VOICE_AND_AUTHENTICITY.md

3Regenerate answers

gpt-4.1-nano · same Q + new prompt

4Compare metrics

OLD vs NEW · ship if better

Low-Scoring Sessions

click Run with new prompt to regenerate all questions in a session using the candidate prompt

Session	Tier	Role	Q's	Pass	Worst metric
Meta — MLOps Lead	enterprise	mlops_engineer	14	64%	Opening: 0.50	Run with new prompt
Anthropic — LLM Inference Eng	consumer	llm_genai_engineer	16	50%	Opening: 0.38	Run with new prompt
Netflix — Backend SWE	consumer	backend_engineer	13	62%	Opening: 0.46	Run with new prompt
Stripe — ML Infra round 2	consumer	mlops_engineer	11	55%	Opening: 0.43	Run with new prompt

Edit the prompt — openapps/server/prompts/_base/VOICE_AND_AUTHENTICITY.md(or any role template), commit + push to a branch likeprompt-experiment
Trigger the workflow — click "Run with new prompt" on a session above. Opens GH Actionseval-run.ymlwith the session ID prefilled.
Pipeline runs — pulls each question from the session, callsgpt-4.1-nanowith the new prompt, scores the new answers with the same 7-metric judge.
Compare — once the run finishes, the Per-Session page shows OLD vs NEW side-by-side. If NEW wins, merge the prompt branch to prod.

See apps/eval/README.md for the full CLI reference (regenerate command, A/B options, threshold tuning).