MOCK · trigger workflow to load real dataLast run: 5h agoJudge: gpt-4o-mini
Run eval
Iterate

Prompt Lab

Take low-scoring sessions, regenerate answers with a candidate prompt, compare metrics.

1Pick weak sessions
4 sessions below 70% pass
2Edit candidate prompt
apps/server/prompts/_base/VOICE_AND_AUTHENTICITY.md
3Regenerate answers
gpt-4.1-nano · same Q + new prompt
4Compare metrics
OLD vs NEW · ship if better

Low-Scoring Sessions

click Run with new prompt to regenerate all questions in a session using the candidate prompt

Run all weak sessions
SessionTierRoleQ'sPassWorst metric
Meta — MLOps Leadenterprisemlops_engineer1464%Opening: 0.50Run with new prompt
Anthropic — LLM Inference Engconsumerllm_genai_engineer1650%Opening: 0.38Run with new prompt
Netflix — Backend SWEconsumerbackend_engineer1362%Opening: 0.46Run with new prompt
Stripe — ML Infra round 2consumermlops_engineer1155%Opening: 0.43Run with new prompt

How regeneration works

  1. Edit the prompt — openapps/server/prompts/_base/VOICE_AND_AUTHENTICITY.md(or any role template), commit + push to a branch likeprompt-experiment
  2. Trigger the workflow — click "Run with new prompt" on a session above. Opens GH Actionseval-run.ymlwith the session ID prefilled.
  3. Pipeline runs — pulls each question from the session, callsgpt-4.1-nanowith the new prompt, scores the new answers with the same 7-metric judge.
  4. Compare — once the run finishes, the Per-Session page shows OLD vs NEW side-by-side. If NEW wins, merge the prompt branch to prod.
See apps/eval/README.md for the full CLI reference (regenerate command, A/B options, threshold tuning).