The One Rule That Made My AI Tutor 3× Cheaper (Without Losing Accuracy)

Cost‑Aware, Format‑Strict, and Surprisingly Minimal

Aug 14, 2025

Context & alignment. This post extends Stanislav Huseletov’s “Useful or Not: DSPy — Declarative Self‑improving Python” (Aug 13, 2025). I agree with Stan’s core point: DSPy is the right foundation when reliability and structure matter. Here, I zoom in on GEPA (Genetic-Pareto), a reflective, Pareto‑guided optimizer inside that worldview, and show how I apply it to AI‑tutor tasks with cost‑aware evaluation and strict format guarantees.

Code & runners: https://github.com/dp-pcs/gepa-tutor-refinery

Abstract

I evaluate Baseline, Self‑Refine (SR), and GEPA for AI‑tutor multiple‑choice questions under a cost‑aware metric (tokens per correct) with strict output rules (Answer: <LETTER>). Naïve GEPA edits often increase verbosity and reduce accuracy; however, minimal reflective edits distilled into a single‑call prompt can match SR’s accuracy at lower cost. A hybrid SR→GEPA auditor (confidence ≥ 0.85) improves robustness on LSAT‑LR while cutting tokens per correct. I conclude with practical guidance on format‑strict prompts, context‑aware auditing, and cost‑aware execution.

Metric & cost accounting (exact).

tokens_total = Σ(prompt_tokens + completion_tokens) across all calls

tokens_per_question = tokens_total / num_questions

tokens_per_correct = tokens_total / num_correct

cost_per_100_correct = 100 * (price_per_1k_tokens * tokens_total / 1000) / num_correct

I grade only the final line that matches -

Answer: <LETTER>

- any format violation scores 0.

Introduction

Large Language Models (LLMs) hold promise as AI tutors but prompt design is non‑trivial. Manual prompts saturate quickly, so I explored automated refinement. I implement a prompt‑refinery pipeline inspired by GEPA (Genetic–Pareto), reflective prompt evolution. The goal is to refine prompts automatically by iteratively reflecting on failures and proposing targeted edits. I compare Baseline, Self‑Refine (SR), and GEPA and evaluate them on diverse multiple‑choice (MCQ) datasets.

Stan’s piece makes the case for DSPy when you need reliability and structure; I agree. This article dives into GEPA specifically for tutor‑style tasks where I care about tokens‑per‑correct and format compliance, not only raw accuracy.

Burying the Lead

If you read nothing else, read this!

The super‑simple version

I optimize tokens per correct while enforcing Answer: <LETTER> format compliance.

The setup is simple: quizzes (datasets) with

a question,
a few choices (A, B, C, …), and
an answer key (the correct letter).

My program makes the model pick a letter. I then check the last line: Answer: <LETTER>. If that letter matches the key, it’s right. If not (or the format is wrong), it’s wrong.

Pass rate = how many it got right.

Three ways I ask the model to answer:

Baseline - one shot.
“Here’s the question & choices → give the letter.”

Self‑Refine (SR) - two shots.

Try an answer.
Read its own first try, fix mistakes, then give the final Answer: <LETTER>.
(Usually helps, costs two calls.)

GEPA - learn a better instruction sheet (prompt) before answering.

Look at mistakes on a small set.
Write prompt edits (Variants A/B/C).
Test those edits.
Keep the best one and use it as the new single‑call prompt.

Distill from Self‑Refine. SR is the teacher that shows how to fix answers. I distill those fixes into a one‑call prompt so I don’t pay for two calls every time.

Pick by objective (aligned with Stan). For strict schema/JSON compliance, MIPROv2 is a great default. For tutor‑style reasoning under a cost budget, GEPA’s reflective edits often win. This post shows when and how I use them.

What the prompt lines mean (plain English)

“Restate the question, quote one sentence, eliminate wrong options, then final letter” → a mini playbook so the model thinks in small steps.
“Ensure answer choices align with the question stem” → don’t miss polarity flips like NOT, EXCEPT, LEAST.
“Clarify the question stem” → rewrite the ask in short words to avoid traps.
“Provide a clear rationale” → one short reason why your letter wins.
(I still only grade the final Answer: <LETTER>.)

A line you might see in a GEPA variant, “Identify the flaw in the argument…”, is LSAT‑LR‑style. It’s great for argument‑based sets, but on science/math datasets it can confuse the model. I keep dataset‑specific lines in dataset‑specific prompts.

Why you see Variant A/B/C (and sometimes they look similar)

GEPA writes several small edits to the base instructions (A/B/C). I A/B test them on a dev split and pick the winner by Pareto: best accuracy and/or lower tokens. If I run GEPA on multiple datasets with the same base prompt, saved variants can look similar because they share the same starting point and my reflection constraints are tight.

How each run is graded (the exact mechanics)

I load one dataset.
For each question, I render the prompt (includes the allowed letters, and my strict rule: final line must be Answer: <LETTER>).
I run Baseline / Self‑Refine / GEPA‑chosen prompt.
I parse only the last line.
I compare to the answer key → right/wrong.
I aggregate accuracy and token cost.

Format rule (machine‑checkable)
EBNF:

final_line  = "Answer: ", LETTER ;  
LETTER      = "A" | "B" | "C" | "D" | "E" ;

Regex (multiline):

(?m)^Answer:\s*([A-Z])\s*$

Fail closed: if no match, grade = 0 and I record format_violation = true.

Methods

Baseline Prompt

I instruct the model to restate the question, cite one evidence sentence if provided, eliminate wrong options, and output a final line Answer: <LETTER>.

Self‑Refine (SR)

A two‑call chain: the first call proposes an answer; the second call critiques and revises. SR improves accuracy but doubles token cost and latency.

GEPA

I collect failed dev examples and use a reflection prompt to describe failure modes and propose concise edits that fix them without altering the final answer format. I append these edits to the baseline prompt to create variants A/B/C. I evaluate the variants on the dev set, apply a Pareto filter to select the highest‑accuracy, lowest‑token candidates, and then run the top variant on the test set.

Hybrid SR→GEPA

I pass SR’s output to a GEPA auditor. I implement a conservative override:

Override rule: GEPA may overrule SR only when
(a) confidence ≥ 0.85, and
(b) the auditor explicitly flags an error type ∈ {misread stem, polarity flip, eliminated correct option, format violation}.

Confidence definition. The auditor produces a rubric score

s ∈ [0,1]

indicating confidence that SR’s answer contains a specific, named error; I calibrate this score on the dev set. Overrides require

s ≥ 0.85

and a one‑sentence rationale tied to the error label.

Evaluation Metrics

I compute development and test accuracy, average tokens per question, tokens per correct (my cost metric), cost per 100 correct (USD), and format compliance (strict Answer: <LETTER>).

Cost reproduction note. I count both prompt and completion tokens for every call. For SR, the second call includes the first call’s transcript where applicable.

Datasets

I use publicly available MCQ datasets spanning logic, science, math, and truthfulness tasks. Each dataset is split into train/dev/test. Only the development set influences GEPA; the test set remains unseen until final evaluation.

Results

Across 40+ runs, I observed that naïve GEPA edits often degraded accuracy and increased tokens. For example, some variants added verbose reasoning, causing token blow‑ups and format violations. The best SR runs reached 100% on easier datasets at a high cost (~600 tokens per question), while GEPA variants occasionally matched accuracy with fewer tokens.

A key discovery is that minimal scaffolding, two short reasoning lines plus the answer, often outperformed elaborate prompts. When I enforced strict format compliance and treated GEPA as a safety auditor (overriding SR only when confidence ≥ 0.85), dev accuracy improved from 0.1 → 0.3 and test accuracy from 0.1 → 0.4 on challenging LSAT‑LR tasks, while tokens per correct dropped substantially (e.g., 857 vs. 2794 for SR). The system achieved 40% accuracy on LSAT‑LR (details in the repo). Comprehensive results for AGIEval LSAT‑AR illustrate strategy variability; tokens per correct ranged from ~11 to >5000 depending on the run mode.

How this extends prior DSPy evaluations. Stan highlights DSPy’s reliability and structure benefits. I agree, and I add that minimal scaffolding + strict format rules often beat verbose prompts on tutor MCQs when I optimize for tokens‑per‑correct.

Discussion

More rules aren’t always better. Over‑complex prompts confused the model and increased cost. SR remains a strong baseline for accuracy but is expensive; GEPA, when used as an efficiency tuner (optimizing for fewer tokens under constant accuracy), is highly useful. The hybrid SR→GEPA mode, with strict confidence thresholds, prevents GEPA from introducing wrong corrections.

Future research. I plan to explore context‑aware GEPA prompts tailored to dataset domains, dynamic token budgets to avoid unnecessary reflections, and richer auditors that cross‑check stem/choice polarity and evidence consistency.

Optimizer Decision Matrix (how I choose)

Production Notes (tiny but useful)

I log traces & evals (e.g., MLflow or equivalent) for reproducibility and drift checks.
I keep a frozen eval set + seed for regression testing.
I enforce format compliance in code (regex) and fail closed.
I track tokens_per_correct and cost_per_100_correct as first‑class KPIs alongside accuracy.
I store per‑run CSVs (accuracy, tokens, violations) in /results/ and link them from the README.

Related Work

DSPy. A framework for declarative LLM programs with pluggable optimizers (e.g., BootstrapFewShot, MIPROv2). My pipeline fits the same slot with a different objective: accuracy under cost with format guarantees.
Self‑Refine. Two‑call critique/revise; strong accuracy but higher token cost and latency.
GEPA. Reflective, Pareto‑guided prompt evolution; I adapt its reflect‑mutate‑select pattern to tutor‑style MCQs with strict output checks.

Acknowledgments

Thanks to Stanislav Huseletov for the pragmatic DSPy framing that I build on, and to coworkers who shared early feedback on rubric design, auditing rules, and the evaluation harness.

Appendix: Reproducibility

Repository: https://github.com/dp-pcs/gepa-tutor-refinery
Runs & artifacts: CSVs for Baseline/SR/GEPA and Hybrid SR→GEPA, including accuracy, tokens per question, tokens per correct, and format violations.
Environment: I record model names, temperatures, max tokens, seeds, and split definitions in the repo’s configs and logs for each run.