When Parallel Beats Smart

How We Cut Generation Time 43% by Splitting Our Pipeline: Three architecture decisions that made our Arabic education system actually work at scale

Oct 23, 2025

CONTEXT

You’ve seen how we built a self-improving education system with DSPy. You’ve witnessed the 7B vs 34B reality check. Now let me show you the counterintuitive engineering decisions that made this system actually work in production.

We generate 20-30 validated Arabic math questions in under 60 seconds. Not because we have faster models. Not because we optimized our prompts. But because we made three architectural choices that everyone told us were “overengineering.”

Spoiler: Sometimes the dumbest-looking architecture is the smartest.

The 7B vs 34B Reality: When DSPy Can't Save You

Stanislav Huseletov

October 7, 2025

Read full story

The Dual-Pipeline Architecture

Everyone builds a single pipeline. Generate → Validate → Convert → Scaffold. Clean, simple, maintainable.

We split ours in two. Pipeline A for existing database questions. Pipeline B for new generation. They run in parallel, converge at Module 4, then split again. It looks like spaghetti on the architecture diagram.

Here’s why it’s good:

Our database has 4,000+ validated UAE curriculum questions. When a teacher asks for 20 questions, why generate all 20? We grab 15 from the database (24.5 seconds), generate 5 new ones (430 seconds total), but they run in parallel.

The result? Total execution: 430 seconds instead of 759 sequential.

Pipeline A skips Modules 1, 2, and 3 entirely. Those questions are already validated, pattern-extracted, and verified. It jumps straight to MCQ conversion and scaffolding. Meanwhile, Pipeline B does the full generation dance.

“But that’s more complex!” Sure. It’s also 43% faster and gives teachers a mix of battle-tested questions and fresh content. The database questions act as quality anchors while new questions add variety.

The real magic: we use an existing_ratio parameter. Want faster generation? Set it to 0.8 for 80% database questions. Want more novelty? Drop it to 0.2. The pipelines automatically rebalance.

This isn’t clean code. It’s production code. And in production, 43% faster matters more than architectural purity.

Questions? Ask our AI Avatar to recap!

Why We Generate Math Diagrams with Code, Not AI

Everyone’s using DALL-E. Midjourney. Stable Diffusion. We use... Matplotlib and SVG generators.

For a Venn diagram question, DALL-E gives you beautiful, artistic circles. But can it guarantee exactly 5 objects in Set A and 3 in Set B? No. It’ll give you “approximately 5-ish looking things.”

Our approach: Python code generates the SVG. Every circle, every object, every intersection is programmatically placed. It’s uglier. It’s also 100% mathematically accurate.

Here’s the spiky truth: Image models can’t count. They’re pattern matchers, not mathematicians. When your Grade 3 question asks “How many apples in the picture?”, you need exactly that many apples. Not artistic interpretation.

We tried hybrid approaches. Run DALL-E first, fall back to SVG on failure. The overhead of detecting counting errors killed performance. Now we pre-classify: questions requiring precise counts use programmatic generation from the start.

The results:

Geometric figures: SVG beats DALL-E by 70% accuracy
Charts/graphs: Matplotlib gives pixel-perfect data representation
Venn diagrams: Programmatic placement ensures set accuracy
Counting problems: Only SVG guarantees correct object counts

Yes, our images look like they’re from 2010. But they’re correct, generate in 3-8 seconds, and never hallucinate extra triangles.

The 10-Pattern Limit That Tripled Quality

Module 2 extracts question patterns from curriculum samples. Version 1 extracted everything it could find - usually 20-30 patterns per topic. More patterns = more variety, right?

Wrong. We now hard-limit to 10 patterns maximum.

Here’s what we discovered: LLMs extracting patterns follow a quality decay curve. The first 5-6 patterns are gold - unique, pedagogically sound, curriculum-aligned. Patterns 7-10 are decent variations. Everything after 10 is garbage - either duplicates with different wording or patterns that technically work but teach nothing.

But here’s the counterintuitive part: limiting to 10 patterns improved variety. Why? Because we spend those saved tokens on pattern enrichment instead. Each pattern gets:

Educational value scoring (0-10)
Template complexity analysis
Prerequisite skill mapping
Common misconception targeting
Differentiation strategies

A deeply enriched pattern generates better variety than surface-level quantity. Our 10 enriched patterns now produce more diverse questions than our old 30 shallow ones.

The enrichment process uses DSPy’s retrieval-augmented generation to pull pedagogical best practices from our curriculum database. Every pattern learns from 4,000+ production questions about what works.

Quality scoring is brutal. Patterns scoring below 7/10 get rejected. We’d rather have 5 excellent patterns than 10 mediocre ones. In one test, we extracted 3 patterns from a geometry topic. Old system would’ve padded to 20. New system generated 60% better questions from those 3 patterns alone.

This flies against every instinct. More data should be better. More patterns should mean more variety. But in production, curation beats collection every single time.

What These Decisions Really Mean

These aren’t optimizations. They’re philosophical stances about how AI systems should work.

Parallel pipelines say that perfect shouldn’t be the enemy of good. Mix validated content with fresh generation.

Programmatic images say that correctness trumps beauty in education. An ugly correct diagram teaches better than a beautiful wrong one.

Pattern limits say that depth beats breadth. Ten patterns you deeply understand generate better questions than thirty you barely grasp.

Every one of these decisions makes our codebase messier. The dual pipelines require careful orchestration. Programmatic image generation needs maintaining multiple rendering engines. Pattern enrichment adds entire processing stages.

But they also make our system actually work for teachers and students. Not in demos. Not in benchmarks. In actual classrooms where a wrong number in a diagram means teaching the wrong concept.

Check out the model

Next week, I’ll show you how we built our evaluation system that caught GPT-4 teaching students that 0.7 > 1.

Spoiler: We score ourselves harsher than any benchmark.

David Proctor

Oct 24

This is fascinating! The counterintuitive architectural choices, especially the dual-pipeline for mixing validated content with fresh generation, and the programmatic image generation for accuracy over aesthetics, really highlight the practical realities of building robust AI systems for education. The 10-pattern limit tripling quality is another brilliant insight into curation beating collection. It makes me wonder, what were some of the initial challenges or resistance you faced when proposing these 'overengineering' solutions, and how did you overcome them?

The 7B vs 34B Reality: When DSPy Can't Save You

Discussion about this post

Ready for more?