The Bug That Kept Cutting Our AI Videos Off Mid-Sentence
The part of 'great videos at the click of a button' that nobody puts in the demo, and the one-line fix that closed the gap.
Editor's note: this work was done when our internal tool was called Amplifier. We've since renamed the repo and in-code identifiers to
amplify. Names below (amplifier-video-worker,amplifier-constraints.md,data-composition-id="amplifier-explainer") match the code at the time of the fix; onmaintoday they're spelledamplify-*.
Previously: In Part 1 we moved explainer-video composition authoring out of our web app and into the worker, gave the LLM the same Hyperframes authoring skills we use locally, and got bespoke videos instead of templated ones. The pitch, point at an article, click a button, get a finished video, held up. The execution had a hole in it. We shipped a bug to our first published batch: videos that played their visuals to the end while the narration cut off mid-sentence. This is that story, and it's the part of "great videos at the click of a button" that nobody puts in the demo.
Context
The day after the worker-authored pipeline went live, two teammates pinged the channel with the same complaint: "the video just stops talking." Different articles. Different durations. Different goals. Same symptom. By the time they flagged it, a few hundred views had already landed on videos that didn't finish their last sentence.
The videos were beautiful, by the way. Custom palettes, tight motion graphics, on-brand. They just didn't finish.
Timeline: Bug reported 2026-05-20 morning. Root cause confirmed within ~30 minutes. Fix landed and tested within ~3 hours.
Severity: P1. The new feature's most visible failure mode was "the experience the LLM was supposed to elevate doesn't actually play through."
Constraint: We didn't want to roll back. The composition quality was the whole point of the pivot.
The Objective
Find out why TTS-narrated explainer videos were truncating before the narration finished, and fix it without giving up the worker-authored composition pipeline that made them look good in the first place.
The Investigation
A two-line root cause that took three files to confirm.
Step 1: Read the bug, not the code
The reporters described the symptom precisely: "the video plays through visually, then the audio cuts off mid-sentence around the end." That's a really specific clue. It told us:
The video's timeline ended on schedule.
The narration audio was longer than that timeline.
The renderer dutifully stopped at the timeline boundary.
So the question wasn't "why is the video broken." It was "what determines the timeline length, what determines the narration length, and what makes them disagree."
Step 2: Find the timeline source of truth
In our Hyperframes runtime, every composition's length is driven by the data-duration attribute on the root composition element. The runtime reads it, builds a GSAP master timeline of that length, and exports an MP4 of that length. So whoever sets data-duration sets the video's duration.
Two candidates write that attribute. One is our template fallback (buildHtmlTemplate), which computes the duration from the planned scene durations. The other is the LLM, when it authors a composition from scratch.
Grepping the worker:
// packages/amplifier-video-worker/src/worker.ts (around line 1511)
// Before the fix: authored composition written to disk exactly as the model wrote it
writeFileSync(join(projectDir, "index.html"), authoredIndexHtml, "utf-8");The LLM-authored HTML went to disk exactly as the model wrote it. Whatever value the model put in data-duration, that's what shipped.
Step 3: Find what the model thought it should write
The model's instructions live in skills/amplifier-constraints.md, the ~100-line constraint document we bake into the system prompt. Relevant excerpt:
The root composition element must have
data-composition-id="amplifier-explainer",data-width/data-heightmatching the chosen aspect ratio (16:9 → 1920×1080, 1:1 → 1080×1080), anddata-durationmatching the target duration in seconds.The audio narration track, if voice is enabled, must use
src="./narration.mp3"… The composition must declare every scene clip withdata-startanddata-durationso the Hyperframes runtime can drive visibility correctly.
The model was doing what we'd literally asked: lock data-duration on the root element, the narration track, and the last scene to the target duration the user picked in the interview (30 / 45 / 60 / 90 seconds). The model had no way to know what would happen next.
Step 4: Find what actually happened next
What happened next was ElevenLabs. After the LLM authored the composition, the worker took the model's narration array (an ordered list of { sceneId, startSeconds, narrationText } entries) and sent it to ElevenLabs to synthesize the actual MP3. ElevenLabs returned audio that, surprise, ran whatever length the words actually took. For a chatty narrative-style brief, that might be 42 seconds against a 30-second target. The voice didn't know about data-duration. Neither did the speaker's mouth.
The diagnosis, in one line:
The LLM faithfully locks
data-durationto the target the user picked, as instructed, but nothing rewrites that value to match the real narration length before the HTML is rendered.
Step 5: Confirm the sanity check wasn't saving us
We had a post-render sanity check that validated the rendered MP4 duration against the target, but with a 50% tolerance window (0.5× to 1.5×). A 30-second target with 42 seconds of actual audio is 140% of target. That falls inside the tolerance. The video passed validation while cutting off mid-sentence.
The 50% window was set for the template path, where we computed durations from planned scene lengths and the worst realistic error was ±20%. The worker-authored path could produce much wider swings, and the check we'd inherited was too generous to catch them. (More on this in "What We'd Do Differently", the window is still 50% today, and that's a debt, not a fix.)
The Fix
The right shape was already obvious from the diagnosis. The LLM should keep doing what it was doing, it's not the LLM's fault that ElevenLabs takes as long as it takes. What needed to change was: after TTS came back, before we wrote the HTML to disk, we needed to rewrite data-duration on the elements that mattered to match the actual narration length.
We wrote a small pure function, extendCompositionDuration, that parses the LLM-authored HTML and, only when the actual TTS duration overruns the value the model baked into the root element, rewrites data-duration on three elements:
The root composition container (
data-composition-id="amplifier-explainer").The narration audio track (
id="narration-track").The last scene clip (determined by highest
data-start + data-duration).
The function reads the original target straight out of the HTML's root element, so the only thing the caller has to supply is the actual narration length. It returns the rewritten HTML plus a structured log of which elements changed and from/to what. It's a no-op when the actual narration is shorter than or equal to what the model already wrote, which is most jobs.
// real signature (packages/amplifier-video-worker/src/composition.ts)
export function extendCompositionDuration(
html: string,
actualDurationSeconds: number,
): {
html: string;
extended: boolean;
originalRootDurationSeconds: number;
newRootDurationSeconds: number;
modifications: Array<{ target: string; from: number; to: number }>;
};
// integration in the worker's render pipeline (worker.ts ~line 1511)
const narrationDurationSeconds = await synthesizeNarration(/* … */);
const extended = extendCompositionDuration(authoredIndexHtml, narrationDurationSeconds);
if (extended.extended) {
log.info(
{
originalSeconds: extended.originalRootDurationSeconds,
newSeconds: extended.newRootDurationSeconds,
elementsRewritten: extended.modifications,
},
"extended composition timeline for TTS overrun",
);
}
writeFileSync(join(projectDir, "index.html"), extended.html, "utf-8");Three implementation details that mattered more than they look:
Regex on raw HTML, not DOM parsing. The worker writes the file to a temp directory and shells out to the renderer; we never have a live DOM. Operating on the raw string is fastest and avoids dragging in jsdom for one rewrite. The function collects its edits with their string offsets and applies them back-to-front so earlier replacements don't shift indices for later ones.
Tolerate quote styles and whitespace. The model writes
data-duration="30", sometimesdata-duration='30', sometimes with extra spaces. We accept both quote styles. We do not accept arbitrary attribute syntax, if the LLM strays far enough off-pattern that the regex misses, the rewriter falls back to a no-op and the existing post-render sanity check is the last line of defense.Scene order in the HTML doesn't have to match time order. "Last scene" is whichever scene has the highest
data-start + data-duration, not whichever element appears last in source order. We compute that explicitly. A surprising number of LLM-authored compositions emit scenes out of time order, likely because the model is thinking about layout grouping while writing.
Tests
Nine cases in composition-duration-extension.test.ts, covering:
Happy path: narration overruns → root, narration track, and last scene all rewritten to the actual length.
No-op at or below target: actual ≤ what the model wrote → no change, modifications list empty. (Short narration with leftover visual time is fine.)
Voice-disabled mode: no narration track present → still rewrites root and last scene if needed; just skips the missing track.
Integer formatting preserved: integer durations stay integers, no spurious
.0.No-op on unknown HTML: missing
data-composition-id→ no change.No-op on degenerate input: zero or negative actual seconds → no change.
Out-of-order scenes: "last scene" identified by
data-start + data-duration, not source order.Single-quoted attributes:
data-duration='30'handled the same as double-quoted.Last scene already covers actual: the latest-ending clip already long enough → that one modification is skipped.
The fix shipped without re-calibrating the post-render sanity check; we'll come back to why that was a mistake.
Results
What Worked
The fix landed in one function plus one integration point. ~190 lines across the function and its single call site, plus a 164-line test file, no schema migrations, no spec changes. The composition module didn't have to know about TTS; the TTS step didn't have to know about timelines. The new function sits between them and does one job.
Videos play through. The two reported issues are resolved. Spot checks on a fresh batch of jobs show timelines now match narration length within ~0.2s. (The remaining drift is GSAP's tick granularity, not authoring.)
No regressions on the template path. The fallback's videos still render with their old (correct) computed durations because the rewriter is a pure no-op when the inputs already agree.
The diagnostic logs paid off immediately. Every
extendCompositionDurationinvocation logs original duration, new duration, and which elements got rewritten. We can answer "how often is this firing in production" without instrumenting anything new.
What Didn't
We shipped a real bug to real viewers. A few hundred views landed on truncated videos before two teammates flagged it. The pivot in Part 1 was correct, but we missed an implication of it for a day. That's not catastrophic, but it's not free either. We didn't write the test "what if the LLM and ElevenLabs disagree on duration?" until viewers wrote it for us.
The post-render sanity check is still lying. A 50% tolerance window means a 140%-of-target video is reported as "passing." We caught the user-facing symptom by rewriting the duration, but we never recalibrated the check that should have caught it in the first place. It's still ±50% on the worker-authored path as of this writing.
We don't yet expose this to the user or the model. When the LLM's narration estimate is off by 40%, the next job for that user with similar settings is going to be off again, same brief, same model, same misestimate. We could feed extension-event statistics back into the model's prompt ("for this duration target, the typical TTS overrun is ~12%, plan accordingly"), but we haven't yet. The fix is reactive; the next iteration should be predictive.
By the Numbers
Lessons Learned
When you hand the LLM authority over one dimension, audit every dimension that depends on it. The model was given authority over the composition's timeline. The video's audio length depended on that timeline matching reality. We didn't audit the dependency. Generalized: if an LLM owns variable X and your system has a downstream invariant X = f(other_things), you need a deterministic step that re-asserts the invariant after the model writes.
Tolerance windows are calibrated to the system that produced them. Our ±50% sanity check was reasonable for the template path. It is wildly too generous for the worker-authored path, and we still haven't fixed it, which is its own small lesson: shipping the user-facing fix made the underlying check feel less urgent than it is. Whenever the producer of a value changes, recalibrate the check that validates it in the same change, or it quietly never happens.
Pure functions between systems beat shared mutable state.
extendCompositionDuration(html, actual) → { html, modifications, … }is a pure function. It doesn't know about TTS providers, S3, or composition authoring. It's also the single load-bearing piece of logic that resolves a contradiction between two systems that don't know about each other. The composition pipeline didn't have to learn about narration; the narration pipeline didn't have to learn about HTML. The seam is in one place, fully testable.Diagnostic logging is feature work. The extension event log (original duration, new duration, elements rewritten) means we can detect future drift, tune the model's planning prompt with real data, and answer "is this still firing" without an instrumentation project. Build observability into the fix, not as a follow-up.
Production traffic finds the bugs your tests didn't. The two teammates who reported this are, effectively, the test we hadn't written. The right response isn't to feel bad about not writing it, it's to widen the test surface around the bug class. We now have nine tests for one rewriter; the next time we add a piece of state the LLM owns, the test list is going to be longer to start with.
What We'd Do Differently
Recalibrate the post-render check the day we change what it's checking. A tighter window on the worker-authored path (the template path can keep its ±50%, since its computed durations stay reliable) should have been part of the Part-1 PR, not a follow-up, and it's still outstanding. "We'll tighten it later" almost always means "we'll tighten it after the first incident," and then after the incident's symptom is patched it slips again.
Audit "implicit invariants" when adding LLM authority. Make a list of every property the system used to derive deterministically, and ask whether the new LLM owner can violate any of them.
Build the rewriter pattern into the initial pivot. The "LLM proposes, deterministic code reconciles" shape is general. We should have introduced it in Part 1, even if its first instance was a no-op rewriter that just validated the LLM hadn't messed anything up. Adding it after a real incident is the wrong order of operations.
The Meta Test
Here's the part I find genuinely fun. The whole point of this system is that you can point it at one of our articles and get a video that's worth watching, not AI filler, but something that earns its place on the feed. So the real acceptance test for this two-part series isn't whether you think it reads well.
It's this: once these posts are polished, I'm going to point the builder at this case study and let the worker author its companion explainer video. Same interview, same skills, same pipeline I just described, including extendCompositionDuration making sure the narration actually finishes.
If the thesis in Part 1 is right, the video about how we stopped shipping slop won't look like slop. If it's wrong, you'll be able to tell. That's about as honest an acceptance test as I can offer, and it's the whole argument in one move: the article describing the system becomes the input to the system.
Takeaways
When an LLM owns part of an output, ask which downstream invariants depend on the part it owns. Reconcile them in code, not in a prompt.
A deterministic seam between an LLM and a downstream system is your unit-testable, debuggable, observable surface. Use it.
Tolerance windows belong to producers, not to consumers. Recalibrate the check with the change that motivated it, before the symptom patch makes it feel optional.
Bug reports are tests you didn't write. Convert them, then widen the class.
Observability isn't follow-up. It's part of the fix.
David Proctor is VP of AI at Trilogy. He writes about AI infrastructure, agent protocols, and what actually works in production.




Any video examples to share? There are few scenarios here well explained what results were, but seeing before, after, failed examples, succesful ones etc. would be superb