[Deep Dive] Building a Meeting Copilot: The Vision
One avatar seat, many specialist brains, and the summoning pattern that makes it work
Building a Meeting Copilot: The Vision
What if you could clone yourself for meetings? Not a creepy deepfake, but a transparent AI copilot that represents you when you can't be there, and seamlessly hands off to the real you when you can.
The Problem We're Solving
Here's a scenario that's probably familiar: You have three meetings scheduled at the same time. One is a status update you don't really need to attend but should probably monitor. Another is a sales call where your presence matters but most of the talking is done by your team. The third actually requires your brain.
The common solutions are bad:
Decline two meetings: Miss information, look disengaged
Ask for recordings: Never actually watch them
Double-book and hop between: Exhausting and ineffective
Hire an assistant: Expensive and can't make technical decisions
What if instead you had an AI avatar that could attend meetings on your behalf, with actual video presence, and knew when to summon specialist capabilities for specific tasks?
The Two-Fold Vision
Vision 1: One Avatar, Many Capabilities
The core model isn't "a team of bots in the meeting." It's one intelligent avatar that knows who to call.
You invite a single bot to the meeting. It joins with video presence, an actual avatar in the meeting grid, not a chat sidebar. Participants see a face, hear a voice, and interact naturally. Under the hood, this avatar is backed by a coordinator called Maestro that understands context and can summon specialist capabilities on demand.
Here's what that looks like in practice:
Someone asks about scheduling. Maestro summons Tempo, who has access to everyone's calendars and can propose times in real-time. Once the scheduling question is resolved, control returns to Maestro.
The conversation shifts to social media strategy. Maestro summons The Algorithm, who can pull live engagement data and draft tweets on the fly.
A question comes up about recent emails. Gatekeeper gets summoned, surfaces relevant threads, and hands back.
Someone asks "what's the market rate for X?" Radar does real-time web research and reports back.
The key: one seat in the meeting, one avatar, one conversation flow. The specialists swap in and out behind the scenes using LiveKit's agent dispatch. Each specialist has its own voice, personality, and toolset, but the meeting participants experience a seamless handoff, not a crowd of bots fighting for airtime.
This is already working. The dispatch-and-handoff architecture exists. But it's not seamless yet.
Vision 2: The Avatar Copilot
The more ambitious vision: an avatar that represents you specifically.
Picture this:
The avatar has your face (or a stylized representation)
It's loaded with your knowledge base, your articles, notes, meeting history, writing style, domain expertise
For many routine meetings, it can handle things autonomously. It knows your positions on common questions. It can give status updates based on your actual work.
But here's the key differentiator from a generic bot: you can take over at any moment.
Two takeover modes:
1. Voice passthrough: You speak, the avatar speaks your words in real-time (maybe creepy, maybe not, we’ll cross that bridge when we get there)
2. Text-to-speech: You type, the avatar vocalizes what you typed
And crucially: transparency built in. The system makes clear when it's pulling from your knowledge base versus when it's actually you. Maybe a subtle visual indicator. Maybe the avatar explicitly says "Based on David's notes..." vs. "David says..."
This isn't about tricking people. It's about extending your presence honestly.
Why This Matters Now
Three things have converged to make this feasible:
1. Real-time voice AI finally works
OpenAI's Realtime API delivers speech-to-text, LLM reasoning, and text-to-speech in a single streaming pipeline. Response times are now 300-600ms - conversational speed. A year ago this required stitching together three separate systems with compounding latency.
2. Meeting integration infrastructure exists
Services like Recall.ai and LiveKit have solved the hard problems of getting AI into video calls. You can join Google Meet, Zoom, or Teams programmatically, capture audio, display video, and stream responses back. The plumbing works.
3. Agent dispatch patterns are production-ready
LiveKit's agent dispatch lets you spin up a fresh agent instance by name into an existing room. Combined with Redis for coordination state, you get clean handoffs: the current agent finishes speaking, signals exit, and the new specialist takes over within seconds. No multi-agent chaos, just sequential, controlled transitions.
The Architecture at 30,000 Feet
Here's how the single-avatar, multi-specialist model works:
The key insight: there's one AI seat in the meeting, but its brain can change. Maestro starts as the default, handling general conversation. When a specialist capability is needed, Maestro dispatches the right agent via LiveKit, passes context through Redis, and steps aside. The specialist handles its domain, then hands control back.
This is the "summoning" pattern, Maestro summons Tempo for a calendar question the same way you'd turn to a colleague and say "hey, you handle this one." The meeting participants see one continuous avatar presence. The architecture underneath is swapping agents.
What's Working Today
Let's be honest about current state versus aspirations.
Working:
Single agent joins Google Meet via Recall.ai with avatar video
OpenAI Realtime API handles voice conversation at ~500ms latency
Avatar video renders via HeyGen LiveAvatar
MCP tools connect to 5 external services (X/Twitter, Google Calendar, Gmail, web search, Substack)
Agent dispatch: Maestro summons specialists, specialists hand back
Redis coordination: active agent tracking, exit signaling, handoff context
Partially working:
Avatar rendering adds significant latency (500-2000ms)
Specialist-to-specialist handoffs (bypassing Maestro) need more testing
Handoff transitions have a brief gap while agents swap
Not yet built:
Personal knowledge base integration (the "your notes, your style" piece)
Human takeover UX (voice passthrough, text input)
Transparency signaling (AI vs. human indicator)
Client-side avatar rendering for lower latency
The Latency Reality Check
Here's the thing architects need to understand: cloud avatar rendering is slow.
Current end-to-end latency breakdown:
Total: 1-3+ seconds
That's not conversational. It's awkward.
The avatar rendering is the bottleneck. Every cloud avatar service we tested (HeyGen, Hedra) has this problem. The video frames have to be generated on their servers and streamed back. Physics and network latency impose a floor.
The path forward is probably client-side avatar rendering using WebGL and something like Ready Player Me. That could get the total latency under 500ms. But it's a significant engineering lift.
For now, there's also an audio-only mode that skips avatar rendering entirely. Total latency drops to 300-800ms - genuinely conversational. The trade-off is no video presence, just a static image with an audio visualizer.
Where We're Going
This series will walk through the architecture in detail:
Part 2: The Audio-Video Pipeline
Deep dive into getting AI into video calls. Recall.ai integration, LiveKit room management, OpenAI Realtime API, and avatar rendering. We'll trace a single utterance from human speech to AI response and back.
Part 3: Multi-Agent Orchestration
How the summoning pattern works in code. LiveKit agent dispatch, Redis coordination, the Maestro-specialist handoff lifecycle, and why one-at-a-time beats a crowd.
Part 4: The Knowledge Layer & Human-in-the-Loop
Building the personal knowledge base, the takeover UX, and how to signal transparency between AI and human speaking.
Key Takeaways
The vision is a meeting copilot: one avatar seat with specialist capabilities behind it, plus a personal avatar that represents you
One bot, many brains: agents swap in and out via dispatch, not multiple bots fighting for airtime
Transparency is non-negotiable: clear signaling when AI is speaking versus when you are
Infrastructure exists: Recall.ai, LiveKit, OpenAI Realtime make the core pipeline possible today
Latency is the enemy: cloud avatar rendering adds 0.5-2 seconds, pushing total latency to 1-3+ seconds
MCP tools are reused: the same tool infrastructure that powers chat agents powers the meeting avatar
Next up: the technical deep dive into getting an AI's voice and face into a Google Meet call.
---
This is Part 1 of a 4-part series on building a meeting copilot.





This is super interesting.
Where can I find other three parts of this series ?