Design a Multi-Agent Orchestration System
The real test: can you decide WHEN multi-agent is even worth it, then coordinate without deadlock, context fragmentation, or cost explosion?
1. TL;DR
Decompose complex tasks across specialized LLM agents, orchestrate execution, and reconcile outputs. The core challenge is knowing when multi-agent beats a single agent, then coordinating without deadlock, context fragmentation, or cost explosion.
2. Clarifying Questions
| Question | Why | Design Fork |
|---|---|---|
| Which tasks need multi-agent vs. single? | Not all tasks benefit from decomposition | Simple Q&A → single; research+code+review → multi |
| Homogeneous or specialized agents? | Determines routing & prompt design | Homogeneous → fan-out; specialized → skill routing |
| Latency budget? | Sequential pipelines multiply latency | ≤5s → parallel-only; ≤60s → pipelines OK |
| User-facing or batch? | Concurrency & cost tolerance differ | Interactive → fewer agents, caching; batch → throughput |
| Which LLM providers? | Heterogeneous models change cost mix | Single → uniform quotas; multi → routing layer |
| Cost ceiling per task? | Multi-agent multiplies spend | Tight → fewer agents, smaller models; generous → full orchestration |
3. Requirements
Functional
- Auto-decompose complex tasks into sub-tasks with dependency DAG
- Route sub-tasks to specialist agents via skill matching
- Support sequential, parallel, hierarchical, and debate orchestration patterns
- Aggregate and reconcile results — including handling worker disagreement
- Shared blackboard state with ownership semantics
- Enforce termination criteria and budget caps per task
Non-Functional
- Latency: ≤30s interactive, ≤5min deep research
- Cost: ≤3× single-agent (target 1.5–2×)
- Reliability: 99.5% completion; no infinite loops or deadlocks
- Concurrency: 500+ simultaneous orchestrations
- Isolation: Zero cross-task information leakage
- Observability: Distributed trace for every agent interaction
4. Back-of-Envelope Estimation
| Metric | Single Agent | Multi-Agent (3–5) | Notes |
|---|---|---|---|
| Tokens/task | ~4K | ~20–60K | Multi-agent research ≈ 15× tokens of one chat turn |
| LLM calls/task | 1–2 | 5–15 | Orchestrator + workers + aggregation |
| Coordination overhead | 0% | 15–30% | Routing, message passing, reconciliation |
| Latency (sequential) | 2–5s | 10–45s | Each hop adds model latency |
| Latency (parallel) | 2–5s | 5–10s | Bounded by slowest worker |
| Cost/task (frontier model) | $0.05–0.15 | $0.30–1.50 | N agents × per-agent cost |
| Monthly at 100K tasks/day | ~$300K | ~$1.5–4.5M | Cost explosion = #1 ops risk |
5. Architecture
Hierarchical orchestrator → worker pattern with shared blackboard for state.
Request Lifecycle
- Intake: Task submitted → API gateway validates and enqueues
- Decision gate: Complexity classifier routes single vs. multi-agent
- Decomposition: Orchestrator breaks task into sub-task dependency DAG
- Assignment: Skill router matches sub-tasks to specialist agents
- Execution: Workers run in parallel/sequential per dependency graph
- Aggregation: Orchestrator reconciles and synthesizes results
- Termination: Validates done-criteria; loops or finalizes
- Response: Synthesized result returned to user
6. Core Components Deep Dives
A. The Decision: When Multi-Agent Pays Off
Multi-agent earns its complexity only when three conditions hold:
- Parallelizable: Sub-tasks run concurrently with minimal dependencies
- Separable contexts: Sub-tasks need different context windows (code vs. docs vs. data)
- Genuine specialization: Sub-tasks benefit from different prompts, tools, or models
| Criterion | Single Agent Wins | Multi-Agent Wins |
|---|---|---|
| Task complexity | Linear, single-domain | Cross-domain, different expertise |
| Context window | Fits in one window | Sub-tasks need separate large contexts |
| Latency tolerance | Low — fast response | Higher — parallel overhead OK |
| Quality bar | Good-enough, one pass | Needs verification or debate |
| Tool access | All tools to one agent | Sandboxed tools per agent |
| Cost sensitivity | Tight budget | Quality justifies 2–5× cost |
B. Orchestration Patterns
| Pattern | Mechanism | Best For | Risk |
|---|---|---|---|
| Hierarchical | Orchestrator delegates; workers may sub-delegate | Complex multi-step task trees | Deep nesting → latency |
| Sequential Pipeline | A → B → C, each refining prior output | Draft → review → polish | Error propagation |
| Parallel Fan-Out/Gather | Fan out N tasks, merge results | Independent research, map-reduce | Reconciliation complexity |
| Debate / Critic | Two agents argue; judge synthesizes | High-stakes, adversarial verification | 3× cost, gridlock |
| Blackboard | Agents read/write shared state reactively | Iterative refinement | Race conditions |
C. Communication Protocol
Agents communicate via structured messages, not raw transcripts.
- Envelope:
{ from, to, type, task_id, payload, token_budget } - Payload types: task assignment, result, clarification, status, escalation
- Context summaries: Compressed (≤500 tokens) — never full conversation history
D. Task Decomposition & Assignment
- DAG output: Orchestrator decomposes task into sub-tasks with dependency edges and complexity estimates
- Skill tags: Sub-tasks tagged with capabilities (
code_gen,web_search,math) and matched to agent profiles - Load balancing: Among agents sharing a skill, route to lowest queue depth
- Budget split: Total token budget allocated proportional to estimated sub-task complexity
E. Context Isolation
Each agent gets its own context window — a feature, not a limitation.
- Agents receive only sub-task description + relevant context, not global state
- Cross-agent messages use explicit schemas — prevents context fragmentation (pollution from irrelevant details) and context bloat (accumulated chatter exhausting the window)
F. Result Aggregation & Reconciliation
| Strategy | Mechanism | When to Use |
|---|---|---|
| Merge | Deduplicate and synthesize agreeing results | Workers agree on outputs |
| Majority vote | N agents vote; majority wins | Factual questions with clear answers |
| Synthesis | Dedicated synthesizer reconciles conflicts | Creative or nuanced tasks |
| Escalation | Flag for human review | Low confidence, safety-critical |
G. Shared State & Ownership
- Blackboard: Partitioned into named sections (
research_findings,code_artifacts,decisions) - Write ownership: Only the assigned agent writes its section; others read-only
- Optimistic locking: Version numbers on writes; orchestrator resolves conflicts
- TTL: Entries expire after task completion — no stale state leakage
H. Termination
Never trust agents to self-stop. Enforce externally:
- Done-criteria: Measurable conditions (“all sub-tasks returned”, “tests pass”)
- Token budget cap: Hard ceiling; agent killed on exceed
- Wall-clock timeout: Per sub-task and per orchestration
- Loop detection: Same message ≥3 times → force-terminate
- Max iterations: Cap orchestrator↔worker round-trips (≤5)
7. Data & State
- Task store (PostgreSQL): Definitions, status, dependency graphs, results
- Blackboard (Redis): Ephemeral shared state with TTL, partitioned by task ID
- Message bus (Kafka/NATS): Inter-agent messages with guaranteed delivery
- Trace store (Jaeger/Tempo): OpenTelemetry distributed traces per orchestration
- Artifact store (S3): Large outputs — code, documents, images
8. Scaling
- Horizontal pools: Stateless agent containers; scale on queue depth
- Orchestrator sharding: Partition by task ID across instances
- Provider LB: Round-robin API keys and providers to avoid rate limits
- Priority queues: Interactive over batch; prevents head-of-line blocking
- Backpressure: Reject with retry-after when pools saturate
9. Failure Modes
| Failure | Cause | Mitigation |
|---|---|---|
| Cascading failures | One agent fails; dependents cascade | Circuit breakers; partial-result fallback |
| Infinite handoff loops | A routes to B, B routes back to A | Max-hop counter; orchestrator loop detection |
| Context fragmentation | Irrelevant cross-agent context pollutes reasoning | Structured summaries only; no raw transcripts |
| Duplicated work | Overlapping sub-task assignments | Dedup at decomposition; idempotency keys |
| Coordination deadlock | Circular wait: A on B, B on A | DAG validation (no cycles); timeout breakers |
| Cost explosion | Runaway agents burning tokens in loops | Per-task/per-agent budgets; hard kill |
| Hallucination poisoning | One agent hallucinates; others consume as fact | Cross-validation; source citation required |
| Stale blackboard | Agent reads outdated shared state | Version vectors; read-after-write consistency |
| Orchestrator SPOF | Orchestrator crashes mid-task | Checkpoint to durable store; restart resumes |
| Provider outage | LLM API 5xx or rate-limit | Multi-provider fallback; exponential backoff |
10. Evaluation Strategy
- Task success rate: Target ≥95% completion with acceptable quality
- Quality delta: Multi-agent vs. single-agent on identical tasks
- Cost efficiency: Quality gain per dollar — must justify the multiplier
- Coordination tax: % tokens spent on orchestration vs. actual work
- Latency percentiles: p50/p95/p99 by orchestration pattern
- A/B testing: Identical tasks routed to single vs. multi-agent; compare quality and cost
11. Safety & Security
- Agent sandboxing: Minimal tool permissions; code execution in isolated containers
- Injection isolation: User input sanitized before entering inter-agent messages
- Output filtering: Content safety at aggregation step before user-facing response
- Audit trail: Every action, tool call, and message logged immutably
- Secret scoping: Credentials scoped per agent type (research agent cannot access code-exec creds)
- HITL gates: High-impact actions require human approval
12. Cost & Latency Optimization
| Technique | Cost | Latency | Tradeoff |
|---|---|---|---|
| Model tiering (small for routing, large for reasoning) | −40–60% | −20% | Routing accuracy may decrease |
| Prompt caching for system prompts | −20–50% | −10% | Cache invalidation |
| Parallel fan-out vs. sequential | Neutral | −50–70% | Harder reconciliation |
| Early termination on high confidence | −15–30% | −20% | May miss edge cases |
| Context compression between agents | −30–40% | −5% | Information loss risk |
| Single-agent fast-path for simple tasks | −60–80% | −70% | Needs accurate complexity classifier |
13. Observability
Multi-agent requires distributed tracing — microservices discipline applied to LLM calls.
- Trace ID: Unique per orchestration, propagated to all agent calls
- Span per agent: Model, token count, latency, tool calls, result summary
- Parent-child spans: Orchestrator → worker spans mirror the dependency DAG
- Dashboards: Token spend by agent type, latency by pattern, failure rate by mode
- Alerts: Budget ≥80%, latency ≥2× p95, loop detection triggered
- Replay: Full per-agent conversation replay for debugging
14. Tradeoffs
| Decision | Chose | Rejected | Why |
|---|---|---|---|
| Communication | Structured messages | Full transcripts | Transcripts cause bloat and cross-pollution |
| Topology | Hierarchical orchestrator | Peer-to-peer mesh | Mesh lacks authority for budgets, deadlock detection, reconciliation |
| State | Blackboard + ownership | Fully shared mutable | Ownership prevents write conflicts at scale |
| Termination | External caps + timeouts | Agent self-completion | Agents can’t reliably judge own completion |
| Models | Heterogeneous tiering | Uniform frontier model | Routing doesn’t need frontier; saves 40–60% |
| Disagreement | Synthesizer + escalation | Always majority vote | Voting fails on nuanced tasks |
15. Rollout & Ops
- Shadow mode: Multi-agent runs parallel with single-agent; compare without serving
- Canary: 5% of complex tasks → multi-agent; monitor cost/quality delta
- Feature flags: Per-pattern toggles (enable fan-out, disable debate)
- Cost guardrails: Per-user/org daily limits; auto-downgrade to single-agent
- Runbooks: Procedures for cost spike, stuck orchestration, cascading failure
16. Phasing v0 → vN
| Phase | Scope | Milestone |
|---|---|---|
| v0 · Baseline | Single agent with tools; establish quality/cost baselines | Metrics for comparison |
| v1 · Pipeline | Two-agent draft → review; structured messages | Measurable quality lift over v0 |
| v2 · Fan-out | Orchestrator + 3–5 parallel workers; aggregation | Latency reduction for parallel tasks |
| v3 · Full orchestration | Dynamic decomposition, skill routing, blackboard, debate | Cost-per-quality <2× single-agent |
| v4 · Self-optimizing | Orchestrator learns best patterns per task type | Adaptive routing outperforms static |
17. Honest Uncertainty
- Optimal agent count: No theory gives the right N — discovered empirically via A/B testing
- Emergent behavior: Multi-agent can exhibit groupthink and sycophancy loops that are hard to anticipate
- Quality ceiling: Whether multi-agent consistently beats a single agent with better prompting remains task-dependent and unresolved
- Coordination scaling: As agent count grows, coordination cost may grow super-linearly — exact curve is pattern-dependent
18. Curveball Follow-Ups
“When would you NOT use multi-agent?”
When the task fits one context window, needs no specialized tools, and a single agent hits acceptable quality. Also when latency ≤5s, budget is tight, or decomposition overhead exceeds parallelism benefit. Most production LLM tasks today are better served by one agent with good tools.
“Two agents disagree — how resolve?”
Three tiers:
- (1)Factual → ground-truth via tool call (search, calculator).
- (2)Judgment call → synthesizer agent reconciles with explicit reasoning.
- (3)Low confidence → escalate to human with both positions. Never silently pick one — log every disagreement
“One worker hangs — what happens?”
Per-agent wall-clock timeout fires:
- (1)Kill agent, mark sub-task failed.
- (2)If critical-path, retry with fresh agent (max 2).
- (3)If retries exhaust, return partial results from completed workers with quality disclaimer.
- (4)Orchestration-level timeout is the ultimate backstop — task never blocks indefinitely
“How do agents share state without stepping on each other?”
Blackboard with ownership: each section has one designated writer, many readers. Optimistic locking via version numbers detects conflicts. For truly shared sections (running summary), the orchestrator is sole writer, aggregating worker inputs.
19. Level-Up Ladder
| Level | Demonstrates |
|---|---|
| Junior | Describes agents calling LLMs; ignores single-vs-multi decision; no cost awareness |
| Mid | Identifies patterns (sequential, parallel); addresses timeouts and retries; mentions cost |
| Senior | Explicit single-vs-multi criteria; context isolation; structured comms; termination guarantees; disagreement reconciliation |
| Staff | Designs complexity classifier for routing; models cost-quality quantitatively; distributed tracing; self-optimizing orchestration; addresses hallucination poisoning |