Design a Multi-Agent Orchestration System

S04

MicrosoftFDE

The real test: can you decide WHEN multi-agent is even worth it, then coordinate without deadlock, context fragmentation, or cost explosion?

1. TL;DR

Decompose complex tasks across specialized LLM agents, orchestrate execution, and reconcile outputs. The core challenge is knowing when multi-agent beats a single agent, then coordinating without deadlock, context fragmentation, or cost explosion.

The Trap: Juniors say “spin up agents for everything” without justifying when a single well-prompted agent is better. Multi-agent adds coordination overhead, N× token cost, and new failure modes. You must earn that complexity.

Design a Multi-Agent Orchestration System Architecture

2. Clarifying Questions

Question	Why	Design Fork
Which tasks need multi-agent vs. single?	Not all tasks benefit from decomposition	Simple Q&A → single; research+code+review → multi
Homogeneous or specialized agents?	Determines routing & prompt design	Homogeneous → fan-out; specialized → skill routing
Latency budget?	Sequential pipelines multiply latency	≤5s → parallel-only; ≤60s → pipelines OK
User-facing or batch?	Concurrency & cost tolerance differ	Interactive → fewer agents, caching; batch → throughput
Which LLM providers?	Heterogeneous models change cost mix	Single → uniform quotas; multi → routing layer
Cost ceiling per task?	Multi-agent multiplies spend	Tight → fewer agents, smaller models; generous → full orchestration

3. Requirements

Functional

Auto-decompose complex tasks into sub-tasks with dependency DAG
Route sub-tasks to specialist agents via skill matching
Support sequential, parallel, hierarchical, and debate orchestration patterns
Aggregate and reconcile results — including handling worker disagreement
Shared blackboard state with ownership semantics
Enforce termination criteria and budget caps per task

Non-Functional

Latency: ≤30s interactive, ≤5min deep research
Cost: ≤3× single-agent (target 1.5–2×)
Reliability: 99.5% completion; no infinite loops or deadlocks
Concurrency: 500+ simultaneous orchestrations
Isolation: Zero cross-task information leakage
Observability: Distributed trace for every agent interaction

4. Back-of-Envelope Estimation

Metric	Single Agent	Multi-Agent (3–5)	Notes
Tokens/task	~4K	~20–60K	Multi-agent research ≈ 15× tokens of one chat turn
LLM calls/task	1–2	5–15	Orchestrator + workers + aggregation
Coordination overhead	0%	15–30%	Routing, message passing, reconciliation
Latency (sequential)	2–5s	10–45s	Each hop adds model latency
Latency (parallel)	2–5s	5–10s	Bounded by slowest worker
Cost/task (frontier model)	$0.05–0.15	$0.30–1.50	N agents × per-agent cost
Monthly at 100K tasks/day	~$300K	~$1.5–4.5M	Cost explosion = #1 ops risk

Key insight: Multi-agent multiplies token cost by ~N. Budget 15–30% extra tokens just for routing, context summaries, and reconciliation messages.

5. Architecture

Hierarchical orchestrator → worker pattern with shared blackboard for state.

Request Lifecycle

Intake: Task submitted → API gateway validates and enqueues
Decision gate: Complexity classifier routes single vs. multi-agent
Decomposition: Orchestrator breaks task into sub-task dependency DAG
Assignment: Skill router matches sub-tasks to specialist agents
Execution: Workers run in parallel/sequential per dependency graph
Aggregation: Orchestrator reconciles and synthesizes results
Termination: Validates done-criteria; loops or finalizes
Response: Synthesized result returned to user

6. Core Components Deep Dives

A. The Decision: When Multi-Agent Pays Off

Multi-agent earns its complexity only when three conditions hold:

Parallelizable: Sub-tasks run concurrently with minimal dependencies
Separable contexts: Sub-tasks need different context windows (code vs. docs vs. data)
Genuine specialization: Sub-tasks benefit from different prompts, tools, or models

Criterion	Single Agent Wins	Multi-Agent Wins
Task complexity	Linear, single-domain	Cross-domain, different expertise
Context window	Fits in one window	Sub-tasks need separate large contexts
Latency tolerance	Low — fast response	Higher — parallel overhead OK
Quality bar	Good-enough, one pass	Needs verification or debate
Tool access	All tools to one agent	Sandboxed tools per agent
Cost sensitivity	Tight budget	Quality justifies 2–5× cost

Decision rule: If a single well-prompted agent with tools solves it in one pass with acceptable quality, do that. Multi-agent is justified for parallel expertise, context isolation, or adversarial verification.

B. Orchestration Patterns

Pattern	Mechanism	Best For	Risk
Hierarchical	Orchestrator delegates; workers may sub-delegate	Complex multi-step task trees	Deep nesting → latency
Sequential Pipeline	A → B → C, each refining prior output	Draft → review → polish	Error propagation
Parallel Fan-Out/Gather	Fan out N tasks, merge results	Independent research, map-reduce	Reconciliation complexity
Debate / Critic	Two agents argue; judge synthesizes	High-stakes, adversarial verification	3× cost, gridlock
Blackboard	Agents read/write shared state reactively	Iterative refinement	Race conditions

C. Communication Protocol

Agents communicate via structured messages, not raw transcripts.

Envelope: { from, to, type, task_id, payload, token_budget }
Payload types: task assignment, result, clarification, status, escalation
Context summaries: Compressed (≤500 tokens) — never full conversation history

Why NOT full transcripts: Passing Agent A’s 8K history to Agent B wastes context, increases cost, and cross-pollinates irrelevant reasoning. Structured summaries preserve signal, discard noise.

D. Task Decomposition & Assignment

DAG output: Orchestrator decomposes task into sub-tasks with dependency edges and complexity estimates
Skill tags: Sub-tasks tagged with capabilities (code_gen, web_search, math) and matched to agent profiles
Load balancing: Among agents sharing a skill, route to lowest queue depth
Budget split: Total token budget allocated proportional to estimated sub-task complexity

E. Context Isolation

Each agent gets its own context window — a feature, not a limitation.

Agents receive only sub-task description + relevant context, not global state
Cross-agent messages use explicit schemas — prevents context fragmentation (pollution from irrelevant details) and context bloat (accumulated chatter exhausting the window)

F. Result Aggregation & Reconciliation

Strategy	Mechanism	When to Use
Merge	Deduplicate and synthesize agreeing results	Workers agree on outputs
Majority vote	N agents vote; majority wins	Factual questions with clear answers
Synthesis	Dedicated synthesizer reconciles conflicts	Creative or nuanced tasks
Escalation	Flag for human review	Low confidence, safety-critical

G. Shared State & Ownership

Blackboard: Partitioned into named sections (research_findings, code_artifacts, decisions)
Write ownership: Only the assigned agent writes its section; others read-only
Optimistic locking: Version numbers on writes; orchestrator resolves conflicts
TTL: Entries expire after task completion — no stale state leakage

H. Termination

Never trust agents to self-stop. Enforce externally:

Done-criteria: Measurable conditions (“all sub-tasks returned”, “tests pass”)
Token budget cap: Hard ceiling; agent killed on exceed
Wall-clock timeout: Per sub-task and per orchestration
Loop detection: Same message ≥3 times → force-terminate
Max iterations: Cap orchestrator↔worker round-trips (≤5)

7. Data & State

Task store (PostgreSQL): Definitions, status, dependency graphs, results
Blackboard (Redis): Ephemeral shared state with TTL, partitioned by task ID
Message bus (Kafka/NATS): Inter-agent messages with guaranteed delivery
Trace store (Jaeger/Tempo): OpenTelemetry distributed traces per orchestration
Artifact store (S3): Large outputs — code, documents, images

8. Scaling

Horizontal pools: Stateless agent containers; scale on queue depth
Orchestrator sharding: Partition by task ID across instances
Provider LB: Round-robin API keys and providers to avoid rate limits
Priority queues: Interactive over batch; prevents head-of-line blocking
Backpressure: Reject with retry-after when pools saturate

9. Failure Modes

Failure	Cause	Mitigation
Cascading failures	One agent fails; dependents cascade	Circuit breakers; partial-result fallback
Infinite handoff loops	A routes to B, B routes back to A	Max-hop counter; orchestrator loop detection
Context fragmentation	Irrelevant cross-agent context pollutes reasoning	Structured summaries only; no raw transcripts
Duplicated work	Overlapping sub-task assignments	Dedup at decomposition; idempotency keys
Coordination deadlock	Circular wait: A on B, B on A	DAG validation (no cycles); timeout breakers
Cost explosion	Runaway agents burning tokens in loops	Per-task/per-agent budgets; hard kill
Hallucination poisoning	One agent hallucinates; others consume as fact	Cross-validation; source citation required
Stale blackboard	Agent reads outdated shared state	Version vectors; read-after-write consistency
Orchestrator SPOF	Orchestrator crashes mid-task	Checkpoint to durable store; restart resumes
Provider outage	LLM API 5xx or rate-limit	Multi-provider fallback; exponential backoff

10. Evaluation Strategy

Task success rate: Target ≥95% completion with acceptable quality
Quality delta: Multi-agent vs. single-agent on identical tasks
Cost efficiency: Quality gain per dollar — must justify the multiplier
Coordination tax: % tokens spent on orchestration vs. actual work
Latency percentiles: p50/p95/p99 by orchestration pattern
A/B testing: Identical tasks routed to single vs. multi-agent; compare quality and cost

11. Safety & Security

Agent sandboxing: Minimal tool permissions; code execution in isolated containers
Injection isolation: User input sanitized before entering inter-agent messages
Output filtering: Content safety at aggregation step before user-facing response
Audit trail: Every action, tool call, and message logged immutably
Secret scoping: Credentials scoped per agent type (research agent cannot access code-exec creds)
HITL gates: High-impact actions require human approval

12. Cost & Latency Optimization

Technique	Cost	Latency	Tradeoff
Model tiering (small for routing, large for reasoning)	−40–60%	−20%	Routing accuracy may decrease
Prompt caching for system prompts	−20–50%	−10%	Cache invalidation
Parallel fan-out vs. sequential	Neutral	−50–70%	Harder reconciliation
Early termination on high confidence	−15–30%	−20%	May miss edge cases
Context compression between agents	−30–40%	−5%	Information loss risk
Single-agent fast-path for simple tasks	−60–80%	−70%	Needs accurate complexity classifier

13. Observability

Multi-agent requires distributed tracing — microservices discipline applied to LLM calls.

Trace ID: Unique per orchestration, propagated to all agent calls
Span per agent: Model, token count, latency, tool calls, result summary
Parent-child spans: Orchestrator → worker spans mirror the dependency DAG
Dashboards: Token spend by agent type, latency by pattern, failure rate by mode
Alerts: Budget ≥80%, latency ≥2× p95, loop detection triggered
Replay: Full per-agent conversation replay for debugging

14. Tradeoffs

Decision	Chose	Rejected	Why
Communication	Structured messages	Full transcripts	Transcripts cause bloat and cross-pollution
Topology	Hierarchical orchestrator	Peer-to-peer mesh	Mesh lacks authority for budgets, deadlock detection, reconciliation
State	Blackboard + ownership	Fully shared mutable	Ownership prevents write conflicts at scale
Termination	External caps + timeouts	Agent self-completion	Agents can’t reliably judge own completion
Models	Heterogeneous tiering	Uniform frontier model	Routing doesn’t need frontier; saves 40–60%
Disagreement	Synthesizer + escalation	Always majority vote	Voting fails on nuanced tasks

15. Rollout & Ops

Shadow mode: Multi-agent runs parallel with single-agent; compare without serving
Canary: 5% of complex tasks → multi-agent; monitor cost/quality delta
Feature flags: Per-pattern toggles (enable fan-out, disable debate)
Cost guardrails: Per-user/org daily limits; auto-downgrade to single-agent
Runbooks: Procedures for cost spike, stuck orchestration, cascading failure

16. Phasing v0 → vN

Phase	Scope	Milestone
v0 · Baseline	Single agent with tools; establish quality/cost baselines	Metrics for comparison
v1 · Pipeline	Two-agent draft → review; structured messages	Measurable quality lift over v0
v2 · Fan-out	Orchestrator + 3–5 parallel workers; aggregation	Latency reduction for parallel tasks
v3 · Full orchestration	Dynamic decomposition, skill routing, blackboard, debate	Cost-per-quality <2× single-agent
v4 · Self-optimizing	Orchestrator learns best patterns per task type	Adaptive routing outperforms static

17. Honest Uncertainty

Optimal agent count: No theory gives the right N — discovered empirically via A/B testing
Emergent behavior: Multi-agent can exhibit groupthink and sycophancy loops that are hard to anticipate
Quality ceiling: Whether multi-agent consistently beats a single agent with better prompting remains task-dependent and unresolved
Coordination scaling: As agent count grows, coordination cost may grow super-linearly — exact curve is pattern-dependent

18. Curveball Follow-Ups

“When would you NOT use multi-agent?”

When the task fits one context window, needs no specialized tools, and a single agent hits acceptable quality. Also when latency ≤5s, budget is tight, or decomposition overhead exceeds parallelism benefit. Most production LLM tasks today are better served by one agent with good tools.

“Two agents disagree — how resolve?”

Three tiers:

(1)Factual → ground-truth via tool call (search, calculator).
(2)Judgment call → synthesizer agent reconciles with explicit reasoning.
(3)Low confidence → escalate to human with both positions. Never silently pick one — log every disagreement

“One worker hangs — what happens?”

Per-agent wall-clock timeout fires:

(1)Kill agent, mark sub-task failed.
(2)If critical-path, retry with fresh agent (max 2).
(3)If retries exhaust, return partial results from completed workers with quality disclaimer.
(4)Orchestration-level timeout is the ultimate backstop — task never blocks indefinitely

“How do agents share state without stepping on each other?”

Blackboard with ownership: each section has one designated writer, many readers. Optimistic locking via version numbers detects conflicts. For truly shared sections (running summary), the orchestrator is sole writer, aggregating worker inputs.

19. Level-Up Ladder

Level	Demonstrates
Junior	Describes agents calling LLMs; ignores single-vs-multi decision; no cost awareness
Mid	Identifies patterns (sequential, parallel); addresses timeouts and retries; mentions cost
Senior	Explicit single-vs-multi criteria; context isolation; structured comms; termination guarantees; disagreement reconciliation
Staff	Designs complexity classifier for routing; models cost-quality quantitatively; distributed tracing; self-optimizing orchestration; addresses hallucination poisoning

Changelog: S04 — Multi-Agent Orchestration System · Added May 2026. Covers orchestration patterns, context isolation, termination guarantees, cost modeling, and the critical decision of when multi-agent is justified.

Design a Multi-Agent Orchestration System

Key Requirements

Interviewer Follow-ups