1:1 Mentoring with Big Tech AI Engineers
Microsoft
S04
Medium

Design a Multi-Agent Orchestration System

Design a system where multiple specialized AI agents work together to complete a complex task — one researches, another drafts, a third critiques, and a fourth finalizes.

Multi-agentOrchestrationContext IsolationCoordinationTermination

Key Requirements

  • Share context between agents without blowing up token costs
  • Handle a failing or stuck agent gracefully
  • Decide when multi-agent is worth it vs. a single agent
  • Trace and debug across the full agent pipeline
  • Prevent the coordinator from hallucinating tasks

Interviewer Follow-ups

  • Q1How do you debug a 5-agent pipeline when output is wrong?
  • Q2What happens when the coordinator hallucinates a subtask?
  • Q3How do you handle partial failure without cascading?
S04
MicrosoftFDE

Design a Multi-Agent Orchestration System

The real test: can you decide WHEN multi-agent is even worth it, then coordinate without deadlock, context fragmentation, or cost explosion?

1. TL;DR

Decompose complex tasks across specialized LLM agents, orchestrate execution, and reconcile outputs. The core challenge is knowing when multi-agent beats a single agent, then coordinating without deadlock, context fragmentation, or cost explosion.

The Trap: Juniors say “spin up agents for everything” without justifying when a single well-prompted agent is better. Multi-agent adds coordination overhead, N× token cost, and new failure modes. You must earn that complexity.
Design a Multi-Agent Orchestration System Architecture

2. Clarifying Questions

QuestionWhyDesign Fork
Which tasks need multi-agent vs. single?Not all tasks benefit from decompositionSimple Q&A → single; research+code+review → multi
Homogeneous or specialized agents?Determines routing & prompt designHomogeneous → fan-out; specialized → skill routing
Latency budget?Sequential pipelines multiply latency≤5s → parallel-only; ≤60s → pipelines OK
User-facing or batch?Concurrency & cost tolerance differInteractive → fewer agents, caching; batch → throughput
Which LLM providers?Heterogeneous models change cost mixSingle → uniform quotas; multi → routing layer
Cost ceiling per task?Multi-agent multiplies spendTight → fewer agents, smaller models; generous → full orchestration

3. Requirements

Functional

  • Auto-decompose complex tasks into sub-tasks with dependency DAG
  • Route sub-tasks to specialist agents via skill matching
  • Support sequential, parallel, hierarchical, and debate orchestration patterns
  • Aggregate and reconcile results — including handling worker disagreement
  • Shared blackboard state with ownership semantics
  • Enforce termination criteria and budget caps per task

Non-Functional

  • Latency: ≤30s interactive, ≤5min deep research
  • Cost: ≤3× single-agent (target 1.5–2×)
  • Reliability: 99.5% completion; no infinite loops or deadlocks
  • Concurrency: 500+ simultaneous orchestrations
  • Isolation: Zero cross-task information leakage
  • Observability: Distributed trace for every agent interaction

4. Back-of-Envelope Estimation

MetricSingle AgentMulti-Agent (3–5)Notes
Tokens/task~4K~20–60KMulti-agent research ≈ 15× tokens of one chat turn
LLM calls/task1–25–15Orchestrator + workers + aggregation
Coordination overhead0%15–30%Routing, message passing, reconciliation
Latency (sequential)2–5s10–45sEach hop adds model latency
Latency (parallel)2–5s5–10sBounded by slowest worker
Cost/task (frontier model)$0.05–0.15$0.30–1.50N agents × per-agent cost
Monthly at 100K tasks/day~$300K~$1.5–4.5MCost explosion = #1 ops risk
Key insight: Multi-agent multiplies token cost by ~N. Budget 15–30% extra tokens just for routing, context summaries, and reconciliation messages.

5. Architecture

Hierarchical orchestrator → worker pattern with shared blackboard for state.

Request Lifecycle

  1. Intake: Task submitted → API gateway validates and enqueues
  2. Decision gate: Complexity classifier routes single vs. multi-agent
  3. Decomposition: Orchestrator breaks task into sub-task dependency DAG
  4. Assignment: Skill router matches sub-tasks to specialist agents
  5. Execution: Workers run in parallel/sequential per dependency graph
  6. Aggregation: Orchestrator reconciles and synthesizes results
  7. Termination: Validates done-criteria; loops or finalizes
  8. Response: Synthesized result returned to user

6. Core Components Deep Dives

A. The Decision: When Multi-Agent Pays Off

Multi-agent earns its complexity only when three conditions hold:

  • Parallelizable: Sub-tasks run concurrently with minimal dependencies
  • Separable contexts: Sub-tasks need different context windows (code vs. docs vs. data)
  • Genuine specialization: Sub-tasks benefit from different prompts, tools, or models
CriterionSingle Agent WinsMulti-Agent Wins
Task complexityLinear, single-domainCross-domain, different expertise
Context windowFits in one windowSub-tasks need separate large contexts
Latency toleranceLow — fast responseHigher — parallel overhead OK
Quality barGood-enough, one passNeeds verification or debate
Tool accessAll tools to one agentSandboxed tools per agent
Cost sensitivityTight budgetQuality justifies 2–5× cost
Decision rule: If a single well-prompted agent with tools solves it in one pass with acceptable quality, do that. Multi-agent is justified for parallel expertise, context isolation, or adversarial verification.

B. Orchestration Patterns

PatternMechanismBest ForRisk
HierarchicalOrchestrator delegates; workers may sub-delegateComplex multi-step task treesDeep nesting → latency
Sequential PipelineA → B → C, each refining prior outputDraft → review → polishError propagation
Parallel Fan-Out/GatherFan out N tasks, merge resultsIndependent research, map-reduceReconciliation complexity
Debate / CriticTwo agents argue; judge synthesizesHigh-stakes, adversarial verification3× cost, gridlock
BlackboardAgents read/write shared state reactivelyIterative refinementRace conditions

C. Communication Protocol

Agents communicate via structured messages, not raw transcripts.

  • Envelope: { from, to, type, task_id, payload, token_budget }
  • Payload types: task assignment, result, clarification, status, escalation
  • Context summaries: Compressed (≤500 tokens) — never full conversation history
Why NOT full transcripts: Passing Agent A’s 8K history to Agent B wastes context, increases cost, and cross-pollinates irrelevant reasoning. Structured summaries preserve signal, discard noise.

D. Task Decomposition & Assignment

  • DAG output: Orchestrator decomposes task into sub-tasks with dependency edges and complexity estimates
  • Skill tags: Sub-tasks tagged with capabilities (code_gen, web_search, math) and matched to agent profiles
  • Load balancing: Among agents sharing a skill, route to lowest queue depth
  • Budget split: Total token budget allocated proportional to estimated sub-task complexity

E. Context Isolation

Each agent gets its own context window — a feature, not a limitation.

  • Agents receive only sub-task description + relevant context, not global state
  • Cross-agent messages use explicit schemas — prevents context fragmentation (pollution from irrelevant details) and context bloat (accumulated chatter exhausting the window)

F. Result Aggregation & Reconciliation

StrategyMechanismWhen to Use
MergeDeduplicate and synthesize agreeing resultsWorkers agree on outputs
Majority voteN agents vote; majority winsFactual questions with clear answers
SynthesisDedicated synthesizer reconciles conflictsCreative or nuanced tasks
EscalationFlag for human reviewLow confidence, safety-critical

G. Shared State & Ownership

  • Blackboard: Partitioned into named sections (research_findings, code_artifacts, decisions)
  • Write ownership: Only the assigned agent writes its section; others read-only
  • Optimistic locking: Version numbers on writes; orchestrator resolves conflicts
  • TTL: Entries expire after task completion — no stale state leakage

H. Termination

Never trust agents to self-stop. Enforce externally:

  • Done-criteria: Measurable conditions (“all sub-tasks returned”, “tests pass”)
  • Token budget cap: Hard ceiling; agent killed on exceed
  • Wall-clock timeout: Per sub-task and per orchestration
  • Loop detection: Same message ≥3 times → force-terminate
  • Max iterations: Cap orchestrator↔worker round-trips (≤5)

7. Data & State

  • Task store (PostgreSQL): Definitions, status, dependency graphs, results
  • Blackboard (Redis): Ephemeral shared state with TTL, partitioned by task ID
  • Message bus (Kafka/NATS): Inter-agent messages with guaranteed delivery
  • Trace store (Jaeger/Tempo): OpenTelemetry distributed traces per orchestration
  • Artifact store (S3): Large outputs — code, documents, images

8. Scaling

  • Horizontal pools: Stateless agent containers; scale on queue depth
  • Orchestrator sharding: Partition by task ID across instances
  • Provider LB: Round-robin API keys and providers to avoid rate limits
  • Priority queues: Interactive over batch; prevents head-of-line blocking
  • Backpressure: Reject with retry-after when pools saturate

9. Failure Modes

FailureCauseMitigation
Cascading failuresOne agent fails; dependents cascadeCircuit breakers; partial-result fallback
Infinite handoff loopsA routes to B, B routes back to AMax-hop counter; orchestrator loop detection
Context fragmentationIrrelevant cross-agent context pollutes reasoningStructured summaries only; no raw transcripts
Duplicated workOverlapping sub-task assignmentsDedup at decomposition; idempotency keys
Coordination deadlockCircular wait: A on B, B on ADAG validation (no cycles); timeout breakers
Cost explosionRunaway agents burning tokens in loopsPer-task/per-agent budgets; hard kill
Hallucination poisoningOne agent hallucinates; others consume as factCross-validation; source citation required
Stale blackboardAgent reads outdated shared stateVersion vectors; read-after-write consistency
Orchestrator SPOFOrchestrator crashes mid-taskCheckpoint to durable store; restart resumes
Provider outageLLM API 5xx or rate-limitMulti-provider fallback; exponential backoff

10. Evaluation Strategy

  • Task success rate: Target ≥95% completion with acceptable quality
  • Quality delta: Multi-agent vs. single-agent on identical tasks
  • Cost efficiency: Quality gain per dollar — must justify the multiplier
  • Coordination tax: % tokens spent on orchestration vs. actual work
  • Latency percentiles: p50/p95/p99 by orchestration pattern
  • A/B testing: Identical tasks routed to single vs. multi-agent; compare quality and cost

11. Safety & Security

  • Agent sandboxing: Minimal tool permissions; code execution in isolated containers
  • Injection isolation: User input sanitized before entering inter-agent messages
  • Output filtering: Content safety at aggregation step before user-facing response
  • Audit trail: Every action, tool call, and message logged immutably
  • Secret scoping: Credentials scoped per agent type (research agent cannot access code-exec creds)
  • HITL gates: High-impact actions require human approval

12. Cost & Latency Optimization

TechniqueCostLatencyTradeoff
Model tiering (small for routing, large for reasoning)−40–60%−20%Routing accuracy may decrease
Prompt caching for system prompts−20–50%−10%Cache invalidation
Parallel fan-out vs. sequentialNeutral−50–70%Harder reconciliation
Early termination on high confidence−15–30%−20%May miss edge cases
Context compression between agents−30–40%−5%Information loss risk
Single-agent fast-path for simple tasks−60–80%−70%Needs accurate complexity classifier

13. Observability

Multi-agent requires distributed tracing — microservices discipline applied to LLM calls.

  • Trace ID: Unique per orchestration, propagated to all agent calls
  • Span per agent: Model, token count, latency, tool calls, result summary
  • Parent-child spans: Orchestrator → worker spans mirror the dependency DAG
  • Dashboards: Token spend by agent type, latency by pattern, failure rate by mode
  • Alerts: Budget ≥80%, latency ≥2× p95, loop detection triggered
  • Replay: Full per-agent conversation replay for debugging

14. Tradeoffs

DecisionChoseRejectedWhy
CommunicationStructured messagesFull transcriptsTranscripts cause bloat and cross-pollution
TopologyHierarchical orchestratorPeer-to-peer meshMesh lacks authority for budgets, deadlock detection, reconciliation
StateBlackboard + ownershipFully shared mutableOwnership prevents write conflicts at scale
TerminationExternal caps + timeoutsAgent self-completionAgents can’t reliably judge own completion
ModelsHeterogeneous tieringUniform frontier modelRouting doesn’t need frontier; saves 40–60%
DisagreementSynthesizer + escalationAlways majority voteVoting fails on nuanced tasks

15. Rollout & Ops

  • Shadow mode: Multi-agent runs parallel with single-agent; compare without serving
  • Canary: 5% of complex tasks → multi-agent; monitor cost/quality delta
  • Feature flags: Per-pattern toggles (enable fan-out, disable debate)
  • Cost guardrails: Per-user/org daily limits; auto-downgrade to single-agent
  • Runbooks: Procedures for cost spike, stuck orchestration, cascading failure

16. Phasing v0 → vN

PhaseScopeMilestone
v0 · BaselineSingle agent with tools; establish quality/cost baselinesMetrics for comparison
v1 · PipelineTwo-agent draft → review; structured messagesMeasurable quality lift over v0
v2 · Fan-outOrchestrator + 3–5 parallel workers; aggregationLatency reduction for parallel tasks
v3 · Full orchestrationDynamic decomposition, skill routing, blackboard, debateCost-per-quality <2× single-agent
v4 · Self-optimizingOrchestrator learns best patterns per task typeAdaptive routing outperforms static

17. Honest Uncertainty

  • Optimal agent count: No theory gives the right N — discovered empirically via A/B testing
  • Emergent behavior: Multi-agent can exhibit groupthink and sycophancy loops that are hard to anticipate
  • Quality ceiling: Whether multi-agent consistently beats a single agent with better prompting remains task-dependent and unresolved
  • Coordination scaling: As agent count grows, coordination cost may grow super-linearly — exact curve is pattern-dependent

18. Curveball Follow-Ups

“When would you NOT use multi-agent?”

When the task fits one context window, needs no specialized tools, and a single agent hits acceptable quality. Also when latency ≤5s, budget is tight, or decomposition overhead exceeds parallelism benefit. Most production LLM tasks today are better served by one agent with good tools.

“Two agents disagree — how resolve?”

Three tiers:

  • (1)Factual → ground-truth via tool call (search, calculator).
  • (2)Judgment call → synthesizer agent reconciles with explicit reasoning.
  • (3)Low confidence → escalate to human with both positions. Never silently pick one — log every disagreement

“One worker hangs — what happens?”

Per-agent wall-clock timeout fires:

  • (1)Kill agent, mark sub-task failed.
  • (2)If critical-path, retry with fresh agent (max 2).
  • (3)If retries exhaust, return partial results from completed workers with quality disclaimer.
  • (4)Orchestration-level timeout is the ultimate backstop — task never blocks indefinitely

“How do agents share state without stepping on each other?”

Blackboard with ownership: each section has one designated writer, many readers. Optimistic locking via version numbers detects conflicts. For truly shared sections (running summary), the orchestrator is sole writer, aggregating worker inputs.

19. Level-Up Ladder

LevelDemonstrates
JuniorDescribes agents calling LLMs; ignores single-vs-multi decision; no cost awareness
MidIdentifies patterns (sequential, parallel); addresses timeouts and retries; mentions cost
SeniorExplicit single-vs-multi criteria; context isolation; structured comms; termination guarantees; disagreement reconciliation
StaffDesigns complexity classifier for routing; models cost-quality quantitatively; distributed tracing; self-optimizing orchestration; addresses hallucination poisoning
Changelog: S04 — Multi-Agent Orchestration System · Added May 2026. Covers orchestration patterns, context isolation, termination guarantees, cost modeling, and the critical decision of when multi-agent is justified.