Loading...
Google
S11
HardPremiumDesign an AI On-Call / Incident-Response Agent
Design an AI agent that helps on-call engineers respond to production incidents — investigating alerts, correlating logs and metrics, identifying root cause, and suggesting fixes.
AgentsTool UseObservabilityReliability
Key Requirements
- Work under time pressure during active incidents
- Correlate signals across logs, metrics, and traces
- Avoid making things worse with dangerous remediations
- Explain reasoning clearly so engineers can verify
- Handle cascading failures across multiple services
Interviewer Follow-ups
- Q1How do you prevent the agent from executing a dangerous fix?
- Q2How does it handle cascading failures across services?
- Q3What if the agent's own infrastructure is down during the incident?