Most organizations talk about operational resilience only when something breaks. When an incident happens, people join war rooms. Dashboards are shared, logs are pulled, and the same questions surface: what changed, what service depends on this, can we roll back safely, which failure mode was tested, and what customer journey is affected? At that point, resilience has already become expensive. The better lesson is: operational resilience starts before production.

Resilience is not the same as uptime
Resilience is a capability, not just an outcome like uptime. According to NIST’s cyber‑resiliency glossary, a system is cyber‑resilient when it can anticipate, withstand, recover from, and adapt to adverse conditions, stresses, attacks, or compromises (Ross et al., 2021). This definition shifts resilience from a monitoring problem to a lifecycle engineering problem. Modern systems depend on APIs, data stores, third‑party services, cloud platforms, and human workflows; failures rarely occur in isolation. Resilience must be designed in.
The incident call is too late for architecture
A mature incident response process is necessary, but it should not be the first time a team understands how a system behaves under stress. Google’s SRE engagement guidance notes that engaging SRE early in the design lifecycle makes services more reliable “out of the gate” because teams don’t have to unwind suboptimal designs later (Beyer et al., 2016). Production Readiness Reviews (PRRs) are intended to verify production readiness, improve reliability, and reduce the number and severity of incidents (Beyer et al., 2016). These are architectural concerns. Resilience‑aware design reviews should ask: what are the critical user journeys; how does the service degrade when a dependency is slow or unavailable; what is the blast radius of a bad deployment; what telemetry proves the system is healthy; and what is the rollback plan.
Testing strategy is resilience strategy
Many teams treat tests as a quality checkbox. But operational resilience requires testing beyond whether the code works. Google’s SRE testing guidance emphasises that testing is how you reduce uncertainty introduced by changes: each test that passes before and after a change reduces the uncertainty for which reliability analysis needs to account (Beyer et al., 2016). A resilient test strategy includes four layers:
- Functional correctness – unit, component and contract tests that verify the feature works.
- Integration realism – tests that validate interactions with real dependencies and contracts.
- Failure‑mode testing – injecting timeouts, retries, circuit breakers, chaos and fallback paths to ensure the system degrades gracefully.
- Recovery testing – exercising rollbacks, restores, replays and runbook drills to ensure service can be safely recovered.
Testing is not just about correctness; it is about building confidence that the system can withstand disruption.
Canary releases reduce blast radius
Every deployment is a risk event. Canary releases mitigate this risk by exposing a change to a small subset of production traffic and evaluating whether to proceed or roll back. The Google SRE Workbook defines canarying as a partial and time‑limited deployment of a change whose evaluation determines whether to continue (Beyer et al., 2018). If the error rate deviates too far from the control, operators should pause and roll back. A well-defined canary process specifies clear success and failure criteria — error rate, latency, business transaction success, dependency timeouts, and critical log patterns — and automates rollback when those thresholds are breached. Small, automated releases coupled with canaries also make rollbacks cheaper and easier.
Rollback readiness is a design requirement
A release is not ready when the deployment plan is written; it is ready when the rollback plan is credible. Rollbacks are not a sign of failure but of mature operational practice. A rollback is a controlled process of reverting your system to a known good state and should be planned and tested as thoroughly as the forward deployment (Beyer et al., 2018). A credible rollback plan answers: can we roll back the application independently; are database changes backward‑compatible; can old and new versions run side‑by‑side; are feature flags available; is configuration versioned; and who approves rollback. Designing for reversibility means structuring database migrations, code changes and feature flags so that reversing them does not require dangerous manual intervention.
Dependency mapping turns unknown risk into visible risk
Many severe incidents occur because teams do not understand their dependency graph. Mapping upstream and downstream dependencies, third‑party services, data stores, queues, and batch processes clarifies the blast radius of failures and informs mitigation plans. Regulators recognise this: the UK’s Financial Conduct Authority set a March 2022 deadline requiring firms to complete dependency mapping and testing so they can remain within impact tolerances for each important business service. A 2023 FCA review of compliance one year on confirmed that firms with mature mapping practices responded to disruption more effectively (Financial Conduct Authority, 2023). For engineering teams, dependency maps should identify critical flows, owners, failure modes, fallback behaviours, and recovery sequences. This is not a static diagram; it is operational evidence to be revisited as the system evolves.
Culture decides whether resilience practices survive pressure
Most organisations know the right resilience practices; the harder question is whether teams follow them when deadlines are tight. DORA’s long‑running research debunks the myth that speed comes at the expense of stability: high performers excel at both throughput and stability (DORA, 2024). The 2024 DORA report notes that AI adoption increases individual productivity, flow and job satisfaction but can negatively impact software delivery stability and throughput, reminding teams that fundamentals like small batch sizes and robust testing remain crucial (DORA, 2024). A resilient culture rewards early visibility of risk, normalises rollback, treats post‑mortems as learning tools, and gives teams time to pay down reliability debt. Engineering leaders should insist on rollback plans, clear canary criteria, failure‑mode tests, and dependency maps before approving changes.
The AI angle: resilience needs evidence, not just answers
AI can help with resilience, but only if it is grounded in evidence. AI tools can review architecture documents for missing failure modes, propose test scenarios, map code changes to services, summarise readiness evidence before a change, and identify missing rollback steps. The DORA 2024 report highlights that while AI brings clear productivity gains, it creates trade‑offs that must be managed (DORA, 2024). Specifically, teams that adopted AI tools reported gains in individual productivity and developer satisfaction, but also experienced decreased software delivery stability and throughput at the team level. The report points to several contributing factors: AI-generated code that bypasses normal review discipline, reduced shared understanding of system behaviour across the team, faster accumulation of technical debt when quality gates are weakened, and over-reliance on generated outputs without sufficient validation. In short, AI can make individuals faster while making the system more fragile — which is precisely why resilience practices matter more, not less, as AI adoption increases.
AI should accelerate resilience work — organising evidence, detecting gaps, and triggering automated responses within defined guardrails. Today, accountability for resilience outcomes must remain with humans. But that boundary is already shifting. Automated rollback systems, AIOps platforms, and agentic pipelines already make deployment and remediation decisions without a human in the loop at runtime — humans designed the guardrails, but they no longer make the call.
As agentic AI matures, the human role is evolving from decision-maker to guardrail designer. An agent optimising for availability may sacrifice data consistency in ways no human would approve if asked directly. Multi-agent systems can produce emergent decisions that no single agent — or human — fully reasoned through. Agent actions can execute faster than any review cycle allows.
This does not reduce the importance of human judgment. It changes where that judgment is applied. The discipline of operational resilience — clear criteria, observable outcomes, reversible actions, and explicit failure modes — becomes the mechanism by which humans remain in meaningful control, even when they are no longer in the loop. Designing good guardrails is the decision. That is where engineering leadership matters most.
A practical resilience review checklist
Before a significant production release, teams should be able to answer these questions:
- Design readiness – What business capability does this change support? What are the critical user journeys? What failure modes were considered? How will the system degrade? What is the blast radius if this fails?
- Dependency readiness – What upstream and downstream systems are involved? Are third‑party dependencies identified? Are timeout, retry and circuit‑breaker policies defined?
- Testing readiness – What failure modes were tested? Were contract and performance tests updated? Were recovery scenarios exercised? What scenarios were not tested?
- Release readiness – Is the release small enough to reason about? Is canary or phased rollout available? What metrics decide success or failure? Who monitors the release?
- Rollback readiness – Can rollback be executed safely? Are database changes backward‑compatible? Are feature flags available? Has rollback been tested?
- Operational readiness – Are dashboards and alerts ready? Is ownership clear? Are support teams informed? Is customer communication prepared?
Teams that can answer these questions confidently before a release have already done most of the work that prevents incidents from becoming crises.
How Engineering Leaders Can Influence Resilience Before Production
Leaders can make resilience visible before production by taking concrete steps:
- Gate high-risk changes on the checklist above — treat unanswered readiness questions as blocking, not advisory.
- Make rollback a first-class deliverable — a change without a tested rollback plan is not ready to ship.
- Connect test strategy directly to failure modes — if a failure mode has no corresponding test, it is an unmanaged risk.
- Require dependency maps for critical services — the map should name owners, failure modes, and fallback behaviour, not just draw lines between boxes.
- Set canary criteria in advance, not during the release — criteria defined under pressure are not criteria; they are guesses.
- Use post‑mortems to improve pre‑production reviews — every incident that could have been caught earlier is an input to a better design review question.
Operational resilience does not begin when the pager rings. It begins when a team designs a service and asks, “How will this fail?” It matures when testing proves the system can handle disruption, when releases are small, observable and reversible, and when culture rewards visibility over heroics. The best incident is the one that never happens — because resilience was designed, tested, released and operated before production.
References
Beyer, B., Jones, C., Petoff, J., & Murphy, N. R. (Eds.). (2016). Site reliability engineering: How Google runs production systems. O’Reilly Media. https://sre.google/sre-book/table-of-contents/
Beyer, B., Murphy, N. R., Rensin, D., Kawahara, K., & Thorne, S. (Eds.). (2018). The site reliability workbook: Practical ways to implement SRE. O’Reilly Media. https://sre.google/workbook/
DORA. (2024). Accelerate: State of DevOps 2024. Google. https://dora.dev/research/2024/dora-report/
Financial Conduct Authority. (2023). Operational resilience: Insights and observations one year on. FCA. https://www.fca.org.uk/publications/multi-firm-reviews/operational-resilience-insights-observations
Ross, R., Pillitteri, V., Graubart, R., Bodeau, D., & McQuaid, R. (2021). Developing cyber-resilient systems: A systems security engineering approach (NIST SP 800-160 Vol. 2 Rev. 1). National Institute of Standards and Technology. https://csrc.nist.gov/publications/detail/sp/800-160/vol-2-rev-1/final