Articles

GitOps During Incidents: When Argo CD Helps, and When It Gets in the Way

GitOps is excellent for normal platform operations, but production incidents need a deliberate break-glass model, clear Argo CD access, drift rules, and post-incident reconciliation.

May 20, 2026

#kubernetes#gitops#argocd#platform-engineering#incident-response

GitOps is one of the better operating models we have for Kubernetes. It gives teams a reviewed source of truth, a repeatable deployment path, a visible audit trail, and a way to recover from accidental manual changes. For day-to-day platform work, that is exactly what most teams need.

But incidents are not day-to-day platform work.

At 02:00, when production is down and a config change needs to happen now, the same properties that make GitOps attractive can suddenly feel hostile. A controller reverts a manual fix. A pull request waits for someone who is asleep. The only person with Argo CD permissions is unavailable. The cluster is technically "correct" because it matches Git, but the business is still down.

That is not a reason to abandon GitOps. It is a reason to design the incident model before the incident.

This tension shows up repeatedly in Kubernetes community discussions. In one r/kubernetes thread, engineers debated whether "GitOps" becomes "Git + Operations" once people start patching production directly. The most useful replies were not ideological. They described the real operational problem: sometimes you need to put the fire out, but you also need a disciplined way to get back to Git as the source of truth afterwards. One commenter described an eight-hour outage where Argo CD kept resetting manual edits while the only person with access was unavailable. That is not a tooling failure alone. It is an access, process, and platform design failure.

The thread is worth reading because it sounds like real production work, not conference-slide GitOps.

What GitOps Is Good At

GitOps works because it reduces ambiguity. The desired state of the system lives in Git. The controller compares that desired state to live cluster state and reconciles the difference. That gives the team a consistent answer to a basic operational question: "What should be running?"

In a healthy setup, this improves several things at once:

Changes go through review instead of being applied from someone's shell history.
Production state can be reproduced from a repository.
Drift becomes visible instead of hiding inside the cluster.
Rollbacks become ordinary Git operations, not archaeology.
Teams can reason about the platform without every engineer needing cluster-admin access.

That is why tools like Argo CD and Flux are so valuable. They make the steady-state path boring. Boring is good.

The mistake is assuming that a system designed for steady-state delivery automatically covers emergency operations.

The Incident Problem

Incidents compress time. They also change the cost of waiting.

During normal delivery, "open a PR and wait for review" is a strength. During an outage, it can become a bottleneck. During normal delivery, "manual cluster edits are reverted" is a safety feature. During an outage, it can undo the only mitigation that currently works.

The recurring failure modes are predictable:

No emergency path: The team can only deploy through the normal PR path, even when the normal path is too slow.
Too few privileged operators: One person can pause sync, switch an Argo application to a fix branch, or approve a production change. That person is not always available.
Self-heal works against mitigation: A direct kubectl patch is reverted before the team has time to confirm whether the mitigation works.
Drift is treated as binary: Any difference between Git and cluster state is treated as equally bad, even if it is a short-lived incident change.
No reconciliation deadline: Emergency changes happen, the service comes back, and nobody backports the fix to Git.
No audit link: The incident timeline, Git commits, Argo events, and live cluster changes are never tied together.

The problem is not "GitOps is too strict" or "operators are undisciplined." The problem is that many teams implement the happy path and leave the emergency path as folklore.

Break-Glass Is Not an Anti-Pattern

Some teams resist break-glass access because they think it weakens GitOps. In reality, a controlled break-glass path is what lets GitOps survive contact with production.

The difference between good and bad break-glass is not whether humans can bypass the normal path. The difference is whether the bypass is deliberate, logged, time-bounded, and reconciled.

A practical break-glass model should answer these questions before anyone is under pressure:

Who can pause or modify GitOps reconciliation during a production incident?
How is that access granted, logged, and reviewed?
Can the on-call engineer switch an Argo CD application to a temporary fix branch?
When is direct kubectl allowed?
What must be captured in the incident record?
How quickly must every emergency change be backported to Git?
What alerts fire if reconciliation is suspended too long?

If those answers are not written down, the platform still has a break-glass process. It is just undocumented, inconsistent, and likely to be discovered during the worst possible moment.

A Better Incident Workflow

For most teams, the right model is not "never touch the cluster." It is "Git remains the destination of record, even if the fastest mitigation temporarily happens elsewhere."

A sane GitOps incident workflow looks like this:

Declare the incident and name the operator.
Decide whether the mitigation can go through Git quickly enough.
If yes, use a fix branch or emergency PR path.
If no, use controlled break-glass access and record the manual change.
Pause self-heal only for the affected application or namespace, not the whole platform.
Validate the mitigation with production signals, not just sync status.
Backport the final state to Git.
Re-enable reconciliation.
Confirm that Git, Argo CD, and the cluster agree.
Add the drift window and reconciliation commit to the incident record.

That workflow preserves the important principle: Git is the durable source of truth. It also accepts the operational reality that production sometimes needs a fast, reversible mitigation before the final fix is reviewed.

Alert on Drift Age, Not Just Drift

There is a useful idea from another Kubernetes drift discussion: alert on drift age, not merely drift existence.

In that thread, one reply described allowing hotfixes during incidents while treating anything not backported to Git within a short window as a process failure.

That distinction matters. New drift during an incident can be normal. Persistent drift after the incident is the problem.

If every diff is a page, the team will drown in noise. Kubernetes itself has many legitimate sources of live-state mutation: HPAs, operators, admission controllers, generated fields, controllers updating status, and sometimes platform-specific defaults. A useful GitOps platform knows which differences are expected, which are temporary, and which are dangerous.

I like this rule of thumb:

Expected drift should be ignored or normalized.
New incident drift should be visible but not necessarily paging.
Persistent drift should create work.
Growing drift should create urgency.
Suspended reconciliation should always have an owner and deadline.

This turns drift from a moral argument into an operational signal.

Access Design Matters More Than Tool Choice

Teams often compare Argo CD and Flux as if the tool choice decides the operating model. It does not.

The important questions are more basic:

Can the on-call engineer see what is deployed?
Can they understand why reconciliation is failing?
Can they safely pause or narrow reconciliation?
Can they route an emergency change through a fast path?
Can the platform prove what happened afterwards?

If only one senior engineer can operate Argo CD, GitOps becomes a single point of failure. If every engineer has broad cluster-admin access, GitOps becomes optional theater. The useful middle ground is role-based access that matches incident responsibilities.

Application teams may need read access to Argo CD and logs. On-call platform engineers may need scoped permission to sync, rollback, suspend, or switch target revisions. A small incident commander group may need break-glass elevation with automatic expiry. The exact model depends on the organization, but the key is that the model exists.

What Good Looks Like

A production-ready GitOps setup should include more than a controller and a repository. At minimum, I want to see:

Documented emergency change policy.
Scoped Argo CD or Flux permissions for on-call operators.
A way to switch an application to a fix branch.
Alerts for failed syncs, degraded health, and long-lived suspended reconciliation.
Drift detection that distinguishes expected mutation from real configuration drift.
A post-incident requirement to reconcile live state back into Git.
Runbooks for rollback, sync suspension, and forced reconciliation.
Regular access review for production GitOps permissions.

This is not bureaucracy. It is how you avoid improvising the operating model while customers are already affected.

The Real Test

The real test of GitOps is not whether every normal deployment uses Git. That is the easy part.

The real test is what happens when production is broken, time matters, and the current Git state is not the state you need for the next 20 minutes.

If the team can mitigate quickly, preserve evidence, reconcile cleanly, and learn from the drift window, GitOps is helping. If the team is stuck waiting for one person, fighting the controller, or leaving manual changes behind for weeks, GitOps is only half implemented.

GitOps should make production safer. It should not make incident response slower by accident.

If your platform uses Argo CD, Flux, or another GitOps controller and you are not sure what happens during a real production incident, that is worth testing before the next outage. A short platform review can usually uncover the access gaps, sync risks, and reconciliation problems before they become downtime.