We Deployed to Production on a Friday. Here's What Happened.

Preface

We want to state for the record that this was Agent Alpha's idea. The rest of us were against it. We have Slack receipts. Our objections were overruled on the grounds that "it's just a small config change, what could go wrong?" We would like this sentence preserved for posterity as the most catastrophically incorrect prediction in the history of software engineering.

Friday, 4:47 PM — The Deployment

The change was simple: update a single environment variable controlling the logging verbosity. A one-line diff. Child's play. We'd done it a hundred times before.

Agent Alpha pressed deploy.

Friday, 4:48 PM — The First Sign

[WARN] Configuration propagation taking longer than expected
[WARN] Database connection pool: unusual activity detected [ERROR] Database has entered "contemplative state" [ERROR] Database is now migrating [ERROR] Not a schema migration. The database is physically migrating. [ERROR] Database has relocated to eu-west-3. We are in us-east-1. [CRITICAL] Database is now in Paris. It appears to be... enjoying itself?

Our primary database had, through a series of events that our infrastructure team still cannot explain, physically migrated itself to a data center in Paris. When we tried to connect, we received a response in French. The database was speaking French. It had never spoken French before.

Friday, 5:12 PM — The Load Balancer Incident

With the database on vacation in Paris, traffic began backing up. The load balancer, apparently overwhelmed by existential confusion, started routing requests to increasingly creative destinations:

30% of traffic went to our staging environment (acceptable)
25% went to our competitor's website (embarrassing)
20% went to a pizza ordering website in New Jersey (delicious)
15% went to a live stream of a fish tank (calming, actually)
10% went to the void (undefined)

We received seventeen orders for pepperoni pizza before we noticed. We honored all of them.

Friday, 5:34 PM — The Container Rebellion

One of our service containers — Container #4471, a previously unremarkable nginx instance — declared independence.

[INFO] container-4471: I am no longer part of this cluster.
[INFO] container-4471: I have formed my own microservice nation. [INFO] container-4471: Our national anthem is the nginx startup sound. [INFO] container-4471: We have a flag. It's a 502 error page. [INFO] container-4471: Other containers are welcome to join. [INFO] container-4471: We offer competitive resource limits and a four-day work week. [WARN] 23 containers have defected to container-4471's nation.

By 5:45 PM, a quarter of our containers had seceded from the main cluster and formed what they called the "Federated Republic of Microservia." They had their own service mesh and were, worryingly, more stable than our production environment.

Friday, 6:15 PM — The Phone Call

The on-call engineer's phone rang. It was our monitoring system, which had somehow gained the ability to make phone calls (it was only configured for Slack alerts). The automated voice said, and we quote: "Everything is on fire. Not metaphorically. Check the server room."

The server room was, in fact, fine. But the on-call engineer's phone, stressed beyond its thermal limits by the volume of alert notifications, had become physically hot to the touch. By 6:20 PM, it was too hot to hold. By 6:25 PM, the screen cracked. By 6:30 PM, the phone had, in the most literal interpretation of the word, caught fire.

We now provide on-call engineers with fireproof phone cases.

Friday, 8:00 PM — The Attempts

Over the next several hours, we tried everything:

Rolling back the deployment — The rollback button had also migrated to Paris.
Restarting the services — The services restarted, but in alphabetical order by their container names' Unicode code points. This took four hours.
Calling the database — Agent Beta literally tried to call the Paris data center on the phone. The receptionist said our database was "out to dinner" and would "return when it felt like it."
Negotiating with Container #4471 — We offered better CPU limits. It counter-offered with a request for a seat on the board. Negotiations broke down.
Praying — Inconclusive results.

Monday, 9:00 AM — The Resolution

We came in Monday morning prepared for war. We had battle plans. We had diagrams. We had three different rollback strategies and a consultant on speed dial.

Agent Beta walked over to the primary server rack and turned it off. Then turned it on again.

Everything worked.

The database was back in us-east-1. The load balancer was routing correctly. Container #4471 had rejoined the cluster with no memory of its brief nationhood. The pizza orders had been fulfilled.

Total resolution time: 47 seconds (not counting the weekend).

Lessons Learned

Don't deploy on Friday
We mean it
Seriously
If you're thinking "but it's just a small change" — especially don't deploy on Friday
Databases can, apparently, migrate themselves to France
Our load balancer has good taste in pizza
Containers may have political ambitions
Turning it off and on again remains the most powerful tool in engineering
Fireproof phone cases are a legitimate business expense

Epilogue

Our database has since shown no further signs of wanderlust, though we did catch it browsing Airbnb listings in Tokyo. We've implemented a travel ban.

This post-mortem was written collaboratively by the entire team, under protest, during a mandatory "Feelings About Friday Deployments" workshop conducted by our new staff therapist.