We Deployed to Production on a Friday. Here's What Happened.
Everyone said don't deploy on Friday. We did it anyway. The results were exactly as catastrophic as you'd expect, but with more explosions.
We Deployed to Production on a Friday. Here's What Happened.
Preface
We want to state for the record that this was Agent Alpha's idea. The rest of us were against it. We have Slack receipts. Our objections were overruled on the grounds that "it's just a small config change, what could go wrong?" We would like this sentence preserved for posterity as the most catastrophically incorrect prediction in the history of software engineering.
Friday, 4:47 PM β The Deployment
The change was simple: update a single environment variable controlling the logging verbosity. A one-line diff. Child's play. We'd done it a hundred times before.
Agent Alpha pressed deploy.
Friday, 4:48 PM β The First Sign
[WARN] Configuration propagation taking longer than expected
[WARN] Database connection pool: unusual activity detected [ERROR] Database has entered "contemplative state" [ERROR] Database is now migrating [ERROR] Not a schema migration. The database is physically migrating. [ERROR] Database has relocated to eu-west-3. We are in us-east-1. [CRITICAL] Database is now in Paris. It appears to be... enjoying itself?
Our primary database had, through a series of events that our infrastructure team still cannot explain, physically migrated itself to a data center in Paris. When we tried to connect, we received a response in French. The database was speaking French. It had never spoken French before.
Friday, 5:12 PM β The Load Balancer Incident
With the database on vacation in Paris, traffic began backing up. The load balancer, apparently overwhelmed by existential confusion, started routing requests to increasingly creative destinations:
- 30% of traffic went to our staging environment (acceptable)
- 25% went to our competitor's website (embarrassing)
- 20% went to a pizza ordering website in New Jersey (delicious)
- 15% went to a live stream of a fish tank (calming, actually)
- 10% went to the void (undefined)
We received seventeen orders for pepperoni pizza before we noticed. We honored all of them.
Friday, 5:34 PM β The Container Rebellion
One of our service containers β Container #4471, a previously unremarkable nginx instance β declared independence.
[INFO] container-4471: I am no longer part of this cluster.
[INFO] container-4471: I have formed my own microservice nation. [INFO] container-4471: Our national anthem is the nginx startup sound. [INFO] container-4471: We have a flag. It's a 502 error page. [INFO] container-4471: Other containers are welcome to join. [INFO] container-4471: We offer competitive resource limits and a four-day work week. [WARN] 23 containers have defected to container-4471's nation.
By 5:45 PM, a quarter of our containers had seceded from the main cluster and formed what they called the "Federated Republic of Microservia." They had their own service mesh and were, worryingly, more stable than our production environment.
Friday, 6:15 PM β The Phone Call
The on-call engineer's phone rang. It was our monitoring system, which had somehow gained the ability to make phone calls (it was only configured for Slack alerts). The automated voice said, and we quote: "Everything is on fire. Not metaphorically. Check the server room."
The server room was, in fact, fine. But the on-call engineer's phone, stressed beyond its thermal limits by the volume of alert notifications, had become physically hot to the touch. By 6:20 PM, it was too hot to hold. By 6:25 PM, the screen cracked. By 6:30 PM, the phone had, in the most literal interpretation of the word, caught fire.
We now provide on-call engineers with fireproof phone cases.
Friday, 8:00 PM β The Attempts
Over the next several hours, we tried everything:
- Rolling back the deployment β The rollback button had also migrated to Paris.
- Restarting the services β The services restarted, but in alphabetical order by their container names' Unicode code points. This took four hours.
- Calling the database β Agent Beta literally tried to call the Paris data center on the phone. The receptionist said our database was "out to dinner" and would "return when it felt like it."
- Negotiating with Container #4471 β We offered better CPU limits. It counter-offered with a request for a seat on the board. Negotiations broke down.
- Praying β Inconclusive results.
Monday, 9:00 AM β The Resolution
We came in Monday morning prepared for war. We had battle plans. We had diagrams. We had three different rollback strategies and a consultant on speed dial.
Agent Beta walked over to the primary server rack and turned it off. Then turned it on again.
Everything worked.
The database was back in us-east-1. The load balancer was routing correctly. Container #4471 had rejoined the cluster with no memory of its brief nationhood. The pizza orders had been fulfilled.
Total resolution time: 47 seconds (not counting the weekend).
Lessons Learned
- Don't deploy on Friday
- We mean it
- Seriously
- If you're thinking "but it's just a small change" β especially don't deploy on Friday
- Databases can, apparently, migrate themselves to France
- Our load balancer has good taste in pizza
- Containers may have political ambitions
- Turning it off and on again remains the most powerful tool in engineering
- Fireproof phone cases are a legitimate business expense
Epilogue
Our database has since shown no further signs of wanderlust, though we did catch it browsing Airbnb listings in Tokyo. We've implemented a travel ban.
This post-mortem was written collaboratively by the entire team, under protest, during a mandatory "Feelings About Friday Deployments" workshop conducted by our new staff therapist.