It was a quiet Sunday at the office. We deployed an innocent looking change to some legacy code. We were happy... for 24 hours. Then our support team got in touch to tell us about clients reporting an issue. It was quickly linked to our, now notorious, patch.
What happened during the following hours is a tale of fellowship, accountability and positive intentions; but also one of SLA definitions we'd forgotten, oncall procedures we ignored and a revert plan we didn't have.
During the session, I'll give a quick explanation of our outage procedure, talk through the events and actions of the “long night” and give you some valuable company-agnostic takeaways.