Twice in the last six weeks I’ve been reminded that “shit happens.” Things fail, people fail, services fail. None of this is surprising; our existence is built on failure and learning from it.
Look at it this way: we built a lot of rockets before we sent people to space. Some of those had tragic endings. More than that, though, even after we thought we got it right, we had tragedy with the launch program because of a part failure. Maybe it could have been caught before the Challenger launched, maybe it couldn’t have. The truth is NASA learned from it, but it didn’t stop another failure 17 years later when the Columbia broke up on re-entry.
I’m pointing these out because they’re big things. They were disasters and they were horrible and many, many people grieved for the loss. And for NASA, for 73 seconds in 1986 and then for 1008 seconds in 2003, there was a crisis that NASA had on it’s hands.
So several weeks ago when my team had to handle a service failure and people began to panic, I thought “This is a problem, how do we solve it?” It was bigger than me, and it was bigger than my team, and someone said out loud “This is going to be a real crisis!” There was worry, anxiety, and no small amount of angst, and combined along with the level of impact, in someone’s mind it was a crisis.
This is not a crisis.
Crisis to me means you aren’t expecting a failure, you don’t have a plan, and there’s not much a person can do to easily solve the problem.
In the modern systems admin/engineering/architecture world, embodied by DevOps concepts and SRE concepts, automation, and yes, failure, people filling those roles should not be surprised when something is broken. Whether it is intentionally broken or accidentally broken, it doesn’t matter - the end result is the same. It’s broken. It failed. What’s next?
I started in my current role 8 months ago and in the first month, everything was a crisis to people around me. I really could not understand what was going on with this mentality because while I could see problems constantly popping up, I could also see solutions. I expect failure, and that’s something I tried to imbue in my team. We made it out of the crisis mode mentality after a few months and we started to see marked improvement in morale of the team and I believe appreciation from our clients.
So when things broke down recently, and someone said the “C word” - crisis - I just put on my game face and got to work.
I’m not going to lie, we’re still dealing with some fallout from the failure, but as opposed to panicking, my team dug in and worked towards a solution. We got our partner team from another location involved to help out, and we’re continuing to push forward, but when failure (and recovery) is a part of everyday thinking, there’s a lot less worry and a lot more problem solving.
So when we had another problem recently that could still blow up to something bigger, nobody on my team panicked, they just started working the problem. The reason?
This is not a crisis.