DevOps Nightmares — Fate’s Fat Finger

Tin Nguyen

Go-To-Market

Imagine, if you will, a series of tragic tales from dark dimensions where nightmares come to life and best practices go to die. The stories you’re about to witness are all true, pieced together from the shattered psyches of those who lived to tell the tales. Accounts have been anonymized to safeguard the unfortunate souls who were caught in the crosshairs of catastrophe.

As you follow this motley crew of unsuspecting engineers navigating the murky waters of automation, integration, and delivery, each alarming anecdote will become another foreboding reminder that it could all happen to you one day.

Prepare for downtime, disasters, and dilemmas. Your pulse will quicken and your hardware will cringe as you travel through the hair-raising vortex of DevOps Nightmares…

Fate’s Fat Finger

Automation is great because it lets you do good things quickly … But you can do terrible things quickly, too.

“It’s been five years and just thinking about it still makes me tired,” says the former VP of Infrastructure at the biometric security company where this DevOps Nightmare took place. “The whole fiasco was exhausting, physically and emotionally.”

At the time, the company was going through a growth spurt like a middle schooler with NBA dreams. Customers were signing up in record numbers to get various body parts scanned into the system, and the servers running the operation were redlining just to keep up.

No one had bothered to build auto-scaling infrastructure because it had always worked fine. But with the new demand it started wheezing like an old man trying to reach a fifth-floor walk-up with seven bags of groceries.

And then the plastic bags broke and servers started rolling down the stairwell. Systems were going down and customers were unable to access their biometric scan services. Let’s just say: the “stuff” was hitting the fan.

The classic scale-out story: build a new cluster and maybe automate it

Under pressure from higher-ups to get things working NOW, the DevOps team was called in to wave their magic wands. They first tried their go-to spell: slap together a new server cluster. But thanks to the platform's complexity and the manual configuration work required, the whole process took five hours plus another hour for the data to trickle in.

Even after all that effort, the new cluster only worked for a couple days, keeling over like a cow tipping contest gone wrong. Back to square one!

“There was a gremlin in the tech stack and we couldn't figure out what it was,” recalls the former Infrastructure VP. “Then someone said they could write the automation to build new clusters and it would ‘only’ take a week.”

The DevOps engineer turned half a day’s cluster provisioning work into a 20-minute automated job. He was treated like someone who stopped a runaway train with his bare hands. While the automation didn’t fix the performance issues entirely, the DevOps team could rapidly scale capacity on demand.

Every silver lining has a cloud (not that kind)

But then, after an all-hands-on-deck week or two of people working long hours and not sleeping much, Fate’s Fat Finger intervened. The engineer who solved the original problem created an even bigger one with a single mistaken keystroke: instead of adding a cluster, he deleted all the existing ones, making the entire platform inoperable. Behold, the incredible power of automation!

The former VP of Infrastructure recounts what happened next: “He came to me whiter than a ghost, trembling as he said, ‘So...I think I just took us offline. Like, completely offline.’ I had to break the news to the CTO and the CEO that we’d be out of order for about 12 hours. As you can imagine, they were not pleased.”

The exec team was emotionally in pieces. The same DevOps engineer who solved the automation problem also brought down the entire company. Heroics one week, company-crushing news the next. Do they promote the guy or fire him?

Despite the tragic accident, in the end most people came to see that automating cluster creation relieved more heartburn and ulcers than it caused. The managers who knew this stuff is difficult got past their disappointment, and the engineers gained a new respect for peer review and the importance of proceeding with caution, especially when exhausted and addled.

The engineer who came up with the solution that saved the company stayed with the company for a few years, but he was never totally forgiven for the mistake he made. Fortunately, he moved on to bigger and better things.

Have your own tales of automation woes or delivery disasters? We want to hear them! If you've endured a devastating DevOps debacle and are willing to anonymously share the cringe-worthy details, please reach out to us at DevOpsStory@aptible.com.

Don’t hold back or hide the scars of your most frightening system scares. Together, we can immortalize the valuable lessons within your darkest DevOps hours. Your therapy is our treasured content, and we’ll gracefully craft your organizational mishap into a cautionary case study for the ages. And, in return for your candor, we'll ship some sweet swag your way as thanks.