February 26, 2025

Disaster recovery (or the pain of losing everything)

Koen Vinken

Senior Software Engineer

We dared to ask the question: what if we lost everything? But what is everything? Everything means different things to different people. We, engineers working at ITP, have a common understanding of everything: it’s our cloud infrastructure.

‍‍Back in 2019, we started building a cloud platform that sends, receives and translates communication between man and machine. It steadily grew into the largest cloud-based platform we currently have, with support for over 1.5 million devices connected from all over Europe.

So as you can imagine, the project is big. And with big, we mean huge. It’s been worked on for several years, by dozens and dozens of engineers. The architecture is quite expansive, and so is the underlying codebase. All that code has to live and run somewhere, and we use AWS (Amazon Web Services) for that. AWS facilitates easy infrastructure setup and management by providing all types of services. For example, AWS provides a “Simple Storage Service” (S3) which allows us to store all types of data, or a “Simple Queue Service” (SQS) which takes care of secure message communication between different services. We use “IoT Core” to connect devices to our cloud infrastructure and “Timestream” to store historical data. So as you can see, in our infrastructure, there’s no shortage of services.

Disaster strikes

Imagine that, on a random Tuesday, that cloud infrastructure is gone. Maybe we’re dealing with a ransomware attack, maybe the servers are fried and there are no backups. Whatever the reason may be, it doesn’t matter. What matters is this: what would it take to rebuild it all? Can we even do it? That’s what we are trying to figure out. And we want to figure that out before it actually happens.

Thus, we set up an exercise with 3 clear objectives:

Rebuild our AWS infrastructure in 5 days (= “Recovery Time Object” or RTO)
Refine the “Disaster Recovery Plan” or DRP (more about that later)
Share the learnings

Since this is an exercise, the disaster isn’t “real”, but the recovery certainly is.

What do you need?

We need people to execute the exercise. And not just any people, we need software engineers. A team of four brave engineers at In The Pocket took on the challenge. They are backed up by a project manager and a security officer. From Monday morning until Friday evening, their sole focus is on the disaster recovery.

Luckily, the DevOps team prepared a so-called “Disaster Recovery Plan”. The plan consists of steps to be taken in order to rebuild the environment on a new AWS account. It will steer the team in the right direction. A kind of guide through the woods. Keep in mind that, up until this point, this plan has never been actually used. It’s a theoretical plan that will have its faults.

But first…

To better understand the journey, it’s important to understand the 2 following concepts.

There are 2 ways to configure the cloud environment. Either you do it manually by browsing through the AWS console in a browser and configure settings (also known as “clickops”). Or you use “Infrastructure as Code”, where the codebase contains code that manages the infrastructure. It’s obvious that the latter is the better way. It’s more transparent, improves collaboration and it happens to be crucial for this exercise. Luckily, we use the IaC throughout the project (or do we?).

We are also using pipelines for continuous integration and continuous deployment. So our development platform GitLab is set up in such a way that it automatically deploys our code (including IaC) to the cloud environment. The pipeline is quite big, as it sequentially deploys service after service.

**Quality assurance** is one of the cornerstones at In The Pocket.

Here goes nothing

We’re going to go over the 5 days, with every day highlighting something we learned.

Monday

After the exercise briefing, we started the recovery well motivated. But that motivation would soon diminish significantly.

Remember that “Disaster Recovery Plan” that would lead us, step-by-step, to glorious bliss? Turns out it missed a lot of steps, which stalled progress frequently. Many times we wondered: what do we do next?

It also made a lot of what we like to call “knowledge assumptions”. “Just do this”. Ok, but how? While this plan was made with the best intentions, it was never actually used or tested. And as with everything related to software engineering, testing is an indispensable part.

Tuesday

Recovering from disaster is mainly trying to get through our CI/CD pipeline. But we soon discovered that the order in which services are deployed in the pipeline conflicts with the dependency structure. You see, services that are deployed first depend on services that are deployed further down the pipeline. This is never an issue when doing iterative deployments, because all services are already there. Doing it “fresh” revealed the dependency hell we created for ourselves.

But this is actually a good thing, as we were now not only being aware of this, but we also fixed many dependency issues.

Wednesday

For convenience reasons, we decided to work from home that day. A decision that left a lasting mark.

I would like to emphasise the mental toll of doing work like this. It is 90% listening, reading and discussing. It’s repeatedly hitting a brick wall with your head. If you combine this with remote work, it puts intense strain on people. After an entire day in a remote meeting, the recovery team was absolutely exhausted. We learned that for this kind of work, being in the same (war)room is crucial.

Thursday

Remember that I said we use IaC (“Infrastructure as Code”) for everything. Well, turns out this is not the case for everything, for example “secrets”. We use secrets as a form of authentication for many things, for example connecting to external services like a MongoDB database or a monitoring tool like New Relic. We store these ‘secrets’ in another AWS service: the “Secret Manager”. Over the years, these were manually added to AWS. Since the exercise dictated that the environment was fully compromised and we had no access whatsoever, we couldn’t access those secrets.

Luckily, we have them backuped in LastPass, but we quickly discovered that many were missing. If this would’ve been a real disaster, we would be unable to set the secrets. Realising this, we were given an exception and we were allowed to copy paste them from the compromised account.

The lesson here is that, when you actually rely on backups, it makes you realise you don’t have everything back upped.

Friday

The last day was finally here. Everyone seemed tired and exhausted, but we were slow and steadily deploying the services successfully to the new environment, rebuilding our infrastructure. Alas, by the end of the day, we were not able to deploy everything. Far from it actually. So was this a failure?

No. While we were not able to successfully complete the objective to rebuild the application, we learned a ton of what kept us from doing so. We also fixed a lot of circular dependencies and refined the disaster recovery plan.

What should you remember?

First of all, actually do disaster recovery exercises. And preferably do it regularly. It will help you to better prepare for when disaster actually strikes. You’ll discover that things aren’t always what you think they are. That you made faulty assumptions.

And secondly, it’s important to automate everything. Don’t do any manual changes! Even if you document those manual changes, know that user interfaces change all the time. Think like you’re starting with a clean slate.

Quality assurance is one of the cornerstones at In The Pocket. This also means ensuring that we can recover from severe incidents by having a robust, tested recovery plan. So we are taking our learnings from this exercise and applying them.

And finally, that brings us to the bigger picture. This exercise isn’t just a technical necessity. It’s a business imperative. The consequences of not being prepared when disaster strikes go beyond the technical challenges. Think of financial losses and reputational damage that results in lost customer trust. For companies that rely on cloud-based services, the inability to quickly recover could disrupt everything.

Disaster recovery (or the pain of losing everything)

Disaster strikes

What do you need?

But first…

Here goes nothing

Monday

Tuesday

Wednesday

Thursday

Friday

What should you remember?

Related stories

Stay ahead
of the game.

Disaster recovery (or the pain of losing everything)

Disaster strikes

What do you need?

But first…

Here goes nothing

Monday

Tuesday

Wednesday

Thursday

Friday

What should you remember?

Related stories

Stay aheadof the game.

Stay ahead
of the game.