Building a beefy backend: our lessons from connecting 500k IoT devices.
This article focuses on what we've learned while building a large-scale cloud platform that connects roughly 500,000 devices. From automated processes to deployments and performance checks. A semi-technical read for the IoT and cloud architecture fans!
The past few years, we've managed to build a unified platform that’s able to connect over 500k devices. What was one of the biggest challenges in our history, turns out to be one of our proudest achievements today. But what we value the most, is the things we learned along the way.
One cloud platform, a lot of challenges
Back in 2019, we started one of our biggest challenges ever. Building a large-scale cloud platform that manages half a million (!) connected IoT devices at the same time. But that was only one part of our assignment. The main goal of the cloud platform was to create a unified interface for all different types of devices, each with their own data models and interfaces. A pretty hard nut to crack.
Today, this cloud platform is a reality. For over a year now, it already runs with over 250k devices on board, and is ready to embrace another 250k.
Before we dive deeper, let's drop some numbers to give you an idea of the requirements and scale we’re dealing with:
At In The Pocket, we aim to release as often as possible. It’s the only way to create a tight feedback loop between developers, quality engineers, product & design, and, most importantly, our end-users. Naturally, running an IoT platform this size comes with more than a few challenges:
- Physical IoT devices never sleep, they're constantly interacting with our platform APIs
- A loosely-coupled microservices architecture has a lot of advantages but makes releasing more complex
- High loads on such a cloud platform requires thoughtful development and architectural decisions when releasing features to production
- Physical IoT devices of multiple generations and versions are complex to simulate for automated testing
With big challenges come great insights. And we don't want to keep them for ourselves. Let's take you through our main learnings on bringing a large-scale IoT platform into production.
Automate all the things!
Deploying a large number of microservices to a production environment in a consistent manner is hard. Fortunately, existing tools can give us a hand. Terraform, for example, allows us to define every setting or resource that can be configured in AWS, Google Cloud or Azure as code.
This is also known as Infrastructure-as-Code or IaC. Instead of manually deploying a lambda function or a container, we write definition blocks in Terraform and push them to the version control system. By using IaC, creating new cloud infrastructure only requires effort during development. Setting up a production environment becomes as simple as pushing a button, if your infrastructure is well-tested for the expected load.
Frequently deploying a large cloud platform with a lot of microservices is tedious enough. Configuring the numerous continuous integration (CI) pipelines, doesn't make it easier. Pipeline scripts are defined over and over again. If you want to change the versioning of docker images or add linting job, for example, you will need a code change on every repository.
Our team managed to limit this to a minimum by sharing the pipeline scripts for every use case or programming language. These templates are defined in a separate repository and are included where needed. Changes to the shared templates impact all repositories and are tested first in a separate environment.
Having a lot of microservices makes manual actions painful. Especially when you have to do them over and over again. Automation helps a lot with this repetitive work.
Zero-downtime deployments
Everyone wants to ship their product to production as fast as possible. This all starts with a well-integrated continuous integration and continuous deployment flow (CI/CD). Once you’ve got that covered, the next question pops up. How do you release new features without impacting your end-user or your connected IoT devices?
You can choose different techniques to achieve this. We like to use a combination of rolling updates, blue-green deployments, and feature flags. Just make sure to keep an eye on data flow and datasets when introducing these deployment practices. Migrating data between deployments can become complex when running multiple versions next to each other.
Rolling updates
With rolling updates, a microservice is deployed into a subset of instances of the same environment and moves to another subset after completion.
Blue-green deployment
Blue-green deployments are used for releases that have a high impact on infrastructure. This technique involves setting up nearly identical versions. The old version is called the blue environment, while the new version is known as the green environment. Once this is done, we can gradually transfer traffic from the blue (old) to the green (new) version. Blue-green deployments give you a fast and safe way to roll back if anything would go wrong. However, this technique requires some extra preparations and planning.
Feature flags
Another technique is using feature flags. These have a tighter coupling with the source code of our services. Normally, we implement a feature flag for every major feature, allowing us to remain in complete control when releasing new features into our IoT platform.
Feature flags are a clever technique to manage functionality remotely without deploying code. This way we can continuously deploy our code even when large features aren't ready yet. Feature flags enable continuous delivery, increase the rate of production releases, and mostly bring peace of mind to both developers and the client.
Of course, as we continue to build out the IoT platform, there are a lot more techniques to consider. We keep innovating and improving to find elegant ways to deploy to production. None of these techniques are perfect, and it is important to combine multiple techniques depending on each use-case and the requirements of the team, the client and other stakeholders.
Uptime & Performance
Deploying to production without monitoring the environment afterwards is like ordering pizza and letting it get cold. The whole process of making and picking ingredients goes to waste. The development efforts should remain valuable. When you open the pizza box, the pizza should still smell and taste the same as the last time you ordered. Users expect the same or an even better experience of your application after every deployment.
System tests - also called end-to-end tests - run automatically to test if the same functionality is still working. For our platform, we run them on an hourly schedule using AWS CodeBuild. The generated reports show us if core functionality still works. This can be turning a device on or testing that a firmware update can still successfully be installed. Each test simulates an IoT device that interacts with our platform.
When system tests fail, things are bad. However, we can detect this early-on by setting alerts on error rates or the amount of timeouts a microservice is allowed to have. We use New Relic to set up alerts and uptime checks. New Relic has an extensive featureset including custom dashboards which we use for alerts and the system tests results.
All these checks provide more control and insights on the current traffic and health of the production environment. The load on our cloud platform steadily increases over time as we connect new customers’ devices and gradually move over devices from legacy environments. To ensure the platform can handle these increasing loads, we simulate scenarios in a separate environment.
Closing thoughts & takeaways
Arriving at our current setup did not happen overnight. System tests were unstable at first as they weren’t part of feature tracks and were hard to debug. As were our CI pipelines. Everything was improved step by step with full buy-in from the whole team and the customer.
Summarising our best practices for developing, deploying and maintaining microservices based cloud platforms:
- Automation is key. Anything from CICD, to infrastructure, and running tests
- Optimise how you deploy features, depending on the type of system
- Test early and often. Start with unit tests and give plenty of TLC to your end to end tests so you have confidence to release often to production
- Observability brings peace of mind: logging, exceptions, uptimes, performance, alerting should be your best friends
One final piece of advice. Start small and try to avoid premature optimisations. Not every project has the same requirements and related challenges. Make sure to communicate a clear roadmap to all stakeholders and allow your team to take the time to focus on innovation. We are still in the early stage of this journey. Cloud platforms are never finished.