Enhancing incident response with custom CloudWatch metrics
At In The Pocket (ITP), we pride ourselves not only on our ability to build exceptional digital products but also on our commitment to maintaining and optimising them post-launch. Our dedication goes beyond just delivering a finished product; we ensure that it continues to perform seamlessly, providing value to our clients and their users. A crucial aspect of our service is our proactive approach to incident response and system reliability. We invest significant effort into monitoring and maintaining the health of our digital solutions with a team ready to act swiftly and effectively when issues arise. This commitment to "keeping the lights on" sets ITP apart, providing peace of mind and continuous support to our clients.
Background: Big platform, Big data
Over the last 5 years, we’ve been eagerly building and maintaining an IoT Platform with over 1.5 million devices. As you can imagine, a platform of this size generates quite a lot of data: about 3 billion inbound IoT messages and 30 billion AWS CloudWatch log events every single month.
To facilitate seamless remote monitoring of devices connected to our platform, we have established an advanced data ingestion service. This solution efficiently manages and processes device data, providing real-time insights and enhanced operational control.
How It Works:
- Data Transmission: Various services send z-standard compressed device states to an Amazon Kinesis stream.
- Data Decompression: AWS Firehose uses a transform handler Lambda function to decompress these Kinesis records in batches.
- Data Storage: The decompressed device states are stored in a time-series database by another service.
This system enables real-time monitoring with quick issue detection and resolution. Furthermore, it allows viewing of historical data, spotting trends, generating analytics and even machine learning. It scales seamlessly to handle growing data volumes. The time-series database also provides efficient data management, optimising how we store and analyse information.
Incident resolution
One evening, our monitoring systems triggered a critical P1 incident alert. This was particularly severe as it affected our core data ingestion pipeline - a crucial component that processes real-time device data. The Lambda function responsible for decompressing device states was experiencing a 100% error rate, completely halting our data flow.
The CloudWatch metrics showed that the Lambda function was failing consistently for every single invocation. This meant that no device data was being processed and stored in our time-series database, creating a critical blind spot for our clients who rely on this data to monitor their devices' performance and health. Although the unprocessed compressed data was safely stored in an S3 bucket, this created a gap in our time-series database.
In the logs for the Lambda we could see the following error appearing:
Custom metrics: A small effort that provides great insights
In this situation, we were essentially flying blind: we knew the decompressed device states were too large and the Lambda Function was hitting the response payload size limit, but we lacked any more profound insights into the data decompression process.
During this crisis, we implemented simple but crucial metrics in the Lambda function to track key data points: the size of compressed data, decompressed data, and response payload.
Consider the following Lambda code:
With the transformRecord method taking care of the decompression, we can now enhance this Lambda with some custom metrics to give us some more insights.
Let’s add metrics for the compressed and decompressed size of the data:
Next, we use the AWS SDK to push these metrics to CloudWatch:
Now we gain valuable insights into the size of the compressed data, the size of the decompressed data, and even calculate the compression ratio.
While this information is valuable, we still didn’t have a precise understanding of the response size of the Lambda function. It's important to note that, in addition to the compressed data, the records also include some metadata. We must consider this aspect as well, of course. To achieve this, we can unmarshal the JSON response data:
Let's visualise this metric with a graph and include an annotation for the Lambda Synchronous Payload Quota of 6MB or 6291456 bytes:
Now it clearly shows when we cross the quota. We can now also monitor the impact of adjusting the compression ratio in the upstream services.
Swift resolution through data-driven investigation
With these new metrics in place, we could quickly identify that the root cause was the buffer size configuration in Kinesis Firehose. The service was accumulating too many records before triggering the Lambda function, resulting in payloads that exceeded the 6MB limit when decompressed.
The solution was straightforward: we adjusted the Kinesis Firehose buffer size settings, reducing both the buffer size and buffer interval. This meant:
- The Lambda function would be triggered more frequently
- Each invocation would process fewer records
- The decompressed payload would stay well below the 6MB limit
After implementing these changes, our custom metrics showed that the response payload size consistently remained under the quota, and the data ingestion pipeline returned to normal operation.
Our takeaways
While implementing custom metrics might seem like an obvious step during development, it's often overlooked in the rush to deliver features or meet deadlines. Developers typically focus on core functionality, logging, and error handling, sometimes forgetting that metrics are also crucial for maintaining and troubleshooting a system in production.
These metrics are simple to implement and can provide invaluable insights during incident resolution:
- They help identify patterns and trends that might not be immediately visible in logs
- They allow for quick validation of whether implemented fixes are having the desired effect
- They enable proactive monitoring and early warning systems for potential issues
- They provide concrete data for post-incident analysis and system optimisation.
In our case, these metrics transformed a blind troubleshooting session into a data-driven investigation, allowing us to quickly identify the root cause and verify our solution's effectiveness. This experience reinforces the importance of implementing meaningful metrics from the start, rather than scrambling to add them during an incident.