Scalable logging for distributed systems
Microservices can be a blessing or a curse when creating a large-scale cloud system. They enable us to leverage domain-driven design and keep our architecture clean. But microservices also introduce more complexity in deployment and testing, especially when we have data flows spanning multiple microservices.
Debugging and monitoring tools should be available to inspect possible issues and gain insights into functional flows. That is why the architecture for logs needs some tender loving care itself. How do you manage such a system? Here are a few tips & tricks we wanted to share about how to tackle logging in distributed systems.
Use different log levels
When you have multiple environments (e.g. development, staging, production), the required level of detail will vary assuming the verbosity level lowers according to the hardening of your environment. When dealing with large amounts of logs, the running cost can rapidly spiral out of control. When creating a running estimate of a system, always include the logging solution as well.
Each log statement can have a different log level, depending on the situation. Defining these log levels can help you create order from chaos.
Include a logging context
Some log statements will have a strong correlation to each other. Including a logging context is highly useful for grouping log statements. This logging context clearly indicates where exactly in your application a log message originates. The context will also depend on the specific circumstances of the microservice itself. Bear in mind that your list of log contexts should always be concise and closely governed.
Log context examples could be: IO, CONFIG, API, DATABASE...
Use correlation ID's
To further group log statements, a correlation ID can be added in the log statement. Such a correlation ID could be anything that makes sense in your software project
(e.g. registrationId, userId, articleId, …). This ID will be extremely useful for debugging data flows spanning multiple microservices.
Together with the context, these IDs allow you to create filters in your logging solution such as Kibana. In production environments, you will want to restrict the log level to reduce costs. Sometimes you are only interested in certain correlation IDs. By adding whitelisting logic, certain logs can still be seen independently of the log level. Certainly, in an ETL system, you are going to want to whitelist some IDs.
Use a shared logging library where possible
By implementing a shared logging library for all your microservices, you can
- Enforce arguments such as correlation IDs and context via strong typing
- Implement whitelisting logic for correlation IDs
- Use centralised logic for error reporting (e.g. Sentry)
When you have multiple programming languages in your microservices, a logging interface should be defined upfront and implemented by the logging library(s). This will enforce consistency and allow flexibility in the choice of programming languages.
Choose a sensible log retention period
An application should never concern itself with the storage of its output log stream. This is the responsibility of the underlying system which handles the logs (e.g. Cloudwatch). When using a system such as Cloudwatch, logs should be rotated over a given time window.
It makes sense to make this configurable via your Infrastructure As Code (e.g. Terraform) per environment. The retention period should be based on the use cases in your system. Too short means losing debugging functionality and too long results in an increased running cost. The logs will be a useful tool for debugging issues that are thrown in your issue tracker (e.g. Sentry).
Create a crystal clear strategy
Logs should be treated as an important aspect of any software project. They need their own crystal clear strategy, which trickles down to every layer of the system. Together with monitoring and error-reporting, it speeds up the development and debugging processes.
Our advice? Always keep a close eye on the running cost, and optimise where possible by modifying the verbosity and retention of your logs.