What Is Observability?
According to Wikipedia: “Observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. In control theory, the observability and controllability of a linear system are mathematical duals.”
In simple words, it is how the system describes its internal state through its external outputs.
There are 3 main pillars of observability:
Time series sensors data provide low-latency quick feedback regarding system performance.
Tracing data helps to find where the error happened.
Text data describing in details events that happened in a low application level.
Everything starts with data gathering. No matter how complex or simple your system is, you will need to have this data as a basis for further analysis and actions.
How To Make a System Observable
In the world of distributed computing, clouds, and microservices, how to make a system observable might look like a very hard question. It becomes much easier when analyzing some systems from the perspective of users who interact with them. What the user should know in terms of observability is the system’s operational state: is it good or bad, working or not, operating successfully or out of operation?
We have plenty of examples from the real world of how we do it day-to-day. For example:
- How are you feeling today? Can you go to work?
- How is your car? Is it ready for a drive?
We are not thinking of it consciously because we are doing it automatically, but to answer these questions we need metrics. To know if we are ok or not, we need to measure temperature, pressure, and blood analysis results. To say if the car is ready to go, we need to look at the control panel if there is an error or not.
Assuming we have a lot of components in the system, then the overall state will be the result of the binary multiplication of its components.
If we need to know what the overall system state is, we need to collect metrics from each component. We also want to know what the state was in previous time and state change time as well. This means we need to constantly collect this data from the components.
Once we have metrics data, we can build a dashboard with a nice view.
Nice dashboard to show your boss you are cool, but what of these indicators do you actually need to say what is the state of the system? From the big variety of metrics, we need to choose the most important which directly affect user experience and business operations.
From the huge number of possible metrics, KPIs, and measuring data, the next three are most important, as they directly affect the user experience and business operations.
1. Error Rate
This is the main indicator showing something not going as expected. It usually happens when a user does not get a successful response.
2. Response Time
This indicator shows the user’s response time.
3. Resources Utilization
This indicator shows how many memory, CPU, or storage resources are allocated vs free, indicating how long the system can work autonomically without external help.
Pillar 4: Events
Dashboards are really good tools for monitoring but do we need to watch them all the time? Yes, you can, but this is not effective. If you are as lazy as me, you can make the metrics tool notify you if the application is not in a good condition. But first, you need to find out what the bad state of your system is from the metrics outputs. The most interesting part is to see how your system behaves in extreme conditions (i.e., under the high load). You can simulate high load in a test environment using such tools as JMeter or Gatling. This gives you a better understanding of the application capabilities and what indicators are crucial. This is also a good point in time to set up automatic alerts. Alerts is a very powerful tool, as it reduces the need to constantly monitor the dashboards and only open them when needed. Going back to the human body and comparing how nature deals with such problems, we can find that we never look at our body parts to be assured they work fine. On the contrary, we are doing our stuff assuming everything is ok and if something is not ok your brain will be notified by a pain signal.
Most of the existing monitoring tools are very mature instruments and have support for alerting via email, Slack, or webhook. You can send alerts directly to the admin/user or operator so then the action will be on humans. Another option is to send alert events to a dedicated observability service via HTTP webhook and automate further actions such as disaster recovery, ML training, or notify other dependent services using fan-out architecture patterns.
Note: Using alerts you will get better resource utilization, automation, and overall control of the system.
The next question is: what should we send in the alert body itself? Again the answer can be found from the client’s perspective who will use it. Usually, it is the operator or developer who will take action to resolve the issue. The context of the alert should help to understand the severity, reason, and blast radius of the issue.
The status of resources shows how important to take action:
* Service is starving, meaning it will be out soon if no action is taken. This indicator helps avoid issues even before users start blaming on response time and errors.
A warning/error message shows the possible reason for the issue. Including traceId will help to get further details from existing monitoring tools.
Region, Cluster, Application, or component name identify location and radius of the blast.
Blast radius is an army term, but it exactly serves the same problem as an incident in real life. It is very important to have the alert addressed to the right team, and blast radius helps identify the affected applications, teams, and components. Doing so will not spam or distract other teams, and will improve their engagement in incident resolution if an alert happens.
This section includes information on how to solve the issue. It also can have useful tips and documentation links to help fix the issue. Some of the issues might require manual actions and can’t be fixed by the application code or configuration changes. These occasionally recurring issues might already have happened in previous times and the SRE team knows how to fix them. This knowledge should be collected as part of product documentation (i.e., production incidents journal) and shared among all responsible members. Adding corresponding references from the documentation into alert messages will help address or solve problems much faster.
The purpose of the tracing is to quickly find where the issue happened. Component, application name, or even the exact method call can be quickly identified using a nice view from such tools as Jaeger, Zipkin of Honeycomb.
Logging probably is the very first observability technique invented when programmers first tried to find more details about program error or malfunctioning. All events happening inside the application are forwarded as text or JSON data to the logging tools such as Splunk, ElasticSearch where a full-text search engine helps to perform queries by keywords. Logs contain the full details of the issue and the last point of observability where engineers can find the answers about the issue’s reason. If it doesn’t help, we can proceed with remote debugging techniques but it is another topic and out of the scope of observability boundaries.
Observability is not a functional part of the system but it is also highly important because it affects user experience. Good observability gives you quick feedback about system/application operational status and notifies you in advance about potential issues even before a user faces the issue. Without good observability, it might be too late to take any action as the user has already gone to another service or product.
Take care of yourself and your products.