Our new mantra for managing and maintaining the health and functionality of our apps and environments is observability. Observability is the quality of software, services, platforms, or products that allows us to understand how systems are behaving. Without the new sources of data giving us insights, our modern cloud-native applications would be quite a challenge to monitor. Observability, that deep data, is the new fuel for our developer and DevOps engineers.
The duality of observability is controllability. Observability is the ability to infer the internal state of a ‘machine’ from externally exposed signals. Controllability is the ability to control input to direct the internal state to the desired outcome. While driving, observing a red stoplight means controlling our vehicle by pressing the breaks (or in some modern vehicles, having the brakes applied automatically for us).
Quite often we find that observability is presented as the desired end state. Yet, in modern computing environments, this isn’t really true. After all, how many times does an application stop working (or deliver incorrect results) and the response is to shrug and walk away? We need to move from a linear model to a loop model, from ‘See something. Say something.‘ to ‘See something. Do something.‘
So, observability is a loop. And we need to stop treating it as the end state of our challenge in delivering performant, quality experiences to our users and customers.
So let’s break this apart into components.
Observe and Control
Observability is the quality of software, services, platforms, or products that allows us to understand how systems are behaving. It goes beyond the traditional data that helps us look for what we already expect is happening to provide data for things that we didn’t realize could happen. It’s a window into the operating state of our applications and systems. Observability is a data problem. Data is only as useful as the ability to aggregate, analyze, visualize and respond to it. This allows observability to extend our monitoring paradigm, making use of AI/ML to extend our analysis and visualization.
With observability comes alerting. After all, when something goes wrong we need to be able to become aware and respond as quickly as possible. And in fact, this alerting capability may actually distinguish between monitoring and observability — detecting and alerting on what we know might go wrong (monitoring) versus detecting when something is wrong that we didn’t foresee (observability). We can consider that observability lets us see the activities (monitoring) and reduce the meantime to detection by showing us both known and unknown activities within our systems.
In some ways, observability could be monitoring at the Chuck Norris level, where an alert leads to a response, even if we didn’t expect the original action.
Controllability is that response.
Controllability consists, at a coarse level, of two sections. And one is often forgotten or lumped together into one concept. But there are really two main points, mean time to respond and mean time to resolution. Your immediate response is to get things working again and that can have a different path than the resolution path, which is to identify the underlying causes and resolve them to allow continued development and continuous performance improvement.
So let’s look at a reasonably simple example.
The Cat and the Canary
You run a social platform that allows people to upload pictures of their cats napping: CatNapFriends. The app has a number of microservices and you’ve begun to introduce serverless elements for image processing. The application has been running quite well, meeting scale and is responsive, leading to happy purring users.
You’ve found that your function performs a bit slow at lower volumes of traffic, so you change the function a bit to make it faster. Your testing shows that it runs 50% faster in your tests, so you are ready to deploy. Being a smart, safe and sane person, you roll out in a canary model. And something goes wrong as it scales out, your serverless functions start hitting durations into the seconds or returning errors.
Your users aren’t so happy anymore. Your Twitter feed blows up, and your Facebook is something you’d rather not see.
So the issue then is:
- How fast did you recognize the problem?
- How fast did you let someone know?
- How fast did you get back to a performant system?
- How did you figure out the root causes?
Each of these questions can have multiple answers, but some answers are probably better for your business and honestly, your sanity.
Start With Monitoring
How do we know something went wrong? Well, if you can’t see it, it never happened. Until, of course, you get a call from the CIO asking why the national news is asking why your site is down. Learning about problems from users on Twitter might be a career-limiting move.
So we start with monitoring. While monitoring sometimes gets lumped into the category of seeing things we already know, monitoring in observability can also flag things that are things we don’t know… yet. Sometimes you’ll hear observability as the ability to answer the unknown unknowns while focusing that on the root causes analysis. Stellar use of observability means also detecting the unknowns long before we are in the resolution stage.
Granularity and fidelity play a major role in your monitoring and detection. So let’s imagine that the serverless scenario is the problem underway. You grab a data point every 5 seconds. No problem. But wait, in serverless cold starts are between 200-700 ms; warm starts are 8-50 ms. You just missed it. In a 5 second window, you can miss A LOT of serverless starts. You’ve just missed it again. And when we get to distributed tracing it can be even worse. You need to see every point, every bit of data on traces that you can. It’s useful in this monitoring and detecting phase but will become increasingly crucial in the resolution phase.
Now that we have the data we need to move to the next phase in our loop, detection.
Detect and Alert
Detection is obviously realizing something went wrong (or out of band). In traditional monitoring, it is most often concerned with static thresholds. But there are a lot of other choices, like Heartbeat Check, Resource Running Out, Outlier Detection, Sudden Change, Historical Anomaly and Custom Threshold.
As you can tell, the list covers both things we already know, (static, heartbeats), but also starts verging into the unknowns (outliers, sudden changes). And as we move into AI/ML categories, we start to see even more observable events leading to unknown detectors (already represented by historical anomaly). The real value is being able to define custom detectors; remember that in our observability world we don’t know the questions (yet), so when we do learn a new question we should be able to track against it.
Detection leads to alerting. And detection and alerting get us to the first open/closed loop bifurcation. An open loop is one that has a person as a triggering element. In detection, an open loop is when an operator (or system) spots a problem (like a metric out of range) on a monitoring dashboard and alerts the responsible persons (as an example). A closed-loop would be one where the dashboard highlights the change/trigger (as in flashing red) and/or launches an automated alert. A closed-loop response is one in which an automated element is a response to a triggering element. There are multiple combinations that can occur, but with observability, you want to figure out how best to keep to a closed-loop process for as long as possible. You’ll still need open loops in places, but in detection and alerting in particular, a closed-loop gives you the fastest result for appropriate events, like a problematic deployment.
Closing the Loop With Controllability
It’s great to know something has occurred or is about to occur. However, if you can’t take action, then while the crash may be exciting, it won’t be cheap. And action can mean many things, but it normally falls into two steps.
The response is the first step in control (and recovery). You may need to think of this in triage terms: can it live without immediate attention, will it die no matter what, or will it live with immediate attention. Fortunately for us, an application ‘death’ is resurrectable.
So think of it this way. ‘My service is running slower than normal or desired, but work is still happening. Let me figure out what is wrong and then fix it.’ Or you can end up with ‘my service is dead or inhibiting work, let me get back to steady-state, then find and fix it.’
The response is a natural fit for automation techniques. Depending on the nature of the alert and surrounding events, it may be highly likely that an automated response (via a runbook, or trigger to a different action script) could be kicked off.
In our above example, CatNapFriends, we pushed a new version of the function. Our monitoring noticed a detrimental change and our alerts fired. Now, assume that one of those alerts also connected to a trigger/script that immediately rolled our updates back to the last known good distribution. It would also alert our appropriate person to examine the problem and start looking into the root causes. In every case, an appropriate alert should be sent to inform the right group of the problem, the response, and basically, keep people informed.
This leads us to resolution.
Closing the Loop — and Staying Open
Resolution is almost always an open loop. When we need to fix it moving forward, we’ll dive into code, infrastructure, and/or configurations. And that will take insight not only into the immediate issue for resolution but also the other items that might affect or be affected by that resolution. For CatNapFriends and its sudden slow behavior, it could be a coding bug that fails to return the completion of the image processing. It could be that your serverless configuration was set to the minimum memory and the image is swamping the container. It could be that you have an unforeseen interaction with the new service and other elements. Fortunately, observability considers this visibility need and expresses it in connection to the data.
These are three major data classes of observability: metrics, tracing, logs.
And it continues over to the controllability concepts: respond, resolve, redeploy.
We need observability to close the loop, and we need tools and techniques that allow us to do that with speed and precision at scale. It’s a loop, leading through each phase and returning to the next. We work to ensure that we are most often in the monitoring aspects but need to be aware that all of the other steps are readily available.
So, while observability takes its cues from control theory, the practical approach takes it from computer process control and SCADA (Supervisory Control and Data Acquisition) implementations. Visibility into your system is good, but not enough to manage today’s ever-increasing complexity of containers and Kubernetes, microservices and serverless functions with that visibility. However, with great observability comes the need for great control, be they open loops or closed.