To manage complex distributed systems, you need to be able to observe and understand what’s happening to all of the components that make up the system. Observability, however, hinges on the assumption that every component can generate information about what’s happening with it, and in an event-driven system, that can be quite complicated.
For example, if you have an application that executes activities A, B, and C and publishes a message to an event broker, which then goes to a queue, you would want to know what happened from start to finish: from the publishing application, to and within the broker, all the way to the receiving application, for every transactional event.
With this blog post, I’ll explain two importance of observability in the context of event-driven architecture (EDA):
- How can event brokers generate information about what’s happening inside the broker and between microservices?
- How can we take action on the generated information from a complex distributed system’s behavior with multiple event brokers in the mix?
To answer these two questions, we will look into two technologies:
- Solace PubSub+ Event Broker: an advanced event broker that enables real-time high-performance messaging in an EDA system.
- Datadog: a cloud-based observability backend used to collect, process, and visualize metrics, logs, and traces from applications and systems.
But first, let’s cover some background.
Introduction to Distributed Tracing
Before diving deep into the distributed tracing of event–driven systems, I’d like to step back and cover some core concepts.
Distributed tracing (DT) is designed to let you observe and understand the journey of information through a distributed system by generating and collecting information about what happens as a piece of information flows through the system. DT falls under the umbrella of tracing, which is, in turn, one of the three pillars of observability. Observability aims to understand what is happening in the system so you can tell what went wrong when something does or identify bottlenecks and figure out how to fix them.
A big part of the increasing popularity and importance of observability, an open standard vendor agnostic framework was needed. This framework would aid in the tracking of transactional event information in a distributed system and meet OpenTelemetry.
As it turns out, an asynchronous system with an event broker at the core of the architecture needs OpenTelemetry to solve mysteries in the overall systems about the flow of transactional events.
There is a direct correlation between the degree of distribution in the system and the complexity of system observability. Advanced observability tools like Datadog enhance the tracing management of such complex systems by letting you monitor, optimize, and investigate all the different components in the system. By stitching together tracing data from across the system, Datadog’s dashboards give a bird’s eye view of what’s going on. However, with Datadog leading in the observability domain, there are still some gaps in the industry when it comes to collecting metrics from event brokers in event-driven systems.
Distributed Tracing Meets Event-Driven Architecture
There are three levels at which traces can be collected in an event-driven system:
- Application level; during business logic execution.
- API level; during communication between other components and services.
- Event broker level; at every hop inside the event mesh.
The advent of OpenTelemetry has led to lots of tools that generate and collect trace information at the application and API levels. Still, it’s been hard to trace events as they transit event-driven systems because event brokers haven’t historically supported OpenTemetry, which leaves the event broker component in the system as a black box. In other words, observability in this scenario would be:
…the event entered the broker (generate trace!), the event exited the broker (generate trace!)
I’ll give you an example: imagine an e-commerce site that offers its customers a variety of payment services. To support that, they run microservices on different cloud providers, and events flow from one service to another. So, for example, a single action, like a user clicking to pay for their order, will trigger a series of events such as checking inventory, running fraud detection, updating their customer profile, and actually charging them. Figure 2 below gives an overview of the architecture:
Now consider their distributed tracing strategy. Assume that events are published and subscribed to between all the backend microservices over a message broker. As a system architect or a developer, when a failure happens, you might ask several questions, such as:
- Why didn’t the fraud detection microservice ever receive the message it’s subscribed to? Is it due to a queue reaching quota capacity? Is it due to subscription permissions?
- What happened to the event in the event mesh if there are multiple message brokers involved?
- Did my message make it to the event broker?
- I want to track the journey the message took from the customer hitting the purchase button all the way to the fraud detection microservice. How can I do that?
We can clearly see an observability gap in an event-driven system. However, with Solace’s support of distributed tracing in the event broker component and Datadog’s commitment to contributing to OpenTelemetry, we can now bridge the observability in gap event-driven architecture.
Closer Look Into the Architecture
As stated previously, complete observability is achieved when all the components of the distributed system generate information about their actions. This includes message brokers.
In Figure 3 below, we see that applications can generate their own OpenTelemetry trace messages directly from the application logic or from the API using OpenTelemetry client libraries. As applications start publishing guaranteed messages to the Solace PubSub+ Event Broker and subscribing to these messages, the broker generates spans that reflect every hop inside the broker. Activities such as enqueuing from publishing, dequeuing from consuming, and acknowledgment will generate spans that are consumed by the OpenTelemetery collector.
So, for example, Figure 5 below shows a setup of an interconnected cloud-hosted cloud-agnostic Solace PubSub+ Event Brokers forming an Event Mesh. The event brokers are hosted on AWS in a North American US-East (NA) region: Ohio, on GCP in a European (EU) region: France, and on Azure in an Asia-Pacific (APAC) region: Japan. Each of the event brokers is configured to generate trace messages following the configuration setup discussed previously in Figure 3.
Figure 4: Solace PubSub+ Cloud Service Creation Dashboard.
A Java microservice is configured to publish messages to the NA-hosted event broker on a predefined topic; in this case, it publishes a guaranteed message on topic orders/pos/1234. In our e-commerce organization example, this could be the check-out cart application. There are queues on the EU and APAC-hosted event brokers configured to subscribe to topic orders/pos/*. In our organization, these event brokers could reflect the 360 Customer Profile and the Customer Charging applications, respectively. Note: see Solace’s wildcards notation for further information on topic subscriptions.
Another Java microservice connects to the EU-hosted event broker and binds to the preconfigured queue. Both the publisher and subscriber Java application is configured with automatic instrumentation using the OTel Java Agent that dynamically injects tracing metadata before publishing the message to the Solace broker for context propagation.
When the collector is configured to export OTel trace messages to Datadog, we can follow all the spans for one transactional event as it propagates between the applications through the event broker, as seen in Figure 6. Note that all the spans for one transactional event are stitched together in one trace, thanks to context propagation.
Figure 6: Datadog Span dashboard.
Figure 7: Datadog Map view of the span.
Figure 8: Full span of placing orders in the e-commerce organization business transaction.
Thanks to the standardization of trace messages using the OpenTelemetry Protocol (OTLP), after the spans are received by the Solace Receiver on the OpenTelemetry collector, they are processed to standardized OpenTelemetry trace messages and passed to exporters. The exporter is a component in the collector that supports sending data to the backend observability system of choice. In this example, we have chosen the Datadog exporter to export the trace messages. Finally, spans are received by Datadog, where they are intelligently stitched and correlated based on several properties and trace IDs so they can be further examined and analyzed using different dashboards and tooling.
Solace’s new distributed tracing capability in Solace PubSub+ Event Broker means that traces can be generated in every hop in the event mesh that reflects what happens to the business transaction events. Using advanced observability backends, like Datadog, all the generated spans and traces from the system can be stitched and correlated, resulting in a better understanding of the overall system.
Distributed Tracing support in the Solace PubSub+ Event Broker and the messaging APIs will continue to improve and develop. So keep an eye out for our latest releases and collaborations for more cool projects!
For more details about Distributed Tracing in Event Driven Architecture, check out this video series:
Do you have any thoughts? Drop them in the comments section below!
Leave a Reply