To understand OpenTelemetry and why it exists, you need to understand observability. The goal of observability is to be able to understand what is happening at any time inside an application. This is done by collecting telemetry data, which provides insight into the performance, behavior, and overall health of the application. When properly done, observability makes it easier to find bugs and performance bottlenecks in your application, resulting in users getting a faster and more reliable experience.
Instrumentation
Instrumentation is to add the OpenTelemetry SDK to your application so that it emits observability signals (typically logs, metrics, and trace spans). In short, instrumentation is necessary to actually generate the telemetry data that is needed for observability. OpenTelemetry provides options for both automatic and manual instrumentation.
Data Sources
OpenTelemetry is currently focused on three specific data sources to support, with the potential for more being added in the future. Those three data sources are traces, metrics, and logs.
Traces
A trace is a way to track a single transaction (for example, an API request or a periodic report job) as it moves through an application, or through a network of applications. In a microservice architecture, a single transaction may touch multiple services. For each operation that a transaction touches, a segment known as a span is created. The span records interesting characteristics of the operation such as duration, errors encountered, and a reference to the parent span (representing the operation that called the current operation). By tracing the progress of a single transaction through our service architecture, we can find the service or operation that is the root cause for slow or failed requests.
As mentioned, traces are often used to debug and optimize distributed software operations. Tracing is also used to optimize manufacturing processes, answering questions like: “Why does widget production slow on Tuesdays?” And to optimize logistics pipelines, tracing can answer questions like: “Will raw materials arrive in time for batch 23 to begin on Friday?”
Traces are then exported from OpenTelemetry to a backend like Zipkin or Jaeger for analysis and visualization.
Metrics
Metrics are measurements from a service created at a specific moment in time. Metrics are typically aggregated in the emitting application, collected at fixed time intervals, ranging from every second to once per day. The goal of collecting metrics data is to give yourself a frame of reference for how your application is behaving. Based on past data, you can then set alerts or take automated actions if certain metrics exceed acceptable thresholds.
Some examples of metrics:
- Request success and error rates, e.g., 42 requests per second
- Request latency, e.g., count per histogram bucket
- Bandwidth utilization, e.g., 1.2Mb/s used of 10Mb/s capacity, or 12 percent
- Fuel level, e.g., 3.2 gallons
Logs
Logs are time-stamped records containing metadata. The data can be structured or unstructured and serves as a record of an event that took place inside an application. Logs are often used by teams when trying to find what changes resulted in an error occurring. In OpenTelemetry, logs can be independent or attached to spans and are often used for determining the root cause of issues.