OpenTelemetry 101: how it works

9 de março de 2026

OpenTelemetry is easiest to misunderstand when it is described too abstractly.

It is not a monitoring backend. It is not only distributed tracing. It is not just a set of client libraries either.

At a practical level, OpenTelemetry is a standard way to generate, shape, and export telemetry data from software. That telemetry usually comes in three signals:

traces
metrics
logs

The point is not simply “collect more data.” The point is to produce telemetry in a consistent format so that different services, runtimes, and backends can all speak the same language.

This post is a technical 101 on how that works.

The short version

When you instrument an application with OpenTelemetry, a few things happen:

your code or an instrumentation library records telemetry
the OpenTelemetry SDK attaches metadata and decides what to keep
the SDK exports that telemetry, often using OTLP
the data goes either directly to a backend or through an OpenTelemetry Collector
the backend stores, indexes, aggregates, and visualizes it

That is the whole system in one line:

flowchart LR
  A[Application code] --> B[Instrumentation]
  B --> C[OpenTelemetry SDK]
  C --> D[OTLP exporter]
  D --> E[Collector]
  E --> F[Observability backend]

The rest of this article is just filling in what each box is actually doing.

The three signals

Traces

A trace is the record of one request or workflow moving through a system.

The building block of a trace is a span. A span represents one unit of work: an HTTP request handler, a SQL query, a call to another service, or a queue publish.

Spans usually contain:

a name
start and end timestamps
a parent span ID, unless the span is the root
attributes
events
status
links when needed

If Service A calls Service B and both are instrumented correctly, spans from both services can belong to the same trace. That is what makes distributed tracing useful: it shows the path of a request across process boundaries instead of only inside one application.

Metrics

Metrics are measurements captured at runtime.

A metric answers questions like:

how many requests happened?
how long did they take?
how many jobs are waiting in the queue right now?

OpenTelemetry metrics are produced through instruments such as:

counters
up-down counters
histograms
gauges or asynchronous instruments, depending on language support

Metrics are usually better than traces for long-term trend analysis, alerting, and capacity questions. Traces explain one request in detail. Metrics explain system behavior over time.

Logs

Logs in OpenTelemetry are log records emitted through a logging provider or correlated from existing logging systems.

The important point is not that OpenTelemetry invents logging. The important point is that logs can be correlated with traces and spans. If a request has a trace ID and span ID, logs generated during that request can carry the same identifiers, which makes cross-signal investigation much easier.

API, SDK, and instrumentation are not the same thing

This is one of the most important distinctions in OpenTelemetry.

Piece	What it does	Who usually uses it
API	Defines how telemetry is recorded	application code and libraries
SDK	Implements processing, sampling, resources, export	app owners and platform teams
Instrumentation library	Hooks into frameworks or libraries	usually installed by developers
Collector	Receives, processes, and forwards telemetry outside the app	platform and ops teams
Backend	Stores and analyzes telemetry	vendor or self-hosted platform

API

The API is the surface you call when you create spans, record measurements, or emit logs.

If you write:

const span = tracer.startSpan('checkout');

that is API usage.

The API is intentionally small. It lets application code describe what happened, without deciding how that telemetry is exported or processed.

SDK

The SDK does the operational work.

It provides:

tracer providers and meter providers
span processors
sampling
resources
exporters

In other words, the API records telemetry. The SDK decides what to do with it.

Instrumentation libraries and auto-instrumentation

Most teams do not want to hand-write spans for every HTTP framework, ORM, cache, or message queue.

That is where instrumentation libraries come in. They automatically create spans or metrics around known libraries. Auto-instrumentation takes that even further by attaching instrumentation to applications with little or no code changes.

This is why OpenTelemetry often feels larger than “just a tracing API.” In normal use, you are not only using the API. You are using:

instrumentation packages
the SDK
exporters
often the Collector

Resources and semantic conventions

Two pieces make telemetry usable across systems.

Resources

A resource describes what produced the telemetry.

Typical resource attributes include:

service.name
service.version
deployment.environment
host, container, or cloud metadata

Without resource data, a span is just an isolated event. With resource data, it becomes “a span from checkout-service version 1.8 running in production.”

Semantic conventions

Semantic conventions standardize attribute names and meanings.

For example, instead of every team inventing different keys for HTTP route, method, or database system, OpenTelemetry defines common names for these concepts. That standardization is what makes dashboards, queries, and cross-service comparison sane.

If two services both emit HTTP spans but one uses method and the other uses httpVerb, the data is harder to query consistently. Semantic conventions exist to prevent that drift.

Context propagation is what makes distributed tracing work

OpenTelemetry context propagation is the mechanism that moves trace context between services and processes.

When one service calls another, the current trace context is injected into the carrier for that protocol. In HTTP, that usually means headers. On the receiving side, the downstream service extracts that context and creates a new child span in the same trace.

In practice:

Service A starts a span
Service A injects trace context into the outbound request
Service B extracts the context
Service B starts a child span using the incoming context

By default, OpenTelemetry uses the W3C Trace Context format, which is why you often see a traceparent header.

Context propagation is also how different signals can be correlated. A log record or metric emitted during a request can be associated with the active trace context.

OTLP is the wire format, not the destination

OTLP stands for the OpenTelemetry Protocol.

It is the general-purpose protocol OpenTelemetry defines for delivering telemetry between clients, collectors, and backends. It describes:

how data is encoded
how it is transported
how export requests and responses work

OTLP commonly runs over:

gRPC
HTTP

For HTTP, the default paths are:

/v1/traces
/v1/metrics
/v1/logs

This is a common source of confusion: OTLP is not a backend. It is the delivery protocol.

An application can export OTLP to:

a local Collector
a remote Collector
a backend that accepts OTLP directly

Where the Collector fits

You do not always need the OpenTelemetry Collector.

For small systems, local development, or quick experiments, sending telemetry directly from the SDK to a backend is often good enough.

But the Collector becomes useful very quickly because it can centralize operational concerns that you do not want inside every application:

retries
batching
fan-out to multiple backends
authentication or TLS handling
enrichment
filtering or redaction
protocol translation

A minimal Collector pipeline looks like this:

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:

exporters:
  debug:
  otlp:
    endpoint: backend.example.com:4317

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug, otlp]

The app sends OTLP to the Collector. The Collector receives it, batches it, and exports it onward.

That pattern is operationally cleaner than teaching every service how to talk to every backend.

What happens during one request

Let’s walk through one HTTP request in a microservice setup.

sequenceDiagram
  participant U as User
  participant F as Frontend
  participant C as Checkout
  participant DB as Database
  participant Col as Collector
  participant B as Backend

  U->>F: GET /checkout
  F->>C: HTTP request with traceparent
  C->>DB: SQL query
  C-->>F: response
  F-->>U: response
  F->>Col: spans, metrics, logs via OTLP
  C->>Col: spans, metrics, logs via OTLP
  Col->>B: processed telemetry

Under the hood:

the frontend receives the incoming request and starts a root span
the frontend propagates context to the checkout service
the checkout service extracts that context and creates a child span
the database client instrumentation creates another child span for the query
metrics are recorded during the same operations
logs can be correlated with the active span
the SDK batches and exports telemetry
the Collector optionally processes and forwards it
the backend stores and visualizes it

That is the practical value of OpenTelemetry: one request can be understood across multiple components using shared context and shared conventions.

Sampling, cost, and realism

Not every trace should be kept forever.

Tracing is detailed, and detail costs money. The SDK can apply head sampling, which means it decides near span creation time whether a trace should be recorded and exported. Some systems also use tail-based sampling later in the pipeline, often in the Collector or backend, after more context is known.

Metrics are usually cheaper and easier to retain long-term. Logs vary a lot: structured logs can be extremely useful, but high-volume logs can become expensive quickly.

A healthy OpenTelemetry setup is not “collect everything blindly.” It is:

keep enough detail to debug
keep enough metrics to operate
correlate logs when useful
control cost intentionally

A minimal instrumentation example

The exact APIs vary by language, but the shape is similar everywhere:

const tracer = tracerProvider.getTracer('checkout-service');

async function checkout(request) {
  return tracer.startActiveSpan('checkout', async (span) => {
    try {
      span.setAttribute('app.user_tier', request.userTier);

      const result = await chargeCard(request);

      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error) {
      span.recordException(error);
      span.setStatus({ code: SpanStatusCode.ERROR, message: 'checkout failed' });
      throw error;
    } finally {
      span.end();
    }
  });
}

This is still only one part of the system. On its own, it creates spans. The SDK, exporter, and possibly Collector are what move those spans out of the process and into an observability system.

The mental model to keep

If you remember only one model, use this one:

Instrumentation says what happened
Context propagation keeps distributed work connected
Resources and semantic conventions make telemetry understandable across services
OTLP moves the data
The Collector handles operational plumbing outside the app
The backend stores and analyzes the result

That is OpenTelemetry in practice.

It is a standard telemetry pipeline, not a single library and not a single product. Once that distinction clicks, the rest of the ecosystem becomes much easier to reason about.