OpenTelemetry 101: how it works
OpenTelemetry is easiest to misunderstand when it is described too abstractly.
It is not a monitoring backend. It is not only distributed tracing. It is not just a set of client libraries either.
At a practical level, OpenTelemetry is a standard way to generate, shape, and export telemetry data from software. That telemetry usually comes in three signals:
- traces
- metrics
- logs
The point is not simply “collect more data.” The point is to produce telemetry in a consistent format so that different services, runtimes, and backends can all speak the same language.
This post is a technical 101 on how that works.
The short version
When you instrument an application with OpenTelemetry, a few things happen:
- your code or an instrumentation library records telemetry
- the OpenTelemetry SDK attaches metadata and decides what to keep
- the SDK exports that telemetry, often using OTLP
- the data goes either directly to a backend or through an OpenTelemetry Collector
- the backend stores, indexes, aggregates, and visualizes it
That is the whole system in one line:
flowchart LR
A[Application code] --> B[Instrumentation]
B --> C[OpenTelemetry SDK]
C --> D[OTLP exporter]
D --> E[Collector]
E --> F[Observability backend]
The rest of this article is just filling in what each box is actually doing.
The three signals
Traces
A trace is the record of one request or workflow moving through a system.
The building block of a trace is a span. A span represents one unit of work: an HTTP request handler, a SQL query, a call to another service, or a queue publish.
Spans usually contain:
- a name
- start and end timestamps
- a parent span ID, unless the span is the root
- attributes
- events
- status
- links when needed
If Service A calls Service B and both are instrumented correctly, spans from both services can belong to the same trace. That is what makes distributed tracing useful: it shows the path of a request across process boundaries instead of only inside one application.
Metrics
Metrics are measurements captured at runtime.
A metric answers questions like:
- how many requests happened?
- how long did they take?
- how many jobs are waiting in the queue right now?
OpenTelemetry metrics are produced through instruments such as:
- counters
- up-down counters
- histograms
- gauges or asynchronous instruments, depending on language support
Metrics are usually better than traces for long-term trend analysis, alerting, and capacity questions. Traces explain one request in detail. Metrics explain system behavior over time.
Logs
Logs in OpenTelemetry are log records emitted through a logging provider or correlated from existing logging systems.
The important point is not that OpenTelemetry invents logging. The important point is that logs can be correlated with traces and spans. If a request has a trace ID and span ID, logs generated during that request can carry the same identifiers, which makes cross-signal investigation much easier.
API, SDK, and instrumentation are not the same thing
This is one of the most important distinctions in OpenTelemetry.
| Piece | What it does | Who usually uses it |
|---|---|---|
| API | Defines how telemetry is recorded | application code and libraries |
| SDK | Implements processing, sampling, resources, export | app owners and platform teams |
| Instrumentation library | Hooks into frameworks or libraries | usually installed by developers |
| Collector | Receives, processes, and forwards telemetry outside the app | platform and ops teams |
| Backend | Stores and analyzes telemetry | vendor or self-hosted platform |
API
The API is the surface you call when you create spans, record measurements, or emit logs.
If you write:
const span = tracer.startSpan('checkout');
that is API usage.
The API is intentionally small. It lets application code describe what happened, without deciding how that telemetry is exported or processed.
SDK
The SDK does the operational work.
It provides:
- tracer providers and meter providers
- span processors
- sampling
- resources
- exporters
In other words, the API records telemetry. The SDK decides what to do with it.
Instrumentation libraries and auto-instrumentation
Most teams do not want to hand-write spans for every HTTP framework, ORM, cache, or message queue.
That is where instrumentation libraries come in. They automatically create spans or metrics around known libraries. Auto-instrumentation takes that even further by attaching instrumentation to applications with little or no code changes.
This is why OpenTelemetry often feels larger than “just a tracing API.” In normal use, you are not only using the API. You are using:
- instrumentation packages
- the SDK
- exporters
- often the Collector
Resources and semantic conventions
Two pieces make telemetry usable across systems.
Resources
A resource describes what produced the telemetry.
Typical resource attributes include:
service.nameservice.versiondeployment.environment- host, container, or cloud metadata
Without resource data, a span is just an isolated event. With resource data, it becomes “a span from checkout-service version 1.8 running in production.”
Semantic conventions
Semantic conventions standardize attribute names and meanings.
For example, instead of every team inventing different keys for HTTP route, method, or database system, OpenTelemetry defines common names for these concepts. That standardization is what makes dashboards, queries, and cross-service comparison sane.
If two services both emit HTTP spans but one uses method and the other uses httpVerb, the data is harder to query consistently. Semantic conventions exist to prevent that drift.
Context propagation is what makes distributed tracing work
OpenTelemetry context propagation is the mechanism that moves trace context between services and processes.
When one service calls another, the current trace context is injected into the carrier for that protocol. In HTTP, that usually means headers. On the receiving side, the downstream service extracts that context and creates a new child span in the same trace.
In practice:
- Service A starts a span
- Service A injects trace context into the outbound request
- Service B extracts the context
- Service B starts a child span using the incoming context
By default, OpenTelemetry uses the W3C Trace Context format, which is why you often see a traceparent header.
Context propagation is also how different signals can be correlated. A log record or metric emitted during a request can be associated with the active trace context.
OTLP is the wire format, not the destination
OTLP stands for the OpenTelemetry Protocol.
It is the general-purpose protocol OpenTelemetry defines for delivering telemetry between clients, collectors, and backends. It describes:
- how data is encoded
- how it is transported
- how export requests and responses work
OTLP commonly runs over:
- gRPC
- HTTP
For HTTP, the default paths are:
/v1/traces/v1/metrics/v1/logs
This is a common source of confusion: OTLP is not a backend. It is the delivery protocol.
An application can export OTLP to:
- a local Collector
- a remote Collector
- a backend that accepts OTLP directly
Where the Collector fits
You do not always need the OpenTelemetry Collector.
For small systems, local development, or quick experiments, sending telemetry directly from the SDK to a backend is often good enough.
But the Collector becomes useful very quickly because it can centralize operational concerns that you do not want inside every application:
- retries
- batching
- fan-out to multiple backends
- authentication or TLS handling
- enrichment
- filtering or redaction
- protocol translation
A minimal Collector pipeline looks like this:
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
exporters:
debug:
otlp:
endpoint: backend.example.com:4317
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [debug, otlp]
The app sends OTLP to the Collector. The Collector receives it, batches it, and exports it onward.
That pattern is operationally cleaner than teaching every service how to talk to every backend.
What happens during one request
Let’s walk through one HTTP request in a microservice setup.
sequenceDiagram
participant U as User
participant F as Frontend
participant C as Checkout
participant DB as Database
participant Col as Collector
participant B as Backend
U->>F: GET /checkout
F->>C: HTTP request with traceparent
C->>DB: SQL query
C-->>F: response
F-->>U: response
F->>Col: spans, metrics, logs via OTLP
C->>Col: spans, metrics, logs via OTLP
Col->>B: processed telemetry
Under the hood:
- the frontend receives the incoming request and starts a root span
- the frontend propagates context to the checkout service
- the checkout service extracts that context and creates a child span
- the database client instrumentation creates another child span for the query
- metrics are recorded during the same operations
- logs can be correlated with the active span
- the SDK batches and exports telemetry
- the Collector optionally processes and forwards it
- the backend stores and visualizes it
That is the practical value of OpenTelemetry: one request can be understood across multiple components using shared context and shared conventions.
Sampling, cost, and realism
Not every trace should be kept forever.
Tracing is detailed, and detail costs money. The SDK can apply head sampling, which means it decides near span creation time whether a trace should be recorded and exported. Some systems also use tail-based sampling later in the pipeline, often in the Collector or backend, after more context is known.
Metrics are usually cheaper and easier to retain long-term. Logs vary a lot: structured logs can be extremely useful, but high-volume logs can become expensive quickly.
A healthy OpenTelemetry setup is not “collect everything blindly.” It is:
- keep enough detail to debug
- keep enough metrics to operate
- correlate logs when useful
- control cost intentionally
A minimal instrumentation example
The exact APIs vary by language, but the shape is similar everywhere:
const tracer = tracerProvider.getTracer('checkout-service');
async function checkout(request) {
return tracer.startActiveSpan('checkout', async (span) => {
try {
span.setAttribute('app.user_tier', request.userTier);
const result = await chargeCard(request);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.recordException(error);
span.setStatus({ code: SpanStatusCode.ERROR, message: 'checkout failed' });
throw error;
} finally {
span.end();
}
});
}
This is still only one part of the system. On its own, it creates spans. The SDK, exporter, and possibly Collector are what move those spans out of the process and into an observability system.
The mental model to keep
If you remember only one model, use this one:
- Instrumentation says what happened
- Context propagation keeps distributed work connected
- Resources and semantic conventions make telemetry understandable across services
- OTLP moves the data
- The Collector handles operational plumbing outside the app
- The backend stores and analyzes the result
That is OpenTelemetry in practice.
It is a standard telemetry pipeline, not a single library and not a single product. Once that distinction clicks, the rest of the ecosystem becomes much easier to reason about.