Grafana Tempo is an open source, easy-to-use, and high-scale distributed tracing backend. Tempo is cost-efficient, requiring only object storage to operate, and is deeply integrated with Grafana, Prometheus, and Loki. Tempo can ingest common open source tracing protocols, including Jaeger, Zipkin, and OpenTelemetry.

Why distributed tracing?

There are times when we encounter an issue, metrics and logs alone can’t pinpoint the problem.

Metrics are good for aggregations but lack fine-grained information. Logs are good at revealing what happened sequentially in an application, or maybe even across applications, but they don’t show how a single request possibly behaves inside of a service.

This is where tracing comes in. Distributed tracing is a way to track and log a single request as it crosses through all of the services in your infrastructure.

Why Grafana Tempo?

Tempo enables you for faster debugging/troubleshooting by quickly allowing you to move from metrics to the relevant traces of the specific logs which have recorded some issues.

Tempo allows users to scale tracing as far as possible with less operational cost and complexity. Tempo’s only dependency is object storage, and it supports search solely via trace ID. Unlike other tracing back ends, Tempo can hit massive scale without a difficult-to-manage ElasticSearch or Cassandra cluster.

See Get started with Grafana Tempo for more details.

How does Grafana Tempo work?

Deploying Tempo

Tempo can be easily deployed through a number of tools.

One example is via Helm. https://grafana.github.io/helm-charts

See our configured tempo chart for details.

Client instrumentation

To build a tracing pipeline, you need four major components: client instrumentation, pipeline, backend, and visualization.

Client instrumentation is the first building block to a functioning distributed tracing visualization pipeline. It is the process of adding instrumentation points in the application that create and offload spans.

Most of the popular client instrumentation frameworks have SDKs in the most commonly used programming languages. You should pick one according to your application needs.

Using OpenTelemetry instrumentation for Java

OpenTelemetry instrumentation for Java provides a Java agent JAR that can be attached to any Java 8+ application and dynamically injects bytecode to capture telemetry from a number of popular libraries and frameworks.

You can export the telemetry data in a variety of formats. You can also configure the agent and exporter via command line arguments or environment variables. The net result is the ability to gather telemetry data from a Java application without code changes.

Adding dependencies and configuration

In order to enable automatic instrumentation, one or more dependencies need to be added. How dependencies are added are language specific.

As we are using Kotlin with Gradle, update the target micro-service Gradle file (build.gradle.kts) with the appropriate dependencies and jib configuration.

dependencies {
    // others omitted for brevity
    
    // tracing
    runtimeOnly("io.opentelemetry.javaagent:opentelemetry-javaagent:1.19.0")
}

jib {
    // others omitted for brevity
    
    container {
        jvmFlags = listOf(
            "-javaagent:/app/libs/opentelemetry-javaagent-1.19.0.jar"
        )
    }
}

See response-message-manager/build.gradle.kts for an example implementation.

The rest are already pre-configured in the kotlin base chart of our micro-services.

# -- Java Agent OpenTelemetry Integration
tracing:
  # -- Enable creation of OTEL env variables
  enabled: true
  # -- OTEL trace exporter
  traces_exporter: otlp
  # -- OTEL metrics exporter
  metrics_exporter: none
  # -- OTEL exporter endpoint
  endpoint: https://tempo.monitoring.dev.safibank.online
apiVersion: v1
kind: ConfigMap
metadata:
  name: {{ include "kotlin.fullname" . }}
  labels:
    {{- include "kotlin.labels" . | nindent 4 }}
data:
  {{- range $key, $value := .Values.env }}
  {{ $key }}: {{ $value | quote }}
  {{- end }}
  TM_KAFKA_CONSUMER_GROUP: {{ .Release.Name | quote }}
  {{- if .Values.tracing.enabled }}
  OTEL_EXPORTER_OTLP_ENDPOINT: {{ .Values.tracing.endpoint | quote }}
  OTEL_METRICS_EXPORTER: {{ .Values.tracing.metrics_exporter | quote }}
  OTEL_RESOURCE_ATTRIBUTES: "service.name={{ .Release.Name }}"
  OTEL_TRACES_EXPORTER: {{ .Values.tracing.traces_exporter | quote }}
  {{- end }}

See tracing on SaFiMono/devops/charts/kotlin/README.md for details.

Viewing traces and visualization

Grafana is the last building block of a tracing pipeline and has a built-in Tempo datasource that can be used to query Tempo and visualize traces.

View by trace by ID

The most basic functionality is to visualize a trace using its ID. If you have a Trace ID (Identifier for the entire trace), you can jump directly to it. You can query and display traces from Tempo via Explore.

Select the Trace ID tab and enter the ID to view it. This functionality is enabled by default..

See here for the trace view explanation.

View by service, span and others

Traces can be searched for data originating from a specific service, duration range, span, or process-level attributes included in your application’s instrumentation, such as HTTP status code and customer ID.

From Search tab, you can select the service name to search from, span name, tags, min-max duration and even limit search results.

View by Service Graph (Node graph)

A service graph is a visual representation of the interrelationships between various services. Service graphs help to understand the structure of a distributed system, and the connections and dependencies between its components.

Service graphs infer the topology of a distributed system, provide a high level overview of the health of your system, and a historic view of a system’s topology. Service graphs show error rates and latencies, among other relevant data.

Select Service Graph then run query button (upper right). Select a node for more details.

Clicking on nodes on the service graph, lets you reveal specific details based upon your selection as shown below.

References