Overview

Tracing is a powerful tool for understanding the performance and behavior of distributed systems. It allows developers and operations teams to track the flow of a request or transaction as it travels across multiple services and applications. It is an important part of observability as it allows teams to understand how requests flow through the system, how long they take, and where the bottlenecks are. This can help teams to identify issues, optimize performance, and improve the user experience.

Strategy & Guidelines

Here are some key strategies for tracing our system:

  1. Tracing scope: Currently we cover traces for three major flow:

    1. REST API call between micro-services

    2. Kafka message from producing to consuming

    3. Database operation

  2. Instrument: Instrument code to generate trace data by using OpenTelemetry.

    1. Auto instrument: this is for automatically generate tracing data with Lib support(OpenTelemetry Java Agent).

    2. Manual instrument: when cases we need customized tracing flow, we need to manually generate tracing data

  3. Collect and store trace data: Collect and store trace data in a central location, we use Tempo as our tracing backend.

  4. Trace visualization and analysis: We use Tempo and Grafana to visualization and analysis tracing data

  5. Logging with Tracing: Log message should contain tracing info as log context.

Tools

OpenTelemetry

Please refer to Distributed tracing with OpenTelemetry for more details .

Tempo

Please refer to Tempo for more details.