SaFi Bank Space : Distributed tracing with OpenTelemetry



What is distributed tracing ?

Micro-services is a typical distributed system. It has all the benefits from the distributed design. But it can be challenging to troubleshoot the request send to the each micro-services since the journey of the request may involve sequences of multiple service calls.

Distributed tracing is a method of tracking application requests as they flow from frontend devices to backend micro-services(including databases/message queue etc.). We can use distributed tracing to troubleshoot requests that exhibit high latency or errors and pinpoint any performance failures or bottlenecks that occurred along the way.

Overview of OpenTelemetry

OpenTelemetry is an open source observability framework. It offers vendor-agnostic or vendor-neutral APIs, software development kits (SDKs). Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help to understand the distributed system’s performance and behavior.

You could read more on this:

https://opentelemetry.io/docs/concepts/what-is-opentelemetry/

https://www.dynatrace.com/news/blog/what-is-opentelemetry-2/

Architecture of our solution

There are three steps in the whole tracing data flow:

  1. Gathering/instrumenting traces data in the application side

  2. Process / transform the traces data (optional)

  3. Store the tracing data in the telemetry backend

Instrumenting

Our micro-services are built with Kotlin and run in jvm, and OpenTelemetry provides several options for instrumenting Java based application:

  1. opentelemetry-java: Components for manual instrumentation including API and SDK as well as extensions, the OpenTracing shim.

  2. opentelemetry-java-instrumentation: Built on top of opentelemetry-java and provides a Java agent JAR that can be attached to any Java 8+ application and dynamically injects bytecode to capture telemetry from a number of popular libraries and frameworks.

  3. opentelemetry-java-contrib: Provides helpful libraries and standalone OpenTelemetry-based utilities that don’t fit the express scope of the OpenTelemetry Java or Java Instrumentation projects. For example, JMX metric gathering.

Our solution is based on using opentelemetry-java-instrumentation to provide auto-instrumenting and use opentelemetry-java to add more customized(manual) instruments when we want to customize it (add customized traces / metrics).

The core idea for the solution is to put the auto-instrumentation function inside a java agent and run alongside with our micro-service application. So there is no code change needed for our application, the tracing ability will be dynamically added by the OpenTelemetry java agent.

You could refer to https://www.baeldung.com/java-instrumentation for more details about java agent.

Store tracing data

We use https://grafana.com/oss/tempo/ to store our tracing data. It has native support for OpenTelemetry standard and protocol. And we could use Grafana for visualizing the tracing data stored in Tempo.

Here is a high level diagram about the solution architecture

Implementation details

OpenTelemetry Java Agent setup

We added the support in our based Kotlin Helm chart to have the option to turn on/off the auto-tracing ability for the micro-service application.

# -- Java Agent OpenTelemetry Integration
tracing:
  # -- Enable creation of OTEL env variables
  enabled: true
  # -- OTEL trace exporter
  traces_exporter: otlp
  # -- OTEL metrics exporter
  metrics_exporter: none
  # -- OTEL exporter endpoint
  endpoint: https://tempo.monitoring.dev.safibank.online

Note that we have turned off the metrics exporter since we are doing the metrics gathering with https://micronaut-projects.github.io/micronaut-micrometer/latest/guide/

Request Header from front-end request

  1. traceparent for sending root traceId and spanId which generated from front-end app. Please refer to https://www.w3.org/TR/trace-context/#traceparent-header for more details about the format.

  2. tracestate for sending customized data which can be propagated through the trace context in OpenTelemetry along the way. Please refer to https://www.w3.org/TR/trace-context/#tracestate-header for more details about the format

Logging with tracing

We have added the support for logging the tracing related data automatically in the common logger. And the way we are doing this is by using the MDC support in OpenTelemetry Java Agent. https://github.com/open-telemetry/opentelemetry-java-instrumentation/blob/main/docs/logger-mdc-instrumentation.md.

Another thing we have added is to automatically log tracestate data which sent by front-end if any(such as customerId / accountId).

Implementation details

class GCPConsoleJsonLayout : StackdriverJsonLayout() {
    override fun addCustomDataToJsonMap(map: MutableMap<String, Any>, event: ILoggingEvent) {
        ...
        addTraceStateData(map)
    }

    private fun addTraceStateData(map: MutableMap<String, Any>) {
        Span.current().spanContext.traceState.forEach { _, value ->
            value.split(";").forEach { item ->
                val splits = item.split(":")
                val itemKey = splits.first()
                val itemValue = splits.last()

                map[itemKey] = itemValue
            }
        }
    }
}

Please refer to the common logger for more details: https://github.com/SafiBank/SaFiMono/tree/main/common/utils

Attachments:

~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~OpenTelemetry Java Agent.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~OpenTelemetry Java Agent.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~OpenTelemetry Java Agent.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~OpenTelemetry Java Agent.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~OpenTelemetry Java Agent.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~OpenTelemetry Java Agent.tmp (application/vnd.jgraph.mxfile)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~OpenTelemetry Java Agent.tmp (application/vnd.jgraph.mxfile)
OpenTelemetry Java Agent (application/vnd.jgraph.mxfile)
OpenTelemetry Java Agent.png (image/png)
~drawio~557058:229b5867-a6cd-46a1-9572-1eb4b6e6294b~OpenTelemetry Java Agent.tmp (application/vnd.jgraph.mxfile)
image-20221112-232151.png (image/png)
~OpenTelemetry Java Agent.tmp (application/vnd.jgraph.mxfile)
~OpenTelemetry Java Agent.tmp (application/vnd.jgraph.mxfile)
~OpenTelemetry Java Agent.tmp (application/vnd.jgraph.mxfile)
~OpenTelemetry Java Agent.tmp (application/vnd.jgraph.mxfile)
OpenTelemetry Java Agent (application/vnd.jgraph.mxfile)
OpenTelemetry Java Agent.png (image/png)
Screen Shot 2022-11-14 at 14.52.14-20221114-065222.png (image/png)
Screen Shot 2022-11-14 at 14.55.28-20221114-065534.png (image/png)
Screen Shot 2022-11-14 at 14.58.58-20221114-065904.png (image/png)