SaFi Bank Space : Zipkin

Zipkin is a distributed tracing system implemented in Java and with OpenTracing compatible API. It helps gather timing data needed to troubleshoot latency problems in service architectures. Features include both the collection and lookup of this data.

If you have a trace ID in a log file, you can jump directly to it. Otherwise, you can query based on attributes such as service, operation name, tags and duration. Some interesting data will be summarized, including such as:

The percentage of time spent in a service
Whether or not operations failed.

The Zipkin UI also presents a Dependency diagram showing how many traced requests went through each application. This can be helpful for identifying aggregate behavior including error paths or calls to deprecated services.

Applications need to be “instrumented” to report trace data to Zipkin. This usually means configuration of a tracer or instrumentation library. The ways to report data to Zipkin are via:

HTTP
Kafka
Apache ActiveMQ
gRPC
RabbitMQ.

The data served to the UI are stored in-memory, or persistently with a supported backend such as Apache Cassandra or Elasticsearch.

Overview

Tech stack

Backend: Java
Frontend: React
Instrumentation: Zipkin span model; OpenTelemetry via adapter
Storage: MySQL, Apache Cassandra, or Elasticsearch.

Pros:

Stable and well-known project
Support for multiple DBMS

Cons:

No active development
Java => heavy (requires more resources)
Limited UI
- BUT it can be replaced with Grafana/Kibana configured to work with Zipkin data source
Limited filtering capabilities
OpenTelemetry support requires an adapter
No ClickHouse support
No built-in authentication in the UI
- Okta integration: not found (still researching)

Architecture

Tracers live in your applications and record timing and metadata about operations that took place. They often instrument libraries, so that their use is transparent to users.

Example: an instrumented web server records when it received a request and when it sent a response.

The trace data collected is called a Span.

Instrumentation is written to be safe in production and have little overhead. For this reason, they only propagate IDs in-band, to tell the receiver there’s a trace in progress. Completed spans are reported to Zipkin out-of-band, similar to how applications report metrics asynchronously.

Example: when an operation is being traced and it needs to make an outgoing http request, a few headers are added to propagate IDs. Headers are not used to send details such as the operation name.

The component in an instrumented app that sends data to Zipkin is called a Reporter. Reporters send trace data via one of several transports to Zipkin collectors, which persist trace data to storage. Later, storage is queried by the API to provide data to the UI.

Trace instrumentation report spans asynchronously to prevent delays or failures relating to the tracing system from delaying or breaking user code.

Transport

Spans sent by the instrumented library must be transported from the services being traced to Zipkin collectors. There are three primary transports: HTTP, Kafka and Scribe.

Components

There are 4 components that make up Zipkin:

Collector:
- Once the trace data arrives at the Zipkin collector daemon, it is validated, stored, and indexed for lookups by the Zipkin collector.
Storage:
- Zipkin was initially built to store data on Cassandra since Cassandra is scalable, has a flexible schema, and is heavily used within Twitter. In addition to Cassandra, ElasticSearch and MySQL are also natively supported. Other back-ends might be offered as third party extensions.
Search (Zipkin Query Service):
- Once the data is stored and indexed, it should be a way to extract it. The query daemon provides a simple JSON API for finding and retrieving traces. The primary consumer of this API is the Web UI.
Web UI:
- GUI that presents an interface for viewing traces. The web UI provides a method for viewing traces based on service, time, and annotations.
  There is no built-in authentication in the UI

API

See Zipkin-API documentation

Enable Zipkin for Tyk

Configuring

In tyk.conf on tracing setting

{
  "tracing": {
    "enabled": true,
    "name": "zipkin",
    "options": {}
  }
}

options are settings that are used to initialise the Zipkin client.

Sample configuration

{
  "tracing": {
    "enabled": true,
    "name": "zipkin",
    "options": {
      "reporter": {
        "url": "http:localhost:9411/api/v2/spans"
      }
    }
  }
}

reporter.url is the URL to the Zipkin server, where trace data will be sent.