Overview

Logs are discrete events with the goal of helping engineers identify problem area(s) during failures. Logging is an essential component of observability and plays a key role in understanding the current state and behavior of a system especially when troubleshooting issues.

Strategy & Guidelines

Here are some high-level guidelines which we use to design effective logging system:

  1. Centralized Logging: All log data should be collected in a centralized location, we use Grafana Loki( https://grafana.com/docs/loki/latest/) as our centralized logging backend.

  2. Structured Logging: Log data should be structured and consistent, in a way that it can be easily parsed, analyzed, and correlated. We use JSON as the logging format which is defined as https://cloud.google.com/logging/docs/structured-logging .

  3. Log Context: Each log message should include contextual information such as request and response headers, user-id, request-id and correlation-id, to help in understanding the origin of the log message and how it relates to other parts of the system. In our system, we use OpenTelemetry for doing the tracing and context propagation, and we follow the W3C Trace Context:

    1. Traceparent for standard trace context propagation

    2. Tracestate for customized trace data propagation(such as customer-id / account-id etc.)

  4. Log Rotation: Logs should be rotated regularly to prevent disk space from being exhausted.

  5. Log Aggregation: Logs from different sources should be aggregated for centralization and analysis. We use Grafana for visualizing our log data.

  6. Log Alerting: An alerting system should be set up to notify the appropriate parties when certain conditions are met, such as when a specific error message appears in the logs.

Generation

Guidelines

When we generate logs from our application we should follow these guidelines:

  1. Log at the appropriate level: Log messages should be generated at the appropriate level of detail. This typically includes levels such as debug, info, warning, and error. Use the correct level to convey the importance of the message.

    1. Debug: Debug level logs are used for detailed information about the system's internal state and behavior. These logs are just used during local development process to understand the system's behavior.

    2. Info: Info level logs are used to provide information about the normal operation of the system. These logs can be used to monitor the system's performance and to ensure that it is working as expected.

    3. Warning: Warning level logs are used to provide information about unexpected events that are not necessarily errors, but could indicate potential issues. These logs can be used to identify potential problems and to take appropriate action.

    4. Error: Error level logs are used to provide information about unexpected events that can cause the system to fail or degrade in performance. These logs are critical for identifying and resolving issues and should be reviewed and addressed as soon as possible.

  2. Log message should be specific and concise: Log messages should be specific and concise, providing the necessary information to understand the event or issue. Avoid using generic messages or logging too much data.

  3. Include relevant context: Each log message should include relevant context, such as the request and response headers, trace-id, span-id and correlation-id. This helps in understanding the origin of the log message and how it relates to other parts of the system.

  4. Avoid sensitive information: Do not include sensitive information, such as passwords or personal data, in log messages. For more on data privacy, could refer to Data privacy .

Tools

We have built a lib for generating logs https://github.com/SafiBank/SaFiMono/tree/main/common/utils#Logging , please check more details in the README.

The main features this lib provides are:

  1. JSON format support for generating logs

  2. PII masking support with @PII annotation

  3. OpenTelemetry integration with support to automatically add context information if the w3c context data is provided.

All micro-services should use the lib for generating logs from application.

Collection

We are using Loki for collecting and storing all log data.

For more details for Loki please refer to https://grafana.com/docs/loki/latest/.

For usage please refer to Logging (applications in cluster) .

Visualization & Analysis

We are using Grafana and Loki for visualizing log data and do log data analysis. Here are few steps for us to do log data analysis:

  1. Use https://grafana.com/docs/loki/latest/logql/ to aggregate & filter log data.

  2. Combine log with trace to get more context.

    1. You could click the Tempo button to link from Log message to the corresponded trace

  3. Setup Log based alert when critical issue are detected (WIP)

  4. Setup log related dashboard & report (WIP)