SaFi Bank Space : Observability

Overview

Observability is the ability to measure a system’s current state and how well the system is behaving. Observability serves the following goals:

  • Understand the system's behavior

    • Observability allows for the monitoring and analysis of system metrics, logs, and traces to understand how the system is behaving.

  • Detect issues quickly

    • A well-designed observability architecture allows for the quick identification of issues, such as service outages or performance bottlenecks. This can be achieved through alerting systems and real-time monitoring of system metrics.

  • Diagnose and troubleshoot issues effectively

    • Once an issue has been detected, observability allows for the gathering of detailed information to diagnose and troubleshoot the issue. This includes the ability to drill down into log data, traces, and metrics to understand the root cause of the problem.

  • Improve reliability and availability

    • By providing a comprehensive view of the system's health and performance, observability allows for the identification and resolution of issues, which improves the overall reliability and availability of the system.

  • Continuously improvement

    • Observability provides feedback on how the system behaves and identify system bottlenecks, which can be leveraged to continuously improve the system in terms of features, performance, scalability, and security etc.

  • A/B testing

    • Observability allows to monitor the performance and behaviour of different versions of the system, and compare them. This allows for A/B testing to test new features and changes to the system in a controlled manner, in order to reduce the risk of errors or issues.

Key Components

Observability uses logs, metrics, and traces to measure the system behavior. Here are some key components which will involve in the lifecycle of the system observability.

  • Log

    • Centralized logging is used to collect log data from all components of the system. This allows for easy search and analysis of log data, and can provide valuable information about system behavior and events.

  • Metrics

    • Metrics are collected from all components of the system to provide a real-time view of the system's performance and health. This includes information such as system resource utilization, request and response times, and error rates.

  • Tracing

    • Distributed tracing is used to track requests as they flow through the system. This allows for the identification of bottlenecks and issues with specific service interactions.

  • Alerting

    • An alerting system is set up to notify the appropriate parties when certain conditions are met, such as when a service becomes unavailable or when a metric exceeds a specified threshold.

  • Dashboards

    • A dashboard visualize the telemetry data(logs, traces and metrics) collected to provides a comprehensive view of the system's behavior and performance.