SaFi Bank Space : [Observability]Alerting rules for micro-services

We make a long run plan for design our Observability system, for more details please refer to Observability .

This is the original roadmap plan for setting up alerting rules for micro-services which will be our next move that covers the two points mentioned:

  1. Setting up corresponding alert rules for each micro-service based on micro-service-related metrics( eg:https://grafana.monitoring.brave.safibank.online/d/micro_service_dashboard/micro-service-dashboard?orgId=1&var-container=account-manager-brave):

    • Identify key performance indicators (KPIs) for each micro-service, such as response time, error rate, and throughput.

    • Define alerting thresholds for each KPI based on acceptable levels of performance and business requirements.

    • Create alerting rules that trigger notifications to the appropriate team members when thresholds are breached.

    • Integrate alerting rules with existing incident management systems to ensure timely response and resolution.

  2. Setting up corresponding alert rules for every single JVM instance based on JVM instance-related metrics(eg:https://grafana.monitoring.brave.safibank.online/d/micro-service_instance/micro-service-instance?var-application=account-manager-brave&orgId=1):

    • Identify key JVM metrics, such as memory usage, garbage collection frequency, and thread count.

    • Define alerting thresholds for each metric based on acceptable levels of performance and business requirements.

    • Create alerting rules that trigger notifications to the appropriate team members when thresholds are breached.

    • Integrate alerting rules with existing incident management systems to ensure timely response and resolution.

These steps will help to ensure that each micro-service and JVM instance is adequately monitored and that any issues are identified and addressed in a timely manner, minimizing downtime and improving overall system reliability.