We make a long run plan for design our Observability system, for more details please refer to Observability .
This is the original roadmap plan for setting up alerting rules for micro-services which will be our next move that covers the two points mentioned:
Setting up corresponding alert rules for each micro-service based on micro-service-related metrics( eg:https://grafana.monitoring.brave.safibank.online/d/micro_service_dashboard/micro-service-dashboard?orgId=1&var-container=account-manager-brave):
Identify key performance indicators (KPIs) for each micro-service, such as response time, error rate, and throughput.
Define alerting thresholds for each KPI based on acceptable levels of performance and business requirements.
Create alerting rules that trigger notifications to the appropriate team members when thresholds are breached.
Integrate alerting rules with existing incident management systems to ensure timely response and resolution.
Setting up corresponding alert rules for every single JVM instance based on JVM instance-related metrics(eg:https://grafana.monitoring.brave.safibank.online/d/micro-service_instance/micro-service-instance?var-application=account-manager-brave&orgId=1):
Identify key JVM metrics, such as memory usage, garbage collection frequency, and thread count.
Define alerting thresholds for each metric based on acceptable levels of performance and business requirements.
Create alerting rules that trigger notifications to the appropriate team members when thresholds are breached.
Integrate alerting rules with existing incident management systems to ensure timely response and resolution.
These steps will help to ensure that each micro-service and JVM instance is adequately monitored and that any issues are identified and addressed in a timely manner, minimizing downtime and improving overall system reliability.