Skip to main content

Alerts

  • We use AlertManager as an alerting tool over metrics stored in Prometheus
  • The package comes with some predefined alerts for some components of the system

Destination

We support Slack channel by default but the alerts can be sent to any channel that AlertManager supports

Common Alerts

For the following components some of the alerts are there for all of them

  • App: API, UI, MongoDB, PostgreSQL
  • Observability: Prometheus, Opentelemetry, Grafana, AlertManager, Kubernetes Events Exporter
AlertMetricEnvironmentDescription
CPU Utilisation > 70%k8s_pod_cpu_limit_utilization_ratiodev, stagingTrigger an alert when CPU utilisation is greater than 70% of the CPU Limit allocated for that component
RAM Utilisation > 70%k8s_pod_memory_limit_utilization_ratiodev, stagingTrigger an alert when RAM utilisation is greater than 70% of the RAM Limit allocated for that component

Component Specific Alerts

Apart from the above common alerts, some of the components have alerts that are specific to the components

API

AlertMetricEnvironmentDescription
API Health is 0- Staging: httpcheck_status - Dev: app_api_healthdev, stagingTriggers an alert when API is down. - Pull based for Staging - Push based for Dev
Internal Server Error in APIhttp_server_duration_milliseconds_countstaging- Triggers an alert whenever an API call results in an Internal Server Error [500 status code] - Also, gives info about what endpoint it occurred at for debugging purposes

MongoDB

AlertMetricEnvironmentDescription
MongoDB Health is 0mongodb_health_ratiostagingTriggers an Alert whenever the MongoDB health metric reports value as 0