Alerts
- We use AlertManager as an alerting tool over metrics stored in Prometheus
- The package comes with some predefined alerts for some components of the system
Destination
We support Slack channel by default but the alerts can be sent to any channel that AlertManager supports
Common Alerts
For the following components some of the alerts are there for all of them
- App: API, UI, MongoDB, PostgreSQL
- Observability: Prometheus, Opentelemetry, Grafana, AlertManager, Kubernetes Events Exporter
Alert | Metric | Environment | Description |
---|---|---|---|
CPU Utilisation > 70% | k8s_pod_cpu_limit_utilization_ratio | dev, staging | Trigger an alert when CPU utilisation is greater than 70% of the CPU Limit allocated for that component |
RAM Utilisation > 70% | k8s_pod_memory_limit_utilization_ratio | dev, staging | Trigger an alert when RAM utilisation is greater than 70% of the RAM Limit allocated for that component |
Component Specific Alerts
Apart from the above common alerts, some of the components have alerts that are specific to the components
API
Alert | Metric | Environment | Description |
---|---|---|---|
API Health is 0 | - Staging: httpcheck_status - Dev: app_api_health | dev, staging | Triggers an alert when API is down. - Pull based for Staging - Push based for Dev |
Internal Server Error in API | http_server_duration_milliseconds_count | staging | - Triggers an alert whenever an API call results in an Internal Server Error [500 status code] - Also, gives info about what endpoint it occurred at for debugging purposes |
MongoDB
Alert | Metric | Environment | Description |
---|---|---|---|
MongoDB Health is 0 | mongodb_health_ratio | staging | Triggers an Alert whenever the MongoDB health metric reports value as 0 |