Alerting

Alerting on metrics is essential to monitoring. They allow you to spot problems anywhere in your infrastructure, application, etc so that you can rapidly identify their causes and minimize service degradation and disruption. An alert should communicate something specific about your systems such as “90% of all web requests are taking more than 0.5s to process and respond.” Receiving these alerts allows you to respond quickly to issues and provide better service, and it also saves time by freeing you from continual manual inspection of metrics.

You can get notified of these alerts through emails, Slack etc. Our team uses the GitHub Alertmanager to handle all the alerts. The GitHub Alertmanager is a webhook receiver that creates GitHub issues from alerts. To learn more about setting up the GitHub Alertmanager, you can refer to our documentation.

Alerts: Kebechet

You can find below the alerts defined for Kebechet.

Alert

Alert Definition

Alert Rule

Mitigation Plan

Alert Provider

Thoth User-API is down

Alert for User API down

up{instance=”{CLUSTER_INSTANCE}”} < 1

Verify status of the cluster.

Prometheus GitHub AlertManager

service requests missed producing results

Alert for mismatch between number of requests and documents produced

thoth_reporter_requests_gauge{env=”{CLUSTER_INSTANCE}”, component=”{THOTH_SERVICE}”} - thoth_reporter_reports_gauge{env=”{CLUSTER_INSTANCE}”, component=”{THOTH_SERVICE}”}

Verify status of Thoth investigator, Kafka, User_API.

Prometheus GitHub AlertManager

purge job issues opened missed

Alert for mismatch between number of requests and documents produced

thoth_number_purge_issues_total{env=”zero-prod”} - thoth_number_purge_issues_created{env=”zero-prod”} > 0

Retrigger Purge Job.

Github Issue by Prometheus GitHub AlertManager

thoth middletier number of worflows is 0

Alert for 0 workflows running in Thoth Middletier namespace

argo_workflows_count{field=”workflow-controller-metrics-thoth-middletier-prod.apps.smaug.na.operate-first.cloud:80”, status=”Running”} < 1

Verify status of Data Ingestion in Thoth or add more workload.

Github Issue by Prometheus GitHub AlertManager

mismatch between analyzed solvers and known solvers

Alert for number of solvers from Solver ConfigMap and from Thoth database not matching.

thoth_graphdb_solvers_number_match{field=”metrics-exporter-thoth-infra-prod.apps.smaug.na.operate-first.cloud:80”} == 1

Check solvers in Solver ConfigMap and in Thoth database.

Github Issue by Prometheus GitHub AlertManager

Issue connecting to Kafka

Alert for issue in connection with Kafka

thoth_kafka_connection_issues{field=”metrics-exporter-thoth-infra-prod.apps.smaug.na.operate-first.cloud:80”} == 1

Verify status of Kafka.

Github Issue by Prometheus GitHub AlertManager

Kafka message is halted

Alert for halted messages

thoth_investigator_halted_topics{field=”investigator-faust-web-thoth-infra-prod.apps.smaug.na.operate-first.cloud:80”} == 1

Activate message again using endpoint in Thoth investigator.

Github Issue by Prometheus GitHub AlertManager

Issue connecting to Thoth database

Alert for issue in connection with Thoth database

thoth_graphdb_connection_issues{field=”metrics-exporter-thoth-infra-prod.apps.smaug.na.operate-first.cloud:80”} == 1

Verify status of Thoth database.

Github Issue by Prometheus GitHub AlertManager

thoth-storages version mismatch

alembic version mismatch between components and database

thoth_graph_db_component_revision_check{env=”{CLUSTER_INSTANCE}”} == 1

Release all impacted components with thoth-storages in the correct version.

Github Issue by Prometheus GitHub AlertManager

Thoth database is corrupted

Database schema is corrupted, all services need to be stopped

thoth_graphdb_is_corrupted{field=”metrics-exporter-thoth-infra-prod.apps.smaug.na.operate-first.cloud:80”} == 1

Analyze Thoth database.

Github Issue by Prometheus GitHub AlertManager

Thoth database has multiple alembic versions

Alert for alembic version table corruption

thoth_graphdb_alembic_table_check{field=”metrics-exporter-thoth-infra-prod.apps.smaug.na.operate-first.cloud:80”} == 1

Analyze Thoth database.

Github Issue by Prometheus GitHub AlertManager

The Kebechet rules triggered by Prometheus can be found here.