Alerting and configurable options for Watchdog

Currently, any watchdog job created in the UI runs every hour to check for errors. For example, I use dataflow errors to monitor the execution of critical flows. These flows may run weekly, daily or hourly.

When a daily job fails, the watchdog will send the same error message every hour, resulting in up to 24 messages (if integrated with a Slack webhook) about the failed dataflow. While this might be appropriate for hourly jobs, I expect to receive only one error message per failure for daily jobs.

The main issue with the alerting system is its frequency—receiving too many alerts can lead to them being ignored. Ideally, the following configurable parameters should be added:

  • alert me once job is failing/ alert me every time watchdog runs
  • frequency of watchdog executions should be configurable (e.g., every day, every 5 minutes, every hour, etc.).

With these two parameters, we could better fine-tune the alerting system.

Tagged:
5
5 votes