Jump to content

Check graphite

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

check_graphite is a nagios/icinga plugin script that can be used to generate alerts based on metric values in Graphite. It simply queries graphite to fetch data in JSON format through the /render endpoint of the graphite server. Our code is an (almost complete) rewrite of the check_graphite plugin from disquis.

Puppet Usage

We have two types of checks that can be performed on graphite-collected metrics:

  • check_graphite_threshold for checking thresholds
  • check_graphite_anomaly which performs some form of anomaly detection on the metric.

Both are a just wrapper for our monitor_service define. See their respective documentation for up to date usage docs.

Define monitor_graphite_threshold

A simple threshold checking is supported -this simply checks if a given percentage (by default, 1%) of the data points in the interested interval exceeds a threshold.

So, for instance, if you want to ensure that less than 5% of the checks in the last hour for the number of 5xx responses is above 500, you can do as follows:

  # Alert if the same metric exceeds an absolute threshold 5% of
  # times.
  monitor_graphite_threshold { 'reqstats-5xx':
      description          => 'Number of 5xx responses',
      metric               => 'reqstats.5xx',
      warning              => 250,
      critical             => 500,
      from                 => '1hours',
      percentage           => 5,
  }

Define monitor_graphite_anomaly

A very simple predictive checking is also supported - it will check if more than N points in a given range of datapoints are outside of the Holt-Winters confidence bands, as calculated by graphite (see http://bit.ly/graphiteHoltWinters), at 3 delta confidence level (99.7%) - which should be good in most cases). The obvious advantage of this method is that we don't need to pre-define thresholds at all.

This kind of monitoring always requires at least a week of data to graphite, which is needed to have decent predictions, so it's pretty computationally-expensive. you can define the interval of datapoints on which you wish to check for anomalies via the check_window parameter.

Let's see how we could try to detect an anomaly in the same metric as before: we will raise an alarm if 5 (or 10) measured datapoints out of the last 200 have an anomaly.

  # Alert if an anomaly is found in the number of 5xx responses
  monitor_graphite_anomaly { 'reqstats-5xx-anomaly':
      description          => 'Anomaly in number of 5xx responses',
      metric               => 'reqstats.5xx',
      warning              => 5,
      critical             => 10,
      check_window         => 200,
  }

A few more checks are in the work, this page will be updated then.