Jump to content

Monitoring package survey

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

An exhaustive list of monitoring tools and services evaluated or used at WMF

Open Source

Alerta

Cabot

Centreon

  • Overview: Fork of Nagios with commercial support, can also use the Icinga "engine"
  • URL: http://www.centreon.com
  • Pro:
  • Con:
    • Some features only in paid "enterprise" version
    • Nagios architecture: check scripts determine warning and critical state

Check_graphite

Check_mk

  • Overview: Client agent runs checks/plugins async, listens on a TCP port, immediately spews all stats and closes the connection. Single Icinga active check connects to clients, returning data as passive checks.
  • URL:
  • Pro:
    • Very fast / scales well
    • Integrates with Graphite
    • Can use Nagios plugins
    • Can completely replace Nagios/Icinga using the optional "check_mk micro core"
  • Con:
    • Generates its own Icinga config based on discovery, so service monitors are not defined for services which are down at the time of the discovery scan
    • Can be tricky to integrate this with Puppet based config templates for Icinga.
    • Some features only in paid "enterprise" version

Collectd

  • Overview: collectd is a daemon which collects system performance statistics periodically and provides mechanisms to store the values in a variety of ways
  • URL:
  • Pro:
    • Written in C
    • Network traffic can be signed or encrypted
    • Clients push data to a server or multicast group
    • Default resolution is 10 seconds
    • Can store data in Graphite (Carbon), RRD, Redis, MongoDB, several others
    • Statsd plugin implements the StatsD network protocol to allow clients to report events. These events are aggregated by collectd and dispatched regularly.
    • Can execute nagios check scripts
    • Contains glue allowing Nagios to check stats harvested by collectd
  • Con:
    • As of Feb/2014, website says that it has run on hundreds of nodes but admits nobody has reported 1000.
    • There is no "write plugin" to publish to Ganglia, preventing collectd from being a drop-in replacement for gmond

Cucumber-nagios

Cyanite

Dashing

Dbeacon

  • Overview: dbeacon is a multicast beacon: its main purpose is to monitor other beacons' reachability and collect statistics such as loss, delay and jitter between them.
  • URL: https://packages.debian.org/sid/dbeacon
  • Pro:
  • Con:

Diamond

Fail2ban

Firefly

Ganglia

Ganglios

Grafana

Graphios

Graphite

Graphsky

  • Overview: Graphite dashboard similar to the Ganglia UI, using data from Collectd
  • URL:
  • Pro:
    • Ganglia-like design gives overview + drill-down ability
    • Simple dashboard and graph definition in JSON
  • Con:
    • Doesn't display any numbers in dashboards or graph legends
    • Lacking navigation elements in the UI
    • Documentation error: they recommend the prefix "collectd." but really you want "collectd.production.bits." or etc. to encode the environment and cluster into the metric name.
  • Status: Good potential but needs development.

Groundwork

  • Overview: Unified systems monitoring and network management: Nagios(R), Nmap, RRDtool, etc. - integrated in one system administration tool
  • URL: http://www.groundworkopensource.com
  • Pro:
  • Con:
    • Nagios architecture: check scripts determine warning and critical state

Hyperic

  • Overview:
    • "Hyperic is application monitoring and performance management for virtual, physical, and cloud infrastructures. Auto-discover resources of 75+ technologies, including vSphere, and collect availability, performance, utilization, and throughput metrics."
    • Now owned by VMWare
  • URL: http://www.hyperic.com
  • Pro:
  • Con:

Icinga

IDOUtils (NDOUtils)

  • Overview: "The IDOUtils (Icinga Data Output Utils) addon is designed to store all configuration and event (status, historical) data from Icinga into a relational database. Storing information from Icinga in an RDBMS will allow for quicker retrieval and processing of that data." An Event Broker plugin.
  • URL:
  • Pro:
  • Con:
    • Unclear how/if this achieves the goal of increased scalability
    • This seems to be only for "output" -- simply recording stats rather than using the database as an R/W data store

Jmxtrans

  • Overview: jmxtrans is effectively the missing connector between speaking to a JVM via JMX on one end and whatever logging / monitoring / graphing package that you can dream up on the other end.
  • URL: http://www.jmxtrans.org/
  • Pro:
    • Can log to Carbon/Graphite
  • Con:
  • Status: Currently in use for Hadoop

KairosDB

LibreNMS

Logstash

Logster

Merlin

Metricinga

Mod-gearman

Monit

Munin

Nagios

  • Overview: The de facto standard tool for availability monitoring and alerting
  • URL: http://www.nagios.org
  • Pro:
  • Con:
    • Nagios architecture: check scripts determine warning and critical state
    • Dissatisfaction with the project has led to multiple forks and rewrites: Centreon, Icinga, Naemon, OpsView, Shinken

NetDB

NRPE

  • Overview: Nagios Remote Plugin Executor - NRPE is an addon that allows you to execute plugins on remote Linux/Unix hosts. This is useful if you need to monitor local resources/attributes like disk usage, CPU load, memory usage, etc. on a remote host.
  • URL: http://exchange.nagios.org/directory/Addons/Monitoring-Agents/NRPE--2D-Nagios-Remote-Plugin-Executor/details
  • Pro:
    • Allows Nagios/Icinga server to trigger execution of check scripts on client nodes
  • Con:
    • Opens a TCP connection for each check script
    • Clients run a listener daemon, requiring complementary firewall rules on hosts which have public IPs
    • Check scripts are run synchronously (active checks), potentially causing high latency responses which impact Nagios/Icinga server performance
    • Code review by Tim deemed NRPE unacceptable for Fundraising cluster
  • Status: Currently in use

NSCA

Observium

  • Overview:
  • URL:
  • Pro:
  • Con:
  • Status: We switched from this to LibreNMS

Oculus

  • Overview: Anomaly correlation: given an identified anomalous metric, searches for similar metrics to help determine scope and root cause.
  • URL: https://github.com/etsy/oculus
  • Pro:
  • Con:

OpenNMS

OpenTSDB

  • Overview: OpenTSDB is a distributed, scalable Time Series Database (TSDB) written on top of HBase. Unlike RRD or Whisper, it never deletes or downsamples data.
  • URL: http://opentsdb.net/
  • Pro:
    • Super-scalable, said to be similar to Google's proprietary Borgmon
  • Con:

OpsView

Pandora FMS

Prometheus

RANCID

Rearview

  • Overview: Allows users to create monitors that both visualize and alert on data as it streams from Graphite.
  • URL: https://github.com/livingsocial/rearview/
  • Pro:
    • Could replace Icinga for alerting
  • Con:
    • Crontab compatible time specification means minimum 1 minute sampling frequency

Riemann

  • Overview: Riemann is an event stream processor.
  • URL: http://riemann.io/
  • Pro:
    • Could replace Icinga for alerting
  • Con:

Sensu

Servermon

Seyren

Shinken

Skyline

  • Overview: Skyline is a real-time anomaly detection system, built to enable passive monitoring of hundreds of thousands of metrics, without the need to configure a model/thresholds for each one
  • URL: https://github.com/etsy/skyline
  • Pro:
  • Con:

Smokeping

Statsd

Tessera

Torrus

Umpire

  • Overview: Lets you test Graphite metrics via HTTP: "Umpire provides a normalized HTTP endpoint that responds with 200 / non-200 according to the metric check parameters specified in the requested URL."
  • URL: https://github.com/heroku/umpire
  • Pro:
  • Con:

Zenoss

Zabbix

Developed for WMF

Dbtree

Ishmael

Labsnagiosbuilder

Sqstat

  • Overview: Short WMF perl script to stick Squid/Varnish stats into Ganglia/Graphite
  • URL:
  • Pro:
  • Con:
  • Status: Currently in use

Tendril

Services

Boundary

Datadog

New Relic

  • Overview: Service - Popular choice for app-layer metrics, also offers system level monitoring
  • URL:
  • Pro:
  • Con:

Nimsoft Cloud Monitor (formerly Watchmouse)

PagerDuty

  • Overview: Service - "PagerDuty is the command center for IT, providing on-call schedule management, alerting and incident tracking. When your systems are down, we wake you up."
  • URL: http://www.pagerduty.com/
  • Pro:
    • Integrates with Nagios, Zenoss, Zabbix, Splunk, etc.
  • Con:

Pingdom

  • Overview: Service -
  • URL:
  • Pro:
  • Con:

RIPE Atlas

Main article: RIPE Atlas

See also