Jump to content

Monitoring procedure

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
This page contains historical information. It may be outdated or unreliable.

Proposed monitoring procedure

Daily:

  • Check nagios for new alerts.
  • Fix simple issues such as daemons that need restarting or servers that can be rebooted remotely.
  • Note any issues which need on-site attention at datacenter tasks.
  • Pass responsibility for any more complex software issues to a competent staff member.

Weekly:

  • Capacity check. Make sure key metrics such as application CPU utilisation and disk space usage are not approaching dangerous limits.
  • Publish a report detailing the times at which Nagios was checked, the issues noted, and any people notified. Or, make this information available continuously, for review on a weekly basis.
  • Another team member should check the report and make sure that the monitoring done was of an appropriate standard.

One to two months:

  • Capacity review. Analyse capacity metrics and report your findings. Notify the team of upcoming performance bottlenecks which might require hardware purchases.
  • Report any long-term issues which have been left unfixed.