Jump to content

Incidents/20141126-ocg

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

OCG (offline content generator) was not able to serve user requests (e.g. PDF versions of pages)


Timeline

  • 20141128T1140 user report on #wikimedia-operations about PDF generation not working, investigation begins
  • 20141128T1147 disk space is suspected to be the root cause, investigation begins on that
  • 20141128T1200 older (14d) PDFs are removed from ocg100* servers, ocg doesn't recover
  • 20141128T1200 ocg logs on logstash indicate failure while talking to redis, investigation proceeds on that
 Nov 25 15:37:43 ocg1002 ganglia-ocg[15741]: ocg_job_status_queue 503449
 Nov 25 15:38:19 ocg1002 ganglia-ocg[25920]: ocg_job_status_queue 0
  • 20141128T1220 it is discovered that ocg configuration ships with a blank password
  • 20141128T1228 the impacting configuration change is fixed

Conclusions

  • There was user impact on the PDF generation starting 20141125T1537, no pages were issued
  • The alarm for "ocg.svc.eqiad.wmnet" was silenced, and thus didn't fire pages
  • The icinga OCG health check issue WARNING even for CRITICAL issues (e.g. returning HTTP 500, connection refused, etc)
  • OCG disks were almost full, at >90% utilization

Actionables

  • Permanent silencing alarms for production services is discouraged, if silencing is desired for a given service the "downtime" facility is to be preferred. Downtime will auto-expire after the chosen period and thus lessen these problems.
  • OCG icinga health checks should correctly report CRITICAL vs WARNING conditions
  • OCG service excessive disk utilization should be checked and automatically reclaimed (e.g. utilization thresholds or date thresholds)