Jump to content

Incidents/20160610-ORES

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

This is a template for an Incident Report. Replace notes with your own description.

Summary

ORES was down for an unknown amount of hours today due to a broken configuration file (99-redis.yaml).

Timeline

at least 6 hours passes

  • 2016-06-10 @ 1930 UTC -- 503 errors and timeouts were noted
  • 2016-06-10 @ 2030 UTC -- 99-redis.yaml files are deleted and the workers are restarted. Service is restored.

Conclusions

https://github.com/wikimedia/operations-puppet/commit/78119152c47b7873fdd7bd0c38a356b5bff27226 should not have been merged. We need a better testing process around puppet merges to make sure that they don't take down the service. Unlike a deploy, there's to a clear event at which puppet is run.

Also, this downtime did not cause a paging event.

Actionables

  • Investigate why we were not paged when the downtime started Phab:T137592