Jump to content

Incidents/20160925-ores

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

At September 25th, ORES service had higher ~14%) timeout ratio for six hours. Because it ran out space due to too verbose logging.

Timeline

  • Sept 25 10:34:40 UTC 2016: icinga test on ORES failed due to timeout.
  • 14:13 UTC: phab:T146581 is created.
  • 16:03 The fix deployed in labs.
  • 16:26 The fix deployed in prod.

Conclusions

We should have better monitoring disk space and be careful on verbosity of production services logs

Actionables