Jump to content

Incidents/2017-06-13 ORES

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

ORES had an intermittent outage from 1600 - 1940 UTC on June 13th. The issue was traced to scb1001.eqiad.wmnet.

Timeline

See https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&from=1497366649350&to=1497383640251&panelId=2&fullscreen

  • 1600 UTC: Errors rise for ORES (not noticed. no icinga pings)
  • 1700 UTC: Deployment for task T167223 begins
  • 1715 UTC: During canary check, error rate is noted and task T167819 is created with "Unbreak now"
  • 1740 UTC: Problem is independent of deploy. The decision is made to continue with deploy.
  • 1816 UTC: Ops is pulled in (mutante responds). Rollback of deploy is considered but rejected.
  • 1828 UTC: Problem is narrowed down to scb1001 specifically. Logs show no errors despite intermittent 500s
  • 1923 UTC: Mutante notes that pdf rendering is taking a lot of CPU and kills it
  • 1940 UTC: Recovery confirmed.

Conclusions

  • icinga didn't tell us about the issue
  • for some reason, the error wasn't being written to app.log
  • it looks like there was some conflict with resource usage WRT pdf rendering
  • memory was very tight on SCB for the duration of the outage:

Actionables

  • task T167830 -- "Extend icinga check to catch 500 errors like those of the 20170613 incident"
  • task T146664 -- "Limit resources used by ORES", move ORES to dedicated hardware. See task T157222.
  • Limit resources used by the pdfrender service: task T167834