Incidents/2017-06-13 ORES

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

ORES had an intermittent outage from 1600 - 1940 UTC on June 13th. The issue was traced to scb1001.eqiad.wmnet.

Timeline

See https://grafana.wikimedia.org/dashboard/db/ores?orgId=1&from=1497366649350&to=1497383640251&panelId=2&fullscreen

1600 UTC: Errors rise for ORES (not noticed. no icinga pings)
1700 UTC: Deployment for task T167223 begins
1715 UTC: During canary check, error rate is noted and task T167819 is created with "Unbreak now"
1740 UTC: Problem is independent of deploy. The decision is made to continue with deploy.
1816 UTC: Ops is pulled in (mutante responds). Rollback of deploy is considered but rejected.
1828 UTC: Problem is narrowed down to scb1001 specifically. Logs show no errors despite intermittent 500s
1923 UTC: Mutante notes that pdf rendering is taking a lot of CPU and kills it
1940 UTC: Recovery confirmed.

Conclusions

icinga didn't tell us about the issue
for some reason, the error wasn't being written to app.log
it looks like there was some conflict with resource usage WRT pdf rendering
memory was very tight on SCB for the duration of the outage:

Actionables

task T167830 -- "Extend icinga check to catch 500 errors like those of the 20170613 incident"
task T146664 -- "Limit resources used by ORES", move ORES to dedicated hardware. See task T157222.
Limit resources used by the pdfrender service: task T167834