Incidents/20160620-ores

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

ores.wikimedia.org was down today for about twenty minutes because of deploying a commit that changed reading config directory without proper order.

Timeline

SAL log

10:58 Amir1: deploying bdc1e2b in ores nodes
11:04 deployment finished and ores went down
- puppet agent ran and services got restarted (uwsgi-ores, celery-ores-worker). Didn't solve the problem
- Checking logs showed the problem persists due to bad config reading
11:27 Amir1: rollbacking ae71d842dfc0958e06922062dd09d49243332a6a
- ORES went live again
12:13 Amir1: deploying bdc1e2bd only to ores on scb2001 (codfw)
- Did not work as expected. (No down time because it only affected that node in codfw).
13:04 Amir1: deploying 8e65182 to scb2001
We fixed it in 295214
- Worked perfectly fine
13:06 Amir1: deploying 8e65182 to all ores nodes

Conclusions

A very shallow reasoning would be the issue of reading config directories which got changed a lot and now it's in a rather stable situation but that's dangerous. What we really need is a safe method to deploy ores which we did the second time today. The only thing is documenting them

Actionables

Status: Unresolved Document safe steps to deploy ores in prod (bug T138234)