Jump to content

Incidents/20160620-ores

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

ores.wikimedia.org was down today for about twenty minutes because of deploying a commit that changed reading config directory without proper order.

Timeline

SAL log

  • 10:58 Amir1: deploying bdc1e2b in ores nodes
  • 11:04 deployment finished and ores went down
    • puppet agent ran and services got restarted (uwsgi-ores, celery-ores-worker). Didn't solve the problem
    • Checking logs showed the problem persists due to bad config reading
  • 11:27 Amir1: rollbacking ae71d842dfc0958e06922062dd09d49243332a6a
    • ORES went live again
  • 12:13 Amir1: deploying bdc1e2bd only to ores on scb2001 (codfw)
    • Did not work as expected. (No down time because it only affected that node in codfw).
  • 13:04 Amir1: deploying 8e65182 to scb2001
  • We fixed it in 295214
    • Worked perfectly fine
  • 13:06 Amir1: deploying 8e65182 to all ores nodes

Conclusions

A very shallow reasoning would be the issue of reading config directories which got changed a lot and now it's in a rather stable situation but that's dangerous. What we really need is a safe method to deploy ores which we did the second time today. The only thing is documenting them

Actionables

  • Status:    Unresolved Document safe steps to deploy ores in prod (bug T138234)