Jump to content

Incidents/20160407-Mediawiki

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

All Mediawiki servers were serving mostly HTTP 5XX for about 5 minutes at 1350 UTC

Timeline

  • 13:50 UTC: switch CirrusSearch traffic to codfw, with a buggy configuration (see https://gerrit.wikimedia.org/r/#/c/282163/ for the correction)
  • almost immediate raise in HTTP 5XX errors to 400K errors / minute
  • 13:53 UTC: rollback
  • 13:55 UTC: error rate back to reasonable level

Conclusions

  1. unit testing wmf-config is hard
  2. testing configuration changes related to datacenter is not possible on labs
  3. carefully testing this kind of change on test nodes (mw1017/mw1099/mw2017/mw2099) is the minimum required

Actionables

Immediate issues have been addressed. This incident is mainly about human error (mine) and insufficient testing (me again).

  • a standardized and automated canary test system would help mitigate this kind of issues, but is probably a long term action outside of the scope of a post incident action.