Jump to content

Incidents/2019-12-04 MediaWiki

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

document status: final

Summary

Roll out of wmf.8 to group1 broke the world.

Detection

Initial indicators of the issue were picked up in logstash and via logspam-watch on mwlog1001. A large number of Icinga alerts followed.

It seems likely that the primary issue was obscured during the initial deploy by a focus on Parsoid errors.

Timeline

All times in UTC.

  • 20:12 brennen: Train wmf.8 roll fowards from group0 to group1 as well (try 1) [1]
  • 20:12 Large amounts of logspam noticed, especially from Parsoid/PHP, and Icinga issues many alerts.
  • 20:28 brennen: Train wmf.8 rolled back to just group0 [2]

[Fixes to exclude Parsoid/PHP]

  • 23:30 brennen: Train wmf.8 roll fowards from group0 to group1 as well (try 2) [3]
  • 23:30 OUTAGE BEGINS
  • 23:30 Large spike in database errors in logstash (T239877), shortly thereafter large amounts of Icinga alerts go off.
  • 23:30+ Production group1 and group2 wikis become noticably sluggish, eventually stopping working entirely.
  • 23:35 brennen: Attempted train wmf.8 roll back thwarted by canary failures [4]
  • 23:38 brennen: Train wmf.8 rolled back to just group0, again [5]
  • 23:38 OUTAGE ENDS