Jump to content

Incidents/20160212-AllWikisOutage

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

While syncing files to backport a logging enhancement to MediaWiki 1.27.0-wmf.13, changes were propagated in the wrong order. This resulted in HHVM fatal errors of

Call to undefined method MediaWiki\Session\SessionManager::checkIpLimits() in /srv/mediawiki/php-1.27.0-wmf.13/includes/Setup.php on line 812

for all requests to all wikis until the updated version of php-1.27.0-wmf.13/includes/session/SessionManager.php was synced to the cluster. The outage lasted approximately 2.5 minutes between 2016-02-12T19:13 to 2016-02-12T19:16.

Timeline

[18:30:05] <jouncebot>	 bd808 tgr anomie: Dear anthropoid, the time has come. Please deploy Debug logging enhancements (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160212T1830).
...
[18:37:20] <bd808>	 Krenair: all clear on mira?
[18:37:22] <Krenair>	 bd808, yep
...
[19:12:34] <logmsgbot>	 !log bd808@mira Synchronized php-1.27.0-wmf.13/includes/DefaultSettings.php: Log multiple IPs using the same session or the same user account (4d8b8ca) (duration: 01m 16s)
[19:12:38] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:14:09] <logmsgbot>	 !log bd808@mira Synchronized php-1.27.0-wmf.13/includes/Setup.php: Log multiple IPs using the same session or the same user account (4d8b8ca) (duration: 01m 18s)
[19:14:34] <paladox>	 wikipedia has gone down for me https://en.wikipedia.org/
[19:14:36] <bd808>	 shit. synced in wrong order
[19:14:41] <paladox>	 Request from 10.20.0.104 via cp1065 cp1065 ([10.64.0.102]:3128), Varnish XID 1730353932
[19:14:41] <paladox>	 Forwarded for: 81.140.246.2, 10.20.0.104, 10.20.0.104, 10.20.0.104
[19:14:41] <paladox>	 Error: 503, Service Unavailable at Fri, 12 Feb 2016 19:14:22 GMT 
[19:14:44] <sjoerddebruin>	 503's yep
[19:14:47] <apergos>	 wikitech empty main page. er?
[19:14:48] <bd808>	 will be fixed in 2 minutes
[19:14:49] <apergos>	 anyways
[19:15:04] <gwicke>	 uh oh, api is throwing lots of 503s
[19:15:12] <bd808>	 !log Synced files for T125455 in wrong order; broke all wikis
[19:15:26] <bd808>	 the fix is syncing now :/
[19:15:44] <logmsgbot>	 !log bd808@mira Synchronized php-1.27.0-wmf.13/includes/session/SessionManager.php: Log multiple IPs using the same session or the same user account (4d8b8ca) (T125455) (duration: 01m 17s)
[19:15:47] <bd808>	 better?
[19:15:58] <gwicke>	 bd808: back for me
[19:16:17] <paladox>	 its back up now.
[19:16:26] <paladox>	 Thanks for fixing the problem.
[19:16:28] <bd808>	 sorry everyone. brain fart from me
[19:16:35] <Krenair>	 woah
[19:16:39] <gwicke>	 we really ought to stop breaking everything at once
[19:16:55] <bd808>	 !log Wikis back up thankfully
[19:16:58] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master

Conclusions

  • Entirely operator error. The deployer should have understood how the changes were interrelated and performed the sync of SessionManager.php before Setup.php.
  • Having the sync-file statements prepared ahead of time in a text document allowed quick action to sync the missing file.

Actionables

  • Use a less risky deployment process. Except for emergencies, always deploy to a canary first, followed by a rolling deploy. Ideally, have a mechanism to automatically detect errors & abort an ongoing deploy. phab:T121597