Jump to content

Incidents/2017-01-11 multiversion

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

The multiversion code is poorly understood by many deployers. The code is complex and the entry points are a mess. An ongoing effort has been underway to address this. On Jan 11th, a fairly involved refactor landed and caused a brief outage, despite testing in beta, on mwdebug*, and the canary checks.

Timeline

  • 18:28: Gerrit #331552 was merged
  • tested on beta, mwdebug, etc
  • 18:56 demon@tin: Synchronized multiversion/MWMultiVersion.php: Attempt #2 for Multiversion cleanup (duration: 00m 41s)
  • 19:27 demon@tin: Synchronized php-1.29.0-wmf.7/extensions/FlaggedRevs: Stupid errors (duration: 00m 46s)
    • Not technically related, but weird autoloader bugs became more apparent (seen also in TMH) in testing this, so we backported a fix here
  • 19:34 demon@tin: Synchronized multiversion: MWVersion fallbacks & such (duration: 00m 56s)
  • outage immediately reported, began rollback
    • PHP fatal error: Call to undefined method stdClass::get()
  • 19:36 demon@tin: Synchronized multiversion: rollback (duration: 00m 56s)

Conclusions

The canary checks for MediaWiki remain insufficient to catch production errors prior to code rolling out live. mwdebug* is nice for testing specific config changes, but does not get "real" traffic so it's hard to test things extensively. The multiversion code is incredibly fragile--but we knew this. This refactor is complicated, should be broken down even further (than it already is)...small changes are best with this endeavor.

Actionables

  • Status:    Done Include fatal log rate check in scap canary test - task T154646
  • Status:    Done All entry points (including cli) should be subject to canary checks - task T121597
  • Status:    Done T152005 did not cause/exacerbate the outage, but was noticed at the time, priority raised