Jump to content

Incidents/2017-08-14 Train

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

The train for 1.30.0-wmf.14 (week of 2017-08-14) was rolled back on Wednesday after going to group1 due to task T173462 Cannot flush pre-lock snapshot because writes are pending. On Thursday morning, database lag caused by a problem in Wikidata 1.30.0-wmf.12 (which was a submodule of MW core 1.30.0-wmf.13) (task T164173 Cache invalidations coming from the JobQueue are causing lag on several wikis) meant that rolling forward group1 to 1.30.0-wmf.14 even with problems became urgent.

Timeline

  • 2017-08-15 WMF holiday, shortened train schedule starting Wednesday 2017-08-16
  • 2017-08-16 There are several tasks blocking 1.30.0-wmf.14, after reading them they all seem to relate to a new Wikidata release, which should be a submodule of the new release, commented on tasks (task T172394#3528232, task T172320#3528235, task T172394#3528232)
  • 2017-08-16 19:35:13 <logmsgbot> !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to 1.30.0-wmf.14
  • 2017-08-16 21:10 Noticed a slow but steady and building increase in Cannot flush pre-lock snapshot because writes are pending in logstash. Filed as task T173462
  • 2017-08-16 21:21:28 <logmsgbot> !log thcipriani@tin Synchronized php: revert group1 wikis to 1.30.0-wmf.14 for T173462 (duration: 00m 47s)
  • 2017-08-16 21:21:54Sent email to engineering-l, wikitech-l: https://lists.wikimedia.org/pipermail/engineering/2017-August/000457.html
  • 2017-08-17 15:43:26 <marostegui> Actually it is not that, it was our friend: https://phabricator.wikimedia.org/T164173
  • 2017-08-17 16:10:20 <Goatification> jynus: regarding https://phabricator.wikimedia.org/T164173 I'm around to work on it (I'm Amir1), but it's outside of Wikidata team because the fix is merged and not deployed because of https://phabricator.wikimedia.org/T173462
  • 2017-08-17 16:19:14 <thcipriani> sigh. I'm not sure what the user impact is for https://phabricator.wikimedia.org/T173462 but it sounds like the user impact from halting the train may outsize it?
  • 2017-08-17 16:27:11 <greg-g> anyone else agree we need aaron's input?
  • 2017-08-17 16:33:58 <thcipriani> so my understanding is that https://phabricator.wikimedia.org/T164173 has a fix that is in wmf.14 (just from reading that ticket) the rollout of which is blocked on https://phabricator.wikimedia.org/T173462 which AaronSchulz has a patch for
  • 2017-08-17 16:37:42 <AaronSchulz> thcipriani: I did some quick local testing and un-WIP'ed it
  • 2017-08-17 16:51:30 <logmsgbot> !log thcipriani@tin Synchronized php-1.30.0-wmf.14/includes/jobqueue/jobs/RefreshLinksJob.php: Avoid lock acquisition errors for multi-title refreshlinks jobs T173462 (duration: 00m 51s)
  • 2017-08-17 16:54:53 <logmsgbot> !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis back to wmf.14 now for T164173
  • 2017-08-17 16:57:32 <thcipriani> AaronSchulz: Goatification hrm now after rolling forward I'm seeing a lot of error: Stack overflow in /srv/mediawiki/php-1.30.0-wmf.14/includes/libs/objectcache/WANObjectCache.php on line 251 and error: Stack overflow in /srv/mediawiki/php-1.30.0-wmf.14/includes/libs/objectcache/MemcachedBagOStuff.php on line 182
  • Created task T173520
  • 2017-08-17 17:20:06 <logmsgbot> !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis back to wmf.13 now T173520
  • 2017-08-17 17:53:17 <AaronSchulz> thcipriani: I'm betting on b48f361d7d606eff5ab48cc2a64c1cae4e794c84
  • 2017-08-17 18:22:30 <thcipriani> AaronSchulz: I think this is everything, but it's definitely a ton: https://gerrit.wikimedia.org/r/#/c/372427/
  • 2017-08-17 19:07:41 <logmsgbot> !log thcipriani@tin Finished scap: ProofReadPage Revert to db7507246665e69384c1d92af2aedc62263a5116 T173520 (duration: 06m 13s)
  • 2017-08-17 19:12:13 <logmsgbot> !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group1 wikis to wmf.14

Conclusions

  • We put the train in a position where the previous version had big problems, but the new version had different problems
  • Wikidata build process made it hard to think about backporting fixes
  • The deployment process requires a lot of people to be around to fix things
  • Change propagation related patches (core and Wikibase) should be tested locally on pages with multiple backlinks, using edits that actually change some of the links, property, or other tracking tables
  • Change propagation is complex and involves multiple wikis and manual testing of patches; it might be worth investigating a more automated approach

Actionables