Jump to content

Incidents/20140919-s1

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

At approx 2014-09-19 08:50:00 enwiki experienced a site outage. The apparent order of events was:

1. Three edits made to a template:

https://en.wikipedia.org/w/index.php?title=Template:Redirect_template&action=history

2. Jobrunner write activity[1] from wikiadmin user on enwiki master increased substantially, with cirrus in the spotlight:

https://logstash.wikimedia.org/#/dashboard/elasticsearch/LinksUpdate%20issues

Binlog showed substantial LinksUpdate hits (it is often in the top 10, but blends in with similar numbers to other traffic):

905809 LinksUpdate::incrTableUpdate
359301 LinksUpdate::updateLinksTimestamp
  9372 Title::invalidateCache
  8728 FlaggableWikiPage::clearStableVersion
  8099 User::invalidateCache
  5753 Revision::insertOn
  4692 ArticleCompileProcessor::save
  4607 `heartbeat`.`heartbeat`
  4404 FlaggedRevs::clearStableOnlyDeps
  3261 CheckUserHooks::updateCheckUserData
  3257 SiteStatsUpdate::tryDBUpdateInternal
  3185 RecentChange::save

The job activity occurred in waves with periods of very heavy writes, then minutes of nothing.

3. enwiki slaves started to experience intermittent replication lag. The main offender was:

DELETE /* LinksUpdate::incrTableUpdate

4. Surges of wikiuser DB connections to slaves began appearing after each write surge above in #2. These hit max_connections on all slaves simultaneously, and apaches went critical. Note that there were no slow queries involved; just an order of magnitude more connections and queries than normal.

5. Significant numbers of wikiadmin connections sat in SELECT MASTER_POS_WAIT due to #3, which reduced the available connections for #4.

6. We killed masses of wikiadmin and wikiuser sleeping connections to make way for new ones.

7. We stopped jobrunners.

8. Things recovered.

Observations and questions:

1. Batching the LinksUpdate DELETE and UPDATE queries would help with replag.

2. The storm of wikiuser traffic after the jobs was due to cache invalidation and presumably a lot of duplicated effort? Could that be mitigated in another layer above the DBs?

3. Can we throttle jobrunners more, or make them smarter in these situations?

Actionables