Jump to content

Incidents/20140622-es1006

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

There was a 50-second flash of 5xx responses which corresponds to a spike of "Too many connections" errors for es1006 in dberror.log.

Timeline

First error: Sun Jun 22 8:01:37 UTC 2014
Last error: Sun Jun 22 8:02:27 UTC 2014

Mean CPU utilization since the 12th is up around 90% compared to the previous ten day period:

http://graphite.wikimedia.org/render/?target=servers.es1006.cpu.total.user.value&from=00%3A00_20140601&until=23%3A59_20140622&width=600&height=300

In fact it is a general load jump for external storage that has been causing similar glitches for some days. There is a corresponding jump also starting on the 12th on the S5 slaves (dewiki, wikidatawiki). None of the other shards show the pattern.

During IRC discussion a probable spike in Wikidata traffic was identified; mostly Wikibase\Lib\Store\WikiPageEntityLookup::selectRevisionRow which would also hit ES. Aude and Hoo investigated and found a latent Wikidata caching bug.

Conclusions

Traffic increased on ES and S5. Probable cause was a latent Wikidata bug.

Actionables

  • Status:    Done An additional S5 slave has been deployed.
  • Status:    Done DB traffic sampling has been deployed to S5.
  • Status:    Done Aude and Hoo deployed https://gerrit.wikimedia.org/r/#/c/141997/