Jump to content

Incidents/20150709-poolcounter

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

There was an 8 minute outage of api at eqiad, starting from 2015-07-09 17:26:35 and ending at 2015-07-09 17:34:15 caused by scheduled maintenance and an unforeseen dependency. helium was powered down for https://phabricator.wikimedia.org/T84770. helium however is also a poolcounter machine. Unfortunately mediawiki has a 0.5 sec timeout when falling back to the next poolcounter server in line which is too high.

Timeline

17:22 cmjohnson1: shutting down helium for a few minutes to move within the same row

17:26:35 icinga complains about api.svc.eqiad.wmnet. Before that it had already complained about HHVM queue sizes on various mw hosts. mutante noticed it's poolcounter host

17:31 ori merges https://gerrit.wikimedia.org/r/#/c/223838/ making mw1154 a poolcounter server, effectively bypassing helium and pottasium. Recoveries start coming in

17:34 icinga declares api.svc.eqiad.wmnet OK.

Actionables