Incidents/20150901-Elasticsearch

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

Elasticsearch service (on elastic*.eqiad.wmnet nodes) backing the search functionality went red for few minutes. We didn't lose any real data and we failed to service some searches during 10 minutes.

Timeline

05:28: dcausse pauses write before applying the firewall rules to master (elastic1001)
05:32: chasemp applies the rules
05:32: master is starting to lose track of its nodes
05:33: cluster is red
05:33: chasemp revert the rules
05:34: cluster is starting to recover
05:39: cluster is back to yellow
05:48: there's a 10 min spike of "Pool errors", dcausse and chasemp test some queries on enwiki and they all worked
07:58: cluster is back to green
08:00: dcausse unfreeze the indices

Conclusions

https://phabricator.wikimedia.org/T104962#1594537

Actionables

https://phabricator.wikimedia.org/T104962#1594537