Jump to content

Incidents/20150901-Elasticsearch

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

Elasticsearch service (on elastic*.eqiad.wmnet nodes) backing the search functionality went red for few minutes. We didn't lose any real data and we failed to service some searches during 10 minutes.

Timeline

  • 05:28: dcausse pauses write before applying the firewall rules to master (elastic1001)
  • 05:32: chasemp applies the rules
  • 05:32: master is starting to lose track of its nodes
  • 05:33: cluster is red
  • 05:33: chasemp revert the rules
  • 05:34: cluster is starting to recover
  • 05:39: cluster is back to yellow
  • 05:48: there's a 10 min spike of "Pool errors", dcausse and chasemp test some queries on enwiki and they all worked
  • 07:58: cluster is back to green
  • 08:00: dcausse unfreeze the indices

Conclusions

Actionables