Jump to content

Incidents/20160509-CirrusSearch

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

At 21:43 UTC on 2016-05-09 Elasticsearch started to slow down (as seen on Grafana). Investigation showed a high CPU consumption on elastic1026. Elasticsearch service was restarted and response times went back to normal. Investigation trace the cause to a large segment merge and garbage collection going crazy on elastic1026.

Timeline

  • 2016-05-09T21:43 increase in overall response time (95%-ile) on elasticsearch
  • 2016-05-09T21:51 issue with search reported on IRC by odder
  • 2016-05-09T22:14 elasticsearch service restarted on elastic1026
  • 2016-05-09T22:20 response time back to normal

Analysis

  • more details on Phabricator
  • We did not get an alert via Icinga. There is currently a check on response time, but it is done on prefix search, which now has a low volume of traffic. This shows again the fragility of Graphite checks.
  • Analysis of GC timings indicates that time was spend mainly in young GC and that memory was recollected. This is usually an indication of a too high memory throughput, not of a memory leak or a too small heap.
  • GC is strongly correlated with a large segment merge on elastic1026. This does not explain why it was an issue only this time.
  • Graphite currentAbove() function is a good tool to identify which server is under more load.

Actionables