Incidents/2017-11-30 wdqs

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

Around 14:55 UTC wdqs1004 was caught in GC death spiral and froze. It recovered after a restart of blazegraph.

Timeline

14:55 UTC: slowdown in updates can be observed for wdqs1004
15:15 UTC: icinga alert: LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
15:15 UTC: icinga recovery
15:19 UTC: restart of blazegraph on wdqs1004

Conclusions

Looking at GC logs, I can see a peak at 17GB/s of heap allocation. This looks related to the traffic received. Much more investigation will be needed to get to the bottom of this.
Looking at throttled requests during that period, I can see that most requests are coming from user agent "MediaWiki/1.31.0-wmf.10". This is a surprise to me.

Actionables

modify the local icinga checks to use the same check as LVS, which do a real query and not just a call to a dummy page phab:T181989
new wdqs cluster, dedicated to synchronous and trusted traffic phab:T178492 (this is a goal of search backend for next quarter)
investigate memory allocation on blazegraph phab:T181988
investigate and document clients of wdqs, a tracking page has been created.