Jump to content

Incidents/2018-04-23 wdqs

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

Wikidata Query Service experienced slow down and timeouts for ~30 minutes

Timeline

  • 7:30 UTC: wdqs.svc.eqiad.wmnet start experiencing slowdowns (95 %-ile is reaching the 1 minute timeout)
  • 7:39 UTC: icinga alert: LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
  • 7:41 UTC: icinga recovery: LVS HTTP IPv4 on wdqs.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 434 bytes in 0.029 second response time
  • 8:08 UTC: restart of wdqs1003, situation back to normal

Conclusions

During the slowdown, we see multiple clients being aggressively throttled. This seems to be a consequence of the slowdown, not a cause. While we do have clients who don't backoff when receiving HTTP 429, they seem to be blocked correctly.

GC logs show mostly no GC activity after 7:28 UTC, which indicates that for once, the issue is somewhere else.

Actionables

  • investigate the cause of the slowdown of wdqs1003 phab:T192759
  • create a dedicated WDQS cluster for internal traffic phab:T178492 (mostly done, still need to move clients to use this new cluster)