Jump to content

Incidents/2019-06-13 wdqs

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

From June 13 ~15:10UTC to ~15:50 UTC the public WDQS endpoint in eqiad was overloaded by a bot to the point where it was not serving user queries. There is no reason to think that this bot was malicious. To mitigate this, the python-requests user agent is temporarily banned from accessing WDQS, consistent with our user agent policy.

Impact

The WDQS public endpoint in eqiad was unavailable from ~15:25 to ~15:45 UTC.

The python-requests user agent is still being banned, we are waiting to implement a more gentle solution before removing this ban.

The internal WDQS endpoint was not impacted.

Detection

Problem was detected by the Icinga LVS probe.

Timeline

All times in UTC.

  • 15:10: load starts to increase on the public wdqs eqiad cluster
  • 15:31: Icinga LVS alert for wdqs.svc.eqiad.wmnet

Conclusions

  • identifying and throttling bots is a hard problem
  • we need to take more drastic action to protect the stability of the service (aggressively throttle generic user agents)

What went well?

  • problem was detected automatically in a timely manner
  • good collaboration and clear communication between

What went poorly?

  • while we do have logic to throttle abusive bots, this throttling was not sufficient to protect the service
  • we are still banning python-requests as a user agent, which affects a number of bots

Where did we get lucky?

  • This happened during SRE offsite, when most SRE are in the same timezone. Luckily this wasn't when all of them were sleeping!

Actionables