Jump to content

Incidents/2019-06-03 eqiad-port-saturation

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

We received a LibreNMS alert for port utilization over 80% on cr2-eqiad.wikimedia.org in xe-3/3/3 interface.

After some investigation, we identified a single client running in AWS performing multiple requests for media content using "User-Agent: python-requests".

Impact

No user-facing impact.

Detection

LibreNMS alert on #-operations

Timeline

(In UTC)

  • 12:23: <librenms-wmf> Critical Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80%
  • 13:22: faidon@weblog1001:/srv/log/webrequest$ head -n 2560000 sampled-1000.json | jq -r '.ip + " " + (.response_size | tostring)' | awk '{ sum[$1] += $2 } END { for (ip in sum) print sum[ip],ip }' |sort -nr | head -10
  • 13:35: cdanis opens abuse report with AWS abuse team
  • 13:50: crawler begins scraping larger objects again, network usage increases to max
  • 14:00: ema merges https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/514005/ to block their User-Agent (also unavoidably blocking some legitimate traffic)
  • 14:01: ema runs puppet on cp1084
  • 14:05: outbound network returns to normal

Actionables

  • phab:T224884 - Rate limit requests to cache_upload
  • phab:T224888 - Network port saturation should page
  • phab:T224891 - Return HTTP 403 to requests violating User-Agent policy
  • Begin enforcing our existing meta:User-Agent policy, after notifying community (TODO: file task). Some summary of discussion:
    • Ratelimit/block 'default' UAs, like "curl", "python-requests", "python-urllib2", etc.
      • Probably allow only so many reqs/sec from a given IP for these 'default' ones
    • First investigate how much traffic this would hurt
    • wikitech-l@, mediawiki-api-announce@, and m:Tech/News are reasonable venues for announcements