Jump to content

Incidents/2021-02-01 swift-codfw

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

document status: in-review

Summary

Today at 11:41 the icinga check for `ms-fe.svc.codfw.wmnet` timed out and thus paged (and recovered three minutes later):

11:41 -icinga-wm:#wikimedia-operations- PROBLEM - LVS swift-https codfw port 443/tcp - Swift/Ceph media 
          storage IPv4 #page on ms-fe.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 
          seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems
...
11:44 -icinga-wm:#wikimedia-operations- RECOVERY - LVS swift-https codfw port 443/tcp - Swift/Ceph media 
          storage IPv4 #page on ms-fe.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 396 bytes in 
          0.140 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems

Users in codfw/ulsfo/eqsin have experienced ~15min of higher latency (possibly timeouts) for hit-local and miss requests (10-25% of the site's requests, depending on the site).

Screenshot from https://grafana.wikimedia.org/d/8T2XA-5Gz/frontend-ats-tls-ttfb-latency

Specifically hitting /monitoring/backend timed out, this in turn meant that some of the backend server(s) where the monitoring container lives were slow/unhealthy. Case in point ms-be2033.codfw.wmnet was reported as slow from /var/log/swift/server.log on e.g. ms-fe2006.codfw.wmnet:

Feb  1 11:44:29 ms-fe2006 proxy-server: ERROR with Object server 10.192.16.15:6000/sdk1 re: Trying to GET /v1/AUTH_mw/monitoring/backend: Timeout (10.0s) (txn: txe96767fb630b4828af04a-006017e993) (client_ip: 208.80.154.88)

The slowness was induced by an earlier swift rebalance (bug T272837) and the way we do rebalances at the moment means that such operations are generally noisy/impactful to the cluster (e.g. bug T221904, bug T271415). Swift has been depooled internally from its discovery record (essentially anticipating bug T267338).


Actionables

  • Change /monitoring/backend to /monitoring/frontend (i.e. check the frontend itself) for icinga service check and pybal's proxyfetch bug T273453
  • Consider depooling swift's discovery records during rebalances bug T273453