Incidents/20151215-swift-syslog

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

An rsyslog config change was merged, which took down swift. This caused a public partial outage on upload.wikimedia.org image fetches. About 2% of said image requests failed with 503 errors for about 15 minutes in our client response-code graphs (the rest were successful due to cache hits in varnish, presumably). dzahn restarted the swift proxies and the service recovered.

upload-5xx graph vs total from this incident: https://grafana.wikimedia.org/dashboard/db/varnish-aggregate-client-status-codes?from=1450205993052&to=1450209627883&var-cache_type=upload&var-status_type=5&var-site=All

Timeline

19:08 < grrrit-wm> (CR) Ottomata: [C: 2] Increase size of programname field in remote syslog template [puppet] - https://gerrit.wikimedia.org/r/259271 (https://phabricator.wikimedia.org/T120874) (owner: Ottomata)
19:08 < ottomata> !log merged change to allow longer programnames in remote rsyslog config.
19:19 < icinga-wm> PROBLEM - Swift HTTP backend on ms-fe1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:21 < icinga-wm> PROBLEM - Swift HTTP backend on ms-fe1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:25 < icinga-wm> PROBLEM - Swift HTTP backend on ms-fe1003 is CRITICAL: Connection timed out
19:26 < icinga-wm> PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:27 < icinga-wm> PROBLEM - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds
19:29 < icinga-wm> PROBLEM - Swift HTTP frontend on ms-fe1003 is CRITICAL: Connection timed out
19:30 < icinga-wm> RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 15119 bytes in 0.082 second response time
19:30 < icinga-wm> PROBLEM - Swift HTTP backend on ms-fe1004 is CRITICAL: Connection timed out
19:31 < icinga-wm> PROBLEM - Swift HTTP frontend on ms-fe1004 is CRITICAL: Connection timed out
19:37 < mutante> !log ms-fe1004, swift-proxy-server stop/start
19:37 < icinga-wm> RECOVERY - LVS HTTP IPv4 on ms-fe.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.061 second response time
19:39 < icinga-wm> RECOVERY - Swift HTTP backend on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.015 second response time
19:39 < icinga-wm> RECOVERY - Swift HTTP frontend on ms-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.006 second response time
19:39 < icinga-wm> RECOVERY - Swift HTTP frontend on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 185 bytes in 0.004 second response time
19:39 < mutante> !log ms-fe1001 thru ms-fe1003: swift-proxy-server stop/start
19:39 < icinga-wm> RECOVERY - Swift HTTP backend on ms-fe1002 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.014 second response time
19:39 < icinga-wm> RECOVERY - Swift HTTP backend on ms-fe1003 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.020 second response time
19:40 < icinga-wm> RECOVERY - Swift HTTP backend on ms-fe1001 is OK: HTTP OK: HTTP/1.1 200 OK - 391 bytes in 0.016 second response time

Conclusions

This is the 3rd time we've had the same outage - once a year in 2013, 2014, and now 2015:

Actionables

Same as the last two times, still not completed.