Jump to content

Incidents/2021-04-06 partial rack codfw

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

document status: final

Summary

Around 17:00 UTC, a faulty network switch caused partial failure of one rack in the Codfw cluster. As 8 individual hosts become unreachable, this led to reduced redundancy or capacity for some of the affected services. The hosts were moved to new switch ports, and were unreachabe for about 1 hour.

  • Elastic search: Three elastic nodes down, mild impact to redundancy (yellow state).
  • Swift media storage: Three ms-be nodes down in the standby cluster, no user impact (should catch up once reachable).
  • Edge cache: One cp node unreachable for the Codfw PoP, automatically depooled by healthcheck.
  • DNS: One dns node unreachable, effectively depooled automatically per lack of BGP advertising.

Notes

https://wm-bot.wmflabs.org/logs/%23wikimedia-operations/20210406.txt

[17:01:22] <icinga-wm>	 PROBLEM - Host cp2036 is DOWN: PING CRITICAL - Packet loss = 100%
[17:01:26] <icinga-wm>	 PROBLEM - Host elastic2045 is DOWN: PING CRITICAL - Packet loss = 100%
[17:01:28] <icinga-wm>	 PROBLEM - Host ms-be2034 is DOWN: PING CRITICAL - Packet loss = 100%
[17:01:28] <icinga-wm>	 PROBLEM - Host ms-be2035 is DOWN: PING CRITICAL - Packet loss = 100%
[17:01:56] <icinga-wm>	 PROBLEM - Host elastic2046 is DOWN: PING CRITICAL - Packet loss = 100%
[17:02:44] <icinga-wm>	 PROBLEM - Host elastic2047 is DOWN: PING CRITICAL - Packet loss = 100%
[17:03:14] <icinga-wm>	 PROBLEM - Host dns2001 is DOWN: PING CRITICAL - Packet loss = 100%
[17:03:48] <icinga-wm>	 PROBLEM - Host ms-be2055 is DOWN: PING CRITICAL - Packet loss = 100%
[17:13:19] <elukey>	 seems a rack down
[17:13:46] <elukey>	 yep C2 
…
[17:38:21] <elukey>	 papaul: so I can connect to cp2035, that is in the same rack
[17:53:24] <bblack>	 (cp2036 was pulled from service by external healthchecks, and dns2001 was pulled from traffic flow when it became unable to advertise BGP to routers)
[18:14:51] <bblack>	 !log dns2001 - re-enabling and running puppet agent to restore service
[18:18:15] <logmsgbot>	 !log bblack@cumin1001 conftool action : set/pooled=yes; selector: name=cp2036.codfw.wmnet

Actionables

Tracking task for the incident: https://phabricator.wikimedia.org/T279457