Jump to content

Incidents/20151118-LVS-PyBal

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

PyBal was stopped on the primary eqiad LVS servers (lvs100[123]) for maintenance, with the expectation that traffic would be unaffected and shift to the backup servers (lvs100[456]). Some services were not operating correctly on the backups, causing a partial outage. pybal was restarted ~5 minutes later on the primaries, ending the outage.

The impact during the window was limited. Most service traffic moved successfully, and only a few specific service+proto+port combinations failed, with the primary public-facing affected services being:

Service Protocol Port
text-lb IPv4 80
mobile-lb IPv6 443
misc-web-lb IPv4 443

Because most of our traffic is IPv4 on port 443 (HTTPS) for all services, the impact to text and mobile services was limited (HTTP->HTTPS redirects for text, IPv6 users for mobile). misc-web was affected for most users, denying access to services such as phabricator, gerrit, racktables, etc.

Traffic graphs of the primary clusters in eqiad: https://phabricator.wikimedia.org/F2972690

Timeline

  • 13:56 - pybal stopped on lvs100[123] by bblack
  • 14:00 - first user report on IRC: "< aude> did someone kill phabricator?"
  • 14:01 - first automated report on IRC: "< icinga-wm> PROBLEM - LVS HTTP IPv4 on text-lb.eqiad.wikimedia.org is CRITICAL: Connection refused"
  • 14:01 - pybal restarted on lvs100[123] by bblack
  • 14:02 - traffic restored to normal

Conclusions

Because the same pattern of pybal failover to the same version of software had happened successfully at 3 other datacenters the day before, confidence was too high and not enough pre-flight checking was done. More verification on the state of lvs100[456] should have been done prior to the start of maintenance. The problem issues were obvious prior to the outage in the output of "ipvsadm -Ln" as well as the pybal service logs in "journalctl" if they were examined in depth.

The actual technical issues are due to bugs in PyBal.

Actionables