Jump to content

Incidents/2018-03-14 ORES

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

We had a small and planned user facing outage during maintenance

Timeline

  • 14:18 UTC Aaron Halfaker and Alex discuss the need to do a scheduled reboot of the oresrdb hosts for kernel upgrades. Decision is taken to proceed with it ASAP
  • 14:25 UTC Alex starts with the slaves in codfw and eqiad. No impact
  • 14:34 UTC Both slaves had caught up to the masters
  • 14:34 UTC Alex starts reboots for the master redis on codfw. Errors start
  • 14:35 UTC Reboot is done but redis is loading the dataset in memory
  • 14:37 UTC oresrdb2001 has the dataset in memory, jobs can be submitted once more.
  • 14:45 UTC After having gauged the effects in codfw, alex starts the same process in eqiad
  • 14:47 UTC Icinga notices
  • 14:50 UTC Up and running again. Icinga notices

Conclusions

We were unable to serve ~2500 requests total in eqiad (~250 external ones) and ~6000 requests in codfw (~240 external ones). The reason behind this is that the redis hosts that acts both as a queue and as a cache is now a SPOF and this should be addressed.

Actionables

  • Use twemproxy for the cache redis at least in order to be able to serve at least a portion of requests during a downtime (Phab: TODO)
  • Try to figure out way that the queue redis can be made highly available (Phab: TODO)