Jump to content

Incidents/20160319-Ores

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

ORES went down and responded slowly for ~2 hours today.

Timeline

  • 1930 UTC: New deployment begins
  • 2005 UTC: ORES begins to be overloaded
  • 2025 UTC: A problem with old Jessie installs is discovered Phab:T130463 -- it turns out that it was really a pip issue with versioning https://github.com/pypa/pip/issues/214
  • 2130 UTC: A new cluster is built and requests are being served at the rate that they come in
  • 2300 UTC: A new cluster configuration is complete.

Conclusions

  1. Pip does not remove old versions when installing new wheels. This will need to be done manually
  2. Our precaching utility will back-up during a short outage and unleash a load of requests on the service when it comes back online

Actionables