Jump to content

Incidents/2020-09-09 mobileapps config change

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

document status: in-review

Summary

While reconfiguring all services to use our service proxy middleware to make remote procedure calls, a faulty configuration was deployed by yours truly for mobileapps at 08:40. This caused mobileapps to create mobile-html content with broken css and js links for pages regenerated during the day.

The issue was reported at 16:40 and the issue was quickly reverted. Then we needed a few hours to actually clear all the caching layers (RESTBase, edge caches). All pages affected were purged by 20:20.

Actionables

  • The biggest actionable is of course to always wait for validation from service owners before merging a patch - and the whole outage would've been avoided if that was done. Anything else listed here is purely a second-order actionable.
  • While this deployment was the result of bad judgement, SRE need to be able to deploy a configuration change with confidence. The fact that the mobileapps spec tests all passed in staging lulled SRE into a false sense of security. The OpenAPI spec should be extended to include a test for the aforementioned URLs. (TODO: create task)
  • We need staging to become a functional environment where we can test more than just a swagger spec test. Maybe linking it to restbase-dev, and making it possible to compare results of urls with production would help (TODO: create task)