Jump to content

Incidents/2021-09-04 appserver latency

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

document status: final

Summary

An increase in load on a database server resulted in many queries being much slower to respond. This in turn meant backend traffic occupies appserver php-fpm workers for much longer, and a proportion of those requests will fail entirely due to unavailable workers. The failed requests got an error page with the message "upstream connect error or disconnect/reset before headers. reset reason: overflow".

Impact: For 37 minutes, backends were slow (taking several seconds to respond) and 2% of requests failed entirely. This affected logged-in users, most bots/API queries, and some page views from unregistered users for pages that were recently edited or otherwise expired from the CDN cache.

Documentation:

Actionables