Jump to content

Incidents/2019-04-23 varnish

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

Similar Varnish 'mailbox lag' problems as many times before.

Impact

Approximately 82k queries lost (HTTP 503 served instead). source

Detection

Automated monitoring -- Icinga alerts on traffic availability.

Timeline

This is a step by step outline of what happened to cause the incident and how it was remedied. Include the lead-up to the incident, as well as any epilogue, and clearly indicate when the user-visible outage began and ended.

All times in UTC.

  • 19:54 Varnish mailbox lag begins climbing on cp1083 OUTAGE BEGINS
  • 19:56 first Icinga alert for HTTP availability PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo
  • 19:57 Varnish mailbox lag recovers on cp1083 but begins climbing on cp1085
  • 20:02 Varnish mailbox lag recovers on cp1085 OUTAGE ENDS

Graphs: Mailbox lag HTTP availability

Conclusions

See Incident_documentation/20190416-varnish#Conclusions

Actionables

See Incident_documentation/20190416-varnish#Actionables