Jump to content

Incidents/20151127-EventLogging

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Timeline

On the 27th of November

1:30 am UTC sql insertion rate goes to zero, topics that are feed from outside in kafka continue to receive events, event-login-valid-mixed is receiving events but not as much as it should have

At the same time we see this errors in the eventlogging_processor log:

2015-11-27 01:31:09,663 (MainThread) Could not receive response to request [0000026b0000000000a0 ... 6b69223a2022656e77696b69227d] from server <KafkaConnection host=kafka1013.eqiad.wmnet port=9092>: Kafka @ kafka1013.eqiad.wmnet:9092 went away
2015-11-27 01:31:09,664 (MainThread) Could not receive response to request [0000038a00000000009d ... 6b69223a2022657377696b69227d] from server <KafkaConnection host=kafka1013.eqiad.wmnet port=9092>: Kafka @ kafka1013.eqiad.wmnet:9092 went away

Kafka had an outage in which only one of the brokers seems to be working:


06:50 am UTC eventlogging gets rebooted and a spike on consumption can be seen on Grafana

07:05 am UTC consumption catches up

Conclusions

Eventlogging consumers get stuck when there are connection problems talking to kafka. System requires a reboot to be able to recover after kafka has been brought back.

Actionables

  • Status:    Done Investigate whether backfilling is needed
  • Status:    Done Backfill missing data
  • Status:    Done Make Eventlogging more resilient to kafka outages: [1]