Jump to content

Incidents/20160503-Wikidata-Query-Service

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

Starting from May 03 2016 around 7:30 UTC, WDQS started to have occasionally increased response time, leading to HTTP 502 errors from Varnish. At that time, WDQS was running on a single server due to a reinstall and data reload in progress. Restarting Blazegraph restored the service. Multiple restart were done over the following days.

The issue was tracked to multiple causes: a known bug in the version of Blazegraph that we use and a file descriptor leak related to Jolokia (monitoring agent).

Timeline

  • 2016-05-01T19:37 enabled wdqs1002, put wdqs1001 in maintenance mode for reload
  • 2016-05-03T11:08 issue reported in https://phabricator.wikimedia.org/T134238 and IRC
  • 2016-05-03T12:28 wdqs-updater killed as it seems to leak pipes
  • 2016-05-03T13:01 restarting wdqs-updater and keeping it under close scrutiny
  • 2016-05-03T17:18 restarting wdqs1002
  • 2016-05-03T21:11 restarting wdqs1002
  • 2016-05-04T23:26 deployed additional Icinga check increase visibility on this issue
  • 2016-05-05T08:12 restarting wdqs1002
  • 2016-05-05T08:57 restarting wdqs1002
  • 2016-05-05T11:32 restarting wdqs1001
  • 2016-05-05T21:00 deploying fix to Jolokia
  • 2016-05-07T12:32 restarting wdqs1002
  • 2016-05-07T20:13 restarting wdqs1001 and wdqs1002
  • 2016-05-07T20:28 deploying updated Blazegraph version for WDQS to mitigate deadlock issue

Conclusions

  • Running on 2 servers when maintenance tasks (data reload) can take multiple days is not enough.
  • We were alerted by users, our monitoring is not sufficient.

Actionables