Incidents/20160503-Wikidata-Query-Service

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

Starting from May 03 2016 around 7:30 UTC, WDQS started to have occasionally increased response time, leading to HTTP 502 errors from Varnish. At that time, WDQS was running on a single server due to a reinstall and data reload in progress. Restarting Blazegraph restored the service. Multiple restart were done over the following days.

The issue was tracked to multiple causes: a known bug in the version of Blazegraph that we use and a file descriptor leak related to Jolokia (monitoring agent).

Timeline

2016-05-01T19:37 enabled wdqs1002, put wdqs1001 in maintenance mode for reload
2016-05-03T11:08 issue reported in https://phabricator.wikimedia.org/T134238 and IRC
2016-05-03T12:28 wdqs-updater killed as it seems to leak pipes
2016-05-03T13:01 restarting wdqs-updater and keeping it under close scrutiny
2016-05-03T17:18 restarting wdqs1002
2016-05-03T21:11 restarting wdqs1002
2016-05-04T23:26 deployed additional Icinga check increase visibility on this issue
2016-05-05T08:12 restarting wdqs1002
2016-05-05T08:57 restarting wdqs1002
2016-05-05T11:32 restarting wdqs1001
2016-05-05T21:00 deploying fix to Jolokia
2016-05-07T12:32 restarting wdqs1002
2016-05-07T20:13 restarting wdqs1001 and wdqs1002
2016-05-07T20:28 deploying updated Blazegraph version for WDQS to mitigate deadlock issue

Conclusions

Running on 2 servers when maintenance tasks (data reload) can take multiple days is not enough.
We were alerted by users, our monitoring is not sufficient.

Actionables

Done: run Jolokia as a Java agent, not attaching and detaching it at each run
Done: add response time check to WDQS
Done: Deploy new Blazregraph version to fix BLZG-1884
Tasks opened: Adjust balance of WDQS nodes / Deploy WDQS node on codfw