Jump to content

Incidents/20160628-wdqs

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Switching WDQS to Scap3 deployment caused some issue and a downtime of the service.

Summary

Tuesday June 28 around 19:30 UTC we migrated WDQS deployment to Scap3. Scap3 deployment replaces the application directory by a symlink, leading to files outside of scap in that directory to disappear. This lead to WDQS failing for about 20 minutes between 19:53 UTC and 20:11 UTC.

Timeline

  • 19:53: WDQS deployed and restarted by Scap3, service failing
  • 19:58: issue tracked to missing symlink to wikidata.jnl (blazegraph data file)
  • 19:59: running puppet to restore the missing symlink -> did not work (failed dependency between due to application folder now being managed by Scap3)
  • 20:08: stopped puppet while restoring service manually
  • 20:11: Service restored by manually restoring symlink

Notes

  • Migrating WDQS to scap looked like a simple operation. Plenty of services have already done this migration without issue.
  • puppet agent --noop is a great tool especially in non standard situation (Gehel did not use it, but should have).
  • Restoring service did take longer than expected. Gehel was not familiar enough with the details of how WDQS work and took some time double checking it before making changes.

Conclusions

  • Scap3 supports canaries, they should *always* be used.

Actionables