Jump to content

Incidents/20160924-ORES

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

ORES review tool (= ORES extension) couldn't score edits made in 14 hours between rolling out of wmf.20 and the fast fix made in 2016-09-23

Timeline

This is a step by step outline of what happened to cause the incident and how it was remedied.

  • (2016-09-22) SAL: 20:00 thcipriani: rolling out wmf.20 to all wikis
  • (2016-09-23) 9:44 The phab task is created
  • 9:45 The gerrit patch is made to fix it in master
  • 9:47 The patch is merged.
  • 9:48 The backport to wmf.20 is made.
  • 9:51 The backport is merged
  • SAL: 09:58 logmsgbot: hashar@tin Synchronized php-1.28.0-wmf.20/extensions/ORES/includes/Cache.php: No int typehinting (causes jobs to crash) T146461 (duration: 00m 42s)
  • SAL: 10:00 Amir1: ladsgroup@terbium:~$ mwscript extensions/ORES/maintenance/PopulateDatabase.php --wiki=enwiki
  • SAL: 10:05 Amir1: ladsgroup@terbium:~$ mwscript extensions/ORES/maintenance/PopulateDatabase.php --wiki=wikidatawiki (T146461) and for 'trwiki', 'plwiki', 'fawiki', 'nlwiki', 'ruwiki', 'ptwiki'

Conclusions

  • There should be an alarm to scream when jobs such as ORESFetchScoreJob is not triggered for more than an hour.
  • The lapse was easy to notice, ORES extension should have extensive CI tests.

Actionables

  • Extensive CI tests for ORES extension (task T146560)
  • High failure rate of account creation should trigger an alarm / page people (task T146090)