Jump to content

Incidents/20160807-CI

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

CI had a roughly 4 hours outage which unfortunately was due to a known issue where Nodepool tries to create too many files (and thus exhausting inodes) on the Jenkins master.

Timeline

  • 11:01 < Amir1> Zuul seems to be extremely slow: https://integration.wikimedia.org/zuul/
  • 11:03 < paladox> Hi nodepool seems to be down in zuul
  • 11:08 <+ Reedy> Aug 07 11:07:54 labnodepool1001 nodepoold[16727]: Forbidden: Quota exceeded for instances: Requested 1, but already used 10 of 10 instances (HTTP 403)
  • 11:15 <+ Reedy> I'm not restarting nodepool on a whim
  • 11:16 <+ Reedy> I'll text hashar
  • 11:35 - Reedy texted Antoine
  • 11:39 <+ hashar> Reedy: paladox around :)
  • Diagnosis/attempts to fix by deleting unused nodepool instances that were stuck
  • 12:30 - Antoine started deleting files in /var/lib/jenkins/config-history/config
    • ssh gallium find /var/lib/jenkins/config-history/config/nodes \ -path '*_deleted_*' -delete
  • 12:41 <+ hashar> Reedy: paladox ci back

Conclusions

  • We need to cleanup unused config files on a schedule

Actionables

  • Jenkins files under /var/lib/jenkins/config-history/config need to be garbage collected - (task T126552)