Incidents/20160214-labsdb1002

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Summary

At about 22:30 UTC on 2016-02-14, labsdb1002 suffered a disk failure and unmounted the volume that hosted several tools databases. We were unable to remount the system, so the db server was depooled pending a disk replacement.

Timeline

22:28: incinga reports

mysqld processes on labsdb1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld
MariaDB disk space on labsdb1002 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error
Disk space on labsdb1002 is CRITICAL: DISK CRITICAL - /srv is not accessible: Input/output error

Giuseppe, Andrew, Chase, Jaime, Ariel and Alex Monk responded to the alerts. Logs include the following:

Feb 14 22:24:55 labsdb1002 kernel: [21554975.965750] XFS (dm-1): Log I/O Error Detected.  Shutting down filesystem
Feb 14 22:24:55 labsdb1002 kernel: [21554975.965752] XFS (dm-1): Please umount the filesystem and rectify the problem(s)
Feb 14 22:24:55 labsdb1002 kernel: [21554975.965754] XFS (dm-1): metadata I/O error: block 0xc00f2fb0 ("xlog_iodone") error 5 numblks 64
Feb 14 22:24:55 labsdb1002 kernel: [21554975.965757] XFS (dm-1): xfs_do_force_shutdown(0x2) called from line 1170 of file /build/buildd/linux-3.13.0/fs/xfs/xfs_log.c.  Return address = 0xffffffffa02db801

22:50: Giuseppe attempts to unmount and remount the failed volume

22:59: After discussion it's agreed that the other db servers can handle the load from 1002, so Jaime submits https://gerrit.wikimedia.org/r/#/c/270650/ which directs access for affected DBs to other servers. Andrew merges the patch and applies on labservices1001.

23:05: Andrew restarts the 'replag' tool by hand, at which point it resumes normal operation. Many tools recover spontaneously, and a few others are restarted by hand by operations staff.

23:22: Andrew sends an email to labs-announce encouraging tool maintainers to restart their services.

Conclusions

Affected tools replica databases:

'bgwiki'
'bgwiktionary',
'commonswiki'
'cswiki'
'dewiki'
'enwikiquote'
'enwiktionary'
'eowiki'
'fiwiki'
'idwiki'
'itwiki',
'nlwiki'
'nowiki'
'plwiki'
'ptwiki'
'svwiki'
'thwiki'
'trwiki'
'wikidatawiki'
'zhwiki'

Tools with sensible reconnect logic recovered immediately after labsdb1002 was depooled. Those without will require a manual restart, which is largely left up to tool maintainers.

Actionables

Replace the broken disk and repool labsdb1002: https://phabricator.wikimedia.org/T126946

Consider implementing an automatic failover system for labsdb shards