MariaDB/Backups/Example recovery

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Data recovery plan

1) Recover up to the drop on db1095: db1109-bin.003832:611995336

transfer.py --type=decompress dbprov1001.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s8.2020-04-05--19-00-02.tar.gz db1095.eqiad.wmnet:/srv/sqldata.s8 # [DONE]

Setup new instance: https://gerrit.wikimedia.org/r/c/operations/puppet/+/587010 [DONE]
Start server [DONE]
setup replication

CHANGE MASTER TO MASTER_HOST='db1109.eqiad.wmnet', MASTER_USER='repl', MASTER_PASSWORD=, MASTER_SSL=1; -- requires right password [DONE]

GTID stuff on file, check replication works.

SET GLOBAL gtid_slave_pos = ; -- requires a good gtid pos [DONE]

Disable gtid out of precaution

CHANGE MASTER TO MASTER_USE_GTID=no; -- [DONE]

Start replication until (before) delete

START SLAVE UNTIL MASTER_LOG_FILE = 'db1109-bin.003832', MASTER_LOG_POS = 611995336; -- [DONE]

Disable consitency options:

SET GLOBAL innodb_flush_log_at_trx_commit=0; -- [DONE]

Add to zarcillo so it can be monitored on prometheus [DONE]

2) Stop replication fully

STOP SLAVE; [DONE]

3) Rename the table to _old

set sql_log_bin=0; RENAME TABLE wb_items_per_site TO wb_items_per_site_recovered; [DONE]

4) Dump the table in 10 chunks: dbprov1001# dbprov1001# backup-mariadb s8_backup --type=dump --host=db1095.eqiad.wmnet --port=3318 --backup-dir /srv/backups/dumps/ongoing --regex='wikidatawiki[.]wb\_items\_per\_site' --rows=1000000 --user=dump --retention=15 --password= # requires a working --password [DONE]

The right dump is /srv/backups/dumps/ongoing/dump.s8_backup.2020-04-07--10-09-21

5) Stop replication on:

codfw
labs
Analytics

6) Recover on eqiad production hosts without replication:

myloader --directory /srv/backups/dumps/ongoing/dump.s8_backup.2020-04-07--10-09-21 --threads=10 --host={} --port={} --user=root --password= # requires working --password
myloader --directory /srv/backups/dumps/ongoing/dump.s8_backup.2020-04-07--10-09-21 --threads=10 --host=db1087.eqiad.wmnet --user=root --password= --enable-binlog # on db1087

(list of servers gotten from zarcillo with the section bash oneliner)

db1116:3318 (backup source, delayed) DONE
db2079 (codfw master, delayed) DONE
db1095:3318 (NOT TO BE DONE, delayed, instance to be removed)
db1124:3318 (NOT TO BE DONE, will be done on its master)
db1099:3318 DONE
db1101:3318 DONE
db1126 DONE
db1104 DONE
db1092 DONE
db1087 (WITH REPLICATION) DONE
db1111 DONE
dbstore1005:3318 (analytics) DONE
db1109 (master) DONE

7) drop? table and rename old to new:

- Set the master on RO for some time

db1109> set global read_only=1; RENAME TABLE wb_items_per_site TO wb_items_per_site_old; RENAME TABLE wb_items_per_site_recovered TO wb_items_per_site; set global read_only=0;

Main issue: metadata locking on the slaves
- Depooling hosts (a bunch of them at the time)
- Repooling them after rename
- Depooling the ones that get metadata locked again