Jump to content

MariaDB/Backups/Example recovery

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Data recovery plan

1) Recover up to the drop on db1095: db1109-bin.003832:611995336

transfer.py --type=decompress dbprov1001.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s8.2020-04-05--19-00-02.tar.gz db1095.eqiad.wmnet:/srv/sqldata.s8 # [DONE]
CHANGE MASTER TO MASTER_HOST='db1109.eqiad.wmnet', MASTER_USER='repl', MASTER_PASSWORD=, MASTER_SSL=1; -- requires right password [DONE]

GTID stuff on file, check replication works.

SET GLOBAL gtid_slave_pos = ; -- requires a good gtid pos [DONE]
  • Disable gtid out of precaution
CHANGE MASTER TO MASTER_USE_GTID=no; -- [DONE]
  • Start replication until (before) delete
START SLAVE UNTIL MASTER_LOG_FILE = 'db1109-bin.003832', MASTER_LOG_POS = 611995336; -- [DONE]
  • Disable consitency options:
SET GLOBAL innodb_flush_log_at_trx_commit=0; -- [DONE]
  • Add to zarcillo so it can be monitored on prometheus [DONE]

2) Stop replication fully

STOP SLAVE; [DONE]

3) Rename the table to _old

set sql_log_bin=0; RENAME TABLE wb_items_per_site TO wb_items_per_site_recovered; [DONE]

4) Dump the table in 10 chunks: dbprov1001# dbprov1001# backup-mariadb s8_backup --type=dump --host=db1095.eqiad.wmnet --port=3318 --backup-dir /srv/backups/dumps/ongoing --regex='wikidatawiki[.]wb\_items\_per\_site' --rows=1000000 --user=dump --retention=15 --password= # requires a working --password [DONE]

The right dump is /srv/backups/dumps/ongoing/dump.s8_backup.2020-04-07--10-09-21

5) Stop replication on:

  • codfw
  • labs
  • Analytics

6) Recover on eqiad production hosts without replication:

myloader --directory /srv/backups/dumps/ongoing/dump.s8_backup.2020-04-07--10-09-21 --threads=10 --host={} --port={} --user=root --password= # requires working --password
myloader --directory /srv/backups/dumps/ongoing/dump.s8_backup.2020-04-07--10-09-21 --threads=10 --host=db1087.eqiad.wmnet --user=root --password= --enable-binlog # on db1087

(list of servers gotten from zarcillo with the section bash oneliner)

  1. db1116:3318 (backup source, delayed) DONE
  2. db2079 (codfw master, delayed) DONE
  3. db1095:3318 (NOT TO BE DONE, delayed, instance to be removed)
  4. db1124:3318 (NOT TO BE DONE, will be done on its master)
  5. db1099:3318 DONE
  6. db1101:3318 DONE
  7. db1126 DONE
  8. db1104 DONE
  9. db1092 DONE
  10. db1087 (WITH REPLICATION) DONE
  11. db1111 DONE
  12. dbstore1005:3318 (analytics) DONE
  13. db1109 (master) DONE

7) drop? table and rename old to new:

- Set the master on RO for some time
db1109> set global read_only=1; RENAME TABLE wb_items_per_site TO wb_items_per_site_old; RENAME TABLE wb_items_per_site_recovered TO wb_items_per_site; set global read_only=0;
  • Main issue: metadata locking on the slaves
    • Depooling hosts (a bunch of them at the time)
    • Repooling them after rename
    • Depooling the ones that get metadata locked again