Jump to content

Portal:Toolforge/Admin/Runbooks/ToolsToolsDBWritableState

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

This happens when the primary ToolsDB instance is down, or is up but in read-only mode.

The procedures in this runbook require admin permissions to complete.

Error / Incident

This usually comes in the form of an alert in alertmanager.

Debugging

General cluster overview

You can run the following cookbook to get a cluster overview:

dcaro@urcuchillay$ wmcs-cookbooks wmcs.toolforge.toolsdb.get_cluster_status --cluster tools


Checking the systemd unit status

SSH to the instance and check the systemd status for mariadb.service

$ ssh tools-db-1.tools.eqiad1.wikimedia.cloud
fnegri@tools-db-1:~$ sudo systemctl status mariadb.service

If SSH does not work (for example because of phab:T349681) you can use virsh console: find the "instance name" and "host" in Horizon, then SSH to the cloudvirt host, and run virsh console {instance name}.

Common issues

Add new issues here when you encounter them!

MariaDB process killed by OOM killer

If this is the case, you usually see a log message like the following one in the mariadb logs:

$ ssh tools-db-1.tools.eqiad1.wikimedia.cloud
fnegri@tools-db-1:~$ sudo journalctl -u mariadb |grep -i kill
Oct 24 09:23:39 tools-db-1 systemd[1]: mariadb.service: A process of this unit has been killed by the OOM killer.
Oct 24 09:23:39 tools-db-1 systemd[1]: mariadb.service: Main process exited, code=killed, status=9/KILL
Oct 24 09:23:39 tools-db-1 systemd[1]: mariadb.service: Failed with result 'oom-kill'.

Sometimes the mariadb logs only include "Main process exited", without any mention to OOM, but you can verify if the process was killed by the OOM killer looking at dmesg:

fnegri@tools-db-1:~$ sudo dmesg -T |grep Killed
[Tue Oct 24 09:19:23 2023] Out of memory: Killed process 2437 (mysqld) total-vm:64835688kB, anon-rss:64103256kB, file-rss:0kB, shmem-rss:0kB, UID:497 pgtables:126460kB oom_score_adj:-600

Check if systemd restarted the "mariadb.service" automatically with systemctl status mariadb, otherwise run systemctl start mariadb.

Finally, set the server to read-write, as it is configured to start in read-only mode for extra safety:

$ sudo mariadb
MariaDB [(none)]> SET GLOBAL read_only=OFF;

Support contacts

The main discussion channel for this alert is the #wikimedia-cloud-admin in IRC.

If the situation is not clear or you need additional help, you can also contact the Data Persistence team (#wikimedia-data-persistence on IRC).

Old incidents

Add any incident tasks here!