User:Razzi/2021-06-10
Appearance
gonna deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/698194
sudo cookbook sre.hadoop.roll-restart-masters analytics
ok I got an eof error somehow...
razzi@cumin1001:~$ sudo cookbook sre.hadoop.roll-restart-masters analytics [0/0]
START - Cookbook sre.hadoop.roll-restart-masters
Checking HDFS and Yarn daemon status. We expect active statuses on the Master node, and standby statuses on the other. Please do not proceed otherwise.
Checking Master/Standby status.
Master status for HDFS:
----- OUTPUT of 'kerberos-run-com...1001-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
active
================
PASS |████████████████████████| 100% (1/1) [00:01<00:00, 1.68s/hosts]
FAIL | | 0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1001-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Master status for Yarn:
----- OUTPUT of 'kerberos-run-com...1001-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
active
================
PASS |████████████████████████| 100% (1/1) [00:01<00:00, 1.54s/hosts]
FAIL | | 0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1001-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Standby status for HDFS:
----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
standby
================
PASS |████████████████████████| 100% (1/1) [00:01<00:00, 1.64s/hosts]
FAIL | | 0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Standby status for Yarn:
----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
standby
================
PASS |████████████████████████| 100% (1/1) [00:01<00:00, 1.74s/hosts]
FAIL | | 0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
>>> Please make sure that the active/standby nodes are correct.
Type "go" to proceed or "abort" to interrupt the execution
> go
Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: an-master[1001-1002].eqiad.wmnet
----- OUTPUT of 'icinga-downtime ...razzi@cumin1001"' -----
================
PASS |████████████████████████| 100% (1/1) [00:00<00:00, 2.32hosts/s]
FAIL | | 0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'icinga-downtime ...razzi@cumin1001"'.
----- OUTPUT of 'icinga-downtime ...razzi@cumin1001"' -----
================
PASS |████████████████████████| 100% (1/1) [00:00<00:00, 2.82hosts/s]
FAIL | | 0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'icinga-downtime ...razzi@cumin1001"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Restarting Yarn Resourcemanager on Master.
----- OUTPUT of 'systemctl restar...-resourcemanager' -----
================
PASS |████████████████████████| 100% (1/1) [00:11<00:00, 11.71s/hosts]
FAIL | | 0% (0/1) [00:11<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'systemctl restar...-resourcemanager'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Sleeping 60.0 seconds.
Restarting Yarn Resourcemanager on Standby.
----- OUTPUT of 'systemctl restar...-resourcemanager' -----
================
PASS |████████████████████████| 100% (1/1) [00:11<00:00, 11.69s/hosts]
FAIL | | 0% (0/1) [00:11<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'systemctl restar...-resourcemanager'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Checking Master/Standby status.
Master status for Yarn:
----- OUTPUT of 'kerberos-run-com...1001-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
active
================
PASS |████████████████████████| 100% (1/1) [00:01<00:00, 1.75s/hosts]
FAIL | | 0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1001-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Standby status for Yarn:
----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
standby
================
PASS |████████████████████████| 100% (1/1) [00:01<00:00, 1.80s/hosts]
FAIL | | 0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
>>> Ok to proceed with HDFS Namenodes ?
Type "go" to proceed or "abort" to interrupt the execution
> go
Run manual HDFS failover from master to standby.
Run manual HDFS Namenode failover from an-master1001-eqiad-wmnet to an-master1002-eqiad-wmnet.
----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Failover to NameNode at an-master1002.eqiad.wmnet/10.64.21.110:8040 successful
================
PASS |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:17<00:00, 17.95s/hosts]
FAIL | | 0% (0/1) [00:17<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Sleeping 30 seconds.
Restart HDFS Namenode on the master.
----- OUTPUT of 'systemctl restart hadoop-hdfs-zkfc' -----
----- OUTPUT of 'systemctl restar...op-hdfs-namenode' -----
================
PASS |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:29<00:00, 29.38s/hosts]
FAIL | | 0% (0/1) [00:29<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Sleeping 600.0 seconds.
^@Checking Master/Standby status.
Master status for HDFS:
----- OUTPUT of 'kerberos-run-com...1001-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
standby
================
PASS |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00, 1.68s/hosts]
FAIL | | 0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1001-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Standby status for HDFS:
----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
active
================
PASS |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00, 1.65s/hosts]
FAIL | | 0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
>>> Ok to proceed?
Type "go" to proceed or "abort" to interrupt the execution
> go
Exception raised while executing cookbook sre.hadoop.roll-restart-masters:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run
raw_ret = runner.run()
File "/usr/lib/python3/dist-packages/spicerack/_module_api.py", line 18, in run
return self._run(self.args, self.spicerack)
File "/srv/deployment/spicerack/cookbooks/sre/hadoop/roll-restart-masters.py", line 154, in run
ask_confirmation("Ok to proceed?")
File "/usr/lib/python3/dist-packages/wmflib/interactive.py", line 67, in ask_confirmation
['go', 'abort'])
File "/usr/lib/python3/dist-packages/wmflib/interactive.py", line 45, in ask_input
response = input('> ')
EOFError
END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99)
razzi@cumin1001:~$
razzi@cumin1001:~$
Ok ran the rest of the commands manually.
See a new error on alerts.wikimedia.org:
CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics
So I pull up journalctl -u monitor_refine_eventlogging_analytics
Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied. dconf will not work properly. Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied. dconf will not work properly. Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied. dconf will not work properly. Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied. dconf will not work properly. Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied. dconf will not work properly. Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied. dconf will not work properly. Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied. dconf will not work properly. Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied. dconf will not work properly. Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied. dconf will not work properly. Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied. dconf will not work properly. Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: 21/06/10 00:18:23 WARN RefineMonitor: RefineMonitor found problems for path /wmf/data/raw/eventlogging -> database event (/wmf/data/event): Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: The following dataset targets in path /wmf/data/raw/eventlogging between 2021-06-08T00:15:07.000Z and 2021-06-09T20:15:07.001Z either have failed or still need Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: Targets with failures: Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=14 Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=15 Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=16 Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=17 Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=18 Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=19 Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: 21/06/10 00:18:23 INFO RefineMonitor: Sending problem email report to analytics-alerts@wikimedia.org Jun 10 00:18:24 an-launcher1002 systemd[1]: monitor_refine_eventlogging_analytics.service: Main process exited, code=exited, status=1/FAILURE Jun 10 00:18:24 an-launcher1002 systemd[1]: monitor_refine_eventlogging_analytics.service: Failed with result 'exit-code'.
Turns out the service just needed to be restarted; the dconf error was unrelated I guess