Jump to content

User:Razzi/2021-06-10

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

gonna deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/698194

sudo cookbook sre.hadoop.roll-restart-masters analytics

ok I got an eof error somehow...

razzi@cumin1001:~$ sudo cookbook sre.hadoop.roll-restart-masters analytics                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            [0/0]
START - Cookbook sre.hadoop.roll-restart-masters
Checking HDFS and Yarn daemon status. We expect active statuses on the Master node, and standby statuses on the other. Please do not proceed otherwise.
Checking Master/Standby status.

Master status for HDFS:
----- OUTPUT of 'kerberos-run-com...1001-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
active
================
PASS |████████████████████████| 100% (1/1) [00:01<00:00,  1.68s/hosts]
FAIL |                                |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1001-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Master status for Yarn:
----- OUTPUT of 'kerberos-run-com...1001-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
active
================
PASS |████████████████████████| 100% (1/1) [00:01<00:00,  1.54s/hosts]
FAIL |                                |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1001-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Standby status for HDFS:
----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
standby
================
PASS |████████████████████████| 100% (1/1) [00:01<00:00,  1.64s/hosts]
FAIL |                                |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Standby status for Yarn:
----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
standby
================
PASS |████████████████████████| 100% (1/1) [00:01<00:00,  1.74s/hosts]
FAIL |                                |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
>>> Please make sure that the active/standby nodes are correct.
Type "go" to proceed or "abort" to interrupt the execution
> go
Scheduling downtime on Icinga server alert1001.wikimedia.org for hosts: an-master[1001-1002].eqiad.wmnet
----- OUTPUT of 'icinga-downtime ...razzi@cumin1001"' -----
================
PASS |████████████████████████| 100% (1/1) [00:00<00:00,  2.32hosts/s]
FAIL |                                |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'icinga-downtime ...razzi@cumin1001"'.
----- OUTPUT of 'icinga-downtime ...razzi@cumin1001"' -----
================
PASS |████████████████████████| 100% (1/1) [00:00<00:00,  2.82hosts/s]
FAIL |                                |   0% (0/1) [00:00<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'icinga-downtime ...razzi@cumin1001"'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Restarting Yarn Resourcemanager on Master.
----- OUTPUT of 'systemctl restar...-resourcemanager' -----
================
PASS |████████████████████████| 100% (1/1) [00:11<00:00, 11.71s/hosts]
FAIL |                                |   0% (0/1) [00:11<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'systemctl restar...-resourcemanager'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Sleeping 60.0 seconds.
Restarting Yarn Resourcemanager on Standby.
----- OUTPUT of 'systemctl restar...-resourcemanager' -----
================
PASS |████████████████████████| 100% (1/1) [00:11<00:00, 11.69s/hosts]
FAIL |                                |   0% (0/1) [00:11<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'systemctl restar...-resourcemanager'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Checking Master/Standby status.

Master status for Yarn:
----- OUTPUT of 'kerberos-run-com...1001-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
active
================
PASS |████████████████████████| 100% (1/1) [00:01<00:00,  1.75s/hosts]
FAIL |                                |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1001-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Standby status for Yarn:
----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
standby
================
PASS |████████████████████████| 100% (1/1) [00:01<00:00,  1.80s/hosts]
FAIL |                                |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
>>> Ok to proceed with HDFS Namenodes ?
Type "go" to proceed or "abort" to interrupt the execution
> go
Run manual HDFS failover from master to standby.
Run manual HDFS Namenode failover from an-master1001-eqiad-wmnet to an-master1002-eqiad-wmnet.
----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
Failover to NameNode at an-master1002.eqiad.wmnet/10.64.21.110:8040 successful
================
PASS |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:17<00:00, 17.95s/hosts]
FAIL |                                                                                  |   0% (0/1) [00:17<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Sleeping 30 seconds.
Restart HDFS Namenode on the master.
----- OUTPUT of 'systemctl restart hadoop-hdfs-zkfc' -----
----- OUTPUT of 'systemctl restar...op-hdfs-namenode' -----
================
PASS |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:29<00:00, 29.38s/hosts]
FAIL |                                                                                  |   0% (0/1) [00:29<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
Sleeping 600.0 seconds.
^@Checking Master/Standby status.

Master status for HDFS:
----- OUTPUT of 'kerberos-run-com...1001-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
standby
================
PASS |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.68s/hosts]
FAIL |                                                                                  |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1001-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.

Standby status for HDFS:
----- OUTPUT of 'kerberos-run-com...1002-eqiad-wmnet' -----
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
active
================
PASS |██████████████████████████████████████████████████████████████████████████| 100% (1/1) [00:01<00:00,  1.65s/hosts]
FAIL |                                                                                  |   0% (0/1) [00:01<?, ?hosts/s]
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'kerberos-run-com...1002-eqiad-wmnet'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
>>> Ok to proceed?
Type "go" to proceed or "abort" to interrupt the execution
> go
Exception raised while executing cookbook sre.hadoop.roll-restart-masters:
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/spicerack/_menu.py", line 234, in run
    raw_ret = runner.run()
  File "/usr/lib/python3/dist-packages/spicerack/_module_api.py", line 18, in run
    return self._run(self.args, self.spicerack)
  File "/srv/deployment/spicerack/cookbooks/sre/hadoop/roll-restart-masters.py", line 154, in run
    ask_confirmation("Ok to proceed?")
  File "/usr/lib/python3/dist-packages/wmflib/interactive.py", line 67, in ask_confirmation
    ['go', 'abort'])
  File "/usr/lib/python3/dist-packages/wmflib/interactive.py", line 45, in ask_input
    response = input('> ')
EOFError
END (FAIL) - Cookbook sre.hadoop.roll-restart-masters (exit_code=99)
razzi@cumin1001:~$
razzi@cumin1001:~$

Ok ran the rest of the commands manually.

See a new error on alerts.wikimedia.org:

CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics

So I pull up journalctl -u monitor_refine_eventlogging_analytics

Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied.  dconf will not work properly.
Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied.  dconf will not work properly.
Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied.  dconf will not work properly.
Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied.  dconf will not work properly.
Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied.  dconf will not work properly.
Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied.  dconf will not work properly.
Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied.  dconf will not work properly.
Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied.  dconf will not work properly.
Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied.  dconf will not work properly.
Jun 10 00:15:10 an-launcher1002 java[18217]: unable to create directory '/home/analytics/.cache/dconf': Permission denied.  dconf will not work properly.
Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: 21/06/10 00:18:23 WARN RefineMonitor: RefineMonitor found problems for path /wmf/data/raw/eventlogging -> database event (/wmf/data/event):
Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: The following dataset targets in path /wmf/data/raw/eventlogging between 2021-06-08T00:15:07.000Z and 2021-06-09T20:15:07.001Z either have failed or still need
Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: Targets with failures:
Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]:         `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=14
Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]:         `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=15
Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]:         `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=16
Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]:         `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=17
Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]:         `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=18
Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]:         `event`.`wmdebannerevents` /wmf/data/event/wmdebannerevents/year=2021/month=6/day=9/hour=19
Jun 10 00:18:23 an-launcher1002 monitor_refine_eventlogging_analytics[17497]: 21/06/10 00:18:23 INFO RefineMonitor: Sending problem email report to analytics-alerts@wikimedia.org
Jun 10 00:18:24 an-launcher1002 systemd[1]: monitor_refine_eventlogging_analytics.service: Main process exited, code=exited, status=1/FAILURE
Jun 10 00:18:24 an-launcher1002 systemd[1]: monitor_refine_eventlogging_analytics.service: Failed with result 'exit-code'.

Turns out the service just needed to be restarted; the dconf error was unrelated I guess