Jump to content

User:Razzi/T280132 disk swap

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

https://phabricator.wikimedia.org/T280132

First let me see how the host (an-worker1100) is doing

razzi@an-worker1100:~$ sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli
=== RaidStatus (does not include components in optimal state)
=== RaidStatus completed

Sweet

https://phabricator.wikimedia.org/T280132#7007970

> I did the following:
> - commented the disk in /etc/fstab
> - umounted it manually - sudo umount /var/lib/hadoop/data/k
> - ran puppet to regenerate the list of datadir for yarn and hdfs
> - the yarn nodemanager was down due to this problem, but puppet brought it up again after 3)
  • Uncommented disk
  • Ran:
sudo mount -a
mount: /var/lib/hadoop/data/k: can't find UUID=7bcd4c25-a157-4023-a346-924d4ccee5a0.

Ok, so I guess the disk has a new uuid

ls -l /dev/disk/by-uuid/

Hmm that shows only /dev/sdX, but I don't know which disk it is.

There's also this link: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration#Swapping_broken_disk

# From the previous commands you should be able to fill in the variables 
# with the values of the disk's properties indicated below:
# X => Enclosure Device ID
# Y => Slot Number
# Z => Controller (Adapter) number
megacli -PDMakeGood -PhysDrv[X:Y] -aZ

so now I'm on step 6 I want to do something like:

> Add the single disk RAID0 array (use the details from the steps above):
sudo megacli -CfgLdAdd -r0 [32:0] -a0

Given I have:

Adapter #0
...
Enclosure Device ID: 32
Slot Number: 11
Firmware state: Online, Spun Up

I will run

sudo megacli -CfgLdAdd -r0 [32:11] -a0

Ok I ran this but got:

Exit Code: 0x1a
razzi@an-worker1100:~$ echo $?
26

Ok I see on this webpage: https://www.thomas-krenn.com/de/wiki/MegaCLI_Error_Messages

0x1a Maximum LDs are already configured

So maybe it's already configured. Let me try to proceed.

Well, I want to figure out which /dev/sd? it is, and I can't figure out how to figure out which one it is, but one of the uuids won't show up in /etc/fstab.

for u in $( ls /dev/disk/by-uuid/); do echo $u; cat /etc/fstab | grep $u; done

bash loop did the trick... e97258d2-5661-469a-9d34-56bd84a80714 is the one.

But wait, there's also 91c728b2-0dc9-4755-841c-ecdab46d38ae...

a7ab9126-4ef4-4824-a41c-69b4f8630edb

Hmm these are all dm-0, 1 2... not what I want. Maybe the disk isn't showing up yet

ls /dev/sd? | wc gives 23.

Yeah I think the disk isn't showing up. I'll comment on the task

Ok I copied the wrong part of the output

Enclosure Device ID: 32
Slot Number: 10
Firmware state: Unconfigured(good), Spun Up

That's the right disk.

razzi@an-worker1100:~$ sudo megacli -CfgLdAdd -r0 [32:10] -a0

Adapter 0: Created VD 10

Adapter 0: Configured the Adapter!!

Now it shows sdl as unused in lsblk:

sdk                     8:160  0  1.8T  0 disk
└─sdk1                  8:161  0  1.8T  0 part  /var/lib/hadoop/data/m
sdl                     8:176  0  1.8T  0 disk
sdm                     8:192  0  1.8T  0 disk
└─sdm1                  8:193  0  1.8T  0 part  /var/lib/hadoop/data/q

Now I want the disk uuid, but it's not showing in blkid or /dev/disk/by-uuid/...

Oh right, it doesn't have a partition yet, and the partition has the uuid.

sudo parted /dev/sdl --script mklabel gpt
sudo parted /dev/sdl --script mkpart primary ext4 0% 100%
sudo mkfs.ext4 -L hadoop-k /dev/sdl1
sudo tune2fs -m 0 /dev/sdl1

Now lsblk shows its uuid. cb58c727-dec9-4abf-8b21-3d70a6443b6d

But it's not showing its space in lsblk...

sdl
└─sdl1             ext4              hadoop-k        cb58c727-dec9-4abf-8b21-3d70a6443b6d
sdm
└─sdm1             ext4              hadoop-q        766882b0-078f-4bc3-b118-3f8456446b52    380.2G    79% /var/lib/hadoop/data/q

It looks like it does mount, but doesn't stay.

Perhaps https://unix.stackexchange.com/questions/474743/mount-command-finishes-successully-but-disk-is-not-mounted

Yep

Apr 29 15:38:09 an-worker1100 kernel: [1810702.143355] EXT4-fs (sdl1): mounted filesystem with ordered data mode. Opts: (null)
Apr 29 15:38:09 an-worker1100 systemd[1]: var-lib-hadoop-data-k.mount: Unit is bound to inactive unit dev-disk-by\x2duuid-7bcd4c25\x2da157\x2d4023\x2da346\x2d924d4ccee5a0.device. Stopping, too.
Apr 29 15:38:09 an-worker1100 systemd[1]: Unmounting /var/lib/hadoop/data/k...
Apr 29 15:38:09 an-worker1100 systemd[1]: var-lib-hadoop-data-k.mount: Succeeded.

systemctl daemon-reexec might fix it. It did!