Portal:Cloud VPS/Admin/notes/Meltdown Response

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Deprecation warning. This page refers to operations we did in the past and may no longer be relevant

https://phabricator.wikimedia.org/T184189 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Meltdown_Response

Rollout checklist (done in mediawiki style so it can be archived)

Summary

Baseline performance on a labvirt with existing
Figure out right kernel versions to move to
Upgrade guests (with some special handling in Toolforge to ensure we don't have general guest issues with these kernels before fleet wide)
Upgrade labvirts
Reboots all around

Preparation

Commands

aptitude install linux-image-4.9.0-0.bpo.5-amd64

[x] Create a bunch of instances on labvirt1018

Some Jessie
Some Trusty
Some Stretch

   OS_TENANT_NAME=testlabs openstack server create --flavor 4 --image 85e8924b-b25d-4341-ad3e-56856d4de2cc --availability-zone host:labvirt1018 labvirt1018stresstest-4

[x] Profile performance on labvirt1018 on existing kernel

over a few hours or a day?
what performance charts or metrics are we watching here?

https://phabricator.wikimedia.org/T184189#3893388

[x] Choose upgrade candidate for Jessie (Seems like: linux-image-4.9.0-0.bpo.5 ... -> Ack, but you should run "apt-get -y install linux-meta"

[x] Choose upgrade candidate for Stretch (Seems likelinux-image-4.9.0-5-amd64 .-> Acl, but you should run "apt-get -y install linux-image-amd64"

[x] Choose upgrade candidate for Trusty (Assumed same for labvirts and guests) No: On labvirt you need "apt-get -y linux-image-generic-lts-xenial" and on instances "apt-get -y install linux-image-generic"

Labvirts:

apt-get install -y linux-image-4.4.0-109-generic linux-image-extra-4.4.0-109-generic linux-lts-xenial-tools-4.4.0-109 linux-tools-4.4.0-109-generic

[x] Upgrade labvirt1018 to Trusty Kernel candidate (reboot)

[x] Upgrade labvirt1018 pilot guest instances to candidate kernels (reboot)

[x] Profile performance on labvirt1018 on existing kernel

over a few hours or a day?

https://phabricator.wikimedia.org/T184189#3893388

Guest updates: We know what to upgrade to for each distro and believe performance will be survivable

Question!

Should we reboot all Toolforge nodes or a serious subset to see if we turn up any problems with guests on new kernels in our controlled environment before going full on? I think potentially yes.

Final answer is no - since the performance impact has been fairly predicatable and this would be more graceful for Tools but require dual reboots.

Commands

apt-get install <kernel> uname -r

PIlot in Toolforge

[x] Upgrade canary candidates for Trusty in Toolforge

seems like kernel landed here as a security update already

apt-get -s install linux-image-generic (would confirm as a noop)

sudo apt-get update && sudo apt-get -y install linux-image-generic && sudo mv /boot/grub/menu.lst /boot/grub/menu.lst.old && sudo update-grub -y && sudo uname -r

tools-exec-1401 tools-exec-1402 tools-exec-1403 tools-exec-1404 tools-exec-1405

[x] Upgrade canary candidates for Jessie in Toolforge

tools-worker-1011.tools.eqiad.wmflabs tools-worker-1012.tools.eqiad.wmflabs tools-worker-1013.tools.eqiad.wmflabs tools-worker-1014.tools.eqiad.wmflabs tools-worker-1015.tools.eqiad.wmflabs tools-worker-1016.tools.eqiad.wmflabs

[x] Upgrade canary candiadates for Stretch in Toolforge

I think this is only PAWS?
Yuvi's upgrade to k8s 1.9 caught the correct kernel so all have been updated since yesterday

tools-paws-master-01

15:11:51 up 1 day, 16:10,  0 users,  load average: 0.72, 1.36, 1.54

Linux tools-paws-master-01 4.9.0-5-amd64 #1 SMP Debian 4.9.65-3+deb9u2 (2018-01-04) x86_64 GNU/Linux

[x] Review performance implications over half a day

https://phabricator.wikimedia.org/T184189#3894622

First canaries from 1001-1009 (pcid) and 1010-1019 (pcid and invpcid) (both should have headroom)

[x] figure out how to target guests on a particular labvirt https://phabricator.wikimedia.org/T184756

[x] send email to effected guest projects for pilot labvirts

https://lists.wikimedia.org/pipermail/cloud/2018-January/000170.html

[x] Update all guests on labvirt1017

$ sudo cumin --force --timeout 120 -o json "host:labvirt1017" "lsb_release -si | grep Ubuntu && apt-get install -y linux-image-generic" $ sudo cumin --force --timeout 120 -o json "host:labvirt1017" "lsb_release -si | grep Ubuntu && mv /boot/grub/menu.lst /boot/grub/menu.lst.old && update-grub -y" $ sudo cumin --force --timeout 120 -o json "host:labvirt1017" "lsb_release -sd | grep jessie && apt-get -y install linux-meta && update-grub" $ sudo cumin --force --timeout 120 -o json "host:labvirt1017" "lsb_release -sd | grep stretch && apt-get -y install linux-image-amd64 && update-grub"

[x] Update labvirt1017.eqiad.wmnet

root@tools-bastion-03:~# exec-manage depool tools-webgrid-lighttpd-1420.tools.eqiad.wmflabs etc. andrew@tools-k8s-master-01:~$ kubectl cordon tools-worker-1012.tools.eqiad.wmflabs etc.

andrew@labcontrol1001:~$ nova list --all-tenants --host labvirt1017 > restartme.txt

Set 2 hour downtime for labvirt1017 root@labvirt1017:~# apt-get install -y linux-image-4.4.0-109-generic linux-image-extra-4.4.0-109-generic linux-lts-xenial-tools-4.4.0-109 linux-tools-4.4.0-109-generic root@labvirt1017:~# update-grub root@labvirt1017:~# reboot

(wait)

andrew@labvirt1017:~$ dmesg | grep -i isolation [ 0.000000] Kernel/User page tables isolation: enabled

(restart everything in restartme.txt)

sudo cumin --force --timeout 120 -o json "host:labvirt1017" "dmesg | grep -i isolation"

[x] Update all guests on labvirt1003

[x] update labvirt1003.eqiad.wmnet

[x] send notice email

[x] reboot labvirt1017

[x] reboot labvirt1003

[x] confirm labvirt kernels

andrew@labvirt1017:~$ dmesg | grep -i isolation [ 0.000000] Kernel/User page tables isolation: enabled

andrew@labvirt1003:~$ dmesg | grep -i isolation [ 0.000000] Kernel/User page tables isolation: enabled

[x] confirm all guest kernels

$ sudo cumin --force --timeout 120 -o json "host:labvirt1017" "dmesg | grep -i isolation"

One straggler that I fixed by changing the apt settings for the kernel (it had upgrades disabled somehow)

$ sudo cumin --force --timeout 120 -o json "host:labvirt1003" "dmesg | grep -i isolation"

One casualty: ttmserver-elasticsearch01.ttmserver.eqiad.wmflabs didn't come back up. It's an old Trusty instance in a project that's a candidate for deletion and had many full drives, so I suspect it was unable to upgrade fully. No logs.

[x] wait over the weekend for performance indicators

Rollout to all guests

[x] Upgrade kernels on all remaining Trusty guests

$ sudo cumin --force --timeout 120 -o json "A:all" "lsb_release -si | grep Ubuntu && apt-get install -y linux-image-generic" $ sudo cumin --force --timeout 120 -o json "A:all" "lsb_release -si | grep Ubuntu && mv /boot/grub/menu.lst /boot/grub/menu.lst.old && update-grub -y"

[x] Upgrade kernels on all remaining Jessie guests to candidate pending reboot

$ sudo cumin --force --timeout 120 -o json "A:all" "lsb_release -sd | grep jessie && apt-get -y install linux-meta && update-grub"

[x] Upgrade kernels on all remaining Stretch guests pending reboot

$ sudo cumin --force --timeout 120 -o json "A:all" "lsb_release -sd | grep stretch && apt-get -y install linux-image-amd64 && update-grub"

Labvirts: At this point all guests are pending a kernel upgrade post reboot along w/ the labvirts

Labvirts to update (All Trusty currently on 4.4.0-81-generic)
Make sure to grab linux-image and linux-image-extras!!!!

Commands

apt-get install <kernel> uname -r

Remaining Main deployment labvirt pool

[x] silence tools.checker

https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=checker.tools.wmflabs.org

Ensure mgmt interface is available before rebooting!

[x] labvirt1001.eqiad.wmnet

[x] labvirt1002.eqiad.wmnet

[x] labvirt1003.eqiad.wmnet

[x] labvirt1004.eqiad.wmnet

[x] labvirt1005.eqiad.wmnet

[x] labvirt1006.eqiad.wmnet

[x] labvirt1007.eqiad.wmnet

-- break to see if any unexpected effects --

[x] labvirt1008.eqiad.wmnet

[x] labvirt1009.eqiad.wmnet

[x] labvirt1010.eqiad.wmnet

[x] labvirt1011.eqiad.wmnet

[x] labvirt1012.eqiad.wmnet

[x] labvirt1013.eqiad.wmnet

[x] labvirt1014.eqiad.wmnet

[x] labvirt1015.eqiad.wmnet

-- dormant --

[x] labvirt1016.eqiad.wmnet <== this is in the spare pool so not a useful canary for profiling normal workloads

[x] labvirt1017.eqiad.wmnet

[x] labvirt1018.eqiad.wmnet <== this is in the spare pool so not a useful canary (thought we used it for initial profiling with generated load)

[x] labvirt1019.eqiad.wmnet <== new DB pair with 20

[x] labvirt1020.eqiad.wmnet <== new DB pair with 19

[] labvirt1021.eqiad.wmnet <== racked, not live yet

[] labvirt1022.eqiad.wmnet <== racked, not live yet

Post: at his point all guests and labvirts are rebooted with new kernels

[x] amend kernel whitelist to only include the relevant 4.4.0-109 kernel (which will catch on 21 and 22 when ready)

https://gerrit.wikimedia.org/r/#/c/404588/

stare at performance charts biting finger nails

"* https://grafana.wikimedia.org/dashboard/db/labs-monitoring?refresh=5m&orgId=1

Checks for Arturo on Monday 15th

check several random VMs from Andrew's email: htop and so on, to see if they are good with performance with the new kernel
physical servers: labvirt1003.eqiad.wmnet and labvirt1017.eqiad.wmnet
check for graphine trends in physical servers
if something breaks: 1) put a message in WhatsApp, 2) try to fix it myself