Juniper router upgrade

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Known issues

Still valid

Junos 21.4R2-Sx and later will not work with system services ssh root-login deny (the FPC won't come online after the upgrade)
- https://supportportal.juniper.net/s/article/Junos-21-4-or-later-Root-login-is-required-for-copying-the-FPC-image-from-the-Junos-VM-to-the-Linux-host-during-upgrade-of-VM-Host-based-platforms?language=en_US

Preparation

List on the task the new interesting features based on https://apps.juniper.net/feature-explorer/
Download the proper image to apt1001:/srv/junos/
- We now only use 64bits vmhost
- Based on upgrade task and Juniper recommended

All the steps bellow should be done with:
cumin1001:~$ sudo cookbook sre.network.prepare-upgrade <image-filename>.tgz <router-fqdn>

Make room for the image
- request system storage cleanup
- If multi-RE, cleanup files on backup RE: request system storage cleanup re1
Save rescue config (just in case)
- request system configuration rescue save
Copy image
- file copy "https://apt.wikimedia.org/junos/$filename.tgz" /var/tmp/ routing-instance mgmt_junos
- As data point this takes ~1h15 from eqiad to ulsfo
Check checksum
- file checksum md5 /var/tmp/$filename.tgz
- Compare with checksum on Juniper's website
Validate new image against existing config
- request vmhost software validate /var/tmp/$filename.tgz

Upgrade

Check if console port(s) is(/are) working
Depool site (optional but recommended)
1. The primary core DC can't easily be depooled (DC-switchover), if a router upgrade is needed in emergency, we have to do a "depool-free" upgrade or get in touch with Service Ops/Traffic.
Drain traffic away from router
1. Set transport links to drained (increase OSPF metrics)
  - For each transport link terminating on the router being worked on, set it's Netbox custom field state to drained then run Homer on the router and the remote side router of the circuit
2. apply GRACEFUL_SHUTDOWN - then wait for ~15min (or check that the devices is not receiving traffic on relevant links, like from the L3 switches or the other routers) - T320230
  - set protocols bgp graceful-shutdown sender
3. Disable the peers
  - set protocols bgp group Transit4 shutdown
  - set protocols bgp group Transit6 shutdown
  - set protocols bgp group IX4 shutdown
  - set protocols bgp group IX6 shutdown
Ensure router is not VRRP master (doesn't apply to codfw)
- show vrrp summary
- set groups vrrp interfaces <*> unit <*> family inet address <*> vrrp-group <*> priority 70
- set groups vrrp interfaces <*> unit <*> family inet6 address <*> vrrp-inet6-group <*> priority 70
  - Note: if specific priorities are set on vrrp groups priority needs to be reduced on the specific groups also.
Downtime host in Icinga and Alert-manager
- sudo cookbook sre.hosts.downtime -r 'router upgrade' -t XXX -H 2 --force 'cr3-ulsfo,cr3-ulsfo IPv6,cr3-ulsfo.mgmt'
- This needs to match the Icinga "hosts", cr3-ulsfo will match in AlertManager as well.
- NOTE: For devices with multiple REs you will probably find the mgmt hosts in Icinga named like 're0.cr3-esams.mgmt'
Double check device has been fully drained of traffic before proceeding:
- Check no traffic to LVS at site: https://grafana-rw.wikimedia.org/d/000000343/load-balancers-lvs
- Check Cloudflare DDoS tunnels are disabled for site: sudo cookbook sre.network.cf status all
- Check LibreNMS graphs for router in question: https://librenms.wikimedia.org/devices/type=network
- Check neither CR routers see preferred routes to private* subnets via the one to be upgraded

If Multi RE:

Remove graceful-switchover
- deactivate chassis redundancy graceful-switchover
- request system configuration rescue save (to ensure graceful-switchover is not in the rescue config)
Install image on backup RE
- request vmhost software add /var/tmp/$filename.tgz re1
Reboot RE1
- request vmhost reboot re1
Once back up (show chassis routing-engine), perform RE switchover (impactful)
- request chassis routing-engine master switch
Once done, repeat previous 3 steps for re0
Rollback "Remove graceful-switchover"

If single RE:

Install image on RE
- request vmhost software add /var/tmp/$filename.tgz
Reboot router
- request vmhost reboot

Both single and dual RE:

Check if router is healthy
- show log messages | last
- show system alarms
- show ospf(3) interface
- show bgp summary
- All green in Icinga and LibreNMS

Cleanup

remove any upgrade leftover files
- request system storage cleanup
  - If multi-RE, cleanup files on backup RE: request system storage cleanup re1
Remove Icinga and LibreNMS downtimes
Rollback "Drain traffic away from router" steps
1. OSPF via Netbox
2. BGP Graceful shutdown
3. Disabled BGP peers
Rollback VRRP change if any
Save rescue config (just in case)
- request system configuration rescue save
On vmhost devices, save the disk snapshot to the backup partition
- request vmhost snapshot for single RE devices
- request vmhost snapshot routing-engine both for dual RE devices
Verify that (little if depooled) traffic flows on the router
Repool site if depooled