Juniper router upgrade
Appearance
Known issues
Still valid
- Junos 21.4R2-Sx and later will not work with
system services ssh root-login deny(the FPC won't come online after the upgrade)
Preparation
- List on the task the new interesting features based on https://apps.juniper.net/feature-explorer/
- Download the proper image to apt1001:/srv/junos/
- We now only use 64bits vmhost
- Based on upgrade task and Juniper recommended
All the steps bellow should be done with:
cumin1001:~$ sudo cookbook sre.network.prepare-upgrade <image-filename>.tgz <router-fqdn>
- Make room for the image
request system storage cleanup- If multi-RE, cleanup files on backup RE:
request system storage cleanup re1
- Save rescue config (just in case)
request system configuration rescue save
- Copy image
file copy "https://apt.wikimedia.org/junos/$filename.tgz" /var/tmp/ routing-instance mgmt_junos- As data point this takes ~1h15 from eqiad to ulsfo
- Check checksum
file checksum md5 /var/tmp/$filename.tgz- Compare with checksum on Juniper's website
- Validate new image against existing config
request vmhost software validate /var/tmp/$filename.tgz
Upgrade
- Check if console port(s) is(/are) working
- Depool site (optional but recommended)
- The primary core DC can't easily be depooled (DC-switchover), if a router upgrade is needed in emergency, we have to do a "depool-free" upgrade or get in touch with Service Ops/Traffic.
- Drain traffic away from router
- Set transport links to drained (increase OSPF metrics)
- For each transport link terminating on the router being worked on, set it's Netbox custom field state to
drainedthen run Homer on the router and the remote side router of the circuit
- For each transport link terminating on the router being worked on, set it's Netbox custom field state to
- apply GRACEFUL_SHUTDOWN - then wait for ~15min (or check that the devices is not receiving traffic on relevant links, like from the L3 switches or the other routers) - T320230
set protocols bgp graceful-shutdown sender
- Disable the peers
set protocols bgp group Transit4 shutdownset protocols bgp group Transit6 shutdownset protocols bgp group IX4 shutdownset protocols bgp group IX6 shutdown
- Set transport links to drained (increase OSPF metrics)
- Ensure router is not VRRP master (doesn't apply to codfw)
show vrrp summaryset groups vrrp interfaces <*> unit <*> family inet address <*> vrrp-group <*> priority 70set groups vrrp interfaces <*> unit <*> family inet6 address <*> vrrp-inet6-group <*> priority 70- Note: if specific priorities are set on vrrp groups priority needs to be reduced on the specific groups also.
- Downtime host in Icinga and Alert-manager
sudo cookbook sre.hosts.downtime -r 'router upgrade' -t XXX -H 2 --force 'cr3-ulsfo,cr3-ulsfo IPv6,cr3-ulsfo.mgmt'- This needs to match the Icinga "hosts",
cr3-ulsfowill match in AlertManager as well. - NOTE: For devices with multiple REs you will probably find the mgmt hosts in Icinga named like 're0.cr3-esams.mgmt'
- Double check device has been fully drained of traffic before proceeding:
- Check no traffic to LVS at site: https://grafana-rw.wikimedia.org/d/000000343/load-balancers-lvs
- Check Cloudflare DDoS tunnels are disabled for site:
sudo cookbook sre.network.cf status all - Check LibreNMS graphs for router in question: https://librenms.wikimedia.org/devices/type=network
- Check neither CR routers see preferred routes to private* subnets via the one to be upgraded
If Multi RE:
- Remove
graceful-switchoverdeactivate chassis redundancy graceful-switchoverrequest system configuration rescue save(to ensure graceful-switchover is not in the rescue config)
- Install image on backup RE
request vmhost software add /var/tmp/$filename.tgz re1
- Reboot RE1
request vmhost reboot re1
- Once back up (
show chassis routing-engine), perform RE switchover (impactful)request chassis routing-engine master switch
- Once done, repeat previous 3 steps for re0
- Rollback "Remove
graceful-switchover"
If single RE:
- Install image on RE
request vmhost software add /var/tmp/$filename.tgz
- Reboot router
request vmhost reboot
Both single and dual RE:
- Check if router is healthy
show log messages | lastshow system alarmsshow ospf(3) interfaceshow bgp summary- All green in Icinga and LibreNMS
Cleanup
- remove any upgrade leftover files
request system storage cleanup- If multi-RE, cleanup files on backup RE:
request system storage cleanup re1
- If multi-RE, cleanup files on backup RE:
- Remove Icinga and LibreNMS downtimes
- Rollback "Drain traffic away from router" steps
- OSPF via Netbox
- BGP Graceful shutdown
- Disabled BGP peers
- Rollback VRRP change if any
- Save rescue config (just in case)
request system configuration rescue save
- On vmhost devices, save the disk snapshot to the backup partition
request vmhost snapshotfor single RE devicesrequest vmhost snapshot routing-engine bothfor dual RE devices
- Verify that (little if depooled) traffic flows on the router
- Repool site if depooled