Jump to content

Portal:Cloud VPS/Admin/Network/Tests

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

This page explains the network checklist/testing functions that we have in place to verify normal network operations for Cloud VPS / Toolforge.

The checklist is meant to test the network exactly as an user would use it:

  • in different directions (ie, internet --> cloud and cloud --> internet)
  • verify correct operation of routing_source_ip NAT, dmz_cidr, floating IP NAT, etc
  • verify interaction with NFS servers and other special services
  • verify some other basic network functions like DNS, LDAP, etc.

You may wonder: How is this different from icinga or other monitoring methods? The answer is: this isn't. We could probably migrate all this to icinga or prometheus with some kungfu. But this was quickly developed to fill a tooling gap, so here we are.

Components

  • cmd-checklist-runner.py: a simple python script that reads a yaml file with a bunch of tests definitions, runs them and reports the results.
  • /etc/networktests/networktests.yaml: yaml file containing test case definitions.
  • a systemd timer job that runs the checklist periodically (15 minutes?). Icinga can monitor systemd services and page if they fail. We can activate this if necessary.
  • In puppet, we have the openstack::monitor::networktests which should be declared for cloudcontrol nodes. This class deploys all the above.
  • We have a cookbook to help running the test suite manually.
  • tests make use of the srv-networktests user both locally (cloudcontrol) and in virtual machines (LDAP).

adding new tests

To add new tests:

  • include desired envvars in profile::openstack::XYZ::networktests::envvars
  • include desired checks in modules/openstack/templates/monitor/networktests.yaml.erb

Checklist

The checklist is a list of shell commands to run. The runner can optionally verify stdout/stderr/retcode to decide if the test passed or not.

Example:

---
- envvars:
  - SSH: /usr/bin/ssh -i /etc/networktests/sshkeyfile [..] -o Proxycommand="ssh -o StrictHostKeyChecking=no -i /etc/networktests/sshkeyfile -W %h:%p srv-networktests@eqiad1.bastion.wmcloud.org"
    CLOUDGW_A_IP: 185.15.56.245
    CLOUDGW_B_IP: 185.15.56.246
    TOOLFORGE_BASTION_LOGIN: login.toolforge.org
    TOOLFORGE_BASTION_DEV: dev.toolforge.org
---
- name: basic ping to cloudgw addresses (raw addresses) from outside the cloud network
  tests:
    - cmd: timeout -k5s 10s ping -c1 $CLOUDGW_A_IP >/dev/null
      stdout: ""
      retcode: 0
      stderr: ""
    - cmd: timeout -k5s 10s ping -c1 $CLOUDGW_B_IP >/dev/null
      stdout: ""
      retcode: 0
      stderr: ""
      
- name: VM (using floating IP) can connect to wikireplicas from Toolforge
  tests:
    - cmd: $SSH $TOOLFORGE_BASTION_LOGIN 'sudo -iu tools.arturo-test-tool sql enwiki "select * from page limit 2;" | grep page_id | wc -l'
      stdout: "1"
      retcode: 0
      stderr: ""
    - cmd: $SSH $TOOLFORGE_BASTION_DEV 'sudo -iu tools.arturo-test-tool sql enwiki "select * from page limit 2;" | grep page_id | wc -l'
      stdout: "1"
      retcode: 0
      stderr: ""

Example execution:

root@cloudcontrol1004:~# cmd-checklist-runner --config /etc/networktests/networktests.yaml 
[cmd-checklist-runner] INFO: running test: basic ping to cloudgw addresses (raw addresses) from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to cloudgw addresses (DNS names) from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to neutron WAN from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to neutron VIRT gateway from within the cloud virtual network, no floating IP
[cmd-checklist-runner] INFO: running test: basic ping to neutron VIRT gateway from within the cloud virtual network, with floating IP
[cmd-checklist-runner] INFO: running test: VM (no floating IP) contacting the internet gets NAT'd using routing_source_ip
[cmd-checklist-runner] INFO: running test: VM (no floating IP) contacting an address covered by dmz_cidr doesn't get NAT'd
[cmd-checklist-runner] INFO: running test: VM (using floating IP) isn't affected by either routing_source_ip or dmz_cidr
[cmd-checklist-runner] INFO: running test: VM (no floating IP) can contact auth DNS server
[cmd-checklist-runner] INFO: running test: VM (no floating IP) can contact recursor DNS server
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can contact auth DNS server
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can contact recursor DNS server
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can contact LDAP server
[cmd-checklist-runner] INFO: running test: VM (not using floating IP) can contact LDAP server
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can contact openstack API
[cmd-checklist-runner] INFO: running test: VM (no floating IP) can contact openstack API
[cmd-checklist-runner] INFO: running test: puppetmasters can sync git tree
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can read dumps NFS
[cmd-checklist-runner] INFO: running test: VM (no floating IP) can read dumps NFS
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can connect to wikireplicas from Toolforge
[cmd-checklist-runner] INFO: running test: Toolforge webservice can be accessed from the internet
[cmd-checklist-runner] INFO: running test: Toolforge bastions see herald file on project NFS
[cmd-checklist-runner] INFO: ---
[cmd-checklist-runner] INFO: --- passed tests: 22
[cmd-checklist-runner] INFO: --- failed tests: 0
[cmd-checklist-runner] INFO: --- total tests: 22

Cookbook

There is a handy cookbook to help leverage this testing suite for other purposes:

arturo@endurance:~ $ cookbook wmcs.openstack.network.tests --cluster-name codfw1dev
START - Cookbook wmcs.openstack.network.tests
----- OUTPUT of 'sudo -i cmd-chec...etworktests.yaml' -----
[cmd-checklist-runner] INFO: running test: basic ping to cloudgw addresses (raw addresses) from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to cloudgw addresses (DNS names) from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to neutron WAN from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to neutron VIRT gateway from within the cloud virtual network, no floating IP
[..]
[cmd-checklist-runner] INFO: running test: puppetmasters can sync git tree
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can read dumps NFS
[cmd-checklist-runner] INFO: running test: VM (no floating IP) can read dumps NFS
[cmd-checklist-runner] INFO: ---
[cmd-checklist-runner] INFO: --- passed tests: 19
[cmd-checklist-runner] INFO: --- failed tests: 0
[cmd-checklist-runner] INFO: --- total tests: 19
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i cmd-chec...etworktests.yaml'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
NetworkTestRunner: 19/19 passed tests.
END (PASS) - Cookbook wmcs.openstack.network.tests (exit_code=0)
arturo@endurance:~ $ cookbook wmcs.openstack.network.tests -d eqiad1
START - Cookbook wmcs.openstack.network.tests
----- OUTPUT of 'sudo -i cmd-chec...etworktests.yaml' -----
[cmd-checklist-runner] INFO: running test: basic ping to cloudgw addresses (raw addresses) from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to cloudgw addresses (DNS names) from outside the cloud network
[cmd-checklist-runner] INFO: running test: basic ping to neutron WAN from outside the cloud network
[..]
[cmd-checklist-runner] INFO: running test: VM (no floating IP) can read dumps NFS
[cmd-checklist-runner] INFO: running test: VM (using floating IP) can connect to wikireplicas from Toolforge
[cmd-checklist-runner] INFO: running test: Toolforge webservice can be accessed from the internet
[cmd-checklist-runner] INFO: running test: Toolforge bastions see herald file on project NFS
[cmd-checklist-runner] INFO: ---
[cmd-checklist-runner] INFO: --- passed tests: 22
[cmd-checklist-runner] INFO: --- failed tests: 0
[cmd-checklist-runner] INFO: --- total tests: 22
================
100.0% (1/1) success ratio (>= 100.0% threshold) for command: 'sudo -i cmd-chec...etworktests.yaml'.
100.0% (1/1) success ratio (>= 100.0% threshold) of nodes successfully executed all commands.
NetworkTestRunner: 22/22 passed tests.
END (PASS) - Cookbook wmcs.openstack.network.tests (exit_code=0)

In the future we plan to develop other cookbooks that depend on the testsuite results to decide on operations, for example:

  • only perform an operation if the network testsuite passes
  • rollback a kernel upgrade if the network testsuite doesn't pass

systemd timer execution

To see history of how network tests have been performing, check the logs of the systemd cloud-vps-networktest service:

root@cloudcontrol1004:~# journalctl -u cloud-vps-networktest.service -f
-- Logs begin at Thu 2021-11-11 10:41:53 UTC. --
Nov 11 16:15:26 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: running test: VM (using floating IP) can read dumps NFS
Nov 11 16:15:29 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: running test: VM (no floating IP) can read dumps NFS
Nov 11 16:15:31 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: running test: VM (using floating IP) can connect to wikireplicas from Toolforge
Nov 11 16:15:40 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: running test: Toolforge webservice can be accessed from the internet
Nov 11 16:15:40 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: running test: Toolforge bastions see herald file on project NFS
Nov 11 16:15:44 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: ---
Nov 11 16:15:44 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: --- passed tests: 22
Nov 11 16:15:44 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: --- failed tests: 0
Nov 11 16:15:44 cloudcontrol1004 cmd-checklist-runner[42756]: [cmd-checklist-runner] INFO: --- total tests: 22
Nov 11 16:15:44 cloudcontrol1004 systemd[1]: cloud-vps-networktest.service: Succeeded.

As of this writing, per puppet code, this only runs in one of the cloudcontrol nodes. Usually the second one (to avoid the already overloaded first one).

TODO: as of this writing, icinga wont alert if this service fail, because we disabled the check in cloudcontrol boxes (was too noisy).

See also