Jump to content

Portal:Cloud VPS/Admin/Runbooks/RabbitmqNetworkPartition

From Wikitech
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
The procedures in this runbook require admin permissions to complete.

Error / Incident

This alert fires when there's no longer consensus between the rabbitmq servers. This seems to happen now and then, for unexplained reasons: the servers can talk to their clients but not to each other. When this happens we get a lot of RPC and other messaging timeouts in OpenStack services.

A state of -1 means that the metric is not being collected. There may or may not be an actual network partition.

Debugging

This alert is based on the output of rabbitmqctl cluster_status. Here's what it looks like when everything is healthy:

andrew@cloudrabbit1003:~$ sudo rabbitmqctl cluster_status
Cluster status of node rabbit@cloudrabbit1003 ...
Basics

Cluster name: rabbit@cloudrabbit1003.wikimedia.org

Disk Nodes

rabbit@cloudrabbit1001
rabbit@cloudrabbit1002
rabbit@cloudrabbit1003

Running Nodes

rabbit@cloudrabbit1001
rabbit@cloudrabbit1002
rabbit@cloudrabbit1003

Versions

rabbit@cloudrabbit1001: RabbitMQ 3.9.13 on Erlang 24.2.1
rabbit@cloudrabbit1002: RabbitMQ 3.9.13 on Erlang 24.2.1
rabbit@cloudrabbit1003: RabbitMQ 3.9.13 on Erlang 24.2.1

Maintenance status

Node: rabbit@cloudrabbit1001, status: not under maintenance
Node: rabbit@cloudrabbit1002, status: not under maintenance
Node: rabbit@cloudrabbit1003, status: not under maintenance

Alarms

(none)

Network Partitions

(none)

Listeners

Node: rabbit@cloudrabbit1001, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@cloudrabbit1001, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@cloudrabbit1001, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP
Node: rabbit@cloudrabbit1001, interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rabbit@cloudrabbit1001, interface: [::], port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS
Node: rabbit@cloudrabbit1002, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@cloudrabbit1002, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@cloudrabbit1002, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP
Node: rabbit@cloudrabbit1002, interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rabbit@cloudrabbit1002, interface: [::], port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS
Node: rabbit@cloudrabbit1003, interface: [::], port: 25672, protocol: clustering, purpose: inter-node and CLI tool communication
Node: rabbit@cloudrabbit1003, interface: [::], port: 15672, protocol: http, purpose: HTTP API
Node: rabbit@cloudrabbit1003, interface: [::], port: 15692, protocol: http/prometheus, purpose: Prometheus exporter API over HTTP
Node: rabbit@cloudrabbit1003, interface: [::], port: 5672, protocol: amqp, purpose: AMQP 0-9-1 and AMQP 1.0
Node: rabbit@cloudrabbit1003, interface: [::], port: 5671, protocol: amqp/ssl, purpose: AMQP 0-9-1 and AMQP 1.0 over TLS

Feature flags

Flag: drop_unroutable_metric, state: enabled
Flag: empty_basic_get_metric, state: enabled
Flag: implicit_default_bindings, state: enabled
Flag: maintenance_mode_status, state: enabled
Flag: quorum_queue, state: enabled
Flag: stream_queue, state: enabled
Flag: user_limits, state: enabled
Flag: virtual_host_metadata, state: enabled

Note that 'Network Partitions' shows as '(none)'. In case of a partition, that section will list the partitioned servers. By running cluster_health on all three nodes it should obvious which node has fallen out of consensus.

Most often this can be resolved on the failing host by restarting rabbit:

andrew@cloudcontrol2001-dev:~$ sudo rabbitmqctl stop_app
Stopping rabbit application on node rabbit@cloudcontrol2001-dev ...
andrew@cloudcontrol2001-dev:~$ sudo rabbitmqctl start_app
Starting node rabbit@cloudcontrol2001-dev ...

If that does not resolve the issue, it might be necessary to reset the failing node, or to reset the entire cluster.