Portal:Cloud VPS/Admin/Runbooks/CephClusterInError

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

The procedures in this runbook require admin permissions to complete.

Error / Incident

Ceph reports health problems, there might be many different issues causing this, some guidelines follow. There's three health levels for ceph:

Healthy -> everything ok
Warning -> Something is wrong, but the cluster is up and running
Critical -> Something is wrong, and it's affecting the cluster

Debugging

Check cluster health details

Ssh to any ceph cluster member and run:

dcaro@cloudcephosd1019:~$ sudo ceph health detail
HEALTH_WARN 1 daemons have recently crashed
[WRN] RECENT_CRASH: 1 daemons have recently crashed
   osd.6 crashed on host cloudcephosd1007 at 2021-07-14T13:27:32.881517Z

Daemon crash debugging

If the issues is a daemon crashing, you can see more information about the crash by running:

dcaro@cloudcephosd1019:~$ sudo ceph crash ls-new
ID                                                                ENTITY  NEW
2021-07-14T13:27:32.881517Z_17153103-7e31-4fd7-be93-cdbc285f0c5f  osd.6    *

To get the id of the crash, then check more info with:

dcaro@cloudcephosd1019:~$ sudo ceph crash info 2021-07-14T13:27:32.881517Z_17153103-7e31-4fd7-be93-cdbc285f0c5f
{
   "backtrace": [
       "(()+0x12730) [0x7f2c99ba3730]",
       "(std::_Rb_tree<boost::intrusive_ptr<AsyncConnection>, boost::intrusive_ptr<AsyncConnection>, std::_Identity<boost::intrusive_ptr<AsyncConnection> >, std::less<boost::intrusive_ptr<AsyncConnection> >, std::allocator<boost::intrusive_ptr<AsyncConnection> > >::find(boost::intrusive_ptr<AsyncConnection> const&) const+0x24) [0x5561dced7f64]",
...
       "(clone()+0x3f) [0x7f2c997464cf]"
   ],
   "ceph_version": "15.2.11",
...
}

That will give you some hints on trying to find what happened.

Clearing the crash

If you found and solved the issue and/or think it will not happen again, you can clear the crash report with:

dcaro@cloudcephosd1019:~$ sudo ceph crash archive 2021-07-14T13:27:32.881517Z_17153103-7e31-4fd7-be93-cdbc285f0c5f

Or to archive all the new crashes:

dcaro@cloudcephosd1019:~$ sudo ceph crash archive-all

Note that the crash report will not be removed, just not tagged as new (you can still see it with ceph crash <ls|info ID>).

Damaged PG or Inconsistent PG

If the health issue look like:

dcaro@cloudcephosd1019:~$ sudo ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
   pg 6.c0 is active+clean+inconsistent, acting [9,89,16]

You can try to recover it by forcing a repair:

dcaro@cloudcephosd1019:~$ sudo ceph pg repair 6.c0
instructing pg 6.c0 on osd.9 to repair

Slow operations

See Portal:Cloud VPS/Admin/Runbooks/CephSlowOps.

Support contacts

Usually anyone in the WMCS team should be able to help/debug the issue, subject matter experts (SMEs) would be Andrew Bogott and David Caro.

Related information

Icinga status check: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=icinga1001&service=Ceph+Cluster+Health
Grafana dashboard: https://grafana.wikimedia.org/d/7TjJENEWz/wmcs-ceph-eqiad-cluster-overview?orgId=1
Internal documentation: Portal:Cloud_VPS/Admin/Ceph
Upstream documentation: https://docs.ceph.com/docs/master/rados/operations/monitoring/

Example tasks

https://phabricator.wikimedia.org/T286649 - OSD daemon crash