Long Running Cluster Outage 2022 May 10 18:00:00 UTC

While adding some new hosts to the Sepia Long Running Cluster, the cluster got into a state where all the MONs started locking up due to lack of system resources. Josh, Neha, Dan, and David have been working to restore the cluster service by service.

The following workloads are down:

teuthology runs
Ceph CI builds (Jenkins/shaman)
quay.ceph.io
telemetry.ceph.com / telemetry-public.ceph.com
chacra.ceph.com

All services relying on the LRC have been restored. I will be upgrading all the daemons to a version of Ceph that have fixes for the problems we ran into.

Posted 2 years ago by dgalloway

The LRC is back up and OSD recovery is still in progress. We're letting things settle before bringing up any clients.

Posted 2 years ago by dgalloway