Long Running Cluster Outage 2022 May 10 18:00:00 UTC


While adding some new hosts to the Sepia Long Running Cluster, the cluster got into a state where all the MONs started locking up due to lack of system resources. Josh, Neha, Dan, and David have been working to restore the cluster service by service.

The following workloads are down:

All services relying on the LRC have been restored. I will be upgrading all the daemons to a version of Ceph that have fixes for the problems we ran into.

The LRC is back up and OSD recovery is still in progress. We're letting things settle before bringing up any clients.