Widespread Outage 2021 May 10 13:40:21 UTC


Looks like RHV/Gluster has done it again. Investigating.

Looking okay for now. A bunch of jobs died initially because there are some gibba nodes that repeatedly failed to provision but I took those and marked them down.

Unfortunately any jobs that were running this morning are lost.

All VMs are up. Some were paused and resumed which means their time is wrong. I'm rebooting those.

Will restart the dispatchers soon.

Data is rebalancing. Estimated completion time is about an hour. I don't want to risk interrupting it. We should be back online today though.

Drives and bricks have been added to the ssdstorage Gluster volume. Data rebalance is in progress.

The RAID6 volumes where the Gluster bricks reside is 100% full. I'm having Labs Ops add a drive to each storage server now.

My plan is to add each drive as a brick to the VM storage volumes, rebalance the data, get RHV running again, delete some snapshots and VMs, then remove those bricks.