Widespread Outage 2021 May 10 13:40:21 UTC

Looks like RHV/Gluster has done it again. Investigating.

Looking okay for now. A bunch of jobs died initially because there are some gibba nodes that repeatedly failed to provision but I took those and marked them down.

Unfortunately any jobs that were running this morning are lost.

Posted 3 years ago by dgalloway

All VMs are up. Some were paused and resumed which means their time is wrong. I'm rebooting those.

Will restart the dispatchers soon.

Posted 3 years ago by dgalloway

Data is rebalancing. Estimated completion time is about an hour. I don't want to risk interrupting it. We should be back online today though.

Posted 3 years ago by dgalloway

Drives and bricks have been added to the ssdstorage Gluster volume. Data rebalance is in progress.

Posted 3 years ago by dgalloway

The RAID6 volumes where the Gluster bricks reside is 100% full. I'm having Labs Ops add a drive to each storage server now.

My plan is to add each drive as a brick to the VM storage volumes, rebalance the data, get RHV running again, delete some snapshots and VMs, then remove those bricks.

Posted 3 years ago by dgalloway