-Alarms showing loss of path redundancy to storage
-Several hosts disconnect from a cluster
-Cannot access the host via vSphere client or SSH
-One or more datastores shows dead and cannot be accessed
-CPU on several hosts is at or near 100 percent
I saw these issues just after several hosts reported the loss of redundant path to storage alarms.
The storage is managed by a separate team, so I had them check the fabric and storage presented to the cluster, they didn’t see any issues except an alarm around the same time as the first loss of redundant path alarms… So what is the next step, try a rescan of the storage, well I did that and the rescan ran for several minutes and timed out, then that host disconnected from the cluster! Going back to the storage team I had them check the one of the LUN ID of the datastore that showed dead, they said it showed on line and didn’t see a problem. Finally they removed and re-presented the LUN to the clusters hosts.
I tried another rescan and again it took forever and failed. So the next step, reboot a hosts? I had one that only had one VM on it, rebooted it and the previous dead datastore was back. A few minutes later the hosts that were previously disconnected from the cluster reconnected and appeared fine..??
I remember back on a 4.0 environment when someone powered off an iSCSI array the hosts disconnected from the cluster, so I assumed that having the storage pulled out from under the hosts is still an issue in vSphere 5.0.
After doing some research and opening a case with VMware, this still can be an issue.
The link below is to a KB that explains a Permanent Device Loss and All Paths Down error. One note on the KB is-
“As the ESXi host is not able to determine if the device loss is permanent (PDL) or transient (APD), it indefinitely retries SCSI I/O, including:
- Userworld I/O (hostd management agent)
- Virtual machine guest I/O”
Click here for a link to the KB article.
The KB also notes that the only way to recover is to resolve the storage access issue and reboot the hosts. Nice…
It turns out there are some settings that can be added to alleviate this issue from happening in 5.1 and in 5.0
Update 2.
For more details see Cormac Hogans great info on the storage features in 5.1 starting here-
(Hope he doesn't mind me sharing this link)
Another KB states that if Storage I/0 Control is enabled, a host cannot remount the datastore.
In my case SIOC was enabled on all of the datastores.
The KB details steps to stop the SIOC service on a host to allow the removal of the datastore.
Access this KB here-
In my case I think rebooting the hosts was the only option to clear the I/0 to the lost datastore. Of course what caused the issue on the storage side is still a mystery.
I have since added the settings to each of the hosts and to the cluster, if there is another issue like this one I am hoping it makes a difference.
If you have experienced this or a similar issue please share your experiences.....