Hosts Disconnecting from Cluster, Storage Issues?

4/22/2013

Here is an interesting issue I ran into recently-
-Alarms showing loss of path redundancy to storage
-Several hosts disconnect from a cluster
-Cannot access the host via vSphere client or SSH
-One or more datastores shows dead and cannot be accessed
-CPU on several hosts is at or near 100 percent
I saw these issues just after several hosts reported the loss of redundant path to storage alarms.
The storage is managed by a separate team, so I had them check the fabric and storage presented to the cluster, they didn’t see any issues except an alarm around the same time as the first loss of redundant path alarms… So what is the next step, try a rescan of the storage, well I did that and the rescan ran for several minutes and timed out, then that host disconnected from the cluster! Going back to the storage team I had them check the one of the LUN ID of the datastore that showed dead, they said it showed on line and didn’t see a problem. Finally they removed and re-presented the LUN to the clusters hosts.
I tried another rescan and again it took forever and failed. So the next step, reboot a hosts? I had one that only had one VM on it, rebooted it and the previous dead datastore was back. A few minutes later the hosts that were previously disconnected from the cluster reconnected and appeared fine..??
I remember back on a 4.0 environment when someone powered off an iSCSI array the hosts disconnected from the cluster, so I assumed that having the storage pulled out from under the hosts is still an issue in vSphere 5.0.
After doing some research and opening a case with VMware, this still can be an issue.
The link below is to a KB that explains a Permanent Device Loss and All Paths Down error. One note on the KB is-
“As the ESXi host is not able to determine if the device loss is permanent (PDL) or transient (APD), it indefinitely retries SCSI I/O, including:

Userworld I/O (hostd management agent)
Virtual machine guest I/O”

That explains why the hosts disconnected and why the CPU on some showed 100 percent. The hostd process just peaks trying to retry I/O, that slows the management agents so you can’t connect directly, and of course running a rescan of the storage just compounds the problem.
Click here for a link to the KB article.
The KB also notes that the only way to recover is to resolve the storage access issue and reboot the hosts. Nice…
It turns out there are some settings that can be added to alleviate this issue from happening in 5.1 and in 5.0
Update 2.
For more details see Cormac Hogans great info on the storage features in 5.1 starting here-
(Hope he doesn't mind me sharing this link)
Another KB states that if Storage I/0 Control is enabled, a host cannot remount the datastore.
In my case SIOC was enabled on all of the datastores.
The KB details steps to stop the SIOC service on a host to allow the removal of the datastore.
Access this KB here-
In my case I think rebooting the hosts was the only option to clear the I/0 to the lost datastore. Of course what caused the issue on the storage side is still a mystery.
I have since added the settings to each of the hosts and to the cluster, if there is another issue like this one I am hoping it makes a difference.
If you have experienced this or a similar issue please share your experiences.....

7 Comments

top essay writing service topessaywriting-services.com link

10/26/2015 05:47:00 am

Education has a great impact on the growth of nation. It is the investment of human capital. It helps in questioning about different things and find out the answers about those things. It imparts good qualities and virtues to everyone

Programming Assignment Help link

1/14/2016 02:32:45 pm

Many People like your blog because you shared very useful knowledge. Keep sharing this type of post on your blog. Thanks for sharing.

Matlab Project Help link

1/14/2016 03:14:12 pm

I want to thank you for this informative read, I really appreciate sharing this great post. Keep up your work.

C Homework Help link

1/14/2016 03:54:32 pm

Your blog provided us with valuable information. Each & every thing of your post are amazing. Thanks for sharing. Keep posting

AutoCad Help link

1/14/2016 05:54:18 pm

This post is really helpful for me. Thanks

ASP Assignment Help link

1/14/2016 06:15:53 pm

I am really grateful to read your blog post. I found a lots of approaches after visiting your blogs. Great work.

Accounting Assignment Help link

1/14/2016 06:37:36 pm

Nice article. Think so new form of features have included in your article. Waiting for your next article.

Hosts Disconnecting from Cluster, Storage Issues?

Leave a Reply.

Archives