I have responsibility for multiple large-ish Hyper-V failover clusters. I very occasionally see an issue where the 'ISALIVE:0' check will fail for a virtual machine (via the vmclusres.dll library) and the Resource Hosting Subsystem will terminate, bringing
down multiple machines with it.
There are multiple pieces of contradictory information on the internet about what is actually happening here. Some sources suggest that fail-over clustering will attempt to isolate the resource which first failed the health check into its own process (suggesting
this protects the other resources running under the same RHS).
This categorically does not happen on a fully patched Windows 2016 Hyper-V cluster. What appears to happen is that the failing RHS is terminated, killing any and all compute that is running under the same process. The logs make mention of the offending virtual
machine resource being isolated, but I can actually see no evidence of this occurring (in the properties of that resource), but even if this does happen, the default config has still resulted in one resource effectively causing an outage.
I can sort of add credence to this summary of behavior but forcing the resource to run in their own separate monitors myself. If I do this on a lab I've stood up:
Get-ClusterResource -Name "*Virtual Machine blah*"
foreach ($resource in $cluster_resources) {$resource.SeparateMonitor}
... I can see they are all using the default setting, which is to not run in a separate monitor. Fine.
If i set them all to run in their own monitor:
foreach ($resource in $cluster_resources) {$resource.SeparateMonitor = 1}
...and count the RHS processes there is no difference. As you would kinda expect, if I now restart the machines I suddenly have lots of RHS processes popping up, one for each VM.
So this suggests that resource cannot magically flip between RHS parent processes while they are running\switched on, so the out-of-the-box configuration can indeed bring down a whole node when there is a problem with a single resource. Could anybody anywhere
tell me if I am right here?
Also, trying to go back to the why behind the original problem. Does anyone know where I can get information about what the ISALIVE check for the vmclusres.dll library is actually doing?
There is no information anywhere about what kind of check failed, if it is a VM state check, some kind of IC communication check etc The VM didn't dump inside the guest, it just 'failed' and caused an outage, which is kinda scary. I know from some research
that the ISALIVE check is the five-minute check which is supposed to be the more in-depth check of the two that are run, but I can find no documentation saying what it is actually checking, therefore I have no way of working backward.
Thank you.