The issue has happened on 2 different hosts in the last two weeks, and we haven't been able to find out exactly what is causing this so I am hoping to get some help.
5 Node Server 2012 Hyper-V cluster. Monitored using SCOM and also using VMM (2012 SP1). In VMM we noticed that the server in question was showing as not responding, all the VM's were fine, and VMM shows that the WinRM service is not OK. The service is running on the host, but the host cannot be RDP'd to, we can't login to the console, and we can't connect to the cluster via Cluster Manager because in both instances the host in question was the Cluster owner. Our "solution" has been to shutdown the VMs on the host and then forcefully reset it, and after that everything is fine. However, I am hoping to get to the bottom of what is going on.
This is the exact same issue as described HERE, but there isn't a solution given. Here is a chronology of events that I gathered from the event logs.
From the logs you can tell that the issues started on 7/7/2013 at 5:38 PM
From Application Event Log:
Error 26001, Microsoft.SystemCenter.VirtualMachineManager.2012.Report.VPortUsageCollection
Got null results from Select Connection from Msvm_SyntheticEthernetPortSettingData where InstanceId='Microsoft:3E323714-F9A3-4384-A2D7-3466B3FED595\\6D14D559-7676-4D9B-82A7-F5F1199601FA' . Different instance IDs.
Basically happened every 30 minutes until 1:49AM on 7/9. I rebooted the server around 11:45 PM that night.
From System Event Log:
Event 1, VDS Basic Provider
Unexpected failure. Error code: 48F@01000003
There are a lot of these errors. These errors still occur even when the WMI Performance Adapter messages are appearing normally
7/7 5:26 PM
Event 7036, Service Control Manager
The WMI Performance Adapter service entered the stopped state.
7/7 6:06 PM
Event 7036, Service Control Manager
The WMI Performance Adapter service entered the running state.
7/9/2013 8:27 AM
Event 7000, Service Control Manager
The Device Setup Manager service failed to start due to the following error. The service did not respond to the start or control request in a timely fashion.
A lot of these errors
7/9/2013 8:35 AM
Event 7011, Service Control Manager
A timeout (30000 milliseconds) was reached while waiting for a transaction response from the SCVMMAgent service.
7/9/2013 8:46 AM
Event 7001, Service Control Manager
The System Center Virtual Machine Manager Agent service depends on the Windows Remote Management (WS-Management) service which failed to start because of the following error:
The service has not been started.
Once the WMI Performance Adapter Service stopped stopping/starting again, in the Operations Manager log you start seeing warnings like these:
7/7/2013 5:30 PM
Event 21402, Health Service Modules
- Module was unable to connect to namespace 'ROOT\MSCLUSTER'
- Module was unable to connect to namespace 'ROOT\CIMV2'
- Summary: 1 rule(s)/monitor(s) failed and got unloaded, 1 of them reached the failure limit that prevents automatic reload. Management group "MY_DOMAIN". This is summary only event, please see other events with descriptions of unloaded rule(s)/monitor(s).
- Forced to terminate the following process started at 5:27:36 PM because it ran past the configured timeout 180 seconds.
- Command executed: "C:\Windows\system32\cscript.exe" /nologo "ConsecutiveSamplesTwoThresholds.vbs"A_SERVER_NAME
Followed by errors like this:
7/7/2013 5:37 PM
Even 22402, Health Service Modules
Forced to terminate the following PowerShell script because it ran past the configured timeout 30 seconds.
Script Name: PowerShellScript
One or more workflows were affected by this.
Workflow name: Microsoft.Windows.HyperV.2012.VMReplicationHealth33412.Monitor
Instance name: A_SERVER_NAME
Instance ID: {EA9D5CBC-577D-C262-CBFC
Forced to terminate the following PowerShell script because it ran past the configured timeout 30 seconds.
Script Name: PowerShellScript
One or more workflows were affected by this.
Workflow name: Microsoft.Windows.HyperV.2012.VMReplicationHealth33414.Monitor
Instance name: A_SERVER_NAME
Instance ID: {EA9D5CBC-577D-C262-CBFC-4F5037B38E50}
Management group: MY_MANAGEMENT_GROUP
The WinRM Service was running on the host (I was able to check it remotely using Powershell), I cannot say for sure about the SCVMM Agent Service, I don't remember.
MCITP | VCP4