We did a clean rebuild of our Hyper-V environment just over 2 weeks ago. 3 HP DL360 G7 servers, two clustered with Failover Cluster Manager, the third is a stand-alone host for Dev/Staging servers. The clustered pair were 2008R2, wiped, and installed Server 2012 fresh. The stand-alone was an OS upgrade from 2008R2 to 2012. While the servers were being rebuilt, we used the HP utility to update all the drivers/firmware/etc. They are up-to-date on WinUpdates as of May 16th. We came in last Monday to discover that one of the clustered Hyper-V hosts had issues over the weekend - VMs hosted on that host were all marked "Host Not Responding" in VMM, and the host was inaccessible via RDP entirely, and access at the KVM showed the server hardlocked. However, all of the VMs running on the host were up and functioning. Knowing that we were doing scheduled downtime this week, we chose to leave things as is. Today, we came in to find the other host in the cluster had done the same exact thing, but again all VMs running on that host are up and working fine. All production servers will be patched and rebooted this week for regular maintenance, but we'd like to avoid this issue occurring again in the future. At the time of the initial host going unresponsive, the only error in the event log of the other host was below:
Log Name: Microsoft-Windows-FailoverClustering/Diagnostic
Source: Microsoft-Windows-FailoverClustering
Date: 5/27/2013 9:25:11 PM
Event ID: 2051
Task Category: None
Level: Error
Keywords:
User: SYSTEM
Computer: HOST.domain.com
Description: [RHS] s_RhsRpcCreateResType: ERROR_NOT_READY(21)' because of 'Startup routine for ResType MSMQTriggers returned 21.'
What is somewhat disturbing is that no VMs were migrated off at the failure, and we had no clean way to migrate the VMs off that were running on the failed host. VMM doesn't allow us to migrate (option grayed out), and the remaining host (at the time) didn't see the cluster, so couldn't access those VMs in Failover Cluster Manager either. Ping is successful to either host, but nothing past that. SCOM flipped a heartbeat alert on the first host, but nothing past that in terms of alerts. Have there been any fixes/updates released for Server 2012/Hyper-V hosts that we should be sure we grab when rebooting tomorrow night? Anyone else experience something similar? We can re-run the HP driver/firmware update utility to check again for updates of that sort while rebooting, but I want to make sure we cover all bases now so this doesn't happen again. Any suggestions are appreciated.