Hello!
I had set up a cluster with 2 nodes (Hyper-V 2012) and iSCSI box (IBM Storwize V3700). Storwize has 2 controllers (canisters). Both servers (nodes) have 2 connections, one for both controller. They are connected directly (10Gb), no switch is used.
Node1 has ip-addresses 172.16.1.1 and is connected to 172.16.1.2
Node1 has ip-addresses 172.16.2.1 and is connected to 172.16.2.2
Node2 has ip-addresses 172.16.3.1 and is connected to 172.16.3.2
Node2 has ip-addresses 172.16.4.1 and is connected to 172.16.4.2
In my opinion the cluster is not important at moment. Let's look only one node.
Mpio devices were created with ease. But the problem is that I have no profit from it.
At first there is traffic in only one path (Storwize quotes this controller as preferred). Problem: 10Gb is less than 20Gb.
Secondly, when I disonnect that path (or change ip address in Storwize) then there is a long time to before second path starts working, and it is quite slow.
mpclaim -v shows me that the supported policies are "FOO RRWS LQD WP LB". I have tried all of them, though I think the best should be Least Queue Depth. The behaviour is same every time. V3700 is "ALUA Implicit only" device. IBM does not provide vendor specific DSM for iSCSI (it does for fibre channel).
I started with default registry values. Then added suggestions by IBM:
HKLM\SYSTEM\CurrentControlSet\Control\Class\{4D36E97B-E325-11CE-BFC1-08002BE10318}\003\Parameters\LinkDownTime =120 (decimal) (Default was 15)
HKLM\SYSTEM\CurrentControlSet\Control\Class\{4D36E97B-E325-11CE-BFC1-08002BE10318}\003\Parameters\MaxRequestHoldTime =120 (decimal) (Default was 60)
HKLM\SYSTEM\CurrentControlSet\Control\Class\{4D36E97B-E325-11CE-BFC1-08002BE10318}\003\Parameters\MaxPendingRequests =2048 (decimal) (Default was 255)
HKLM\SYSTEM\CurrentControlSet\Services\Disk\TimeOutValue =60 (decimal) (Default was 60)
Nothing better, even worse because LinkDownTime is the time in seconds before something really happens.
I studied that this value does not have effect decreasing below 12. But 12 seconds is terribly long time!
I tried several registry settings with no luck. At moment the settings are:
[HKLM\System\CurrentControlSet\Services\Disk]
"TimeOutValue"=2
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4d36e97b-e325-11ce-bfc1-08002be10318}\0003]
"DriverDesc"="Microsoft iSCSI Initiator"
"ProviderName"="Microsoft"
"DriverDateData"=hex:00,80,8c,a3,c5,94,c6,01
"DriverDate"="6-21-2006"
"DriverVersion"="6.2.9200.16451"
"InfPath"="iscsi.inf"
"InfSection"="iScsiPort_Install_Control"
"MatchingDeviceId"="root\\iscsiprt"
"EnumPropPages32"="iscsipp.dll,iSCSIPropPageProvider"
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4d36e97b-e325-11ce-bfc1-08002be10318}\0003\Parameters]
"TCPConnectTime"=dword:0000000f
"TCPDisconnectTime"=dword:0000000f
"WMIRequestTimeout"=dword:0000001e
"DelayBetweenReconnect"=dword:00000005
"MaxPendingRequests"=dword:00000800
"EnableNOPOut"=dword:00000000
"MaxTransferLength"=dword:00040000
"MaxBurstLength"=dword:00040000
"FirstBurstLength"=dword:00010000
"MaxRecvDataSegmentLength"=dword:00010000
"MaxConnectionRetries"=dword:ffffffff
"MaxRequestHoldTime"=dword:00000078
"LinkDownTime"=dword:0000000f
"IPSecConfigTimeout"=dword:0000003c
"InitialR2T"=dword:00000000
"ImmediateData"=dword:00000001
"ErrorRecoveryLevel"=dword:00000002
"PortalRetryCount"=dword:00000005
"NetworkReadyRetryCount"=dword:0000000a
"SrbTimeoutDelta"=dword:0000000f
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4d36e97b-e325-11ce-bfc1-08002be10318}\0003\PersistentTargets]
@=""
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4d36e97b-e325-11ce-bfc1-08002be10318}\0003\PersistentTargets\iqn.1986-03.com.ibm:2145.my-v3700.canister1#0xB7A2B94B1A35CE01]
"LoadBalancePolicy"=dword:00000004
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4d36e97b-e325-11ce-bfc1-08002be10318}\0003\PersistentTargets\iqn.1986-03.com.ibm:2145.my-v3700.canister1#0xB7A2B94B1A35CE01\LoginTarget]
"LoginTargetIN"=hex:<TEXT REMOVED by me for this post>
"LocalIPAddress"=hex:<TEXT REMOVED by me for this post>
"PathWeight"=dword:00000000
"PrimaryPath"=dword:00000001
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4d36e97b-e325-11ce-bfc1-08002be10318}\0003\PersistentTargets\iqn.1986-03.com.ibm:2145.my-v3700.canister2#0x3568D5531A35CE01]
"LoadBalancePolicy"=dword:00000004
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4d36e97b-e325-11ce-bfc1-08002be10318}\0003\PersistentTargets\iqn.1986-03.com.ibm:2145.my-v3700.canister2#0x3568D5531A35CE01\LoginTarget]
"LoginTargetIN"=hex:<TEXT REMOVED by me for this post>
"LocalIPAddress"=hex:<TEXT REMOVED by me for this post>
"PathWeight"=dword:00000000
"PrimaryPath"=dword:00000001
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\mpio\Parameters]
"GatherHealthStats"=dword:00000001
"FlushHealthInterval"=dword:000005a0
"PathVerifyEnabled"=dword:00000001
"PathVerificationPeriod"=dword:0000003c
"PDORemovePeriod"=dword:0000003c
"RetryCount"=dword:00000003
"RetryInterval"=dword:00000001
"UseCustomPathRecoveryInterval"=dword:00000000
"PathRecoveryInterval"=dword:00000028
"DiskPathCheckDisabled"=dword:00000000
"DiskPathCheckInterval"=dword:0000000a
-------------------
My vendor says that in linux all works perfectly: both paths are in use and breaking one path is not noticable.
I cannot check this saying but I decided to start in MS forums. Is there something I should to try?