Home

Mellanox WinOF VPI User Manual

image

Contents

1. a Device Manager File Action View Help S es B E Z ZEE T General Port Protocol Driver Details Events Resources K Intel R 5000 Series Chipset Reserved Registers 25F1 LY Mellanox Connectx 3 MT04098 Network Adapter 1M Intel R 5000 Series Chipset Reserved Registers 25F3 1 Intel R 5000X Chipset Memory Controller Hub 25C0 Property 1 Intel R 5000X Chipset PCI Express x16 Port 4 7 25FA Driver key v K Intel R 6311ESB 6321ESB PCI Express Downstream Port E1 3510 K Intel R 6311ESB 6321ESB PCI Express to PCI X Bridge 350C Value pM Intel R 6311ESB 6321ESB PCI Express Upstream Port 3500 4d36e97d e325 11ce b c1 08002be10318 0041 JE Intel R 631xESB 6321ESB 3100 Chipset LPC Interface Controller 2 JE Intel R 631xESB 6321ESB 3100 Chipset PCI Express Root Port 1 2 pM Intel R 631xESB 6321ESB 3100 Chipset PCI Express Root Port 2 2 pM Intel R 631xESB 6321ESB 3100 Chipset SMBus Controller 2698 pM Intel R 82801 PCI Bridge 244E IE Mellanox ConnectX 3 MT04099 Network Adapter 3 4 Port Configuration After WinOF OFED VPI installation it is possible to modify the network protocol that runs on each port of VPI adapter cards Each port can be set to run as InfiniBand Ethernet or Auto Sens ing 3 4 1 Auto Sensing Auto Sensing enables the NIC to automatically sense the link type InfiniBand or Ethernet based on the cable connected to the port and load the appropriate driver stack InfiniBand or Eth
2. Flag Description h help Print the help menu d debug Raise the IB debug level May be used several times for higher debug levels ddd or d d d a all Show all LIDs in range including invalid entries v verbose Increase verbosity level May be used several times for additional verbos ity vvv or v v v V version Show version info Mellanox Technologies 69 J Rev 4 60 Table 14 ibroute Flags and Options Flag n no dests Description Do not try to resolve destinations D Direct Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G Guid M Multicast Use GUID address argument In most cases it is the Port GUID Exam ple 0x08f1040023 Show multicast forwarding tables The parameters lt startlid gt and lt endlid gt specify the MLID range L Lid Use Lid address argument u usage Usage message e errors Show send and receive errors timeouts and others s sm port lt smlid gt Use lt smlid gt as the target LID for SM SA queries C Ca ca name P Port ca port Use the specified channel adapter or router Use the specified port t timeout lt timeout_ms gt Override the default timeout for the solicited MADs msec lt dest dr_path lid guid gt Destination s dire
3. New Servic Name OpenSM BinaryPathName C Program Files Mella noxNMLNX VPINIBNToolsNopensm exe service L 128 DisplayName OpensSM Description OpenSM for IB subnet StartupType Automatic To start OpenSM as a service run Start Service OpenSM1 Notes Forlong term running please avoid using the v verbosity option to avoid exceeding disk quota Running OpenSM on multiple servers may lead to incorrect OpenSM behavior Please do not run more than two instances of OpenSM in the subnet Mellanox Technologies 61 J Rev 4 60 8 InfiniBand Fabric 8 1 Network Direct Interface The Network Direct Interface NDI architecture provides application developers with a net working interface that enables zero copy data transfers between applications kernel bypass I O generation and completion processing and one sided data transfer operations NDI is supported by Microsoft and is the recommended method to write InfiniBand application NDI exposes the advanced capabilities of the Mellanox networking devices and allows applica tions to leverage advances of InfiniBand For further information please refer to http msdn microsoft com en us library cc904397 v vs 85 aspx 8 2 part man Virtual IPoIB Port Creation Utility part_man is used to add remove virtual IPoIB ports Currently each Mellanox IPoIB port can have a single virtual IPoIB only which is created with a default PKey value of Oxff
4. Rev 4 60 9 Software Development Kit Software Development Kit SDK a set of development tools that allows the creation of Infini Band applications for MLNX VPI software package The SDK package contains header files libraries and code examples To compile the examples provided with the SDK you must install Windows Driver Kit WDK version 8 1 and higher To open the SDK package you must run the sdk exe file and get the complete list of files SDK package can be found under lt installation_directory gt IB SDK It is highly recommended to program the applications over the ND API and not over the IBAL API Mellanox Technologies 124 Rev 4 60 10 Troubleshooting 10 1 InfiniBand Troubleshooting Issue 1 The InfiniBand interfaces are not up after the first reboot after the installation process is completed Suggestion To troubleshoot this issue follow the steps below 1 Check that the InfiniBand driver is running on all nodes by using vstat The vstat utility located at lt installation_directory gt tools displays the status and capabil ities of the network adaptor card s 2 On the command line enter vstat use h for options to retrieve information about one or more adapter ports The field port_state will be equal to PORT DOWN when there is no InfiniBand cable no link PORT INITIALIZED when the port is connected to some other port physical link PORT ACTIVE when
5. SA Query Timeout Sets the waiting timeout in millisecond of an SA query completion The valid values are 500 60000 default 1000 ms 6 4 Adapter Proprietary Performance Counters Proprietary Performance Counters are used to provide information on Operating System applica tion service or the drivers performance Counters can be used for different system debugging purposes help to determine system bottlenecks and fine tune system and application perfor mance The Operating System network and devices provide counter data that the application can consume to provide users with a graphical view of the system s performance quality WinOF counters hold the standard Windows CounterSet API that includes Network Interface RDMA activity SMB Direct Connection Mellanox Technologies 55 Rev 4 60 6 4 4 Supported Standard Performance Counters 6 4 1 1 Proprietary Mellanox Adapter Traffic Counters Proprietary Mellanox adapter traffic counter set consists of global traffic statistics which gather information from ConnectX 3 and ConnectX 3 Pro network adapters and includes traffic statistics and various types of error and indications from both the Physical Function and Virtual Function Table 8 Mellanox Adapter Traffic Counters Mellanox Adapter Traffic Counters Description Bytes Received Bytes IN Shows the number of bytes received by the adapter The counted bytes include framing characters
6. Two or more machines running Windows Server 2012 and above One or more Mellanox ConnectXG 3 or ConnectX 3 Pro adapters for each server Oneor more Mellanox InfiniBand switches Two or more QSFP cables required for InfiniBand 4 3 SMB Configuration Verification 4 3 1 Verifying SMB Configuration Use the following PowerShell cmdlets to verify SMB Multichannel is enabled confirm the adapters are recognized by SMB and that their RDMA capability is properly identified Onthe SMB client run the following PowerShell cmdlets Get SmbClientConfiguration Select EnableMultichannel Get SmbClientNetworkInterface Onthe SMB server run the following PowerShell cmdlets Get SmbServerConfiguration Select EnableMultichannel Get SmbServerNetworkInterface netstat exe xan match 445 1 The NETSTAT command confirms if the File Server is listening on the RDMA interfaces Mellanox Technologies 39 J Rev 4 60 4 3 2 Verifying SMB Connection To verify the SMB connection on the SMB client Step 1 Copy the large file to create a new session with the SMB Server Step 2 Open a PowerShell window while the copy is ongoing Step3 Verify the SMB Direct is working properly and that the correct SMB dialect is used Get SmbConnection Get SmbMultichannelConnection netstat exe xan match 445 If you have no activity while you run the commands above you might get an empty list d due to session expir
7. llle esses 62 8 3 InfiniBand Fabric Diagnostic Unnes 62 834 Utilities Usage e ERI aane SER E YR 62 8 3 2 ubdiagriet odi vh OH MS Heads viv 64 8 3 3 abportstatex oes e em CLER RS ERE RU DOLI Ie d 66 8 34 brotten sx on i aaa ii Ea MERE Sua Se Soe RARE 69 8 3 5 abd mp z s weg ses EE E ecules oe abou wee MER 71 8 3 6 smpquety iz 5 ces Ae eet ol n e A wa at hy An 72 Mellanox Technologies 4 J Rev 4 60 8 3 7 iperfquery sso ooo et ee oH adatom tails Oe auc e aie itd 76 8 3 8 JOPE is octo ck DLE V LER ER NEEUDCD PUE DV BD UD SS 79 8 3 9 jbnetdiscoV r o po ERREUR Goad pods Pane NA Cone eas 80 8 3 10 3btracett zs seva Nu WEE ee aways oh HOOT ESS MER eee 84 8 3 TT sminfo A ee p b dete t t dots 85 8 3 12 abcle rettOts ouo Casco eere Shwe eae eU Co ee Dette e pU bete 87 8 3 13 1bstat op A UMS eec Re OD og S OQ 87 8 3 I4 SAL ek Seeds eU ree e esce edes CUR CR RR E Se 88 8 3 TI OSMUCST oer SARI OA REBEL Pale eh ees INR 2s EN MEN 88 8 3 T6 dbaddz viscosa ERR rc rH IRURE oes Salen saan 91 8 341 7 abcacheeditz 5 onn We B RH ERN eiui A 93 8 3 18 ablinkinfo 24 5 eode ede ctt d bete tbt uo cg eoe eR acs 94 8 3 19 1bquetyettOts vu toco tue eee ERR E qu e ee A ende ee P oreste P 95 8 320 1bsysstat oso bb cp I DIVA LER ER OREAY PEN ORDEI 97 8 3 2 Saquety c cover ek ere SEHR RE EROR RR D doe I ERA 99 8 322 smpduinp 22 ves v EE ek iw an soa PA as I EAT ENSE 101 8 4 InfiniBand Fabric Performance Unes 103
8. Bytes Received Sec Shows the rate at which bytes are received by the adapter The counted bytes include framing characters Packets Received Shows the number of packets received by ConnectX 3 and ConnectX 3Pro network interface Packets Received Sec Shows the rate at which packets are received by ConnectX 3 and Con nectX 3Pro network interface Bytes Packets OUT Bytes Sent Shows the number of bytes sent by the adapter The counted bytes include framing characters Bytes Sent Sec Shows the rate at which bytes are sent by the adapter The counted bytes include framing characters Packets Sent Shows the number of packets sent by ConnectX 3 and ConnectX 3Pro network interface Packets Sent Sec Shows the rate at which packets are sent by ConnectX 3 and ConnectX 3Pro network interface Bytes TOTAL Bytes Total Shows the total of bytes handled by the adapter The counted bytes include framing characters Bytes Total Sec Shows the total rate of bytes that are sent and received by the adapter The counted bytes include framing characters Packets Total Shows the total of packets handled by ConnectX 3 and ConnectX 3Pro network interface Packets Total Sec Shows the rate at which packets are sent and received by ConnectX 3 and ConnectX 3Pro network interface Control Packets The total number of successfully received control frames ERRORS DROP AND MISC INDICATIONS
9. Mellanox Technologies 56 J Rev 4 60 Table 8 Mellanox Adapter Traffic Counters Mellanox Adapter Traffic Counters Packets Outbound Errors Description Shows the number of outbound packets that could not be transmitted because of errors Packets Outbound Discarded Shows the number of outbound packets to be discarded even though no errors had been detected to prevent transmission One possible reason for discarding packets could be to free up buffer space Packets Received Errors Packets Received with Frame Length Error Shows the total number of inbound packets that contained errors prevent ing them from being deliverable to a higher layer protocol Shows the number of inbound packets that contained error where the frame has length error Packets received with frame length error are a sub set of packets received errors Packets Received with Symbol Error Shows the number of inbound packets that contained symbol error or an invalid block Packets received with symbol error are a subset of packets received errors Packets Received with Bad CRC Error Packets Received Discarded Shows the number of inbound packets that failed the CRC check Packets received with bad CRC error are a subset of packets received errors Shows the number of inbound packets that were chosen to be discarded even though no errors had been detected to prevent their being deliverable to a higher layer pro
10. I lt max inline size gt The maximum size of message to send inline The default number is 128B D lt test duration in seconds gt Tests duration in seconds f lt margin time in seconds gt The margin time to avoid calculation and it must be less than half of the duration time Q CQ Moderation lt value gt The default number is 100 S server interface IP gt lt server side only must be last parameter gt C server interface IP gt lt client side only must be last parameter gt 8 4 14 nd write lat This test is used for performance measuring of RDMA Write requests in Microsoft Windows Operating Systems nd write lat is performance oriented for RDMA Write with minimum Mellanox Technologies 117 Rev 4 60 latency and runs over Microsoft s NetworkDirect standard The level of customizing for the user is relatively high User may choose to run with a customized message size customized number of iterations or alternatively customized test duration time nd write lat runs with all message sizes from 1B to 4MB powers of 2 message inlining CQ moderation 8 4 14 1 nd write lat Synopsys running on specific single core Server side start b affinity 0X1 nd write lat s1048576 D10 S 11 137 53 1 Client side start b wait affinity 0X1 nd write lat s1048576 D10 C 11 137 53 1 8 4 14 2 nd write lat Options The table below lists the various flags of the
11. Rev 4 60 All InfiniBand verbs applications which run over InfiniBand verbs should work on RoCE links if they use GRH headers Set HCA to use Ethernet protocol Display the Device Manager and expand System Devices Please refer to Section 3 4 2 Port Protocol Configuration on page 20 3 7 2 2 Configuring Windows Host Since PFC is responsible for flow controlling at the granularity of traffic priority it is necessary to assign different priorities to different types of network traffic As per RoCE configuration all ND NDK traffic is assigned to one or more chosen pri orities where PFC is enabled on those priorities Configuring Windows host requires configuring QoS To configure QoS please follow the pro cedure described in Section 5 3 Configuring Quality of Service QoS on page 44 3 7 2 2 1 Using Global Pause Flow Control GFC To use Global Pause Flow Control GFC mode disable QoS and Priority PS Disable NetQosFlowControl PS Disable NetAdapterQos 3 7 3 Configuring SwitchX amp Based Switch System To enable RoCE the SwitchX should be configured as follows Ports facing the host should be configured as access ports and either use global pause or Port Control Protocol PCP for priority flow control Ports facing the network should be configured as trunk ports and use Port Control Pro tocol PCP for priority flow control For further information on how to configure SwitchX p
12. ibaddr d ebug D irect G uid l1 id show g id show C ca name P ca port t imeout timeout ms V ersion aleto lt lid dr path guid gt 8 3 16 2 ibaddr Options Table 26 ibaddr Flags and Options Flags Description G Guid shows lid range and gid for GUID address l lid_show shows lid range only L Lid_show shows lid range in decimal only g gid_show shows gid address only Mellanox Technologies 91 J Rev 4 60 Table 26 ibaddr Flags and Options Flags Description Debugging Flags Description NOTE Most OpenIB diagnostics take the following common flags The exact list of supported flags per utility can be found in the usage message and can be shown using the util name h syn tax d Raises the IB debugging level Can be used several times ddd or d d d e shows send and receive errors timeouts and oth ers h shows the usage message V Increases the application verbosity level Can be used several times vv or v v v V shows the version info Addressing Flags Description D Uses directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G Uses GUID address argument In most cases it is the Port GUID Example 0x08f1040023 s lt smlid gt Uses smlid as the target lid for SM SA queries
13. mtu lt mtu gt The mtu size default 1024 c connection lt RC UC UD gt Connection type RC UC UD default RC SIze lt size gt The size of message to exchange default 65536 a all Runs sizes from 2 till 2 23 t tx depth lt dep gt The size of tx queue default 100 n iters lt iters gt The number of exchanges at least 2 default 1000 b bidirectional Measures bidirectional bandwidth default unidirectional V version Displays version number g grh Use GRH with packets mandatory for RoCE 8 4 4 ib send lat Ib send lat calculates the latency of sending a packet in message size between a pair of machines One acts as a server and the other as a client They perform a ping pong benchmark on Mellanox Technologies 105 Rev 4 60 which you send packet only if you receive one Each of the sides samples the CPU each time they receive a packet in order to calculate the latency 8 4 4 1 ib send lat Synopsys ib send lat i b port ib port c onnection type RC UC UD m tu mtu size s ize message size t x depth tx size n iteration num p ort PDT port a 11 V ersion C report cycles H report histogram U report unsorted 8 4 4 2 ib send lat Options The table below lists the various flags of the command Table 36 ib send lat Flags and Options Flag Description p port lt port gt
14. 3 5 4 Removing a Port VLAN in Windows 2008 R2 gt To remove a port VLAN perform the following steps Step 1 In the Device Manager window right click the network adapter from which the port VLAN was created Step 2 Left click Properties Step3 Select the VLAN tab from the Properties sheet Mellanox Technologies 28 J Rev 4 60 Step4 Select the VLAN to be removed Step 5 Click Remove and confirm the operation 3 5 5 Configuring a Port to Work with VLAN in Windows 2012 and Above In this procedure you DO NOT create a VLAN rather use an existing VLAN ID Pd To configure a port to work with VLAN using the Device Manager Step 1 Open the Device Manager Step2 Go to the Network adapters Step3 Right click Properties on Mellanox ConnectX 3 Ethernet Adapter card Step 4 Go to Advanced tab Step 5 Choose the VLAN ID in the Property window Step 6 Set its value in the Value window a Device Manager File Action View Help L Details xl Events f Power Management e9 m 5 um E PENE General Advanced Information I Performance Driver p Ws Monitors The following properties are available for this network adapter Click 4 S Network adapters the property you want to change on the left and then select its value V Embedded Broadcom NetXtreme 5721 PCI E Gigabit NIC en emo SP Embedded Broadcom NetXtreme 5721 PCI E Gigabit NIC 2 Property Value amp Hyper V Virtual Ethernet
15. Force Confirm Stop VM Name mtlae14 006 Force Confirm Connect VM to vSwitch maybe you have to switch off VM before doing manual does also work Connect VMNetworkAdapter VMName mtlae14 005 SwitchName VSwMLNX Add VMNetworkAdapter VMName mtlae14 005 SwitchName VSwMLNX StaticMacAddress 00155D720100 Add VMNetworkAdapter VMName mtlael4 006 SwitchName VSwMLNX StaticMacAddress 00155D720101 al The commands from Step 2 4 are not persistent Its suggested to create script is running after each OS reboot Step 2 Configure a Subnet Locator and Route records on each Hyper V Host Host 1 and Host 2 mtlael4 amp mtlael5 ew NetVirtualizationLookupRecord CustomerAddress 172 16 14 5 ProviderAddress 192 168 20 114 VirtualSubnetID 5001 MACAddress 00155D720100 Rule TranslationMetho dEncap ew NetVirtualizationLookupRecord CustomerAddress 172 16 14 6 ProviderAddress 92 168 20 114 VirtualSubnetID 5001 MACAddress 00155D720101 Rule TranslationMetho dEncap ew NetVirtualizationLookupRecord CustomerAddress 172 16 15 5 ProviderAddress 192 168 20 115 VirtualSubnetID 5001 MACAddress 00155D730100 Rule TranslationMetho dEncap ew NetVirtualizationLookupRecord CustomerAddress 172 16 15 6 ProviderAddress 192 168 20 115 VirtualSubnetID 5001 MACAddress 00155D730101 Rule TranslationMetho dEncap Add customer route ew NetVirtualizationCustomerRoute RoutingDomainID 11111111 2222 3333 4444 00000
16. Intel R 5000X Chipset Memory Controller Hub 25CO E Intel R 5000X Chipset PCI Express x16 Port 4 7 25FA Intel R 6311ESB 6321ESB PCI Express Downstream Port E1 3510 Intel R 6311ESB 6321ESB PCI Express to PCI X Bridge 350C Intel R 6311ESB 6321ESB PCI Express Upstream Port 3500 pM Intel R 631xESB 6321ESB 3100 Chipset LPC Interface Controller 2670 o Intel R 631xESB 6321ESB 3100 Chipset PCI Express Root Port 1 2690 Intel R 631xESB 6321ESB 3100 Chipset PCI Express Root Port 2 2692 Intel R 631xESB 6321ESB 3100 Chipset SMBus Controller 269B Intel R 82801 PCI Bridge 244E Mellanox ConnectX 3 MT04099 Network Adapter JE Mellanox ConnectX 3 MT04099 Network Adapter I Microsoft ACPI Compliant System Step 2 Right click a Mellanox network adapter under Network adapters list and left click Proper ties Select the Advanced tab from the Properties sheet Details Events 1 Power Management General Advanced Information Performance I Driver The following properties are available for this network adapter Click the property you want to change on the left and then select its value on the right Property Value Bus master DMA Operations Enabled X Flow Control Header Data Split Interrupt Moderation Interrupt Moderation RX Packet Cc Interrupt Moderation RX Packet Ti Interrupt Moderation TX Packet Cc Interrupt Moderation TX Packet Tit IP 4 Checksum Off
17. Other Common Flags Description C lt ca_name gt Uses the specified ca_name P lt ca_port gt Uses the specified ca_port t lt timeout_ms gt Overrides the default timeout for the solicited mads 8 3 16 3 Multiple CA Multiple Port Support When no IB device or port is specified the port to use is selected by the following criteria 1 The first port that is ACTIVE 2 Ifnot found the first port that is UP physical link up If a port and or CA name is specified the user request is attempted to be fulfilled and will fail if it is not possible Mellanox Technologies 92 J Rev 4 60 Examples ibaddr local port s address ibaddr 32 show lid range and gid of lid 32 ibaddr G 0x8f1040023 same but using guid address ibaddr 1 32 show lid range only ibaddr L 32 show decimal lid range only ibaddr g 32 show gid address only 8 3 17 ibcacheedit Ibcacheedit allows users to edit an ibnetdiscover cache created through the cache option in Ibnetdiscover 8 8 3 17 1ibcacheedit Synopsis ibcacheedit switchguid BEFOREGUID AFTERGUID caguid BEFORE AFTER sysimgguid BEFOREGUID AFTERGUID G guid NODEGUID BEFOREGUID AFTERGUID h elp lt orig cache gt lt new cache gt 8 3 17 2ibcacheedit Options Table 27 ibcacheedit Flags and Options Flags Description switchguid BEFOREGUID AFTERGUID Specifies a switchguid that should be changed The before and after guid should be sepa
18. Rev 4 60 Examples sminfo local ports sminfo sminfo 32 show sminfo of lid 32 sminfo G 0x8f1040023 same but using guid address 8 3 12 ibclearerrors Ibclearerrors is a script which clears the PMA error counters in PortCounters by either waking the InfiniBand subnet topology or using an already saved topology file 8 3 12 1 ibclearerrors Synopsys ibclearerrors h N nocolor lt topology file gt C ca name P ca port t ime out timeout ms 8 3 12 2 ibclearerrors Options The table below lists the various flags of the command Table 22 ibclearerrors Flags and Options Flag Description C ca name Use the specified ca name P ca port Use the specified ca port t timeout ms Override the default timeout for the solicited mads 8 3 13 ibstat Ibstat is a binary which displays basic information obtained from the local IB driver Output includes LID SMLID port state link width active and port physical state 8 3 13 1 ibstat Synopsys Hosea elewa Fast oi eash iex dex dp i e exce Jouer Filersion a ca name portnum 8 3 13 2 ibstat Options The table below lists the various flags of the command Most OpenIB diagnostics take the following common flags The exact list of supported flags per utility can be found in the usage message and can be shown using the util name h syntax Table 23 ibstat Flags and Options Flag Description l list of ca
19. e Failed to initialize the Mellanox ConnectX EN 10Gbit Ethernet Adapter X because it uses old firmware version old firmware version gt You need to burn firmware version new firmware ver sion or higher and to restart your computer Mellanox Technologies 126 Rev 4 60 e Mellanox ConnectX EN 10Gbit Ethernet Adapter X device detected that the link connected to port lt Y gt is up and has initiated normal operation e Mellanox ConnectX EN 10Gbit Ethernet Adapter X device detected that the link connected to port Y is down This can occur if the physical link is disconnected or damaged or if the other end port is down Mismatch in the configurations between the two ports may affect the performance When Using MSI X both ports should use the same RSS mode To fix the problem configure the RSS mode of both ports to be the same in the driver GUI Mellanox ConnectX EN 10Gbit Ethernet Adapter lt X gt device failed to create enough MSI X vec tors The Network interface will not use MSI X interrupts This may affects the performance To fix the problem configure the number of MSI X vectors in the registry to be at least lt Y gt 10 3 Performance Troubleshooting Issue 1 Windows Settings Suggestion 1 In Windows 2012 and above when a kernel debugger is configured not neces sarily physically connected flow control is disabled unless the following registry key is set reboot required after setting R
20. iters lt iters gt The number of exchanges at least 2 default 1000 C report cycles Reports times in cpu cycle units default microseconds H report histogram Prints out all results default print summary only U report unsorted implies Prints out unsorted results default sorted H V version Displays version number g grh Uses GRH with packets mandatory for RoCE 8 4 7 ibv read bw This is a more advanced version of ib read bw and contains more flags and features than the older version and also improved algorithms ibv read bw calculates the BW of RDMA read between a pair of machines One acts as a server and the other as a client The client RDMA reads the server memory and calculate the BW by sampling the CPU each time it receive a suc cessful completion The test supports a large variety of features as described below and has bet ter performance than ib read bw in Nahalem systems Read is available only in RC connection mode as specified in the InfiniBand spec Mellanox Technologies 108 Rev 4 60 8 4 7 1 8 4 7 2 Mellanox Technologies 109 ibv read bw Synopsys ibv read bw i b port ib port mtu size s ize message size t x depth tx size PDT port u qp timeout events ibv read bw Options F CPU freq fail b idirectional a 11 V ersion d ib device o uts outstanding reads m tu n iteration num p ort S 1 sl type x gid index e vent
21. max inline size gt The maximum size of message to send inline The default number is 128B D test duration in seconds Tests duration in seconds The margin time to avoid calculation and it must be less than half of the duration time f margin time in seconds S server interface IP server side only must be last parameter C server interface IP gt client side only must be last parameter gt h Shows the Help screen 8 4 17 nd send bw This test is used for performance measuring of Send requests in Microsoft Windows Operating Systems nd send bw is performance oriented for Send with maximum throughput and runs over Microsoft s NetworkDirect standard The level of customizing for the user is relatively high User may choose to run with a customized message size customized number of iterations or alternatively customized test duration time nd send bw runs with all message sizes from 1B to 4MB powers of 2 message inlining CQ moderation Mellanox Technologies 120 Rev 4 60 8 4 17 1 nd send bw Synopsys running on specific single core Server side start b affinity 0X1 nd send bw s1048576 D10 S 11 137 53 1 Client side start b wait affinity 0X1 nd send bw s1048576 D10 C 11 137 53 1 8 4 17 2 nd send bw Options The table below lists the various flags of the command Table 49 nd send bw Flags and Options Flag Description h Shows the H
22. message inlining CQ moderation Mellanox Technologies 121 Rev 4 60 8 4 18 1 nd send lat Synopsys running on specific single core Server side start b affinity 0X1 nd send lat s1048576 D10 S 11 137 53 1 Client side start b wait affinity 0X1 nd send lat s1048576 D10 C 11 137 53 1 8 4 18 2 nd send lat Options The table below lists the various flags of the command Table 50 nd send lat Options Flag Description h Shows the Help screen V Shows the version number p Connects to the port lt port gt lt default 6830 gt s lt msg size gt Exchanges the message size with lt default 65536B gt and it must not be combined with a flag Runs all the messages sizes from 1B to 8MB and it must not be combined with s flag n lt num of iterations gt The number of exchanges at least 2 the default is 100000 I max inline size The maximum size of message to send inline The default number is 128B D test duration in seconds Tests duration in seconds f margin time in seconds The margin time to avoid calculation and it must be less than half of the duration time S server interface IP gt server side only must be last parameter C server interface IP gt client side only must be last parameter gt h Shows the Help screen 8 4 19 NTttcp NTttcp is a Windows base testing application
23. perfquery R 32 2 Ox0fff reset only error counters of port 2 perfquery R 32 2 0xf000 reset only non error counters of port 2 1 Read local port s performance counters gt perfquery Mellanox Technologies 77 J Rev 4 60 Port counters Lid 6 port 1 SIH S R x1000 NMED IS O MINT COMBCMEMINIMNAAONENS ace coe o eae INCUCOMICMEMINTI ROMS on Gann aaduios beee HE Ha GERI y ce nee Gr ro dco PACH SIDE Ves LON S T GST ene ORTU TOO TINO 55178210 RTE 55174680 MMUBKUSS e ete ERR semen NUES UT 166366 RO scan O A OOO 766315 2 Read performance counters from LID 2 all ports gt smpquery a 2 Port counters Lid 2 port 255 Portoelect e eene es eer ereen E 250 Seif eE ST tede e TE TITO 0x0100 Symbol OSEE TIT 65535 H T PEE IR TIE 255 TMM DOWN Germ rS SESS 16 ROVN KODS nare D USE EET escorts 657 RevRemotePhyskrrors Vee Ay EnO H EET IE 70 RMEDUS CALS Hereafter ODD 488 MIME COMSICME UMNO o nc moa o 606 5 RovCon simran MARON Si e TEE I ele Dette Re 1H CN e EOS RI o rer oe tron EZCBUTOVECrUNE dons P P e MWIdSDEODDedice us c im MMEDAIE Os see et ere e ERE ETE DRM 129840354 ReyDaba eeese tore SU S OUT 129529906 E e E ETE NAE es 1803332 REVER CS E UE ES 1799018 3 Read then reset performance counters from LID 2 port 1 gt perfquery r21 Port counters Lid 2 port 1 POREO ONO C Biene pte nro ae bt OA 1 GOUnte Sel eeu eris OI 0x0100 Symb OLER POES nea E A T ate
24. scans daw be Re er nhs wea Awe ERR 84 Table 21 sminto Flags and Options iv case ae RE EE AS OGRA AS 86 Table 22 ibclearerrors Flags and Options 0000s 87 Table 23 ibstat Flags and Options Sod sese P RE PR RIA ad abe PR sa FS Oe Reese 87 Table 24 vstat Flags and Options 5 e Rega ok eee iw eee aged seen EO OER 88 Table 25 osmtest Flags and Options size lisa tree Ra t E dam Eee ed 89 Table26 ibaddr Flags and Options socia ssa rRNA ESSO 9 Table 27 ibcacheedit Flags and Options naana ccc eee eee ee ences eens 93 Table 28 iblinkinfo Flags and Options 4 424 Vsasien Yao ens ey aoe tae RS eR das 94 Table 29 ibqueryerrors Flags and Options 0 c eee ee eee ee eee 95 Table 30 ibsysstat Flags and Options 0 00 ccc s eee cence cent n 97 Table 31 saquery Flags and Options 0 cece cece eee ene eee nee 100 Table 32 smpdump Flags and Options 0 0c eee 102 Table33 ib read bw Flags and Options 0 cece eee eee eee 103 Table 34 ib read lat Flags and Options 0 cc cece cece eee tere eneees 104 Table 35 ib send bw Flags and Options 2 iussa IPTE RE RA E EI T 105 Table36 ib send lat Flags and Options 0 0 cece eee eee ee 106 Table 37 ib write bw Flags and Options secs Sa ere eRe tae SER SSS 107 Table 38 ib write lat Flags and Options 34 4 0 siae yeh es eee e Rn 108 Table 39 ibv read bw Flags and Options 0 0c eee eee eee eee 109 Fable 40 ibv read lat
25. seiprexew ail ell el esl feats o iE iu exp deg Ie 9 sell e exe sil sl IAC eal mel 2 cea peee Feels val t imeout lt msec gt src to dst lt src dst gt sgid to dgid lt sgid dgid gt node name map lt node name map gt lt name gt lid lt guid gt Mellanox Technologies 99 J Rev 4 60 8 3 21 2 saquery Options Table 31 saquery Flags and Options Flags Description p Gets PathRecord info N Gets NodeRecord info list D Gets NodeDescriptions of CAs only S Gets ServiceRecord info I Gets InformInfoRecord subscription info L Returns the Lids of the name specified l Returns the unique Lid of the name speci fied G Returns the Guids of the name specified O Returns the name for the Lid specified U Returns the name for the Guid specified C Gets the SA s class port info S Returns the PortInfoRecords with isSM or isSMdisabled capability mask bit on g Gets multicast group info m Gets multicast member info If a group is specified limit the output to the group specified and print one line containing only the GUID and node description for each entry Example saquery m 0xc000 X Gets LinkRecord info src to dst Gets a PathRecord for lt src dst gt where src and dst are either node names or LIDs sgid to dgid Gets a PathRecord for sgid to dgid where both GIDs are in an IPv6 format
26. 8 41 Ab re d bw 555 od et aca ec Cc Geo Oa NAT Oe he RSM 103 S42 db read T oe is Sts he dat so tah er dan VE 104 8 amp 4 ib send bW i esie I be ea eee ae 105 8 44 ab Send lati oc BAe Se Nive ae ee DER ees MENS ENG MEN E 105 8 4 5 1b write DW ccc Gea Rc RR IR eee R Foes Pea ees 106 8 4 6 ab write Jats eR ABER ne PEN EMI eos e 107 8 4 7 ibvzread DW eed last e c tet pte tip te dl de n e dt 108 8458 aby read lat vue osos sag ee ene rene vetus Vetement des 110 8 49 aby send DW scsooelo use eL Wa X I CPIS E eects ENT qp VPE ENS 111 8 4 10 iby send lat i sss REPE SERE RET EER SEE E SS 112 8 4 TT iby wite DW ose Sie ies oad ets Ques Sate ee ea A 114 84 12 3by owritesla6 2o cine wk Aa oi t Ge 115 8 413 nd writes DW cu eoe ee bie et RS ases wee ues 117 84 14 nd write P aun e e de e t cte 117 8 4 15 nd read bw ou RUBER REEIUREQPMEREPRS RE PEE 118 8 4 16 nd Tead latere wae os Bh bee eee aer ekal ur te PER PER eus 119 8 4 17 nd send bw vos ees Med Ee E RA Vere EE EPIS 120 8 4 T8 nd send lat ove eee exce eO esce ete ee sttvetta ue v ets 121 BE TO NTC ope e ee oo Rome hase 122 Chapter 9 Software Development Kit sss ss ss ss cc cee ee cee ee eee ee eee ee eens 124 Chapter 10 Troubleshooting sss sss s s e s nh he hon xsv cb ace 125 10 1 InfiniBand Troubleshooting 000 ec ees 125 10 2 Ethernet Troubleshooting ee 125 10 3 Performance Troubleshooting 000 c ccc ees 127 Chapter 11 Documentatio
27. For example if the adapter is represented by Local Area Connection 6 and Local Area Con nection 7 For single port stream tuning type perf tuning exe s cl Local Area Connection 6 c2 Local Area Connection 7 or to set one adapter only perf tuning exe s cl Local Area Connection 6 For single stream tuning type perf tuning exe st cl Local Area Connection 6 c2 Local Area Connection 7 or to set one adapter only perf tuning exe st cl Local Area Connection 6 For dual port streams tuning type perf tuning exe d cl Local Area Connection 6 c2 Local Area Connection 7 For forwarding streams tuning type perf tuning exe f cl Local Area Connection 6 c2 Local Area Connection 7 For manual tuning of the first adapter to use RSS on CPUs 0 3 perf tuning exe m cl Local Area Connection 6 b 0 n 4 In order to restore defaults type perf tuning exe r cl Local Area Connection 6 c2 Local Area Connection 7 6 2 Application Specific Optimization and Tuning 6 2 1 Ethernet Performance Tuning The user can configure the Ethernet adapter by setting some registry keys The registry keys may affect Ethernet performance To improve performance activate the performance tuning tool as follows Step 1 Start the Device Manager open a command line window and enter devmgmt msc Mellanox Technologies 51 J Rev 4 60 Step 2 Open Network Adapters Step3 Right click the relev
28. It is active even if the cpufreq ondemand module is loaded Connect QPs with rdma cm and run test on those QPs Z com rdma cm Communicate with rdma cm module to exchange data use regular QPs c connection lt RC UC gt I inline_size lt size gt Connection type RC UC default RC Max size of message to be sent in inline default 0 Rev 4 60 Table 39 ibv read bw Flags and Options Flag Description Q cq mod Generate Cqe only after lt cq mod gt completion N no peak bw Cancel peak bw calculation default with peak 8 4 8 jibv read lat This is a more advanced version of ib read lat and contains more flags and features than the older version and also improved algorithms ibv read lat calculates the latency of RDMA read operation of message size between a pair of machines One acts as a server and the other as a cli ent They perform a ping pong benchmark on which one side RDMA reads the memory of the other side only after the other side have read his memory Each of the sides samples the CPU clock each time they read the other side memory to calculate latency Read is available only in RC connection mode as specified in InfiniBand spec 8 4 8 1 ibv read lat Synopsys ibv read lat i b port ib port m tu mtu size s ize message size t x depth tx size I nline size inline size u qp timeout S L sl type d ib device name x gid ind
29. default 0 Mellanox Technologies 54 J Rev 4 60 Offload Options Allows you to specify which TCP IP offload settings are handled by the adapter rather than the oper ating system Enabling offloading services increases transmission performance as the offload tasks are performed by the adapter hardware rather than the operating system Thus freeing CPU resources to work on other tasks Pv4 Checksums Offload Enables the adapter to compute IPv4 checksum upon transmit and or receive instead of the CPU default Enabled TCP UDP Checksum Offload for IPv4 packets Enables the adapter to compute TCP UDP checksum over IPv4 packets upon transmit and or receive instead of the CPU default Enabled TCP UDP Checksum Offload for IPv6 packets Enables the adapter to compute TCP UDP checksum over IPv6 packets upon transmit and or receive instead of the CPU default Enabled Large Send Offload LSO Allows the TCP stack to build a TCP message up to 64KB long and sends it in one call down the stack The adapter then re segments the message into multiple TCP packets for transmission on the wire with each pack sized according to the MTU This option offloads a large amount of kernel processing time from the host CPU to the adapter IB Options Configures parameters related to InfiniBand functionality SA Query Retry Count Sets the number of SA query retries once a query fails The valid values are 1 64 default 10
30. port connections and port state See ibnetdiscover for information on caching ibnetdiscover output Mellanox Technologies 94 J Rev 4 60 Table 28 iblinkinfo Flags and Options Flags Description diffcheck lt key s gt Specifies what diff checks should be done in the diffoption above Comma separate multiple diff check key s The available diff checks are port port connections state port state lid lids nodedesc node descriptions If port is specified alongside lid or nodedesc remote port lids and node descriptions will also be compared filterdownports lt filename gt Filters downports indicated in a ibnetdiscover cache If a port was previously indicated as down in the specified cache and is still down do not output it in the resulting out put This option may be particularly useful for environ ments where switches are not fully populated thus much of the default iblinkinfo info is considered un useful See ibnetdiscover for information on caching ibnetdiscover out put 8 3 19 ibqueryerrors The default behavior is to report the port error counters which exceed a threshold for each port in the fabric The default threshold is zero 0 Error fields can also be suppressed entirely In addition to reporting errors on every port ibqueryerrors can report the port transmit and receive data as well as report full link information to the remote port if available 8 3 19 1
31. timeout Override the default timeout for the solicited MADs msec dest dr path lid guid Destination s directed path LID or GUID lt portnum gt Destination s port number lt op gt lt value gt Define the allowed port operations enable disable reset speed and query In case of multiple channel adapters CAs or multiple ports without a CA port being specified a port is chosen by the utility according to the following criteria 1 The first ACTIVE port that is found 2 If not found the first port that is UP physical link state is LinkUp Examples 1 Query the status of Port 1 of CA mlx4_0 using ibstatus and use its output the LID 3 in this case to obtain additional link information using ibportstate gt ibstat CA type MT4099 umber of ports 2 Firmware version 2 11 536 Hardware version 0 ode GUID 0x0002c903002e6670 System image GUID 0x0002c903002e6673 Boy ike Physical state Disabled Rate 10 Base lid 4 LMC 0 SM lid 2 Capability mask 0x0251486a Port GUID 0x0002c903002e6671 Link layer InfiniBand gt ibportstate C mlx4 0 4 1 query PortInfo Port info Lid 3 port 1 LinkStates e accuse eterne uR eI USUS e Initialize Phystanko tate year e E LinkUp Tyra ENO UD OUS OP EE TET 1X or 4X Mellanox Technologies 67 J Rev 4 60 E s Ruhe oed LSYO LS ces ccn oo aonne pao eS 1X or 4X duum Ward eh Nolen E e ECT CIT AX LinkSpeedSupported ss
32. zip file 2 2 Downloading Mellanox Firmware Tools Step 1 Download Mellanox Firmware Tools Go to http www mellanox com gt Products gt Firmware Tools The tools package to download is MFT Software for Windows x64 for x64 architecture Mellanox Technologies 15 J Rev 4 60 Step 2 Install and Run WinMFT To install the WinMFT package double click the MSI package or run it from the command prompt Installing the WinMFT package from the command line requires administrator privi leges Example PS msiexec exe i WinMFT x64 3 0 0 17 msi To check the device status Step 1 Start Stop mst PS mst start OR PS mst stop Step 2 Check the device s status PS mst status If no installation problems occur the status command should produce the following output PS mt4099 pciconf0 PS mt4099 pci cr0 2 3 Upgrading Firmware Firmware can be upgraded either manually or automatically as described in the sections below 2 3 1 Upgrading Firmware Manually gt To upgrade firmware manually Step 1 Burn the firmware image PS flint d mt device id pci cr i image name bin burn Example PS flint d mt4099 pci cr i fw ConnectX3 rel 2 11 0500 MCX354A FCA Al bin burn Step 2 Reboot the server For additional details please check the MFT user manual under http www mellanox com gt Products gt Firmware Tools Mellanox Technologies 16 J Rev 4 60 3 Driver
33. 2010 First release Mellanox Technologies 10 J Rev 4 60 About this Manual Scope The document describes WinOF Rev 4 60 features performance InfiniBand diagnostic tools content and configuration Additionally this document provides information on various perfor mance tools supplied with this version Intended Audience This manual is intended for system administrators responsible for the installation configuration management and maintenance of the software and hardware of VPI InfiniBand Ethernet adapter cards It is also intended for application developers Documentation Conventions Table 2 Documentation Conventions Description Convention Example File names file extension Directory names directory Commands and their parameters command param1 mts3610 1 gt show hosts Required item lt gt Optional item Mutually exclusive parameters pl p2 p3 or pl p2 p35 Optional mutually exclusive parame pl p2 p3 J ters Variables for which users supply spe Italic font enable cific values Emphasized words Italic font These are emphasized words Note lt text gt This is a note A A J Warning lt text gt May result in system instabil A A Mellanox Technologies 11 J Rev 4 60 Common Abbreviations and Acronyms Table 3 Abbreviations and Acronyms Abbreviation Acr
34. 3 3 1 ibportstate Applicable Hardware All InfiniBand devices 8 3 3 2 ibportstate Synopsis ibportstate d e v V D L G s lt smlid gt IG ca name gt P ca port u t timeout ms lt dest dr path lid guid lt portnum gt lt op gt lt value gt 8 3 3 3 ibportstate Options The table below lists the various flags of the command Table 13 ibportstate Flags and Options Flag Description h help Print the help menu d debug Raise the IB debug level May be used several times for higher debug lev els ddd or d d d e errors Show send and receive errors timeouts and others v verbose Increase verbosity level May be used several times for additional verbos ity vvv or v v v V version Show version info D Direct Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 L Lid Use Lid address argument Mellanox Technologies 66 J Rev 4 60 Table 13 ibportstate Flags and Options Continued Flag Description G Guid Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 s sm port Use lt smlid gt as the target lid for SM SA queries C Ca Use the specified channel adapter or router P Port Use the specified port u usage Usage message t
35. Adapter 2 RSS load balancing Profile A 0 E Mellanox ConnectX 3 Ethernet Adapter 2 Ass Hair oe x Interrupt Moderation Profile L Mellanox ConnectX 3 Ethernet Adapter 3 Rx Interrupt Moderation Type Microsoft Kernel Debug Network Adapter Send Buffers ysr Send Completion Method kN Ports AEOM BEP TCP UDP Checksum Offload Pv p dh Print queues TCP UDP Checksum Offload UP b D Processors Transmit Control Blocks gt S troll Tx Interrupt Moderation Profile p Sp xtorage controllers Virtual Machine Queues 4 Bi System devices pM ACPI Fixed Feature Button VMQ Lookahead Split VMO VLAN Filtering v 3 6 Ports TX Arbitration On a setup with a dual port NIC with both ports at link speed of 40GbE each individual port can achieve maximum line rate When both ports are running simultaneously in a high throughput scenario the total throughput is bottlenecked by the PCIe bus and in this case each port may not achieve its maximum of 40GDbE Ports TX Arbitration ensures bandwidth precedence is given to one of the ports on a dual port NIC enabling the preferred port to achieve the maximum throughput and the other port taking up the rest of the remaining bandwidth To configure Ports TX Arbitration Step 1 Open the Device Manager Step2 Go to the Network adapters Step3 Right click Properties on Mellanox ConnectX 3 Ethernet Adapter card Step 4 Go to Advanced tab Mellanox Technologies 29 J Rev 4 60 Ste
36. Clear error counters after read k and K can be used together to clear both errors and counters clear counts K Clear data counters after read CAUTION clearing data counters will occur regardless of if they are printed or not This is because data counters are only printed on ports which have errors This means if a port has 0 errors and the K option is specified the data counters will be cleared without any printed output details load cache lt filename gt Includes receive error and transmits discard details Loads and uses the cached ibnetdiscover data stored in the specified filename May be useful for outputting and learn ing about other fabrics or a previous state of a fabric Cannot be used if user specifies a direct route path See ibnetdis cover for information on caching ibnetdiscover output R This option is obsolete and has no effect d Raises the IB debugging level May be used several times ddd or d d d e Shows send and receive errors time outs and others h Shows the usage message V Increases the application verbosity level May be used sev C lt ca_name gt eral times vv or v v v Uses the specified ca_name P lt ca_port gt Uses the specified ca_port Mellanox Technologies 96 J Rev 4 60 Table 29 ibqueryerrors Flags and Options Flags Description t timeout ms Overrides the default timeout for
37. Flags and Options 9444004 ene sees eg sp RI Rh e 4s 110 Table 41 bv scnd bw Flags and Options 0 00 c cece nee 111 Table 42 bv send lar Flags and OpfIOBS thane RE RW aces Saye ads 113 Table 43 ibv_write_bw Flags and Options 000s eee nee 114 Mellanox Technologies 6 J Rev 4 60 Table 44 ibv write lat Flags and Options 0 0 cece eee n 116 Table45 nd write bw Flags and Options 0 0 cece ee eee eee 117 Table 46 nd write_lat Options ge arr enn re ee Pe ae On ee PR er ud 118 Table47 nd read Dw Options sega ga RR R RR RR ean r ES AS PR UE 119 Table 48 nd read lar Options os Se Vea try Vas bata Ves wit ive d e yeh s 120 Table 49 nd send bw Flags and Options 0 0 0 cece ene 121 Table 50 nd send lat Options ooo estesa Tu eed to ha ES ne ae d 122 Tabled INTitcp Options ii free ed seat SER e Ra S Dew Ct aad A tae Ma a P ONE 123 Mellanox Technologies 7 J Rev 4 60 Document Revision History Table 1 Document Revision History Document Revision Date Changes Rev 4 60 March 16 2014 Removed ConnectX 2 from Section 4 2 Hardware and Software Prerequisites on page 39 February 13 2014 Updated the following sections Section 3 1 Hyper V with VMQ on page 17 Section 3 8 1 Enabling Disabling NVGRE Offloading on page 34 Added the following sections Section 3 8 3 Verifying the Encapsulation of the Traffic on page 36 December 30 2013
38. Listens on connect to port lt port gt default 18515 d ib dev lt dev gt Uses IB device lt device guid gt default first device found i ib port lt port gt Uses port lt port gt of IB device default 1 m mtu lt mtu gt The mtu size default 1024 c connection lt RC UC UD gt Connection type RC UC UD default RC SIze lt size gt The size of message to exchange default 65536 l signal Signal completion on each msg a all Runs sizes from 2 till 2 23 t tx depth lt dep gt The size of tx queue default 100 n iters lt iters gt The number of exchanges at least 2 default 1000 C report cycles Reports times in cpu cycle units default microseconds H report histogram Print out all results default print summary only U report unsorted implies Print out unsorted results default sorted H V version Displays version number g grh Use GRH with packets mandatory for RoCE 8 4 5 ib write bw Ib write bw calculates the BW of RDMA write between a pair of machines One acts as a server and the other as a client The client RDMA writes to the server memory and calculate the BW by sampling the CPU each time it receive a successful completion The test supports features such as Bidirectional in which they both RDMA write to each other at the same time change of mtu size tx size number of iteration message size and more Using the a flag provides results for all message sizes
39. Mellanox Technologies 106 Rev 4 60 8 4 5 1 ib write bw Synopsys ib write bw q num of gps c onnection type RC UC i b port ib port m tu mtu size s ize message size t x depth tx size n iteration num p ort PDT port b idirectional a 11 V ersion 8 4 5 2 ib write bw Options The table below lists the various flags of the command Table 37 ib write bw Flags and Options Flag Description p port lt port gt Listens on connect to port lt port gt default 18515 d ib dev lt dev gt Uses IB device lt device guid gt default first device found i ib port lt port gt Uses port lt port gt of IB device default 1 m mtu lt mtu gt The mtu size default 1024 c connection lt RC UC gt Connection type RC UC default RC SiIze lt size gt The size of message to exchange default 65536 a all Runs sizes from 2 till 2423 t tx depth lt dep gt The size of tx queue default 100 n iters lt iters gt The number of exchanges at least 2 default 1000 b bidirectional Measures bidirectional bandwidth default unidirectional V version Displays version number 0 post lt num of posts gt The number of posts for each qp in the chain default tx depth q qp lt num of qp s gt The number of qp s default 1 g grh Use GRH with packets mandatory for RoCE 8 4 6 ib write lat Ib write lat calculates the lat
40. Shows send and receive errors timeouts and others help h Shows the usage message verbose v vvv v v v Increases the application verbosity level version V Shows the version info Lid L Use LID address argument usage u Usage message Mellanox Technologies 79 J Rev 4 60 Table 18 ibping Flags and Options Flag Description Guid G C ses GUID address argument In most cases it is the Port GUID For example 0x08f1040023 sm port s lt smlid gt Uses smlid as the target lid for SM SA queries Ca C ca name Uses the specified ca name Port P ca port Uses the specified ca port timeout t timeout ms Overrides the default timeout for the solicited mads 8 3 9 ibnetdiscover Ibnetdiscover performs IB subnet discovery and outputs a readable topology file GUIDs node types and port numbers are displayed as well as port LIDs and NodeDescriptions All nodes and links are displayed full topology Optionally this utility can be used to list the current connected nodes by node type The output is printed to standard output unless a topology file is specified 8 3 9 1 ibnetdiscover Synopsys ibnetdiscover d ebug e rr_show v erbose s how l ist g rouping H ca_ list S witch list R outer list C ca name P ca port t imeout timeout ms V ersion outstanding smps o lt val gt u sage n
41. TECHNOLOGIES Mellanox Technologies Mellanox Technologies Ltd 350 Oakmead Parkway Suite 100 Beit Mellanox Sunnyvale CA 94085 PO Box 586 Y oknearn 20692 U S A Israel www mellanox com www mellanox com Tel 408 970 3400 Tel 972 0 74 723 7200 Fax 408 970 3403 Fax 972 0 4 959 3245 Copyright 2014 Mellanox Technologies All Rights Reserved Mellanox Mellanox logo BidgeX ConnectX Connect IB CORE Direct Infini Bridge InfiniHost InfiniScale MetroX MLNX OS PhyX ScalableHPC SwitchX UFM Virtual Protocol Interconnect and Voltaire are registered trademarks of Mellanox Technologies Ltd ExtendX FabricIT Mellanox Open Ethernet Mellanox Virtual Modular Switch Metro DX YM Unbreakable Link are trademarks of Mellanox Technologies Ltd All other trademarks are property of their respective owners 2 Mellanox Technologies Document Number 3280 Rev 4 60 Table of Contents Document Revision History eeeeeeeeeeee nnn nn nn nnns About this Manual Ee oon ew orale RR mele e iar I eT dos heo fondant ttes tat dro NS Leu Ide ES E SL EE c 11 Intended A udiencew ii ERE esas anu amc he Sea ee Oa EVA 11 Documentation Conventions saoasaoa reenen es 11 Common Abbreviations and Acronyms 000 ccc ccc eens 12 Related Documents 1 he cas Cet Rede suce lew Ra A s 13 Chapter 1 Introduction 5 5 ss s s s e x aca xh RH s
42. Technologies 21 J Rev 4 60 5 Adaptive Load Balancing The same functionality as Load Balancing Send amp Receive In case of traffic load in one of the adapters the load balancing channels the traffic between the other team adapter 6 Dynamic Link Aggregation 802 3ad Provides dynamic link aggregation allowing creation of one or more channel groups using same speed or mixed speed server adapters 7 Static Link Aggregation 802 3ad Provides increased transmission and reception throughput in a team comprised of two to eight adapter ports through static configuration If the switch connected to the HCA supports 802 3ad the recommended setting is teaming mode 6 3 5 2 Creating a Load Balancing and Fail Over LBFO Bundle LBFO is used to balance the workload of packet transfers by distributing the workload over a bundle of network instances and to set a secondary network instance to take over packet indica tions and information requests if the primary network instance fails The following steps describe the process of creating an LBFO bundle Step 1 Display the Device Manager ral Device Manager part xi File Action View Help e9 mH SL tH Os K Computer cs Disk drives R Display adapters e DVD CD ROM drives N gt Human Interface Devices C IDE ATAJATAPI controllers I Keyboards n Mice and other pointing devices R Monitors X Network adapters Broadcom Netxtreme Gigabit Ethernet K Broadcom Ne
43. and hardware related information such as driver version firmware version bus interface adapter identity and network port link information perform the following steps Mellanox Technologies 41 J Rev 4 60 Step 1 Display the Device Manager File Action View Help Step 2 e9 m g Hs sg p K Display adapters p c IDE ATA ATAPI controllers 4 i IEEE 1394 host controllers Texas Instruments 1394 OHCI Compliant Host Controller p amp Keyboards gt B Mice and other pointing devices p K Monitors 4 amp Network adapters 4 Broadcom BCM5708C NetXtreme Il GigE NDIS VBD Client 48 4 SEU BCMS5709C NetXtreme Il Gig NDIS VBD Client 49 v Melanox Comet ent 9 d Pots COM amp LPT b dh Print queues b B Processors p 7 Security devices b p Software devices gt Storage controllers 4 jE System devices jE ACPI Fixed Feature Button 1E Broadcom BCM5709C NetXtreme II GigE 48 1E Broadcom BCM5709C NetXtreme Il GigE 49 1E Composite Bus Enumerator PRG K Direct mem access controller Select the Information tab from the Properties sheet Details General Advanced Driver Version Firmware Version Port Number Bus Type Link Speed Part Number levice Id Revision Id Current MAC Address Permanent MAC Address Network Status Adapter Friendly Name IPv4 Address Adapter User Name Events Information Adapter Information Power Management Performance D
44. and polling methods dynamically depending on traffic type and network usage Choosing a different setting may improve network and or system performance in certain configu rations Interrupt Moderation RX Packet Count Number of packets that need to be received before an interrupt is generated on the receive side default 5 Mellanox Technologies 53 J Rev 4 60 Interrupt Moderation RX Packet Time Maximum elapsed time in usec between the receiving of a packet and the generation of an inter rupt even if the moderation count has not been reached default 10 Rx Interrupt Moderation Type Sets the rate at which the controller moderates or delays the generation of interrupts making it pos sible to optimize network throughput and CPU utilization The default setting Adaptive adjusts the interrupt rates dynamically depending on the traffic type and network usage Choosing a differ ent setting may improve network and system performance in certain configurations Send completion method Sets the completion methods of the Send packets and it may affect network throughput and CPU utilization Interrupt Moderation TX Packet Count Number of packets that need to be sent before an interrupt is generated on the send side default 0 Interrupt Moderation TX Packet Time Maximum elapsed time in usec between the sending of a packet and the generation of an inter rupt even if the moderation count has not been reached
45. as configured is possible through any packet capturing utility If configured correctly an encapsulated packet should appear as a packet consisting of the following headers Outer MAC Outer IP GRE Header Inner MAC Original Ethernet Payload 3 9 Differentiated Services Code Point DSCP DSCP is a mechanism used for classifying network traffic on IP networks It uses the 6 bit Dif ferentiated Services Field DS or DSCP field in the IP header for packet classification purposes Using Layer 3 classification enables you to maintain the same classification semantics beyond local network across routers Every transmitted packet holds the information allowing network devices to map the packet to the appropriate 802 1Qbb CoS For DSCP based PFC the packet is marked with a DSCP value in the Differentiated Services DS field of the IP header 3 9 1 Setting the DSCP in the IP Header Marking DSCP value in the IP header is done differently for IP packets constructed by the NIC e g RDMA traffic and for packets constructed by the IP stack e g TCP traffic e For IP packets generated by the IP stack the DSCP value is provided by the IP stack The NIC does not validate the match between DSCP and Class of Service CoS values CoS and DSCP values are expected to be set through standard tools such as PowerShell command New NetQosPolicy using PriorityValue8021Action and DSCPAction flags respectively ForIP packets generated by the NIC RDMA th
46. c option to detect possible problematic paths on which packets may be lost Such paths are explored and a report of the suspected bad links is displayed on the standard output After scanning the fabric if the r option is provided a full report of the fabric qualities is dis played This report includes SM report Number of nodes and systems Hop count information maximal hop count an example path and a hop count histo gram e All CA to CA paths traced Credit loop report mgid mlid HCAs multicast group and report Partitions report IPoIB report In case the IB fabric includes only one CA then CA to CA paths are not reported Furthermore if a topology file is provided ibdiagnet uses the names defined in it for the output reports Mellanox Technologies 65 J Rev 4 60 8 3 2 3 ibdiagnet Error Codes Failed to fully discover the fabric Failed to parse command line options Failed to interact with IB fabric Failed to use local device or local port Failed to use Topology File ron Cn EO N Ae Failed to load required Package 8 3 3 ibportstate Enables querying the logical link and physical port states of an InfiniBand port It also allows adjusting the link speed that is enabled on any InfiniBand port If the queried port is a switch port then ibportstate can be used to Disable enable or reset the port Validate the port s link width and speed against the peer port 8
47. can group a group of ports inside a network adapter or a number of physical net work adapters into virtual adapters that provide the fault tolerance and load balancing functions Depending on the teaming mode one or more interfaces can be active The non active interfaces in a team are in a standby mode and will take over the network traffic in the event of a link failure in the active interfaces All of the active interfaces in a team participate in load balancing opera tions by sending and receiving a portion of the total network traffic 3 5 1 1 Teaming Bundle Modes 1 Fault Tolerance Provides automatic redundancy for the server s network connection If the primary adapter fails the secondary adapter currently in a standby mode takes over Fault Tolerance is the basis for each of the following teaming types and is inherent in all teaming modes 2 Switch Fault Tolerance Provides a failover relationship between two adapters when each adapter is connected to a separate switch 3 Send Load Balancing Provides load balancing of transmit traffic and fault tolerance The load balancing performs only on the send port 4 Load Balancing Send amp Receive Provides load balancing of transmit and receive traffic and fault tolerance The load balancing splits the transmit and receive traffic statically among the team adapters without changing the base of the traffic loading based on the source destination MAC and IP addresses Mellanox
48. disabled the system generates an interrupt each time a packet is received or sent In this mode the CPU utilization data rates increase as the system handles a larger number of interrupts However the latency decreases as the packet is handled faster Receive Side Scaling RSS Mode Improves incoming packet processing performance RSS enables the adapter port to utilize the multiple CPUs in a multi core system for receiving incoming packets and steering them to the des ignated destination RSS can significantly improve the number of transactions the number of con nections per second and the network throughput This parameter can be set to one of the following values Enabled default Set RSS Mode Disabled The hardware is configured once to use the Toeplitz hash function and the indirection table is never changed IOAT is not used while in RSS mode Aha Receive Completion Method Sets the completion methods of the received packets and can affect network throughput and CPU utili zation Polling Method Increases the CPU utilization as the system polls the received rings for the incoming packets However it may increase the network performance as the incoming packet 1s handled faster Interrupt Method Optimizes the CPU as it uses interrupts for handling incoming messages However in certain scenarios it can decrease the network throughput Adaptive Default Settings A combination of the interrupt
49. ibqueryerrors Synopsis ibqueryerrors options 8 3 19 2 ibqueryerrors Options Table 29 ibqueryerrors Flags and Options Flags Description s lt errl err2 gt Suppresses the errors listed in the comma separated list pro vided c Suppresses some of the common side effect counters These counters usually do not indicate an error condition and can be usually be safely ignored G lt port_guid gt S lt port_guid gt port Report results for the port specified For switches results are guid printed for all ports not just switch port 0 S same as G Provided only for backward compatibility D lt direct_route gt Reports results for the port specified For switches results are printed for all ports not just switch port 0 Mellanox Technologies 95 J Rev 4 60 Table 29 ibqueryerrors Flags and Options Flags Description Reports the port information This includes LID port exter nal port if applicable link speed setting remote GUID remote port remote external port 1f applicable and remote node description information data Includes the optional transmit and receive data counters threshold file Specifies an alternate threshold file The default is opt ufm files conf infiniband diags error thresholds switch Prints data for switches only ca Prints data for CA s only router Prints data for routers only clear errors k
50. mands available through MLNX OS with explanations and examples Mellanox Technologies 13 J Rev 4 60 1 Introduction This User Manual addresses the Mellanox WinOF driver Rev 4 60 package Mellanox WinOF is composed of several software modules that contain an InfiniBand and Ether net driver The Mellanox WinOF driver supports 10 or 40 Gb s Ethernet and 40 or 56 Gb s InfiniBand network ports The port type is determined upon boot based on card capabilities and user settings Mellanox Technologies 14 J Rev 4 60 2 Firmware Upgrade The adapter card may not have been shipped with the latest firmware version The section below describes how to update firmware 2 1 Downloading Firmware To identify your adapter card perform the following steps Step 1 Extract the HCA PSID Run vstat PS at exe p port_ link 1 r ink widths Step 2 Download the latest firmware using the PSID from the step above Go to http www mellanox com gt Support gt Support Downloader CLEAR IMT 1090120019 Lep or OPN Identifying Adapter Cards PSID ConnectX 3 VPI ConnectX 3 VPI adapter card dual port QSFP FDR IB 56Gb s and 40GigE PCIe3 0 x8 8GT s RoHS R6 Firmware MT 1090120019 fw ConnectX3 rel 2 11 0500 MCX354A FCB A2 A4 bin zi Documentation Release Notes User Manual Downloads Mellanox IB Software Stack Mellanox Firmware Tools MFT Step 3 Unzip the binary image
51. on which you send packet only after you receive one Each of the sides samples the CPU clock each time they receive a send packet in order to calculate the latency 8 4 10 1ibv send lat Synopsys ibv send lat i b port ib port m tu mtu size s ize message size I nline size inline size type x gid index iteration num group p ort PDT port a 11 cycles H report histogram CPU freq fail 8 4 10 2ibv send lat Options c onnection type RC UC UD d ib device name Tel SEI exc ze u qp timeout S L sl e events use events n g num of qps in mcast V ersion C report U report unsorted F The table below lists the various flags of the command Table 42 ibv send lat Flags and Options Flag Description p port lt port gt d ib dev lt dev gt Listens on connect to port lt port gt default 18515 Uses IB device lt device guid gt default first device found i ib port lt port gt Uses port lt port gt of IB device default 1 m mtu lt mtu gt c connection lt RC UC UD gt The mtu size default 1024 Connection type RC UC UD default RC SiZe lt size gt The size of message to exchange default 65536 a all t tx depth lt dep gt Runs sizes from 2 till 223 The size of tx queue default 100 n iters lt iters gt The number of exchanges at least 2 default 1000 u qp
52. ping pong benchmark on which one side RDMA reads the memory of the other side only after the other side have read his memory Each of the sides samples the CPU clock each time they read the other side memory in order to calculate latency Read is availible only in RC connection mode as specified in IB spec 8 4 2 1 ib read lat Synopsys ib read lat i b port ib port m tu mtu size s ize message size t x depth tx size n iteration num p ort PDT port o uts outstanding reads a 11 V ersion C report cycles H report histogram U report unsorted 8 4 2 2 ib read lat Options The table below lists the various flags of the command Table 34 ib read lat Flags and Options Flag Description p port lt port gt Listens on connect to port lt port gt default 18515 d ib dev lt dev gt Uses IB device lt device guid gt default first device found i ib port lt port gt Uses port lt port gt of IB device default 1 m mtu lt mtu gt The mtu size default 1024 0 outs lt num gt The number of outstanding read atom default 4 SIZe lt size gt The size of message to exchange default 65536 a all Runs sizes from 2 till 2 23 t tx depth lt dep gt The size of tx queue default 100 n iters lt iters gt The number of exchanges at least 2 default 1000 C report cycles Reports times in cpu cycle units default microseconds H repor
53. port lt port gt of IB device default 1 m mtu lt mtu gt The mtu size default 1024 Mellanox Technologies 114 Rev 4 60 Table 43 ibv write bw Flags and Options Flag c connection lt RC UC gt Description Connection type RC UC default RC size lt size gt The size of message to exchange default 65536 a all t tx depth lt dep gt Runs sizes from 2 till 2 23 The size of tx queue default 100 n iters lt iters gt The number of exchanges at least 2 default 1000 u qp timeout lt timeout gt S sl lt sl gt QP timeout The timeout value is 4 usec 2 timeout default 14 The service level default 0 x gid index lt index gt Test uses GID with GID index taken from command line for RDMAOE index should be 0 b bidirectional V version Measures bidirectional bandwidth default unidirectional Displays version number g post lt num of posts The number of posts for each qp in the chain default tx depth F CPU freq q qp lt num of qp s gt The CPU frequency test It is active even if the cpufreq_ondemand module is loaded The number of qp s default 1 I inline_size lt size gt The maximum size of message to be sent in inline mode default 0 N no peak bw R rdma_cm Cancels peak bw calculation default with peak bw Connect QPs with rdm
54. switch with portguid 0x000p8c 004016 gt ibroute G 0Ox000b8cffff004016 Unicast lids 0x0 0x8 of switch Lid 3 guid 0x000b8cffff004016 MT47396 Infiniscale II Mellanox Technologies Lid Out Destination Port Info x0002 023 Switch portguid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies x0003 000 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies x0006 023 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 x0007 020 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 x0008 024 Channel Adapter portguid 0x0002c902002582cd sw136 HCA 1 5 valid lids dumped 4 Dump all non empty mlids of switch with Lid 3 ibroute M 3 Multicast mlids 0xc000 0xc3ff of switch Lid 3 guid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0 1 2 Pese D 12 95 678911341571 901232 MLid xc000 X xc001 x xc002 x xc003 X xc020 x xc021 x xc022 X xc023 x xc024 x xc040 x xc041 x xc042 X 12 valid mlids dumped 8 3 5 ibdump The ibdump tool dumps InfiniBand Ethernet and all RoCE versions traffic that flows to and from Mellanox ConnectX 3 ConnectX 3 Pro NIC s ports It provides a similar functionality to the tepdump tool on a standard Ethernet port The ibdump tool generates packet dump file in Mellanox Technologies 71 J Rev 4 60 peap format This file can be loaded by the Wi
55. the Windows PowerShell execution policy PS Set ExecutionPolicy AllSigned Step2 Remove the entire previous QoS configuration PS Remove NetQosTrafficClass PS Remove NetQosPolicy Confirm False Step3 Set the DCBX Willing parameter to false as Mellanox drivers do not support this feature PS set NetQosDcbxSetting Willing 0 Step 4 Create a Quality of Service QoS policy and tag each type of traffic with the relevant priority In this example we used TCP UDP priority 1 ND NDK priority 3 PS New NetQosPolicy SMB store Activestore NetDirectPortMatchCondition 445 PriorityValue8021Action 3 PS New NetQosPolicy DEFAULT store Activestore Default PriorityValue8021Action 3 PS New NetQosPolicy TCP store Activestore IPProtocolMatchCondition TCP PriorityValue8021Action 1 PS New NetQosPolicy UDP store Activestore IPProtocolMatchCondition UDP PriorityValue8021Action 1 Mellanox Technologies 44 J Rev 4 60 Step 5 Optional If VLANs are used mark the egress traffic with the relevant VlanID The NIC is referred as Ethernet 4 in the examples below PS Set NetAdapterAdvancedProperty Name Ethernet 4 RegistryKeyword VlanID RegistryValue DIU Step 6 Optional Configure the IP address for the NIC If DHCP is used the IP address will be assigned automatically PS Set NetIPInterface InterfaceAlias Ethernet 4 DHCP Disabled PS Remove NetIPAddress InterfaceAlias Ethernet 4 Address
56. the port is connected and OpenSM is running logical link PORT ARMED when the port is connected to some other port physical link 3 Run sminfo and verify that OpenSM is running In case OpenSM is not running please see OpenSM operation instructions in Section 7 OpenSM Subnet Manager on page 61 above Verify the status of ports by using vstat All connected ports should report PORT ACTIVE state A 10 2 Ethernet Troubleshooting Issue 1 The installation of Win OFED VPI for Windows fails with the following error mes sage This installation package is not supported by this processor type Contact your product vendor Suggestion This message is printed if you have downloaded and attempted to install an incor rect driver version for example if you are trying to install a 64 bit driver on a 32 bit machine or vice versa Issue 2 The performance is low Suggestion This can be due to non optimal system configuration See the section Perfor mance Tuning to take advantage of Mellanox 40 10 GBit NIC performance Issue 3 The driver does not start Suggestion 1 This can happen due to an RSS configuration mismatch between the TCP stack and the Mellanox adapter To confirm this scenario open the event log and look under Sys tem for the mlx4ethX source If found enable RSS as follows 1 Run the following command netsh int tcp set global rss enabled Suggestion 2 This is a less recommended s
57. the solicited mads 8 3 19 3 ibqueryerrors Exit Status If a failure to scan the fabric occurs return 1 If the scan succeeds without errors beyond thresh olds return 0 If errors are found on ports beyond thresholds return 1 8 3 19 4 ibqueryerrors Files opt ufm files conf infiniband diags error thresholds Define threshold values for errors File format is simple name val Comments begin with Example Define thresholds for error counters SynbolErrorCounter 10 LinkErrorRecoveryCounter 10 VL15Dropped 100 8 3 20 ibsysstat ibsysstat uses vendor MADs to validate connectivity between InfiniBand nodes and obtain other information about the InfiniBand node ibsysstat is run as client server Default is to run as client 8 3 20 1ibsysstat Synopsis ibsysstat d ebug e rr show v erbose G uid C ca name P ca port s smlid t imeout timeout ms V ersion o oui S erver h elp dest lid guid lt op gt 8 3 20 2 ibsysstat Options Table 30 ibsysstat Flags and Options Flags Description ping Verifies connectivity to server default host Obtains host information from server cpu Obtains cpu information from server 0 oul Uses specified OUI number to multiplex vendor mads S Server Starts in server mode do not return Debugging Flags Description Mellanox Technologies 97 J Rev 4 60 Table 30 ibsysstat Flags and Option
58. them Ca C lt ca_name gt Port P lt ca_port gt Use the specified channel adapter or router Use the specified port Reset_only R Reset the counters timeout t lt timeout_ms gt version V Override the default timeout for the solicited MADs msec Show version info lt lid guid gt port reset mask LID or GUID extended x extended speeds T show extended port counters show port extended speeds counters perfquery r 32 1 oprcvcounters show Rcv Counters per Op code flowctlcounters show flow control counters vloppackets show packets received per Op code per VL vlopdata show data received per Op code per VL vlxmitflowctlerrors show flow control update errors per VL vlxmitcounters show ticks waiting to transmit counters per VL swportvlcong show sw port VL congestion ICvcc show Rcv congestion control counters slrevfecn show SL Rcv FECN counters slrcvbecn show SL Rcv BECN counters xmitcc show Xmit congestion control counters vlxmittimecc show VL Xmit Time congestion control counters Examples read performance counters and reset perfquery e r 32 1 read extended performance counters and reset perfquery R 0x20 1 reset performance counters of port 1 only perfquery e R 0x20 1 reset extended performance counters of port 1 only perfquery R a 32 reset performance counters of all ports
59. timeout lt timeout gt S sl lt sl gt QP timeout The timeout value is 4 usec 2 timeout default 14 The service level default 0 x gid index lt index gt Test uses GID with GID index taken from command line for RDMAoE index should be 0 C report cycles H report histogram Reports times in cpu cycle units default microseconds Print out all results default print summary only U report unsorted implies Print out unsorted results default sorted H V version Displays version number F CPU freq The CPU frequency test It is active even if the cpufreq_ondemand module is loaded Mellanox Technologies 113 Rev 4 60 Table 42 ibv send lat Flags and Options Flag Description g post lt num of posts The number of posts for each qp in the chain default tx depth I inline_size lt size gt The maximum size of message to be sent in inline mode default 0 e events Inactive during CQ events default poll Sends messages to multicast group with lt num_of_qps gt qps attached to it g mcg lt num_of_qps gt M MGID lt multicast_gid gt In case of multicast uses lt multicast_gid gt as the group MGID The format must be 255 1 X X XX X XC XXX XXX XC where X is a value within 0 255 You must specify a different MGID on both sides to avoid loopback R xrdma cm Connect QPs with rdma cm a
60. to find an inventory file which osmtest uses to validate real time information received from the SA during testing If i is not specified osmtest defaults to the file osmtest dat See c option for related information stress This option runs the specified stress test instead of the normal test suite Stress test options are as follows OPT Description sl Single MAD RMPP response SA queries 82 Multi MAD RMPP response SA queries 83 Multi MAD RMPP Path Record SA queries s4 Single MAD non RMPP get Path Record SA queries Without s stress testing is not performed M Multicast_Mode This option specify length of Multicast test OPT Description M1 Short Multicast Flow default single mode M2 Short Multicast Flow multiple mode M3 Long Multicast Flow single mode M4 Long Multicast Flow multiple mode Single mode Osmtest is tested alone with no other apps that interact with OpenSM MC Multiple mode Could be run with other apps using MC with OpenSM Without M default flow testing is performed This option specifies the time in milliseconds used for transaction timeouts Specifying t 0 disables timeouts Without t OpenSM defaults to a timeout value of 200 milliseconds l log file This option defines the log to be the given file By default the log goes to stdout This option increases the log verbosity level The v option may be specified multiple tim
61. 0 HR HERO GL T 0 UNKOWN Aiea E T mie 0 Mellanox Technologies 78 J Rev 4 60 REVELI OTS OPI OTI ROO OTDIERG RevRenotePhys Erron SU ets AG EIER coo 0090660006 Kme DRS armos P 3 MMEC ONS ceraint Errors eeek a 0 REVC ONSE RAIN EM EE OSI ee 0 Mnk incegrityErrorse nce I 0 PXCBULOVEE HEG yale eles 0 MIy 5Droppedtsemes 9e deeds reels 0 Xm Dabam ee teme eie SIS Se TIS SR 0 REVDAT Annae e Od OU QUU OU QUOD GET REVERE SEa e E 8 3 8 ibping ibping uses vendor MADs to validate connectivity between IB nodes On exit IP ping like out put is shown ibping is run as client server however the default is to run it as a client Note also that in addition to ibping a default server is implemented within the kernel 8 3 8 1 ibping Synopsys ibping d ebug e rr show v erbose G uid C ca name P ca port s smlid t imeout timeout_ms V ersion L id u sage c ping count f lood o oui S erver h elp dest lid guid 8 3 8 2 ibping Options The table below lists the various flags of the command Table 18 ibping Flags and Options Flag Description count c num Stops after count packets f flood Floods destination send packets back to back without delay 0 oui Uses specified OUI number to multiplex vendor mads Server S Starts in server mode do not return debug d ddd d d d Raises the IB debugging level errors
62. 0005001 VirtualSubnetID 5001 DestinationPrefix 172 16 0 0 16 NextHop 0 0 0 0 Metric 255 Mellanox Technologies 133 B 3 Rev 4 60 Step 3 Configure the Provider Address and Route records on Hyper V Host 1 Host 1 Only mtlael4 SNIC Get NetAdapter Port1 ew NetVirtualizationProviderAddress InterfaceIndex NIC InterfaceIndex Pro viderAddress 192 168 20 114 PrefixLength 24 ew NetVirtualizationProviderRoute InterfaceIndex NIC InterfaceIndex Destination Prefix 0 0 0 0 0 NextHop 192 168 20 1 Step 5 Configure the Virtual Subnet ID on the Hyper V Network Switch Ports for each Virtual Machine on each Hyper V Host Host 1 and Host 2 Run the command below for each VM on the host the VM is running on it i e the for mtlael4 005 mtlael4 006 on host 192 168 20 114 and for VMs mtlael5 005 mtlael5 006 on host 192 168 20 115 mtlael4 only Get VMNetworkAdapter VMName mtlae14 005 where MacAddress eq 00155D720100 Set VMNetworkAdapter VirtualSubnetID 5001 Get VMNetworkAdapter VMName mtlae14 006 where MacAddress eq 00155D720101 Set VMNetworkAdapter VirtualSubnetID 5001 B 2 Adding NVGRE Configuration to Host 15 Example The following is an example of adding NVGRE to Host 15 On both sides vSwitch create command Note tha
63. 1 ISR9024 Vol taire lid 6 4xSDR 1 8 10403960559 S 005442ba00003080 12 lid 10 lmc 1 ISR9024 Voltaire lid 6 1xSDR Node Name Map File Format The node name map is used to specify user friendly names for nodes in the output GUIDs are used to perform the lookup comment lt guid gt lt name gt Mellanox Technologies 83 J Rev 4 60 Example IBl Line cards 0x0008f 104003 125c IB1 Rack slot 1 ISR9288 ISR9096 Voltaire sLB 24D 0x0008 104003 125d IB1 Rack slot 1 ISR9288 ISR9096 Voltaire sLB 24D 0x0008 104003 10d2 IB1 Rack slot 2 ISR9288 ISR9096 Voltaire sLB 24D 0x0008 104003 10d3 IB1 Rack slot 2 ISR9288 ISR9096 Voltaire sLB 24D x0008f104003f10bf IB1 Rack slot 12 ISR9288 ISR9096 Voltaire sLB 24D Spines x0008 10400400e2d IB1 Rack spine 1 ISR9288 Voltaire sFB 12D x0008 10400400e2e IB1 Rack spine 1 ISR9288 Voltaire sFB 12D x0008 10400400e2f IB1 Rack spine 1 ISR9288 Voltaire sFB 12D x0008 10400400e31 IB1 Rack spine 2 ISR9288 Voltaire sFB 12D 0x0008 10400400e32 IB1 Rack spine 2 ISR9288 Voltaire sFB 12D GUID ode Name 0x0008f 10400411a08 SW1 Rack 3 ISR9024 Voltaire 9024D 0x0008f 10400411a28 SW2 Rack 3 ISR9024 Voltaire 9024D 0x0008 10400411a34 SW3 Rack 3 ISR9024 Voltaire 9024D 0x0008 104004119d0 SW4 Rack 3 ISR9024 Voltaire 9024D 8 3 10 ibtracert ibtracert uses SMPs to trace the pat
64. 2 5 Gbps or 5 0 Gbps IiinkSpeedrnaledi eee eR T I 2 5 Gbps or 5 0 Gbps IsmkopeedAGMVel IH MES 5 0 Gbps 2 Query the status of two channel adapters using directed paths gt ibportstate C mlx4 0 D 0 1 PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 RE SG Initialize TEH LinkUp utri k Wage SUDO Ee disi T teres 1X or 4X MORNA NENIO E PR T M 1X or 4X SENS IDT G cers AX SEE GT ETT I 2 5 Gbps or 5 0 Gbps Wunkspeedknableokd e edd Td ts 2 5 Gbps or 5 0 Gbps THRG O I OIN ee IRIS 5 0 Gbps gt ibportstate C mthca0 D 0 1 PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 MANKO tA CERS e eod ramen IUIS Down PhyslunkScate te ee eee see eles Polling MNW bU DOSES CREE T T 1X or 4X IumikWargetilinable kde dT TS 1X or 4X EXnkWarehNe Suy fert MM AX LinkSpeedSupported es ZI SS GS LinkSpeedEnabled 2 5 Gbps Minko pee GT UN 2 5 Gbps 3 Change the speed of a port First query for current configuration gt ibportstate C mlx4 0 D 0 1 PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 Vilbel SABES COO ange Ubon aan UIS Initialize Plyslinkouabc ee Tu OE TT CI LinkUp Ern ewige hi dps eC eR I areata 1X or 4X SENT epe IOS sso aao soon 1X or 4X Iurrrk Che WNC oc NI ETIN LUE AX LinkSpeedSupported 4 2 5 Gbps or 5 0 Gbps unkSpeedknablleg immanent os 2 5 Gbps or 5 0 Gbps LUNKOPCECA CAV Och rns sep HIE 5 0 Gbps Now change the enabled link speed gt i
65. 3 s sm port lt smlid gt Use lt smlid gt as the target LID for SM SA queries V version Show version info L Lid se Lid address argument c combined se combined route address argument u usage C Ca lt ca_name gt se the specified channel adapter or router P Port lt ca_port gt U U Usage message U U se the specified port t timeout lt timeout_ms gt Override the default timeout for the solicited MADs msec lt op gt Supported operations NodeInfo NI lt addr gt NodeDesc ND lt addr gt PortInfo PI lt addr gt lt portnum gt SwitchInfo SI lt addr gt PKeyTable PKeys lt addr gt lt portnum gt e SL2VLTable SL2VL lt addr gt lt portnum gt e VLArbitration VLArb lt addr gt lt portnum gt GUIDInfo GI lt addr gt Rev 4 60 Table 16 smpquery Flags and Options Flag Description dest dr path lid guid Destination s directed path LID or GUID node name map lt file gt Node name map file x extended Use extended speeds Examples 1 Query PortInfo by LID with port modifier gt smpquery portinfo 1 1 Port info Lid 1 port 1 MKE E E sS 0x0000000000000000 Chelsie c aee Eom A E RO 0xf e80000000000000 Tides sere aee es e RISUS S TURTEUR S eUSIR e o een 0x0001 SMS e a E E TEL SECUS 0x0001 CADMAS koneerne eri e Ie Un ee RES 0x251086a sSM s
66. DISABLE SOCK 1 A 5 Configuring MPI Step 1 Configure all the hosts in the cluster with identical PFC see the PFC example below Step 2 Run the WHCK ND based traffic tests to Check PFC ndrping ndping ndrpingpong ndping pong Step 3 Validate PFC counters during the run time of ND tests with Mellanox Adapter QoS Coun ters in the perfmon Step 4 Install the same version of HPC Pack in the entire cluster NOTE Version mismatch in HPC Pack 2012 can cause MPI to hung Step 5 Validate the MPI base infrastructure with simple commands such as hostname A 5 1 PFC Example In the example below ND and NDK go to priority 3 that configures no drop in the switches The TCP UDP traffic directs ALL traffic to priority 1 Install dcbx and remove any previous settings Install WindowsFeature Data Center Bridging e Remove NetQosTrafficClass Remove NetQosPolicy Confirm False Set NetQosDobxsSetting Willing 0 e New NetQosPolicy SMB NetDirectPortMatchCondition 445 Priority Value8021 Action 3 e New NetQosPolicy DEFAULT Default Priority Value802 1 Action 3 e New NetQosPolicy TCP IPProtocolMatchCondition TCP Priority Value802 1 Action1 Mellanox Technologies 131 Rev 4 60 e New NetQosPolicy UDP IPProtocolMatchCondition UDP Priority Value8021 Action 1 Enable NetQosFlowControl 3 Duisable NetQosFlowControl 0 1 2 4 5 6 7 Enable netadapterqos Name A 5 2 Running MPI Co
67. Family IPv4 Confirm false PS New NetIPAddress InterfaceAlias Ethernet 4 IPAddress 192 168 1 10 PrefixLength 24 Type Unicast Step 7 Optional Set the DNS server assuming its IP address is 192 168 1 2 PS Set DnsClientServerAddress InterfaceAlias Ethernet 4 ServerAddresses 192 168 1 2 After establishing the priorities of ND NDK traffic the priorities must have PFC enabled on them A Step 8 Disable Priority Flow Control PFC for all other priorities except for 3 PS Disable NetQosFlowControl 0 1 2 4 5 6 7 Step 9 Enable QoS on the relevant interface PS Enable NetAdapterQos InterfaceAlias Ethernet 4 Step 10 Enable PFC on priority 3 PS Enable NetQosFlowControl Priority 3 To add the script to the local machine startup scripts Step 1 From the PowerShell invoke gpedit msc Step2 In the pop up window under the Computer Configuration section perform the following 1 Select Windows Settings 2 Select Scripts Startup Shutdown 3 Double click Startup to open the Startup Properties 4 Click Add 5 Browse for the script s location 6 Click OK Mellanox Technologies 45 J Rev 4 60 6 Performance Tuning This section describes how to modify Windows registry parameters in order to improve performance Please note that modifying the registry incorrectly might lead to serious problems including the loss of data system hang and you may need to reinstall Windows As such it is reco
68. Features The Mellanox VPI WinOF driver release introduces the following capabilities Support for Single and Dual port Adapters Upto 16 Rx queues per port Rxsteering mode RSS Hardware Tx Rx checksum calculation Large Send off load i e TCP Segmentation Off load Hardware multicast filtering Adaptive interrupt moderation Support for MSI X interrupts Support for Auto Sensing of Link level protocol Ethernet Only Hardware VLAN filtering Header Data Split RDMA over Converged Ethernet RoCE DSCP over IPv4 NVGRE hardware off load in ConnectX 3 Pro Ports TX arbitration Bandwidth allocation per port For the complete list of Ethernet and InfiniBand Known Issues and Limitations WinOF Release Notes www mellanox com gt Products gt InfiniBand VPI Drivers gt Windows SW Drivers 3 1 Hyper V with VMQ Mellanox WinOF Rev 4 60 includes a Virtual Machine Queue VMQ interface to support Microsoft Hyper V network performance improvements and security enhancement VMQ interface supports Classification of received packets by using the destination MAC address to route the packets to different receive queues NIC ability to use DMA to transfer packets directly to a Hyper V child partition s shared memory Scaling to multiple processors by processing packets for different virtual machines on different processors gt To enable Hyper V with VMQ using UI Step 1 Open Hyper V Manager St
69. MT23108 InfiniHost Mellanox Technologies lid 10 4xSDR vendid 0x8f1 devid 0x5a05 switchguid 0x8f 10400410015 8 10400410015 Switch 8 S 0008f10400410015 SW 61B4 Voltaire base port 0 lid 3 lmc 0 6 H 0008 10403960984 1 8 10403960985 MT23108 InfiniHost Mellanox Technologies lid 16 4xSDR 4 H 005442b100004900 1 54420100004901 MT23108 InfiniHost Mellanox Technologies lid 12 4xSDR 1 S 005442ba00003080 10 ISR9024 Voltaire lid 6 1xSDR Mellanox Technologies 82 J Rev 4 60 3 S 005442ba00003080 6 ISR9024 Voltaire lid 6 4xSDR vendid 0x2c9 devid 0x5a44 caguid 0x8f10403960984 Ca 2 H 0008 10403960984 MT23108 InfiniHost Mellanox Technologies 1 8610403960985 0008 10400410015 6 lid 16 Imc 1 SW 61B4 Vol taire lid 3 4xSDR vendid 0x2c9 devid 0x5a44 caguid 0x5442b100004900 Ca 2 H 005442b100004900 MT23108 InfiniHost Mellanox Technologies 1 5442b100004901 0008 10400410015 4 lid 12 lmc 1 SW 61B4 Vol taire lid 3 4xSDR vendid 0x2c9 devid 0x5a44 caguid 0x8 10403961354 Ca 2 H 0008 10403961354 MT23108 InfiniHost Mellanox Technologies 1 8 10403961355 S 005442ba00003080 22 lid 4 lmc 1 ISR9024 Voltaire lid 6 4xSDR vendid 0x2c9 devid 0x5a44 caguid 0x8 10403960558 Ca 2 H 0008 10403960558 MT23108 InfiniHost Mellanox Technologies 2 8 1040396055a S 005442ba00003080 8 lid 14 lmc
70. Mellanox TECHNOLOGIES Mellanox WinOF VPI User Manual Rev 4 60 Last Modified 16 March 2014 www mellanox com Rev 4 60 NOTE THIS HARDWARE SOFTWARE OR TEST SUITE PRODUCT CPRODUCT S AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES AS IS WITH ALL FAULTS OF ANY KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE THE PRODUCTS IN DESIGNATED SOLUTIONS THE CUSTOMER S MANUFACTURING TEST ENVIRONMENT HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE PRODUCTO S AND OR THE SYSTEM USING IT THEREFORE MELLANOX TECHNOLOGIES CANNOT AND DOES NOT GUARANTEE OR WARRANT THAT THE PRODUCTS WILL OPERATE WITH THE HIGHEST QUALITY ANY EXPRESS OR IMPLIED WARRANTIES INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT ARE DISCLAIMED IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER OR ANY THIRD PARTIES FOR ANY DIRECT INDIRECT SPECIAL EXEMPLARY OR CONSEQUENTIAL DAMAGES OF ANY KIND INCLUDING BUT NOT LIMITED TO PAYMENT FOR PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY WHETHER IN CONTRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY FROM THE USE OF THE PRODUCT S AND RELATED DOCUMENTATION EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE Mellanox
71. Technologies 135
72. TrapSupported sAutomaticMigrationSupported sSLMappingSupported sSystemImageGUIDsupported sCommunicatonManagement Supported sVendorClassSupported sCapabilityMaskNoticeSupported sClientRegistrationSupported Daag God elses cede serie ete S ERU THER 0x0000 ke yleasSePeruod spss secre seve 0 iO CANES mend deo d OO qUTO OO i LonkWadehEnabllede eater 1X or 4X USHA iuo ST T aoo aeo e goo n 1X or 4X BERT ARR LE G R AX nM KS PCO SUD DO REE i TS 2 5 Gbps or 5 0 Gbps RETETE E osa de roe rare E E Active E TT n ooooconondacoomobeo LinkUp amie ownP eas disce ET ETT TTE Polling Prore Da E 0 MORBOS EIUS ODIO ODIO 0 LinkSpeedAetivet ace nee eneee 5 0 Gbps TEE GH TG IT DITS 2 5 Gbps or 5 0 Gbps RESIST TE 2048 SMG ro et OTHO OO UOI OO QINOOR DDR S 0 VECA cT VL0 7 RSE oe eere ii ERE eie aee 0x00 VERNO MEMBER o eren cao me da E 0 5 6 4 VAr OARA O oo estro nib oo eoa e mere 8 MMBASDSOWESD ee d IU IUIS 8 r R VE ESTO A AA 0x00 Mellanox Technologies 74 J Rev 4 60 2 Query SwitchInfo by GUID 3 Query NodelInfo by direct route Mellanox Technologies 75 J Rev 4 60 Pasto dpud O O cota ieee 128 Devils TU EIE STIS 0x634a ROVS ONE EET 0x000000a0 li oye BEBE R 5 cot eade E OSCURO OE i Vendorid s eese eere tele RES ES 0x0002c9 8 3 7 perfquery Queries InfiniBand ports performance and error counters Optionally it displays aggregated counters for all ports of a node It can also reset c
73. Updated the following sections Section 3 7 2 2 Configuring Windows Host on page 31 Updated the example in Step 5 Section 6 1 4 1 Performance Tuning Tool Appli cation on page 48 Updated the Options table Section 62 Application Specific Optimization and Tuning on page 51 Removed the Bus master DMA Operations Section 7 OpenSM Subnet Manager on page 61 Added an option of how to register OpemSM via the PowerShell Section 3 8 2 Configuring the NVGRE using PowerShell on page 35 Added the following sections Section 5 3 Configuring Quality of Service QoS on page 44 Appendix B NVGRE Configuration Scrips Examples on page 133 Rev 4 55 December 15 2013 Updated the following sections Section 3 8 Network Virtualization using Generic Routing Encapsulation on page 33 Section 3 8 2 Configuring the NVGRE using PowerShell on page 35 November 07 2013 Updated the following sections e Section 3 7 2 2 Configuring Windows Host on page 31 Section 8 4 19 1 NTttcp Synopsys on page 123 October 03 2013 Added support for Windows Server 2012 R2 Mellanox Technologies 8 J Rev 4 60 Table 1 Document Revision History Document Revision Date Changes Rev 4 40 July 17 2013 Updated the following sections Section 3 7 1 RoCE Overview on page 30 Section 7 OpenSM Subnet Manager on page 61 Sec
74. _read_bw runs with all mes sage sizes from 1B to 4MB powers of 2 message inlining CQ moderation Mellanox Technologies 118 Rev 4 60 8 4 15 1 nd read bw Synopsys running on specific single core Server side start b affinity 0X1 nd read bw s1048576 D10 S 11 137 53 1 Client side start b wait affinity 0X1 nd read bw s1048576 D10 C 11 137 53 1 8 4 15 2 nd read bw Options The table below lists the various flags of the command Table 47 nd read bw Options Flags Description h Shows the Help screen V Shows the version number p Connects to the port lt port gt lt default 6830 gt s msg size Exchanges the message size with default 65536B gt and it must not be combined with a flag a Runs all the messages sizes from 1B to 8MB and it must not be combined with s flag n lt num of iterations gt The number of exchanges at least 2 the default is 100000 I max inline size The maximum size of message to send inline The default number is 128B D test duration in seconds Tests duration in seconds f margin time in seconds The margin time to avoid calculation and it must be less than half of the duration time Q CQ Moderation lt value gt The default number is 100 S server interface IP server side only must be last parameter C server interface IP gt client side only must be last
75. a_cm and run test on those QPs Z com rdma cm Communicate with rdma cm module to exchange data use regular QPs Q cq mod 8 4 12 ibv write lat Generate Cqe only after lt cq mod gt completion This is a more advanced version of ib write lat and contains more flags and features than the older version and also improved algorithms ibv write lat calculates the latency of RDMA write operation of message size between a pair of machines One acts as a server and the other as a client They perform a ping pong benchmark on which one side RDMA writes to the other side memory only after the other side wrote on his memory Each of the sides samples the CPU clock each time they write to the other side memory to calculate latency Mellanox Technologies 115 Rev 4 60 8 4 12 1 ibv write lat Synopsis ibv write lat i b port ib port s ize message size t x depth tx size size u qp timeout S L sl type iteration num V ersion C report cycles report unsorted 8 4 12 2ibv write lat Options c onnection type RC UC UD m tu mtu size I nline size inline d ib device name x gid index n p ort PDT port a 11 H report histogram U The table below lists the various flags of the command Table 44 ibv write lat Flags and Options Flag p port lt port gt Description Listens on connect to port lt port gt default 18515 d ib dev l
76. acceptable to inet_pton 3 C lt ca_name gt Uses the specified ca_name P lt ca_port gt Uses the specified ca_port smkey lt val gt Uses SM Key value for the query Will be used only with trusted queries If non numeric value like x is specified then saquery will prompt for a value Mellanox Technologies 100 Rev 4 60 Table 31 saquery Flags and Options Flags Description t timeout lt msec gt Specifies SA query response timeout in milliseconds Default is 100 milliseconds You may want to use this option if IB_TIMEOUT is indicated node name map lt node name map gt Specifies a node name map The node name map file maps GUIDs to more user friendly names See ibnetdiscover 8 for node name map file format Only used with the O and U options Supported query names and aliases ClassPortInfo CPI NodeRecord NR lid PortInfoRecord PIR lid port options SL2VLTableRecord SL2VL lid in_port out_port PKeyTableRecord PKTR lid port block VLArbitrationTableRecord VLAR lid port block InformInfoRecord IIR LinkRecord LR from_lid from port to_lid to_port ServiceRecord SR PathRecord PR MCMemberRecord MCMR LFTRecord LFTR lid block MFTRecord MFTR mlid position block GUIDInfoRecord GIR lid block d enables debugging h Shows help 8 3 22 smpdump smpdump i
77. all ports is identical EQ overflows Number of EQ overflows NOTE this value is evaluated for the entire NIC since there are cases where EQ might be associated with both ports i e the value on all ports is identical Bad doorbells Number of bad DoorBells Responder duplicate request Number of duplicate requests received when the local machine receives received pending firmware inbound traffic implementation Requester time out received Number of time out received when the local machine generates outbound pending firmware implementa traffic tion 6 4 1 3 Proprietary Mellanox QoS Counters Proprietary Mellanox QoS counter set consists of flow statistics per VLAN priority Each QoS policy is associated with a priority The counter presents the priority s traffic pause statistic Table 10 Mellanox QoS Counters Mellanox QoS Counters Description Bytes Packets IN The number of bytes received that are covered by this priority The counted bytes include framing characters modulo 2 64 Bytes Received The number of bytes received per second that are covered by this priority The counted bytes include framing characters Bytes Received Sec The number of packets received that are covered by this priority modulo 2 64 Packets Received Packets Received Sec The number of packets received per second that are covered by this prior ity Bytes Packets OUT Bytes Sent The number of byte
78. ant Ethernet adapter and select Properties Step 4 Select the Advanced tab Step 5 Modify performance parameters properties as desired 6 2 1 1 Performance Known Issues On Intel I OAT supported systems it is highly recommended to install and enable the latest I OAT driver download from www intel com With I OAT enabled sending 256 byte messages or larger will activate I OAT This will cause a significant latency increase due to I OAT algorithms On the other hand throughput will increase significantly when using I OAT 6 2 2 IPoIB Performance Tuning The user can configure the IPoIB adapter by setting some registry keys The registry keys may affect IPoIB performance For the complete list of registry entries that may be added changed by the performance tuning procedure see MLNX VPI WinOF Registry Keys following the path below http www mellanox com page products dyn product family 32 amp mtag windows sw drivers To improve performance activate the performance tuning tool as follows Step 1 Start the Device Manager open a command line window and enter devmgmt msc Step 2 Open Network Adapters Step3 Right click the relevant IPoIB adapter and select Properties Step 4 Select the Advanced tab Step 5 Modify performance parameters properties as desired 6 3 Tunable Performance Parameters The following is a list of key parameters for performance tuning Jumbo Packet The maximum available size o
79. are that significant reduces performance Mellanox Technologies 34 J Rev 4 60 3 8 2 Configuring the NVGRE using PowerShell Hyper V Network Virtualization policies can be centrally configured using PowerShell 3 0 and PowerShell Remoting Step 1 Create a vSwitch New VMSwitch vSwitchName NetAdapterName EthInterfaceName AllowManagementOS true Step2 Shut down the VMs Stop VM Name VM Name Force Confirm Step3 Configure the Virtual Subnet ID on the Hyper V Network Switch Ports for each Virtual Machine on each Hyper V Host Host 1 and Host 2 Add VMNetworkAdapter VMName VMName SwitchName vSwitchName StaticMacAddress StaticMAC Address Step 4 Configure the Provider Address and Route records on Hyper V Host 1 Host 1 Only New NetVirtualizationProviderRoute InterfaceIndex NIC InterfaceIndex DestinationPrefix dest prefix NextHop lt nexthopvalue gt Step 5 Configure a Subnet Locator and Route records on each Hyper V Host Host 1 and Host 2 New NetVirtualizationLookupRecord CustomerAddress VMInterfaceIPAddress 1 n gt ProviderAddress XHypervisorInterfaceIPAddressl VirtualSubnetID lt virtualsubnetID gt MACAddress VMmacaddressi Rule TranslationMethodEncap New NetVirtualizationLookupRecord CustomerAddress VMInterfacelPAddress 2 n ProviderAddress HypervisorInterfaceIPAddress2 VirtualSubnetID lt virtualsubnetID gt MACAddress lt VMmacaddress2 gt Rule T
80. at are not consecutive Requester resync Number of resync operations when the local machine generates outbound traffic Responder resync Number of resync operations when the local machine receives inbound traffic Requester Remote operation errors Number of remote operation errors when the local machine generates out bound traffic i e NAK was received indicating that the other end encountered an error that prevented it from completing the request Requester transport retries exceeded errors Number of transport retries exceeded errors when the local machine gen erates outbound traffic Requester RNR NAK retries exceeded errors Number of RNR Receiver Not Ready NAKs retries exceeded errors when the local machine generates outbound traffic Bad multicast received Number of bad multicast packet received Mellanox Technologies 58 J Rev 4 60 Table 9 Mellanox Adapter Diagnostics Counters Mellanox Adapter Diagnostics Description Counters Discarded UD packets Number of UD packets silently discarded on the receive queue due to lack of receives descriptor Discarded UC packets Number of UC packets silently discarded on the receive queue due to lack of receives descriptor CQ overflows Number of CQ overflows NOTE this value is evaluated for the entire NIC since there are cases where CQ might be associated with both ports 1 e the value on
81. ation and no current connections 4 4 Verifying SMB Events that Confirm RDMA Connection To confirm RDMA connection verify the SMB events Step 1 Open a PowerShell window on the SMB client Step 2 Run the following cmdlets NOTE Any RDMA related connection errors will be displayed as well Get WinEvent LogName Microsoft Windows SMBClient Operational Message match RDMA Mellanox Technologies 40 J Rev 4 60 5 Driver Configuration Once you have installed Mellanox WinOF VPI package you can perform various modifications to your driver to make it suitable for your system s needs Changes made to the Windows registry happen immediately and no backup is automati Hm cally made Do not edit the Windows registry unless you are confident regarding the changes 5 1 Configuring the InfiniBand Driver 5 1 1 Modifying IPoIB Configuration gt To modify the IPoIB configuration after installation perform the following steps Step 1 Open Device Manager and expand Network Adapters in the device display pane Step 2 Right click the Mellanox IPoIB Adapter entry and left click Properties Step 3 Click the Advanced tab and modify the desired properties 4 The IPoIB network interface is automatically restarted once you finish modifying IPoIB d parameters Consequently it might affect any running traffic 5 1 2 Displaying Adapter Related Information To display a summary of network adapter software firmware
82. ation or script A reboot may be required for the changes to take effect 6 1 4 Tuning the Ethernet Network Adapter The Ethernet Network Adapter general tuning can be performed during installation by modifying some of Windows registries as explained in section Registry Tuning on page 32 Specific sce narios tuning can be set post installation manually To improve the network adapter performance activate the performance tuning tool as fol lows Step 1 Start the Device Manager open a command line window and enter devmgmt msc Step 2 Open Network Adapters Step3 Select Mellanox Ethernet adapter right click and select Properties Step 4 Select the Performance tab Step 5 Choose one of the tuning scenarios Single port traffic Improves performance for running single port traffic each time Single stream traffic Optimizes tuning for applications with single connection Dual port traffic Improves performance for running traffic on both ports simultaneously Forwarding traffic Improves performance for running scenarios that involve both ports for exam ple via IXIA Multicast traffic Improves performance when the main traffic runs on multicast Mellanox Technologies 47 J Rev 4 60 7 Click on Run Tuning button Detads Events Power Management General Advanced Information Performance Driver AMA Perfomance Tuning Tool Mellanox Tuning Scenario C Single pos
83. bportstate C mlx4 0 D 0 1 speed 2 ibportstate C mlx4 0 D 0 1 speed 2 nitial PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 Mellanox Technologies 68 J Rev 4 60 EinkSpeedEnablledk eI 2 5 Gbps After PortInfo set Port info DR path slid 65535 dlid 65535 0 port 1 LinkSpeedEnabled eese 5 0 Gbps IBA extension Show the new configuration gt ibportstate C mlx4 0 D 0 1 PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 Ln kStoUCH M UT E CE Initialize phys eo CHE LinkUp ERIN GECKO TORSO ETT 1X or 4X Eum kart mre coU 1X or 4X Eam kart AC E Ve e 4X SES GT de E A 2 5 Gbps or 5 0 Gbps LinkSpeedEnabled 0 5 0 Gbps IBA extension LinkspsedACtiven TII S 5 0 Gbps 8 3 4 ibroute Uses SMPs to display the forwarding tables for unicast LinearForwardingTable or LFT or mul ticast MulticastForwardingTable or MFT for the specified switch LID and the optional lid mlid range The default range is all valid entries in the range of 1 to FDBTop 8 3 4 1 ibroute Applicable Hardware InfiniBand switches 8 3 4 2 ibroute Synopsis ibroute h d v IW a n D G M L e u s smlid C ca name P ca port t timeout ms V lt dest dr path lid guid lt startlid gt lt endlid gt 8 3 4 3 ibroute Options The table below lists the various ibroute flags of the command Table 14 ibroute Flags and Options
84. can be set post installation manually gt To improve the network adapter performance activate the performance tuning tool as fol lows Step 1 Start the Device Manager open a command line window and enter devmgmt msc Step 2 Open Network Adapters Step3 Select Mellanox IPoIB adapter right click and select Properties Step 4 Select the Performance tab Step 5 Choose one of the tuning scenarios Mellanox Technologies 46 J Rev 4 60 Single port traffic Improves performance for running single port traffic each time Dual port traffic Improves performance for running traffic on both ports simultaneously Forwarding traffic Improves performance for running scenarios that involve both ports for exam ple via IXIA Multicast traffic Improves performance when the main traffic runs on multicast Step 6 Click on Run Tuning button Clicking the Run Tuning button changes several registry entries described below and checks for system services that may decrease network performance It also generates a log including the applied changes Users can view this log to restore the previous values The log path is SHOMEDRIVE S Windows System32 LogFiles PerformanceTunning log This tuning is required to be performed only once after the installation is completed and on one adapter only as long as these entries are not changed directly in the registry or by some other install
85. ce the hosts perform the following config interface et10 config if Et10 dcbx mode ieee config if Et10 priority flow control mode on config if Et10 priority flow control priority 3 no drop 3 7 5 Configuring Router PFC only The router uses L3 s DSCP value to mark the egress traffic of L2 PCP The required mapping maps the three most significant bits of the DSCP into the PCP This is the default behavior and no additional configuration is required 3 7 5 1 Copying Port Control Protocol PCP between Subnets The captured PCP option from the Ethernet header of the incoming packet can be used to set the PCP bits on the outgoing Ethernet header 3 7 6 Configuring the RoCE Mode Configuring the RoCE mode requires the following e RoCE mode is configured per driver and is enforced on all the devices in the system The supported RoCE modes depend on the firmware installed If the firmware does not support the needed mode the fallback mode would be the maximum supported RoCE aa mode of the installed NIC RoCE mode can be enabled and disabled via PowerShell gt To enable RoCE using the PowerShell Mellanox Technologies 32 J Rev 4 60 Open the PowerShell and run Set MlnxDriverCoreSetting RoceMode 1 To disable RoCE using the PowerShell Open the PowerShell and run Set MlnxDriverCoreSetting RoceMode 0 3 8 Network Virtualization using Generic Routing Encapsulation Networ
86. cis eee DS WPeRRPRnON NY veg ni 30 3 7 3 Configuring SwitchX Based Switch System 0 00 0c eee ee eee 31 3 7 4 Configuring Arista Switch leise eee 31 3 7 5 Configuring Router PFC only 0 eee nes 32 3 7 6 Configuring the ROCE Mode 0 ccc ce eee 32 3 8 Network Virtualization using Generic Routing Encapsulation 33 3 8 1 Enabling Disabling NVGRE Offloading 0 0 0 cece eee es 34 3 8 2 Configuring the NVGRE using PowerShell 2 0 0 000 esses 35 Mellanox Technologies 3 J Rev 4 60 3 8 3 Verifying the Encapsulation of the Traffic 20 0 0 0c eee eee 36 3 9 Differentiated Services Code Point DSCP 00 cece eee 36 3 9 1 Setting the DSCP in the IP Header 36 3 9 2 Configuring Quality of Service for TCP and RDMA Traffic 36 3 9 3 Configuring DSCP for TCP Traffic 0 0 cen 37 3 9 4 Configuring DSCP for RDMA Traffic 0 ces 37 3 955 Repistty Settings oe eret de pte ers e esi ts 37 3 9 6 DSCP Sanity Testing oscar ee ETE Pe CREW Ku ees 38 Chapter 4 Deploying Windows Server 2012 and Above with SMB Direct 39 AA SOVGIVIGW o ense Tore ue eer aerem Herode he hak eed 39 4 2 Hardware and Software Prerequisites 0 cece eee teens 39 4 3 SMB Configuration Verification sse 39 4 3 1 Verifying SMB Configuration 00 0 cece cee es 39 4 3 2 Verifying SMB Connection 0 cc e 40 4 4 Ve
87. cm module to exchange data use regular QPs Mellanox Technologies 116 Rev 4 60 8 4 13 nd write bw This test is used for performance measuring of RDMA Write requests in Microsoft Windows Operating Systems nd write bw is performance oriented for RDMA Write with maximum throughput and runs over Microsoft s NetworkDirect standard The level of customizing for the user is relatively high User may choose to run with a customized message size customized num ber of iterations or alternatively customized test duration time nd write bw runs with all mes sage sizes from 1B to 4MB powers of 2 message inlining CQ moderation 8 4 13 1 nd write bw Synopsys running on specific single core Server side start b affinity 0X1 nd write bw s1048576 D10 S 11 137 53 1 Client side start b wait affinity 0X1 nd write bw s1048576 D10 C 11 137 53 1 8 4 13 2 nd write bw Options The table below lists the various flags of the command Table 45 nd write bw Flags and Options Flag Description h Shows the Help screen v Shows the version number p Connects to the port lt port gt default 6830 s msg size Exchanges the message size with default 65536B gt and it must not be combined with a flag Runs all the messages sizes from 1B to 8MB and it must not be combined with s flag n lt num of iterations gt The number of exchanges at least 2 the default is 100000
88. command Table 46 nd write lat Options Flag Description h Shows the Help screen v Shows the version number p Connects to the port lt port gt lt default 6830 gt s lt msg size gt Exchanges the message size with lt default 65536B gt and it must not be combined with a flag a n lt num of iterations gt Runs all the messages sizes from 1B to 8MB and it must not be combined with s flag The number of exchanges at least 2 the default is 100000 I lt max inline size gt The maximum size of message to send inline The default number is 128B D lt test duration in seconds gt f lt margin time in seconds gt Tests duration in seconds The margin time to avoid calculation and it must be less than half of the duration time S server interface IP gt lt server side only must be last parameter gt C server interface IP gt h client side only must be last parameter Shows the Help screen 8 4 15 nd read bw This test is used for performance measuring of RDMA Read requests in Microsoft Windows Operating Systems nd read bw is performance oriented for RDMA Read with maximum throughput and runs over Microsoft s NetworkDirect standard The level of customizing for the user is relatively high User may choose to run with a customized message size customized num ber of iterations or alternatively customized test duration time nd
89. cted path LID or GUID lt startlid gt Starting LID in an MLID range lt endlid gt Ending LID in an MLID range Examples 1 Dump all Lids with valid out ports of the switch with Lid 2 gt ibroute 2 Unicast lids 0x0 0x8 of switch Lid 2 guid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies Lid Out Destination Port Info x0002 000 Technologies x0003 021 Technologies x0006 007 x0007 021 5 valid lids dumped Switch portguid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 x0008 008 Channel Adapter portguid 0x0002c902002582cd sw136 HCA 1 2 Dump all Lids in the range 3 to 7 with valid out ports of the switch with Lid 2 lows 2 3 7 Mellanox Technologies 70 J Rev 4 60 Unicast lids 0x3 0x7 of switch Lid 2 guid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies Lid Out Destination Port Info 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0x0007 021 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 3 valid lids dumped 3 Dump all Lids with valid out ports of the
90. d subnets it is mandatory when using RoCE Applications written over IB verbs should work seamlessly but they require provisioning of GRH information when creating address vectors The library and driver are modified to provide mapping from GID to MAC addresses required by the hardware 3 7 2 RoCE Configuration In order to function reliably RoCE requires a form of flow control While it is possible to use global flow control this is normally undesirable for performance reasons The normal and optimal way to use RoCE is to use Priority Flow Control PFC To use PFC it must be enabled on all endpoints and switches in the flow path In the following section we present instructions to configure PFC on Mellanox ConnectXTM cards There are multiple configuration steps required all of which may be performed via Power Shell Therefore although we present each step individually you may ultimately choose to write a PowerShell script to do them all in one step Note that administrator privileges are required for these steps For further information please refer to http blogs technet com b josebda archive 2012 07 31 deploying windows server 2012 with smb direct smb over rdma and the mellanox connectx 3 using 1 0gbe 40gbe roce step by step aspx 3 7 2 1 Prerequisites The following are the driver s prerequisites in order to set or configure RoCE e ConnectX 3 firmware version 2 30 3000 or higher Mellanox Technologies 30 J
91. e 1 Document Revision History Document Revision Date Changes Rev 3 0 0 February 08 2012 Added section RDMA over Converged Ethernet RoCE and its subsections Added section Hyper V with VMQ Added section Network Driver Interface Specification NDIS Added section Header Data Split Added section Auto Sensing Added section Adapter Teaming Added section Port Protocol Configuration e Added section Advanced Configuration for InfiniBand Driver Added section Advanced Configuration for Ethernet Driver Added section Updated section Tunable Performance Parame ters Added section Merged Ethernet and InfiniBand features sec tions Removed section Sockets Direct Protocol and its subsections e Removed section Winsock Direct and Protocol and its subsec tions Removed section Added ConnectX 3 support Removed section IPoIB Drivers Overview Removed section Booting Windows from an iSCSI Target Rev 2 1 3 January 28 2011 Complete restructure Rev 2 1 2 October 10 2010 e Removed section Debug Options Updated Section 3 Uninstalling Mellanox VPI Driver on page 11 Added Section 6 InfiniBand Fabric on page 38 and its sub sections Added Section 6 3 InfiniBand Fabric Performance Utilities on page 71 and its subsections Rev 2 1 1 1 July 14 2010 Removed all references of InfiniHost adapter since it 1s not supported starting with WinOF VPI v2 1 1 Rev 2 1 1 May
92. e DSCP value is generated according to the CoS value programmed for the interface CoS value is set through standard tools such as PowerShell command New NetQosPolicy using Priority Value8021 Action flag The NIC uses a mapping table between the CoS value and the DSCP value configured through the RroceDscpMarkPriorityFlow Control 0 7 Registry keys 3 9 2 Configuring Quality of Service for TCP and RDMA Traffic Step 1 Verify that DCB is installed and enabled is not installed by default Install WindowsFeature Data Center Bridging Step2 Import the PowerShell modules that are required to configure DCB import module NetQos import module DcbQos import module NetAdapter Step3 Configure DCB Set NetQosDcbxSetting Willing 0 Step 4 Enable Network Adapter QoS Set NetAdapterQos Name Cx3Pro ETH P1 Enabled 1 Step 5 Enable Priority Flow Control PFC on the specific priority 3 5 Enable NetQosFlowControl 3 5 Mellanox Technologies 36 J Rev 4 60 3 9 3 Configuring DSCP for TCP Traffic Create a QoS policy to tag All TCP UDP traffic with CoS value 1 and DSCP value 9 New NetQosPolicy DEFAULT PriorityValue8021Action 3 DSCPAction 9 DSCP can also be configured per protocol New NetQosPolicy TCP IPProtocolMatchCondition TCP PriorityValue8021Action 3 DSCPAction 16 New NetQosPolicy UDP IPProtocolMatchCondition UDP PriorityValue8021Action 3 DSCPAction 32 3 9 4 Configuring DSCP for RDMA Traf
93. eam The following steps describe how to create a port VLAN Mellanox Technologies 25 J Rev 4 60 Step 1 Display the Device Manager uH Device Manager File Action View Help e9 m HTE o4 ql Computer Fey Disk drives R Display adapters EH DVD CD ROM drives cs Floppy drive controllers DS Human Interface Devices IDE ATAJ ATAPI controllers 5l IEEE 1394 Bus host controllers lt Keyboards n Mice and other pointing devices K Monitors l S Network adapters Physical Broadcom BCMS708C Netxtreme II GigE NDIS VBD Client Adapters Broadcom BCM5708C Netxtreme II GigE NDIS VBD Client 2 Mellanox Connectx MT25418 DDR Channel Mellanox ConnectX 10Gb Ethernet Adapter Mellanox ConnectX 10Gb Ethernet Adapter 2 Mellanox Virtual Miniport Driver Team A Ernel e 25 Other devices Virtual Bundle vo bh bry by bi Be iv fy Base System Device Team H P Ports COM amp LPT Processors lt gt Storage controllers YR System devices Y Universal Serial Bus controllers Mellanox Technologies 26 J Rev 4 60 Step 2 Right click a Mellanox network adapter under Network adapters list and left click Proper ties Select the VLAN tab from the Properties sheet Physical Adapter Virtual Bundle Team Mellanox ConnectX 10Gb Ethernet Adapter Properties 2 x iMellanox Virtual Miniport Driver Team A Properties LBFO Diiver Deta
94. ega Bytes 2 20 and reflect the actual data that was transferred excluding headers w A If these results are not as expected the problem is most probably with one or more of the following Old Firmware version Misconfigured Flow control Global pause or PFC is configured wrong on the hosts routers and switches See Section 3 7 RDMA over Converged Ethernet RoCE on page 30 e CPU power options are not set to Maximum Performance Issue 3 QoS and Flow control Flow control settings can greatly affect results In order to see configured settings for all of the QoS options open a PowerShell prompt and use Get NetAdapterQos To achieve maximum performance all of the following must exist 1 All of the hosts switches and routers should use the same matching flow control settings If Global pause is used all devices must be configured for it If PFC Prior ity Flow control is used all devices must have matching settings for all priorities 2 ETS settings that limit speed of some priorities will greatly affect the output results Make sure Flow Control is enabled on the Mellanox Interfaces enabled by default Go to the device manager right click the Mellanox interface go to Advanced and make sure Flow control is enabled for both TX and RX To eliminate QoS and Flow control as the performance degrading factor set all devices to run with Global Pause and rerun the tests w A e Set Global pause on the sw
95. egistry Path HKRLMNSYSTEMCurrentControlSet Services NDIS Parameters Type REG DWORD Key name AllowFlowControlUnderDebugger Value 1 Suggestion 2 Go to Power Options in the Control Panel Make sure Maximum Perfor mance is set as the power scheme reboot is needed Issue2 General Diagnostic Suggestion 1 Go to Device Manager locate the Mellanox adapter that you are debugging right click and go to Information e PCI Gen 2 should appear as PCI E 5 0 GT s e PCI Gen 3 should appear as PCI E 8 0 GT s Link Speed 40 0Gbps 10 0Gbps Suggestion 2 To determine if the Mellanox NIC and PCI bus can achieve their maximum speed it s best to run ib send bw in a loopback On the same machine 1 Run start b affimty Ox1ibv write bw 2 Run start b affinity 0x2 ibv write bw 127 0 0 1 3 Repeat for port 2 with additional p2 and for other cards if necessary 4 On PCI Gen3 the expected result is around 5700MB s On PCI Gen2 the expected result is around 3300MB s Any number lower than that points to bad configuration or installation on the wrong PCI slot Malfunctioning QoS settings and Flow Control can be the cause as well Suggestion 3 To determine the maximum speed between the two sides with the most basic test 1 Run ib send bw on machine 1 2 Run ib send bw lt hostl gt on machine 2 where lt hostl gt is the hostname for machine 1 Mellanox Technologies 127 Rev 4 60 Results appear in MB s M
96. eir fields Mellanox Technologies 88 J Rev 4 60 It verifies the existing inventory with all the object fields and matches it to a pre saved one A Multicast Compliancy test An Event Forwarding test A Service Record registration test AnRMPP stress test A Small SA Queries stress test It is recommended that after installing opensm the user should run osmtest f c to generate the inventory file and immediately afterwards run osmtest f a to test OpenSM Additionally it is recommended to create the inventory when the IB fabric is stable and occa sionally run osmtest v to verify that nothing has changed 8 3 15 1 osmtest Synopsys osmtest f low cla v s e f m g t w ait trap wait time d ebug num ber gt m ax lid lt LID in hex gt g uid lt GUID in hex p ort i nventory lt filename gt s tress M ulticast Mode t imeout lt milliseconds gt l log file v vf lt flags gt h elp 8 3 15 2 osmtest Options The table below lists the various flags of the command Table 25 osmtest Flags and Options Flag Description f flow This option directs osmtest to run a specific flow The following is the flow s description e c create an inventory file with all nodes ports and paths a run all validation tests expecting an input inventory v only validate the given inventory file s run service registration deregistrat
97. elp screen V Shows the version number p Connects to the port lt port gt lt default 6830 gt s msg size Exchanges the message size with default 65536B gt and it must not be combined with a flag a Runs all the messages sizes from 1B to 8MB and it must not be combined with s flag n lt num of iterations gt The number of exchanges at least 2 the default is 100000 I max inline size The maximum size of message to send inline The default number is 128B D test duration in seconds Tests duration in seconds f margin time in seconds The margin time to avoid calculation and it must be less than half of the duration time Q CQ Moderation lt value gt The default number is 100 S lt server interface IP gt lt server side only must be last parameter gt C lt server interface IP gt lt client side only must be last parameter gt 8 4 18 nd_send_lat This test is used for performance measuring of Send requests in Microsoft Windows Operating Systems nd send lat is performance oriented for Send with minimum latency and runs over Microsoft s NetworkDirect standard The level of customizing for the user is relatively high User may choose to run with a customized message size customized number of iterations or alterna tively customized test duration time nd_send_lat runs with all message sizes from 1B to 4MB powers of 2
98. eme II GigE NDIS VBD Client x Broadcom BCM5708C Netxtreme II GigE NDIS VBD Client 2 Mellanox ConnectX MT25418 DDR Channel Adapter KY Mellanox ConnectX 10Gb Ethernet Adapter K Mellanox Connectx 10Gb Ethernet Adapter 2 Mellanox Virtual Miniport Driver Team A ey Other devices tp Base System Device 79 Ports COM amp LPT D Processors T lt gt Storage controllers 0 K System devices Y Universal Serial Bus controllers gt To modify an existing bundle perform the following a Select the desired bundle and click Modify b Modify the bundle name its type and or the participating adapters in the bundle c Click the Commit button gt To remove an existing bundle select the desired bundle and click Remove You will be prompted to approve this action Notes on this step a Each adapter that participates in a bundle has two properties Status Connected Disconnected Disabled e Role Active or Backup b Each network adapter that is added or removed from a bundle gets refreshed 1 e disabled then enabled This may cause a temporary loss of connection to the adapter c In case a bundle loses one or more network adapters by a create or modify operation the remaining adapters in the bundle are automatically notified of the change 3 5 3 Creating a Port VLAN in Windows 2008 R2 You can create a Port VLAN either on a physical Mellanox ConnectX EN adapter or a virtual bundle t
99. ency of RDMA write operation of message size between a pair of machines One acts as a server and the other as a client They perform a ping pong benchmark on which one side RDMA writes to the other side memory only after the other side wrote on his memory Each of the sides samples the CPU clock each time they write to the other side memory in order to calculate latency Mellanox Technologies 107 Rev 4 60 8 4 6 1 ib write lat Synopsys ib write lat i b port ib port c onnection type RC UC m tu mtu size s ize message size t x depth tx size n iteration num p ort PDT port a 11 V ersion C report cycles H report histogram U report unsorted 8 4 62 ib write lat Options The table below lists the various flags of the command Table 38 ib write lat Flags and Options Flag Description p port lt port gt Listens on connect to port lt port gt default 18515 d ib dev lt dev gt Uses IB device lt device guid gt default first device found i ib port lt port gt Uses port lt port gt of IB device default 1 m mtu lt mtu gt The mtu size default 1024 c connection lt RC UC gt Connection type RC UC default RC Size lt size gt The size of message to exchange default 65536 f freq lt dep gt How often the time stamp is taken a all Runs sizes from 2 till 2 23 t tx depth lt dep gt The size of tx queue default 100 n
100. ep 2 Right click the desired Virtual Machine VM and left click Settings in the pop up menu Step 3 In the Settings window under the relevant network adapter select Hardware Acceleration Step 4 Check uncheck the box Enable virtual machine queue to enable disable VMQ on that spe cific network adapter gt To enable Hyper V with VMQ using PowerShell Step 1 Enable VMQ on a specific VM Set VMNetworkAdapter VM Name gt VmqWeight 100 Mellanox Technologies 17 J Rev 4 60 Disable VMQ on a specific VM Set VMNetworkAdapter VM Name VmqWeight 0 3 2 3 3 Step 2 Header Data Split The header data split feature improves network performance by splitting the headers and data in received Ethernet frames into separate buffers The feature is disabled by default and can be enabled in the Advanced tab Performance Options from the Properties window For further information please refer to the MSDN library http msdn microsoft com en us library windows hardware ff553723 v VS 85 aspx Receive Side Scaling RSS Mellanox WinOF Rev 4 60 IPoIB and Ethernet drivers use NDIS 6 30 new RSS capabilities The main changes are Removed the previous limitation of 64 CPU cores Individual network adapter RSS configuration usage gt RSS capabilities can be set per individual adapters as well as globally To do so set the registry keys listed below Table 5 Registry Keys Setting Sub key Desc
101. ernet Auto Sensing is performed only when rebooting the machine or after disabling enabling the mlx4 bus interface from the Device Manager Hence if you replace cables during the runtime the NIC will not perform Auto Sensing For further information on how to configure it please refer to Section 3 4 2 Port Protocol Con figuration on page 20 Mellanox Technologies 19 J Rev 4 60 3 4 2 Port Protocol Configuration Step 1 Display the Device Manager and expand System devices File Action View Help e m S HSI e CRS gt Ports COM amp LPT den Print queues Processors lt gt Storage controllers pM System devices Wa ACPI Fixed Feature Button Composite Bus Enumerator Direct memory access controller Generic Bus Intel R 5000 Series Chipset Error Reporting Registers 25FQ Intel R 5000 Series Chipset Error Reporting Registers 25FQ0 Intel R 5000 Series Chipset Error Reporting Registers 25F0 Intel R 5000 Series Chipset FBD Registers 25F5 Intel R 5000 Series Chipset FBD Registers 25F6 Intel R 5000 Series Chipset PCI Express x4 Port 3 25E3 Intel R 5000 Series Chipset PCI Express x4 Port 5 25bE5 Intel R 5000 Series Chipset PCI Express x4 Port 6 25E6 Intel R 5000 Series Chipset PCI Express x4 Port 7 25E7 Intel R 5000 Series Chipset PCI Express x8 Port 2 3 25F7 Intel R 5000 Series Chipset Reserved Registers 25F1 Intel R 5000 Series Chipset Reserved Registers 25F3 I
102. ernet Adapter Properties P General Advanced Information Performance Diagnostics VLAN LBFO Driver Details PowerManagement AA Load Balancing and Fail Over LBFO Settings Mellanox Bunde Name TS Bundle Type Fault Tolerance F Primary s Jv Fallback to Primary Adapters in the bundle Adapter Name Riain Role D Mellanox ConnectX 3 Ethernet Adapter D Mellanox ConnectX 3 Ethemet Adapter 2 Create Modify Remove LBFO stands for Load Balancing and Fail Over The administrator can configure a bundle of adapters and associate up to 8 Mellanox ConnectX adapters to this bundle LBFO should be used to increase the system reliability upon a link failure and to balance the workload zl ETE TOP The newly created virtual Mellanox adapter representing the bundle will be displayed by the Device Manager under Network adapters in the following format see the figure below Mellanox Virtual Miniport Driver Team bundle name Mellanox Technologies 24 J Rev 4 60 n Device Manager LE CO Lx File Action View Help s HO EE K Computer Disk drives Y By Display adapters Y f DVD CD ROM drives 9 Floppy drive controllers Y 02 Human Interface Devices cg IDE ATA ATAPI controllers IEEE 1394 Bus host controllers Y 2 Keyboards E n Mice and other pointing devices Y R Monitors amp Network adapters L Broadcom BCMS708C Netxtr
103. es to further increase the verbosity level See the vf option for more information about log verbosity This option sets the maximum verbosity level and forces log flushing The V is equivalent to vfOxFF d 2 See the vf option for more information about log verbosity Mellanox Technologies 90 J Rev 4 60 Table 25 osmtest Flags and Options Flag Description vf This option sets the log verbosity level A flags field must follow the D option A bit set clear in the flags enables disables a specific log level as follows BIT LOG LEVEL ENABLED 0x01 ERROR error messages 0x02 INFO basic messages low volume 0x04 VERBOSE interesting stuff moderate volume 0x08 DEBUG diagnostic high volume 0x10 FUNCS function entry exit very high volume 0x20 FRAMES dumps all SMP and GMP frames 0x40 ROUTING dump FDB routing information 0x80 currently unused Without vf osmtest defaults to ERROR INFO 0x3 Specifying vf 0 disables all messages Specifying vf OxFF enables all messages see V High verbosity levels may require increasing the transaction timeout with the t option h help Display this usage info then exit 8 3 16 ibaddr Displays the lid and range as well as the GID address of the port specified by DR path lid or GUID or the local port by default This utility can be used as simple address resolver P 8 3 16 1 ibaddr Synopsis
104. ex n iteration num o uts outstanding reads e vents use events p ort PDT port a 11 V ersion C report cycles H report histogram U report unsorted F CPU freq fail 8 4 8 2 ibv read lat Options The table below lists the various flags of the command Table 40 ibv read lat Flags and Options Flag Description p port lt port gt Listens on connect to port port default 18515 d ib dev lt dev gt Uses IB device device guid gt default first device found i ib port lt port gt Uses port lt port gt of IB device default 1 m mtu lt mtu gt The mtu size default 1024 0 outs lt num gt The number of outstanding read atom default for ConnectX 16 oth ers 4 SIZe lt size gt The size of message to exchange default 65536 a all Runs sizes from 2 till 2 23 t tx depth lt dep gt The size of tx queue default 100 n iters lt iters gt The number of exchanges at least 2 default 1000 u qp timeout lt timeout gt QP timeout The timeout value is 4 usec 2 timeout default 14 S sl lt sl gt The service level default 0 Mellanox Technologies 110 Rev 4 60 Table 40 ibv read lat Flags and Options Flag Description x gid index lt index gt Test uses GID with GID index taken from command line for RDMAoE index should be 0 C report cycles Reports times in c
105. f the transfer unit also known as the Maximum Transmission Unit MTU For IPoIB the MTU should not include the size of the IPoIB header 4B For example if the network adapter card supports a 4K MTU the upper threshold for payload MTU is 4092B and not 4096B The MTU of a network can have a substantial impact on performance A 4K MTU size improves performance for short messages since it allows the OS to coalesce many small messages into a large one Valid MTU values range for an Ethernet driver is between 614 and 9614 Valid MTU values range for an IPoIB driver is between 1500 and 4092 All devices on the same physical network or on the same logical network must have the same MTU Receive Buffers Mellanox Technologies 52 J Rev 4 60 The number of receive buffers default 1024 Send Buffers The number of sent buffers default 2048 Performance Options Configures parameters that can improve adapter performance Interrupt Moderation Moderates or delays the interrupts generation Hence optimizes network throughput and CPU uti lization default Enabled When the interrupt moderation is enabled the system accumulates interrupts and sends a single interrupt rather than a series of interrupts An interrupt is generated after receiving 5 packets or after 10ms from the first packet received It improves performance and reduces CPU load however it increases latency When the interrupt moderation is
106. ff gt Usage part man exe v lt show add rem gt Local area connection name v increases verbosity level Show shows the currently configured virtual ipoib ports Add adds new virtual IPoIB port Where add should be used with interface name as it appears in Network connection in the control panel Ke 99 Name any printable name without quotations marks commas and starting with i Rem removes existing virtual IPoIB port Therefore it requires running it with Show then copy the parameters Example Adding and removing virtual port part man add Ethernet 4 ipoib 4 1 Done Part man show Ethernet 6 ipoib 4 1 pare men ren eene Mo aljeyejiiloy A 1 Done 8 3 InfiniBand Fabric Diagnostic Utilities The diagnostic utilities described in this chapter provide means for debugging the connectivity and status of InfiniBand IB devices in a fabric 8 3 1 Utilities Usage This section first describes common configuration interface and addressing for all the tools in the package Then it provides detailed descriptions of the tools themselves including operation synopsis and options descriptions error codes and examples Mellanox Technologies 62 J Rev 4 60 8 3 1 1 Common Configuration Interface and Addressing Topology File Optional An InfiniBand fabric is composed of switches and channel adapter HCA TCA devices To iden tify devices in a fabric or even i
107. fic Create a QoS policy to tag the ND traffic for port 10000 with CoS value 3 New NetQosPolicy ND10000 NetDirectPortMatchCondition 10000 PriorityValue8021Action 3 Related Commands Get NetAdapterQos Gets the QoS properties of the network adapter e Get NetQosPolicy Retrieves network QoS policies Get NetQosFlowControl Gets QoS status per priority 3 9 5 Registry Settings The following attributes must be set manually and will be added to the miniport registry Table 6 DSCP Registry Keys Settings Registry Key Description TxUntagPriorityTag If Ox1 do not add 802 1Q tag to transmitted packets which are assigned 802 1p priority but are not assigned a non zero VLAN ID i e priority tagged Default 0x0 for DSCP based PFC set to 0x1 RxUntaggedMapToLossless If 0x1 all untagged traffic is mapped to the lossless receive queue Default 0x0 for DSCP based PFC set to 0x1 RroceDscpMarkPriorityFlowControl ID A value to mark DSCP for RoCE v2 packets assigned to CoS ID when priority flow control is enabled The valid values range is from 0 to 63 Default is ID value e g RroceDscpMarkPriorityFlowControl_ 3 is 3 ID values range from 0 to 7 For changes to take affect please restart the network adapter after changing this registry key Mellanox Technologies 37 J Rev 4 60 3 9 5 1 Default Settings When DSCP configuration registry keys are missing in the miniport reg
108. first virtual adapter VLAN on a specific port the port becomes dis abled This means that it is not possible to bind to this port until all the virtual adapters associated with it are removed v When using a VLAN the network address is configured using the VLAN ID There fore the VLAN ID on both ends of the connection must be the same Ful Step 4 Verify the new VLAN s by opening the Device Manager window or the Network Connections window The newly created VLAN will be displayed in the following format Mellanox Virtual Miniport Driver VLAN name lal ge Device Manager File Action View Help e9 MHH mE KEFE g l supp 10 Y RR Computer s Disk drives M Display adapters 77 DVD CD ROM drives US Human Interface Devices G IDE ATAJATAPI controllers Keyboards n Mice and other pointing devices E Monitors K Network adapters X HP NC362i Integrated DP Gigabit Server Adapter K HP NC362i Integrated DP Gigabit Server Adapter 2 KY Mellanox ConnectX 3 Ethernet Adapter Mellanox Connectx 3 IPoIB Adapter 2 El Mellanox Virtual Miniport Driver VLAN New_Production_VLAN y Other devices fa Unknown device 1 Ports COM amp LPT Processors lt gt Storage controllers JE System devices Universal Serial Bus controllers EI DEDE laca FE REDE
109. g Systems support NDIS6 3 RssProfile 4 Additionally this option chooses the best processors to assign to DefaultRecvRingProcessor TxForwardingProcessor In Operating Systems support NDIS6 2 RssBaseProcNumber MaxRssProcessors In Operating Systems support NDIS6 3 NumRSSQueues RssMaxProcNumber Mellanox Technologies 49 J Rev 4 60 Flag Description f Forwarding traffic scenario This option must be followed by two connection names The tuning in this case is code pendent This option automatically sets SendCompletionMethod 1 RecvCompletionMethod 0 ReceiveBuffers 4096 UseRSSForRawIP 0 UseRSSForUDP 0 Additionally this option chooses the best processors to assign to DefaultRecvRingProcessor TxInterruptProcessor TxForwardingProcessor In Operating Systems support NDIS6 2 RssBaseProcNumber MaxRssProcessors In Operating Systems support NDIS6 3 NumRSSQueues RssMaxProcNumber m Manual configuration This option must be followed by one connection name This option assigns the provided base and number of CPUs to e RssBaseProcNumber e MaxRssProcessors Additionally this option assigns the following with processors inside the range DefaultRecvRingProcessor TxInterruptProcessor T Restore default settings This option can be followed by one or two connection names This option automatically sets the driver registry values back to the
110. gt 8 3 18 2 iblinkinfo Flags and Options Table 28 iblinkinfo Flags and Options Flags Description S port guid G port guid port guid Starts partial scan at the port specified by port guid hex format D direct route Starts partial scan at the port specified by the direct route path l Prints all information for each link on one line Default is to print a header with the node information and then a list for each port useful for grep ing output d Prints only nodes which have a port in the Down state p Prints additional port settings lt Life Time gt lt HoqLife gt lt VLStall Count gt C lt ca_name gt Uses the specified ca_name for the search P lt ca_port gt Uses the specified ca_port for the search R This option is obsolete and does nothing load cache lt filename gt Loads and use the cached ibnetdiscover data stored in the specified filename May be useful for outputting and learn ing about other fabrics or a previous state of a fabric Can not be used if user specifies a direct route path See ibnetdiscover for information on caching ibnetdiscover out put diff lt filename gt Loads cached ibnetdiscover data and do a diff comparison to the current network or another cache A special diff output for iblinkinfo output will be displayed showing differences between the old and current fabric links Be default the fol lowing are compared for differences
111. gt pm pc lw lt 1x 4x 12x gt s lt sys name gt P lt lt PM counter gt lt Trash Limit gt gt ols lt 2 HSHT r 0 lt out dir gt i lt dev index gt p lt port num gt skip dup guids zero guids pm logical state gt 8 3 2 1 ibdiagnet Options Table 11 ibdiagnet Options Flag Description c lt count gt Min number of packets to be sent across each link default 10 V Enable verbose mode r Provides a report of the fabric qualities o lt out dir gt Specifies the directory where the output files will be placed default tmp t lt topo file gt S lt Sys name gt Specifies the topology file name Specifies the local system name Meaningful only if a topology file is specified i lt dev index gt Specifies the index of the device of the port used to connect to the IB fab ric in case of multiple devices on the local system p lt port num gt pm Specifies the local device s port num used to connect to the IB fabric Dump all the fabric links pm Counters into ibdiagnet pm pc Reset all the fabric links pmCounters P lt PM lt Trash gt gt Iw lt 1x 4x 12x gt If any of the provided pm is greater than its provided value print it to screen Specifies the expected link width Is lt 2 5 5 10 gt Specifies the expected link speed skip lt skip option s gt Mellanox Technologies 64 J S
112. h from a source GID LID to a destination GID LID Each hop along the path is displayed until the destination is reached or a hop does not respond By using the m option multicast path tracing can be performed between source and destination nodes 8 3 10 1 ibtracert Synopsys ibtracert d ebug v erbose D irect L id e rrors u sage G uids f orce n o info m mlid s smlid C ca name P ca port t imeout timeout ms V ersion node name map lt node name map gt h elp lt dest dr path lid guid lt startlid gt lt endlid gt 8 3 10 2 ibtracert Options The table below lists the various flags of the command Most OpenIB diagnostics take the following common flags The exact list of supported flags per utility can be found in the usage message and can be shown using the util_name h syntax Table 20 ibtracert Flags and Options Flag Description force f Force n no_info Simple format do not show additional information mlid m lt mlid gt Shows the multicast trace of the specified mlid node name map lt node name Specifies a node name map The node name map file maps GUIDs to map gt more user friendly names See Topology File Format on page 82 debug d ddd d d d Raises the IB debugging level Mellanox Technologies 84 J Rev 4 60 Table 20 ibtracert Flags and Options Flag Descriptio
113. he local machine it is necessary to specify the device s index to the tool as well For this use on of the following options 1 On the command line specify the index of the local device using the following option 1 index of local device gt 2 Define the environment variable IBDIAG DEY IDX 8 3 1 3 Addressing z This section applies to the ibdiagpath tool only A tool command may require defining the destination device or port to which it applies The following addressing modes can be used to define the IB ports Using a Directed Route to the destination Tool option d This option defines a directed route of output port numbers from the local port to the destination Using port LIDs Tool option 1 In this mode the source and destination ports are defined by means of their LIDs If the fabric is con figured to allow multiple LIDs per port then using any of them is valid for defining a port Mellanox Technologies 63 J Rev 4 60 Using port names defined in the topology file Tool option n This option refers to the source and destination ports by the names defined in the topology file Therefore this option is relevant only if a topology file is specified to the tool In this mode the tool uses the names to extract the port LIDs from the matched topology then the tool operates as in the I option 8 3 2 ibdiagnet ibdiagnet c lt count gt v t lt topo file
114. ils General VLAN Diver Details General Information Advanced Performance VLAN Virtual Lans Virtual Lans Mellanox Mellanox VLANs associated with this adapte VLANs associated with this adapter This dialog allows you to configure Virtual LANs VLANs for the This dialog allows you to configure Virtual LANs VLANs forthe 4 adapter adapter NOTE After configuring a VLAN the adapter associated with the NOTE After configuring a VLAN the adapter associated with the VLAN may experience a momentary loss of connectivity VLAN may experience a momentary loss of connectivity The list view has four columns The list view has four columns VLAN Name Displays the assigned VLAN name zl VLAN Name Displays the assigned VLAN name zl E If a physical adapter has been added to a bundle team the VLAN tab will not be displayed Step3 Click New to open a VLAN dialog window Enter the desired VLAN Name and VLAN ID and select the VLAN Priority his dialog allows you to enter or modify the following VLAN properties VLAN Mame The name can be any unique alphanumeric string VLAN ID The ID is number between 1 and 4095 VLAN Priority The priority is a number between 0 and 7 0 lowest 7 highest NOTE After creating a new VLAN the adapter associated with the VLAN may experience a momentary loss of connectivity Lm ree Mellanox Technologies 27 J Rev 4 60 After installing the
115. ing MSMPI Traffic Directing MPI traffic to a specific QoS priority may delayed due to Except for NetDirectPortMatchCondition the QoS powershell CmdLet for NetworkDi rect traffic does not support port range Therefore NetwrokDirect traffic cannot be directed to ports 1 65536 The MSMPI directive to control the port range namely MPICH PORT RANGE 3000 3030 is not working for ND and MSMPI chose a random port A 4 Running MSMPI on the Desired Priority Step 1 Set the default QoS policy to be the desired priority Note this prio should be lossless all the way in the switches Step 2 Set SMB policy to a desired priority only if SMD Traffic running Mellanox Technologies 130 Rev 4 60 Step3 Recommended Direct ALL TCP UDP traffic to a lossy priority by using the IPProtocol MatchCondition TCP is being used for MPI control channel smpd while UDP 1s being used for other services such as remote desktop Arista switches forwards the pcp bits e g 802 1p priority within the vlan tag from ingress to egress to enable any two End Nodes in the fabric as to maintain the priority along the route In this case the packet from the sender goes out with priority X and reaches the far end node with the same priority X The priority should be losslessin the switches ae gt To force MSMPI to work over ND and not over sockets add the following in mpiexec com mand env MPICH DISABLE ND 0 env MPICH
116. ion and lease test e e run event forwarding test f flood the SA with queries according to the stress mode m multicast flow q QoS info dump VLArb and SLtoVL tables t run trap 64 65 flow this flow requires running of external tool default is all flows except QoS w wait This option specifies the wait time for trap 64 65 in seconds It is used only when running f t the trap 64 65 flow default to 10 sec d debug This option specifies a debug option These options are not normally needed The number following d selects the debug option to enable as follows OPT Description d1 Force single threaded dispatching d2 Force log flushing after each log message d3 Disable multicast support Mellanox Technologies 89 J Rev 4 60 Table 25 osmtest Flags and Options Flag m max lid Description This option specifies the maximal LID number to be searched for during inventory file build default to 100 g guid This option specifies the local port GUID value with which OpenSM should bind OpenSM may be bound to 1 port at a time If GUID given is 0 OpenSM displays a list of possible port GUIDs and waits for user input Without g OpenSM tries to use the default port p port This option displays a menu of possible local port GUID values with i inventory which osmtest could bind This option specifies the name of the inventory file Normally osmtest expects
117. ional V version Displays version number g post lt num of posts gt The number of posts for each qp in the chain default tx depth e events Inactive during CQ events default poll F CPU freq The CPU frequency test It is active even if the cpufreq ondemand module is loaded r rx depth lt dep gt Makes rx queue bigger than tx default 600 I inline_size lt size gt The maximum size of message to be sent in inline mode default 0 N no peak bw Cancels peak bw calculation default with peak bw g mcg lt num_of_qps gt Sends messages to multicast group with lt num_of_qps gt qps attached to it M MGID lt multicast_gid gt In case of multicast uses multicast gid as the group MGID The format must be 255 1 X X X X X X X X X X X X X XC where X is a value within 0 255 R xrdma cm Connects QPs with rdma cm and run test on those QPs Z com rdma cm Communicates with rdma cm module to exchange data use regular QPs Q cq mod Generates Cqe only after lt cq mod gt completion 8 4 10 ibv send lat This is a more advanced version of ib send lat and contains more flags and features than the older version and also improved algorithms ibv send lat calculates the latency of sending a packet in message size between a pair of machines One acts as a server and the other as a client Mellanox Technologies 112 Rev 4 60 They perform a ping pong benchmark
118. ir default values e SendCompletionMethod 0 IPoIB 1 ETH e RecvCompletionMethod 2 e ReceiveBuffers 1024 UseRSSForRawIP 1 DefaultRecvRingProcessor 1 TxlnterruptProcessor 1 TxForwardingProcessor 1 e UseRSSForUDP 1 In Operating Systems support NDIS6 2 MaxRssProcessors 8 In Operating Systems support NDIS6 3 NumRSSQueues 8 cl Specifies first connection name See examples c2 Specifies second connection name See examples b Specifies base RSS processor number See examples Used for manual option m only n Specifies number of RSS processors See examples Used for manual option m only Mellanox Technologies 50 J Rev 4 60 Flag Description st Single stream traffic scenario This option must be followed by one or two connection names for an Ethernet adapter The tuning will restore the default settings on the second connection and performed on the first connection This option automatically sets SendCompletionMethod 0 e RecvCompletionMethod 2 ReceiveBuffers 1024 In Operating Systems support NDIS6 3 RssProfile 4 Additionally this option chooses the best processors to assign to DefaultRecvRingProcessor TxInterruptProcessor TxForwardingProcessor In Operating Systems support NDIS6 2 RssBaseProcNumber MaxRssProcessors In Operating Systems support NDIS6 3 NumRSSQueues RssMaxProcNumber Examples
119. istry the following defaults are assigned Table 7 DSCP Default Registry Keys Settings Registry Key Default Value TxUntagPriorityTag RxUntaggedMapToLossles RroceDscpMarkPriorityFlowControl 0 RroceDscpMarkPriorityFlowControl 1 RroceDscpMarkPriorityFlowControl 2 RroceDscpMarkPriorityFlowControl 3 RroceDscpMarkPriorityFlowControl 4 RroceDscpMarkPriorityFlowControl 5 RroceDscpMarkPriorityFlowControl 6 RroceDscpMarkPriorityFlowControl 7 7 3 9 6 DSCP Sanity Testing To verify that all QoS and DSCP settings were correct you can capture incoming and outgoing traffic by using the ibdump tool and see the DSCP value in the captured packets as displayed in the figure below File Edit View Go Capture Analyze Statistics Telephony Tools Internals Help aaa cEX225le929T7t aeaamiawmxilH Filter Expression Clear Apply Save No Time Source Destination Protocol Length Info 9 0 042502 11 7 33 148 11 7 33 149 UDP 1086 source port 49153 Destination port expl 3 10 E v 1 bits Ethernet II src Me Ox 89 57 11 00 02 c9 e9 57 11 DST Mellanox e9 56 41 00 02 c9 89 56 41 B Internet Protocol Version 4 Src 11 7 33 148 11 7 33 148 Dst 11 7 33 149 11 7 33 149 version 4 Header length 20 bytes Differentiated services Field oxoe oscP 0x63 unknown DSCP ECN 0x02 ECT O CECN Capable Transport 0000 11 Differentiated
120. itches routers Run Disable NetAdapterQos on all of the hosts in a PowerShell window Mellanox Technologies 128 Rev 4 60 11 Documentation e Under installation directory Documentation License file User Manual this document MLNX VPI WinOF Installation Guide MLNX VPI WinOF Release Notes MLNX VPI WinOF Registry Keys Mellanox Technologies 129 Rev 4 60 Appendix A Windows MPI MS MPI A 1 Overview Message Passing Interface MPI is meant to provide virtual topology synchronization and com munication functionality between a set of processes With MPI you can run one process on several hosts Windows MPI run over the following protocols Sockets Ethernet Network Direct ND A 1 1 Prerequisites nstall HPC Build 4 0 3906 0 Validate traffic ping between the whole MPI Hosts Every MPI client need to run smpd process which open the mpi channel MPI Initiator Server need to run mpiexec If the initiator is also client it should also run smpd A 2 Running MPI Step 1 Run the following command on each mpi client Stage smp GUESS OUS der Step2 Install ND provider on each MPI client in MPI ND Step3 Run the following command on MPI server MOLES SS Empoli orc MOS MSIE SEO Sis DOSTS H9 List env MPICH NETMASK network ip subnet env MPICH ND ZCOPY THRESHOLD 1 env MPICH DISABLE ND 0 1 env MPICH DISABLE SOCK 0 1 affinity process A 3 Direct
121. ities emma Se or xe ore ea ace l4 Chapter 2 Firmware Upgrade sss sss ss ss sse cc cee cece ccc ccc e nn nnn nnns 15 2 1 Downloading Firmware llle 15 2 2 Downloading Mellanox Firmware Tools cece eee eee 15 2 3 Upgrading Firmware ss iise eut a t pawns EU oe E US E EO RATE 16 2 3 1 Upgrading Firmware Manually sese 16 Chapter 3 Driver Features suc co s s e xor RERREOOESIPERCORE RN S ee b V REC lU 31 gt iHypersV with V MQ ou kere od pa Shean rte Pee e EU EAE 17 3 2 Header Data Split cre Rr EORR PST ete 18 3 3 Receive Side Scaling RSS 0 0 eect teenies 18 344 Port Contiguration cns dpe tia oes eh reu E T eR ee Et ace 19 34 T Autoensime es isn Pole vite de tele Eu ik ah oe 19 3 4 2 Port Protocol Configuration 0 0 ccc eee teens 20 3 5 Load Balancing Fail Over LBFO and VLAN 00 000 e eee eee 21 3 5 1 Adapter Teaming cis RR Seek feted hak PET adden K bore ER 21 3 5 2 Creating a Load Balancing and Fail Over LBFO Bundle 22 3 5 3 Creating a Port VLAN in Windows 2008 R2 0 eee 25 3 5 4 Removing a Port VLAN in Windows 2008 R2 00 e eee ee ee eee 28 3 5 5 Configuring a Port to Work with VLAN in Windows 2012 and Above 29 3 6 Ports TX Arbitration s 12 809 pik Ee ee phere Mee bk Pe ae eene 29 3 7 RDMA over Converged Ethernet ROCE 0 0 ce cece eee ene 30 37 LT CROCE OVE VIEW i o e CREER RE RR deh ER ide ep ERE SER 30 3 72 ROCE Configuration es
122. ize registry value To resolve this issue remove the value key under HKEY LOCAL MACHINE SYSTEM CurrentControl Set Services Tcpip Parameters TcpWindowsSize or set its value to OXFFFF Issue8 Packets are being lost Suggestion This may occur if the port MTU has been set to a value higher than the maximum MTU supported by the switch Issue 9 Issue s not listed above The MLNX EN for Windows driver records events in the system log of the Windows event system Using the event log you ll be able to identify diagnose and predict sources of system problems Suggestion To see the log of events open System Event Viewer as follows 1 Right click on My Computer click Manage and then click Event Viewer OR 1 Click start gt Run and enter eventvwr exe 2 In Event Viewer select the system log The following events are recorded Mellanox ConnectX EN 10Gbit Ethernet Adapter X has been successfully initialized and enabled Failed to initialize Mellanox ConnectX EN 10Gbit Ethernet Adapter e Mellanox ConnectX EN 10Gbit Ethernet Adapter X has been successfully initialized and enabled The port s network address is MAC Address The Mellanox ConnectX EN 10Gbit Ethernet was reset Failed to reset the Mellanox ConnectX EN 10Gbit Ethernet NIC Try disabling then re enabling the Mellanox Ethernet Bus Driver device via the Windows device manager e Mellanox ConnectX EN 10Gbit Ethernet Adapter X has been successfully stopped
123. k Virtualization using Generic Routing Encapsulation NVGRE off load is cur rently supported in Windows Server 2012 R2 only Network Virtualization using Generic Routing Encapsulation NVGRE is a network virtualiza tion technology that attempts to alleviate the scalability problems associated with large cloud computing deployments It uses Generic Routing Encapsulation GRE to tunnel layer 2 packets across an IP fabric and uses 24 bits of the GRE key as a logical network discriminator which is called a tenant network ID Configuring the Hyper V Network Virtualization requires two types of IP addresses Provider Addresses PA unique IP addresses assigned to each Hyper V host that are routable across the physical network infrastructure Each Hyper V host requires at least one PA to be assigned Customer Addresses CA unique IP addresses assigned to each Virtual Machine that participate on a virtualized network Using NVGRE multiple CAs for VMs running on a Hyper V host can be tunneled using a single PA on that Hyper V host CAs must be unique across all VMs on the same virtual network but they do not need to be unique across virtual networks with different Virtual Subnet ID The VM generates a packet with the addresses of the sender and the recipient within the CA space Then Hyper V host encapsulates the packet with the addresses of the sender and the recip ient in PA space PA addresses are determined by using virtualizatio
124. kip the executions of the selected checks Skip options one or more can be specified dup_guids zero_guids pm logical_state part ipoib all Rev 4 60 8 3 2 2 ibdiagnet Output Files Table 12 ibdiagnet Output Files Output File Description ibdiagnet log A dump of all the application reports generate according to the provided flags ibdiagnet st List of all the nodes ports and links in the fabric ibdiagnet fdbs A dump of the unicast forwarding tables of the fabric switches ibdiag A dump of the multicast forwarding tables of the fabric switches net mcfdbs ibdiag In case of duplicate port node Guids these file include the map between masked Guid net masks and real Guids ibdiagnet sm List of all the SM state and priority in the fabric ibdiagnet pm A dump of the pm Counters values of the fabric links ibdiagnet pkey A dump of the existing partitions and their member host ports Ibdiagnet meg A dump of the multicast groups their properties and member host ports Ibdiagnet db A dump of the internal subnet database This file can be loaded in later runs using the load db option In addition to generating the files above the discovery phase also checks for duplicate node port GUIDs in the IB fabric If such an error is detected it is displayed on the standard output After the discovery phase is completed directed route packets are sent multiple times according to the
125. lease refer to SwitchX User Manual 3 7 4 Configuring Arista Switch Step 1 Set the ports that face the hosts as trunk config interface et10 config if Et1l0 switchport mode trunk Step2 Set VID allowed on trunk port to match the host VID config if Et10 switchport trunk allowed vlan 100 Step 3 Set the ports that face the network as trunk config interface et20 config if Et20 switchport mode trunk Step 4 Assign the relevant ports to LAG config interface et10 config if Et10 dcbx mode ieee nfig if Etl0 speed forced 40gfull config if Et10 channel group 11 mode active G o Mellanox Technologies 31 J Rev 4 60 Step 5 Enable PFC on ports that face the network config interface et20 config if Et20 load interval 5 config if Et2 speed forced 40gfull config if Et2 switchport trunk native vlan tag switchport trunk allowed vlan 11 config if Et2 dcbx mode ieee config if Et2 priority flow control mode on config if Et2 config if Et20 switchport mode trunk config if Et2 priority flow control priority 3 no drop 3 7 4 1 Using Global Pause Flow Control GFC gt To enable GFC on ports that face the hosts perform the following config interface etl0 config if Et10 flowcontrol receive on config if Et10 flowcontrol send on 3 7 4 2 Using Priority Flow Control PFC gt To enable PFC on ports that fa
126. load Jumbo Packet Large Send Offload LSO Large Send Offload V2 IPv4 Large Send Offload V2 IPv6 Large Send Offload Version 1 IP Y Mellanox Technologies 43 J Rev 4 60 Step 3 Modify configuration parameters to suit your system Please note the following a For help on a specific parameter option check the help button at the bottom of the dialog b If you select one of the entries Off load Options Performance Options or Flow Control Options you ll need to click the Properties button to modify parameters via a pop up dialog 5 3 Configuring Quality of Service QoS Prior to configuring Quality of Service you must install Data Center Bridging using one of the following methods To install the Data Center Bridging using the Server Manager Step 1 Open the Server Manager Step 2 Select Add Roles and Features Step3 Click Next Step 4 Select Features on the left panel Step 5 Check the Data Center Bridging checkbox Step 6 Click Install To install the Data Center Bridging using PowerShell Step 1 Enable Data Center Bridging DCB PS Install WindowsFeature Data Center Bridging To configure QoS on the host The procedure below is not saved after you reboot your system Hence we recommend you create a script using the steps below and run it on the local machine Please see the procedure below on how to add the script to the local machine startup Scripts Step 1 Change
127. mmand Examples Running MPI pallas test over ND MOMS ese cio GOZO nosies A iil iid WAG Ow SET Atay oi wil Zi ayy Sil 11 11 145 101 env MPICH NETMASK 11 0 0 0 255 0 0 0 rexmw IMPICEL IND ACORN HRL ERS STER il oeisny IMIPIVCIs IDICSVNS ILI IND ces MPICH DISABLE SOCK 1 affinity c testl exe Running MPI pallas test over ETH xempiexec exe p 19020 hosts 4 11 11 146 101 11 21 147 101 11 21 147 51 11 11 145 101 env MPICH NETMASK 11 0 0 0 255 0 0 0 env MPICH ND ZCOPY THRESHOLD 1 env MPICH DISABLE ND 1 env MPICH DISABLE SOCK 0 affinity c testl exe Mellanox Technologies 132 Rev 4 60 Appendix B NVGRE Configuration Scrips Examples The setup is as follow for both examples below Hypervisor mtlael4 Portl 192 168 20 114 24 VM on mtlael4 mtlael4 005 172 16 14 5 16 Mac 001550720100 VM on mtlael4 mtlael4 006 172 16 14 6 16 Mac 001550720101 Hypervisor mtlael5 Portl 192 168 20 115 24 VM on mtlael5 mtlael5 005 172 16 15 5 16 Mac 00155D730100 VM on mtlael5 mtlael5 006 172 16 15 6 16 Mac 00155D730101 B 1 Adding NVGRE Configuration to Host 14 Example The following is an example of adding NVGRE to Host 14 On both sides vSwitch create command Note that vSwitch configuration is persistent no need to configure it after each reboot ew VMSwitch VSwMLNX NetAdapterName Portl AllowManagementOS Strue Shut down VMs Stop VM Name mtlael4 005
128. mmended to back up the registry on your system before implementing recommendations included in this section If the modifications you apply lead to serious problems you will be able to restore the original registry state For more details about backing up and restoring the registry please visit www microsoft com Au 6 1 General Performance Optimization and Tuning To achieve the best performance for Windows you may need to modify some of the Windows registries 6 1 1 Registry Tuning The registry entries that may be added changed by this General Tuning procedure are Under HKEY LOCAL MACHINENSYSTEM CurrentControlSet Services Tcpip Parameters Disable TCP selective acks option for better cpu utilization SackOpts type REG DWORD value set to 0 Under HKEY LOCAL MACHINE SSYSTEM CurrentControlSetNServicesAFD Parameters Enable fast datagram sending for UDP traffic FastSendDatagramThreshold type REG DWORD value set to 64K Under HKEY LOCAL MACHINENSYSTEM CurrentControlSet Services Ndis Parameters Set RSS parameters RssBaseCpu type REG DWORD value set to 1 6 1 2 Enable RSS Enabling Receive Side Scaling RSS is performed by means of the following command netsh int tcp set global rss enabled 6 1 3 Tuning the IPoIB Network Adapter The IPoIB Network Adapter tuning can be performed either during installation by modifying some of Windows registries as explained in Section 6 1 1 Registry Tuning on page 46 or
129. n sss s s s REA PRSE Cee ines ccd WR Rew es 129 Mellanox Technologies 5 J Rev 4 60 List of Tables Table 1 Document Revision History douane eau td hd ean DE NODE E Rb es 8 Table 2 Documentation Conventions teens 11 Table 3 Abbreviations and Acronyms 0 0 c cece cece eee 12 Table 4 Related Documents over ay eae E P Oe ERRAT RUE 13 Table 5 Registry Keys Setting oiu 0 RG PUE seals TERME YA MSR RRS DENS eae ee 18 Table 6 DSCP Registry Keys Settings uo Soe ede Seed aes Pace chee LEN ROPA 37 Table 7 DSCP Default Registry Keys Settings llle 38 Table 8 Mellanox Adapter Traffic Counters 0 0c eee eee 56 Table 9 Mellanox Adapter Diagnostics Counters u sanaaa 57 Table 10 Mellano QoS Counters 2 uu si ues kh ete woo TA APTE PES 59 Table abdiasnet Options cas out cete bac qe I osa tre READ Pie C ERR 64 Table 12 ibdiagnet Output Files u acs ve seus MORE noe Ves Sedo Vesp ea RS es 65 Table 13 ibportstate Flags and Options 0 cece eens 66 Table 14 1broute Flags and Options iade x RR RE EA RE SS E ORs 69 Table 15 ibdump Flags and Options ss esis dace sha Reve e bake ews 12 Table 16 smpquery Flags and Options 0 0 eee eee eee nee 173 Table 17 perfquery Flags and OpHOofns ees ASRS EO 76 Table 18 ibping Flags and Opinions 2s e haa eee wed EY e xe ERU UO Seta 79 Table 19 ibnetdiscover Flags and Options 00 0 cece eee eens 80 Table20 ibtracert Flags and Options
130. n Lid L Uses LID address argument errors Shows send and receive errors usage u Usage message Guid G Uses GUID address argument In most cases it is the Port GUID Example 0x08f1040023 sm port s lt smlid gt Uses smlid as the target lid for SM SA queries help h Shows the usage message verbose v vv v v v Increases the application verbosity level version V Shows the version info Ca C ca name Uses the specified ca name Port P ca port Uses the specified ca port timeout t timeout ms Overrides the default timeout for the solicited mads Examples Unicast examples ibtracert 4 16 show path between lids 4 and 16 ibtracert n 4 16 same but using simple output format ibtracert G 0x8f1040396522d 0x002c9000100d051 use guid addresses Multicast example ibtracert m 0xc000 4 16 show multicast path of mlid 0xc000 between lids 4 and 16 8 3 11 sminfo Optionally sets and displays the output of a sminfo query in a readable format The target SM is the one listed in the local port info or the SM specified by the optional SM lid or by the SM direct routed path Using sminfo for any purposes other than simple query may result in a mal function of the target SM Mellanox Technologies 85 J Rev 4 60 8 3 11 1 sminfo Synopsys sminfo d ebug e rr show s state p prio a activity D irect L id u sage G uid C ca
131. n one switch system each device is given a GUID a MAC equivalent Since a GUID is a non user friendly string of characters it is better to alias it to a meaningful user given name For this objective the IB Diagnostic Tools can be provided with a topology file which is an optional configuration file specifying the IB fabric topology in user given names For diagnostic tools to fully support the topology file the user may need to provide the local sys tem name if the local hostname is not used in the topology file To specify a topology file to a diagnostic tool use one of the following two options 1 On the command line specify the file name using the option t topology file name gt 2 Define the environment variable IBDIAG TOPO FILE To specify the local system name to a diagnostic tool use one of the following two options 1 On the command line specify the system name using the option s lt local system name gt 2 Define the environment variable IBDIAG_ SYS NAME 8 3 1 2 IB Interface Definition The diagnostic tools installed on a machine connect to the IB fabric by means of an HCA port through which they send MADs To specify this port to an IB diagnostic tool use one of the fol lowing options 1 On the command line specify the port number using the option p local port number gt see below 2 Define the environment variable IBDIAG PORT NUM In case more than one HCA device is installed on t
132. n table Hyper V host retrieves the received packet identifies recipient and forwards the original packet with the CA addresses to the desired VM NVGRE can be implemented across an existing physical IP network without requiring changes to physical network switch architecture Since NVGRE tunnels terminate at each Hyper V host the hosts handle all encapsulation and de encapsulation of the network traffic Firewalls that block GRE tunnels between sites have to be configured to support forwarding GRE IP Protocol 47 tunnel traffic Figure 1 NVGRE Packet Structure OuterIP GRE Header TCP TCP user data Header Includes 24 Bit TNI Header PA Mellanox Technologies 33 Rev 4 60 3 8 4 Enabling Disabling NVGRE Offloading To leverage NVGRE to virtualize heavy network IO workloads the Mellanox ConnectX 3 Pro network NIC provides hardware support for GRE off load within the network NICs by default gt To enable disable NVGRE off loading Step 1 Open the Device Manager Step 2 Go to the Network adapters Step 3 Right click Properties on Mellanox ConnectX 3 Pro Ethernet Adapter card Step 4 Go to Advanced tab Step 5 Choose the Encapsulate Task Offload option Step 6 Set one of the following values Enable GRE off loading is Enabled by default Disabled When disabled the Hyper V host will still be able to transfer NVGRE traffic but TCP and inner IP checksums will be calculated by softw
133. name P ca port t imeout timeout ms V ersion h elp sm lid sm dr path modifier 8 3 11 2 sminfo Options The table below lists the various flags of the command Most OpenIB diagnostics take the following common flags The exact list of supported flags per utility can be found in the usage message and can be shown using the util name h syntax Table 21 sminfo Flags and Options Flag Description state s Sets SM state O notactive e 1 discovering e 2 standby e 3 master priority p Sets priority 0 15 activity a Sets activity count debug d ddd d d d Raises the IB debugging level Direct D Uses directed path address arguments The path is a comma separated list of out ports Examples e 0 self port e 0 1 2 1 4 out via port 1 then 2 Lid L Uses LID address argument usage u Usage message errors Shows send and receive errors timeouts and others Guid G Uses GUID address argument In most cases it is the Port GUID Example 0x08f1040023 help h Shows the usage message verbose v vv v v v Increases the application verbosity level version V Shows the version info Ca C ca name Uses the specified ca name Port P ca port Uses the specified ca port timeout t timeout ms Overrides the default timeout for the solicited mads Mellanox Technologies 86 J
134. nd run test on those QPs Z com rdma cm Communicate with rdma cm module to exchange data use regular QPs 8 4 11 ibv write bw This is a more advanced version of ib write bw and contains more flags and features than the older version and also improved algorithms ibv write bw calculates the BW of RDMA write between a pair of machines One acts as a server and the other as a client The client RDMA writes to the server memory and calculate the BW by sampling the CPU each time it receives a successful completion The test supports a large variety of features as described below and has better performance than ib write bw in Nehalem systems 8 4 11 1 ibv write bw Synopsys ibv write bw i b port ib port d ib device c onnection type RC UC m tu mtu size s ize message size t x depth tx size n iteration num p ort PDT port I nline size inline size u qp timeout S 1 sl type x gid index e vents use events N o peak use peak calc F CPU freq fail g num of posts q num of qps b idirectional a 11 V ersion 8 4 11 2 ibv_write_bw Options The table below lists the various flags of the command Table 43 ibv_write_bw Flags and Options Flag Description p port lt port gt Listens on connect to port lt port gt default 18515 d ib dev lt dev gt Uses IB device lt device guid gt default first device found i ib port lt port gt Uses
135. ntel R 5000X Chipset Memory Controller Hub 25C0 Intel R 5000X Chipset PCI Express x16 Port 4 7 25FA Intel R 6311ESB 6321ESB PCI Express Downstream Port E1 3510 Intel R 6311ESB 6321ESB PCI Express to PCI X Bridge 350C ntel R 6311ESB 6321ESB PCI Express Upstream Port 3500 Intel R 631xESB 6321ESB 3100 Chipset LPC Interface Controller 2670 Intel R 631xESB 6321ESB 3100 Chipset PCI Express Root Port 1 2690 Intel R 631xESB 6321ESB 3100 Chipset PCI Express Root Port 2 2692 Intel R 631xESB 6321ESB 3100 Chipset SMBus Controller 269B Intel R 82801 PCI Bridge 244E Mellanox ConnectX 3 MT04099 Network Adapter Mellanox ConnectX 3 MTOA099 Network Adapter ol Microsoft ACPI Compliant System v vyv Step 2 Right click on the Mellanox ConnectX Ethernet network adapter and left click Properties Select the Port Protocol tab from the Properties window The Port Protocol tab is displayed only if the NIC is a VPI IB and ETH The figure below is an example of the displayed Port Protocol window for a dual port VPI adapter card General Port Protocol Driver Details Events Resources DA Current Setting Porti IB Port2 Eth r HC Port Type Configuration HW Defaults Port 4 4 e IB C ETH C AUTO Port 2 C IB ETH C AUTO Port Protocol Configuration This menu dis
136. ode name map lt node name map gt cache lt filename gt load cache filename p orts m ax_hops h elp lt topology file gt 8 3 9 2 ibnetdiscover Options The table below lists the various flags of the command Most OpenIB diagnostics take the following common flags The exact list of supported flags per utility can be found in the usage message and can be shown using the util name h syntax Table 19 ibnetdiscover Flags and Options Flag Description list Lists of connected nodes g grouping Shows grouping Grouping correlates InfiniBand nodes by different vendor specific schemes It may also show the switch external ports correspondence H Hca list Lists of connected CAs S Switch list Lists of connected switches R Router list Lists of connected routers S show Shows progress information during discovery Mellanox Technologies 80 J Rev 4 60 Table 19 ibnetdiscover Flags and Options Flag Description node name map lt node name Specifies a node name map The node name map file maps GUIDs to map more user friendly names See Topology File Format on page 82 cache filename Caches the ibnetdiscover network data in the specified filename This cache may be used by other tools for later analysis load cache filename Loads and use the cached ibnetdiscover data stored in the specified filename May be useful for o
137. of local CQE with errors when the local machine receives inbound traffic Requester Invalid request errors Number of remote invalid request errors when the local machine gener ates outbound traffic 1 e NAK was received indicating that the other end detected invalid OpCode request Responder Invalid request errors Number of remote invalid request errors when the local machine receives inbound traffic Requester Remote access errors Number of remote access errors when the local machine generates out bound traffic i e NAK was received indicating that the other end detected wrong rkey Responder Remote access errors Number of remote access errors when the local machine receives inbound traffic i e the local machine received RDMA request with wrong rkey Requester RNR NAK Number of RNR Receiver Not Ready NAKs received when the local machine generates outbound traffic Responder RNR NAK Number of RNR Receiver Not Ready NAKs sent when the local machine receives inbound traffic Requester out of order sequence NAK Number of Out of Sequence NAK received when the local machine gen erates outbound traffic i e the number of times the local machine received NAKs indicating OOS on the receiving side Responder out of order sequence received Number of Out of Sequence packet received when the local machine receives inbound traffic i e the number of times the local machine received messages th
138. of pause frames that were received for priority 1 The untagged instance indicates global pause that were received The total duration in microseconds of pause that was requested by the other end to freeze transmission on priority 1 Mellanox Technologies 60 J Rev 4 60 7 OpenSM Subnet Manager OpenSM v3 3 11 is an InfiniBand Subnet Manager In order to operate one host machine or more in the InfiniBand cluster at least one Subnet Manger is required in the fabric cluster Otherwise we recommend using OpenSM from FabricIT EFM or UFM or P MLNX OS P Please use the embedded OpenSM in the WinOF package for testing purpose in small OpenSM can run as a Windows service and can be started manually from the following directory lt installation_directory gt tools OpenSM as a service will use the first active port unless it receives a specific GUID OpenSM can be registered as a service from either the Command Line Interface CLI or the PowerShell The following are commands used from the CLI To register it as a service execute the OpenSM service sc create OpenSM binPath c Program Files Mellanox MLNX VPI IB Tools opensm exe service start auto To start OpenSM as a service Sie ESTEE Open M Torun OpenSM manually opensm xe For additional run options enter opensm exe h The following are commands used from the PowerShell To register it as a service execute the OpenSM service
139. onym Whole Word Description B Capital B is used to indicate size in bytes or multiples of bytes e g IKB 1024 bytes and IMB 1048576 bytes b Small b is used to indicate size in bits or multiples of bits e g IKb 1024 bits FW Firmware HCA Host Channel Adapter HW Hardware IB InfiniBand LSB Least significant byte Isb Least significant bit MSB Most significant byte msb Most significant bit NIC Network Interface Card SW Software VPI Virtual Protocol Interconnect IPoIB IP over InfiniBand PFC Priority Flow Control PR Path Record RDS Reliable Datagram Sockets RoCE RDMA over Converged Ethernet SL Service Level MPI Message Passing Interface EoIB Ethernet over InfiniBand QoS Quality of Service ULP Upper Level Protocol VL Virtual Lane Mellanox Technologies 12 J Rev 4 60 Related Documents Table 4 Related Documents Document Description MFT User Manual Describes the set of firmware management tools for a single InfiniBand node MFT can be used for Generating a standard or customized Mellanox firmware image Querying for firmware information Burning a firmware image to a single InfiniBand node WinOF Release Notes For possible software issues please refer to WinOF Release Notes SwitchX amp User Manual This document contains information regarding configuring and managing Mellanox Technologies SwitchX switch platforms listing all of the com
140. ounters after reading them or simply reset them 8 3 7 1 perfquery Applicable Hardware All InfiniBand devices 8 3 7 2 perfquery Synopsys periguery al eel fe attsl lt attchisc cJ wewsl 9 mewere m lsijsleel Sel Jed l thicl cu L5wngesten cc deb y cel wee v usage u 1 r C ca name P ca port gt R t timeout ms V lt lid guid gt port reset mask The table below lists the various flags of the command Table 17 perfquery Flags and Options Flag Description help h Print the help menu debug d Raise the IB debug level May be used several times for higher debug levels ddd or d d d Guid G Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 xmtsl X Show Xmt SL port counters revsl S Show Rcv SL port counters xmtdisc D Show Xmt Discard Details reverr E Show Rev Error Details smplctl c Show samples control all_ports a Apply query to all ports Lid L Use LID address argument sm_port s lt lid gt SM port lid eITOIS Show send and receive errors verbose v Increase verbosity level usage u Usage message Mellanox Technologies 76 J Rev 4 60 Table 17 perfquery Flags and Options Flag loop ports l Description Loop ports reset after read r Reset the counters after reading
141. p 5 Choose the Tx Throughput Port Arbiter option Step 6 Set one of the following values Best Effort Default Default behavior No precedence is given to this port over the other e Guaranteed Give higher precedence to this port Not Present No configuration exists defaults are used 3 7 RDMA over Converged Ethernet RoCE 3 7 1 RoCE Overview Remote Direct Memory Access RDMA is the remote memory management capability that allows server to server data movement directly between application memory without any CPU involvement RDMA over Converged Ethernet RoCE is a mechanism to provide this efficient data transfer with very low latencies on loss less Ethernet networks With advances in data center convergence over reliable Ethernet ConnectX 2 ConnectX 3 EN ConnectX 3 Pro EN with RoCE uses the proven and efficient RDMA transport to provide the platform for deploying RDMA technology in mainstream data center application at 10GigE and 40GigE link speed ConnectX 2 ConnectX 3 ConnectX 3 Pro EN with its hardware offload support takes advantage of this efficient RDMA transport InfiniBand services over Ethernet to deliver ultra low latency for performance critical and transaction intensive applications such as financial database storage and content delivery networks RoCE encapsulates IB transport and GRH headers in Ethernet packets bearing a dedicated ether type While the use of GRH is optional within InfiniBan
142. parameter gt h Shows the Help screen 8 4 16 nd read lat This test is used for performance measuring of RDMA Read requests in Microsoft Windows Operating Systems nd read lat is performance oriented for RDMA Read with minimum latency and runs over Microsoft s NetworkDirect standard The level of customizing for the user is relatively high User may choose to run with a customized message size customized number of iterations or alternatively customized test duration time nd read lat runs with all message sizes from 1B to 4MB powers of 2 message inlining CQ moderation Mellanox Technologies 119 Rev 4 60 8 4 16 1 nd read lat Synopsys running on specific single core Server side start b affinity 0X1 nd read lat s1048576 D10 S 11 137 53 1 Client side start b wait affinity 0X1 nd read lat s1048576 D10 C 11 137 53 1 8 4 16 2 nd read lat Options The table below lists the various flags of the command Table 48 nd read lat Options Flags Description h Shows the Help screen v Shows the version number p Connects to the port lt port gt lt default 6830 gt s msg size Exchanges the message size with default 65536B gt and it must not be combined with a flag a Runs all the messages sizes from 1B to 8MB and it must not be combined with s flag The number of exchanges at least 2 the default is 100000 n num of iterations
143. plays the adapter s port type and enables you to set the network protocols for the network adapter ports Thenetwork protocol is determined according to the NIC s Hardware Defaults port type You can choose the protocol explicitly by selecting the port type to InfiniBand IB or Ethernet Eth Ta enable Sita Sensing please chanse Al ITO If the NIC OK Mellanox Technologies 20 J Rev 4 60 Step3 In this step you can perform the following functions Ifyou choose the HW Defaults option the port protocols will be determined according to the NIC s hardware default values Choose the desired port protocol for the available port s If you choose IB or ETH both ends of the connection must be of the same type IB or ETH Enable Auto Sensing by checking the AUTO checkbox If the NIC does not support Auto Sensing the AUTO option will be grayed out If you choose AUTO the current setting will indicate the actual port settings IB or 1 ETH 3 5 Load Balancing Fail Over LBFO and VLAN Windows Server 2012 and above supports load balancing as part of the operating system Please refer to Microsoft guide NIC Teaming in Windows Server 2012 following the link below http social technet microsoft com wiki contents articles 1495 nic teaming in windows server 2012 aspx For other earlier operating systems please refer to the sections below 3 5 1 Adapter Teaming Adapter teaming
144. pu cycle units default microseconds H report histogram Prints out all results default print summary only U report unsorted implies Prints out unsorted results default sorted H V version Displays version number e events Inactive during CQ events default poll F CPU freq The CPU frequency test It is active even if the cpufreq_ondemand module is loaded R rdma_cm Connects QPs with rdma_cm and run test on those QPs Z com rdma cm Communicates with rdma cm module to exchange data use regular QPs c connection lt RC UC gt Connection type RC UC default RC I inline_size lt size gt Max size of message to be sent in inline default 400 8 4 9 ibv send bw This is a more advanced version of ib send bw and contains more flags and features than the older version and also improved algorithms ibv send bw calculates the BW of SEND between a pair of machines One acts as a server and the other as a client The server receive packets from the client and they both calculate the throughput of the operation The test supports a large variety of features as described below and has better performance than ib send bw in Nehalem systems 8 4 9 1 ibv send bw Synopsys ibv send bw i b port ib port d ib device c onnection type RC UC UD m tu mtu size s ize message size t x depth tx size r x dpeth rx size n iteration num p ort PDT port I nline size inline size u qp
145. raffic h help Display this help screen V version Print version information 8 3 6 smpquery Provides a basic subset of standard SMP queries to query Subnet management attributes such as node info node description switch info and port info 8 3 6 4 smpquery Applicable Hardware Mellanox Technologies 72 J All InfiniBand devices Rev 4 60 8 3 6 2 8 3 6 3 Mellanox Technologies 73 J smpquery Synopsys smpquery h d e c ca name gt P ca port gt t lt timeout_ms gt D G s lt smlid gt L u V 2 node name map lt node name map gt lt op gt dest dr path lid guid gt op params smpquery Options The table below lists the various flags of the command Table 16 smpquery Flags and Options Flag Description h help Print the help menu d debug Raise the IB debug level May be used several times for higher debug levels ddd or d d d e errors Show send and receive errors timeouts and others v verbose Increase verbosity level May be used several times for additional verbosity vvv or v v v D Direct Use directed path address arguments The path 1s a comma sepa rated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G Guid Use GUID address argument In most cases it is the Port GUID Example 0x08f104002
146. ranslationMethodEncap a This is the VM s MAC address associated with the vSwitch connected to the Mellanox device Step 6 Add customer route New NetVirtualizationCustomerRoute RoutingDomainID 11111111 2222 3333 4444 000000005001 VirtualSubnetID virtualsubnetID DestinationPrefix VMInterfaceIPAddress Mask NextHop 0 0 0 0 Metric 255 Step 7 Configure the Provider Address and Route records on Hyper V Host 1 Host 1 Only SNIC Get NetAdapter lt EthInterfaceName gt New NetVirtualizationProviderAddress InterfaceIndex NIC InterfaceIndex ProviderAddress lt HypervisorInterfacelPAddress gt PrefixLength 24 New NetVirtualizationProviderRoute InterfaceIndex NIC InterfaceIndex DestinationPrefix 0 0 0 0 0 NextHop lt HypervisorInterfacelPAddress gt a The Hypervisor Interface IP address is 192 168 20 118 and the NextHop will be 192 168 20 1 Step 8 Configure the Virtual Subnet ID on the Hyper V Network Switch Ports for each Virtual Machine on each Hyper V Host Host 1 and Host 2 Get VMNetworkAdapter VMName lt VMName gt where MacAddress eq VMmacaddress1 Set VMNet workAdapter VirtualSubnetID lt virtualsubnetID gt A Please repeat steps 5 to 8 on each VM after rebooting the hypervisor Mellanox Technologies 35 J Rev 4 60 3 8 3 Verifying the Encapsulation of the Traffic Once the configuration using PowerShell is completed verifying that packets are indeed encap sulated
147. rated by a colon On switches port guids are identical to the switch guid so port guids will be adjusted as well on switches caguid BEFOREGUID AFTERGUID Specifies a caguid that should be changed The before and after guid should be separated by a colon sysimgguid BEFOREGUID AFTERGUID Specifies a sysimgguid that should be changed The before and after guid should be separated by a colon portguid NODEGUID BEFOREGUID AFTER Specifies a portguid that should be changed The node GUID guid of the port e g switchguid or caguid should be specified first followed by a colon the before port guid another colon then the after port guid On switches port guids are identical to the switch guid so the switch guid will be adjusted as well on switches Debugging Flags Description NOTE Most OpenIB diagnostics take the following common flags The exact list of supported flags per util ity can be found in the usage message and can be shown using the util name h syntax h shows the usage message y shows the version info Mellanox Technologies 93 J Rev 4 60 8 3 18 iblinkinfo Iblinkinfo reports link info for each port in an IB fabric node by node Optionally iblinkinfo can do partial scans and limit its output to parts of a fabric 8 3 18 1 iblinkinfo Synopsis hcdl C ca name P ca port p S port guid G port guid D direct route load cache lt filename
148. reshark tool www wireshark org for graphical traffic analysis This provides the ability to analyze network behavior and performance and to debug applica tions that send or receive RDMA network traffic Run ibdump h to display a help message which details the tools options 8 3 5 1 ibdump Synopsis ibdump 8 3 5 2 ibdump Options The table below lists the various ibdump flags of the command Table 15 ibdump Flags and Options Flag Description d ib dev lt dev gt Use RDMA device lt dev gt default first device found The relevant devices can be listed by running the ibv_devinfo command zi 1b port lt port gt Use port lt port gt of IB device default 1 w write lt file gt Dump file name default sniffer pcap stands for stdout enables piping to tcpdump or Shark 0 output lt file gt Alias for the w option Do not use for backward compatibility b max burst lt log2 burst gt log2 of the maximal burst size that can be captured with no packets loss Each entry takes MTU bytes of memory default 12 4096 entries s Silent Do not print progress indication mem mode lt size gt When specified packets are written to file only after the capture is stopped It is faster than default mode less chance for packet loss but takes more memory In this mode ibdump stops after lt size gt bytes are captured decap Decapsulate port mirroring headers Should be used when capturing RSPAN t
149. rifying SMB Events that Confirm RDMA Connection 40 Chapter 5 Driver Configuration s s e x cece cece eee cece cece ee dl 5 1 Configuring the InfiniBand Driver 0 0 eee 41 5 1 1 Modifying IPoIB Configuration 0 0 cece eee 41 5 1 2 Displaying Adapter Related Information 0 00 c eee ees 41 5 2 Configuring the Ethernet Driver 00 43 5 3 Configuring Quality of Service QoS 44 Chapter 6 Performance Tuning cece cee cece eee nn nn nnn nnns 46 6 1 General Performance Optimization and Tuning susse saa sasae 46 6 1 1 Registry Tuning ecen MURRAY ec RELAIS CTS 46 6 122 Enable RSS hos vous en Sete ee EEE SS a eet 46 6 1 3 Tuning the IPoIB Network Adapter 46 6 1 4 Tuning the Ethernet Network Adapter sleeeeeee eese 47 6 2 Application Specific Optimization and Tuning 0 2 2 0 0005 51 6 2 1 Ethernet Performance Tuning nasaun cee eee 51 6 2 2 IPoIB Performance Tuning 0 00 cece eee 52 6 3 Tunable Performance Parameters 00 0 52 6 4 Adapter Proprietary Performance Counters 00 0c cece eee nee 55 6 4 1 Supported Standard Performance Counters 0 0 cece eee ees 56 Chapter 7 OpenSM Subnet Manager cece eee eee e eee n n nn nnn Ol Chapter 8 InfiniBand Fabric 2 2 26 ss s s s s s s rh hn oe done DZ 8 1 Network Direct Interface nes 62 8 2 part man Virtual IPoIB Port Creation Utility
150. ription HKLM SYSTEM CurrentControlSet Con trol Class 4d36e972 e325 11ce bfc1 08002be10318 lt nn gt MaxRSSProcessors Maximum number of CPUs allotted Sets the desired maximum number of processors for each interface The number can be differ ent for each interface Note Restart the network adapter after you change this registry key HKLM SYSTEM CurrentControlSet Con trol Class 4d366e972 e325 11ce bfc1 08002be10318 lt nn gt RssBaseProcNumber HKLM SYSTEM CurrentControlSet Con trol Class 4d366e972 e325 11ce bfc1 08002be10318 lt nn gt NumaNodeID Base CPU number Sets the desired base CPU number for each interface The number can be different for each interface This allows partitioning of CPUs across network adapters Note Restart the network adapter when you change this registry key NUMA node affinitization HKLM SYSTEM CurrentControlSet Con trol Class 4d36e972 e325 11ce bfcl 08002be10318 lt nn gt RssBaseProcGroup Sets the RSS base processor group for sys tems with more than 64 processors gt To find the nn value of your HCA from the Device Manager please perform the following steps Step 1 Open Device Manager and go to System devices Step 2 Right click gt properties on Mellanox ConnectX card Step 3 Go to Details tab Mellanox Technologies 18 J Rev 4 60 Step 4 Select the Driver key and obtain the nn number
151. river 4 2 11165 0 2 11 500 1 PCI E 5 0 Gbps x8 MCX3544 FCBT 4099 0 00 02 C9 35 9E F0 00 02 C9 35 9E F0 Disconnected Ethernet 3 153 254 27 228 Save To File To save this information for debug purposes click Save to File and provide the output file name Mellanox Technologies 42 J Rev 4 60 5 2 Configuring the Ethernet Driver The following steps describe how to configure advanced features Step 1 Display the Device Manager File Action View Help e bl 0 E HS Dg UN S b TS Ports COM amp LPT p den Print queues gt D Processors b 4 gt Storage controllers jS System devices RE ACPI Fixed Feature Button Composite Bus Enumerator Direct memory access controller Generic Bus Intel R 5000 Series Chipset Error Reporting Registers 25FO Intel R 5000 Series Chipset Error Reporting Registers 25FO Ml Intel R 5000 Series Chipset Error Reporting Registers 25F0 Intel R 5000 Series Chipset FBD Registers 25F5 Intel R 5000 Series Chipset FBD Registers 25F6 Intel R 5000 Series Chipset PCI Express x4 Port 3 25E3 Intel R 5000 Series Chipset PCI Express x4 Port 5 25E5 Intel R 5000 Series Chipset PCI Express x4 Port 6 25E6 Intel R 5000 Series Chipset PCI Express x4 Port 7 25E7 Intel R 5000 Series Chipset PCI Express x8 Port 2 3 25F7 Intel R 5000 Series Chipset Reserved Registers 25F1 Intel R 5000 Series Chipset Reserved Registers 25F3
152. rt gt Uses the specified ca_port t lt timeout_ms gt Overrides the default timeout for the solicited mads 8 3 22 3 Multiple CA Multiple Port Support When no IB device or port is specified the port to use is selected by the following criteria 1 The first port that is ACTIVE 2 Ifnot found the first port that is UP physical link up If a port and or CA name is specified the user request is attempted to be fulfilled and will fail if it is not possible Mellanox Technologies 102 Rev 4 60 Examples Direct Routed Examples smpdump D 0 1 2 3 5 16 NODE DESC smpdump D 0 1 2 0x15 2 PORT INFO port 2 LID Routed Examples smpdump 3 0x15 2 PORT INFO lid 3 port 2 smpdump 0xa0 0x11 NODE INFO lid 0xa0 8 4 InfiniBand Fabric Performance Utilities The performance utilities described in this chapter are intended to be used as a performance micro benchmark 8 4 1 ib read bw Ib read bw calculates the BW of RDMA read between a pair of machines One acts as a server and the other as a client The client RDMA reads the server memory and calculate the BW by sampling the CPU each time it receive a successful completion The test supports features such as Bidirectional in which they both RDMA read from each other memory s at the same time change of mtu size tx size number of iteration message size and more Read is available only in RC connection mode as specified in IB spec 8 4 1 1 ib read bw Synop
153. s Flags Description NOTE Most OpenIB diagnostics take the following common flags The exact list of sup ported flags per utility can be found in the usage message and can be shown using the util name h syntax d Raises the IB debugging level Can be used several times ddd or d d d e Shows send and receive errors timeouts and oth ers h Shows the usage message V Increases the application verbosity level Can be used several times vv or v v v v Shows the version info Addressing Flags Description G Uses GUID address argument In most cases it is the Port GUID Example 0x08f1040023 s lt smlid gt Uses smlid as the target lid for SM SA queries Other Common Flags Description C lt ca_name gt Uses the specified ca_name P lt ca_port gt Uses the specified ca_port t lt timeout_ms gt Overrides the default timeout for the solicited mads 8 3 20 3 Multiple CA Multiple Port Support When no IB device or port is specified the port to use is selected by the following criteria 1 The first port that is ACTIVE 2 Ifnot found the first port that is UP physical link up If a port and or CA name is specified the user request is attempted to be fulfilled and will fail if it is not possible Mellanox Technologies 98 J Rev 4 60 8 3 21 saquery saquery issues the selected SA query Node records are queried by default 8 3 21 1 saquery Synopsis
154. s use The table below lists the various flags of the command Table 39 ibv read bw Flags and Options Flag Description p port lt port gt Listens on connect to port lt port gt default 18515 d ib dev lt dev gt i ib port lt port gt Uses IB device lt device guid gt default first device found Uses port lt port gt of IB device default 1 m mtu lt mtu gt The mtu size default 1024 0 outs lt num gt SIZe lt size gt The number of outstanding read atom default for ConnectX 16 oth ers 4 The size of message to exchange default 65536 a all Runs sizes from 2 till 223 t tx depth lt dep gt n iters lt iters gt The size of tx queue default 100 The number of exchanges at least 2 default 1000 u qp timeout lt timeout gt QP timeout The timeout value is 4 usec 2 timeout default 14 S sI lt sl gt x gid index lt index gt The service level default 0 Test uses GID with GID index taken from command line for RDMAOE index should be 0 b bidirectional Measures bidirectional bandwidth default unidirectional V version g post lt num of posts Displays version number The number of posts for each qp in the chain default tx depth e events Inactive during CQ events default poll F CPU freq R rdma cm The CPU frequency test
155. s List all IB devices s short Short output p port_list Show port list Mellanox Technologies 87 J Rev 4 60 Table 23 ibstat Flags and Options Flag Description ca name InfiniBand device name portnum Port number of InfiniBand device debug d ddd d d d Raise the IB debugging level help h Show the usage message verbose v vv v v v Increase the application verbosity level version V Show the version info usage u usage message Examples ibstat display status of all ports on all IB devices ibstat i list all IB devices ibstat p show port guids ibstat mthca0 2 show status of port 2 of mthca0 8 3 14 vstat vstat is a binary which displays information on the HCA attributes e vstat synopsys is vstat v c m p N 8 3 14 1 vstat Options The table below lists the various flags of the command Table 24 vstat Flags and Options Flag Description V Verbose mode c HCA error statistic counters m more verbose mode p N repeat every N sec 8 3 15 osmtest osmtest is a test program to validate InfiniBand subnet manager and administration SM SA Default is to run all flows with the exception of the QoS flow osmtest provides a test suite for opensm osmtest has the following capabilities and testing flows It creates an inventory file of all available Nodes Ports and PathRecords including all th
156. s a general purpose SMP utility which gets SM attributes from a specified SMA The result is dumped in hex by default 8 3 22 1 smpdump Synopsis smpdump beling reise 7C ca seme ca pore I t imeout timeout ms V ersion h elp dlid dr path attr mod Mellanox Technologies 101 Rev 4 60 8 3 22 2 smpdump Options Table 32 smpdump Flags and Options Flags Description attr IBA attribute ID for SM attribute mod IBA modifier for SM attribute Debugging Flags Description NOTE Most OpenIB diagnostics take the following common flags The exact list of supported flags per utility can be found in the usage message and can be shown using the util name h syntax d Raises the IB debugging level Can be used several times ddd or d d d e Shows send and receive errors timeouts and others h Shows the usage message v Increases the application verbosity level Can be used several times vv or v v v V Shows the version info Addressing Flags Description D Uses directed path address arguments The path is a comma separated list of out ports Examples o self port 0 1 2 1 4 out via port 1 then 2 G Uses GUID address argument In most cases it is the Port GUID Example 0x08f1040023 s lt smlid gt Uses smlid as the target lid for SM SA queries Flags Description C lt ca_name gt Uses the specified ca_name P lt ca_po
157. s sent that are covered by this priority The counted bytes include framing characters modulo 264 Mellanox Technologies 59 J Rev 4 60 Table 10 Mellanox QoS Counters Mellanox QoS Counters Bytes Sent Sec Description The number of bytes sent per second that are covered by this priority The counted bytes include framing characters Packets Sent The number of packets sent that are covered by this priority modulo 2 64 Packets Sent Sec The number of packets sent per second that are covered by this priority Bytes and Packets TOTAL Bytes Total The total number of bytes that are covered by this priority The counted bytes include framing characters modulo 2 64 Bytes Total Sec The total number of bytes per second that are covered by this priority The counted bytes include framing characters Packets Total The total number of packets that are covered by this priority modulo 2 64 Packets Total Sec The total number of packets per second that are covered by this priority Per prio sent pause frames PAUSE INDICATION The number of pause frames that were sent to priority i The untagged instance indicates global pause that were sent Per prio sent pause duration The total duration in microseconds of pause that was sent to the other end to freeze the transmission on priority i Per prio rcv pause frames Per prio rcv pause duration The number
158. services eedegoiwmt Unknown 0x03 TT Id Explicit Congestion Notification ECTCO ECN Capable Transport 0x02 Total Length 1068 Identification 0x0001 1 Flags 0x02 Don t Fragment Fragnent offset 0 Time to live 16 Protocol UDP 17 Header checksum OxOd7c correct Source 11 7 33 148 11 7 33 148 Destination 11 7 33 149 11 7 33 149 source GeoIP Unknown Destination GeoIP Unknown User Datagram Protocol Src Port 49153 49153 Dst Port expl 1021 Data 1040 bytes Mellanox Technologies 38 J Rev 4 60 4 Deploying Windows Server 2012 and Above with SMB Direct 4 1 Overview The Server Message Block SMB protocol is a network file sharing protocol implemented in Microsoft Windows The set of message packets that defines a particular version of the protocol is called a dialect The Microsoft SMB protocol is a client server implementation and consists of a set of data pack ets each containing a request sent by the client or a response sent by the server SMB protocol is used on top of the TCP IP protocol or other network protocols Using the SMB protocol allows applications to access files or other resources on a remote server to read create and update them In addition it enables communication with any server program that is set up to receive an SMB client request 4 2 Hardware and Software Prerequisites The following are hardware and software prerequisites
159. ss 172 16 15 6 ProviderAddress 192 168 20 115 VirtualSubnetID 5001 MACAddress 00155D730101 Rule TranslationMetho dEncap Add customer route ew NetVirtualizationCustomerRoute RoutingDomainID 11111111 2222 3333 4444 000000005001 VirtualSubnetID 5001 DestinationPrefix 172 16 0 0 16 NextHop v9 9 0 0 Meine 259 Step 4 Configure the Provider Address and Route records on Hyper V Host 2 Host Only mtlael5 SNIC Get NetAdapter Port1 ew NetVirtualizationProviderAddress InterfaceIndex NIC InterfaceIndex Pro viderAddress 192 168 20 115 PrefixLength 24 ew NetVirtualizationProviderRoute InterfaceIndex SNIC InterfaceIndex Destination Prefix 0 0 0 0 0 NextHop 192 168 20 1 Step 5 Configure the Virtual Subnet ID on the Hyper V Network Switch Ports for each Virtual Machine on each Hyper V Host Host 1 and Host 2 Run the command below for each VM on the host the VM is running on it i e the for mtlae14 005 mtlael4 006 on host 192 168 20 114 and for VMs mtlael5 005 mtlael15 006 on host 192 168 20 115 mtlael5 only Get VMNetworkAdapter VMName mtlae15 005 where MacAddress eq 00155D730100 Set VMNetworkAdapter VirtualSubnetID 5001 Get VMNetworkAdapter VMName mtlae15 006 where MacAddress eq 00155D730101 Set VMNetworkAdapter VirtualSubnetID 5001 Bo Mellanox
160. sys ib read bw i b port ib port m tu mtu size s ize message size n iteration num p ort PDT port b idirectional o uts outstanding reads a 11 V ersion 8 4 1 2 ib read bw Options The table below lists the various flags of the command Table 33 ib read bw Flags and Options Flag Description p port lt port gt Listens on connect to port lt port gt default 18515 d ib dev lt dev gt Uses IB device lt device guid gt default first device found i ib port lt port gt Uses port lt port gt of IB device default 1 m mtu lt mtu gt The mtu size default 1024 0 outs lt num gt The number of outstanding read atom default 4 SIze lt size gt The size of message to exchange default 65536 a all Runs sizes from 2 till 2 23 t tx depth lt dep gt The size of tx queue default 100 n iters lt iters gt The number of exchanges at least 2 default 1000 Mellanox Technologies 103 Rev 4 60 Table 33 ib read bw Flags and Options Flag Description b bidirectional Measures bidirectional bandwidth default unidirectional V version Displays version number g grh Use GRH with packets mandatory for RoCE 8 4 2 ib read lat Ib read lat calculates the latency of RDMA read operation of message size between a pair of machines One acts as a server and the other as a client They perform a
161. t dev gt Uses IB device lt device guid gt default first device found i ib port lt port gt m mtu lt mtu gt Uses port lt port gt of IB device default 1 The mtu size default 1024 c connection lt RC UC gt Connection type RC UC default RC S SIZe lt size gt a all The size of message to exchange default 65536 Runs sizes from 2 till 223 t tx depth lt dep gt The size of tx queue default 100 n iters lt iters gt u qp timeout lt timeout gt The number of exchanges at least 2 default 1000 QP timeout The timeout value is 4 usec 2 timeout default 14 sl lt sl gt The service level default 0 x gid index lt index gt C report cycles Test uses GID with GID index taken from command line for RDMAoE index should be 0 Reports times in cpu cycle units default microseconds H report histogram Print out all results default print summary only U report unsorted implies Print out unsorted results default sorted H V version Displays version number F CPU freq The CPU frequency test It is active even if the cpufreq_ondemand module is loaded I inline_size lt size gt R rdma_cm The maximum size of message to be sent in inline mode default 0 Connects QPs with rdma_cm and run test on those QPs Z com rdma cm Communicates with rdma
162. t follows the comment is dependent on the node type If it is a switch node it is followed by the NodeDescription in quotes and the LID of the peer node If it is a CA or router node it is fol lowed by the local LID and LMC and then followed by the NodeDescription in quotes and the LID of the peer node The active link width and speed are then appended to the end of this output line Example Topology file generated on Tue Jun 5 14 15 10 2007 Max of 3 hops discovered Initiated from node 0008 10403960558 port 0008 10403960559 Non Chassis Nodes When grouping is used InfiniBand nodes are organized into chasses which are numbered Nodes which cannot be determined to be in a chassis are displayed as Non Chassis Nodes External ports are also shown on the connectivity lines vendid 0x8f1 devid 0x5a06 sysimgguid 0x5442ba00003000 switchguid 0x5442ba00003080 5442ba00003080 Switch 24 S 005442ba 3080 ISR9024 Voltaire base port 0 lid 6 lmc 0 22 H 0008 10403961354 1 8 10403961355 MT23108 InfiniHost Mellanox Technologies lid 4 4xSDR 10 0008 104004 SUPL SW 6IB4 Voltaire lid 3 4xSDR 8 H 0008 10403960558 2 8 1040396055a MT23108 InfiniHost Mellanox Technologies lid 14 4xSDR 6 0008 104004 SES SW 61B4 Voltaire lid 3 4xSDR 12 H 0008 10403960558 1 8110403960559
163. t histogram Print out all results default print summary only U report unsorted implies Print out unsorted results default sorted H V version Displays version number Mellanox Technologies 104 Rev 4 60 Table 34 ib read lat Flags and Options Flag Description g grh Use GRH with packets mandatory for RoCE 8 4 3 ib send bw Ib send bw calculates the BW of SEND between a pair of machines One acts as a server and the other as a client The server receive packets from the client and they both calculate the through put of the operation The test supports features such as Bidirectional on which they both send and receive at the same time change of mtu size tx size number of iteration message size and more Using the a provides results for all message sizes 8 4 3 1 ib send bw Synopsys ib send bw i b port ib port c onnection type RC UC UD m tu mtu size s ize message size t x depth tx size n iteration num p ort PDT port b idirectional a 11 V ersion 8 4 3 2 ib send bw Options The table below lists the various flags of the command Table 35 ib send bw Flags and Options Flag Description p port lt port gt Listens on connect to port lt port gt default 18515 d ib dev lt dev gt Uses IB device lt device guid gt default first device found l ib port lt port gt Uses port lt port gt of IB device default 1 m
164. t traffic r c C Mulicast traffic C Single stream traffic Restore Default Settings Run Tuning single port traffic each time Single stream traffic Improving performance tor running single stream traffic each tme Dusi port traffic Improving performance for running traffic on both ports simutaneousty Clicking the Run Tuning button activates the general tuning as explained above and changes several driver registry entries for the current adapter and its sibling device once the sibling is an Ethernet device as well It also generates a log including the applied changes Users can view this log to restore the previous values The log path is SHOMEDRIVE NWindowsNSystem32NLogFilesNPerformanceTunning log This tuning is required to be performed only once after the installation is completed and on one adapter only as long as these entries are not changed directly in the registry or by some other instal lation or script Please note that a reboot may be required for the changes to take effect v 6 1 4 1 Performance Tuning Tool Application You can also activate the performance tuning through a script called perf tuning exe This script has 4 options which include the 3 scenarios described above and an additional manual tuning through which you can set the RSS base and number of processors for each Ethernet adapter The adapters you wish to tune are supplied to the script by their name according to the Network Connec
165. t vSwitch configuration is persistent no need to configure it after each reboot ew VMSwitch VSwMLNX NetAdapterName Portl AllowManagementOS Strue Shut down VMs Stop VM Name mtlael5 005 Force Confirm Stop VM Name mtlae15 006 Force Confirm Connect VM to vSwitch maybe you have to switch off VM before doing manual does also work Connect VMNetworkAdapter VMName mtlael4 005 SwitchName VSwMLNX Add VMNetworkAdapter VMName mtlael5 005 SwitchName VSwMLNX StaticMacAddress 001550730100 Add VMNetworkAdapter VMName mtlael5 006 SwitchName VSwMLNX StaticMacAddress 00155D730101 Mellanox Technologies 134 Rev 4 60 c The commands from Step 2 4 are not persistent Its suggested to create Script is running after each OS reboot Step 2 Configure a Subnet Locator and Route records on each Hyper V Host Host 1 and Host 2 mtlael4 amp mtlaeld ew NetVirtualizationLookupRecord CustomerAddress 172 16 14 5 ProviderAddress 192 168 20 114 VirtualSubnetID 5001 MACAddress 00155D720100 Rule TranslationMetho ew NetVirtualizationLookupRecord CustomerAddress 172 16 14 6 ProviderAddress 192 168 20 114 VirtualSubnetID 5001 MACAddress 00155D720101 Rule TranslationMetho ew NetVirtualizationLookupRecord CustomerAddress 172 16 15 5 ProviderAddress 192 168 20 115 VirtualSubnetID 5001 MACAddress 00155D730100 Rule TranslationMetho dEncap ew NetVirtualizationLookupRecord CustomerAddre
166. that sends and receives TCP data between two or more endpoints It is a Winsock based port of the ttcp tool that measures networking perfor mance bytes second To download the latest version of NTttcp 5 28 please refer to Microsoft website following the link below http gallery technet microsoft com NTttcp Version 528 Now f8b12769 This tool should be run from cmd only Mellanox Technologies 122 Rev 4 60 8 4 19 1 NTttcp Synopsys Soryers ipic does r t 15 cu 165 inter tacs 1 gt Client ntttcp x64 exe s t 15 m 16 same address as above 8 4 19 2 NTttcp Options The table below lists the various flags of the command Table 51 NTttcp Options Flags Description S Works as a sender T Works as a receiver l lt Length of buffer gt default TCP 64K UDP 128 n lt Number of buffers gt default 20K p port base default 5001 sp Synchronizes data ports if used p should be same on every instance a lt outstanding I O gt default 2 X lt PacketArray size gt default 1 rb lt Receive buffer size gt default 64K sb lt Send buffer size gt default 8K u UDP send recv W WSARecv WSASend d Verifies Flag t Runtime in seconds cd lt Cool down gt in seconds wu lt Warm up gt in seconds nic lt NIC IP Use NIC with for sending data sender only m mapping mapping Mellanox Technologies 123
167. timeout S 1 sl type x gid index e vents use events N o peak use peak calc F CPU freq fail g num of gps in mcast group M mcast gid b idirectional a 11 V ersion 8 4 9 2 ibv send bw Options The table below lists the various flags of the command Table 41 ibv send bw Flags and Options Flag Description p port lt port gt Listens on connect to port lt port gt default 18515 d ib dev lt dev gt Uses IB device lt device guid gt default first device found Mellanox Technologies 111 Rev 4 60 Table 41 ibv send bw Flags and Options Flag Description i ib port lt port gt Uses port lt port gt of IB device default 1 m mtu lt mtu gt The mtu size default 1024 c connection lt RC UC UD gt Connection type RC UC UD default RC SIZe lt size gt The size of message to exchange default 65536 a all Runs sizes from 2 till 2423 t tx depth lt dep gt The size of tx queue default 100 n iters lt iters gt The number of exchanges at least 2 default 1000 u qp timeout lt timeout gt QP timeout The timeout value is 4 usec 2 timeout default 14 S sl lt sl gt The service level default 0 x gid index lt index gt Test uses GID with GID index taken from command line for RDMAoE index should be 0 b bidirectional Measures bidirectional bandwidth default unidirect
168. tion 8 4 19 NTttcp on page 122 Section 10 Troubleshooting on page 125 Added Appendix A Windows MPI MS MPI on page 130 June 10 2013 Updated the following sections Section 22 Downloading Mellanox Firmware Tools on page 14 Section 8 InfiniBand Fabric on page 62 Section 10 Troubleshooting on page 125 Section 11 Documentation on page 129 Section Options on page 49 Added the following sections e perf tuning Appendix Synopsys on page 49 Section 2 3 1 Upgrading Firmware Manually on page 16 Section 3 7 2 RoCE Configuration on page 30 Section 6 4 Adapter Proprietary Performance Counters on page 55 Rev 4 2 October 20 2012 Added the following sections Section 4 Deploying Windows Server 2012 and Above with SMB Direct on page 39 and its subsec tions Section 3 2 Header Data Split on page 18 Section 8 2 part man Virtual IPoIB Port Cre ation Utility on page 62 Updated Section 6 Performance Tuning on page 46 Rev 3 2 0 July 23 2012 No changes Rev 3 1 0 May 21 2012 Added section Tuning the IPoIB Network Adapter Added section Tuning the Ethernet Network Adapter Added section Performance tuning tool application Removed section Tuning the Network Adapter Removed section part man e Removed section ibdiagnet Mellanox Technologies 9 J Rev 4 60 Tabl
169. tions Mellanox Technologies 48 J Rev 4 60 Synopsys perf tuning exe s cl first connection name c2 second connection name gt perf tuning exe d cl first connection name c2 second connection name gt perf tuning exe f cl first connection name c2 second connection name gt perf tuning exe m cl first connection name b base RSS processor number n number of RSS processors perf tuning st cl first connection name gt c2 second connection name gt Options Flag Description S Single port traffic scenario This option can be followed by one or two connection names The tuning will restore the default settings on the second connection and performed on the first connection This option automatically sets SendCompletionMethod 0 e RecvCompletionMethod 2 e ReceiveBuffers 1024 In Operating Systems support NDIS6 3 RssProfile 4 Additionally this option chooses the best processors to assign to DefaultRecvRingProcessor TxInterruptProcessor TxForwardingProcessor In Operating Systems support NDIS6 2 RssBaseProcNumber MaxRssProcessors In Operating Systems support NDIS6 3 NumRSSQueues RssMaxProcNumber d Dual port traffic scenario This option must be followed by two connection names The tuning in this case is code pendent This option automatically sets SendCompletionMethod 0 RecvCompletionMethod 2 ReceiveBuffers 1024 In Operatin
170. tocol One possible reason for discarding such a packet could be to free up buffer space 6 4 1 2 Proprietary Mellanox Adapter Diagnostics Counters Proprietary Mellanox adapter diagnostics counter set consists of the NIC diagnostics These counters collect information from ConnectX 3 and ConnectX 3 Pro firmware flows Table 9 Mellanox Adapter Diagnostics Counters Mellanox Adapter Diagnostics Counters Description Requester length errors Number of local length errors when the local machine generates outbound traffic Responder length errors Number of local length errors when the local machine receives inbound traffic Requester QP operation errors Number of local QP operation errors when the local machine generates outbound traffic Responder QP operation errors Number of local QP operation errors when the local machine receives inbound traffic Requester protection errors Number of local protection errors when the local machine generates out bound traffic Responder protection errors Number of local protection errors when the local machine receives inbound traffic Mellanox Technologies 57 Rev 4 60 Table 9 Mellanox Adapter Diagnostics Counters Mellanox Adapter Diagnostics Counters Description Requester CQE errors Number of local CQE with errors when the local machine generates out bound traffic Responder CQE errors Number
171. txtreme Gigabit Ethernet 2 amp Mellanox Connectx 3 Ethernet Adapter K Mellanox ConnectX 3 Ethernet Adapter 2 1 Ports COM amp LPT Processors JE System devices Universal Serial Bus controllers Mellanox Technologies 22 J Rev 4 60 Step2 Right click a Mellanox ConnectX 10Gb Ethernet adapter under Network adapters list and left click Properties Select the LBFO tab from the Properties window S d It is not recommended to open the Properties window of more than one adapter simulta K neously The LBFO dialog enables creating modifying or removing a bundle Only Mellanox Technologies adapters can be part of the LBFO p To create a new bundle perform the following Step 1 Click Create Step 2 Enter a unique bundle name Step3 Select a bundle type Step 4 Select the adapters to be included in the bundle that have not been associated with a VLAN Step 5 Optional Select Primary Adapter An active passive scenario used for data transfer of link disconnecting In such scenario the system uses one of the other interfaces When the primary link comes up the LBFO interface returns to transfer data using the primary interface If the primary adapter is not selected the primary interface is selected randomly Step 6 Optional Failback to Primary Mellanox Technologies 23 J Rev 4 60 Step 7 Check the checkbox Mellanox ConnectX 3 Eth
172. uggestion and will cause low performance To dis able RSS on the adapter run the following command netsh int tcp set global rss no dynamic balancing Issue 4 The Ethernet driver fails to start In the Event log under the mlx4_bus source the fol lowing error message appears RUN FW command failed with error 22 Mellanox Technologies 125 Rev 4 60 Suggestion The error message indicates that the wrong firmware image has been programmed on the adapter card See Section 2 Firmware Upgrade on page 15 Issue5 The Ethernet driver fails to start A yellow sign appears near the Mellanox ConnectX 10Gb Ethernet Adapter in the Device Manager display Suggestion This can happen due to a hardware error Try to disable and re enable Mellanox ConnectX Adapter from the Device Manager display Issue 6 No connectivity to a Fault Tolerance bundle while using network capture tools e g Wireshark Suggestion This can happen if the network capture tool captures the network traffic of the non active adapter in the bundle This is not allowed since the tool sets the packet filter to pro miscuous thus causing traffic to be transferred on multiple interfaces Close the network cap ture tool on the physical adapter card and set it on the LBFO interface instead Issue7 No Ethernet connectivity on 10Gb adapters after activating Performance Tuning part of the installation Suggestion This can happen due to adding a TcpWindowS
173. utputting and learning about other fab rics or a previous state of a fabric diff filename Loads cached ibnetdiscover data and do a diff comparison to the cur rent network or another cache A special diff output for ibnetdiscover output will be displayed showing differences between the old and cur rent fabric By default the following are compared for differences switches channel adapters routers and port connections diffcheck lt key s gt Specifies what diff checks should be done in the diff option above Comma separate multiple diff check key s The available diff checks are sw switches ca channel adapters router routers port port connections lid lids nodedesc node descriptions Note that port lid and nodedesc are checked only for the node types that are specified e g sw ca router If port is specified alongside lid or nodedesc remote port lids and node descriptions will also be com pared p ports Obtains a ports report which is a list of connected ports with relevant information like LID port num GUID width speed and NodeDe scription m max_hops Reports max hops discovered debug d ddd d d d Raises the IB debugging level Errors Shows send and receive errors timeouts and others help h Shows the usage message verbose v vv v v v Increases the application verbosity level version V Shows the version info outstanding smps o
174. val Specifies the number of outstanding SMPs which should be issued during the scan usage u Usages message Ca C ca name Uses the specified ca name Port P lt ca_port gt Uses the specified ca_port timeout t lt timeout_ms gt Overrides the default timeout for the solicited mads full f Shows full information ports speed and width show s Shows more information Mellanox Technologies 81 J Rev 4 60 8 3 9 3 Topology File Format The topology file format 1s largely intuitive Most identifiers are given textual names like vendor ID vendid device ID device ID GUIDs of various types sysimgguid caguid switchguid etc PortGUIDs are shown in parentheses For switches this 1s shown on the switchguid line For CA and router ports it is shown on the connectivity lines The IB node is identified followed by the number of ports and the node GUID On the right of this line is a comment followed by the NodeDescription in quotes If the node is a switch this line also contains whether switch port 0 1s base or enhanced and the LID and LMC of port 0 Subsequent lines pertaining to this node show the connectivity On the left is the port number of the current node On the right is the peer node node at other end of link It is identified in quotes with nodetype followed by followed by NodeGUID with the port number in square brackets Further on the right is a comment 7 Wha

Download Pdf Manuals

image

Related Search

Related Contents

  Agenda culturel  DA-WM1100 INSTALLATION MANUAL  2. Connectors and wiring A: RS-232C  CA ARCserve D2D On Demand  Sputnik User Manual  el es_shbpmcon.qxp  OVS-CHX8130SQB-40 - KOBE Range Hoods  bicicleta ergométrica ex 550 ferrari fitness  Servidor de impresión USB  

Copyright © All rights reserved.
Failed to retrieve file