Home

Mellanox OFED Linux User`s Manual

image

Contents

1. H H HH ie H H HH HH H H H HH H 4 H H HH HH H H HH E H H HH HH H Ww H H HH H H H HH HH H H H HH HH H H H HH HH H H H HH H H H H HH HH atte H H H HH HH H H H HH HH H T Ww H H HH HH H H H HH HH H H H HH H UP H H HH HH H T H H HH H H 4 H H HH HH H H H HH H tte H H H HH HH H T T H H H HH H H H H H HH HH H H H H HH HH H H HH HH H H H HH H 4 H H HH HH atte H H H HH HH H H HH H T Ww H H HH H H H HHHHH HH HH H HHHHH HH HH ay H HH H HH HH H H HH H T H H HH HH atte H H HH HH H ate H H HH HH T Ww H H HH H H H H HH HH H H H H HH H H H HH HH H Mellanox Technologies
2. 68 4 7 Atomic Operations 69 4 7 1 Enhanced Atomic Operations 69 4 8 Ethernet Tunneling Over IPoIB Driver eIPoIB 70 4 8 1 Enabling the eIPoIB 71 4 8 2 Configuring the Ethernet Tunneling Over IPoIB Driver 71 4 8 3 VLAN Configuration Over an eIPoIB 73 4 8 4 Setting Performance Tuning 74 4 9 Contiguous Pages oves pepe rae Rais ales Dee is Da ete 74 4 10 Shared Memory Region 75 4 11 XRC eXtended Reliable Connected Transport Service for InfiniBand 76 4 12 Flow Steenin suec ve eR ELEM DIEN ES 77 4 12 1 Enable Disable Flow Steering TI 4 12 2 Flow Domains and 77 4 13 Single Root IO Virtualization SR IOV 80 4 13 1 System Requirements s ssr eso e 80 4 Mellanox Technologies J Rev 2 0 3 0 0 4 13 2 Setting Up SR lOV u u pe sind heen dk heed erfa ote Re SOC suu 80 4 13 3 Enabling SR IOV and Para Virtualization on the Same Setup 84 4 13 4 Assigning a Virtual Function to a Virtual Machine 85 4 13 5 Uninstal
3. 169 Table 21 Congestion Control Manager CA Options File 169 Table 22 Congestion Control Manager CC MGR Options File 170 Table 23 ibdiagnet of ibutils2 Output Files 175 Table 24 ibdiagnet of ibutils Output Files 177 Table 25 ibdiagpath Output Files 180 Table 26 devinfo Flags and Options 181 Table 27 ibstatus Flags and Options 183 Table 28 ibportstate Flags and Options 185 Table 29 ibportstate Flags and Options 189 Table 30 smpquery Flags and Options 191 Table31 perfquery Flags and Options 195 Table 32 ibcheckerrs Flags and Options 197 Table 33 mstflint Switches I 199 Table 34 mstflint Commands oeste veterem steer Dae ROC S ae Be ee Rs 201 10 Mellanox Technologies Rev 2 0 3 0 0 Document Revision History Table 1 Document Revision History Release Date Description 2 0 3 0 0 October 2013 Updated the following sections Appendix E Lu
4. BIOS Option Values Memory Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled NUMA Channel Interleaving Enabled Thermal Mode Performance 7 2 Performance Tuning for Linux You can use the Linux sysctl command to modify default system network parameters that are set by the operating system in order to improve IPv4 and IPv6 traffic performance Note however that changing the network parameters may yield different results on different systems The results are significantly dependent on the CPU and chipset efficiency 7 2 1 Tuning the Network Adapter for Improved IPv4 Traffic Performance The following changes are recommended for improving IPv4 traffic performance Disable the TCP timestamps option for better CPU utilization Sysctl w net ipv4 tcp timestamps 0 Enable the TCP selective acks option for better throughput sysctl w net ipv4 tcp sack 1 ncrease the maximum length of processor input queues sysctl w net core netdev max backlog 250000 Increase the TCP maximum and default buffer sizes using setsockopt Sysctl w net core rmem max 4194304 Sysctl w net core wmem max 4194304 Sysctl w net core rmem default 4194304 Sysctl w net core wmem default 4194304 sysctl w net core optmem max 4194304 Increase memory thresholds to prevent packet dropping Sysctl w net ipv4 tcp rmem 4096 87380 4194304 Sysctl w net ipv4 tcp wmem 4096 65536
5. H H HH HH H H H HH H H 4 H H HH HH H H H HH E H H HH HH H Ww H H HH H H H HH HH H H H HH HH H H H HH HH H H H HH H H H H HH HH atte H H H HH HH H H HH HH H T Ww H H HH HH H H H HH HH H H H HH H T uP H H HH HH Ww H H HH H H H H HH HH z H H HH H ate T H H HH HH H H H HH H H 4 H H HH HH atte H HHHHH HH HH H HHHHH HH HH Ww H H HH HH H H H HH HH H H H HH HH UP H H HH HH H T H H HH H H 4 H H HH HH H H H H HH UL H H H HH HH H T ar H H HH H H HH HH H H H HH H H H HH HH H H HH H H H H H HH HH ate H H H HH HH H H H HH HH H Ww H H HHH HH H H Mellanox Technologies Rev 2 0 3 0 0 Rev 2 0 3 0 0 Preparing dapl devel static Preparing dapl devel static Preparing dapl utils Preparing perftest Preparing mstflint Preparin
6. 219 B 1 Prerequisites and Installation 219 B 2 JHOW tO RUN i nanan cows sehr de Mt obe watts 219 B 3 How to Unload Shutdown 222 Appendix C mlx4 Module Parameters 223 Mellanox Technologies 7 J Rev 2 0 3 0 0 mlx4 a1b Parameters uu us ass n OR TCR ee I 223 C2 mlx4 core Parameters sosie sense ble as A ER usa Gl ue 223 3 4 en Parameters 224 Appendix D mlx5 Module Parameters 225 Appendix E Lustre Compilation over MLNX OFED 226 8 Mellanox Technologies J Rev 2 0 3 0 0 List of Figures Figure 1 Mellanox OFED Stack for ConnectX Family Adapter Cards 19 Figure 2 Consolidation Over InfiniBand 56 Figure 3 An Example ofa Virtual Network 73 Fig ure 4 Qos Manager a s upa aq eles oe as 150 Figure 5 Example QoS Deployment on InfiniBand Subnet 159 Mellanox Technologies 9 J Rev 2 0 3 0 0 List of Tables Table 1 Document Revision History 11 Table 2 Abbreviations and Acronyms 12 3 Glossary bod fr
7. Possible Value Description ANON Use current pages ANON small ones Default value HUGE Force huge pages CONTIG Force contiguous pages PREFER CONTIG Try contiguous fallback to ANON small pages PREFER HUGE Try huge fallback to ANON small pages ALL Try huge fallback to contiguous if failed fallback to ANON small pages 76 Mellanox Technologies Rev 2 0 3 0 0 1 Values are NOT case sensitive Usage The application calls the ibv reg mr API which turns on the IBV ACCESS ALLOCATE MR bit and sets the input address to NULL Upon success the address field of the struct ibv mr will hold the address to the allocated memory block This block will be freed implicitly when the ibv_dereg_mr is called The following are environment variables that can be used to control error cases contiguity Table 4 Parameters Used to Control Error Cases Contiguity Parameters Description MLX MR ALLOC TYPE Configures the allocator type ALL Default Uses all possible allocator and selects most effi cient allocator ANON Enables the usage of anonymous pages and disables the allocator CONTIG Forces the usage of the contiguous pages allocator If contiguous pages are not available the allocation fails MLX MR MAX LOG2 CONTIG BS Sets the maximum contiguous block size order IZE Values 12 23 Default 23 MLX MR MIN LOG2 CONTIG BS Sets the minimum contiguous block size order IZE
8. 5 0 Gbps gt ibportstate C mthca0 D 0 1 PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 HINKS CAE A t e TUE Down physhinkstate e T TIT E Polling MkW ERS Upp Orr 9 EET TT 1X or 4X le qawas 1X or 4X Ln VEEE 4X LinkSpeedSupported 2 5 Gbps LinkSpeedEnabled 2 5 Gbps Mellanox Technologies 187 Rev 2 0 3 0 0 IR NS M tt 2 5 Cbps 3 Change the speed of a port First query for current configuration gt ibportstate C mlx4 0 D 01 PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 Lak master uate SER S Initialize physpimkstate e s LinkUp irse pond 1X or 4X i Tile Wata Eli ral lie aaa 1X or 4X ulwaq q ERAN GCI esa 4X LinkSpeedSupported 2 5 Gbps or 5 0 Gbps mataas 2 5 Gbps or 5 0 Gbps ImisspeedAGVCH PU 5 0 Gbps Now change the enabled link speed gt ibportstate mlx4 0 D 0 1 speed 2 ibportstate C mlx4 0 D 0 1 speed 2 Initial PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 LinkSpeedEnabled 2 5 Gbps After PortInfo set Port info DR path slid 65535 dlid 65535 0 port 1 Dai opeeqanapied naa 5 0 Gbps IBA extension Show the new configuration gt ibportstate mlx4 0 D 0 1 PortInfo Port info DR path slid 65535 dlid
9. e 102 5 3 3 Tuning MXM Settings ir f rd cece tenet ene 102 5 3 4 Configuring Multi Rail Support 103 5 3 5 Configuring MXM over the Ethernet Fabric 103 5 4 Fabric Collective Accelerator 103 5 5 ScalableUPG asa Riek Aa ah Po de tiga y bY u hq 104 5 5 1 Installing ScalableUPC 104 5 35 27 Runtime Parameters Loa ges eR pens 105 5 5 3 Various Executable Examples 106 Chapter 6 Working With VPI 10S 6 1 Port Type 108 6 2 Auto ped E Ve 109 6 2 1 Enabling Auto Sensing 109 Chapter 7 Performance 110 7 1 General System 110 7 1 1 PCI Express PCIe Capabilities 110 7 1 2 Memory Configuration 110 7 1 3 Recommended BIOS Settings 110 7 2 Performance Tuning for Linux 113 7 2 1 Tuning the Network Adapter for Improved IPv4 Traffic Performance 113 7 2 2 Tuning the Network Adapter for Improved IPv6 Traffic
10. no part enforce N DEPRECATED This option disables partition enforcement on switch external ports 126 Mellanox Technologies Rev 2 0 3 0 0 pant enforce both in out off This option indicates the partition enforcement type for switches Enforcement type can be outbound only out inbound only in both or disabled off Default is both allow both pkeys W This option indicates whether both full and limited membership on the same partition can be configured in the PKeyTable Default is not to allow both pkeys qos Q This option enables QoS setup qos policy file Y QoS policy file This option defines the optional QoS policy file The default name is etc opensm qos policy conf congestion control EXPERIMENTAL This option enables congestion control configuration cc key key EXPERIMENTAL This option configures the CCkey to use when configuring congestion control Say Gu This option will cause SM not to exit on fatal initialization issues if SM discovers duplicated guids or 12x link with lane reversal badly configured By default the SM will exit on these errors B Run in daemon mode OpenSM will run in the background daemon inactive I Start SM in inactive rather than normal init SM state perfmgr Start with PerfMgr enabled perfmgr sweep time s sec PerfMgr sweep interval in seconds prefix routes file path to file This op
11. HHH HH HH HH Preparing HH Het HH HH HH HH H H H HHH HH H HH HH libibcn HHH HH HH HH HH HHH H HH HH Preparing 7 HHH HH E 41 4E 4L HH HH HH H HHH H H HH H HHH HH HH libibcm devel THHHHHHHHHHHBHBHHHHHHHHHHHHBHHHBHHHHHHHHHHBHHHBHRHHI Preparing 27 HH HH HH HH HH H Het H H HH H HH H HH HH libibcn devel HH Ht HH HH HH HH H HH HHHH HH H HH H HH HH Preparing E HHH HH METANET HH HH L HH HH HH HH HHHH HH H HH H HH HH libibumad Preparing 2 HH HH HH HH HH H Het H H HHH HH H HH HH libib un ad HH HH HHHH HH HH HH H HH H H H HH H HH H HH HH Preparing HH HHH HH HH HH H HH HH H HHH H H HH H HHH HH HH libibun ad devel HH H HH HHHH HH H HH H HHH HH HH HH Preparing HH HH He 41 4E 4L HH HH HH H H H HH H HH H HH HH libibun ad devel HH E 4 4E 4L HH HH HH Het HH H H HH HH Preparing HHH HH HH HH H HH HH H HHH H H HH H HHH HH HH libibun ad static HH E 4 4E 4L HH HH HHH H HH H H HH HH Preparing HH HH E 41 4E UL HH HHH HH H HHH H H HHH HH H HH HH libibun ad static HH HH HH HH HH H H H H HH H HH H HH HH Preparing HH HHH HH HH HH H HH HH H HHH H H HH H
12. Valid port types 1 1 2 eth 3 auto string log maximum number of QPs per HCA default 19 int log maximum number of SRQs per HCA default 16 int log number of RDMARC buffers per QP default 4 int log maximum number of CQs per HCA default 16 int log maximum number of int multicast groups per HCA default 13 log maximum number of default 19 int log maximum number of memory translation table segments per HCA default max 20 2 MTTs for register all of the host mem ory limited to 30 int Enable Quality of Service support in the HCA default off bool Reset device on internal errors if non zero default 1 in SRIOV mode default is 0 int memory protection table entries per HCA Threshold for using inline data int Default and max value is 104 bytes Saves PCI read operation transaction packet less then threshold size will be copied to hw buffer directly Enable RSS for incoming UDP traffic uint On by default Once disabled no RSS for incoming UDP traffic will be done Priority based Flow Control policy on TX 7 0 Per priority bit mask uint Priority based Flow Control policy on RX 7 0 Per priority bit mask uint 224 Mellanox Technologies Rev 2 0 3 0 0 Appendix D mlx5 Module Parameters The mlx5 ib module supports a single parameter used to select the profile which defines the number of resources supported The parameter name for selecting the pro
13. In case sensing the port protocol fails the port will be configured as an InfiniBand port For ConnectX Mellanox ConnectX FlexBoot v3 3 400 iPXE 1 0 0 Open Source Network Boot Firmware net0 00 02 c9 03 00 0c 78 11 on PCI02 00 0 open Link doun TX 0 TXE O RX 0 RXE 01 Link status The socket is not connected Waiting for link up on netO ok Mellanox Technologies 209 Rev 2 0 3 0 0 A 8 A 8 1 A 8 2 A 8 3 After configuring the IB ETH port the client attempts connecting to the DHCP server to obtain an IP address and the source location of the kernel OS to boot from For ConnectX InfiniBand Mellanox ConnectX FlexBoot v3 3 400 iPXE 1 0 0 Open Source Network Boot Firmware netO 00 02 c9 03 00 0c 78 11 on 192 00 0 open ILink doun TX O TXE O RX O RXE O Link status The socket is not connected Waiting for link up met0 ok DHCP netO 02 02 c9 0c 78 11 ok netO 11 3 12 2 255 255 255 0 Next server 11 3 12 121 Filename pxeilinux O Root path vtftpbootv tftp 11 3 12 121 pxeilinux O Next FlexBoot attempts to boot as directed by the DHCP server Command Line Interface CLI Invoking the CLI When the boot process begins the computer starts its Power On Self Test POST sequence Shortly after completion of the POST the user will be prompted to press CTRL B to invoke Mel lanox FlexBoot CLI The user has few seconds to press CTRL B before the message disa
14. PTP v2 UDP Sync packet HWTSTAMP FILTER PTP V2 L4 SYNC PTP v2 UDP Delay req packet HWTSTAMP FILTER PTP V2 L4 DELAY REQ xi 1 4 rJ 802 AS1 HWTSTAMP FIL 802 AS1 HWTSTAMP FIL 802 AS1 HWTSTAMP FILT hernet any kind of event packet R PTP V2 L2 EVENT hernet Sync packet V2 L2 SYNC hernet Delay req packet V2 L2 DELAY REQ Ed 9 Cj Ei ct p ct wc PTP v2 802 AS1 any layer any kind of event packet HWTSTAMP FILTER PTP V2 EVENT PTP v2 802 AS1 any layer Sync packet HWTSTAMP FILTER PTP V2 SYNC PTP v2 802 AS1 any layer Delay req packet HWTSTAMP FILTER PTP V2 DELAY REQ Note for receive side time stamping currently only HWTSTAMP FILTER NONE and HWTSTAMP FILTER ALL are supported 4 6 2 Getting Time Stamping Once time stamping is enabled time stamp is placed in the socket Ancillary data recvmsg can be used to get this control message for regular incoming packets For send time stamps the outgo ing packet is looped back to the socket s error queue with the send time stamp s attached It can be received with recvmsg flags MSG ERRQUEUE The call returns the original outgoing packet data including all headers preprended down to and including the link layer the scm timestamping control message and a sock extended err control message with ee
15. Rev 2 0 3 0 0 Rev 2 0 3 0 0 Installation Preparing x HH H HH HH HHH HH H Het H H HH H HH H HH HH libexgb4 devel THHHBHHHEHHHBHBHHHHHHHHBHHHHBHHHHBHHHHHHHHBHHHBSHRHHI Preparing HH HHH HH HH HH H HH HH H HHH H H HH H HHH HH HH libn es HH H HH HHHH HH H HH H HH HH HH HH Preparing 2 HH HH HH HH HH H H H HH H HH H HH HH libn es HH HH HH HH HH jk HH H H HH HH Preparing HHH HH HH HH H HH HH H HHH H H HH H HHH HH HH libnes devel static HE E E EHE EH aE aE a EE a R EEEH Preparing 27 HH HH Ht db HH HHH HH H Het H H HHH HH H HH HH libnes devel static HH HH HH HH HH H Het H H HH H HH H HH HH Preparing HHH HHH HH HH HH H HH HH H HHH H H HH H HHH HH HH libipa thverbs THHHBHHHHHHHHBHBHHHHHHHHHHHHHHHHHBHHHHHHHHBHHHBHHHHI Preparing 2x HH HH HH HH HH H H H HH H HH H HH HH libipa thverbs THHHBHHHHHHHHBHBHHHHHHHHBHHHHHHHHHBHHHHHHHHHBHHHBHRHHI Preparing AS HH HH HH HH H HH HH H HHH H H HH H HHH HH HH libipa thverbs devel HH HH E 4 4 4L HH H HH H H HH H H H HH H HH H HH HH Preparing 21 HH HH HH HH HH H Het H H HHH HH H HH HH libipa thverbs devel HH HH HH HH HH Het HH HH HH Preparing zn HH HHH HH HH HH HH H HHH H H HH H HHH HH HH libibcn HH H HH HHHH HH H HH H
16. e Applications use the ROMA API to transmit using QPs Raw Ethernet QP Application use VERBs API to transmit using a Raw Ethernet QP 4 5 3 Plain Ethernet Quality of Service Mapping Applications use regular inet sockets and the traffic passes via the kernel Ethernet driver The following is the Plain Ethernet QoS mapping flow Mellanox Technologies 59 J Rev 2 0 3 0 0 Driver Features 1 The application sets the ToS of the socket using setsockopt IP TOS value 4 5 4 ToS is translated into the sk_prio using a fixed translation TOS 0 lt gt sk_prio 0 TOS 8 lt gt sk prio 2 TOS 24 lt gt sk prio 4 TOS 16 lt gt sk prio 6 The Socket Priority is mapped to the UP Ifthe underlying device is a VLAN device egress map is used controlled by the vconfig command This is per VLAN mapping Ifthe underlying device is not a VLAN device the tc command is used In this case even though tc manual states that the mapping is from the sk prio to the TC number the mlx4 en driver interprets this as sk prio to UP mapping Mapping the sk prio to the UP is done by using tc wrap py i dev name u 0 1 2 3 4 5 6 7 The the UP is mapped to the TC as configured by the m1nx qos tool or by the 11 daemon if DCBX is used of the socket In this case the ToS to sk prio fixed mapping is not needed This allows the application and the administrator to utilize more than the 4 values possible via ToS
17. mlnxofedinstall script For further information please see add kernel support option gt The mlnx add kernel support sh script can be executed directly from the ad below 2 3 2 Installation Script Mellanox OFED includes an installation script called minxofedinstall Its usage is described below You will use it during the installation procedure described in Section 2 3 3 Installation Procedure on page 28 Usage mnt mlnxofedinstall OPTIONS Options c config packages config file Example of the configuration file can be found under docs n net network config file Example of the network configuration file can be found under docs k kernel version kernel version Use provided kernel version instead of uname rf p print available Print available packages for current platform and create corresponding ofed conf file without 32bit Skip 32 bit libraries installation without depcheck Skip Distro s libraries check without fw update Skip firmware update fw update only Update firmware Skip driver installation force fw update Force firmware update 26 Mellanox Technologies Rev 2 0 3 0 0 force Force installation all hpc basic msm Install all hpc basic or Mellanox Subnet man ager packages correspondingly vma 1 Install packages required by VMA to support vma eth Install packages required by VMA to work over Ethernet with vma Set confi
18. root selene mstflint dev PCI Device dc Verify in the HCA section the following field appears 2 HCA num pfs 1 cotal vigs 5 sriov en true HCA parameters can be configured during firmware update using the m1nxofedinstall script and running the enable sriov and total vfs lt 0 63 gt installation parameters 1 Ifthe fields in the example above do not appear in the HCA section meaning SR IOV is not supported in the used INI 2 IfSR IOV is supported to enable if it is not it is sufficient to set sriov_en true in the INI Mellanox Technologies 83 Rev 2 0 3 0 0 Driver Features If the current firmware version is the same as one provided with MLNX_OFED run it in combination with the force fw update parameter This configuration option is supported only in HCAs that their configuration file INI is included in MLNX_OFED Parameter Recommended Value num pfs 1 Note This field is optional and might not always appear total_vfs 63 sriov en true Ifthe HCA does not support SR IOV please contact Mellanox Support support mellanox com Step 7 Create the text file etc modprobe d mlx4 core conf if it does not exist otherwise delete its contents Step 8 Insert an option line in the etc modprobe d mlx4 core conf file to set the number of VFs the protocol type per port and the allowed number of virtual functions to be used by the physical funct
19. Arbitration tables on various nodes in the fabric However this is not supported in OFED the section is parsed and ignored SL2VL and VLArb tables should be configured in the OpenSM options file by default var cache opensm opensm opts end qos setup qos levels Having a QoS Level named DEFAULT is a must it is applied to PR MPR requests that didn t match any of the matching rules gos level name DEFAULT use default QoS Level sl 0 end qos level the whole set SL MTU Limit Rate Limit PKey Packet Lifetime gos level name WholeSet Mellanox Technologies 153 OpenSM Subnet Manager 54 Mellanox Technologies Rev 2 0 3 0 0 8 6 6 Simple QoS Policy Details and Examples Simple QoS policy match rules are tailored for matching ULPs or some application on top of a ULP PR MPR requests This section has a list of per ULP or per application match rules and the SL that should be enforced on the matched PR MPR query Match rules include Default match rule that is applied to PR MPR query that didn t match any of the other match rules SDP SDP application with a specific target TCP IP port range SRP with a specific target IB port GUID e RDS IPoIB with a default PKey IPoIB with a specific PKey Any ULP application with a specific Service ID in the PR MPR query Any ULP application with a specific PKey in the PR MPR query Any ULP application with a specific ta
20. Default 0x100 as in rev 16A of the specification In rev 10 the default was Oxff00 initiator ext Please refer to Section 9 Multiple Connections Tolist the new SCSI devices that have been added by the echo command you may use either of the following two methods Execute fdisk I This command lists all devices the new devices are included in this listing Execute dmesg or look at var log messages to find messages with the names of the new devices 4 1 2 3 SRP Tools ibsrpdm and srp daemon To assist in performing the steps in Section 6 the OFED distribution provides two utilities ibsrpdm and srp daemon which Detect targets on the fabric reachable by the Initiator for Step 1 Output target attributes in a format suitable for use in the above echo command Step 2 The utilities can be found under usr sbin and are part of the srptools RPM that may be installed using the Mellanox OFED installation Detailed information regarding the various options for these utilities are provided by their man pages Below several usage scenarios for these utilities are presented 42 Mellanox Technologies Rev 2 0 3 0 0 ibsrpdm ibsrpdm is using for the following tasks 1 Detecting reachable targets a To detect all targets reachable by the SRP initiator via the default umad device dev umad0 execute the following command ibsrpdm This command will output information on each SRP Targ
21. This assistant will guide you through adding a new piece of virtual hardware First select what type of hardware you wish to add Hardware type __ Storage Parallel Physical Host Device 00 video B watchdog X cancel Forward F Add Hardware Bemove Step 4 Choose a Mellanox virtual function according to its PCI device e g 00 03 1 5 the Virtual Machine is up reboot it otherwise start it Step 6 Log into the virtual machine and verify that it recognizes the Mellanox card Run lspci grep Mellanox 00 03 0 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 7 Add the device to the etc sysconfig network scripts ifcfg ethx configuration file The MAC address for every virtual function is configured randomly therefore it is not necessary to add it 4 13 5 Uninstalling SR IOV Driver To uninstall SR IOV driver perform the following Step 1 For Hypervisors detach all the Virtual Functions VF from all the Virtual Machines VM or stop the Virtual Machines that use the Virtual Functions Please be aware stopping the driver when there are VMs that use the VFs will cause machine to hang Step2 Run the script below Please be aware uninstalling the driver deletes the entire driver s file but does not unload the driver root sw1022 usr sbin ofed uninstall sh This program will uninstall all OFED packages on your machine Do you want to
22. Values 12 23 Default 12 4 10 Shared Memory Region Shared Memory Region is only applicable to the mlx4 driver Shared Memory Region MR enables sharing MR among applications by implementing the Register Shared MR verb which is part of the IB spec Sharing MR involves the following steps 1 Request to create a shared MR The application sends a request via the ibv_reg_mr API to create a shared MR The application supplies the allowed sharing access to that MR and if the MR was created successfully a unique MR ID is returned as part of the struct ibv mr which can be used by other applications to register with that MR The underlying physical pages must not be Last Recently Used LRU or Anonymous To disable that you need to turn on the IBV ACCESS ALLOCATE MR bit as part of the sharing bits Usage Mellanox Technologies 77 J Rev 2 0 3 0 0 Driver Features Turns on via the reg mr more of the sharing access bits The sharing bits are part of ibv_reg_mr man page Turns on the IBV ACCESS ALLOCATE MR bit Step 2 Request to register to a shared MR A new verb called ibv reg shared mr is added to enable sharing an MR To use this verb the application supplies the MR ID that it wants to register for and the desired access mode to that MR The desired access is validated against its given permissions and upon successful creation the physical pages of the original MR a
23. gt Socket applications can use setsockopt SK PRIO value to directly set the sk prio In case of VLAN interface the UP obtained according to the above mapping is also used in the VLAN tag of the traffic RoCE Quality of Service Mapping Applications use RDMA CM API to create and use QPs The following 1s the RoCE QoS mapping flow l The application sets the ToS of the QP using the set option option RDMA OPTION ID TOS value ToS is translated into the Socket Priority sk prio using a fixed translation TOS 0 sk prio 0 TOS 8 sk prio 2 TOS 24 sk prio 4 TOS 16 sk prio 6 The Socket Priority is mapped to the User Priority UP using the tc command In case of a VLAN device the parent real device is used for the purpose of this mapping 60 Mellanox Technologies Rev 2 0 3 0 0 4 The the UP is mapped to the TC as configured by the m1nx_qos tool or by the 11dpad daemon if DCBX is used With RoCE there can only be 4 predefined ToS values for the purpose of QoS mapping ad 4 5 5 Raw Ethernet QP Quality of Service Mapping Applications open a Raw Ethernet QP using VERBs directly The following is the RoCE QoS mapping flow 1 The application sets the UP of the Raw Ethernet QP during the INIT to RTR state transition of the QP Sets attrs ah attrs sl up e Calls modify with av set in the mask 2 The UP is mapped to the TC as configured by the m1nx
24. mlnx add kernel support sh m mlnx ofed path to MLNX OFED directory make iso make tgz make iso Create MLNX OFED ISO image make tgz Create MLNX OFED tarball Default t tmpdir local work dir kmp Enable KMP format if supported k kernel kernel version Kernel version to use s kernel sources path to the kernel sources Path to kernel headers v verbose name Name of the package to be created y yes Answer yes to all questions 1 The firmware will not be updated if you run the install script with the without fw update option Mellanox Technologies 25 J Rev 2 0 3 0 0 Installation Example The following command will create MLNX OFED LINUX ISO image for RedHat 6 3 under the tmp directory MLNX OFED LINUX 2 0 3 0 1 rhel6 3 x86 64 mlnx add kernel support sh m lt path gt MLNX OFED LINUX 2 0 3 0 1 rhel6 3 x86 64 make tgz Note This program will create MLNX OFED LINUX TGZ for rhel6 3 under tmp directory All Mellanox OEM OFED or Distribution IB packages will be removed Do you want to continue y N y See log file tmp mlnx ofed 150 1380 109 Building OFED RPMs Please wait Removing OFED RPMs Created tmp MLNX OFED LINUX 2 0 3 0 1 rhel6 3 x86 64 tgz Install newly created MLNX OFED package cd tmp tar xzf MLNX OFED LINUX 2 0 3 0 1 rhel6 3 x86 64 tgz MLNX OFED LINUX 2 0 3 0 1 rhel6 3 x86 64 mlnxofedinstall
25. mstflint dev PCI device dc gt ini device file gt Step 4 Edit the ini file that you found in the previous step and add the following lines to HCA section in order to support 63 VFs SRIOV enable total vfs 63 num pfs 1 Sriov en true 1 Some servers might have issues accepting 63 Virtual Functions or more In such case please set the number of total vfs to any required value Step 5 Create a binary image using the modified ini file Run mlxburn fw fw name mlx conf modified ini file wrimage file name gt bin The file file name gt bin is a firmware binary file with SR IOV enabled that has 63 VFs It can be spread across all machines and can be burnt using mstflint which is part of the bundle using the following command mstflint dev PCI device image file name bin b After burning the firmware the machine must be rebooted If the driver is only restarted the machine may hang and a reboot using power OFF ON might be required ae Mellanox Technologies 89 J Rev 2 0 3 0 0 Driver Features 4 13 7 Configuring Pkeys GUIDs under SR IOV 4 13 7 1 Port Type Management Port Type management is static when enabling SR IOV the connectx_port_config script will not work The port type is set on the Host via a module parameter port type array mlx4_core This parameter may be used to set the port type uniformly for all installed Con nectX HCAs or it may speci
26. pou Flag un dur If Not Description y Specified d device Optional First found Run the command for the provided IB ib dev lt device gt device device device i lt port gt Optional All device Query the specified device port lt port gt ib port lt port gt ports Mellanox Technologies 181 Rev 2 0 3 0 0 InfiniBand Fabric Diagnostic Utilities Table 22 ibv_devinfo Flags and Options Optional Default Flag ee ior If Not Description y Specified 1 Optional Inactive Only list the names of InfiniBand list devices V Optional Inactive Print all available information about the verbose InfiniBand device s Examples 1 List the names of all available InfiniBand devices gt ibv devinfo 1 2 HCAs found mthca0 mlx4 0 2 Query the device mlx4 0 and print user available information for its Port 2 gt devinfo d mlx4 0 i 2 hca id mlx4 0 fw ver 2 5 944 node guid 0000 0000 0007 3895 Sys image guid 0000 0000 0007 3898 vendor id 0x02c9 vendor part id 25418 hw ver 0xA0 board id MT 04A0140005 phys port cnt 2 port 2 state PORT ACTIVE 4 max mtu 2048 4 active mtu 2048 4 sm lid il port lid 1 port 0x00 9 8 ibdev2netdev ibdev2netdev enables association between IB devices and ports and the associated net device Additionally it reports the state of the net device link Synopsys ibdev2netdev v h 182 Mellanox Technologies
27. reset performance counters of port 1 only reset extended performance counters of port 1 only reset performance counters of all ports reset only error counters of port 2 reset only non error counters of port 2 1 Read local port s performance counters gt perfquery Port counters Lid 6 port 1 196 Mellanox Technologies Rev 2 0 3 0 0 2 Read performance counters from LID 2 all ports 3 Read then reset performance counters from LID 2 port 1 Mellanox Technologies 197 Rev 2 0 3 0 0 InfiniBand Fabric Diagnostic Utilities RevCons traint Errors aaa 0 Mink integr TE v BST 8 Pia 0 ECB OM STAINTON aaa 0 WESC Jo 0 na sasana 0 0 XMEPKE S S Senet 0 REVERIE S ee a aa aso as s 0 9 144 ibcheckerrs Validates an IB port or node and reports errors in counters above threshold Check specified port or node and report errors that surpassed their predefined threshold Port address is lid unless G option is used to specify a GUID address The predefined thresholds can be dumped using the s option and a user defined threshold file using the same format as the dump can be specified using the t lt file gt option Synopsis ibcheckerrs h b v G threshold file gt s N nocolor C ca name port t timeout ms lid guid port Output Files Table 28 lists the various flags of the command Table 28 ibchecker
28. Direct Bridging as well as other L3 Switching modes e g NAT This document explains the configuration and driver behavior when configured in Bridging mode In virtualization environment a virtual machine can be expose to the physical network by per forming the next setting 1 Create a virtual bridge Step 2 Attach the para virtualized interface created by the eth_ipoib driver to the bridge Step 3 Attach the Ethernet interface in the Virtual Machine to that bridge 72 Mellanox Technologies Rev 2 0 3 0 0 The diagram below describes the topology that was created after these steps Hypervisor Virtual Interface s vX Virtual Bridge s vbrX aka vSwitch Bridge Uplink s pif elPolB IPoib Uplink InfiniBand Fabric The diagram shows how the traffic from the Virtual Machine goes to the virtual bridge in the Hypervisor and from the bridge to the eIPoIB interface eIPoIB interface is the Ethernet interface that enslaves the IPoIB interfaces in order to send receive packets from the Ethernet interface in the Virtual Machine to the IB fabric beneath 4 8 1 Enabling the Driver Once the mlnx ofed driver installation is completed perform the following Step 1 Open the etc infiniband openib conf file and include E IPOIB LOAD yes Step2 Restart the InfiniBand drivers etc init d openibd restart 4 8 2 Configuring the Ethernet Tunneling Over IPoIB Driver When
29. ERROR WINDOW 5 SWITCH 0x12345 ENABLE true AGEING TIME 77 SWITCH 0x0002c902004050f8 AGEING TIME 44 SWITCH Oxabcde ENABLE false 166 Mellanox Technologies Rev 2 0 3 0 0 8 9 Congestion Control 8 9 1 Congestion Control Overview Congestion Control Manager is a Subnet Manager SM plug in i e it is a shared library libcc mgro that is dynamically loaded by the Subnet Manager Congestion Control Manager is installed as part of Mellanox OFED installation The Congestion Control mechanism controls traffic entry into a network and attempts to avoid oversubscription of any of the processing or link capabilities of the intermediate nodes and net works Additionally is takes resource reducing steps by reducing the rate of sending packets Congestion Control Manager enables and configures Congestion Control mechanism on fabric nodes HCAs and switches 8 9 2 Running OpenSM with Congestion Control Manager Congestion Control CC Manager can be enabled disabled through SM options file To do so perform the following 1 Create the file Run opensm options file name 2 Find the event plugin name option in the file and add cemgr to it _ Event plugin name s event plugin name ccmgr 3 Run the SM with the new options file opensm F lt options file name gt Once the Congestion Control is enabled on the fabric nodes to completely disable Congestion Control
30. If for instance TCO is set to 80 guarantee and to 20 the TCs sum must be 100 then the BW left after servicing all strict priority TCs will be split according to this ratio Since this is a minimal guarantee there is no maximum enforcement This means in the same example that if TC1 did not use its share of 20 the reminder will be used by TCO Rate Limit Rate limit defines a maximum bandwidth allowed for a TC Please note that 1096 deviation from the requested values is considered acceptable Quality of Service Tools mlnx qos mlnx qos is a centralized tool used to configure QoS features of the local host It communicates directly with the driver thus does not require setting up a DCBX daemon on the system The minx qos tool enables the administrator of the system to Inspect the current QoS mappings and configuration 62 Mellanox Technologies Rev 2 0 3 0 0 The tool will also display maps configured by TC and vconfig set_egress_map tools in order to give a centralized view of all QoS mappings Set UP to TC mapping Assign a transmission algorithm to each TC strict or ETS Set minimal BW guarantee to ETS TCs Set rate limit to TCs For unlimited ratelimit set the ratelimit to 0 Usage mlnx_gos i interface options Options version show program s version number and exit h help show this help message and exit MSI CEUS maps UBs to TCs LIST is 8 comma seperated TC
31. Optional Use lt smlid gt as the target LID for SM SA queries V ersion Optional Show version info C Optional Use the specified channel adapter or router lt ca_name gt P ca port Optional Use the specified port Rev 2 0 3 0 0 Table 26 smpquery Flags and Options Optional Detault A Flag If Not Description Mandatory Specified t Optional Override the default timeout for the solicited lt timeout ms msec gt lt op gt Mandatory Supported operations nodeinfo lt addr gt nodedesc lt addr gt portinfo lt addr gt lt portnum gt switchinfo lt addr gt pkeys lt addr gt lt portnum gt 512 1 lt addr gt lt portnum gt vlarb lt addr gt lt portnum gt guids lt addr gt mepi lt addr gt lt portnum gt lt destdr_path Optional Destination s directed path LID or GUID lid guid gt Examples 1 Query PortInfo by LID with port modifier gt smpquery portinfo 1 1 Port info Lid 1 port 1 0 0000000000000000 M RI ENTRE STET 0xfe80000000000000 Dude qu e d Mee RE UE 0x0001 SM esas Gos Gia e UP 0x0001 ee 0x251086a IsSM IsTrapSupported IsAutomaticMigrationSupported IsSLMappingSupported IsSystemImageGUIDsupported IsCommunicatonManagement Supported IsVendorClassSupported IsCapabilityMaskNoticeSupported IsClientRegistrationSupported COGS e ee sns E A E 0x0000
32. S binterfaces uu y ge peret ree AN pig ond Pa S IRE 53 4 3 5 Verifying IPoIB Functionality 54 4 3 6 Bonding IPoIB 00 ett ene en eens 55 4 4 Quality of Service InfiniBand 56 4 4 1 Quality of Service Overview 56 442 QoS Architecture si sie es cca eka CAN e 57 4 43 Supported Policy ERR ee ee ed 57 4344 CMA Features ia ar eet eee sg eed tcd 58 4 4 5 vOpenSM Features se eS e CADRE CASCO I Rn 59 4 5 Quality of Service Ethernet 59 4 5 1 Quality of Service Overview 59 4 5 2 Mapping Traffic to Traffic 1 59 4 5 3 Plain Ethernet Quality of Service Mapping 59 4 5 4 Quality of Service Mapping 60 4 5 5 Raw Ethernet QP Quality of Service Mapping 61 4 5 6 Map Priorities with tc wrap py mlnx qos 61 4 5 7 Quality of Service Properties sese us luu yy a eee 62 4 5 8 Quality of Service Tools 62 4 6 Time Stamping Service 66 4 6 1 Enabling Time Stamping 67 4 6 2 Getting Time Stamping
33. Samples programs for reference jbv task pingpong 16 cc pingpong Mellanox Technologies 95 J Rev 2 0 3 0 0 Driver Features 4 15 Ethtool ethtool is standard Linux utility for controlling network drivers hardware particularly for wired Ethernet devices It can be used to Getidentification and diagnostic information Get extended device statistics Control speed duplex autonegotiation and flow control for Ethernet devices Control checksum offload and other hardware offload features Control DMA ring sizes and interrupt moderation The following are the ethtool supported options Table 6 ethtool Supported Options Options Description ethtool i eth lt x gt Checks driver and device information For example gt ethtool i eth2 driver mlx4 en MT 0DD0120009 CX3 version 2 1 6 Aug 2013 firmware version 2 30 3000 bus info 0000 1a 00 0 ethtool k eth lt x gt Queries the stateless offload status ethtool K eth lt x gt rx onloff tx Sets the stateless offload status onloff sg tso onjoff Iro TCP Segmentation Offload TSO Generic Segmentation onjoff gro gso onjoff Offload GSO increase outbound throughput by reducing CPU overhead It works by queuing up large buffers and letting the network interface card split them into separate packets Large Receive Offload LRO increases inbound through put of high bandwidth network connections by r
34. The expansion ROM image presents itself to the BIOS as a boot device As a result the BIOS will add to the list of boot devices MLNX FlexBoot lt ver gt for a ConnectX device The priority of this list can be modified through BIOS setup 208 Mellanox Technologies Rev 2 0 3 0 0 A 7 Operation A 7 1 Prerequisites Make sure that your client is connected to the server s The FlexBoot image is already programmed on the adapter card see Section A 2 ForInfiniBand ports only Start the Subnet Manager as described in Section A 4 The DHCP server should be configured and started see Section 4 3 3 1 IPoIB Config uration Based on DHCP on page 50 Configure and start at least one of the services iSCSI Target see Section A 10 and or TFTP see Section A 5 A 7 2 Starting Boot Boot the client machine and enter BIOS setup to configure MLNX FlexBoot to be the first on the boot device priority list see Section A 6 On dual port network adapters the client first attempts to boot from Port 1 If this fails it switches to boot from Port 2 Note also that the driver waits up to 90 seconds for ad each port to come up If MLNX FlexBoot iPXE was selected through BIOS setup the client will boot from FlexBoot The client will display FlexBoot attributes sense the port protocol Ethernet or InfiniBand In case of an InfiniBand port the client will also wait for port configuration by the Subnet Manager
35. a run all validation tests expecting an input inventory only validate the given inventory file run service registration deregistration and lease test e run event forwarding test flood the SA with queries according to the stress mode m multicast flow q QoS info dump VLArb and SLtoVL tables t run trap 64 65 flow this flow requires running of external tool Default all flows except QoS w wait This option specifies the wait time for trap 64 65 in seconds It is used only when running f t the trap 64 65 flow Default 10 sec d debug This option specifies a debug option These options are not normally needed The number following d selects the debug option to enable as follows OPT Description d0 Ignore other SM nodes d1 Force single threaded dispatching d2 Force log flushing after each log message d3 Disable multicast support m max lid This option specifies the maximal LID number to be searched for during inventory file build Default 100 Gy G This option specifies the local port GUID value with which OpenSM should bind OpenSM may be bound to 1 port at a time If GUID given is 0 OpenSM displays a list of possible port GUIDs and waits for user input Without g OpenSM tries to use the default port O VORE This option displays a menu of possible local port GUID values with which osmtest could bind i inventory This option specifies the name of the inventory file No
36. l 3 l 2 l 1 y 0 x 0 i 2 3 4 s Assuming the y dateline was between y 4 and y 0 this spanning tree has a branch that crosses a dateline However again this cannot contribute to credit loops as it occurs on a 1D ring the ring for x 3 that is broken by a failure as in the above example 8 5 7 3 Torus Topology Discovery The algorithm used by torus 2QoS to construct the torus topology from the undirected graph rep resenting the fabric requires that the radix of each dimension be configured via torus 2QoS conf It also requires that the torus topology be seeded for a 3D torus this requires configuring four switches that define the three coordinate directions of the torus Given this starting information the algorithm is to examine the cube formed by the eight switch locations bounded by the corners x y z and x 1 y 1 z 1 Based on switches already placed into the torus topology at some of these locations the algorithm examines 4 loops of interswitch links to find the one that is consis tent with a face of the cube of switch locations and adds its swiches to the discovered topology in the correct locations Because the algorithm is based on examining the topology of 4 loops of links a torus with one or more radix 4 dimensions requires extra initial seed configuration See torus 2QoS conf 5 for details Torus 2QoS will detect and repor
37. m1nx qos should be used m1nx qos gets a list of a mapping between UPs to TCs For example m1nx qos iethO p 0 0 0 0 1 1 1 1 maps UPs 0 3 to rco and Ups 4 7 to Tc1 Quality of Service Properties The different QoS properties that can be assigned to a TC are Strict Priority see Strict Priority e Minimal Bandwidth Guarantee ETS see Minimal Bandwidth Guarantee ETS Rate Limit see Rate Limit Strict Priority When setting a TC s transmission algorithm to be strict then this TC has absolute strict prior ity over other TC strict priorities coming before it as determined by the TC number TC 7 is highest priority TC 0 is lowest It also has an absolute priority over non strict TCs ETS This property needs to be used with care as it may easily cause starvation of other TCs A higher strict priority TC is always given the first chance to transmit Only if the highest strict priority TC has nothing more to transmit will the next highest TC be considered Non strict priority TCs will be considered last to transmit This property is extremely useful for low latency low bandwidth traffic Traffic that needs to get immediate service when it exists but is not of high volume to starve other transmitters in the sys tem Minimal Bandwidth Guarantee ETS After servicing the strict priority TCs the amount of bandwidth BW left on the wire may be split among other TCs according to a minimal guarantee policy
38. threshold 10 lid 2 port 255 threshold 10 lid 2 port 255 warn threshold 100 lid 2 port 255 Error check on lid 2 MT47396 Infiniscale III Mellanox Technologies port all warn counter LinkRecovers warn counter LinkDowned 12 warn counter RcvErrors 565 counter XmtDiscards 441 FAILED 2 Check port counters for LID 2 Port 1 gt ibcheckerrs v 2 1 Error check on lid 2 MT47396 Infiniscale III Mellanox Technologies port 1 OK 3 Check the LID2 Port 1 using the specified threshold file gt cat threshl SymbolErrors 10 LinkRecovers 10 LinkDowned 10 RevErrors 10 RcvRemotePhysErrors 100 RcvSwRelayErrors 100 XmtDiscards 100 XmtConstraintErrors 100 RcvConstraintErrors 100 LinkIntegrityErrors 10 ExcBufOverrunErrors 10 VL15Dropped 100 Mellanox Technologies 199 Rev 2 0 3 0 0 gt ibcheckerrs v T threshl 2 1 Error check on lid 2 MT47396 Infiniscale III Mellanox Technologies port 1 OK 9 15 mstflint Queries and burns a binary firmware image file on non volatile Flash memories of Mellanox InfiniBand and Ethernet network adapters The tool requires root privileges for Flash access If you purchased a standard Mellanox Technologies network adapter card please down load the firmware image from www mellanox com gt Downloads gt Firmware If you purchased a non standard card from a vendor other than Mellanox Technologies please contact your vendor To run mstflint
39. to the event plugin option option in the file amp options string that would be passed to the plugin s event plugin options armgr conf file lt ar mgr options file name gt 2 Run Subnet Manager with the new options file opensm F options file name AR Manager options file contains two types of parameters 1 General options Options which describe the AR Manager behavior and the AR parameters that will be applied to all the switches in the fabric 2 Per switch options Options which describe specific switch behavior Note the following Adaptive Routing configuration file is case sensitive You can specify options for nonexisting switch GUID These options will be ignored until a switch with a matching GUID will be added to the fabric Adaptive Routing configuration file is parsed every AR Manager cycle which in turn is executed at every heavy sweep of the Subnet Manager Ifthe AR Manager fails to parse the options file default settings for all the options will be used 164 Mellanox Technologies Table 13 Adaptive Routing Manager Options File Rev 2 0 3 0 0 8 8 5 1 General AR Manager Options Option File Description Values ENABLE lt true false gt Enable disable Adaptive Routing on fabric switches Note that if a switch was identified by AR Man ager as device that does not support AR AR Manager will not try to enable AR on this switch If the firmware of this switch was upd
40. 00 1 ports l pkey idx 1 echo 1 gt 0000 02 00 1 ports 1 pkey_idx 0 echo 0 gt 0000 02 00 2 ports 1 pkey idx 1 echo 2 gt 0000 02 00 2 ports 1 pkey_idx 0 Mellanox Technologies 93 J Rev 2 0 3 0 0 Driver Features vml pkey index 0 will be mapped to physical pkey index 1 vm2 pkey index 0 will be mapped to physical pkey index 2 Both vml and vm2 will have their pkey index mapped to the default pkey Step d Host do the following cd sys class infiniband mlx4 0 iov echo 0 gt 0000 03 00 1 ports l pkey idx 1 echo 1 gt 0000 03 00 1 ports 1 pkey_idx 0 echo 0 gt 0000 03 00 2 1 idx 1 echo 2 gt 0000 03 00 2 ports 1 pkey_idx 0 Stepe Once the VMs are running you can check the VM s virtualized PKey table by doing on the vm cat sys class infiniband mlx4 0 ports 1 2 pkeys 0 1 Step3 Start up the VMs and bind VFs to them Step 4 Configure IP addresses for ib0 on the host and on the guests 4 13 7 3 Ethernet Virtual Function Configuration when Running SR IOV 4 13 7 3 1VLAN Guest Tagging VGT and VLAN Switch Tagging VST When running ETH ports on VFs the ports may be configured to simply pass through packets as is from VFs Vlan Guest Tagging or the administrator may configure the Hypervisor to silently force packets to be associated with a VLan Qos Vlan Switch Tagging In the latter case untagged or priority tagged outgoing packets from the guest will have the VLAN tag insert
41. 2 0 3 0 0 Unicast lids 0x3 0x7 of switch Lid 2 guid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies Lid Out Destination Port Info 0 0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0x0007 021 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 3 valid lids dumped 4 Dump all Lids with valid out ports of the switch with portguid 004016 gt ibroute G 0x000b8cffff004016 Unicast lids 0x0 0x8 of switch Lid 3 guid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies Lid Out Destination Port Info 0x0002 023 Switch portguid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies 0x0003 000 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0 0006 023 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0 0007 020 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 0x0008 024 Channel Adapter portguid 0x0002c902002582cd sw136 HCA 1 5 valid lids dumped 5 Dump all non empty mlids of switch with Lid 3 ibroute M 3 Multicast mlids 0xc000 0xc3ff of switch Lid 3 guid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0 1 2 Mocs anons oso sO lage MLid 0xc000 0xc001 0xc002 0xc003 0xc020 0xc021 0xc022 0xc023 0
42. 2QoS can generate credit loop free unicast routes it is also possible to generate a master spanning tree for multicast that retains the required properties For example consider that same 2D 6x5 torus with the link from 2 2 to 3 2 failed Torus 2QoS will generate the following master spanning tree 4 I I 3 2 I 1 l I y 0 x 2 3 4 5 Mellanox Technologies 145 Rev 2 0 3 0 0 OpenSM Subnet Manager Two things notable about this master tree First assuming the dateline was between x 5 and x 0 this spanning tree has a branch that crosses the dateline However just as for unicast crossing a dateline on a 1D ring here the ring for y 2 that is broken by a failure cannot contribute to a torus credit loop Second this spanning tree is no longer optimal even for multicast groups that encompass the entire fabric That unfortunately is a compromise that must be made to retain the other desirable properties of torus 2QoS routing In the event that a single switch fails torus 2QoS will generate a master spanning tree that has no extra turns by appro priately selecting a root switch In the 2D 6x5 torus example assume now that the switch at 3 2 1 the root for a pristine fabric fails Torus 2QoS will generate the following master spanning tree for that case 4
43. 3 0 0 OpenSM Subnet Manager end qos ulps OpenSM options file qos max vls 8 qos high limit 0 qos vlarb high 1 32 2 96 3 96 4 96 qos vlarb low 0 1 Gio ew Ordr 3 256 7 105 15 15 15 15 15 15 18 Partition configuration file Default 0x7ff ipoib ALL full PartA 0x8001 sl 1 ipoib ALL full 8 8 Adaptive Routing 8 8 1 Overview Adaptive Routing is at beta stage Adaptive Routing AR enables the switch to select the output port based on the port s load AR supports two routing modes Free AR No constraints on output port selection e Bounded AR The switch does not change the output port during the same transmission burst This mode minimizes the appearance of out of order packets Adaptive Routing Manager enables and configures Adaptive Routing mechanism on fabric switches It scans all the fabric switches deduces which switches support Adaptive Routing and configures the AR functionality on these switches Currently Adaptive Routing Manager supports only link aggregation algorithm Adaptive Rout ing Manager configures AR mechanism to allow switches to select output port out of all the ports that are linked to the same remote switch This algorithm suits any topology with several links between switches Especially it suits 3D torus mesh where there are several link in each direc tion of the X Y Z axis If some switches do not support AR they will slow down AR Manager as it may
44. 3 1 IPoIB Configuration Based on DHCP Setting an IPoIB interface configuration based on DHCP is performed similarly to the configura tion of Ethernet interfaces In other words you need to make sure that IPoIB configuration files include the following line For RedHat BOOTPROTO dhcp For SLES BOOTPROTO dchp If IPoIB configuration files are included ifefg ib lt n gt files will be installed under etc sysconfig network scripts on a RedHat machine p etc sysconfig network on a SuSE machine A patch for DHCP is required for supporting IPoIB For further information please see the REAME which is available under the docs dhcp directory Standard DHCP fields holding MAC addresses are not large enough to contain an IPoIB hard ware address To overcome this problem DHCP over InfiniBand messages convey a client iden tifier field used to identify the DHCP session This client identifier field can be used to associate an IP address with a client identifier value such that the DHCP server will grant the same IP address to any client that conveys this client identifier The length of the client identifier field is not fixed in the specification For the Mellanox OFED for Linux package it is recommended to have IPoIB use the same format that FlexBoot uses for this client identifier see Section A 3 2 Configuring the DHCP Server on page 207 4 3 3 1 1 DHCP Server In order for the DHCP server to provide config
45. 79 J Rev 2 0 3 0 0 Driver Features e struct flow attr attaches the to the flow specified The flow contains mandatory control parameters and optional L2 L3 and L4 headers The optional headers are detected by setting the size and num_of_specs fields struct ibv flow attr can be followed by the optional flow headers structs struct ibv flow spec ib struct ibv flow spec eth struct ibv flow spec ipv4 struct ibv flow spec tcp udp For further information please refer to the ibv create flow man page Be advised that from MLNX_OFED v2 0 3 0 0 and higher the parameters both the value and the mask should be set in big endian format de Each header struct holds the relevant network layer parameters for matching To enforce the match the user sets a mask for each parameter The supported masks are All one mask include the parameter value in the attached rule Note Since the VLAN ID in the Ethernet header is 12bit long the following parameter should be used flow spec eth mask vlan tag htons OxOfff All zero mask ignore the parameter value in the attached rule When setting the flow type to NORMAL the incoming traffic will be steered according to the rule spec ifications ALL DEFAULT and MC DEFAULT rules options are valid only for Ethernet link type since InfiniBand link type packets always include QP number For further information please refer to the relevant man pages ibv destroy flow int ibv
46. Administrator that runs on top of the Mellanox OFED stack opensm performs the InfiniBand specification s required tasks for ini tializing InfiniBand hardware One SM must be running for each InfiniBand subnet opensm also provides an experimental version of a performance manager opensm defaults were designed to meet the common case usage on clusters with up to a few hun dred nodes Thus in this default mode opensm will scan the IB fabric initialize it and sweep occasionally for changes opensm attaches to a specific IB port on the local machine and configures only the fabric con nected to it If the local machine has other IB ports opensm will ignore the fabrics connected to those other ports If no port is specified opensm will select the first best available port opensm can also present the available ports and prompt for a port number to attach to By default the opensm run is logged to two files var log messages and var log opensm log The first file will register only general major events whereas the second file will include details of reported errors All errors reported in this second file should be treated as indi cators of IB fabric health issues Note that when a fatal and non recoverable error occurs opensm will exit Both log files should include the message SUBNET UP if opensm was able to setup the subnet correctly 8 2 1 opensm Syntax opensm OPTIONS where OPTIONS are version Prints OpenSM ver
47. BER for each port and check no BER value has exceeds the BER threshold default threshold 10 12 ber_use data Indicates that BER test will use the received data for calculation ber thresh value Specifies the threshold value for the BER test The reciprocal number of the BER should be provided Example for 10 12 than value need to be 1000000000000 or 0 804 51000 10 12 If threshold given is 0 than all BER values for all ports will be reported extended_speeds lt dev type gt Collect and test port extended speeds counters dev type sw all pm per lane List all counters per lane when available ls lt 2 5 5 10 14 25 FDR10 gt Specifies the expected link speed lw lt 1x 4x 8x 12x gt Specifies the expected link width w write topo file file name gt Write out a topology file for the discovered topology t topo file file Specifies the topology file name out ibnl dir directory The topology file custom system definitions ibnl directory Screen num errs num Specifies the threshold for printing errors to screen default 5 smp window lt num gt Max smp MADs on wire default 8 gmp window lt num gt Max gmp MADs on wire default 128 max hops lt max hops gt Specifies the maximum hops for the discovery process default 64 V version Prints the version of the tool h help Prints help information withou
48. ERROR Causes the process to hang in a loop when completion with error which is not flushed with error or retry exceeded occurs Otherwise disabled e MLX5 POST SEND PREFER BF Configures every work request that can use blue flame will use blue flame Otherwise blue flame depends on the size of the message and inline indication in the packet MLXS SHUT UP BF Disables blue flame feature Otherwise do not disable e MLX5 SINGLE THREADED Allspinlocks are disabled Otherwise spinlocks enabled Used by applications that are single threaded and would like to save the overhead of taking spinlocks MLX5 SIZE 64 completion queue entry size is 64 bytes default 128 completion queue entry size is 128 bytes 20 Mellanox Technologies Rev 2 0 3 0 0 MLX5 SCATTER TO CQE Small buffers are scattered to the completion queue entry and manipulated by the driver Valid for RC transport Default is 1 otherwise disabled 13 33 Mid layer Core Core services include management interface MAD connection manager CM interface and Subnet Administrator SA interface The stack includes components for both user mode and kernel applications The core services run in the kernel and expose an interface to user mode for verbs CM and management 1 34 ULPs IPoIB The IP over IB IPoIB driver is a network interface implementation over InfiniBand IPoIB encapsulates IP datagrams over an
49. MULTICAST MTU 2044 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 128 RX bytes 0 0 0 b TX bytes 0 0 0 b Step 4 can be seen the interface does not have IP or network addresses To configure those you should follow the manual configuration procedure described in Section 4 3 3 3 5 be able to use this interface a configuration of the Subnet Manager is needed so that the PKey chosen which defines a broadcast address be recognized see Chapter 8 OpenSM Subnet Manager Removing a Subinterface To remove a child interface subinterface run echo subinterface gt sys class net ib interface delete child Using the example of Step 2 echo 0x8001 gt sys class net ib0 delete child Note that when deleting the interface you must use the PKey value with the most significant bit set e g 0x8000 in the example above Verifying IPoIB Functionality To verify your configuration and your IPoIB functionality perform the following steps Step 1 Verify the IPoIB functionality by using the i config command The following example shows how two IB nodes are used to verify IPoIB functionality In the following example IB node 1 is at 11 4 3 175 and IB node 2 is at 11 4 3 176 hostl ifconfig ib0 11 4 3 175 netmask 255 255 0 0 host2 ifconfig ib0 11 4 3 176 netmask 255 255 0 0 Step 2 Enter the ping command
50. Mellanox FCA which off loads from UPC collective operations For further information on FCA please refer to the Mellanox website GasNet library contains MXM conduit which offloads from UPC all P2P operations as well as some synchronization routines For further information on MXM please refer to the Mellanox website Mellanox OFED 1 8 includes ScalableUPC 2 1 which is installed under opt mellanox bupc p If you have installed OFED 1 8 you do not need to download and install ScalableUPC Mellanox ScalableUPC is distributed as source RPM as well and can be downloaded from the Mellanox website Installing ScalableUPC Mellanox ScalableUPC is installed as part of MLNX OFED package Mellanox OFED 1 8 5 includes ScalableUPC Rev 2 2 which is installed under opt mellanox bupc If you have installed OFED 1 8 5 you do not need to download and install ScalableUPC Mellanox ScalableUPC is distributed as source RPM as well and can be downloaded from the Mellanox website 106 Mellanox Technologies Rev 2 0 3 0 0 Please note the binary distribution of ScalableUPC is compiled with the following defaults support FCA is disabled at runtime by default and must be configured prior to using it from the ScalableUPC For further information please refer to FCA User Man ual MXM support enabled by default 5 5 2 Runtime Parameters The following parameters can be passed to upcrun in order to change FCA
51. Mellanox Technologies 125 Rev 2 0 3 0 0 OpenSM Subnet Manager port_search ordering file 0 lt path to file gt This option provides the means to define a mapping between ports and dimension Order for controlling Dimension Order Routing DOR Moreover this option provides the means to define non default routing port order dimn ports file O path to file DEPRECATED This option provides the means to define a mapping between ports and dimension Order for controlling Dimension Order Routing DOR honor guid2lid x This option forces OpenSM to honor the guid2lid file when it comes out of Standby state if such file exists under OSM CACHE DIR and is valid By default this is FALSE const multicast This option forces OpenSM to conserver previously built multicast trees log file f lt log file name gt This option defines the log to be the given file By default the log goes to var log opensm log For the log to go to standard output use f stdout log limit L size in MB This option defines maximal log file size in MB When Specified the log file will be truncated upon reaching this limit erase log file e This option will cause deletion of the log file if it previously exists By default the log file is accumulative Pconfig P lt partition config file gt This option defines the optional partition configuration file The default name is etc opensm partitions conf
52. Performance 113 Mellanox Technologies 5 J Rev 2 0 3 0 0 7 2 3 Preserving Your Performance Settings after a Reboot 114 7 2 4 Tuning Power Management 114 7 2 5 Interrupt cece ee 116 7 2 6 Tuning for NUMA Architecture 116 TAT IRQUATffinity u hate a ashe At tei teeta wkd er eee a Saves 118 7 2 8 Tuning Multi Threaded IP 120 Chapter 8 OpenSM Subnet Manager 121 Bul OVerVIew 2 edet e e RO Dida ete de 121 8 2 opensm Description ge Ig eR eR e Ue ee ee Side oe 121 8 2 1 opensm Syntax i e ete et rem dete cow ocolos 121 8 2 2 Environment Variables llle 129 823 Signaling EPIO SERRE LOL ae ales 130 8 2 4 Running opensm i lec e RESELLER RO OE E SEED 130 8 3 osmtest Description 130 9 9 T 3S yntax csset AIR Rete deese este nic 131 8 32 Running NE wo uet Sas BRO eR CEA pets 133 8 4 ete ees 133 8 4 1 File Format saa tates eee ees sua ae ees 133 8 5 Routing Algorithms llle nes 136 8 5 1 Effect of Topology Changes 137 8 5 2 Min Hop Algorithm uama uda E e 137 8 5
53. QP based congestion control 1 SL Port based congestion con trol Default 0 Mellanox Technologies 169 Table 17 Congestion Control Manager CA Options File Rev 2 0 3 0 0 OpenSM Subnet Manager Option File Desctiption Values ca_control_map An array of sixteen bits one for each SL Each bit indicates whether or not the corresponding SL entry is to be modified Values Oxffff ccti_increase Sets the CC Table Index CCTI increase Default 1 trigger_threshold Sets the trigger threshold Default 2 ccti min Sets the CC Table Index CCTI minimum Default 0 cct Sets all the CC table entries to a specified value Values lt comma separated The first entry will remain 0 whereas last value list will be set to the rest of the table Default 0 When the value is set to 0 the CCT calculation is based on the number of nodes ccti timer Sets for all SL s the given ccti timer Default 0 When the value is set to 0 the CCT calculation is based on the number of nodes Table 18 Congestion Control Manager CC MGR Options File Option File Desctiption Values max errors error window When number of errors exceeds max_errors of send receive errors or timeouts in less than error window seconds the CC MGR will abort and will allow OpenSM to proceed Values max errors 0 zero tollerance abort configuration on first error er
54. QoS policy file has the following sections I Port Groups denoted by port groups This section defines zero or more port groups that can be referred later by matching rules see below Port group lists ports by Port GUID Port name which is a combination of NodeDescription and IB port number PKey which means that all the ports in the subnet that belong to partition with a given PKey belong to this port group Partition name which means that all the ports in the subnet that belong to partition with a given name belong to this port group Node type where possible node types are CA SWITCH ROUTER ALL and SELF SM s port 150 Mellanox Technologies Rev 2 0 3 0 0 ID QoS Setup denoted by qos setup This section describes how to set up SL2VL and VL Arbitration tables on various nodes in the fabric However this is not supported in OFED SL2VL and VLArb tables should be configured in the OpenSM options file default location var cache opensm opensm opts QoS Levels denoted by qos levels Each QoS Level defines Service Level SL and a few optional fields e MTU limit Rate limit e Packet lifetime When path s search is performed it is done with regards to restriction that these QoS Level parameters impose One QoS level that is mandatory to define is a DEFAULT QoS level It is applied to a PR MPR query that does not match any existing match rule Similar to any other QoS Leve
55. Rev 2 0 3 0 0 Options v Enable verbose mode Adds additional information such as Device ID Part Number Card Name Firmware version IB port state h Print help messages Example Sw417 BXOFED 1 5 2 20101128 1524 ibdev2netdev v mlx4 0 26428 MT1006X00034 FALCON QDR fw 2 7 9288 port 1 ACTIVE gt eth5 Down mlx4 0 MT26428 MT1006X00034 FALCON QDR fw 2 7 9288 port 1 ACTIVE gt ib0 Down mlx4 0 MT26428 MT1006X00034 FALCON QDR fw 2 7 9288 port 2 DOWN gt 1101 Down mlx4 1 MT26448 MT1023X00777 Hawk Dual Port fw 2 7 9400 port 1 DOWN gt eth2 Down mlx4 1 MT26448 MT1023X00777 Hawk Dual Port fw 2 7 9400 port 2 DOWN zz eth3 Down sw417 BXOFED 1 5 2 20101128 1524 ibdev2netdev 1 4 0 port 1 gt eth5 Down 1 4 0 port 1 gt 1 0 Down lx4 1 port 1 eth2 Down lx4 1 port 2 eth3 Down m m mlx4 0 port 2 gt ibl Down m m 9 9 ibstatus Displays basic information obtained from the local InfiniBand driver Output includes LID SMLID port state port physical state port width and port rate Synopsis ibstatus h device name gt lt port gt Output Files Table 23 lists the various flags of the command Table 23 ibstatus Flags and Options a Default Flag T If Not Description y Specified h Optional Print the help menu lt device gt Optional All devices Print information for the spe
56. Skprio 4 tos 24 Skprio 5 Skprio 6 tos 16 Skprio 7 Skprio 8 Skprio 9 Skprio 10 Skprio 11 Skprio 12 Skprio 13 Skprio 14 Skprio 15 up 7 tc 1 ratelimit 4 Gbps tsa ets bw 70 tie 1 up 2 Wos 9 tc 2 ratelimit 2 Gbps tsa strict up 4 bigs 5 up 6 4 5 8 2 tc and tc wrap py The tc tool is used to setup sk prio to UP mapping using the mgprio queue discipline In kernels that do not support such as 2 6 34 an alternate mapping is created in sysfs The wrap py tool will use either the sysfs the tc tool to configure the sk prio to UP mapping 66 Mellanox Technologies Rev 2 0 3 0 0 Usage tc_wrap py i lt interface gt options Options version show program s version number and exit h help show this help message and exit u SKPRIO UP skprio up SKPRIO UP maps sk prio to UP LIST is lt 16 comma separated UP index of element is sk prio i INTF interface INTF Interface name Example set skprio 0 2 to UPO and skprio 3 7 to UP1 on eth4 UP 0 Skprio 0 Skprio 1 Skprio 2 tos 8 Skprio 7 Skprio 8 Skprio 9 Skprio 10 Skprio 11 Skprio 12 Skprio 13 Skprio 14 Skprio 15 UP 1 Skprio 3 Skprio 4 tos 24 Skprio 5 Skprio 6 tos 16 UP 2 Us 3 UP 4 Us 5 0 7 4 5 8 3 Additional Tools tc tool compiled with the sch_mqprio module is required to support kernel v2 6 32 or higher This is a part of iproute2 package v2
57. Write a data block to Flash without sector erase rb lt addr gt lt size gt out file Read a data block from Flash swreset SW reset the target InfniScale IV device This command is supported only in the In Band access method 202 Mellanox Technologies InfiniBand Fabric Diagnostic Utilities Rev 2 0 3 0 0 Possible command return values are 0 successful completion 1 error has occurred 7 the burn command was aborted because firmware is current Examples 1 Find Mellanox Technologies s ConnectX VPI cards with PCI Express running at 2 5GT s and InfiniBand ports at DDR or Ethernet ports at 10GigE gt sbin lspci d 15b3 634a 04 00 0 InfiniBand Mellanox Technologies MT25418 ConnectX IB DDR PCIe 2 0 2 5GT s rev a0 In the example above 15b3 is Mellanox Technologies s vendor number in hexadecimal and 634a is the device s PCI Device ID in hexadecimal The number string 04 00 0 identifies the device in the form bus dev fn The PCI Device IDs of Mellanox Technologies devices can be obtained from the PCI ID Repository Website at http pci ids ucw cz read PC 15b3 2 Verify the ConnectX firmware using its ID using the results of example above gt mstflint d 04 00 0 v ConnectX failsafe image Start address 80000 Chunk size 80000 NOTE The addresses below are contiguous logical addresses Physical addresses on flash may be different based on the
58. be unique but PKey does need to be unique Ifa PKey is repeated then the associated partition configurations will be merged and the first PartitionName will be used see also next note 115 possible to split a partition configuration in more than one definition but then they PKey should be explicitly specified otherwise different PKey values will be generated for those definitions 134 Mellanox Technologies Rev 2 0 3 0 0 Examples DeFault 0x7FFF ALL SELF full NewPartition ipoib 0x123456 full 0x3456789034 limi 0x2134af2306 YetAnotherOne 0x300 SELF full YetAnotherOne 0x300 ALL limited ShareIO 0x80 defmember full 0 123451 0 123452 0x123453 0 123454 will be limited ShareIO 0x80 0x123453 0x123454 0x123455 full 0x123456 0x123457 will be limited ShareIO 0x80 defmember limited 0x123456 0 123457 0x123458 full ShareIO 0x80 defmember full 0x123459 0x12345a ShareIO 0x80 defmember full 0x12345b 0x12345c limited 0x12345d The following rule is equivalent to how OpenSM used to run prior to the partition manager Default 0x7fff ipoib ALL full Mellanox Technologies 135 Rev 2 0 3 0 0 OpenSM Subnet Manager 8 5 Routing Algorithms OpenSM offers six routing engines 1 Min Hop Algorithm Based on the minimum hops to each node where the path length is optimized 2 UPDN Algorithm Based on the minimum hops to ea
59. before the rest of the LASH algorithm runs Mellanox Technologies 141 Rev 2 0 3 0 0 OpenSM Subnet Manager 8 5 6 DOR Routing Algorithm The Dimension Order Routing algorithm is based on the Min Hop algorithm and so uses short est paths Instead of spreading traffic out across different paths with the same shortest distance it chooses among the available shortest paths based on an ordering of dimensions Each port must be consistently cabled to represent a hypercube dimension or a mesh dimension Paths are grown from a destination back to a source using the lowest dimension port of available paths at each step This provides the ordering necessary to avoid deadlock When there are multiple links between any two switches they still represent only one dimension and traffic is balanced across them unless port equalization is turned off In the case of hypercubes the same port must be used throughout the fabric to represent the hypercube dimension and match on both ends of the cable In the case of meshes the dimension should consistently use the same pair of ports one port on one end of the cable and the other port on the other end continuing along the mesh dimension Use R dor option to activate the DOR algorithm 8 5 7 Torus 2QoS Routing Algorithm Torus 2Q0S is a routing algorithm designed for large scale 2D 3D torus fabrics The torus 2QoS routing engine can provide the following functionality on a 2D 3D torus Fr
60. by their LIDs or by the names defined in the topology file In this case the actual path from the local port to the source port and from the source port to the destination port is defined by means of Subnet Management Linear Forwarding Table queries of the switch nodes along that path Therefore the path cannot be predicted as it may change ibdiagpath should not be supplied with contradicting local ports by the p and d flags see synopsis descriptions below In other words when ibdiagpath is provided with the options p and d together the first port in the direct route must be equal to the one specified in the p option Otherwise an error is reported When ibdiagpath queries for the performance counters along the path between the source and destination it always traverses LID route even if a directed route is specified If along the LID route one or more links are not in the ACTIVE state ibdi Aa agpath reports an error Moreover the tool allows omitting the source node in LID route addressing in which case the local port on the machine running the tool is assumed to be the source Synopsis ibdiagpath n lt src name dst name gt 1 lt src lid dst lid gt d pl p2 p3 c count v t topo file 8 sys name ic lt dev index gt c p lt port num gt o out dir lw 1x 4x 12x 1s lt 2 5 5 10 gt pm pc P lt lt PM counter Trash Limit gt gt
61. device numbers e g 1 Max supported devices 32 string 1 In the current version this parameter is using decimal number to describe the InfiniBand device and not hexadecimal number as it was in previous versions in order to uniform the mapping of device function numbers to InfiniBand device numbers as defined for other module parameters e g num vfs and probe vf For example to map mlx4 15 to device function number 04 00 0 in the current version we use options mlx4 ib dev assign str 04 00 0 15 as opposed to the previous version in which we Used options mlx4 ib dev assign str 04 00 0 f C 2 mlx4 core Parameters set 4k mtu Obsolete attempt to set 4K MTU to all ConnectX ports int debug level Enable debug tracing if 0 int 52 0 don t use MSI X 1 wise WSK gt 1 limit number of MSI X irgs to msi_x non SRIOV only int enable sys tune Tune the cpu s for better performance default 0 int block loopback Block multicast loopback packets if 0 default 1 int num vfs Either single value e g 5 to define uniform num vfs value for all devices functions or a string to map device function numbers to their num vfs values e g 0000 04 00 0 5 002b 1c 0b a 15 Hexadecimal digits for the device function e g 002b 1c 0b a and decimal for num vfs value e g 15 string probe vf Either single value e g 3 to define uniform number of VFs to probe by the pf driver for all devices functions or a st
62. down SRP Mellanox Technologies 47 J Rev 2 0 3 0 0 Driver Features 2 After Manual Activation of High Availability If you manually activated SRP High Availability perform the following steps a Unmount all SRP partitions that were mounted b Kill the SRP daemon instances c Make sure there are no multipath instances running If there are multiple instances wait for them to end or kill them d Run multipath F 3 After Automatic Activation of High Availability If SRP High Availability was automatically activated SRP shutdown must be part of the driver shut down etc init d openibd stop which performs Steps 2 4 of case b above However you still have to unmount all SRP partitions that were mounted before driver shutdown 42 iSCSI Extensions for RDMA iSER iSCSI Extensions for ROMA iSER is currently at beta level Please be aware that the content below is subject to change 42 1 Overview iSCSI Extensions for RDMA iSER extends the iSCSI protocol to RDMA It permits data to be transferred directly into and out of SCSI buffers without intermediate data copies 4 2 2 iSER Initiator The iSER initiator is controlled through the iSCSI interface available from the iscsi initiator utils package Targets settings such as timeouts and retries are set the same as any other iSCSI targets If targets are set to auto connect on boot and targets unreachable it may take a long time to continue th
63. e Aet de DA babu ss Fd eso oid gae Due i 13 Table4 Reference Documents s 14 Table 5 Software and Hardware Requirements 24 Table 6 mlnxofedinstall Return Codes 27 Table 7s Buffer Valles s susse ately EUR ame 74 Table 8 Parameters Used to Control Error Cases Contiguity 75 Table 9 Flow Specific Parameters 79 Table 10 ethtool Supported Options 94 Table Useful MPELinks s atiy aypuspa CET CHE CRT HE CERT PCR RKA 99 Table 12 Runtime Parameters 105 Table 13 Recommended PCIe Configuration 110 Table 14 Recommended BIOS Settings for Intel Sandy Bridge Processors 111 Table 15 Recommended BIOS Settings for Intel Nehalem Westmere Processors 112 Table 16 Recommended BIOS Settings for AMD Processors 112 Table 17 Adaptive Routing Manager Options File 165 Table 18 Adaptive Routing Manager Pre Switch Options File 166 Table 19 Congestion Control Manager General Options File 169 Table 20 Congestion Control Manager Switch Options
64. errno ENOMSG and ee origin SO EE ORIGIN TIMESTAMPING A socket with such 70 Mellanox Technologies Rev 2 0 3 0 0 a pending bounced packet is ready for reading as far as select is concerned If the outgoing packet has to be fragmented then only the first fragment is time stamped and returned to the sending socket When time stamping is enabled VLAN stripping is disabled For more info please refer to Documentation networking timestamping txt in kernel org 47 Atomic Operations Atomic Operations are applicable to the mlx4 driver only 4 7 1 Enhanced Atomic Operations ConnectX implements a set of Extended Atomic Operations beyond those defined by the IB spec Atomicity guarantees Atomic Ack generation ordering rules and error behavior for this set of extended Atomic operations is the same as that for IB standard Atomic operations as defined in section 9 4 5 of the IB spec 4 7 1 1 Masked Compare and Swap MskCmpSwap The MskCmpSwap atomic operation is an extension to the CmpSwap operation defined in the IB spec MskCmpSwap allows the user to select a portion of the 64 bit target data for the compare check as well as to restrict the swap to a possibly different portion The pseudocode below describes the operation atomic response va if compare add va amp compare add mask then va va amp swap mask swap amp swap mask return atomic response The additional operands a
65. eth ipoib is loaded number of eIPoIB interfaces are created with the following default naming scheme ethx where X represents the ETH port available on the system Too check which eIPoIB interfaces were created cat sys class net eth ipoib interfaces Mellanox Technologies 73 Rev 2 0 3 0 0 Driver Features For example system with dual following two interfaces might be created eth4 and eth5 cat sys class net eth ipoib interfaces eth4 over IB port ib0 eth5 over IB port ibl These interfaces can be used to configure the network for the guest For example if the guest has a VIF that is connected to the Virtual Bridge bro then enslave the eIPoIB interface to bro by running brctl addif br0 ethX In RHEL KVM environment there are other methods to create configure your virtual net work e g macvtap For additional information please refer to the Red Hat User Manual The IPoIB daemon ipoibd detects the new virtual interface that is attached to the same bridge as the eIPoIB interface and creates a new IPoIB instances for it in order to send receive data As a result number of IPoIB interfaces ibX Y are shown as being created destroyed and are being enslaved to the corresponding ethx interface to serve any active VIF in the system according to the set configuration This process is done automatically by the ipoibd service gt To see the list of IPoIB interfaces enslaved under et
66. for transaction timeouts Specifying t 0 disables timeouts Without t OpenSM defaults to a timeout value of 200 milliseconds retries lt number gt This option specifies the number of retries used for transactions Without retries OpenSM defaults to 3 retries for transactions maxsmps n lt number gt This option specifies the number of VL15 SMP MADs allowed on the wire at any one time Specifying maxsmps 0 allows unlimited outstanding SMPs Without maxsmps OpenSM defaults to a maximum of 4 outstanding SMPs rereg on guid migr This option if enabled forces OpenSM to send port info with client reregister bit set to all nodes in the fabric when alias Guid migrates from one physical port to another aguid inout notice This option enables sending GID IN OUT notices on Alias GUIDs register delete request to registered clients sm assign guid func unig count base port Specifies the algorithm that SM will use when it comes to choose SM assigned alias GUIDs The default is uniq count console q off local This option activates the OpenSM console default off ignore guids i equalize ignore guids file This option provides the means to define a set of ports by guid that will be ignored by the link load equalization algorithm hop weights file w path to file This option provides the means to define a weighting factor per port for customizing the least weight hops for the routing
67. h help Prints the help page information V version Prints the version of the tool Vars Prints the tool s environment variables and their values Output Files Table 21 ibdiagpath Output Files Output File Description ibdiagpath log A dump of all the application reports generated according to the provided flags ibdiagnet pm A dump of the Performance Counters values of the fabric links 180 Mellanox Technologies InfiniBand Fabric Diagnostic Utilities Rev 2 0 3 0 0 Error Codes 1 The path traced is un healthy 2 Failed to parse command line options 3 More then 64 hops are required for traversing the local port to the Source port and then to the Destination port 4 Unable to traverse the LFT data from source to destination 5 Failed to use Topology File 6 Failed to load required Package 96 ibv_devices Lists InfiniBand devices available for use from userspace including node GUIDs Synopsis ibv devices Examples 1 List the names of all available InfiniBand devices gt ibv devices device node GUID mthca0 0002c9000101d150 mlx4 0 0000000000073895 9 7 ibv_devinfo Queries InfiniBand devices and prints about them information that is available for use from user space Synopsis ibv_devinfo d lt device gt 1 lt port gt 1 v Output Files Table 22 lists the various flags of the command Table 22 ibv_devinfo Flags and Options Optional
68. image start address and chunk size 0x00000038 0x000010db 0x0010a4 BOOT2 OK 0x000010dc 0x00004947 0x00386c BOOT2 OK 0x00004948 0x000052c7 0x000980 Configuration OK 0x000052c8 0x0000530b 0x000044 GUID OK 0x0000530c 0x0000542f 0x000124 Image Info OK 0x00005430 0x0000634 0x000 20 DDR OK 0x00006350 0x0000f 29b 0 008 4 DDR OK 0x0000 29c 0x0004749b 0x038200 DDR OK 0x0004749c 0x0005913 0x011ca4 DDR OK 0x00059140 0x0007a123 0x020fe4 DDR OK 0x0007a124 0x0007bdff 0x001cdc DDR OK 0x0007be00 0x0007eb97 0x002d98 DDR OK 0x0007eb98 0x0007f0af 0x000518 Configuration OK 0x0007 0b0 0x0007 0fb 0x00004c Jump addresses OK 0x0007 0 c 0x0007 2a7 0x0001ac FW Configuration OK FW image verification succeeded Image is bootable Mellanox Technologies 203 Rev 2 0 3 0 0 InfiniBand Fabric Diagnostic Utilities 9 16 asyncwatch Display asynchronous events forwarded to userspace for an InfiniBand device Synopsis ibv asyncwatch 204 Mellanox Technologies Rev 2 0 3 0 0 Examples 1 Display asynchronous events gt ibv_asyncwatch mlx4 0 async event FD 4 9 17 ibdump Dump InfiniBand traffic that flows to and from Mellanox Technologies ConnectX family adapters InfiniBand ports The dump file can be loaded by the Wireshark tool for graphical traffic analysis The following describ
69. keybeaseperiodi 0 pa ass 1 noma ETT 1 or 4X SURE maaa 1X or 4X TT AX Mellanox Technologies 193 Rev 2 0 3 0 0 InfiniBand Fabric Diagnostic Utilities 2 Query SwitchInfo by GUID 194 Mellanox Technologies J Rev 2 0 3 0 0 nn nn 18 Statechange n n TS 0 IdSBerDontec eee ER 0 rrr 32 NH ia 1 Qutboundbab e e il a a 1 a il EnnancedPortO merriment nn 0 3 Query by direct route gt smpquery D nodeinfo 0 Node info DR path slid 65535 dlid 65535 0 BaseVerg E 1 il Node TY DC acumen tat Channel Adapter N lm 8 2 0x0002c9030000103b Gurdas s TREE 0x0002c90300001038 doeet M eerie 0x0002c90300001039 Capi iam aver a e Ee TES 128 DEVIA e ree n ps 0 634 0 000000 0 1 n n E S ss 0x0002c9 9 13 perfquery Queries InfiniBand ports performance and error counters Optionally it displays aggregated counters for all ports of a node It can also reset counters after reading them or simply reset them Synopsis perfquery h d G a 1 r C lt ca_name gt P lt ca_port gt R t timeout ms V lid guid port reset
70. of the flow specification the ethtool API defines Asking for an unsupported flow specification will result with an invalid value failure The following are the flow specific parameters Table 5 Flow Specific Parameters ether tep4 udp4 ip4 Mandatory dst src ip dst ip Optional vlan src ip dst ip src src ip dst ip vlan port dst port vlan RFS RFS is an in kernel logic responsible for load balancing between CPUs by attaching flows to CPUs that are used by flow s owner applications This domain allows the RFS mechanism to use the flow steering infrastructure to support the RFS logic by implementing the ndo rx flow steer which in turn calls the underlying flow steering mechanism with the RFS domain Enabling the RFS requires enabling the ntuple flag via the ethtool For example to enable ntuple for 0 run ethtool K eth0 ntuple RFS requires the kernel to be compiled with the conFIG_RFS ACCEL option This options is available in kernels 2 6 39 and above Furthermore RFS requires Device Managed Flow Steering support RFS cannot function if LRO is enabled LRO can be disabled via ethtool Allof the rest The lowest priority domain serves the following users The mlx4 Ethernet driver attaches its unicast and multicast MACs addresses to its QP using L2 flow specifications The mlx4 driver when it attaches its QP to his configured GIDS Fragmented UDP tra
71. set up partitions with appropriate IPoIB broadcast group This broadcast group carries its QoS attributes SL MTU RATE and Packet Lifetime 3 IPoIB is being setup IPoIB uses the SL MTU RATE and Packet Lifetime available on the multicast group which forms the broadcast group of this partition 4 MPI which provides non IB based connection management should be configured to run using hard coded SLs It uses these SLs for every QP being opened 5 ULPs that use CM interface like SRP have their own pre assigned Service ID and use it while obtaining PathRecord MultiPathRecord PR MPR for establishing connections The SA receiving the PR MPR matches it against the policy and returns the appropriate PR MPR including SL MTU RATE and Lifetime 6 ULPs and programs e g SDP use CMA to establish RC connection provide the CMA the target IP and port number ULPs might also provide QoS Class The CMA then creates Ser vice ID for the ULP and passes this ID and optional QoS Class in the PR MPR request The resulting PR MPR is used for configuring the connection QP PathRecord and MultiPathRecord Enhancement for QoS As mentioned above the PathRecord and MultiPathRecord attributes are enhanced to carry the Service ID which is a 64bit value A new field QoS Class is also provided A new capability bit describes the SM QoS support in the SA class port info This approach pro vides an easy migration path for existing access layer and ULPs by not in
72. support behavior Table 8 Runtime Parameters Parameter Description fca enable lt 0 1 gt Disables Enables FCA support at runtime default disable fca np value Enables FCA support for collective operations if the number of processes in the job is greater than the fca_np value default 64 fca verbose level Sets verbosity level for the FCA modules fca ops lt gt list op list comma separated list of collective operations fca ops op list Enables disables only the specified operations fca ops lt gt Enables disables all operations By default all operations are enabled Allowed operation names are barrier br bcast bt reduce rc allgather ag Each operation can be also enabled disabled via environment variable GASNET FCA ENABLE BARRIER GASNET FCA ENABLE BCAST GASNET FCA ENABLE REDUCE Note All the operations are enabled by default 5 5 2 1 Enabling FCA Operations through Environment Variables in ScalableUPC This method can be used to control UPC FCA offload from environment using job scheduler srun utility The valid values are 1 enable 0 disable To enable a specific operation with shell environment variables in ScalableUPC export GASNET FCA ENABLE BARRIER 1 export GASNET FCA ENABLE BCAST 1 export GASNET FCA ENABLE REDUCE 1 oo 5 5 2 2 Controlling FCA Offload in ScalableUPC using Environment Variables gt To enab
73. the last packet is sent received before triggering an inter rupt ethtool a eth lt x gt Queries the pause frame settings ethtool A eth lt x gt rx onloff tx onloff Sets the pause frame settings ethtool g eth lt x gt Queries the ring size values ethtool G eth lt x gt rx lt N gt tx lt N gt Modifies the rings size ethtool S eth lt x gt Obtains additional device statistics ethtool t eth lt x gt Performs a self diagnostics test ethtool s eth lt x gt msglvl N Changes the current driver message level Dynamically Connected Transport Service 4 Dynamically Connected transport DCT is currently at alpha level Please be aware that the content below is subject to change Dynamically Connected transport DCT service is an extension to transport services to enable higher degree of scalability while maintaining high performance for sparse traffic Utilization of DCT reduces the total number of QPs required system wide by having RC type QPs dynamically connect and disconnect from any remote node DCT connections only stay connected while they are active This results in smaller memory footprint less overhead to set connections and higher on chip cache utilization and hence increased performance DCT is supported only in mlx5 and is in alpha level Mellanox Technologies 97 J Rev 2 0 3 0 0 HPC Features 5 Features 5 1 Shared Memo
74. the virtual interfaces at the Virtual Machines To display the services provided to the Virtual Machine interfaces cat sys class net eth0 eth vifs Example cat sys class net eth0 eth vifs SLAVE ib0 2 MAC 52 54 00 60 55 88 VLAN N A In the example above the ib0 2 IPoIB interface serves the MAC 52 54 00 60 55 88 with no VLAN tag for that interface 4 83 VLAN Configuration Over an elPolB Interface elPoIB driver supports VLAN Switch Tagging VST mode which enables the virtual machine interface to have no VLAN tag over it thus allowing VLAN tagging to be handled by the Hyper visor gt To attach a Virtual Machine interface to a specific isolated tag 1 Verify the VLAN tag to be used has the same pkey value that is already configured on that ib port cat sys class infiniband mlx4 0 ports ib port gt pkeys Step 2 Create a VLAN interface in the Hypervisor over the eIPoIB interface vconfig add lt eIPoIB interface vlan tag Step3 Attach the new VLAN interface to the same bridge that the virtual machine interface is already attached to brctl addif br name interface name For example to create the VLAN tag 3 with pkey 0x8003 over that port in the eIPoIB interface eth4 run vconfig add eth4 3 brctl addif br2 eth4 3 Mellanox Technologies 75 J Rev 2 0 3 0 0 Driver Features 4 8 4 4 9 Setting Performance Tuning Use larger IPoIB RX TX rings dom0 Reload the IPoIB drive
75. transfer from a different PE and remote pointers allowing direct references to data objects owned by another PE Additional supported operations are collective broadcast and reduction barrier synchronization and atomic memory operations An atomic memory operation is an atomic read and update oper ation such as a fetch and increment on a remote or local data object SHMEM libraries implement active messaging The sending of data involves only one CPU where the source processor puts the data into the memory of the destination processor Likewise a processor can read data from another processor s memory without interrupting the remote CPU The remote processor is unaware that its memory has been read or written unless the programmer implements a mechanism to accomplish this 5 1 1 Mellanox ScalableSHMEM The ScalableSHMEM programming library is a one side communications library that supports a unique set of parallel programming features including point to point and collective routines syn chronizations atomic operations and a shared memory paradigm used between the processes of a parallel programming application Mellanox ScalableSHMEM is based on the API defined by the OpenSHMEM org consortium The library works with the OpenFabrics RDMA for Linux stack OFED and also has the ability to utilize MellanoX Messaging libraries MXM as well as Mellanox Fabric Collective Accelera tions FCA providing an unprecedented level of scalability
76. value indicates a hexadecimal number interface ibl send dhcp client identifier 20 00 55 04 01 80 00 00 00 00 00 00 00 02 9 02 00 23 13 92 In order to use the configuration file run host1 dhclient cf dhclient conf ibl Mellanox Technologies 51 J Rev 2 0 3 0 0 Driver Features 4 3 3 2 Static IPoIB Configuration If you wish to use an IPoIB configuration that is not based on DHCP you need to supply the installation script with a configuration file using the n option containing the full IP configu ration The IPoIB configuration file can specify either or both of the following data for an IPoIB interface A static IPoIB configuration An IPoIB configuration based on an Ethernet configuration See your Linux distribution documentation for additional information about configuring IP addresses The following code lines are an excerpt from a sample IPoIB configuration file Static settings all values provided by this file IPADDR ib0 11 4 3 175 NETMASK ib0 255 255 0 0 NETWORK ib0 11 4 0 0 BROADCAST ib0 11 4 255 255 ONBOOT ib0 1 Based on eth0 each will be replaced with a corresponding octet from eth0 LAN INTERFACE ib0 eth0 TPADDRETOOZ Deo NETMASK ib0 255 255 0 0 NETWORK ib0 11 4 0 0 BROADCAST ib0 11 4 255 255 ONBOOT ib0 1 Based on the first eth n interface that is found for n 0 1 each will be replaced with a corresponding octet from eth lt n gt L
77. with multiple parallel links routes are distributed in a round robin fashion across such links and so changing the order that CA ports are visited changes the distribution of routes across such links This may be advantageous for some specific traffic patterns The default is to visit CA ports in increasing port order on destination switches Duplicate values in the list will be ignored EXAMPLE Look for a 2D since x radix is one 4x5 torus torus 14 5 y is radix 4 torus dimension need both ym link and yp link configuration yp link 0x200000 0x200005 sw y 0 z 0 gt sw y 1 z 0 ym link 0x200000 0x20000f sw y 0 z 0 gt sw y 3 z 0 z is not radix 4 torus dimension only need one of m link or zp link configuration zp link 0x200000 0x200001 sw 0 2 0 gt sw 0 2 1 next seed yp link 0x20000b 0x200010 sw y 2 z 1 sw y 3 z 1 ym link 0x20000b 0x200006 sw y 2 z 1 gt sw y 1 z 1 zp link 0x20000b 0x20000c sw y 2 z 1 gt sw y 2 2 2 y dateline 2 Move the dateline for this seed z dateline 1 back to its original position If OpenSM failover is configured for maximum resiliency one instance should run on a host attached to a switch from the first seed and another instance should run on a host attached to a switch from the second seed Both instances should use this torus 2QoS conf to ensure path SL values do not change in the event of SM failover port order defi
78. you must know the device location on the PCI bus See Example 1 for details Synopsis mstflint switches lt command gt parameters Output Files Table 29 lists the various switches of the utility and Table 30 lists its commands Table 29 mstflint Switches Sheet 1 of 3 InfiniBand Fabric Diagnostic Utilities Affected Switch Relevant Description Commands h Print the help menu hh Print an extended help menu d evice All Specify the device to which the Flash is connected lt device gt guid burn sg GUID base value 4 GUIDs are automatically assigned to the lt GUID gt following values guid gt node GUID guid 1 gt guid 2 gt port2 guid 3 gt system image GUID Note Port2 guid will be assigned even for a single port HCA the HCA ignores this value guids burn sg 4 GUIDs must be specified here The specified GUIDs are lt GUIDs gt assigned the following values repectively node port1 port2 and system image GUID Note Port2 guid must be specified even for a single port HCA the HCA ignores this value It can be set to 0x0 200 Mellanox Technologies Rev 2 0 3 0 0 Table 29 mstflint Switches Sheet 2 of 3 Affected Switch Relevant Description Commands mac burn sg MAC address base value Two MACs are automatically lt MAC gt assigned to the following values gt portl 1 gt po
79. 0 Step 4 Attach a virtual NIC to VM ifconfig a eth6 Link encap Ethernet HWaddr 52 54 00 E7 77 99 inet addr 13 195 15 5 Bcast 13 195 255 255 Mask 255 255 0 0 inet6 addr fe80 5054 ff fee7 7799 64 Scope Link UP BROADCAST RUNNING MULTICAST MTU 1500 Metric 1 RX packets 481 errors 0 dropped 0 overruns 0 frame 0 TX packets 450 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 1000 RX bytes 22440 21 9 KiB TX bytes 19232 18 7 KiB Interrupt 10 Base address 0xa000 Step 5 Add the MAC 52 54 00 E7 77 99 to the sys class net eth5 fdb table on HV Before cat sys class net eth5 fdb 33 33 00 0002102 33 33 2 66 52 01 00 5 00 00 01 33 33 00 00 00 01 echo 52 54 00 E7 77 99 gt sys class net eth5 fdb After cat sys class net eth5 fdb 799 33 33 00 0002102 33 33 2 2 66 52 01 00 5 00 00 01 33 33 00 00 00 01 4 13 4 Assigning Virtual Function to a Virtual Machine This section will describe a mechanism for adding a SR IOV VF to a Virtual Machine 4 13 4 1 Assigning the SR IOV Virtual Function to the Red Hat KVM VM Server Step 1 Run the virt manager Step 2 Double click on the virtual machine and open its Properties Mellanox Technologies 87 J Rev 2 0 3 0 0 Driver Features Step 3 Goto Details gt Add hardware gt PCI host device Ble Virtual Machine View Send Key Q Bie Add new virtual hardware v C x Adding Virtual Hardware
80. 0 OpenSM Subnet Manager qos swe s12vl 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 7 VL arbitration tables both high and low are lists of VL Weight pairs Each list entry contains a VL number values from 0 14 and a weighting value values 0 255 indicating the number of 64 byte units credits which may be transmitted from that VL when its turn in the arbitration occurs A weight of 0 indicates that this entry should be skipped If a list entry is programmed for VL15 or for a VL that is not supported or is not currently configured by the port the port may either skip that entry or send from any supported VL for that entry Note that the same VLs may be listed multiple times in the High or Low priority arbitration tables and further it can be listed in both tables The limit of high priority VLArb table qos type high limit indicates the number of high priority packets that can be transmitted without an opportunity to send a low priority packet Specifically the number of bytes that can be sent is high limit times 4K bytes A high limit value of 255 indicates that the byte limit is unbounded If the 255 value 15 used low priority VLs may starved A value of 0 indicates that only a single packet from the high priority table may be sent before opportunity is given to the low priority table Keep in mind that ports usually transmit packets of size equal to MTU For instance for 4KB MTU a single packet will require 6
81. 3 UPDN Algorithm 4 u yu X EXE Mr Rt a ER CR SAU 138 8 5 4 Fat tree Routing Algorithm 139 8 5 5 LASH Routing Algorithm 140 8 5 6 DOR Routing Algorifhm ra 142 8 5 7 Torus 2QoS Routing Algorithm 142 8 6 Quality of Service Management in 5 150 8 64 OVERVIEW 2 ios cec op ICE Ed HERRERA DERE RAE CR EAE 150 8 6 2 Advanced QoS Policy File 0 cece eee 150 8 6 3 Simple QoS Policy 151 8 6 4 Policy File Syntax Guidelines 152 8 6 5 Examples of Advanced Policy 152 8 6 6 Simple QoS Policy Details and Examples 155 8 6 7 SL2VL Mapping and VL Arbitration osuessa ee eese 157 8 6 8 Deployment Example 158 8 7 QoS Configuration Examples 159 8 7 1 Typical HPC Example MPI and Lustre 159 8 7 2 EDC SOA 2 tier IPoIB and 160 8 7 3 EDC 3 tier IPoIB RDS SRP eee 161 8 8 Adaptive Routing 162 8 8 1 OVELVICW 2 su thy Se tN Rie se ts a Sh the so Es s 162
82. 4 Supplement to InfiniBand Architecture Speci fication Volume 1 2 1 A new API can be used by user space applications to work with the XRC transport The legacy API is currently supported in both binary and source modes however it is deprecated Thus we recommend using the new API The new verbs to be used are ibv open xrcd ibv close xrcd e ibv create srq ex ibv get srq num create ex ibv open Please use ibv xsrq pingpong for basic tests and code reference For detailed information regarding the various options for these verbs please refer to their appropriate man pages 78 Mellanox Technologies Rev 2 0 3 0 0 4 12 Flow Steering Flow Steering is applicable to the mlx4 driver only Flow steering is a new model which steers network flows based on flow specifications to specific QPs Those flows can be either unicast or multicast network flows In order to maintain flexibil ity domains and priorities are used Flow steering uses a methodology of flow attribute which is a combination of L2 L4 flow specifications a destination QP and a priority Flow steering rules could be inserted either by using ethtool or by using InfiniBand verbs The verbs abstraction uses an opposed terminology of a flow attribute ibv flow attr defined by a combination of specifi cations struct ibv flow spec 4 12 1 Enable Disable Flow Steering Flow Steering is disabled by default and regular L2 steering
83. 4 credits so in order to achieve effective VL arbitration for packets of 4KB MTU the weighting values for each VL should be multiples of 64 Below is an example of SL2VL and VL Arbitration configuration on subnet qos_ca max vls 15 qos ca high limit 6 qos ca vlarb high 0 4 qos ca vlarb low 0 0 1 64 2 128 3 192 4 0 5 64 6 64 7 64 COs E AVL 0131 25 9 056 Ue 3 3 02 359 1 7 qos swe max vls 15 qos swe high limit 6 qos swe vlarb high 0 4 qos swe vlarb low 0 0 1 64 2 128 3 192 4 0 5 64 6 64 77 64 qos sl2vl 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 7 In this example there are 8 VLs configured on subnet VLO to VL7 VLO is defined as a high pri ority VL and it is limited to 6 x AKB 24KB in a single transmission burst Such configuration would suilt VL that needs low latency and uses small MTU when transmitting packets Rest of VLs are defined as low priority VLs with different weights while VL4 is effectively turned off 8 6 8 Deployment Example Figure 5 shows an example of an InfiniBand subnet that has been configured by a QoS manager to provide different service levels for various ULPs 158 Mellanox Technologies Rev 2 0 3 0 0 Figure 5 Example QoS Deployment on InfiniBand Subnet Traffic class SDP Service level 2 Policy min 20 BW Traffic class Partition A Service level 0 Policy min 40 App A Server Service Access Points Traffic class SRP Service Level 1 Policy mi
84. 4194304 Enable low latency mode for TCP sysctl w net ipv4 tcp low latency 1 7 2 2 Tuning the Network Adapter for Improved IPv6 Traffic Performance The following changes are recommended for improving IPv6 traffic performance Disable the TCP timestamps option for better CPU utilization Sysctl w net ipv4 tcp timestamps 0 Mellanox Technologies 113 Rev 2 0 3 0 0 Performance Enable the TCP selective acks option for better CPU utilization sysctl w net ipv4 tcp sack 1 7 2 3 Preserving Your Performance Settings after a Reboot To preserve your performance settings after a reboot you need to add them to the file etc sysctl conf as follows lt sysctl namel gt lt valuel gt lt sysctl name2 gt lt value2 gt lt sysctl name3 gt lt value3 gt lt sysctl name4 gt lt value4 gt For example Tuning the Network Adapter for Improved IPv4 Traffic Performance on page 113 lists the following setting to disable the TCP timestamps option sysctl w net ipv4 tcp timestamps 0 In order to keep the TCP timestamps option disabled after a reboot add the following line to etc sysctl conf net ipv4 tcp timestamps 0 7 2 4 Tuning Power Management Check that the output CPU frequency for each core is equal to the maximum supported and that all core frequencies are consistent Check the maximum supported CPU frequency cat sys devices system cpu cpu cpufreq cpuinfo max freq e Check that core frequencies
85. 5 2 3 5 Post installation Notes ei et rv tae e pc ere tn 36 2 4 Updating Firmware After Installation 36 2 5 Installing MLNX OFED using YUM 38 2 5 1 Setting up MLNX OFED YUM Repository 38 2 5 2 Installing MLNX OFED using the YUM 1 39 2 5 3 Updating Firmware After 39 2 6 Uninstalling Mellanox OFED 39 2 7 Uninstalling Mellanox OFED using the YUM Tool 39 Chapter 3 Configuration Files 40 3 1 Persistent Naming for Network Interfaces 40 Chapter 4 Driver Features 41 4 1 SCSIRDMA n 41 Mellanox Technologies 3 J Rev 2 0 3 0 0 Ai Vel C OVERVIEW he D uter tee iad Mea iA Seat Ue Rs 41 4 1 2 SRP Initiator 4 gee hes paku m es bua aad yaspa usata 41 4 2 iSCSI Extensions for ROMA iSER 48 AD t te G 48 A222 ASER TING T ex gol shes Aes 48 4 3 TP over IntimBand uyu ayaka Oe dee Sa ele Ra ae Re ee ee 49 e IntroductioBz4 as huy gs kashapa three Gases alah 49 4 3 2 IPoIB Mode Setting 0 0 cette teen ees 49 4 3 3 IPoIB Configuration 47 55 de ease 50 434
86. 6 32 19 or higher Otherwise an alternative custom sysfs interface is available mlnx qos tool package ofed scripts requires python gt 2 5 tc_wrap py package ofed scripts requires python gt 2 5 Mellanox Technologies 67 J Rev 2 0 3 0 0 Driver Features 46 Time Stamping Service Time Stamping is currently at beta level Please be aware that everything listed here is subject to change Time Stamping is currently supported in ConnectX 3 ConnectX 3 Pro adapter cards only Time stamping is process of keeping track of the creation of packet A time stamping ser vice supports assertions of proof that a datum existed before a particular time Incoming packets are time stamped before they are distributed on the PCI depending on the congestion in the PCI buffers Outgoing packets are time stamped very close to placing them on the wire 4 6 1 Enabling Time Stamping Time stamping is off by default and should be enabled before use gt To enable time stamping for a socket Call setsockopt with SO TIMESTAMPING and with the following flags SOF TIMESTAMPING TX HARDWARE try to obtain send time stamp in hardware SOF TIMESTAMPING TX SOFTWARE if SOF TIMESTAMPING TX HARDWARE is off or fails then do it in software SOF TIMESTAMPING RX HARDWARE return the original unmodified time stamp as generated by the hardware SOF TIMESTAMPING RX SOFTWARE if SOF TIMESTAMPING RX HARDWARE is off or fail
87. 65535 0 port 1 Pata Cetera me meme Initialize DhyshinkStateo e em LinkUp noponscd maaa 1X or 4X Ariete cielos s eos umanan 1X or 4X ENAC aye un s sn 4X LinkSpeedSupported 2 5 Gbps or 5 0 Gbps IjumisSpeedbrabilcd Satwa aun 5 0 Gbps IBA extension I tumispeedAGV 5 0 Gbps 9 11 ibroute InfiniBand Fabric Diagnostic Utilities Uses SMPs to display the forwarding tables unicast LinearForwardingTable or LFT or multi cast MulticastForwardingTable or MFT for the specified switch LID and the optional lid mlid range The default range is all valid entries in the range 1 to FDBTop 188 Mellanox Technologies Rev 2 0 3 0 0 Synopsis ibroute h d v V a n D G M s lt smlid gt ca name gt P lt ca_port gt t lt timeout_ms gt lt dest dr path lid guid lt star tlid gt lt endlid gt Output Files Table 25 lists the various flags of the command Table 25 ibportstate Flags and Options a Default Flag Aui Ab If Not Description Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d a ll Optional Show all LIDs in range including invalid entries v erbose Optional Increase verbosity level May be used several times for addition
88. 8 8 2 Installing the Adaptive Routing 163 8 8 3 Running Subnet Manager with Adaptive Routing Manager 163 8 84 Querying Adaptive Routing Tables 164 8 8 5 Adaptive Routing Manager Options File 164 6 Mellanox Technologies J Rev 2 0 3 0 0 8 0 Congestion Control cot coya up end AS S S IDA eters 5 wee aaa 167 8 9 1 Congestion Control OverviewW 167 8 9 2 Running OpenSM with Congestion Control Manager 167 8 9 3 Configuring Congestion Control 167 8 9 4 Configuring Congestion Control Manager Main Settings 168 Chapter9 InfiniBand Fabric Diagnostic Utilities 171 DA HOVCIVICW as Sis ah u t sha 171 92 Utilities Usage Lone eI reete RA e qr ee e 171 9 2 1 Common Configuration Interface and Addressing 171 9 2 2 InfiniBand Interface Definition 171 9 2 3 Addressing ser tw eid Weed Seeded s 172 9 3 ibdiagnet of ibutils2 IB Net Diagnostic 172 9 4 ibdiagnet of ibutils IB Net Diagnostic 176 9 5 ibdiagpath IB diagnosticpath 179 9 6 by devices seek eee ERR EET e Uie VE EE Ve
89. AN INTERFACE ib0 Vos ean NETMASK ib0 255 255 0 0 NETWORK ib0 11 4 0 0 BROADCAST ib0 11 4 255 255 ONBOOT ib0 1 4 3 3 3 Manually Configuring IPoIB This manual configuration persists only until the next reboot or driver restart manually configure IPoIB for default partition perform following steps Step 1 configure the interface enter the ifconfig command with the following items The appropriate IB interface ib0 ibl etc The IP address that you want to assign to the interface The netmask keyword 52 Mellanox Technologies Rev 2 0 3 0 0 The subnet mask that you want to assign to the interface The following example shows how to configure an IB interface host1 ifconfig 160 11 4 3 175 netmask 255 255 0 0 Step 2 Optional Verify the configuration by entering the ifconfig command with the appropriate interface identifier ib argument The following example shows how to verify the configuration host1 ifconfig 100 b0 Link encap UNSPEC HWaddr 80 00 04 04 FE 80 00 00 00 00 00 00 00 00 00 00 inet addr 11 4 3 175 Bcast 11 4 255 255 Mask 255 255 0 0 UP BROADCAST MULTICAST MTU 65520 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 128 RX bytes 0 0 0 b TX bytes 0 0 0 b Step 3 Repeat Step 1 and Step 2 on the remaining interface s 4 3 4 Sub
90. Burning a firmware binary image using mst 1int that is already installed on your machine Please refer to MSTFLINT README txt under docs 2 Burning a firmware image from a mlx file using the mlxburn utility that is already installed on your machine The following command burns firmware onto the ConnectX device with the device name obtained in the example of Step 2 host1 mlxburn dev dev mst mt25418 pci cr0 fw mnt firmware fw 25408 fw 25408 rel mlx Step 4 Reboot your machine after the firmware burning is completed Mellanox Technologies 37 J Rev 2 0 3 0 0 Installation 25 Installing MLNX OFED using YUM 2 5 1 Setting up MLNX_OFED YUM Repository Step 1 Download the tarball to your host The image s name has the format MLNX_OFED_LINUX lt ver gt lt OS label gt lt CPU arch gt tgz You can download it from http www mellanox com gt Products gt Software gt InfiniBand Drivers Step 2 Extract the MLNX OFED tarball package to a shared location in your network tar xzf MLNX OFED LINUX MLNX OFED version rhel6 4 x86 64 tgz Step3 Download and install Mellanox Technologies GPG KEY The key can be downloaded via the following link http www mellanox com downloads ofed R PM GPG KEY Mellanox wget http www mellanox com downloads ofed RPM GPG KEY Mellanox 2013 08 20 13 52 30 http www mellanox com downloads ofed RPM GPG KEY Mellanox Resolving www mellanox com 72 3 194 0 Connecting to www mellanox
91. Datagram except for Connect IBTM adapter card which uses IPoIB with Connected mode as default For better scalability and performance we recommend using the Datagram mode However the mode can be changed to Connected mode by editing the file etc infiniband openib conf andsetting SET IPOIB CM yes The SET IPOIB CM parameter is set to auto by default to enable the Connected mode for Con nect IB card and Datagram for all other ConnectX cards After changing the mode you need to restart the driver by running etc init d openibd restart To check the current mode used for out going connections enter cat sys class net ib lt n gt mode Mellanox Technologies 49 J Rev 2 0 3 0 0 Driver Features 4 3 3 IPoIB Configuration Unless you have run the installation script nlnxofedinstall with the flag n then IPoIB has not been configured by the installation The configuration of IPoIB requires assigning an IP address and a subnet mask to each HCA port like any other network adapter card i e you need to prepare a file called ifcfg ib n for each port The first port on the first HCA in the host is called interface ib0 the second port is called ib1 and so on An IPoIB configuration can be based on DHCP Section 4 3 3 1 or on a static configuration Section 4 3 3 2 that you need to supply You can also apply a manual configuration that persists only until the next reboot or driver restart Section 4 3 3 3 4 3
92. ED Release Notes file Required Disk Space 1GB for Installation Device ID For the latest list of device IDs please visit Mellanox website Operating System Linux operating system For the list of supported operating system distributions and kernels please refer to the Mellanox OFED Release Notes file Installer Privileges The installation requires administrator privileges on the target machine 2 2 Downloading Mellanox OFED 1 Verify that the system has a Mellanox network adapter HCA NIC installed by ensuring that you can see ConnectX or InfiniHost entries in the display The following example shows a system with an installed Mellanox HCA lspci v grep Mellanox 06 00 0 Network controller Mellanox Technologies MT27500 Family ConnectX 3 Subsystem Mellanox Technologies Device 0024 Step 2 Download the ISO image to your host The image s name has the format MLNX_OFED_LINUX lt ver gt lt OS label gt lt CPU arch gt iso You can download it from http www mellanox com gt Products gt Software gt InfiniBand Drivers Step 3 Use the md5sum utility to confirm the file integrity of your ISO image Run the following com mand and compare the result to the value provided on the download page host1 md5sum MLNX OFED LINUX lt ver gt lt 0S label gt iso 24 Mellanox Technologies Rev 2 0 3 0 0 23 Installing Mellanox OFED The installation script mlnxofedinstal1 performs the foll
93. ED was installed using the yum tool then it can be uninstalled as follow yum groupremove group name gt 1 The group name gt must be the same group name that was previously used to install MLNX OFED Mellanox Technologies 39 J Rev 2 0 3 0 0 Configuration Files Configuration Files For the complete list of configuration files please refer to MLNX OFED configuration files txt 3 1 Persistent Naming for Network Interfaces To avoid network interface renaming after boot or driver restart use the etc udev rules d 70 persistent net rules file Example for Ethernet interfaces PCI device 0x15b3 0x1003 mlx4 core SUBSYSTEM net ACTION add DRIVERS ATTR address 00 02 c9 fa c3 50 ATTR dev_id 0x0 ATTR type 1 KERNEL eth NAME eth1 SUBSYSTEM net ACTION add DRIVERS ATTR address 00 02 c9 fa c3 51 14 0 0 ATTR type 1 KERNEL eth NAME eth2 SUBSYSTEM net ACTION add DRIVERS ATTR address 00 02 c9 e9 56 al ATTR dev_id 0x0 ATTR type 1 KERNEL eth NAME eth3 SUBSYSTEM net ACTION add DRIVERS ATTR address 00 02 c9 e9 56 a2 ATTR dev_id 0x0 ATTR type 1 KERNEL eth NAME eth4 Example for IPoIB interfaces SUBSYSTEM net ACTION add DRIVERS ATTR dev_id 0x0 ATTR type 32 NAME ib0 S
94. Ethernet device a static IP address then copy ifconfig Otherwise skip this step hostl cp sbin ifconfig tmp initrd en sbin Now you can add the commands for loading the copied modules into the file init Edit the file tmp initrd en init and add the following lines at the point you wish the Ethernet driver to be loaded A The order of the following commands for loading modules is critical 216 Mellanox Technologies Rev 2 0 3 0 0 echo loading Mellanox ConnectX FN driver sbin insmod lib modules mlnx en mlx4 core ko sbin insmod lib modules mlnx en mlx4 en ko Step 8 Now you can assign a static or dynamic IP address to your Mellanox ConnectX EN network interface Step9 Save the init file Step 10 Close initrd host1 cd tmp initrd en host1 find cpio H newc o gt tmp new initrd en img host1 gzip tmp new init en img At this stage the modified initrd including the Ethernet driver is ready and located at tmp new init ib img gz Copy it to the original initrd location and rename it properly A 10 iSCSI Boot Mellanox FlexBoot enables an iSCSI boot of an OS located on a remote iSCSI Target It has a built in iSCSI Initiator which can connect to the remote iSCSI Target and load from it the kernel and initrd Linux There are two instances of connection to the remote iSCSI Target the first is for getting the kernel and initrd via FlexBoot and the second is for loading other parts of the OS via initrd If y
95. Fabric Collective Accelerator The Mellanox Fabric Collective Accelerator FCA is a unique solution for offloading collective operations from the Message Passing Interface MPI process to the server CPUs As a system wide solution FCA does not require any additional hardware The FCA manager creates a topol ogy based collective tree and orchestrates an efficient collective operation using the CPUs in the servers that are part of the collective operation FCA accelerates MPI collective operation perfor mance by up to 100 times providing a reduction in the overall job runtime Implementation is simple and transparent during the job runtime FCA is built on the following main principles Topology aware Orchestration The MPI collective logical tree is matched to the physical topology The collective logical tree Is constructed to assure Maximum utilization of fast inter core communication Distribution of the results Communication Isolation Collective communications are isolated from the rest of the traffic in the fabric using a private virtual network VLane eliminating contention with other types of traffic After MLNX OFED installation FCA can be found at opt mellanox fca folder For further information on configuration instructions please refer to the FCA User Manual Mellanox Technologies 105 Rev 2 0 3 0 0 HPC Features 5 5 5 5 1 ScalableUPC Unified Parallel C UPC is an extension of the C programming lang
96. HHH HH HH libibmad Preparing THHHBHHHBHHHBHBHHHHHHHHBHHHBHHHHBHHHHHHHHBHHHBHRHHI libibmad HH HH HH HH Het HH HH HH Mellanox Technologies Preparing libibmad devel Preparing libibmad devel Preparing libibmad static Preparing libibmad static Preparing ibsim Preparing ibacm Preparing librdmacm Preparing librdmacm Preparing librdmacm utils Preparing librdmacm devel Preparing librdmacm devel Preparing opensm libs Preparing opensm libs Preparing opensm Preparing opensm devel Preparing opensm devel Preparing opensm static Preparing opensm static Preparing Preparing tg s D w H 3 S D H 3 H H H HH HH H H H HH HH H
97. IDs are requested from the SM These GIDs are mapped to VHCAs as follows vHCA number x is assigned the GID GUID at index x of the physical GID table Each vHCA port presents its own virtual PKey table The virtual PKey table presented to a VF is a mapping of selected indexes of the physical PKey table The host admin can control which PKey indexes are mapped to which virtual indexes using a sysfs interface see Section on page 89 The physical PKey table may contain both full and partial memberships of the same PKey to allow different membership types in different virtual tables 90 Mellanox Technologies Rev 2 0 3 0 0 Each vHCA port has its own virtual port state A vHCA port is up if the following conditions apply The physical port is up The virtual GID table contains the GIDs requested by the host admin The SM has acknowledged the requested GIDs since the last time that the physical port went up e Other port attributes are shared such as GID prefix LID SM LID LMC mask To allow the host admin to control the virtual GID and PKey tables of vHCAs a new sysfs sub tree has been added under the PF InfiniBand device 4 13 7 2 1SRIOV sysfs Administration Interfaces on the Hypervisor Administration of GUIDs and PKeys is done via the sysfs interface in the Hypervisor Dom0 This interface is under sys class infiniband lt infiniband device gt iov Under this directory the following subdirectories can be fo
98. INUX un installation script lt RPMS folders Directory of binary RPMs for a specific CPU architecture firmware Directory of the Mellanox IB HCA firmware images including Boot over IB src Directory of the OFED source tarball mlnx add kernel support sh Script required to rebuild MLNX OFED LINUX for customized kernel version on supported Linux Distribution docs Directory of Mellanox OFED related documentation 18 Mellanox Technologies Rev 2 0 3 0 0 1 3 Architecture Figure 1 shows a diagram of the Mellanox OFED stack and how upper layer protocols UL Ps interface with the hardware and with the kernel and user space The application level also shows the versatility of markets that Mellanox OFED applies to Figure 1 Mellanox OFED Stack for ConnectX Family Adapter Cards UDAPL MPI uverbs rdmacm Sockets Layer SCSI TCP UDP ICNP Mid Layer Kern IP Netdevice SRP iSER elPolB IPoIB verbs CMA ib_core mlx4_en 5 ib IB mlx4 ib IB and RoCE Adapter Driver mIx5 core Adapter Driver 4 core Mellanox VPI Device HCA NIC The following sub sections briefly describe the various components of the Mellanox OFED stack 1 3 1 mlx4 VPI Driver m1x4 is the low level driver implementation for the ConnectX family adapters designed by Mel lanox Technologies ConnectX family adapters can operate as an InfiniBand adapter or as an Ethernet NIC The OFED driver supports
99. InfiniBand and Ethernet NIC configurations To accommodate the supported configurations the driver is split into the following modules mlx4 core Handles low level functions like device initialization and firmware commands processing Also controls resource allocation so that the InfiniBand and Ethernet functions can share the device without interfering with each other mlx4 ib Handles InfiniBand specific functions and plugs into the InfiniBand midlayer Mellanox Technologies 19 Rev 2 0 3 0 0 Mellanox OFED Overview mlx4 en A 10 40GigE driver under drivers net ethernet mellanox mlx4 that handles Ethernet specific functions and plugs into the netdev mid layer 1 3 2 mlx5 Driver m1x5 is the low level driver implementation for the Connect IB adapters designed by Mella nox Technologies Connect IBTM operates as an InfiniBand adapter The mlx5 driver is com prised of the following kernel modules mlx5 core Acts as a library of common functions e g initializing the device after reset required by the Connect IB adapter card mIx5 ib Handles InfiniBand specific functions and plugs into the InfiniBand midlayer libmlx5 libmlx5 is the provider library that implements hardware specific user space functionality If there is no compatibility between the firmware and the driver the driver will not load and a mes sage will be printed in the dmesg The following are the Libmlx5 environment variables e 5 FREEZE ON
100. InfiniBand connected or datagram transport service IPoIB pre appends the IP datagrams with an encapsulation header and sends the outcome over the InfiniBand transport service The transport service is Unreliable Datagram UD by default but it may also be configured to be Reliable Connected RC The interface supports unicast multicast and broadcast For details see Chapter 4 3 IP over InfiniBand iSER iSCSI Extensions for RDMA iSER extends the iSCSI protocol to RDMA It permits data to be transferred directly into and out of SCSI buffers without intermediate data copies For further information please refer to Chapter 4 2 iSCSI Extensions for ROMA iSER SRP SCSI RDMA Protocol SRP is designed to take full advantage of the protocol offload and features provided by the InfiniBand architecture SRP allows a large body of SCSI soft ware to be readily used on InfiniBand architecture The SRP driver known as the SRP Initia tor differs from traditional low level SCSI drivers in Linux The SRP Initiator does not control a local HBA instead it controls a connection to an I O controller known as the SRP Target to provide access to remote storage devices across an InfiniBand fabric The SRP Target resides in an I O unit and provides storage services See Chapter 4 1 SCSI RDMA Protocol and Appen dix B SRP Target Driver uDAPL User Direct Access Programming Library uDAPL is a standard API that pr
101. MAGE Mellanox TECHNOLOGIES Mellanox Technologies Mellanox Technologies Ltd 350 Oakmead Parkway Suite 100 Beit Mellanox Sunnyvale CA 94085 PO Box 586 Yokneam 20692 U S A Israel www mellanox com www mellanox com Tel 408 970 3400 Tel 972 0 74 723 7200 Fax 408 970 3403 Fax 972 0 4 959 3245 Copyright 2013 Mellanox Technologies All Rights Reserved Mellanox Mellanox logo BridgeX ConnectX CORE Direct InfiniBridge InfiniHost InfiniScale MLNX OS PhyX SwitchX UFM Virtual Protocol Interconnect and Voltaire are registered trademarks of Mellanox Technologies Ltd Connect IB ExtendX FabricIT Mellanox Open Ethernet Mellanox Virtual Modular Switch MetroX MetroDX ScalableHPC Unbreakable Link are trademarks of Mellanox Technologies Ltd All other trademarks are property of their respective owners 2 Mellanox Technologies Document Number 2877 Rev 2 0 3 0 0 Table of Contents Table of Contemts 2 24 sho eu Ou eee oe 3 List of FIGURES de Pr PP 9 List OL Tables PP 10 Chapter 1 Mellanox OFED Overview 17 1 1 Introduction to Mellanox OFED 17 12 Mellanox OFED Package 17 12 1 tee ck E nai ene a 17 1 2 2 Software Components Qua CAN Qua 17 1 2 3 Riri wa
102. Mellanox TECHNOLOGIES Mellanox OFED for Linux User Manual Rev 2 0 3 0 0 Last Updated 03 October 2013 www mellanox com Rev 2 0 3 0 0 NOTE THIS HARDWARE SOFTWARE OR TEST SUITE PRODUCT PRODUCT S AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES AS IS WITH ALL FAULTS OF ANY KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE THE PRODUCTS IN DESIGNATED SOLUTIONS THE CUSTOMER S MANUFACTURING TEST ENVIRONMENT HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE PRODUCTO S AND OR THE SYSTEM USING IT THEREFORE MELLANOX TECHNOLOGIES CANNOT AND DOES NOT GUARANTEE OR WARRANT THAT THE PRODUCTS WILL OPERATE WITH THE HIGHEST QUALITY ANY EXPRESS OR IMPLIED WARRANTIES INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT ARE DISCLAIMED IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER OR ANY THIRD PARTIES FOR ANY DIRECT INDIRECT SPECIAL EXEMPLARY OR CONSEQUENTIAL DAMAGES OF ANY KIND INCLUDING BUT NOT LIMITED TO PAYMENT FOR PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY WHETHER IN CONTRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY FROM THE USE OF THE PRODUCT S AND RELATED DOCUMENTATION EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DA
103. Mellanox Technologies 179 Rev 2 0 3 0 0 Options n lt src name dst name gt 1 src lid dst lid C lt count gt t lt topo file gt s Sys name 1 lt dev index gt p lt port num gt o lt out dir gt lw 1x 4x 12x 1s lt 2 5 5 10 gt pm pc P lt PM lt Trash gt gt Names of the source and destination ports as defined in the topology file source may be omit ted local port is assumed to be the source Source and destination LIDs source may be omit ted the local port is assumed to be the Source Directed route from the local node which is the Source and the destination node The minimal number of packets to be sent across each link default 100 Enable verbose mode Specifies the topology file name Specifies the local system name Meaningful only if a topology file is specified Specifies the index of the device of the port used to connect to the IB fabric in case of multiple devices on the local system Specifies the local device s port number used to connect to the IB fabric Specifies the directory where the output files will be placed default tmp Specifies the expected link width Specifies the expected link speed Dump all the fabric links pm Counters into ibdiagnet pm Reset all the fabric links pmCounters If any of the provided pm is greater then its provided value print it to screen
104. QoS Levels Application traffic IPoIB UD and CM and SDP Isolated from storage Min BW of 50 SRP Min BW 50 Bottleneck at storage nodes Administration OpenSM QoS policy file In the following policy file example replace SRPT with the real SRP Target port GUIDs qos ulps qefault 0 ipoib adl sdp 3l srp target port guid SRPT1 SRPT2 SRPT3 2 160 Mellanox Technologies Rev 2 0 3 0 0 end qos ulps OpenSM options file qos max vls 8 qos high limit 0 qos vlarb high 1 32 2 32 qos vlarb low 0 1 Gio O 1 2 6 D 6 7 105 15 195 159 15 15 15 18 8 7 3 EDC 3 tier IPoIB RDS SRP The following is an example of QoS configuration for an enterprise data center EDC with IPoIB carrying all application traffic RDS for database traffic and SRP used for storage QoS Levels Management traffic ssh IPoIB management VLAN partition A Min BW 10 Application traffic IPoIB application VLAN partition B Isolated from storage and database Min BW of 30 Database Cluster traffic RDS Min BW of 30 SRP Min BW 30 Bottleneck at storage nodes Administration OpenSM Qos policy file In the following policy file example replace SRPT with the real SRP Initiator port GUIDs aie qos ulps default ipoib pkey 0x8001 rds 0 i ipoib pkey 0x8002 3 srp target port guid SRPT1 SRPT2 SRPT3 4 Mellanox Technologies 161 Rev 2 0
105. R will abort and will allow OpenSM to pro ceed To do so set the following parameter max errors error window The values are max errors 0 zero tollerance abort configuration on first error error window 0 mechanism disabled no error checking 0 48K The default is 5 8 9 4 1 Congestion Control Manager Options File Table 15 Congestion Control Manager General Options File Option File Description Values enable Enables disables Congestion Control mechanism on the fabric nodes Values TRUE FALSE Default True num hosts Indicates the number of nodes The CC table val ues are calculated based on this number Values 0 48K Default 0 base on the CCT calculation on the current subnet size Table 16 Congestion Control Manager Switch Options File Option File Description Values threshold Indicates how aggressive the congestion mark 0 0xf ing should be 0 packet marking Oxf very aggressive Default Oxf marking rate The mean number of packets between marking Values 0 Oxffff eligible packets with a FECN Default packet size Any packet less than this size bytes will not be Values 0 0x3fc0 marked with FECN Default 0x200 Table 17 Congestion Control Manager CA Options File Option File Desctiption Values port control Specifies the Congestion Control attribute for this port Values 0
106. RP daemon detects the SRP Targets in the fabric and sends requests to the ib srp module to connect to each of them These SRP daemons also detect targets that subsequently join the fab ric and send the ib srp module requests to connect to them as well Operation When a path from port1 to a target fails the ib srp module starts an error recovery process If this process gets to the reset host stage and there is no path to the target from this port ib srp will remove this scsi host After the scsi host is removed multipath switches to another path to this target from another port HCA When the failed path recovers it will be detected by the SRP daemon The SRP daemon will then request ib srp to connect to this target Once the connection is up there will be a new scsi host for this target Multipath will be executed on the devices of this host returning to the original state prior to the failed path 46 Mellanox Technologies Rev 2 0 3 0 0 Manual Activation of High Availability Initialization Execute after each boot of the driver 1 Execute modprobe dm multipath 2 Execute modprobe ib srp 3 Make sure you have created file etc udev rules d 91 srp rules as described above 4 Execute for each port and each HCA srp daemon c e R 300 i InfiniBand HCA name p port number This step can be performed by executing srp daemon sh which sends its log to var log srp daemon log Now it is possible to access the SRP L
107. The list of network interfaces is available via the ifstat com mand Example iPXE ifclose netl A 8 3 4 autoboot Starts the boot process from the device s A 8 3 5 sanboot Starts the boot process of an iSCSI target Example iPXE sanboot iscsi 11 4 3 7 ign 2007 08 7 3 4 11 iscsiboot A 8 3 6 echo Echoes an environment variable Example iPXE echo root path A 8 3 7 dhcp A network interface attempts to open the network interface and then tries to connect to and com municate with the DHCP server to obtain the IP address and filepath from which the boot will occur Example iPXE dhcp net1 Mellanox Technologies 211 Rev 2 0 3 0 0 8 3 8 help Displays available list of commands A 8 3 9 exit Exits from the command line interface 9 Diskless Machines Mellanox FlexBoot supports booting diskless machines To enable using an IB ETH driver the initrd image must include a device driver module and be configured to load that driver This can be achieved by adding the device driver module into the initrd image and loading it The initrd image of some Linux distributions such as SuSE Linux Enterprise Server and Red Hat Enterprise Linux cannot be edited prior or during the installation process If you need to install Linux distributions over Flexboot please replace your initrd images with the images found at www mellanox com gt Products gt Adapter IB VPI SW g
108. UBSYSTEM net ACTION add DRIVERS ATTR dev_id 0x1 ATTR type 32 NAME ib1 40 Mellanox Technologies Rev 2 0 3 0 0 4 Driver Features 4 1 SCSI RDMA Protocol 4 1 1 Overview As described in Section 1 3 4 the SCSI RDMA Protocol SRP is designed to take full advantage of the protocol off load and RDMA features provided by the InfiniBand architecture SRP allows a large body of SCSI software to be readily used on InfiniBand architecture The SRP Initiator controls the connection to an SRP Target in order to provide access to remote storage devices across an InfiniBand fabric The SRP Target resides in an IO unit and provides storage services Section 4 1 2 describes the SRP Initiator included in Mellanox OFED for Linux This package however does not include an SRP Target 4 1 2 SRP Initiator This SRP Initiator is based on open source from OpenFabrics www openfabrics org that imple ments the SCSI RDMA Protocol 2 SRP 2 SRP 2 is described in Document T10 1524 D available from http www t10 org The SRP Initiator supports Basic SCSI Primary Commands 3 SPC 3 www t10 org ftp tl0 drafts spc3 spc3r21b pdf Basic SCSI Block Commands 2 SBC 2 www t10 org ftp t10 drafts sbc2 sbc2r16 pdf Basic functionality task management and limited error handling 4 1 2 1 Loading SRP Initiator To load the SRP module either execute the modprobe ib srp command after the OFED driver is up or ch
109. UNs on dev mapper identified by their names You can configure the etc multipath conf file to change gt It is possible for regular non SRP LUNs to also be present the SRP LUNs may be ad multipath behavior occur if the SRP LUNs are in the black list of multipath Edit the blacklist section in y It is also possible that the SRP LUNS will not appear under dev mapper This can etc multipath conf and make sure the SRP LUNs are not black listed Automatic Activation of High Availability Set the value of SRPHA ENABLE in etc infiniband openib conf to yes For the changes in openib cont to take effect run etc init d openibd restart From the next loading of the driver it will be possible to access the SRP LUNs on dev mapper It is possible that regular not SRP LUNs may also be present the SRP LUNs may be identified by their name tis possible to see the output of the SRP daemon in var log srp daemon log 4 1 2 7 Shutting Down SRP SRP can be shutdown by using rmmod ib srp or by stopping the OFED driver etc init d openibd stop or a by product of a complete system shutdown Prior to shutting down SRP remove all references to it The actions you need to take depend on the way SRP was loaded There are three cases 1 Without High Availability When working without High Availability you should unmount the SRP partitions that were mounted prior to shutting
110. V This option increases the log verbosity level The v option may be specified multiple times to further increase the verbosity level See the D option for more information about log verbosity This option sets the maximum verbosity level and forces log flushing The V is equivalent to D OxFF d 2 See the D option for more information about log verbosity D D flags 128 Mellanox Technologies Rev 2 0 3 0 0 This option sets the log verbosity level A flags field must follow the D option A bit set clear in the flags enables disables a specific log level as follows BIT LOG LEVEL ENABLED 0x01 ERROR error messages 0x02 INFO basic messages low volume 0x04 VERBOSE interesting stuff moderate volume 0x08 DEBUG diagnostic high volume 0x10 FUNCS function entry exit very high volume 0x20 FRAMES dumps all SMP and GMP frames 0x40 ROUTING dump FDB routing information 0x80 currently unused Without D OpenSM defaults to ERROR INFO 0x3 Specifying D 0 disables all messages Specifying D OxFF enables all messages see V High verbosity levels may require increasing the transaction timeout with the t option debug d lt number gt This option specifies a debug option These options are not normally needed The number following d selects the debug option to enable as follows OPT Description d0 Ignore other SM nodes d1 Force single threaded dispatchi
111. VL values VL bit 0 per QoS level to provide deadlock free rout ing on a 3D torus 2005 routes around link failure by taking the long way around 1D ring interrupted by a link failure For example consider the 2D 6x5 torus below where switches are denoted by a zA Z 46 4I Blx I n I I I I I I go p Pp Ir I I I I I I 2 THsI rr I I I I I I SSS ee I I I I I I y 0 I I I I I I x 0 1 2 3 4 5 For a pristine fabric the path from S to D would be S n T r D In the event that either link S n or n T has failed torus 2QoS would use the path S m p o T r D Note that it can do this without changing the path SL value once the 1D ring m S n T o p m has been broken by failure path segments using it cannot contribute to deadlock and the x direction dateline between say x 5 and x 0 can be ignored for path segments on that ring One result of this is that torus 2QoS can route around many simultaneous link failures as long as no 1D ring is broken into disjoint segments For example if links n T and T o have both failed that ring has been broken into two disjoint segments T and o p m S n Torus 2QoS checks for such issues reports if they are found and refuses to route such fabrics Note that in the case where there are multiple parallel links between a pair of switches torus 2005 will allocate route
112. _64 ORE Verse Wai aaa MLNX OFED LINUX 2 0 2 0 0 OFED 2 0 2 0 0 2 6 32 279 e16 x86 64 Host PT PASS Firmware Oil enm ONCE ME v2 9 1000 Firmware Check on CA 40 NIC PASS ton CAN sell WHC E YA doll Firmware Check on CA 1 NIC PASS DIY WME oo gen don PASS Number CAWPoOIESEACIU CH 4 Port State of Port 1 on CA 0 NIC UP 1X QDR Ethernet Port State of Port 2 on CA 0 NIC UP 1X QDR Ethernet Port State of Port 1 on CA 1 NIC UP 1X QDR Ethernet Port State of Port 2 on CA 1 NIC UP 1X QDR Ethernet Error Counter Check on CA 0 NIC NA Eth ports Error Counter Check on CA 1 NIC NA Eth ports Kernell syslog Check cr n EUM PASS GUID Ca CA 0 UO aaa 00 02 c9 03 00 07 4 8 NOderGUTDEONE CAEN UNT Emaan 00 02 c9 03 00 35 d c0 SES rss DON A L After the installer completes information about the Mellanox OFED installation such as prefix kernel version and installation parameters can be retrieved by running the com Ad mand etc infiniband info 2 3 4 Installation Results Software Most of MLNX OFED packages are installed under the usr directory except for the following packages which are installed under the opt directory openshmem fca and ibutils The kernel modules are installed under lib modules uname r updates on SLES and Fedora Dis
113. _mask Output Files Table 27 lists the various flags of the command Table 27 perfquery Flags and Options Optional Default Flag pos If Not Description Mandatory Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d Mellanox Technologies 195 Rev 2 0 3 0 0 Table 27 perfquery Flags and Options InfiniBand Fabric Diagnostic Utilities Optional Default Flag ae pes If Not Description Specified G uid Optional Use GUID address argument In most cases it Is the Port GUID Example 0x08f1040023 a Optional Apply query to all ports 1 Optional Loop ports r Optional Reset the counters after reading them C Optional Use the specified channel adapter or router ca gt P ca port Optional Use the specified port R Optional Reset the counters t Optional Override the default timeout for the solicited lt timeout_ms msec gt V ersion Optional Show version info lt lid guid gt Optional LID or GUID port reset_ mask Examples perfquery r 32 1 peviquery 32 1 perfquery R 0x20 1 perfquery e R 0x20 1 se perfquery R a 32 perfquery R 32 2 OxOfff perfquery R 32 2 0xf000 read performance counters and reset read extended performance counters and reset
114. _qos tool or by the 11 daemon if is used When using Raw Ethernet QP mapping the TOS sk prio to UP mapping is lost Performing the Raw Ethernet mapping forces QP to transmit using given UP If packets with VLAN tag are transmitted UP in the VLAN tag will be overwritten with the given UP 4 5 6 Map Priorities with tc wrap py mlnx qos Network flow that can be managed by QoS attributes is described by a User Priority UP A user s sk priois mapped to UP which in turn is mapped into TC Indicating the UP When the user uses sk prio it is mapped into a UP by the tc tool This is done by the tc wrap py tool which gets a list of lt 16 comma separated UP and maps the sk prio to the specified UP For example tc wrap py ieth0 u 1 5 maps sk prio 0 of etho device to UP 1 and sk prio 1to UP 5 Setting set egress map in VLAN maps the skb priority of the VLAN to a v1an qos The v1an qos is represents a UP for the VLAN device In RoCE rdma set option with ROMA OPTION ID TOS could be used to set the UP When creating QPs the s1 field in ibv modify qp command represents the UP Indicating the TC Mellanox Technologies 61 J Rev 2 0 3 0 0 Driver Features 4 5 7 4 5 7 1 4 5 7 2 4 5 7 3 4 5 8 4 5 8 1 After mapping the skb priority to UP one should map UP into TC This assigns the user priority to a specific hardware traffic class In order to do that
115. a decision is made as to what port should be used to get to that LID This step is common to standard and 136 Mellanox Technologies Rev 2 0 3 0 0 Up Down routing Each port has a counter counting the number of target LIDs going through it When there are multiple alternative ports with same MinHop to a LID the one with less previously assigned ports is selected If LMC gt 0 more checks are added Within each group of LIDs assigned to same target port a Use only ports which have same MinHop b First prefer the ones that go to different systemImageGuid then the previous LID of the same LMC group c Ifnone prefer those which go through another NodeGuid d Fall back to the number of paths method if all go to same node 8 5 1 Effect of Topology Changes OpenSM will preserve existing routing in any case where there is no change in the fabric switches unless the r reassign_lids option is specified r reassign lids This option causes OpenSM to reassign LIDs to all end nodes Specify ing r on a running subnet may disrupt subnet traffic Without r OpenSM attempts to preserve existing LID assignments resolving multiple use of same LID Ifa link is added or removed OpenSM does not recalculate the routes that do not have to change A route has to change if the port is no longer UP or no longer the MinHop When routing changes are performed the same algorithm for balancing the routes is invoked In the case of using th
116. achine in an IB subnet By default an opensm run is logged to two files var log messages and var log opensm log The first message registers only general major events the second file opensm log includes details of reported errors All errors reported in opensm 1og should be treated as indicators of IB fabric health Both log files should include the message SUBNET UP if opensm was able to setup the subnet correctly If a fatal non recoverable error occurs opensm exits Running OpenSM As OpenSM can also run as daemon To run OpenSM in this mode enter host1 etc init d opensmd start osmtest Description osmtest is a test program for validating the InfiniBand Subnet Manager and Subnet Administra tor osmtest provides a test suite for opensm It can create an inventory file of all available nodes ports and PathRecords including all their fields It can also verify the existing inventory with all the object fields and matches it to a pre saved one See Section 8 3 2 osmtest has the following test flows Multicast Compliancy test Event Forwarding test Service Record registration test RMPP stress test 130 Mellanox Technologies Rev 2 0 3 0 0 Small SA Queries stress test 8 3 1 Syntax osmtest OPTIONS where OPTIONS are ELON This option directs osmtest to run a specific flow Flow Description create an inventory file with all nodes ports and paths
117. ailable In general LASH is a very flexible algorithm It can for example reduce to Dimension Order Routing in certain topologies it is topology agnostic and fares well in the face of faults It has been shown that for both regular and irregular topologies LASH outperforms Up Down The reason for this is that LASH distributes the traffic more evenly through a network avoid ing the bottleneck issues related to a root node and always routes shortest path The algorithm was developed by Simula Research Laboratory Use R lash Q option to activate the LASH algorithm QoS support has to be turned on in order that SL VL mappings are used ae LMC gt 015 not supported by the LASH routing If this is specified the default routing algorithm is invoked instead For open regular cartesian meshes the DOR algorithm is the ideal routing algorithm For toroidal meshes on the other hand there are routing loops that can cause deadlocks LASH can be used to route these cases The performance of LASH can be improved by preconditioning the mesh in cases where there are multiple links connecting switches and also in cases where the switches are not cabled consistently To invoke this use R lash Q do mesh analysis This will add an additional phase that analyses the mesh to try to determine the dimension and size of a mesh If it determines that the mesh looks like an open or closed cartesian mesh it reorders the ports in dimension order
118. ailable MXM parameters and their default values run the opt mellanox mxm bin mxm dump config utility which is part of the MXM RPM MXM parameters can be modified in one of the following methods Modifying the default MXM parameters value as part of the mpirun mpirun x MXM UD RX MAX BUFFERS 128000 lt gt Modifying the default MXM parameters value from SHELL export MXM UD RX MAX BUFFERS 128000 mpirun lt gt 104 Mellanox Technologies Rev 2 0 3 0 0 5 3 4 Configuring Multi Rail Support Multi Rail support enables the user to use more than one of the active ports on the card by mak ing a better use of the resources It provides a combined throughput among the used ports gt To configure dual rail support Specify the list of ports you would like to use to enable multi rail support MXM PORTS cardName portNum mpirun x PORTS mlx4 0 1 mlx4 0 2 lt gt 5 3 5 Configuring MXM over the Ethernet Fabric To configure MXM over the Ethernet fabric Step 1 Make sure the Ethernet port is active ibv devinfo ibv devinfo displays the list of cards and ports in the system Please make sure in the 16 devinfo output that the desired port has Ethernet at the 1ink layer field and that Aa its state 15 PORT ACTIVE 2 Specify the ports you would like to use if there is a non Ethernet active port in the card mpirun x PORTS mlx4 0 1 lt gt 5 4
119. airs of sources destinations and groups these paths into virtual layers in such a way as to avoid deadlock 140 Mellanox Technologies Rev 2 0 3 0 0 from HCA between and switch does not need virtual layers as deadlock will not arise y LASH analyzes routes and ensures deadlock freedom between switch pairs The link between switch and HCA In detail algorithm works as follows 1 LASH determines the shortest path between all pairs of source destination switches Note LASH ensures the same SL is used for all SRC DST DST SRC pairs and there is no guar antee that the return path for a given DST SRC will be the reverse of the route SRC DST 2 LASH then begins an SL assignment process where a route is assigned to a layer SL if the addition of that route does not cause deadlock within that layer This is achieved by main taining and analysing a channel dependency graph for each layer Once the potential addition of a path could lead to deadlock LASH opens a new layer and continues the process 3 Once this stage has been completed it is highly likely that the first layers processed will contain more paths than the latter ones To better balance the use of layers LASH moves paths from one layer to another so that the number of paths in each layer averages out Note that the implementation of LASH in opensm attempts to use as few layers as possible This number can be less than the number of actual layers av
120. al verbosity vvv or v v v V ersion Optional Show version info a ll Optional Show all LIDs in range including invalid entries n o dests Optional Do not try to resolve destinations D irect Optional Use directed path address arguments The path is acomma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 M ulticast Optional Show multicast forwarding tables The param eters lt startlid gt and lt endlid gt specify the MLID range s lt smlid gt Optional Use lt smlid gt as the target LID for SM SA queries C Optional Use the specified channel adapter or router ca name P ca port Optional Use the specified port Mellanox Technologies 189 Rev 2 0 3 0 0 Table 25 ibportstate Flags and Options InfiniBand Fabric Diagnostic Utilities gt Optional Flag dator If Not Description y Specified t Optional Override the default timeout for the solicited lt timeout ms msec lt destdr_path Optional Destination s directed path LID or GUID lid guid gt lt startlid gt Optional Starting LID in an MLID range lt endlid gt Optional Ending LID in an MLID range Examples 1 Dump all Lids with valid out ports of the switch with Lid 2 gt ibroute 2 Unicast l
121. all script with the without fw update option and now you wish to manually update firmware on your adapter card s you need to perform the following steps If you need to burn an Expansion ROM image please refer to Burning the Expan sion ROM Image on page 205 The following steps are also appropriate in case you wish to burn newer firmware that you have downloaded from Mellanox Technologies Web site http www mella Adi nox com gt Downloads gt Firmware Step 1 Start mst host1l mst start Step 2 Identify your target InfiniBand device for firmware update 1 Get the list of InfiniBand device names on your machine 36 Mellanox Technologies Rev 2 0 3 0 0 host1 mst status MST modules MST PCI module loaded MST PCI configuration module loaded MST Calibre 12C module is not loaded MST devices dev mst mt25418 pciconf0 PCI configuration cycles access bus dev fn 02 00 0 addr reg 88 data reg 92 Chip revision is A0 dev mst mt25418 pci cro PCI direct access bus dev fnz02 00 0 bar 0xdef00000 size 0x100000 Chip revision is A0 dev mst mt25418 pci_msix0 PCI direct access bus dev fn 02 00 0 bar 0xdeefe000 size 0x2000 dev mst mt25418 pci uar PCI direct access bus dev fnz02 00 0 bar 0xdc800000 Size 0x800000 2 Your InfiniBand device is the one with the postfix pci cr0 In the example listed above this will be dev mst mt25418 pci cro Step3 Burn firmware 1
122. ance 7 2 6 3 1 Running Application a Certain Node 7 2 7 7 2 7 1 In order to run an application on a certain NUMA node the process affinity should be set in either in the command line or an external tool For example if the adapter s NUMA node is 1 and NUMA 1 cores are 8 15 then an application should run with process affinity that uses 8 15 cores only gt To run an application run the following commands taskset c 8 15 ib write bw Or taskset Oxff00 ib write bw a IRQ Affinity The affinity of an interrupt is defined as the set of processor cores that service that interrupt To improve application scalability and latency it is recommended to distribute interrupt requests IRQs between the available processor cores To prevent the Linux IRQ balancer application from interfering with the interrupt affinity scheme the IRQ balancer must be turned off The following command turns off the IRQ balancer gt etc init d irgbalance stop The following command assigns the affinity of a single interrupt vector gt echo hexadecimal bit mask gt proc irg irq vector gt smp affinity Bit i in lt hexadecimal bit mask gt indicates whether processor core i is in irq vector gt s affinity or not IRQ Affinity Configuration It is recommended to set each IRQ to a different core For Sandy Bridge or AMD systems set the irq affinity to the adapter s NUMA node For optimizing single port t
123. and set the CC manager main settings perform the following To enables disables Congestion Control mechanism on the fabric nodes set the follow ing parameter enable The values are lt TRUE FALSE gt The default is true CC manager configures CC mechanism behavior based on the fabric size The larger the fabric is the more aggressive CC mechanism is in its response to congestion To manu ally modify CC manager behavior by providing it with an arbitrary fabric size set the following parameter num_hosts e The values are 0 48K The default is o base on the CCT calculation on the current subnet size The smaller the number value of the parameter the faster HCAs will respond to the con gestion and will throttle the traffic Note that if the number is too low it will result in suboptimal bandwidth To change the mean number of packets between marking eligi ble packets with a FECN set the following parameter marking rate The values are to ox 1 e The default is oxa You can set the minimal packet size that can be marked with FECN Any packet less than this size bytes will not be marked with FECN To do so set the following param eter packet size The values are 0 0x3 c0 The default is ox200 168 Mellanox Technologies Rev 2 0 3 0 0 When number of errors exceeds max errors of send receive errors or timeouts in less than error window seconds the CC MG
124. ange the value of SRP LOAD in etc infiniband openib conf to yes For the changes to take effect run etc init d openibd restart P srp sg tablesize This is the maximum number of gather scatter entries per I O gt When loading the ib_srp module it is possible to set the module parameter default 12 Mellanox Technologies 41 J Rev 2 0 3 0 0 Driver Features 4 1 2 2 Manually Establishing SRP Connection The following steps describe how to manually load an SRP connection between the Initiator and an SRP Target Section 4 1 2 4 explains how to do this automatically Make sure that the ib srp module is loaded the SRP Initiator is reachable by the SRP Target and that an SM is running To establish a connection with an SRP Target and create an SRP SCSI device for that target under dev use the following command echo n id ext GUID value ioc guid GUID value dgid port GID value pkey ffff service id service 0 value gt sys class infiniband_srp srp mthca hca number port number add target See Section 4 1 2 3 for instructions on how the parameters in this echo command may be obtained Notes Execution of the above echo command may take some time The SM must be running while the command executes e tis possible to include additional parameters in the echo command max cmd per lun Default 63 max sect short for max sectors sets the request size of a command 10 class
125. are consistent scat proc cpuinfo grep cpu MHz Check that the output frequencies are the same as the maximum supported If the CPU frequency is not at the maximum check the BIOS settings according to tables in is section Recommended BIOS Settings on page 110 to verify that power state is disabled Check the current CPU frequency to check whether it is configured to max available frequency cat sys devices system cpu cpu cpufreq cpuinfo cur freq 114 Mellanox Technologies Rev 2 0 3 0 0 7 2 4 1 Setting the Scaling Governor If the following modules are loaded CPU scaling is supported and you can improve perfor mance by setting the scaling mode to performance freq table acpi cpufreq this module is architecture dependent It is also recommended to disable the module cpuspeed this module is also architecture depen dent gt To set the scaling mode to performance use echo performance gt sys devices system cpu cpu7 cpufreg scaling governor To disable cpuspeed use service cpuspeed stop 7 2 4 2 Kernel Idle Loop Tuning The mlx4 en kernel module has an optional parameter that can tune the kernel idle loop for bet ter latency This will improve the CPU wake up time but may result in higher power consump tion To tune the kernel idle loop set the following options in the etc modprobe d mlx4 conf file For MLNX OFED 2 0 x options mlx4 core enable sys tune 1 For MLNX EN 1 5 10
126. at is UP physical link state is LinkUp Examples 186 Mellanox Technologies InfiniBand Fabric Diagnostic Utilities Rev 2 0 3 0 0 1 Query the status of Port 1 of CA mlx4_0 using ibstatus and use its output the LID 3 in this case to obtain additional link information using ibportstate gt ibstatus mlx4 0 1 Infiniband device mlx4 0 port 1 status default gid e80 0000 0000 0000 0000 0000 9289 3895 base lid 0x3 sm lid 0x3 State 23 phys state 5 LinkUp rate 20 Gb sec 4X DDR gt ibportstate C mlx4 0 3 1 query PortInfo Port info Lid 3 port 1 ee Es eer Initialize DhyshpsnkSbate e e rarer LinkUp bali poc SING GIONE amana 559502355 1X or 4X nol 1X or 4X y apasun AX LinkSpeedSupported 2 5 Gbps or 5 0 Gbps mainan 2 5 Gbps or 5 0 Gbps IinnkSpeedAGEmvOrd m sss 5 0 Gbps 2 Query the status of two channel adapters using directed paths gt ibportstate C mlx4 0 D 0 1 PortInfo Port info DR path slid 65535 dlid 65535 0 port 1 ND eR Initialize Phys lim cS LinkUp Miley NS IOI 1X or 4X iq req EES LS CS 1X or 4X Tapi CHAC annua 4X LinkSpeedSupported 2 5 Gbps or 5 0 Gbps 2 5 Gbps or 5 0 Gbps Mig
127. at recognizes an application that is pinned to a remote NUMA node and activates a flow that improves the out of the box latency and throughput However the NUMA node recognition must be enabled as described in section Tuning for Intel Sandy Bridge Platform on page 116 In systems which do not support SLIT the following environment variable should be applied MLX4 LOCAL CPUS 0x bit mask of local NUMA node Example for local NUMA node which its cores are 0 7 LOCAL CPUS 0xff Additional modification can apply to impact this feature by changing the following environment variable MLX4 STALL NUM LOOP integer default 400 The default value is optimized for most applications However several applications might benefit from increasing decreasing this value 7 2 6 2 Tuning for AMD Architecture On AMD architecture there is a difference between a 2 socket system and a 4 socket system With a2 socket system the PCIe adapter will be connected to socket 0 nodes 0 1 With a4 socket system the PCIe adapter will be connected either to socket 0 nodes 0 1 or to socket 3 nodes 6 7 7 2 6 3 Recognizing NUMA Node Cores gt To recognize NUMA node cores run the following command cat sys devices system node node X cpulist cpumap Example cat sys devices system node nodel cpulist Upon an 7 9 115 5 15 cat sys devices system node nodel cpumap 0000aaaa Mellanox Technologies 117 Rev 2 0 3 0 0 Perform
128. at will join the fabric execute srp daemon e This utility continues to execute until it is either killed by the user or encounters connection errors such as no SM in the fabric To execute SRP daemon as a daemon you may run srp daemon found under usr sbin providing it with the same options used for running srp daemon Make sure only instance of run_srp_daemon runs port To execute SRP daemon as a daemon on all the ports run srp_daemon sh found under usr sbin srp_daemon sh sends its log to var log srp daemon log It is possible to configure this script to execute automatically when the InfiniBand driver starts by changing the value of SRPHA_ENABLE in etc infiniband openib conf to yes However this option also enables SRP High Availability that has some more features see Section 4 1 2 6 For the changes in openib conf to take effect run etc init d openibd restart 4 1 2 5 Multiple Connections from Initiator IB Port to the Target Some system configurations may need multiple SRP connections from the SRP Initiator to the same SRP Target to the same Target IB port or to different IB ports on the same Target HCA In case of a single Target IB port 1 SRP connections use the same path the configuration is enabled using a different initiator_ext value for each SRP connection The initiator_ext value is a 16 hexadecimal digit value specified in the connection comma
129. ated to support the AR the AR Manager will need to be restarted by restarting Subnet Man ager to allow it to configure the AR on this switch This option can be changed on the fly Default true AR MODE lt bounded free gt Adaptive Routing Mode no constraints on output port selection bounded the switch does not change the output port during the same transmission burst This mode minimizes the appearance of out of order packets This option can be changed on the fly Default bounded AGEING_ TIME lt usec gt Applicable to bounded AR mode only Specifies how much time there should be no traffic in order for the switch to declare a transmission burst as finished and allow changing the output port for the next transmission burst 32 bit value This option can be changed on the fly Default 30 ERRORS lt N gt ERROR WINDOW DEN When number of errors exceeds MAX ERRORS of send receive errors or time outs in less than ERROR WINDOW seconds the AR Manager will abort returning control back to the Subnet Manager This option can be changed on the fly Values for both options 0 Oxffff MAX ERRORS 0 zero tolle rance abort configuration on first error Default 10 ERROR WINDOW 0 mecha nism disabled no error checking Default 5 LOG FILE full path AR Manager log file This option can be changed on the fly Default var log armgr
130. ations and Acronyms Table 2 Abbreviations and Acronyms Sheet 1 of 2 Abbreviation Acronym Whole Word Description B Capital B is used to indicate size in bytes or multiples of bytes e g 1 1024 bytes and 1 1048576 bytes b Small b is used to indicate size in bits or multiples of bits e g IKb 1024 bits FW Firmware HCA Host Channel Adapter HW Hardware IB InfiniBand iSER iSCSI RDMA Protocol LSB Least significant byte Isb Least significant bit MSB Most significant byte msb Most significant bit NIC Network Interface Card SW Software VPI Virtual Protocol Interconnect IPoIB IP over InfiniBand PFC Priority Flow Control PR Path Record RDS Reliable Datagram Sockets RoCE over Converged Ethernet 12 Mellanox Technologies Rev 2 0 3 0 0 Table 2 Abbreviations and Acronyms Sheet 2 of 2 Abbreviation Acronym Whole Word Description SDP Sockets Direct Protocol SL Service Level SRP SCSI RDMA Protocol MPI Message Passing Interface EoIB Ethernet over Infiniband QoS Quality of Service ULP Upper Level Protocol VL Virtual Lane vHBA Virtual SCSI Host Bus adapter uDAPL User Direct Access Programming Library Glossary The following is a list of concepts and terms related to InfiniBand in general and to Subnet Man agers in particular It is inclu
131. ble is organized by MLID Network Interface A network adapter card that plugs into the PCI Express slot Card NIC and provides one or more ports to an Ethernet network Standby Subnet Man A Subnet Manager that is currently quiescent and not in the ager role of a Master Subnet Manager by agency of the master SM See Subnet Manager Subnet Administra An application normally part of the Subnet Manager that tor SA implements the interface for querying and manipulating subnet management data Subnet Manager One of several entities involved in the configuration and con SM trol of the an IB fabric Unicast Linear For A table that exists in every switch providing the port through warding Tables which packets should be sent to each LID LFT Virtual Protocol A Mellanox Technologies technology that allows Mellanox Interconnet VPI channel adapter devices ConnectX to simultaneously con nect to an InfiniBand subnet and a 10GigE subnet each subnet connects to one of the adpater ports Related Documentation Table 4 Reference Documents Document Name Description InfiniBand Architecture Specification Vol 1 Release 1 2 1 is provided by IBTA IEEE Std 802 3ae 2002 Amendment to IEEE Std 802 3 2002 Document PDF 5594996 Physical Layer Specifications Amendment Media Access Control MAC Parameters for 10 Gb s Operation 14 Mellanox Technolo
132. ble motherboard BIOS Hypervisor that supports SR IOV such as Red Hat Enterprise Linux Server Version 6 Mellanox ConnectX VPI Adapter Card family with SR IOV capability 4 13 2 Setting Up SR IOV Depending on your system perform the steps below to set up your BIOS The figures used in this section are for illustration purposes only For further information please refer to the appropriate BIOS User Manual 1 Enable SR IOV in the system BIOS BIOS SETUP UTILITY 82 Mellanox Technologies J Rev 2 0 3 0 0 2 Enable Intel Virtualization Technology BIOS SETUP UTILITY wided b wa lization Tech Enabled le Bit Cay bility Step 3 Install the hypervisor that supports SR IOV Step 4 Depending on your system update the boot grub grub conf file to include a similar command line load parameter for the Linux kernel For example to Intel systems add default 0 timeout 5 splashimage hd0 0 grub splash xpm gz hiddenmenu title Red Hat Enterprise Linux Server 2 6 32 36 x86 645 root hd0 0 kernel vmlinuz 2 6 32 36 x86 64 ro root dev VolGroup00 LogVol00 rhgb quiet intel iommu on initrd initrd 2 6 32 36 x86 64 img 1 Please make sure the parameter intel_iommu on exists when updating the boot grub grub conf file otherwise SR IOV cannot be loaded Step 5 Install the MLNX OFED driver for Linux that supports SR IOV Step 6 Verify the HCA is configured to support SR IOV
133. bric links pmCounters P lt PM lt Trash gt gt If any of the provided pm is greater then its provided value print it to screen skip lt skip option s gt Skip the executions of the selected checks Skip options one or more can be specified dup guids zero guids pm logical state part ipoib all wt lt file name gt Write out the discovered topology into the given file This flag is useful if you later want to check for changes from the current state of the fabric A directory named ibdiag ibnl is also created by this option and holds the IBNL files required to load this topology To use these files you will need to set the environment variable named IBDM IBNL PATH to that directory The directory is located in tmp or in the output directory provided by the o flag load db lt file name gt gt Load subnet data from the given db file and skip subnet discovery stage Note Some of the checks require actual subnet discovery and therefore would not run when load db is specified These checks are Duplicated zero guids link state SMs status h help Prints the help page information V version Prints the version of the tool vars Prints the tool s environment variables and their values Output Files Table 20 ibdiagnet of ibutils Output Files Output File Description ibdiagnet log A dump of all the application reports generate according to the provided flags ibdiagnet lst List of all the nodes ports an
134. but only the actions such as joining multicast group that need to be taken when using the API Since LID is a layer 2 attribute of the InfiniBand protocol stack it is not set for a port and is displayed as zero when querying the port With the alternate path is not set for RC QP and therefore APM is not supported Since the SM is not present querying a path is impossible Therefore the path record structure must be filled with the relevant values before establishing a connection Hence it is recommended working with RDMA CM to establish a connection as it takes care of filling the path record structure The GID table for each port is populated with N 1 entries where N N gt 0 is the num ber of VLAN devices over this port Mellanox Technologies 23 J Rev 2 0 3 0 0 Installation 2 Installation This chapter describes how to install and test the Mellanox OFED for Linux package on a single host machine with Mellanox InfiniBand and or Ethernet adapter hardware installed 2 1 Hardware and Software Requirements Table 1 Software and Hardware Requirements Requirements Description Platforms A server platform with an adapter card based on one of the following Mellanox Technologies InfiniBand HCA devices 27508 ConnectX 3 VPI IB EN firmware fw ConnectX3 e MT4113 Connect IB IB firmware fw Connect IB For the list of supported architecture platforms please refer to the Mellanox OF
135. cessors The following table displays the recommended BIOS settings in machines with Intel Nehalem based processors Configuring the Completion Queue Stall Delay Table 11 Recommended BIOS Settings for Intel Nehalem Westmere Processors BIOS Option Values General Operating Mode Power pro Maximum Performance file Processor C States Disabled Turbo mode Disabled Hyper Threading Disabled Recommended for latency and message rate sensitive applications CPU frequency select Max performance Memory Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled NUMA Channel Interleaving Enabled Thermal Mode Performance 1 Hyper Threading can increase message rate for multi process applications by having more logical cores It might increase the latency of a single process due to lower frequency of a single logical core when hyper threading is enabled 7 1 3 4 AMD Processors The following table displays the recommended BIOS settings in machines with AMD based pro cessors Table 12 Recommended BIOS Settings for AMD Processors BIOS Option Values General Operating Mode Power pro Maximum Performance file Processor C States Disabled Turbo mode Disabled HPC Optimizations Enabled CPU frequency select Max performance 112 Mellanox Technologies Rev 2 0 3 0 0 Table 12 Recommended BIOS Settings for AMD Processors
136. ch node but it is constrained to ranking rules This algorithm should be chosen if the subnet is not a pure Fat Tree and a deadlock may occur due to a loop in the subnet 3 Fat tree Routing Algorithm This algorithm optimizes routing for a congestion free shift communication pattern It should be chosen if a subnet is a symmetrical Fat Tree of various types not just a K ary N Tree non constant K not fully staffed and for any CBB ratio Similar to UPDN Fat Tree routing is constrained to rank ing rules 4 LASH Routing Algorithm Uses InfiniBand virtual layers SL to provide deadlock free shortest path routing while also distrib uting the paths between layers LASH is an alternative deadlock free topology agnostic routing algo rithm to the non minimal UPDN algorithm It avoids the use of a potentially congested root node 5 DOR Routing Algorithm Based on the Min Hop algorithm but avoids port equalization except for redundant links between the same two switches This provides deadlock free routes for hypercubes when the fabric is cabled as a hypercube and for meshes when cabled as a mesh 6 Torus 2QoS Routing Algorithm Based on the DOR Unicast routing algorithm specialized for 2D 3D torus topologies Torus 2QoS provides deadlock free routing while supporting two quality of service QoS levels Additionally it can route around multiple failed fabric links or a single failed fabric switch without intro
137. ch rules such that the target QoS Level defini tion is found Given the QoS Level a path s search is performed with the given restrictions imposed by that level 4 5 Quality of Service Ethernet 4 5 1 Quality of Service Overview Quality of Service QoS is a mechanism of assigning a priority to a network flow socket rdma cm connection and manage its guarantees limitations and its priority over other flows This is accomplished by mapping the user s priority to a hardware TC traffic class through a 2 3 stages process The TC is assigned with the QoS attributes and the different flows behave accordingly 4 5 2 Mapping Traffic to Traffic Classes Mapping traffic to TCs consists of several actions which are user controllable some controlled by the application itself and others by the system network administrators The following is the general mapping traffic to Traffic Classes flow 1 The application sets the required Type of Service ToS 2 The ToS is translated into a Socket Priority sk prio 3 The sk prio is mapped to a User Priority UP by the system administrator some applica tions set sk prio directly 4 The UP is mapped to TC by the network system administrator 5 TCs hold the actual QoS parameters QoS can be applied on the following types of traffic However the general QoS flow may vary among them Plain Ethernet Applications use regular inet sockets and the traffic passes via the ker nel Ethernet driver
138. cified device May specify more than one device Mellanox Technologies 183 Rev 2 0 3 0 0 InfiniBand Fabric Diagnostic Utilities Table 23 ibstatus Flags and Options Optional Default Flag f r If Not Description y Specified lt port gt Optional but All ports of Print information for the specified port only requires the specified ofthe specified device specifying a device device name Examples 1 List the status of all available InfiniBand devices and their ports gt ibstatus Infiniband device mlx4 0 port 1 status default gid base lid sm lid state phys state fe80 0000 0000 0000 0000 0000 0007 3896 0x3 0x3 4 ACTIVE 5 LinkUp 20 Gb sec 4X DDR Infiniband device mlx4 0 port 2 status default gid base lid sm lid state phys state rate e80 0000 0000 0000 0000 0000 0007 3897 0 1 0 1 4 5 LinkUp 20 Gb sec 4X DDR Infiniband device mthca0 port 1 status default gid base lid sm lid state phys state e80 0000 0000 0000 0002 c900 0101 d151 0x0 0x0 215 5 LinkUp 10 Gb sec 4X Infiniband device mthca0 port 2 status default gid base lid sm lid state phys state e80 0000 0000 0000 0002 c900 0101 d152 0x0 0x0 23 5 LinkUp 10 Gb sec 4 184 Mellanox Technologies Rev 2 0 3 0 0 2 List
139. com 72 3 194 0 80 connected HTTP request sent awaiting response 200 OK Length 1354 1 3K text plain Saving to RPM GPG KEY Mellanox 100 gt 1 354 tn Qe 2013 08 20 13 52 30 247 MB s RPM GPG KEY Mellanox saved 1354 1354 4 Install the key sudo rpm import RPM GPG KEY Mellanox Step 5 Check that the key was successfully imported rpm q gpg pubkey qf NAME VERSION RELEASE t SUMMARY n grep Mellanox gpg pubkey a9e4b643 520791ba gpg Mellanox Technologies lt support mellanox com gt Step 6 Create a YUM repository configuration file called etc yum repos d mlnx ofed repo with the following content mlnx ofed name MLNX OFED Repository baseurl file lt path to extracted MLNX OFED package enabled 1 gpgkey file path to the downloaded key RPM GPG KEY Mellanox gpgcheck 1 Step 7 Check that the repository was successfully added yum repolist Loaded plugins product id security subscription manager This system is not registered to Red Hat Subscription Management You can use subscrip tion manager to register repo id repo name status mlnx ofed MLNX OFED Repository 108 rpmforge RHEL 6Server RPMforge net dag A 597 repolist 8 351 38 Mellanox Technologies Rev 2 0 3 0 0 2 5 2 Installing MLNX_OFED using the YUM Tool After setting up the YUM repository for MLNX_OFED package perform the follo
140. continue y N y Running usr sbin vendor pre uninstall sh Removing OFED Software installations 88 Mellanox Technologies Rev 2 0 3 0 0 Running bin rpm e allmatches kernel ib kernel ib devel libibverbs libibverbs devel libibverbs devel static libibverbs utils libmlx4 libmlx4 devel libibcm libibcm devel libibumad libibumad devel libibumad static libibmad libibmad devel libibmad static librdmacm librdmacm utils librdmacm devel ibacm opensm libs opensm devel perftest com pat dapl compat dapl devel dapl dapl devel dapl devel static dapl utils srptools infini band diags guest ofed scripts opensm devel warning etc infiniband openib conf saved as etc infiniband openib conf rpmsave Running tmp 2818 ofed vendor post uninstall sh Step3 Restart the server 4 13 6 Burning Firmware with SR IOV The following procedure explains how to create a binary image with SR IOV enabled that has 63 VFs However the number of VFs varies according to the working mode requirements To burn the firmware 1 Verify you have MFT installed in your machine Step 2 Enter the firmware directory according to the HCA type e g ConnectX 3 The path is mlnx_ofed firmware lt device gt lt FW version Step 3 Find the ini file that contains the HCA s PSID Run ibv_devinfo grep board id board id MT 1090120019 If such ini file cannot be found in the firmware directory you may want to dump the configura tion file using mstflint Run
141. creases the log verbosity level The v option may be specified multiple times to further increase the verbosity level See the vf option for more information about log verbosity V This option sets the maximum verbosity level and forces log flushing The V is equivalent to vf OxFF d 2 See the vf option for more information about log verbosity zit This option sets the log verbosity level A flags field must follow the D option A bit set clear in the flags enables disables a specific log level as follows BIT LOG LEVEL ENABLED 0x01 ERROR error messages 0x02 INFO basic messages low volume 0x04 VERBOSE interesting stuff moderate volume 132 Mellanox Technologies Rev 2 0 3 0 0 0x08 0x10 FUNCS function entry exit very high volume 0x20 FRAMES dumps all SMP and GMP frames 0x40 ROUTING dump FDB routing information 0x80 currently unused Without vf osmtest defaults to ERROR INFO 0x3 Specifying vf 0 disables all messages Specifying DEBUG diagnostic high volume vf OxFF enables all messages see V High verbosity levels may require increasing the transaction timeout with the t option h help Display this usage info then exit 8 3 2 Running osmtest To run osmtest in the default mode simply enter host1 osmtest The default mode runs all the flows except for the Quality of Service flow see Section 8 6 After installing opensm and if the InfiniBand fabric is stable it is
142. d links in the fabric Mellanox Technologies 177 Rev 2 0 3 0 0 Table 20 ibdiagnet of ibutils Output Files Output File Description ibdiagnet fdbs A dump of the unicast forwarding tables of the fabric switches ibdiagnet mcfdbs A dump of the multicast forwarding tables of the fabric switches ibdiagnet masks In case of duplicate port node Guids these file include the map between masked Guid and real Guids ibdiagnet sm List of all the SM state and priority in the fabric ibdiagnet pm A dump of the pm Counters values of the fabric links ibdiagnet pkey A dump of the existing partitions and their member host ports ibdiagnet mcg A dump of the multicast groups their properties and member host ports ibdiagnet db A dump of the internal subnet database This file can be loaded in later runs using the load_db option In addition to generating the files above the discovery phase also checks for duplicate node port GUIDs in the IB fabric If such an error is detected it is displayed on the standard output After the discovery phase is completed directed route packets are sent multiple times according to the c option to detect possible problematic paths on which packets may be lost Such paths are explored and a report of the suspected bad links is displayed on the standard output After scanning the fabric if the option is provided a full report of the fabric qual
143. ded here for ease of reference but the main reference remains the InfiniBand Architecture Specification Table 3 Glossary Sheet 1 of 2 Channel Adapter An IB device that terminates an IB link and executes transport CA Host Channel functions This may be an HCA Host CA or a TCA Target Adapter HCA CA HCA Card A network adapter card based on an InfiniBand channel adapter device IB Devices Integrated circuit implementing InfiniBand compliant commu nication IB Cluster Fabric Subnet A set of IB devices connected by IB cables In Band A term assigned to administration activities traversing the IB connectivity only Local Identifier ID An address assigned to a port data sink or source point by the Subnet Manager unique within the subnet used for directing packets within the subnet Local Device Node The IB Host Channel Adapter HCA Card installed on the System machine running IBDIAG tools Mellanox Technologies 13 J Rev 2 0 3 0 0 Table 3 Glossary Sheet 2 of 2 Local Port The IB port of the HCA through which IBDIAG tools connect to the IB fabric Master Subnet Man ager The Subnet Manager that is authoritative that has the refer ence configuration information for the subnet See Subnet Manager Multicast Forward ing Tables A table that exists in every switch providing the list of ports to forward received multicast packet The ta
144. destroy flow struct ibv flow flow id Input parameters destroy flow requires struct flow which is the return value of ibv create flowin case of success Output parameters Returns 0 on success or the value of errno on failure For further information please refer to the ibv destroy flow man page Ethtool Ethtool domain is used to attach an RX ring specifically its QP to a specified flow Please refer to the most recent ethtool manpage for all the ways to specify a flow Examples ethtool U eth5 flow type ether dst 00 11 22 33 44 55 loc 5 action 2 All packets that contain the above destination MAC address are to be steered into rx ring 2 its underlying QP with priority 5 within the ethtool domain ethtool U eth5 flow type tcp4 src ip 1 2 3 4 dst port 8888 loc 5 action 2 All packets that contain the above destination IP address and source port are to be steered into rx ring 2 When destination MAC is not given the user s destination MAC is filled automatically 80 Mellanox Technologies Rev 2 0 3 0 0 ethtool u eth5 Shows all of ethtool s steering rule When configuring two rules with the same priority the second rule will overwrite the first one so this ethtool interface is effectively a table Inserting Flow Steering rules in the kernel requires support from both the ethtool in the user space and in kernel v2 6 28 MLXA Driver Support The mlx4 driver supports only a subset
145. dm and extends its functionality In addition to the ibsr pdm functionality described above srp_daemon can also Establish an SRP connection by itself without the need to issue the echo command described in Section 4 1 2 2 Continue running in background detecting new targets and establishing SRP connec tions with them daemon mode Discover reachable SRP Targets given an infiniband HCA name and port rather than just by dev umad lt N gt where lt N gt is a digit Enable High Availability operation together with Device Mapper Multipath Haveaconfiguration file that determines the targets to connect to l srp daemon commands equivalent to ibsrpdm srp daemon a o is equivalent to ibsrpdm srp daemon c a o is equivalent to ibsrpdm c These srp_daemon commands can behave differently than the equivalent ibsrpdm command when etc srp_daemon conf is not empty 2 srp_daemon extensions to ibsrpdm To discover SRP Targets reachable from the HCA device lt InfiniBand HCA name gt and the port lt port num gt and to generate output suitable for echo you may execute host1 srp daemon c a o i lt InfiniBand HCA name gt p port number To obtain the list of InfiniBand HCA device names you can either use the ibstat tool or run 15 sys class infiniband To both discover the SRP Targets and establish connections with them just add the e option to the above command Ex
146. ducing deadlocks and without changing path SLvalues granted before the failure OpenSM provides an optional unicast routing cache enabled by A or ucast_cache options When enabled unicast routing cache prevents routing recalculation which is a heavy task in a large cluster when there was no topology change detected during the heavy sweep or when the topology change does not require new routing calculation e g when one or more CAs RTRs leaf switches going down or one or more of these nodes coming back after being down A very common case that is handled by the unicast routing cache is host reboot which otherwise would cause two full routing recalculations one when the host goes down and the other when the host comes back online OpenSM also supports a file method which can load routes from a table see Modular Routing Engine below The basic routing algorithm is comprised of two stages 1 MinHop matrix calculation How many hops are required to get from each port to each LID The algorithm to fill these tables is different if you run standard min hop or Up Down For standard routing a relaxation algorithm is used to propagate min hop from every destina tion LID through neighbor switches For Up Down routing a BFS from every target is used The BFS tracks link direction up or down and avoid steps that will perform up after a down step was used 2 Once MinHop matrices exist each switch is visited and for each target LID
147. e and one of its loops may experience a deadlock due for example to high pressure The UPDN algorithm is based on the following main stages 1 Auto detect root nodes based on the CA hop length from any switch in the subnet a statisti cal histogram is built for each switch hop num vs number of occurrences If the histogram reflects a specific column higher than others for a certain node then it is marked as a root node Since the algorithm is statistical it may not find any root nodes The list of the root nodes found by this auto detect stage is used by the ranking process stage The user override the node list manually If this stage cannot find any root nodes and the user did not specify a guid list file OpenSM defaults back to the Min Hop routing algorithm 2 Ranking process All root switch nodes found in stage 1 are assigned a rank of 0 Using the BFS algorithm the rest of the switch nodes in the subnet are ranked incrementally This ranking aids in the process of enforcing rules that ensure loop free paths 3 Min Hop Table setting after ranking is done a BFS algorithm is run from each CA or switch node in the subnet During the BFS process the FDB table of each switch node tra versed by BFS is updated in reference to the starting node based on the ranking rules and guid values At the end of the process the updated FDB tables ensure loop free paths through the subnet Up Dow
148. e ID The Service ID for RDS is 0x000000000106PPPP where PPPP are 4 hex digits holding the remote TCP IP Port Number to connect to Default port number for RDS is 0x48CA which makes a default Service ID 0x00000000010648CA The following two match rules are equivalent rds SL any Service id 0x00000000010648CA SL 156 Mellanox Technologies Rev 2 0 3 0 0 8 6 6 4 SRP Service ID for SRP varies from storage vendor to vendor thus SRP query is matched by the tar get IB port GUID The following two match rules are equivalent srp target port guid 0x1234 lt SL gt any target port guid 0x1234 SL Note that any of the above ULPs might contain target port GUID in the PR query so in order for these queries not to be recognized by the QoS manager as SRP the SRP match rule or any match rule that refers to the target port guid only should be placed at the end of the qos ulps match rules 8 6 6 5 MPI SL for MPI is manually configured by MPI admin OpenSM is not forcing any SL on the MPI traffic and that s why it is the only ULP that did not appear in the qos ulps section 8 6 7 SL2VL Mapping and VL Arbitration OpenSM cached options file has a set of QoS related configuration parameters that are used to configure SL2VL mapping and VL arbitration on IB ports These parameters are Max VLs the maximum number of VLs that will be on the subnet High limit the limit of High Priority component of VL Arbitration
149. e ID through the notion of port space as a prefix to the port number which is part of the sockaddr provided to rdma_resolve_add The CMA also allows the ULP like SDP to propagate a request for a specific QoS Class The CMA uses the provided QoS Class and Service ID in the sent PR MPR 4 4 4 1 IPoIB IPoIB queries the SA for its broadcast group information and uses the SL MTU RATE and Packet Lifetime available on the multicast group which forms this broadcast group 4 4 4 2 SRP The current SRP implementation uses its own CM callbacks not CMA So SRP fills in the Ser vice ID in the PR MPR by itself and use that information in setting up the QP SRP Service ID is defined by the SRP target I O Controller it also complies with IBTA Service ID rules The Service ID is reported by the I O Controller in the ServiceEntries DMA attribute and should be used in the PR MPR if the SA reports its ability to handle QoS PR MPRs 58 Mellanox Technologies Rev 2 0 3 0 0 4 4 5 OpenSM Features The QoS related functionality that is provided by OpenSM the Subnet Manager described in Chapter 8 can be split into two main parts I Fabric Setup During fabric initialization the Subnet Manager parses the policy and apply its settings to the discovered fabric elements II PR MPR Query Handling OpenSM enforces the provided policy on client request The overall flow for such requests is first the request is matched against the defined mat
150. e appropriate driver stack InfiniBand or Ethernet For example if the first port is connected to an InfiniBand switch and the second to Ethernet switch the NIC will automatically load the first switch as InfiniBand and the second as Ethernet 6 2 1 Enabling Auto Sensing Upon driver start up 1 Sense the adapter card s port type If a valid cable or module is connected QSFP SFP or SFP with EEPROM in the cable module Set the port type to the sensed link type IB Ethernet Otherwise Set the port type as default Ethernet During driver run time Sense a link every 3 seconds if no link is sensed detected fsensed set the port type as sensed Mellanox Technologies 109 Rev 2 0 3 0 0 Performance 7 Performance 7 1 General System Configurations The following sections describe recommended configurations for system components and or interfaces Different systems may have different features thus some recommendations below may not be applicable 7 1 1 PCI Express PCle Capabilities Table 9 Recommended PCle Configuration PCIe Generation 3 0 Speed 8GT s Width x8 or x16 Max Payload size 256 Max Read Request 4096 For ConnectX3 based network adapters 40GbE Ethernet adapters it is recommended to use an x16 PCIe slot to benefit from the additional buffers allocated by the CPU 7 1 2 Memory Configuration For high performance it is recommended to use the high
151. e bonding master configuration file e g ifcfg bond0 in addition to Linux bond ing semantics use the following parameter MTU 65520 65520 is a valid MTU value only if all IPoIB slaves operate in Connected mode See Section 4 3 2 IPoIB Mode Setting on page 49 and are configured with the same value For IPoIB slaves that work in datagram mode use MTU 2044 If you do not set correct MTU do not set MTU at all performance of the interface might decrease n the bonding slave configuration file e g ifcfg ib0 use the same Linux Network Scripts semantics In particular DEVICE ib0 In the bonding slave configuration file e g ifcfg ib0 8003 the line TYPE InfiniBand is necessary when using bonding over devices configured with partitions p key For RHEL users In etc modprobe b bond conf add the following lines alias bond0 bonding For SLES users It is necessary to update the MANDATORY DEVICES environment variable in etc sysconfig net work config with the names of the IPoIB slave devices e g ib0 ibl etc Otherwise bonding mas ter may be created before IPoIB slave interfaces at boot time Itis possible to have multiple IPoIB bonding masters and a mix of IPoIB bonding master and Ethernet bonding master However It is NOT possible to mix Ethernet and IPoIB slaves under the same bond ing master Restarting openibd does no keep the bonding configuration via Network Scripts You have
152. e boot process if timeouts and max retries are set too high Example for discovering and connecting targets over ISER iscsiadm m discovery o new o old t st iser p lt ip port gt 1 iSER also supports RoCE without any additional configuration required To bond the RoCE interfaces set the fail over mac option in the bonding driver 48 Mellanox Technologies Rev 2 0 3 0 0 4 3 IP over InfiniBand 4 3 1 Introduction The IP over IB IPoIB driver is a network interface implementation over InfiniBand IPoIB encapsulates IP datagrams over an InfiniBand Connected or Datagram transport service The IPoIB driver ib_ipoib exploits the following capabilities e VLAN simulation over an InfiniBand network via child interfaces High Availability via Bonding Varies MTU values up to 4k in Datagram mode up to 64k in Connected mode Uses any ConnectX IB ports one or two Inserts IP UDP TCP checksum on outgoing packets e Calculates checksum on received packets Support net device TSO through ConnectX LSO capability to defragment large data grams to MTU quantas Dual operation mode datagram and connected Large MTU support through connected mode IPoIB also supports the following software based enhancements Giant Receive Offload Ethtool support 4 3 2 IPoIB Mode Setting IPoIB can run in two modes of operation Connected mode and Datagram mode By default IPoIB is set to work in
153. e file based routing any topology changes are currently ignored The file routing engine just loads the LFTs from the file specified with no reaction to real topology Obviously this will not be able to recheck LIDs by GUID for disconnected nodes and LFTs for non existent switches will be skipped Multicast is not affected by file routing engine this uses min hop tables 8 5 2 Min Hop Algorithm The Min Hop algorithm is invoked by default if no routing algorithm is specified It can also be invoked by specifying R minhop The Min Hop algorithm is divided into two stages computation of min hop tables on every switch and LFT output port assignment Link subscription is also equalized with the ability to override based on port GUID The latter 1s supplied by i lt equalize ignore guids file gt ignore guids lt equalize ignore guids file gt This option provides the means to define a set of ports by guids that will be ignored by the link load equalization algorithm LMC awareness routes based on remote system or switch basis Mellanox Technologies 137 Rev 2 0 3 0 0 OpenSM Subnet Manager 8 5 3 Algorithm The UPDN algorithm is designed to prevent deadlocks from occurring in loops of the subnet A loop deadlock is a situation in which it is no longer possible to send data between any two hosts connected through the loop As such the UPDN routing algorithm should be used if the subnet is not a pure Fat Tre
154. e is not configured correctly 173 Failed to start the mst driver Mellanox Technologies 27 J Rev 2 0 3 0 0 Installation 2 3 3 Installation Procedure Step 1 Login to the installation machine as root Step 2 Mount the ISO image on your machine host1 mount o ro loop MLNX OFED LINUX ver 0S label gt lt CPU arch gt iso mnt Step 3 the installation script mlnxofedinstall This program will install the MLNX OFED LINUX package on your machine Note that all other Mellanox OEM OFED or Distribution IB packages will be removed Do you want to continue y N y Uninstalling the previous version of MLNX OFED LINUX Starting MLNX OFED LINUX 2 0 2 6 7 installation Installing mlnx ofa kernel RP Preparing hs HH H HH HH H HH H HH HH H Het HH H HHH HH H HH H HH HH mlnx ofa_kernel HHH HHH HH HH H HHH HH HH H HHH H H HH H HHH HH HH Installing kmod mlnx ofa_kernel RPM Preparing HH H HH HH H HHH HH HHH H H HH H HH H HH HH kmod mlnx ofa kernel HH H Het HH HH H HH H HH H Het HH HHHH HH H HH H HH HH Installing mlnx ofa_kernel devel RPM Preparing AE H HHH HH H HHH H H H HHH H HH HH mlnx ofa_kernel devel HHH Het HH Het HHH HH HHH Het H H HHH HH H HH HH Installing kmod kernel mft mlnx RP Prepari ng A HHH HHH HH HH H HHH HH HH H Het HHH H HHH HHH HH HH kmod kernel mft mlnx HH H HHH HH Het HHH HH H HH Het H HH HH Installing knem m
155. e y e c oao E Can t auto detect fw configuration file Step4 In case the installation script performed firmware updates to your network adapter hardware it will ask you to reboot your machine 5 script adds the following lines to etc security limits conf for the userspace com ponents such as MPI soft memlock unlimited hard memlock unlimited These settings unlimit the amount of memory that can be pinned by a user space application If desired tune the value unlimited to a specific amount of RAM Step 6 For your machine to be part of the InfiniBand VPI fabric a Subnet Manager must be running on one of the fabric nodes At this point Mellanox OFED for Linux has already installed the OpenSM Subnet Manager on your machine For details on starting OpenSM see Chapter 8 OpenSM Subnet Manager 7 InfiniBand only Run the hca_self_test ofed utility to verify whether or not the Infini Band link is up The utility also checks for and displays additional information such as HCA firmware version Kernel architecture Driver version Number of active HCA ports along with their states Node GUID 34 Mellanox Technologies Rev 2 0 3 0 0 Note For more details on hca self test ofed seethefilehca self test readme under docs hca self test ofed Performing Adapter Device Self Test NumbensosgeASSDeucc ss 2 PeneDevicertGhecke suasana nm PASS Kerne aap as 86
156. ecuting srp_daemon over a port without the a option will only display the reachable targets via the port and to which the initiator is not connected If executing with the e option it is better to omit a tis recommended to use the n option This option adds the initiator_ext to the connecting string See Section 4 1 2 5 for more details e srp daemon has a configuration file that can be set where the default is etc srp daemon conf Use the f to supply a different configuration file that configures the tar gets srp daemon is allowed to connect to The configuration file can also be used to set values for additional parameters e g max cmd per lun max sect A continuous background daemon operation providing an automatic ongoing detection and connection capability See Section 4 1 2 4 44 Mellanox Technologies Rev 2 0 3 0 0 4 1 2 4 Automatic Discovery and Connection to Targets Make sure that the ib_srp module is loaded the SRP Initiator can reach an SRP Target and that an SM is running To connect to all the existing Targets in the fabric run daemon e o This util ity will scan the fabric once connect to every Target it detects and then exit srp_daemon will follow the configuration it finds in etc srp_daemon conf Thus it will ignore a target that is disallowed in the configuration file To connect to all the existing Targets in the fabric and to connect to new targets th
157. ed and incoming packets will have the VLAN tag removed Any vlan tagged packets sent by the VF are silently dropped The default behavior is VGT The feature may be controlled on the Hypervisor from userspace via iprout2 netlink ip link set dev DEVICE group DEVGROUP up down v NUM mac LLADDR vlan VLANID qos VLAN QOS spoofchk on off 1 use ip link set dev PF device vf NUM vlan vlan id qos lt qos gt where NUM 0 max vf num e vlan 14 0 4095 4095 means set qos 0 7 For example ip link set dev eth2 vf 2 gos 3 sets VST mode for VF 2 belonging to PF eth2 with qos 3 ip link set dev eth2 vf 4095 sets mode for VF 2 back to VGT 94 Mellanox Technologies Rev 2 0 3 0 0 4 13 7 3 2Additional Ethernet VF Configuration Options Guest MAC configuration By default guest MAC addresses are configured to be all zeroes In the mlnx_ofed guest driver if a guest sees a zero MAC it generates a random MAC address for itself If the administrator wishes the guest to always start up with the same MAC he she should configure guest MACs before the guest driver comes up The guest MAC may be configured by using ip link set dev lt PF device gt vf lt NUM gt mac lt LLADDR gt For legacy guests which do not generate random MACs the adminstrator should always configure their MAC addresses via ip link as above Spoof checking Spoof checking is cu
158. ed docs HH Ht HH HH HH Device 05 00 0 05 00 0 Ethernet controller Mellanox Technologies MT26448 ConnectX EN 10GigE PCIe 2 0 5GT s rev b0 Link Width is not 8x PCI Link Speed 5Gb s Device 07 00 0 07 00 0 Ethernet controller Mellanox Technologies MT27500 Family ConnectX 3 Link Width 8x PCI Link Speed 5Gb s Installation finished successfully The firmware version on dev mst mt26448 pci cr0 2 9 1000 is up to date Note To force firmware update use force fw update flag The firmware version on dev mst mt4099 pci cr0 2 30 4450 is up to date Note To force firmware update use force fw update flag Mellanox Technologies 33 J Rev 2 0 3 0 0 Installation In case your machine has the latest firmware firmware update will occur the installation script will print at the end of installation a message similar to the following de The firmware version on dev mst mt26448 pci cr0 2 9 1000 is up to date Note To force firmware update use force fw update flag The firmware version on dev mst mt4099 pci cr0 2 11 500 is up to date Note To force firmware update use force fw update flag In case your machine has an unsupported network adapter device no firmware update W will occur and the error message below will be printed Please contact your hardware vendor for help on firmware updates de Error message ic
159. ed to the client machine 3 An initrd file 4 To add an Ethernet driver into initrd you need to copy the Ethernet modules to the diskless image Your machine needs to be pre installed with a MLNX EN Linux Driver that is appro priate for the kernel version the diskless image will run Adding the Ethernet Driver to the initrd File Ah Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 The following procedure modifies critical files used in the boot procedure It must be executed by users with expertise in the boot process Improper application of this pro cedure may prevent the diskless machine from booting Back up your current initrd file Make a new working directory and change to it host1 mkdir tmp initrd en host1 cd tmp initrd en Normally the initrd image is zipped Extract it using the following command host1 gzip dc initrd image cpio id The initrd files should now be found under tmp initrd_en Create a directory for the ConnectX EN modules and copy them hostl mkdir p tmp initrd en lib modules mlnx en host1 cd lib modules uname r updates kernel drivers hostl cp net mlx4 mlx4 core ko tmp initrd en lib modules mlnx en hostl cp net mlx4 mlx4 en ko tmp initrd en lib modules mlnx en To load the modules you need the insmod executable If you do not have it in your initrd please add it using the following command hostl cp sbin insmod tmp initrd en sbin If you plan to give your
160. ed with major operating system distributions Mellanox OFED is certified with the following products Mellanox Messaging Accelerator VMA software Multicast socket acceleration library that performs OS bypass for standard socket based applications Mellanox Unified Fabric Manager UFM software Powerful platform for managing demanding scale out computing fabric environments built on top of the OpenSM industry standard routing engine Fabric Collective Accelerator FCA FCA is a Mellanox MPl integrated software package that utilizes CORE Direct technology for implementing the MPI collectives communications 1 2 Mellanox OFED Package 1 2 1 ISO Image Mellanox OFED for Linux MLNX OFED LINUX is provided as ISO images or as a tarball one per supported Linux distribution and CPU architecture that includes source code and binary RPMs firmware utilities and documentation The ISO image contains an installation script called m1nxofedinstall that performs the necessary steps to accomplish the following Discover the currently installed kernel Uninstall any InfiniBand stacks that are part of the standard operating system distribu tion or another vendor s commercial stack Install the MLNX OFED LINUX binary RPMs if they are available for the current kernel Identify the currently installed InfiniBand HCAs and perform the required firmware updates 1 2 2 Software Components MLNX OFED LINUX contains the fo
161. educing CPU overhead It works by aggregating multiple incoming packets from a single stream into a larger buffer before they are passed higher up the networking stack thus reducing the number of packets that have to be processed LRO is available in kernel versions 3 1 for untagged traffic Note LRO will be done whenever possible Otherwise GRO will be done Generic Receive Offload GRO is available throughout all kernels ethtool c eth lt x gt Queries interrupt coalescing settings 96 Mellanox Technologies Rev 2 0 3 0 0 Table 6 ethtool Supported Options 4 16 Options Description ethtool C eth lt x gt adaptive rx on off Enables disables adaptive interrupt moderation By default the driver uses adaptive interrupt moderation for the receive path which adjusts the moderation time to the traffic pattern ethtool C eth lt x gt pkt rate low N pkt rate high N rx usecs low N rx usecs high N Sets the values for packet rate limits and for moderation time high and low values Above an upper limit of packet rate adaptive moderation will set the moderation time to its highest value Below a lower limit of packet rate the moderation time will be set to its lowest value ethtool C eth lt x gt rx usecs N frames N Sets the interrupt coalescing settings when the adaptive moderation is disabled Note usec settings correspond to the time to wait after
162. ee of credit loops routing Two levels of QoS assuming switches support 8 data VLs Ability to route around a single failed switch and or multiple failed links without introducing credit loops changing path SL values Very short run times with good scaling properties as fabric size increases 8 5 7 1 Unicast Routing Torus 2QoS is DOR based algorithm that avoids deadlocks that would otherwise occur in a torus using the concept of a dateline for each torus dimension It encodes into a path SL which datelines the path crosses as follows all e for d 0 d lt torus dimensions dt path crosses dateline d returns 0 or 1 sl path crosses dateline d lt lt d For a 3D torus that leaves one SL bit free which torus 2QoS uses to implement two QoS levels Torus 2QoS also makes use of the output port dependence of switch SL2VL maps to encode into one VL bit the information encoded in three SL bits It computes in which torus coordinate direc tion each inter switch link points and writes SL2VL maps for such ports as follows for sl 0 sl lt 16 sl cdir port reports which torus coordinate direction a switch port points in and returns 0 1 or 2 sl2vl iport oport sl 0 1 amp sl gt gt cdir oport 142 Mellanox Technologies Rev 2 0 3 0 0 Thus on a pristine 3D torus i e in the absence of failed fabric switches torus 2QoS consumes 8 SL values SL bits 0 2 and 2
163. en saved in home lt username gt ssh id_rsa Your public key has been saved in home lt username gt ssh id_rsa pub The key fingerprint is 38 10 29 0 4 08 00 4 0 50 0 05 44 7 9 05 lt username gt host1l Step 2 Check that the public and private keys have been generated host1 cd home lt username gt ssh host1 18 host1 15 la total 40 Gl geom 2 root root 4096 Mar 5 04 57 drwxr x 13 root root 4096 Mar 4 18 27 i 1 root root 1675 Mar 5 04 57 id rsa Mellanox Technologies 101 Rev 2 0 3 0 0 HPC Features rw r r 1 root root 404 Mar 5 04 57 id rsa pub Step 3 Check the public key host1 cat id_rsa pub ssh rsa AAAAB3NzaClyc2EAAAABI WAAAQEA1zVY8VBHQh90kZN70A11bUQ74RxXm4 zHeczyVxpYHaDPyDmqezbYMKrCIVz d10bH ZkCOrpLYviU00UHd3fvNT Ms0gcGg08PysUf 12FyYjira2Plxyg mkHLGGqVut fEMmABZ3wNCUg6J2X 3G uiuSWXeubZmbXcMrP wAIWByfH8ajwo6A5SWioNbFZElbYeeNfPZf4UNcgMOAMWp64sL58tkt32F RGmyLXQWZL27Synsn6dHpxMqBorX NCOZBe4kTnUqm63nQ2zi1qVMdL9FrCmalxIOu9 SQUAjwONevaMzFKEHe7YHg6YrNfXunfdbEurzB524TpPcrod ZlfCQ username Ghostl Step 4 Now you need to add the public key to the authorized keys2 file on the target machine host1 cat id rsa pub xargs ssh host2 V echo home username ssh authorized keys2 lt username gt host2 s pass word Enter password For a local machine simply add the key to authorized keys2 hostl cat id rsa pub gt gt authorized keys2 Step 5 Test hos
164. erion its goal is to match a certain ULP or a certain application on top of this ULP PR MPR request and QoS Level has only one constraint Service Level SL The simple policy section may appear in the policy file in combine with the advanced policy or as a stand alone policy definition See more details and list of match rule criteria below Mellanox Technologies 151 Rev 2 0 3 0 0 OpenSM Subnet Manager 8 6 4 8 6 5 Policy File Syntax Guidelines Leading and trailing blanks as well as empty lines are ignored so the indentation in the example is just for better readability Comments are started with the pound sign and terminated by EOL Any keyword should be the first non blank in the line unless it s a comment Keywords that denote section subsection start have matching closing keywords Having a QoS Level named DEFAULT is a must it is applied to PR MPR requests that didn t match any of the matching rules Any section subsection of the policy file is optional Examples of Advanced Policy File As mentioned earlier any section of the policy file 1s optional and the only mandatory part of the policy file is a default QoS Level Here s an example of the shortest policy file gos levels gos level name DEFAULT lla Qi end qos level end qos levels Port groups section is missing because there are no match rules which means that port groups are not referred anywhere and there 1s no
165. es a work flow for local HCA adapter sniffing Run ibdump with the desired options Run the application that you wish its traffic to be analyzed Stop ibdump CTRL C or wait for the data buffer to fill in mem mode Open Wireshark and load the generated file How to Get Wireshark Download the current release from www wireshark org for a Linux or Windows environment See the ibdump release notes txt file for more details Although ibdump is a Linux application the generated pcap file may be analyzed on either operating system Synopsis ibdump options Output Files d ib dev lt dev gt use RDMA device lt dev gt default first device found The relevant devices can be listed by running the ibv devinfo command 1 ib port lt port gt use port port of IB device default 1 W write lt file gt dump file name default sniffer pcap stands for stdout enables piping to tcpdump or tshark 0 output lt file gt alias for the w option Do not use for backward compatibility b max burst lt log2 burst 092 of the maximal burst size that can be captured with 1 packets loss Each entry takes MTU bytes of memory default 12 4096 entries 8 silent do not print progress indication Mellanox Technologies 205 Rev 2 0 3 0 0 InfiniBand Fabric Diagnostic Utilities 1 Run ibdump 206 Mellanox Technologies J Rev 2 0 3 0 0 Ap
166. est memory speed with fewest DIMMs and populate all memory channels for every CPU installed For further information please refer to your vendor s memory configuration instructions or mem ory configuration tool available Online 7 1 3 Recommended BIOS Settings These performance optimizations may result in higher power consumption 7 1 3 1 General Set BIOS power management to Maximum Performance 110 Mellanox Technologies Rev 2 0 3 0 0 7 1 3 2 Intel Sandy Bridge Processors The following table displays the recommended BIOS settings in machines with Intel code name Sandy Bridge based processors Table 10 Recommended BIOS Settings for Intel Sandy Bridge Processors BIOS Option Values General Operating Mode Power pro Maximum Performance file Processor C States Disabled Turbo mode Enabled Hyper Threading HPC disabled Data Centers enabled CPU frequency select Max performance Memory Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled NUMA Channel Interleaving Enabled Thermal Mode Performance 1 Hyper Threading can increase message rate for multi process applications by having more logical cores It might increase the latency of a single process due to lower frequency of a single logical core when hyper threading is enabled Mellanox Technologies 111 Rev 2 0 3 0 0 Performance 7 1 3 3 Intel Nehalem Westmere Pro
167. et detected in human readable form Sample output IO Unit Info POE 0103 port GID e800000000000000002c90200402bd5 change ID 0002 max controllers 0x10 controller 1 GUID 0002c90200402bd4 vendor ID 0002c9 device ID 005a44 IO class 0100 IND LSI Storage Systems SRP Driver 200400a0b81146a1 service entries 1 service 0 200400 0081146 1 SRP T10 200400A0B81146A1 b To detect all the SRP Targets reachable by the SRP Initiator via another umad device use the following command ibsrpdm d lt umad device gt 2 Assistance in creating an SRP connection a To generate output suitable for utilization in the echo command of Section 4 1 2 2 add the option to ibsrpdm ibsrpdm c Sample output id ext 200400A0B81146A1 ioc guid 0002c90200402bd4 dgid fe800000000000000002c90200402bd5 pkey ffff service id 200400a0b81146a1 b To establish a connection with an SRP Target using the output from the ibsrpdm c example above execute the following command echo n id ext 200400A0B81146A1 i1oc guid 0002c90200402bd4 dgid fe800000000000000002c90200402bd5 pkey ffff service id 200400a0b81146al gt sys class infiniband srp srp mthca0 1 add target The SRP connection should now be up the newly created SCSI devices should appear in the listing obtained from the disk 1 command Mellanox Technologies 43 J Rev 2 0 3 0 0 Driver Features srp_daemon The srp daemon utility is based on ibsrp
168. et irq affinity cpulist sh 0 1 eth2 set irq affinity cpulist sh 2 3 eth3 set irq affinity cpulist sh 4 5 eth4 set irq affinity cpulist sh 6 7 eth5 Mellanox Technologies 119 Rev 2 0 3 0 0 Performance 7 2 8 Tuning Multi Threaded Forwarding gt To optimize NIC usage as IP forwarding 1 Set the following options in etc modprobe d mlx4 conf For MLNX OFED 2 0 x options mlx4 en inline thold 0 options mlx4 core high rate steer 1 ForMLNX EN 1 5 10 options mlx4 en num lro 0 inline thold 0 options mlx4 core high rate steer 1 2 Apply interrupt affinity tuning 3 Forwarding on the same interface set irq affinity bynode sh numa node interface 4 Forwarding from one interface to another set irq affinity bynode sh numa node interfacel lt interface2 gt 5 Disable adaptive interrupt moderation and set status values using ethtool C adaptive rx off 120 Mellanox Technologies Rev 2 0 3 0 0 8 OpenSM Subnet Manager 8 1 Overview OpenSM is an InfiniBand compliant Subnet Manager SM It is provided as a fixed flow execut able called opensm accompanied by a testing application called osmtest OpenSM implements an InfiniBand compliant SM according to the InfiniBand Architecture Specification chapters Management Model 13 Subnet Management 14 and Subnet Administration 15 8 2 Description opensm is an InfiniBand compliant Subnet Manager and Subnet
169. evice to boot from an iSCSI target host hosti filename For a ConnectX device with ports configured as InfiniBand comment out the following line option dhcp client identifier 00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39 For a ConnectX device with ports configured as Ethernet comment out the following line hardware ethernet 00 02 c9 00 00 bb A 11 WinPE Mellanox FlexBoot enables WinPE boot via TFTP For instructions on preparing a WinPE image please see http etherboot org wiki winpe 218 Mellanox Technologies Rev 2 0 3 0 0 Appendix B SRP Target Driver The SRP Target driver is designed to work directly on top of OpenFabrics OFED software stacks http www openfabrics org or InfiniBand drivers in Linux kernel tree kernel org It also inter faces with Generic SCSI target mid level driver SCST http scst sourceforge net By interfacing with an SCST driver it is possible to work with and support a lot of IO modes on real or virtual devices in the back end 1 scst vdisk fileio and blockio modes This allows turning software raid volumes LVM vol umes IDE disks block devices and normal files into SRP luns 2 NULLIO mode allows measuring the performance without sending IOs to real devices B 1 Prerequisites and Installation 1 SRP targer is part of the OpenFabrics OFED software stacks Use the latest OFED distribu tion package to install SRP target On distribu
170. fabric 1s routed with torus 2QoS Torus 2QoS can provide unchanging path SL values in the presence of subnet manager failover provided that all OpenSM instances have the same idea of dateline location See torus 2QoS conf 5 for details Torus 2QoS will detect configurations of failed switches and links that prevent routing that is free of credit loops and will log warnings and refuse to route If no fallback was configured in the list of OpenSM routing engines then no other routing engine will attempt to route the fabric In that case all paths that do not transit the failed compo nents will continue to work and the subset of paths that are still operational will continue to remain free of credit loops OpenSM will continue to attempt to route the fabric after every sweep interval and after any change such as a link up in the fabric topology When the fabric components are repaired full functionality will be restored In the event OpenSM was config ured to allow some other engine to route the fabric if torus 2QoS fails then credit loops and mes sage deadlock are likely if torus 2QoS had previously routed the fabric successfully Even if the other engine is capable of routing a torus without credit loops applications that built connections with path SL values granted under torus 2QoS will likely experience message deadlock under routing generated by a different engine unless they repath To verify that a torus fabric is routed free of credit
171. ffic cannot be steered It is treated as other protocol by hardware from the first packet and not considered as UDP traffic Mellanox Technologies 81 J Rev 2 0 3 0 0 Driver Features We recommend using 1ibibverbs 2 0 3 0 0 and libm1x4 v2 0 3 0 0 and higher as of MLNX_OFED v2 0 3 0 0 due to API changes 4 13 Single Root IO Virtualization SR IOV Single Root IO Virtualization SR IOV is a technology that allows a physical PCIe device to present itself multiple times through the PCIe bus This technology enables multiple virtual instances of the device with separate resources Mellanox adapters are capable of exposing in ConnectX 3 adapter cards 63 virtual instances called Virtual Functions VFs These virtual functions can then be provisioned separately Each VF can be seen as an addition device con nected to the Physical Function It shares the same resources with the Physical Function and its number of ports equals those of the Physical Function SR IOV is commonly used in conjunction with an SR IOV enabled hypervisor to provide virtual machines direct hardware access to network resources hence increasing its performance In this chapter we will demonstrate setup and configuration of SR IOV in a Red Hat Linux envi ronment using Mellanox ConnectX VPI adapter cards family 4 13 1 System Requirements To set up an SR IOV environment the following is required MLNX OFED Driver Aserver blade with an SR IOV capa
172. file is prof sel The supported values for profiles are 0 number of resources medium number of resources 2 large number of resources default Mellanox Technologies 225 Rev 2 0 3 0 0 Appendix E Lustre Compilation over MLNX_OFED gt To compile Lustre version 2 3 65 and higher configure with o2ib usr src ofa kernel default make rpms To compile older Lustre versions EXTRA LNET INCLUDE I usr src ofa_kernel default include include usr src ofa kernel default include linux compat 2 6 h configure with o2ib usr src ofa kernel default EXTRA LNET INCLUDE I usr src ofa kernel default include include usr src ofa kernel default include linux compat 2 6 h make rpms For Lustre 2 1 3 due to a duplicate definition of INVALID UID macro the following patch must be applied lustre 2 1 3 lustre include lustre cfg h 2012 09 17 14 26 46 000000000 0200 lustre 2 1 3 lustre include lustre cfg h new 2013 09 07 10 45 07 121772824 0200 288 7 288 9 include lustre lustre user h ifndef INVALID UID define INVALID UID 1 endif 226 Mellanox Technologies
173. fo 2 Create and burn the composite image Run flint dev lt mst device name gt brom lt expansion ROM image gt Example on Linux flint dev dev mst mt26428 pci cr0 brom ConnectX 26428 ROM X X XXX mrom Example on Windows flint dev mt26428 pci cr0 brom ConnectX 26428 ROM X X XXX mrom Removing the Expansion ROM Image Remove the expansion ROM image Run flint dev mst device name drom When removing the expansion ROM image you also remove Flexboot from the boot device list A 3 Preparing the DHCP Server in Linux Environment The DHCP server plays a major role in the boot process by assigning IP addresses for FlexBoot clients and instructing the clients where to boot from FlexBoot requires that the DHCP server run on a machine which supports IP over IB A 3 1 Installing the DHCP Server Install DHCP client server in embedded within the Linux Distribution 1 Depending on the OS the device name may be superceded with a prefix 206 Mellanox Technologies Rev 2 0 3 0 0 A 3 2 Configuring the DHCP Server A 3 2 1 For ConnectX Family Devices When a FlexBoot client boots it sends the DHCP server various information including its DHCP client identifier This identifier is used to distinguish between the various DHCP sessions The value of the client identifier is composed of a prefix ff 00 00 00 00 00 02 00 00 02 c9 00 and an 8 byte port GUID all separated by colons and represented in hexadecima
174. for SHMEM programs running over InfiniBand The latest ScalableSHMEM software can be downloaded from the Mellanox website 98 Mellanox Technologies Rev 2 0 3 0 0 5 1 2 Running SHMEM with FCA The Mellanox Fabric Collective Accelerator FCA is a unique solution for offloading collective operations from the Message Passing Interface MPI or ScalableSHMEM process onto Mella nox InfiniBand managed switch CPUs As a system wide solution FCA utilizes intelligence on Mellanox InfiniBand switches Unified Fabric Manager and MPI nodes without requiring addi tional hardware The FCA manager creates a topology based collective tree and orchestrates an efficient collective operation using the switch based CPUs on the MPI ScalableSHMEM nodes FCA accelerates MPI ScalableSHMEM collective operation performance by up to 100 times providing a reduction in the overall job runtime Implementation is simple and transparent during the job runtime FCA is disabled by default and must be configured prior to using it from the Scal ableSHMEM gt To enable FCA by default in the ScalableSHMEM 1 Edit the opt mellanox openshmem 2 2 etc openmpi mca params conf file 2 Set the 11 fca enable parameter to 1 Scoll fca enable 1 3 Setthe scoll fca np parameter to 0 Scoll fca np 0 gt To enable FCA in the shmemrun command line add the following mca scoll fca enable 1 mca scoll fca enable np 0 gt To disable FCA mca scoll fca e
175. for all users 2 The mpi selector command This command is a CLI equivalent of the mpi selector menu allowing for the same functionality as mpi selector menu but without the interactive menus and prompts It is suitable for scripting 102 Mellanox Technologies Rev 2 0 3 0 0 5 2 4 Compiling MPI Applications Compiling MVAPICH Applications Please refer to http mvapich cse ohio state edu support mvapich user guide html To review the default configuration of the installation check the default configuration file usr mpi lt compiler gt mvapich lt mvapich ver gt etc mvapich conf Compiling Open MPI Applications Please refer to http www open mpi org faq category mpi apps 5 3 MellanoX Messaging MellanoX Messaging MXM provides enhancements to parallel communication libraries by fully utilizing the underlying networking infrastructure provided by Mellanox HCA switch hard ware This includes a variety of enhancements that take advantage of Mellanox networking hard ware including Multiple transport support including RC and UD Proper management of HCA resources and memory structures Efficient memory registration One sided communication semantics Connection management Receive side tag matching Intra node shared memory communication These enhancements significantly increase the scalability and performance of message commu nications in the network alleviating bottlenecks within the parallel commun
176. from 11 4 3 175 to 11 4 3 176 The following example shows how to enter the ping command host1 ping c 5 11 4 3 176 PING 11 4 3 176 11 4 3 176 56 84 bytes of data 64 bytes from 11 4 3 176 icmp seq 0 ttl 64 time 0 079 ms 64 bytes from 11 4 3 176 icmp seg 1 ttl 64 time 0 044 ms 64 bytes from 11 4 3 176 icmp seq 2 ttl 64 time 0 055 ms 64 bytes from 11 4 3 176 icmp seq 3 ttl 64 time 0 049 ms 64 bytes from 11 4 3 176 icmp seg 4 ttl 64 time 0 065 ms 13 0 9 304 jus Stable 54 Mellanox Technologies Rev 2 0 3 0 0 5 packets transmitted 5 received 0 packet loss time 3999ms rtt min avg max mdev 0 044 0 058 0 079 0 014 ms pipe 2 4 3 6 Bonding IPoIB To create an interface configuration script for the ibX and bondX interfaces you should use the standard syntax depending on your OS Bonding of IPoIB interfaces is accomplished in the same manner as would bonding of Ethernet interfaces via the Linux Bonding Driver e Network Script files for IPoIB slaves are named after the IPoIB interfaces e g ifcfg 100 The only meaningful bonding policy in IPoIB is High Availability bonding mode num ber 1 or active backup Bonding parameter fail over mac is meaningless in IPoIB interfaces hence the only supported value is the default 0 or none in SLES11 For a persistent bonding IPoIB Network configuration use the same Linux Network Scripts semantics with the following exceptions additions Jn th
177. fy an individual configuration for each HCA This parameter should be specified as an options line in the file etc modprobe d mlx4 core conf For example to configure all HCAs to have Port as ETH and Port2 as IB insert the following line options mlx4 core port type array 1 2 To set HCAs individually you may use a string of Domain bus device function x y For example if you have a pair of HCAs whose PFs are 0000 04 00 0 and 0000 05 00 0 you may specify that the first will have both ports as IB and the second will have both ports as ETH as follows options mlx4 core port type array 0000 04 00 0 1 1 0000 05 00 0 2 2 Only the PFs are set via this mechanism The VFs inherit their port types from their asso ciated PF adi 4 13 7 2 Virtual Function InfiniBand Ports Each VF presents itself as an independent vHCA to the host while a single HCA 1s observable by the network which is unaware of the vHCAs No changes are required by the InfiniBand sub system ULPs and applications to support SR IOV and vHCAs are interoperable with any exist ing non virtualized IB deployments Sharing the same physical port s among multiple VHCAs is achieved as follows Each vHCA port presents its own virtual GID table The virtual GID table for the InfiniBand ports consists of a single entry at index 0 that maps to a unique index in the physical GID table The vHCA of the PF maps to physical GID index 0 To obtain GIDs for other vHCAs alias GU
178. g mft Preparing srptools Preparing rds tools Preparing rds devel Preparing ibutils2 Preparing ibutils Preparing cc mgr Preparing dump pr Preparing ar mgr Preparing ibdump Preparing infiniband diags Preparing infiniband diags compat Preparing qperf Preparing fca INFO updating IMPORTANT NOTE Mellanox Technologies Installation Rev 2 0 3 0 0 The FCA Manager and FCA MPI Runtime library are installed in opt mellanox fca directory The FCA Manager will not be started automatically To start FCA Manager now type etc init d fca managerd start There should be single process of FCA Manager running per fabric To start FCA Manager automatically after boot type etc init d fca managerd install service Check opt mellanox fca share doc fca README txt for quick
179. get timeouts on the AR related queries to these switches 162 Mellanox Technologies Rev 2 0 3 0 0 8 8 2 Installing the Adaptive Routing Adaptive Routing Manager is a Subnet Manager plug in i e it is a shared library libarmgr so that is dynamically loaded by the Subnet Manager Adaptive Routing Manager is installed as a part of Mellanox OFED installation 8 83 Running Subnet Manager with Adaptive Routing Manager Adaptive Routing AR Manager can be enabled disabled through SM options file 8 8 3 4 Enabling Adaptive Routing To enable Adaptive Routing perform the following 1 Create the Subnet Manager options file Run opensm c lt options file name gt 2 Add armgr to the event plugin name option in the file Event plugin name s event plugin name armgr 3 Run Subnet Manager with the new options file opensm F lt options file name gt Adaptive Routig Manager can read options file with various configuration parameters to fine tune AR mechanism and AR Manager behavior Default location of the AR Manager options file Is etc opensm ar mgr conf To provide an alternative location please perform the following 1 Add armgr conf file lt ar mgr options file name gt to the event plugin options option in the file options string that would be passed to the plugin s event plugin options armgr conf file lt ar mgr options file name gt 2 Run Subnet Manager with the new options file opens
180. gies The InfiniBand Architecture Specification that Part 3 Carrier Sense Multiple Access with Colli sion Detection CSMA CD Access Method and Parameters Physical Layers and Management Rev 2 0 3 0 0 Table 4 Reference Documents Document Name Description Firmware Release Notes for Mellanox See the Release Notes PDF file relevant to your adapter devices adapter device under docs folder of installed package MFT User s Manual Mellanox Firmware Tools User s Manual See under docs folder of installed package MFT Release Notes Release Notes for the Mellanox Firmware Tools See under docs folder of installed package Mellanox Technologies 15 J Rev 2 0 3 0 0 Support and Updates Webpage Please visit http www mellanox com gt Products gt InfiniBand VPI Drivers gt Linux SW Drivers for downloads FAQ troubleshooting future updates to this manual etc 16 Mellanox Technologies Rev 2 0 3 0 0 1 Mellanox OFED Overview 1 1 Introduction to Mellanox OFED Mellanox OFED is a single Virtual Protocol Internconnect VPI software stack which operates across all Mellanox network adapter solutions supporting 10 20 40 and 56 Gb s InfiniBand IB 10 40 and 56 Gb s Ethernet and 2 5 or 5 0 GT s PCI Express 2 0 and 8 GT s PCI Express 3 0 uplinks to servers All Mellanox network adapter cards are compatible with OpenFabrics based RDMA protocols and software and are support
181. gt vdisk vdisk echo add vdisk0 0 gt proc scsi_tgt groups Default devices echo add vdisk1 1 gt proc scsi_tgt groups Default devices echo add vdisk2 2 gt proc scsi_tgt groups Default devices Example 2 working with scst_vdisk FILEIO mode Using md0 device and file 10G file a modprobe scst b d e modprobe scst_vdisk echo open vdisk0 dev md0 gt proc scsi_tgt vdisk vdisk echo open vdisk1 10G file gt proc scsi_tgt vdisk vdisk echo add vdisk0 0 gt proc scsi_tgt groups Default devices echo add vdisk1 1 gt proc scsi_tgt groups Default devices 2 Run For all distributions except SLES 11 gt modprobe ib srpt For SLES 11 gt modprobe f ib srpt For SLES 11 please ignore the following error messages in var log messages when loading ib srpt to SLES 11 distribution s kernel ib srpt NS ib_srpt ib_srpt ib_srpt ib_srpt no symbol version for scst_unregister Unknown symbol scst_unregister no symbol version for scst_register Unknown symbol scst_register no symbol version for scst unregister target template Unknown symbol scst unregister target template B On Initiator Machines On Initiator machines manually perform the following steps 220 Mellanox Technologies Rev 2 0 3 0 0 1 Run modprobe ib srp 2 Run ibsrpdm d dev infiniband umadX to discover a new SRP target umad0 port 1 of the first HCA umadl port 2 of the first HCA umad2
182. guration for VMA use to be used with any installation parameter guest Install packages required by guest os hypervisor Install packages required by hypervisor os v vv vvv Set verbosity level umad dev rw Grant non root users read write permission for umad devices instead of default enable affinity Run mlnx_affinity script upon boot disable affinity Disable mlnx_affinity script Default enable sriov Burn SR IOV enabled firmware add kernel support Add kernel support Run mlnx add kernel support sh skip distro check Do not check MLNX OFED vs Distro matching total vfs 0 63 Maximum number of Virtual Functions in SR IOV mode Default 16 Implies enable sriov hugepages overcommit Setting 80 of MAX MEMORY as overcommit for huge page allocation Per priority bit mask uint Default 0 q Set quiet no messages will be printed 2 3 2 1 mlnxofedinstall Return Codes Table 2 lists the m1nxofedinsta11 script return codes and their meanings Table 2 mInxofedinstall Return Codes Return Code Meaning 0 The Installation ended successfully 1 The installation failed 2 No firmware was found for the adapter device 22 Invalid parameter 28 Not enough free space 171 Not applicable to this system configuration This can occur when the required hardware is not present on the system 172 Prerequisites are not met For example missing the required software installed or the hardwar
183. h device It includes query functions to the burnt firmware image and to the binary image file The tool accesses the EEPROM and or switch device via an I2C compatible interface or via vendor specific over the InfiniBand fabric In Band tool Debug utilities A set of debug utilities e g itrace mstdump isw and i2c For additional details please refer to the MFT User s Manual docs 1 4 Quality of Service Quality of Service QoS requirements stem from the realization of I O consolidation over an IB and Eth network As multiple applications and ULPs share the same fabric a means is needed to control their use of network resources QoS over Mellanox OFED for Linux is discussed in Chapter 8 OpenSM Subnet Manager 1 5 RDMA over Converged Ethernet RoCE RoCE allows InfiniBand IB transport applications to work over Ethernet network RoCE encapsulates the InfiniBand transport and the GRH headers in Ethernet packets bearing a dedi cated ether type 0x8195 Thus any VERB application that works in an InfiniBand fabric can work in an Ethernet fabric as well RoCE is enabled only for drivers that support VPI currently only mlx4 When working with RDMA applications over Ethernet link layer the following points should be noted The presence of a Subnet Manager SM is not required in the fabric Thus operations that require communication with the SM are managed in a different way in RoCE This does not affect the API
184. h_ipoib interface cat sys class net ethX eth vifs For example cat sys class net eth5 eth vifs SLAVE ib0 1 MAC 9a c2 1 d7 3b 63 VLAN N A SLAVE ib0 2 MAC 52 54 00 60 55 88 VLAN N A SLAVE ib0 3 MAC 52 54 00 60 55 89 VLAN N A Each ethX interface has at lease one ibX Y slave to serve the PIF itself In the VIFs list of ethX you will notice that ibX 1 is always created to serve applications running from the Hypervisor on top of the ethX interface directly For InfiniBand applications that require native IPoIB interfaces e g CMA the original IPoIB interfaces ibX can still be used For example CMA and ethX drivers can co exist and make use of IPoIB ports CMA can use ib0 while eth0 ipoib interface will use ibX Y interfaces gt To see the list of eIPoIB interfaces cat sys class net eth ipoib interfaces For example cat sys class net eth ipoib interfaces eth4 over IB port ib0 eth5 over IB port ibl The example above shows two elIPoIB interfaces where eth4 runs traffic over ib0 and eth5 runs traffic over ibl 74 Mellanox Technologies Rev 2 0 3 0 0 Figure 3 An Example of a Virtual Network Host ib0 2 ib0 3 bot KVM GUEST1 or one LAN t vito 2 via port 1 etho fe ip ge tapo 1 1 Ss KVM GUEST2 v a bro vifo 3 The example above shows a few IPoIB instances that server
185. here x is the root switch and each is a non root switch 4 I I I I I I 3 I I I I I I 2 I I I I I I 1 _ I I I I I I y 0 _ x 0 x 2 3 4 5 For multicast traffic routed from root to tip every turn in the above spanning tree is a legal DOR turn For traffic routed from tip to root and some traffic routed through the root turns are not legal DOR turns However to construct a credit loop the union of multicast routing on this span ning tree with DOR unicast routing can only provide 3 of the 4 turns needed for the loop In addi tion if none of the above spanning tree branches crosses a dateline used for unicast credit loop avoidance on a torus and if multicast traffic is confined to SL 0 or SL 8 recall that torus 2QoS uses SL bit 3 to differentiate QoS level then multicast traffic also cannot contribute to the ring credit loops that are otherwise possible in a torus Torus 2QoS uses these ideas to create a master spanning tree Every multicast group spanning tree will be constructed as a subset of the master tree with the same root as the master tree Such multicast group spanning trees will in general not be optimal for groups which are a subset of the full fabric However this compromise must be made to enable support for two QoS levels on a torus while preventing credit loops In the presence of link or switch failures that result in a fabric for which torus
186. hmalloc use hugepages 5 If using compound pages is not possible then the user will fall back to regular hugepages mechanism gt To force use of compound pages allocator Run the following command opt mellanox openshmem 2 1 bin shmemrun mca shmalloc use hugepages 5 x MR FORCE CONTIG PAGES 1 For further information on the Contiguous Pages please refer to Section 4 9 Contiguous Pages on page 74 5 1 5 Running ScalableSHMEM Application The ScalableSHMEM framework contains the shmemrun utility which launches the executable from a service node to compute nodes This utility accepts the same command line parameters as mpirun from the OpenMPI package For further information please refer to OpenMPI MCA parameters documentation at http www open mpi org faq category running Run shmemrun help to obtain ScalableSHMEM job launcher runtime parameters ScalableSHMEM contains support for environment module system http mod ules sf net The modules configuration file can be found at p opt mellanox openshmem 2 2 etc shmem modulefile 5 2 Message Passing Interface 5 2 1 Overview Mellanox OFED for Linux includes the following Message Passing Interface MPI implementa tions over InfiniBand Open MPI 1 4 6 amp 1 6 1 an open source MPI 2 implementation by the Open MPI Project e OSU MVAPICH2 1 7 an MPI 1 implementation by Ohio State University 100 Mellanox Technologies Rev 2 0 3 0 0 These MPI
187. i EE 181 9 7 byv devini xo m u eed ste tene f uo a 181 9 8 abdev2nieldeV su ER dE Ur 182 9 0 Abstatusz cc mo safe SR yau ws apaspa EU RTI E TUBES 183 9 10 abportstate COEPI SR US 185 tute al aie 188 DAZ SMP GUC sz ont pa pote NYA deis 191 9 T3 pertqueLy eee uyta uy wah tede Pape ng 194 9 14 ibcheckerns eee seu ua EXER ER NEN UR 197 uo iod dere ed exe tee ee re ARES eR 199 9 16 IBV casynewalche Qa CEA 202 9 T7 abdump ts y k tase vied rath ees ado NES es 203 Appendix A Mellanox FlexBoot 205 AGT OVGIVICW coe rime 205 A 2 Burning the Expansion ROM Image 205 Preparing the DHCP Server in Linux Environment 206 A4 Subnet Manager OpenSM 208 SEV ED nette eei ed eere kostet tre Deis Oed 208 A 6 BIOS Configuration 208 AJ Operation zuo u uui ewe RD eee eins cg pa ents SEEN S AV 209 A 8 Command Line Interface CLI 210 A9 Diskless Machines hn 212 AAO 3SCSEBOOL ht eet ere ttt Rhye aah tiet hd Sh hasa be 217 AT O WARE u oe E asa St 218 Appendix SRP Target Driver
188. ication libraries The latest MXM software can be downloaded from the Mellanox website 5 3 1 Compiling OpenMPI with MXM Step 1 Install MXM from RPM rpm ihv mxm x y z 1 x86 64 rpm MXM will be installed automatically in the opt mellanox mxm folder 2 Enter OpenMPI source directory and run 5 cd OMPI HOME configure with mxm opt mellanox mxm lt other configure parameters make all amp amp make install oo MLNX OFED v2 0 or later comes with a pre installed version of MXM v1 1 and OpenMPI compiled with MXM v1 1 Mellanox Technologies 103 Rev 2 0 3 0 0 HPC Features gt upgrade MLNX_OFED v2 0 or later with newer MXM Step 1 Remove MXM v1 1 rpm e mxm Step 2 Remove the pre compiled OpenMPI rpm e mlnx openmpi_gcc Step 3 Install the new MXM and compile the OpenMPI with it To run OpenMPI without MXM run mpirun mca mtl mxm lt gt When upgrading to MXM v1 5 OMPI compiled with MXM v1 1 should be recompiled with MXM v1 5 5 3 2 Enabling MXM OpenMPI MXM Rev 2 0 3 0 0 is automatically selected by OpenMPI when the Number of Processes NP is higher or equal to 128 To enable MXM for any NP use the following OpenMPI parameter mca mtl mxm np lt number gt To activate MXM for any NP run mpirun mca mtl mxm np 0 other mpirun parameters gt 5 3 3 Tuning MXM Settings The default MXM settings are already optimized To check the av
189. ids 0x0 0x8 of switch Lid 2 guid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies Lid Out Destination Port Info 0x0002 000 Switch portguid 0x0002c902fffff00a MT47396 Infi Technologies 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infi Technologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 swl 0x0007 021 Channel Adapter portguid 0x0002c9020025874a swl 0x0008 008 Channel Adapter portguid 0x0002c902002582cd swl 5 valid lids dumped 2 Dump all Lids with valid out ports of the switch with Lid 2 ibroute 2 niscale III Mellanox niscale III Mellanox 37 HCA 1 Si SHEAR 36 HCA 1 Unicast lids 0x0 0x8 of switch Lid 2 guid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies Lid Out Destination Port Info 0x0002 000 Switch portguid 0x0002c902fffff00a MT47396 Infi Technologies 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infi Technologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 swl 0x0007 021 Channel Adapter portguid 0x0002c9020025874a swl 0x0008 008 Channel Adapter portguid 0x0002c902002582cd swl 5 valid lids dumped niscale III Mellanox niscale III Mellanox Sr 57 36 HCA 1 3 Dump all Lids in the range 3 to 7 with valid out ports of the switch with Lid 2 gt Honours 2 9 7 190 Mellanox Technologies Rev
190. imilar to the advanced policy definition matching of PR MPR queries is done in order of appearance in the QoS policy file such as the first match takes precedence except for the default rule which is applied only if the query didn t match any other rule All other sections of the QoS policy file take precedence over the qos ulps section That is if a policy file has both qos match rules and qos ulps sections then any query is matched first against the rules in the qos match rules section and only if there was no match the query 1s matched against the rules in qos ulps section Note that some of these match rules may overlap so in order to use the simple QoS definition effectively it is important to understand how each of the ULPs is matched 8 6 6 1 IPoIB IPoIB query is matched by PKey or by destination GID in which case this is the GID of the mul ticast group that OpenSM creates for each IPoIB partition Default PKey for IPoIB partition is 0x7fff so the following three match rules are equivalent ipoib lt SL gt ipoib pkey Ox7fff SL any pkey Ox7fff SL 8 6 6 2 SDP SDP PR query is matched by Service ID The Service ID for SDP is 0x000000000001PPPP where PPPP are 4 hex digits holding the remote TCP IP Port Number to connect to The follow ing two match rules are equivalent sdp ESSI any service id 0x0000000000010000 0x000000000001ffff SL 8 6 6 3 RDS Similar to SDP RDS PR query is matched by Servic
191. implementations along with MPI benchmark tests such as OSU BW LAT Intel MPI Benchmark and Presta are installed on your machine as part of the Mellanox OFED for Linux installation Table 7 lists some useful MPI links Table 7 Useful MPI Links MPI Standard http www unix mcs anl gov mpi Open MPI http www open mpi org MVAPICH 2 MPI http mvapich cse ohio state edu MPI Forum http www mpi forum org This chapter includes the following sections Section 5 2 2 Prerequisites for Running MPI on page 99 Section 5 2 3 MPI Selector Which MPI Runs on page 100 Section 5 2 4 Compiling MPI Applications on page 100 5 2 2 Prerequisites for Running MPI For launching multiple MPI processes on multiple remote machines the MPI standard provides a launcher program that requires automatic login i e password less onto the remote machines SSH Secure Shell is both a computer program and a network protocol that can be used for log ging and running commands on remote computers and or servers 5 2 2 1 SSH Configuration The following steps describe how to configure password less access over SSH 1 Generate an ssh key on the initiator machine host1 host1 ssh keygen t rsa Generating public private rsa key pair Enter file in which to save the key home username ssh id rsa Enter passphrase empty for no passphrase Enter same passphrase again Your identification has be
192. in a dimension that is not the last dimension routed by DOR here the failed switches are O and T 5 p I I I I I I 4 I I I I I I 0 I I I I I I SS I I I I I I 1 m S n O T p I I I I I I y 0 S_ FM 9 I I I I I I x 0 s d 2 3 4 5 In a pristine fabric torus 2QoS would generate the path from S to D as S n O T r D With failed switches O and T torus 2QoS will generate the path S n I q r D with illegal turn at switch I and with hop I q using a VL with bit 1 set In contrast to the earlier examples the second hop after the illegal turn q r can be used to construct a credit loop encircling the failed switches 8 5 7 2 Multicast Routing Since torus 2QoS uses all four available SL bits and the three data VL bits that are typically available in current switches there is no way to use SL VL values to separate multicast traffic from unicast traffic Thus torus 2QoS must generate multicast routing such that credit loops can 144 Mellanox Technologies Rev 2 0 3 0 0 not arise from a combination of multicast and unicast path segments It turns out that it is possi ble to construct spanning trees for multicast routing that have that property For the 2D 6x5 torus Wee example above here is the full fabric spanning tree that torus 2QoS will construct w
193. interfaces You can create subinterfaces for a primary IPoIB interface to provide traffic isolation Each such subinterface also called a child interface has a different IP and network addresses from the pri mary parent interface The default Partition Key PKey ff ff applies to the primary parent interface This section describes how to Create a subinterface Section 4 3 4 1 Remove a subinterface Section 4 3 4 2 4 3 4 1 Creating a Subinterface In the following procedure ib0 is used as an example of IB subinterface To create a child interface subinterface follow this procedure Step 1 Decide on the PKey to be used in the subnet valid values can be 0 or any 16 bit unsigned value The actual PKey used is a 16 bit number with the most significant bit set For example a value of 1 will give a PKey with the value 0x8001 Step 2 Create a child interface by running host1 echo lt PKey gt gt sys class net lt IB subinterface gt create child Example hostl echo 1 gt sys class net ib0 create child This will create the interface ib0 8001 Mellanox Technologies 53 J Rev 2 0 3 0 0 Driver Features 4 3 4 2 4 3 5 Step 3 Verify the configuration of this interface by running host1 ifconfig subinterface subinterface gt Using the example of Step 2 host1 ifconfig ib0 8001 ib0 8001 Link encap UNSPEC HWaddr 80 00 00 4A FE 80 00 00 00 00 00 00 00 00 00 00 BROADCAST
194. ion By default the common switch in a torus seed is taken as the origin of the coordinate system used to describe switch location The position param 148 Mellanox Technologies Rev 2 0 3 0 0 eter for a dateline keyword moves the origin and hence the dateline the specified amount rela tive to the common switch in a torus seed next_seed If any of the switches used to specify a seed were to fail torus 2QoS would be unable to complete topology discovery successfully The next seed keyword specifies that the following link and dateline keywords apply to a new seed specification For maximum resiliency no seed specification should share a switch with any other seed specifi cation Multiple seed specifications should use dateline configuration to ensure that torus 2QoS can grant path SL values that are constant regardless of which seed was used to initiate topology discovery portgroup max ports max ports This keyword specifies the maximum number of parallel inter switch links and also the maximum number of host ports per switch that torus 2QoS can accommodate The default value is 16 Torus 2QoS will log an error message during topology discovery if this parameter needs to be increased If this keyword appears multiple times the last instance prevails port order pl p2 p3 This keyword specifies the order in which CA ports on a destination switch are visited when computing routes When the fabric contains switches connected
195. ion driver probe vf options mlx4 core num vfs 5 port type array 1 2 probe vf 1 Parameter Recommended Value num vfs Absent or zero The SRI OV mode is not enabled in the driver hence no VFs will be available Its value is a single number in the range of 0 63 The driver will enable the num v s VFs on the HCA and this will be applied to all ConnectX HCAs on the host ts format is a string which allows the user to specify the num vfs parameter separately per installed HCA e Its format is bb dd f v bb dd f v bb dd f bus device function of the PF of the HCA v number of VFs to enable for that HCA This parameter can be set in one of the following ways For example num vfs 5 The driver will enable 5 VFs on the HCA and this will be applied to all ConnectX HCAs on the host num vfs 00 04 0 5 00 07 0 8 The driver will enable 5 VFs on the HCA positioned in BDF 00 04 0 and 8 on the one in 00 07 0 Note PFs not included in the above list will not have SR IOV enabled 84 Mellanox Technologies Rev 2 0 3 0 0 Parameter Recommended Value port type array Specifies the protocol type of the ports It is either one array of 2 port types t1 t2 for all devices or list of BDF to port type array bb dd f t1 t2 string probe vf Absent or zero No VFs will be used by the PF driver Its value is a single number in the range of 0 63 Physical Func tion drive
196. is performed instead BO Steering When using SR IOV flow steering is enabled if there is adequate amount of space to store the flow steering table for the guest master gt To enable Flow Steering Step 1 Open the etc modprobe d mlnx conf file 2 Set the parameter 1og num mgm entry size to 1 by writing the option mlx4_core log num mgm entry size 1 Step3 Restart the driver To disable Flow Steering Step 1 Open the etc modprobe d mlnx conf file Step 2 Remove the options mlx4 core log num mgm entry size 1 Step3 Restart the driver 4 12 2 Flow Domains and Priorities Flow steering defines the concept of domain and priority Each domain represents a user agent that can attach a flow The domains are prioritized A higher priority domain will always super sede a lower priority domain when their flow specifications overlap Setting a lower priority value will result in higher priority In addition to the domain there is priority within each of the domains Each domain can have at most 2 12 priorities in accordance to its needs The following are the domains at a descending order of priority User Verbs allows a user application QP to be attached into a specified flow when using ibv create flowand ibv destroy flow verbs ibv create flow struct ibv flow ibv create flow struct ibv qp qp struct ibv flow attr flow Input parameters struct the attached QP Mellanox Technologies
197. ith torus 2QoS Since SL to VL map configuration must be under the complete control of torus 2QoS any con figuration via qos_sl2vl qos_swe_sl2vl etc must and will be ignored and a warning will be generated Torus 2QoS uses VL values 0 3 to implement one of its supported QoS levels and VL values 4 7 to implement the other Hard to diagnose application issues may arise if traffic is not delivered fairly across each of these two VL ranges Torus 2QoS will detect and warn 1f VL arbi tration is configured unfairly across VLs in the range 0 3 and also in the range 4 7 Note that the default OpenSM VL arbitration configuration does not meet this constraint so all torus 2QoS users should configure VL arbitration via qos vlarb high qos vlarb low etc 8 5 7 5 Operational Considerations Any routing algorithm for a torus IB fabric must employ path SL values to avoid credit loops As a result all applications run over such fabrics must perform a path record query to obtain the cor rect path SL for connection setup Applications that use rdma cm for connection setup will auto matically meet this requirement If a change in fabric topology causes changes in path SL values required to route without credit loops in general all applications would need to repath to avoid message deadlock Since torus 2QoS has the ability to reroute after a single switch failure without changing path SL values repathing by running applications is not required when the
198. ities is dis played This report includes 178 Mellanox Technologies SM report Number of nodes and systems Hop count information maximal hop count an example path and a hop count histo gram All CA to CA paths traced Credit loop report mgid mlid HCAs multicast group and report Partitions report IPoIB report In case the IB fabric includes only one CA then CA to CA paths are not reported Furthermore if a topology file is provided ibdiagnet uses the names defined in it for the output reports InfiniBand Fabric Diagnostic Utilities Rev 2 0 3 0 0 Error Codes Failed to fully discover the fabric Failed to parse command line options Failed to intract with IB fabric Failed to use local device or local port Failed to use Topology File Failed to load requierd Package Ov U1 H9 CO hO 1 9 5 ibdiagpath IB diagnostic path ibdiagpath traces a path between two end points and provides information regarding the nodes and ports traversed along the path It utilizes device specific health queries for the different devices along the path The way ibdiagpath operates depends on the addressing mode used on the command line If directed route addressing is used d flag the local node is the source node and the route to the destination port is known apriori On the other hand if LID route or by name addressing is employed then the source and destination ports of a route are specified
199. itrd ib hostl cd tmp initrd ib Step3 Normally the initrd image is zipped Extract it using the following command host1 gzip dc lt initrd image gt cpio id The initrd files should now be found under tmp initrd_ib Step 4 Create a directory for the InfiniBand modules and copy them host1 mkdir p tmp initrd ib lib modules ib host1 cd lib modules uname r updates kernel drivers hostl cp infiniband core ib addr ko tmp initrd ib lib modules ib host1 cp infiniband core ib core ko tmp initrd ib lib modules ib host1 cp infiniband core ib mad ko tmp initrd ib lib modules ib hostl cp infiniband core ib sa ko tmp initrd ib lib modules ib host1 cp infiniband core ib cm ko tmp initrd ib lib modules ib hostl cp infiniband core ib uverbs ko tmp initrd ib lib modules ib hostl cp infiniband core ib ucm ko tmp initrd ib lib modules ib host1 cp infiniband core ib umad ko tmp initrd ib lib modules ib hostl cp infiniband core iw cm ko tmp initrd ib lib modules ib host1 cp infiniband core rdma cm ko tmp initrd ib lib modules ib hostl cp infiniband core rdma ucm ko tmp initrd ib lib modules ib host1 cp net mlx4 mlx4 core ko tmp initrd ib lib modules ib host1 cp infiniband hw mlx4 mlx4 ib ko tmp initrd ib lib modules ib host1 cp infiniband hw mthca ib mthca ko tmp initrd ib lib modules ib hostl cp infiniband ulp ipoib ipoib helper ko tmp initrd ib lib modules ib host1 cp infiniband ulp ipoib ib ipoib ko tmp initrd ib lib modu
200. its will be used When omitted P Key will be autogenerated flag used to indicate IPOIB capability of this partition defmeuber full limited Specifies default membership for port guid list Default is limited Currently recognized flags are ipoib indicates that this partition may be used for IPoIB as result IPoIB capable MC group will be created rate lt val gt specifies rate for this IPoIB MC group default is 3 10GBps mtu lt val gt specifies MTU for this IPoIB MC group default is 4 2048 sl lt val gt specifies SL for this IPoIB MC group default is 0 scope lt val gt specifies scope for this IPoIB MC group default is 2 link local Note that values for rate mtu and scope should be specified as defined in the IBTA specification for example mtu 4 for 2048 PortGUIDs list PortGUID GUID of partition member EndPort Hexadecimal numbers should start from 0x decimal numbers are accepted too full or limited indicates full or limited membership for this port When omitted or unrecognized limited membership is assumed There are two useful keywords for PortGUID definition ALL means all end ports in this subnet SELF means subnet manager s port An empty list means that there are no ports in this partition Notes e White space is permitted between delimiters 5 The line can be wrapped after after a Partition Definition and between PartitionName does not need to
201. l it can also be explicitly referred by any match rule IV QoS Matching Rules denoted by qos match rules Each PathRecord MultiPathRecord query that OpenSM receives is matched against the set of matching rules Rules are scanned in order of appearance in the QoS policy file such as the first match takes precedence Each rule has a name of QoS level that will be applied to the matching query A default QoS level is applied to a query that did not match any rule Queries can be matched by Source port group whether a source port is a member of a specified group Destination port group same as above only for destination port PKey QoS class Service ID To match a certain matching rule PR MPR query has to match ALL the rule s criteria However not all the fields of the PR MPR query have to appear in the matching rule For instance if the rule has a single criterion Service ID it will match any query that has this Service ID disregarding rest of the query fields However if a certain query has only Service ID which means that this is the only bit in the PR MPR component mask that is on it will not match any rule that has other matching criteria besides Service ID 8 6 3 Simple QoS Policy Definition Simple QoS policy definition comprises of a single section denoted by qos ulps Similar to the advanced QoS policy it has a list of match rules and their QoS Level but in this case a match rule has only one crit
202. l digits Extracting the Port GUID Method To obtain the port GUID run the following commands The following commands assume that the Mellanox Firmware Tools package has been installed on the client machine host1 mst start host1 mst status The device name will be of the form dev mst mt dev id pci _cr0 conf0 Use this device name to obtain the Port GUID via the following query command flint d MST DEVICE NAME q Example with ConnectX 2 QDR MHJH29B XTR Dual 4X IB QDR Port PCIe Gen2 x8 Tall Bracket ROHS R6 HCA Card CX4 Connectors as the adapter device Image type ConnectX FW Version 2 9 1000 Rom Info type PXE version 3 3 400 devid 26428 proto VPI Device ID 26428 Description Node Porti Port2 Sys image GUIDs 0002c9030005cffa 0002c9030005cffb 0002c9030005cffc 0002c9030005cffd ACS 0002c905cffa 0002c905cffb Board ID MT_0DD0110009 VSD PSID MT 0DD0110009 Assuming that FlexBoot is connected via Port 1 then the Port GUID is 00 02 c9 03 00 05 cf fb Extracting the Port GUID Method Il An alternative method for obtaining the port GUID involves booting the client machine via Flex Boot This requires having a Subnet Manager running on one of the machines in the InfiniBand subnet The 8 bytes can be captured from the boot session as shown in the figure below Mellanox Technologies 207 Rev 2 0 3 0 0 Mellanox ConnectX FlexBoot v3 3 400 iPXE 1 0 0 Open So
203. l system Specifies the local device s port number used to connect to the IB fabric Specifies the local port GUID value of the port used to connect to the IB fabric If GUID given is 0 than ibdiagnet displays a list of possible port GUIDs and waits for user input Specifies opensm path records dump file path src dst to SL mapping generated by SM plugin ibdiagnet will use this mapping for MADs sending and credit loop check if r option selected Provides a report of the fabric qualities Indicates that UpDown credit loop checking should be done against automatically determined roots Specifies the directory where the output files will be placed default var tmp ibdiagnet2 Skip the executions of the given stage Applicable skip stages all dup guids dup node desc lids links sm pm nodes info speed width check pkey aguid Skip the load of the given library name Applicable skip plugins libibdiagnet_cable diag plugin libibdiagnet_cable diag plugin 2 1 1 Reset all the fabric PM counters If any of the provided PM is greater then its provided value than print it Specifies the seconds to wait between first counters sample and second counters sample If seconds given is 0 than no second counters sample will be done default 1 InfiniBand Fabric Diagnostic Utilities Rev 2 0 3 0 0 Provides BER test for each port Calculate
204. le FCA module under Scalable UPC export GASNET FCA ENABLE CMD LINE 1 gt To set FCA verbose level export GASNET FCA VERBOSE CMD LINE 10 gt To set the minimal number of processes threshold to activate FCA export GASNET FCA NP CMD LINE 1 Mellanox Technologies 107 Rev 2 0 3 0 0 HPC Features ScalableUPC contains modules configuration file http modules sf net which can be found at opt mellanox bupc 2 2 etc bupc modulefile Fr 5 5 3 Various Executable Examples The following are various executable examples gt To run a ScalableUPC application without FCA support 9 upcrun np 128 fca enable 0 executable filename Torun UPC applications with FCA enabled for any number of processes export GASNET FCA ENABLE CMD LINE 1 GASNET FCA NP CMD LINE 0 upcrun np 64 executable filename Torun UPC application on 128 processes verbose mode 9 upcrun np 128 fca enable 1 fca np 10 fca verbose 5 executable filename gt To run UPC application offload to FCA Barrier and Broadcast only 9 upcrun np 128 fca ops executable filename 108 Mellanox Technologies Rev 2 0 3 0 0 Mellanox Technologies 109 Rev 2 0 3 0 0 Working With VPI 6 Working With VPI VPI allows ConnectX ports to be independently configured as either IB or Eth 6 1 Port Type Management ConnectX ports can be individually configured to work as InfiniBand or Ethe
205. les ib Mellanox Technologies 213 Rev 2 0 3 0 0 5 IB requires loading IPv6 module If you do not have it in your initrd please add it using the following command host1 cp lib modules uname r kernel net ipv6 ipv6 ko tmp initrd ib lib modules Step6 load the modules you need the insmod executable If you do not have it in your initrd please add it using the following command hostl cp sbin insmod tmp initrd ib sbin Step 7 If you plan to give your IB device a static IP address then ifconfig Otherwise skip this step hostl cp sbin ifconfig tmp initrd ib sbin 8 Ifyou plan to obtain an IP address for the IB device through DHCP then you need to copy the DHCP client which was compiled specifically to support IB Otherwise skip this step To continue with this step DHCP client v3 1 3 needs to be already installed on the machine you are working with Copy the DHCP client v3 1 3 file and all the relevant files as described below host1 cp path to DHCP client v3 1 3 dhclient tmp initrd ib sbin host1 cp path to DHCP client v3 1 3 dhclient script tmp initrd ib sbin host1 mkdir p tmp initrd ib var state dhcp host1 touch tmp initrd ib var state dhcp dhclient leases host1 cp bin uname tmp initrd ib bin host1 cp usr bin expr tmp initrd ib bin host1 cp sbin ifconfig tmp initrd ib bin host1 cp bin hostname tmp initrd ib bin Step 9 Create a configuratio
206. ling SR IOV Driver 86 4 13 6 Burning Firmware with SR IOV 86 4 13 7 Configuring Pkeys and GUIDs under SR IOV 87 got us freed HL pe bogota p 93 4 14 1 CORE Drrect Overview uu cererii dn eda Hea EEA MERI esa 93 AAD Bthtool 2 voie nto e eda n 94 4 16 Dynamically Connected Transport 95 Chapter 5 HPC Features 96 5 1 Shared Memory Access 96 5 1 1 Mellanox ScalableSHMEM eens 96 5 1 2 Running SHMEM with FCA 97 5 1 3 Running ScalableSHMEM with 97 5 1 4 Running SHMEM with Contiguous 98 5 1 5 Running ScalableSHMEM Application 98 5 2 Message Passing Interface 98 98 5 2 2 Prerequisites for Running MPI teens 99 5 2 3 MPI Selector Which MPI Runs 100 5 2 4 Compiling MPI Applications 100 5 3 MellanoX Messaging 101 5 3 1 Compiling OpenMPI with MXM 101 5 3 2 Enabling MXM in OpenMPI
207. ll effect SM authentication Note that OpenSM version 3 2 1 and below used the default value 1 in a host byte order it is fixed now but you may need this option to interoperate with old OpenSM running on a little endian machine reassign lids r This option causes OpenSM to reassign LIDs to all end nodes Specifying r on a running subnet may disrupt subnet traffic Without r OpenSM attempts to preserve existing LID assignments resolving multiple use of same LID routing engine R engine name This option chooses routing engine s to use instead of default Min Hop algorithm Multiple routing engines can be specified Separated by commas so that specific ordering of routing algorithms will be tried if earlier routing engines fail If all configured routing engines fail OpenSM will always attempt to route with Min Hop unless no fallback is included in the list of routing engines Supported engines updn file ftree lash dor torus 2QoS 122 Mellanox Technologies Rev 2 0 3 0 0 do mesh analysis This option enables additional analysis for the lash routing engine to precondition switch port assignments in regular cartesian meshes which may reduce the number of SLs required to give a deadlock free routing lash start vl lt vl number gt Sets the starting VL to use for the lash routing algorithm Defaults to 0 sm_sl sl number Sets the SL to use to communicate with the SM SA Defaults to 0 con
208. llowing software components Mellanox Host Channel Adapter Drivers mlx5 mlx4 VPI which is split into multiple modules mlx4 core low level helper mlx4 ib IB mlx ib mlx 5 core Mellanox Technologies 17 J Rev 2 0 3 0 0 Mellanox OFED Overview mlx4 en Ethernet Mid layer core Verbs MADs SA CMA uVerbs uMADs Upper Layer Protocols ULPs IPoIB RDS SRP Initiator SRP NOTE RDS was not tested by Mellanox Technologies MPI Open MPI stack supporting the InfiniBand RoCE and Ethernet interfaces OSU MVAPICH stack supporting the InfiniBand and RoCE interfaces MPI benchmark tests OSU BW LAT Intel MPI Benchmark Presta OpenSM InfiniBand Subnet Manager Utilities Diagnostic tools Performance tests Firmware tools MFT Source code for all the OFED software modules for use under the conditions men tioned in the modules LICENSE files Documentation 12 3 Firmware The ISO image includes the following firmware items Firmware images mlx format for ConnectX 3 ConnectX 3 Pro Connect IB net work adapters Firmware configuration IND files for Mellanox standard network adapter cards and custom cards FlexBoot for ConnectX 3 HCA devices 1 2 4 Directory Structure The ISO image of MLNX_OFED_ LINUX contains the following files and directories minxofedinstall This is the MLNX OFED LINUX installation script ofed uninstall sh This is MLNX OFED L
209. lnx RPM Preparing ak HH H HH HH H HH H HH HH H Het HH H HHH HHH H HH HH knem mlnx H HH HH H H HH HHH HHH HHH HH HHH HHH HH HH Installing kmod knem mlnx RPM Prepari ng 7 HH H HH HHH HHH HH HH H H H HHH HHH HH HH kmod knem mlnx HH H HHH HH HH H HHH HH H HH HHHH HH H HH H HH HH Installing mpi selector RPM Preparing HH H HH HH H HHH HH H Het HH HH H HH H HH HH mpi sel ector Het HHH HH HH H HHH HH HHH Het H H HH H HHH HH HH 28 Mellanox Technologies Installing user level RPMs Preparing ofed scripts Preparing libibverbs Preparing libibverbs Preparing libibverbs devel Preparing libibverbs devel Preparing libibverbs devel static Preparing libibverbs devel static Preparing libibverbs utils Preparing libmlx4 Preparing libmlx4 Preparing libmlx4 devel Preparing libmlx4 devel Preparing libmlx5 Preparing libmlx5 Preparing libmlx5 devel Preparing libmlx5 devel Preparing libexgb3 Preparing libexgb3 Preparing libexgb3 devel Preparing libexgb3 devel Preparing libcxgb4
210. log LOG SIZE size in MB gt This option defines maximal AR Manager log file size in MB The logfile will be truncated and restarted upon reaching this limit This option cannot be changed on the fly 0 unlimited log file size Default 5 8 8 5 1 1 Per switch AR Options A user can provide per switch configuration options with the following syntax Mellanox Technologies 165 Rev 2 0 3 0 0 OpenSM Subnet Manager SWITCH lt GUID gt lt switch option 1 gt lt switch option 2 gt The following are the per switch options Table 14 Adaptive Routing Manager Pre Switch Options File Option File Description Values ENABLE Allows you to enable disable the AR on this Default true lt true false gt switch If the general ENABLE option value is set to false then this per switch option is ignored This option can be changed on the fly AGEING_TIME Applicable to bounded AR mode only Specifies Default 30 lt usec gt how much time there should be no traffic in order for the switch to declare a transmission burst as finished and allow changing the output port for the next transmission burst 32 bit value In the pre switch options file this option refers to the particular switch only This option can be changed on the fly 8 8 5 1 2 Example of Adaptive Routing Manager Options File ENABLE true LOG FILE tmp ar_mgr log LOG SIZE 100 MAX ERRORS 10
211. loops use ibdmchk to analyze data collected via ibdiagnet vlr Mellanox Technologies 147 Rev 2 0 3 0 0 OpenSM Subnet Manager 8 5 7 6 2005 Configuration File Syntax file torus 2QoS conf contains configuration information that is specific to the OpenSM rout ing engine torus 2QoS Blank lines and lines where the first non whitespace character is are ignored A token is any contiguous group of non whitespace characters Any tokens on a line fol lowing the recognized configuration tokens described below are ignored torus mesh x radix m M t T radix m M t T z _radix m M t T Either torus or mesh must be the first keyword in the configuration and sets the topology that torus 2QoS will try to construct A 2D topology can be configured by specifying one of x_radix y_radix or z_radix as 1 An individual dimension can be configured as mesh open or torus looped by suffixing its radix specification with one of m M t or T Thus mesh 3T 4 5 and torus 3 4M 5M both specify the same topology Note that although torus 2QoS can route mesh fabrics its ability to route around failed compo nents is severely compromised on such fabrics A failed fabric componentis very likely to cause a disjoint ring see UNICAST ROUTING in torus 2QoS 8 xp link sw0 GUID swl GUID yp link sw0 GUID swl GUID zp link sw0 GUID swl GUID xm link sw0 GUID swl GUID ym link sw0 GUID swl GUID zm link sw0 GUID swl GUID These keyw
212. m F lt options file name gt See an example of AR Manager options file with all the default values in Example of Adaptive Routing Manager Options File on page 166 8 8 3 2 Disabling Adaptive Routing There are two ways to disable Adaptive Routing Manager 1 By disabling it explicitly in the Adaptive Routing configuration file 2 removing the armgr option from the Subnet Manager options file Mellanox Technologies 163 Rev 2 0 3 0 0 OpenSM Subnet Manager Adaptive Routing mechanism is automatically disabled once the switch receives setting of the usual linear routing table LFT e Therefore no action is required to clear Adaptive Routing configuration on the switches if you do not wish to use Adaptive Routing 8 8 4 Querying Adaptive Routing Tables When Adaptive Routing is active the content of the usual Linear Forwarding Routing Table on the switch is invalid thus the standard tools that query LFT e g smpquery dump Ifts sh and others cannot be used To query the switch for the content of its Adaptive Routing table use the smparquery tool that is installed as a part of the Adaptive Routing Manager package To see its usage details run smparquery h 8 8 5 Adaptive Routing Manager Options File The default location of the AR Manager options file is etc opensm ar mgr conf To set an alter native location please perform the following 1 Add armgr conf file lt ar mgr options file name gt
213. min guids sysfs interface Mellanox Technologies 91 J Rev 2 0 3 0 0 Driver Features To configure the GUID at index lt gt on port lt port_num gt cd sys class infiniband mlx4 0 iov ports port num admin guids echo your desired guid n Example cd sys class infiniband mlx4 0 iov ports 1 admin guids echo 0x002f ffff8118 gt 3 1 echo 0x0 means let the SM assign a value to that GUID echo Oxffffffffffffffff means delete that GUID echo any other value means request the SM to assign this GUID to this index Step3 Read the administrative status of the GUID index To read the administrative status of GUID index m on port n cat sys class infiniband mlx4 0 iov ports n admin guids m Step 4 Check the operational state of a GUID sys class infiniband mlx4 0 iov ports n gids where n 1 or 2 The values indicate what gids are actually configured on the firmware hardware and all the entries are R O Step 5 Compare the value you read under the admin_guids directory at that index with the value under the gids directory to verify the change requested in Step 3 has been accepted by the SM and programmed into the hardware port GID table If the value under admin guids lt m gt is different that the value under gids lt m gt the request is still in progress 4 13 7 2 3Partitioning IPoIB Communication using PKeys PKeys are used to partition IPoIB communication between the Virtual Machines a
214. n 30 BW Traffic class IPoIB Service Level 3 Policy min 1 k IT D LA App A Server Virtual Server App B Server 8 7 QoS Configuration Examples The following are examples of QoS configuration for different cluster deployments Each exam ple provides the QoS level assignment and their administration via OpenSM configuration files 8 7 1 Typical HPC Example MPI and Lustre Assignment of QoS Levels MPI Separate from I O load Min BW of 70 Storage Control Lustre MDS Low latency Storage Data Lustre OST Min BW 30 Administration MPI 15 assigned an SL via the command line host1 mpirun s1 0 OpenSM QoS policy file In the following policy file example replace and with the real port GUIDs Mellanox Technologies 159 Rev 2 0 3 0 0 OpenSM Subnet Manager 8 7 2 qos ulps default 0 default SL for MPT any target port guid OST1 0ST2 0ST3 0ST4 1 SL for Lustre OST any target port guid MDS1 MDS2 2 SL for Lustre MDS end qos ulps OpenSM options file qos max vls 8 qos high limit 0 qos vlarb high 2 1 qos vlarb low 0 96 1 224 Cos SAW 01 2210 9 0907 109 ls 309 10 IS EDC SOA 2 tier IPoIB and SRP The following is an example of QoS configuration for a typical enterprise data center EDC with service oriented architecture SOA with IPoIB carrying all application traffic and SRP used for storage
215. n file for the DHCP client as described in Section 4 3 3 1 and place it under tmp initrd ib sbin The following is an example of such a file called dclient conf dhclient conf The value indicates a hexadecimal number For a ConnectX device interface ib0 send dhcp client identifier 00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39 Step 10 Now you can add the commands for loading the copied modules into the file init Edit the file tmp initrd_ib init and add the following lines at the point you wish the IB driver to be loaded Hm The order of the following commands for loading modules is critical echo loading ipv6 sbin insmod lib modules ipv6 ko echo loading IB driver sbin insmod lib modules ib ib addr ko 214 Mellanox Technologies Rev 2 0 3 0 0 sbin insmod lib modules ib ib core ko sbin insmod lib modules ib ib mad ko sbin insmod lib modules ib ib sa ko sbin insmod lib modules ib ib cm ko sbin insmod lib modules ib ib uverbs ko sbin insmod lib modules ib ib ucm ko sbin insmod lib modules ib ib umad ko sbin insmod lib modules ib iw cm ko sbin insmod lib modules ib rdma cm ko sbin insmod lib modules ib rdma ucm ko sbin insmod lib modules ib mlx4 core ko sbin insmod lib modules ib mlx4 ib ko sbin insmod lib modules ib ib mthca ko The following command loading ipoib_helper ko is not required for all OS kernels Please check the release note
216. n routing does not allow LID routing communication between switches that located inside spine switch systems The reason is that there is way to allow LID route between them that does not break the Up Down rule One ramification of this is that you cannot SM on switches other the leaf switches of the fabric 8 5 3 1 UPDN Algorithm Usage Activation through OpenSM e Use R updn option instead of old u to activate the UPDN algorithm Use a root guid file gt for adding an guid file that contains the root nodes for ranking If the a option is not used OpenSM uses its auto detect root nodes algo rithm Notes on the guid list file 138 Mellanox Technologies Rev 2 0 3 0 0 1 A valid guid file specifies one guid in each line Lines with an invalid format will be dis carded 2 The user should specify the root switch guids However it is also possible to specify CA guids OpenSM will use the guid of the switch if it exists that connects the CA to the subnet as a root node 8 5 4 Fat tree Routing Algorithm The fat tree algorithm optimizes routing for shift communication pattern It should be chosen if a subnet is a symmetrical or almost symmetrical fat tree of various types It supports not just K ary N Trees by handling for non constant K cases where not all leafs CAs are present any Constant Bisectional Ratio CBB ratio As in UPDN fat tree also preven
217. nable 0 mca coll fca enable 0 For more details on FCA installation and configuration please refer to the FCA User Manual found in the Mellanox website 5 1 3 Running ScalableSHMEM with MXM MellanoX Messaging MXM library provides enhancements to parallel communication libraries by fully utilizing the underlying networking infrastructure provided by Mellanox HCA switch hardware This includes a variety of enhancements that take advantage of Mellanox networking hardware including Multiple transport support including RC XRC and UD Proper management of HCA resources and memory structures Efficient memory registration One sided communication semantics Connection management Receive side tag matching Intra node shared memory communication Mellanox Technologies 99 J Rev 2 0 3 0 0 HPC Features These enhancements significantly increase the scalability and performance of message com muni cations in the network alleviating bottlenecks within the parallel communication libraries 5 1 4 Running SHMEM with Contiguous Pages Contiguous Pages improves performance by allocating user memory regions over contiguous pages It enables a user application to ask low level drivers to allocate contiguous memory for it as part of ibv_reg_mr gt To activate MLNX_OFED 2 0 and the contiguous pages allocator with SHMEM Run the following argument to enable compound pages with SHMEM opt mellanox openshmem 2 1 bin shmemrun mca s
218. nd Also in case of two physical connections i e network paths from a single initiator IB port to two different IB ports on the same Target HCA there is need for a different initiator_ext value on each path The conventions is to use the Target port GUID as the initiator_ext value for the rele vant path Mellanox Technologies 45 J Rev 2 0 3 0 0 Driver Features If you use srp_daemon with n flag it automatically assigns initiator_ext values according to this convention For example id ext 200500A0B81146A1 ioc 9011 0002 90200402 dgid fe800000000000000002c90200402bed service id 200500a0b81146a1 initiator ext ed2b400002c90200 Notes 1 It is recommended to use the n flag for all stp daemon invocations 2 ibsrpdm does not have a corresponding option 3 srp daemon sh always uses the n option whether invoked manually by the user or automat ically at startup by setting SRPHA ENABLE to yes 4 1 2 6 High Availability HA Overview High Availability works using the Device Mapper DM multipath and the SRP daemon Each initiator is connected to the same target from several ports HCAs The DM multipath is responsi ble for joining together different paths to the same target and for fail over between paths when one of them goes offline Multipath will be executed on newly joined SCSI devices Each initiator should execute several instances of the SRP daemon one for each port At startup each S
219. nd the Dom0 by mapping a non default full membership PKey to virtual index 0 and mapping the default PKey to a virtual pkey index other than zero The below describes how to set up two hosts each with 2 Virtual Machines Host 1 vm 1 will be able to communicate via IPoIB only with Host2 vm1 and Host1 vm2 only with Host2 vm2 In addition Host1 Dom0 will be able to communicate only with Host2 Dom0 over 160 vm1 and vm2 will not be able to communicate with each other nor with Dom0 This is done by configuring the virtual to physical PKey mappings for all the VMs such that at virtual PKey index 0 both vm 1s will have the same pkey and both vm 2s will have the same PKey different from the vm 1 s and the Dom0 s will have the default pkey different from the vm s pkeys at index 0 OpenSM must be used to configure the physical Pkey tables on both hosts The physical Pkey table on both hosts Dom0 will be configured by OpenSM to be index 0 Oxffff index 1 0xb000 index 2 0xb030 The vml s virt to physical PKey mapping will be pkey idx 0 1 pkey idx 1 0 92 Mellanox Technologies Rev 2 0 3 0 0 The vm2 s virt to phys pkey mapping will be pkey_ idx 0 2 pkey 1 0 so that the default pkey will reside on the vms at index 1 instead of at index 0 The IPoIB QPs are created to use the PKey at index 0 As a result the Dom0 vm1 and vm2 IPoIB QPs will all use different PKeys gt To partition IPoIB commu
220. nect roots z This option enforces routing engines up down and fat tree to make connectivity between root switches and in this way be IBA compliant In many cases this can violate pure deadlock free algorithm so use it carefully This option enables unicast routing cache to prevent routing recalculation which is a heavy task in a large cluster when there was no topology change detected during the heavy sweep or when the topology change does not require new routing calculation e g in case of host reboot This option becomes very handy when the cluster size is thousands of nodes lid matrix file M file name gt This option specifies the name of the lid matrix dump file from where switch lid matrices min hops tables will be loaded eed iile ule mee This option specifies the name of the LFTs file from where switch forwarding tables will be loaded sadb file S file name gt This option specifies the name of the SA DB dump file from where SA database will be loaded root guid file a path to file Set the root nodes for the Up Down or Fat Tree routing algorithm to the guids provided in the given file one Mellanox Technologies 123 Rev 2 0 3 0 0 OpenSM Subnet Manager to line cn guid file u path to file gt Set the compute nodes for the Fat Tree routing algorithm to the guids provided in the given file one to a line io guid file G path t
221. need defining them And since this policy file doesn t have any matching rules PR MPR query will not match any rule and OpenSM will enforce default QoS level Essentially the above example 1s equivalent to not having a QoS policy file at all The following example shows all the possible options and keywords in the policy file and their syntax See the comments in the following example They explain different keywords and their meaning port groups port group using port GUIDs name Storage use is just a description that is used for logging Other than that it is just a comment use SRP Targets port guid 0x10000000000001 0x10000000000005 0x1000000000FFFA port guid 0x1000000000FFFF end port group 152 Mellanox Technologies Rev 2 0 3 0 0 port group name Virtual Servers The syntax of the port name is as follows node description Pnum node description is compared to the NodeDescription of the node and Pnum is a port number on that node port name vsl HCA 1 P1 vs2 HCA 1 P1 end port group using partitions defined in the partition policy port group name Partitions partition Partl pkey 0x1234 end port group using node types CA ROUTER SWITCH SELF for node that runs SM or ALL for all the nodes in the subnet port group name CAs and SM node type CA SELF end port group end port groups qos setup This section of the policy file describes how to set up SL2VL and VL
222. nes the order on which the ports would be chosen for routing pong order 1 iil WA 25 29 25 2D 2 SQ Mellanox Technologies 149 N H N Rev 2 0 3 0 0 OpenSM Subnet Manager 8 6 Quality of Service Management in OpenSM 8 6 1 Overview When Quality of Service QoS in OpenSM is enabled using the Q or qos flags OpenSM looks for a QoS Policy file During fabric initialization and at every heavy sweep OpenSM parses the QoS policy file applies its settings to the discovered fabric elements and enforces the provided policy on client requests The overall flow for such requests is as follows The request is matched against the defined matching rules such that the QoS Level def inition is found Given the QoS Level a path s search is performed with the given restrictions imposed by that level Figure 4 QoS Manager M Administrator QoS Policy Config File InfiniBand subnet with QoS OFED 1 3 Manager based nodes OSM Z There are two ways to define QoS policy Advanced the advanced policy file syntax provides the administrator various ways to match a PathRecord MultiPathRecord PR MPR request and to enforce various QoS constraints on the requested PR MPR Simple the simple policy file syntax enables the administrator to match PR MPR requests by various ULPs and applications running on top of these ULPs 8 6 2 Advanced QoS Policy File The
223. ng d2 Force log flushing after each log message d3 Disable multicast support 910 Put OpenSM in testability mode Without d no debug options are enabled I em Display this usage info then exit 8 2 2 Environment Variables The following environment variables control opensm behavior OSM TMP DIR Controls the directory in which the temporary files generated by opensm are created These files are opensm subnet lst opensm fdbs and opensm mcfdbs By default this directory is var log OSM CACHE DIR Mellanox Technologies 129 Rev 2 0 3 0 0 OpenSM Subnet Manager 8 2 3 8 2 4 8 2 4 1 8 3 opensm stores certain data to the disk such that subsequent runs are consistent The default directory used is var cache opensm The following file is included in it guid21id stores the LID range assigned to each GUID Signaling When OpenSM receives a HUP signal it starts a new heavy sweep as if a trap has been received or a topology change has been found Also SIGUSR1 can be used to trigger a reopen of var log opensm 1log for logrotate pur poses Running opensm The defaults of opensm were designed to meet the common case usage on clusters with up to a few hundred nodes Thus in this default mode opensm will scan the IB fabric initialize it and sweep occasionally for changes To run opensm in the default mode simply enter host1 opensm Note that opensm needs to be run on at least one m
224. nge 0 to 15 in their header SL field Each switch can map the incoming packet by its SL to a particular output VL based on a programmable table VL SL to VL MAP in port out port SL The Subnet Administrator controls the parameters of each communication flow by pro viding them as a response to Path Record PR or MultiPathRecord MPR queries DiffServ architecture IETF RFC 2474 amp 2475 is widely used in highly dynamic fabrics The following subsections provide the functional definition of the various software elements that enable a DiffServ like architecture over the Mellanox OFED software stack 56 Mellanox Technologies Rev 2 0 3 0 0 44 2 QoS Architecture QoS functionality is split between the SM SA CMA and the various ULPs We take the chro nology approach to describe how the overall system works 1 The network manager human provides a set of rules policy that define how the network is being configured and how its resources are split to different QoS Levels The policy also define how to decide which QoS Level each application or ULP or service use 2 The SM analyzes the provided policy to see if it is realizable and performs the necessary fab ric setup Part of this policy defines the default QoS Level of each partition The SA is enhanced to match the requested Source Destination QoS Class Service ID PKey against the policy so clients ULPs programs can obtain a policy enforced QoS The SM may also
225. nication using PKeys Step 1 Create a file etc opensm partitions conf on the host on which OpenSM runs contain ing lines Default 0x7fff ipoib ALL full Pkey1 0x3000 ipoib ALL full Pkey3 0x3030 ipoib ALL full This will cause OpenSM to configure the physical Port Pkey tables on all physical ports on the network as follows pkey idx pkey value 0 OxFFFF 1 0xB000 2 0xB030 the most significant bit indicates if a PKey is a full PKey The ipoib causes OpenSM to pre create IPoIB the broadcast group for the indicated PKeys Step 2 Configure Dom0 the virtual to physical mappings for the VMs Step Check the PCI ID for the Physical Function and the Virtual Functions lspci grep Mel Stepb Assuming that on Hostl the physical function displayed by Ispci 15 0000 02 00 0 and that on Host2 it is 0000 03 00 0 On Hostl do the following cd sys class infiniband mlx4 0 iov 0000 02 00 0 0000 02 00 1 0000 02 00 2 1 0000 02 00 0 contains the virtual to physical mapping tables for the physical func tion 0000 02 00 X contain the virt to phys mapping tables for the virtual functions Do not touch the Dom0 mapping table under lt nnnn gt lt nn gt 00 0 Modify only tables under 0000 02 00 1 and or 0000 02 00 2 We assume that vml uses VF 0000 02 00 1 and vm2 uses VF 0000 02 00 2 Configure the virtual to physical PKey mapping for the VMs echo 0 gt 0000 02
226. numbers Example 0 0 0 0 1 1 1 1 maps UPs 0 3 to TCO and UPs 1000 TCI s LIST tsa LIST Transmission algorithm for each TC LIST is comma seperated algorithm names for each TC Possible algorithms strict etc Example ets strict ets sets TCO TC2 to ETS and 1 to strict The rest are unchanged t LIST tcbw LIST Set minimal guaranteed BW for ETS TCs LIST is comma Seperated percents for each TC Values set to TCs that are not configured to ETS algorithm are ignored but must be present Example if TCO TC2 are set to ETS then 10 0 90 will set TCO to 10 and TC2 to 90 Percents must sum to 100 r LIST ratelimit LIST Rate limit for TCs in Gbps LIST is a comma Seperated Gbps limit for each TC Example 1 8 8 will limit to 1Gbps and TC1 TC2 to 8 Gbps each 1 INTF interface INTF Interface name a Show all interface s TCs Mellanox Technologies 63 J Rev 2 0 3 0 0 Driver Features Get Current Configuration 64 Mellanox Technologies Rev 2 0 3 0 0 Set ratelimit 3Gbps for tc0 4Gbps for tc1 and 2Gbps for tc2 Configure QoS map UP 0 7 to tc0 1 2 3 to tc1 and 4 5 6 to tc 2 set tc0 tc1 as ets and tc2 Mellanox Technologies 65 Rev 2 0 3 0 0 Driver Features as strict divide ets 30 for tc0 and 70 for cos al ets Era DEO IPIE T ME RES QUITO tc 0 ratelimit 3 Gbps tsa ets bw 30 up 0 Skprio 0 Skprio 1 Skprio 2 tos 8 Skprio 3
227. o file Set the I O nodes for the Fat Tree routing algorithm to the guids provided in the given file one to a line port shifting Attempt to shift port routes around to remove alignment problems in routing tables Scatter ports random seed Randomize best port chosen for a route max reverse hops H hop count Set the max number of hops the wrong way around an I O node is allowed to do connectivity for I O nodes on top swithces ids guid file m path to file Name of the map file with set of the IDs which will be used by Up Down routing algorithm instead of node GUIDs format guid id per line guid routing order file X path to file Set the order port guids will be routed for the MinHop and Up Down routing algorithms to the guids provided in the given file one to a line torus config path to file This option defines the file name for the extra configuration info needed for the torus 2QoS routing engine The default name is etc opensm torus 20QoS conf once o This option causes OpenSM to configure the subnet once then exit Ports remain in the ACTIVE state Sweep s interval This option specifies the number of seconds between subnet sweeps Specifying s 0 disables sweeping Without s OpenSM defaults to a sweep interval of 10 seconds 124 Mellanox Technologies Rev 2 0 3 0 0 timeout t lt milliseconds gt This option specifies the time in milliseconds used
228. o this although the algo rithm allows leaf switches to have any number of CAs the closer the tree is to be fully popu lated the more effective the shift communication pattern will be In general even if the root list is provided the closer the topology to a pure and symmetrical fat tree the more optimal the routing will be The algorithm also dumps compute node ordering file opensm ftree ca order dump in the same directory where the OpenSM log resides This ordering file provides the CN order that may be used to create efficient communication pattern that will match the routing tables 1 Ports that are connected to the same remote switch are referenced as port group 2 List of compute nodes CNs can be specified by u or cn_guid_file OpenSM options Mellanox Technologies 139 Rev 2 0 3 0 0 OpenSM Subnet Manager 8 5 4 1 Routing between non CN Nodes The use of the cn_guid file option allows non CN nodes to be located on different levels in the fat tree In such case it is not guaranteed that the Fat Tree algorithm will route between two non CN nodes In the scheme below N1 N2 and N3 are non CN nodes Although all the CN have routes to and from them there will not necessarily be a route between N1 N2 and N3 Such routes would require to use at least one of the switches the wrong way around Spinel Spine2 Spine 3 N ei N f N X y N Switch N2 Switch N3 ZIN aa Going do
229. ock So in the example above with failed switch T the location of the illegal turn at I in the path from S to D requires that any credit loop caused by that turn must encircle the failed switch at T Thus the second and later hops after the illegal turn at I 1 hop r D cannot contribute to a credit loop Mellanox Technologies 143 Rev 2 0 3 0 0 OpenSM Subnet Manager because they cannot be used to construct a loop encircling T The hop uses a separate VL so it cannot contribute to a credit loop encircling T Extending this argument shows that in addition to being capable of routing around a single switch failure without introducing deadlock torus 2QoS can also route around multiple failed switches on the condition they are adjacent in the last dimension routed by DOR For example consider the following case on a 6x6 2D torus 5 E 4 I I I I I 4 lt F 1 I I I I I 3 t I I I I I I a R 1 I I I I I 3 m S n T 0 p 1 I I I I I I I I I I I x 0 x 2 3 4 5 Suppose switches T and R have failed and consider the path from S to D Torus 2QoS will gen erate the path S n q I u D with an illegal turn at switch I and with hop I u using a VL with bit 1 set As a further example consider a case that torus 2QoS cannot route without deadlock two failed switches adjacent
230. ode Assume the answer is yes to all ques tions no All Non interactive mode Assume the answer is no to all ques tions Mellanox Technologies 201 Rev 2 0 3 0 0 Table 29 mstflint Switches Sheet 3 of 3 Affected Switch Relevant Description Commands vsd burn Write this string of up to 208 characters to VSD upon a burn lt string gt command burn Burn vsd as it appears in the given image do not keep existing use image p VSD on Flash 5 dual image burn Make the burn process burn two images on Flash The current default failsafe burn process burns a single image in alternat ing locations V Print version info Table 30 mstflint Commands Command Description b urn Burn Flash q uery Query miscellaneous Flash firmware characteristics v erify Verify the entire Flash bb Burn Block Burn the given image as is without running any checks sg Set GUIDs ri lt out file gt Read the firmware image on the Flash into the specified file dc lt out file gt Dump Configuration Print a firmware configuration file for the given image to the specified output file e rase lt addr gt Erase sector rw lt addr gt Read one DWORD from Flash ww lt addr gt lt data gt Write one DWORD to Flash wwne lt addr gt Write one DWORD to Flash without sector erase wbne lt addr gt lt size gt lt data gt
231. oduces the following files in the output directory which is defined by the o option described below Synopsis i device lt dev name gt p port lt port num gt g guid GUID in hex vlr lt file gt r routing u fat tree o output path lt directory gt skip lt stage gt skip plugin library name gt pe P counter lt lt PM gt lt value gt gt pm pause time seconds ber test ber use data ber thresh lt value gt extended speeds dev type pm per lane 1s lt 2 5 5 10 14 25 FDR10 lw lt 1x 4x 8x 12x gt w write topo file file name gt t topo file lt file gt out ibnl dir lt directory gt screen num errs num smp window num gmp window lt num gt max hops lt max hops gt V version h help H deep help 172 Mellanox Technologies Rev 2 0 3 0 0 Options Mellanox Technologies 173 Rev 2 0 3 0 0 i device lt dev name gt p port lt port num gt g guid lt GUID in hex gt vlr lt file gt r routing u fat tree o output path lt directory gt skip lt stage gt skip_plugin lt library name gt pc P counter lt lt PM gt lt value gt gt pm pause time seconds 174 Mellanox Technologies Specifies the name of the device of the port used to connect to the IB fabric in case of multiple devices on he loca
232. omotes data center application data messaging performance scalability and reliability over RDMA interconnects InfiniBand and RoCE The uDAPL interface is defined by the DAT collaborative This release of the uDAPL reference implementation package for both DAT 1 2 and 2 0 specifi cation is timed to coincide with OFED release of the Open Fabrics www openfabrics org soft ware stack For more information about the DAT collaborative go to the following site http www datcollaborative org Mellanox Technologies 21 J Rev 2 0 3 0 0 Mellanox OFED Overview 1 3 5 Message Passing Interface MPI is a library specification that enables the development of paral lel software libraries to utilize parallel computers clusters and heterogeneous networks Mella nox OFED includes the following MPI implementations over InfiniBand Open MPI an open source MPI 2 implementation by the Open MPI Project e OSU MVAPICH an MPI 1 implementation by Ohio State University Mellanox OFED also includes MPI benchmark tests such as OSU BW LAT Intel MPI Bench mark and Presta 1 3 6 InfiniBand Subnet Manager All InfiniBand compliant ULPs require a proper operation of a Subnet Manager SM running on the InfiniBand fabric at all times An SM can run on any node or on an IB switch OpenSM is an InfiniBand compliant Subnet Manager and it is installed as part of Mellanox OFED See Chap ter 8 OpenSM Subnet Manager 1 3 7 Diagno
233. options mlx4 en enable sys tune 1 7 2 4 3 OS Controlled Power Management Some operating systems can override BIOS power management configuration and enable c states by default which results in a higher latency To resolve the high latency issue please follow the instructions below 1 Edit the boot grub grub conf file or any other bootloader configuration file 2 Add the following kernel parameters to the bootloader command intel idle max cstate 0 processor max cstate 1 3 Reboot the system Example title RH6 2x64 root hd0 0 kernel wmlinuz RH6 2x64 2 6 32 220 e16 x86 64 root UUID 817c207b c0e8 4ed9 9c33 c589c0bb566f console tty0 console ttyS0 115200n8 rhgb intel idle max cstate 0 processor max cstate l Mellanox Technologies 115 Rev 2 0 3 0 0 Performance 7 2 5 7 2 6 7 2 6 1 Interrupt Moderation Interrupt moderation is used to decrease the frequency of network adapter interrupts to the CPU Mellanox network adapters use an adaptive interrupt moderation algorithm by default The algo rithm checks the transmission Tx and receive Rx packet rates and modifies the Rx interrupt moderation settings accordingly To manually set Tx and or Rx interrupt moderation use the ethtool utility For example the fol lowing commands first show the current default setting of interrupt moderation on the interface ethl then turns off Rx interrupt moderation and last shows the new setting gt ethtool c ethl Coale
234. ords are used to seed the torus mesh topology For example xp link 0x2000 0x2001 specifies that a link from the switch with node GUID 0x2000 to the switch with node GUID 0x2001 would point in the positive x direction while xm link 0x2000 0x2001 specifies that a link from the switch with node GUID 0x2000 to the switch with node GUID 0x2001 would point in the negative x direction AII the link keywords for a given seed must specify the same from switch In general it is not necessary to configure both the positive and negative directions for a given coordinate either is sufficient However the algorithm used for topology discovery needs extra information for torus dimensions of radix four see TOPOLOGY DISCOVERY in torus 2005 8 For such cases both the positive and negative coordinate directions must be specified Based on the topology specified via the torus mesh keyword torus 2QoS will detect and log when it has insufficient seed configuration X dateline position y dateline position z dateline position In order for torus 2QoS to provide the guarantee that path SL values do not change under any conditions for which it can still route the fabric its idea of dateline position must not change rel ative to physical switch locations The dateline keywords provide the means to configure such behavior The dateline for a torus dimension is always between the switch with coordinate 0 and the switch with coordinate radix 1 for that dimens
235. ou choose to continue loading the OS after boot through the HCA device driver please ver ify that the initrd image includes the HCA driver as described in Section A 8 Mellanox Technologies 217 Rev 2 0 3 0 0 A 10 1 Configuring an iSCSI Target Linux Environment Prerequisites Step 1 Make sure that an iSCSI Target is installed on your server side You can download and install an iSCSI Target from the following location http sourceforge net projects iscsitarget files iscsitarget Step 2 Dedicate a partition on your iSCSI Target on which you will later install the operating system Step 3 Configure your iSCSI Target to work with the partition you dedicated If for example you choose partition dev sda5 then edit the iSCSI Target configuration file etc ietd conf to include the following line under the iSCSI Target iqn line Lun 0 Path dev sda5 Type fileio Example of an iSCSI Target iqn line Target iqn 2007 08 7 3 4 10 iscsiboot Step 4 Start your iSCSI Target Example host1 etc init d iscsitarget start Configuring the DHCP Server to Boot From an iSCSI Target Configure DHCP as described in Section 4 3 3 1 IPoIB Configuration Based on DHCP Edit your DHCP configuration file etc dhcpd conf and add the following lines for the machine s you wish to boot from the iSCSI target Filename option root path iscsi iscsi target ip iscsi target ign The following is an example for configuring an IB ETH d
236. owing Discovers the currently installed kernel Uninstalls any software stacks that are part of the standard operating system distribution or another vendor s commercial stack Installs MLNX OFED LINUX binary RPMs if they are available for the current kernel Identifies the currently installed InfiniBand and Ethernet network adapters and automat ically upgrades the firmware 2 3 1 Pre installation Notes The installation script removes all previously installed Mellanox OFED packages and re installs from scratch You will be prompted to acknowledge the deletion of the old packages Pre existing configuration files will be saved with extension conf rpmsave If you need to install Mellanox OFED on an entire homogeneous cluster common strategy is to mount the ISO image on one of the cluster nodes and then copy it to a shared file system such as NFS To install on all the cluster nodes use cluster aware tools such as pdsh If your kernel version does not match with any of the offered pre built RPMs you can add your kernel version by using the m1nx add kernel support sh script located under the docs directory On Redhat and SLES distributions with errata kernel installed there is no need to use the mlnx add kernel support sh script The regular installation can be performed and weak updates mechanism will create symbolic links to the MLNX OFED kernel modules Usage
237. pendix A Mellanox FlexBoot A 1 Overview Mellanox FlexBoot is a multiprotocol remote boot technology FlexBoot supports remote Boot over InfiniBand BoIB and over Ethernet Using Mellanox Virtual Protocol Interconnect VPI technologies available in ConnectX adapt ers FlexBoot gives IT Managers the choice to boot from a remote storage target iSCSI target or a LAN target Ethernet Remote Boot Server using a single ROM image on Mellanox Con nectX products FlexBoot is based on the open source project iPXE available at http www ipxe org FlexBoot first initializes the adapter device senses the port protocol Ethernet or InfiniBand and brings up the port Then it connects to a DHCP server to obtain its assigned IP address and network parameters and also to obtain the source location of the kernel OS to boot from The DHCP server instructs FlexBoot to access the kernel OS through a TFTP server an iSCSI target or some other service For an InfiniBand port Mellanox FlexBoot implements a network driver with IP over IB acting as the transport layer IP over IB is part of the Mellanox OFED for Linux software package see www mellanox com gt Products gt Software gt InfiniBand VPI Drivers The binary code is exported by the device as an expansion ROM image A 1 1 Tested Platforms See the Mellanox FlexBoot Release Notes FlexBoot release notes txt A 1 2 FlexBoot in Mellanox OFED The FlexBoot package is provided as a ta
238. port 1 of the second HCA 3 echo new target info gt sys class infinband srp srp mthca0 l add target 4 fdisk 1 will show the newly discovered scsi disks Example Assume that you use port 1 of first HCA in the system i e mthca0 root lab104 ibsrpdm c d dev infiniband umad0 id ext 0002c90200226cf4 ioc guid 0002c90200226cf4 dgid fe800000000000000002c90200226cf 5 pkey ffff service id 0002c90200226cf4 root lab104 echo id ext 0002c90200226cf4 ioc guid 0002c90200226cf4 dgid fe800000000000000002c90200226cf 5 pkey ffff service id 0002c90200226cf4 gt sys class infiniband srp srp mthca0 1 add target OR You can edit etc infiniband openib conf to load the SRP driver and SRP High Avail ability HA daemon automatically that is set SRP LOAD yes and SRPHA ENABLE yes To set up and use the HA feature you need the dm multipath driver and multipath tool Please refer to OFED 1 x SRP s user manual for more detailed instructions on how to enable use the HA feature The following is an example of an SRP Target setup file kkkkkkkkkkkkkkkkkkkkkkk srpt sh kkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk bin sh modprobe scst scst_threads 1 modprobe scst_vdisk scst_vdisk_ID 100 echo open vdisk0 dev cciss c1d0 BLOCKIO gt proc scsi_tgt vdisk vdisk echo open vdiskl dev sdb BLOCKIO gt proc scsi_tgt vdisk vdisk echo open vdisk2 dev sdc BLOCKIO gt proc scsi tgt vdisk vdisk echo open vdisk3 dev sdd BLOCKIO gt p
239. ported Run lspci grep Mellanox 03 00 0 InfiniBand Mellanox Technologies MT26428 ConnectX VPI PCIe 2 0 5GT s IB QDR 10GigE rev b0 03 00 1 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev 10 03 00 2 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev 10 03 00 3 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev 10 03 00 4 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev 10 03 00 5 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 Where 03 00 represents the Physical Function 03 00 X represents the Virtual Function connected to the Physical Function 4 13 3 Enabling SR IOV and Para Virtualization on the Same Setup gt To enable SR IOV and Para Virtualization on the same setup 1 Create a bridge vim etc sysconfig network scripts ifcfg bridge0 DEVICE bridge0 TYPE Bridge PADDR Are NETMASK 255 255 0 0 BOOTPROTO static ONBOOT yes NM CONTROLLED no DELAY 0 Step 2 Change the related interface in the example below bridge0 is created over eth5 DEVICE eth5 BOOTPROTO none STARTMODE on HWADDR 00 02 c9 2e 66 52 TYPE Ethernet NM_CONTROLLED no ONBOOT yes BRIDGE bridge0 Step3 Restart the service network 86 Mellanox Technologies Rev 2 0 3 0
240. ppears see figure Press Ctrl B for the iPXE command line Alternatively you may skip invoking CLI right after POST and invoke it instead right after FlexBoot starts booting Once the CLI is invoked you will see the following prompt iPXE gt Operation The CLI resembles a Linux shell where the user can run commands to configure and manage one or more PXE port network interfaces Each port is assigned a network interface called neti where i is 0 1 2 lt of interface gt Some commands are general and are applied to all network interfaces Other commands are port specific therefore the relevant network interface is speci fied in the command Command Reference 210 Mellanox Technologies Rev 2 0 3 0 0 A 8 3 1 ifstat Displays the available network interfaces in a similar manner to Linux s ifconfig iPXE gt ifstat neto 00 0Z c9 03 00 0c 78 11 on PCIOZ 00 0 1 gt CLink down TX 8 TXE 2 RX 11 RXE 111 Link status The socket is not connected 1 2 x No such file or directory RXE x The socket is not connected CRXE 8 x Operation canceled neti 00 02 c9 O0c 78 12 on PCIOZ 00 0 Copen CLink up TX 12 TXE O HRxX O HRXE 0O1 iPXE gt 8 3 2 ifopen Opens the network interface net lt x gt The list of network interfaces is available via the ifstat com mand Example iPXE gt ifopen netl A 8 3 3 ifclose Closes the network interface net lt x gt
241. r will use probe v VFs and this will be applied to all ConnectX amp HCAs on the host Its format is a string which allows the user to specify the probe vf parameter separately per installed HCA Its format is bb dd f v bb dd f v bb dd f bus device function of the PF of the HCA of VFs to use in the PF driver for that HCA This parameter can be set in one of the following ways For example probe vfs 5 The PF driver will probe 5 VFs on the HCA and this will be applied to all ConnectX amp HCAs on the host probe vfs 00 04 0 5 00 07 0 8 The PF driver will probe 5 VFs on the HCA positioned in BDF 00 04 0 and 8 for the one in 00 07 0 Note PFs not included in the above list will not use any of their VFs in the PF driver The example above loads the driver with 5 VFs num vfs The standard use of a VF is a single VF per a single VM However the number of VFs varies upon the working mode requirements The protocol types are Port I IB Port 2 Ethernet port type array 2 2 Ethernet Ethernet port type 1 1 IB IB port type array 1 2 VPI IB Ethernet NO port type array module parameter ports are IB 9 Reboot the server If the SR IOV is not supported by the server the machine might not come out of boot load Mellanox Technologies 85 J Rev 2 0 3 0 0 Driver Features Step 10 Load the driver verify SR IOV is sup
242. r with the larger send_queue_size recv_queue_size values set the follow ing ib_ipoib module parameters send_queue_size 1024 recv_queue_size 1024 Use Jumbo Frames JF up to 64K domu domu In UD mode the maximum MTU value is 4092 Bytes In CM mode the maximum MTU value is 65520 Bytes Make sure that all interfaces including the guest interface and its virtual bridge have the same MTU value For further information of MTU and JF settings please refer to the Hypervisor User Manual Tune the TCP IP stack using sysctl dom0 domu sbin sysctl perf tuning Enable irqbalancer dom0 domu etc init d irqbalance start Other performance tuning for KVM environment such as vCPU pinning and NUMA tuning may apply For further information please refer to the Hypervisor User Manual Contiguous Pages Contiguous Pages improves performance by allocating user memory regions over physical con tiguous pages It enables a user application to ask low level drivers to allocate contiguous mem ory for itas part of ibv reg mx Additional performance improvements can be reached by allocating Queue Pair QP and Com pletion Queue CQJ buffers to the Contiguous Pages To activate set the below environment variables with values of PREFER CONTIG or CONTIG For QP ALLOC TYPE ForCQ MLX CQ ALLOC TYPE The following are all the possible values that can be allocated to the buffer Table 3 Buffer Values
243. raffic run set irq affinity bynode sh numa node interface For optimizing dual port traffic run set irq affinity bynode sh numa node interfacel interface2 To show the current affinity settings run show irq affinity sh interface 7 2 7 2 Auto Tuning Utility MLNX OFED 2 0 x introduces a new affinity tool called mlnx affinity This tool can automati cally adjust your affinity settings for each network interface according to the system architecture Usage Start mlnx affinity start 118 Mellanox Technologies Rev 2 0 3 0 0 Stop mlnx_affinity stop Restart mlnx_affinity restart mlnx_affinity can also be started by driver load unload gt To enable mlnx_affinity by default Add the line below to the etc infiniband openib conf file RUN AFFINITY TUNER yes 7 2 7 3 Tuning for Multiple Adapters When optimizing the system performance for using more than one adapter It is recommended to separate the adapter s core utilization so there will be no interleaving between interfaces The following script can be used to separate each adapter s IRQs to different set of cores set irq affinity cpulist sh cpu list interface cpu list can be either a comma separated list of single core numbers 0 1 2 3 or core groups 0 3 Example Ifthe system has 2 adapters on the same NUMA node 0 7 each with 2 interfaces run the follow ing etc init d irgbalancer stop s
244. rball tgz extension containing the files specified in Appendix A 1 1 Tested Platforms page 205 1 A PXE ROM image file for each of the supported Mellanox network adapter devices Specif ically the following images are included ConnectX ConnectX 2 ConnectX 3 images e ConnectX FlexBoot PCI Device ID gt _ROM lt version gt mrom where the number after the ConnectX_FlexBoot prefix indicates the corresponding PCI Device ID of the ConnectX ConnectX 2 ConnectX 3 device 2 Additional documents under docs dhcp A 2 Burning the Expansion ROM Image A 2 1 Burning the Image on ConnectX ConnectX 2 ConnectX 3 This section 15 valid for ConnectX 2 devices with firmware versions 2 8 0600 or later and ConnectX 3 firmware de Mellanox Technologies 205 Rev 2 0 3 0 0 Prerequisites 1 Expansion ROM Image The expansion ROM images are provided as part of the Mellanox FlexBoot package and are listed in the release notes file FlexBoot_release_notes txt 2 Firmware Burning Tools You need to install the Mellanox Firmware Tools MFT package version 2 7 0 or later in order to burn the PXE ROM image To download MFT see Firmware Tools under www mellanox com gt Downloads Image Burning Procedure To burn the composite image perform the following steps 1 Obtain the MST device name Run mst start mst status The device name will be of the form mt lt dev_id gt pci cro con
245. re carried in the Extended Transport Header Atomic response genera tion and packet format for MskCmpSwap is as for standard IB Atomic operations 4 7 1 2 Masked Fetch and Add MFetchAdd The MFetchAdd Atomic operation extends the functionality of the standard IB FetchAdd by allowing the user to split the target into multiple fields of selectable length The atomic add is done independently on each one of this fields A bit set in the field boundary parameter specifies the field boundaries The pseudocode below describes the operation bit adder ci bl b2 co value ci bl b2 co value amp 2 Mellanox Technologies 71 J Rev 2 0 3 0 0 Driver Features 4 8 return value amp 1 define MASK IS SET mask attr 1 mask amp attr bit position 1 carry 0 atomic response 0 Mo 1 63 de 1150 bit position bit position lt lt 1 bit add res bit adder carry MASK IS SET va bit position MASK IS SET compare add bit position amp new carry if bit add res atomic response bit position carry new carry amp amp MASK IS SET compare add mask bit position return atomic response Ethernet Tunneling Over IPoIB Driver elPolB The eth ipoib driver provides a standard Ethernet interface to be used as a Physical Interface PIF into the Hypervisor virtual network and serves one or more Virtual Interfaces VIF This driver supports L2 Switching
246. re esis tau SS eee eee tobe es drop d des 18 1 2 4 Directory Structure tuqyasqa tenet teen eens 18 1 35 JAP CHILE CHUTE 19 1 311 mIXq VP Diver ns ot ese baa Sa hie aa pais 19 1 3 2 MXS Driver isd uu as aaa eee heels ae RR xU CURE RR 20 133 Midslayer Core is hohe tes SL bho tee beady oats hase eels poss 21 13 4 v UBbPs e Ladies Ph e t 21 3 5 cea epe e 22 1 3 6 InfiniBand Subnet Manager 22 1 3 7 Diagnostic Utilities see eee hee e b HI RR Rn 22 1 3 8 Mellanox Firmware Tools isinan e 22 1 4 Quality of Service 23 1 5 over Converged Ethernet RoCE 23 Chapter 2 Installation 24 2 1 Hardware and Software Requirements 24 2 2 Downloading Mellanox OFED 24 2 3 Installing Mellanox OFED 25 2 3 1 Pre installation Notes aa sce sete ee icen eee Sea i eee apa 25 23 2 Installation Script Moe ee gts od Sapien dg HOR Seed uqha 26 2 3 3 Installation Procedure 0 2 se disse se bj sse ccc tne ence nee 28 2 3 4 Installation Results vis ua sne awk appe peat ee ae 3
247. re shared by the new MR Once the MR is shared it can be used even if the original MR was destroyed The request to share the MR can be repeated numerous times and arbitrary number of Memory Regions can potentially share the same physical memory locations Usage Uses the handle field that was returned from the ibv reg mr as the mr handle Supplies the desired access mode for that MR Supplies the address field which can be either NULL or any hint as the required output The address and its length are returned as part of the ibv_mr struct To achieve high performance it is highly recommended to supply an address that is aligned as the origi nal memory region address Generally it may be an alignment to 4M address For further information on how to use the ibv reg shared mr verb please refer to the ibv_reg_shared_mr man page and or to the ibv_shared_mr sample program which demonstrates a basic usage of this verb Further information on the ibv shared mr sample program can be found in the ibv_shared_mr man page 4 11 XRC eXtended Reliable Connected Transport Service for InfiniBand XRC allows significant savings in the number of QPs and its associated memory resources required to establish all to all process connectivity in large clusters It significantly improves the scalability of the solution for large clusters of multicore end nodes by reducing the required resources For further details please refer to the Annex A1
248. recommended to run the fol lowing command in order to generate the inventory file hostl osmtest f c Immediately afterwards run the following command to test opensm host1 osmtest f a Finally it is recommended to occasionally run osmtest v with verbosity to verify that noth ing in the fabric has changed 8 4 Partitions OpenSM enables the configuration of partitions PKeys in an InfiniBand fabric By default OpenSM searches for the partitions configuration file under the name usr etc opensm par titions conf To change this filename you can use opensm with the Pconfig or P flags The default partition 1s created by OpenSM unconditionally even when a partition configuration file does not exist or cannot be accessed The default partition has a P Key value of Ox7fff The port out of which runs OpenSM is assigned full membership in the default partition All other end ports are assigned partial mem bership 8 4 4 File Format Notes Line content followed after character is comment and ignored by parser General File Format Partition Definition gt lt PortGUIDs list Partition Definition PartitionName PKey flag value defmember full limited Mellanox Technologies 133 Rev 2 0 3 0 0 OpenSM Subnet Manager where PartitionName string will be used with logging When omitted an empty string will be used PKey P Key value for this partition Only low 15 b
249. rget IB port GUID in the PR MPR query Since any section of the policy file is optional as long as basic rules of the file are kept such as no referring to nonexisting port group having default QoS Level etc the simple policy section qos ulps can serve as a complete QoS policy file The shortest policy file in this case would be as follows qos ulps default 0 default SL end qos ulps It is equivalent to the previous example of the shortest policy file and it is also equivalent to not having policy file at all Below is an example of simple QoS policy with all the possible key words qos ulps default 0 default SL Sdp port num 30000 0 SL for application running on top of SDP when a destination TCP IPport is 30000 Sdp port num 10000 20000 8 0 sdp 1 4 default SL for any other application running on top of SDP rds 32 Sh ORI ipoib pkey 0x0001 0 SL for IPoIB on partition with pkey 0x0001 ipoib 4 default IPoIB partition pkey 0x7FFF any service id 0x6234 6 match any PR MPR query with a specific Service ID Mellanox Technologies 155 Rev 2 0 3 0 0 OpenSM Subnet Manager pkey 0x0ABC 6 match any PR MPR query with a specific PKey srp target port guid 0x1234 5 SRP when SRP Target is located on a specified IB port GUID any target port guid 0x0ABC 0xFFFFF 6 match any PR MPR query with a specific target port GUID end qos ulps S
250. ring to map device function numbers to their probe vf values e g 0000 04 00 0 3 002b 1c 0b a 13 Hexadecimal digits for the device function e g 002b 1c 0b a and decimal for probe vf value e g 13 string Mellanox Technologies 223 Rev 2 0 3 0 0 log num mgm entry size high rate steer fast drop enable 64b log num mac log num vlan log mtts per seg port type array log num gp log num srq log rdmarc per qp log num cq log num mcg log num mpt log num mtt enable qos internal err reset 4 en Parameters inline thold udp rs pf ctx pfcrx log mgm size that defines the num of qp per mcg for example 10 gives 248 range 7 log num mgm entry size 12 To activate device managed flow steering when available set to T Enable steering mode for higher packet rate default off int Enable fast packet drop when no recieve WQEs are posted int Enable 64 byte CQEs EQEs when the FW supports this if non zero default 1 int Log2 max number of MACs per ETH port 1 7 int Obsolete Log2 max number of VLANs per ETH port 0 7 int Log2 number of MTT entries per segment 0 7 default 0 int Either pair of values e g 1 2 to define uniform portl port2 types configuration for all devices functions or a string to map device function numbers to their pair of port types values e g 0000 04 00 0 1 2 002b 1c 0b a 1 1
251. rmally osmtest expects to find an inventory file which osmtest uses to validate real time information Mellanox Technologies 131 Rev 2 0 3 0 0 OpenSM Subnet Manager received from the SA during testing If 1 is not specified osmtest defaults to the file osmtest dat See c option for related information 8 stress This option runs the specified stress test instead of the normal test suite Stress test options are as follows OPT Description 81 Single MAD response SA queries 82 Multi MAD RMPP response SA queries 83 Multi MAD RMPP Path Record SA queries Without s stress testing is not performed M Multicast ModeThis option specify length of Multicast test OPT Description M1 Short Multicast Flow default single mode M2 Short Multicast Flow multiple mode M3 Long Multicast Flow single mode M4 Long Multicast Flow multiple mode Single mode Osmtest is tested alone with no other apps that interact with OpenSM MC ultiple mode Could be run with other apps using MC with OpenSM Without M default flow testing is per formed t timeout This option specifies the time in milliseconds used for transaction timeouts Specifying t 0 disables timeouts Without t OpenSM defaults to a timeout value of 200 milliseconds ol falle This option defines the log to be the given file By default the log goes to var log osm log For the log to go to standard output use f stdout v verbose This option in
252. rnet ports By default both ConnectX ports are initialized as InfiniBand ports If you wish to change the port type use the connectx_port_config script after the driver is loaded Running sbin connectx port config s will show current port configuration for all ConnectX devices Port configuration is saved in the file etc infiniband connectx conf This saved con figuration is restored at driver restart only if restarting via etc init d openibd restart Possible port types are eth Ethernet ib Infiniband auto Link sensing mode Detect port type based on the attached network type If no link is detected the driver retries link sensing every few seconds The port link type can be configured for each device in the system at run time using the sbin connectx port config script This utility will prompt for the PCI device to be modified 1f there is only one it will be selected automatically In the next stage the user will be prompted for the desired mode for each port The desired port configuration will then be set for the selected device This utility also has a non interactive mode sbin connectx port config d device PCI device ID gt c conf portl port2 108 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 2 0 3 0 0 62 Auto Sensing Auto Sensing enables the NIC to automatically sense the link type InfiniBand or Ethernet based on the link partner and load th
253. roc scsi tgt vdisk vdisk echo add vdisk0 0 gt proc scsi_tgt groups Default devices echo add vdiskl 1 gt proc scsi_tgt groups Default devices echo add vdisk2 2 gt proc scsi_tgt groups Default devices echo add vdisk3 3 gt proc scsi_tgt groups Default devices modprobe ib srpt Mellanox Technologies 221 Rev 2 0 3 0 0 echo add mgmt gt proc scsi tgt trace level echo add mgmt dbg gt proc scsi tgt trace level echo add out of mem proc scsi tgt trace level kkkkkkkkkkkkkkkkkkkkkkk End srpt sh kkkkkkkkkkkkkkkkkkkkkkkkkkkk B 3 How to Unload Shutdown 1 Unload ib srpt modprobe r ib srpt 2 Unload scst and its dev_handlers first modprobe r scst_vdisk scst 3 Unload ofed etc rc d openibd stop 222 Mellanox Technologies Rev 2 0 3 0 0 Appendix C mlx4 Module Parameters In order to set m1x4 parameters add the following line s to etc modprobe conf options mlx4 core parameter lt value gt and or options mlx4 ib parameter lt value gt and or options mlx4 en parameter lt value gt The following sections list the available m1x4 parameters C 1 mlx4 ib Parameters sm guid assign Enable SM alias GUID assignment if sm guid assign 0 Default 1 int dev assign str Map device function numbers to IB device numbers e 9g 0000 04 00 0 0 002b 1c 0b a 1 Hexadecimal digits for the device function e g 002b 1c 0b a and decimal for IB
254. ror window 0 mechanism dis abled no error checking Default 5 cc statistics cycle Enables CC MGR to collect statistics from all nodes every cc statistics cycle seconds Default 0 When the value is set to 0 no statistics are collected 170 Mellanox Technologies Rev 2 0 3 0 0 9 InfiniBand Fabric Diagnostic Utilities 9 1 Overview The diagnostic utilities described in this chapter provide means for debugging the connectivity and status of InfiniBand IB devices in a fabric 92 Utilties Usage This section first describes common configuration interface and addressing for all the tools in the package Then it provides detailed descriptions of the tools themselves including operation synopsis and options descriptions error codes and examples 9 2 1 Common Configuration Interface and Addressing Topology File Optional An InfiniBand fabric is composed of switches and channel adapter HCA TCA devices To iden tify devices in a fabric or even in one switch system each device is given a GUID a MAC equivalent Since a GUID is a non user friendly string of characters it is better to alias it to a meaningful user given name For this objective the IB Diagnostic Tools can be provided with a topology file which is an optional configuration file specifying the IB fabric topology in user given names For diagnostic tools to fully support the topology file the user may need to p
255. rovide the local sys tem name if the local hostname is not used in the topology file To specify a topology file to a diagnostic tool use one of the following two options 1 On the command line specify the file name using the option t topology file name gt 2 Define the environment variable IBDIAG TOPO FILE To specify the local system name to an diagnostic tool use one of the following two options 1 On the command line specify the system name using the option lt local system name gt 2 Define the environment variable IBDIAG SYS NAME 9 2 2 InfiniBand Interface Definition The diagnostic tools installed on a machine connect to the IB fabric by means of an HCA port through which they send MADs To specify this port to an IB diagnostic tool use one of the fol lowing options 1 On command line specify the port number using the option p local port number gt see below 2 Define the environment variable IBDIAG PORT NUM In case more than one HCA device is installed on the local machine it is necessary to specify the device s index to the tool as well For this use on of the following options 1 On the command line specify the index of the local device using the following option 1 index of local device gt 2 Define the environment variable IBDIAG DEV IDX Mellanox Technologies 171 9 2 3 9 3 Rev 2 0 3 0 0 InfiniBand Fabric Diagnostic Utilities Addressing This
256. rrently available only on upstream kernels newer than 3 1 ip link set dev lt PF device gt vf lt NUM gt spoofchk on off 4 13 7 3 3ROCE Support RoCE is supported on VFs and VLANs may be used For RoCE the hypervisor can support RoCE over up to 15 vlans There are 127 vlans available per port for the Hypervisor all guests together The Hypervisor is allocated 16 GIDs which can support 15 VLANs The remaining VLANs are allocated equally among the number of VFs requested in the num vfs mlx4 core module parameter VLANs will not work in VST mode packets will simply not be sent nor will they arrive ade 4 14 CORE Direct 4 14 1 CORE Direct Overview CORE Direct provides a solution for off loading the MPI collectives operations from the soft ware library to the network CORE Direct accelerates MPI applications and solves the scalability issues in large scale systems by eliminating the issues of operating systems noise and jitter It addresses the collectives communication scalability problem by off loading a sequence of data dependent communications to the Host Channel Adapter HCA This solution provides the hooks needed to support computation and communication overlap Additionally it provides a means to reduce the effects of system noise and application skew on application scalability The relevant verbs to be used for CORE Direct ibv create qp ex ibv modify cq 16 query device ex jbv post task
257. rs Flags and Options Default Flag Pd M If Not Description Specified h help Optional Print the help menu b Optional Print in brief mode Reduce the output to show only if errors are present not what they are v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 T Optional Use specified threshold file lt threshold_fi le gt 5 Optional Show the predefined thresholds Optional color mode Use mono mode rather than color mode 198 Mellanox Technologies Table 28 ibcheckerrs Flags and Options Rev 2 0 3 0 0 Optional Default Flag es t If Not Description Specified C Optional Use the specified channel adapter or router lt ca_name gt P ca port Optional Use the specified port t Optional Override the default timeout for the solicited lt timeout_ms MADs msec gt lt lid guid gt Mandatory Use the specified port s or node s LID GUID with G flag with G option lt port gt Mandatory Use the specified port without G flag Examples 1 Check aggregated node counter for LID 0x2 gt ibcheckerrs 2 warn counter SymbolErrors 65535 threshold 10 lid 2 port 255 255 threshold 10 lid 2 port 255
258. rt2 Note This switch is applicable only for Mellanox Technolo gies Ethernet products macs burn sg Two MACs must be specified here The specified MACs are lt MACs gt assigned to port and port2 repectively Note This switch is applicable only for Mellanox Technolo gies Ethernet products blank guids burn Burn the image with blank GUIDs and MACs where applica ble These values can be set later using the sg command see Table 30 below No com Force clear the Flash semaphore on the device No command is clear semap mands allowed when this switch is used hore allowed Warning May result in system instability or Flash corruption if the device or another application is currently using the Flash i mage burn verify Binary image file lt image gt qq burn query Run a quick query When specified mstflint will not perform full image integrity checks during the query operation This may shorten execution time when running over slow interfaces e g I2C MTUSB 1 nofs burn Burn image in a non failsafe manner skip is burn Allow burning the firmware image without updating the invariant sector This is to ensure failsafe burning even when an invariant sector difference is detected byte mode burn write Shift address when accessing Flash internal registers May be required for burn write commands when accessing certain Flash types s ilent burn Do not print burn progress messages y es All Non interactive m
259. ry Access The Shared Memory Access SHMEM routines provide low latency high bandwidth communi cation for use in highly parallel scalable programs The routines in the SHMEM Application Pro gramming Interface API provide a programming model for exchanging data between cooperating parallel processes The SHMEM API can be used either alone or in combination with MPI routines in the same parallel program The SHMEM parallel programming library is an easy to use programming model which uses highly efficient one sided communication APIs to provide an intuitive global view interface to shared or distributed memory systems SHMEM s capabilities provide an excellent low level interface for PGAS applications A SHMEM program is of a single program multiple data SPMD style All the SHMEM pro cesses referred as processing elements PEs start simultaneously and run the same program Commonly the PEs perform computation on their own sub domains of the larger problem and periodically communicate with other PEs to exchange information on which the next communi cation phase depends The SHMEM routines minimize the overhead associated with data transfer requests maximize bandwidth and minimize data latency the period of time that starts when a PE initiates a transfer of data and ends when a PE can use the data SHMEM routines support remote data transfer through put operations data transfer to a different PE get operations data
260. s sbin insmod lib modules ib ipoib helper ko sbin insmod lib modules ib ib ipoib ko Step 11 In case of interoperability issues between iSCSI and Large Receive Offload LRO change the last command above as follows to disable LRO sbin insmod lib modules ib ib ipoib ko lro 0 Step 12 Now you can assign an IP address to your IB device by adding a call to ifconfig or to the DHCP client in the init file after loading the modules If you wish to use the DHCP client then you need to add a call to the DHCP client in the init file after loading the IB modules For example sbin dhclient cf sbin dhclient conf 1 1 Step 13 Save the init file Step 14 Close initrd hostl cd tmp initrd_ib host1 find cpio H newc o gt tmp new initrd ib img hostl gzip tmp new init ib img Step 15 At this stage the modified initrd including the IB driver is ready and located at tmp new init ib img gz Copy it to the original initrd location and rename it prop erly A 9 2 Case Il Ethernet Ports The Ethernet driver requires loading the following modules in the specified order see the exam ple below mlx4 core ko mlx4 en ko Mellanox Technologies 215 Rev 2 0 3 0 0 A 9 2 1 Example Adding Ethernet Driver to initrd Linux Prerequisites 1 The FlexBoot image is already programmed on the adapter card 2 The DHCP server is installed and configured as described in Section 4 3 3 1 on page 50 and connect
261. s then do it in software SOF TIMESTAMPING RAW HARDWARE return original raw hardware time stamp SOF TIMESTAMPING SYS HARDWARE return hardware time stamp transformed to the system time base SOF TIMESTAMPING SOFTWARE return system time stamp generated in software SOF TIMESTAMPING TX RX determine how time stamps are generated SOF TIMESTAMPING RAW SYS determine how they are reported To enable time stamping for a net device Admin privileged user can enable disable time stamping through calling ioctl sock SIOCSHWT STAMP amp ifreq with following values Send side time sampling 68 Mellanox Technologies Rev 2 0 3 0 0 Enabled by ifreq hwtstamp config tx type when Mellanox Technologies 69 Rev 2 0 3 0 0 Driver Features Receive side time sampling Enabled by ifreq hwtstamp config rx filter when possible values for hwtstamp config rx filter enum hwtstamp rx filters time stamp no incoming packet at all HWTSTAMP FILTER NONE time stamp any incoming packet HWTSTAMP FILTER ALL return value time stamp all packets requested plus some others HWTSTAMP FILTER SOME PTP vi UDP any kind of event packet HWTSTAMP FILTER PTP V1 L4 EVENT PTP vi UDP Sync packet HWTSTAMP FILTER PTP V1 14 SYNC PTP vl UDP Delay req packet HWTSTAMP FILTER PTP V1 L4 DELAY REQ PTP v2 UDP any kind of event packet HWTSTAMP FILTER PTP V2 L4 EVENT
262. s across such links in a round robin fashion based on ports at the path destination switch that are active and not used for inter switch links Should a link that is one of severalsuch parallel links fail routes are redistributed across the remaining links When the last of such a set of parallel links fails traffic 1s rerouted as described above Handling a failed switch under DOR requires introducing into a path at least one turn that would be otherwise illegal Le not allowed by DOR rules Torus 2QoS will introduce such a turn as close as possible to the failed switch in order to route around it n the above example suppose switch T has failed and consider the path from S to D Torus 2QoS will produce the path S n I r D rather than the S n T r D path for a pristine torus by introducing an early turn at n Normal DOR rules will cause traffic arriving at switch I to be forwarded to switch r for traffic arriving from I due to the early turn at n this will generate an illegal turn at I Torus 2QoS will also use the input port dependence of SL2VL maps to set VL bit 1 which would be otherwise unused for y x z x and z y turns i e those turns that are illegal under DOR This causes the first hop after any such turn to use a separate set of VL values and pre vents deadlock in the presence of a single failed switch For any given path only the hops after a turn that is illegal under DOR can contribute to a credit loop that leads to deadl
263. sce parameters for ethl Adaptive RX on TX off pkt rate low 400000 pkt rate high 450000 rx usecs 16 rx frames 88 rx usecs irq 0 rx frames irq 0 ethtool C ethl adaptive rx off rx usecs 0 rx frames 0 gt ethtool c ethl Coalesce parameters for ethl Adaptive RX off TX off pkt rate low 400000 pkt rate high 450000 rx usecs 0 rx frames 0 rx usecs irq 0 rx frames irq 0 Tuning for NUMA Architecture Tuning for Intel amp Sandy Bridge Platform The Intel Sandy Bridge processor has an integrated PCI express controller Thus every PCIe adapter OS is connected directly to a NUMA node On a system with more than one NUMA node performance will be better when using the local NUMA node to which the PCIe adapter is connected In order to identify which NUMA node is the adapter s node the system BIOS should support ACPI SLIT gt To see if your system supports PCIe adapter s NUMA node detection cat sys class net interface device numa node cat sys devices PCI root PCIe function numa node 116 Mellanox Technologies Rev 2 0 3 0 0 Example for supported system cat sys class net eth3 device numa_node 0 Example for unsupported system cat sys class net ib0 device numa node 1 7 2 6 1 1 Improving Application Performance on Remote NUMA Node Verbs API applications that mostly use polling will have an impact when using the remote NUMA node libmlx4 has a build in enhancement th
264. section applies to the ibdiagpath tool only A tool command may require defining the destination device or port to which it applies The following addressing modes can be used to define the IB ports Using a Directed Route to the destination Tool option d This option defines a directed route of output port numbers from the local port to the destination e Using port LIDs Tool option I In this mode the source and destination ports are defined by means of their LIDs If the fabric is con figured to allow multiple LIDs per port then using any of them is valid for defining a port Using port names defined in the topology file Tool option n This option refers to the source and destination ports by the names defined in the topology file Therefore this option is relevant only if a topology file is specified to the tool In this mode the tool uses the names to extract the port LIDs from the matched topology then the tool operates as in the I option ibdiagnet of ibutils2 IB Net Diagnostic This version of ibdiagnet is included in the ibutils2 package and it is run by default after installing Mellanox OFED use this ibdiagnet version run ibdiagnet Please see ibutils2 release notes txt for additional information and known issues ibdiagnet scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices It then pr
265. sion and exits config F lt file name gt The name of the OpenSM config file When not specified etc opensm opensm conf will be used if exists create config c lt file name gt OpenSM will dump its configuration to the specified file and exit This is a way to generate OpenSM configuration file template guid g GUID in hex This option specifies the local port GUID value with which OpenSM should bind OpenSM may be Mellanox Technologies 121 Rev 2 0 3 0 0 OpenSM Subnet Manager bound to 1 port time If GUID given is 0 OpenSM displays a list of possible port GUIDs and waits for user input Without g OpenSM tries to use the default port lmc 1 lt LMC gt his option specifies the subnet s LMC value he number of 110 assigned to each port is 2 LMC MC values gt 0 allow multiple paths between ports lt T It The LMC value must be in the range 0 7 Li Li MC values gt 0 should only be used if the subnet topology actually provides multiple paths between ports i e multiple interconnects between switches Without 1 OpenSM defaults to LMC 0 which allows one path between any two ports priority p lt PRIORITY gt This option specifies the SM s PRIORITY This will effect the handover cases where master is chosen by priority and GUID Range goes from 0 lowest priority to 15 highest smkey k SM Key This option specifies the SM s SM Key 64 bits This wi
266. start instructions Preparing 2 HH HHH HH HH HHH HH HH HH HH mxm HH HHH HH HH Preparing n HH HH HH HH HHH HH HH HH bupc HH Ht dy HH HHH HH HH Preparing x HH HH He 41 4E 4L HH HHH HH HH HH infinipath psm HH HH HH HH Preparing LA HH HH HH HHH HH HH HH infinipath psm devel HH Ht db HH HHH HH HH HH Preparing HH HH Ht db HH HH HH HH HH mvapich2 HH E 4 4E 4L Het HH HH HH Preparing HH HH HHH HHH HH HH HH HH openmpi T1HHHBHUHEBHHSHHHHUH SHHHHHUUH B H AE HEH HE HE AE H HE BSHBHBHH NE Preparing HH HH He 41 4E 4L i HH HHH HH HH HH HH openshmem HH Ht db HH HH HH HH Preparing HH HH HH HH HH HH HH HH HH HH mpitests mvapich2 HH E 4 4 4L HH HHH HH HH Preparing 2x HH HH HH HHH HH HH HH HH mpitests openmpi HH He dt H HH HH HH Preparing N HH HH HH HH HHH HH HH HH HH mlnxof
267. stic Utilities Mellanox OFED includes the following two diagnostic packages for use by network and data center managers jbutils Mellanox Technologies diagnostic utilities infiniband diags OpenFabrics Alliance InfiniBand diagnostic tools 1 3 8 Mellanox Firmware Tools The Mellanox Firmware Tools MFT package is a set of firmware management tools for a single InfiniBand node MFT can be used for Generating a standard or customized Mellanox firmware image Querying for firmware information Burning a firmware image to a single InfiniBand node MFT includes the following tools mlxburn provides the following functions Generation of a standard or customized Mellanox firmware image for burning in bin binary or img format Burning an image to the Flash EEPROM attached to a Mellanox HCA or switch device Querying the firmware version loaded on an HCA board Displaying the VPD Vital Product Data of an HCA board flint This tool burns a firmware binary image or an expansion ROM image to the Flash device of a Mella nox network adapter bridge switch device It includes query functions to the burnt firmware image and to the binary image file spark 1 OpenSM is disabled by default See Chapter 8 OpenSM Subnet Manager for details on enabling it 22 Mellanox Technologies Rev 2 0 3 0 0 This tool burns a firmware binary image to the EEPROM s attached to an InfiniScaleIII switc
268. stre Compilation over MLNX OFED page 226 2 0 3 0 0 August 2013 Updated the following sections Section 1 3 4 ULPs on page 21 Section 4 12 Flow Steering on page 77 and its subsections Section 1 3 3 Mid layer Core on page 21 Section 4 8 Ethernet Tunneling Over IPoIB Driver eIPoIB on page 70 Section 8 2 1 opensm Syntax on page 121 Appendix C mlx4 Module Parameters page 223 Added the following sections Section 1 5 over Converged Ethernet RoCE on page 23 Section 4 5 Quality of Service Ethernet on page 59 and its subsections Section 4 11 eXtended Reliable Connected Trans port Service for InfiniBand on page 76 Section 4 13 7 Configuring Pkeys and GUIDs under SR IOV on page 87 and its subsections Section 4 15 Ethtool on page 94 Appendix E Lustre Compilation over MLNX OFED page 226 2 0 2 0 5 April 2013 Initial release Mellanox Technologies 11 J Rev 2 0 3 0 0 About this Manual This Preface provides general information concerning the scope and organization of this User s Manual Intended Audience This manual is intended for system administrators responsible for the installation configuration management and maintenance of the software and hardware of VPI InfiniBand Ethernet adapter cards It is also intended for application developers Common Abbrevi
269. t ibdiagnet scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices It then produces the following files in the output directory which is defined by the o option described below Synopsis ibdiagnet l e lt count gt v r o lt out dir gt t lt topo file gt s lt sys name gt i lt dev index gt p lt port num gt wt pm pc P lt lt PM gt lt Value gt gt lw lt 1x 4x 12x gt 1s lt 2 5 5 10 gt skip lt ibdiag check s gt load_db lt db file gt 176 Mellanox Technologies Rev 2 0 3 0 0 Options count Min number of packets to be sent across each link default 110 V Enable verbose mode r Provides a report of the fabric qualities t topo file Specifies the topology file name s sys name Specifies the local system name Meaningful only if a topology file is specified i dev index Specifies the index of the device of the port used to connect to the IB fabric in case of multiple devices on the local system p port num Specifies the local device s port num used to connect to the IB fabric o lt out dir gt Specifies the directory where the output files will be placed default tmp lw 1x 4x l2x Specifies the expected link width ls lt 2 5 5 10 gt Specifies the expected link speed pm Dump all the fabric links pm Counters into ibdiagnet pm Reset all the fa
270. t FlexBoot Download Tab A 9 1 Case 1 InfiniBand Ports The IB driver requires loading the following modules in the specified order see Section A 9 1 1 for an example ib_addr ko ib_core ko ib_mad ko ib_sa ko ib_cm ko ib_uverbs ko ib_ucm ko ib_umad ko iw_cm ko rdma_cm ko rdma_ucm ko mlx4 core ko mlx4 ib ko ib mthca ko ipoib helperko this module is not required for all OS kernels Please check the release notes ib ipoib ko 212 Mellanox Technologies Rev 2 0 3 0 0 A 9 1 1 Example Adding an IB Driver to initrd Linux Prerequisites 1 The FlexBoot image is already programmed on the HCA card 2 The DHCP server is installed and configured as described in Section 4 3 3 1 IPoIB Config uration Based on DHCP and is connected to the client machine 3 An initrd file 4 To add an IB driver into initrd you need to copy the IB modules to the diskless image Your machine needs to be pre installed with a Mellanox OFED for Linux ISO image that is appropriate for the kernel version the diskless image will run Adding the IB Driver to the initrd File The following procedure modifies critical files used in the boot procedure It must be executed by users with expertise in the boot process Improper application of this cedure may prevent the diskless machine from booting 1 Back up your current initrd file Step 2 Make anew working directory and change to it host1 mkdir tmp in
271. t plugins help if exists H deep help Prints deep help information including plugins help Output Files Table 19 lists the ibdiagnet output files that are placed under var tmp ibdiagnet2 Table 19 ibdiagnet of ibutils2 Output Files Output File Description ibdiagnet2 lst Fabric links in LST format ibdiagnet2 sm Subnet Manager ibdiagnet2 pm Ports Counters ibdiagnet2 fdbs Unicast FDBs ibdiagnet2 mcfdbs Multicast FDBx ibdiagnet2 nodes info Information on nodes Mellanox Technologies 175 9 4 Rev 2 0 3 0 0 InfiniBand Fabric Diagnostic Utilities Table 19 ibdiagnet of ibutils2 Output Files Output File Description ibdiagnet2 db_csv ibdiagnet internal database An ibdiagnet run performs the following stages Fabric discovery Duplicated GUIDs detection Links in INIT state and unresponsive links detection Counters fetch Error counters check Routing checks Link width and speed checks Alias GUIDs check Subnet Manager check Partition keys check Nodes information Return Codes 0 Success 1 Failure with description ibdiagnet of ibutils IB Net Diagnostic after installing Mellanox OFED To use this ibdiagnet version and not that of the ibu gt This version of ibdiagnet is included in the ibutils package and it is not run by default tils package you need to specify the full path opt bin ibdiagnet a
272. t when it has insufficient configuration for a torus with radix 4 dimensions In the event the torus is significantly degraded 1 e there are many missing switches or links it may happen that torus 2QoS is unable to place into the torus some switches and or links that were discovered in the fabric and will generate a warning in that case A similar condition 146 Mellanox Technologies Rev 2 0 3 0 0 occurs if torus 2QoS is misconfigured i e the radix of a torus dimension as configured does not match the radix of that torus dimension as wired and many switches links in the fabric will not be placed into the torus 8 5 7 4 Quality Of Service Configuration OpenSM will not program switchs and channel adapters with SL2VL maps or VL arbitration configuration unless it is invoked with Q Since torus 2QoS depends on such functionality for correct operation always invoke OpenSM with Q when torus 2QoS is in the list of routing engines Any quality of service configuration method supported by OpenSM will work with torus 2QoS subject to the following limitations and considerations For all routing engines sup ported by OpenSM except torus 2QoS there is a one to one correspondence between QoS level and SL Torus 2QoS can only support two quality of service levels so only the high order bit of any SL value used for unicast QoS configuration will be honored by torus 2QoS For multicast QoS configuration only SL values 0 and 8 should be used w
273. t1 ssh host2 uname Linux 5 2 3 MPI Selector Which MPI Runs Mellanox OFED contains a simple mechanism for system administrators and end users to select which MPI implementation they want to use The MPI selector functionality is not specific to any MPI implementation it can be used with any implementation that provides shell startup files that correctly set the environment for that MPI The Mellanox OFED installer will automatically add MPI selector support for each MPI that it installs Additional MPI s not known by the Mella nox OFED installer can be listed in the MPI selector see the mpi selector 1 man page for details Note that MPI selector only affects the default MPI environment for future shells Specifically if you use MPI selector to select MPI implementation ABC this default selection will not take effect until you start a new shell e g logout and login again Other packages such as environ ment modules provide functionality that allows changing your environment to point to a new MPI implementation in the current shell The MPI selector was not meant to duplicate or replace that functionality The MPI selector functionality can be invoked in one of two ways 1 The mpi selector menu command This command is a simple menu based program that allows the selection of the system wide MPI usually only settable by root and a per user MPI selection It also shows what the current selections are This command is recommended
274. table IBA 7 6 9 VLArb low table Low priority VL Arbitration table IBA 7 6 9 template VLArb high table High priority VL Arbitration table IBA 7 6 9 template SL2VL SL2VL Mapping table IBA 7 6 6 template It is a list of VLs corresponding to SLs 0 15 Note that VL15 used here means drop this SL There are separate QoS configuration parameters sets for various target types CAs routers switch external ports and switch s enhanced port 0 The names of such parameters are prefixed by qos type string Here is a full list of the currently supported sets qos QoS configuration parameters set for CAs qos rtr parameters set for routers qos sw parameters set for switches port 0 qos swe parameters set for switches external ports Here s the example of typical default values for CAs and switches external ports hard coded in OpenSM initialization qos ca max vls 15 qos ca high limit 0 efe dentem Lot 250 900 2080 s 0 5 800 190 880 9a 1050 Mile 30280 Lei 312 810 qos ca vlarb low 0 0 1 4 2 4 3 4 4 4 5 4 6 4 7 4 8 4 9 4 10 4 11 4 12 4 13 4 14 4 Cos Ca 0 1 2 354 505 77 8 9530 Mil Wa Ws qos swe max vls 15 gos_swe_high limit 0 qos swe vlarb high 0 4 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 0 11 0 12 0 13 0 14 0 qos swe vlarb low SIO beak edly Saab Areal edly orb fe M edi TDI agi 1220152012 Mellanox Technologies 157 Rev 2 0 3 0
275. the status of specific ports of specific devices gt ibstatus mthca0 1 mlx4 0 2 Infiniband device mthca0 port 1 status default gid e80 0000 0000 0000 0002 c900 0101 d151 base lid 0x0 sm lid 0x0 state phys state 5 LinkUp rate 10 Gb sec 4X Infiniband device mlx4 0 port 2 status default gid e80 0000 0000 0000 0000 0000 0007 3897 base lid 0x1 sm lid 0 1 state 4 ACTIVE phys state 5 LinkUp rate 20 Gb sec 4X DDR 9 10 ibportstate Enables querying the logical link and physical port states of an InfiniBand port It also allows adjusting the link speed that is enabled on any InfiniBand port If the queried port is a switch port then ibportstate can be used to disable enable or reset the port validate the port s link width and speed against the peer port Synopsis ibportstate d e v V D G s lt smlid gt V C ca name P ca port t timeout ms dest dr path lid guid portnum op value Output Files Table 24 lists the various flags of the command Table 24 ibportstate Flags and Options Default Flag hs TER If Not Description Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d e rr show Optional Show send and receive errors time outs and others Mellano
276. tion default kernels you can run scst_vdisk blockio mode to obtain good performance ae 2 Download and install the SCST driver The supported version is 1 0 1 1 a Download scst 1 0 1 1 tar gz from http scst sourceforge net downloads html b Untar scst 1 0 1 1 tar at BELL lO TEN oye Col pest 0 1il c Install scst 1 0 1 1 as follows make amp amp make install B 2 How to Run A On an SRP Target machine 1 Please refer to SCST s README for loading scst driver and its dev_handlers drivers scst vdisk block or file IO mode nullio Regardless of the mode you always need to have lun 0 in any group s device list Then you can have any lun number following lun 0 it is not required to have the lun adi numbers in ascending order except that the first lun must always be 0 Setting SRPT_LOAD yes in etc infiniband openib conf is not enough as it only loads the ib srpt module but does not load scst not its dev handlers Mellanox Technologies 219 Rev 2 0 3 0 0 The scst disk module pass thru mode of SCST is not supported by Mellanox OFED Example 1 Working with VDISK BLOCKIO mode Using the md0 device sda and cciss c1d0 a modprobe scst b e f g h modprobe scst_vdisk echo open vdisk0 dev md0 BLOCKIO gt proc scsi_tgt vdisk vdisk echo open vdisk1 dev sda BLOCKIO gt proc scsi tgt vdisk vdisk echo open vdisk2 dev cciss c1d0 BLOCKIO gt proc scsi_t
277. tion specifies the prefix routes file Prefix routes control how the SA responds to path record queries for off subnet DGIDs Default file is etc opensm prefix routes conf Mellanox Technologies 127 Rev 2 0 3 0 0 OpenSM Subnet Manager consolidate ipv6_snm Use shared MLID for IPv6 Solicited Multicast groups per MGID scope and P Key consolidate ipv4 mask Use mask for IPv4 multicast groups multiplexing per MGID scope and P Key pid file lt path to file gt Specifies the file that contains the process ID of the opensm daemon The default is var run opensm pid max seq redisc Specifies the maximum number of failed discovery loops done by the SM before completing the whole heavy sweep cycle mc secondary root guid GUID in hex This option defines the guid of the multicast secondary root switch mc primary root guid GUID in hex This option defines the guid of the multicast primary root switch guid routing order no scatter Don t use scatter for ports defined in guid routing order file pr full world queries allowed This option allows OpenSM to respond full World Path Record queries path record for each pair of ports in a fabric enable crashd This option causes OpenSM to run Crash Daemon child process that allows backtrace dump in case of fatal terminating signals log prefix prefix text Prefix to syslog messages from OpenSM verbose
278. to restart network service in order to bring up the bonding master A fter the configuration is saved restart the network service by running etc init d network en restart Mellanox Technologies 55 J Rev 2 0 3 0 0 Driver Features 44 Quality of Service InfiniBand 4 4 1 Quality of Service Overview Quality of Service QoS requirements stem from the realization of I O consolidation over an IB network As multiple applications and ULPs share the same fabric a means is needed to control their use of network resources Figure 2 I O Consolidation Over InfiniBand Servers IB Ethernet a Gateway IB Fibre Block Storage Channel Gateway QoS over Mellanox OFED for Linux is discussed in Chapter 8 OpenSM Subnet Manager The basic need is to differentiate the service levels provided to different traffic flows such that a policy can be enforced and can control each flow utilization of fabric resources The InfiniBand Architecture Specification defines several hardware features and management interfaces for supporting QoS Up to 15 Virtual Lanes VL carry traffic in a non blocking manner Arbitration between traffic of different VLs is performed by a two priority level weighted round robin arbiter The arbiter is programmable with a sequence of VL weight pairs and a maximal number of high priority credits to be processed before low priority is served Packets carry class of service marking in the ra
279. tributions lib modules uname r extra mlnx ofa kernel on RHEL and other RedHat like Distribu tions lib modules uname r updates dkms on Ubuntu Mellanox Technologies 35 J Rev 2 0 3 0 0 Installation Firmware The firmware of existing network adapter devices will be updated if the following two conditions are fulfilled a You run the installation script in default mode that is without the option without fw update b The firmware version of the adapter device is older than the firmware version included with the Mellanox OFED ISO image If an adapter s Flash was originally programmed with an Expansion ROM image the automatic firmware update will also burn an Expansion ROM image ae ncase your machine has an unsupported network adapter device no firmware update will occur and the error message below will be printed Please contact your hardware vendor for help on firmware updates Error message I Querying device E Can t auto detect fw configuration file 2 3 5 Post installation Notes Most of the Mellanox OFED components can be configured or reconfigured after the installation by modifying the relevant configuration files See the relevant chapters in this manual for details The list of the modules that will be loaded automatically upon boot can be found in the etc infiniband openib conf file 2 4 Updating Firmware After Installation In case you ran the m1nxofedinst
280. troducing new set of PR MPR attributes 4 4 3 Supported Policy The QoS policy which is specified in a stand alone file is divided into the following four sub sections I Port Group A set of CAs Routers or Switches that share the same settings A port group might be a partition defined by the partition manager policy list of GUIDs or list of port names based on NodeDe scription Mellanox Technologies 57 J Rev 2 0 3 0 0 Driver Features II Fabric Setup Defines how the SL2VL and VLArb tables should be setup In OFED this part of the policy is ignored SL2VL VLArb tables should be config ured in the OpenSM options file opensm opts ae II QoS Levels Definition This section defines the possible sets of parameters for QoS that a client might be mapped to Each set holds SL and optionally Max MTU Max Rate Packet Lifetime and Path Bits Path Bits are not implemented in IV Matching Rules A list of rules that match an Incoming PR MPR request to a QoS Level The rules are processed in order such as the first match is applied Each rule is built out of a set of match expressions which should all match for the rule to apply The matching expressions are defined for the fol lowing fields e SRC and DST to lists of port groups Service ID to a list of Service ID values or ranges QoS Class to a list of QoS Class values or ranges 4 4 4 CMA Features The CMA interface supports Servic
281. ts credit loop dead locks If the root guid file is not provided a or root_guid_file options the topology has to be pure fat tree that complies with the following rules Tree rank should be between two and eight inclusively Switches of the same rank should have the same number of UP going port groups unless they are root switches in which case the shouldn t have UP going ports at all Switches of the same rank should have the same number of DOWN going port groups unless they are leaf switches Switches of the same rank should have the same number of ports in each UP going port group Switches of the same rank should have the same number of ports in each DOWN going port group the CAs have to be at the same tree level rank If the root guid file is provided the topology does not have to be pure fat tree and it should only comply with the following rules Tree rank should be between two and eight inclusively All the Compute Nodes have to be at the same tree level rank Note that non compute node CAs are allowed here to be at different tree ranks Topologies that do not comply cause a fallback to min hop routing Note that this can also occur on link failures which cause the topology to no longer be a pure fat tree Note that although fat tree algorithm supports trees with non integer CBB ratio the routing will not be as balanced as in case of integer CBB ratio In addition t
282. uage designed for high per formance computing on large scale parallel machines The language provides a uniform program ming model for both shared and distributed memory hardware The programmer is presented with a single shared partitioned address space where variables may be directly read and written by any processor but each variable is physically associated with a single processor UPC uses a Single Program Multiple Data SPMD model of computation in which the amount of parallelism Is fixed at program startup time typically with a single thread of execution per processor In order to express parallelism UPC extends ISO C 99 with the following constructs An explicitly parallel execution model shared address space Synchronization primitives and a memory consistency model Memory management primitives The UPC language evolved from experiences with three other earlier languages that proposed parallel extensions to ISO C 99 AC Split C and Parallel C Preprocessor PCP UPC is not a superset of these three languages but rather an attempt to distill the best characteristics of each UPC combines the programmability advantages of the shared memory programming paradigm and the control over data layout and performance of the message passing programming para digm Mellanox ScalableUPC is based on Berkely UPC package see http upc Ibl gov and contains the following enhancements GasNet library used within UPC integrated with
283. und ports The actual physical port resource tables Port GID tables ports n gids n where 0 lt lt 127 the physical port gids ports lt n gt admin_guids lt n gt where 0 lt n lt 127 allows examining or changing the administrative state of a given GUID gt ports lt n gt pkeys lt n gt where 0 lt n lt 126 displays the contents of the physical pkey table pci id directories one Dom0 and one per guest Here you may see the map ping between virtual and physical pkey indices and the virtual to physical gid 0 Currently the GID mapping cannot be modified but the pkey virtual to physical mapping can These directories have the structure lt pci_id gt port lt m gt gid_idx 0 where m 1 2 this is read only and pci id port m pkey idx n Where m 1 2andn 0 126 For instructions on configuring pkey_idx please see below 4 13 7 2 2Configuring an Alias GUID under ports lt n gt admin_guids Step 1 Determine the GUID index of the PCI Virtual Function that you want to pass through to a guest For example if you want to pass through PCI function 02 00 3 to a certain guest you initially need to see which GUID index is used for this function To do so cat sys class infiniband iov 0000 02 00 3 port lt port_num gt gid_idx 0 The value returned will present which guid index to modify Step 2 Modify the physical GUID table via the ad
284. uration records for clients an appropriate config uration file needs to be created By default the DHCP server looks for a configuration file called dhcpd conf under etc You can either edit this file or create a new one and provide its full path to the DHCP server using the cf flag See a file example at docs dhcpd conf of the Mel lanox OFED for Linux installation The DHCP server must run on a machine which has loaded the IPoIB module 50 Mellanox Technologies Rev 2 0 3 0 0 To run the DHCP server from the command line enter dhcpd lt IB network interface name gt d Example host1 dhcpd ib0 d 4 3 3 1 2 DHCP Client Optional A DHCP client can be used if you need to prepare a diskless machine with IB driver See Step 8 under Example Adding an IB Driver to initrd Linux In order to use DHCP client identifier you need to first create a configuration file that defines the DHCP client identifier Then run the DHCP client with this file using the following command dhclient cf lt client conf file gt lt IB network interface name gt Example of a configuration file for the ConnectX PCI Device ID 26428 called dhclient conf The value indicates a hexadecimal number interface ibl send dhcp client identifier 00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39 Example of a configuration file for InfiniHost HI Ex PCI Device ID 25218 called dhclient conf The
285. urce Network Boot Firmware netO 00 02 c9 03 00 0c 78 11 on PCI02 00 0 open Link down TX O TXE O 0 RXE 01 Link status The socket is not connected Waiting for link up on netO ok Placing Client Identifiers in etc dhcpd conf The following is an excerpt of a etc dhcpd conf example file showing the format of represent ing a client machine for the DHCP server host hostl next server 11 4 3 7 filename pxelinux 0 fixed address 11 4 3 130 option dhcp client identifier 00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39 A 4 Subnet Manager OpenSM This section applies to ports configured as InfiniBand only ae FlexBoot requires a Subnet Manager to be running on one of the machines in the IB network OpenSM is part of the Mellanox OFED for Linux software package and can be used to accom plish this Note that OpenSM may be run on the same host running the DHCP server but it is not mandatory For details on OpenSM see OpenSM Subnet Manager on page 121 to use the OpenSM options described in Section 8 2 1 opensm Syntax on To use OpenSM caching large InfiniBand clusters gt 100 nodes it is recommended adi page 121 A 5 TFTP Server When you set the filename parameter in your DHCP configuration file to a non empty file name the client will ask for this file to be passed through TFTP For this reason you need to install a TFTP server A 6 BIOS Configuration
286. wing 1 View the available package groups by invoking yum grouplist grep MLNX_OFED LNX OFED ALL LNX OFED BASIC LNX OFED GUEST LNX OFED HPC LNX OFED HYPERVISOR LNX OFED VMA LNX OFED VMA ETH LNX OFED VMA VPI 2 Install the desired group yum groupinstall MLNX OFED ALL Loaded plugins product id security subscription manager This system is not registered to Red Hat Subscription Management You can use subscrip tion manager to register Setting up Group Process Resolving Dependencies Running transaction check Package ar mgr x86 64 0 1 0 0 11 g22fff4a will be installed rds devel x86 64 0 2 0 6mlnx 1 rds tools x86 64 0 2 0 6mlnx 1 srptools x86 64 0 0 0 4mlnx3 OFED 2 0 2 6 7 11 ge863cb7 Complete 2 5 3 Updating Firmware After Installation Installing MLNX OFED using the YUM tool does not automatically update the firmware To update the firmware to the version included in MLNX OFED package you can either Run the minxofedinstall script with the w update only flag or Update the firmware to the latest version available on Mellanox Technologies Web site as described in section Section 2 4 Updating Firmware After Installation on page 36 2 6 Uninstalling Mellanox OFED Use the script usr sbin ofed uninstall sh to uninstall the Mellanox OFED package The script is part of the ofed scripts RPM 27 Uninstalling Mellanox OFED using the YUM Tool If MLNX OF
287. wn to compute nodes To solve this problem list of non CN nodes be specified by G or V io guid fileV option These nodes will be allowed to use switches the wrong way around a specific number of times specified by H or V max reverse hopsV With the proper max reverse hops and io guid file values you can ensure full connectivity in the Fat Tree In the scheme above with a max reverse hop of 1 routes will be instanciated between N1 lt gt N2 and N2 lt gt N3 With a max reverse hops value of 2 N1 N2 and will all have routes between them Using max_reverse_hops creates routes that use the switch in a counter stream way This option should never be used to connect nodes with high bandwidth traffic between them It should only be used to allow connectivity for HA purposes or similar Also having routes the other way around can cause credit loops 8 5 4 2 Activation through OpenSM Use R ftree option to activate the fat tree algorithm LMC gt 0 is not supported by fat tree routing If this is specified the default routing algorithm is invoked instead ra 8 5 5 LASH Routing Algorithm LASH is an acronym for LAyered SHortest Path Routing It is a deterministic shortest path rout ing algorithm that enables topology agnostic deadlock free routing within communication net works When computing the routing function LASH analyzes the network topology for the shortest path routes between all p
288. x Technologies 185 Rev 2 0 3 0 0 Table 24 ibportstate Flags and Options Continued Optional Default Flag If Not Description a Specified v erbose Optional Increase verbosity level May be used several times for additional ver bosity vvv or v v v V ersion Optional Show version info D irect Optional Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 s lt smlid gt Optional Use lt smlid gt as the target lid for SM SA queries C ca name Optional Use the specified channel adapter or router P ca port Optional Use the specified port t timeout ms Optional Override the default timeout for the solicited MADs msec dest dr path Optional Destination s directed path LID or guid gt GUID lt portnum gt Optional Destination s port number lt op gt lt value gt Optional query Define the allowed port operations enable disable reset speed and query In case of multiple channel adapters CAs or multiple ports without a CA port being specified a port is chosen by the utility according to the following criteria 1 The first ACTIVE port that is found 2 If not found the first port th
289. xc024 0xc040 0xc041 0xc042 12 valid mlids dumped XM P4 Mellanox Technologies 191 9 12 192 Mellanox Technologies Rev 2 0 3 0 0 smpquery Provides a basic subset of standard SMP queries to query Subnet management attributes such as node info node description switch info and port info Synopsis smpquery h d e v D G s lt smlid gt V C ca name P ca port t timeout ms node name map lt node name map gt op dest dr path lid guid op params Output Files Table 26 lists the various flags of the command Table 26 smpquery Flags and Options InfiniBand Fabric Diagnostic Utilities Default Flag Ne If Not Description Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d e rr_show Optional Show send and receive errors timeouts and others v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v D irect Optional Use directed path address arguments The path Is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 s lt smlid gt
290. you will need to actively turn it off Running the SM w o the CC Manager is not sufficient as the hardware still continues to function in accordance to ad the previous CC configuration For further information on how to turn OFF CC please refer to Section 8 9 3 Configuring Con gestion Control Manager on page 167 8 9 3 Configuring Congestion Control Manager Congestion Control CC Manager comes with a predefined set of setting However you can fine tune the CC mechanism and CC Manager behavior by modifying some of the options To do so perform the following 1 Find the event plugin options option in the SM options file and add the following conf file cc mgr options file name Options string that would be passed to the plugin s event plugin options ccmgr conf file cc mgr options file name 2 Run the SM with the new options file opensm F options file name Mellanox Technologies 167 Rev 2 0 3 0 0 OpenSM Subnet Manager To turn CC OFF set enable to FALSE in the Congestion Control Manager configura tion file and run OpenSM ones with this configuration For full list of CC Manager options with all the default values See Configuring Congestion Control Manager on page 167 For further details on the list of CC Manager options please refer to the IB spec 8 9 4 Configuring Congestion Control Manager Main Settings To fine tune CC mechanism and CC Manager behavior

Download Pdf Manuals

image

Related Search

Related Contents

sviluppo di materiali compositi rinforzati con fibre naturali per l  Fortinet 3.0 MR4 Network Card User Manual  Proiettore digitale SP891 Manuale Utente  Betriebsanleitung  Oregon Scientific RM901A Clock Radio User Manual  取扱説明書 - 三菱電機  取扱説明書 - オムロン ヘルスケア  LER-42478K-LS9+R-4209  Modelo normalizado de ficha para asignaturas  取扱説明書 - 三菱電機  

Copyright © All rights reserved.
Failed to retrieve file