Home
Mellanox OFED Linux User`s Manual
Contents
1. Optional Derant Flag oN datar If Not Description y Specified i port Optional All device ports Query the specified device port port ib port lt port gt l Optional Inactive Only list the names of InfiniBand devices list v Optional Inactive Print all available information about the InfiniBand verbose device s Examples 1 List the names of all available InfiniBand devices gt iow devinto i 2 HCAs found mthca0 mlx4 0 2 Query the device mlx4_0 and print user available information for its Port 2 gt dov clevilnio cxel mild 0 a 2 hca id mlx4 0 fw ver 2 5 944 node guid 0000 0000 0007 3895 Sys image guid 0000 0000 0007 3898 vendor id 0x02c9 vendor part id 25418 hw ver OxA0 board id MT 04A0140005 phys port cnt 2 POE 2 Stabe PORT ACTIVE 4 max mtu 2048 4 active mtu 2048 4 SUMMER 1 Pore et I jose Mmes 0x00 158 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 14 8 ibstatus Applicable Hardware All InfiniBand devices Description Displays basic information obtained from the local InfiniBand driver Output includes LID SMLID port state port physical state port width and port rate Synopsis ibstatus h lt device name gt lt port gt Table 11 lists the various flags of the command Table 11 ibstatus Flags and Options Optional Default Flag S datur If Not Description y Specified
2. 2 7 000 HCA Firmware Check on HCA 0 PASS Host Driver Initialization 2643443546454 PASS Number of HCA Ports Active 0 Port State of Port 0 on HCA 0 NIT Port State of Port 0 on HCA 0 DOWN Error Counter Check on HCA 0 PASS ernel Syslog Check ees PASS Node GUID on HCA 0 ee eee eee 00 02 c9 03 00 00 10 e0 DONE Mellanox Technologies 35 J Rev 1 5 Installation Note After the installer completes information about the Mellanox OFED installation such as prefix kernel version and installation parameters can be retrieved by running the com mand etc infiniband info 2 3 4 Installation Results Software The OFED and MFT packages are installed under the usr directory The kernel modules are installed under InfiniBand subsystem lib modules uname r updates kernel drivers infiniband mlx4 driver Under lib modules uname r updates kernel drivers net mlx4 you will find mlx4 core ko mlx4 en ko mlx4 ib ko and mlx4 fc if you ran the installation script with with fc RDS lib modules uname r updates kernel net rds rds ko Bonding module lib modules uname r updates kernel drivers net bonding bonding ko The package kernel ib devel include files are placed under usr src ofa_kernel include These include files should be used when building kernel modul
3. clear semaphor allowed when this switch is used e Warning May result in system instability or Flash corruption if the device or another application is currently using the Flash i mage burn verify Binary image file lt image gt qq burn query Run a quick query When specified mstflint will not perform full image integrity checks during the query operation This may shorten execution time when running over slow interfaces e g I2C MTUSB 1 nofs burn Burn image in a non failsafe manner skip_is burn Allow burning the firmware image without updating the invariant sector This is to ensure failsafe burning even when an invariant sector difference is detected 180 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Table 17 mstflint Switches Sheet 2 of 2 Affected Switch Relevant Description Commands byte mode burn write Shift address when accessing Flash internal registers May be required for burn write commands when accessing certain Flash types s ilent burn Do not print burn progress messages y es All Non interactive mode Assume the answer is yes to all questions no All Non interactive mode Assume the answer is no to all questions vsd lt string gt burn Write this string of up to 208 characters to VSD upon a burn command use_image ps burn Burn vsd as it appears in the given image do not keep existing VSD on
4. Partitioning Accept Proposal Base Partition Setup on This Proposal Create Custom Partition Setup ck Abort Mellanox Technologies 241 Rev 1 5 Step 12 In the Expert Partitioner window select from the IET VIRTUAL DISK device the row that has its Mount column indicating swap then click Delete Confirm the delete opera tion and click Finish Partition your hard 4 Expert Partitioner disks This is intended for n experts If you are not Device Size E Type Mount Mount By Start End Used By Label Devi familiar with the Idevisda 8 0 GB IET VIRTUAL DISK 0 1045 scsi M of icis disk Idevisdal 70 5 MB F Linux native Ext2 boot 0 8 scsi ph A ARESE c want to na back and Idev sda3 7 4 GB F Linux native Reiser 73 1045 select automatic partitioning Please note that nothing will be written to your hard disk until you confirm Really delete device dev sda2 the entire installation in the last installation dialog Until that point you can safely abort the installation For LVM setup using a non LVM root device and a non LVM swap device is recommended Other than the root and swap devices you should have partitions managed by LVM The table to the right shows the current 3 partitions on all your Create Edit Delete Resize hard disks 4 LVM
5. 12 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Documentation Conventions Typographical Conventions Table 1 Typographical Conventions Description Convention Example File names file extension Directory names directory Commands and their parameters command paraml Optional items Mutually exclusive parameters pl p2 p3 Optional mutually exclusive parameters pl p2 p3 Prompt of a user command under bash shell hostname Prompt of a root command under bash shell hostname Prompt of a user command under tcsh shell tcsh Environment variables VARIABLE Code example if a b Comment at the beginning of a code line E Characters to be typed by users as is bold font Keywords bold font Variables for which users supply specific val Italic font ues Emphasized words Italic font These are emphasized words Pop up menu sequences menul gt menu2 gt gt item Note Note Warning Warning Common Abbreviations and Acronyms Table 2 Abbreviations and Acronyms Sheet 1 of 2 Abbreviation Acronym Whole Word Description B Capital B is used to indicate size in bytes or multiples of bytes e g 1KB 1024 bytes and 1MB 1048576 bytes b Small b is used to indicate size in bits or multiples of bits e g Kb 1024 bits FCoE Fibre Channel over Ethernet Mella
6. Min Hop Table setting after ranking is done a BFS algorithm is run from each CA or switch node in the subnet During the BFS process the FDB table of each switch node traversed by BFS is updated in reference to the starting node based on the ranking rules and guid values w At the end of the process the updated FDB tables ensure loop free paths through the subnet Note Up Down routing does not allow LID routing communication between switches that are located inside spine switch systems The reason is that there is no way to allow a LID route between them that does not break the Up Down rule One ramification of this is that you cannot run SM on switches other than the leaf switches of the fabric 126 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 12 5 3 1 UPDN Algorithm Usage Activation through OpenSM Use R updn option instead of old u to activate the UPDN algorithm e Use a root guid file gt for adding an UPDN guid file that contains the root nodes for rank ing If the a option is not used OpenSM uses its auto detect root nodes algorithm Notes on the guid list file 1 A valid guid file specifies one guid in each line Lines with an invalid format will be discarded 2 The user should specify the root switch guids However it is also possible to specify CA guids OpenSM will use the guid of the switch 1f it exists that connects the CA to the subnet as a root node
7. define additional Optional Kernel Command Line Parameter ihe keme a the kernel resume dev sdal splash silent showopts ibft mode off Kernel Image defines Kernel EE the kernel to boot boot vmlinuz v Browse Either enter the name um directly or choose via Initial RAM Disk B Tem 2f ws Iboot initrd M Browse Initial RAM Disk if Root Device SEXE not empty defines the SNES initial ramdisk to use devisda2 v Either enterthe path MES and file name directly Vga Mode or choose by using lox332 Browse Root Device sets the device to passtothe kernel as root device Back Abort el Step 17 If you wish to change additional settings click the appropriate item and perform the changes and click Accept when done Step 18 In the Confirm Installation window click Install to start the installation See image below Mellanox Technologies 245 Rev 1 5 Preparation E E Installation Settings v Language D 9 v License Agreement v Disk Activation 7 v System Analysis Click any headline to make changes or use the Change menu below v TimeZone Overview Expert Installation gt Installation Sum Perform Installatic Confirm Installation Configuration e Hostname e Root Password e Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes e Hardware Config All information
8. B 6 2 Starting Boot Boot the client machine and enter BIOS setup to configure MLNX NIC lt ver gt to be the first on the boot device priority list see Section B 5 Note On dual port network adapters the client first attempts to boot from Port 1 If this fails it switches to boot from Port 2 If MLNX NIC was selected through BIOS setup the client will boot from ConnectX EN PXE The client will display ConnectX EN PXE attributes and will attempt to bring up a port link Mellanox Technologies 227 Rev 1 5 MLNX NIC 1 5 5 PCI 02 00 0 starting execution MLNX NIC 1 5 5 initialising devices Mellanox ConnectX EN PXE v1 5 5 gPXE 0 9 9 Open Source Boot Firmware http etherboot netO 00 02 c9 2 00 0 open Link up TX 0 TKE 0 RX 0 RXE 0 If the Ethernet link comes up successfully the client attempts connecting to the DHCP server to obtain an IP address and the source location of the kernel OS to boot from The client waits up to 30 seconds for a DHCP server response MLNX NIC 1 5 5 PCI 02 00 0 starting execution MLNX NIC 1 5 5 initialising devices Mellanox Connect EN PXE v1 5 5 gPXE 0 9 9 Open Source Boot Firmware http 1 rboot ora netO 90 02 c9 05 cf PCIO2 00 0 CLink up O RK 0 RXE 01 DHCP net 00 0 gQUES CT STB BK neto 11 4 3 13 255 0 gu 0 0 0 0 Booting from filename pxeboot 0 tftp 11 4 3 7 pxeboot 0 open Next ConnectX EN PXE attempts to boot as directed
9. Description Queries InfiniBand ports performance and error counters Optionally it displays aggregated coun ters for all ports of a node It can also reset counters after reading them or simply reset them Synopsys perfquery h d G a 1 r C ca name P ca port R t timeout ms V lt lid guid gt port reset_mask Table 15 lists the various flags of the command Table 15 perfquery Flags and Options Optional Default Flag e dafor If Not Description y Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 a Optional Apply query to all ports l Optional Loop ports r Optional Reset the counters after reading them C lt ca_name gt Optional Use the specified channel adapter or router P lt ca_port gt Optional Use the specified port R Optional Reset the counters t Optional Override the default timeout for the solicited MADs lt timeout_ms gt msec V ersion Optional Show version info lid guid gt Optional LID or GUID port reset ma sk Examples perfquery r 32 1 read performance counters and reset perfquery e r 321 read extended performance counters and reset perfquery R 0x20 1
10. EVMS RAID v J Crypt File v Expert v aTe Hard disks are Back abort Step 13 In the pop up window click No to approve deleting the swap partition You will be returned to Installation Settings window See image below 242 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Partition your hard a Expert Partitioner disks This is intended for EEE experts If you are not i Type Mount By Start End familiar with the 8 0 GB IET VIRTUAL DISK 0 1045 F Linux native Ext2 0 8 7 4GB F Linux native Reiser 73 1045 concepts of hard disk 70 5 MB partitions and how to use them you might wantto go back and select automatic partitioning Please note that nothing will be written to your hard disk until you confirm the entire installation You have not assigned a swap partition There is nothing wrong with that but in the last installation in most cases it is highly recommended to create and assign a swap partition dialog Until that point Swap partitions on your system are listed in the main window with the you can safely abort type Linux Swap An assigned swap partition hasthe mount point swap the installation You can assign more than one swap partition if desired For LVM setup using a non LVM root device and a non LVM swap device is No recommended Other than the root and swap devices you should have
11. M lid matrix file file name This option specifies the name of the lid matrix dump file from where switch lid matrices min hops tables will be loaded U lfts file file name This option specifies th nam of the LFTs file from where switch forwarding tables will be loaded S sadb file file name This option specifies the name of th SA DB dump file from where SA database will be loaded a root guid file file name Set the root nodes for the Up Down or Fat Tree routing algorithm to the guids provided in the given file one to a line Mellanox Technologies 113 Rev 1 5 OpenSM Subnet Manager u cn guid file file name Set the compute nodes for the Fat Tree routing algorithm to the guids provided in the given file one to a line G io guid file path to file max reverse hops Set the I O nodes for the Fat Tree routing algorithm to the guids provided in the given file one to a line H hop count Set the max number of hops the wrong way around an I O node is allowed to do connectivity for I O nodes on top swith ces m ids guid file file name Name of the map file with set of the IDs which will be used by Up Down routing algorithm instead of node GUIDs format guid id per line X guid routing order file file name gt Set the order port guids will be routed for the MinHop and Up Down routing algorithms to the gu
12. h Optional Print the help menu device Optional AII devices Print information for the specified device May specify more than one device port Optional but All ports of the Print information for the specified port only of the spec requires specify specified device ified device ing a device name Mellanox Technologies 159 Rev 1 5 InfiniBand Fabric Diagnostic Utilities Examples 1 List the status of all available InfiniBand devices and their ports ibstatus igu similloeuavel Clewalrers Vundbxdi OY port db statuss default gid base lid Sin Jlatels Seaton phys state TESS fe80 0000 0000 0000 0x3 0x3 4 ACTIVE RERU O 20 Gb sec 4X DDR ET MN devices mi 0Y port 2 STATUSS default gid base lid smesibirci SEATER phys state ERTER Infiniband device default gid base lid SUPE Shu S phys state TESS Infiniband device default gid base lid SIRGE SALES phys state ENESA mthca0 mthca0 fe80 0000 0000 0000 0x1 0x1 4 ACTIVE Ss HAUG 20 Gb sec 4X DDR Posey I esitacus fe80 0000 0000 0000 0x0 0x0 ZO TONIN 5 LinkUp 10 Gb sec 4X Porti GEMENS B fe80 0000 0000 0000 0x0 0x0 28 TENEAN 5 LinkUp 10 Gb sec 4X 0000 0000 0007 3896 0000 0000 0007 3897 8199 02 a c 9X s OLOL sci Si 20002 COOCOO OLOT exei 5 2 160 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 2 List the status of spe
13. menu below v Time Zone Installation Installation Summary Keyboard Layout e Perform Installation English US Configuration aia e Hostname Partitioning Root Password Create boot partition dev sdal 70 5 MB with ext2 Network Create swap partition dev sda2 502 0 MB e Customer Center Create root partition dev sda3 7 4 GB with reiserfs e Online Update e Service Software e Users e Clean Up SUSE Linux Enterprise Server 10 e Release Notes Fo MIndoW yer GNOME Desktop Environment for Server e Hardware Configuration Novell AppArmor Print Server Server Base System Size of Packagesto Install 1 3 GB Language Primary Language English US Show Release Notes Help Abort Step 11 Select Base Partition Setup on This Proposal then click Next Your hard disks have Suggested Partitioning been checked The partition setup displayed is proposed for your hard drive Create boot partition dev sdal 70 5 MB with ext2 Create swap partition dev sda2 502 0 MB To acceptthese Create root partition dev sda3 7 4 GB with reiserfs suggestions and continue select Accept Proposal Ifthe suggestion does not fit your needs create your own partition setup starting with the partitions as currently present on the disks For this select Custom Partition Setup This is also the option to choose for advanced options like RAID and LVM
14. page 179 jbv asyncwatch page 183 ibdump page 184 14 2 Utilities Usage This section first describes common configuration interface and addressing for all the tools in the package Then it provides detailed descriptions of the tools themselves including operation syn opsis and options descriptions error codes and examples 14 2 1 Common Configuration Interface and Addressing Topology File Optional An InfiniBand fabric is composed of switches and channel adapter HCA TCA devices To iden tify devices in a fabric or even in one switch system each device is given a GUID a MAC equivalent Since a GUID is a non user friendly string of characters it is better to alias it to a meaningful user given name For this objective the IB Diagnostic Tools can be provided with a topology file which is an optional configuration file specifying the IB fabric topology in user given names For diagnostic tools to fully support the topology file the user may need to provide the local sys tem name if the local hostname is not used in the topology file To specify a topology file to a diagnostic tool use one of the following two options 1 On the command line specify the file name using the option t topology file name Mellanox Technologies 149 Rev 1 5 InfiniBand Fabric Diagnostic Utilities 2 Define the environment variable IBDIAG TOPO FILE To specify the local system name to an diagnostic too
15. u root cn file gt to provide the list of compute nodes If the u option is not used all the CAs are considered as compute nodes Note LMC gt 0 is not supported by fat tree routing If this is specified the default routing algorithm 1s invoked instead 12 5 5 LASH Routing Algorithm LASH is an acronym for LAyered SHortest Path Routing It is a deterministic shortest path routing algorithm that enables topology agnostic deadlock free routing within communication networks When computing the routing function LASH analyzes the network topology for the shortest path routes between all pairs of sources destinations and groups these paths into virtual layers in such a way as to avoid deadlock Note LASH analyzes routes and ensures deadlock freedom between switch pairs The link from HCA between and switch does not need virtual layers as deadlock will not arise between switch and HCA In more detail the algorithm works as follows 1 LASH determines the shortest path between all pairs of source destination switches Note LASH ensures the same SL is used for all SRC DST DST SRC pairs and there is no guaran tee that the return path for a given DST SRC will be the reverse of the route SRC DST 2 LASH then begins an SL assignment process where a route is assigned to a layer SL if the addition of that route does not cause deadlock within that layer This is achieved by main taining and analysing a channel dependency gr
16. Partitioning Accept Proposal Base Partition Setup on This Proposal Create Custom Partition Setup ck Abort Mellanox Technologies 213 Rev 1 5 Step 12 In the Expert Partitioner window select from the IET VIRTUAL DISK device the row that has its Mount column indicating swap then click Delete Confirm the delete opera tion and click Finish Partition your hard 4 Expert Partitioner disks This is intended for n experts If you are not Device Size E Type Mount Mount By Start End Used By Label Devi familiar with the Idevisda 8 0 GB IET VIRTUAL DISK 0 1045 scsi M of icis disk Idevisdal 70 5 MB F Linux native Ext2 boot 0 8 scsi ph A ARESE c want to na back and Idev sda3 7 4 GB F Linux native Reiser 73 1045 select automatic partitioning Please note that nothing will be written to your hard disk until you confirm Really delete device dev sda2 the entire installation in the last installation dialog Until that point you can safely abort the installation For LVM setup using a non LVM root device and a non LVM swap device is recommended Other than the root and swap devices you should have partitions managed by LVM The table to the right shows the current 3 partitions on all your Create Edit Delete Resize hard disks 4 LVM E
17. define TXBUFSZ 2048 uint8 t tx buffer TXBUFSZ int main int argc char argv if argc lt 2 printf Usage sdp client ip addr gt n exit EXIT FAILURE int sd socket PF INET SDP SOCK STREAM 0 if sd lt 0 perror socket failed exit EXIT FAILURE E E Mellanox Technologies TT Rev 1 5 SDP struct sockaddr in to addr sin family AF INET sin port htons DEF PORT F7 int ip ret inet aton argv 1 amp to addr sin addr if dp ret 0 printf invalid ip address Ss n argv 1 exit EXIT FAILURE int conn ret connect sd struct sockaddr amp to addr sizeof to_addr if conn ret 0 1 perror connect failed exit EXIT FAILURE E printf connected to s u n inet ntoa to addr sin addr ntohs to addr sin port ssize t nw write sd tx buffer TXBUFSZ if nw lt 0 perror write failed exit EXIT FAILURE else if nw 0 printf socket was closed by remote host n printf sent zd bytes n nw close sd return 0 sdp_server c Code 78 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Usage sdp server include lt stdio h gt include lt stdlib h gt include lt stdint h gt include lt unistd h gt include lt sys types h gt include lt sys socket h gt include netinet in h i
18. nocolor Optional color mode Use mono mode rather than color mode C lt ca_name gt Optional Use the specified channel adapter or router P lt ca_port gt Optional Use the specified port t Optional Override the default timeout for the solicited MADs lt timeout_ms gt msec lid guid gt Mandatory with Use the specified port s or node s LID GUID with G G flag option Mellanox Technologies 177 Rev 1 5 InfiniBand Fabric Diagnostic Utilities Table 16 ibcheckerrs Flags and Options Default Flag M If Not Description y Specified pon Mandatory Use the specified port without G flag Examples 1 Check aggregated node counter for LID 0x2 gt ibcheckerrs 2 warn counter SymbolErrors 65535 masinollei 10 dixo 2 OME 2515 warn counter LinkRecovers 255 mese 10 duel 2 DORE 255 warn counter LinkDowned 12 threshold 10 lid 2 port 255 warn counter RcvErrors 565 ehe solito Trel 2 joerc 255 warn counter XmtDiscards 441 ERS Sn ING 100 dae 2 some 255 Error check on lid 2 MT47396 Infiniscale III Mellanox Technol ogies port all FAILED 2 Check port counters for LID 2 Port 1 PAS Chee keratasm y 2 i Error check on lid 2 MT47396 Infiniscale III Mellanox Technol ogies port 1 OK 178 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 Check the LID2 Port 1 using the specified threshold file gt
19. reset performance counters of port 1 only 174 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 perfquery e R 0x20 1 reset extended performance counters of port 1 only perfquery R a 32 reset performance counters of all ports perfquery R 32 2 OxOfff reset only error counters of port 2 perfquery R 32 2 0xf000 reset only non error counters of port 2 1 Read local port s performance counters gt perfquery Port counters Lid 6 port 1 Ponto clle cie S NN EE CUR REED ll COUIECESCLECERZARI T ERES 0x1000 Siamo od SE Eo EE 0 LDumReeovernscme repr MU ens 0 Eu y GDoWwried see arava Soe A oO Or 0 REMET CON SE neat ene ene pee T IU DES 0 IRYGNARSMORE SIMIAN ASIEMCTEONSS MEME 0 ECVSWRC kay menon S dood odo odo SOR 0 MEDICEA os geoo9 0099059 0 Xi Consiste ONES S sa nea E 0 RGWACOIMSIETESMLMNEINIETAOICS d noon 590062399 0 man kies g SLAS NA DIC TEE SIE oan bo aon 598 0 Eco Uu ON CUISISUTDIES GI SEDEM 0 ISIESID E ODISG AA e e chan Berd to a SPESE 0 SIE AAA SN OSEE IDE NT OO ERE EORR OEC 55178210 Reva TA e e a ad we E MEE EE 55174680 SLINA AEA DD E UM ETE ERED 766366 REVERE SS I ey ets oad or ak VA Cer A ome ETE SES 766315 Mellanox Technologies 175 Rev 1 5 InfiniBand Fabric Diagnostic Utilities 2 Read performance counters from LID 2 all ports smpquery a 2 ii Port Conatersg Like 2 joint 255 ROT Coe LEC RE EA DIEI coumbemocuocb
20. without fw update option and now you wish to manually update firmware on you adapter card s you need to perform the fol lowing steps Note If you need to burn an Expansion ROM image please refer to Burning the Expansion ROM Image on page 189 Note The following steps are also appropriate in case you wish to burn newer firmware that you have downloaded from Mellanox Technologies Web site http www mellanox com gt Downloads gt Firmware Step 1 Start mst host1 mst start Step 2 Identify your target InfiniBand device for firmware update a Get the list of InfiniBand device names on your machine host1 mst status Mellanox Technologies 37 Rev 1 5 Installation MST modules MST PCI module loaded MST PCI configuration module loaded MST Calibre I2C module is not loaded MST devices dev mst mt25418 pciconf0 PCI configuration cycles access bus dev fn 02 00 0 addr reg 88 data reg 92 Chip revision is A0 dev mst mt25418 pci cr0 PCI direct access bus dev fn 02 00 0 bar 0xdef00000 size 0x100000 Chip revision is A0 dev mst mt25418 pci msix0 PCI direct access bus dev fn 02 00 0 bar 0xdeefe000 size 0x2000 dev mst mt25418 pci uar0 PCI direct access bus dev fn 02 00 0 bar 0xdc800000 size 0x800000 b Your InfiniBand device is the one with the postfix _pci_cr0 In the example listed above this will be dev mst mt25418 pci cro Step 3 Burn firmware a
21. Compile Netperf by following the instructions at http www netperf org netperf NetperfPage html Step 3 Start the Netperf server The following example shows how to start the Netperf server host1 netserver Starting netserver at port 12865 Starting netserver at hostname 0 0 0 0 port 12865 and family AF UNSPEC host1 Step 4 Run the Netperf client The default test is the Bandwidth test The following example shows how to run the Netperf client which starts the Bandwidth test by default host2 netperf H 11 4 17 6 t TCP STREAM c C m 65536 TCP STREAM TEST from 0 0 0 0 0 0 0 0 port 0 AF_INET to 11 4 17 6 11 4 17 6 port 0 AF_INET Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs 10 6bits s S 8 us KB us KB 87380 16384 65536 10 00 2483 00 7 03 5 42 1 854 1 431 Note You must specify the IPoIB IP address when running the Netperf client 58 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 The following table describes parameters for the netperf command Option Description H Where to find the server 11 4 17 6 IPoIB IP address t Test Name Specify the test to perform Options are TCP STREAM TCP RR etc C Client CPU utilization C Server CPU utilization Separates the global and test specific parameters m Message siz
22. NIC A network adapter card that plugs into the PCI Express slot and provides one or more ports to an Ether net network Mellanox Technologies 261 Rev 1 5 Standby Subnet Manager A Subnet Manager that is currently quiescent and not in the role of a Master Subnet Manager by agency of the master SM See Subnet Manager Subnet Administrator SA An application normally part of the Subnet Manager that implements the interface for querying and manipulating subnet management data Subnet Manager SM One of several entities involved in the configuration and control of the subnet Unicast Linear Forwarding Tables LFT A table that exists in every switch providing the port through which packets should be sent to each LID Virtual Protocol Interconnet VPI A Mellanox Technologies technology that allows Mellanox channel adapter devices ConnectX to simultaneously connect to an InfiniBand subnet and a 10GigE subnet each subnet connects to one of the adpater ports 262 Mellanox Technologies
23. and Jose Duato at the Universidad Politecnica de Valencia To learn more about LASH and the flexibility behind it the requirement for layers performance comparisons to other algorithms see the following articles Layered Routing in Irregular Networks Lysne et al IEEE Transactions on Parallel and Dis tributed Systems VOL 16 No12 December 2005 Routing for the ASI Fabric Manager Solheim et al IEEE Communications Magazine Vol 44 No 7 July 2006 Mellanox Technologies 129 Rev 1 5 OpenSM Subnet Manager e Layered Shortest Path LASH Routing in Irregular System Area Networks Skeie et al IEEE Computer Society Communication Architecture for Clusters 2002 12 5 8 Modular Routine Engine Modular routing engine structure allows for the ease of plugging new routing modules Cur rently only unicast callbacks are supported Multicast can be added later One existing routing module is up down updn which may be activated with R updn option instead of old u General usage is hostl opensm R module name There is also a trivial routing module which is able to load LFT tables from a dump file Main features are This will load switch LFTs and or LID matrices min hops tables e This will load switch LFTs according to the path entries introduced in the dump file No additional checks will be performed such as is port connected etc e In case when fabric LIDs were changed
24. cat threshl SymbolErrors 10 LinkRecovers 10 LinkDowned 10 RcvErrors 10 RcvRemotePhysErrors 100 RcvSwRelayErrors 100 XmtDiscards 100 XmtConstraintErrors 100 RcvConstraintErrors 100 LinkIntegrityErrors 10 ExcBufOverrunErrors 10 VL15Dropped 100 gt ibcheckerrs v T threshl 2 1 Error check on lid 2 MT47396 Infiniscale III Mellanox Technol o cries MET Os EMT OX 14 14 mstflint Applicable Hardware Mellanox InfiniBand and Ethernet devices and network adapter cards Description Queries and burns a binary firmware image file on non volatile Flash memories of Mellanox InfiniBand and Ethernet network adapters The tool requires root privileges for Flash access Note If you purchased a standard Mellanox Technologies network adapter card please down load the firmware image from www mellanox com gt Downloads gt Firmware If you purchased a non standard card from a vendor other than Mellanox Technologies please contact your vendor To run mstflint you must know the device location on the PCI bus See Example 1 for details Synopsis mstflint switches lt command gt parameters Mellanox Technologies 179 Rev 1 5 InfiniBand Fabric Diagnostic Utilities Table 17 lists the various switches of the utility and Table 18 lists its commands Table 17 mstflint Switches Sheet 1 of 2 Switch Affected Relevant Commands Description h Print the hel
25. engines fail Supported engines minhop updn file ftree lash dor 112 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 do mesh analysis This option enables additional analysis for the lash rout ing engine to precondition switch port assignments in regu lar cartesian meshes which may reduce the number of SLs required to give a deadlock free routing lash start vl vl number Sets the starting VL to use for the lash routing algorithm Defaults to 0 sm sl sl number Sets the SL to use to communicate with the SM SA Defaults to 0 z connect roots This option enforces routing engines up down and fat tree to make connectivity between root switches and in this way be IBA compliant In many cases this can violate pure deadlock free algorithm so use it carefully A ucast cache This option enables unicast routing cache and prevents routing recalculation which is a heavy task in a large cluster when there was no topology change detected during the heavy sweep or when the topology change does not require new routing calculation e g when one or more CAs RTRs leaf switches going down or one or more of these nodes coming back after being down A very common case that is handled by the unicast routing cache is host reboot which otherwise would cause two full routing recal culations one when the host goes down and the other when the host comes back online
26. host hostl next server 11 4 3 7 filename pxelinux 0 fixed address 11 4 3 130 option dhcp client identifier 20 00 55 04 01 6 80 00 00 00 00 00 00 00 02 c9 02 00 23 13 92 A 3 Subnet Manager OpenSM Note This section applies to ports configured as InfiniBand only FlexBoot requires a Subnet Manager to be running on one of the machines in the IB network OpenSM is part of the Mellanox OFED for Linux software package and can be used to accomplish this Note that OpenSM may be run on the same host running the DHCP server but it is not manda tory For details on OpenSM see OpenSM Subnet Manager on page 111 A 4 TFTP Server When you set the filename parameter in your DHCP configuration file to a non empty filename the client will ask for this file to be passed through TFTP For this reason you need to install a TFTP server A 5 BIOS Configuration The expansion ROM image presents itself to the BIOS as a boot device As a result the BIOS will add to the list of boot devices MLNX FlexBoot lt ver gt for a ConnectX device or gPXE for an InfiniHost III device The priority of this list can be modified through BIOS setup A 6 Operation A 6 1 Prerequisites Make sure that your client is connected to the server s The FlexBoot image is already programmed on the adapter card see Section A 2 Start the Subnet Manager as described in Section A 3 The DHCP server should be configured and started s
27. lw 1x 4x 12x ls lt 2 5 5 10 gt pm pc P lt PM lt Trash gt gt h help V version vars 14 5 2 Output Files Directed route from the local node the destina The minimal default Enable verb which is the source and tion node number of packets to be sent across each link 100 ose mod Specifies t Specifies t file is spe pecifies t he topology file name he local system name Meaningful only if a topology cified he index of the devic em B fabric pecifies t o the IB fabric of the port used to connect in case of multiple devices on the local sys he local device s port number used to connect to the S iu t Specifies t I S default Specifies t he directory where the output files will be placed tmp h xpected link width Specifies t Dump all th Reset all t If any of t print it to Prints the h xpected link speed e fabric links pm Counters into ibdiagnet pm he fabric links pmCounters he provided pm is greater then its provided value screen help page information Prints the version of the tool Prints the Table 9 ibdiagpath Output Files tool s environment variables and their values Output File Description ibdiagpath log A dump of all the application reports generated according to the provided flags ibdiagnet pm A dump of the Performance Counters values of the fabric links 14 5
28. onto the remote machines SSH Secure Shell is both a computer program and a network protocol that can be used for log ging and running commands on remote computers and or servers 10 2 1 SSH Configuration The following steps describe how to configure password less access over SSH Step 1 Generate an ssh key on the initiator machine host1 host1 ssh keygen t rsa Mellanox Technologies 97 Rev 1 5 MPI Generating public private rsa key pair Enter file in which to save the key home lt username gt ssh id rsa Enter passphrase empty for no passphrase Enter same passphrase again Your identification has been saved in home lt username gt ssh id rsa Your public key has been saved in home lt username gt ssh id rsa pub The key fingerprint is 38 1b 29 d 4 08 00 4a 0e 50 0 05 44 e7 9 05 username 8host1 Step 2 Check that the public and private keys have been generated host1 cd home lt username gt ssh hostl 1s host1 ls la total 40 drwx 2 root root 4096 Mar 5 04 57 drwxr x 13 root root 4096 Mar 4 18 27 CUBWImmmmm 1 root root 1675 Mar 5 04 57 id rsa rw r r 1 root root 404 Mar 5 04 57 id rsa pub Step 3 Check the public key host1 cat id rsa pub ssh rsa AAAAB3NzaClyc2EAAAABIWAAAQEA1zVY8VBHOh9O0kZN7OAlibUQ74RXm4zHeczyVxpYHaDPyDmqezbYMKrCIVzd1l0b H ZkCOrpLYviUOoUHd3fvNTfMs0gcGg08PysUfM 12FyYjira2Pl1xyg6mkHLGGqVutfEMmABZ3wNCUg6J2X3G uiuS WXeubZmbXcMrP w4IWByfH8ajwo6A
29. where x is the OS assigned interface number To check driver and device information run gt ethtool i eth lt n gt Example gt ethtool i eth2 driver mlx4 en MT 04A0140005 version 1 5 1 March 2010 firmware version 2 7 000 bus info 0000 13 00 0 To query stateless offload status run gt ethtool k eth lt x gt To set stateless offload status run gt ethtool K eth lt x gt rx onl off tx onloff sg onloff tso on off e To query interrupt coalescing settings run gt ethtool c eth lt x gt By default the driver uses adaptive interrupt moderation for the receive path which adjusts the moderation time to the traffic pattern To enable disable adaptive interrupt moderation use the following command gt ethtool C eth lt x gt adaptive rx on off Above an upper limit of packet rate adaptive moderation will set the moderation time to its highest value Below a lower limit of packet rate the moderation time will be set to its lowest value To set the values for packet rate limits and for moderation time high and low values use the following command gt ethtool C eth lt x gt pkt rate low N pkt rate high N rx usecs low N rx usecs high N To set interrupt coalescing settings when adaptive moderation is disabled use gt ethtool c eth lt x gt rx usecs N rx frames N Note Note usec settings correspond to the time to wait after the last packet is sent received befor
30. you will be using the mlxburn tool to create and burn a composite image from an adapter device s firmware and the PXE ROM image onto the same Flash device of the adapter Image Burning Procedure To burn the composite image perform the following steps 1 Obtain the MST device name Run mst start mst status The device name will be of the form mt dev id pci cr0 conf0 A 2 Create and burn the composite image Run mlxburn d lt mst device name gt fw lt FW mlx file gt conf lt ini file gt exp rom lt expansion ROM image gt Example on Linux mlxburn dev dev mst mt25448 pci cr0 fw fw 25408 X X XXX mlx conf MNEH28 XTC ini exp rom ConnectX EN 25448 ROM X X XXX rom Example on Windows mlxburn dev mt25448 pci cr0 fw fw 25408 X X XXX mlx conf MNEH28 XTC ini exp rom ConnectX EN 25448 ROM X X XXX rom B 3 Preparing the DHCP Server in Linux Environment The DHCP server plays a major role in the boot process by assigning IP addresses for ConnectX EN PXE clients and instructs the clients where to boot from When the ConnectX EN PXE boot session starts the PXE firmware attempts to bring up a Con nectX network link port If it succeeds to bring up a connected link the PXE firmware communi cates with the DHCP server The DHCP server assigns an IP address to the PXE client and provides it with the location of the boot program 1 Depending on the OS the device name may be sup
31. 0 255 The actual PKey used is a 16 bit number with the most significant bit set For example a value of 0 will give a PKey with the value 0x8000 Step 2 Create a child interface by running hostl echo lt PKey gt gt sys class net lt IB subinterface gt create child Example hostl echo 0 gt sys class net ib0 create child This will create the interface ib0 8000 Step 3 Verify the configuration of this interface by running host1 ifconfig lt subinterface gt lt subinterface PKey gt Using the example of Step 2 host1 ifconfig ib0 8000 1b0 8000 Link encap UNSPEC HWaddr 80 00 00 4A FE 80 00 00 00 00 00 00 00 00 00 00 BROADCAST MULTICAST MTU 2044 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 128 RX bytes 0 0 0 b TX bytes 0 0 0 b Step 4 As can be seen the interface does not have IP or network addresses To configure those you should follow the manual configuration procedure described in Section 4 3 3 Step 5 To be able to use this interface a configuration of the Subnet Manager is needed so that the PKey chosen which defines a broadcast address be recognized see Chapter 12 OpenSM Subnet Manager Mellanox Technologies 55 J Rev 1 5 IPoIB 4 4 2 Removing a Subinterface 4 5 4 6 To remove a child interface subinterface run echo lt subinterface PKey gt sys class net lt ib_int
32. 12 5 4 Fat tree Routing Algorithm 127 12 5 5 LASH Routing Algorithm 128 12 5 6 DOR Routing Algorithm 129 12 5 7 Routing References 129 12 5 8 Modular Routine Engine 130 12 6 Quality of Service Management in OpenSM 131 12 6 1 Overview 131 12 6 2 Advanced QoS Policy File 131 12 6 3 Simple QoS Policy Definition 133 12 6 4 Policy File Syntax Guidelines 133 12 6 5 Examples of Advanced Policy File 133 12 6 6 Simple QoS Policy Details and Examples 136 12 6 6 1 IPoIB 139 12 6 6 2 SDP 140 12 6 6 3 RDS 141 12 6 6 4 SRP 141 12 6 6 5 MPI 142 12 6 7 SL2VL Mapping and VL Arbitration 142 12 6 8 Deployment Example 143 12 7 QoS Configuration Examples 144 12 7 1 Typical HPC Example MPI and Lustre 144 12 7 2 EDC SOA 2 tier IPoIB and SRP 145 12 7 3 EDC 3 tier IPoIB RDS SRP 146 Chapter 13 Adaptive Routing 147 13 1 Overview 147 13 2 Running OpenSM With AR Manager 147 13 2 1 AR Configuration File Example 147 Chapter 14 InfiniBand Fabric Diagnostic Utilities 149 14 1 Overview 149 14 2 Utilities Usage 149 14 2 1 Common Configuration Interface and Addressing 149 14 2 2 IB Interface Definition 150 14 2 3 Addressing 150 14 3 ibdiagnet of ibutils2 IB Net Diagnostic 150 14 3 4 SYNOPSYS 151 14 3 2 Output Files 152 14 3 3 Return Codes 152 14 4 ibdiagnet of ibutils IB Net Diagnostic 152 14 4 44 SYNOPSYS 152 14 4 2 Output Files 154 14 4 3 ERROR CODES 154 14 5 ibdiagpath IB diagnostic path 155 14 5 1 SYNOPSYS 155 14
33. 3 ERROR CODES 1 The path traced is un healthy 2 Failed to parse command line options 3 More then 64 hops are required for traversing the local port to the Source port and then to the Destination port 4 Unable to traverse the LFT data from source to destination 5 Failed to use Topology File Failed to load required Package 156 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 14 6 ibv devices Applicable Hardware All InfiniBand devices Description Lists InfiniBand devices available for use from userspace including node GUIDs Synopsis ibv_devices Examples 1 List the names of all available InfiniBand devices gt iov _clewiless device node GUID mthca0 0002c9000101d150 mlx4 0 0000000000073895 14 7 ibv_devinfo Applicable Hardware All InfiniBand devices Description Queries InfiniBand devices and prints about them information that is available for use from user space Synopsis ibv_devinfo d lt device gt i lt port gt 1 v Table 10 lists the various flags of the command Table 10 ibv_devinfo Flags and Options Optional Default Kiss Mandatory If Not Description Specified d device Optional First found Run the command for the provided IB device device ib device dev lt device gt Mellanox Technologies 157 Rev 1 5 InfiniBand Fabric Diagnostic Utilities Table 10 ibv devinfo Flags and Options
34. 5 Glossary The following is a list of concepts and terms related to InfiniBand in general and to Subnet Manag ers in particular It is included here for ease of reference but the main reference remains the nfini Band Architecture Specification Channel Adapter CA Host Channel Adapter HCA An IB device that terminates an IB link and executes transport functions This may be an HCA Host CA or a TCA Target CA HCA Card A network adapter card based on an InfiniBand channel adapter device IB Devices Integrated circuit implementing InfiniBand compliant communication IB Cluster Fabric Subnet A set of IB devices connected by IB cables In Band A term assigned to administration activities traversing the IB connectivity only LID An address assigned to a port data sink or source point by the Subnet Manager unique within the sub net used for directing packets within the subnet Local Device Node System The IB Host Channel Adapter HCA Card installed on the machine running IBDIAG tools Local Port The IB port of the HCA through which IBDIAG tools connect to the IB fabric Master Subnet Manager The Subnet Manager that is authoritative that has the reference configuration information for the sub net See Subnet Manager Multicast Forwarding Tables A table that exists in every switch providing the list of ports to forward received multicast packet The table is organized by MLID Network Interface Card
35. 5 2 Output Files 156 14 5 3 ERROR CODES 156 14 6 ibv devices 157 14 7 ibv devinfo 157 14 8 ibstatus 159 14 9 ibportstate 161 14 10 ibroute 166 14 11 smpquery 170 14 12 perfquery 173 6 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 14 13 ibcheckerrs 177 14 14 mstflint 179 14 15 ibv asyncwatch 183 14 16 ibdump 184 Appendix A Mellanox FlexBoot 187 A Overview 187 A 2 Burning the Expansion ROM Image 189 A 3 Subnet Manager OpenSM 194 A 4 TFTP Server 194 A 5 BIOS Configuration 194 A 6 Operation 194 A 7 Command Line Interface CLI 196 A 8 Diskless Machines 199 A 9 iSCSI Boot 204 A 10 WinPE 221 Appendix B ConnectX EN PXE 223 B 1 Overview 223 B 2 Burning the PXE ROM Image 224 B 3 Preparing the DHCP Server in Linux Environment 225 B 4 TFTP Server 227 B 5 BIOS Configuration 227 B 6 Operation 227 B 7 Command Line Interface CLI 228 B 8 Diskless Machines 230 B 9 iSCSI Boot 232 B 10 iSCSI Boot Example of SLES 10 SP2 OS 233 B 11 Windows 2008 iSCSI Boot 248 B 12 WinPE 249 Appendix C Performance Troubleshooting 251 C 1 PCI Express Performance Troubleshooting 251 C 2 InfiniBand Performance Troubleshooting 251 C 3 System Performance Troubleshooting 252 Appendix D ULP Performance Tuning 253 D 1 IPoIB Performance Tuning 253 D 2 Ethernet Performance Tuning 253 D 3 MPI Performance Tuning 254 Appendix E SRP Target Driver 255 E l Prerequisites and Installation 255 E 2 How to run 255 E 3 How to Unload Shut
36. Burning a firmware binary image using mstflint that is already installed on your machine Please refer to MSTFLINT README txt under docs b Burning a firmware image from a mlx file using the mlxburn utility that is already installed on your machine The following command burns firmware onto the ConnectX device with the device name obtained in the example of Step 2 hostl mlxburn dev dev mst mt25418 pci cr0 fw mnt firmware fw 25408 fw 25408 rel mlx Warning Make sure that you have the correct device name firmware path and firmware file name before running this command For help please refer to the Mellanox Firm ware Tools MFT User s Manual under mnt docs Step 3 Reboot your machine after the firmware burning is completed 2 5 Uninstalling Mellanox OFED Use the script usr sbin ofed uninstall sh to uninstall the Mellanox OFED package The script is part of the ofed scripts RPM 38 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 Working With VPI VPI allows ConnectX ports to be independently configured as either IB or Eth If a ConnectX port is configured as Eth it may also function as a Fibre Channel HBA 3 1 Port Type Management ConnectX ports can be individually configured to work as InfiniBand or Ethernet or Fibre Channel over Ethernet ports By default both ConnectX ports are initialized as InfiniBand ports If you wish to change the port type use the connectx
37. Disk if not empty definesthe initial ramdisk to use Either enter the path Boot Loader Settings Section Management Section Editor Section Name SUSE Linux Enterprise Server 10 SP2 B Section Settings Do not verify Filesystem before Booting Optional Kernel Command Line Parameter resume dev sdal splash silent showopts ibft mode off Kernel Image pr A T boot vmlinuz Ly Browse Initial RAM Disk bootfinitrd r Browse Root Device devisda2 gt and file name directly Vga Mode or choose by using 0x332 Browse a Root Device setsthe device to passtothe kernel as root device Back Abort OK Step 17 If you wish to change additional settings click the appropriate item and perform the changes and click Accept when done Step 18 In the Confirm Installation window click Install to start the installation See image below 218 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Preparation E E Installation Settings v Language D 9 v License Agreement v Disk Activation 7 v System Analysis Click any headline to make changes or use the Change menu below v TimeZone Overview Expert Installation gt Installation Sum Perform Installatic Confirm Installation Configuration e Hostname Root Password e Network e Customer Center
38. Flash dual image burn Make the burn process burn two images on Flash The current default fail safe burn process burns a single image in alternating locations v Print version info Table 18 mstflint Commands Command Description b urn Burn Flash q uery Query miscellaneous Flash firmware characteristics v erify Verify the entire Flash bb Burn Block Burn the given image as is without running any checks sg Set GUIDs ri lt out file gt Read the firmware image on the Flash into the specified file dc lt out file gt Dump Configuration Print a firmware configuration file for the given image to the specified output file e rase lt addr gt Erase sector rw lt addr gt Read one DWORD from Flash ww lt addr gt lt data Write one DWORD to Flash wwne lt addr gt Write one DWORD to Flash without sector erase wbne lt addr gt size data Write a data block to Flash without sector erase rb lt addr gt size out file Read a data block from Flash swreset SW reset the target InfniScale IV device This command is supported only in the In Band access method Mellanox Technologies 181 Rev 1 5 InfiniBand Fabric Diagnostic Utilities Possible command return values are 0 successful completion 1 error has occurred 7 the burn command was aborted because firmware is current Examples 1 Find Mel
39. IBA 7 6 9 template SL2VL SL2VL Mapping table IBA 7 6 6 template It is a list of VLs corresponding to SLs 0 15 Note that VL15 used here means drop this SL There are separate QoS configuration parameters sets for various target types CAs routers switch external ports and switch s enhanced port 0 The names of such parameters are prefixed by gos type string Here is a full list of the currently supported sets e qos ca QoS configuration parameters set for CAs qos rtr parameters set for routers e qos sw parameters set for switches port 0 qos swe parameters set for switches external ports Here s the example of typical default values for CAs and switches external ports hard coded in OpenSM initialization gos_ca max vls 15 qos ca high limit 0 qos ca vlarb high 0 24 T20 2 205 33044 0 5 04 6220 7 0 8 20 9 0 10 0 1T1 30 1230 13 50 14 0 qos ca vlarb low 0 0 1 24 2 4 3 4 4 4 5 4 06 4 7 4 8 4 9 4 10 4 11 4 12 4 13 4 14 4 qos ca s12vl 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 7 qos swe max vls 15 qos swe high limit 0 qos swe vlarb high 0 47 171 0 2 0 3 0 4 0 5 0 6 0 720 8 0 9 10 1020 T0T1 0 12 0 13 0 14 0 qos swe vlarb low 0S0 l 4 2 4 393 4 4 24 5 4 6 4 7i14 814 9 4 10 4 T1 4 12 4 13 4 314 4 qos swe s12vl 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 7 VL arbitration tables both high and low are lists of VL Weight pairs Each list entry contains a VL number values from 0 14 and a weighting value values 0
40. MLNX OFED LINUX 1 5 1 slesll firmware fw 25408 2 7 000 fw 25408 rel mlx dev type 25408 no I Querying device I Using auto detected configuration file tmp MLNX OFED LINUX 1 5 1 MLNX OFED LINUX 1 5 1 slesll firmware fw 25408 2 7 000 MHGH28 XTC A4 A7 ini PSID MT 04A0140005 I Generating image Current FW version on flash 2 6 0 New FW version 2 150 Burning FW image without signatures OK Restoring signature OK I Image burn completed successfully Please reboot your system for the changes to take effect warning etc infiniband openib conf saved as etc infiniband openib conf rpmsave Note In case your machine has the latest firmware no firmware update will occur and the installation script will print at the end of installation a message similar to the following Installation finished successfully The firmware version 2 7 000 is up to date Note To force firmware update use force fw update flag Note In case your machine has an unsupported network adapter device no firmware update will occur and the error message below will be printed Please contact your hardware vendor for help on firmware updates Error message I Querying device E Can t auto detect fw configuration file Step 4 In case the installation script performed firmware updates to your network adapter hard ware it will ask you to reboot your machine 34 Mellanox Technologies Mellanox OFED for Linux User
41. Open FCoE The FCoE feature is based on and interacts with the Open FCoE project Mellanox OFED includes the following open fcoe org modules libfc and fcoe See Section 3 4 Fibre Channel over Ether net 1 4 5 ULPs IPoIB The IP over IB IPoIB driver is a network interface implementation over InfiniBand IPoIB encapsulates IP datagrams over an InfiniBand connected or datagram transport service IPoIB pre appends the IP datagrams with an encapsulation header and sends the outcome over the Infini 20 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Band transport service The transport service is Reliable Connected RC by default but it may also be configured to be Unreliable Datagram UD The interface supports unicast multicast and broadcast For details see Chapter 4 IPoIB RoCE RDMA over Converged Ethernet RoCE allows InfiniBand IB transport over Ethernet net works It encapsulates IB transport and GRH headers in Ethernet packets bearing a dedicated ether type RDS Reliable Datagram Sockets RDS is a socket API that provides reliable in order datagram deliv ery between sockets over RC or TCP IP For more details see Chapter 6 RDS SDP Sockets Direct Protocol SDP is a byte stream transport protocol that provides TCP stream semantics SDP utilizes InfiniBand s advanced protocol offload capabilities Because of this SDP can have lower CPU and memory bandwidth utili
42. Preparing 00 dapl devel static 00 Preparing 00 dapl utils 00 Preparing 00 perftest 00 Preparing 00 imstflint 00 Preparing 00 sdpnetstat 00 Preparing 00 srptools 00 Preparing 00 rds tools 00 Preparing 00 nfs utils 00 Preparing 00 ibutils 00 Preparing 00 ibutils2 00 Preparing 00 ibdump 00 Preparing 00 infiniband diags 00 Preparing 00 qperf 00 Preparing 00 imlnxofed docs 00 Preparing 00 mvapich gcc 00 Preparing 00 imvapich pgi 00 Preparing 00 imvapich intel 00 Preparing 00 openmpi gcc 00 Preparing 00 openmpi pgi 00 Preparing 00 openmpi intel 00 Preparing 00 impitests mvapich gcc 00 Preparing 00 impitests mvapich pgi 00 Preparing 00 impitests mvapich intel 00 Preparing 00 impitests openmpi gcc 00 Preparing 00 impitests openmpi pgi 00 Mellanox Technologies 33 Rev 1 5 Installation Preparing 44 4 44 4 444 4 44 4 444444444 44444444444 44 44 44 1005 l mpitests openmpi intel 4 44 44 444 44 4 4 4 4 100 Device 15b3 634a 02 00 0 InfiniBand Mellanox Technologies MT25418 ConnectX IB DDR PCIe 2 0 2 5GT s rev a0 Link Width 8x Link Speed 2 5Gb s Installation finished successfully Programming HCA firmware for dev mst mt25418 pci cr0 device Running mlxburn d dev mst mt25418 pci cr0 fw tmp MLNX OFED LINUX 1 5 1
43. User s Manual Rev 1 5 Step 15 Click Edit in the Boot Loader Settings window Section List Boot Loader Settings From Other you can manually editthe boot Section Management Boot Loader Installation loader configuration files clear the current configuration and propose anew Section Summary configuration start from E Linux Enter scratch or reread the Floppy Other chainloader dev fd0 configuration saved on Failsafe SUSE Linux Enterprise Server 10 SP2 Image append showoptside n your disk If you have multiple Linux systems installed YaST can try to find them and merge their menus Up EN powm J sI Add Edit J Delete Set as Default Cim EE Mellanox Technologies 217 Rev 1 5 Step 16 In the Optional Kernel Command Line Parameter field append the following string to the end of the line ibft_mode off include a space before the string Click OK and then Finish to apply the change Section Name Use Section Name to specify the boot loader section name The section name must be unique Section Settings Selecting Do not verify Filesystem before Booting will skip all file system checks Optional Kernel Command Line Parameter lets you define additional parameters to pass to the kernel Kernel Image defines the kernel to boot Either enter the name directly or choose via Browse Initial RAM
44. by all parameters qos class 7 9 11 Source Virtual Servers destination Storage service id 0x0000000000010000 0x000000000001FFFF pkey 0x0F00 0x0FFF qos level name WholeSet end qos match rule end qos match rules 12 6 6 Simple QoS Policy Details and Examples Simple QoS policy match rules are tailored for matching ULPs or some application on top of a ULP PR MPR requests This section has a list of per ULP or per application match rules and the SL that should be enforced on the matched PR MPR query Match rules include Default match rule that is applied to PR MPR query that didn t match any of the other match rules SDP SDP application with a specific target TCP IP port range SRP with a specific target IB port GUID RDS IPoIB with a default PKey IPoIB with a specific PKey Any ULP application with a specific Service ID in the PR MPR query 136 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 e Any ULP application with a specific PKey in the PR MPR query Any ULP application with a specific target IB port GUID in the PR MPR query Since any section of the policy file is optional as long as basic rules of the file are kept such as no referring to nonexisting port group having default QoS Level etc the simple policy section qos ulps can serve as a complete QoS policy file The shortest policy file in this case would be as follows qos ulps default 0 default SL
45. by the DHCP server B 7 Command Line Interface CLI B 7 1 Invoking the CLI When the boot process begins the computer starts its Power On Self Test POST sequence Shortly after completion of the POST the user will be prompted to press CTRL B to invoke Mel lanox ConnectX EN PXE CLI The user has few seconds to press CTRL B before the message dis appears see figure Mellanox Connect EN PXE v1 5 5 GPXE http etherboot o0org 02 00 0 CB80 PCI3 00 PnP BBS PMMBO2CRO4 CBSO Press Ctrl B to configure MLNX NIC 1 5 5 PCI 02 00 0 Alternatively you may skip invoking CLI right after POST and invoke it instead right after Con nectX EN PXE starts booting Once the CLI is invoked you will see the following prompt gPXE gt 228 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 B 7 2 Operation The CLI resembles a Linux shell where the user can run commands to configure and manage one or more PXE port network interfaces Each port is assigned a network interface called neti where i is 0 1 2 lt of interface Some commands are general and are applied to all network inter faces Other commands are port specific therefore the relevant network interface is specified in the command B 7 3 Command Reference B 7 3 1 ifstat Displays the available network interfaces in a similar manner to Linux s ifconfig E 8 RX 0 wn Oxi 1 f on PC B closed TXE 0 RK 0 R Link status Unknown 0x1
46. bytes 4194304 MPI Datatype MPI_BYTE MPI Datatype for reductions 3 MPI FLOAT MPI Op MPI SUM List of Benchmarks to run PingPong PingPing Sendrecv Exchange Allreduce Reduce Reduce scatter Mellanox Technologies 105 Rev 1 5 MPI Allgather Allgatherv Alltoall Alltoallv Bcast Barrier Benchmarking PingPong processes 2 bytes repetitions t usec Mbytes sec 0 000 47 0 00 1 000 sb 0 61 2 000 56 1 22 4 000 03 2 49 8 000 55 4 92 16 000 60 O2 32 000 62 18 86 64 000 261 37 90 128 000 80 67 65 256 000 2 05 119 26 512 000 2 67 183 08 1024 000 3 74 260 97 2048 000 6 15 317 84 4096 000 10 15 384 74 8192 000 X2 115 612 84 16384 000 18 47 845 85 32768 000 30 84 013 28 65536 640 48 88 278 77 131072 320 86 36 447 43 262144 160 163 91 525 26 524288 80 335 82 488 90 1048576 40 726 25 376 94 2097152 20 1786 35 119 60 4194304 10 4253 59 940 38 OUTPUT TRUNCATED 106 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 11 Quality of Service 11 1 Overview Quality of Service QoS requirements stem from the realization of I O consolidation over an IB network As multiple applications and ULPs share the same fabric a means is needed to control their use of network resources Figure 2 I O Consolidation Over InfiniBand Servers Administrator QoS IB Ethemet p Gateway IB Fibre Channel Gateway QoS over M
47. can be pinned by a user space application See Step 5 Man pages will be installed under usr share man 36 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Firmware The firmware of existing network adapter devices will be updated if the following two condi tions are fullfilled 1 You run the installation script in default mode that is without the option without fw update 2 The firmware version of the adapter device is older than the firmware version included with the Mellanox OFED ISO image Note If an adapter s Flash was originially programmed with an Expansion ROM image the automatic firmware update will also burn an Expansion ROM image Incase your machine has an unsupported network adapter device no firmware update will occur and the error message below will be printed Please contact your hardware vendor for help on firmware updates Error message I Querying device E Can t auto detect fw configuration file 2 3 5 Post installation Notes Most of the Mellanox OFED components can be configured or reconfigured after the installa tion by modifying the relevant configuration files See the relevant chapters in this manual for details The list of the modules that will be loaded automatically upon boot can be found in the etc infiniband openib conf file 2 4 Updating Firmware After Installation In case you ran the mlnxofedinstall script with the
48. correctly Note If a fatal non recoverable error occurs opensm exits 118 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 12 2 4 1 Running OpenSM As Daemon OpenSM can also run as daemon To run OpenSM in this mode enter hostl etc init d opensmd start 12 3 osmtest Description osmtest is a test program for validating the InfiniBand Subnet Manager and Subnet Administrator osmtest provides a test suite for opensm It can create an inventory file of all available nodes ports and PathRecords including all their fields It can also verify the existing inventory with all the object fields and matches it to a pre saved one See Section 12 3 2 osmtest has the following test flows e Multicast Compliancy test Event Forwarding test Service Record registration test RMPP stress test Small SA Queries stress test e 12 3 1 Syntax osmtest OPTIONS where OPTIONS are f flow This option directs osmtest to run a specific flow Flow Description C create an inventory file with all nodes ports and paths a run all validation tests expecting an input inventory v only validate the given inventory file S run service registration deregistration and lease test e run event forwarding test f flood the SA with queries according to the stress mode m multicast flow q QoS info dump VLArb and SLtoVL tables t run trap 64 65 flow this flow requires running of ext
49. disconnecting disconnected test complete return status 0 On Client ucmatose s 20 4 3 219 cmatose starting client cmatose connecting receiving data transfers sending replies data transfers complete test complete return status 0 This server client run is without PCP or VLAN because the IP address used does not belong to a VLAN interface If you specify a VLAN IP address then traffic should go over VLAN Type Of Service TOS The TOS field for rdma cm sockets can be set using the rdma set option API just as it is set for regualr sockets If the user does not set a TOS the default value 0 will be used Within the rdma_cm kernel driver the TOS field is converted into an SL field The conversion formual is as follows SL TOS gt gt 5 e g take the 3 most significant bits of the TOS field In the hardware driver the SL field is converted into PCP by the following formula PCP SL amp 7 take the 3 least significant bits of the TOS field Note SL affects the PCP only when the traffic goes over tagged VLAN frames 68 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 6 RDS 6 1 Overview Reliable Datagram Sockets RDS is a socket API that provides reliable in order datagram deliv ery between sockets over RC or TCP IP RDS is intended for use with Oracle RAC 11g For programming details enter host1 man rds 6 2 RDS Configuration The RDS ULP is installed as part
50. distribution or another vendor s commercial stack Installs the MLNX OFED LINUX binary RPMs if they are available for the current kernel Identifies the currently installed InfiniBand and Ethernet network adapters and automatically upgrades the firmware 2 3 1 Pre installation Notes The installation script removes all previously installed Mellanox OFED packages and re installs from scratch You will be prompted to acknowledge the deletion of the old packages Note Pre existing configuration files will be saved with the extension conf saverpm f you need to install Mellanox OFED on an entire homogeneous cluster a common strategy is to mount the ISO image on one of the cluster nodes and then copy it to a shared file system such as NFS To install on all the cluster nodes use cluster aware tools such as pdsh f your kernel version does not match with any of the offered pre built RPMs you can add your kernel version by using the minx add kernel support sh script located under the docs directory Usage mlnx add kernel support sh i iso mlnx iso t tmpdir local work dir gt v verbose 26 Mellanox Technologies 1 The firmware will not be updated if you run the install script with the without fw update option Mellanox OFED for Linux User s Manual Rev 1 5 Example The following command will create a MLNX OFED LINUX ISO image for RedHat 5 4 under the tmp directory MLNX O
51. documentation for additional information about configuring IP addresses The following code lines are an excerpt from a sample IPoIB configuration file Static settings all values provided by this file IPADDR ib0 11 4 3 175 NETMASK ib0 255 255 0 0 NETWORK ib0 11 4 0 0 BROADCAST ib0 11 4 255 255 ONBOOT ib0 1 Based on eth0 each will be replaced with a corresponding octet from eth0 LAN INTERFACE ib0 eth0 IPADDR ib0 11 4 NETMASK ib0 255 255 0 0 Mellanox Technologies 53 J Rev 1 5 IPoIB NETWORK ib0 11 4 0 0 BROADCAST ib0 11 4 255 255 ONBOOT ib0 1 Based on the first eth lt n gt interface that is found for n 0 1 each will be replaced with a corresponding octet from eth lt n gt LAN INTERFACE ib0 IPADDR ib0 11 4 NETMASK ib0 255 255 0 0 NETWORK ib0 11 4 0 0 B O ROADCAST ib0 11 4 255 255 NBOOT ib0 1 4 3 3 Manually Configuring IPoIB To manually configure IPoIB for the default IB partition VLAN perform the following steps Note This manual configuration persists only until the next reboot or driver restart Step 1 To configure the interface enter the ifconfig command with the following items e The appropriate IB interface 1b0 ibl etc e The IP address that you want to assign to the interface The netmask keyword The subnet mask that you want to assign to the interface The following example shows how to configure an IB int
52. e Customer Center e Online Update e Service Users Clean Up Release Notes Hardware Configuration Help Abort Mellanox Technologies 211 Rev 1 5 Step 9 Select the appropriate Region and Time Zone in the Clock and Time Zone window then click Finish Preparation v Language v License Agreement v Disk Activation v System Analysis Time Zone Installation e Installation Summary Perform Installation Configuration e Hostname e Root Password e Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes Help Step 10 In the Installation Settings window window e Hardware Configuration Ei Clock and Time Zone Region Time Zone Europe Canada Central and South America Russia Asia Australia Africa Pacific Global Etc Alaska Aleutian Arizona Central East Indiana Hawaii Indiana Starke Michigan Mountain Pacific Samoa Hardware Clock Set To Time and Date UTC Abort 07 52 06 24 03 2008 Change Finish click Partitioning to get the Suggested Partitioning 212 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Preparation v Language B Installation Settings v License Agreement v Disk Activation Click handi kath he Ch N bel v System Analysis ick any headline to make changes or use the ange
53. for higher debug levels ddd or d d d e rr_show Optional Show send and receive errors timeouts and others v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v V ersion Optional Show version info D irect Optional Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 s lt smlid gt Optional Use lt smlid gt as the target lid for SM SA queries C lt ca_name gt Optional Use the specified channel adapter or router P lt ca_port gt Optional Use the specified port t Optional Override the default timeout for the solicited MADs lt timeout_ms gt msec lt dest dr_path Optional Destination s directed path LID or GUID lid guid gt lt portnum gt Optional Destination s port number lt op gt lt value gt Optional query Define the allowed port operations enable disable reset speed and query In case of multiple channel adapters CAs or multiple ports without a CA port being specified a port is chosen by the utility according to the following criteria 1 The first ACTIVE port that is found 2 If not found the first port that is UP physical link state is LinkUp 162 Mellanox Technologies Mella
54. is designed to replace the original coe module and to allow using ConnectX hardware offloads Mellanox OFED also includes the following open fcoe org modules 42 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 libfc Used by the m1x4 fc module to handle FC logic such as fabric login and logout remote port login and logout fc ns transactions etc fcoe Implements FCoE fully in software Will load instead of mlx4 fc to support T11 frame format Works on top of standard Ethernet NICs including m1x4 en See http www open fcoe org for further information on the Open FCoE project 3 4 2 Installation To install the FCoE feature you should run the mlnxofedinstall script described in Section 2 3 with the with fc option 3 4 3 FCoE Basic Usage After loading the driver userspace operations should create destroy vHBAs on required Ethernet interfaces This can be done manually by issuing commands to the driver using simple sysfs oper ations Alternatively it can be handled automatically by the dcbxd daemon if the interface is con nected to an FCoE switch supporting DCBX negotiation of the FCoE feature e g Cisco Nexus Once a vHBA is instantiated on an Ethernet interface it immediately attempts to log into the FC fabric Provided that the FC fabric and FC targets are well configured LUNs will map to SCSI disk devices dev sdX XX vHBAs instantiated automatically by the dcbxd daemon are creat
55. librdmacm utils 00 Preparing 00 librdmacm devel 00 Preparing 00 librdmacm devel 00 Preparing 00 libsdp 00 Preparing 00 libsdp 00 Preparing 00 libsdp devel 00 Preparing 00 libsdp devel 00 Preparing 00 opensm libs 00 Preparing 00 opensm libs 00 Preparing 00 opensm 00 opensmd 0 off off 2 0ff 3 off 4 0ff B off 6 0ff Preparing 00 opensm devel 00 Preparing 00 opensm devel 00 Preparing 00 opensm static 00 Preparing 00 opensm static 00 Preparing 00 compat dapl 00 Preparing 00 compat dapl 00 Preparing 00 compat dapl devel 00 Preparing 00 compat dapl devel 00 Preparing 00 dapl 00 Preparing 00 dapl 00 Preparing 00 dapl devel 00 Preparing 00 dapl devel 00 Preparing 00 dapl devel static 00 32 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5
56. lt IPoIB server name or address gt lt export gt mnt To verify that the mount is using RDMA run cat proc mounts and check the proto field for the given mount Congratulations You re using NFS RDMA 96 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 10 MPI 10 1 Overview Mellanox OFED for Linux includes the following MPI implementations over InfiniBand and RoCE e Open MPI an open source MPI 2 implementation by the Open MPI Project OSU MVAPICH an MPI 1 implementation by Ohio State University These MPI implementations along with MPI benchmark tests such as OSU BW LAT Intel MPI Benchmark and Presta are installed on your machine as part of the Mellanox OFED for Linux installation Table 6 lists some useful MPI links Table 6 Useful MPI Links MPI Standard http www unix mcs anl gov mpi Open MPI http www open mpli org MVAPICH MPI http mvapich cse ohio state edu MPI Forum http www mpi forum org This chapter includes the following sections Prerequisites for Running MPI page 97 e MPI Selector Which MPI Runs page 99 Compiling MPI Applications page 99 e OSU MVAPICH Performance page 100 Open MPI Performance page 103 10 2 Prerequisites for Running MPI For launching multiple MPI processes on multiple remote machines the MPI standard provides a launcher program that requires automatic login i e password less
57. mlx4 en net device for getting notifications about the state of the port as well as using the mlx4 en driver to resolve IP addresses to MAC that are required for address vector creation However RoCE traffic does not go through the mlx4 en driver it is completely offloaded by the hardware Configre an IP Address to mlx4 en Interface Run the following on both sides of the link ifconfig eth2 20 4 3 220 ifconfig eth2 eth2 Link encap Ethernet HWaddr 00 02 C9 08 E8 11 inet addr 20 4 3 220 Bcast 20 255 255 255 Mask 255 0 0 0 UP BROADCAST MULTICAST MTU 1500 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 1000 RX bytes 0 0 0 b TX bytes 0 0 0 b Make sure that ping is working ping 20 4 3 219 PING 20 4 3 219 20 4 3 219 56 84 bytes of data 64 bytes from 20 4 3 219 icmp seg 1 tt1 64 time 0 873 ms 64 bytes from 20 4 3 219 icmp seq 2 tt1 64 time 0 198 ms 64 bytes from 20 4 3 219 icmp seg 3 tt1 64 time 0 167 ms 20 4 3 219 ping statistics 3 packets transmitted 3 received 0 packet loss time 2000ms rtt min avg max mdev 0 167 0 412 0 873 0 326 ms Inspecting the GID Table cat sys class infiniband mlx4 0 ports 2 gids 0 fe80 0000 0000 0000 0202 c9 f fe08 e811 cat sys class infiniband mlx4 0 ports 2 gids 1 0000 0000 0000 0000 0000 0000 0000 0000 According to the output we currently have one entry
58. net core wmem max 16777216 n n n net ipv4 tcp mem 16777216 16777216 16777216 net ipv4 tcp rmem 4096 87380 16777216 net ipv4 tcp wmem 4096 65536 16777216 D 3 MPI Performance Tuning To optimize bandwidth and message rate running over MVAPICH you can set tuning paramters either using the command line or in the configuration file usr mpi compiler mvapich mvapich ver etc mvapich conf Tuning Parameters in Configuration File Edit the mvapich conf file with the following lines VIADEV USE COALESCE 1 VIADEV COALESCE THRESHOLD SQ 1 VIADEV PROGRESS THRESHOLD 2 VIADEV RENDEZVOUS THRESHOLD 8192 Tuning Parameters via Command Line The following command tunes MVAPICH parameters hostl usr mpi gcc mvapich mvapich ver bin mpirun rsh np 2 hostfile home lt username gt cluster VIADEV USE COALESCE 1 VIADEV COALESCE THRESHOLD SQ 1 VN VIADEV PROGRESS THRESHOLD 2 VIADEV RENDEZVOUS THRESHOLD 8192 N usr mpi gcc mvapich mvapich ver tests osu benchmarks lt osu ver gt osu bw The example assumes the following A cluster of at least two nodes Example host1 host2 A machine file that includes the list of machines Example hostl cat home lt username gt cluster hosti host2 hostl 254 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Appendix E SRP Target Driver The SRP Target driver is designed to work dir
59. of Mellanox OFED for Linux To load the RDS module upon boot edit the file etc infiniband openib conf and set RDS LOAD yes Note For the changes to take effect run etc init d openibd restart Mellanox Technologies 69 J Rev 1 5 RDS 70 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 7 1 Overview Sockets Direct Protocol SDP is an InfiniBand byte stream transport protocol that provides TCP stream semantics Capable of utilizing InfiniBand s advanced protocol offload capabilities SDP can provide lower latency higher bandwidth and lower CPU utilization than IPoIB or Ethernet running some sockets based applications SDP can be used by applications and improve their performance transparently that is without any recompilation Since SDP has the same socket semantics as TCP an existing application is able to run using SDP the difference is that the application s TCP socket gets replaced with an SDP socket It is also possible to configure the driver to automatically translate TCP to SDP based on the source IP port the destination or the application name See Section 7 5 The SDP protocol is composed of a kernel module that implements the SDP as a new address fam ily protocol family and a library see Section 7 2 that is used for replacing the TCP address fam ily with SDP according to a policy This chapter includes the following sections libsdp so Library on page 71 Configu
60. on all your Create Edit Delete Resize hard disks S LVM EVMS RAID v I Crypt File v Expert v Hard disks are Mellanox Technologies 215 Rev 1 5 Step 14 Select the Expert tab and click Booting Preparation V Language License Agreement Disk Activation V System Analysis Time Zone Overview Expert a Installation Settings Click any headline to make changes or use the Change menu below Installation Installation Summary Keyboard Layout w Perform Installation English US Configuration SETS e Root Password Partitioning Hostname Create swap partition dev sdal 502 0 MB e Network Create root partition dev sda2 7 5 GB with reiserfs e Customer Center e Online Update Add On Products e Service Usare No add on product selected for installation e Clean U Software e Release Notes e Hardware Configuration SUSE Linux Enterprise Server 10 SP2 Server Base System KDE Desktop Environment for Server C C Compiler and Tools X Window System Size of Packagesto Install 1 6 GB Booting Boot Loader Type GRUB Location dev sda2 boot Sections SUSE Linux Enterprise Server 10 SP2 default Floppy Failsafe SUSE Linux Enterprise Show Release Notes Server 10 SP2 Change v Help Back Abort 216 Mellanox Technologies Mellanox OFED for Linux
61. partitions managed by LVM Do you want to change this The table to the right showsthe current partitions on all your Create Edit Delete Resize hard disks S LVM EVMS RAID v I Crypt File v Expert v Hard disks are acta Preparation Language License Agreement WERE BE EVabor Click any headline to make changes or use the Change menu below System Analysis Time Zone Overview Expert Installation Installation Summary Keyboard Layout w Perform Installation English US Configuration RS Root Password Partitioning e Hostname Create swap partition dev sdal 502 0 MB e Network Create root partition dev sda2 7 5 GB with reiserfs e Customer Center e Online Update Add On Products e Service e Users No add on product selected for installation CI u xcd Software e Release Notes Hardware Configuration SUSE Linux Enterprise Server 10 SP2 Server Base System KDE Desktop Environment for Server C C Compiler and Tools X Window System Size of Packagesto Install 1 6 GB Booting BootLoader Type GRUB Location dev sda2 boot Sections SUSE Linux Enterprise Server 10 SP2 default Floppy Failsafe SUSE Linux Enterprise E v m Show Release Notes Server 10 SP2 Ge e cs Mellanox Technologies 243 Rev 1 5 Step 15 Click Edit in the Boot Lo
62. portguid 0x0002c902002582cd sw136 HCA 1 5 valid lids dumped 3 Dump all Lids in the range 3 to 7 with valid out ports of the switch with Lid 2 gt sdo xguue 2 3 7 Unicast licks Wx3 Ox7 Qut evince Hie 2 ga 0 0 0002 6 90 E 1t 3E 8 VCE MT47396 Infiniscale III Mellanox Technologies use OUE Destination Poe Info 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 us valige GAS 0x0007 021 Channel Adapter portguid 0x0002c9020025874a Merl sy GATE 3 valid lids dumped 168 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 4 Dump all Lids with valid out ports of the switch with portguid 0x000b8cffff004016 ibroute G 0x000b8cffff004016 Unicast lids 0x0 0x8 of switch Lid 3 guid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies Lid Out Destination Port Miro 0x0002 023 Switch portguid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies 0x0003 000 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0x0006 023 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0x0007 020 Channel Adapter portguid 0x0002c9020025874a Ewa MEE AS IN 0x0008 024 Channel Adapter portguid 0x0002c902002582cd sw136 HCA 1 5 valid lids dumped 5 Dump all non empty mlids of sw
63. s Manual Rev 1 5 Step 5 The script adds the following lines to etc security limits conf for the userspace components such as MPI soft memlock unlimited hard memlock unlimited These settings unlimit the amount of memory that can be pinned by a user space applica tion If desired tune the value unlimited to a specific amount of RAM Step 6 For your machine to be part of the InfiniBand VPI fabric a Subnet Manager must be run ning on one of the fabric nodes At this point Mellanox OFED for Linux has already installed the OpenSM Subnet Manager on your machine For details on starting OpenSM see Chapter 12 OpenSM Subnet Manager Step 7 InfiniBand only Run the hca self test ofed utility to verify whether or not the InfiniBand link is up The utility also checks for and displays additional information such as HCA firmware version Kernel architecture Driver version Number of active HCA ports along with their states Node GUID Note For more details on hca self test ofed seethefilehca self test readme under docs hostl usr bin hca self test ofed Performing InfiniBand HCA Self Test Number Of HCAS Detected eese Eee 1 POI Device Cheek LL eem ic ER PASS ernel Arch sipario ho bebe x86_64 Host Driver Version ees MLNX OFED LINUX 1 5 1 OFED 1 5 1 mlnx9 1 5 1 2 6 9 89 ELlargesmp Host Driver RPM Check PASS HCA Firmware on HCA 0
64. server gt ibv rc pingpong g 1 On client gt ibv_rc_pingpongs g 1 server For rdma cm applications the user needs only to specify an IP address of a VLAN device for the traffic to go with the VLAN tagged frames 5 7 A Detailed Example This section provides a step by step example of using InfiniBand over Ethernet RoCE Installation and Driver Loading The MLNX OFED installation script installs RoCE as part of mlx4 and mlx4_en and other mod ules See Section 2 3 Installing Mellanox OFED for details on installation Note The list of the modules that will be loaded automatically upon boot can be found in the configuration file etc infiniband openib conf Enter the following command to display the current run of MLNX OFED ibv devinfo hca id mlx4 0 transport InfiniBand 0 fw ver 2 7 700 node guid 0002 c903 0008 e810 sys image guid 0002 c903 0008 e813 vendor id 0x02c9 vendor part id 26428 hw ver OxBO board id MT 0DD0120009 phys port cnt 2 port il state PORT INIT 2 max mtu 2048 4 active mtu 2048 4 sm lid 0 port lid 0 port lmc 0x00 link layer IB port 2 state PORT ACTIVE 4 Mellanox Technologies 63 J Rev 1 5 RoCE max mtu 2048 4 active mtu 1024 3 sm lid 0 port lid 0 port lmc 0x00 link layer Ethernet Notes regarding the command output 1 The InfiniBand port port 1 is in PORT INIT state and the Ethernet port port 2 is in PORT ACTIVE state Yo
65. socket was closed by remote host n printf read zd bytes n nr printf end of test n close cd close sd 80 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 return 0 Mellanox Technologies 81 J Rev 1 5 SDP 7 6 BZCopy Zero Copy Send BZCOPY mode is only effective for large block transfers By setting the sys parameter sdp zcopy thresh to a non zero value a non standard SDP speedup is enabled Messages lon ger than sap zcopy thresh bytes in length cause the user space buffer to be pinned and the data to be sent directly from the original buffer This results in less CPU usage and on many sys tems much higher bandwidth Note that the default value of sdp zcopy thresh is 64KB but is may be too low for some sys tems You will need to experiment with your hardware to find the best value 7 7 UsingRDMA for Small Buffers For smaller buffers the overhead of preparing a user buffer to be RDMA ed is too big therefore it is more efficient to use BCopy Large buffers can also be sent using RDMA but they lower CPU utilization This mode is called ZCopy combined mode The sendmsg syscall is blocked until the buffer is transfered to the socket s peer and the data is copied directly from the user buffer at the source side to the user buffer at the sink side To set the threshold use the module parameter sdp zcopy thresh This parameter can be accessed through sysfs
66. specified stress test instead of the normal test suite Stress test options are as follows OPT Description sl Single MAD response SA queries s2 Multi MAD RMPP response SA queries s3 Multi MAD RMPP Path Record SA queries Without s stress testing is not performed M Multicast Mode This option specify length of Multicast test OPT Description M1 Short Multicast Flow default single mode M2 Short Multicast Flow multiple mode M3 Long Multicast Flow single mode M4 Long Multicast Flow multiple mode Single mode Osmtest is tested alone with no other apps that interact with OpenSM MC Multiple mode Could be run with other apps using MC with OpenSM Without M default flow testing is performed 120 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 t timeout lhis option specifies the time in milliseconds used for transaction timeouts Specifying t 0 disables timeouts Without t OpenSM defaults to a timeout value of 200 mil liseconds l log file lhis option defines the log to be the given file By default the log goes to var log osm log For the log to go to standard output use f stdout v verbose This option increases the log verbosity level The v option may be specified multiple times to further increase the verbosity level See the vf option for more information about log verbosity V This option sets the maximu
67. sys module ib sdp parameters sdp zcopy thresh Setting it to 0 disables ZCopy 7 8 Testing SDP Performance This section describes how to verify SDP performance by running the Bandwidth BW test and the Latency test These tests are described in detail at the following URL http www netperf org netperf training Netperf html To verify SDP performance perform the following steps Step 1 Download Netperf from the following URL http www netperf org netperf NetperfPage html Step 2 Compile Netperf by following the instructions at http www netperf org netperf NetperfPage html Step 3 Create 1ibsdp conf configuration file hostl cat gt HOME libsdp conf lt lt EOF gt use sdp server gt use sdp client gt EOF Step 4 Start the Netperf server such that you force SDP to be used instead of TCP hostl LD PRELOAD libsdp so LIBSDP CONFIG FILE HOME libsdp conf netserver Starting netserver at port 12865 Starting netserver at hostname 0 0 0 0 port 12865 and family AF UNSPEC 82 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 host1 Step 5 Run the Netperf client such that you force SDP to be used instead of TCP The default test is the Bandwidth test host2 LD PRELOAD libsdp so LIBSDP CONFIG FILE HOME libsdp conf netperf VN H 11 4 17 6 t TCP STREAM c C m 65536 TCP STREAM TEST from 0 0 0 0 0 0 0 0 port 0 AF INET to 11 4 17 6 11 4 17 6 port 0 AF INET Recv S
68. that level Note QoS in OpenSM is described in detail in Chapter 12 110 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 12 OpenSM Subnet Manager 12 1 Overview OpenSM is an InfiniBand compliant Subnet Manager SM It is provided as a fixed flow execut able called opensm accompanied by a testing application called osmtest OpenSM implements an InfiniBand compliant SM according to the InfiniBand Architecture Specification chapters Man agement Model 13 Subnet Management 14 and Subnet Administration 15 12 2 opensm Description opensm is an InfiniBand compliant Subnet Manager and Subnet Administrator that runs on top of the Mellanox OFED stack opensm performs the InfiniBand specification s required tasks for ini tializing InfiniBand hardware One SM must be running for each InfiniBand subnet opensm also provides an experimental version of a performance manager opensm defaults were designed to meet the common case usage on clusters with up to a few hun dred nodes Thus in this default mode opensm will scan the IB fabric initialize it and sweep occasionally for changes opensm attaches to a specific IB port on the local machine and configures only the fabric con nected to it If the local machine has other IB ports opensm will ignore the fabrics connected to those other ports If no port is specified opensm will select the first best available port opensm can also present the ava
69. the log verbosity level A flags field must follow the D option A bit set clear in the flags enables disables a specific log level as follows 116 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 BIT OG LEVEL ENABLED 0x01 ERROR error messages 0x02 NFO basic messages low volume 0x08 0x10 0x20 FRAMES dumps all SMP and GMP frames EBUG diagnostic high volume I 0x04 VERBOSE interesting stuff moderate volume D F UNCS function entry exit very high volume 0x40 ROUTING dump FDB routing information 0x80 currently unused Without D OpenSM defaults to ERROR INFO 0x3 Speci fying D 0 disables all messages Specifying D OxFF enables all messages see V High verbosity levels may require increasing the transaction timeout with the t option d debug This option specifies a debug option These options are not normally needed The number following d selects the debug option to enable as follows OPT Description d0 Ignore other SM nodes dl Force single threaded dispatching d2 Force log flushing after each log message d3 Disable multicast support h help Display this usage info then exit Display this usage info then exit 12 2 2 Environment Variables The following environment variables control opensm behavior OSM TMP DIR Controls the directory in which the temporary files generated by op
70. the one specified in the p option Oth erwise an error is reported Note When ibdiagpath queries for the performance counters along the path between the source and destination ports it always traverses the LID route even if a directed route is specified If along the LID route one or more links are not in the ACTIVE state ibdi agpath reports an error Moreover the tool allows omitting the source node in LID route addressing in which case the local port on the machine running the tool is assumed to be the source 14 5 1 SYNOPSYS ibdiagpath n src name dst name 1 lt src lid dst lid gt d lt pl1 p2 p3 gt c lt count gt v t lt topo file gt s lt sys name gt ic lt dev index gt c p lt port num gt o lt out dir gt lw lt 1x 4x 12x gt 1s lt 2 5 5 10 gt pm pc P lt lt PM counter gt lt Trash Limit gt gt OPTIONS n src name dst name Names of the source and destination ports as defined in the topology file source may be omitted gt local port is assumed to be the source 1 src lid dst lid Source and destination LIDs source may be omitted gt the local port is assumed to be the source Mellanox Technologies 155 Rev 1 5 InfiniBand Fabric Diagnostic Utilities O lt p LI p2 DS pen gt c lt count gt t lt topo file gt s lt sys name gt i lt dev index gt p lt port num gt o lt out dir gt
71. to Boot From an iSCSI Target Configure DHCP as described in Section 4 3 1 IPoIB Configuration Based on DHCP Edit your DHCP configuration file etc dhcpd conf and add the following lines for the machine s you wish to boot from the iSCSI target Filename option root path iscsi iscsi target ip iscsi target ign The following is an example for configuring an IB ETH device to boot from an iSCSI target host hostl filename For a ConnectX device with ports configured as InfiniBand comment out the following line option dhcp client identifier 00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39 For a ConnectX device with ports configured as Ethernet comment out the following line hardware ethernet 00 02 c9 00 00 bb For an InfiniHost III Ex comment out the following line option dhcp client identifier fe 00 55 00 41 fe 80 00 00 00 00 00 00 00 02 c9 03 00 00 0d 41 option root path r scsa 11 4 3 7 2 129 2007 08 7 3 4 10 3C6S812DbG00E5 Mellanox Technologies 205 Rev 1 5 A 9 2 iSCSI Boot Example of SLES 10 SP2 OS This section provides an example of installing the SLES 10 SP2 operating system on an iSCSI tar get and booting from a diskless machine via FlexBoot Note that the procedure described below assumes the following The client s LAN card is recognized during installation e The iSCSI target can be connected to the client via LAN and InfiniBan
72. to the original initra location and rename it properly ISCSI Boot Mellanox ConnectX EN PXE enables an iSCSI boot of an OS located on a remote iSCSI Target It has a built in iSCSI Initiator which can connect to the remote iSCSI Target and load from it the kernel and initrd There are two instances of connection to the remote iSCSI Target the first is for getting the kernel and initrd via ConnectX EN PXE and the second is for loading other parts of the OS via initrd Note Linux distributions such as SuSE Linux Enterprise Server 10 SPx and Red Hat Enter prise Linux 5 1 can be directly installed on an iSCSI target At the end of this direct installation initrd is capable to continue loading other parts of the OS on the iSCSI tar get Other distributions may also be suitable for direct installation on iSCSI targets If you choose to continue loading the OS after boot through the HCA device driver please verify that the initrd image includes the adapter driver as described in Section B 8 1 Configuring an iSCSI Target in Linux Environment Prerequisites Step 1 Make sure that an iSCSI Target is installed on your server side Tip You can download and install an iSCSI Target from the following location http sourceforge net project showfiles php group id 108475 amp package id 117141 Step 2 Dedicate a partition on your iSCSI Target on which you will later install the operating sys tem Step 3 Configure your iSCSI Target to wor
73. tree ranks Topologies that do not comply cause a fallback to min hop routing Note that this can also occur on link failures which cause the topology to no longer be a pure fat tree Note that although fat tree algorithm supports trees with non integer CBB ratio the routing will not be as balanced as in case of integer CBB ratio In addition to this although the algorithm 1 Ports that are connected to the same remote switch are referenced as port group 2 List of compute nodes CNs can be specified by u or cn guid file OpenSM options Mellanox Technologies 127 Rev 1 5 OpenSM Subnet Manager allows leaf switches to have any number of CAs the closer the tree is to be fully populated the more effective the shift communication pattern will be In general even if the root list is pro vided the closer the topology to a pure and symmetrical fat tree the more optimal the routing will be The algorithm also dumps compute node ordering file opensm ftree ca order dump in the same directory where the OpenSM log resides This ordering file provides the CN order that may be used to create efficient communication pattern that will match the routing tables Activation through OpenSM Use R ftree option to activate the fat tree algorithm Use a lt root guid file gt to provide root nodes for ranking If the a option is not used rout ing algorithm will detect roots automatically e Use
74. update Force firmware update force Force installation without querying the user all Install all kernel modules libibverbs libibumad librd macm mft mstflint diagnostic tools OpenSM ib bonding MVAPICH Open MPI MPI tests MPI selector perftest sdp netstat and libsdp srptools rds tools static and dynamic libraries Mellanox Technologies 27 Rev 1 5 Installation hpc Install all kernel modules libibverbs libibumad librd macm mft mstflint diagnostic tools OpenSM ib bonding MVAPICH Open MPI MPI tests MPI selector dynamic librar ies basic Install all kernel modules libibverbs libibumad mft mstflint dynamic libraries msm Install all kernel modules libibverbs libibumad mft mstflint diagnostic tools OpenSM ib bonding dynamic libraries NOTE With msm flag the OpenSM daemon is configured to run upon boot v vv vvv Set verbosity level q Set quiet no messages will be printed 28 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 2 3 2 1 mInxofedinstall Return Codes Table 4 lists the mlnxofedinstall script return codes and their meanings Table 4 ninxofedinstall Return Codes Return Code Meaning 0 The Installation ended successfully 1 The installation failed 2 No firmware was found for the adapter device 3 Failed to start the mst driver Mellanox Technologies 29 Rev 1 5 Installation 2 3 3 Insta
75. was found between the client and the iSCSI target Open a shell to ping the iSCSI target you can use CTRL ALT F2 and verify that the target is or is not accessible To return to the graphical installation screen press CTRL ALT F7 236 Mellanox Technologies J Mellanox OFED for Linux User s Manual Rev 1 5 Step 5 The iSCSI Initiator Discovery window will now request authentication to access the iSCSI target Click Next to continue without authentication unless authentication is required Preparation gt Language iscsi Initiator Discovery License Agreement Disk Activation e System Analysis Time Zone Installation e Installation Summary Perform Installation Configuration e Root Password e Hostname e Network e Customer Center Incoming Authentication e Online Update Username Password e Service e Users e Clean Up e Release Notes e Hardware Configuration X No Authentication Outgoing Authentication Username Password Help Back Abort Step 6 The iSCSI Initiator Discovery window will show the iSCSI target that got connected to Note that the Connected column must indicate True for this target Click Next See figure below Mellanox Technologies 237 Rev 1 5 E e ra EE iSCSI Initiator Discovery Vv License Agreement Disk Activation e System Analysis Portal Address Target Name Connected e Time Zone 10 4 3 7
76. 0 and 40Gb s InfiniBand switches Ethernet switches emerging Data Cen ter Ethernet switches InfiniBand to Ethernet and Fibre Channel Gateways and Ethernet to Fibre Channel gateways Fibre Channel over Ethernet and Fibre Channel over InfiniBand A single firmware image for dual port ConnectX ConnectX 2 adapters that supports indepen dent access to different convergence networks InfiniBand Ethernet or Data Center Ethernet per port A unified application programming interface with access to communication protocols including Networking TCP IP UDP sockets Storage NFS CIFS iSCSI NFS RDMA SRP Fibre Channel Clustered Storage and FCoE Clustering MPI DAPL RDS sockets and Manage ment SNMP SMI S e Communication protocol acceleration engines including networking storage clustering virtu alization and RDMA with enhanced quality of service 1 3 Mellanox OFED Package 1 3 1 ISO Image Mellanox OFED for Linux MLNX OFED LINUX is provided as ISO images one per a sup ported Linux distribution that includes source code and binary RPMs firmware utilities and doc umentation The ISO image contains an installation script called m1nxofedinstall that performs the necessary steps to accomplish the following Discover the currently installed kernel Uninstall any InfiniBand stacks that are part of the standard operating system distribution or another vendor s commercial stack e Install the MLNX OFED LINUX binary R
77. 00 Preparing 00 30 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 ofed scripts 00 Preparing 00 libibverbs 00 Preparing 00 libibverbs 00 Preparing 00 libibverbs devel 00 Preparing 00 libibverbs devel 00 Preparing 00 libibverbs devel static 00 Preparing 00 libibverbs devel static 00 Preparing 00 libibverbs utils 00 Preparing 00 libmthca 00 Preparing 00 libmthca 00 Preparing 00 libmthca devel static 00 Preparing 00 libmthca devel static 00 Preparing 00 libmlx4 00 Preparing 00 libmlx4 00 Preparing 00 libmlx4 devel 00 Preparing 00 libmlx4 devel 00 Preparing 00 libibcm 00 Preparing 00 libibcm 00 Preparing 00 libibcm devel 00 Preparing 00 libibcm devel 00 Preparing 00 libibumad static 00 Preparing 00 libibumad static 00 Preparing 00 libibmad static 00 Preparing 00 libibmad static 00 Preparing 00 ibsim 00 Preparing 00 Mellanox Technologies 31 Rev 1 5 Installation librdmacm 00 Preparing 00 librdmacm 00 Preparing 00
78. 1 26 6 07 16 1000 1 29 11 83 32 1000 1 36 224 51 102 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 64 1000 14 52 40 25 128 1000 2 67 45 74 256 1000 3 03 80 48 512 1000 3 64 134 22 1024 1000 4 89 199 69 2048 1000 6 30 309 85 4096 1000 8 91 438 24 8192 1000 14 07 555 20 16384 1000 18 85 828 93 32768 1000 30 47 1025 75 65536 640 53 67 1164 57 131072 320 99 78 1252 80 262144 160 191 80 1303 44 524288 80 373 92 1337 19 1048576 40 742 31 1347 14 2097152 20 1475 20 135535 4194304 10 2956 95 1352 75 OUTPUT TRUNCATED 10 6 Open MPI Performance 10 6 1 Requirements At least two nodes Example host1 host2 Machine file Includes the list of machines Example host1 cat home lt username gt cluster hosti host2 hostl 10 6 2 Important Note on RoCE Support In order to run Open MPI over a RoCCE network the following MCA parameter should be included in the run command mca btl openib cpc include rdmacm 10 6 3 Bandwidth Test Performance To run the OSU Bandwidth test enter hostl usr mpi gcc openmpi lt ompi ver gt bin mpirun np 2 mca mpi leave pinned 1 hostfile home lt username gt cluster usr mpi gcc openmpi lt ompi ver gt tests osu benchmarks lt osu ver gt osu bw OSU MPI Bandwidth Test v3 0 Size Bandwidth MB s Mellanox Technologies 103 Rev 1 5 MPI 1 1 12 2 2 24 4 4 43 8 8 96 16 17 38 32 34 69 64 69 31 128 121 29 256 212 70 512 326 50
79. 1 iqn 2007 08 7 3 4 10 iscsiboot True Installation e Installation Summary Perform Installation Configuration e Root Password Hostname e Network e Customer Center Online Update e Service e Users e Clean Up e Release Notes e Hardware Configuration Help Back Abor Step 7 The iSCSI Initiator Overview window will pop up Click Toggle Start Up to change start up from manual to automatic Click Finish 210 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Preparation Vv Language License Agreement Disk Activation e System Analysis Service Connected Targets e Time Zone E iSCSI Initiator Overview Portal Address Target Name 1 iqn 08 7 3 4 10 oot manual Installation e Installation Summary e Perform Installation Configuration e Root Password Hostname Network Customer Center Online Update Service Users Clean Up Release Notes Hardware Configuration Add Log Out Toggle Start Up Help Abort Step 8 Select New Installation then click Finish in the Installation Mode window Preparation Vv Language License Agreement Disk Activation System Analysis e Time Zone a Installation Mode Installation e Installation Summary Perform Installation Select Mode Configuration RM n New Installation Root Password e Hostname Network
80. 1024 461 78 2048 597 85 4096 543 06 8192 829 64 16384 1137 22 32768 1386 08 65536 1520 89 131072 1622 73 262144 1659 33 524288 1679 36 1048576 1675 35 2097152 1668 89 4194304 1671 78 10 6 4 Latency Test Performance To run the OSU Latency test enter hostl usr mpi gcc openmpi lt ompi ver gt bin mpirun np 2 mca mpi leave pinned 1 hostfile home lt username gt cluster usr mpi gcc openmpi lt ompi ver gt tests osu benchmarks lt osu ver gt osu latency OSU MPI Latency Test v3 0 Size Latency us 0 1 23 1 1237 2 1 55 4 1 54 8 1 455 16 1 58 32 1 59 64 1 59 128 1 78 256 2 05 512 2 69 104 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 1024 Sud 2048 6 14 4096 10 07 8192 12 77 16384 18 36 32768 30 52 65536 48 92 131072 93 18 262144 171 92 524288 341 08 1048576 Ta DI 2097152 1729 27 4194304 4226 58 10 6 5 Intel MPI Benchmark To run the Intel MPI Benchmark test enter host1 usr mpi gcc openmpi ompi ver bin mpirun np 2 mca mpi leave pinned 1 hostfile home lt username gt cluster usr mpi gcc openmpi ompi ver tests IMB IMB ver IMB MPI1 Intel R MPI Benchmark Suite V3 0 MPI 1 part Date Mon Mar 10 12 57 18 2008 Machine x86 64 System Linux Release 2 6 16 21 0 8 smp Version 1 SMP Mon Jul 3 18 25 39 UTC 2006 MPI Version 2240 MPI Thread Environment MPI THREAD SINGLE Minimum message length in bytes 0 Maximum message length in
81. 12 5 4 Fat tree Routing Algorithm The fat tree algorithm optimizes routing for shift communication pattern It should be chosen if a subnet is a symmetrical or almost symmetrical fat tree of various types It supports not just K ary N Trees by handling for non constant K cases where not all leafs CAs are present any CBB ratio As in UPDN fat tree also prevents credit loop deadlocks If the root guid file is not provided a or root guid file options the topology has to be pure fat tree that complies with the following rules Tree rank should be between two and eight inclusively Switches of the same rank should have the same number of UP going port groups unless they are root switches in which case the shouldn t have UP going ports at all Switches of the same rank should have the same number of DOWN going port groups unless they are leaf switches Switches of the same rank should have the same number of ports in each UP going port group Switches of the same rank should have the same number of ports in each DOWN going port group All the CAs have to be at the same tree level rank e If the root guid file is provided the topology doesn t have to be pure fat tree and it should only comply with the following rules Tree rank should be between two and eight inclusively All the Compute Nodes have to be at the same tree level rank Note that non compute node CAs are allowed here to be at different
82. 1800MB sec See Section C 1 With IB QDR and PCI Express Gen 2 the expected unidirectional full wire speed bandwidth is 3000MB sec See Section C 1 To check the adapter s maximum bandwidth use the ib write bw utility p p es y To check the adapter s latency use the ib write lat utility Note The utilities ib write bw and ib write lat are installed as part of Mellanox OFED C 3 System Performance Troubleshooting On some systems it is recommended to change the power saving configuration in order to achieve better performance This configuration is usually handled by the BIOS Please contact the system vendor for more information 252 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Appendix D ULP Performance Tuning D 1 IPoIB Performance Tuning This section provides tuning guidelines of TCP stack configuration parameters in order to boost IPoIB and IPoIB CM performance Without tuning the parameters the default Linux configuration may significantly limit the total available bandwidth below the actual capabilities of the adapter card The parameter settings described below will increase the ability of Linux to transmit and receive data Generally if you increase the MTU maximum transmission unit in bytes you get better perfor mance The following MTUS are suggested use ifconfig to modify the MTU IPoIB 2044 bytes IPoIB CM 64K bytes When IPoIB is configured to run in connecte
83. 255 indicating the number of 64 byte units credits which may be transmitted from that VL when its turn in the arbitration occurs A weight of 0 indicates that this entry should be skipped If a list entry is programmed for VL15 or 142 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 for a VL that is not supported or is not currently configured by the port the port may either skip that entry or send from any supported VL for that entry Note that the same VLs may be listed multiple times in the High or Low priority arbitration tables and further it can be listed in both tables The limit of high priority VLArb table qos type high limit indicates the number of high priority packets that can be transmitted without an opportunity to send a low priority packet Specifically the number of bytes that can be sentis high limit times 4K bytes A high limit value of 255 indicates that the byte limit is unbounded Note If the 255 value is used the low priority VLs may be starved A value of 0 indicates that only a single packet from the high priority table may be sent before an opportunity is given to the low priority table Keep in mind that ports usually transmit packets of size equal to MTU For instance for 4KB MTU a single packet will require 64 credits so in order to achieve effective VL arbitration for packets of 4KB MTU the weighting values for each VL should be multiples of 64 Below is an example of
84. 26438 Hexadecimal 6746 Decimal 25448 Hexadecimal 6368 Decimal 26448 Hexadecimal 6750 Decimal 26478 Hexadecimal 676e Decimal 26488 Hexadecimal 6778 Decimal 25458 Hexadecimal 6372 Decimal 26458 Hexadecimal 675A B 1 2 Tested Platforms See Mellanox ConnectX EN PXE Release Notes ConnectX EN PXE release notes txt Mellanox Technologies 223 Rev 1 5 B 1 3 ConnectX EN PXE in Mellanox OFED B 2 B 2 1 The ConnectX EN PXE package is provided as as part of the Mellaox OFED for Linux ISO image The package includes a PXE ROM image file for each of the supported Mellanox network adapter devices For example ConnectX EN PCI DevID 25448 CONNECTX EN 25448 ROM lt version gt rom Note Please refer to the release notes file for the exact contents Burning the PXE ROM Image Burning the Image on ConnectX EN ConnectX 2 EN Note This section is applicable only to ConnectX EN ConnectX 2 EN devices with firm ware versions 2 7 000 or later For earlier firmware versions please follow the instruc tions in Section B 2 2 on page 225 Prerequisites 1 Expansion ROM Image The expansion ROM images are provided as part of the Mellanox OFED for Linux package and are listed in the release notes file ConnectX EN PXE release notes txt 2 Firmware Burning Tools The Mellanox Firmware Tools MFT package version 2 6 0 or later should be installed on your machine i
85. 3 Purpose of UPDN Algorithm The UPDN algorithm is designed to prevent deadlocks from occurring in loops of the subnet A loop deadlock is a situation in which it 1s nolonger possible to send data between any two hosts Mellanox Technologies 125 Rev 1 5 OpenSM Subnet Manager connected through the loop As such the UPDN routing algorithm should be used if the subnet is not a pure Fat Tree and one of its loops may experience a deadlock due for example to high pressure The UPDN algorithm is based on the following main stages 1 Auto detect root nodes based on the CA hop length from any switch in the subnet a statisti cal histogram is built for each switch hop num vs number of occurrences If the histogram reflects a specific column higher than others for a certain node then it is marked as a root node Since the algorithm is statistical it may not find any root nodes The list of the root nodes found by this auto detect stage is used by the ranking process stage Note The user can override the node list manually Note If this stage cannot find any root nodes and the user did not specify a guid list file OpenSM defaults back to the Min Hop routing algorithm 2 Ranking process All root switch nodes found in stage 1 are assigned a rank of 0 Using the BFS algorithm the rest of the switch nodes in the subnet are ranked incrementally This ranking aids in the process of enforcing rules that ensure loop free paths
86. 3260 1 iqn 2007 08 7 3 4 10 iscsiboot True Installation e Installation Summary Perform Installation Configuration e Root Password Hostname e Network e Customer Center Online Update e Service e Users e Clean Up e Release Notes e Hardware Configuration Help Back Abor Step 7 The iSCSI Initiator Overview window will pop up Click Toggle Start Up to change start up from manual to automatic Click Finish 238 Mellanox Technologies J Mellanox OFED for Linux User s Manual Rev 1 5 Preparation Vv Language License Agreement Disk Activation e System Analysis Service Connected Targets e Time Zone E iSCSI Initiator Overview Portal Address Target Name 1 iqn 08 7 3 4 10 oot manual Installation e Installation Summary e Perform Installation Configuration e Root Password Hostname Network Customer Center Online Update Service Users Clean Up Release Notes Hardware Configuration Add Log Out Toggle Start Up Help Abort Step 8 Select New Installation then click Finish in the Installation Mode window Preparation Vv Language License Agreement Disk Activation System Analysis e Time Zone a Installation Mode Installation e Installation Summary Perform Installation Select Mode Configuration RM n New Installation Root Password e Hostname Netw
87. 5WioNbFZElbYeeNfPZf4UNcgMOAMWp64sL58tkt32F RGmyLXQWZL27Synsn6dHpxMqBorXNC0Z Be4kTnUqm63nQ2ziqVMdL9FrCmalxlIOu945QJAjwONevaMzFKEHe7YHg6YrNfXunfdbEurzB524TpPcrodZl1fCQ lt username gt hostl Step 4 Now you need to add the public key to the authorized keys file on the target machine hostl cat id rsa pub xargs ssh host2 echo gt gt home lt username gt ssh authorized keys2 lt username gt host2 s password Enter password host1 For a local machine simply add the key to authorized_keys2 host1 cat id rsa pub gt gt authorized keys2 Step 5 Test hostl ssh host2 uname Linux 98 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 10 3 MPI Selector Which MPI Runs Mellanox OFED contains a simple mechanism for system administrators and end users to select which MPI implementation they want to use The MPI selector functionality 1s not specific to any MPI implementation it can be used with any implementation that provides shell startup files that correctly set the environment for that MPI The Mellanox OFED installer will automatically add MPI selector support for each MPI that it installs Additional MPI s not known by the Mellanox OFED installer can be listed in the MPI selector see the mpi selector 1 man page for details Note that MPI selector only affects the default MPI environment for future shells Specifically if you use MPI selector to select MPI implementation ABC this default
88. 73 0095 6 5 o9 5 0 Gbps IBA extension Show the new configuration gt ilogomneisitate C mb 0 D 0 3b PortInfo Pore amiteg DIR peus Slicl 655352 dille 655357 0 port 1 ITS RE CRETE EE O Initialize PRYS PINKO mate CHER IS TITELN LinkUp IKW EIE ASUP PORTE r Ga casducsvaco 1X or 4X Roca ruga S 1X or 4X PUNITA o or d doo doo 4X IMLUIGUCSOSSCISUISIOOISESCS 5 I a 50 0 9 8 ZIO RIGO SEEK A 50 STIS iatis pe eciEmialoJie cl M PE sodo ooo 5 0 Gbps IBA extension TENKSOCCAAGEIVO 5 Adin chord ehm vest 5 0 Gbps Mellanox Technologies 165 Rev 1 5 InfiniBand Fabric Diagnostic Utilities 14 10ibroute Applicable Hardware InfiniBand switches Description Uses SMPs to display the forwarding tables unicast LinearForwardingTable or LFT or multi cast MulticastForwardingTable or MFT for the specified switch LID and the optional lid mlid range The default range is all valid entries in the range 1 to FDBTop Synopsis ibroute h d v V a n D G M s lt smlid gt C lt ca_name gt P ca port gt t timeout ms gt lt dest dr path lid guid gt lt startlid gt lt endlid gt Table 13 lists the various flags of the command Table 13 ibportstate Flags and Options Optional Detsult Flag NE datar If Not Description y Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for h
89. 9c 0x0005913f 0x011ca4 DDR OK 0x00059140 0x0007a123 0x020fe4 DDR OK 0x0007a124 0x0007bdff 0x001cdc DDR OK 0x0007be00 0x0007eb97 0x002d98 DDR OK 0x0007eb98 0x0007 0af 0x000518 Configuration OK 0x0007 0b0 0x0007 0fb 0x00004c Jump addresses OK 0x0007 0fc 0x0007 2a7 0x0001ac FW Configuration OK FW image verification succeeded Image is bootable 14 15ibv asyncwatch Applicable Hardware All InfiniBand devices Description Display asynchronous events forwarded to userspace for an InfiniBand device Synopsis ibv_asyncwatch Mellanox Technologies 183 Rev 1 5 InfiniBand Fabric Diagnostic Utilities Examples 1 Display asynchronous events gt ibv asyncwatch mlx4 0 async event FD 4 14 16 ibdump Applicable Hardware Mellanox ConnectX ConnectXG 2 adapter devices Description Dump InfiniBand traffic that flows to and from Mellanox Technologies ConnectX ConnectX 2 adapters InfiniBand ports The dump file can be loaded by the Wireshark tool for graphical traffic analysis The following describes a work flow for local HCA adapter sniffing Run ibdump with the desired options Run the application that you wish its traffic to be analyzed Stop ibdump CTRL c or wait for the data buffer to fill in mem mode Open Wireshark and load the generated file How to Get Wireshark Download the current release from www wireshark org for a Linux or Windows en
90. A srp daemon c e R 300 i lt InfiniBand HCA name gt p port number This step can be performed by executing srp daemon sh which sends its log to var log srp daemon log Now it is possible to access the SRP LUNs on dev mapper Note It is possible for regular non SRP LUNs to also be present the SRP LUNs may be identified by their names You can configure the etc multipath conf file to change multipath behavior Note It is also possible that the SRP LUNs will not appear under dev mapper This can occur if the SRP LUNs are in the black list of multipath Edit the blacklist section in etc multipath conf and make sure the SRP LUNs are not black listed Automatic Activation of High Availability Set the value of SRPHA ENABLE in etc infiniband openib conf to yes Note For the changes in openib conf to take effect run etc init d openibd restart From the next loading of the driver it will be possible to access the SRP LUNs on dev mapper Note It is possible that regular not SRP LUNs may also be present the SRP LUNs may be identified by their name Itis possible to see the output of the SRP daemon in var log srp daemon log 8 2 7 Shutting Down SRP SRP can be shutdown by using rmmod ib_srp or by stopping the OFED driver etc init d openibd stop or as a by product of a complete system shutdown Prior to shutting down SRP remove all references to it The actions you need to take d
91. ALL files After building the nfs utils package there will be a mount nfs binary in the utils mount directory This binary can be used to initiate NFS v2 v3 or v4 mounts To initiate a v4 mount the binary must be called mount nfs4 The standard technique is to create a symlink called mount nfs4 to mount nfs This mount nfs binary should be installed at sbin mount nfs as follows sudo cp utils mount mount nfs sbin mount nfs Mellanox Technologies 93 J Rev 1 5 NFSoRDMA In this location mount nfs will be invoked automatically for NFS mounts by the system mount com mand Note mount nfs and therefore nfs utils 1 1 2 or greater is only needed on the NFS client machine You do not need this specific version of nfs utils on the server Furthermore only the mount nfs command from nfs utils 1 1 2 1s needed on the client Install a Linux kernel with NFS RDMA The NFS RDMA client and server are both included in the mainline Linux kernel version 2 6 25 and later This and other versions of the 2 6 Linux kernel can be found at ftp ftp kernel org pub linux kernel v2 6 Download the sources and place them in an appropriate location Configure the RDMA stack Make sure your kernel configuration has RDMA support enabled Under Device Drivers gt InfiniBand support update the kernel configuration to enable InfiniBand support NOTE the option name is mis leading Enabling InfiniBand support is required for all ROMA devices I
92. AMN Mellanox TECHNOLOGIES Mellanox OFED for Linux User s Manual Rev 1 5 www mellanox com Rev 1 5 NOTE THIS HARDWARE SOFTWARE OR TEST SUITE PRODUCT PRODUCT S AND ITS RELATED DOCUMENTATION ARE PRO VIDED BY MELLANOX TECHNOLOGIES AS IS WITH ALL FAULTS OF ANY KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE THE PRODUCTS IN DESIGNATED SOLUTIONS THE CUS TOMER S MANUFACTURING TEST ENVIRONMENT HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE PRODUCTO S AND OR THE SYSTEM USING IT THEREFORE MELLANOX TECHNOLOGIES CAN NOT AND DOES NOT GUARANTEE OR WARRANT THAT THE PRODUCTS WILL OPERATE WITH THE HIGHEST QUALITY ANY EXPRESS OR IMPLIED WARRANTIES INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MER CHANTABILITY FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT ARE DISCLAIMED IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER OR ANY THIRD PARTIES FOR ANY DIRECT INDIRECT SPECIAL EXEM PLARY OR CONSEQUENTIAL DAMAGES OF ANY KIND INCLUDING BUT NOT LIMITED TO PAYMENT FOR PROCURE MENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY WHETHER IN CONTRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY FROM THE USE OF THE PRODUCT S AND RELATED DOCUMENTATION EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE Mellanox TECH
93. B iWARP etc Enable the appropriate IB HCA support mlx4 mthca ehca ipath etc or IWARP adapter support amso cxgb3 etc If you are using InfiniBand be sure to enable IP over InfiniBand support Configure the NFS client and server Your kernel configuration must also have NFS file system support and or NFS server support enabled These and other NFS related configuration options can be found under File Systems gt Network File Systems Build install reboot The NFS RDMA code will be enabled automatically if NFS and RDMA are turned on The NFS RDMA client and server are configured via the hidden SUNRPC XPRT RDMA config option that depends on SUNRPC and INFINIBAND The value of SUNRPC XPRT RDMA will be N if either SUNRPC or INFINIBAND are N in this case the NFS RDMA client and server will not be built M if beth SUNRPC and INFINIBAND are on M or Y and at least one is M in this case the NFS RDMA client and server will be built as modules Y if both SUNRPC and INFINIBAND are Y in this case the NFS RDMA client and server will be built into the kernel Therefore if you have followed the steps above and turned no NFS and RDMA the NFS RDMA client and server will be built Build a new kernel install it boot it 94 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 9 3 Check RDMA and NFS Setup Before configuring the NFS RDMA software it is a good idea to test your new kernel t
94. Bi RR 0x0000000000000000 GLOBO AO eesti yee 0xfe80000000000000 Eps ana Go meh Soon aceto 0x0001 SMI RE 0x0001 eap M aser MD E TERRE Eee NOS eR RET 0x251086a IsSM IsTrapSupported IsAutomaticMigrationSupported IsSLMappingSupported IsSystemImageGUIDsupported IsCommunicatonManagementSupported IsVendorClassSupported IsCapabilityMaskNoticeSupported IsClientRegistrationSupported Dice oloxolsiato5o o s 5m Eno 0x0000 Mkeyledsebcuutod ae a ESI E T T e 0 HOGS OIAGS 4 58 5 070 5 9 0 0 D D U D Q O DCO O Qr ES ll Tie AEREE o DE ANG 1X or 4X TRIS Cep POSERO E TEO Lite DUNG belay Ox RIE CIS ER IO 4X ILIA GSOSSCISISIOOIMENCIS a crore eno DE MIN os 550 Cos TASSATO REN nn Active PySIRERIcS ate CERRI E LinkUp KENKDOWODE O CERE SS tro 6515 8 Pe no Polling EosccdBixsci n Sa odds E E E 0 MEIER LINEE 0 TREES SONGS ZO RENEE CI 5 0 Gbps IFA OSS METSO STOKE HOT 5 565545500056 05 6 A CoS or 50 COPS Nesghls on so coos adoro so coo aU 2048 SMS T TIE VICO 0 VIT CAPE trus eee ARES ERCOLE VLO 7 IMA DS eI CE S ERE 0x00 VERRI lab ab IDE Cr eere aet Od EO TY Og TORINO EORR 4 VPA CORRON CAP ex e Mec ard Boe Ao oA Gere 8 NISUS SOWIE E NI RT ET RA ES 8 Veris Eee INVERSO E 0x00 MEC o tomo at Deas aa bles ous aoe 2048 VISCONTE E Ba SBN d BERT SA 0 HOLE EIA INO 31 OPS VAN S pre tear eee een tto NIAE TO 9 PAREEN TORCEM NO vidr r E 0 80 5 5 0 0 ECE Eine OSS 005 5g do So Hoo rs SSI a 0 172 Mellanox Technologies Mellanox OF
95. CP Client Optional 52 4 3 2 Static IPoIB Configuration 53 4 3 3 Manually Configuring IPoIB 54 4 4 Subinterfaces 55 4 4 1 Creating a Subinterface 55 4 4 2 Removing a Subinterface 56 4 5 Verifying IPoIB Functionality 56 4 6 The ib bonding Driver 56 4 6 1 Using the ib bonding Driver 57 4 7 IPoIB Performance Tuning 58 4 8 Testing IPoIB Performance 58 Chapter 5 RoCE 61 5 1 Overview 61 5 2 Software Dependencies 61 5 3 General Guidelines 61 5 4 Ported Applications 61 5 5 GID Tables 62 5 5 1 Priority Pause Frames 62 5 6 Using VLANs 62 5 7 A Detailed Example 63 Chapter 6 RDS 69 6 1 Overview 69 6 2 RDS Configuration 69 Chapter 7 SDP 71 7 1 Overview 71 7 2 libsdp so Library 71 7 3 Configuring SDP 71 7 3 1 How to Know SDP Is Working 72 7 3 1 1 Alternative Method Using the sdpnetstat Program 72 7 3 2 Monitoring and Troubleshooting Tools 73 7 4 Environment Variables 74 7 5 Converting Socket based Applications 74 7 6 BZCopy Zero Copy Send 82 7 7 Using RDMA for Small Buffers 82 7 8 Testing SDP Performance 82 4 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Chapter 8 SRP 85 8 1 Overview 85 8 2 SRP Initiator 85 8 2 1 Loading SRP Initiator 85 8 2 2 Manually Establishing an SRP Connection 85 8 2 3 SRP Tools ibsrpdm and srp daemon 86 8 2 4 Automatic Discovery and Connection to Targets 88 8 2 5 Multiple Connections from Initiator IB Port to the Target 89 8 2 6 High Availability HA 90 8 2 7
96. D for Linux User s Manual Rev 1 5 However this is not supported in OFED the section is parsed f and ignored SL2VL and VLArb tables should be configured in the OpenSM options file by default var cache opensm opensm opts end qos setup qos levels Having a QoS Level named DEFAULT is a must it is applied to PR MPR requests that didn t match any of the matching rules qos level name DEFAULT use default QoS Level sl 0 end qos level the whole set SL MTU Limit Rate Limit PKey Packet Lifetime qos level name WholeSet Sl 1 mtu limit 4 rate limit 5 pkey 0x1234 packet life 8 end qos level end qos levels Match rules are scanned in order of their apperance in the policy file 4 First matched rule takes precedenc qos match rules f matching by single criteria QoS class qos match rule use by QoS class qos class 7 9 11 Name of qos level to apply to the matching PR MPR qos level name WholeSet end qos match rule show matching by destination group and service id qos match rule Mellanox Technologies 135 Rev 1 5 OpenSM Subnet Manager use Storage targets destination Storage service id 0x10000000000001 0x10000000000008 0x10000000000FFF qos level name WholeSet end qos match rule qos match rule Source Storage use match by source group only qos level name DEFAULT end qos match rule qos match rule use match
97. Driver mthca is the low level driver implementation for the following Mellanox Technologies HCA InfiniBand devices InfiniHost InfiniHost III Ex and InfiniHost III Lx 1 4 2 mlx4 VPI Driver mlx4 is the low level driver implementation for the ConnectX and ConnectX 2 adapters designed by Mellanox Technologies ConnectX ConnectX 2 can operate as an InfiniBand adapter as an Ethernet NIC or as a Fibre Channel HBA The OFED driver supports InfiniBand and Ether net NIC configurations To accommodate the supported configurations the driver is split into four modules mlx4 core Handles low level functions like device initialization and firmware commands processing Also controls resource allocation so that the InfiniBand and Ethernet functions can share the device without interfering with each other mlx4 ib Handles InfiniBand specific functions and plugs into the InfiniBand midlayer mlx4 en A 10GigE driver under drivers net mlx4 that handles Ethernet specific functions and plugs into the netdev mid layer mlx4 fc Handles the FCoE functions using ConnectX ConnectX 2 Fibre Channel hardware offloads 1 4 3 Mid layer Core Core services include management interface MAD connection manager CM interface and Subnet Administrator SA interface The stack includes components for both user mode and ker nel applications The core services run in the kernel and expose an interface to user mode for verbs CM and management 1 4 4
98. ED for Linux User s Manual Rev 1 5 2 Query SwitchInfo by GUID gt smpquery G switchinfo 0x000b8c 004016 at Suam RAOS Lael 3 Yestiovi uelotololors qoe ran Tse ET e 49152 Randomidh e dose moe e T CE 0 castssclo eura eee RE E 1024 penea wal Holo WMOVO Re UE S 8 DISTRAE aaa EO RS SNS 0 Disp eeishgle reiecta cn9 55 0890 90 0 DSIEMIGVSHEINOIEI TO LMMPOIAE S uo do va voodoo 0 RSH MINES ao o oo REI 18 Strade c Cbr OG RIE 0 IN SIP SSE Oe er ers re E fre CC ome 0 PISIS BIO ECL G SEA Herr SERLO n 32 TAPUNA PIECEN EE 5 5 0 0707570709 rara ora Sao Il QUNEISOWUNCIP SUA EIINE B sodo go doo ll FREES IRE MAEMNOUINCIS 45545 055999099097 il PCE ARENO OWNS 56 ao go dodo Il ERNANCCILOREO ERE EI 0 3 Query Nodelnfo by direct route gt smpquery D nodeinfo 0 i Node intos WIR parta sulle 655957 lle 655957 0 BASE Ver SER wens ace il Gia S SOUS SR ito dl NORIS Ce aise sea tases sce ease ERU ES RS TEES Channel Adapter Numbers aA ora yg eA oo ed oe BO 2 SyS Ce mE RE a d E e Sri 0x0002c9030000103b GUC est muro OD Ud doo dat 0x0002c90300001038 LIGNE TE GUIS c b Ben meni CI IONI 0x0002c90300001039 Eas C qoi Po IT 1 218 PIERA EE E EE ERE E TE 0x634a VERSOS ere TN TER E 0x000000a0 Ic ailiE ors ee posee pee Mr EU TEE E UE dl VENATORIA E I ener ee T ERE 0x0002c9 14 12perfquery Applicable Hardware All InfiniBand devices Mellanox Technologies 173 Rev 1 5 InfiniBand Fabric Diagnostic Utilities
99. FED LINUX 1 5 1 rhel5 4 docs mlnx add kernel support sh i mnt MLNX OFED LINUX 1 5 1 rhel5 4 iso All Mellanox OEM OFED or Distribution IB packages will be removed Do you want to continue y N y Removing OFED RPMs Running mkisofs Created tmp MLNX OFED LINUX 1 5 1 rhel5 4 iso 2 3 2 Installation Script Mellanox OFED includes an installation script called minxofedinstall Its usage is described below You will use it during the installation procedure described in Section 2 3 3 Installation Pro cedure on page 30 Usage mlnxofedinstall OPTIONS Note If no options are provided to the script then all available RPMs are installed Options c config lt packages config file gt Example of the configuration file can be found under docs n net network config file gt Example of the network configuration file can be found under docs p print available Print available packages for the current platform and cre ate a corresponding ofed conf file The installation script exits after creating ofed conf with fc Install FCoE support Available on RHEL5 2 ONLY with 32bit Install 32 bit libraries default This is relevant for x86 64 and ppc64 platforms without 32bit Skip 32 bit libraries installation without ib bonding Skip ib bonding RPM installation without depcheck Skip Distro s libraries check without fw update Skip firmware update force fw
100. H EAE s Eee 5 0 Gbps gt ibportstate C mthca0 D 0 1 Poninifor Pet bos WR paci slic 655357 gie 695357 0 pene 1 EAI Site aie Cis RIT TO AOLO MEA Down Day dS EC SIS 3 3 E Polling S MST WEB OG EIS UT OI EGYOLB 3 do wd ord 1X or 4X natalie aChElumavslollercle yaad saqsnaadsddc 1X or 4X IRENE Re E RAC E ES da s n o mon he nee 4X TrnkSPCCISUPPORECERRETTE EEE ue kris 28 Glojoys TRES CCA ERA IEEE Zod Eos ILI MES PECAS k IVE ao aednoagnao sod ou 2 8 Glos 164 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 Change the speed of a port First query for current configuration gt illjxomesicerce C mb D i PONG IDEs Powe mio DIR peus slic 655357 cllicl 655355 0 pone il IIS RE EI Initialize Bystrica CHE AR DN M MM EU TE LinkUp JU 300 etos eUIeIe O6 IESIOLg o a coo od 90599 9v 1X or 4X seme EE CI N ENa OE ae a or ce 1X or 4X IpLi MN ALCHAVNGIEILWOAS daino soda noci 4X latio eixexeolsUISI9Q5 EGGS ao 5 53 5 90 59 96 25 Ggs 540 goes utndsSpeecdimalted e 9 5 5 29 e 29 Go coc 30 Glos JESUS CONV RARE 5 0 Gbps Now change the enabled link speed gt dosomestente C ine 0 D 0 i sixsed 2 ilopporitstatto SC mbz4 0 D 0 L sped 2 Wigalicalell eroe dimito ji BOWE into DE peta Silicl 655355 chick 65535 0 pw i ko pecdEna ledn e an a don Zo5 Goyer After PortInfo set Port absusos WR paca slici 655355 ciue 655357 0 pont i IMUM KSIHSSCINMAEIOINEICS 5 3 5 5 5
101. How to Unload Shutdown 1 Unload ib srpt modprobe r ib srpt 2 Unload scst and its dev handlers first modprobe r scst vdisk scst 3 Unload o ed etc rc d openibd stop 258 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Appendix F mlx4 Module Parameters In order to set m1x4 parameters add the following line s to etc modpobe conf options mlx4 core parameter value and or options mlx4 ib parameter lt value gt and or options mlx4 en parameter lt value gt and or options mlx4 fc parameter lt value gt The following sections list the available m1x4 parameters F1 mlx4 core Parameters set 4k mtu Attempt to set 4K MTU to all ConnectX ports int debug level Enable debug tracing if gt 0 default 0 block loopback Block multicast loopback packets if gt 0 default 1 msi x Attempt to use MSI X if nonzero default 1 log num mac log maximum number of MACs per ETH port 1 7 int use prio Enable steering by VLAN priority on ETH ports 0 1 default 0 bool log num qp log maximum number of QPs per HCA default is 17 max is 20 log num srq log maximum number of SROs per HCA default is 16 max is 20 log rdmarc per qp log number of RDMARC buffers per QP default is 4 max is 7 log num cq log maximum number of CQs per HCA default is 16 max is 19 log num mcg log maximum number of multicast groups per HCA default is 13 max is 21 log num mpt
102. ID in an MLID range lt endlid gt Optional Ending LID in an MLID range Examples 1 Dump all Lids with valid out ports of the switch with Lid 2 Sb BOUts ce Unicast lids 0x0 0x8 of switch Lid 2 guid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies lel Owie Destination Poi Info 0x0002 000 Switch portguid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 0x0007 021 Channel Adapter portguid 0x0002c9020025874a swl57 HCA 1 0x0008 008 Channel Adapter portguid 0x0002c902002582cd sw136 HCA 1 5 valid lids dumped Mellanox Technologies 167 Rev 1 5 InfiniBand Fabric Diagnostic Utilities 2 Dump all Lids with valid out ports of the switch with Lid 2 gt doieouee 2 Unicast lids 0x0 0x8 of switch Lid 2 guid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies Lael Owie Destination Bora Info 0x0002 000 Switch portguid 0x0002c902fffff00a MT47396 Infiniscale III Mellanox Technologies 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale III Mellanox Technologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 egy gc v 0x0007 021 Channel Adapter portguid 0x0002c9020025874a PSW SERCAT IED 0x0008 008 Channel Adapter
103. Link Pause The mix4 en Ethernet driver supports link pause by default To change this setting you can use the following command i ethtool A eth lt x gt rx on off tx onloff To create a vHBA run gt echo eth3 55 gt SFCSYSFS create Mellanox Technologies 49 Rev 1 5 Working With VPI 50 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 4 IPoIB 4 1 Introduction The IP over IB IPoIB driver is a network interface implementation over InfiniBand IPoIB encapsulates IP datagrams over an InfiniBand Connected or Datagram transport service This chapter describes the following PoIB mode setting Section 4 2 IPoIB configuration Section 4 3 How to create and remove subinterfaces Section 4 4 How to verify IPoIB functionality Section 4 5 The ib bonding driver Section 4 6 PoIB performance tuning Section 4 7 How to test IPoIB performance Section 4 8 4 2 IPoIB Mode Setting IPoIB can run in two modes of operation Connected mode and Datagram mode By default IPoIB is set to work in Connected mode This can be changed to become Datagram mode by editing the file etc infiniband openib conf and setting SET IPOIB CM no After changing the mode you need to restart the driver by running etc init d openibd restart To check the current mode used for out going connections enter cat sys class net ib n mode 4 3 PoIB Configuration Unless you hav
104. NOLOGIES Mellanox Technologies Mellanox Technologies Ltd 350 Oakmead Parkway Suite 100 PO Box 586 Hermon Building Sunnyvale CA 94085 Yokneam 20692 U S A Israel www mellanox com Tel 972 4 909 7200 Tel 408 970 3400 Fax 972 4 959 3245 Fax 408 970 3403 Copyright 2010 Mellanox Technologies Inc All Rights Reserved Mellanox BridgeX ConnectX InfiniBlast InfiniBridge InfiniHost InfiniPCI InfiniRISC InfiniScale and Virtual Protocol Interconnect are registered trademarks of Mellanox Technologies Ltd CORE Direct FabricIT and PhyX are trademarks of Mellanox Technologies Ltd All other marks and names mentioned herein may be trademarks of their respective companies 2 Mellanox Technologies Document Number 2877 Mellanox OFED for Linux User s Manual Rev 1 5 Table of Contents Table of Contents List of Tables Revision History 11 Preface 12 Intended Audience 12 Documentation Conventions 13 Typographical Conventions 13 Common Abbreviations and Acronyms 13 Related Documentation 15 Support and Updates Webpage 15 Chapter 1 Mellanox OFED Overview 17 1 1 Introduction to Mellanox OFED 17 1 2 Introduction to Mellanox VPI Adapters 17 1 3 Mellanox OFED Package 17 1 3 1 ISO Image 17 1 3 2 Software Components 18 1 3 3 Firmware 18 1 3 4 Directory Structure 18 1 4 Architecture 19 1 4 1 mthca HCA IB Driver 20 1 4 2 mlx4 VPI Driver 20 1 4 3 Mid layer Core 20 1 4 4 Open FCoE 20 1 4 5 UL
105. PMs if they are available for the current kernel Mellanox Technologies 17 Rev 1 5 Mellanox OFED Overview Identify the currently installed InfiniBand HCAs and perform the required firmware updates 1 3 2 Software Components MLNX OFED LINUX contains the following software components Network adapter drivers mthca IB only mlx4 VPI which is split into four modules mlx4 core low level helper mlx4 ib IB mlx4 en Ethernet and mlx4 fc FCoE Mid layer core Verbs MADs SA CM CMA uVerbs uMADs Upper Layer Protocols ULPs IPoIB RDS SDP SRP Initiator NFSoRDMA NFS over RDMA MPI Open MPI stack supporting the InfiniBand RoCE and Ethernet interfaces OSU MVAPICH stack supporting the InfiniBand and RoCE interfaces MPI benchmark tests OSU BW LAT Intel MPI Benchmark Presta OpenSM InfiniBand Subnet Manager Utilities Diagnostic tools Performance tests Firmware tools MFT Source code for all the OFED software modules for use under the conditions mentioned in the modules LICENSE files Documentation 1 3 3 Firmware The ISO image includes the following firmware items e Firmware images mlx format for all Mellanox standard network adapter devices Firmware configuration INI files for Mellanox standard network adapter cards and custom cards FlexBoot for ConnectX ConnectX 2 InfiniHost III Ex in Mem free mode and Infini Host HI Lx HCA devices ConnectX EN P
106. Ps 20 1 4 6 MPI 21 1 4 7 InfiniBand Subnet Manager 22 1 4 8 Diagnostic Utilities 22 1 4 9 Performance Utilities 22 1 4 10 Mellanox Firmware Tools 22 1 5 Quality of Service 23 Chapter 2 Installation 25 2 1 Hardware and Software Requirements 25 2 1 1 Hardware Requirements 25 2 1 2 Software Requirements 25 22 Downloading Mellanox OFED 26 2 3 Installing Mellanox OFED 26 2 3 1 Pre installation Notes 26 2 3 2 Installation Script 27 2 3 2 1 mlnxofedinstall Return Codes 29 2 3 3 Installation Procedure 30 2 3 4 Installation Results 36 2 3 5 Post installation Notes 37 2 4 Updating Firmware After Installation 37 2 5 Uninstalling Mellanox OFED 38 Chapter 3 Working With VPI 39 3 1 Port Type Management 39 3 2 InfiniBand Driver 40 3 3 Ethernet Driver 40 3 3 1 Overview 40 Mellanox Technologies 3 J Rev 1 5 3 3 2 Loading the Ethernet Driver 40 3 3 3 Unloading the Driver 40 3 3 4 Ethernet Driver Usage and Configuration 41 3 4 Fibre Channel over Ethernet 42 3 4 1 Overview 42 3 4 2 Installation 43 3 4 3 FCoE Basic Usage 43 3 4 3 1 FCoE Configuration 44 3 4 3 2 Starting FCoE Service 45 3 4 8 3 Stopping FCoE Service 46 3 4 4 FCoE Advanced Usage 46 3 4 4 1 Manual vHBA Control 47 3 4 4 2 Creating VHBAs That Use PFC 48 3 4 4 8 Creating VHBAs That Use Link Pause 49 Chapter 4 IPoIB 51 4 Introduction 51 4 2 IPoIB Mode Setting 51 4 3 IPoIB Configuration 51 4 3 1 IPoIB Configuration Based on DHCP 51 4 3 1 1 DHCP Server 52 4 3 1 2 DH
107. S Architecture QoS functionality is split between the SM SA CMA and the various ULPs We take the chronol ogy approach to describe how the overall system works 1 The network manager human provides a set of rules policy that define how the network is being configured and how its resources are split to different QoS Levels The policy also define how to decide which QoS Level each application or ULP or service use 2 The SM analyzes the provided policy to see if it is realizable and performs the necessary fabric setup Part of this policy defines the default QoS Level of each partition The SA is enhanced to match the requested Source Destination QoS Class Service ID PKey against the policy so clients ULPs programs can obtain a policy enforced QoS The SM may also set up partitions with appropriate IPoIB broadcast group This broadcast group carries its QoS attributes SL MTU RATE and Packet Lifetime IPoIB is being setup IPoIB uses the SL MTU RATE and Packet Lifetime available on the mul ticast group which forms the broadcast group of this partition w 4 MPI which provides non IB based connection management should be configured to run using hard coded SLs It uses these SLs for every QP being opened CA ULPs that use CM interface like SRP have their own pre assigned Service ID and use it while obtaining PathRecord MultiPathRecord PR MPR for establishing connections The SA receiv ing the PR MPR matches i
108. SL2VL and VL Arbitration configuration on subnet qos ca max vls 15 qos ca high limit 6 qos ca vlarb high 0 4 qos ca vlarb low 0 0 1 64 2 128 3 192 4 0 5 64 6 64 7 64 qos ca s12vl 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 7 qos swe max vls 15 qos swe high limit 6 qos swe vlarb high 0 4 qos swe vlarb low 0 0 1 64 2 128 3 192 4 0 5 64 6 64 77 64 qos swe sl12vl 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 7 In this example there are 8 VLs configured on subnet VLO to VL7 VLO is defined as a high pri ority VL and it is limited to 6 x 4KB 24KB in a single transmission burst Such configuration would suilt VL that needs low latency and uses small MTU when transmitting packets Rest of VLs are defined as low priority VLs with different weights while VL4 is effectively turned off 12 6 8 Deployment Example Figure 4 shows an example of an InfiniBand subnet that has been configured by a QoS manager to provide different service levels for various ULPs Mellanox Technologies 143 Rev 1 5 OpenSM Subnet Manager Figure 4 Example QoS Deployment on Infini Traffic class SDP Traffic class Partition A Service level 2 Service level 0 Policy min 20 BW Policy min 40 d re Access Points Traffic class SRP Service Level 1 Policy min 30 BW Traffic class IPoIB Service Level 3 Policy min 10 BW App A Server Virtual Server 12 7 QoS Configuration Examples The following are examples of QoS configu
109. SM receives is matched against the set of matching rules Rules are scanned in order of appearance in the QoS policy file such as the first match takes precedence Each rule has a name of QoS level that will be applied to the matching query A default QoS level is applied to a query that did not match any rule Queries can be matched by Source port group whether a source port is a member of a specified group Destination port group same as above only for destination port PKey QoS class Service ID To match a certain matching rule PR MPR query has to match ALL the rule s criteria However not all the fields of the PR MPR query have to appear in the matching rule For instance if the rule has a single criterion Service ID it will match any query that has this Ser vice ID disregarding rest of the query fields However if a certain query has only Service ID which means that this is the only bit in the PR MPR component mask that is on it will not match any rule that has other matching criteria besides Service ID 132 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 12 6 3 Simple QoS Policy Definition Simple QoS policy definition comprises of a single section denoted by qos ulps Similar to the advanced QoS policy it has a list of match rules and their QoS Level but in this case a match rule has only one criterion its goal is to match a certain ULP or a certain application on top of thi
110. Shutting Down SRP 91 Chapter 9 NFSoRDMA 93 9 1 Overview 93 9 2 Installation 93 9 3 Check RDMA and NFS Setup 95 9 4 NFS RDMA Setup 95 Chapter 10 MPI 97 10 1 Overview 97 10 2 Prerequisites for Running MPI 97 10 2 1 SSH Configuration 97 10 3 MPI Selector Which MPI Runs 99 10 4 Compiling MPI Applications 99 10 5 OSU MVAPICH Performance 100 10 5 1 Requirements 100 10 5 2 Bandwidth Test Performance 100 10 5 3 Latency Test Performance 101 10 5 4 Intel MPI Benchmark 101 10 6 Open MPI Performance 103 10 6 1 Requirements 103 10 6 2 Important Note on RoCE Support 103 10 6 3 Bandwidth Test Performance 103 10 6 4 Latency Test Performance 104 10 6 5 Intel MPI Benchmark 105 Chapter 11 Quality of Service 107 11 1 Overview 107 11 2 QoS Architecture 108 11 3 Supported Policy 108 11 4 CMA features 109 11 5 IPoIB 109 11 6 SDP 109 11 7 RDS 109 11 8 SRP 110 11 9 OpenSM Features 110 Chapter 12 OpenSM Subnet Manager 111 12 1 Overview 111 12 2 opensm Description 111 12 2 1 opensm Syntax 111 12 2 2 Environment Variables 117 12 2 3 Signaling 118 12 2 4 Running opensm 118 12 2 4 1 Running OpenSM As Daemon 119 12 3 osmtest Description 119 12 3 1 Syntax 119 12 3 2 Running osmtest 121 12 4 Partitions 122 12 4 1 File Format 122 Mellanox Technologies 5 J Rev 1 5 12 5 Routing Algorithms 124 12 5 1 Effect of Topology Changes 125 12 5 2 Min Hop Algorithm 125 12 5 3 Purpose of UPDN Algorithm 125 12 5 3 1 UPDN Algorithm Usage 127
111. This identifier is used to distinguish between the various DHCP sessions The value of the client identifier is composed of 21 bytes separated by colons having the follow ing components 20 lt QP Number 4 bytes gt lt GID 16 bytes gt Note Bytes are represented as two hexadecimal digits Extracting the Client Identifier Method The following steps describe one method for extracting the client identifier Step 1 QP Number equals 00 55 04 01 for InfiniHost III Ex and InfiniHost HI Lx HCAs Step 2 GID is composed of an 8 byte subnet prefix and an 8 byte Port GUID The subnet prefix is fixed for the supported Mellanox HCAs and is equal to fe 80 00 00 00 00 00 00 The next steps explains how to obtain the Port GUID Step 3 To obtain the Port GUID run the following commands Note The following MFT commands assume that the Mellanox Firmware Tools MFT pack age has been installed on the client machine hostl mst start 192 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 hostl mst status The device name will be of the form dev mst mt lt dev_id gt _pci _cr0 conf0 Use this device name to obtain the Port GUID via a query command flint d lt MST DEVICE NAME gt q Example with InfiniHost III Ex as the HCA device hostl flint d dev mst mt25218 pci cr0 q Image type Failsafe FW Version dad Rom Info type GPXE version 1 0 0 devid 25218 port 2 I S Version 1 De
112. UB JOOICE 3 4 Dump file sniffer pcap Sniffer WOEs max burst size 4096 Initiating resources searching for IB devices in host Port active mtu 2048 MR was registered with addr 0x60d850 lkey 0x28042601 rkey 0x28042601 flags 0x1 OP was created QP number 0x4004a Ready to capture Press c to stop Mellanox Technologies 185 Rev 1 5 InfiniBand Fabric Diagnostic Utilities 186 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Appendix A Mellanox FlexBoot A 1 Overview Mellanox FlexBoot is a multiprotocol remote boot technology FlexBoot supports remote boot over InfiniBand BoIB and Boot over Ethernet BoE Using Mellanox Virtual Protocol Interconnect VPI technologies available in ConnectX adapt ers FlexBoot gives IT Managers the choice to boot from a remote storage target iSCSI target or a LAN target Ethernet Remote Boot Server using a single ROM image on Mellanox ConnectX products FlexBoot is based on the open source project Etherboot gPXE available at http www ether boot org FlexBoot first initializes the adapter device senses the port protocol Ethernet or InfiniBand and brings up the port Then it connects to a DHCP server to obtain its assigned IP address and net work parameters and also to obtain the source location of the kernel OS to boot from The DHCP server instructs FlexBoot to access the kernel OS through a TFTP server an iSCSI target or some ot
113. VMS RAID v J Crypt File v Expert v aTe Hard disks are Back abort Step 13 In the pop up window click No to approve deleting the swap partition You will be returned to Installation Settings window See image below 214 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Partition your hard Expert Partitioner disks This is intended for experts lf you are not familiar with the concepts of hard disk partitions and how to use them you might want to go back and select automatic partitioning 7 4 GB F Linux native Reiser Please note that nothing will be written to your hard disk until you confirm the entire installation You have not assigned a swap partition There is nothing wrong with that but in the last installation dj S in most cases it is highly recommended to create and assign a swap partition dialog Until that point Swap partitions on your system are listed in the main window with the you can safely abort type Linux Swap An assigned swap partition has the mount point swap the installation You can assign more than one swap partition if desired For LVM setup using a non LVM root device and a non LVM swap device is No recommended Other No than the root and swap devices you should have partitions managed by LVM Do you want to change this The table to the right showsthe current partitions
114. XE gPXE boot for ConnectX EN and ConnectX 2 EN devices 1 3 4 Directory Structure The ISO image of MLNX_OFED LINUX contains the following files and directories mlnxofedinstallThis is the MLNX OFED LINUX installation script uninstall shThis is the MLNX OFED LINUX un installation script lt CPU architecture folders gt Directory of binary RPMs for a specific CPU architecture 18 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 firmware Directory of the Mellanox IB HCA firmware images including Boot over IB src Directory of the OFED source tarball and the Mellanox Firmware Tools MFT tarball docs Directory of Mellanox OFED related documentation 1 4 Architecture Figure 1 shows a diagram of the Mellanox OFED stack and how upper layer protocols ULPs interface with the hardware and with the kernel and user spaces The application level also shows the versatility of markets that Mellanox OFED applies to Figure 1 Mellanox OFED Stack Back end App Middleware Front end Eth Cluster Config Mgmnt Life Sciences Block Storage HPC Application Application A A Mellanox VPI Device HCA NIC Markets L4 Linux ENS OFED in Linux Currently not supported by Mellanox OFED Applications EN FED Mellanox Technologies 19 Rev 1 5 Mellanox OFED Overview The following sub sections briefly describe the various components of the Mellanox OFED stack 1 4 1 mthca HCA IB
115. a086001 gPXE B 7 3 2 ifopen net lt x gt Opens the network interface net lt x gt The list of network interfaces is available via the ifstat com mand B 7 3 3 ifclose net lt x gt Closes the network interface net lt x gt The list of network interfaces is available via the ifstat com mand B 7 3 4 autoboot Starts the boot process from the device s B 7 3 5 sanboot Starts the boot process of an iSCSI target Example gPXE gt sanboot iscsi 11 4 3 7 ign 2007 08 7 3 4 11 iscsiboot B 7 3 6 echo Echoes an environment variable Example gPXE gt echo root path Mellanox Technologies 229 Rev 1 5 B 7 3 7 dhcp A network interface attempts to open the network interface and then tries to connect to and com municate with the DHCP server to obtain the IP address and filepath from which the boot will occur Example gPXE gt dhcp netl B 7 3 8 help Displays the available list of commands B 7 3 9 exit Fxits from the command line interface B 8 Diskless Machines Mellanox ConnectX EN PXE supports booting diskless machines To enable using an Ethernet driver the remote kernel or initrd image must include and be configured to load the driver This can be achieved either by compiling the adapter driver into the kernel or by adding the device driver module into the initrd image and loading it The Ethernet driver requires loading the following modules in the specified order see Secti
116. ader Settings window Section List Boot Loader Settings From Other you can manually editthe boot Section Management Boot Loader Installation loader configuration files clearthe current configuration and p propose a new Def Label configuration start from SUSE Linux Enterprise Serv scratch or reread the Floppy Other chainloader dev fd0 configuration saved on Failsafe SUSE Linux Enterprise Server 10 SP2 Image append showopts ide n your disk If you have multiple Linux systems installed YaST can try to find them and merge their menus I Cas eat oaee 244 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Step 16 In the Optional Kernel Command Line Parameter field append the following string to the end of the line ibft_mode off include a space before the string Click OK and then Finish to apply the change Section Name a Boot Loader Settings Section Management Use Section Name to specify the boot loader section name The section name must be unique Section Editor Section Settings Selecting Do not Section Name SUSE Linux Enterprise Server 10 SP2 verify Filesystem before Booting will skip all file system checks Optional Kernel Command Line Parameter lets you Section Settings Do not verify Filesystem before Booting
117. al Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 s lt smlid gt Optional Use lt smlid gt as the target LID for SM SA queries V ersion Optional Show version info C lt ca_name gt Optional Use the specified channel adapter or router P lt ca_port gt Optional Use the specified port t Optional Override the default timeout for the solicited MADs lt timeout_ms gt msec 170 Mellanox Technologies Table 14 smpquery Flags and Options Mellanox OFED for Linux User s Manual Rev 1 5 Flag Optional Mandatory Default If Not Specified Description lt op gt Mandatory Supported operations nodeinfo lt addr gt nodedesc lt addr gt portinfo lt addr gt lt portnum gt switchinfo lt addr gt pkeys lt addr gt lt portnum gt sl2vl lt addr gt lt portnum gt vlarb lt addr gt lt portnum gt guids lt addr gt lt dest dr_path lid guid gt Optional Destination s directed path LID or GUID Mellanox Technologies 171 Rev 1 5 InfiniBand Fabric Diagnostic Utilities Examples 1 Query PortInfo by LID with port modifier smpquery portinfo 1 1 ur Ret dios Liel 1 pont AL MIEN e
118. am with a name starting with db2 t cp would match on ttcp etc If program name is not provided default the statement matches all programs lt address gt Either the local address to which the server binds or the remote server address to which the client connects The syntax for address matching is lt IPv4 address gt lt prefix length gt IPv4 address 0 9 0 9 0 9 0 9 each sub number lt 255 prefix length 0 9 and with value lt 32 A prefix length of 24 matches the subnet mask 255 255 255 0 A prefix length of 32 requires matching of the exact IP lt port range gt start port end port where port numbers are gt 0 and 65536 Note that rules are evaluated in the order of definition So the first match wins If no match is made 1ibsap will default to both Examples Use SDP by clients connecting to machines that belongs to subnet 192 168 1 use sdp connect 192 168 1 0 24 A A family program role address port range Mellanox Technologies 75 Rev 1 5 SDP e Use SDP by ttep when it connects to port 5001 of any machine use sdp listen ttcp 5001 Use TCP for any program with name starting with ttcp serving ports 22 to 25 use tcp server ttcp 25 Listen on both TCP and SDP by any server that listen on port 8080 use both server 8080 e Connect ssh through SDP and fallback to TCP to hosts on 11 4 8 port 22 use both connect 11 4 8 0 24 22 Explicit No
119. ange menu below v Time Zone Installation Installation Summary Keyboard Layout e Perform Installation English US Configuration aia e Hostname Partitioning Root Password Create boot partition dev sdal 70 5 MB with ext2 Network Create swap partition dev sda2 502 0 MB e Customer Center Create root partition dev sda3 7 4 GB with reiserfs e Online Update e Service Software e Users e Clean Up SUSE Linux Enterprise Server 10 e Release Notes Fo MIndoW yer GNOME Desktop Environment for Server e Hardware Configuration Novell AppArmor Print Server Server Base System Size of Packagesto Install 1 3 GB Language Primary Language English US Show Release Notes Help Abort Step 11 Select Base Partition Setup on This Proposal then click Next Your hard disks have Suggested Partitioning been checked The partition setup displayed is proposed for your hard drive Create boot partition dev sdal 70 5 MB with ext2 Create swap partition dev sda2 502 0 MB To acceptthese Create root partition dev sda3 7 4 GB with reiserfs suggestions and continue select Accept Proposal Ifthe suggestion does not fit your needs create your own partition setup starting with the partitions as currently present on the disks For this select Custom Partition Setup This is also the option to choose for advanced options like RAID and LVM
120. aph for each layer Once the potential addition of a path could lead to deadlock LASH opens a new layer and continues the process 3 Once this stage has been completed itis highly likely that the first layers processed will con tain more paths than the latter ones To better balance the use of layers LASH moves paths from one layer to another so that the number of paths in each layer averages out Note that the implementation of LASH in opensm attempts to use as few layers as possible This number can be less than the number of actual layers available 128 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 In general LASH is a very flexible algorithm It can for example reduce to Dimension Order Routing in certain topologies it is topology agnostic and fares well in the face of faults It has been shown that for both regular and irregular topologies LASH outperforms Up Down The reason for this is that LASH distributes the traffic more evenly through a network avoiding the bottleneck issues related to a root node and always routes shortest path The algorithm was developed by Simula Research Laboratory Use R lash Q option to activate the LASH algorithm Note QoS support has to be turned on in order that SL VL mappings are used Note LMC gt 0 is not supported by the LASH routing If this is specified the default routing algorithm is invoked instead 12 5 6 DOR Routing Algorithm The Dimension Or
121. bdde2c GID fe80 202 c900 708 e799 remote address LID 0x0000 QBPN 0x08004f PSN 0xc9d800 GID fe80 202 c900 708 e811 8192000 bytes in 0 01 seconds 4824 50 Mbit sec 1000 iters in 0 01 seconds 13 58 usec iter On Client ibv rc pingpong g 1 i 2 sw419 local address LID 0x0000 OPN 0x08004f PSN 0xc9d800 GID fe80 202 c900 708 e811 remote address LID 0x0000 QPN 0x04004f PSN Oxbdde2c GID fe80 202 c900 708 e799 8192000 bytes in 0 01 seconds 4844 83 Mbit sec 1000 iters in 0 01 seconds 13 53 usec iter Defining Ethernet Priority PCP in 802 1q Headers On Server ibv rc pingpong g 1 i 2 14 local address LID 0x0000 QPN 0x1c004f PSN 0x9daf6c GID fe80 202 c900 708 e799 remote address LID 0x0000 QPN 0x1c004f PSN 0xb0a49b GID fe80 202 c900 708 e811 8192000 bytes in 0 01 seconds 4840 89 Mbit sec 1000 iters in 0 01 seconds 13 54 usec iter On Client ibv rc pingpong g 1 i 2 1 4 sw419 local address LID 0x0000 QPN 0x1c004f PSN Oxb0a49b GID fe80 202 c900 708 e811 remote address LID 0x0000 QPN 0x1c004f PSN 0x9daf6c GID fe80 202 c900 708 e799 8192000 bytes in 0 01 seconds 4855 96 Mbit sec 1000 iters in 0 01 seconds 13 50 usec iter Using rdma cm Tests On Server Mellanox Technologies 67 Rev 1 5 RoCE f ucmatose cmatose starting server initiating data transfers completing sends receiving data transfers data transfers complete cmatose
122. be connected to the client via a LAN and a ConnectX Ethernet Prerequisites See Section B 6 1 on page 227 Warning The following procedure modifies critical files used in the boot procedure It must be executed by users with expertise in the boot process Improper application of this procedure may prevent the diskless machine from booting Procedure Step 1 Load the SLES 10 SP2 installation disk and enter the following parameters as boot options Mellanox Technologies 233 Rev 1 5 netsetup 1 WithISCSI 1 Boot from Hard Disk Installation Installation ACPI Disabled Installation Local APIC Disabled Installation Safe Settings Rescue System Memory Test Boot Options netsetup i HithISCSI 1 Step 2 Continue with the procedure as instructed by the installation program until the iSCSI Ini tiator Overview window appears Preparation Language License Agreement Disk Activation e System Analysis Connected Targets e Time Zone em iSCSI Initiator Overview Portal Address TargetName StartUp Installation e Installation Summary e Perform Installation Configuration e Root Password e Hostname e Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes e Hardware Configuration 234 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Step 3 Click the Add tab in the iSCSI Initiator Overview window An iSCSI I
123. bsrpdm d lt umad device gt 2 Assistance in creating an SRP connection a To generate output suitable for utilization in the echo command of Section 8 2 2 add the c option to ibsrpdm ibsrpdm c Sample output id ext 200400A0B81146A1 ioc guid 0002c90200402bd4 dgid fe800000000000000002c90200402bd5 pkey ffff service id 200400a0b81146a1 b To establish a connection with an SRP Target using the output from the 1ibsrpdm c example above execute the following command echo n id ext 200400A0B81146A1 ioc guid 0002c90200402bd4 dgid fe800000000000000002c90200402bd5 pkey ffff service id 200400a0b81146a1 gt sys class infiniband srp srp mthca0 1 add target The SRP connection should now be up the newly created SCSI devices should appear in the listing obtained from the fdisk 1 command srp daemon The srp daemon utility is based on ibsrpdm and extends its functionality In addition to the ibsr pdm functionality described above srp daemon can also e Establish an SRP connection by itself without the need to issue the echo command described in Section 8 2 2 Mellanox Technologies 87 Rev 1 5 SRP e Continue running in background detecting new targets and establishing SRP connections with them daemon mode Discover reachable SRP Targets given an infiniband HCA name and port rather than just by dev umad lt N gt where lt N gt is a digit Enable High Availability operation together wi
124. ch as OSU BW LAT Intel MPI Bench mark and Presta Mellanox Technologies 21 Rev 1 5 Mellanox OFED Overview 1 4 7 1 4 8 1 4 9 InfiniBand Subnet Manager All InfiniBand compliant ULPs require a proper operation of a Subnet Manager SM running on the InfiniBand fabric at all times An SM can run on any node or on an IB switch OpenSM is an InfiniBand compliant Subnet Manager and it is installed as part of Mellanox OFED See Chapter 12 OpenSM Subnet Manager Diagnostic Utilities Mellanox OFED includes the following two diagnostic packages for use by network and data cen ter managers jbutils Mellanox Technologies diagnostic utilities infiniband diags OpenFabrics Alliance InfiniBand diagnostic tools Performance Utilities A collection of tests written over uverbs intended for use as a performance micro benchmark As an example the tests can be used for hardware or software tuning and or functional testing See PERF TEST README txt under docs 1 4 10 Mellanox Firmware Tools The Mellanox Firmware Tools MFT package is a set of firmware management tools for a single InfiniBand node MFT can be used for Generating a standard or customized Mellanox firmware image Querying for firmware information Burning a firmware image to a single InfiniBand node MFT includes the following tools mlxburn This tool provides the following functions Generation of a standard or c
125. cific ports of specific devices ibstatus mthca0 1 mlx4 0 2 Infiniband device mthca0 port 1 status default gid fe80 0000 0000 0000 0002 c900 0101 d151 base lid 0x0 sim ae 0x0 SESS S 23 RTE phys state 5 LinkUp rate 10 Gb sec 4X Infiniband device mlx4 0 port 2 status default gid fe80 0000 0000 0000 0000 0000 0007 3897 base lid 0x1 SME 0x1 HEU B ZEE eSI TES phys state SENAK rate 20 Gb sec 4X DDR 14 9 ibportstate Applicable Hardware All InfiniBand devices Description Enables querying the logical link and physical port states of an InfiniBand port It also allows adjusting the link speed that is enabled on any InfiniBand port If the queried port is a swich port then ibportstate can be used to disable enable or reset the port validate the port s link width and speed against the peer port Synopsis ibportstate d e v V D G s lt smlid gt C ca name P ca port gt t timeout ms gt lt dest dr path lid guid gt lt portnum gt op lt value gt Mellanox Technologies 161 Rev 1 5 InfiniBand Fabric Diagnostic Utilities Table 12 lists the various flags of the command Table 12 ibportstate Flags and Options Optional Default Flag Di dator If Not Description y Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times
126. cluded ConnectX ConnectX 2 images ConnectX FlexBoot 25408 ROM lt version gt rom ConnectX FlexBoot 25418 ROM lt version gt rom ConnectX FlexBoot 26418 ROM version rom ConnectX FlexBoot 26428 ROM version rom ConnectX FlexBoot 26438 ROM lt version gt rom ConnectX FlexBoot 26488 ROM lt version gt rom where the number after the ConnectX FlexBoot prefix indicates the corresponding PCI Device ID of the ConnectX ConnectX 2 device InfiniHost III Ex images IHOST3EX_PORT1_ROM lt version gt rom IB Port 1 IHOST3EX_PORT2_ROM lt version gt rom IB Port 2 InfiniHost III Lx image IHOST3LX_ROM lt version gt rom 2 Additional documents under docs dhcpd conf sample DHCP configuration file dhcp patch patch file for DHCP v3 1 3 188 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 A 2 Burning the Expansion ROM Image A 2 1 Burning the Image on ConnectX ConnectX 2 Note This section is valid for ConnectX ConnectX 2 devices with firmware versions 2 7 000 or later For earlier firmware versions please follow the instructions in Sec tion A 2 2 on page 189 Prerequisites 1 Expansion ROM Image The expansion ROM images are provided as part of the Mellanox FlexBoot package and are listed in the release notes file F1exBoot release notes txt 2 Firmware Burning Tools You need to install the Mellanox Firmware Tools MFT package ver
127. d Prerequisites See Section A 6 1 on page 194 Warning The following procedure modifies critical files used in the boot procedure It must be executed by users with expertise in the boot process Improper application of this procedure may prevent the diskless machine from booting Procedure Step 1 Load the SLES 10 SP2 installation disk and enter the following parameters as boot options netsetup 1 WithISCSI 1 Boot from Hard Disk Installation Installation ACPI Disabled Installation Local APIC Disabled Installation Safe Settings Rescue Sustem Memory Test Boot Options netsetup i HithISCSI 1 206 Mellanox Technologies J Mellanox OFED for Linux User s Manual Rev 1 5 Step 2 Continue with the procedure as instructed by the installation program until the iSCSI Ini tiator Overview window appears Preparation V Language V License Agreement Disk Activation e System Analysis Service Connected Targets Time Zone iSCSI Initiator Overview Portal Address Target Name Start Up Installation e Installation Summary Perform Installation Configuration e Root Password e Hostname e Network e Customer Center e Online Update e Service e Users e Clean Up Release Notes e Hardware Configuration Step 3 Click the Add tab in the iSCSI Initiator Overview window An iSCSI Initiator Discovery window will pop up Enter the IP Addres
128. d mode TCP parameter tuning is performed at driver startup to improve the throughput of medium and large messages The driver startup scripts set the following TCP parameters as follows Note The following settings should not be applied when running in datagram mode as they degrade the performance net ipv4 tcp timestamps 0 net ipv4 tcp sack 0 t core netdev max backlog 250000 t core rmem max 16777216 t core wmem max 16777216 t core rmem default 16777216 t core wmem default 16777216 t core optmem max 16777216 net ipv4 tcp mem 16777216 16777216 16777216 net ipv4 tcp rmem 4096 87380 16777216 net ipv4 tcp wmem 4096 65536 16777216 If you change the IPoIB run mode to datagram while the driver is running then the tuned parameters do not get restored to the default values suitable for datagram mode It is recommended to change the IPoIB mode only while the driver is down by setting the line SET IPOIB CM yes to SET IPOIB CM no in the file etc infiniband openib conf and then restarting the driver D 2 Ethernet Performance Tuning When the etc init d openibd script loads the mix4 en driver the following network stack parameters are applied Mellanox Technologies 253 Rev 1 5 net ipv4 tcp timestamps 0 net ipv4 tcp sack 0 t core netdev max backlog 250000 t core rmem max 16777216 t core rmem default 16777216 t core wmem default 16777216 t core optmem max 16777216 n n
129. d to con nect to The configuration file can also be used to set values for additional parameters e g max cmd per lun max sect A continuous background daemon operation providing an automatic ongoing detection and connec tion capability See Section 8 2 4 8 2 4 Automatic Discovery and Connection to Targets Make sure that the ib srp module is loaded the SRP Initiator can reach an SRP Target and that an SM is running e To connect to all the existing Targets in the fabric run srp daemon e o This utility will scan the fabric once connect to every Target it detects and then exit Note srp daemon will follow the configuration it finds in etc srp daemon conf Thus it will ignore a target that is disallowed in the configuration file 88 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 To connect to all the existing Targets in the fabric and to connect to new targets that will join the fabric execute srp daemon e This utility continues to execute until it is either killed by the user or encounters connection errors such as no SM in the fabric e To execute SRP daemon as a daemon you may run run srp daemon found under usr sbin providing it with the same options used for running srp daemon Note Make sure only one instance of run srp daemon runs per port To execute SRP daemon as a daemon on all the ports run srp daemon sh found under usr sbin srp daemon sh sends its lo
130. dapter and the Chelsio cxgb3 iWARP adapter Install a Linux distribution and tools The first kernel release to contain both the NFS RDMA client and server was Linux 2 6 25 Therefore a distribution compatible with this and subsequent Linux kernel release should be installed The proce dures described in this document have been tested with distributions from Red Hat s Fedora Project http fedora redhat com Install nfs utils 1 1 2 or greater on the client An NFS RDMA mount point can be obtained by using the mount nfs command in nfs utils 1 1 2 or greater nfs utils 1 1 1 was the first nfs utils version with support for NFS RDMA mounts but for vari ous reasons we recommend using nfs utils 1 1 2 or greater To see which version of mount nfs you are using type sbin mount nfs V If the version is less than 1 1 2 or the command does not exist you should install the latest version of nfs utils Download the latest package from http www kernel org pub linux utils nfs Uncompress the package and follow the installation instructions If you will not need the idmapper and gssd executables you do not need these to create an NFS RDMA enabled mount command the installation process can be simplified by disabling these features when running configure configure disable gss disable nfsv4 To build nfs utils you will need the tcp wrappers package installed For more information on this see the package s README and INST
131. der Routing algorithm is based on the Min Hop algorithm and so uses shortest paths Instead of spreading traffic out across different paths with the same shortest distance it chooses among the available shortest paths based on an ordering of dimensions Each port must be consistently cabled to represent a hypercube dimension or a mesh dimension Paths are grown from a destination back to a source using the lowest dimension port of available paths at each step This provides the ordering necessary to avoid deadlock When there are multiple links between any two switches they still represent only one dimension and traffic is balanced across them unless port equalization is turned off In the case of hypercubes the same port must be used throughout the fabric to represent the hypercube dimension and match on both ends of the cable In the case of meshes the dimension should consistently use the same pair of ports one port on one end of the cable and the other port on the other end continuing along the mesh dimen sion Use R dor option to activate the DOR algorithm 12 5 7 Routing References To learn more about deadlock free routing see the article Deadlock Free Message Routing in Multiprocessor Interconnection Networks by William J Dally and Charles L Seitz 1985 To learn more about the up down algorithm see the article Effective Strategy to Compute For warding Tables for InfiniBand Networks by Jose Carlos Sancho Antonio Robles
132. ding MAC addresses are not large enough to contain an IPoIB hardware address To overcome this problem DHCP over InfiniBand messages convey a client identifier field used to identify the DHCP session This client identifier field can be used to associate an IP address with a client identifier value such that the DHCP server will grant the same IP address to any client that conveys this client identifier The length of the client identifier field is not fixed in the specification For the Mellanox OFED for Linux package it is recommended to have IPoIB use the same format that FlexBoot uses for this client identifier see Section A 2 4 Configuring the DHCP Server on page 190 4 3 1 1 DHCP Server In order for the DHCP server to provide configuration records for clients an appropriate configu ration file needs to be created By default the DHCP server looks for a configuration file called dhcpd conf under etc You can either edit this file or create a new one and provide its full path to the DHCP server using the cf flag See a file example at docs dhcpd conf of this package The DHCP server must run on a machine which has loaded the IPoIB module To run the DHCP server from the command line enter dhcpd lt IB network interface name gt d Example hostl dhcpd ib0 d 4 3 1 2 DHCP Client Optional Note A DHCP client can be used if you need to prepare a diskless machine with an IB driver See Step 8 under Example Ad
133. ding an IB Driver to initrd Linux In order to use a DHCP client identifier you need to first create a configuration file that defines the DHCP client identifier Then run the DHCP client with this file using the following command dhclient cf client conf file gt IB network interface name gt 52 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Example of a configuration file for the ConnectX PCI Device ID 26428 called dhclient conf The value indicates a hexadecimal number interface ibl send dhcp client identifier f 00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39 Example of a configuration file for InfiniHost III Ex PCI Device ID 25218 called dhclient conf The value indicates a hexadecimal number interface ibl send dhcp client identifier 20 00 55 04 01 fe 80 00 00 00 00 00 00 00 02 c9 02 00 23 13 92 In order to use the configuration file run hostl dhclient cf dhclient conf ibl 4 3 2 Static IPoIB Configuration If you wish to use an IPoIB configuration that is not based on DHCP you need to supply the instal lation script with a configuration file using the n option containing the full IP configuration The IPoIB configuration file can specify either or both of the following data for an IPoIB interface A static IPoIB configuration An IPoIB configuration based on an Ethernet configuration Note See your Linux distribution
134. dir gt h V OPTIONS i device lt dev name gt Specify the name of the device of the port used to connect to the IB fabric in case of multiple devices on the local system pl port lt port num gt Specify the local device s port number used to connect to the IB fabric pm Dump all pmCounters values into ibdiagnet pm pc Reset all the fabric links pmCounters P counter lt lt PM gt lt Value gt gt Print any provided pm that is greater than its provided value r routing Provide a report of the fabric qualities eu fat tree Indicate that UpDown credit loop checking should be done against automatically determined roots lw 1x 4x 8x 12x Specify the expected link width ls lt 2 5 5 10 gt Specify the expected link speed skip lt ibdiag check gt Skip the execution of the given stage Applicable to the following stages dup guids lids links sm nodes info all default None o output path lt out dir gt Specify the directory where the output files will be placed screen num errs Specify the threshold for printing errors to screen default 5 Placed default var tmp ibdiagnet2 h help Print this help message V version Print the version of the tool Mellanox Technologies 151 Rev 1 5 InfiniBand Fabric Diagnostic Utilities 14 3 2 Output Files Table 7 lists the ibdiagnet output files that are placed under var tmp ibdiagnet2 Table 7 ibdiagnet of ibutils2 Ou
135. down 258 Appendix F mlx4 Module Parameters 259 F 1 mlx4 core Parameters 259 F 2 mlx4 ib Parameters 260 F 3 mlx4 en Parameters 260 F 4 mlx4 fc Parameters 260 Glossary 261 Mellanox Technologies 7 Rev 1 5 8 Mellanox Technologies Mellanox OFED Stack for Linux User s Manual Rev 1 5 List of Tables Table 1 Typographical Conventions 13 Table 2 Abbreviations and Acronyms 13 Table 3 Reference Documents 15 Table 4 minxofedinstall Return Codes 29 Table 5 Supported ConnectX Port Configurations 39 Table 6 Useful MPI Links 97 Table 7 ibdiagnet of ibutils2 Output Files 152 Table 8 ibdiagnet of ibutils Output Files 154 Table 9 ibdiagpath Output Files 156 Table 10 ibv_devinfo Flags and Options 157 Table 11 ibstatus Flags and Options 159 Table 12 ibportstate Flags and Options 162 Table 13 ibportstate Flags and Options 166 Table 14 smpquery Flags and Options 170 Table 15 perfquery Flags and Options 174 Table 16 ibcheckerrs Flags and Options 177 Table 17 mstflint Switches 180 Table 18 mstflint Commands 181 Table 19 ibdump Options 184 Mellanox Technologies 9 J 10 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Revision History Rev 1 5 March 29 2010 e Updated Figure 1 Mellanox OFED Stack Added support for ConnectX 2 devices e Added support for ROMA over Converged Ethernet RoCE see Chapter 5 RoCE Modified Section 7 3 1 How to Know SDP Is Working Add
136. dules mlnx en host1 cd lib modules uname r updates kernel drivers hostl cp net mlx4 mlx4 core ko tmp initrd en lib modules mlnx en hostl cp net mlx4 mlx4 en ko tmp initrd en lib modules mlnx en Step 5 To load the modules you need the insmod executable If you do not have it in your initrd please add it using the following command host1 cp sbin insmod tmp initrd en sbin Step 6 If you plan to give your Ethernet device a static IP address then copy ifconfig Other wise skip this step hostl cp sbin ifconfig tmp initrd en sbin Step 7 Now you can add the commands for loading the copied modules into the file init Edit the file tmp initrd en init and add the following lines at the point you wish the Ether net driver to be loaded Warning The order of the following commands for loading modules is critical echo loading Mellanox ConnectX EN driver sbin insmod lib modules mlnx en mlx4 core ko sbin insmod lib modules mlnx en mlx4 en ko Step 8 Now you can assign a static or dynamic IP address to your Mellanox ConnectX EN net work interface Step9 Savethe init file Step 10 Close initrd hostl1 cd tmp initrd en host1 find cpio H newc o gt tmp new initrd en img hostl gzip tmp new init en img Mellanox Technologies 231 Rev 1 5 B 9 B 9 1 Step 11 At this stage the modified initrd including the Ethernet driver is ready and located at tmp new init ib img gz Copy it
137. e here Configure the mlx4_en Ethernet driver to support PFC Add the following line to the file etc modprobe conf and restart the network driver options mlx4 en pfctx 0xff pfcrx Oxff 44 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 4 3 2 Starting FCoE Service Make sure the network is up modprobe mlx4 en Then run gt etc init d mlxfc start vHBAs will be instantiated on DCBX monitored interfaces and SCSI LUNs will get mapped For Manual instantiation of vHBAs please see Section 3 4 4 1 Manual vHBA Control Mellanox Technologies 45 Rev 1 5 Working With VPI 3 4 3 3 Stopping FCoE Service Run gt etc init d mlxfc stop Note Only when the mlxfc service is stopped and the mlx4 en module is removed can the mlx4 core module be removed as well 3 4 4 FCoE Advanced Usage Advanced usage will probably be needed when connected to FCoE switches that do not support the Cisco like FCoE DCBX auto negotiation 46 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 4 4 1 Manual vHBA Control Manual control allows creating and destroying vHBAs and signaling link up and link down to existing vHBAs This is done using sysfs operations When using the pre T11 stack the sysfs directory is located at sys class mlx4 fc When using the T11 stack the sysfs directory is located at sys module fcoe Both directories contain the same entries In the follo
138. e which is 65536 in the example above Note that the run example above produced the following results e Throughput is 2 483 gigabits per second Client CPU utilization is 7 03 percent of client CPU Server CPU utilization is 5 42 percent of server CPU Step 5 Run the Netperf Latency test Run the test once and stop the server so that it does not repeat the test The following example shows how to run the Latency test and then stop the Netperf server host2 netperf H 11 4 17 6 t TCP RR c C rl 1 TCP REQUEST RESPONSE TEST from 0 0 0 0 0 0 0 0 port 0 AF_INET to 11 4 17 6 11 4 17 6 port 0 AF_INET Local Remote Socket Size Request Resp Elapsed Trans CPU CPU S dem S dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs per sec S 5 us Tr us Tr 16384 87380 1 1 10 00 19913 18 5 61 6 78 22 549 27 296 16384 87380 The following table describes parameters for the netperf command Option Description H Where to find the server 11 4 17 6 IPoIB IP address t Test Name Specify the test to perform Options are TCP STREAM TCP RR etc C Client CPU utilization C Server CPU utilization Mellanox Technologies 59 Rev 1 5 IPoIB Option Description Separates the global and test specific parameters r 1 1 The request size sent and how many bytes requested back Note that the run example above produc
139. e Online Update e Service e Users e Clean Up e Release Notes e Hardware Config All information required for the base installation is now complete If you continue now partitions on your hard disk will be formatted erasing any existing data in those partitions according to the installation settings in the previous dialogs Go back and check the settings if you are unsure Change v Help Abort Step 19 At the end of the file copying stage the Finishing Basic Installation window will pop up and ask for confirming a reboot You can click OK to skip count down See image below Note Assuming that the machine has been correctly configured to boot from FlexBoot via its connection to the iSCSI target make sure that MLNX IB for ConnectX family or gPXE for InfiniHost III family has the highest priority in the BIOS boot sequence Mellanox Technologies 219 Rev 1 5 Preparation v Language License Agreement v System Analysis v Time Zone Copy filesto installed system Installation R v Installation Summary Save configuration Perform Installation Install boot manager Configuration Save installation settings e Hostname e Root Password Prepare system for initial boot e Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes The system will reboot now e Hardware Configuration 8 Finished Step 20 Once the boot is comp
140. e Size Time Rate local remote local remote bytes bytes bytes bytes secs per sec S 5 us Tr us Tr Mellanox Technologies 83 Rev 1 5 16384 87380 1 1 10 00 37572 83 15 72 23 36 33 469 49 729 16384 87380 The following table describes parameters for the netperf command Option Description H Where to find the server 11 4 17 6 SDP IP address t Test Name Specify the test to perform Options are TCP STREAM TCP RR etc C Client CPU utilization C Server CPU utilization Separates the global and test specific parameters r 1 1 The request size sent and how many bytes requested back Note that the run example above produced the following results e Client CPU utilization is 15 72 percent of client CPU Server CPU utilization is 23 36 percent of server CPU e Latency is 13 31 microseconds Latency is calculated as follows 0 5 1 Transaction rate per sec 1 000 000 one way average latency in usec Step 7 To end the test shut down the Netperf server hostl pkill netserver 84 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 8 1 Overview As described in Section 1 4 5 the SCSI RDMA Protocol SRP is designed to take full advantage of the protocol offload and RDMA features provided by the InfiniBand architecture SRP allows a large body of SCSI software to be readily used on InfiniBand architecture The SRP Initiator con trols the connection t
141. e run the installation script mlnxofedinstall with the flag n then IPoIB has not been configured by the installation The configuration of IPoIB requires assigning an IP address and a subnet mask to each HCA port like any other network adapter card 1 e you need to prepare a file called ifcfg ib lt n gt for each port The first port on the first HCA in the host is called interface ib0 the second port is called ib1 and so on An IPoIB configuration can be based on DHCP Section 4 3 1 or on a static configuration Sec tion 4 3 2 that you need to supply You can also apply a manual configuration that persists only until the next reboot or driver restart Section 4 3 3 4 3 1 IPolB Configuration Based on DHCP Setting an IPoIB interface configuration based on DHCP v3 1 3 which is availabe via www isc org is performed similarly to the configuration of Ethernet interfaces In other words you need to make sure that IPoIB configuration files include the following line Mellanox Technologies 51 J Rev 1 5 IPoIB For RedHat BOOTPROTO dhcp For SLES BOOTPROTO dchp Note If IPoIB configuration files are included i fcfg ib lt n gt files will be installed under etc sysconfig network scripts on a RedHat machine etc sysconfig network on a SuSE machine Note A patch for DHCP is required for supporting IPoIB The patch file for DHCP v3 1 3 dhcp patch is available under the docs directory Standard DHCP fields hol
142. e triggering an interrupt Mellanox Technologies 41 Rev 1 5 Working With VPI To query pause frame settings run gt ethtool a eth lt x gt To set pause frame settings run gt ethtool A eth lt x gt rx on off tx on off To query ring size values run f ethtool g eth lt x gt To modify rings size run gt ethtool G eth lt x gt rx N tx lt N gt To obtain additional device statistics run f ethtool S eth x To perform a self diagnostics test run f ethtool t eth x The mlx4 en parameters can be found under sys module mlx4 en or sys module mlx4 en parameters depending on the OS and can be listed using the command f modinfo mlx4 en To set non default values to module parameters the following line should be added to the file etc modprobe conf options mlx4 en param name gt lt value gt param name gt lt value gt 3 4 Fibre Channel over Ethernet 3 4 1 Overview The FCoE feature provided by Mellanox OFED allows connecting to Fibre Channel FC targets on an FC fabric using an FCoE capable switch or gateway Key features include T11 and pre T11 frame format Complete hardware offload of SCSI operations in pre T11 format Hardware offload of FC CRC calculations in pre T11 format Zero copy FC stack in pre T11 format VLANS and PFC Priority flow control that is PPP The FCoE feature is based on and interacts with the Open FCoE project The m1x4 fc module
143. ect to 11 7 RDS RDS uses CMA and thus it is very close to SDP The Service ID for RDS is 0x000000000106PPPP where PPPP are 4 hexadecimal digits holding the TCP IP Port Number that the protocol connects to Mellanox Technologies 109 Rev 1 5 Quality of Service The default port number for RDS is 0x48CA which makes a default Service ID 0x00000000010648CA 11 8 SRP The current SRP implementation uses its own CM callbacks not CMA So SRP fills in the Ser vice ID in the PR MPR by itself and use that information in setting up the QP SRP Service ID is defined by the SRP target I O Controller it also complies with IBTA Service ID rules The Service ID is reported by the I O Controller in the ServiceEntries DMA attribute and should be used in the PR MPR if the SA reports its ability to handle QoS PR MPRs 11 9 OpenSM Features The QoS related functionality that is provided by OpenSM the Subnet Manager described in Chapter 12 can be split into two main parts I Fabric Setup During fabric initialization the Subnet Manager parses the policy and apply its settings to the dis covered fabric elements II PR MPR Query Handling OpenSM enforces the provided policy on client request The overall flow for such requests is first the request is matched against the defined match rules such that the target QoS Level definition is found Given the QoS Level a path s search is performed with the given restrictions imposed by
144. ectly on top of OpenFabrics OFED software stacks http www openfabrics org or InfiniBand drivers in Linux kernel tree kernel org It also inter faces with Generic SCSI target mid level driver SCST http scst sourceforge net By interfacing with an SCST driver it is possiblee to work with and support a lot of IO modes on real or virtual devices in the backend l scst vdisk fileio and blockio modes This allows turning software raid volumes LVM vol umes IDE disks block devices and normal files into SRP luns 2 NULLIO mode allows measuring the performance without sending IOs to real devices E 1 Prerequisites and Installation 1 For the supported distributions please see the Mellanox OFED release notes Note On distribution default kernels you can run scst vdisk blockio mode to obtain good per formance 2 Download and install the SCST driver The supported version is 1 0 1 1 a Download scst 1 0 1 1 tar gz from http scst sourceforge net downloads html b Untar scst 1 0 1 1 tar zxvf scst 1 0 1 1 tar gz cd scst 1 0 1 1 c Install scst 1 0 1 1 as follows make amp amp make install E 2 How torun A On an SRP Target machine 1 Please refer to SCST s README for loading scst driver and its dev handlers drivers scst vdisk block or file IO mode nullio Note Regardless of the mode you always need to have lun 0 in any group s devicelist Then you can have any lun number following lun 0 it is
145. ed Section 7 7 Using RDMA for Small Buffers e Added support for NFS over ROMA NFSoRDMA Chapter 9 NFSoRDMA Added Section 10 6 2 Important Note on RoCE Support on page 103 in Chapter 10 MPI Modified Section 12 2 1 opensm Syntax on page 111 Added Chapter 13 Adaptive Routing Added ibdiagnet of ibutils2 and ibdump to Chapter 14 InfiniBand Fabric Diagnostic Utilities Appendix B is now called Mellanox FlexBoot instead of BoIB FlexBoot supports Virtual Pro tocol Interconnect VPI Added Section C 3 System Performance Troubleshooting e Added the parameter setting VIADEV RENDEZVOUS THRESHOLD 8192 to Sec tion D 3 MPI Performance Tuning Rev 1 40 1 Changes from 1 40 March 19 2009 e Correction to text in Section 4 3 IPoIB Configuration on page 51 Mellanox Technologies 11 Rev 1 5 Preface This Preface provides general information concerning the scope and organization of this User s Manual It includes the following sections Intended Audience page 12 Documentation Conventions page 13 Related Documentation page 15 Support and Updates Webpage page 15 Intended Audience This manual is intended for system administrators responsible for the installation configuration management and maintenance of the software and hardware of VPI InfiniBand Ethernet FCoE adapter cards It is also intended for application developers
146. ed on a VLAN 0 interface with VLAN priority set to the value negotiated with the switch This takes advantage of PFC which allows pausing FCoE traffic when needed without pausing the entire Ethernet link Also with proper configuration of the FCoE switch the link s maximum bandwidth can be divided as needed between FCoE and regular Ethernet traffic Instantiating vHBAs manually allows creating them on VLAN interfaces with any arbitrary VLAN id and priority as well as on the regular without VLAN Ethernet interfaces Using the reg ular interface means that PFC cannot be used In this case it 1s highly recommended that both the FCoE switch and the m1x4 en driver be con figured to use link pause regular flow control Otherwise any FCoE packet drop will trigger SCSI errors and timeouts Mellanox Technologies 43 J Rev 1 5 Working With VPI 3 4 3 1 FCoE Configuration After installation please edit the file etc mlxfc mlxfc conf and set the following variables e FC SPEC set to T11 or pre T11 as supported by your FCoE switch Note Only pre T11 format is offloaded in hardware DCBX IFS provide a space separated list of Ethernet devices to monitor the use of the DCBX protocol for the FCoE feature availability vHBAs are automatically created on these interfaces if the FCoE switch is configured for automatic FCoE negotiation MTU if MTU of the Ethernet device is changed from the default 1500 put the correct valu
147. ed the following results Client CPU utilization is 5 61 percent of client CPU Server CPU utilization is 6 79 percent of server CPU e Latency is 25 11 microseconds Latency is calculated as follows 0 5 1 Transaction rate per sec 1 000 000 one way average latency in usec Step 6 To end the test shut down the Netperf server host1 pkill netserver 60 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 5 RoCE 5 1 Overview RDMA over Converged Ethernet RoCE allows InfiniBand IB transport over Ethernet net works It encapsulates IB transport and GRH headers in Ethernet packets bearing a dedicated ether type While the use of GRH is optional within IB subnets it is mandatory when using RoCE Verbs applications written over IB verbs should work seamlessly but they require provisioning of GRH information when creating address vectors The library and driver are modified to provide for mapping from GID to MAC addresses required by the hardware 5 2 Software Dependencies In order to use RoCE over Mellanox ConnectX R hardware the mlx4 en driver must be loaded Please refer to MLNX EN README txt for further details 5 3 General Guidelines Since RoCE encapsulates InfiniBand traffic in Ethernet frames the corresponding net device must be up and running In case of Mellanox hardware mlx4 en must be loaded and the corresponding interface configured e Make sure that m1x4 en ko is loaded Make
148. ee Section 4 3 1 IPoIB Configuration Based on DHCP on page 51 194 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Configure and start at least one of the services iSCSI Target see Section A 9 and or TFTP see Section A 4 A 6 2 Starting Boot Boot the client machine and enter BIOS setup to configure MLNX FlexBoot for ConnectX family or gPXE for InfiniHost III family to be the first on the boot device priority list see Section A 5 Note On dual port network adapters the client first attempts to boot from Port 1 If this fails it switches to boot from Port 2 Note also that the driver waits up to 90 seconds for each port to come up If MLNX FlexBoot gPXE was selected through BIOS setup the client will boot from FlexBoot The client will display FlexBoot attributes sense the port protocol Ethernet or InfiniBand In case of an InfiniBand port the client will also wait for port configuration by the Subnet Manager Note In case sensing the port protocol fails the port will be configured as an InfiniBand port For ConnectX ellanox ConnectX FlexBoot v3 0 000 PXE 0 9 9 Open Source Boot Firmware Neto 00 92 c9 00 00 00 aa be on PCIO2 00 CLink down TX 0 TXE 8 RX 0 RXE 01 Link status Not connected 0x38086001 1 Jaiting for link up on net ok For InfiniHost III Ex XE starting Mellanox Boot over IB for InfiniHost via IB Port 2 1g for Infiniband link
149. ee under docs folder of installed package MFT Release Notes Release Notes for the Mellanox Firmware Tools See under docs folder of installed package Support and Updates Webpage Please visit http Avww mellanox com gt Products gt IB VPI SW Drivers for downloads FAQ trou bleshooting future updates to this manual etc Mellanox Technologies 15 16 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 1 Mellanox OFED Overview 1 4 Introduction to Mellanox OFED Mellanox OFED is a single Virtual Protocol Internconnect VPI software stack based on the OpenFabrics OFED Linux stack and operates across all Mellanox network adapter solutions supporting 10 20 and 40Gb s InfiniBand IB 10Gb s Ethernet 10GigE Fibre Channel over Ethernet FCoE and 2 5 or 5 0 GT s PCI Express 2 0 uplinks to servers AII Mellanox network adapter cards are compatible with OpenFabrics based RDMA protocols and software and are supported with major operating system distributions 1 2 Introduction to Mellanox VPI Adapters Mellanox VPI adapters which are based on Mellanox ConnectX and ConnectX 2 adapter devices provide leading server and storage I O performance with flexibility to support the myriad of communication protocols and network fabrics over a single device without sacrificing func tionality when consolidating I O For example VPI enabled adapters can support Connectivity to 10 2
150. eir LIDs If the fabric is config ured to allow multiple LIDs per port then using any of them is valid for defining a port Using port names defined in the topology file Tool option n This option refers to the source and destination ports by the names defined in the topology file There fore this option is relevant only if a topology file is specified to the tool In this mode the tool uses the names to extract the port LIDs from the matched topology then the tool operates as in the l option 14 3 ibdiagnet of ibutils2 IB Net Diagnostic Note This version of ibdiagnet is included in the ibutils2 package and it is not run by default after installing Mellanox OFED To use this ibdiagnet version and not that of the ibutils package you need to specify the full path opt bin ibdiagnet 150 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Note Please see ibutils2 release notes txt for additional information and known issues ibdiagnet scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices It then produces the following files in the output directory which is defined by the o option described below 14 3 1 SYNOPSYS ibdiagnet i lt dev name gt p lt port num gt pm pc P lt lt PM gt lt Value gt gt r u lw lt 1x 4x 8x 12x gt 1s lt 2 5 5 10 gt skip lt ibdiag stage gt o lt out
151. el Essentially the above example is equivalent to not having a QoS policy file at all The following example shows all the possible options and keywords in the policy file and their syntax See the comments in the following example They explain different keywords and their meaning Mellanox Technologies 133 Rev 1 5 OpenSM Subnet Manager port groups port group using port GUIDs name Storage use is just a description that is used for logging Other than that it is just a comment use SRP Targets port guid 0x10000000000001 0x10000000000005 0x1000000000FFFA port guid 0x1000000000FFFF end port group port group name Virtual Servers The syntax of the port name is as follows node description Pnum node description is compared to the NodeDescription of the node and Pnum is a port number on that node port name vsl HCA 1 P1 vs2 HCA 1 P1 end port group using partitions defined in the partition policy port group name Partitions partition Partl pkey 0x1234 end port group using node types CA ROUTER SWITCH SELF for node that runs SM or ALL for all the nodes in the subnet port group name CAs and SM zal E nj node type CA SI end port group end port groups qos setup This section of the policy file describes how to set up SL2VL and VL Arbitration tables on various nodes in the fabric 134 Mellanox Technologies Mellanox OFE
152. ellanox OFED for Linux is discussed in Chapter 12 OpenSM Subnet Manager The basic need is to differentiate the service levels provided to different traffic flows such that a policy can be enforced and can control each flow utilization of fabric resources The InfiniBand Architecture Specification defines several hardware features and management interfaces for supporting QoS Up to 15 Virtual Lanes VL carry traffic in a non blocking manner Arbitration between traffic of different VLs is performed by a two priority level weighted round robin arbiter The arbiter is programmable with a sequence of VL weight pairs and a maximal number of high priority credits to be processed before low priority is served Packets carry class of service marking in the range 0 to 15 in their header SL field Each switch can map the incoming packet by its SL to a particular output VL based on a pro grammable table VL SL to VL MA P in port out port SL The Subnet Administrator controls the parameters of each communication flow by providing them as a response to Path Record PR or MultiPathRecord MPR queries DiffServ architecture IETF RFC 2474 amp 2475 1s widely used in highly dynamic fabrics The fol lowing subsections provide the functional definition of the various software elements that enable a DiffServ like architecture over the Mellanox OFED software stack Mellanox Technologies 107 Rev 1 5 Quality of Service 11 2 Qo
153. em M I 0x0100 SVOLSERO eo D O5 1 009 65535 TREMIKROCOVCIESEIIRE EET DR a aon 2 55 IPLIMEDON ANNE od aug I IM anes 16 RENEE Oa S bas aao HS oo O65 Soa S Dads 5 7 RGWRSMOIE DIAN Econ GS E e 0 IRGNWSIIRNS Le NE TORE By odo Go oS a dan 50 8 70 vie I Bal TSK CARASSO INTO 488 AMC CONSE RANE ERE O ASe GG as oa do oot 0 Ec 9 C IIS IbsrscEINDUS BEI C TA NN M MM 0 IR ltiode eXo eal ES ARE IA 4 5 55 Soa area Dore 0 Bx BUNC ISISUDIE IS CI SENE RE SIE AST 0 NOSE TODDO E ERE ano een 0 MINEDENEES 54d ots a Soa Do draco Daa 129840354 Rev DINCEIA a eu n 5p ORT Oe des oqgcoobo sog 12 929 99 5 SOLIS RE NA E SU D eee 1803332 REV BKS erdt ette ee M TI UO ON 3 Read then reset performance counters from LID 2 port 1 e 21 lal 2 joie di gt perfquery Port counters IONE IE SKOKIE e dey A A E il Counters lesi m rm enn eerste oes 0x0100 Shyaloyoinrercopeste aon Ge aod coe M eras 0 IAL MUI RN SVCKON SNS Ria Auer eed Gate Cope OO 0 Ininiebowtcco ret 9 0 aoe CIS 0 REV PRAG OUS E E AE di iet see bins 0 IRGWANSIMON ESCA SE OMS pre Rei re 0 Ie S SHINS Le AOWOIEQNEISIB 45 oa coo 05090099 0 MME DALSCCIACIIS amp orato adeo dmcso 950 3 NE COM trame MRA G SF so go ob og ao B05 0 RIGG OMS IE ISSULIONEINIETAOIOS S sa Ga anaodnoa 0 IpliglcILiMieSre fies EVAUIETOSCSIS no 05090050 0 br BU NS AUNE EON Brr DM 0 SIUE DIS Go rino 0 SMD AIN ORE 0 Eas eee Pep ERRE AE 0 OI rer ey con Serra he gu e CECT OES 0 REVERE E rut
154. end Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs 10 6bits s S SS us KB us KB 87380 16384 65536 10 00 5872 60 19 41 ITII2 2 166 Tgrt Note You must specify the SDP IP address when running the Netperf client The following table describes parameters for the netperf command Option Description H Where to find the server 11 4 17 6 SDP IP address t lt Test Name gt Specify the test to perform Options are TCP_STREAM TCP_RR etc C Client CPU utilization C Server CPU utilization Separates the global and test specific parameters m Message size which is 65536 in the example above Note that the run example above produced the following results e Throughput is 5 872 gigabits per second e Client CPU utilization is 19 41 percent of client CPU Server CPU utilization is 17 12 percent of server CPU Step 6 Run the Netperf Latency test such that you force SDP to be used instead of TCP Run the test once and stop the server so that it does not repeat the test host2 LD PRELOAD libsdp so LIBSDP CONFIG FILE HOME libsdp conf netperf VN H 11 4 17 6 t TCP RR c C r1 1 TCP REQUEST RESPONSE TEST from 0 0 0 0 0 0 0 0 port 0 AF INET to 11 4 17 6 11 4 17 6 port 0 AF INET Local Remote Socket Size Request Resp Elapsed Trans CPU CPU S dem S dem Send Recv Siz
155. end qos ulps It is equivalent to the previous example of the shortest policy file and it is also equivalent to not having policy file at all Below is an example of simple QoS policy with all the possible keywords qos ulps default 0 default SL sdp port num 30000 0 SL for application running on top of SDP when a destination TCP IPport is 30000 sdp port num 10000 20000 0 sdp came default SL for any other application running on top of SDP rds 22 SL for RDS traffic ipoib pkey 0x0001 0 SL for IPoIB on partition with pkey 0x0001 ipoib 4 default IPoIB partition pkey 0x7FFF any Service id 0x6234 6 4 match any PR MPR query with a Specific Service ID any pkey OxOABC 6 match any PR MPR query with a specific PKey srp target port guid 0x1234 5 SRP when SRP Target is located on a specified IB port GUID any target port guid OxOABC OxFFFFF 6 match any PR MPR query with a specific target port GUID end qos ulps Similar to the advanced policy definition matching of PR MPR queries is done in order of appear ance in the QoS policy file such as the first match takes precedence except for the default rule which is applied only if the query didn t match any other rule All other sections of the QoS policy Mellanox Technologies 137 Rev 1 5 OpenSM Subnet Manager file take precedence over the qos ulps section That is if a policy file has both qos match rules and qos ulps section
156. enib conf to load the SRP driver and SRP High Availability HA daemon automatically that is set SRP_LOAD yes and SRPHA ENABLE yes To set up and use the HA feature you need the dm multipath driver and multipath tool Please refer to OFED 1 x SRP s user manual for more detailed instructions on how to enable use the HA feature The following is an example of an SRP Target setup file ACkCkCk ck ckCk ck ck ck ck ck k ck ck ck k k kk kk srpt sh KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK bin sh modprobe scst scst threads 1 modprobe scst vdisk scst vdisk ID 100 echo open vdisk0 dev cciss cld0 BLOCKIO gt proc scsi tgt vdisk vdisk echo open vdiskl dev sdb BLOCKIO gt proc scsi tgt vdisk vdisk echo open vdisk2 dev sdc BLOCKIO gt proc scsi tgt vdisk vdisk echo open vdisk3 dev sdd BLOCKIO gt proc scsi tgt vdisk vdisk echo add vdisk0 0 gt proc scsi tgt groups Default devices echo add vdiskl 1 gt echo add vdisk2 2 gt proc scsi tgt groups Default devices echo add vdisk3 3 gt proc scsi tgt groups Default devices proc scsi tgt groups Default devices Mellanox Technologies 257 Rev 1 5 modprobe ib srpt echo add mgmt proc scsi tgt trace level echo add mgmt dbg proc scsi tgt trace level echo add out of mem proc scsi tgt trace level CkCkCckckckckckckckck ck ck ckck ck ck ck ck ck kk End srpt sh Ck ck Ck ck ck Ck KRKA RK KKR KKR ckckckck ck ck ck ck ck AR E 3
157. ensm are created These files are opensm subnet lst opensm fdbs and opensm mcfdbs By default this directory is var log OSM CACHE DIR opensm stores certain data to the disk such that subsequent runs are consistent The default directory used is var cache opensm The following file is included in it guid21id stores the LID range assigned to each GUID Mellanox Technologies 117 Rev 1 5 OpenSM Subnet Manager 12 2 3 Signaling When opensm receives a HUP signal it starts a new heavy sweep as if a trap has been received or a topology change has been found Also SIGUSRI can be used to trigger a reopen of var 1og opensm 1og for logrotate purposes 12 2 4 Running opensm The defaults of opensm were designed to meet the common case usage on clusters with up to a few hundred nodes Thus in this default mode opensm will scan the IB fabric initialize it and sweep occasionally for changes To run opensm in the default mode simply enter hostl opensm Note that opensm needs to be run on at least one machine in an IB subnet By default an opensm run is logged to two files var log messages and var log opensm log The first file message registers only general major events the second file opensm 10g includes details of reported errors All errors reported in opensm 10g should be treated as indicators of IB fabric health Both log files should include the message SUBNET UP if opensm was able to setup the subnet
158. epend on the way SRP was loaded There are three cases 1 Without High Availability When working without High Availability you should unmount the SRP partitions that were mounted prior to shutting down SRP 2 After Manual Activation of High Availability If you manually activated SRP High Availability perform the following steps a Unmount all SRP partitions that were mounted b Kill the SRP daemon instances c Make sure there are no multipath instances running If there are multiple instances wait for them to end or kill them Mellanox Technologies 91 J Rev 1 5 SRP d Run multipath F 5 After Automatic Activation of High Availability If SRP High Availability was automatically activated SRP shutdown must be part of the driver shut down etc init d openibd stop which performs Steps 2 4 of case b above However you still have to unmount all SRP partitions that were mounted before driver shutdown 92 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 9 NFSoRDMA 9 1 Overview NFS over RDMA in Mellanox OFED is a binding of NFS v2 v3 v4 on top of the InfiniBand RDMA trans port and iWARP 9 2 Installation These instructions are a step by step guide to building a machine for use with NFS RDMA Install an RDMA device Any device supported by the drivers in drivers infiniband hw is acceptable Testing has been performed using several Mellanox based IB cards the Ammasso AMS1100 iWARP a
159. erceded with a prefix Mellanox Technologies 225 Rev 1 5 B 3 1 Configuring the DHCP Server When a ConnectX EN PXE client boots it sends the DHCP server various information including its DHCP hardware Ethernet address MAC The MAC address is 6 bytes long and it is used to distinguish between the various DHCP sessions Extracting the MAC Address Method I Run the following commands Note The following MFT commands assume that the Mellanox Firmware Tools MFT pack age has been installed on the client machine mst start mst status The device name will be of the form dev mst mt dev id pci _crO conf0 Use this device name to obtain the MAC address via a query command flint d lt MST DEVICE NAME gt q Example with ConnectX EN as the network adapter device flint d dev mst mt25448 pci cr0 q Image type ConnectX FW Version 2 7 000 Rom Info type GPXE version 1 5 5 devid 25448 Device ID 25448 Chip Revision AO Description Portl Port2 MACs 0002c90000bb 0002c90000bc Board ID n a MT 0920110004 VSD n a PSID MT_0920110004 Assuming that ConnectX EN PXE is connected via Port 1 then the MAC address is 00 02 c9 00 00 bb Extracting the MAC Address Method Il The six bytes of MAC address can be captured from the display upon the boot of the ConnectX device session as shown in the figure below Mellanox ConnectX EN PXE v1 5 5 gPXE 0 9 9 Open Source Boot Firmwa
160. erface host1 ifconfig ib0 11 4 3 175 netmask 255 255 0 0 Step 2 Optional Verify the configuration by entering the ifconfig command with the appropri ate interface identifier ib argument The following example shows how to verify the configuration hostl ifconfig ib0 bO Link encap UNSPEC HWaddr 80 00 04 04 FE 80 00 00 00 00 00 00 00 00 00 00 inet addr 11 4 3 175 Bcast 11 4 255 255 Mask 255 255 0 0 UP BROADCAST MULTICAST MTU 65520 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 128 RX bytes 0 0 0 b TX bytes 0 0 0 b Step 3 Repeat Step 1 and Step 2 on the remaining interface s 54 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 4 4 Subinterfaces You can create subinterfaces for a primary IPoIB interface to provide traffic isolation Each such subinterface also called a child interface has a different IP and network addresses from the pri mary parent interface The default Partition Key PKey ff ff applies to the primary parent interface This section describes how to Create a subinterface Section 4 4 1 Remove a subinterface Section 4 4 2 4 4 1 Creating a Subinterface To create a child interface subinterface follow this procedure Note In the following procedure i50 is used as an example of an IB subinterface Step 1 Decide on the PKey to be used in the subnet Valid values are
161. erface gt delete child Using the example of Step 2 echo 0x8000 gt sys class net ib0 delete child Note that when deleting the interface you must use the PKey value with the most significant bit set e g 0x8000 in the example above Verifying IPoIB Functionality To verify your configuration and your IPoIB functionality perform the following steps Step 1 Verify the IPoIB functionality by using the ifconfig command The following example shows how two IB nodes are used to verify IPoIB functional ity In the following example IB node 1 is at 11 4 3 175 and IB node 2 is at 11 4 3 176 hostl ifconfig ib0 11 4 3 175 netmask 255 255 0 0 host2 ifconfig ib0 11 4 3 176 netmask 255 255 0 0 Step 2 Enter the ping command from 11 4 3 175 to 11 4 3 176 The following example shows how to enter the ping command host1 ping c 5 11 4 3 176 PING 11 4 3 176 11 4 3 176 56 84 bytes of data 64 bytes from 11 4 3 176 icmp seq 0 tt1 64 time 0 079 ms 64 bytes from 11 4 3 176 icmp seq 1 tt1 64 time 0 044 ms 64 bytes from 11 4 3 176 icmp seq 2 ttl 64 time 0 055 ms 64 bytes from 11 4 3 176 icmp seq 3 tt1 64 time 0 049 ms 64 bytes from 11 4 3 176 icmp seq 4 tt1 64 time 0 065 ms 11 4 3 176 ping statistics 5 packets transmitted 5 received 0 packet loss time 3999ms rtt min avg max mdev 0 044 0 058 0 079 0 014 ms pipe 2 The ib bonding Driver The ib bonding driver is a High Availability solution for IPoIB int
162. erfaces It is based on the Linux Ethernet Bonding Driver and was adapted to work with IPoIB The ib bonding package contains a bonding driver and a utility called ib bond to manage and control the driver operation The ib bonding driver comes with the ib bonding package run rpm qi ib bonding to get the package information 56 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 4 6 1 Using the ib bonding Driver The ib bonding driver can be loaded manually or automatically Manual Operation Use the utility ib bond to start query or stop the driver For details on this utility please read the documentation for the ib bonding package under usr share doc ib bonding 0 9 0 ib bonding txt on RedHat and usr share doc packages ib bonding 0 9 0 ib bonding txt on SuSE Automatic Operation There are two ways to configure automatic ib bonding operation 1 Using the openibd configuration file as described in the following steps a Edit the file etc infiniband openib conf to define bonding parameters Example Enable the bonding driver on startup IPOIBBOND ENABLE yes Set bond interface names IPOIB BONDS bond0 bond8007 Set specific bond params address and slaves bond0 IP 10 10 10 1 24 bond0 SLAVES ib0 ib1 bond8007 IP 20 10 10 1 bondl SLAVES 1b0 8007 1b1 8007 b Restart the driver by running etc init d openibd restart 2 Using a standard OS bonding config
163. ernal tool Default all flows except QoS w wait This option specifies the wait time for trap 64 65 in sec onds It is used only when running f t the trap 64 65 flow Default 10 sec d debug This option specifies a debug option These options are not normally needed The number following d selects the debug option to enable as follows Mellanox Technologies 119 Rev 1 5 OpenSM Subnet Manager OPT Description d0 Ignore other SM nodes dl Force single threaded dispatching d2 Force log flushing after each log message d3 Disable multicast support m max lid This option specifies the maximal LID number to be searched for during inventory file build Default 100 g guid This option specifies the local port GUID value with which OpenSM should bind OpenSM may be bound to 1 port at a time If GUID given is 0 OpenSM displays a list of possible port GUIDs and waits for user input Without g OpenSM tries to use the default port p port This option displays a menu of possible local port GUID values with which osmtest could bind i inventory This option specifies the name of the inventory file Nor mally osmtest expects to find an inventory file which osmtest uses to validate real time information received from the SA during testing If i is not specified osmtest defaults to the file osmtest dat See c option for related information S stress This option runs the
164. ery destination LID through neighbor switches For Up Down routing a BFS from every target is used The BFS tracks link direction up or down and avoid steps that will per form up after a down step was used Once MinHop matrices exist each switch is visited and for each target LID a decision is made as to what port should be used to get to that LID This step is common to standard and Up Down routing Each port has a counter counting the number of target LIDs going through it When there are multiple alternative ports with same MinHop to a LID the one with less previously assigned ports is selected If LMC gt 0 more checks are added Within each group of LIDs assigned to same target port 124 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 a Use only ports which have same MinHop b First prefer the ones that go to different systemImageGuid then the previous LID of the same LMC group c If none prefer those which go through another NodeGuid d Fall back to the number of paths method if all go to same node 12 5 1 Effect of Topology Changes OpenSM will preserve existing routing in any case where there is no change in the fabric switches unless the r reassign lids option is specified r reassign lids This option causes OpenSM to reassign LIDs to all end nodes Specifying r on a running subnet may disrupt subnet traffic Without r OpenSM attempts to preserve existing LID assignments
165. es see below Port group lists ports by Port GUID Port name which is a combination of NodeDescription and IB port number PKey which means that all the ports in the subnet that belong to partition with a given PKey belong to this port group Partition name which means that all the ports in the subnet that belong to partition with a given name belong to this port group Node type where possible node types are CA SWITCH ROUTER ALL and SELF SM s port Mellanox Technologies 131 Rev 1 5 OpenSM Subnet Manager II QoS Setup denoted by gos setup This section describes how to set up SL2VL and VL Arbitration tables on various nodes in the fab ric However this is not supported in OFED SL2VL and VLArb tables should be configured in the OpenSM options file default location var cache opensm opensm opts IIT QoS Levels denoted by gos levels Each QoS Level defines Service Level SL and a few optional fields MTU limit Rate limit PKey Packet lifetime When path s search is performed it is done with regards to restriction that these QoS Level parameters impose One QoS level that is mandatory to define is a DEFAULT QoS level It is applied to a PR MPR query that does not match any existing match rule Similar to any other QoS Level it can also be explicitly referred by any match rule IV QoS Matching Rules denoted by qos match rules Each PathRecord MultiPathRecord query that Open
166. es that use the stack Note that the include files if needed are backported to your kernel The raw package un backported source files are placed under usr src ofa _kernel lt ver gt e The script openibd is installed under etc init d This script can be used to load and unload the software stack The script connectx port config is installed under sbin This script can be used to con figure the ports of ConnectX network adapter cards to Ethernet and or InfiniBand For details on this script please see Section 3 1 Port Type Management The directory etc infiniband is created with the files info and openib conf and con nectx conf The info script can be used to retrieve Mellanox OFED installation information The openib conf file contains the list of modules that are loaded when the openibd script is used The connectx conf file saves the ConnectX adapter card s ports configuration to Ether net and or InfiniBand This file is used at driver start restart etc init d openibd start The file 90 ib rules is installed under etc udev rules d If OpenSM is installed the daemon opensmd is installed under etc init d and opensm conf is installed under etc IfIPoIB configuration files are included ifcfg ib lt n gt files will be installed under etc sysconfig network scripts ona RedHat machine etc sysconfig network on a SuSE machine e The installation process unlimits the amount of memory that
167. ete at Rude ona Der ee 0 176 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 14 13ibcheckerrs Applicable Hardware All InfiniBand devices Description Validates an IB port or node and reports errors in counters above threshold Check specified port or node and report errors that surpassed their predefined threshold Port address is lid unless G option is used to specify a GUID address The predefined thresholds can be dumped using the s option and a user defined threshold_file using the same format as the dump can be specified using the t lt file gt option Synopsis ibcheckerrs h b v G nocolor C ca_name lt port gt T lt threshold file gt s N P ca port t timeout ms lt lid guid gt Table 16 lists the various flags of the command Table 16 ibcheckerrs Flags and Options Optional Detanlt Flag Dia dior If Not Description ci Specified h help Optional Print the help menu b Optional Print in brief mode Reduce the output to show only if errors are present not what they are v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 T Optional Use specified threshold file lt threshold_file gt S Optional Show the predefined thresholds N
168. eview the default configuration of the installation check the default configuration file usr mpi lt compiler gt mvapich lt mvapich ver gt etc mvapich conf Compiling Open MPI Applications Please refer to http www open mpi org faq category mpi apps Mellanox Technologies 99 Rev 1 5 MPI 10 5 OSU MVAPICH Performance 10 5 1 Requirements At least two nodes Example host1 host2 Machine file Includes the list of machines Example host1 cat home lt username gt cluster hosti host2 hostl 10 5 2 Bandwidth Test Performance To run the OSU Bandwidth test enter hostl usr mpi gcc mvapich lt mvapich ver gt bin mpirun_rsh np 2 hostfile home lt username gt cluster usr mpi gcc mvapich lt mvapich ver gt tests osu benchmarks lt osu ver gt osu bw OSU MPI Bandwidth Test v3 0 Size Bandwidth MB s 1 4 62 2 8 91 4 17 70 8 32 59 16 60 13 32 T13 21 64 194 22 128 293 20 256 549 43 512 883 23 1024 1096 65 2048 1165 60 4096 1233 91 8192 1230 90 16384 1308 92 32768 1414 75 65536 1465 28 131072 1500 36 262144 1515 26 524288 1525 20 1048576 1527 63 2097152 1530 48 4194304 1537 50 100 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 10 5 3 Latency Test Performance To run the OSU Latency test enter hostl usr mpi gcc mvapich lt mvapich ver gt bin mpirun_rsh np 2 hostfile home lt username gt cluster usr mpi gcc mvapich lt mvapich ver gt
169. exports file and start the NFS RDMA server Exports entries with the following formats have been tested vol0 192 168 0 47 fsid 0 rw async insecure no root squash vol0 192 168 0 0 255 255 255 0 fsid 0 rw async insecure no_ root squash The IP address es is are the client s IPoIB address for an InfiniBand HCA or the cleint s iWARP address es for an RNIC Mellanox Technologies 95 J Rev 1 5 NFSoRDMA Note The insecure option must be used because the NFS RDMA client does not use a reserved port Each time a machine boots Load and configure the RDMA drivers For InfiniBand using a Mellanox adapter modprobe ib mthca modprobe ib ipoib ifconfig ib0 a b c d Note Use unique addresses for the client and server Start the NFS server If the NFS RDMA server was built as a module CONFIG SUNRPC XPRT RDMA m in kernel con fig load the RDMA transport module modprobe svcrdma Regardless of how the server was built module or built in start the server etc init d nfs start or service nfs start Instruct the server to listen on the RDMA transport echo rdma 20049 gt proc fs nfsd portlist On the client system If the NFS RDMA client was built as a module CONFIG SUNRPC XPRT RDMA m in kernel con fig load the RDMA client module modprobe xprtrdma ko Regardless of how the client was built module or built in use this command to mount the NFS RDMA server mount o rdma port 20049
170. f the fabric links ibdiagnet pkey A dump of the the existing partitions and their member host ports ibdiagnet mcg A dump of the multicast groups their properties and member host ports ibdiagnet db A dump of the internal subnet database This file can be loaded in later runs using the load_db option In addition to generating the files above the discovery phase also checks for duplicate node port GUIDs in the IB fabric If such an error is detected it is displayed on the standard output After the discovery phase is completed directed route packets are sent multiple times according to the c option to detect possible problematic paths on which packets may be lost Such paths are explored and a report of the suspected bad links is displayed on the standard output After scanning the fabric if the r option is provided a full report of the fabric qualities is dis played This report includes SM report Number of nodes and systems Hop count information maximal hop count an example path and a hop count histogram All CA to CA paths traced Credit loop report mgid mlid HCAs multicast group and report Partitions report PoIB report Note In case the IB fabric includes only one CA then CA to CA paths are not reported Fur thermore if a topology file is provided ibdiagnet uses the names defined in it for the output reports 14 4 3 ERROR CODES 1 Failed to fully discover the fabric 154 Mellanox Technol
171. filename uid if filename is not specified as a full path otherwise write to path filename uid min level verbosity level of the log 9 print errors only 8 print warnings 7 print connect and listen summary useful for tracking SDP usage 4 print positive match summary useful for config file debug 3 print negative match summary useful for config file debug 2 print function calls and return values 1 print debug messages Examples To print SDP usage per connect and listern to STDERR include the following statement log min level 7 destination stderr A non root user can configure 1ibsdp so to record function calls and return values in the file tmp libsdp log lt pid gt root log goes to var 10og libsdp 1og for this example by including the following statement in 1ibsdp conf log min level 2 destination file libsdp log Mellanox Technologies 73 J Rev 1 5 SDP To print errors only to syslog include the following statement log min level 9 destination syslog To print maximum output to the file tmp sdp debug 1og pid include the following state ment log min level 1 destination file sdp debug log Kernel Space SDP Debug The SDP kernel module can log detailed trace information if you enable it using the debug level variable in the sysfs filesystem The following command performs this hostl echo 1 gt sys module ib sdp debug level Note Depending on the o
172. forming the steps in Section 6 the OFED distribution provides two utilities ibsrpdm and srp daemon which Detect targets on the fabric reachable by the Initiator for Step 1 Output target attributes in a format suitable for use in the above echo command Step 2 The utilities can be found under usr sbin and are part of the srptools RPM that may be installed using the Mellanox OFED installation Detailed information regarding the various options for these utilities are provided by their man pages Below several usage scenarios for these utilities are presented ibsrpdm ibsrpdm is using for the following tasks 1 Detecting reachable targets a To detect all targets reachable by the SRP initiator via the default umad device dev umad0 execute the following command ibsrpdm This command will output information on each SRP Target detected in human readable form Sample output 86 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 IO Unit Info port LID 0103 port GID fe800000000000000002c90200402bd5 change ID 0002 max controllers 0x10 controller 1 GUID 0002c90200402bd4 vendor ID 0002c9 device ID 005a44 IO class 0100 ID LSI Storage Systems SRP Driver 200400a0b81146al servic ntries 1 service 0 200400a0b81146a1 SRP T10 200400A0B81146A1 b To detect all the SRP Targets reachable by the SRP Initiator via another umad device use the follow ing command i
173. found between the client and the iSCSI target Open a shell to ping the iSCSI target you can use CTRL ALT F2 and verify that the target is or is not accessible To return to the graphical installation screen press CTRL ALT F7 208 Mellanox Technologies J Mellanox OFED for Linux User s Manual Rev 1 5 Step 5 The iSCSI Initiator Discovery window will now request authentication to access the iSCSI target Click Next to continue without authentication unless authentication is required Preparation gt Language iscsi Initiator Discovery License Agreement Disk Activation e System Analysis Time Zone Installation e Installation Summary Perform Installation Configuration e Root Password e Hostname e Network e Customer Center Incoming Authentication e Online Update Username Password e Service e Users e Clean Up e Release Notes e Hardware Configuration X No Authentication Outgoing Authentication Username Password Help Back Abort Step 6 The iSCSI Initiator Discovery window will show the iSCSI target that got connected to Note that the Connected column must indicate True for this target Click Next See figure below Mellanox Technologies 209 Rev 1 5 E e ra EE iSCSI Initiator Discovery Vv License Agreement Disk Activation e System Analysis Portal Address Target Name Connected e Time Zone 10 4 3 7 3260
174. full PartB 0x8002 sl 2 ipoib ALL full 146 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 13 Adaptive Routing 13 1 Overview Adaptive Routing AR enables the switch to select the output port based on the port s load AR supports two routing modes Free AR No constraints on output port selection Bounded AR The switch does not change the output port during the same transmission burst This mode minimizes the appearance of out of order packets 13 2 Running OpenSM With AR Manager To enable AR Manager in OpenSM run f opensm ar AR Manager scans all the fabric switches figures out which switches support AR and configures the AR functionality on these switches Note that if some switches do not support AR they will slow down the AR Manager as it may get timeouts on the AR related queries to these switches To run AR Manager with an AR configuration file enter f opensm ar ar config file path to file gt Currently there are two options in the config file 1 Enable disable AR on fabric switches by including the following line to the AR configuration file nable true false where the default value is true which is also valid for cases when the AR config file is not provided This option is different from the OpenSM command line option ar The former controls AR on fab ric switches while the latter specifies whether AR Manager in OpenSM should be launched
175. g to var 1og srp daemon log tis possible to configure this script to execute automatically when the InfiniBand driver starts by changing the value of SRPHA ENABLE in etc infiniband openib conf to yes However this option also enables SRP High Availability that has some more features see Sec tion 8 2 6 Note For the changes in openib conf to take effect run etc init d openibd restart 8 2 5 Multiple Connections from Initiator IB Port to the Target Some system configurations may need multiple SRP connections from the SRP Initiator to the same SRP Target to the same Target IB port or to different IB ports on the same Target HCA In case of a single Target IB port i e SRP connections use the same path the configuration is enabled using a different initiator ext value for each SRP connection The initiator ext value is a 16 hexadecimal digit value specified in the connection command Also in case of two physical connections i e network paths from a single initiator IB port to two different IB ports on the same Target HCA there is need for a different initiator_ext value on each path The conventions is to use the Target port GUID as the initiator ext value for the relevant path If you use srp daemon with n flag it automatically assigns initiator ext values according to this convention For example id ext 200500A0B81146A1 ioc guid 0002c90200402bec dgid fe800000000000000002c90200402bed pkey ffff N service
176. gies Mellanox OFED for Linux User s Manual Rev 1 5 OPTIONS c count in number of packets to be sent across each link default 10 v Enable verbose mod r Provides a report of the fabric qualities t lt topo file gt Specifies the topology file name s lt sys name gt Specifies the local system name Meaningful only if a topology file is specified i lt dev index gt Specifies the index of the device of the port used to con nect to the IB fabric in case of multiple devices on the local system p lt port num gt Specifies the local device s port num used to connect to the IB fabric o lt out dir gt Specifies the directory where the output files will be placed default tmp lw 1x 4x 12x Specifies th xpected link width ls lt 2 5 5 10 gt Specifies th xpected link speed pm Dump all the fabric links pm Counters into ibdiagnet pm pc Reset all the fabric links pmCounters P lt PM lt Trash gt gt If any of the provided pm is greater then its provided value print it to screen skip lt skip option s gt Skip the executions of the selected checks Skip options one or more can be specified dup guids zero guids pm logical state part ipoib all wt file name Write out the discovered topology into the given file This flag is useful if you later want to check for changes from the current state of the fabric A directory named ibdiag ibnl is also created by this op
177. h is pro vided in Mellanox OFED for Linux 3 Expansion ROM Image The expansion ROM images are provided as part of the SW package and are listed in the release notes file FlexBoot release notes txt 4 Firmware Burning Tools You need to install the Mellanox Firmware Tools MFT package version 2 6 0 or later in order to burn the PXE ROM image To download MFT see Firmware Tools under www mellanox com gt Downloads Specifically you will be using the m1xburn tool to create and burn a composite image from an adapter device s firmware and the PXE ROM image onto the same Flash device of the adapter Image Burning Procedure To burn the composite image perform the following steps 1 Obtain the MST device name Run f mst start mst status The device name will be of the form mt dev id pci cr0 conf0 ui 2 Create and burn the composite image Run mlxburn d lt mst device name gt fw lt FW mlx file gt conf lt ini file gt exp rom lt expansion ROM image gt Example on Linux mlxburn dev dev mst mt25418 pci cr0 fw fw 25408 X X XXX mlx conf MHGH28 XTC ini exp rom ConnectX IB 25418 ROM X X XXX rom Example on Windows mlxburn dev mt25418 pci cr0 fw fw 25408 X X XXX mlx conf MHGH28 XTC ini exp rom ConnectX IB 25418 ROM X X XXX rom A 2 3 Preparing the DHCP Server in Linux Environment The DHCP server plays a major role in the boot process by assigning IP addresses for FlexBoot clients and
178. hca ko ipoib helper ko this module is not required for all OS kernels Please check the release notes ib ipoib ko A 8 1 1 Example Adding an IB Driver to initrd Linux Prerequisites 1 The FlexBoot image is already programmed on the HCA card 2 The DHCP server is installed and configured as described in Section 4 3 1 IPoIB Configura tion Based on DHCP and is connected to the client machine 3 An initrd file 4 To add an IB driver into init rd you need to copy the IB modules to the diskless image Your machine needs to be pre installed with a Mellanox OFED for Linux ISO image that is appropri ate for the kernel version the diskless image will run Mellanox Technologies 199 Rev 1 5 Adding the IB Driver to the initrd File Warning The following procedure modifies critical files used in the boot procedure It must Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 be executed by users with expertise in the boot process Improper application of this procedure may prevent the diskless machine from booting Back up your current initrd file Make a new working directory and change to it hostl1 mkdir tmp initrd ib hostl cd tmp initrd ib Normally the initrd image is zipped Extract it using the following command host1 gzip dc initrd image cpio id The initrd files should now be found under tmp initrd ib Create a directory for the InfiniBand modules and copy them hostl
179. her service For an InfiniBand port Mellanox FlexBoot implements a network driver with IP over IB acting as the transport layer IP over IB is part of the Mellanox OFED for Linux software package see www mellanox com gt Products gt InfiniBand VPI SW Drivers The binary code is exported by the device as an expansion ROM image A 1 1 Supported Mellanox Adapter Devices and Firmware The package supports all ConnectX ConnectX 2 network adapter devices and cards It also supports the InfiniHost III Ex and InfiniHost Lx adapter devices and cards Specifically adapter products responding to the following PCI Device IDs are supported ConnectX Connectx 2 devices Decimal 25408 Hexadecimal 6340 Decimal 25418 Hexadecimal 634a Decimal 26418 Hexadecimal 6732 Decimal 26428 Hexadecimal 673c Decimal 26438 Hexadecimal 6746 Decimal 26488 Hexadecimal 6778 InfiniHost amp III Ex devices Decimal 25218 Hexadecimal 6282 Mellanox Technologies 187 Rev 1 5 InfiniHost III Lx devices Decimal 25204 Hexadecimal 6274 A 1 2 Tested Platforms See the Mellanox FlexBoot Release Notes FlexBoot release notes txt A 1 3 FlexBoot in Mellanox OFED The FlexBoot binary files are provided as part of Mellaox OFED for Linux The following binary files are included 1 A PXE ROM image file for each of the supported Mellanox network adapter devices Specifi cally the following images are in
180. hich makes a default Service ID 0x00000000010648CA The following two match rules are equivalent rds i SSL any service id 0x00000000010648CA lt SL gt 12 6 6 4 SRP Service ID for SRP varies from storage vendor to vendor thus SRP query is matched by the target IB port GUID The following two match rules are equivalent srp target port guid 0x1234 SL any target port guid 0x1234 SL Note that any of the above ULPs might contain target port GUID in the PR query so in order for these queries not to be recognized by the QoS manager as SRP the SRP match rule or any match rule that refers to the target port guid only should be placed at the end of the qos ulps match rules Mellanox Technologies 141 Rev 1 5 OpenSM Subnet Manager 12 6 6 5 MPI SL for MPI is manually configured by MPI admin OpenSM is not forcing any SL on the MPI traf fic and that s why it is the only ULP that did not appear in the qos ulps section 12 6 7 SL2VL Mapping and VL Arbitration OpenSM cached options file has a set of QoS related configuration parameters that are used to configure SL2VL mapping and VL arbitration on IB ports These parameters are Max VLs the maximum number of VLs that will be on the subnet High limit the limit of High Priority component of VL Arbitration table IBA 7 6 9 VLArb low table Low priority VL Arbitration table IBA 7 6 9 template VLArb high table High priority VL Arbitration table
181. iSCSI target Prerequisites Make sure that your client is connected to the server s e The ConnectX EN PXE image is programmed on your adapter card Configure and start the DHCP server as described in Section 4 3 1 IPoIB Configuration Based on DHCP Configure and start at least one of the services iSCSI target see Section B 9 and TFTP see Section B 4 Booting Procedure The following steps describe the procedure for booting Windows 2008 from an iSCSI target Step 1 Install Windows 2008 on a local machine as instructed on the Etherboot Web page http etherboot org wiki sanboot win2k8_physical 248 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Step 2 Install the MLNX EN for Windows driver which is part of Mellanox OFED for Linux Step 3 Prepare an image from your installed partition The following Web location shows how to do this using a Linux USB key http etherboot org wiki sanboot transfer Step 4 Copy the prepared image to the iSCSI target Step 5 Make sure that an iSCSI target is installed on your server side Tip You can download and install an iSCSI target from the following location http sourceforge net project showfiles php group id 108475 amp package id 117141 Step 6 Configure your iSCSI target to work with the file copied in Step 4 If for example you choose the file name w2k8_boot img then edit the iSCSI target configuration file etc ietd conf to
182. ice interface ib0 send dhcp client identifier 00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39 For an InfiniHost III Ex device interface ibl send dhcp client identifier 20 00 55 04 01 fe 80 00 00 00 00 00 00 00 02 c9 02 00 23 13 92 Step 9 Now you can add the commands for loading the copied modules into the file init Edit the file tmp initrd ib init and add the following lines at the point you wish the IB driver to be loaded Warning The order of the following commands for loading modules is critical Mellanox Technologies 201 Rev 1 5 echo loading ipv6 sbin insmod lib modules ipv6 ko echo loading IB driver sbin insmod lib modules ib ib addr ko sbin insmod lib modules ib ib core ko sbin insmod lib modules ib ib mad ko sbin insmod lib modules ib ib sa ko sbin insmod lib modules ib ib cm ko sbin insmod lib modules ib ib uverbs ko sbin insmod lib modules ib ib ucm ko sbin insmod lib modules ib ib umad ko sbin insmod lib modules ib iw cm ko sbin insmod lib modules ib rdma cm ko sbin insmod lib modules ib rdma ucm ko sbin insmod lib modules ib mlx4 core ko sbin insmod lib modules ib mlx4 ib ko sbin insmod lib modules ib ib mthca ko Note The following command loading ipoib helper ko is not required for all OS kernels Please check the release notes sbin insmod lib modules ib ipoib helper ko sbin insmod lib modules ib ib i
183. id 200500a0b81146al initiator ext ed2b400002c90200 Notes 1 It is recommended to use the n flag for all srp daemon invocations 2 ibsrpdm does not have a corresponding option 3 srp_daemon sh always uses the n option whether invoked manually by the user or automati cally at startup by setting SRPHA ENABLE to yes Mellanox Technologies 89 J Rev 1 5 SRP 8 2 6 High Availability HA Overview High Availability works using the Device Mapper DM multipath and the SRP daemon Each ini tiator is connected to the same target from several ports HCAs The DM multipath is responsible for joining together different paths to the same target and for fail over between paths when one of them goes offline Multipath will be executed on newly joined SCSI devices Each initiator should execute several instances of the SRP daemon one for each port At startup each SRP daemon detects the SRP Targets in the fabric and sends requests to the ib_srp module to connect to each of them These SRP daemons also detect targets that subsequently join the fabric and send the ib_srp module requests to connect to them as well Operation When a path from port to a target fails the ib srp module starts an error recovery process If this process gets to the reset_host stage and there is no path to the target from this port ib_srp will remove this scsi_host After the scsi_host is removed multipath switches to another path to this target from another
184. ids provided in th given fil one to a line 0 once This option causes OpenSM to configure the subnet once then exit Ports remain in the ACTIVE state S sweep interval value This option specifies th number of seconds between subnet sweeps Specifying s 0 disables sweeping With out s OpenSM defaults to a sweep interval of 10 seconds t timeout value retries number maxsmps number This option specifies the time in milliseconds used for transaction timeouts Specifying t 0 disables timeouts Without t OpenSM defaults to a timeout value of 200 mil liseconds This option specifies the number of retries used for trans actions Without retries OpenSM defaults to 3 retries for transactions This option specifies the number of VL15 SMP MADs allowed on the wire at any one time Specifying maxsmps 0 allows unlimited outstanding SMPs Without maxsmps OpenSM defaults to a maximum of 4 outstanding SMPs 114 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 console off local This option activates the OpenSM console default off i ignore guids lt equalize ignore guids file gt This option provides the means to define a set of ports by node guid and port number that will be ignored by the link load equalization algorithm hop weights file w path to file This option provides the means to define a weighting facto
185. igher debug levels ddd or d d d a ll Optional Show all LIDs in range including invalid entries v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v V ersion Optional Show version info a ll Optional Show all LIDs in range including invalid entries n o dests Optional Do not try to resolve destinations D irect Optional Use directed path address arguments The path is a comma separated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 M ulticast Optional Show multicast forwarding tables The parameters lt star tlid gt and lt endlid gt specify the MLID range s lt smlid gt Optional Use lt smlid gt as the target LID for SM SA queries C lt ca_name gt Optional Use the specified channel adapter or router P lt ca_port gt Optional Use the specified port 166 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Table 13 ibportstate Flags and Options Optional Derant Flag i dator If Not Description y Specified t Optional Override the default timeout for the solicited MADs lt timeout_ms gt msec lt dest dr_path Optional Destination s directed path LID or GUID lid guid gt lt startlid gt Optional Starting L
186. ilable ports and prompt for a port number to attach to By default the opensm run is logged to two files var log messages and var 1og opensm log The first file will register only general major events whereas the second file will include details of reported errors All errors reported in this second file should be treated as indica tors of IB fabric health issues Note that when a fatal and non recoverable error occurs opensm will exit Both log files should include the message SUBNET UP if opensm was able to setup the subnet correctly 12 2 1 opensm Syntax opensm OPTIONS where OPTIONS are version Prints OpenSM version and exits F config config file The name of the OpenSM config file If not specified etc opensm opensm conf will be used if it exists c create config file name gt Mellanox Technologies 111 Rev 1 5 OpenSM Subnet Manager Ir Dr OpenSM will dump its configuration to the specified fil and exit This is one way to generate an OpenSM configura tion file template guid lt GUID in hexadecimal gt This option specifies the local port GUID value with which OpenSM should bind OpenSM may be bound to 1 port at a time If the GUID given is 0 OpenSM displays a list of possible port GUIDs and waits for user input Without g OpenSM tries to use the default port lmc lt LMC value gt This option specifies th subnet s LMC value The num ber of LIDs as
187. in Section 3 3 and Section 3 3 4 7 3 1 How to Know SDP Is Working Since SDP is a transparent TCP replacement it can sometimes be difficult to know that it is work ing correctly To check whether traffic is passing through SDP or TCP monitor the file proc net sdpstats and see which counters are running 7 3 1 1 Alternative Method Using the sdpnetstat Program The sdpnetstat program can be used to verify both that SDP is loaded and is being used The fol lowing command shows all active SDP sockets using the same format as the traditional netstat program Without the S option it shows all the information that netstat does plus SDP data hostl sdpnetstat S Assuming that the SDP kernel module is loaded and is being used then the output of the command will be as follows hostl sdpnetstat S Proto Recv Q Send Q Local Address Foreign Address sdp 0 0 193 168 10 144 34216 193 1608 10 125 12865 sdp 0 884720 193 168 10 144 42724 193 168 10 filenet rmi The example output above shows two active SDP sockets and contains details about the connec tions If the SDP kernel module is not loaded then the output of the command will be something like the following hostl sdpnetstat S Proto Recv Q Send Q Local Address Foreign Address netstat no support for AF INET tcp on this system To verify whether the module is loaded or not you can use the 1smoa command hostl lsmod grep sdp ib sdp 125020 0 The exam
188. include the following line under the iSCSI target ign line Lun 0 Path w2k8 boot img Type fileio Tip The following is an example of an iSCSI target ign line Target igqn 2007 08 7 3 4 10 iscsiboot B 12 WinPE Mellanox ConnectX EN PXE enables WinPE boot via TFTP For instructions on preparing a WinPE image please see http etherboot org wiki winpe Mellanox Technologies 249 250 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Appendix C Performance Troubleshooting C 1 PCI Express Performance Troubleshooting For the best performance on the PCI Express interface the adapter card should be installed in an x8 slot with the following BIOS configuration parameters e Max Read Req the maximum read request size is 512 or higher e MaxPayloadSize the maximum payload size is 128 or higher Note A Max Read Req of 128 and or installing the card in an x4 slot will significantly limit bandwidth To obtain the current setting for Max Read Req enter setpci d 15b3 68 w To obtain the PCI Express slot link width and speed enter setpci d 15b3 72 1 If the output is neither 81 nor 82 card then the card is NOT installed in an x8 PCI Express slot 2 The least significant digit indicates the link speed for PCI Express Gen 1 2 5 GT s 2 for PCI Express Gen 2 5 GT s Note If you are running InfiniBand at QDR 40Gb s 4X IB ports you must run PCI Express Gen 2 C 2 InfiniBand Perf
189. instructing the clients where to boot from FlexBoot requires that the DHCP server run on a machine which supports IP over IB A 2 4 Configuring the DHCP Server A 2 4 1 For ConnectX Family Devices When a FlexBoot client boots it sends the DHCP server various information including its DHCP client identifier This identifier is used to distinguish between the various DHCP sessions The 1 Depending on the OS the device name may be superceded by a prefix 190 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 value of the client identifier is composed of a prefix ff 00 00 00 00 00 02 00 00 02 c9 00 and an 8 byte port GUID all separated by colons and represented in hexadecimal digits Extracting the Port GUID Method To obtain the port GUID run the following commands Note The following MFT commands assume that the Mellanox Firmware Tools MFT pack age has been installed on the client machine hostl mst start hostl mst status The device name will be of the form dev mst mt dev id pci _crO conf0 Use this device name to obtain the Port GUID via the following query command flint d lt MST DEVICE NAME gt q Example with ConnectX 2 QDR MHJH29B XTR Dual 4X IB QDR Port PCIe Gen2 x8 Tall Bracket RoHS R6 HCA Card CX4 Connectors as the adapter device hostl flint d dev mst mt26428 pci cr0 q Image type ConnectX FW Version 2 7 000 Device ID 26428 Chip Re
190. ion over an IB network As multiple applications and ULPs share the same fabric a means is needed to control their use of network resources QoS over Mellanox OFED for Linux is discussed in Chapter 12 OpenSM Subnet Manager Mellanox Technologies 23 J Rev 1 5 Mellanox OFED Overview 24 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 2 Installation This chapter describes how to install and test the Mellanox OFED for Linux package on a single host machine with Mellanox InfiniBand and or Ethernet adapter hardware installed The chapter includes the following sections Hardware and Software Requirements page 25 Downloading Mellanox OFED page 26 Installing Mellanox OFED page 26 Uninstalling Mellanox OFED page 38 2 4 Hardware and Software Requirements 2 1 4 Hardware Requirements Platforms e A server platform with an adapter card based on one of the following Mellanox Technologies InfiniBand HCA devices MT25408 ConnectX 2 VPI IB EN FCoE firmware fw ConnectX2 MT25408 ConnectXe VPI IB EN FCoE firmware fw 25408 MT25208 InfiniHoste III Ex firmware fw 25218 for Mem Free cards and fw 25208 for cards with memory MT25204 InfiniHoste III Lx firmware fw 25204 MT23108 InfiniHoste firmware fw 23108 Note For the list of supported architecture platforms please refer to the Mellanox OFED Release Notes file Required Dis
191. itch with Lid 3 ibroute M 3 Molticest mulick OxcOWOO Cxesitie Ol Syalitela acl S quel 0x000b8c 004016 MT47396 Infiniscale III Mellanox Technolo gies 0 il 2 Exe 0 3 2 4 5 78 O OSSE e 02 A MLid 0xc000 0xc001 0xc002 0xc003 0xc020 0xc021 0xc022 0xc023 0xc024 0xc040 0xc041 0xc042 12 valid mlids dumped x Mm mM xX x mM X Me Me X X X Mellanox Technologies 169 Rev 1 5 InfiniBand Fabric Diagnostic Utilities 14 11 smpquery Applicable Hardware All InfiniBand devices Description Provides a basic subset of standard SMP queries to query Subnet management attributes such as node info node description switch info and port info Synopsys smpquery h d e v D G s lt smlid gt V C lt ca_name gt P lt ca_port gt t lt timeout_ms gt node name map lt node name map gt lt op gt dest dr path lid guid gt op params Table 14 lists the various flags of the command Table 14 smpquery Flags and Options Optional Du Flag e dator If Not Description y Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d e rr show Optional Show send and receive errors timeouts and others v erbose Optional Increase verbosity level May be used several times for additional verbosity vvv or v v v D irect Option
192. k Space for Installation 400 MB 2 1 2 Software Requirements Operating System Linux operating system Note For the list of supported operating system distributions and kernels please refer to the Mellanox OFED Release Notes file Installer Privileges The installation requires administrator privileges on the target machine Mellanox Technologies 25 Rev 1 5 Installation 2 2 2 3 Downloading Mellanox OFED Step 1 Verify that the system has a Mellanox network adapter HCA NIC installed by ensuring that you can see ConnectX or InfiniHost entries in the display The following example shows a system with an installed Mellanox HCA hostl lspci v grep Mellanox 02 00 0 InfiniBand Mellanox Technologies MT25418 ConnectX IB DDR PCIe 2 0 2 5GT s a0 Step 2 Download the ISO image to your host The image s name has the format MLNX_OFED_LINUX lt ver gt lt OS label gt iso You can download it from http www mellanox com gt Products gt IB SW Drivers Step 3 Use the md5sum utility to confirm the file integrity of your ISO image Run the following command and compare the result to the value provided on the download page host1 md5sum MLNX OFED LINUX lt ver gt lt OS label gt iso Installing Mellanox OFED The installation script minxofedinstal1 performs the following Discovers the currently installed kernel Uninstalls any software stacks that are part of the standard operating system
193. k with the partition you dedicated If for example you choose partition dev sda5 then edit the iSCSI Target configuration file etc ietd conf to include the following line under the 1SCSI Target iqn line Lun 0 Path dev sda5 Type fileio Tip The following is an example of an iSCSI Target iqn line Target ign 2007 08 7 3 4 10 iscsiboot Step4 Start your iSCSI Target Example hostl etc init d iscsitarget start Configuring the DHCP Server to Boot From an iSCSI Target in Linux Environment Configure DHCP as described in Section B 3 1 Configuring the DHCP Server 232 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Edit your DHCP configuration file etc dhcpd conf and add the following lines for the machine s you wish to boot from the iSCSI Target Filename option root path iscsi iscsi target ip iscsi target iqn The following is an example for configuring an Ethernet device to boot from an iSCSI Target host hostl filename hardware ethernet 00 02 c9 00 00 bb option root path iscsi 11 4 3 7 iqn 2007 08 7 3 4 10 iscsiboot B 10 iSCSI Boot Example of SLES 10 SP2 OS This section provides an example of installing the SLES 10 SP2 operating system on an iSCSI tar get and booting from a diskless machine via ConnectX EN PXE Note that the procedure described below assumes the following The client s LAN card is recognized during installation e The iSCSI target can
194. l use one of the following two options 1 On the command line specify the system name using the option s 1ocal system name gt 2 Define the environment variable IBDIAG SYS NAME 14 2 2 IB Interface Definition The diagnostic tools installed on a machine connect to the IB fabric by means of an HCA port through which they send MADs To specify this port to an IB diagnostic tool use one ofthe follow ing options 1 On the command line specify the port number using the option p local port number gt see below 2 Define the environment variable IBDIAG PORT NUM In case more than one HCA device is installed on the local machine it is necessary to specify the device s index to the tool as well For this use on of the following options 1 On the command line specify the index of the local device using the following option j index of local device gt 2 Define the environment variable IBDIAG DEV IDX 14 2 3 Addressing Note This section applies to the ibdiagpath tool only A tool command may require defin ing the destination device or port to which it applies The following addressing modes can be used to define the IB ports Using a Directed Route to the destination Tool option d This option defines a directed route of output port numbers from the local port to the destination Using port LIDs Tool option 1 In this mode the source and destination ports are defined by means of th
195. lanox Technologies s ConnectX VPI cards with PCI Express running at 2 5GT s and InfiniBand ports at DDR or Ethernet ports at 10GigE sbin lspci d 15b3 634a 04 00 0 InfiniBand Mellanox Technologies MT25418 ConnectX IB DDR PCIe 2 0 2 5GT s rev a0 In the example above 15b3 is Mellanox Technologies s vendor number in hexadecimal and 634a is the device s PCI Device ID in hexadecimal The number string 04 00 0 identifies the device in the form bus dev fn Note The PCI Device IDs of Mellanox Technologies devices can be obtained from the PCI ID Repository Website at http pci ids ucw cz read PC 15b3 2 Verify the ConnectX firmware using its ID using the results of the example above 182 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 gt SKE Lake b 04300 0 w ConnectX failsafe image Start address 80000 Chunk size 80000 NOTE The addresses below are contiguous logical addresses Physical addresses on flash may be different based on the image start address and chunk size 0x00000038 0x000010db 0x0010a4 BOOT2 OK 0x000010dc 0x00004947 0x00386c BOOT2 OK 0x00004948 0x000052c7 0x000980 Configuration OK 0x000052c8 0x0000530b 0x000044 GUID OK 0x0000530c 0x0000542f 0x000124 Image Info OK 0x00005430 0x0000634f 0x000 20 DDR OK 0x00006350 0x0000 29b 0x008f4c DDR OK 0x0000 29c 0x0004749b 0x038200 DDR OK 0x000474
196. lete the Startup Options window will pop up Select SUSE Linux Enterprise Server 10 SP2 then press Enter SUSE Linux Enterprise Server 10 Floppy SUSE Linux Enterprise Server 10 Failsafe Boot Options 220 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Step 21 The Hostname and Domain Name window will pop up Continue configuring your machine until the operating system is up then you can start running the machine in normal operation mode Step 22 Optional If you wish to have the second instance of connecting to the iSCSI Target go through the IB driver copy the initrd file under boot to a new location add the IB driver into it after the load commands of the iSCSI Initiator modules and continue as described in Section A 7 on page 196 Warning Pay extra care when changing initrd as any mistake may prevent the client machine from booting It is recommended to have a back up iSCSI Initiator on a machine other than the client you are working with to allow for debug in case initrd gets corrupted In addition edit the init file that is in the initrd zip and look for the following string if SiSCSI TARGET IPADDR then iscsiserver iSCSI TARGET IPADDR fi Now add before the string the following line iSCSI TARGET IPADDR IB IP Address of iSCSI Target Example iSCSI TARGET IPADDR 11 4 3 7 A 10 WinPE Mellanox FlexBoot enables WinPE boot via TFTP For instructions
197. llation Procedure Step 1 Login to the installation machine as root Step 2 Mount the ISO image on your machine hostlf mount o ro loop MLNX OFED LINUX ver OS label gt iso mnt Note After mounting the ISO image mnt will be a Read Only folder Step3 Run the installation script hostl mnt mlnxofedinstall This program will install the MLNX OFED LINUX package on your machine Note that all other Mellanox OEM OFED or Distribution IB packages will be removed Do you want to continue y N y Uninstalling the previous version of OFED Starting MLNX OFED LINUX 1 5 1 rc6 installation Installing kernel ib RPM oe Preparing 00 1 kernel ib 00 oe Installing kernel ib devel RPM oe Preparing 00 oe l kernel ib devel 00 Installing mft RPM Preparing 00 1 mft 00 Installing mpi selector RPM Preparing 00 1 mpi selector 00 Install user level RPMs Preparing 00 libibumad 00 Preparing 00 libibumad 00 Preparing 00 libibmad 00 Preparing 00 libibmad 00 Preparing 00 libibumad devel 00 Preparing 00 libibumad devel 00 Preparing 00 libibmad devel 00 Preparing 00 libibmad devel
198. log maximum number of memory protection table entries per HCA default is 17 max is 20 log num mtt log maximum number of memory translation table segments per HCA default is 20 max is 20 log mtts per seg log number of MTT entries per segment 1 5 int enable qos Enable Quality of Service support in the HCA if gt 0 default 0 enable pre tll mode For FCoXX enable pre t11 mode if non zero default 0 internal err reset Reset device on internal errors if non zero default 1 Mellanox Technologies 259 Rev 1 5 F2 mlx4 ib Parameters debug level Enable debug tracing if gt 0 default 0 F 3 mlx4 en Parameters inline thold tcp rss udp rss num lro ip reasm pfctx pfcrx Threshold for using inline data default is 128 Enable RSS for incoming TCP traffic default 1 enabled Enable RSS for incoming UDP traffic default 1 enabled umber of LRO sessions per ring or disabled 0 default is 32 Allow the assembly of fragmented IP packets default 1 enabled Priority based Flow Control policy on TX 7 0 Per priority bit mask default is 0 Priority based Flow Control policy on RX 7 0 Per priority bit mask default is 0 F4 mlx4_fc Parameters log exch per vhba Max outstanding FC exchanges per virtual HBA log Default 9 int max vhba per port Max vHBAs allowed per port Default 2 int 260 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1
199. m verbosity level and forces log flushing The V is equivalent to vf OxFF d 2 See the vf option for more information about log verbosity vf This option sets the log verbosity level A flags field must follow the D option A bit set clear in the flags enables disables a specific log level as follows BET OG LEVEL ENABLED 0x01 ERROR error messages 0x02 INFO basic messages low volume 0x04 VERBOSE interesting stuff moderate volume 0x08 DEBUG diagnostic high volume 0x10 FUNCS function entry exit very high volume 0x20 FRAMES dumps all SMP and GMP frames 0x40 ROUTING dump FDB routing information 0x80 currently unused Without vf osmtest defaults to ERROR INFO 0x3 Spec ifying vf 0 disables all messages Specifying vf OxFF enables all messages see V High verbosity levels may require increasing the transaction timeout with the t option h help Display this usage info then exit 12 3 2 Running osmtest To run osmtest in the default mode simply enter hostl osmtest The default mode runs all the flows except for the Quality of Service flow see Section 12 6 After installing opensm and if the InfiniBand fabric is stable it is recommended to run the fol lowing command in order to generate the inventory file Mellanox Technologies 121 Rev 1 5 OpenSM Subnet Manager host1 osmtest f c Immediately afterwards ru
200. mkdir p tmp initrd ib lib modules ib host1 cd lib modules uname r updates kernel drivers hostl cp infiniband core ib addr ko tmp initrd ib lib modules ib hostl cp infiniband core ib core ko tmp initrd ib lib modules ib hostl cp infiniband core ib mad ko tmp initrd ib lib modules ib hostl cp infiniband core ib sa ko tmp initrd ib lib modules ib hostl cp infiniband core ib cm ko tmp initrd ib lib modules ib hostl cp infiniband core ib uverbs ko tmp initrd ib lib modules ib hostl cp infiniband core ib ucm ko tmp initrd ib lib modules ib hostl cp infiniband core ib umad ko tmp initrd ib lib modules ib hostl cp infiniband core iw cm ko tmp initrd ib lib modules ib hostl cp infiniband core rdma cm ko tmp initrd ib lib modules ib hostl cp infiniband core rdma ucm ko tmp initrd ib lib modules ib hostl cp net mlx4 mlx4 core ko tmp initrd ib lib modules ib hostl cp infiniband hw mlx4 mlx4 ib ko tmp initrd ib lib modules ib hostl cp infiniband hw mthca ib mthca ko tmp initrd ib lib modules ib hostl cp infiniband ulp ipoib ipoib helper ko tmp initrd ib lib modules ib hostl cp infiniband ulp ipoib ib ipoib ko tmp initrd ib lib modules ib IB requires loading an IPv6 module If you do not have it in your initrd please add it using the following command host1 cp lib modules uname r kernel net ipv6 ipv6 ko tmp initrd ib lib modules To load the modules you need the insmod executable If you do no
201. n order to burn the PXE ROM image MFT is part of the Mellanox OFED for Linux package Burning the Image To burn the composite image perform the following steps 1 Obtain the MST device name Run f mst start mst status The device name will be of the form mt dev id pci cr0 conf0 si 2 Create and burn the composite image Run flint dev lt mst device name gt brom lt expansion ROM image gt Example on Linux flint dev dev mst mt25448 pci cr0 brom ConnectX EN 25448 ROM X X XXX rom Example on Windows flint dev mt25448 pci cr0 brom ConnectX EN 25448 ROM X X XXX rom 1 Depending on the OS the device name may be superseded with a prefix 224 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 B 2 2 Updating the Image on ConnectX EN Devices with Legacy Firmware Note This section is applicable only to ConnectX EN devices with firmware versions earlier than 2 7 000 Prerequisites 1 Firmware mlx and ini files included in the Mellanox OFED for Linux package 2 Expansion ROM Image The expansion ROM images are provided as part of the SW package and are listed in the release notes file ConnectX EN PXE release notes txt W Firmware Burning Tools The Mellanox Firmware Tools MFT package version 2 6 0 or later should be installed on your machine in order to burn the PXE ROM image MFT is part of the Mellanox OFED for Linux package Specifically
202. n the following command to test opensm hostl osmtest f a Finally it is recommended to occasionally run osmtest v with verbosity to verify that noth ing in the fabric has changed 12 4 Partitions OpenSM enables the configuration of partitions PKeys in an InfiniBand fabric By default OpenSM searches for the partitions configuration file under the name usr etc opensm parti tions conf To change this filename you can use opensm with the Pconfig or P flags The default partition is created by OpenSM unconditionally even when a partition configuration file does not exist or cannot be accessed The default partition has a P Key value of 0x7fff The port out of which runs OpenSM is assigned full membership in the default partition All other end ports are assigned partial membership 12 4 1 File Format Notes e Line content followed after character is comment and ignored by parser General File Format Partition Definition PortGUIDs list Partition Definition PartitionName PKey flag value defmember full limited where PartitionName string will be used with logging When omitted an empty string will be used PKey P Key value for this partition Only low 15 bits will be used When omitted P Key will be autogenerated flag used to indicate IPoIB capability of this partition defmember full limited specifies default membership for port guid list Default i
203. n transparent Conversion Use explicit conversion if you need to maintain full control from your application while using SDP To configure an explicit conversion to use SDP simply recompile the application replacing PF INET or PF INET with AF INET SDP or AF INET SDP when calling the socket system call in the source code The value of AF INET SDP is defined in the file sdp_socket h or you can define it inline define AF_INET_SDP 27 define PF_INET_SDP AF_INET_SDP You can compile and execute the following very simple TCP application that has been converted explicitly to SDP Compilation gcc sdp_server c o sdp server gcc sdp client c o sdp client Usage Server hostl sdp server Client host1 sdp client server IP addr gt Example Server hostl sdp server accepted connection from 15 2 2 42 48710 read 2048 bytes end of test host1 Client host2 sdp client 15 2 2 43 connected to 15 2 2 43 22222 sent 2048 bytes 76 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 host2 sdp client c Code usage sdp client ip addr gt x include stdio h include stdlib h include lt stdint h gt include lt unistd h gt include lt string h gt include lt sys types h gt include lt sys socket h gt include lt netinet in h gt include arpa inet h define DEF PORT 22222 define AF INET SDP 27 define PF INET SDP AF INET SDP
204. nclude arpa inet h include lt sys epoll h gt include lt errno h gt include lt assert h gt define RXBUFSZ 2048 uint8 t rx buffer RXBUFSZ define DI na nj PORT 22222 define AF INET SDP 27 define PF INET SDP AF INET SDP int main int argc char argv T int sd socket PF INET SDP SOCK STREAM 0 if sd lt 0 A perror socket failed exit EXIT FAILURE struct sockaddr in my addr sin family AF INET sin port htons DEF PORT sin addr s addr INADDR ANY Mellanox Technologies 79 Rev 1 5 SDP int retbind bind sd struct sockaddr amp my addr sizeof my addr if retbind 0 perror bind failed exit EXIT FAILURE E int retlisten listen sd 5 backlog if retlisten lt 0 perror listen failed exit EXIT FAILURE E accept the client connection struct sockaddr in client addr socklen t client addr len sizeof client addr int cd accept sd struct sockaddr amp client addr amp client addr len if cd lt 0 perror accept failed exit EXIT FAILURE E printf accepted connection from s u n inet ntoa client addr sin addr ntohs client addr sin port ssize t nr read cd rx buffer RXBUFSZ if nr 0 perror read failed exit EXIT FAILURE else if nr 0 printf
205. nfiniBand specific functions and plugs into the InfiniBand mid layer 3 3 Ethernet Driver 3 3 1 Overview The Ethernet driver m1x4 en exposes the following ConnectX ConnectX 2 capabilities Single Dual port Fibre Channel over Ethernet FCoE Up to 16 Rx queues per port e 5 Tx queues per port Rxsteering mode Receive Core Affinity RCA Txarbitration mode VLAN user priority off by default MSI X or INTx Adaptive interrupt moderation HW Tx Rx checksum calculation Large Send Offload i e TCP Segmentation Offload Large Receive Offload IP Reassembly Offload Multi core NAPI support VLAN Tx Rx acceleration HW VLAN stripping insertion HW VLAN filtering HW multicast filtering ifconfig up down mtu changes up to 10K Ethtool support Net device statistics CX4 QSFP and SFP connectors 3 3 2 Loading the Ethernet Driver By default the Mellanox OFED stack loads m1x4 en Run ifconfig a to verify that the module is listed 3 3 3 Unloading the Driver If etc infiniband openib conf had MLX4 EN LOAD yes at driver start up then you can unload the m1x4 en driver by running etc init d openibd stop Otherwise unload m1x4 en by running 40 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 gt modprobe r mlx4 en 3 3 4 Ethernet Driver Usage and Configuration To assign an IP address to the interface run gt ifconfig eth lt n gt ip
206. nitiator Discovery window will pop up Enter the IP Address of your iSCSI target and click Next Preparation V Language License Agreement Disk Activation e System Analysis Time Zone E iSCSI Initiator Discovery Installation e Installation Summary e Perform Installation IPAddress Port 2260 Configuration e Root Password e Hostname e Network X No Authentication e Customer Center e Online Update e Service A Peer Incoming Authentication e Users e Clean Up Username Password e Release Notes e Hardware Configuration Outgoing Authentication Username Password Help Back Abort Mellanox Technologies 235 Rev 1 5 Step 4 Details of the discovered iSCSI target s will be displayed in the iSCSI Initiator Discovery window Select the target that you wish to connect to and click Connect Preparation v Language License Agreement Disk Activation E iSCSI Initiator Discovery e System Analysis Portal Address Target Name Connected 10 4 3 7 3260 1 iqn 2007 08 7 3 4 10 iscsiboot False Time Zone Installation e Installation Summary Perform Installation Configuration Root Password e Hostname e Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes e Hardware Configuration Connect Tip If no iSCSI target was recognized then either the target was not properly installed or no connection
207. not required to have the lun numbers in ascending order except that the first lun must always be 0 Note Setting SRPT LOAD yes in etc infiniband openib conf is not enough as it only loads the ib srpt module but does not load scst not its dev handlers Note The scst disk module pass thru mode of SCST is not supported by Mellanox OFED Mellanox Technologies 255 Rev 1 5 Example 1 Working with VDISK BLOCKIO mode Using the md0 device sda and cciss c1d0 a modprobe scst b modprobe scst_vdisk echo open vdisk0 dev md0 BLOCKIO gt proc scsi_tgt vdisk vdisk echo open vdisk1 dev sda BLOCKIO gt proc scsi_tgt vdisk vdisk echo open vdisk2 dev cciss c1d0 BLOCKIO gt proc scsi_tgt vdisk vdisk echo add vdisk0 0 gt proc scsi_tgt groups Default devices a 0 echo add vdisk1 1 gt proc scsi_tgt groups Default devices ge o echo add vdisk2 2 gt proc scsi_tgt groups Default devices Example 2 working with scst_vdisk FILEIO mode Using md0 device and file 10G file a modprobe scst c modprobe scst vdisk echo open vdisk0 dev md0 gt proc scsi tgt vdisk vdisk echo open vdisk1 10G file gt proc sesi tgt vdisk vdisk a AQ echo add vdisk0 0 gt proc scsi_tgt groups Default devices mh 0 echo add vdisk1 1 gt proc scsi_tgt groups Default devices 2 Run For all distributions except SLES 11 gt modprobe ib srpt For SLES 11 modprobe f ib sr
208. nox OFED for Linux User s Manual Rev 1 5 Examples 1 Query the status of Port 1 of CA mlx4 0 using ibstatus and use its output the LID 3 in this case to obtain additional link information using ibportstate gt ibstatus mlx4 0 1 Iam oense Clewalers inl 0 port sco default gid fe80 0000 0000 0000 0000 0000 9289 3895 base lid 0x3 sum irel 0x3 Sene 2g JUNE phys state 5 LinkUp rate 20 Gb sec 4X DDR gt dlgoormestato C mb 0 gt 1 query POeIe IgE S Posi nes Jg S port dl F NOS ES EKS an Se E e e E me ede Initialize Sa SHAUN SESS RR Oo ra dyo hosrosa LinkUp ILA LCICEIMSTIOVONSIA CANCE iddio IDK He AS TRN cia EMA DUCE o 0 6 oo aos ooo 1X or 4X ILI PALCHEIVNCIELWOS 4 so esa ds0o8399 4X ua GSIOKSSCISISIOOIMEDCIS Go cdsctadcuouds 2 45 Glos gue OMG OOS isse DS CGL EPI OMG CNS ers ep MEME 255 Gioia ie 5 0 Gods PUOI 44 a eos dow oun on 5 0 Gloyos Mellanox Technologies 163 Rev 1 5 InfiniBand Fabric Diagnostic Utilities 2 Query the status of two channel adapters using directed paths gt illjsormesicece C mils 0 p Oil POENOS Pere tarcos WR paci Slicl 055957 cllicl 655353 0 port dl Mko e ASCELLE Initialize Physics ale CRETE Ne LinkUp Tinta istelitim SUPPORTE e 1X or 4X Ts mew elite brasil cle pe RCM MM ME 1X or 4X TRAKW KIERA CEUV E a doo rano 4X PINKS pecdSUpporte dia dy os g bow a 2 GO TUS oM OMIT TS PINKS pecdEna eA ARE Z 9 Gope Oi 5 0 Eos IU GUSH KSCICVAC
209. nox Technologies 13 Rev 1 5 Table 2 Abbreviations and Acronyms Sheet 2 of 2 Abbreviation Acronym Whole Word Description FW Firmware HCA Host Channel Adapter HW Hardware IB InfiniBand LSB Least significant byte Isb Least significant bit MSB Most significant byte msb Most significant bit NIC Network Interface Card SW Software VPI Virtual Protocol Interconnect 14 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Related Documentation Table 3 Reference Documents Document Name Description InfiniBand Architecture Specification Vol 1 Release 1 2 1 The InfiniBand Architecture Specification that is provided by IBTA IEEE Std 802 3aeTM 2002 Amendment to IEEE Std 802 3 2002 Document PDF SS94996 Part 3 Carrier Sense Multiple Access with Collision Detec tion CSMA CD Access Method and Physical Layer Spec ifications Amendment Media Access Control MAC Parameters Physical Layers and Management Parameters for 10 Gb s Operation Fibre Channel BackBone 5 standard for Fibre Channel over Ethernet Document INCITS xxx 200x Fibre Channel Backbone http www t11 org draft Firmware Release Notes for Mellanox adapter devices See the Release Notes PDF file relevant to your adapter device under docs folder of installed package MFT User s Manual Mellanox Firmware Tools User s Manual S
210. o add an Ethernet driver into initrd you need to copy the Ethernet modules to the diskless image Your machine needs to be pre installed with a MLNX EN Linux Driver that is appropri ate for the kernel version the diskless image will run Adding the Ethernet Driver to the initrd File Warning The following procedure modifies critical files used in the boot procedure It must be executed by users with expertise in the boot process Improper application of this procedure may prevent the diskless machine from booting Step 1 Back up your current initrd file Step 2 Make a new working directory and change to it host1 mkdir tmp initrd en hostis cd tmp initrd en Step3 Normally the initrd image is zipped Extract it using the following command host1 gzip dc initrd image cpio id The initrd files should now be found under tmp initrd en Step 4 Create a directory for the ConnectX EN modules and copy them hostl mkdir p tmp initrd en lib modules mlnx en host1 cd lib modules uname r updates kernel drivers hostl cp net mlx4 mlx4 core ko tmp initrd en lib modules mlnx en hostl cp net mlx4 mlx4 en ko tmp initrd en lib modules mlnx en Step 5 To load the modules you need the insmod executable If you do not have it in your initrd please add it using the following command hostl cp sbin insmod tmp initrd en sbin Step 6 If you plan to give your Ethernet device a static IP address then copy ifconfig Other wi
211. o an SRP Target in order to provide access to remote storage devices across an InfiniBand fabric The SRP Target resides in an IO unit and provides storage services Section 8 2 describes the SRP Initiator included in Mellanox OFED for Linux This package how ever does not include an SRP Target 8 2 SRP Initiator This SRP Initiator is based on open source from OpenFabrics www openfabrics org that imple ments the SCSI RDMA Protocol 2 SRP 2 SRP 2 is described in Document T10 1524 D avail able from http www t10 org The SRP Initiator supports Basic SCSI Primary Commands 3 SPC 3 www t10 org ftp t10 drafts spc3 spe3r2 1b pdf Basic SCSI Block Commands 2 SBC 2 www t10 org ftp t10 drafts sbc2 sbc2r16 pdf Basic functionality task management and limited error handling 8 2 1 Loading SRP Initiator To load the SRP module either execute the modprobe ib_srp command after the OFED driver is up or change the value of SRP_ LOAD in etc infiniband openib conf to yes Note For the changes to take effect run etc init d openibd restart Note When loading the ib_srp module it is possible to set the module parameter srp_sg tablesize This is the maximum number of gather scatter entries per I O default 12 8 2 2 Manually Establishing an SRP Connection The following steps describe how to manually load an SRP connection between the Initiator and an SRP Target Section 8 2 4 explains how to do this au
212. o ensure that the kernel is working correctly In particular it is a good idea to verify that the RDMA stack is functioning as expected and standard NFS over TCP IP and or UDP IP is working properly Check RDMA Setup If you built the RDMA components as modules load them at this time For example if you are using a Mellanox InfiniHost InfiniHost III Ex InfiniHost III Lx card modprobe ib mthca modprobe ib ipoib If you are using InfiniBand make sure there is a Subnet Manager SM running on the network If your IB switch has an embedded SM you can use it Otherwise you will need to run an SM such as OpenSM on one of your end nodes If an SM is running on your network you should see the following cat sys class infiniband driverX ports 1 state 4 ACTIVE where driverX is mthca0 ipath5 ehca3 etc To further test the InfiniBand software stack use IPoIB this assumes you have two IB hosts named hostl and host2 host1 ifconfig ib0 a b c x host2 ifconfig ib0 a b c y host1 ping a b c y host2 ping a b c x For other device types follow the appropriate procedures Check NFS Setup For the NFS components enabled above client and or server test their functionality over standard Ethernet using TCP IP or UDP IP 9 4 NFS RDMA Setup We recommend that you use two machines one to act as the client and one to act as the server One time configuration On the server system configure the etc
213. ocal address is formed in the following way gid 0 7 fe80000000000000 gid 8 mac 0 2 gid 9 mac 1 gid 10 mac 2 gid 11 ff gid 12 fe gid 13 mac 3 gid 14 mac 4 gid 15 mac 5 If VLAN is supported by the kernel and there are VLAN interfaces on the main Ethernet interface the interface that the IB port is tied to then each such VLAN will appear as a new GID in the port s GID table The format of the GID entry will be identical to the one described above except for the following change gid 11 VLAN ID high byte 4 MS bits gid 12 VLAN ID low byte Please note that VLAN ID is 12 bits wide 5 5 1 Priority Pause Frames Tagged Ethernet frames carry a 3 bit priority field The value of this field is derived from the IB SL field by taking the 3 least significant bits of the SL field 5 6 Using VLANs In order for RoCE traffic to use VLAN tagged frames the user needs to specify GID table entries that are derived from VLAN devices when creating address vectors Consider the example below Make sure VLAN support is enabled by the kernel Usually this requires loading the 8021q module gt modprobe 8021q Adda VLAN device 7 vconfig add eth2 7 Assign an IP address to the VLAN interface This should create a new entry in the GID table as index 1 62 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 7 ifconfig eth2 7 7 10 11 12 Verbs test On
214. ogies Mellanox OFED for Linux User s Manual Rev 1 5 Failed to parse command line options Failed to intract with IB fabric Failed to use local device or local port Failed to use Topology File Dn OF WN I Failed to load requierd Package 14 5 ibdiagpath IB diagnostic path ibdiagpath traces a path between two end points and provides information regarding the nodes and ports traversed along the path It utilizes device specific health queries for the different devices along the path The way ibdiagpath operates depends on the addressing mode used on the command line If directed route addressing is used d flag the local node is the source node and the route to the destination port is known apriori On the other hand if LID route or by name addressing is employed then the source and destination ports of a route are specified by their LIDs or by the names defined in the topology file In this case the actual path from the local port to the source port and from the source port to the destination port is defined by means of Subnet Management Linear Forwarding Table queries of the switch nodes along that path Therefore the path cannot be predicted as it may change ibdiagpath should not be supplied with contradicting local ports by the p and d flags see syn opsis descriptions below In other words when ibdiagpath is provided with the options p and d together the first port in the direct route must be equal to
215. on B 8 1 for an example mlx4 core ko mlx4 en ko B 8 1 Example Adding an Ethernet Driver to initrd Linux Prerequisites 1 The ConnectX EN PXE image is already programmed on the adapter card 2 The DHCP server is installed and configured as described in Section 4 3 1 IPoIB Configura tion Based on DHCP and connected to the client machine 3 An initrd file 4 To add an Ethernet driver into initrd you need to copy the Ethernet modules to the diskless image Your machine needs to be pre installed with a Mellanox OFED for Linux ISO image that is appropriate for the kernel version the diskless image will run Adding the Ethernet Driver to the initrd File Warning The following procedure modifies critical files used in the boot procedure It must be executed by users with expertise in the boot process Improper application of this procedure may prevent the diskless machine from booting 230 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Step 1 Back up your current initrd file Step 2 Make a new working directory and change to it hostl1 mkdir tmp initrd en Kostis cd tmp initrd en Step 3 Normally the initra image is zipped Extract it using the following command host1 gzip dc lt initrd image gt cpio id The initrd files should now be found under tmp initrd en Step 4 Create a directory for the ConnectX EN modules and copy them hostl mkdir p tmp initrd en lib mo
216. on preparing a WinPE image please see http etherboot org wiki winpe Mellanox Technologies 221 222 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Appendix B ConnectX EN PXE B 1 Overview This appendix describes Mellanox ConnectX EN PXE the software for Boot over Mellanox Technologies network adapter devices supporting Ethernet Mellanox ConnectX EN PXE enables booting kernels or operating systems OSes from remote servers in compliance with the PXE specification Mellanox ConnectX EN PXE 1s based on the open source project Etherboot gPXE available at http www etherboot org Mellanox ConnectX EN PXE first initializes the network adapter device Then it connects to a DHCP server to obtain its assigned IP address and network parameters and also to obtain the source location of the kernel OS to boot from The DHCP server instructs Mellanox ConnectX EN PXE to access the kernel OS through a TFTP server an iSCSI target or other service The binary code is exported by the device as an expansion ROM image B 1 1 Supported Mellanox Network Adapter Devices and Firmware The package supports all ConnectX and ConnectX 2 Network Adapter family devices and cards Specifically adapter products responding to the following PCI Device IDs are supported Decimal 25408 Hexadecimal 6340 Decimal 25418 Hexadecimal 634a Decimal 26418 Hexadecimal 6732 Decimal 26428 Hexadecimal 673c Decimal
217. only Mellanox Technologies 65 J Rev 1 5 RoCE Run an Example Test ibv rc pingpong Start the server first f ibv rc pingpong g 0 i2 local address LID 0x0000 QPN 0x00004f PSN 0x3315f6 GID fe80 202 c9ff fe08 e799 remote address LID 0x0000 QPN 0x04004f PSN 0x2cdede GID fe80 202 c9ff fe08 e811 8192000 bytes in 0 01 seconds 4730 13 Mbit sec 1000 iters in 0 01 seconds 13 85 usec iter Then start the client ibv rc pingpong g 0 i 2 sw419 local address LID 0x0000 QPN 0x04004f PSN 0x2cdede GID fe80 202 c9ff fe08 e811 remote address LID 0x0000 QPN 0x00004f PSN 0x3315f6 GID fe80 202 c9ff fe08 e799 8192000 bytes in 0 01 seconds 4787 84 Mbit sec 1000 iters in 0 01 seconds 13 69 usec iter Add VLANs Make sure that the 8021 q module is loaded modprobe 8021q Add the VLAN device vconfig add eth2 7 Added VLAN with VID 7 to IF eth2 Configure an IP address for it ifconfig eth2 7 7 4 3 220 Examine the GID table cat sys class infiniband mlx4 0 ports 2 gids 0 fe80 0000 0000 0000 0202 c9ff fe08 e811 cat sys class infiniband mlx4 0 ports 2 gids 1 fe80 0000 0000 0000 0202 c900 0708 e811 According to the output we now have two entries 66 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Run the Example Again Not on VLAN On Server ibv rc pingpong g 1 i 2 local address LID 0x0000 QPN 0x04004f PSN 0x
218. or not Note that once AR is enabled you will need to actively turn it off in order to disable it To turn it off set enable to false in the AR configuration file and run OpenSM as follows f opensm ar ar config file path to file gt 2 AR Mode In the configuration file set ar mode lt bounded free gt where the default value is bounded 13 2 1 AR Configuration File Example The following is an example of AR configuration file content Mellanox Technologies 147 Rev 1 5 Adaptive Routing Begin AR configuration file enable true ar mode bounded End AR configuration file The above file has options with default values which is equivalent to not having the AR configu ration file at all 148 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 14 InfiniBand Fabric Diagnostic Utilities 14 1 Overview The diagnostic utilities described in this chapter provide means for debugging the connectivity and status of InfiniBand IB devices in a fabric The tools are jbdiagnet of ibutils2 IB Net Diagnostic page 150 jbdiagnet of ibutils IB Net Diagnostic page 152 jbdiagpath IB diagnostic path page 155 jbv devices page 157 jbv devinfo page 157 jbstatus page 159 jbportstate page 161 ibroute page 166 smpquery page 170 e perfquery page 173 ibcheckerrs page 177 mstflint
219. ork e Customer Center e Online Update e Service Users Clean Up Release Notes Hardware Configuration Help Abort Mellanox Technologies 239 Rev 1 5 Step 9 Select the appropriate Region and Time Zone in the Clock and Time Zone window then click Finish Preparation v Language v License Agreement v Disk Activation v System Analysis Time Zone Installation e Installation Summary Perform Installation Configuration e Hostname e Root Password e Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes Help Step 10 In the Installation Settings window window e Hardware Configuration Ei Clock and Time Zone Region Time Zone Europe Canada Central and South America Russia Asia Australia Africa Pacific Global Etc Alaska Aleutian Arizona Central East Indiana Hawaii Indiana Starke Michigan Mountain Pacific Samoa Hardware Clock Set To Time and Date UTC Abort 07 52 06 24 03 2008 Change Finish click Partitioning to get the Suggested Partitioning 240 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Preparation v Language B Installation Settings v License Agreement v Disk Activation Click handi kath he Ch N bel v System Analysis ick any headline to make changes or use the
220. ormance Troubleshooting InfiniBand IB performance depends on the health of IB link s and on the IB card type IB link speed 10Gb s or SDR 20Gb s or DDR 40Gb s or QDR also affects performance Note A latency sensitive application should take into account that each switch on the path adds 200nsec at SDR and 150nsec for DDR 1 To check the IB link speed enter ibstat Check the value indicated after the Rate string 10 indicates SDR 20 indicates DDR and 40 indicates QDR 2 Check that the link has NO symbol errors since these errors result in the re transmission of packets and therefore in bandwidth loss This check should be conducted for each port after the driver is loaded To check for symbol errors enter Mellanox Technologies 251 Rev 1 5 cat sys class infiniband lt device gt ports 1 counters symbol error The command above is performed on Port 1 of the device device The output value should be 0 if no symbol errors were recorded 3 Bandwidth is expected to vary between systems It heavily depends on the chipset memory and CPU Nevertheless the full wire speed should be achieved by the host With IB SDR the expected unidirectional full wire speed bandwidth is 900MB sec With IB DDR and PCI Express Gen 1 the expected unidirectional full wire speed bandwidth is 1400MB sec See Section C 1 With IB DDR and PCI Express Gen 2 the expected unidirectional full wire speed bandwidth is
221. ort a listening port or a program name Socket control statements in 1ibsdp conf allow the user to specify when 1ibsap should replace AF INET SOCK STREAM sockets with AF SDP SOCK STREAM sockets Each control state 74 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 ment specifies a matching rule that applies if all its subexpressions must evaluate as true logical and The use statement controls which type of sockets to open The format of a use statement is as fol lows use address family role program name lt address gt lt port range gt where lt address family gt can be one of sdp for specifying when an SDP should be used tcp for specifying when an SDP socket should not be matched both for specifying when both SDP and AF INET sockets should be used Note that both semantics is different for server and client roles For server it means that the server will be listening on both SDP and TCP sock ets For client the connect function will first attempt to use SDP and will silently fall back to TCP if the SDP connection fails lt role gt can be one of server or listen for defining the listening port address family client or connect for defining the connected port address family lt program name gt Defines the program name the rule applies to not including the path Wildcards with same semantics as ls are supported and So db2 would match on any progr
222. p menu hh Print an extended help menu d evice device All Specify the device to which the Flash is connected guid lt GUID gt burn sg GUID base value 4 GUIDs are automatically assigned to the following val ues guid gt node GUID guid 1 gt port guid 2 gt port2 guid 3 gt system image GUID Note Port2 guid will be assigned even for a single port HCA the HCA ignores this value guids lt GUIDs gt burn sg 4 GUIDs must be specified here The specified GUIDs are assigned the fol lowing values repectively node port1 port2 and system image GUID Note Port2 guid must be specified even for a single port HCA the HCA ignores this value It can be set to 0x0 mac lt MAC gt burn sg MAC address base value Two MACs are automatically assigned to the fol lowing values mac gt portl mac l gt port2 Note This switch is applicable only for Mellanox Technologies Ethernet products macs lt MACs gt burn sg Two MACs must be specified here The specified MACs are assigned to port1 and port2 repectively Note This switch is applicable only for Mellanox Technologies Ethernet products blank_guids burn Burn the image with blank GUIDs and MACs where applicable These val ues can be set later using the sg command see Table 18 below No commands Force clear the Flash semaphore on the device No command is allowed
223. penSM options file opensm opts III QoS Levels Definition This section defines the possible sets of parameters for QoS that a client might be mapped to Each set holds SL and optionally Max MTU Max Rate Packet Lifetime and Path Bits Note Path Bits are not implemented in OFED IV Matching Rules A list of rules that match an incoming PR MPR request to a QoS Level The rules are processed in order such as the first match is applied Each rule is built out of a set of match expressions which should all match for the rule to apply The matching expressions are defined for the following fields SRC and DST to lists of port groups Service ID to a list of Service ID values or ranges QoS Class to a list of QoS Class values or ranges 11 4 CMA features The CMA interface supports Service ID through the notion of port space as a prefix to the port number which is part of the sockaddr provided to rdma resolve add The CMA also allows the ULP like SDP to propagate a request for a specific QoS Class The CMA uses the provided QoS Class and Service ID in the sent PR MPR 11 5 IPoIB IPoIB queries the SA for its broadcast group information and uses the SL MTU RATE and Packet Lifetime available on the multicast group which forms this broadcast group 11 6 SDP SDP uses CMA for building its connections The Service ID for SDP is 0x000000000001PPPP where PPPP are 4 hexadecimal digits holding the remote TCP IP Port Number to conn
224. perating system distribution on your machine you may need an extra level parameters in the directory structure so you may need to direct the echo com mand to sys module ib sdp parameters debug level Turning off kernel debug is done by setting the sysfs variable to zero using the following com mand hostl echo 0 gt sys module ib sdp debug level To display debug information use the dmesg command hostl dmesg 7 4 Environment Variables For the transparent integration with SDP the following two environment variables are required 1 LD_PRELOAD this environment variable is used to preload 1ibsdp so and it should point to the 1ibsdp so library The variable should be set by the system administrator to usr 1ib libsdp so or usr lib64 libspd so 2 LIBSDP_CONFIG FILE this environment variable is used to configure the policy for replac ing TCP sockets with SDP sockets By default it points to etc libsdp conf 3 SIMPLE LIBSDP ignore 1ibsdp conf and always use SDP 7 5 Converting Socket based Applications You can convert a socket based application to use SDP instead of TCP in an automatic also called transparent mode or in an explicit also called non transparent mode Automatic Transparent Conversion The libsdp conf configuration policy file is used to control the automatic transparent replace ment of TCP sockets with SDP sockets In this mode socket streams are converted based upon a destination p
225. ple output above shows that the SDP module is loaded If the SDP module is loaded and the sapnetstat command did not show SDP sockets then SDP is not being used by any application SDP 72 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 7 3 2 Monitoring and Troubleshooting Tools SDP has debug support for both the user space 1ibsdp so library and the ib sap kernel module Both can be useful to understand why a TCP socket was not redirected over SDP and to help find problems in the SDP implementation User Space SDP Debug User space SDP debug is controlled by options in the 1ibsdp conf file You can also have a local version and point to it explicitly using the following command hostl export LIBSDP CONFIG FILE lt path gt libsdp conf To obtain extensive debug information you can modify 1ibsdp conf to have the log directive produce maximum debug output provide the min 1eve1 flag with the value 1 The 1og statement enables the user to specify the debug and error messages that are to be sent and their destination The syntax of 1og is as follows log destination stderr syslog file lt filename gt min level 1 9 where options are destination send log messages to the specified destination Stderr forward messages to the STDERR Syslog send messages to the syslog service file filename write messages to the file var log filename for root For a regular user write to tmp
226. poib ko Note In case of interoperability issues between iSCSI and Large Receive Offload LRO change the last command above as follows to disable LRO sbin insmod lib modules ib ib ipoib ko lro 0 Step 10 Now you can assign an IP address to your IB device by adding a call to ifconfig or to the DHCP client in the init file after loading the modules If you wish to use the DHCP client then you need to add a call to the DHCP client in the init file after loading the IB modules For example sbin dhclient cf sbin dhclient conf ibl Step 11 Save the init file Step 12 Close initrd hostl cd tmp initrd ib hostl find cpio H newc o gt tmp new initrd ib img hostl gzip tmp new init ib img Step 13 At this stage the modified initrd including the IB driver is ready and located at tmp new init ib img gz Copy it to the original initrd location and rename it properly 202 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 A 8 2 Case ll Ethernet Ports The Ethernet driver requires loading the following modules in the specified order see the exam ple below mlx4 core ko mlx4 en ko A 8 2 1 Example Adding an Ethernet Driver to initrd Linux Prerequisites 1 The FlexBoot image is already programmed on the adapter card 2 The DHCP server is installed and configured as described in Section 4 3 1 on page 51 and con nected to the client machine 3 An initrd file 4 T
227. port HCA When the failed path recovers it will be detected by the SRP daemon The SRP daemon will then request ib_srp to connect to this target Once the connection is up there will be a new scsi_host for this target Multipath will be executed on the devices of this host returning to the original state prior to the failed path Prerequisites Installation for RHEL4 5 Execute once Verify that the standard device mapper multipath rpm is installed If not install it from the RHEL distribution Installation for SLES10 Execute once Verify that multipath is installed If not take it from the installation you may use yast Update udev Execute once for manual activation of High Availability only e Adda file to etc udev rules d you can call it 91 srp rules This file should have one line ACTION add KERNEL sd 0 9 RUN sbin multipath M m Note When SRPHA ENABLE is set to yes see Automatic Activation of High Availability below this file is created upon each boot of the driver and is deleted when the driver is unloaded Manual Activation of High Availability Initialization Execute after each boot of the driver 1 Execute modprobe dm multipath 2 Execute modprobe ib srp 3 Make sure you have created file etc udev rules d 91 srp rules as described above 90 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 4 Execute for each port and each HC
228. port config script after the driver is loaded Running sbin connectx port config s will show current port configuration for all ConnectX devices Port configuration is saved in the file etc infiniband connectx conf This saved config uration is restored at driver restart only if restarting via etc init d openibd restart Possible port types are eth Ethernet ib Infiniband Table 5 lists the ConnectX port configurations supported by VPI Table 5 Supported ConnectX Port Configurations Port 1 Configuration Port 2 Configuration ib ib ib eth eth eth Note that the configuration Port eth and Port2 ib is not supported Also note that FCoE can run only on a port configured as eth and the m1x4 en driver must be loaded The port link type can be configured for each device in the system at run time using the sbin connectx port config script This utility will prompt for the PCI device to be modified if there is only one it will be selected automatically In the next stage the user will be prompted for the desired mode for each port The desired port configuration will then be set for the selected device Note This utility also has a non interactive mode sbin connectx port config d device PCI device ID gt c conf lt portl port2 gt Mellanox Technologies 39 J Rev 1 5 Working With VPI 3 2 InfiniBand Driver The InfiniBand driver m1x4 ib handles I
229. pt Note For SLES 11 please ignore the following error messages in var log messages when loading ib srpt to SLES 11 distribution s kernel ib srpt no symbol version for scst unregister ib srpt Unknown symbol scst unregister ib srpt no symbol version for scst register ib srpt Unknown symbol scst register ib srpt no symbol version for scst unregister target template ib srpt Unknown symbol scst unregister target template B On Initiator Machines On Initiaor machines manualy perform the following steps 256 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 1 Run modprobe ib srp 2 Run ipsrpdm c d dev infiniband umadX to discover a new SRP target umad0 port 1 of the first HCA umadl port 2 of the first HCA umad2 port 1 of the second HCA 3 echo new target info gt sys class infiniband srp srp mthca0 1 add target 4 fdisk 1 will show the newly discovered scsi disks Example Assume that you use port 1 of first HCA in the system i e mthca0 root lab104 ibsrpdm c d dev infiniband umad0 id ext 0002c90200226cf4 ioc guid 0002c90200226cf4 dgid fe800000000000000002c90200226c 5 pkey ffff service id 0002c9020022 6cf4 root lab104 echo id_ext 0002c90200226cf4 ioc guid 0002c90200226c 4 dgid fe800000000000000002c90200226cf5 pkey ffff service id 0002c9020022 6cf4 sys class infiniband srp srp mthca0 1 add target OR Youcan edit etc infiniband op
230. ption defines the optional QoS policy file The default name is etc opensm qos policy conf This option will cause SM not to exit on fatal initial ization issues if SM discovers duplicated guids or a 12x link with lane reversal badly configured By default the SM will exit on these errors Run in daemon mode OpenSM will run in the background Start SM in inactive rather than init SM state This option can be used in conjunction with the perfmgr so as to run a standalone performance manager without SM SA However this is NOT currently implemented in the perfor mance manager prefix routes fil file name Prefix routes control how the SA responds to path record queries for off subnet DGIDs By default the SA fails Such queries The PREFIX ROUTES section below describes th format of the configuration file The default path is etc opensm prefix routes conf consolidate ipv6 snm req Vy verbose Consolidate IPv6 Solicited Node Multicast group join requests into one multicast group per MGID PKey This option increases the log verbosity level The v option may be specified multiple times to fur ther increase the verbosity level See the D option for more information about log verbosity This option sets the maximum verbosity level and forces log flushing The V option is equivalent to D OxFF d 2 See the D option for more information about log ver bosity This option sets
231. r per port for customizing the least weight hops for the routing x honor guid2lid This option forces OpenSM to honor the guid2lid file when it comes out of Standby state if such file exists under OSM CACHE DIR and is valid By default this is FALSE f log file lt file name gt This option defines the log to be the given file By default the log goes to var log opensm log For the log to go to standard output use f stdout L log limit lt size in MB gt This option defines maximal log file size in MB When specified the log file will be truncated upon reaching this limit e erase log file This option will cause deletion of the log file if it previously exists By default the log file is accumula tive P Pconfig lt partition config file gt This option defines the optional partition configuration file The default name is etc opensm partitions conf N no part enforce This option disables partition enforcement on switch external ports rer This option enables Adaptive Routing Manager in OpenSM Mellanox Technologies 115 Rev 1 5 OpenSM Subnet Manager ar config file path to file Q Vr qos This option specifies the optional Adaptive Routing config file The default name is etc opensm osm ar conf This option enables QoS setup It is disabled by default qos policy file file name stay on fatal daemon inactive This o
232. ration for different cluster deployments Each example provides the QoS level assignment and their administration via OpenSM configuration files 12 7 1 Typical HPC Example MPI and Lustre Assignment of QoS Levels MPI Separate from I O load MinBW of 70 Storage Control Lustre MDS Low latency Storage Data Lustre OST Min BW 3096 Administration e MPI is assigned an SL via the command line hostili mpirun s1 0 OpenSM QoS policy file Note In the following policy file example replace OST and MDS with the real port GUIDs qos ulps default 0 default SL for MPI any target port guid OST1 0ST2 0ST3 OST4 1 SL for Lustre OST 144 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 any target port guid MDS1 MDS2 2 SL for Lustre MDS end qos ulps e OpenSM options file qos max vls 8 qos high limit 0 qos vlarb high 2 1 qos vlarb low 0 96 1 224 qos Sl2vl 0 1 2 3 4 5 6 7 15 T15 15 15 15 15 15 15 12 7 2 EDC SOA 2 tier IPoIB and SRP The following 1s an example of QoS configuration for a typical enterprise data center EDC with service oriented architecture SOA with IPoIB carrying all application traffic and SRP used for storage QoS Levels Application traffic IPoIB UD and CM and SDP Isolated from storage Min BW of 5096 SRP Min BW 50 Bottleneck at storage nodes Administration e OpenSM QoS policy file Note In the following polic
233. re netO 90 02 c9 05 cf f6 on PCIO02 00 0 open CLink up TX 0 O RK 0 RXE DHCP net 00 02 c9 05 cf 65 226 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Placing MAC Addresses in etc dhcpd conf The following is an excerpt of a etc dhcpd conf example file showing the format of represent ing a client machine for the DHCP server running on a Linux machine host hosti next server 11 4 3 7 filename pxelinux 0 fixed address 11 4 3 130 hardware ethernet 00 02 c9 00 00 bb B 4 TFTP Server If you have set the filename parameter in your DHCP configuration to a non empty filename you need to install TFTP Trivial File Transfer Protocol TFTP is a simple FTP like file transfer protocol used to transfer files from the TFTP server to the boot client as part of the boot process B 5 BIOS Configuration The expansion ROM image presents itself to the BIOS as a boot device As a result the BIOS will add MLNX NIC lt ver gt to the list of boot devices The priority of this list can be modified through BIOS setup B 6 Operation B 6 1 Prerequisites Make sure that your client is connected to the server s e The ConnectX EN PXE image is already programmed on the adapter card see Section B 2 The DHCP server is configured and started as described in Section 4 3 1 Configure and start at least one of the services iSCSI Target see Section B 9 and or TFTP see Section B 4
234. required for the base installation is now complete If you continue now partitions on your hard disk will be formatted erasing any existing data in those partitions according to the installation settings in the previous dialogs Go back and check the settings if you are unsure Change v Help Abort Step 19 At the end of the file copying stage the Finishing Basic Installation window will pop up and ask for confirming a reboot You can click OK to skip count down See image below Note Assuming that the machine has been correctly configured to boot from ConnectX EN PXE via its connection to the iSCSI target make sure that MLNX_EN has the highest priority in the BIOS boot sequence 246 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Preparation v Language License Agreement v System Analysis v Time Zone Copy filesto installed system Installation R v Installation Summary Save configuration Perform Installation Install boot manager Configuration Save installation settings e Hostname e Root Password Prepare system for initial boot e Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes The system will reboot now e Hardware Configuration 8 Finished Step 20 Once the boot is complete the Startup Options window will pop up Select SUSE Linux Enterprise Server 10 then press Enter SUSE Linux Enterp
235. resolving multiple use of same LID If a link is added or removed OpenSM does not recalculate the routes that do not have to change A route has to change if the port is no longer UP or no longer the MinHop When routing changes are performed the same algorithm for balancing the routes is invoked In the case of using the file based routing any topology changes are currently ignored The file routing engine just loads the LFTs from the file specified with no reaction to real topology Obvi ously this will not be able to recheck LIDs by GUID for disconnected nodes and LFTs for non existent switches will be skipped Multicast is not affected by file routing engine this uses min hop tables 12 5 2 Min Hop Algorithm The Min Hop algorithm is invoked when neither UPDN or the file method are specified The Min Hop algorithm is divided into two stages computation of minhop tables on every switch and LFT output port assignment Link subscription is also equalized with the ability to override based on port GUID The latter is supplied by i lt equalize ignore guids file gt ignore guids lt equalize ignore guids file gt This option provides the means to define a set of ports by guid that will be ignored by the link load equalization algorithm Note that only endports CA switch port 0 and router ports and not switch external ports are supported LMC awareness routes based on remote system or switch basis 12 5
236. ring SDP page 71 Environment Variables page 74 Converting Socket based Applications page 74 BZCopy Zero Copy Send page 82 Using RDMA for Small Buffers page 82 Testing SDP Performance page 82 7 2 libsdp so Library libsdp so is a dynamically linked library which is used for transparent integration of applications with SDP The library is preloaded and therefore takes precedence over glibc for certain socket calls Thus it can transparently replace the TCP socket family with SDP socket calls The library also implements a user level socket switch Using a configuration file the system administrator can set up the policy that selects the type of socket to be used libsdp so also has the option to allow server sockets to listen on both SDP and TCP interfaces The various configu rations with SDP TCP sockets are explained inside the etc 1ibsdp cont file 7 3 Configuring SDP To load SDP upon boot edit the file etc infiniband openib conf and set SDP LOAD yes Mellanox Technologies 71 Rev 1 5 Note For the changes to take effect run etc init d openibd restart SDP can work over IPoIB interfaces or RoCE interfaces In case of IPoIB SDP uses the same IP addresses and interface names as IPoIB see IPoIB configuration in Section 4 3 and Section 4 3 3 In case of RoCE SDP use the same IP addresses and interface names of the corresponding mlx4_en interfaces see mlx4_en configuration
237. rise Server 10 Floppy SUSE Linux Enterprise Server 10 Failsafe Boot Options Mellanox Technologies 247 Rev 1 5 Step 21 The Hostname and Domain Name window will pop up Continue configuring your machine until the operating system is up then you can start running the machine in normal operation mod e Step 22 Optional If you wish to have the second instance of connecting to the iSCSI Target go through the Ethernet driver copy the initrd file under boot to a new location add the Ethernet driver into it after the load commands of the iSCSI Initiator modules and con tinue as described in Section B 8 on page 230 Warning Pay extra care when changing initrd as any mistake may prevent the client Warning ray ging y yp machine from booting It is recommended to have a back up iSCSI Initiator on a machine other than the client you are working with to allow for debug in case initrd gets corrupted Next edit the init file that is in the initra zip and look for the following string if SiSCSI_TARG ET IPADDR then iscsiserver iSCSI TARGET IPADDR fi Now add before the string the following line iSCSI TARGI Example iSCSI TARGI ET IPADDR Ethernet IP Address of iSCSI Target ET IPADDR 11 4 3 7 Also edit the file boot grub menu 1st and delete the following string ibft mode off B 11 Windows 2008 iSCSI Boot ConnectX EN PXE supports booting Windows 2008 from an
238. s ULP PR MPR request and QoS Level has only one constraint Service Level SL The simple policy section may appear in the policy file in combine with the advanced policy or as a stand alone policy definition See more details and list of match rule criteria below 12 6 4 Policy File Syntax Guidelines Leading and trailing blanks as well as empty lines are ignored so the indentation in the exam ple is just for better readability e Comments are started with the pound sign and terminated by EOL Any keyword should be the first non blank in the line unless it s a comment Keywords that denote section subsection start have matching closing keywords Having a QoS Level named DEFAULT is a must it is applied to PR MPR requests that didn t match any of the matching rules Any section subsection of the policy file is optional 12 6 5 Examples of Advanced Policy File As mentioned earlier any section of the policy file is optional and the only mandatory part of the policy file is a default QoS Level Here s an example of the shortest policy file qos levels qos level name DEFAULT sl 0 end qos level end qos levels Port groups section is missing because there are no match rules which means that port groups are not referred anywhere and there is no need defining them And since this policy file doesn t have any matching rules PR MPR query will not match any rule and OpenSM will enforce default QoS lev
239. s limited Currently recognized flags are ipoib indicates that this partition may be used for IPOIB asa result IPoIB capable MC group will be created rate val Specifies rate for this IPoIB MC group default is 3 10GBps 122 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 mtu lt val gt specifies MTU for this IPoIB MC group default is 4 2048 sl lt val gt specifies SL for this IPoIB MC group default is 0 scope lt val gt specifies scope for this IPoIB MC group default is 2 link local Note that values for rate mtu and scope should be specified as defined in the IBTA specifica tion for example mtu 4 for 2048 PortGUIDs list PortGUID GUID of partition member EndPort Hexadecimal numbers should start from 0x decimal numbers are accepted too full or limited indicates full or limited membership for this port When omitted or unrecognized limited membership is assumed There are two useful keywords for PortGUID definition ALL means all end ports in this subnet e SELF means subnet manager s port An empty list means that there are no ports in this partition Notes e White space is permitted between delimiters 5 e The line can be wrapped after after a Partition Definition and between A PartitionName does not need to be unique but PKey does need to be unique f a PKey is repeated then the associated partition configuration
240. s then any query is matched first against the rules in the qos match rules section and only if there was no match the query is matched against the rules in qos ulps section Note that some of these match rules may overlap so in order to use the simple QoS definition effectively it is important to understand how each of the ULPs is matched 138 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 12 6 6 1 IPoIB IPoIB query is matched by PKey or by destination GID in which case this is the GID of the multi cast group that OpenSM creates for each IPoIB partition Default PKey for IPoIB partition is 0x7fff so the following three match rules are equivalent ipoib KOL ipoib pkey Ox7fff lt SL gt any pkey Ox7fff SL Mellanox Technologies 139 Rev 1 5 OpenSM Subnet Manager 12 6 6 2 SDP SDP PR query is matched by Service ID The Service ID for SDP is 0x000000000001PPPP where PPPP are 4 hex digits holding the remote TCP IP Port Number to connect to The following two match rules are equivalent sdp 1 lt SL gt any service id 0x0000000000010000 0x000000000001ffff lt SL gt 140 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 12 6 6 3 RDS Similar to SDP RDS PR query is matched by Service ID The Service ID for RDS is 0x000000000106PPPP where PPPP are 4 hex digits holding the remote TCP IP Port Number to connect to Default port number for RDS is 0x48CA w
241. s of your iSCSI target and click Next Preparation V Language License Agreement Disk Activation e System Analysis e Time Zone Ss iSCSI Initiator Discovery Installation e Installation Summary e Perform Installation IP Address Port aeo Configuration P E e Root Password Hostname e Network X No Authentication e Customer Center e Online Update e Service net Incoming Authentication e Users e Clean Up Username Password e Release Notes e Hardware Configuration LJ Outgoing Authentication Username Password Help Back Abort Next Mellanox Technologies 207 Rev 1 5 Step 4 Details of the discovered iSCSI target s will be displayed in the iSCSI Initiator Discovery window Select the target that you wish to connect to and click Connect Preparation v Language License Agreement Disk Activation e System Analysis Portal Address Target Name Connected e Time Zone 10 4 3 7 3260 1 iqn 2007 08 7 3 4 10 iscsiboot False Installation e Installation Summary Perform Installation E iSCSI Initiator Discovery Configuration Root Password e Hostname e Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes e Hardware Configuration Connect Tip If no iSCSI target was recognized then either the target was not properly installed or no connection was
242. s will be merged and the first PartitionName will be used see also next note Itis possible to split a partition configuration in more than one definition but then they PKey should be explicitly specified otherwise different PKey values will be generated for those defi nitions Examples Default 0x7fff ALL SELF full NewPartition ipoib 0x123456 full 0x3456789034 limi 0x2134a 2306 YetAnotherOne YetAnotherOne CH 0x300 SI 0x300 A F full limited ShareIO 0x80 defmember full 0x123451 0x123452 0x123453 0x123454 will be limited ShareIO 0x80 0x123453 0x123454 0x123455 full 0x123456 0x123457 will be limited SharelO 0x80 defmember limited 0x123456 0x123457 0x123458 full ShareIO 0x80 defmember full 0x123459 0x12345a Mellanox Technologies 123 Rev 1 5 OpenSM Subnet Manager ShareIO 0x80 defmember full 0x12345b 0x12345c limited 0x12345d Note The following rule is equivalent to how OpenSM used to run prior to the partition man ager Default 0x7fff ipoib ALL full 12 5 Routing Algorithms OpenSM offers five routing engines 1 W CA Min Hop algorithm Based on the minimum hops to each node where the path length is optimized UPDN Unicast routing algorithm Based on the minimum hops to each node but it is constrained to ranking rules This algorithm should be chosen if the subnet is not a pure Fa
243. se skip this step Mellanox Technologies 203 Rev 1 5 A 9 A 9 1 hostl cp sbin ifconfig tmp initrd en sbin Step 7 Now you can add the commands for loading the copied modules into the file init Edit the file tmp initrd en init and add the following lines at the point you wish the Ethernet driver to be loaded Warning The order of the following commands for loading modules is critical echo loading Mellanox ConnectX EN driver sbin insmod lib modules mlnx en mlx4 core ko sbin insmod lib modules mlnx en mlx4 en ko Step 8 Now you can assign a static or dynamic IP address to your Mellanox ConnectX EN net work interface Step9 Savethe init file Step 10 Close initrd host1 cd tmp initrd en host1 find cpio H newc o gt tmp new initrd en img hostl gzip tmp new init en img At this stage the modified initrd including the Ethernet driver is ready and located at tmp new init ib img gz Copy itto the original initrd location and rename it properly ISCSI Boot Mellanox FlexBoot enables an iSCSI boot of an OS located on a remote iSCSI Target It has a built in iSCSI Initiator which can connect to the remote iSCSI Target and load from it the kernel and initrd There are two instances of connection to the remote iSCSI Target the first is for get ting the kernel and initrd via FlexBoot and the second is for loading other parts of the OS via initrd Note Linux distributions such as SuSE Linu
244. selection will not take effect until you start a new shell e g logout and login again Other packages such as environment modules provide functionality that allows changing your environment to point to a new MPI implementation in the current shell The MPI selector was not meant to duplicate or replace that functionality The MPI selector functionality can be invoked in one of two ways The mpi selector menu command This command is a simple menu based program that allows the selection of the system wide MPI usu ally only settable by root and a per user MPI selection It also shows what the current selections are This command is recommended for all users 2 The mpi selector command This command is a CLI equivalent of the mpi selector menu allowing for the same functionality as mpi selector menu but without the interactive menus and prompts It is suitable for scripting 10 4 Compiling MPI Applications Note A valid Fortran compiler must be present in order to build the MVAPICH MPI stack and tests The following compilers are supported by Mellanox OFED s MVAPICH and Open MPI packages Gcc Intel and PGI The install script prompts the user to choose the compiler with which to install the MVAPICH and Open MPI RPMs Note that more than one compiler can be selected simultane ously if desired Compiling MVAPICH Applications Please refer to http mvapich cse ohio state edu support mvapich user guide html To r
245. signed to each port is 2 LMC The LMC value must be in th rang 0 7 LMC values gt 0 allow mul tiple paths between ports LMC values gt 0 should only be used if the subnet topology actually provides multiple paths between ports i e multiple interconnects between switches Without l OpenSM defaults to LMC 0 which allows one path between any two ports priority lt priority value gt This option specifies the SMA s priority This will affect the handover cases where the master is chosen by priority and GUID Range is 0 default and lowest prior ity to 15 highest smkey lt SM Key value gt r R reassign lids routing engine This option specifies the SMA s SM Key 64 bits This will affect SM authentication Note that OpenSM version 3 2 1 and below used 1 as the default value in a host byte order now it is fixed but you may need this option to interoperate with an old OpenSM running on a little endian machine This option causes OpenSM to reassign LIDs to all end nodes Specifying r on a running subnet may disrupt sub net traffic Without r OpenSM attempts to preserve existing LID assignments resolving multiple use of same LID Routing engine names This option chooses routing engine s to use instead of Min Hop algorithm default Multiple routing engines can be Specified separated by commas so that specific ordering of routing algorithms will b tried if earlier routing
246. sion 2 6 0 or later in order to burn the PXE ROM image To download MFT see Firmware Tools under www mellanox com gt Downloads Image Burning Procedure To burn the composite image perform the following steps 1 Obtain the MST device name Run f mst start mst status The device name will be of the form mt dev id pci cr0 confo 2 Create and burn the composite image Run flint dev mst device name brom expansion ROM image Example on Linux flint dev dev mst mt26428 pci cr0 brom ConnectX 26428 ROM X X XXX rom Example on Windows flint dev mt26428 pci cr0 brom ConnectX 26428 ROM X X XXX rom A 2 2 Burning the Image on InfiniHost III Ex Lx Products Prerequisites 1 Firmware packages The appropriate firmware mlx packages ConnectX fw 25408 InfiniHost III Ex fw 25208 and or InfiniHost HI Lx fw 25204 can be downloaded from Mellanox Technologies Web site see www mel lanox com gt Downloads gt Firmware gt Customized Firmware 2 Firmware Configuration ini Files 1 Depending on the OS the device name may be superceded with a prefix 2 Relevant only if your ConnectX EN devices are currently burnt with a firmware version earlier than 2 7 000 Mellanox Technologies 189 Rev 1 5 For standard Mellanox products ini files are included in the firmware mlx packages For help in iden tifying the correct ini file of your adapter hardware please refer to MFT User s Manual whic
247. sure an IP address has been configured to this interface Run ibv_devinfo There is a new field named link layer which can be either Ethernet or IB If the value is IB then you need to use connectx port config to change the ConnectX ConnectX 2 ports designation to eth see mlx4 release notes txt for details Configure the IP address of the interface so that the link will become active e All IB verbs applications which run over IB verbs should work on RoCE links as long as they use GRH headers that is as long as they specify use of GRH in their address vector 5 4 Ported Applications The following applications are ported with RoCE ibv_ pingpong examples are ported The user must specify the GID of the remote peer using the new g option The GID has the same format as that in sys class infiniband mlx4 O ports 1 gids 0 Note Care should be taken when using ibv_ud pingpong The default message size is 2K which is likely to exceed the MTU of the RoCE link Use ibv_devinfo to inspect the link MTU and specify an appropriate message size All rdma cm applications should work seamlessly without any change Mellanox Technologies 61 J Rev 1 5 RoCE libsdp works without any change Performance tests 5 5 GID Tables With RoCE there may be several entries in a port s GID table The first entry always contains the IPv6 link s local address of the corresponding Ethernet interface The link s l
248. t 3 0 000 PCI 02 00 0 _ Alternatively you may skip invoking CLI right after POST and invoke it instead right after Flex Boot starts booting Once the CLI is invoked you will see the following prompt gPXE gt 196 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 A 7 2 Operation The CLI resembles a Linux shell where the user can run commands to configure and manage one or more PXE port network interfaces Each port is assigned a network interface called neti where i is 0 1 2 lt Hof interface Some commands are general and are applied to all network inter faces Other commands are port specific therefore the relevant network interface is specified in the command A 7 3 Command Reference A 7 3 1 ifstat Displays the available network interfaces in a similar manner to Linux s ifconfig gPXE ifstat neth 00 02 c9 00 00 00 aa bc on PCIO2 00 0 closed CLink down TX 9 TXE 0 RX 0 RXE 0 Link status Unknown 0x1 neti 00 02 c3 00 12 35 on PCIE CLink down TX 9 TXE 0 RX 0 RXE 0 Link status Unknown 0x1a086001 gPXE gt A 7 3 2 ifopen Opens the network interface net lt x gt The list of network interfaces is available via the ifstat com mand Example gPXE gt ifopen netl A 7 3 3 ifclose Closes the network interface net lt x gt The list of network interfaces is available via the ifstat com mand Example gPXE gt ifclose netl A 7 3 4 au
249. t Tree and a deadlock may occur due to a loop in the subnet Fat Tree Unicast routing algorithm This algorithm optimizes routing for a congestion free shift communication pattern It should be cho sen if a subnet is a symmetrical Fat Tree of various types not just a K ary N Tree non constant K not fully staffed and for any CBB ratio Similar to UPDN Fat Tree routing is constrained to ranking rules LASH Unicast routing algorithm Uses InfiniBand virtual layers SL to provide deadlock free shortest path routing while also distribut ing the paths between layers LASH is an alternative deadlock free topology agnostic routing algorithm to the non minimal UPDN algorithm It avoids the use of a potentially congested root node DOR Unicast routing algorithm Based on the Min Hop algorithm but avoids port equalization except for redundant links between the same two switches This provides deadlock free routes for hypercubes when the fabric is cabled as a hypercube and for meshes when cabled as a mesh OpenSM also supports a file method which can load routes from a table see Modular Routing Engine below The basic routing algorithm is comprised of two stages 1 MinHop matrix calculation How many hops are required to get from each port to each LID The algorithm to fill these tables is dif ferent if you run standard min hop or Up Down For standard routing a relaxation algorithm is used to propagate min hop from ev
250. t against the policy and returns the appropriate PR MPR including SL MTU RATE and Lifetime 6 ULPs and programs e g SDP use CMA to establish RC connection provide the CMA the tar get IP and port number ULPs might also provide QoS Class The CMA then creates Service ID for the ULP and passes this ID and optional QoS Class in the PR MPR request The resulting PR MPR is used for configuring the connection QP PathRecord and MultiPathRecord Enhancement for QoS As mentioned above the PathRecord and MultiPathRecord attributes are enhanced to carry the Service ID which is a 64bit value A new field QoS Class is also provided A new capability bit describes the SM QoS support in the SA class port info This approach pro vides an easy migration path for existing access layer and ULPs by not introducing new set of PR MPR attributes 11 3 Supported Policy The QoS policy which is specified in a stand alone file is divided into the following four subsec tions I Port Group A set of CAs Routers or Switches that share the same settings A port group might be a partition defined by the partition manager policy list of GUIDs or list of port names based on NodeDe scription 108 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 II Fabric Setup Defines how the SL2VL and VLArb tables should be setup Note In OFED this part of the policy is ignored SL2VL and VLArb tables should be config ured in the O
251. t have it in your initrd please add it using the following command hostl cp sbin insmod tmp initrd ib sbin 200 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Step 7 If you plan to give your IB device a static IP address then copy ifconfig Otherwise skip this step hostl cp sbin ifconfig tmp initrd ib sbin Step 8 If you plan to obtain an IP address for the IB device through DHCP then you need to copy the DHCP client which was compiled specifically to support IB Otherwise skip this step To continue with this step DHCP client v3 1 3 needs to be already installed on the machine you are working with Copy the DHCP client v3 1 3 file and all the relevant files as described below hostl cp path to DHCP client v3 1 3 gt dhclient tmp initrd ib sbin hostl cp path to DHCP client v3 1 3 gt dhclient script tmp initrd ib sbin hostl mkdir p tmp initrd ib var state dhcp host1 touch tmp initrd ib var state dhcp dhclient leases hostl cp bin uname tmp initrd ib bin hostl cp usr bin expr tmp initrd ib bin hostl cp sbin ifconfig tmp initrd ib bin hostl cp bin hostname tmp initrd ib bin Create a configuration file for the DHCP client as described in Section 4 3 1 and place it under tmp initrd ib sbin The following is an example of such a file called dclient conf dhclient conf The value indicates a hexadecimal number For a ConnectX dev
252. tests osu benchmarks lt osu ver gt osu latency OSU MPI Latency Test v3 0 Size Latency us 0 1 20 1 1 21 2 1321 4 1 21 8 1 23 16 1 24 32 1 33 64 1 49 128 2 66 256 3 08 51 2 3 61 1024 4 82 2048 6 09 4096 8 62 8192 13 59 16384 18 12 32768 28 81 65536 50 38 131072 93 70 262144 178 77 524288 349 31 1048576 689 25 2097152 1371 04 4194304 2739 16 10 5 4 Intel MPI Benchmark To run the Intel MPI Benchmark test enter hostl usr mpi gcc mvapich lt mvapich ver gt bin mpirun_rsh np 2 hostfile home lt username gt cluster usr mpi gcc mvapich mvapich ver tests IMB IMB ver IMB MPI1 Intel R MPI Benchmark Suite V3 0 MPI 1 part Date Sun Mar 2 19 56 42 2008 Machine x86 64 System Linux Mellanox Technologies 101 Rev 1 5 MPI Release 2 6 16 21 0 8 smp Version 1 SMP Mon Jul 3 18 25 39 UTC 2006 MPI Version i 142 MPI Thread Environment MPI THREAD FUNNELED Minimum message length in bytes 0 Maximum message length in bytes 4194304 MPI Datatype MPI BYTE MPI Datatype for reductions i MPI FLOAT MPI Op MPI SUM List of Benchmarks to run PingPong PingPing Sendrecv Exchange Allreduce Reduce Reduce scatter Allgather Allgatherv Alltoall Alltoallv Bcast Barrier Benchmarking PingPong fprocesses 2 fbytes repetitions t usec Mbytes sec 0 1000 1 295 0 00 1 1000 1 24 0 77 2 1000 14 25 Lasz 4 1000 1 23 3 09 8 1000
253. th Device Mapper Multipath Have a configuration file that determines the targets to connect to 1 srp_daemon commands equivalent to ibsrpdm srp daemon a o is equivalent to ibsrpdm srp daemon c a o is equivalent to ibsrpdm c Note These srp daemon commands can behave differently than the equivalent ibsrpdm command when etc srp daemon conf is not empty 2 srp daemon extensions to ibsrpdm To discover SRP Targets reachable from the HCA device lt InfiniBand HCA name gt and the port port num and to generate output suitable for echo you may execute hostl srp daemon c a o i lt InfiniBand HCA name gt p port number Note To obtain the list of InfiniBand HCA device names you can either use the ibstat tool or run ls sys class infiniband To both discover the SRP Targets and establish connections with them just add the e option to the above command Executing srp daemon over a port without the a option will only display the reachable targets via the port and to which the initiator is not connected If executing with the e option it is better to omit a Itis recommended to use the n option This option adds the initiator ext to the connecting string See Section 8 2 5 for more details srp daemon has a configuration file that can be set where the default is etc srp daemon conf Use the f to supply a different configuration file that configures the targets srp daemon is allowe
254. this will try to reconstruct LFTs correctly if endport GUIDs are represented in the dump file in order to disable this GUIDs may be removed from the dump file or zeroed The dump file format is compatible with output of ibroute utility and for whole fabric can be generated with dump lfts sh script To activate file based routing module use hostl opensm R file U path to dump file If the dump file is not found or is in error the default routing algorithm is utilized The ability to dump switch lid matrices aka min hops tables to file and later to load these is also supported The usage is similar to unicast forwarding tables loading from dump file introduced by file routing engine but new lid matrix file name should be specified by m or 1id matrix file option For example host1 opensm R file M opensm lid matrix dump The dump file is named opensm lid matrix dump and will be generated in the standard opensm dump directory var log by default when OSM LOG ROUTING logging flag is set When routing engine file is activated but the dump file is not specified or cannot be opened the default lid matrix algorithm will be used There is also a switch forwarding tables dumper which generates a file compatible with dump lfts sh output This file can be used as input for forwarding tables loading by file routing engine Both or one of options U and M can be specified together with R file 130 Mellano
255. tion and holds the IBNL files required to load this topology To use these files you will need to set th nvironment variable named IBDM IBNL PATH to that directory The directory is located in tmp or in the output directory provided by the o flag load db lt file name gt gt Load subnet data from the given db file and skip subnet discovery stage Note Some of the checks require actual subnet discovery and therefore would not run when load db is specified These checks are Duplicated zero guids link state SMs Status h help Prints the help page information V version Prints the version of the tool vars Prints the tool s environment variables and their values Mellanox Technologies 153 Rev 1 5 InfiniBand Fabric Diagnostic Utilities 14 4 2 Output Files Table 8 ibdiagnet of ibutils Output Files Output File Description ibdiagnet log A dump of all the application reports generate according to the provided flags ibdiagnet lst List of all the nodes ports and links in the fabric ibdiagnet fdbs A dump of the unicast forwarding tables of the fabric switches ibdiag A dump of the multicast forwarding tables of the fabric switches net mcfdbs ibdiagnet masks In case of duplicate port node Guids these file include the map between masked Guid and real Guids ibdiagnet sm List of all the SM state and priority in the fabric ibdiagnet pm A dump of the pm Counters values o
256. toboot Starts the boot process from the device s A 7 3 5 sanboot Starts the boot process of an iSCSI target Example gPXE gt sanboot iscsi 11 4 3 7 ign 2007 08 7 3 4 11 iscsiboot Mellanox Technologies 197 Rev 1 5 A 7 3 6 echo Echoes an environment variable Example gPXE gt echo root path A 7 3 7 dhcp A network interface attempts to open the network interface and then tries to connect to and com municate with the DHCP server to obtain the IP address and filepath from which the boot will occur Example gPXE gt dhcp netl A 7 3 8 help Displays the available list of commands A 7 3 9 exit Fxits from the command line interface 198 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 A 8 Diskless Machines Mellanox FlexBoot supports booting diskless machines To enable using an IB ETH driver the remote kernel or init ra image must include and be configured to load that driver This can be achieved either by compiling the HCA driver into the kernel or by adding the device driver module into the initra image and loading it A 8 1 Case l InfiniBand Ports The IB driver requires loading the following modules in the specified order see Section A 8 1 1 for an example jb addr ko jb core ko jb mad ko jb sa ko e jb cm ko jb uverbs ko jb ucm ko jb umad ko iw_cm ko rdma cm ko rdma ucm ko mlx4 core ko mlx4 ib ko jb mt
257. tomatically Make sure that the ib srp module is loaded the SRP Initiator is reachable by the SRP Target and that an SM is running To establish a connection with an SRP Target and create an SRP SCSI device for that target under dev use the following command Mellanox Technologies 85 J Rev 1 5 SRP echo n id ext GUID value ioc guid GUID value dgid port GID value pkey ffff service_id service 0 value gt sys class infiniband srp srp mthca hca number port number add target See Section 8 2 3 for instructions on how the parameters in this echo command may be obtained Notes Execution of the above echo command may take some time The SM must be running while the command executes Itis possible to include additional parameters in the echo command max cmd per lun Default 63 max sect short for max sectors sets the request size of a command e io class Default 0x100 as in rev 16A of the specification In rev 10 the default was Oxff00 e initiator ext Please refer to Section 9 Multiple Connections To list the new SCSI devices that have been added by the echo command you may use either of the following two methods Execute fdisk I This command lists all devices the new devices are included in this listing Execute dmesg or look at var log messages to find messages with the names of the new devices 8 2 3 SRP Tools ibsrpdm and srp daemon To assist in per
258. tput Files Output File Description ibdiagnet2 Ist Fabric links in LST format ibdiagnet2 sm Subnet Manager ibdiagnet2 pm Ports Counters ibdiagnet2 fdbs Unicast FDBs ibdiagnet2 mcfdbs Multicast FDBx ibdiagnet2 nodes_info Information on nodes ibdiagnet2 db_csv ibdiagnet internal database An ibdiagnet run performs the following stages Fabric discovery Duplicated GUIDs detection Links in INIT state and unresponsive links detection e Counters fetch Error counters check Routing checks Link width and speed checks 14 3 3 Return Codes 0 Success 1 Failure with description 14 4 ibdiagnet of ibutils IB Net Diagnostic Note This version of ibdiagnet is included in the ibutils package and it is run by default after installing Mellanox OFED To use this ibdiagnet version run ibdiagnet ibdiagnet scans the fabric using directed route packets and extracts all the available information regarding its connectivity and devices It then produces the following files in the output directory which is defined by the o option described below 14 4 1 SYNOPSYS ibdiagnet c lt count gt v r o lt out dir gt t lt topo file gt s lt sys name gt i lt dev index gt p lt port num gt wt pm pc P lt lt PM gt lt Value gt gt lw lt 1x 4x 12x gt 1s lt 2 5 5 10 gt skip lt ibdiag check s gt load db db file gt 152 Mellanox Technolo
259. u can also run the following commands to obtain the port state cat sys class infiniband mlx4 0 ports 1 state 2 INIT cat sys class infiniband mlx4 0 ports 2 state 4 ACTIVE 2 Look at the link_layer parameter of each port In this case port 1 is IB and port 2 is Ethernet Nevertheless port 2 appears in the list of the HCA s ports You can also run the following com mands to obtain the link_layer of the two ports cat sys class infiniband mlx4 0 ports 1 link layer InfiniBand cat sys class infiniband mlx4 0 ports 2 link layer Ethernet 3 The firmware version is 2 7 700 appears at the top You can also run the following command to obtain the firmware version cat sys class infiniband mlx4 0 fw ver 2 7 100 4 The IB over Ethernet s Port MTU is 2K byte at maximum however the actual MTU cannot exceed the mlx4_en interface s MTU Since the mlx4_en interface s MTU is 1560 port 2 will run with MTU of IK Association of IB Ports to Ethernet Ports It is useful to know how IB ports associate to network ports ibdev2netdev mlx4 0 port 2 lt gt eth2 mlx4 0 port 1 lt gt ib0 64 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Since both RoCE and mlx4 en use the Ethernet port of the adapter one of the drivers must carry the task of controlling the port state In this implementation it is the task of the mlx4 en driver The mlx4 ib driver holds a reference to the
260. up ok After configuring the IB ETH port the client attempts connecting to the DHCP server to obtain an IP address and the source location of the kernel OS to boot from Mellanox Technologies 195 Rev 1 5 For ConnectX InfiniBand Mellanox Connect FlexBoot v3 0 000 gPXE 0 9 9 Open Source Boot Firmware nett 36 02 c9 00 00 00 aa be on PCI02 00 0 open CLink down TX 9 TKE 0 RX 0 RXE Link status Not connected 001 Waiting for link up on net ok DHCP net 00 02 09 90 090 090 aa bc ok netO 11 4 3 1307255 255 255 0 gu 0 0 0 0 Booting from filename pxelinux 0 tftp 11 4 3 7 pxelinux 0 For InfiniHost III Ex PXE 0 9 3 Open Source Boot Firmware eatures TFTP iSCSI AoE PXE PXEXI leto c902 00231392 on PCIO5 00 0 open 0 RX 0 RXE 0 HCP net0 00550401 fe800000 00000000 0002c0902 00231392 ok vetO 11 4 3 130 255 Next FlexBoot attempts to boot as directed by the DHCP server A 7 Command Line Interface CLI A 7 1 Invoking the CLI When the boot process begins the computer starts its Power On Self Test POST sequence Shortly after completion of the POST the user will be prompted to press CTRL B to invoke Mel lanox FlexBoot CLI The user has few seconds to press CTRL B before the message disappears see figure Mellanox ConnectX FlexBoot v3 0 000 gPXE http etherboot org 02 00 0 CB80 PCI3 00 PnP BBS PMMO040820 CB80 Press Ctrl B to configure MLNX FlexBoo
261. uration For details on this please read the documentation for the ib bonding package under usr share doc ib bonding 0 9 0 ib bonding txt on RedHat and usr share doc packages ib bonding 0 9 0 ib bonding txt on SuSE Notes f the bondX name is defined but one of bondX SLAVES or bondX IPs is missing then that specific bond will not be created The bondX name must not contain characters which are disallowed for bash variable names such as and Using etc infiniband openib conf to create a persistent configuration is not recom mended Do not use it unless you have no other option It is not guaranteed that the first method will be supported in future versions of OFED Mellanox Technologies 57 J Rev 1 5 IPoIB 4 7 PolB Performance Tuning When IPoIB is configured to run in connected mode TCP parameter tuning is performed at driver startup to improve the throughput of medium and large messages 4 8 Testing IPoIB Performance This section describes how to verify IPoIB performance by running the Bandwidth BW test and the Latency test These tests are described in detail at the following URL http www netperf org netperf training Netperf html Note For UDP best performance please use IPoIB in Datagram mode and not in Connected mode To verify IPoIB performance perform the following steps Step 1 Download Netperf from the following URL http www netperf org netperf NetperfPage html Step 2
262. ustomized Mellanox firmware image for burning in binary or mlx for mat Burning an image to the Flash EEPROM attached to a Mellanox HCA or switch device Querying the firmware version loaded on an HCA board Displaying the VPD Vital Product Data of an HCA board flint This tool burns a firmware binary image to the Flash es attached to an HCA board It includes query functions to the burnt firmware image and to the binary image file spark 1 OpenSM is disabled by default See Chapter 12 OpenSM Subnet Manager for details on enabling it 22 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 This tool burns a firmware binary image to the EEPROM s attached to a switch device It includes query functions to the burnt firmware image and to the binary image file The tool accesses the EEPROM and or switch device via an I2C compatible interface ibspark This tool burns a firmware binary image to the EEPROM s attached to a switch device It includes query functions to the burnt firmware image and to the binary image file The tool accesses the switch device and the EEPROM via vendor specific MADs over the InfiniBand fabric In Band tool Debug utilities A set of debug utilities e g itrace mstdump isw and 12c For additional details please refer to the MFT User s Manual docs 1 5 Quality of Service Quality of Service QoS requirements stem from the realization of I O consolidat
263. vice ID 25218 Chip Revision AO Description Node Portl Port2 Sys image GUIDs 0002c90200231390 0002c90200231391 0002c90200231392 0002c90200231393 Board ID MT 0370110001 VSD PSID MT 0370110001 Assuming that FlexBoot is connected via Port 2 then the Port GUID is 00 02 c9 02 00 23 13 92 Step 4 The resulting client identifier is the concatenation from left to right of 20 the QP_Number the subnet prefix and the Port GUID In the example above this yields the following DHCP client identifier 20 00 55 04 01 fe 80 00 00 00 00 00 00 00 02 c9 02 00 23 13 92 Extracting the Client Identifier Method Il An alternative method for obtaining the 20 bytes of QP Number and GID involves booting the cli ent machine via FlexBoot This requires having a Subnet Manager running on one of the machines in the InfiniBand subnet The 20 bytes can be captured from the boot session as shown in the fig ure below IB for InfiniHost III Infiniband link up ok ce Boot Firmware oE PXE PXEXI CI05 00 0 open TX 0 TXE Mellanox Technologies 193 Rev 1 5 Concatenate the byte 20 to the left of the captured 20 bytes then separate every byte two hexa decimal digits with a colon You should obtain the same result shown in Step 4 above Placing Client Identifiers in etc dhcpd conf The following is an excerpt of a etc dhcpd conf example file showing the format of represent ing a client machine for the DHCP server
264. vironment See the ibdump release notes txt file for more details Note Although ibdump is a Linux application the generated pcap file may be analyzed on either operating system Synopsis ibdump options Table 19 lists the various flags of the command Table 19 ibdump Options Optional Default Flag pto If Not Description Mandatory Specified h help Optional Print the help menu 184 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Table 19 ibdump Options Optional Default Flag M dator If Not Description y Specified d ib dev lt dev gt Optional First device Use IB device lt dev gt found i ib port lt port gt Optional 1 Use port lt port gt of IB device 0 output lt file gt Optional sniffer pcap Dump file name b max burst lt log2 Optional 12 4096 entries log2 of the maximal burst size that can be captured with burst gt no packet loss Each entry takes MTU bytes of memory mem mode lt size gt Optional When specified packets are written to the dump file only after the capture is stopped It is faster than the default mode less chance for packet loss but it uses more memory In this mode ibdump stops after lt size gt bytes are cap tured decap Optional Decapsulate port mirroring headers Should be used when capturing RSPAN traffic Examples 1 Run ibdump gt ibdump IB device 2 mix On J
265. vision BO Description Node Portl Port2 Sys image GUIDs 0002c90300001038 0002c90300001039 0002c9030000103a 0002c9030000103b MACs 0002c9001039 0002c900103a Board ID n a MT_0D20110009 VSD n a PSID MT 0D20110009 Assuming that FlexBoot is connected via Port 1 then the Port GUID is 00 02 c9 03 00 00 10 39 Extracting the Port GUID Method Il An alternative method for obtaining the port GUID involves booting the client machine via Flex Boot This requires having a Subnet Manager running on one of the machines in the InfiniBand subnet The 8 bytes can be captured from the boot session as shown in the figure below Mellanox Technologies 191 ellanox ConnectX FlexBoot v3 0 000 PXE 0 9 9 Open Source Boot Firmware net 90 92 c9 00 00 00 aa bc on PCIOZ2 Link doun TX 0 TXE RX 0 RXE Link status Not connected 0 gt Jaiting for link up on net0O ok Placing Client Identifiers in etc dhcpd conf The following is an excerpt of a etc dhcpd conf example file showing the format of represent ing a client machine for the DHCP server host hostl next server 11 4 3 7 filename pxelinux 0 fixed address 11 4 3 130 option dhcp client identifier 00 00 00 00 00 02 00 00 02 c9 00 00 02 c9 03 00 00 10 39 A 2 4 2 For InfiniHost Ill Family Devices PCI Device IDs 25204 25218 When a FlexBoot client boots it sends the DHCP server various information including its DHCP client identifier
266. wing the sysfs directory will be referred to as SFCSYSFS To create a new vHBA on an Ethernet interface e g eth3 run f echo eth3 gt FCSYSFS create To destroy a previously created vHBA on an interface e g eth3 run gt echo eth3 gt FCSYSFS destroy To signal link up to an existing vHBA e g on eth3 run gt echo eth3 gt FCSYSFS link up To signal link down to an existing vHBA e g on eth3 run gt echo eth3 gt SFCSYSFS link down Mellanox Technologies 47 Rev 1 5 Working With VPI 3 4 4 2 Creating vHBAs That Use PFC To create a vHBA that uses the PFC feature it is required to configure the Ethernet driver to sup port PFC create a VLAN Ethernet interface assign it a priority and start a vHBA on the interface The following steps demonstrate the creation of such a vHBA To configure the mlx4 en Ethernet driver to support PFC add the following line to the file etc modprobe conf and restart the network driver options mlx4 en pfctx 0xff pfcrx 0xff To create a VLAN with an ID e g 55 on interface e g eth3 run gt vconfig add eth3 55 gt ifconfig eth3 55 up To set the map of skb priority 0 to the requested vlan priority e g 6 run gt vconfig set egress map eth3 55 0 6 To create the vHBA enter gt echo eth3 55 gt SFCSYSFS create 48 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 3 4 4 3 Creating vHBAs That Use
267. x Enterprise Server 10 SPx and Red Hat Enter prise Linux 5 1 or above can be directly installed on an iSCSI target At the end of this direct installation initrd is capable to continue loading other parts of the OS on the iSCSI target Other distributions may also be suitable for direct installation on iSCSI targets If you choose to continue loading the OS after boot through the HCA device driver please verify that the initrd image includes the HCA driver as described in Section A 7 Configuring an iSCSI Target in Linux Environment Prerequisites Step 1 Make sure that an iSCSI Target is installed on your server side 204 Mellanox Technologies Mellanox OFED for Linux User s Manual Rev 1 5 Tip You can download and install an iSCSI Target from the following location http sourceforge net projects iscsitarget files iscsitarget Step 2 Dedicate a partition on your iSCSI Target on which you will later install the operating sys tem Step 3 Configure your iSCSI Target to work with the partition you dedicated If for example you choose partition dev sda5 then edit the iSCSI Target configuration file etc ietd conf to include the following line under the iSCSI Target ign line Lun 0 Path dev sda5 Type fileio Tip The following is an example of an iSCSI Target iqn line Target ign 2007 08 7 3 4 10 iscsiboot Step 4 Start your iSCSI Target Example hostl etc init d iscsitarget start Configuring the DHCP Server
268. x Technologies Mellanox OFED for Linux User s Manual Rev 1 5 12 6 Quality of Service Management in OpenSM 12 6 1 Overview When Quality of Service QoS in OpenSM is enabled using the Q or qos flags OpenSM looks for a QoS Policy file During fabric initialization and at every heavy sweep OpenSM parses the QoS policy file applies its settings to the discovered fabric elements and enforces the pro vided policy on client requests The overall flow for such requests is as follows e The request is matched against the defined matching rules such that the QoS Level definition is found Given the QoS Level a path s search is performed with the given restrictions imposed by that level Figure 3 QoS Manager Administrator InfiniBand subnet with Qos OFED 1 3 Manager based nodes OSM There are two ways to define QoS policy Advanced the advanced policy file syntax provides the administrator various ways to match a PathRecord MultiPathRecord PR MPR request and to enforce various QoS constraints on the requested PR MPR Simple the simple policy file syntax enables the administrator to match PR MPR requests by various ULPs and applications running on top of these ULPs 12 6 2 Advanced QoS Policy File The QoS policy file has the following sections I Port Groups denoted by port groups This section defines zero or more port groups that can be referred later by matching rul
269. y file example replace SRPT with the real SRP Target port GUIDs qos ulps default ipoib sdp K r e O AD srp target port guid SRPT1 SRPT2 SRPT3 end qos ulps e OpenSM options file qos max vis 8 qos high limit 0 qos vlarb high 1 32 2 32 qos vlarb low 0 1 qos s12vl 0 1 2 3 4 5 6 7 15 15 15 15 15 15 15 15 Mellanox Technologies 145 Rev 1 5 OpenSM Subnet Manager 12 7 3 EDC 3 tier IPoIB RDS SRP The following is an example of QoS configuration for an enterprise data center EDC with IPoIB carrying all application traffic RDS for database traffic and SRP used for storage QoS Levels Management traffic ssh IPoIB management VLAN partition A Min BW 10 Application traffic IPoIB application VLAN partition B Isolated from storage and database Min BW of 3096 Database Cluster traffic RDS Min BW of 3096 SRP Min BW 30 Bottleneck at storage nodes Administration OpenSM QoS policy file Note In the following policy file example replace SRPT with the real SRP Initiator port GUIDs qos ulps default 0 ipoib pkey 0x8001 1 ipoib pkey 0x8002 NE rds 3 Srp target port guid SRPT1 SRPT2 SRPT3 4 end qos ulps e OpenSM options file qos max vls 8 qos high limit 0 qos vlarb high 1 32 2 96 3 96 4 96 qos vlarb low 0 1 qos sl2vl 0 1 2 3 4 5 6 7 15 15 15 15 15 15 15 15 Partition configuration file Default 0x7fff ipoib ALL full PartA 0x8001 sl 1 ipoib ALL
270. zation when compared to conventional imple mentations of TCP while preserving the TCP APIs and semantics upon which most current net work applications depend For more details see Chapter 7 SDP SRP SRP SCSI RDMA Protocol is designed to take full advantage of the protocol offload and RDMA features provided by the InfiniBand architecture SRP allows a large body of SCSI software to be readily used on InfiniBand architecture The SRP driver known as the SRP Initiator differs from traditional low level SCSI drivers in Linux The SRP Initiator does not control a local HBA instead it controls a connection to an IO controller known as the SRP Target to provide access to remote storage devices across an InfiniBand fabric The SRP Target resides in an IO unit and provides storage services See Chapter 8 SRP and Appendix E SRP Target Driver NFSoRDMA NFS over RDMA in Mellanox OFED is a binding of NFS v2 v3 v4 on top of the InfiniBand RDMA transport and IWARP 1 4 6 MPI Message Passing Interface MPI is a library specification that enables the development of parallel software libraries to utilize parallel computers clusters and heterogeneous networks Mellanox OFED includes the following MPI implementations over InfiniBand Open MPI an open source MPI 2 implementation by the Open MPI Project OSU MVAPICH an MPI 1 implementation by Ohio State University Mellanox OFED also includes MPI benchmark tests su
Download Pdf Manuals
Related Search
Related Contents
Samsung SyncMaster Instructions for the decommissioning of the LBS PDQuest™ - Bio-Rad Maxell DVD+R Z-340 manual.qxd - Logitech Support sondage pour évangéliser Security Center SDK Release Notes 5.3 GA Weber SUMMIT 56567 User's Manual Copyright © All rights reserved.
Failed to retrieve file