Home
Mellanox OFED Linux User`s Manual
Contents
1. 35 4 4 1 CreatingaSubinterface LL rro 35 4 4 2 Removing a Subinterface 0 cece rr rer rer teen ene eae 36 4 5 Verifying IPoIB Functionality 36 4 6 The ib bonding Driver 36 4 6 1 Using the ib bonding Driver soris oss svs os eee nett nee eens 37 4 7 IPoIB Performance Tuning 38 4 8 Testing IPoIB Performance 38 Chapter 5 RDS 53 06 56 wide ie cis Es ds ws Oa we ac 5 1 Overview 4 5 2 RDS Configuration 4 Chapter 6 OLB iio as Rat AL 6 1 Introduction 42 6 2 EoIB Topology 42 6 2 1 External ports eports and GW 1 2 eee ences 43 6 2 2 Nirtual Hubs Hubs s dos sete A dal A da ui 43 6 23 Virtual NIC QVNIC sica Raed Ss BOW GAYS i badass 43 6 3 EolB Configuration 44 6 3 1 EoIB Host Administered VNic LL 44 6 3 1 1 Central configuration file etc infiniband mlx4_vnic conf 45 6 3 1 2 vNic specific configuration fil
2. A 1 Overview This chapter describes Mellanox Boot over IB BoIB the software for Boot over Mel lanox Technologies InfiniBand IB HCA devices BoIB enables booting kernels or oper ating systems OSs from remote servers in compliance with the PXE specification BoIB is based on the open source project Etherboot gPXE available at http www ether boot org BolIB first initializes the HCA device Then it connects to a DHCP server to obtain its assigned IP address and network parameters and also to obtain the source location of the kernel OS to boot from The DHCP server instructs BoIB to access the kernel OS through a TFTP server an iSCSI target or other service Mellanox Boot over IB implements a network driver with IP over IB acting as the trans port layer IP over IB is part of the Mellanox BXOFED for Linux software package see www mellanox com The binary code is exported by the device as an expansion ROM image A 1 1 Supported Mellanox HCA Devices and Firmware Table 18 Supported Mellanox Technologies Devices and PCI Device IDs PCI Device ID Device Name Decimal Firmware Name Hexadecimal MT25408 ConnectX IB SDR PCI Express 2 0 2 5GT s 25408 0x6340 fw 25408 MT25408 ConnectX IB DDR PCI Express 2 0 2 5GT s 25418 0x634a fw 25408 MT25408 ConnectX IB DDR PCI Express 2 0 5 0GT s 26418 0x6732 fw 25408 MT25408 ConnectX IB QDR PCI Express 2 0 5 0GT s 26428 0x673c
3. LVM Partitioning O Accept Proposal Base Partition Setup on This Proposal Create Custom Partition Setup Back Abort Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 179 Step 31 In the Expert Partitioner window select from the IET VIRTUAL DISK device the row that has its Mount column indicating swap then click Delete Confirm the delete oper ation and click Finish Partition your hard a Expert Partitioner disks This is intended for experts If you are not Device Size E Type Mount Mount By Start End Used By Label Devil familiar with the Idevisda 8 0 GB IET VIRTUAL DISK 0 1045 scsi concepisiofhard disk Idev sdal 70 5MB FLinuxnative Ext2 boot 0 8 scsi partitions and how to use them you might want to go back and select automatic partitioning dev sda2 502 0 MB F Linux swap dev sda3 7 4 GB F Linux native Reiser Please note that nothing will be written to your hard disk until you confirm Really delete device dev sda2 the entire installation in the last installation dialog Until that point you can safely abort the installation For LVM setup using a non LVM root device and a non LVM swap device is recommended Other than the root and swap devices you should have partitions managed by LVM The table to the right shows the current
4. 140 12 10 smpquery 144 12 11 perfquery 147 12 12 ibcheckerrs 151 12 13 mstflint 153 12 14 ibv_asynewatch 156 Appendix A Boot over IB BoIB 158 A 1 Overview 158 A 2 Burning the Expansion ROM Image 159 A 3 Preparing the DHCP Server in Linux Environment 160 Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential Appendix B A 4 Subnet Manager OpenSM A 5 TFTP Server A 6 BIOS Configuration A 7 Operation A 8 Diskless Machines A 9 iSCSI Boot A 10 WinPE ConnectX EN PXE B 1 Overview B 2 Burning the Expansion ROM Image B 3 Preparing the DHCP Server in Linux Environment B 4 TFTP Server B 5 BIOS Configuration B 6 Operation B 7 Diskless Machines B 8 iSCSI Boot 12 15 SCSI Boot Example of SLES 10 SP2 OS B 9 WinPE Appendix C Performance Troubleshooting C 1 PCI Express Performance Troubleshooting C 2 InfiniBand Performance Troubleshooting Appendix D ULP Performance Tuning Appendix E Appendix F D 1 IPoIB Performance Tuning D 2 Ethernet Performance Tuning D 3 MPI Performance Tuning SRP Target Driver E l Prerequisites E 2 How to run E 3
5. 76 9 2 1 SSH Configuration ee ee cee eee a sb ANG Roe ek NAN NER 76 9 3 MPI Selector Which MPI Runs 77 9 4 Compiling MPI Applications 78 9 5 OSU MVAPICH Performance 79 GIL Requirements A dao 79 9 5 2 Bandwidth Test Performance 79 9 5 3 Latency Test Performance iii cena sri ere aera teenie eed 80 9 5 4 Intel MPI Benchmark 1 0 6 cen rr rr rss rr rss 80 9 6 Open MPI Performance 82 96 1 Requirements si oss has cee sak a WSR a abs 82 9 6 2 Bandwidth Test Performance 82 9 6 3 Latency Test Performance bss sikten se ccc e a ee ge re 83 9 6 4 Intel MPI Benchmark issiga cds toe bit saa bee ek a wed a V G ia a 84 Chapter 10 Quality of Service o ooooooocmocmoororcmomororomomommsrsror o ro 00 10 1 Overview 86 10 2 QoS Architecture 87 10 3 Supported Policy 87 10 4 CMA features 88 10 5 IPoIB 88 10 6 SDP
6. exit EX accept the client connection struct sockaddr in client addr socklen t client addr len int cd accept sd amp client addr len if cd lt 0 perror accept failed exit EX LEA LURE SOCK_STREAM struct sockaddr struct sockaddr amp my addr 5 backlog sizeof client addr Client addr printf accepted connection from s u n inet ntoa client addr sin addr ntohs client_addr sin port Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential SDP 64 ssize t nr if nr lt 0 perror read cd y YX buffer RXBUFSZ read failed exit EX E E else if printf socket was closed by remote host n T FAILUR nr 0 printf read zd bytes n nr printf end of test n close cd close sd return 0 Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 7 6 7 7 BZCopy Zero Copy Send BZCOPY mode is only effective for large block transfers By setting the sys parameter sdp zcopy thresh to a non zero value a non standard SDP speedup is enabled Messages longer than sdp zcopy thresh bytes in length cause the user space buf fer to be pinned and the data to be sent directly from the original buffer This results in less CPU usage and
7. partitions on all your Create Edit Delete Resize 4 J J hard disks E gt a s gt la LVM EMS RAD CryptFile v Expert y Hard disks are x d di Back Abort Step 32 In the pop up window click No to approve deleting the swap partition You will be returned to Installation Settings window See image below Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 180 Partition your hard a disks This is intended for experts If you are not familiar with the concepts of hard disk partitions and how to use them you might want to go back and select automatic partitioning Please note that nothing will be written to your hard disk until you confirm the entire installation in the last installation dialog Until that point you can safely abort the installation For LVM setup using a non LVM root device and a non LVM swap device is recommended Other than the root and swap devices you should have partitions managed by LVM The table to the right shows the current partitions on all your hard disks Hard disks are Step 33 Preparation Y Language y License Agreement Disk Activation y System Analysis Y Time Zone Installation gt Installation Summary Perfurmi Installation Configuration e Root Password e Hostname Network Customer Center Online Update Service Users Clean Up Relea
8. TFTP Server When you set the filename parameter in your DHCP configuration file to a non empty filename the client will ask for this file to be passed through TFTP For this reason you need to install a TFTP server Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 164 A 6 BIOS Configuration The expansion ROM image presents itself to the BIOS as a boot device As a result the BIOS will add to the list of boot devices MLNX IB lt ver gt for a ConnectX device or gPXE for an InfiniHost III device The priority of this list can be modified through BIOS setup A 7 Operation A 7 1 Prerequisites e Make sure that your client is connected to the server s e The BoIB image is already programmed on the adapter card see Section A 2 e Start the Subnet Manager as described in Section A 4 e Configure and start the DHCP server as described in Section A 3 e Configure and start at least one of the services iSCSI Target see Section A 1 and or TFTP see Section A 5 A 7 2 Starting Boot Boot the client machine and enter BIOS setup to configure MLNX IB for ConnectX family or gPXE for InfiniHost III family to be the first on the boot device priority list see Section A 6 Note On dual port network adapters the client first attempts to boot from Port 1 If this fails it switches to boot from Port 2 Note also that the driver waits up to 45 sec for each port to come up Rev 1 50 Mel
9. Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 212 C 2 C 3 Ethernet Performance Tuning When the etc init d openibd script loads the m1x4 en driver the following network stack parameters are applied gt 000000000000 ct ipv4 core cor EQ core 1pv4 1pv4 1pv4 ipv4 core COTE wmem max 16777216 tcp_timestamps 0 tcp_sack 0 netdev max backlog 2 rmem max 16777216 rmem default 1677721 wmem default 1677721 optmem max 16777216 tcp mem 16777216 16 50000 6 6 777216 16777216 tcp _ rmem 4096 87380 tcp wmem 4096 65536 MPI Performance Tuning 16777216 16777216 To optimize bandwidth and message rate running over MVAPICH you can set tuning paramters either using the command line or in the configuration file usr mpi lt compiler gt mvapich lt mvapich ver gt etc mvapich conf Tuning Parameters in Configuration File Edit the mvapich conf file with the following lines V AD EV US E COALESCE 1 V AD V AD EV COALESCE TH EV PROGRESS THI _ THR R ESHOLD 2 ESHOLD SQ 1 Tuning Parameters via Command Line The following command tunes MVAPICH parameters host1 usr mpi gcc mvapic hostfile home lt username gt cluster VIADEV_USE_COALESCE 1 VIAD
10. E libsdp conf host2 LD PRELOAD libsdp so LIBSDP CONFIG FILE HOM netperf H 11 4 17 6 t TOP RR c C Yr1 1 TCP REQUEST RESPONSE TEST from 0 0 0 0 0 0 0 0 po 11 4 17 6 11 4 17 6 port 0 AF INET Local Remote Socket Siz Request Resp S dem Send Recv Size Size Time remote bytes bytes bytes bytes secs us Tr 16384 87380 1 ds 10 00 49 729 Elapsed Trans Rate per sec 37572 83 CPU local 1572 rt 0 AF INET to CPU S dem remote local oe S us Tr 23 36 33 469 Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 16384 87380 The following table describes parameters for the netperf command Option Description H Where to find the server 11 4 17 6 SDP IPoIB IP address t lt Test Name gt Specify the test to perform Options are TCP_STREAM TCP_RR etc C Client CPU utilization C Server CPU utilization Separates the global and test specific parameters r 1 1 The request size sent and how many bytes requested back Note that the run example above produced the following results e Client CPU utilization is 15 72 percent of client CPU e Server CPU utilization is 23 36 percent of server CPU e Latency is 13 31 microseconds Latency is calculated as follows 0 5 1 Transaction rate per sec 1 000 000 one way average latency in usec Step 7 To end the test shut
11. PKey V verbose This option increases the log verbosity level The v option may be specified multiple times to further increase the verbosity level See the D option for more information about log verbosity V This option sets the maximum verbosity level and forces log flushing The V option is equivalent to D OxFF d 2 See the D option for more informa tion about log verbosity D This option sets the log verbosity level A flags field must follow the D option A bit set clear in the flags enables disables a specific log level as follows BIT OG LEVEL ENABLED 0x01 ERROR error messages 0x02 NFO basic messages low volume I 0x04 VERBOSE interesting stuff moderate volume 0x08 D EBUG diagnostic high volume Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential OpenSM Subnet Manager 96 0x10 0x20 0x40 0x80 Withou Specif D Ox bosity timeou This o are no select OPT d debug do dl d2 d3 h help Displa Displa 11 2 2 Environment Variables FUNCS function entry exit very high volume FRAMES dumps all SMP and GMP frames ROUTING dump FDB routing information currently unused t D OpenSM defaults to ERROR INFO 0x3 ying D 0 disables all messages Specifying FF enables all messages see V High ver levels may require increasing the transaction t wit
12. Step 61 Select New Installation then click Finish in the Installation Mode window Preparation Y Language y License Agreement Y Disk Activation System Analysis e Time Zone Installation e Installation Summary e Perform Installation Configuration e Root Password e Hostname e Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes e Hardware Configuration 2 Installation Mode Help Abort Select Mode 8 New Installation A Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 199 200 Step 62 click Finish Select the appropriate Region and Time Zone in the Clock and Time Zone window then Preparation v Language v License Agreement w Disk Activation w System Analysis gt Time Zone Installation e Installation Summary e Perform Installation Configuration e Hostname e Root Password e Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes e Hardware Configuration Em Clock and Time Zone Region Time Zone Europe Canada Central and South America Russia Asia Australia Africa Pacific Global Etc Alaska Aleutian Arizona Central East Indiana Hawaii Indiana Starke Michigan Mountain Pacific Samoa Hardware Clock Set To UTC Help Step 63 Abort Time and
13. base lid 0x0 sim Iseo 0x0 Stare 2 2 JIUNIT phys state Se Linki Helsek 10 Gb sec 4X Ia iniloene device YmilxA port 2 STATUSS default gid fe80 0000 0000 0000 0000 0000 0007 3897 base lid 0x1 12 8 ibportstate Applicable Hardware All InfiniBand devices Description Enables querying the logical link and physical port states of an InfiniBand port It also allows adjusting the link speed that is enabled on any InfiniBand port If the queried port is a swich port then ibportstate can be used to e disable enable or reset the port e validate the port s link width and speed against the peer port Synopsis ibportstate d el v V D G s lt smlid gt C lt ca_name gt P lt ca_port gt t lt timeout_ms gt lt dest dr path lid guid gt lt portnum gt lt op gt lt value gt Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential InfiniBand Fabric Diagnostic Utilities 136 Table 11 lists the various flags of the command Table 11 ibportstate Flags and Options Default Optional er Flag p If Not Description Mandatory 2 Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d e rr_show Optional Show send and receive errors timeouts and others v erbose Optional Increase
14. e To both discover the SRP Targets and establish connections with them just add the e option to the above command e Executing srp_daemon over a port without the a option will only display the reach able targets via the port and to which the initiator is not connected If executing with the e option it is better to omit a e Itis recommended to use the n option This option adds the initiator_ext to the con necting string See Section 8 2 5 for more details e srp daemon has a configuration file that can be set where the default is etc srp_daemon conf Use the f to supply a different configuration file that configures the targets srp_daemon is allowed to connect to The configuration file can also be used to set values for additional parameters e g max_cmd_per_lun max_sect e A continuous background daemon operation providing an automatic ongoing detection and connection capability See Section 8 2 4 Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential SRP 72 8 2 4 Automatic Discovery and Connection to Targets e Make sure that the ib srp module is loaded the SRP Initiator can reach an SRP Target and that an SM is ruming e To connect to all the existing Targets in the fabric run srp_daemon e o This util ity will scan the fabric once connect to every Target it detects and then exit Note srp_daemon will follow the configuration 1t finds in etc srp_daemon conf Thus
15. For UDP best performance please use IPoIB in Datagram mode and not in Connected mode To verify IPoIB performance perform the following steps Step 1 Step 2 Step 3 Step 4 Download Netperf from the following URL http www netperf org netperf NetperfPage html Compile Netperf by following the instructions at http www netperf org netperf NetperfPage html Start the Netperf server The following example shows how to start the Netperf server host1 netserver Starting netserver at port 12865 Starting netserver at hostname 0 0 0 0 port 12865 and family AF _UNSPEC host1 Run the Netperf client The default test is the Bandwidth test The following example shows how to run the Netperf client which starts the Bandwidth test by default host2 netperf H 11 4 17 6 t TCP_ STREAM c C m 65536 TCP STREAM TEST from 0 0 0 0 0 0 0 0 port 0 AF INET to 11 4 17 6 11 4 17 6 port 0 AF INET Recv Send Send Utilization Service Demand Socket Socket Messag Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs 10 6bits s S S us KB us KB Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential Note Step 5 87380 16384 65536 10 00 2483 00 Tie 1 431 03 42 1 854 You must specify the IPoIB IP address when running the Netperf client The following table describes parameters fo
16. Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential Figure 1 BXOFED Stack Back End App MiddleWare Front End Cluster Management User Diagnostics uverbs API Sockets Layer umad API TCP SCSI UDP ICMP Mid Layer IP RDS SDP Open Foe ME vos XW i mlx4_en verbs API DIUA mlx4_fc mlx4_ib HCA Driver mlx4_core Mellanox VPI Device HCA NIC Markets EA Linux MA Linux Modules in OFED Currently Not Supported by Mellanox E Applications MW OFED The following sub sections briefly describe the various components of the BXOFED stack 1 4 1 mthca HCA IB Driver mthca is the low level driver implementation for the following Mellanox Technologies HCA InfiniBand devices InfiniHost InfiniHost III Ex and InfiniHost III Lx 1 4 2 mlx4 VPI Driver m1x4 is the low level driver implementation for the ConnectX adapters designed by Mel lanox Technologies The ConnectX can operate as an InfiniBand adapter as an Ethernet NIC or as a Fibre Channel HBA To accommodate the supported configurations the driver is split into four modules mlx4_core Handles low level functions like device initialization and firmware commands processing Also controls resource allocation so that the InfiniBand and Ethernet functions can share the device without interfering with each other mlx4_ib Handles InfiniBand specific functions and plugs into the InfiniBand midlayer Mellanox Tech
17. cpio id The initrd files should now be found under tmp initrd_en Step 44 Create a directory for the ConnectX EN modules and copy them host1 mkdir p tmp initrd en lib modules mlnx en host1 cd lib modules uname r updates kernel drivers hostl cp net mlx4 mlx4 core ko tmp initrd en lib modules mlnx en hostl cp net mlx4 mlx4 en ko tmp initrd en lib modules mlnx en Step 45 Step 46 Step 47 To load the modules you need the insmod executable If you do not have it in your initrd please add it using the following command host1 cp sbin insmod tmp initrd en sbin If you plan to give your Ethernet device a static IP address then copy ifconfig Oth erwise skip this step host1 cp sbin ifconfig tmp initrd en sbin Now you can add the commands for loading the copied modules into the file init Edit the file tmp initrd en init and add the following lines at the point you wish the Ethernet driver to be loaded Warning The order of the following commands for loading modules is critical echo loading Mellanox ConnectX EN driver sbin insmod lib modules mlnx en mlx4 core ko sbin insmod lib modules mlnx en mlx4 en ko Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 192 Step 48 Step 49 Step 50 Step 51 Now you can assign a static or dynamic IP address to your Mellanox ConnectX EN net work interface Save the init file
18. Close initrd host1 cd tmp initrd en host1 find cpio H newc o gt tmp new initrd en img host1 gzip tmp new init en img At this stage the modified init rd including the Ethernet driver is ready and located at tmp new init ib img gz Copy it to the original initrd location and rename it properly A 1 iSCSI Boot Mellanox ConnectX EN PXE enables an iSCSI boot of an OS located on a remote iSCSI Target It has a built in iSCSI Initiator which can connect to the remote iSCSI Target and load from it the kernel and initrd There are two instances of connection to the remote iSCSI Target the first is for getting the kernel and initrd via ConnectX EN PXE and the second is for loading other parts of the OS via initrd Note Linux distributions such as SuSE Linux Enterprise Server 10 SPx and Red Hat Enterprise Linux 5 1 can be directly installed on an iSCSI target At the end of this direct installation initrd is capable to continue loading other parts of the OS on the iSCSI target Other distributions may also be suitable for direct installation on iSCSI targets If you choose to continue loading the OS after boot through the HCA device driver please verify that the initrd image includes the adapter driver as described in Section B 7 1 A 1 1 Configuring an iSCSI Target in Linux Environment Prerequisites Step a Make sure that an iSCSI Target is installed on your server side Tip Step 52 Step 53
19. Mellanox Technologies Confidential Rev 1 50 InfiniBand Fabric Diagnostic Utilities 156 gt meicitlimce cl 042000 w ConnectX failsafe image Start address 80000 Chunk size 80000 NOTE The addresses below are contiguous logical addresses Physical addresses on flash may be different based on the image start address and chunk size 0x00000038 0x000010db 0x0010a4 BOOT2 OK 0x000010dc 0x00004947 0x00386c BOOT2 OK 0x00004948 0x000052c7 0x000980 Configuration OK 0x000052c8 0x0000530b 0x000044 GUID OK 0x0000530c 0x0000542f 0x000124 Image Info OK 0x00005430 0x0000634f 0x000 20 DDR OK 0x00006350 0x0000 29b 0x008 4c DDR OK 0x0000 29c 0x0004749b 0x038200 DDR OK 0x0004749c 0x0005913f 0x011ca4 DDR OK 0x00059140 0x0007a123 0x020fe4 DDR OK 0x0007a124 0x0007bdff 0x001cdc DDR OK 0x0007be00 0x0007eb97 0x002d98 DDR OK 12 14ibv_asyncwatch Applicable Hardware All InfiniBand devices Description Display asynchronous events forwarded to userspace for an InfiniBand device Synopsis ibv_asyncwatch Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 157 Examples 1 Display asynchronous events gt ibv_asyncwatch mix4 0 asyne event FD 4 Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 158 Appendix A Boot over IB BoIB
20. ON_CODE static inline void sg clear struct scatterlist sg 2 E lt K ERN EL V ERS static inline struct page sg page struct scatterlist sg ON 2 6 24 h Patch scsi_tgt h with tmp scsi_tgt patch S cd usr local include scst cp sest h sesi tgt h patch p0 lt tmp scsi tgt patch 4 When you install Mellanox BXOFED remember to choose srpt y install pl D 2 How to run A On an SRP Target machine 1 Please refer to SCST s README for loading scst driver and its dev_handlers drivers sest disk scst_vdisk block or file IO mode nullio Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential Note Regardless of the mode you always need to have lun 0 in any group s device list Then you can have any lun number following lun 0 it is not required to have the lun numbers in ascending order except that the first lun must always be 0 Setting SRPT_LOAD yes in etc infiniband openib conf is not enough as it only causes the loading of the ib srpt module but it does not load scst and its dev_handlers Example 1 Working with VDISK BLOCKIO mode Using md0 device sda and cciss c1d0 a ca modprobe scst modprobe scst_vdisk echo open vdisk0 dev md0 BLOCKIO gt proc scsi_tgt vdisk vdisk echo open vdisk1 dev sda BLOCKIO gt proc scsi_tgt vdisk vdisk echo open vdisk2 dev cciss c1d0 BLOCKIO gt proc scsi_t
21. bond0 SLAVES ib0 ib1 bond8007 IP 20 10 10 1 bond1l_SLAVES ib0 8007 ib1 8007 b Restart the driver by running etc init d openibd restart 2 Using a standard OS bonding configuration For details on this please read the doc umentation for the ib bonding package under usr share doc ib bonding 0 9 0 ib bonding txt on RedHat and usr share doc packages ib bonding 0 9 0 ib bonding txt on SuSE E Notes e If the bondX name is defined but one of bondX SLAVES or bondX_IPs is missing then that specific bond will not be created e The bondX name must not contain characters which are disallowed for bash variable names such as and Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential IPoIB 38 4 7 4 8 e Using etc infiniband openib conf to create a persistent configuration is not recommended Do not use it unless you have no other option It is not guaranteed that the first method will be supported in future versions of BXOFED IPoIB Performance Tuning When IPoIB is configured to run in connected mode TCP parameter tuning is performed at driver startup to improve the throughput of medium and large messages Testing IPoIB Performance This section describes how to verify IPoIB performance by running the Bandwidth BW test and the Latency test These tests are described in detail at the following URL http www netperf org netperf training Netperf html Note
22. exit EX T FAILURE else if nw 0 printf socket was closed by remote host n Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential SDP 62 printf sent zd bytes n nw close sd return 0 sdp_server c Code Usage sdp server include lt stdio h gt include lt stdlib h gt include lt stdint h gt include lt unistd h gt include lt sys types h gt include lt sys socket h gt include lt netinet in h gt include lt arpa inet h gt include lt sys epoll h gt include lt errno h gt include lt assert h gt define RXBUFSZ 2048 uint8 t rx buffer RXBUFSZ define DEF PORT 22222 define AF INET SDP 27 define PF INET SDP AF INET SDP int main int argc char argv Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential int sd socket PF INET SDP if sd lt 0 perror socket failed exit EXIT_FAILURE struct sockaddr in my addr sin family AF INET htons DEF PORT INADDR_ANY sin port sin addr s addr he int retbind sizeof my addr bind sd if retbind lt 0 perror bind failed exit EX T FA LURE int retlisten listen sd if retlisten lt 0 perror listen failed T FAILURE
23. v Time Zone Overview Expert Installation Installation Summary Keyboard Layout e Perform Installation English US Configuration TO e Hostname Partitioning Root Password Create boot partition dev sda1 70 5 MB with ext2 e Network Create swap partition dev sda2 502 0 MB e Customer Center Create root partition dev sda3 7 4 GB with reiserfs e Online Update e Service Software e Users e Clean Up SUSE Linux Enterprise Server 10 e Release Notes A System GNOME Desktop Environment for Server e Hardware Configuration Server Base System Novell AppArmor Print Server Size of Packages to Install 1 3 GB Language Primary Language English US Show Release Notes Change v Help Abort Step 30 Select Base Partition Setup on This Proposal then click Next Your hard disks have Suggested Partitioning been checked The partition setup displayed is proposed for your hard drive Create boot partition dev sdal 70 5 MB with ext2 Create swap partition dev sda2 502 0 MB To acceptthese Create root partition dev sda3 7 4 GB with reiserts suggestions and continue select Accept Proposal Ifthe suggestion does not fit your needs create your own partition setup starting with the partitions as currently present on the disks For this select Custom Partition Setup This is also the option to choose for advanced options like RAID and
24. Optional Use lt smlid gt as the target LID for SM SA queries C lt ca_name gt Optional Use the specified channel adapter or router P lt ca_port gt Optional Use the specified port t lt timeout_ms gt Optional Override the default timeout for the solicited MADs msec Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential Table 12 ibportstate Flags and Options 141 Optional Default Flag P If Not Description Mandatory k Specified lt dest dr path lid Optional Destination s directed path LID or GUID guid gt lt startlid gt Optional Starting LID in an MLID range lt endlid gt Optional Ending LID in an MLID range Examples 1 Dump all Lids with valid out ports of the switch with Lid 2 gt worouce 2 Unicast lids 0x0 0 0x0002c902fffff00a Technologies E Gue xel Or Siten bio 2 Guel MT 47396 Destination POLE Info 0x0002 000 MT47396 Infinisca e Me Infiniscale Gwira porrgule 0000239021010 101 00 5 lanox Technologies 00003 021 MT47396 Infiniscal Switch portguid 0x000b8cffff004016 Cai Mel lanox Technologies 0x0006 007 0x0002c90300001039 0x0007 021 0x0002c9020025874a Channel Channel sw137 Adapter portguid HCA 1 Adapter portguid sw157 HCA 1 Mellanox Mellanox Technologies Mellanox
25. Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential e Node type where possible node types are CA SWITCH ROUTER ALL and SELF SM s port II QoS Setup denoted by gos setup This section describes how to set up SL2VL and VL Arbitration tables on various nodes in the fabric However this is not supported in BXOFED SL2VL and VLArb tables should be configured in the OpenSM options file default location var cache opensm opensm opts III QoS Levels denoted by qos levels Each QoS Level defines Service Level SL and a few optional fields e MTU limit e Rate limit e PKey e Packet lifetime When path s search is performed it is done with regards to restriction that these QoS Level parameters impose One QoS level that is mandatory to define is a DEFAULT QoS level It is applied to a PR MPR query that does not match any existing match rule Similar to any other QoS Level it can also be explicitly referred by any match rule IV QoS Matching Rules denoted by qos match rules Each PathRecord MultiPathRecord query that OpenSM receives is matched against the set of matching rules Rules are scanned in order of appearance in the QoS policy file such as the first match takes precedence Each rule has a name of QoS level that will be applied to the matching query A default QoS level is applied to a query that did not match any rule Queries can be matched by Source port group whether
26. To get the full vNic information simply type mlx4 vnic info or mlx4 vnic info eth10 Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 EoIB 50 A typical output for a single interface looks something like NETDEV_ NAME ETDEV LINK ETDEV_OPEN W_PORT_ID N N NETDEV_QSTOP G S GW_OPN GW_LID IB_PORT IB LOG LINK eth10 up yes no 0 0 0x800000 0x0004 mlx4 0 1 LinkUp IB PHY LINK IB MTU MTU RX_RINGS_NUM RX_RINGS_LIN SW_RSS SW_RSS_SIZE TX RINGS NUM TX RINGS ACT NDO_TSS NDO_TSS_SIZE PROMISC_MCAST CAST_MASK NO_BX RO ENABLED LRO NU ED NAPI ENAB VNIC PER PORT VNIC NUM VLAN USED GID GW PORT NAME SYSTEM NAME SYSTEM GUID Active 2048 1500 8 no yes 16 1 1 no 1 yes no yes 32 yes 1 2 Oxffff 0x0000 0x80060 0x0003 00 00 00 00 02 00 0x000 0 fe 80 00 00 00 00 00 00 00 02 c9 03 00 00 22 85 A10 unnamed system 00 00 00 00 00 05 67 90 Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 6 5 2 6 5 3 GW_EPORT up Other useful options for this script include h print usage v print script version 1 list all virtual nics S print short info ethtool Another way to retrieve interface info and change its configuration is through the us
27. Users Clean Up Release Notes Hardware Configuration Help Abort Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential Step 28 click Finish Preparation v Language v License Agreement v Disk Activation w System Analysis gt Time Zone Installation e Installation Summary e Perform Installation Configuration e Hostname e Root Password e Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes e Hardware Configuration Em Clock and Time Zone Region Time Zone Europe Canada Central and South America Russia Asia Australia Africa Pacific Global Etc Alaska Aleutian Arizona Central East Indiana Hawaii Indiana Starke Michigan Mountain Pacific Samoa Hardware Clock Set To UTC Help Step 29 window Abort Time and Date 07 52 06 24 03 2008 Change Einish In the Installation Settings window click Partitioning to get the Suggested Partitioning Select the appropriate Region and Time Zone in the Clock and Time Zone window then Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 177 178 Preparation v Language 8 Installation Settings v License Agreement v Disk Activation Click headli kath he Ch E bel v System Analysis ick any headline to make changes or use the ange menu below
28. When computing the routing function LASH analyzes the network topology for the short est path routes between all pairs of sources destinations and groups these paths into virtual layers in such a way as to avoid deadlock Note LASH analyzes routes and ensures deadlock freedom between switch pairs The link from HCA between and switch does not need virtual layers as dead lock will not arise between switch and HCA In more detail the algorithm works as follows 1 LASH determines the shortest path between all pairs of source destination switches Note LASH ensures the same SL is used for all SRC DST DST SRC pairs and there is no guarantee that the return path for a given DST SRC will be the reverse of the route SRC DST 2 LASH then begins an SL assignment process where a route is assigned to a layer SL if the addition of that route does not cause deadlock within that layer This is achieved by maintaining and analysing a channel dependency graph for each layer Once the potential addition of a path could lead to deadlock LASH opens a new layer and continues the process 3 Once this stage has been completed it is highly likely that the first layers pro cessed will contain more paths than the latter ones To better balance the use of lay ers LASH moves paths from one layer to another so that the number of paths in each layer averages out Note that the implementation of LASH in opensm attempts to use as few layers as possi
29. h Optional Print the help menu lt device gt Optional All devices Print information for the specified device May specify more than one device lt port gt Optional but All ports of the Print information for the specified port only of the specified requires specify specified device device ing a device name Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential InfiniBand Fabric Diagnostic Utilities 134 Examples 1 List the status of all available InfiniBand devices and their ports gt ibstatus Imiciinogiacl device mixat OV port 1 stecus default gid fe80 0000 0000 0000 0000 0000 0007 3896 base lid 0x3 SME 0x3 STATS 4 ACTIVE phys state 5 LinkUp rate 20 Gb sec 4X DDR Ia mallens device mild 0 port 2 sterus default gid fe80 0000 0000 0000 0000 0000 0007 3897 base lid 0x1 sm lid 0x1 Stare 4 ACTIVE phys state 53 Lilia US rate 20 Gb sec 4X DDR Infiniband device mthca0 port 1 status default gid fe80 0000 0000 0000 0002 c900 0101 d151 base lid 0x0 sm lid 0x0 STATE s 23 NIE phys state 58 GL rateg 10 Gb sec 4X Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 135 2 Listthe status of specific ports of specific devices gt ibstatus mthca0 1 mixa 0 2 Infiniband device mthca0 port 1 status default gid fe80 0000 0000 0000 0002 c900 0101 d151
30. host1 mkdir tmp initrd_ib host1 cd tmp initrd ib Step 8 Normally the initrd image is zipped Extract it using the following command host1 gzip dc lt initrd image gt cpio id The initrd files should now be found under tmp initrd_ib Step 9 Create directory for the InfiniBand modules and copy them host1 mkdir p tmp initrd ib lib modules ib host1 cd lib modules uname r updates kernel drivers hostl cp infiniband core ib_addr ko tmp initrd ib lib modules ib hostl cp infiniband core ib_core ko tmp initrd ib lib modules ib hostl cp infiniband core ib_mad ko tmp initrd ib lib modules ib hostl cp infiniband core ib_sa ko tmp initrd_ib lib modules ib hostl cp infiniband core ib_cm ko tmp initrd_ib lib modules ib hostl cp infiniband core ib uverbs ko tmp initrd ib lib modules ib hostl cp infiniband core ib_ucm ko tmp initrd ib lib modules ib hostl cp infiniband core ib_umad ko tmp initrd ib lib modules ib hostl cp infiniband core iw_cm ko tmp initrd ib lib modules ib hostl cp infiniband core rdma_cm ko tmp initrd ib lib modules ib hostl cp infiniband core rdma_ucm ko tmp initrd ib lib modules ib hostl cp net mlx4 mlx4 core ko tmp initrd ib lib modules ib hostl cp infiniband hw mlx4 m1x4_ib ko tmp initrd ib lib modules ib hostl cp infiniband hw mthca ib_mthca ko tmp initrd ib lib modules ib hostl cp infiniband ulp ipoib ib_ipoib ko tmp initrd ib lib modules ib Step 10 IB requires loading an IPv6 module If you d
31. on many systems much higher bandwidth Note that the default value of sdp zcopy thresh is 64KB but is may be too low for some systems You will need to experiment with your hardware to find the best value Testing SDP Performance This section describes how to verify SDP performance by running the Bandwidth BW test and the Latency test These tests are described in detail at the following URL http www netperf org netperf training Netperf html To verify SDP performance perform the following steps Step 1 Download Netperf from the following URL http www netperf org netperf NetperfPage html Step 2 Compile Netperf by following the instructions at http www netperf org netperf NetperfPage html Step 3 Create Libsdp conf configuration file hostl cat gt HOME libsdp conf lt lt EOF gt use sdp server gt use sdp client gt EOF Step 4 Start the Netperf server such that you force SDP to be used instead of TCP host1 LD PRELOAD libsdp so LIBSDP CONFIG FILE HOME libsdp conf netserver Starting netserver at port 12865 Starting netserver at hostname 0 0 0 0 port 12865 and family AF_UNSPEC hostl Step 5 Run the Netperf client such that you force SDP to be used instead of TCP The default test is the Bandwidth test host2 LD PRELOAD libsdp so LIBSDP CONFIG FILE HOME libsdp conf netperf H 11 4 17 6 t TCP_STREAM c C m 65536 TCP STREAM TEST
32. var log filename for root For a regular user write to tmp lt filename gt lt uid gt if filename is not specified as a full path otherwise write to lt path gt lt filename gt lt uid gt min level verbosity level of the log 9 print errors only 8 print warnings 7 print connect and listen summary useful for tracking SDP usage 4 print positive match summary useful for config file debug 3 print negative match summary useful for config file debug 2 print function calls and return values 1 print debug messages Examples To print SDP usage per connect and listern to STDERR include the following statement log min level 7 destination stderr A non root user can configure 1ibsdp so to record function calls and return values in the file Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 7 4 7 5 tmp libsdp log lt pid gt root log goes to var log libsdp log for this example by including the following statement in libsdp conf log min level 2 destination file libsdp log To print errors only to syslog include the following statement log min level 9 destination syslog To print maximum output to the file tmp sdp debug log lt pid gt include the fol lowing statement log min level 1 destination file sdp debug log Kernel Space SDP Debug The SDP kernel module can log detailed trace information if you enable it using the debug_level va
33. 1 The path traced is un healthy 2 Failed to parse command line options Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 12 5 12 6 3 More then 64 hops are required for traversing the local port to the Source port and then to the Destination port 4 Unable to traverse the LFT data from source to destination 5 Failed to use Topology File 6 Failed to load required Package ibv_devices Applicable Hardware All InfiniBand devices Description Lists InfiniBand devices available for use from userspace including node GUIDs Synopsis ibv_devices Examples 1 List the names of all available InfiniBand devices gt ibv_devices device node GUID mthca0 0002c9000101d150 mlx4 0 0000000000073895 ibv_devinfo Applicable Hardware All InfiniBand devices Description Queries InfiniBand devices and prints about them information that is available for use from userspace Synopsis ibv_devinfo d lt device gt i lt port gt 1 v 131 Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential InfiniBand Fabric Diagnostic Utilities 132 Table 9 lists the various flags of the command Table 9 ibv_devinfo Flags and Options Default Optional Cub Flag p3 If Not Description Mandatory 7 Specified d lt device gt Optional First found device Run the command for the provided IB device device
34. 10 07 8192 12 77 16384 18 36 32768 30 52 65536 48 92 131072 93 18 262144 1 92 524288 341 08 1048576 137 97 2097152 1129 27 4194304 4226 58 9 6 4 Intel MPI Benchmark To run the Intel MPI Benchmark test enter host1 usr mpi gcc openmpi lt ompi ver gt bin mpirun np 2 mca mpi leave pinned 1 hostfile home lt username gt cluster usr mpi gcc openmpi lt ompi ver gt tests IMB lt IMB ver gt IMB MPI1 Intel R MPI Benchmark Suite V3 0 MPI 1 part Date Mon Mar 10 12 57 18 2008 Machine x86 64 System Linux Release 2 6 16 21 0 8 smp Version 1 SMP Mon Jul 3 18 25 39 UTC 2006 MPI Version 20 MPI Thread Environment MPI_THREAD SINGLE Minimum message length in bytes 0 Maximum message length in bytes 4194304 MPI Datatype MPI_BYTE MPI Datatype for reductions MPI FLOAT MPI Op MPI SUM Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential List of Benchmarks to run PingPong PingPing Sendrecv Exchange Allreduce Reduce Reduce scatter Allgather Allgatherv Alltoall Alltoallv Bcast Barrier Benchmarking PingPong processes 2 bytes repetitions t usec Mbytes sec 0 000 47 0 00 1 000 97 0 61 2 000 2156 1 22 4 000 53 2 49 8 000 300 4 92 16 000 60 952 32 000 62 18 86 64 000 61 37 90 128 000 80 67 65 256 000 2 05 119 26 512 000 2 67 183 08 1024 000 3 74 260 97 2048 000 6 15 317 84 4096 000 10 15 384 74 8192 000 12 75 612 84 16384 000 18 47 845 8
35. How to Unload Shutdown mlx4 Module Parameters F 1 mlx4_core Parameters F 2 mlx4_ib Parameters F 3 mlx4_en Parameters F 4 mlx4_ fc Parameters 163 163 164 164 166 169 185 186 186 187 187 189 189 189 190 192 193 208 209 209 209 211 211 212 212 214 214 216 219 220 220 221 221 221 A ne NO 222 Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 8 List of Tables Table 1 Typographical Conventions o oooooooooorrrrr ere rr rr rr rese sea 10 Table 2 Abbreviations and Acronyms 0 cect tenet ere rr resa 11 Table 3 Reference Documents se verse ol die cence BR ie db SER a 12 Table 4 mlnxofedinstall Return Codes 22 Table 5 Supported ConnectX Port Configurations LL LL 23 Fable 6 Useful MPL sinks ia elle ii ae 76 Table 7 ibdiagnet Output Files 128 Table 8 ibdiagpath Output Files 130 Table 9 ibv devinfo Flags and Options 132 Table 10 ibstatus Flags and Options 133 Table 11 ibportstate Flags and Options 136 Table 12 ibportstate Flags and Options 140 Table 13 smpquery Flags and Options 00 0 cece cece en eee nee nes 144 Table 14 perfquery Flags and Options 148 Table 15 ibcheckerrs Flags and Options LL 151 Fable T6 mstflmt SWitCches suarirora irene ii 153 Table 17 mstflint Commands osease tesna eiae ia ea e a a E a a aA 154 Table 18 Supported Mellanox Tech
36. IB Cluster Fabric Subnet A set of IB devices connected by IB cables In Band A term assigned to administration activities traversing the IB connectivity only LID An address assigned to a port data sink or source point by the Subnet Manager unique within the sub net used for directing packets within the subnet Local Device Node System The IB Host Channel Adapter HCA Card installed on the machine running IBDIAG tools Local Port The IB port of the HCA through which IBDIAG tools connect to the IB fabric Master Subnet Manager The Subnet Manager that is authoritative that has the reference configuration information for the subnet See Subnet Manager Multicast Forwarding Tables A table that exists in every switch providing the list of ports to forward received multicast packet The table is organized by MLID Network Interface Card NIC A network adapter card that plugs into the PCI Express slot and provides one or more ports to an Ethernet network Standby Subnet Manager Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 223 A Subnet Manager that is currently quiescent and not in the role of a Master Subnet Manager by agency of the master SM See Subnet Manager Subnet Administrator SA An application normally part of the Subnet Manager that implements the interface for querying and manipulating subnet management data Subnet Manager SM One of several entities involv
37. IPoIB EolB The Ethernet over IB EolB mlx4_vnic module is a network interface implementation over Infini Band EolB encapsulates Layer 2 datagrams over an InfiniBand Datagram UD transport service The InfiniBand UD datagrams encapsulates the entire Ethernet L2 datagram and its payload For details see Chapter 6 EolB RDS Reliable Datagram Sockets RDS is a socket API that provides reliable in order datagram delivery between sockets over RC or TCP IP For more details see Chapter 5 RDS SDP Sockets Direct Protocol SDP is a byte stream transport protocol that provides TCP stream semantics SDP utilizes InfiniBand s advanced protocol offload capabilities Because of this SDP can have lower CPU and memory bandwidth utilization when com pared to conventional implementations of TCP while preserving the TCP APIs and Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 1 4 6 1 4 7 1 4 8 1 4 9 semantics upon which most current network applications depend For more details see Chapter 7 SDP SRP SRP SCSI RDMA Protocol is designed to take full advantage of the protocol offload and RDMA features provided by the InfiniBand architecture SRP allows a large body of SCSI software to be readily used on InfiniBand architecture The SRP driver known as the SRP Initiator differs from traditional low level SCSI drivers in Linux The SRP Initiator does not control a local HBA
38. NEE eee tenes 105 11 54 Fat tree Routing Algorithm LL 106 11 5 5 LASH Routing Algorithm 0 0 cece ence tees 107 11 5 6 DOR Routing Algorithm 0 0 e eens 108 1155 7 Routing References esos lariana pei Lane wee hg as 108 11 5 8 Modular Routine Engine LL 109 11 6 Quality of Service Management in OpenSM 110 ILE OVEIVICW soe x tenete a leis lesi a rc 110 11 6 2 Advanced QoS Policy File 2 2 cece cnet eens 110 11 6 3 Simple QoS Policy Definition 112 11 6 4 Policy File Syntax Guidelines ssssererssssssrrtersrorrsrsrrrerrer sorts 112 11 6 5 Examples of Advanced Policy File 2 0 0 ccc ccc cence rr rss 112 11 6 6 Simple QoS Policy Details and Examples 000 cee cee eee eee 116 11 6 6 L IPoIB ca Aaa ee Bye aaa mat 118 16 62 SIRs oe ce ss Gah ct Mate atch and aie ete Ene e 118 11 6 6 3 RDSI aie ne ee Dida 118 16 04 USER dii anime co rei e A e et i i ARI Si STEN 118 11 6 6 5 SRP ion E loa AAA ETERO ARE A ES 119 116616 AMP 78 Cre ul E ea rei 119 11 6 7 SL2VL Mapping and VL Arbitration 0 0 cece rer rr rer rr rer 119 11 6 8 Deployment Example sss renro 0 0 cece eect r AE ERORAR 121 11 7 QoS Configuration Examples 121 11 7 1 Typical HPC Example MPI and Lustre 0 2 0 2 0 cece ee eee 121 11 7 2 EDC SOA 2 tier IPoIB and SRP 122 11 7 3 EDC 3 ti
39. OPTIONS are version Prints OpenSM version and exits F config lt config file gt The name of the OpenSM config file If not speci fied etc opensm opensm conf will be used if it exists Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 91 C create config lt file name gt OpenSM will dump its configuration to the specified file and exit This is one way to generate an OpenSM configuration file template g guid lt GUID in hexadecimal gt This option specifies the local port GUID value with which OpenSM should bind OpenSM may be bound to 1 port at a time If the GUID given is 0 OpenSM displays a list of possible port GUIDs and waits for user input Without g OpenSM tries to use the default port 1 lmc lt LMC value gt This option specifies th subnet s LMC value The number of LIDs assigned to each port is 2 LMC The LMC value must be in th rang 0 7 LMC values gt 0 allow multiple paths between ports LMC values gt 0 should only be used if the subnet topol ogy actually provides multiple paths between ports i e multiple interconnects between switches Without 1 OpenSM defaults to LMC 0 which allows one path between any two ports p priority lt priority value gt This option specifies the SMA s priority This will affect the handover cases where the master is chosen by priority and GUID Range is 0 default and lowest priorit
40. OST1 0ST2 0ST3 OST4 1 SL for Lustre OST any target port guid MDS1 MDS2 2 SL for Lustre MDS end qos ulps e OpenSM options file qos max vls 8 qos_high_limit 0 qos_vlarb_ high 2 1 qos_vlarb_ low 0 96 1 224 qos sl2vil 0 1727 34 076071 y1by1515y15 15y105y15 15 11 7 2 EDC SOA 2 tier IPoIB and SRP The following is an example of QoS configuration for a typical enterprise data center EDC with service oriented architecture SOA with IPoIB carrying all application traffic and SRP used for storage QoS Levels e Application traffic e IPoIB UD and CM and SDP e Isolated from storage e Min BW of 50 e SRP e Min BW 50 Bottleneck at storage nodes Administration e OpenSM QoS policy file Note In the following policy file example replace SRPT with the real SRP Target port GUIDs Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 123 qos ulps default ipoib sdp A EA AE srp target port guid SRPT1 SRPT2 SRPT3 end qos ulps e OpenSM options file gos max vls 8 qos_high_limit 0 qos vlarb high 1 32 2 32 qos_vlarb low 0 1 qos_sl2vl 0 1 2 3 4 5 6 7 15 15 15 15 15 15 15 15 11 7 3 EDC 3 tier IPoIB RDS SRP The following is an example of QoS configuration for an enterprise data center EDC with IPoIB carrying all application traffic RDS for database traffic and SRP used for storage QoS Levels e Management traffic ssh e IPoIB management VLAN partition
41. Target igqn 2007 08 7 3 4 10 iscsiboot Step 20 Start your iSCSI Target Example hostl etc init d iscsitarget start Configuring the DHCP Server to Boot From an iSCSI Target Configure DHCP as described in Section A 3 1 Configuring the DHCP Server Edit your DHCP configuration file etc dhcpd conf and add the following lines for the machine s you wish to boot from the iSCSI Target Filename option root path iscsi iscsi target ip iscsi target ign The following is an example for configuring an IB device to boot from an iSCSI Target host hostl filename For a ConnectX device comment out the following line option dhcp client identifier 00 02 c9 03 00 00 10 39 For an InfiniHost Ex comment out the following line Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential A 0 1 Step a 171 option dhcp client identifier fe 00 55 00 41 fe 80 00 00 00 00 00 00 00 02 c9 03 00 00 0d 41 option root path iscsi 11 4 3 7 iqn 2007 08 7 3 4 10 iscsi boot iSCSI Boot Example of SLES 10 SP2 OS This section provides an example of installing the SLES 10 SP2 operating system on an iSCSI target and booting from a diskless machine via BoIB Note that the procedure described below assumes the following e The client s LAN card is recognized during installation e The iSCSI target can be connected to the client via LAN and InfiniB
42. Technologies Mellanox Technologies Confidential 7 3 Configuring SDP 55 7 3 1 How to Know SDP Is Working LL 55 7 3 2 Monitoring and Troubleshooting Tools LL 56 7 4 Environment Variables 57 7 5 Converting Socket based Applications 57 7 6 BZCopy Zero Copy Send 65 7 7 Testing SDP Performance 65 Chapter 8 SRP os sso ias dolia Os Wea Tee bias is See Sates a OS 8 1 Overview 68 8 2 SRP Initiator 68 82 1 Loading SRR Initiator minar ara ee ease atest dad old eae Peas 68 8 2 2 Manually Establishing an SRP Connection LL 68 8 2 3 SRP Tools ibsrpdm and srp daemon 0 cee cee eens 69 8 2 4 Automatic Discovery and Connection to Targets 0 0 eee eee eee 72 8 2 5 Multiple Connections from Initiator IB Port to the Target oooooooo ooooooo 72 8 2 6 High Availability HA sinne LL 73 8 2 7 Shutting Down SRP 74 Chapter 9 MPI RT es ve anya w avarb F ne e V ne stab Sie er LO 9 1 Overview 76 9 2 Prerequisites for Running MPI
43. Technologies Confidential Rev 1 50 InfiniBand Fabric Diagnostic Utilities Dump all Lids with valid out ports of the switch with Lid 2 gt aloiouce 2 Unicas lids 00 06 Cue swiccm Dio 2 Gube 0x0002c902fffff00a MT47396 Infiniscale Mellanox Technologies MORRONE Destination Rore Info 0x0002 000 Switch portguid 0x0002c902fffffO00a MT47396 Infiniscale Mellanox Technologies 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale Mellanox Technologies 0x0006 007 Channel Adapter portguid 0x002 OSOS S HCA 1T 0x0007 021 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 Dump all Lids in the range 3 to 7 with valid out ports of the switch with Lid 2 gt dioronte 2 3 7 Unicas lies 0 lt S3S 0 7 Morse 2 guie 0x0002c902fffff00a MT47396 Infiniscale Mellanox Technologies Liel Oue Destination POLE Info 0x0003 021 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale Mellanox Technologies 0x0006 007 Channel Adapter portguid 0x0002c90300001039 sw137 HCA 1 Mellanox Technologies Mellanox Technologies Confidential 143 Dump all Lids with valid out ports of the switch with portguid gt ibroute G 0x000b8cffff004016 Unicas lides Ox0 Oxds or Sswircen hist 3 Gule 0x000b8c 004016 MT47396 Infiniscale Mellanox Technolo
44. a db 16 AO MP RA a ta a 17 1 4 7 InfiniBand Subnet Manager 0 2 0 sne d s n oda vd seal Kigeg deg 17 1 4 8 Diagnostic Utilities s se st hive ka ia e i 17 14 9 Performance Utilities gt ia A Sa ahaa OP Ansa 17 1 5 Quality of Service 18 Chapter 2 Installation oooooooooooomorororcmornororororomomoromomoross 19 2 1 Hardware and Software Requirements 19 2 1 1 Hardware Requirements 19 2 1 2 Software Requirements 19 2 2 Downloading BXOFED 20 2 3 Installing BXOFED 20 23 1 Pre installation Notes os 44 Doe Ss na Soda a ed a 20 2 3 2 Installation Script ass gor sak Latera le Saeed Same eee 21 23 2 1 Install Return Codes iii cag ae bt aro diodes ee va ee ob we de pha 22 2 4 Uninstalling BXOFED 22 Chapter 3 Working With VPI ccc ccc ccc cece rere cece eee reese rece e 23 3 1 Port Type Management 23 3 2 InfiniBand Driver 24 3 3 Ethernet Driver 24 3371 LOVerview sot S rad A bea a eee 24 3 3 2 Loading the Ethernet Driver 0 ec ete eens 24 3 33 Unloading
45. address and a subnet mask to each HCA port like any other network adapter card 1 e you need to prepare a file called ifcfg ib lt n gt for each port The first port on the first HCA in the host is called interface 1b0 the second port is called ib1 and so on An IPoIB configuration can be based on DHCP Section 4 3 1 or on a static configuration Section 4 3 2 that you need to supply You can also apply a manual configuration that per sists only until the next reboot or driver restart Section 4 3 3 Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential IPoIB 32 4 3 1 IPolB Configuration Based on DHCP Setting an IPoIB interface configuration based on DHCP v3 1 2 which is availabe via www isc org is performed similarly to the configuration of Ethernet interfaces In other words you need to make sure that IPoIB configuration files include the following line For RedHat BOOTPROTO dhcp For SLES BOOTPROTO dchp Note If IPoIB configuration files are included ifcfg ib lt n gt files will be installed under etc sysconfig network scripts ona RedHat machine etc sysconfig network ona SuSE machine Note A patch for DHCP is required for supporting IPoIB The patch file for DHCP v3 1 2 dhcp patch is available under the docs directory Standard DHCP fields holding MAC addresses are not large enough to contain an IPoIB hardware address To overcome this problem DHCP over Infin
46. addresses The following code lines are an excerpt from a sample IPoIB configuration file Static settings all values provided by this file IPADDR_ib0 11 4 3 175 NETMASK_ib0 255 255 0 0 ETWORK_ib0 11 4 0 0 Z Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential IPoIB 34 4 3 3 BROADCAST _ib0 11 4 255 255 ONBOOT_ib0 1 Based on eth0 each will be replaced with a corresponding octet from eth0 LAN INTERFACE ib0 eth0 IPADDR_ib0 11 4 ETMASK ib0 255 255 0 0 N NETWORK 1ib0 11 4 0 0 BROADCAST _ib0 11 4 255 255 ONBOOT_ib0 1 Based on the first eth lt n gt interface that is found n 0 1 each will be replaced with a corresponding octet from eth lt n gt LAN INTERFACE ib0 IPADDR ib0 11 4 x tx NETMASK ib0 255 255 0 0 NETWORK ib0 11 4 0 0 n O E Tor BROADCAST ib0 11 4 255 255 NBOOT_ib0 1 Manually Configuring IPoIB To manually configure IPoIB for the default IB partition VLAN perform the following steps Note This manual configuration persists only until the next reboot or driver restart Step 1 To configure the interface enter the ifconfig command with the following items e The appropriate IB interface 1b0 ib1 etc e The IP address that you want to assign to the interface e The netmask keyword e The subnet mask that you wa
47. and Open MPI RPMs Note that more than one compiler can be selected simultaneously if desired Compiling MVAPICH Applications Please refer to http mvapich cse ohio state edu support mvapich user guide html To review the default configuration of the installation check the default configuration file usr mpi lt compiler gt mvapich lt mvapich ver gt etc mvapich conf Compiling Open MPI Applications Please refer to http www open mpi org faq category mpi apps Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 9 5 OSU MVAPICH Performance 9 5 1 Requirements At least two nodes Example host1 host2 e Machine file Includes the list of machines Example host1 cat home lt username gt cluster hostl host2 host1 9 5 2 Bandwidth Test Performance To run the OSU Bandwidth test enter host1 usr mpi gcc mvapich lt mvapich ver gt bin mpirun rsh np 2 hostfile home lt username gt cluster usr mpi gcc mvapich lt mvapich ver gt tests osu benchmarks lt osu ver gt osu bw OSU MPI Bandwidth Test v3 0 Size Bandwidth MB s 1 4 62 2 8 91 4 17 70 8 32 59 16 60 13 32 L13721 64 194 22 128 29320 256 549 43 512 88323 1024 1096 65 2048 1165 60 4096 123391 8192 1230 90 16384 1308 92 32768 1414 75 65536 1465 28 131072 1500 36 262144 1515 26 524288 1525 20 1048576 1527 63 2097152 1530 48 4194304 1537 50 Mellanox T
48. are RN A AA VLO 7 IN ST E Tr an 0x00 A aa IS ek ee 4 VAD HUGH CAPS naa o as 8 VIAS LOWCA Deki trate te trae tee teehee Moe RN 8 PRISER GS DYL SSA at at N 0x00 MET COP Aa Lea a nero e et centers 2048 VISA MCO IEEE A 0 HO QUES eae ates a tr e Lol ia ste a tro E e sa P EL VISS TER E ker UR A OA VLO 3 BARES MA a 0 PERE alt RISOLTI o ettaro do 0 Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 147 2 Query SwitchInfo by GUID gt smpquery G switchinfo 0x000b8cffff004016 Switch info Lid 3 ES AECI te 49152 RaNdOnBabCaprRE n wun nn 0 Meals HE Clb RED RR O Inno 1024 E SEDO DAG a 8 DONAR RS Hea che ERA cr SE 0 DSIEMSSSCPicMMPOIMCS son EROE 0 DSTIMCASICINOUEPCIMPOWIES os an o0acgaone 0 SE O RIA I NA I 18 SIERECCHANGCHRRPI ao RA 0 ETS SR ZERI A 0 Par CENFORCE CAPE A a ee IZ NOUS MES E MITE SOOO Gobo QUEDO dBat Eni aae EEE 3 Query Nodelnfo by direct route gt smpquery D nodeinfo 0 i Noce laos DR parca slic 655357 dlid 6333597 O BaS CVEN SRA E A E RN SATSA Claseverssosesconsorsscooobo ooo sl NO GIST SSP ti oa do ens to ho Channel Adapter NOMPO TESE AEN T 6 oe OC ees A DES 2 Syot emeut IR oa ae 0x0002c9030000103b GUIANA Rein 0x0002c90300001038 Ront CU Do mh CIT LI de aprte 0x0002c90300001039 A o OO O OOO oo 128 DE mee hay O IA 0x634a REVISION NI Sti 0x000000a0 12 11 perfquery Applicable Hardware All InfiniBand devices Mellanox Technologi
49. be captured from the boot session as shown in the fig ure below IB v2 0 000 Boot Firmware o A on PCI02 00 0 open TX 0 TXE 0 RX 0 RXE 0 link up on net0 ok Placing Client Identifiers in etc dhcpd confn The following is an excerpt of a etc dhcpd conf example file showing the format of representing a client machine for the DHCP server host hostl next server 11 4 3 7 filename pxelinux 0 fixed address 11 4 3 130 option dhcp client identifier 00 02 c9 03 00 00 10 39 A 3 1 2 For InfiniHost III Family Devices PCI Device IDs 25204 25218 When a BoIB client boots it sends the DHCP server various information including its DHCP client identifier This identifier is used to distinguish between the various DHCP sessions The value of the client identifier is composed of 21 bytes separated by colons having the following components 20 lt QP Number 4 bytes gt lt GID 16 bytes gt Note Bytes are represented as two hexadecimal digits Extracting the Client Identifier Method The following steps describe one method for extracting the client identifier 1 QP Number equals 00 55 04 01 for InfiniHost III Ex and InfiniHost HI Lx HCAs Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 162 2 GID is composed of an 8 byte subnet prefix and an 8 byte Port GUID The subnet prefix is fixed for the supported Mellanox HCAs and is equal to fe 80 00 00 00 00 0
50. between hosts A packet that is sent from the host on a specific EoIB interface will be routed to the Ethernet subnet through a specific external port connection on the BridgeX box Virtual Hubs vHubs Virtual hubs connect zero or more EoIB interfaces on internal hosts and an eport through a virtual hub Each vHub has a unique virtual LAN VLAN id Virtual hub participants can send packets to one another directly without the assistance of the Ethernet subnet external side routing This means that two EoIB interfaces on the same vHub will com municate solely using the InfiniBand fabric EoIB interfaces residing on two different vHubs whether on the same GW or not can not communicate directly There are two types of vHub a default vHub one per GW which has no VLAN ID and vHubs which have unique different VLAN IDs Each vHub belongs to a specific GW BridgeX eport and each GW has one default vHub and zero or more VLAN associ ated vHubs A specific GW can have multiple vHubs distinguishable by their unique vir tual LAN VLAN id Traffic coming from the Ethernet side on a specific eport will be routed to the relevant vHub group based on its VLAN tag or to the default vHub for that GW ifno VLAN ID is present Virtual NIC vNic A virtual NIC 1s a network interface instance on the host side The vNic behaves like reg ular HW network interface A vNic belongs to a single vHub on a specific GW The host can have multiple inter
51. default flow test ing is performed This option specifies the time in milliseconds used for transaction timeouts Specifying t 0 disables timeouts Without t OpenSM defaults to a timeout value of 200 milliseconds H his option defines the log to be the given fil By default the log goes to var log osm log For he log to go to standard output use f stdout ct This option increases the log verbosity level The v option may be specified multiple times to further increase the verbosity level S the vf ption for more information about log verbosity O This option sets the maximum verbosity level and forces log flushing The V is equivalent to vf OxFF d 2 See the vf option for more information about log verbosity This option sets the log verbosity level A flags field must follow the D option A bit set clear in the flags enables disables a specific log level as follows BIT LOG LEVEL ENABLED Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential OpenSM Subnet Manager 100 0x01 ERROR error messages 0x02 INFO basic messages low volume 0x04 VERBOSE interesting stuff moderate volume 0x08 DEBUG diagnostic high volume 0x10 FUNCS function entry exit very high volume 0x20 FRAMES dumps all SMP and GMP frames 0x40 ROUTING dump FDB routing information 0x80 currently unused Without
52. down Note A link state of down on a host administrated vNics when the BridgeX is con nected and the InfiniBand fabric seems OK is a good indication ofa BridgeX system name or system GUID or eport mis configuration Check the value of BXADDR and BXEPORT in the configuration file To query the link state run ifconfig lt interface name gt and check for RUNNING in the result text Example ifconfig eth2 eth2 Link encap Ethernet HWaddr 00 25 8B 00 04 00 Mellanox Technologies Mellanox Technologies Confidential EoIB 52 inet6 addr UP BROADCAST RUNNING MULTICAST MTU 1500 fe80 225 8bff fe00 400 64 Scope Link Metric 1 RX packets 49 errors 0 dropped 11 overruns 0 frame 0 TX packets 25 errors 0 dropped 0 collisions 0 txqueuelen 1000 RX bytes 11278 11 0 KiB TX bytes 5821 overruns 0 carrier 0 5 6 KiB An alternative is to use ethtool lt interface name gt and test for Link detected Example ethtool eth2 Settings for eth2 Supported ports Supported link modes Supports auto negotiation No Advertised link modes Advertised auto negotiation No Speed Unknown 10000 Duplex Full Port Twisted Pair PHYAD 0 Transceiver internal Auto negotiation off Supports Wake on d Wake on d Current message level 0x00000000 Link detected yes 6 6 Bonding Driver Not reported 0 EoIB uses the standard Linux bonding driver For more information on the Linu
53. e Listen on both TCP and SDP by any server that listen on port 8080 use both server 8080 e Connect ssh through SDP and fallback to TCP to hosts on 11 4 8 port 22 use both connect 11 4 8 0 24 22 Explicit Non transparent Conversion Use explicit conversion if you need to maintain full control from your application while using SDP To configure an explicit conversion to use SDP simply recompile the applica tion replacing PF INET or PF INET with AF INET SDP or AF INET SDP when calling the socket system call in the source code The value of AF INET SDP is defined in the file sdp socket h or you can define it inline define AF_INET_SDP 27 define PF INET SDP AF INET SDP You can compile and execute the following very simple TCP application that has been converted explicitly to SDP Compilation gcc sdp server c o sdp server gec sdp client c o sdp client Usage Server host1 sdp server Client host1 sdp client lt server IPoIB addr gt Example Server Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential SDP 60 host1 sdp server accepted connection from 15 2 2 42 48710 read 2048 bytes end of test host1 Client host2 sdp client 15 2 2 43 connected to 15 2 2 43 22222 sent 2048 bytes host2 sdp_client c Code usage sdp client lt ip addr gt SL include include include include includ
54. fw 25408 MT25208 InfiniHost III Ex 25218 0x6282 fw 25218 MT25204 InfiniHost III Lx 25204 0x6274 fw 25204 A 1 2 Tested Platforms See the Boot over IB Release Notes boot_over_ib_release_notes txt Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential A 1 3 A 2 159 BoIB in Mellanox BXOFED The Boot over IB binary files are provided as part of BXOFED The following binary files are included 1 PXE binary files for Mellanox HCA devices HCA Single Dual port ConnectX IB SDR PCI DevID 25408 CONNECTX IB 25408 ROM X X XXX rom HCA Single Dual port ConnectX IB DDR PCI DevID 25418 CONNECTX IB 25418 ROM X X XXX rom HCA Single Dual port ConnectX IB DDR amp PCI Express 2 0 5 0GT s PCI DevID 26418 CONNECTX IB 26418 ROM X X XXX rom HCA Single Dual port ConnectX IB QDR amp PCI Express 2 0 5 0GT s PCI DevID 26428 CONNECTX IB 26428 ROM X X XXX rom HCA InfiniHost III Ex in Mem Free mode PCI DevID 25218 IHOST3EX PORT1 ROM X X xxx rom IB Port 1 IHOST3EX PORT2 ROM X X xxx rom IB Port 2 HCA InfiniHost III Lx PCI DevID 25204 IHOST3LX ROM X X xxx rom single IB Port device Additional documents under docs dhcpd conf sample DHCP configuration file dhcp patch patch file for DHCP v3 1 2 Burning the Expansion ROM Image The binary code resides in the same Flash device of the device firmware Note that the binary files are distinct and
55. gt gt h help V version vars 12 4 2 Output Files Source and destination LIDs source may be omitted gt the local port is assumed to be the source Directed route from the local node which is the source and the destination node The minimal number of packets to be sent across each link default 100 Enable verbose mod pecifies the topology file name S Specifies the local system name Meaningful only if a topology file is specified Specifies the index of the device of the port used to connect to the IB fabric in case of multiple devices on the local system Specifies the local device s port number used to connect to the IB fabric 5 Q he K Specifies t ctory where the output files will be placed default tmp cted link width Specifies t 5 x se Specifies th xpected link speed Dump all the fabric links pm Counters into ibdiagnet pm Reset all the fabric links pmCounters If any of the provided pm is greater then its provided value print it to screen Prints the help page information Prints the version of the tool Prints the tool s environment variables and their values Table 8 ibdiagpath Output Files Output File Description ibdiagpath log A dump of all the application reports generated according to the provided flags ibdiagnet pm A dump of the Performance Counters values of the fabric links 12 4 3 ERROR CODES
56. gt to be the first on the boot device priority list see Section B 5 Note On dual port network adapters the client first attempts to boot from Port 1 If this fails 1t switches to boot from Port 2 189 Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 190 If MLNX NIC was selected through BIOS setup the client will boot from ConnectX EN PXE The client will display ConnectX EN PXE attributes and will attempt to bring up a port link Mellanox ConnectX SPXE 0 9 5 Open Source Boot Firmware 1t0 00 02 c9 00 00 bb on PCIO2 00 0 open Link up TX 0 TXE 0 RX 0 RXE 0 Waiting for link up on net0 ok If the Ethernet link comes up successfully the client attempts connecting to the DHCP server to obtain an IP address and the source location of the kernel OS to boot from The client waits up to 30 seconds for a DHCP server response Mellanox ConnectX Boot over IB v2 0 000 gPXE 0 9 6 Open Source Boot Firmware 1t0 00 02 c9 00 01 77 70 51 on PCIO2 00 0 open Link down TX 0 TXE 0 RX 0 RXE 0 Waiting for link DHCP 20 netO 11 4 3 Next ConnectX EN PXE attempts to boot as directed by the DHCP server B 7 Diskless Machines Mellanox ConnectX EN PXE supports booting diskless machines To enable using an Ethernet driver the remote kernel or initrd image must include and be configured to load the driver This can be achieved either by compiling the adapter dri
57. ibdiagnet mcg A dump of the multicast groups their properties and member host ports ibdiagnet db A dump of the internal subnet database This file can be loaded in later runs using the load_db option In addition to generating the files above the discovery phase also checks for duplicate node port GUIDs in the IB fabric If such an error is detected it is displayed on the stan dard output After the discovery phase is completed directed route packets are sent multi ple times according to the c option to detect possible problematic paths on which packets may be lost Such paths are explored and a report of the suspected bad links is dis played on the standard output After scanning the fabric if the r option is provided a full report of the fabric qualities is displayed This report includes e SM report e Number of nodes and systems Hop count information maximal hop count an example path and a hop count histo gram e All CA to CA paths traced e Credit loop report e mgid mlid HCAs multicast group and report e Partitions report IPoIB report Mellanox Technologies Mellanox Technologies Confidential Note 129 In case the IB fabric includes only one CA then CA to CA paths are not reported Furthermore if a topology file is provided ibdiagnet uses the names defined in it for the output reports 12 3 3 ERROR CODES 12 4 1 Fai Fai Fai Oy 0 pbaeeoONnN I Fai led
58. in rev 16A of the specification In rev 10 the default was Oxff00 MB initiator ext Please refer to Section 9 Multiple Connections e To list the new SCSI devices that have been added by the echo command you may use either of the following two methods e Execute fdisk 1 This command lists all devices the new devices are included in this listing e Execute dmesg or look at var log messages to find messages with the names of the new devices 8 2 3 SRP Tools ibsrpdm and srp_daemon To assist in performing the steps in Section 6 the BXOFED distribution provides two util ities ibsrpdm and srp daemon which Detect targets on the fabric reachable by the Initiator for Step 1 Output target attributes in a format suitable for use in the above echo command Step 2 The utilities can be found under usr sbin and are part of the srptools RPM that may be installed using the Mellanox BXOFED installation Detailed information regarding the various options for these utilities are provided by their man pages Below several usage scenarios for these utilities are presented ibsrpdm ibsrpdm is using for the following tasks 1 Detecting reachable targets Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential SRP 70 a To detect all targets reachable by the SRP initiator via the default umad device dev umad0 execute the following command ibsrpdm This command wil
59. information run gt ethtool Example gt ethtool driver m version 1x4 1 4 i ui eth lt n gt eth2 en MT_04A0140005 0 March 2009 firmware version 2 6 000 bus info 00 00 13 00 0 e To query stateless offload status run k eth lt n gt gt ethtool e To set stateless offload status run gt ethtool on off K eth lt n gt rx onloff e To query interrupt coalescing settings run gt ethtool c eth lt n gt tx onloff sg onloff tso By default the driver uses adaptive interrupt moderation for the receive path which adjusts the moderation time according to the traffic pattern To enable disable adaptive interrupt moderation use the following command gt ethtool C eth lt n gt adaptive rx onloff Above a higher limit of packet rate adaptive interrupt moderation will set the moderation time to its high est value below a lower limit of packet rate adaptive interrupt moderation will set the moderation time to its lowest value Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential Working With VPI 26 3 4 3 4 1 To set the values for packet rate limits and moderation time high low values use the following command gt ethtool C eth lt n gt pkt rate low N pkt rate high N rx usecs low N rx usecs high N e To set interrupt coalescing settings w
60. it will ignore a target that is disallowed in the configuration file e To connect to all the existing Targets in the fabric and to connect to new targets that will join the fabric execute srp daemon e This utility continues to execute until it is either killed by the user or encounters connection errors such as no SM in the fabric To execute SRP daemon as a daemon you may run run srp daemon found under usr sbin providing it with the same options used for running srp daemon Note Make sure only one instance of run srp daemon runs per port To execute SRP daemon as a daemon on all the ports run srp_daemon sh found under usr sbin srp_ daemon sh sends its log to var log srp daemon log It is possible to configure this script to execute automatically when the InfiniBand driver starts by changing the value of SRPHA ENABLE in etc infiniband openib conf to yes However this option also enables SRP High Availability that has some more features see Section 8 2 6 Note For the changes in openib conf to take effect run etc init d openibd restart 8 2 5 Multiple Connections from Initiator IB Port to the Target Some system configurations may need multiple SRP connections from the SRP Initiator to the same SRP Target to the same Target IB port or to different IB ports on the same Tar get HCA In case of a single Target IB port 1 e SRP connections use the same path the configura tion is enab
61. leng type MP type for reductions List of Benchmarks to run PingPong PingPing Sendrecv Exchange Allreduce Reduce Reduce scatter Allga Allga Alltoall All ther therv toall Bcast Barrier THREAD FUNNELE 4194304 2 19 56 42 Mon Jul 3 18 BYTE E MPI MPI 2008 25 39 UTC 2006 _FLOAT _SUM Benchmarking PingPong processes 2 t usec 1425 bytes repetitions 0 1000 Mbytes sec 0 00 Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 MPI 82 9 6 9 6 1 9 6 2 OUTPUT TRUNCATED Bw DN RP 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144 524288 1048576 2097152 4194304 1000 1000 1 1000 1000 1000 1000 1000 1 1000 2 1000 3 1000 3 1000 4 1000 6 1000 8 1000 14 1000 18 1000 30 640 53 320 99 160 191 80 373 40 742 20 1475 10 2956 Open MPI Performance Requirements At least two nodes Example host1 host2 1 24 25 1 23 1 26 1 29 1 36 92 67 03 64 89 30 91 07 85 47 67 78 80 92 31 20 95 e Machine file Includes the list of machines Example hos hos tl t2 host1 cat home lt username gt cluster host1 Bandwidth Test Performance To run the OSU Bandwidth test enter host1
62. low priority is served Packets carry class of service marking in the range 0 to 15 in their header SL field Each switch can map the incoming packet by its SL to a particular output VL based on a programmable table VL SL to VL MAP in port out port SL The Subnet Administrator controls the parameters of each communication flow by pro viding them as a response to Path Record PR or MultiPathRecord MPR queries Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 10 2 10 3 DiffServ architecture IETF RFC 2474 amp 2475 is widely used in highly dynamic fabrics The following subsections provide the functional definition of the various software ele ments that enable a DiffServ like architecture over the Mellanox BXOFED software stack QoS Architecture QoS functionality is split between the SM SA CMA and the various ULPs We take the chronology approach to describe how the overall system works 1 The network manager human provides a set of rules policy that define how the network is being configured and how its resources are split to different QoS Levels The policy also define how to decide which QoS Level each application or ULP or service use 2 The SM analyzes the provided policy to see if it is realizable and performs the neces sary fabric setup Part of this policy defines the default QoS Level of each partition The SA is enhanced to match the requested Source Destination QoS Class
63. read performance counters and reset perfquery e r 32 1 read extended performance counters and reset perfquery R 0x20 1 reset performance counters of port 1 only perfquery e R 0x20 1 reset extended performance counters of port 1 only Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 149 perfquery R a 32 reset performance counters of all ports perfquery R 32 2 OxOfff reset only error counters of port 2 perfquery R 32 2 0xf000 reset only non error counters of port 2 1 Read local port s performance counters DIRITTI 7 Port cobacerzss ue 6 pose i BOTS AMA AI I Couneerselie ce peer are te 0x1000 MIS A Cah OG Oe le ONA 0 InpLiGlINSCKONVSNESS non ano HS HOO AG HS ao oS 0 STNG O WINS Citar EA dret preto 0 REVELE Aan mn crow Ale hs GERD Oye Oca 0 Revkeno repisa os 8 EE 0 RE VOWRE aV O REMO OA 0 SMEDISCA BOSIO III 0 MIMIC COMSIE HEINOUS Bn 65 dr a dr Nr 0 RECO ar SE SLA 0 iam AE IC E BEO GS S yoo canon os oes 0 EEE UE OSI 35 5560000 se600 0 MIRES DES SoeKol Bs E nee es Cone eee aX 0 Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential InfiniBand Fabric Diagnostic Utilities 150 Read performance counters from LID 2 all ports gt smpquery a 2 i Fore countera Lid 2 port 295 Porto oee t wet AI 255 Counters SSE
64. set to the value negotiated with the switch This takes advantage of PFC which allows pausing FCoE traffic when needed without pausing the entire Ethernet link Also with proper configuration of the FCoE switch the link s maximum bandwidth can be divided as needed between FCoE and regular Ethernet traffic Instantiating VHBAs manually allows creating them on VLAN interfaces with any arbi trary VLAN id and priority as well as on the regular without VLAN Ethernet interfaces Using the regular interface means that PFC cannot be used Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 Working With VPI 28 In this case it is highly recommended that both the FCoE switch and the m1x4 en driver be configured to use link pause regular flow control Otherwise any FCoE packet drop will trigger SCSI errors and timeouts 3 4 3 1 FCoE Configuration After installation please edit the file etc mlxfc mlxfc conf and set the follow ing variables e FC SPEC set to T11 or pre T11 as supported by your FCoE switch Note Only pre T11 format is offloaded in hardware DCBX IFS provide a space separated list of Ethernet devices to monitor the use of the DCBX protocol for the FCoE feature availability vHBAs are automatically created on these interfaces if the FCoE switch is configured for automatic FCoE negotiation e MTU if MTU of the Ethernet device is changed from the default 1500 put the cor
65. the Driver 3 2 4 3 2 ital sted eka anse ke bdo bodas rali 25 3 3 4 Ethernet Driver Usage and Configuration 0 0 cee eee eee ee 25 3 4 Fibre Channel over Ethernet 26 AL COVE Wo ba eae teh onan pe aap eee eee AAA A See ae eee 26 3 42 Tastallato no adn de Gens ads as athe le ae tae Gabe 27 3 4 3 FCOE Basic Usage us A A a eae bb 27 3 4 3 1 FCoE Configuration 0 0 ccc rr rr rr ere rest 28 3 4 3 2 Starting FCoE Services iii nn ii e A EA ee 28 Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential a 3 4 3 3 Stopping FCoE Seryice insine iectur iii 28 3 44 P oE Advanced Usage ai mee tay ue e ile ria s Do 28 3 4 4 1 Manual vHBA Control 29 3 4 4 2 Creating VHBAs That Use PFC 1 cee 29 3 4 4 3 Creating vHBAs That Use Link Pause 0 0 00 ee eee 30 Chapter 4 VPOUB 166605054 Sires 68 e a aa e eee OI 4 1 Introduction 31 4 2 IPoIB Mode Setting 31 4 3 IPoIB Configuration 31 4 3 1 IPoIB Configuration Based on DHCP 32 AZLI DACP Servers A chet ghee a ees 32 4 3 1 2 DHCP Client Optional lisi sa0 534 uretra red o s cde at 33 4 3 2 Static IPoIB Configuration LL 33 4 3 3 Manually Configuring IPoIB LL 34 4 4 Subinterfaces
66. the vNics and will be verified on all packets received on the vNic When passed from the InfiniBand side to the Ethernet side the EoIB encapsulation will be stripped but the VLAN tag will remain For example if the vNic eth23 is associated with a vHub that uses BridgeX bridge01 eport A10 and VLAN tag 8 all incoming and outgoing traffic on eth23 will use a VLAN tag of 8 This will be enforced by both BridgeX and destination hosts When a packet is passed from the internal fabric to the Ethernet subnet through the BridgeX it will have a true Ethernet VLAN tag of 8 Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 EoIB 48 6 3 5 1 6 3 6 6 3 7 6 3 8 The VLAN implementation used by EoIB uses OS un aware VLANs This is in many ways similar to switch tagging in which an external Ethernet switch adds strips tags on traffic preventing the need of OS intervention EoIB does not support OS aware VLANs in the form of vconfig Configuring VLANs To configure VLAN tag for a vNic just add the VLAN tag property to the configuration file in host administrated mode or configure the vNic on the appropriate vHub in network administered mode In host administered mode if a vHub with the requested VLAN tag is not available yet it will most likely be created automatically Host administered VLAN configuration in centralized conf file Add vid lt vlan tag gt or remove vid property for no VLAN Host
67. vf osmtest defaults to ERROR INFO 0x3 Specifying vf 0 disables all messages Specifying vf OxFF enables all messages see V High verbos ity levels may require increasing the transaction timeout with the t option h help Display this usage info then exit 11 3 2 Running osmtest To run osmtest in the default mode simply enter hostl osmtest The default mode runs all the flows except for the Quality of Service flow see Section 11 6 After installing opensm and if the InfiniBand fabric is stable it is recommended to run the following command in order to generate the inventory file host1 osmtest f c Immediately afterwards run the following command to test opensm hostl osmtest f a Finally it is recommended to occasionally run osmtest v with verbosity to verify that nothing in the fabric has changed 11 4 Partitions OpenSM enables the configuration of partitions PKeys in an InfiniBand fabric By default OpenSM searches for the partitions configuration file under the name usr etc opensm partitions conf To change this filename you can use opensm with the Pconfig or P flags The default partition is created by OpenSM unconditionally even when a partition config uration file does not exist or cannot be accessed Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 101 The default partition has a P_Key value of 0x7fff The port out of which ru
68. why it is the only ULP that did not appear in the qos ulps section 11 6 7 SL2VL Mapping and VL Arbitration OpenSM cached options file has a set of QoS related configuration parameters that are used to configure SL2VL mapping and VL arbitration on IB ports These parameters are e Max VLs the maximum number of VLs that will be on the subnet e High limit the limit of High Priority component of VL Arbitration table IBA 7 6 9 VLArb low table Low priority VL Arbitration table IBA 7 6 9 template VLArb high table High priority VL Arbitration table IBA 7 6 9 template e SL2VL SL2VL Mapping table IBA 7 6 6 template It is a list of VLs corresponding to SLs 0 15 Note that VL15 used here means drop this SL There are separate QoS configuration parameters sets for various target types CAs rout ers switch external ports and switch s enhanced port 0 The names of such parameters are prefixed by qos_ lt type gt _ string Here is a full list of the currently supported sets e gos ca QoS configuration parameters set for CAs e qos rtr parameters set for routers e gos sw0_ parameters set for switches port 0 e qos swe parameters set for switches external ports Here s the example of typical default values for CAs and switches external ports hard coded in OpenSM initialization qos_ca max vls 15 gos_ca high limit 0 qos_ca vlarb high 0 4 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 0 11 0 12 0 13 0 14 0 M
69. will be In general even if the root list is provided the closer the topology to a pure and sym metrical fat tree the more optimal the routing will be The algorithm also dumps compute node ordering file opensm ftr ca order dump in the same directory where the OpenSM log resides This ordering file provides the CN order that may be used to create efficient communication pattern that will match the routing tables 1 Ports that are connected to the same remote switch are referenced as port group 2 List of compute nodes CNs can be specified by u or cn_guid_file OpenSM options Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 107 Activation through OpenSM e Use R ftree option to activate the fat tree algorithm e Use a lt root guid file gt to provide root nodes for ranking If the a option is not used routing algorithm will detect roots automatically e Use u lt root cn file gt to provide the list of compute nodes If the u option is not used all the CAs are considered as compute nodes Note LMC gt 0 is not supported by fat tree routing If this is specified the default routing algorithm is invoked instead 11 5 5 LASH Routing Algorithm LASH is an acronym for LAyered SHortest Path Routing It is a deterministic shortest path routing algorithm that enables topology agnostic deadlock free routing within communi cation networks
70. working with to allow for debug in case initrd gets corrupted In addition edit the init file that is in the initrd zip and look for the following string if iSCSI TARGET IPADDR then iscsiserver S iSCSI TARGET IPADDR fi Now add before the string the following line iSCSI TARGET IPADDR lt IB IP Address of iSCSI Target gt Example LSES _ TARGET IPADDR 11 4 Zel WinPE Mellanox BoIB enables WinPE boot via TFTP For instructions on preparing a WinPE image please see http etherboot org wiki winpe Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 186 Appendix B ConnectX EN PXE B 1 Overview This appendix describes Mellanox ConnectX EN PXE the software for Boot over Mel lanox Technologies network adapter devices supporting Ethernet Mellanox ConnectX EN PXE enables booting kernels or operating systems OSes from remote servers in compli ance with the PXE specification Mellanox ConnectX EN PXE is based on the open source project Etherboot gPXE avail able at http www etherboot org Mellanox ConnectX EN PXE first initializes the network adapter device Then it connects to a DHCP server to obtain its assigned IP address and network parameters and also to obtain the source location of the kernel OS to boot from The DHCP server instructs Mel lanox ConnectX EN PXE to access the ker
71. 0 00 The next steps explains how to obtain the Port GUID 3 To obtain the Port GUID run the following commands Note The following MFT commands assume that the Mellanox Firmware Tools MFT package has been installed on the client machine hostl mst start hostl mst status The device name will be of the form dev mst mt lt dev_id gt _pci _cr0 conf0 Use this device name to obtain the Port GUID via a query command flint d lt MST DEVICE NAME gt q Example with InfiniHost III Ex as the HCA device hostl flint d dev mst mt25218 pci cr0 q Image type Failsafe FW Version 5 3 0 Rom Info type GPXE version 1 0 0 devid 25218 port 2 I S Version 1 Device ID 25218 Chip Revision AO Description Node Portl Port2 Sys image GUIDs 0002c90200231390 0002c90200231391 0002c90200231392 0002c90200231393 Board ID MT_0370110001 VSD PSID MT_0370110001 Assuming that BoIB is connected via Port 2 then the Port GUID is 00 02 c9 02 00 23 13 92 Step 6 The resulting client identifier is the concatenation from left to right of 20 the QP_Number the subnet prefix and the Port GUID In the example above this yields the following DHCP client identifier 20 00 55 04 01 fe 80 00 00 00 00 00 00 00 02 c9 02 00 23 13 92 Extracting the Client Identifier Method Il An alternative method for obtaining the 20 bytes of QP Number and GID involves booting the client machine via BoIB This requires having a Subnet Ma
72. 0 Mellanox Technologies Mellanox Technologies Confidential 71 srp_daemon The srp_daemon utility is based on ibsrpdm and extends its functionality In addition to the ibsrpdm functionality described above srp_daemon can also Establish an SRP connection by itself without the need to issue the echo command described in Section 8 2 2 Continue running in background detecting new targets and establishing SRP connec tions with them daemon mode Discover reachable SRP Targets given an infiniband HCA name and port rather than just by dev umad lt N gt where lt N gt is a digit Enable High Availability operation together with Device Mapper Multipath Have a configuration file that determines the targets to connect to srp daemon commands equivalent to ibsrpdm srp daemon a o is equivalent to ibsrpdm srp daemon c a 0 is equivalent to ibsrpdm c Note These srp daemon commands can behave differently than the equivalent ibsrpdm command when etc srp_daemon conf is not empty srp daemon extensions to ibsrpdm e To discover SRP Targets reachable from the HCA device lt InfiniBand HCA name gt and the port lt port num gt and to generate output suitable for echo you may exe cute hostl srp daemon c a o i lt InfiniBand HCA name gt p lt port number gt Note To obtain the list of InfiniBand HCA device names you can either use the ibstat tool or run Is sys class infiniband
73. 0x123456 0x123457 will be limited Sharel0O 0x80 defmember limited 0x123456 0x123457 0x123458 full ShareIO 0x80 defmember full 0x123459 0x12345a ShareIO 0x80 defmember full 0x12345b 0x12345c lim ited 0x12345d Note The following rule is equivalent to how OpenSM used to run prior to the par tition manager Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 11 5 103 Default 0x7fff ipoib ALL full Routing Algorithms OpenSM offers five routing engines 1 Min Hop algorithm Based on the minimum hops to each node where the path length is optimized 2 UPDN Unicast routing algorithm Based on the minimum hops to each node but it is constrained to ranking rules This algorithm should be chosen if the subnet is not a pure Fat Tree and a deadlock may occur due to a loop in the subnet 3 Fat Tree Unicast routing algorithm This algorithm optimizes routing for a congestion free shift communication pattern It should be chosen if a subnet is a symmetrical Fat Tree of various types not just a K ary N Tree non constant K not fully staffed and for any CBB ratio Similar to UPDN Fat Tree routing is constrained to ranking rules 4 LASH Unicast routing algorithm Uses InfiniBand virtual layers SL to provide deadlock free shortest path routing while also distributing the paths between layers LASH is an alternative deadlock free topolo
74. 2x gt ls lt 2 5 5 10 gt skip lt ibdiag check s gt load db lt db file gt in number of packets to be sent across each link default 10 Enable verbose mod Provides a report of the fabric qualities pecifies the topology file name pecifies the local system name Meaningful only if topology file is specified pecifies the index of the device of the port used to onnect to the IB fabric in case of multiple evices on the local system pecifies the local device s port num used to con ect to the IB fabric pecifies the directory where the output files will placed default tmp pecifies th xpected link width Nn OTN THM Aan gt NW pecifies th xpected link speed Dump all the fabric links pm Counters into ibdiag net pm Reset all the fabric links pmCounters If any of the provided pm is greater then its pro vided value print it to screen skip lt skip option s gt Skip the executions of the selected checks Skip wt lt file name gt options one or more can be specified dup_guids zero guids pm logical state part ipoib all Write out the discovered topology into the given file This flag is useful if you later want to check for changes from the current state of the fabric A directory named ibdiag ibnl is also created by this option and holds the IBNL files required to load this topology To use these files you will need to set the environment variable n
75. 32K bx The BridgeX box system GUID or system name string eport The string describing the eport name 6 3 1 2 vNic specific configuration files ifcfg ethX EoIB configuration can use the well known ifcfg ethX files used by the network service to derive the needed configuration In this usage model a separate file is required per vNic We will need to update the ifcfg ethX file and add some new attributes to it On Red Hat the new file will be of the form DEVICE eth2 HWADDR 00 30 48 7d de e4 BOOTPROTO dhcp ONBOOT yes BXADDR BX001 BXEPORT A10 VNICVLAN 0 Optional field VNICIBPORT m1x4 0 1 Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential EoIB 46 6 3 2 6 3 3 The fields used in the file have the following meaning DEVICE The name of the interface that is displayed when running ifconfig This field is optional if it is not present the trailer of the configuration file name e g ifcfg eth47 gt eth47 is used BXADDR The BridgeX box system GUID or system name string BXEPORT The string describing the eport name VNICVLAN An optional field If it exists the vNic will be assigned the VLAN id specified This value must be between 0 and 4095 VNICIBPORT Device name and port number in the form device name port number The device name can b retrieved by running ibv_devinfo and using the out put of hca_id filed The port number can have a value
76. 5 32768 000 30 84 013 28 65536 640 48 88 278 77 131072 320 86 36 447 43 262144 160 163 91 525 26 524288 80 335 82 488 90 1048576 40 726 25 376 94 2097152 20 1786 39 119 60 4194304 10 4253 59 940 38 OUTPUT TRUNCATED Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential Quality of Service 86 10 Quality of Service 10 1 Overview Quality of Service QoS requirements stem from the realization of I O consolidation over an IB network As multiple applications and ULPs share the same fabric a means is needed to control their use of network resources Figure 4 I O Consolidation Over InfiniBand Servers Administrator QoS dd IB Ethemet Gateway IB Fibre Channel Gateway QoS over Mellanox BXOFED for Linux is discussed in Chapter 11 OpenSM Subnet Manager The basic need is to differentiate the service levels provided to different traffic flows such that a policy can be enforced and can control each flow utilization of fabric resources The InfiniBand Architecture Specification defines several hardware features and manage ment interfaces for supporting QoS Up to 15 Virtual Lanes VL carry traffic in a non blocking manner Arbitration between traffic of different VLs is performed by a two priority level weighted round robin arbiter The arbiter is programmable with a sequence of VL weight pairs and a maximal number of high priority credits to be processed before
77. 88 10 7 RDS 89 10 8 SRP 89 10 9 OpenSM Features 89 Chapter 11 OpenSM Subnet Manager ccc cece cece cece ee cece eee eee nee 90 11 1 Overview 90 11 2 opensm Description 90 V1 2 Syntax ss sis d r d r b r Aer see AA 90 11 2 2 Environment Variables u nan annasa ro 96 1123 Signaling cr a See eae san A A hd PA a eke ck Bo 96 11 2 4 Runhing opens 2s mirar ie AE AAA ea ae dead way 97 11 2 4 1 Running OpenSM As Daemon 97 11 3 osmtest Description 97 ES o A A A RR RA 97 1 132 RUINA O Mt divi A A A aw Pee A NA bade ae 100 11 4 Partitions 100 11 41 File Format a tate Be ech eee ed aig eben at A Rha ee eon deals 101 Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 6 11 5 Routing Algorithms 103 11 5 1 Effect of Topology Changes 1 2 0 ad ale skred red ANA teen eens 104 11 5 2 Min Hop Algorithm 0 cece tence tenes 104 11 5 3 Purpose of UPDN Algorithm LL 104 11 5 3 1 UPDN Algorithm Usage seiren ea
78. A e Min BW 10 e Application traffic e IPoIB application VLAN partition B e Isolated from storage and database Min BW of 30 Database Cluster traffic e RDS e Min BW of 30 e SRP e Min BW 30 Bottleneck at storage nodes Administration e OpenSM QoS policy file Note In the following policy file example replace SRPT with the real SRP Initia tor port GUIDs qos ulps default 220 ipoib pkey 0x8001 dl Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential OpenSM Subnet Manager 124 ipoib pkey 0x8002 rds srp target port guid SRPT1 SRPT2 SRPT3 end qos ulps e OpenSM options file qos max vls 8 qos_high_limit 0 qos _vlarb high 1 32 2 96 3 96 4 96 qos_vlarb_low 0 1 qos s12w1 0 1 2 3 4 5 6 7 15 15 15 15 15 15 15 15 e Partition configuration file Default 0x7fff ipoib ALL full PartA 0x8001 sl 1 ipoib ALL full PartB 0x8002 sl 2 ipoib ALL full Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 12 InfiniBand Fabric Diagnostic Utilities 12 1 Overview The diagnostic utilities described in this chapter provide means for debugging the connec tivity and status of InfiniBand IB devices in a fabric The tools are ibdiagnet IB Net Diagnostic page 127 e ibdiagpath IB diagnostic path page 129 e ibv devices page 131 e ibv_devinfo page 131 ibstatus page 133 ibportstate page 135 ibroute page 140 smpquery page 144 perfque
79. ABIWAAAQEA1zVY8VBHOh90kZN70A1ibUO74RXm4zHeczyVxpY HaDPyDmqezbYMKrCIVzd10bH ZkC0rpLYviU0oUHd3fvNT Ms0gcGg08PysUf 12FyY jira2Plxyg6mkHLGGqVut fEMmABZ3wNCUg6J2X3G uiuSWXeubZmbXcMrP w4IWByfH8aJwo6A5WioNbFZElbYeeNfPZf4UNcgMOAMWp64sL58tkt32F RGmyLXQWZ L27Synsn6dHpxMqBorXNC0ZBe4kTnUgm63n02z1gVMdL9IFrCmalxI0u9 SQJA wONev aMzFKEHe7YHg6YrNfXunfdbEurzB524TpPcrodZ1 C0 lt username gt thostl Now you need to add the public key to the authorized keys2 file on the target machine host1 cat id rsa pub xargs ssh host2 A echo gt gt home lt username gt ssh authorized keys2 lt username gt host2 s password Enter password host1 For a local machine simply add the key to authorized _keys2 host1 cat id rsa pub gt gt authorized keys2 Test host1 ssh host2 uname Linux MPI Selector Which MPI Runs Mellanox BXOFED contains a simple mechanism for system administrators and end users to select which MPI implementation they want to use The MPI selector functionality is Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential MPI 78 9 4 not specific to any MPI implementation it can be used with any implementation that pro vides shell startup files that correctly set the environment for that MPI The Mellanox BXOFED installer will automatically add MPI selector support for each MPI that it installs Additional MPI s not known by the Mellanox BXOFED installer can be listed in the MPI
80. AMN Mellanox TECHNOLOGIES Mellanox BXOFED Stack for Linux User Manual Rev 1 50 www mellanox com Mellanox Technologies Confidential Rev 1 50 NOTE THIS INFORMATION IS PROVIDED BY MELLANOX FOR INFORMATIONAL PURPOSES ONLY AND ANY EXPRESS OR IMPLIED WARRANTIES INCLUDING BUT NOT LIM ITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED IN NO EVENT SHALL MELLANOX BE LIA BLE FOR ANY DIRECT INDIRECT INCIDENTAL SPECIAL EXEMPLARY OR CONSE QUENTIAL DAMAGES INCLUDING BUT NOT LIMITED TO PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY WHETHER IN CONTRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS HARDWARE EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE Mellanox TECHNOLOGIES Mellanox Technologies Mellanox Technologies Ltd 350 Oakmead Parkway PO Box 586 Hermon Building Sunnyvale CA 94085 Yokneam 20692 U S A Israel www mellanox com Tel 972 4 909 7200 Tel 408 970 3400 Fax 972 4 959 3245 Fax 408 970 3403 Copyright 2009 Mellanox Technologies Inc All Rights Reserved Mellanox BridgeX ConnectX InfiniBlast InfiniBridge InfiniHost InfiniRISC InfiniScale InfiniPCI and PhyX and Virtual Protocol Interconnect are registered trademarks of Mellanox Technologie
81. Boot Example of SLES 10 SP2 OS This section provides an example of installing the SLES 10 SP2 operating system on an iSCSI target and booting from a diskless machine via ConnectX EN PXE Note that the procedure described below assumes the following e The client s LAN card is recognized during installation e The iSCSI target can be connected to the client via a LAN and a ConnectX Ethernet Prerequisites See Section B 6 1 on page 189 Warning The following procedure modifies critical files used in the boot proce dure It must be executed by users with expertise in the boot process Improper application of this procedure may prevent the diskless machine from booting Procedure Load the SLES 10 SP2 installation disk and enter the following parameters as boot options Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 194 netsetup 1 WithISCSI 1 Boot from Hard Disk Installation Installation ACPI Disabled Installation Local APIC Disabled Installation Safe Settings Rescue System Memory Test Boot Options netsetup i HithISCSI 1l Step 55 Continue with the procedure as instructed by the installation program until the SCSI Initiator Overview window appears Preparation Y Language y License Agreement Disk Activation e System Analysis Connected Targets e Time Zone SCSI Initiator Overview Portal Address TargetName StartUp Installation e Installation Summ
82. Date 07 52 06 24 03 2008 Change Einish In the Installation Settings window click Partitioning to get the Suggested Partitioning window Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 201 Preparation v Language 8 Installation Settings v License Agreement v Disk Activation Click headli kath he Ch E bel v System Analysis ick any headline to make changes or use the ange menu below v Time Zone Overview Expert Installation Installation Summary Keyboard Layout e Perform Installation English US Configuration TO e Hostname Partitioning Root Password Create boot partition dev sda1 70 5 MB with ext2 e Network Create swap partition dev sda2 502 0 MB e Customer Center Create root partition dev sda3 7 4 GB with reiserfs e Online Update e Service Software e Users e Clean Up SUSE Linux Enterprise Server 10 e Release Notes A System GNOME Desktop Environment for Server e Hardware Configuration Server Base System Novell AppArmor Print Server Size of Packages to Install 1 3 GB Language Primary Language English US Show Release Notes Change v Help Abort Step 64 Select Base Partition Setup on This Proposal then click Next Your hard disks have Suggested Partitioning been checked The partition setup displayed is proposed for your hard drive Create boot partiti
83. E h lt mvapich ver gt bin mpirun_ rsh np 2 V_COALESCE THRESHOLD SQ 1 VIADEV_PROGRESS_THRESHOLD 2 usr mpi gcc mvapic osu_bw The example assumes the following h lt mvapich ver gt tests osu benchmarks lt osu ver gt e A cluster of at least two nodes Example host1 host2 e A machine file that includes the list of machines Example host1 cat home lt username gt cluster Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 213 hostl host2 host1 Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 214 Appendix D SRP Target Driver D 1 The SRP Target driver is designed to work directly on top of OpenFabrics OFED software stacks http www openfabrics org or InfiniBand drivers in Linux kernel tree ker nel org It also interfaces with Generic SCSI target mid level driver SCST http scst sourceforge net By interfacing with an SCST driver it is possiblee to work with and support a lot of IO modes on real or virtual devices in the backend 1 scst_disk interfacing with the scsi sub system to claim and export real scsi devices disks hardware raid volumes tape library as SRP luns scst_vdisk fileio and blockio modes This allows turning software raid volumes LVM volumes IDE disks block devices and normal files into SRP luns NULLIO mode allows measuring the perfor
84. EA e a tae R 0x0100 SYMBOLET CORSE RE ER NOR N 65535 TENERECON GTA Sete see TOI NE 255 AD A LS ain ce shoe Saath che EB RAR et A 16 REVERLEOLS anes Ae caine Se ore moa Boe 657 ROWINEINO ESE Ys PS 0 NOVO MINS Ley Ers d Syster a EE os 70 MESA O Hae Oe 488 MIMEIC OMS CIELO EIT OWSS seco neascooans 0 Rev cons vra na 0 Tinent eor eye OSSE pe uno bo rr tr r 0 EOS UR ONS IE UP SN 0 VITLODIROPPEd ot een elen ar oe 0 Read then reset performance counters from LID 2 port 1 gt perfquery r 21 i Police Cobaezss ticl 2 port I ERES a da siro orse SN OA Ob oo IL Counter elec Enae ente 0x0100 SYMDO VET EOL SAE IR 0 MTS SE ONES A TA N 0 EASTING SS A Rar a ere eae ese ae Rs 0 IGN BETO NESS ta RE PR RE RION 0 RevRemocEhys EEES 0 RSS Clio REOT Sk o T e a 0 MS CATO SERIE N ATTI 3 MECO SEN OS 0 RevVCOm Strasne OASI O ON 0 La ae yE CEOLE cod dd 0 ESB EOS ISO SR 5 pogo 60060 606 0 DS A I 0 MATITA 0 Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 151 12 12ibcheckerrs Applicable Hardware All InfiniBand devices Description Validates an IB port or node and reports errors in counters above threshold Check specified port or node and report errors that surpassed their predefined threshold Port address is lid unless G option is used to specify a GUID address The predefined thresholds can be dumped using the s option and a user defined threshold_file using the same format as the dump can be specified usin
85. FS iSCSI NFS RDMA SRP iSER Fibre Channel Clustered Storage and FCoE Clustering MPI DAPL RDS sockets and Management SNMP SMI S e Communication protocol acceleration engines including networking storage cluster ing virtualization and RDMA with enhanced quality of service BXOFED Package Contents Note For instructions on installing the package please refer to Chapter 2 Installa tion BXOFED contains the following software components e Network adapter drivers Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 Mellanox BXOFED Overview 14 e mthca IB only e mlx4 VPI which is split into the following modules mlx4 core low level helper mlx4_ib IB mlx4_en Ethernet mlx4 fc FCoE and mlx4_vnic EoIB e Mid layer core Verbs MADs SA CM CMA uVerbs uMADs e Upper Layer Protocols ULPs IPoIB RDS SDP SRP Initiator e MPI e Open MPI stack supporting the InfiniBand interface e OSU MVAPICH stack supporting the InfiniBand interface MPI benchmark tests OSU BW LAT Intel MPI Benchmark Presta e OpenSM InfiniBand Subnet Manager e Utilities e Diagnostic tools e Performance tests Documentation 1 4 Architecture Figure 1 shows a diagram of the Mellanox BXOFED stack and how upper layer protocols ULPs interface with the hardware and with the kernel and user spaces The application level also shows the versatility of markets that BXOFED applies to
86. ID The Service ID for RDS is 0x000000000106PPPP where PPPP are 4 hex digits holding the remote TCP IP Port Number to connect to Default port number for RDS is 0x48CA which makes a default Service ID 0x00000000010648CA The following two match rules are equivalent rds 2 lt amp ST gt any service id 0x00000000010648CA lt SL gt 11 6 6 4 SER Similar to RDS SER query is matched by Service ID where the the Service ID is also 0x000000000106PPPP Default port number for SER is 0x0CBC which makes a default Service ID 0x0000000001060CBC The following two match rules are equivalent iser lt SL gt any service id 0x0000000001060CBC lt SL gt Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 119 11 6 6 5 SRP Service ID for SRP varies from storage vendor to vendor thus SRP query is matched by the target IB port GUID The following two match rules are equivalent srp target port guid 0x1234 lt SL gt any target port guid 0x1234 lt SL gt Note that any of the above ULPs might contain target port GUID in the PR query so in order for these queries not to be recognized by the QoS manager as SRP the SRP match tule or any match rule that refers to the target port guid only should be placed at the end of the qos ulps match rules 11 6 6 6 MPI SL for MPI is manually configured by MPI admin OpenSM is not forcing any SL on the MPI traffic and that s
87. Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential B 4 B 5 B 6 B 6 1 B 6 2 Placing MAC Addresses in etc dhcpd conf The following is an excerpt ofa etc dhcpd conf example file showing the format of representing a client machine for the DHCP server running on a Linux machine host hostl next server 11 4 3 7 filename pxelinux 0 fixed address 11 4 3 130 hardware ethernet 00 02 c9 00 00 bb TFTP Server If you have set the filename parameter in your DHCP configuration to a non empty file name you need to install TFTP Trivial File Transfer Protocol TFTP is a simple FTP like file transfer protocol used to transfer files from the TFTP server to the boot client as part of the boot process BIOS Configuration The expansion ROM image presents itself to the BIOS as a boot device As a result the BIOS will add MLNX NIC lt ver gt to the list of boot devices The priority of this list can be modified through BIOS setup Operation Prerequisites e Make sure that your client is connected to the server s e The ConnectX EN PXE image is already programmed on the adapter card see Section B 2 e Configure and start the DHCP server as described in Section B 3 e Configure and start at least one of the services iSCSI Target see Section A 1 and or TFTP see Section B 4 Starting Boot Boot the client machine and enter BIOS setup to configure MLNX NIC lt ver
88. Ser vice ID PKey against the policy so clients ULPs programs can obtain a policy enforced QoS The SM may also set up partitions with appropriate IPoIB broadcast group This broadcast group carries its QoS attributes SL MTU RATE and Packet Lifetime 3 IPoIB is being setup IPoIB uses the SL MTU RATE and Packet Lifetime available on the multicast group which forms the broadcast group of this partition 4 MPI which provides non IB based connection management should be configured to run using hard coded SLs It uses these SLs for every QP being opened 5 ULPs that use CM interface like SRP have their own pre assigned Service ID and use it while obtaining PathRecord MultiPathRecord PR MPR for establishing con nections The SA receiving the PR MPR matches it against the policy and returns the appropriate PR MPR including SL MTU RATE and Lifetime 6 ULPs and programs e g SDP use CMA to establish RC connection provide the CMA the target IP and port number ULPs might also provide QoS Class The CMA then creates Service ID for the ULP and passes this ID and optional QoS Class in the PR MPR request The resulting PR MPR is used for configuring the connection QP PathRecord and MultiPathRecord Enhancement for QoS As mentioned above the PathRecord and MultiPathRecord attributes are enhanced to carry the Service ID which is a 64bit value A new field QoS Class is also provided A new capability bit describes the SM QoS support
89. Transparent Conversion The libsdp conf configuration policy file is used to control the automatic transpar ent replacement of TCP sockets with SDP sockets In this mode socket streams are con verted based upon a destination port a listening port or a program name Socket control statements in libsdp conf allow the user to specify when libsdp should replace AF INET SOCK STREAM sockets with AF SDP SOCK STREAM sockets Each control statement specifies a matching rule that applies if all its subexpres sions must evaluate as true logical and The use statement controls which type of sockets to open The format of a use statement is as follows use lt address family gt lt role gt lt program name gt lt address gt lt port range gt where lt address family gt can be one of sdp for specifying when an SDP should be used tcp for specifying when an SDP socket should not be matched both for specifying when both SDP and AF INET sockets should be used Note that both semantics is different for server and client roles For server it means that the server will be listening on both SDP and TCP sockets For client the connect function will first attempt to use SDP and will silently fall back to TCP if the SDP connection fails lt role gt can be one of server or listen for defining the listening port address family client or connect for defining the connected port address family lt program name gt D
90. VPI Virtual Protocol Interconnect Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 Related Documentation Table 3 Reference Documents Document Name Description InfiniBand Architecture Specification Vol 1 Release 1 2 1 The InfiniBand Architecture Specification that is provided by IBTA IEEE Std 802 3ae 2002 Amendment to IEEE Std 802 3 2002 Document PDF SS94996 Part 3 Carrier Sense Multiple Access with Collision Detec tion CSMA CD Access Method and Physical Layer Spec ifications Amendment Media Access Control MAC Parameters Physical Layers and Management Parameters for 10 Gb s Operation Fibre Channel BackBone 5 standard for Fibre Channel over Ethernet Document INCITS xxx 200x Fibre Channel Backbone http www t11 org draft BridgeX Programmer s Reference Manual Document 2936PM Describes the software interface used by developers to write a driver for Mellanox BridgeX devices Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 1 1 1 2 1 3 Mellanox BXOFED Overview Introduction to Mellanox BXOFED BXOFED or OFED with BridgeX support is a single Virtual Protocol Internconnect VPI software stack based on the OpenFabrics OFED Linux stack and operates across all Mellanox network adapter solutions supporting 10 20 and 40Gb s InfiniBand IB 10Gb s Ethernet 10GigE Fibre Channel o
91. You can download and install an iSCSI Target from the following location http sourceforge net project showfiles php group id 108475 amp package id 117141 Dedicate a partition on your iSCSI Target on which you will later install the operating system Configure your iSCSI Target to work with the partition you dedicated If for example you choose partition dev sda5 then edit the iSCSI Target configuration file etc ietd conf to include the following line under the iSCSI Target iqn line Lun 0 Path dev sda5 Type fileio Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 1 1 Step a 193 Tip The following is an example of an iSCSI Target iqn line Target iqn 2007 08 7 3 4 10 iscsiboot Step 54 Start your iSCSI Target Example hostl etc init d iscsitarget start Configuring the DHCP Server to Boot From an iSCSI Target in Linux Environment Configure DHCP as described in Section B 3 1 Configuring the DHCP Server Edit your DHCP configuration file etc dhcpd conf and add the following lines for the machine s you wish to boot from the iSCSI Target Filename option root path iscsi iscsi target ip iscsi target ign The following is an example for configuring an Ethernet device to boot from an iSCSI Target host hostl filename hardware ethernet 00 02 c9 00 00 bb option root path iscsi 11 4 3 7 iqn 2007 08 7 3 4 10 iscsi boot SCSI
92. _LOAD in etc infiniband openib conf to yes Note For the changes to take effect run etc init d openibd restart Note When loading the ib srp module it is possible to set the module parameter srp sg tablesize This is the maximum number of gather scatter entries per I O default 12 8 2 2 Manually Establishing an SRP Connection The following steps describe how to manually load an SRP connection between the Initia tor and an SRP Target Section 8 2 4 explains how to do this automatically e Make sure that the ib srp module is loaded the SRP Initiator is reachable by the SRP Target and that an SM is running Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential e To establish a connection with an SRP Target and create an SRP SCSI device for that target under dev use the following command echo n id ext GUID value ioc guid GUID value dgid port GID value pkey ffff service id service 0 value gt sys class infiniband srp srp mthca hca number port number add target See Section 8 2 3 for instructions on how the parameters in this echo command may be obtained Notes e Execution of the above echo command may take some time e The SM must be running while the command executes e Itis possible to include additional parameters in the echo command MW max cmd per lun Default 63 EH max sect short for max sectors sets the request size of a command E io class Default 0x100 as
93. a source port is a member of a specified group Destination port group same as above only for destination port e PKey QoS class e Service ID To match a certain matching rule PR MPR query has to match ALL the rule s criteria However not all the fields of the PR MPR query have to appear in the matching rule 111 Mellanox Technologies Mellanox Technologies Confidential OpenSM Subnet Manager 112 For instance if the rule has a single criterion Service ID it will match any query that has this Service ID disregarding rest of the query fields However if a certain query has only Service ID which means that this is the only bit in the PR MPR component mask that is on it will not match any rule that has other matching criteria besides Service ID 11 6 3 Simple QoS Policy Definition Simple QoS policy definition comprises of a single section denoted by qos ulps Similar to the advanced QoS policy it has a list of match rules and their QoS Level but in this case a match rule has only one criterion its goal is to match a certain ULP or a certain applica tion on top of this ULP PR MPR request and QoS Level has only one constraint Service Level SL The simple policy section may appear in the policy file in combine with the advanced pol icy or as a stand alone policy definition See more details and list of match rule criteria below 11 6 4 Policy File Syntax Guidelines e Leading and trailing bl
94. additional link information using ibportstate gt ibstatus mlx4 0 1 imcimlsama device mlz4 OV pore I statuss default gid fe80 0000 0000 0000 0000 0000 9289 3895 base lid 0x3 sm lid 0x3 Staras 23 NAN phys state 58 inku rate 20 Gb sec 4X DDR gt oporrsrcare C mis 0 3 1 query PORTINO i Port intos biel 3 ooru il NS EN IRA By SITAS rr a E RON OSA oe ae LinkUp IbaLinhealLChelnSWjoyexrOreicOCls 55 5650000c000celM Ox AX alicia nal leca A 0000000000106 ie DX Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential InfiniBand Fabric Diagnostic Utilities 138 2 Query the status of two channel adapters using directed paths gt wdlpgoorcsicece C mls 0 D O 1 POE Jato E ij Post laos DR pava slic 659535 Clic 655357 0 porre 1 DASS aid RO pad usos Seah zie Pays OS eee LinkUp ISE PAST OAE O sso sd De A IbsLia kAVsuShCIUMMASIGILSCS sso ooo copos lk OE AM ETNIA EE oa Ses 4X SS pe SAS PORTS RR 2 5 Gbps or 5 0 Gbps ns pee En 2 5 Gbps or 5 0 Gbps TESSA det Re 5 0 Elojos gt loporcscace C mencat D Q T POET ON ij Pose laos DR pava slic 655357 clilicl 655357 0 pose i tots o TT eto ts Caer Down BAVIERA oma wo on KT BE POL aime Lia sLCheIAS WSOC Rls ss ooosoooo gol DE AX IbaLiaaNsvShclUMMEIGILSCS ssa sos ossa I OE WK ATENAS EY e 4X Rev 1 50 Mellanox Technologies Mellano
95. administered VLAN configuration with ifcfg ethX configuration files VNICVLAN lt vlan tag gt or remove VNICVLAN property for no VLAN Note Using a VLAN tag value of 0 is not recommended because the traffic using it would not be separated form non VLAN traffic EolB Multicast Configuration Configuring Multicast for EoIB interfaces is identical to multicast configuration for native Ethernet interfaces Note EoIB maps Ethernet multicast addresses to InfiniBand MGIDs Multicast GID The map is done in a way that ensures that different vHubs use mutu ally exclusive MGIDs This prevents vNics on different vHubs from commu nicating with one another EolB and QoS EoIB enables the use of InfiniBand service levels At this time EoIB supports the use of a single SL for all vNics The configuration of the SL is performed through the BridgeX Refer to BridgeX documentation for the use of non default SL IP Configuration Based on DHCP Setting an EoIB interface configuration based on DHCP v3 1 2 which is available via www isc org is performed similarly to the configuration of Ethernet interfaces In other words you need to make sure that EoIB configuration files include the following line For RedHat BOOTPROTO dhcp Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 6 3 8 1 6 3 9 6 4 6 5 6 5 1 For SLES BOOTPROTO dchp Note If EoIB configuration files are included ifcfg eth lt n gt fi
96. amed IBDM IBNL PATH to that directory The directory is located in tmp or in the output directory provided by the o flag load db lt file name gt gt Load subnet data from the given db file and skip subnet discovery stage Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential InfiniBand Fabric Diagnostic Utilities 128 Note Some of the checks require actual subnet dis covery and therefore would not run when load db is specified These checks are Duplicated zero guids link state SMs status h help Prints the help page information V version Prints the version of the tool vars Prints the tool s environment variables and their values 12 3 2 Output Files Table 7 ibdiagnet Output Files Output File Description ibdiagnet log A dump of all the application reports generate according to the provided flags ibdiagnet Ist List of all the nodes ports and links in the fabric ibdiagnet fdbs A dump of the unicast forwarding tables of the fabric switches ibdiagnet mcfdbs A dump of the multicast forwarding tables of the fabric switches ibdiagnet masks In case of duplicate port node Guids these file include the map between masked Guid and real Guids ibdiagnet sm List of all the SM state and priority in the fabric ibdiagnet pm A dump of the pm Counters values of the fabric links ibdiagnet pkey A dump of the the existing partitions and their member host ports
97. ameters pl p2 p3 Prompt of a user command under bash shell hostname Prompt of a root command under bash shell hostname Prompt of a user command under tcsh shell tcsh Environment variables VARIABLE Code example if a b 1 Comment at the beginning of a code line LH Characters to be typed by users as is bold font Keywords bold font Variables for which users supply specific values Italic font Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential Table 1 Typographical Conventions 11 Description Convention Example Emphasized words Italic font These are emphasized words Pop up menu sequences menul gt menu2 gt gt item Note Note Warning Warning Common Abbreviations and Acronyms Table 2 Abbreviations and Acronyms Abbreviation Acronym Whole Word Description B Capital B is used to indicate size in bytes or multiples of bytes e g IKB 1024 bytes and 1MB 1048576 bytes b Small b is used to indicate size in bits or multiples of bits e g Kb 1024 bits FCoE Fibre Channel over Ethernet FW Firmware HCA Host Channel Adapter HW Hardware IB InfiniBand LSB Least significant byte Isb Least significant bit MSB Most significant byte msb Most significant bit NIC Network Interface Card SW Software
98. ance micro bench mark As an example the tests can be used for hardware or software tuning and or func tional testing See PERF TEST README txt under docs 1 OpenSM is disabled by default See Chapter 11 OpenSM Subnet Manager for details on enabling it Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 Mellanox BXOFED Overview 18 1 5 Quality of Service Quality of Service QoS requirements stem from the realization of I O consolidation over an IB network As multiple applications and ULPs share the same fabric a means is needed to control their use of network resources QoS over Mellanox BXOFED for Linux is discussed in Chapter 11 OpenSM Subnet Manager Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 2 1 2 1 1 Installation This chapter describes how to install and test the BXOFED for Linux package on a single host machine with Mellanox InfiniBand and or Ethernet adapter hardware installed The chapter includes the following sections e Hardware and Software Requirements page 19 e Downloading BXOFED page 20 Installing BXOFED page 20 e Uninstalling BXOFED page 22 Hardware and Software Requirements Hardware Requirements Platforms e A server platform with an adapter card based on one of the following Mellanox Tech nologies InfiniBand HCA devices e ConnectX VPI IB EN FCoE firm
99. and Prerequisites See Section A 7 1 on page 164 Warning The following procedure modifies critical files used in the boot proce dure It must be executed by users with expertise in the boot process Improper application of this procedure may prevent the diskless machine from booting Procedure Load the SLES 10 SP2 installation disk and enter the following parameters as boot options netsetup 1 WithISCSI 1 Boot from Hard Disk Installation Installation ACPI Disabled Installation Local APIC Disabled Installation Safe Settings Rescue System Memory Test Boot Options netsetup 1 HithISCSI 1 Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 172 Step 21 Continue with the procedure as instructed by the installation program until the iSCSI Initiator Overview window appears Preparation Y Language License Agreement Disk Activation e System Analysis Service Connected Targets e Time Zone Sa SCSI Initiator Overview Portal Address Target Name Start Up Installation e Installation Summary Perform Installation Configuration e Root Password e Hostname e Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes e Hardware Configuration Step 22 Click the Add tab in the iSCSI Initiator Overview window An iSCSI Initiator Discov ery window will pop up Enter the IP Address
100. and applica tions and ULPs and is not unique to EoIB Refer to Chapter 11 OpenSM Subnet Man Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 6 2 1 6 2 2 6 2 3 ager for more information on SM and OpenSM Other then the subnet manager and the BridgeX you will most likely need one or more InfiniBand switches and probably some Ethernet switches as well A simple EoIB setup will look something like this The BridgeX gateway is at the heart of EoIB On one side usually referred to as the inter nal side it is connected to the InfiniBand fabric by one or more links On the other side usually referred to as the external side 1t is connected to the Ethernet subnet by one or more ports The Ethernet connections on the BridgeX s external side are called external ports or eports Every BridgeX that is in use with EoIB needs to have one or more eports connected External ports eports and GW The combination of a specific BridgeX box and a specific eport is referred to as a gateway GW The GW is an entity that is visible to the EoIB host driver and is used in the config uration of the network interfaces on the host side For example in host administered vnics the user will request to open an interface on a specific GW identifying 1t by the BridgeX box and eport name Distinguishing between GWs is important because they determine the network topology and affect the path that a packet traverses
101. anks as well as empty lines are ignored so the indentation in the example is just for better readability e Comments are started with the pound sign and terminated by EOL Any keyword should be the first non blank in the line unless it s a comment e Keywords that denote section subsection start have matching closing keywords Having a QoS Level named DEFAULT is a must it is applied to PR MPR requests that didn t match any of the matching rules e Any section subsection of the policy file is optional 11 6 5 Examples of Advanced Policy File As mentioned earlier any section of the policy file is optional and the only mandatory part of the policy file is a default QoS Level Here s an example of the shortest policy file gos levels qos level name DEFAULT sis 0 nd qos level nd qos levels Port groups section is missing because there are no match rules which means that port groups are not referred anywhere and there is no need defining them And since this pol icy file doesn t have any matching rules PR MPR query will not match any rule and Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 113 OpenSM will enforce default QoS level Essentially the above example is equivalent to not having a QoS policy file at all The following example shows all the possible options and keywords in the policy file and their syntax See the comments in the following exam
102. antly limit bandwidth To obtain the current setting for Max Read Reg enter setpci d 15b3 68 w To obtain the PCI Express slot link width and speed enter setpci d 15b3 72 1 Ifthe output is neither 81 nor 82 card then the card is NOT installed in an x8 PCI Express slot 2 The least significant digit indicates the link speed e 1 for PCI Express Gen 1 2 5 GT s e 2 for PCI Express Gen 2 5 GT s Note If you are running InfiniBand at QDR 40Gb s 4X IB ports you must run PCI Express Gen 2 B 2 InfiniBand Performance Troubleshooting InfiniBand IB performance depends on the health of IB link s and on the IB card type IB link speed 10Gb s or SDR 20Gb s or DDR 40Gb s or QDR also affects performance Note A latency sensitive application should take into account that each switch on the path adds 200nsec at SDR and 150nsec for DDR 1 To check the IB link speed enter ibstat Check the value indicated after the Rate string 10 indicates SDR 20 indicates DDR and 40 indicates QDR Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 210 2 Check that the link has NO symbol errors since these errors result in the re transmis sion of packets and therefore in bandwidth loss This check should be conducted for each port after the driver is loaded To check for symbol errors enter cat sys class infiniband lt device gt ports 1 counters symbol_error The command above
103. ary e Perform Installation Configuration e Root Password e Hostname e Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes e Hardware Configuration Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential Step 56 Click the Add tab in the iSCSI Initiator Overview window An iSCSI Initiator Discov ery window will pop up Enter the IP Address of your iSCSI target and click Next Preparation Y Language Y License Agreement gt Disk Activation e System Analysis e Time Zone Installation e Installation Summary e Perform Installation Configuration e Root Password e Hostname e Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes e Hardware Configuration iSCSI Initiator Discovery Help Back IP Address Port 220 IX No Authentication Incoming Authentication Username Password Outgoing Authentication Username Password Abort Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 195 196 Step 57 Details of the discovered iSCSI target s will be displayed in the iSCSI Initiator Discov ery window Select the target that you wish to connect to and click Connect Fra iSCSI Initiator Discovery License Agreement Disk Activation r e System Analysis Portal Address Target Name Connected e Time Zone Installati
104. at a time If GUID given is 0 OpenSM displays a list of possible port GUIDs and waits for user input Without g OpenSM tries to use the default port This option displays a menu of possible local port GUID values with which osmtest could bind This option specifies the name of the inventory file Normally osmtest expects to find an inventory file which osmtest uses to validate real time information received from the SA during testing If i is not specified osmtest defaults to the file osmtest dat See c option for related information Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential S stress M Multicast Mode t timeout 1 log file v verbose vf 99 This option runs the specified stress test instead of the normal test suite Stress test options are as follows OPT Description sl Single MAD response SA queries s2 Multi MAD RMPP response SA queries s3 Multi MAD RMPP Path Record SA queries Without s stress testing is not performed This option specify length of Multicast test OPT Description M1 Short Multicast Flow default single mode M2 Short Multicast Flow multiple mode M3 Long Multicast Flow single mode M4 Long Multicast Flow multiple mode Single mode Osmtest is tested alone with no other apps that interact with OpenSM MC Multiple mode Could be run with other apps using MC with OpenSM Without M
105. atic Activation of High Availability If SRP High Availability was automatically activated SRP shutdown must be part of the driver shutdown etc init d openibd stop which performs Steps 2 4 of case b above However you still have to unmount all SRP partitions that were mounted before driver shutdown Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential MPI 76 9 1 9 2 9 2 1 MPI Overview Mellanox BXOFED for Linux includes the following MPI implementations over Infini Band e Open MPI an open source MPI 2 implementation by the Open MPI Project e OSU MVAPICH an MPI 1 implementation by Ohio State University These MPI implementations along with MPI benchmark tests such as OSU BW LAT Intel MPI Benchmark and Presta are installed on your machine as part of the Mellanox BXOFED for Linux installation Table 6 lists some useful MPI links Table 6 Useful MPI Links MPI Standard http www unix mcs anl gov mpl Open MPI http www open mpi org MVAPICH MPI http nowlab cse ohio state edu projects mpi iba MPI Forum http www mpi forum org This chapter includes the following sections e Prerequisites for Running MPI page 76 e MPI Selector Which MPI Runs page 77 e Compiling MPI Applications page 78 e OSU MVAPICH Performance page 79 e Open MPI Performance page 82 Prerequisites for Running MPI For launching multiple MPI proce
106. ble This number can be less than the number of actual layers available In general LASH is a very flexible algorithm It can for example reduce to Dimension Order Routing in certain topologies it is topology agnostic and fares well in the face of faults It has been shown that for both regular and irregular topologies LASH outperforms Up Down The reason for this is that LASH distributes the traffic more evenly through a Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential OpenSM Subnet Manager 108 network avoiding the bottleneck issues related to a root node and always routes shortest path The algorithm was developed by Simula Research Laboratory Use R lash Q option to activate the LASH algorithm Note QoS support has to be turned on in order that SL VL mappings are used Note LMC gt 0 is not supported by the LASH routing If this is specified the default routing algorithm is invoked instead 11 5 6 DOR Routing Algorithm The Dimension Order Routing algorithm is based on the Min Hop algorithm and so uses shortest paths Instead of spreading traffic out across different paths with the same short est distance it chooses among the available shortest paths based on an ordering of dimen sions Each port must be consistently cabled to represent a hypercube dimension or a mesh dimension Paths are grown from a destination back to a source using the lowest dimension port of available path
107. ble console socket i ignore guids lt equalize ignore guids file gt This option provides the means to define a set of ports by node guid and port number that will be ignored by the link load equalization algorithm Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential OpenSM Subnet Manager 94 x f L e P honor guid2lid log file lt file This option forces OpenSM to honor the guid2lid file when it comes out of Standby state if such file exists under OSM CACHE DIR and is valid By default this is FALSE name gt This option defines the log to be the given fil By default the log goes to var log opensm log For the log to go to standard output use f stdout log limit lt size in MB gt erase log file This option defines maximal log file size in MB When specified the log file will be truncated upon reaching this limit This option will cause deletion of the log file if it previously exists By default the log file is accumulative Pconfig lt partition config file gt prefix routes fil Q Y N Vr This option defines the optional partition config uration file The default name is etc opensm parti tions conf lt file name gt qos Prefix routes control how the SA responds to path record queries for off subnet DGIDs By default the SA fails such queries The PREFIX ROUTES section below descr
108. cket Specifically the number of bytes that can be sent is high_limit times 4K bytes A high limit value of 255 indicates that the byte limit is unbounded Note If the 255 value is used the low priority VLs may be starved A value of 0 indicates that only a single packet from the high priority table may be sent before an opportunity is given to the low priority table Keep in mind that ports usually transmit packets of size equal to MTU For instance for 4KB MTU a single packet will require 64 credits so in order to achieve effective VL arbi tration for packets of 4B MTU the weighting values for each VL should be multiples of 64 Below is an example of SL2VL and VL Arbitration configuration on subnet gos ca max vls 15 qos ca high limit 6 qos ca vlarb high 0 4 qos_ca vlarb low 0 0 1 64 2 128 3 192 4 0 5 64 6 64 7 64 g s ca s12vl 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 7 qos_swe max vls 15 gos_swe high limit 6 qos_ swe vlarb high 0 4 Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential qos swe vlarb low 0 0 1 64 2 128 3 192 4 0 5 64 6 64 7 64 g s swe sl2v1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 7 In this example there are 8 VLs configured on subnet VLO to VL7 VLO is defined as a high priority VL and it is limited to 6 x 4KB 24KB in a single transmission burst Such configuration would suilt VL that needs low latency and uses small MTU when transmit ting packets Res
109. cket loss time 399 ms rtt min avg max mdev 0 044 0 058 0 079 0 014 ms pipe 2 4 6 The ib bonding Driver The ib bonding driver is a High Availability solution for IPoIB interfaces It is based on the Linux Ethernet Bonding Driver and was adapted to work with IPoIB The ib bonding Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 4 6 1 137 package contains a bonding driver and a utility called ib bond to manage and control the driver operation The ib bonding driver comes with the ib bonding package run rpm qi ib bond ing to get the package information Using the ib bonding Driver The ib bonding driver can be loaded manually or automatically Manual Operation Use the utility ib bond to start query or stop the driver For details on this utility please read the documentation for the ib bonding package under usr share doc ib bonding 0 9 0 ib bonding txt on RedHat and usr share doc packages ib bonding 0 9 0 ib bonding txt on SuSE Automatic Operation There are two ways to configure automatic ib bonding operation 1 Using the openibd configuration file as described in the following steps a Edit the file etc infiniband openib conf to define bonding parameters Example Enable the bonding driver on startup IPOIBBOND ENABLE yes Set bond interface names IPOIB BONDS bond0 bond8007 Set specific bond params address and slaves bond0_IP 10 10 10 1 24
110. command Option Description H Where to find the server 11 4 17 6 IPoIB IP address t lt Test Name gt Specify the test to perform Options are TCP_STREAM TCP_RR etc C Client CPU utilization C Server CPU utilization Separates the global and test specific parameters r 1 1 The request size sent and how many bytes requested back Note that the run example above produced the following results e Client CPU utilization is 5 61 percent of client CPU e Server CPU utilization is 6 79 percent of server CPU e Latency is 25 11 microseconds Latency is calculated as follows 0 5 1 Transaction rate per sec 1 000 000 one way average latency in usec Step 6 To end the test shut down the Netperf server host1 pkill netserver Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 5 1 5 2 RDS Overview Reliable Datagram Sockets RDS is a socket API that provides reliable in order data gram delivery between sockets over RC or TCP IP RDS is intended for use with Oracle RAC lig For programming details enter host1 man rds RDS Configuration The RDS ULP is installed as part of Mellanox BXOFED for Linux To load the RDS mod ule upon boot edit the file etc infiniband openib conf and set RDS _LOAD yes Note For the changes to take effect run etc init d openibd restart Mellanox Technologies Rev 1 50 Mel
111. ctor difference is detected byte mode burn write Shift address when accessing Flash internal registers May be required for burn write commands when accessing certain Flash types s ilent burn Do not print burn progress messages y es All Non interactive mode Assume the answer is yes to all questions no All Non interactive mode Assume the answer is no to all questions vsd lt string gt burn Write this string of up to 208 characters to VSD upon a burn command use_image_ps burn Burn vsd as it appears in the given image do not keep existing VSD on Flash dual_image burn Make the burn process burn two images on Flash The current default failsafe burn process burns a single image in alternating locations v Print version info Table 17 mstflint Commands Sheet 1 of 2 Command Description b urn Burn Flash q uery Query miscellaneous Flash firmware characteristics v erify Verify the entire Flash bb Burn Block Burn the given image as is without running any checks sg Set GUIDs ri lt out file gt Read the firmware image on the Flash into the specified file Mellanox Technologies Mellanox Technologies Confidential Table 17 mstflint Commands Sheet 2 of 2 Command Description de lt out file gt Dump Configuration Print a firmware configuration file for the given image to the specified output file e rase lt addr gt Erase s
112. d the second is for load ing other parts of the OS via initrd Note Linux distributions such as SuSE Linux Enterprise Server 10 SPx and Red Hat Enterprise Linux 5 1 or above can be directly installed on an iSCSI tar Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 170 A 1 1 Step a get At the end of this direct installation initrd is capable to continue loading other parts of the OS on the iSCSI target Other distributions may also be suitable for direct installation on iSCSI targets If you choose to continue loading the OS after boot through the HCA device driver please verify that the initrd image includes the HCA driver as described in Section A 8 Configuring an iSCSI Target in Linux Environment Prerequisites Make sure that an iSCSI Target is installed on your server side Tip You can download and install an iSCSI Target from the following location http sourceforge net project showfiles php group_id 108475 amp package id 117141 Step 18 Dedicate a partition on your iSCSI Target on which you will later install the operating system Step 19 Configure your iSCSI Target to work with the partition you dedicated If for example you choose partition dev sda5 then edit the iSCSI Target configuration file etc ietd conf to include the following line under the iSCSI Target iqn line Lun 0 Path dev sda5 Type fileio Tip The following is an example of an iSCSI Target iqn line
113. do not affect each other Mellanox s mlxburn tool is avail able for burning however it is not possible to burn the expansion ROM image by itself Rather both the firmware and expansion ROM images must be burnt simultaneously mlxburn requires the following items 1 2 3 MST device name After installing the MFT package run mst start mst status The device name will be of the form dev mst mt lt dev id gt pci _cr0 conf0 The firmware mlx file w lt ID gt X_X XXX mlx One of the expansion ROM binary files listed in Section A 1 3 Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 160 Firmware burning example The following command burns a firmware image and an expansion ROM image to the Flash device of a ConnectX adapter card mlxburn dev dev mst mt25418 pci cr0 fw fw 25408 X X XXX mlx conf MHGH28 XTC ini exp rom ConnectX IB 25418 ROM X X XXX rom A 3 Preparing the DHCP Server in Linux Environment The DHCP server plays a major role in the boot process by assigning IP addresses for BoIB clients and instructing the clients where to boot from BoIB requires that the DHCP server runs on a machine which supports IP over IB A 3 1 Configuring the DHCP Server A 3 1 1 For ConnectX Family Devices When a BoIB client boots it sends the DHCP server various information including its DHCP client identifier This identifier is used to distinguish between
114. down the Netperf server hostl pkill netserver Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 67 SRP 68 SRP 8 1 Overview As described in Section 1 4 5 the SCSI RDMA Protocol SRP is designed to take full advantage of the protocol offload and RDMA features provided by the InfiniBand archi tecture SRP allows a large body of SCSI software to be readily used on InfiniBand archi tecture The SRP Initiator controls the connection to an SRP Target in order to provide access to remote storage devices across an InfiniBand fabric The SRP Target resides in an IO unit and provides storage services Section 8 2 describes the SRP Initiator included in Mellanox BXOFED for Linux This package however does not include an SRP Target 8 2 SRP Initiator This SRP Initiator is based on open source from OpenFabrics www openfabrics org that implements the SCSI RDMA Protocol 2 SRP 2 SRP 2 is described in Document T10 1524 D available from www t10 org ftp t10 drafts srp2 srp2r00a pdf The SRP Initiator supports Basic SCSI Primary Commands 3 SPC 3 www t10 org ftp t10 drafts spc3 spc3r21b pdf e Basic SCSI Block Commands 2 SBC 2 www t10 org ftp t10 drafts sbc2 sbc2r16 pdf e Basic functionality task management and limited error handling 8 2 1 Loading SRP Initiator To load the SRP module either execute the modprobe ib srp command after the BXOFED driver is up or change the value of SRP
115. e include include include include define define define define lt stdio h gt lt stdlib h gt lt stdint h gt lt unistd h gt lt string h gt lt sys types h gt lt sys socket h gt lt netinet in h gt lt arpa inet h gt DEF_PORT 22222 AF INET SDP 27 PF INET SDP AF INET SDP TXBUFSZ 2048 uint8 t int main int tx buffer TXBUFSZ argc char argv Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential if argc lt 2 printf Usage sdp client lt ip addr gt n exit EX T FA LURE int sd socket PF INET SDP SOCK STREAM 0 if sd lt 0 i perror socket failed exit EX T FAILURE struct sockaddr in to addr sin family AF INET sin port htons DEF PORT int ip ret inet aton argv 1 amp to addr sin addr if ip ret 0 printf invalid ip address s n argv 1 exit EX T_FAILURE int conn ret connect sd struct sockaddr amp to addr sizeof to_addr if conn ret lt 0 perror connect failed exit EXIT FAILURE printf connected to s u n inet _ ntoa to addr sin addr ntohs to addr sin port ssize t nw write sd tx buffer TXBUFSZ if nw lt 0 perror write failed
116. e p option Otherwise an error is reported Note Moreover When ibdiagpath queries for the performance counters along the path between the source and destination ports it always traverses the LID route even if a directed route is specified If along the LID route one or more links are not in the ACTIVE state ibdiagpath reports an error the tool allows omitting the source node in LID route addressing in which case the local port on the machine running the tool is assumed to be the source 12 4 1 SYNOPSYS ibdiagpath n lt src name dst name gt 1 lt src lid dst lid gt d lt p1 p2 p3 gt c lt count gt v t lt topo file gt s lt sys name gt ic lt dev index gt c p lt port num gt o lt out dir gt 1w lt 1x 4x 12x gt ls lt 2 5 5 10 gt pm pc P lt lt PM counter gt lt Trash Limit gt gt Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential InfiniBand Fabric Diagnostic Utilities 130 OPTIONS n lt src name dst name gt Names of the source and destination ports as defined in the topology file source may be omitted gt local port is assumed to be the source 1 lt src lid dst lid gt d lt pL p27p3 x E lt count gt t lt topo file gt s lt sys name gt i lt dev index gt p lt port num gt o lt out dir gt lw lt 1x 4x 12x gt 18 ELL O 1103 pm pc P lt PM lt Trash
117. e GRUB Location dev sda2 boot Sections SUSE Linux Enterprise Server 10 SP2 default Floppy Failsafe SUSE Linux Enterprise Server 10 SP2 Change v Help Back Abot Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 181 Step 34 Click Edit in the Boot Loader Settings window Section List Boot Loader Settings From Other you can manually edit the boot loader configuration files clear the current configuration and propose anew Section Management Boot Loader Installation Section Summary configuration start from SUSE Linux Enterprise Serv SP2 Image append resume scratch or reread the Floppy Other chainloader dev fd0 configuration saved on Failsafe SUSE Linux Enterprise Server 10 SP2 Image append showopts ide n your disk If you have multiple Linux systems installed YaST can try to find them and merge their menus 4 I Add Edit Delete Setas Default ada J ese Back Abort Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 182 Step 35 In the Optional Kernel Command Line Parameter field append the following string to the end of the line ibft mode 0ff include a space before the string Click OK and then Finish to apply the change T gt Use Section Name to specify the boot loader section name The sec
118. e Zone Copy files to installed system Installation v Installation Summary Save configuration Perform Installation Install boot manager Configuration Save installation settings e Hostname e Root Password Prepare system for initial boot e Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes The system will reboot now e Hardware Configuration 8 EX Finished Step 39 Once the boot is complete the Startup Options window will pop up Select SUSE Linux Enterprise Server 10 SP2 then press Enter SUSE Linux Enterprise Server 10 Floppy SUSE Linux Enterprise Server 10 Failsafe Boot Options Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential A l 185 Step 40 The Hostname and Domain Name window will pop up Continue configuring your machine until the operating system is up then you can start running the machine in nor mal operation mode Step 41 Optional If you wish to have the second instance of connecting to the iSCSI Target go through the IB driver copy the init rd file under boot to a new location add the IB driver into it after the load commands of the iSCSI Initiator modules and continue as described in Section A 8 on page 166 Warning Pay extra care when changing initrd as any mistake may prevent the client machine from booting It is recommended to have a back up iSCSI Initiator on a machine other than the client you are
119. e discovered fabric elements and enforces the provided policy on client requests The overall flow for such requests is as follows e The request is matched against the defined matching rules such that the QoS Level def inition is found e Given the QoS Level a path s search is performed with the given restrictions imposed by that level Figure 5 QoS Manager Administrator InfiniBand subnet with OFED 1 3 based nodes SL There are two ways to define QoS policy e Advanced the advanced policy file syntax provides the administrator various ways to match a PathRecord MultiPathRecord PR MPR request and to enforce various QoS constraints on the requested PR MPR e Simple the simple policy file syntax enables the administrator to match PR MPR requests by various ULPs and applications ruming on top of these ULPs 11 6 2 Advanced QoS Policy File The QoS policy file has the following sections I Port Groups denoted by port groups This section defines zero or more port groups that can be referred later by matching rules see below Port group lists ports by e Port GUID e Port name which is a combination of NodeDescription and IB port number PKey which means that all the ports in the subnet that belong to partition with a given PKey belong to this port group e Partition name which means that all the ports in the subnet that belong to partition with a given name belong to this port group
120. e of the standard ethtool application EoIB interfaces support ethtool in a similar way to HW Ethernet interfaces The supported Ethtool options include the following options o Show and update interrupt coalesce options g Query RX TX ring parameters k K Show and update protocol offloads i Show driver information S Show adapter statistics For more information on ethtool run ethtool h Link state An EolB interface has two different link states that it can report The first is the physical link state of the interface and the second is the link state of the external port associated with the vNic interface The physical link state of the port is by itself made up of the actual HCA port link state and the status of the vNics connection with the BridgeX If the HCA port link state is down or the EoIB connection with the BridgeX has failed the link will be reported as down This is because without the connection to the BridgeX the EoIB proto col does not work so no data can be sent on the wire The mlx4_vnic driver can also report the status of the external BridgeX port status This information can be retrieved through the mlx4_vnic_info script GW_EPORT field value If the eport_state_enforce module parameter is set then the external port state will be reported as the vNic interface link state Naturally if the connection between the vNic and the BridgeX is broken and there for the external port state is unknown the link will be reported as
121. echnologies Rev 1 50 Mellanox Technologies Confidential MPI 80 9 5 3 Latency Test Performance To run the OSU Latency test enter host1 usr mpi gcc mvapich lt mvapich ver gt bin mpirun rsh np 2 hostfile home lt username gt cluster usr mpi gcc mvapich lt mvapich ver gt tests osu benchmarks lt osu ver gt osu latency OSU MPI Latency Test v3 0 Size Latency us 0 1 20 1 1221 2 1 21 4 1 21 8 1523 16 1 24 32 1233 64 1 49 128 2 66 256 3 08 512 3 61 1024 4 82 2048 6 09 4096 8 62 8192 13 59 16384 18 12 32768 28 81 65536 50 38 131072 93 70 262144 178 77 524288 349 31 1048576 689 25 2097152 1371 04 4194304 2739 16 9 5 4 Intel MPI Benchmark To run the Intel MPI Benchmark test enter host1 usr mpi gcc mvapich lt mvapich ver gt bin mpirun rsh np 2 hostfile home lt username gt cluster usr mpi gcc mvapich lt mvapich ver gt tests IMB lt IMB ver gt IMB MPI1 Intel R MPI Benchmark Suite V3 0 MPI 1 part Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 34 4 4 4 e Sun Mar x86_64 Linux Date Machine System Release 1 SMP L2 Version MPI Version 2 6 16 21 0 8 smp MP MPI Thread Environment th th in bytes Minimum message leng in bytes 0 Maximum message
122. echo open vdisk0 dev cciss c1d0 vdisk ech ech ech ech ech ech o o 00 0 0 0 ech scst_threads 1 _vdisk scst_vdisk_ID 100 BLOCK open vdisk1 dev sdb BLOCK open vdisk2 dev sdc BLOCK open vdisk3 dev sdd BLOCK add vdisk0 0 gt proc scsi add vdiskl 1 gt add vdisk2 2 gt proc scsi tg add vdisk3 3 gt modprobe ib srpt echo add mgmt echo add mgmt proc sesi tg o om o O gt proc sesi tgt vdisk gt proc scsi tgt vdisk vdisk gt proc scsi tgt vdisk vdisk gt proc scsi tgt vdisk vdisk tg proc scsi tg E gt fproc scsi tgt i E dbg gt proc sesi t groups Default devices t groups Default devices t groups Default devices t groups Default devices trace level Cg echo add out of mem gt proc scsi_ How to Unload Shutdown 1 Unload ib srpt modprobe r ib srpt 2 Unload scst and its dev_handlers first modprobe r scst vdisk scst 3 Unload ofed etc rc d openibd stop t trace level cg t trace level Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 220 Appendix E mlx4 Module Parameters In order to set m1x4 parameters add the following line s to etc modpobe conf options mlx4 core parameter lt value gt and or options mlx4 ib parameter lt value gt and or opt
123. ector rw lt addr gt Read one DWORD from Flash ww lt addr gt lt data gt Write one DWORD to Flash wwne lt addr gt Write one DWORD to Flash without sector erase wbne lt addr gt lt size gt lt data gt Write a data block to Flash without sector erase rb lt addr gt lt size gt out file Read a data block from Flash swreset SW reset the target InfniScale IV device This command is supported only in the In Band access method Possible command return values are 0 successful completion 1 error has occurred 7 the burn command was aborted because firmware is current Examples 1 Find Mellanox Technologies s ConnectX VPI cards with PCI Express running at 2 5GT s and InfiniBand ports at DDR or Ethernet ports at 10GigE gt sloim 1 04 00 0 spel osos al InfiniBand Mellanox Technologies MT25418 asc rhs DDR Delle 2 0 2 561 8 ey al In the example above 15b3 is Mellanox Technologies s vendor number in hexadecimal and 634a is the device s PCI Device ID in hexadecimal The number string 04 00 0 identifies the device in the form bus dev fn Note The PCI Device IDs of Mellanox Technologies devices can be obtained from the PCI ID Repository Website at http pci ids ucw cz read PC 15b3 1 Verify the ConnectX firmware using its ID using the results of the example above Con 155 Mellanox Technologies
124. ed in the configuration and control of the subnet Unicast Linear Forwarding Tables LFT A table that exists in every switch providing the port through which packets should be sent to each LID Virtual Protocol Interconnet VPI A Mellanox Technologies technology that allows Mellanox channel adapter devices ConnectX to simultaneously connect to an InfiniBand subnet and a 10GigE subnet each subnet connects to one of the adpater ports Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential
125. efines the program name the rule applies to not including the path Wild cards with same semantics as ls are supported and So db2 would match on any program with a name starting with db2 t cp would match on ttcp etc If program name is not provided default the statement matches all pro grams lt address gt Either the local address to which the server binds or the remote server address to which the client connects The syntax for address matching is lt IPv4 address gt lt prefix length gt IPv4 address 0 9 0 9 0 9 0 9 each sub number lt 255 prefix length 0 9 and with value lt 32 A prefix length of 24 matches the subnet mask 255 255 255 0 A prefix length of 32 requires matching of the exact IP lt port range gt start port end port where port numbers are gt 0 and lt 65536 Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential Note that rules are evaluated in the order of definition So the first match wins If no match is made 1ibsdp will default to both Examples e Use SDP by clients connecting to machines that belongs to subnet 192 168 1 use sdp connect 192 168 1 0 24 A A family program role address port range e Use SDP by ttep when it connects to port 5001 of any machine use sdp listen ttcp 43 50 01 Use TCP for any program with name starting with ttcp serving ports 22 to 25 use tcp server ttcp 22425
126. ellanox Technologies Rev 1 50 Mellanox Technologies Confidential OpenSM Subnet Manager 120 gos ca vlarb low 0 50 Da 22d S247 424 al A 72 4y O84 IA OA ds dl ESA Ae A qos c s12vl 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 7 qos_swe max vls 15 qos_ swe high limit 0 gos swe vlarb high Or4 120 2 0 330 470 Dr 0 620 720 820 920 1020 11 0 12 70 13 20 1420 gos swe vlarb low 0 0 1 4 2 4 3 4 4 4 5 4 6 qos_swe sl2vl 0 1 2 3 4 5 6 4 9 4 10 4 11 4 12 4 13 4 14 4 10 11 12 13 14 7 VL arbitration tables both high and low are lists of VL Weight pairs Each list entry con tains a VL number values from 0 14 and a weighting value values 0 255 indicating the number of 64 byte units credits which may be transmitted from that VL when its turn in the arbitration occurs A weight of 0 indicates that this entry should be skipped If a list entry is programmed for VL15 or for a VL that is not supported or is not currently config ured by the port the port may either skip that entry or send from any supported VL for that entry Note that the same VLs may be listed multiple times in the High or Low priority arbitra tion tables and further it can be listed in both tables The limit of high priority VLArb table qos_ lt type gt _high limit indicates the number of high priority packets that can be transmitted without an opportunity to send a low priority pa
127. enSM Subnet Manager Removing a Subinterface To remove a child interface subinterface run echo lt subinterface PKey gt sys class net lt ib interface gt delete child Using the example of Step 2 echo 0x8000 gt sys class net ib0 delete child Note that when deleting the interface you must use the PKey value with the most signifi cant bit set e g 0x8000 in the example above Verifying IPoIB Functionality To verify your configuration and your IPoIB functionality perform the following steps Step 1 Step 2 Verify the IPoIB functionality by using the ifconfig command The following example shows how two IB nodes are used to verify IPoIB functionality In the following example IB node 1 is at 11 4 3 175 and IB node 2 is at 11 4 3 176 hostl ifconfig ib0 11 4 3 175 netmask 255 255 0 0 host2 ifconfig ib0 11 4 3 176 netmask 255 255 0 0 Enter the ping command from 11 4 3 175 to 11 4 3 176 The following example shows how to enter the ping command hostl ping c 5 11 4 3 176 PING 11 4 3 176 11 4 3 176 56 84 bytes of data 64 bytes from 11 4 3 176 icmp seg 0 ttl 64 time 0 079 ms 64 bytes from 11 4 3 176 icmp seg 1 ttl 64 time 0 044 ms 64 bytes from 11 4 3 176 icmp seq 2 ttl 64 time 0 055 ms 64 bytes from 11 4 3 176 icmp seg 3 ttl 64 time 0 049 ms 64 bytes from 11 4 3 176 icmp seq 4 ttl 64 time 0 065 ms 11 4 3 176 ping statistics 5 packets transmitted 5 received 0 pa
128. ent switches will be skipped Multicast is not affected by file routing engine this uses min hop tables 11 5 2 Min Hop Algorithm The Min Hop algorithm is invoked when neither UPDN or the file method are specified The Min Hop algorithm is divided into two stages computation of minhop tables on every switch and LFT output port assignment Link subscription is also equalized with the ability to override based on port GUID The latter is supplied by i lt equalize ignore guids file gt ignore guids lt equalize ignore guids file gt This option provides the means to define a set of ports by guid that will be ignored by the link load equalization algorithm Note that only endports CA switch port 0 and router ports and not switch external ports are supported LMC awareness routes based on remote system or switch basis 11 5 3 Purpose of UPDN Algorithm The UPDN algorithm is designed to prevent deadlocks from occurring in loops of the subnet A loop deadlock is a situation in which it is no longer possible to send data between any two hosts connected through the loop As such the UPDN routing algorithm should be used if the subnet is not a pure Fat Tree and one of its loops may experience a deadlock due for example to high pressure Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential The UPDN algorithm is based on the following main stages 1 Auto detect root nodes based on the CA
129. enter e Online Update Add On Products e Service a Users No add on product selected for installation e Clean Up e Release Notes Software e Hardware Configuration SUSE Linux Enterprise Server 10 SP2 Server Base System KDE Desktop Environment for Server C C Compiler and Tools X Window System Size of Packages to Install 1 6 GB Booting Boot Loader Type GRUB Location dev sda2 boot Sections SUSE Linux Enterprise Server 10 SP2 default Floppy Failsafe SUSE Linux Enterprise E Show Release Notes Server 10 SP2 Change v Ge ma Ce Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 204 Step 68 Click Edit in the Boot Loader Settings window Section List From Other you can manually edit the boot loader configuration files clearthe current configuration and propose anew configuration start from scratch or reread the configuration saved on your disk If you have multiple Linux systems installed YaST can try to find them and merge their menus Boot Loader Settings Section Management Boot Loader Installation Det T Label SUSE Linux Enter SP2 e Floppy Other chainloader dev fd0 Failsafe SUSE Linux Enterprise Server 10 SP2 Image append showopts ide n Lr Add Edit Delete Set as Default cs Rev 1 50 Mellanox Technologies Mellano
130. er IPoIB RDS SRP 2 ee esk ak ene eee 123 Chapter 12 InfiniBand Fabric Diagnostic Utilities 0 cece eee eee eee ee 125 12 1 Overview 125 12 2 Utilities Usage 125 12 2 1 Common Configuration Interface and Addressing 00 0 cece e reses 125 12 2 2 TB Interface Definition isspiss sos 445 sadh ues daw eae ste etd taste edad sewed tee 126 127243 Addressing nile ARA ara ERE RR Pai aie See ey eS 126 12 3 ibdiagnet IB Net Diagnostic 127 12 31 CSYNOPS YS o35 a A rea 127 12 32 Qutput Files s 2 23 sis 200 cas 405 PERE RI eae SA dar eas ian ies 128 123 3 ERROR CODES siii kine Ly hive te og daly aiden toad eo deeds As 129 12 4 ibdiagpath IB diagnostic path 129 12401 SYNOPSYS ges ohh ao an ae ca eee cede eee eee et dete eee 129 12 4 2 OUP EINES 55 mon ERE eet hig eo tet gah e 130 12 4 3 ERROR CODES r r vers A Ad 130 12 5 ibv_devices 131 12 6 ibv_devinfo 131 12 7 ibstatus 133 12 8 ibportstate 135 12 9 ibroute
131. es ifcfg ethX LL 45 6 3 2 Extracting BridgeX host name 46 6 3 3 mlx MC contd i ee le Seed ee Sen een a Gb ele be 46 6 3 4 EoIB Network Administered VNic 2 0 0 0 LL 47 6 3 5 VLAN Configuration I s 340 4 sas var Pied sade cee dais dd 47 6 3 5 1 Configuring VLANS turista atid baler kon sidan dla 48 6 3 6 EoIB Multicast Configuration 0 0 cee rr rer rer rer rr rss ene 48 63 7 EolB and QoS vs ci ba toes ead eA ee en oe es AA 48 6 3 8 IP Configuration Based on DHCP 48 6 3 8 THER Servet Rione ie 49 6 3 9 Static EoIB Configuration 0 cece ce rr rer rer ere rss eens 49 6 4 Sub Interfaces VLAN 49 6 5 Retrieving EoIB information 49 6 5 1 aid yic NOs pe he ae ogee tte sole Be steve a E ee ele 49 615 2 ethtoolli i A diate Sete the Me Sates ot Meath oii hha S 51 635 3 Mesta arias or a oo Paw ee A a cae eau ia 51 6 6 Bonding Driver 52 6 7 Jumbo Frames 52 6 8 Module Parameters 53 Chapter 7 SDP cuencia at ss a We es wie rar A 7 1 Overview 54 7 2 libsdp so Library 54 Rev 1 50 Mellanox
132. es Rev 1 50 Mellanox Technologies Confidential InfiniBand Fabric Diagnostic Utilities 148 Description Queries InfiniBand ports performance and error counters Optionally it displays aggre gated counters for all ports of a node It can also reset counters after reading them or sim ply reset them Synopsys perfquery h d G a 1 r C lt ca_name gt P lt ca_port gt R t lt timeout_ms gt V lt lid guid gt port reset mask Table 14 lists the various flags of the command Table 14 perfquery Flags and Options Optional Default Flag p If Not Description Mandatory 5 Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 a Optional Apply query to all ports l Optional Loop ports r Optional Reset the counters after reading them C lt ca_name gt Optional Use the specified channel adapter or router P lt ca_port gt Optional Use the specified port R Optional Reset the counters t lt timeout_ms gt Optional Override the default timeout for the solicited MADs msec V ersion Optional Show version info lt lid guid gt Optional LID or GUID port reset_mask 1 Examples perfquery r 32 1
133. faces that belong to the same vHub Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 EoIB 44 6 3 6 3 1 Figure 3 eports vNics and vHubs Ethernet port Ethernet port Ethernet port eport eport eport BridgeX l vHub vHub vHub Different VLAN id EolB Configuration The mlx4 vnic module supports two different modes of configuration host administration and network administration In the first the vNic is configured on the host side and in the latter the configuration is done by the BridgeX and this configuration is passed to the host mlx4_vnic driver using the EoIB protocol Both modes of operation require the presence ofa BridgeX gateway in order to work properly The EoIB driver supports a mixture of host and network administered vNics EolB Host Administered vNic In host administered mode vNics are configured using static configuration files located on the host side These configuration files define the number of vNics and the vHub that each host administered vNic will belong to 1 e the vNic s BridgeX box eport and VLAN id properties The mlx4 vnic confd service is used to read these configuration files and pass the relevant data to the mlx4_vnic module Two forms of configuration files are sup ported a central configuration file mlx4_vnic conf and vNic specific configuration files ifefg ethXX both supply the same functionality If both forms
134. file is not found or is in error the default routing algorithm is utilized The ability to dump switch lid matrices aka min hops tables to file and later to load these is also supported The usage is similar to unicast forwarding tables loading from dump file introduced by file routing engine but new lid matrix file name should be specified by M or lid matrix file option For example hostl opensm R file M opensm lid matrix dump The dump file is named opensm lid matrix dump and will be generated in the standard opensm dump directory var log by default when OSM LOG ROUTING logging flag is set When routing engine file is activated but the dump file is not specified or cannot be opened the default lid matrix algorithm will be used There is also a switch forwarding tables dumper which generates a file compatible with dump_lfts sh output This file can be used as input for forwarding tables loading by file routing engine Both or one of options U and M can be specified together with R file Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential OpenSM Subnet Manager 110 11 6 Quality of Service Management in OpenSM 11 6 1 Overview When Quality of Service QoS in OpenSM is enabled using the Q or qos flags OpenSM looks for a QoS Policy file During fabric initialization and at every heavy sweep OpenSM parses the QoS policy file applies its settings to th
135. from 0 0 0 0 0 0 0 0 port 0 AF INET to 11 4 17 6 11 4 17 6 port 0 AF INET Recv Send Send Utilization Service Demand Mellanox Technologies Mellanox Technologies Confidential SDP 66 Note Step 6 Socket Socket Messag Recv Size Size Size remote bytes bytes bytes us KB 87380 16384 65536 La QLL Elapsed Time secs 10 00 10 6bits s 5872 60 Send 19 Throughput local S 41 Recv Send remote local S us KB 174 12 2 166 You must specify the SDP IPoIB IP address when running the Netperf client The following table describes parameters for the netperf command Option Description H Where to find the server 11 4 17 6 SDP IPoIB IP address t lt Test Name gt Specify the test to perform Options are TCP_STREAM TCP_RR etc a Client CPU utilization C Server CPU utilization Separates the global and test specific parameters m Message size which is 65536 in the example above Note that the run example above produced the following results Throughput is 5 872 gigabits per second Client CPU utilization is 19 41 percent of client CPU Server CPU utilization is 17 12 percent of server CPU Run the Netperf Latency test such that you force SDP to be used instead of TCP Run the test once and stop the server so that it does not repeat the test
136. g gt length len endif LINUX VERSION CODE lt KERNEL VERSION 2 6 24 static inline void sg clear struct scatterlist sg d Patchscsi_tgt h with tmp scsi_tgt patch S cd usr local include scst cp scst h scsi tgt h patch p0 lt tmp scsi tgt patch Note The above steps are for RHEL 5 2 distributions only e Save the following patch as tmp scst patch scst h2008 07 20 14 25 30 000000000 0700 scst h2008 07 20 14 25 09 000000000 0700 42 7 42 9 endif if LINUX VERSION CODE lt KERNEL VERSION 2 6 19 FE typedef _Bool bool define true 1 define false 0 endif f Untar patch and install scst 1 0 0 tar zxvf scst 1 0 0 tar gz S cd scst 1 0 0 include 215 int status Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 216 patch p0 lt tmp scst patch cd make amp amp make install g Save the following patch as tmp scsi_tgt patch scsi tgt h2008 07 20 14 25 30 000000000 scsi tgt h2008 07 20 14 25 09 000000000 2330 7 2332 7 RE 0700 0700 void scst async mcmd completed struct scst mgmt cmd mcmd int status if LINUX VERSION_CODE lt KERN return sg gt page 2358 7 2360 7 sg gt offset offse sg gt length len tendif LINUX V ae t ERS EL VERSION
137. g the t lt file gt option Synopsis ibcheckerrs h b v G T lt threshold file gt s N nocolor C ca name P ca port t timeout ms lt lid guid gt lt port gt Table 15 lists the various flags of the command Table 15 ibcheckerrs Flags and Options Optional DORLI Flag P If Not Description Mandatory E Specified h help Optional Print the help menu b Optional Print in brief mode Reduce the output to show only if errors are present not what they are v erbose Optional Increase verbosity level May be used several times for addi tional verbosity vvv or v v v G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 T Optional Use specified threshold file lt threshold_file gt S Optional Show the predefined thresholds N nocolor Optional color mode Use mono mode rather than color mode C lt ca_name gt Optional Use the specified channel adapter or router P lt ca_port gt Optional Use the specified port t lt timeout_ms gt Optional Override the default timeout for the solicited MADs msec lt lid guid gt Mandatory with Use the specified port s or node s LID GUID with G option G flag lt port gt Mandatory with Use the specified port out G flag Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential InfiniBand Fabric Diagnostic Utilitie
138. gies MORRONE Destination ROTE Info 000027 023 2 Sverre 0000220211101 100 Oras MT47396 Infiniscale Mellanox Technologies 0x0003 000 Switch portguid 0x000b8cffff004016 MT47396 Infiniscale Mellanox Technologies 0x0006 023 Channel Adapter portguid OSOOOZE 08 00 0010 8 SERIA BA A iY 0x0007 020 Channel Adapter portguid 0x0002c9020025874a sw157 HCA 1 0x000b8cfff f004016 Dump all non empty mlids of switch with Lid 3 gt ibroute M 3 Malescasie macs Oxe OO OxeSici or sujlecia tid 3 guie 0x000b8c 004016 MT47396 Infiniscale Mellanox Technologies 0 il 2 Porcas 01234567890123423536789012 3 4 MLid 0xc000 0xc001 0xc002 0xc003 0xc020 0xc021 Oxe0Z22 0xc023 0xc024 DD Xx OK x x Xx Xx Xx Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential InfiniBand Fabric Diagnostic Utilities 144 12 10smpquery Applicable Hardware All InfiniBand devices Description Provides a basic subset of standard SMP queries to query Subnet management attributes such as node info node description switch info and port info Synopsys smpquery h d e v D G s lt smlid gt V C lt ca_name gt P lt ca port gt t lt timeout_ms gt node name map lt node name map gt lt op gt lt dest dr path lid guid gt op params Table 13 lists the variou
139. gt echo eth3 gt SFCSYSFS create To destroy a previously created VHBA on an interface e g eth3 run gt echo eth3 gt SFCSYSFS destroy To signal link up to an existing VHBA e g on eth3 run gt echo eth3 gt SFCSYSFS link up To signal link down to an existing vHBA e g on eth3 run gt echo eth3 gt FCSYSFS link down 3 4 4 2 Creating vHBAs That Use PFC To create a vHBA that uses the PFC feature it is required to configure the Ethernet driver to support PFC create a VLAN Ethernet interface assign it a priority and start a vHBA on the interface The following steps demonstrate the creation of such a vHBA To configure the mlx4 en Ethernet driver to support PFC add the following line to the file etc modprobe conf and restart the network driver options mlx4 en pfctx 0xff pfcrx 0xff To create a VLAN with an ID e g 55 on interface e g eth3 run gt vconfig add eth3 55 gt ifconfig eth3 55 up To set the map of skb priority 0 to the requested vlan priority e g 6 run gt vconfig set _ egress map eth3 55 0 6 To create the vHBA enter gt echo eth3 55 gt SFCSYSFS create Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential Working With VPI 30 3 4 4 3 Creating vHBAs That Use Link Pause The m1x4 en Ethernet driver supports link pause by default To change this setting you can use the following command
140. gt ethtool A eth lt x gt rx on off tx on off To create a VHBA run gt echo eth3 55 gt SFCSYSFS create Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 4 1 4 2 4 3 IPoIB Introduction The IP over IB IPoIB driver is a network interface implementation over InfiniBand IPoIB encapsulates IP datagrams over an InfiniBand Connected or Datagram transport service This chapter describes the following e IPoIB mode setting Section 4 2 e IPoIB configuration Section 4 3 How to create and remove subinterfaces Section 4 4 e How to verify IPoIB functionality Section 4 5 e The ib bonding driver Section 4 6 e IPoIB performance tuning Section 4 7 How to test IPoIB performance Section 4 8 IPoIB Mode Setting IPoIB can run in two modes of operation Connected mode and Datagram mode By default IPoIB is set to work in Connected mode This can be changed to become Data gram mode by editing the file etc infiniband openib conf and setting SET IPOIB CM no After changing the mode you need to restart the driver by running etc init d openibd restart To check the current mode used for out going connections enter cat sys class net ib lt n gt mode IPoIB Configuration Unless you have run the installation script install with the flag n then IPoIB has not been configured by the installation The configuration of IPoIB requires assigning an IP
141. gt vdisk vdisk echo add vdisk0 0 gt proc scsi_tgt groups Default devices echo add vdisk1 1 gt proc scsi_tgt groups Default devices echo add vdisk2 2 gt proc scsi_tgt groups Default devices Example 2 Working with real back end scsi disks in scsi pass thru mode a b Cc modprobe scst modprobe scst_disk cat proc scsi_tgt scsi_tgt ibstor00 cat proc sesi tgt scsi tgt Device host ch id lun or name Os Y OY oO A ODO Oe O Sire DI i St Or DT Device handler dev disk dev disk dev disk dev disk dev disk Now you want to exclude the first scsi disk and expose the last 4 scsi disks as IB SRP luns for I O 217 echo add 4 0 0 0 0 gt proc scsi tgt groups Default devices echo add 5 0 0 0 1 gt proc scsi tgt groups Default devices echo add 6 0 0 0 2 gt proc scsi_tgt groups Default devices echo add 7 0 0 0 3 gt proc scsi tgt groups Default devices Example 3 working with scst vdisk FILEIO mode Using md0 device and file 10G file a modprobe scst Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 218 b modprobe scst_vdisk c echo open vdisk0 dev md0 gt proc scsi_tgt vdisk vdisk d echo open vdisk1 10G file gt proc scsi_tgt vdisk vdisk o echo add vdisk0 0 gt proc scsi_tgt groups Default devices f echo add vdisk1 1 gt proc scsi_tgt groups Default devices 2 modprobe ib srpt B On Ini
142. guration files This option will not modify vNics with a valid configuration that was not changed mlx4 vnic confd reload EolB Network Administered vNic In network administered mode the configuration of the vNic is done by the BridgeX If a vNic is configured for a specific host it will appear on that host once a connection is established between the BridgeX and the mlx4_vnic module This connection between the mlx4_vnic modules and all available BridgeX boxes is established automatically when the mlx4_vnic module is loaded If the BridgeX is configured to remove the vnic or if the connection between the host and BridgeX is lost the vNic interface will disappear run ning ifconfig will not display the interface Similar to host administered vNics a network administered vNic resides on a specific vHub See BridgeX documentation on how to configure a network administered vNic To disable network administered vnics on the host side load mlx4_vnic module with the net_admin module parameter set to 0 VLAN Configuration As explained in the topology section a vNic instance is associated with a specific vHub group This vHub group is connected to a BridgeX external port and has a VLAN tag attri bute When creating configuring a vNic you define the VLAN tag it will use via the vid or the VNICVLAN fields If these fields are absent the vnic will not have a VLAN tag If the vNic has a VLAN tag it will be present in all EoIB packets sent by
143. gy agnostic routing algorithm to the non minimal UPDN algorithm It avoids the use of a potentially congested root node 5 DOR Unicast routing algorithm Based on the Min Hop algorithm but avoids port equalization except for redundant links between the same two switches This provides deadlock free routes for hypercubes when the fabric is cabled as a hypercube and for meshes when cabled as a mesh OpenSM also supports a file method which can load routes from a table see Modular Routing Engine below The basic routing algorithm is comprised of two stages 1 MinHop matrix calculation How many hops are required to get from each port to each LID The algorithm to fill these tables is dif ferent if you run standard min hop or Up Down For standard routing a relaxation algorithm is used to propagate min hop from every destination LID through neighbor switches For Up Down routing a BFS from every target is used The BFS tracks link direction up or down and avoid steps that will per form up after a down step was used 2 Once MinHop matrices exist each switch is visited and for each target LID a deci sion is made as to what port should be used to get to that LID This step is common to standard and Up Down routing Each port has a counter counting the number of target LIDs going through it When there are multiple alternative ports with same MinHop to a LID the one with less previously assigned ports is selected If LMC gt 0 more chec
144. h the t option ption specifies a debug option These options t normally needed The number following d s the debug option to enable as follows Description Ignore other SM nodes Force single threaded dispatching Force log flushing after each log message Disable multicast support y this usage info then exit y this usage info then exit The following environment variables control opensm behavior OSM TMP DIR Controls the directory in which the temporary files generated by opensm are created These files are opensm subnet 1st ope var log OSM_CACHE DIR nsm fdbs and opensm mcfdbs By default this directory is opensm stores certain data to the disk such that subsequent runs are consistent The default direc tory used is var cache opensm The following file is included in it 11 2 3 Signaling guid21id stores the LID range assigned to each GUID When opensm receives a HUP signal it starts a new heavy sweep as If a trap has been received or a topology change has been found Also SIGUSR1 can be used to logrotate purposes trigger a reopen of var log opensm log for Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 11 2 4 Running opensm The defaults of opensm were designed to meet the common case usage on clusters with up to a few hundred nodes Thus in this default mode opensm will scan the IB fabric initial ize it and sweep occasionally for changes To
145. hen adaptive moderation is disabled run gt ethtool c eth lt n gt rx usecs N rx frames N Note usec settings correspond to the time to wait after the last packet sent received before triggering an interrupt e To query pause frame settings run gt ethtool a eth lt n gt e To set pause frame settings run gt ethtool A eth lt n gt rx on off tx on off e To obtain additional device statistics run gt ethtool S eth lt n gt The mlx4 en parameters can be found under sys module m1x4_en or sys module mlx4 en parameters depending on the OS and can be listed using the command gt modinfo mlx4 en To set non default values to module parameters the following line should be added to the file etc modprobe conf options mlx4 en lt param name gt lt value gt lt param name gt lt value gt Fibre Channel over Ethernet Overview The FCoE feature provided by Mellanox BXOFED allows connecting to Fibre Channel FC targets on an FC fabric using an FCoE capable switch or gateway Key features include e Tll and pre T11 frame format e Complete hardware offload of SCSI operations in pre T11 format Hardware offload of FC CRC calculations in pre T11 format e Zero copy FC stack in pre T11 format Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 3 4 2 3 4 3 e VLANs and PFC Priority flow control that is PPP The FCoE feature is based on and interacts
146. hop length from any switch in the subnet a statistical histogram is built for each switch hop num vs number of occurrences If the histogram reflects a specific column higher than others for a certain node then it is marked as a root node Since the algorithm is statistical it may not find any root nodes The list of the root nodes found by this auto detect stage is used by the ranking process stage Note The user can override the node list manually Note If this stage cannot find any root nodes and the user did not specify a guid list file OpenSM defaults back to the Min Hop routing algorithm Ranking process All root switch nodes found in stage 1 are assigned a rank of 0 Using the BFS algorithm the rest of the switch nodes in the subnet are ranked incrementally This ranking aids in the process of enforcing rules that ensure loop free paths Min Hop Table setting after ranking is done a BFS algorithm is run from each CA or switch node in the subnet During the BFS process the FDB table of each switch node traversed by BFS is updated in reference to the starting node based on the ranking rules and guid values At the end of the process the updated FDB tables ensure loop free paths through the subnet Note Up Down routing does not allow LID routing communication between switches that are located inside spine switch systems The reason is that there is no way to allow a LID route between them that does not b
147. iBand messages convey a client identifier field used to identify the DHCP session This client identifier field can be used to associate an IP address with a client identifier value such that the DHCP server will grant the same IP address to any client that conveys this client identifier Note Refer to the DHCP documentation for more details how to make this associa tion The length of the client identifier field is not fixed in the specification For BKOFED it is recommended to have IPoIB use the same format that Boot over IB uses for this client identifier see Section A 3 1 Configuring the DHCP Server on page 160 4 3 1 1 DHCP Server In order for the DHCP server to provide configuration records for clients an appropriate configuration file needs to be created By default the DHCP server looks for a configura tion file called dhcpd conf under etc You can either edit this file or create a new one and provide its full path to the DHCP server using the cf flag See a file example at docs dhcpd conf of this package The DHCP server must run on a machine which has loaded the IPoIB module To run the DHCP server from the command line enter dhcpd lt IB network interface name gt d Example host1 dhcpd ib0 d Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 4 3 1 2 DHCP Client Optional 4 3 2 Note A DHCP client can be used if you need to prepare a diskless machine w
148. ib dev lt device gt i lt port gt Optional All device ports Query the specified device port lt port gt ib port lt port gt l Optional Inactive Only list the names of InfiniBand devices list v Optional Inactive Print all available information about the InfiniBand device s verbose Examples 1 List the names of all available InfiniBand devices gt iby deyinto i 2 HCAs found mthca0 mlx4 0 Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 133 2 Query the device mlx4 0 and print user available information for its Port 2 gt ibv de Wee elg 12 7 ibstatus O C iil 0 i 2 mlx4 0 fw ver node guid sys image guid Wismcloie ils vendor part_id hw_ver board id Piye OE Sas POTTE 2 STALE max mtu active mtu sm Aiel pose Liels Por mnek Applicable Hardware All InfiniBan d devices Description 2 5 944 0000 0000 0007 3895 0000 0000 0007 3898 0x02c9 25418 OxA0 MT_04A0140005 2 PORT_ACTIVE 4 2048 4 2048 4 Il i 0x00 Displays basic information obtained from the local InfiniBand driver Output includes LID SMLID port state port physical state port width and port rate Synopsis i Table 10 lists the various flags of the command bstatus h lt device name gt lt port gt Table 10 ibstatus Flags and Options Optional Default Flag p If Not Description Mandatory k Specified
149. ibed below will increase the ability of Linux to transmit and receive data e Generally if you increase the MTU maximum transmission unit in bytes you get bet ter performance The following MTUs are suggested use ifconfig to modify the MTU e IPoIB 2044 bytes IPoIB CM64K bytes e When IPoIB is configured to run in connected mode TCP parameter tuning is per formed at driver startup to improve the throughput of medium and large messages The driver startup scripts set the following TCP parameters as follows Note The following settings should not be applied when running in datagram mode as they degrade the performance t ipv4 tcp timestamps 0 t ipv4 tcp_sack 0 t core netdev max backlog 250000 t core rmem max 16777216 t core wmem max 16777216 core rmem default 16777216 t core wmem default 16777216 t core optmem max 16777216 t ipv4 tcp_mem 16777216 16777216 16777216 t ipv4 tcp_rmem 4096 87380 16777216 t ipv4 tcp wmem 4096 65536 16777216 3 oO 0 00 0 0 0 Po OOO O ct If you change the IPoIB run mode to datagram while the driver is running then the tuned parameters do not get restored to the default values suitable for datagram mode It is recommended to change the IPoIB mode only while the driver is down by setting the line SET IPOIB_CM yes to SET IPOIB_CM no in the file etc infiniband openib conf and then restarting the driver
150. ibes the format of the configuration file The default path is etc opensm prefix routes conf This option enables QoS setup It is disabled by default gos policy file lt file name gt This option defines the optional QoS policy file The default name is etc opensm qos policy conf no_part_enforce stay on fatal This option disables partition enforcement on switch external ports Mellanox Technologies Mellanox Technologies Confidential 95 This option will cause SM not to exit on fatal initialization issues if SM discovers duplicated guids or a 12x link with lane reversal badly config ured By default the SM will exit on these errors B daemon Run in daemon mode OpenSM will run in the back ground I inactive Start SM in inactive rather than init SM state This option can be used in conjunction with the perfmgr so as to run a standalone performance manager without SM SA However this is NOT currently imple mented in the performance manager perfmgr Enable the perfmgr Only takes effect if enable perfmgr was specified at configure time perfmgr sweep time s lt seconds gt Specify the sweep time for the performance manager in seconds default is 180 seconds Only takes effect if enable perfmgr was specified at configure time consolidate ipv6 snm req Consolidate IPv6 Solicited Node Multicast group join requests into one multicast group per MGID
151. ications Maga zine Vol 44 No 7 July 2006 e Layered Shortest Path LASH Routing in Irregular System Area Networks Skeie et al IEEE Computer Society Communication Architecture for Clusters 2002 Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 109 11 5 8 Modular Routine Engine Modular routing engine structure allows for the ease of plugging new routing modules Currently only unicast callbacks are supported Multicast can be added later One existing routing module is up down updn which may be activated with R updn option instead of old u General usage is hostl opensm R module name There is also a trivial routing module which is able to load LFT tables from a dump file Main features are e This will load switch LFTs and or LID matrices min hops tables e This will load switch LFTs according to the path entries introduced in the dump file e No additional checks will be performed such as is port connected etc e In case when fabric LIDs were changed this will try to reconstruct LFTs correctly if endport GUIDs are represented in the dump file in order to disable this GUIDs may be removed from the dump file or zeroed The dump file format is compatible with output of ibroute utility and for whole fabric can be generated with dump Ifts sh script To activate file based routing module use hostl opensm R file U path to dump file If the dump
152. in the SA class port info This approach provides an easy migration path for existing access layer and ULPs by not intro ducing new set of PR MPR attributes Supported Policy The QoS policy which is specified in a stand alone file is divided into the following four subsections Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 Quality of Service 88 10 4 10 5 10 6 I Port Group A set of CAs Routers or Switches that share the same settings A port group might be a partition defined by the partition manager policy list of GUIDs or list of port names based on NodeDescription ll Fabric Setup Defines how the SL2VL and VLArb tables should be setup Note In BXOFED this part of the policy is ignored SL2VL and VLArb tables should be configured in the OpenSM options file opensm opts Ill QoS Levels Definition This section defines the possible sets of parameters for QoS that a client might be mapped to Each set holds SL and optionally Max MTU Max Rate Packet Lifetime and Path Bits Note Path Bits are not implemented in BKOFED IV Matching Rules A list of rules that match an incoming PR MPR request to a QoS Level The rules are pro cessed in order such as the first match is applied Each rule is built out of a set of match expressions which should all match for the rule to apply The matching expressions are defined for the following fields e SRC and DST to lists of port grou
153. instead it controls a connection to an IO controller known as the SRP Target to provide access to remote storage devices across an InfiniBand fabric The SRP Target resides in an IO unit and provides storage services See Chapter 8 SRP MPI Message Passing Interface MPI is a library specification that enables the development of parallel software libraries to utilize parallel computers clusters and heterogeneous net works Mellanox BXOFED includes the following MPI implementations over InfiniBand e Open MPI an open source MPI 2 implementation by the Open MPI Project e OSU MVAPICH an MPI 1 implementation by Ohio State University Mellanox BXOFED also includes MPI benchmark tests such as OSU BW LAT Intel MPI Benchmark and Presta InfiniBand Subnet Manager All InfiniBand compliant ULPs require a proper operation of a Subnet Manager SM run ning on the InfiniBand fabric at all times An SM can run on any node or on an IB switch OpenSM is an InfiniBand compliant Subnet Manager and it is installed as part of BXOFED See Chapter 11 OpenSM Subnet Manager Diagnostic Utilities Mellanox BXOFED includes the following two diagnostic packages for use by network and data center managers e ibutils Mellanox Technologies diagnostic utilities e infiniband diags OpenFabrics Alliance InfiniBand diagnostic tools Performance Utilities A collection of tests written over uverbs intended for use as a perform
154. ions mlx4 en parameter lt value gt and or options mlx4 fc parameter lt value gt The following sections list the available m1x4 parameters E 1 mlx4 core Parameters set 4k mtu msi x ena blo ble gos ck_loopback Attempt to set 4K MTU to all ConnectX ports int Attempt to use MSI X if nonzero default 1 Enable Quality of Service support in the HCA if gt 0 default 0 Block multicast loopback packets if gt 0 default 1 internal err resetReset device on internal errors if non zero default deb log ug_level num Ip g num srq g_rdmarc per qp G num cq g num mcg g num mpt g num mtt g num mac g num vlan 1 Enable debug tracing if gt 0 default 0 log maximum number of QPs per HCA default is 17 max is 20 log maximum number of SROs per HCA default is 16 max is 20 log number of RDMARC buffers per QP default is 4 max is 7 log maximum number of CQs per HCA default is 16 max is 19 log maximum number of multicast groups per HCA default is 13 max is 21 log maximum number of memory protection table entries per HCA default is 17 max is 20 log maximum number of memory translation table seg ments per HCA default is 20 max is 20 log maximum number of MACs per ETH port 1 7 int log maximum number of VLANs per ETH port 0 7 int Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 221
155. ions supported by VPI Table 5 Supported ConnectX Port Configurations Port 1 Configuration Port 2 Configuration ib ib ib eth eth eth Note that the configuration Port eth and Port2 ib is not supported Also note that FCoE can run only on a port configured as eth and the m1x4 en driver must be loaded The port link type can be configured for each device in the system at run time using the sbin connectx port config script This utility will prompt for the PCI device to be modified if there is only one it will be selected automatically In the next stage the user will be prompted for the desired mode for each port The desired port configuration will then be set for the selected device 1 In this document ConnectX will be used to indicate also ConnectX 2 devices Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 Working With VPI 24 3 2 3 3 3 3 1 3 3 2 Note This utility also has a non interactive mode sbin connectx port config d device lt PCI device ID gt c conf lt portl port2 gt InfiniBand Driver The InfiniBand driver m1x4 ib handles InfiniBand specific functions and plugs into the Infini Band midlayer Ethernet Driver Overview The Ethernet driver m1x4 en exposes the following ConnectX capabilities e Single Dual port Fibre Channel over Ethernet FCoE Up to 16 Rx queues per port Rx steering m
156. is performed on Port 1 of the device lt device gt The output value should be 0 if no symbol errors were recorded Bandwidth is expected to vary between systems It heavily depends on the chipset memory and CPU Nevertheless the full wire speed should be achieved by the host e With IB SDR the expected unidirectional full wire speed bandwidth is 900MB sec With IB Y DDR and PCI Express Gen 1 the expected unidirectional full wire speed bandwidth is 1400MB sec See Section B 1 e With IB Y DDR and PCI Express Gen 2 the expected unidirectional full wire speed bandwidth is 1800MB sec See Section B 1 e With IB QDR and PCI Express Gen 2 the expected unidirectional full wire speed bandwidth is 3000MB sec See Section B 1 To check the adapter s maximum bandwidth use the ib write_bw utility To check the adapter s latency use the ib write lat utility Note The utilities ib write bwandib write lat are installed as part of Mellanox BXOFED Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 211 Appendix C ULP Performance Tuning C 1 IPoIB Performance Tuning This section provides tuning guidelines of TCP stack configuration parameters in order to boost IPoIB and IPoIB CM performance Without tuning the parameters the default Linux configuration may significantly limit the total available bandwidth below the actual capabilities of the adapter card The parameter settings descr
157. ith an IB driver See Step 13 under Example Adding an IB Driver to initrd Linux In order to use a DHCP client identifier you need to first create a configuration file that defines the DHCP client identifier Then run the DHCP client with this file using the fol lowing command dhclient cf lt client conf file gt lt IB network interface name gt Example of a configuration file for the ConnectX PCI Device ID 25418 called dhc1i ent conf The value indicates a hexadecimal number interface ibl send dhcp client identifier 00 02 c9 03 00 00 10 39 Example of a configuration file for InfiniHost III Ex PCI Device ID 25218 called dhclient conf The value indicates a hexadecimal number interface ibl send dhcp client identifier 20 00 55 04 01 fe 80 00 00 00 00 00 00 00 02 c9 02 00 23 13 92 In order to use the configuration file run hostl dhclient cf dhclient conf ibl Static IPolB Configuration If you wish to use an IPoIB configuration that is not based on DHCP you need to supply the installation script with a configuration file using the n option containing the full IP configuration The IPoIB configuration file can specify either or both of the following data for an IPoIB interface e A static IPoIB configuration e An IPoIB configuration based on an Ethernet configuration Note See your Linux distribution documentation for additional information about configuring IP
158. ivalent to the previous example of the shortest policy file and it is also equivalent to not having policy file at all Below is an example of simple QoS policy with all the pos sible keywords gos ulps default 0 default SL sdp port num 30000 0 SL for application run ning on top of SDP when a destination TCP IPport is 30000 Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential sdp port num 10000 20000 sdp top of SDP rds iser port num 900 cific iser ipoib pkey 0x0001 tion with ipoib 0 1 default SL for any other application running on 2 SL for RDS traffic O SL for iS ER with a spe target port 3 default SL for O SL for IPo SER B on parti pkey 0x0001 4 default IPo pkey 0x7FFF B partition any service id 0x6234 6 match any PR MPR query with a specific Service ID any pkey Ox0ABC 6 match any PR MPR query with a specific PKey srp target port guid 0x1234 5 SRP when SRP Target is located on a specified IB port GUID any target port guid 0x0ABC 0xFFFFF 6 match any PR MPR query with a specific target port GUID end qos ulps Similar to the advanced policy definition matching of PR MPR queries is done in order of appearance in the QoS policy file such as the first match takes precedence except for
159. kernel version does not match with any of the offered pre built RPMs you can add your kernel version by using the mlnx add kernel support sh script located under the docs directory Usage mlnx add kernel support sh i iso lt mlnx iso gt t tmpdir lt local work dir gt v verbose Example The following command will create a MLNX OFED LINUX ISO image for RedHat 5 2 under the tmp directory MLNX_OFED LINUX 1 4 rhel5 2 docs mlnx add kernel support sh i mnt MLNX _OFED LINUX 1 4 rhel5 2 iso All Mellanox OEM OFED or Distribution IB packages will be removed Do you want to continue y N y Removing OFED RPMs Running mkisofs Created tmp MLNX OFED LINUX 1 4 rhel5 2 iso Installation Script BXOFED includes an installation script called insta11 p1 Its usage is described below You will use it during the installation procedure described in Section 2 3 Installing BXOFED on page 20 Usage install pl OPTIONS Note If no options are provided to the script then all available RPMs are installed Options c config lt packages config file gt Example of the configuration file can be found under docs n net lt network config file gt Example of the network configuration file can be found under docs pl print available Print available packages for the current platform and create a corresponding ofed co
160. key 0x0F00 0x0FFF gos level name WholeSet nd gos match rul end qos match rules Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential OpenSM Subnet Manager 116 11 6 6 Simple QoS Policy Details and Examples Simple QoS policy match rules are tailored for matching ULPs or some application on top of a ULP PR MPR requests This section has a list of per ULP or per application match rules and the SL that should be enforced on the matched PR MPR query Match rules include Default match rule that is applied to PR MPR query that didn t match any of the other match rules e SDP e SDP application with a specific target TCP IP port range e SRP with a specific target IB port GUID RDS e SER e SER application with a specific target TCP IP port range IPoIB with a default PKey IPoIB with a specific PKey e Any ULP application with a specific Service ID in the PR MPR query e Any ULP application with a specific PKey in the PR MPR query e Any ULP application with a specific target IB port GUID in the PR MPR query Since any section of the policy file is optional as long as basic rules of the file are kept such as no referring to nonexisting port group having default QoS Level etc the simple policy section qos ulps can serve as a complete QoS policy file The shortest policy file in this case would be as follows gos ulps default 0 default SL end qos ulps It is equ
161. ks are added Within each group of LIDs assigned to same target port a Use only ports which have same MinHop b First prefer the ones that go to different systemImageGuid then the previous LID of the same LMC group c Ifnone prefer those which go through another NodeGuid d Fall back to the number of paths method if all go to same node Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential OpenSM Subnet Manager 104 11 5 1 Effect of Topology Changes OpenSM will preserve existing routing in any case where there is no change in the fab ric switches unless the r reassign_lids option is specified r reassign lids This option causes OpenSM to reassign LIDs to all nd nodes Specifying r on a running subnet may disrupt subnet traffic Without r OpenSM attempts to preserve existing LID assignments resolving mul tiple use of same LID If a link is added or removed OpenSM does not recalculate the routes that do not have to change A route has to change if the port is no longer UP or no longer the MinHop When routing changes are performed the same algorithm for balancing the routes is invoked In the case of using the file based routing any topology changes are currently ignored The file routing engine just loads the LFTs from the file specified with no reaction to real topology Obviously this will not be able to recheck LIDs by GUID for discon nected nodes and LFTs for non exist
162. l OpenSM Subnet Manager 102 PortGUID GUID of partition member EndPort Hexadecimal num bers should start from 0x decimal numbers are accepted too full or limited indicates full or limited membership for this port When omitted or unrecognized limited membership is assumed There are two useful keywords for PortGUID definition ALL means all end ports in this subnet SELF means subnet manager s port An empty list means that there are no ports in this partition Notes White space is permitted between delimiters 5 The line can be wrapped after after a Partition Definition and between A PartitionName does not need to be unique but PKey does need to be unique If a PKey is repeated then the associated partition configurations will be merged and the first PartitionName will be used see also next note It is possible to split a partition configuration in more than one definition but then they PKey should be explicitly specified otherwise different PKey values will be generated for those definitions Examples Default 0x7fff ALL SELF full NewPartition ipoib 0x123456 full 0x3456789034 limi 0x2134a 2306 YetAnotherOne 0x300 SELF full YetAnotherOne 0x300 ALL limited ShareIO 0x80 defmember full 0x123451 0x123452 0x123453 0x123454 will be limited ShareIO 0x80 0x123453 0x123454 0x123455 full
163. l Return Codes Table 4 lists the install script return codes and their meanings Table 4 minxofedinstall Return Codes Return Code Meaning 0 The Installation ended successfully 1 The installation failed 2 No firmware was found for the adapter device 3 Failed to start the MS driver 2 4 Uninstalling BXOFED Use the script usr sbin uninstall sh to uninstall the BXOFED package The script is part of the ofed scripts RPM Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 3 1 Working With VPI VPI allows ConnectX ConnectX 2 ports to be independently configured as either IB or Eth If a ConnectX port is configured as Eth it may also function as a Fibre Channel HBA Port Type Management ConnectX ports can be individually configured to work as InfiniBand or Ethernet or Fibre Channel over Ethernet ports By default both ConnectX ports are initialized as InfiniBand ports If you wish to change the port type use the connect x_port config script after the driver is loaded Running sbin connectx port config s will show current port configura tion for all ConnectX devices Port configuration is saved in the file etc infiniband connectx conf This saved configuration is restored at driver restart only if restarting via etc init d openibd restart Possible port types are e eth Ethernet ib Infiniband Table 5 lists the ConnectX port configurat
164. l output information on each SRP Target detected in human readable form Sample output IO Unit Info port LID 0103 port GID fe800000000000000002c90200402bd5 change ID 0002 max controllers 0x10 controller 1 GUID 0002c90200402bd4 vendor ID 0002c9 device ID 005a44 IO class 0100 ID LSI Storage Systems SRP Driver 200400a0b81146a1 servic ntries 1 service 0 200400a0b81146a1 SRP T10 200400A0B81146A1 b To detect all the SRP Targets reachable by the SRP Initiator via another umad device use the fol lowing command ibsrpdm d lt umad device gt 2 Assistance in creating an SRP connection a To generate output suitable for utilization in the echo command of Section 8 2 2 add the c option to ibsrpdm ibsrpdm c Sample output id _ext 200400A0B81146A1 ioc_guid 0002c90200402bd4 dgid fe800000000000000002c90200402bd5 pkey ffff service id 200400a0b81146a1 b To establish a connection with an SRP Target using the output from the libsrpdm c example above execute the following command echo n id _ext 200400A0B81146A1 ioc_guid 0002c90200402bd4 dgid fe800000000000000002c90200402bd5 pkey ffff service id 200400a0b81146a1 gt sys class infiniband srp srp mthca0 1 add target The SRP connection should now be up the newly created SCSI devices should appear in the listing obtained from the fdisk 1 command Rev 1 5
165. lack listed Automatic Activation of High Availability e Set the value of SRPHA ENABLE in etc infiniband openib conf to yes Note For the changes in openib conf to take effect run etc init d openibd restart e From the next loading of the driver it will be possible to access the SRP LUNs on dev mapper Note It is possible that regular not SRP LUNs may also be present the SRP LUNs may be identified by their name e Itis possible to see the output of the SRP daemon in var log srp_daemon log 8 2 7 Shutting Down SRP SRP can be shutdown by using rmmod ib_srp or by stopping the BKOFED driver etc init d openibd stop or as a by product of a complete system shutdown Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential Prior to shutting down SRP remove all references to it The actions you need to take depend on the way SRP was loaded There are three cases 1 Without High Availability When working without High Availability you should unmount the SRP partitions that were mounted prior to shutting down SRP 2 After Manual Activation of High Availability If you manually activated SRP High Availability perform the following steps a Unmount all SRP partitions that were mounted b Kill the SRP daemon instances c Make sure there are no multipath instances running If there are multiple instances wait for them to end or kill them d Run multipath F 3 After Autom
166. lanox Technologies Mellanox Technologies Confidential 165 If MLNX IB gPXE was selected through BIOS setup the client will boot from BoIB The client will display BoIB attributes and wait for IB port configuration by the subnet man ager For ConnectX Mellanox ConnectX Boot over IB v2 C gPXE 0 9 6 Open Source Boot Firmware htt 1et0 00 02 c9 00 01 77 70 51 on PCI02 00 0 open Link TX 0 TXE 0 RX 0 RXE 0 link up on net0 ok Mellanox Boot over IB for InfiniHost III Ex ver 1 0 0 Loading via IB Port 2 Waiting for Infiniband link up ok After configuring the IB port the client attempts connecting to the DHCP server to obtain an IP address and the source location of the kernel OS to boot from For ConnectX Boot over IB v2 0 000 1 Source Boot Firmware htt 1et0 00 02 c9 00 01 77 70 51 on PCIO2 00 0 open Link down TX 0 TXE 0 RX C Waiting for link up on net0 DHCP netO 00 02 3 00 16t0 11 4 3 130 Source Boot Firmware ht eatures TFTP AoE PXE PXEXT leto 00550401 fe j 0 RX 0 RXE 0 JHCP netO 00550401 e800000 00000000 0002c902 00231392 ok net0 11 4 3 130 255 255 Next BoIB attempts to boot as directed by the DHCP server Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 166 A 8 A 8 1 Diskless Machines Mellanox Boot over IB supports booting diskless machines To enable using an IB driver the remote kernel or init
167. lanox Technologies Confidential EoIB 42 6 1 6 2 EolB Introduction The Ethernet over IB EoIB mlx4 vnic module is a network interface implementation over InfiniBand EoIB encapsulates Layer 2 datagrams over an InfiniBand Datagram UD transport service The InfiniBand UD datagrams encapsulates the entire Ethernet L2 datagram and its payload Figure 2 EolB in the OSI Model Application Transport Internet Link EoIB In order to perform this operation the module performs an address translation from Ether net layer 2 MAC addresses 48 bits long to InfiniBand layer 2 addresses made of LID GID and QPN This translation is done by the module and is totally invisible to the OS and user In this regard EoIB differs from IPoIB which exposes a 20 Bytes HW address to the OS The mlx4 vnic module is designed for Mellanox s ConnectX family of HCAs and intended to be used with Mellanox s BridgeX gateway family Having a BridgeX gateway is a requirement for using EoIB It performs two operations e Enables the layer 2 address translation required by the mlx4_vnic module e Enables routing of packets from the InfiniBand fabric to a 1 or 10 GigE Ethernet subnet EolB Topology EoIB is designed to work over an InfiniBand fabric In order for it to work properly it requires the presence of two entities e Subnet Manager SM e BridgeX gateway The required subnet manager configuration is similar to that of other InfiniB
168. le etc infiniband openib conf and set SDP_LOAD yes Note For the changes to take effect run etc init d openibd restart SDP shares the same IP addresses and interface names as IPoIB See IPoIB configuring in Section 4 3 and Section 4 3 3 How to Know SDP Is Working Since SDP is a transparent TCP replacement it can sometimes be difficult to know that it is working correctly The sdpnetstat program can be used to verify both that SDP is loaded and is being used host1 sdpnetstat S This command shows all active SDP sockets using the same format as the traditional netstat program Without the S option it shows all the information that netstat does plus SDP data Assuming that the SDP kernel module is loaded and is being used then the output of the command will bve as follows host1 sdpnetstat S Proto Recv Q Send Q Local Address Foreign Address sdp 0 0 193 168 10 144 34216 193 168 10 125 12865 sdp 0 884720 193 168 10 144 42724 193 168 10 filenet rmi The example output above shows two active SDP sockets and contains details about the connections If the SDP kernel module is not loaded then the output of the command will be something like the following host1 sdpnetstat S Proto Recv Q Send Q Local Address Foreign Address netstat no support for AF INET tcp on this system To verify whether the module is loaded or not you can use the 1 smod command h
169. led using a different initiator ext value for each SRP connection The initiator_ext value is a 16 hexadecimal digit value specified in the connection command Also in case of two physical connections 1 e network paths from a single initiator IB port to two different IB ports on the same Target HCA there is need for a different initiator_ext value on each path The conventions is to use the Target port GUID as the initiator_ext value for the relevant path If you use srp_daemon with n flag 1t automatically assigns initiator_ext values according to this convention For example id _ext 200500A0B81146A1 ioc_guid 0002c90200402bec AX dgid fe800000000000000002c90200402bed pkey ffff service id 200500a0b81146a1 initiator ext ed2b400002c90200 Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 8 2 6 Notes 1 Itis recommended to use the n flag for all srp_daemon invocations 2 ibsrpdm does not have a corresponding option 3 srp daemon sh always uses the n option whether invoked manually by the user or automatically at startup by setting SRPHA_ENABLE to yes High Availability HA Overview High Availability works using the Device Mapper DM multipath and the SRP daemon Each initiator is connected to the same target from several ports HCAs The DM multipath is responsible for joining together different paths to the same target and for fail over between paths when one of them goes offli
170. les will be installed under etc sysconfig network scripts on a RedHat machine etc sysconfig network on a SuSE machine DHCP Server No special configuration is needed to use a DHCP server with EoIB The DHCP server can run on a server which is located on the Ethernet side using any Ethernet HW or on a server located on the InfiniBand side and running EoIB module Static EolB Configuration If you wish you can use an EoIB configuration that is not based on DHCP Static configu ration is performed in a similar fashion to a typical Ethernet device See your Linux distri bution documentation for additional information about configuring IP addresses Note Ethernet configuration files are located at etc sysconfig network scripts on a RedHat machine etc sysconfig network on a SuSE machine Sub Interfaces VLAN EoIB interfaces do not support creating sub interfaces via the vconfig command In order to create interfaces with VLAN refer to Chapter 6 3 5 VLAN Configuration Retrieving EolB information mlx4_vnic_info To retrieve information about EoIB interfaces use the script mlx4_vnic_info This script gives detailed information about a specific vNic or all EoIB vNic interfaces If network administered vNics are enabled this script can also be used to discover the available BridgeXs from the host side To do this simply run mlx4 vnic info grep SYSTEM GUID or mlx4 vnic info grep SYSTEM NAME
171. llowing procedure ib0 is used as an example of an IB subinterface Decide on the PKey to be used in the subnet Valid values are 0 255 The actual PKey used is a 16 bit number with the most significant bit set For example a value of 0 will give a PKey with the value 0x8000 Create a child interface by running host1 echo lt PKey gt gt sys class net lt IB subinterface gt create child Example host1 echo 0 gt sys class net ib0 create child This will create the interface 1b0 8000 Verify the configuration of this interface by running host1 ifconfig lt subinterface gt lt subinterface PKey gt Using the example of Step 2 host1 ifconfig ib0 8000 ib0 8000 Link encap UNSPEC HWaddr 80 00 00 4A FE 80 00 00 00 00 00 00 00 00 00 00 BROADCAST MULTICAST MTU 2044 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 128 RX bytes 0 0 0 b TX bytes 0 0 0 b Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential IPoIB 36 4 4 2 4 5 Step 4 Step 5 As can be seen the interface does not have IP or network addresses To configure those you should follow the manual configuration procedure described in Section 4 3 3 To be able to use this interface a configuration of the Subnet Manager is needed so that the PKey chosen which defines a broadcast address be recognized see Chapter 11 Op
172. lt var cache opensm opensm opts end qos setup qos levels Having a QoS Level named DEFAULT is a must it is applied to PR MPR requests that didn t match any of the matching rules qos level name DEFAULT use default QoS Level sl 0 nd qos level the whole set SL MTU Limit Rate Limit PKey Packet Lifetime qos level name WholeSet sl 1 mtu limit 4 rate limit 5 pkey 0x1234 packet life 8 nd qos lev nd qos levels Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 115 Match rules are scanned in order of their apperance in the policy file First matched rule takes precedenc qos match rules matching by single criteria QoS class qos match rule use by QoS class qos class 7 9 11 Name of qos level to apply to the matching PR MPR qos level name WholeSet nd qos match rul show matching by destination group and service id qos match rule use Storage targets destination Storage service id 0x10000000000001 0x10000000000008 0x10000000000FFF gos level name WholeSet nd gos match rul qos match rule source Storage use match by source group only gos level name DEFAULT nd qos match rul qos match rule use match by all parameters qos class 7 9 11 source Virtual Servers destination Storage service id 0x0000000000010000 0x000000000001FFFF p
173. mance without sending IOs to real devices Prerequisites 1 Supported distributions RHEL 5 5 1 5 2 SLES 10 spl and vanilla kernels gt 2 6 16 Note On distribution default kernels you can run sest_vdisk blockio mode to obtain good performance You can also run scst_disk i e scsi pass thru mode how ever you have to compile scst with DSTRICT_SERIALIZING enabled but this does not yield good performance It is required to recompile the kernel to have good performance with scst_disk Download and install the SCST driver a Download scst 1 0 0 tar gz from http scst sourceforge net downloads html If your distribution is RHEL 5 2 please go to step lt e gt b Untar and install scst 1 0 0 tar zxvf scst 1 0 0 tar gz cd scst 1 0 0 make amp amp make install c Save the following patch as tmp scsi_tgt patch scsi tgt h2008 07 20 14 25 30 000000000 0700 scsi tgt h2008 07 20 14 25 09 000000000 0700 42 7 42 9 RR endif if LINUX VERSION CODE lt KERNEL VERSION 2 6 19 typedef Bool bool Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential FEY define true 1 define false 0 endif 2330 7 2332 7 void scst async memd completed struct scst mgmt cmd mcmd if LINUX_VERSION_CODE lt KERNEL VERSION 2 6 24 RE static inline struct page sg page struct scatterlist sg return sg gt page ee 2358 7 2360 7 CR sg gt offset offset s
174. n 7 5 The SDP protocol is composed of a kernel module that implements the SDP as a new address family protocol family and a library see Section 7 2 that is used for replacing the TCP address family with SDP according to a policy This chapter includes the following sections e libsdp so Library on page 54 Configuring SDP page 55 e Environment Variables page 57 e Converting Socket based Applications page 57 Note that the default value of sdp zcopy thresh is 64KB but is may be too low for some sys tems You will need to experiment with your hardware to find the best value page 65 libsdp so Library libsdp so is a dynamically linked library which is used for transparent integration of applications with SDP The library is preloaded and therefore takes precedence over glibc for certain socket calls Thus 1t can transparently replace the TCP socket family with SDP socket calls The library also implements a user level socket switch Using a configuration file the sys tem administrator can set up the policy that selects the type of socket to be used 1ib sdp so also has the option to allow server sockets to listen on both SDP and TCP interfaces The various configurations with SDP TCP sockets are explained inside the etc libsdp conf file Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 7 3 7 3 1 Configuring SDP To load SDP upon boot edit the fi
175. n InfiniBand compliant Subnet Manager and Subnet Administrator that runs on top of the Mellanox BXOFED stack opensm performs the InfiniBand specification s required tasks for initializing InfiniBand hardware One SM must be running for each InfiniBand subnet opensm also provides an experimental version of a performance manager opensm defaults were designed to meet the common case usage on clusters with up to a few hundred nodes Thus in this default mode opensm will scan the IB fabric initialize 1t and sweep occasionally for changes opensm attaches to a specific IB port on the local machine and configures only the fabric connected to it If the local machine has other IB ports opensm will ignore the fabrics connected to those other ports If no port is specified opensm will select the first best available port opensm can also present the available ports and prompt for a port number to attach to By default the opensm run is logged to two files var log messages and var log opensm log The first file will register only general major events whereas the second file will include details of reported errors All errors reported in this second file should be treated as indicators of IB fabric health issues Note that when a fatal and non recoverable error occurs opensm will exit Both log files should include the message SUBNET UP if opensm was able to setup the subnet correctly 11 2 1 Syntax opensm OPTIONS where
176. nager running on one of the Mellanox Technologies Mellanox Technologies Confidential A 4 A S 163 machines in the InfiniBand subnet The 20 bytes can be captured from the boot session as shown in the figure below XE starting boot o BEA Mellanox Boot over IB for InfiniHost III Ex ver 1 IB Port 2 or Infiniband link up ok yt Firmware E PRRI 7 401 fe800000 00000000 0002c902 00231 392 oh PCIO5 00 0 open TX 0 TXE 20 e 1550401 Fe800000 00000000 0002c902 00231392 ok Concatenate the byte 20 to the left of the captured 20 bytes then separate every byte two hexadecimal digits with a colon You should obtain the same result shown in Step 6 above Placing Client Identifiers in etc dhcpd conf The following is an excerpt of a etc dhcpd conf example file showing the format of representing a client machine for the DHCP server host hostl next server 11 4 3 7 filename pxelinux 0 fixed address 11 4 3 130 option dhcp client identifier 20 00 55 04 01 fe 80 00 00 00 00 00 00 00 02 c9 02 00 23 13 92 Subnet Manager OpenSM BoIB requires a Subnet Manager to be running on one of the machines in the IB network OpenSM is part of Mellanox BXOFED for Linux and can be used to accomplish this Note that OpenSM may be run on the same host running the DHCP server but it is not manda tory For details on OpenSM see OpenSM Subnet Manager on page 90
177. nal lid mlid range The default range is all valid entries in the range 1 to FDBTop Synopsis ibroute h d v V al n D G C lt ca_name gt P lt ca_port gt t lt timeout_ms gt lt dest dr path lid guid gt lt startlid gt lt endlid gt M s lt smlid gt Table 12 lists the various flags of the command Table 12 ibportstate Flags and Options Optional Default Flag p If Not Description Mandatory 7 Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d a 11 Optional Show all LIDs in range including invalid entries v erbose Optional Increase verbosity level May be used several times for addi tional verbosity vvv or v v v V ersion Optional Show version info a 11 Optional Show all LIDs in range including invalid entries n o_dests Optional Do not try to resolve destinations D irect Optional Use directed path address arguments The path is a comma sep arated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 M ulticast Optional Show multicast forwarding tables The parameters lt startlid gt and lt endlid gt specify the MLID range s lt smlid gt
178. nd perform the changes and click Accept when done Step 71 In the Confirm Installation window click Install to start the installation See image below Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 206 Preparation Language License Agreement Disk Activation System Analysis Time Zone lt 2 Installation Settings Click any headline to make changes or use the Change menu below Overview Expert I ae ae Installation gt Installation Sum e Perform Installatic Confirm Installation Configuration e Hostname Root Password e Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes e Hardware Config All information required for the base installation is now complete If you continue now partitions on your hard disk will be formatted erasing any existing data in those partitions according to the installation settings in the previous dialogs Go back and check the settings if you are unsure Show Release Nc Change v 2 ee Step 72 At the end of the file copying stage the Finishing Basic Installation window will pop up and ask for confirming a reboot You can click OK to skip count down See image below Note Assuming that the machine has been correctly configured to boot from Con nectX EN PXE via its connection to the iSCSI target make sure that MLNX_EN has the highe
179. ne I ae ae Installation gt Installation Sum e Perform Installatic Configuration e Hostname Root Password e Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes e Hardware Config Show Release Nc License Agreement Overview Expert 2 Installation Settings Click any headline to make changes or use the Change menu below Confirm Installation All information required for the base installation is now complete If you continue now partitions on your hard disk will be formatted erasing any existing data in those partitions according to the installation settings in the previous dialogs Go back and check the settings if you are unsure Help Abort Change v Step 38 At the end of the file copying stage the Finishing Basic Installation window will pop up and ask for confirming a reboot You can click OK to skip count down See image below Note Assuming that the machine has been correctly configured to boot from BoIB via its connection to the iSCSI target make sure that MLNX IB for Con nectX family or gPXE for InfiniHost III family has the highest priority in the BIOS boot sequence Mellanox Technologies Confidential Mellanox Technologies Rev 1 50 183 184 Fapa do A Finishing Basic Installation v Language v License Agreement w System Analysis v Tim
180. ne Multipath will be executed on newly joined SCSI devices Each initiator should execute several instances of the SRP daemon one for each port At startup each SRP daemon detects the SRP Targets in the fabric and sends requests to the ib srp module to connect to each of them These SRP daemons also detect targets that subsequently join the fabric and send the ib srp module requests to connect to them as well Operation When a path from port1 to a target fails the ib srp module starts an error recovery pro cess If this process gets to the reset_host stage and there is no path to the target from this port ib srp will remove this scsi_host After the scsi_host is removed multipath switches to another path to this target from another port HCA When the failed path recovers it will be detected by the SRP daemon The SRP daemon will then request ib srp to connect to this target Once the connection is up there will be a new scsi host for this target Multipath will be executed on the devices of this host return ing to the original state prior to the failed path Prerequisites Installation for RHEL4 5 Execute once e Verify that the standard device mapper multipath rpm is installed If not install it from the RHEL distribution Installation for SLES10 Execute once e Verify that multipath is installed If not take it from the installation you may use yast e Update udev Execute once for manual activation
181. nel OS through a TFTP server an iSCSI target or other service The binary code is exported by the device as an expansion ROM image B 1 1 Supported Mellanox Network Adapter Devices and Firmware Table 19 Supported Mellanox Technologies Devices and PCI Device IDs PCI Device ID Device Name Decimal Firmware Name Hexadecimal MT25408 ConnectX IB Q SDR PCI Express 2 0 2 5GT s 25408 0x6340 fw 25408 MT25408 ConnectX IB DDR PCI Express 2 0 2 5GT s 25418 0x634a fw 25408 MT25408 ConnectX IB DDR PCI Express 2 0 5 0GT s 26418 0x6732 fw 25408 MT25408 ConnectX IB QDR PCI Express 2 0 5 0GT s 26428 0x673c fw 25408 MT25208 InfiniHost III Ex 25218 0x6282 fw 25218 MT25204 InfiniHost III Lx 25204 0x6274 fw 25204 B 1 2 Tested Platforms See Mellanox ConnectX EN PXE Release Notes ConnectX EN PXE release notes txt Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential B 1 3 B 2 B 3 187 ConnectX EN PXE in Mellanox BXOFED The ConnectX EN PXE binary files are provided as part of the Mellaox BXOFED for Linux ISO image The following files are included 1 A PXE ROM image file for each of the supported Mellanox network adapter devices For example e ConnectX EN PCI DevID 25448 CONNECTX EN 25448 ROM lt version gt rom Burning the Expansion ROM Image The binary code resides in the same Flash device of the de
182. nf file The installation script exits after creating ofed conf with fc Install FCoE support Available on RHEL5 2 ONLY with 32bit Install 32 bit libraries default This is relevant for x86 64 and ppc64 platforms without 32bit Skip 32 bit libraries installation without ib bonding Skip ib bonding RPM installation Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential Installation 22 without depcheck Skip Distro s libraries check without fw update Skip firmware update force fw update Force firmware update force Force installation without querying the user all Install all kernel modules libibverbs libibumad librdmacm mft mstflint diagnostic tools OpenSM ib bonding MVAPICH Open MPI MPI tests MPI selec tor perftest sdpnetstat and libsdp srptools rds tools static and dynamic libraries hpc Install all kernel modules libibverbs libibumad librdmacm mft mstflint diagnostic tools OpenSM ib bonding MVAPICH Open MPI MPI tests MPI selec tor dynamic libraries basic Install all kernel modules libibverbs libibumad mft mstflint dynamic libraries msm Install all kernel modules libibverbs libibumad mft mstflint diagnostic tools OpenSM ib bonding dynamic libraries NOTE With msm flag the OpenSM daemon is config ured to run upon boot v vv vvv Set verbosity level q Set quiet no messages will be printed 2 3 2 1 Instal
183. nfirm the entire installation You have not assigned a swap partition There is nothing wrong with that but in the last installation in most casesitis highly recommended to create and assign a swap partition dialog Until that point Swap partitions on your system are listed in the main window with the you can safely abort type Linux Swap An assigned swap partition has the mount point swap the installation You can assign more than one swap partition if desired For LVM setup using a non LVM root device and a non LVM swap device is no recommended Other No than the root and swap devices you should have partitions managed by LVM Do you want to change this The table to the right shows the current partitions on all your Create Edit I Delete Resize hard disks an LVM EVMS RAID v Crypt File v Expert y abor Step 67 Select the Expert tab and click Booting Preparation Y Language y License Agreement V Disketa Click any headline to make changes or use the Change menu below y System Analysis Y Time Zone Overview Expert 2 Installation Settings Installation gt Installation Summary Keyboard Layout e Perfor Installation English US Configuration SES e Root Password Partitioning e Hommema Create swap partition dev sda1 502 0 MB e Network Create root partition dev sda2 7 5 GB with reiserfs e Customer C
184. ng rules e Tree rank should be between two and eight inclusively e Switches of the same rank should have the same number of UP going port groups unless they are root switches in which case the shouldn t have UP going ports at all e Switches of the same rank should have the same number of DOWN going port groups unless they are leaf switches e Switches of the same rank should have the same number of ports in each UP going port group Switches of the same rank should have the same number of ports in each DOWN going port group All the CAs have to be at the same tree level rank e If the root guid file is provided the topology doesn t have to be pure fat tree and it should only comply with the following rules e Tree rank should be between two and eight inclusively e All the Compute Nodes have to be at the same tree level rank Note that non compute node CAs are allowed here to be at different tree ranks Topologies that do not comply cause a fallback to min hop routing Note that this can also occur on link failures which cause the topology to no longer be a pure fat tree Note that although fat tree algorithm supports trees with non integer CBB ratio the routing will not be as balanced as in case of integer CBB ratio In addition to this although the algorithm allows leaf switches to have any number of CAs the closer the tree is to be fully populated the more effective the shift communication pattern
185. nologies Devices and PCI Device IDs 158 Table 19 Supported Mellanox Technologies Devices and PCI Device IDs 186 Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential Revision History Rev 1 50 September 27 2010 e Added Section 2 Installation on page 19 Rev 1 10 February 23 2010 e First release Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 10 Preface This Preface provides general information concerning the scope and organization of this User s Manual It includes the following sections e Intended Audience page 10 Documentation Conventions page 10 e Related Documentation page 12 Intended Audience This manual is intended for system administrators responsible for the installation configu ration management and maintenance of the software and hardware of VPI InfiniBand Ethernet FCoE FCoIB systems comprising servers with adapter cards VPI gateways and InfiniBand switch platforms It is also intended for application developers Documentation Conventions Typographical Conventions Table 1 Typographical Conventions Description Convention Example File names file extension Directory names directory Commands and their parameters command paraml Optional items LEI Mutually exclusive parameters pl p2 p3 Optional mutually exclusive par
186. nologies Rev 1 50 Mellanox Technologies Confidential Mellanox BXOFED Overview mlx4_en A 10GigE driver under drivers net mlx4 that handles Ethernet specific functions and plugs into the netdev mid layer mlx4_fc Handles the FCoE functions using ConnectX Fibre Channel hardware offloads mlx4_vnic Handles the EoIB functions using ConnectX Ethernet hardware offloads 1 4 3 Mid layer Core Core services include management interface MAD connection manager CM inter face and Subnet Administrator SA interface The stack includes components for both user mode and kernel applications The core services run in the kernel and expose an inter face to user mode for verbs CM and management 1 4 4 Open FCoE The FCoE feature is based on and interacts with the Open FCoE project BXOFED includes the following open fcoe org modules libfc and fcoe See Section 3 4 Fibre Chan nel over Ethernet 1 4 5 ULPs IPoIB The IP over IB IPoIB driver is a network interface implementation over InfiniBand IPoIB encap sulates IP datagrams over an InfiniBand connected or datagram transport service IPoIB pre appends the IP datagrams with an encapsulation header and sends the outcome over the InfiniBand transport service The transport service is Reliable Connected RC by default but it may also be configured to be Unreliable Datagram UD The interface supports unicast multicast and broad cast For details see Chapter 4
187. nologies network adapter card please download the firmware image from www mellanox com gt Downloads gt Firmware If you purchased a non standard card from a vendor other than Mellanox Technologies please contact your vendor To run mstflint you must know the device location on the PCI bus See Example 1 for details Synopsis mstflint switches lt command gt parameters Table 16 lists the various switches of the utility and Table 17 lists its commands Table 16 mstflint Switches Sheet 1 of 2 Affected Switch Relevant Description Commands h Print the help menu hh Print an extended help menu d evice All Specify the device to which the Flash is connected lt device gt guid lt GUID gt burn sg GUID base value 4 GUIDs are automatically assigned to the following values guid gt node GUID guid 1 gt port guid 2 gt port2 guid 3 gt system image GUID Note Port2 guid will be assigned even for a single port HCA the HCA ignores this value guids lt GUIDs gt burn sg 4 GUIDs must be specified here The specified GUIDs are assigned the following val ues repectively node portl port2 and system image GUID Note Port2 guid must be specified even for a single port HCA the HCA ignores this value It can be set to 0x0 mac lt MAC gt burn sg MAC address base value Two MACs are automatically assigned to the following val ues mac gt portl mac 1 gt port2 Note Thi
188. note that nothing will be written to your hard disk until you confirm Really delete device dev sda2 the entire installation in the last installation dialog Until that point you can safely abort the installation For LVM setup using a non LVM root device and a non LVM swap device is recommended Other than the root and swap devices you should have partitions managed by LVM The table to the right shows the current partitions on all your Create Edit Delete Resize 4 J J hard disks E gt a s gt la LVM EMS RAD CryptFile v Expert y Hard disks are x d di Back Abort Step 66 In the pop up window click No to approve deleting the swap partition You will be returned to Installation Settings window See image below Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 203 Partition your hard a Expert Partitioner disks This is intended for experts If you are not Device Size Type Mount By Start End Used By Label Devi familiar with the 8 0 GB IET VIRTUAL DISK 0 1045 concepts of hard disk id d 70 5 MB F Linux native Ext2 4 0 8 7 4 GB F Linux native Reiser 73 1045 partitions and how to use them you might wantto go back and select automatic partitioning Please note that nothing will be written to your hard disk until you co
189. ns OpenSM is assigned full membership in the default partition All other end ports are assigned partial membership 11 4 1 File Format Notes e Line content followed after ff character is comment and ignored by parser General File Format lt Partition Definition gt lt PortGUIDs list gt Partition Definition PartitionName PKey flag value defmember full limited where PartitionName string will be used with logging When omitted an empty string will be used PKey P Key value for this partition Only low 15 bits will be used When omitted P Key will be autogenerated flag used to indicate IPoIB capability of this partition defmember full limited specifies default membership for port guid list Default is limited Currently recognized flags are ipoib indicates that this partition may be used for IPoIB asa result IPoIB capable MC group will be created rate lt val gt specifies rate for this IPoIB MC group default is 3 LOGBps mtu lt val gt specifies MTU for this IPoIB MC group default is 4 2048 sl lt val gt specifies SL for this IPoIB MC group default is 0 scope lt val gt specifies scope for this IPoIB MC group default is 2 link local Note that values for rate mtu and scope should be specified as defined in the IBTA specification for example mtu 4 for 2048 PortGUIDs list Mellanox Technologies Rev 1 50 Mellanox Technologies Confidentia
190. nt to assign to the interface The following example shows how to configure an IB interface host1 ifconfig ib0 11 4 3 175 netmask 255 255 0 0 Step 2 Optional Verify the configuration by entering the ifconfig command with the appro priate interface identifier ib argument The following example shows how to verify the configuration host1 ifconfig ib0 bO Link encap UNSPEC HWaddr 80 00 04 04 FE 80 00 00 00 00 00 00 00 00 00 00 inet addr 11 4 3 175 Bcast 11 4 255 255 Mask 255 255 0 0 Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 4 4 4 4 1 35 UP BROADCAST MULTICAST MTU 65520 Metric 1 RX packets 0 errors 0 dropped 0 overruns 0 frame 0 TX packets 0 errors 0 dropped 0 overruns 0 carrier 0 collisions 0 txqueuelen 128 RX bytes 0 0 0 b TX bytes 0 0 0 b Step 3 Repeat Step 1 and Step 2 on the remaining interface s Subinterfaces You can create subinterfaces for a primary IPoIB interface to provide traffic isolation Each such subinterface also called a child interface has a different IP and network addresses from the primary parent interface The default Partition Key PKey ff ff applies to the primary parent interface This section describes how to e Create a subinterface Section 4 4 1 Remove a subinterface Section 4 4 2 Creating a Subinterface To create a child interface subinterface follow this procedure Note Step 1 Step 2 Step 3 In the fo
191. o nsmod lib modules ib iw cm ko nsmod lib modules ib rdma cm ko nsmod lib modules ib rdma ucm ko nsmod lib modules ib mlx4 core ko nsmod lib modules ib mlx4 ib ko nsmod lib modules ib ib mthca ko nsmod lib modules ib ib ipoib ko In case of interoperability issues between iSCSI and Large Receive Offload LRO change the last command above as follows to disable LRO nsmod lib modules ib ib ipoib ko lro 0 Now you can assign an IP address to your IB device by adding a call to ifconfig or to the DHCP client in the init file after loading the modules If you wish to use the DHCP client then you need to add a call to the DHCP client in the init file after load ing the IB modules For example sbin dhclient cf sbin dhclient conf ibl Step 16 Step 17 Save the init file Close initrd host1 cd tmp initrd ib host1 find cpio H newc o gt tmp new initrd ib img host1 gzip tmp new init ib img At this stage the modified init rd including the IB driver is ready and located at tmp new init ib img gz Copyitto the original initrd location and rename it properly ISCSI Boot Mellanox Boot over IB enables an iSCSI boot of an OS located on a remote iSCSI Target It has a built in iSCSI Initiator which can connect to the remote iSCSI Target and load from it the kernel and initrd There are two instances of connection to the remote iSCSI Target the first is for getting the kernel and initrd via BoIB an
192. o not have it in your init rd please add it using the following command host1 cp lib modules uname r kernel net ipv6 ipv6 ko tmp initrd ib lib modules Step 11 To load the modules you need the insmod executable If you do not have it in your initrd please add it using the following command host1 cp sbin insmod tmp initrd ib sbin Step 12 If you plan to give your IB device a static IP address then copy ifconfig Otherwise skip this step host1 cp sbin ifconfig tmp initrd ib sbin Step 13 Ifyou plan to obtain an IP address for the IB device through DHCP then you need to copy the DHCP client which was compiled specifically to support IB Otherwise skip this step To continue with this step DHCP client v3 1 2 needs to be already installed on the machine you are working with Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 168 Copy the DHCP client v3 1 2 file and all the relevant files as described below hostl cp lt path to DHCP client v3 1 2 gt dhclient tmp initrd ib sbin hostl cp lt path to DHCP client v3 1 2 gt dhclient script tmp initrd ib sbin hostl mkdir p tmp initrd ib var state dhcp hostl touch tmp initrd ib var state dhcp dhclient leases hostl cp bin uname tmp initrd ib bin hostl cp usr bin expr tmp initrd ib bin hostl cp sbin ifconfig tmp initrd ib bin hostl cp bin hostname tmp initrd ib bin Create a configuration file f
193. ode Receive Core Affinity RCA Tx arbitration mode VLAN user priority off by default MSI X or INTx Adaptive interrupt moderation HW Tx Rx checksum calculation Large Send Offload 1 e TCP Segmentation Offload Large Receive Offload Multi core NAPI support VLAN Tx Rx acceleration HW VLAN stripping insertion HW VLAN filtering HW multicast filtering ifconfig up down MTU changes up to 10K Ethtool support Net device statistics CX4 connectors XAUI or XFP Loading the Ethernet Driver By default the Mellanox BXOFED stack does not load m1x4 en To cause the mlx4_en module to load at driver start up set MLX4_EN_LOAD yes in file etc infini band openib conf prior to start up Alternatively if you do not wish to stop and restart the driver m1x4 en can be loaded by running sbin modprobe mlx4_ en 2 The result is a new net device appearing in ifconfig a Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 3 3 3 Unloading the Driver 3 3 4 If etc infiniband openib conf had MLX4 EN LOAD yes at driver start up then you can unload the m1x4 en driver by running etc init d openibd stop Otherwise unload m1x4_ en by running gt modprobe r mlx4 en Ethernet Driver Usage and Configuration e To assign an IP address to the interface run gt ifconfig eth lt n gt lt ip gt where x is the OS assigned interface number e To check driver and device
194. of High Availability only e Add a file to etc udev rules d you can call it 91 srp rules This file should have one line Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 SRP 74 ACTION add KERNEL sd 0 9 RUN sbin multipath o M om Note When SRPHA_ ENABLE is set to yes see Automatic Activation of High Availability below this file is created upon each boot of the driver and is deleted when the driver is unloaded Manual Activation of High Availability Initialization Execute after each boot of the driver 1 Execute modprobe dm multipath 2 Execute modprobe ib srp 3 Make sure you have created file etc udev rules d 91 srp rules as described above 4 Execute for each port and each HCA srp_daemon c e R 300 i lt InfiniBand HCA name gt p lt port number gt This step can be performed by executing srp_daemon sh which sends its log to var log srp_daemon log Now it is possible to access the SRP LUNs on dev mapper Note It is possible for regular non SRP LUNs to also be present the SRP LUNs may be identified by their names You can configure the etc multi path conf file to change multipath behavior Note It is also possible that the SRP LUNs will not appear under dev mapper This can occur if the SRP LUNs are in the black list of multipath Edit the blacklist section in etc multipath conf and make sure the SRP LUNs are not b
195. of Leon 2 HWADDR The mac address to assign the vnic Other fields available for regular ethernet interfaces in the ifcfg ethX files may also be used Extracting BridgeX host name In order to configure host administered vNics the BridgeX box address needs to be known on the host The simplest way to learn this information is by loging into the BridgeX box and querying the information One way to perform this is by connecting through ssh to the BridgeX and running bridge 1128b8 enable bridge 1128b8 configure terminal bridge 1128b8 config show bxm The result will display the system GUID and some more information BXM status System GUID 00 02 C9 03 00 11 08 C7 An alternative is to replace the last command with bridge 1128b8 config show hosts The result will display the system name and some more information Hostname bridge 1128b8 mlx4_vnic_confd After updating the configuration files you are ready to create the host Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 6 3 4 6 3 5 administered vNics To create the vNics you will use the mlx4 vnic_confd service that is located at etc init d To use the service run To start load new vnics mlx4 vnic confd start To stop all host administrated vNics mlx4 vnic confd stop To restart close and then open all host administrated vNics mlx4 vnic confd restart To update system according to up to date confi
196. of configuration files exist the central configuration file has precedence and only this file will be used Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 45 6 3 1 1 Central configuration file etc infiniband mlx4_vnic conf The mlx4_vnic conf file consists of lines each describing one vNic The following file format is used name eth44 mac 00 25 8B 27 14 78 ib port m1x4 0 1 vid 0 vnic_id 5 bx 00 00 00 00 00 00 03 B2 eport A10 name eth45 mac 00 25 8B 27 15 78 ib port m1x4 0 1 vnic id 6 bx 00 00 00 00 00 00 03 B2 eport A10 name eth47 mac 00 25 8B 27 16 84 ib port m1x4 0 1 vid 2 vnic_id 7 bx BX001 eport A10 name eth40 mac 00 25 8B 27 17 93 ib port m1x4 0 1 vnic_id 8 bx BX001 eport A10 The fields used in the file have the following meaning name The name of the interface that is displayed when running ifconfig mac The mac address to assign to the vnic ib port Device name and port number in the form device name port number The device name can b retrieved by running ibv_devinfo and using the out put of hca_id field The port number can have a value Of 1 0 Ze vid vLan ID an optional field If it exists the vNic will be assigned the VLAN id specified This value must be between 0 and 4095 If no vid is specified the vNic will be assigned to the default vHub asso ciated with the bx eport pair GW vnic_id A unique number per vNic between 0 and
197. of your iSCSI target and click Next Preparation Y Language License Agreement gt Disk Activation e System Analysis e Time Zone fa SCSI Initiator Discovery Installation e Installation Summary e Perform Installation JEAduress 10 4 3 7 32 E Configuration e Root Password e Hostname e Network X No Authentication e Customer Center e Online Update e Service HER LJ Incoming Authentication e Users e Clean Up Username Password e Release Notes e Hardware Configuration LJ Outgoing Authentication Username Password Help Back Abort Next Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 173 Step 23 Details of the discovered iSCSI target s will be displayed in the iSCSI Initiator Discov ery window Select the target that you wish to connect to and click Connect Fra iSCSI Initiator Discovery License Agreement Disk Activation r e System Analysis Portal Address Target Name Connected e Time Zone Installation e Installation Summary Perform Installation Configuration e Root Password Hostname e Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes e Hardware Configuration Connect Tip If no iSCSI target was recognized then either the target was not properly installed or no connection was found between the client and the iSCSI target Open a shell
198. oing down or one or more of these nodes coming back after being down A very common case that is handled by the unicast routing cache is host reboot which otherwise would cause two full routing recalculations one when the host goes down and the other when the host comes back online This option enforces a routing engine currently up down only to make connectivity between root switches and in this way to be fully IBA com plaint In many cases this can violate pure dead lock free algorithm so use it carefully lid matrix file lt file name gt This option specifies the name of the lid matrix dump file from where switch lid matrices min hops tables will be loaded lfts file lt file name gt This option specifies th nam of the LFTs file from where switch forwarding tables will be loaded sadb file lt file name gt This option specifies the name of th SA DB dump file from where SA database will be loaded root_guid_ file lt file name gt Set the root nodes for the Up Down or Fat Tree routing algorithm to the guids provided in the given file one toa line cn_guid file lt file name gt Set the compute nodes for the Fat Tree routing algo rithm to the guids provided in the given file one to a line Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 193 m ids guid file lt file name gt Name of the map file with set of the IDs which will be used b
199. on e Installation Summary Perform Installation Configuration e Root Password Hostname e Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes e Hardware Configuration Connect Tip If no iSCSI target was recognized then either the target was not properly installed or no connection was found between the client and the iSCSI target Open a shell to ping the iSCSI target you can use CTRL ALT F2 and verify that the target is or is not accessible To return to the graphical installation screen press CTRL ALT F7 Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential Step 58 The iSCSI Initiator Discovery window will now request authentication to access the iSCSI target Click Next to continue without authentication unless authentication is required Preparation Y Language V License Agreement Disk Activation e System Analysis e Time Zone Installation e Installation Summary e Perform Installation Configuration e Root Password e Hostname e Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes e Hardware Configuration Help iSCSI Initiator Discovery X No Authentication Incoming Authentication Username Password Outgoing Authentication Username Password Abot Step 59 The iSCSI Initiator Discovery window will show the iSCSI target that got connec
200. on dev sdal 70 5 MB with ext2 Create swap partition dev sda2 502 0 MB To acceptthese Create root partition dev sda3 7 4 GB with reiserts suggestions and continue select Accept Proposal Ifthe suggestion does not fit your needs create your own partition setup starting with the partitions as currently present on the disks For this select Custom Partition Setup This is also the option to choose for advanced options like RAID and LVM Partitioning O Accept Proposal Base Partition Setup on This Proposal Create Custom Partition Setup Back Abort Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 202 Step 65 In the Expert Partitioner window select from the IET VIRTUAL DISK device the row that has its Mount column indicating swap then click Delete Confirm the delete oper ation and click Finish Partition your hard a Expert Partitioner disks This is intended for experts If you are not Device Size E Type Mount Mount By Start End Used By Label Devil familiar with the Idevisda 8 0 GB IET VIRTUAL DISK 0 1045 scsi concepisiofhard disk Idev sdal 70 5MB FLinuxnative Ext2 boot 0 8 scsi partitions and how to use them you might want to go back and select automatic partitioning dev sda2 502 0 MB F Linux swap dev sda3 7 4 GB F Linux native Reiser Please
201. on B 7 on page 190 Warning Pay extra care when changing initrd as any mistake may prevent the client machine from booting It is recommended to have a back up iSCSI Initiator on a machine other than the client you are working with to allow for debug in case initrd gets corrupted Next edit the init file that is in the initrd zip and look for the following string if iSCSI TARGET _IPADDR then iscsiserver S iSCSI TARGET IPADDR fi Now add before the string the following line iSCSI TARGET IPADDR lt Ethernet IP Address of iSCSI Target gt Example iSCSI TARGET IPADDR 11 4 3 7 Also edit the file boot grub menu 1st and delete the following string ibft_mode off A l WinPE Mellanox ConnectX EN PXE enables WinPE boot via TFTP For instructions on prepar ing a WinPE image please see http etherboot org wiki winpe Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 209 Appendix B Performance Troubleshooting B 1 PCI Express Performance Troubleshooting For the best performance on the PCI Express interface the adapter card should be installed in an x8 slot with the following BIOS configuration parameters e Max Read Req the maximum read request size is 512 or higher e MaxPayloadSize the maximum payload size is 128 or higher Note A Max_Read_Req of 128 and or installing the card in an x4 slot will signifi c
202. or the DHCP client as described in Section 4 3 1 2 and place it under tmp initrd ib sbin The following is an example of such a file called dclient conf dhclient conf The value indicates a hexadecimal number For a ConnectX device interface ib0 send dhcp client identifier 00 02 c9 03 00 00 10 39 For an InfiniHost EX device interface ibl send dhcp client identifier 20 00 55 04 01 fe 80 00 00 00 00 00 00 00 02 c9 02 00 23 13 92 Step 14 Now you can add the commands for loading the copied modules into the file init Edit the file tmp initrd_ib init and add the following lines at the point you wish the IB driver to be loaded Warning The order of the following commands for loading modules is critical echo loading ipv6 sbin insmod lib modules ipv6 ko echo loading IB driver sbin insmod lib modules ib ib addr ko sbin insmod lib modules ib ib core ko sbin insmod lib modules ib ib mad ko Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential A l sbin i sbin i sbin i sbin i sbin i sbin i sbin i sbin i sbin i sbin i sbin i sbin i Note sbin i Step 15 169 nsmod lib modules ib ib sa ko nsmod lib modules ib ib cm ko nsmod lib modules ib ib uverbs ko nsmod lib modules ib ib ucm ko nsmod lib modules ib ib umad k
203. ost1 lsmod grep sdp ib sdp1250200 The example output above shows that the SDP module is loaded If the SDP module is loaded and the sdpnetstat command did not show SDP sockets then SDP is not being used by any application Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential SDP 56 7 3 2 Monitoring and Troubleshooting Tools SDP has debug support for both the user space 1 ibsdp so library and the ib sdp ker nel module Both can be useful to understand why a TCP socket was not redirected over SDP and to help find problems in the SDP implementation User Space SDP Debug User space SDP debug is controlled by options in the 1ibsdp conf file You can also have a local version and point to it explicitly using the following command host1 export LIBSDP CONFIG FILE lt path gt libsdp conf To obtain extensive debug information you can modify libsdp conf to have the log directive produce maximum debug output provide the min level flag with the value 1 The log statement enables the user to specify the debug and error messages that are to be sent and their destination The syntax of 1og is as follows log destination stderr syslog file lt filename gt min level 1 9 where options are destination send log messages to the specified destination stderr forward messages to the STDERR syslog send messages to the syslog service file lt filename gt write messages to the file
204. ot True Installation e Installation Summary e Perform Installation Configuration e Root Password Hostname Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes e Hardware Configuration Help Back Abort Step 26 The iSCSI Initiator Overview window will pop up Click Toggle Start Up to change start up from manual to automatic Click Finish Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 176 Preparation Y Language y License Agreement Disk Activation e System Analysis Service Connected Targets e Time Zone SCSI Initiator Overview Portal Address Target Name ign 20 Installation e Installation Summary e Perform Installation Configuration e Root Password Hostname Network Customer Center Online Update Service Users Clean Up Release Notes Hardware Configuration Add Log Out Toggle Start Up Gee mer Step 27 Select New Installation then click Finish in the Installation Mode window Preparation Y Language y License Agreement Y Disk Activation System Analysis e Time Zone 2 Installation Mode Installation e Installation Summary e Perform Installation Configuration Select Mode e Root Password 0 New Installation e Hostname e Network e Customer Center e Online Update e Service A
205. ple They explain different keywords and their meaning port groups port group using port GUIDs name Storage use is just a description that is used for logging Other than that it is just a comment use SRP Targets port guid 0x10000000000001 0x10000000000005 0x1000000000FFFA port guid 0x1000000000FFFF end port group port group name Virtual Servers The syntax of the port name is as follows node description Pnum node description is compared to the NodeDescription of the node and Pnum is a port number on that node port name vsl HCA 1 P1 vs2 HCA 1 P1 end port group using partitions defined in the partition policy port group name Partitions partition Partl pkey 0x1234 end port group using node types CA ROUTER SWITCH SELF for node that runs SM or ALL for all the nodes in the subnet port group Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential OpenSM Subnet Manager 114 name CAs and SM node type CA SELF end port group end port groups qos setup This section of the policy file describes how to set up SL2VL and VL Arbitration tables on various nodes in the fabric However this is not supported in BXOFED the section is parsed and ignored SL2VL and VLArb tables should be configured in the OpenSM options file by defau
206. ps Service ID to a list of Service ID values or ranges e QoS Class to a list of QoS Class values or ranges CMA features The CMA interface supports Service ID through the notion of port space as a prefix to the port number which is part of the sockaddr provided to rdma resolve add The CMA also allows the ULP like SDP to propagate a request for a specific QoS Class The CMA uses the provided QoS Class and Service ID in the sent PR MPR IPolB IPoIB queries the SA for its broadcast group information and uses the SL MTU RATE and Packet Lifetime available on the multicast group which forms this broadcast group SDP SDP uses CMA for building its connections The Service ID for SDP is 0x000000000001PPPP where PPPP are 4 hexadecimal digits holding the remote TCP IP Port Number to connect to Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 10 7 10 8 10 9 RDS RDS uses CMA and thus it is very close to SDP The Service ID for RDS is 0x000000000106PPPP where PPPP are 4 hexadecimal digits holding the TCP IP Port Number that the protocol connects to The default port number for RDS is 0x48CA which makes a default Service ID 0x00000000010648CA SRP The current SRP implementation uses its own CM callbacks not CMA So SRP fills in the Service ID in the PR MPR by itself and use that information in setting up the QP SRP Service ID is defined by the SRP target I O Controller it also complie
207. r the netperf command Option Description H Where to find the server 11 4 17 6 IPoIB IP address t lt Test Name gt Specify the test to perform Options are TCP_STREAM TCP_RR etc C Client CPU utilization C Server CPU utilization Separates the global and test specific parameters m Message size which is 65536 in the example above Note that the run example above produced the following results Throughput is 2 483 gigabits per second Client CPU utilization is 7 03 percent of client CPU Server CPU utilization is 5 42 percent of server CPU Run the Netperf Latency test Run the test once and stop the server so that it does not repeat the test The following example shows how to run the Latency test and then stop the Netperf server host2 netperf H 11 4 17 6 t TCP RR c C rl l Local Remote Socket Siz S dem Send Recv remote bytes bytes Request Resp Elapsed Trans us Tr 16384 87380 27 296 16384 87380 Size Size Time Rate bytes bytes secs per sec 1 1 10 00 19913 18 CPU local TCP REQUEST RESPONSE TEST from 0 0 0 0 0 0 0 0 port 0 AF INET to 11 4 17 6 11 4 17 6 port 0 AF_INET CPU S dem remote local oe S us Tr 6 79 22 549 Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 IPoIB 40 The following table describes parameters for the netperf
208. rd image must include and be configured to load the IB driver including IPoIB This can be achieved either by compiling the HCA driver into the kernel or by adding the device driver module into the initrd image and loading it The IB driver requires loading the following modules in the specified order see Section A 8 1 for an example e ib_addr ko e ib_core ko e ib_mad ko e ib_sa ko e ib_cm ko e ib_uverbs ko e ib_ucm ko e ib umad ko e iw_cm ko e rdma_cm ko e rdma_ucm ko e mlx4 core ko e mlx4 ib ko e ib_mthca ko e ib ipoib ko Example Adding an IB Driver to initrd Linux Prerequisites 1 The BoIB image is already programmed on the HCA card 2 The DHCP server is installed and configured as described in Section A 3 1 Configur ing the DHCP Server and connected to the client machine 3 Aninitrd file To add an IB driver into initrd you need to copy the IB modules to the diskless image Your machine needs to be pre installed with BXOFED Adding the IB Driver to the initrd File Warning The following procedure modifies critical files used in the boot proce dure It must be executed by users with expertise in the boot process Improper application of this procedure may prevent the diskless machine from booting Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 167 Step a Back up your current initrd file Step 7 Make a new working directory and change to it
209. reak the Up Down rule One ramification of this is that you cannot run SM on switches other than the leaf switches of the fabric 11 5 3 1 UPDN Algorithm Usage Activation through OpenSM e Use R updn option instead of old u to activate the UPDN algorithm e Use a lt root guid file gt for adding an UPDN guid file that contains the root nodes for ranking If the a option is not used OpenSM uses its auto detect root nodes algo rithm Notes on the guid list file 1 A valid guid file specifies one guid in each line Lines with an invalid format will be discarded The user should specify the root switch guids However it is also possible to specify CA guids OpenSM will use the guid of the switch 1f it exists that connects the CA to the subnet as a root node Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 105 OpenSM Subnet Manager 106 11 5 4 Fat tree Routing Algorithm The fat tree algorithm optimizes routing for shift communication pattern It should be chosen if a subnet is a symmetrical or almost symmetrical fat tree of various types It sup ports not just K ary N Trees by handling for non constant K cases where not all leafs CAs are present any CBB ratio As in UPDN fat tree also prevents credit loop dead locks If the root guid file is not provided a or root_guid file options the topology has to be pure fat tree that complies with the followi
210. rect value here Configure the mlx4_en Ethernet driver to support PFC Add the following line to the file etc modprobe conf and restart the network driver options mlx4 en pfctx 0xff pfcrx 0xff 3 4 3 2 Starting FCoE Service Make sure the network is up modprobe mlx4 en Then run gt etc init d mlxfc start vHBAs will be instantiated on DCBX monitored interfaces and SCSI LUNs will get mapped For Manual instantiation of VHBAs please see Section 3 4 4 1 Manual vHBA Control 3 4 3 3 Stopping FCoE Service Run gt etc init d mlxfc stop Note Only when the mlxfc service is stopped and the mlx4 en module is removed can the mlx4_core module be removed as well 3 4 4 FCoE Advanced Usage Advanced usage will probably be needed when connected to FCoE switches that do not support the Cisco like FCoE DCBX auto negotiation Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 3 4 4 1 Manual vHBA Control Manual control allows creating and destroying vHBAs and signaling link up and link down to existing VHBAs This is done using sysfs operations When using the pre T11 stack the sysfs directory is located at sys class mlx4 fc When using the T11 stack the sysfs directory is located at sys module fcoe Both directories contain the same entries In the following the sysfs directory will be referred to as FCSYSFS To create a new vHBA on an Ethernet interface e g eth3 run
211. riable IBDIAG_TOPO_FILE To specify the local system name to an diagnostic tool use one of the following two options 1 On the command line specify the system name using the option s lt local system name gt 2 Define the environment variable IBDIAG SYS NAME 12 2 2 IB Interface Definition The diagnostic tools installed on a machine connect to the IB fabric by means of an HCA port through which they send MADs To specify this port to an IB diagnostic tool use one of the following options 1 On the command line specify the port number using the option p lt local port num ber gt see below 2 Define the environment variable IBDIAG PORT NUM In case more than one HCA device is installed on the local machine it is necessary to specify the device s index to the tool as well For this use on of the following options 1 On the command line specify the index of the local device using the following option 1 lt index of local device gt 2 Define the environment variable IBDIAG_DEV_ IDX 12 2 3 Addressing Note This section applies to the ibdiagpath tool only A tool command may require defining the destination device or port to which it applies The follow ing addressing modes can be used to define the IB ports e Using a Directed Route to the destination Tool option d This option defines a directed route of output port numbers from the local port to the destination Using port LIDs Tool op
212. riable in the sysfs filesystem The following command performs this host1 echo 1 gt sys module ib sdp debug level Note Depending on the operating system distribution on your machine you may need an extra level parameters in the directory structure so you may need to direct the echo command to sys module ib sdp parameters debug_level Turning off kernel debug is done by setting the sysfs variable to zero using the following command host1 echo 0 gt sys module ib sdp debug level To display debug information use the dme sg command host1 dmesg Environment Variables For the transparent integration with SDP the following two environment variables are required 1 LD PRELOAD this environment variable is used to preload 1ibsdp so and it should point to the 1ibsdp so library The variable should be set by the system administrator to usr lib libsdp so or usr lib64 libspd so 2 LIBSDP CONFIG FILE this environment variable is used to configure the policy for replacing TCP sockets with SDP sockets By default it points to etc lib sdp conf 3 SIMPLE LIBSDP ignore libsdp conf and always use SDP Converting Socket based Applications You can convert a socket based application to use SDP instead of TCP in an automatic also called transparent mode or in an explicit also called non transparent mode Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 SDP 58 Automatic
213. run opensm in the default mode simply enter hostl opensm Note that opensm needs to be run on at least one machine in an IB subnet By default an opensm run is logged to two files var log messages and var log opensm log The first file message registers only general major events the second file opensm log includes details of reported errors All errors reported in opensm log should be treated as indicators of IB fabric health Both log files should include the message SUBNET UP if opensm was able to setup the subnet correctly Note If a fatal non recoverable error occurs opensm exits 11 2 4 1 Running OpenSM As Daemon 11 3 OpenSM can also run as daemon To run OpenSM in this mode enter hostl etc init d opensmd start osmtest Description osmtest is a test program for validating the InfiniBand Subnet Manager and Subnet Administrator osmtest provides a test suite for opensm It can create an inventory file of all available nodes ports and PathRecords including all their fields It can also verify the existing inventory with all the object fields and matches it to a pre saved one See Section 113 2 osmtest has the following test flows e Multicast Compliancy test e Event Forwarding test e Service Record registration test e RMPP stress test e Small SA Queries stress test 11 3 1 Syntax osmtest OPTIONS where OPTIONS are f flow This option directs osmtest to run a specific flow Flo
214. ry page 147 ibcheckerrs page 151 mstflint page 153 e ibv_asyncwatch page 156 12 2 Utilities Usage This section first describes common configuration interface and addressing for all the tools in the package Then it provides detailed descriptions of the tools themselves includ ing operation synopsis and options descriptions error codes and examples 12 2 1 Common Configuration Interface and Addressing Topology File Optional An InfiniBand fabric is composed of switches and channel adapter HCA TCA devices To identify devices in a fabric or even in one switch system each device is given a GUID a MAC equivalent Since a GUID is a non user friendly string of characters it is better to alias it to a meaningful user given name For this objective the IB Diagnostic Tools can be provided with a topology file which is an optional configuration file spec ifying the IB fabric topology in user given names For diagnostic tools to fully support the topology file the user may need to provide the local system name if the local hostname is not used in the topology file To specify a topology file to a diagnostic tool use one of the following two options 1 On the command line specify the file name using the option t lt topology file name gt 125 Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 InfiniBand Fabric Diagnostic Utilities 126 2 Define the environment va
215. s 152 Examples 1 Check aggregated node counter for LID 0x2 gt ibcheckerrs 2 warn counter SymbolErrors 65535 threshold 10 lid 2 JODE 255 warn counter LinkRecovers 255 threshold 10 lid 2 porte 255 warn counter LinkDowned 12 elawesia llel 10 Lic 2 port 255 warn counter RevErrors 565 threshold 10 lid 2 port 255 warn counter XmtDiscards 441 threshold 100 2 Check port counters for LID 2 Port 1 gt ibcheckerrs v 2 marcor Check cia Lied 2 MirA7396 Tarinigcale Mellanox Technologies port 1 OK 3 Check the LID2 Port 1 using the specified threshold file gt Car tres SymbolErrors 10 LinkRecovers 10 LinkDowned 10 RevErrors 10 RcvRemotePhysErrors 100 RcvSwRelayErrors 100 XmtDiscards 100 XmtConstraintErrors 100 RevConstraintErrors 100 LinkIntegrityErrors 10 ExcBufOverrunErrors 10 VL15Dropped 100 Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 153 12 13mstflint Applicable Hardware Mellanox InfiniBand and Ethernet devices and network adapter cards Description Queries and burns a binary firmware image file on non volatile Flash memories of Mel lanox InfiniBand and Ethernet network adapters The tool requires root privileges for Flash access Note If you purchased a standard Mellanox Tech
216. s Ltd CORE Direct and FabricIT are trademarks of Mellanox Technologies Ltd All other marks and names mentioned herein may be trademarks of their respective companies 2 Mellanox Technologies Document Number 3253 Mellanox Technologies Confidential Table of Contents Table of Contents a a Fist Of Tables o iii Di St A A A a A AA O Revision History di ES IAS A ATA ts Predial A TO Intended Al elet a o 10 Documentation Conventions oo oooooo rr reser r rr nee rr tres rss 10 Typographical Conventions 0 0 eee rer rer rer rer rr rss 10 Common Abbreviations and Acronyms LL 11 Related Documentation ssssesreresererrerersrer rer rer reser rer rer rr rr rer rss rr set 12 ARIS AE AAA R O GRAD FR AR E RIAS aR Tah Ta EE PN E A EE ace GA 12 Chapter 1 Mellanox BXOFED Overview ooooooocoocmocmocmororororomomoromoroso 13 1 1 Introduction to Mellanox BXOFED 13 1 2 Introduction to Mellanox VPI Adapters 13 1 3 BXOFED Package Contents 13 1 4 Architecture 14 1 4 1 mthca HCA IB Driver 15 VAD MA MBE Driver ee ak setae Lin a a tues 15 1 437 Mid layer Core cir crea pae do bite cae Papas sae sehen ee eS 16 1 4 4 vOpen ECOP ose cee aed ee a mets bas Eee ee eda ee TROT PASO RS eS Uns 16 14S UBA A PAA hk A
217. s at each step This provides the ordering necessary to avoid deadlock When there are multiple links between any two switches they still repre sent only one dimension and traffic is balanced across them unless port equalization is turned off In the case of hypercubes the same port must be used throughout the fab ric to represent the hypercube dimension and match on both ends of the cable In the case of meshes the dimension should consistently use the same pair of ports one port on one end of the cable and the other port on the other end continuing along the mesh dimension Use R dor option to activate the DOR algorithm 11 5 7 Routing References To learn more about deadlock free routing see the article Deadlock Free Message Routing in Multiprocessor Interconnection Networks by William J Dally and Charles L Seitz 1985 To learn more about the up down algorithm see the article Effective Strategy to Com pute Forwarding Tables for InfiniBand Networks by Jose Carlos Sancho Antonio Rob les and Jose Duato at the Universidad Politecnica de Valencia To learn more about LASH and the flexibility behind it the requirement for layers perfor mance comparisons to other algorithms see the following articles e Layered Routing in Irregular Networks Lysne et al IEEE Transactions on Parallel and Distributed Systems VOL 16 No12 December 2005 e Routing for the ASI Fabric Manager Solheim et al IEEE Commun
218. s flags of the command Table 13 smpquery Flags and Options Optional Default Flag p If Not Description Mandatory 7 Specified h help Optional Print the help menu d ebug Optional Raise the IB debug level May be used several times for higher debug levels ddd or d d d e rr_show Optional Show send and receive errors timeouts and others v erbose Optional Increase verbosity level May be used several times for addi tional verbosity vvv or v v v D irect Optional Use directed path address arguments The path is a comma sep arated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08 1040023 s lt smlid gt Optional Use lt smlid gt as the target LID for SM SA queries V ersion Optional Show version info C lt ca_name gt Optional Use the specified channel adapter or router P lt ca_port gt Optional Use the specified port t lt timeout_ms gt Optional Override the default timeout for the solicited MADs msec Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential Table 13 smpquery Flags and Options 145 Flag Optional Mandatory Default If Not Specified Description lt op gt Mandatory Supported operations nodeinfo lt addr gt nodedesc lt addr g
219. s switch is applicable only for Mellanox Technologies Ethernet products Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential InfiniBand Fabric Diagnostic Utilities 154 Table 16 mstflint Switches Sheet 2 of 2 Affected Switch Relevant Description Commands macs lt MACs gt burn sg Two MACs must be specified here The specified MACs are assigned to portl and port2 repectively Note This switch is applicable only for Mellanox Technologies Ethernet products blank_guids burn Burn the image with blank GUIDs and MACs where applicable These values can be set later using the sg command see Table 17 below clear_semaphore No commands allowed Force clear the Flash semaphore on the device No command is allowed when this switch is used Warning May result in system instability or Flash corruption if the device or another application is currently using the Flash i mage lt image gt burn verify Binary image file qq burn query Run a quick query When specified mstflint will not perform full image integrity checks during the query operation This may shorten execution time when running over slow interfaces e g 12C MTUSB 1 nofs burn Burn image in a non failsafe manner skip is burn Allow burning the firmware image without updating the invariant sector This is to ensure failsafe burning even when an invariant se
220. s with IBTA Service ID rules The Service ID is reported by the I O Controller in the ServiceEntries DMA attribute and should be used in the PR MPR if the SA reports its ability to handle QoS PR MPRs OpenSM Features The QoS related functionality that is provided by OpenSM the Subnet Manager described in Chapter 11 can be split into two main parts I Fabric Setup During fabric initialization the Subnet Manager parses the policy and apply its settings to the discovered fabric elements II PR MPR Query Handling OpenSM enforces the provided policy on client request The overall flow for such requests is first the request is matched against the defined match rules such that the target QoS Level definition is found Given the QoS Level a path s search is performed with the given restrictions imposed by that level Note QoS in OpenSM is described in detail in Chapter 11 Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 OpenSM Subnet Manager 90 11 11 1 11 2 OpenSM Subnet Manager Overview OpenSM is an InfiniBand compliant Subnet Manager SM It is provided as a fixed flow executable called opensm accompanied by a testing application called osmtest OpenSM implements an InfiniBand compliant SM according to the InfiniBand Architecture Speci fication chapters Management Model 13 Subnet Management 14 and Subnet Admin istration 15 opensm Description opensm is a
221. se Notes Hardware Configuration Show Release Notes Expert Partitioner Device Size Type Start End Mount By Used By Label Devi 8 0 GB 70 5 MB 7 4 GB IET VIRTUAL DISK 0 1045 F Linux native Ext2 4 0 8 F Linux native Reiser 73 1045 You have not assigned a swap partition There is nothing wrong with that but in most cases it is highly recommended to create and assign a swap partition Swap partitions on your system are listed in the main window with the type Linux Swap An assigned swap partition has the mount point swap You can assign more than one swap partition if desired Do you want to change this Cae Create Edit fl Delete Resize LVM EVMS RAID v Crypt File v Expert y Abort Select the Expert tab and click Booting ie Installation Settings Click any headline to make changes or use the Change menu below Overview Expert Keyboard Layout English US Partitioning Create swap partition dev sdal 502 0 MB Create root partition dev sda2 7 5 GB with reiserfs Add On Products No add on product selected for installation Software SUSE Linux Enterprise Server 10 SP2 Server Base System KDE Desktop Environment for Server C C Compiler and Tools X Window System Size of Packages to Install 1 6 GB Booting Boot Loader Typ
222. selector see the mpi selector 1 man page for details Note that MPI selector only affects the default MPI environment for future shells Specifi cally if you use MPI selector to select MPI implementation ABC this default selection will not take effect until you start a new shell e g logout and login again Other pack ages such as environment modules provide functionality that allows changing your envi ronment to point to a new MPI implementation in the current shell The MPI selector was not meant to duplicate or replace that functionality The MPI selector functionality can be invoked in one of two ways 1 The mpi selector menu command This command is a simple menu based program that allows the selection of the system wide MPI usually only settable by root and a per user MPI selection It also shows what the cur rent selections are This command is recommended for all users 2 The mpi selector command This command is a CLI equivalent of the mpi selector menu allowing for the same func tionality as mpi selector menu but without the interactive menus and prompts It is suitable for scripting Compiling MPI Applications Note A valid Fortran compiler must be present in order to build the MVAPICH MPI stack and tests The following compilers are supported by Mellanox BXOFED s MVAPICH and Open MPI packages Gcc Intel and PGI The install script prompts the user to choose the com piler with which to install the MVAPICH
223. sses on multiple remote machines the MPI standard provides a launcher program that requires automatic login i e password less onto the remote machines SSH Secure Shell is both a computer program and a network protocol that can be used for logging and running commands on remote computers and or servers SSH Configuration The following steps describe how to configure password less access over SSH Step 1 Generate an ssh key on the initiator machine host1 Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 9 3 Step 2 Step 3 Step 4 Step 5 177 host1 ssh keygen t rsa Generating public private rsa key palr Enter file in which to save the key home lt username gt ssh id_rsa Enter passphrase empty for no passphrase Enter same passphrase again Your identification has been saved in home lt username gt ssh id_rsa Your public key has been saved in home lt username gt ssh id_rsa pub The key fingerprint is 38 1b 29 df 4f 08 00 4a 0e 50 0 05 44 e7 9 05 lt username gt hostl Check that the public and private keys have been generated host1 cd home lt username gt ssh host1 1s host1 1s la total 40 drwx 2 root root 4096 Mar 5 04 57 drwxr x 13 root root 4096 Mar 4 18 27 nIrws 4 1 root root 1675 Mar 5 04 57 id rsa rw r r 1 root root 404 Mar 5 04 57 id rsa pub Check the public key host1 cat id rsa pub ssh rsa AAAAB3NzaC1yc2EAAA
224. st priority in the BIOS boot sequence Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 207 Fapa do A Finishing Basic Installation v Language v License Agreement w System Analysis v Time Zone Copy files to installed system Installation v Installation Summary Save configuration Perform Installation Install boot manager Configuration Save installation settings e Hostname e Root Password Prepare system for initial boot e Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes The system will reboot now e Hardware Configuration 8 EX Finished Step 73 Once the boot is complete the Startup Options window will pop up Select SUSE Linux Enterprise Server 10 then press Enter SUSE Linux Enterprise Server 10 Floppy SUSE Linux Enterprise Server 10 Failsafe Boot Options Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 208 Step 74 The Hostname and Domain Name window will pop up Continue configuring your machine until the operating system is up then you can start running the machine in nor mal operation mode Step 75 Optional If you wish to have the second instance of connecting to the iSCSI Target go through the Ethernet driver copy the initrd file under boot to a new location add the Ethernet driver into it after the load commands of the iSCSI Initiator modules and continue as described in Secti
225. stacks that are part of the standard operating system distribution or another vendor s commercial stack Installs the MLNX_OFED LINUX binary RPMs if they are available for the current kernel Identifies the currently installed InfiniBand and Ethernet network adapters and automat ically upgrades the firmware 2 3 1 Pre installation Notes The installation script removes all previously installed BXOFED packages and re installs from scratch You will be prompted to acknowledge the deletion of the old pack ages Note Pre existing configuration files will be saved with the extension conf sav 2 erpm If you need to install BXOFED on an entire homogeneous cluster a common strategy is to mount the ISO image on one of the cluster nodes and then copy it to a shared file system such as NFS To install on all the cluster nodes use cluster aware tools such as pdsh 1 The firmware will not be updated if you run the install script with the without fw update option Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential Verify that the system has a Mellanox network adapter HCA NIC installed by ensur Download the BXOFED X X X Y Y Y tgz file to your target Linux host If this package is to be installed on a cluster it is recommended to download it to an NFS shared direc Use the md5sum utility to confirm the file integrity of the downloaded tarball Run the 2 3 2 21 e If your
226. t portinfo lt addr gt lt portnum gt switchinfo lt addr gt pkeys lt addr gt lt portnum gt sl2vl lt addr gt lt portnum gt vlarb lt addr gt lt portnum gt guids lt addr gt lt dest dr_path lid guid gt Optional Destination s directed path LID or GUID Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 InfiniBand Fabric Diagnostic Utilities 146 Examples 1 Query PortInfo by LID with port modifier gt smpquery portinfo 1 1 i Pac oros Ilo il jara Al MRS Geol 0x0000000000000000 GIROLESENI a ts dao 0xfe80000000000000 MA TCA TOTO AT 0x0001 MEDE Ra OTIS S SSS 0x0001 CapMas Kier AE RAS ERRE N an scan sn 0x251086a sSM sTrapSupported sAutomaticMigrationSupported sSLMappingSupported sSystemImageGUIDsupported sCommunicatonManagementSupported sVendorClassSupported sCapabilityMaskNoticeSupported sClientRegistrationSupported DA RI I etero T 0x0000 Mkeylte ase Perito deta nea 0 LO CAME A Mom eh oe ek Oe 1 Esa di ERE A a 1X or 4X cas AEEOS em ore 1X or 4X PARAS EME AA Add 4X DECANOAS yea a a i PSICO SRO ONG OS Lira ale aaa aa ae aa a ACCIVE PAYVSLINK o sts Unt ENEA RIE LinkUp KDD EE SEA CCR eee E e tee Polling BIO We Cle NB SN I REPERTI 0 MG ES sa arial a alata a a at Sh 0 Ean Sip SCAG truly asta su stats te 5 0 Gbps A Peed bina DIARREA tere ZE ACO SRO ORLO INCH EMISE rcs tee emcee Oc teo RO SG 2048 SMS RANA RIPARA 0 E re lA oO
227. t of VLs are defined as low priority VLs with different weights while VL4 is effectively turned off 11 6 8 Deployment Example Figure 6 shows an example of an InfiniBand subnet that has been configured by a QoS manager to provide different service levels for various ULPs Figure 6 Example QoS Deployment on Infini Traffic class SDP Service level 2 Policy min 20 BW Traffic class Partition A Service level 0 Policy min 40 Service Access Points Traffic class SRP Service Level 1 Traffic class IPoIB Policy min 30 BW Service Level 3 Policy min 10 BW 11 7 QoS Configuration Examples The following are examples of QoS configuration for different cluster deployments Each example provides the QoS level assignment and their administration via OpenSM config uration files 11 7 1 Typical HPC Example MPI and Lustre Assignment of QoS Levels e MPI e Separate from I O load e Min BW of 70 Storage Control Lustre MDS e Low latency 121 Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 OpenSM Subnet Manager 122 Storage Data Lustre OST e Min BW 30 Administration e MPI is assigned an SL via the command line hostl mpirun sl 0 e OpenSM QoS policy file Note In the following policy file example replace OST and MDS with the real port GUIDs qos ulps default 0 default SL for MPI any target port guid
228. ted to Note that the Connected column must indicate True for this target Click Next See fig ure below Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 197 198 dpi raid iSCSI Initiator Discovery License Agreement Disk Activation e System Analysis Portal Address Target Name Connected e Time Zone 10 4 3 7 3260 1 iqn 2007 08 7 3 4 10 iscsiboot True Installation e Installation Summary e Perform Installation Configuration e Root Password Hostname Network e Customer Center e Online Update e Service e Users e Clean Up e Release Notes e Hardware Configuration Help Back Abort Step 60 The iSCSI Initiator Overview window will pop up Click Toggle Start Up to change start up from manual to automatic Click Finish Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential Preparation Y Language y License Agreement Disk Activation e System Analysis e Time Zone Installation e Installation Summary e Perform Installation Configuration e Root Password Hostname Network Customer Center Online Update Service Users Clean Up Release Notes Hardware Configuration SCSI Initiator Overview Service Connected Targets ign 20 Portal Address Target Name Add Log Out Toggle Start Up Help Abort
229. the default rule which is applied only if the query didn t match any other rule All other sec tions of the QoS policy file take precedence over the qos ulps section That is if a policy file has both qos match rules and qos ulps sections then any query is matched first against the rules in the qos match rules section and only if there was no match the query is matched against the rules in qos ulps section Note that some of these match rules may overlap tion effectively it is important to understand how so in order to use the simple QoS defini each of the ULPs is matched Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 117 OpenSM Subnet Manager 118 11 6 6 1 IPolB IPoIB query is matched by PKey or by destination GID in which case this is the GID of the multicast group that OpenSM creates for each IPoIB partition Default PKey for IPoIB partition is 0x7fff so the following three match rules are equiva lent ipoib lt SL gt ipoib pkey Ox7fff lt SL gt any pkey OxT7fff lt SL gt 11 6 6 2 SDP SDP PR query is matched by Service ID The Service ID for SDP is 0x000000000001PPPP where PPPP are 4 hex digits holding the remote TCP IP Port Number to connect to The following two match rules are equivalent sdp lt SL gt any service id 0x0000000000010000 0x000000000001ffff lt SL gt 11 6 6 3 RDS Similar to SDP RDS PR query is matched by Service
230. the various DHCP sessions The value of the client identifier is composed of an 8 byte port GUID separated by colons and represented in hexadecimal digits Extracting the Port GUID Method To obtain the port GUID run the following commands Note The following MFT commands assume that the Mellanox Firmware Tools MFT package has been installed on the client machine hostl mst start hostl mst status The device name will be of the form dev mst mt lt dev_id gt _pci _cr0 conf0 Use this device name to obtain the Port GUID via the following query command flint d lt MST DEVICE NAME gt q Example with ConnectX IB DDR amp PCI Express 2 0 2 5GT s as the HCA device hostl flint d dev mst mt25418 pci cr0 q Image type ConnectX FW Version 2 6 000 Device ID 25418 Chip Revision AO Description Node Portl Port2 Sys image GUIDs 0002c90300001038 0002c90300001039 0002c9030000103a 0002c9030000103b MACs 0002c9001039 0002c900103a Board ID n a MT_04A0110002 Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 161 VSD n a PSID MT_04A0110002 Assuming that BoIB is connected via Port 1 then the Port GUID is 00 02 c9 03 00 00 10 39 Extracting the Port GUID Method II An alternative method for obtaining the port GUID involves booting the client machine via BoIB This requires having a Subnet Manager running on one of the machines in the InfiniBand subnet The 8 bytes can
231. tiator Machines On Initiaor machines you can manualy do the following steps 1 modprobe ib srp 2 ipsrpdm c d dev infiniband umadX to discover new SRP target umad0 port 1 of the first HCA umadl port 2 of the first HCA the second HCA umad2 port 1 o 3 echo new target info gt sys class infiniband_srp srp mthca0 1 add_target fdisk 1 will show the newly discovered scsi disks Example Assume that you use port of first HCA in the system i e mthca0 root lab104 ibsrpdm c d dev infiniband umad0 id_ext 0002c90200226cf4 ioc_guid 0002c90200226cf4 dgid fe800000000000000002c90200226c 5 pkey ffff service_1d 0002c90200226cf4 root lab104 echo id_ext 0002c90200226cf4 ioc_guid 0002c90200226cf4 dgid fe800000000000000002c90200226cf5 pkey ffff service_id 0002c90200226cf4 gt sys class infiniband_srp srp mthca0 1 add target OR e You can edit etc infiniband openib conf to load srp driver and srp HA daemon automat ically that is set SRP_LOAD yes and SRPHA_ENABLE yes Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential D 3 219 e To set up and use high availability feature you need dm multipath driver and multipath tool Please refer to OFED 1 x SRP s user manual for more in details instructions on how to enable use the HA feature The following is an example of ab SRP Target setup file bin sh modprobe scsi modprobe scst
232. tion 1 In this mode the source and destination ports are defined by means of their LIDs If the fabric is config ured to allow multiple LIDs per port then using any of them is valid for defining a port e Using port names defined in the topology file Tool option n This option refers to the source and destination ports by the names defined in the topology file There fore this option is relevant only if a topology file is specified to the tool In this mode the tool uses the names to extract the port LIDs from the matched topology then the tool operates as in the l option Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 127 12 3 ibdiagnet IB Net Diagnostic ibdiagnet scans the fabric using directed route packets and extracts all the available infor mation regarding its connectivity and devices It then produces the following files in the output directory which is defined by the o option described below 12 3 1 SYNOPSYS ibdiagnet c lt count gt v r o lt out dir gt t lt topo file gt OPTIONS c lt count gt t lt topo file gt s lt sys name gt i lt dev index gt p lt port num gt 0o lt out dir gt lw lt 1x 4x 12x gt ls lt 2 5 5 10 gt pm pc P lt PM lt Trash gt gt s lt sys name gt i lt dev index gt p lt port num gt wt pm pc P lt lt PM gt lt Value gt gt lw lt 1x 4x 1
233. tion name must be unique Section Settings Selecting Do not verify Filesystem before Booting will skip all file system checks Optional Kernel Command Line Parameter lets you define additional parametersto passto the kernel Kernel Image defines the kernel to boot Either enter the name directly or choose via Browse Initial RAM Disk if not empty definesthe initial ramdisk to use Either enterthe path and file name directly or choose by using Browse Root Device sets the fa device to passto the S kernel as root device Y Section Name Boot Loader Settings Section Management Section Editor Section Name SUSE Linux Enterprise Server 10 SP2 Section Settings Do not verify Filesystem before Booting Optional Kernel Command Line Parameter Kernel Image bootvmlinuz Initial RAM Disk bootinitrd Root Device l dev sda2 gt Vga Mode resume dev sdal splash silent showopts ibft mode off v Browse z Browse 0x332 Back Abort Step 36 If you wish to change additional settings click the appropriate item and perform the changes and click Accept when done Step 37 In the Confirm Installation window click Install to start the installation See image below Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential Preparation Language lt Disk Activation System Analysis Time Zo
234. tion operation is usually the responsibility of the BridgeX eport_state_enfroce Bring vNIC link indication up only when corresponding External Port is up default 0 Iro num Number of LRO sessions per ring or disable 0 default 32 napi_weight NAPI weight default 32 max_tx_outs Max outstanding TX packets default 16 linear_small_pkt Use linear buffer for small packets default 1 If set causes packet copy for small packets net_admin Network administration enabled default 1 If disabled no network admin istered interfaces will be opened Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 SDP 54 7 1 7 2 SDP Overview Sockets Direct Protocol SDP is an InfiniBand byte stream transport protocol that pro vides TCP stream semantics Capable of utilizing InfiniBand s advanced protocol offload capabilities SDP can provide lower latency higher bandwidth and lower CPU utilization than IPoIB running some sockets based applications SDP can be used by applications and improve their performance transparently that is without any recompilation Since SDP has the same socket semantics as TCP an existing application is able to run using SDP the difference is that the application s TCP socket gets replaced with an SDP socket It is also possible to configure the driver to automatically translate TCP to SDP based on the source IP port the destination or the application name See Sectio
235. to fully discover the fabric Failed to parse command line options led to intract with IB fabric led to use local device or local port Failed to use Topology File led to load requierd Package ibdiagpath IB diagnostic path ibdiagpath traces a path between two end points and provides information regarding the nodes and ports traversed along the path It utilizes device specific health queries for the different devices along the path The way ibdiagpath operates depends on the addressing mode used on the command line If directed route addressing is used d flag the local node is the source node and the route to the destination port is known apriori On the other hand if LID route or by name addressing is employed then the source and destination ports of a route are speci fied by their LIDs or by the names defined in the topology file In this case the actual path from the local port to the source port and from the source port to the destination port is defined by means of Subnet Management Linear Forwarding Table queries of the switch nodes along that path Therefore the path cannot be predicted as it may change ibdiagpath should not be supplied with contradicting local ports by the p and d flags see synopsis descriptions below In other words when ibdiagpath is provided with the options p and d together the first port in the direct route must be equal to the one speci fied in th
236. to ping the iSCSI target you can use CTRL ALT F2 and verify that the target is or is not accessible To return to the graphical installation screen press CTRL ALT F7 Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 174 Step 24 The iSCSI Initiator Discovery window will now request authentication to access the iSCSI target Click Next to continue without authentication unless authentication is required Preparation Y Language V License Agreement Disk Activation e System Analysis e Time Zone iSCSI Initiator Discovery Installation e Installation Summary Perform Installation Configuration e Root Password e Hostname e Network e Customer Center Incoming Authentication e Online Update Username Password e Service e Users e Clean Up e Release Notes e Hardware Configuration X No Authentication Outgoing Authentication Username Password Help Back Abort Step 25 The iSCSI Initiator Discovery window will show the iSCSI target that got connected to Note that the Connected column must indicate True for this target Click Next See fig ure below Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 175 dep raid iSCSI Initiator Discovery V License Agreement Disk Activation e System Analysis Portal Address Target Name Connected e Time Zone 10 4 3 7 3260 1 iqn 2007 08 7 3 4 10 iscsibo
237. use prio Enable steering by VLAN priority on ETH ports 0 1 default 0 bool E 2 mlx4 ib Parameters debug level Enable debug tracing if gt 0 default 0 E 3 mlx4 en Parameters rss _xor Use XOR hash function for RSS 0 default is xor rss mask RSS hash type bitmask default is 0xf num lro Number of LRO sessions per ring or disabled 0 default is 32 p ctx Priority based Flow Control policy on TX 7 0 Per priority bit mask default is 0 pfcrx Priority based Flow Control policy on RX 7 0 Per priority bit mask default is 0 inline thold Threshold for using inline data default is 128 E 4 mlx4 fc Parameters log exch per vhba Max outstanding FC exchanges per virtual HBA log Default 9 int max vhba per port Max vHBAs allowed per port Default 2 int Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential 222 Glossary The following is a list of concepts and terms related to InfiniBand in general and to Subnet Managers in particular It is included here for ease of reference but the main reference remains the InfiniBand Architecture Specification Channel Adapter CA Host Channel Adapter HCA An IB device that terminates an IB link and executes transport functions This may be an HCA Host CA or a TCA Target CA HCA Card A network adapter card based on an InfiniBand channel adapter device IB Devices Integrated circuit implementing InfiniBand compliant communication
238. usr mpi gcc openmpi lt ompi ver gt bin mpirun np 2 peli DZ 09 07 83 51 25 74 48 22 69 309 438 535 828 1025 1164 1252 1303 1337 1 85 24 20 93 75 57 80 44 1347 1 L355 1352 75 TD Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 9 6 3 mca mpi leave pinned 1 hostfile home lt username gt cluster usr mpi gcc openmpi lt ompi ver gt tests osu benchmarks lt osu ver gt osu bw OSU MPI Bandwidth Test v3 0 Size Bandwidth MB s 1 1 12 2 Dis ZA 4 4 43 8 8 96 16 17 38 32 34 69 64 69 31 128 121 29 256 212 70 512 326 50 1024 461 78 2048 597 85 4096 543 06 8192 829 64 16384 LEB V22 32768 1386 08 65536 1520 89 131072 1622 73 262144 1659 33 524288 1679 36 1048576 1675 35 2097152 1668 89 4194304 1671 78 Latency Test Performance To run the OSU Latency test enter host1 usr mpi gcc openmpi lt ompi ver gt bin mpirun np 2 mca mpi leave pinned 1 hostfile home lt username gt cluster usr mpi gcc openmpi lt ompi ver gt tests osu benchmarks lt osu ver gt osu latency OSU MPI Latency Test v3 0 Size Latency us 0 1 23 1 1 37 2 TESS Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential MPI 84 4 1 54 8 1 55 16 1 58 32 1059 64 109 128 Pe 78 256 2205 512 2 69 1024 35 79 2048 6 14 4096
239. v 1 50 Mellanox Technologies Confidential 188 B 3 1 Configuring the DHCP Server B 3 1 1 For ConnectX Family Devices When a ConnectX EN PXE client boots it sends the DHCP server various information including its DHCP hardware Ethernet address MAC The MAC address is 6 bytes long and it is used to distinguish between the various DHCP sessions Extracting the MAC Address Method I Run the following commands Note The following MFT commands assume that the Mellanox Firmware Tools MFT package has been installed on the client machine hostl mst start hostl mst status The device name will be of the form dev mst mt lt dev_id gt _pci _cr0 conf0 Use this device name to obtain the MAC address via a query command flint d lt MST DEVICE NAME gt q Example with ConnectX EN as the network adapter device hostl flint d dev mst mt25448 pci cr0 q Image type ConnectX FW Version 2 6 0 Rom Info type GPXE version 1 3 0 devid 25448 Device ID 25448 Chip Revision AO Description Portl Port2 MACs 0002c90000bb 0002c90000bc Board ID n a MT_0920110004 VSD n a PSID MT 0920110004 Assuming that ConnectX EN PXE is connected via Port 1 then the MAC address is 00 02 c9 00 00 bb Extracting the MAC Address Method Il The six bytes of MAC address can be captured from the display upon the boot of the Con nectX device session as shown in the figure below K Up Waiting for 1
240. ver Ethernet FCoE Fibre Channel over InfiniBand FCoIB connected via Mellanox BridgeX gateways Ethernet over Infini Band EoIB connected via Mellanox BridgeX gateways and 2 5 or 5 0 GT s PCI Express 2 0 uplinks to servers All Mellanox network adapter cards are compatible with OpenFabrics based RDMA pro tocols and software and are supported with major operating system distributions Introduction to Mellanox VPI Adapters Mellanox VPI adapters which are based on Mellanox ConnectX ConnectX 2 adapter devices provide leading server and storage 1 O performance with flexibility to support the myriad of communication protocols and network fabrics over a single device without sac rificing functionality when consolidating I O For example VPI enabled adapters can sup port e Connectivity to 10 20 and 40Gb s InfiniBand switches Ethernet switches emerging Data Center Ethernet switches InfiniBand to Ethernet and Fibre Channel Gateways and Ethernet to Fibre Channel gateways e Fibre Channel over Ethernet FCoE and Fibre Channel over InfiniBand FCoIB e Ethernet over InfiniBand EoIB e A single firmware image for dual port ConnectX ConnectX 2 adapters that sup ports independent access to different convergence networks InfiniBand Ethernet or Data Center Ethernet per port A unified application programming interface with access to communication protocols including Networking TCP IP UDP sockets Storage NFS CI
241. ver into the kernel or by adding the device driver module into the initrd image and loading it The Ethernet driver requires loading the following modules in the specified order see Sec tion B 7 1 for an example e mlx4_core ko e mlx4 en ko B 7 1 Example Adding an Ethernet Driver to initrd Linux Prerequisites 1 The ConnectX EN PXE image is already programmed on the adapter card Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential Step a 191 2 The DHCP server is installed and configured as described in Section B 3 1 Configur ing the DHCP Server and connected to the client machine 3 An initrd file To add an Ethernet driver into initrd you need to copy the Ethernet modules to the diskless image Your machine needs to be pre installed with BXOFED that is appropriate for the kernel version the diskless image will run Adding the Ethernet Driver to the initrd File Warning The following procedure modifies critical files used in the boot proce dure It must be executed by users with expertise in the boot process Improper application of this procedure may prevent the diskless machine from booting Back up your current initrd file Step 42 Step 43 Make a new working directory and change to it host1 5 mkdir tmp initrd_en host1 cd tmp initrd en Normally the initrd image is zipped Extract it using the following command host1 gzip dc lt initrd image gt
242. verbosity level May be used several times for addi tional verbosity vvv or v v v V ersion Optional Show version info D irect Optional Use directed path address arguments The path is a comma sep arated list of out ports Examples 0 self port 0 1 2 1 4 out via port 1 then 2 G uid Optional Use GUID address argument In most cases it is the Port GUID Example 0x08f1040023 s lt smlid gt Optional Use lt smlid gt as the target lid for SM SA queries C lt ca_name gt Optional Use the specified channel adapter or router P lt ca_port gt Optional Use the specified port t lt timeout_ms gt Optional Override the default timeout for the solicited MADs msec lt dest dr path lid Optional Destination s directed path LID or GUID guid gt lt portnum gt Optional Destination s port number lt op gt lt value gt Optional query Define the allowed port operations enable disable reset speed and query In case of multiple channel adapters CAs or multiple ports without a CA port being spec ified a port is chosen by the utility according to the following criteria 1 The first ACTIVE port that is found 2 Ifnot found the first port that is UP physical link state is LinkUp Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential Examples 137 1 Query the status of Port 1 of CA mlx4_0 using ibstatus and use its output the LID 3 in this case to obtain
243. vice firmware Note that the binary files are distinct and do not affect each other Mellanox s mlxburn tool is avail able for burning however it is not possible to burn the expansion ROM image by itself Rather both the firmware and expansion ROM images must be burnt simultaneously mlxburn requires the following items 1 MST device name After installing the MFT package run mst start mst status The device name will be of the form dev mst mt lt dev id gt pci _cr0 conf0 2 The firmware mlx file fw lt ID gt X_X XXX mlx 3 One of the expansion ROM binary files listed in Section B 1 3 Firmware burning example The following command burns a firmware image and an expansion ROM image to the Flash device of a ConnectX adapter card mlxburn dev dev mst mt25448 pci cr0 fw fw 25408 X X XXX mlx conf MNEH28 XTC ini exp rom ConnectX EN 25448 ROM X X XXX rom Preparing the DHCP Server in Linux Environment The DHCP server plays a major role in the boot process by assigning IP addresses for ConnectX EN PXE clients and instructs the clients where to boot from When the ConnectX EN PXE boot session starts the PXE firmware attempts to bring up a ConnectX network link port If it succeeds to bring up a connected link the PXE firm ware communicates with the DHCP server The DHCP server assigns an IP address to the PXE client and provides it with the location of the boot program Mellanox Technologies Re
244. w Description c create an inventory file with all nodes ports and paths Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 OpenSM Subnet Manager 98 w d m Ir Pr i wait debug max lid guid port inventory a run all validation tests expecting an input inventory v only validate the given inventory file s run service registration deregistration and lease test e run event forwarding test f flood the SA with queries according to the stress mode m multicast flow q QoS info dump VLArb and SLtoVL tables t run trap 64 65 flow this flow requires running of external tool Default all flows except QoS 3 his option specifies the wait time for trap 64 65 in seconds It is used only when running f t the trap 64 65 flow Default 10 sec ci 3 his option specifies a debug option These options are not normally needed The number fol lowing d selects the debug option to enable as follows OPT Description d0 Ignore other SM nodes dl Force single threaded dispatching d2 Force log flushing after each log message d3 Disable multicast support This option specifies the maximal LID number to be searched for during inventory file build Default 100 This option specifies the local port GUID value with which OpenSM should bind OpenSM may be bound to 1 port
245. ware fw 25408 e InfiniHost III Ex firmware fw 25218 for Mem Free cards and fw 25208 for cards with memory InfiniHost III Lx firmware fw 25204 e InfiniHost firmware fw 23108 Note For the list of supported architecture platforms please refer to the BXOFED Release Notes file Required Disk Space for Installation 400 MB Software Requirements Operating System e Linux operating system Note For the list of supported operating system distributions and kernels please refer to the BXOFED Release Notes file Installer Privileges e The installation requires administrator privileges on the target machine Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential Installation 20 2 2 2 3 Downloading BXOFED Step 1 Step 2 ing that you can see ConnectX ConnectX 2 or InfiniHost entries in the display The following example shows a system with an installed Mellanox HCA hostl lspci v grep Mellanox 02 00 0 InfiniBand Mellanox Technologies MT25418 ConnectX IB DDR PCIe 2 0 2 5GT s rev a0 tory Step 3 Extract the package using Step 4 tar xzvf BXOFED X X X Y Y Y tgz following command and compare the result to the value provided on the download page host1 md5sum BXOFED 1 4 1 1 1 2 tgz Installing BXOFED The installation script mlnxofedinstall1 performs the following e Discovers the currently installed kernel Uninstalls any software
246. with the Open FCoE project The m1x4 fc module is designed to replace the original fcoe module and to allow using ConnectX hardware offloads Mellanox BXOFED also includes the following open fcoe org modules libfe Used by the m1x4 fc module to handle FC logic such as fabric login and logout remote port login and logout fc ns transactions etc fcoe Implements FCoE fully in software Will load instead of mlx4_fc to support T11 frame format Works on top of standard Ethernet NICs including m1x4 en See http www open fcoe org for further information on the Open FCoE proj ect Installation To install the FCoE feature you should run the install script described in Section 2 3 with the with fc option FCoE Basic Usage After loading the driver userspace operations should create destroy vHBAs on required Ethernet interfaces This can be done manually by issuing commands to the driver using simple sysfs operations Alternatively it can be handled automatically by the dcbxd dae mon if the interface is connected to an FCoE switch supporting DCBX negotiation of the FCoE feature e g Cisco Nexus Once a vHBA is instantiated on an Ethernet interface it immediately attempts to log into the FC fabric Provided that the FC fabric and FC targets are well configured LUNs will map to SCSI disk devices dev sdX XX vHBAs instantiated automatically by the dcbxd daemon are created on a VLAN 0 inter face with VLAN priority
247. x Bonding driver please refer to lt kernel source gt Documentation networking bonding txt Currently not all bonding modes are supported LACP is not supported 6 7 Jumbo Frames EoIB supports jumbo frames up to the InfiniBand limit of 4Kbytes To configure EoIB to work with jumbo frames you need to configure the entire InfiniBand fabric to use 4K MTU This includes configuring the SM InfiniBand switches and configuring the Con nectX HCA To configure the HCA port to work with 4K MTU set the mlx4_core module parameter set_4k mtu For how to configure the SM and switches refer to their corre sponding documentation Rev 1 50 Mellanox Technologies Mellanox Technologies Confidential 6 8 Module Parameters The mlx4_vnic driver supports the following module parameters These parameters are intended to enable more specific configuration of the mlx4_vnic driver to customer needs The mlx4 vnic is also effected by module parameters of other modules like the set 4k mtu of mlx4 core but these will not be addressed here The available module parameters include tx rings num Number of TX rings used per vNic use 0 for cores default 0 rx_rings_ num Number of RX rings use 0 for cores default 0 The receive rings ser vice all vNics that use the HCA port mcast create Create multicast group during join request default 0 int If set mlx4_vnic will request the mcast group to be created and not only joined The cre a
248. x Technologies Confidential 205 Step 69 In the Optional Kernel Command Line Parameter field append the following string to the end of the line ibft_mode off include a space before the string Click OK and then Finish to apply the change Section Name Boot Loader Settings Section Management Use Section Name to specify the boot loader section name The section name must be unique Section Editor Section Settings Selecting Do not Section Name SUSE Linux Enterprise Server 10 SP2 verify Filesy stem before Booting will skip all file system checks Optional Kernel Section Settings Command Line Parameter lets you define additional Optional Kernel Command Line Parameter debita Papas te resume dev sdal splash silent showopts ibft_ mode off Kernel Image _ Do not verify Filesystem before Booting Kernel Image defines the kernel to boot Ibootvmlinuz ly Browse Either enter the name ST directly or choose via Initial RAM Disk pro bootinitrd r Browse Initial RAM Disk if i j i Root Device not empty definesthe IAEA initial ramdisk to use dew sda2 Either enter the path E a and file name directly Vga Mode or choose by using 0x332 Browse _ Root Device sets the device to passtothe 2 kernel as root device 7 Back Abort OK Step 70 If you wish to change additional settings click the appropriate item a
249. x Technologies Confidential Change the speed of a port 139 First query for current configuration gt awlsgyoocesiceice SC mld 0 D O 1 EE ILIMUE i ROWE datos DIR pera slice 695397 eli 659397 0 port il AMERICANI ze Phys tinkorat A t LinkUp Lidia sua rtecls viscoso vos Or AK ILLIA INSUChCMEMASIOILECS sos oo sod dg O AK AAA ac ch les Hoh led 4X TNK eds ula reek 5 4060000868600 225 Goos or 5 0 Coss INS pe SUENA 2 3 os oi 5 0 Cops MES E SA e 5 0 Gbps Now change th nabled link speed gt iliopporesicatco C mlx 0 D 0 1 speed 2 ilporesitaico C mlx4 0 D Q 1 speed 2 Taslicaal Dotti Port Ii AES Dec 37 Poe LAMEO E LinkSpeedEnab Show the ne porte scene POE IL Port LinkState EOS SLING MILO DR path slig 65535 o set DIR perdia Slicl 65535 w configuration a PE mile D 0 1 DIR pera sie 65535 elle 5535 Gllicl 655355 5 0 Gbps Gllicl 6559355 Initialize 0 pora 1 0 Pere i IBA extension 0 Pere I Mellanox Technologies Mellanox Technologies Confidential Rev 1 50 InfiniBand Fabric Diagnostic Utilities 140 12 9 ibroute Applicable Hardware InfiniBand switches Description Uses SMPs to display the forwarding tables unicast LinearForwardingTable or LFT or multicast MulticastForwardingTable or MFT for the specified switch LID and the optio
250. y to 15 highest smkey lt SM Key value gt r reassign lids R routing engine This option specifies the SMA s SM Key 64 bits This will affect SM authentication Note that OpensM version 3 2 1 and below used 1 as the default value in a host byte order now it is fixed but you may need this option to interoperate with an old OpenSM running on a little endian machine This option causes OpenSM to reassign LIDs to all nd nodes Specifying r on a running subnet may disrupt subnet traffic Without r OpenSM attempts to preserve existing LID assignments resolving multiple use of same LID lt Routing engine names gt This option chooses routing engine s to use instead of Min Hop algorithm default Multiple routing engines can be specified separated by commas so that Mellanox Technologies Rev 1 50 Mellanox Technologies Confidential OpenSM Subnet Manager 92 A Z U S a u ucast_cache connect roots specific ordering of routing algorithms will be tried if earlier routing engines fail Supported engines minhop updn file ftree lash dor This option enables unicast routing cache and pre vents routing recalculation which is a heavy task in a large cluster when there was no topology change detected during the heavy sweep or when the topology change does not require new routing calcu lation e g when one or more CAs RTRs leaf switches g
251. y Up Down routing algorithm instead of node GUIDs format lt guid gt lt id gt per line X guid routing order file lt file name gt Set the order port guids will be routed for the MinHop and Up Down routing algorithms to the guids provided in th given fil one to a line 0 once This option causes OpenSM to configure the subnet once then exit Ports remain in the ACTIVE state S sweep lt interval value gt This option specifies th number of seconds between subnet sweeps Specifying s 0 disables sweeping Without s OpenSM defaults to a sweep interval of 10 seconds t timeout lt value gt maxsmps lt number gt This option specifies the time in milliseconds used for transaction timeouts Specifying t 0 disables timeouts Without t OpenSM defaults to a timeout value of 200 milliseconds This option specifies the number of VL15 SMP MADs allowed on the wire at any one time Specifying maxsmps 0 allows unlimited outstanding SMPs With out maxsmps OpenSM defaults to a maximum of 4 out standing SMPs console off local socket loopback console port lt port gt This option brings up the OpenSM console default off Note that the socket and loopback options will only be available if OpenSM was built with enable console socket Specify an alternate telnet port for the socket con sole default 10000 Note that this option only appears if OpenSM was built with ena
Download Pdf Manuals
Related Search
Related Contents
HT。SHーBA 東芝シーリングファン取扱説明書 闘.j酉V Panel de bomba independiente HDLV つくば版 - アイ・エス・ガステム Androstar® Plus User Manual (PDF, April 2005) ウォールマウントキット『YWK zum - hundkatzepferd A Verification Environment for PCI Indesit KD6G25SAIR cooker Manual PDF Copyright © All rights reserved.
Failed to retrieve file