Home
MLNX_EN for Linux User Manual
Contents
1. Hyper Threading HPC disabled Data Centers enabled CPU frequency select Max performance Memory Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled NUMA Channel Interleaving Enabled Thermal Mode Performance a Hyper Threading can increase message rate for multi process applications by having more logical cores It might increase the latency of a single process due to lower frequency of a single logical core when hyper threading is enabled 46 Mellanox Technologies Rev 2 1 1 0 0 6 2 3 3 Intel Nehalem Westmere Processors The following table displays the recommended BIOS settings in machines with Intel Nehalem based processors Configuring the Completion Queue Stall Delay Table 16 Recommended BIOS Settings for Intel Nehalem Westmere Processors BIOS Option Values General Operating Mode Power profile Maximum Performance Processor C States Disabled Turbo mode Disabled Hyper Threading Disabled Recommended for latency and message rate sen sitive applications CPU frequency select Max performance Memory Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled NUMA Channel Interleaving Enabled Thermal Mode Performance a Hyper Threading can increase message rate for multi process applications by having more logical cores It might increase the l
2. When running ETH ports on VFs the ports may be configured to simply pass through packets as is from VFs Vlan Guest Tagging or the administrator may configure the Hypervisor to silently force packets to be associated with a VLan Qos Vlan Switch Tagging In the latter case untagged or priority tagged outgoing packets from the guest will have the VLAN tag inserted and incoming packets will have the VLAN tag removed Any vlan tagged packets sent by the VF are silently dropped The default behavior is VGT The feature may be controlled on the Hypervisor from userspace via iprout2 netlink e ip link set dev DEVICE group DEVGROUP up down v NUM mac LLADDR vlan VLANID gos VLAN QOS spoofchk on off use ip link set dev PF device vf NUM vlan vlan id qos lt qos gt where NUM 0 max vf num vlan 0 4095 4095 means set qos 0 7 For example ip link set dev eth2 vf 2 qos VST mode for VF 2 belonging to PF eth2 with qos 3 ip link set dev eth2 vf 4095 sets mode for VF 2 back to VGT 5 4 7 2 Additional Ethernet VF Configuration Options Guest MAC configuration By default guest MAC addresses are configured to be all zeroes In the MLNX_EN guest driver if a guest sees a zero MAC it generates a random MAC address for itself If the administrator wishes the guest to always start up with the same MAC he she should configure guest M
3. w net ipv4 tcp timestamps 0 Enable the TCP selective acks option for better throughput sysctl Increase sysctl Increase sysctl sysctl sysctl sysctl sysctl Increase sysctl sysctl Enable sysctl w net ipv4 tcp sack 1 the maximum length of processor input queues w net core netdev_max_backlog 250000 the TCP maximum and default buffer sizes using setsockopt w net core rmem_max 4194304 w net core wmem max 4194304 w net core rmem default 4194304 w net core wmem default 4194304 w net core optmem max 4194304 memory thresholds to prevent packet dropping w net ipv4 tcp rmem 4096 87380 4194304 w net ipv4 tcp wmem 4096 65536 4194304 low latency mode for TCP w net ipv4 tcp low latency 1 48 Mellanox Technologies Rev 2 1 1 0 0 6 3 2 Tuning the Network Adapter for Improved IPv6 Traffic Performance The following changes are recommended for improving IPv6 traffic performance Disable the TCP timestamps option for better CPU utilization Sysctl w net ipv4 tcp timestamps 0 Enable the TCP selective acks option for better CPU utilization sysctl w net ipv4 tcp sack 1 6 3 3 Preserving Your Performance Settings after a Reboot To preserve your performance settings after a reboot you need to add them to the file etc sysctl conf as follows lt sysctl namel gt lt valuel gt lt sysctl name2 gt lt value2 gt lt sysctl name3 gt lt value3 gt lt sysct
4. Examples e ethtool U eth5 flow type ether dst 00 11 22 33 44 55 loc 5 action 2 All packets that contain the above destination MAC address are to be steered into rx ring 2 its underlying QP with priority 5 within the ethtool domain e ethtool U eth5 flow type tcp4 src ip 1 2 3 4 dst port 8888 loc 5 action 2 All packets that contain the above destination IP address and source port are to be steered into rx ring 2 When destination MAC is not given the user s destination MAC is filled automatically e ethtool u eth5 Shows all of ethtool s steering rule When configuring two rules with the same priority the second rule will overwrite the first one so this ethtool interface is effectively a table Inserting Flow Steering rules in the kernel requires support from both the ethtool in the user space and in kernel v2 6 28 MLXA Driver Support The mlx4 driver supports only a subset of the flow specification the ethtool API defines Asking for an unsupported flow specification will result with an invalid value failure The following are the flow specific parameters Table 6 Flow Specific Parameters ether tcp4 udp4 ip4 Mandatory dst src ip dst ip Optional vlan src ip dst ip src src ip dst ip vlan port dst port vlan RFS RFS is an in kernel logic responsible for load balancing between CPUs by attaching flows to CPUs that are used by flow s owner applications This domain allows t
5. IB Cluster Fabric Subnet A set of IB devices connected by IB cables In Band A term assigned to administration activities traversing the IB connectivity only Local Identifier ID An address assigned to a port data sink or source point by the Subnet Manager unique within the subnet used for directing packets within the subnet Local Device Node The IB Host Channel Adapter HCA Card installed on the System machine running IBDIAG tools 8 Mellanox Technologies Rev 2 1 1 0 0 Table 3 Glossary Sheet 2 of 2 Local Port The IB port of the HCA through which IBDIAG tools connect to the IB fabric Master Subnet Man ager The Subnet Manager that is authoritative that has the refer ence configuration information for the subnet See Subnet Manager Multicast Forward ing Tables A table that exists in every switch providing the list of ports to forward received multicast packet The table 1s organized by MLID Network Interface A network adapter card that plugs into the PCI Express slot Card NIC and provides one or more ports to an Ethernet network Standby Subnet Man A Subnet Manager that is currently quiescent and not in the ager role of a Master Subnet Manager by agency ofthe master SM See Subnet Manager Subnet Administra An application normally part of the Subnet Manager that tor SA implements the interface for querying and manipulating
6. Ifthe SR TOV is not supported by the server the machine might not come out of boot load Step 10 Load the driver and verify the SR IOV is supported Run lspci grep Mellanox 03 00 0 InfiniBand Mellanox Technologies MT26428 ConnectX VPI PCIe 2 0 5GT s IB QDR 10GigE rev b0 03 00 1 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 03 00 2 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 03 00 3 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 03 00 4 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 03 00 5 InfiniBand Mellanox Technologies MT27500 Family ConnectX 3 Virtual Function rev b0 Where e 03 00 represents the Physical Function e 03 00 X represents the Virtual Function connected to the Physical Function 5 4 3 Enabling SR IOV and Para Virtualization on the Same Setup To enable SR IOV and Para Virtualization on the same setup Step 1 Create a bridge vim etc sysconfig network scripts ifcfg bridge0 DEVICE bridge0 TYPE Bridge IPADDR 12 195 15 1 NETMASK 255 255 0 0 BOOTPROTO static ONBOOT yes NM CONTROLLED no DELAY 0 Step 2 Change the related interface in the example below bridge0 is created over eth5 DEVICE eth5 BOOTPROTO none STARTMODE on HWADDR 00 02 c9 2e 66 52 TYPE Ethernet NM_CONTROLLED no ONBOOT yes BRIDGE bridge0 Ste
7. by Mellanox Technologies The ConnectX can operate as an InfiniBand adapter and as an Ethernet NIC To accommodate the two flavors the driver is split into modules mlx4 core mlx4 en and mlx4 ib Note mlx4 ib is not part of this package mlx4 core Handles low level functions like device initialization and firmware commands processing Also controls resource allocation so that the InfiniBand Ethernet and FC functions can share a device without interfering with each other mlx4 en Handles Ethernet specific functions and plugs into the netdev mid layer mstflint An application to burn a firmware binary image Software modules Sources of all software modules under conditions mentioned in the modules LICENSE files 12 Mellanox Technologies Rev 2 1 1 0 0 Table 5 MLNX EN Package Content Components Description Documentation Release Notes README Mellanox Technologies 13 Rev 2 1 1 0 0 Driver Installation 2 2 1 2 2 Driver Installation Software Dependencies To install the driver software kernel sources must be installed on the machine MLNX EN driver cannot coexist with OFED software on the same machine Hence when installing MLNX EN all OFED packages should be removed done by the en install script Installing the Driver Step 1 Download Driver Package from the Mellanox site http www mellanox com content pages php pg products dyn amp prod
8. ROANT atthe rc te e nk ocd tete el nah Ot d a 53 6 3 8 TuningMulti ThreadedIPForwarding 55 4 Mellanox Technologies Rev 2 1 1 0 0 List of Tables Table 1 Document Revision History a 6 Table 2 AbbreviatlonsandAcronyms ia 7 Table 3 GIOSSary sx ces eere ep Ro e EROR E aa DERE RR e SV Gate hid ia E ae 8 Table 4 ReferenceDocuments me 9 Table5 ENPackageContent 0 0000 12 Table 6 Flow Specific Parameters 0 0 ccc e 29 Fable 7 4 POr IN Counters e tern ERE OE eh RUE Cent Died 39 Table8 Port OUT Counters 2 0 eee e e ee 40 Table 9 Port VLANPriorityTagging where lt i gt isintherange0 7 4 Table 10 PortPause where lt i gt isintherange0 41 Table 11 VPort Statistics where lt i gt lt empty_string gt is the PF and ranges 1 NumOfVf per VF 42 Table 12 SW Statistics cns oL ER DRE ULP Goan lk en e ERI DR OR pat 43 Table 13 Per Ring SW Statistics where lt i gt is the ring I per configuration 43 Table 14 RecommendedPCleConfiguration 45 Table 15 Recommended BIOS Settings for Intel Sandy Bridge Processors 46 Table 16 Recommended BIOS Settings for Intel Nehalem Westmere Processors 47 Table 17 Recommended BIOS Setting
9. Rev 2 1 1 0 0 Table 8 Port OUT Counters Counter Description tx It 64 bytes packets Number of transmitted 64 or less octet frames tx 127 bytes packets Number of transmitted 65 to 127 octet frames tx 255 bytes packets Number of transmitted 128 to 255 octet frames tx 511 bytes packets Number of transmitted 256 to 511 octet frames tx 1023 bytes packets Number of transmitted 512 to 1023 octet frames tx 1518 bytes packets Number of transmitted 1024 to 1518 octet frames tx 1522 bytes packets Number of transmitted 1519 to 1522 octet frames tx 1548 bytes packets Number of transmitted 1523 to 1548 octet frames tx gt 1548 bytes packets Number of transmitted 1549 or greater octet frames Table 9 Port VLAN Priority Tagging where i is in the range 0 7 Counter Description IX prio i packets Total packets successfully received with priority 1 IX prio i bytes Total bytes in successfully received packets with priority i rx novlan packets Total packets successfully received with no VLAN priority rx novlan bytes Total bytes in successfully received packets with no VLAN pri ority tx prio i packets Total packets successfully transmitted with priority 1 tx prio lt i gt bytes Total bytes in successfully transmitted packets with priority i tx novlan packets Total packets successfully transmitted with no VLAN priority tx novlan bytes Total bytes in s
10. and vconfig set egress map tools in order to give a centralized view of all QoS mappings Set UP to TC mapping Assign a transmission algorithm to each TC strict or ETS Setminimal BW guarantee to ETS TCs Setrate limit to TCs For unlimited ratelimit set the ratelimit to 0 Usage minx qos i interface options Mellanox Technologies 21 Rev 2 1 1 0 0 Driver Features Options 22 Mellanox Technologies Rev 2 1 1 0 0 Get Current Configuration Set ratelimit 3Gbps for tcO 4Gbps for tc1 and 2Gbps for tc2 Mellanox Technologies 23 Rev 2 1 1 0 0 Driver Features Configure QoS map UP 0 7 to tc0 1 2 3 to tc1 and 4 5 6 to tc 2 set tc0 tc1 as ets and tc2 as strict divide ets 30 for tc0 and 70 for tc1 mlnx qos i eth3 s ets ets strict p 0 1 1 1 2 2 2 t 30 70 tc 0 ratelimit 3 Gbps tsa ets bw 30 up 0 Skprio 0 Skprio 1 Skprio 2 tos 8 Skprio 3 Skprio 4 tos 24 Skprio 5 Skprio 6 tos 16 Skprio 7 Skprio 8 Skprio 9 Skprio 10 Skprio 11 Skprio 12 Skprio 13 Skprio 14 Skprio 15 vios 7 tc 1 ratelimit 4 Gbps tsa ets bw 70 wis vias A wipg 3 tc 2 ratelimit 2 Gbps tsa strict bye A big d bias 5 1 5 2 tc and tc wrap py The tc tool is used to setup sk prio to UP mapping using the mgprio queue discipline In kernels that do not support mgprio such as 2 6 34 an alternate mapping is created in sysfs The tc wrap
11. can be received with recvmsg flags MSG ERRQUEUE The call returns the original outgoing packet data including all headers preprended down to and including the link layer the scm timestamping control message and a sock extended err control message with ee errno ENOMSG and ee origin SO EE ORIGIN TIMESTAMPING A socket with such a pending bounced packet is ready for reading as far as select is concerned If the outgoing Mellanox Technologies 27 J Rev 2 1 1 0 0 Driver Features packet has to be fragmented then only the first fragment is time stamped and returned to the sending socket When time stamping is enabled VLAN stripping is disabled For more info please E refer to Documentation networking timestamping txt in kernel org 5 3 Flow Steering Flow Steering is applicable to the mlx4 driver only Flow steering is new model which steers network flows based on flow specifications to specific QPs Those flows can be either unicast or multicast network flows In order to maintain flexibil ity domains and priorities are used Flow steering uses a methodology of flow attribute which is a combination of L2 L4 flow specifications a destination QP and a priority Flow steering rules could be inserted either by using ethtool or by using InfiniBand verbs The verbs abstraction uses an opposed terminology of a flow attribute ibv flow attr defined by a combination of specifi cations struct ibv flow spec 5 3 1 Ena
12. has nothing more to transmit will the next highest TC be considered Non strict priority TCs will be considered last to transmit This property is extremely useful for low latency low bandwidth traffic Traffic that needs to get immediate service when it exists but is not of high volume to starve other transmitters in the sys tem 5 1 4 2 Minimal Bandwidth Guarantee ETS After servicing the strict priority TCs the amount of bandwidth BW left on the wire may be split among other TCs according to a minimal guarantee policy If for instance TCO is set to 80 guarantee and TC1 to 20 the TCs sum must be 100 then the BW left after servicing all strict priority TCs will be split according to this ratio Since this is a minimal guarantee there is no maximum enforcement This means in the same example that if TC1 did not use its share of 20 the reminder will be used by TCO 5 1 4 3 Rate Limit Rate limit defines a maximum bandwidth allowed for a TC Please note that 1096 deviation from the requested values is considered acceptable 5 1 5 Quality of Service Tools 5 1 5 1 mlnx qos mlnx qos is a centralized tool used to configure QoS features of the local host It communicates directly with the driver thus does not require setting up a DCBX daemon on the system The minx qos tool enables the administrator of the system to nspect the current QoS mappings and configuration The tool will also display maps configured by TC
13. or interfaces Different systems may have different features thus some recommendations below may not be applicable 6 2 1 PCI Express PCle Capabilities Table 14 Recommended PCle Configuration PCIe Generation 3 0 Speed 8GT s Width x8 or x16 Max Payload size 256 Max Read Request 4096 For ConnectX3 based network adapters 40GbE Ethernet adapters it is recommended d to use an x16 PCIe slot to benefit from the additional buffers allocated by the CPU 6 2 2 Memory Configuration For high performance it is recommended to use the highest memory speed with fewest DIMMs and populate all memory channels for every CPU installed For further information please refer to your vendor s memory configuration instructions or mem ory configuration tool available Online Mellanox Technologies 45 Rev 2 1 1 0 0 Performance Tuning 6 2 3 Recommended BIOS Settings P These performance optimizations may result in higher power consumption 6 2 3 1 General Set BIOS power management to Maximum Performance 6 2 3 2 Intel Sandy Bridge Processors The following table displays the recommended BIOS settings in machines with Intel code name Sandy Bridge based processors Table 15 Recommended BIOS Settings for Intel Sandy Bridge Processors BIOS Option Values General Operating Mode Power profile Maximum Performance Processor C States Disabled Turbo mode Enabled
14. plus some others HWTSTAMP FILTER SOME PTP v1 UDP any kind of event packet HWTSTAMP FILTER PTP V1 L4 EVENT PTP v1 UDP Sync packet HWTSTAMP FILTER PTP V1 L4 SYNC vi UDP Delay reg packet HWTSTAMP FILTER PTP V1 L4 DELAY REQ PTP v2 UDP any kind of event packet HWTSTAMP FILTER PTP V2 L4 EVENT PTP v2 UDP Sync packet HWTSTAMP FILTER PTP V2 L4 SYNC PTP v2 UDP Delay req packet HWTSTAMP FILTER PTP V2 L4 DELAY REQ 802 AS1 Ethernet any kind of event packet HWTSTAMP FILTER PTP V2 L2 EVENT 802 AS1 Ethernet Sync packet HWTSTAMP FILTER PTP V2 L2 SYNC 802 AS1 Ethernet Delay req packet HWTSTAMP FILTER PTP V2 L2 DELAY REQ PTP v2 802 AS1 any layer any kind of event packet HWISTAMP FILTER V2 EVENT PTP v2 802 AS1 any layer Sync packet HWISTAMP FILTER V2 SYNC PTP v2 802 AS1 any layer Delay req packet HWTSTAMP FILTER PTP V2 DELAY REQ Note for receive side time stamping currently only HWTSTAMP FILTER NONE and HWTSTAMP FILTER ALL are supported 5 2 2 Getting Time Stamping Once time stamping is enabled time stamp is placed in the socket Ancillary data recvmsg can be used to get this control message for regular incoming packets For send time stamps the outgo ing packet is looped back to the socket s error queue with the send time stamp s attached It
15. the firmware directory you may want to dump the configura tion file using mstflint Run mstflint dev PCI device dc gt ini device file gt Step 4 Edit the ini file that you found in the previous step and add the following lines to HCA section in order to support 63 VFs SRIOV enable total vfs 63 num pfs 1 sriov en true a Some servers might have issues accepting 63 Virtual Functions or more In such case please set the number of total vfs to any required value Step 5 Create a binary image using the modified ini file Step a Download the Mellanox Firmware Tools www mellanox com gt Products gt Adapter IB VPI SW gt Firmware Tools and install the package Stepb mlxburn fw fw name gt mlx conf modified ini file wrimage file name gt bin The file file name gt bin is a firmware binary file with SR IOV enabled that has 63 VFs It can be spread across all machines and can be burnt using mstflint which is part of the bundle using the following command mstflint dev PCI device image file name gt bin b Mellanox Technologies 37 J Rev 2 1 1 0 0 Driver Features After burning the firmware the machine must be rebooted If the driver is only restarted the machine may hang and a reboot using power OFF ON might be required 5 4 7 Ethernet Virtual Function Configuration when Running SR IOV 5 4 7 1 VLAN Guest Tagging VGT and VLAN Switch Tagging VST
16. utils mstflint README file for details Alternatively you can download the current Mellanox Firmware Tools package MFT from www mellanox com gt Products gt Adapter IB VPI SW gt Firmware Tools The tools package to download is MFT SW for Linux tarball name is mft X X X tgz For help in identifying your adapter card please visit http www mellanox com content pages php pg firmware FW identification 4 2 Updating Adapter Card Firmware Using a card specific binary firmware image file enter the following command gt mstflint d pci device i image name bin b For burning firmware using the MFT package please check the MFT user s manual under www mel lanox com gt Products gt Adapter IB VPI SW gt Firmware Tools After burning new firmware to an adapter card reboot the machine so that the new firm ware can take effect yw 18 Mellanox Technologies Rev 2 1 1 0 0 5 Driver Features 5 1 Quality of Service Quality of Service QoS is a mechanism of assigning a priority to a network flow socket rdma cm connection and manage its guarantees limitations and its priority over other flows This is accomplished by mapping the user s priority to a hardware TC traffic class through a 2 3 stages process The TC is assigned with the QoS attributes and the different flows behave accordingly 5 1 1 Mapping Traffic to Traffic Classes Mapping traffic to TCs consists of several actions which are
17. 0 Family ConnectX 3 Virtual Function rev b0 Step 7 Add the device to the etc sysconfig network scripts ifcfg ethx configuration file The MAC address for every virtual function is configured randomly therefore it is not necessary to add it 36 Mellanox Technologies Rev 2 1 1 0 0 5 4 5 Uninstalling SR IOV Driver gt To uninstall SR IOV driver perform the following Step 1 For Hypervisors detach all the Virtual Functions VF from all the Virtual Machines VM or stop the Virtual Machines that use the Virtual Functions Please be aware stopping the driver when there are VMs that use the VFs will cause machine to hang Step2 Run the script below Please be aware uninstalling the driver deletes the entire driver s file but does not unload the driver sbin mlnx en uninstall sh MLNX EN uninstall done Step3 Restart the server 5 4 6 Burning Firmware with SR IOV The following procedure explains how to create a binary image with SR IOV enabled that has 63 VFs However the number of VFs varies according to the working mode requirements To burn the firmware Step 1 Verify you have MFT installed in your machine Step 2 Enter the firmware directory according to HCA type e g ConnectX 3 The path is mlnx en firmware lt device gt lt FW version Step3 Find the ini file that contains the HCA s PSID Run mstflint d 03 00 0 grep PSID PSID MT 1090110019 If such ini file cannot be found in
18. 00 1a 00 0 To query stateless offload status gt ethtool k eth lt x gt gt To set stateless offload status gt ethtool K eth x rx on off tx on off sg on off tso on off lro on off To query interrupt coalescing settings gt ethtool c eth lt x gt To enable disable adaptive interrupt moderation gt ethtool C eth x adaptive rx on off By default the driver uses adaptive interrupt moderation for the receive path which adjusts the mod eration time to the traffic pattern gt To set the values for packet rate limits and for moderation time high and low gt ethtool C eth lt x gt pkt rate low N pkt rate high N rx usecs low N rx usecs high N Above an upper limit of packet rate adaptive moderation will set the moderation time to its highest value Below a lower limit of packet rate the moderation time will be set to its lowest value gt To set interrupt coalescing settings when adaptive moderation is disabled gt ethtool C eth lt x gt rx usecs N rx frames N usec settings correspond to the time to wait after the last packet is sent received before triggering an interrupt gt To query pause frame settings gt ethtool a eth lt x gt gt To set pause frame settings gt ethtool A eth x rx on off tx on off 16 Mellanox Technologies Rev 2 1 1 0 0 To query ring size values gt ethtool g eth lt x gt gt To modify ring
19. 1 1 0 0 6 3 5 Interrupt Moderation Interrupt moderation is used to decrease the frequency of network adapter interrupts to the CPU Mellanox network adapters use an adaptive interrupt moderation algorithm by default The algo rithm checks the transmission Tx and receive Rx packet rates and modifies the Rx interrupt moderation settings accordingly To manually set Tx and or Rx interrupt moderation use the ethtool utility For example the fol lowing commands first show the current default setting of interrupt moderation on the interface ethl then turns off Rx interrupt moderation and last shows the new setting ethtool c ethl Coalesce parameters for ethl Adaptive RX on TX off pkt rate low 400000 pkt rate high 450000 vestel rx frames 88 rx usecs irg 0 rx frames irq 0 gt ethtool C ethl adaptive rx off rx usecs 0 rx frames 0 gt ethtool c ethl Coalesce parameters for ethl Adaptive RX off TX off pkt rate low 400000 pkt rate high 450000 rx usecs 0 rx frames 0 rx usecs irq 0 rx frames irq 0 6 3 6 Tuning for NUMA Architecture 6 3 6 1 Tuning for Intel Sandy Bridge Platform The Intel Sandy Bridge processor has an integrated PCI express controller Thus every PCIe adapter OS is connected directly to a NUMA node On a system with more than one NUMA node performance will be better when using the local NUMA node to which the PCIe adapter is connected In order to identify which NUMA n
20. ACS before the guest driver comes up The guest MAC may be configured by using ip link set dev PF device vf NUM mac lt LLADDR gt For legacy guests which do not generate random MACS the adminstrator should always configure their MAC addresses via ip link as above 38 Mellanox Technologies Rev 2 1 1 0 0 Spoofchecking Spoof checking is currently available only on upstream kernels newer than 3 1 ip link set dev PF device vf NUM spoofchk on off 5 5 Ethernet Performance Counters Counters are used to provide information about how well an operating system an application a service or a driver is performing The counter data helps determine system bottlenecks and fine tune the system and application performance The operating system network and devices pro vide counter data that an application can consume to provide users with a graphical view of how well the system is performing The counter index 1s a QP attribute given in the QP context Multiple QPs may be associated with the same counter set If multiple QPs share the same counter its value represents the cumulative total e ConnectXG 3 support 127 different counters which allocated 4 counters reserved for PF 2 counters for each port 2counters reserved for VF 1 counter for each port All other counters if exist are allocated by demand e counters are available only through sysfs located under e sys class infiniband mlx4 por
21. Mellanox TECHNOLOGIES MLNX_EN for Linux User Manual Rev 2 1 1 0 0 www mellanox com Rev 2 1 1 0 0 NOTE THIS HARDWARE SOFTWARE OR TEST SUITE PRODUCT PRODUCT S AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES AS IS WITH ALL FAULTS OF ANY KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE THE PRODUCTS IN DESIGNATED SOLUTIONS THE CUSTOMER S MANUFACTURING TEST ENVIRONMENT HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE PRODUCTO S AND OR THE SYSTEM USING IT THEREFORE MELLANOX TECHNOLOGIES CANNOT AND DOES NOT GUARANTEE OR WARRANT THAT THE PRODUCTS WILL OPERATE WITH THE HIGHEST QUALITY ANY EXPRESS OR IMPLIED WARRANTIES INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT ARE DISCLAIMED IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER OR ANY THIRD PARTIES FOR ANY DIRECT INDIRECT SPECIAL EXEMPLARY OR CONSEQUENTIAL DAMAGES OF ANY KIND INCLUDING BUT NOT LIMITED TO PAYMENT FOR PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY WHETHER IN CONTRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY FROM THE USE OF THE PRODUCT S AND RELATED DOCUMENTATION EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE Mellanox TECHNOLOGIES Mel
22. SI RDMA Protocol LSB Least significant byte Isb Least significant bit MSB Most significant byte msb Most significant bit NIC Network Interface Card SW Software VPI Virtual Protocol Interconnect IPoIB IP over InfiniBand PFC Priority Flow Control PR Path Record RDS Reliable Datagram Sockets RoCE RDMA over Converged Ethernet Mellanox Technologies 7 Rev 2 1 1 0 0 Table 2 Abbreviations and Acronyms Sheet 2 of 2 Abbreviation Acronym Whole Word Description SDP Sockets Direct Protocol SL Service Level SRP SCSI RDMA Protocol MPI Message Passing Interface EoIB Ethernet over Infiniband QoS Quality of Service ULP Upper Level Protocol VL Virtual Lane vHBA Virtual SCSI Host Bus adapter uDAPL User Direct Access Programming Library Glossary The following is a list of concepts and terms related to InfiniBand in general and to Subnet Man agers in particular It 1s included here for ease of reference but the main reference remains the InfiniBand Architecture Specification Table 3 Glossary Sheet 1 of 2 Channel Adapter An IB device that terminates an IB link and executes transport CA Host Channel functions This may be an HCA Host CA or a TCA Target Adapter HCA CA HCA Card A network adapter card based on an InfiniBand channel adapter device IB Devices Integrated circuit implementing InfiniBand compliant commu nication
23. This is per VLAN mapping Ifthe underlying device is not a VLAN device the tc command is used In this case even though tc manual states that the mapping is from the sk prio to the TC number the mlx4 en driver interprets this as a sk prio to UP mapping Mellanox Technologies 19 Rev 2 1 1 0 0 Driver Features Mapping the sk prio to the UP is done by using tc wrap py i dev name u 0 1 2 3 4 5 6 7 4 The the UP is mapped to the TC as configured by the minx qos tool or by the 11dpad daemon if DCBX is used of the socket In this case the ToS to sk prio fixed mapping is not needed This allows gt Socket applications can use setsockopt SK PRIO value to directly set the sk prio ae the application and the administrator to utilize more than the 4 values possible via ToS In case of VLAN interface the UP obtained according to the above mapping is also used in the VLAN tag of the traffic ae 5 1 3 Map Priorities with tc wrap py mlnx qos Network flow that can be managed by QoS attributes is described by a User Priority UP A user s sk priois mapped to UP which in turn is mapped into TC Indicating the UP When the user uses sk prio it is mapped into a UP by the tc tool This is done by the tc wrap py tool which gets a list of lt 16 comma separated UP and maps the sk prio to the specified UP For example tc wrap py ietho u 1 5 maps sk prio 0 of etho device to UP 1 and sk prio 1to UP 5 Setting
24. acel lt interface2 gt 5 Disable adaptive interrupt moderation and set status values using ethtool C adaptive rx off Mellanox Technologies 55 Rev 2 1 1 0 0 Performance Tuning 56 Mellanox Technologies
25. adingtheDriver n 15 2 5 UninstallingtheDriver R 15 Chapter 3 Ethernet Driver UsageandConfiguration 16 Chapter 4 Firmware 18 4 1 MmstallingFirmwareTools 18 4 2 UpdatingAdapterCardfirmware 18 Chapter 5 Driver Features 19 5 Quality of Services pro Debe eh ae He e ce RC eR lee be 19 5 1 1 MappingTraffictoTrafficClasses 19 5 1 2 Plain Ethernet Quality of Service Mapping 19 5 1 3 Map Priorities with tc wrap py mlnx gos 20 5 1 4 Quality of Service Properties 20 519 Quality of Service Tools curi v reU BERRA EARS A OS 21 5 2 Time StampingService ern 25 5 2 1 EnablingTimeStamping 26 5 22 Getting Time Stamping 0 eee 27 5 3 Slow Steering is i sees e EIE V ERAN 28 5 3 1 Enable DisableFlowsSteering 28 5 3 2 Flow Domains and 28 5 4 SingleRootlOvVirtualization SR IOV 30 5 4 1 SystemReguirements 1 30 9 42 Setting Up SRO Vers e e
26. atency of a single process due to lower frequency of a single logical core when hyper threading is enabled 6 2 3 4 AMD Processors The following table displays the recommended BIOS settings in machines with AMD based pro CeSsors Table 17 Recommended BIOS Settings for AMD Processors BIOS Option Values General Operating Mode Power profile Maximum Performance Processor C States Disabled Turbo mode Disabled HPC Optimizations Enabled CPU frequency select Max performance Mellanox Technologies 47 J Rev 2 1 1 0 0 Performance Tuning Table 17 Recommended BIOS Settings for AMD Processors BIOS Option Values Memory Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled NUMA Channel Interleaving Enabled Thermal Mode Performance 6 3 Performance Tuning for Linux You can use the Linux sysct command to modify default system network parameters that are set by the operating system in order to improve IPv4 and IPv6 traffic performance Note however that changing the network parameters may yield different results on different systems The results are significantly dependent on the CPU and chipset efficiency 6 3 1 Tuning the Network Adapter for Improved IPv4 Traffic Performance The following changes are recommended for improving IPv4 traffic performance Disable the TCP timestamps option for better CPU utilization sysctl
27. ble Disable Flow Steering Flow Steering is disabled by default and regular L2 steering is performed instead BO Steering When using SR IOV flow steering is enabled if there is adequate amount of space to store the flow steering table for the guest master To enable Flow Steering Step 1 Open the etc modprobe d mlnx conf file Step 2 Set the parameter log num entry size to 1 by writing the option mlx4 core log num mgm entry 1 Step3 Restart the driver To disable Flow Steering Step 1 Open the etc modprobe d mlnx conf file Step 2 Remove the options mlx4 core log num mgm entry size 1 Step3 Restart the driver 5 3 2 Flow Domains and Priorities Flow steering defines the concept of domain and priority Each domain represents a user agent that can attach a flow The domains are prioritized A higher priority domain will always super sede a lower priority domain when their flow specifications overlap Setting a lower priority value will result in higher priority In addition to the domain there is priority within each of the domains Each domain can have at most 2 12 priorities in accordance to its needs The following are the domains at a descending order of priority 28 Mellanox Technologies Rev 2 1 1 0 0 Ethtool Ethtool domain is used to attach an RX ring specifically its QP to a specified flow Please refer to the most recent ethtool manpage for all the ways to specify a flow
28. conf file to include a similar command line load parameter for the Linux kernel Mellanox Technologies 31 Rev 2 1 1 0 0 Driver Features For example to Intel systems add default 0 timeout 5 splashimage hd0 0 grub splash xpm gz hiddenmenu title Red Hat Enterprise Linux Server 2 6 32 36 x86 645 root hd0 0 kernel vmlinuz 2 6 32 36 x86 64 ro root dev VolGroup00 LogVol00 rhgb quiet intel iommu on initrd initrd 2 6 32 36 x86 64 img a Please make sure the parameter intel iommu on exists when updating the boot grub grub conf file otherwise SR IOV cannot be loaded Step 5 Install the MLNX EN driver for Linux that supports SR IOV Step 6 Verify HCA is configured to support SR IOV root selene mstflint dev PCI Device dc e Verify in the HCA section the following field appears HCA num pfs 1 total vis 5 sriov en true HCA parameters can be configured during firmware update using the mlnxofedinstal script and running the enable sriov and total vfs lt 0 63 gt installation parameters If the current firmware version is the same as one provided with MLNX EN run it in combination with the force fw update parameter 4 This configuration option is supported only in HCAs that their configuration file INI is k included in MLNX EN Parameter Recommended Value num pfs 1 Note This field is optional and might not always appear total_vfs 63 srio
29. e a lame lata eg le e RE elle ne 31 5 4 3 Enabling SR IOV and Para Virtualization on the Same Setup 34 5 4 4 AssigningaVirtualFunctiontoaVirtualMachine 36 5 4 5 UninstallingSR lOVDriver 37 5 4 6 Burning Firmware withSR lOV 37 5 4 7 Ethernet Virtual Function Configuration when Running SR IOV 38 5 5 Ethernet Performance Counters 00 000 cece ene 39 Chapter 6 Performance 45 6 1 Increasing Packet Rate laa san es hye ERE edebi 45 6 2 GeneralSystemConfigurations 45 6 2 1 PClExpress PCle Capabilities 45 Mellanox Technologies 3 J Rev 2 1 1 0 0 6 2 2 MemoryConfiguratlon 5 45 6 2 3 RecommendedBlOSSettings 46 6 3 PerformanceTuningforlinux 48 6 3 1 Tuning the Network Adapter for Improved IPv4 Traffic Performance 48 6 3 2 Tuning the Network Adapter for Improved IPv6 Traffic Performance 49 6 3 3 Preserving Your Performance SettingsafteraReboot 49 6 3 4 TuningPowerManagement 49 6 3 5 Interrupt Moderation sosse dace eo o RR y ee E REGI ERE He a 51 6 3 6 Tuning for NUMA Architecture rr ra 51 6 3 7
30. eive descriptor rx csum good umber of packets received with good checksum IX csum none umber of packets received with no checksum indication tx chksum offload umber of packets transmitted with checksum offload tx queue stopped Number of times transmit queue suspended tx wake queue Number of times transmit queue resumed tx timeout Number of times transmitter timeout tx tso packets Number of packet that were aggregated Table 13 Per Ring SW Statistics where i is the ring per configuration Counter Description rx lt i gt packets Total packets successfully received on ring i rx lt i gt bytes Total bytes in successfully received packets on ring i tx lt i gt packets Total packets successfully transmitted on ring i tx lt i gt bytes Total bytes in successfully transmitted packets on ring i Mellanox Technologies 43 Rev 2 1 1 0 0 Driver Features 44 Mellanox Technologies Rev 2 1 1 0 0 6 Performance Tuning 6 1 Increasing Packet Rate To increase packet rate especially for small packets set the value of high rate steer mod ule parameter in mlx4 module to 1 default 0 Enabling this mode will cause the following chassis management features to stop work ing w NC SI e 6 2 General System Configurations The following sections describe recommended configurations for system components and
31. ependent It is also recommended to disable the module cpuspeed this module is also architecture depen dent gt To set the scaling mode to performance use echo performance gt sys devices system cpu cpu7 cpufreg scaling governor To disable cpuspeed use service cpuspeed stop Kernel Idle Loop Tuning The mlx4 en kernel module has an optional parameter that can tune the kernel idle loop for bet ter latency This will improve the CPU wake up time but may result in higher power consump tion To tune the kernel idle loop set the following options in the etc modprobe d mlx4 conf file For MLNX EN 2 0 x options mlx4 core enable sys tune 1 e For MLNX EN 1 5 10 options mlx4 en enable sys tune 1 OS Controlled Power Management Some operating systems can override BIOS power management configuration and enable c states by default which results in a higher latency To resolve the high latency issue please follow the instructions below 1 Edit the boot grub grub conf file or any other bootloader configuration file 2 Add following kernel parameters to the bootloader command intel idle max cstate 0 processor max_cstate 1 3 Reboot the system Example title RH6 2x64 root hd0 0 kernel wvmlinuz RH6 2x64 2 6 32 220 e16 x86 64 root UUID 817c207b c0e8 4ed9 9c33 c589c0bb566f console tty0 console ttyS0 115200n8 rhgb intel idle max cstate 0 processor max_cstate 1 50 Mellanox Technologies Rev 2
32. gth_error Number of received frames with a length type field value in the decimal range 1500 46 42 is also counted for VLANtagged frames rx out range length error Number of received frames with a length type field value in the decimal range 1535 1501 rx t 64 bytes packets Number of received 64 or less octet frames rx 127 bytes packets Number of received 65 to 127 octet frames rx 255 bytes packets Number of received 128 to 255 octet frames rx 511 bytes packets Number of received 256 to 511 octet frames rx 1023 bytes packets Number of received 512 to 1023 octet frames rx 1518 bytes packets Number of received 1024 to 1518 octet frames rx 1522 bytes packets Number of received 1519 to 1522 octet frames rx 1548 bytes packets Number of received 1523 to 1548 octet frames rx gt 1548 bytes packets Number of received 1549 or greater octet frames Table 8 Port OUT Counters Counter Description tx packets Total packets successfully transmitted tx bytes Total bytes in successfully transmitted packets tx multicast packets Total multicast packets successfully transmitted tx broadcast packets Total broadcast packets successfully transmitted tx errors tx dropped Number of frames that failed to transmit Number of transmitted frames that were dropped 40 Mellanox Technologies
33. he RFS mechanism to use the flow steering infrastructure to support the RFS logic by implementing the ndo rx flow steer which in turn calls the underlying flow steering mechanism with the RFS domain Enabling the RFS requires enabling the ntuple flag via the ethtool For example to enable ntuple for 0 run ethtool K eth0 ntuple on RFS requires the kernel to be compiled with the CONFIG RFS ACCEL option This options is available in kernels 2 6 39 and above Furthermore RFS requires Device Managed Flow Steering support RFS cannot function if LRO is enabled LRO can be disabled via ethtool Mellanox Technologies 29 Rev 2 1 1 0 0 Driver Features e All of the rest The lowest priority domain serves the following users The mlx4 Ethernet driver attaches its unicast and multicast MACs addresses to its QP using L2 flow specifications 7 4 Fragmented UDP traffic cannot be steered It is treated as other protocol by hardware from the first packet and not considered as UDP traffic 5 4 Single Root IO Virtualization SR IOV Single Root IO Virtualization SR IOV is a technology that allows a physical PCIe device to present itself multiple times through the PCIe bus This technology enables multiple virtual instances of the device with separate resources Mellanox adapters are capable of exposing in ConnectX 3 adapter cards 63 virtual instances called Virtual Functions VFs These virtual function
34. his will be applied to all ConnectX HCAs on the host num vfs 00 04 0 5 00 07 0 8 The driver will enable 5 VFs on the HCA positioned in BDF 00 04 0 and 8 on the one in 00 07 0 Note PFs not included in the above list will not have SR IOV enabled probe vf Absent or zero No VFs will be used by the PF driver e ts value is a single number in the range of 0 63 Physical Func tion driver will use probe v VFs and this will be applied to all ConnectX HCAs on the host ts format is a string which allows the user to specify the probe vf parameter separately per installed HCA Its format is bb dd f v bb dd f v bb dd f bus device function of the PF of the HCA v number of VFs to use in the PF driver for that HCA This parameter can be set in one of the following ways For exam ple probe vfs 5 The PF driver will probe 5 VFs on HCA and this will be applied to all ConnectX HCAs on the host probe vfs 00 04 0 5 00 07 0 8 The PF driver will probe 5 VFs on the HCA positioned in BDF 00 04 0 and 8 for the one in 00 07 0 Note PFs not included in the above list will not use any of their VFs in the PF driver The example above loads the driver with 5 VFs num vfs The standard use of a VF is a single VF per a single VM However the number of VFs varies upon the working mode requirements Mellanox Technologies 33 Rev 2 1 1 0 0 Driver Features Step 9 Reboot the server
35. icast packet bytes received successfully vport lt i gt rx broadcast pac kets Broadcast packets received successfully vport lt i gt rx broadcast byte 5 Broadcast packet bytes received successfully vport lt i gt _rx_dropped Received packets discarded due to out of buffer condition vport lt i gt _rx_errors Received packets discarded due to receive error condition vport lt i gt _tx_unicast_packet S Unicast packets sent successfully vport lt i gt tx unicast bytes Unicast packet bytes sent successfully vport lt i gt tx multicast pack ets Multicast packets sent successfully vport lt i gt tx multicast byte S Multicast packet bytes sent successfully vport lt i gt tx broadcast pac kets Broadcast packets sent successfully 42 Mellanox Technologies Rev 2 1 1 0 0 Table 11 VPort Statistics where i 2 empty string is the PF and ranges 1 NumOfVf VF Counter Description vport lt i gt tx broadcast byte 5 Broadcast packet bytes sent successfully vport lt i gt _tx_errors Packets dropped due to transmit errors Table 12 SW Statistics Counter Description rx lro aggregated N umber of packets aggregated rx lro flushed N umber of LRO flush to the stack rx lro no desc N umber of times LRO description was not found rx alloc failed N umber of times failed preparing rec
36. l name4 gt lt value4 gt For example Tuning the Network Adapter for Improved IPv4 Traffic Performance on page 48 lists the following setting to disable the TCP timestamps option sysctl w net ipv4 tcp timestamps 0 In order to keep the TCP timestamps option disabled after a reboot add the following line to etc sysctl conf net ipv4 tcp timestamps 0 6 3 4 Tuning Power Management Check that the output CPU frequency for each core is equal to the maximum supported and that all core frequencies are consistent Check the maximum supported CPU frequency cat sys devices system cpu cpu cpufreg cpuinfo max freq e Check that core frequencies are consistent cat proc cpuinfo grep cpu MHz Check that the output frequencies are the same as the maximum supported If the CPU frequency is not at the maximum check the BIOS settings according to tables in is section Recommended BIOS Settings on page 46 to verify that power state is disabled Check the current CPU frequency to check whether it is configured to max available frequency cat sys devices system cpu cpu cpufreq cpuinfo cur freq Mellanox Technologies 49 Rev 2 1 1 0 0 Performance Tuning 6 3 4 1 6 3 4 2 6 3 4 3 Setting the Scaling Governor If the following modules are loaded CPU scaling is supported and you can improve perfor mance by setting the scaling mode to performance freq table acpi cpufreq this module is architecture d
37. lanox Technologies Mellanox Technologies Ltd 350 Oakmead Parkway Suite 100 Beit Mellanox Sunnyvale CA 94085 PO Box 586 Yokneam 20692 U S A Israel www mellanox com www mellanox com Tel 408 970 3400 Tel 972 0 74 723 7200 Fax 408 970 3403 Fax 972 0 4 959 3245 Copyright 2014 Mellanox Technologies All Rights Reserved Mellanox amp Mellanox logo BridgeX ConnectX CORE Direct InfiniBridge InfiniHost InfiniScale MLNX OS PhyX SwitchX UFM Virtual Protocol Interconnect and Voltaire are registered trademarks of Mellanox Technologies Ltd Connect IB ExtendX FabricIT Mellanox Open Ethernet Mellanox Virtual Modular Switch MetroX MetroDX ScalableHPC Unbreakable Link are trademarks of Mellanox Technologies Ltd All other trademarks are property of their respective owners 2 Mellanox Technologies Document Number 2950 Rev 2 1 1 0 0 Table of Contents Table of Contents ts 2 ro e uL ee e eh cr ne rv c ees d Dist Of Tables eee Re gere et eR e Ros dre N lec Ca ated DS Chapter I O VErVieW A era ex eR eee weer ee 1 1 Package Contents 6404 ones pere Ra ed ad ee d Le e Ge 12 Chapter 2 Driver Installation zuo ce haere ce 14 2 1 Software Dependencies 14 2 2 Installing the DriVerz vec sees air RE ol aad waite dads 14 2 3 lLoadingtheDriver 0 15 24 Unlo
38. ode is the adapter s node the system BIOS should support ACPI SLIT gt To see if your system supports PCIe adapter s NUMA node detection cat sys class net interface device numa node cat sys devices PCI root PCIe function numa node Mellanox Technologies 51 J Rev 2 1 1 0 0 Performance Tuning Example for supported system cat sys class net eth3 device numa node 0 Example for unsupported system cat sys class net ib0 device numa node 1 6 3 6 1 1 Improving Application Performance on Remote NUMA Node Verbs API applications that mostly use polling will have an impact when using the remote NUMA node libmlx4 has a build in enhancement that recognizes an application that is pinned to a remote NUMA node and activates a flow that improves the out of the box latency and throughput However the NUMA node recognition must be enabled as described in section Tuning for Intel Sandy Bridge Platform on page 51 In systems which do not support SLIT the following environment variable should be applied MLX4 LOCAL CPUS 0x bit mask of local NUMA node Example for local NUMA node which its cores are 0 7 LOCAL CPUS 0xff Additional modification can apply to impact this feature by changing the following environment variable MLX4 STALL NUM LOOP integer default 400 The default value is optimized for most applications However several applications y might benefit from increasing decreasing thi
39. om interfering with the interrupt affinity scheme the IRQ balancer must be turned off The following command turns off the IRQ balancer gt etc init d irgbalance stop The following command assigns the affinity of a single interrupt vector gt echo lt hexadecimal bit mask gt proc irg lt irg vector gt smp affinity Bit i in lt hexadecimal bit mask gt indicates whether processor core i is in lt irq vector gt s affinity or not 6 3 7 1 IRQ Affinity Configuration It is recommended to set each IRQ to a different core For Sandy Bridge or AMD systems set the irq affinity to the adapter s NUMA node Foroptimizing single port traffic run set irq affinity bynode sh numa node interface Foroptimizing dual port traffic run set irq affinity bynode sh numa node interfacel interface2 To show the current irq affinity settings run show irq affinity sh interface Mellanox Technologies 53 Rev 2 1 1 0 0 Performance Tuning 6 3 7 2 6 3 7 3 Auto Tuning Utility MLNX EN 2 0 x introduces a new affinity tool called mlnx affinity This tool can automatically adjust your affinity settings for each network interface according to the system architecture Usage Start mlnx affinity start Stop mlnx affinity stop Restart mlnx affinity restart mlnx affinity can also be started by driver load unload gt To enable minx affinity by default Add the line bel
40. ow to the etc infiniband openib conf file RUN AFFINITY TUNER yes Tuning for Multiple Adapters When optimizing the system performance for using more than one adapter It is recommended to separate the adapter s core utilization so there will be no interleaving between interfaces The following script can be used to separate each adapter s IRQs to different set of cores set irq affinity cpulist sh cpu list interface cpu list can be either a comma separated list of single core numbers 0 1 2 3 or core groups 0 3 Example Ifthe system has 2 adapters on the same NUMA node 0 7 each with 2 interfaces run the follow ing etc init d irgbalancer stop set irq affinity cpulist sh 0 1 eth2 set irq affinity cpulist sh 2 3 eth3 set irq affinity cpulist sh 4 5 eth4 set irq affinity cpulist sh 6 7 eth5 54 Mellanox Technologies Rev 2 1 1 0 0 6 3 8 Tuning Multi Threaded IP Forwarding To optimize NIC usage as IP forwarding 1 Set the following options in etc modprobe d mlx4 conf For MLNX EN 2 0 x options mlx4 en inline thold 0 options mlx4 core high rate steer 1 ForMLNX EN 1 5 10 options mlx4 en num lro 0 inline thold 0 options mlx4 core high rate steer 1 2 Apply interrupt affinity tuning 3 Forwarding on the same interface set irq affinity bynode sh numa node interface 4 Forwarding from one interface to another set irq affinity bynode sh numa node interf
41. p3 Restart the service network 34 Mellanox Technologies Rev 2 1 1 0 0 Step 4 Attach a virtual NIC to VM Step 5 Add the MAC 52 54 00 E7 77 99 to the sys class net eth5 fdb table on HV Mellanox Technologies 35 Rev 2 1 1 0 0 Driver Features 5 4 4 Assigning a Virtual Function to a Virtual Machine This section will describe a mechanism for adding a SR IOV VF to a Virtual Machine 5 4 4 1 Assigning the SR IOV Virtual Function to the Red Hat KVM VM Server Step 1 Run the virt manager Step 2 Double click on the virtual machine and open its Properties Step 3 to Details gt Add hardware gt PCI host device LEG v He Virtual Machine View Send Key a mQ O Add new virtual hardware DA Overvie Performance Adding Virtual Hardware This assistant will guide you through adding a new piece of virtual hardware First select what type of hardware you wish to add 8 Pr oO Bc Hardware type Storage Network input Graphics fs Sound Serial Parallel Physical Host Device B watchdog cancel gt Forward sf Add Hardware Remove Step 4 Choose a Mellanox virtual function according to its PCI device e g 00 03 1 Step 5 Ifthe Virtual Machine is up reboot it otherwise start it Step 6 Log into the virtual machine and verify that it recognizes the Mellanox card Run lspci grep Mellanox 00 03 0 InfiniBand Mellanox Technologies MT2750
42. py tool will use either the sysfs or the tc tool to configure the sk prio to UP mapping Usage tc wrap py i interface options Options version Show program s version number and exit h help show this help message and exit u SKPRIO UP skprio up SKPRIO UP maps sk prio to UP LIST is 16 comma separated UP index of element is sk prio i INTF interface INTF Interface name 24 Mellanox Technologies Rev 2 1 1 0 0 Example set skprio 0 2 to UPO and skprio 3 7 to UP1 on eth4 UP 0 skprio 0 skprio 1 skprio 2 tos 8 skprio 7 skprio 8 skprio 9 skprio 10 skprio 11 skprio 12 skprio 13 skprio 14 skprio 15 UP 1 skprio 3 skprio 4 tos 24 skprio 5 skprio 6 tos 16 UP 2 UP 3 UP 4 UP 5 UP 6 UP 7 5 1 5 3 Additional Tools tc tool compiled with the sch mgprio module is required to support kernel v2 6 32 or higher This is a part of iproute2 package v2 6 32 19 or higher Otherwise an alternative custom sysfs interface is available e mlinx qos tool package ofed scripts requires python gt 2 5 tc wrap py package ofed scripts requires python gt 2 5 5 2 Time Stamping Service Time Stamping is currently at beta level y Please be aware that everything listed here is subject to change Time Stamping is currently supported in ConnectX 3 ConnectX 3 Pro adapter ly Time stamping is the process of keeping track of the c
43. r can enable disable time stamping through calling ioctl sock SIOCSHWT STAMP amp ifreq with following values Send side time sampling Enabled by ifreq hwtstamp config tx type when possible values for hwtstamp config tx type enum hwtstamp tx types No outgoing packet will need hardware time stamping should a packet arrive which asks for it no hardware time stamping will be done HWTSTAMP TX OFF Enables hardware time stamping for outgoing packets the sender of the packet decides which are to be time stamped by setting SOF TIMESTAMPING TX SOFTWARE before sending the packet s HWTSTAMP TX ON Enables time stamping for outgoing packets just as HWTSTAMP TX ON does but also enables time stamp insertion directly into Sync packets In this case transmitted Sync packets will not received a time stamp via the socket error queue a HWTSTAMP TX ONESTEP SYNC DE Note for send side time stamping currently only HWTSTAMP TX OFF and HWTSTAMP TX ON are supported 26 Mellanox Technologies Rev 2 1 1 0 0 Receive side time sampling Enabled by ifreq hwtstamp config rx filter when possible values for hwtstamp config rx filter enum hwtstamp rx filters time stamp no incoming packet at all HWTSTAMP FILTER NONE time stamp any incoming packet HWTSTAMP FILTER ALL return value time stamp all packets requested
44. reation of a packet A time stamping ser vice supports assertions of proof that a datum existed before a particular time Incoming packets are time stamped before they are distributed on the PCI depending on the congestion in the PCI buffers Outgoing packets are time stamped very close to placing them on the wire Mellanox Technologies 25 Rev 2 1 1 0 0 Driver Features 5 2 1 Enabling Time Stamping Time stamping is off by default and should be enabled before use To enable time stamping for a socket Call setsockopt with SO TIMESTAMPING and with the following flags SOF TI SOF TI SOF TI SOF TI SOF TI SOF TI SOF TI SOF TI SOF TI ESTAMPING TX HARDWARE try to obtain send time stamp in hardware ESTAMPING TX SOFTWARE ABE SOF TIMESTAMPING TX HARDWARE is off or fails then do it in software ESTAMPING RX HARDWARE return the original unmodified time stamp as generated by the hardware ESTAMPING RX SOFTWARE SOF TIMESTAMPING RX HARDWARE is off or fails then do it in software ESTAMPING RAW HARDWARE return original raw hardware time stamp ESTAMPING SYS HARDWARE return hardware time stamp transformed to the system time base ESTAMPING SOFTWARE return system time stamp generated in software ESTAMPING TX RX determine how time stamps are generated ESTAMPING RAW SYS determine how they are reported gt To enable time stamping for a net device Admin privileged use
45. s can then be provisioned separately Each VF can be seen as an addition device con nected to the Physical Function It shares the same resources with the Physical Function and its number of ports equals those of the Physical Function SR IOV is commonly used in conjunction with an SR IOV enabled hypervisor to provide virtual machines direct hardware access to network resources hence increasing its performance In this chapter we will demonstrate setup and configuration of SR IOV in a Red Hat Linux envi ronment using Mellanox ConnectX VPI adapter cards family 5 4 1 System Requirements To set up an SR IOV environment the following is required MLNX EN Driver Aserver blade with an SR IOV capable motherboard BIOS Hypervisor that supports SR IOV such as Red Hat Enterprise Linux Server Version 6 Mellanox ConnectX VPI Adapter Card family with SR IOV capability 30 Mellanox Technologies Rev 2 1 1 0 0 5 4 2 Setting Up SR IOV Depending on your system perform the steps below to set up your BIOS The figures used in this section are for illustration purposes only For further information please refer to the appropriate BIOS User Manual Step 1 Enable SR IOV in the system BIOS pam BIOS SETUP UTILITY i J 1 Supported T X OPR nboard Step 2 Enable Intel Virtualization Technology Step3 Install the hypervisor that supports SR IOV Step 4 Depending on your system update the boot grub grub
46. s for AMD 47 Mellanox Technologies 5 Rev 2 1 1 0 0 Document Revision History Table 1 Document Revision History Release Date Description 2 1 1 0 0 January 2014 Added Section 5 5 Ethernet Performance Counters on page 39 2 0 3 0 0 October 2013 Added the following sections Section 5 4 Single Root IO Virtualization SR IOV on page 30 Section 5 3 Flow Steering on page 28 Section 52 Time Stamping Service on page 25 6 Mellanox Technologies Rev 2 1 1 0 0 About this Manual This Preface provides general information concerning the scope and organization of this User s Manual Intended Audience This manual is intended for system administrators responsible for the installation configuration management and maintenance of the software and hardware of VPI InfiniBand Ethernet adapter cards It is also intended for application developers Common Abbreviations and Acronyms Table 2 Abbreviations and Acronyms Sheet 1 of 2 Abbreviation Acronym Whole Word Description B Capital B is used to indicate size in bytes or multiples of bytes e g IKB 1024 bytes and IMB 1048576 bytes b Small b is used to indicate size in bits or multiples of bits e g IKb 1024 bits FW Firmware HCA Host Channel Adapter HW Hardware IB InfiniBand iSER iSC
47. s size gt ethtool G eth lt x gt rx lt N gt tx lt N gt gt To obtain additional device statistics gt ethtool S eth lt x gt gt To perform a self diagnostics test gt ethtool t eth lt x gt The driver defaults to the following parameters Both ports are activated 1 e a net device is created for each port The number of Rx rings for each port is the nearest power of 2 of number of cpu cores limited by 16 LRO is enabled with 32 concurrent sessions per Rx ring Some of these values can be changed using module parameters which can be displayed by run ning gt modinfo mlx4 en To set non default values to module parameters add to the etc modprobe cont file options mlx4 en param name value param name value Values of all parameters can be observed in sys module mlx4 en parameters Mellanox Technologies 17 Rev 2 1 1 0 0 Firmware Programming 4 Firmware Programming The adapter card was shipped with the most current firmware available This section is intended for future firmware upgrades and provides instructions for 1 installing Mellanox firmware update tools MFT 2 downloading FW and 3 updating adapter card firmware 4 1 Installing Firmware Tools The driver package compiles and installs the Mellanox mstflint utility under usr local bin You may also use this tool to burn a card specific firmware binary image See the file tmp mlnx en src
48. s value 6 3 6 2 Tuning for AMD Architecture On AMD architecture there is a difference between a 2 socket system and a 4 socket system e With a2 socket system the PCIe adapter will be connected to socket 0 nodes 0 1 With a4 socket system the PCIe adapter will be connected either to socket 0 nodes 0 1 or to socket 3 nodes 6 7 6 3 6 3 Recognizing NUMA Node Cores gt To recognize NUMA node cores run the following command cat sys devices system node node X cpulist cpumap Example cat sys devices system node nodel cpulist 13 05 cat sys devices system node node1 cpumap 0000aaaa 52 Mellanox Technologies Rev 2 1 1 0 0 6 3 6 3 1 Running an Application on a Certain NUMA Node In order to run an application on a certain NUMA node the process affinity should be set in either in the command line or an external tool For example if the adapter s NUMA node is 1 and NUMA 1 cores are 8 15 then an application should run with process affinity that uses 8 15 cores only Torun an application run the following commands taskset c 8 15 ib write bw a or taskset Oxff00 ib write bw a 6 3 7 IRQ Affinity The affinity of an interrupt is defined as the set of processor cores that service that interrupt To improve application scalability and latency it is recommended to distribute interrupt requests IRQs between the available processor cores To prevent the Linux IRQ balancer application fr
49. set egress map in VLAN maps the skb priority of the VLAN to a v1an qos The v1an qos is represents a UP for the VLAN device e n set option with RbMA OPTION ID TOS could be used to set the UP When creating QPs the s field in modify gp command represents the UP Indicating the TC After mapping the skb priority to UP one should map the UP into a TC This assigns the user priority to a specific hardware traffic class In order to do that minx qos should be used m1nx qos gets a list of a mapping between UPs to TCs For example minx qos ietho p 0 0 0 0 1 1 1 1 maps UPs 0 3 to Tco and Ups 4 7 to Tc1 5 1 4 Quality of Service Properties The different QoS properties that can be assigned to a TC are Strict Priority see Strict Priority e Minimal Bandwidth Guarantee ETS see Minimal Bandwidth Guarantee ETS Rate Limit see Rate Limit 5 1 4 1 Strict Priority When setting a TC s transmission algorithm to be strict then this TC has absolute strict prior ity over other TC strict priorities coming before it as determined by the TC number TC 7 is highest priority TC 0 1s lowest It also has an absolute priority over non strict TCs ETS 20 Mellanox Technologies Rev 2 1 1 0 0 This property needs to be used with care as it may easily cause starvation of other TCs A higher strict priority TC is always given the first chance to transmit Only if the highest strict priority TC
50. subnet management data Subnet Manager One of several entities involved in the configuration and con SM trol of the an IB fabric Unicast Linear For A table that exists in every switch providing the port through warding Tables which packets should be sent to each LID LFT Virtual Protocol A Mellanox Technologies technology that allows Mellanox Interconnet VPI channel adapter devices ConnectX to simultaneously con nect to an InfiniBand subnet and a 10GigE subnet each subnet connects to one of the adpater ports Related Documentation Table 4 Reference Documents Document Name Description InfiniBand Architecture Specification Vol 1 Release 1 2 1 is provided by IBTA IEEE Std 802 3ae 2002 Amendment to IEEE Std 802 3 2002 Document PDF 8594996 Physical Layer Specifications Amendment Media Access Control MAC Parameters for 10 Gb s Operation Mellanox Technologies 9 The InfiniBand Architecture Specification that Part 3 Carrier Sense Multiple Access with Colli sion Detection CSMA CD Access Method and Parameters Physical Layers and Management Rev 2 1 1 0 0 Table 4 Reference Documents Document Name Description Firmware Release Notes for Mellanox See the Release Notes PDF file relevant to your adapter devices adapter device under docs folder of installed package MFT User s Manual Mellanox Firmware Tools User s Man
51. ts counters sys class infiniband mlx4 ports counters ext Physical Function can also read Virtual Functions port counters through sysfs located under sys class net eth vf statistics To display the network device Ethernet statistics you can run Ethtool S lt devname gt Table 7 Port IN Counters Counter Description rx_packets Total packets successfully received rx_bytes Total bytes in successfully received packets rx_multicast_packets Total multicast packets successfully received rx_broadcast_packets Total broadcast packets successfully received rX_errors Number of receive packets that contained errors preventing them from being deliverable to a higher layer protocol rx dropped Number of receive packets which were chosen to be discarded even though no errors had been detected to prevent their being deliverable to a higher layer protocol Mellanox Technologies 39 Table 7 Port IN Counters Rev 2 1 1 0 0 Driver Features Counter Description rx length errors Number of received frames that were dropped due to an error in frame length rx_over_errors Number of received frames that were dropped due to overflow rx_crc_errors Number of received frames with a bad CRC that are not runts jabbers or alignment errors rx_jabbers Number of received frames with a length greater than MTU octets and a bad CRC rx_in_range_len
52. ual See under docs folder of installed package MFT Release Notes Release Notes for the Mellanox Firmware Tools See under docs folder of installed package 10 Mellanox Technologies Rev 2 1 1 0 0 Support and Updates Webpage Please visit http www mellanox com gt Products gt InfiniBand VPI Drivers gt Linux SW Drivers for downloads FAQ troubleshooting future updates to this manual etc Mellanox Technologies 11 Rev 2 1 1 0 0 Overview 1 Overview This document provides information on the MLNX EN Linux driver and instructions for install ing the driver on Mellanox ConnectX adapter cards supporting 10Gb s and 40Gb s Ethernet The MLNX EN driver release exposes the following capabilities Single Dual port e Up to 16 Rx queues per port 6 Tx queues per port Rxsteering mode Receive Core Affinity RCA e MSI X or INTx Adaptive interrupt moderation HW Tx Rx checksum calculation Large Send Offload i e TCP Segmentation Offload Large Receive Offload Multi core NAPI support e VLAN Tx Rx acceleration HW VLAN stripping insertion Ethtool support Net device statistics SR IOV support Flow steering Ethernet Time Stamping at beta level 1 1 Package Contents This driver kit contains the following Table 5 MLNX EN Package Content Components Description mlx4 driver mix4 is the low level driver implementation for the ConnectX adapters designed
53. uccessfully transmitted packets with no VLAN priority Table 10 Port Pause where lt i gt is in the range 0 7 Counter Description IX pause prio lt i gt The total number of PAUSE frames received from the far end port rx pause duration prio i The total time in microseconds that far end port was requested to gt pause transmission of packets Mellanox Technologies 41 J Rev 2 1 1 0 0 Driver Features Table 10 Port Pause where lt i gt is in the range 0 7 Counter Description IX pause transition prio lt i gt The number of receiver transitions from XON state paused to XOFF state non paused tx pause prio lt i gt The total number of PAUSE frames sent to the far end port tx pause duration prio lt i gt The total time in microseconds that transmission of packets has been paused tx pause transition prio i gt The number of transmitter transitions from XON state paused to XOFF state non paused Table 11 VPort Statistics where lt i gt lt empty_string gt is the PF and ranges 1 NumOfVf per VF Counter Description vport lt i gt _rx_unicast_packet S Unicast packets received successfully vport lt i gt rx unicast bytes Unicast packet bytes received successfully vport lt i gt rx multicast pack ets Multicast packets received successfully vport lt i gt rx multicast byte S Mult
54. uct family 27 amp menu section 35 Step 2 Install Driver gt tar xzvf mlnx en 2 0 3 0 0 tgz file gt cd mlnx_en 2 0 3 0 0 gt install sh gt To install minx en 2 0 3 0 0 on XenServer6 1 rpm ihv RPMS xenserver6ul i386 uname r mlnx_en rpm The package consists of several source RPMs The install script rebuilds the source RPMs and then installs the created binary RPMs The created kernel module binaries are located at For KMP RPMs installation On SLES mellanox mlnx en kmp RPM 1ib modules kernel ver updates mellanox mlnx en On RHEL kmod mellanox mlnx en RPM lib modules kernel ver extra mellanox mlnx en Fornon KMP RPMs mlnx en RPM OnSLES 1ib modules kernel ver updates mlnx en On RHE 1ib modules kernel ver extra mlnx en milnx en installer supports 2 modes of installation The install scripts selects the mode of driver installation depending of the running OS kernel version Kernel Module Packaging KMP mode where the source rpm is rebuilt for each installed flavor of the kernel This mode is used for RedHat and SUSE distributions NonKMP installation mode where the sources are rebuilt with the running kernel This mode is used for vanilla kernels If the Vanilla kernel is installed as rpm please use the disable kmp flag when installing the driver 14 Mellanox Technologies Rev 2 1 1 0 0 The kernel module sources are placed under usr src mellano
55. user controllable some controlled by the application itself and others by the system network administrators The following is the general mapping traffic to Traffic Classes flow 1 The application sets the required Type of Service ToS 2 The ToS is translated into a Socket Priority sk_prio 3 The sk_prio is mapped to a User Priority UP by the system administrator some applica tions set sk_prio directly 4 The UP is mapped to TC by the network system administrator 5 TCs hold the actual QoS parameters QoS can be applied on the following types of traffic However the general QoS flow may vary among them Plain Ethernet Applications use regular inet sockets and the traffic passes via the ker nel Ethernet driver RoCE Applications use the RDMA API to transmit using QPs Raw Ethernet QP Application use VERBs API to transmit using a Raw Ethernet QP 5 1 2 Plain Ethernet Quality of Service Mapping Applications use regular inet sockets and the traffic passes via the kernel Ethernet driver The following is the Plain Ethernet QoS mapping flow 1 The application sets the ToS of the socket using setsockopt IP Tos value 2 ToS is translated into the sk prio using a fixed translation TOS 0 sk prio 0 TOS 8 sk prio 2 TOS 24 sk prio 4 TOS 16 sk prio 6 3 The Socket Priority is mapped to the UP Ifthe underlying device is a VLAN device egress map is used controlled by the vconfig command
56. v_en true Ifthe HCA does not support SR IOV please contact Mellanox Support support mellanox com Step 7 Create the text file etc modprobe d mlx4 core conf if it does not exist otherwise delete its contents 1 Ifthe fields in the example above do not appear in the HCA section meaning SR IOV is not supported in the used INI 2 If SR IOV is supported to enable if it is not it is sufficient to set sriov en true in the INI 32 Mellanox Technologies Rev 2 1 1 0 0 Step8 Insert an option line in the etc modprobe d mlx4 core conf file to set the number of VFs the protocol type per port and the allowed number of virtual functions to be used by the physical function driver probe vf options mlx4 core num vfs 5 probe vf 1 Parameter Recommended Value num vfs Absent or zero The SRI OV mode is not enabled in the driver hence no VFs will be available ts value is a single number in the range of 0 63 The driver will enable the num v s VFs on the HCA and this will be applied to all ConnectX HCAs on the host ts format is a string which allows the user to specify the num vfs parameter separately per installed HCA Its format is bb dd f v bb dd f v bb dd f bus device function of the PF of the HCA v number of VFs to enable for that HCA This parameter can be set in one of the following ways For exam ple num vfs 5 The driver will enable 5 VFs on the HCA and t
57. x mlnx en 2 0 gt To recompile the driver gt cd usr src mellanox mlnx en 2 0 gt scripts mlnx en patch sh gt make gt make install The uninstall and performance tuning scripts are installed If the driver was installed without kmp support the sources would be located under usr srs mlnx en 2 0 2 3 Loading the Driver Step 1 Make sure no previous driver version is currently loaded gt modprobe r mlx4 en Step 2 Load the new driver version gt modprobe mlx4 en The result is a new net device appearing in the ifconfig a output For details on driver usage and configuration please refer to Section 3 Ethernet Driver Usage and Configuration on page 16 Pp On Ubuntu OS the minx en service is responsible for loading the mlx4 en driver upon boot 2 4 Unloading the Driver Tounload the Ethernet driver gt modprobe r mlx4 en 2 5 Uninstalling the Driver gt To uninstall the minx en driver gt sbin mlnx en uninstall sh Mellanox Technologies 15 Rev 2 1 1 0 0 Ethernet Driver Usage and Configuration 3 X Ethernet Driver Usage and Configuration To assign an IP address to the interface gt ifconfig eth lt x gt ip a x is the OS assigned interface number To check driver and device information gt ethtool i eth lt x gt Example gt ethtool i eth2 driver mlx4 en version 2 1 8 Oct 06 2013 firmware version 2 30 3110 bus info 00
Download Pdf Manuals
Related Search
Related Contents
Télécharger (2.8 Mo) 66037a_0706fra.book - WEINMANN Emergency ProForm 385c PFTL39191 User's Manual PCS ™ US P™ OPERATOR`S MANUAL 1.0 Manuel de l`utilisateur Mode d`emploi pour commandes de semoirs Ver ficha tecnica - Almacenes Benamiel TechniSat HD-Vision DVB-S User's Manual 取扱説明書 - 3.19 MB Massive Table lamp 43221/30/20 (M)SDS Copyright © All rights reserved.
Failed to retrieve file