Home

User Manual

image

Contents

1. a 48 A3 1 UDP MC Ping pong Over T Gb uu uuu uuu uu i eee Rn 48 posu UDPRIMG Pingpong Over TO GD iid e hum hi gs 48 A 3 3 UDP MC Ping pong Over 10 Gb VMA 49 A 3 4 UDP MC Ping pong 5 50 A 4 Bandwidth and Packet Rate With Throughput Test 50 AI TOP Taroughpul Ov Gr iis 50 A 4 2 Throughput Over 10 50 A 4 3 TCP Throughput 51 AS SOCKDEN SubcommiB ids ues cud ien aO eno dut bot te ah ZT A 51 PS Adatna BIO UII mE 51 ASZ SendnNno BUrSIS 54 6 Debugging 54 Ar Troubleshooting SOGKDOLT u u uu y u Dorica te ated 54 Appendix B M lticasSI ROUTING i cnet 55 B 1 Multicast Interface 55 Appendix C te Ee A IER TED NIETO Bett 56 Table of Contents Rev 6 9 1 List of Tables Document HevisloD IIS TOMY aerei Ue aui 6 Table 2 Target Process Statement Options 888 14 Table 3 Socket Transport Statement OPO sai ioci E 14 Table 4 C
2. 6 347 Socii cc cocci MEE SEDES SOCKDCr Percentile 98 50 LOZ Perec smiles oue ZONA peccemus DM e per Oi s percentile 50 9 cU 310957 Soci Qc me 3 Analyze the client output Average latency 1 108 usec 7 Extra API 7 1 Overview of the VMA Extra API The information in this chapter 15 intended for application developers who want to use VMA s Extra API to maximize performance with VMA N 31 Rev 6 9 1 VMA Extra API e To further lower latencies e To increase throughput e To gain additional CPU cycles for the application logic better control offload capabilities All socket applications are limited to the given Socket API interface functions The VMA Extra API enables VMA to open a new set of functions which allow the application developer to add code which utilizes zero copy receive function calls and low level packet filtering by inspecting the incoming packet headers or packet payload at a very early stage in the processing VMA is designed as a dynamically linked user space library As such the VMA Extra API has been designed to allow the user to dynamically load VMA and to detect at runtime if the additional functionality described here is available o
3. n S LEE 56 DE 1 ZONES 2109090 9 6 Do 362404 Parameter Description tx total byte The number of transmit bytes from InfiniBand to Ethernet associated with a TFM rule has a log counter n The above example shows the number of bytes sent from Infiniband to Ethernet one way or sent between InfiniBand and Ethernet and matching the two TFM rules with log counter 1 rx total pack The number of receive packets from Ethernet to InfiniBand associated with a TFM rule has a log counter n rx total byte The number of receive bytes from Ethernet to InfiniBand associated with a TFM rule has a log counter n 36 J User Manual Rev 6 9 1 8 Debugging Troubleshooting and Monitoring 8 1 Monitoring the vma stats Utility Networking applications open various types of sockets The VMA library holds the following counters e Separate performance counters for each socket of the datagram UDP IP family type e Internal performance counters which accumulate information for select poll and epoll wait usage by the whole application An additional performance counter logs the CPU usage of VMA during select poll or epoll wait calls calculates this counter only CPU USAGE STATS parameter is enabled otherwise this counter is not in use and displays the default value as Zero e VMA internal CQ performance counters e VMA internal RING performance counters Use the included vma
4. By default all sockets use the same ring for both RX and TX VMA RING ALLOCATION L over the same interface For different interfaces different OGIC RX rings are used even when specifying the logic to be per socket or thread The logic options are 0 Ring per interface 10 Ring per socket using socket ID as separator 20 Ring per thread using the ID of the thread in which the socket was created 30 Ring per core using CPU ID 3 Ring per core attach threads attach each thread to a CPU core Default 0 VMA RING MIGRATION Ring migration ratio is used with the ring per thread logic in TIO TX order to decide when it is beneficial to replace the socket s ring with the ring allocated for the current thread User Manual Rev 6 9 1 VMA Contiguration Parameter Description and Examples VMA RING MIGRATION RA Each VMA RING MIGRATION RATIO iteration of TIO RX accessing the ring the current thread ID is checked to see whether the ring matches the current thread If not ring migration is considered If the ring continues to be accessed from the same thread for a certain iteration the socket is migrated to this thread ring Use a value of 1 in order to disable migration Default 100 RING LIMIT PER INTE Limits the number of rings that can be allocated per interface RFACE For example in ring allocation per socket logic if the number of sockets using the same interface is larger than
5. Look at the Ethernet counters by using the ifconfig command to understand whether the traffic 1s passing through the kernel or through the Rx and Tx NIC Counters Look at the NIC counters to monitor HW interface level packets received and sent drops errors and other useful information ls sys class net eth2 statistics Troubleshooting This section lists problems that can occur when using VMA and describes solutions for these problems e Problem High log level VMA WARNING VAS VAS VAS VA TS WS VAS WS TS WS TS lt TS WS TS lt WS lt TS TS WS TS lt lt lt TS lt TS lt TS WS lt WS TS lt lt US lt lt lt WS TS lt lt US TS lt IS US TS lt lt US TS x lt x US x x x US TS VMA WARNING VMA is currently configured with high log level VMA WARNING Application performance will decrease in this log level VMA WARNING This log level is recommended for debugging purposes only 42 J User Manual Rev 6 9 1 VMA WARNING k k k k k k k k k k k k k k k k lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt X lt X lt lt X lt X X x x lt This warning message indicates that you are using with a high log level The VMA TRACELEVEL variable value is set to 4 or more which is good for troubleshooting but not for live runs or performance m
6. SoGkperi MN benan 9 Interpretation of the results The example shows an average latency of 6 825 usec A 3 3 UDP MC Ping pong Over 10 Gb VMA gt Torun UDP MC ping pong over 10 Gb Ethernet VMA 9 After configuring the routing table as described in Configuring the Routing Table for Multicast Tests on page 47 run the server by using i JD INO i bvue 10 Run the client by using i OD PE OD Ti ovna 80 190 veer o gt The following output is obtained VMA INFO a cc VMA INFO o Current Times Sun Jan 29 1353525240 2012 VMA INFO s Cmd Laine o pe i 220 452 2006 VMA INFO Rid 4215 VMA INFO o OFED Version DUM qup VMA INFO INFO use O04 INFO 3 Nodes DOO VMA INFO M VMA INFO Log Level 3 ii EP VMA INFO hog Pile VMA OG INFO LS Se ee ee uu ie cbe Uc EE VMA INFO x X XK K X X X X XK K X X X XX XX X TSE AL PAE TAS TA TAT XX IU Xxx x Xx xx x VMA INFO LIEN I send on using rec virom block om cocer s 0 2224 2 2 216 PORT 11111 UDP Sockperf Warmup stage sending a few dummy messages SOCKPeE Ee otar ng sockperf Test end interrupted by timer
7. The VMA library accelerates TCP and UDP socket applications by offloading traffic from the user space directly to the network interface card NIC or Host Channel Adapter HCA without going through the kernel and the standard IP stack kernel bypass increases overall traffic packet rate reduces latency and improves CPU utilization Basic Features The VMA library utilizes the direct hardware access and advanced polling techniques of RDMA capable network cards Utilization of InfiniBand s and Ethernet s direct hardware access enables the VMA kernel bypass which causes the VMA library to bypass the kernel s network stack for all IP network traffic transmit and receive socket API calls Thus applications using the VMA library gain many benefits including e Reduced context switches and interrupts which result in e Lower latencies e Higher throughput e Improved CPU utilization e Minimal buffer copies between user data and hardware VMA needs only a single copy to transfer a unicast or multicast offloaded packet between hardware and the application s data buffers Target Applications Good application candidates for VMA include but are not limited to e Fast transaction based network applications which require a high rate of request response type operations over TCP or UDP unicast This also includes any send receive to from an external network entity such as a Market Data Order Gateway application working with an
8. sockperf Test ended sockoperf R n Runtime 100 see SemeMesseaqes 2 99903 ReceivedMessages 299902 sockperf Printing statistics for Server No 0 Soclperr Duration Puniime 1 000 sec SemeMessages 77 2956 ReceivedMessages 272956 sockperf gt avg lat 1 809 std dev 0 244 sockperf dropped messages 0 duplicated messages 0 out of order messages 0 SsOockperi Summary Latency 15 1 209 usec s9 J Rev 6 9 1 Sockperf UDP TCP Latency and Throughput Benchmarking Tool Sols cn otc cB Contains 27725226 observations sockoerf observyotlon 7 489 gt emendo 22220 sockperf gt percentile 99 00 sockperf gt percentile 50 00 sockperf Mc observa lion 509 Interpretation of results The example shows an average latency of 1 809 usec A 3 4 UDP MC Ping pong Summary Table 12 UDP MC Ping pong Results Test 1 Gb Ethernet 10 Gb Ethernet 10 Gb Ethernet VMA 15 096 usec 6 825 usec 1 809 usec VMA Improvement 13 287 usec 89 5 016 usec 73 sss A 4 Bandwidth and Packet Rate With Throughput Test To determine the maximum bandwidth and highest message rate for a single process single threaded network application sockperf attempts to send the maximum amount of data in a specific period of time A 4 1 TCP Throughput Over 10 Gb Torun TCP throughput over 10 Gb
9. Default value is 0 no limit VMA TX SEGS TCP Number of TCP LWIP segments allocation for each VMA process Default 1000000 VMA TX BUFS Number of global Tx data buffer elements allocation Default 200000 VL 7 Rev 6 9 1 Configuring VMA Configuration Parameter VMA TX WRE VMA TX MAX INLINE VMA TX MC LOOPBACK VMA TX NONBLOCKED FAGAI NS VMA TX PREFETCH BYTES VMA RX BUFS VMA RX WRE Description and Examples Number of Work Request Elements allocated in all transmit QP s The number of QP s can change according to the number of network offloaded interfaces Default 16000 The size of the Tx buffers is determined by the MTU parameter value see below If this value is raised the packet rate peaking can be better sustained however this increases memory usage A smaller number of data buffers gives a smaller memory footprint but may not sustain peaks in the data rate Max send inline data set for QP Data copied into the INLINE space is at least 32 bytes of headers and the rest can be user datagram payload TX MAX 0 disables INLINEing on the TX transmit path In older releases this parameter was called VMA MAX INLINE Default 224 Sets the initial value used internally by the VMA to control multicast loopback packet behavior during transmission An application that calls setsockopt with IP MULTICAST LOOP overwrites the initial value set by this par
10. Log File Stats File Stats shared memory directory tmp IED Noi 100 Conf File Application LD DEPAUL etc libvma conf Polling CPU idle usage Disabled Ciri S Disabled SegFault Backtrace Disabled Ring eub Iona 0 VMA RING ALLOCATION LOGIC TX Ring Jo catio n R 0 VMA RING ALLOCATION LOGIC RX i em ni El nl LUG Fri imi i rnc IDs 100 Ring limit pr lanes max syn rin rare no limit Tx Mem Bufs TCP 1000000 Tx Mem Bufs 200000 Tx QP WRE 16000 Tx Max OP INLINE 220 Tx MC LOoOOpLacCk Enabled Tx non blocked eagains Disabled Tx Prefetch Bytes 21519 Ux backlog Rx Mem Bufs 200000 Rx WRE 16000 Rx WRE BATCHING 64 Rx Byte Min Limit 052020 os 100000 525 pij po 0 Rx UDP Poll OS Ratio 100 Poll ves Disabled Rx Prefetch Bytes 256 Rx Prefetch Bytes Before Poll 0 Rx Drain Rate Disabled GRO max streams 32 TCE 3T rules Disabled ETR OM 152 seb ders Disabled Select Poll usec 100000 Select Poll OS Force Disabled Select Poll OS Ratio 10 Select Poll Yield Disabled Selleck okip 25 4 STATS FD NUM VMA CONFIG FILE VMA APPLICATION ID VMA CPU USAGE STATS VMA HANDLE SIGINTR VMA HANDLE SIGSEGV Ring per interface Ring per interface VMA RING MIGRATION RATIO TX VMA RING MIGRATION RATIO RX VMA R
11. Range 32 bytes to MTU size Default 256 bytes VMA RX DRAIN RATE NS Sockets receive path CQ drain logic rate control When disabled default socket s receive path attempts to return a ready packet from the socket s receive ready packet queue If the ready receive packet queue 15 empty the socket checks the CQ for ready completions for processing When enabled even if the socket s receive ready packet queue is not empty this parameter checks the CQ for ready completions for processing This CQ polling rate is controlled in nanosecond resolution to prevent CPU consumption due to over CQ polling This enables improved real time monitoring of the socket ready packet queue Recommended value is 100 5000 nsec Default 0 Disabled VMA GRO STREAMS MAX Controls the number of TCP streams to perform GRO generic receive offload simultaneously Disable GRO with a value of 0 Default 32 VMA TCP 3T RULES Uses only 3 tuple rules for TCP instead of using 5 tuple rules This can improve performance for a server with a listen socket which accepts many connections from the same source IP Enable with a value of 1 Default 0 Disabled User Manual Rev 6 9 1 Configuration Parameter Description and Examples VMA ETH MC L2 ONLY RULE Uses only L2 rules for Ethernet Multicast 5 loopback traffic will be handled by instead of OS Enable with a value of 1 Default 0 Disabled
12. The VMA configuration parameters are Linux OS environment variables and are controlled with system environment variables It is recommended that you set these parameters prior to loading the application with VMA You can set the parameters In a system file which can be run manually or automatically All the parameters have defaults that can be modified On default startup the VMA library prints the VMA version information as well as the configuration parameters being used and their values to stderr VMA always logs the values of the following parameters even when they are equal to the default value e VMA TRACELEVEL e VMA LOG FILE For all other parameters VMA logs the parameter values only when they are not equal to the default value Nf Rev 6 9 1 Configuring gt NOTE The VMA version information parameters and values are subject to change 6 9 1 0 Release built on 2015 05 10 16 20 19 VMA_TRACELEVEL VMA LOG DETAILS VMA LOG COLORS VMA LOG FILE VMA STATS FILE VMA STATS SHMEM DIR Pr Example VMA INFO VMA INFO VMA VERSION VMA INFO Line sockpert sr VMA DEBUG Current AL mee Sce aD Sec NES VMA DEBUG Pid 24714 VMA DEBUG Version cS n VMA DEBUG Systeme 26653 2 220 515 5227 64 VMA DEBUG Architecture xo VMA DEBUG Node haill VMA MSN cc uu E e Log Level 4 Log Details 2 Colors Enabled
13. VMA SELECT POLL The duration in micro seconds usec in which to poll the hardware on Rx path before blocking for an interrupt when waiting and also when calling select poll or epoll wait Range 1 0 100 000 000 Default 100 000 When the selected path has successfully received poll hits the latency improves dramatically However this comes at the expense of CPU utilization For more information see Debugging Troubleshooting and Monitoring on page 37 VMA SELECT POLL OS RATI This enables polling the OS file descriptors while the user O thread calls select poll orepoll wait and VMA is busy in the offloaded socket polling loop This results in a single poll of the non offloaded sockets every VMA SELECT POLL RATIO offloaded socket CQ polls When disabled only offloaded sockets are polled See VMA SELECT POLL for more information Disable with 0 Default 10 VMA SELECT POLL YIELD When an application runs with multiple threads on a limited number of cores each thread polling inside VMA select poll orepoll wait should yield the CPU to other polling threads so as not to starve them from processing incoming packets Default 0 Disabled VMA SELECT SKIP OS In select 11 11 wait forces the VMA to check the non offloaded sockets even though an offloaded socket has a ready packet that was found while polling Range 0 10 000 Default 4 VMA SELECT CQ IRQ When disabl
14. oue iua cdm cast Tour nte eei cue tuve ttes ota seus 12 ODA auta fous brodo s Mut a otras DO 12 2 2 JOOCKOL TYPE cue mutato inen iae ton 12 dette Pc 13 COnfig ring E 13 AA sCOnTIGUMAGNIDVINA COMM T 13 4 1 1 Configuring Target Application or Process 13 4 1 2 Configuring Socket Transport Control 14 4 1 3 Example of VMA 15 4 2 Configuration Parameters 15 4 2 1 Configuration Parameter 17 4 2 2 Beta Level Features Configuration Parameters 28 Using SOCKpert with VMA uuu uu aussa 30 Example Running sockperf Ping pong 1 1 1 J J 31 VMA Ext API u k 31 7 1 Overviewor tie VMA EXIFa APT l dvi u l ua yas usuyasa uka Dusun 31 Tee NMA TEXISAP IE2 gt z uru k a aaa kis 32 7 3 Control Off load Capabilities During
15. Run the server by using J Nus edes i 12 Run the client by using v G LI 29 Ee rr where m msg size is the minimum message size in bytes minimum default 12 The following output is obtained Sockperr Total or 1222013 messages seme 1 100 sec Sockperf Summary Message Rate is 1165457 msg sec SoCkperf erbe v Bandhidth is MPPE 100 T0 Wises Notes e You can use tcp avoid nodelay to deliver TCP messages immediately default ON e For more sockperf throughput options run sockperf tp h A 4 2 TCP Throughput Over 10 Gb VMA gt Torun TCP throughput over 10 Gb VMA 13 Run the server by using 50 J User Manual Rev 6 9 1 jr p sj GO S nN Bic 14 Run the client by using SO SOC Meese T 1 il LG The following output is obtained sockoerf Total or 2415772 messages sent im 1 100 sec Sockperf Summary Message Rate is 7648873 msg sec sockperf S mmary BandWidth is 372554 MBps 700 275 Mops A 4 3 TCP Throughput Summary Table 13 TCP Throughput Results Test 10 Gb Ethernet 10 Gb Ethernet VMA Message Rate 1165457 msg sec 7648873 msg sec Bandwidth 13 338 MBps 106 701 Mbps 87 534 MBps 700 275 Mbps VMA Improvement 74 196 662 WeIIIIISEIIGG A 5 sockperf Subcommands You can use additional sockperf subcommands Usage sockperf subcommand options args e display help for
16. Statement Options Option Description lt program namel gt Define the program name not including the path to which the control statements appearing below this statement apply Wildcards with the same semantics as Is are supported and For example e db2 matches any program with a name starting with db2 e t cp matches etc lt user defined idl gt Specify the process ID to which the control statements appearing below this statement apply Note You must also set the VMA_APPLICATION_ID environment variable to the same value as user defined id 4 1 2 Configuring Socket Transport Control Use socket control statements to specify when libvma will offload AF_INET SOCK_STREAM or AF_INET SOCK_DATAGRAM sockets currently SOCK_RAW is not supported Each control statement specifies a matching rule that all its subexpressions must evaluate as true logical and to apply Statements are evaluated in order of definition according to first match Socket control statements use the following format transport rangel Table 3 Socket Transport Statement Options Option Description transport Define the mode of transport e vma VMA should be used e os The socket should be handled by the OS network stack In this mode the sockets are not offloaded The default 15 Specify one of the following roles e tcp server for listen sockets Accepted sockets fol
17. TOP SCOMPEN A TIONTEEVEE Offloaded Sockets Enabled VMA n FFF O SPDED Timer Resolution msec 10 VMA REIMER RESOLUTION M EC TCE Timer Resolution msec 100 ICE TIMER RE EC Delay after join msec 0 VMA WAIT AFTER JOIN MSEC Delay after rereg msec 500 bie WAIT VAR TER PEP AM EC Ttc tdeo VMA TNTERNAL THEEAD Internal Thread Arm CQ Disabled SASS Ph TERESA U THREAD SURE Internal Thread Cpuset VMA INTERNAL THREAD CPUSET C cuoio ele MUIT S PIN KOCK VMA T THREAD MODE Mem Allocate type COn Pa EMTA EEOC Num of UC ARPs 3 VMA NEIGH UC ARP QUATA UC ARP delay msec 10000 VMA NEIGH UC ARP DELAY MSEC Num Mengh restart 1 VMA NEIGH NUM ERR RETRIES TIPO TI TOES Enabled TE BF Blue Flame Enabled MATER qos n SUPP Os Enabled VMA FORK Close TH Enabled Id M TU 1500 VAA MTU MSS 0 follow VMA MTU VMA MSS 0 LWIP VMA TCP CC ALGO Ie Pees VMA WINDOW SCALING Suppress IGMP ver warning Disabled VMA SUPPRESS IGMP WARNING 4 2 1 Configuration Parameter Values The following table lists the VMA configuration parameters and their possible values Table 4 Configuration Parameter Values VMA Configuration Parameter Description and Examples VMA TRACELEVEL PANIC Panic level logging This trace level causes fatal behavior and halts the appli
18. The library is a dynamically linked user space library Use of the library does not require any code changes or recompiling of user applications Instead it is dynamically loaded via the Linux OS environment variable LD_PRELOAD When a user application transmits TCP and UDP unicast and multicast IPv4 data or listens for such network traffic data the VMA library e Intercepts the socket receive and send calls made to the stream socket or datagram socket address families e Implements the underlying work in user space instead of allowing the buffers to pass on to the usual OS network kernel libraries VMA implements native RDMA verbs API The native RDMA verbs have been extended into the Ethernet RDMA capable NICs enabling the packets to pass directly between the user application and the InfiniBand HCA or Ethernet NIC bypassing the kernel and its TCP UDP handling network stack You can implement the code in native RDMA verbs API without making any changes to your applications The VMA library does all the heavy lifting under the hood while transparently presenting the same standard socket API to the application thus redirecting the data flow The VMA library operates in a standard networking stack fashion to serve multiple network interfaces The VMA library behaves according to the way the application calls the bind connect and setsockopt directives and the administrator sets the route lookup to determine the inter
19. a specific subcommand use sockperf lt subcommand gt help e To display the program version number use Sockpert versTOn Table 14 Available Subcommands Option Description For help use help h Display a list of supported commands under load ul Run sockperf client for latency under load test h ping pong pp Run sockperf client for latency test in ping pong mode h playback pb Run sockperf client for latency test using playback sockperf pb of predefined traffic based on timeline and h message size throughput tp Run sockperf client for one way throughput test h server sr Run sockperf as a server h For additional information see http code google com p sockperf A 5 1 Additional Options The following tables describe additional sockperf options and their possible values 51 Rev 6 9 1 Sockperf UDP TCP Latency and Throughput Benchmarking Tool Table 15 General sockperf Options Short Full Command Description Command Show the help message and exit Use TCP protocol default UDP Listen on send to IP ip Listen on connect to port port default 11111 file Tread multiple 1p port combinations from file file server uses select jomux type Type of multiple file descriptors handle slselectlplpolllelepolllrirecvfrom default select timeout Set select poll epoll timeout to lt msec gt or 1 for infinite default is 10 msec activity Measur
20. be sent without IP fragmentation Value of 0 will set VMA s TCP MSS to be aligned with MTU configuration leaving 40 bytes of room for IP TCP headers TCP MSS VMA MTU 40 Other VMA_MSS values will force VMA s TCP MSS to that specific value Default 0 following VMA_MTU VMA WINDOW SCALING TCP scaling window This value factor range from 0 to 14 1 to disable 2 to use OS value sets the factor in which the TCP window is scaled Factor of 0 allows using the TCP scaling window of the remote host while not changing the window of the local host Value of 1 disables both directions Value of 2 uses the OS maximums and receives buffer value to calculate the factor Make sure that VMA buffers are big enough to support the window Default 3 User Manual Rev 6 9 1 VMA Contiguration Parameter Description and Examples VMA CLOSE ON DUP2 When this parameter is enabled VMA handles the duplicated file descriptor oldfd as if it is closed clear internal data structures and only then forwards the call to the OS This is in effect a very rudimentary dup2 support It supports only the case where dup2 is used to close file descriptors Default 1 Enabled VMA INTERNAL THREAD AFF Controls which CPU core s the VMA internal thread is INITY serviced on The CPU set should be provided as either a hexidecmal value that represents a bitmask or as a comma delimited of values ranges are ok Both
21. e zero copy cannot be performed If zero copy is performed the flag MSG ZCOPY is set upon exit If zero copy is performed MSG VMA 2 flag is returned the buffer is filled with a vma packets t structure holding as much fragments as len allows The total size of all fragments is returned Otherwise the buffer is filled with actual data and its size is returned same as recvfrom Return Values If the return value is positive data copy has been performed If the return value is zero no data has been received 7 4 2 Freeing Zero Copied Packet Buffers Description Frees a packet received by recvfrom zcopy or held by receive callback Syntax lub ss Deckers Aint S she ls vma Packet X ADKUS p BLAS LOW Parameters Table 10 Freeing Zero copy Datagram Parameters Parameter Name Description Values S Socket from which the packet was received Return Values 0 on success 1 on failure errno is set to EINVAL not a VMA offloaded socket ENOENT the packet was not received from s Example N 35 Rev 6 9 1 VMA Extra API entry Source source mask Dest Dest mask Interface Service FOULING Status og 1 any Film IMS sj cS E negar any t nne ing activen Expected Result SEBRS2029q906 961f 0 show Corus pe o nin bol U ole
22. gt Sets the VMA log level to lt level gt 1 lt level lt 7 lt level gt Sets the VMA log detail level to lt level gt 0 Pm level level 3 sockets list range Logs only sockets that match lt list gt or range format 4 16 or 1 9 or combination hy Prints help message 8 1 1 Examples The following sections contain examples of the vma stats utility 8 1 1 1 Example 1 Description The following example demonstrates basic use of the vma stats utility Command Line unosi Em NOTE If there 15 only single process running over 115 not necessary to use the p option since vma_stats will automatically recognize the process Output If no process with a suttable pid 15 running over the the output 1s vm L to identi process If an appropriate process was found the output is 129 M Fi NCIC Ole ape esc MM Tora eo 2 T pkt Kbyte eagain error polls pkt Kbyte error IAR n 71 71 0 9 0 0 0 Tx 1401799027274502 0 0 0 0 0 Analysis of the Output e Asingle socket with user fd 14 was created e Received 140479898 packets 274374 Kilobytes via the socket 38 J User Manual Rev 6 9 1 e Transmitted 140479898 packets 274374 Kilobytes via the socket e the traffic was offloaded No packets were transmitted or received via the OS e There were no mi
23. in the routing table are mapped to the interface you are working on If they are not mapped you can map them as follows route ade net 224 0 0 0 necmask 240 0 0 0 dey 10 It is best to perform the mapping before running the user application with VMA so that multicast packets are routed via the InfiniBand 10 Gb Ethernet interface and not via the default Ethernet interface eth0 The general rule is that the VMA routing is the same as the OS routing Rev 6 9 1 Acronyms Appendix C Acronyms Table 18 Acronym Table Acronym Definition Was i va 56 J
24. known as the median and is different from the Statistical average e 99 percentile The latency value for which 99 percent of the observations are smaller than it and 1 percent are higher These percentiles and the other percentiles that the histogram provides are very useful for analyzing spikes in the network traffic Sockperf can provide a full log of all packets tx and rx times by dumping all the data that it uses for calculating percentiles and building the histogram to a comma separated file This file can be further analyzed using external tools such as Microsoft Excel or matplotlib All these additional calculations and reports are executed after the fast path is completed This means that using these options has no effect on the benchmarking of the test itself During runtime of the fast path sockperf records txTime and rxTime of packets using the TSC CPU register which has a negligible effect on the benchmark itself as opposed to using the computer s clock which can affect benchmarking results A 2 Configuring the Routing Table for Multicast Tests If you want to use multicast you must first configure the routing table to map multicast addresses to the Ethernet interface on both client and server Example s route net 224 0 0 0 netmask 240 0 0 0 where eth0 is the 10 Gb Ethernet Interface You can also set the interface on runtime in sockperf e Use mc rx if lt ip gt to set the address
25. of the Interface on which to receive multicast packets can be different from the route table a7 Rev 6 9 1 Sockperf UDP TCP Latency and Throughput Benchmarking Tool e Use mc tx if ip to set the address of the interface on which to transmit multicast packets can be different from the route table A 3 Latency with Ping pong Test To measure latency statistics after the test completes sockperf calculates the route trip times divided by two between the client and the server for all messages then it provides the average statistics and histogram A 3 1 UDP MC Ping pong Over 1 Gb Torun UDP MC ping pong over 1 Gb Ethernet 4 On both client and server configure the routing table to map multicast addresses to the Ethernet interface by using m pourte ado net 224 0 0 0 netmask 240 0 0 0 where ethl is the 1 Gb Ethernet interface 5 Run the server by using Sie o 6 Run the chent by using 00 1 eaaa The following output is obtained Sockperf Warmup stage sending a few dummy packets Starring testo Sockperf Test end interrupted by timer Sockperf Total Run RunTime 1 100 sec SentMessages 36304 ReceivedMessages 36303 Sockperrt Printing siati Lice for Server No 0 sockpert Duration FunTime 1 000 sec s entMessages 22026 ReceivedMessages 33026 sockperf gt avg lat 15 096 std dev 0 300 sockpe
26. the limit several sockets will share the same ring Note VMA_RX_BUFS might need to be adjusted in order to have enough buffers for all rings in the system Each ring consumes VMA RX WRE buffers Use a value of 0 for an unlimited number of rings Default 0 no limit TCP CC ALGO TCP congestion control algorithm The default algorithm coming with LWIP is a variation of Reno New Reno The new Cubic algorithm was adapted from FreeBsd implementation Use value of 0 for LWIP algorithm Use value of 1 for the Cubic algorithm Default 0 LWIP Rev 6 9 1 Using sockperf with VMA 5 Using sockperf with VMA Sockperf is VMA s sample application for testing latency and throughput over a socket API The precompiled sockperf binary is located in usr bin sockperf Torun a sockperf UDP test e Torun the server use FI cT li Hee SiO TO E Sig si csse TS e Torun the client use EO T 1 Cr where lt server ip gt 1s the IP address of the server lt sockperf test 15 the test you want to run for example pp for the ping pong test tp for the throughput test and so on Use sockperf h to display a list of all available tests Torun a sockperf TCP test e Torun the server use ID T TY sere c i eS e Torun the client use ID SO SOCK cee O pO r fi i en eee ape EIS 30 J User Manual Rev 6 9 1 6 Example Running so
27. the two machines is the RTT divided by two e The average RTT is calculated by summing the route trip times for all the packets that perform the round trip and then dividing the total by the number of packets Sockperf can test the improvement of UDP TCP traffic latency when running applications with and without VMA Sockperf can work as a server consumer or execute under load ping pong playback and throughput tests as a client publisher 46 J User Manual Rev 6 9 1 In addition sockperf provides more detailed statistical Information and analysis as described in the following section Sockperf is installed on the VMA server at usr bin sockperf For examples of running sockperf over 1 Gb and 10Gb Ethernet see e Latency with Ping pong Test on page 48 e Bandwidth and Packet Rate With Throughput Test on page 50 Note If you want to use multicast you must first configure the routing table to map multicast addresses to the Ethernet interface on both client and server See Configuring the Routing Table for Multicast Tests on page 47 A 1 1 Advanced Statistics and Analysis In each run sockperf presents additional advanced statistics and analysis information In addition to the average latency and standard deviation sockperf presents a histogram with various percentiles including e 50 percentile The latency value for which 50 percent of the observations are smaller than it The 50 percentile is also
28. usecs Example 4 Description This example demonstrates how you can get multicast group membership information via stats Command Line Seles dM Output VMA Group Membership Information Group fd number zo WS Le If the user application performed transmit or receive activity on a socket those values will be logged when the sockets are closed The VMA logs its internal performance counters 1f VMA TRACELEVEL 4 see Example 5 Example 5 Description This is an example of a log of socket performance counters along with an explanation of the results Output Offload 455 U 299020 7 557565 packets errors VMA fd 10 Tx OS info FB Ibytes packets errors VMA rlo Rx Offload 455 KB 0263020 0 iyi espa ee errora VMA TOR no 7 00 0 y e packers errors S l jr byte 200 7 0 0070029 aa 2000000 VMA fd 10 Rx pkt max 1 dropped O 0 00 VMA 55 Jools 0 222000 00 0 Darse o lei Analysis of the Output e No transmission or reception errors occurred on this socket user 1 0 e All the traffic was offloaded No packets were transmitted or received via the OS e There were practically no missed Rx polls see VMA RX POLL and VMA SELECT POLL This implies that the receiving thread did not enter a blocked state Thus there was no context switch to hurt latency Rev 6 9 1 Debu
29. will send UC ARP in case neigh state is NUD_STALE In case that neigh state 15 still NUD STALE VMA will try 27 Rev 6 9 1 Configuring VMA Contiguration Parameter Description and Examples VMA NHEIGH UC ARP QUATA retries to send UC ARP again and then will send BC ARP Default 3 VMA NEIGH UC ARP DELAY This parameter indicates number of msec to wait between MSEC every UC ARP Default 10000 VMA NEIGH NUM ERR RETRI Indicates number of retries to restart NEIGH state machine if ES NEIGH receives ERROR event Default 1 VMA SUPPRESS IGMP WARNI Use SUPPRESS IGMP WARNING 1 to suppress the NG warnings about igmp version not forced to be 2 Default 0 Disabled VMA BF Enables disables BlueFlame usage of the card Default 1 Enabled 4 2 2 Beta Level Features Configuration Parameters The following table lists configuration parameters and their possible values for new VMA Beta level features The parameters below are disabled by default These VMA features are still experimental and subject to changes They can help improve performance of Multi thread applications We recommend altering these parameters in a controlled environment until reaching the best performance tuning Table 5 Beta Level Configuration Parameter Values VMA Configuration Parameter Description and Examples VMA RING ALLOCATION L Ring allocation logic is used to separate the traffic into OGIC TX different rings
30. 372 of them offloaded 15186 via fd 15 and 15186 via fd 19 and 15185 were received via the OS through fd 23 e There were no missed Select polls see VMA SELECT POLL This implies that the receiving thread did not enter a blocked state Thus there was no context switch to hurt latency e The CPU usage in the select call is 70 You can use this information to calculate the division of CPU usage between VMA and the application For example when the CPU usage 15 100 70 is used by for polling the hardware and the reamining 30 is used for processing the data by the application 39 Rev 6 9 1 Debugging Troubleshooting and Monitoring 8 1 1 3 Example 3 Description This example presents the most detailed vma stats output Command Line F L 0 5246022 cu Se Output Fd 14 Blocked MC Loop Enabled EE DC EE Member of 224 7 7 7 Rx Offload 1128530 KB 786133 0 0 bytes packets eagains errors s J i EDO b oo5pec sU Santee F phi 5 4 m des Rl DO 2705077 110020050 Dae O 121 co TO Packets dropped 0 s Packets queue len 0 Drained max pool size 500 B ffer disorder omo RING 0 Packets count 1861338 s Packets bytes 109295059148 Interrupt requests 1954 5 y rs Interrupt received JU SES s Moderation frame count 110 Moderation usec period Analysis of the Output e Asingle so
31. ING LIMIT PER INTERFACE VMA TCP MAX SYN FIN RATE VMA TX SEGS TCP VMA TX BUFS VMA TX WRE VMA TX MAX INLINE VMA TX MC LOOPBACK VMA TX NONBLOCKED EAGAINS VMA TX PREFETCH BYTES VMA TX BACKLOG MAX VMA RX BUFS VMA RX WRE VMA RX WRE BATCHING VMA RX BYTES MIN VMA RX POLL VMA RX POLL INIT VMA RX UDP POLL OS RATIO VMA RX POLL YIELD VMA RX PREFETCH BYTES VMA RX PREFETCH BYTES BEFORE POLL VMA RX CQ DRAIN RATE NSEC VMA GRO STREAMS MAX VMA TCP 3T RULES VMA ETH MC L2 ONLY RULES VMA SELECT POLL VMA SELECT POLL OS FORCE VMA SELECT POLL OS RATIO VMA SELECT POLL YIELD VMA SELECT SKIP OS WEN User Manual Rev 6 9 1 Select BITS tob s Enabled py ONIS CQ Drain Interval msec 10 VMA PROGRESS ENGINE INTERVAL CQ Drain WCE max 10000 VMA PROGRESS ENGINE WCE MAX CQ Interrupts Moderation Enabled MODERATION DEN CO Moderation Count 48 VMA EC MODERATI ONTCOUNTI CQ Moderation Period usec 50 OMO EOIN FEER EC CQ AIM Max Count 560 VMA CQ AIM MAX COUNT CQ AIM Max Period usec 2 50 VMA CQ AIM MAX PERIOD USEC CQ AIM Interval msec 250 VMA CQ AIM INTERVAL MSEC AIM Interrupts Rate per sec 5000 VMA AIM INTERRUPTS RATE PER SEC CQ Poll Batch max 16 VMA CQ POLL BATCH MAX CQ Keeps QP Full Enabled VMA CQ KEEP QP FULL OP Compensation Level 256 VMA
32. MEDIATE ACTION NEEDED i VMA WARNING Not enough hugepage resources for VMA memory allocation VMA WARNING VMA will continue working with regular memory allocation VMA INFO 2 5 Optional le Disable VMA hugepage support CME EUGE nS VMA INFO 2 Restart process after increasing the number Of x VMA INFO DS hugepages resources in the system VMA INFO cat proc meminfo grep i HugePage VMA INFO echo 1000000000 gt proc sys kernel shmmax VMA INFO x0 vant estem VMA WARNING Read more about the Huge Pages in the VMA User Manual VMA WARNING VAS VAS VAS US VAS WS VAS WS lt US TS lt WS WS TS Xa lt US TS US WS US lt US lt US lt lt lt lt lt lt lt US lt lt lt US lt lt lt lt lt lt lt US TS lt US lt TS lt US US X lt X US TS x x x lt This warning message means that you are using with huge page memory allocation enabled MEM ALLOC 2 but not enough huge page resources available in the system VMA will use contiguous pages instead Solution Set VMA MEM ALLOC TYPE 1 in orderto enable VMA s contig pages allocation logic this is the default setting If you want VMA to take full advantage of the performance benefits of huge pages restart the application after adding more huge page resources to your system similar to the details in the warning message above or try to
33. Mellanox TECHNOLOGIES Connect Accelerate Outperform Mellanox Messaging Accelerator VMA Library for Linux User Manual Rev 6 9 1 www mellanox com NOTE THIS HARDWARE SOFTWARE OR TEST SUITE PRODUCT PRODUCT S AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES AS IS WITH ALL FAULTS OF ANY KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE THE PRODUCTS IN DESIGNATED SOLUTIONS THE CUSTOMER S MANUFACTURING TEST ENVIRONMENT HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE PRODUCT S AND OR THE SYSTEM USING IT THEREFORE MELLANOX TECHNOLOGIES CANNOT AND DOES NOT GUARANTEE OR WARRANT THAT THE PRODUCTS WILL OPERATE WITH THE HIGHEST QUALITY ANY EXPRESS OR IMPLIED WARRANTIES INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT ARE DISCLAIMED IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER OR ANY THIRD PARTIES FOR ANY DIRECT INDIRECT SPECIAL EXEMPLARY OR CONSEQUENTIAL DAMAGES OF ANY KIND INCLUDING BUT NOT LIMITED TO PAYMENT FOR PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY WHETHER IN CONTRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY FROM THE USE OF THE PRODUCT S AND RELATED DOCUMENTATION EVEN IF ADVISED OF THE POSSIBILITY OF
34. Run Time 33 7 3 1 Adding libvma conf Rules During Run Time 33 7 3 2 Creating Sockets as Off loaded or Not Off loaded 33 TA Meee Tile IVD E TES 33 TAA Zero COPY uD 34 7 4 2 Freeing Zero Copied Packet Buffers 35 8 Debugging Troubleshooting and Monitoring 37 8 1 Monitoring the stats Utility 37 edl cle EE 38 6 RR uma 42 Bod MNMAJBOOS etu Eee 42 8 2 2 X Ethernet 42 VL Z7 Rev 6 9 1 Table of Contents S uuu 42 69 PtouDpISsIO0lIng NEED 42 Appendix A Sockperf UDP TCP Latency and Throughput Benchmarking Tool 46 a TEE 46 Ad Advanced 5tallsiics arid amp u u Ea uude N De n 47 A 2 Configuring the Routing Table for Multicast Tests 47 Latency with Ping pong T est
35. SUCH DAMAGE Mellanox TECHNOLOGIES Mellanox Technologies 350 Oakmead Parkway Suite 100 Sunnyvale CA 94085 U S A www mellanox com Tel 408 970 3400 Fax 408 970 3403 Copyright 2015 Mellanox Technologies All Rights Reserved Mellanox Mellanox logo BridgeX ConnectX Connect IB CoolBox CORE Direct GPUDirect InfiniBridge InfiniHost InfiniScale Kotura Kotura logo MetroX MLNX OS PhyX ScalableHPC SwitchX TestX Virtual Protocol Interconnect Voltaire and Voltaire logo are registered trademarks of Mellanox Technologies Ltd ExtendX M FabricIT FPGADirect HPC X Mellanox Care Mellanox CloudX Mellanox Open Ethernet Mellanox PeerDirect Mellanox Virtual Modular Switch MetroDX NVMeDirect StPU Switch IB Unbreakable Link are trademarks of Mellanox Technologies Ltd All other trademarks are property of their respective owners 2 Document Number DOC 00393 co1J Table of Contents Rev 6 9 1 Table of Contents DOCUMENT ReVISIOry PHSLIOryiu Rm 6 ADO UT iSMa n al uu u uuu ua 8 T m 10 I A 10 eA MT TT 10 AROS Alla WOM S NN EET 10 ta Advanced YMA a a te qusa 11 2 VMA Library AFChI GCIUEB
36. ameter Range 0 Disabled 1 Enabled Default 1 Returns value OK on all send operations that are performed on a non blocked udp socket This is the OS default behavior The datagram sent is silently dropped inside the VMA or the network stack When set to Enabled set to 1 VMA returns with error EAGAIN if it was unable to accomplish the send operation and the datagram was dropped In both cases a dropped Tx statistical counter is incremented Default 0 Disabled Accelerates an offloaded send operation by optimizing the cache Different values give an optimized send rate on different machines We recommend that you adjust this parameter to your specific hardware Range 0 to MTU size Disable with a value of 0 Default 256 bytes The number of Rx data buffer elements allocated for the processes These data buffers are used by all QPs on all HCAs as determined by the OP LOGIC Default 200000 bytes The number of Work Request Elements allocated in all received QPs Default 16000 20 J User Manual Rev 6 9 1 Configuration Parameter Description and Examples VMA RX BYTES MIN The minimum value in bytes used per socket by the VMA when applications call to setsockopt SO RCVBUF If the application tries to set a smaller value than configured in VMA RX BYTES MIN VMA forces this minimum limit value on the socket VMA offloaded sockets receive the maximum amount of ready bytes If
37. architects Systems administrators tasked with installing uninstalling maintaining VMA ISV partners who want to test integrate their traffic consuming producing applications with VMA Related Documentation For additional relevant information refer to the latest revision of the following documents Mellanox Messaging Accelerator VMA Library for Linux Release Notes DOC 00329 Mellanox Messaging Accelerator VMA Installation Guide DOC 10055 Performance Tuning Guidelines for Mellanox Network Adapters DOC 3368 Document Conventions gt NOTE Identifies important information that contains helpful suggestions r 0 CAUTION Alerts you to risk of personal injury system damage or loss of data WARNING Warns you that failure to take or avoid a specific action might result in personal injury or malfunction of the hardware or software Be aware of the hazards involved with electrical circuitry and be familiar with standard practices for preventing accidents before you work on any equipment User Manual Rev 6 9 1 Typography The following table describes typographical conventions in Mellanox documentation terms refer to isolated terms within body text or regular table text unless otherwise mentioned in the Notes column Term Construct Example Notes Text Block File name pathname opt ufm conf gv cfg Console session code flashClear CR Complete sample line or block C
38. be any machine with any OS and can be located on an InfiniBand or an Ethernet network E NOTE VMA uses a standard protocol that enables an application to use the VMA for asymmetric acceleration purposes A TCP server side only application a multicast consuming only or multicast publishing only application can leverage this while remaining compatible with Ethernet or IPoIB peers e Kernel bypass for unicast and multicast transmit and receive operations This delivers much lower CPU overhead since TCP IP stack overhead is not incurred e Reduced number of context switches All VMA software is implemented in user space in the user application s context This allows the server to process a significantly higher packet rate than would otherwise be possible e Minimal buffer copies Data is transferred from the hardware NIC HCA straight to the application buffer in user space with only a single intermediate user space buffer and zero kernel IO buffers e Fewer hardware interrupts for received transmitted packets e Fewer queue congestion problems witnessed in standard TCP IP applications e Supports legacy socket applications no need for application code rewrite e Maximizes Messages per second MPS rates Minimizes message latency e Reduces latency spikes outliers e Lowers the CPU usage required to handle traffic 11 Rev 6 9 1 VMA Library Architecture 2 VMA Library Architecture 2 1 Top Level
39. cation typically caused by memory allocation problems PANIC level is rarely used ERROR Runtime errors in Typically this trace level assists you to identify internal logic errors such as errors from underlying OS or InfiniBand verb calls and internal double mapping unmapping of objects WARNING Runtime warning that does not disrupt the application workflow A warning may indicate problems in the setup or in the overall setup configuration For example address resolution failures due to an incorrect routing setup configuration corrupted IP packets in the receive path or unsupported functions requested by the user application Rev 6 9 1 Configuring VMA Contiguration Parameter Description and Examples 3 INFO General Information passed to the user of the application This trace level includes configuration logging or general information to assist you with better use of the VMA library 4 DEBUG High level insight to the operations performed in VMA In this logging level all socket API calls are logged and internal high level control channels log their activity FUNC Low level runtime logging of activity This logging level includes basic Tx and Rx logging in the fast path Note that using this setting lowers application performance We recommend that you use this level with the VMA LOG FILE parameter 6 FUNC ALL Very low level runtime logging of activity This logging
40. cket with user fd 14 was created e The socket is a member of multicast group 224 7 7 7 e Received 786133 packets 1128530 Kilobytes via the socket during the last second e transmitted data e the traffic was offloaded No packets were transmitted or received via the OS e There were almost no missed Rx polls see VMA RX POLL e There were no transmission or reception errors on this socket e The sockets receive buffer size is 16777216 Bytes e There were no dropped packets caused by the socket receive buffer limit see VMA RX BYTES MIN e Currently one packet of 1470 Bytes is located in the socket receive queue e The maximum number of packets ever located simultaneously in the sockets receive queue is 16 e No packets were dropped by the CQ 40 User Manual Rev 6 9 1 8 1 1 4 8 1 1 5 No packets in the CQ ready queue packets which were drained by the CQ and waiting to be processed by the upper layers e The maximum number of packets drained by the CQ during a single drain cycle is 511 see CQ DRAIN WCE MAX e The RING received 786133 packets during this period e The RING received 1192953545 bytes during this period This includes headers bytes e 786137 interrupts were requested by the ring during this period e 78613 interrupts were intercepted by the ring during this period e The moderation engine was set to trigger an interrupt for every 10 packets and with maximum time of 181
41. ckperf Ping pong Test 1 Run sockperf server on Host A PPE Sp oo Die wails SO SOC n 2 Run sockperf client on Host B Dino Z eio eoe cenae rc IER n Client expected output I T Ei Pp Tune r VMA INFO VMA INFO E oj Ou Whe b ggg VMA INFO EE INFO Log Level 3 VMA TRACELEVEL VMA INFO Ml cl ccce ccc E cm I HE qd C di Du sr 1 sockperf version 2 5 231 Send us iu to sockets PO eae ale PORT Sockperf Warmup stage sending a few dummy messages sockperf Test end interrupted timer sockperf Test ended Sockperf Total Run RunTime 5 100 sec SentMessages 2240397 ReceivedMessages 2240396 1 Printing for Semen No Sockperf Valid Duration RunTime 4 988 sec SentMessages 2218152 ReceivedMessages 2218152 sockperf gt avg lat 1 108 std dev 0 244 sockperf dropped messages 0 duplicated messages 0 out of order messages 0 sockperf Summary Latency rs 1201007 usec rotal 22 5 sobsecrvatkions Lach perecntn 220159052 observations sogkpert gt lt MAX gt Observation ISOS O ry
42. d value provided during callback registration for each socket gt NOTE b The application can call all the Socket APIs from within the callback context iov SZ Size of the iov array ws Packet loss might occur depending on the application s behavior in the callback context A very quick non blocked callback behavior is not expected to induce packet loss Parameters iov and vma info are only valid until the callback context is returned to VMA You should copy these structures for later use if working with zero copy logic 7 4 1 Zero Copy recvfrom Description Zero copy revcfrom implementation This function attempts to receive a packet without doing data copy Syntax int recvirom zcopy int s void size t len Int sockaddr 5 romnmlen Parameters Table 9 Zero copy revcfrom Parameters Parameter Name Description Values User Manual Rev 6 9 1 Parameter Name Description Values buf Buffer to fill with received data or pointers to data see below flags Pointer to flags see below Usual flags to recvmsg and _ ZCOPY_FORCE from If not NULL is set to the source address same as recvfrom fromlen If not NULL is set to the source address size same as recvfrom The flags argument can contain the usual flags to recvmsg and also the MSG VMA ZCOPY FORCE flag If the latter is not set the reverts to data copy 1
43. e activity by printing a for the last lt N gt messages processed Activity Measure activity by printing the duration for last N messages processed tcp avoid nodelay Stop delivering TCP Messages Immediately default ON _ N A 1 f F N A A N A A N mc rx if IP address of interface on which to receive multicast packets can be different from the route table 2 gt mc tx 1f IP address of interface on which to transmit multicast packets can be different from the route table mc loopback enable Enable MC loopback default disabled Limit the lifetime of the message default 2 mc ttl buffer size Set total socket receive send buffer lt size gt in bytes system defined by default vmazcopyread If possible use VMA s zero copy reads API see the VMA readme nonblocked Open non blocked sockets Do not send warm up packets on start pre warmup wait Time to wait before sending warm up packets seconds no rdtsc Do not use the register when measuring time instead use the monotonic clock set sock accl Set socket acceleration before running available for some Mellanox systems N load vma Load VMA dynamically even when LD_PRELOAD was not used N A N A N A N A N A N A N A N A N A N A A 52 J User Manual Rev 6 9 1 Short Full Command Description Command LM tcp skip blocking send Enables non blocking send operation d
44. easurements Solution Set TRACELEVEL to its default value e Problem On running an application with the following error is reported ERROR 19 805 bj Oo P Ti m o r IND ignored Solution Check that 1ibvma is properly installed and that 1ibvma so is located in usr lib or in usr lib64 for 64 bit machines Problem On attempting to install vma rpm the following error is reported fem Tipun hom error can t Creare transaction Lock Solution Install the rpm with privileged user root e Problem The following warning is reported VMA WARNING 59 VAS VAS VAS K lt WS TS Yay TAS WS TAS lt TS WS TS WS lt WS TAS lt TS lt lt WS lt lt TS lt TS WS TS lt lt lt TS lt lt WS TS lt lt US TS lt lt US lt lt lt US X US x x x x IS VMA WARNING Your current max locked memory is 33554432 Please change it o unlimited VMA WARNING Set this user s default to ulimit 1 unlimited VMA WARNING Read more about this issue in the VMA s User Manual VMA WARNING lt K lt lt Kk lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt lt X lt lt lt lt lt lt lt lt lt lt lt lt IU eX X lt X lt lt lt lt X x x Solution When working with root increase the maximum locked memory to unlimited by using the following command sql ama L unlimiced Whe
45. ed Configuring the ConnectX 3 HCA for VMA section Removed deprecated parameters for NetEffect from Configuration Parameter Values section User Manual Rev 6 9 1 Version Description MER e Updated Appendix on sockpert Updated chapter on VMA Library Architecture Updated graphic to reflect support of ConnectX3 Removed outdated sections Unicast Support and Link and Port Recovery Updated sections in chapter on Installation and Initial Configuration VMA System Requirements All sections on VMA Installation and Upgrade Updated chapter on Configuring VMA Removed information about configuring a virtual MAC interface for unicast offload deprecated feature Updated information for VMA configuration parameters THREAD MODE VMA CLOSE DUP2 CONFIG FILE VMA APPLICATION ID Updated default values for VMA configuration parameters RX POLL SELECT POLL Updated configuration parameter descriptions to include poll Removed chapter Tuning This information will be included in a separate Performance Tuning Guide which will be part of the VMA 6 0 Documentation package Updated information for Zero copy revcfrom Added troubleshooting topic for UMCAST enabled Rev 6 9 1 About This Manual About This Manual Audience This manual is primarily intended for Market data professionals Messaging specialists Software engineers and
46. ed no InfiniBand interrupts are used during select 11 11 wait socket calls This mode of work is not recommended This parameter is used by applications that use VMA SELECT POLL for polling with the default zero millisecond timeout Range 0 Disabled 1 Enabled Default 1 Enabled VMA CQ POLL BATCH MAX The maximum size of the array while polling the CQs in the VMA Default 8 Rev 6 9 1 Configuring VMA Configuration Parameter VMA PROGRESS ENGINE INT ERVAL VMA PROGRESS ENGINE WCE MAX VMA CQ MODERATION ENABL E VMA CQ MODERATION COUNT VMA CQ MODERATION PERIO D USEC VMA CQ AIM MAX COUNT VMA CQ AIM MAX PERIOD U SEC VMA CQ AIM INTERVAL MSE E VMA CQ AIM INTERRUPTS R ATE PER SEC VMA CQ KEEP QP FULL Description and Examples Internal VMA thread safety which checks that the CQ is drained at least once every N milliseconds This mechanism allows VMA to progress the TCP stack even when the application does not access its socket so it does not provide a context to VMA If the CQ was already drained by the application receive socket API calls this thread goes back to sleep without any processing Disable with 0 Default 10 milliseconds Each time the VMA s internal thread starts its CQ draining it stops when it reaches this maximum value The application is not limited by this value in the number of CQ elements that it can ProcessId from calling any of the
47. efault OFF looping num Set sockperf to loop over recvfrom until EAGAIN or N good received packets 1 for infinite must be used with nonblocked default pt Print extra debug information Table 16 Client Options Short Full Command Description Command ____ Set the number of servers the client works with sender affinity Set sender thread affinity to the given core IDs in the list format see cat proc cpuinfo N A receiver affinity Set receiver thread affinity to the given core IDs in the list format see cat proc cpuinfo N A full log Dump full log of all message send receive times to the given file in CSV format time Set the number of seconds to run default 1 max 36000000 burst Control the number of messages sent from the client in burst Print sizes in GigaBytes __ increase_output_precision Increase number of digits after the decimal point of the throughput output from 3 to 9 mps Set number of messages per second default 10000 for under load mode or max for ping pong and throughput modes for maximum use mps max Supports pps for backward compatibility m msg size Use messages of minimum size in bytes minimum default 12 bytes T range Use with m to randomly change the minimum message size in range size N Table 17 Server Options Short Full Command Description Command N A threads num Run lt N gt th
48. exchange e Market data feed handler software which consumes multicast data feeds and which often use multicast as a distribution mechanism downstream such as Wombat WDF and Reuters RMDS or any home grown feed handlers e Messaging applications responsible for producing consuming relatively large amounts of multicast data including applications that use messaging middleware such as Tibco Rendezvous RV _ 7 User Manual Rev 6 9 1 e Cachine data distribution applications which utilize quick network transactions for cache creation state maintenance such as MemCacheD and Redis e Applications that handle distributed denial of service DDoS and web services applications with a heavy load of DNS requests e Messaging applications such as UMS Informatica which 6 4 was certified with e Any other applications that make heavy use of multicast or unicast that require any combination of the following e Higher Packets per Second PPS rates than with kernel e Lower data distribution latency e Lower CPU utilization by the multicast consuming producing application in order to support further application scalability 1 4 Advanced VMA Features The VMA library provides several significant advantages e underlying wire protocol used for the unicast and multicast solution is standard and UDP IPv4 which 15 interoperable with any TCP UDP IP networking stack Thus the opposite side of the communication can
49. face to be used for the socket traffic The library knows whether data is passing to or from an InfiniBand HCA or Ethernet NIC If the data is passing to from a supported HCA or Ethernet NIC the library intercepts the call and does the bypass work If the data is passing to from an unsupported HCA or Ethernet NIC the library passes the call to the usual kernel libraries responsible for handling network traffic Thus the same application can listen in on multiple HCAs or Ethernet NICs without requiring any configuration changes for the hybrid environment 2 2 Socket Types The following Internet socket types are supported e Datagram sockets also known as connectionless sockets which use User Datagram Protocol UDP e Stream sockets also known as connection oriented sockets which use Transmission Control Protocol TCP or Stream Control Transmission Protocol SCTP User Manual Rev 6 9 1 3 Installing VMA For detailed information on how to install the VMA software please refer to the VMA Installation Guide 4 Configuring VMA You can control the behavior of VMA by configuring e The libvma conf file e VMA configuration parameters which are Linux OS environment variables e VMA extra API 4 1 Configuring libvma conf The installation process creates a default configuration file etc libvma conf in which you can define and change the following settings e The target applications or processes to which
50. fload property 1 for offloaded 0 for not offloaded 7 4 Packet Filtering The packet filter logic gives the application developer the capability to inspect a received packet You can then decide on the fly to keep or drop the received packet at this stage in processing The user s application packet filtering callback is defined by the prototype N 33 Rev 6 9 1 VMA Extra API Lope l n c Call qoae ese Di pp rac mte ASA ON Tio Q EE TIPS itp t ape e E This callback function should be registered with VMA by calling the VMA Extra API function register 11 be unregistered by setting a NULL function pointer VMA calls the callback to notify of new incoming packets after the internal IP amp UDP TCP header processing and before they are queued in the socket s receive queue The context of the callback is always that of one of the user s application threads that called one of the following socket APIs select poll epoll wait recv recvfrom recvmsg read or readv Table 8 Packet Filtering Callback Function Parameters Parameter Name Description Values fd File descriptor of the socket to which this packet refers 10 lovector structure array pointer holding the packet received data buffer pointers and the size of each buffer vma info Additional information on the packet and socket context User define
51. free unused huge page shared memory segments with the script below echo 1000000000 proc sys kernel shmmax exo 400 owed y mug ine Puj e If you are running multiple instances of your application loaded with VMA you will probably need to increase the values used in the above example 44 User Manual Rev 6 9 1 i CAUTION Check that your host machine has enough free memory after allocating the huge page resources for VMA Low system memory resources may cause your system to hang gt NOTE Use ipcs mand ipcrm m shmid to check and clean unused shared memory segments ad Use the following script to release VMA unused huge page resources fter Slime iim aos m cies 05300101910 019 0 ule donans o2 E clo echo M Ipem m o mic done 45 J Rev 6 9 1 Sockperf UDP TCP Latency and Throughput Benchmarking Tool Appendix A Sockperf UDP TCP Latency and Throughput Benchmarking Tool This appendix presents sockperf VMA s sample application for testing latency and throughput over socket API Sockperf can be used natively or with VMA acceleration A 1 Overview Sockperf 1s an open source utility For more general information see http code google com p sockpert Sockperf s advantage over other network benchmarking utilities is its focus on testing the performance of high performance systems as well as testing the performance of regular networking systems I
52. gging Troubleshooting and Monitoring 8 2 8 2 1 8 2 2 8 2 3 8 3 e There were no dropped packets caused by the socket receive buffer limit see VMA RX BYTES MIN A single socket with user d 14 was created Debugging VMA Logs Use the VMA logs in order to trace VMA operations VMA logs can be controlled by the VMA TRACELEVEL variable This variable s default value is 3 meaning that the only logs obtained are those with severity of PANIC ERROR and WARNING You can increase the TRACELEVEL variable value up to 6 as described in Configuration Parameters on page 15 to see more information about each thread s operation Use the VMA LOG DETAILS 3 to add a time stamp to each log line This can help to check the time difference between different events written to the log Use the VMA LOG FILE tmp my file log to save the daily events It is recommended to check these logs for any VMA warnings and errors Use the Troubleshooting on page 42 section to help resolve the different issues in the log VMA will replace a single d appearing in the log file name with the pid of the process loaded with VMA This can help in running multiple instances of VMA each with its own log file name When VMA LOG COLORS is enabled VMA uses a color scheme when logging Red for errors and warnings and dim for low level debugs Use the VMA HANDLE SIGSEGV to print a backtrace if a segmentation fault occurs Ethernet Counters
53. ifies a group of rules from libvma conf for to apply Example APPLICATION ID iperf server Default VMA DEFAULT APPLICATION ID match only the group rule VMA HANDLE SIGINTR When enabled the VMA handler is called when an interrupt signal is sent to the process VMA also calls the application s handler if it exists Range 0 to 1 Default 0 Disabled VMA HANDLE SIGSEGV When enabled a print backtrace is performed if a segmentation fault occurs Range 0 to 1 Default 0 Disabled VMA STATS FD NUM Maximum number of sockets monitored by the VMA statistics mechanism Range 0 to 1024 Default 100 VMA STATS FILE Redirects socket statistics to a specific user defined file VMA dumps each socket s statistics into a file when closing the socket Example STATS FILE tmp stats VMA STATS SHMEM DIR Sets the directory path for VMA to create the shared memory files for vma stats 66 66 In case this value is set to an empty string no shared memory files are created Default tmp VMA TCP MAX SYN FIN RAT Limits the number of TCP control packets TCP E SYN FIN RST packets that VMA handles per second for each thread Example by setting this value to 10 the maximal number of TCP control packets accepted by VMA per second for each thread will be 10 Set this value to 0 for VMA to handle an un limited number of TCP control packets per second for each thread Value range is 0 to 100000
54. level drastically lowers application performance We recommend that you use this level with the VMA LOG FILE parameter VMA LOG DETAILS Provides additional logging details on each log line Basic log line With ThreadId With ProcessId and ThreadId With Time ProcessId and ThreadId Time is the amount of milliseconds from the start of the process Default 0 For TRACELEVEL gt 4 this value defaults to 2 VMA LOG FILE Redirects all VMA logging to a specific user defined file This is very useful when raising the TRACELEVEL The VMA replaces a single d appearing in the log file name with the pid of the process loaded with VMA This can help when running multiple instances of VMA each with its own log file name Example VMA LOG FILE tmp vma log txt VMA CONFIG FILE Sets the full path to the configuration file Example VMA CONFIG FILE tmp libvma conf Default etc libvma conf LOG COLORS Uses a color scheme when logging red for errors and warnings and dim for very low level debugs VMA LOG COLORS is automatically disabled when logging is done directly to a non terminal device for example when VMA LOG FILE is configured Default 1 Enabled VMA CPU USAGE STATS Calculates the VMA CPU usage during polling hardware loops This information is available through VMA stats utility User Manual Rev 6 9 1 VMA Contiguration Parameter Description and Examples VMA APPLICATION ID Spec
55. low listen sockets Defined by local ip local port tcp client for connected sockets Defined by remote ip remote port local ip local port udp sender for TX flows Defined by remote ip remote port udp receiver for RX flows Defined by local ip local port udp connect for UDP connected sockets Defined by remote ip remote port local ip local port User Manual Rev 6 9 1 Option Description address You can specify the local address the server is bind to or the remote server address the client connects to The syntax for address matching is lt IPv4 address gt lt prefix length gt e Pv4 address 0 9 0 9 0 9 0 9 each sub number 255 prefix length 0 9 and with value lt 32 A prefix length of 24 matches the subnet mask 255 255 255 0 A prefix length of 32 requires matching of the exact IP port range Define the port range as start port end port where port numbers are gt 0 and lt 65536 4 1 3 Example of VMA Configuration To set the following e Apply the rules to program tcp lat with ID e Use VMA by TCP clients connecting to machines that belong to subnet 792 766 e Use OS when TCP server listens to port 5007 of any machine In libvma conf configure app ication pul Wins ives e 927 15215070218 s pos USE CE Cio CHIEN Ben Note You must also set the VMA parameter WME IMEI I 4 2 VMA Configuration Parameters
56. ment Revision History Document Revision History Table 1 Document Revision History Version Rev 6 9 1 Rev 6 8 3 Description Updated the following sections e VMA Configuration Parameters e Configuration Parameter Values Updated the following sections e Overview of the VMA Extra API e Zero Copy recvfrom e Freeing Zero Copied Packet Buffers Rev 6 6 4 Rev 6 5 9 Rev 6 4 11 Rev 6 3 28 Updated the following sections e Configuring Socket Transport Control e VMA Configuration Parameters e Configuration Parameter Values e Monitoring the stats Utility e Example 3 Added the following sections e Adding libvma conf Rules During Run Time e Creating Sockets as Off loaded or Not Off loaded Added the following sections e Using sockperf with VMA e Example Running sockperf Ping pong Test e Beta Level Features Configuration Parameters Updated the following sections e VMA Configuration Parameters e Configuration Parameter Values Removed the Installation and Initial Configuration chapter Was moved to the Installation Guide Updated the following sections Target Applications VMA Configuration Parameters Configuration Parameter Values Problem Incorrect IGMP version Problem Lack of huge page resources in the system Updated sections in Introduction to VMA chapter and VMA Library Architecture for offload over InfiniBand Updated VMA System Requirements section Updat
57. n addition sockperf covers most of the socket API call and options Specifically in addition to the standard throughput tests sockperf e Measures latency of each discrete packet at sub nanosecond resolution using TSC register that counts CPU ticks with very low overhead e Measures latency for ping pong mode and for latency under load mode This means that you can measure latency of single packets even under a load of millions of PPS without waiting for reply of packet before sending a subsequent packet on time e Enables spike analysis by providing in each run a histogram with various percentiles of the packets latencies for example median min max 99 percentile and more in addition to average and standard deviation e Can provide full logs containing all packet s tx rx times without affecting the benchmark itself The logs can be further analyzed with external tools such as MS Excel or matplotlib e Supports many optional settings for good coverage of socket API while still keeping a very low overhead in the fast path to allow cleanest results Sockperf operates by sending packets from the client also known as the publisher to the server also known as the consumer which then sends all or some of the packets back to the client This measured roundtrip time is the route trip time RTT between the two machines on a specific network path with packets of varying sizes The latency for a given one way path between
58. n working as a non privileged user ask your administrator to Increase the maximum locked memory to unlimited e Problem Incorrect IGMP version The following warning 1s reported VMA WARNING p dir di dir di div dir k lt k k k lt k k lt lt x k k lt x k k lt x k k lt x k lt x x K lt x x k lt x x x lt x x x lt x x x lt VMA WARNING IGMP Version flag is not forced to IGMPv2 for interface ib2 while VMAS WARN INC VMATIGME is Enabled VMA WARNING Working in this mode can cause issues due to Eth IB gateway requirements VMA WARNING Please echo 2 gt exe e f Sie Comm qh siepe h oe c p or P WANNER ING VMA WARNING before loading your application with VMA library VMA WARNING Read the IGMP section in the VMA s User Manual for more ik eal VMA WARNING Jk we wwe wk wb WE TS WE WC TS Xs WS WW WS X WOTS WS WS WS TS 3 WS SS WW TW TS WW WS TS WX WS WS WS UT WS WS WW XS 7 WW WS X T Wf WS WS WW TW TS Wy AW This warning message means that you are using IGMP version other than 2 which is the version supported by VMA Version 2 is required for the Eth IB gateway Solution Use VMA SUPPRESS IGMP WARNING 1 if you are working in an InfiniBand fabric and do not need to receive multicast packets from the Ethernet to the InfiniBand fabric or you are working in an Ethernet fabric Rev 6 9 1 Debugging Troubleshooting and Monito
59. omprises both input and output The code can also be shaded Linux shell prompt The character stands for the Linux shell prompt Mellanox CLI Guest Mode CLI Guest Mode gt Mellanox CLI Guest Mode Mellanox CLI admin mode Mellanox CLI admin mode String Or T Strings in lt gt or are descriptions of what will actually be shown on the screen for example the contents of lt your 1 gt could be 192 168 1 1 Management GUI label New Network Management GUI labels and item name New Environment item names appear in bold whether or not the name is explicitly displayed for example buttons and icons User text entered into 1 Network1 Note the quotes The text Manager e g to assign as entered does not include the the name of a logical object quotes Rev 6 9 1 Introduction to VMA 1 1 1 2 1 3 Introduction to VMA VMA Overview The Mellanox Messaging Accelerator VMA library is a network traffic offload dynamically linked user space Linux library which serves to transparently enhance the performance of socket based networking heavy applications over an InfiniBand or Ethernet network VMA has been designed for latency sensitive and throughput demanding unicast and multicast applications VMA can be used to accelerate producer applications and consumer applications and enhances application performance by orders of magnitude without requiring any modification to the application code
60. onfiguration Parameter Values 2 20 000000 000 17 Table 5 Beta Level Configuration Parameter Values nennen 28 Table add cont r le Patramelekb un um ele D LE 33 Table 7zadd Coll rule ParamielerS so sects testi boasts 33 Table 8 Packet Filtering Callback Function Parameters 34 Table 9 Zero copy revetrom Parameters uu o E c ede bv Un 34 Table 10 Freeing Zero copy Datagram Parameters 35 Table 11 vma state ODUOFIS a uu u ul ul eet ee toe tao itd ae 37 Table T UDP MG Ping pong RESUS snis uu ht usu suu n 50 Table 13 TCP Throughput Results a a ann 51 Table 14 Available pte era eee i abut Eo 51 Table 15 General sockperr ODORS coa ceo hoo samaspa Face body eset v goce vedere 52 Table 6 GlientiObBionp t mo mE 53 Fable d OB sia cesta 53 Table 19 ACrOnyi Fable l l u uu imi cuts ib icto 56 Rev 6 9 1 Docu
61. owed for filling up QP while full receive buffers are being processed inside VMA Default 256 buffers VMA OFFLOADED SOCKETS Creates all sockets as offloaded not offloaded by default e is used for offloaded e 015 used for not offloaded Default 1 Enabled VMA TIMER RESOLUTION MS Control VMA internal thread wakeup timer resolution in EC milliseconds Default 10 milliseconds VMA TCP TIMER RESOLUTIO Controls VMA internal TCP timer resolution fast timer in N MSEC milliseconds Minimum value is the internal thread wakeup timer resolution TIMER RESOLUTION MSEC Default 100 milliseconds VMA THREAD MODE By default VMA is ready for multi threaded applications meaning it is thread safe If the user application is single threaded use this configuration parameter to help eliminate VMA locks and improve performance Values 0 Single threaded application 1 Multi threaded application with spin lock 2 Multi threaded application with mutex lock 3 Multi threaded application with more threads than cores using spin lock Default 1 Multi with spin lock VMA MEM ALLOC TYPE This replaces the VMA_HUGETBL parameter logic VMA will try to allocate data buffers as configured 0 ANON using malloc e CONTIG using contiguous pages e 2 HUGEPAGES using huge pages OFED will also try to allocate QP amp CQ memory accordingly e 0 ANON default use current pages ANON small ones e HUGE fo
62. r not The application is still able to run over the general socket library without VMA loaded as it did previously or can use an application flag to decide which API to use Socket API or VMA Extra API The VMA Extra APIs are provided as a header with the VMA binary rpm The application developer needs to include this header file in his application code After installing the VMA rpm on the target host the VMA Extra APIs header file is located in the following link u tm Linck rie ra s The vma extra h provides detailed information about the various functions and structures and instructions on how to use them An example using the VMA Extra API can be seen in the udp 1at source code e Follow the vmarxfiltercb flag for the packet filter logic e Follow the vmazcopyread flag for the zero copy recvfrom logic A specific example for using the TCP zero copy extra API can be seen under extra Lests tcp Ob 7 2 Using Extra During runtime use the vna get function to check if VMA is loaded in your application and if the VMA Extra API is accessible If the function returns with NULL either VMA is not loaded with the application or the VMA Extra API is not compatible with the header function used for compiling your application NULL will be the typical return value when running the application on native OS without VMA loaded Any non NULL return value isa vna api t t
63. rce huge pages e CONTIG force contig pages e PREFER CONTIG try contig fallback to ANON small pages e PREFER try huge fallback to ANON small pages Rev 6 9 1 Configuring VMA Contiguration Parameter Description and Examples e 2 ALL try huge fallback to contig if failed fallback to ANON small pages To override OFED use MLX ALLOC MLX CQ ALLOC TYPE Default 1 Contiguous pages VMA FORK Controls VMA fork support Setting this flag on will cause tocallibv fork init function ibv fork init initializes libibverbs s data structures to handle fork function calls correctly and avoid data corruption Ifibv fork init is not called or returns a non zero status then libibverbs data structures are not fork safe and the effect of an application calling fork is undefined ibv fork init works on Linux kernels 2 6 17 and later which support the MADV DONTFORK flag for madvise You should use an OFED stack version that supports fork with huge pages Mellanox OFED 1 5 3 and later allocates huge pages VMA_HUGETBL by default Default 1 Enabled VMA MTU Sets the fragmentation size of the packets sent by the VMA library This value determines the size of each Rx and Tx buffer Default 1500 bytes Recommendations e Set to1500 for Ethernet networks or interoperability with Ethernet networks VMA MSS Defines the max TCP payload size that can
64. re the socket is a UDP unicast socket and no multicast addresses were added to it Once the first ADD MEMBERSHIP 1s called the VMA RX POLL above takes effect Value range is similar to the VMA RX POLL above Default 0 VMA RX UDP POLL OS RATI Defines the ratio between CQ poll and OS FD poll This will result in a single poll of the not offloaded sockets every Rev 6 9 1 Configuring VMA Contiguration Parameter Description and Examples VMA RX UDP POLL OS RATIO offloaded socket CQ polls No matter if the CQ poll was a hit or miss No matter if the socket is blocking or non blocking When disabled only offloaded sockets are polled This parameter replaces the two old parameters e VMA RX POLL OS RATIO and e VMA RX SKIP OS Disable with 0 Default 10 VMA RX POLL YIELD When an application is running with multiple threads on a limited number of cores there 1s a need for each thread polling inside VMA read readv recv and recvfrom to yield the CPU to another polling thread so as not to starve them from processing incoming packets Default 0 Disabled VMA RX PREFETCH BYTES The size of the receive buffer to prefetch into the cache while processing ingress packets The default is a single cache line of 64 bytes which should be at least 32 bytes to cover the IPoIB IP UDP headers and a small part of the user payload Increasing this size can help improve performance for larger user payloads
65. reads on server side requires option N A cpu affinity Set threads affinity to the given core IDs in the list format see cat proc cpuinfo Rev 6 9 1 Sockperf UDP TCP Latency and Throughput Benchmarking Tool Short Full Command Description Command N A vmarxfiltercb If possible use VMA s receive path packet filter callback API See the VMA readme N A force unicast reply Force server to reply via unicast N A dont reply Set server to not reply to the client messages m msg size Set maximum message size that the server can receive size bytes default 65506 e cspdetecion Enable gap detection A 5 2 Sending Bursts Use the b burst size option to control the number of messages sent by the client In every burst A 6 Debugging sockperf Use d debug to print extra debug information without affecting the results of the test The debug information is printed only before or after the fast path A 7 Troubleshooting sockperf If the following error is received sockperf error sockperf No messages were received from the server Is the server down Perform troubleshooting as follows e Make sure that exactly one server is running e Check the connection between the client and server e Check the routing table entries for the multicast unicast group e Extend test duration use the time command line switch e If you used extreme values for mps and or reply every swi
66. receive path socket APIs Default 2048 Enable CQ interrupt moderation Default 1 Enabled Number of packets to hold before generating interrupt Default 48 Period in micro seconds for holding the packet before generating interrupt Default 50 Maximum count value to use in the adaptive interrupt moderation algorithm Default 560 Maximum period value to use in the adaptive interrupt moderation algorithm Default 250 Frequency of interrupt moderation adaptation Interval in milliseconds between adaptation attempts Use value of 0 to disable adaptive interrupt moderation Default 250 Desired interrupts rate per second for each ring CQ The count and period parameters for CQ moderation will change automatically to achieve the desired interrupt rate for the current traffic rate Default 5000 If disabled default the CQ does not try to compensate for each poll on the receive path It uses a debt to remember how many WRE are missing from each QP so that it can fill it when buffers become available If enabled CQ tries to compensate QP for each polled receive completion If there is a shortage of buffers it reposts a User Manual Rev 6 9 1 VMA Contiguration Parameter Description and Examples recently completed buffer This causes a packet drop and 15 monitored in vma stats Default 1 Enabled VMA QP COMPENSATION LEV The number of spare receive buffer CQ holds that can be EL all
67. ring If you do expect to receive multicast packets from the Ethernet to the InfiniBand fabric with VMA force IGMP working mode to version 2 in all your hosts as well as in your routers exeo 2 ro ers aet Mem Comm or aso v rom Problem UMCAST Is enabled The following warning 1s reported VMA WARNING p dir di dir di dir dir di div dir div dir dir dir dis div dio dir di lt x lt lt lt k k lt x k x lt x k x lt x x lt lt x x x lt x x x lt x x x lt x x dir did VMA WARNING UMCAST flag is Enabled for interface ib0 VMA WARNING Please disable it echo 0 gt sys class net ib0 umcast VMA WARNING This option in no longer needed in this version VMA WARNING Please refer to Release Notes for more information V MA WARNING E Way WAY RAS US KK vay WM YS WAS X XXX X Yay IS K VY Dey Wy WAY WS Y WAS Ty Was WAY WAS Yey Y TY Wy Wy Y X X YA This warning message means that the UMCAST flag is on Solution Turn off the UMCAST flag This option is no longer needed in this version Problem Lack of huge page resources in the system The following warning 1s reported VMA WARNING p dir di dir di div dir div div di div dir dir di div dir di div di dir di dir dir di dir dis div dir div dir dir dir dio div di div dir dir dir dir dir dis div dir dir dir dir dir di div dio di di lt lt x xk lt lt x x x x lt VMA WARNING NO IM
68. rt dropped packets 0 duplicated packets i out ot order packets 0 SOCkperr 5 Latency I 15 96 SoCkperf Toi 2 G observations cachi scone msus 507 6 observations sockperf gt lt MAx observation 26 32593 Per SS 53593 Soclpobpu cpesceniumbceo0 0990 1 50 9 09 Sockperfr c MINS observation 13 406 Interpretation of the results The example shows an average latency of 15 096 usec A 3 2 UDP MC Ping pong Over 10 Gb gt Torun UDP MC ping pong over 10 Gb Ethernet 7 After configuring the routing table as described in Configuring the Routing Table for Multicast Tests on page 47 run the server by using moie eng 8 Run the client by using 48 User Manual Rev 6 9 1 ub c ojo i ser sr The following output is obtained sockpert Total Run hunriime 1 100 SentMessages 9960 ReceivedMessages 79959 sockpert Duration Funiime 000 see SemeMessages 2603 ReceivedMessages 72803 sockperf gt avg lat 6 825 std dev 0 261 SOCckperi Summary Latency 15 6 625 wsec Total 7220 observations each percentile contains 720 03 observations gt MAX observation LOBOS 7 so eri n c c OE SoCkperi 29 00 TE cron gt 00 22020
69. ssed Rx polls see RX POLL This implies that the receiving thread did not enter a blocked state and therefore there was no context switch to hurt latency e There are no transmission or reception errors on this socket 8 1 1 2 Example 2 Description Vma stats presents not only cumulative statistics but also enables you to view deltas of VMA counter updates This example demonstrates the use of the deltas mode Command Line noms Mone Output po GIT or tecti eeu OSI pkt s Kbyte s eagain s error s poll pkt s Kbyte s error s LS R T5186 29 0 0 0 O 0 0 0 JP 15186 29 0 0 0 0 0 Tg LS 9S 29 0 0 0 0 0 0 0 pu 1151836 29 0 0 0 0 0 259 0 0 0 0 O20 1 51 85 27 0 Tx 0 0 0 0 15185 22 0 select Rx Ready 30372 os or Oo d sO Exo JETER MOO Analysis of the Output e Three sockets were created fds 15 19 and 23 e Received 11590 packets 22 Kilobytes during the last second via fds 15 and 19 e Transmitted 11590 packets 22 Kbytes during the last second via fds 15 and 19 Not all the traffic was offloaded as fd 23 11590 packets 22 KBytes were transmitted and received via the OS This means that fd 23 was used for unicast traffic e Notransmission or reception errors were detected on any socket e The application used select for I O multiplexing e 45557 packets were placed in socket ready queues over the course of the last second 30
70. stats utility to view the per socket information and performance counters during runtime Note For TCP connections vma stats shows offloaded traffic and not os traffic Usage jg we esce 040 few cwsewg pee ceteris catene eese The following table lists the basic and additional vma stats utility options Table 11 vma stats Utility Options Parameter Name Argument Parameter Description and Values MER 080 pid Shows VMA statistics for a process with pid pid lt 1 2 3 4 gt Sets the view type 1 Shows the runtime basic performance counters default 2 Shows extra performance counters Shows additional application runtime configuration information Shows multicast group membership information d details Sets the details mode 1 Show totals default 2 Show deltas L 29 perval lt n gt Prints a report every lt n gt seconds Default 1 sec Cycles Do lt n gt report print cycles and exit use 0 value for Default 0 Rev 6 9 1 Debugging Troubleshooting and Monitoring Parameter Name Argument Parameter Description and Values n name application Shows VMA statistics for application application TINH pid Finds pid and shows statistics for the VMA instance running default S OPD Tt Clean When you set this flag to inactive shared objects files are not removed l 1log level lt level
71. tch try other values or try the default values 54 J User Manual Rev 6 9 1 Appendix B Multicast Routing B 1 Multicast Interface Definitions All applications that receive and or transmit multicast traffic on a multiple Interface host should define the network interfaces through which they would prefer to receive or transmit the various multicast groups If a networking application can use existing socket API semantics for multicast packet receive and transmit the network interface can be defined by mapping the multicast traffic In this case the routing table does not have to be updated for multicast group mapping The socket API setsockopt handles these definitions When the application uses setsockopt with IP ADD MEMBERSHP for the receive path multicast join request it defines the interface through which it wants the VMA to join the multicast group and listens for incoming multicast packets for the specified multicast group on the specified socket When the application uses setsockopt with IP MULTICAST IF on the transmit path it defines the interface through which the VMA will transmit outgoing multicast packets on that specific socket If the user application does not use any of the above set sockopt socket lib API calls the VMA uses the network routing table mapping to find the appropriate interface to be used for receiving or transmitting multicast packets Usethe route command to verify that multicast addresses
72. the application does not drain sockets and the byte limit Is reached newly received datagrams are dropped The application s socket usage of current max dropped bytes and packet counters can be monitored using vma_stats Default 65536 VMA RX POLL The number of times to unsuccessfully poll an Rx for VMA packets before going to sleep Range 1 0 100 000 000 Default 100 000 This value can be reduced to lower the load on the CPU However the price paid for this is that the Rx latency is expected to increase Recommended values 10000 when CPU usage is not critical and Rx path latency is critical 0 when CPU usage is critical and Rx path latency is not critical e 1 causes infinite polling Once the VMA has gone to sleep if it is in blocked mode it waits for an interrupt if it is in non blocked mode it returns This Rx polling is performed when the application is working with direct blocked calls to read recv recvfrom and recvmsg When the Rx path has successful poll hits the latency improves dramatically However this causes increased CPU utilization For more information see Debugging Troubleshooting and Monitoring on page 37 VMA RX POLL INIT VMA maps all UDP sockets as potential Offloaded capable Only after ADD MEMBERSHIP is set the offload starts working and the CQ polling starts VMA This parameter controls the polling count during this transition phase whe
73. the bitmask and comma delimited list methods are identical to what is supported by the taskset command See the man page on taskset for additional information The 1 value disables the Internal Thread Affinity setting by VMA e Bitmask Examples 0x00000001 Run on processor 0x00000007 Run on processors 1 2 and 3 Comma Delimited Examples 0 4 8 Run on processors 0 4 and 8 0 1 7 10 Run on processors 0 1 7 90 9 10 Default 1 VMA INTERNAL THREAD CPU Selects a CPUSET for VMA internal thread For further SET information see man page of cpuset The value is either the path to the CPUSET for example dev cpuset my set or an empty string to run it on the same CPUSET the process runs on VMA INTERNAL THREAD ARM Wakes up the internal thread for each packet that the CQ ges receives Polls and processes the packet and brings it to the socket layer This can minimize latency for a busy application that is not available to receive the packet when it arrives However this might decrease performance for high pps rate applications Default 0 Disabled VMA WAIT AFTER JOIN MSE This parameter indicates the time of delay the first packet is C send after receiving the multicast JOINED event from the SM This is helpful to overcome loss of first few packets of an outgoing stream due to SM lengthy handling of MFT configuration on the switch chips Default 0 milli sec VMA NEIGH UC ARP QUATA
74. the configured control settings apply By default VMA control settings are applied to all applications e The transport to be used for the created sockets e The IP addresses and ports in which you want offload By default the configuration file allows VMA to offload everything In the ibvma conf file e Youcan define different VMA control statements for different processes in a single configuration file Control statements are always applied to the preceding target process statement in the configuration file e Comments start with and cause the entire line after it to be ignored e Any beginning whitespace is skipped e Any line that is empty is skipped tis recommended to add comments when making configuration changes The following sections describe configuration options in ibvma conf For a sample libvma conf file see Example of VMA Configuration on page 15 4 1 1 Configuring Target Application or Process The target process statement specifies the process to which all control statements that appear between this statement and the next target process statement apply Each statement specifies a matching rule that all its subexpressions must evaluate as true logical and to apply If not provided default the statement matches all programs The format of the target process statement is application 1d lt program namel lt user defined id gt L 8 7 Rev 6 9 1 Configuring Table 2 Target Process
75. ype structure pointer that holds pointers to the specific VMA Extra API function calls which are needed for the application to use It is recommended to call vma_ get api once on startup and to use the returned pointer throughout the life of the process There is no need to release this pointer in any way 32 J User Manual Rev 6 9 1 7 3 Control Off load Capabilities During Run Time 7 3 1 Adding libvma conf Rules During Run Time Adds a libvma conf rule to the top of the list This rule will not apply to existing sockets which already considered the conf rules around connect listen send recv Syntax int add conf rule char config line Return value success e error code on failure Table 6 add conf rule Parameters Parameter Name Description Values Config line New rule to add to the top of the A char buffer with the exact list highest priority format as defined in libvma conf and should end with 0 7 3 2 Creating Sockets as Off loaded or Not Off loaded Creates sockets on pthread tid as off loaded not off loaded This does not affect existing sockets Offloaded sockets are still subject to libvma conf rules Usually combined with the OFFLOADED SOCKETS parameter Syntax int thread OLLload ime Offload pthread tid Return value e success e error code on failure Table 7 add conf rule Parameters Parameter Name Description Values offload Of

Download Pdf Manuals

image

Related Search

Related Contents

Documento Informazione Formazione - Istituto Comprensivo Statale  Inspiration CCR User Manual    IP電話使いかたガイド  "取扱説明書"  DPP-250  Página 17 - Sul Rural  GT48 Utilitu Proarams LISP Oisolau Slave by  Receiver - topfield.co.kr  Modulo de Encendido/Apagado INSTEON 2633  

Copyright © All rights reserved.
Failed to retrieve file