Home

RDMA Aware Networks Programming User Manual

image

Contents

1. enum ibv_atomic_cap atomic cap int max ee int max rdd int max mw int max raw ipv6 qp int max raw _ethy qp int max _mcast_grp int max mcast qp attach int max total mcast qp attach int max ah int max fmr int max map per fmr int max sra int max srq wr int max srq_sge uint16 t max_pkeys uint8 t local ca ack delay uint8 t phys port cnt fw_ver Firmware version node guid Node global unique identifier GUID sys image guid System image GUID max mr size Largest contiguous block that can be registered page size cap Supported page sizes vendor_id Vendor ID per IEEE vendor part id Vendor supplied part ID hw ver Hardware version max qp Maximum number of Queue Pairs QP max qp wr Maximum outstanding work requests WR on any queue device cap flags IBV_DEVICE RESIZE MAX WR IBV DEVICE BAD PKEY CNTR IBV DEVICE BAD QKEY CNTR IBV_DEVICE_RAW_ MULTI IBV DEVICE AUTO PATH MIG IBV DEVICE CHANGE PHY PORT IBV DEVICE UD AV PORT ENFORCE IBV DEVICE CURR QP STATE MOD IBV_DEVICE SHUTDOWN_PORT IBV_DEVICE_INIT_TYPE IBV DEVICE PORT ACTIVE EVENT IBV DEVICE SYS IMAGE GUID IBV DEVICE RC RNR NAK GEN IBV DEVICE SRQ RESIZE IBV DEVICE N NOTIFY CQ IBV_DEVICE_XRC max sge Maximum scatter gather entries SGE per WR for non RD QPs max sge rd Maximum SGEs per WR for RD QPs max cq Maximum supported completion queues CQ max _cqe Maximum completion queue entries CQE per C
2. listen id For RDMA CM EVENT CONNECT REQUEST event types this references the corresponding listening request identifier event Specifies the type of communication event which occurred See EVENT TYPES below status Returns any asynchronous error information associated with an event The status is zero unless th corresponding operation failed param Provides additional details based on th typ of vent Users should select the conn or ud subfields based on the rdma_port_space of the rdma_cm_id associated with the event See UD EVENT DATA and CONN EVENT DATA below Mellanox Technologies 129 Rev 1 3 RDMA_CM API UD Event Data Event parameters related to unreliable datagram UD services RDMA_ PS UDP and RDMA PS _IPOIB The UD event data is valid for RDMA CM EVENT ESTABLISHED and RDMA CM EVENT MULTICAST JOIN events unless stated otherwise private data References any user specified data associated with RDMA_CM EVENT CONNECT REQUEST or RDMA CM EVENT ESTABLISHED events The data referenced by this field matches that specified by the remote side when calling rdma_connect or rdma_accept This field is NULL if the event does not include private data Th buffer referenced by this pointer is deallocated when calling rdma_ack_ cm event private _data_len The size of the private data buffer Users should note that th
3. Make sure that we get notified on the first completion ret ibv_req_notify_cq ctx gt srq_cq 0 if ret VERB _ERR ibv_req_notify_cq ret return ret 238 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 return 0 Function destroy_resources Input ctx The context object Output none Returns 0 on success non zero on failure Description This function cleans up resources used by the application gi void destroy_resources struct context ctx int i if ctx gt conn_id for i 0 1 lt ctx gt qp_count i if ctx gt conn_id i if ctx gt conn_id i gt qp amp amp ctx gt conn_id i gt qp gt state IBV_QPS_ RTS rdma_disconnect ctx gt conn_id i rdma_destroy_qp ctx gt conn_id i rdma_destroy_id ctx gt conn_id i free ctx gt conn_id if ctx gt recv_mr rdma_dereg mr ctx gt recv_mr if ctx gt send_mr rdma_dereg_mr ctx gt send_mr if ctx gt recv_buf free ctx gt recv_buf if ctx gt send_buf free ctx gt send_ buf if ctx gt srq_cq ibv_destroy_cq ctx gt srq_cq if ctx gt srq_cq_channel ibv_destroy_comp_channel ctx gt srq_cq_channel Mellanox Technologies 239 Rev 1 3 Programming Examples Using RDMA Verbs if ctx gt srq_id rdma_destroy_srq ctx gt srq_id rdma_destroy_id ctx gt srq_id Function a
4. 16 1 6 References 17 Chapter 2 Introduction to the Programming User Guide 18 ZA ASCO Cat sates E E ce teak ae any ANG 18 22 Online Resources aG ane i heei a 000 a EN a eee 18 Chapter 3 Ovenie W e at ner i a a E whe e aoe whee a tal 19 3 1 Available Communication Operations naasa aaaea 19 3 1 1 Send Send With Immediate 202000 0c eee ee 19 34 2 RECEIVE iia aaa taa a 19 31 3 RDMA Read nera a al ba 19 3 1 4 RDMA Write RDMA Write With Immediate 19 3 1 5 Atomic Fetch and Add Atomic Compare and Swap 19 3 2 Transport Modes o o oocccococo cece eens 20 3 2 1 Reliable Connection RC 0 000 ee eee 20 3 2 2 Unreliable Connection UC 0 0 00 eee eee 20 3 2 3 Unreliable Datagram UD 0 0 00 e eee eee 21 3 3 Key Concepiiumionicita deat caka ge veo e eda Pye a eae ead 22 3 3 1 Send Request SR 0 eee eee 22 3 3 2 Receive Request RR 0 00 eee eee eens 22 3 3 3 Completion Queue 0 0 teens 22 3 3 4 Memory Registration 0 0 2 0 000 eee 22 3 3 5 Memory Window 0 0 ae 23 3 3 6 Address Vector camila ld adas tole gba 23 3 3 7 Global Routing Header GRH cocccccccccooo 23 3 3 8 Protection Domain oooccoocoooooo ee 24 3 3 9 Asynchronous EventsS 0000 eee 24 3 3 10 Scatter Gather ii sims rre pend eed dag tia 24 353 11 Poli
5. Create a thread to handle any CM events while messages are exchanged pthread_create amp ctx cm_thread NULL cm_thread amp ctx if ctx sender printf waiting for messages n for i 0 i lt ctx msg_ count i if ctx sender ret post_send amp ctx if ret goto out ret get_completion amp ctx if ret goto out if ctx sender printf sent message d n i 1 else printf received message d n i 1 out ret rdma_leave multicast ctx id amp ctx mcast_sockaddr 1f ret VERB_ERR rdma_leave_multicast ret destroy resources zctx return ret 234 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 9 3 Shared Received Queue SRQ Copyright c 2012 Software Forge Inc All rights reserved This software is available to you under a choice of one of two licenses You may choose to be licensed under the terms of the GNU General Public License GPL Version 2 available from the file COPYING in the main directory of this source tree or the OpenIB org BSD license below Redistribution and use in source and binary forms with or without modification are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice this list of conditions and the following disclaimer Redistributions in binary form must reproduce the above copyright
6. n n EE F ae ls al oe le ol le od oe le ad de al le al o leo le ad e le al al al le al ol al o le ll al ll al ll a al e a a le al ll a ll ll ae he 2 ll ll ol ll ll al ll al ll el Function usage Mellanox Technologies 187 Rev 1 3 Programming Examples Using IBV Verbs Input argv0 command line arguments Output none Returns none Description print a description of command line syntax a ae del o ol oe le od e ad le ale al le al le al oe le 2 leal le al ol lo le al he al al al ll all a ll al ll 2 ll ol al ll ll al ll al ll ll ae he 2 ll ll a ake ae ie ll static void usage const char argv0 fprintf stdout Usage n fprintf stdout s start a server and wait for connection n argv0 fprintf stdout s lt host gt connect to server at lt host gt n argv0 fprintf stdout n fprintf stdout Options n fprintf stdout p port lt port gt listen on connect to port lt port gt default 18515 n fprintf stdout d ib dev lt dev gt use IB device lt dev gt default first device found n fprintf stdout i ib port lt port gt use port lt port gt of IB device default 1 n fprintf stdout g gid_idx lt git index gt gid index to be used in GRH default not used n EE ao ls al oe le ol le od oe le ad de ol le al o lea le od e le al al al le al all o ll al ll al ll ol al 6 2 ll ll o ll ll al 2 ll ll al ll ll al ll ll ll Functi
7. API definition files rdma rdma_cma h and infiniband verbs h 8 3 2 Run 1 Get source if provided for binding and destination addresses convert the input addresses to socket presentation 2 Joining A For all connections if source address is specifically provided then bind the rdma_cm object to the corresponding net work interface Associates a source address with an rdma_cm identifier if unmapped MC address with bind address provided check the remote address and then bind B Poll on all the connection events and wait that all rdma_cm objects joined the MC group 3 Send amp receive A If sender send the messages to all connection nodes function post_sends Mellanox Technologies 193 Rev 1 3 Programming Examples Using IBV Verbs B If receiver poll the completion queue function poll_cqs till messages arrival On ending release network resources per all connections leaves the multicast group and detaches its associated QP from the group 194 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 8 4 Code for Multicast Using RDMA_CM and IBV Verbs Multicast Code Example BUILD COMMAND gcc g Wall D_ GNU_SOURCE g 02 o examples mckey examples mckey c libverbs Irdmacm Copyright c 2005 2007 Intel Corporation All rights reserved x This software is available to you under a choice of one of two licenses You may choose to be licensed
8. The user defined context associated with the send request will be returned to the user through the work completion work request identifier wr_id field Mellanox Technologies 149 Rev 1 3 RDMA Verbs API 6 2 10 rdma_get_send_comp Template int rdma_get_send_comp struct rdma_cm_id id struct ibv_wc wc Input Parameters id A reference to the communication identifier to check for completions we A reference to a work completion structure to fill in Output Parameters we A reference to a work completion structure The structure will contain information about the completed request when routine returns Return Value A non negative value 0 or 1 equal to the number of completions found on success or 1 on failure If the call fails errno will be set to indicate the reason for the failure Description rdma_get_send_ comp retrieves a completed work request for a send RDMA read or RDMA write operation Information about the completed request is returned through the ibv_ wc we parameter with the wr_id set to the context of the request Please see ibv_poll_cq for details on the work completion structure ibv_we Please note that this call polls the send completion queue associated with the rdma_cm_id id Ifa completion is not found the call blocks until a request completes This means therefore that the call should only be used on rdma_cm_ids which do not share CQs with other rdma_cm_ids and maintain separate CQs
9. 19875 tep_port l ib_port 1 gid_idx 5 EE o ae led o le ol le od oe le 2 le ak le al ol leo le ad e le 2 al ake a a ol al o ll le al le al al ll o al el ll al ll 2 ll ll al 2 la ll ol ll ll al 2 ll ll el Socket operations For simplicity the example program uses TCP sockets to exchange control information If a TCP IP stack connection is not available connection manager CM may be used to pass this information Use of CM is beyond the scope of this example FK ae a ak led o ol oe le od e ad le ale ol le al le al oe le 2 leal le al ol lo le al e le al al al ll ol al a ll al ll 2 ll al al he ll 2 ll al ll ll ae he 2 la ll o ake ae ae ll 7 He a del o le al le od oe le 2 de ak le ol o la le al e le al al ake ll ol al o le ll al le al 2 ll a al el ll al ll o ll ll e al 2 la ll o ll ll al al ll ll al Function sock_connect Input servername URL of server to connect to NULL for server mode port port of service Output none Returns socket fd on success negative error code on failure Description Connect a socket If servername is specified a client connection will be initiated to the indicated server and port Otherwise listen on the indicated port for an incoming comnection aK ake oe led o ol oe je od e ad le al ol le al le al he 2 leal le al ol lo le al e al al al al ll ol al a ll al ll 2 ll a al el 2 ll ll ol ll ll al 2 la ll o fe ae ake ll Mellanox Tech
10. Opens an event channel used to report communication events Asynchronous events are reported to users through event channels Notes Event channels are used to direct all events on an rdma cm id For many clients a single event channel may be sufficient however when managing a large number of connections or cm_ids users may find it useful to direct events for different cm_ids to different channels for processing All created event channels must be destroyed by calling rdma destroy event channel Users should call rdma get cm event to retrieve events on an event channel Each event channel is mapped to a file descriptor The associated file descriptor can be used and manipulated like any other fd to change its behavior Users may make the fd non blocking poll or select the fd etc See Also rdma cm rdma get cm event rdma destroy event channel Mellanox Technologies 97 J Rev 1 3 RDMA_CM API 5 1 2 rdma_destroy_event_channel Template void rdma_ destroy _event_channel struct rdma event channel channel Input Parameters channel The communication channel to destroy Output Parameters none Return Value none Description Close an event communication channel Release all resources associated with an event channel and closes the associated file descriptor Notes All rdma_cm_id s associated with the event channel must be destroyed and all returned events must be acked before calling this funct
11. Rev 1 3 Programming Examples Using RDMA Verbs for i 0 i lt ctx gt qp_count i memset amp attr 0 sizeof attr attr qp_context ctx attr cap max_send_wr ctx gt max_wr attr cap max_recv_wr ctx gt max_wr attr cap max_send_sge 1 attr cap max_recv_sge 1 attr cap max_inline_ data 0 attr recv_cq ctx gt srq_cq attr srq ctx gt srq attr sq_sig all 0 ret rdma_create_ep amp ctx gt conn_id i rai NULL amp attr if ret VERB _ERR rdma create ep ret return ret ret rdma_connect ctx gt conn_id i NULL if ret VERB _ERR rdma_ connect ret return ret while send_count lt ctx gt msg_count for i 0 1 lt ctx gt max_wr amp amp send_count lt ctx gt msg_count i perform our send to the server ret rdma_post_send ctx gt conn_id i ctx gt qp_count NULL ctx gt send_buf ctx gt msg_ length ctx gt send_mr IBV_SEND_SIGNALED if ret VERB_ERR rdma_post_send ret return ret ret rdma_get_send_comp ctx gt conn_id i ctx gt qp_ count wc if ret lt 0 VERB_ERR rdma get send comp ret return ret send_countt printf send count d qp_num d n send_count wc qp_num wait for a recv indicating that all buffers were processed ret await_completion ctx if ret VERB _ERR await_completion ret return ret 244 Mellanox Technologies RDMA Aw
12. ibv_ack_async_event amp event return NULL Function get alt dlid from_private data x Input event The RDMA event containing private data Output dlid The DLID that was sent in the private data x Returns 0 on success non zero on failure x Description Takes the private data sent from the remote side and returns the destination LID that was contained in the private data int get_alt_dlid from private data struct rdma_cm_event event uint16_t dlid if event gt param conn private_data_len lt 4 printf unexpected private data len d event gt param conn private data len return 1 dlid ntohs uint16_t event gt param conn private_data return 0 Function get_alt port details x Input ctx The context object Output none Returns 0 on success non zero on failure x Description First query the device to determine if path migration is supported Next queries all the ports on the device to determine if there is different port than the current one to use as an alternate port If so copy the port number and dlid to the context so they can be used when 210 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 the alternate path is loaded x Note This function assumes that if another port is found in the active state that the port is connecte
13. port and watch it migrate to the other port x Running the Example This example requires a specific IB network configuration to properly demonstrate APM Two hosts are required one for the client and one for the server At least one of these two hosts must have a IB card with two ports Both of these ports should be connected to the same subnet and each have a route to the other host through an IB switch The executable can operate as either the client or server application Start the server side first on one host then start the client on the other host With default parameters the client and server will exchange 100 sends over 100 seconds During that time manually unplug the cable connected to the original port of the two port host and watch the path get migrated to the other port It may take up to a minute for the path to migrated To see the path get migrated by software use the m option on the client side Server apm s x Client a is IP of remote interface apm a 192 168 1 12 x ay include lt stdlib h gt include lt stdio h gt include lt string h gt include lt errno h gt include lt getopt h gt include lt rdma rdma_verbs h gt define VERB_ERR verb ret fprintf stderr s returned d errno d n verb ret errno Default parameter values define DEFAULT_PORT 51216 define DEFAULT_MSG COUNT 100 define DEFAULT MSG LENGTH 100
14. rdma_get_local_addr retrieves the local IP address for the rdma_cm_id which has been bound to a local device Mellanox Technologies 119 Rev 1 3 RDMA_CM API 5 2 20 rdma_get_peer_addr Template struct sockaddr rdma_get_peer_addr struct rdma cm id id Input Parameters id RDMA communication identifier Output Parameters None Return Value A pointer to the sockaddr address of the connected peer If the rdma_cm_id is not connected then the contents of the sockaddr structure will be set to all zeros Description rdma_get_peer_addr retrieves the remote IP address of a bound rdma_cm_id 120 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 5 2 21 rdma_get_devices Template struct ibv_context rdma_get_ devices int num_devices Input Parameters num devices If non NULL set to the number of devices returned Output Parameters num devices Number of RDMA devices currently available Return Value Array of available RDMA devices on success or NULL if the request fails Description rdma get devices retrieves an array of RDMA devices currently available Devices remain opened while librdmacm is loaded and the array must be released by calling rdma_free_devices Mellanox Technologies 121 Rev 1 3 RDMA_CM API 5 2 22 rdma_free_devices Template void rdma_ free devices struct ibv_context list Input Parameters list List of devices returned fro
15. res gt cq ibv_create_cq res gt ib_ctx cq_size NULL NULL 0 if res gt cq fprintf stderr failed to create CQ with u entries n cq_size rc 1 178 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 goto resources _create_exit allocate the memory buffer that will hold the data size MSG _ SIZE res gt buf char malloc size if res gt buf fprintf stderr failed to malloc Zu bytes to memory buffer n size rc 1 goto resources _create_exit memset res gt buf 0 size only in the server side put the message in the memory buffer if Iconfig server_name strepy res gt buf MSG fprintf stdout going to send the message s n res gt buf else memset res gt buf 0 size register the memory buffer mr flags IBV_ ACCESS LOCAL WRITE IBV_ACCESS REMOTE READ IBV_ACCESS REMOTE _ WRITE res gt mr ibv_reg mr res gt pd res gt buf size mr_flags if res gt mr fprintf stderr ibv_reg mr failed with mr_flags 0x x n mr_flags rc 1 goto resources_create_exit fprintf stdout MR was registered with addr p lkey 0x x rkey 0x x flags 0x x n res gt buf res gt mr gt Ikey res gt mr gt rkey mr_flags create the Queue Pair memset amp qp_init_attr 0 sizeof qp_init_attr qp_init_attrqp type IBV_QPT RC qp_init_attr sq_sig all 1 qp_init_attrsend_cq res gt cq qp_init_attrre
16. config dev_name fprintf stdout ntest result is d n rc return rc 192 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 8 3 Synopsis for Multicast Example Using RDMA_CM and IBV Verbs This code example for Multicast uses RDMA CM and VPI and hence can be run both over IB and over LLE Notes 1 In order to run the multicast example on either IB or LLE no change is needed to the test s code However if RDMA_CM is used it is required that the network interface will be config ured and up whether it is used over RoCE or over IB 2 For the IB case a join operation is involved yet it is performed by the rdma_cm kernel code 3 For the LLE case no join is required All MGIDs are resolved into MACs at the host 4 To inform the multicast example which port to use you need to specify b lt IP address gt to bind to the desired device port 8 3 1 Main Get command line parameters Hua e m MC address destination port e M unmapped MC address requires also bind address parameter b e s sender flag e b bind address e c connections amount e C message count e S message size p port space UDP default IPoIB Create event channel to receive asynchronous events Allocate Node and creates an identifier that is used to track communication information Start the run main function nA BW N On ending release and free resources
17. free node gt mem return 1 static int verify_test_params struct cmatest_node node struct ibv_port_attr port_attr int ret ret ibv_query_port node gt cma_id gt verbs node gt cma_id gt port_num amp port_attr if ret return ret if message_count amp amp message size gt 1 lt lt port_attr active_mtu 7 printf mckey message_size d is larger than active mtu d n message_ size lt lt port_attr active mtu 7 return EINVAL return 0 static int init_node struct cmatest_node node struct ibv_qp init attr init_qp_ attr int cqe ret node gt pd ibv_alloc_pd node gt cma_id gt verbs if Inode gt pd ret ENOMEM printf mckey unable to allocate PD n goto out cqe message count message count 2 2 node gt cq ibv_create_cq node gt cma_id gt verbs cqe node 0 0 if Inode gt cq ret ENOMEM printf mckey unable to create CQ n goto out memset amp init_qp_attr 0 sizeof init_qp_attr Mellanox Technologies 197 Rev 1 3 Programming Examples Using IBV Verbs init _qp_attr cap max_send_wr message count message count 1 init_qp_attr cap max_recv_wr message count message count 1 init _qp_attr cap max_send_sge 1 init_qp_attr cap max_recv_sge 1 init_qp_attr qp_context node init_qp_attr sq_sig all 0 init_qp_attr qp_type IBV_QPT UD init_qp_attr send_cq node gt cq init_qp_attr re
18. if wc status IBV_WC_SUCCESS printf work completion status s n ibv_wc_status_str wc status return 1 recv_count printf recv count d qp_num d n recv_count we qp_num ret rdma_post_recv ctx gt srq_id void wc wr_id ctx gt recv_buf ctx gt msg_length ctx gt recv_mr if ret VERB _ERR rdma_post_recv ret return ret 242 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 i while ne ret rdma_post_send ctx gt conn_id 0 NULL ctx gt send_buf ctx gt msg_length ctx gt send_mr IBV SEND SIGNALED if ret VERB _ERR rdma_post_send ret return ret ret rdma_get_send_comp ctx gt conn_id 0 amp wc if ret lt 0 VERB _ERR rdma_ get send comp ret return 1 send_count printf send count d n send_count return 0 Function run_client x Input ctx The context object rai The RDMA address info for the connection x Output none x Returns 0 on success non zero on failure x Description Executes the client side of the example int run_client struct context ctx struct rdma_addrinfo ra1 int ret i ne uint64_t send_count 0 uint64_t recv_count 0 struct ibv_we we struct ibv_qp_init_attr attr ret init_resources ctx rai if ret printf init_resources returned d n ret return ret Mellanox Technologies 243
19. malloc sizeof test nodes connections if test nodes printf mckey unable to allocate memory for test nodes n return ENOMEM memset test nodes 0 sizeof test nodes connections Mellanox Technologies 201 Rev 1 3 Programming Examples Using IBV Verbs for i 0 i lt connections i test nodes i id 1 ret rdma_create_id test channel amp test nodes i cma_id amp test nodes i port_space if ret goto err return 0 err while i gt 0 rdma_destroy_id test nodes i cma_id free test nodes return ret static void destroy_nodes void int i for i 0 1 lt connections i destroy_node amp test nodes i free test nodes static int poll_cqs void struct ibv_we we 8 int done i ret for i 0 i lt connections i if test nodes 1 connected continue for done 0 done lt message count done ret ret ibv_poll_cq test nodes 1i cq 8 wc if ret lt 0 printf mckey failed polling CQ d n ret return ret return 0 static int connect_events void struct rdma_cm_event event int ret 0 while test connects left amp amp ret 202 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 ret rdma_get_cm_event test channel amp event if ret ret cma_handler event gt id event rdma_ack cm_event event return ret static int
20. Description ibv_fork_init initializes libibverbs data structures to handle the fork function safely and avoid data corruption whether fork is called explicitly or implicitly such as in system calls It is not necessary to call ibv_fork_init if all parent process threads are always blocked until all child processes end or change address space via an exec operation This function works on Linux kernels supporting the MADV_DONTFORK flag for madvise 2 6 17 and higher Setting the environment variable RDMAV_FORK_ SAFE or IBV_FORK_SAFE to any value has the same effect as calling ibv_fork_init Setting the environment variable RDMAV_HUGEPAGES SAFE to any value tells the library to check the underlying page size used by the kernel for memory regions This is required if an appli cation uses huge pages either directly or indirectly via a library such as libhugetlbfs Calling ibv_fork_init will reduce performance due to an extra system call for every memory reg istration and the additional memory allocated to track memory regions The precise performance impact depends on the workload and usually will not be significant Setting RDMAV_HUGEPAGES SAFE adds further overhead to all memory registrations 28 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 2 Device Operations The following commands are used for general device operations allowing the user to query infor mation about devices th
21. Function modify _qp_to_init Input qp QP to transition Output none x Returns 0 on success ibv_modify_qp failure code on failure Description Transition a QP from the RESET to INIT state ae a ak del o al oe le od e ad le al al le al le al oe le al leal le al o la le al e al al al al ll ol ll ll al ll 2 ll ol al ll ll al ll a ll ll al le ll ll ake ae ie ll static int modify qp to init struct ibv_qp qp struct ibv_qp attr attr int flags int rc memset amp attr 0 sizeof attr attr qp_state IBV_QPS_INIT attr port_num config ib_port attr pkey_index 0 attr qp_access flags IBV_ ACCESS LOCAL WRITE IBV_ACCESS REMOTE READ IBV_ACCESS_ REMOTE _ WRITE flags IBV_QP_ STATE IBV_QP PKEY INDEX IBV_QP PORT IBV_QP ACCESS FLAGS rc ibv_modify_qp qp amp attr flags if rc fprintf stderr failed to modify QP state to INIT n return rc EE oe del oe le ol le od oe le ad de al le al o leo le ol e le al al al le al ol al o le ll al ll al ll a al he ll ake ll 2 ll ll al 2 ll ll o ll ll al al ll ll el Mellanox Technologies 181 Rev 1 3 Programming Examples Using IBV Verbs Function modify _qp_ to_rtr Input QP to transition remote qpn remote QP number dlid destination LID dgid destination GID mandatory for RoCEE Output none Returns 0 on success ibv_modify_qp failure code on failure Description Transi
22. and atomic operations when timeouts occur Applies only to RDMA PS TCP rnr_ retry count The maximum number of times that a send operation from the remote peer should be retried on a connection after receiving a receiver not ready RNR error RNR errors are generated when a send request arrives before a buffer has been posted to receive the incoming data Applies only to RDMA PS TCP srq Specifies if the QP associated with the connection is using a shared receive queue This field is ignored by the library if a QP has been created on the rdma_cm_id Applies only to RDMA PS TCP qp_num Specifies the QP number associated with the connection This field is ignored by the library if a QP has been created on the rdma_cm_ id Applies only to RDMA PS TCP Mellanox Technologies 111 Rev 1 3 RDMA_CM API 5 2 12 rdma_get_request Template int rdma_get_request struct rdma_cm_id listen struct rdma_cm_id id Input Parameters listen Listening rdma cm id id rdma_cm_id associated with the new connection Output Parameters id A pointer to rdma_cm_id associated with the request Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_get_request retrieves the next pending connection request event The call may only be used on listening rdma_cm_ids operating synchronously If the call is successful a new rdma_cm_id id representing the con
23. and management agents AH Address Handle An object which describes the path to the remote side used in UD QP CA Channel Adapter CI Channel Interface CM Communication Man ager A device which terminates an InfiniBand link and executes transport level functions Presentation of the channel to the Verbs Consumer as implemented through the combination of the network adapter associated firmware and device driver software An entity responsible to establish maintain and release communication for RC and UC QP service types The Service ID Resolution Protocol enables users of UD service to locate QPs supporting their desired service There is a CM in every IB port of the end nodes Compare amp Swap Instructs the remote QP to read a 64 bit value compare it with the compare data provided and if equal replace it with the swap data provided in the QP CQ Completion Queue A queue FIFO which contains CQEs CQE Completion Queue An entry in the CQ that describes the information about the completed WR status size etc Entry DMA Direct Memory Allowing Hardware to move data blocks directly to and from the memory bypassing the Access CPU Fetch 8 Add Instructs the remote QP to read a 64 bit value and replace it with the sum of the 64 bit value and the added data value provided in the QP GUID Globally Unique IDentifier A 64 bit number that uniquely identifies a device or componen
24. buffer flags Optional flags used to control the send operation Output Parameters None Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_post_send posts a work request to the send queue of the queue pair associated with the rdma_cm_id id The contents of the posted buffer will be sent to the remote peer of the connec tion The user is responsible for ensuring that the remote peer has queued a receive request before issu ing the send operations Also unless the send request is using inline data the message buffer must already have been registered before being posted with the mr parameter referencing the registra tion The buffer must remain registered until the send completes Send operations may not be posted to an rdma_cm_id or the corresponding queue pair until a con nection has been established The user defined context associated with the send request will be returned to the user through the work completion work request identifier wr_id field 146 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 6 2 7 rdma_post_read Template int rdma_post_read struct rdma_cm_id id void context void addr size_t length struct ibv_mr mr int flags uint64_t remote_addr uint32_t rkey Input Parameters id A reference to the communication identifier where the request will be posted context A user de
25. printf n if ctx server amp amp ctx server_name printf server address must be specified for client mode n exit 1 both of these must be set or neither should be set if ctx alt_dlid gt 0 amp amp ctx alt_srcport gt 0 ctx alt_dlid 0 amp amp ctx alt_srcport 0 printf d and r must be used together n exit 1 if ctx migrate_after gt ctx msg count printf num_ iterations then_migrate must be less than msg_count n exit 1 ret getaddrinfo_and_create_ep amp ctx if ret goto out if ctx server ret get_connect_request amp ctx if ret goto out only query for alternate port if information was not specified on the command line if ctx alt_dlid 0 amp amp ctx alt_srcport 0 ret get_alt_port_details amp ctx if ret goto out create a thread to handle async events pthread_create amp ctx asyne_event_thread NULL async_event thread amp ctx ret reg_mem amp ctx if ret goto out ret establish_connection amp ctx 220 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 load the alternate path after the connection was created This can be done at connection time but the connection must be created and established using all ib verbs ret load_alt_path amp ctx if ret goto out send_cnt recv_cnt 0 for i 0 1 lt ctx msg_ count i
26. 00002 eee 109 52 11 rdmMaConnect arf cena seaa cee ceataPevaaca ad Ea 110 5 2 12 rdma_get_request 00 00 cece ee 112 5 2 13 rdma_get_request 0200 02 ceeeeeeee 113 9 2 14 rama TejOCt eena ra eye bean ak EEE ead 114 5 2 15 rama NOUN sesasine tiae er bee phages de ae 115 5 2 16 rdma_disconnect 000 cece eee eee 116 5 22 17 dma GESTO PO sama pea ate aaa a edad 117 5 2 18 rdma_get_dst_port 00002 118 5 2 19 rdma_get_local_addr 2 0 0 2 eee 119 5 2 20 rdma_get_peer_addr 02 e eee eee eee 120 5 2 21 rdma_get_devices 200 02 cee eee 121 5 2 22 rdma_free_deviceS 00 0c cece eee 122 5 2 23 rdma_getaddrinfo 0 eee eee 123 5 2 24 rdma_freeaddrinfo 00 00 cee eee 124 52 29 rdma create Gp ssa amnedicedaine dba gece ne Saatadamigdd 125 5 2 26 rdma_destroy_qp 22 cece ee eee ee 126 5 2 27 rdma_join_multicast 2 0 ere eet aT Ea ee 127 5 2 28 rdma_leave_multicast 202 eee eee 128 5 3 Event Handling Operations 0 00 0 ee eee 129 5 3 1 rdma_get_cm_event 000 cece eee 129 5 3 2 rdma_ack_cm_event 000 00 eee eee eee 133 5 3 3 rdma_event_str 0 00 aea n a easta Aa a a a ia 134 Chapter 6 RDMA VerbS AP lo ia wae 135 6 1 Protection Domain Operations 20200 cee eee eee 135 GLT rdma reg MSGS venne aiee pi ede ees ee eee 13
27. 154 7 1 7 IBV_EVENT_PATH_MIG 0 0 0 0 cece eee 154 7 1 8 IBV_EVENT_PATH_MIG_ERR 0 0 00 cece ee eee 154 7 1 9 IBV EVENT DEVICE FATAL 00000 e eee eee eee 155 7 1 10 IBV_EVENT_PORT_ACTIVE 0 0000 c cece eens 155 7 1 11 IBV_EVENT_PORT_ERR 0 0 000 cece eee eee 155 7 1 12 IBV_EVENT_LID_CHANGE 0 0002 156 7 1 13 IBV_EVENT_PKEY_CHANGE 0 0000 156 7 1 14 IBV_EVENT_SM_CHANGE 0 00000 cee eee eee 156 7 1 15 IBV_EVENT_SRQ_ERR 0 0 00 cece ens 156 7 1 16 IBV_EVENT_SRQ_LIMIT REACHED 156 7 1 17 IBV_EVENT_QP_LAST_WQE_ REACHED 156 7 1 18 IBV_EVENT_CLIENT_REREGISTER 004 157 7 1 19 IBV_EVENT_GID_CHANGE 0 000002 eee eee 157 7 2 IBVWCEvents 00 0002 158 TAT IBV WC SUCCESS etorri oleae dla ates 158 7 2 2 IBV WC LOG LEN ERR e aE eee 158 7 2 3 IBV_WC_LOC_QP_OP_ERR a E a a eee 158 7 2 4 IBV_WC_LOC_EEC_OP_ERR 0 00 eee E 158 7 2 5 BYW LOC PROT ERR estuar enian Ga TES RESA OSS 158 7 2 6 IBY WC WR FLUSH ERR oer dern e E a eee 158 7 2 7 IBV WC MW BIND ERR ie eiris arioak a ees 158 7 2 8 IBV_WC_BAD_RESP_ERR 0 0 0 0 eai ni ea 158 7 2 9 IBV_WC_LOC_ACCESS_ERR 000000000 158 7 2 10 IBV_WC_REM_INV_REQ_LERR 00 0 eee eeee 159 7 2 11 IBV_WC_REM_ACCESS ERR 0 0000 cence 159 7 2 12 IBV_WC_REM_OP
28. CQ n rc 1 186 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 if res gt pd if ibv_dealloc_pd res gt pd fprintf stderr failed to deallocate PD n rc 1 if res gt ib_ ctx if ibv_close_device res gt ib_ctx fprintf stderr failed to close device context n rc 1 if res gt sock gt 0 if close res gt sock fprintf stderr failed to close socket n rc 1 return rc EE F 8 del ol le ol le od oe le ad dal le al o leo le ad e le 2 al al ll ol al o le 2 al ll al ll a al he 2 le ake ll a ll ll ae he 2 la ll o ll ll al 2 ll ll el Function print_config Input none Output none x Returns none Description Print out config information Fk ake a oe del o ol oe le ol e ad le ale ol le al le al he 2 leal le al ol la le ol e al al al al ll all ol ll al ll 2 ll a al al ll al ll al ll ll al 2 ll ll o afc ae ake ll static void print config void fprintf stdout n fprintf stdout Device name s n config dev_name fprintf stdout IB port Yu n config ib_ port if config server_name fprintf stdout IP s n config server_name fprintf stdout TCP port Yu n config tcp_port if config gid_idx gt 0 fprintf stdout GID index Yu n config gid_ idx fprintf stdout
29. If in client mode show the message we received via the RECEIVE operation otherwise if we are in server mode load the buffer with a new message Sync client lt gt server At this point the server goes directly to the next sync All RDMA operations are done strictly by the client Client only Call post_send with IBV_WR_RDMA_READ to perform a RDMA read of server s buffer Call poll completion Show server s message Setup send buffer with new message Call post_send with IBV_WR_RDMA_ WRITE to perform a RDMA write of server s buffer Mellanox Technologies 163 Rev 1 3 Programming Examples Using IBV Verbs Call poll_completion End client only operations Sync client lt gt server If server mode show buffer proving RDMA write worked Call resources destroy Free device name string Done 8 1 2 print_config Print out configuration information 8 1 3 resources_init Clears resources struct 8 1 4 resources_create Call sock_connect to connect a TCP socket to the peer Get the list of devices locate the one we want and open it Free the device list Get the port information Create a PD Create a CQ Allocate a buffer initialize it register it Create a QP 8 1 5 sock_connect If client resolve DNS address of server and initiate a connection to it If server listen for incoming connection on indicated port 8 1 6 connect_qp Call modify _qp_to_init 164 Mellanox Technologies RDMA Aware
30. Networks Programming User Manual Rev 1 3 6 2 9 rdma_post_ud_send Template int rdma_post_ud_send struct rdma_cm_id id void context void addr size_t length struct ibv_mr mr int flags struct ibv_ah ah uint32_t remote_qpn Input Parameters id A reference to the communication identifier where the request will be posted context A user defined context associated with the request addr The address of the memory buffer to post length The length of the memory buffer mr Optional registered memory region associated with the posted buffer flags Optional flags used to control the send operation ah An address handle describing the address of the remote node remote_qpn The destination node s queue pair number Output Parameters None Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma post _ud_send posts a work request to the send queue of the queue pair associated with the rdma_cm_id id The contents of the posted buffer will be sent to the specified destination queue pair remote_qpn The user is responsible for ensuring that the destination queue pair has queued a receive request before issuing the send operations Unless the send request is using inline data the message buffer must have been registered before being posted with the mr parameter referencing the registration The buffer must remain registered until the send completes
31. Now the client performs an RDMA read and then write on server Note that the server has no idea these events have occured if config server_name First we read contens of server s buffer if post_send amp res IBV WR RDMA READ fprintf stderr failed to post SR 2 n rc 1 goto main_exit if poll_completion amp res fprintf stderr poll completion failed 2 n rc 1 goto main exit fprintf stdout Contents of server s buffer s n res buf Now we replace what s in the server s buffer strepy res buf RDMAMSGW fprintf stdout Now replacing it with s n res buf if post_send amp res IBV_WR_RDMA_ WRITE fprintf stderr failed to post SR 3 n rc 1 goto main_exit if poll_completion amp res fprintf stderr poll completion failed 3 n rc 1 goto main_exit Sync so server will know that client is done mucking with its memory if sock_sync_data res sock 1 W amp temp_char just send a dummy char back and forth fprintf stderr sync error after RDMA ops n rc 1 goto main_exit if config server_name Mellanox Technologies 191 Rev 1 3 Programming Examples Using IBV Verbs fprintf stdout Contents of server buffer s n res buf rc 0 main_exit if resources_destroy amp res fprintf stderr failed to destroy resources n rc 1 if config dev_name free char
32. Technologies 11 J Rev 1 3 Table 2 Glossary Sheet 4 of 4 Term Description Verbs An abstract description of the functionality of a network adapter Using the verbs any application can create manage objects that are needed in order to use RDMA for data transfer VPI Virtual Protocol Inter Allows the user to change the layer 2 protocol of the port face WQ Work Queue One of Send Queue or Receive Queue WQE Work Queue A WQE pronounced wookie is an element in a work queue Element WR Work Request A request which was posted by a user to a work queue 12 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 1 RDMA Architecture Overview 1 1 InfiniBand InfiniBand IB is a high speed low latency low CPU overhead highly efficient and scalable server and storage interconnect technology One of the key capabilities of InfiniBand is its support for native Remote Direct Memory Access RDMA InfiniBand enables data transfer between servers and between server and storage without the involvement of the host CPU in the data path InfiniBand uses I O channels for data communication up to 16 million per host where each channel provides the semantics of a virtualized NIC or HCA security isolations etc InfiniBand provides various technology or solution speeds ranging from 10Gb s SDR up to 56Gb s FDR per port using copper and optical fiber connections Infin
33. The failure to complete the operation may be due to QP related errors which prevent the responder from completing the request or a malformed WQE on the Receive Queue 7 2 13 IBV_WC_RETRY_EXC_ERR This event is generated when a sender is unable to receive feedback from the receiver This means that either the receiver just never ACKs sender messages in a specified time period or it has been disconnected or it is in a bad state which prevents it from responding 7 2 14 IBV_WC_RNR_RETRY_EXC_ERR This event is generated when the RNR NAK retry count is exceeded This may be caused by lack of receive buffers on the responder side 7 2 15 IBV_WC_LOC_RDD_VIOL_ERR This event is generated when the RDD associated with the QP does not match the RDD associated with the EEC 7 2 16 IBV_WC_REM_INV_RD_REQ_ERR This event is generated when the responder detects an invalid incoming RD message The message may be invalid because it has in invalid Q Key or there may be a Reliable Datagram Domain RDD violation 7 2 17 IBV_WC_REM_ABORT_ERR This event is generated when an error occurs on the responder side which causes it to abort the operation 7 2 18 IBV_WC_INV_EECN_ERR This event is generated when an invalid End to End Context Number EECN is detected Mellanox Technologies 159 Rev 1 3 Events 7 2 19 IBV_WC_INV_EEC_STATE_ERR This event is generated when an illegal operation is detected in a request for the specified EEC state 7 2 2
34. These errors are rare but may occur when there are problems in the subnet or when an RDMA device sends illegal packets When this happens the QP is automatically transitioned to the IBV_QPS_ERR state by the RDMA device The user must modify the states of any such QPs from the error state to the Reset state for recovery This event applies only to RC QPs 7 1 4 IBV_EVENT_QP_ACCESS_ERR This event is generated when the transport layer ofthe RDMA device detects a request error viola tion on the responder side The error may be caused by Misaligned atomic request Too many RDMA Read or Atomic requests Mellanox Technologies 153 Rev 1 3 Events R_Key violation Length errors without immediate data These errors usually occur because of bugs in the user code When this happens the QP is automatically transitioned to the IBV_QPS_ERR state by the RDMA device The user must modify the QP state to Reset for recovery This event is relevant only to RC QPs 7 1 5 IBV_EVENT_COMM_EST This event is generated when communication is established on a given QP This event implies that a QP whose state is IBV_QPS_RTR has received the first packet in its Receive Queue and the packet was processed without error This event is relevant only to connection oriented QPs RC and UC QPs It may be generated for UD QPs as well but that is driver implementation specific 7 1 6 IBV_EVENT_SQ_DRAINED This event is generated when all outstanding messages
35. _PS_IPOIB exit 1 Mellanox Technologies 205 Rev 1 3 Programming Examples Using IBV Verbs test dst_addr struct sockaddr amp test dst_in test connects_left connections test channel rdma_create_event_channel if test channel printf failed to create event channel n exit 1 if alloc_nodes exit 1 ret run printf test complete n destroy_nodes rdma_destroy_event_channel test channel printf return status d n ret return ret 206 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 9 Programming Examples Using RDMA Verbs This chapter provides code examples using the RDMA Verbs 9 1 Automatic Path Migration APM Copyright c 2012 Software Forge Inc All rights reserved This software is available to you under a choice of one of two licenses You may choose to be licensed under the terms of the GNU General Public License GPL Version 2 available from the file COPYING in the main directory of this source tree or the OpenIB org BSD license below Redistribution and use in source and binary forms with or without modification are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice this list of conditions and the following disclaimer Redistributions in binary form must reproduce the above copyright notice this list of condit
36. ake a a ll al ll 2 ll a ll ll 2 ll a ll ll ae he 2 ll ll ae afc ae ake ll static int poll_completion struct resources res struct ibv_we WC unsigned long start_time_msec unsigned long cur_time_msec struct timeval cur_time int poll_result int re 0 poll the completion for a while before giving up of doing it gettimeofday amp cur_time NULL 172 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 start_time_msec cur_time tv_sec 1000 cur_time tv_usec 1000 do poll_result ibv_poll_cq res gt cq 1 amp we gettimeofday amp cur_time NULL cur_time_msec cur_time tv_sec 1000 cur_time tv_usec 1000 while poll_result 0 amp amp cur_time_msec start_time_msec lt MAX POLL CQ TIMEOUT if poll_result lt 0 poll CQ failed fprintf stderr poll CQ failed n rc 1 else if poll_result 0 the CQ is empty fprintf stderr completion wasn t found in the CQ after timeout n rc 1 else CQE found fprintf stdout completion was found in CQ with status 0x x n wc status check the completion status here we don t care about the completion opcode if wc status IBV_WC_SUCCESS fprintf stderr got bad completion with status 0x x vendor syndrome 0x x n wc status we vendor_err rc 1 return rc EE o ao ds al oe le ol le od oe le ad lol le al o la le al e le al al al ll ol al o l
37. being allocated the QP will be ready to handle posting of receives If the QP is unconnected it will be ready to post sends See Also rdma_bind_addr rdma_resolve_addr rdma_destroy_qp ibv_create_qp ibv_modify_qp Mellanox Technologies 125 Rev 1 3 RDMA_CM API 5 2 26 rdma_destroy_qp Template void rdma_destroy_qp struct rdma_cm_id id Input Parameters id RDMA identifier Output Parameters none Return Value none Description rdma_destroy_qp destroys a QP allocated on the rdma_cm_id Notes Users must destroy any QP associated with an rdma_cm_id before destroying the ID See Also rdma_create_qp rdma_destroy_id ibv_destroy_qp 126 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 5 2 27 rdma_join_multicast Template int rdma_join_multicast struct rdma_cm_id id struct sockaddr addr void context Input Parameters id Communication identifier associated with the request addr Multicast address identifying the group to join context User defined context associated with the join request Output Parameters none Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_join_multicast joins a multicast group and attaches an associated QP to the group Notes Before joining a multicast group the rdma_cm_id must be bound to an RDMA device by calling rdma_bind
38. creating address handle n goto err node gt connected 1 test connects_left return 0 connect_error return 1 static int cma_handler struct rdma_cm_id cma_id struct rdma_cm_event event int ret 0 switch event gt event case RDMA_CM EVENT ADDR RESOLVED ret addr_handler cma_id gt context break case RDMA_CM_ EVENT MULTICAST_JOIN ret join_handler cma_id gt context amp event gt param ud break 200 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 case RDMA_CM_EVENT_ADDR_ ERROR case RDMA_CM_EVENT_ROUTE_ ERROR case RDMA_CM_EVENT_MULTICAST ERROR printf mckey event s error d n rdma event_str event gt event event gt status connect error ret event gt status break case RDMA_CM_EVENT_DEVICE_ REMOVAL Cleanup will occur after test completes break default break return ret static void destroy_node struct cmatest_node node if Inode gt cma_id return if node gt ah ibv_destroy_ah node gt ah if node gt cma_id gt qp rdma_destroy_qp node gt cma_id if node gt cq ibv_destroy_cq node gt cq if node gt mem ibv_dereg_mr node gt mr free node gt mem if node gt pd ibv_dealloc_pd node gt pd Destroy the RDMA ID after all device resources rdma_destroy_id node gt cma_id static int alloc_nodes void int ret 1 test nodes
39. ctx gt cq ibv_destroy_cq ctx gt cq if ctx gt mr rdma_dereg_mr ctx gt mr if ctx gt buf free ctx gt buf if ctx gt pd amp amp ctx gt id gt pd NULL ibv_dealloc_pd ctx gt pd rdma_destroy_id ctx gt id gt Function post send Input ctx The context structure Mellanox Technologies 229 Rev 1 3 Programming Examples Using RDMA Verbs Output none x Returns 0 on success non zero on failure Description Posts a UD send to the multicast address int post_send struct context ctx int ret struct ibv_send_wr wr bad_wr struct ibv_sge sge memset ctx gt buf 0x12 ctx gt msg_ length set the data to non zero sge length ctx gt msg_length sge lkey ctx gt mr gt lkey sge addr uint64_t ctx gt buf Multicast requires that the message is sent with immediate data and that the QP number is the contents of the immediate data wr next NULL wr sg_list amp sge wr num_sge 1 wr opcode IBV_WR_SEND_WITH_IMM wr send_flags IBV SEND SIGNALED wr wr_id 0 wr imm_ data htonl ctx gt 1d gt qp gt qp_num wr wr ud ah ctx gt ah wr wr ud remote_qpn ctx gt remote_qpn wr wr ud remote_qkey ctx gt remote_qkey ret ibv_post_send ctx gt id gt qp amp wr amp bad_wr if ret VERB _ERR ibv_post_send ret return 1 return 0 Function get_completion In
40. ctx gt recv_mr VERB _ERR rdma_reg_ msgs 1 return 1 Mellanox Technologies 237 Rev 1 3 Programming Examples Using RDMA Verbs ctx gt send_mr rdma_reg_ msgs ctx gt srq_id ctx gt send_buf ctx gt msg_length if ctx gt send_mr VERB _ERR rdma_reg msgs 1 return 1 Create our shared receive queue struct ibv_srq_init_attr srq_attr memset amp srq_attr 0 sizeof srq_attr srq_attr attr max_wr ctx gt max_wr srq_attr attr max_sge 1 ret rdma_create_srq ctx gt srq_id NULL amp srq_attr if ret VERB _ERR rdma _create_srq ret return 1 Save the SRQ in our context so we can assign it to other QPs later ctx gt srq ctx gt srq_id gt srq Post our receive buffers on the SRQ for i 0 i lt ctx gt max_wr i ret rdma_post_recv ctx gt srq_id NULL ctx gt recv_buf ctx gt msg_length ctx gt recv_mr if ret VERB _ERR rdma_post_recv ret return ret Create a completion channel to use with the SRQ CQ ctx gt srq_cq_ channel ibv_create_comp_channel ctx gt srq_id gt verbs if ctx gt srq_cq_channel VERB _ERR ibv_create_comp_channel 1 return 1 Create a CQ to use for all connections QPs that use the SRQ ctx gt srq_cq ibv_create_cq ctx gt srq_id gt verbs ctx gt max_wr NULL ctx gt srq_cq_channel 0 if ctx gt srq_cq VERB _ERR ibv_create_cq 1 return 1
41. data conn_param responder_resources 2 conn_param initiator_depth 2 conn_param retry_count 5 conn _param rnr_retry_count 5 if ctx gt server printf rdma_accept n ret rdma_accept ctx gt id amp conn_param if ret VERB _ERR rdma_accept ret return ret else printf rdma_connect n ret rdma_connect ctx gt id amp conn_param if ret VERB _ERR rdma_connect ret return ret if ctx gt id gt event gt event RDMA_ CM EVENT ESTABLISHED printf unexpected event s rdma_event_str ctx gt id gt event gt event return 1 Tf the alternate path info was not set on the command line get 216 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 it from the private data if ctx gt alt_dlid 0 amp amp ctx gt alt_srcport 0 ret get_alt_dlid_from_private_data ctx gt id gt event amp ctx gt alt_dlid if ret return ret return 0 Function send msg x Input ctx The context object x Output none Returns 0 on success non zero on failure x Description Performs an Send and gets the completion el int send_msg struct context ctx int ret struct ibv_we we ret rdma_post_send ctx gt id NULL ctx gt send_buf ctx gt msg_length ctx gt send_mr IBV SEND SIGNALED if ret VERB _ERR rdma_send_recv ret return ret ret
42. for sends and receive completions 150 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 6 2 11 rdma_get_recv_comp Template int rdma_get_recv_comp struct rdma_cm_id id struct ibv_wc wc Input Parameters id A reference to the communication identifier to check for completions we A reference to a work completion structure to fill in Output Parameters we A reference to a work completion structure The structure will contain information about the completed request when routine returns Return Value A non negative valu qual to the number of completions found on success or errno on failure Description rdma_get_recv_comp retrieves a completed work request a receive operation Information about the completed request is returned through the ibv_wc wc parameter with the wr_id set to the con text of the request Please see ibv_poll_cq for details on the work completion structure ibv_wc Please note that this call polls the receive completion queue associated with the rdma_cm_id id If a completion is not found the call blocks until a request completes This means therefore that the call should only be used on rdma_cm_ids which do not share CQs with other rdma_cm_ids and maintain separate CQs for sends and receive completions Mellanox Technologies 151 Rev 1 3 RDMA Verbs API 152 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 7 Events Th
43. get_addr char dst struct sockaddr addr struct addrinfo res int ret ret getaddrinfo dst NULL NULL amp res if ret printf getaddrinfo failed invalid hostname or IP address n return ret memcpy addr res gt ai_addr res gt ai_addrlen freeaddrinfo res return ret static int run void int i ret printf mckey starting s n is_sender client server if src_addr ret get_addr src_addr struct sockaddr amp test src_in if ret return ret ret get_addr dst_addr struct sockaddr amp test dst_in if ret return ret printf mckey joining n for i 0 i lt connections i if sre_addr ret rdma bind addr test nodes i cma_id test src_addr 1f ret printf mckey addr bind failure d n ret connect _error Mellanox Technologies 203 Rev 1 3 Programming Examples Using IBV Verbs return ret if unmapped_addr ret addr_handler amp test nodes 1 else ret rdma_resolve_addr test nodes i cma_id test src_addr test dst_addr 2000 if ret printf mckey resolve addr failure d n ret connect_error return ret ret connect_events if ret goto out Pause to give SM chance to configure switches We don t want to handle reliability issue in this simple test program sleep 3 if message_count if is_sender printf initiating data transfers n for 1 0 1
44. have been drained from the Send Queue SQ of a QP whose state has now changed from IBV_QPS_RTS to IBV_QPS_SQD For RC QPs this means that all the messages received acknowledgements as appropriate Generally this event will be generated when the internal QP state changes from SQD draining to SQD drained The event may also be generated if the transition to the state IBV_QPS_SQD is aborted because of a transition either by the RDMA device or by the user into the IBV_QPS_SQE IBV_QPS_ERR or IBV_QPS_RESET QP states After this event and after ensuring that the QP is in the IBV_QPS_SQD state it is safe for the user to start modifying the Send Queue attributes since there aren t are no longer any send messages in progress Thus it is now safe to modify the operational characteristics of the QP and transition it back to the fully operational RTS state 7 1 7 IBV_EVENT_PATH_MIG This event is generated when a connection successfully migrates to an alternate path The event is relevant only for connection oriented QPs that is it is relevant only for RC and UC QPs When this event is generated it means that the alternate path attributes are now in use as the pri mary path attributes If it is necessary to load attributes for another alternate path the user may do that after this event is generated 7 1 8 IBV_EVENT_PATH_MIG_ERR This event is generated when an error occurs which prevents a QP which has alternate path attri butes loaded from p
45. hints 0 sizeof hints ctx server 0 ctx server_port DEFAULT PORT ctx msg count DEFAULT MSG COUNT ctx msg length DEFAULT MSG LENGTH ctx qp_count DEFAULT_QP COUNT ctx max_wr DEFAULT MAX WR Read options from command line while op getopt argc argv sa p c l q w 1 switch op case s ctx server 1 break case a ctx server_name optarg break case p ctx server_port optarg break case c ctx msg_count atoi optarg break case l ctx msg_length atoi optarg break case q ctx qp_count atoi optarg break case w ctx max_wr atoi optarg break default printf usage s a server_address n argv 0 printf t s server mode n printf t p port_number n printf t c msg_count n printf t l msg_length n printf t q qp_count n printf t w max_wr n exit 1 if ctx server_ name NULL printf server address required use a n 246 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 exit 1 hints ai_port_space RDMA PS TCP if ctx server 1 hints ai_flags RAI PASSIVE this makes it a server ret rdma_getaddrinfo ctx server_name ctx server_port amp hints amp rai if ret VERB_ERR rdma_getaddrinfo ret exit 1 allocate memory for our QPs and send recv buffers ctx conn_id struct rdma_cm_id calloc ctx qp_count sizeo
46. if ctx server if recv_msg amp ctx break printf recv d n recv_cnt if ctx msec_delay gt 0 usleep ctx msec_delay 1000 if send_msg amp ctx break printf send d n send_cnt if ctx server if recv_msg amp ctx break printf recv d n recv_cnt migrate the path manually if desired after the specified number of sends if ctx server amp amp i ctx migrate_after qp_attr path_mig state IBV_ MIG MIGRATED ret ibv_modify_qp ctx id gt qp amp qp_attr IBV_QP PATH MIG STATE if ret VERB_ERR ibv_modify_qp ret goto out rdma_disconnect ctx id out if ctx send_mr rdma_dereg mr ctx send_mr if ctx recv_mr rdma_dereg mr ctx recv_mr Mellanox Technologies 221 Rev 1 3 Programming Examples Using RDMA Verbs if ctx 1d rdma destroy ep ctx id if ctx listen_id rdma_destroy_ep ctx listen_id if ctx send_buf free ctx send_buf if ctx recv_buf free ctx recv_buf return ret 222 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 9 2 Multicast Code Example Using RDMA CM Copyright c 2012 Software Forge Inc All rights reserved This software is available to you under a choice of one of two licenses You may choose to be licensed under the terms of the GNU General Public License GPL Version 2 available from the file COPYING in the main direc
47. in response to rdma resolve route It is gener ated when the system is able to resolve the server address supplied by the client 7 3 4 RDMA_CM_EVENT_ROUTE_ERROR This event is generated when rdma resolve route fails 7 3 5 RDMA_CM_EVENT_CONNECT_REQUEST This is generated on the passive side of the connection to notify the user of a new connection request It indicates that a connection request has been received 7 3 6 RDMA_CM_EVENT_CONNECT_RESPONSE This event may be generated on the active side of the connection to notify the user that the connec tion request has been successful The event is only generated on rdma cm _ids which do not have a QP associated with them 7 3 7 RDMA_CM_EVENT_CONNECT_ERROR This event may be generated on the active or passive side of the connection It is generated when an error occurs while attempting to establish a connection 7 3 8 RDMA_CM_EVENT_UNREACHABLE This event is generated on the active side of a connection It indicates that the remote server is unreachable or unable to respond to a connection request Mellanox Technologies 161 Rev 1 3 Events 7 3 9 RDMA_CM_EVENT_REJECTED This event may be generated on the client active side and indicates that a connection request or response has been rejected by the remote device This may happen for example if an attempt is made to connect with the remote end point on the wrong port 7 3 10 RDMA_CM_EVENT_ESTABLISHED This event is generate
48. it is suggested for clients to check that the GID indexes used by the client s QPs are not changed as a result of this event If a user caches the values of the P Key table then these must be flushed when the IBV_EVENT_ PKEY CHANGE event is received 7 1 14 IBV_EVENT_SM_CHANGE This event is generated when the SM being used at a given port changes The user application must re register with the new SM This means that all subscriptions previously registered from the given port such as one to join a multicast group must be reregistered 7 1 15 IBV_EVENT_SRQ_ERR This event is generated when an error occurs on a Shared Receive Queue SRQ which prevents the RDMA device from dequeuing WRs from the SRQ and reporting of receive completions When an SRQ experiences this error all the QPs associated with this SRQ will be transitioned to the IBV_QPS_ERR state and the IBV_EVENT QP FATAL asynchronous event will be generated for them Any QPs which have transitioned to the error state must have their state modified to Reset for recovery 7 1 16 IBV_EVENT_SRQ_LIMIT_REACHED This event is generated when the limit for the SRQ resources is reached This means that the num ber of SRQ Work Requests WRs is less than the SRQ limit This event may be used by the user as an indicator that more WRs need to be posted to the SRQ and rearm it 7 1 17 IBV_EVENT_QP_LAST_WQE_REACHED This event is generated when a QP which is associated with an SRQ is transit
49. list 104 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 5 2 6 rdma_destroy_ep Template int rdma_destroy_ep struct rdma_cm_id id Input Parameters id The communication identifier to destroy Output Parameters None Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_destroy_ep destroys the specified rdma cm _id and all associated resources including QPs associated with the id Mellanox Technologies 105 Rev 1 3 RDMA_CM API 5 2 7 rdma_resolve_addr Template int rdma_resolve_addr struct rdma_cm_id id struct sockaddr src_addr struct sockaddr dst_addr int timeout_ms Input Parameters id RDMA identifier src _addr Source address information This parameter may be NULL dst_addr Destination address information timeout_ms Time to wait for resolution to complete Output Parameters None Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_resolve_addr resolves destination and optional source addresses from IP addresses to an RDMA address If successful the specified rdma cm id will be bound to a local device Notes This call is used to map a given destination IP address to a usable RDMA address The IP to RDMA address mapping is done using the local routing tables or via ARP If
50. lt connections i ret post_sends amp test nodes i 0 if ret goto out else printf receiving data transfers n ret poll _cqsQ if ret goto out printf data transfers completen out for i 0 1 lt connections i ret rdma_leave_multicast test nodes i cma_id test dst_addr if ret printf mckey failure leaving d n ret 204 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 return ret int main int argc char argv int op ret while op getopt argc argv m M sb c C S p 1 switch op case m dst_addr optarg break case M unmapped_addr 1 dst_addr optarg break case s is sender 1 break case b src_addr optarg test src_addr struct sockaddr Sztest src_in break case c connections atoi optarg break case C message_count atoi optarg break case S message_size atoi optarg break case p port_space strtol optarg NULL 0 break default printf usage s n argv 0 printf t m multicast_address n printf t M unmapped _multicast_address n t replaces m and requires b n printf t s ender n printf t b bind_address n printf t c connections n printf t C message_count n printf t S message_size n printf t p port_space x for UDP default x for IPOIB n RDMA_PS_ UDP RDMA
51. ol al o le al al ll 2 ll a al he ll al ll a ll ll ae he 2 la ll o ll ll al ll al ll el Function modify _qp to rts Input qp QP to transition Output none Returns 0 on success ibv_modify_qp failure code on failure Description Transition a QP from the RTR to RTS state ae a ak del o al oe je od e al le ale ol le al le al o le 2 leal le al ol lo le al e al al al al ll ol al a ll le al ll al ll al al e ll ll a a a ll ll al 2 le al ll o ake ae fe ll static int modify qp to rts struct ibv_qp qp struct ibv_qp attr attr int flags int IC memset amp attr 0 sizeof attr attr qp state IBV_QPS RTS attr timeout 0x12 attrretry cnt 6 attrrmr_retry 0 attr sq_psn 0 attr max_rd_atomic 1 flags IBV_QP_ STATE IBV_QP_TIMEOUT IBV_QP RETRY CNT IBV_QP RNR RETRY IBV_QP_SQ PSN IBV_QP_ MAX QP RD ATOMIC rc ibv_modify_qp qp amp attr flags if rc fprintf stderr failed to modify QP state to RTS n return rc EE ae del oe le ol le ad oe he ad de al le al ol leo le al e le 2 al al le al al al o le ll al ll al ll a al he 2 le al ll a ll ll e al le ll ll o ll ll al al al al ll el Function connect_qp Input res pointer to resources structure Output none Mellanox Technologies 183 Rev 1 3 Programming Examples Using IBV Verbs Returns 0 on success error code on failure Description x Connect the QP
52. opcode case IBV_WR_ SEND fprintf stdout Send Request was posted n break case IBV_WR_RDMA READ fprintf stdout RDMA Read Request was posted n break case IBV_WR_RDMA WRITE fprintf stdout RDMA Write Request was posted n break default fprintf stdout Unknown Request was posted n 174 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 break return rc 7 He ae ls al ol le ol le od oe he ad lol le al o leo le ol e le al al ake le al all ae le ll al ll 2 ll a al ll ll 2 ll al ll ll al le ll ll al ll ll al ll ll ll Function post receive Input res pointer to resources structure Output none x Returns 0 on success error code on failure Description ad oe a ak o al oe le od e ad le al 2 le ol le al oe le 2 leal le al 2 lo le al e al al al 2 ll all ae le ll al ll 2 ll a al ll ll al ll a ll ll al 2 ll ll ll al ll static int post_receive struct resources res struct ibv_recv_wr rr struct ibv_sge sge struct ibv_recv_wr bad_wr int rc prepare the scatter gather entry memset amp sge 0 sizeof sge sge addr uintptr_t res gt buf sge length MSG _ SIZE sge lkey res gt mr gt lkey prepare the receive work request memset amp rr 0 sizeof rr rr next NULL rr wr_id 0 rr sg_list amp sge rr num sge l post the Receive Request to the RQ rc ibv_post_recv res gt qp amp
53. ret VERB _ERR rdma _ listen ret return ret printf waiting for connection from client n for i 0 i lt ctx gt qp_count i ret rdma_get_request ctx gt listen_id amp ctx gt conn_id i if ret VERB _ERR rdma_ get request ret return ret Create the queue pair memset 8 qp_attr 0 sizeof qp_attr qp_attr qp_context ctx qp_attr qp type IBV_QPT RC qp_attr cap max_send_wr ctx gt max_wr qp_attr cap max_recv_wr ctx gt max_wr qp_attr cap max_send sge 1 qp_attr cap max_recv_sge 1 qp_attr cap max_inline_data 0 qp_attr recv_cq ctx gt srq_cq qp_attr srq ctx gt srq qp_attr sq_sig all 0 Mellanox Technologies 241 Rev 1 3 Programming Examples Using RDMA Verbs ret rdma_create_qp ctx gt conn_id i NULL amp qp_attr if ret VERB _ERR rdma_create_qp ret return ret Set the new connection to use our SRQ ctx gt conn_id 1 gt srq ctx gt srq ret rdma_accept ctx gt conn_id i NULL if ret VERB _ERR rdma_accept ret return ret while recv_count lt ctx gt msg_ count 1 0 while i lt ctx gt max_wr amp amp recv_count lt ctx gt msg_count int ne ret await_completion ctx if ret printf await_completion d n ret return ret do ne ibv_poll_cq ctx gt srq_cq 1 amp wc if ne lt 0 VERB _ERR ibv_poll_cq ne return ne else if ne 0 break
54. that will respond to commands from the requestor which may include a request to write to the responder memory or read from the responder memory and finally a command requesting the responder to receive a message rkey A number that is received upon registration of MR is used to enforce permissions on incom ing RDMA operations RNR Receiver Not Ready The flow in an RC QP where there is a connection between the sides but a RR is not present in the Receive side RQ Receive Queue A Work Queue which holds RRs posted by the user RR Receive Request A WR which was posted to an RQ which describes where incoming data using a send opcode is going to be written Also note that a RDMA Write with immediate will consume a RR RTR Ready To Receive A QP state in which an RR can be posted and be processed RTS Ready To Send A QP state in which an SR can be posted and be processed SA Subnet Administrator The interface for querying and manipulating subnet management data SGE Scatter Gather An entry to a pointer to a full or a part of a local registered memory block Elements The element hold the start address of the block size and lkey with its associated permis sions S G Array An array of S G elements which exists in a WR that according to the used opcode either col lects data from multiple buffers and sends them as a single stream or takes a single stream and breaks it down to numerous buffers
55. the dev_cap device cap flags 7 1 11 IBV_EVENT_PORT_ERR This event is generated when the link on a given port becomes inactive and is thus unavailable to send receive packets The port_attr state must have been in either in either IBV_PORT_ACTIVE or IBV_PORT_ACTIVE_DEFER state and transitions to one of the following states IBV_PORT DOWN IBV_PORT_ INIT IBV_PORT ARMED This can happen when there are connectivity problems within the IB fabric for example when a cable is accidentally pulled This will not affect the QPs associated with this port although if this is a reliable connection the retry count may be exceeded if the link takes a long time to come back up Mellanox Technologies 155 Rev 1 3 Events 7 1 12 IBV_EVENT_LID_CHANGE The event is generated when the LID on a given port changes This is done by the SM If this is not the first time that the SM configures the port LID it may indicate that there is a new SM on the subnet or that the SM has reconfigured the subnet If the user cached the structure returned from ibv_query_port then these values must be flushed when this event occurs 7 1 13 IBV_EVENT_PKEY_CHANGE This event is generated when the P_Key table changes on a given port The PKEY table is config ured by the SM and this also means that the SM can change it When that happens an IBV_EVENT_PKEY_ CHANGE event is generated Since QPs use GID table indexes rather than absolute values as the source GID
56. to indicate the reason for the failure Description ibv_destroy_ah frees an address handle AH Once an AH is destroyed it can t be used anymore in UD QPs Mellanox Technologies 71 J Rev 1 3 VPI Verbs API 4 5 4 5 1 Queue Pair Bringup ibv_modify_qp Queue pairs QP must be transitioned through an incremental sequence of states prior to being able to be used for communication QP States RESET Newly created queues empty INIT Basic information set Ready for posting to receive queue RTR Ready to Receive Remote address info set for connected QPs QP may now receive packets RTS Ready to Send Timeout and retry parameters set QP may now send packets These transitions are accomplished through the use of the ibv_modify_qp command ibv_modify_qp Template int ibv_modify_qp struct ibv_qp qp struct ibv_qp_attr attr enum ibv_qp_attr_mask attr_mask Input Parameters ap struct ibv_qp from ibv_create qp attr OP attributes attr mask bit mask that defines which attributes within attr have been set for this call Output Parameters none Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_modify_qp this verb changes QP attributes and one of those attributes may be the QP state Its name is a bit of a misnomer since you cannot use this command to modify qp attributes at will There is a ver
57. under ibv_ create srq If the value of srq_limit in srq_attr is 0 then the SRQ limit reached low water mark event is not or is no longer armed No asynchronous events will be generated until the event is re armed 78 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 6 3 ibv_query_xrc_rcv_qp Template int ibv_query_xre_rev_qp struct ibv_xrc_domain xrc_domain uint32 t xre_qp num struct ibv_qp_attr attr int attr_mask struct ibv_qp_init_attr init_attr Input Parameters xrc_ domain The XRC domain associated with this QP E xrc qp num The queue pair number to identify this QP attr The ibv qp attr struct in which to return the attributes attr_mask A mask specifying the minimum list of attributes to retriev init_attr The ibv_qp_ init _attr struct to return the initial attributes Output Parameters attr A pointer to the struct containing the QP attributes of interest init_attr A pointer to the struct containing initial attributes Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_query_xre_rcv_qp retrieves the attributes specified in attr mask for the XRC receive QP with the number xrc_qp_num and domain xrc_domain It returns them through the pointers attr and init_attr The attr_mask specifies a minimal list to retrieve Some RDMA devices may return extra attri butes not requested Att
58. unit These packets are transmitted through the IB network and delivered directly into the receiving applica tion s virtual buffer where they are re assembled into a complete message The receiving applica tion is notified once the entire message has been received Thus neither the sending nor the receiving application is involved until the entire message is delivered into the receiving applica tion s buffer 1 4 Key Components These are being presented only in the context of the advantages of deploying IB and RoCE We do not discuss cables and connectors Host Channel Adapter HCAs provide the point at which an IB end node for example a server connects to an IB net work These are the equivalent of the Ethernet NIC card but they do much more HCAs provide Mellanox Technologies 15 Rev 1 3 RDMA Architecture Overview address translation mechanism under the control of the operating system which allows an applica tion to access the HCA directly The same address translation mechanism is the means by which an HCA accesses memory on behalf of a user level application The application refers to virtual addresses while the HCA has the ability to translate these addresses into physical addresses in order to affect the actual message transfer Range Extenders InfiniBand range extension is accomplished by encapsulating the InfiniBand traffic onto the WAN link and extending sufficient buffer credits to ensure full bandwidth across the
59. using a completion channel CC The parameter channel is used to specify a CC A CQ is merely a queue that does not have a built in notification mechanism When using a polling paradigm for CQ processing a CC is unneces sary The user simply polls the CQ at regular intervals If however you wish to use a pend para digm a CC is required The CC is the mechanism that allows the user to be notified that a new CQE is on the CQ The parameter comp_vector is used to specify the completion vector used to signal completion events It must be gt 0 and lt context gt num_comp_vectors Mellanox Technologies 47 J Rev 1 3 VPI Verbs API 4 3 8 ibv_resize_cq Template int ibv_resize_cq struct ibv_cq cq int cqe Input Parameters cq CQ to resize cqe Minimum number of entries CQ will support Output Parameters none Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_resize_cq resizes a completion queue CQ The parameter cqe must be at least the number of outstanding entries on the queue The actual size of the queue may be larger than the specified value The CQ may or may not contain completions when it is being resized thus it can be resized during work with the CQ 48 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 3 9 ibv_destroy_cq Template int ibv_destroy_cq struct ibv_cq cq Input Para
60. 0 IBV_WC_FATAL_ERR This event is generated when a fatal transport error occurs The user may have to restart the RDMA device driver or reboot the server to recover from the error 7 2 21 IBV_WC_RESP_TIMEOUT_ERR This event is generated when the responder is unable to respond to a request within the timeout period It generally indicates that the receiver is not ready to process requests 7 2 22 IBV_WC_GENERAL_ERR This event is generated when there is a transport error which cannot be described by the other spe cific events discussed here 160 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 7 3 RDMA_CM Events 7 3 1 RDMA_CM_EVENT_ADDR_RESOLVED This event is generated on the client active side in response to rdma_resolve_addr It is gener ated when the system is able to resolve the server address supplied by the client 7 3 2 RDMA_CM_EVENT_ADDR_ERROR This event is generated on the client active side It is generated in response to rdma_resolve_addr in the case where an error occurs This may happen for example if the device cannot be found such as when a user supplies an incorrect device Specifically if the remote device has both ethernet and IB interfaces and the client side supplies the ethernet device name instead of the IB device name of the server side an RDMA_CM_EVENT_ADDR ERROR will be generated 7 3 3 RDMA_CM_EVENT_ROUTE_RESOLVED This event is generated on the client active side
61. 00 pdf e Mellanox WP 2007 IB Software and Protocols pdf http www mellanox com pdf whitepapers WP_2007_IB_ Software_and_Protocols pdf Mellanox Technologies 17 Rev 1 3 Introduction to the Programming User Guide 2 Introduction to the Programming User Guide 2 1 Scope The Mellanox Virtual Protocol Interconnect VPI architecture provides a high performance low latency and reliable means for communication among network adapters and switches supporting both InfiniBand and Ethernet semantics A VPI adapter or switch can be set to deliver either Infini Band or Ethernet semantics per port A dual port VPI adapter for example can be configured to one of the following options e An adapter HCA with two InfiniBand ports e A NIC with two Ethernet ports e An adapter with one InfiniBand port and one Ethernet port at the same time Similarly a VPI switch can have InfiniBand only ports Ethernet only ports or a mix of both InfiniBand and Ethernet ports working at the same time Mellanox based VPI adapters and switches support both the InfiniBand RDMA and the Ethernet RoCE solutions The VPI architecture permits direct user mode access to the hardware Mellanox provides a dynamically loaded library creating access to the hardware via the verbs API This document con tains verbs and their related inputs outputs descriptions and functionality as exposed through the operating system programming interface
62. 0000 define DEFAULT MSEC DELAY 500 Resources used in the example struct context User parameters int server char server_name char server_port int msg_count int msg_length int msec_delay uint8_t alt_srcport 208 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 uint16_t alt_dlid uintl6_t my_alt_dlid int migrate_after Resources struct rdma_cm_id id struct rdma_cm_id listen_id struct ibv_mr send_mr struct ibv_mr recv_mr char send_buf char recv_buf pthread_t async_event_thread p Function async_event_thread x Input arg The context object Output none x Returns NULL x Description Reads any Asynchronous events that occur during the sending of data and prints out the details of the event Specifically migration related events static void async_event_thread void arg struct ibv_async_event event int ret struct context ctx struct context arg while 1 ret ibv_get_async_event ctx gt id gt verbs amp event if ret VERB ERR ibv_get_async_event ret break switch event event_type case IBV_EVENT_PATH_MIG printf QP path migrated n break case IBV_EVENT_PATH_MIG_ERR printf QP path migration error n break default printf Async Event d n event event_type Mellanox Technologies 209 Rev 1 3 Programming Examples Using RDMA Verbs break
63. 5 6 1 2 rdma_reg_ Teal 136 6 1 3 rdma_reg_write 0 0 20 ee 137 6 1 4 rdmad rege MP we aa ya eta aig pa eel Ge le Hala ae 2S 138 6 1 5 rdma_create_srq 0c eee 139 6 1 6 rdma_destroy_srq 020 c cee 140 6 2 Active Queue Pair Operations 20200 cee eee 141 0 2 1 frdma p st TEC face ace Fe ale ed es ale Ghd a ak SA 141 6 2 2 rdma_post_sendv 20 c eee eee 142 6 2 3 rdma_post_readv 0 0 0 ee 143 6 2 4 rdma_post_writev 2 000 0c 144 6 25 TdMaPOSLECV nadaa Rede he Re edad ale eat 145 6 2 6 rdma_post_send 22 2 ccc eee ee 146 6 2 7 rdma_post_read 0 02 ee 147 6 2 8 rdma_post_write 00 00 ee 148 Mellanox Technologies 5 J Rev 1 3 6 2 9 rdma_post Ud SENO 149 6 2 10 rdma_get_send_comp ooocccccco cece eee eee 150 6 2 11 rdma_get_recv_comp 0 0c eee eee eee eee 151 Chapter Events iniciar a ee eee aca ww 153 TV IBV Events a eae he Ma aS Re od BY 153 TAM IBV EVENT CO ERR 22208 404 42 oe Av etna a dere a whee ae 153 7 1 2 IBV_EVENT_QP_FATAL 0 0000 cece eee eee 153 7 1 3 IBV_EVENT_QP_REQ_ERR 0 0 00 cece eee eee 153 7 14 IBV_EVENT_QP_ACCESS _ERR 0 005 153 7 1 5 IBV_EVENT_COMM_EST 2 0 0 0 ccc eee eee eee 154 7 1 6 IBV_EVENT_SQ_DRAINED 0 00 eee e eee
64. 5 2 5 rdma_create_ep Template int rdma_create_ep struct rdma_cm_id id struct rdma_addrinfo res struct ibv_pd pd struct ibv_qp_init_attr qp_init_attr Input Parameters id A reference where the allocated communication identifier will be returned res Address information associated with the rdma_cm_id returned from rdma_getaddrinfo pd OPtional protection domain if a QP is associated with the rdma_cm_id qp_init_attr Optional initial QP attributes Output Parameters id The communication identifier is returned through this reference Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_create_ep creates an identifier and optional QP used to track communication information If qp_init_attr is not NULL then a QP will be allocated and associated with the rdma_cm_id id If a protection domain PD is provided then the QP will be created on that PD Otherwise the QP will be allocated on a default PD The rdma cm id will be set to use synchronous operations connect listen and get request To use asynchronous operations rdma cm id must be migrated to a user allocated event channel using rdma_migrate_id rdm cm id must be released after use using rdma destroy ep struct rdma_addrinfo is defined as follows struct rdma_addrinfo int ai_flags int ai family int ai_qp type int ai port space socklen t ai_src_len socklen t
65. AMA Mellanox TECHNOLOGIES RDMA Aware Networks Programming User Manual Rev 1 3 www mellanox com Rev 1 3 NOTE THIS HARDWARE SOFTWARE OR TEST SUITE PRODUCT PRODUCT S AND ITS RELATED DOCUMENTATION ARE PROVIDED BY MELLANOX TECHNOLOGIES AS IS WITH ALL FAULTS OF ANY KIND AND SOLELY FOR THE PURPOSE OF AIDING THE CUSTOMER IN TESTING APPLICATIONS THAT USE THE PRODUCTS IN DESIGNATED SOLUTIONS THE CUSTOMER S MANUFACTURING TEST ENVIRONMENT HAS NOT MET THE STANDARDS SET BY MELLANOX TECHNOLOGIES TO FULLY QUALIFY THE PRODUCTO S AND OR THE SYSTEM USING IT THEREFORE MELLANOX TECHNOLOGIES CANNOT AND DOES NOT GUARANTEE OR WARRANT THAT THE PRODUCTS WILL OPERATE WITH THE HIGHEST QUALITY ANY EXPRESS OR IMPLIED WARRANTIES INCLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT ARE DISCLAIMED IN NO EVENT SHALL MELLANOX BE LIABLE TO CUSTOMER OR ANY THIRD PARTIES FOR ANY DIRECT INDIRECT SPECIAL EXEMPLARY OR CONSEQUENTIAL DAMAGES OF ANY KIND INCLUDING BUT NOT LIMITED TO PAYMENT FOR PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROFITS OR BUSINESS INTERRUPTION HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY WHETHER IN CONTRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTHERWISE ARISING IN ANY WAY FROM THE USE OF THE PRODUCT S AND RELATED DOCUMENTATION EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE Mellanox TECHNOLOGI
66. All OPs XA xk ap access flags IBV_QP ACCESS FLAGS Connected OPs only alt_ah_attr IBV_OP ALT PATH min _rnr timer IBV_QP MIN _RNR_TIMER Unconnected QPs only qkey IBV_QP QKEY Effect of transition IBV_QPS_RTS local ack timeout recommended value 14 retry count recommended value 7 RNR retry count recommended value 7 number send queue starting packet sequenc should match remote QP s rq psn number of outstanding RDMA reads and atomic operations allowed access flags see ibv_reg mr AH with alternate path info filled in minimum RNR NAK timer qkey see ibv_post_send Once the QP is transitioned into the RTS state the QP begins send processing and is fully opera tional The user may now post send requests with the ibv_post_send command 76 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 6 Active Queue Pair Operations A QP can be queried staring at the point it was created and once a queue pair is completely opera tional you may query it be notified of events and conduct send and receive operations on it This section describes the operations available to perform these actions 4 6 1 ibv_query_qp Template int ibv_query_qp struct ibv_qp qp struct ibv_qp_attr attr enum ibv_qp_attr_mask attr_mask struct ibv_qp_init_attr init_attr Input Parameters ap struct ibv_qp from ibv_create_qp attr_mask bitmask o
67. DMA Verbs struct ibv_port_attr port_attr struct rdma_cm_event event char buf 40 memset amp ctx 0 sizeof ctx ctx sender 0 ctx msg count DEFAULT MSG COUNT ctx msg length DEFAULT MSG LENGTH ctx server_port DEFAULT PORT Read options from command line while op getopt argc argv shb m p c 1 1 switch op case s ctx sender 1 break case b ctx bind_addr optarg break case m ctx mcast_addr optarg break case p ctx server_port optarg break case c ctx msg_count atoi optarg break case 1 ctx msg_length atoi optarg break default printf usage s m mc_address n argv 0 printf t s ender mode n printf t b bind_address n printf t p port_number n printf t c msg_count n printf t l msg_length n exit 1 if ctx mcast_addr NULL printf multicast address must be specified with m n exit 1 ctx channel rdma_create_event_channel if ctx channel VERB _ERR rdma_create_event_channel 1 exit 1 232 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 ret rdma_create_id ctx channel amp ctx id NULL RDMA_PS_UDP if ret VERB_ERR rdma create_id 1 exit 1 ret resolve_addr amp ctx if ret goto out Verify that the buffer length is not larger than the MTU ret ibv_query_port ctx id gt verbs ctx id gt port_nu
68. Description ibv_modify_xre_rev_qp modifies the attributes of an XRC receive QP with the number xrc qp_ num which is associated with the attributes in the struct attr according to the mask attr_mask It then moves the QP through the following transitions Reset gt Init gt RTR At least the following masks must be set the user may add optional attributes as needed Next State Next State Required attributes Init IBV_QP_ STATE IBV_QP PKEY INDEX IBV_QP PORT IBV_QP ACCESS FLAGS RTR IBV_QP STATE IBV_QP_AV IBV_QP PATH MTU IBV_QP DEST _QPN IBV_QP RQ PSN IBV_QP MAX DEST RD ATOMIC IBV_QP MIN RNR TIMER Please note that if any attribute to modify is invalid or if the mask as invalid values then none of the attributes will be modified including the QP state 66 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 4 13 ibv_reg_xrc_rcv_qp Template int ibv_reg_xre_rev_qp struct ibv_xrc_domain xrc_domain uint32_t xrc_qp_num Input Parameters xrc_domain The XRC domain associated with the receive QP E The number associated with the created QP to which the user process is to be registered xrc qp num Output Parameters None Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_reg_xrc_rcv_qp registers a user process with the XRC receive QP whose number is xrc_qp num associated with the XRC doma
69. ES Mellanox Technologies Mellanox Technologies Ltd 350 Oakmead Parkway Suite 100 Beit Mellanox Sunnyvale CA 94085 PO Box 586 Yokneam 20692 U S A Israel www mellanox com www mellanox com Tel 408 970 3400 Tel 972 0 74 723 7200 Fax 408 970 3403 Fax 972 0 4 959 3245 O Copyright 2012 Mellanox Technologies All Rights Reserved Mellanox Mellanox logo BridgeX ConnectX CORE Direct InfiniBridge InfiniHost InfiniScale PhyX SwitchX Virtual Protocol Interconnect and Voltaire are registered trademarks of Mellanox Technologies Ltd Connect IB FabricIT MLNX OS ScalableHPC Unbreakable LinkTM UFM and Unified Fabric Manager are trademarks of Mellanox Technologies Ltd All other trademarks are property of their respective owners 2 Mellanox Technologies Document Number 2865 RDMA Aware Networks Programming User Manual Rev 1 3 Table of Contents Revision History sa ced ans tae ER ee Ree eae il eee eee A ee a 8 GOSS AR ysis eee eh ana cepa yee Sage E T ea a Sets la SO seta ey eta at aes Spa cia ER 9 Chapter 1 RDMA Architecture Overview 0000 e eee eee eee eens 13 ki AATIMIBANG sie sek A eg ed ae Ne ha tea hk tl blo 13 1 2 RDMA over Converged Ethernet RoCE 0 0 0 eee eee 13 1 3 Comparison of RDMA Technologies 0 00 eee 13 1 4 Key Components 2 0 0 0 06 20 b eee 15 1 5 Support for Existing Applications and ULPs
70. Networks Programming User Manual Rev 1 3 Call post_receive Call sock _sync_data to exchange information between server and client Call modify qp to rtr Call modify_qp_to_rts Call sock_sync_data to synchronize client lt gt server 8 1 7 modify_qp_to_init Transition QP to INIT state 8 1 8 post_receive Prepare a scatter gather entry for the receive buffer Prepare an RR Post the RR 8 1 9 sock_sync_data Using the TCP socket created with sock connect synchronize the given set of data between client and the server Since this function is blocking it is also called with dummy data to synchronize the timing of the client and server 8 1 10 modify_qp_to_rtr Transition QP to RTR state 8 1 11 modify_qp_to_rts Transition QP to RTS state 8 1 12 post_send Prepare a scatter gather entry for data to be sent or received in RDMA read case Create an SR Note that IBV_SEND_SIGNALED is redundant If this is an RDMA operation set the address and key Post the SR 8 1 13 poll_ completion Poll CQ until an entry is found or MAX POLL CQ TIMEOUT milliseconds are reached Mellanox Technologies 165 Rev 1 3 Programming Examples Using IBV Verbs 8 1 14 resources_destroy Release free deallocate all items in resource struct 166 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 8 2 Code for Send Receive RDMA Read RDMA Write BUILD COMMAND gcc Wall I usr local ofed include O2 o RDMA_RC_ examp
71. Note This programming manual and its verbs are valid only for user space See header files for the kernel space verbs Programming with verbs allows for customizing and optimizing the RDMA Aware network This customizing and optimizing should be done only by programmers with advanced knowledge and experience in the VPI systems 2 2 Online Resources Mellanox driver software stacks and firmware are available for download from Mellanox Technol ogies Web pages http www mellanox com 18 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 3 Overview In order to perform RDMA operations establishment of a connection to the remote host as well as appropriate permissions need to be set up first The mechanism for accomplishing this is the Queue Pair QP For those familiar with a standard IP stack a QP is roughly equivalent to a socket The QP needs to be initialized on both sides of the connection Communication Manager CM can be used to exchange information about the QP prior to actual QP setup Once a QP is established the verbs API can be used to perform RDMA reads RDMA writes and atomic operations Serialized send receive operations which are similar to socket reads writes can be performed as well 3 1 Available Communication Operations 3 1 1 Send Send With Immediate The send operation allows you to send data to a remote QP s receive queue The receiver must have previously posted a receiv
72. OVAL 162 7 3 13 RDMA_CM_EVENT MULTICAST _JOIN 004 162 7 3 14 RDMA_CM_EVENT_ MULTICAST _ERROR 162 7 3 15 RDMA_CM_EVENT_ADDR_CHANGE 162 7 3 16 RDMA_CM_EVENT_TIMEWAIT_EXIT 004 162 Chapter 8 Programming Examples Using IBV Verbs oooooo o 163 8 1 Synopsis for RDMA_RC Example Using IBV Verbs 163 84 17 MAIN irasara ETR wes Po aa Bie at ead awe Reed Da aG wad ne 163 8 1 2 print Config 0 One it ad ace BE son aa ht ed 164 8 1 3 resources _init 00022 eee 164 8 1 4 resources create 0 00 cee 164 8 1 5 SOCK CONNECE i gt have ae RRA tee ee eed alent eats 164 8 1 6 CONNEC Dicc ii a A ia ede 164 8 1 7 Modify_ qp_tO IMibt o o ooooocooooooooco 165 8 1 8 post_receive oooocooocoo eee 165 8 129 Sock SYNC_ data oir lr al a tel Be a Gaia 165 8 1 10 modify_qp tortr 2 0 ee 165 8 1 11 modify_qp torts a a ee 165 8 1 12 pOSt SENO occ ee 165 8 1 13 poll_completion a aa e oaa eee 165 8 1 14 resources_destroy 0 cee eee 166 8 2 Code for Send Receive RDMA Read RDMA Write 167 8 3 Synopsis for Multicast Example Using RDMA_CM and IBV Verbs 193 8 31 Mall soto ote ca A a o 193 8 32 RUM 2 hed da EN ons We Aa oe Pe ies 193 8 4 Code for Multicast Using RDMA_CM and IBV Verbs 195 Chapter 9 Pro
73. Parameters none Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_dealloc_pd frees a protection domain PD This command will fail if any other objects are currently associated with the indicated PD 46 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 3 7 ibv_create_cq Template struct ibv_cq ibv_create_cq struct ibv context context int cqe void cq context struct ibv_comp_channel channel int comp_vector Input Parameters context struct ibv_context from ibv_open_device cqe Minimum number of entries CQ will support cq context Optional User defined value returned with completion events channel Optional Completion channel comp vector Optional Completion vector Output Parameters none Return Value pointer to created CQ or NULL on failure Description ibv_create_cq creates a completion queue CQ A completion queue holds completion queue entries CQE Each Queue Pair QP has an associated send and receive CQ A single CQ can be shared for sending and receiving as well as be shared across multiple QPs The parameter cqe defines the minimum size of the queue The actual size of the queue may be larger than the specified value The parameter cq_ context is a user defined value If specified during CQ creation this value will be returned as a parameter in ibv_get_cq_event when
74. Q Mellanox Technologies 39 Rev 1 3 VPI Verbs API max mr max pd max qp rd atom max ee rd atom max res rd atom max qp init rd atom max ee init atom atomic cap max ee max rdd max mw max raw ipv6 q max_raw_ethy q max mcast_grp max mcast_qp attach max total mcast qp a max ah max fmr max map per fmr max srq max srq wr max srq sge max pkeys local_ca_ack_delay phys port cnt aximum aximum aximum aximum supported memory regions supported protection domains outstanding RDMA read and atomic operations per Q outstanding RDMA read and atomic operations per End to EF aximum resources MR PD nd context RD connections used for incoming RDMA read and atomic operations aximium RDMA read and atomic operations that may be initiated per QP aximum RDMA read and atomic operations that may be initiated per EE IBV_A7 IBV_A7 IBV Al aximum aximum aximum aximum aximum aximum aximum ttach aximum aximum aximum aximum required aximum aximum aximum aximum TOMIC NONE TOMIC TOMIC GLOB su su su su su su no atomic guarantees HCA atomic guarantees within this device global atomic guarantees EF ted ted ted ted ted ppor contexts ppor RD domains ppor memory windows MW ppor raw IPv6 datagram OPs ppor ethertype datagr
75. QP_DEST_QPN Mellanox Technologies 73 J Rev 1 3 VPI Verbs API 4 5 2 RESET to INIT When a queue pair QP is newly created it is in the RESET state The first state transition that needs to happen is to bring the QP in the INIT state Required Attributes KKK All OPs KKK qp_state IBV_QP STAT za pkey index IBV_QP PKEY INDE port_num IBV_QP PORT qp_access flags IBV_QP ACCESS FLAGS Unconnected QPs only qkey IBV_QP QKEY Optional Attributes none Effect of transition IBV_QPS_INIT pkey index normally 0 physical port number 1 n access flags see ibv_reg mr qkey see ibv_post_ send Once the QP is transitioned into the INIT state the user may begin to post receive buffers to the receive queue via the ibv_post_recv command At least one receive buffer should be posted before the QP can be transitioned to the RTR state 74 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 5 3 Mellanox Technologies 75 J INIT to RTR Once a queue pair QP has receive buffers posted to it it is now possible to transition the QP into the ready to receive RTR state Required Attributes Kkx All OPs kk xk qp_state IBV_QP STAT path mtu IBV_QP PATH MTU El Connected QPs only ah_attr IBV_QP AV dest_qp_ num IBV QP DEST OPN rq psn IBV _QP RQ PSN max dest rd atomic IB
76. RY_EXC_ERR El H R RR R WC RNR RETRY EXC WC LOC E po e IBV_WC_R IBV_WC_RE IBV_WC_INV_EECN_ERR IBV_WC_INV_EEC_STATE_ERR IBV_WC_FATAL ERR IBV_WC_RESP_TIMEOUT_ERR ERR a D H J lt W lw E 10 T opcode vendor err byte_len imm data qp_ num sre qp wc flags pkey index slid sl dlid path bits flags IBV_WC_GRH IBV_WC_WITH_ IMM IBV_WC_GENERAL_ERR IBV_WC_SEND IBV_WC_RD IBV_WC_RD IBV_WC_CO IBV_WC_FE A WRITE A READ P_ SWAP TCH ADD IBV_WC_BIN D MW IBV_WC_RECV 1 lt lt 7 IBV_WC_RECV_RDMA WITH IMM vendor specific error number of bytes transferred immediate data local queue pair QP number remote QP number see below index of pkey valid only for GSI QPs source local identifier LID SL destination LID path bits service level global route header GRH immediate data value is valid 88 Mellanox Technologies is present in UD packet RDMA Aware Networks Programming User Manual Rev 1 3 4 6 11 ibv_init_ah_from_wc Template int ibv_init_ah_from_we struct ibv_context context uint8_t port_num struct ibv_wc wc struct ibv_grh grh struct ibv_ah_attr ah_attr Input Parameters context struct ibv_context from ibv_open_device This sho
77. SM Subnet Manager An entity that configures and manages the subnet Discovers the network topology Assign LIDs Determines the routing schemes and sets the routing tables One master SM and possible several slaves Standby mode Administers switch routing tables thereby establishing paths through the fabric SQ Send Queue A Work Queue which holds SRs posted by the user SR Send Request A WR which was posted to an SQ which describes how much data is going to be transferred its direction and the way the opcode will specify the transfer SRQ Shared Receive Queue A queue which holds WQEs for incoming messages from any RC UC UD QP which is asso ciated with it More than one QPs can be associated with one SRQ TCA Target Channel A Channel Adapter that is not required to support verbs usually used in I O devices Adapter UC Unreliable A QP transport service type based on a connection oriented protocol Connection where a QP Queue pair is associated with another single QP The QPs do not execute a reli able Protocol and messages can be lost UD Unreliable A QP transport service type in which messages can be one packet length and every UD QP Datagram can send receive messages from another UD QP in the subnet Messages can be lost and the order is not guaranteed UD QP is the only type which supports multicast messages The message size of a UD packet is limited to the Path MTU Mellanox
78. Transition the server side to RTR sender side to RTS Fk ake a ak fe ak o al oe le od 2 le ale ol a ol le al a le ad leal le al ol la le al e ae ae ll ll ol al a a ll al ll al ll a al el ll al a a al ll ll al 2 ll ll ae ake ake ake ll static int connect_qp struct resources res struct cm_con data t local con data struct cm_con_data_t remote_con data struct cm_con_data_t tmp_con data int rc 0 char temp_char union ibv_gid my_gid if config gid_idx gt 0 rc ibv_query_gid res gt ib_ctx config ib_port config gid_idx amp my_gid if rc fprintf stderr could not get gid for port d index d n config ib_port config gid_idx return rc else memset amp my_gid 0 sizeof my_ gid exchange using TCP sockets info required to connect QPs local_con_data addr htonll uintptr_t res gt buf local_con_data rkey htonl res gt mr gt rkey local_con_data qp_num htonl res gt qp gt qp_num local_con_data lid htons res gt port_attr lid memcepy local_ con _data gid amp my_gid 16 fprintf stdout nLocal LID 0x x n res gt port_attr lid if sock_sync_data res gt sock sizeof struct cm_con_data_t char amp local_con_data char amp tmp_con_data lt 0 fprintf stderr failed to exchange connection data between sides n rc 1 goto connect_qp_ exit remote_con_data addr ntohll tmp_con_data addr remote_con_data rkey ntohl tmp_con_data rkey remo
79. V_MAX_DEST_RD_ATOMIC min_rnr timer IBV_QP_MIN_RNR_TIMER Optional Attributes All OPs qp_access flags IBV_QP ACCESS FLAGS pkey index IBV QP PKEY INDEX Connected QPs only alt_ah_attr IBV_QP ALT PATH Unconnected QPs only qkey IBV_QP_QKEY Effect of transition IBV_QPS_RTR IB MTU_256 IB MTU_512 IB_MTU_1024 IB_MTU_2048 IB_MTU_4096 recommended value an address handle AH needs to be created and filled in as appropriate Minimally ah_attr dlid needs to be filled in QP number of remote QP starting receive packet sequence number should match remote QP s sq psn maximum number of resources for incoming RDMA requests minimum RNR NAK timer recommended value 12 access flags see ibv_reg mr pkey index normally 0 AH with alternate path info filled in qkey see ibv_post_send Once the QP is transitioned into the RTR state the QP begins receive processing Rev 1 3 VPI Verbs API 4 5 4 RTR to RTS Once a queue pair QP has reached ready to receive RTR state 1t may then be transitioned to the ready to send RTS state Required Attributes KKK All OPs KKK qp_state IBV_QP STAT E Connected OPs only timeout IBV_QP TIMEOUT retry cnt IBV QP RETRY CNT rnr_retry IBV QP RNR_RETRY sq _psn IBV_SQ PSN max rd atomic IBV QP MAX QP RD ATOMIC Optional Attributes Ad
80. WAN Subnet Manager The InfiniBand subnet manager assigns Local Identifiers LIDs to each port connected to the InfiniBand fabric and develops a routing table based on the assigned LIDs The IB Subnet Man ager is a concept of Software Defined Networking SDN which eliminates the interconnect com plexity and enables the creation of very large scale compute and storage infrastructures Switches IB switches are conceptually similar to standard networking switches but are designed to meet IB performance requirements They implement flow control of the IB Link Layer to prevent packet dropping and to support congestion avoidance and adaptive routing capabilities and advanced Quality of Service Many switches include a Subnet Manager At least one Subnet Manager is required to configure an IB fabric 1 5 Support for Existing Applications and ULPs IP applications are enabled to run over an InfiniBand fabric using IP over IB IPoIB or Ethernet over IB EoIB or RDS ULPs Storage applications are supported via iSER SRP RDS NFS ZFS SMB and others MPI and Network Direct are all supported ULPs as well but are outside the scope of this document 16 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 1 6 References IBTA Intro to IB for End Users http members infinibandta org kwspub Intro_to IB for End Users pdf e Mellanox InfiniBandFAQ FQ 100 pdf http www mellanox com pdf whitepapers InfiniBandFAQ FQ_1
81. WR can only be safely reused after WR the request is fully executed and a work completion has been retrieved from the corresponding completion queue CQ If a WR is being posted to a UD QP the Global Routing Header GRH of the incoming message will be placed in the first 40 bytes of the buffer s in the scatter list If no GRH is present in the incoming message then the first 40 bytes will be undefined This means that in all cases for UD QPs the actual data of the incoming message will start at an offset of 40 bytes into the buffer s in the scatter list Mellanox Technologies 83 J Rev 1 3 VPI Verbs API 4 6 7 ibv_req_notify_cq Template int ibv_req_notify_cq struct ibv_cq cq int solicited_only Input Parameters cq struct ibv_cq from ibv_create_cq solicited_only only notify if WR is flagged as solicited Output Parameters none Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_req_notify_cq arms the notification mechanism for the indicated completion queue CQ When a completion queue entry CQE is placed on the CQ a completion event will be sent to the completion channel CC associated with the CQ If there is already a CQE in that CQ an event won t be generated for this event If the solicited_only flag is set then only CQEs for WRs that had the solicited flag set will trigger the notification The user should use the ibv_
82. XRC operations struct ibv qp cap is defined as follows struct ibv qp cap uint32 t max send wr uint32 t max recv wr uint32 t max send sge uint32 t max recv_sge uint32 t max inline data max_send wr aximum number of outstanding send requests in the send queue max recv_ wr aximum number of outstanding receive requests buffers in the receive queu max send sge aximum number of scatter gather elements SGE in a WR on the send queue max recv_sge aximum number of SGEs in a WR on the receive queu max inline data aximum size in bytes of inline data on the send queue 56 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 4 4 ibv_destroy_qp Template int ibv_destroy_qp struct ibv_qp qp Input Parameters ap struct ibv_qp from ibv_create_qp Output Parameters none Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_destroy_qp frees a queue pair QP Mellanox Technologies 57 J Rev 1 3 VPI Verbs API 4 4 5 ibv_create_srq Template struct ibv_srq ibv_create_srq struct ibv_pd pd struct ibv_srq init attr srq init attr Input Parameters pd The protection domain associated with th shared receiv queue SRQ srq_init_attr A list of initial attributes required to create the SRQ Output Parameters ibv_srq attr Actual values of the struct are set Retur
83. _ERR 0 00 cece eee nee eee 159 7 2 13 IBV_WC_RETRY_EXC_ERR ococcocccoccco a E 159 7 2 14 IBV_WC_RNR_RETRY_EXC_ERR 0000000 159 7 2 15 IBV_WC_LOC_RDD_VIOL_ERR 0000005 159 7 2 16 IBV_WC_REM_INV_RD_REQ_ERR 55 159 7 2 17 IBV_WC_REM_ABORT_ERR 000 ce eee pa aa 159 7 2 18 IBV_WC_INV_EECN_ERR 0 0000 c eee eee 159 7 2 19 IBV_WC_INV_EEC_STATE_ERR 00 0 eeu 160 7 2 20 IBV_WC_FATAL_ERR 0 0000 cece eee 160 7 2 21 IBV_WC_RESP_TIMEOUT_ERR 000 tiis 160 7 2 22 IBV_WC_GENERAL_ERR 0 000 e eee nee eee 160 7 3 RDMA_CM Events 0 0 00 cece eee ene 161 7 3 1 RDMA_CM_EVENT_ADDR_RESOLVED 4 161 7 3 2 RDMA_CM_EVENT_ADDR_ERROR 000000 161 7 3 3 RDMA_CM_EVENT_ROUTE_RESOLVED 161 7 3 4 RDMA_CM_EVENT_ROUTE_ERROR 005 161 7 35 RDMA_CM_EVENT_CONNECT_REQUEST 161 6 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 7 3 6 RDMA_CM_EVENT_CONNECT_RESPONSE 161 7 3 7 RDMA_CM_EVENT_CONNECT_ERROR 04 161 7 3 8 RDMA_CM_EVENT_UNREACHABLE 000 161 7 3 9 RDMA_CM_EVENT_REJECTED 0 00000 e eae 162 7 3 10 RDMA_CM_EVENT_ESTABLISHED 162 7 3 11 RDMA_CM_EVENT_DISCONNECTED 04 162 7 3 12 RDMA_CM_EVENT_DEVICE_REM
84. _addr or rdma_ resolve _addr Use of rdma_resolve_addr requires the local routing tables to resolve the multicast address to an RDMA device unless a specific source address is pro vided The user must call rdma_leave_multicast to leave the multicast group and release any mul ticast resources After the join operation completes any associated QP is automatically attached to the multicast group and the join context is returned to the user through the private_data field in the rdma_cm_event See Also rdma leave multicast rdma bind addr rdma resolve addr rdma create qp rdma_get_cm_event Mellanox Technologies 127 Rev 1 3 RDMA_CM API 5 2 28 rdma_leave_multicast Template int rdma leave _multicast struct rdma cm id id struct sockaddr addr Input Parameters id Communication identifier associated with the request addr Multicast address identifying the group to leave Output Parameters none Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_leave_multicast leaves a multicast group and detaches an associated QP from the group Notes Calling this function before a group has been fully joined results in canceling the join operation Users should be aware that messages received from the multicast group may stilled be queued for completion processing immediately after leaving a multicast group Destroying an rdma_cm_id will automatic
85. _alloc_pd ctx gt id gt verbs if ctx gt pd VERB ERR ibv_alloc_pd 1 return ret ctx gt cq ibv_create_cq ctx gt id gt verbs 2 0 0 0 if ctx gt cq VERB _ERR ibv_create_cq 1 return ret attr qp_type IBV_QPT UD attr send_cq ctx gt cq attr recv_cq ctx gt cq attr cap max_send_wr ctx gt msg_count attr cap max_recv_wr ctx gt msg_count attr cap max_ send sge 1 attr cap max_recv_sge l ret rdma_create_qp ctx gt 1d ctx gt pd amp attr if ret VERB _ERR rdma_create_qp ret return ret The receiver must allow enough space in the receive buffer for the GRH buf _size ctx gt msg_length ctx gt sender 0 sizeof struct ibv_grh ctx gt buf calloc 1 buf size memset ctx gt buf 0x00 buf size 228 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 Register our memory region ctx gt mr rdma_reg msgs ctx gt id ctx gt buf buf size if ctx gt mr VERB _ERR rdma_reg msgs 1 return 1 return 0 gt Function destroy resources Input ctx The context structure x Output none x Returns 0 on success non zero on failure Description Destroys AH QP CQ MR PD and ID void destroy_resources struct context ctx if ctx gt ah ibv_destroy_ah ctx gt ah if ctx gt id gt qp rdma_destroy_qp ctx gt id if
86. _attr memset amp hints 0 sizeof hints hints al_port space RDMA_PS_TCP if ctx gt server 1 hints ai_flags RAI_PASSIVE this makes it a server printf rdma_getaddrinfo n ret rdma_getaddrinfo ctx gt server_name ctx gt server_port amp hints amp rai if ret VERB _ERR rdma_getaddrinfo ret return ret memset amp qp_ init_attr 0 sizeof qp_init_attr qp_init_attr cap max_send_wr 1 qp_init_attr cap max_recv_wr 1 qp_ init attr cap max_send_sge 1 qp_ init attr cap max_recv_sge 1 printf rdma_create_ep n ret rdma _create_ep amp ctx gt id rai NULL amp qp_init_attr if ret VERB _ERR rdma_create_ep ret return ret rdma_freeaddrinfo rai return 0 gt Function get connect request Input ctx The context object Output none x Returns 214 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 0 on success non zero on failure x Description Wait for a connect request from the client int get_connect_request struct context ctx int ret printf rdma_listen n ret rdma_listen ctx gt id 4 if ret VERB _ERR rdma _ listen ret return ret ctx gt listen_id ctx gt id printf rdma_get_request n ret rdma_get_request ctx gt listen_id amp ctx gt id if ret VERB _ERR rdma_get request ret return ret if ct
87. _init_attr qp_init_attr Input Parameters pd struct ibv pd from ibv_alloc pd qp_init_attr initial attributes of queue pair Output Parameters qp_init_attr actual values are filled in Return Value pointer to created queue pair QP or NULL on failure Description ibv_create_qp creates a QP When a QP is created it is put into the RESET state struct qp_init_attr is defined as follows struct iby qp init attr void qp context struct ibv_cq send cq struct ibv_cq recv_cq struct ibv_srq srq struct ibv_qp cap cap enum ibv_qp type ap type int sq _sig all struct ibv xrc domain xrc domain y qp_ context optional user defined value associated with QP send _cq send CQ This must be created by the user prior to calling ibv create qp recv_cq receive CQ This must be created by the user prior to calling ibv_create qp It may be the same as send cq srq optional shared receive queue Only used for SRQ QP s cap defined below qp_type must be one of the following IBV_QPT RC 2 IBV_QPT_UC IBV_QPT UD IBV_QPT XRC IBV_QPT RAW PACKET 8 IBV_QPT RAW ETH 8 sq_ sig all If this value is set to 1 all send requests WR will generate completion queue events CQE If this value is set to 0 only WRs that are flagged will generate CQE s see ibv_post_send Mellanox Technologies 55 J Rev 1 3 VPI Verbs API xrc domain Optional Only used for
88. _t htonll uint64_t x return x static inline uint64_t ntohll uint64_t x return x else error BYTE ORDER is neither LITTLE ENDIAN nor __ BIG ENDIAN Hendif structure of test parameters struct config t const char dev_name TB device name char Server_name server host name u_int32 t tcp_port server TCP port int ib_ port local IB port to work with int gid_idx gid index to use E structure to exchange data which is needed to connect the QPs struct cm_con_data t uint64 t addr Buffer address uint32_t rkey Remote key uint32_t qp_num QP number uintl6_t lid LID of the IB port uint8 t gid 16 gid _ attribute packed structure of system resources struct resources 168 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 struct ibv_device_attr Device attributes device_attr struct ibv_port_attr port_attr IB port attributes struct cm_con_data t remote_props values to connect to remote side struct ibv_context ib_ ctx device handle struct ibv_pd od PD handle struct ibv_cq ceq CQ handle struct ibv_qp qp QP handle struct ibv_mr mr MR handle for buf char buf memory buffer pointer used for RDMA and send ops int sock TCP socket file descriptor 5 struct config_t config NULL dev_name NULL server_name
89. a source address is given the rdma_cm_id is bound to that address the same as if rdma_bind_addr were called If no source address is given and the rdma cm id has not yet been bound to a device then the rdma_cm_id will be bound to a source address based on the local routing tables After this call the rdma_cm_id will be bound to an RDMA device This call is typically made from the active side of a connection before calling rdma_resolve_route and rdma_connect InfiniBand Specific This call maps the destination and if given source IP addresses to GIDs In order to perform the mapping IPoIB must be running on both the local and remote nodes See Also rdma_create_id rdma_resolve_route rdma_connect rdma_create_qp rdma_get_cm_event rdma bind addr rdma get src port rdma_ get dst port rdma get local addr rdma_get_peer_addr 106 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 5 2 8 rdma_bind_addr Template int rdma_bind_addr struct rdma cm _id id struct sockaddr addr Input Parameters id RDMA identifier addr Local address information Wildcard values are permitted Output Parameters None Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma bind _addr associates a source address with an rdma cm id The address may be wild carded If binding to a specific local address the rdma_cm_id will als
90. a_id gt qp gt qp_num send_wr wr ud ah node gt ah send_wr wr ud remote_qpn node gt remote_qpn send_wr wr ud remote_qkey node gt remote_qkey sge length message_size sge lkey node gt mr gt lkey sge addr uintptr_t node gt mem for i 0 i lt message_count amp amp ret i ret ibv_post_send node gt cma_id gt qp amp send_wr amp bad_send_wr if ret printf failed to post sends d n ret return ret static void connect_error void test connects_left static int addr_handler struct cmatest_node node int ret ret verify_test_params node if ret goto err ret init_node node if ret goto err if lis_sender Mellanox Technologies 199 Rev 1 3 Programming Examples Using IBV Verbs erT ret post_recvs node if ret goto err ret rdma_join_multicast node gt cma_id test dst_addr node if ret printf mckey failure joining d n ret goto err return 0 connect_error return ret static int join_handler struct cmatest_node node struct rdma_ud_ param param erT char buf 40 inet_ntop AF_INET6 param gt ah_attr grh dgid raw buf 40 printf mckey joined dgid s n buf node gt remote_qpn param gt qp_num node gt remote_qkey param gt qkey node gt ah ibv_create_ah node gt pd amp param gt ah_attr if Inode gt ah printf mckey failure
91. access IBV ACCESS REMOTE READ Allow remote hosts read access IBV ACCESS REMOTE ATOMIC Allow remote hosts atomic access IBV_ACCESS MW BIND Allow memory windows on this MR Local read access is implied and automatic Any VPI operation that violates the access permissions of the given memory operation will fail Note that the queue pair QP attributes must also have the correct permissions or the operation will fail If IBV_ACCESS REMOTE WRITE or IBV ACCESS REMOTE ATOMIC is set then IBV_ACCESS_LOCAL_WRITE must be set as well 52 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 struct ibv_mr is defined as follows struct ibv mr struct ibv_context context struct ibv_pd pd void addr size t length uint32 t handle uint32 t lkey uint32 t rkey e Mellanox Technologies 53 J Rev 1 3 VPI Verbs API 4 4 2 ibv_dereg_mr Template intibv_dereg_mr struct ibv_mr mr Input Parameters mr struct ibv mr from ibv_reg_ mr Output Parameters none Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_dereg_mr frees a memory region MR The operation will fail if any memory windows MW are still bound to the MR 54 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 4 3 ibv_create_qp Template struct ibv_qp ibv_create_qp struct ibv_pd pd struct ibv_qp
92. acklog Input Parameters id RDMA communication identifier backlog The backlog of incoming connection requests Output Parameters None Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_listen initiates a listen for incoming connection requests or datagram service lookup The listen is restricted to the locally bound source address Please note that the rdma_cm_id must already have been bound to a local address by calling rdma_bind_addr before calling rdma_listen If the rdma_cm_id is bound to a specific IP address the listen will be restricted to that address and the associated RDMA device If the rdma_cm_id is bound to an RDMA port number only the listen will occur across all RDMA devices Mellanox Technologies 109 Rev 1 3 RDMA_CM API 5 2 11 rdma_connect Template int rdma_connect struct rdma_cm_id id struct rdma_conn_param conn_param Input Parameters id RDMA communication identifier conn param Optional connection parameters Output Parameters none Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_connect initiates an active connection request For a connected rdma_cm_id id the call initiates a connection request to a remote destination or an unconnected rdma cm id it initiates a lookup of the remote QP providing the datag
93. ai_dst_len struct sockaddr ai_src_addr struct sockaddr ai_dst_addr char ai src _canonname char ai dst canonname size t ai_route_len void ai route size t ai_connect_len Mellanox Technologies 103 void struct rdma_addrinfo he ai_flags ai_family ai_qp_type ai_port_space ai_src_len ai_dst_len ai_ src _addr ai_dst_addr ai_src canonname ai_dst_canonname ai_route len ai_ route ai_connect_len ai connect ail_next Rev 1 3 RDMA_CM API ai connect ai_ next Hint flags which control the operation Supported flags are RAI PASSIVE RAI NUMERICHOST and RAI NOROUTE Address family for the AF_INET AF_INET6 AF_IB The type of RDMA QP used RDMA port space used RDMA PS UDP or RDMA PS TCP Length of th source and destination address source address referenced by ai_src_addr Length of the destination address referenced by ai_dst_addr Address of local RDMA device Address of destination RDMA device a if provided if provided The canonical for the source m The canonical for the destination Size of the information buffer referenced by ai_route routing Routing information for RDMA transports that require routing data as part of connection establishment Size of connection information referenced by ai_connect Data exchanged as of the connection establishment process part Pointer to the next rdma_addrinfo structure in the
94. aid applications which want the links used for their RDMA sessions to align with the network stack 7 3 16 RDMA_CM_EVENT_TIMEWAIT_EXIT This event is generated when the QP associated with the connection has exited its timewait state and is now ready to be re used After a QP has been disconnected it is maintained in a timewait state to allow any in flight packets to exit the network After the timewait state has completed the rdma_cm will report this event 162 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 8 Programming Examples Using IBV Verbs This chapter provides code examples using the IBV Verbs 8 1 Synopsis for RDMA_RC Example Using IBV Verbs The following is a synopsis of the functions in the programming example in the order that they are called 8 1 1 Main Parse command line The user may set the TCP port device name and device port for the test If set these values will override default values in config The last parameter is the server name If the server name is set this designates a server to connect to and therefore puts the program into client mode Otherwise the program is in server mode Call print_config Call resources_init Call resources create Call connect _qp If in server mode do a call post send with IBV_WR_SEND operation Call poll_completion Note that the server side expects a completion from the SEND request and the client side expects a RECEIVE completion
95. ally leave all multicast groups See Also rdma join multicast rdma destroy qp 128 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 5 3 Event Handling Operations 5 3 1 rdma_get_cm_event Template int rdma get cm event struct rdma event channel channel struct rdma cm _ event event Input Parameters channel Event channel to check for events event Allocated information about the next communication event Output Parameters none Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description Retrieves a communication event If no events are pending by default the call will block until an event is received Notes The default synchronous behavior of this routine can be changed by modifying the file descriptor associated with the given channel All events that are reported must be acknowledged by calling rdma_ack_cm_ event Destruction of an rdma_cm_id will block until related events have been acknowledged Event Data Communication event details are returned in the rdma_cm_event structure This structure is allo cated by the rdma_cm and released by the rdma ack cm event routine Details of the rdma_cm_event structure are given below id The rdma_cm identifier associated with the event If the event type is RDMA CM EVENT CONNECT REQUEST then this references a new id for that communication
96. am QPs pported multicast groups QPs per multicast group that can be attached to su su tal OPs that can be attached to multicast groups pported address handles AH pported fast memory regions FMR number of remaps per FMR before an unmap operation is su work requests SGI SRCQ queues per SRO pported shared receiv WR ES per SRO number of partitions Local CA ack delay Number of physical ports 40 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 3 2 ibv_query_port Template intibv_query_port struct ibv_context context uint8_t port_num struct ibv_port_attr port_attr Input Parameters context struct ibv_context from ibv_open_device port_num physical port number 1 is first port Output Parameters port_attr struct ibv_port_attr containing port attributes Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_query_port retrieves the various attributes associated with a port The user should allocate a struct ibv_port_attr pass it to the command and it will be filled in upon successful return The user is responsible to free this struct struct ibv_port_attr is defined as follows struct ibv_port_attr enum ibv_port_state state enum ibv_mtu max mtu enum ibv_mtu active mtu int gid_tbl_ len uint32 t port cap flags uint32 t max msg_Sz u
97. are Networks Programming User Manual Rev 1 3 do ne ibv_poll_cq ctx gt srq_cq 1 amp we if ne lt 0 VERB _ERR ibv_poll_cq ne return ne else if ne 0 break if wc status IBV_WC_SUCCESS printf work completion status s n ibv_wc_status_str we status return 1 recv_count printf recv count d n recv_count ret rdma_post_recv ctx gt srq_id void wc wr_id ctx gt recv_buf ctx gt msg_length ctx gt recv_mr if ret VERB_ERR rdma_post_recv ret return ret while ne return ret Function main Input argc The number of arguments argv Command line arguments Output none Returns 0 on success non zero on failure Description Main program to demonstrate SRQ functionality Both the client and server use an SRQ ctx qp_count number of QPs are created and each one of them uses the SRQ After the connection the client starts blasting sends to the server upto ctx max_wr When the server has received all the sends it performs a send to the client to tell it that it can continue Process repeats until ctx msg_ count sends have been performed XK LX XX XX XX Mellanox Technologies 245 Rev 1 3 Programming Examples Using RDMA Verbs int main int argc char argv i int ret op struct context ctx struct rdma_addrinfo rai hints memset amp ctx 0 sizeof ctx memset amp
98. at are on the system as well as opening and closing a specific device 4 2 1 ibv_get_device_list Template struct ibv_device iby_get device list int num devices Input Parameters none Output Parameters num devices optional If non null the number of devices returned in the array will be stored here Return Value NULL terminated array of VPI devices or NULL on failure Description ibv_get_device_list returns a list of VPI devices available on the system Each entry on the list is a pointer to a struct ibv_device struct ibv_device is defined as struct ibv_device struct ibv_device ops Ops enum ibv_node type node_ type enum ibv transport type transport type char name IBV_SYSFS_ NAME MAX char dev_name IBV_SYSFS NAME MAX char dev _path IBV_SYSFS PATH MAX char ibdev path IBV_SYSFS PATH MAX y ops pointers to alloc and free functions node type IBV_NODE UNKNOWN IBV_NODE CA IBV_NODE SWITCH IBV_NODE_ROUTER IBV_NODE_RNIC transport_type IBV_TRANSPORT_ UNKNOWN IBV_TRANSPORT_IB IBV_TRANSPORT_IWARP name kernel device nam g mthca0 dev_name uverbs device nam g uverbs0 dev_path path to infiniband verbs class device in sysfs Mellanox Technologies 29 J Rev 1 3 VPI Verbs API ibdev path path to infiniband class device in sysfs The list of ibv_device structs shall remain valid until the list is freed After calling ibv get device list the user s
99. ate uintl _trdma_get_sre_port struct rdma_cm_id id Input Parameters id RDMA communication identifier Output Parameters None Return Value Returns the 16 bit port number associated with the local endpoint of 0 if the rdma cm id id is not bound to a port Description rdma_get_src_port retrieves the local port number for an rdma_cm_id id which has been bound to a local address If the id is not bound to a port the routine will return 0 Mellanox Technologies 117 Rev 1 3 RDMA_CM API 5 2 18 rdma_get_dst_port Template uintl _trdma_get_dst_port struct rdma_cm_id id Input Parameters id RDMA communication identifier Output Parameters None Return Value Returns the 16 bit port number associated with the peer endpoint of 0 if the rdma_cm_id id is not connected Description rdma_get_dst_port retrieves the port associated with the peer endpoint If the rdma_cm_id id is not connected then the routine will return 0 118 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 5 2 19 rdma_get_local_addr Template struct sockaddr rdma_get_local_addr struct rdma cm id id Input Parameters id RDMA communication identifier Output Parameters None Return Value Returns a pointer to the local sockaddr address of the rdma cm id id If the id is not bound to an address then the contents of the sockaddr structure will be set to all zeros Description
100. before being posted as a work request They must be deregis tered by calling rdma_dereg_mr Mellanox Technologies 135 Rev 1 3 RDMA Verbs API 6 1 2 rdma_reg_read Template struct ibv_mr rdma_reg_read struct rdma_cm_id id void addr size_t length Input Parameters id A reference to the communication identifier where the message buffer s will be used addr The address of the memory buffer s to register length The total length of the memory to register Output Parameters ibv_mr A reference to an ibv mr struct of the registered memory region Return Value A reference to the registered memory region on success or NULL on failure If an error occurs errno will be set to indicate the failure reason Description rdma_ reg read Registers a memory buffer that will be accessed by a remote RDMA read opera tion Memory buffers registered using rdma_reg_read may be targeted in an RDMA read request allowing the buffer to be specified on the remote side of an RDMA connection as the remote_addr of rdma_post_read or similar call rdma reg read is used to register a data buffer that will be the target of an RDMA read operation on a queue pair associated with an rdma_cm_id The memory buffer is registered with the protec tion domain associated with the identifier The start of the data buffer is specified through the addr parameter and the total size of the buffer is given by length All data buffers must be regist
101. buf 236 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 Function init resources Input ctx The context object rai The RDMA address info for the connection Output none x Returns 0 on success non zero on failure x Description This function initializes resources that are common to both the client and server functionality It creates our SRQ registers memory regions posts receive buffers and creates a single completion queue that will be used for the receive queue on each queue pair int init_resources struct context ctx struct rdma_addrinfo rai int ret 1 struct rdma_cm_id id Create an ID used for creating accessing our SRQ ret rdma_create_id NULL amp ctx gt srq_id NULL RDMA_ PS_ TCP if ret VERB _ERR rdma_create_id ret return ret We need to bind the ID to a particular RDMA device This is done by resolving the address or binding to the address if ctx gt server 0 ret rdma_resolve_addr ctx gt srq_id NULL rai gt ai_dst_addr 1000 if ret VERB _ERR rdma_resolve_addr ret return ret else ret rdma_bind_addr ctx gt srq_id rai gt ai_sre_addr if ret VERB _ERR rdma_bind_addr ret return ret Create the memory regions being used in this example ctx gt recv_mr rdma_reg msgs ctx gt srq_id ctx gt recv_buf ctx gt msg_length if
102. buffer is given by the length All data buffers must be registered before being posted as work requests Users must deregister all registered memory by calling the rdma_dereg_mr See Also rdma cm 7 rdma create id 3 rdma create ep 3 rdma reg msgs 3 rdma reg read 3 ibv_reg mr 3 ibv_dereg mr 3 rdma post write 3 Mellanox Technologies 137 Rev 1 3 RDMA Verbs API 6 1 4 rdma_dereg_mr Template int rdma_dereg_mr struct ibv_mr mr Input Parameters mr A reference to a registered memory buffer Output Parameters None Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_dereg_mr deregisters a memory buffer which has been registered for RDMA or message operations This routine must be called for all registered memory associated with a given rdma cm id before destroying the rdma cm id 138 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 6 1 5 rdma_create_srq Template int rdma_create_srq struct rdma_cm_id id struct ibv_pd pd struct ibv_srq_init_attr attr Input Parameters id The RDMA communication identifier pd Optional protection domain for the shared request queue SRQ attr Initial SRQ attributes Output Parameters attr The actual capabilities and properties of the created SRO are returned through this structure Return Value 0 on success 1 on error If the call fails e
103. casts ie aoa as Oak bas Sh Ge ees 24 3 4 Typical Application 2 20 00 26 Chapter 4 VPI Verbs API oooccccccc ees 28 4 1 Initialization a a a a eee 28 AAA A E E E E TNE E E E hee Bhs 28 4 2 Device Operations 0 2000 29 4 2 1 ibv_get_device_list 0202 eee 29 4 2 2 ibv_free_device list 00002 31 4 2 3 ibv_get_device_name 2 00 cee ee eee 32 4 2 4 ibv_get_device_guid 000 ee 33 4 2 5 ibv_open_device 2 0 00 eee 34 4 2 6 ibv_close_device 0 000 ee 35 4 2 7 ibv_node_type_str 0 00 36 4 2 8 ibv_port_state_str 20 0 02 ee 37 4 3 Verb Context Operations 0 000 cece 38 Mellanox Technologies 3 J Rev 1 3 4 3 1 ibv_query_device 0020 c cee 38 4 3 2 DY Query POr 41 433 ipv query Old etica daria rra ls ead 43 4 3 4 ibv_query_pkey 0 0 22 e eee eee 44 A 3 5 A AAA dies gine Phat ead bea 44 ed ag 45 43 6 ibv_dealloc_ pd 00 2 ee 46 4 3 0 IDU Create uta dao aa 47 4 3 8 ibv_resize cq ooocoooooo eee 48 4 3 9 iby destroy Cee e eal A eal ee goa ar Pane A 49 4 3 10 ibv_create_comp_channel 22200000 eee 50 4 3 11 ibv_destroy_comp_channel 00 000 eee eee 51 4 4 Protection Domain Operations 00000 00 nr 52 AAT DVT M a tec a peda weds ep da we bd a aah Bag
104. cq_eventS 00 cece ee ee 86 46 10 IbV polliiGg ves oct esti Ri ca ele wae Oe ae da 87 46 11 ibv_init_ah_from_we 2000 cee ee 89 46 12 ibv_create_ah_from_we 000 cece eee eee 90 46 13 ibv_attach_mcast 0 020000 aae ea ee 91 4 6 14 ibv_detach_mcast 0000 cee ee 92 4 7 Event Handling Operati0NS o ooccoocccococccco ee 93 4 7 1 ibv_get_async_event 2 0 0000 cece ee 93 4 7 2 ib _ack_async_event 0 cece eee 95 4 7 3 ibv_event_type_str 0 0 0 0020 eee ee 96 Chapter 5 RDMA CMAPl 40000 A at See eek 97 4 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 5 1 Event Channel Operations o oooocccoccococoeoeo 97 5 1 1 rdma_create_event_channel 2200000 ee eee 97 5 1 2 rdma_destroy_event_channel 2 00 e eee eee 98 5 2 Connection Manager CM ID Operations 002 ec e eae 99 ZA rdma created ico car ala rana pa it da 99 5 2 2 rdma_destroy id o oooooccocoo ee 100 5 2 3 rdma_migrate id ooococoocoooccon ee 101 5 2 4 rdma_sSet Opti0N o o ooococooo eee 102 5 2 5 dma Create eremita 103 5 2 6 rdma_destroy_ ep o oooooocococc e ee 105 5 2 7 rdma_resolve_addr 2 0 000 cee eee 106 5 2 8 rdma_bind_addr 000 cee eee 107 5 2 9 rdma_resolve_route oooooooooooo ee 108 5 2 10 rdma_listen
105. cv_ cq res gt cq qp_init_attr cap max_send_wr 1 qp_init_attr cap max_recv_wr 1 qp_ init attr cap max_send_sge 1 qp_ init attr cap max_recv_sge 1 Mellanox Technologies 179 Rev 1 3 Programming Examples Using IBV Verbs res gt qp ibv_create_qp res gt pd amp qp_init_attr if res gt qp fprintf stderr failed to create QP n rc 1 goto resources create exit fprintf stdout QP was created QP number 0x x n res gt qp gt qp_num resources_create_exit if re Error encountered cleanup if res gt qp ibv_destroy_qp res gt qp res gt qp NULL if res gt mr ibv_dereg_mr res gt mr res gt mr NULL if res gt buf free res gt buf res gt buf NULL if res gt cq ibv_destroy_cq res gt cq res gt cq NULL if res gt pd ibv_dealloc_pd res gt pd res gt pd NULL if res gt ib_ctx ibv_close_device res gt ib_ctx res gt ib_ctx NULL if dev_list iby_free_device_list dev_list 180 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 dev_list NULL if res gt sock gt 0 if close res gt sock fprintf stderr failed to close socket n res gt sock 1 return rc 7 HH ae ls al o le ol le od oe he ad al al le al o lea le ad e le 2 lol ll ol al ol ll al ll 2 ll al ll ll 2 ll a ll ll al le la ll o afc ae a ll al al ll ll el
106. cv_cq node gt cq ret rdma_create_qp node gt cma_id node gt pd amp init_qp_attr if ret printf mckey unable to create QP d n ret goto out ret create_message node if ret printf mckey failed to create messages d n ret goto out out return ret static int post_recvs struct cmatest_node node struct ibv_recv_wr recv_wr recv_failure struct ibv_sge sge int i ret 0 if message count return 0 recv_wr next NULL recv_wr sg_list amp sge recv_wr num_sge l recv_wr wr_id uintptr_t node sge length message_size sizeof struct ibv_grh sge lkey node gt mr gt lkey sge addr uintptr_t node gt mem for i 0 i lt message _count amp amp ret i ret ibv_post_recv node gt cma_id gt qp amp recv_wr amp recv_failure if ret printf failed to post receives d n ret break return ret 198 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 static int post_sends struct cmatest_node node int signal_flag struct ibv_send wr send_wr bad_send_wr struct ibv_sge sge int i ret 0 if node gt connected message_count return 0 send_wr next NULL send_wr sg list amp sge send_wr num_sge l send_wr opcode IBV_WR_SEND_WITH_IMM send _wr send_ flags signal flag send_wr wr_id unsigned long node send _wr imm_data htonl node gt cm
107. d on both sides of a connection It indicates that a connection has been estab lished with the remote end point 7 3 11 RDMA_CM_EVENT_DISCONNECTED This event is generated on both sides of the connection in response to rdma disconnect The event will be generated to indicate that the connection between the local and remote devices has been disconnected Any associated QP will transition to the error state All posted work requests are flushed The user must change any such QP s state to Reset for recovery 7 3 12 RDMA_CM_EVENT_DEVICE_REMOVAL This event is generated when the RDMA CM indicates that the device associated with the rdma_cm_id has been removed Upon receipt of this event the user must destroy the related rdma_cm_id 7 3 13 RDMA_CM_EVENT_MULTICAST_JOIN This event is generated in response to rdma_join_multicast It indicates that the multicast join operation has completed successfully 7 3 14 RDMA_CM_EVENT_MULTICAST_ERROR This event is generated when an error occurs while attempting to join a multicast group or on an existing multicast group if the group had already been joined When this happens the multicast group will no longer be accessible and must be rejoined if necessary 7 3 15 RDMA_CM_EVENT_ADDR_CHANGE This event is generated when the network device associated with this ID through address resolu tion changes its hardware address For example this may happen following bonding fail over This event may serve to
108. d to the same subnet as the initial port and that there is a route to the other hosts alternate port int get_alt_port_details struct context ctx t int ret 1 struct ibv_qp_ attr qp_attr struct ibv_qp init attr qp_init_attr struct ibv_device_attr dev_attr This example assumes the alternate port we want to use is on the same HCA Ports from other HCAs can be used as alternate paths as well Get a list of devices using ibv_get_device_list or rdma_get_devices ret ibv_query_device ctx gt id gt verbs amp dev_attr if ret VERB _ERR ibv_query device ret return ret Verify the APM is supported by the HCA if dev_attr device cap flags IBV DEVICE AUTO PATH MIG printf device does not support auto path migration n return 1 Query the QP to determine which port we are bound to ret ibv_query_qp ctx gt id gt qp amp qp_attr 0 amp qp_init_attr if ret VERB _ERR ibv_query_qp ret return ret for i 1 i lt dev_attr phys_port_cnt i Query all ports until we find one in the active state that is not the port we are currently connected to struct ibv_port_attr port_attr ret ibv_query_port ctx gt id gt verbs 1 amp port_attr if ret VERB _ERR ibv_query_device ret return ret if port_attr state IBV_PORT_ ACTIVE ctx gt my_alt_dlid port_attr lid ctx gt alt_srcport i if qp_attr port_num i br
109. dad wheeb ded 52 44 2 ibv_dereg MT occ 54 4 4 3 ibv_create_qp 0 00 cece 55 44 4 ibv_destroy_qp 0 0 cee eee 57 44 5 DV Cheale SIGs eo3e2k aia at ay eat dle dg ae gine tee Da Oa ede ale ge OS 58 44 6 ibv_modify_srq 0 0 2 60 4 4 7 ibv_destroy_srq 000 eee 61 44 8 ibv_open_xrc_domain 00 002 c eee eee 62 44 9 ibv_create_xrc_srq 2 0 0 tee 63 44 10 ibv_close_xrc_domain ooo 64 44 11 ibv_create_xrc_rcv_qp 0 cee eee eee 65 44 12 ibv_modify_xrc_rcv_qp 2 0c cee ee 66 44 13 bv reg XTC IOMA as Poke eed 67 44 14 Dv_UNreg_XIC TOY OQD ooococcoc 0c ee ee 68 44 15 ibv_create_ah 0 00 20 eee 69 44 16 ibv_destroy_ah 0 0 0 2 ee 71 4 5 Queue Pair Bringup ibv_modify_qp 2 2220200 cee eee 72 4 5 1 ibv_modify_qp 0 02 ee 72 4 5 2 RESETTONNITO ata seen ea ee Ye ea ane Pe ale 74 45 3 INITIO RIR oee a5 ce be A Babee Rees 75 454 RERIO RAS iaa taa shu a 4 76 4 6 Active Queue Pair Operations 200 cece eee 77 461 IDV QUERY Ap seas eke ada gee te ped areata ceda a 17 46 2 DY QUery_ SIA na 0c eee eee 78 46 39 TDV QUErY Xr TEV Diadora a Pha ta at wedded 79 46 4 ibv_post_recv occ 80 4 6 5 ibv postsendi i 22 eee edad ta eee at dad 81 46 6 ibv_post_srq recv 00 2 2 ee 83 AGT bv sreqsnotty A cate bonded a dtewe dak aas od 84 46 8 ibv_get_cq_event 0 20 c eee 85 46 9 ibv_ack_
110. depth The maximum number of outstanding RDMA read atomic operations that the recipient may have outstanding This field matches th responder resources specified by the remote node when calling rdma connect and rdma accept flow control Indicates if hardware level flow control is provided by the sender retry count For RDMA_CM_EVENT_CONNECT_REQUEST events only indicates the number of times that the recipient should retry send operations 130 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 rnr_retry count The number of times that the recipient should retry receiver not ready RNR NACK errors srq Specifies if the sender is using a shared receive queue gp _num Indicates the remote QP number for the connection Event Types The following types of communication events may be reported RDMA CM EVENT ADDR RESOLVED Address resolution rdma_resolve_addr completed successfully RDMA CM EVENT ADDR ERROR Address resolution rdma_resolve_addr failed RDMA CM EVENT ROUTE RESOLVED Route resolution rdma resolve route completed successfully RDMA CM EVENT ROUTE ERROR Route resolution rdma_resolve_ route failed RDMA CM EVENT CONNECT REQUEST Generated on the passive side to notify the user of a new connection request RDMA CM EVENT CONNECT RESPONSE Generated on the active side to notify the user of a successful response to a connection request It is only generated on rdma_cm_id s that do
111. domains PD which can be used for further operations ibv_query_device Template int ibv_query_device struct ibv_ context context struct ibv_device_ attr device_attr Input Parameters context struct ibv_context from ibv_open_device Output Parameters device attr struct ibv_device attr containing device attributes Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_query_device retrieves the various attributes associated with a device The user should mal loc a struct ibv_device_attr pass it to the command and it will be filled in upon successful return The user is responsible to free this struct struct ibv_device_attr is defined as follows struct ibv_device attr char fw_ver 64 uinte4 t node guid uint64 t sys_image_guid uint64 t max mr size uint64 t page size cap uint32 t vendor id uint32 t vendor part_id uint32 t hw_ver int max gp int max qp wr int device cap flags int max sge int max sge rd int max Cq int max cqe int max mr int max pd int max gp rd atom int max ee rd atom int max res rd atom int max_qp_init_rd atom int max ee init rd atom 38 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3
112. e al all al al ll a al e e a le al ll a ll ll e al 2 ll ll o ll ll al ll ll ll Function post_send Input res pointer to resources structure opcode IBV_WR_SEND IBV_WR_RDMA READ or IBV_WR_RDMA WRITE Output none Returns 0 on success error code on failure x Description This function will create and post a send work request Fk ae k ak a ak o al oe je od e ad le al ol le al le al o le 2 leal le al ol lo ll e a ll 2 ll all a a ll al ll al ll a ll ll 2 ll a ll ll al 2 ll ll ae ll ll Mellanox Technologies 173 Rev 1 3 Programming Examples Using IBV Verbs static int post_send struct resources res int opcode struct ibv_send_wr sr struct ibv_sge sge struct ibv_send_wr bad_wr NULL int rc prepare the scatter gather entry memset amp sge 0 sizeof sge sge addr uintptr_t res gt buf sge length MSG_ SIZE sge lkey res gt mr gt lkey prepare the send work request memset amp sr 0 sizeof sr sr next NULL sr wr_id 0 sr sg_list amp sge sr num_sge l sr opcode opcode sr send_ flags IBV_SEND SIGNALED if opcode IBV_WR_SEND sr wr rdma remote_addr res gt remote_props addr sr wr rdma rkey res gt remote_props rkey there is a Receive Request in the responder side so we won t get any into RNR flow rc ibv_post_send res gt qp sr amp bad_wr if rc fprintf stderr failed to post SR n else switch
113. e buffer to receive the data The sender does not have any control over where the data will reside in the remote host Optionally an immediate 4 byte value may be transmitted with the data buffer This immediate value is presented to the receiver as part of the receive notification and is not contained in the data buffer 3 1 2 Receive This is the corresponding operation to a send operation The receiving host is notified that a data buffer has been received possibly with an inline immediate value The receiving application is responsible for receive buffer maintenance and posting 3 1 3 RDMA Read A section of memory is read from the remote host The caller specifies the remote virtual address as well as a local memory address to be copied to Prior to performing RDMA operations the remote host must provide appropriate permissions to access its memory Once these permissions are set RDMA read operations are conducted with no notification whatsoever to the remote host For both RDMA read and write the remote side isn t aware that this operation being done other than the preparation of the permissions and resources 3 1 4 RDMA Write RDMA Write With Immediate Similar to RDMA read but the data is written to the remote host RDMA write operations are per formed with no notification to the remote host RDMA write with immediate operations however do notify the remote host of the immediate value 3 1 5 Atomic Fetch and Add Atomic Com
114. e demonstrated on a simple fabric of two nodes with the server application running on one node and the client application running on the other Each node must be configured to support IPoIB and the IB interface Mellanox Technologies 235 Rev 1 3 Programming Examples Using RDMA Verbs ex ib0 must be assigned an IP Address Finally the fabric must be initialized using OpenSM x Server a is IP of local interface srq s a 192 168 1 12 x Client a is IP of remote interface srq a 192 168 1 12 ty include lt stdlib h gt include lt stdio h gt include lt string h gt include lt errno h gt include lt getopt h gt include lt rdma rdma_verbs h gt define VERB_ERR verb ret fprintf stderr s returned d errno d n verb ret errno Default parameters values define DEFAULT_PORT 51216 define DEFAULT_MSG_ COUNT 100 define DEFAULT MSG_ LENGTH 100000 define DEFAULT _QP COUNT 4 define DEFAULT MAX WR 64 Resources used in the example struct context User parameters int server char server_name char server_port int msg_count int msg_length int qp_count int max_wr Resources struct rdma_cm_id srq_id struct rdma_cm_id listen_id struct rdma_cm_id conn_id struct ibv_mr send_mr struct ibv_mr recv_mr struct ibv_srq srq struct ibv_cq srq_cq struct ibv_comp_channel srq_cq_channel char send_buf char recv_
115. e pair is bound to an rdma_cm_id after calling rdma_create_ep or rdma_create_qp if the rdma cm id is allocated using rdma create_id The user defined context associated with the receive request will be returned to the user through the work completion work request identifier wr_id field Mellanox Technologies 141 Rev 1 3 RDMA Verbs API 6 2 2 rdma_post_sendv Template int rdma_post_sendv struct rdma_cm_id id void context struct ibv_sge sgl int nsge int flags Input Parameters id A reference to the communication identifier where the message buffer will be posted context A user defined context associated with the request sgl A scatter gather list of memory buffers posted as a single request nsge The number of scatter gather entries in the sgl array flags Optional flags used to control the send operation Output Parameters None Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma post _sendv posts a work request to the send queue of the queue pair associated with the rdma_cm_id id The contents of the posted buffers will be sent to the remote peer of the connec tion The user is responsible for ensuring that the remote peer has queued a receive request before issu ing the send operations Also unless the send request is using inline data the message buffers must already have been registered before being posted The buf
116. e read into the local data buffers given in the sgl array The user must ensure that both the remote and local data buffers have been registered before the read is issued The buffers must remain registered until the read completes Read operations may not be posted to an rdma_cm_id or the corresponding queue pair until a con nection has been established The user defined context associated with the read request will be returned to the user through the work completion work request identifier wr_id field Mellanox Technologies 143 Rev 1 3 RDMA Verbs API 6 2 4 rdma_post_writev Template int rdma_post_writev struct rdma_cm_id id void context struct ibv_sge sgl int nsge int flags uint64_t remote_addr uint32_t rkey Input Parameters id A reference to the communication identifier where the request will be posted context A user defined context associated with the request sgl A scatter gather list of the source buffers of the write nsge The number of scatter gather entries in the sgl array flags Optional flags used to control the write operation remote addr The address of the remote registered memory to write into rkey The registered memory key associated with the remote address Output Parameters None Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_post_writev posts a work request to the send queue of the queue pai
117. e size of the private data buffer may be larger than the amount of private data sent by th remot side Any additional space in the buffer will be zeroed out ah_attr Address information needed to send data to the remote end point s Users should use this structure when allocating their address handle qp_num QP number of the remote endpoint or multicast group qkey QKey needed to send data to the remote endpoint s Conn Event Data Event parameters related to connected QP services RDMA PS TCP The connection related event data is valid for RDMA CM EVENT CONNECT REQUEST and RDMA_CM EVENT ESTABLISHED events unless stated otherwise private data References any user specified data associated with the event The data referenced by this field matches that specified by the remote side when calling rdma_connect or rdma_accept This field is MULL if the event does not include private data Th buffer referenced by this pointer is deallocated when calling rdma_ack_cm event private _data_len The size of the private data buffer Users should note that the size of the private data buffer may be larger than the amount of private data sent by th remot Side Any additional space in the buffer will be zeroed out responder resources Th number of responder resources requested of the recipient This field matches the initiator depth specified by th remot nod when calling rdma connect and rdma accept initiator
118. each one of the sends and then both sides leave the multicast group and cleanup resources Running the Example The executable can operate as either the sender or receiver application It can be demonstrated on a simple fabric of two nodes with the sender application running on one node and the receiver application running on the other Each node must be configured to support IPoIB and the IB interface ex ib0 must be assigned an IP Address Finally the fabric must be initialized using OpenSM Mellanox Technologies 223 Rev 1 3 Programming Examples Using RDMA Verbs Receiver m is the multicast address often the IP of the receiver mc m 192 168 1 12 Sender m is the multicast address often the IP of the receiver me s m 192 168 1 12 x include lt stdlib h gt include lt stdio h gt include lt string h gt include lt errno h gt include lt getopt h gt include lt netinet in h gt include lt arpa inet h gt include lt rdma rdma_verbs h gt define VERB_ERR verb ret fprintf stderr s returned d errno d n verb ret errno Default parameter values define DEFAULT PORT 51216 define DEFAULT MSG_ COUNT 4 define DEFAULT MSG _ LENGTH 64 Resources used in the example struct context User parameters int sender char bind_addr char mcast_addr char server_port int msg_count int msg length Resourc
119. eak Mellanox Technologies 211 Rev 1 3 Programming Examples Using RDMA Verbs return 0 p Function load alt path Input ctx The context object Output none Returns 0 on success non zero on failure Description Uses ibv_modify_qp to load the alternate path information and set the path migration state to rearm int load_alt_path struct context ctx int ret struct ibv_qp_attr qp_attr struct ibv_qp init attr qp_init_attr query to get the current attributes of the qp ret ibv_query_qp ctx gt id gt qp amp qp_attr 0 amp qp_init_attr if ret VERB _ERR ibv_query_qp ret return ret initialize the alternate path attributes with the current path attributes memepy amp qp_attr alt_ah attr amp qp_attr ah_attr sizeof struct ibv_ah_attr set the alt path attributes to some basic values qp_attr alt pkey index qp_attr pkey index qp_attr alt_timeout qp_attr timeout qp_attr path_mig state IBV_ MIG REARM if an alternate path was supplied set the source port and the dlid if ctx gt alt_srcport qp_attr alt_port_num ctx gt alt_srcport else qp_attr alt_port_num qp_attr port_num if ctx gt alt_dlid qp_attr alt_ah_attr dlid ctx gt alt_dlid printf loading alt path local port d dlid d n qp_attr alt_port_num qp_attr alt ah_attr dlid 212 Mellanox Technologies RDMA Aware Networks Programm
120. ed when the CQ is created When a CQE is polled it is removed from the CQ CQ is a FIFO of CQEs CQ can service send queues receive queues or both Work queues from multiple QPs can be associated with a single CQ struct ibv_cq is used to implement a CQ Memory Registration Memory Registration is a mechanism that allows an application to describe a set of virtually con tiguous memory locations or a set of physically contiguous memory locations to the network adapter as a virtually contiguous buffer using Virtual Addresses The registration process pins the memory pages to prevent the pages from being swapped out and to keep physical lt gt virtual mapping During the registration the OS checks the permissions of the registered block The registration process writes the virtual to physical address table to the network adapter 22 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 When registering memory permissions are set for the region Permissions are local write remote read remote write atomic and bind Every MR has a remote and a local key r_key l_ key Local keys are used by the local HCA to access local memory such as during a receive operation Remote keys are given to the remote HCA to allow a remote process access to system memory during RDMA operations The same memory buffer can be registered several times even with different access permissions and every registration r
121. ention A message can be an RDMA Read an RDMA Write operation or a Send Receive operation IB and RoCE also support Multicast transmission The IB Link layer offers features such as a credit based flow control mechanism for congestion control It also allows the use of Virtual Lanes VLs which allow simplification of the higher layer level protocols and advanced Quality of Service It guarantees strong ordering within the VL along a given path The IB Transport layer provides reliability and delivery guarantees The Network Layer used by IB has features which make it simple to transport messages directly between applications virtual memory even if the applications are physically located on different servers Thus the combination of IB Transport layer with the Software Transport Interface is better thought of as a RDMA message transport service The entire stack including the Software Trans port Interface comprises the IB messaging service The most important point is that every application has direct access to the virtual memory of devices in the fabric This means that applications do not need to make requests to an operating system to transfer messages Contrast this with the traditional network environment where the shared network resources are owned by the operating system and cannot be accessed by a user application Thus an application must rely on the involvement of the operating system to move data from the application s virtual buffer s
122. enum value which may be an HCA Switch Router RNIC or Unknown Output Parameters none Return Value A constant string which describes th num value node_type Description ibv_node type _ str returns a string describing the node type enum value node type This value can be an InfiniBand HCA Switch Router an RDMA enabled NIC or unknown enum ibv_node type IBV_NODE_UNKNOWN 1 IBV_NODE_CA IBV_NODE_SWITCH IBV_NODE_ROUTER IBV_NODE_RNIC y 36 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 2 8 ibv_port_state_str Template const char ibv_port_state_str enum ibv_port state port state Input Parameters port_state The enumerated value of the port state Output Parameters None Return Value A constant string which describes th num value port state Description ibv_port_state_str returns a string describing the port state enum value port_state enum ibv port state IBV_PORT_NOP a IBV_PORT_DOW IBV_PORT_INIT IBV_PORT_ARMED IBV_PORT_ACTIVE IBV_PORT ACTIVE_DEFER y li OB WN Oo Mellanox Technologies 37 J Rev 1 3 VPI Verbs API 4 3 4 3 1 Verb Context Operations The following commands are used once a device has been opened These commands allow you to get more specific information about a device or one of its ports create completion queues CQ completion channels CC and protection
123. ered before being posted as work requests Users must deregister all registered memory by calling the rdma_dereg_mr See Also rdma cm 7 rdma create id 3 rdma create ep 3 rdma reg msgs 3 rdma reg write 3 ibv_reg mr 3 ibv_dereg mr 3 rdma post read 3 136 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 6 1 3 rdma_reg_write Template struct ibv_mr rdma_reg_write struct rdma_cm_id id void addr size_t length Input Parameters id A reference to the communication identifier where the message buffer s will be used addr The address of the memory buffer s to register length The total length of the memory to register Output Parameters ibv_mr A reference to an ibv mr struct of the registered memory region Return Value A reference to the registered memory region on success or NULL on failure If an error occurs errno will be set to indicate the failure reason Description rdma reg write registers a memory buffer which will be accessed by a remote RDMA write operation Memory buffers registered using this routine may be targeted in an RDMA write request allowing the buffer to be specified on the remote side of an RDMA connection as the remote_addr of an rdma_post_write or similar call The memory buffer is registered with the protection domain associated with the rdma_cm_id The start of the data buffer is specified through the addr parameter and the total size of the
124. erforming a path migration change The attempt to effect the path migration may have been attempted automatically by the RDMA device or explicitly by the user 154 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 This error usually occurs if the alternate path attributes are not consistent on the two ends of the connection It could be for example that the DLID is not set correctly or if the source port is invalid CQ The event may also occur if a cable to the alternate port is unplugged 7 1 9 IBV_EVENT_DEVICE_FATAL This event is generated when a catastrophic error is encountered on the channel adapter The port and possibly the channel adapter becomes unusable When this event occurs the behavior of the RDMA device is undetermined and it is highly recom mended to close the process immediately Trying to destroy the RDMA resources may fail and thus the device may be left in an unstable state 7 1 10 IBV_EVENT_PORT_ACTIVE This event is generated when the link on a given port transitions to the active state The link is now available for send receive packets This event means that the port_attr state has moved from one of the following states IBV_PORT DOWN IBV_PORT_ INIT IBV_PORT ARMED to either IBV_PORT ACTIVE IBV_PORT ACTIVE DEFER This might happen for example when the SM configures the port The event is generated by the device only if the IBV_ DEVICE PORT ACTIVE EVENT attri bute is set in
125. erminates a link and executes transport level functions IB multicast groups identified by MGIDs are managed by the SM The SM associates a MLID with each MGID and explicitly programs the IB switches in the fabric to ensure that the packets are received by all the ports that joined the multicast group MR Memory Region A contiguous set of memory buffers which have already been registered with access permis sions These buffers need to be registered in order for the network adapter to make use of them During registration an L_Key and R_Key are created and associated with the created memory region MTU Maximum The maximum size of a packet payload not including headers that can be sent received Transfer Unit from a port MW Memory Window An allocated resource that enables remote access after being bound to a specified area within an existing Memory Registration Each Memory Window has an associated Window Handle set of access privileges and current R_Key Outstanding Work Request WR which was posted to a work queue and its completion was not polled pkey Partition key The pkey identifies a partition that the port belongs to A pkey is roughly analogous to a VLAN ID in ethernet networking It is used to point to an entry within the port s partition key pkey table Each port is assigned at least one pkey by the subnet manager SM PD Protection Domain Object whose components can interact with only eac
126. es RDMA Aware Networks Programming User Manual Rev 1 3 4 3 3 ibv_query_gid Template int ibv_query_gid struct ibv_context context uint8 t port num int index union ibv_gid gid Input Parameters context struct ibv_context from ibv_open_device port_num physical port number 1 is first port index which entry in the GID table to return 0 is first Output Parameters gid union ibv_gid containing gid information Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_query_gid retrieves an entry in the port s global identifier GID table Each port is assigned at least one GID by the subnet manager SM The GID is a valid IPv6 address composed of the globally unique identifier GUID and a prefix assigned by the SM GID 0 is unique and contains the port s GUID The user should allocate a union ibv_gid pass it to the command and it will be filled in upon suc cessful return The user is responsible to free this union union ibv_gid is defined as follows union ibv_gid uint8 t raw 16 struct uint64 t subnet_prefix uint64 t interface id global y Mellanox Technologies 43 J Rev 1 3 VPI Verbs API 4 3 4 ibv_query_pkey Template int ibv_query_pkey struct ibv_context context uint8_t port_num int index uint16_t pkey Input Parameters context struct ibv_ context from ibv_open device port_num phy
127. es struct sockaddr mcast_sockaddr struct rdma_cm_id id struct rdma_event_channel channel struct ibv_pd pd struct ibv_cq cq struct ibv_mr mr char buf struct ibv_ah ah uint32_t remote_qpn uint32_t remote_qkey pthread_tcm_thread Function cm_thread Input arg The context object 224 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 Output none Returns NULL x Description Reads any CM events that occur during the sending of data and prints out the details of the event y static void cm_thread void arg struct rdma_cm_event event int ret struct context ctx struct context arg while 1 ret rdma_get_cm_event ctx gt channel amp event if ret VERB _ERR rdma_get_cm_event ret break printf event s status d n rdma_event_str event gt event event gt status rdma_ack_cm_event event return NULL Function get_cm_event Input channel The event channel type The event type that is expected Output out ev The event will be passed back to the caller if desired i Set this to NULL and the event will be acked automatically Otherwise the caller must ack the event using rdma_ack_cm_event x Returns 0 on success non zero on failure Description Waits for the next CM event and check that is matches the expected type in
128. es CQE to return Output Parameters wc COE array Return Value Number of COEs in array wc or 1 on error Description ibv_poll_cq retrieves CQEs from a completion queue CQ The user should allocate an array of struct ibv_we and pass it to the call in wc The number of entries available in wc should be passed in num entries It is the user s responsibility to free this memory The number of CQEs actually retrieved is given as the return value CQs must be polled regularly to prevent an overrun In the event of an overrun the CQ will be shut down and an async event IBV EVENT CQ ERR will be sent struct ibv_wc is defined as follows struct ibv wc uint64 t wr_id enum ibv_wc_ status status enum ibv_wc_opcode opcode uint32 t vendor err uint32 t byte len uint32 t imm data network byte order uint32 t qp_ num uint32 t src ap enum ibv_wc_ flags wc flags uintl6 t pkey index uintl6 t slid uint8 t si uint8 t dlid path bits y Mellanox Technologies 87 J Rev 1 3 VPI Verbs API wr id user specified work request id as given in ibv_post_ send or ibv_post_recv IBV_WC_SUCCESS IBV_WC_LOC_LEN_ERR IBV WC LOC QP OP ERR IBV WC LOC EEC OP ERR IBV_WC_LOC_PROT_ERR IBV_WC_WR_FLUSH ERR status IBV IBV IBV IBV IBV IBV IBV IBV IBV _WC_MW_BIND_ERR _WC_BAD_RESP_ERR WC_LOC_ACCESS_ER WC_RI V_REQ E WC_REM_ACCESS_ER _WC_REM_OP_ERR WC_RET
129. estroyed before destroying the rdma_cm_id id 140 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 6 2 Active Queue Pair Operations 6 2 1 rdma_post_recvv Template int rdma_post_recvv struct rdma cm _id id void context struct ibv_sge sgl int nsge Input Parameters id A reference to the communication identifier where the message buffer s will be posted context A user defined context associated with the request sgl A scatter gather list of memory buffers posted as a single request nsge The number of scatter gather entries in the sgl array Output Parameters None Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_post_recvv posts a single work request to the receive queue of the queue pair associated with the rdma_cm_id id The posted buffers will be queued to receive an incoming message sent by the remote peer Please note that this routine supports multiple scatter gather entries The user is responsible for ensuring that the receive is posted and the total buffer space is large enough to contain all sent data before the peer posts the corresponding send message The message buffers must have been regis tered before being posted and the buffers must remain registered until the receive completes Messages may be posted to an rdma cm _id only after a queue pair has been associated with it A queu
130. esults in a different set of keys struct ibv_mr is used to implement memory registration 3 3 5 Memory Window An MW allows the application to have more flexible control over remote access to its memory Memory Windows are intended for situations where the application e wants to grant and revoke remote access rights to a registered Region in a dynamic fashion with less of a performance penalty than using deregistration registration or reregistration e wants to grant different remote access rights to different remote agents and or grant those rights over different ranges within a registered Region The operation of associating an MW with an MR is called Binding Different MWs can overlap the same MR event with different access permissions 3 3 6 Address Vector An Address Vector is an object that describes the route from the local node to the remote node In every UC RC QP there is an address vector in the QP context In UD QP the address vector should be defined in every post SR struct ibv_ah is used to implement address vectors 3 3 7 Global Routing Header GRH The GRH is used for routing between subnets When using RoCE the GRH is used for routing inside the subnet and therefore is a mandatory The use of the GRH is mandatory in order for an application to support both IB and RoCE When global routing is used on UD QPs there will be a GRH contained in the first 40 bytes of the receive buffer This area is used to store global rout
131. eswap h gt include lt unistd h gt include lt getopt h gt Mellanox Technologies 195 Rev 1 3 Programming Examples Using IBV Verbs include lt rdma rdma_cma h gt struct cmatest_node int id struct rdma_cm_id cma_id int connected struct ibv_pd pd struct ibv_cq cq struct ibv_mr mr struct ibv_ah ah uint32_t remote_qpn uint32 t remote_qkey void mem B struct cmatest struct rdma_event_channel channel struct cmatest_node nodes int conn_index int connects_left struct sockaddr_in6 dst_in struct sockaddr dst_addr struct sockaddr_in6 src_in struct sockaddr sre_addr static struct cmatest test static int connections 1 static int message_size 100 static int message_count 10 static int is_sender static int unmapped_addr static char dst_addr static char src_addr static enum rdma_port_space port space RDMA PS UDP static int create_message struct cmatest_node node if message_size message_count 0 if message count return 0 node gt mem malloc message_size sizeof struct ibv_grh if node gt mem printf failed message allocation n return 1 node gt mr ibv_reg_mr node gt pd node gt mem message_size sizeof struct ibv_grh 196 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 IBV_ACCESS_ LOCAL WRITE if Inode gt mr printf failed to reg MR n goto err return 0 err
132. et device _guid returns the devices 64 bit Global Unique Identifier GUID in network byte order Mellanox Technologies 33 J Rev 1 3 VPI Verbs API 4 2 5 ibv_open_device Template struct ibv_context ibv_open_device struct ibv_device device Input Parameters device struct ibv_ device for desired devic Output Parameters none Return Value A verbs context that can be used for future operations on the device or NULL on failure Description ibv_open_ device provides the user with a verbs context which is the object that will be used for all other verb operations 34 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 2 6 ibv_close_device Template int ibv_close_device struct ibv_context context Input Parameters context struct ibv_ context from ibv_open device Output Parameters none Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_close_device closes the verb context previously opened with ibv_open device This operation does not free any other objects associated with the context To avoid memory leaks all other objects must be independently freed prior to calling this command Mellanox Technologies 35 J Rev 1 3 VPI Verbs API 4 2 7 ibv_node_type_str Template const char ibv_node_type_str enum ibv_node type node_type Input Parameters node type ibv_node type
133. eues CQ still associated with this completion channel Mellanox Technologies 51 J Rev 1 3 VPI Verbs API 4 4 4 4 1 Protection Domain Operations Once you have established a protection domain PD you may create objects within that domain This section describes operations available on a PD These include registering memory regions MR creating queue pairs QP or shared receive queues SRQ and address handles AH ibv_reg_mr Template struct ibv_mr ibv_reg mr struct ibv_pd pd void addr size_t length enum ibv_ access flags access Input Parameters pd protection domain struct ibv_pd from ibv_alloc pd addr memory base address length length of memory region in bytes access access flags Output Parameters none Return Value pointer to created memory region MR or NULL on failure Description ibv_reg mr registers a memory region MR associates it with a protection domain PD and assigns it local and remote keys Ikey rkey All VPI commands that use memory require the memory to be registered via this command The same physical memory may be mapped to differ ent MRs even allowing different permissions or PDs to be assigned to the same memory depend ing on user requirements Access flags may be bitwise or one of the following enumerations IBV ACCESS LOCAL WRITE Allow local host write access IBV ACCESS REMOTE WRITE Allow remote hosts write
134. events IBV_EVENT SRQ ERR Error occurred on an SRQ IBV_EVENT_SRQ LIMIT REACHED SRQ limit was reached Port events IBV_EVENT PORT ACTIVE Link became active on a port IBV_EVENT_ PORT _ERR Link became unavailable on a port IBV_EVENT_LID CHANGE LID was changed on a port IBV_EVENT PKEY CHANGE P_Key table was changed on a port IBV_EVENT_ SM CHANGE SM was changed on a port IBV_EVENT_ CLIENT REREGISTER SM sent a CLIENT _REREGISTER request to a port IBV_EVENT_GID_ CHANGE GID table was changed on a port CA events IBV_EVENT_ DEVICE FATAL CA is in FATAL state 94 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 7 2 ib_ack_async_event Template void ibv_ack_async_event struct ibv_async_event event Input Parameters event A pointer to the event to be acknowledged Output Parameters None Return Value None Description All async events that ibv get async event returns must be acknowledged using ibv_ack_async event To avoid races destroying an object CQ SRQ or QP will wait for all affiliated events for the object to be acknowledged this avoids an application retrieving an affili ated event after the corresponding object has already been destroyed Mellanox Technologies 95 J Rev 1 3 VPI Verbs API 4 7 3 ibv_event_type_str Template const char ibv_event_ type str enum ibv_event type event type Input Parameters event type ibv event type enum value Output Parameters N
135. f struct rdma_cm_id memset ctx conn_id 0 sizeof ctx conn_id ctx send_buf char malloc ctx msg_length memset ctx send_buf 0 ctx msg_length ctx recv_buf char malloc ctx msg_ length memset ctx recv_buf 0 ctx msg_ length if ctx server ret run_server amp ctx rai else ret run_client amp ctx rai destroy_resources amp ctx free rai return ret Mellanox Technologies 247
136. f items to query see ibv_modify qp Output Parameters attr struct ibv_qp attr to be filled in with requested attributes init_attr struct ibv_qp_init_attr to be filled in with initial attributes Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_query_qp retrieves the various attributes of a queue pair QP as previously set through ibv_create_qp and ibv_modify_qp The user should allocate a struct ibv_qp_attr and a struct ibv_qp_init_attr and pass them to the command These structs will be filled in upon successful return The user is responsible to free these structs struct ibv_qp init_attr is described in ibv_create_qp and struct ibv_qp attr is described in ibv_modify_qp Mellanox Technologies 77 J Rev 1 3 VPI Verbs API 4 6 2 ibv_query_srq Template intibv_query_srq struct ibv_srq srq struct ibv_srq_attr srq_attr Input Parameters srq The SRQ to query srq_attr The attributes of the specified SRQ Output Parameters srq_attr The struct ibv_srq attr is returned with the attributes of the specified SRO Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_query_srq returns the attributes list and current values of the specified SRQ It returns the attributes through the pointer srq_attr which is an ibv_srq_attr struct described above
137. fers must remain registered until the send completes This routine supports multiple scatter gather entries Send operations may not be posted to an rdma_cm_id or the corresponding queue pair until a con nection has been established The user defined context associated with the send request will be returned to the user through the work completion work request identifier wr_id field 142 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 6 2 3 rdma_post_readv Template int rdma_post_readv struct rdma_cm_id id void context struct ibv_sge sgl int nsge int flags uint64_t remote_addr uint32_t rkey Input Parameters id A reference to the communication identifier where the request will be posted context A user defined context associated with the request sgl A scatter gather list of the destination buffers of the read nsge The number of scatter gather entries in the sgl array flags Optional flags used to control the read operation remote addr The address of the remote registered memory to read from rkey The registered memory key associated with the remote address Output Parameters None Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_post_readv posts a work request to the send queue of the queue pair associated with the rdma_cm_id id The contents of the remote memory region at remote_addr will b
138. fined context associated with the request addr The address of the local destination of the read request length The length of the read operation mr Registered memory region associated with the local buffer flags Optional flags used to control the read operation remote_addr The address of the remote registered memory to read from rkey The registered memory key associated with the remote address Output Parameters None Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma post_read posts a work request to the send queue of the queue pair associated with the rdma_cm_id The contents of the remote memory region will be read into the local data buffer For a list of supported flags see ibv_post send The user must ensure that both the remote and local data buffers must have been registered before the read is issued and the buffers must remain registered until the read completes Read operations may not be posted to an rdma_cm_id or the corresponding queue pair until it has been connected The user defined context associated with the read request will be returned to the user through the work completion wr_id work request identifier field Mellanox Technologies 147 Rev 1 3 RDMA Verbs API 6 2 8 rdma_post_write Template int rdma_post_write struct rdma_cm_id id void context void addr size_t length struct ibv_mr mr int flags uint64_t
139. g must be provided by a higher level protocol 20 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 3 2 3 Unreliable Datagram UD A Queue Pair may transmit and receive single packet messages to from any other UD QP Ordering and delivery are not guaranteed and delivered packets may be dropped by the receiver Multicast messages are supported one to many A UD connection is very similar to a UDP connection Mellanox Technologies 21 J Rev 1 3 Overview 3 3 3 3 1 3 3 2 3 3 3 3 3 4 Key Concepts Send Request SR An SR defines how much data will be sent from where how and with RDMA to where struct ibv_send_wr is used to implement SRs Receive Request RR An RR defines buffers where data is to be received for non RDMA operations If no buffers are defined and a transmitter attempts a send operation or a RDMA Write with immediate a receive not ready RNR error will be sent struct ibv_recv_wr is used to implement RRs Completion Queue A Completion Queue is an object which contains the completed work requests which were posted to the Work Queues WQ Every completion says that a specific WR was completed both suc cessfully completed WRs and unsuccessfully completed WRs A Completion Queue is a mechanism to notify the application about information of ended Work Requests status opcode size source CQs have n Completion Queue Entries CQE The number of CQEs is specifi
140. get_cq_event operation to receive the notification The notification mechanism will only be armed for one notification Once a notification is sent the mechanism must be re armed with a new call to ibv_req_notify_cq 84 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 6 8 ibv_get_cq_event Template int ibv_get_cq_event struct ibv_comp channel channel struct ibv_cq cq void cq_ context Input Parameters channel struct ibv_comp_channel from ibv_create_comp_channel Output Parameters cq pointer to completion queue CQ associated with event cq context user supplied context set in ibv_create_cq Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_get_cq_event waits for a notification to be sent on the indicated completion channel CC Note that this is a blocking operation The user should allocate pointers to a struct ibv_cq and a void to be passed into the function They will be filled in with the appropriate values upon return It is the user s responsibility to free these pointers Each notification sent MUST be acknowledged with the ibv_ack_cq_events operation Since the ibv_destroy_cq operation waits for all events to be acknowledged it will hang if any events are not properly acknowledged Once a notification for a completion queue CQ is sent on a CC that CQ is now disarmed and will not se
141. gramming Examples Using RDMA VerbS 207 9 1 Automatic Path Migration APM 000 cee eee eee 207 9 2 Multicast Code Example Using RDMACM 000 eee eee 223 9 3 Shared Received Queue SRQ 0 02 eee 235 Mellanox Technologies 7 J Rev 1 3 Revision History Table 1 Revision History Rev Date Changes Rev 1 3 Sep 2012 Added new verbs and structures from verbs h Added new verbs and structures from rdma_cma h Added new verbs and structures from rdma_verbs h e Added RDMA CM EVENTS Added IBV_EVENTS e Added IBV_WC Status Codes Added additional programming examples using RDMA Verbs APM Mul ticast and SRQ Added discussion regarding the differences between RDMA over IB trans port versus RoCE Rev 1 2 Jan 2010 Updated Programming Example Appendix A Added RDMAoE support Rev 1 1 Oct 2009 Integrated Low Latency Ethernet API RDMA_CM VPI and Multicast code example Rev 1 0 Mar 2009 e Reorganized programming example 8 Mellanox Technologies Glossary RDMA Aware Networks Programming User Manual Rev 1 3 Table 2 Glossary Sheet 1 of 4 Term Description Access Layer Low level operating system infrastructure plumbing used for accessing the interconnect fabric VPI InfiniBand Ethernet FCoE It includes all basic transport services needed to support upper level network protocols middleware
142. gramming User Manual Rev 1 3 sock connect exit if listenfd close listenfd if resolved_addr freeaddrinfo resolved_addr if sockfd lt 0 if servername fprintf stderr Couldn t connect to s d n servername port else perror server accept fprintf stderr accept failed n return sockfd EE a del o le ol le od oe le ad de al le al o la le al e le al al al del ol al o le al al ll al ll a al he ll al ll 2 ll ll ae al ll al ll ol fee a ll al al ll ll eke Function sock_sync_data Input sock socket to transfer data on xfer_size size of data to transfer local_data pointer to data to be sent to remote Output remote data pointer to buffer to receive remote data Returns 0 on success negative error code on failure Description Sync data across a socket The indicated local data will be sent to the remote It will then wait for the remote to send its data back It is assumed that the two sides are in sync and call this function in the proper order Chaos will ensue if they are not Also note this is a blocking function and will wait for the full data to be received from the remote XX XK XX FK oe a ak del o ol oe le od e ad le al ol le ol le al oe le 2 leal le al ol la le al e a 2 al al ll ol al al ll al ll al ll ol al ll 2 le 2 ll a ll ll ae he le ll ll o ll ll intsock_sync_data int sock int xfer_size char local_data char remote_da
143. h other AHs interact with QPs and MRs interact with WQs QP Queue Pair The pair send queue and receive queue of independent WQs packed together in one object for the purpose of transferring data between nodes of a network Posts are used to initiate the sending or receiving of data There are three types of QP UD Unreliable Datagram Unreliable Connection and Reliable Connection RC Reliable Connection A QP Transport service type based on a connection oriented protocol A QP Queue pair is associated with another single QP The messages are sent in a reliable way in terms of the correctness and order of the information RDMA Remote Direct Mem ory Access Accessing memory in a remote side without involvement of the remote CPU RDMA_CM Remote Direct Memory Access Communica tion Manager API used to setup reliable connected and unreliable datagram data transfers It provides an RDMA transport neutral interface for establishing connections The API is based on sockets but adapted for queue pair QP based semantics communication must be over a specific RDMA device and data transfers are message based Requestor The side of the connection that will initiate a data transfer by posting a send request 10 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 Table 2 Glossary Sheet 3 of 4 Term Description Responder The side of the connection
144. hange it When that happens an IBV_EVENT_GID_CHANGE event is generated If a user caches the values of the GID table then these must be flushed when the IBV EVENT GID CHANGE event is received Mellanox Technologies 157 Rev 1 3 Events 7 2 IBV WC Events 7 2 1 IBV_WC_SUCCESS The Work Request completed successfully 7 2 2 IBV_WC_LOC_LEN_ERR This event is generated when the receive buffer is smaller than the incoming send It is generated on the receiver side of the connection 7 2 3 IBV_WC_LOC_QP_OP_ERR This event is generated when a QP error occurs For example it may be generated if a user neglects to specify responder resources and initiator depth values in struct rdma conn param before calling rdma_connect on the client side and rdma_accept on the server side 7 2 4 IBV_WC_LOC_EEC_OP_ERR This event is generated when there is an error related to the local EEC s receive logic while execut ing the request packet The responder is unable to complete the request This error is not caused by the sender 7 2 5 IBV_WC_LOC_PROT_ERR This event is generated when a user attempts to access an address outside of the registered memory region For example this may happen if the Lkey does not match the address in the WR 7 2 6 IBV_WC_WR_FLUSH_ERR This event is generated when an invalid remote error is thrown when the responder detects an invalid request It may be that the operation is not supported by the request queue or there is in
145. he rdma cm id id The posted buffer will be queued to receive an incoming message sent by the remote peer The user is responsible for ensuring that receive buffer is posted and is large enough to contain all sent data before the peer posts the corresponding send message The buffer must have already been registered before being posted with the mr parameter referencing the registration The buffer must remain registered until the receive completes Messages may be posted to an rdma cm _id only after a queue pair has been associated with it A queue pair is bound to an rdma_cm_id after calling rdma_create_ep or rdma_create_qp if the rdma cm id is allocated using rdma create id The user defined context associated with the receive request will be returned to the user through the work completion request identifier wr_id field Please note that this is a simple receive call There are no scatter gather lists involved here Mellanox Technologies 145 Rev 1 3 RDMA Verbs API 6 2 6 rdma_post_send Template int rdma_post_send struct rdma_cm_id id void context void addr size_t length struct ibv_mr mr int flags Input Parameters id A reference to the communication identifier where the message buffer will be posted context A user defined context associated with the request addr The address of the memory buffer to post length The length of the memory buffer mr Optional registered memory region associated with the posted
146. hould open any desired devices and promptly free the list via the ibv_free device list command 30 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 2 2 ibv_free_device_list Template void ibv_free_device_list struct ibv_device list Input Parameters list list of devices provided from ibv get device list command Output Parameters none Return Value none Description ibv_free_device_list frees the list of ibv_device structs provided by ibv_get_device_list Any desired devices should be opened prior to calling this command Once the list is freed all ibv_device structs that were on the list are invalid and can no longer be used Mellanox Technologies 31 J Rev 1 3 VPI Verbs API 4 2 3 ibv_get_device_name Template const char ibv_get_device_name struct ibv_device device Input Parameters device struct ibv_device for desired devic Output Parameters none Return Value Pointer to device name char string or NULL on failure Description ibv_get_device_name returns a pointer to the device name contained within the ibv_device struct 32 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 2 4 ibv_get_device_guid Template uint64_t ibv_get_device_guid struct ibv_device device Input Parameters device struct ibv_ device for desired devic Output Parameters none Return Value 64 bit GUID Description ibv_g
147. iBand efficiency and scalability have made it the optimal performance and cost performance interconnect solution for the world s lead ing high performance computing cloud Web 2 0 storage database and financial data centers and applications InfiniBand is a standard technology defined and specified by the IBTA organization 1 2 RDMA over Converged Ethernet RoCE RoCE is a standard for RDMA over Ethernet that is also defined and specified by the IBTA organi zation RoCE provides true RDMA semantics for Ethernet as it does not require the complex and low performance TCP transport needed for iWARP for example RoCE is the most efficient low latency Ethernet solution today It requires a very low CPU over head and takes advantage of Priority Flow Control in Data Center Bridging Ethernet for lossless connectivity RoCE has been fully supported by the Open Fabrics Software since the release of OFED 1 5 1 1 3 Comparison of RDMA Technologies Currently there are three technologies that support RDMA InfiniBand Ethernet RoCE and Ether net WARP All three technologies share a common user API which is defined in this document but have different physical and link layers When it comes to the Ethernet solutions RoCE has clear performance advantages over WARP both for latency throughput and CPU overhead RoCE is supported by many leading solutions and is incorporated within Windows Server software as well as InfiniBand RDMA tech
148. ibv_create_ah creates an AH An AH contains all of the necessary data to reach a remote destina tion In connected transport modes RC UC the AH is associated with a queue pair QP In the datagram transport modes UD the AH is associated with a work request WR struct ibv_ah_attr is defined as follows struct ibv_ah_attr struct ibv_global_ route grh uintl6 t dlid uint8 t sl uint8 t src path bits uint8_t static rate uint8_t is global uint8_t port_num y grh defined below dlid destination lid sl service level src path bits source path bits static rate static rate is global this is a global address use grh port_num physical port number to use to reach this destination struct ibv_global_route is defined as follows struct ibv global_route union ibv gid dgid uint32 t flow label uint8_t sgid_ index uint8 t hop limit uint8_t traffic class y Mellanox Technologies 69 J Rev 1 3 VPI Verbs API dgid destination GID see ibv_query gid for definition flow label flow label sgid_index index of source GID see ibv_ query gid hop limit hop limit traffic class traffic class 70 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 4 16 ibv_destroy_ah Template int ibv_destroy_ah struct ibv_ah ah Input Parameters ah struct ibv_ah from ibv_create_ah Output Parameters none Return Value O on success 1 on error If the call fails errno will be set
149. in xrc_domain This function may fail if the number xrc_qp_num is not the number of a valid XRC receive QP for example if the QP is not allocated or it is the number of a non XRC QP or the XRC receive QP was created with an XRC domain other than xrc_domain Mellanox Technologies 67 J Rev 1 3 VPI Verbs API 4 4 14 ibv_unreg_xrc_rcv_qp Template int ibv_unreg_xre_rev_qp struct ibv_xrc_domain xrc_domain uint32_t xrc_qp_num Input Parameters xrc_ domain The XRC domain associated with the XRC receive QP from which the user wishes to unregister xrc_qp_num The QP number from which the user process is to be unregistered Output Parameters None Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_unreg_xre_rcv_qp unregisters a user process from the XRC receive QP number xrc_qp num which is associated with the XRC domain xrc_domain When the number of user processes regis tered with this XRC receive QP drops to zero the QP is destroyed 68 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 4 15 ibv_create_ah Template struct ibv_ah ibv_create_ah struct ibv_pd pd struct ibv_ah_attr attr Input Parameters pd struct ibv_pd from ibv_alloc_ pd attr attributes of address Output Parameters none Return Value pointer to created address handle AH or NULL on failure Description
150. ing User Manual Rev 1 3 ret ibv_modify_qp ctx gt id gt qp amp qp_attr IBV_QP ALT PATH IBV_QP PATH MIG STATE if ret VERB _ERR ibv_modify_qp ret return ret Function reg mem x Input ctx The context object x Output none Returns 0 on success non zero on failure x Description Registers memory regions to use for our data transfer i int reg_mem struct context ctx ctx gt send_buf char malloc ctx gt msg_ length memset ctx gt send_buf 0x12 ctx gt msg_length ctx gt recv_buf char malloc ctx gt msg_ length memset ctx gt recv_buf 0x00 ctx gt msg_length ctx gt send_mr rdma_reg_msgs ctx gt 1d ctx gt send_buf ctx gt msg_length if ctx gt send_mr VERB _ERR rdma_reg_ msgs 1 return 1 ctx gt recv_mr rdma_reg msgs ctx gt id ctx gt recv_buf ctx gt msg_length if ctx gt recv_mr VERB_ERR rdma_reg_ msgs 1 return 1 return 0 gt Function getaddrinfo and create ep Input ctx The context object Output none Mellanox Technologies 213 Rev 1 3 Programming Examples Using RDMA Verbs Returns 0 on success non zero on failure Description Gets the address information and creates our endpoint int getaddrinfo_and_create_ep struct context ctx t int ret struct rdma_addrinfo rai hints struct ibv_qp init attr qp_init
151. ing information so an appropriate address vec tor can be generated to respond to the received packet If GRH is used with UD the RR should always have extra 40 bytes available for this GRH struct ibv_grh is used to implement GRHs Mellanox Technologies 23 J Rev 1 3 Overview 3 3 8 3 3 9 Protection Domain Object whose components can interact with only each other These components can be AH QP MR and SRQ A protection domain is used to associate Queue Pairs with Memory Regions and Memory Win dows as a means for enabling and controlling network adapter access to Host System memory PDs are also used to associate Unreliable Datagram queue pairs with Address Handles as a means of controlling access to UD destinations struct ibv_pd is used to implement protection domains Asynchronous Events The network adapter may send async events to inform the SW about events that occurred in the system There are two types of async events Affiliated events events that occurred to personal objects CQ QP SRQ Those events will be sent to a specific process Unaffiliated events events that occurred to global objects network adapter port error Those events will be sent to all processes 3 3 10 Scatter Gather Data is being gathered scattered using scatter gather elements which include Address address of the local data buffer that the data will be gathered from or scattered to Size the size of the data that wi
152. int srq_attr_mask Input Parameters srq The SRQ to modify srq attr Specifies the SRQ to modify input the current values of the selected SRQ attributes are returned output srq attr mask A bit mask used to specify which SRQ attributes are being modified Output Parameters srq attr The struct ibv srq attr is returned with the updated values Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_modify_srq modifies the attributes of the SRQ srq using the attribute values in srq_attr based on the mask srq_attr_mask srq_attr is an ibv_srq_attr struct as defined above under the verb ibv_create_srq The argument srq_attr_mask specifies the SRQ attributes to be modified It is either 0 or the bitwise OR of one or more of the flags IBV_SRQ MAX WR Resize the SRQ IBV_SRQ LIMIT Set the SRQ limit If any of the attributes to be modified is invalid none of the attributes will be modified Also not all devices support resizing SRQs To check if a device supports resizing check if the IBV_DEVICE SRQ RESIZE bit is set in the device capabilities flags Modifying the SRQ limit arms the SRQ to produce an IBV EVENT SRQ LIMIT REACHED low watermark async event once the number of WRs in the SRQ drops below the SRQ limit 60 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 4 7 ibv_destroy_srq Template int ibv_destr
153. int32 t bad pkey cntr uint32 t qkey viol cntr uintl6 t pkey tbl len uintl6 t lid uintl6 t sm_lid uints t lmc uint8 t max vl num uint8 t sm sl uint8_t subnet_timeout uint8 t init_type_reply uint8_t active_width uint8_t active_speed uint8 t phys_state y state IBV_PORT_NOP IBV_PORT_DOW IBV_PORT_INIT IBV_PORT_ARMED IBV_PORT_ACTIVE Mellanox Technologies 41 J max mtu active mtu gid_tbl len port cap flags max msg SZ bad_pkey cntr qkey viol cntr pkey tbl_len lid sm lid imc max vl num sm_sl subnet timeout init type reply active width active speed phys state Rev 1 3 VPI Verbs API IBV_PORT_ACTIVE DEFER Maximum Transmission Unit MTU supported by port Can be IBV_MTU 256 IBV_MTU_512 IBV_MTU_1024 IBV_MTU_ 2048 IBV MTU 4096 Actual MTU in use Length of source global ID GID table Supported capabilities of this port There are currently no enumerations defines declared in verbs h Maximum message size Bad P_ Key counter Q Key violation counter Length of partition table First local identifier LID assigned to this port LID of subnet manager SM LID Mask control used when multiple LIDs are assigned to port aximum virtual lanes VL SM service level SL Subnet propagation delay Type of initialization performed by SM Currently active link width Currently active link speed Physical port state 42 Mellanox Technologi
154. ion See Also rdma_create_event_channel rdma_get_cm_event rdma_ack_cm_event 98 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 5 2 Connection Manager CM ID Operations 5 2 1 rdma_create_id Template int rdma_create_id struct rdma_event_channel channel struct rdma_cm_id id void con text enum rdma port space ps Input Parameters channel The communication channel that events associated with the allocated rdma_ cm id will be reported on id A reference where the allocated communication identifier will be returned context User specified context associated with the rdma_cm_id ps RDMA port space Output Parameters one Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description Creates an identifier that is used to track communication information Notes rdma_cm_ids are conceptually equivalent to a socket for RDMA communication The difference is that RDMA communication requires explicitly binding to a specified RDMA device before com munication can occur and most operations are asynchronous in nature Communication events on an rdma_cm_id are reported through the associated event channel Users must release the rdma cm id by calling rdma_destroy_id PORT SPACES Details of the services provided by the different port spaces are outlined below RDMA_PS_TCP Provides reliable connectio
155. ioned into the IBV_QPS ERR state either automatically by the RDMA device or explicitly by the user This may have happened either because a completion with error was generated for the last WQE or the QP 156 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 transitioned into the IBV_QPS_ERR state and there are no more WQEs on the Receive Queue of the QP This event actually means that no more WQEs will be consumed from the SRQ by this QP If an error occurs to a QP and this event is not generated the user must destroy all of the QPs asso ciated with this SRQ as well as the SRQ itself in order to reclaim all of the WQEs associated with the offending QP At the minimum the QP which is in the error state must have its state changed to Reset for recovery 7 1 18 IBV_EVENT_CLIENT_REREGISTER This event is generated when the SM sends a request to a given port for client reregistration for all subscriptions previously requested for the port This could happen if the SM suffers a failure and as a result loses its own records of the subscriptions It may also happen if a new SM becomes operational on the subnet The event will be generated by the device only if the bit that indicates a client reregister is sup ported is set in port_attr port_cap_flags 7 1 19 IBV_EVENT_GID_CHANGE This event is generated when a GID changes on a given port The GID table is configured by the SM and this also means that the SM can c
156. ions and the following disclaimer in the documentation and or other materials provided with the distribution E ETE RO E RICE ER THE SOFTWARE IS PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND EXPRESS OR IMPLIED INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM DAMAGES OR OTHER LIABILITY WHETHER IN AN ACTION OF CONTRACT TORT OR OTHERWISE ARISING FROM OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE el Compile Command gcc apm c o apm libverbs Irdmacm Description This example demonstrates Automatic Path Migration APM The basic flow is as follows 1 Create connection between client and server 2 Set the alternate path details on each side of the connection 3 Perform send operations back and forth between client and server 4 Cause the path to be migrated manually or automatically 5 Complete sends using the alternate path There are two ways to cause the path to be migrated Mellanox Technologies 207 Rev 1 3 Programming Examples Using RDMA Verbs 1 Use the ibv_modify_qp verb to set path_mig _state IBV_MIG_MIGRATED 2 Assuming there are two ports on at least one side of the connection and each port has a path to the other host pull out the cable of the original
157. is chapter describes the details of the events that occur when using the VPI API 7 1 IBV Events 7 1 1 IBV_EVENT_CQ_ERR This event is triggered when a Completion Queue CQ overrun occurs or rare condition due to a protection error When this happens there are no guarantees that completions from the CQ can be pulled All of the QPs associated with this CQ either in the Read or Send Queue will also get the IBV_EVENT_QP FATAL event When this event occurs the best course of action is for the user to destroy and recreate the resources 7 1 2 IBV_EVENT_QP_FATAL This event is generated when an error occurs on a Queue Pair QP which prevents the generation of completions while accessing or processing the Work Queue on either the Send or Receive Queues The user must modify the QP state to Reset for recovery It is the responsibility of the software to ensure that all error processing is completed prior to calling the modify QP verb to change the QP state to Reset If the problem that caused this event is in the CQ of that Work Queue the appropriate CQ will also receive the IBV_EVENT_CQ_ERR event In the event of a CQ error 1t is best to destroy and rec reate the resources 7 1 3 IBV_EVENT_QP_REQ_ERR This event is generated when the transport layer of the RDMA device detects a transport error vio lation on the responder side The error may be caused by the use of an unsupported or reserved opcode or the use of an out of sequence opcode
158. l be set to indicate the reason for the failure Description rdma_migrate_id migrates a communication identifier to a different event channel and moves any pending events associated with the rdma_cm_id to the new channel No polling for events on the rdma_cm_id s current channel nor running of any routines on the rdma_cm_id should be done while migrating between channels rdma_migrate_id will block while there are any unacknowledged events on the current event channel If the channel parameter is NULL then the specified rdma_cm_id will be placed into synchronous operation mode All calls on the id will block until the operation completes Mellanox Technologies 101 Rev 1 3 RDMA_CM API 5 2 4 rdma_set_option Template int rdma_set_option struct rdma cm_id id int level int optname void optval size_t optlen Input Parameters id RDMA communication identifier level Protocol level of the option to set optname Name of the option to set optval Reference to the option data optlen The size of the option data optval buffer Output Parameters None Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_set_option sets communication options for an rdma_cm_id Option levels and details may be found in the enums in the relevant header files 102 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3
159. la ll ll 2 static int resources_create struct resources res struct ibv_device dev_list NULL struct ibv_qp init attr qp_init_attr struct ibv_device ib_dev NULL size_t size int 1 int mr_flags 0 int cq_size 0 int num_devices int re 0 if client side if config server_name res gt sock sock_connect config server_name config tcp_port 176 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 1f res gt sock lt 0 fprintf stderr failed to establish TCP connection to server s port d n config server_name config tcp_port rc 1 goto resources create exit else fprintf stdout waiting on port d for TCP connection n config tcp_port res gt sock sock_connect NULL config tcp_port if res gt sock lt 0 fprintf stderr failed to establish TCP connection with client on port d n config tcp_port rc 1 goto resources create exit fprintf stdout TCP connection was established n fprintf stdout searching for IB devices in host n get device names in the system dev_list ibv_get_device_list amp num_devices if dev_list fprintf stderr failed to get IB devices list n rc 1 goto resources _create_exit if there isn t any IB device in host if Inum_devices fprintf stderr found d device s n num_devices rc 1 goto resources _create_exit fprintf stdou
160. le L usr local ofed lib64 L usr local ofed lib lib verbs RDMA_RC _ example c Copyright c 2009 Mellanox Technologies All rights reserved This software is available to you under a choice of one of two licenses You may choose to be licensed under the terms of the GNU General Public License GPL Version 2 available from the file COPYING in the main directory of this source tree or the OpenIB org BSD license below Redistribution and use in source and binary forms with or without modification are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice this list of conditions and the following disclaimer Redistributions in binary form must reproduce the above copyright notice this list of conditions and the following disclaimer in the documentation and or other materials provided with the distribution IS x THE SOFTWARE IS PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND EXPRESS OR IMPLIED INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM DAMAGES OR OTHER LIABILITY WHETHER IN AN ACTION OF CONTRACT TORT OR OTHERWISE ARISING FROM OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE sf EE oe le al oe le ol le od oe
161. le ad le al le al o lo le ad e le 2 le al le al all o le ll al ll al ll o al el ll al ll a ll ll ae he le ll ll o ll ll al ll lll el 7 RDMA Aware Networks Programming Example This code demonstrates how to perform the following operations using the VPI Verbs API Send Receive RDMA Read RDMA Write XK XX XX ae a ak led o al oe le od e ad le ale ol le ol le ol oe le 2 leal le al ol lo fe al le al 2 al al ll 2 ll ll al ll ol ll ae al e al le le al ll a ll ll le al ae la ll al ae ie 2 Hinclude lt stdio h gt include lt stdlib h gt Mellanox Technologies 167 Rev 1 3 Programming Examples Using IBV Verbs include lt string h gt include lt unistd h gt include lt stdint h gt include lt inttypes h gt include lt endian h gt include lt byteswap h gt include lt getopt h gt include lt sys time h gt include lt arpa inet h gt include lt infiniband verbs h gt include lt sys types h gt include lt sys socket h gt include lt netdb h gt poll CQ timeout in millisec 2 seconds define MAX POLL _CQ TIMEOUT 2000 define MSG SEND operation define RDMAMSGR RDMA read operation define RDMAMSGW RDMA write operation define MSG_SIZE strlen MSG 1 if BYTE ORDER _ LITTLE ENDIAN static inline uint64_t htonll uint64_t x return bswap_64 x static inline uint64_t ntohll uint64_t x return bswap_64 x elif BYTE ORDER _ BIG ENDIAN static inline uint64
162. ll be read from written to this address L_ key the local key of the MR that was registered to this buffer struct ibv_sge implements scatter gather elements 3 3 11 Polling Polling the CQ for completion is getting the details about a WR Send or Receive that was posted If we have completion with bad status in a WR the rest of the completions will be all be bad and the Work Queue will be moved to error state Every WR that does not have a completion that was polled is still outstanding Only after a WR has a completion the send receive buffer may be used reused freed 24 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 The completion status should always be checked When a CQE is polled it is removed from the CQ Polling is accomplished with the ibv_poll_cq operation Mellanox Technologies 25 J Rev 1 3 Overview 3 4 Typical Application This documents provides two program examples e e The first code RDMA RC example uses the VPI verbs API demonstrating how to perform RC Send Receive RDMA Read and RDMA Write operations The second code multicast example uses RDMA_CM verbs API demonstrating Multicast UD The structure of a typical application is as follows The functions in the programming example that implement each step are indicated in bold 1 Get the device list First you must retrieve the list of available IB devices on the local host Ever
163. lly be acknowledged with ibv_ack async_event ibv_get_async_event is a blocking function If multiple threads call this function simultane ously then when an async event occurs only one thread will receive it and 1t is not possible to predict which thread will receive 1t struct ibv_async_event is defined as follows struct ibv_async event union struct ibv_cq cq The CQ that got the event struct ibv_qp qp The QP that got the event struct ibv srq srq The SRO that got the event int port_num The port number that got the event element enum ibv event type vent type Type of event One member of the element union will be valid depending on the event_type member of the struc ture event_type will be one of the following events QP events IBV_EVENT_QP_FATAL Error occurred on a QP and it transitioned to error state IBV_EVENT_QP_REQ ERR Invalid Request Local Work Queue Error IBV_EVENT_QP_ACCESS_ERR Local access violation error IBV_EVENT_COMM EST Communication was established on a QP IBV_EVENT_SQ_DRAINED Send Queue was drained of outstanding messages in progress Mellanox Technologies 93 J Rev 1 3 VPI Verbs API IBV_EVENT_PATH_MIG A connection has migrated to the alternate path IBV_EVENT_PATH_MIG_ERR A connection failed to migrate to the alternate path IBV EVENT QP LAST WQE REACHED Last WQE Reached on a QP associated with an SRQ CQ events IBV_EVENT_CQ_ERR CQ is in error CQ overrun SRQ
164. m amp port_attr if ret VERB _ERR ibv_query_port ret goto out if ctx msg length gt 1 lt lt port_attr active mtu 7 printf buffer length d is larger then active mtu od n ctx msg length 1 lt lt port_attr active_mtu 7 goto out ret create_resources amp ctx if ret goto out if ctx sender for i 0 1 lt ctx msg_ count i ret rdma_post_recv ctx id NULL ctx buf ctx msg length sizeof struct ibv_grh ctx mr if ret VERB_ERR rdma_post_recv ret goto out Join the multicast group ret rdma_join_multicast ctx id amp ctx mcast_sockaddr NULL if ret VERB _ERR rdma_join multicast ret goto out Verify that we successfully joined the multicast group ret get_cm_event ctx channel RDMA_ CM EVENT MULTICAST JOIN amp event if ret goto out inet_ntop AF_INET6 event gt param ud ah_attr grh dgid raw buf 40 printf joined dgid s mlid 0x x sl d n buf event gt param ud ah_attr dlid event gt param ud ah_attr sl Mellanox Technologies 233 Rev 1 3 Programming Examples Using RDMA Verbs ctx remote_qpn event gt param ud qp_num ctx remote_qkey event gt param ud qkey if ctx sender Create an address handle for the sender ctx ah ibv_create_ah ctx pd amp event gt param ud ah_attr if ctx ah VERB _ERR ibv_create_ah 1 goto out rdma_ack_cm_event event
165. m rdma get devices Output Parameters None Return Value None Description rdma free devices frees the device array returned by the rdma get devices routine 122 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 5 2 23 rdma_getaddrinfo Template int rdma_getaddrinfo char node char service struct rdma_addrinfo hints struct rdma_addrinfo res Input Parameters node Optional name dotted decimal IPv4 or IPv6 hex address to resolve service The service name or port number of the address hints Reference to an rmda_addrinfo structure containing hints about the type of service the caller supports resA pointer to a linked list of rdma_addrinfo structures containing response information Output Parameters res An rdma_addrinfo structure which returns information needed to establish communication Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_getaddrinfo provides transport independent address translation It resolves the destination node and service address and returns information required to establish device communication It is the functional equivalent of getaddrinfo Please note that either node or service must be provided If hints are provided the operation will be controlled by hints ai_flags If RAI PASSIVE is specified the call will resolve address informa tion for use on
166. meters cq CQ to destroy Output Parameters none Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_destroy_cq frees a completion queue CQ This command will fail if there is any queue pair QP that still has the specified CQ associated with it Mellanox Technologies 49 J Rev 1 3 VPI Verbs API 4 3 10 ibv_create_comp_channel Template struct ibv_comp channel ibv_create_comp_channel struct ibv_context context Input Parameters context struct ibv_context from ibv_open_device Output Parameters none Return Value pointer to created CC or NULL on failure Description ibv_create_comp_channel creates a completion channel A completion channel is a mechanism for the user to receive notifications when new completion queue event CQE has been placed on a completion queue CQ 50 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 3 11 ibv_destroy_comp_channel Template int ibv_destroy_comp_channel struct ibv_comp_channel channel Input Parameters channel struct ibv_comp_channel from ibv_create comp channel Output Parameters none Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_destroy_comp_ channel frees a completion channel This command will fail if there are any completion qu
167. n Value A pointer to the created SRQ or NULL on failure Description ibv_create_srq creates a shared receive queue SRQ srq_attr gt max_wr and srq_attr gt max_sge are read to determine the requested size of the SRQ and set to the actual values allocated on return If ibv_create_srq succeeds then max_wr and max_sge will be at least as large as the requested values struct ibv_srq is defined as follows struct ibv_srq struct ibv_context context struct ibv context from ibv_open device void srq_ context struct ibv_pd pd Protection domain uint32 t handle pthread mutex t mutex pthread cond t cond uint32 t events completed struct ibv_srq_init_attr is defined as follows struct ibv_srq init attr void srq context struct ibv_srq_attr attr be srq context struct ibv_context from ibv open device attr An ibv srq attr struct defined as follows 58 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 struct ibv_srq_attr is defined as follows struct ibv_srq attr uint32 t max wr uint32 t max sge uint32 t srq limit he 7 E max wr Requested maximum number of outstanding WRs in the SRQ max sge Requested number of scatter elements per WR srq_limit The limit value of the SRQ irrelevant for ibv_create_srq Mellanox Technologies 59 J Rev 1 3 VPI Verbs API 4 4 6 ibv_modify_srq Template int ibv_modify_srq struct ibv_srq srq struct ibv_srq_attr srq_attr
168. n oriented QP communication Unlike TCP the RDMA port space provides message not stream based communication RDMA_PS_UDP Provides unreliable connection less QP communication Supports both datagram and multicast communication See Also rdma cm rdma create event channel rdma destroy id rdma get devices rdma bind addr rdma_resolve_addr rdma_ connect rdma_listen rdma_set_option Mellanox Technologies 99 J Rev 1 3 RDMA_CM API 5 2 2 rdma_destroy_id Template int rdma_destroy_id struct rdma cm_id id Input Parameters id The communication identifier to destroy Output Parameters None Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description Destroys the specified rdma_cm_id and cancels any outstanding asynchronous operation Notes Users must free any associated QP with the rdma_cm_id before calling this routine and ack an related events See Also rdma_create_id rdma destroy _qp rdma_ack cm event 100 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 5 2 3 rdma_migrate_id Template int rdma_migrate_id struct rdma_cm_id id struct rdma_event_channel channel Input Parameters id An existing RDMA communication identifier to migrate channel The new event channel for rdma_cm_id events Output Parameters None Return Value O on success 1 on error If the call fails errno wil
169. nd any more notifications to the CC until it is rearmed again with a new call to the ibv_req_notify_cq operation This operation only informs the user that a CQ has completion queue entries CQE to be pro cessed it does not actually process the CQEs The user should use the ibv_poll_cq operation to process the CQEs Mellanox Technologies 85 J Rev 1 3 VPI Verbs API 4 6 9 ibv_ack_cq_events Template void ibv_ack_cq_events struct ibv_cq cq unsigned int nevents Input Parameters cq struct ibv_cq from ibv_create cq nevents number of events to acknowledge 1 n Output Parameters None Return Value None Description ibv_ack_cq_events acknowledges events received from ibv_get_cq_event Although each notifi cation received from ibv_get_cq_event counts as only one event the user may acknowledge mul tiple events through a single call to ibv_ack_cq_events The number of events to acknowledge is passed in nevents and should be at least 1 Since this operation takes a mutex it is somewhat expensive and acknowledging multiple events in one call may provide better performance See ibv_get_cq_event for additional details 86 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 6 10 ibv_poll_cq Template intibv_poll_eq struct ibv_cq cq int num_entries struct ibv_wc wc Input Parameters cq struct ibv_cq from ibv create cq num_entries maximum number of completion queue entri
170. nection request will be returned to the user The new rdma_cm_id will ref erence event information associated with the request until the user calls rdma_reject rdma_accept or rdma_destroy_id on the newly created identifier For a description of the event data see rdma_get_cm_event If QP attributes are associated with the listening endpoint the returned rdma_cm_id will also ref erence an allocated QP 112 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 5 2 13 rdma_get_request Template int rdma_accept struct rdma_cm_id id struct rdma_conn_param conn_param Input Parameters id RDMA communication identifier conn param Optional connection parameters described under rdma_connect Output Parameters None Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_accept is called from the listening side to accept a connection or datagram service lookup request Unlike the socket accept routine rdma_accept is not called on a listening rdma_cm_id Instead after calling rdma_listen the user waits for an RDMA_ CM EVENT CONNECT REQUEST event to occur Connection request events give the user a newly created rdma_cm_id similar to a new socket but the rdma_cm_id is bound to a specific RDMA device rdma_accept is called on the new rdma cm id Mellanox Technologies 113 Rev 1 3 RDMA_CM API 5 2 14 rdma_rejec
171. nnect_qp_exit rc modify_qp_to_rts res gt qp if rc fprintf stderr failed to modify QP state to RTR n goto connect_qp_exit fprintf stdout QP state was change to RTS n sync to make sure that both sides are in states that they can connect to prevent packet loose if sock_sync_data res gt sock 1 Q amp temp_char just send a dummy char back and forth Mellanox Technologies 185 Rev 1 3 Programming Examples Using IBV Verbs fprintf stderr sync error after QPs are were moved to RTS n rc 1 connect_qp_ exit return rc alioli liado EE EE EEEE EEEE CIC ICICI III KCCI ICE Function resources_ destroy Input res pointer to resources structure Output none Returns 0 on success 1 on failure Description Cleanup and deallocate all resources used ae o del o al oe je od e ad le ale ol le al le al oe le 2 le al le al ol lo le al e a al al al ll ol al a le ll al ll 2 ll a al he 2 ll ll a ll ll ae he 2 ll ll o ake ae fe ll static int resources_destroy struct resources res int re 0 if res gt qp if ibv_destroy_qp res gt qp fprintf stderr failed to destroy QP n rc 1 if res gt mr if ibv_dereg_ mr res gt mr fprintf stderr failed to deregister MR n re 1 if res gt buf free res gt buf if res gt cq if ibv_destroy_cq res gt cq fprintf stderr failed to destroy
172. nologies 169 Rev 1 3 Programming Examples Using IBV Verbs static int sock_connect const char servername int port struct addrinfo resolved_addr NULL struct addrinfo iterator char service 6 int sockfd 1 int listenfd 0 int tmp struct addrinfo hints ai_flags AI PASSIVE ai_family AF_INET ai_socktype SOCK_STREAM bs if sprintf service d port lt 0 goto sock_connect_exit Resolve DNS address use sockfd as temp storage sockfd getaddrinfo servername service amp hints amp resolved_addr if sockfd lt 0 fprintf stderr os for s d n gai_strerror sockfd servername port goto sock_connect_exit Search through results and find the one we want for iterator resolved_addr iterator iterator iterator gt ai_next sockfd socket iterator gt ai_ family iterator gt ai_socktype iterator gt ai_protocol if sockfd gt 0 if servername Client mode Initiate connection to remote if tmp connect sockfd iterator gt ai_addr iterator gt ai_addrlen fprintf stdout failed connect n close sockfd sockfd 1 else Server mode Set up listening socket an accept a connection listenfd sockfd sockfd 1 if bind listenfd iterator gt ai_addr iterator gt ai_addrlen goto sock_connect_exit listen listenfd 1 sockfd accept listenfd NULL 0 170 Mellanox Technologies RDMA Aware Networks Pro
173. nologies are based on networking concepts found in a traditional network but there are differences them and their counterparts in IP networks The key difference is that RDMA provides a messaging service which applications can use to directly access the virtual memory on remote computers The messaging service can be used for Inter Process Communication IPC communi cation with remote servers and to communicate with storage devices using Upper Layer Protocols ULPs such as iSCSI Extensions for RDMA ISER and SCSI RDMA Protocol SRP Storage Message Block SMB Samba Lustre ZFS and many more RDMA provides low latency through stack bypass and copy avoidance reduces CPU utilization reduces memory bandwidth bottlenecks and provides high bandwidth utilization The key benefits Mellanox Technologies 13 J Rev 1 3 RDMA Architecture Overview that RDMA delivers accrue from the way that the RDMA messaging service is presented to the application and the underlying technologies used to transport and deliver those messages RDMA provides Channel based IO This channel allows an application using an RDMA device to directly read and write remote virtual memory In traditional sockets networks applications request network resources from the operating system through an API which conducts the transaction on their behalf However RDMA use the OS to establish a channel and then allows applications to directly exchange messages without further OS interv
174. none Return Value A pointer to a static character string corresponding to the event Description rdma_event_str returns a string representation of an asynchronous event See Also rdma_get_cm_event 134 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 6 RDMA Verbs API 6 1 Protection Domain Operations 6 1 1 rdma_reg_msgs Template struct ibv_mr rdma_reg_msgs struct rdma_cm_id id void addr size_t length Input Parameters id A reference to the communication identifier where the message buffer s will be used addr The address of the memory buffer s to register length The total length of the memory to register Output Parameters ibv_mr A reference to an ibv mr struct of the registered memory region Return Value A reference to the registered memory region on success or NULL on failure Description rdma_reg_msgs registers an array of memory buffers for sending or receiving messages or for RDMA operations The registered memory buffers may then be posted to an rdma_cm_id using rdma post send or rdma post recv They may also be specified as the target of an RDMA read operation or the source of an RDMA write request The memory buffers are registered with the protection domain associated with the rdma_cm_id The start of the data buffer array is specified through the addr parameter and the total size of the array is given by the length All data buffers must be registered
175. not have a QP associated with them RDMA_CM EVENT CONNECT ERROR Indicates that an error has occurred trying to establish or a connection May be generated on the active or passive side of a connection RDMA CM EVENT UNREACHABLE Generated on the active side to notify the user that the remote server is not reachable or unable to respond to a connection request RDMA CM EVENT REJECTED Indicates that a connection request or response was rejected by the remote end point RDMA CM EVENT ESTABLISHED Indicates that a connection has been established with the remote end point RDMA CM EVENT DISCONNECTED The connection has been disconnected RDMA CM EVENT DEVICE REMOVAL The local RDMA device associated with the rdma_cm_id has been removed Upon receiving this event the user must destroy the related rdma_cm_id Mellanox Technologies 131 Rev 1 3 RDMA_CM API RDMA CM EVENT MULTICAST JOIN The multicast join operation rdma_join multicast completed successfully RDMA CM EVENT MULTICAST ERROR An error either occurred joining a multicast group or if the group had already been joined on an existing group The specified multicast group is no longer accessible and should be rejoined if desired RDMA_CM EVENT ADDR CHANGE The network device associated with this ID through address resolution changed its HW address eg following of bonding failover This event can serve as a hint for applications who want the links used for their RDMA sessi
176. notice this list of conditions and the following disclaimer in the documentation and or other materials provided with the distribution a i a i i i i i a a x THE SOFTWARE IS PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND EXPRESS OR IMPLIED INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM DAMAGES OR OTHER LIABILITY WHETHER IN AN ACTION OF CONTRACT TORT OR OTHERWISE ARISING FROM OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE Compile Command gcc srq c o srq libverbs Irdmacm Description Both the client and server use an SRQ A number of Queue Pairs QPs are created ctx qp_count and each QP uses the SRQ The connection between the client and server is established using the IP address details passed on the command line After the connection is established the client starts blasting sends to the server and stops when the maximum work requests ctx max_wr have been sent When the server has received all the sends it performs a send to the client to tell it to continue The process repeats until the number of requested number of sends ctx msg_count have been performed Running the Example The executable can operate as either the client or server application It can b
177. o be bound to a local RDMA device Notes Typically this routine is called before calling rdma_listen to bind to a specific port number but it may also be called on the active side of a connection before calling rdma resolve addr to bind to a specific address If used to bind to port 0 the rdma_cm will select an available port which can be retrieved with rdma get src port See Also rdma create id rdma listen rdma resolve addr rdma create qp rdma get local addr rdma_get_src_port Mellanox Technologies 107 Rev 1 3 RDMA_CM API 5 2 9 rdma_resolve_route Template int rdma_resolve_route struct rdma cm id id int timeout_ms Input Parameters id RDMA identifier addr Local address information Wildcard values are permitted Output Parameters None Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_resolve_route resolves an RDMA route to the destination address in order to establish a connection The destination must already have been resolved by calling rdma resolve addr Thus this function is called on the client side after rdma resolve addr but before calling rdma connect For InfiniBand connections the call obtains a path record which is used by the connection 108 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 5 2 10 rdma_listen Template int rdma_listen struct rdma_cm_id id int b
178. omain PD from ibv_alloc pd we completion queue entry CQE from ibv_poll_cq grh global route header GRH from packet port _num physical port number 1 n that CQE was received on Output Parameters none Return Value Created address handle AH on success or 1 on error Description ibv_create_ah_from_wc combines the operations ibv_init_ah_from_we and ibv_create_ah See the description of those operations for details 90 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 6 13 ibv_attach_mcast Template int ibv_attach_mceast struct ibv_qp qp const union ibv_gid gid uint16_ t lid Input Parameters qp QP to attach to the multicast group gid The multicast group GID lid The multicast group LID in host byte order Output Parameters none Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_attach_mcast attaches the specified QP qp to the multicast group whose multicast group GID is gid and multicast LID is lid Only QPs of Transport Service Type IBV_QPT_UD may be attached to multicast groups In order to receive multicast messages a join request for the multicast group must be sent to the subnet administrator SA so that the fabric s multicast routing is configured to deliver messages to the local port If a QP is attached to the same multicast group multiple times the QP will s
179. on main Input argc number of items in argv argv command line parameters x Output none Returns 0 on success 1 on failure Description Main program code ae a ak ak ak o al oe le od e ad le ale ol le al le al oe le 2 leal le al od la le ol e ll al 2 a a al al a ll le al al al 2 ll a fe ae he ll al all a ll 2 ae Ae 2 ll ll ol ll all int main int argc char argv struct resources res int ro 1 char temp_char parse the command line parameters while 1 int c 188 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 static struct option long options name port has arg 1 val p name ib dev has arg 1 val d name ib port has arg 1 val r name gid idx has arg 1 val g name NULL has arg 0 val 0 A c getopt long argc argv p d i g long options NULL if c 1 break switch c case p config tcp_port strtoul optarg NULL 0 break case d config dev_name strdup optarg break case i config ib_port strtoul optarg NULL 0 if config ib_ port lt 0 usage argv 0 return 1 break case g config gid_idx strtoul optarg NULL 0 if config gid_idx lt 0 usage argv 0 return 1 break default usage argv 0 return 1 parse the last parameter if exists as the server name if optind argc 1 config server_name argv o
180. one Return Value A constant string which describes th num value event type Description ibv_event_type_str returns a string describing the event type enum value event type event type may be any one of the 19 different enum values describing different IB events ibv event type IBV_EVENT_CQ ERR IBV_EVENT_QP FATAL IBV_EVENT QP REQ ERR IBV EVENT QP ACCESS ERR IBV_EVENT COMM EST IBV_EVENT SQ DRAI IBV_EVENT PATH MIG IBV EVENT PATH MIG ERR IBV EVENT DEVICE FATAL IBV_EVENT PORT ACTIVE IBV_EVENT PORT ERR IBV_EVENT LID CHANGE IBV_EVENT_PKEY CHANGE IBV_EVENT SM CHANGE IBV_EVENT SRQ E Ho U RR IBV_EVENT_SRQ LIMIT REACHED IBV_EVENT QP LAST WOE REACHED IBV_EVENT_CLIENT_REREGISTER IBV_EVENT GID CHANGE he 96 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 5 RDMA_CM API 5 1 Event Channel Operations 5 1 1 rdma_create_event_channel Template struct rdma_event_channel rdma_create_event_channel void Input Parameters void no arguments Output Parameters none Return Value A pointer to the created event channel or NULL if the request fails On failure errno will be set to indicate the failure reason Description
181. ons to align with the network stack RDMA_CM EVENT TIMEWAIT EXIT The QP associated with a connection has exited its timewait state and is now ready to be re used After a QP has been disconnected it is maintained in a timewait state to allow any in flight packets to exit the network After the timewait state has completed the rdma_cm will report this event See Also rdma_ack_cm_event rdma_create_event_channel rdma_resolve_addr rdma_resolve_route rdma_connect rdma_listen rdma_join_multicast rdma_destroy_id rdma_event_str 132 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 5 3 2 rdma_ack_cm_event Template int rdma_ack_cm_event struct rdma_cm_event event Input Parameters event Event to be released Output Parameters none Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_ack cm event frees a communication event All events which are allocated by rdma_ get cm_event must be released there should be a one to one correspondence between suc cessful gets and acks This call frees the event structure and any memory that it references See Also rdma_get cm event rdma_destroy_id Mellanox Technologies 133 Rev 1 3 RDMA_CM API 5 3 3 rdma_event_str Template char rdma_event_str enum rdma_cm_event_type event Input Parameters event Asynchronous event Output Parameters
182. orks Programming User Manual Rev 1 3 side as part of the communication request May be NULL if private data is not required private data len Specifies the size of the user controlled data buffer Note that the actual amount of data transferred to the remote side is transport dependent and may be larger than that requested responder resources The maximum number of outstanding RDMA read and atomic operations that the local side will accept from the remote side Applies only to RDMA PS TCP This value must be less than or equal to the local RDMA device attribute max qp rd atom and remote RDMA device attribute max _qp init _rd_atom The remote endpoint can adjust this value when accepting the connection initiator depth The maximum number of outstanding RDMA read and atomic operations that the local side will have to the remote sid Applies only to RDMA PS TCP This value must be less than or equal to the local RDMA device attribute max_qp init_rd atom and remote RDMA device attribute max_qp rd_atom The remote endpoint can adjust this value when accepting the connection flow_control Specifies if hardware flow control is available This value is exchanged with th remot peer and is not used to configure the QP Applies only to RDMA PS TCP retry count The maximum number of times that a data transfer operation should be retried on the connection when an error occurs This setting controls the number of times to retry send RDMA
183. ort 0 ctx migrate_after 1 while op getopt argc argv sa p c l d r m 1 switch op case s ctx server 1 break case a ctx server_name optarg break case p ctx server_port optarg break case c ctx msg_count atoi optarg break case I ctx msg_length atoi optarg break case d ctx alt_dlid atoi optarg break case r ctx alt_srcport atoi optarg break case m ctx migrate_after atoi optarg break case w ctx msec_delay atoi optarg break default printf usage s s or a required n argv 0 printf t s erver mode n printf t a ip_address n printf t p port_number n printf t c msg_count n printf t l msg_length n printf t d alt_dlid requires r n printf t r alt_srcport requires d n printf t m num_iterations_then_migrate client only n printf t w msec_wait_between_sends n exit 1 Mellanox Technologies 219 Rev 1 3 Programming Examples Using RDMA Verbs printf mode s n ctx server server client printf address s n ctx server_name NULL ctx server_name printf port s n ctx server_port printf count d n ctx msg_ count printf length d n ctx msg_length printf alt_dlid d n ctx alt_dlid printf alt_port d n ctx alt_srcport printf mig after d n ctx migrate_after printf msec_wait din ctx msec_delay
184. oy_srq struct ibv_srq srq Input Parameters srq The SRQ to destroy Output Parameters none Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_destroy_srq destroys the specified SRQ It will fail if any queue pair is still associated with this SRQ Mellanox Technologies 61 J Rev 1 3 VPI Verbs API 4 4 8 ibv_open_xrc_domain Template struct iby_xre_domain ibv_ open xrc_domain struct ibv_context context int fd int oflag Input Parameters context struct ibv context from ibv open device fd The file descriptor to be associated with the XRC domain oflag The desired file creation attributes Output Parameters A file descriptor associated with the opened XRC domain Return Value A reference to an opened XRC domain or NULL Description ibv_open_xrc_domain opens an eXtended Reliable Connection XRC domain for the RDMA device context The desired file creation attributes oflag can either be 0 or the bitwise OR of O_CREAT and O_ EXCL If a domain belonging to the device named by the context is already associated with the inode then the O_ CREAT flag has no effect If both O CREAT and O_XCL are set open will fail if a domain associated with the inode already exists Otherwise a new XRC domain will be created and associated with the inode specified by fd Please note that the check for the existence of the domain and crea
185. pace through the network stack and out onto the wire 14 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 Similarly at the other end an application must rely on the operating system to retrieve the data on the wire on its behalf and place it in its virtual buffer space virtual virtual physical physical TCP IP Ethernet is a byte stream oriented transport for passing bytes of information between sockets applications TCP IP is lossy by design but implements a reliability scheme using the Transmission Control Protocol TCP TCP IP requires Operating System OS intervention for every operation which includes buffer copying on both ends of the wire In a byte stream oriented network the idea of a message boundary is lost When an application wants to send a packet the OS places the bytes into an anonymous buffer in main memory belonging to the operating system and when the byte transfer is complete the OS copies the data in its buffer into the receive buffer of the application This process is repeated each time a packet arrives until the entire byte stream is received TCP is responsible for retransmitting any lost packets due to congestion In IB a complete message is delivered directly to an application Once an application has requested transport of an RDMA Read or Write the IB hardware segments the outbound message as needed into packets whose size is determined by the fabric path maximum transfer
186. pare and Swap These are atomic extensions to the RDMA operations Mellanox Technologies 19 J Rev 1 3 Overview 3 2 3 2 1 3 2 2 The atomic fetch and add operation atomically increments the value at a specified virtual address by a specified amount The value prior to being incremented is returned to the caller The atomic compare and swap will atomically compare the value at a specified virtual address with a specified value and if they are equal a specified value will be stored at the address Transport Modes There are several different transport modes you may select from when establishing a QP Opera tions available in each mode are shown below in Table 3 RD is not supported by this API Table 3 Transport Mode capabilities Operation UD UC RC RD Send with immediate X X X Receive X X RDMA Write with immediate X X RDMA Read X X Atomic Fetch and Add Cmp and Swap X X Max message size MTU 2GB 2GB 2GB Reliable Connection RC Queue Pair is associated with only one other QP Messages transmitted by the send queue of one QP are reliably delivered to receive queue of the other QP Packets are delivered in order A RC connection is very similar to a TCP connection Unreliable Connection UC A Queue Pair is associated with only one other QP The connection is not reliable so packets may be lost Messages with errors are not retried by the transport and error handlin
187. ptind else if optind lt argc usage argv 0 return 1 Mellanox Technologies 189 Rev 1 3 Programming Examples Using IBV Verbs print the used parameters for info print_config init all of the resources so cleanup will be easy resources_init amp res create resources before using them if resources_create amp res fprintf stderr failed to create resources n goto main _ exit connect the QPs if connect_qp amp res fprintf stderr failed to connect QPs n goto main _ exit let the server post the sr if config server_name if post_send amp res IBV_WR_SEND fprintf stderr failed to post sr n goto main exit in both sides we expect to get a completion if poll_completion amp res fprintf stderr poll completion failed n goto main_exit after polling the completion we have the message in the client buffer too if config server_name fprintf stdout Message is Yosn res buf else setup server buffer with read message strepy res buf RDMAMSGR Sync so we are sure server side has data ready before client tries to read it if sock_sync_data res sock 1 R amp temp_char just send a dummy char back and forth fprintf stderr sync error before RDMA ops n rc 1 goto main_exit 190 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3
188. put ctx The context structure a Output none Returns 230 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 0 on success non zero on failure x Description Waits for a completion and verifies that the operation was successful int get_completion struct context ctx int ret struct ibv_we we do ret ibv_poll_cq ctx gt cq 1 amp we if ret lt 0 VERB _ERR ibv_poll_cq ret return 1 while ret 0 if wc status IBV_WC_SUCCESS printf work completion status s n ibv_wc_status_str wc status return 1 return 0 Function main Input argc The number of arguments argv Command line arguments x Output none Returns 0 on success non zero on failure x Description Main program to demonstrate multicast functionality Both the sender and receiver create a UD Queue Pair and join the specified multicast group ctx mcast_addr If the join is successful the sender must create an Address Handle ctx ah The sender then posts the specified number of sends ctx msg_ count to the multicast group The receiver waits to receive each one of the sends and then both sides leave the multicast group and cleanup resources int main int argc char argv t int ret op 1 struct context ctx Mellanox Technologies 231 Rev 1 3 Programming Examples Using R
189. r associated with the rdma_cm_id id The contents of the local data buffers in the sgl array will be written to the remote memory region at remote_addr Unless inline data is specified the local data buffers must have been registered before the write is issued and the buffers must remain registered until the write completes The remote buffers must always be registered Write operations may not be posted to an rdma_cm_id or the corresponding queue pair until a con nection has been established The user defined context associated with the write request will be returned to the user through the work completion work request identifier wr_id field 144 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 6 2 5 rdma_post_recv Template int rdma_post_recv struct rdma_cm_id id void context void addr size_t length struct ibv_mr mr Input Parameters id A reference to the communication identifier where the message buffer will be posted context A user defined context associated with the request addr The address of the memory buffer to post length The length of the memory buffer mr A registered memory region associated with the posted buffer Output Parameters None Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_post_recv posts a work request to the receive queue of the queue pair associated with t
190. ram operations remote qpn remote QP number for datagram operations remote qkey Okey for datagram operations xrc remote srq num shared receiv queu SRQ number for the destination xtended reliabl connection XRC Only used for XRC operations set fence indicator send completion event for this WR hat had the sq sig all set to 0 Only meaningful for QPs set solicited event indicator send flags IBV_SEND_FENCE IBV_SEND_ SIGNALED t IBV_SEND SEND SOLICITED IBV_SEND INLINE send data in sge list as inline data struct ibv_sge is defined in ibv_post_recv 82 Mellanox Technologies J RDMA Aware Networks Programming User Manual Rev 1 3 4 6 6 ibv_post_srq_recv Template int ibv_post_srq _recv struct ibv_srq srq struct ibv recv_wr recv_wr struct ibv recv_wr bad recv_wr Input Parameters srq The SRQ to post the work request to recv_wr A list of work requests to post on the receive queu Output Parameters bad_recv_wr pointer to first rejected WR Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_post_srq_recv posts a list of work requests to the specified SRQ It stops processing the WRs from this list at the first failure which can be detected immediately while requests are being posted and returns this failing WR through the bad_recv_wr parameter The buffers used by a
191. ram service The user must already have resolved a route to the destination address by having called rdma resolve route or rdma create ep before calling this method For InfiniBand specific connections the QPs are configured with minimum RNR NAK timer and local ACK values The minimum RNR NAK timer value is set to 0 for a delay of 655 ms The local ACK timeout is calculated based on the packet lifetime and local HCA ACK delay The packet lifetime is determined by the InfiniBand Subnet Administrator and is part of the resolved route path record information The HCA ACK delay is a property of the locally used HCA Retry count and RNR retry count values are 3 bit values Connections established over iWarp RDMA devices currently require that the active side of the connection send the first message struct rdma_conn_param is defined as follows struct rdma_conn_ param const void private data uint8 t private data len uint8 t responder resources uint8 t initiator depth uint8 t flow control uint8 t retry count ignored when accepting uint8 t rnr retry count uint8 t srq ignored if QP created on the rdma_cm_id uint32_t qp num ignored if QP created on the rdma cm id e Here is a more detailed description of the rdma conn param structure members private data References a user controlled data buffer The contents of the buffer are copied and transparently passed to the remote 110 Mellanox Technologies RDMA Aware Netw
192. rdma get send comp ctx gt id amp we if ret lt 0 VERB_ERR rdma get send comp ret return ret return 0 gt Function recv_msg Input ctx The context object Mellanox Technologies 217 Rev 1 3 Programming Examples Using RDMA Verbs Output none Returns 0 on success non zero on failure Description Waits for a receive completion and posts a new receive buffer int recv_msg struct context ctx int ret struct ibv_we wc ret rdma get recv_comp ctx gt 1d amp we if ret lt 0 VERB _ERR rdma_get recv_comp ret return ret ret rdma_post_recv ctx gt id NULL ctx gt recv_buf ctx gt msg_ length ctx gt recv_mr if ret VERB _ERR rdma_post_recv ret return ret return 0 Function main x Input ctx The context object x Output none x Returns 0 on success non zero on failure x Description int main int argc char argv int ret op i send_cnt recv_cnt struct context ctx struct ibv_qp_ attr qp_attr memset amp ctx 0 sizeof ctx memset amp qp_attr 0 sizeof qp_attr 218 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 ctx server 0 ctx server_port DEFAULT PORT ctx msg count DEFAULT MSG COUNT ctx msg length DEFAULT MSG LENGTH ctx msec_delay DEFAULT _MSEC_ DELAY ctx alt_dlid 0 ctx alt_srcp
193. remote_addr uint32_t rkey Input Parameters id A reference to the communication identifier where the request will be posted context A user defined context associated with the request addr The local address of the source of the write request length The length of the write operation mr Optional registered memory region associated with the local buffer flags Optional flags used to control the write operation remote addr The address of the remote registered memory to write into rkey The registered memory key associated with the remote address Output Parameters None Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_post_write posts a work request to the send queue of the queue pair associated with the rdma cm id id The contents of the local data buffer will be written into the remote memory region Unless inline data is specified the local data buffer must have been registered before the write is issued and the buffer must remain registered until the write completes The remote buffer must always be registered Write operations may not be posted to an rdma_cm_id or the corresponding queue pair until a con nection has been established The user defined context associated with the write request will be returned to the user through the work completion work request identifier wr_id field 148 Mellanox Technologies RDMA Aware
194. request ID next pointer to next WR NULL if last one sg_list scatter array for this WR num sge number of entries in sg list struct ibv_sge is defined as follows struct ibv_sge uinte4 t addr uint32 t length uint32 t lkey y addr address of buffer length length of buffer lkey local key lkey of buffer from ibv_reg_ mr 80 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 6 5 ibv_post_send Template int ibv_post_send struct ibv_qp qp struct ibv_send_wr wr struct ibv_send wr bad wr Input Parameters ap struct ibv_qp from ibv_create_qp wr first work request WR Output Parameters bad_wr pointer to first rejected WR Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_post_send posts a linked list of WRs to a queue pair s QP send queue This operation is used to initiate all communication including RDMA operations Processing of the WR list is stopped on the first error and a pointer to the offending WR is returned in bad_wr The user should not alter or destroy AHs associated with WRs until the request has been fully exe cuted and a completion queue entry CQE has been retrieved from the corresponding completion queue CQ to avoid unexpected behaviour The buffers used by a WR can only be safely reused after the WR has been fully executed and a WCE has been retrieved from the cor
195. responding CQ However if the IBV_SEND_ INLINE flag was set the buffer can be reused immediately after the call returns struct ibv_send_wr is defined as follows struct ibv send wr uint64 t wr_id struct ibv send wr next struct ibv sge sg list int num_sge enum ibv wr _opcode opcode enum ibv_send flags send flags uint32 t imm data network byte order union struct uint64 t remote addr uint32 t rkey rdma struct uint64 t remote addr uinte4 t compare add Mellanox Technologies 81 J Rev 1 3 VPI Verbs API uint64 t swap uint32 t rkey atomic Struct struct ibv ah ah uint32 t remote qpn uint32 t remote qkey ud wr uint32 t xrc remote srq num y 7 7 wares wr id user assigned work request ID next pointer to next WR NULL if last one sg list scatter gather array for this WR num sge number of entries in sg list opcode IBV_WR_RDMA WRITE IBV WR RDMA WRITE WITH IMM IBV_WR_SEND IBV WR SEND WITH IMM IBV WR RDMA READ IBV WR ATOMIC CMP AND SWP IBV WR ATOMIC FETCH AND ADD send flags optional this is a bitwise OR of the flags See the details below imm_ data immediate data to send in network byte order remote_addr remote virtual address for RDMA atomic operations rkey remote key from ibv_reg mr on remote for RDMA atomic operations compare add compare value for compare and swap operation swap swap value ah address handle AH for datag
196. ret hints ai_flags 0 ret rdma_getaddrinfo ctx gt mcast_addr NULL amp hints amp mceast_rai if ret VERB_ERR rdma_getaddrinfo mcast ret return ret if ctx gt bind_addr bind to a specific adapter if requested to do so ret rdma_bind_addr ctx gt id bind_rai gt ai_src_addr if ret VERB _ERR rdma_bind_addr ret return ret A PD is created when we bind Copy it to the context so it can be used later on ctx gt pd ctx gt id gt pd ret rdma_resolve_addr ctx gt id bind_rai bind_rai gt ai_src_addr NULL meast_rai gt ai_dst_addr 2000 if ret VERB _ERR rdma_resolve_addr ret return ret ret get_cm_event ctx gt channel RDMA_ CM EVENT ADDR RESOLVED NULL if ret return ret memcpy amp ctx gt mcast_sockaddr meast_rai gt ai_dst_addr sizeof struct sockaddr return 0 x Function create_resources Input Mellanox Technologies 227 Rev 1 3 Programming Examples Using RDMA Verbs ctx The context structure x Output none Returns 0 on success non zero on failure x Description Creates the PD CQ QP and MR el int create_resources struct context ctx int ret buf size struct ibv_qp _init_attr attr memset amp attr 0 sizeof attr If we are bound to an address then a PD was already allocated to the CM ID if ctx gt pd ctx gt pd ibv
197. ributes are valid if they have been set using the ibv_ modify xrc rcv_qp The exact list of valid attributes depends on the QP state Multiple ibv_query_xrc_rcv_qp calls may yield different returned values for these attributes qp state path mig state sq draining ah_attr if automatic path migration APM is enabled Mellanox Technologies 79 J Rev 1 3 VPI Verbs API 4 6 4 ibv_post_recv Template int ibv_post_recv struct ibv_qp qp struct ibv_recv_wr wr struct ibv_recv_wr bad_wr Input Parameters ap struct ibv_qp from ibv_create_qp wr first work request WR containing receive buffers Output Parameters bad_wr pointer to first rejected WR Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_post_recv posts a linked list of WRs to a queue pair s QP receive queue At least one receive buffer should be posted to the receive queue to transition the QP to RTR Receive buffers are con sumed as the remote peer executes Send Send with Immediate and RDMA Write with Immediate operations Receive buffers are NOT used for other RDMA operations Processing of the WR list is stopped on the first error and a pointer to the offending WR is returned in bad_wr struct ibv_recv_wr is defined as follows struct ibv_recv_wr uint64 t wr_id struct ibv_recv_wr next struct ibv sge FSgulist int num sge J wr id user assigned work
198. rr amp bad_wr if rc fprintf stderr failed to post RR n else fprintf stdout Receive Request was posted n return rc Mellanox Technologies 175 Rev 1 3 Programming Examples Using IBV Verbs EE a del oe le ol le od oe he 2 le ol le al o leo le od e le al al al ll all o le ll al ll al ll a al el ll al ll a ll ll al al le 2 ll o ll ll al al al al ll el Function resources_init Input res pointer to resources structure Output res is initialized Returns none Description res is initialized to default values ae a ak del o ol oe fe od e al le ale al le al le al oe le 2 leal le al ol lo ll e al al ll ll ol a ll le al ll al ll o al e ll le ake ll al ll ll ae he 2 ll ll o ake ae ake ll E EE CIES E Se RS O EE static void resources init struct resources res memset res 0 sizeof res res gt sock 1 EE 8 a le al o le ol le od he ad lol le al o leo le al e le al al al leal ol al ol ll al ll 2 ll a al he ll 2 ll a ll ll al le ll ll o ll 2 ae al al al ll ll Function resources create Input res pointer to resources structure to be filled in Output res filled in with resources Returns 0 on success on failure Description This function creates and allocates all necessary system resources These are stored in res ao led o al oe le od e ad le al ol le ol le ol oe he 2 leal le al al lo ll le al 2 al al ll ol ll ll al ll o ll o al al le ll ll a ll ll le al ae
199. rrno will be set to indicate the reason for the failure Description rdma_create_srq allocates a shared request queue associated with the rdma_cm_id id The id must be bound to a local RMDA device before calling this routine If the protection domain pd is provided it must be for that same device After being allocated the SRQ will be ready to handle posting of receives If a pd is NULL then the rdma_cm_id will be created using a default protec tion domain One default protection domain is allocated per RDMA device The initial SRQ attri butes are specified by the attr parameter If a completion queue CQ is not specified for the XRC SRQ then a CQ will be allocated by the rdma_cm for the SRQ along with corresponding completion channels Completion channels and CQ data created by the rdma_cm are exposed to the user through the rdma_cm_id structure The actual capabilities and properties of the created SRQ will be returned to the user through the attr parameter An rdma_cm_id may only be associated with a single SRQ Mellanox Technologies 139 Rev 1 3 RDMA Verbs API 6 1 6 rdma_destroy_srq Template void rdma_destroy_srq struct rdma cm id id Input Parameters id The RDMA communication identifier whose associated SRQ we wish to destroy Output Parameters None Return Value none Description rdma destroy_srq destroys an SRQ allocated on the rdma_cm_id id Any SRQ associated with an rdma cm id must be d
200. sical port number 1 is first port index which entry in the pkey table to return 0 is first Output Parameters pkey desired pkey Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_query_pkey retrieves an entry in the port s partition key pkey table Each port is assigned at least one pkey by the subnet manager SM The pkey identifies a partition that the port belongs to A pkey is roughly analogous to a VLAN ID in Ethernet networking The user passes in a pointer to a uint16 that will be filled in with the requested pkey The user is responsible to free this uint16 44 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 3 5 ibv_alloc_pd Template struct ibv_pd ibv_alloc_pd struct ibv_context context Input Parameters context struct ibv_context from ibv_open_device Output Parameters none Return Value Pointer to created protection domain or NULL on failure Description ibv_alloc_pd creates a protection domain PD PDs limit which memory regions can be accessed by which queue pairs QP providing a degree of protection from unauthorized access The user must create at least one PD to use VPI verbs Mellanox Technologies 45 J Rev 1 3 VPI Verbs API 4 3 6 ibv_dealloc_pd Template int ibv_dealloc_pd struct ibv_pd pd Input Parameters pd struct ibv_pd from ibv_alloc pd Output
201. suf ficient buffer space to receive the request 7 2 7 IBV_WC_MW_BIND_ERR This event is generated when a memory management operation error occurs The error may be due to the fact that the memory window and the QP belong to different protection domains It may also be that the memory window is not allowed to be bound to the specified MR or the access permis sions may be wrong 7 2 8 IBV_WC_BAD_RESP_ERR This event is generated when an unexpected transport layer opcode is returned by the responder 7 2 9 IBV_WC_LOC_ACCESS_ERR This event is generated when a local protection error occurs on a local data buffer during the pro cess of an RDMA Write with Immediate Data operation sent from the remote node 158 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 7 2 10 IBV_WC_REM_INV_REQ_ERR This event is generated when the receive buffer is smaller than the incoming send It is generated on the sender side of the connection It may also be generated if the QP attributes are not set cor rectly particularly those governing MR access 7 2 11 IBV_WC_REM_ACCESS_ERR This event is generated when a protection error occurs on a remote data buffer to be read by an RDMA Read written by an RDMA Write or accessed by an atomic operation The error is reported only on RDMA operations or atomic operations 7 2 12 IBV_WC_REM_OP_ERR This event is generated when an operation cannot be completed successfully by the responder
202. t found d device s n num_devices search for the specific device we want to work with for i 0 i lt num devices i if config dev_name config dev_name strdup ibv_get device _name dev_list i fprintf stdout device not specified using first one found s n config dev_name if stremp ibv_get_device_name dev_list i config dev_name Mellanox Technologies 177 Rev 1 3 Programming Examples Using IBV Verbs ib_dev dev_list i break if the device wasn t found in host if ib_ dev fprintf stderr IB device s wasn t found n config dev_name rc 1 goto resources create exit get device handle res gt ib_ctx ibv_open_device ib_dev if res gt ib_ctx fprintf stderr failed to open device s n config dev_name rc 1 goto resources _create_exit We are now done with device list free it ibv_free device list dev_list dev_list NULL ib_dev NULL query port properties if ibv_query_port res gt ib_ctx config ib port amp res gt port_attr fprintf stderr ibv_query_port on port u failed n config ib_port rc 1 goto resources _create_exit allocate Protection Domain res gt pd ibv_alloc_pd res gt 1b_ctx if res gt pd fprintf stderr ibv_alloc_pd failed n rc 1 goto resources _create_exit each side will send only one WR so Completion Queue with 1 entry is enough cq_size 1
203. t Template int rdma_reject struct rdma cm id id const void private data uint8 t private data len Input Parameters id RDMA communication identifier private data Optional private data to send with the reject messag private data_len Size in bytes of the private data being sent Output Parameters None Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_reject is called from the listening side to reject a connection or datagram service lookup request After receiving a connection request event a user may call rdma _ reject to reject the request The optional private data will be passed to the remote side if the underlying RDMA transport supports private data in the reject message 114 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 5 2 15 rdma_notify Template int rdma_notify struct rdma_cm_id id enum ibv_event_type event Input Parameters id RDMA communication identifier event Asynchronous event Output Parameters None Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_notify is used to notify the librdmacm of asynchronous events which have occurred on a QP associated with the rdma cm id id Asynchronous events that occur on a QP are reported through the user s device event handler This ro
204. t get_cm_event struct rdma_event_channel channel Mellanox Technologies 225 Rev 1 3 Programming Examples Using RDMA Verbs enum rdma_cm_event_type type struct rdma cm event out_ev int ret 0 struct rdma_cm_event event NULL ret rdma_get_cm_event channel amp event if ret VERB _ERR rdma_resolve_addr ret return 1 Verify the event is the expected type if event gt event type printf event s status Yod n rdma_event_str event gt event event gt status ret 1 Pass the event back to the user if requested if lout_ev rdma_ack_cm_event event else out_ev event return ret Function resolve_addr Input ctx The context structure a Output none Returns 0 on success non zero on failure a Description Resolves the multicast address and also binds to the source address if one was provided in the context int resolve_addr struct context ctx int ret struct rdma_addrinfo bind_rai NULL struct rdma_addrinfo mcast_rai NULL struct rdma_addrinfo hints memset amp hints 0 sizeof hints hints ai_port_space RDMA_PS_ UDP 226 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 if ctx gt bind_addr hints ai_flags RAI_PASSIVE ret rdma_getaddrinfo ctx gt bind_addr NULL amp hints amp bind_rai if ret VERB _ERR rdma_getaddrinfo bind ret return
205. t handle Mellanox Technologies 63 J Rev 1 3 VPI Verbs API 4 4 10 ibv_close_xrc_domain Template int ibv_close_xre_domain struct ibv_xrc_domain d Input Parameters d A pointer to the XRC domain the user wishes to close Output Parameters none Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_close_xrc_domain closes the XRC domain d If this happens to be the last reference then the XRC domain will be destroyed This function decrements a reference count and may fail if any QP or SRQ are still associated with the XRC domain being closed 64 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 4 11 ibv_create_xrc_rcv_qp Template int ibv_create_xre_rev_qp struct ibv_qp_ init_attr init_attr uint32_t xrc_rcv_qpn Input Parameters init_attr The structure to be populated with QP information xrc_rcv_qpn The QP number associated with the receive QP to be created Output Parameters init_attr Populated with the XRC domain information the QP will be associated with xrc_rcv_qpn The QP number associated with the receive QP being created Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_create_xre_rev_qp creates an XRC queue pair QP to serve as a receive side only QP and returns the QP number thro
206. t in a subnet GID Global IDentifier A 128 bit identifier used to identify a Port on a network adapter a port on a Router or a Mul ticast Group A GID is a valid 128 bit IPv6 address per RFC 2373 with additional properties restric tions defined within IBA to facilitate efficient discovery communication and routing GRH Global Routing A packet header used to deliver packets across a subnet boundary and also used to deliver Header Multicast messages This Packet header is based on IPv6 protocol Network Adapter A hardware device that allows for communication between computers in a network Host A computer platform executing an Operating System which may control one or more net work adapters IB InfiniBand Mellanox Technologies 9 J Rev 1 3 Table 2 Glossary Sheet 2 of 4 Term Description Join operation An IB port must explicitly join a multicast group by sending a request to the SA to receive multicast packets Ikey A number that is received upon registration of MR is used locally by the WR to identify the memory region and its associated permissions LID Local IDentifier A 16 bit address assigned to end nodes by the subnet manager Each LID is unique within its subnet LLE Low Latency Ethernet RDMA service over CEE Converged Enhanced Ethernet allowing IB transport over Ether net NA Network Adapter MGID Multicast Group ID A device which t
207. ta int rc int read bytes 0 Mellanox Technologies 171 Rev 1 3 Programming Examples Using IBV Verbs int total_read_bytes 0 rc write sock local_data xfer_size if rc lt xfer_size fprintf stderr Failed writing data during sock_sync_data n else rc 0 while rc amp amp total_read_bytes lt xfer_size read_bytes read sock remote_data xfer_size if read_bytes gt 0 total _read_bytes read_bytes else rc read_bytes return rc EE F oo ls al o le ol le od oe le ad le al le al o la le ad e le al al al ll all ol ll al ll al ll a al el ll al ll al ll ll he 2 la ll al fe ake ae ll al ll ll ll End of socket operations ad ak del o al oe le od e ad de ale 2 le al le al oe le ad le al le al ol lo le al e al ll 2 ll 2 al ol ll al ll 2 ll o al ll 2 ll ll a ll ll al ll al ll a ll ll poll_completion EFE F ae ls al oe le ol le od he ad le al le al ol lo le od e le al al al le al ol al o ll al ll al ll o al el 2 le al ll a ll ll al ll al ll o ll ll al al ll ll al Function poll_completion Input res pointer to resources structure Output none x Returns 0on success 1 on failure Description Poll the completion queue for a single event This function will continue to poll the queue until MAX_POLL_CQ_TIMEOUT milliseconds have passed x Fk ake a ak fe ak o ol oe fe od e ad le ale ol le ol le ol oe le 2 leal le al ol lo le al e a al al 2 ll ol
208. te_con_data qp_ num ntohl tmp_con_data qp_num remote_con_data lid ntohs tmp_con_data lid memcpy remote_con_data gid tmp _con_data gid 16 save the remote side attributes we will need it for the post SR res gt remote_props remote_con_data 184 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 fprintf stdout Remote address 0x PRIx64 n remote _con_data addr fprintf stdout Remote rkey 0x x n remote_con_data rkey fprintf stdout Remote QP number 0x x n remote_con_data qp_num fprintf stdout Remote LID 0x x n remote_con_data lid if config gid_idx gt 0 uint8_t p remote_con_data gid fprintf stdout Remote GID 02x 02x 02x 02x 02x 02x 02x 02x 02x 02x 02x 02x 02x 02x 02x 02x n pl0 p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p19 p 10 pl1 1 p 12 p 13 p114 p 15 modify the QP to init rc modify _qp_to_init res gt qp if rc fprintf stderr change QP state to INIT failed n goto connect_qp_exit let the client post RR to be prepared for incoming messages if config server_name rc post_receive res if rc fprintf stderr failed to post RR n goto connect_qp_ exit modify the QP to RTR rc modify _qp to rtr res gt qp remote_con_data qp num remote _con_data lid remote _con_data gid if rc fprintf stderr failed to modify QP state to RTR n goto co
209. ted in programming example by 8 1 4 resources_create Create a Queue Pair QP Creating a QP will also create an associated send queue and receive queue Implemented in programming example by 8 1 4 resources_create Bring up a QP A created QP still cannot be used until it is transitioned through several states eventually getting to Ready To Send RTS This provides needed information used by the QP to be able send receive data Implemented in programming example by 8 1 6 connect_qp 8 1 7 modify_qp_to_init 8 1 8 post_receive 8 1 10 modify_qp_to_rtr and 8 1 11 modify_qp_to_rts Post work requests and poll for completion Use the created QP for communication operations Implemented in programming example by 8 1 12 post_send and 8 1 13 poll_completion 10 Cleanup Destroy objects in the reverse order you created them 26 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 Delete QP Delete CQ Deregister MR Deallocate PD Close device Implemented in programming example by 8 1 14 resources_destroy Mellanox Technologies 27 J Rev 1 3 VPI Verbs API 4 VPI Verbs API This chapter describes the details of the VPI verbs API 4 1 Initialization 4 1 1 ibv_fork_init Template int ibv_fork_init void Input Parameters one Output Parameters one Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure
210. the passive side of a connection The rdma_addrinfo structure is described under the rdma create ep routine Mellanox Technologies 123 Rev 1 3 RDMA_CM API 5 2 24 rdma_freeaddrinfo Template void rdma_freeaddrinfo struct rdma_addrinfo res Input Parameters res The rdma_addrinfo structure to free Output Parameters None Return Value None Description rdma_freeaddrinfo releases the rdma_addrinfo res structure returned by the rdma_getaddrinfo routine Note that if ai_next is not NULL rdma_freeaddrinfo will free the entire list of addrinfo structures 124 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 5 2 25 rdma_create_qp Template int rdma_create_qp struct rdma_cm_id id struct ibv_pd pd struct ibv_qp_init_attr qp init attr Input Parameters id RDMA identifier pd protection domain for the QP qp_init_attr initial QP attributes Output Parameters none Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_create_qp allocates a QP associated with the specified rdma_cm_id and transitions it for sending and receiving Notes The rdma_cm_id must be bound to a local RDMA device before calling this function and the pro tection domain must be for that same device QPs allocated to an rdma_cm_id are automatically transitioned by the librdmacm through their states After
211. till receive a single copy of a multicast message Mellanox Technologies 91 J Rev 1 3 VPI Verbs API 4 6 14 ibv_detach_mcast Template int ibv_detach_mcast struct ibv_qp qp const union ibv_gid gid uint16_t lid Input Parameters ap OP to attach to the multicast group gid The multicast group GID lid The multicast group LID in host byte order Output Parameters none Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_detach_mcast detaches the specified QP qp from the multicast group whose multicast group GID is gid and multicast LID is lid 92 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 7 Event Handling Operations 4 7 1 ibv_get_async_event Template int ibv_get_async_event struct ibv_context context struct ibv_async_ event event Input Parameters context struct ibv_context from ibv open device event A pointer to use to return the async event Output Parameters event A pointer to the async event being sought Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_get_async_event gets the next asynchronous event of the RDMA device context context and returns it through the pointer event which is an ibv_async_event struct All async events returned by ibv_get_async_event must eventua
212. tion a QP from the INIT to RTR state using the specified QP number ae a ak del o al oe le 2 e ad le ale ol le ol le al oe le 2 leal le al ol la le al e a al al 2 ll ol al a ll al ll 2 ll a al ll ll 2 ll a ll ll ae he le ll ll o fe ae ake ll static int modify _qp_to_rtr struct ibv_qp qp uint32_t remote_qpn uint16_t dlid uint8_t dgid struct ibv_qp_attr attr int flags int TC memset amp attr 0 sizeof attr attr qp_state IBV_QPS_RTR attr path_mtu IBV_MTU_256 attr dest_qp_num remote _qpn attr rq_psn 0 attr max_dest_rd_atomic 1 attr min rnr timer 0x12 attr ah_attr is_global 0 attr ah_attr dlid dlid attr ah_attr sl 0 attr ah_attr src_path bits 0 attr ah_attr port_num config ib_port if config gid_idx gt 0 attr ah_attr is_global 1 attr ah_attrport_num 1 memcepy amp attr ah_attr grh dgid dgid 16 attr ah_attr grh flow_label 0 attr ah_attr grh hop_ limit 1 attr ah_attr grh sgid_ index config gid_ idx attr ah_attr grh traffic_class 0 flags IBV_QP_STATE IBV_QP_AV IBV_QP_ PATH MTU IBV_QP_DEST_QPN IBV_QP RQ PSN IBV_QP MAX DEST RD ATOMIC IBV_QP_MIN RNR TIMER rc ibv_modify_qp qp amp attr flags if rc fprintf stderr failed to modify QP state to RTR n 182 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 return rc EE a ds al ol le ol le od oe he ad le al le al ol leo le ad e le al al al le al
213. tion of the domain if it does not exist is atomic with respect to other processes executing open with fd naming the same inode If fd equals 1 then no inode is associated with the domain and the only valid value for oflag is O _CREAT Since each ibv_open_xrc_domain call increments the xrc_domain object s reference count each such call must have a corresponding ibv_close_xrc_domain call to decrement the xrc_domain object s reference count 62 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 4 4 9 ibv_create_xrc_srq Template struct ibv_srq ibv_create_xre_srq struct ibv_pd pd struct ibv_xrc_ domain xrc_domain struct ibv_cq xre_cq struct ibv_srq_init_attr srq_init_attr Input Parameters pd The protection domain associated with the shared receive queue xrc_domain The XRC domain xrc_cq The CQ which will hold the XRC completion srq_init attr A list of initial attributes required to create the SRQ described above Output Parameters ibv_srq_attr Actual values of the struct are set Return Value A pointer to the created SRQ or NULL on failure Description ibv_create_xre_srq creates an XRC shared receive queue SRQ associated with the protection domain pd the XRC domain domain _xrc and the CQ which will hold the completion xrc_cq struct ibv_xrc_domain is defined as follows struct ibv_xrc_ domain struct ibv_context context struct ibv context from ibv_open device uint64
214. tory of this source tree or the OpenIB org BSD license below Redistribution and use in source and binary forms with or without modification are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice this list of conditions and the following disclaimer Redistributions in binary form must reproduce the above copyright notice this list of conditions and the following disclaimer in the documentation and or other materials provided with the distribution a i a i i i i i a a THE SOFTWARE IS PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND EXPRESS OR IMPLIED INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM DAMAGES OR OTHER LIABILITY WHETHER IN AN ACTION OF CONTRACT TORT OR OTHERWISE ARISING FROM OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE Compile Command gcc mc c o mc libverbs Irdmacm Description Both the sender and receiver create a UD Queue Pair and join the specified multicast group ctx mcast_addr If the join is successful the sender must create an Address Handle ctx ah The sender then posts the specified number of sends ctx msg_count to the multicast group The receiver waits to receive
215. ugh xrc_rcv_qpn This number must be passed to the remote sender node The remote node will use xrc_rcv_qpn in ibv_post_send when it sends messages to an XRC SRQ on this host in the same xrc domain as the XRC receive QP The QP with number xre_rev_qpn is created in kernel space and persists until the last process reg istered for the QP called ibv_unreg _xrc_rcv qp at which point the QP is destroyed The process which creates this QP is automatically registered for it and should also call ibv_unreg xre_rcv_qp at some point to unregister Any process which wishes to receive on an XRC SRQ via this QP must call ibv_reg_xrc_rcv_qp for this QP to ensure that the QP will not be destroyed while they are still using it Please note that because the QP xrc_rcv_qpn is a receive only QP the send queue in the init_attr struct is ignored Mellanox Technologies 65 J Rev 1 3 VPI Verbs API 4 4 12 ibv_modify_xrc_rcv_qp Template int ibv_modify_xre_rev_qp struct ibv_xrc_domain xrc_domain uint32_t xrc_qp_ num struct ibv_qp_attr attr int attr_mask Input Parameters xrc_ domain The XRC domain associated with this QP xrc qp num The queue pair number to identify this QP attr The attributes to use to modify the XRC receive QP attr mask The mask to use for modifying the QP attributes Output Parameters None Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure
216. uld be the device the completion queue entry CQE was received on port_num physical port number 1 n that CQE was received on wc received CQE from ibv_poll_cq grh global route header GRH from packet see description Output Parameters ah_attr address handle AH attributes Return Value O on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description ibv_init_ah_from_we initializes an AH with the necessary attributes to generate a response to a received datagram The user should allocate a struct ibv_ah_attr and pass this in If appropriate the GRH from the received packet should be passed in as well On UD connections the first 40 bytes of the received packet may contain a GRH Whether or not this header is present is indicated by the IBV_WC_GRH flag of the CQE If the GRH is not present on a packet on a UD connection the first 40 bytes of a packet are undefined When the function ibv_init_ah_from_we completes the ah_attr will be filled in and the ah_attr may then be used in the ibv_create_ah function The user is responsible for freeing ah_attr Alternatively ibv_create_ah_from_we may be used instead of this operation Mellanox Technologies 89 J Rev 1 3 VPI Verbs API 4 6 12ibv_create_ah_from_wc Template struct ibv_ah ibv_create ah from we struct ibv pd pd struct ibv_ we wc struct ibv_grh orh uint8_t port_num Input Parameters pd protection d
217. under the terms of the GNU General Public License GPL Version 2 available from the file COPYING in the main directory of this source tree or the OpenIB org BSD license below Redistribution and use in source and binary forms with or without modification are permitted provided that the following conditions are met Redistributions of source code must retain the above copyright notice this list of conditions and the following disclaimer Redistributions in binary form must reproduce the above copyright notice this list of conditions and the following disclaimer in the documentation and or other materials provided with the distribution E E F x THE SOFTWARE IS PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND EXPRESS OR IMPLIED INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM DAMAGES OR OTHER LIABILITY WHETHER IN AN ACTION OF CONTRACT TORT OR OTHERWISE ARISING FROM OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE Id include lt stdlib h gt include lt string h gt include lt stdio h gt include lt errno h gt include lt sys types h gt include lt netinet in h gt include lt arpa inet h gt include lt sys socket h gt include lt netdb h gt include lt byt
218. utine is used to notify the librdmacm of communication events In most cases use of this routine is not necessary however if connection establishment is done out of band such as done through InfiniBand it is possible to receive data on a QP that is not yet considered connected This routine forces the connection into an established state in this case in order to handle the rare situation where the connection never forms on its own Calling this routine ensures the delivery of the RDMA_CM_EVENT ESTABLISHED event to the application Events that should be reported to the CM are IB EVENT COMM EST Mellanox Technologies 115 Rev 1 3 RDMA_CM API 5 2 16 rdma_disconnect Template int rdma_disconnect struct rdma_cm_id id Input Parameters id RDMA communication identifier Output Parameters None Return Value 0 on success 1 on error If the call fails errno will be set to indicate the reason for the failure Description rdma_disconnect disconnects a connection and transitions any associated QP to the error state This action will result in any posted work requests being flushed to the completion queue rdma_disconnect may be called by both the client and server side of the connection After success fully disconnecting an RDMA CM EVENT DISCONNECTED event will be generated on both sides of the connection 116 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 5 2 17 rdma_get_src_port Templ
219. wait_completion Input ctx The context object Output none Returns 0 on success non zero on failure Description Waits for a completion on the SRQ CQ int await_completion struct context ctx int ret struct ibv_cq ev_cq void ev_ctx Wait for a CQ event to arrive on the channel ret ibv_get_cq_event ctx gt srq_cq_channel amp ev_cq amp ev_ctx if ret VERB _ERR ibv_get_cq_event ret return ret ibv_ack_cq_events ev_cq 1 Reload the event notification ret ibv_req_notify_cq ctx gt srq_cq 0 if ret VERB _ERR ibv_req_notify_cq ret return ret return 0 gt Function run server Input ctx The context object rai The RDMA address info for the connection 240 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 Output none Returns 0 on success non zero on failure Description Executes the server side of the example int run_server struct context ctx struct rdma_addrinfo rai i int ret 1 uint64_t send_count 0 uint64_t recv_count 0 struct ibv_we we struct ibv_qp init_attr qp_attr ret init_resources ctx rai if ret printf init_resources returned d n ret return ret Use the srq_id as the listen_id since it is already setup ctx gt listen_id ctx gt srq_id ret rdma_listen ctx gt listen_id 4 if
220. x gt id gt event gt event RDMA CM EVENT CONNECT REQUEST printf unexpected event s rdma_event_str ctx gt id gt event gt event return ret Tf the alternate path info was not set on the command line get it from the private data if ctx gt alt_dlid 0 amp amp ctx gt alt_srcport 0 ret get alt dlid from private data ctx gt 1d gt event amp ctx gt alt_dlid if ret return ret return 0 Function establish_connection Input ctx The context object ke Output none x Returns 0 on success non zero on failure Mellanox Technologies 215 Rev 1 3 Programming Examples Using RDMA Verbs Description Create the connection For the client call rdma_connect For the server the connect request was already received so just do rdma accept to complete the connection 27 int establish_connection struct context ctx t int ret uintl6_tprivate_data struct rdma_conn_param conn_param post a receive to catch the first send ret rdma_post_recv ctx gt id NULL ctx gt recv_buf ctx gt msg_ length ctx gt recv_mr if ret VERB _ERR rdma_post_recv ret return ret send the dlid for the alternate port in the private data private data htons ctx gt my_alt_dlid memset amp conn_param 0 sizeof conn_param conn_param private_data_len sizeof int conn_param private_data amp private
221. y device in this list contains both a name and a GUID For example the device names can be mthca0 mlx4_1 Implemented in programming example by 8 1 4 resources_create Open the requested device Iterate over the device list choose a device according to its GUID or name and open it Implemented in programming example by 8 1 4 resources_create Query the device capabilities The device capabilities allow the user to understand the supported features APM SRQ and capabilities of the opened device Implemented in programming example by 8 1 4 resources_create Allocate a Protection Domain to contain your resources A Protection Domain PD allows the user to restrict which components can interact with only each other These components can be AH QP MR MW and SRQ Implemented in programming example by 8 1 4 resources_create Register a memory region VPI only works with registered memory Any memory buffer which is valid in the process s virtual space can be registered During the registration process the user sets memory permissions and receives local and remote keys Ikey rkey which will later be used to refer to this memory buffer Implemented in programming example by 8 1 4 resources_create Create a Completion Queue CQ A CQ contains completed work requests WR Each WR will generate a completion queue entry CQE that is placed on the CQ The CQE will specify if the WR was completed successfully or not Implemen
222. y strict set of attributes that may be modified during each transition and transitions must occur in the proper order The following subsections describe each transition in more detail struct ibv_qp_attr is defined as follows struct ibv_gp attr enum ibv_qp_ state qp_state enum ibv_qp state cur_qp_ state enum ibv_mtu path_mtu 72 Mellanox Technologies RDMA Aware Networks Programming User Manual Rev 1 3 enum ibv_mig state path_mig_ state uint32 t akey uint32 t rq psn uint32 t sq _psn uint32 t dest gp num int qp_access flags struct ibv qp cap cap struct ibv ah attr ah attr struct ibv_ah_ attr alt_ah_attr uintl6 t pkey index uint16 t alt pkey index uint8 t en_sqd_async_notify uint8 t sq draining uint8 t max rd atomic uint8 t max dest _ rd atomic uinte t min _ rnr timer uint8 t port_num uint8 t timeout uint8 t retry cnt uint8 t rir retry uint8 t alt port_num uint8 t alt _timeout y The following values select one of the above attributes and should be OR d into the attr_mask field IBV_QP STATE IBV_QP_CUR_STATE IBV_QP_EN SQD ASYNC_NOTIFY IBV_QP ACCESS FLAGS IBV_QP_PKEY_INDEX IBV_QP PORT IBV_QP_OKEY IBV_QP AV IBV_QP PATH MTU IBV_QP_TIMEOUT IBV_QP RETRY CNT IBV_QP_RNR_RETRY IBV_OP_ RO PS IBV_QP MAX QP RD ATOMIC IBV_QP ALT PATH IBV_QP MIN RNR TIMER IBV_QP_SQ PS IBV_QP MAX DEST RD ATOMIC IBV_QP PATH MIG STATE IBV_QP CAP IBV_

Download Pdf Manuals

image

Related Search

Related Contents

Boletim de Jurisprudência Jan./Mar. 2001  Manual de instruções Interroll RollerDrive  Instruction Manual  For Parts Call K&T 606-678-9623 or 606-561  S5U13781R00C100 Reference Board User Manual  Mode d`emploi  Logitech 920-005618  Télécharger le dossier au format PDF  Smeg FA 120 APS User's Manual  JD-S07CL/CW  

Copyright © All rights reserved.
Failed to retrieve file