Home
Coherent Accelerator Processor Interface User`s Manual
Contents
1. cccesssseeee 21 2 1 Organization of a CAIA Compliant Accelerator cccccceeeeececeeeeeeeeeeeeeeeceneeeseaeeeseaeeeseeeeteaeeseaees 21 2 1 1 POWER Service Layer 0 ccccecceeeeceeeeeceeeeeeeeeeeneeeeeceeeeeaeeeeeaeeeceaeeeecaeeeseaeeescaeeeseeeeseaeeseeeeee 22 2 1 2 Accelerator Function UNI i seiviscccastvedencsadas ceansetesdacenssdacsedacudeshckiieas sc cisaiachansedaevsansceccetaescecanars 23 2 2 Main Storage Addressing iccceseiceeececsscccueeeeewaeceveytneceyaeiedadkdeventeeuservcnaseeve att res eeeeetsnees vent thcccceeeinev hake 23 2 2 1 Main Storage Attrib tes g eit ectet ionene a aE e Naaa a 23 3 Programming Models cicixsdacssisinnsintacetesanetsnsuaaensanduatinnsnicinsainndiansanaddvdsnsdameseansnuniniaasiane 25 3 1 Dedicated Process Programming Model sseisssssisusninnunnninnnannninn aininn nanana aaa aaa 26 3 1 1 Starting and Stopping an AFU in the Dedicated Process Model c ccccccsesseeeeeeesteeeeeeees 26 3 2 Shared Programming Models ccccccceceeeeeceeeeeeeeenaeeeeeaeeeeaaeeeeeaeeeeaaeeeseaeeeseaeeeseaeeeseeeeteneeeenaees 29 3 2 1 Starting and Stopping an AFU in the Shared Models c cceeeeeeeeeeeeeeeneeeseneeeeeneeeseneeeee 31 3 3 Scheduled Processes Area ccccceccceeceeeeeeeeeeeceaeeeceaeeeeeaeeeseaeeesaeeeseaeeeseaaeesesaeeseaeeseeeeseeeeeeeneeensaees 33 3 3 1 Process Element Entry 0icc cies edie dn eed iets eee ished io ented 35 3 3 2 Software State Field Format
2. lt psl_queue_area gt lt psl_queue_control gt lt psl_control_area gt Hypervisor Accelerator Utilization Record HAURP lt utilization_val gt POWER Service Layer PSL Slice SR SSTP SDR WEQP gt WE Fetch AM O R IVTEs CSRP H AURP LPID PID TID Context Manager WED to AFU CtxTime gt Preempt Request z gt Seg Walk gt Pg Walk a A Interrupt Source Layer Context S R AFU Interrupt Request EA from AFU Physical Address RA Programming Models Page 32 of 101 Version 1 2 29 January 2015 User s Manual Advance Coherent Accelerator Processor Interface 3 3 Scheduled Processes Area In the virtualization programming models the PSL reads process elements from a structure located in system memory called the scheduled processes area SPA The SPA contains a list of processes to be serviced by the AFUs The process elements contain the address context and other state information for the processes scheduled to run on the AFUs assigned to service the SPA structure The SPA structure consists of two sections a linked list maintained by system software and a circular queue maintained by the PSL The circular queue section is only used for programming models where th
3. ccccccccccsssccecssnceeeeeesnneeeeessneeeeesseneeeeeesceeeeeessneeeeesseieeeeeeea 36 3 3 3 Software Command Status Field Format ccccccecseeceeeeeeeeeeeeeeeeeeeeeaeeseeaeeeseeeeseeeeseneeeee 37 3 4 Process Management is iectseectidecseeceaveace esse euees eeneanestie st euvee dees EASE TEE EENET AAEE EE A En EEE 38 3 4 1 Adding a Process Element to the Linked List by System Software ccceceeseeeteeeeeteees 39 3 4 2 PSL Queue Processing Starting and Resuming Process Elements c ccccccssseeeeees 42 3 4 3 Terminating a Process Element cc ecceeeeeeeeneeeeeeeeceeeeeeenaeeeeeeseaeeeeeeeenaaeeeetenaaeeeeenenaeeeeeeee 43 3 4 4 Removing a Process Element from the Linked List 0 c ccc eeeeeeeeeeeeeeeeeeeeeeeeeeeeeneeeeeeeeaees 48 3 4 5 Suspending a Process Element in the Linked List 0 cccceeeseeeeteeeeeeeeeeeeeeseneeeseeeeeeeeeeee 50 Version 1 2 Contents 29 January 2015 Page 3 of 101 User s Manual SoS Ses Coherent Accelerator Processor Interface Advance 3 4 6 Resume a Process EIOMENL ccccccccecceseseecceeeesseeeceeeeessceesetaugseceeseeseaseeceeeseaeaneeeeeetseaaaes 54 3 4 7 Updating a Process Element in the Linked List cceeeeeeeeeeecneeeeeeeeeeeeeeeeeeaeeeeeeeeeeeeeeeees 56 4 AFU Descriptor Overview iiciciccesicincinninnticedcccunntnnntnnnscusasnedsnipnsen ntscndsusdantsesenendeeedsecsnanves 59 A TAFU Desciptor Format sorcsirusi eana REE ETEA EEE 59 5 PSL
4. 1 The PSL notifies the AFU of the process element termination The AFU performs any necessary opera tions to remove the process and then acknowledges the termination of the process element When the acknowledgment is received the PSL continues with the next substep 2 If the process is running the process is terminated The AFU and PSL are allowed to complete any out standing transactions but must not start any new transactions for the process 3 The PSL writes a termination command to the psi_chained_command doubleword for the next PSL e Write the value x 00010000 II next_psi_id link_of_element_to_terminate to the psl_chained_command Programming Models Version 1 2 Page 46 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt 2 lt o 3 fa o Operations Performed by the Last PSL PSL_ID L F 00 When the terminate_element command is detected by the next PSL the PSL checks to see if the process element being terminated is currently running performs any operations necessary and sends the terminate_element command to the next PSL The terminate_element command is detected by monitoring the ps _chained_command doubleword 1 The PSL notifies the AFU of the process element termination The AFU performs any necessary opera tions to remove the process and then acknowledges the termination of the process element When the acknowledgment is received the PSL continues with
5. Version 1 2 Programming Models 29 January 2015 Page 53 of 101 User s Manual Coherent Accelerator Processor Interface Advance 3 4 6 Resume a Process Element The resume process element procedure is used to restart the execution of an process element after the process has been suspended 3 4 6 1 Software Procedure 1 Reset the suspend flag in the software state to 0 Software_State S 0 e Store x 80000000 to the 31st word of the process element to suspend 2 Ensure that the update to the software_state is visible to all processes e System software running on the host processor must perform a sync instruction 3 Write the resume_element command to the software command status field in the linked list area e Store x 00040000 first_psi _id link_of_element_to_resume to sw_command_status 4 Ensure that the resume_element command is visible to all processes e System software running on the host processor must perform a sync instruction 5 Issue the resume_element MMIO command to the first PSL e System software performs an MMIO to the PSL Linked List Command Register with the update_element command and the link to the new process being added PSL_LLCMD_An x 000400000000 II ink_of_element_to_resume 6 Wait for the PSLs to acknowledge the update of the process element e The process element is updated when a load from the sw_command_siatus returns x 00040004 first_psl_id link_of
6. form other operations The process is terminated when the status field in the sw_command_status is x 0001 Operations Performed by the Last PSL PSL_ID L 1 When the terminate_element MMIO command is received or the terminate_element command is detected by the last PSL the PSL checks to see if the process element being terminated is currently running performs any operations necessary and sets the completion status in the software command status word The terminate_element command is detected by monitoring the ps _chained_command doubleword 1 If the process element is running the process is terminated and the PSL sets the complete status in the software command status field to indicate that the process has been successfully terminated The PSL is allowed to complete any outstanding transactions but must not start any new transactions for the process e The status field in the sw_command_status is set to x 0001 using a caching inhibited DMA or special memory update operation that is guaranteed not to corrupt memory if the operation fails The final value of the sw_command_status must be x 00010001 II first_psi_id link_of_element_to_terminate 2 If the process element is not running the PSL searches the queue to determine if the process is waiting to be resumed and indicates the process termination is complete e The PSL pulls each process link from the PSL queue and compares the link with the process being
7. Coherent Accelerator Processor Interface User s Manual Advance Version 1 2 29 January 2015 Copyright International Business Machines Corporation 2014 2015 Printed in the United States of America January 2015 IBM the IBM logo and ibm com are trademarks or registered trademarks of International Business Machines Corp registered in many jurisdictions worldwide Other product and service names might be trademarks of IBM or other compa nies A current list of IBM trademarks is available on the Web at Copyright and trademark information at www ibm com legal copytrade shtml Other company product and service names may be trademarks or service marks of others All information contained in this document is subject to change without notice The products described in this document are NOT intended for use in applications such as implantation life support or other hazardous uses where malfunction could result in death bodily injury or catastrophic property damage The information contained in this document does not affect or change IBM product specifications or warranties Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties All information contained in this docu ment was obtained in specific environments and is presented as an illustration The results obtained in other operating environments may vary While the information co
8. terminates while in the middle of the update operation Access Type sw_command_status psl_chained_command Base Address Offset sw_command_status psl_chained_command Command Read write by both system software and PSL Note The PSL must never cache the line containing the sw_command_status in a modified state Read write by only the PSL SPA_Base n 3 x 128 where n maximum number of process elements supported SPA_Base n 4 x 128 n x 8 127 gt gt 7 x 128 128 where n maximum number of process elements supported Status v vy 4 Oo 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 PSL_ID Link y vy 4 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 Bits Field Name Description Command x 0000 No command x 0001 x 0002 x 0003 x 0004 x 0005 x 0006 update_element Sof All other values are reserved Note The most significant bi 0 15 Command remove_element Re resume_element Re terminate_element Terminate process element at the link provided suspend_element Stop executing the process element at the link provided add_element Software is adding a process element at the link provided move the process element at the link provided sume executing the process element a
9. An implementation dependent recovery procedure must be initiated by hardware Programming Models Version 1 2 Page 56 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt 2 lt o 9 o 3 4 7 2 PSL Procedure for Time Sliced and AFU Directed Programming Models Each PSL assigned to service the scheduled processes is configured with a unique identifier and the identifier of the next PSL in the list of PSLs servicing the processes In addition each PSL is identified as either the first PSL the last PSL both first and last PSL only one PSL servicing the queue or neither first or last PSL The PSL ID Register contains the PSL unique identifier and the settings for first and last Operations Performed by the First PSL PSL_ID L F 01 When the update_element MMIO command is received by the first PSL the PSL checks to see if the process element being updated is currently running performs any operations necessary and sends the update_element command to the next PSL 1 When operating in an AFU directed programming model the PSL notifies the AFU of the updated pro cess element The AFU performs any necessary operations to update the process and then acknowl edges the updated process element When the acknowledgment is received the PSL continues with the next substep The AFU is not notified of the added process element for all other programming models 2 If the process is running the
10. Operations Performed by the Last PSL PSL_ID L 1 When the resume_element MMIO command is received or the resume_element command is detected by the last PSL perform any operations necessary and set the completion status in the software command status word The resume_element command is detected by monitoring the ps _chained_command doubleword The PSL does not start any process with a software state of complete suspend or terminate A process element with the suspend flag set can be added to the PSL queue 1 When operating in an AFU directed programming model the PSL notifies the AFU of the process ele ment being resumed The AFU performs any necessary operations to resume execution of the process and then acknowledges the resumed process element When the acknowledgment is received the PSL continues with the next substep The AFU is not notified of the added process element for all other pro gramming models 2 The PSL sets the complete status in the software command status field to indicate that the process has been successfully resumed e The status field in the sw_command_status is set to x 0004 using a caching inhibited DMA or special memory update operation that is guaranteed not to corrupt memory if the operation fails The final value of the sw_command_status must be x 00040004 II first_psi_id link_of_element_to_resume Version 1 2 Programming Models 29 January 2015 Page 55 of 101 m
11. Reserved SPSIZE 4 R HTABORG most significant bits 5 HTABORG least significant bits Reserved HTABSIZE 6 R HAURP Physical Address most significant bits 7 HAURP Physical Address least significant bits Reserved V 8 Reserved Idle_Time Reserved Context_Time 9 IVTE_Offset_0 IVTE_Offset_1 10 IVTE_Offset_2 IVTE_Offset_3 11 IVTE_Range_0 IVTE_Range_1 12 IVTE_Range_2 IVTE_Range_3 13 LPID 14 TID 15 PID 16 CSRP Effective Address most significant bits CSRP Effective Address least significant bits Reserved AURP Virtual Address AURP Virtual A ddress least significant bits Ks kp N L G 0 iP Resened SSTP Virtual Address most significant bits Reserved AURP Virtual Address most significant bits Reserved SegTableSize 24 SSTP Virtual Address 25 SSTP Virtual Address least significant bits Reserved V 26 Authority Mask most significant bits 27 Authority Mask least significant bits 28 Reserved 29 Work Element Descriptor WED word 0 30 Work Element Descriptor WED word 1 31 Software State Version 1 2 Programming Models 29 January 2015 Page 35 of 101 User s Manual Coherent Accelerator Processor Interface 3 3 2 Software State Field Format The software state field in the process element is used by system software to indicate how the PSL should handle the process element This word in the process element must only be modified by system softw
12. afu cxl_adapter_afu_next NULL afu The cxl_for_each_adapter_afu macro sets up a loop to iterate through each AFU on a given adapter 6 2 2 5 cxl_for_each_afu Note Not applicable for the CAPI Developer Kit include lt libcx1 h gt define cxl_for_each_afu afu for afu cxl_afu_next NULL afu afu cxl_afu_next afu The cxl_for_each_afu macro sets up a loop to iterate through each AFU in the system by also looping through the cxl adapters in the system 6 2 3 Accelerated Function Unit Management 6 2 3 1 cxl_afu_open_dev include lt libcx1 h gt struct cxl_afu_h cxl_afu_open_dev char path The cxl_afu_open_dev routine opens an existing AFU by its device path name It returns a handle to the open device In the CAPI Developer Kit release this returns a negative number if the AFU is unavailable for some reason In the CAPI Developer Kit release the programmer will probably know the full device name of their AFU 6 2 3 2 cxl_afu_open_h Note Not applicable for the CAPI Developer Kit include lt libcx h gt int cxl_afu_open_h struct cxl_afu_h afu unsigned long master 6 2 3 3 cxl_afu_fd_to_h Note Not applicable for the CAPI Developer Kit include lt libcx h gt struct cxl_afu_h cxl_afu_fd_to_h int fd The cxl_afu_fd_to_h routine inserts the file descriptor parameter in a newly allocated AFU buffer The routine returns the pointer to the allocated AFU buffer CAPI Low Level Management libcxl Ver
13. it will fail AXh_csize must be a power of 2 and aXh_cea must be naturally aligned according to size push_i x 0140 Attempt to accelerate the subsequent writing of a line previously written by the accelerator or by another processor AXh_csize must be 128 bytes and aXh_cea must be 128 byte line aligned This command is a no op if the line is not modified push_s x 0150 Attempt to accelerate the subsequent reading of a line previously written by the accelerator or by another processor AXh_csize must be 128 bytes and aXh_cea must be 128 byte line aligned This command is a no op if the line is not modified evict_i x 1140 Force a line out of the precise cache Modified lines are castout to system memory AXh_csize must be 128 bytes and aXh_cea must be 128 byte line aligned reserved x 1260 Reserved for future use lock x 016B Request that a cache line be present in the precise cache in a locked and modified state This command must be used as part of a atomic read modify write sequence AXh_csize must be 128 bytes and aXh_cea must be 128 byte line aligned unlock x 017B Clear the lock state associated with a line AXh_csize must be 128 bytes and aXh_cea must be 128 byte line aligned Table 5 3 PSL Command Opcodes That Do Not Allocate in the PSL Cache Mnemonic Opcode Description Read_cl_na Ox0A00 Read a cache line but do not allocate the cache line into a cache This command must
14. make progress If the PSL detects a write to the address by another processor it deactivates the reservation Write_c inspects the state of the reservation during execution If the reservation is active on the write_c line address write_c will write data to the line deactivate the reservation and return DONE If the reservation is active on a different address write_c deactivates the reservation and returns NRES If the reservation is not active write_c returns NRES Note While it is not an error to submit multiple read_c l_res and write_c commands to different line addresses the order they execute in is not defined and therefore the state of the reservation is unpredict able 5 1 3 Locks Cache lines can be locked and while they are locked no other read or write access is permitted by any other processor in the system This capability allows an accelerator to implement complex atomic operations on shared memory Lock requests are made with either the read_cl_Ick or the lock command If the PSL grants the lock it responds with DONE If the PSL declines the lock request it responds with NLock The PSL can decline a lock request based on configuration available resources and cache state After the lock is in effect it remains in effect until a subsequent write_unlock or unlock request Locks cannot be held indefinitely The PSL automatically unlocks lines after a certain amount of time to allow the system to make forward progress Wri
15. system software first sets the system software state field in the process element to indicate that the process element is being terminated Next system software issues a termination command to the PSLs which initiates a sequence of operation to remove the process element from the PSL queue The termination pending status is needed to prevent a PSL from starting or resuming the process while the corresponding process entry is being removed from the PSL queue Programming Models Version 1 2 Page 38 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt 2 lt o 3 fa o The following sections define the system software and PSL procedures for various process element and linked list management e Section 3 4 1 Adding a Process Element to the Linked List by System Software on page 39 e Section 3 4 2 PSL Queue Processing Starting and Resuming Process Elements on page 42 e Section 3 4 3 Terminating a Process Element on page 43 e Section 3 4 4 Removing a Process Element from the Linked List on page 48 e Section 3 4 5 Suspending a Process Element in the Linked List on page 50 e Section 3 4 6 Resume a Process Element on page 54 e Section 3 4 7 Updating a Process Element in the Linked List on page 56 3 4 1 Adding a Process Element to the Linked List by System Software System software adds a new process element for each process that has work for the accelerator The process element is added to the software m
16. 2 High Level Design of the AFU on page 87 for FPGA considerations during the high level design 7 3 2 General PSL Information The PSL contains 32 read machines one reserved for interrupts and 32 write machines three reserved for deadlock prevention An AFU design must balance the use of read and write commands accordingly Note MMIO interface requests to valid registers in the accelerator must complete with no dependencies on the completion of any other command 7 3 3 Buffer Interface It is recommended that the PSL read buffer interface for AFU write data is implemented so that BRTAG goes directly to the RAM read address and the data returned on BRDATA is latched after the RAM access to meet the BRLAT 1 requirement 7 3 4 PSL Interface Timing It is recommended that all AFU signals are driven to the PSL directly from a latch and all PSL to AFU signals are received directly into a latch unless otherwise noted as in Section 7 3 3 Buffer Interface 7 3 5 Designing for Performance PSL command ordering is performance oriented meaning that the PSL can reorder commands for perfor mance If a particular order is intended by the AFU it is the AFU s responsibility to send commands in that order The AFU can select the translation ordering mode though which can impact performance This control is described in Table 5 5 aXh_cabt Translation Ordering Behavior on page 66 It is important to understand this so that translation i
17. MMIO transfer is present on the interface The haX_mm signals are valid during the cycle that haX_mmval is asserted haX_mmcfg 1 PSL The MMIO represents an AFU descriptor space access haX_mmrnw 1 PSL 0 Write 1 Read haX_mmdw 1 PSL 0 Word 32 bits 1 Doubleword 64 bits haX_mmad 24 PSL MMIO word address For doubleword access the address is even haX_mmadpar 1 PSL Odd parity for haX_mmad valid with haX_mmval haX_mmdata 64 PSL Write data For word writes data is replicated onto both halves of the bus haX_mmdatapar 1 PSL Odd parity for haX_mmdata valid with haX_mmval and haX_mmrnw equal to 0 Not valid during an MMIO read haX_mmrnw 1 aXh_mmack 1 Acc This signal must be asserted for a single cycle to acknowledge that the write is complete or the read data is valid aXh_mmdata 64 Acc Read data For word reads data must be supplied on both halves of the bus aXh_mmdatapar 1 Acc Odd parity for aXh_mmdata valid with aXh_mmack 5 5 Accelerator Control Interface The accelerator control interface is used to control the state of the accelerator and sense change in the state of the accelerator as execution ends on the process element This interface is also used for timebase requests and responses The interface is a synchronous interface HaX_jval is valid for only one cycle per command and the other command descriptor signals are also valid during that cycle Table 5 10 on page 74 shows the signals used for th
18. Note The shared programming models are for future releases only Currently libcxl only supports the dedi cated programming model Additional programming models might be added in the future In the dedicated process model the AFU is dedicated to a single application or process under a single oper ating system The single application can act as an Application as a Service and funnel other application requests to the accelerator providing virtualization within a partition In the PSL controlled shared and AFU directed shared programming models the AFU can be shared by multiple partitions The shared models require a system hypervisor to virtualize the AFU so that each oper ating system can access the AFU For single partition systems not running a hypervisor the AFU is owned by the operating system In both cases the operating system can virtualize the AFU so that each process or application can access the AFU For the AFU directed shared programming model the AFU selects a process element using a process handle The process handle is an implementation specific value provided to the host process when regis tering its context with the AFU that is calling system software to add the process element to the process element linked list While the process handle is implementation specific the lower 16 bits of the process handle must be the offset of the process element within the process element linked list The process element contai
19. PSL response according to their individual CABT mode Requests received with CABT Strict or Page are added to the queue until the queue is emptied When the queue is emptied any Restart command from the AFU is honored Continuing to send requests with CABT Strict or Page before the queue is emptied will delay the honoring of the Restart command for that ERAT entry It is recommended that new request that hit the 16 MB page are halted until a response is received for the Restart command 011 Pref Checks if the translation for the address is already available in the ERAT or can be determined with a read of the PTE and or STE from system memory If the translation can complete without software assistance the command completes e If translation for the command results in a protection violation or the table walk process fails the command receives the FAULT response Only this command will be terminated No interrupt is generated e Ifthe translation for the command results ina DERROR only this command is terminated with a FAULT response No FLUSHED response is generated 111 Spec Checks if the translation for the address is already available in the ERAT If it is in the ERAT the command completes e If translation for the command results in a protection violation or an ERAT miss the com mand will receive the FAULT response No new translation is performed Only this command will be terminated No interrupt is generated e If the tra
20. PSL completes any outstanding transactions and does not start any new transactions for the process The PSL then invalidate the process element state and refetches a new copy from the process element linked list in system memory If the process element is coherently cached the update is automatically handled by the coherency protocol e The status field in the sw_command_status is set to x 0006 using a caching inhibited DMA or special memory update operation that is guaranteed not to corrupt memory if the operation fails The final value of the sw_command_status must be x 00060006 II first_ps _id link_of_element_to_update e The PSL does not start any process with a software state of complete suspend or terminate A pro cess element with the suspend flag set can be added to the PSL queue 3 If the process element is not running the PSL writes a update command to the ps _chained_command doubleword for the next PSL e Write the value x 00060000 II next_psl_id link_of_element_to_update to the psl_chained_command e The PSL does not start any process with a software state of complete suspend or terminate A pro cess element with the suspend flag set can be added to the PSL queue Operations Performed by the Last PSL PSL_ID L F 00 When the update_element command is detected by the next PSL the PSL checks to see if the process element being updated is currently running performs any operations necessary
21. PSL_ID L 1 When the suspend_element MMIO command is received or the suspend_element command is detected by the last PSL the PSL checks to see if the process element being terminated is currently running performs any operations necessary and sets the completion status in the software command status word The suspend_element command is detected by monitoring the ps _chained_command doubleword 1 The PSL notifies the AFU of the suspended process element The AFU performs any necessary opera tions to suspend the process and then acknowledges the suspension of the process element When the acknowledgment is received the PSL continues with the next substep 2 If the process element is running the process is suspended The PSL is allowed to complete any out standing transactions but must not start any new transactions for the process 3 The PSL sets the complete status in the software command status field to indicate the process has been successfully suspended e The status field in the sw_command_status is set to x 0003 using a caching inhibited DMA or special memory update operation that is guaranteed not to corrupt memory if the operation fails The final value of the sw_command_status must be x 00030003 II first_psi_id link_of_element_to_suspend e The PSL does not start any process with a software state of complete suspend or terminate A pro cess element with the suspend flag set can be added to the PSL queue
22. Page 89 of 101 User s Manual Coherent Accelerator Processor Interface Advance Translation misses can cause delays in the PSL AFU interface Therefore using touch_I commands to make pages come in early to the PSL translation cache is recommended Also it is advisable to use large pages to avoid too many translation requests from the PSL to the chip Care should be used to avoid cache line thrashing between the PSL cache and the processor cache structure An application should avoid cases where the application and the AFU are both modifying the same cache line constantly 7 3 6 Simulation Stand alone AFU simulation in an internal environment must be done in the customer s choice of simulation application to verify the internal function of the AFU Simulation can also be done with the POWER8 Func tional Simulator along with the application to ensure proper functionality with the PSL and POWER8 system and software See the POWER8 Functional Simulator User s Guide for additional information Although the FPGA might reset to all zeros on power up it is good practice to perform multi value simulation initial X states This ensures that the AFU resets to a state that clears all previous job states and new jobs run without any issues 7 3 7 Debug Considerations The problem state interface must be implemented to provide a debug mechanism Registers can be used to capture errors or runtime status that can then be read using MMIQs to the AFU
23. Permission x Error OK No Wait for Data MMIO to Tenure PSL_TFC_An Failure Response Enter DONE Flush Restart Mode Error Response DERROR Enter Flush Mode Enter Flush Mode Response AERROR Response PAGED PSL Accelerator Interface Page 72 of 101 Version 1 2 29 January 2015 User s Manual Coherent Accelerator Processor Interface 5 4 Accelerator MMIO Interface The MMIO interface can be used to read and write MMIO registers and AFU descriptor space registers inside the accelerator The PSL is the command master It performs a single read or write and waits for an acknowl edgment before beginning another MMIO MMIO requests that are not acknowledged cause an application hang to be detected and an error condition to be reported Note MMIO interface requests to valid registers in the accelerator must complete with no dependencies on the completion of any other command An MMIO request is sent to the accelerator only when the accelerator is enabled as indicated by the AFU_CNTL_An ES field Otherwise an error condition is reported Note that the MMIO address contains a word 4 byte address therefore the last 2 bits of the true address are dropped at the interface For an address of 0x300_1080 HAX_MMAD equals 0xC0_0042 Table 5 9 Accelerator MMIO Interface Signal Name Bits Source Description haX_mmval 1 PSL This signal is asserted for a single cycle when an
24. The PSL notifies the AFU of the added process element The AFU performs any necessary operations to prepare for the new process and then acknowledges the new process element When the acknowledg ment is received the PSL continues with the next substep 2 The PSL sets the complete status in the software command status field to indicate the process has been successfully added Version 1 2 Programming Models 29 January 2015 Page 41 of 101 User s Manual m Coherent Accelerator Processor Interface Advance e The status field in the sw_command_status is set to x 0005 using a caching inhibited DMA or special memory update operation that is guaranteed not to corrupt memory if the operation fails The final value of the sw_command_status must be x 000500085 II first_psl_id link_of_element_to_add 3 4 2 PSL Queue Processing Starting and Resuming Process Elements Multiple PSLs can be assigned to service the list of scheduled processes Each PSL follows the sequence outlined in Section 3 4 2 1 to start a new process or continue a previously started process The following procedures apply only to the time sliced programming models 3 4 2 1 PSL Procedure for Time Sliced Programming Models 1 Check the PSL queue for processes waiting to be restarted a Perform a read of the cache line containing the head_pointer and tail_pointer such that the cache line is owned by the PSL e The PSL must prevent any other PSL
25. There are also example trace arrays within the verilog and a readme file to explain how the trace array files are used 8 4 2 Build the FPGA 1 Copy the IBM build directory structure and files to the build location 2 Put AFU source files in the afu0 subdirectory 3 Edit the afu0 afu0 qip file to include the AFU VHDL or Verilog files Example of files added to afu0 set_global_assignment name set_global_assignment name set_global_assignment name set_global_assignment name set_global_assignment name set_global_assignment name set_global_assignment name set_global_assignment name set_global_assignment name qip from the provided memcopy example VERILOG FILE file join quartus qip path VERILOG FILE file join quartus qip_path VERILOG FILE file join quartus qip path dw_parity v VERILOG FILE file join quartus qip path endian _swap v afu v VERILOG FILE file join quartus qip path job v dma v VERILOG FILE file join quartus qip path mmio v VERILOG FILE file join quartus qip path parity v VERILOG FILE file join quartus qip path ram v VERILOG FILE file join quartus qip_ path trace_array_muxout_template v set_global_assignment name trace_array_template v VERILOG FILE file join quartus qip_path 4 Ensure that the file ps1 ps1_accel vhd1 correctly maps signal connections to the top level source file for your AFU
26. This helps debug the AFU during initial bringup as well as during failure scenarios It is also helpful to include trace arrays that can capture a logic analyzer type of trace of a particular interface or function These trace arrays store a history of events that can later be read out using MMIO registers to aid in the debug of performance or functional problems At a minimum the AFU PSL interface signals should be implemented in the AFU to debug basic issues The interface signal should be set to the trace array Some potential features of a trace array e Trigger mechanism to start or stop storing data e Pattern match to only store a cycle with a particular pattern e Time or cycle stamp for the relative time between events Note An example trace array in Verilog along with the implementation of that trace array in the memcpy example will be provided 7 3 8 Operating System Error Handling 7 3 8 1 AFU Errors If the main application is responding and the AFU is in a state to communicate with the main application the AFU must signal an error by some user defined means For example e Interrupt e Command Response Status e MMIO e Other AFU Development and Design Version 1 2 Page 90 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt 2 lt o 3 9 o If the main application is not responding or the AFU is not in a state to communicate with the main application the AFU must assert ah_jdo
27. User s Manual Coherent Accelerator Processor Interface Advance 3 4 7 Updating a Process Element in the Linked List The update flag in the software process element state is use to update the state of a process element This command causes the PSL to invalidate any non coherent copies of the process element that might be cached and read a new copy from system memory If the update of the process element is required to be atomic the process element must be suspended before the update is made suspend update resume 3 4 7 1 Software Procedure 1 Write the update_element command to the software command status field in the linked list area e Store x 00060000 first_psl_id link_of_element_to_update to sw_command_status 2 Ensure that the update_element command is visible to all processes e System software running on the host processor must perform a sync instruction 3 Issue the update_element MMIO command to the first PSL e System software performs an MMIO to the PSL Linked List Command Register with the update_element command and the link to the new process being added PSL_LLCMD_An x 000600000000 link_of_element_to_update 4 Wait for the PSL to acknowledge the update of the process element The process element is updated when a load from the sw_command_status returns x 00060006 II first_psi_id link_of_element_to_update e Ifa value of all 1 s is returned for the status an error has occurred
28. after the translation response is received with CABT Abort Pref or Spec are processed immediately and provide a PSL response according to their individual CABT mode Requests received with CABT Strict or Page are added to the queue until the queue is emptied When the queue is emptied any Restart command from the AFU is honored Continuing to send requests with CABT Strict or Page before the queue is emptied will delay the honoring of the Restart command for that ERAT entry It is recommended that new requests that hit the ERAT entry are halted until a response is received for the Restart command 001 Abort Accesses to different pages proceed in high performance order If translation for the command results in a protection violation or the table walk process fails the command receives the FAULT response and an interrupt is sent Only this command is terminated e If the translation for the command results in a DERROR only this command is terminated with a FAULT response No FLUSHED response is generated PSL Accelerator Interface Version 1 2 Page 66 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface Table 5 5 aXh_cabt Translation Ordering Behavior Sheet 2 of 2 aXh_cabt Mnemonic Description 010 Page Translation is in order for addresses in the same effective page that maps into a 4 KB 16 KB and 16 MB ERAT Accesses to different pages exit translation in a high performa
29. any process with a software state of complete suspend or terminate A pro cess element with the suspend flag set can be added to the PSL queue 3 If the process element is not running the PSL writes a suspend command to the ps _chained_command doubleword for the next PSL e Write the value x 00060000 II next_psi_id link_of_element_to_terminate to the psl_chained_command e The PSL does not start any process with a software state of complete suspend or terminate A pro cess element with the suspend flag set can be added to the PSL queue Operations Performed by the Last PSL PSL_ID L 1 When the update_element MMIO command is received or the update_element command is detected by the last PSL the PSL checks to see if the process element being terminated is currently running performs any operations necessary and sets the completion status in the software command status word The suspend_element command is detected by monitoring the ps _chained_command doubleword 1 When operating in an AFU directed programming model the PSL notifies the AFU of the updated pro cess element The AFU performs any necessary operations to update the process and then acknowl edges the updated process element When the acknowledgment is received the PSL continues with the next substep The AFU is not notified of the added process element for all other programming models 2 If the process is running the PSL completes any outstanding
30. be used during streaming operations when there is no expectation that the data will be re used before it is cast out of the cache AXh_csize must be 128 bytes and aXh_cea must be 128 byte line aligned Read_pna Ox0E00 Read all or part of a line without allocation This command must be used for MMIO AXh_csize must be a power of 2 and aXh_cea must be naturally aligned according to size Write_na 0x0D00 Write all or part of a cache line but do not allocate the cache line into a cache This command must be used during streaming operations when there is no expectation that the data will be re used before it is cast out of the cache AXh_csize must be a power of 2 and aXh_cea must be naturally aligned according to size Write_inj 0x0D10 Write all or part of a cache line Do not allocate the cache line into a cache attempt to inject the data into the highest point of coherency HPC AXh_csize must be a power of 2 and aXh_cea must be naturally aligned according to size Table 5 4 PSL Command Opcodes for Management Mnemonic Opcode Description flush 0x0100 Flush data from all caches intreq 0x0000 Request interrupt service See Section 5 1 4 Request for Interrupt Service on page 69 restart 0x0001 Stop flushing commands after error Axh_cea is ignored PSL Implementation Note New requests that hit in the same ERAT page entry as the request with the translation error response must not continue to be issued
31. called the CAPI Developer Kit card This section describes the CAIA architecture features supported in this release for the CAPI Developer Kit card 8 1 Supported CAIA Features The PSL AFU interface described in this section assumes the following restrictions on features Additional interface signals or command opcodes will be required in the future to support some of the architecture features that are not supported in this release e Only the dedicated process programming model is supported e Only one AFU is supported per CAPI Developer Kit card e LPC memory is not supported e The only supported value for axh_brlat value is 1 e The maximum size of the AFU problem state area is 64 MB The maximum size of the AFU descriptor space is 4 MB e Timebase is not supported e The CAPI interface might only be available on a subset of the PCle slots on a POWER system For exam ple the IBM Power Systems S812L and S822L have CAPI enabled on location code P1 C5 P1 C7 P1 C6 and P1 C4 as CAPI compatible Gen3 PCle slots Refer to your specific system s guide for more details 8 2 CAPI Developer Kit Card Hardware e Altera Stratix V 5S5GXA7 FPGA e PCle GENS x8 port e 2 SFP connections e Nallatech Developer Kit card 8 3 FPGA Build Restrictions Note The DRAM is not supported by the current HDK If usage of the DRAM is desirable for your applica tion please contact your IBM representative The SFP SERDES in the delivere
32. cxl_mmio_write on page 84 for details about how to read and write registers in this space Version 1 2 CAPI Low Level Management libcxl 29 January 2015 Page 79 of 101 User s Manual m Coherent Accelerator Processor Interface Advance cxl_mmio_unmap Unmaps the register space of the AFU from the memory associated with this process cxl_afu_free Closes and frees the AFU and the related supporting data structures that have been allocated 6 2 CAPI Low Level Management API 6 2 1 Adapter Information and Availability This section describes API calls used by an application to determine what resources are available and to query information about resources allocated to the calling process 6 2 1 1 cxl_adapter_next Note Not applicable for the CAPI Developer Kit include lt libcx h gt struct cxl_adapter_h cxl_adapter_next struct cxl_adapter_h adapter The cxl_adapter_next returns a handle to the next available CAPI capable adapter If the input adapter pointer is NULL this routine will allocate the necessary buffer and return its pointer A subsequent call to this routine obtains the directory entry of the next adapter If there are no more adapters the buffers are freed and the routine returns NULL 6 2 1 2 cxl_adapter_devname Note Not applicable for the CAPI Developer Kit include lt libcx h gt char cxl_adapter_devname struct cxl_adapter_h adapter The cxl_adapter_devname returns the null
33. devices in the system Coherent Accelerator Interface Architecture Defines an architecture for loosely coupled coherent accelerators The Coherent Accelerator Interface Architecture provides a basis for the development of accelerators coherently connected to a POWER processor Coherent Accelerator Process Interface Coherent Attached Processor Proxy Refers to memory and cache coherence The correct ordering of stores to a memory address and the enforcement of any required cache write backs during accesses to that memory address Cache coherence is implemented by a hard ware snoop or inquire method which compares the memory addresses of a load request with all cached copies of the data at that address If a cache contains a modified copy of the requested data the modified data is written back to memory before the pending load request is serviced Context Save Restore Area Pointer Glossary Page 97 of 101 User s Manual Coherent Accelerator Processor Interface Advance DLL Delay locked loop DMA Direct memory access A technique for using a special purpose controller to generate the source and destination addresses for a memory or I O transfer DSISR Data Storage Interrupt Status Register DSP Digital signal processor EAH PSL effective address high EAL PSL effective address low EA Effective address An address generated or used by a program to reference memory A memory management unit translates an effective address to
34. element linked list struc tures to reflect the removal of the process element 2 Write a remove_element command to the software command status field in the linked list area e Store x 00020000 I first_psl_id link_of_element_to_remove to sw_command_status 3 Ensure that the remove_element command is visible to all processes e System software running on the host processor must perform a sync instruction 4 Issue the remove_element MMIO command to the first PSL e System software performs an MMIO to the PSL Linked List Command Register with the remove_element command and the link of the process being removed PSL_LLCMD_An x 000200000000 II ink_of_element_to_remove 5 Wait for the PSLs to acknowledge the removal of the process element e The process element is terminated when a load from the sw_command_status returns x 00020002 ll first_psl_id link_of_element_to_remove e Ifa value of all 1 s is returned for the status an error has occurred An implementation dependent recovery procedure must be initiated by hardware 6 Invalidate the PSL SLBs and TLBs for the processes being removed e System software performs an MMIO write to the Lookaside Buffer Invalidation Selector with the pro cess ID and logical partition ID of the process being removed PSL_LBISEL PID II LPID e System software performs an MMIO write to invalidate the SLBs PSL_SLBIA x 3 e System software waits until the SLB invali
35. four AFUs per PSL early implementations support only a single AFU The AFU can be dedicated to a single application or shared between multiple applications However only the dedicated programming model is currently supported Physically a CAIA compliant accelerator can consist of a single chip a multi chip module or modules or multiple single chip modules on a system board or other second level package The design depends on the technology used and on the cost and performance characteristics of the intended design point Figure 2 1 on page 22 illustrates a CAIA compliant accelerator with several n AFUs connected to the PSL All the AFUs share a single cache Version 1 2 Introduction to Coherent Accelerator Interface Architecture 29 January 2015 Page 21 of 101 User s Manual Coherent Accelerator Processor Interface Advance Figure 2 1 CAlA Compliant Processor System CAIA Compliant Accelerator AFU_0O AFU_1 AFU_n POWER Service Layer PSL Cache Memory Management Unit MMU SLB TLB ERAT ISL PCle Link Host Processor AFU Accelerator Function Unit MMU Memory Management Unit PSL POWER Service Layer SLB Segment Lookaside Buffer ISL Interrupt Source Layer TLB Translation Lookaside Buffer Cache Cache for data accessed by AFUs ERAT Effective to Real Address Translation 2 1 1 POWER Service Layer A CAIA compliant processor includes
36. in the software command status field to indicate that it is now safe to remove the process element from the linked list e Write the value x 00020000 II next_psi_id link_of_element_to_remove to the psl_chained_command e The PSL does not start any process with a software state of complete suspend or terminate A pro cess element with the suspend flag set can be added to the PSL queue Version 1 2 Programming Models 29 January 2015 Page 49 of 101 User s Manual m Coherent Accelerator Processor Interface Advance Operations Performed by the Last PSL PSL_ID L 1 When the suspend_element MMIO command is received or the suspend_element command is detected by the last PSL the PSL checks to see if the process element being terminated is currently running performs any operations necessary and sets the completion status in the software command status word The suspend_element command is detected by monitoring the psl_chained_command doubleword 1 The PSL notifies the AFU of the process element removal The AFU performs any necessary operations to remove the process and then acknowledges the removal of the process element When the acknowl edgment is received the PSL continues with the next substep 2 The PSL sets the complete status in the software command status field to indicate that the process has been successfully removed e The status field in the sw_command_status is set to x 0002 u
37. load from the sw_command_status returns x 00010001 ll first_psl_id link_of_element_to_terminate e Ifa value of all 1 s is returned for the status an error has occurred An implementation dependent recovery procedure must be initiated by hardware 7 Reset the valid flag in the software state to 0 Software_State V 0 e Store x 00000000 to the 31st word of the process element to terminate 8 Remove the process element from the linked list e See the procedure in Section 3 4 4 Removing a Process Element from the Linked List on page 48 3 4 3 2 PSL Procedure for Time Sliced Programming Models Each PSL assigned to service the scheduled processes is configured with a unique identifier and the identifier of the next PSL in the list of PSLs servicing the processes In addition each PSL is identified as either the first PSL the last PSL both first and last PSL only one PSL servicing the queue or neither first or last PSL The PSL ID Register contains the PSL unique identifier and the settings for first and last Operations Performed by the First PSL PSL_ID L F 01 When the terminate_element MMIO command is received by the first PSL the PSL checks to see if the process element being terminated is currently running performs any operations necessary and sends the terminate_element command to the next PSL or sets the completion status in the software command status word 1 If the process element is r
38. must be the first in the list of PSLs assigned to service the processes Each PSL has the ID of the next PSL in the list and forwards the command to the next PSL in the psl _chained_command if required 48 63 Link Process element link The process element link is the offset from the SPA_Base shifted right by 7 bits of the process ele ment to operate on 3 4 Process Management In the shared programming model the PSL switches between the processes scheduled to use the AFUs by system software This section describes the procedures for both system software and the PSLs for sched uling descheduling and terminating processes To schedule a process for an AFU system software adds a process element entry to the linked list in system storage Once added the PSL starts the new process at the next available context interval for a time sliced programming model or at an implementation dependent point in time for an AFU directed programming model For the time sliced programming models any newly added processes are placed into a circular queue main tained by the PSL referred to as the psl_queue Process elements are pulled from the psl_queue in a round robin order by one or more CAIA compliant devices to be run When a process element completes system software is responsible for removing the process element and updating the link list before allocating the process element to another processes To terminate a process element
39. note to Section 6 2 2 1 cxl_adapter_afu_next on page 81 Section 6 2 2 2 cxl_afu_next on page 81 Section 6 2 2 3 cxl_afu_devname on page 81 Section 6 2 2 4 cxl_for_each_adapter_afu on page 82 Section 6 2 2 5 cxl_for_each_afu on page 82 Section 6 2 3 2 cxl_afu_open_h on page 82 Section 6 2 3 3 cxl_afu_fd_to_h on page 82 Section 6 2 3 6 cxl_afu_attach_full on page 83 Section 6 2 3 7 cxl_afu_fd on page 83 Section 6 2 3 8 cxl_afu_open_and_attach on page 83 and Section 6 2 3 9 cxl_afu_sysfs_pci on page 84 e Added Section 6 2 3 5 cxl_afu_attach on page 83 e Revised Section 6 2 3 11 cxl_mmio_unmap on page 84 e Revised Section 6 2 3 12 cxl_mmio_read on page 84 e Revised Section 7 3 5 Designing for Performance on page 89 20 November 2014 1 1 e Revised Table 3 2 Process Element Entry Format on page 35 e Revised Table 5 2 PSL Command Opcodes Directed at the PSL Cache on page 64 e Revised Section 5 1 3 Locks on page 68 e Revised Table 5 8 PSL Response Codes on page 71 06 November 2014 1 0 Initial release Version 1 2 29 January 2015 Revision Log Page 11 of 101 User s Manual Coherent Accelerator Processor Interface Advance Revision Log Version 1 2 Page 12 of 101 29 January 2015 User s Manual Advance Coherent Accelerator Processor Interface About this Document This user s guide describes the Coherent Accelerator Processor Interface CAPI for the IBM POWER8 systems This document is intended to ass
40. on for multiple cycles indicating that the responses are being returned back to back haX_rtag 8 PSL Accelerator generated ID for the request haX_rtagpar 1 PSL Odd parity for haX_rtag valid with haX_rvalid haX_response 8 PSL Response code See Table 5 8 PSL Response Codes on page 71 haX_rcredits 9 PSL Two s compliment number of credits returned haX_rcachestate 2 PSL Reserved haX_rcachepos 13 PSL Reserved PSL Accelerator Interface Version 1 2 Page 70 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface Table 5 8 PSL Response Codes Mnemonic Code Description DONE 0x00 Command is complete Any and all data requests have been made for the request to from the buffer inter face Data movement between the accelerator and the PSL for these requests is complete AERROR 0x01 Command has resulted in an address translation error All further commands are flushed until a restart command is accepted on the command interface DERROR 0x03 Command has resulted in a data error All further commands are flushed until a restart command is accepted on the command interface NLOCK 0x04 Command requires a lock status that is not present Command issued is unrelated to an outstanding lock NRES 0x05 Command requires a reservation that is not present FLUSHED 0x06 Command follows a command that failed and is flushed See Table 5 5 aXh_cabt Translation Ordering Behavior on
41. page 66 for additional information FAULT 0x07 Command address could not be quickly translated Interrupt has been sent to the operating system or hypervisor for aXh_cabt mode ABORT The command has been terminated FAILED 0x08 Command could not be completed because e An interrupt service request that receives this response contained an invalid source number e Parity error detected on command request therefore the command was ignored e Command issued that is not supported in the configured PSL_SCNTL_An PSL Model Type PAGED Ox0A Command address could not be translated The operating system has requested that the accelerator con tinue The command has been terminated All further commands are flushed until a restart command is accepted on the command interface Version 1 2 PSL Accelerator Interface 29 January 2015 Page 71 of 101 User s Manual Coherent Accelerator Processor Interface 5 3 1 Command Response Flow Figure 5 1 illustrates the PSL command and response flow Figure 5 1 PSL Command Response Flow Response Flush Send Interrupt aXh_cabt Yes asserted Response Fault Accelerator can reissue Faulted commands but needs to reissue with aXh_cabt 0 to guarantee forward progress Continue New Command Yes No Perform Hitand OK Address Miss gt Transaction Translation
42. process has been successfully terminated e The status field in the sw_command_status is set to x 0001 using a caching inhibited DMA or special memory update operation that is guaranteed not to corrupt memory if the operation fails The final value of the sw_command_status must be x 00010001 II first_ps _id link_of_element_to_terminate Version 1 2 Programming Models 29 January 2015 Page 47 of 101 User s Manual m Coherent Accelerator Processor Interface Advance 3 4 4 Removing a Process Element from the Linked List To make room for new process elements in the linked list completed and terminated process elements must be removed by system software To safely remove a process element from the linked list of scheduled processes software must follow the sequence outlined in Section 3 4 4 1 Implementation Note The removal of a process element must also invalidate all cache copies of transla tions that are associated with the process element being removed An implementation cannot depend on system software performing TLB and SLB invalidates 3 4 4 1 Software Procedure Note The following sequence is only for a single system software process managing the linked list Addi tional locking and synchronization steps are necessary to allow for multiple system software processes to concurrently manage the linked list 1 Update the system software implementation dependent free list and process
43. successfully suspended The PSL is allowed to complete any outstanding transactions but must not start any new transactions for the process e The status field in the sw_command_status is set to x 0003 using a caching inhibited DMA or special memory update operation that is guaranteed not to corrupt memory if the operation fails The final value of the sw_command_status must be x 00030003 II first_psi_id link_of_element_to_suspend 2 If the process element is not running the PSL writes a suspend command to the ps _chained_command doubleword for the next PSL e Write the value x 00030000 II next_psl_id link_of_element_to_suspend to the psl_chained_command e The PSL does not start any process with a software state of complete suspend or terminate A pro cess element with the suspend flag set can be added to the PSL queue Version 1 2 Programming Models 29 January 2015 Page 51 of 101 User s Manual Coherent Accelerator Processor Interface Advance Operations Performed by the Last PSL PSL_ID L 1 When the suspend_element MMIO command is received or the suspend_element command is detected by the last PSL the PSL checks to see if the process element being terminated is currently running performs any operations necessary and sets the completion status in the software command status word The suspend_element command is detected by monitoring the ps _chained_command doubleword 1 If the process elemen
44. terminated string that represents the device path name of the CAPI capable adapter 6 2 1 3 cxl_adapter_free Note Not applicable for the CAPI Developer Kit include lt libcx h gt void cxl_adapter_free struct cxl_adapter_h adapter The cxl_adapter_free routine frees the buffers associated the adapter handle CAPI Low Level Management libcxl Version 1 2 Page 80 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt 2 lt o 9 to 6 2 1 4 cxl_for_each_adapter Note Not applicable for the CAPI Developer Kit include lt libcx1 h gt define cxl_for_each_adapter adapter for adapter cxl_adapter_next NULL adapter adapter cx l_adapter_next adapter This macro visits each CAPI capable adapter in the system 6 2 2 Accelerated Function Unit Selection 6 2 2 1 cxl_adapter_afu_next Note Not applicable for the CAPI Developer Kit include lt libcx h gt struct cxl_afu_h cxl_adapter_afu_next struct cxl_adapter_h adapter struct cxl_afu_h afu The cxl_adapter_afu_next routine returns a handle to the next available AFU on a given CAPI capable adapter The adapter parameter must not be NULL If the AFU parameter is NULL cxl_adapter_afu_next returns a pointer to the buffer holding the information for the first available AFU on this adapter Subsequent calls to this routine return the information for the next AFU on this adapter If there are no more remaining AF Us
45. the buffer for the AFU information is freed and NULL is returned 6 2 2 2 cxl_afu_next Note Not applicable for the CAPI Developer Kit include lt libcx1 h gt struct cxl_afu_h cxl_afu_next struct cxl_afu_h afu The cxl_afu_next routine returns a handle to the next available AFU on the next CAPI capable adapter If the AFU parameter is NULL the routine allocates buffers for the CXL adapter information and AFU information and returns the pointer to the AFU information buffer that contains the information for the first available adapter and AFU Subsequent calls iterate first through the AFU on the current adapter stored in the AFU buffer The routine advances to next adapter after exhausting all the AFUs on the current adapter and returns the information for the first AFU on that adapter 6 2 2 3 cxl_afu_devname Note Not applicable for the CAPI Developer Kit include lt libcx h gt char cxl_afu_devname struct cxl_afu_h afu The cxl_afu_devname routine returns the path name that represents the AFU associated with the given AFU handle Version 1 2 CAPI Low Level Management libcxl 29 January 2015 Page 81 of 101 User s Manual m Coherent Accelerator Processor Interface Advance 6 2 2 4 cxl_for_each_adapter_afu Note Not applicable for the CAPI Developer Kit include lt libcx1 h gt define cxl_for_each_adapter_afu adapter afu for afu cxl_adapter_afu_next adapter NULL afu
46. the first PSL the PSL notifies the AFU that the process element is being removed and sends the remove_element command to the next PSL 1 The PSL notifies the AFU of the process element removal The AFU performs any necessary operations to remove the process and then acknowledges the removal of the process element When the acknowl edgment is received the PSL continues with the next substep 2 The PSL sets the complete status in the software command status field to indicate that it is now safe to remove the process element from the linked list e Write the value x 00020000 II next_psl_id link_of_element_to_remove to the psl_chained_command e The PSL does not start any process with a software state of complete suspend or terminate A pro cess element with the suspend flag set can be added to the PSL queue Operations Performed by the Last PSL PSL_ID L F 00 When the remove_element command is detected by the next PSL the PSL notifies the AFU that the process element is being removed and sends the remove_element command to the next PSL The remove_element command is detected by monitoring the ps _chained_command doubleword 1 The PSL notifies the AFU of the process element removal The AFU performs any necessary operations to remove the process and then acknowledges the removal of the process element When the acknowl edgment is received the PSL continues with the next substep 2 The PSL sets the complete status
47. transactions and does not start any new transactions for the process The PSL then invalidate the process element state and refetches a new copy from the process element linked list in system memory If the process element is coherently cached the update is automatically handled by the coherency protocol e The PSL does not start any process with a software state of complete suspend or terminate A pro cess element with the suspend flag set can be added to the PSL queue 3 The PSL sets the complete status in the software command status field to indicate the process has been successfully suspended e The status field in the sw_command_status is set to x 0006 using a caching inhibited DMA or special memory update operation that is guaranteed not to corrupt memory if the operation fails The final value of the sw_command_status must be x 00060006 II first_ps _id link_of_element_to_update e The PSL does not start any process with a software state of complete suspend or terminate A pro cess element with the suspend flag set can be added to the PSL queue Programming Models Version 1 2 Page 58 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt Q lt o fa o 4 AFU Descriptor Overview A CAIA compliant device can support programmable AFUs The AFU descriptor is a set of registers within the problem state area that contains information about the capabilities of the AFU that is r
48. virtual memory that reside on disk It does not include caches or execution unit register files A pattern of bits used to accept or reject bit patterns in another set of data Hard ware interrupts are enabled and disabled by setting or clearing a string of bits with each interrupt assigned a bit position in a mask register Megabyte An aspect of caching in which it is ensured that an accurate view of memory is provided to all devices that share system memory Mapped into the Coherent Attached Accelerator s addressable memory space Registers local storage LS I O devices and other readable or writable storage can be memory mapped Privileged software does the mapping Memory mapped I O Process ID POWER service layer It is the interface logic for a coherently attached accelerator and provides two main functions moves data between accelerator function units AFUs and main storage and synchronizes the transfers with the rest of the processing units in the system Memory mapped input output See memory mapped Glossary Page 99 of 101 User s Manual Coherent Accelerator Processor Interface MMU Most significant bit MRU MSb Page Page table PCle PLL POWER Power ISA Privileged mode Privileged software Problem state PSL PTE RA RAM RA SAO SLB Glossary Page 100 of 101 Advance Memory management unit A functional unit that translates between effective addresses EAs used by pro
49. 101 29 January 2015 User s Manual Advance Coherent Accelerator Processor Interface amp bitwise AND l bitwise OR bitwise NOT modulus equal to l not equal to gt greater than or equal to lt less than or equal to x gt gt y shift to the right for example 6 gt gt 2 1 least significant y bits are dropped X lt lt y shift to the left for example 3 lt lt 2 12 least significant y bits are replaced zeros II Concatenate References to Registers Fields and Bits Registers are referred to by their full name or by their short name also called the register mnemonic Fields are referred to by their field name or by their bit position Table 1 describes how registers fields and bit ranges are referred to in this document and provides examples Table 1 Register References Type of Reference Format Example Reference to a specific register and a specific field using the register short name and the field name Register_Short_Name Field_Name MSRIR Reference to a field using the field name Field_Name R Reference to a specific register and to multiple fields using the register short name and the field names Register_Short_Name Field_Name1 Field_Name2 MSR FEO FE1 Reference to a specific register and to Register_Short_Name Bit_Number Bit_Number MSR 52 55 multiple fields using the register short name and the bit positions Reference to a spe
50. 28 bytes and aXh_cea must be 128 byte line aligned Read a cache line and allocate the cache line in the precise cache in the modified state This command must be used when there is an expectation that data within the line will be written in the near future AXh_csize must be 128 bytes and aXh_cea must be 128 byte line aligned Read_cl_Ick x OA6B Read a cache line and allocate the cache line in the precise cache in the locked and modified state This command must be used as part of an atomic read modify write sequence AXh_csize must be 128 bytes and aXh_cea must be 128 byte line aligned Read_cl_res x 0A67 Read a cache line and allocate the cache line in the precise cache and acquire a reservation AXh_Csize must be 128 bytes and aXh_cea must be 128 byte line aligned touch_i x 0240 Bring a cache line into the precise cache in the IHPC state without reading data in preparation for a cache line write AXh_csize must be 128 bytes and aXh_cea must be 128 byte line aligned IHPC The owner of the line is the highest point of coherency but it is holding the line in an state touch_s x 0250 Bring a cache line into the precise cache in the shared state AXh_csize must be 128 bytes and aXh_cea must be 128 byte line aligned touch_m x 0260 Bring a cache line into the precise cache in modified state AXh_csize must be 128 bytes and aXh_cea must be 128 byte line aligned
51. Accelerator Interface eecscissccscaduciciccnscadarciawincsnacatiasestcneesaessassnteninndeesdesiaiaauncaseuiaas 63 5 1 Accelerator Command Interface ccccccccccssssssssesseeesececeececeecceceeceeseceeeeeeususuauaaaueuseeeesesseeeececeneseees 63 5 1 Command Ordering resestir aa raaa e a aa aae aaa a a aena oa eae Se Saa 66 5 1 2 RESCIVALION iseci a E E e A 68 5213 Eo H Ae EE EEEE A E E A E E E E 68 5 1 4 Request for Interrupt Service ccccccccccssccceeeenceeeeesenneeeeesseeeeeeecseeeeeesssaeeeseusieeesessnieeeeeneas 69 5 1 5 Parity Handling for the Command Interface ccccccccsssseeceeeesseeeeeessnneeeesseseeeeeesenieeeeeneaas 69 5 2 Accelerator Buffer Interface ccccccccccsssseceeeeeeeeseceeeccesseseceeeesueessceeeetauesseceesesseaseeeeeesseeasseseeesaanaees 69 5 3 PSL Response Interface ccccccccceeeeseeeceneeeeeneeeeeaeeeeeaeeeceaeeeceaaeeseaeeeseaeeeegaeeecaeesecaeesseaeessseeessaeees 70 5 3 1 Command Response FIOW cccccceceeeeeeeeeeeeeeneeececeeeeeeeeeeeaaeeseaaeeseaaeeseeeeeseaeeseceeseeaeeeneeeene 72 5 4 Accelerator MMIO Interface cccccccccccccccccssssssssssseeeeecsesecececcceceeceesceaeeeeeesuauauaaaueuaeeeenensseeeeseceseeeess 73 5 5 Accelerator Control Interface cccccccccccccccscssssssesseeeseeseeececececcceececccceeeeeeeseauauaaaueuenseesesseeeeeceseseeeees 73 5 5 1 Accelerator Control Interface in the Non Shared Mode c ccccsccceceeeceeessssscaeeeeeeeeeeee
52. Control Interface 20 ccccsccceesssseeeeeceneeeeeceeneeeeeeeseneeeeeecseeeeseseeeeeeesesnteeeesssaaes 74 Table 5 11 PSL Control Commands on NaX_jCOM ccccececececeeeeeeeeeeeeeeeeseaeeeeeaeeeseaeeeseeeeseeeeesenaeeesaes 74 Table 7 1 FPGA Resources Available for AFU cccccsccccccssssceeeesseneeeeesenneeeeessenaeeeessenseeeesnsnaeeeeesees 88 29 January 2015 Page 7 of 101 User s Manual Coherent Accelerator Processor Interface Advance List of Tables Version 1 2 Page 8 of 101 29 January 2015 User s Manual gt 2 lt o 3 O o Coherent Accelerator Processor Interface List of Figures Figure 1 1 Coherent Accelerator Process Interface Overview ccccceeeeeeeeeeeeeeeeeeeeeceeeeeeaeeeeeaeeeteaeeene 17 Figure 1 2 POWER Service Layer 0 cccccceeecseeeeeeeeeneeeeeaeeeceaeeeseaeeeeeaeeeseaeeesaeeseaeeeseeaeeeecaeesneaeeeeeaeeene 18 Figure 1 3 CAPI Application on the FPGA 0 eccceccceeeseeeeeneeeeeeeeeeeneeeeeaeee sees eseaeeeeeaeeeseaeeescaeeeneaeeeneaaeene 19 Figure 2 1 CAIA Compliant Processor System cccccccceeseeeeeeeeeceeeeeeeeeeteeeeeeaeeeseeeeeseaeeeeteaeeeteaeeeseaeeeee 22 Figure 3 1 Accelerator Invocation Process in the Dedicated Process Model seeen 28 Figure 3 2 Accelerator Invocation Process in the Shared Model cccceeseeeeeeeeeeteeeeeeeeeeeeeeeeeteeeens 32 Figure 3 3 Structure for Scheduled Processes cccccceceeeeeeeneeeeeeeeeeeeeeeeneeeeeaeeeseeeeseeee
53. Developer Kit Card icccicicnecennicscnnsinnmsinncinnnncesdeuntenesnnnenncemesecsencnnntnindcucdeaseeatoencns 93 8 1 Supported CAIA Features irotan asna aaarnas san anssi Naaa Eaa Pa diene 93 8 2 CAPI Developer Kit Card Hardware ccccccceeeeeeeeneeeseeeeeeaeeeeeaeeeeeaeeesceeeseaeeescaeeeeseaeeeeeaeeeseaeeess 93 8 3 FPGA Build RESthCtONS isisisi akaieie aak A ca eaaa 93 8 4 CAPI Developer Kit Card FPGA Build FIOW 0 0 0 ccccceceeceeeeeeeeeeeeeeeeeeeeaeeeeeaeeseeeeseeaeeeseeeeteeaeeeees 94 8 4 1 Structure of Quartus Project files eee e cee cece ee eeeeeeeeeeeeeeaeeeeeaeeeecaeeeseaeeeseaeeeseeeeteeeeteneeeee 94 8 4 2 Build the FPGA cccceeeceecceeeeeeneeeeeneeeeeeeeeeeeeeeceaeeeeeeeeegeeeeeeaee scenes eecaeeeseaeeescaeeeseeeeseaeeeeneeeee 94 8 4 3 Load FPGA rbf File onto the CAPI Developer Kit Card cccseceeeeeeeeeeeeeeeeeeeneeeeeneeeee 95 8 4 4 Timing Closure Hints sssisis th decisis dias dnl elie aE as A EES 95 8 4 5 Debug IMfOrmatlon sssri desenin AAEE devine babe nd abate dees anne 95 OSS Y E E E E 97 Version 1 2 Contents 29 January 2015 Page 5 of 101 User s Manual Coherent Accelerator Processor Interface Advance Contents Version 1 2 Page 6 of 101 29 January 2015 AHA User s Manual Advance Coherent Accelerator Processor Interface List of Tables Table 1 Register References eriseeria a aE aa aaa ar aE deae a 15 Table 2 1 Sizes of Main Storage Address Spaces cccccececeeeeeeeeeee
54. FU process is an AFU implemen tation specific procedure Version 1 2 Programming Models 29 January 2015 Page 31 of 101 User s Manual Coherent Accelerator Processor Interface Figure 3 2 Accelerator Invocation Process in the Shared Model Advance Host Processor Application syscall Hie Se heall lt parms gt syscall parms Function_ID WED AMR CSRP Operating System gt hcall parms Function_ID WED AMR CSRP PID TID SSTP LISN AURP Hypervisor lt enqueue PE gt Process Element SR LPID HAURP SDR PID TID CSRP AURP SSTP AMR IVTEs WED Main Memory Application EA Space Accelerator Specific Job Inf S ormation WED gt lt start_of_area gt lt end_of_area gt Accelerator Context Save Restore Area CSRP gt lt start_of_area gt i lt end_of_ y Operating System VA Space Operating System A Accelerator Utilization Record AURP gt lt utilization_val gt Storage Segment Table SSTP gt lt start_of_table gt lt end_of_table gt N Hypervisor RA Space r Hypervisor Process Element List S SPAP gt lt start_of_SPA_area gt lt end_of_SPA_area gt lt list_control gt
55. PI The libcxl provides an application programming interface API for the allocation de allocation and communication with a CAPI accelerator Section 7 AFU Development and Design on page 87 provides some general information about developing an AFU and some best practices to consider when designing an AFU Section 8 CAPI Developer Kit Card on page 93 describes CAIA implementation details for the CAPI Devel oper Kit card and the FPGA build flow for the CAPI Developer Kit card Coherent Accelerator Processor Interface Overview Version 1 2 Page 20 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt Q lt o 3 9 o 2 Introduction to Coherent Accelerator Interface Architecture The Coherent Accelerator Interface Architecture CAIA defines an accelerator interface structure for coher ently attaching accelerators to the Power Systems using a standard PCle bus The intent is to allow imple mentation of a wide range of accelerators to optimally address many different market segments 2 1 Organization of a CAIA Compliant Accelerator Logically the CAIA defines two functional components the PSL and the AFU The PSL in a CAIA compliant accelerator provides the interface to the host processor Effective addresses from an AFU are translated to a physical address in system memory by the PSL The PSL also provides miscellaneous management for the AFUs Although the CAIA architecture defines interfaces for up to
56. Processor Interface 1 3 Application The application that runs on the FPGA can be a new solution or one ported from a software application or an I O subsystem The new host algorithm is far lighter compared to the old paradigm The new paradigm off loads the processor or avoids device driver programming overhead Figure 1 3 compares the old paradigm with the CAPI paradigm Figure 1 3 CAPI Application on the FPGA Old Paradigm Main Fori 1 to All callmyAlgorithm i End myAlgorithm input Brilliant idea goes here in SW or call to O Idea End myAlgorithm Idea runs on POWER Core or I O CAPI Paradigm Development Stages Main Fori 1 to All call myAlgorithm i End MainNew myAlgorithm input Fori 1toAll Brilliantidea goes herein SW call myAlgorithm i End myAlgorithm input Brilliantidea goes here in SW OR Idea End myAlgorithm Integrate into Application Create FPGA oe Accelerator 9 p gt aung System If widely used Version 1 2 Coherent Accelerator Processor Interface Overview 29 January 2015 Page 19 of 101 User s Manual Coherent Accelerator Processor Interface Advance The accelerator algorithm that resides on the FPGA is referred to as the accelerator functional unit AFU The AFU is created in a source language that can be synthesized by the FPGA tools This source language must also be able to be compiled into a simulation
57. The script name and location is included in the README file 4 Run your application 8 4 4 Timing Closure Hints A general flow to help close timing after the initial build follows 1 Run Top alt_xcvr_reconfig and ps1_accel design units as source with the ps1 as post fit with preserva tion level set to placement and routing default copied project file settings 2 If timing is not met rerun with alt_xcvr_reconfig and ps1 as post fit and Top and ps1_accel as post syn thesis 3 If timing is not met rerun the same as 2 above except with different fitter seeds 4 If timing is still not met start over at step 1 to rerun synthesis You might also want to delete the db and incremental_db directories to start fresh 5 In all the above steps if very large timing misses are occurring look at AFU design file changes to correct the problem 8 4 5 Debug Information Use trace arrays implemented within the AFU to monitor the AFU PSL interface for any errors Simulate failing scenarios with the POWER8 Functional Simulator to try and isolate the issue Version 1 2 CAPI Developer Kit Card 29 January 2015 Page 95 of 101 User s Manual Coherent Accelerator Processor Interface Advance CAPI Developer Kit Card Version 1 2 Page 96 of 101 29 January 2015 AFU ALUT AMOR AMR architecture AURP Big endian Cache Caching inhibited CAIA CAPI CAPP Coherence CSRP Version 1 2 29 January 2015 User s Man
58. Write_mi Write_ms x OD60 x 0D70 Write all or part of a cache line and allocate the cache line in the precise cache in modified state The line goes invalid if a snoop read hits it This command must be used when there is an expectation of temporal locality followed by a use by another processor AXh_csize must be a power of 2 and aXh_cea must be naturally aligned according to size Write all or part of a cache line and allocate the cache line in the precise cache in modified state The line goes to a shared state if a snoop read hits it This command must be used when there is an expectation of temporal locality in a producer consumer model AXh_csize must be a power of 2 and aXh_cea must be naturally aligned according to size PSL Accelerator Interface Page 64 of 101 Version 1 2 29 January 2015 User s Manual Coherent Accelerator Processor Interface Table 5 2 PSL Command Opcodes Directed at the PSL Cache Sheet 2 of 2 Mnemonic Opcode Description Write_unlock x OD6B Ifa lock is present write all or part of a cache line and clear the line s lock status back to a modified state It will fail if the lock is not present AXh_csize must be a power of 2 and aXh_cea must be naturally aligned according to size Write_c x 0D67 If a reservation is present write all or part of a cache line and clear the reservation status If a reservation is not present
59. _element_to_resume e Ifa value of all 1 s is returned for the status an error has occurred An implementation dependent recovery must be initiated by hardware 3 4 6 2 PSL Procedure for Time Sliced and AFU Directed Programming Models Each PSL assigned to service the scheduled processes is configured with a unique identifier and the identifier of the next PSL in the list of PSLs servicing the processes In addition each PSL is identified as either the first PSL the last PSL both first and last PSL only one PSL servicing the queue or neither first or last PSL The PSL ID Register contains the PSL unique identifier and the settings for first and last Operations Performed by the First PSL PSL_ID L F 01 When the resume_element MMIO command is received by the first PSL the PSL performs any operations necessary and sends the resume_element command to the next PSL The PSL does not start any process with a software state of complete suspend or terminate A process element with the suspend flag set can be added to the PSL queue 1 When operating in an AFU directed programming model the PSL notifies the AFU of the process ele ment being resumed The AFU performs any necessary operations to resume execution of the process and then acknowledges the resumed process element When the acknowledgment is received the PSL continues with the next substep The AFU is not notified of the added process element for all other pro grammin
60. a POWER service layer PSL The PSL is the bridge to the system for the AFU and provides address translation and system memory cache In addition the PSL provides miscel laneous facilities for the host processor to manage the virtualization of the AFUs interrupts and memory management The PSL consists of several functional units Such as the memory protection tables Hardware resources defined in the CAIA are mapped explicitly to the real address space seen by the host processor Therefore any host processor can address any of these resources directly by using an appropriate effective address value A primary function of the PSL is the physical separation of the AFUs so that they appear to the system as independent units Introduction to Coherent Accelerator Interface Architecture Version 1 2 Page 22 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt 2 lt o 3 v7 o 2 1 2 Accelerator Function Unit Note The AFU functional definition is outside the scope of the CAPI User s Manual The AFU functional def inition is owned by the CAPI solution provider A CAIA compliant processor includes one or more AFUs The AFUs are user defined functions for acceler ating applications They typically process data and initiate any required data transfers to perform their allo cated tasks The purpose of an AFU is to provide applications with a higher computational unit density for hardware accel erati
61. a is fetched or interpreted for example by an accelerator function About this Document Version 1 2 Page 16 of 101 29 January 2015 User s Manual Advance Coherent Accelerator Processor Interface 1 Coherent Accelerator Processor Interface Overview The Coherent Accelerator Process Interface CAPI is a general term for the infrastructure of attaching a coherent accelerator to an IBM POWER system The main application is executed on the host processor with computation heavy functions executing on the accelerator The accelerator is a full peer to the host processor with direct communication with the application The accelerator uses an unmodified effective address with full access to the real address space It uses the processor s page tables directly with page faults handled by system software Figure 1 1 shows an overview of CAPI Figure 1 1 Coherent Accelerator Process Interface Overview CAPI Developer Kit Card U e The application sets up the data and calls the accelerator functional unit AFU e The accelerator functional unit reads and writes coherent data across the PCle and communicates with the application POWER Service Layer PSL cache holds coherent data for quick AFU access POWER8 Processor Chip 1 1 Coherency The Coherent Attached Processor Proxy CAPP in the multi core POWER8 processor extends coherency to the attached accelerator A directory on the CAPP provides coherency respons
62. a reset sequence also disables the AFU The AFU does not respond to the Problem State MMIO region while disabled System software must poll the AFU Slice Reset Status for the AFU Slice Reset Sequence to be complete AFU_Cntl_An RS 10 Version 1 2 Programming Models 29 January 2015 Page 27 of 101 User s Manual Coherent Accelerator Processor Interface Figure 3 1 Accelerator Invocation Process in the Dedicated Process Model Advance Host Processor Application lt accelerator specific invocation gt Main Memory Application EA Space Accelerator Specific Job Information WED lt start_of_area gt lt end_of_area gt Operating System VA Space Storage Segment Table SSTP gt lt start_of_table gt lt end_of_table gt SR POWER Service Layer PSL Slice SSTP SDR AM O R EE GG IVTEs OED CG LPID PID TID WED to AFU GE Preempt Req Context S R Seg Walk Pg Walk Interrupt Source Layer AFU Int Req EA from AFU Physical Address RA Programming Models Page 28 of 101 Version 1 2 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt Q lt o 3 fa o 3 2 Shared Programming Models Note This section is for future releases only Currently libcxl only suppo
63. a virtual address which it then translates to a real address RA that accesses real phys ical memory The maximum size of the effective address space is 264 bytes ELF Executable and linkable format ERAT Effective to real address translation or a buffer or table that contains such transla tions or a table entry that contains such a translation Exception An error unusual condition or external signal that can alter a status bit and causes a corresponding interrupt if the interrupt is enabled See interrupt Fetch Retrieving instructions from either the cache or system memory and placing them into the instruction queue FPGA Field programmable gate array HAURP Hypervisor Accelerator Utilization Record Pointer hcall Hypervisor call HPC Highest point of coherency Hypervisor A control or virtualization layer between hardware and the operating system It allocates resources reserves resources and protects resources among for example sets of AFUs that may be running under different operating systems IHPC The owner of the line is the highest point of coherency but it is holding the line in an Implementation INT Interrupt packet ISA JEA Glossary Page 98 of 101 ns state A particular processor that conforms to the architecture but might differ from other architecture compliant implementations For example in design this could be the feature set and implementation of optional features Interrupt A change
64. aiberneuesduedcnsmunchunduauedsusaabloasinesvauninnndniuakauntadanaaanace 9 REVISION Keo M E 11 About this Document sssiscssassdevedsnsicistcsscctecececnscescccanaededencwacdetstcaacucnesanensaebiashdersecwsonsiaeecs 13 Who Should Read This Manual a ccic ssssceisecsstaccessssescecssesttecssedsisneeecvagnesendaduantseuddssnsteneduuaabaneadvssadennatseees 13 Doc ment Organization sissisota e ana ci agl EnaA aa AEO E Aa EESE dee acetal veel 13 Related Publications sssini ea eede A na a AA a a ee aa r Raas 14 Conventions Used in This Document cccceesecceeeeesneeeeecceaeeeeecceaeeeeseceaeeeeseceaseeessecneeeeeseeeneeeeeesaaes 14 Representation of Numbers cccecceeeeeeeneeeeeeeeneeeeeeeeaaeeeeeeeaeeeeeeecaaeeeeesenaeeeeeeenaeeeeeetesaaeeeeeneaees 14 Bit Significance azii pon e r Gee aea a E E EE REE 14 Other CONVENTIONS sosersreeiaanea re ee E S AREETA EANES NENO A AKOE AEEA 14 References to Registers Fields and Bits c cccceeeeeeeeeeeeeeenneeeeeeeeeeeeeeesecceaaaeeeeeeeeeeeeteeeeeeneeaaeees 15 SEES 15 E E E E E A AEE EE A A EEE S E E S E 16 1 Coherent Accelerator Processor Interface Overview unnnssssnnsennnnnunnnennnnnnnnnnnnn 17 WA COMGLONCY cose scene A E E AE E And itele veg T 17 1 2 POWER S6Irvice Layer maisip usahae Seui ice iaaea aaea aE cl dabei eae eee vere 18 TS ADDIIGATION cerite iiien i AEAEE AREE EN ETNE EAE a TREA ict REA TERA 19 2 Introduction to Coherent Accelerator Interface Architecture
65. anaged linked list of scheduled processes using the following sequence The sequence outlined below is only for a single system software process managing the linked list Additional locking and synchronization steps are necessary to allow for multiple system software processes to concurrently manage the linked list 3 4 1 1 Software Procedure 1 Determine if there is room in the linked list for the new process element Note The method system software uses to calculate the free space in the linked list is implementation specific 2 Write the new process state to a free process element location in the linked list area The free process element can be obtained from a linked list of free processes or by some other implementation specific means 3 Set the valid flag in the software state to 1 Software_State V 1 Store x 80000000 to the 31st word of the process element to add 4 Ensure that the terminate status is visible to all processes System software running on the host processor must perform a sync instruction 5 Write an add_element command to the software command status field in the linked list area Store x 00050000 II first_psi_id I link_of_element_to_add to address sw_command_status 6 Update the system software implementation dependent free list and the process element linked list struc tures to reflect the added process element 7 Ensure that the new process element is visible to all processes S
66. and sends the terminate_element command to the next PSL or sets the completion status in the software command status word The update_element command is detected by monitoring the ps _chained_command doubleword 1 When operating in an AFU directed programming model the PSL notifies the AFU of the updated pro cess element The AFU performs any necessary operations to update the process and then acknowl edges the updated process element When the acknowledgment is received the PSL continues with the next substep The AFU is not notified of the added process element for all other programming models 2 If the process is running the PSL completes any outstanding transactions and does not start any new transactions for the process The PSL then invalidates the process element state and refetches a new copy from the process element linked list in system memory If the process element is coherently cached the update is automatically handled by the coherency protocol Version 1 2 Programming Models 29 January 2015 Page 57 of 101 User s Manual m Coherent Accelerator Processor Interface Advance e The status field in the sw_command_status is set to x 0006 using a caching inhibited DMA or special memory update operation that is guaranteed not to corrupt memory if the operation fails The final value of the sw_command_status should be x 00060006 II first_psl_id link_of_element_to_update e The PSL does not start
67. are V Reserved S T y vy 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Bits Field Name Description 0 V Process element valid 0 Process element information is not valid 1 Process element information is valid 1 29 Reserved Reserved 30 S Suspend process element 0 Process element can execute not suspended 1 Process element execution is suspended suspended All outstanding operations are also complete Note The process element can be added to the PSL queue even if the suspend flag is 1 31 T Terminate Process Element 0 Termination of the process element has not been requested 1 Process element is being terminated Programming Models Version 1 2 Page 36 of 101 29 January 2015 Advance 3 3 3 Software Command Status Field Format User s Manual Coherent Accelerator Processor Interface There are two command status words in the scheduled processes area the sw_command_status word and the psl_chained_command word These commands are used by system software and the PSLs to either terminate or to safely remove a process element Updates of the sw_command_status word by the PSL must be performed using a caching inhibited write operation In some implementations a special write operation must be used The special write operation allows the system to continue normal operation in the scenario where the CAIA compliant device abnormally
68. are Procedure The following sequence is only for a single system software process managing the linked list Additional locking and synchronization steps are necessary to allow for multiple system software processes to concur rently manage the linked list 1 Set the terminate flag in the software state to 1 Software_State T 1 e Store x 80000001 to the 31st word of the process element to terminate 2 Ensure that the terminate status is visible to all processes e System software running on the host processor must perform a sync instruction 3 Write a terminate_element command to the software command status field in the linked list area e Store x 00010000 II first_psl_id link_of_element_to_terminate to address sw_command_status 4 Ensure that the terminate_element command is visible to all processes e System software running on the host processor must perform a sync instruction Version 1 2 Programming Models 29 January 2015 Page 43 of 101 User s Manual Coherent Accelerator Processor Interface Advance 5 Issue the terminate_element MMIO command to the first PSL e System software performs an MMIO to the PSL Linked List Command Register with the terminate_element command and the link of the process being terminated PSL_LLCMD_An x 000100000000 link_of_element_to_terminate 6 Wait for the PSLs to complete the termination of the process element e The process element is terminated when a
69. arity inputs are provided for important fields in the command interface The command tag and address are protected by odd parity Bad parity on any of these buses causes the PSL to return the error status for the command All parity signals on the command interface are valid in the same cycle as aXh_cvalid 5 2 Accelerator Buffer Interface Data is moved between the PSL and the accelerator through the buffer interfaces When a command is given to the PSL it assumes that it can read or write data to the accelerator with the aXh_ctag contained in the command Data is read or written before the command is completed and it can be read or written more than once before the command is completed There are two buffer interfaces present one for reading during a write operation and one for writing during a read operation Each read write moves a half of a line of data 64 bytes Requests can arrive at any time on either interface Each interface is synchronous pipelined and non blocking Read requests are serviced after a small 1 4 cycle fixed delay in a pipelined fashion in the order that they are received so that data can be directly sent to the PCle write stream without PSL buffering Table 5 6 Accelerator Buffer Interface Signal Name Bits Source Description haX_brvalid 1 PSL This signal is asserted for a single cycle when a valid read data transfer is present on the interface The haX_br signals are valid during the cycle haX_brvali
70. ated to a single process in the system These types of accelerators are operating in the dedicated process virtualization program ming model defined by the Coherent Accelerator Interface Architecture CAIA Virtualized Accelerators An accelerator that is shared between one or more processes in the system These types of accelerators are operating in either the shared or dedicated partition virtualization programming models defined by the Coherent Accelerator Interface Architecture CAIA Note The CAPI Accelerator Management Library is still under development Currently libcxl only supports the dedicated programming model The following sections describe the contents of libcxl In the Developer Kit release a number of these calls are not implemented They will be implemented in the future as the architecture expands beyond the contents of the CAPI Developer Kit release CAPI Developer Kit users should focus on the following routines to start and eventually close their AFU cxl_afu_open_dev Opens an existing AFU by its device path name and returns a handle to the open device It is necessary for the user to know the device name that has been asso ciated with their AFU cxl_afu_attach Passes the work element descriptor WED to the FRGA and enables the given AFU for operation cxl_mmio_map Maps the register space in the AFU into the memory associated with this process See Section 6 2 3 14 Additional Routines on page 84 and Section 6 2 3 13
71. ation ERAT misses and protection violations stall subsequent aXh_cabt Strict operations before translation efforts This ensures that the order of translation interrupts is the same as the order of command submis sion and loads and stores that follow a translation event have not been executed if the state needs to be saved and restored during the handling of a translation interrupt e If translation for the command results in a protection violation or the table walk process fails the command an interrupt is sent If the translation interrupt response is CONTINUE the command receives the PAGED response and all subsequent commands get FLUSHED responses until a restart command is received e Ifthe translation interrupt response is Address Error the command receives the AERROR response and all subsequent commands get FLUSHED responses until a restart command is received e Ifthe translation detects an internal error or data error the command receives the DERROR response and all subsequent commands get FLUSHED responses until a restart command is received PSL Implementation Note When a protection violation occurs and before the translation interrupt response is received subsequent commands that hit the same 16 MB page are held in a queue and marked as a protection violation Once the translation response is received the queued commands are processed and provide a PSL response according to their individual CABT mode Requests that are received
72. by the next PSL the PSL checks to see if the process element that is being terminated is currently running performs any operations necessary and sends the terminate_element command to the next PSL or sets the completion status in the software command status word The terminate_element command is detected by monitoring the ps _chained_command doubleword 1 If the process element is running the process is terminated The PSL sets the complete status in the soft ware command status field to indicate that the process has been successfully terminated The PSL is allowed to complete any outstanding transactions but must not start any new transactions for the process e The status field in the sw_command_status is set to x 0001 using a caching inhibited DMA or special memory update operation that is guaranteed not to corrupt memory if the operation fails The final value of the sw_command_status must be x 00010001 II first_psi_id link_of_element_to_terminate 2 If the process element is not running the PSL writes a termination command to the psl_chained_command doubleword for the next PSL and watches for the termination to be completed e Write the value x 00010000 II next_psl_id link_of_element_to_terminate to the psl_chained_command e While waiting for the process to be terminated the PSL does not attempt to start the correspond ing process or any process with the complete suspend or terminate flags set The PSL can per
73. ccelerator Processor Interface gt 2 lt o 3 9 o 3 4 1 3 PSL Procedure for the AFU Directed Programming Models Each PSL assigned to service the scheduled processes is configured with a unique identifier and the identifier of the next PSL in the list of PSLs servicing the processes In addition each PSL is identified as either the first PSL the last PSL both first and last PSL only one PSL servicing the queue or neither first or last PSL The PSL ID Register contains the PSL unique identifier and the settings for first and last Operations Performed by the First PSL PSL_ID L F 01 When the add_element MMIO command is received by the first PSL the PSL performs any operations necessary and sends the add_element command to the next PSL The PSL does not start any process with a software state of complete suspend or terminate A process element with the suspend flag set can be added to the PSL queue 1 The PSL notifies the AFU of the added process element The AFU performs any necessary operations to prepare for the new process and then acknowledges the new process element When the acknowledg ment is received the PSL continues with the next substep 2 The PSL writes an add_element command to the ps _chained_command doubleword for the next PSL and watches for the add_element to be complete e Write the value x 00050000 II next_psl_id link_of_element_to_add to the ps _chained_comman4d Operations Perform
74. cific register and to a Register_Short_Name Bit_Number MSR 52 field using the register short name and A A the bit position or the bit range Register_Short_Name Starting_Bit_Number Ending_Bit_Number MSR 39 44 A field name followed by an equal sign Register_Short_Name Field_Name n MSR FEO 1 and a value indicates the value for MSR FE x 1 that field Register_Short_Name Bit_Number n MSR 52 0 MSR 52 x 0 Register_Short_Name Starting_Bit_Number Ending_Bit_Number n MSR 39 43 10010 MSR 39 43 x 1 1 1 Where nis the binary or hexadecimal value for the field or bits specified in the brackets Version 1 2 29 January 2015 About this Document Page 15 of 101 User s Manual m Coherent Accelerator Processor Interface Advance Endian Order The Power ISA supports both big endian and little endian byte ordering modes Book of the Power ISA describes these modes The CAIA supports only big endian byte ordering Because the CAIA supports only big endian byte ordering the POWER service layer PSL does not implement the optional little endian byte ordering mode of the Power ISA The data transfers themselves are simply byte moves without regard to the numerical signifi cance of any byte Thus the big endian or little endian issue becomes irrelevant to the actual movement of a block of data The byte order mapping only becomes significant when dat
75. contains at least the following infor mation e A work element descriptor WED e An Authority Mask Register AMR value masked with the current AMOR e An effective address EA Context Save Restore Area Pointer CSRP e A process ID PID and optional thread ID TID e A virtual address accelerator utilization record pointer AURP e The virtual address of the storage segment table pointer SSTP e Interrupt vector table IVTE_Offset_n IVTE_Range_n derived from the LISNs in the hypervisor call parameters e A state register SR value e A logical partition ID LPID e Areal address RA hypervisor accelerator utilization record pointer HAURP e The Storage Descriptor Register SDR The hypervisor initializes the following PSL registers e PSL Control Register PSL_SCNTL_An e Real address RA Scheduled Processes Area Pointer PSL_SPAP_An e PSL Authority Mask Override Register PSL_AMOR_An Programming Models Version 1 2 Page 30 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt 2 lt o 3 O o 3 2 1 Starting and Stopping an AFU in the Shared Models In the shared PSL controlled time sliced programming model the AFU is automatically started and stopped by the PSL The PSL essentially follows the procedures defined in Section 3 1 1 Starting and Stopping an AFU in the Dedicated Process Model on page 26 In the AFU directed shared programming model starting and stopping an A
76. d 4 KB configuration space defined by the PCle specification If multiple AFU configuration records exist each record corresponds to a physical function of the AFU The AFU configuration record space is defined in little endian format to conform to the PCle standard Note The length of each configuration record is selectable in 256 byte blocks The AFU does not have to reserve a full 4 KB for the extended configuration space The AFU descriptor also contains an AFU error buffer The AFU error buffer is intended to be used by the AFU to report application specific errors This data can be collected by system software and combined with adapter error data to use in creating error logs or other problem determination Note Some operating systems have a base page size of 64 KB To be compatible with a base page size of 64 KB the AFU_PSA_offset must start on a 64 KB boundary and the AFU_PSA_ ength must also be a multi ple of 64 KB For implementation requirements on the alignment and size of the problem state area refer to the design guides for the target operating system Version 1 2 AFU Descriptor Overview 29 January 2015 Page 59 of 101 User s Manual Coherent Accelerator Processor Interface Table 4 1 defines the format of the AFU descriptor for a CAIA compliant device Table 4 1 AFU Descriptor Sheet 1 of 2 Register Offset Field Name Bits Description x 0 num_ints_per_process 0 15 The power on reset value of this field
77. d and stopped by system software operating system or hypervisor This section describes the sequence used by system software to start an AFU and is provided for reference only An application simply calls libcxl with the desired WED and libcxl performs the system software calls described in the following procedures The WED is specific to each AFU It contains all the information an AFU requires to do its work or it can be a pointer to a memory location where the applica tion has set up a command queue of work to be completed See Section 6 CAPI Low Level Management libcx on page 79 for additional information Use the following procedure to start an AFU 1 System software must initialize the state of the PSL All the required Privileged 1 Privileged 1 Slice and Privileged 2 Slice registers must be initialized so that the address context for the processes and other contexts such as the interrupt vector table entries can be used Programming Models Version 1 2 Page 26 of 101 29 January 2015 gt 2 lt o 8 Ilu N User s Manual Coherent Accelerator Processor Interface System software must set the AFU Slice Reset bit in the AFU_Cntl_An Register AFU_Cntl_An RA Setting the AFU Slice Reset starts a reset sequence for the corresponding AFU Initiating a reset sequence also disables the AFU The AFU does not respond to the problem state MMIO region while dis abled System software must poll the AFU Slice Reset Status
78. d is asserted The buffer read interface is used for accelerator write requests and the buffer write interface is used for accelerator read requests Note This signal can be on for multiple cycles indicating that data is being returned on back to back cycles haX_brtag 8 PSL Accelerator generated ID for the accelerator write request haX_brtagpar 1 PSL Odd parity for haX_brtag valid with haX_brvalid haX_brad 6 PSL Half line index of read data within the transaction Cache lines are 128 bytes so that only the LSB is modulated aXh_briat 4 Acc Read buffer latency This bus is a static indicator of the access latency of the read buffer It must not change while there are commands that have been submitted on the command interface that have not been acknowledged on the response inter face It is sampled continuously However after a reset the PSL assumes this is a con stant and that it is static for any particular accelerator 1 Data is ready the second cycle after haX_brvalid is asserted 3 Data is ready the fourth cycle after haX_brvalid is asserted aXh_brdata 512 Acc Read data Version 1 2 PSL Accelerator Interface 29 January 2015 Page 69 of 101 User s Manual Coherent Accelerator Processor Interface Table 5 6 Accelerator Buffer Interface Signal Name Bits Source Description aXh_brpar 8 Acc Odd parity for each 64 bit doubleword of read data aXh_brpar must be provided on the
79. d project are configured for an 8 Gb Fibre Channel Altera Quartus Software must be used for the FPGA build Check the readme file for the required version AFU source design files must be VHDL or Verilog Version 1 2 CAPI Developer Kit Card 29 January 2015 Page 93 of 101 User s Manual m Coherent Accelerator Processor Interface Advance 8 4 CAPI Developer Kit Card FPGA Build Flow This section describes the process for building the AFU into the Altera FPGA 8 4 1 Structure of Quartus Project files All files necessary for compiling and synthesizing the AFU into the CAPI Developer Kit FPGA can be obtained from Nallatech All I O connections to the CAPI Developer Kit card timing parameters placement constraints and so on are contained in this directory and will be pulled into the project with the Quartus software Root directory psl qpf Main project file that must be loaded into Quartus This includes qip files for the PSL logic and the AFU logic along with all other infrastructure files and hard IP ps1 psl qip Library file that pulls in all files needed by the top level as well as the encrypted post routed PSL file All files reside within the ps1 subdirectory afu0 afu0 qip This file must contain all of the AFU source files that are to be included in the design The delivered project contains a sample AFU called memcopy This is a simple AFU that simply copies data from one area of memory to another
80. d_pointer continue with substep 2 Ifthe tail_pointer is equal to initial_head_pointer continue with the next substep 6 Release the protection of the cache line containing the head_pointer and tail_pointer values After completing the search of all process links the status field in the sw_command_status is set to x 0001 using a caching inhibited DMA or special memory update operation that is guaranteed not to corrupt memory if the operation fails The final value of the sw_command_status must be x 00010001 II first_psi_id link_of_element_to_terminate All other PSLs can now stop protecting against starting the process being terminated 3 4 3 3 PSL Procedure for AFU Directed Programming Models Each PSL assigned to service the scheduled processes is configured with a unique identifier and the identifier of the next PSL in the list of PSLs servicing the processes In addition each PSL is identified as either the first PSL the last PSL both first and last PSL only one PSL servicing the queue or neither first or last PSL The PSL ID Register contains the PSL unique identifier and the settings for first and last Operations Performed by the First PSL PSL_ID L F 01 When the terminate_element MMIO command is received by the first PSL the PSL checks to see if the process element being terminated is currently running performs any operations necessary and sends the terminate_element command to the next PSL
81. date is completed MMIO read of PSL_SLBIA returns zero in the least significant bit e System software performs an MMIO write to invalidate the TLBs PSL_TLBIA x 3 e System software waits until the SLB invalidate is completed MMIO read of PSL_TLBIA returns zero in the least significant bit 7 At this point the memory locations for the process element that was removed can now be reused Programming Models Version 1 2 Page 48 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt 2 lt o v7 o 3 4 4 2 PSL Procedure for Time Sliced Programming Models Operations Performed by the First PSL PSL_ID L F x1 When the remove_element MMIO command is received by the first PSL the PSL sets the completion status in the software command status word 1 The PSL sets the complete status in the software command status field to indicate that it is now safe to remove the process element from the linked list e The status field in the sw_command_status is set to x 0002 using a caching inhibited DMA or special memory update operation that is guaranteed not to corrupt memory if the operation fails The final value of the sw_command_status must be x 00020002 II first_psi_id link_of_element_to_remove 3 4 4 3 PSL Procedure for AFU Directed Programming Models Operations Performed by the First PSL PSL_ID L F x1 When the remove_element MMIO command is received by
82. dress for physical storage which includes physical memory local storage LS and memory mapped O registers The maximum size of the real address space is 28 bytes Strict address ordering Segment lookaside buffer It is used to map an effective address to a virtual address Version 1 2 29 January 2015 Storage model SUE TAG Tag group TG TID TLB UAMOR VA VHDL WED Version 1 2 29 January 2015 User s Manual Coherent Accelerator Processor Interface Scheduled processes area Storage segment table pointer A CAPI User s Manual compliant accelerator implements a storage model consis tent with the Power ISA For more information about storage models see the Coherent Accelerator Interface Architecture document Special uncorrectable error PSL command tag A group of PSL commands Each PSL command is tagged with an n bit tag group identifier An AFU can use this identifier to check or wait on the completion of all queued commands in one or more tag groups Tag parameter Thread ID Translation lookaside buffer An on chip cache that translates virtual addresses VAs to real addresses RAs A TLB caches page table entries for the most recently accessed pages thereby eliminating the necessity to access the page table from memory during load store operations User Authority Mask Override Virtual address An address to the virtual memory space which is typically much larger than the real addres
83. dress space and naturally aligned to the size of the scheduled processes area Note The structure for the scheduled processes in Figure 3 3 contains an implementation dependent sys tem software area This area is used to maintain the linked list pointers for maintaining the list of active pro cess elements and the free list of process elements How these pointers are maintained is implementation specific and outside the scope of the CAIA Table 3 1 Scheduled Processes Area Structure Sheet 1 of 2 Mnemonic Address Byte Description start_of_linked_list_area SPA_Base This is the start of the area in system storage used by system software to store the linked list of process elements scheduled for the acceleration function units AFUs The process elements in this area must never be cached by the PSL in a modified state end_of_linked_list_area SPA_Base n x 128 1 This is the end of the area in system storage used by system where n maximum number of process software to store the linked list of process elements scheduled elements supported for the AFUs Version 1 2 Programming Models 29 January 2015 Page 33 of 101 User s Manual Coherent Accelerator Processor Interface Table 3 1 Scheduled Processes Area Structure Sheet 2 of 2 Mnemonic Address Byte Description Sw_command_status SPA_Base n 3 x 128 where n maximum number of process elements supported Soft
84. e The top level FPGA file is called ps _fpga vhd1 It instantiates a component called ps1_accel vhd1 psl_accel vhd1 is a wrapper around the AFU top level e Modify psl_accel vhd1 to bind top level customer AFU signals to the IBM supplied PSL Change component declaration name to match the AFU top level entity name CAPI Developer Kit Card Page 94 of 101 Version 1 2 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt 2 lt o 3 O o Port and map the AFU entity VHDL or module Verilog names to psl_accel names 5 Compile and synthesize the design using Quartus software until timing targets are met The PSL design unit must always remain a post fit partition with routing and placement preserved If routing and place ment is not preserved the partition is re routed and can cause timing misses 6 Assemble the design using Quartus software to get your complete sof build 7 Use Quartus software to obtain the rbf for loading to the FPGA 8 4 3 Load FPGA rbf File onto the CAPI Developer Kit Card 1 Boot system being used to test your AFU 2 Run lt BM CAPI flash download script gt to transfer your bitfile to the FPGA on the CAPI Developer Kit card flash memory The script name and location is included in the README file 3 Run lt BM CAPI Developer Kit reset script gt to reset the CAPI Developer Kit card This causes the new image that is in flash memory to be loaded into the FPGA
85. e accelerator control interface Version 1 2 PSL Accelerator Interface 29 January 2015 Page 73 of 101 User s Manual Coherent Accelerator Processor Interface Table 5 10 Accelerator Control Interface Signal Name Bits Source Description haX_jval 1 PSL This signal is asserted for a single cycle when a valid job control command is pres ent The haX_j signals are valid during this cycle haX_jcom 8 PSL Job control command opcode See Table 5 11 PSL Control Commands on haX_jcom on page 74 haX_jcompar 1 PSL Odd parity for haX_jcom valid with haX_jval haX_jea 64 PSL This is the WED or timebase information Note Timebase is currently not supported haX_jeapar 1 PSL Odd parity for haX_jea valid with haX_jval aXh_jrunning 1 Acc Accelerator is running This signal should transition to a 1 after a start command is recognized It must be negated when the job is complete in error or a reset command is recognized aXh_jdone 1 Acc __ Assert for a single cycle to acknowledge a reset command or when the accelerator is finished The aXh_jerror signal is valid when aXh_jdone is asserted aXh_jcack 1 Acc In dedicated process mode drive to 0 aXh_jerror 64 Acc Accelerator error code A 0 means success If nonzero the information is captured in the AFU_ERR_An Register and PSL_DSISR_An AE is set causing an interrupt aXh_jyield 1 Acc Reserved drive to 0 aXh_tbr
86. e context swaps are managed by the PSL For all other programming models the circular queue section is not used Figure 3 3 shows the struc ture that contains the processes scheduled for the AFUs Figure 3 3 Structure for Scheduled Processes Fires List of Scheduled ee Sr Process Elemento lt Addr_Context gt Z ingoni capitan System Software Area ae Scheduled Processes Area Handle 0 lt Sw_State gt y SPAP lt start_of_SPA_area gt lt end_of_SPA_area gt lt first_element gt Process Element 1 lt Addr_Context gt Handle 1 lt sw_ State gt lt last_element gt lt start_of_SW_area gt lt previous next_link_for PE 0 gt lt sw_command_status gt lt previous next_link_for PE 1 gt lt start_of_psl_queue gt Process Element n lt Addr_Context gt C zend ok pel quaies i E Handle n lt SW_State gt J ee SW_area gt lt previous_next_link_for PE n gt Table 3 1 defines the various fields and areas within the scheduled processes structure The starting address of the area SPA_Base is defined by the PSL Scheduled Processes Area Pointer Register PSL_SPAP_An The size of the area PSL_SPAP_An size determines the number of process elements supported by the structure and the amount of storage that must be allocated The storage must be contiguous in the real ad
87. ed The PSL is allowed to complete any outstanding transactions but must not start any new transactions for the process e The status field in the sw_command_status is set to x 0003 using a caching inhibited DMA or special memory update operation that is guaranteed not to corrupt memory if the operation fails The final value of the sw_command_status must be x 00030003 II first_ps _id link_of_element_to_suspend 2 If the process element is not running the PSL writes a suspend command to the ps _chained_command doubleword for the next PSL e Write the value x 00030000 II next_psi _id link_of_element_to_suspend to the psl_chained_command e The PSL does not start any process with a software state of complete suspend or terminate A pro cess element with the suspend flag set can be added to the PSL queue Operations Performed by the Last PSL PSL_ID L F 00 When the suspend_element command is detected by the next PSL the PSL checks to see if the process element being suspended is currently running performs any operations necessary and sends the suspend_element command to the next PSL or sets the completion status in the software command status word The suspend_element command is detected by monitoring the ps _chained_command doubleword 1 If the process element is running the process is suspended The PSL sets the complete status in the software command status field to indicate that the process has been
88. ed by the Next PSL PSL_ID L F 00 When the add_element command is detected by the next PSL perform any operations necessary and send the add_element command to the next PSL The add_element command is detected by monitoring the psl_chained_command doubleword The PSL does not start any process with a software state of complete suspend or terminate A process element with the suspend flag set can be added to the PSL queue 1 The PSL notifies the AFU of the added process element The AFU performs any necessary operations to prepare for the new process and then acknowledges the new process element When the acknowledg ment is received the PSL continues with the next substep 2 The next PSL writes an add_element command to the ps _chained_command doubleword for the next PSL and watches for the add_element to be complete e Write the value x 00050000 II next_psi_id link_of_element_to_add to the ps _chained_comman4d Operations Performed by the Last PSL PSL_ID L 1 When the add_element MMIO command is received or the add_element command is detected by the last PSL perform any operations necessary and set the completion status in the software command status word The add_element command is detected by monitoring the ps _chained_command doubleword The PSL does not start any process with a software state of complete suspend or terminate A process element with the suspend flag set can be added to the PSL queue 1
89. eeeeeeeeeeaeeseeaeeseeaeeeseaeesseaeeeseaeeeseaees 24 Table 3 1 Scheduled Processes Area Structure cccccccccessceceeseeneeeeeeeseeeeeeeseneeeeseseneeeeeesesneeeessesaees 33 Table 3 2 Process Element Entry Format sccsscctceds facecedeuvincte eden Ea AANE Meee ee arated REER 35 Table 4 12 AFU DOSCIPIOF necon a E A O oe tia aie uate 60 Table 5 1 Accelerator Command Interface ccccccceccsssecceeeenneeeeeeeneeeeeeeseaeeeeseseeeeeeeseneeeeeeseeneeeeessenaees 63 Table 5 2 PSL Command Opcodes Directed at the PSL Cache cccscccccessseeeeeeessneeeeeeseneeeeesesenaees 64 Table 5 3 PSL Command Opcodes That Do Not Allocate in the PSL Cache ccccccccesssseeeeeeeneees 65 Table 5 4 PSL Command Opcodes for Management cccccceeeeeeeeeeeeeeeneeeeeeeeeseeeetecaeeeeeaeeeenaeeenaes 65 Table 5 5 aXh_cabt Translation Ordering Behavior ccccceeeeeeceeeeeeeeeeeeeeeeeeeeaeeseeaeeeseaeeeeeaeeeseaeeeseaees 66 Table 5 6 Accelerator Buffer Interface cccecccesecccceeeeeeeeeeeeeeeeaaeeeeeeeeeeeeeeceeaaaaaeeeeeeeeeeeseesecesseeeeeeees 69 Table 5 7 PSL Response Intenace s tiiceavi nities sheild ns O A iets A aaa A Eai 70 Table 5 8 PSL Response Codes siccvecistecccccessescdstuvondscegthteestecsvneand Jeet eae Ae aaa ea EE AS TAANE 71 Table 5 9 Accelerator MMIO Interface cccccscccceeseeseeeeeeceneeeeeeceeaeeeeeseeeaeeeeesecaeeeeeseeaeeeeessenseeeeesssaees 73 Table 5 10 Accelerator
90. eess 75 5 5 2 Accelerator Control Interface for Timebase ccceeeeeeeeeeececceeeeeeeeeeeceucueeeueeeueauauaaeaenensess 77 6 CAPI Low Level Management liDCXI cceeeceeeeeeeeeeeeeeeeeeeeeeneeeeeeeeeeeneeeeeeeeeneees 79 GA OVOWIOW cciiisededentstevaiovinesdaisetesnsdacoutvdeatanenaviathavasutesndaasncainiadbeudananierieehgnacdiatatacaiabutahondeavainsddeseechsaparan 79 6 2 CAPI Low Level Management API ccceceeeeeeeeeeneeeeeneeeeeeeeeseaeeeseaeee tease seaeeeseeeeteneeeseeaeeeneneeene 80 6 2 1 Adapter Information and Availability ec eeceeeeeeeeneeeeeeteeeeeeeeeenaeeeeeeeeaaeeeesesieeeeeeeenaeees 80 6 2 2 Accelerated Function Unit Selection ccccccccccceceesseeaeeeeeeeeesesseeeeecececececeeeeseeeeeeeeenenenea 81 6 2 3 Accelerated Function Unit Management cccccceeeeeeeeeeeeeeecneeeeeeeeeeeeeeeeeeeceaaeeeeeeeeeeeeess 82 7 AFU Development and Design cccccccceeeeeeeeeessssessseeeeneenneneeeneeeeeeeeeeeeeeeeeees 87 71 High Level Planning ssscctesisisctecisepacceceesvectes cans A OE EEE E N ERTA 87 Te Deovolopme n sermona E EE A 87 Ti2A Design Language sereset ane N a a n Aa AAEE EES ENEE 87 7 2 2 High Level Design of the AFU sessssssssssssesssrrrsnssrnnnsterannnnnrnnnnntnnnnnnnnnnnanennnnentnnnnannnnanannnnanannna 87 7 2 3 Application Development sssrin ssriioneinriiranin inin ano ENE ANEREN KEARE NEE ENEAN EE ARKEEN EAK 88 T 24 AFU DeVelOpMONteccc iia ck s
91. environment of the user s choice The host algorithm uses the off loaded AFU through the library calls to an included library libcxl For more information about the AFU development cycle see Section 7 AFU Development and Design on page 87 Section 2 Introduction to Coherent Accelerator Interface Architecture on page 21 and Section 3 Programming Models on page 25 provide an overview of the architecture for coherent acceleration in a POWER8 system These sections are provided as background to the programming models provided by the Coherent Acceler ator Interface Architecture CAIA The facilities referenced are not fully described in these sections and are generally not required for an application developer Section 4 AFU Descriptor Overview on page 59 provides an overview of the AFU descriptor The AFU descriptor is a set of registers within the problem state area that contains information about the capabilities of the AFU required by system software Section 5 PSL Accelerator Interface on page 63 describes the interface facilities provided by the POWER service layer PSL for the AFU The interface facilities provide the AFU with the ability to read and write main storage maintain coherency with the system caches and perform synchronization primitives Collectively these facilities are called the accelerator unit interface AUI Section 6 CAPI Low Level Management libcx on page 79 describes the low level library interface libcxl for CA
92. eq 1 Acc _ Single cycle pulse to request that the PSL send a timebase control command with the current timebase value aXh_paren 1 Acc If asserted the accelerator supports parity generation on various interface buses The parity is checked by the PSL hXa_pclock 1 PSL All accelerator interfaces are synchronous to the rising edge of this 250 MHz clock Table 5 11 PSL Control Commands on haX_jcom Mnemonic Code Description Start 0x90 Job execution in all modes Begin running a new context haX_jea contains the work element descriptor in dedicated process mode and shared mode Reset 0x80 Job execution in all modes Force into a clean state erasing all of the state from the previous context This command is sent before a start command Timebase 0x42 Send requested 64 bit timebase value to the accelerator on the haX_jea bus Note Timebase is currently not supported PSL Accelerator Interface Page 74 of 101 Version 1 2 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt 2 lt SY 3 v7 o 5 5 1 Accelerator Control Interface in the Non Shared Mode In a non shared mode the hypervisor must always reset and enable the AFU through the AFU_CNTL_A Register as shown in Figure 5 2 PSL Accelerator Control Interface Flow in Non Shared Mode on page 76 While the accelerator is enabled the following functions are possible e Requests can be submitted to the PSL through the command inte
93. equired by system soft ware The AFU descriptor also contains a standard format for reporting errors to system software All AFUs must implement an AFU descriptor 4 1 AFU Descriptor Format The length of the AFU descriptor is implementation specific All accesses to the AFU descriptor including the AFU configuration record must be either 32 bit or 64 bit operations The AFU descriptor provides system software with information specific to the AFU The AFU descriptor also provides a mechanism for assigning regions of the problem state area to system processes attached to the AFU The assignment is based on the process handle or on the offset of the process element in the linked list The region assigned to a process handle of x 0 corresponds to the beginning of the AFU per process problem state area that is the area starting at the AFU_PSA_offset within the problem state area The region assigned to a process handle of n corresponds to the problem state area starting at AFU_PSA_ offset n x AFU_PSA_length x 4096 where 0 lt n lt num_of_processes 1 Note In this version only one process is supported dedicated mode only The AFU descriptor contains AFU configuration records that provide system software with the information that is typically provided by the PCle configuration space if the AFU was a PCle device The format of the AFU configuration record can either be the standard 256 byte configuration space or the extende
94. es on behalf of the acceler ator Coherency protocol is tunneled over standard PCI Express links between the CAPP unit on the processor and the POWER service layer PSL on the accelerator card Version 1 2 Coherent Accelerator Processor Interface Overview 29 January 2015 Page 17 of 101 User s Manual Coherent Accelerator Processor Interface Advance 1 2 POWER Service Layer The PSL provided by IBM is used by the accelerator to interface with the POWER8 system The PSL inter face to the accelerator is described in Section 5 PSL Accelerator Interface on page 63 This interface provides the basis for all communication between the accelerator and the POWER8 system The PSL provides address translation that is compatible with the Power Architecture for the accelerator and provides a cache for the data being used by the accelerator This provides many advantages over a standard I O model including shared memory no pinning of data in memory for DMA lower latency for cached data and an easier more natural programming model Figure 1 2 shows an overview of the FPGA with the PSL the customer s AFU the CAPI interface and other available interfaces Figure 1 2 POWER Service Layer VO Ethernet DASD and so on FPGA POWERS Processor Coherent Accelerator Processor Interface Overview Version 1 2 Page 18 of 101 29 January 2015 User s Manual Advance Coherent Accelerator
95. f the problem state area to each process PerProcessPSA_control 6 1 each region might be required to be aligned on a 64 KB boundary See the target operating system details for more infor mation x 40 Reserved 0 7 Reserved set to x 0 AFU_EB_len 8 63 This field specifies the length of the AFU error buffer in multiples of 4 KB A length of x 0 indicates that an AFU error buffer does not exist This is a read only field x48 AFU_EB_offset 0 63 This field specifies the 4 KB aligned offset of the AFU error buffer information from the start of the AFU descriptor This field contains a 64 bit pointer to the start of the AFU error status information The lower 12 bits of the pointer are always 0 This is a read only field Version 1 2 AFU Descriptor Overview 29 January 2015 Page 61 of 101 User s Manual Coherent Accelerator Processor Interface Advance AFU Descriptor Overview Version 1 2 Page 62 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface 5 PSL Accelerator Interface The PSL accelerator interface communicates to the acceleration logic running on the FPGA Through this interface the PSL offers services to the accelerator The services offered are cache line oriented and allow the accelerator to make buffering versus throughput trade offs The interface to the accelerator is composed of five independent interfaces e Accelerator C
96. for the AFU Slice Reset Sequence to be complete AFU_Cntl_An RS 10 System software must set the WED if required by the AFU at start time The WED is initialized by writing a 64 bit WED value to the PSL_WED_An Register System software writes the WED that was passed to libcxl by the application System software must set the AFU Enable bit in the AFU_Cntl_An Register AFU_Cntl_An E The state of the AFU Enable Status must be a 00 before system software setting can set the AFU Enable bit to a 1 for a start command to be issued to the AFU by the PSL The WED is passed to the AFU when the start command is issued System software must poll the AFU Enable Status for the AFU Slice Enabled AFU_Cntl_An ES 10 The AFU_Cntl_An ES field is set to 10 when the PSL and AFU are initialized running and able to accept MMIO After the AFU is running system memory accesses can be performed by the AFU and problem state MMIOs can be performed by software Note If problem state registers are required to be initialized in the AFU before the application starts the AFU must provide a mechanism for starting the accelerator and must not depend on the start command issued by the PSL Use the following procedure to stop an AFU 1 2 System software must set the AFU Slice Reset bit in the AFU_Cntl_An Register AFU_Cntl_An RA Setting the AFU Slice Reset starts a reset sequence for the corresponding AFU Initiating
97. from accessing the cache line until substep c or substep b of step 3 b Compare the head and tail pointers of the PSL queue e Ifthe head_pointer does not equal the tail_pointer a process is waiting to be started or resumed Continue with step 3 e If the head_pointer equals the tail_pointer no processes are waiting to be restarted Continue with substep c c Release the protection of the cache line containing the head_pointer and tail_pointer values Continue with step 2 2 No processes to run Wait until a process is added to the PSL queue a Wait for the head and tail pointers to be updated e The PSL can either poll the cache line that contains the head_pointer and tail_pointer information for a change in state or detect when the cache line is modified by another device using the coherency protocol b Continue with step 1 3 Start the next process in the PSL queue a Remove the process to start resume from the PSL queue e Set the process element handle PSL_PEHandle_An PE_Handle for the process element to resume or start and the process_state with the data contained at the tail_pointer e Add 8 to the fail_pointer tail_pointer equals tail_pointer 8 e If the tail_pointer is greater than end_of_PSL_queue_area tail_pointer equals start_of_PSL_queue_area b Release the protection of the cache line containing the head_pointer and tail_pointer values c Read the process element state from the linked list and star
98. fu_h afu _uint32_t data This routine removes the indicated AFU s register space from the calling processes virtual address space 6 2 3 12 cxl_mmio_read include lt libcx1 h gt _uint64_t cxl_afu_mmio_read64 struct cxl_afu_h afu _uint64_t offset _uint32_t cxl_afu_mmio_read32 struct cxl_afu_h afu _uint64_t offset These two routines are read from a 32 or 64 bit register in the indicated AFU The offset parameter indi cates the register to be read Data is returned to the local address space The data parameter is a pointer to the location in memory at which the data from the AFU should be placed 6 2 3 13 cxl_mmio_write include lt libcx1 h gt void cxl_afu_mmio write64 struct cxl_afu_h afu _uint64_t offset uint64_t data void cxl_afu_mmio_write32 struct cxl_afu_h afu _uint64_t offset uint32_t data These two routines write to a 32 or 64 bit register in the indicated AFU s register space The data 32 or 64 bits long indicated by the data parameter is written to the register in the AFU indicated by the offset param eter 6 2 3 14 Additional Routines The following accelerated function unit management routines are also available e cxl_read_event e cxl_read_expected_event e fprint_cxl_event e fprint_cxl_unknown_event CAPI Low Level Management libcxl Version 1 2 Page 84 of 101 29 January 2015 User s Manual Advance Coherent Accelerator Processor Interface Version 1 2 CAPI Low Level Manage
99. g models Programming Models Version 1 2 Page 54 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt 2 lt o 3 v7 ts 2 The PSL writes an resume_element command to the psi_chained_command doubleword for the next PSL e Write the value x 00040000 II next_psl_id link_of_element_to_resume to the psl_chained_command Operations Performed by the Next PSL PSL_ID L F 00 When the resume_element command is detected by the next PSL perform any operations necessary and send the add_element command to the next PSL The resume_element command is detected by monitoring the ps _chained_command doubleword The PSL does not start any process with a software state of complete suspend or terminate A process element with the suspend flag set can be added to the PSL queue 1 When operating in an AFU directed programming model the PSL notifies the AFU of the process ele ment being resumed The AFU performs any necessary operations to resume execution of the process and then acknowledges the resumed process element When the acknowledgment is received the PSL continues with the next substep The AFU is not notified of the added process element for all other pro gramming models 2 The PSL writes an resume_element command to the psi_chained_command doubleword for the next PSL e Write the value x 00040000 II next_psl_id link_of_element_to_resume to the psl_chained_command
100. ge with the host processors This storage area can either be uniform in struc ture or can be part of a hierarchical cache structure Programs reference this level of storage by using an effective address 2 2 1 Main Storage Attributes The main storage of a system typically includes both general purpose and nonvolatile storage It also includes special purpose hardware registers or arrays used for functions such as system configuration data transfer synchronization memory mapped I O and I O subsystems Version 1 2 Introduction to Coherent Accelerator Interface Architecture 29 January 2015 Page 23 of 101 User s Manual Coherent Accelerator Processor Interface Table 2 1 lists the sizes of address spaces in main storage Table 2 1 Sizes of Main Storage Address Spaces Address Space Size Description Real Address Space 2M bytes wherem lt 60 Effective Address Space 264 bytes _An effective address is translated to a virtual address using the segment lookaside buffer SLB Virtual Address Space 2 bytes where 65 lt n lt 78 A virtual address is translated to a real address using the page table Real Page Base 212 bytes Virtual Page 2P bytes where 12 lt p lt 28 Up to eight page sizes can be supported simultaneously A small 4 KB p 12 page is always supported The number of large pages and their sizes are implementation depen dent Segment Size 25 bytes where s 28 or 40 The number of virtual segments i
101. grams and real addresses RAs used by physical memory The MMU also provides protection mechanisms and other functions The highest order bit in an address registers data element or instruction encoding See most recently used Most significant bit A region in memory The Power ISA defines a page as a 4 KB area of memory aligned on a 4 KB boundary or a large page size which is implementation depen dent A table that maps virtual addresses VAs to real addresses RAs and contains related protection parameters and other information about memory locations Peripheral Component Interconnect Express Phase locked loop Of or relating to the Power ISA or the microprocessors that implement this architecture A computer architecture that is based on the third generation of reduced instruction set computer RISC processors The Power ISA was developed by IBM Also known as supervisor mode The permission level of operating system instruc tions The instructions are described in PowerPC Architecture Book III and are required of software that accesses system critical resources Software that has access to the privileged modes of the architecture The permission level of user instructions The instructions are described in Power ISA Books and Il and are required of software that implements application programs POWER service layer Page table entry See page table Real address Random access memory Real address An ad
102. head_pointer e Add 8 to the head_pointer head_pointer equals head_pointer 8 e If the head_pointer is greater than end_of_PSL_queue_area head_pointer equals start_of_PSL_queue_area c Release the protection of the cache line that contains the head_pointer and tail_pointer values 3 4 2 2 PSL Procedure for AFU Directed Programming Models The procedure for starting and resuming a process element in the AFU directed programming models is implementation specific In these models system software adds a process element to the linked list and provides the application with a context handle The lower 16 bits of the process handle are a pointer to the process element that contain the corresponding process state for the application The AFU provides the lower 16 bits of the process handle context ID for each transaction associated with the process handle The PSL uses the context ID to find the corresponding process element In the AFU directed programming models the PSL does not manage any queue of processes waiting to be resumed 3 4 3 Terminating a Process Element Under certain circumstances system software might have to terminate a process element currently sched uled for the AFUs Because a scheduled process element might have already been started or is currently being executed by a PSL system software must follow the following sequence to safely terminate a process element in the linked list of scheduled processes 3 4 3 1 Softw
103. in machine state in response to an exception See exception Used to signal an interrupt typically to a processor or to another interruptible device Instruction set architecture Job effective address Version 1 2 29 January 2015 gt Q lt o 3 rs o gt amp Least significant bit Little endian LISN Logical partitioning LPAR LPC LPID LSb LSB Main storage Mask MB Memory coherency Memory mapped MMIO PID PSL MMIO Version 1 2 29 January 2015 User s Manual Coherent Accelerator Processor Interface Kilobyte A local storage LS address of an PSL list It is used as a parameter in an PSL command The bit of least value in an address register data element or instruction encoding A byte ordering method in memory where the address n of a word corresponds to the least significant byte In an addressed memory word the bytes are ordered left to right 3 2 1 0 with 3 being the most significant byte See big endian Logical interrupt service number A function of an operating system that enables the creation of logical partitions Logical partitioning Lowest point of coherency Logical partition identity Least significant bit Least significant byte The effective address space It consists physically of real memory whatever is external to the memory interface controller Local Storage memory mapped registers and arrays memory mapped I O devices and pages of
104. is required to make an operating system system call with at least the following information e An AFU type AFU_Type The AFU type describes the targeted acceleration function for the system call The AFU_Type is a system specific value e A work element descriptor WED This document does not define the contents of the WED The WED is AFU implementation specific and can be in the form of an AFU command an effective address pointer to a user defined structure an effec tive address pointer to a queue of commands or any other data structure to describe the work to be done by the AFU e An Authority Mask Register AMR value The AMR value is the AMR state to use for the current process The value passed to the operating sys tem is similar to an application setting the AMR in the processor by using spr 13 or by calling a system library If the PSL and AFU implementations do not support a User Authority Mask Override Register UAMOR the operating system should apply the current UAMOR value to the AMR value before passing the AMR in the hypervisor call hcall The UAMOR is not described in this document For more informa tion about the UAMOR see the Power ISA Book III The hypervisor can optionally apply the current Authority Mask Override Register AMOR value before placing the AMR into the process element The PSL applies the PSL_AMOR_An when updating the PSL_AMR_An Register from the process element Version 1 2 Programming Models 29 Januar
105. ist users of CAPI implementations in designing applications for hardware acceleration Maintaining compatibility with the interfaces described in this document allows appli cations to migrate from one implementation to another with minor changes For a specific implementation of the CAPI see the documentation for that accelerator Who Should Read This Manual This manual is intended for system software and hardware developers and application programmers who want to develop products that use CAPI It is assumed that the reader understands operating systems micro processor system design basic principles of reduced instruction set computer RISC processing and details of the Power ISA Document Organization This CAPI User s Manual contains two types of information First it provides a general overview of CAPI accelerator interfaces and application library calls to use the accelerator Second it provides implementa tion specific information about building an accelerator for the supported card along with the architecture limi tations of this implementation Document Division Description About this Document Describes this document related documents the intended audience and other general information Revision Log Lists all significant changes made to the document since its initial release Introduction to Coherent Accelerator Interface Provides a high level overview of the Coherent Accelerator Inte
106. izes the PSL for the owning partition and the operating system initializes the PSL for the owning process at the time when the AFU is assigned The following information is initialized Note The PSL architecture allows multiple AFUs available in future implementations These registers are duplicated for each AFU Each of these duplicated registers is called a slice Registers initialized by the hypervisor e PSL Slice Control Register PSL_SCNTL_An e Real Address RA Scheduled Processes Area Pointer PSL_SPAP_An disable e PSL Authority Mask Override Register PSL_AMOR_An Interrupt Vector Table Entry Offset PSL_IVTE_Offset_An Interrupt Vector Table Entry Limit PSL_IVTE_Limit_An PSL State Register PSL_SR_An PSL Logical Partition ID PSL_LPID_An Real address RA Hypervisor Accelerator Utilization Record Pointer HAURP_An disable e PSL Storage Description Register PSL_SDR_An Registers initialized by the operating system PSL Process and Thread Identification PSL_PID_TID_An Effective Address EA Context Save Restore Pointer CSRP_An disable Virtual Address VA Accelerator Utilization Record Pointer AURPO_An and AURP1_An disable Virtual Address VA Storage Segment Table Pointer SSTPO_An and SSTP1_An PSL Authority Mask PSL_AMR_An e PSL Work Element Descriptor PSL_WED_An 3 1 1 Starting and Stopping an AFU in the Dedicated Process Model In a dedicated process programming model an AFU is starte
107. language will be used to develop the AFU This language must be a type that is compatible with the FPGA toolset and compilable for a simulation environment 7 2 2 High Level Design of the AFU Consider AFU partitions interfaces to the PSL command and dataflow logic Estimate latch and RAM cell requirements for size estimates of the logic Begin early floorplan work with the FPGA footprint Import the post routed PSL into the project along with the example memcpy AFU to get an initial look at what the PSL will occupy in the floorplan Consider how debug of the logic will be performed One possible debug aid is to route information to FPGA memories as trace arrays that can be read after a fail to determine the command sequence that caused a fail One might also capture failure information into registers so that more informa tion can be accessed after an error occurs A high level performance target should also be established before implementation begins Examples of some FPGA considerations during the high level design that can impact both performance and floorplanning are e RAM cells must be used whenever possible because they are much more area efficient than latches e Wiring delays are large and consume FPGA area Avoid routing a large number of wires to many destina tions whenever possible e The PSL supplies a 250 MHz clock for the AFU implementation and no DLL or PLL is required unless there is a unique clocking requirement e Conside
108. lid must be valid for only one cycle per command and the other command descriptor signals must also be valid during that cycle Each command is assigned a tag by the accelerator This tag is used by the PSL during subsequent phases of the transaction to identify the command Table 5 1 lists the commands that can be sent to the PSL by the application Table 5 1 Accelerator Command Interface Sheet 1 of 2 Signal Name Bits Source Description aXh_cvalid 1 Acc A valid command is present on the interface This signal is asserted for a single cycle for each command that is to be accepted Design recommendation make this a latched interface to the PSL Note This signal can be driven for multiple cycles That is different commands can be driven back to back as long as there is an adequate number of credits outstanding aXh_ctag 8 Acc Accelerator generated ID for the request This is used as an array address on the Accelerator Buffer interface and for status notification aXh_ctagpar 1 Acc Odd parity for aXh_ctag axh_aparen 1 Version 1 2 PSL Accelerator Interface 29 January 2015 Page 63 of 101 User s Manual Coherent Accelerator Processor Interface Table 5 1 Accelerator Command Interface Sheet 2 of 2 Signal Name Bits Source Description aXh_com 13 Acc Indicates which command the PSL will execute Opcodes are defined in Table 5 2 PSL Command Opcodes Directed at
109. ment libcxl 29 January 2015 Page 85 of 101 User s Manual Coherent Accelerator Processor Interface Advance CAPI Low Level Management libcxl Version 1 2 Page 86 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt 2 lt o fa o 7 AFU Development and Design The previous sections describe the overall CAIA architecture the PSL interfaces and application library calls This section describes some general information about developing an accelerator functional unit AFU 7 1 High Level Planning Before starting development work on an AFU the user should become familiar with CAPI which includes reading this document and any other education material as well as determining the hardware implementation that the AFU will reside on Implementation dependent information for this version of the document begins in Section 8 CAPI Developer Kit Card on page 93 The user must ensure that they have access to any needed FPGA development tools and hardware that is specific to each implementation A rough sizing must also be done at this point to ensure that the AFU will fit on the available space within the FPGA for a given implemen tation Planning for the needed hardware systems must also begin in this early stage 7 2 Development The following sections describe some topics and examples that should be considered during the develop ment stage 7 2 1 Design Language Determine what design
110. nce order If translation for the command results in a protection violation or the table walk process fails the command an interrupt is sent If the interrupt response is CONTINUE the command receives a PAGED response and all subsequent commands that hit this page receive a FLUSHED response until a command restart for an address in the same effective page is received Commands out side of this effective page are not affected e If the translation interrupt response is Address Error the command receives the AERROR response and all subsequent commands that hit this page get FLUSHED responses until a restart command is received Commands outside of this effective page are not affected e Ifthe translation detects an internal error or Data Error the command receives the DERROR response and all subsequent commands that hit this page get FLUSHED responses until a restart command is received Commands outside of this effective page are not affected PSL Implementation Note When a protection violation occurs and before the translation interrupt response is received subsequent commands that hit the same 16 MB page are held in a queue and marked as a protection violation Once the translation response is received the queued commands are processed and provide a PSL response according to their individual CABT mode Requests that are received after the translation response is received with CAB T Abort Pref or Spec are processed immediately and provide a
111. ne with a nonzero ah_jerror The error is logged by the operating system and provided to the applications as an event 7 3 8 2 Application Errors Normal signal faults result in the process on the AFU being terminated If an application determines that the AFU is not responding the application should either request an AFU reset terminate the AFU processes or reload the AFU Reset is the least invasive The others require the application to detach and re attach to the AFU 7 3 8 3 Errors Reported by the System Including the PSL The application must monitor the error event On these errors the operating system also logs the error but the AFU will at the minimum be reset The application must reload the AFU to continue The severity of the error should determine what happened to the adapter e Only the AFU was reset The application might be able to recover e The card was reset meaning the PSL was reloaded Version 1 2 AFU Development and Design 29 January 2015 Page 91 of 101 User s Manual Coherent Accelerator Processor Interface Advance AFU Development and Design Version 1 2 Page 92 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt 2 lt o 3 v7 o 8 CAPI Developer Kit Card The previous sections describe the overall CAIA architecture PSL interface library calls and general AFU development steps This section is intended to aid application developers for a specific card
112. ng aXh_tbreq on the accelerator control interface for one cycle Only one request can be issued at a time The PSL returns the timebase information by asserting haX_jval 1 haX_jcom timebase and haX_jea timebase value 0 63 Version 1 2 PSL Accelerator Interface 29 January 2015 Page 77 of 101 User s Manual Coherent Accelerator Processor Interface PSL Accelerator Interface Version 1 2 Page 78 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt 2 lt o 3 v7 o 6 CAPI Low Level Management libcxl 6 1 Overview Note The CAPI Developer Kit release does not support partial reconfiguration The CAPI accelerator management library libcxl is a low level management library that consists of the following categories of POWER functions e Adapter Information and Availability on page 80 e Accelerated Function Unit Selection on page 81 e Accelerated Function Unit Management on page 82 Libcxl introduces the following terms Accelerator The physical accelerator available to the system A task is sent to a hardware thread within the physical accelerator Hardware Thread The physical accelerator function unit within a physical accelerator Accelerator Instance An accelerator with a defined function The function can either be defined by the system or by the application depending on if the accelerator is shared or dedi cated Dedicated Accelerators An accelerator that is dedic
113. ns the process state for the corresponding application The work element descriptor WED contained in the process element can be a single job requested by an application or contains a pointer to a queue of jobs In the latter case the WED is a pointer to the job request queue in the application s address space This document does not cover all aspects of the programming models The intent of this section is to provide a reference for how the AFUs can be shared by all or a subset of the processes in the system This section defines the infrastructure for setting up the process state and sending a work element descriptor WED to an AFU to start a job in a virtualized environment The function performed by an AFU is implementation depen dent Version 1 2 Programming Models 29 January 2015 Page 25 of 101 User s Manual Coherent Accelerator Processor Interface Advance 3 1 Dedicated Process Programming Model The dedicated process programming model is implementation specific Figure 3 1 Accelerator Invocation Process in the Dedicated Process Model on page 28 shows how an application invokes an accelerator under the dedicated process programming model In this model a single process owns the AFU Because the AFU is dedicated to a single process the programming model is not defined in this document For more information see the documentation for the specific implementation Because the AFU is owned by a single process the hypervisor initial
114. nslation for the command results in a DERROR only this command is terminated with a FAULT response No FLUSHED response is generated 5 1 1 2 Strict Address Ordering Pages Accelerator designs might need to delay accesses until prior accesses are completed if they need to inter operate with POWER applications with pages in strict address ordering SAO mode PSL operation ordering is affected by accesses to pages with WIMG SAO 5 1 1 3 Execution Ordering After commands have proceeded past address translation the PSL orders only on a cache line address basis Commands to an address are performed after earlier commands to that address and before later commands to that address Order between commands involving different addresses is unpredictable Version 1 2 PSL Accelerator Interface 29 January 2015 Page 67 of 101 User s Manual Coherent Accelerator Processor Interface Advance 5 1 2 Reservation The operations read_cl_res and write_c manipulate the reservation There is one reservation for the acceler ator This reservation can be active on an address or inactive Read_cl_res reads an address and acquires the reservation after which the reservation is active on the address of the read While the reservation is active the PSL snoops for writes performed to the address Reservations cannot be held indefinitely The PSL will automatically clear the reservation on lines after a certain amount of time to allow the system to
115. ntained herein is believed to be accurate such information is preliminary and should not be relied upon for accuracy or completeness and no representations or warranties of accuracy or completeness are made Note This document contains information on products in the design sampling and or initial production phases of development This information is subject to change without notice Verify with your IBM field applications engineer that you have the latest version of this document before finalizing a design You may use this documentation solely for developing technology products compatible with Power Architecture You may not modify or distribute this documentation No license express or implied by estoppel or otherwise to any intellec tual property rights is granted by this document THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN AS IS BASIS In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document IBM Systems and Technology Group 2070 Route 52 Bldg 330 Hopewell Junction NY 12533 6351 The IBM home page can be found at ibm com Version 1 2 29 January 2015 RE User s Manual Advance Coherent Accelerator Processor Interface Contents List OF Tables iscccecciisncinsasevsnnsvcvntecasewennsnduaueternensasnccsteweueaudwendtue shucsbinunestacewneewuedauivousenccsnnse 7 List of PIQUN GS sssisicacsncanacscwanctaadaawadnnadssnn
116. ocessor Interface CAPI for POWER8 Systems White Paper Coherent Accelerator Processor Interface CAPI for POWER8 Systems Decision Guide and Development Process Data Engine for NoSQL IBM Power Systems Edition White Paper POWER8 Functional Simulator User s Guide Conventions Used in This Document This section explains numbers bit fields instructions and signals that are in this document Representation of Numbers Numbers are generally shown in decimal format unless designated as follows e Hexadecimal values are preceded by an x and enclosed in single quotation marks For example x OA00 e Binary values in sentences are shown in single quotation marks For example 1010 Note A bit value that is immaterial which is called a don t care bit is represented by an X Bit Significance Inthe documentation the smallest bit number represents the most significant bit of a field and the largest bit number represents the least significant bit of a field Other Conventions This document uses the following software documentation conventions e Command names or instruction mnemonics are written in bold type For example afu_wr and afu_rd e Variables are written in italic type Required parameters are enclosed in angle brackets Optional param eters are enclosed in brackets For example afu lt f b gt _wrfa This document uses the following symbols About this Document Version 1 2 Page 14 of
117. ommand Interface is the interface through which the accelerator sends service requests to the PSL e Accelerator Buffer Interface is the interface through which the PSL moves data to and from the accelerator e PSL Response Interface is the interface through which the PSL reports status about service requests e Accelerator MMIO Interface is the interface through which software MMIO reads and writes can access registers within the accelerator e Accelerator Control Interface allows the PSL job management functions to control the state of the accelerator Together these interfaces allow software to control the accelerator state and allow the accelerator to access data in the system 5 1 Accelerator Command Interface Note There are references to PSL internal register mnemonics within this section These registers are men tioned to provide additional content clarity These registers are set by system software during initialization or library calls to the AFU However the format of these registers is not information required by an AFU designer The accelerator command interface provides the accelerator logic with the ability to send commands to the PSL The interface is a credit based interface the bus haX_croom informs the accelerator of the number of commands it can accept from the accelerator The number of commands allocated to the accelerator might change based on job management policies The interface is a synchronous interface aXh_va
118. on of functions to improve the performance of the application and off load the host processor Using an AFU for application acceleration allows for cost effective processing over a wide range of applications When an application requests use of an AFU a process element is added to the process element linked list that describes the application s process state The process element also contains a work element descriptor WED provided by the application The WED can contain the full description of the job to be performed or a pointer to other main memory structures in the application s memory space Several programming models are described providing for an AFU to be used by any application or for an AFU to be dedicated to a single appli cation See Section 3 Programming Models on page 25 for details 2 2 Main Storage Addressing The addressing of main storage in the CAIA is compatible with the addressing defined in the Power ISA The CAIA builds upon the concepts of the Power ISA and extends the addressing of main storage to the AFU The AFU uses an effective address to access main storage The effective address is computed by the AFU and is provided to the PSL The effective address is translated to a real address according to the procedures described in the overview of address translation in Power ISA Book III The real address is the location in main storage that is referenced by the translated effective address All the AFUs share main stora
119. perations necessary and sends the add_element command to the next PSL The PSL does not start any process with a software state of complete suspend or terminate A process element with the suspend flag set can be added to the PSL queue 1 Performs a read of the cache line containing the head_pointer and tail_pointer such that the cache line is owned by the PSL e The PSL must prevent any other PSL from accessing the cache line until substep 3 2 Writes the link to the process element and its status to the PSL queue of processes waiting to be restarted e Writes the added process element link to the memory location pointed to by the head_pointer e Adds 8 to the head_pointer head_pointer equals head_pointer 8 e If the head_pointer is greater than end_of_PSL_queue_area head_pointer equals start_of_PSL_queue_area 3 Releases the protection of the cache line containing the head_pointer and tail_pointer values 4 The PSL sets the complete status in the software command status field to indicate the process has been successfully added e The status field in the sw_command_status is set to x 0005 using a caching inhibited DMA or special memory update operation that is guaranteed not to corrupt memory if the operation fails The final value of the sw_command_status should be x 00050005 II first_ps _id link_of_element_to_adod Programming Models Version 1 2 Page 40 of 101 29 January 2015 User s Manual Coherent A
120. r the number of logic levels between latch stages and pipeline the design Be aware of this with performance targets Version 1 2 AFU Development and Design 29 January 2015 Page 87 of 101 User s Manual Coherent Accelerator Processor Interface Advance e Use FPGA floorplanning for AFU internal logic blocks to help the FPGA tools place the logic optimally and have repeatable synthesis and timing results 7 2 2 1 Floorplan Considerations The PSL uses just under 25 of the Stratix V FPGA s ALUTs arrays and DSPs For estimation purposes plan on your algorithm fitting in 70 of the overall ALUTs arrays and DSPs The maximum remaining resources after placing the PSL is shown in Table 7 1 Table 7 1 FPGA Resources Available for AFU Item Total Available for AFU ALUTs 341548 M20K 1874 DSP 188 7 2 3 Application Development Develop host application code that makes use of libcxl to call the AFU See Section 6 CAPI Low Level Management libcx on page 79 for additional information 7 2 4 AFU Development Code the AFU in the chosen design language Perform unit simulations to verify that the internal accelerator function is operating correctly Some simulation with a basic PSL interface driver customer developed can be done in this unit simulation stage Synthesize with the FPGA tools in the implementation environment to ensure that timing is met Make floorplan and logic updates as needed for timing closu
121. ramming Models Version 1 2 Page 52 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt 2 lt o 3 fa o Operations Performed by the Last PSL PSL_ID L F 00 When the suspend_element command is detected by the next PSL the PSL checks to see if the process element being terminated is currently running performs any operations necessary and sends the terminate_element command to the next PSL or sets the completion status in the software command status word The suspend_element command is detected by monitoring the ps _chained_command doubleword 1 The PSL notifies the AFU of the suspended process element The AFU performs any necessary opera tions to suspend the process and then acknowledges the suspension of the process element When the acknowledgment is received the PSL continues with the next substep 2 If the process element is running the process is suspended The PSL is allowed to complete any out standing transactions but must not start any new transactions for the process 3 The PSL writes a suspend command to the ps _chained_command doubleword for the next PSL e Write the value x 00030000 II next_psl_id link_of_element_to_suspend to the psl_chained_command e The PSL does not start any process with a software state of complete suspend or terminate A pro cess element with the suspend flag set can be added to the PSL queue Operations Performed by the Last PSL
122. rator in the AFU descriptor num_interrupts 1 The routine can also return the process element number if the process element pointer is nonzero This routine resets the AFU transmits the WED to the AFU and enables the AFU 6 2 3 7 cxl_afu_fd Note Not applicable for the CAPI Developer Kit include lt libcx h gt int cxl_afu_fd struct cxl_afu_h afu The cxl_afu_fd routine returns the file descriptor of the AFU that is contained in the AFU handle 6 2 3 8 cxl_afu_open_and_attach Note Not applicable for the CAPI Developer Kit include lt libcx h gt int cxl_afu_open_and_attach struct cxl_afu_h afu mode Version 1 2 CAPI Low Level Management libcxl 29 January 2015 Page 83 of 101 User s Manual m Coherent Accelerator Processor Interface Advance 6 2 3 9 cxl_afu_sysfs_pci Note Not applicable for the CAPI Developer Kit include lt libcx h gt int cxl_afu_sysfs_pci char pathp struct cxl_afu_h afu 6 2 3 10 cxl_mmio_map include lt libcx h gt int cxl_mmio_map struct cxl_afu_h afu u32 flags This routine returns the base virtual address of the register space of the AFU indicated by the AFU parameter and adds that space to the calling processes virtual address space The flag parameter indicates the key characteristics of the register space One such characteristic is endianess 6 2 3 11 cxl_mmio_unmap include lt libcx h gt int cxl_mmio_unmap struct cxl_a
123. re and bug fixes 7 2 5 Develop Lab Test Plan for the AFU Develop a test plan to determine what testing is needed to validate the hardware and application function after hardware testing begins These tests must also be run in the system simulation environment 7 2 6 System Simulation of Application and AFU The POWER8 Functional Simulator provides a system simulation of an entire POWER8 CAPI system by providing the complete system behavior at the AFU PSL interface This simulator can be run on a customer platform before actual lab testing with a POWER8 system begins This simulation ensures compatibility with the PSL interface and verifies the interaction between the application system running on the host and the AFU See the Power8 Functional Simulator Demo in the HDK SDK and the POWER8 Functional Simulator User s Guide for additional information 7 2 7 Test After the application and the AFU have been developed and simulated the test phase on actual hardware begins The first step is to prepare the AFU FPGA image so that it can be downloaded to the FPGA The method used is implementation dependent After the image is downloaded the application can be started and the lab test plan can be executed AFU Development and Design Version 1 2 Page 88 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt 2 lt o 3 rs o 7 3 Best Practices for AFU Design 7 3 1 FPGA Considerations See Section 7 2
124. rface e MMIO requests can be passed from the PSL to the accelerator and must be acknowledged e Timebase values can be passed to the accelerator When a PSL slice is initialized for dedicated process mode the PSL fetches the process element from system memory if the address specified in PSL_SPAP_An is valid when the AFU_CNTL_A Enable is set to 1 If the PSI_SPAP_An address is not valid the PSL assumes that the process element registers have been initialized by software already so the start command is immediately sent to the AFU The 64 bit hax_jea indi cates the value of the work element descriptor Version 1 2 PSL Accelerator Interface 29 January 2015 Page 75 of 101 User s Manual Coherent Accelerator Processor Interface Advance Figure 5 2 PSL Accelerator Control Interface Flow in Non Shared Mode Accel Reset lt Reconfiguration Possible Send Start Command JEA Work Element Descriptor AFU Reset JRunning 1 JRunning 0 Enabled JDone 1 JDone 1 Resetting JRunning 0 JError Error code Send Reset Command AFU Reset Clocks Started Idle PSL Accelerator Interface Version 1 2 Page 76 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface 5 5 2 Accelerator Control Interface for Timebase Note Timebase is currently not supported The accelerator requests the latest timebase information by asserti
125. rface Architecture Architecture CAIA and the system software programming models PSL Accelerator Interface Describes the interface between the POWER service layer PSL and the accelerator function unit AFU CAPI Low Level Management libcxl Provides an overview description of the low level accelerator management and some programming examples AFU Development and Design General information about developing an accelerator functional unit AFU and some best practices to consider when designing an AFU CAPI Developer Kit Card Describes CAIA implementation details for the CAPI Developer Kit card and the EPGA build flow for the CAPI Developer Kit card Glossary Defines terms and acronyms used in this document Version 1 2 About this Document 29 January 2015 Page 13 of 101 User s Manual m Coherent Accelerator Processor Interface Advance Related Publications The following documents can be helpful when reading this specification Contact your IBM representative to obtain any documents that are not available through OpenPOWER Connect or Power org Power ISA User Instruction Set Architecture Book I Version 2 07 Power ISA Virtual Environment Architecture Book II Version 2 07 Power ISA Operating Environment Architecture Server Environment Book III S Version 2 07 VO Design Architecture v2 IODA2 Version 2 4 Coherent Accelerator Processor Interface CAPI Education Package Coherent Accelerator Pr
126. rts the dedicated programming model The shared programming models allow for all or a subset of processes from all or a subset of partitions in the system to use an AFU There are two programming models where the AFU is shared by multiple processes and partitions PSL time sliced shared and AFU directed shared Figure 3 2 on page 32 shows how an application invokes an AFU under the shared programming model In this model the system hypervisor owns the AFU and makes the function available to all operating systems For an AFU to support virtualization by the system hypervisor the AFU must adhere to the following require ments e An application s job request must be autonomous that is the state does not need to be maintained between jobs OR The AFU must provide a context save and restore mechanism e An application s job request must be guaranteed by the AFU to complete in a specified amount of time including any translation faults OR The AFU must provide the ability to preempt the processing of the job e The AFU must be guaranteed fairness between processes when operating in the AFU directed shared programming model In the case where an AFU can be preempted the AFU can either require the current job to be restarted from the beginning or it can provide a method to save and restore the context so that the current job can be restarted from the preemption point at a later time For the shared model the application
127. s 2 where 65 lt n lt 78 Note The values of m n and p are implementation dependent Introduction to Coherent Accelerator Interface Architecture Version 1 2 Page 24 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt 2 lt o 3 9 o 3 Programming Models The Coherent Accelerator Interface Architecture CAIA defines several programming models for virtualiza tion of an acceleration function unit AFU e Dedicated process programming model no AFU virtualization e Shared programming models which include these two types PSL controlled shared programming models AFU time sliced virtualization AFU directed shared programming models AFU controlled process element selection virtualization Architecture Note The AFU directed programming model where the AFU selects a context from the process element linked list to use for a transfer is intended for the Networking and Storage market segments For these types of applications the required address context is selected based on a packet received from a net work or which process is accessing storage A CAlA compliant device can also act as system memory or the lowest point of coherency LPC In this model the process element and address translation are not required The LPC model can also be used in combination with the other programming models but might not be supported by all devices
128. s done efficiently according to the requirements of the AFU Write operation sizes to the PSL must be in powers of 2 and the address must be aligned to the size Odd alignment and size write operations must be broken into multiple sizes with the correct alignment It is advisable that the AFU buffers commands that go to same cache line Issue them together as one write or read instead of sending multiple shorter commands If there is a chance that the data will again be used by the AFU it is good practice to hold it inside the AFU buffers This minimizes PSL traffic and frees the PSL interface resources for other commands Locking commands must be used to make sure a line is not modified while the AFU is holding data in internal buffers There are two possibilities for this lock and reservation Locks are used typically when updates are needed atomically for shared memory After the PSL grants a lock it does not allow anybody else to modify that line The number of locks allowed at a point in time is dependent on the PSL resources available The AFU can acquire a reservation for any particular line and do write_c operations later which are successful only if the reservation is available If some other processor has taken the reservation the AFU s previous reservation is killed Therefore in cases where it is probable that a line might not be modified use a reserva tion instead of a lock Version 1 2 AFU Development and Design 29 January 2015
129. s eticeseaeicteiebe eaned tenet CENNE EAn AEEA ENA E E EEEN ESE EEEn NNa 88 7 2 5 Develop Lab Test Plan for the AFU sassssssesssssrssssrrnsssernnrnsnrnnneserrnternnnnnnrtnnnnnnnnnnnntnnanannnnannnnna 88 7 2 6 System Simulation of Application and AFU sssssesssssssessiessrsssrresrnsrirsesrnstirnstnnstrntstnnnennntnn nene 88 Ter TOSE o Lae E A E A E EEES 88 7 3 Best Practices for AFU Design 00 0 2 ccccceeeeseceeeeeseeeeeeeeneseeeeeeeaaeeeeeeesaaeeeeeesaaaeeeeeesaaeeeeseessieeeeeeneaaes 89 7 3 1 FPGA Considerations cccccccccccccccccccccceeeeeeeeeaeauaueeeeeueesseeeseeeceecececeeeeeeeceseuaeeeeueaeauauaaanensnsess 89 8 2 General PSLE INFOMMALION rrisni a aono e E NE AE E E E 89 F323 BufferlimMerlaCe ossoa AEAT 89 793 4 PSI Inte mace TIMING sarsii uenn bien pes sdecedvabes Sect cekua neat Gd bveedadeenadeestlienviedeetnddest ebasieves a 89 733 5 Designing for PEMOrMaNGe icccsveneicednesssescgredinea aeieepensebewes vie Eaa Aaa EAE 89 LB 6 SIMULATION vssscassssvsvcrecesstenevadcetevdssanadrsvvsstsnecedsadacetstbinhassdabetanssts dasnsdessdeascdasavedanteaauvaniscasabusabeaeras 90 7 3 7 Debug Considerations ispisiri ninan sanana aaa kaaien Sapada aanak a EKAS Na iiaia 90 7 3 8 Operating System Error Handling sssseeseeesieeseeeriessreeirstirseiiterenstitnstntstnntntnntrntntnntnnnnant 90 Contents Version 1 2 Page 4 of 101 29 January 2015 RE User s Manual Advance Coherent Accelerator Processor Interface 8 CAPI
130. s space and includes pages stored on disk It is trans lated from an effective address by a segmentation mechanism and used by the paging mechanism to obtain the real address RA The maximum size of the virtual address space is 2 bytes VHSIC Hardware Description Language Work element descriptor Glossary Page 101 of 101
131. same cycle as aXh_brdata A parity check fail results ina DERROR response and SUE data written haX_bwvalid 1 PSL This signal is asserted for a single cycle when a valid write data transfer is present on the interface The haX_bw signals except for haX_bwpar are valid during the cycle that haX_bwvalid is asserted Note This signal can be on for multiple cycles indicating that data is being driven on back to back cycles haX_bwtag 8 PSL Accelerator generated ID for the read request haX_bwtagpar 1 PSL Odd parity for haX_bwtag valid with haX_bwvalid haX_bwad 6 PSL Half line index of write data within the transaction Cache lines are 128 bytes so that only the LSB is modulated haX_bwdata 512 PSL Data to be written haX_bwpar 8 PSL Odd parity for each 64 bit doubleword of haX_bwdata haX_bwpar is presented to the accelerator one PSL cycle after haX_bwdata 5 3 PSL Response Interface The PSL uses the response interface to indicate the completion status of each command and to manage the command flow control credits Each command completion can return credits back to the accelerator so that further commands can be sent Table 5 7 PSL Response Interface Signal Name Bits Source Description haX_rvalid 1 PSL This signal is asserted for a single cycle when a valid response is present on the interface The haX_r signals are valid during the cycle that haX_rvalid is asserted Note This signal can be
132. seneeeseeeeeeneaeeeees 33 Figure 5 1 PSL Command Response FIOW cccccccccceeeeeeeeseeeeeseeeeceaeeeseeeeseaeeeseeeeseeeesenaeesenaeeseeeeeee 72 Figure 5 2 PSL Accelerator Control Interface Flow in Non Shared Mode c cecseeseeeeeeseeeenteeseeeeaes 76 29 January 2015 Page 9 of 101 User s Manual Coherent Accelerator Processor Interface Advance List of Figures Version 1 2 Page 10 of 101 29 January 2015 Revision Log User s Manual Coherent Accelerator Processor Interface Each release of this document supersedes all previously released versions The revision log lists all signifi cant changes made to the document since its initial release In the rest of the document change bars in the margin indicate that the adjacent text was significantly modified from the previous release of this document Revision Date Version Contents of Modification 29 January 2015 1 2 e Changed reference to the Iwsync instruction to the sync instruction in the following sections Section 3 4 1 1 on page 39 Section 3 4 3 1 on page 43 Section 3 4 4 1 on page 48 Section 3 4 5 1 on page 50 Section 3 4 6 1 on page 54 and Section 3 4 7 1 on page 56 e Revised Section 5 1 2 Reservation on page 68 e Revised Section 5 1 3 Locks on page 68 e Revised Table 5 5 aXh_cabt Translation Ordering Behavior on page 66 e Revised Table 5 6 Accelerator Buffer Interface on page 69 e Revised Section 6 1 Overview on page 79 e Added a
133. sing a caching inhibited DMA or special memory update operation that is guaranteed not to corrupt memory if the operation fails 3 4 5 Suspending a Process Element in the Linked List The suspend flag in the software process element state is used to temporarily stall the processing of a process element If the process is already running on an AFU setting the suspend flag stops the current running process 3 4 5 1 Software Procedure 1 Set the suspend flag in the software state to 1 Software_State S 1 e Store x 80000002 to the 31st word of the process element to suspend 2 Ensure that the update to the software_state is visible to all processes e System software running on the host processor must perform a sync instruction 3 Write a suspend_element command to the software command status field in the linked list area e Store x 00030000 I first_psl_id link_of_element_to_suspend to sw_command_stiatus 4 Ensure that the suspend_element command is visible to all processes e System software running on the host processor must perform a sync instruction 5 Issue the suspend_element MMIO command to the first PSL e System software performs an MMIO to the PSL Linked List Command Register with the suspend_element command and the link to the new process being added PSL_LLCMD_An x 000300000000 link_of_element_to_suspend 6 Wait for the PSL to suspend the process element e The process element is
134. sion 1 2 Page 82 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt 2 lt o 3 9 o 6 2 3 4 cxl_afu_free include lt libcx h gt void cxl_afu_free struct cxl_afu_h afu The routine cxl_afu_free releases the buffers allocated to hold the handle file descriptor and other informa tion require by the device It also closes the device thereby releasing it from the process so that a subse quent process can open the device 6 2 3 5 cxl_afu_attach include lt libcx h gt int cxl_afu_attach struct cxl_afu_h afu _uint64 t wed The routine cxl_afu_attach creates the connection between the current process and the AFU on the acceler ator card The calling process will have established an accelerator specified work element descriptor WED This routine resets the AFU transmits the WED to the AFU and enables the AFU 6 2 3 6 cxl_afu_attach_full Note Not applicable for the CAPI Developer Kit include lt libcx h gt int cxl_afu_attach_full struct cxl_afu_h afu _uint64 t wed __ul6 num_interrupts __ul6 process_ element The routine cxl_afu_attach_full creates the connection between the current process and the AFU on the accelerator card The calling process will have established an accelerator specified work element descriptor WED The calling process can specify a number of interrupts that it can process or it can rely on the number of interrupts that are defined by the accele
135. specifies the minimum number of inter rupts required by the AFU for each process supported This field is read only Implementation Note This value does not include LISNO used by the PSL for reporting translation faults A value of zero in this field implies that the AFU does not require any interrupts num_of_processes 16 31 This field specifies the maximum number of processes that can be supported by the AFU This field can be written by system software to a number less than the power on value System software is required to read back the value to determine if the devices support reducing the number of processes supported Implementation Note If the value written to this field by system software is less than the minimum number of processes required to be supported an implementation can return the minimum number of processes or the power on value For a dedicated process this field must be set to x 0001 num_of_afu_CRs 32 47 This field specifies the number of configuration records contained in the con figuration record area A length of x 0 indicates that an AFU configuration record does not exist This is a read only field req_prog_model 48 63 This field specifies the programming model required by the AFU This is a read only field This should be set to x 8010 for the dedicated process mode programming model x8 x 18 Reserved 0 63 Reserved set to x 0 x 20 Reserved 0 7 Reserved set
136. suspended when a load from the sw_command_status returns x 00030003 ll first_psl_id link_of_element_to_suspend e f a value of all 1 s is returned for the status an error has occurred An implementation dependent recovery procedure must be initiated by hardware Programming Models Version 1 2 Page 50 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt 2 lt o 3 9 o 3 4 5 2 PSL Procedure for Time Sliced Programming Models Each PSL assigned to service the scheduled processes is configured with a unique identifier and the identifier of the next PSL in the list of PSLs servicing the processes In addition each PSL is identified as either the first PSL the last PSL both first and last PSL only one PSL servicing the queue or neither first or last PSL The PSL ID Register contains the PSL unique identifier and the settings for first and last Operations Performed by the First PSL PSL_ID L F 01 When the suspend_element MMIO command is received by the first PSL the PSL checks to see if the process element being suspended is currently running performs any operations necessary and sends the suspend_element command to the next PSL or sets the completion status in the software command status word 1 If the process is running the process is suspended The PSL sets the complete status in the software command status field to indicate that the process has been successfully suspend
137. t is running the process is suspended The PSL is allowed to complete any out standing transactions but must not start any new transactions for the process 2 The PSL sets the complete status in the software command status field to indicate that the process has been successfully suspended e The status field in the sw_command_status is set to x 0003 using a caching inhibited DMA or special memory update operation that is guaranteed not to corrupt memory if the operation fails The final value of the sw_command_status must be x 00030003 II first_psi_id link_of_element_to_suspend e The PSL does not start any process with a software state of complete suspend or terminate A pro cess element with the suspend flag set can be added to the PSL queue 3 4 5 3 PSL Procedure for AFU Directed Programming Models Each PSL assigned to service the scheduled processes is configured with a unique identifier and the identifier of the next PSL in the list of PSLs servicing the processes In addition each PSL is identified as either the first PSL the last PSL both first and last PSL only one PSL servicing the queue or neither first or last PSL The PSL ID Register contains the PSL unique identifier and the settings for first and last Operations Performed by the First PSL PSL_ID L F 01 When the suspend_element MMIO command is received by the first PSL the PSL checks to see if the process element being suspended is currently r
138. t the link provided tware is updating the process element state at the link provided t of the command is reserved and must always be set to 0 Version 1 2 29 January 2015 Programming Models Page 37 of 101 User s Manual Coherent Accelerator Processor Interface Bits Field Name Description 16 31 Status Status The status field in the sw_command_status word must always be set to x 0000 by system software The PSL should only update this field when setting the completion status The most significant bit being set indicates an error For example a status of x 8001 indicates that there was an error termi nating a process element x 0000 Operation pending x 0001 Process element has been terminated x 0002 Safe to remove process element from the linked list x 0003 Process element has been suspended and all outstanding operations are complete x 0004 Execution of the process element has been resumed x 0005 PSL acknowledgment of added process element x 0006 PSL acknowledgment of updated process element 1ccc cccc ccce ccce Indicates an error with the requested command indicated by the c field All other values are reserved 32 47 PSL_ID PSL identifier The PSL identifier is used to select which PSL assigned to service the scheduled processes must perform the operation When the sw_command_status word is written by system software the PSL_ID
139. t the process e The process to start or resume is the value of the tail_pointer read in substep a e If the suspend flag is set in the software status field continue with substep 4 e If the suspend flag is not set in the software status field perform a context restore if indicated by the process_state read in substep a and start the process Continue with the next substep d Continue running the process until either the context time has expired or the process completes e f the processes are completed continue with step 1 e If the context timer expires request the AFU to perform a context save operation Programming Models Version 1 2 Page 42 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface e If a context save is performed by the AFU wait until the operation is completed and set the process_state to indicate a context restore is required Continue with the step 4 4 Place the process element into the PSL queue of processes waiting to be started or resumed a Perform a read of the cache line containing the head_pointer and tail_pointer such that the cache line is owned by the PSL e The PSL must prevent any other PSL from accessing the cache line until substep c b Write the link to the process element and its status to the PSL queue of processes waiting to be restarted e Write the process element handle PSL_PEHandle_An PE_Handle and the process_state to the memory location pointed to by the
140. te_unlock or unlock returns NLock if they are attempted when an address is not locked An accelerator holding a lock is required to release its lock and wait for the write_unlock or unlock command to complete before it can proceed with commands to other addresses While a lock is active commands to other addresses can be terminated with the response NLock Note that command ordering within the PSL can cause a command issued before the read_cl_Ick to be executed after the lock is obtained causing that command to be terminated with response NLock If this is a problem the AFU should wait until all previous commands have completed before starting a lock sequence PSL Accelerator Interface Version 1 2 Page 68 of 101 29 January 2015 User s Manual Advance Coherent Accelerator Processor Interface 5 1 4 Request for Interrupt Service The intreq command is used to generate an interrupt request to the system Address bits 53 63 indicate the source of the interrupt Only values 1 2043 are supported A second interrupt request using the same source must not be generated to the system until the first request has been serviced The PSL generates a PSL response DONE when the interrupt request has been presented to the upstream logic The response provides no indication of interrupt service The PSL generates a PSL response FAILED if an invalid source number is used as defined in PSL_IVTE_LIMIT_An 5 1 5 Parity Handling for the Command Interface P
141. terminated The full queue is searched 1 Performs a read of the cache line containing the head_pointer and tail_pointer such that the cache line is owned by the PSL The PSL must prevent any other PSL from accessing the cache line until substep 6 e Save the head_pointer location to an initial_head_pointer internal register 2 Removes the process from the PSL queue Read the process_element_link and process_state pointed to by the tail_pointer Add 8 to the fail_pointer tail_pointer equals tail_pointer 8 Version 1 2 Programming Models 29 January 2015 Page 45 of 101 User s Manual Coherent Accelerator Processor Interface Advance If the tail_pointeris greater than end_of_PSL_queue_area tail_pointer equals start_of_PSL_queue_area 3 Compares the process_element_link read in substep 2 with link_of_element_to_terminate If the links match continue with substep 5 If the link do not match continue with the next substep 4 Puts the process_element_link and process_state back on the PSL queue Writes the process_element_link and process_state to the memory location pointed to by the head_pointer Add 8 to the head_pointer head_pointer equals head_pointer 8 If the head_pointer is greater than end_of_PSL_queue_area head_pointer equals start_of_PSL_queue_area 5 Compares the tail_pointer to the initial_head_pointer Ifthe tail_pointer is note equal to initial_hea
142. the PSL Cache aXh_compar 1 Acc Odd parity for aXh_com axh_aparen 1 aXh_cabt 3 Acc PSL translation ordering behavior See Table 5 5 aXh_cabt Translation Ordering Behavior on page 66 aXh_cea 64 Acc Effective byte address for the command Addresses for cl commands must be sent as 128 byte aligned addresses Addresses for write_ must be naturally aligned according to the given aXh_csize aXh_ceapar 1 Acc Odd parity for aXh_cea axh_aparen 1 aXh_cch 16 Acc Context handle used to augment aXh_cea in AFU directed context mode Drive to 0 in dedicated process mode aXh_csize 12 Acc Number of bytes for partial line commands Read write commands require the size to be a power of 2 1 2 4 8 16 32 64 128 The aXh_csize is binary encoded haX_croom 8 PSL Number of commands that the PSL is prepared to accept and that must be cap tured by the accelerator when it is enabled on the Accelerator Control interface This only changes with a policy change when the accelerator is not enabled This signal is not meant to be a dynamic count from the PSL to the accelerator Table 5 2 PSL Command Opcodes Directed at the PSL Cache Sheet 1 of 2 Mnemonic Opcode Description Read_cl_s Read_cl_m x 0A50 x OA60 Read a cache line and allocate the cache line in the precise cache in the shared state This command must be used when there is an expectation of temporal locality AXh_csize must be 1
143. the next substep 2 If the process is running the process is terminated The AFU and PSL are allowed to complete any out standing transactions but should not start any new transactions for the process 3 The PSL writes a termination command to the psi_chained_command doubleword for the next PSL and watches for the termination to be complete e Write the value x 00010000 II next_psi_id link_of_element_to_terminate to the psl_chained_command Operations Performed by the Last PSL PSL_ID L 1 When the terminate_element MMIO command is received or the terminate_element command is detected by the last PSL the PSL checks to see if the process element that is being terminated is currently running performs any operations necessary and sets the completion status in the software command status word The terminate_element command is detected by monitoring the ps _chained_command doubleword 1 The PSL notifies the AFU of the process element termination The AFU performs any necessary opera tions to remove the process and then acknowledges the termination of the process element When the acknowledgment is received the PSL continues with the next substep 2 If the process is running the process is terminated The AFU and PSL are allowed to complete any out standing transactions but must not start any new transactions for the process 3 The PSL sets the complete status in the software command status field to indicate that the
144. to next process element to resume The tail pointer value is an index from the start address of the PSL queue area psl_chained_command SPA_Base n 4 x 128 n x 8 127 gt gt 7 x 128 128 where n maximum number of process elements supported Command for next PSL assigned to service the process ele ments end_of_SPA_area SPA_Base n 4 x 128 n x 8 127 gt gt 7 x 128 255 where n maximum number of process elements supported End of the scheduled processes area Programming Models Page 34 of 101 Version 1 2 29 January 2015 Advance 3 3 1 Process Element Entry User s Manual Coherent Accelerator Processor Interface Each process element entry is 128 bytes in length Table 3 2 shows the format of each process element The shaded fields in Table 3 2 correspond to privileged 1 registers and the fields not shaded correspond to privi leged 2 registers The Software State field is an exception and does not have corresponding privileged 1 or privileged 2 registers Table 3 2 Process Element Entry Format Process Element Entry 18 B 19 20 22 B Ks Kp N L Clo LP Word 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 0 State Register 0 31 1 State Register 32 63 2 EP SPOffset most significant bits 3 SPOffset least significant bits
145. to x 0 AFU_CR_len 8 63 This field specifies the length of each AFU configuration record in multiples of 256 bytes If more than one configuration record is present the total length of the configuration record area is num_of_CRs x AFU_CR_len x 256 A length of x 0 indicates that an AFU configuration record does not exist This is a read only field x 28 AFU_CR_offset 0 63 This field specifies the 256 byte aligned offset of the AFU configuration record from the start of the AFU descriptor This field contains a 64 bit pointer to the start of the AFU configuration records The lower 8 bits of the pointer are always 0 256 byte aligned This is a read only field AFU Descriptor Overview Version 1 2 Page 60 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface Table 4 1 AFU Descriptor Sheet 2 of 2 Register Offset Field Name Bits Description x 30 PerProcessPSA_control 0 7 Bit Description 0 5 Reserved set to 0 6 Per process problem state area required read only O A per process problem state area is not required 1 A per process problem state area is required The per process problem state area is a subset of the overall problem state area The problem state area required bit must also be set if this bit is set 7 Problem state area required read only O A problem state area is not required Only the necessary area for the AFU descrip
146. tor configuration records and error buffers area are mapped into the system address space 1 A problem state area is required PerProcessPSA_length 8 63 If the per process problem state area required bit is set this field specifies the length of each per process problem state area in multiples of 4 KB The size of per process problem state area required is determined by PerProcess_area PerProcessPSA_length x 4K x num_of_processes If the per process problem state area required bit is not set this field is reserved and returns x 0 This is a read only field Implementation Note Operating systems using a base page size of 64 KB might require the problem state area to be a multiple of 64 KB To assign dif ferent regions of the problem state area to each process PerProcessPSA_control 6 1 each region might be required to be a multi ple of 64 KB See the target operating system details for more information x 38 PerProcessPSA_ offset 0 63 This field specifies the 4 KB aligned offset of the per process problem state area from the start of the problem state area This field contains a 64 bit pointer to the start of the per process problem state area The lower 12 bits of the pointer are always 0 4 KB aligned This is a read only field Implementation Note Operating systems using a base page size of 64 KB might require the problem state area to be aligned on a 64 KB boundary To assign different regions o
147. ual Coherent Accelerator Processor Interface Acknowledgment A transmission that is sent as an affirmative response to a data transmission Accelerator functional unit Adaptive lookup table Authority Mask Override Register Authority Mask Register A detailed specification of requirements for a processor or computer system It does not specify details of how the processor or computer system must be imple mented instead it provides a template for a family of compatible implementations Accelerator Utilization Record Pointer A byte ordering method in memory where the address n of a word corresponds to the most significant byte In an addressed memory word the bytes are ordered left to right 0 1 2 3 with O being the most significant byte See little endian High speed memory close to a processor A cache usually contains recently accessed data or instructions but certain cache control instructions can lock evict or otherwise modify the caching of data or instructions A memory update policy in which the cache is bypassed and the load or store is performed to or from system memory A page of storage is considered caching inhibited when the I bit has a value of 1 in the page table Data located in caching inhibited pages cannot be cached at any memory hierarchy that is not visible to all processors and devices in the system Stores must update the memory hierarchy to a level that is visible to all processors and
148. unning performs any operations necessary and sends the suspend_element command to the next PSL or sets the completion status in the software command status word 1 The PSL notifies the AFU of the suspended process element The AFU performs any necessary opera tions to suspend the process and then acknowledges the suspension of the process element When the acknowledgment is received the PSL continues with the next substep 2 If the process is running the process is suspended The PSL sets the complete status in the software command status field to indicate that the process has been successfully suspended The PSL is allowed to complete any outstanding transactions but must not start any new transactions for the process e The status field in the sw_command_status is set to x 0003 using a caching inhibited DMA or special memory update operation that is guaranteed not to corrupt memory if the operation fails The final value of the sw_command_status must be x 00030003 II first_psi_id link_of_element_to_suspend 3 If the process element is not running the PSL writes a suspend command to the ps _chained_command doubleword for the next PSL e Write the value x 00030000 II next_psi _id link_of_element_to_suspend to the psl_chained_command e The PSL does not start any process with a software state of complete suspend or terminate A pro cess element with the suspend flag set can be added to the PSL queue Prog
149. unning the process is terminated The PSL sets the complete status in the soft ware command status field to indicate that the process has been successfully terminated The PSL is allowed to complete any outstanding transactions but must not start any new transactions for the process e The status field in the sw_command_status is set to x 0001 using a caching inhibited DMA or special memory update operation that is guaranteed not to corrupt memory if the operation fails The final value of the sw_command_status must be x 00010001 II first_ps _id link_of_element_to_terminate 2 If the process element is not running the PSL writes a termination command to the psl_chained_command doubleword for the next PSL and watches for the termination to be complete e Write the value x 00010000 II next_psi_id link_of_element_to_terminate to the psl_chained_command e While waiting for the process to be terminated the PSL does not attempt to start the corresponding process or any process with the complete suspend or terminate flags set The PSL can perform other operations The process is terminated when the status field in the sw_command_status is x 0001 Programming Models Version 1 2 Page 44 of 101 29 January 2015 User s Manual Coherent Accelerator Processor Interface gt 2 lt o 3 fa o Operations Performed by the Last PSL PSL_ID L F 00 When the terminate_element command is detected
150. until the restart command has received a DONE response Version 1 2 PSL Accelerator Interface 29 January 2015 Page 65 of 101 User s Manual Coherent Accelerator Processor Interface 5 1 1 Command Ordering In general the PSL processes commands in a high performance order If a particular ordering is required between two commands the application must submit the first command and wait for its completion before submitting the second command For example the application might want to write results and then write a door bell indicating to other threads the data is ready It must submit the result write commands wait for all of the completion responses and then submit the door bell write This way when the other threads read the door bell value they can subsequently correctly read the results The PSL has multiple stages of execution each of which can have an impact on the order in which commands are completed 5 1 1 1 Translation Ordering Translation ordering is affected by the state of the ahX_cabt input to the PSL This control is an important way to control the behavior and performance of the PSL Table 5 5 aXh_cabt Translation Ordering Behavior on page 66 lists the translation ordering behavior Table 5 5 aXh_cabt Translation Ordering Behavior Sheet 1 of 2 aXh_cabt Mnemonic Description 000 Strict Translation proceeds in order relative to other aXh_cabt Strict operations Strict means that effective to real address transl
151. ware command for the first PSL assigned to service the pro cess element The last PSL assigned to service the process elements returns the status Note This location must never be cached by the PSL in a mod ified state Note Storage in the SPA above this address must not be read by system software start_of_PSL_queue SPA_Base n 4 x 128 area where n maximum number of process elements supported This is the start of the area in system storage used by the PSLs for the queue of process elements waiting to run end_of_PSL_queue SPA_Base n 4 x 128 n x 8 1 area where n maximum number of process elements supported This is the end of the area in system storage used by the PSLs for the queue of process elements waiting to run head_pointer SPA_Base n 4 x 128 n x 8 127 gt gt 7 x 128 where n maximum number of process elements supported Pointer to the next location to insert a preempted process ele ment The head pointer value is an index from the start address of the PSL queue area Note This location is aligned to the next cache line offset fol lowing the end of the PSL queue If the number of cache lines needed for the PSL queue area is even this location is the next cache line plus 1 tail_pointer SPA_Base n 4 x 128 n x 8 127 gt gt 7 x 128 8 where n maximum number of process elements supported Pointer
152. y 2015 Page 29 of 101 User s Manual Coherent Accelerator Processor Interface Advance e A Context Save Restore Area Pointer CSRP The CSRP is the effective address of an area in the applications memory space for the AFU to save and restore the context state This pointer is optional if no state is required to be saved between jobs or when a job is preempted The context save restore area must be pinned system memory Upon receiving the system call syscall the operating system verifies that the application has registered and been given the authority to use the AFU The operating system then calls the hypervisor hcall with at least the following information e A work element descriptor WED e An Authority Mask Register AMR value masked with the current PSL_AMOR_An Register value by the PSL and optionally masked with the current UAMOR by the hypervisor e An effective address EA Context Save Restore Area Pointer CSRP e A process ID PID and optional thread ID TID e A virtual address VA accelerator utilization record pointer AURP e The virtual address of the storage segment table pointer SSTP e A logical interrupt service number LISN Upon receiving the hypervisor call hcall the hypervisor verifies that the operating system has registered and been given the authority to use the AFU The hypervisor then puts the process element into the process element linked list for the corresponding AFU type The process element
153. ystem software running on the host processor must perform a sync instruction 8 Issue the add_element MMIO command to the first PSL System software performs an MMIO to the PSL Linked List Command Register with the add_element command and the link to the new process being added PSL_LLCMD_An x 000500000000 II link_of_element_to_add 9 Wait for the PSLs to acknowledge the process element The process element is added when a load from the sw_command_status returns x 00050005 II first_psi_id link_of_element_to_adoa e f a value of all 1 s is returned for the status an error has occurred An implementation dependent recovery procedure must be initiated by hardware Version 1 2 Programming Models 29 January 2015 Page 39 of 101 User s Manual Coherent Accelerator Processor Interface Advance 3 4 1 2 PSL Procedure for the Time Sliced Programming Models Each PSL assigned to service the scheduled processes is configured with a unique identifier and the identifier of the next PSL in the list of PSLs servicing the processes In addition each PSL is identified as either the first PSL the last PSL both first and last PSL only one PSL servicing the queue or neither first or last PSL The PSL ID Register contains the PSL unique identifier and the settings for first and last Operations Performed by the First PSL PSL_ID L F 01 When the add_element MMIO command is received by the first PSL the PSL performs any o
Download Pdf Manuals
Related Search
Related Contents
Philips Car entertainment system CID2780 SmartTouch Manual del Usuario FA-6407-1 - BM inDART-HCS08 User`s Manual SMG-700 User's Guide V1.00 (Nov 2004) chapitre 1 - Michel Puech Copyright © All rights reserved.
Failed to retrieve file