Home

Introduction to CUDA

1. CS195V Week 10 Introduction to CUDA e There s about a month before reading period e We plan to spend one or two lectures on CUDA introductions then O Anyone want to present e Intro CUDA project will go out soon o Video filtering o Should be pretty simple e With the introduction of the unified device architecture UDA GPU hardware was made generic o No longer specific hardware for each shader stage o Same hardware is responsible for all shader stages o Much more generic purpose hardware e OpenCL and CUDA are the main GPGPU languages today DirectCompute was recently introduced by MS o Very similar languages people tend to use CUDA more even though it is NVIDIA specific more support slightly better performance Compute Unified Device Architecture Released by NVIDIA first in 2007 o Only supported on 8 series cards and later Framework for general purpose computation on the GPU Written mostly like C so easy to convert over But lots of little caveats o The C programming guide at tells you pretty much all you need to en but it s lona and boring e The warp schedulers dispatch warps groups of 32 threads to the shader cores e GIX 460 has 336 shader cores or stream processors 1350 MHz e You need the CUDA toolkit from NVIDIA O contrib projects cuda toolkit Toolkit O contrib projects cuda sdk SDK e Your kernel and kernel launching code goes in a cu file headers can go in cuh O Com
2. a but different numbers can have different performance more on this later o Note that the hardware limits number of blocks in a launch to 65 535 and the number of threads per block dananre nn tha CDI fahniit R19 Thread single execution thread as you would expect runs on a single GPU microprocessor Thread Block A group of threads that will be issued the same instructions at the same time they can also share some memory o Note that the order of block execution is random Grid A group of all of the thread blocks that you are running Usually you know how many total threads you want to launch so the question is how do vou partition them into blocks A Simple Kernel blockDim x 4 include lt cuda h gt f blockldx x 0 1 2 global wald kernelfunction float a int N nl threadidxx EER ZEEE EEE int idx blockldx x blockDim x threadldx x if idx lt N idx blockD im 2x blocklax x i i threadidx x alidx a idx 1 kernel if idx lt N alidx alidx 1 e Given an array of floats and the array length this kernel adds one to each value e blockldx blockDim and threadldx are built in variables O blockldx x y z is the current block index O blockDim x y z is the block dimension size of block o threadldx x y z is the current thread in the block e So each thread in each block doubles a value in a e We need to check if idx lt N to prevent writing past the array O If we start
3. e of your threads This structure can be important from a performance perspective Also some more arguments optional O Sharan mamarv allnratinne mara nn thic lator Grids and Blocks can be 1D 2D or 3D o You may want different dimensions depending on the format of your data o In this case we ve used a 1D grid Specify the dimensions of your grid and blocks using int or dim3 The kernel call is then kernel lt lt lt gridDim BlockDim gt gt gt args You can access the block index within the grid with blockldx the block dimensions with blockDim and the thread index in the block with threadldx Host Device Host Device A Sample 2D Kernel global void invert tloat image int x threadidx x blockidx x blockDim x int y threadIdx y blockIdx y blockDim y int idx y blockDim x gridDim x x image idx 1 0 image idx void launch kernel float image move host memory to device int threadsPerBlock 16 dim3 blocks 512 threadsPerBlock 512 threadsPerBlock dim3 threads threadsPerBlock threadsPerBlock invert lt lt lt blacks threads gt gt gt d image So how many threads blocks do you want Number of threads per block should be a multiple of 32 because of the architecture o How many depends on how much sharing needs to happen between threads in a block Total number of blocks depends on size of block and the size of your data You can use cudaGetDeviceC
4. ence each other via pointers use cudamalloc and cudafree to make destroy linear memory and cudamemcpy to transfer between host and device memory you can also use cudamallocpitch or cudamalloc3d for 2d or 3d arrays recommended for performance because it appropriately pads it memory allocated with these is considered global you can create __shared__ memory inside your kernels there are more memory types like page locked portable write combined mapped but we won t go into these using SLI requires some additional stuff like transferring between devices but we won t go into it More notes use cudagetlasterror or cudapeeklasterror to get errors for asynchronous calls like kernel launches call cudaDeviceSynchronize first there is direct interoperability with gl and dx where you can map resources directly to cuda so that it can read or modify you can malloc and free inside your kernels and allocated memory will persist across different kernels in general you want multiple of 32 threads per block shared memory can play into this and the number of blocks depends on the size of your problem for scheduling see the occupancy spreadsheet if you want to optimize more
5. ication states that no thread will advance to the next instruction until every thread in the block reaches __ syncthreads e Recall that in the case of a branch the GPU feeds threads through one condition while the others wait then the remaining threads complete the other branch o What happens if the _ syncthreads lies in a divergent branch Divergent Branch with Sync Global void vec dot Elgar a float _ shared float cache threadsPerBlock int cacheIndex threadldx x cache cacheIndex a threadIdx x blockldx x blockDim x int i blockDim x 25 while i 0 if cacheIndex lt i cache cacheIndex cache cacheIndex i _ Bynethreads 2 e The above code will cause the GPU to stall indefinitely e Normally divergent branches result in some idling threads but in the case of _ syncthreads the results are somewhat tragic o Since all threads within a block must reach __ syncthreads before continuing the GPU ends up waiting forever Matrix multiplication oh boy Split it up into smaller matrices and assign a block to each section o For each block copy the parts of the multiplicands that you are interested in into shared memory then use that for your computations Will probably have to draw this out Matrix Multiply Example for int m 0 m lt A width BLOCK SIZE m Get sub matrix Asub of A Matrix Asub GetS
6. ot independent and their scheduling order is not guaranteed o Clearly threads can access and write to any location in global memory e CUDA provides atomic functions to safely update the same data across multiple threads o le atomicAdd atomicMin e Constant memory has some optimizations o Usethe constant__ keyword optionally with _ device _ o However with computer 2 0 and later if you malloc as usual and pass data to your kernel as a const pointer it will do the same thing e Local memory is local to threads like registers but It s actually as slow as global memory e Sofar we have only talked about global memory accessible across all threads o This is fine but it is the slowest memory available on the GPU e For speed you ll want to make use of shared memory o Shared memory is private to a single thread block but can be accessed by all threads in the block o Many times faster than global memory o The amount of shared memory must be determinable at kernel launch time o Shared memory has a lifetime of the kernel block e Since shared memory is private per thread block it s useful for communicating data between threads in a block e To synchronize threads across a block use __syncthreads O Note that this does not synchronize all threads globally but only threads within that block O Useful for reading writing shared memory e Useful but be careful when using __syncthreads e The specif
7. ount and cudaGetDeviceProperties to learn more about the specifics of your system o If you really want to optimize check out the occupancy calculator at Global memory read and write o Slow but has cache Texture memory read only o Cache optimized for 2D access pattern Constant memory o Slow but with cache Shared memory 48kb per MP o Fast but kind of special lacal Mamarv e CUDA kernels can only operate on device memory not host memory O So we need to copy over relevant data from the host to device e 2 general types linear memory and CUDA arrays O We will focus on linear memory for now O Linear memory works like normal C memory that you re used to e Use cudaMalloc and cudaF ree to allocate and free linear device memory e Use cudaMemcpy to move data between host and device memory e Device memory is in a 32 bit address space for compute level 1 x cards and 40 bit space for 2 x cards we are 2 1 o Thus you can refer to memory via pointers as you usually would e You can also allocate 2D and 3D memory using cudaMallocPitch and cudaMalloc3D o You will want to use these if you can because they are properly optimized and padded for performance o It might make sense to also use 2D and 3D thread blocks to operate on such a memory arrangement e You can also malloc and free inside your kernels and such allocations will persist across different kernels e Note that thread blocks are n
8. piled separately by nvcc e Tell your makefile what is CUDA code and what is C C code O We do it for you but the NVIDIA code samples have example makefiles don t worry they re short and simple Host The regular computer Device The GPU Kernel A function made to be executed many times in parallel on the GPU O Origins in stream processor programming where a kernel function is applied to each element in the stream SIMD O Shaders are a specific type of kernel nvcc The NVIDIA compiler for CUDA code ptx GPU assembly language O nvcc compiles your kernel to ptx first and then ptx to binary code e Most CUDA programs operate like Copy some data to Device memory Kernel launch run kernel function on some data Wait for kernels to finish Copy data back to Host memory Keep going your merry way e It s a simple model but you can do a lot of things with it o Parallelization is implicit your kernel function is launched thousands of times for each piece of data e Also it means that much of your CUDA code will look similar or copy pasted verbatim O O 0 O e CUDA programs are a hierarchy of concurrent threads o Threading is subdivided into blocks each of which are then subdivided into threads o Choice of subdivision is up to you o Note that num of blocks x threads per block total number of threads e Choosing the number of threads per block and number of blocks can be tricky o Largely depends on your dat
9. the kernel with more total threads than array elements ie the number elements may not be evenly divisible into blocks threads per block e global keyword defines a function as being a kernel run on the device that can be launched from the host o device _ can be used for subroutines launched from the device run on the device gets compiled by nvcc and gcc e Note that recursion is supported only on newer devices compute capability 2 o But you should avoid using it may see up to 30 performance hit e Now to run the kernel Launching the Kernel int kernel launch tloat a h int N float a d pointer to device memory int 15 size t size N sizeof float allocate array on device cudaMalloc void amp a_d size copy data from host to device cudaMemcpy a_d a_h sizeof float N cudaMemcpyHostToDevice do calculation on device Compute execution configuration int blockSize 4 int nBlocks N blockSize N blockSize 0 0 1 kernel launch kernelfunetion lt lt lt Blocks blocksize gt gt gt ad N Retrieve result from device cudaMemcpy a h a d sizeof float N cudaMemcpyDeviceToHost cleanup cudaFree a d The cudaMalloc Memcpy Free work like you d expect except they work on device memory more on this later You see the C style function launch with arguments as usual but sandwiched in between are angle brackets These determine the structur
10. tively you can allocate your shared memory in the kernel O float data DATA_SIZE o But data size needs to be known at compile time think o Useful if amount of shared memory required by a block is same for all blocks like in the matrix multiply example CUDA Boilerplate amp Utility e CUDA gdb is installed for debugging CUDA programs o check the toolkit e Allows for realtine debugging of a CUDA application on GPU hardware should be very similar to GDB e See the user manual for instructions O e Including cudart h provides you with simple device functionality checks e Good practice to use CUDA INIT argc argv at the beginning of your program to check the device e Use CUDA EXIT argc argv when you re done e The cutil h header provides several utility functions for error checking e It s good practice to wrap your cudaMalloc cudaMemcpy other calls with cudaSafeCall function o Checks if an error occurs when calling that function e In your C C code you can use cudaGetLastError and cudaPeekLastError to get error data after any synchronous CUDA call o For asynchronous calls like kernel launches call cudaThreadSynchronize and then check for errors o cudaThreadSynchronize blocks until device has completed all calls including kernel calls and returns an error if something fails e Use cudaGetErrorString to translate error into something readable e In all the previous examples we have used cop
11. ubMatrix A blockRow m Get sub matrix Bsub of B Matrix Bsub GetSubMatrix B m blockCol Shared memory used to store Asub and Bsub respectively _ shared float As BLOCK_SIZE BLOCK SIZE _ shared __ float Bs BLOCK_SIZE BLOCK SIZE Load Asub and Bsub from device memory to shared memory Each thread loads one element of each sub matrix ad As row col GetElement Asub row col r Li Bs row col GetElement Bsub row col Synchronize to make sure the sub matrices are loaded before starting the computation __ syncthreads Multiply Asub and Bsub together for int e 0 e lt BLOCK SIZE e Cvalue As row e Bs e col Synchronize to make sure that the preceding computation is done before loading two new sub matrices of A and B in the next lteration _ syncthreads SetElement Csub row col Cvalue e You can add a parameter to the kernel launch lt lt lt gridDim blockDim sharedBytes gt gt gt o Allocates a certain number of shared bytes that you can access with something like extern _shared___ float datal o You can only do it once so if you want multiple shared items dynamically allocated you have to do extern float datal float d1 data int d2 amp data 32 if you want d1 to be 32 bytes double d3 amp data 64 if d2 is 32 bytes o Watch out for memory alignment e Alterna
12. y pinning to move memory from host to device e In CUDA 4 you can map host memory to the device to avoid a copy with the drawback of using higher latency memory lost a float malloc s1izeot float 64 d_a cudaHostRegister a sizeof float 64 cudaHostRegisterMapped eudahostGetDevicep Lnteri yond amp d a void j U mykernel lt lt lt s2 402 gt gt gt d_ a 64 cudaHostUnregister a free a grids blocks threads blocks must be independent but threads do not they can synchronize and share memory threads have local memory blocks have shared memory grid has global memory also constant texture memory there is a host CPU and device GPU assumes that threads execute separately on these devices CUDA code compiles down to PTX assembly you can also write directly in PTX but we won t nvcc compiler turns ptx into binary code arch sm_10 means compute capability 1 0 we have 2 1 for fermi you can use _ CUDA_ARCH__ macro to change between different code paths depending on your architecture use global__ for your kernel and __device__ for your subroutines kernels can only operate on device memory aka gpu memory so you have to allocate it for the kernels either as linear memory or cuda arrays cuda arrays are for texture stuff which we won t use much linear memory is in a 32 bit space for compute 1 x and 40 bit for computer 2 x so separately allocated memory can still refer

Introduction to CUDA

Contents

Download Pdf Manuals

Related Search

Related Contents