Home
A guide to debugging in serial and parallel
Contents
1. Look at source code listing very handy when isolating an IEEE exception Line by line execution Insert stops or breakpoints at certain functional points i e when critical values change Ability to monitor variable values Look at stack trace or backtrace when code crashes Common attributes Divided into command line or graphical user interfaces e Usually have to recompile g is almost a standard option to enable debugging your code to utilize most debugger features Invocation by name of debugger and executable e g gdb a out core M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 6 90 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 7 90 Introduction Software for Debugging Introduction Software for Debugging Command line debugging example 25 Print the elements of arrayl Consider the following code example 26 printf arrayl 27 printArray nelem arrayl 1 include lt stdio h gt 28 2 include lt stdlib h gt 29 Copy arrayl to array2 x 3 30 array2 arrayl 4 ant indx 31 5 32 Pass array2 to the function squareArray 6 void initArray int nelem_in_array int array 33 squareArray nelem array2 7 void printArray int nelem_in_array int xarray 34 8 aint squareArray int nelem_in_array int array 35 Compute difference between elem
2. 19 nim2 ni 2 20 njm2 nj 2 HPC I Fall 2013 Ph D CCR UB Debugging in Serial amp Parallel Debugging Life Itself Game of Life Rules of Life The rules in the game of life Any live cell with fewer than two neighbours dies as if by loneliness Any live cell with more than three neighbours dies as if by overcrowding Any live cell with two or three neighbours lives unchanged to the next generation e Any dead cell with exactly three neighbours comes to life An initial pattern is evolved by simultaneously applying the above rules to the entire grid and subsequently at each tick of the clock M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC Fall 2013 41 90 Debugging Life Itself Game of Life 21 22 time iteration 23 24 time_iteration do n 1 nsteps 25 do j 1 nj 26 do i 1 ni 27 28 periodic boundaries 29 30 im 1 itnim2 i nim2 ni ni if i l ni 31 ip 1 i i ni ni af i ni 1 32 jm 1 j njm2 j njm2 nj nj if j l nj 33 jp 1t 3 3 nj nj t 22 gang 2 34 35 for each point add surrounding values 36 37 nsum old im jp old i jp old ip jp amp 38 old im j old ip j amp 39 old im jm old i jm old ip jm HPC I Fall 2013 M D Jones Ph D CCR UB Debugging in Serial amp Parallel 43 90 Debugging Life Itself Game of Life Debugging Lif
3. 1024 10 pipe size 512 bytes p 8 11 POSIX message queues bytes q 819200 12 real time priority ey Q 13 stack size kbytes s unlimited 14 cpu time seconds t 900 15 max user processes u 1024 16 virtual memory kbytes v unlimited 17 file locks x unlimited for bash syntax M D Jones Ph D CCR UB Debugging in Serial amp Parallel Other Debugging Miscellany Core Files Core File Example Ok so now we can use one of our previous examples and generate a core file HPC I Fall 2013 50 90 rush d_debug ulimit c unlimited rush d_debug gcc g o findprimes_orig findprimes_orig c rush d_debug findprimes_orig enter upper bound 20 Segmentation fault core dumped rush d_debug ls l corex y a 1 jonesm ccrstaff 196608 Sep 16 13 22 core 38729 OANOnNRWNDY M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 52 90 Other Debugging Miscellany Core Files this particular core file is not at all large it is a very simple code though with very little stored data generally the core file size will reflect the size of the application in terms of its memory use when it crashed Analyzing it is pretty much like we did when running this example live in gdb rush d_debug gdb quiet findprimes_orig core 38729 Reading symbols from ifs user jonesm d_debug findprimes_orig done New Thread 38729 Co
4. 16 gdb M D Jones Ph TIS pts 30 pts 167 pts 169 pts 169 pts 169 pts 169 pts 169 14 Attaching to program rush ps u jonesm TIME 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 04 00 00 04 12 rush d_hw d_pp gdb 13 Reading symbols from ifs user jonesm d_hw d_pp pp gdb done ifs user jonesm d_hw d_pp pp gdb process 34517 A015 1 0 CMD sshd bash sshd bash mpirun mpiexec hydra pmi_proxy pp gdb pp gdb quiet pp gdb p 34517 Debugging in Serial amp Parallel HPC I Fall 2013 77 90 GDB in Parallel Using Serial Debuggers in Parallel Using Serial Debuggers in Parallel Yes you can certainly run debuggers designed for use in sequential codes in parallel They are even quite effective You may just have to jump through a few extra hoops to do so D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 201 Attaching GDB Of course unless you put an explicit waiting point inside your code the processes are probably happily running along when you attach to them and you will likely want to exert some control over that M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC Fall 2013 GDB in Parallel Attaching GDB GDB in Parallel Attaching GDB First using our above example was running two mpi tasks on the CCR cluster front end After attaching gdb to each process
5. 2013 80 90 GDB in Parallel Attaching GDB 1 rush d_hw d_pp gdb quiet pp gdbwait p 80444 2 Reading symbols from ifs user jonesm d_hw d_pp pp gdbwait done 3 Attaching to program ifs user jonesm d_hw d_pp pp gdbwait process 80444 4 5 0x0000000000400d 2 in pp at pp 90 42 6 42 do while gdbWait 1 me gdb set gdbWait 1 8 gdb c 9 Continuing rush d_hw d_pp gdb quiet pp gdbwait p 80445 Reading symbols from ifs user jonesm d_hw d_pp pp gdbwait done Attaching to program ifs user jonesm d_hw d_pp pp gdbwait process 80445 pp at pp 90 42 42 do while gdbWait 1 gdb set gdbWait 1 gdb c Continuing OANODOARWNM M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 82 90 GDB in Parallel Using GDB Within MPI Task Launcher Using GDB Within MPI Task Launcher GDB in Parallel Using GDB Within MPI Task Launcher More Using GDB With MPI Task Launcher Last but not least you can usually launch gdb through your MPI task launcher For example using the Intel MPI task launcher mpirun mpiexec note that this generally pauses at MPI_Tnit 1 rush d_hw d_pp mpirun np 2 gdb pp gdb 2 mpigdb np 2 3 mpigdb attaching to 22615 pp gdb 07n05 4 mpigdb attaching to 22616 pp gdb 07n05 5 0 1 mpigdb list 40 6 0 35 if ierr 0 then 7 0 1 36 print Unable to intialize MPI 8 0
6. 4 4 98 6 1365672 00 01 12 nwchem openib i 9 9667 9633 9667 4 1 98 2 1370000 00 01 12 nwchem openib i 10 9668 9633 9668 4 5 98 7 1358960 00 01 13 nwchem openib i 11 9669 9633 9669 4 2 98 7 1352112 00 01 13 nwchem openib i 12 9670 9633 9670 4 6 98 7 1360200 00 01 13 nwchem openib i 13 9671 9633 9671 4 3 98 7 1359828 00 01 13 nwchem openib i 14 9672 9633 9672 4 7 98 7 1361228 00 01 13 nwchem openib i 15 9751 9749 9751 1 7 0 0 2136 00 00 00 sshd 16 9752 9751 9752 1 0 0 0 2040 00 00 00 bash 17 9828 9752 9828 1 5 0 0 1204 00 00 00 ps HPC I Fall 2013 69 90 M D Jones Ph D CCR UB Debugging in Serial amp Parallel Basic Parallel Debugging Process Checking Basic Parallel Debugging Process Checking or you can script it I called this script job_ps 1 bin sh 2 3 4 5 6 QST which squeue 7 if z QST then 8 9 exit 10 1 11 12 case in 14 1 jobid 1 16 esac 17 18 get node listing 19 20 nodelist SQST CCR UB 13 0 echo single SLURM_JOBID 15 echo single SLURM_JOBID job jobid 21 echo nodelist nodelist 22 if Snodelist 23 echo Job is not running 24 exit 25 fi echo ERROR no squeue in PATH Shell script to take a single argument ps command on each node in the job required required format then yet PEA Si Slurm job id ex ex PATH S PATH IE Ji it j
7. arrayl are 6 0 0 000 00000 7 xx glibc detected array ex double free or corruption fasttop 0x0000000001cc7010 8 9 10 Backtrace 1ib64 libc so 6 0x3elbe760e6 array ex 0x400710 1ib64 libe so 6 __libe_start_main 0xfd 0x3elbelecdd array ex 0x4004d9 Memory map Not exactly what we expect is it Array2 should contain the squares of the values in array1 and therefore the difference should be i i for i 2 11 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 11 90 Introduction Software for Debugging 23 gdb s 24 squareArray nelem_in_array 10 array 0x601010 at array ex c 59 25 59 for indx 0 indx lt nelem_in_array indx 26 gdb p indx 27 1 10 28 gdb s 29 60 array indx array indx 30 gdb p indx 31 2 0 32 gdb display indx 33 1 indx 0 34 gdb display array indx 35 2 array indx 2 36 gdb s 37 59 for indx 0 indx lt nelem_in_array indx t 38 2 array indx 4 39 1 indx 0 40 gdb s 41 60 array indx array indx 42 2 array indx 3 43 1 indx 1 Ok that is instructive but no closer to finding the bug M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC Fall 2013 13 90 Introduction Software for Debugging So what have we learned so far about the command line debugger Useful for peaking inside source code
8. integer arithmetic errors And the context will be slightly more interesting see for example Martin Gardner s article in Scientific American 223 pp 120 123 1970 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC Fall 2013 39 90 Debugging Life Itself Game of Life Game of Life The Game of Life is one of the better known examples of cellular automatons CA namely a discrete model with a finite number of states often used in theoretical biology game theory etc The rules are actually pretty simple and can lead to some rather surprising self organizing behavior The universe in the game of life Universe is an infinite 2D grid of cells each of which is alive or dead Cells interact only with nearest neighbors including on the diagonals which makes for eight neighbors HPC I Fall 2013 M D Jones Ph D CCR UB Debugging in Serial amp Parallel Debugging Life Itself Game of Life Sample Code Game of Life 40 90 1 program life 2 3 Conway game of life debugging example 4 i 5 implicit none 6 integer parameter ni 1000 nj 1000 nsteps 100 sf integer i j n im ip jm jp nsum isum 8 integer dimension 0 ni 0 nj old new 9 real arand nim2 njm2 10 j 11 initialize elements of old to 0 or 1 12 13 do j 1 nj 14 do i 1 ni 15 CALL random_number arand 16 old i j NINT arand 17 enddo 18 enddo
9. jonesm 29923 29905 64 jonesm 29923 29905 65 jonesm 29924 29905 66 jonesm 29924 29905 67 jonesm 29924 29905 68 jonesm 29924 29905 69 jonesm 29925 29905 70 jonesm 29925 29905 71 jonesm 29925 29905 72 jonesm 29925 29905 73 jonesm 29926 29905 74 jonesm 29926 29905 75 jonesm 29926 29905 76 jonesm 29926 29905 77 jonesm 29927 29905 78 jonesm 29927 29905 79 jonesm 29927 29905 80 jonesm 29927 29905 81 jonesm 29928 29905 82 jonesm 29928 29905 83 jonesm 29928 29905 84 jonesm 29928 29905 85 jonesm 30009 30007 86 jonesm 30010 30009 CPU thread Usage LWP C NLWP 29706 29883 29889 29891 29892 29895 29888 29921 29958 29959 29967 29984 29922 29960 29961 29972 29923 29954 29955 29966 29924 29956 29957 29968 29925 29964 29965 29973 29926 29950 29951 29953 29927 29962 29963 29971 29928 29969 29970 29974 30009 30010 NeCCONCCON CC ON COON COO NCC ONCCON COC emMOCOCCCCO PES ARR RAO DD D DD R RRR OR A RRR B U U UT OT PO CCR UB STIME 17 01 17 01 17 01 17 01 17 01 17 01 17 01 17 01 GDB in Parallel CMD bin bash var spool slurmd job436749 slurm_script srun srun srun srun srun srun 2 n util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util
10. rather than interactively debug e Your bug may take quite a while to manifest itself You have to debug inside a batch queuing system where interactive use is difficult or curtailed e You want to capture a picture of the code state when it crashes VE Bo ae na DP 0701511015 Debugging in Serial amp Parallel HPC I Fall 2013 54 90 Other Debugging Miscellany Run time Compiler Checks Run time Compiler Checks Most compilers support run time checks than can quickly catch common bugs Here is a handy short list contributions welcome For Intel fortran check bounds traceback g will automate bounds checking and enable extensive traceback analysis in case of a crash leave out the bounds option to get a crash report on any IEEE exception format mismatch etc e For PGI compilers Mbounds g will do bounds checking e For GNU compilers fbounds check g should also do bounds checking but is only currently supported for Fortran and Java front ends M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC Fall 2013 56 90 Other Debugging Miscellany Run time Compiler Checks Run time Compiler Checks cont d It should be noted that run time error checking can very much slow down a code s execution so it is not something that you will want to use all of the time M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 57 90 Other Debu
11. 0 13 is a prime 21 17 is a prime 22 19 is a prime 23 24 Program exited with code 025 25 gdb Ah the sweet taste of success even better give the program a return code M D Jones Ph D CCR UB HPC I Fall 2013 37 90 Debugging in Serial amp Parallel gdb 1 16 scanf Sd amp UpperBound 18 Prime 2 1 20 for N 3 N lt UpperBound N 2 21 CheckPrime N 22 if Prime N printf d is a prime n N 23 OMONDOARWNDN wo void CheckPrime int K 12 gdb b 20 13 Breakpoint 1 at 0x40052d file findprimes c 14 gdb b 22 15 Breakpoint 2 at 0x400550 16 gdb run 17 Starting program 18 enter upper bound line 20 file findprimes c line 22 ifs user jonesm d_debug findprimes 19 20 20 Breakpoint 1 main at findprimes c 20 21 20 for N 3 N lt UpperBound N 2 22 gdb c 23 Continuing 24 25 Breakpoint 2 main at findprimes c 22 26 22 if Prime N printf d is a prime n N 27 gdb p N 28 1 21 29 gdb HPC I Fall 2013 36 90 M D Jones Ph D CCR UB Debugging in Serial amp Parallel Debugging Life Itself Game of Life Debugging Life Itself Well ok not exactly debugging life itself rather the game of life Mathematician John Horton Conway s game of life to be exact This example will basically be similar to the prior examples but now we will work in Fortran and debug some
12. 1 37 STOP 9 0 1 38 end if 10 0 1 39 CALL MPI_COMM_RANK MPI_COMM_WORLD myid ierr 11 0 1 40 CALL MPI_COMM_SIZE MPI_COMM_ WORLD Nprocs ierr 12 0 1 41 dummy pause point for gdb insertion 13 0 1 42 do while gdbWait 1 14 0 1 43 lend do 15 0 1 44 if Nprocs 2 then 16 0 mpigdb c 17 0 1 Continuing 18 Hello from proc 0 of 2 07n05 19 Number Averaged for Sigmas 2 20 Hello from proc 1 of 2 07n05 HPC I Fall 2013 83 90 M D Jones Ph D CCR UB Debugging in Serial amp Parallel GDB in Parallel Using GDB Within MPI Task Launcher Using Serial Debuggers in Parallel So you can certainly use serial debuggers in parallel in fact itis a pretty handy thing to do Just keep in mind Don t forget to compile with debugging turned on e You can always attach to a running code and you can instrument the code with that purpose in mind Beware that not all task launchers are equally friendly towards built in support for serial debuggers M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC Fall 2013 85 90 1 0 1 mpigdb list 84 2 0 79 do i my_low my_high 2 3 0 80 partial_sum_p partial_sum p 1 0_dp 2 0_dp i 1 0_dp 4 0 1 81 partial_sum_m partial_sum_m 1 0_dp 2 0_dp i 1 0_dp 5 0 1 82 end do 6 0 1 83 partial_sum partial_sum_p partial_sum_m 7 0 84 CALL MPI_REDUCE partial_sum sum 1 MPI_DOUBLE_PRECISION MPI_SUM 0 am
13. 3 gdb run 37 iE K a oma ae 4 Starting program ifs user jonesm d_debug findprimes_orig 38 Prime K 0 5 enter upper bound 39 return 6 20 40 7 H gt 8 Program received signal SIGSEGV Segmentation fault 9 0x0000003elbe56ed0 in _IO_vfscanf_internal from 1ib64 libc so 6 43 A ae Ay 10 Missing separate debuginfos use debuginfo install glibc 2 12 1 107 e16 x86_64 44 if we get here then there were no divisors of K so it is 11 gdb bt 9 a 12 0 0x0000003e1be56ed0 in _I0_vfscanf_internal from lib64 libc so 6 49 Primek a 13 1 0x0000003elbe646cd in _isoc99_scanf from lib64 libc so 6 Ar 14 2 0x00000000004005a0 in main at findprimes_orig c 16 so now if we compile and run this code M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 29 90 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 30 90 Array Indexing Errors Array Indexing Errors Now the scanf intrinsic is probably pretty safe from internal bugs so the error is likely coming from our usage 1 gdb list 16 2 11 int main 3 12 4 13 int N 5 14 6 15 printf enter upper bound n T 16 scanf d UpperBound 8 17 9 18 Prime 2 1 10 19 11 20 for N 3 N lt UpperBound N 2 Yeah pretty dumb scanf needs a pointer argument i e scanf d amp UpperBound and that takes care of the first bug but let s keep r
14. 5 5 36 Compute difference between elements of array2 and arrayl 6 37 for indx 0 indx lt nelem indx 7 38 del indx array2 indx arrayl indx 8 39 9 40 10 41 Print the computed differences 11 42 printf The difference in the elements of array2 and arrayl are 12 gdb b 37 13 Breakpoint 1 at 0x400611 file array ex c line 37 14 gdb run 15 Starting program san user jonesm u2 d_debug array ex 16 arrayl 17 2 3 4 5 6 7 8 9 10 11 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC Fall 2013 15 90 Introduction Software for Debugging Now that isn t right array1 was not supposed to change Let us go back and look more closely at the call to squareArray M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC Fall 2013 17 90 Introduction Software for Debugging Introduction Software for Debugging 1 gdb 1 2 32 3 33 Pass array2 to the function squareArray 4 34 squareArray nelem array2 5 35 6 36 Compute difference between elements of array2 and arrayl x 7 27 for indx 0 indx lt nelem indx 8 38 del indx array2 indx arrayl indx 9 39 10 40 E R s 11 far Print the computed differences Yikes array1 and array2 point to the same memory location See 12 gdb b 34 Fy 13 Breakpoint 2 at 0x400605 file array ex c line 34 pointer errors like this don t happen
15. 667 9633 9667 4 1 98 6 10 9668 9633 9668 4 5 98 6 11 9669 9633 9669 4 2 98 9 12 9670 9633 9670 4 6 99 1 13 9671 9633 9671 4 3 98 8 14 9672 9633 9672 4 7 98 6 15 9921 9919 9921 dl 4 0 0 16 9922 9921 9922 dl 5 2 0 17 NODE d09n29s02 my CPU thread Usage 18 PID PETO LWP NLWP PSR CPU 19 27963 27959 27963 1 4 0 0 20 28145 27963 28145 5 3 0 0 21 28149 28145 28149 1 5 0 0 22 28182 28167 28182 5 0 97 5 23 28183 28167 28183 4 4 98 0 24 28184 28167 28184 4 1 98 5 25 28185 28167 28185 4 5 98 3 26 28186 28167 28186 4 2 98 4 27 28187 28167 28187 4 6 98 1 28 28188 28167 28188 4 3 98 6 29 28189 28167 28189 4 7 98 4 30 28372 28370 28372 1 3 0 0 31 28373 28372 28373 1 4 1 0 Debugging in Serial amp Parallel HPC I Fall 2013 26 27 28 29 30 31 32 33 34 35 36 37 38 nodelist nodeset e nodelist echo expanded nodelist nodelist define ps command MYPS ps aeLf awk if 5 gt 10 print 1 2 3 4 5 9 10 MYPS ps u jonesm L o pid ppid 1lwp nlwp psr pcpu rss time comm MYPS ps u jonesm Lf echo MYPS MYPS for node in S nodelist do echo NODE node my CPU thread Usage ssh node MYPS done D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 20 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC Fall 2013 rush projects jonesm d_nwchem d_siosi job_
16. Debugging in Serial amp Parallel M D Jones Ph D Center for Computational Research University at Buffalo State University of New York High Performance Computing I 2013 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 1 90 Introduction Software for Debugging Software for Debugging Part Basic Serial Debugging HPC I Fall 2013 2 90 M D Jones Ph D CCR UB Debugging in Serial amp Parallel Introduction Software for Debugging Debugging Tools The most common method for debugging by far is the instrumentation method One instruments the code with print statements to check values and follow the execution of the program Not exactly sophisticated one can certainly debug code in this way but wise use of software debugging tools can be more effective M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 4 90 Debugging tools are abundant but we will focus merely on some of the most common attributes to give you a bag of tricks that can be used when dealing with common problems M D Jones Ph D CCR UB HPC I Fall 2013 5 90 Debugging in Serial amp Parallel Introduction Software for Debugging Introduction Software for Debugging Basic Capabilities Running Within Inside a debugger be it using a command line interface CLI or graphical front end you have some very handy abilities
17. Prime int K prototype for CheckPrime function x k 2 x z 11 int main Frustrating Debugging which you can find easily enough on the web 12 l1 13 int N 14 i i 15 printf enter upper bound n http heather cs ucdavis edu matloff unix html 16 scanf a UpperBound 17 18 Prime 2 1 19 20 for N 3 N lt UpperBound N 2 21 CheckPrime N 22 if Prime N printf Sd is a prime n N 23 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 27 90 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 28 90 Array Indexing Errors Array Indexing Errors Function FindPrime 1 rush d_debug gcc g o findprimes_orig findprimes_orig c 24 2 rush d_debug findprimes_orig 25 void CheckPrime int K 3 enter upper bound 26 int J 4 20 27 5 Segmentation fault core dumped 28 the plan see if J divides K for all values J which are 6 rush d_debug ulimit c 29 a themselves prime no need to try J if it is nonprime and 7 O 30 b less than or equal to sqrt K if K has a divisor larger 31 than this square root it must also have a smaller one 5 A 32 so no need to check for larger ones Ok let s fire up gdb and see where this code crashed 33 34 a T 2i 1 rush d_debug gdb quiet findprimes_orig 35 mela 1 2 Reading symbols from ifs user jonesm d_debug findprimes_orig done 36 af Prime J 1
18. asier M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC Fall 2013 Other Debugging Miscellany Serial Debugging GUIs More Information on Debuggers More information on the tools that we have used mentioned man pages are also a good place to start gdb User Manual http sources redhat com gdb current onlinedocs gdb_toc html ddd User Guide http www gnu org manual ddd pdf ddd pdf idb Manual 58 90 http www intel com software products compilers docs linux idb_ manual_l html pgdbg Guide locally on CCR systems file util pgi linux86 64 version doc index htm M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC Fall 2013 60 90 Other Debugging Miscellany Source Code Checking Tools Source Code Checking Tools Now in a completely different vein there are tools designed to help identify errors pre compilation namely by running it through the source code itself splint is a tool for statically checking C programs http www splint org ftncheck is a tool that checks only alas FORTRAN 77 codes http www dsm fordham edu ftnchek can t say that have found these to be particulary helpful though M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 61 90 Other Debugging Miscellany Source Code Checking Tools Strace Other Debugging Miscellany Source Code Checking Tools Memory Allocation Too
19. break Breakpoints s Stepping through execution p Print values at selected points can also use handy printf syntax as in C display Displaying values for monitoring while stepping through code bt Backtrace or Stack Trace haven t used this yet but certainly will M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 14 90 Introduction Software for Debugging 18 Breakpoint 1 main at array ex c 37 19 37 for indx 0 indx lt nelem indx 20 gdb disp indx 21 1 indx 10 22 gdb disp arrayl indx 23 2 arrayl indx 49 24 gdb disp array2 indx 25 3 array2 indx 49 26 gdb s 27 38 del indx array2 indx arrayl indx 28 3 array2 indx 4 29 2 arrayl indx 4 30 1 indx 0 31 gdb s 32 37 for indx 0 indx lt nelem indx 33 3 array2 indx 4 34 2 arrayl indx 4 35 1 indx 0 36 gdb s 37 38 del indx array2 indx arrayl indx 38 3 array2 indx 9 39 2 arrayl indx 9 40 1 indx 1 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 16 90 Introduction Software for Debugging Digging Out the Bug What we have learned is enough look more closely at the line where the differences between array1 and array2 are computed 1 gdb 1 38 2 33 Pass array2 to the function squareArray 3 34 squareArray nelem array2 4 3
20. e Itself Game of Life Initial Run 40 41 set new value based on number of live neighbors 42 43 select case nsum 44 case 3 1 bono d_debug ifort g o life life 90 45 new i j 1 2 bono d_debug life 46 case 2 3 Tick 1 number of living 342946 47 new i j old i j 4 Tick 2 number of living 334381 48 case default 5 Tick 3 number of living 291022 49 new i j 0 6 Tick 4 number of living 263356 50 end select 7 Tick 5 number of living 290940 51 enddo 8 Tick 6 number of living 322733 52 enddo 9 F 53 10 54 copy new state into old state 11 Tick 99 number of living 0 55 12 Tick 100 number of living 0 56 old new 13 number of live points 0 57 prints Tick n number of living sum new 58 enddo time_iteration py oa TT Hmm everybody dies What kind of life is that well not a correct 60 write number of live points S os SAR one in this context at least Undoubtedly the problem lies within the 62 print number of live points sum new i g 63 ssa prodreti life neighbor calculation so let us take a closer look at the execution M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 44 90 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 45 90 Debugging Life Itself Game of Life Debugging Life Itself Game of Life fgat en os Ok so therein lay the problem nim2 and n
21. e that our original example code missed initializing the first ae oe element of the array and the results were rather erratic in fact they will KEE evensum arr til likely be compiler and flag dependent 11 20 odd_sum arr i 12 gdb b 16 F i sy ee ee ee Initialization is just one aspect of things going wrong with array ge eget program Pu Ged nen jaresm d Hebug Ex2 indexing let us examine another common problem 17 Breakpoint 1 main argc Variable argc is not available 18 at ex2 c 16 19 16 for i 0 i lt N 1 i 20 gdb p arr 21 1 671173696 1 1 0 1 0 1 4 4 0 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 26 90 Array Indexing Errors Array Indexing Errors The Infamous Seg Fault Main code findprimes c 1 2 prime number finding program will after bugs are fixed report a list of 3 all primes which are less than or equal to the user supplied upper bound 4 5 include lt stdio h gt 6 7 define MaxPrimes 50 This example borrowed from Norman Matloff UC Davis who has a 8 int Prime MaxPrimes Prime I will be 1 if I is prime 0 otherwise i 9 UpperBound we will check up through UpperBound for primeness nice article well worth the time to read Guide to Faster Less 10 void Check
22. ents of array2 and arrayl 9 36 for indx 0 indx lt nelem indx 10 int main void 37 delfindx array2 indx arrayl indx 11 const int nelem 10 38 12 int arrayl array2 del 39 13 40 Print the computed differences 14 Allocate memory for each array 41 printf The difference in the elements of array2 and arrayl are 15 arrayl int malloc nelem sizeof int 42 printArray nelem del 16 array2 int malloc nelem sizeof int 43 17 del int malloc nelem sizeof int 44 free arrayl 18 45 free array2 19 Initialize arrayl 46 free del 20 initArray nelem arrayl 47 return 0 21 48 22 for indx 0 indx lt nelem indx 23 arrayl indx indx 2 24 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 8 90 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 9 90 Introduction Software for Debugging 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 OANOaARWN TAO On Oe ie ee ae ee eee eee NHCOONDARWN O void initArray const int nelem_in_array int x array for indx 0 indx lt nelem_in_array indx array indx indx 1 int squareArray const int nelem_in_array int array int indx for indx 0 indx lt nelem_in_array indx t array indx array indx return array void printArray cons
23. es as they run useful flags ps u L even on remote nodes rsh ssh into them M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 68 90 Basic Parallel Debugging Wither Goest the GUI Wither Goest the GUI Using a GUl based debugger gets considerably more difficult when dealing with debugging an MPI based parallel code not so much on the OpenMP side due to the fact that you are now dealing with multiple processes scattered across different machines The TotalView debugger is the premier product in this arena it has both CLI and GUI support but it is very expensive and not present in all environments We will start out using our same toolbox as before and see that we can accomplish much without spending a fortune The methodologies will be equally applicable to the fancy commercial products M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 67 90 Basic Parallel Debugging Process Checking Example Process Checking 1 rush projects jonesm d_nwchem d_siosi6 squeue user jonesm 2 JOBID PARTITION NAME USER ST TIME NODES NODELIST REASON 3 436728 debug siosi6 jonesm R 0 23 2 d09n29s02 d16n02 4 rush projects jonesm d_nwchem d_siosi6 ssh d16n02 5 d16n02 ps u jonesm o pid ppid lwp nlwp psr pcpu rss time comm 6 PID PPID LWP NLWP PSR CPU RSS TIME COMMAND 7 9665 9633 9665 5 0 98 4 1722040 00 01 12 nwchem openib i 8 9666 9633 9666
24. g Life Itself Game of Life Diversion Demo life http www radicaleye com lifepage http en wikipedia org wiki Conway s_Game_of_Life Interesting repository of Conway s life and cellular automata references M D Jones Ph D CCR UB Debugging in Serial amp Parallel Other Debugging Miscellany Core Files Systems administrators set the core file size limit to zero by default for a good reason these files generally contain the entire memory image of an application process when it dies and that can be very large End users are also notoriously bad about leaving these files laying around Having said that we can up the limit and produce a core file that can later be used for analysis M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 48 90 HPC I Fall 2013 51 90 Other Debugging Miscellany Core Files Core Files Core files can also be used to instantly analyze problems that caused a code failure bad enough to dump a core file Often the computer system has been set up in such a way that the default is not to output core files however 1 rush d_debug ulimit a 2 core file size blocks c 0 3 data seg size kbytes d unlimited 4 scheduling priority e 0 5 file size blocks f unlimited 6 pending signals i 2066355 7 max locked memory kbytes 1 33554432 8 max memory size kbytes m unlimited 9 open files n
25. gging Miscellany Serial Debugging GUIs DIDIDE clanle Running one of our previous examples using ddd la feel MGW E G Sue Gey eke Help Onin RH LLSYaRe Tees int main int N RTE enter upper bound n d amp UpperBound Prime 2 1 3 N lt UpperBound N 2 for N checkPrimei N 3 if Prime N printf d is a prime n N void CheckPrimeCint K E intg the plan see if J divides K for all values J which are a ea ves prime no need to try J if it is nonprime and b less than or equal to sqrt K if K has a divisor larger than this square root it must also have a smaller one so no need to check for larger ones T gt wall U 1 12 Grime FST Ot return rogram received signal SIGSEGV Segmentation fault SvadonnioosOsee in CheckPrime K 3 at Findprimes fault c 37 g 36 gdb T A 1 376 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 59 90 Other Debugging Miscellany Serial Debugging GUIs Serial Debugging GUls There are of course a matching set of GUls for the various debuggers A short list ddd a graphical front end for the venerable gdb pgdbg GUI for the PGI debugger idb gui GUI for Intel compiler suite debugger It is very much a matter of preference whether or not to use the GUI find the GUI to be constraining but it does make navigation e
26. i aN retry later 7 F E no args exit too many args and launch a exit ail 1 awk print 2p Debugging in Serial amp Parallel HPC I Fall 2013 Basic Parallel Debugging Process Checking OANONAARWN M D Jones Ph D CCR UB RSS 1748340 1479024 1479352 1466844 1461372 1474016 1470640 1474296 00 00 00 00 00 00 00 00 rush projects jonesm d_nwchem d_siosi6 job_ps 436728 nodelist d09n29s02 d16n02 expanded nodelist dl6n02 d09n29s02 MYPS ps u jonesm o pid ppid lwp nlwp psr pcpu rss time comm NODE d16n02 my CPU thread Usage TIME COMMAND 03 03 03 03 03 03 03 03 2132 00 00 00 1204 00 00 00 RSS TIME 1396 00 00 00 7024 00 00 00 800 00 00 00 1750904 1477128 1472524 1456200 1488400 1459120 1470960 1465752 00 00 00 00 00 00 00 00 03 03 03 203 03 03 2 0 3 2 03 32 33 33 33 33 34 33 33 s P Cc S S S 32 33 34 34 34 33 35 34 nwchem openib i nwchem openib i nwchem openib i nwchem openib i nwchem openib i nwchem openib i nwchem openib i nwchem openib i shd s OMMAND lurm script run run nwchem openib i nwchem openib i nwchem openib i nwchem openib i nwchem openib i nwchem openib i nwchem openib i nwchem openib i 2148 00 00 00 sshd 1204 00 00 00 ps PID PPID LWP NLWP PSR CPU 9665 9633 9665 5 0 98 2 9666 9633 9666 4 4 98 7 9
27. ib impi siosi6 incore nw jonesm 11418 11382 11455 0 4 17 01 o 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi6 incore nw jonesm 11419 11382 11419 99 4 17 01 o 59 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi6 incore nw jonesm 11419 11382 11449 0 4 17 01 0 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi incore nw jonesm 11419 11382 11450 0 4 17 01 0 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi incore nw jonesm 11419 11382 11458 0 4 17 01 o 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi6 incore nw jonesm 11420 11382 11420 98 4 17 01 0 59 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi6 incore nw jonesm 11420 11382 11451 0 4 17 01 00 00 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi6 incore nw jonesm 11420 11382 11452 0 4 17 01 00 00 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi6 incore nw jonesm 11420 11382 11457 0 4 17 01 00 00 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi6 incore nw jonesm 11421 11382 11421 99 4 17 01 o 59 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi6 incore nw jonesm 11421 11382 11447 0 4 17 01 0 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi incore nw jonesm 11421 11382 11448 0 4 17 01 0 00 util nwchem nwchem 6 1 1 bin nwchem openib i siosi incore nw jonesm 11421 11382 11459 0 4 17 01 0 00 util nwchem nwchem 6 1 1 bin nwchem openil siosi incore nw jonesm 11422 11382 11422 99 4 17 01 0 59 u
28. jm2 should be integers oj 1 nj i ae do i 1 ni not real values fix that 27 5 28 periodic boundaries 1 1i 6 29 1 3 program life 7 30 im 1 itnim2 itnim2 ni ni if i 1 ni i i 5 li 8 31 ipsi ti G nijant Vieden 1 Conway game of life debugging example io is jm z F H E K j njm2 nj nj l Fr i 5 P T 11 gdb b 25 Jp J inini yO EE As 6 integer parameter ni 1000 nj 1000 nsteps 100 12 Breakpoint 1 at 0x402e23 file life 90 line 25 integer Se es E nam njma 13 gdb run integer dimension 0 ni 0 nj old new 14 Starting program ifs user jonesm d_debug life Teal arand 15 16 Breakpoint 1 life at life f90 25 17 25 do j 1 nj and things become a bit more reasonable 18 Current language auto currently fortran 19 gdb s 1 bono d_debug ifort g o life life f90 20 26 do i 1 ni 2 bono d_debug life 21 gdb s 3 Tick 1 number of living 272990 22 30 im 1 itnim2 itnim2 ni ni if i 1 ni 4 Tick 2 number of living 253690 23 gdb s 5 24 31 ip 1 i i ni ni if i ni 1 6 25 gdb print im 7 Tick 99 number of living 95073 26 1 1 8 Tick 100 number of living 94664 27 gdb print i nim2 1000 9 number of live points 94664 28 2 0 999 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 46 90 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 Debuggin
29. ls Memory allocation problems are very common there are some tools designed to help you catch such errors at run time efence or Electric Fence tries to trap any out of bounds references see man efence valgrind is a suite of tools for anlayzing and profiling binaries see man valgrind there is a user manual available at file usr share doc valgrind 3 8 1 html manual html valgrind have seen used with good success but not particularly in the HPC arena M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 62 90 Other Debugging Miscellany Source Code Checking Tools Strace Example strace is a powerful tool that will allow you to trace all system calls and signals made by a particular binary whether or not you have source code Can be attached to already running processes A powerful low level tool You can learn a lot from it but is often a tool of last resort for user applications in HPC due to the copious quantity of extraneous information it outputs M D Jones Ph D CCR UB HPC I Fall 2013 63 90 Debugging in Serial amp Parallel As an example of using strace let s peek in on a running MPI process part of a 32 task job on U2 c06n15 ps u jonesm Lf 1 2 UID PID PPID LWP C NLWP STIME TTY TIME CMD 3 jonesm 23964 16284 23964 92 2 14 34 00 04 11 util nwchem nwchem 5 0 bin 4 jonesm 23964 16284 23965 99 2 14 34 00 04 30 util nwchem nwchem 5 0 bi
30. n 5 jonesm 23987 23986 23987 0 1 14 37 pts 0 00 00 00 bash 6 jonesm 24128 23987 24128 0 1 14 39 pts 0 00 00 00 ps u jonesm Lf T c06n15 strace p 23965 8 Process 23965 attached interrupt to quit 9 10 lseek 45 691535872 SEEK_SET 691535872 524288 524288 12 gettimeofday 1161107631 126604 240 1161107631 11 read 45 0 0 0 0 0 0 0 0 2 273 250 207V 276 376K amp 331 230d 0 13 gettimeofday 1161107631 128553 240 1161107631 0 14 15 i 16 select 47 34678942 43 44 46 4 NULL NULL 2 in 4 out 4 17 write 4 O O O O O O O O O O O O O O O O O O O O O O O O O O O 2932 2932 18 writev 4 O O O O O O O 17 0 0O 0 37 0 O O O O O 0 O O O O O O 0O 32 19 L O O O O O O O 37 O O O 17 O O O 37 0 O 0 O 1 0000u 44 2 76 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 64 90 Part Il Advanced Parallel Debugging M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 65 90 Basic Parallel Debugging Process Checking Process Checking First on the agenda parallel processing involves multiple processes threads or both and the first rule is to make sure that they are ending up where you think that they should be needless to say all too often they do not Use MPI_Get_processor_name to report back on where processes are running e Use ps to monitor process
31. nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 sshd jonesm notty ps u jonesm Lf 16 K K K K K K 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Debugging in Serial amp Parallel util nwchem nwchem 6 1 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 util nwchem nwchem 6 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin n
32. ortran in particular lets you use very complex indexing schemes arrayl 23 45 67 89 10 11 essentially arbitrary The difference in the elements of array2 and arrayl are 2 6 12 20 30 42 56 72 90 110 Program exited normally gdb OONONAWNH M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 20 90 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 22 90 eee Example Indexing Error 1 include lt stdio h gt 2 define N 10 3 int main int argc char xargv ate j 4 int arriN Now try compiling with gcc and running the code 5 int i odd_sum even_sum 6 P sestecnstew nevsn aa 9 E Ey odd_sum 5 even_sum 671173703 10 else 11 rr i ixi 5 12 ee Ok that hardly seems reasonable does it Now let s run this 13 Sanas r 14 ere example from within gdb and set a breakpoint to examine the eee 2 emesis accumulation of values to even_sum 17 if i 2 0 18 even_sum arr i 19 else 20 odd_sum arr i 21 22 23 printf odd_sum d even_sum d n odd_sum even_sum 24 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 23 90 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 24 90 Array Indexing Errors Array Indexing Errors 1 gdb 1 16 2 11 arr i ix i 5 3 12 4 13 ale sees So we se
33. p 8 0 85 MPI_COMM_WORLD ierr 9 0 1 86 tl MPI_Wtime 10 0 1 87 time_delta time_delta t1 t0 11 0 88 end do 12 0 mpigdb b 83 13 0 1 Breakpoint 1 at 0x401161 file pi mpi f90 line 83 14 0 1 mpigdb run 15 0 1 Continuing 16 Greetings from proc 0 of 2 07n05 17 Nterms Nperproc Nreps error time rep 18 Greetings from proc 1 of 2 07n05 19 0 1 20 0 1 Breakpoint 1 pimpi at pi mpi f90 83 21 0 1 83 partial_sum partial_sum_p partial_sum_m 22 0 1 mpigdb p my_low 23 0 1 1 24 1 1 65 25 0 1 mpigdb p my_high 26 0 2 64 27 1 2 128 1 6 Bo p nd DP 1 015 7 0 Debugging in Serial amp Parallel HPC I Fall 2013 84 90 GUI based Parallel Debugging TotalView The TotalView Debugger The premier parallel debugger TotalView Sophisticated commercial product think many e Designed especially for HPC multi process multi thread Has both GUI and CLI Supports C C Fortran 77 90 95 mixtures thereof The official debugger of DOE s Advanced Simulation and Computing ASC program LVF Bo ae na Ba 1015 7 0 Debugging in Serial amp Parallel HPC I Fall 2013 87 90 GUl based Parallel Debugging TotalView GUl based Parallel Debugging DDT Using TotalView at CCR The DDT Debugger Pretty simple to start using TotalView on the CCR systems Allinea s commercial parallel debugger DDT Generally you want to load the latest
34. ps 436749 nodelist d09n29s02 d16n02 expanded nodelist d16n02 d09n29s02 MYPS ps u jonesm Lf NODE di n02 my CPU thread Usage UID PID PPID LWP C NLWP STIME TTY TIME CMD jonesm 11416 11382 11416 98 5 17 01 00 00 59 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi incore nw jonesm 11416 11382 11441 0 5 17 01 00 00 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi incore nw jonesm 11416 11382 11442 0 5 17 01 o 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi incore nw jonesm 11416 11382 11454 0 5 17 01 o 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi6 incore nw jonesm 11416 11382 11465 0 5 17 01 o 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi6 incore nw jonesm 11417 11382 11417 99 4 17 01 0 59 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi incore nw jonesm 11417 11382 11445 0 4 17 01 00 00 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi6 incore nw jonesm 11417 11382 11446 0 4 17 01 00 00 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi6 incore nw jonesm 11417 11382 11460 0 4 17 01 00 00 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi6 incore nw jonesm 11418 11382 11418 99 4 17 01 0 59 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi6 incore nw jonesm 11418 11382 11439 0 4 17 01 o 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi6 incore nw jonesm 11418 11382 11440 0 4 17 01 o 00 util nwchem nwchem 6 1 1 bin nwchem open
35. re was generated by findprimes_orig Program terminated with signal 11 Segmentation fault 0 0x0000003elbe56ed0 in _IO_vfscanf_internal from 1ib64 libc so 6 Missing separate debuginfos use debuginfo install glibc 2 12 1 107 e16 x86_64 gdb bt 10 0 0x0000003elbe56ed0 in _IO_vfscanf_internal from 1ib64 libc so 6 11 1 0x0000003elbe646cd in __isoc99_scanf from 1ib64 libc so 6 12 2 0x00000000004005a0 in main at findprimes_orig c 16 13 gdb 1 16 14 11 int main 15 12 16 Le int N 17 14 18 15 19 16 20 17 21 18 22 19 23 20 OANOARWN printf enter upper bound n scanf Sd UpperBound Prime 2 1 for N 3 N lt UpperBound N 2 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 53 90 Other Debugging Miscellany More Comannd line Debugging Tools More Command line Debuggers We focused on gdb but there are command line debuggers that accompany just about every available compiler product pgdbg part of the PGI compiler suite defaults to a GUI but can be run as a command line interface CLI using the text option idb part of the Intel compiler suite defaults to CLI has a special option gdb for using gdb command syntax M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC Fall 2013 55 90 Other Debugging Miscellany Core Files Summary on Core Files So why would you want to use a core file
36. rn 24 41 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC Fall 2013 32 90 Array Indexing Errors Fixing the last bug 1 gdb list 40 2 35 J 2 3 36 while 1 x 4 37 for J 2 I xJ lt K J 5 38 if Prime J 1 6 39 if K J 0 7 40 Prime K 0 8 41 return 9 42 10 43 x J 11 44 Ok now let us try to run the code d_debug gcc g o findprimes findprimes c d_debug findprimes upper bound akwn d_debug Oh fantastic no primes between 1 and 20 Not hardly HPC I Fall 2013 34 90 LVF Bo ae oa De 1 015710 Debugging in Serial amp Parallel Array Indexing Errors Array Indexing Errors Ok so now we will set a couple of breakpoints one at the call to FindPrime and the second where a successful prime is to be output M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 35 90 Array Indexing Errors Another gotcha misplaced or no braces Fix that 1 2 16 scanf Sd amp UpperBound 3 17 4 18 Prime 2 1 5 19 6 20 for N 3 N lt UpperBound N 2 7 21 CheckPrime N 8 22 if Prime N printf Sd is a prime n N 9 23 10 24 12 gdb run 13 Starting program 14 enter upper bound 15 20 16 3 is a prime 17 5 is a prime 18 7 is a prime ifs user jonesm d_debug findprimes 19 11 is a prime 2
37. t int nelem_in_array int xarray printt n a for indx 0 indx lt nelem_in_array indx t printf Sd array indx printf n M D Jones Ph D CCR UB Debugging in Serial amp Parallel Introduction Software for Debugging HPC I Fall 2013 Now let us run the code from within gdb Our goal is to set a breakpoint where the squared arrays elements are computed then step through the code 10 90 rush d_debug gdb quiet array ex Reading symbols from ifs user jonesm d_debug array ex done gdb 1 34 29 30 Copy arrayl to array2 31 array2 arrayl 32 33 Pass array2 to the function squareArray 34 squareArray nelem array2 35 36 Compute difference between elements of array2 and arrayl 37 for indx 0 indx lt nelem indx 38 del indx array2 indx arrayl indx gdb b 34 Breakpoint 1 at 0x400660 file array ex c line 34 gdb run Starting program ifs user jonesm d_debug array ex arrayl 2 3 4 5 6 7 8 9 10 11 Breakpoint 1 main at array ex c 34 34 squareArray nelem array2 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 Introduction Software for Debugging Ok now let s compile and run this code 1 rush d_debug gcc g o array ex array ex c 2 rush d_debug array ex 3 arrayl 4 2 93 4 5 6 7 8 9 10 11 5 The difference in the elements of array2 and
38. they paused and we can easily release them using continue rush d_hw d_pp gdb quiet pp gdb p 34517 Reading symbols from ifs user jonesm d_hw d_pp pp gdb done Attaching to program ifs user jonesm d_hw d_pp pp gdb process 34517 gdb c Continuing aOnrWwnh M D Jones Ph D CCR UB Debugging in Serial amp Parallel GDB in Parallel Attaching GDB Using a Waiting Point You can insert a waiting point into your code to ensure that execution waits until you get a chance to attach a debugger HPC I Fall 2013 79 90 integer gdbWait 0 CALL MPI_COMM_RANK MPI_COMM_WORLD myid ierr CALL MPI_COMM_SIZE MPI_COMM_WORLD Nprocs ierr dummy pause point for gdb instertion do while gdbWait 1 end do M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 81 90 and on the second process rush d_hw d_pp gdb quiet pp gdb p 34518 Reading symbols from ifs user jonesm d_hw d_pp pp gdb done Attaching to program ifs user jonesm d_hw d_pp pp gdb process 34518 gdb c Continuing Oahwond and we used the c continue command to let the execution pick up again where we temporarily interrupted it M D Jones Ph D CCR UB Debugging in Serial amp Parallel and then you will find the waiting at that point when you attach gdb and you can release it at your leisure after setting breakpoints etc HPC I Fall
39. til nwchem nwchem 6 1 1 bin nwchem openil siosi incore nw jonesm 11422 11382 11437 0 4 17 01 o 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi incore nw jonesm 11422 11382 11438 0 4 17 01 o 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi incore nw jonesm 11422 11382 11453 0 4 17 01 o 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi6 incore nw jonesm 11423 11382 11423 99 4 17 01 00 00 59 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi incore nw jonesm 11423 11382 11443 0 4 17 01 00 00 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi6 incore nw jonesm 11423 11382 11444 0 4 17 01 00 00 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi6 incore nw jonesm 11423 11382 11456 0 4 17 01 0 00 util nwchem nwchem 6 1 1 bin nwchem openib impi siosi incore nw jonesm 11489 11487 11489 0 1 17 02 0 00 sshd jonesm notty jonesm 11490 11489 11490 2 1 17 02 2 00 00 00 ps u jonesm Lf 73 90 Basic Parallel Debugging Process Checking 43 NODE d09n29s02 my 44 UID PID PPID 45 jonesm 29706 29702 46 jonesm 29883 29706 47 jonesm 29883 29706 48 jonesm 29883 29706 49 jonesm 29883 29706 50 jonesm 29883 29706 51 jonesm 29888 29883 52 jonesm 29921 29905 53 jonesm 29921 29905 54 jonesm 29921 29905 55 jonesm 29921 29905 56 jonesm 29921 29905 57 jonesm 29922 29905 58 jonesm 29922 29905 59 jonesm 29922 29905 60 jonesm 29922 29905 61 jonesm 29923 29905 62 jonesm 29923 29905 63
40. too often in Fortran Now of 14 gdb run S The prograi being debugged bas been started already course the bug is obvious but aren t they all obvious after you find 1 rt it from th inning rn n Start it om the beg g yo y them 18 Starting program ifs user jonesm d_debug array ex 19 arrayl 20 2 3 4 amp 4 amp 6 7 8 10 41 21 22 Breakpoint 2 main at array ex c 34 23 34 squareArray nelem array2 24 3 array2 indx 49 25 2 arrayl indx 49 26 1 indx 10 27 gdb disp array2 28 4 array2 int 0x501010 29 gdb disp arrayl 30 5 arrayl int 0x501010 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 18 90 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 19 90 Introduction Software for Debugging Array Indexing Errors The Fix Is In Array Indexing Errors Just as an afterthought what we ought to have done in the first place was copy array1 into array2 Copy arrayl to array2 Array indexing errors are one of the most common errors in both M ees sequential and parallel codes and it is not entirely surprising array2 indx array1 indx e Different languages have different indexing defaults Multi dimensional arrays are pretty easy to reference out of bounds which will finally produce the right output gdb run 7 2 f 7 Starting program hone jonesm d_debug ex1 F
41. unning from within gdb M D Jones Ph D CCR UB Array Indexing Errors Debugging in Serial amp Parallel HPC I Fall 2013 31 90 very often we get seg faults on trying to reference an array out of bounds so have a look at the value of J 26 gdb 1 37 27 32 than this square root it must also have a smaller one 28 33 so no need to check for larger ones 29 34 30 35 J 2 31 36 while 1 32 37 if Prime J 1 33 38 if K J 0 34 39 Prime K 0 35 40 return 36 37 38 Oops That is just a tad outside the bounds 50 Kind of forgot to put a cap on the value of J M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 33 90 1 rush d_debug gcc g o findprimes findprimes c 2 rush d_debug gdb findprimes 3 gdb run 4 Starting program ifs user jonesm d_debug findprimes 5 enter upper bound 6 20 7 8 Program received signal SIGSEGV Segmentation fault 9 0x0000000000400586 in CheckPrime K 3 at findprimes c 37 10 37 if Prime J 1 11 gdb bt 12 0 0x0000000000400586 in CheckPrime K 3 at findprimes c 37 13 1 0x0000000000400547 in main at findprimes c 21 14 gdb 1 37 15 32 than this square root it must also have a smaller one 16 33 so no need to check for larger ones 17 34 18 35 J 2 19 36 while 1 20 37 if Prime J 1 21 38 if K J 0 22 39 Prime K 0 23 40 retu
42. version Sophisticated commercial product think many d16n03 module avail totalview i e Designed especially for HPC multi process multi thread Has both GUI and CLI Make sure that your x DISPLAY environment is working if you are Supports C C Fortran 77 90 95 mixtures thereof gang ta tss the eur CCR has a 32 token license for DDT including CUDA and profiler The current CCR license supports 2 concurrent users up to 8 support processors precludes usage on nodes with more than 8 cores until unless this license is upgraded To find the latest installed version module avail ddt M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 88 90 M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 89 90 GUl based Parallel Debugging Eclipse PTP Current Recommendations CCR has licenses for Allinea s DDT and TotalView although the current TotalView license is very small and outdated and will be either upgraded or dropped in favor of DDT Both are quite expensive but stay tuned for further developments Note that the open source eclipse project also has a parallel tools platform that can be used in combination with C C and Fortran http www eclipse org ptp M D Jones Ph D CCR UB Debugging in Serial amp Parallel HPC I Fall 2013 90 90
43. wchem openib impi Attaching GDB Attaching GDB to Running Processes 1 bin nwchem openib impi 1 1 bin nwchem openib impi T 1 1 bin nwchem openib impi 1 1 bin nwchem openib impi util nwchem nwchem 6 1 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi 1 bin nwchem openib impi siosi6 incore nw siosi incore nw siosi incore nw siosi incore nw siosi incore nw siosi incore nw siosi incore nw siosi incore nw siosi6 incore nw siosi incore nw siosi incore nw siosi incore nw siosi incore nw siosi incore nw siosi6 incore nw siosi6 incore nw siosi incore nw siosi incore nw siosi incore nw siosi incore nw siosi incore nw siosi incore nw siosi6 incore nw siosi incore nw siosi6 incore nw siosi6 incore nw siosi6 incore nw siosi incore nw siosi incore nw siosi incore nw siosi incore nw siosi incore nw siosi incore nw HPC I Fall 2013 The simplest way to use a CLI based debugger in parallel is to attach it to already running processes namely e Find the parallel processes using the ps command may have to ssh into remote nodes if that is where they are running Invoke gdb on each process ID 74 90 PID 1772 1773 25814 25815 34507 34512 34513 10 34517 11 34518 OANOnNARWND 15 er
Download Pdf Manuals
Related Search
Related Contents
July 2008, Number 50 ALL-WAYS Benutzer- und Wartungshandbuch User and maintenance manual Petrol Istruzioni d`Uso - Amazon Web Services ユーザ情報・障害情報の利活用実態についての調査報告書 1330 CLOCK Digital Kitchen Scale OPERATING INSTRUCTIONS MicroVAX Troubleshooting and Diagnostics CM100 Carbon Monoxide Meter Product Manual Manual usuario XPC (+ RCC-01) Copyright © All rights reserved.
Failed to retrieve file