Home

Data Analysis in the CIMMYT Applied Biotechnology

1. A good program for partitioning variation between populations and within them and also between and within clusters following a cluster analysis which will be discussed later is the AMOVA analysis of molecular variation procedure This is very similar to the ANOVA procedure and is very commonly used so it will not be discussed in this manual For a complete review of the AMOVA see Excoffier et al 1992 One can also measure the richness of alleles for each marker or the information that each marker imparts to the study It can also be looked at as the measure of usefulness of each marker in distinguishing one individual from another Several factors affect this usefulness including number of alleles frequency of these alleles in the study and others Three measures of the usefulness of the markers are allele richness Polymorphic Information Content PIC and discriminatory power of the markers Allele richness is can be calculated in the LCDMV software package by Dubreuil et al 2002 This package runs on SAS and can be downloaded from the CIMMYT webpage at http www cimmyt cgiar org ABC Protocols manualABC html along with the user s manual and source code if desired A discussion of the calculation of discriminatory power of marker can be found in Franco et al 2001 An example of calculating PIC is presented here PIC is a quantification of the number of alleles or bands that a marker has and the frequency of each of th
2. cultivars landraces etc in the study For the Excel file name each marker and cultivar preferably using names that are less than 8 characters long and avoid non alphanumeric characters such as periods dashes etc The example in Table 1 corresponds to data that will be analyzed using SAS For NTSYS all periods which indicate missing data should be replaced with 9 either in the Excel table or later using Word Table 1 Example of Excel data file with five different maize lines corresponding to columns and 10 different marker bands corresponding to rows 1 band present 0 band absent missing data MaizeA MaizeB MaizeC MaizeD MaizeE AFLPA 1 1 1 0 1 AFLPA2 1 1 1 1 1 AFLPB1 0 0 1 0 1 AFLPB2 1 1 0 0 AFLPC1 1 0 1 0 0 AFLPC2 1 0 0 0 AFLPC3 0 0 1 1 1 AFLPC4 0 0 0 1 1 AFLPC5 0 1 1 1 1 AFLPC6 1 1 1 0 When all your data has been entered check for rows or columns with too much missing data Missing data can distort the analyses You will need to decide how much is too much you may wish to run some analyses on the entire data set and then again on a sub set of the data after removing the individual lines or markers that contain a lot of missing data a good rule of thumb if more than 15 of the observations are missing data for any given marker or maize it is TOO MUCH For the entire data set you want to minimize missing data overall When you have remove
3. FONT SIMPLEX RUN Determining the approximate number of clusters using SAS A question always raised following cluster analysis is What grouping are the real clusters and at what level of proximity must draw the line to determine this The 14 pseudo F and t statistics may be good indicators for determining the approximate number of clusters although they are not distributed as F and t random variables respectively They can be calculated for any clustering strategy as long as the data is raw data not distance measurements or for the Ward Centroid and Average clustering strategies when distance measurements are used The SAS code for obtaining these values using the Ward method in a hypothetical distance matrix for 13 individuals IND is as follows data a type distance input IND1 IND2 IND3 IND4 INDS IND6 IND7 IND8 IND9 IND10 IND11 IND12 IND13 4 2 78 IND 6 4 2 number of places for the distance values with two decimal places 78 number of places from the left column of the distance matrix to the column before the IND 6 Places for the individuals IND datalines 0 00 IND1 0 99 0 00 IND2 0 98 0 53 0 00 IND3 0 55 0 21 0 27 0 00 IND4 0 77 0 30 0 92 0 72 0 00 IND5 0 46 0 24 0 42 0 92 0 98 0 00 IND6 0 50 0 41 0 67 0 18 0 87 0 39 0 00 IND7 0 87 0 35 0 81 0 39 0 30 0 75 0 45 0 00 IND8 0 30 0 90 0 50 0 34 0 89 0 12 0 34 0 23 0 00 IND9 0 25 0 80 0 40 0 14 0 09 0 92 0 44 0 13 0 21 0 00 IN
4. persons have contributed Furthermore one of the mandates of CIMMYT is training of our national program partners who have also expressed interest in learning the statistical techniques we use here at CIMMYT It may even be possible one day to combine data from different labs into one analysis In an effort to standardize the process and the results and as a teaching tool for interested parties this manual was prepared to act as a set of guidelines for future diversity analyses of maize and wheat germplasm The analysis tools will also work in other species Three main steps are involved in the statistical analysis of molecular data in diversity studies 1 Data collection Scoring and entry of band information into the computer 2 Data analysis using Univariate and Multivariate Statistical approaches and 3 Interpretation of the data Each step in the process should follow a standardized format if the output of one diversity study is to be compared to other studies and inferences drawn in this manner Likewise laboratory procedures must be standardized between different workers to achieve this end all users should read the manual entitled Laboratory Protocols CIMMYT Applied Molecular Genetics Laboratory which should be followed when initiating diversity studies This manual will provide both simple examples of all procedures in the main body of the text and real examples of data analyses in the appendices Please refer to these exa
5. 3333 44 55 12 For some coefficients the SIMGEND module needs to know the sample size for each population being compared A rectangular matrix with a single row or column provides this information This matrix can be produced by the FREQ module An example is given below sample size matrix for 4 populations 1140 25 25 25 25 2000 by Applied Biostatistics Inc Figure 4 NTSYS 2 1 window for calculating Nei s 1972 genetic distance coefficients FRINTSYSpc Simgend iol x File Options Help Z Compute Output amp transf Clustering Parameters Input data file Ordination Similarity g Name of N array Name of loci array file Genetic distance Interval data 13 Clustering The first type of clustering we will perform on the proximity matrices is the Unweighted Pair Group Method using Arithmetic Averages UPGMA This is a hierarchical algorithm for clustering entries maize into similar groups For a more detailed description of the algorithm used to calculate the dendrogram see the NTSYS or SAS manuals The output of this clustering procedure is a dendrogram or tree with distance along the horizontal top axis and the maize lines listed vertically down the side see Fig 4 as an example more output trees can be found in Appendix 1 SAS calculation of clusters The following is a SAS code called Cluster sas which can be used to calculate the dendrogram for the UPGMA Ward or Single L
6. 770 1 000 0 937 0 666 0 666 0 666 0 708 0 708 0 687 0 729 1 000 0 687 0 729 0 729 0 770 0 770 0 750 0 750 1 000 0 666 1 000 0 708 0 875 1 000 0 750 0 708 0 791 1 000 0 666 0 750 0 833 0 875 1 000 0 562 0 729 0 812 0 729 0 854 0 729 0 687 0 770 0 979 0 854 Part 3 Dendrogram produced by NTSYS using the simple matching matrix above File Edit Options Help 066 CML247 LPI CML264 LP4 LP5 CML254 TS2 TS5 TS3 CML258 LP3 TS4 CML273 CML274 LP2 P21 TSI CML268 Pl O74 082 090 098 Coefficient if Start BY Microsoft Word GNteps E NTSYSpe REJRRO 321AM 27 Part 4 PCA output produced by NTSYS using the simple matching matrix above C L273 a 294 b 54 r 99 0 Esc Quit E Edges L Labels P Pins R rock S spin T tumble Alt P Print Appendix 2 Part 1 The Excel spreadsheet used to calculate Polymorphic Information Content PIC for two SSR markers in a sample of 7 inbred lines E Microsoft Excel im A E PICexamp txt mer 4 ssria 1 0 1 1 1 o 0 5 ssrib o 1 0 00 1 0 e ssric o 0o 0 0 4 1 1 Ir ssrid o 0 0 0 0 0 1 i 8 ssr2a 1 1 141 0 1 1 1 i la issr2b 1 0 0 141 0 1 0 10 ssr2c o of 1 0 ol 0 0 13 ij 2 3 4 5 6 7 total freq freq2 sum PIC Msissda 2 0 2 2 1 0 0 7 0 5 0 25 0 3469 0 6531 l 16 ssrib o 2 0 00 1 0 3 0 2143 0 0459 liz ssric o 0 0 014 14 1 3 0 2143 0 0459 lis ssrid
7. FRINTSYSpc SimQual File Options Help O x Sy Compute 25 Cancel PL Close ome Srat Parameters Arguments Clustering inputfile Civ DocumentsiSSRvVheat tt Ordination Similarity E e Positive code Negative code i VY Interval data Ea Qualitative data Co dominant marker types NTSYS 2 02 and 2 1 When allelic relationships between bands are known as in the case of RFLPs and SSRs genetic distances can be calculated between individuals in a study Distances such as Nei and Li 1979 and Roger s 1972 or Modified Roger s are examples of this type of distance An NTSYS 2 1 example of Nei and Li distance calculation will be shown here NTSYS also calculates Roger s distances but an error in the program causes the calculations to be incorrect so a SAS or other program procedure should be used for this instead The following example is taken from the NTSYS 2 1 online help manual Matrices for gene frequency data must contain the frequencies of all the alleles i e the frequencies must add up to 1 for each locus In the example shown below the 19 rows correspond to 19 alleles distributed over the 5 loci The columns correspond to samples taken from four populations The first 4 rows correspond to the alleles at the ABO locus Thus the column sums must be equal to 1 for the first 4 rows The next five rows correspond the next locus within which the columns must sum to 1 and so on for the remai
8. amp l 0 AND SUBJ J 0 THEN N 1 ELSE N 0 IF SUBJ amp l OR SUBJ J THEN D 0 ELSE D 1 NUM J N DEN J D END IF BAND S9D THEN write the name of your last band in your data set for example in fig 2 it would say IF BAND AFLPC6 IF DISTNC 1 THEN DO J 1 TO amp N DIST J SQRT 1 NUM J DEN J END IF DISTNC 2 THEN DO J 1 TO amp N DIST J 1 NUM J DEN J END RUN DATA B SET A KEEP DIST1 DIST amp N FIRSTOBS 281 281 refers to number of markers you have change this value accordingly FILE C DATA allpoly MTX LRECL 1030 MOD change the filename between the quotes to a name you choose for the output of the analysis including the path PUT DIST1 DIST amp N 7 4 RUN END MEND DISSIMLR To input this file into SAS open the SAS program and open a file by using the file menu The opened file will appear in the Program Editor window Submit the program by clicking on the button that looks like a little man running Text will appear in the Log box if there are errors the text will be red if there are no errors the text will all be blue and black The output a square matrix which is the same above the diagonal as below will be saved in the file you specified The diagonal will be O since it is the comparison of an individual with itself and cannot be similar Note If you run the same procedure more than once erase the old output file before you start or name it som
9. o 0o 0 0 0 0 1 1 0 0714 0 0051 ka 2 2 2 2 2 2 2 14 1 1 21 ssr2a 1 2 1 0 2 4 2 9 0 6429 0 4133 0 5 0 5 22 ssr2b 1 0 0 2 0 1 0 4 0 2857 0 0816 23 ssr2c o 0o 14 0 0 0 0 1 0 0714 0 0051 ei 05 2 2 2 2 2 2 125 0 8929 0 7972 29 Part 2 Table showing the formulas that were typed into each cell of the above Excel spreadsheet to calculate the PIC values shown Steps in the process are detailed in the text of this manual Although we have wrapped the text in the cells displaying formulas you must type in the formula without a space or carriage return in Excel d A a 4 s el ioa freg frea kum pic SUM B15 SUM K 1 ssrla Hi 115 1Al J157J151 15 K18 IL15 TIE E 116 14 J16 J1 mai wel dod dad gd d Hi 117 14 J17 ak JE wl di dd dd d dee 118 14 J18 a SUM SUMJ SUMJ SUMJ SUM SUM SUM B15 B C15 D15 E15 F15 F G15 H15 FO 18 C18 D18 E18 18 G18 H18 H19 119 14 J19 J19 SUM B21 SUM K 1 ssr2a Hz 121 14 J21 J21 21 K23 JL21 fal ddd 122 14 ee eos ssr2c Ha 23 14 J23 J23 SUM SUM SIT SUM SUM SUM SUM pe B21 B2 C21 D21 E21 F21 FI G21 H21 3 j C23 D23 E23 j 23 G23 H23 ij 124 14J 247J24
10. to use ID line VAR dist1 dist35 run Varclus Varclus clusters entries maize into varying numbers of clusters as specified by the user usually starting with two and proceeding to a larger number not to exceed the number of entries in the test This program will tell you when splitting clusters into smaller groups and thus a larger number of clusters does not make statistical sense you can however choose to use a smaller number of clusters An example of the Varclus procedure is shown below OPTIONS LINESIZE 132 PAGESIZE 77 Title VARClus Analysis of GBG ancestors 4 2 98 change title as appropriate data DIST infile a ancestors txt name and path of your input file NOTE this is an original data file NOT the output of Mergclus sas therefore you need labels below LRECL 1050 INPUT band 1 6 7 P68600 P189930 P261474 P290116A P291306A P297500 P297544 P317335 P347560 P361067 P372415A P378664 P383276 P384469A P384471 P391594 P393999 P398763 P404157 P404161 P404188A P404192C P407654 P423950 P424159 P437909B P87588 P890612 P91091 P189930A P253665D P283331 P436682 P436684 P437697 P437851A P438206 P69507 P84657 P88310 P189893 P200485 P361006A P361075 P399016 P417510 P427088B P437578 P445837 P467307 P476352C P491548 P491579 P503338 P506920 P506945 P507295 P507296 P507373 P507543 PFC3571 P890612A P227328 P391583 P391584 P424159B P458511 P464878 P464887 P464920 P468377 P475814 LG852534 LG871991 LG921128 LG924208 LG93760
11. 0 0000100000000000000 0O000000010000000000 0000100000000000000 0000000000000001001 0000000000110000010 1000000000000000000 0100000000001001001 26 Part 2 Simple Matching matrix created by NTSYS using the previous input data set SIMQUAL input A Teaching maize txt coeff SM by Cols 3 19L19 0 CML247 CML254 CML258 CML264 CML268 CML273 CML274 LP1 LP2 LP3 LP4 LP5 P1 P21 TS1 TS2 TS3 TS4 TS5 1 0000 0 7083 1 0000 0 6666 0 8333 1 0000 0 7916 0 7500 0 7083 0 6590 0 7045 0 7045 0 6666 0 7916 0 7083 0 6875 0 8125 0 7291 0 8409 0 7954 0 7500 0 6875 0 7708 0 6875 0 7708 0 7708 0 7708 0 7291 0 7708 0 7291 0 7083 0 7500 0 6666 0 5625 0 6875 0 6875 0 6875 0 7708 0 6875 0 6875 0 7291 0 7291 0 6875 0 8125 0 7291 0 7291 0 8541 0 8125 0 7083 0 7500 0 7083 1 000 0 7083 0 8333 0 7500 0 708 1 000 1 0000 0 6590 0 7500 0 7708 0 7500 0 8125 0 7708 0 8125 0 7916 0 6875 0 7291 0 7291 0 7708 0 7708 0 7083 0 7916 1 000 0 659 0 681 0 681 0 681 0 681 0 636 0 704 0 636 0 681 0 727 0 727 0 727 0 704 0 704 1 000 0 979 0 704 0 854 0 729 0 729 0 750 0 645 0 854 0 812 0 729 0 770 0 750 0 708 1 000 0 727 0 875 0 708 0 750 0 770 0 666 0 875 0 791 0 750 0 750 0 729 0 729 1 000 0 727 1 000 0 818 0 708 0 818 0 791 0 750 0 812 0 590 0 666 0 727 0 791 0 681 0 750 0 681 0 750 0 727 0 750 0 750 0 729 0 704 0 729 1 000 0 791 0 770 0 625 0 708 0 791 0 750 0 833 0 812 0
12. 1 TO amp N DATA A INFILE C DATA allpoly PRN LRECL 340 change the file path and name inside the quotes to your file and the correct path change the 340 to a larger number if your data set has a lot of individuals make it about 10 x the number of lines you have FIRSTOBS 1 INPUT BAND 1 8 9 SUBJ1 SUBJ amp N 2 change the 1 8 to the number of spaces that your marker labels take up in your data set for example in Fig 2 the marker labels take up spaces 1 7 Change the 9 to the next space after your markers for example in Fig 2 it would be 88 ARRAY SUBJ amp N SUBJ1 SUBJ amp N ARRAY NUM amp N NUM1 NUM8 amp N ARRAY DEN amp N DEN1 DEN amp N ARRAY DIST amp N DIST1 DIST amp N ASSOC 3 choose Assoc 1 for Gower s Jaccard s coefficient Assoc 2 for Nei and Li Dice and Assoc 3 for Simple Matching default DISTNC 1 IF ASSOC 1 THEN DO J 1 TO amp N IF SUBJ 8I 1 AND SUBJ J 1 THEN N 1 ELSE N 0 IF SUBJ amp I 0 AND SUBJ J 0 THEN D 0 ELSE IF SUBJ amp l OR SUBJ J THEN D 0 ELSE D 1 NUM J N DEN J D END IF ASSOC 2 THEN DO J 1 TO amp N IF SUBJ amp l 1 AND SUBJ J 1 THEN N 2 ELSE N 0 IF SUBJ amp l 1 AND SUBJ J 1 THEN D 2 ELSE IF SUBJ amp l 0 AND SUBJ J 0 THEN D 0 ELSE IF SUBJ amp l OR SUBJ J THEN D 0 ELSE D 1 NUM J N DEN J D END IF ASSOC 3 THEN DO J 1 TO amp N IF SUBJ amp l 1 AND SUBJ J 1 THEN N 1 ELSE IF SUBJ
13. 4 LG937654 LG941309 A3205 S4230 CNS ILLINI MANDARIN LINCOLN DUNFIELD RICHLAND AKHARROW ARKSOY CAPITAL HABERLAN JACKSON KOREAN MUKDEN OGDEN PERRY RALSOY ROANOKE 100 proc corr data dist noprint cov outp covout proc print data cov type cov set covout proc varclus data cov maxeigen 1 initial random short maxiter 100 maxsearch 100 run 20 Multidimensional scaling This is a procedure for plotting the lines on a graph of two axes for the purpose of visualizing the relationships between entries and clusters An example of a SAS MDS procedure is listed below OPTIONS LINESIZE 132 PAGESIZE 77 Title Cluster Analysis of US and Chinese Ancestors using only polmorphic data change to appropriate title data DIST type distance INFILE C DATA uscnf3 lab LRECL 1050 INPUT LINE 1 21 22 dist1 dist35 PROC MDS DATA DIST LEVEL ABSOLUTE DIM 22 OUT OUT PINEIGVAL PININ PINIT OUTRES RES set dim number of dimensions you want Final R value printed on last page of SAS output should be at least 95 which means you have accounted for 95 of your original variation in your analysis If you set this number too high it will take a LONG time to run the procedure As it is it takes several hours ID LINE PROC PRINT DATA OUT PROC PRINT DATA RES PROC PLOT DATA OUT VTOH 2 0 PLOT DIM2 DIM1 LINE HAXIS BY 0 1 VAXIS BY 0 1 WHERE _TYPE_ CONFIG PROC PLOT DATA RES VTOH 2 0 PLOT FITDATA FITDIST HAXIS BY 0 1 VAX
14. 831 Cluster History i NCL Clusters Joined FREQ SPRSQ RSQ 12 IND2 IND3 2 0 0000 11 INDS IND6 2 0 0000 10 IND8 IND9 2 0 0000 9 IND11 IND12 2 0 0000 8 IND7 IND10 2 0 0024 7 CL11 CL8 4 0 0064 6 CL12 IND4 3 0 0086 5 CL9 IND13 3 0 0379 4 IND1 CL5 4 0 0957 3 CL7 CL10 6 0 1924 2 CL6 CL3 9 0 2159 1 CL4 CL2 13 0 4407 The SAS System 3007 F F T FF FT T 0 Number of Clusters T PSF PST2 e 1 00 T 1 00 T 1 00 T 1 00 T 998 300 991 112 54 983 78 8 945 341 849 16 9 5 1 657 9 6 87 2 441 87 7 2 000 8 7 14 51 Friday October 6 Plot of PSF NCL Symbol used is F Plot of PST2_ NCL_ Symbol used is T NOTE 38 obs had missing values 1 obs hidden A further output of SAS is the actual dendrogram with the ID of the variable IND identified The four clusters previously determined are clearly apparent Note that SIFFIFAFF IIIFAF IITITF SIESTE DLLLLL LLLP LL FAFAIF FIFFAF TESEI IITTTF LLL LS 12 3 4 5 6 7 8 9 10 11 12 15 QOTUEDMIZD ld A9AJOTDlLl 300 using SAS version 6 12 or earlier the dendrogram is shown in a totallv different format IND1 IND11 IND12 IND13 IND2 IND3 IND4 INDS IND6 IND INDIO IND8 IND9 IND NTSVS calculation of Clusters NTSVS1 7 The input file for NTSVS will be the output file from the simple matching calculations Enter into NTSVS and use the arrow kevs to select the SAHN clustering option under the Cluster and graph methods heading and vou will se
15. CIMMYT Institutional Multimedia Publications Repository http repository cimmyt org CIMMYT Genetic Resources Data analysis in the CIMMYT applied biotechnology center For fingerprinting and genetic diversity studies Warburton M 2002 Downloaded from the CIMMYT Institutional Multimedia Publications Repository fid Data Analvsis in the CIMMXT Applied Biotechnology Center For Fingerprinting and Genetic Diversity Studies Marilyn Warburton and Jos Crossa August 2002 Second Edition Table of Contents l OV RIVIOW 8a a eme tnd ha II Data Collectieti ii iwiddeb renren II Data ANS SIS it t Ma Partitioning variation in the sample 2 Ordination visualizing the relationships in the samples Proximity M ICES i asa wa i ba iri genie Cluster ia kies ita sie i guotauepeshiamiamants Determining approximate number of clusters using SAS Other SAS clustering procedures sse Multidimensional scaling nn Principal components analVSIS L nn IV Interpretation of the Data seen nennenna Bootstrapping iaee ae b ens V References ss ia ces van ceecteecdce eine cecepeeeeedarecscneeeecs Appendix 1 Sample data files s sse Appendix 2 Excel spreadsheet for PIC calculations I Overview The molecular genetic characterization of the diversity present in the CIMMYT maize and wheat germplasm collections is an ongoing process to which many different
16. D10 0 55 0 70 0 90 0 84 0 99 0 92 0 54 0 53 0 31 0 34 0 00 IND11 0 45 0 60 0 80 0 74 0 89 0 82 0 44 0 43 0 21 0 24 0 23 0 00 IND12 0 46 0 68 0 81 0 70 0 85 0 81 0 43 0 44 0 20 0 25 0 25 0 280 00 IND13 distance matrix proc cluster data a method ward pseudo pseudo asks for the pseudo F and pseudo t id IND proc tree id IND proc plot plot psf NCL 7F PST2 NCL T overlav haxis 1 to 13 bv 1 vaxis 0 to 300 bv 50 Plot the pseudo F and pseudo t RUN The above program plots the pseudo F and t values for each number of clusters The place where there is a local peak should be considered as the possible number of clusters Some peaks appearing at a larger number of clusters may not represent real clusters and should be considered with caution If coordinate data is available the SAS codes are the same as these except that the lines regarding the data steps need to be changed accordingly The SAS outputs give the clustering history with the values of the pseudo F and t that are plotted together The pseudo t peaks at 3 clusters so the number of clusters will be one greater than the level at which the large pseudo t is printed in this case 4 clusters The pseudo F also peaks at 4 clusters and further increases do not appear to represent real clusters The SAS System The CLUSTER Proceduri 14 51 Friday October 6 e Ward s Minimum Variance Cluster Analysis Root Mean Square Distance Between Observations 0 532
17. IS BY 0 1 PROC PLOT DATA RES VTOH 2 0 PLOT DATA DISTance HAXIS BY 0 1 VAXIS BY 0 1 run PROC REG DATA RES MODEL FITDATA FITDIST PROC REG DATA RES MODEL DATA DISTance RUN DATA Z SET RES FILE C DATA mdsoutput PRN LRECL 1200 MOD the output will be VERY big be sure to put it somewhere you have enough room PUT LINE 1 21 22 DIM1 DIM22 9 4 If you change dim 22 to a different number above be sure to change it here too RUN Principal components analysis Principal Components is an ordination technique that allows the projection of the data onto two or three axes in order to visualize the differences in the individuals and look for groups The principal components are the new uncorrelated variables that are calculated from the original variables that may not have a biological meaning especially with molecular markers However they are a useful since the first two or three usually account for most of the variation of all the original variables Whereas it would be impossible to project the data onto a graph with axes corresponding to all the variables usually more than 100 in the case of molecular markers using PCA you can project the data onto two or three axes In three dimensions you can see patterns that cannot be represented in a two dimensional dendrogram In order to use PCA you must first calculate eigenvalues which represent the amount of 21 variance accounted for by a component and the eigen
18. alysis many times and return a dendrogram in which the clusters are defined by the number of times the individuals within the cluster were found together in each analysis This number can be used as a confidence limit of the clusters within the dendrograms Felsenstein 1985 To ensure that the accuracy of the bootstrap is 95 400 repetitions of the analysis must be done and 2 000 repetitions must be done to ensure the accuracy is 99 Hedges 1992 We recommend the WinBoot program by Yap and Nelson 1996 as a user friendly free program for performing bootstrap analysis of binary data to determine the confidence limits of UPGMA based dendrograms However this program only does UPGMA and does not accept missing data in the data matrix The authors may be contacted via the Internet at the following email addresses i yap cgnet com for technical support and r nelson cgnet com for distribution general inquiries For other dendrograms or data types SAS routines have been calculated in the LCDMV software package by Dubreuil et al 2002 This package can be downloaded from the CIMMYT webpage at http www cimmyt cgiar org ABC Protocols manualABC html along with the user s manual and source code if desired 24 V References Beaumont M A K M Ibrahim P Boursot and M W Bruford 1998 Measuring genetic distance P 315 325 In A Karp P G Isaac and D S Ingram ed Molecular tools for screening biodiversity London Chapman
19. and Hall Dubreuil P C Dillman J Crossa and M Warburton 2002 LCCMV Software for the Calculation of Molecular Distances between Varieties First Edition Mexico D F CIMMYT Excoffier L P Smouse and J Quattro 1992 Analysis of molecular variance inferred for metric distances among DNA haplotypes application to human mitochondrial DNA restriction data Genetics 131 479 491 Felsenstein J 1985 Confidence limits of phylogenies an approach using the bootstrap Evolution 39 783 791 Franco J J Crossa J M Ribaut J Betran M L Warburton and M Khairallah 2001 A method for combining molecular markers and phenotypic attributes for classifying plant genotypes TAG 103 6 7 944 952 Hedges SV 1992 The number of replications needed for accurate estimation of the bootstrap P value in phylogenetic studies Mol Biol Evol 9 366 369 Hoisington D M Khairallah and D Gonzalez de Leon 2000 Laboratory Protocols CIMMYT Applied Molecular Genetics Laboratory Third Edition Mexico D F CIMMYT Lewin Benjamin 2000 Genes VII Oxford University Press Nei M and W Li 1979 Mathematical model for studying genetic variation in terms of restriction endonucleases Proc Natl Acad Sci USA 76 5269 5273 NTSYSpc 2 10 2000 Applied Biostatistics Inc Rohlf F J 1997 NTSYSpc Numerical Taxonomy and Multivariate Analysis System version 201 Department of Ecology and Evolution State University of New York Sambroo
20. ce bar to toggle to other options Name for output matrix path and name of output file By rows or cols COL is default but we need ROW press space bar to change Positive code 1 Negative code 0 Show matrix NO Listing file CON When all the blank spaces have been filled in or left as the default press F2 to start the program running When it is finished there will be a message on the screen press ESC to exit to the main menu Press ESC again to exit NTSYS when you are finished The output a diagonal matrix will be saved in the file you specified only one half is displayed unlike SAS it does not print both above and below the diagonal The diagonal will be 1 since it is the comparison of an individual with itself and cannot be dissimilar NTSYS 2 02 and 2 1 NTSYS 2 02 has all the same options and calculations as NTSYS 1 7 but the menus have been updated to Windows Instead of moving around the menus with the arrow keys you can click on the window you want and then on the option you want For Similarity calculations you click on the Similarity heading then chose SimGen for allele frequency data or SimQual for zero and one data See Figure 3 for an example of calculation of Simple Matching coefficients Note NTSYS 2 02 and 2 1 have an online help menu which can be accessed by clicking the Help Option from the main task bar Figure 3 NTSYS 2 1 window for calculating Simple Matching similarity coefficients
21. d the individuals or markers with too much missing data save the file as a text file without the column labels For the SAS procedures demonstrated in this manual you want the rows to be labeled with the marker name Do not include spaces or punctuation and do not begin the name with a number although you can have numbers in the name Make sure all the names are the same length or that you include spaces at the end of the name so that the observations start at the same column in Word You will want one space between each observation and put one space at the end of each line before the return character If you do not add this return SAS will not accept your data A SAS input data file example is shown in Figure 2 Fig 2 Input data file for SAS saved as a text file with 5 different maize lines and 10 different marker bands this file corresponds to the Excel file shown in Table 1 AFLPA 1 1 1 0 1 AFLPA2 1 1 1 1 1 AFLPB1 0 0 1 0 1 AFLPB2 1 1 0 0 AFLPC1 1 0 1 0 0 AFLPC2 1 0 0 0 AFLPC3 0 0 1 1 1 AFLPCA 0 0 0 1 1 AFLPC5 0 1 1 1 1 AFLPC6 1 1 1 0 For NTSYS versions older than 2 02 you must make sure the length of each line of data does not exceed approximately 45 columns in Word including spaces or the NTSYS program will not read your data properly A heading must also be placed at the beginning of the NTSYS data file as follows 1 10 5 1 9 The numbers refer to in order prese
22. d z axis 30 Rotation around x axis 30 Viewing distance 99 all these things can be 22 Label the points NO gt changed while viewing Show the pins YES graph produced Show edges in graph YES Normalize scales NO Hardcopy device choose your printer from list Port of file lpt1 or whatever your printer port is Graphics paging YES Listing device CON NTSYS 2 02 This version of NTSYS apparently has a problem calculating PCA and we have not been able to successfully use 2 02 for this purpose Therefore we only use NTSYS 2 1 IV Interpretation of the Data When you have completed clustering using a number of different procedures you can compare the outputs to search for consensus clusters Many clusters contain the same individuals regardless of the clustering algorithm used you can be fairly sure in these cases that the clusters represent genetic biological or geographical factors and are a useful classification of the maize lines However some lines will show up in a different cluster each time a different clustering procedure is used These lines are more difficult to assign to their proper cluster and you may need to assign them to the cluster that makes the most sense based on known pedigree region of origin etc However this is cheating a little you are forming a hypothesis which group does a particular line belong to and testing it with the same data when you do thi
23. e alleles or bands in the population of OTUs in the study Since a marker with fewer bands has less power to distinguish several OTUs and alleles present at low frequency also have less power to distinguish a higher PIC is assigned to a marker with many alleles and with alleles present at roughly equal proportions in the population We use an Excel spreadsheet to calculate PIC a copy of which is found in Appendix 2 Remember when using Appendix 2 several of the cells contain equations and not numbers see Part 2 to see the formulas so you will have to adjust the equations depending on the source cells that the equations are using as data The formula used to calculate PIC is PIC 1 p Where piis the frequency of the i allele for individual p To use the excel spreadsheet perform the following steps Step 1 Enter the data as presence 1 or absence 0 of each allele in rows for each OTU in columns Step 2 Change the 1 in each cell to a 2 if the OTU is homozygous for that allele leave it as a 1 if it is heterozygous and there is another allele present for that SSR in that OTU You can sum over all alleles for each SSR to make sure the sum is 2 in every individual for every SSR in this way you know that you have not misscored any individuals as every individual will have two alleles for every SSR Step 3 Sum alleles over OTUs Step 4 Divide the sum by the total number of alleles possible at each locus to get the freq
24. e down arrow to find the proper printer HP laserjet II for example you can print in either portrait or landscape LPT1 usually but depends on your computer CON Press F2 to get the graph then follow the instruction on the screen to print and return to the main menu Use ESC to exit NTSYS when finished NTSYS 2 02 The same clustering steps as outlined for Version 1 7 are shown in Figures 4 and 5 and the resulting dendrogram shown in the appendix Part 3 Figure 5 NTSYS 2 02 window for clustering calculations File Options Hel Ei Help aleo nm 2 Graphics ifi Gener al l Similarity Clustering 1 Ordination SAHN SAHN clustering IA Star Gintsys BY Microsol ft Word NTSYSpe bi S AR 12364 18 Figure 6 NTSYS 2 02 window for drawing the cluster produced by UPGMA clustering FAINTSYSpec x File Options Help 3 oj x General Similarity Clustering Ordination Graphics 2D plot MxComp FRA Tree xj pe Parameters Arguments Input file EAMydocumente SSRWheet out ez com e me MStar E Ntsys BY Microsoft Word F NTSYSpe Ye 3 amp RD 12 42 AM Other SAS clustering procedures Two other non hierarchical clustering procedures available with SAS are Fastclus and Varclus Examples of both are shown here Fastclus This procedure allows the quick clustering of a very large data set into putat
25. e the following screen vou must fill in the parts listed in bold italics vourself Name of input matrix path and name of file is the output of the SM procedure Name for output matrix path and name of output file Method UPGMA toggle to change to other methods if desired In case of ties WARN Maximum no tied trees 25 Tie tolerance 0 Show tree YES Beta 0 25 Listing file CON When all the blank spaces have been filled in or left as the default press F2 to start the program running When it is finished there will be a message on the screen press ESC to exit to the main menu The output an unreadable tree graphic 17 will be saved in the file you specified You must follow the final instructions below to visualize it well Select the Tree display option under the Graphics heading The following menu will appear fill in the blanks as indicated by the bold italic notes Name of tree matrix Title Tree style Minimum for scale Maximum for scale Number class intervals Graphics Mode Line length text mode Squeeze factor Hardcopy device Port or file Listing file path and name of file is the output of the SAHN procedure I choose yourself Phen don t toggle to the other option Clad 0 0 is default but you probably want 1 0 NO is default but you need to toggle to YES 61 1 is default but you may want smaller if your tree is big ie 0 75 use th
26. ething different because SAS appends the new data file to the end of the old one rather than overwriting it SAS cannot use the output of this program directly for the other programs that are listed below it must first be modified by adding the name of each maize line into the file at the beginning of each line You can do this in Word remember that the labels must all be the same length or have the same number of spaces following each one until they all have the same number of characters spaces Save the file as text because the output of this program will be used directly for cluster analysis principal components etc You can also use Excel to insert one column with the labels but you must save it as a text file with a space between each column and a space at the end of each row which must still be done in Word NTSYS calculation of Similarity Matrices Dominant marker types NTSYS 1 7 The input file for NTSYS will be similar to the SAS input file but with a few exceptions see Appendix 1 for more details You will not need to write a program to tell NTSYS what to do since it is a menu driven program Simply enter into NTSYS and use the arrow keys to move around the menu Select the Qualitative option under the Dis Similarity Measures heading and you will see the following screen you must fill in the parts listed in bold italics yourself Name of input matrix path and name of file Coefficient SM is default press spa
27. inkage methods Parts in bold italics are notes and not part of the protocol do not include them in the SAS program The notes tell you which part of the program must be changed according to the data set Note that version 8 00 of SAS calculates the dendrogram automatically so that this SAS codes are only needed if you use any SAS version prior to version 8 00 OPTIONS LINESIZE 132 PAGESIZE 77 Title Cluster Analysis of GBG experimental lines change title inside of quotes data DIST type distance INFILE a usedata txt LRECL 1050 change the file path and name inside the quotes to your file and the correct path use the output of mergcult sas INPUT LINE 1 12 13 DIST1 DIST93 the numbers refer to columns be sure these numbers agree with the numbers in Alldist sas PROC CLUSTER DATA DIST METHOD AVERAGE OUTTREE TREE choose METHOD AVERAGE for UPGMA default METHOD WARD for Ward s and METHOD SINGLE for single linkage calculations ID LINE VAR DIST1 DIST93 the numbers refer to number of markers be sure these numbers agree with the numbers in Alldist sas RUN PROC TREE DATA TREE HORIZONTAL SPACES 2 ID LINE RUN GOPTIONS HSIZE 6 VSIZE 8 TITLE BRING THE MACRO INTO THE PROGRAM INCLUDE DENDRO DENDRO FORMAT LANDSCAPE RUN BRING THE MACRO INTO THE PROGRAM INCLUDE GRFTREE NOSOURCE2 GRFTREE CLUSDSN TREE ITEMS 93 AXIS D LABEL Genetic set ITEMS number of maize lines you have Dissimilarity
28. ive clusters It does not draw a dendrogram rather it simply lists similar entries maize into groups which have a higher between group variance then within group variance You can then use each group as a separate data set to cluster The advantage of this program is that it is much faster to cluster large data sets with other clustering methods can take a long time to run and that working with a small data set appears to be preferable statistically What may happen is that with more entries relationships between individual pairs get obscured or exaggerated An individual entry may end up in a group not because it is similar to all the other members of that group but because it is fairly similar to one of the members which in turn is fairly similar to the others You must specify the number of clusters you wish to end up with you may wish to run Varclus first to get an idea of an appropriate number of clusters An example of the Fasclus code used in SAS follows 19 OPTIONS LINESIZE 132 PAGESIZE 77 Title FASTClus Analysis of 123 lines using 35 core primers change title as appropriate data DIST INFILE C DATA core114b MTX LRECL 1050 change the file path and name inside the quotes to your file and the correct path use the output of mergcult sas INPUT LINE 1 11 12 dist1 dist35 PROC FASTCLUS DATA dist MAXITER 10 DRIFT LEAST 2 MAXC 25 OUT out2 SUMMARY REPLACE FULL LIST Maxc the maximum number of clusters you want SAS
29. k J D Russell and J Sambrook 2001 Molecular Cloning A Laboratory Manual 4 ed Cold Spring Harbor Laboratory SAS STAT User s Guide Version 6 Fourth Edition SAS Institute Inc Cary NC Yap I and R J Nelson 1996 WinBoot a program for performing bootstrap analysis of binary data to determine the confidence limits of UPGMA based dendrograms IRRI Discussion Paper Series No 14 International Rice Research Institute P O Box 933 Manila Philippines Appendix 1 Part 1 NTSYS data file 14819L19 CML247 CML254 CML258 CML264 CML268 CML273 CML274 LP1 LP2 LP3 LP4 LP5 P1 P21 TS1 TS2 TS3 TS4 TS5 0O000000010000000000 0000100000000000000 0000000000001000000 0111011011111111101 0000000100000000010 1000000000000000000 0000900901000000000 0000900900110000000 1001911910000110010 0110900900001001101 1000000000000000000 0000010001000010110 0001011010111001001 0010100000000000000 0000000100000100000 0O000000000001000000 1111000101100000001 0100111010010111111 0001000000000000000 1000000101110000000 0110000000000000110 0000100000001011001 0000011010000100000 0001000000000000000 0010100000001000000 1101100111110001111 0000011000000000000 0000000000000110000 1000000100000000000 0010000001000011111 0001000010111000000 0100011000000100000 0010000000000000000 0000100000000000000 1111111111110111111 0000000000110000000 0O000000000001000000 1001000000000001101 0110011110100000000 0000000000001110000 000000000100000001
30. mples when questions arise regarding any procedure mentioned in this manual ll Data Collection Data used in genetic diversity studies of plant species are molecular markers namely Amplified Fragment Length Polymorphisms AFLPs Random Amplified Polymorphic DNA or RAPDs Restriction Fragment Length Polymorphisms RFLPs and Simple Sequence Repeats or SSRs RAPD and SSR markers are PCR based and thus avoid the main difficulties associated with RFLP or AFLP data specifically the cost and time involved in isolation of sufficiently high quality DNA and visualization of the bands via radioactivity fluorescence or bio luminescence It should be cautioned however that RAPD bands have demonstrated some problems related to repeatability For an overview on molecular markers we suggest GENES VII by Lewen Oxford University Press 2000 or the Molecular Cloning Laboratory Manual by Sambrook et al 2001 The data can be scored as presence absence 1 or 0 in the case of dominant markers such as RAPDs or AFLPs or as allele frequencies for SSRs or RFLPs SSRs and RFLPs can also be scored as presence absence but some genetic information will be lost so more markers should be used if markers will be scored this way For presence absence data the data should be entered into a spreadsheet such as EXCEL in the format followed in Table 1 Rows should correspond to variables or markers and columns should correspond to the taxonomic units or lines
31. ning loci The following example input file will be used in the example in Figure 4 More than one space is allowed between observations in this version of NTSYS Note the two comment lines at the beginning of the file starting with Blood group data from Cavalli Sforza and Edwards 1967 5 loci with a total of 19 alleles for 4 populations 119L 4L 0 A1 A2 B O CDE CDe cDE cDe Cde cdE cde MS Ms NS Ns Fya Fyb Dia Dib Eskimo Bantu English Korean 0 2914 0 1034 0 2090 0 2208 0 0 0866 0 0696 0 0 0316 0 1200 0 0612 0 2069 0 6770 0 6900 0 6602 0 5723 0 0 0 0024 0 0082 0 4985 0 1400 0 4205 0 6197 0 4906 0 0100 0 1411 0 3148 0 0109 0 6000 0 0257 0 0573 0 0 0200 0 0098 0 0 0 0 0119 0 0 0 2300 0 3886 0 0 1719 0 0900 0 2377 0 0245 0 6703 0 4800 0 3048 0 4615 0 0 0400 0 0703 0 0646 0 1578 0 3900 0 3872 0 4494 0 7500 0 0600 0 4213 0 9950 0 2500 0 9400 0 5787 0 0050 0 0 0 0 0313 1 1 1 0 9687 For some coefficients the SIMGEND module needs to know which alleles correspond to the same locus This information is provided in a rectangular matrix stored in a separate file that contains a single row or column of codes indicating the locus that each allele belongs to This information can also be used by the FREQ module An example is shown below for the above data Loci info for Blood group data from Cavalli Sforza and Edwards 1967 1119L0 A1 A2 B O CDE CDe cDE cDe Cde cdE cde MS Ms NS Ns Fya Fvb Dia Dib 11112222222
32. nted here the type of data matrix 1 rectangular raw data matrix as we have here 10 number of rows markers or variables 5 number of columns maize or entries 1 there is missing data as opposed to 0 which would mean that there is no missing data in the entire file and 9 what we called the missing data You can call it any number you like but NTSYS unlike SAS will not accept a period An example of an NTSYS input data file can be found in Appendix 1 NTSYS version 2 02 has a built in data editor where you can enter the data directly or open an Excel file for import into NTSYS However on frequent occasions we have had problems with this data editor it may not recognize our Excel files and data entered into the editor cannot be printed nor exported to Excel Therefore we do not routinely use this data editor More information on the data editor can be found in the NTSYS manual version 2 02 or 2 10 lll Data Analysis Partitioning variation in the sample Usually one of the first steps in a diversity study is to investigate the variation present in the sample under study not to visualize relationships between individuals but simply to see the overall breakdown of variation in the sample and if it is a comparison of populations the partitioning of diversity within and between populations Some tools are available to quantify the variation present and how it is broken down among individuals populations and markers
33. ominant vs recessive alleles In cluster analysis many different proximity measurements can be used In this manual we use the Simple Matching Jaccard s Gower s is Jaccard and Dice Nei and Li coefficients for calculating the phenotypic distance between each pair of entries maize lines in the diversity study These are the three most commonly used coefficients in the literature Other coefficients can easily be calculated by consulting the NTSYS manual other coefficients calculated by SAS require more work as SAS is not as user friendly as NTSYS One final note the SAS procedures listed here calculate dissimilarity rather than similarity matrices but this turns out simply to be 1 similarity and the resulting dendrograms and scatter plots are identical for either one The SAS procedure PROC CLUSTER that we will examine later always uses dissimilarity distances measurements SAS calculation of Dissimilarity Matrices The following is a SAS code called Alldist sas that can be used to calculate the proximity coefficients Simple Matching Jaccard s Gower s and Dice Nei and Li 1979 coefficients Parts in bold italics are notes and not part of the protocol do not include them in the SAS program The notes tell you which part of the program must be changed according to the data set OPTIONS LINESIZE 132 PAGESIZE 77 MACRO DISSIMLR LET N 35 change the 35 to the number of lines or maize you have DO l
34. s this is statistically shaky Also if you have no prior data on a given line you may not be able to place it into any cluster thus you may not be able to include this line in the analysis In all cases be sure to explain why each individual was placed in the cluster you finally decide to put it in Using NTSYS you can compare the matrix produced by the SAHN procedure with the similarity coefficient matrix using the MXCOMP procedure if there is a good correlation above 0 9 for example you can be more certain that the dendrogram produced is a good representation of the data see NTSYS manual for instructions Finally in order to visualize the data you may wish to present the MDS or PCA graph which gives a good three dimensional picture of the variation You can group the consensus clusters by drawing circles around individuals or coloring them the same color 23 Bootstrapping One final method for testing whether your data is statistically sound and to make sure you have used enough markers in analyzing the data is called bootstrapping This method involves repeated analysis of the same data set to see if the resulting dendrograms change a lot following each analysis If the program is unsure of the data or if there are not enough markers the algorithms used for clustering may result in clusters containing individuals that do not fit particularly well in that particular cluster A bootstrapping program can repeat the cluster an
35. uency of occurrence of each allele in this case with 7 OTUs of diploid individuals you have 14 possible alleles so divide by 14 Frequencies must sum to li Step 5 Square the frequencv of each allele Step 6 Sum the squared frequencies Step 7 Subtract the summed squared frequencies from 1 Ordination visualizing relationships in the sample The classification and or ordination analvses performed on molecular data all use a dissimilaritv or similaritv matrix as input files This section will be divided according to the procedures and will begin with the calculation of similaritv matrices Please see the SAS or the NTSVS manuals for further explanation of anv of the procedures listed here A good overview of the theory can be found in Beaumont et al 1998 Proximitv matrices For AFLP data and other dominant marker svstems we will calculate the similaritv or dissimilaritv the two together known as Proximitv between individuals using the methods for calculating diversitv based on qualitative differences Direct calculation of genetic distance is possible only for co dominant marker data where it is possible to calculate allelic frequencies for each marker in a population This will be demonstrated in the following section With dominant marker data this is impossible since the heterozygous individuals cannot be distinguished from the dominant homozygous individuals thus making it impossible to calculate the exact frequency of the d
36. vectors which are the correlation between the original variable and the principal component NTSYS 1 7 Performing PCA using NTSYS requires the following steps to use SAS for this procedure please consult the SAS manual 1 Convert original data file c inputdatamatrix dat to a similarity matrix c simmatrix dat but run by ROWS variables not columns see section entitled Similarity Matrices above 2 Run the eigen program on the similarity matrix to generate eigenvectors and eigenvalues Input Matrix C simmatrix dat Number of dimensions 3 Sample size of mx 0 Degrees of freedom of mx 0 Eigenvector matrix C simmatrix vec Eigenvalue matrix C simmatrix val Vector scaling SQRT LAMBDA Listing file CON 3 Run the projection program PROJ on the matrices to project the transformed data matrix onto the first three principal components eigenvectors Name of matrix C intupdatamatrix dat OTUs rows or cols COL Name of factor matrix C simmatrix vec Projection type Proj Name of eigenvalue mx C simmatrix val Name for projection matrix C simmatrix pro Show matrix NO Listing file CON 4 Use the MOD3D program to generate the graph of the output of PROJ Name of matrix C simmatrix pro Direction to plot by ROW Variable for x axis 1 Variable for y axis 2 Variable for z axis 3 Graph matrix leave blank Title choose your title _ Rotation aroun

Data Analysis in the CIMMYT Applied Biotechnology

Contents

Download Pdf Manuals

Related Search

Related Contents