Home

User Manual for Version 3 The Buckler Lab at Cornell University

1. Seem Data can be sorted by clicking on the column header of interest A secondary sort can be done by holding down the CTRL key and clicking on a second column Data can be exported to flat files that are either comma separated Comma Separated Values CSV tab delimited Both these formats can then be imported into a spreadsheet program such as Excel Tables also be printed 4 2 Tree Plot Y Tree Plot Displays the results of cladogram analysis After running Analysis gt Cladogram select the desired data set and then click Tree Plot in the Results mode Results Tree Plot Trees can be visualized in either a Normal or Circular layout W ru fra a rmn These images can be printed saved in JPEG format or saved as a Scalable Vector Graphics SVG file 43 2D Plot Displays 2D plots and determines color thresholds This function is useful for plotting associations in multiple environments First select the desired result set Using the drop down boxes provided populate rows with Environment columns with Site and value with PermuteP The cutoff value for coloring can be chosen either by inputting a value in the text box or by using the slider tool to the right of the text box
2. et al Efficient Control of Population Structure Model Organism Association Mapping Genetics 178 1709 1723 2008 Zhang Z et al Mixed linear model approach adapted for genome wide association studies Nat Genet 42 355 60 2010 Kang H M et al Variance component model to account for sample structure in genome wide association studies Nat Genet 42 348 54 2010 Thornsberry J M et al Dwarf8 polymorphisms associate with variation in flowering time Nature Genetics 28 286 289 2001 Pritchard J K Stephens M Rosenberg amp Donnelly P Association mapping in structured populations American Journal of Human Genetics 67 170 181 2000 Zhao K et al An Arabidopsis example of association mapping in structured samples PLoS Genet 3 4 2007 Yu J M et al A unified mixed model method for association mapping that accounts for multiple levels of relatedness Nature Genetics 38 203 208 2006 Casstevens amp Buckler 5 GDPC connecting researchers with multiple integrated data sources Bioinformatics 20 2839 2840 2004 Ware D et al Gramene a resource for comparative grass genomics Nucleic Acids Research 30 103 105 2002 Ware D H et al Gramene a tool for grass Genomics Plant Physiology 130 1606 1613 2002 Jaiswal P et al Gramene development and integration of trait and gene ontologies for rice Comparative and Functional Genomics 3 132 136 2002 20
3. el c W saes Ph impute lt lt pronior jon 1 m Jan connect to more than database simply repeat the process outlined above In the figures of following sections only the GDPC area will be displayed if other areas are deemed irrelevant 6 6 2 Data Query GDPC is equipped with several tabs to query data namely Taxa Taxon Parents Loci Genotype Experiments Environment Experiment and Localities Within each tab any retrieved data will be displayed in the Filtered List Choose attributes by checking the desired boxes located beneath the Filtered List After an attribute is selected values of that attribute from the database are displayed using the Taxa tab choose Germplasm type field and then select After clicking the Get Data button the subset of taxa from the database that meets these criteria will appear in both the Filtered List and the Working List 5 2 TE E Cem cet cecen Land 00000 Tara centre enters rasan tesis reges aa OR movens Items listed the Working List can be modified by the user To do so first break the link between the Filtered List and the Working List by clicking on
4. 21 22 23 24 25 26 28 Yamazaki amp Jaiswal Biological ontologies rice databases An introduction to the activities in gramene and oryzabase Plant and Cell Physiology 46 63 68 2005 Zhao W et al Panzea a database and resource for molecular and functional diversity in the maize genome Nucleic Acids Research 34 D752 D757 2008 Canaran Stein amp Ware Look Align an interactive web based multiple sequence alignment viewer with polymorphism analysis support Bioinformatics 22 885 886 2006 Du C G Buckler E amp Muse S Development of a maize molecular evolutionary genomic database Comparative and Functional Genomics 4 246 249 2003 SAS SAS Statistical Analysis Software for Windows 9 0 ed Cary NC USA 2002 Hardy O J amp Vekemans X SPAGEDI a versatile computer program to analyse spatial genetic structure at the individual or population levels Molecular Ecology Notes 2 618 620 2002 Cover T amp Hart P Nearest neighbor pattern classification Proc IEEE Trans Inform Theory 13 1967 Weir Genetic Data Analysis Il Sunderland MA 1996 Farnir et al Extensive genome wide linkage disequilibrium in cattle Genome Res 10 220 7 2000 Henderson C R Best Linear Unbiased Estimation and Prediction under a Selection Model Biometrics 31 423 447 1975 Kang H M et al Efficient control of population structure in mod
5. is an unknown vector of random additive genetic effects from multiple background QTL for individuals lines X and Z are the known design matrices and e is the unobserved vector of random residual The and vectors assumed to be normally distributed with null mean and variance of v y with as the additive genetic variance and as the kinship matrix Homogeneous variance is assumed for the residual effect which means R162 where 62 is the residual variance The proportion of genetic variance over the total variance is defined as heritability h When is derived from pedigrees the elements of K equal 2 Probability IBD where IBD means that two alleles drawn at random are identical by descent Generally calculated from markers is an IBS matrix The resulting multiplier is then not 6 but some unknown constant times a Some methods for calculating K such as those implemented in SPaGEDI actually use markers to develop an estimate of the IBD relationship matrix For those values of the resulting variance estimate be considered an estimate of a as long as the assumptions of the method used to derive are not violated for the population being analyzed One implication is that two different K matrices may give very different estimates of and heritability yet produce the same model fit and test of marker association TASSEL implements several methods to improve statistical power and r
6. Users can mouse over any box to view the value associated with that box as shown here 36 B amp El amp anyone manye cours ES 3 avons a EE OO TE IN If P value coloring is desired simply check the P value box as shown below T 20 chant B v oy oe manye Paw A sm rear v cat rovs nesta BOTE TERETE TE EET TET By checking the P value box Cutoff selection tools will be disabled and fields will instead be colored according to the following grayscale Dialog Mon This key can be shown by clicking on the 2 icon next to the P value check box 44 LD Plot Displays the results from the linkage disequilibrium analysis After selecting the desired result from the Data Tree panel click on the LD Plot button while in Results mode Results LD Plot The graph that is generated displays LD between all possible pairs of sites The black diagonal represents LD between each site and itself The default setting graphs F in the upper right and p values lower left This default can be modified by clicking on the radio buttons in the lower left The left side of the 37 graph contains text description of the gene or chromosome and the site within the ge
7. See Kinship Square Numerical Matri 25 Standalone 8 Table 33 Transform 18 Web sta 7
8. ime ann Pee Bv orn DI seen 7 emacs En EE SiS 00 eee Ce 1 1 Installation The graphic version of TASSEL can be installed in one of the three ways using Java Web Start as a stand alone application or using the source code 1 1 1 Web start TASSEL can be installed using Java Web Start technology which automatically checks for the most recent version of TASSEL each time the application is executed In addition Java Web Start will ensure that the correct version of the Java Runtime Environment is running thus avoiding complicated 7 installation and upgrade procedures Users should use Web Start unless they have a specific reason to use of the other installation methods To begin Java Web Start WS must be installed prior to the installation of TASSEL JWS is included as part of Java Runtime Environment JRE 5 0 and above PC s and Mac s will most likely have JWS already installed If you need to install Java the most recent version is available at hitp www java com The easiest way to tell if it is installed on your computer is to try running TASSEL from the following link http www maizegenetics net tassel If you will be using TASSEL frequently and would prefer to launch the application from your desktop rather than
9. ANALYSIS m 63 ESTIMATION OF KINSHIP USING GENETIC MARKERS as 6 4 ASSOCIATION ANALYSIS USING GLM 65 ASSOCIATION ANALYSIS USING MLM 6 6 IMPORTING DATA FROM A DATABASE VIA GDPC 6 6 1 CONNECTING WITH A DATABASE 6 62 DATA QUERY 6 6 3 IMPORTING GDPC DATA INTO TASSEL 6 64 SAVING GDPC QUERY RESULTS 1 APPENDIX NUCLEOTIDE CODES DERIVED FROM IUPAC 72 TASSEL TUTORIAL DATA SETS 73 BIOGRAPHY OF TASSEL 74 FREQUENTLY ASKED QUESTIONS 1 WHAT DO IDO IF TASSEL MISBEHAVES 2 WHERE DO I TURN FOR MORE INFORMATION 3 HOW DOT JOIN THE FUN TASSEL ON SOURCEFORGE 4 How DOI CHANGE THE AMOUNT OF MEMORY USED WHAT DO 1 DO WHEN THE EXCEPTION JAVA LANG OUTOFMEMORYERROR APPEARS 5 WHEN I CLICK ON THE MOST CURRENT VERSION OF TASSEL WEB START A PREVIOUS VERSION APPEARS WHAT SHOULD 1 DO 6 WHAT SHOULD I SUBSTITUTE FOR MISSING VALUES IN TASSEL 7 IS IT POSSIBLE TO CHANGE DATA NAMES IN THE DATA TREE S HOW CANI CREATE A TASSEL ICON ON DESKTOP 9 WHY DO I GET EMPTY SQUARES IN MLM ASSOCIATION ANALYSIS 10 WHY SHOULD I EXCLUDE ONE COLUMN OF THE POPULATION STRUCTURE 11 CAN KINSHIP REPLACE POPULATION STRUCTURE 12 WHY TASSEL AND SPAGEDI GIVE DIFFERENT KINSHIP ESTIMATES 13 CAN I GET MARKER R SQUARE USING SAS PROC MIXED OR TASSEL MLM 14 DOES MLM FIND MORE ASSOCIATIONS THAN GLM 15 DO I NEED MULTIPLE TEST CORRECTION FOR THE P VALUE FROM TASSEL 16 CAN TASSEL HANDLE DIPLOID GENOTYPE DATA 17 How CI
10. Tree Panel and then click the Transform button Data gt Transform The Transform Column Data window will open Click on the Impute tab in this window Finally click on the Create Data set button to create the new data set with missing values imputed Note that missing values are now filled 42 Toot rep GOP elem ameta Eh Loni 3 Eoo Sites Taxa vein Jet Ss pom es 1 E i 30 8 43 6 2 Principal Component Analysis Principal component analysis PCA is a statistical tool that transforms a set of correlated variables into a smaller number of uncorrelated variables called principal components PCs The first PC captures as much of the variation as possible and the succeeding PCs account for a decreasing fraction of the remaining variance Another application of PCA is to use PCs derived from genetic markers to represent population structure This method requires much less computing time than maximum likelihood estimation As most marker data are characters numericalization must be performed first A common approach for converting character marker scores is to set one of the homozygotes to 0 the other homozygote to 2 and the heterozygote to 1 For haploids the conversion can be simply p
11. algorithm as is used in determining linkage disequilibrium 5 3 Preferences The Quality Score Colors tab found in the Preferences dialog box allows the user to set cutoff values for visualizing quality score values on a sequence alignment or a set of called SNPs Jo To set a desired threshold simply adjust the slider on the left side of the dialog Ns dashes and alignments without any quality score information have a default value of 1 minus one 6 Tutorial This tutorial reviews several common scenarios for using TASSEL in order to help the user better understand its capabilities for data manipulation and association analyses The TASSEL software package includes tutorial data set that be downloaded from the TASSEL website please unzip all files to a directory of your choice This tutorial data set contains data for phenotype genotype population structure and kinship 6 1 Missing Phenotype Imputation The phenotype file mdp traits will be used to demonstrate the process of imputing missing data Note that the data set below contains missing values NaN Took Hep ESC alten ome zm FE orc 3 Sits uo 4 m p To impute missing data first select the mdp traits data set in the Data
12. are not appropriate for heterozygous data GLM or MLM fit SNPs one at a time treating each distinct genotype as a separate class This has the effect of fitting an additive plus dominance model Separating the two effects is under consideration Because handling heterozygotes as a third marker class is not appropriate for kinship or LD those analyses should not be used for that type of data at the present time Work to improve handling heterzygotes is ongoing 17 to cite TASSEL The paper that describes TASSEL as a software package and the papers that introduce specific methods implemented in TASSEL should be cited as appropriate such as the unified Q K approach EMMA compression of mixed linear model and P3D For example A Linkage disequilibrium 07 R and P value were calculated by TASSEL Association analyses were performed with the mixed linear model approach implemented by TASSEL GWAS was performed with the compressed mixed linear model approach carried by TASSEL which also implemented the EMMA and P3D algorithms to reduce computing time 69 REFERENCES Bradbury P J et al TASSEL software for association mapping of complex traits in diverse samples Bioinformatics 23 2633 2635 2007 Zhang Z Buckler E S Casstevens amp Bradbury P J Software engineering the mixed model for genome wide association studies on large samples Brief Bioinform 10 664 75 2009
13. by revisiting the website Java Web Start can be used to manually launch TASSEL each time and or to create a shortcut Access the Java Application Cache Viewer by going to Start gt Settings gt Control Panel gt Java From the General tab click on Settings in the Temporary Internet Files section and then click on View and the Java Application Cache Viewer will appear Another way of achieving this is by going to Start gt Run and typing in javaws The TASSEL icon should now be visible and can be used to launch the application Shortcuts can be created from the menu of the Java Application Cache Viewer Application gt Install Shortcuts 1 1 2 Stand alone Downloading a stand alone version is recommended for anyone who has a slow Internet connection While Java Web Start is a very good way of deploying software it does not ask the user before attempting to download updates Thus a slow Internet connection may start a download process that requires unreasonable amount of time to complete If you are not interested in disabling your network connection each time before starting TASSEL we recommend downloading the stand alone version which does not attempt to update the program However given that TASSEL is a Java application a Java Runtime Environment version 1 6 0 or greater is still required To get the stand alone version download tassel3 0_standalone zip from the TASSEL web site To run the stand alone version double click o
14. each trait by marker combination will be tested and two reports will be produced containing trait by marker F tests and the other containing allele estimates run GLM select a data set and then click the button A dialog box will pop up to allow the user to indicate that a permutation test should be run and to allow the number of permutations to be changed The permutation test will be run using the method suggested by Anderson and Ter Braak 2003 which calculates the predicted and residual values of the reduced model contained all terms except markers then permutes the residuals and adds them to the predicted values When the GLM options dialog is closed the user is presented with a dialog allowing the output to be saved to file rather than stored in memory and displayed by TASSEL This option is useful when the output is expected to be very large and risks exceeding available The following table shows an example of the Marker Test output as viewed with Results Table In addition to displaying the F statistics and p values for the requested F tests the table also contains markerR2 mean squares MS and degrees of freedom DF for the marker effect for the model corrected for the mean and for error If taxa are replicated across reps or environments then the markers are tested using the taxa within marker mean square If taxa are unreplicated then the residual mean square is used Marker
15. for each trait The first line is for the model with no markers Following that is a single line for each marker tested The columns labeled Af F and p are the degrees of freedom F and p value from the F distribution for the test of the marker The column emordf is the degrees of freedom used for the denominator of the F test The column labeled markerR2 is the R2 for the marker calculated based on a formula for R2 for a generalized least squares GLS model as shown here 32 The columns Genetic Var Residual Var and 2LnLikelihood list o2a 2 and minus two times the mode likelihood respectively When the P3D option is used all of the values are the same for a given trait because they are only calculated once A second table lists the estimated effects each allele for each marker similar to the output for The compression results table shown below shows the likelihood genetic variance and error variance for each compression level tested during the optimization process The meaning of groups and compression is discussed above in the description of the compression method The compression level with the lowest value of 2LaLk is used for testing markers orn ed fp ce BE pe E z xar TH s 3 8 Ridge Regression This function performs ridge regression to
16. handling of s and non standard characters Added Sliding Haplotype functionality Changed LD Fisher s Exact p value to use two sided p value Added Ability to visualize sequence quality scores Synonymize match taxa names between data sets GLM analysis improvements Code change preventing large data sets from being shown in JTable Update of GDPC which allows automatic restoration of last data source connection Data transformation utilities added K Nearest Neighbor Data Imputation added Association analysis with Mixed Linear Model Taxa name synonymizer added Basic heterozygosity handling added Many ease of use improvements Fixed problem loading genotype data Mixed Linear Model changes Output NaN if non converged Fixed problem loading genotype data Detection of duplicate ID in kinship Correction on progressive bar with MLM Starting values of NaN from previous marker are no longer used MLM Significant speed improvement 10 faster GLM Added User defined F tests Output taxa or marker means Principle Components Analysis Architecture restructure and pipeline version for advanced users Genetic marker data numerical transformation MLM implemented P3D algorithm increased speed in order of magnitude 65 of at least ten times May 2009 EMMA implemented November 2009 TASSEL Version 3 release redesigned for large genomic data and large samples April 2010 Compression of MLM implemented 66 74 Frequently As
17. of 3093 SNPs spread across the maize genome For the dwarfS gene sequence use the joint data set created by following the tutorial for GLM Solve the mixed linear model by highlighting the joint data set and the kinship data then clicking the button in Analysis mode Quee Fae mo 3 ec O ammess EPD mee a De MLM option dialog will pop up as shown above Choose the default options which use P3D and compression at the optimum compression level After the Run button is clicked the progress bar will start moving The time required will depend on sample size number of traits number of markers and the options chosen in the MLM option dialog After the progress bar is reset to zero indicating completion of MLM three reports will be added to the data tree The first two are similar to the reports created by GLM The most significant SNP is still the same however the strength of association is weaker with a P value of 7 199x10 vs 1 1021x10 from GLM which does not pass the Bonferroni multiple test threshold 5x10 The third report contains the MLM specific statistics including 2 Log Likelihood genetic variance and residual variance components under different level of compression These statistics are illustrated by th
18. significant level of 1 after Bonferroni multiple test correction 0 01 3093 The association was not significant As illustrated below the output labeled GLM_Allele Estimates shows the marker effects assigned to genotypes for each SNP The GLM is also the same For example the first SNP at 157104 bp on chromosome 1 had three genotypes CC and AC coded as A C and M based on the IUPAC code see Appendix Nucleotide Codes 55 rot Aaja eT La c Ba EB Tote ZE i chan m 5 1 EE er SS E n lt 6 6 Importing Data from Database via GDPC GDPC middleware that is integrated into TASSEL allows the user to import data from a database To display GDPC in TASSEL click on the GDPC button in Data mode General rules for working with databases include 1 Establish a connection with the database 2 Define a query 3 once the desired data is in GDPC load the data from GDPC into TASSEL 6 6 1 Connecting with a Database To establish connection with a database click the Add Conn button followed by the button of the database you wish to add Then click Ok In the example below we chose Panzea 56 el
19. taxa names can prevent proper joining Taxa names can be made uniform by using the Synonymizer 2 11 Intersection Join S n Jon This button joins multiple data sets by the intersection of their taxa Taxa must be present in both data sets to be included Select multiple data sets using the CTRL key in conjunction with mouse clicks and then click the intersection button to join the data sets Because this function uses taxa names to join data sets any variation in taxa names can prevent proper joining names can be made uniform by using the Synonymizer 3 Analysis Mode Qas Analysis mode consists of the following options 3 1 Diversity Ki Ders This button executes a basic diversity anal Average pairwise divergence segregating sites and estimates ANp can be calculated as well as sliding windows of diversity To run a diversity analysis click on a raw sequence alignment and then select Analysis gt Diversity es m non sorte o Endase Tetai E s ra rion Nonsvronvmous Step 100 ceang des M E E Indes In the resulting Diversity Surveys dialog box the various site classes available for analysis are listed the left If the sequence has no annotation then only the Overall and Indels options will be active A sliding window of diversity can also be calculated across the region To prod
20. that the variables in this file will be used as covariates not as dependent variables This is the format to use for population structure covariates Example lt gt qo gt 33 16 0 014 0 972 0 014 38 11 0 003 0 993 0 004 4226 0 071 0 917 0 012 4122 0 035 0 854 0 111 A188 0 013 0 982 0 005 2273 TASSEL version 2 1 formats Version 2 1 formats for numeric data will continue to be supported to provide backward compatibility However that format does not identify covariates as such As a result any covariates imported using this format will need to be properly identified using the Trait filter function described later in the manual 2274 Repeated measurements format for repeated measurements may be implemented in the future 2 2 8 Square Numerical Matrix Kinship can be calculated externally from pedigrees by using SAS Proc Inbreeding or from markers by using software packages such as SPAGedi The following format is provided to import the resulting kinship estimates If a represents the number of taxa the format for kinship files is as follows TaxalName rit r12 rin Taxa Name r21 r22 E TaxanName rn2 Here rij i is the element in the kinship matrix located at row i and column Missing values are not allowed for kinship matrix Important note The current format is different from the format used in TASSEL version 2 0 or lower 2 2 9 Genetic Map Genetic Map is a l
21. trait data or covariates Kinship must be loaded as square numerical matrix Users can either specify the file type or use the Guess option to let the program determine the file type As an example we describe how the Guess function can be used to import all the files from the tutorial data set The tutorial data can be downloaded from the TASSEL website or using this link biip www maizegenetics net tassel does TASSEL TutorialData3 zip To use the data the zip file must be uncompressed and saved in a folder that the user specifies To import data click the LOAD button The File Loader dialog box will then pop up to let user choose the files and specify a format For the files in the tutorial data set the default Guess function will load all the files correctly Multiple files can be imported simultaneously by highlighting them first holding Shift or Control key while clicking and then clicking the Open button 2 2 1 BLOB Binary Large Object BLOB is a collection of binary data stored as a single entity In TASSEL BLOBs are used to compress large data sets into more manageable sizes For sequence data three types of BLOBs are used SNP value BLOB position BLOB and SNP ID BLOB The three BLOBs are used to store individual SNP values SNP position within the genome and the SNP identifiers respectively BLOB is composed of two components a header and a body The header for each BLOB is 1024 bytes long while the length
22. traits the algorithm finds other taxa neighbors that are most like it for the non missing traits It uses the average of the neighbors to impute the missing data Click on the Impute tab to display the following 21 284PCA Principal component analysis PCA can only be performed on a numerical data set without missing values Two methods are available correlation or covariance This determines whether a correlation or covariance matrix will be used as the basis for the analysis The default correlation reasonable choice for genetic data The number of PCA axes in the output data set can be controlled by selecting either of the minimum eigen value associated with each axis the minimum percent of the variance captured by an axis or the number of axes The resulting axes will be sorted by the amount of variance each captures 2 9 Synonymize Taxa Names 4 Swmonmizer This button makes taxa names uniform to permit the joining of data sets The join functions that generate fused data sets work by matching taxa names Consequently if multiple names exist for a given taxon an added suffix alternative spellings different naming conventions etc then the two data sets will not join correctly To help remedy this the Synonymizer function allows the taxa names of one data set to replace similar taxa names in the second data set It relies on an algorithm that calcula
23. R is the marginal R2 for the marker calculated as SS Marker after fitting all other model terms SS Total where SS stands for sum of squares The following table shows an example of the Allele Estimates output as viewed with Results Table Eco ERN e 3 T pren mi ee Lene For each marker and trait combination each marker allele is listed along with the number of observations for taxa carrying that allele Obs the locus usually chromosome and locus position of that marker the 30 allele and the estimate of the effect of that allele Because of the way that GLM codes alleles the last allele estimate for a marker is always zero and the other allele estimates are relative to that 3 7 Mixed Linear Mode This conducts association analysis via mixed linear model MLM mixed model is one which includes both fixed and random effects Including random effects gives MLM the ability to incorporate information about relationships among individuals When a genetic marker based kinship matrix is used jointly with population structure Q the approach improves statistical power compared to 0 only MLM can be described in Henderson s matrix notation follows Xp Zu e where y is the vector of observations f is an unknown vector containing fixed effects including genetic marker and population structure 0
24. TE TASSEL REFERENCES INDEX 64 67 67 6s 68 6 6s 6 69 70 INTRODUCTION While TASSEL has changed considerably since its initial public release in 2001 its primary function continues to be providing tools to investigate the relationship between phenotypes and genotypes As indicated by its title Trait Analysis by aSSociation Evolution and Linkage TASSEL has multiple functions including association study evaluating evolutionary relationships analysis of linkage disequilibrium principal component analysis cluster analysis missing data imputation and data visualization of the design elements driving TASSEL development has been the need to analyze ever larger sets of data For example the MLM mixed linear model function for association analysis originally used an EM expectation maximization algorithm which is a common method for solving mixed models but is relatively slow Subsequently developers implemented the EMMA algorithm to increase computing speed Model compression was added to that to improve speed and statistical power for association study Another technique that optimizes variance components once and then uses the estimates to test markers now provides the ability to screen the large numbers of markers used in genome wide association studies GWAS The method was independently described by Zhang et al and Kang et al in 2010 This method was
25. Use the graph type combo box to select the desired graph type XY Plot from the list of options Select data to be plotted in X and Y axes using the appropriate drop down boxes If two data series are plotted simultaneously on the Y axis the 2 Y Axes checkbox will provide an axis for each Em hee vi pecu re x x pecu HOPESTEAD IDL 12 VAs DPOLL HOMESTEAD ID1 vs DPOLL CLAYTON ID15 dz EE abe 5 menus TASSEL include File Tools GDPC and Help menus The File menu is mainly used to save the entire data tree which includes the data loaded into TASSEL and the data created within TASSEL A previously saved data tree can be loaded to TASSEL This function provides the users the capability to save their intermediate results The tools menu contains contingency test and option to set preference GDPC Genomic Diversity and Phenotype Connection is a software package to retrieve data from open database sources such as SNPs and phenotypic data It can also be started using the GDPC button in data mode Its use is described earlier in the manual 5 1 File Menu Individual data sets on the data tree and the entire data tree can be saved An individual data set is saved in the genotype format for sequence data or numerical format for ph
26. User Manual for Trait Analysis by aSSociation Evolution and Linkage Version 3 The Buckler Lab at Cornell University August 28 2011 PAGEL www maizegenetics net tassel Disclaimer While the Buckler Lab at Cornell University has performed extensive testing and results in general reliable correct or appropriate results are not guaranteed for any specific set of data It is strongly recommended that users validate TASSEL results with other software Further help Additional help is available beyond this document Users are welcome to report bugs request new features through the TASSEL website Questions are also welcome to our current team members For more quick and precise answers please address your questions to the most pertinent person General Information Ed Buckler Project leader esb33 cornell edu Data import Pipeline Terry Casstevens tmed6 acomell edu Statistical analysis Peter Bradbury pjb39 comell edu Zhiwu Zhang z219 i cornell edu Contributors Yogesh Ramdoss Michael E Oak and Karin J Holmberg N Stevens and Yang Zhang The TASSEL project is supported by the National Science Foundation and the USDA ARS USDA Main Web Site hitp vww maizegenetics ncvtassel Open source code htip sourceforge nev projects tassel Modified version of the PAL library is used http www auckland ac nz pal project Database access is achieved by GDPC middleware hitp
27. ach combination of traits and markers TASSEL provides users several options 1 to estimate genetic and residual variance for each combination 2 to get these estimates once for each trait without fitting genetic markers and then to use those estimates to test markers 3 to use a prior heritability estimate provided by the user The second option named P3D population parameters previously determined has the same statistical power as the first option Using the method or using a prior heritability be much faster than calculating heritability for each marker Using MLM is very similar to using GLM The difference is that in addition to choosing the joint data set or numerical data set kinship data must also be highlighted before clicking the MLM button to show the MLM option dialog The option of Compression is the regular MLM which is equivalent to Custom level 1 For data sets with large numbers of taxa the optimal compression option may be considerably slower than no compression or user supplied compression This is because the algorithm solves the model once for each of a series of compression levels in order to determine the optimal one MLM analyses create two output tables model statistics and model effects If compression is used the analysis creates three tables T ach aE A rge aah E De EET ea The statisties table shows the results of the tests
28. ata set is selected mathematical transformation data imputation and principal component analysis PCA be performed The Transform columns tags will be displayed in a Data dialog box with three tabs Trans Impute and PCA 2 5 Transform 2 8 1 Genotype Numericaliz Two options are provided to transform genotype from character to numerical as shown in the following dialog box 2 8 1 1 Collapse Non Major Alleles This function assigns 1 to the major allele and 0 to any other alleles The converted genotypes are saved in a new numerical data set 2 8 1 2 Separate Alleles This function assigns an indicator 1 for present and 0 for absent for each allele The converted genotypes are saved in new numerical data set 2 8 2 Transform and or Standardize Data The Trans dialog box is the default selection as shown below In the Column list select the columnis you wish to transform Then select the type of transformation you wish to execute Selecting the Standardize checkbox will transform data by subtracting the column mean from the value of the trait and then dividing by the column s standard deviation Clicking on the Create Data set button will result in the placement of a dataset containing only the selected columns in the Data Tree 2 8 3 Impute Phenotype The k nearest neighbor algorithm is used to impute missing phenotype data If data is missing for a taxon for one of the
29. axa from the right side Click the arrow button to substitute the taxa Taxa with no synonym can be identified by selecting then clicking No Synonym Click OK to save the changes Threshold for smonymizer synonymizer Taie HefDNum fan ee as d DEM MN NN Es br bs Eu uan e m n fz pa i Bs ki 2 0 expo pos ass S f Ze 5 p as s pan 62 2 o jam Es za n ha oo 1 7 tm mens Once it has been determined that the taxa names were matched correctly the synonyms can be applied With the synonyms selected hold down the CTRL key while clicking on the second synonym data set the data set whose names you would like to change Then once again click on the Synonymizer button to apply the new names to the data set 2 10Union Join Jon This button joins multiple data sets by a union of their taxa Missing data will be inserted if taxa are missing from one data set Select multiple data sets using the CTRL key in conjunction with mouse clicks and then click on the union button to join the data sets Because this function uses taxa names to join data sets any variation in
30. ction GDPC GDPC is middleware that eliminates the need for end users of data to understand various database schemas and write SQL queries to extract data Instead the GDPC browser provides a single easy to use interface which can extract genotype and phenotype data from a variety of sources Currently GDPC has connections to the following databases Gramene diversity for maize wheat and rice http www gramene org db diversity diversity view Panzea http www panzea org GRIN http www ans grin gov GDPC can be used within TASSEL or as a stand alone application To display GDPC in TASSEL click an the GDPC button in Data mode Data is available for import once the user has defined the desired filters and data is visible in either the Genotypes or Phenotypes tab To load data activate either the Genotypes or Phenotypes tab depending on the data you wish to import and then click the Load button ww For additional information about GDPC please see http www maizegeneties net gdpe 2 2 Load I toad This function provides options to import files for genotypes phenotypes populations structure and kinship matrices Several common sequence formats are accepted for genotype data including BLOB Hapmap Plink and Flapjack and a general format for polymorphism data Some file types used by TASSEL version 2 are also supported for backward compatibility Phenotype and population structure be imported as numerical
31. cts tassel thereby allowing anyone to access the most recent changes to the code This setup makes it convenient for anyone to add special functionality to TASSEL if they so desire It also serves as a good platform for anyone who wishes to become involved in a bioinformatics software development project 4 How do change the amount of memory used What do do when the Exception java lang OutOfMemoryError appears Ifyou are working with very large data sets or are running memory intensive procedures there may be occasions when TASSEL runs out of memory For most routine usage however TASSEL memory is sufficient Memory issues usually result from attempting to execute a procedure like LD on a raw sequence alignment instead of selected SNPs You may also experience a memory issue if you are not sufficiently specific when retrieving information through GDPC By default TASSEL is allocated up to 512 Mb of memory on your computer If more is available on your computer you can increase the amount allocated by downloading the stand alone version of TASSEL and opening a command line window in Windows use Start gt Run and type in cmd or command run TASSEL from a command line cd to go to the directory containing the stand alone jar file then start TASSEL by typing the following java Xms256M Xmx768M jar sTASSEL jar Where Xms M specifies the starting memory available and Xmx M specifies the maximum m
32. e Chart function on the Result mode as follows groups vs 2LnLk groups vs Var genetic and Var eror In the example 79 are included in the final analysis When they are clustered into 44 groups the 2 Log Likelihood reaches a minimum which indicates the best model fit The screening of SNPs was performed at this optimum compression level Note When two or more individuals are clustered into one group the variance component for the random effect is not equivalent to the one without compression Consequently the heritability derived should not be interpreted as the individual based heritability To perform a Genome Wide Association Study GWAS on the 3093 SNPs we need to create a new joint data set containing the filtered phenotype population structure and the genome wide genotype Highlight the new joint file and the kinship data and click the MLM button Choose the default options on the MLM option dialog The analysis will take a minute or two The output report labeled MLM compression indicates that 259 lines were used in the analysis With 74 groups the statistics from the best are as graphed below tS EN groups groups vs Var genetic and Var error The strongest associated SNP is at 193565357 bp on chromosome 3 The P value is 1 302710 The threshold is 3 2331x10 at
33. ection N Join on Data mode to create a combined data set 5 Association analysis Highlight the joint data set then click GLM in Analysis mode to perform association analysis Two reports will be added to the data tree 49 m m Imm mam Bar E twon F susa pe rana Y vo 07 moute snes Fe ranson pe synonymer ib son e Jou pone senaten O 4 One of the reports added to data tree is labeled GLM_Marker Test followed by the name of the joint data In addition to the information for traits and markers the data set contains the following statistics marker F value from the F test on marker marker P value from the F test on marker markerR2 KC for the marker after fitting other model terms population structure 50 markerDF Degree freedom of marker markerMS Mean square of marker errorDF Degree freedom of residual error errorMS Mean square of residual error modelDF Degree freedom of model modelMS Mean square of model Clicking marker will sort the table by P value The smallest P value is 1 1021x10 for SNP at position 6 The threshold is 5x10 at a significance level of 1 after Bonferroni multiple test correction 0 01 20 T
34. educe computing time The Restricted Maximum Likelihood REML estimates of 6 and 6 are obtained through the Efficient Mixed Model Association EMMA algorithm which is much faster than the expectation and maximization EM algorithm TASSEL also implements a method called compression which reduces the dimensionality of the kinship matrix to reduce computational time and improve model fitting When MLM is used without compression compression 1 each taxon belongs to its own group At the other extreme GLM can be interpreted as maximum compression compression n with all taxa in a single group In that case it is not possible to estimate the random effect independently of error and l is absorbed into 63 Between these two extremes taxa can be grouped using cluster analysis based on kinship When n individuals are compressed into s clusters groups the kinship among individuals is replaced with the kinship among groups At some grouping levels dependent on the trait and population being analyzed this compressed MLM has improved statistical power compared to the regular MLM The optimum grouping with the best model fit for MLM without fitting genetic markers has the best statistical power for an association test of markers TASSEL allows users to specify the compression level average number of individuals group or to have the program determine the optimum grouping Similar to GLM MLM performs an association test for e
35. el organism association mapping Genetics 178 1709 23 2008 Laird amp Ware J H Random Effects Models for Longitudinal Data Biometrics 38 963 974 1982 Thornsberry J M et al Dwarf8 polymorphisms associate with variation in flowering time Nat Genet 28 286 9 2001 Flint Garcia S A et al Maize association population a high resolution platform for quantitative trait locus dissection Plant J 44 1054 64 2005 Anderson M J amp Ter C J F Permutations tests for multi factorial analysis of variance Journal of Statistical Computation and Simulation 73 85 113 2003 Analysis Mode 25 Annotated alignment 14 Cladogram 27 Collapse Non Major Alleles 19 compressed MLM 31 Compression 31 compression level 31 Data Mode 10 data tree 38 Diversity 25 EM algorithm 30 expectation and maximization algorithm 30 Fle Menu 38 13 10 s4 General Linear Model 28 Genome Wide Association Study 53 Genotype Numericaization 18 Hapmap 12 Henderson See MIM Hertaily 30 impute Phenotype 20 impute SNPs 18 Kinship 15 28 30 46 INDEX n uo Piot 35 Linkage Disequilibrium 26 Mied Linear Model 30 Numerical data 14 Open source code Panels Pik 12 Population parameters previously determined 31 Principal component analysis 21 Principal Component Analysts 42 Restricted Maximum Likelihood 20 Specified number of rows columns and labels
36. emory available to the Java Virtual Machine You may set the values higher or lower as your hardware dictates Alternatively you can modify the start_tassel bat or start tasselpl file that comes with the standalone distribution 6 5 When click on the most current version of TASSEL web start a previous version appears What should do The previous version of TASSEL web start was cached in your machine To replace it with the most current version click the Start button in Windows followed by Run Type javaws and then click OK In the window that opens keep the most current version of TASSEL and delete the rest 6 What should substitute for missing values TASSEL For numerical data in version 3 format use NA or NaN For numerical data in version 2 format use 999 for missing values For SNP data use For SSR data use Kinship does not allow missing values 7 Is it possible to change data names in the Data Tree A Yes Click on the desired data name in the Data Tree wait for one second and then click it again or immediately hit the F2 key Rename the data set and then hit Enter to save the change 8 How can create a TASSEL icon on desktop Click Stan on Microsoft Windows and select Control Panel then double click Java to show java Control Panel In Temporary Internet Files section click View button show Java Cache Viewer Move mouse over TASSEL application a
37. ences ecoeval spagedi html Comparisons of methods for calculating kinship can be found in the literature eg Stich et al 2008 3 6 General Linear Model Sst This function performs association analysis using a least squares fixed effects linear model TASSEL utilizes a fixed effects linear model to test for association between segregating sites and phenotypes The analysis optionally accounts for population structure using covariates that indicate degree of membership in underlying populations A main effects only model is automatically built using all variable in the input data A separate model is built and solved for each trait and marker combination Any factors covariates reps or locations are included in every model as main effects How the data is used must be defined either in the input data files or using the Trait Filter after the data has been imported but before it has been joined with a genotype General Linear Model GLM can be run using a numeric data set only numeric data joined to genotype data If only numeric data is selected best linear unbiased estimates BLUEs or least square means will be generated for the taxa for each trait Note only factors and covariates intended to control field variation should be included at this stage Population structure covariates which are intended to control for marker effects should only be included when markers are also in the analysis If numeric data with genotypes are analyzed
38. enotype within the Data folder in TASSEL Results will look as follows To load phenotype data from GDPC into TASSEL fist click on the GDPC button in Data mode Then choose the Phenotypes tab followed by the Load button The phenotype data is then loaded into TASSEL and labeled as 4 traits environ To view the uploaded data select 4 traitsenviron from the Phenotypes folder in TASSEL Results will appear as follows gne ee gen 61 6 6 4 Saving GDPC Query Results All query results including both genotype and phenotype queries can be saved as either Tab delimited text files or XML files Results are exported as tab delimited text files by first choosing the Query Tab a a and then clicking on the Export button 5020 or by clicking the Save As button 5245 to save results in XML format Location and file name must be specified in both situations Data in XML format a be imported back into GDPC by clicking on the Open button 7 Appendix 7 1 Nucleotide Codes Derived from IUPAC Code Meaning GG T TT R Y 8 w AT GT AX insertion homozygous o a deletion homozygous N Unknown 6 7 2 TASSEL Tutorial Data sets The data set contains 9 files and can be downloaded at http www maizegenetics net tassel docs TASSEL TutorialData3 zip File File name F
39. enotype covariate and kinship The data tree is saved in a binary format 5 1 1 Save Data Tree This feature allows you to save the entire contents of the Data Tree panel to default location This is helpful when the user does not wish to recreate a Data Tree panel that is already well populated with information the next time they initializes the program To save a Data Tree select File gt Save Data Tree 5 1 2 Open Data Tree To restore a Data Tree that was saved previously saved select File gt Open Data Tree 5 1 3 Save Data Tree As To save the contents of a Data Tree to a specific location or to give it a speci Data Tree As ic name select File gt Save 5 1 4 Open Data Tree To restore a Data Tree from a specific location select ile gt Open Data Tree NOTE The information outlined above for saving a Data Tree is applicable to files that are in general version specific When a new version of TASSEL is released a data tree saved with a previous version might not load to the version For longer term storage the best practice is to save individual data sets rather than the entire data tree 5 1 5 Save Selected As export data to one of the supported file types select File gt Save Selected As 40 5 2 Contingency Test TE e This utility calculates chi square contingency test or Fisher exact test when using only the 2 x 2 table of observations using the same
40. erformed by coding one allele as 0 and the other as 1 The TRANSFORM function in TASSEL converts the major allele to 0 All the other alleles are collapsed to a single class and coded as 1 PCA requires that all variables should have variation and should not have missing values As a result filtering genotype to eliminate monomorphic markers and imputing missing values may be necessary Imputing missing values can be done before or after numericalization Here we demonstrate how to generate PCs from the genotype file in the tutorial data 1 Remove monomorphic sites Make sure TASSEL is in Data mode Highlight the genotype and click Site Set the minimum frequency to 0 05 and have Remove minor SNP status checked Click Filter 2 Numericalization Highlight the filtered genotype and click Transform Use the default option of Collapse non major alleles Click Create data set 3 Imputation of missing values Highlight the numerical genotype and click Transform and then click Impute Tab Use the default options Click Create data set 4 PCA Highlight the imputed numerical genotype click Transform and then click PCA Tab Change the default option to Components 3 by choosing Components and type 3 in the text box Click Create data set EILEEN Ware smettere C
41. ge is determined by calculating D or 2 for all possible combinations of alleles and then weighting them according to the allele s frequency Note Jt is not entirely certain that this procedure fully accounts for allele number effects P values are determined by two methods If only two alleles are present at both loci then a two sided Fisher s Exact test is calculated Note Previous editions of TASSEL used a one sided test but TASSEL version 1 0 8 and later use a two sided test If more than two alleles are present permutations are used to calculate the proportion of permuted gamete distributions that are less probable then the observed gamete distribution under the null hypothesis of independence When calculating linkage disequilibrium users have the option of employing Rapid Permutations If this option is selected the algorithm will compute either a fixed number of permutations or run until 10 permutations are found that are more significant than the observed P value While this slightly reduces P values it also saves a large amount of computational time If an unbiased p value is desired then the user must unselect the Rapid Permutations check box Linkage disequilibrium results can be plotted using Results LD Plot or viewed in a table via Results gt Table 3 3 Cladogram Catocram This function generates a tree or cladogram data set TASSEL produces neighbor joining trees using only simple parsimony s
42. he denominator in the Bonferroni correction is the total number of SNPs tested The association was significant The other data added to the data tree is labeled GLM Allele Estimates followed the name of the Joint data For the most significant SNP at position 6 there were two genotypes CC and GG There are 62 lines with genotype CC and 10 lines with allele GG For the trit dpoll days to pollination the difference between the two homozygotes was 6 63755 days Ic TETTE tos per Fren EZ a 2 6 5 Associ jon analysis using MLM Running MLM in tassel is similar to running GLM The difference is that in addition to the joint data or numerical data MLM requires kinship data to define the relationship between individuals The kinship matrix times a parameter equals the covariance matrix between individuals Here we use kinship file from the tutorial data set to fit the following statistical model Flowering time Population structure Marker effect Individuals residual Individuals and the residual are fit as random effects The other terms are treated as fixed effects With respect to the marker effect we will demonstrate the analysis using two sets of markers One is the dwarf8 gene sequence used in the GLM tutorial The other is a set
43. here are several formats for numerical data to fit the requirement for modeling Trait data dependent variables can be imported by starting the first line with lt Trait gt and following that with the trait names Additional classifiers may also be included in subsequent header rows by starting the row with Header name xxx gt followed by a name for each column of data For instance to define environments start the second header row with cHeader name env gt Comment li character be inserted at the beginning of the file as long as each comment line begins with the 2 This format does not require users to provide information on number of rows and columns The file stats with key word Trait followed by names of columns The column for line should not be labeled 1 Trait format Example 1 simple list of rait values 811 59 5 33 16 64 75 64 5 NA 38 11 92 25 68 5 37 897 4226 6515 59 5 32121933 4722 81 13 71 5 32 421 27 5 62 31 419 Example 2 traits data collected in multiple environments AT Plantat Plantit lt Header name env 1061 Locl Locl 1002 B11 59 5 NA NA 33 16 64 75 121 5 NA 92 25 15318 37 897 83 4 4226 6515 130 1 32 21933 621 4722 81 13 165 7 32 421 90 1 A188 27 5 110 2 31 419 79 6 2272 Covariate Format Covariate data uses the same format as trait data except that the first line must be Covariate This line tells TASSEL
44. ie mcm 45 Three items will be added to the data tree after running PCA The first are the PCs The second are the eigenvalues And the last are the eigenvectors Here we use the Chart Function in the Result mode to graph the first three PCs the individual eigenvalue contributions sometimes called a skree plot and the cumulative eigenvalue contributions The eigenvalues are of interest because they equal the variance explained by each of the PCS PE vi individual Proportion and Cumulative Proportion PA va PE Zand PC 3 a 6 3 Estimation of Kinship using genetic markers While PCs can be used to capture major population subdivisions kinship can be used to capture more subtle relationships This section shows how to create a kinship matrix based on the same SNP data used to calculate PC s 1 Remove monomorphic sites Highlight the genotype and click Site in Data mode Set the threshold on MAF to 0 05 check Remove minor SNP status then click Filter 2 Estimate kinship Highlight the filtered genotype and click Kinship in Data mode A kinship matrix will be added to the data tree under Matrix category 1 Wem Etre p 48 6 4 Association analysis using GLM We use th
45. inning of the file as long as any comment lines begin with the symbol Columns are delimited Numeric values are allowed but by default will be treated as classification variables not as covariates in analyses In some cases a user may wish to have marker values treated as numerical covariates If the first line of the file is lt Numerie gt then the data will be imported as numeric data but used as marker data GLM and MLM Example 2 Marker 0 Note to TASSEL 2 1 users The polymorphism format specified in TASSEL v2 1 is still supported to provide backward compatibility 2 2 6 Phylip The Phylip format used by TASSEL version 2 1 will continue to be supported Details Phylip format are described at the following website http evolution genetics washington edw phylip doc sequence html 2 2 7 Numerical data This type of format is used for trait and covariate data such as population structure Similar to sequence alignment genotype data numerical data also consists of two parts a header that defines data structure and a body containing the main data Tabs should be used as delimiters However any white space character such as blank will be treated as a delimiter as well As a result embedded blanks in names will cause data to be imported incorrectly We suggest representing missing values using or NaN However any text value e g will be interpreted as missing data T
46. ist of markers with chromosome and map position and optionally physical position It can be used by GLM and MLM to provide genetic positions in the output files It is not used as part of the analysis The input format is First line as is including the brackets Following lines marker name chromosome name genetic position physical position actual data Example lt gt markerl 213 2456873 marker 521 52345691 There is no header line as such Marker name chromosome name and genetic position are required Physical position is optional and not used at this time It is there because it is anticipated that information from this map may be used to convert between physical and genetic position at some time in the future 2 3 Export ii Export Options are provided to export sequence data BLOB Hapmap Plink Flapjack Phylip Sequential or Interleaved Phenotypes and covariate data is exported as numerical trait data Table Reports are exported as tab delimited table This button has the same function as the Save selected as on the File menu For numerical data the function of Export is similar to the Table function in Results mode 16 24 Sites Y stes The alignment can be filtered in several ways Monomorphic sites can be eliminated and regions of a sequence can be eliminated Minimum Count the minimum number of taxa in which the site must have been scored to be included in the filte
47. ked Questions 1 What do I do if TASSEL misbehaves TASSEL is an open source software project hosted on SourceForge and has a bug tracking list at buip sfnet projects tassel where you can notify the developer community of problems In order for a bug to be fixed we must be able to replicate the problem Thus it is important to document the steps that were taken that produced the error If the data you are working with is not too sensitive please include the files which were used in the faulty procedure If you would rather not post your data file on SourceForge you may email it to one of the software developers 2 Where do turn for more information If you are having difficulty with a certain aspect of TASSEL you can either email one of the software developers listed at www maizegenetics net or you may check the TASSEL forum on SourceForge hitp snetprojects tassel as another user may have already addressed a similar question There is also a TASSEL discussion group at http groups google comy group tassel How do l join the fun TASSEL on SourceForge TASSEL is an open source project distributed under the GNU general public license This means that the source code is available and the user is free to modify the code to suit their particular needs We welcome input from developers and those who wish to become involved in the improvement of this software The project is hosted on SourceForge hitp sf net proje
48. link Plink is a whole genome association analysis toolset which comes with its own text based data format The data is stored in a set of two files a map file and a file 12 The ped file contains all the SNP values and has six mandatory header columns for Family ID Individual ID Paternal ID Maternal ID Sex and Phenotype TASSEL only requires that the Individual ID field be filled in Each row of the ped file describes a single germplasm line Notice in Plink an unknown character is represented with 97 However in TASSEL an unknown character is represented with a N and 10 is used to represent heterozygous indel TASSEL will automatically convert between the 0 and the Any exported Plink files will represent the heterozygous indel with a insertion and a deletion The map file describes all the SNPs in the associated ed file where each row provides information on SNP The file must contain exactly four columns Chromosome rs Genetic distance and Position TASSEL does not require the Genetic distance field to be filled in Both files should be TAB delimited Fora more detailed description on the data format please visit the Plink basic usage and data formats webpage http pngu mgh harvard edw purcell plink data shtml 2 2 4 Flapjack Flapjack is a software tool for graphical genotyping and haplotype visualization The program is capable of outputting data in its own text based da
49. m information content PIC and Haplotype PIC Overall score is essentially an estimate of the ability to design a single base pair extension reaction in the region These results can be exported by using a table Results gt Table 3 5 Kinship W Kinship The function generates a kinship matrix from a set of random SNPs To do so first highlight SNP data then click on the Analysis button followed by the Kinship button The resulting kinship data will be added as a data set on the Data Tree panel When a genotype file is selected the kinship matrix is generated by first using the TASSEL Cladogram function to calculate a distance matrix Each element dj of the distance matrix is equal to the proportion of the SNPs which are different between taxon i and taxon j The distance matrix is converted to a similarity matrix by subtracting all values from 2 then scaling so that the minimum value in the matrix is 0 and the maximum value is 2 Kinship can be derived from a set of random SNP data a minimum of several hundred SNPs spread over the whole genome is recommended Warning This method currently works correctly only for homozygous inbred lines The method will be modified in the near future to work with heterozygous taxa At that point this warning will be removed Users may also load their own kinship data using Data Load Kinship matrices can be calculated using SPAGeDi software package hitp vww ulb ac be sci
50. n the JAR file STASSEL jar Alternatively from command prompt in Windows go to Start gt Run and type in cmd or command change into the tassel3 0_standalone directory and execute this command tassel bat For Windows Star_tssel pl Far UNIX 1 1 3 Open source code Open source code for the TASSEL software package is available at http sourceforge net projects tassel The package uses a number of other libraries that are included in the TASSEL distribution These include modified version of the PAL library hitp iwww ceblauckland ac nz pal project the COLT library http dsd Ibl gov hoschek colt and jFreeChart htip iwww jfree org freechari GDPC middleware http www maizegeneties net sdpe provides database access 1 2 Panels TASSEL is organized into five main panels 1 The Control Panel at the top contains menus and buttons to control functions 2 The Data Tree Panel is located beneath the Control Panel on the side This panel organizes data sets and results Data set s displayed in the Data Tree Panel must first be selected before a desired function or analysis can be performed To select multiple data sets press the CTRL key while selecting the data sets 3 The Report Panel is located below the Data Tree Panel It displays 8 information about a selected data set from the Data Tree Panel such the type of data and how it was created 4 The Progress Monitoring Panel below the Repor
51. named P3D by Zhang et al and EMMAX by Kang et al TASSEL was designed for a wide range of users including those not expert in statistics or computer science A GWAS using the mixed linear model method to incorporate information about population structure and cryptic relationships can be performed by in a few steps by clicking on the proper choices using a graphic interface All the processes necessary for the analysis are performed automatically including importing phenotypic and genotype data imputing missing data phenotype or genotype filtering markers on minor allele frequency generating principal components and a kinship matrix to represent population structure and cryptic relationships optimizing compression level and performing GWAS The command line version of TASSEL called the Pipeline provides users the ability to program tasks using a script instead of the graphic user interface GUI This feature allows researchers to define tasks using a few lines of code and provides the ability to use TASSEL as part of an analysis pipeline or to perform simulation studies Due to the increasing availability of open data sources TASSEL utilizes a data browser from the Genomic Diversity and Phenotype Connection GDPC project to provide an interface to relational databases As a result TASSEL users can access any data source that provides a GDPC service Using this middleware which provides a common graphical interface TASSEL user
52. nd click right button and select Install Shortcuts 9 Why do get empty squares in MLM association analysis The empty square means null information The major reasons include non convergence in the estimation of variance componentsor that the statistic question was not calculated For example marker F and R are not calculated when no marker is included in the model 10 Why should I exclude one column of the population structure For some methods of calculating population structure such as the software STRUCTURE the population proportions sum to one This produces linear dependence between the population covariates While the algorithm used by tolerates that dependency MLM will fail because the design matrix will not be invertibleExcluding one column eliminates linear dependence between columns Using PC axes to represent population structure does not result in linear dependency because all PC columns are guaranteed to be independent 11 kinship replace population structure Sometimes For some traits and populations the K only model may be as good as or better than the Q K model For others Q K may be superior The Q only model is not as effective for controlling population structure as alternatives Unfortunately no general guidelines exist for predicting which model will perform best As a result an investigator may wish to fit all three models and compare the results If eliminating false po
53. ne or genetic position within the chromosome At the bottom of the graph is a display of the position of each site along the gene or chromosome This display can be hidden by deselecting the Schematic checkbox Legends describing the color scheme appear on the right hand side of the graph TT Linkage Disequilibrium O8 Ore OF E senate LD plots can be printed saved in JPEG format or saved as a Scalable Vector Graphics SVG file An SVG file is useful for creating publication quality graphics which can be easily sized using an editor such as Adobe Illustrator Corel Draw or OpenOffice org Draw 2 0 4 5 Chart il Chart provides a variety of graphs for visualizing numeric data This feature can be used to display histograms XY plots bar charts and or pie charts Any numeric table data can be charted including LD results phenotypic data diversity results and association results Histograms Use the graph type combo box to select the desired graph type Histogram from the list of options Up to two different series of data can be plotted together Users may specify the number of bins to be used in the histogram 38 El Gales eos suo es neon KONES ADD eres oscuros as DPOLL HOMESTEAD ID1 amp DPOLL CLAYTON ID15 Distribution Scatter plots
54. of the body depends on the type of BLOB and on the amount of data being stored For a more detailed description on the structure and information contained within the header and body refer to the GDPDM BLOB Specifications http www maizegeneties net gdpdm does 20100526 GDPDMBLobSpecifieation 20100526 pdf 2 2 2 Hapmap is a text based file format for storing sequence data All the information for a series of SNPs as well as the germplasm lines is stored in one file The first row contains the header labels and each additional row contains all the information associated with a single SNP The first 11 columns describe attributes of the SNP while the following columns describe the SNP value for a single germplasm line The first 12 columns of the first row should look like this where Line 1 is the beginning of germplasm line names Takes Deum ges center RED panei SID While all 11 header columns are required not all 11 of the columns need to be filled in for TASSEL to correctly interpret the data The only required fields are chrom Chromosome name and pos Position For TASSEL to correctly read Hapmap data the data must be in order of chromosome and position within each chromosome and the file should be TAB delimited If some of the data is missing the correct number of must still be present so that TASSEL can properly assign data to columns 2 2 3 P
55. ormat 48 sequence phy Genotype Phylip Alignment mdp genotype hmp txt Genotype Hapmap Alignment Imdp_genotype fipjk geno Genotype Flapjack Alignment 4___ mdp_genotype flpjk map map genotype pik ped Genotype Plink Alignment 6 genotype pik map mdp_kinship txt Kinship mdp_population_structure txt Population structure BO mdp traits txt Phenotype Numerical trait data File 1 is the sequence of dwarf gene with 2466 sites on 91 maize inbred lines The data was described by the paper on the association between Dwarf8 and flowering time File 42 6 are 3093 SNPs on 281 maize association inbred lines The data was presented in three formats Hapmap Plink and Flapjack The data was created by the PANZEA project funded by NSF Details of the data can be found at htip vww panzen org File 3 and 4 are in pair for the format of Flapjack File 5 and 6 are in pair for the format of Plink File 7 is kinship created by Yu et al File 8 is population structure of 282 maize inbred line File 9 is phenotype on three traits including flowering time on 282 maize inbred lines 7 3 Biography of TASSEL 2001 December 2004 February 2005 March 2005 April 2005 June 2005 October 2005 January 2006 March 2006 September 2006 October 2006 September 2007 April 2008 June 2008 First public release Score able SNP Extractor Updated Main Panel StepClade update Fixed
56. predict phenotypes from genotypes It is one of the methods used for genomic selection GS The input dataset must contain one or more phenotypes and numeric marker data Optionally it may also contain factors and covariates The analysis is run by selecting the input dataset then clicking the GS button Because no additional user input is needed the analysis will run immediately after the button is clicked All traits will be analyzed separately using all of the genotypes factors and covariates in the dataset The output will consist of two new datasets for each trait One of the datasets will contain genomic estimated breeding values for each taxon and the other will contain BLUPs for each marker in the genotype file The output datasets will appear in the Numerical folder which holds the input data as well The output datasets can in turn be used for subsequent analysis For example it could be joined with the input data so that the predicted values could be graphed against the original values Understanding the input data requirements is important to ensure that the results of the analysis will be correct and useful Genotypes must be numeric with one column for each marker It is expected that the markers are bi allelic with the homozygotes coded as 1 and 1 and the heterozygotes coded as 0 However any reasonable coding scheme will work For instance missing data could be replaced by probability resulting from imputa
57. red data set GAP or missing data do not count Minimum Frequency the minimum frequency of the minority polymorphisms for the site to be included in the filtered data set Start Position End Position establishes the range of sites for filtering Extract Indels if selected indels are extracted from the alignment If not selected only point substitutions are extracted Remove minor SNP states converts tertiary and rarer states to missing data 77 thereby forcing sites to have only two types of segregating sites at locus This may help remove sequencing errors Generate haplotypes via sliding window creates haplotypes from an ordered set of SNPs 2 5 Select either genotypic phenotypic or population structure data from the data tree The resulting dialog box displays the selected data in table format By using either the CTRL or SHIFT key in conjunction with the mouse the user can select or deselect taxa rows Once desired taxa have been selected the Capture Selected or Capture Unselected buttons will create a new data set containing only the captured taxa 2 6 Traits Y Traits Clicking the Traits button on the Data toolbar launches the Trait Filter dialog This dialog is used with numerical data sets to 1 change the trait type 2 view but not change whether the trait is discrete or continuou
58. ree files from the tutorial data set to perform association analysis using the GLM The first file is the dwarf gene sequence with 2466 sites on 91 maize inbred lines The second one is the population structure of 282 maize inbred lines The last one is phenotypes for three traits for 282 maize inbred lines The statistical model is Flowering time Population structure Marker effect residual 1 Remove monomorphic sites Highlight the genotype and click Site in Data mode Set the threshold on MAF as 0 05 then click Filter 2 Trait selection Highlight the phenotype and click Trait in Data mode Uncheck all the traits except flowering time DPOLL Make sure that the Type is set to Data Click OK to create a filtered phenotype 3 Covariate selection The population structure is presented as the proportion of each population There are three populations represented as Q1 Q2 and Q3 They sum to 100 This creates linear dependency if we use all of them as covariates We can eliminate the dependency by population removing one of them In this demonstration we exclude the last one Highlight the filtered structure phenotype and click Trait in Data mode Uncheck the last population Q3 Make sure that the Type is set to Covariate Then click OK to create a filtered population structure data 4 Joining data Highlight the three filtered data sets by holding the Control key while selecting the individual data Then click Inters
59. s and 3 drop one or more traits from the data set In addition the dialog can be used to view the trait properties without changing them If the OK button is clicked a new data set is created that incorporates the changes the original data set remains unchanged and the dialog closes If the Cancel button is clicked no data set is created the original data set remains unchanged and the dialog closes Allowable trait types are data covariate factor and marker Generally data and covariate traits will be Continuous not discrete and factor will be discrete Markers in a numerical data set will be continuous Discrete valued markers are better imported as sequence or polymorphisms 18 Clicking Exclude unchecks the Include box for all traits Clicking Include All checks the Include box for all traits The Exclude Selected and Include Selected buttons do the same thing for traits that have been highlighted by selecting them with the mouse Important Once a numerical data set has been joined with genotypes it can no longer be modified using the trait filter function 2 7 Impute SNPs moute SNPs This function is used to impute missing genotypes A sequence data type is required to use the function This suite of functions allows multiple data manipulation on genotype and phenotype numerical data When a genotype data set is selected the data are transformed to numbers When a numerical d
60. s can avoid writing SQL queries to access data Currently GDPC provides connections to Panzea Gramene Germinate and GRIN USDA s Germplasm Resources Information Network TASSEL is written in Java thereby enabling its use with virtually any operating system It can be installed using Java Web Start technology by simply clicking on a link at www maizegenetics net tasse A stand alone version of TASSEL can also be downloaded to use in pipeline mode or in any situation where the user wishes to start the software from a command line 4 Getting Started quick way to get started using TASSEL is to load the tutorial data and try performing analyses However because some of the necessary steps may not be intuitive we recommend that new users follow the tutorial at end of this manual The objective of this section is to provide information necessary to install and start TASSEL software and to provide a brief overview of the interface Most functions are organized into three modes Data Analysis and Results which correspond to the first three buttons on the TASSEL interface as shown below Clicking one of these buttons changes the funetions represented by the second row of buttons Those three modes are described in detail in the subsequent sections of this manual The screen shot shows TASSEL after the tutorial files have been loaded IY Ana by Soo
61. s can either save this genotype data several formats or upload it to TASSEL However before outlining these procedures let us finish the query by exploring phenotypes To get data from experiments conducted 2000 first select the Environment Experiments tab followed by the Repetition checkbox 59 Select the desired repetitions in 2000 as the values to be used for filtering then click the Get Data button The subset of data that meets these criteria is returned as follows m Est m erra Daneman Now extract phenotype data by clicking on the Phenotypes tab Traits can only be extracted one at a time Choose Days to Silk from the Ontology field Make sure no Taxa are selected and all Environment Experiments are selected that were retrieved in the previous step Click the Get Data button then the Merge bution leaving only Accession checked under the Taxa Properties section Leave Locality and Repetition checked under the Environment Experiments Properties section Data are merged as follows r EN uL LE sats i uv 6 6 3 Importing GDPC data into TASSEL Genotype and phenotype data must be loaded in separate steps To load genotype data first click GDPC in Data mode Then click on the Genotypes tab followed by the Load button The genotype data 60 is then loaded into TASSEL and labeled as Genotype To view the uploaded data click on G
62. sitives is very important then it may make sense to accept the most conservative model However if the objective is to identify candidates for further study and the cost of following up on a false lead is low the most liberal model may be preferred 68 12 Why do TASSEL and SPAGeDi give different kinship estimates A First many algorithms exist to calculate kinship and their estimates will differ from one another Secondly the algorithm in TASSEL treats each genotype as a haplotype It is not recommended that TASSEL be used to generate kinship matrix from heterozygous genotype In the near future the TASSEL kinship algorithm will be modified to handle heterozygous diploids 13 I get Marker R square using SAS Proc Mixed or TASSEL MLM SAS Proc Mixed does not produce an statistic MLM in TASSEL does The user manual describes how itis calculated 14 Does MLM find more associations than GLM Sometimes MLM has higher statistical power than GLM and may detect more true associations When the tested genetic markers are confounded with kinship structure GLM does not correct for that as effectively as MLM and may produce more false positives 15 Do I need multiple test correction for the p value from Tassel A Yes 16 TASSEL handle diploid genotype data A While TASSEL accepts most common sequence alignment formats which handle polyploid genotype data including haploid and diploid some analyses
63. t Panel shows the progress of running tasks and has buttons that be used to cancel tasks 5 The Main Panel occupies the right side of the viewing area It displays the content of a selected data set from the Data Tree Panel Functions in TASSEL are accessed by buttons and menus on the Control Panel The three buttons on the top left are the Mode Selectors Data Analysis and Results The buttons below the Mode Selectors changed when a new Mode Selector is clicked The modes are described in section 2 4 To the right of the Mode Selectors are the Progress Bar and the Delete Print Save and Help buttons 2 Data Mode Data mode serves the purpose of importing and managing data Data mode is the default mode when TASSEL starts Click on the Data button to switch to this mode Tassel has two ways of importing data One way is via GDPC to import data from databases The other way is via flat files formatted as genotypes e g hapmap flapjack and plink phenotypes trait data population structure and kinship matrices The preliminary data manipulations include filtering data by site or taxa joining data and data transformation 2 1 GDPC 9 Genotype and phenotype data generated from numerous genomic research projects are still valuable resources for the public even after results are published Some of these data have been migrated to several databases and can be accessed using Genotype Data and Phenotype Conne
64. ta format Like Plink the data is stored in a set of two files map file and a geno genotype file The genotype file contains all the SNP values Each column in the first row contains a SNP ID except for the first column which is blank The first column of the following rows contains the germplasm line names TASSEL requires that all fields be filled out in order for data to be read correctly The map file describes all the SNPs associated with the genotype file Each row describes a single SNP There are three columns in the map file for Flapjack SNP ID Chromosome and Position all of which are required for TASSEL to run correctly Both files should be TAB delimited For a more detailed description on the Flapjack data file format please visit the Flapjack data import website hip bioinf dialog DatalmportDialog shtml 2 2 5 Polymorphism general format that accepts almost any type of marker data can also be used Any alphanumeric character is allowed Diploid data can be represented by separating alleles with a colon gt for example or B B All loci in a file must have the same ploidy level The first line starts with the symbol lt Marker gt followed by the marker names Subsequent lines must start with the name of the individual or taxon genotyped followed by the marker scores in the same order as the header Comments can be inserted at the beg
65. tes the degree of similarity between names using the name from the first set which is most similar to that in the second data set When using the Synonymizer keep in mind that order of selection matters Always select the data set with the names you wish to use the real name first and then while holding down the CTRL key click the second data set with the taxa names you wish to change the synonym Then click on the Synonymizer button A synonym data set will be placed on the Data Tree panel under Synonyms Each name in the data set selected second is now listed in the TaxaSynonym column Next to this column is a TaxaRealName column listing the highest scoring match derived from the real name data set The MatchScore column gives an indication of the amount of similarity between the two names where 0 is no similarity and 1 0 is identity Caution Before the synonyms are applied we strongly encourage the user to check the match score especially for those taxa with low match scores To do that the user selects the synonym file and clicks the Synonymizer button The incorrect matches usually the ones with the lowest match scores can be rejected at this point Sorting on the match score column first makes this fairly easy process In the event that some of the taxa are not interpreted correctly matches can be modified manually Select the taxa you wish to modify on the left side and then choose a replacement t
66. the Link Unlink button The button will now appear as LB activates the Add selected items L P Aga all items Remove 00 and Remove all buttons Remove all items from the working list then select items with a name starting with the letter D Click on the Add selected items button to move them to the Working List The resulting Working List is shown as follows rer cnm 3 FE E Lr iE i EN l Em pe z re Em 58 To filter data by polymorphism type first click on the Genotype Experiments tab check the Polymorphism Type and Producer checkbox field and then select SNP and Jim Finally click the Get Data button to reveal the subset of data that meets these criteria Results for this example are shown below aroma BE xo Lau Im EI ton atiga EJ Con p v pun p Tire Genotype data can be extracted from the database by clicking on the Genotypes tab followed by the Get Data button After a moment genotype data will be displayed as follows EET ep eee CE pawa FAINT FOES FANE um s fT ihr mn fe te fr Ee To p t rerit wet a User
67. tion If any genotype data is missing it will be imputed as the average of the marker scores across all taxa for that marker Ifa user prefers to use a different method of imputation then the missing genotypes must be imputed before importing the data into TASSEL GEBVs will be calculated for all taxa in the dataset including any lines that have missing phenotype data A typical use of genomic selection is to predict GEBVs for a set of unphenotyped lines based on the performance of a training set do that a dataset containing both the genotypes to be predicted and the genotypes of the training set can be joined with a dataset containing the phenotypes of the training set using a union join All taxa in the phenotype set should have genotypes If an individual without genotype data is included all the marker data for that individual will be imputed which is not a generally useful thing to do 34 4 Result Mode Results mode consists of the functions to present data as table or graphics 4 1 Table Ed Tae Allows data to be displayed in a spreadsheet view and exported into a flat file create a table select a data set from the Data Tree panel then click on the Results button followed by the Table button Results Table Shown below is an example in which diversity estimates are displayed T Diversity estimates COE En a tak PL m
68. ubstitution models To retrieve cladogram data first select genotypic data from the Data Tree panel and then click on the Analysis button followed by the Cladogram button The resulting tree data and the corresponding matrix will appear as separate data sets on the Data Tree panel Results can be plotted using Results gt Tree Plot 34 SNP Extract SHP Extract SNP Extract extracts SNPs from raw sequence alignment into a useful format for export Additionally this function provides information for designing genotyping assays Below is a detailed explanation of the SNP Extractor Dialog Minimum Site Frequency the minimum frequency for which the site must have good base Minimum SNP Frequency the minimum frequency of the minority polymorphisms for the site to be included in the resulting data set Minimum Surrounding Bases the minimum number of good bases on at least one side of the SNP 28 Minimum Good SBE Bases the minimum number of good bases on at least one side of SNP Filter SNPs to Biallelic converts tertiary and rarer states to missing data 7 thereby forcing sites to have only two types of segregating sites at any particular locus This helps to remove bad sequence effects Results are displayed on the Data Tree panel and include SNPs along with their context Additional information is also provided including the location of the nearest polymorphisms on either side polymorphis
69. uce a sliding window check the box next to Sliding Window and then enter the desired step size and size of the sliding window Results can be plotted using Results gt Chart or viewed in a table via Results Table 3 2 Linkage Disequilibrium V kok Dsea This button generates linkage disequilibrium data set from SNP data NOTE It is important to use only filtered data sets apply Data gt Sites first when estimating linkage disequilibrium as a raw alignment with numerous invariant bases will take a very long time and consume a large amount of memory to calculate Linkage disequilibrium between any set of polymorphisms can be estimated by clicking on a filtered set of polymorphisms and then using Analysis Link At this time D 2 and P values will be estimated The current version calculates LD between haplotypes with known phase only unphased diploid genotypes are not supported see PowerMarker or Arlequin for genotype support D is the standardized disequilibrium coefficient a useful statistic for determining whether recombination or homoplasy has occurred between a pair of alleles P represents the correlation between alleles at two loci which is informative for evaluating the resolution of association approaches D and r2 can be calculated when only two alleles are present If multiple alleles are present a weighted average of D or 72 is calculated between the two loci This weighted avera
70. www maizegeneties net gdpe Table of Contents INTRODUCTION 6 1 GETTING STARTED 1 1 INSTALLATION 1 WEB start 7 1 12 STAND ALONE E 1 13 OPEN SOURCE CODE 8 12 PANELS 8 DATA MODE 22 10 22 Loan 221 BLOB 12 222 HAPMAP 12 223 PUNK 12 224 FLAPIACK 13 225 POLYMORPHISM 13 226 14 227 NUMERICAL DATA 14 22 8 SQUARE NUMERICAL MATRIX 16 229 GENETIC 16 23 Exvour 8 Export 16 24 sires W Sites 17 15 Taxa 18 26 Trarrs Traits 2 impure SNPs Imoute SNPs 19 28 TrANsFORM 55 Transform 19 2 8 1 GENOTYPE NUMERICALIZATION 5 Trnstorm 19 282 TRANSFORM AND OR STANDARDIZE DATA 20 2 8 3 IMPUTE PHENOTYPE 284PCA Taxa Names 29 Srmommizer 210 Unto Jorn son 211 Iwrersection Jor Join ANALysis es 29 SYNONYM 26 34 Diversrry Diversity 26 32 LINKAGE Diskou CLanocram am 34 SNPExreacr Extract 28 35 Kinship 36 GENERAL LINEAR GLM a Lanman E mwen necssssioy 88 4 RESULT mope as Tane 8 Table as 42 Tree PLor Tren Pit as E as a Fux MENU a 1 SAVE DATA TREE 40 5 1 2 OPEN DATA TREE 40 5113 SAVE DATA TREE As 40 SA OPEN DATA TREE 40 SS SAVE SELECTED AS 40 52 CONTINGENCY Test 53 PREFERENCES 6 TUTORIAL a MISSING PHENOTYPE IMPUTATION a 62 PRINCIPAL COMPONENT

User Manual for Version 3 The Buckler Lab at Cornell University

Contents

Download Pdf Manuals

Related Search

Related Contents