Home

User Manual for Version 5.0 The Buckler Lab at Cornell University

1. Var Prop i ie Te Tt Tt Tt EEEE ee 2d Jt Tt Dt ee Tt Tet Tet Tt i Tt Tt Tet Dt Three items will be added to the data tree after running PCA The first are the PCs The second are the eigenvalues And the last are the eigenvectors Here we use the Chart Function in the Result mode to graph the first three PCs the individual eigenvalue contributions sometimes called a skree plot and the cumulative eigenvalue contributions T eigenvalues are of interest because they equal the variance explained by each of the PCs 55 raph Type XYScatter v i Y1 Individual Proportion w v2 E a g vi Y2 PC 3 a x PC Mi Line Regression v 2 Y Axes M Due FTResressin VIP Y Axes PC vs Individual Proportion and Cumulative Proportion PC 1 vs PC 2 and PC 3 0 065 p F 1 00 0 060 4 0 95 F 0 90 r 0 85 0 050 r 0 80 r 0 75 r 0 70 0 055 1 o o o o Individual Proportion O o on uoniodojg eAnejnuun r n n i i i i i i i 25 50 75 100 125 150 175 200 225 250 275 50 45 40 35 30 25 20 15 10 5 0 5 10 15 PC PC 1 m Individual Proportion Cumulative Proportion m PC2 PC3 10 3 Estimation of Kinship using genetic markers While PCs can be used to capture major population subdivisions kinship can be used to capture more subtle
2. 11 Appendix 11 1Nucleotide Codes Derived from IUPAC t insertion homozygous T a deletion homozygous Unknown 11 2TASSEL Tutorial Data sets http www maizegenetics net tassel docs TASSELTutorialData3 zip Filename 1 Type Formats d8 sequence ph Genotype Ph lip Alignment mdp genotype hmp txt apmap Alignment mdp genotype plk ped Genotype Plink Alignment mdp genotype plk map matrix mdp population structure txt Population structure Numerical trait data mdp traits txt Phenotype Numerical trait data 63 File 1 1s the sequence of dwarf8 gene with 2466 sites on 91 maize inbred lines The data was described by the paper on the association between Dwarf8 and flowering time File 2 6 are 3093 SNPs on 281 maize association inbred lines The data was presented in three formats Hapmap Plink and Flapjack The data was created by the PANZEA project funded by NSF Details of the data can be found at http www panzea org File 5 and 6 are in pair for the format of Plink File 7 is kinship created by Yu et al File 8 is population structure of 282 maize inbred line File 9 is phenotype on three traits including flowering time on 282 maize inbred lines 64 11 3Frequently Asked Questions What do I do if TASSEL misbehaves TASSEL is an open source software project hosted on SourceForge and has a bug tracking list at http sf net projects tassel where you can notify the developer
3. In addition to displaying the F statistics and p values for the requested F tests the table also contains markerR2 mean squares MS and degrees of freedom DF for the marker effect for the model corrected for the mean and for error If taxa are replicated across reps or environments then the markers are tested using the taxa within marker mean square If taxa are unreplicated then the residual mean square is used MarkerR2 is the marginal R squared for the marker calculated as SS Marker after fitting all other model terms SS Total where SS stands for sum of squares The following table shows an example of the Allele Estimates output as viewed with Results Table T Allele Estimates Marker Obs PZBO0O08553 1 For each marker and trait combination each marker allele is listed along with the number of observations for taxe carrying that allele Obs the locus usually chromosome and locus position of that marker the allele and the estimat of the effect of that allele Because of the way that GLM codes alleles the last allele estimate for a marker is alway zero and the other allele estimates are relative to that 6 6 MLM Mixed Linear Model This conducts association analysis via a mixed linear model MLM A mixed model is one which includes both fixed and random effects Including random effects gives MLM the ability incorporate information about relationships among individuals When a genetic marker based kinship matri
4. Min Freq of Row Data 0 80 3 4 4 PCA Principal component analysis PCA can only be performed on a numerical data set without missing values Tw methods are available correlation or covariance This determines whether a correlation or covariance matrix will bi used as the basis for the analysis The default correlation 1s a reasonable choice for genetic data The number of PCA axes in the output data set can be controlled by selecting either of the minimum eigen value associated with each axi 17 the minimum percent of the variance captured by an axis or the number of axes The resulting axes will be sorted by th amount of variance each captures Column Fercent Missing Data Trans Impute PCA EarHT M 0 00 dpall MA 0 00 Method EarDia M 0 00 o ETUR C Covariance QuEpuE Eigenvalue 0 OQ var Frop 0 33333 O Components m b Create Dataset 3 5 Synonymizer Synonymize Taxa Names This button makes taxa names uniform to permit the joining of data sets The join functions that generate fused data sets work by matching taxa names Consequently if multiple names exist fc a given taxon an added suffix alternative spellings different naming conventions etc then the two data sets will n join correctly To help remedy this the Synonymizer function allows the taxa names of one data set to replace simila taxa names in the second data set It relies on an algorithm that calculates t
5. Traits button on the Data toolbar launches the Trait Filter dialog This dialog is used with numerical d sets to 1 change the trait type 2 view but not change whether the trait is discrete or continuous and 3 drop one or more traits from the data set In addition the dialog can be used to view the trait properties without changing them I the OK button is clicked a new data set is created that incorporates the changes the original data set remains unchanged and the dialog closes If the Cancel button is clicked no data set is created the original data set remains unchanged and the dialog closes Allowable trait types are data covariate factor and marker Generally data and covariate traits will be continuous not discrete and factor will be discrete Markers in a numerical data set will be continuous Discrete valued markers are 30 better imported as genotypes and filtered using the Sites filter Clicking Exclude All unchecks the Include box for all traits Clicking Include All checks the Include box for traits The Exclude Selected and Include Selected buttons do the same thing for traits that have been highlighted b selecting them with the mouse Type can be changed for individual traits by selecting a value in the drop down box in t type column for that trait Type can be changed for multiple traits by selecting those traits then clicking one of th Change Sele
6. E RIN ABB B on df Od Pd D IPIE at a n E NINE IGCGCGACA IGCGCGACA CI187 2 J un CM 174 li b JE 5 5 m LJ fs dus Boy C72 cmos CMiz4 cmo CML 10 AREER EIFE EE t juu in un un m a rpi i iD ME j Custom Level 9 No Compression Variance Component Estimation P3D estimate once 3 Re estimate after each marker An MLM option dialog will pop up as shown above Choose the default options which use P3D and compression at the optimum compression level After the Run button is clicked the progress bar will start moving The time require will depend on sample size number of traits number of markers and the options chosen in the MLM option dialog After the progress bar is reset to zero indicating completion of MLM three reports will be added to the data tree The first two are similar to the reports created by GLM The most significant SNP is still the same however the strength o association is weaker with a P value of 7 199x10 vs 1 1021x10 from GLM which does not pass the Bonferroni 60 multiple test threshold 5x10 The third report contains the MLM specific statistics including 2 Log Likelihood genetic variance and residua variance components under different level of compression These statistics are illustrated by the Chart function on th Result mode as follows IL J raph Type XYScatter Save I Properties raph Type XYScatter zj
7. Y Order Subtrees Uncallapse All EIAS Search 2 W22R RSTD 7 3 2D Plot Displays 2D plots and determines color thresholds This function is useful for plotting associations in multiple environments First select the desired result set Using the drop down boxes provided populate rows with Environment columns with Site and value with PermuteP The cutoff value for coloring can be chosen either by inputting a value in the te box or by using the slider tool to the right of the text box Users can mouse over any box to view the value associated with that box as shown here 46 T 2 D chart t3 IH AS Sh cel size 14 v EJ only upper triangle m P value Ei am Column value Cutoff 0 001 v Till GU parta III jIANI Erarrprrrr CUNI PermuteP 1459 1490 15 70 CLAYTON ID15 ee HOMESTEAD ID1 ee Statistics Min 0 0 Max 0 98 Mean 0 2658788 SD 0 322007 72 HOMESTEAD ID1 1490 0 0 If P value coloring is desired simply check the P value box as shown below T 2 D chart x H AG 4x Cell size 14 v only upper triangle a P Value Ei Row Column value PermuteP 1000 1459 1490 15 0 1616 CLAYTON IDI5 D HOMESTEAD ID1 E Statistics Min 0 0 Max 0 98 Mean 0 2658788 5D 0 32200772 HOMESTEAD 1D1 1459 0 04 By checking the P value box Cutoff selection tools will be disabled and fields will instead be colored according to th following grayscale 0 01 0 05 0 05 This key can b
8. 9 4 Logging 50 10 Tutorial This tutorial reviews several common scenarios for using TASSEL in order to help the user better understand its capabilities for data manipulation and association analyses The TASSEL software package includes a tutorial data set that can be downloaded from the TASSEL website please unzip all files to a directory of your choice This tutoria data set contains data for phenotype genotype population structure and kinship 10 1 Missing Phenotype Imputation The phenotype file mdp traits will be used to demonstrate the process of imputing missing data Note that the data set below contains missing values NaN B TASSEL Trait Analysis by aSSociation Evolution and Linkage 3 0 37 File Tools Help GDPC 6 2404 E de sequence chr 6 A Polymorphisms io mdp population structure Lg NRI El Matrix io4 mdp kinship Table Title Phenotypes Number of columns 4 Number of rows 301 Number of elements 1204 Taxa To impute missing data first select the mdp traits data set in the Data Tree Panel and then click the Transform buttor Data O Transform The Transform Column Data window will open Click on the Impute tab in this window Finally click on the Create Data set button to create the new data set with missing values imputed Note that missing values are now filled 51 TASSEL Trait Analysis by aSSociation Evolution and Linkage 3 0 37 File Tools NE GD
9. Chromosome name and pos Position In the example below genotype values are represented by 2 characters i e AA Note that you can record those as single character values see Nucleotide Codes in the Appendix For TASSEL to correctly read Hapmap data the data must be in order of position within each chromosome and the file should be TAB delimited example below 1s in Excel only for easy viewing If some of the data 1s missing the corre number of TABs must still be present so that TASSEL can properly assign data to columns rst alleles chrom pos Strand assembly center protLSID assayLSID panel QCcode 33 16 38 11 4226 4722 A188 PZB00859 1 A C 1 157104 AGPv1 Panzea NA NA maizez82 WA cc cc CC CC AA Pz7A01271 1 C G 1 1947384 AGPv1 Panzea NA NA maizez82 WA CC GG CC GG CC P7A03613 2 G T 1 2914066 AGPv1 Panzea NA NA maizez82 WA GG GG GG GG GG PFAQ3613 1 A T 1 2914171 AGPv1 Panzea NA NA maize287 WA TT TT TT TT TT Pz7A03614 2 A G 1 2915078 AGPv1 Panzea NA NA maizez82 WA GG GG GG GG GG PZAQ3614 1 A T 1 2915242 AGPv1 Panzea NA NA maizez82 WA TT TT TT TT TT PZAQ00258 3 C G 1 2973508 AGPv1 Panzea NA NA maizez82 WA GG CC CC CG cc PZA02962 13 A T 1 3205252 AGPv1 Panzea NA NA maizez82 WA TT TT TT TT TT Pz7A02962 14 C G 1 3205262 AGPv1 Panzea NA NA maizez82 WA cc CC CC CC cc PZAO00599 25 C T 1 3206090 AGPv1 Panzea NA NA maizez82 WA cc TT CC TT TT Pz7A02129 1 C T 1 3706018 AGPv1 Panzea NA NA maizez82 WA TT cc CC CC
10. D 52 D75 2006 Canaran P Stein L amp Ware D Look Align an interactive web based multiple sequence alignment viewer with polymorphism analysis support Bioinformatics 22 885 886 2006 Du C G Buckler E amp Muse S Development of a maize molecular evolutionary genomic database Comparative and Functional Genomics 4 246 249 2003 SAS l l SAS Statistical Analysis Software for Windows 9 0 ed Cary NC USA 2002 Hardy O J amp Vekemans X SPAGEDI a versatile computer program to analyse spatial genetic structure at the individual or population levels Molecular Ecology Notes 2 618 620 2002 Cover T amp Hart P Nearest neighbor pattern classification Proc IEEE Trans Inform Theory 13 1967 Weir Genetic Data Analysis Il Sunderland MA 1996 Farnir F et al Extensive genome wide linkage disequilibrium in cattle Genome Res 10 220 7 2000 Henderson C R Best Linear Unbiased Estimation and Prediction under a Selection Model Biometrics 31 423 447 1975 Kang H M et al Efficient control of population structure in model organism association mapping Genetics 178 1709 23 2008 Laird N M amp Ware J H Random Effects Models for Longitudinal Data Biometrics 38 963 974 1982 Thornsberry J M et al Dwarf8 polymorphisms associate with variation in flowering time Nat Genet 28 286 9 2001 Flint Garcia S A et al Maize association population a high resolution platform for qu
11. Save Properties Yi ntk Y2 None Yi Var genetic za Y2 Var error x groups Line Regression Y Axes x Enn m Line groups vs 2LnLk groups vs Var genetic and Var error r Regression 1g Y Axes c ay a c a m 40 15 20 25 30 35 40 45 50 55 80 65 70 75 80 70 75 80 groups i groups Nar genetic Var errar In the example 79 are included in the final analysis When they are clustered into 44 groups the 2 Log Likelihooc reaches a minimum which indicates the best model fit The screening of SNPs was performed at this optimum compression level Note When two or more individuals are clustered into one group the variance component for the random effect 1s not equivalent to the one without compression Consequently the heritability derived should not be interpreted as the individual based heritability To perform a Genome Wide Association Study GWAS on the 3093 SNPs we need to create a new joint data set containing the filtered phenotype population structure and the genome wide genotype Highlight the new joint file ai the kinship data and click the MLM button Choose the default options on the MLM option dialog The analysis will take a minute or two The output report labeled MLM compression indicates that 259 lines were used in the analysis With 74 groups the statistics from the best are as graphed below 61 Y1 Var genetic F Line F Regression x
12. relationships This section shows how to create a kinship matrix based on the same SNP data used to calculate PC s Remove monomorphic sites Highlight the genotype and choose Filter Sites on the menu bar Set the threshold on MAF to 0 05 check Remove minor SNP status then click Filter Estimate kinship Highlight the filtered genotype and click Analysis Kinship Leave Scaled IBS selected in th Choose Kinship Method dialog and click OK A kinship matrix will be added to the data tree under Matrix category Alternatively impute missing genotype data first then create the kinship matrix using the imputed data To impute miss data highlight the filtered genotype choose Data Transform leave Collapse Non Major Alleles selected and click Create Dataset A new data set with Collapse appended will appear in the Numerical folder Highlight the collapsed data set choose Data Transform select the Impute tab then click Create Dataset Highlight the resulting imputed data then choose Analysis Kinship 56 TASSEL Trait Analysis by aSSociation Evolution and Linkage 5 0 4 a 31 16 38 AES 33 16 1 81455 0 04063 0 0632 7 5834E 0 035 5 1403E 6 4094E E 38 11 0 04063 1 92021 0 01983 6 4647E 3 0453E 3 7423E 0 1310 4 4 4226 0 0632 0 01983 1 33465 1 6516E 2 S419E 3 8 7 472 7 5834E 6 4647E 1 6516E 1 44544 z A188 0 03594 3 0453 02562 2 0005 8 mdp genotype chrl
13. 0 00114 0 29404 0 28052 0 21254 0 20132 0 0375 0 0047 0 00322 0 00119 mdp genotype Sitesummary mdp genotype TaxasSummary 839 9 5533E 4 0 001 596 5 85 4E 4 1145E 4 55 7 5 408 E 4 6 649E 4 547 0 17585 NaN 486 0 15713 NaN 408 0 13191 NaN 373 0 12059 NaN 219 0 07081 NaN 184 0 05949 NaN 155 0 05335 NaN 155 0 05011 NaN 121 0 03912 NaN 110 0 03556 NaN 108 0 03492 NaN 78 0 02522 NaN 50 0 01617 NaN 46 0 01487 NaN 24 0 00776 NaN 19 0 00614 NaN L L L 4 2EBR wee Zradn Table Title Allele Summary Number of columns 4 Number of rows 27 Number of elements 108 Allele Summary of mdp_genotype leTableReport Alleles Allele values present in data set Single letter values are diploid where some letter represent heterozygous Two letter values are major minor combinations with count of sites Number Number of occurrences Proportion Percentage the value occurs in data set Frequency Percentage the value occurs in data set not counting unknown N values 42 eoo TASSEL Trait Analysis by aSSociation Evolution and Linkage 5 0 5 File Data Filter Analysis Results Help Site Number Site Name Chromoso Physical Po Number of Major Allele Major PYBOOSS 157104 261 PZAD127 1947984 261 PZA03651 2914066 261 PZAO03651 2914171 261 PFAODS61 2915078 261 PZAQ36 1 2915242 261 mdp_genotype_AlleleSummary PZAQ023 2973508 281 3 PYAD2 96 3205252 261 mdp genotype _Sit
14. 2 3 4 5 6 7 8 9 10 157104 148907116 A214N 6 1403E 3 7423E a Result A239 6 4094E 0 A272 3 4562E 7 8 A441 5 0 086 1 5 ASS 0 004 0 113 0 2 0 00918 4 4 A556 0 02044 0 06832 4 9268E 0 05737 1 2258E 1 3 Ab i 2 t 1 70 013E 6 452 8E 7 0903 8 5848EF 1 836E 1 1 A619 0 02416 2 705 0 47E 7 1986E 0 11674 0 08643 1 2462E 1 5 A632 4133 07 0 644 4 8881 7 2249E 1 0662E 0 31715 0 A634 1 239E 1 1 4 0 68956 1 2823E 1 0067E 7 9731E 0 32677 4 le Title Alignment Distance M I A635 8 02 17E 4 0063 1 13 68 0 92127 4 9243t 0973E 9 2 q nber of columns 282 MAI 0 033 i6 994 af 23 du 3 ores 7 nber of rows 281 Ab54 0277 2 1423E 0 19805 5 0946 0 0 1 2 ar ante n A659 0 0493 0 04002 0 0424 1 9086 INI Q0 nber mE d f 3 907116 A682 0 01092 0 03385 0 00147 0 06965 11 Sp matrix map genotype chr1 0 3 5 5 8 8 9 A679 1 4359E 5 7639 7 9908E 1 4286E l A680 1 8842E 4 4384E 8 7841E 1 605 1E 2 A682 0 00863 5 8781E 0 05776 3 2347E 6 1 4 AB28A 7 5477E 0 53905 0 00724 3 0479E 2 7707E i 6 B10 1 6858E 4 3444E 0 07198 1 3603E 9 163E 2 0 357 o 8103 0 02804 1 6718E 1 0304E 0 02547 4 B104 4 7435E 2 4154E 0 0083 8 7575E I 8105 6351E 0 05643 0 11994 1 0122E 0 0102 u 8109 1 3574E 0 04056 2 8808E 1 7789E 5 828 1 8115 0 0486 05404 0 05205 0 0802 0514 8 626 1 B14A 1 4491E 5 7115E 1 391E 1 1 5328E 1 7639E 6164 0 03086 0 0
15. 35 1 0 E 0 0 WF9 90 1 0 Threshold for synonymizer Synonymizer TaxaSy TaxaRe ReflDMum Matchsc ri g qX ps 1 u L u SIEJEIEIFIEITAEIE z 2 B S ad End Ed nd E ep TEL hae Lu i D ce AEE TT A th Bh ii ii mw C Kei aj Ale nar T na T i un A zi o ae cd T EG tn IL 3 Pejes ss ejeje Tegel Apply threshold I Once it has been determined that the taxa names were matched correctly the synonyms can be applied With the synonyms selected hold down the CTRL key while clicking on the second synonym data set the data set whose names you would like to change Then once again click on the Synonymizer button to apply the new names to the data set 3 6 Intersect Join Command Lon pipeline pl fork e oroupl hupatxzt rfork2 h Group2 imps txt combine3 inputl input2 intersect export group gronp2 JIHLtersectnmp txt ePunrork funtork2 eXumporko This joins multiple data sets by the intersection of their taxa Taxa must be present in both data sets to be included Select multiple data sets using the CTRL key in conjunction with mouse clicks and then click on the intersection buttc to join the data sets Because this function uses taxa names to join data sets any variation in taxa names can prevent proper joining Taxa names can be made uniform by using the Synonymizer 20 3 Union Join Comma
16. 6 5 GLM General Linear Model This function performs association analysis using a least squares fixed effects linear model TASSEL utilizes a fixed effects linear model to test for association between segregating sites and phenotypes The analysis optionally accounts for population structure using covariates that indicate degree of membership in underlyu populations A main effects only model is automatically built using all variables in the input data A separate model built and solved for each trait and marker combination Any factors covariates reps or locations are included in every model as main effects How the data is used must be defined either in the input data files or using the Trait Filter afte the data has been imported but before it has been joined with a genotype General Linear Model GLM can be run using a numeric data set only or using numeric data joined to genotype data Ii only numeric data is selected best linear unbiased estimates BLUES or least square means will be generated for the taxa for each trait Note only factors and covariates intended to control field variation should be included at this stage Population structure covariates which are intended to control for marker effects should only be included when marker are also in the analysis If numeric data with genotypes are analyzed each trait by marker combination will be teste and two reports will be produced one containing trait by marker F tests and the other c
17. CC Pz7A00393 1 C T 1 4175293 AGPv1 Panzea NA NA maizez82 WA TT TT TT CC TT PZAQ28698 8 C T 1 4429897 AGPv1 Panzea NA NA maizez82 WA cc TT CC NN CC PZAQ2869 4 C G 1 4429927 AGPv1 Panzea NA NA maizez82 WA cc cc CC NN GG PzZAQ02869 2 C T 1 4430055 AGPv1 Panzea NA NA maizez82 WA NN TT TT CC TT Pz7A02032 1 A T 1 4490461 AGPv1 Panzea NA NA maizez82 WA AA TT AA AA AA zagl1 5 A T 1 4835434 AGPv1 Panzea NA NA maizez8 WA AA NN AA AA AA zagli 2 AJC 1 4835558 AGPv1 Panzea NA NA maizez82 WA cc cc CC CC CC zagli 6 C T 1 4835558 AGPv1 Panzea NA NA maize282 WA TT TT TT TT TT PZDOO081 2 C T 1 4836542 AGPv1 Panzea NA NA maizez82 WA cc cc CC CC CC zagli 1 AJC 1 4912526 AGPv1 Panzea NA NA maizez82 WA AA AA AA AA AA PZB00919 1 AJC 1 5353319 AGPv1 Panzea NA NA maizez82 WA cc cc CC CC AA PZB00919 2 G T 1 5353655 AGPv1 Panzea NA NA maizez82 WA GG GG GG GG GG 3 1 2 HDF5 Hierarchical Data Format version 5 http www hdferoup org HDF5 3 1 3 VCF Variant Call Format http www 10009genomes oreg wiki analvsis variant call format vcf variant call format version 42 3 1 4 Plink Plink is a whole genome association analysis tool set which comes with its own text based data format The data is stored in a set of two files a map file and a ped file The ped file contains all the SNP values and has six mandatory header columns for Family ID Individual ID Paternal ID Maternal ID Sex and Phenotype TASSEL only requires that the Individual
18. Disequilibrium This generates a linkage disequilibrium data set from SNP data NOTE It is important to use only filtered data sets apply Filter gt Sites first when estimating linkage disequilibriun as a raw alignment with numerous invariant bases will take a very long time and consume a large amount of memory calculate 22 6050 Linkage Disequilibrium Select LD type Sliding Window LD Window Size 50 Sliding Window LD with 153375 comparisons How to treat heterozygous calls Set to missing Accumulate R2 Results Run Close Linkage disequilibrium between any set of polymorphisms can be estimated by clicking on a filtered set o polymorphisms and then using Analysis O Link Diseq At this time D r2 and P values will be estimated The current version calculates LD between haplotypes with known phase only unphased diploid genotypes are not supported see PowerMarker or Arlequin for genotype support D is the standardized disequilibrium coefficient a useful statistic for determining whether recombination homoplasy has occurred between a pair of alleles i represents the correlation between alleles at two loci which is informative for evaluating the resolution o association approaches D and r2 can be calculated when only two alleles are present If multiple alleles are present a weighted average of D or r2 is calculated between the two loci This weighted average is determined by calculating D o
19. GWAS Yu et al 2006 yet these models can be slow to solve TASSEL has been a test bed and implements some of the most best optimizations such as EMMA Kang at al 2008 plus approaches optimize variance components once P3D Zhang et al 2010 and EMMAX Kang et al 2010 Compression algorithms are also available Zhang et al 2010 When used correctly these optimizations make powerful GWAS computationally possible e The code is being continually optimized for larger numbers of cores and clusters For example we generally run imputation on 64 core machines And while Java provides some excellent is interoperability between systems its code 1s about 2 fold slower than optimized C libraries and 10 fold slower than GPU processing for some problems TASSELS is building out connection layers directly to native code when these efficiencies are need TASSEL was designed for a wide range of users including those not expert in statistical genetics or computer science A GWAS using the mixed linear model method to incorporate information about population structure and cryptic relationships can be performed by in a few steps by clicking on the proper choices using a graphic interface All th processes necessary for the analysis are performed automatically including importing phenotypic and genotype data imputing missing data phenotype or genotype filtering markers on minor allele frequency generating princip components and a kinship matrix to r
20. O Data 5 Sequence o mdp genotype mdp genotype e mdp genotype ge Mndp_genotype_chri_157104 14890711 H Polymorphisms 9 Numerical fe mdp population structure 4 mdp traits ao Matrix mdp_kinship Fusions Synonvmizer Result Number of sequences 261 Number of sites 2561 Data type IUPACNucleotide lt Program Status IM Show Memory 148907112 Physical Positions O Site Numbers Locus Alleles Enter physical position E n3 tb _ em c ai No A B 99323776 Bo Eu J zm ce L e j 4 en w n3 co 157104 24948772 870 136356797 871 135357534 872 139668467 873 140524105 874 142431173 875 142821031 876 143853993 877 144466196 878 144466243 879 144466246 880 144466414 881 145421006 882 148153258 883 148153805 884 148154058 885 150829954 886 150830416 887 150830673 888 150830782 889 150837246 890 150837488 891 155566732 892 155576390 893 155818939 894 156252241 895 156252478 896 157104591 897 157263770 898 157640380 899 157640764 900 157640944 901 157646430 alata s e z o e BBB BRR alalalalalala Esa eje eje e Pel tale aialnlal alc GEDZZEZEER Ea Fa C E n a E e n n n e ln njelo e o gt ale a a alalalalalalalalalala Alalalalalalalala a a E EE E EE eleele FEE c Fn E Gi i EEZEEZEREER alaala e o e o eje o e Ec ESE C
21. community of problems In order for a bug to be fixed we must be able to replicate the problem Thus it is important to document the steps that were taken that produced the error If the data you are working with is not too sensitive please include the files which were used in the faulty procedure If you would rather not post your data file on SourceForge you may email it to one of the software developers Where do I turn for more information If you are having difficulty with a certain aspect of TASSEL you can either email one of the software developer listed at www maizegenetics net or you may check the TASSEL forum on SourceForge http sf net projects tassel as another user may have already addressed a similar question There is also a TASSEL discussion group at http groups google com group tassel How do I join the fun TASSEL on SourceForge TASSEL is an open source project distributed under the GNU general public license This means that the source code is available and the user is free to modify the code to suit their particular needs We welcome input from developers and those who wish to become involved in the improvement of this software The project 1s hosted on SourceForge http sf net projects tassel thereby allowing anyone to access the most recent changes to the code This setup makes it convenient for anyone to add special functionality to TASSEL if they so desire It also serves as a good platform for anyone who wishes to b
22. eliminate monomorphic markers and imputing missing val may be necessary Imputing missing values can be done before or after numericalization Here we demonstrate how tc generate PCs from the genotype file in the tutorial data Remove monomorphic sites Make sure TASSEL is in Data mode Highlight the genotype and click Site Set the minimum frequency to 0 05 and have Remove minor SNP status checked Click Filter Numericalization Highlight the filtered genotype and click Transform Use the default option of Collapse non maj alleles Click Create data set Imputation of missing values Highlight the numerical genotype and click Transform and then click Impute Tab U the default options Click Create data set PCA Highlight the imputed numerical genotype click Transform and then click PCA Tab Change the default optio to Components 3 by choosing Components and type 3 in the text box Click Create data set 52 53 B Filter Alignment Minimum Count Minimum Frequency Position Type Start Position End Position _ Extract Indels C Generate haplotypes via sliding window Haplotype Length Step Lenath TASSEL Trait Analysis by aSSociation Evolution and Linkage 3 0 39 File Tools Help GDPC Filter Alignment 210 out of 281 sequences 0 05 Position index Physical Position AGP 0 157104 2560 1489071 16 of 2561 sites EEx Ce
23. has improved statistical power compared to the regular MLM The optimum grouping with the best model fit for MLM without fitting genetic markers has the best statistical power for an association test of markers TASSEL allows users tc specify the compression level average number of individuals per group or to have the program determine the optimur grouping similar to GLM MLM performs an association test for each combination of traits and markers TASSEL provides users several options 1 to estimate genetic and residual variance for each combination 2 to get these estimates once for each trait without fitting genetic markers and then to use those estimates to test markers 3 to use a prior heritabilit estimate provided by the user The second option named P3D population parameters previously determined has the same statistical power as the first option Using the P3D method or using a prior heritability can be much faster thai calculating heritability for each marker Using MLM is very similar to using GLM The difference is that in addition to choosing the joint data set or numerical data set kinship data must also be highlighted before clicking the MLM button to show the MLM option dialog The option of No Compression is the regular MLM which is equivalent to Custom level 1 For data sets with large 37 numbers of taxa the optimal compression option may be considerably slower than no compression or user supplied compressio
24. implemented yet Notes e This maps to Data gt Merge Genotype Tables Menu on GUI Error if duplicate site names in same file same as with other file loadings Undefined taxa site allele values are set to UNKNOWN Duplicate taxa site set to last Alignment processed Sites are identified by Locus chromosome Physical Position and Site Name 3 9 Separate This separates the selected data set into 1t s components For example a genotype table would be separated into individual chromosomes 21 3 10 Homozygous Genotype This changes all heterozygous values to unknown N 4 Impute Menu 4 1 Genotypic Imputation TASSELS contains two methods for imputing missing genotype information one is a generalized approach suitable for all types of populations but optimized for those with higher inbreeding coefficients FILLIN and the other is specificall optimized for finding recombination break points in full sib families FSFHap More information on these two method can be found at Swarts et al FSFHap Full Sib Family Haplotype Imputation and FILLIN Fast Inbred Line Library ImputatioN optimize genotypic imputation for low coverage next generation sequence data in crop plants Plant Genome in review FSFHap Full Sib Family Haplotype Imputation FSFHap imputes missing genotypes and corrects genotyping errors for inbred individuals in full sib families It 1s very useful for calling haplotypes in low coverage GBS d
25. markers the data set contains the following statistics marker F F value from the F test on marker marker p P value from the F test on marker markerR2 R for the marker after fitting other model terms population structure markerDF Degree freedom of marker markerMS Mean square of marker errorDF Degree freedom of residual error errorMS Mean square of residual error modelDF Degree freedom of model modelMS Mean square of model 58 00 TASSEL Trait Analysis by aSSociation Evolution and Linkage 5 0 2 File Data Filter Analysis Results Help i Marker Locus Locus pos marker F marker p markerR2 markerDF markerMS errorDF errorMS model Synonymizer dpoll PZBOO85 157104 0 33532 0 71543 0 0018 7 32118 21 83356 E Result dpoll PZAO127 1947984 5 98887 0 01509 0 01593 130 91719 21 86006 Diversity dpoll PZA0361 2914066 0 44396 0 50582 0 00117 9 7473 21 95558 SNP Assays dpoll PZA0361 2914171 1 94335 0 14533 0 00993 42 34854 21 79146 LD dpoll PZA0361 2915078 0 18011 0 67166 4 9717E 4 3 98879 22 14663 v ME Associntioa dpoll PZA0361 2915242 1 17459 0 27955 0 00313 24 76818 21 08668 dpoll PZA0025 2973508 1 31685 0 26993 0 00725 28 6036 21 72128 dpoll PZA0296 3205252 2 98033 0 05264 0 01559 59 84505 20 08003 GLM allele estimates for Filtered mdp tra dpoll PZA0296 3205262 0 33803 0 56153 9 1575E 4 6 53992 19 34731 Variances dpoll PZA0059 3206090 0 70844 0 49339 0 00369 15 598
26. minimum frequency of the minority polymorphisms for the site to be included in the filter data set Start Position End Position establishes the range of sites for filtering Extract Indels if selected indels are extracted from the alignment If not selected only point substitutions are extracted Remove minor SNP states converts tertiary and rarer states to missing data thereby forcing sites to have only two types of segregating sites at a locus This may help remove sequencing errors Generate haplotypes via sliding window creates haplotypes from an ordered set of SNPs Example Pipeline Command that removes SNPs with MAF Minimum Allele Frequency less than 5 r n prpeiline pl forkl r map genotype hmpat t PrilterAtrgn CLDLllteraAlTgnMlnBEreq 0 05 sexport Tiltered genotype runrforkl 28 5 2 Site Names First select the genotypic data from the data tree The resulting dialog displays the site names associated with the selected data By using either the CTRL or SHIFT key in conjunction with the mouse the user can select or deselect site names Once desired site names have been moved to the Selected window using the Add gt button the Capture Selected or Capture Unselected buttons will create a new data set containing only the desired site names Using the search box e i the wildcard e is always implied at end of search string e Search string is case sensitive For example
27. properly it will cause MLM to complain and refuse to complete your analysis We can eliminate the dependency by removing one of the Q variables In this demonstration we exclude the last one Highlight mdp population structure and click Filter Traits Uncheck the last population Q3 Make sure that the Type is set to Covariate Then click OK to create a filtered population structure data Joining data Highlight the three filtered data sets by holding the Control key while selecting the individual data sets TI click the menu item Data Intersect Join to create a combined data set Association analysis Highlight the joint data set then click the menu item Analysis GLM to perform associatioi analysis Two reports will be added to the data tree 57 MR RM mm MM M MM 909 Ii S m a eoo Filter Alignment Filter Alignment Minimum Count out of 281 sequences Minimum Frequency 0 05 Maximum Frequency 1 0 Position Type Position index Start Position 0 End Position 3092 of 3092 sites Remove minor SNP states Generate haplotypes via sliding window Haplotype Length Step Length Select Chromosomes Cancel Filter Filter Traits odify Tr t Fropertie Type ovariate ovariate covariate One of the reports added to data tree is labeled GLM Marker Test_ followed by the name of the joint data In addition to the information for traits and
28. 3093 238 0 03847 25 B10 3093 118 0 01908 36 B103 3093 136 0 02199 29 B104 3093 Ll 0 01811 30 Y BS Sequence mdp genotype E Result Y B3 Genotype Summary 1 mdp genotype Overallsummary mdp genotype Allelesummary mdp genotype Sitesummary Qo c Cn ud d UJ M e C Table Title Taxa Summary Number of columns 9 Number of rows 281 Number of elements 2529 Taxa Summary of mdp genotype Taxa Index of taxa Taxa Name Name of taxa Number of Sites Number of sites for taxon same for all Gametes Missing Number of gametes with unknown N value Every taxa site combination has two gametes Proportion Missing Gametes Missing Number of Sites 2 Number Heterozygous Number of sites that are heterozygous for taxon e Proportion Heterozygous Number Heterozygous Number of Sites not counting sites that are unknown NN e Inbreeding Coefficient e Inbreeding Coefficient Scaled by Missing 6 9 Stepwise 43 7 Results Menu Results consists of the functions to present data as table or graphics 7 1 Table Allows data to be displayed in a spreadsheet view and exported into a flat file To create a table select a data set from the Data Tree panel then click on the menu Results gt Table TASSEL Trait Analysis by aSSociation Evolution and Linkage 5 0 6 File Data Filter Analysis Results GBS Ji Data BB Table Archseopte 2D Plot LD Plot Chart QQ Plot Manhattan Plo
29. 33 16 64 75 121 5 NA NA DOSL Dro Lied ui 83 4 4220 ODD TUI 22 20 93 BZ ec A722 GLL soy Lee A SU AL Alego 2745 110 2 31 419 19 26 3 1 8 2 Covariate Format Covariate data uses the same format as trait data except that the first line must be lt Covariate gt This line tells TASSEL that the variables in this file will be used as covariates not as dependent variables This is the format to use fo population structure covariates lt Covariate gt lt Trait gt Q1 Q2 Q3 239 716 0014 0 972 05014 ogeLl 4003 23 9935 0 004 4226 O07 0 91T 0 012 QUIA Uo 594 39 ALSS 3 9032509 992 aD 3 1 8 3 Marker Values as Numerical Co variates In some cases a user may wish to have marker values treated as numerical co variates If the first line of the file is lt Numeric gt then the data will be imported as numeric data but used as marker data in GLM and MLM Numeric Marker ml m2 m3 m4 m5 Doelo D esp s opedqo D no s wg qo X Itb Sx 3 1 9 Square Numerical Matrix Kinship can be calculated externally from pedigrees by using SAS Proc Inbreeding or from markers by using one o several available software packages The following format is provided to import the resulting kinship estimates If n represents the number of taxa the format for kinship files 1s as follows 13 n TaxalName rll r12 sed rin Taxa2Name r21 r22 UN ran TaxanName rni rn2 NN rnn Here rij 1 j71 2 n is the element in the kins
30. 8966 0 00698 8 6151E 9E 0 0 7 6 B2 0 12692 0 04725 0 19644 2 7222E 7 9646E B37 0 02824 0 05153 0 02269 3 4049E 8 7392E 00777 2 2 0 0292 0 0 0 02899 0 21215 4 class Bet Cazegenetics tara gistance Distancey atts 10 4 Association analysis using GLM We use three files from the tutorial data set to perform association analysis using the GLM The first file mdp genotype hmp txt a set of SNPs scored at 3093 sites on 281 maize inbred lines The second one is the population structure of 282 maize inbred lines mdp population structure txt The last one is phenotypes for three traits for 282 maize inbred lines mdp traits txt The statistical model is Flowering time Population structure Marker effect residual Remove monomorphic and low coverage sites Highlight the mdp genotype and click Filter Sites on the menu bar Set Minimum Frequency to 0 05 Maximum Frequency to 1 0 and Minimum Count to 150 Click Filter to create a filtered genotype data set Trait selection Highlight the phenotype and click the menu item Filter Traits Uncheck all the traits except flowern time DPOLL Make sure that the Type is set to Data Click OK to create a filtered phenotype Covariate selection The population structure is presented as the proportion of each population There are three populations represented as Q1 Q2 and Q3 They sum to 100 This creates linear dependency if we use all of them as covariates While GLM can handle that
31. 99 22 01874 dpoll PZA0212 3706018 0 18465 0 66777 4 8205E 4 4 09165 22 15916 dpoll PZA0039 4175293 0 01174 0 91382 3 1912E 5 0 2533 21 58512 dpoll PZA0286 4429897 2 57509 0 1098 0 00668 56 3929 21 89943 fable Title Marker Test T dpoll PZA0286 4429927 3 39142 0 03529 0 0176 72 99552 21 52361 Number ol couine T3 dpoll PZA0286 4430055 3 14175 0 04505 0 01722 68 18523 21 70296 Number of rows 2559 dpoll PZA0203 4490461 0 73384 0 39245 0 00191 16 19389 22 06733 Number of elements 33267 dpoll PZB0091 5353319 1 69532 0 1941 0 00445 36 29275 21 40765 supe ahaaa Peers dpoll PHM2244 5562502 1 29978 0 27446 0 00693 28 44728 21 88623 dpoll PZA0309 8075572 0 09464 0 90973 4 9411E 4 2 08047 21 98287 dpoll PZA0018 8366368 0 14162 0 86803 7 639E 4 3 12032 22 03351 dpoll PZA0018 8366411 4 48832 0 01214 0 02238 95 08609 21 18523 dpoll PZA0052 8367944 0 98318 0 32245 0 00277 21 694 22 06508 AH nA0na 7 ortT n 7 cC EFCA A NINN NNALCAD TAN CANIS 31 5 0nc Stepwise Clicking marker p will sort the table by P value The smallest P value is 3 5963x10 A reasonable significance threshold is 1 9x10 which is 5 after Bonferroni multiple test correction 0 05 2559 The denominator in the Bonferroni correction is the total number of SNPs tested The association was significant The other data added to the data tree is labeled GLM Allele Estimates followed by the name of the joint data F
32. E Liye ra mem P oup at i r Her 8 F BU ns i 1T 3 Los e i x r B 1 E D JUI i m grep z Pt 1 LE m g il 2 TL JU Ji T em i hs HFE e ie Ll mmi M J LI ECTS i 1 r if i 1 yg a T 1 EE xm E a ees af x E I deriv z rna E pit MERERI S gcse stale toe ow none AE al J a L q 1 EE ar es Br ee MJ ere g E a0 pu A quis d Hr i E 00 m9 mn oc CERE DE EINE UC An rg HIE Th km a D 40 c t Wiper E a o iE Te act AF Spgs ae Lat PS 00 ML i Lp F z mes DE MFE at ins l P Si LM gis 8 TIR CURE eee ay Sey ocn Y mmm LE 1 a D ban i i Li pene F bar uM 1 1 D 10 nu P ae j ait o 00 LI i ria ie E P wake LEM i i sidi LL HL i L He JEU e INL SL Bored L L T 20 051 LE Bn E TI 7 Em Hi ia zai n j TS Upper triange R Squared El Lod Y Axis to X V Schematic Save Save al Lower tiange P Value Gese LD plots can be saved in several formats The Save button will save the area of the graph shown in the screen while the Save All button will save the entire graph 7 5 Chart Chart provides a variety of graphs for visualizing numeric data 48 This feature can be used to display histograms XY plots bar charts and or pie charts Any numeric table data can be charted including LD results phe
33. ID field be filled in Each row of the ped file describes a single germplasm line Notice in Plink an unknown character is represented with a 0 However in TASSEL an unknown character is represented with a N and 0 is used to represent heterozygous indel TASSEL will automatically convert between the 0 and the N Any exported Plink files will represent the heterozygous indel with a 11 insertion and a deletion The map file describes all the SNPs in the associated ped file where each row provides information on one SNP The map file must contain exactly four columns Chromosome rs Genetic distance and Position TASSEL does not require the Genetic distance field to be filled in Both files should be TAB delimited For a more detailed description on the data format please visit the Plink basic usage and data formats webpage http pngu mgh harvard edu purcell plink data shtml 3 1 5 Projection Alignment 3 1 6 Phylip Details on Phylip format are described at the following website http evolution genetics washington edu phylip doc sequence html 3 1 7 FASTA 3 1 8 Numerical Data This type of format is used for trait and covariate data such as population structure Similar to sequence alignmen genotype data numerical data also consists of two parts a header that defines data structure and a body containing the main data Tabs should be used as delimiters However any white space character such as blan
34. It does this by first looking for one haplotype to a threshold 2a then two modeling a recombination break between inbred segments 2b then finally to a higher threshold looks for two haplotypes and models the 64 focus site window as heterozygous combining the two haplotypes together The thresholds for 2a c are also set differently based on whether the whole sequence of the target taxon is above or below a user supplied heterozygosity threshold For taxon considered outbred above the threshold 2b the Viterbi option is never used because it 1s more likely in an outbred taxon that 1f two haplotypes explain a segmen it is heterozygous for those two haplotypes If the algorithm cannot find haplotypes to satisfy any of these threshold requirements the segment will not be imputed The thresholds for the focus block imputation are set based on the mxInbErr and mxHybErr values entered or defaults esee so wmewus fo i FILLINFindHaplotypesPlugin FILLINImputationPlugin Generate block Impute back onto sample haplotypes 42k taxa using haplotypes by block Impute 64 site subsets of blocks Impute one nearest neighbor haplotype One haplotype Impute with two best haplotypes using Viterbi HMM v DO NOT IMPUTE Running FILLIN FILLIN consists of two TASSEL plugins FILLINFindHaplotypesPlugin and FILLINImputationPlugin which are called sequentially If you would like to mask your data and calculate accuracy use the accuracy
35. K the resulting variance estimate can be considered an estimate of c as long as the assumptions of the method used to derive K are not violated for the population being analyzed One implication is that two different K matrices may giv very different estimates of o and heritability yet produce the same model fit and test of marker association TASSEL implements several methods to improve statistical power and reduce computing time The Restricted r 2 r 2 Maximum Likelihood REML estimates of 9 and are obtained through the Efficient Mixed Model Associatior EMMA algorithm which is much faster than the expectation and maximization EM algorithm TASSEL also implements a method called compression which reduces the dimensionality of the kinship matrix to reduc computational time and improve model fitting When MLM is used without compression compression 1 each taxor belongs to its own group At the other extreme GLM can be interpreted as maximum compression compression n with all taxa in a single group In that case it 15 not possible to estimate the random effect independently of error an T p 0a is absorbed into e Between these two extremes taxa can be grouped using cluster analysis based on kinship When n individuals are compressed into s clusters groups the kinship among individuals is replaced with the kinsh among groups At some grouping levels dependent on the trait and population being analyzed this compressed MLM
36. Minor Allele Gametes Number of times minor allele occurs for site Minor Allele Proportion Minor Allele Gametes Number of Taxa 2 Number of Taxa 2 1s the Number of Gametes for a Site Minor Allele Frequency Minor Allele Gametes Number of Taxa 2 Gametes Missing Gametes Missing Number of gametes with unknown N value Proportion Missing Gametes Missing Number of Taxa 2 Number Heterozygous Number of taxa that are heterozygous for site Proportion Heterozygous Number Heterozygous Number of Taxa not counting taxa that are unknown NN Inbreeding Coefficient e Inbreeding Coefficient Scaled by Missing eoo TASSEL Trait Analysis by aSSociation Evolution and Linkage 5 0 5 File Data Filter Analysis Results Help E x E Taxa Name Number of Gametes M Proportion Number He I 33 16 3093 190 0 03071 33 38 11 3093 T8 0 01261 fe 4276 3093 176 0 02845 if 4722 3093 790 0 12771 147 A188 3093 158 0 02554 25 A2 14N 3093 118 0 01908 25 A239 3093 76 0 01229 31 A272 3093 330 0 05335 50 A441 5 3093 BO 0 01293 26 mdp genotype Taxasummary A554 3093 104 0 01681 34 A555 3093 254 0 04106 25 Ab 3093 T8 0 01261 36 A619 3093 124 0 02005 38 A532 3093 98 0 01584 32 Ab34 3093 114 0 01843 33 A635 3093 150 0 02425 40 Ab41 3093 14 0 022796 26 Ab54 3093 226 0 03653 31 A659 3093 160 0 02586 31 AGGI 3093 468 0 07565 29 Ab 79 3093 140 0 02763 29 Ab 80 3093 128 0 02069 44 A682 3093 Ll 0 01811 33 AB BA
37. PC ER aes L TEE ds sequence cd EarDia Polymorphisms 37 897 B Numerical 4 mdp population structure lE a eoo a 21933 amp mdp traits Hz2 J gnii3 pis g241 oe mee pues pe ms ea Tel Az 5 9 52006 Ao 4788 Bi eoa Imputed Phenotypic Values Taxa with insufficient data 35 K 30 8 cutoff 10 2 Principal Component Analysis Principal component analysis PCA is a statistical tool that transforms a set of correlated variables into a smalle number of uncorrelated variables called principal components PCs The first PC captures as much of the variation as possible and the succeeding PCs account for a decreasing fraction of the remaining variance Another application ol PCA is to use PCs derived from genetic markers to represent population structure This method requires much less computing time than maximum likelihood estimation As most marker data are characters numericalization must I performed first A common approach for converting character marker scores is to set one of the homozygotes to 0 the other homozygote to 2 and the heterozygote to 1 For haploids the conversion can be simply performed by coding one allele as 0 and the other as 1 The TRANSFORM function in TASSEL converts the major allele to 0 AII the other alleles are collapsed to a single class and coded as 1 PCA requires that all variables should have variation and should not have missing values As a result filtering genotype to
38. Sort Genotype File Transform Genotype Numericalization Collapse Non Major Alleles Separate Alleles Transform and or Standardize Data Impute Phenotype PCA Synonymizer Synonymize Taxa Names Intersect Join Command Union Join Command Merge Genotype Tables Command Notes Separate Homozygous Genotype Impute Menu Genotypic Imputation Filter Menu Sites Site Names laxa Names Traits Analysis Menu Diversity Linkage Disequilibrium Cladogram Kinship GLM General Linear Model MLM Mixed Linear Model mic Selection using Ridge Regression eno Geno Summary Stepwise Results Menu Table Archaeopteryx Tree 2D Plot LD Plot Chart Plo Manhattan Plot GBS Menu Help Menu Help Manual About Show Memory Logging Tutorial Missing Phenotype Imputation Principal Component Analysis Estimation of Kinship using genetic markers um analysis using MLM Appendix TASSEL Tial mm Frequently Asked Questions REFERENCES Introduction While TASSEL has changed considerably since its initial public release in 2001 its primary function continues to t providing tools to investigate the relationship between phenotypes and genotypes TASSEL has functionality fo association study evaluating evolutionary relationships analysis of linkage disequilibrium principal component analy cluster analysis missing data imputation and data visualization TASSEL development has been led by a group focuse on maize geneti
39. User Manual for TASSEL Trait Analysis by aSSociation Evolution and Linkage Version 5 0 The Buckler Lab at Cornell University July 17 2014 www maizegenetics net tassel Disclaimer While the Buckler Lab at Cornell University has performed extensive testing and results are in general reliable correct or appropriate Results are not guaranteed for any specific set of data It is strongly recommended that users validate TASSEL results with other software Further help Additional help is available beyond this document Users are welcome to report bugs request new features through the TASSEL website Questions are also welcome to our current team members For more quick and precise answers please address your questions to the most pertinent person Tassel User Group http groups google com group tassel recommended tassel googlegroups com General Information Ed Buckler Project leader esb33 cornell edu Data Import Pipeline Terry Casstevens tmc46 cornell edu Statistical Analysis Peter Bradbury pjb39 cornell edu Contributors Ed Buckler Terry Casstevens Peter Bradbury Zhiwu Zhang Dallas Kroon Jeff Glaubitz Kelly Swarts Jason Wallace Fei Lu Alberto Romero Cinta Romay Eli Rodgers Melnick Alexander Lipka Sara Miller James Harriman Yogesh Ramdoss Michael Oak Karin Holmberg Natalie Stevens and Yang Zhang Citations Overall Package Bradbury PJ Zhang Z Kroon DE Casstevens TM Ramdoss Y Buck
40. a Ca Fa Ed o alalaaji CEcCACETONE Alalalaalalalalalala BREE BEE ala gt a Ei E Ca i Ls LG HHan gt alqlalalalalz aata EBDEEBOERBBE i ajajaja s z e e o n te o njee ejnjele enjele BEEEERBBBERE 54 EA TASSEL Trait Analysis by aSSociation Evolution and Linkage 3 0 39 PEE sm SHG J Data 3 73 Sequence mdp genotype mdp genotype i mdp genotype Lo mdp genotype chri 157104 148907116 4 Polymorphisms 3 73 Numerical fe mdp population structure mdp traits Law mdp genotype chri 157104 148907116 Matrix t 4 mdp kinship Tree Fusions 4 Synonymizer Number of rows 281 Number of elements 719922 2 s8 Taxa 1 1 1 1 si po p dm 0 w2 0 090 p XQ qd 0 e 9 9Q9Q0 4226 0 040 Q09pg o w 0p 90 0p Jqdl S 0 J f 0 null 1 null 2 null 3 null 4 null 5 null Eit 57 null 8 null 9 null ERUIT 11 null 12 null 13 null 14 null 15 null 17 null 18 null 19 null F3 TASSEL Trait Analysis by aSSociation Evolution and Linkage 3 0 39 PEE File Tools Help GDPC E SHA E S6 S S8 Sequence mdp genotype mdp genotype mdp genotype mdp genotype chri 157104 148907116 mdp population structure mdp traits mdp genotype chri 157104 148907116 Colla Taxa with insufficient data 7 K 30 8 cutoff E w
41. ad Plink 88 d amp sequence phy Monday July 19 2010 5 28 PM Load Proiectinn Ali diploid_SSR txt Sunday May 5 2013 2 32 AM Enc rojection ignment 7 mdp genotype hmp txt Tuesday July 13 2010 1 49 PM _ Load Phylip 88 mdp genotype plk map Tuesday July 13 2010 3 26 PM 886 mdp genotype plk ped Tuesday July 13 2010 3 26 PM Lad mdp_kinship txt Tuesday July 13 2010 2 25 PM Load FASTA File J Load Trait data covariates or factors Bi mdp population structure txt Tuesday July 13 2010 2 32 PM UE cp Humence Mt Scitis lai mdp_traits txt Sunday January 30 2011 3 04 AM C Load a Genetic Map i Load a Table Report Make Best Guess File Format All Files dal OK Cancel men Open 3 1 1 Hapmap Hapmap is a text based file format for storing sequence data All the information for a series of SNPs as well as the germplasm lines are stored in one file The first row contains the header labels and each additional row contains all th information associated with a single SNP The first 11 columns describe attributes of the SNP while the followin columns describe the SNP value for a single germplasm line The first 12 columns of the first row should look like thi where Line 1 is the beginning of germplasm line names 10 While all 11 header columns are required not all 11 of the columns need to be filled in for TASSEL to correctly interpret the data The only required fields are chrom
42. ang H M et al Variance component model to account for sample structure in genome wide association studies Nat Genet 42 348 54 2010 Thornsberry J M et al Dwarf8 polymorphisms associate with variation in flowering time Nature Genetics 28 286 289 2001 Pritchard J K Stephens M Rosenberg N A amp Donnelly P Association mapping in structured populations American Journal of Human Genetics 67 170 181 2000 Zhao K et al An Arabidopsis example of association mapping in structured samples PLoS Genet 3 e4 2007 Yu J M et al A unified mixed model method for association mapping that accounts for multiple levels of relatedness Nature Genetics 38 203 208 2006 Ware D et al Gramene a resource for comparative grass genomics Nucleic Acids Research 30 103 105 2002 Ware D H et al Gramene a tool for grass Genomics Plant Physiology 130 1606 1613 2002 Jaiswal P et al Gramene development and integration of trait and gene ontologies for rice Comparative and Functional Genomics 3 132 136 2002 Yamazaki Y amp Jaiswal P Biological ontologies in rice databases An introduction to the activities in gramene and oryzabase Plant and Cell 15 16 17 18 19 20 21 22 23 24 20 26 27 28 69 Physiology 46 63 68 2005 Zhao W et al Panzea a database and resource for molecular and functional diversity in the maize genome Nucleic Acids Research 34
43. antitative trait locus dissection Plant J 44 1054 64 2005 Anderson M J amp Ter Braak C J F Permutations tests for multi factorial analysis of variance Journal of Statistical Computation and Simulation 73 85 113 2003
44. any other alleles The converted genotypes are saved in a new numerical data set 3 4 1 2 Separate Alleles This function assigns an indicator 1 for present and 0 for absent for each allele The converted genotypes are saved in a new numerical data set 3 4 2 Transform and or Standardize Data The Trans dialog box is the default selection as shown below In the Column list select the column s you wish t transform Then select the type of transformation you wish to execute Selecting the Standardize checkbox wil transform data by subtracting the column mean from the value of the trait and then dividing by the column s standai deviation Clicking on the Create Data set button will result in the placement of a dataset containing only the selectec columns in the Data Tree 16 Percent Missing Data EarHT null 0 66 A Raise to Power 2 dpoll null x i EarDia null 12 Take Log Base 10 F Standardize 3 4 3 Impute Phenotype The k nearest neighbor algorithm is used to impute missing phenotype data If data is missing for a taxon for one o the traits the algorithm finds other taxa neighbors that are most like it for the non missing traits It uses the average the neighbors to impute the missing data Click on the Impute tab to display the following EarHT null Manhatten Distance dpoll null EarDia null C Eudid Distance Unweighted Average Weighted Average Number of Neighbors K 35
45. ata The individuals must be at least partially inbred because the method relies on finding inbred segments to identify haplotypes It does not use the parent genotypes directly but including the parents may be useful for interpreting the results The algorithms used for imputation analyze one chromosome and family at a time As a result a pedigree file must be supplied that indicates which entries belong to which family Also input genotypes must contain data for only a single chromosome If the genotype file contains multiple chromosomes the chromosomes can be separated using the TASSEL separate command Pedigree File Format The only file format specific to FSFHap is the pedigree file The taxa names must exactly match names in the genotype data If the genotype data contains taxa not included in the pedigree file only individuals listed in the pedigree file will analyzed The input genotypes can be in any of the formats accepted by TASSEL The pedigree file must contain the names of the individual taxa to be analyzed the family to which each belongs the parents the parent contributions and the average inbreeding coefficient The first row in the file must be column headers The values in the columns should b tab delimited and are expected to be in the following order family taxon parentl parent2 parent Contribution parent2Contribution F The F value is not required but all other columns are Example family taxonName parent pare
46. ata tree 2 1 5 Set Preferences Currently there 1s only one preference That 1s whether to retain rare alleles This 1s irrelevant for nucleotide data A C G T N because at that number of states there is no data lost Potentially with other types of data it could exceed the 14 max per site number of allele states If you Retain Rare Alleles the lower frequency allele values wil be consolidated into a rare Z state Otherwise those lower frequency alleles are changed to unknown N 6 Preferences Alignment Preferences vi Retain Rare Alleles OK Cancel 3 Data Menu The Data Menu has options to import and export data sets as well as other data manipulate functions 3 1 Load Load provides options to import files for genotypes phenotypes populations structure and kinship matrices etc The tutorial data can be downloaded from the TASSEL website at this link http www maizegenetics net tassel docs TASSELTutorialData3 zip To use the data the zip file must be uncompressed and saved on your local machine These tutorial files will loa correctly with the Make Best Guess option Multiple files can be imported simultaneously by highlighting them fi holding Shift or Control key while clicking and then clicking the Open button eoo File Loader Choose File Type to Load Load Hapmap 800 Open Load HDF5 Ex TASSELTutorialData m i Load VCF Name a Date Modified Lo
47. axa for that marker If a user prefers to use a different method of imputation then the missing genotypes must be imputed before importing the data into TASSEL GEBVs will be calculated for all taxa in the dataset including any lines that have missing phenotype data A typical use of genomic selection is to predict GEBVs for a set of unphenotyped lines based on the performance of a training set To do that a dataset containing both the genotypes to be predicted and the genotypes of the training set can be joined with a dataset containing the phenotypes of the training set using a union join All taxa in the phenotype set should have genotypes If an individual without genotype data is included all the marker data for that individual will be imputed which is not a generally useful thing to do 6 8 Geno Summary i Genotype Overview Site Summary Taxa Summary Ok Close 39 TASSEL Trait Analysis by aSSaciation Evolution and Linkage 5 0 5 Data Filter Analysis Results Help 40 File BS Sequence 1 mdp genotype ES Result i Genotype Summary L t L mdp genotype Qverallummary mdp genotype Allelesummary mdp genotype Sitesummary mdp genotype TaxaSummary Table Title Overall Summary Number of columns 2 Number of rows 14 Number of elements 28 Overall Summary of mdp genotype Number of Taxa Number of Sites Sites x Taxa Number Not Missing Proportion Not Missing Number Mis
48. components or that the statistic in question was not calculated For example marker F p and R are not calculated 65 10 11 12 13 14 15 16 when no marker is included in the model Why should I exclude one column of the population structure For some methods of calculating population structure such as the software STRUCTURE the population proportions sum to one This produces linear dependence between the population co variates While the algorithm used by GLM tolerates that dependency MLM will fail because the design matrix will not be invertible Excluding one column eliminates linear dependence between columns Using PC axes to represent population structure does not result in linear dependency because all PC columns are guaranteed to be independent Can kinship replace population structure sometimes For some traits and populations the K only model may be as good as or better than the Q K model For others Q K may be superior The Q only model is not as effective for controlling population structure as the alternatives Unfortunately no general guidelines exist for predicting which model will perform best As a result investigator may wish to fit all three models and compare the results If eliminating false positives 1s very importa then it may make sense to accept the most conservative model However if the objective is to identify candidates for further study and the cost of following up on a false lead is l
49. cs and genomics and for these reasons that software has design and computational optimizations tha account for the biology found in many plants and breeding situations Compared to human genetics many crops are highly diverse both at the nucleotide level and structural variations 10 50X greater than humans inbreeding 1 common large families are common and whole genome prediction 1s being applied daily to real world problems Thes biological differences lead to some different optimizations that are of use to many biological systems outside of crops One of the design elements driving TASSEL development has been the need to analyze ever larger sets of data TASSELS has at its heart lots of design optimizations for big data including e Bit level encoding of nucleotides so genetic distance and linkage disequilibrium estimates can be mad very quickly 20 50X speed increases e Extensive use the HDFS5 file format which has been developed as a robust element of many climate modelers for matrix style data e Tools for extracting and calling SNPs from extensive Genotyping by Sequencing data tested for 60 000 samples by over 2 5 million SNPs and 96 million sequence alleles e Projection and imputation procedures that are optimized for the large families in crops Some of these optimizations permit memory and computational improvements of gt 100 000 fold e Mixed models based on DNA relationships have come to dominate GWP Meuwissen et al 2001 and
50. cted Type to buttons Important Once a numerical data set has been joined with genotypes it can no longer be modified using the trait filte function eoo Filter Traits Modify Trait Properties Type Discrete Include covariate covariate ivi Q2 NENNEN TION NN NENNEN a psExeludeSelested Include Selected Exclude All L Include All Change Selected Type to Data Change Selected Type to Covariate Change Selected Type to Marker OK Cancel 6 Analysis Menu 6 1 Diversity 3l This executes a basic diversity analysis Average pairwise divergence x segregating sites and 0 estimates 4Nu can be calculated as well as sliding windows of diversity To run a diversity analysis click on a raw sequence alignment and then select Analysis gt Diversity Diversity Surveys Start Base 0 End Base 2465 In the resulting Diversity Surveys dialog box the various site classes available for analysis are listed on the left If t sequence has no annotation then only the Overall and Indels options will be active A sliding window of diversity can also be calculated across the region To produce a sliding window check the box next to Sliding Window and then enter the desired step size and size of the sliding window Results can be plotted using Results gt Chart or viewed in a table via Results gt Table 6 2 Linkage
51. e Library ImputatioN The generalized approach FILLIN imputes missing genotypes in two steps 1 haplotype generation FILLINFindHaplotypesPlugin and 2 imputation of the resulting haplotypes back onto the target samples FILLINImputationPlugin Haplotypes are generated by collapsing low coverage but inbred segments that share identity by state to an optionally user supplied threshold value by site window default 8k this is performed by the first plugin FILLINFindHaplotypesPlugin Because short IBD segments may be replicated widely within a species even between diverse individuals we recommend supplying all the information available within a species for this step The second plugin FILLINImputationPlugin uses these haplotypes to impute missing genotypes in target individuals I does so in multiple steps first looking for haplotypes that match the minor alleles to a threshold within the whole site window la in schematic below and if this fails looks for two haplotypes to explain the site window and assuming thi 24 represents a recombination break point between two inbred haplotypes uses a Viterbi HMM algorithm to model the recombination breakpoints 2a If two haplotypes cannot be found to explain the whole site window the algorithm next searches for haplotypes to explain a smaller focus window within the site window centered on 64 sites at a time and searching to the right and left until enough informative minor alleles are found
52. e shown by clicking on the icon next to the P value check box 7 4 LD Plot Displays the results from a linkage disequilibrium analysis After selecting the desired result from the Data Tree choose Results gt LD Plot The graph that is generated displays LD between pairs of sites calculated with the analysis step The black diagonal represents LD between each site and itself The default setting graphs r in the upper right and p values in the lower lefi 47 This default can be modified by clicking on the buttons in the lower left The left side of the graph contains a te description with the Chromosome and the Site name At the bottom of the graph 1s a display of the position of each site along the chromosome This display can be hidden by deselecting the Schematic check box Legends that describe the color scheme appear on the right hand side of the graph The number of sites displayed can be selected by entering a number in the white box in the upper right corner or by moving the sliding bar next to it To move through the graph use the sliding bars on the right and bottom The red box i the small white window in the upper left corner will show what portion of the graph is displayed To move only aroun the diagonal select the Lock Y Axis to X check box recommended when visualizing a LD by sliding window analysis p ME o i pd k aj Select viewable sine 614 E ig ikii niue ef si 3 TEM 0 acy ge T RS
53. eSummary P7A0 96 3205262 281 mdp genotype Taxasummary PZA0059 3206090 281 PFAODZ 12 3706018 261 PZAD039 4175293 281 PZAO 2 86 4429897 281 Y B3 Sequence mdp genotype ie Result Lu Genotype Summary mdp genotype Overallsummary EO c Cn un d Lu P ca PZA0 285 PZA0 203 zagll 5 zagll zagll PZDDODS zagll PZB0O9 PZB0UO9 PHM2244 PZA0 309 PZADO18 PZADO18 4430055 281 4490461 261 4835434 281 4835558 281 4835558 281 4835542 281 4912526 281 5353319 281 53536555 281 5562502 281 8075572 281 83553658 281 8366411 261 Table Title Site Summary Number of columns 35 Number of rows 3093 Number of elements 108255 Site Summary of mdp_genotype l l l l l l l l l l l l l PZAU285 4429927 281 l l l l l l l l l l l l c c C C ri nP rx u000 00 040 C0 Site Number Index of site Site Name Name of site Chromosome Chromosome Physical Position Physical Position on Chromosome Number of Taxa Number of taxa for site same of all Major Allele The major allele of site Major Allele Gametes Number of times major allele occurs for site up to twice number of taxa Major Allele Proportion Major Allele Gametes Number of Taxa 2 Number of Taxa 2 1s the Number of Gametes for a Site Major Allele Frequency Major Allele Gametes Number of Taxa 2 Gametes Missing Minor Allele The minor allele of site
54. ecome involved in a bioinformatics software development project When I click on the most current version of TASSEL web start a previous version appears What should I do The previous version of TASSEL web start was cached in your machine To replace it with the most current version click the Start button in Windows followed by Run Type javaws and then click OK In the window that opens keep the most current version of TASSEL and delete the rest What should I substitute for missing values in TASSEL For numerical data in version 3 format use NA or NaN For numerical data in version 2 format use 999 for missing values For SNP data use N Kinship does not allow missing values Is it possible to change data names in the Data Tree Yes Click on the desired data name in the Data Tree wait for one second and then click it again or immediately hit the F2 key Rename the data set and then hit Enter to save the change How can I create a TASSEL icon on desktop Click Start on Microsoft Windows and select Control Panel then double click Java to show java Control Panel In Temporary Internet Files section click View button show Java Cache Viewer Move mouse over TASSEL application and click right button and select Install Shortcuts Why do I get empty squares in MLM association analysis The empty square means null information The major reasons include non convergence in the estimation of variance
55. epresent population structure and cryptic relationships optimizing compressio level and performing GWAS The command line version of TASSEL called the Pipeline provides users the ability to program tasks using a script instead of the graphic user interface GUI This feature allows researchers to define tasks using a few lines of code and provides the ability to use TASSEL as part of an analysis pipeline or to perform simulation studies We are also buildin a larger community of scientist developers that are adding functionality to this platform and working together to improv 6 the system So throughout this user manual you will see how to do most things three different ways with the GUI wit the pipeline and with the API application programming interface TASSEL is written in Java thereby enabling its use with virtually any operating system It can be installed using Ja Web Start technology by simply clicking on a link at www maizegenetics net tassel A stand alone version of TASSEI can also be downloaded to use in pipeline mode or in any situation where the user wishes to start the software from a command line Getting Started A quick way to get started using TASSEL is to load the tutorial data and try performing analyses However because some of the necessary steps may not be intuitive we recommend that new users follow the tutorial at end of this manua The objective of this section 1s to provide information necessary to install a
56. es and output it to a new file ready for further analysis This sort is not done automatically at load time because the computational cost for sorting large files can be very large We feel it s better for users to know what they re getting into instead of being surprised by it There is currently only support for sorting Hapmap and VCF files To sort a genotype file from the GUI just select Data gt Sort Genotype File and fill in the appropriate parameters in the popup dialog To sort a file from the command line use the following command fun pipeline pl SortGenotypelaDlePLlugin inputrrile frlename o tputbEile filename fileType Hapmap or VCF The fileType flag 1s optional and is only needed if the input file s extension doesn t match a known file extensi hmp txt vcf etc 3 4 Transform This suite of functions allows multiple data manipulation on genotype and phenotype numerical data When a genotyt data set is selected the data are transformed to numbers When a numerical data set is selected mathematical transformation data imputation and principal component analysis PCA can be performed The Transform column tags will be displayed in a Data dialog box with three tabs Trans Impute and PCA 3 4 1 Genotype Numericalization 15 Collapse Non Major Alleles Separate Alleles i Close 3 4 1 1 Collapse Non Major Alleles This function assigns to the major allele and 0 to
57. false Impute the donor file itself Default false nV true false Supress system out Default false Options for calculating accuracy accuracy true false Masks input file before imputation and calculates accuracy based on masked genotypes Default false propSitesMask lt Proportion of genotypes to mask if no depth Proportion of genotypes to mask for accuracy calculation if depth not available Default 0 01 depthMask Depth of genotypes to mask Depth of genotypes to mask for accuracy calculation if depth information available Default 9 propDepthSitesMask Proportion of depth genotypes to mask Proportion of genotypes of given depth to mask for accuracy calculation 1f depth available Default 0 2 5 Filter Menu 5 1 Sites The genotype table can be filtered in several ways For example monomorphic sites can be eliminated and regions of e 27 sequence can be eliminated 8 0 0 Filter Alignment al Filter Alignment Minimum Count 10 out of 281 sequences Minimum Frequency 0 1 Maximum Frequency 1 0 Position Type Position index Start Position 0 End Position 3092 of 3092 sites Remove minor SNP states _ Generate haplotypes via sliding window Haplotype Length Filter Select Chromosomes Cancel Minimum Count the minimum number of taxa in which the site must have been scored to be included in the filtere data set GAP or missing data do not count Minimum Frequency the
58. flag for 25 FILLINImputationPlugin If imputing maize a donor file of haplotypes from 40k taxa can be found on the Panzea website http www panzea org lit data sets html FILLIN can be run either within the TASSEL GUI or through the command line The options are the same for both A typical command sequence for running FILLIN through the command line 1s as follows replace items in lt gt with actual parameter values run pipeline pl FILLINFindHaplotypesPlugin hmp lt genotypeFilename gt o lt outDonorDir gt run pipeline pl FILLINImputationPlugin hmp lt genotypeFilename gt d lt donorDir gt o lt outFile hmp txt gz gt To run FILLIN from the GUI go to Impute gt FILLINFindHaplotypesPlugin or FILLINImputationPlugin Options for FILLINFindHaplotypesPlugin hmp Target file gt Input genotypes to generate haplotypes from Usually best to use all available samples from a species Accepts all file types supported by TASSELS required o lt Donor dir file basename gt Output file directory name or new directory path Directory will be created if doesn t exist Outfiles will be placed in the directory and given the same name and appended with the substring gc s hmp txt to denote chromosome and section required mxDiv lt Max divergence from founder gt Maximum genetic divergence from founder haplotype to cluster sequences Default 0 01 mxHet lt Max heterozygosity of output haplotypes gt Maximum hetero
59. genetics net tassel docs ExecutingTassel pdf 1 2 Open Source Code Open source code for TASSEL is available at https bitbucket org tasseladmin tassel 5 source The package uses a number of other libraries that are included in the TASSEL distribution These include a modified version of the PAL library http www cebl auckland ac nz pal project the COLT library http dsd lbl gov hoschek colt jFreeChart http www free org jfreechart Guava Google Core Libraries https code google com p guava libraries JUnit http junit org Archaeopteryx https sites google com site cmzmasek home software archaeopteryx and BioJava http www biojava org 1 3 Software Development Tools jProfiler http www ej technologies com products jprofiler overview html install4j http www ej technologies com products install4j overview html NetBeans IDE https netbeans org Eclipse http www eclipse org IntelliJ http www jetbrains com idea Structure101 http structure101 com Team Viewer http www teamviewer com Bitbucket https bitbucket org sourceforge http sourceforge net JIRA https www atlassian com software jira Tower http www git tower com 1 4 Graphical Interface TASSEL is organized into five main panels 1 At the top menus control functions 2 The Data Tree at the top lef organizes data sets and results Data set s displayed in the Data Tree must first be selected before a desired function
60. groups Ling Regression 2 Y Axes groups vs 2LnLk groups vs Var genetic and Var error Var genetic 125 150 75 100 125 150 175 200 225 250 groups groups Var genetic Var error The strongest associated SNP is at 193565357 bp on chromosome 3 The P value is 1 3027x10 The threshold is 3 2331x10 at significant level of 1 after Bonferroni multiple test correction 0 01 3093 The association was not significant As illustrated below the output labeled GLM Allele Estimates shows the marker effects assigned t genotypes for each SNP The GLM is also the same For example the first SNP at 157104 bp on chromosome 1 had three genotypes AA CC and AC coded as A C and M based on the IUPAC code see Appendix Nucleotide Codes B TASSEL Trait Analysis by aSSociation Evolution and Linkage 3 0 39 File Tools Help GDPC eo s PZB00859 1 3 64912 197 Synonymizer dpoll PzB00859 1 1 157104 A ss Ji Result dpol PZBO0859 1 0 1 383 Diversity 5 5 z l LD Gee Association bed MLM statistics for Filtered n MLM compression for Filtere MLM statistics for Filtered n amp MLM compression for Filtere MLM statistics for Filtered n MLM effects for Filtered m MLM compression for Filtere k i SEITE 58 ESSERE EIE JEJEIEISISISIESE co Table Title MLM effects 4 Number of columns 7 an ut tB 2E oo B bJ IERE a 5 3S S SS BERE LJ
61. he degree of similarity between names usin the name from the first set which 1s most similar to that in the second data set When using the Synonymizer keep in mind that order of selection matters Always select the data set with the name you wish to use the real name first and then while holding down the CTRL key click on the second data set with the taxa names you wish to change the synonym Then click on the Synonymizer button A synonym data set wil be placed on the Data Tree panel under Synonyms Each name in the data set selected second is now listed in the TaxaSynonym column Next to this column is a TaxaRealName column listing the highest scoring match derived fron the real name data set The MatchScore column gives an indication of the amount of similarity between the tw names where 0 is no similarity and 1 0 is identity 18 Caution Before the synonyms are applied we strongly encourage the user to check the match score especially for those taxa with low match scores To do that the user selects the synonym file and clicks the Synonymizer button The incorrect matches usually the ones with the lowest match scores can be rejected at this point Sorting on the match score column first makes this a fairly easy process In the event that some of the taxa are not interpreted correctly matches can be modified manually Select the taxa you wish to modify on the left side and then choose a replacement taxa from the rig
62. he pedigree file Value filename wor windowSize the number of SNPs to examine for each LD cluster Value integer default 50 r or minR minimum R used to filter SNPs on LD Value number between 0 and 1 default 0 2 use 0 for no ld filter m or maxMissing maximum proportion of missing data allowed for a SNP Value number between 0 and 1 default 0 9 f or minMaf minimum minor allele frequency used to filter SNPs If negative filters on expected segregation ratio from parental contribution Value number between 1 and 1 default 1 b or bcl use BC specific filter Value true or false default true n or ben use multiple backcross specific filter Value true or false default false logfile the name of a file to which all logged messages will be printed Value filename Options not taking a parameter value cluster use the cluster algorithm minMaf defaults to 0 05 subpops filter sites for heterozygosity in subpopulations nohets delete het calls from original data before imputing windowld use the window ld algorithm for finding parent haplotypes 29 66 The cluster subpops nohets and windowld options do not take parameters but only act as flags that include certain features in the analysis Of those cluster and windowld are the most useful When the cluster option 1s used a different algorithm 1s used that does a better job of handling residual hete
63. hip data to define the relationship between individuals The kinship matrix times a parameter equa 59 the covariance matrix between individuals Here we use kinship file from the tutorial data set to fit the following statisti model Flowering time Population structure Marker effect Individuals residual Individuals and the residual are fit as random effects The other terms are treated as fixed effects With respect to the marker effect we will demonstrate the analysis using two sets of markers One is the dwarf8 gene sequence used in the GLM tutorial The other is a set of 3093 SNPs spread across the maize genome For the dwarf8 gene sequence use the joint data set created by following the tutorial for GLM Solve the mixed lineai model by highlighting the joint data set and the kinship data then clicking the MLM button in Analysis mode File Tools Help GDPC HN BHA Taxa C Qa Haplotype EM SEER 38 11 68 5 3 0E 3 0 993 NNC CGCAT 000 dB sequence A272 po h z p 122 w miu 4418 675 sos 531 5 54 l x a mdp genotype F mdp genotype i d amp sequence chr 6 2404 p gt i Polymorphisms Numerical rie mdp population structure Base 4 mdp traits Edis Filtered mdp population structure i4 Filtered mdp traits ij Matrix ill un un AO Gal mmm i hi oiu a LJ Boc ab Hi m iJ a m JA un S O o 0 o minu li bI ie oO jies i E om A
64. hip matrix located at row 1 and column J Missing values are not allowed for kinship matrix Important note The current format is different from the format used in TASSEL version 2 0 or lower 3 1 10 Table Report Data can be imported as tab delimited text files The first row of the file will be interpreted as column labels and the remaining rows as rows in the table 3 1 11 TOPM Tags on Physical Map 3 2 Export Options are provided to export sequence data Hapmap Plink Phylip Sequential or Interleaved Phenotypes and covariate data 1s exported as numerical trait data Table Reports are exported as a tab delimited table For numerical data the function of Export 1s similar to the Table function in Results mode 14 eo Export Choose File Type to Export e Write Hapmap Write HDF5 Write VCF Write Plink Write Phylip Sequential Write Phylip Interleaved Write Tab Delimited OK Cancel 3 3 Sort Genotype File TASSELS has strict requirements for the sites in a genotype file Each site must be unique as defined by its locus chromosome position and name and they must be in order in the file Genotype files produced by other programs and also earlier versions of TASSEL often do not meet this second requirement and throw an error when TASSEL tries to load them It can be difficult to recreate TASSEL s internal sort order by hand so this plugin allows the user to sort an input genotype file according to TASSEL s rul
65. ht side Click the arrow button to substitute the taxa Taxa with no synonym can be identified by selecting then clicking No Synonym Click OK t TASSEL Trait Analysis by aSSociation Evolution and Linkage 3 0 36 i File Tools Help GDPC orro el GDPC E Load amp Export T sites D gt Taxa Y Traits Impute SNPs 5 Transform D Q Synonymizer D u Join Data B d Sequence 44 d8 sequence 4 mdp genotype 4 mdp population structure i 4 amp mdp traits Ei Matrix mdp kinship Tree Fusions Synonymizer foe mds sequence Synonyms J Result Table Title Taxa Synonym Table Number of columns 4 Number of rows 301 Number of elements 1204 Taxa synonyms Synonym Table 301 unique matches 73 unmatched save the changes 19 TaxaSyno TaxaRealN ReflDNum MatchScore 11 1 0 0 110 110 82 L0 73 12 0 5 1 0 0 73 12 lo 5 37W 52 1 0 73 12 0 5 3 47 1 0 73 12 lo 5 112 112 37 1 0 31A 3 47 0 33333333 73 12 1 0 A554 3 l0 33333333 i144 49 0 66666666 92 CML5 29 0 57142857 CMLS 29 0 57142857 CI187 2 16 0 25 5 CMLS 29 1 0 SC55 78 1 0 78 78 72 1 0 IL677A 43 1 0 3 47 0 4 37 37W 52 0 66666666 68 11 1 0 1 0 0 68 11 0 5 i 0 0 811 11 45 0 28571428 39 38 l0 5 55 44 0 33333333 372 1 0 0 370 1 0 0 267Y
66. k will be treated as a delimiter as well As a result embedded blanks in names will cause data to be imported incorrectly We suggest representing missing values using NA or NaN However any text value e g will be interpreted as missin data There are several formats for numerical data to fit the requirement for modeling Trait data dependent variables can be imported by starting the first line with lt Trait gt and following that with the trait names Additional classifiers n also be included in subsequent header rows by starting the row with Header name xxx gt followed by a name for each column of data For instance to define environments start the second header row with Header name env gt Comment lines may be inserted at the beginning of the file Comment line begins with the character 7 3 1 8 1 Trait format This format does not require users to provide information on number of rows and columns The file starts with the ke word lt Trait gt followed by names of columns The column for line should not be labeled Example 1 simple list of trait values Trait EarHT dpoll EarDia 811 59 5 NA NA 33 16 64 75 64 5 NA 12 Docll 292 29 U0990 1924 991 42240 Coo i9 92205219335 4922 Giu Paige 2 455 AISG 27 5 62 OL ALD Ol Example 2 traits data collected in multiple environments ST ES gt EarHI PlantHT EarHI BPlantHt Header name env Locli Loci Loc2 Loc2 Su Oa NA NA NA
67. ler ES 2007 TASSEL Software for association mapping of complex traits in diverse samples Bioinformatics 23 2633 2635 Genotyping by Sequencing Glaubitz JC Casstevens TM Lu F Harriman J Elshire RJ Sun Q Buckler ES 2014 TASSEL GBS A High Capacity Genotyping by Sequencing Analysis Pipeline PLoS ONE 9 2 e90346 Mixed Model GWAS Zhang Z Ersoz E Lai C Q Todhunter RJ Tiwari HK Gore MA Bradbury PJ Yu J Arnett DK Ordovas JM Buckler ES 2010 Mixed linear model approach adapted for genome wide association studies Nature Genetics 42 355 360 The TASSEL project is supported by the National Science Foundation and the USDA ARS USDA aue Reference Links Main Web Site http www maizegenetics net tassel Open source code https bitbucket org tasseladmin tassel 5 source Wiki https bitbucket org tasseladmin tassel 5 source wiki Table of Contents Introduction Getting Started Executing TASSEL Open Source Code Software Development Tools Graphical Interface Pipeline Command Line Interface GBS Pipeline File Menu Save Data Tree Open Data Tree Save Data Tree As Open Data Tree Set Preferences Data Menu Load Hapmap HDF5 Hierarchical Data Format version 5 VCF Variant Call Format Plink Projection Alignment Phylip FASTA Numerical Data Trait format Covariate Format Marker Values as Numerical Co variates Square Numerical Matrix Table Report TOPM Tags on Physical Map Export
68. mum of several hundred SNPs spread over the whole genome is recommended This ad hoc rescaling method was implemented in an earlier version of TASSEL in order tc provide a reasonable estimate of additive genetic variance but tends to overestimate that value Rescaling does not affect its use for correcting for population structure It only affects the estimate of additive genetic variance and consequently heritability To provide a better estimate of addivitive genetic variance an alternative method can be used by selecting scaled IBS This method from Endelman and Jannink 2012 codes genotypes as 2 1 or 0 equal to the count of one of the alleles at that locus It then replaces missing genotype values with the average genotypic score at that locus before estimating relationship matrix Other methods of imputing genotypes prior to calculating Kinship may provide a better result Fc instance rather than using this default treatment of missing values using the numerical genotype method followed imputation described in section 3 3 before running Kinship is a reasonable alternative When using numerical genotype Kinship always applies the scaled IBS method Users may also load their own kinship data using Data O Load Kinship matrices can be calculated using the SPAGeDi software package http www ulb ac be sciences ecoevol spagedi html Comparisons of methods for calculating kinship can be found in the literature e g Stich et al 2008
69. n This is because the algorithm solves the model once for each of a series of compression levels in order to determine the optimal one All MLM analyses create two output tables model statistics and model effects If compression is used the analysis creates three tables MLM statistics for Filtered mdp traits Filtered mdp population structure mdp genotype chrl 157104 3706018 md Trait Marker Locus Site df F p errordt markerR 2 Genetic Var Residual Var 2Ln Likelihood Ta lone 257 L1 8 068 14 585 AE 183 1 esee 1 439 a 2 SEis E arr tes i mime i et ar Bs of sls 1 pain 1 289 009 23 Qi 806 14585 1477183 a 3 04 esM 28 9 808 HS5 Lem del pasai i e a ese 3S Mio OOP 8x9 85 377 18 dod PzxxoSS3 973588 1 073 0393 2a 8003 8068 ium 1 477 18 PZA02962 13 1 mus 1i 999 amp Se 24 994 8B E 1 477 18 PzAn 7057 id l 3n 53257 n ns n AFT A NKA 14 GAG The statistics table shows the results of the tests for each trait The first line is for the model with no markers Following th is a single line for each marker tested The columns labeled df F and p are the degrees of freedom F and p value from the F distribution for the test of the marker The column errordf is the degrees of freedom used for the denominator of the F test The column labeled markerR2 is the R2 for the marker calculated based on a f
70. nd EUN pripelrnegpr forkl eh grou ploHhmp txc f9rk2 A group2 nnmpitst COONDIUSS IBpuutL Dnput2 i 0te6rsect eexport group group unbonshmpatxt CI bordelL rTUIPODEA r rork This joins multiple data sets by a union of their taxa Missing data will be inserted if taxa are missing from one data sc Select multiple data sets using the CTRL key in conjunction with mouse clicks and then click on the union button to jo the data sets Because this function uses taxa names to join data sets any variation in taxa names can prevent proper joining Taxa names can be made uniform by using the Synonymizer 3 8 Merge Genotype Tables Command APUD pIpelruegpL Lorskl sn group lenp tR 05k2 n GroupZ lip Exe combine3 inputl input2 mergeAlignments export groupl Group merge bhmp txt eruntorskl rf nfork2 runtorks This is the most complex merge function and can be considered as a union join across both sites and taxa The actual union join only works across taxa The resulting genotype table will contain all unique sites and all unique taxa from across the input datasets If a specific site taxon combination isn t present in any input dataset the value is set to missing If a specific site taxon combination 1s present in more than one input file the output will contain the last value processed That 1s later values overwrite earlier values even if they conflict There are plans to change this but they have not been
71. nd start TASSEL software and to provide a brief overview of the interface eoo File Data Filter Analysis Results Help ag aa Sie Numbers Loces Se Name O Alleles MajorkinorAllele Eeaer physical posizioni Search iB Sequence Pr yrorghims um m BB Numerical midp_pogulation_viructune m p trai 7 EB Mairix m p kriege Tree Fis ete 13 16 orm ne r 1i Bg eru 4226 J 4 avait Abe A 14H ala A IS A44 5 bal a j d a IIHS m i i m E ts es nr oo ki d t T0 0255 036b ELAN i 228 0295 e SS M O 0235075 02351840 0235726 0235939 0236675 2407348 28903509 5593158 5393255 5393350 DS LEE 550 220857872 Bi 1560 220885121 525 104742287 525 197964404 B s27 197264459 P1528 201131313 31529 201131577 S30 201131730 531 202245921 532 202883367 533 202833397 535 204059180 536 204059957 TA 1537 204764426 538 205427981 539 206492925 540 206493021 561 222233027 62 22223 5032 563 222233078 564 222233297 65 222331942 66 222331995 557 222332058 i Lu Marier of sequences 281 Nueniber od site 5095 Loci 1 2 3 4 5 6 T 6 3 10 Boo CEEE Spee t t P 2 sm P At El kilos icc b EI E um EN m B THE i oe b kk bpi M M NEM UM UN NL 0 dar ECC SE TRBN 1 1 Executing TASSEL http www maize
72. notypic data diversity results and association results Histograms Use the graph type combo box to select the desired graph type Histogram from the list of options Up to two different series of data can be plotted together Users may specify the number of bins to be used in the histogram X Series 1 DPOLL HOMESTEAD ID I Series z DPOLL CLASTTOMN ID1IS Bins 3a DPOLL HOMESTEAD ID1 DPOLL CLAYTON ID15 Distribution 12 5 li i TEN 50 60 DPOLLHOMESTEAD ID1 M DPOLL CLAYTON ID16 DPOLLHOMESTEAD ID1 M DPOLL CLAYTON IDI5 HOMESTEAD ID1 B DPOLL CLAYTON ID15 Scatter plots Use the graph type combo box to select the desired graph type XY Plot from the list of options Select data to be plotted in X and Y axes using the appropriate drop down boxes If two data series are plotted simultaneously on the Y axis the 2 Y Axes checkbox will provide an axis for each 49 2 Nes x DPOLL AOMESTEAD IDI ha Line Regressior 2 Y Axes DPOLL HOMESTEAD ID1 vs DPOLL CLAYTON ID15 5B Ret a ee eee ins Tui p Lo z z E I CL C 450 475 500 525 550 575 800 625 85D 875 700 725 750 DPOLL HOMESTEAD ID1 B DpPOLL CLAYTON ID15 W DPOLL CLAYTON ID15 Fitted Reg Line 7 6 QQ Plot 7 7 Manhattan Plot 8 GBS Menu http www maizegenetics net tassel docs TasselPipelineGBS pdf 9 Help Menu Help provides information Tassel and diagnostics 9 1 Help Manual 9 2 About 9 3 Show Memory
73. nput dataset must contain one or more phenotypes and numeric marker data Optionally it may also contain factors and covariates The analysis is run by selecting the input dataset then clicking the GS button Because no additional user input is needed the analysis will run immediately after the button is clicked All traits will be analyzed separately using all of the genotypes factors and covariates in the dataset The output will consist of two new datasets for each trait One of the datasets will contain genomic estimated breeding values GEBVs for each taxon and the other will contain BLUPs for each marker in the genotype file The output datasets will appear in the Numerical folder which holds the input data as well The output datasets can in turn be used for subsequent analysis For example it could be joined with the input data so that the predicted values could be graphed against the original values Understanding the input data requirements is important to ensure that the results of the analysis will be correct and useful Genotypes must be numeric with one column for each marker It is expected that the markers are bi allelic with the homozygotes coded as 1 and 1 and the heterozygotes coded as 0 However any reasonable coding scheme will work For instance missing data could be replaced by a probability resulting from imputation If any genotype data is missing it will be imputed as the average of the marker scores across all t
74. nt2 contribution contribution2 F fam t0001 parl par2 0 5 0 5 92 faml t0002 parl par2 0 5 0 5 92 fam2 t0201 parl par3 0 5 0 5 22 fam2 t0202 parl par3 0 5 0 5 92 fam2 t0203 parl par3 0 5 0 5 92 22 The values for contributionl contribution2 and F are family means Those values are read from the first line for a famil only and then applied to the entire family Using the command line for FSFHap FSFHap consists of three TASSEL plugins CallParentAllelesPlugin ViterbiAlgorithmPlugin and WritePopulationAlignmentPlugin which are called sequentially A typical command for running FSFHap is as follows replace items in lt gt with actual parameter values for a genotype containing a single chromosome run pipeline pl h lt genotypeFilename gt CallParentAllelesPlugin p lt pedigreeFilename gt m 0 9 r 0 5 logfile lt logFilename gt endPlugin ViterbiAlgorithmPlugin g true endPlugin WritePopulationAlignmentPlugin f lt outputFilename gt m false o parents endPlugin For a genotype file containing multiple chromosomes run pipeline pl h lt genotypeFilename gt separate CallParentAllelesPlugin p lt pedigreeFilename gt m 0 9 r 0 5 logfile lt logFilename gt endPlugin ViterbiAlgorithmPlugin g true endPlugin WritePopulationAlignmentPlugin f lt outputFilename gt m false o parents endPlugin Options for CallParentAllelesPlugin Options taking a parameter value specified by Value p or pedigrees t
75. ontaining allele estimates To run GLM select a data set and then click the GLM button A dialog box will pop up to allow the user to indicate that a permutation test should be run and to allow the number of permutations to be changed The permutation test will be run using the method suggested by Anderson and Ter Braak 2003 which calculates the predicted and residual values of the reduced model contained all terms except markers then permutes the residuals and adds them to the predicted values When the GLM options dialog is closed the user is presented with a dialog allowing the output to be saved to a file rather than stored in memory and displayed by TASSEL This option is useful when the output is expected to be very large and risks exceeding available RAM The following table shows an example of the Marker Test output as viewed with Results Table 35 Locus pos marker F marker p markerR2 markerDF markerMS errorDF 23 522 0 076 2 126 0 208 0 261 39 907 6 089 5 374 ZB00859 1 ZA01271 1 ZA03613 2 ZA03613 1 ZA03614 2 ZA03614 1 ZA00258 3 ZA02962 13 157104 0 199 0 007 1947984 0 942 0 2914066 0 705 0 001 2914171 0 905 0 2915078 0 894 0 2915242 0 104 0 012 2973508 0 517 0 002 3205252 0 541 0 002 rg erg rere erg erg erg erg rary tea Med ed ed ed ee a Ga Ga CJ CJ CJ CJ Go CJ Prnt Export csv Export Tab
76. or analysis can be performed To select multiple data sets press the CTRL or Command for Mac key while selecting the data sets 3 The Report Panel is located below the Data Tree It displays information about a selected data set from the Data Tree such as the type of data and how it was created 4 The Progress Monitoring Panel below the Report Panel shows the progress of running tasks and has buttons that can cancel tasks 5 The Main Panel occupies the right side of the viewing area and displays the content of the selected data set from the Data Tree 1 5 Pipeline Command Line Interface http www maizegenetics net tassel docs TasselPipelineCLI pdf 1 6 GBS Pipeline http www maizegenetics net tassel docs TasselPipelineGBS pdf 2 File Menu The data tree can be saved in a binary format 2 1 1 Save Data Tree This feature allows you to save the entire contents of the Data Tree panel to a default leeation This ts helpful wher th user dees net wish te reereate a Data Tree panel that is already well populated with information the next time they DIC TO Stre tdt vate 9 d a D D CD v NOTE The information outlined above for saving a Data Tree is applicable to files that are in general version specific When a new version of TASSEL is released a data tree saved with a previous version might not load to the version For longer term storage the best practice is to save individual data sets rather than the entire d
77. or the most significant SNP highlighted in the figure below there were two genotypes AA and GG There are 220 line with genotype AA and 41 lines with allele GG For the trait dpoll days to pollination the difference between the tw homozygotes was 3 86 days 600 TASSEL Trait Analysis by aSSociation Evolution and Linkage 5 0 2 File Data Filter Analysis Results Help i Marker Locus pos Allele Estimate _ Synonymizer dpoll PHM448 23 8 133775120 R amp 3 Result dpoll PZA00766 1 133 8 133775220 T 4 06105 Diversity dpoll PZA00766 1 115 8 133775220 C 2 47046 SNP Assays dpoll PZA00766 1 2 8 133775220 Y 0 LD dpoll PZB01389 1 122 8 134723842 C 4 ao dpoll PZB01389 1 137 8 134723842 T Y ll Association pol Zea E ESTEE 3 zn GLM marker test Filtered mdp traits F dpoll PZA03591 1 8 134813437 G ni dpoll PZA03591 3 223 8 134813550 C Variances dpoll PZA03591 3 33 8 134813550 T 0 1 dpoll PZA03591 2 104 8 134813696 G 1 06301 dpoll PZA03591 2 145 8 134813696 A 0 R G A C 0 pos Stepwise dpoll PZA00090 1 29 8 137480768 0 01513 ac m dpoll PZA00090 1 217 8 137480768 0 29198 dpoll PZA00090 1 14 8 137480768 0 2 2406 dpoll PZB00665 1 183 8 137572174 lass net maizegenetics util SimpleTableReport 10 5 Association analysis using MLM Running MLM in tassel is similar to running GLM The difference is that in addition to the joint data or numerical dat MLM requires kins
78. ormula for R2 for a generalized least squares GLS model as shown here The columns Genetic Var Residual Var and 2LnLikelihood list 62a 62e and minus two times the model likelihood respectively When the P3D option is used all of the values are the same for a given trait because they are only calculated once A second table lists the estimated effects of each allele for each marker similar to the output for GLM The compression results table shown below shows the likelihood cenetic variance and error variance for each compression level tested during the optimization process The meaning of groups and compression is discussed above in the description of the compression method The compression level with the lowest value of 2LnLk is used for testing markers MLM compression for Filtered mdp traits Filtered mdp population structure mdp a groups 2LnLk Var genetic Var error 7 362 a 29 394 tma 8 7a 3 s 24 i m eie i53 Si s 24 luw i1 235 63y 999 4 29 im iei em m 4 24 ime Lem em 3 i 2m Biur er sin Bs dol 21 1205 148552 640 10 342 a p a M5 2M UME 38 6 7 Genomic Selection using Ridge Regression This function performs ridge regression to predict phenotypes from genotypes It is one of the methods used for genomic selection GS The i
79. otes as a third marker class 1s not appropriate for kinship or LD those analyses should not be used for that type of data at the present time Work to improve handling heterozygotes is ongoing How to cite TASSEL The paper that describes TASSEL as a software package and the papers that introduce specific methods implemented in TASSEL should be cited as appropriate such as the unified Q K approach EMMA compression of mixed linear model and P3D For example 66 oo 67 Linkage disequilibrium D R and P value were calculated by TASSEL Association analyses were performed with the mixed linear model approach implemented by TASSEL GWAS was performed with the compressed mixed linear model approach carried by TASSEL which also implemented the EMMA and P3D algorithms to reduce computing time 11 12 13 14 68 REFERENCES Bradbury P J et al TASSEL software for association mapping of complex traits in diverse samples Bioinformatics 23 2633 2635 2007 Zhang Z Buckler E S Casstevens T M amp Bradbury P J Software engineering the mixed model for genome wide association studies on large samples Brief Bioinform 10 664 75 2009 Kang H M et al Efficient Control of Population Structure in Model Organism Association Mapping Genetics 178 1709 1723 2008 Zhang Z et al Mixed linear model approach adapted for genome wide association studies Nat Genet 42 355 60 2010 K
80. ow the most liberal model may be preferred Why do TASSEL and SPAGeDi give different kinship estimates First many algorithms exist to calculate kinship and their estimates will differ from one another Secondly tl algorithm in TASSEL treats each genotype as a haplotype It is not recommended that TASSEL be used to generate a kinship matrix from heterozygous genotype In the near future the TASSEL kinship algorithm will be modified handle heterozygous diploids Can I get Marker R square using SAS Proc Mixed or TASSEL MLM SAS Proc Mixed does not produce an R statistic MLM in TASSEL does The user manual describes how it is calculated Does MLM find more associations than GLM sometimes MLM has higher statistical power than GLM and may detect more true associations When the tested genetic markers are confounded with kinship structure GLM does not correct for that as effectively as MLM and may produce more false positives Do I need multiple test correction for the p value from Tassel Yes Can TASSEL handle diploid genotype data While TASSEL accepts most common sequence alignment formats which handle polyploid genotype data including haploid and diploid some analyses are not appropriate for heterozygous data GLM or MLM fit SNPs one at a time treating each distinct genotype as a separate class This has the effect of fitting an additive plus dominance model Separating the two effects is under consideration Because handling heterozyg
81. r Dir Directory containing donor haplotype files from output of FILLINFindHaplotypesPlugin All files with gc in tl filename will be read in only those with matching sites are used required o Output filename Output file hmp txt gz and hmp h5 accepted required hapSize Preferred haplotype size Preferred haplotype block size in sites use same as in FILLINFindHaplotypesPlugin Default 8000 hetThresh Heterozygosity threshold Threshold per taxon heterozygosity for treating taxon as heterozygous no Viterbi het thresholds Default 0 01 mxInbErr Max error to impute one donor Maximum error rate for applying one haplotype to entire site window Default 0 01 mxHybErr Max combined error to impute two donors Maximum error rate for applying Viterbi with to haplotypes to entire site window Default 0 003 mnTestSite Min sites to test match Minimum number of sites to test for IBS between haplotype and target in focus block Default 20 minMnCnt Min num of minor alleles to compare Minimum number of informative minor alleles in the search window or 10X major Default 20 mxDonH Max donor hypotheses Maximum number of donor hypotheses to be explored Default 20 hybNN lt true false If true uses combination mode in focus block else does not impute Default true ProjA true false Create a projection alignment for high density markers Default false impDonor lt true
82. r r2 for all possible combinations of alleles and then weighting them according to the allele s frequency Note It is not entirely certain that this procedure fully accounts for allele number effects P values are determined by two methods If only two alleles are present at both loci then a two sided Fisher s Exact test is calculated Note Previous editions of TASSEL used a one sided test but TASSEL version 1 0 8 and later use a two sided test If more than two alleles are present permutations are used to calculate the proportion of permuted gamete distributions that are less probable then the observed gamete distribution under the null hypothesis of independence When calculating linkage disequilibrium users have the option of employing Rapid Permutations If this option selected the algorithm will compute either a fixed number of permutations or run until 10 permutations are found that a 33 more significant than the observed P value While this slightly reduces P values it also saves a large amount o computational time If an unbiased p value is desired then the user must unselect the Rapid Permutations check box Full Matrix LD calculates LD for every combination of sites in the alignment Sliding Window LD calculates LD sites within a window of sites surrounding the current site The LD Window Size determines the width of the windo on one side of the current site Linkage disequilibrium results can be plo
83. rozygosity in the parents However it does not perform well for partially inbred RILs that have only been self pollinated for one or two generations If the RILs being imputed are F2 s or F3 s the cluster option should not be used The subpops option should only be used when imputing families of the NAM population developed by the Maize Diversity Project The nohets option 29 was included to test whether or not erroneous het calls result in too many hets being imputed It appears to have only a small effect on the outcome The windowld algorithm handles F2 and later populations effectively but can have problems when parents have some residual heterozygosity It is recommended that the logfile option be used The output can be used to identify and diagnose possible problems The bcn true should be used for populations with two or more backcrosses However using the bc1 option is not necessary as the default behavior 1s usually best Options for ViterbiAlgorithmPlugin g or fillgaps if true then missing values flanked by SNPs from the same parent will be imputed to that parent false otherwise Value true or false default true h or phet expected frequency of heterozygous loci Used only if the inbreeding coefficient 1s not specified in the pedigree file Value number between 0 and 1 default 0 07 Options for WritePopulationAlignmentsPlugin Required f or file The base file name for the o
84. sing Proportion Missing Number Gametes Gametes Mot Missing Proportion Gametes Not Missing Gametes Missing Proportion Gametes Missing Number Heterozygous Proportion Heterozygous Number of Taxa Number of Taxa in data set Number of Sites Number of Sites in data set Sites x Taxa Number of sites multiplied by number of taxa Number Not Missing Number allele values not unknown NN Proportion Not Missing Number Not Missing Sites x Taxa Number Missing Number unknown NN values Proportion Missing Number Missing Sites x Taxa Number Gametes Number of Sites x Number Taxa x 2 Gametes Not Missing Number of gametes not unknown Proportion Gametes Not Missing Gametes Not Missing Number Gametes Gametes Missing Number unknown N gametes Proportion Gametes Missing Gametes Missing Number Gametes Number Heterozygous Number of heterozygous values Proportion Heterozygous Number Heterozygous Sites x Taxa 281 3093 859133 837722 0 96386 31411 0 03614 1 383E6 1 6754E6 0 96386 62822 0 03614 9622 0 01107 L TASSEL Trait Analysis by aSSociation Evolution and Linkage 5 0 5 r Ll X AE Nub ME D Alleles Number Proportion Frequency Y 3 Sequence L 53 Result mdp genotype DE Genotype Summary gt mdp genotype Overallsummary mdp genotype AlleleSummary 246327 235079 178045 158548 31411 3938 2598 994 0 28342 0 27048 0 20485 0 19404 0 03614 0 00453 0 0031
85. t PNER id lt Shown below is an example in which the Taxa Summary is displayed noon 0 0 0O 0 0 0 0 0 0O wo 0 0 5 D D B o Dg g 1 g Jg g g g H g g oo b Oo 0 B n j gt Inbreeding Coe 0 054 inbreedng Coe s 2 Es ooa 44 QOiSinbeedng oo 3 0 01 inbreeding 0 022 23 xeedng 0o breeding 44 Data can be sorted by clicking on the column header of interest A secondary sort can be done by holding down the CTRL key and clicking on a second column Data can be exported to flat files that are either comma separated Comma Separated Values CSV or tab delimited Both these formats can then be imported into a spreadsheet program such as Excel Tables can also be printed 7 2 Archaeopteryx Tree Select a Tree data set to use https sites google com site cmzmasek home software archaeopteryx Tree ai Tree mdp_genotype 45 eoo Archaeopteryx 0 955 beta x 2010 01 15 File Tools Viewas Text FontSize Options Type Analysis Help v Phylagram vi Dyna Hide v Rollover vi Show Internal Gata vi Taxonomy Colorize Colorize Branches Display Data i Node Name E Taxonomy Code v Taxonomy Mame Prot Gene Symbol Prot Gene Mame Prot Gene Acc Confidence Value Event ABTS Click on Node to B73 eT Display Mode Data i E RH Zoom Y F
86. tted using Results gt LD Plot or viewed in a table via Results gt Table 6 3 Cladogram eoe Create Tree v Save distance matrix Clustering Method a F Neighbor Joining Run Close This function generates a tree or cladogram data set TASSEL produces neighbor joining trees using only simple parsimony substitution models To retrieve cladogram data first select genotypic data from the Data Tree and then click Analysis gt Cladogram The resulting tree data and the corresponding matrix will appear as separate data sets on the Data Tree Results can be plotted using Results gt Archaeopteryx Tree 6 4 Kinship This function generates a kinship matrix from a genotype To do so first highlight SNP data then click on thi Analysis Kinship submenu The resulting dialog box will then provide the option to select scaled IBS or pairwis IBS Clicking OK generates a kinship matrix When a genotype file is selected and pairwise IBS each element 1 of the kinship matrix that is generated is equal t the proportion of the SNPs which are different between taxon 1 and taxon j Distance is calculated for each pair of taxa ignoring any sites that have a missing value for one of the taxa The distance matrix is converted to a similarity matrix 34 subtracting all values from 2 then scaling so that the minimum value in the matrix is 0 and the maximum value is Kinship can be derived from a set of random SNP data a mini
87. uput hmp txt will be appended Value filename Optional m or merge if true then families are merged into a single file if false then each family 1s output to a separate file Value true or false default false 0 or outputT ype if value parents then output parent calls 1f value nucleotides then output nucleotides if value both then output both in separate files default both d or diploid if true output is AA CC AC if false output is A C M Value true or false default false c or minCoverage the minimum coverage for a monomorphic snp to be included in the nucleotide output Value number between 0 and 1 default 0 1 X or maxMono the maximum minor allele frequency used to call monomorphic snps default 0 01 For individual families only polymorphic SNPs are imputed When merge false only those SNPs appear in the output When merge true SNPs that are polymorphic in any family will be written to output For any site if SNP coverage is high enough in a family to determine with confidence that it is monomorphic for that family then all individuals in that family will be imputed to the monomorphic value at that site The minCoverage and maxMono options are used to determine thresholds for determining whether a site will be called monomorphic in a family If eithei of the options is set to a value of NaN then missing values at monomorphic sites will not be imputed FILLIN Fast Inbred Lin
88. use Aa bc to match site names beginning with Abc or abc e PZ AB Will match anything starting with PZA or PZB a OM Site Name Filter Available Selected PZB0OS59 1 T PZA03613 2 PZAU1271 1 PZAU3513 1 IPZA03613 1 IPZA03614 2 PZA03614 1 PZA00258 3 ZZNEETITAE E IPZA02962 14 LY PZ AB TL n Capture Selected 1 Capture Unselected Remove Cancel 5 3 Taxa Names First select the genotypic phenotypic or population structure data from the data tree The resulting dialog displays th taxa associated with the selected data By using either the CTRL or SHIFT key in conjunction with the mouse the user can select or deselect taxa Once desired taxa have been moved to the Selected window using the Add gt button the Capture Selected or Capture Unselected buttons will create a new data set containing only the desired taxa Using the search box e jsthe wildcard e is always implied at end of search string e Search string is case sensitive For example use Aa bc to match taxa beginning with Abc or abc 29 e A 56 Will match anything starting with A5 or A6 af Taxa Filter Available Selected Capture Selected Capture Unselected Remove Cancel 5 4 Taxa Filter Taxa by Properties Min Proportion of Sites Present Min Heterozygous Proportion 0 0 Max Heterozygous Proportion 1 0 Close 5 5 Traits Clicking the
89. x K is use jointly with population structure Q the Q K approach improves statistical power compared to Q only MLM can be described in Henderson s matrix notation as follows y XB Zute where y is the vector of observations B 1s an unknown vector containing fixed effects including genetic marker an population structure Q u is an unknown vector of random additive genetic effects from multiple background QTL fo individuals lines X and Z are the known design matrices and e is the unobserved vector of random residual The u anc e vectors are assumed to be normally distributed with null mean and variance of 36 Var ful OG 0 0 R 2 2 where G aK witha as the additive genetic variance and K as the kinship matrix Homogeneous variance is r 2 assumed for the residual effect which means R I where variance over the total variance is defined as heritability h r 2 0 is the residual variance The proportion of genetic When K is derived from pedigrees the elements of K equal 2 Probability IBD where IBD means that two alleles drawn at random are identical by descent Generally K calculated from markers is an IBS matrix The resulting multiplier is then not o but some unknown constant times o Some methods for calculating K such as those implemented in SPaGEDI actually use markers to develop an estimate of the IBD relationship matrix For those value of
90. zygosity of output haplotype Heterozygosity results from clustering sequences that either have residual heterozygosity or clustering sequences that do not share all minor alleles Default 0 01 minSites Min sites to cluster The minimum number of sites present in two taxa to compare genetic distance to evaluate similarity for clusterin Default 50 mxErr Max combined error to impute two donors The maximum genetic divergence allowable to cluster taxa Default 0 05 hapSize Preferred haplotype size gt Preferred haplotype block size in sites minimum 64 will use the closest multiple of 64 at or below the suppliec value Default 8192 minPres Min sites to test match Minimum number of present sites within input sequence to do the search Default 500 maxHap Max haplotypes per segment Maximum number of haplotypes per segment Default 3000 minTaxa Min taxa to generate a haplotype gt Minimum number of taxa to generate a haplotype Default 2 maxOutMiss Max frequency missing per haplotype gt Maximum frequency of missing data in the output haplotype Default 0 4 nV true false Supress system out Default false extOut true false Details of taxa included in each haplotype to system out Default false 26 Options for FILLINImputationPlugin hmp Target file gt Input HapMap file of target genotypes to impute Accepts all file types supported by TASSELS required d Dono

User Manual for Version 5.0 The Buckler Lab at Cornell University

Contents

Download Pdf Manuals

Related Search

Related Contents