Home
BATWING USER GUIDE Bayesian Analysis of Trees With Internal
Contents
1. 1 2 Modelling assumptions Underlying BATWING are various modelling assumptions many of which are also assumptions underpinning coalescent theory see e g Nordborg 2001 In the case that the data are drawn from a single population two key assump tions are that the data form a random sample from the population and that the population is random mating For structured data drawn from k subpop ulations BATWING assumes a common random mating ancestral population which split into two isolated random mating subpopulations If k gt 2 then one or both of these subpopulations split again and so on until k random mating subpopulations existed at the time of sampling Three models are available for the population size 0 constant 1 pure exponential growth and 2 exponential growth from a constant ancestral pop ulation size For structured data the start of growth time in model 2 is the same across all subpopulations and there is also a common growth rate under models 1 and 2 Natural populations are unlikely to satisfy BATWING s modelling assump tions In particular BATWING assumes that population splitting events are instantaneous with no subsequent migration whereas in reality splits may be gradual and followed by migration However more realistic models will almost certainly have too many parameters for useful inferences to be drawn We have tried to find a satisfactory compromise between reality and tractability so that
2. The starting state of the genealogical tree and other parameters may be chosen automatically by BAT WING or a starting tree may be specified in Newick for mat see Section 2 10 1 using an initialfile assignment in the infile The 11 latter option is useful for resuming a BATWING run which has been interrupted either by the user or by a computer failure initialfile contains complete information on an instance of a tree from a previous MCMC run such as those output as outfile iit outfile end or as outfile x where x is an integer these files consist of information about the tree in Newick format and information about the other parameters and the population tree if appropriate New MCMC settings can be specified via the infile as usual If seed is specified in the infile it takes precedence over the initialfile settings and a command line seed takes precedence over an infile setting All other settings in the infile e g priors are over ruled by the settings specified in the initialfile If the initial tree is not specified it is generated by BATWING with the badness parameter controlling the suitability of this tree for the data 1 spec ifies a random tree chosen independently of the data while O specifies a tree obtained via a parsimony heuristic with each coalescence occurring between between the two nodes with the most similar haplotypes A value between 0 and 1 allows a mixture between these two extremes See Wilson
3. inftype 0 22 initialfile used optional default The name of a iit file containing complete information on an instance of a tree from a previous MCMC run Used to set up all information on the data model settings and prior setting as specified in the file and it sets the random number seed to the same value that it was when the iit file was saved MCMC settings still can be specified in the infile e g samples Nbetsamp treebetN warmup seed If seed is specified in the infile it takes precedence over the initialfile settings If seed is specified in the command line it takes precedence over the infile settings All other set tings in the infile e g priors etc are over ruled by the settings specified in the initialfile example initialfile kalleles used whenever we have STR loci default NULL Determines whether a k alleles model is used for mutations of the STR loci Value is a list of the numbers of alleles permissible at each locus Under the k alleles model a mutation causes a change to any of the k alleles including the original allele with equal probability If not set the SMM is used in which a mutation causes the allele repeat number to increase or decrease by one unit with equal probability example kalleles 4 4 4 4 for 4 STR loci kappaprior used see above default none The prior for kappa NO0xalphaxbeta kappa is the natural log of the ratio of current population size to initial population size e
4. Data settings The location of the observed data is specified by a line of the form datafile lt lt path gt filename gt in the infile If path is omitted the current directory is assumed otherwise the path should be specified relative to the current directory default setting is datafile Data files have one line per haplotype with one or more spaces separating the alleles at distinct loci Following the remainder of the line is ignored the whole line is ignored if is the first non space character First come the UEP alleles which may be coded by any two single alphanu meric characters e g 0 and 1 or A and T Next come the microsatel lite or STR short tandem repeat alleles coded by an integer value giving the number of tandem repeats at that locus a constant added to each allele length at a locus has no effect on inferences also arbitrary numeric allele labels are allowed under the k alleles model Section 2 6 1 Missing STR data can be specified using 1 If the data are drawn from several distinct populations i e migmodel 1 see Section 2 5 2 the positive integer label specifying the source subpopulation is stored in a file locationfile in the same way as for datafile Rows of the locationfile should correspond to the rows of the datafile Subpopu lation codes may be any positive integers Missing location information can be specified using 1 if this is used the total number of s
5. L 5 85066 258 997 4 60517 27 7094 0 632908 2 82459 4 40921 266 008 4 60517 60 3782 0 537683 2 36435 5 01031 255 684 4 60517 47 6446 0 536507 2 51192 3 6966 242 024 4 60517 38 9361 0 536507 2 157 Figure 4 Listing of output files out par and out first five lines for example See Figure 3 for infile 20 gres lt function x c mean x median x quantile x probs c 0 025 0 975 sets up a functions that calculates mean median and interval apply 0 2 qres applys function over all columns see splus r documentation VVVV NM note that the symbol again indicates a comment obtain the means medians and equal tailed 95 posterior intervals shown in Table mean median 95 Interval lltimes 8 16 7 86 14 45 3 702 llmut 246 0 245 6 268 4 225 6 theta 16 83 15 31 7 685 35 14 T 1 172 1 566 0 6643 3 65 L 5 33 5 041 2 34 10 11 Table 4 Posterior 95 equal tailed intervals obtained by post processing the results in out 21 5 Alphabetical listing of infile options alphaprior used see above default none The prior for alpha the population growth rate per generation Currently only positive values of alpha are allowed If migmodel 1 all subpopulations continue to grow at the same rate example alphaprior gamma 1 100 ancestralinf used when infsites 40 default NULL If set constrains the root node to have the UEP haplotype specified value should be a list of s
6. amp Balding 1998 for more details BATWING allows warmup to be set in the infile However this merely specifies an additional number of outputs to be generated No portion of a BATWING run is automatically discarded the user must explicitly choose how many of the initial outputs to discard as burn in Therefore use of warmup is optional and if used a warning message is generated One important diagnostic tool for an MCMC run are the plots of the au tocorrelation function ACF of the output values both for the parameters of interest and for the log likelihood Ideally the ACFs should be as small as possible However it is usually not worthwhile investing in reducing the ACF provided that the values decrease approximately monotonically to zero and be come negligible at lag much less than the square root of the number of outputs The tw BATWING parameters controlling the thinning of the BATWING output and hence the ACF are Nbetsamp which gives the number of changes that are made to model parameters between sampling occasions and treebetN the number of changes to the tree attempted between changes in the model pa rameters We make more changes to the tree than to the model parameters because tree changes are computationally cheap The appropriate balance between Nbetsamp and treebetN requires experi mentation and depends on the number of loci and the sample size The effects of varying them are illustrated by the three infile s
7. executable files batwing exe and MacBatwing hqx are also available at the same site but batwing tar gz is still needed for documentation and other files 2 1 1 Unpacking on Unix The downloaded file batwing tar gz is unpacked using these command gunzip c batwing tar gz tar xf on a command line To produce the executable file type make or alternatively on some systems gmake This should make the batwing sample and prior programs and put them in the bin subdirectory 2 1 2 Unpacking on Windows and Macintosh Under these systems any decompression software such as winzip should unpack batwing tar gz 2 2 The Distribution The distribution when unpacked will extract the files into a directory batwing with subdirectories bin executable files data data files examples example files doc documents including this Guide sample files for programs to sample from the model and src and R source files To run the programs from a command line you will need to add the bin directory to your path or copy the program into the directory with your input files In this document we use this font to indicate BATWING parameters and filenames and this font for variable names File listings use a fixed width font The first line of a file listing gives the path to the file in the distribution to enable readers to find the file in the BATWING distribution The symbol is used to denote a comment which is ignored by BATWING 2 3 BA
8. it may be reasonable to assume that the ancestral state is known For SNP sites modelled as a UEP the ancestral state is typically unknown If the parameter ancestralinf is assigned a UEP haplotype this is taken to be the haplotype at the root node i e the haplotype of the MRCA at the UEP loci To use this feature UEP alleles must be spec ified in the same way both in the datafile and for ancestralinf Currently an ancestral state can be assigned either to all or none of the UEP loci 2 7 Time scaling and parameterisation options BATWING follows the coalescent theory convention of working with times such as the TMRCA or a population split time expressed in units of N generations Time 0 corresponds to the present and large positive times correspond to far back in the past The value of N used by BATWING for the time scaling is the current effective population size N under model 1 but the ancestral effective population size Na under model 2 The reason for this choice is that Na is not defined under model 1 whereas under model 2 it is Na that has the most important effect on the data Ne is realised only instantaneously BATWING users can choose to work either with the unscaled growth rate alpha and mutation rate mu or the scaled growth rate omega N x alpha and scaled mutation rate theta 2N x mu the 2 appears in the definition of theta for historical reasons A feature of working with scaled rates and times is that t
9. of the sample in coalescent units multiply by N to obtain TMRCA in genera tions L the total branch length of the tree in coalescent units if sizemodel 0 the population growth rate either unscaled alpha or scaled omega if sizemodel 2 beta the time at which growth starts if sizemode1 2 kappa N x alpha x beta omega x beta if infsites gt 0 the next infsites columns give the binary states of the UEP loci at the root node of the tree will always equal ancestralinf if set if migmodel 1 the next k columns k subpopulations give the rel ative subpopulation sizes 14 15 if migmodel 1 the next 2 k 2 columns are pairs of numbers for each population merging event starting with the most recent one and ending with the penultimate one The first number indicates which clades merge at this point To interpret this first convert the number into binary form with k digits in the order corresponding to the locationfile codes with Pop 1 as the least significant digit and Pop k as the most significant digit 1s indicate which populations are beneath this node The second number of the pair gives the time of the merging event After these k 2 pairs of columns a final column gives the time of the final coalescence event 16 if infsites gt 0 and UEPtimes 1 the next infsitesx2 columns are pairs of numbers for each UEP locus giving the descendent and ancestral node times of the branch on which the U
10. prior distribution and hence not a unique posterior computational power is now available to explore a range of priors often the posterior is dominated by the likelihood so that the posterior distribution is essentially invariant over a wide range of priors If not the investigator learns the crucial information that his her data are only weakly informative about some parameters of interest and hence background information must be chosen with care and accurately reported in any publication BATWING offers a range of probability distributions from which investiga tors can choose priors for mutation rates effective population sizes and growth rates and population splitting times see Section 2 8p For unknowns which are properties of the genealogical tree such as the TMRCA BATWING of fers several models from coalescent theory outlined above in Section 1 2 and in more detail below in Section 2 5 Note that we speak of coalescent models even though they specify priors for the unknown genealogical tree Although model and prior are usually thought of as distinct the distinction is not fundamental and is a matter of custom and convenience 2 The BATWING software 2 1 Downloading and Installation The main distribution and the only available for unix systems is in the file batwing tar gz a zipped tar file The first step is to download this file which is available at If you work under Windows or Macintosh MacOS 8 0 or above systems
11. that a single historical mutation event was responsible for the observed polymorphism These may include insertion deletion polymorphisms such as an Alu insertion or single nucleotide polymorphisms SNP The number of UEP sites should be assigned to infsites default 0 There is an ascertainment problem common to many genetic data types but which is often particularly acute for UEP data Such sites may have been included in a genetic survey because they were known in advance to be poly morphic possibly they were known to be highly polymorphic in several popu lations Inferences which may validly be drawn from a site ascertained in this way may differ substantially from valid inferences had exactly the same data been observed at a site which was chosen at random To allow some investigation of ascertainment effects BATWING allows three ways to incorporate UEP sites into the analysis selected by setting inftype default is 0 0 Conditions on there being a single mutation at each UEP site Hence UEP positions on the tree contribute to the tree likelihood and the posterior density but no inference is drawn about the mutation rate 1 Assumes the same UEP mutation rate for all UEP loci with a uniform prior UEP positions within the tree contribute to the tree likelihood and posterior density 2 Only trees consistent with the UEP data are permitted but UEP data are not used in any other way For insertion deletion polymorphisms
12. the most important features of the data are exploited to quantify the major underlying demographic events It must be recognised that some questions of interest about historic demography cannot be answered from present day genetic data alone 1 3 The Bayesian paradigm BATWING implements the Bayesian paradigm for statistical inference Most scientists have been trained in the classical paradigm which dominated 20th century science The Bayesian approach is older having been the predominant mode of inference in the 19th century but has returned to prominence in recent years due in large part to the availability of fast computing and algorithms such as MCMC A key feature of the Bayesian paradigm is that a probability distribution is associated with all the unknowns The investigator specifies both a prior dis tribution representing knowledge about the unknowns before the present data are taken into account and a likelihood function which assigns probabilities to the data given values for the unknowns Bayes Theorem can then be employed to convert the prior distribution and likelihood into a posterior distribution Information used in formulating the prior may come from previous studies of related systems theory and informal judgement Sometimes scientists choose to downweight or ignore available prior information and formulate a diffuse prior giving support to a wide range of values for the unknowns Although there is not usually a unique
13. 1 v2 if X has this distribution then log X has the normal v1 v2 distribution gamma vl v2 mode vl 1 v2 mean v1 v2 All the parameters within one of the parameterisation options shown in Table 2 must be assigned a prior distribution from the above list In addition in models with population structure split and prop must also be assigned a prior The latter is a multivariate parameter and the only prior distribution available is the dirichlet v1 v2 un The case vl v2 vn 1 gives the multivariate uniform distribution The case vl v2 vn 2 is the default prior When n 2 the dirichlet is also known as the beta distribution Prior distributions for scaled genealogical times are assigned implicitly via choice of the demographic model for example under model 0 the standard coalescent the prior TMRCA when the sample size is two has the exponential 10 Parameter Description and comments N Effective population size haploid current Ne under model 1 ancestral Na under model 2 mu Mutation rate a list of prior distributions is required if locustypes 1 theta theta 2N x mu alpha Unscaled growth rate per generation only positive values allowed if migmodel 1 all subpopulations have the same growth rate omega omega N x alpha kappa kappa log N Na Na x alpha x beta beta beta the time at which population growth starts split Time of the first population s
14. 2 2 8 8 summary 10 samples with 10 STR loci and O infinite sites theta 10 height 2 02684 length 6 66922 Figure 2 Listing of the samplein file used in the example top and the resulting sampout and sampout data files middle and bottom 18 m 1 2 6 8 3 4 2 4 3 3 1 2 1 7 6 2 3 3 2 3 2 1 m 3 1 5 2 2 3 6 6 2 4 2 4 1 4 2 3 3 6 6 2 1 3 L 5 2 5 1 3 1 5 6 2 4 4 6 5 5 4 7 9 1 2 3 4 3 7 5 5 4 7 9 1 2 3 4 3 m 10 4 4 2 1 7 9 1 1 11 12 8 3 3 2 1 7 8 2 3 10 11 9 3 3 1 5 7 8 2 3 11 13 examples session1 infile input file for analysis of sample output datafile sampout tree_concensus 1 switch on concensus output picgap 500 samples 10000 warmup 1000 treebetN 10 Nbetsamp 20 thetaprior uniform 0 100 seed 11 Figure 3 Tree output from treeview using sampout data file top and listing of infile for example bottom 19 examples session1 out par output file from example datafile sampout outfile out infsites 0 Analysis Details sizemodel O migmodel O Priors thetaprior uniform 0 100 Program Control Details badness 0 01 seed 11 samples 11000 Nbetsamp 20 treebetN 10 pop_consensus O tree_consensus 1 proportion accepted cutjoin 0 00898727 times 0 256097 haplotype 0 34313 theta 0 0701045 examples session1 out output file from example lltimes llmut llprior theta T
15. BATWING USER GUIDE AMAS Bayesian Analysis of Trees With Internal Node Generation lan Wilson David Balding and Mike Weale Correspondence address lan Wilson Department of Mathematical Sciences University of Aberdeen King s College Aberdeen AB24 3UE UK Email I WilsonQmaths abdn ac uk BATWING Home Page http www maths abdn ac uk ijw This guide gives more details of the program described in Wilson Weale z Balding 2003 Inferences from DNA data population histories evolutionary processes and forensic match probabilities Journal of the Royal Statistical Society Series A Statistics in Society 166 155 188 COPYRIGHT NOTICE 2002 by Ian Wilson David Balding and Mike Weale Permission is granted to copy this document provided that no fee is charged for it and that this copyright notice is not removed Guide v1 03 Last Edited 26 June 2003 Contents a es ree E a ese ados ts E bob 2 The BATWING software 2 1 Downloading and Installation o 2 1 1 Unpacking on Unix o 2 1 2 Unpacking on Windows and Macintosh A gone den ae ee A O a ae ee pie Nec ede a ge ee ge ee ae ee PR ck ee OMe He BA E 2 5 1 Population size 2 ee ee a ee a rr a E 2 6 Mutation model settings 2 22000 2 6 1 STR microsatellite loci o corras 26 2 UEP SITES 2 24 s ge sea bd Pa ae ee Ee 2 7 Time scal
16. EP mutation occurred 17 if countcoals 1 the number of coalescences occurring more recently than the start of growth When BATWING has finished running it creates an outfile par text file with details of model settings used in the analysis and acceptance rates for the following Metropolis Hastings proposals cutjoin moving a node to somewhere else in the tree times changing the time of a node haplotype changing the STR haplotype of a node splitprop if migmodel 1 changing the subpopulation size proportions splittime if migmodel 1 changing one of the subpopulation split times mu changing the value of the STR mutation rate N changing the value of N if used alpha if sizemodel 0 changing the value of alpha or omega growth if sizemodel 2 changing the time of start of growth infroot if infsites gt 0 changing the MRCA UEP haplotype In addition to the outfile par file BATWING also generates a number of text files providing a detailed description the current state of the genealogical tree outfile 1 outfile xz describe the tree at regular intervals where 7 samples picgap outfile iit and outfile end describe the tree at the start and at the end of the process Each of these files starts with a description of the current instance of the tree in newick format suitable for viewing by Treeview and other packages The information for each node includes its ID number the UEP and STR haplot
17. TWING command line The unix and windows versions of BATWING are command line driven al though BATWING can be run by double clicking under a Windows environ ment Syntax batwing infile outfile lt seed gt The lt gt notation indicates that seed is optional infile and outfile are re quired and will be prompted for if not supplied infile The input filename the default infile is assumed if not specified e g if BATWING is run by double clicking batwing exe in a Windows environ ment Can contain settings for 1 data file path 2 models and prior distributions and 3 MCMC control parameters such as the number of outputs and the number of update steps between outputs All settings are made via the syntax name value ona separate line Default values exist for many settings see Sections 2 4 to 2 9 and Section outfile The output filename stem default out Various output files are created with this stem and different extensions outfile par contains a copy of the infile information and information about MCMC acceptance rates outfile iit outfile end and outfile z where x is an integer give states of the genealogical tree and other parameters visited by the MCMC algorithm while outfile without an extension gives the parameter val ues output by the algorithm See Section 2 10 seed Seed for the random number generator default value 1 Can also be specified in the infile but a command line seed takes priority 2 4
18. coalescent theory to generate approximate random samples from the posterior distributions of parameters such as mutation rates effective popula tion sizes and growth rates and times of population splitting events It also generates approximate posterior samples of the entire genealogical tree underly ing the sample including the tree height which corresponds to the Time since the Most Recent Common Ancestor TMRCA BATWING does not model the effects of either recombination or selection and hence implicitly assumes that these effects are negligible Note also that BATWING is intended for within species data and not between species data for which many phylogenetic software packages are available BATWING is a direct descendant of MICSAT described in Wilson amp Bald ing Genetics 150 pp 499 510 1998 The principal differences are 1 in addition to the stepwise mutation model SMM for microsatellite loci there are mutation models for unique event polymorphisms UEP and 2 the popu lation demography is extended from constant size random mating populations the standard coalescent model to include population growth and subdivision A manuscript giving detailed illustrations of BATWING based analyses Inferences from DNA data population histories evolutionary processes and forensic match probabilities by Wilson Weale and Balding is currently under review for publication a preprint is available from any of the authors
19. es of the unknowns at one instant of the BATWING run The first line of outfile gives names of the output variables which we now briefly describe I 10 11 12 13 14 11_times the log likelihood of the coalescent times of all interior nodes in the genealogical tree if there are STR loci ll mut the log likelihood of all the STR mutation events that occur between all nodes in the genealogical tree if infsites gt 0 11_inf the log likelihood of all the UEP mutation events that occur between all nodes in the genealogical tree If inftype is set to 2 in infile the value 0 is always output see 2 6 2 ll_allpriors the log prior probability of the current values of all the model hyperparameters This column is also a catch all for all things not covered in the first 3 columns For example if population splitting is specified in the infile then the log likelihood of all population splitting times is also included here The sum of the first 4 columns gives the log posterior probability up to a constant the STR mutation rate either unscaled mu or scaled theta If separate rates have been specified for different STR loci using locustypes then they all appear here in separate columns if a prior has been assigned N the effective population size Ne if sizemodel 1 Na if sizemodel 2 if inftype 1 the UEP mutation rate T the TMRCA or time since the most recent common ancestor
20. es out zx for 1 2 220 We post process the parameter values in out using either R or Splus using the commands gt o lt read table out header T read in the data gt o lt o c 1 1000 remove the first 1000 rows as a burn in gt median o T calculate the median of T gt mean o theta calculate the mean of theta 17 examples sessioni samplein input file for example 1 samplesize 10 theta 10 nsTR 10 seed 16 examples session sampout output file generated from above input 33217823 10 11 5547912343 2513156244 33157823 11 13 5547912343 1762332321 1423366213 2683424331 44217911 11 12 1522366242 examples session sampout data output file generated from above input CCC 1 2 6 8 3 4 2 4 3 3 1 0 0820825 2 1 7 6 2 3 3 2 3 2 1 0 0820825 3 2 6 7 3 4 2 3 3 3 1 0 91005 3 1 5 2 2 3 6 6 2 4 2 0 0751013 4 1 4 2 3 3 6 6 2 1 3 0 0587582 5 2 5 1 3 1 5 6 2 4 4 0 0587582 6 2 5 2 3 3 6 6 2 4 4 0 0163431 6 2 5 2 3 3 6 6 2 4 4 0 917031 6 4 5 3 2 5 3 3 2 6 5 0 0627099 6 5 5 4 7 9 1 2 3 4 3 0 0216645 7 5 5 4 7 9 1 2 3 4 3 0 02166 45 8 5 5 4 7 9 1 2 3 4 3 1 03318 8 5 5 3 3 4 3 3 1 6 5 0 972001 8 3 3 2 1 7 8 2 3 10 11 0 143024 9 3 3 1 5 7 8 2 3 11 13 0 143024 10 3 3 2 2 7 8 2 2 11 12 0 0448991 10 4 4 2 1 7 9 1 1 11 12 0 187923 11 3 3 2 2 7 9 3 2 11 12 1 83892 11 8 1 6 1 5 6
21. he likelihood under BATWING s modelling assumptions does not depend on N which may thus be eliminated from any analyses This reduction in the number of unknowns may lead to more precise inferences about the remaining unknowns However if N is unknown then interpretation of the results is severely limited because times and rates cannot be expressed in terms of generations or years For this reason it is often preferable to work with unscaled rates together with explicit assignment of prior and posterior distributions to N A scaled time output by BATWING can then be converted into an unscaled time in generations via multiplication by the prevailing value of N a further multi plication by generation time leads to a time in years Note however that inferences about unscaled parameters usually depend more sensitively on the prior specification than do inferences for the scaled parameters For example when the sample size is very large the data may convey substantial information about N x mu so that the posterior for theta may be effectively independent of the prior but how this information is allocated to N and mu separately may depend sensitively on their joint prior distribution A further parameterisation option is offered by BATWING under growth model 2 where users can choose to work either with a scaled or unscaled growth rate or with kappa the natural logarithm of current to ancestral pop ulation sizes Since N Ne eee it fo
22. hown at the top of Fig ure 1 Resulting autocorrelation plots for theta and L up to lag 30 are shown at the bottom of the figure Increasing treebetN at the cost of a proportional reduction in Nbetsamp has a slightly deleterious effect on the autocorrelations while decreasing computation times by about 25 All the ACF shown in the figure would usually be regarded as acceptable the number of outputs is the default 1000 12 examples thinning thin1 in datafile data examples ex1 data treebetN 2 Nbetsamp 50 examples thinning thin2 in datafile data examples ex1 data treebetN 5 Nbetsamp 20 examples thinning thin3 in datafile data examples ex1 data treebetN 10 Nbetsamp 10 Series o1 theta Series o2 theta c 2 J o o o Le it O O lt lt Ty o Na gt Hb S PU pi T T T T T T T 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Lag Lag Series o1 L Series o2 L e J o o e J o UL We 3 QO lt Z Y o NA _ o S ias I a I e F e 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Lag Lag 13 ACF ACF 02 04 06 08 1 0 0 0 02 04 06 08 1 0 0 0 Series o3 theta T T T T T T T 15 20 25 30 Lag Series o3 L 15 20 25 30 Lag 2 10 Output BATWING generates a text file outfile of which each line after the first contains a space separated list of floating point or integer numbers giving the valu
23. ing and parameterisation options 2 8 Prior distributions e 2 9 MCMC control settings o 0202000202 2 10 Output eda 4 fa eee ee REE BOR Ee EE ee BG 2 10 1 The newick file format 0 0 00 2 10 2 Post processing 2 o eee ee ee 3 Data Simulation 4 Example session 5 Alphabetical listing of infile options 16 17 22 BATWING Bayesian Analysis of Trees With Internal Node Generation MCMC Markov chain Monte Carlo MRCA Most recent common ancestor SMM Stepwise mutation model SNP Single nucleotide polymorphism STR Short tandem repeat microsatellite TMRCA Time since MRCA UEP Unique event polymorphism Table 1 Acronyms used in the text 1 Introduction 1 1 General description BATWING is a program written in C for the analysis of population genetic data written by lan Wilson now at the University of Aberdeen Its initial development was in collaboration with David Balding now at Imperial College London and formed part of a research project funded by the UK EPSRC under the Stochastic Modelling in Science and Technology programme BATWING is described in Wilson Weale amp Balding JRSS A 166 pp 155 188 Please send comments on this User Guide to L Wilson maths abdn ac uk BATWING reads in multi locus haplotype data and model and prior distri bution specifications and uses a Markov chain Monte Carlo MCMC method based on
24. llows that kappa Ng x alpha x beta omega x beta The parameterisation options available within BATWING under the three population size models are summarised in Table 2 An option is selected by assigning a prior distribution to the corresponding parameters see Section Population Parameter size model code N mu theta alpha omega kappa beta 0 U x x S 1 U x x x S x x 2 U x U x S x g Table 2 Parameterisations available within BATWING under the different models for population size see Table 3 for brief definitions of the parameters Users must choose one of the parameterisation options and assign a prior dis tribution to each of the parameters indicated by a in the corresponding row The code U denotes that N is included as a parameter and hence inferences may be obtained for unscaled rates and times S denotes that only inferences about scaled rates and times are possible Under model 2 the denotes that kappa is included instead of a growth rate parameter below A brief summary of all BATWING parameters that may need to be assigned a prior distribution is given in Table 3 2 8 Prior distributions The distributions supported by BATWING for univariate parameters are uniform lt vl v2 gt uniform on the interval v1 v2 default interval is 0 00 which specifies an improper prior constant v1 or just v1 normal v1 v2 mean v1 SD v2 lognormal v
25. megaprior used see above default none The prior for N x alpha example omegaprior uniform outroot used when infsites gt 0 default none if this is set then the root haplotype is output example outroot 0 picgap used always default 100 The number of BATWING parameter outputs between each output of the tree to a outfile z file example picgap 1000 pop _consensus used if there is a population tree default 0 If this is set then an additional output file outfile pcsn is output giving a list of the population tree shapes in newick format This can be used as input to a program like CONSENSE from Phylip to produce a consensus tree example pop consensus 1 propprior used if migmodel 1 default Dirichlet 2 2 2 The prior for proportion of the total population size taken up by each geo graphic group Must be Dirichlet example propprior dirichlet 4 4 4 24 samples used always default 1000 The number of outputs from a BATWING run beyond any specified by warmup Each output is written on a separate line in the outfile N B samples has nothing to do with data sample size example samples 2000 seed used always default 1 A long unsigned integer specifying the random number generator seed The order of precedence for seed assignment is Command Line gt infile gt initialfile gt default example seed 123 sizemodel used always default 0 Code for the population growth model 0 constant po
26. nd appendix 2 gives shows their use for our example analysis It is recommended that analysis of output from BATWING is done using S plus commercial or R freely available from The file batwing r in the distribution has functions for input and analysis of output from batwing and organises the output into a more usable format as well as producing tables for input into latex documents and word processors BATWING allows the output of trees from the posterior distribution and in order to look at these a tree drawing package is needed A number of programs are available for tree drawing some are listed at http evolution genetics washington edu phylip software html These programs were designed for phylogenetic species trees but can also be used for genealogical trees 3 Data Simulation The two programs SAMPLE and PRIOR distributed with BATWING in the directory sample can be used to simulate data for testing They use an infile similar to that for BATWING but without a datafile assignment Instead they require samplesize to be assigned a list of sample sizes for the simulated data if the list is of length one an unstructured sample is simulated For SAMPLE we also need to assign a value to nSTR the number of STR loci An optional parameter is height which if assigned a positive real number hval results in a tree with approximate height hval obtained by sampling trees until one is within 0 1 of hval The output file
27. pace separated 0 1 characters with length infsites example ancestralinf 0 1 0 0 0 5 UEP loci badness used if initialfile not set default 0 01 A real number used to obtain the initial genealogical tree of the MCMC chain A value gt 1 means the initial tree joins the data sample nodes at random without regard to their haplotype A 0 value means that the first coalescence going back in time happens between the two most similar nodes and so on A value between O and 1 gives a compromise between these two extremes See Wilson amp Balding 1998 for more details example badness 0 1 betaprior used see above default The prior for beta the time at which population growth starts example betaprior gamma 2 1 countcoals used if sizemodel 2 default 0 This is a tool for investigating the convergence behaviour of the chain If set to 1 the number of coalescences more recent than the start of growth is included in each output example countcoals 1 infsites used always default 0 The number of UEP loci example infsites inftype used whenever we have UEP loci default 0 Code for treatment of UEP loci 0 Use UEP data to condition permissible genealogical trees but not to affect the tree likelihood or posterior density in any other way 1 assume the same UEP mutation rate for all UEP loci with a uniform prior UEP positions within the tree contribute to the tree likelihood and posterior density example
28. plit prop Proportion of the total population size taken up by each subpopulation dirichlet is only permitted prior family and dirichlet 2 2 2 is the default Table 3 Summary of the BATWING parameters that may need assignment of a prior distribution according to the parameterisation option selected from those given in Table 2 The final two parameters are used whenever migmodel 1 distribution with expectation and SD both equal to 1 coalescent unit as the sample size gets large the prior expectation and SD of the TMRCA approach 2 and 1 08 units respectively The prior for unscaled genealogical times is assigned implicitly by choice of demographic model and the choice of prior for N the prior expectation of the unscaled TMRCA in generations of a large sample is approximately twice the prior expectation of N To assign a prior distribution append prior to the parameter name shown in Table 3 followed by a colon space and the distribution all on one line of the infile For example thetaprior gamma 20 2 assigns a gamma prior to theta The prior mean is 10 and the prior mode is 9 5 If the ancestral states at the UEP loci are known then these values can be fixed in the BATWING analyses via ancestralinf For example ancestralinf 00000 assigns O to the ancestral state of the five UEP loci If ancestralinf is not set the two states at each locus are a priori equally likely 2 9 MCMC control settings
29. pulation size 1 ex ponential growth at rate alpha at all times 2 exponential growth at rate alpha from a constant size population with growth starting at time beta example sizemodel 2 splitprior used if migmodel 1 default none The prior for the time of the first population split example splitprior gamma 1 1 thetaprior used see above default none The prior for 2N x mu example thetaprior uniform treebetN used always default 10 The number of times that changes to the genealogical tree are attempted before any changes to the hyperparameters are attempted Thus BATWING outputs are separated by treebetNxNbetsamp attempted tree updates example treebetN 10 tree consensus used if there is a population tree default 0 If this is set then an additional output file outfile tcsn is generated listing the shapes of the subpopulation tree in newick format This can be used as input to a program like CONSENSE from Phylip to produce a consensus tree example tree_consensus 1 warmup used always default 0 The number of additional outputs to be generated to allow for burn in of the MCMC algorithm However the additional outputs are not automati cally discarded this must be done explicitly by the user and the number eventually discarded need bear no relation to the value of warmup example warmup 200 25 26
30. s This is an improper prior but always leads to a proper posterior distribution Although negative allele lengths are not excluded under our model positive observed states and a positive ancestral state means that intermediate negative states are extremely unlikely An alternative is the k allele mutation model under which a mutant allele is equally likely to take any one of the k possible states irrespective of the previous state To specify this model kalleles should be assigned a list separated by spaces of the numbers of alleles possible at each locus the loci having the same order as in datafile Under either SMM or k alleles model by default all loci have the same mutation rate this default can be explicitly set by assigning locustypes the integer 1 If locustypes is assigned an integer equal to the number of STR loci there is a distinct mutation rate for each locus The only other setting permitted for locustypes is a list of positive integers whose sum is the number of STR loci If for example there are 11 STR loci and the list is 3 6 2 then there are three distinct mutation rates one common to the first three loci one shared by the next six loci and one shared by the final two loci Clearly the loci have to be ordered so that loci with the same mutation rate are neighbouring 2 6 2 UEP sites Unique Event Polymorphism UEP sites are polymorphic sites at which only two alleles are observed and the investigator assumes
31. s for SAMPLE are outfile a text file with data in a format for input into batwing lt out gt data a text file includes the tree in newick format and includes more informa tion on other parameters PRIOR gives output in the same form as BATWING but is samples from the prior distribution only ignoring any observed data 4 Example session This session is an example of a test session using BATWING SAMPLE and R and Treeview for postprocessing SAMPLE was used to simulate data using syntax sample samplein sampout The input file samplein specifies 10 STR loci no UEP loci a sample size of 10 haplotypes Figure 2 top Two output files are produced sampout which lists the simulated data Figure 2 middle and sampout data which is a newick format file containing information about the total length of the tree and the height in the last line Figure 2 bottom The tree can be displayed using the program treetool solaris or treeview windows or other programs Figure 3 We analyse the data in sampout using BATWING with syntax batwing infile out The infile Figure 3 bottom calls for 11000 BATWING parameter outputs and 220 11000 500 tree outputs The file out par repeats information from infile together with MCMC acceptance rates Figure 4 The file out consists of 11000 rows each giving values for the six parameters 1ltimes llmut llprior theta T and L Figure 4 bottom The 220 tree outputs are given in fil
32. ubpopulations must be assigned to Npopulations in the infile 2 5 Population model settings 2 5 1 Population size The population growth model is specified via sizemodel which can be assigned one of three values default 0 0 Constant effective population size N chromosomes in this case BATWING implements the standard coalescent model 1 Pure exponential growth at rate alpha to a current population size N 2 Exponential growth at rate alpha from a constant size ancestral popula tion of size N with growth starting at scaled time beta before present 2 5 2 Population structure Two options are available for population structure Setting migmodel 0 speci fies that the data are drawn from a single population the default If migmodel 1 then the subpopulation from which a haplotype in datafile is drawn must be specified by a positive integer on the corresponding row of locationfile with 1 indicating missing data If there are missing data then Npopulations must be assigned the number of subpopulations If no data are missing Npopulations is set automatically to the number of distinct codes used in locationfile 2 6 Mutation model settings 2 6 1 STR microsatellite loci The default mutation model is the SMM under which the repeat number changes by one at each mutation event with decreases and increases being equally likely The default prior distribution on the ancestral allele size is uni form on the positive integer
33. xample kappaprior gamma 5 1 locustypes used always default 1 If 1 all STR loci have the same mutation rate If STR loci each STR locus has a distinct mutation rate If a list of integers whose sum STR loci e g 3 6 2 then the first 3 STR loci have the same mutation rate then the next 6 have their own mutation rate and then the final 2 have theirs example locustypes 3 6 2 meantime used always default 0 If set to 1 the mean pairwise coalescence times between individuals in the sample is output example meantime 1 migmodel used always default 0 0 No population substructure 1 samples drawn from subpopulations specified in locationfile example migmodel 23 muprior used see above default none The prior distribution s for the STR mutation rate Value is a list of dis tributions of length according to value of locustypes example muprior gamma 1 1000 Nbetsamp used always default 10 The number of times that changes to hyperparameters are attempted be tween outputs example Nbetsamp 10 npopulations used if missing data in locationfile default 0 Specifies the number of sub populations Normally this is assigned auto matically to the number of distinct calues in locationfile example npopulations 3 Nprior used see above default none The prior for N the effective population size Ne if sizemodel 1 Na if sizemodel 2 example Nprior lognormal 9 1 o
34. ypes and the time in coalescence units Following the newick tree description there is information on the current state of the random number generator and values of the model parameters This information allows 15 a further BATWING run to continue from exactly the same place using the initialfile option in the infile Following the hyperparameter information a final line provides details of the current tree in exactly the same format as it would appear in a line of the outfile If outroot is set in the infile by including a line outroot in the infile then the MRCA haplotype will be output If UEPtimes is set then the minimum and maximum time at which each UEP mutation could have occurred are output If countcoals is set the number of coalescences more recent than the start of growth is output 2 10 1 The newick file format The newick file format is a way of describing trees in computer files A number of documents about it are available at http evolution genetics washington edu phylip newick_doc html 2 10 2 Post processing The use of Splus commercial or R freely available at http cran r project org is recommended for post processing of the output however most simple analyses can be performed using spreadsheet software such as excel A set of S R functions is available for reading and writing output from BATWING and is included in the distribution as R batwing R Appendix 1 of this document describes their use a
Download Pdf Manuals
Related Search
Related Contents
V7 Projector Lamp for selected projectors by TOSHIBA LaCie Big Disk Extreme with Triple Interface 600GB Imprimer le CV Avaya 6408D+ Telephone User Manual (HC-2010032 136 138 单)DRS Philips BDP7500S2 Philips Saeco Energica Focus 0151 6491500 Modelling Dicota Pep.Up Junior Copyright © All rights reserved.
Failed to retrieve file