Home

Catchall Version 4.0 User Manual

1. below henceforth SF Tau the upper frequency cutoff Analysis is based on the frequency counts up to T the remaining counts are added ex post facto see SF Observed Sp the number of species counts with frequencies up to t only Estimated total Sp the final estimate of the total number of species in the population including those with sample frequencies gt t SE standard error of the preceding estimate Lower CB Upper CB lower and upper 95 confidence bounds Note that the confidence interval is asymmetric see SF GOFO a raw or naive Pearson goodness of fit p value GOF5 Pearson goodness of fit p value with adjacent cells concatenated to achieve minimum expected cell count of 5 for asymptotic approximation see SF Best Parm Model Parm Model 2a 2b 2c These are the parametric models and choices of tT selected as optimal by CatchAll according to various goodness of fit criteria see SF If no best model appears it is because the stringent GOF criteria required for best status were not satisfied by any model in this case the user may consider the second best models 2a 2c or the other procedures including nonparametric estimates and lower bounds WLRM The weighted linear regression model see SF Parm Max Tau WLRM Max Tau The best parametric model and the WLRM computed on the entire dataset i e no discarding of large outlying frequencies so that t the maximum frequency in the data Best Discounted
2. Color worksheet 3 Bubble Graph Color This graph shows the behavior of the estimates as t is increased i e as outliers large frequencies are progressively added to the data Typically the nonparametric estimates diverge as t increases while the parametric estimates converge The bubble sizes are proportional to SE 2 in each case points are not plotted they are blanked out if either i C gt 100c or ii SE gt 10 On the worksheet Bubble Graph Data click Insert Bubble Graph Data and navigate to the file datasetname_BubblePlot csv This imports the required data and automatically generates the bubble plot on the Bubble Graph Color worksheet 4 Bubble Color No Non Parametric This is the bubble plot with the nonparametric sequence deleted for easier visual comparison of the parametric estimates Note that all Excel functions are available under CatchAll display xlsm in particular one can change the scale of axes alter colors or shapes delete plotted data sequences etc at will Thus the program gives the user interactive control over the graphical displays For more information see the appropriate Excel help screens Summary of main output display Here we briefly describe the main output as found either in the GUI output screen or equivalently in the file datasetname_BestModelsAnalysis csv Total number of observed species self explanatory Model the fitted models are described in Statistical Foundations
3. The best parametric model with the low frequency high diversity component removed discounted to account for uncertainty in the low frequency sample counts such as singletons This is usually too drastic a reduction see SF Non P 1 The statistic known as Chao1 which is a nonparametric lower bound for the total number of species see SF Chao1 only uses the first two frequency counts hence t 2 Non P 2 Chao s Abundance Based Coverage Estimator ACE or its high diversity variant ACE1 selected according to whether the coefficient of variation of the data CV_rare is lt 0 8 ACE or gt 0 8 ACE1 at t 10 or the largest t lt 10 These are nonparametric estimates of the total number of species See SF Non P 3 Chao s Abundance Based Coverage Estimator ACE at t 10 or the largest t lt 10 regardless of the ACE ACE1 selection in Non P 2 see SF CatchAll Version 3 0 Statistical Foundations by Linda Woodard Sean Connolly and John Bunge Cornell University Sponsored by NSF Grant 0816638 June 7 2011 1 Introduction CatchAll is a set of programs for analyzing frequency count data arising from abundance or incidence based samples Given the data CatchAll estimates the total number of species or individuals observed unobserved and provides a variety of competing model fits and model assessments along interactive graphical displays of the data fitted models and comparisons of estimates The first
4. fi n T c4 T j 1 fi n T X Yrare fi 12 Good Turing 4 1 A fi n T x Yrare where fun to _ Elsi DA o 1 f n T n_ r 1 i ACE is preferred when Yrare lt 0 8 otherwise ACEI is preferred CatchAll makes this selection automatically Vare Max Ce 1 5 Chao Bunge gamma Poisson estimator Ay ine fi c T fi a ee eS EUIS Poea Teel si ifi nET This is known to be consistent when the stochastic abundance model is the gamma distribution i e when the sample counts follow the negative binomial distribution In each case we also compute a standard error based on an asymptotic approximation due to Chao The variance for one of these estimators C is given by the approximate formula aC a OC DD ap aR Ole fi i gt 1l j gt 1 where c v fi fj 2 ij The empirical standard error of C is then Var C Thus the problem is to calculate aC Ofi which in turn depends on the formula for in each case We omit the specific details here Finally we display two analyses of the full dataset that is with no right truncation T max7T These are the minimum AICc parametric model and the preferred ACE ACE1 choice New Features in CatchAll v 3 0 John Bunge and Linda Woodard June 7 2011 3 The weighted linear regression model This is an approach to analyzing frequency count data based on a novel completely different concept from either the parametri
5. of the parameter vector 6 and from this we obtain an estimate p 0 of the zero probability p 0 0 which is the probability that an arbitrary species is unobserved in the sample Our final estimate is then A c C SS O A 1 p 0 0 where c is the number of observed species in the sample This estimate has an associated standard error given by A z 1 2 SE C c x a0 aj A a0 A where aoo 1 p 0 4 p 0 ao 1 p 0 0 Vo 1 p 0 and A Info 0 X the Fisher information about 0 in X all evaluated at 0 6 Actually the situation is slightly more complicated Because frequency count data typ ically exhibits a large number of rare species graphically a steep slope upward to the left and a small number of very abundant species a long right hand tail of outliers paramet ric models typically do not fit the entire dataset Instead some outliers must be deleted Specifically we fit a parametric model up to some maximum frequency 7 deleting all of the frequency count data for frequencies gt 7 obtaining an estimate that depends on 7 T To complete the estimate we add the number of species with counts greater than 7 c4 T and the final estimate is then C C r c r Similarly the SE is only com puted on the data excluding outliers i e on the frequency counts up to 7 Essentially we regard the frequencies gt 7 as constants or fixed points for the purposes of the analysis This mean
6. Catchall Version 4 0 User Manual by Linda Woodard Sean Connolly and John Bunge Sponsored by NSF Grant 0816638 July 2013 To cite CatchAll Bunge J Woodard L Bohning D Foster J Connolly S and Allen H 2012 Estimating population diversity with CatchAll Bioinformatics 28 1045 7 doi 10 1093 bioinformatics btsO75 See this paper for a brief account of the operation and statistical theory of the program System requirements There are two types of programs available the main analysis program in a variety of flavors CatchAlIName exe and an interactive graphics module CatchAll display xlsm written in Excel 2007 which uses macros that need to be enabled The graphics module runs only on a Windows platform assuming Excel 2007 or later is installed Apple decided not to enable macros in this version of Excel The GUI version of the executable CatchAlIGUI exe will only run under Windows There is also a Windows command line version CatchAllcmdW exe The Net framework must be installed to run either Windows version In addition there is a command line version CatchAllcmdL exe that will run on the MAC OS and other Linux platforms provided the appropriate version of Mono has been installed Input data CatchAll is a set of two programs for analyzing data derived from experiments or observations of species abundances or multiple recapture counts For simplicity we will use the species abundance terminology throug
7. UI version This is to be read into the Summary Analysis worksheet in CatchAll display xlsm see below datasetname_BestModelsFits csv Fitted values for the best models as selected by the model selection algorithm see Statistical Procedures This is to be read into the Best Fits Data worksheet in CatchAll display xlsm see below datasetname_BubblePlot csv Analysis data to generate the bubble plot display this is to be read into the Bubble Graph Data worksheet in CatchAll display xlsm see below Graphical Display The Microsoft Excel based module CatchAll display xlsm generates four displays To view these open CatchAll display xlsm by double clicking Near the top of the screen click Options gt Enable Macros 1 Summary analysis This is a copy of the CatchAll output window formatted for columns On the worksheet Summary Analysis click Import Summary Analysis and navigate to the file datasetname_BestModelsAnalysis csv This copies the CatchAll summary output display to the worksheet in column formatted form 2 Best Fits Color This is a scatterplot showing the frequency count data as points and the various fitted models best 2a 2b 2c as curved lines On the worksheet Best Fits Data click Import Best Fit Data and navigate to the file datasetname_BestModelsFits csv This imports the fitted values from the best selected models and automatically generates the comparative plot on the Best Fits
8. c or the coverage based nonparametric methods discussed above The approach is discussed in detail in Rocchetti Bunge and Bohning 2011 Population size estimation based upon ratios of recapture probabilities Annals of Applied Statistics in press as of this writing Basically the frequency count data is converted to adjusted ratios of successive counts i 1 fi fi i 1 2 Under the unmixed Poisson and the gamma mixed Poisson or negative binomial models the ratios r i form an approximately linear function of i It is conjectured that under mild departures from these models the linearity is preserved to some degree i e the model is somewhat robust to such departures this is a topic of current research It is therefore reasonable to consider linear regression of r i on 2 that is rt i 1 fi r l fon Po Bri fi i 1 2 Having fit such a regression model in the usual way one can then project the model downwards so as to obtain an estimate prediction of f and hence an estimate of the total diversity The same procedure yields standard errors and goodness of fit assessments There are three secondary considerations here 1 This model is inherently heteroscedastic and consequently weighted linear regression must be used hence the name the weights are computed auto matically by CatchAll according to the specification in Rocchetti et al 2 In some cases a log transformed reg
9. e five progressively more complicated models which we refer to as order 0 1 2 3 and 4 0 Poisson Here the stochastic abundance distribution is a point mass at a fixed A i e all of the species sizes are assumed to be equal This is rarely if ever realistic and almost never fits real data but it provides a readily computable lower bound benchmark since heterogeneous species sizes will render this model downwardly biased 1 Single exponential mixed Poisson The stochastic abundance distribution is expo nential 1 L058 ge A gt 0 8 gt 0 The mixed Poisson distribution of the frequency counts is then the geometric 1 0 N P X j 0 A ea j 0 1 2 6 gt 0 2 Mixture of two exponentials mixed Poisson The stochastic abundance distribution is a mixture of two exponentials and the mixed Poisson distribution is then a mixture of two geometrics ae ae 1 6 N 1 Oa j 0 1 2 01 02 gt 0 0 lt 03 lt 1 3 Mixture of three exponentials mixed Poisson The stochastic abundance distribution is a mixture of three exponentials and the mixed Poisson distribution is then a mix ture of three geometrics 1 6 N 1 ba I EmA 5 a s a E 1 03 4 1 04 65 se 5 j 0 1 2 01 02 03 gt 0 0 lt 04 05 lt 1 4 Mixture of four exponentials mixed Poisson The stochastic abundance distribution is a mixture of four exponentials and the m
10. hout this manual but the same methods can be applied to the total counts row sums of recaptures in a multiple recapture or multiple list study The fundamental dataset consists of frequency counts This is a list of frequencies of occurrence followed by the number of species occurring the given number of times in the sample For example in the following dataset 1 295 2 63 3 30 4 6 5 4 6 6 7 1 9 6 11 1 12 2 13 1 14 1 17 1 21 1 25 1 30 1 31 1 55 1 69 1 86 1 there are 295 species with exactly one representative in the sample called singletons 63 species with two representatives 30 with 3 and then there are some large or very abundant species in the right tail 1 species with 55 representatives 1 with 69 and 1 the largest or most abundant with 86 This structure with a large number of rare species and a small number of very abundant species is typical The dataset must be in this comma delimited format frequency count with filename equal to datasetname csv or datasetname txt Analysis with the GUI version CatchAlIGUI exe To read in the data start CatchAll by double clicking on CatchAll exe use the Locate Input Data button to navigate to your dataset and double click on the appropriate file CatchAll then displays the first 10 lines of your dataset in a small window for verification click OK Once the data are loaded perform the analysis by clicking one of the Run Program buttons After a sh
11. ixed Poisson distribution is then a mixture of four geometrics ee 5 1 1 f 1 02 f 1 bs I 1 64 N t 1 4 EDIK G m z E j 0 1 2 01 02 803 804 gt 0 0 lt 85 86 07 lt 1 CatchAll computes all five models at every value of 7 having non zero frequency count in the data This generates a combinatorial explosion of analyses one for each model r combination which then must be sifted to find a best model or at least a collection of best models We do this according to the following algorithm which combines statistical principles and heuristic decisions based on empirical experience Model selection algorithm 1 2 Statistical Eliminate model 7 combinations for which GOF5 lt 0 01 Statistical For each 7 select the model with minimum AICc Akaike Information Criterion corrected where necessary for small sample sizes Heuristic Eliminate model 7 combinations for which estimate gt 100xACE1 where ACE1 is the estimate at 7 10 Heuristic Eliminate model 7 combinations for which SE gt estimate 2 Heuristic Then e Best model Select the largest 7 for which GOFO gt 0 01 e Model 2a Select the 7 with maximum GOFO e Model 2b Select the largest 7 e Model 2c Select 7 as close as possible but lt 10 Heuristic e If all model 7 combinations are eliminated allow GOF5 above 0 001 but keep SE lt estimate 2 e If all combinations are stil
12. l eliminated allow GOF5 above 0 001 and allow SE up to the estimate e If there are still no combinations require calculable computable GOF5 but impose no restrictions on SE 2 2 Nonparametric procedures We compute five nonparametric estimates of C All derive directly or indirectly from the coverage based approach under which the estimate of the total number of species is based on an estimate of the coverage of the sample the proportion of the population represented by the sampled species We compute the nonparametric estimates at every 7 as we do for the parametric estimates but we report them only for 7 10 or the nearest possible value because they tend to be highly sensitive to outliers see the section on CatchAll display xlsm for more discussion of this point 1 Good Turing also called homogeneous model i e equal species sizes same as sumption as Model 0 Poisson above C c_ T 1_ fi n_r ote c T where n_ T iq ifi 2 Chaol e fi fi 1 2 f2 0 This is generally regarded as a lower bound for C c f 2fe f2 gt 0 3 ACE Abundance based Coverage Estimator Ay C7 DE 2 C 1 f n_ r c t 1 fi n T X Yrare fi 2 i fi n T X Yrare Good Turing 4 where EE eT Xj ili Di Trare G filn r n_ r n 7 1 1 0 i 4 ACE1 Abundance based Coverage Estimator for highly heterogeneous cases yO c T i fi 12 C 1
13. odel Once the first CatchAll run is complete we obtain as detailed above a best fitted parametric model at a selected value of T This may be a mixture model of order 0 1 2 3 or 4 If the selected order is 4 3 or 2 CatchAll deletes the highest diversity component i e the component as sociated with fitting the lowest frequency counts The resulting model is one order lower 4 3 or 2 converts to 3 2 or 1 respectively and this step down model is reported as the best discounted model Specifically the formulae are e Step down from four mixed to three mixed C 1 15 x C SE 1 15 xSE e Step down from three mixed to two mixed 1 t4 x C SE 1 t4 xSE e Step down from two mixed to one mixed 1 t3 x SE 1 t3 xSE where Ce is the step down reduced model estimate of total diversity based on the frequency count data up to 7 and C is the original estimate of total diversity based on the frequency count data up to T Counts for frequencies gt 7 are added in ex post facto as usual A graphical display of an example using viral diversity data is given on the next page in Figure 1 Figure 1 Best discounted model stepping down from order 3 to order 2 component 1 is deleted 4000 3500 3000 2500 Component 3 2000 Components 2 3 1500 Components 1 2 3 A Observed 1000 500 1 10 100 Component 3 lt C
14. omponents 2 3 10
15. ort time ranging from lt 1 sec to lt 5 minutes the Model Analysis Completed button will appear click OK N B the 4 Mixed Exponential Model will take longer to calculate A summary of the analysis appears in the Best Models window and the OUTPUT FILES window displays the pathnames for the files used by the interactive graphics program CatchAll display xlsm see below for details Other files created by the program are located in the same folder See Output below for a detailed description of these files Analysis with the Windows command line version CatchAllCmdW exe At least two parameters must be supplied to the Windows command line version the input filename complete path if not in same directory as the executable and the path to the directory where the output files will be written If no such folder exists it will be created See Output below for a detailed description of these files Optionally you can include a flag to have the program calculate the 4 Mixed Exponential Model the default is to calculate it N B the 4 Mixed Exponential Model will take longer to calculate Calculate 4 Mixed Exponential Model CatchAllcmdW exe inputfilename outputpath or Ju CatchAlicmdW exe inputfilename outputpath 1 Don t calculate 4 Mixed Exponential Model CatchAllcmdW exe inputfilename outputpath 0 Analysis with the Linux command line version CatchAllcmdL exe Mono must be in
16. program is CatchAll exe which performs the neces sary statistical and numerical analysis and the second is CatchAll display xlsm which is a Microsoft Excel based program that generates the graphical displays We first discuss the statistical procedures underlying the main program 2 Statistical procedures implemented by CatchAll 2 1 Parametric models We fit a suite of five parametric models to the data These are increasingly complex ver sions of the standard model for species estimation For full mathematical details see e g Bunge and Barger 2008 here we give a sketch intended to briefly explain the compu tations performed by the program We assume that there is a fixed number of species C lt oo in the population The ith species contributes a random number X of individuals to the sample where X 0 1 2 If X 0 then the ith species is unobserved X has a Poisson distribution with mean E X A 7 1 C and in general we assume that A Ac are distributed according to a stochastic abundance model that is a prob ability distribution with probability density function say f A The stochastic abundance distribution depends on some number of parameters in our implementation there are at most 7 parameters called 0 The observed frequency count data is then unconditionally distributed as zero truncated f mixed Poisson We fit this distribution to the data via max imum likelihood which yields an estimate 6
17. ression must be used to avoid certain edge effects which can lead to negative predictions for f CatchAll automatically selects between the log transformed and the untransformed original models and reports the selected mode in the summary output However all models are reported in the copious output 3 As with all of the procedures implemented by CatchAll the results depend to varying degrees on the right truncation point 7 CatchAll automatically selects an optimal 7 based on goodness of fit criteria and reports the corre sponding results in the summary output as usual results at all 7 s are reported in the copious output Note that in order to avoid division by zero the set of frequency counts used by the weighted linear regression procedure must be contiguous that is gaps in the frequency counts are not allowed Hence the maximum possible 7 for this model is the maximum of the contiguous frequency counts which may not be the actual maximum frequency count 4 Best discounted model This procedure is intended to address the scenario in which the low frequency counts may be inaccurately recorded That is for example the number of single tons or other very low frequency counts may be artificially inflated due to errors in measurement or registration or to other experimental or observational artifacts To address this we compute a best discounted model This is based on the multiple component parametric mixture m
18. s that we compute every model at every possible value of 7 and compare the results we return to this issue below For confidence intervals we do not use the Wald or normal approximation interval C 1 96 SE for various reasons Instead we implement an asymmetric interval based on a lognormal transformation proposed by Chao 1987 Estimating the population size for capture recapture data with unequal catchability Biometrics 43 4 783791 Let c T denote the total number of species with frequency counts lt 7 so that c T ci T c for all 7 The lognormal based interval is then c C r c 7 d c E t 7 x d where iagi 1 90 log 1 SE C r c r We also compute two goodness of fit measures GOFO is the p value for the Pearson x goodness of fit test comparing the observed frequencies to the expected frequencies under the fitted model This measure uses no adjustment for low cell counts that is every frequency is compared to its corresponding expected frequency Since the x test is based on an asymptotic approximation requiring cell counts gt 5 although there is not a consen sus on this value we also compute a p value for the Pearson x test after concatenating adjacent cells so as to achieve a minimum expected cell count of 5 under the fitted model this is GOF5 Since the null hypothesis in both cases is that the model fits larger p values support the choice of model We comput
19. voked to run this executable at least two parameters must be supplied to the Linux MAC command line version the input filename complete path if not in same directory as the executable and the path to the directory where the output files will be written If no such folder exists it will be created See Output below for a detailed description of these files Optionally you can include a flag to have the program calculate the 4 Mixed Exponential Model the default is to calculate it N B the 4 Mixed Exponential Model will take longer to calculate Calculate 4 Mixed Exponential Model mono CatchAlicmdL exe input Ju lename outputpath or Ju mono CatchAlicmdL exe inputfilename outputpath 1 Don t calculate 4 Mixed Exponential Model mono CatchAlicmdL exe inputfilename outputpath 0 Output Running the analysis program generates a number of files If you use the GUI version a folder called Output is created in the same directory as the input file If you use either command line version these files are put in the folder you designate This folder contains the following files datasetname_Analysis csv This is a complete listing of all information from all analyses performed by CatchAll See the section Statistical Procedures below for details datasetname_BestModelsAnalysis csv Column formatted copy of summary analysis output as displayed in the main CatchAll window when using the G

Catchall Version 4.0 User Manual

Contents

Download Pdf Manuals

Related Search

Related Contents