Home

A Comparison of Two MCMC Algorithms for - CEUR

1. 4 3 Automated Convergence Criteria Gelman and Shirley 2011 describe recommended practice for assessing pseudo convergence of a Markov chain The most common technique is to run several Markov chains and look at the ratio of within chain variance to between chain variance This ratio is called R and it comes in both a univariate one parameter at a time and a multivariate version It is natural to look at parameters with the same name together e g Ho Bo To and Yo Using the multivariate version of R with o and 7 requires some care because calcu lating the multivariate R involves inverting a matrix that does not have full rank when the parameters are restricted to a simplex The work around is to only look at the first K 1 values of these parameters The other commonly used test for pseudo convergence is to look at a trace plot for each parameter a time series plot of the sampled parameter value against the MCMC cycle number Multiple chains can be plot ted on top of each other using different colors If the chain is converged and the sample size is adequate the traceplot should look like white noise If this is not the case a bigger MCMC sample is needed While it is not particularly useful for an automated test of pseudo convergence the traceplot is very useful for di agnosing what is happening when the chains are not converging Some sample traceplots are given below Even if the chains have reached pseudo conve
2. A Comparison of Two MCMC Algorithms for Hierarchical Mixture Models Russell G Almond Florida State University Abstract Mixture models form an important class of models for unsupervised learning allowing data points to be assigned labels based on their values However standard mixture models procedures do not deal well with rare components For example pause times in student essays have different lengths depend ing on what cognitive processes a student engages in during the pause However in stances of student planning and hence very long pauses are rare and thus it is dif ficult to estimate those parameters from a single student s essays A hierarchical mix ture model eliminates some of those prob lems by pooling data across several of the higher level units in the example students to estimate parameters of the mixture com ponents One way to estimate the parame ters of a hierarchical mixture model is to use MCMC But these models have several issues such as non identifiability under label switch ing that make them difficult to estimate just using off the shelf MCMC tools This paper looks at the steps necessary to estimate these models using two popular MCMC packages JAGS random walk Metropolis algorithm and Stan Hamiltonian Monte Carlo JAGS Stan and R code to estimate the models and model fit statistics are published along with the paper Key words Mixture Models Markov Chain Monte Carlo JAGS Stan
3. Eds 2011 Handbook of markov chain monte carlo CRC Press Deane P 2012 Rethinking K 12 writing assessment In N Elliot amp L Perelman Eds Writing as sessment in the 21st century essays in honor of Edward M white pp 87 100 Hampton Press Friithwirth Schnatter S 2001 Markov chain Monte Carlo estimation of classical and dynamic switching and mixture models Journal of the American Statistical Association 96 453 194 209 Gelman A Carlin J B Stern H S Dunson D B Vehtari A amp Rubin D B 2013 Bayesian data analysis 3rd ed Chapman and Hall The third edition is revised and expanded and has material that the earlier editions lack Gelman A amp Shirley K 2011 Inference from simulations and monitoring convergence In S Brooks A Gelman G L Jones amp X Meng Eds Handbook of markov chain monte carlo pp 163 174 CRC Press Geyer C J 2011 Introduction to Markov chain Monte Carlo In S Brooks A Gelman G L Jones amp X Meng Eds Handbook of markov chain monte carlo pp 3 48 CRC Press Gruen B amp Leisch F 2008 FlexMix version 2 Finite mixtures with concomitant variables and varying and constant parameters Journal of Statistical Software 28 4 1 35 Retrieved from http www jstatsoft org v28 i04 Li T 2013 A Bayesian hierarchical mixture ap proach to model timing data with application to writing assessment Unpubli
4. Unsurprisingly the higher the value of K number of components in the estimated model the longer the run took The runs for K 2 to between 10 15 min utes in JAGS and from 30 90 minutes in Stan while the runs for K 4 too just less than an hour in JAGS and between 2 3 hours in Stan The time difference between JAGS and Stan is due to the difference in the algorithms HMC uses a more complex proposal distribution than RWM so naturally each cycle takes longer The goal of the HMC is to trade the longer run times for a more efficient lower autocorrelation sample so that the total run time is minimized The constrained version of Stan seemed to take longer than the unconstrained In all 27 runs chain lengths of 15 000 cycles were not sufficient to reach pseudo convergence The maximum across all cross student parameters value for R varied between 1 3 and 2 2 depending on the run It seemed to be slightly worse for cases in which K number of components for data generation was even and K es 11 timated number of components was odd or vice versa The values of R for the constrained Stan model were substantially worse basically confirming the findings of Friithwirth Schnatter 2001 The effective sample sizes were much better for Stan and HMC For the K 2 case the smallest effective sample size ranged from 17 142 while for the uncon strained Stan model it ranged from 805 3084 Thus roughly 3 times the C
5. is sia Hit 1 Oi k J K Li Y ilmi Hi oi ILS Ti kOl IKEI where is the unit normal density Although conceptually simple there are a number of issues that mixture models can have The first is sue is that the component labels cannot be identified from data Consider the case with two components The model created by swapping the labels for Compo nents 1 and 2 with new parameters m 7 2 7 1 ui Hi2 Hi1 and oi 0 2 0 1 has an identical likelihood For the K component model any permu tation of the component labels produces a model with identical likelihood Consequently the likelihood sur face is multimodal A common solution to the problem is to identify the components by placing an ordering constraint on one of the three parameter vectors m w or o Section 4 1 returns to this issue in practice A second issue involves degenerate solutions which contain only a single data point If a mixture com ponent has only a single data point its standard devi ation gi will go to zero and 7 will approach 1 J which is close to zero for large Level 1 data sets Note that if either m 0 or o gt 0 for any k then the likelihood will become singular Estimating c requires a minimum number of data points from Component k Ti kJi gt 5 is a rough mini mum If 7 x is believed to be small for some k then a large Level 1 sample is needed As K increases the smallest value of m
6. WAIC Paper presented at the Bayesian Application Workshop at Un certainty in Artificial Intelligence Conference 2014 Quebec City Canada 1 Introduction Mixture models McLachlan amp Peel 2000 are a fre quently used method of unsupervised learning They sort data points into clusters based on just their val ues One of the most frequently used mixture models is a mixture of normal distributions Often the mean and variance of each cluster is learned along with the classification of each data point As an example Almond Deane Quinlan Wagner and Sydorenko 2012 fit a mixture of lognormal distribu tions to the pause time of students typing essays as part of a pilot writing assessment Alternatively this model can be described as a mixture of normals fit to the log pause times Almond et al found that mix ture models seems to fit the data fairly well The mix ture components could correspond to different cogni tive process used in writing Deane 2012 where each cognitive process takes different amounts of time i e students pause longer when planning than when sim ply typing Mixture models are difficult to fit because they dis play a number of pathologies One problem is com ponent identification Simply swapping the labels of Components 1 and 2 produces a model with identical likelihoods Even if a prior distribution is placed on the mixture component parameters the posterior is multimodal Second it is
7. the log posterior distribution of the data or if the prior distribution is flat it maximizes the log likelihood It 5 comes up with a single parameter estimate but does not explore the entire space of the distribution In con trast MCMC algorithms explore the entire posterior distribution by taking samples from the space of possi ble values Although the MCMC samples are not inde pendent if the sample is large enough it converges in distribution to the desired posterior Two approaches to MCMC estimation are the random walk Metropolis algorithm RWM used by JAGS Section 3 2 and the Hamiltonian Monte Carlo algorithm HMC used by Stan Section 3 3 3 1 EM Algorithm McLachlan and Krishnan 2008 provides a review of the EM algorithm with emphasis on mixture models The form of the EM algorithm is particularly simple for the special case of a non hierarchical mixture of normals It alternates between an E step where the p Zi j k pij n is calculated for every observation j and every component k and an M step where the maximum likelihood values for Ti k Hik and Gik are found by taking moments of the data set Y weighted by the component probabilities p j k A number of different problems can crop up with using EM to fit mixture models In particular if 7 goes to zero for any k that component essentially disappears from the mixture Also if di goes to zero the mix ture component concentrates on a single point
8. DeviancePlots pdf Hg Rhat JAGS 1 14 effective sample size por 143 Rhat Stan 1 15 effective sample size Stan 3680 JAGS JAGS a o Ey w 3 z o wn a 2 a 5000 10000 15000 1 0 4 5 0 0 Iterations N 15000 Bandwidth 0 0225 Stan unconstrained model Stan unconstrained model a CY w a im a o 5000 i0000 15000 2 5 1 5 40 5 0 5 lterations N 15000 Bandwidth 0 02464 Figure 5 Traceplots and Density plots for 4o 1 Note To reduce the file size of this paper this is a bitmap picture of the traceplot The original pdf version is available at http pluto coe fsu edu mcmc hierMM mu0Plots pdf yg Rhat JAGS 1 3 effective sample size JAGS 1596 Rhat Stan 1 57 effective sample size Stan 5207 JAGS JAGS a 2 cy o 5000 i0000 15000 0 2 0 4 0 6 0 8 Iterations N 15000 Bandwidth 0 07705 Stan unconstrained model Stan unconstrained model ii co ag a cy cy a a 5000 i0000 15000 0 2 O4 O86 0 8 lterations N 15000 Bandwidth 0 01332 Figure 6 Traceplots and Density plots for ao 1 Note To reduce the file size of this paper this is a bitmap picture of the traceplot The original pdf version is available at http pluto coe fsu edu mcmc hierMM alpha0Plots pdf shoulder on the density plot also points towards mul tiple modes The plot for Stan looks even worse after iteration 5 000 i e after the chain was restarted the red chain has moved
9. complete pooling model and if v oo then the partial pooling model is the same as the no pooling model Li 2013 builds a hierarchical mixture model for the essay pause data Figure 1 shows the no pool 3 ni Hok Big p Bok fij Yo a eA Tks F a 0 gk zul Yaj YO k i ier bern ay ke 1 K iE l I Figure 2 Hierarchical Mixture Model ing mixture model To make the hierarchical model add across student prior distributions for the student specific parameters parameters m y and a Be cause JAGS parameterizes the normal distribution with precisions reciprocal variances rather than stan dard deviations T are substituted for the standard deviations o Figure 2 shows the hierarchical model Completing the model requires specifying distributions for the three Level 1 student specific parameters In particular let Ti Ti 1 Ti Kx Dirichlet ay a 2 bik N Ho k Bo k 3 log Ti n N log To k Yo k 4 This introduces new Level 2 across student param eters Ho Bo To and Yo The likelihood for a single student L Y 7 Hi Ti is given with a suit able chance of variable by Equation 1 To get the complete data likelihood multiply across the students units or to avoid numeric overflow sum the log likeli hoods If we let Q be the complete parameter Ti H 7 for each student plus a Ho Bo To and Yo then I L Y Q X
10. log L Y Ti Hi Ti 5 i l Hierarchical models have their own pathologies which require care during estimation If either of the stan dard deviation parameters 8o Or Yo z gets too close to zero or infinity then this could cause the log poste rior to go to infinity These cases correspond to the no pooling and complete pooling extremes of the hierar chical model Similarly the variance of the Dirichlet distribution is determined by ay DA ak If an is too close to zero this produces no pooling in the estimates of m and if ay is too large then there is nearly complete pooling in those estimates Again those values of ay can cause the log posterior to be come infinite 2 3 Reparameterization It is often worthwhile to reparameterize a model in or der to make MCMC estimation more efficient Both random walk Metropolis Section 3 2 and Hamilto nian Monte Carlo Section 3 3 work by moving a point around the parameter space of the model The geom etry of that space will influence how fast the Markov chain mixes that is moves around the space If the geometry is unfavorable the point will move slowly through the space and the autocorrelation will be high Neal 2011 In this case a very large Monte Carlo sample will be required to obtain reasonable Monte Carlo error for the parameter estimates Consider once more the Eight Schools problem where pi N uo v Assume that we have a sampler that works by updating the
11. some of that output processing into BUGS Although BUGS played an important role in encourag ing data analysts to use MCMC it is no longer actively supported This means that the latest developments and improvements in MCMC do not get incorporated into its code Rather than use BUGS analysts are advised to use one of the two successor software pack ages OpenBUGS Thomas O Hara Ligges amp Sturtz 2006 or JAGS just another Gibbs sampler Plummer 2012 The R package rjags allows JAGS to be called from R and hence allows R to be used as a scripting language for JAGS which is important for serious an alytic efforts Similar packages exist for BUGS and OpenBUGS 3 3 Hamiltonian Monte Carlo HMC used by Stan Hamiltonian Monte Carlo HMC Neal 2011 is a variant on the Metropolis Algorithm which uses a dif ferent proposal distribution than RWM In HMC the current draw from the posterior is imagined to be a small particle on a hilly surface the posterior distri bution The particle is given a random velocity and is allowed to move for several discrete steps in that direc tion The movement follows the laws of physics so the particle gains speed when it falls down hills and loses speed when it climbs back up the hills In this manner a proposal is generated that can be a great distance from the original starting point The proposed point is then accepted or rejected according to the Metropolis rule The software pa
12. walk Metropolis Algorithm RWM used by JAGS Geyer 2011 gives a tutorial summary of MCMC The basic idea is that a mechanism is found for construct ing a Markov chain whose stationary distribution is the desired posterior distribution The chain is run until the analyst is reasonably sure it has reach the station ary distribution these early draws are discarded as burn in Then the the chain is run some more until it is believed to have mixed throughout the entire poste rior distribution At this point it has reached pseudo convergence Geyer calls this pseudo convergence be cause without running the chain for an infinite length of time there is no way of telling if some part of the parameter space was never reached At this point the mean and standard error of the parameters are estimated from the the observed mean and standard deviation of the parameter values in the MCMC sam ple There are two sources of error in estimates made from the MCMC sample The first arises because the ob served data are a sample from the universe of potential observations This sampling error would be present even if the posterior distribution could be computed exactly The second is the Monte Carlo error that comes from the estimation of the posterior distribu tion with the Monte Carlo sample Because the draws from the Markov chain are not statistically indepen dent this Monte Carlo error does not fall at the rate of 1 VR where R is the numbe
13. CMC run does not provide a maximum likeli hood estimate A pseudo AIC can be computed by substituting Q the posterior mean of the parameters Q Note that Q must be calculated after the parame ters in each MCMC cycle are permuted to identify the components Gelman et al 2013 advocate using a new measure of model selection called WAIC WAIC makes two sub stitutions in the definition of AIC above First it uses D Q in place of D Q Second it uses a new way of calculating the dimensionality of the model There are two possibilities IOS PWAIC 22D log E1Q Ellog Qi 5 Q m 21 I J PWATC 22 2 Var is 22 The expectation and variance are theoretically defined over the posterior distribution and are approximated using the MCMC sample The final expression for WAIC is For all of the model fit statistics discussed here AIC WAIC and WAICs the model with the lowest value is the best One way to chose the best value of K is to calculate all of these statistics for multiple values of K and choose the value of K which has the lowest value for all 5 statistics Hopefully all of the statistics will point at the same model If there is not true that is often evidence that the fit of two different models is too close to call In the 27 runs using the data from known values of K simply using the lowest WAIC value was not sufficient to recover the model As the Markov chains have not yet reached pseudo co
14. Fur thermore if Hik Hik and Cik Oi k for any pair of components the result is a degenerate solution with K 1 components As the posterior distribution for the mixture model is multimodal the EM algorithm only finds a local max imum Running it from multiple starting points may help find the global maximum however in practice it is typically run once If the order of the components is important the components are typically relabeled after fitting the model according to a predefined rule e g increasing values of ki x Two packages are available for fitting non hierarchical mixture models using the EM algorithm in R R Core Team 2014 FlexMix Gruen amp Leisch 2008 and mixtools Benaglia Chauveau Hunter amp Young 2009 These two packages take different approaches to how they deal with degenerate solutions FlexMix will combine two mixture components if they get too close together or the probability of one component gets too small by default if m lt 05 Mixtools on the other hand retries from a different starting point when the the EM algorithm converges on a degenerate so lution If it exceeds the allowed number of retries it gives up Neither mixtools nor FlexMix provides standard er rors for the parameter estimates The mixtools pack age recommends using the bootstrap resampling from the data distribution to calculate standard errors and provides a function to facilitate this 3 2 Random
15. PU time is producing a 5 fold decrease in the Monte Carlo error The following graphs compare the JAGS and Stan Unconstrained model outputs for four selected cross student parameters The output is from coda and pro vides a traceplot on the left and a density plot for the corresponding distribution on the right Recall that the principle difference between JAGS and Stan is the proposal distribution JAGS uses a random walk pro posal with a special distribution for the mixing param eters and Stan uses the Hamiltonian proposal Also note that for Stan the complete run of 15 000 samples is actually made up of a sort run of length 5 000 and a longer run of length 10 000 hence there is often a discontinuity at iteration 5 000 Although not strictly speaking a parameter the de viance twice the negative log likelihood can easily be monitored in JAGS Stan does not offer a deviance monitor but instead monitors the log posterior which is similar Both give an impression of how well the model fits the data Figure 4 shows the monitors for the deviance or log posterior This traceplot is nearly ideal white noise indicating good convergence for this value The value of R is less than the heuristic thresh old of 1 1 for both chains and the effective sample size is about 1 6 of the 45 000 total Monte Carlo observa tions Figure 5 shows the grand mean of the log pause times for the first component The JAGS output upper row shows a clas
16. Zij k Let Ten pen and on be the draws of the Level 1 parameters from Cycle r of Chain c after the data have been sorted to identify the components Now define iD eng ey 17 Js a 7 and set Oo E a ae The distribution for Z j for MCMC draw c r is then re Chee Lene oe The MCMC estimate for p j is just the aver ri over all E draws of pS that is i j k oR DA one 1P 7 If desired Z j can be estimated as the maximum over k of Pi j k but for most purposes simply looking at the probability distribution p provides an adequate if not better summary of the available information about the component from which Y was drawn 15 4 6 Additional Level 2 Units students In the pause time application the goal is to be able to estimate fairly quickly the student level parameters for a new student working on the same essay As the MCMC run is time consuming a fast method for an alyzing data from new student would be useful One approach would be to simply treat Equations 2 3 and 4 as prior distributions plugging in the point estimates for the cross student parameters as hyper parameters and find the posterior mode of the ob servations from the new data There are two problems with this approach First the this method ignores any remaining uncertainty about the cross student param eters Second the existing mixture fitting software packages FlexMix and mixtools do not provide any way to input the
17. ailor made for the identification issue If an order based on pg is desired the declaration ordered K mu0 enforces the ordering constraint in the MCMC sam pler JAGS contains a sort function which can achieve a similar effect Friithwirth Schnatter 2001 rec ommends against this solution because the resulting Markov chains often mix slowly Some experimenta tion with the ordered restriction in Stan confirmed this finding the MCMC runs took longer and did not reach pseudo convergence as quickly when the ordered re striction was not used Friithwirth Schnatter 2001 instead recommends let ting the Markov chains mix across all of the possi ble modes and then sorting the values according to the desired identification constraints post hoc JAGS provides special support for this approach offering a special distribution dnormmix for mixtures of normals This uses a specially chosen proposal distribution to encourage jumping between the multiple modes in the posterior The current version of Stan does not pro vide a similar feature One result of this procedure is that the MCMC sample will have a variety of orderings of the components each draw could potentially have a different labeling For example the MCMC sample for the parameter pi 1 will in fact be a mix of samples from ui1 HiK The average of this sample is not likely to be a good estimate of j1 1 Similar problems arise when looking at pseudo convergence statistic
18. are completed Here result1 is a variable that gathers together the portion of the results to keep and runName is a character string that provides a unique name for the run Assume that all of the commands necessary to perform the MCMC analysis are in a file script R To run this from a command shell use the command R CMD BATCH slave script R This will run the R code in the script using the RData file in the current working directory and put the out put into script Rout The slave switch performs two useful functions First it does not save the RData file at the end of the run avoiding potential memory problems Second it suppresses echoing the script to the output file making the output file easier to read Under Linux and Mac OS X the command nohup R CMD BATCH slave script R amp runs R in the background The user can log out of the computer but as long as the computer remains on the process will continue to run 4 Model Estimation A MCMC analysis always follows a number of similar steps although the hierarhical mixture model requires a couple of additional steps Usually it is best to write these as a script because the analysis will often need to be repeated several times The steps are as follows 1 Set up parameters for the run This includes which data to analyze how many mixture compo nents are required i e K how long to run the Markov chain how many chains to run what the prior hyperpara
19. becomes smaller so the minimum sample size increases 2 2 Hierarchical mixtures Gelman et al 2013 explains the concept of a hier archical model using a SAT coaching experiment that took place at 8 different schools Let X N mui 0 be the observed effect at each school where the school specific standard error g is known based mainly on the sample size used at that school There are three ways to approach this problem 1 No pooling Es timate u separately with no assumption about about the similarity of u across schools 2 Complete pool ing Set ui po for all i and estimate po 3 Par tial pooling Let ui N uo v and now jointly es timate Uo H1 ug The no pooling approach pro duces unbiased estimates for each school but it has the largest standard errors especially for the smallest schools The complete pooling approach ignores the school level variability but has much smaller standard errors In the partial pooling approach the individual school estimates are shrunk towards the grand mean Lo with the amount of shrinkage related to the size of the ratio v v 0 in particular there is more shrinkage for the schools which were less precisely mea sured Note that the higher level standard deviation v controls the amount of shrinkage the smaller v is the more the individual school estimates are pulled to wards the grand mean At the limits if v 0 the par tial pooling model is the same as the
20. ckage Stan Stan Development Team 2013 provides support for HMC As with BUGS and JAGS the model of Section 2 is written in pseudo code although this time the syntax looks more like C than R Rather than translate the model into interpreted code Stan translates it into C then compiles and links it with existing Stan code to run the sampler This has an initial overhead for the com pilation but afterwards each cycle of the sampler runs faster Also as HMC generally has lower autocorrela tion than random walk Metropolis smaller run lengths can be used making Stan considerably faster than JAGS in some applications A package rstan is avail able to link Stan to R but because of the compilation step it requires that the user have the proper R de velopment environment set up HMC has more tuning parameters than random walk Metropolis the mass of the particle the distribution of velocities and the number of steps to take in each direction must be selected to set up the algorithm Stan uses a warm up phase to do this adaptation The recommended procedure is to use approximately 1 2 the samples for warm up as a longer warm up produces lower autocorrelations when actively sampling Stan has some interesting features that are not present in BUGS or JAGS For one it does not require every parameter to have a proper prior distribution as long as the posterior is proper It will simply put a uni form prior over the space of possibl
21. creasing K should always improve the fit of the model even if the extra component is just used to trivially fit one additional data point Consequently analysts use penalized measures of model fit such as AIC to chose among the models Chapter 7 of Gelman et al 2013 gives a good overview of the various information criteria so the current section will provide only brief definitions Re call Equation 5 that described the log likelihood of the complete data Y as a function of the complete parameter Q By convention the deviance is twice the negative log likelihood D Q 2 Y Q The lower the deviance the better the fit of the data to the model evaluated at that parameter As students are independent given the cross student parameters and pause times are independent given the student level parameters D Q for some value of the parameters Q can be written as a sum IOS X So log Qi 5 Q 18 i 1 j l where Q j Q is the likelihood of data point Y j that is QIQ Da TE Note that the cross student parameters drop out of these calculations Mik ae 19 The Akaike information criterion AIC takes the de viance as a measure of model fit and penalizes it for the number of parameters in the model Let Q be the maximum likelihood estimate of the parameter Then AIC is defined as AIC D Q 2 m 20 where m K 2IK 4 K 1 1 1 1 is the number of free parameters in the model Note that the M
22. dexes k k so that Werk Ses Werk for in the Level 2 parameters Q0 Ho Bo To Yo do Replace op with r 44 end for Level 2 parameter if inferences about Level 1 parameters are de sired then for in the Level 1 parameters 7 u T do for each Level 2 unit student i do Replace ari Eer i k gt see rarik end for Level 1 unit end for Level 1 parameter end if end for Cycle end for Markov Chain Erki with Figure 3 Component Identification Algorithm reach convergence more quickly Gelman et al 2013 recommend starting at the maximum likelihood esti mate when that is available Fortunately off the shelf software is available for finding the maximum likeli hood estimate of the individual student mixture mod els These estimates can be combined to find starting values for the cross student parameters The initial values can be set by the following steps 1 Fit a separate mixture model for students using maximum likelihood The package mixtools is slightly better for this purpose than FlexMix as it will try multiple times to get a solution with the requested number of components If the EM algorithm does not converge in several tries set the values of the parameters for that student to NA and move on 2 Identify Components Sort the student level pa rameters Ti Wi Ci and T according to the de sired criteria see Section 4 1 3 Calculate the Level 2 initial values Most
23. e problem that frequently arises in estimating mix ture models is determining how many mixture compo nents to use What is commonly done is to estimate models for K 2 3 4 up to some small maximum number of components depending on the size of the data Then a measure of model data fit such as AIC DIC or WAIC see Gelman et al 2013 Chapter 7 is calculated for each model and the model with the best fit index is chosen These methods look at the deviance minus twice the log likelihood of the data and adjust it with a penalty for model complexity Both DIC and WAIC require Markov chain Monte Carlo MCMC to compute and require some custom coding for mixture models because of the component identification issue This paper is a tutorial for replicating the method used by Li 2013 The paper walks through a script writ ten in the R language R Core Team 2014 which per forms most of the steps The actual estimation is done using MCMC using either Stan Stan Development Team 2013 or JAGS Plummer 2012 The R scripts along with the Stan and JAGS models and some sam ple data are available at http pluto coe fsu edu mcmc hierMM 2 Mixture Models Let i 1 Z be a set of indexes over the sec ond level units students in the example and let j 1 Ji be the first level units pause events in the example A hierarchical mixture model is by adding a Level 2 across student distribution over the parameters
24. e values for any parameter not given a prior distribution However using explicit priors has some advantages for the ap plication to student pause data In particular when data for a new student become available the poste rior parameters for the previous run can be input into Stan or JAGS and the original calibration model can be reused to estimate the parameters for new student Mislevy Almond Yan amp Steinberg 1999 3 4 Parallel Computing and Memory Issues As most computers have multiple processors paral lel computing can be used to speed up MCMC runs Often multiple Markov chains are run and the results are compared to assess pseudo convergence and then combined for inference Gelman amp Shirley 2011 This is straightforward using the output processing package coda It is a little bit trickier using the rstan package because many of the graphics require a full stanfit object However the conversion from Stan to coda format for the MCMC samples is straightforward In the case of hierarchical mixture models there is an even easier way to take advantage of multiple pro cesses If the number of components K is unknown the usual procedure is to take several runs with dif ferent values of K and compare the fit Therefore if 4 processors were available one could run all of the chains for K 2 one for K 3 and one for K 4 leaving one free to handle operating system tasks In most modern computers the bottlen
25. easy to get a pathological solution in which a mixture component consists of a single point These solutions are not desirable and some estimation tools constrain the minimum size of a mixture component Gruen amp Leisch 2008 Fur thermore if a separate variance is to be estimated for each mixture component several data points must be assigned to each component in order for the variance estimate to have a reasonable standard error There fore fitting a model with rare components requires a large data size Almond et al 2012 noted these limitations in their conclusions First events corresponding to the highest level planning components in Deane 2012 s cognitive model would be relatively rare and hence would lumped in with other mixture components due to size restrictions Second some linguistic contexts e g between Sentence pauses were rare enough that fitting a separate mixture model to each student would not be feasible One work around is a hierarchical mixture model As with all hierarchical models it requires units at two different levels in this example students or essays are Level 2 and individual pauses are Level 1 The assumption behind the hierarchical mixture model is that the mixture components will look similar across the second level units Thus the mean and variance of Mixture Component 1 for Student 1 will look similar to those for Student 2 Li 2013 tried this on some of the writing data On
26. eck is usually not available CPU cycles but available memory For running 3 chains with J 100 students and R 5000 MCMC samples in each chain the MCMC sample can take up to 0 5GB of memory Consequently it is crit ically important to monitor memory usage when run ning multiple MCMC runs If the computer starts re quiring swap disk based memory in addition to phys ical memory then running fewer processes will proba bly speed up the computations Another potential problem occurs when storing the re sult of each run between R sessions in the RData file Note that R attempts to load the entire contents of that data file into memory when starting R If there are the results of several MCMC runs in there the RData file can grow to several GB in size and R can take several minutes to load In case this problem arises it is recommended that you take a make a copy of the RData after the data have been cleaned and all the auxiliary functions loaded but before starting the MCMC runs and put it someplace safe If the RData file gets to be too large it can be simply replaced with the backup In order to prevent the RData file from growing unmanageably large it recommended that the work space not be saved at the end of each run Instead run a block of code like this assign runName result1 outfile lt gzfile paste runName R gz sep open wt dump runName file outfile close outfile after all computations
27. ection There is a obvious trade off between working with smaller data sets and being able to estimate rare events J 100 seemed like a good compromise 5 Model Selection A big question in hierarchical mixture modeling is What is the optimal number of components K Although there is a way to add a prior distribution yo Rhat Stan 1 73 effective sample size JAGS 1003 17 Rhat Stan 3 effective sample size Stan 3085 JAGS JAGS t 2 m cy o a 5000 i0000 15000 o 1 2 3 4 Iterations N 15000 Bandwidth 0 06726 Stan unconstrained model Stan unconstrained model a o in 2 t pi m m a a 5000 i0000 15000 0 1 2 3 4 5 6 Iterations N 15000 Bandwidth 0 06526 Figure 7 Traceplots and Density plots for yo 1 Note To reduce the file size of this paper this is a bitmap picture of the traceplot The original pdf version is available at http pluto coe fsu edu mcmc hierMM gamma0Plots pdf over Kk and learn this as part of the MCMC pro cess it is quite tricky and the number of parameters varies for different values of K Geyer 2011 describes the Metropolis Hastings Green algorithm which is re quired for variable dimension parameter spaces In practice analysts often believe that the optimal value of K is small less than 5 so the easiest way to de termine a value of K is to fit separate models for K 2 3 4 and compare the fit of the models Note that in
28. effective sample sizes are greater than a certain minimum by default 100 for all cross student parameters If these criteria are not met new chains of twice the length of the old chain can be run The traceplots are saved to a file for later examination as are some other statistics such as the mean value of each chain These traceplots and outputs are available at the web site mentioned above It is generally not worthwhile to restart the chain for a third time If pseudo convergence has not been achieved after the second longer run usually a better solution is to reparameterize the model It is easier to extend the MCMC run using JAGS than using Stan In JAGS calling update again samples additional points from the Markov chain starting at the previous values extending the chains The current version of Stan 2 2 0 saves the compiled C code but not the warm up parameter values So the sec ond run is a new run starting over from the warm up phase In this run both the warm up phase and the sampling phase should be extended because a longer 1The Stan development team have stated that the abil ity to save and reuse the warm up parameters are a high priority for future version of Stan Version 2 3 of Stan was released between the time of first writing and publication of the paper but the release notes do not indicate that the restart issue was addressed warm up may produce a lower autocorrelation In ei ther case the data
29. em it has for this application is the assumption that each pause is independent This does not correspond with what is known about the writing process typically writ ers spend long bursts of activities simply doing tran scription i e typing and only rarely pause for higher level processing e g spelling grammar word choice planning etc In particular rather than getting this model to work for this application it may be better to look at hidden Markov models for individual student writings The slow mixing of the hierarchical mixture model Gary Feng and Paul Deane private communication 18 for large data sets indicates that there may be addi tional transformation of this model that are necessary to make it work properly Hopefully the publication of the source code will allow other to explore this model more fully References Almond R G Deane P Quinlan T Wagner M amp Sydorenko T 2012 A prelimi nary analysis of keystroke log data from a timed writing task Research Report No RR 12 23 Educational Testing Service Re trieved from http www ets org research policy_research_reports publications report 2012 jgdg Benaglia T Chauveau D Hunter D R amp Young D 2009 mixtools An R package for ana lyzing finite mixture models Journal of Sta tistical Software 32 6 1 29 Retrieved from http www jstatsoft org v32 i06 Brooks S Gelman A Jones G L amp Meng X
30. from both runs can be pooled for inferences 4 4 A simple test To test the scripts and the viability of the model I created some artificial data consistent with the model in Section 2 To do this I took one of the data sets use by Li 2013 and ran the initial parameter algo rithm described in Section 4 2 for K 2 3 and 4 I took the cross student parameters from this exer cise and generated random parameters and data sets for 10 students This produced three data sets for K 2 3 and 4 which are consistent with the model and reasonably similar to the observed data All three data sets and the sample parameters are avail able on the web site http pluto coe fsu edu mcmc hierMM For each data set I fit three different hierarchical mix ture models K 2 3 and 4 where K is the value used for estimation using three different variations of the algorithm JAGS RWM Stan HMC with the cross student means constrained to be increasing and Stan HMC with the means unconstrained For the JAGS and Stan unconstrained run the chains were sorted Section 4 1 on the basis of the cross students means uo before the convergence tests were run In each case the initial run of 3 chains with 5 000 ob servations was followed by a second run of 3 chains with 10 000 observations when the first one did not converge All of the results are tabulated at the web site including links to the complete output files and all traceplots
31. gamma distribution has about the right shape for the prior distribution of ay and we expect it to be about the same size as J the number of Level 2 units The choice of prior is a chi squared distribution with 2 J degrees of freedom an X U 2 14 The two remaining parameters we don t need to con strain too much For uo we use a diffuse normal prior one with a high standard deviation and for o we use a Jeffrey s prior uniform on the logistic scale which is the Dirichlet distribution with all values set to 1 2 jo N 0 1000 15 ao Dirichlet 0 5 0 5 16 The informative priors above are not sufficient to al ways keep the Markov chain out of trouble In partic ular it can still reach places where the log posterior distribution is infinite There are two different places where these seem to occur One is associated with high values of ay Putting a hard limit of ay lt 500 seems to avoid this problem when J 100 Another possi ble degenerate spot is when ay 0 for some k This means that m will be essentially zero for all students Adding 01 to all of the a values in Equation 2 seems to fix this problem Ti Tii iai TiK Dirichlet a1 01 Sats ak 01 3 Estimation Algorithms There are generally two classes of algorithms used with both hierarchical and mixture models The ex pectation maximization EM algorithm Section 3 1 searches for a set of parameter values that maximizes
32. ill be high If the step size is too small the chain will move very slowly through the space and the autocorrelation will be high Gibbs sampling where the step is chosen using a conjugate distribution so the Metropolis Hastings ratio always accepts is not necessarily better Often the effective step size of the Gibbs sampler is small resulting in high autocorrelation Most packages that use RWM do some adaptation on the step size trying to get an optimal rejection rate During this adaptation phase the Markov chain does not have the correct stationary distribution so those observations must be discarded and a certain amount of burn in is needed after the adaptation finishes MCMC and the RWM algorithm were made popular by their convenient implementation in the BUGS soft ware package Thomas Spiegelhalter amp Gilks 1992 With BUGS the analyst can write down the model in a form very similar to the series of equations used to describe the model in Section 2 with a syntax de rived from the R language BUGS then compiles the model into pseudo code which produces the Markov chain choosing to do Gibbs sampling or random walk Metropolis for each parameter depending on whether or not a convenient conjugate proposal distribution was available The output could be exported in a form that could be read by R and the R package coda Plummer Best Cowles amp Vines 2006 could be used to process the output Later WinBUGS would build
33. into the lower end of the support of the distribution This gives a large shoulder in the corresponding density plot The chains for the variance of the log precisions for the first component yo Figure 7 are far from pseudo convergence with R 1 73 In the JAGS chains top row the black chain seems to prefer much smaller val ues for this parameter In the Stan output the green chain seems to be restricted to the smaller values in both the first and second run However between the first and second run the black and red chains swap places Clearly this is the weakest spot in the model and perhaps a reparameterization here would help Looking at Figures 6 and 7 together a pattern emerges There seems to be at least two distinct modes One mode black chain JAGS green chain in Stan has higher value for ao and a lower value for Yo 1 and the other one has the reverse pattern This is an advantage of the MCMC approach over an EM approach In particular if the EM algorithm was only run once from a single starting point the existence of the other mode might never be discovered 4 5 Data Point Labeling For some applications the analysis can stop after es timating the parameters for the students units mi H and T or o as desired In other applications it is interesting to assign the individual pauses to compo nents that is to assign values to Z j When 77 u and g are known it is simple to calculate Dijk Pr
34. k log To k ni kY0 k 9 ni N 0 1 10 2 4 Prior distributions and parameter constraints Rarely does an analyst have enough prior informa tion to completely specify a prior distribution Usu ally the analyst chooses a convenient functional form and chooses the hyperparameters of that form based on available prior information if any One popular choice is the conjugate distribution of the likelihood For example if the prior for a multinomial probability m follows a Dirichlet distribution with hyperparame ter then the posterior distribution will also be a Dirichlet with a hyperparameter equal to the sum of the prior hyperparameter and the data This gives the hyperparameter of a Dirichlet prior a convenient in terpretation as pseudo data A convenient functional form is one based on a normal distribution sometimes on a transformed parameter such as the log of the precision whose mean can be set to a likely value of the parameter and whose standard deviation can be set so that all of the likely values for the parameter are within two standard deviations of the mean Note that for location parameters the normal distribution is often a conjugate prior distribution Proper prior distributions are also useful for keeping parameter values away from degenerate or impossible solutions For example priors for standard deviations Bo k and yo must be strictly positive Also the group precisions for each student 7 mus
35. ks W R 1992 BUGS A program to perform Bayesian inference using Gibbs sampling In J M Bernardo J O Berger A P Dawid amp A F M Smith Eds Bayesian statistics 4 pp 837 842 Clarendon Press Acknowledgments Tingxuan Li wrote the first version of much of the R code as part of her Master s thesis Paul Deane and Gary Feng of Educational Testing Service have been collaborators on the essay keystroke log analysis work which is the inspiration for the present research Various members of the JAGS and Stan users groups and development teams have been helpful in answer ing emails and postings related to these models and Kathy Laskey and several anonymous reviewers have given me useful feedback to which I should have paid closer attention on the draft of this paper
36. meters are 2 Clean the data In the case of hierarchical mixture models student data vectors which are too short should be eliminated Stan does not accept miss ing values so NAs in the data need to be replaced with a finite value For both JAGS and Stan the data are bundled with the prior hyperparameters to be passed to the MCMC software 3 Set initial values Initial values need to be chosen for each Markov chain see Section 4 2 4 Run the Markov Chain For JAGS this consists of four substeps a run the chain in adaptation mode b run the chain in normal mode for the burn in c set up monitors on the desired param eters and d run the chain to collect the MCMC sample For Stan the compilation warm up and sampling are all done in a single step 5 Identify the mixture components Ideally the Markov chains have visited all of the modes of the posterior distribution including the ones which differ only by a permutation of the component la bels Section 4 1 describes how to permute the component labels are permuted so that the com ponent labels are the same in each MCMC cycle 6 Check pseudo convergence Several statistics and plots are calculated to see if the chains have reached pseudo convergence and the sample size is adequate If the sample is inadequate then ad ditional samples are collected Section 4 3 7 Draw Inferences Summaries of the posterior dis tribution for the the parameters of interest a
37. nvergence the WAIC values may not be correct but there was fairly close agreement on both WAIC and WAIC for both the JAGS and Stan unconstrained models However the values of both WAIC and WAIC were also fairly close for all values of K number of estimated components For example when K 2 the WAIC value was 1132 1132 and 1135 for K 2 3 and 4 respectively when estimated with JAGS and 1132 1131 and 1132 when estimated with Stan The results for WAIC2 were similar It appears as if the common practice of just picking the best value of WAIC to determine K is not sufficient In particular it may be possible to approximate the data quite well with a model which has an incorrect value of K 6 Conclusions The procedure described above is realized as code in R Stan and JAGS and available for free at http pluto coe fsu edu mcmc hierMM Also available are the simulated data for trying out the models These scripts contain the functions necessary to auto matically set starting parameters for these problems identify the components and to calculate WAIC model fit statistics The scripts seem to work fairly well for a small number of students units 5 lt J lt 10 With this sample size although JAGS has the faster run time Stan has a larger effective sample size Hamiltonian Monte Carlo With large sample sizes J 100 J 100 even the HMC has high autocorrelation and both JAGS and Stan are extremely slow He
38. of the cross student parameters are either means or vari ances of the student level parameters The initial value for o is the mean of m across the students i ignoring NAs The initial values for po and By are the mean and standard deviation of u The initial values for log To and y are the mean and standard deviation of log 7 4 Impute starting values for maximum likelihood es timates that did not converge These are the NAs from Step 1 For each student i that did not converge in Step 1 set m ao H Ho and Ti T 5 Compute initial values for 0 and n These can be computed by solving Equations 7 and 9 for 6 and 7 6 Set the initial value of an I Only one param eter was not given an initial value in the previous steps that is the effective sample size for a Set this equal to the number of Level 2 units J The initial values of can now be calculated as well This produces a set of initial values for a single chain Common practice for checking for pseudo convergence involves running multiple chains and seeing if they ar rive the same stationary distribution Different start ing values for the second and subsequent chains can be found by sampling some fraction A of the Level 1 units from each Level 2 unit When J is large A 5 seems to work well When J is small A 8 seems to work better An alternative would be to take a bootstrap sample a new sample of size J drawn with replacement
39. of the Level 1 within student mixture model Section 2 1 describes the base Level 1 mixture model and Section 2 2 describes the Level 2 model Often MCMC requires reparameterization to achieve better mixing Section 2 3 Also there are certain pa rameter values which result in infinite likelihoods Sec Zig wae oa Yay FE cyl i 1 I ke 1 K Figure 1 Non hierarchical Mixture Model tion 2 4 describes prior distributions and constraints on parameters which keep the Markov chain away from those points 2 1 Mixture of Normals Consider a collection observations Y Yi1 Yi z for a single student i Assume that the process that generated these data is a mix ture of K normals Let Z j cat be a categorical latent index variable indicating which component Ob servation j comes from and let Y N ti ks Cik be the value of Y which would be realized when Zig k Figure 1 shows this model graphically The plates in dicate replication over categories k Level 1 pauses j and Level 2 students i units Note that there is no connection across plates so the model fit to each Level 2 unit is independent of all the other Level 2 units This is what Gelman et al 2013 call the no pooling case The latent variables Z and Y can be removed from the likelihood for Y by summing over the possible values of Z The likelihood for one student s data Y
40. prior information A better approach is to do an additional MCMC run using the posterior distributions for the cross student parameters from the previous run as priors from the previous run Mislevy et al 1999 If the hyperpa rameters for the cross student priors are passed to the MCMC sampler as data then the same JAGS or Stan model can be reused for additional student pause records Note that in addition to the MCMC code Stan provides a function that will find the maximum of the posterior Even if MCMC is used at this step the number of Level 2 units J will be smaller and the chains will reach pseudo convergence more quickly If the number of students J is large in the original sample as is true in the original data set then a sample of students can be used in the initial model calibration and student level parameters for the oth ers can be estimated later using this method In par ticular there seems to be a paradox estimating mod els using large data sets with MCMC The larger the data the slower the run This is not just because each data point needs to be visited within each cycle of each Markov chain It appears especially with hi erarchical models that more Level 2 units cause the chain to have higher autocorrelation meaning larger runs are needed to achieve acceptable Monte Carlo er ror For the student keystroke logs the initial data used for calibration was a sample of 100 essays from the complete coll
41. r of Monte Carlo sam ples It is also related to the autocorrelation corre lation between successive draws of the Markov chain The higher the autocorrelation the lower the effective sample size of the Monte Carlo sample and the higher the Monte Carlo error Most methods for building the Markov chain are based on the Metropolis algorithm see Geyer 2011 for de tails A new value for one or more parameter is pro posed and the new value is accepted or rejected ran domly according to the ratio of the posterior distribu tion and the old and new points with a correction fac tor for the mechanism used to generate the new sam ple This is called a Metropolis or Metropolis Hastings update the latter contains a correction for asymmet ric proposal distributions Gibbs sampling is a special case in which the proposal is chosen in such a way that it will always be accepted As the form of the proposal distribution does not mat 6 ter for the correctness of the algorithm the most com mon method is to go one parameter at a time and add a random offset a step to its value accepting or re jecting it according to the Metropolis rule As this distribution is essentially a random walk over the pa rameter space this implementation of MCMC is called random walk Metropolis RWM The step size is a critical tuning parameter If the average step size is too large the value will be rejected nearly every cycle and the autocorrelation w
42. re computed and reported Note that JAGS offers some possibilities here that Stan does not In par ticular JAGS can monitor just the cross student parameters Qo QN Ho Bo log To and yo for a much longer time to check pseudo convergence and then a short additional run can be used to draw inferences about the student specific param eters 7 p and T for a considerable memory savings 8 Data point labeling In mixture models it is some times of interest to identify which mixture com ponent each observation Y comes from Sec tion 4 5 9 Calculate model fit index If the goal is to compare models for several different values of K then a measure of model fit such as DIC or WAIC should be calculated Section 5 4 1 Component Identification The multiple modes in the posterior distribution for mixture models present a special problem for MCMC In particular it is possible for a Markov chain to get stuck in a mode corresponding to a single component labeling and never mix to the other component la belings This especially problematic when coupled with the common practice of starting several paral lel Markov chains Friithwirth Schnatter 2001 de scribes this problem in detail One solution is to constrain the parameter space to fol low some canonical ordering e g Hi1 lt Wig S lt pi K Stan allows parameter vectors to be specified as ordered that is restricted to be in increasing order This seems t
43. re the ability of JAGS to allow the user to selectively monitor parameters and to extend the chains without starting over allows for better memory management The code supplied at the web site solves two key tech nical problems the post run sorting of the mixture components Friithwirth Schnatter 2001 and the cal culation of the WAIC model fit indexes Hopefully this code will be useful for somebody working on a similar application The model proposed in this paper has two problems even with data simulated according to the model The first is the slow mixing This appears to be a prob lem with multiple modes and indicates some possible non identifiability in the model specifications Thus further work is needed on the form of the model The second problem is that the WAIC test does not ap pear to be sensitive enough to recover the number of components in the model As the original purpose of moving to the hierarchical mixture model was to dis cover if there were rare components that could not be detected in the single level hierarchical model the cur rent model does not appear to be a good approach for that purpose In particular returning to the original application of fitting mixture models to student pauses while writing the hierarchical part of the model seems to be creat ing as many problems as it solves It is not clearly better than fitting the non hierarchical mixture model individually to each student Another probl
44. rgence 10 the size of the Monte Carlo error could still be an is sue Both rstan and coda compute an effective sample size for the Monte Carlo sample This is the size of a simple random sample from the posterior distribu tion that would have the same Monte Carlo error as the obtained dependent sample This is different for different parameters If the effective sample size is too small then additional samples are needed Note that there are K 32 5 parameters whose con vergence needs to be monitored It is difficult to achieve pseudo convergence for all of these parame ters and exhausting to check them all A reasonable compromise seems to be to monitor only the 5K cross student parameters a Ho Bo log To and yo JAGS makes this process easier by allowing the analyst to pick which parameters to monitor Using JAGS only the cross student parameters can be monitored dur ing the first MCMC run and then a second shorter sample of student level parameters can be be obtained through an additional run The rstan package always monitors all parameters including the 6 and pa rameters which are not of particular interest When I and J are large a run can take several hours to several days As the end of a run might occur dur ing the middle of the night it is useful to have a au tomated test of convergence The rule I have been using is to check to see if all R are less than a certain maximum by default 1 1 and if all
45. s which are related to the variances of the parameters To fix this problem the component labels need to be permuted separately for each cycle of the MCMC sam ple With a one level mixture model it is possible to identify the components by sorting on one of the student level parameters Tik i k Or Tip For the hierarchical mixture model one of the cross student parameters Qo k 0 k OF To k Should be used Note that in the hierarchical models some of the student level parameters might not obey the constraints In the case of the pause time analysis this is acceptable as some students may behave differently from most of their peers the ultimate goal of the mixture modeling is to identify students who may benefit from different approaches to instruction Choosing an identification rule involves picking which parameter should be used for identification at Level 2 cross student and using the corresponding parameter is used for student level parameter identification Component identification is straightforward Let w be the parameter chosen for model identification Fig ure 3 describes the algorithm 4 2 Starting Values Although the Markov chain should eventually reach its stationary distribution no matter where it starts starting places that are closer to the center of the dis tribution are better in the sense that the chain should for each Markov chain c do for each MCMC sample r in Chain c do Find a permutation of in
46. shed master s the sis Florida State University McLachlan G amp Krishnan T 2008 The EM al gorithm and extensions 2nd ed Wiley McLachlan G amp Peel D 2000 Finite misture models Wiley Mislevy R J Almond R G Yan D amp Stein berg L S 1999 Bayes nets in educational assessment Where the numbers come from In K B Laskey amp H Prade Eds Uncertainty in artificial intelligence 99 p 437 446 Morgan Kaufmann Neal R M 2011 MCMC using Hamiltonian dynam ics In MCMCHandbook Ed pp 113 162 Plummer M 2012 May JAGS version 3 2 0 user manual 3 2 0 ed Computer software manual Retrieved from http mcmc jags sourceforge net Plummer M Best N Cowles K amp Vines K 2006 Coda Convergence diagnosis and out put analysis for memc R News 6 1 7 11 Retrieved from http CRAN R project org doc Rnews R Core Team 2014 R A language and environ ment for statistical computing Computer soft ware manual Vienna Austria Retrieved from http www R project org Stan Development Team 2013 Stan A C library for probability and sampling Version 2 2 0 ed Computer software manual Retrieved from http mc stan org Thomas A O Hara B Ligges U amp Sturtz S 2006 Making BUGS open R News 6 12 17 Retrieved from http www rni helsinki fi poh publications Rnews2006 1 pdf Thomas A Spiegelhalter D J amp Gil
47. sic slow mixing pattern the chains are crossing but moving slowly across the support of the parameter space Thus for JAGS even though the chains have nearly reached pseudo convergence R is just slightly greater than 1 1 the effective sample size is only 142 observations A longer chain might be needed for a good estimate of this parameter The Stan output bottom row looks much better and the effective sample size is a respectable 3 680 The Markov chains for ao the proportion of pause times in the first component have not yet reached pseudo convergence but they are close a longer run might get them there Note the black chain often ranges above the values in the red and green chains in the JAGS run upper row This is an indication that the chains may be stuck in different modes the Deviance JAGS Log Posterior SAG Rhat JAGS 1 effective sample size JAGS 7992 Rhat Stan 1 08 effective sample size Stan 7388 JAGS JAGS s S a Q co a a 5000 i0000 15000 1060 1080 1100 1120 1140 Iterations N 15000 Bandwidth 1 305 Stan unconstrained model Stan unconstrained model A G t 2 a P i G a 5 a 5000 i0000 15000 410 590 570 550 Iterations N 15000 Bandwidth 0 9306 Figure 4 Deviance JAGS and Log Posterior Stan plots Note To reduce the file size of this paper this is a bitmap picture of the traceplot The original pdf version is available at http pluto coe fsu edu mcmc hierMM
48. t be strictly pos itive The natural conjugate distribution for a preci sion is a gamma distribution which is strictly positive The lognormal distribution used in Equations 4 and 9 has a similar shape but its parameters can be inter preted as a mean and standard deviation on the log scale The mixing probabilities 7 must be defined over the unit simplex i e they must all be between 0 and 1 and they must sum to 1 as must the expected mixing probabilities o the Dirichlet distribution sat isfies this constraint and is also a natural conjugate Finally ay must be strictly positive the choice of the chi squared distribution as a prior ensures this There are other softer restraints on the parameters If Bo k OF Yo k gets too high or too low for any value of k the result is a no pooling or complete pooling so lution on that mixture component For both of these parameters 01 100 seems like a plausible range A lognormal distribution which puts the bulk of its prob ability mass in that range should keep the model away from those extremes Values of 7 that are too small represent the collapse of a mixture component onto a single point Again if To is mostly in the range 01 100 the chance of an extreme value of 7 should be small This yields the following priors log Box N 0 1 11 log yor M 0 1 12 log Tok M 0 1 13 High values of ay also correspond to a degenerate so lution In general the
49. value of the u s one at a time and then updating the values of uo and v When up dating p4 if v is small then values of u close to uo will have the highest conditional probability When updat ing v if all of the values of ju are close to fio then small values of v will have the highest conditional probabil ity The net result is a chain in which the movement from states with small v and the p s close together to states with large v and u s far apart takes many steps A simple trick produces a chain which moves much more quickly Introduce a series of auxiliary variables 6 N 0 1 and set u uo V6 Note that the marginal distribution of u has not changed but the geometry of the uo v space is different from the ge ometry of the u 4o v space resulting in a chain that moves much more quickly A similar trick works for modeling the relationships between a and 7 Setting a Qo k an where p has a Dirichlet distribution and ay has a gamma distribution seems to work well for the parameters a of the Dirichlet distribution In this case the parameters have a particularly easy to interpret meaning ap is the expected value of the Dirichlet distribution and ay is the effective sample size of the Dirichlet distribution Applying these two tricks to the parameters of the hierarchical model yields the following augmented pa rameterization Ak Q0 k AN 6 Hi k Ho k i k bo k 7 Oi N 0 1 8 log Ti

A Comparison of Two MCMC Algorithms for - CEUR

Contents

Download Pdf Manuals

Related Search

Related Contents