Home

GEMMA User Manual

1. k filename specifies relatedness matrix file name o prefix specifies output file prefix Notice that k filename could be replaced by d filename and u filename where d filename specifies the eigen value file and u filename specifies the eigen vector file The BIMBAM mean genotype file and or the relatedness matrix file or the eigen vector file can be provided in a gzip compressed format 4 5 2 Detailed Information The algorithm calculates either REML or MLE estimate of A in the evaluation interval from 1x 107 corresponding to almost pure environmental effect to 1x 10 corresponding to almost pure genetic effect Although unnecessary in most cases one can expand or reduce this evaluation interval by specifying Imin and lmax e g Imin 0 01 Imax 100 specifies an interval from 0 01 to 100 The log scale evaluation interval is further divided into 10 equally spaced regions and optimization is carried out in each region where the first derivatives change sign Although also unnecessary in 16 most cases one can increase or decrease the number of regions by using region e g region 100 uses 100 equally spaced regions on the log scale which may yield more stable or faster performance respectively For binary traits one can label controls as 0 and cases as 1 and follow our previous approaches to fit the data with a linear mixed model by tre
2. 4 6 Association Tests with Multivariate Linear Mixed Models 4 6 1 Basic Usage The basic usages for association analysis with either the PLINK binary ped format or the BIMBAM format are gemma bfile prefix k filename lmm num n numi num2 num3 o prefix gemma g filename p filename a filename k filename lmm num n num1 num2 num3 o prefix This is identical to the above univariate linear mixed model association test except that an n option is employed to specify which phenotypes in the phenotype file are used for association tests The values after the n option should be separated by a space 17 199 Notice that k filename could be replaced by d filename and u filename where d filename specifies the eigen value file and u filename specifies the eigen vector file The BIMBAM mean genotype file and or the relatedness matrix file or the eigen vector file can be provided in a gzip compressed format 4 6 2 Detailed Information Although the number of phenotypes used for analysis can be arbitrary it is highly recommended to restrict the number of phenotypes to be small say less than ten In addition when a small proportion of phenotypes are partially missing one can impute these missing values before association tests gemma bfile prefix k filename predict n numi num2 num3 o prefix gemma g filenam
3. n num where n 1 uses the original sixth column as phenotypes and n 2 uses the seventh column and so on and so forth GEMMA codes alleles as 0 1 according to the plink webpage on binary plink format pngu mgh harvard edu purcell plink binary shtml Specifically the column 5 of the bim file is the minor allele and is coded as 1 while the column 6 of the bim file is the major allele and is coded as 0 The minor allele in column 5 is therefore the effect allele notice that GEMMA version 0 92 and before treats the major allele as the effect allele GEMMA will read the phenotypes as provided and will recognize either 9 or NA as missing phenotypes If the phenotypes in the fam file are disease status one is recommended to label controls as 0 and cases as 1 as the results will have better interpretation For example the predicted values from a linear BSLMM can be directly interpreted as the probability of being a case In addition the probit BSLMM will only recognize 0 1 as control case labels For prediction problems one is recommended to list all individuals in the file but label those individuals in the test set as missing This will facilitate the use of the prediction function imple mented in GEMMA 3 2 BIMBAM File Format GEMMA also recognizes BIMBAM file format http stephenslab uchicago edu software html 1 which is particularly useful for imputed genotypes as well as for general covaria
4. bslmm 3 fits a probit BSLMM using MCMC Therefore option bslmm 2 is much faster than the other two options and option bslmm 1 is faster than bslmm 3 For MCMC based methods one can use w num to choose the number of burn in iterations that will be discarded and s num to choose the number of smax num to choose the num sampling iterations that will be saved In addition one can use ber of maximum SNPs to include in the model i e SNPs that have addition effects which may also be needed for the probit BSLMM because of its heavier computationAL burden It is up to the users to decide these values for their own data sets in order to balance computation time and computation accuracy For binary traits one can label controls as 0 and cases as 1 and follow our previous approach to fit the data with a linear BSLMM by treating the binary case control labels as quantitative traits 5 This approach can be justified by recognizing the linear model as a first order Taylor approximation to a generalized linear model One can of course choose to fit a probit BSLMM but in our limited experience we do not find appreciable prediction accuracy gains in using the probit BSLMM over the linear BSLMM for binary traits see a briefly discussion in 5 This of course could be different for a different data set The genotypes phenotypes except for the probit BSLMM as well as the relatedness matrix will be
5. a known n by n relatedness matrix L x is a n by n identity matrix V is a d by d symmetric matrix of genetic variance component Ve is a d by d symmetric matrix of environmental variance component and MN xa 0 V1 V2 denotes the nx d matrix normal distribution with mean 0 row covariance matrix Vi n by n and column covariance matrix V d by d GEMMA performs tests comparing the null hypothesis that the marker effect sizes for all phenotypes are zero Hp B 0 where O is a d vector of zeros against the general alternative H B 0 For each SNP in turn GEMMA obtains either the maximum likelihood estimate MLE or the restricted maximum likelihood estimate REML of V and Ve and outputs the corresponding p value In addition GEMMA estimates the genetic and environmental correlations among phenotypes 1 3 3 Bayesian Sparse Linear Mixed Model GEMMA can fit a Bayesian sparse linear mixed model in the following form as well as a corre sponding probit counterpart y Inu X u e B 71N 0 02771 4 1 m 80 u MVN 0 0777 K MVN 0 771In where 1 is an n vector of Is y is a scalar representing the phenotype mean X is an n x p matrix of genotypes measured on n individuals at p genetic markers 8 is the corresponding p vector of the genetic marker effects and other parameters are the same as defined in the standard linear mixed model in the previous section In the special case K XXT p default in GEMMA the SNP
6. as a em eS SHH 16 4 5 Association Tests with Univariate Linear Mixed Models 16 45 1 Basic Usage s c sacs ee ee 16 e ates ee ee ee E ee 16 gg eee ee ee Be a ee Gees es Gt igs Gece ess 17 4 6_ Association Tests with Multivariate Linear Mixed Models 17 A61 Basic UBABE e in soto Seb ae Se eels pe ee ee SO ee ae ANY 17 SiMe tos e oe a a ee Pe a 18 oe be oA ee Se od fo Sao ee ee ee eh 18 4 7 Fit a Bayesian Sparse Linear Mixed Model 00000850 18 Atel Basic Usagel e sa es eae ae sees Pewee PRG a Pee bo o ae lt 18 A Bie Be Gs ae 19 a St ge ath a wae ee oe Bees ae 20 Lope eee ae aes Stel oe ee 21 4 8 1 Basic Usage 2 62043 4 eb ee Pe eee PRR PEE ee eS 21 fuk ere aie doh fo hate he aoe es A tess 21 ine ey Se oe A We ee A E S 22 23 24 1 Introduction 1 1 What is GEMMA GEMMA is the software implementing the Genome wide Efficient Mixed Model Association al gorithm 6 for a standard linear mixed model and some of its close relatives for genome wide association studies GWAS It fits a univariate linear mixed model LMM for marker association tests with a single phenotype to account for population stratification and sample structure and for estimating the proportion of variance in phenotypes explained PVE by typed genotypes i e chip heritability 6 It fits a multivariate linear mixed model mvLMM for testing marker as sociations with multiple phenotypes simultan
7. can accommodate values other than SNP genotypes One can use the notsnp option to disable the minor allele frequency cutoff and to use any numerical values as covariates 3 2 2 Phenotype File This file contains phenotype information Each line is a number indicating the phenotype value for each individual in turn in the same order as in the mean genotype file Notice that only numeric values are allowed and characters will not be recognized by the software Missing phenotype information is denoted as NA The number of rows should be equal to the number of individuals in the mean genotype file An example phenotype file with five individuals and one phenotype is as follows 1 2 NA 2 7 0 2 3 3 One can include multiple phenotypes as multiple columns in the phenotype file and specify a 6 different column for association tests by using n num where n 1 uses the original first column as phenotypes and n 2 uses the second column and so on and so forth An example phenotype file with five individuals and three phenotypes is as follows 1 2 0 3 1 5 NA 1 5 0 3 2 7 1 1 NA 0 2 0 7 0 8 3 3 2 4 2 1 For binary traits one is recommended to label controls as 0 and cases as 1 as the results will have better interpretation For example the predicted values from a linear BSLMM can be directly interpreted as the probability of being a case In addition the probit BSLMM will only recognize 0 1 as contro
8. centered when fitting the models The estimated values in the output files are thus for these centered values Therefore proper prediction will require genotype means and phenotype means from the individuals in the training set and one should always use the same phenotype file and the same phenotype column and the same genotype file with individuals in the test set labeled as 19 missing to fit the BSLMM and to obtain predicted values described in the next section 4 7 3 Output Files There will be five output files all inside an output folder in the current directory The prefix log txt file contains some detailed information about the running parameters and computation time In addition prefix log txt contains PVE estimate and its standard error in the null linear mixed model not the BSLMM The prefix hyp txt contains the posterior samples for the hyper parameters h PVE p PGE m and y for every 10th iteration An example file with a few SNPs is shown below h pve rho pge pi n_gamma 4 777635e 01 5 829042e 01 4 181280e 01 4 327976e 01 2 106763e 03 25 5 278073e 01 5 667885e 01 3 339020e 01 4 411859e 01 2 084355e 03 26 5 278073e 01 5 667885e 01 3 339020e 01 4 411859e 01 2 084355e 03 26 6 361674e 01 6 461678e 01 3 130355e 01 3 659850e 01 2 188401e 03 25 5 479237e 01 6 228036e 01 3 231856e 01 4 326231e 01 2 164183e 03 27 The prefix param txt contains the posterior mean estimates for the effect size parameters a Bly 1 a
9. the relatedness value between ith and jth individuals An example relatedness matrix file with three individuals is as follows 0 3345 0 0227 0 0103 0 0227 0 3032 0 0253 0 0103 0 0253 0 3531 The second relatedness matrix format is a three column id id value format where the first two columns show two individual id numbers and the third column shows the relatedness value between these two individuals Individual ids are not required to be in the same order as in the fam file and relatedness values not listed in the relatedness matrix file will be considered as 0 An example relatedness matrix file with the same three individuals above is shown below id1 idi 0 3345 idi id2 0 0227 id1 id3 0 0103 id2 id2 0 3032 id2 id3 0 0253 id3 id3 0 3531 As BIMBAM mean genotype files do not provide individual id the second format only works with the PLINK binary ped format One can use km num to choose which format to use i e use km 1 or km 2 to accompany PLINK binary ped format and use km 1 to accompany BIMBAM format 3 3 2 Eigen Value and Eigen Vector Format GEMMA can also read the relatedness matrix in its decomposed forms To do this one should supply two files instead of one one file containing the eigen values and the other file containing the corresponding eigen vectors The eigen value file contains one column of na elements with 11 each element corresponds to an eigen value The eigen
10. 3 fits a probit BSLMM bfile prefix specifies PLINK binary ped file prefix g filename specifies BIMBAM mean genotype 18 file name p filename specifies BIMBAM phenotype file name a filename optional specifies BIMBAM SNP annotation file name o prefix specifies output file prefix Notice that the BIMBAM mean genotype file can be provided in a gzip compressed format 4 7 2 Detailed Information Notice that a large memory is needed to fit BSLMM e g may need 20 GB for a data set with 4000 individuals and 400 000 SNPs because the software has to store the whole genotype matrix in the physical memory The float version gemmaf can be used to save about half of the memory requirement without noticeable loss of accuracy In default GEMMA does not require the user to provide a relatedness matrix explicitly It internally calculates and uses the centered relatedness matrix which has the nice interpretation that each effect size 0 follows a mixture of two normal distributions a priori Of course one can choose to supply a relatedness matrix by using the k filename option In addition GEMMA does not take covariates file when fitting BSLMM However one can use the BIMBAM mean genotype file to store these covariates and use notsnp option to use them The option bslmm 1 fits a linear BSLMM using MCMC bslmm 2 fits a ridge regres sion GBLUP with standard non MCMC method and
11. GEMMA User Manual Xiang Zhou October 31 2014 Contents 1 Introduction dar a Bid Oe ee o AA 1 2 Howto Cite GEMMA JL es 3 Models ovocitos EH ew Ss 1 3 1 Univariate Linear Mixed Model 020 000 e 13 2 Multivariate Linear Mixed Model 0 0 00 005 00 1 3 3 Bayesian Sparse Linear Mixed Model 2 02004 A es Ae ees Aa ce 1 4 1 Missing Genotypes ooa A Be oe es 2 Installing and Compiling GEMMA 3 Input File Formats 3 1 PLINK Binary PED File Format 0 0 020202 Ea 3 2 BIMBAM File Format 3 2 1 Mean Genotype File 0 o e 3 2 2 Phenotype File o e o 3 2 3 SNP Annotation File optional soccer ee ba ea 4 3 3 Relatedness Matrix File Format 2 20 0 0 000202 ee ee eee 3 3 1 Original Matrix Format e e 3 3 2 Eigen Value and Eigen Vector Format o 3 4 Covariates File Format optional o 4 Running GEMMA 4 1 A Small GWAS Example Datasetl o o e e 42 ONP Alters se a si ar e a a a Bl dew Bee A a a als 4 3 Estimate Relatedness Matrix from Genotypes o a 14 4 3 1 Basic Usate i s spitas o aoe a Gk aS ae we ar aca a w 14 Bere Heres Ape ek a Se Oe ee e ee 14 ode oe OS eee a elk Gd ae he eee he ae ee 15 aaa peas 15 4 4 1 Basic Usage ee 15 rs A a OS he eS Oe 16 a bd aon
12. IMBAM and PLINK binary formats For demonstration purpose for CD8 we randomly divided the 85 families into two sets where each set contained roughly half of the individuals i e inter family split as in 5 We also created artificial binary phenotypes from the quantitative phenotypes CD8 by assigning the half individuals with higher quantitative values to 1 and the other half to 0 as in 5 Therefore the phenotype files contain six columns of phenotypes The first column contains the quantitative phenotypes CD8 for all individuals The second column contains quantitative phenotypes CD8 for individuals in the training set The third column contains quantitative phenotypes CD8 for individuals in the test set The fourth and fifth columns contain binary phenotypes CD8 for individuals in the training set and test set respectively The sixth column contains the quantitative phenotypes MCH for all individuals A demo txt file inside the same folder shows detailed steps on how to use GEMMA to estimate the relatedness matrix from genotypes how to perform association tests using both the univariate linear mixed model and the multivariate linear mixed model how to fit the Bayesian sparse linear mixed model and how to obtain predicted values using the output files from fitting BSLMM The output results from GEMMA for all the examples are also available inside the result subfolder 4 2 SNP filters The are a few SNP filters implemented in the s
13. ating the binary case control labels as quantitative traits 6 5 This approach can be justified partly by recognizing the linear model as a first order Taylor approximation to a generalized linear model and partly by the robustness of the linear model to model misspecification 5 4 5 3 Output Files There will be two output files both inside an output folder in the current directory The pre fix log txt file contains some detailed information about the running parameters and computation time In addition prefix log txt contains PVE estimate and its standard error in the null linear mixed model The prefix assoc txt contains the results An example file with a few SNPs is shown below chr rs ps n_miss allelel allele0 af beta se 1_remle p_wald 1 rs3683945 3197400 0 A G 0 443 7 788665e 02 6 193502e 02 4 317993e 00 2 087616e 01 rs3707673 3407393 0 G A 0 443 6 654282e 02 6 210234e 02 4 316144e 00 2 841271e 01 rs6269442 3492195 0 AG 0 365 5 344241e 02 5 377464e 02 4 323611e 00 3 204804e 01 rs6336442 3580634 0 A G 0 443 6 770154e 02 6 209267e 02 4 315713e 00 2 757541e 01 rs13475700 4098402 0 A C 0 127 5 659089e 02 7 175374e 02 4 340145e 00 4 304306e 01 PP re re The eight columns are chromosome numbers snp ids base pair positions on the chromosome number of missing values for a given snp minor allele major allele allele frequency beta estimates standard errors for beta remle estimates for lambda and p values from Wald test
14. c association of complex traits in heterogeneous stock mice Nature Genetics 38 879 887 2006 Xiang Zhou Peter Carbonetto and Matthew Stephens Polygenic modelling with Bayesian sparse linear mixed models PLoS Genetics 9 e1003264 2013 Xiang Zhou and Matthew Stephens Genome wide efficient mixed model analysis for association studies Nature Genetics 44 821 824 2012 Xiang Zhou and Matthew Stephens Efficient multivariate linear mixed model algorithms for genome wide association studies Nature Methods 11 407 409 2014 27
15. ctor of the corresponding coefficients including the intercept x is an n vector of marker genotypes 8 is the effect size of the marker u is an n vector of random effects e is an n vector of errors T is the variance of the residual errors A is the ratio between the two variance components K is a known n xX n relatedness matrix and I is an n x n identity matrix MVN denotes the n dimensional multivariate normal distribution GEMMA tests the alternative hypothesis H 8 4 0 against the null hypothesis Hp 6 0 for each SNP in turn using one of the three commonly used test statistics Wald likelihood ratio or score GEMMA obtains either the maximum likelihood estimate MLE or the restricted maximum likelihood estimate REML of A and and outputs the corresponding p value In addition GEMMA estimates the PVE by typed genotypes or chip heritability 1 3 2 Multivariate Linear Mixed Model GEMMA can fit a multivariate linear mixed model in the following form Y WA x8 U E G MNnxal0 K Vg E MNnxa 0 Inxn Ve where Y is an n by d matrix of d phenotypes for n individuals W w1 We is an n x c matrix of covariates fixed effects including a column of 1s A is a c by d matrix of the corresponding coefficients including the intercept x is an n vector of marker genotypes 8 is a d vector of marker effect sizes for the d phenotypes U is an n by d matrix of random effects E is an n by d matrix of errors K is
16. e p filename a filename k filename predict n numi num2 num3 o prefix 4 6 3 Output Files There will be two output files both inside an output folder in the current directory The pre fix log txt file contains some detailed information about the running parameters and computation time In addition prefix log txt contains genetic correlations estimates and their standard errors in the null multivariate linear mixed model The prefix assoc txt contains the results and is in a very similar format as the result file from the univariate association tests The number of columns will depend on the number of phenotypes used for analysis The first few columns are chromosome numbers snp ids base pair positions on the chromosome number of missing values for a given snp minor allele major allele and allele frequency The last column contains p values from the association tests The middle columns contain beta estimates and the variance matrix for these estimates 4 7 Fit a Bayesian Sparse Linear Mixed Model 4 7 1 Basic Usage The basic usages for fitting a BSLMM with either the PLINK binary ped format or the BIMBAM format are gemma bfile prefix bslmm num o prefix gemma g filename p filename a filename bslmm num o prefix where the bslmm num option specifies which model to fit i e bslmm 1 fits a standard linear BSLMM bslmm 2 fits a ridge regression GBLUP and bslmm
17. e the predict num option specifies where the predicted values need additional transforma tion with the normal cumulative distribution function CDF i e predict 1 obtains predicted values predict 2 obtains predicted values and then transform them using the normal CDF to probability scale bfile prefix specifies PLINK binary ped file prefix g filename specifies BIMBAM mean genotype file name p filename specifies BIMBAM phenotype file name epm filename specifies the output estimated parameter file i e prefix param txt file from BSLMM emu filename specifies the output log file which contains the estimated mean i e prefix log txt file from BSLMM ebv filename specifies the output estimated breeding value file i e pre fix bv txt file from BSLMM k filenamej specifies relatedness matrix file name o prefix specifies output file prefix 4 8 2 Detailed Information GEMMA will obtain predicted values for individuals with missing phenotype and this process will require genotype means and phenotype means from the individuals in the training set Therefore use the same phenotype file and the same phenotype column and the same genotype file as used in fitting BSLMM There are two ways to obtain predicted values use and B or use and B We note that a and are estimated in slightly different ways and so even in the special case K XX p a ma
18. ed in a gzip compressed format 15 4 4 2 Detailed Information GEMMA extracts the matrix elements corresponding to the analyzed individuals which may be smaller than the number of total individuals center the matrix and then perform an eigen decomposition 4 4 3 Output Files There will be three output files all inside an output folder in the current directory The pre fix log txt file contains some detailed information about the running parameters and computation time while the prefix eigenD txt and prefix eigenU txt contain the eigen values and eigen vectors of the estimated relatedness matrix respectively 4 5 Association Tests with Univariate Linear Mixed Models 4 5 1 Basic Usage The basic usages for association analysis with either the PLINK binary ped format or the BIMBAM format are gemma bfile prefix k filename lmm num o prefix gemma g filename p filename a filename k filename lmm num o prefix where the Imm num option specifies which frequentist test to use i e lmm 1 performs Wald test Imm 2 performs likelihood ratio test Imm 3 performs score test and lmm 4 performs all the three tests bfile prefix specifies PLINK binary ped file prefix g filename specifies BIMBAM mean genotype file name p filename specifies BIMBAM phenotype file name a filename optional specifies BIMBAM SNP annotation file name
19. effect sizes can be decomposed into two parts that captures the small effects that all SNPs have and 8 that captures the additional effects of some large effect SNPs In this case u Xa can be viewed as the combined effect of all small effects and the total effect size for a given SNP is a i There are two important hyper parameters in the model PVE being the proportion of variance in phenotypes explained by the sparse effects X6 and random effects terms u together and PGE being the proportion of genetic variance explained by the sparse effects terms X68 These two parameters are defined as follows PVE 8 u 7 oon ae where GEMMA uses MCMC to estimate 8 u and all other hyper parameters including PVE PGE and 7 1 4 Missing Data 1 4 1 Missing Genotypes As mentioned before 6 the tricks used in the GEMMA algorithm rely on having complete or imputed genotype data at each SNP That is GEMMA requires the user to impute all missing genotypes before association testing This imputation step is arguably preferable than simply dropping individuals with missing genotypes since it can improve power to detect associations 1 Therefore for fitting both LMM or BSLMM missing genotypes are recommended to be imputed first Otherwise any SNPs with missingness above a certain threshold default 5 will not be analyzed and missing genotypes for SNPs that do not pass this threshold will be simply replaced with the est
20. eously while controlling for population stratification and for estimating genetic correlations among complex phenotypes 7 It fits a Bayesian sparse linear mixed model BSLMM using Markov chain Monte Carlo MCMC for estimating PVE by typed genotypes predicting phenotypes and identifying associated markers by jointly modeling all markers while controlling for population structure 5 It is computationally efficient for large scale GWAS and uses freely available open source numerical libraries 1 2 How to Cite GEMMA e Software tool and univariate linear mixed models Xiang Zhou and Matthew Stephens 2012 Genome wide efficient mixed model analysis for association studies Nature Genetics 44 821 824 e Multivariate linear mixed models Xiang Zhou and Matthew Stephens 2014 Efficient multivariate linear mixed model algo rithms for genome wide association studies Nature Methods 11 407 409 e Bayesian sparse linear mixed models Xiang Zhou Peter Carbonetto and Matthew Stephens 2013 Polygenic modeling with Bayesian sparse linear mixed models PLoS Genetics 9 2 e1003264 1 3 Models 1 3 1 Univariate Linear Mixed Model GEMMA can fit a univariate linear mixed model in the following form y Wa xB u e u MVN 0 A7 K e MVN 0 7 L where y is an n vector of quantitative traits or binary disease labels for n individuals W w1 We is an n x c matrix of covariates fixed effects including a column of 1s is a c ve
21. fault 1e 5 e region num specify the number of regions used to evaluate lambda default 10 Bayesian Sparse Linear Mixed Model Options e bslmm num specify analysis choice default 1 valid value 1 3 1 linear BSLMM 2 ridge regression GBLUP 3 probit BSLMM e hmin num specify minimum value for h default 0 e hmax num specify maximum value for h default 1 e rmin num specify minimum value for rho default 0 e rmax num specify maximum value for rho default 1 e pmin num specify minimum value for log10 pi default log10 1 p where p is the number of analyzed SNPs e pmax num specify maximum value for log10 pi default log10 1 25 e smin num specify minimum value for gamma default 0 e smax num specify maximum value for gamma default 300 e gmean num specify the mean for the geometric distribution default 2000 e hscale num specify the step size scale for the proposal distribution of h value between 0 and 1 default min 10 sqrt n 1 e rscale num specify the step size scale for the proposal distribution of rho value between 0 and 1 default min 10 sqrt n 1 e pscale num specify the step size scale for the proposal distribution of log10 pi value between 0 and 1 default min 5 sqrt n 1 e w num specify burn in steps default 100 000 e s num specify sampling steps default 1 000 000 e rpace num specify recording pace record one state in every
22. ignore ebv filename specify input estimated random effect breeding value file name emu filename specify input log file name containing estimated mean mu filename specify estimated mean value directly instead of using emu file snps filename specify input snps file name to only analyze a certain set of snps contains a column of snp ids pace num specify terminal display update pace default 100000 o prefix specify output file prefix default result SNP Quality Control Options 24 e miss num specify missingness threshold default 0 05 e maf num specify minor allele frequency threshold default 0 01 e r2 num specify r squared threshold default 0 9999 e hwe num specify HWE test p value threshold default 0 no test e notsnp minor allele frequency cutoff is not used and so all real values can be used as covariates Relatedness Matrix Calculation Options e gk num specify which type of kinship relatedness matrix to generate default 1 valid value 1 2 1 centered matrix 2 standardized matrix Eigen Decomposition Options e cigen specify to perform eigen decomposition of the relatedness matrix Linear Mixed Model Options e Imm num specify frequentist analysis choice default 1 valid value 1 4 1 Wald test 2 likelihood ratio test 3 score test 4 all 1 3 e lmin num specify minimal value for lambda default 1e 5 e Imax num specify maximum value for lambda de
23. iles containing genotypes phenotypes relatedness matrix and op tionally covariates Genotype and phenotype files can be in two formats either both in the PLINK binary ped format or both in the BIMBAM format Mixing genotype and phenotype files from the two formats for example using PLINK files for genotypes and using BIMBAM files for phenotypes will result in unwanted errors BIMBAM format is particularly useful for imputed genotypes as PLINK codes genotypes using 0 1 2 while BIMBAM can accommodate any real values between 0 and 2 and any real values if paired with notsnp option Notice that the BIMBAM mean genotype file and or the relatedness matrix file can be provided in compressed gzip format while other files should be provided in uncompressed format 3 1 PLINK Binary PED File Format GEMMA recognizes the PLINK binary ped file format http pngu mgh harvard edu purcell plink 3 for both genotypes and phenotypes This format requires three files bed bim and fam all with the same prefix The bed file should be in the default SNP major mode beginning with three bytes One can use the PLINK software to generate binary ped files from standard ped files using the following command plink file file_prefix make bed out bedfile_prefix For the fam file GEMMA only reads the second column individual id and the sixth column phenotype One can specify a different column as the phenotype column by using
24. imated mean genotype of that SNP For predictions though all SNPs will be used regardless of their missingness Missing genotypes in the test set will be replaced by the mean genotype in the training set 1 4 2 Missing Phenotypes Individuals with missing phenotypes will not be included in the LMM or BSLMM analysis How ever all individuals will be used for calculating the relatedness matrix so that the resulting relat edness matrix is still an n x n matrix regardless of how many individuals have missing phenotypes In addition predicted values will be calculated for individuals with missing values based on indi viduals with non missing values For relatedness matrix calculation because missingness and minor allele frequency for a given SNP are calculated based on analyzed individuals i e individuals with no missing phenotypes and no missing covariates if all individuals have missing phenotypes then no SNP and no individuals will be included in the analysis and the estimated relatedness matrix will be full of nan s 2 Installing and Compiling GEMMA If you have downloaded a binary executable no installation is necessary In some cases you may need to use chmod a x gemma before using the binary executable In addition notice that the end of line coding in Windows DOS is different from that in Linux and so you may have to convert input files using the utility dos2unix or unix2dos in order to use them in a different p
25. l case labels For prediction problems one is recommended to list all individuals in the file but label those individuals in the test set as missing This will facilitate the use of the prediction function imple mented in GEMMA 3 2 3 SNP Annotation File optional This file contains SNP information The first column is SNP id the second column is its base pair position and the third column is its chromosome number The rows are not required to be in the same order of the mean genotype file but must contain all SNPs in that file An example annotation file with four SNPs is as follows rsi 1200 1 rs2 1000 1 rs3 3320 1 rs4 5430 1 If an annotation file is not provided the SNP information columns in the output file for association tests will have 9 as missing values 10 3 3 Relatedness Matrix File Format GEMMA as a linear mixed model software requires a relatedness matrix file in addition to both genotype and phenotype files The relatedness matrix can be supplied in two different ways either use the original relatedness matrix or use the eigen values and eigen vectors of the original relatedness matrix 3 3 1 Original Matrix Format GEMMA takes the original relatedness matrix file in two formats The first format is a n x n matrix where each row and each column corresponds to individuals in the same order as in the fam file or in the mean genotype file and ith row and jth column is a number indicating
26. latform The binary executable of GEMMA works well for a reasonably large number of individuals say for example the eigen option works for at least 45 000 individuals Due to the outdated computation environment the software was compiled on however for larger sample size and for improved computation efficiency it is recommended to compile GEMMA on user s own modern computer system If you want to compile GEMMA by yourself you will need to download the source code and you will need a standard C C compiler such as GNU gcc as well as the GSL and LAPACK libraries You will need to change the library paths in the Makefile accordingly A sample Makefile is provided along with the source code For details on installing GSL library please refer to For details on installing LAPACK library please refer to www netlib org lapack If you are interested in fitting BSLMM for a large scale GWAS data set but have limited memory to store the entire genotype matrix you could compile GEMMA in float precision A float precision binary executable named gemmaf is available inside the bin folder in the source code To compile a float precision binary by yourself you can first run d2f sh script inside the src folder and then enable FORCE_FLOAT option in the Makefile The float version could save about half of the memory without appreciable loss of accuracy 3 Input File Formats GEMMA requires four input f
27. n algorithm and evoke GSL errors To avoid this one can either regress the phenotypes on the covariates and use the residuals as new phenotypes or use only SNPs that are not identical to any of the covariates for the analysis The later can be achieved for example by performing a standard linear regression in the genotype data but with covariates as phenotypes 12 4 Running GEMMA 4 1 A Small GWAS Example Dataset If you downloaded the GEMMA source code recently you will find an example folder containing a small GWAS example dataset This data set comes from the heterogeneous stock mice data kindly provided by Wellcome Trust Centre for Human Genetics on the public domain http mus with detailed described in 4 The data set consists of 1904 individuals from 85 families all descended from eight inbred progenitor strains We selected two phenotypes from this data set the percentage of CD8 cells with measurements in 1410 individuals mean corpuscular hemoglobin MCH with measurements in 1580 individuals A total of 1197 individuals have both phenotypes The phenotypes were already corrected for sex age body weight season and year effects by the original study and we further quantile normalized the phenotypes to a standard normal distribution In addition we obtained a total of 12 226 autosomal SNPs with missing genotypes replaced by the mean genotype of that SNP in the family Genotype and phenotype files are in both B
28. nce then the standardized genotype matrix is preferred If the SNP effect size does not depend on its minor allele frequency then the centered genotype matrix is preferred In our previous experience based on a limited examples we typically find the centered genotype matrix provides better control for population structure in lower organisms and the two matrices seem to perform similarly in humans 4 3 3 Output Files There will be two output files both inside an output folder in the current directory The pre fix log txt file contains some detailed information about the running parameters and computation time while the prefix cXX txt or prefix sXX txt contains a n x n matrix of estimated relatedness matrix 4 4 Perform Eigen Decomposition of the Relatedness Matrix 4 4 1 Basic Usage The basic usages to perform an eigen decomposition of the relatedness matrix with either the PLINK binary ped format or the BIMBAM format are gemma bfile prefix k filename eigen o prefix gemma g filename p filename k filename eigen o prefix where the bfile prefix specifies PLINK binary ped file prefix g filename specifies BIMBAM mean genotype file name p filename specifies BIMBAM phenotype file name k filename specifies the relatedness matrix file name o prefix specifies output file prefix Notice that the BIMBAM mean genotype file and or the relatedness matrix file can be provid
29. nd y An example file with a few SNPs is shown below chr rs ps n_miss alpha beta gamma 1 rs3683945 3197400 O 7 314495e 05 0 000000e 00 0 000000e 00 rs3707673 3407393 0 7 314495e 05 0 000000e 00 0 000000e 00 rs6269442 3492195 0 3 412974e 04 0 000000e 00 0 000000e 00 rs6336442 3580634 0 8 051198e 05 0 000000e 00 0 000000e 00 rs13475700 4098402 0 1 200246e 03 0 000000e 00 0 000000e 00 e e e pepa Notice that the beta column contains the posterior mean estimate for B y 1 rather than Bi Therefore the effect size estimate for the additional effect is iyi and in the special case K XX p the total effect size estimate is a biyi The prefix bv txt contains a column of breeding value estimates Individuals with missing phenotypes will have values of NA The prefix gamma txt contains the posterior samples for the gamma for every 10th iteration Each row lists the SNPs included in the model in that iteration and those SNPs are represented by their row numbers 1 in the prefix param txt file 20 4 8 Predict Phenotypes Using Output from BSLMM 4 8 1 Basic Usage The basic usages for association analysis with either the PLINK binary ped format or the BIMBAM format are gemma bfile prefix epm filename emu filename ebv filename k filename predict num o prefix gemma g filename p filename epm filename emu filename ebv filename k filename predict num o prefix e wher
30. num steps default 10 e wpace num specify writing pace write values down in every num recorded steps default 1000 e seed num specify random seed a random seed is generated by default e mh num specify number of MH steps in each iteration default 10 requires 0 1 phenotypes and bslmm 3 option Prediction Options e predict num specify prediction options default 1 valid value 1 2 1 predict for individu als with missing phenotypes 2 predict for individuals with missing phenotypes and convert the predicted values using normal CDF 26 References 1 Es ES Yongtao Guan and Matthew Stephens Practical issues in imputation based association map ping PLoS Genetics 4 e1000279 2008 Bryan N Howie Peter Donnelly and Jonathan Marchini A flexible and accurate genotype imputation method for the next generation of genome wide association studies PLoS Genetics 5 e1000529 2009 Shaun Purcell Benjamin Neale Kathe Todd Brown Lori Thomas Manuel A R Ferreira David Bender Julian Maller Pamela Sklar Paul I W de Bakker Mark J Daly and Pak C Sham PLINK a toolset for whole genome association and population based linkage analysis The American Journal of Human Genetics 81 559 575 2007 William Valdar Leah C Solberg Dominique Gauguier Stephanie Burnett Paul Klenerman William O Cookson Martin S Taylor J Nicholas P Rawlins Richard Mott and Jonathan Flint Genome wide geneti
31. oftware e Polymorphism Non polymorphic SNPs will not be included in the analysis e Missingness By default SNPs with missingness below 5 will not be included in the analysis 13 Use miss num to change For example miss 0 1 changes the threshold to 10 e Minor allele frequency By default SNPs with minor allele frequency above 1 will not be 6 included in the analysis Use maf num to change For example maf 0 05 changes the threshold to 5 e Correlation with any covariate By default SNPs with r correlation with any of the covariates above 0 9999 will not be included in the analysis Use r2 num to change For example r2 0 999999 changes the threshold to 0 999999 e Hardy Weinberg equilibrium Use hwe num to specify For example hwe 0 001 will filter out SNPs with Hardy Weinberg p values below 0 001 e User defined SNP list Use snps filename to specify a list of SNPs to be included in the analysis Calculations of the above filtering thresholds are based on analyzed individuals i e individuals with no missing phenotypes and no missing covariates Therefore if all individuals have missing phenotypes no SNP will be analyzed and the output matrix will be full of nan s 4 3 Estimate Relatedness Matrix from Genotypes 4 3 1 Basic Usage The basic usages to calculate an estimated relatedness matrix with either the PLINK binary ped format or
32. ory The pre fix log txt file contains some detailed information about the running parameters and computation time while the prefix prdt txt contains a column of predicted values for all individuals In par ticular individuals with missing phenotypes will have predicted values while individuals with non missing phenotypes will have NA s 22 5 Questions and Answers 1 Q I want to perform a cross validation with my data and want to fit BSLMM in the training data set and obtain predicted values for individuals in the test data set How should I prepare the phenotype file A One should always use the same phenotype and genotype files for both fitting BSLMM and obtaining predicted values Therefore one should combine individuals in the training set and test set into a single phenotype and genotype file before running GEMMA Specifically in the phenotype file one should label individuals in the training set with the true phenotype values and label individuals in the test set as missing e g NA Then one can fit BSLMM with the files aa BSLMM only uses individuals with non missing phenotypes i e individuals 6 in the training set Afterwards one can obtain predicted values using the predict option on the same files and the predicted values will be obtained only for individuals with missing phenotypes i e individuals in the test set Notice that the software will still output NA for individuals with non mis
33. sing phenotypes so that the number of individuals in the output prefix prdt txt file will match the total sample size Please refer to the GWAS sample data set and some demo scripts included with the GEMMA source code for detailed examples 23 6 Options File I O Related Options bfile prefix specify input plink binary file prefix require fam bim and bed files g filename specify input bimbam mean genotype file name p filename specify input bimbam phenotype file name n num specify phenotype column in the phenotype file default 1 or to specify which phenotypes are used in the mvLMM analysis a filename specify input bimbam SNPs annotation file name optional k filename specify input kinship relatedness matrix file name km num specify input kinship relatedness file type default 1 valid value 1 or 2 d filename specify input eigen value file name u filename specify input eigen vector file name c filename specify input covariates file name optional an intercept term is needed in the covariates file epm filename specify input estimated parameter file name en n1 n2 n3 n4 specify values for the input estimated parameter file with a header default 2 5 6 7 when no ebv k files and 2 0 6 7 when ebv and k files are supplied n1 rs column number n2 estimated alpha column number 0 to ignore n3 estimated beta column number 0 to ignore n4 estimated gamma column number 0 to
34. tes other than SNPs BIMBAM format consists of three files a mean genotype file a phenotype file and an optional SNP annotation file We explain these files in detail below 3 2 1 Mean Genotype File This file contains genotype information The first column is SNP id the second and third columns are allele types with minor allele first and the remaining columns are the posterior imputed mean genotypes of different individuals numbered between 0 and 2 An example mean genotype file with two SNPs and three individuals is as follows rsi A T 0 02 0 80 1 50 rs2 G C 0 98 0 04 1 00 GEMMA codes alleles exactly as provided in the mean genotype file and ignores the allele types in the second and third columns Therefore the minor allele is the effect allele only if one codes minor allele as 1 and major allele as 0 One can use the following bash command in one line to generate BIMBAM mean genotype file from IMPUTE genotype files http www stats ox ac uk marchini software gwas file_ format html cat impute filename awk v s number of samples individuals printf 2 4 5 for i 1 i lt s i printf i 3 3 2 i 3 4 printf An P gt bimbam filename Notice that one may need to manually input the two quote symbols Depending on the terminal a direct copy paste of the above line may result in bash syntax error near unexpected token errors Finally the mean genotype file
35. the BIMBAM format are gemma bfile prefix gk num o prefix gemma g filename p filename gk num o prefix where the gk num option specifies which relatedness matrix to estimate i e gk 1 calculates the centered relatedness matrix while gk 2 calculates the standardized relatedness matrix bfile prefix specifies PLINK binary ped file prefix g filename specifies BIMBAM mean genotype file name p filename specifies BIMBAM phenotype file name o prefix specifies output file prefix Notice that the BIMBAM mean genotype file can be provided in a gzip compressed format 4 3 2 Detailed Information GEMMA provides two ways to estimate the relatedness matrix from genotypes using either the centered genotypes or the standardized genotypes We denote X as the n x p matrix of genotypes x as its ith column representing genotypes of ith SNP z as the sample mean and vgy as the sample 14 variance of ith SNP and 1 as a n x 1 vector of 1 s Then the two relatedness matrices GEMMA can calculate are as follows E 7 Ge OS 1ni xi 1487 Pia Ry a S me Gs LS x AE Tari P i 1 Uri Which of the two relatedness matrix to choose will largely depend on the underlying genetic architecture of the given trait Specifically if SNPs with lower minor allele frequency tend to have larger effects which is inversely proportional to its genotype varia
36. vector file contains a ng X Ng matrix with each column corresponds to an eigen vector The eigen vector in the ith column of the eigen vector file should correspond to the eigen value in the ith row of the eigen value file Both files can be generated from the original relatedness matrix file by using the eigen option in GEMMA Notice that Nna represents the number of analyzed individuals which may be smaller than the number of total individuals n 3 4 Covariates File Format optional One can provide a covariates file if needed for fitting LMM if necessary GEMMA fits a linear mixed model with an intercept term if no covariates file is provided but does not internally provide an intercept term if a covariates file is available Therefore if one has covariates other than the intercept and wants to adjust for those covariates W simultaneously one should provide GEMMA with a covariates file containing an intercept term explicitly The covariates file is similar to the above BIMBAM multiple phenotype file and must contain a column of 1 s if one wants to include an intercept An example covariates file with five individuals and three covariates the first column is the intercept is as follows 1 1 5 2 0 3 2 0 6 1 0 8 1 2 0 BPP re BP Be It can happen especially in a small GWAS data set that some of the covariates will be identical to some of the genotypes up to a scaling factor This can cause problems in the optimizatio
37. y not equal to X However in this special case these two approaches typically give similar results based on our previous experience Therefore if one used the default matrix in fitting the BSLMM then it may not be necessary to supply ebv filename and k filename options and GEMMA can use only the estimated parameter file and log file to obtain predicted values by the second approach But of course one can choose to use the first approach which is more formal and when do so one needs to calculate the centered matrix based on the same phenotype column used in BSLMM i e to use only SNPs that were used in the fitting On the other hand if one did not use the default matrix in fitting the BSLMM then one needs to supply the same relatedness matrix here again 21 The option predict 2 should only be used when a probit BSLMM was used to fit the data In particular for binary phenotypes if one fitted the linear BSLMM then one should use the option predict 1 and use option predict 2 only if one fitted the data with the probit BSLMM Here unlike in previous sections all SNPs that have estimated effect sizes will be used to obtain predicted values regardless of their minor allele frequency and missingness SNPs with missing values will be imputed by the mean genotype of that SNP in the training data set 4 8 3 Output Files There will be two output files both inside an output folder in the current direct

GEMMA User Manual

Contents

Download Pdf Manuals

Related Search

Related Contents