Home

User's Guide - solutionmetrics.com.au

1. _ j amo E nn ee 3 H r e a 5 C jeo _ 7 Je como Fo T T T T SL OL S 0 Ja moment MM J exo oo ot p 0 p a 4 ooo 4 cmt Too H a y 5 comcomment poo je Jo comot Je Sl OL G 0 Pr 9MpS pa gmMys PI yMps Pr EmpS PI eMps pa MYS pq 90S pq gos pa vos Preos PFZOS PY LOS Pr 9MpS pa GMys PT yMps PEMS PI ZMyS PY Laps pq 90S pa gos PTOS Preos pa cos pa Los Pre and post normalized boxplots of the swimming mice data Figure 4 12 Additional MvA plots are generated for the other chips both before and after normalization but they are not displayed here 41 Chapter 4 An Example Affymetrix MAS Data DIFFERENTIAL EXPRESSION TESTING Multiple We are now ready to compute the differential expression tests From Comparisons the main menu open the testing dialog by clicking ArrayAnalyzer gt Differential Expression Analysis gt Multiple Comparisons Test The procedures implemented in this dialog provi
2. Before Normalization After Normalization TTT fy io ry PaA ey z m S ae NJ o 2 wa 1 Po e P e e e e ea on J 7 Pa M ql peba o l L l l L Io 1 o fad CGa CGb CG24a CG24b CGa CGb CG24a CG24b Figure 4 22 Before and after normalization plots for the Melanoma data logged expression intensities 55 Chapter 4 An Example Affymetrix MAS Data M vs A Plots 56 We can do an M vs A plot of the logged expression intensities in LCG N with the mva pairs function For MAS4 5 data this function plots all pairwise scatter plots of M vs A for each treatment condition and replicate combination Because there are over 12 000 probes on each chip we randomly sample 2 000 of them before plotting and because the intensities have already been logged we turn that off in the plotting function gt mva pairs LCG N sample dim LCG N 1 2000 log F The resulting plots are displayed on the following page in Figure 4 23 Differential Expression Testing From The Command Line MVA plot CGa 0 2 4 6 8 10 12 14 0 838 CGb 0 2 4 6 8 10 12 14 1 15 1 12 CG24a 0 2 4 6 8 10 12 14 1 08 1 17 0 523 CG24b A Figure 4 23 M versus A plots for the Melanoma experim
3. a gt Page 1 Summary Volcano Plot NN Plot ps Figure 8 13 Chromosome plot of the human genome for Affymetrix s HG U95A chip with the 10 most differentially expressed genes displayed in color Hovering the mouse over the colored spots displays the gene LD in the upper right corner 201 Chapter 8 Differential Expression Testing Multiple Comparisons Specific Plots 202 For the Multiple Comparisons Test dialog in addition to the volcano heat map and chromosome plots you may also generate a Q Q Normal Probability plot of the test statistics This plot provides a visual assessment of the distribution of the test statistics relative to the standard normal distribution as shown in Figure 8 14 Graph Window 11 File View Options 8 3B T a pa D E Quantiles of Standard Normal E aaNorm Figure 8 14 Q Q Normal Probability plots of the test statistics generated by the Multiple Comparisons Test dialog Differential Expression Analysis Plots LPE Specific Plots For the LPE Test dialog in addition to the volcano heat map and chromosome plots you may also generate plots of the local pooled error variance versus the overall intensity within experimental conditions Two plots are produced one for each experimental condition as shown in Figure 8 15 Graph Window 9 File View Options Y o o o i amp amp G ta F gt gt Ww Ww a a ad al 6 8 10 12 14 A for Ohr A for 24hr a gt S
4. Import Affymetrix Data File Selection MIAME Variable Selection amp Filtering Associate Files with Design Points Single Factor 2 Level Design Reps 6 Reset Grid Read Design Save Design File Name Factor1 D Microarrays MicroarrayDemoData MouseSwi D Microarrays MicroarrayDemoData MouseSwi D Microarrays MicroarrayDemoData MouseSwii Type filename Olga Type filename or riyri cick w a Type filename or right click to browse Data File Type Chip Name Save As Mas 5 Summary Data lt required gt v myExprSet IV Print Output Cancel x afi entries Help Figure 4 4 Browsing for data files You can find the swimming mice example data by navigating to your splus61 module ArrayAnalyzer examples directory and selecting the s01 txt file Repeat for the other eleven txt files entering one file per cell File Type Note that the File Type e g MAS5 Summary Data listed in the bottom left corner of the dialog is automatically detected once a file is selected The dialog is designed to prohibit mixing file types Selecting the Chip Name The chip name is a required field You must select the name that corresponds to the Affymetrix chip name you used for your experiment Some common examples are hgu133a and hgu95a Click the drop down button and select mgu74av2 as shown in Figure 4 5 33 Chapter 4 An Example Affymetrix MAS Data 34 Figure 4 5 Selecting mgu74av2 as
5. File ESR a a hore Connection l Probe level Summarization Figure 2 1 Once data is obtained from a microarray experiment several steps are required to prepare and analyze differential expression intensities and annotate the results with gene descriptions available in public databases like LocusLink or UniGene This workflow shows the steps incorporated into the workflow of S ARRAYANALYZER when doing differential expression analysis Microarray technology is complex and experiments using microarrays are resource intensive As such there is an urgent need for rigorous statistical design and analysis of microarray experiments Genomics and Differential Expression Statistical issues in microarray experiments include Experimental design Pre processing e g normalization Differential expression testing Clustering and prediction Annotation All of these issues may be addressed with the use of modern statistical methods Care is required however and detailed collaborations between biologists and statisticians are a sound recipe for successful use of microarrays Insightful is pleased to offer the S ARRAYANALYZER module for microarray data analysis S ARRAYANALYZER provides off the shelf functionality for microarray data analysis as well as a toolkit and development environment for custom microarray analysis solutions Key packages are included from the Bioconductor project located at http www bioconductor org Report
6. This is a computationally intensive 240 Differential Expression Analysis for Experiments with More than Two Experimental Conditions 2 procedure that could take some time to finish depending on the hardware May want to run in BATCH mode geneModelFun lt function gene Fit the gene model geneModel lt try me fixed Residuals strain 1 data dropUnusedLevels yeastDatal yeastData gene gene c Residuals strain spotInArray random 1 spotInArray method REML Check if a model was fit and if there are enough degrees of freedom for testing if class geneModel Error amp amp geneModel fixDF X 1 gt 0 Obtain the degrees of freedom for the t test fixDF lt geneModel fixDF X 1 Construct the all pairwise comparisons contrast matrix p lt length fixef geneModel Lmat lt matrix 0 p p p 1 2 forty tm LeCp 1 Linetli 1 1 p I I I Z2 i 1 p lt 1 Lati i I p 1 1 Ceti ye Celie lt diag p i Determine which comparisons can be made comparison lt t outer names fixef geneModel names fixef geneModel paste sep lower tri matrix 0 p p J Compute the estimated fold changes foldChange lt t Lmat fixef geneModel 241 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data Compute the standard errors for those estimated differences stderr lt sqrt diag
7. 1165 1188 Dudoit S Shaffer J P and Boldrick J C 2002 Multiple hypothesis testing in microarray experiments U C Berkeley Division of Biostatistics Working Paper Series Working Paper 110 Dudoit S Yang Y Callow M and Speed T 2002 Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments Statistica Sinica 12 111 139 Efron B Tibshirani R Storey J D and Tusher V 2001 Empirical Bayes analysis of a microarray experiment Journal of the American Statistical Association 96 1151 1160 Hochberg Y 1988 A sharper Bonferroni procedure for multiple tests of significance Biometrika Vol 75 800 802 Hochberg Y and Tamhane A C 1987 Multiple Comparison Procedures New York Wiley Holm S 1979 A simple sequentially rejective multiple test procedure Scand J Statist Vol 6 65 70 Hsu J C 1996 Multiple Comparisons Theory and Methods London Chapman and Hall Lee J K and O Connell M 2003 An S PLUS library for the analysis of differential expression In The Analysis of Gene Expression Data Methods and Software Edited by G Parmigiani E S Garrett R A Irizarry and S L Zeger Springer New York Moore D S and McCabe G P 1999 Introduction to the Practice of Statistics 3rd ed New York W H Freeman and Company Snedecor G W and Cochran W G 1980 Statistical Methods 7th ed Ames Iowa Iowa
8. Creating an Expression Intensity Data Frame From The Command Line 0 hr Replicate A 0 hr Replicate B lo 4 10 10 Log 2 Expression Intensities 5 il Log 2 Expression Intensities TELS controls noncontrols controls 24 hr Replicate A 24 hr Replicate A 15 15 10 10 ll Il Log 2 Expression Intensities 5 Log 2 Expression Intensities 5 controls noncontrols controls noncontrols Figure 4 21 Boxplots of control versus noncontrol spots for the melanoma data Now extract the expression intensities from each chip in preparation to normalization and differential expression testing Extract the avg diff column and add it to a data frame named CG For MAS5 data this is the signal column gt CG lt data frame CGa cga avg diff CGb cgb avg diff CG 24a cg24a avg diff CG24b cg24b avg diff 53 Chapter 4 An Example Affymetrix MAS Data Logging Expression Intensities Removing Controls Normalization 54 Compute the base 2 log transformation of the intensity values as follows Any intensity values less than one will be negative or missing after taking logs so we set them explicitly to one in the ifelse function call JHHE Threshold and log adjusted average differences gt LCG lt CG gt for i in names LCG LCGLLi lt logb ifelse CG Li lt 1 1 CG i base 2 Now remove the cont
9. North Central and South America Contact Technical Support at Insightful Corporation Telephone 206 283 8802 or 1 800 569 0123 ext 235 Monday Friday 6 00 a m PST 9 00 a m EST to 5 00 p m PST 8 00 p m EST Fax 206 283 8691 E mail support insightful com Web http www insightful com support Chapter 1 Welcome to S ARRAYANALYZER All Other Locations Contact the European Headquarters of Insightful Corporation Christoph Merian Ring 11 4153 Reinach Switzerland Telephone 41 61 717 9340 Fax 41 61 717 9341 E mail info ch insightful com INTRODUCTION TO MICROARRAY DATA Genomics and Differential Expression 8 Microarray Data 10 Affymetrix Arrays 12 Custom cDNA Arrays 13 Chapter 2 Introduction To Microarray Data GENOMICS AND DIFFERENTIAL EXPRESSION DNA microarrays are the most widely used tools in the analysis of gene expression and the study of functional genomics Microarrays comprise gene specific sequences probes immobilized to a solid state matrix which are queried with mRNA from biological samples under study Since many changes in cells are related to changes in mRNA levels for some genes microarrays can be effectively used in a wide variety of applications including identification and validation of drug targets characterization and screening of drug toxicities exploration of biological pathways and development of molecular diagnostics INSIGHTFUL StARRAYANALYZER WORKFLOW oe REIES Prepare
10. casefold names cgb gt names cg24a lt casefold names cg24a gt names cg24b lt casefold names cg24b Now lets find the control spots All the chips are the same so we can work off one of the data sets 51 Chapter 4 An Example Affymetrix MAS Data Extracting Probe Names and Finding Controls Removing Genes With Few Good Spots Comparing Controls and Non controls 52 JHHF Extract probe names gt cg probes lt cga probe set name iHHF Find control spots gt prefix lt substring cg probes 1 4 gt controls lt prefix AFFXA You can eliminate genes with few spots used in their summarization by a simple subset operation We repeat it for each chip object dHHF Set avg diff to missing wherever pairs used lt 7 gt cga avg diff lt ifelse cga pairs used lt 7 NA cga avg diff gt cgb avg diff lt ifelse cgb pairs used lt 7 NA cgb avg diff lt same for the other two chips gt One example exploratory plot is the comparison of control and non control spots We can generate boxplots as follows gt par mfrow c 2 2 gt boxplet listtcontrals logb cgdtavg diftfleontrols 2 noncontrols logb cga avg diff controls 2 ylab Log 2 Expression Intensities gt title 0 hr Replicate A The above expression creates a pair of boxplots for the first 0 hour replicate By repeating the commands for the other three chips we produce the remaining plots in Figure 4 21
11. where 2 2 2 n and n numbers of replicates for the samples compared s Med i 1 2 is the error estimate from the i th LPE baseline error distribution at each median Med For more details see Lee and O Connell 2003 Note that the LPE statistic based on medians is robust to outliers if there are three or more replicates 181 Chapter 8 Differential Expression Testing Raw P Values 182 Running any of the above statistical procedures produces raw p values the p values associated with the individual statistical tests To make confident statements about differential expression for the entire experiment you need to compute adjusted p values which control the family wise error rate or false discovery rate See the section Controlling The False Positive Rate for more details Controlling Type I Error Rates CONTROLLING TYPE I ERROR RATES When testing for differential expression across many genes simultaneously numerous genes may be identified as significantly differentially expressed by chance alone even if there is no real differential expression For example if you test 10 000 genes for differential expression at a significance level of 0 05 you can expect to misidentify about 500 genes as significant even when there is no real difference in gene expression Multiple testing corrections adjust the individual p values to account for the inflated false positive rate due to multiple testing Because there are
12. Affymetrix Data 172 The data can be plotted by typing the following pre normalized data box plot log transform the data for nicer plots boxplot data frame log2 Dilution exprSet exprs ylim c 0 15 style bxp att post normalized data boxplot data frame log2 cbind DilutionEsetNormTmt1 DilutionEsetNormTmt2 style bxp att ylim c 0 15 Vv vyv Note When creating box plots from the normalization dialog the log intensity is used on the y axis To continue working with an exprSet object we can create a new exprSet object which has the normalized intensity information normalize the data without subsetting gt DilutionEsetNorm matrix lt medianIQR norm Dilution exprSet exprs create new exprSet object with normalized intensities gt DilutionEsetNorm lt Dilution exprSet gt DilutionEsetNorm exprs lt DilutionEsetNorm matrix affy scalevalue exprSet shifts the mean intensity value of the chips to the same specified point The default reference value is 500 The function accepts exprSet objects and returns an exprSet object Similar to medianIQR affy scalevalue exprSet can be used to normalize summarized data as follows gt DilutionEsetScaleTmtl lt affy scalevalue exprSet Dilution exprSet 1 2 sc 100 gt DilutionEsetScale lt affy scalevalue exprSet Dilution exprSet sc 100 MvA and box plots for Affymetrix summarized data are available through the Normal
13. Differential Expression Testing Heat Map Plot A heat map plot shown in Figure 5 17 shows a two way layout of the most differentially expressed genes along the vertical axis versus the experimental conditions on the horizontal axis This graph is also hyperlinked to the annotation information Toa Re Sample cg2b CEL Gene 33543_s_at Exp Value 1 26 Accession Number LocusLink X a i Summary Volcano Flot Heatmap Chromosome f Variance Plat Figure 5 17 A heatmap plot shows differentially expressed genes as a function of experimental conditions The map is hyperlinked to annotation databases 83 Chapter 5 An Example Affymetrix Probe Level Data 84 Chromosome Plot A chromosome plot displays the entire chromosome with differential expression marked up for positive down for negative for each gene represented on the chip The top 10 differentially expressed genes are highlighted with color orange to indicate their location on the chromosome Hovering the mouse over one of the colored active points displays the gene ID in the upper right hand corner of the graph as shown in Figure 5 18 1262_s_at lt a A 4 L a ay A A e aa AR Sth ot rrr bth Ht BURE Ma i FRU PO TAY T LHL ArH e p T T Hi tt HHHH aH A h L prt e pe r PRES Ht r r r i H Tr TTT hrr ier a ArH prha t r h rH ee Trt r 0 o 3 D o 2 S AE Summary V
14. In The Analysis of Gene Expression Data Methods and Software Edited by G Parmigiani E S Garrett R A Irizarry and S L Zeger Published by Springer Verlag New York References Irizarry R A Hobbs B Collin F Beazer Barclay Y D Antonellis K J Scherf U Speed T P 2003b Exploration Normalization and Summaries of High Density Oligonucleotide Array Probe Level Data Accepted for publication in Biostatistics Irizarry R A Bolstad B M Collin F Cope L M Hobbs B and Speed T P 2002 Summaries of Affymetrix GeneChip Probe Level Data Nucleic Acids Research Vol 31 No 4 e15 Lazaridis E Sinibaldi D Bloom G Mane S and Jove R 2002 A simple method to improve probe set estimates from oligonucleotide arrays Mathematical Biosciences Volume 176 1 53 58 Lee J K and O Connell M 2003 An S PLUS library for the analysis of differential expression In The Analysis of Gene Expression Data Methods and Software Edited by G Parmigiani E S Garrett R A Irizarry and S L Zeger Published by Springer New York Li C Wong W 2001a Model based analysis of oligonucleotide arrays Expression index computation and outlier detection Proceedings of the National Academy of Science U S A 98 31 36 Li C and Wong W 2001b Model based analysis of oligonucleotide arrays model validation design issues and standard error application Genome Biology 2 8 research0032 1 0
15. This method divides the chip into a given number of zones and uses the lowest 2 of the intensity values to compute the background intensity within each zone Smoothing across zones is done by computing a zone weight which is based on the distances of spots to zone centers The background at each cell location x y is computed using these weights A similar computation is made for the noise at each cell 159 Chapter 7 Pre Processing and Normalization An Example With bg correct PM correct methods 160 The background corrected value is computed as a function of the background at x y noise at x y and the threshold and floor noise values at each x y cell location based on the noise at x y such that the cell intensity remains positive bg correct takes an AffyBatch object and returns an AffyBatch object Following are some examples of background correcting a sample extracted from the Dilution experiment Please refer to the Dilution help file for more details on the data gt tmp lt bg correct Dilution method mas gt tmp lt bg correct rma Dilution gt tmp lt bg correct mas Dilution One may wish to correct the PM intensities in a Probeset for non specific binding hybridization that occurred at random Affymetrix chips provide a mechanism for measuring non specific binding through the mismatch probes MM The amount of binding that occurs at these spots is a measure of the amount of random bin
16. chips to support estimation of random and fixed effects and an experimental design that provides enough degrees of freedom on error for informative estimation of the treatment effects and comparisons of treatment levels References REFERENCES Alizadeh A A Eisen M B Davis R E Ma C Lossos I S Rosenwald A Boldrick J C Sabet H Tran T Yu X Powell J I Yang L Marti G E Moore T Hudson T Jr Lu L Lewis D B Tibshirani R Sherlock G Chan W C Greiner T C Weisenburger D D Armitage J O Warnke R Levy R Wilson W Grever M R Byrd J C Botstein D Brown P O Staudt L M 2000 Distinct types of diffuse large B cell lymphoma identified by gene expression profiling Nature 403 503 511 Benjamini Y Hochberg Y 1995 Controlling the false discovery rate A practical and powerful approach to multiple testing Journal of the Royal Statistical Society Series B Methodological 57 289 300 Benjamini Y and Yekutieli D The control of the false discovery rate in multiple hypothesis testing under dependency Annals of Statistics 2001 Do K Broom Wen 2003 GeneClust To appear in The Analysis of Gene Expression Data Methods and Software Edited by G Parmigiani ES Garrett RA Irizarry and SL Zeger Published by Springer New York S Dudoit Y H Yang P Luu D M Lin V Peng J Ngai and T P Speed 2002 Normalization for cDNA microarray data a robust composite method addressing single
17. first two chips in set gt swirl normGmed lt maNormMain swirl 1 2 f loc list maNormMed x NULL y maM 150 Normalization Functions maNorm maNormScale Normalization Methods for cDNA Data Global median normalization over all chips in swirl gt swirl normGmed lt maNormMain swirl f loc list maNormMed x NULL y maM 2D spatial location normalization of array 93 gt swirl norm2D lt maNormMain swirl 3 f loc list maNorm2D A normalization that is a weighted average of the loess normalization over the chip and the loess normalization over the print tip groups gt swirl norm lt maNormMain swirl 1 f loc list maNormLoess x maA y maM Zz NULL span 5 maNormLoess x maA y maM z maPrintTip a loc maCompNormA Simple wrapper functions to marrayNormMain are provided by maNorm and maNormScale These wrappers send default accessor methods and settings to marrayNormMain as outlined in Table 73 and Table 7 4 cDNA normalization from the St ARRAYANALYZER normalization dialog uses these functions and associated method names 151 Chapter 7 Pre Processing and Normalization Table 7 3 The norm parameter of maNorm results in the following normalization methods and settings being passed to maNormMain maSpotRow y maSpotCol z maM g maPrintTip w NULL subset subset span span Normalization Method
18. gt GO Term Name Definition 4 I cytosolic small ribosomal subunit sensu Eukarya Tree View The small subunit of a eukaryotic cytosolic ribosome has a sedimentation coefficient of 40S lt I protein biosynthesis Tree View The formation from simpler components of a protein rather than of proteins in general definiti T RNA binding Tree View Interacting selectively with an RNA molecule or a portion thereof definition_ T structural constituent of ribosome Tree View The action of a molecule that contributes to the structural integrity of the ribosome definition __Check Uncheck Drow New Tree SubmitQuery Figure 9 9 Results of searching the Gene Ontology GO site with GO ID s for the first gene identified in the gene filtering analysis described above i e Gene M58459 Human ribosomal protein RPS4Y isoform mRNA F AmiGO Tree Yiew Microsoft Internet Explorer A A Asearch Favorites media B BH Address E http godatabase org cai bin go cgi action replace_tree amp query GO 0003735asession_id 282510481365628searc Y Coo Links HIG O 0003673 Gene Ontology 59650 GO 0008150 biological process 41074 E GO 0005575 cellular component 27979 GO 0003674 molecular function 49802 GO 0005198 structural molecule 1698 E GO 0003735 structural constituent of ribosome 874 Get this tree as RDF XML Get this data as a GO flat file Get a boo ible Figur
19. qoi 1of2 Figure 5 9 Specifying Robust Multichip Analysis with a single checkbox Figure 5 11 displays the M vs A plot for the 24 hour samples The interpretation is the same as that for Figure 5 10 Figure 5 12 displays boxplots of logged expression summaries for each sample chip Visual inspection shows the distributions are well aligned at their centers and quartiles Although normalization may be repeated sequentially to summarized expression intensities there is little need to apply more normalization to CGExprSet rma After applying normalization and summarization procedures to the raw expression intensities a log base 2 transformation is applied Consequently the returned summarized object contains expression intensities on a logged scale The log transformation is computed as log E ifE gt 1 0 if E less than or equal to 1 75 Chapter 5 An Example Affymetrix Probe Level Data MvA for Ohr cg2a CEL 0 2 2 0 2 5 3 0 3 5 0 0338 cg2b CEL A Figure 5 10 M versus A plot for the two replicate samples measured at O hours The value in the lower left panel of the plot is the interquartile range of M 76 Expression Summaries MvA for 24hr y o cg24a CEL 2 0 2 5 3 0 3 5 0 0468 cg24b CEL A Figure 5 11 M versus A plot for the two replicate samples measured at 24 hours The value in the lower left panel of the plot is the interquartile
20. wildtype are estimated and hypotheses re significance of these are tested We represent these contrasts as a linear sum with estimates S LL Tix and standard errors where Cj 2 4 is the element of the estimated variance covariance matrix of T k statistics are formed from these estimates as The statistics need to be adjusted for control of family wise error rate FWER or false discovery rate FDR In this case we present a Bonferroni adjustment in accordance with the analysis of Wolfinger et al 2001 We also plot the p values versus fold change for the individual contrasts in a trellis display of volcano plots Gene lists for each contrast are also readily displayed as output Better adjustments of the p values may be obtained in S ARRAYANALYZER using the function mtrawp2adjp Options for control of FWER in this function are e Bonferroni Holm based on Holm 1979 e Hochberg based on Hochberg 1988 SidakSS and SidakSD 243 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data Options for control of FDR in this function are BH based on Benjamini and Hochberg 1995 BY based on Benjamini and Yekutieli 2001 Create non interactive volcano plots of the results d All pairwise comparisons gt graphsheet gt print xyplot logl0 pValue foldChange data yeastResults panel function x y critPValue panel xyplot x y abline h logl0 critPValue col 4 Iw
21. 1 The Data gt Select Data menu item on the main S PLUS menu bar 2 Through the S PLUS Object Explorer 204 From The Data Menu Item From The Object Explorer Differential Expression Summary Table Output 3 The Command line Open the Select Data dialog by selecting Data gt Select Data from the main S PLUS menu bar and select the test summary objects from the Existing Data Name drop down list Select Data Source Existing Data Existing Data Name Imeem C New Data i C Import File New Data PET I Show Dialog on Startup Cancel Apply KE current Figure 8 17 Selecting the complete gene list from the Select Data dialog Clicking OK opens a data sheet containing the summary information 4 MultTestSumm Read Only 1 2 3 4 5 6 7 8 gName mean Ohr mean 24hr foldChange testStat rawp adjp signif p 1 35704_at 35704_a 9 24 0 54 8 70 188 04 0 00 0 11 T ral 37023_at 37023_a 8 74 0 54 8 20 174 87 0 00 0 11 T 3 33532_at 33532_al 7 78 10 90 3 13 3621 54 0 00 0 11 T 4 37712_g_at 37712_g_at 8 47 0 54 7 93 163 33 0 00 0 11 T 5 31979_at 31979_a 7 21 0 54 6 67 142 71 0 00 0 11 T 6 1837_at a 7 44 0 54 6 90 164 14 0 00 0 11 T 7 41848_f_at 41848_f_at 8 59 0 54 8 05 150 70 0 00 0 12 T 8 1984_s_at 1984_s_a 8 43 0 54 7 89 199 13 0 00 0 12 T 9 41231_f_at ee 12 93 13 80 0 87 115 31 0 00 0 13 T 10 36250_at 36250_at 8 62
22. 1 details the experimental conditions and the associated data files Table 6 1 Experimental design and file association for the swirl cDNA experiment Cy3 Cy5 Replicate File Name swirl wild type 1 swirl Lspot wild type swirl 2 swirl 2 spot swirl wild type 3 swirl 3 spot wild type swirl 4 swirl 4 spot 91 Chapter 6 An Example Two Color cDNA Data IMPORTING DATA To import cDNA data go to the main S PLUS menu and select ArrayAnalyzer gt Import Data gt From cDNA Array ArrayAnalyzer J Import Data d From Affymetrix Affymetrix Expression Summary From cDNA Array Normalization X Differential Expression Analysis Figure 6 1 Menu selection to import cDNA data Import cDNA This launches the Import cDNA Data dialog with the File Data Dialog Selection page displayed The primary task of the import process is to associate data files with experimental conditions and select the variables and corresponding columns that are used in the data analysis Import cDNA Data File Selection MIAME Variable Selection amp Fitering r Associate Files with Design Points Single Factor Design Reps p Reset Grid Read Design Save Design File Name Type filename or right click to browse Type filename or right click to browse Type filename or right click to browse Type filename or right click to browse Data Chip Layout Save As Agilent Layout v
23. 6 22 Post normalized M vs A plot for the swirl data We can also do boxplots as a function of print tip groups as follows gt par mfrow c 1 2 gt maBoxplot swirl rawL 3 main Pre normalization gt maBoxplot swirl norm 3 main Post normalization The resulting graph is display is shown in Figure 6 23 125 Chapter 6 An Example Two Color cDNA Data Pre normalization Post normalization Scale Print Tip MAD TEETE ER te H Cae er e EAT a E Eae TINHA ERTE 7 E 4 ofthat a ipie EE 1 f Daa ee E ee ee PrintTip PrintTip Figure 6 23 Before and after scale print tip MAD normalization of the swirl data Before we move onto differential expression testing note that the slots of the marrayRaw object are gt getSlots swirl raw maRf maGf maRb maGb maW matrix matrix matrix matrix matrix maLayout maGnames maTargets maNotes marrayLayout marrayInfo marrayInfo character 126 Differential Expression Testing Paired t test From The Command Line Each of the first four slots are raw intensity matrices with dimensions equal to number of genes x number of chips after controls have been removed For this example 7680 x 4 gt dim swirl raw maRf 1 7680 4 Once we apply the normalization procedures background correction is done the raw intensities are converted to M and A values and the and the normalized object has different sl
24. BLAST OMIM Map Viewer Taxonomy Structure Search LocusLink Y Display Brief v Organism All v View Hs ODC1 I j One of 1 Loci ABCDEFGHIJKLMNOPQRSTUVWXYZ m a Oo Click to Display mRNA Genomic Alignments spanning 8122 bps PUB OMIM ACEVIEW UNIGENE map var HOMOL GDB e UCSC MGC Homo sapiens Official Gene Symbol and 86 References REFERENCES Fox J W Dragulev B Fox N Mauch C and Nischt R 2001 Identification of ADAM9 in human melanoma Expression regulation by matrix and role in cell cell adhesion Proceedings of International Protelysis Society Meeting Lee JK and O Connell M 2003 An S PLUs Library for the Analysis of Differential Expression To appear in The Analysis of Gene Expression Data Methods and Software Edited by G Parmigiani ES Garrett RA Irizarry and SLZeger Published by Springer New York 87 Chapter 5 An Example Affymetrix Probe Level Data 88 AN EXAMPLE TWO COLOR CDNA DATA cDNA Data Analysis Workflow Swirl cDNA Data Set Importing Data Import cDNA Data Dialog Create Layout Dialog Other Data Formats Normalization The Normalization Dialog Differential Expression Testing The Options Group Annotation From The Command Line Importing Data Normalization Differential Expression Testing 90 90 92 92 96 101 102 102 106 106 111 114 114 123 127 89 Chapter 6 An Example Two Color cDNA Data CDNA DATA AN
25. Comparisons Test Input The dialog is arranged in four main groups Data e Options e Graph Options e Output Data The Data group allows you to select the expression object for testing You start by selecting the data type in Show Data of Type as one of Affymetrix or cDNA and then selecting a data object an expression object created by importing expression summarization for Affy CEL and normalization from the Data drop down list box Multiple Comparisons Test Sag Data Graph Options Show Data of Type IV Volcano Plot Affymetrix Z Heat Map Data CGExprSet rme M Chromosome Plot Chip Name DESEVWA V QQ Norm Plot Options Output FWER WM Display Output in S PLUS Test Save Output as HTML Alt Hypothesis Not equal Adjustment Bonferroni Save Summary As MultTestSumm Cancel Apply KE current Help Figure 8 1 The Multiple Comparisons Test dialog Once a data object is selected the chip name is filled in the Chip Name field For custom 2 color cDNA or non Affymetrix oligonucleotide chips the chip name may be lt undetermined gt 188 GUI for Multiple Comparisons Testing Options The Options group displayed in Figure 8 2 allows you to specify various options for the statistical tests 1 Select the statistical test default is Welch s t test 2 Specify the alternative default is Not equal 3 Input the FWER or FDR 4 Select the p value adjustment procedure 5 Specify the number
26. Design Save Design Reps 2 f File Name Type filename or right click to browse Type filename or right click to browse Type filename or right click to browse Type filename or right click to browse r Data File Type Chip Name Save As eundetermined gt lt required gt v mys et Print Output Figure 3 2 The File Selection page of the Import Affymetrix Data dialog allows you to specify the design and data files for each design point Import Data The second page is for data collected from MIAME Minimal Information About a Microarray Experiment This information is used as the default labeling on plots and other output Import Affymetrix Data File Selection MIAME Variable Selection amp Filtering Experimenter s Name Bob Bryant Laboratory IGRI Contact Information bb igri com Experiment Title Zebra Fish mutant Experiment Description Zebra fish embryos from two genetic strains were used a swirl mutant and a normal wild type The goal was to identify genes with differential expression between the two strains Existing Notes D Microarrays ZebraFishStudy Figure 3 3 The MIAME page of the Import Affymetrix Data dialog contains information describing the experiment 19 Chapter 3 GUI Overview The third page is Variable Selection amp Filtering which allows you to choose the columns to use for gene expression and gene names You can also adjust fil
27. Details Display Summary z Show 20 z Send to Tex z Items 1 2 of 2 One page F 1 M58459 Human ribosomal protein RPS4Y isoform mRNA complete cds 833751 1 eb M58459 HUMRPS4Y 337511 Links F 2 AI936826 Links wp69h10 x1 NCI_CGAP_Bm25 Homo sapiens cDNA clone IMAGE 2467075 3 similar to SW GP39_HUMAN 043194 PUTATIVE G PROTEIN COUPLED RECEPTOR GPR39 MRNA sequence gi 5675696 b A1936826 1 5675696 Figure 9 7 UniGene annotation for the two genes identified in the gene filtering analysis described above 231 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data 232 3 Entrez PubMed Microsoft Internet Explorer Fie Edit View Favorites Tools Help Back gt O A A Gsearch GyFavortes Meda D 3 fl a National Library of Medicine Search PubMed z forl Limits Preview Index History Clipboard Details Display Summary z Show 20 Sort z Send to Tex z Items 1 8 of 8 One page IF 1 UechiT Tanaka T Kenmochi N Related Articles Links complete map of the human ribosomal protein genes assignment of 80 genes to the cytogenetic map and implications for human disorders Genomics 2001 Mar 15 72 3 223 30 PMID 11401437 PubMed indexed for MEDLINE Kenmochi N Kawaguchi T Rozen 5 Davis E Goodman N Hudson TJ Related Articles Links Tanaka T Page DC A map of 75 human ribosomal protein genes Genome Res 1998 May 8 5 509 23 PMID 958
28. Level Data The key task is to convert probe level data to one expression value for each gene transcript which can then be used to test for differential gene expression This is typically achieved through the following sequence of steps 1 exploratory data analysis and diagnostics 2 background correction 3 probe specific background correction e g subtracting MM 4 normalization 5 summarizing the probe set values into one expression measure and in some cases a standard error for this summary As discussed in the section Workflow on page 141 normalization can be done before and or after summarizing probe level data Steps 2 5 above can be done using separate functions or together using functions such as expresso and express These functions as well as functions for plotting probe level data for exploratory data analysis are discussed in the next sections In StARRAYANALYZER the expresso function provides many options to handle the tasks in steps 2 5 above Examples are given in section Summarization in S ARRAYANALYZER on page 167 CDF Libraries Affymetrix Diagnostic plots Pre Processing And Normalization For Affymetrix Probe Level Data In order to compute expression summaries and or normalization of Affymetrix probe level data you will need to have the Affymetrix CDF information available In R this CDF information is stored in an Renvironment In S PLUS the information is stored in a named list Each CDF libr
29. Na BOE Q No Adive Link Bg 000000000 0 00000000 Si 0 o000c000 4 009990 0 00000012 0 00739335 0 00000008 0 0053019 Ba 0 0000c000 a o0003000 fq 000000000 0 00000000 eromewilmne tt arerldype stramnawiinetrstanwiictype Sranswilmin etaewidype Taneri chet arw dtype Srainawiirichrsvarwiictype fransia stanwidype kranswiimn sranwidiype 3 4 5 ion 7 e Bj 10 ul 2 E 155 si FeAl 8 19 EJ zal Jer anr 2mri st arrmkitype j 0 00000079 0 0435413 aa a Ready lnapatafierrg ON Modfied Figure 9 14 Interactive volcano plot for the 227 significant genes The plot can be interactively changed as described in the text Now just drag the comparison column on top of this graph First highlight the comparison column then grab it in the body of the column and drag it onto the graph of log10p v foldChange You will see a dotted square on the top of the graph see screen shot below Release your mouse in this square and you have a trellis plot showing log10p v foldChange for each of the contrasts in separate panels The resulting trellis plot is interactive For example when you hover over points you see their gene name and you can highlight points for filtering or additional analysis By clicking on any point you can change the appearance of most details of the graphical presentation In this plot we change the Data Tips and Point Labels to show the gene name fold change and log10adjp value I
30. PLUS Command Line to Analyze Microarray Data 211 Introduction 212 Clustering Microarray Data using S PLUS 214 Annotation of Microarray Data using S PLUS 225 Differential Expression Analysis for Experiments with More than Two Experimental Conditions 234 References 251 Appendix S ARRAYANALYZER Data Libraries 255 Index 261 vii Contents viii WELCOME TO S ARRAYANALYZER Welcome Features Libraries Supported Platforms and System Requirements Installing and Running S ARRAYANALYZER Online Help Online Reference Technical Support aAaanrk ft Fe WNW WD Chapter 1 Welcome to S ARRAYANALYZER WELCOME Features S ARRAYANALYZER is an S PLUS module that provides you with a powerful tool for analyzing Affymetrix MAS and CEL data and cDNA microarray data Using either the graphical user interface GUI dialogs or the Commands window you can perform statistical analysis to determine differential gene expression in microarrays fundamental to the rapidly growing field of functional genomics In S ARRAYANALYZER you can access functions in a collection of libraries based on the Bioconductor project a repository for current microarray and genomics research developed by leading statisticians Log10 LPE p value Fold Change Figure 1 1 Sample volcano plot using LPE func in StARRAYANALYZER This plot was generated using Affymetrix data The S tARRAYANALYZER module helps you analyze microarray data using the
31. PLUs has finished generating the output one for the summary table and the another for the Graphlet Differential Expression Analysis Plots DIFFERENTIAL EXPRESSION ANALYSIS PLOTS Common Plots The differential expression summary plots are designed to give you easy access to annotation data in public databases Two of the plots the volcano plot and the heat map have embedded hyperlinks so you can click on a point and bring up annotation from NCBI databases There are three plots common to both testing dialogs Volcano plot e Heat map e Chromosome plot Each of the dialogs optionally produces one additional plot The Multiple Comparisons Test dialog produces a Q Q Normal Probability plot of the test statistics and the LPE Test dialog produces a Variance plot displaying a graph of the baseline variance estimates as a function of the average expression intensity for each experimental condition Each of these types of plots is discussed in the following sections 197 Chapter 8 Differential Expression Testing Volcano Plot 198 A volcano plot displays the logarithm of adjusted p value versus average fold change The vertical lines indicate average fold change values of plus or minus two and the horizontal line indicates a significant adjusted p value Points located in the lower outer sextants are those with large absolute fold change and small significant p value Each of those points is active so you can click an individual
32. Plot IV Variance Plots Output V Output to S PLUS Output to HTML Save Summary As LPESumm Help Figure 8 7 The Local Pooled Error Test or LPE test dialog 193 Chapter 8 Differential Expression Testing Options Variance Estimation 194 Once a data object is selected the chip name is filled in the Chip Name field For custom 2 color cDNA or non Affymetrix oligonucleotide chips the chip name may be lt undetermined gt The Options group contains the procedures for controlling the FWER and FDR as shown in the drop down list in Figure 8 8 The procedures correspond to those described in section Controlling The False Positive Rate and section FDR Procedures Both FWER and FDR procedures are included in the drop down list Select one and specify the family wise error rate for either an FWER or FDR procedure in the FWER editable field Local Pooled Error Test Data Variance Estimation Show Data of Type Smoother D F 10 Alfymetis x Number of Bins ho Data MelanomaExpr v Trim 2 Boo Chip Name hua Graph Options IV Volcano Plot Options PWER I Heat Map Adjustment ochbe V Chromosome Plot M Variance Plots Output M Output to S PLUS I Output to HTML Save Summary As LPESumm Cancel Apply KE current Help Figure 8 8 Setting the p value adjustment procedure for controlling the FWER The Variance Estimation group in the upper right hand corner of the dialog control
33. Selection MIAME Wariable Selection amp Filtering Green Foreground Gmean v Red Foreground Rmean ps2 Green Background bgGmean v Red Background bgRmean 7 Weights Figure 6 9 The Variable Selection amp Filtering page allows you to set the variable and row selections In this example select Gmean as the Green Foreground and Rmean as the Red Foreground to complete the required fields Optionally select bgGmean as the Green Background and bgRmean for the Red Background The Weights field is for specifying a column of spot quality weights These weights are used in subsequent computations to down weight poor quality spots during normalization See Chapter 7 Pre Processing and Normalization for more detail Other Data Formats Single Grid Arrayers Importing Data Red Foreground lt tequied gt GIOR 4 Red Background EMES Rmedian Le RIOR 4 Figure 6 10 Selecting the Red Foreground and Red Background intensities The foreground intensities for red and green are both required Clicking OK Once you have completed the Variable Selection amp Filtering page click OK to begin importing the files The object resulting from the import step is of class marrayRaw which is saved as an S PLUS object with the name you entered on the first page of the Import cDNA Data dialog swirlmarrayRaw Some scanning equipment generates layout information as part of the data file For example some scanners
34. State University Press Storey J D 2002 A direct approach to false discovery rates Journal of the Royal Statistical Society Series B 64 479 498 References Westfall P H and Young S S Resampling based multiple testing Examples and methods for p value adjustment John Wiley amp Sons 1993 209 Chapter 8 Differential Expression Testing 210 USING THE S PLus COMMAND LINE TO ANALYZE MICROARRAY DATA Introduction Clustering Microarray Data using S PLUS Example Lymphoma Classification Annotation of Microarray Data using S PLUS Annotation examples Differential Expression Analysis for Experiments with More than Two Experimental Conditions References 212 214 217 225 227 234 251 211 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data INTRODUCTION 212 S ARRAYANALYZER is designed as an add on module to S PLUS It has a rich set of cutting edge methodologies for loading microarray data pre processing e g normalization and expression intensity summarization differential expression analysis and annotation Most of these methods have been implemented through the GUI installed with the module However there are times when you need to go beyond the GUI to get a job done One key advantage of StARRAYANALYZER is that it sits on top of S PLUS S PLUS is the richest environment available today for statistics graphical data analysis and enterprise deployment of best analyti
35. This graph is also hyperlinked to the annotation information Graph Window 3 File View Options E a gt Summary Volcano Plot Figure 6 17 A heat map plot shows differentially expressed genes as a function of experimental conditions 109 Chapter 6 An Example Two Color cDNA Data QQ Normal Quantiles Plot 110 Graph Window 11 File View Options 8 2 T a y D E 0 Quantiles of Standard Normal La gt Summary Volcano Plot Figure 6 18 QQ Normal quantiles plot The QQ Normal quantiles plot displays the test statistics for all genes versus the standard normal quantiles This plot gives some sense of the distribution of the test statistics and is used primarily for diagnostic purposes This particular plot shows an extreme test statistic about 1200 The other statistics are less than 100 Annotation ANNOTATION Annotation for cDNA arrays is not automatic the way it is for Affymetrix chips In order to produce volcano and heat map plots with interactive annotation you need to create a couple of S PLUS objects first With the current release you will need to do this through the command line Once this is done the annotation graphics will use the objects to link directly to either GenBank or LocusLink databases To create the annotation objects proceed as follows Create a data frame that links the probe names with GenBank s Accession Number and LocusLink s ID The data frame should look
36. and Normalization gt swirl norm lt maNormMain swirl f loc list maNormLoess span 5 Table 7 2 cDNA scale and location normalization methods performed through maNormMain Normalization Method and Default Settings Description maNormMed Defaults x NULL y maM subset TRUE Location normalization using the global median of intensity log ratios for a group of spots maNormLoess Defaults x maA y maM z maPrintTip w NULL subset TRUE span 0 4 Location normalization to a fitted loess curve usually for M vs A maNormMAD Defaults x NULL y maM geo TRUE subset TRUE Scale normalization using the median absolute deviation MAD of intensity log ratios for a group of spots maNorm2D Defaults x maSpotRow y maSpotCol z maM g maPrintTip w NULL subset TRUE span 0 4 2D spatial location normalization Normalizes to the smoothed intensity surface loess surface by print tip group at each x y coordinate Examples With Let s normalize the swirl data using a variety of methods in the maNormMain maNormMain function The normalization methods will be applied to the set of chips given If you don t want to normalize across treatment conditions then the marrayRaw objects can be subset as shown below The swirl dataset For description type swirl or help swirl gt swirl Global median normalization for arrays 81 and 82 dt
37. and multiple slide systematic variation Nucleic Acids Research Vol 30 No 4 e15 Eisen M B Spellman P T Brown P O Botstein D 1998 Cluster analysis and display of genome wide expression patterns Proceedings of National Academic Sciences USA 95 25 14863 14868 Fox J W Dragulev B Fox N Mauch C Nischt R 2001 Identification of ADAM9 in human melanoma Expression regulation by matrix and role in cell cell adhesion Proceedings of International Proteolysis Society Meeting Fraley C and Raftery A E 2002 MCLUST Software for Model Based Clustering Discriminant Analysis and Density Estimation Technical Report no 415 Department of Statistics University of Washington 251 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data 252 Golub T R Slonim D K Tamayo P Huard C Gaasenbeek M Mesirov J P Coller H Loh M L Downing J R Caligiuri M A Bloomfield C D Lander E S 1999 Molecular classification of cancer Class discovery and class prediction by gene expression monitoring Science 286 5439 531 537 Hastie T Tibshirani R Eisen MB Alizadeh A Levy R Staudt L Chan WC Botstein D Brown P 2000 Gene Shaving as a method for identifying distinct sets of genes with similar expression patterns Genome Biology 1 research0003 1 research0003 21 Hochberg Y A sharper Bonferroni procedure for multiple tests of significance Biometrika 75 800 802 1988 Holm S
38. as follows gt annoDF 1 10 all an probes 1 12502 M18228 100001_at 2 16426 X70393 100002_at 3 20190 D38216 100003_at 4 77065 AW120890 100004_at 5 22032 X92346 100005_at 6 12552 D21253 100006_at 7 272359 AI837573 100007_at 8 20674 X94127 100009_r_at 9 16599 U36340 100010_at 10 16599 AI851658 100011_at Creating the data frame is not difficult to do and can even be read from an existing Excel file or text file through the dialog generated by File gt Import Data gt From File from the main menu bar of S PLUS or by using S PLUS read table function from the command line If you read the data frame from either the GUI or the read table function be sure to specify no conversion of strings to factors This option is available on the Options tab of the import dialog Strings as factors check box should be unchecked or as an argument to the read table function stringsAsFactors F Once you ve created the annotation data frame you need to create two objects The object name will be a concatenation of the chip or array name and the strings LOCUSID and ACCNUM 111 Chapter 6 An Example Two Color cDNA Data 112 For the swirl example I would have swirlLayoutLOCUSID and swirlLayoutACCNUM for names These two objects are created as follows JHHF Create LocusLink named list gt swirlLayoutLOCUSID lt as list annoDF 11 gt names swirlLayoutLOCUSID lt annoDF probes iHHE Create GenBank Accession Number named list gt sw
39. been done i e multiplying all expression values on a chip by a single scalar such that the scaled mean expression values on each chip are the same This simple normalization is not enough to account for much extraneous variability see Bolstad et al 2002 In this chapter we step through the analysis of an experiment designed to improve understanding of the effect of chronic conditioning on the mass build up of the left ventricular muscle of the heart A study was conducted on mice which were regularly exercised by swimming Over the course of 10 days exercise was increased from 10 minutes twice a day to 90 minutes twice a day Conditioning of the mice continued for 4 weeks For more details see http cardiogenomics med harvard edu groups proj1 pages swim_home html This simple experimental design thus involved one factor amount of conditioning at two levels 0 and 4 weeks with expression being measured six times replicate arrays at each time point The main hypothesis of interest involves discovering genes showing differential expression between the two time points because these genes are believed to be relevant to the enlargement of ventricular mass during chronic conditioning The chips and data files are listed in Table 4 1 Affymetrix Data Analysis Workflow Table 4 1 Experimental design and file association for the melanoma cancer study Experimental Condition Replicate
40. between cluster variability based on the chosen between cluster dissimilarity measure at each stage of the Clustering Microarray Data using S PLUS process The mclust method assumes that data are generated from an underlying mixture of probability distributions e g Gaussian distributions and provides insight into the number of clusters a quantity that is derived from a model selection process in its probability framework The divisive method diana starts by finding the most disparate object and splitting it into a splinter group All cluster methods are very sensitive to the choice of distance or dissimilarity between points i e samples or genes S PLUS includes two commonly used functions for creating distances or dissimilarities between points namely dist and daisy The correlation function cor may also be used and 1 cor x produces a matrix representing the dissimilarities between columns samples of a matrix x x pe The dist function simply constructs distances between rows as Euclidean Manhattan maximum and binary If the data are normalized with mean zero and variance one prior to calling dist the resulting matrix is equivalent to a dissimilarity matrix produced using cor Hierarchical methods like agnes and hclust have been widely used for the cluster analysis of microarray data Yeung et al 2001 discuss the benefits of model based clustering for microarray analysis Results from the hierarchical methods a
41. contains the locations on the chip for the perfect and mismatch probes S ARRAYANALYZER functions need to access this named list when doing probe level operations If the list is not available S ARRAYANALYZER attempts to load the library if it cannot find the library an error occurs S ARRAYANALYZER includes three of these named lists in the S ARRAYANALYZER affy library hgu95acdf hgu95av2cdf and hgu133acdf If you are working with these chips hgu95a hgu95av2 or hgu133a then you do not need to do anything as the S ARRAYANALYZER functions that operate on the CEL data finds the named lists The CDF information for other Affymetrix chips is available on the S ARRAYANALYZER CD under DataLibs CDFLibs There is a zip file for each chip and each zip file unpacks to create a library The libraries can be installed in the library directory under the top level S PLUs installation directory run getenv SHOME at the S PLUS command line to find your S PLUS installation directory Alternatively you can install the libraries in any location and use the 1ib 1oc argument to the library function when attaching them Example I For example if you are working with the mgu74a chip 1 Find the file mgu74acdf zip under DataLibs CDFLibs on the StARRAYANALYZER CD or from the S ARRAYANALYZER Web site above and copy it to your computer 2 Unzip mgu74acdf zip into SHOME library The directory contains the files README txt DESCRIPTION an
42. deviation MAD this allows geo subset between slide scale subset normalization printTipMAD f loc NULL Within print tip group scale f scale normalization using the list maNormMAD x ee maPrintTip y median absolute deviation maM geo geo subset subset Let s look at some examples using maNorm and maNormScale d scalePrintTipMAD performs both location and scale normalization gt swirl PrintTipMAD lt maNorm swirl norm scalePrintTipMAD print tip loess gt swirl ptloess lt maNorm swirl norm printTipLoess globalMAD gt swirl gMAD lt maNormScale swirl norm globalMAD swirl gMADS lt maNormScale swirl c 2 4 norm globalMAD Vv printTipMAD gt swirl ptMAD lt maNormScale swirl norm printTipMAD 153 Chapter 7 Pre Processing and Normalization PRE PROCESSING AND NORMALIZATION FOR AFFYMETRIX PROBE LEVEL DATA 154 Affymetrix data typically arrives as DAT CEL and CHP files The DAT files contain the raw images as processed by the scanner The CEL files contain expression measures for each individual probe on the chip The CHP files contain summaries of the individual probe level data for each gene transcript This section discusses methods for analyzing correcting summarizing and normalizing the CEL probe level data Examples of these procedures from the GUI can be found in Chapter 5 An Example Affymetrix Probe
43. directory of the ArrayAnalyzer module gt swirl samples lt read marrayInfo file path AApath SwirlSample txt The resulting object show the information stored in a rectangular array gt swirl samples Object of class marraylInfo From The Command Line maLabels of slide Names experiment Cy3 1 81 81 swirl 1 spot swirl 2 82 82 swirl 2 spot wild type 3 93 93 swirl 3 spot swirl 4 94 94 swirl 4 spot wild type experiment Cy5 date comments 1 wild type 2001 9 20 NA 2 swirl 2001 9 20 NA 3 wild type 2001 11 8 NA 4 swirl 2001 11 8 NA Number of labels 4 Dimensions of maInfo matrix 4 rows by 6 columns Notes C PROGRAM FILES INSIGHTFUL splus61 module ArrayAnalyzer examples SwirlSample txt Reading Gene ID s We are now ready to read the gene ID s gt swirl gnames lt read marrayInfo file path AApath fish gal info id 4 5 labels 5 skip 21 gt swirl gnames Object of class marraylInfo maLabels ID Name genol control genol geno2 control geno2 geno3 control geno3 SxXSSC control Sxssc 3XSSC control 3XSSC ESTI control ESTI genol control genol geno2 control geno2 geno3 control geno3 3155C control 3XSSC DO ON DOT PWD p d Number of labels 8448 Dimensions of maInfo matrix 8448 rows by 2 columns 117 Chapter 6 An Example Two Color cDNA Data Reading the Raw Data 118 Notes C PROGRAM FILES INSIGHTFUL splus61 module ArrayAnalyzer examples fish gal The additional argumen
44. eal 24hr ses CA File r Data File Type Chip Name Save s Probe Level CEL HG_US54v2 1sq z myAffyBatch I Print Output OK Cancel py Figure 5 4 Browsing for data files You can find the melanoma example CEL data by navigating to your splus61 module ArrayAnalyzer examples directory and selecting the cg2a CEL file Repeat for the other three CEL files entering one file per cell File Type Note that the File Type e g Probe Level CEL listed in the bottom left corner of the dialog is automatically detected from the first file selected The dialog is designed to prohibit mixing file types Importing Data The Chip Name For a CEL data file the chip name is automatically detected You typically don t need to change this selection S ARRAYANALYZER has pre loaded the chip definition files CDF and the gene annotation information for chips hgu95a hgu95av2 and hgu133a If you are using other chips you may want to refer to the Appendix S ARRAYANALYZER Data Libraries to see how to load the CDF and annotation information for your chip Data File Type Chip Name Save As Probe Level CEL CGAffyBatch Figure 5 5 File Type and Chip Name are auto detected and filled for Affy CEL files The Save As field specifies the name for the resulting S PLUS object Saving the Data Object To save the data objectAppendix Appendix type a name in the Save As field in the lower right corner of
45. four subpopulations However absolute values of the average silhouette width are fairly small in all cases Clustering Microarray Data using S PLUS The partitioning around medoids analysis and graphical summaries are presented in Figure 9 3 and Figure 9 4 Figure 9 3 shows the two clusters projected onto biplot of the first two principal components A silhouette plot for two subpopulations is provided in Figure 9 4 partitioning 2 classes compare to 3 and up mat3a 2 pam lt pam t mat3a 2 plot mat3a 2 pam Component 2 Component 1 These two components explain 45 12 of the point variability Figure 9 3 Partitioning around medoids analysis summary for the Alizadeh et al 2000 lymphoma data The two clusters are projected onto a biplot of the first two principal components 223 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data 0 0 0 2 0 4 06 08 1 0 Silhouette width Average silhouette width 0 19 Figure 9 4 Silhouette plot for the two subpopulation partitioning around medoids clustering for the Alizadeh et al 2000 lymphoma data 224 Annotation of Microarray Data using S PLUS ANNOTATION OF MICROARRAY DATA USING S PLus The results of the hierarchical analysis of the Alizadeh et al 2000 data may also be displayed as an S PLUS Graphlet created through the S PLUs 6 1 Java graphics device java graph This implementation produces a lightweight interactive applet typically less than 3
46. library hgu95aAnnoData But annotation data on 26 genes that are on the hgu95a chip but not on the hgu95av2 chip will be missing Other chips may have more differences in genes between the original and the V2 version 257 Appendix S ARRAYANALYZER Data Libraries Table A 1 Information contained in an annotation library Continued Suffix Description LOCUSID Unique integer id for locus MAP The chromosome assignment CHR Chromosome number PMID A sub set of PubMed unique ids GRIF PubMed unique identifier SUMFUNC Summary of the function of genes GO Gene ontology id CHRLOC Chromosomal location of genes CHRORI Chromosomal orientation of genes ENZYME Enzyme Commission identifier EC PATH Pathway name AFFYCOUNTS Total number of Affymetrix ids ENZYME2AFFY Mapping from EC to Affymetrix probe id GO2AFFY Mapping from GO id to Affymetrix id Loading an Annotation Library Table A 1 Information contained in an annotation library Continued Suffix Description PATH2AFFY Mapping from pathway name to Affymetrix id PMID2AFFY Mapping from PubMed id to Affymetrix id GO2ALLAFFY Mapping from GO id to Affymetrix id counts The GenBank accession number and LocusLink annotation information is used by S tARRAYANALYZER when doing differential expression testing using the dialogs If HTML is selected as the output the resulting plots contain links
47. mismatch MM and perfect match PM values have been summarized using a Tukey biweight procedure as described in the Affymetrix document Statistical Algorithms Description Document SADD and section mas on page 167 If requested by the user the software scales the signal using a trimmed mean 2 of the data at either end is trimmed away before the mean is computed The output intensity for MAS 5 0 data is termed Signal Affymetrix version 4 0 software adjusts the probe level data by e Subtracting the global background signal and noise as described in section mas on page 167 Summarizing the 11 20 mismatch MM and perfect match PM values using a simple trimmed average difference procedure see section avgdi ff on page 166 The output intensity for MAS 4 0 data is termed Avg Diff 170 Normalization Methods medianIQR Normalization Methods for Affymetrix MAS Data The summarized Affymetrix data from both MAS4 and MASS have not been suitably normalized for differential expression testing Note that the MAS software allows a very simple global scaling in which the user enters a target value TGT value With this method the average signal across all probes on each chip is calculated for each chip and a scale factor SF is determined for each chip such that chip mean SF TGT Thus the signals on each chip are scaled by a single number for each chip a crude form of normalization S ARRAYANALYZER provides two meth
48. not and dendrograms then reflect that imposed structure There are at least a couple of ways that the value of such analyses can be assessed The cophenetic distance between two observations 7 and jis defined to be the intergroup distance at which observations are first put into the same cluster The extent to which cophenetic distances reflect the true distances relates to the usefulness of the dendrogram as a tool for visualization This agreement can be assessed by the cophenetic correlation coefficient or the correlation between the true distances and the cophenetic distances The silhouette distance measures how well individual samples are classified into a discrete set of classes This is a particularly relevant measure in assessing the value of a partitioning cluster analysis but can be applied to a hierarchical analysis by cutting the tree at some point and classifying samples into the groups defined by the cut This is described further below The partitioning methods primarily kmeans and pam are appropriate when distinct sets of subpopulations are hypothesized Results from using the partitioning methods are typically represented with cluster biplots and silhouette plots Cluster biplots show the subpopulations separated in the first two principal component dimensions whereas silhouette plots show how well individual samples are classified In silhouette plots for each object i a sample or experimental condition typically the s
49. of iterations along with a random seed if you select one of the permutation methods Options PWER 05 Test q X Alt Hypothesis Not equal X Adjustment Bonferroni v Figure 8 2 The Options group of the Multiple Comparisons Test dialog Statistical Tests The statistical tests and p value adjustment procedures are all described in the section Statistical Tests and the section Controlling Type I Error Rates The key words or phrases used to select one of these options match those used in the descriptive text Options FWER Test 3 t Alt Hypothesis Legua Des 7 n tequalvar permute Adjustment tpemute wilcoxon wilcoxon permute Figure 8 3 Selecting a statistical test procedure in the Options group FWER and FDR Control The procedures for controlling the FWER and FDR are shown in the drop down list of Figure 8 4 The procedures correspond to those described in the section Controlling The False Positive Rate and the section FDR Procedures Both FWER and FDR procedures are 189 Chapter 8 Differential Expression Testing Cautionary Note Graph Options 190 included in the drop down list For something other than the default Bonferroni correction with FWER 0 05 select an adjustment procedure from the drop down list and input the overall error rate in the FWER editable field Options PWER 05 Test t v Alt Hypothesis Not equal X Adjustment X i Caneel SidakSD Figure 8 4 Procedures for contr
50. point to access annotation from LocusLink or UniGene databases Graph Window 2 DE File View Options T z i 4 a T D B 5 k 6 3 Mean Log2 Fold Change ee eeepc eso Plat Oem Figure 8 10 A volcano plot shows the logarithm of the adjusted p value vs average fold change This is displayed using an S PLUS Java graphics device It may also be displayed in the browser An example of the The volcano plot complete with hyperlinks can be sent to an HTML file for later viewing It can also be sent to an S PLUS graphics window Figure 8 10 shows a typical volcano plot with the interactive menu generated by clicking a point in the differential expression region of the plot Differential Expression Analysis Plots When the plot is viewed in an S PLUS graphics window the active points are not hyper linked to the annotation databases However hovering the mouse over active points shows the gene name ID in the upper right corner of the graphic as it does for the HTML display as shown in Figure 8 11 Gene Name 38428_at Figure 8 11 Finding the gene name on the graph for differentially expressed genes 199 Chapter 8 Differential Expression Testing Heat map 200 A heat map plot shows a 2 D image plot of the 300 genes with lowest p values by default along the vertical axis versus the experimental conditions on the horizontal axis See Figure 8 12 This graph is also hyperlinked to public annotation data
51. range of M 77 Chapter 5 An Example Affymetrix Probe Level Data Expression Summaries A ba T Sal wo m O m e e e e wo N T T N a Si en DO E er cg2a CEL cg2b CEL cg24a CEL cg24b CEL Figure 5 12 Boxplot of log2 expression intensities for the four samples after applying the composite RMA procedure 78 Differential Expression Testing DIFFERENTIAL EXPRESSION TESTING Local Pooled Error Test After summarizing the probe level data we are now ready to compute differential expression tests From the main menu open the testing dialog by clicking ArrayAnalyzer gt Differential Expression Analysis gt LPE Test Import Data gt Affymetrix Expression Summary Normalization Differential Expression Analysis es Mult teh Figure 5 13 Selecting LPE test to open the Local Pooled Error Test dialog Local Pooled Error Test Data Variance Estimation Show Data of Type Smoother D F 6 Affymetrix Number of Bins 100 Data CGExprSet me_y Trim 2 5 4 Chip Name hgu95av2 Graph Options Options IV Volcano Plot PWER 001 Heat Map Adjustment Bonferroni v IV Chromosome Plot WMV Variance Plots Output I Display Output in S PLUS M Save Output as HTML IV Display HTML Output Save Summary As LPESumm Cancel Apply KE current Help Figure 5
52. read from a HU6800 CEL file gt ps lt probeset affybatch example geneNames affybatch example 1 2 1 gt pps subtractmm lt pmcorrect subtractmm pps If no subsetting is desired we can simply use the AffyBatch object in the correction procedure gt pmCor mas lt pmcorrect mas affybatch example We can replace the original PM values with the corrected PM values by typing gt affybatch example tmp lt affybatch example gt pm affybatch example tmp lt pmCor mas Like cDNA arrays the spot intensities on Affymetrix arrays include variations due to sample preparation manufacturing of the arrays and array processing labeling hybridization and scanning Many researchers have pointed out the need for normalizing Affymetrix arrays See for example Bolstad et al 2002 and Irizarry et al 2003a StARRAYANALYZER provides a variety of normalization methods for cell level data Location normalization methods e constant e contrasts invariantset e loess Scale normalization methods e qspline e quantiles 161 Chapter 7 Pre Processing and Normalization normalize Function 162 e quantiles robust The main function for normalizing AffyBatch objects is normalize The normalize function accepts AffyBatch objects and returns AffyBatch objects AffyBatch objects store the experimental information about the probe level data Please refer to the affy library documentation splus61 library affy a
53. space after the quote and type BATCH a TextData myjob txt output txt error txt 239 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data Click OK to close the dialog box and then restart S PLUS by double clicking the shortcut After the batch processing finishes check the project folder where S PLUS was working To find what this directory is type gt getenv S_ PROJ The file output txt will contain text of the commands listed in myjob txt and the file error txt contains error messages for any commands that failed to execute 3 Fit GENE models 3 1 Remove non important gene information Note that you need to load the ArrayAnalyzer module in order to access the function dropUnusedLevels gt yeastData lt dropUnusedLevels yeastData is element yeastDataL gene c EMPTY NORF 3 2 Create a variable for spot within array gt yeastData spotInArray lt factor paste as character yeastData spot as character yeastData array sep in 3 3 Create a data object to store the results of the model fitting gt yeastResults lt data frame gName character 0 comparison character 0 foldChange double 0 pValue double 0 adjp double 0 signif p logical 0 3 4 Obtain the gene names gt genes lt levels yeastData gene 3 5 Fit a gene model for each gene and estimate if the pairwise contrasts between experimental conditions
54. t Lmat geneModel varFix Lmat i Compute the p values using a two sided t test Correct for Os in p values tValue lt foldChange stderr pValue lt 2 1 pt abs tValue fixDF pValue pValue 0 lt min c le 17 pValue pValue gt 0 na rm T data frame gName genes index comparison comparison foldChange foldChange pValue pValue adjp pValue signif p pValue lt 0 5 else data frame gName character 0 comparison character 0 foldChange double 0 pValue double 0 adjp double 0 signif p logical 0 gt for index in 1 length genes cat Gene index genes Lindex n yeastResults lt rbind yeastResults geneModel Fun genes index Determine the p value Bonferroni adjusted cutoff gt critPValue lt 0 05 nrow yeastResults gt yeastResults adjp lt yeastResults pValue nrow yeastResults gt yeastResults signif p lt 242 Differential Expression Analysis for Experiments with More than Two Experimental Conditions yeastResults pValue lt critPValue Count the number of significant fold changes and tabulate the significant genes gt numSignif lt sum yeastResults pValue lt critPValue na rm T gt signif out lt yeastResults yeastResults signif p In the analysis outlined above the ten pairwise contrasts between the five treatment levels conserved unconserved rich minimal
55. the Command line you need the object name The default output names for the Multiple Comparison Test and LPE Test dialogs are MultTestSumm and LPESumm respectively For an object with FDR set to 0 20 and Benjamini Hochberg adjustment the first 5 rows of one object named BHMultTestSumm looks as follows gt BHMultTestSumm 1 5 gName mean Ohr mean 24hr 35704_at 35704_at 9 237412 0 5398517 37023 at 37023 at 8 742761 0 5398517 Jooac ab sasse_at fat eri 10 9035727 37712_g_at 37712_g_at 8 466196 0 5398517 31979 at 31979_at 7 211200 O 539R517 foldChange testStat rawp 35704_at 8 697560 188 0414 0 00002828013 Differential Expression Summary Table Output 37023_at 8 202909 174 8744 0 00003290512 33532 at 3 125856 3621 5378 0 00003708125 37712_g_at 7 926344 163 3330 0 00004028347 31979_at 6 671348 142 7131 0 00004926020 adjp signif p Locus Link Acc Num 35704_at 0 1135119 t 11145 X92814 3 023_ab 0 1135119 t 3936 J02923 33532_at 0 1135119 T 8092 U31986 37112 eat 0 1135119 t 4208 557212 31979 at 0 1135119 T 5210 D49818 207 Chapter 8 Differential Expression Testing REFERENCES 208 Benjamini Y Hochberg Y 1995 Controlling the false discovery rate a practical and powerful approach to multiple testing Journal of the Royal Statistical Society Series B 57 289 300 Benjamini Y Yekutieli D 2001 The control of the false discovery rate in multiple hypothesis testing under dependency Annals of Statistics 29 4
56. the Save As field the name swirlMarrayRaw norm will be generated by default You can edit the name in this field if you wish Our example uses the default object name for the normalized expression data Normalization Now set the other options on the right side of the Normalization dialog The normalization methods are listed in the Normalization drop down list median loess twoD e printTipLoess e scalePrintTipMAD 103 Chapter 6 An Example Two Color cDNA Data e global MAD e printTipMAD In this example select median as the normalization method Select the Box Plot check box the MvA check box and the Before amp After radio button for pre and post normalization plots Click OK or Apply to produce the normalized data and create the pre and post normalization plots shown in Figures 6 13 and 6 14 After median Normalization Figure 6 13 MvA plot of normalized data 104 Normalization Before Normalization After Normalization E i i j e pal e t t 8 H l i i e i j l I i j 1 Tit i k l H i i i 24 i f gJ swirl 1 spot swirl 2 spot swirl 3 spot swirl 4 spot swirl 1 spot swirl 2 spot swirl 3 spot swirl 4 spot Figure 6 14 Before and after median normalized swirl data 105 Chapter 6 An Example Two Color cDNA Data DIFFERENTIAL EXPRESSION TESTING The Options Group 106
57. the dialog Remember this name as it is used in the next step to summarize and or normalizae the expression data For our example enter CGAffyBatch as the object name Saving the Design Once you ve entered all the information on this tab you can save it for later use by clicking the Save Design button at the top of the dialog A txt file is written to the directory of your choice with number of factors number of levels repititions and the full path file names and their associated factor levels Reading Designs This design file can be reused for another experiment with the same design by modifying the file locations and names and factor levels as needed In fact if you have many chips in your experiment you can create a file with all the design content and read it with the Read Design button which will set the reps indicator and fill the file name fields and their associated factor levels 69 Chapter 5 An Example Affymetrix Probe Level Data MIAME Page MIAME is an acronym for Minimal Information About a Microarray Experiment and this information can be entered on the second page of the Import Affymetrix Data dialog This information is not required but it is used in table output and graphics and thus it is to your advantage to complete the information in this page Once you ve entered MIAME information for any experiment the first three fields are saved and are filled automatically the next time you open this dialog This di
58. to the command line adds great flexibility to the set of features available through the ArrayAnalyzer GUI and opens the door to additional analyses The flexibility and feature rich S PLUS language make it an ideal platform for exploratory analysis statistical testing and modeling of gene expression data This section is designed to expose you to the critical functions for differential expression testing of microarray data If you have no interest in running your analyses from the command line you can skip this section The relevant information for a cDNA microarray is 1 layout of the chip 2 experimental design 3 gene ID s 4 expression intensities All of this information must be read into S PLUS and assembled into a single object for further analysis The primary convenience of importing data through the GUI is the coordination of the following three functions which read the above information read marrayLayout This function creates objects of class marrayLayout to store layout parameters for two color cDNA microarrays read marrayInfo This function creates objects of class marrayInfo The marrayInfo class is used to store information regarding the target mRNA samples co hybridized on the arrays or the spotted probe sequences e g data frame of gene names annotations and other identifiers read marrayRaw This function reads in cDNA microarray data from a directory and creates objects of class marrayRaw from spot quantifica
59. typically many genes represented in a microarray experiment managing the side effects of multiple statistical tests is important in differential expression testing Consequently a number of procedures have been implemented in St ARRAYANALYZER for controlling family wise error rate FWER and false discovery rate FDR Table 8 1 Errors in statistical testing Truth Significant Test Not Significant Test Differentially S FN False Negative Expressed Type II Error Not Differentially FP False Positive NS Expressed Type I Error Q Total S FP NS FN 183 Chapter 8 Differential Expression Testing Controlling The False Positive Rate Notation FWER Procedures 184 Suppose the significance level of the test procedure is A and the number of genes being tested is N A procedure is said to control the family wise error rate FWER if it adjusts the significance level so that the overall error rate is at most A Without adjusting the significance level there may be as many as N false positives For arrays with many genes the number of false positives without correcting for multiple tests can be quite large Consequently a number of procedures have been implemented in S ARRAYANALYZER for controlling FWER and FDR The results of the procedures are summarized using adjusted p values which reflect for each gene the overall Type I error rate when genes with a smaller p value are declared diffe
60. 0 INDEX A Amigo 230 232 annotate library 3 227 228 ANOVA 16 234 236 B BATCH 212 239 241 BH 107 128 244 Bonferroni 80 82 107 108 128 243 Box Plot 104 C cDNA normalization median 103 104 105 123 Chromosome Plot 84 Chromosome plot 80 F FDR 80 106 107 243 244 FWER 80 106 107 243 G GenBank 16 25 82 108 111 H heat map 86 107 111 112 212 218 220 L LocusLink 8 16 25 26 86 108 111 112 227 229 M medianIQR 227 MIAME 19 66 70 93 99 MvA Plot 123 O Object Explorer 247 P paired t 106 107 platforms supported 4 Q QQ Norm Plot 110 R requirements system 4 S supported platforms 4 system requirements 4 261 Index T volcano plot 2 82 112 243 248 249 250 technical support 5 6 W Wilcoxon test 106 107 V Variance plot 80 Volcano Plot 108 262
61. 0 07 O65 o 10 155 D4 027 a 00 nes 0 44 0 30 24 GIE 021 013 195 Q 0 66 0 28 Q 07 NA Q2 0 27 ae sat ae E i Resdy Kourt buffering on Figure 9 1 Imported data from Figure 3a of Alizadeh et al 2000 Note that this is not the actual raw data but rather data as summarized by Cluster Eisen et al 1998 and prepared for viewing in Tree View Eisen et al 1998 We treat it as raw data to show the cluster methods in S PLUS but the resulting output should not be directly compared with fig3a of Alizadeh et al 2000 We then standardize this data frame calculate the distances between points in the column and row spaces and fit the hierarchical cluster models Note that since the data are normalized with mean zero and variance one prior to calling dist the resulting matrix is equivalent to a dissimilarity matrix produced using cor gt module ArrayAnalyzer gt fileName lt file path getenv SHOME module ArrayAnalyzer examples figure3a cdt gt mat3a lt importData fileName rowNamesCol 1 colNameRow 1 drop c 2 4 startRow 3 type ASCII gt stand norm lt function x x mean x na rm T sqrt var x na method available 219 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data A gt gt gt gt gt gt gt aliz cmat lt apply mat3a 1 stand norm cluster rows aliz distl lt dist t aliz cmat aliz hclustl lt hclust dist al
62. 032 11 Parmigiani G Garrett E S Irizarry R A and Zeger S L 2003 The analysis of gene expression data an overview of methods and software In The Analysis of Gene Expression Data Methods and Software Edited by G Parmigiani E S Garrett R A Irizarry and S L Zeger Published by Springer New York S PLUS 2000 Guide to Statistics Volume 1 Data analysis Products Division MathSoft Seattle WA Yang Y H Dudoit S Luu P Lin D M Peng V Ngai J and Speed T P 2002 Normalization for cDNA microarray data a robust composite method addressing single and multiple slide systematic variation Nucleic Acids Research 20 4 Yang Y H Dudoit S Luu P and Speed T P 2001 Normalization for cDNA microarray data In M L Bittner Y Chen A N Dorsel and E R Dougherty editors Microarrays Optical Technologies and Informatics volume 4266 of Proceedings of SPIE May 2001 175 Chapter 7 Pre Processing and Normalization Yang Y H Dudoit S 2003 Bioconductor s marrayNorm Package Bioconductor marrayNorm library documentation January 23 2003 p 3 176 DIFFERENTIAL EXPRESSION TESTING Introduction Statistical Tests Within Gene Two Sample Comparisons Local Pooled Error Test Raw P Values Controlling Type I Error Rates Controlling The False Positive Rate GUI for Multiple Comparisons Testing Multiple Comparisons Testing Dialog Input GUI for LPE Testing LPE Tes
63. 0KB in size in a browser with mouseover metadata showing gene and sample information and expression intensity for each spot on the set of arrays Genes are shown as rows and samples as columns By clicking on a particular spot the gene s accession number is sent to the NCBI UniGene database and annotation information regarding that gene is returned in a lower browser frame A screen shot showing the heatmap dendrogram and analysis of the gene TNFRSF7 is shown in Figure 9 5 225 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data 226 Z Heat Map and Hierarchical Clustering Microsoft Internet Explorer ESS 3 10j xj I File Edt View Favorites Tool Help Heak gt O A A Qsearch Favorites lt Brsstory Sy Sp ml ue 2 LocusLink 939 OMIM 186711 HomoloGene Hs 180841 SELECTED MODEL ORGANISM PROTEIN SIMILARITIES organism protein and percent identity and length of aligned region H sapiens sp P26842 CD27_HUMAN 100 259 aa CD27L RECEPTOR see ProtEST PRECURSOR gt E Done IBE Unknown Zone Mixed 7 Figure 9 5 Heat map and dendrogram for DLBCL samples corresponding to Figure 3a of Alizadeh et al 2000 The gene highlighted is TNFRSF7 tumor necrosis factor receptor super family member 7 The heat map and dendrogram are drawn using an S PLUS graphlet produced using the java graph device in S PLUS 6 1 This features interactive metadata in the top right hand corner o
64. 14 The LPE Test dialog To set up the Local Pooled Error Test dialog follow these steps 1 In the Show data of type field select Affymetrix 2 Inthe Data field select CGExprSet rma 3 The Chip Name field should be automatically updated to hgu95av2 79 Chapter 5 An Example Affymetrix Probe Level Data 80 4 Enter LPESumm in the Save Summary As file for saving the test result object Options The Options group allows you to set the family wise error rate FWER or the false discovery rate FDR to control the overall Type I error rate false positive rate based on adjusting individual test p values to account for multiple tests In our melanoma example there are 12 558 genes so the increase in Type I error is substantial without adjusting the p values There are many options for adjusting the p values to achieve the FWER We describe them in more detail in Differential Expression Testing Here we leave the default setting as Bonferroni There are four options in the Graph Options group 1 Volcano plot 2 Heat map 3 Chromosome plot 4 Variance plots In addition a summary table is output to the graphlet with the 10 most differentially expressed genes Differential Expression Testing Top 10 Summary The first page of the output is the Top 10 gene list Included in the Table table is the raw p value the adjusted p value and fold change When the output is to HTML and annotation data is available for the chip eac
65. 2 Level Design Reps B Reset Grid Read Design Save Design File Name Type filename or right click to browse Type filename or right click to browse Type filename or right click to browse Type filename or right click to browse m Data File Type Chip Name Save s kundetermined gt krequired gt v mys et IV Print Output OK Cancel I gt f1 entries o Figure 4 2 The Import Affymetrix Data dialog 30 File Selection Page Importing Data The Import Affymetrix Data dialog has three pages e File Selection This page must be completed in order to create a data object for continued analysis e MIAME Completing this page is optional but highly recommended because information on the MIAME tab is used for labeling tables and graphs e Variable Selection amp Filtering This page has default settings depending on the type of data files e g MAS4or MASS you select Before we can begin to associate data files with experimental conditions we need to set up the experimental conditions in S ARRAYANALYZER We start by setting the replications in the experiment Setting the Replications Reps The swimming mice experiment has six replicates Click the up arrow or enter the number in the field so that six replicates show then click the Reset Grid button This generates the experimental design points that populate the grid in the center of the dialog You see the grid
66. 2194 PubMed indexed for MEDLINE Omoe K Endo A Related Articles Links Relationship between the monosomy X phenotype and Y linked ribosomal protein 54 ps4 in several species of mammals a molecular evolutionary analysis of Rps4 homologs Internet Figure 9 8 Pubmed articles for the first gene identified in the gene filtering analysis described above i e Gene M58459 Human ribosomal protein RPS4Y isoform mRNA Note that in searching the Gene Ontology databases http www geneontology org we need to cut and paste the ID s obtained from the hgu95aGO mel gnames vector into a GO search engine e g Amigo http www godatabase org cgi bin go cgi If the Advanced Query option on the Amigo page is chosen multiple GO accession id numbers can be entered at one time into the search field In this case each gene is placed on a separate line with carriage return in between For example the four GO ID s for Gene M58459 Human ribosomal protein RPS4Y isoform mRNA are shown below and pasted in to Amigo with results shown in the Amigo search page in Figure 9 9 GO 0003735 GO 0003723 GO 0006412 GO 0005843 Annotation of Microarray Data using S PLUS o Onda J http www godatabase orgfcgi bin go cai l G0 0003735 a Beton Submit Name and Synonym Name and Symbol gt a FlyBase SGD z All Inferred from Mutant Phenotype
67. 22 genes so the Type I error is substantial without adjusting the p values You can chose the type of test you want to perform Welch s t for unequal variance the default an equal variance t and the nonparametric Wilcoxon test For the swimming mice example leave the setting as the default t test Welch s There are many options for adjusting the p values to achieve the FWER Here we leave the default setting as Bonferroni There are three options in the Graph Options group 1 Volcano plot 2 Heat map 3 Normal Quantile Quantile plot QQ Norm plot Note that chromosome plots are not available for chips other than hgu95a Figure 4 14 displays the volcano plot with Bonferroni FWER correction Most of the p values for the genes have been adjusted to one Let s try a less conservative adjustment procedure We ll use the Benjamini and Hochberg FDR procedure which maintains a small percentage of false positives amongst only those genes which are significant Select BH from the Adjustment drop down list The resulting volcano plot is displayed in Figure 4 15 There are about 50 significant genes in the plot resulting from the BH correction compared to 6 for the Bonferroni correction Even with an 8 fold increase in significant genes the BH correction maintains a low false positive rate of 5 amongst the significant genes This translates to on average about 2 3 genes not really differentially expressed amongst those genes tagg
68. 39 140 141 141 142 142 142 143 144 144 144 145 148 149 151 154 155 155 158 160 161 131 Chapter 7 Pre Processing and Normalization Summarization Methods Normalization Methods for Affymetrix MAS Data Normalization Methods Diagnostic Plots for Summarized Affymetrix Data References 132 166 170 171 172 174 Introduction INTRODUCTION Before differential expression testing can be performed microarray data must be pre processed and normalized Pre processing refers to the process of correcting the measured spot intensities for background signal and non specific binding and for probe level data summarizing the multiply cloned gene expression measurements into one expression measure As described in Parmigiani et al 2003 data from microarrays are subject to many sources of extraneous variability including manufacturing preparation of mRNA from experimental samples hybridization scanning and imaging These sources of variability are often called technical sources of variability The removal and balancing of extraneous technical variability before analysis allows for more confident interpretation of the estimated differential expression effects as true differential expression and not a result of systematic experimental artifacts Pre processing of probe level data primarily involves summarizing data from probe sets into a single measure per gene transcript The Affymetrix MAS software provi
69. 4 94 swirl 4 spot wild type experiment Cy5 date comments 1 wild type 2001 9 20 NA 2 swirl 2001 9 20 NA 3 wild type 2001 11 8 NA 4 swirl 2001 11 8 NA Number of labels 4 Dimensions of maInfo matrix 4 rows by 6 columns Notes C PROGRAM FILES INSIGHTFUL splus61 module ArrayAnalyzer examples SwirlSample txt C Summary statistics for log ratio distribution Min 1st Qu Median Mean 3rd Qu Max swirl l spot 2 74 0 79 0 58 0 48 0 29 4 42 SwIPl 2 8pot 2 72 0 15 0 03 9 03 Gel 239 switl 3 spot 2 29 0 75 0 46 0 42 0 12 2 65 Swirl A sp t 3 21 0 46 lt 0 26 0 27 0 06 2 90 D Notes on intensity data Normalization We can extract the controls with the subsetting method for marrayRaw objects as follows dHHE Extract controls gt swirl raw controls lt swirl rawLcontrols gt swirl raw lt swirl raw controls Comparative plots are produced with maPlot and maBoxplot We can create the plots either within print groups for a single chip or for all chips disregarding print tip groups JHHF Boxplots of controls vs noncontrols by print tip group gt graphsheet From The Command Line gt par mfrow c 1 2 gt maBoxplot swirl raw controls 3 main Controls by Print Tip Group srt 90 gt maBoxplot swirl raw 3 main Non controls by Print Tip Group srt 90 Figure 6 19 displays the resulting graph Controls Non controls by Print Tip by Print Tip P
70. 8 gt EBEE ENMNE a FRERERREE REEPRARB PREFREBS EPR ARERR SD EERRER BPPRRRAER foldChange Inapa buena ON Modhied I Figure 9 16 The interactive trellised volcano plots for the 227 significant genes trellising on the 10 contrasts The mouse is over the single significant gene from the contrasts between strains within the rich media The Tooltip metadata shows the gene name fold change and log10 adjusted p value The plot can be interactively changed as described in the text In closing we note that many complex designs can be analyzed in the mixed model formulation described above This includes time course experiments and multi factorial experiments Additional random effects may be included to model additional error strata e g technical and biological replicates Also more general covariance structures can be readily accommodated in this modeling framework e g random coefficient models and correlated error structures The contrast structure is likewise rich and can be exploited to capture information regarding expression patterns between levels of experimental factors and test related hypotheses For example with drug treatments in known pathways expected effects of inhibition and efficacy can be set up as contrasts and genes whose expression patterns match such contrasts can be accurately identified All of this depends of course on the microarray experiment having enough replicates
71. 9 07 0 45 109 52 0 00 0 13 T 11 37777_at 37777 at 4 75 0 54 4 21 98 81 0 00 0 13 TI Figure 8 18 The first few rows of the gene list summary table generated by the Multiple Comparisons Test dialog Open the S PLUS Object Explorer by clicking the Object Explorer tool bar button displayed in Figure 8 19 ArrayAnalyzer Graph Options Window Help e 1p RE 3a BF lai Linear Figure 8 19 S PLUS Object Explorer tool bar button 205 Chapter 8 Differential Expression Testing From the Command Line 206 Under the Data tree in the Object Explorer double click the summary object of your choice See Figure 8 20 Object Explorer Contents of D Microarrays ArrayAnalyzer BHMultTestSumm E Data fail a a a a a sa sa sa a oo gm gm m E a a m m m oO a Data Class Dimensi MIAME Agname A UR mean Ohr aa IR mean 24hr adjpObj foldChange Agilent1 testStat Agilent2 rawp Agilent3 adjp Agilent4 signif p B Locus Link Em mm Era CG CG controls CG N cg24a cg24b cga cgb diffExpr Fold change geneNames mgu 4 LCG bs gt rFPWUOON AMA UON efD character numeric numeric numeric named numeric numeric logical integer character 12550 12550 12550 12550 12550 12550 12550 12550 12550 12550 Figure 8 20 The Object Explorer in S PLUS allows you to browse your data files and open them in a grid for viewing To access the gene list from
72. A simple sequentially rejective multiple test procedure Scand J Statist 6 65 70 1979 Kaufman L Rousseeuw P J 1990 Finding Groups in Data An Introduction to Cluster Analysis John Wiley amp Sons New York Kerr M K Churchill G A 2001 Bootstrapping cluster analysis Assessing the reliability of conclusions from microarray experiments Proceedings of National Academic Sciences USA 98 8961 8965 Kerr M K Churchill GA 2001b Statistical design and the analysis of gene expression microarray data Genetic Research 77 123 128 Ross D T Scherf U Eisen M B Perou C M Rees C Spellman P Iyer V Jeffrey S S Van de Rijn M Waltham M Pergamenschikov A Lee J C Lashkari D Shalon D Myers T G Weinstein J N Botstein D Brown PO 2000 Systematic variation in gene expression patterns in human cancer cell lines Nature Genetics 24 3 227 35 Scherf U Ross D T Waltham M Smith L H Lee J K Kohn K W Reinhold W C Myers T G Andrews D T Scudiero D A Eisen M B Sausville E A Pommier Y Botstein D Brown P O Weinstein J N 2000 A cDNA microarray gene expression database for the molecular pharmacology of cancer Nature Genetics 24 3 236 244 Storey JD 2002 A direct approach to false discovery rates Journal of the Royal Statistical Society Series B 64 479 498 Sudarsanam P Vishwanath R I Brown P O and Winston F 2000 Whole genome expression analysis of snf swi m
73. ALYSIS WORKFLOW Swirl cDNA Data Set 90 The entire process of analyzing differential expression for custom cDNA arrays can be done through the S ARRAYANALYZER menu and dialogs To obtain differential expression test results from two color cDNA microarray data we go through four steps 1 Importing and filtering the data 2 Adjusting for background noise 3 Normalizing the data 4 Performing differential expression analysis The cDNA data we use is summarized means and medians across all spots with identical probes When we import the data we specify background intensity value as a preliminary to normalization and testing In this chapter we examine two color microarray data from a developmental biology experiment The data are included with the Bioconductor distribution and were originally provided by Katrin Wuennenberg Stapleton from the Ngai Lab at UC Berkeley The experiment was designed to study the early development of vertebrates using zebra fish as a model organism Zebra fish embryos from two genetic strains were used a swirl mutant and a normal wild type The goal was to identify genes with differential expression between the two strains Refer to the swir1 help file for more details cDNA Data Analysis Workflow The experiment consisted of two sets of dye swap experiments resulting in a total of four replicates Each pair of experiments swapped the color labels between the swirl and wild type samples Table 6
74. ARRAYANALYZER Interface Import Data AffyMetrix Expression Summary Normalization Differential Expression Analysis Annotation Chapter 4 An Example Affymetrix MAS Data Affymetrix Data Analysis Workflow Importing Data Normalization Differential Expression Testing iii 10 15 16 18 22 23 24 25 27 28 30 38 42 Contents From the Command Line 49 References 61 Chapter 5 An Example Affymetrix Probe Level Data63 Affymetrix Probe Level Data Analysis Workflow 64 Importing Data 65 Normalization of Probe Level Data 72 Expression Summaries 74 Differential Expression Testing 79 References 87 Chapter 6 An Example Two Color cDNA Data 89 cDNA Data Analysis Workflow 90 Importing Data 92 Normalization 102 Differential Expression Testing 106 Annotation 111 From The Command Line 114 Chapter 7 Pre Processing and Normalization 131 Introduction 133 Normalization 134 Ideas in Normalization 139 Diagnostic Plots 142 Normalization Methods for cDNA Data 144 Pre Processing And Normalization for Affymetrix Probe Level Data 154 Normalization Methods for Affymetrix MAS Data 170 References 174 vi Contents Chapter 8 Differential Expression Testing 177 Introduction 178 Statistical Tests 179 Controlling Type I Error Rates 183 GUI for Multiple Comparisons Testing 188 GUI for LPE Testing 193 Differential Expression Analysis Plots 197 Differential Expression Summary Table Output 204 References 208 Chapter 9 Using the S
75. ChannellFun and sudarsanamChannel2Fun to define aspects of the data needed for the read 1 Read in data 1 1 Create helper functions gt sudarsanamArrayFun lt function file prefix suffix match file et snizypdactxt Snf2ypde txt satzypdd txt snfzmina txt snt mine txt SsAt2mind txt swilypda txt swilypdc txt swilypdd txt swilmina txt swilminc txt swilmind txt gt sudarsanamChannellFun lt function array ifelse array lt 3 snf2rich ifelse array lt 6 snf2mini ifelse array lt 9 swilrich swilmini gt sudarsanamChannel2Fun lt function array rep wildtype length array 1 2 Set data file location variable dataFileLocation df to directory where files have been downloaded gt dataFileLocation lt data sudarsanam 1 3 Read data Note that you need to load the ArrayAnalyzer module 235 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data in order to access the function readScanAlyzeData gt yeastData lt readScanAlyzeData path dataFileLocation premix SUFFIX mE arrayFUN sudarsanamArrayFun channel1FUN sudarsanamChannel1Fun channel2FUN sudarsanamChannel2Fun df 1 4 Remove rows with missing values in log intensity gt yeastData lt dropUnusedLevels yeastDataL is na yeastData logi Now that the data are read in we fit a simple normalization
76. Create Layout myM arrayR aw Cancel K Figure 6 2 The Import cDNA Data dialog 92 File Selection Page Importing Data The Import cDNA Data dialog has three pages e File Selection This page must be completed in order to create a data object for continued analysis e MIAME Completing this page is optional but highly recommended because information on the MIAME page is used for labeling tables and graphs e Variable Selection amp Filtering Red and green foreground colors and optionally background colors must be selected on this page Before we can begin to associate data files with experimental conditions we need to set up the experimental conditions in S ARRAYANALYZER We start by setting the replications in the experiment Setting the Replications Reps The swirl experiment has four replicates which is the default setting for the Reps indicator If your experimental design has more or less than four replications click the up and down arrows to match the number of replicates you have and then click the Reset Grid button This generates the experimental design points that populate the grid in the center of the dialog You see the grid control change to include factor levels e g Al A2 repeated as many times as there are replicates Note that this dialog assumes two color cDNA arrays with dye swapping Hence the factor level associated with the dye color alternates with each replicate If your experi
77. Document Affymetrix Santa Clara CA Affymetrix 2001 Affymetrix MicroArray Suite Version 5 0 User s Guide Santa Clara CA Bolstad B M 2002a Comparing the effects of background normalization and summarization on gene expression estimates Unpublished Manuscript Bolstad B M Irizarry R A Astrand M and Speed T P 2002 A comparison of normalization methods for high density oligonucleotide array data based on variance and bias Bioinformatics 19 2 185 193 Cleveland W S 1979 Robust locally weighted regression and smoothing scatterplots Journal of the American Statistical Association 74 368 829 836 Dudoit S and Yang Y H 2003 Bioconductor R packages for exploratory analysis and normalization of cDNA microarray data In The Analysis of Gene Expression Data Methods and Software G Parmigiani E S Garrett R A Irizarry and S L Zeger editors Springer New York Dudoit S Yang Y H Callow M J and Speed T P 2002 Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments Statistica Sinica 12 1 111 139 3 4 Fox J W Dragulev B Fox N Mauch C and Nischt R 2001 Identification of ADAM9 in human melanoma Expression regulation by matrix and role in cell cell adhesion Proceedings of International Protelysis Society Meeting Irizarry R A Gautier L and Cope L P 2003a Analysis of Affymetrix Prove level Data
78. Example Affymetrix MAS Data 60 The p values have been sorted from smallest to largest so printing the first 10 rows prints the 10 most statistically differentially expressed genes It s worth noting that fold change values less than two in absolute value are significant if their standard errors are relatively small In the top 10 list of this experiment there are two genes with very significant differential expression but with absolute fold change less than two These genes would have not made the cut using a straight fold change approach to gene discovery References REFERENCES Fox J W Dragulev B Fox N Mauch C and Nischt R 2001 Identification of ADAM9 in human melanoma Expression regulation by matrix and role in cell cell adhesion Proceedings of International Protelysis Society Meeting Lee JK and O Connell M 2003 An S PLUs Library for the Analysis of Differential Expression To appear in The Analysis of Gene Expression Data Methods and Software Edited by G Parmigiani ES Garrett RA Irizarry and SLZeger Published by Springer New York 61 Chapter 4 An Example Affymetrix MAS Data 62 AN EXAMPLE AFFYMETRIX PROBE LEVEL DATA Affymetrix Probe Level Data Analysis Workflow Melanoma Probe Level Data Set Importing Data Import Affymetrix Data Dialog Normalization of Probe Level Data Normalization Dialog Expression Summaries RMA Summary RMA Output Logging Expression Intensities D
79. For these procedures you can also set the maximum number of permutations and a random seed for reproducibility and testing FWER and FDR The Options group allows you to set the family wise error rate FWER or the false discovery rate FDR to maintain an overall Type I error rate false positive rate based on adjusting individual test p values to account for multiple tests In our swirl mutant example there are 8448 genes so the increase in Type I error is substantial without adjusting the p values You can chose the type of test you want to perform paired t the default for two color arrays Welch s t for unequal variance an equal variance t and the nonparametric Wilcoxon test For the swirl example leave the setting as the default paired t test The Alternative Hypothesis drop down list provides three options 1 Greater than 2 Less than 3 Not equal default These hypotheses refer to the alternative to the Null Hypothesis for the statistical tests that there is no differential expression Significant differential expression for any given gene means that factor level 1 A1 is greater than less than or not equal to factor level 2 A2 There are many options for adjusting the p values to achieve the FWER We describe them in more detail in the Differential Expression Testing chapter Typically we start with the default Bonferroni procedure but instead we select the BH Benjamini Hochberg procedure which is less conservative
80. GeNE MAP VAR HOMOL GDB e ucsc Homo sapiens Official Gene Symbol and Name HGNC CTIGF connective tissue growth factor LocusID 1490 Figure 3 10 Annotation page from LocusLink providing information on differentially expressed genes 26 AN EXAMPLE AFFYMETRIX MAS DATA Affymetrix Data Analysis Workflow Swimming Mice MAS Data Set Importing Data Import Affymetrix Data Dialog Normalization Normalization Dialog Differential Expression Testing Multiple Comparisons Test Dialog From The Command Line Importing Data Data Manipulation Normalization Differential Expression Testing References 28 28 30 30 38 38 42 42 49 49 51 54 57 61 27 Chapter 4 An Example Affymetrix MAS Data AFFYMETRIX DATA ANALYSIS WORKFLOW Swimming Mice MAS Data Set 28 The entire process of analyzing gene expression data with Affymetrix MAS 4 5 or cel file data can be done through the S ARRAYANALYZER menu and dialogs To obtain differential expression information from probe level cel file microarray data we perform the following five steps 1 Importing and filtering the data 2 Adjusting for background noise 3 Summarizing the data 4 Normalizing the data 5 Differential expression analysis MAS 4 5 data has already been corrected for background noise and summarized so we can skip steps 2 and 3 However if chips have been analyzed together with MAS 4 5 software only simple normalization has
81. Insightful S ARRAYANALYZER l l User s Guide April 2003 Insightful Corporation Seattle Washington Proprietary Notice Copyright Notice Trademarks Insightful Corporation owns the StARRAYANALYZER software program and its documentation Both the program and documentation are copyrighted with all rights reserved by Insightful Corporation S ARRAYANALYZER provides access to the Bioconductor R packages for microarray analysis which are free software The affy annodata annotate biobase edd geneplotter genefilter LPETest marrayClasses marrayInput marrayNorm marrayPlots multtest and ROC libraries are copyrighted 2003 by Insightful Corporation These libraries are free software that are redistributed and modified under the terms of the GNU Lesser General Public License as published by the Free Software Foundation version 2 1 of the License The S ARRAYANALYZER software is covered by a separate license agreement The correct bibliographical reference for this document is as follows S ARRAYANALYZER 1 1 User s Guide Insightful Corporation Seattle WA Printed in the United States Copyright 1987 2003 Insightful Corporation All rights reserved Insightful Corporation 1700 Westlake Avenue N Suite 500 Seattle WA 98109 3044 USA S PLUS is a registered trademark and StatServer S PLUS Analytic Server S ARRAYANALYZER S FINMETRICS S SDK S SPATIALSTATS S DOX S GARCH S SEQTRIAL and S WAVELET
82. MAS 5 0 To obtain a summary similar to MAS 5 0 use gt eset lt expresso affybatch example normalize FALSE bgcorrect method mas pmcorrect method mas summary method mas gt eset lt affy scalevalue exprSet eset Notice that in this case we normalize after we obtain summarized expression measures The function affy scalevalue exprSet performs a normalization similar to that described in the MAS 5 0 manual see section af fy scalevalue exprSet on page 172 This is a simple global scaling in which the user enters a target value TGT value The average signal across all probes on each chip is calculated for each chip and a scale factor SF is determined for each chip such that chip mean SF TGT Thus the signals on each chip are scaled by a single number for each chip a crude form of normalization Li and Wong 2001 MBEI To obtain a probe level normalized summary similar to Li and Wong s MBEI one can use This is computationally intensive gt eset lt expresso Dilution normalize method invariantset bg correct FALSE pmcorrect method pmonly summary method 1iwong This gives the current PM only default The reduced model previous default can be obtained using pmcorrect method subtractmm RMA method of Irizarry et al 2002 The RMA method of Irizarry et al 2002 can be obtained using expresso as follows gt eset lt expresso affybatch example normalize method qu
83. S ARRAYANALYZER stores the CDF information in individual data libraries and it also stores annotation information for Affymetrix chips in individual libraries This annotation information is used in S ARRAYANALYZER when you click a data point in a graph as these data points are linked to genetic databases e g LocusLink and UniGene on the Web The CDF and annotation information for Affymetrix chips is available on the on St ARRAYANALYZER CD under DataLibs An up to date collection of libraries can be found on the S tARRAYANALYZER Web site http www insightful com support ArrayAnalyzer Click the data libraries link on the right side of the page In order to compute expression summaries and or normalization of Affymetrix probe level data you need to have the Affymetrix CDF information available In R this CDF information is stored in an R environment in S PLUS the information is stored in a named list In StARRAYANALYZER the CDF library that matches the Affymetrix chip you are analyzing must be loaded The name of each library is the chip name in all lower case letters with no hyphens or underscores and the suffix cdf added For example if you are working with mgu74av2 chips you need to have the library mgu 7 4av2cdf available 255 Appendix S ARRAYANALYZER Data Libraries Loading a CDF library 256 Each CDF library contains a single named list and the name of the list is the same as the name of the library The list
84. S are trademarks of Insightful Corporation S and New S are trademarks of Lucent Technologies Inc Intel is a registered trademark and Pentium a trademark of Intel Corporation Microsoft Windows MS DOS and Excel are registered trademarks and Windows NT is a trademark of Microsoft Corporation Other brand and product names referred to are trademarks or registered trademarks of their respective owners ACKNOWLEDGMENTS The Insightful ArrayAnalyzer uses Bioconductor packages that represent state of the art work from a collection of leading statisticians Insightful would like to recognize these contributors affy Rafael A Irizarry Laurent Gautier and Leslie M Cope AnnBuilder Jianhua Zhang annotate Robert Gentleman Biobase Robert Gentleman and Vincent Carey edd Vincent Carey genefilter Robert Gentleman and Vincent Carey geneplotter Robert Gentleman marrayNorm Sandrine Dudoit Yee Hwa Jean Yang marrayClasses Sandrine Dudoit Yee Hwa Jean Yang marrayInput Sandrine Dudoit Yee Hwa Jean Yang marrayPlots Sandrine Dudoit Yee Hwa Jean Yang multtest Yongchao Ge Sandrine Dudoit rhdh5 Byron Ellis Robert Gentleman ROC Vincent Carey iii iv CONTENTS Acknowledgments Chapter 1 Welcome to S ARRAYANALYZER Welcome Supported Platforms and System Requirements Chapter 2 Introduction To Microarray Data Genomics and Differential Expression Microarray Data Chapter 3 GUI Overview The S
85. SION SUMMARY 22 If you work with Affymetrix probe level data CEL files you can apply various normalization and correction methods to the raw intensities and then summarize the intensities across all the spots for a given probe before proceeding to differential expression testing The summarization dialog provides numerous options for normalization background correction PM MM correction and summarization of the expression intensities Affymetrix Expression Summary E Data Options CEL Data X F RMA Composite Save As mpExprSet Bkad Correction mas iv Perfect Match mas X Summary avgdiff Y Normalization none it Summary Graphics IV MvA Plot IV Box Plot Cancel Apply KE current Figure 3 6 The Affymetrix Expression Summary dialog allows you to normalize for systematic biases in the measurements of the raw probe level intensities correct for background noise and differences in PM and MM spots and summarize expression intensities across all the spots for a given probe Normalization NORMALIZATION Regardless of whether you work with Affymetrix probe level data Affymetrix MAS4 5 data or data from custom cDNA microarrays you will likely apply normalization method s to eliminate systematic measurement errors and biases from the expression intensity measurements This step is accomplished through the Normalization dialog Normalization Data Normalization Show Data of Type Normalization median v XI i MvA Pl
86. The Data Group Normalization Show data of type The Normalization dialog requires you to select the type of data you are working with Click the drop down button on the Show Data of Type field and select one of the choices For the melanoma example select Affymetrix Summary Normalization Data Normalization Show Data of Type Normalization medianlIQR had Affymetrix Sum v I MvA Plot hag p EEV Box Plot Affymetrix CEL PAARE hen to Show Before amp After C Only After Cancel Apply KE current Help Figure 4 10 Selecting the data type for normalization Data Click the drop down button to right of the Data field and select the expression object created during the import step MouseSwimExprSet Save As Enter the object name for saving the normalized expression data in the Save As field By default this is set to MouseSwimExprSet norm the name of the object you select in the Data field with norm attached as a suffix Normalization In the Normalization group set the Normalization field to medianIQR select the MvA plot check box select the Box Plot check box and click the radio button to select Before amp After for pre and post normalization boxplots The normalization procedures for MAS 4 5 summary data are described in greater detail in Chapter 4 An Example Affymetrix MAS Data For this example we select the default setting as medianIQR which adjusts the location and scale of the data so exp
87. The Multiple Comparisons Test dialog provides traditional testing methods e g paired t test t test Wilcoxon test with a host of correction methods to control the family wise error rate and false discovery rate Type I Error To set up the Multiple Comparisons Test dialog follow these steps 1 In the Show data of type field select cDNA 2 In the Data field select swirl MarrayRaw norm 3 The Chip Name field should be automatically updated to lt undetermined gt 4 Enter MultTestSumm in the Save Summary As for saving the test result object Multiple Comparisons Test Data Graph Options Show Data of I et rains IV Volcano Plot cDNA 1M Heat Map Data SwiManayRa y V Chromosome Plot Chip Name kundetermined gt QQ Norm Plot Dptions Output FWER 4 Output to S PLUS Test t Output to HTML X Alt Hypothesis Not equal Ea EN pe Adjustment Bonferroni Save Summary As MultT estS umm OK Cancel KE current Help Figure 6 15 The Multiple Comparisons Test dialog for the Swir1MarrayRaw expression object The Options group allows you to specify the statistical test and alternative hypothesis a procedure for controlling the family wise error rate FWER or the false discovery rate FDR and the maximum error rate Additionally certain procedures estimate FWER Tests Alternative Hypothesis Adjustment Differential Expression Testing statistical tests and p values by permutation sampling
88. U K gt options contrasts c factor contr SAS ordered contr poly 2 2 a Linear mixed effects normalization model gt normalizationModelLME lt Ime fixed logi strain data yeastData random 1 array strain method REML Estimates from the linear mixed effects model 1me are provided below This model provides a simple global scaling normalization and is formulated and parameterized similarly to that described in Wolfinger et al 2001 A faster way of fitting the same model in S PLUS is to use the varcomp function Code and output from this model are also provided below In either case the result from this step is the residuals from the normalization model These are used as input to the gene expression model also fit by 1me as described below Variance Component Estimates gt VarCorr normalizationModel LME Variance StdDev array pdSymm 1 Intercept 1 86985604 1 3674268 strain pdSymm 1 Intercept 0 02984196 0 1727483 Residual 4 22194776 2 0547379 2 2 b Variance components normalization model is faster First specify array as a random component gt is random yeastData 237 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data array gene name strain F F fF F gt is random yeastData lt c T F F F gt is random yeastData array gene name strain T F F F gt normalizationModelVarComp lt varcomp logi strain array array strain data yea
89. URERE EK ERARE AL AAAA Dg KAA EANA CALE ONE Lebo E NAAA EL Shes ALAR e df Figure 6 3 Selecting the factor level names Importing Data Selecting Files To associate data files with the design points right click the uppermost File Name field and select Browse for File You can find the swirl example data by navigating to your splus61 module ArrayAnalyzer examples directory selecting SPOT spot as the Files of type and selecting swirl 1 spot Repeat for the other three spot files entering one file per cell as shown in Figure 6 4 Import cDNA Data File Selection MIAME Variable Selection amp Filtering Associate Files with Design Points Single Factor Design Reps p Reset Grid Read Design Save Design File Name Cy3 cys D Microarrays ArrayAnalyzer data swirl 1 spot swirl wild type D Microarrays ArrayAnalyzer data swirl 2 spot wild type swirl Type filenam swirl wild type Type filename or right clic ha browse wild type swirl p Data Chip Layout Save As Agilent Layout v Create Layout myMarrayRaw IV Print Output Figure 6 4 Selecting files for import Saving the Data Object Enter swirlMarrayRaw as the object name in the Save As field in the lower right corner of the dialog Remember this name as it is used for normalizing the expression data The object resulting from the import step is of class marrayRaw 95 Chapter 6 An Example Two C
90. USID mel gnames gt locuslinkByID 11names 228 Annotation of Microarray Data using S PLUS In this example two genes were identified in the filtering and the LocusLink information was obtained using the S PLUS function locuslinkByID This returns the Web page shown in Figure 9 6 Note that the View field in LocusLink is populated with the two genes identified in the analysis LocusLink Report Microsoft Internet Explorer File Edit View Favorites Tools Help Bak gt gt Q A Gsearch Favorites Media 3a aw g Address amp http jwww ncbi nih gov LocusLink LocRpt cgi l 6192 2c2863 c eS NCBI LocusLink PubMed Entrez BLAST OMIM Map Viewer Taxonomy Structure Search LocusLink Display Organism All Query Go Clear View EEEE Z One of 2 Loci _SaveAliLoci_ ABCDEFGHIJKLMNOPQRSTUVWXYZ Lt Click to Display mRNA Genomic Alignments spanning 25663 bps Homo sapiens Official Gene Symbol and Name HGNC RPS4Y ribosomal protein 4 Y linked LocusID 6192 Overview RefSeq Summary Cytoplasmic ribosomes organelles that catalyze protein synthesis consist of a small 405 subunit and a large 605 subunit Together these subunits are composed of 4 RNA species and iy approximately 80 structurally distinct proteins This gene encodes x ee ee ae Gees eee es fone a Figure 9 6 LocusLink annotation for two genes identified in the gene filtering analysis described above Note that the View field
91. a dialog has three pages e File Selection This page must be completed in order to create a data object for continued analysis e MIAME Completing this page is optional but highly recommended because information on the MIAME tab is used for labeling tables and graphs e Variable Selection amp Filtering This page has default settings depending on the type of data e g MAS4 or MASS you select Before associating data files with experimental conditions we set up the experimental conditions in S ARRAYANALYZER We start by setting the replications in the experiment Setting the Replications Reps The melanoma experiment has two replicates which is the default setting for the Reps indicator If your experimental data does not have two replicates click the up and down arrows or enter the number in the field to match the number of replicates you have and then click the Reset Grid button This generates the experimental design points that populate the grid in the center of the dialog You see the grid control Factor1 column change to include factor levels e g Al A2 repeated as many times as there are replications For this experiment there are two replicates so you can leave the Reps field at the default setting Importing Data Setting the Factor Levels You set the factor level names by clicking one of the factor level fields in the right column of the file association box Factor1 and typing in the new name Note tha
92. alog is shown in Figure 5 6 Import Affymetrix Data File Selection MIAME Variable Selection amp Filtering Experimenter s Name Bob Bryant Laboratory E Contact Information bob igrl com Experiment Title Swirl mutant zebrafish Experiment Description MM5 melanoma cell line in which a gel matrix that simulates the in vivo cellular condition and progression of melanoma was added for 0 and 24 hours later Fox et al 2001 This simple experimental design thus involved one factor matrix condition at two levels 0 and 24 Existing Notes D Microarrays D ata Affymetrix ee Figure 5 6 Entering experiment information in the MIAME page 70 Importing Data Variable Selection The third page in the Import Affymetrix Data dialog is for variable amp Filtering Page and row selection for Affymetrix MAS data It is inactive and not used when importing probe level data Ignore this page for CEL data Once all the pages of the Import Affymetrix Data have been completed press OK and data is imported and ready for use in S tARRAYANALYZER 71 Chapter 5 An Example Affymetrix Probe Level Data NORMALIZATION OF PROBE LEVEL DATA Normalization Dialog 72 Normalization procedures may be applied to both raw probe set intensities and to summarized expression intensities For examples of normalizing expression summary data see An Example Affymetrix MAS Data and the Normalization chapter In this sectio
93. and Normalization Normalizing to Many Points Normalization Using Quantiles 140 is a smoothing method for summarizing multivariate data using general curves and surfaces The smoothing is achieved by fitting a linear or quadratic function of the predictor variables locally to the response data M in this case The loess procedure fits polynomials over contiguous subsets intervals in one dimension of the predictor values using iteratively weighted least squares Yang and Dudoit 2003 write that in the context of microarray experiments robust local regression allows us to capture the non linear dependence of the intensity log ratio M on the overall average intensity A while at the same time ensuring that computed normalization values are not driven to a small number of differentially expressed genes with extreme log ratios Expression data may be variable between chips not only in the median of the data but also in its spread around that median value Variability may be due to such things as scanning settings and different concentrations of mRNA across slides In order to compare expression across slides these extraneous effects must be minimized The spread of the data in some range can be scaled to be the same between groups by specifying that the data between groups match at more than one point For example we could specify that the IQR of the data be the same This requires that the data be scaled so that the spread of the m
94. ange Figure 9 12 Volcano plots of genes trellised on the contrasts This plot allows the user to examine each contrast separately and to identify genes which show significant differential expression for a given contrast of interest 246 Differential Expression Analysis for Experiments with More than Two Experimental Conditions The 227 Significant Comparisons log Op f T f 0 2 4 log2 fold change oa Figure 9 13 Volcano plot of genes for just the significant contrasts These 227 genes are explored further below Note that these plots can also be created interactively from the S PLUS GUI We illustrate this with the data frame containing the significant genes for each of the contrasts as held in the object signif out created above Simply click this object in the Object Explorer to view the data frame Create an additional column by right clicking any column and selecting Insert Column Use the expression 1og10 adjp for the new column Now highlight the two columns foldChange and the new column we have called this 1og10adjp You can highlight two columns by holding down the CTRL key Then click the scatter plot in the 2 D graph palette top left hand corner Results from this are shown below 247 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data 248 ats ee lai Ge Est Yen rset Format Date Statitics Gaph Options Window Heb DSWE e S DYE Ue efer e taf BE e A Eza 44
95. antiles bg correct rma pmcorrect method pmonly re ge summary method medianpolish Pre Processing And Normalization For Affymetrix Probe Level Data Equivalently the rma function can be used and is faster for this series of operations gt eset lt rma affybatch example 169 Chapter 7 Pre Processing and Normalization NORMALIZATION METHODS FOR AFFYMETRIX MAS DATA Affymetrix data typically arrives as DAT CEL and CHP files The DAT files contain the raw images as processed by the scanner The CEL files contain expression measures for each individual probe on the chip analysis of these probe level data is described in Chapter 5 An Example Affymetrix Probe Level Data and in this chapter in section Pre Processing And Normalization For Affymetrix Probe Level Data The CHP files contain summaries of the individual probe level data for each gene transcript Analysis of these summarized data is described in this section These data have been background adjusted and summarized into a single expression value per gene transcript using the Affymetrix MAS software Affymetrix version 5 0 software has adjusted the probe level intensity values as follows e Global background signal and noise have been subtracted and thresholded as described in the Affymetrix Statistical Algorithms Description Document SADD available from Affymetrix This method is also described in section mas on page 159 The 11 20
96. ary contains a single named list the name of the list is the same as the name of the library The list contains the locations on the chip for the perfect and mismatch probes StARRAYANALYZER functions will need to access this named list when doing probe level operations If the list is not available S ARRAYANALYZER will attempt to load the library if it cannot find the library an error will occur S ARRAYANALYZER includes three of these named lists in the S ArrayAnalyzer affy library hgu95acdf hgu95av2cdf and hgu133acdf If you are working with these chips hgu95a hgu95av2 or hgu133a then you do not need to do anything the S ARRAYANALYZER functions that operate on the CEL data will find the named lists The CDF information for other Affymetrix chips is available on the S ARRAYANALYZER CD under DataLibs CDFLibs There is a zip file for each chip and each zip file unpacks to create a library Please refer to the Appendix S ARRAYANALYZER Data Libraries for more details on loading additional CDF libraries Both MvA plots and box plots are available from the expression summary and normalization dialogs under the S ARRAYANALYZER menu Affymetrix data uses one treatment condition per chip Comparisons can be made between treatments and within treatments 155 Chapter 7 Pre Processing and Normalization Box Plots 156 To compare expression within treatment conditions the intensity log ratio M and the average intensity A are commo
97. as NCImelanoma Add the correct PM values back into melanoma object gt pm NCImelanoma lt tmp The default method for normalize is quantiles In this next example we subset the AffyBatch object by treatment 0 and 24 hours normalize each subset then merge the objects into a single normalized AffyBatch object gt mel norm quantiles lt merge normalize NCImelanoma 1 2 normalize NCImelanoma 3 4 We can normalize each replicate set to the median of one of the chips by typing gt mel norm constant lt merge normalize NCImelanoma 1 2 method constant normalize NCImelanoma 3 4 method constant Arguments to the normalization methods can be passed through normalize as optional arguments This example normalizes all four chips in the melanoma experiment gt mel norm loess lt normalize NCImelanoma method loess span 5 The corrections and normalization can be done in one step using the expresso function This function also summarizes the probe level data The resulting object is of class exprSet gt melanoma exprSet lt expresso NCImelanoma bgcorrect method mas pmcorrect method mas normalize method constant summary method mas 165 Chapter 7 Pre Processing and Normalization Summarization Affymetrix and some other high density oligonucleotide arrays Methods avgdiff liwong 166 include multiple spots per gene transcript In Affymetrix arra
98. as used by Alizadeh et al 2000 The columns of our heatmap are thus in approximately reverse order to those presented in Alizadeh et al 2000 Also note that individuals within nodes are paired by their original order in S PLUS while this ordering is at random in the package Cluster Eisen et al 1998 Further Alizadeh et al 2000 use a weighting function that is not well documented in the Cluster Eisen et al 1998 manual 221 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data 222 E bane eens te APLU E Eue 1 a l ll bie Figure 9 2 Heat map and dendrogram based on data from Alizadeh et al 2000 Note that this is not the actual raw data but rather data as summarized by Cluster Eisen et al 1998 and prepared for viewing in TreeView Eisen et al 1998 We treat it as raw data to show the cluster methods in S PLUS but the resulting output should not be directly compared with fig3a of Alizadeh et al 2000 Since Alizadeh et al 2000 were interested in identifying two specific subpopulations within the DLBCL samples they may have used a partitioning clustering method We use the partitioning around medoids method pam This analysis provides some evidence for the existence of two subpopulations rather than three four or five subpopulations based on average silhouette width The average silhouette width for two subpopulations is 0 19 compared to 0 15 and 0 08 for three and
99. ata set is also available in the swirl help file type hel p swir1 gt maBoxplot swirl 3 main Swirl array 93 pre normalized 145 Chapter 7 Pre Processing and Normalization Swirl array 93 pre normalized Je u ry Joso o gome es Joam o 1 1 1 2 1 3 1 4 2 1 2 2 2 3 2 4 3 1 3 2 3 3 3 4 4 1 4 2 4 3 4 4 PrintTip Figure 7 3 Box plot for chip 93 of swirl dataset before it is normalized Plots for cDNA data are available from the command line using the functions in Table 7 1 Following are examples of how to use these functions Please refer to the marrayPlots library pdf splus61 library marrayPlots marrayPlots pdf and the marrayPlots library help files for detailed descriptions of the function arguments The section Notes For Command Line Users on page 144 discusses the meaning of the x y and z parameters used in the plotting functions 146 Normalization Methods for cDNA Data Table 7 1 Plotting functions available for cDNA data Plotting Function Default Parameter Settings Description maPlots x maA y maM a z maPrintTip Produces scatter plots of microarray spot statistics for the classes marrayRaw marrayNorm and marray Two Creates plot for first chip given maBoxPlot x maPrintTip y maM Produces box plots of microarray spot statistics for the classes marrayRaw marrayNorm and marrayTwo Plots b
100. background noise 2 Normalization of corrected PM probes using quantile normalization Bolstad et al 2002 3 Calculation of expression measures using median polish This sequence of steps is available by simply checking the RMA checkbox in the upper right corner of the Affymetrix Expression Summary dialog Open the dialog by selecting the Affymetrix Expression Summary selection from the ArrayAnalyzer menu item Then select the CGAffyBatch object in the CEL Data drop down list and check the RMA Composite checkbox The result of the computation is an expression summary object We set the name to be CGExprSet rma by typing it into the Save As field RMA Output Logging Expression Intensities Expression Summaries A sequence of graphs is produced as output by the RMA procedure Figure 5 10 displays the M vs A plot for the two samples taken at 0 hours The value 0 0338 in the lower left panel is the inter quartile range IQR of the values of M across all summarized expression values A small value indicates there is little difference on the log2 scale for the middle 50 of the expression values for the two chips For replicate chips there is no real differential expression so the IQR is expected to be small Affymetrix Expression Summary a a Data Options CEL Data CGAffyBatch v M RMA Composite Save As CGExprSet rma Zi Normalization quantiles 2j Summary Graphics MV MvA Plot IV Box Plot OK Cancel Apply
101. bases and displays the gene identifier in the upper right corner of the plot Left clicking in a colored rectangle exposes the menu for making an annotation database choice The left and top margins of the graph contain a dendrogram resulting from applying hierarchical clustering to the expression intensity values and treatment conditions respectively a oa rea Sample 24h A xls Gene 41126_at Exp Value 1 02 Accession Number LocusLink N Gene List Significant Genes Figure 8 12 Heat map plot for differentially expressed genes This graphlet may be displayed through a Web browser or an S PLUS Java graphics device Pixels colored red signify positive expression values those colored green signify negative expression values The brighter the color the larger the intensity in absolute value Differential Expression Analysis Plots Chromosome Plot A chromosome plot displays the human genome for Affymetrix s HG U95A chip Differential expression is marked for up regulation and down regulation for each gene represented on the chip The top 10 differentially expressed genes are highlighted with color orange to indicate their location on the chromosome Hovering the mouse over one of the colored active points displays the gene ID in the upper right hand corner of the graph as shown in Figure 8 13 Graph Window 3 CE File View Options faa 37420_i_at a a i 6 a 6 E 2 o
102. c practices Fundamental to the S PLUS environment is the S programming language The S language is he reason that virtually all new statistical methodology is developed in S PLUS or it s freeware clone R before any other environment By taking advantage of the S PLUS programming environment you can extend the capabilities currently offered through the GUI Some typical examples of extensions that our customers count on are 1 Create script files for repeated use batch processing or simply documenting your analysis Loading data normalization and summarization can all be done as a BATCH process in S PLUS 2 Develop general programs functions which extend the functionality of S ARRAYANALYZER or even S PLUS for that matter You will see an example of creating a function for combining a heat map of differential expression with a hierarchical clustering of the most differentially expressed genes 3 Data manipulation at the command line is richer than through the GUI The GUI may not have implemented everything you might ever want to do with microarray data We will show you an example in a few lines of S PLUS code how to create new annotation objects for custom microarrays so the graphics generated during differential expression testing are hyperlinked to annotation data bases Introduction 4 By working at the command line you can keep up with the latest trends in microarray analysis by coding them yourself or by getting them from
103. chip label File Name 0 weeks 1 s0l sOl txt 0 weeks 2 s02 s02 txt 0 weeks 3 s03 s03 txt 0 weeks 4 s04 s04 txt 0 weeks 5 s05 s05 txt 0 weeks 6 s06 s06 txt 4 weeks 1 s4wl s4w Ltxt 4 weeks 2 s4w2 s4w2 txt 4 weeks 3 s4w3 s4w3 txt 4 weeks 4 s4w4 s4w4 txt 4 weeks 5 s4w5 s4w5 txt 4 weeks 6 s4w6 s4w6 txt These data have been obtained from the CardioGenomics PGA Public Data Web site located at http cardiogenomics med harvard edu public data and are used here for the purpose of this example only The data are available for free public download but have also been included with the distribution of S ARRAYANALYZER 29 Chapter 4 An Example Affymetrix MAS Data IMPORTING DATA To import Affymetrix data from the main S PLUS menu select ArrayAnalyzer gt Import Data gt From Affymetrix ArrayAnalyzer Import Data kell Ameri Affymetrix Expression Summary From tDNa Array Normalization Differential Expression Analysis Figure 4 1 Menu selection to import Affymetrix data Import Figure 4 2 shows the Import Affymetrix Data dialog with the File Affymetrix Selection page displayed The primary task of the import process associates data files with experimental conditions and selects the variable columns that are used in subsequent analysis Data Dialog Import Affymetrix Data File Selection MIAME Variable Selection amp Fitering m Associate Files with Design Points Single Factor
104. colleagues or collaborative projects such as Bioconductor This is exactly what we have done in implementing StARRAYANALYZER In addition to the examples in this chapter we also provide command line examples for each of the example chapters Chapter 4 An Example Affymetrix MAS Data and Chapter 6 An Example Two Color cDNA Data You can find the command line example scripts by navigating to your splus61 module Array Analyzer examples directory and selecting one of the ssc files located there 213 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data CLUSTERING MICROARRAY DATA USING S PLus 214 Clustering approaches have been widely applied to the analysis of gene expression data Eisen et al 1998 Scherf et al 2000 In particular the method of visualizing gene expression data based on cluster order or cluster image map analysis using Aierarchical clustering has been found to be an efficient approach for summarizing thousands of gene expression values and assisting in the identification of interesting gene expression patterns Partitioning clustering methods such as K means are used to identify candidate subgroups in experiments involving multiple samples and or experimental conditions Both hierarchical and partitioning clustering have been used for example in the identification of novel sub types of cancers Additional gene information is also extremely useful for discovering meaningful clustering patterns Pri
105. control Factor1 column change to include factor levels e g A1 A2 repeated as many times as there are replications 31 Chapter 4 An Example Affymetrix MAS Data 32 Setting the Factor Levels You set the factor level names by clicking one of the factor level fields in the right column of the file association box Factor1 and typing in the new name Note that changing the factor level name in one place changes all the factor level names with the same name Enter Oweeks for the s01 txt file and enter 4weeks for the s4w1 txt the seventh row file as shown in Figure 4 3 Import Affymetrix Data File Selection MIAME Variable Selection amp Filtering m Associate Files with Design Points Single Factor 2 Level Design Reps fe Reset Grid Read Design Save Design File Name Factor1 Type filename or right click to browse Oweeks Type filename or right click to browse Oweeks Type filename or right click to browse Oweeks Type filename or right click to browse Type filename or right click to browse 42 Type filename or right click to browse Data File Type Chip Name cundetermined gt krequired gt mySet IV Print Output Figure 4 3 Setting the factor levels to Oweeks and 4weeks Importing Data Selecting Files To associate data files with the design points right click a Filename field and then click the Browse for File button as in Figure 4 4
106. correction is p 1 Ge ppi All genes with adjusted p value p less than are significant with an overall FWER of at most minP The Westfall and Young 1993 minP step down procedure is computed as Pa max Pr min in k N P lt Pa Ho For each pj pg is the resampling based probability of obtaining a p value no larger than Pii from simulated probability distributions generated by the decreasing sets Pa Pin Pia Pin Pip Pin 185 Chapter 8 Differential Expression Testing FDR Procedures 186 maxT The Westfall and Young 1993 maxT step down procedure is computed as Pa maxy y Pr max in c npITjl gt lto Hod For each py pq is the resampling based probability of obtaining a test statistic at least as large as ti from simulated distributions of the test statistics generated by the decreasing sets fti tnp ftia tN fti tay The minP and maxT procedures are only available for the permute versions of the test procedures t permute t equalvar permute wilcoxon permute Furthermore the permute versions of the test statistics only have access to these two procedures for p value adjustment The other adjustment procedures are implemented for all the non permutation testing procedures described in the section Statistical Tests The results of the procedures are summarized using adjusted p values which reflect for each gene the overall experiment Type I error rate when gen
107. croarray which may occupy one or more slides At each spot one gene or an active segment of a gene is represented tens of thousands of times by cloning it and fixing all the duplicates to the spot on the slide Microarray Data Figure 2 2 Microarray experiments produce gene expression images like the one pictured here These images must be converted to numbers the quantification step before analysis can proceed Scanners like those from GenePix and Agilent produce raw intensity data files which form the starting point for differential expression analysis in S ARRAYANALYZER A gene expression experiment entails washing microarrays with concentrated cellular material and quantifying how much cellular substance binds to the gene spots Lots of binding at a spot indicates that gene is active in the cell that is the gene is being expressed in that cell or tissue Knowing which genes are being expressed or not expressed and how that expression changes under different experimental conditions is of great importance in functional genomics and in developing new diagnostics therapeutics or treatment strategies S tARRAYANALYZER is designed to work with data from different commercial microarrays In particular it works with data from Affymetrix microarrays and from custom cDNA microarrays available through several suppliers We describe in more detail the 11 Chapter 2 Introduction To Microarray Data differences between these two ba
108. cross the spots within or between arrays and can vary according to overall spot intensity location on the array plate origin and possibly other variables Some causes of the imbalances may be the following e Labeling efficiencies and scanning properties of the Cy3 and Cy5 dyes e Amounts of Cy3 and Cy5 labeled mRNA e Scanning parameters such as PMT settings e Print tip spatial and plate effects The GUI performs default setting normalization for a batch of arrays The GUI includes the methods listed in Table 7 3 and Table 7 4 Please refer to the section Normalization for examples showing how to use the GUI to produce diagnostic plots and normalize cDNA data The normalization and plotting functions for cDNA data make heavy use of the accessor methods for the different marray classes The input parameters to the functions are labeled x y and z Each function uses the x y and z parameters differently refer to the help files for specifics In general these parameters give the accessor methods for the marrayRaw or marrayNorm class objects These accessor methods are then used to extract the desired information from the data object for use in the normalization computation or plotting function Useful methods are maM and maA for obtaining the intensity log ratios and average log intensities respectively and the maPrintTip method which computes the print tip grid coordinates for the spots maPrintTip and maPlate are used to strat
109. d 3 abline v 1 col 3 lwd 3 abline v 1 col 3 Iwd 3 fs critPValue critPValue xlab log2 fold change ylab logl0 p main list All Pairwise Comparisons For Each Gene cex 1 5 All pairwise comparisons in separate plot gt print xyplot logl0 pValue foldChange comparison data yeastResults panel function x y critPValue panel xyplot x y abline h logl0 critPValue col 4 Twd 1 abline v 1 col 3 Iwd 1 abline v 1 col 3 Iwd 1 critPValue critPValue layout c 5 2 xlab log2 fold change ylab logl0 p main list All Pairwise Comparisons For Each Gene cex 1 5 gt print xyplot logl0 pValue foldChange data yeastResults 244 Differential Expression Analysis for Experiments with More than Two Experimental Conditions subset signif p panel function x y critPValue panel xyplot x y abline v 1 col 3 Iwd 3 abline v 1 col 3 Iwd 3 Re xlab log2 fold change ylab logl0 p main list paste The numSignif Significant Comparisons cex 1 5 All Painwise Comparisons For Each Gene logt O n 5 0 5 log2 fold change Figure 9 11 Volcano plot of all genes for all contrasts 245 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data All Pairwise Comparisons For Each Gene logt Op a a 2 log2 fold ch
110. d a Data folder 3 In S PLUS before working with your mgu74a files type gt library mgu74acdf Note that you can attach the CDF library once and make a local copy of the CDF named list As long as that named list is in your path when working with that CEL data you don t need to attach the CDF library Annotation Libraries Example 2 Get the CDF library mgu74acdf and install it in directory C CDFLIB gt library mgu74acdf 1ib loc C CDFLIB Make a local copy of the mgu74a CDF list so you don t need to attach the library any more gt mgu 74acdf lt mgu 4acdf The annotation libraries contain named lists of annotation information for various genome databases Each chip has its own annotation library and the name of each library is the chip name in all lower case letters with no hyphens or underscores and the suffix AnnoData added on Within each library are named lists the names of these lists are the chip name with a suffix related to the annotation data Table A 1 shows the annotation data that is available in each library Table A 1 Information contained in an annotation library Suffix Description ACCNUM GenBank accession number SYMBOL The symbol used for gene reports GENENAME The gene description used for genes UNIGENE UniGene cluster ids 1 In some cases chips can share annotation information For example the chips hgu95a and hgu95av2 can use the same annotation
111. d the background noise levels may not be consistent over the chip Background correction aims to quantify and subtract this background signal from the expression intensities S ARRAYANALYZER provides two methods through the function bg correct for correcting Affymetrix probe level chips for background signal and inconsistencies The available background correct methods can be obtained by typing gt bgcorrect methods 1 mas none rma rma Pre Processing And Normalization For Affymetrix Probe Level Data The rma background adjustment assumes the PMs are a convolution of the normal and exponential distributions According to Bolstad 2002a we can write this as O S N where N is the background and S is the signal It is assumed that S is distributed exp and N is distributed N 1 0 The background corrected PM values returned for each chip in the object are then E s O o This expectation is equal to O A oo at 2 wherea s u 0 Q b O and and are the normal density and cumulative distribution function respectively Caution mas The rma method adjusts the PM values but leaves the MM values intact This is problematic if a PM correction is done after the background correction using MM values which have not been background corrected The mas background method performs the noise correction described in the Statistical Algorithms Description Document SADD a white paper from Affymetrix
112. de traditional testing methods e g Student s t test Welch s t test Wilcoxon test with a host of correction methods to control the Type I Error rate For more details on the Multiple Comparisons Test dialog as well as the testing procedures and Type I Error correction procedures see Chapter 8 Differential Expression Testing Test Dialog To set up the Multiple Comparisons Test dialog follow these steps 1 In the Show data of type field select Affymetrix 2 In the Data field select MouseSwimExprSet norm 3 The Chip Name field should be automatically updated to mgu74av2 4 Enter MultTestSumm in the Save Summary As for saving the test result object Multiple Comparisons Test Data Graph Options Show Data of Type I Volcano Plot Affymetrix bf M Heat Map Data XI Chromosome Plot A 2 Imqu7dav tS Chip Name mgu dav2 V OQ Nom Plot Options Output PWER I Display Output in S PLUS Test M Save Output as HTML Alt Hypothesis Not equal fe Adjustment Bonferroni Save Summary As MultTestSumm Cancel Apply Kf current Help Figure 4 13 The Multiple Comparisons Test dialog 42 Differential Expression Testing Options The Options group allows you to set the family wise error rate FWER and false discovery rate FDR to control the overall Type I error false positive rate based on adjusting individual test p values to account for multiple tests In our swimming mice example there are 12 4
113. des a way of doing this based on a one step Tukey biweight procedure Other approaches including the MBEI method of Li and Wong 2001 and the RMA method of Irizarry et al 2003b have been shown to provide improved extraction of biological information from probe level data Irizarray et al 2002 2003b This chapter address the variety of methods available in S tARRAYANALYZER for correcting normalizing and summarizing probe level microarray data The S PLUS script splus61 module Array Analyzer examples scripts normalization_chapter ssc includes the example code presented in this chapter 1 Affymetrix GeneChip microarrays represent each gene with an oligonucleotide 25 mer probe spotted at typically 16 20 pairs of spots 32 40 spots in all Each probe pair consists of a spot for the probe called a perfect match PM and a spot for a slight alteration of the probe called a missmatch MM The collection of the PM and MM spots for a specific gene are called a probe set 133 Chapter 7 Pre Processing and Normalization NORMALIZATION Technical Sources of Variability cDNA Data Affymetrix Data 134 Many factors can modify spot intensity other than differential gene expression And each type of microarray chip has its own inherent systematic variations that need to be taken into account Normalization can be thought of as a series of corrections for these systematic effects In general normalization is needed to ensure tha
114. ding that is occurring in the experiment The available methods can be obtained by typing gt pmcorrect methods 1 tas pmonly subtractmm These are the pm correction methods performed by Affymetrix MAS 4 0 subtractmm and MAS 5 0 mas software subtractmm simply returns the difference between the PM and MM intensity values This can lead to negative values for the intensity pmonly simply returns the PM intensity values from the ProbSet PM slot mas correction allows for the possibility that the MM intensity is larger than the PM intensity for a particular probe pair within a probe set The mas method is described in the Affymetrix Statistical Algorithms Description Document SADD available from Affymetrix 1 A ProbeSet object is the collection of cloned PM and MM spot intensities for one gene Pre Processing And Normalization For Affymetrix Probe Level Data An Example With The input to the pm correct functions can be either a ProbeSet or pmcorrect Normalization AffyBatch object and the return value is a matrix of corrected PM values for each chip in the input object An object of class ProbSet contains the PM and MM data for a probe set from one or more samples ProbeSet objects can be created by applying the method probeset to instances of AffyBatch We illustrate the procedure using the example data affybatch example in the affy library data directory This data set gives a subset of the values
115. e 9 10 Tree view of the results of searching the Gene Ontology GO site with GO ID for the first gene identified in the gene filtering analysis described above i e Gene M58459 Human ribosomal protein RPS4Y isoform mRNA 233 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data DIFFERENTIAL EXPRESSION ANALYSIS FOR EXPERIMENTS WITH MORE THAN TWO EXPERIMENTAL CONDITIONS 234 The S tARRAYANALYZER GUI provides simple workflows for identification of genes that are differentially expressed across two experimental conditions Similar analyses of experiments with multiple conditions may be done at the S PLUS command line Examples include time course experiments where conditions are multiple time points and factorial designs where the effects of multiple factors interactions and contrasts between factors and levels are explored simultaneously Standard ideas from analysis of variance and mixed models can be used in this context to incorporate randomness in the spots and in the arrays These models have been suggested by Churchill and coworkers e g Kerr and Churchill 2001b and by Wolfinger et al 2001 who fit these models using SAS We illustrate the analysis of experiments with multiple conditions with a cDNA two channel microarray experiment with a 2x2 factorial control structure The experiment was designed to explore differential expression of two strains of yeast mutants deleted for a gene encoding one conse
116. e for the number of samples and compares the resulting silhouette plots One can then select the number of clusters yielding the highest average silhouette width If the highest average silhouette width is small e g below 0 2 one may conclude that no substantial structure has been found In cancer diagnostics there is considerable interest in subpopulations of cancer tissue samples For example distinct subpopulations identified within the collection of samples may have different etiologies and may be candidates for different clinical interventions Alizadeh et al 2000 characterized variability in gene expression among tumors in lymphoma patients using a customized cDNA lympho chip This chip included genes expressed in lymph cells and 217 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data 218 genes that play an important role in cancer They ran samples from the three most common adult lymphomas on the lympho chip namely diffuse large B cell lymphoma DLBCL follicular lymphoma FL and chronic lymphocytic leukemia CLL and a variety of other lymphoma and leukemia cell lines Each chip had a reference sample with cy5 labeling used for the experimental samples and cy3 for the reference samples Alizadeh et al 2000 identified two distinct subtypes of DLCBL from a hierarchical cluster analysis of the resulting data the relevant heat map and dendrogram from this analysis are given in Figure 3a of Alizadeh e
117. ect to have the same mean or median as a given reference chip contrasts span 2 3 choose subset Performs a modified loess normalization using contrasts to create a linear combination of all pairwise combinations of chips invariantset prd td c 0 003 0 007 The chip with the median mean intensity for the set is chosen as the reference chip Based on the rank of the intensities a group of invariant genes is chosen for each chip A smooth spline is fit to this invariant set of genes This is a pairwise normalization for each chip in the set to the reference chip loess subset sample 1 dim mat 2 5000 epsilon 10 2 maxit 1 log it T verbose T span 2 3 family loess symmetric Normalizes the chips with respect to each other by forcing log ratios to be scattered around the same constant curve This is accomplished on more than two arrays by averaging the pairwise loess curves 163 Chapter 7 Pre Processing and Normalization Table 7 6 Normalization methods available through the normalize function Normalization Methods Default Function Values Description ScaleNormalization Methods qspline target NULL samples The quantiles from each array and NULL Aes Gers 5 the target are used to fit a system of mino Ser S bic splines to normalize the data spline method natural eure SPINES LO NOF ae Ne R smooth TRUE spar 0 p min 0 p max 1
118. ed as significant by the correction procedure 43 Chapter 4 An Example Affymetrix MAS Data Graph Window 9 File View Options T a a bA a T D v Ss 6 a Mean Log2 Fold Change a Summary volcano Plot Figure 4 14 Volcano plot resulting from Bonferroni FWER correction The Bonferroni correction is very conservative and most of the p value have been pushed to one 44 Differential Expression Testing Volcano Plot A volcano plot displays the logarithm of p value versus fold change as shown in Figure 4 15 The vertical lines indicate fold change values of plus or minus two and the horizontal line indicates a significant LPE Test p value after doing the Bonferroni correction Points located in the lower outer sextants are those with large absolute fold change and small significant p value Each of those points is active so you can click an individual point to access annotation information from Locus Link or GenBank Graph Window 8 File View Options E Gene Name 102426_at T z a 4 a T D B Ss 6 3 Mean Log2 Fold Change E Summary Volcano Plot Heatmap Figure 4 15 A volcano plot which is the logarithm of p value versus fold change The Benjamini Hochberg FDR correction method was used for the swimming mouse data at an FDR of 0 05 The BH correction is less conservative than the Bonferroni procedure yet maintains a small proportion of false positives amongst those
119. ed to location and possibly scale before differential expression analysis 141 Chapter 7 Pre Processing and Normalization DIAGNOSTIC PLOTS Box Plots MVA Scatter Plots 142 Diagnostic plots of intensity data can help identify printing hybridization scanning artifacts and other sources of unwanted variability which can removed before analysis of differential gene expression The S ARRAYANALYZER GUI provides box plots and scatter plots to help identify such unwanted variability and guide subsequent adjustments and modeling procedures Histograms spatial image plots and RNA degradation plots are also available from the St ARRAYANALYZER command line Boxplots show side by side graphical summaries of intensity information from each array The summary consists of the median and the upper and lower quartiles 75th and 25th percentiles respectively of the data The central box in the plot represents the inter quartile range IQR which is defined as the difference between the 75th percentile and the 25th percentile The median is represented by a line or a dot in the middle of the box By default the upper and lower whiskers on the box plots are placed at the most extreme observation not exceeding plus and minus 1 5 times the IQR from the quartiles Data outside the whiskers are plotted separately The y axis is typically the intensity log ratios and the x axis is the grouping variable Figure 7 1 on page 136 shows an example of a typ
120. els The focus of the S ARRAYANALYZER GUI is two sample problems For two sample problems it is quite easy to do the following 1 Read the data 2 Summarize probe level Affy data 3 Normalize 4 Test differential expression and 5 Annotate differentially expressed genes from the GUI The within gene two sample comparisons implemented through the GUI include the following methods e paired t Paired t test t Welch s t test unequal variance e tequalvar Student s t test equal variance e wilcoxon Wilcoxon signed rank sum non parametric test 179 Chapter 8 Differential Expression Testing Cautionary Note Local Pooled Error Test 180 t permute Welch s t test null distribution and p value estimated by permutation t equalvar permute Student s t test with null distribution and p value estimated by permutation e wilcoxon permute Wilcoxon signed rank test with null distribution and p value estimated by permutation The method names listed in bold are used in the GUI for specifying a particular testing method All the basic methods paired t Welch s t student s t wilcoxon are described in standard introductory statistical textbooks such as Moore and McCabe 1999 or Snedecor and Cochran 1980 The permute versions of the test procedures are based on permuting the intensity scores across treatment conditions repeatedly re computing the test statistic each time to form a nul
121. eneral is the default S PLUS working directory If you specify no project folder when you start S PLUS your cmd directory is the default working directory gt getenv S_ WORK D Program Files Insightful splus61 cmd You should see two HTML files in your working directory when S PLUs has finished generating the output one for the summary table and the another for the Graphlet GUI FOR LPE TESTING LPE Testing Dialog Input Data GUI for LPE Testing The dialog for LPE testing is displayed in Figure 8 7 Open the dialog from the main S PLUS menu by clicking ArrayAnalyzer gt Differential Expression Analysis gt LPE Test The dialog is arranged in five main groups p Data e Options a e O N Output Variance Estimator Graph Options The Data group allows you to select the expression object for testing You start by selecting the data type Show Data Type as one of Affymetrix or cDNA and then selecting a data object an expression object created by importing expression summarization for Affy CEL and normalization from the Data drop down list box Local Pooled Error Test Data Show Data of Type Data Chip Name Options FWER Adjustment Cancel Apply Affymetrix bd MelanomaExpr hgu95a 05 Hochberg v current Variance Estimation Smoother D F 10 ho H p Number of Bins Trim Graph Options Volcano Plot I Heat Map Chromosome
122. ent For each graph the vertical axis is M and the horizontal axis is A M is computed from the two chips found by going horizontally left and vertically down to the first chip name you come to For example for the upper left scatterplot just right of the CGa label M is computed as the difference in logged intensities from the CGa and CGb chips A is the average of the same logged intensities The cone shaped MvA plot shows that variance decreases as a function of the log average expression intensity Given this pattern we will use the LPE test for differential expression since it allows variance to be modeled as a function of the average expression intensity We first have to create a couple of objects that are arguments to the LPE test function 57 Chapter 4 An Example Affymetrix MAS Data LPE Test Plotting Differential Expression Results 58 The LPE test function requires baseline variance estimates before computing test statistics We compute the baseline variance or error estimate with the baseOLIG function as follows OLIGgrpO lt baseOLIG LCG N 1 2 OLIGgrp24 lt baseOLIG LCG N 3 4 The required argument to baseOLIG is the expression intensities for all replicates of a given treatment condition An optional second argument number bins sets the number of the bins for the partition used to compute local error estimates The default value for number bins is 100 implying that 1 of the intensity values wil
123. er s Guide Our example in Chapter 3 uses Affymetrix MAS4 summary data 12 Custom cDNA Arrays Microarray Data cDNA or two color microarrays are designed to compare two different samples the experimental conditions on each slide Each sample is treated with a different color before it is added to the slide Differential expression is computed as the difference between the color intensities of the two samples Prepare cDNA target Proves hllerdcrrcay Figure 2 4 Custom two color cDNA microarrays compare treatments on each array by tagging them with different colors This two color design provides a way of estimating differential expression independent of chip to chip variability cDNA microarrays may be customized both gene content and layout by the experimenter Consequently the layout and gene content must be provided at the time of the analysis This makes data import more complex than for Affymetrix chips for which there are many standard fixed layout descriptions We provide a cDNA example in Chapter 4 to illustrate the steps involved in the analysis 13 Chapter 2 Introduction To Microarray Data 14 GUI OVERVIEW The S ARRAYANALYZER Interface Import Data Import Affymetrix Data Import cDNA Data AffyMetrix Expression Summary Normalization Differential Expression Analysis Annotation 16 18 18 20 22 23 24 25 15 Chapter 3 GUI Overview THE S ARRAYANALYZER INTERFACE 16 S ARRAYANALYZER p
124. er library LPEtest is provided by Insightful Chapter 1 Welcome to S ARRAYANALYZER SUPPORTED PLATFORMS AND SYSTEM REQUIREMENTS Installing and Running S ARRAYANALYZER Online Help S ARRAYANALYZER is supported on the following platforms e Windows NT 4 0 Service Pack 6 or later e Windows 2000 e Windows XP Professional e Windows ME e Windows 98 The minimum recommended system configuration is a Pentium IT 300 processor at least 512MB of RAM and an SVGA or better graphics card and monitor You must have at least 225MB of free disk space for the typical installation and even if not installing on drive C an additional 2MB of free disk space on drive C to unpack the distribution To install S ARRAYANALYZER insert the S ARRAYANALYZER CD double click the setup exe file in the CD ROM drive of Microsoft Windows Explorer and follow the step by step installation instructions In S PLUS load the S ARRAYANALYZER module from the command line by entering gt module ArrayAnalyzer You can also load S ARRAYANALYZER by choosing File gt Load Module and selecting ArrayAnalyzer from the menu To detach or unload StARRAYANALYZER type gt detach ArrayAnalyzer S ARRAYANALYZER also includes an online HTML Help system for all the available functions After you have loaded the S ARRAYANALYZER module you can get help for any command by using the or help function For example if want help on the maDotsMatch function simpl
125. erformed for the set of chips provided i e all those read into one object during the data import phase Thus if the whole experiment is supplied normalization will be done across treatment groups From the command line the user has more control For example the user may choose to normalize within experimental conditions and merge the resulting normalized data When chips are normalized to an average reference it is assumed that there is a common underlying intensity distribution on each chip For this reason pairwise normalization where one chip is a target chip may be preferable when there are just 2 chips But pairwise normalization when there are more than two chips has been shown to give variable results depending on which chip is chosen to be the reference chip Bolstad et al 2002 Normalization of microarray data is currently an active research topic We leave it to the researcher to decide the best approach for their data Examples shown in this chapter are for demonstration purposes only Data corrections and normalizations can be done in series The suggested work flow is as follows For Affymetrix probe level data e Background correcting Probe level summary Summarizing the 11 20 probe pair sets into a single value for each transcript e Location scale normalization Affymetrix summary data e g CHP file data from Affymetrix 4 5 and cDNA summary data e g GPR file data from GenePix are typically normaliz
126. erimposed on the scatter plot The plot shows a non linear dependence of the log ratio of red to green intensity M on the average log intensity A In section Normalizing in ArrayAnalyz er Normalization Normalization Methods for cDNA Data on page 144 we examine a number of normalization methods for this dataset to correct this systematic variation swirl 1 spot Figure 7 2 MvA plot for chip 81 of swirl dataset before normalization with loess curves for each print tip group S ARRAYANALYZER provides a wide variety of normalization methods depending on the form of the data prior to normalization For cDNA chips the scanning software typically accounts for background noise and adjusts for control information For Affymetrix probe level data CEL files the analyst needs to account for background noise and may make use of controls often the mismatch probes MM are used for this to correct for random non specific binding The analyst must also summarize the probe level data into a single value per gene transcript Affymetrix summarized data e g chp files output from MAS 4 5 software has already been background adjusted and perhaps mildly normalized and the probe level data have been summarized into a single intensity value per gene transcript 137 Chapter 7 Pre Processing and Normalization 138 Because of the inherent differences in cDNA data versus Affymetrix data the specifics of the normalization m
127. es 14 1 12 10 o0 o0 o0 i ae ee cdo odli cg2a CELcg2b CElcg24a CEtg24b CEL cg2a CELcg2b CElcg24a CEtg24b CEL Figure 5 8 Pre and post normalized probe level data The normalized intensities are plotted on the log scale 73 Chapter 5 An Example Affymetrix Probe Level Data EXPRESSION SUMMARIES RMA Summary 74 Once we ve imported the data files we need to covert the raw probe level expression intensities to expression summaries before testing for differential expression This is usually done in a series of steps including some combination of the following e Background correction e Normalization Probe specific background correction e g subtracting MM e Summarizing the probe set values into one expression measure and sometimes a standard error for this measure An assortment of procedures are available for completing these steps You can find much more detail in the Normalization chapter In addition to normalization in the context of summarizing raw intensities you can also normalize without the summarization step For more detail see Pre Processing and Normalization In this chapter we will focus on one sequence of steps referred to as robust multichip analysis or RMA for short This procedure completes the following steps 1 Probe specific correction of the PM probes using a model based on observed intensity being the sum of signal and
128. es with a smaller p value are declared differentially expressed Adjusted p values may be obtained either from the nominal distribution of the test statistics or by permutation The false discovery rate FDR is defined as the proportion of genes expected to be identified by chance relative to the total number of genes with significant tests of difference That is FDR FP IS FP in Table 8 1 Controlling the FDR has the advantage of maintaining a small number of false positives amongst only those tests which are significant BH The Benjamini and Hochberg procedure computes Pqy ming n min N k pqy 1 Any Pa lt ois significant with an overall FDR for the experiment not greater than a This procedure provides a good balance between discovery of significant genes and protection against false positives since occurrence of the latter is held to a small proportion of the significant gene list BY The Benjamini and Yekutieli procedure computes Pi min y min Nsum 1 j k pqy 1 Controlling Type I Error Rates Any Pa lt Qis significant with an overall FDR for the experiment not greater than a 187 Chapter 8 Differential Expression Testing GUI FOR MULTIPLE COMPARISONS TESTING Multiple The dialog for Multiple Comparisons testing is displayed in Figure Comparisons 8 1 Open the dialog from the main S PLUS menu by clicking Testi Dial ArrayAnalyzer gt Differential Expression Analysis gt Multiple esting Dialog
129. ethods differ between data types However normalization in general can be thought of as either normalization to a point location normalization or scaling of the variability of the data scale normalization Having a visual representation of the data is very useful in the normalization process S ARRAYANALYZER includes a variety of diagnostic plots and the sections that follow discuss the diagnostic plots and specific methods for normalizing cDNA and Affymetrix data Ideas in Normalization IDEAS IN NORMALIZATION Normalizing to One Point Normalization Using Loess Normalization typically involves adjusting distributional summaries of data from each chip to common reference values Sometimes reference values are supplied by the user Alternatively some methods assume one chip is the reference chip and the other chips in the set are then normalized to the target chip s reference values A median is a robust estimate of the center of the data distribution where just under 50 of the data on either side of the median can be moved to infinity and the median value will not be affected It isa quantity that defines the center of the data 50 of the data are above the median and 50 are below the median Consequently the median is often used as a reference value The inter quartile range IQR estimates the spread or variability of the data and is computed as the range of the middle 50 of the data The IQR is a robust estimator of spread i
130. examples include the following 1 Different labeling efficiencies and scanning properties of the Cy3 and Cy5 dyes 2 Print tip effects 3 Spatial within slide effects 4 Between slide effects Furthermore normalization allows for the use of control spots on the array or spiked into the mRNA samples In Chapter 7 Pre Processing and Normalization we provide more detail on different methods of normalization Here we list them briefly and work through a simple example To normalize cDNA data go to the main S PLUS menu and select ArrayAnalyzer gt Normalization ArrayAnalyzer Import Data Affymetrix Expression Summary Differentia m Analysis Figure 6 11 Selecting Normalization from the main S PLUS menu Normalization Show data of type Select the type of data you are normalizing In this example select cDNA in the Show data of type field for the swirl example as shown in Figure 6 12 Normalization Data Normalization Show Data of Type Normalization medianIQR v f Affymetrix Sum v I MvA Plot lt select gt Affymetrix Summary vV Box Plot i Affymetrix CEL Save As cDNA Data hs hen to Show Before amp After E C Only After c Cancel Apply KE current Help Figure 6 12 Selecting cDNA before selecting for a list of cDNA data Data In the Data field select swirlMarrayRaw from the drop down list as the object of class marrayRaw created during the import step Save As In
131. f the data frame had identifiers for annotation databases it would be a simple matter to annotate genes of interest Please refer to the annotation section in this chapter to see how this can be done Differential Expression Analysis for Experiments with More than Two Experimental Conditions The trellis plot showing the significant genes for each of the contrasts is shown below We are hovering over the single gene that was found to be significant for the contrast of strains in the rich media The metadata shows the gene name fold change and log10adjp value for this gene E S PLUS signifo a laix File Est Wen Insert Foma Daa Statistics Gaph Options Window Hap SC Ee eee Bie sa 3 fy ye HINE Sogo 1 56 0 000000 9 00000000 1 32 000000000 0 00000000 0 77 000000012 0 00739595 0 73 000000006 0 00539104 4 16 0 o0000000 a 00099900 4 56 0 00000000000000 2 51 0 00000000000000 aKes ojej o sa za E 787 79 Ss ow amp 1900 1991 1693 1694 1695 1698 a A 700 1769 HoA BEE HAREE o 2 toltChange Rie 8 g Figure 9 15 Creating the interactive trellised volcano plot for the 227 significant genes trellising on the 10 contrasts 249 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data 250 E s Puus G82 Fhe Edt yen pout Fuma Qua Sathtics Graph Options kindow eb Osa SG Ranee HO Blew afea Jel tll l a le isiat JM Bezu a aul a E a S
132. f the graph and links to UniGene annotation information for genes rows via clicking on the heat map S PLUs Graphlets are typically deployed using a simple Web user interface with the S PLUS engine on a server e g the Insightful Analytic Server on UNIX platforms and the Insightful StatServer on Windows platforms In these server solutions the data are read from a database or some other source e g a Microsoft Excel file Code snippets for StatServer and Analytic Server deployment are available from the Insightful Web site at http www insightful com products demos asp Annotation examples Annotation of Microarray Data using S PLUS Annotation of genes in StARRAYANALYZER analyses is handled primarily through the library annotate This package is designed to provide experiment level annotation data suitable for the analysis of individual experiments or combinations of experiments A microarray experiment typically involves a set of known identifiers corresponding to the probes used These identifiers are typically unique for any manufacturer This holds true for any of the standard databases such as LocusLink Note that when the identifiers from one source are linked to the identifiers from another there does not need to be a one to one relationship For example several different Affymetrix accession numbers correspond to a single LocusLink identifier Thus when going one direction Affymetrix to LocusLink we have no prob
133. ffy pdf or the AffyBatch class and exprSet class help files for more details The normalize function is a generic wrapper which calls the normalize AffyBatch method functions These functions extract the intensity columns for the given set of chips and pass each intensity column or matrix depending on whether the normalization is over a chip or across a group of chips to the functions normalize method The normalize method functions can be called directly by the user by passing in the appropriate intensity vector or matrix You can obtain a list of normalization methods for an object by typing gt normalize methods Dilution This list of methods for AffyBatch objects can also be obtained by typing gt normalize AffyBatch methods 1 eonstant ConLrasts invariantset 4 loess qspline quantiles 7 quantiles robust Pre Processing And Normalization For Affymetrix Probe Level Data The normalization methods available in S ARRAYANALYZER for Affymetrix CEL files AffyBatch objects are shown in Table 7 6 Table 7 6 Normalization methods available through the normalize function T subset size 5000 verbose T family symmetric Normalization Methods Default Function Values Description Location Normalization Methods constant ref 1 FUN mean na rm T Normalizes one chip to a have a given mean or median value or normalizes a set of chips if the object is an AffyBatch obj
134. floc Value Summary median f loc list maNormMed x Median normalization by chip NULL y maM subset subset loess f loc list maNormLoess x Normalization to loess curve of maA y maM z chip s M vs A NULL w NULL subset subset span span twoD f loc list maNorm2D x 2D spatial location normalization Normalizes to the smoothed intensity surface loess surface by print tip group at each x y coordinate printTipLoess f loc list maNormLoess x maA y maM z maPrintTip w NULL subset subset span span Normalizes to the loess curve of M vs A within each print tip group on each chip in the object scalePrintTipMAD f loc list maNormLoess x maA y maM z maPrintTip w NULL subset subset span span f scale list maNormMAD x maPrintTip y maM geo TRUE subset subset Normalizes to the loess curve of M vs A within each print tip group followed by within print group scale normalization using the median absolute deviation 152 Normalization Methods for cDNA Data Table 7 4 The norm parameter of maNormSca e results in the following normalization methods and settings being passed to maNormMain Examples With maNorm and maNormScale Normalization Method floc Value Summary globalMAD f loc NULL Scale normalization over each f scale chip using the median absolute ties i Ne ob Eei a R
135. fort to minimize and track variability in their arrays As in many assay formats the vendors of Why do We Normalize Data Normalization microarray technology compete based on how well their manufacturing and deployment processes control extraneous variability and provide reproducible results Let s look at the swirl data set that has already been discussed in section Swirl cDNA Data Set on page 90 Figure 7 1 and Figure 7 2 shows a box plot for each chip in the experiment and an M vs A plot of one of the chips Figure 7 1 shows that there are significant differences in the median log intensity differences If the probes are placed randomly on the slide and the experimental conditions are well controlled we would expect the medians of each print tip group to be similar However the experimental conditions are not perfectly controlled as shown by the negative values for all of the print tip groups in Figure 7 1 The log ratio of intensities is a measure of the difference between the red and green fluorescence The fact that this quantity is always negative suggests an imbalance in intensities of the two dyes Cy3 and Cy5 135 Chapter 7 Pre Processing and Normalization 136 i ee pees 7 swirl 1 spot swirl 2 spot swirl 3 spot swirl 4 spot Figure 7 1 Box plot for swirl experiment before normalization Figure 7 2 is an M vs A plot for one chip in the swirl set The loess curves for each print tip are sup
136. generate expression intensity files with columns containing Row and Column layout information for the spots on the array When your data files are organized this way you should be able to read the data through the GUI by selecting one of the data files as the layout file for the Create Layout dialog and then reusing that same file on the File Selection page of the Import cDNA Data dialog The Create Layout dialog will only pick up the layout probe names and control information When you use the file the second time on the File Selection page the import operation will pick up the expression intensity columns Some arrayers don t have a double grid layout as we describe in the swirl example Agilent with its Inkjet technology for printing arrays produces one large spot matrix In this case set the Grid Rows and Grid Columns to one on the Create Layout dialog 101 Chapter 6 An Example Two Color cDNA Data NORMALIZATION The Normalization Dialog 102 Normalization is designed to remove artifacts and systematic variation resulting from the preparation and measurement process The goal is to remove variability not due to differential expression so that differential expression is estimated accurately for each gene Note that we need to be careful not to normalize so aggressively as to wash out signal For cDNA data normalization corrects for various types of dye bias as well as print tip and substratum irregularities Some
137. genes tagged as significant 45 Chapter 4 An Example Affymetrix MAS Data 46 Heat Map A heat map plot shown in Figure 4 16 shows a two way layout of the most differentially expressed genes along the vertical axis versus the experimental conditions on the horizontal axis This graph is also hyperlinked to the annotation information Graph Window 8 File View Options Sample 502 b Gene 99605_at Exp Value 1 76 E E a gt Summary Volcano Piot Figure 4 16 A heat map plot shows differentially expressed genes as a function of experimental conditions Differential Expression Testing Q Q Norm Plot Graph Window 8 BAA File View Options 2 v T a pa C fi Quantiles of Standard Normal La H Summary Volcano Plot Figure 4 17 Normal quantiles plot The Q Q Norm plot displays the test statistics for all genes versus the standard normal quantiles This plot gives some sense of the distribution of the test statistics and is used primarily for diagnostic purposes 47 Chapter 4 An Example Affymetrix MAS Data Annotation Clicking one of the hyperlinked points in either the volcano plot or the heat map pops up a menu for selecting the database to query for annotation information Selecting either one opens an HTML page in your default web browser displaying a brief description of the gene with a hyperlink to more detailed information Figure 4 18 shows an example page from LocusLin
138. good statistical testing is obtaining good estimates of the standard error of differential expression for each gene In some studies e g with few replicates specialized methods may be required to improve the power of the statistical test We describe one approach to doing this in the section GUI for LPE Testing Statistical Tests STATISTICAL TESTS Within Gene Two Sample Comparisons The S PLUS environment is rich in methods for statistical modeling and hypothesis testing Virtually all the traditional modeling and testing methodology is available in S PLUS through either its GUI or its Command line and many through both Furthermore because of the ease of programming S PLUS many new methods quickly find their way into the S PLUS environment This makes S PLUS ideal for doing microarray analysis where many traditional methods e g t test Wilcoxon test ANOVA are used but where the advantage of using cutting edge methods loess normalization invariant set normalization local pooled error test may provide a big pay off by reducing false positives and negatives The focus of this chapter is on methods primarily supported through the GUI for differential expression testing However there are many techniques not covered here that are accessible through other sections of the GUI or through the Command line See the Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data for examples on clustering and mixed effects mod
139. h gene identifier at the beginning of each row of the table is hyperlinked to one or more annotation databases _ _ Summary Output for LPE Test with Bonferroni Adjustment Top 10 Genes Raw p Value Adj p Value Fold Change 1 Accession Number es 1178_al 0 34 1262_s_al 0 24 1433_g_at 0 16 1536_at 0 32 167_at 0 16 1854_at 0 24 1945_at 0 16 2003_s_at 0 18 2033_s_at 0 26 Figure 5 15 Summaryu of top 10 differentially expressed genes Each gene is hyperlinked to annotation databases 81 Chapter 5 An Example Affymetrix Probe Level Data Volcano Plot 82 A volcano plot displays the logarithm of adjusted p value versus fold change as shown in Figure 5 16 The vertical lines indicate fold change values of plus or minus two and the horizontal line indicates a significant test p value after doing the Bonferroni correction Points located in the lower outer sextants are those with large absolute fold change and small significant p value Each of those points is active so you can click on a point to access annotation information from Locus Link or GenBank Pal ea el Gene Name 38772_at T 2 3 Qa D v 3 a o 4 T 0 2 0 4 Mean Log2 Fold Change fe Summary Volcano Plat Heatmap Chromosome J Variance Plat Figure 5 16 A volcano plot which is the logarithm of p value versus fold change Points below the horizontal line are hyperlinked to annotation databases
140. ical box plot for an experiment with two replicate slides for each dye swap condition Scatter plots of spot statistics allow the user to highlight and annotate subsets of points on the plot and assess patterns of differences in intensities between channels or chips Such patterns may be visualized via fitted curves from robust local regression or other smoothing procedures The MvA plot shows the log ratio of the intensities difference of the log intensities usually termed M between channels or chips to the average of the log intensities usually termed A for the channels chips Figure 7 2 shows an MvA plot for one chip in the swirl dataset cDNA data with loess curves overlaid for each print tip group Diagnostic Plots Note MvA plots from the GUI plot a maximum of 2000 genes If there are more than 2000 genes in the experiment then only 2000 randomly sampled genes are plotted Chip Specific Detailed information about these plots for the different chips types are Plots available in the following sections The sections that follow will also describe other types of plots that are available from the command line 143 Chapter 7 Pre Processing and Normalization NORMALIZATION METHODS FOR CDNA DATA Normalizing With the GUI Notes For Command Line Users 144 There are often systemic variation and imbalances of the red and green fluorescence intensities in cDNA data This variation is usually not constant a
141. iddle 50 of the data is identical across the groups We can extend this idea of normalization so that the data matches at a sequence of points not just one or two points For example deciles 10th 20th 30th 90th percentiles may be aligned via this type of normalization The quantiles method in S ARRAYANALYZER extends the many point approach described above to a fine scale granularity such that each ordered individual gene expression value is aligned The method assumes there is an underlying common distribution of intensities across all chips in the set and disparate datasets can be transformed to the same distribution by transforming the quantiles at the level of individual values of each to have the same value Details of this transformation can be found in the normalize quantiles help file references The draw back of this method is that extreme values in the tails are normalized to the same values thus possibly loosing the differential expression information Empirical evidence however suggests that this is not a problem see Bolstad et al 2002 Cautions in Normalizing Workflow Ideas in Normalization Some care is required when normalizing across treatment groups to not wash out signal particularly for aggressive normalization approaches such as the quantile method This is not much of an issue with mild normalization approaches such as lining up medians and IQR s In the S ARRAYANALYZER GUI normalization is p
142. ies respectively Although it is possible to change the variables in the Probe Name and Expr Intensities fields in this dialog it is not recommended These fields correspond to the columns read from the files and are used in subsequent analyses The dialogs that follow in the data analysis e g normalization and differential expression testing expect expression data without control rows Import Affymetrix Data File Selection MIAME Yariable Selection amp Filtering Data Column Names Probe Name Probe SetName _v Expr Intensities Signal y MV Apply Log Base 2 Remove Rows Column Name Stat Pairs Used p If Less Than Control Prefix Cancel k afi entries Figure 4 8 The Variable Selection amp Filtering page of the Import Affymetrix Data dialog Note the Apply Log Base 2 check box which by default takes logy of the expression intensities before saving them in the resulting object The actual computation is logo if E gt 1 andO0ifE lt 1 One field that you may be interested in changing is in the Remove Rows group By default the Remove Rows Column Name is set to Stat Paris Used MAS5 or Min Pairs Matched MAS4 and the If Less Than field is set to 7 When the MAS 4 5 software computes the Importing Data expression summary data it counts the number of probe pairs that are used in the computations The Remove Rows group is implemented to let you specify the minimum number of pairs the su
143. ifferential Expression Testing Local Pooled Error Test References 64 64 65 65 72 72 74 74 75 75 79 79 87 63 Chapter 5 An Example Affymetrix Probe Level Data AFFYMETRIX PROBE LEVEL DATA ANALYSIS WORKFLOW Melanoma Probe Level Data Set 64 The process of analyzing Affymetrix probe level gene expression data can be done through the StARRAYANALYZER menu To obtain differential expression information from probe level microarray data we perform the following six steps 1 Import and filter the data Adjust for background noise Mis match correction 2 3 4 Summarize 5 Normalize 6 Differential expression analysis In this chapter we step through the analysis of an experiment using an MM5 melanoma cell line in which a gel matrix that simulates the in vivo cellular condition and progression of melanoma was added for 0 and 24 hours later Fox et al 2001 This simple experimental design thus involved one factor matrix condition at two levels 0 and 24 hours with expression being measured twice duplicated arrays for each time point The main hypothesis of interest involves discovering genes showing differential expression at the two time points because these genes are believed to be relevant to tumor invasion and metastasis The chips and data files are in Table 5 1 Table 5 1 Experimental design and file association for the melanoma cancer study Experimental Cond
144. ify the data by print tip groups or by chips Please refer to the marrayClasses documentation splus61 library marrayClasses marrayClasses pdf or the marrayRaw and marrayNorm help files for additional options cDNA Diagnostic Plots Normalization Methods for cDNA Data Normalization usually begins with exploratory data analysis and diagnostic plots cDNA data typically includes two treatment conditions on one chip Dudoit et al 2002 and Yang et al 2001 suggest that the most useful way to view such data in order to identify spot artifacts and for normalization purposes is via an M vs A plot of the intensity log ratio M log 4 vs the mean log intensity 2 A logoVRG This amounts to a 45 degree counter clockwise rotation of the logyG logR coordinate system followed by a scaling of the coordinates This plot highlights the difference between the red and green channels as a function of average intensity across the two channels Figure 7 2 on page 137 shows an example of an MvVA plot for cDNA data Box plots are also available from the normalization dialog The y axis typically shows the intensity log ratio M The x axis shows the grouping chip or print tip Please see section Box Plots on page 142 for additional information on box plots From the command line we can create a box plot by print tip groups for chip 93 of the swirl dataset discussed in section Swirl cDNA Data Set on page 90 as follows Information on the swirl d
145. iles for information about the Dilution dataset Plots from the GUI show one pairs plot per treatment condition From the GUI a maximum of 2000 genes are plotted If the chips have more than 2000 genes then a random sample of 2000 genes are plotted MVA plot 20A 0 22 20B A Figure 7 4 MvA plot of one treatment group of Dilution experiment 157 Chapter 7 Pre Processing and Normalization Plots From The Command Line Background Correction 158 Table 7 5 lists the diagnostic plots available in S ARRAYANALYZER from the command line The functions boxplot hist and image work on AffyBatch objects The input to mva pairs is the matrix of expression measures usually the log intensity matrix is used Table 7 5 Exploratory data analysis plots available from the command line for Affymetrix probe level data Function Name Description boxplot Box plot of log base 2 of intensity matrix hist Calls plotDensity Plots the non parametric density estimates of the given matrix mva pairs MvA plots image Raw image plots can be used to detect spatial artifacts plotAffyRNAdeg Requires object returned from Af fyRNAdeg RNA degradation plots aid in assessment of RNA quality Expression intensity measurements are summaries of the fluorescence intensities for the pixels contained within each chip spot The background of the chip contributes to this signal an
146. ilhouette value s i is computed and then represented in the plot as a bar of s i If A denotes the cluster to which object i belongs we define a i average dissimilarity of i to all other objects of A Now consider any cluster C different from A and define d i C average dissimilarity of i to all objects of C After computing d i C for all clusters C not equal to A we take the smallest of them Example Lymphoma Classification Clustering Microarray Data using S PLUS b i ming d i C The cluster B that attains this minimum namely d i B i is called the neighbor of object i This is the second best cluster for object l The value s i can now be defined b i a i max a i b i s i 9 1 We see that s i always lies between 1 and 1 The value s 7 may be interpreted as follows sli 1 gt object iis well classified s 7 0 gt object ilies between two clusters sli 1 object iis badly classified The silhouette of a cluster is a plot of the s i ranked in decreasing order The silhouette plot shows the silhouettes of all clusters next to each other so the quality of the clusters can be compared The average silhouette width of a partitioning cluster analysis is the average of all the s i from every cluster This is a measure of quality or goodness of the cluster analysis One typically runs pam several times using a different number of clusters within a specified range appropriat
147. in LocusLink is populated with the two genes identified in the analysis There are several other ways to annotate these genes simply by using the gene list that we are holding in the S PLUS variable me1 gnames We illustrate by obtaining UniGene information Pubmed articles and ontology information from the Gene Ontology consortium GO S PLUS code for this annotation follows and results from the queries are presented in Figures 9 6 and 9 7 229 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data JHF get unigene accession id numbers and make a call to J H Unigene gt uids lt unlist hgu95aACCNUM foldChange Ls 001 10 gName gt genbank uids type accession disp browser JHF get Pubmed accession id number and make a call to Pubmed for articles on the JHF second gene gt pmedids lt hgu95aPMID mel gnames 1 gt pubmed pmedids disp browser JF get GO accession id numbers need to cut and paste these into GO search IHF e g Amigo gt goids lt hgu95aGO mel gnames gt goids 412714 at 1 GO 0003735 GO 0003723 G0 0006412 G0 0005843 38749_at 1 GO 0004930 G0 0007186 GO 0005887 230 Annotation of Microarray Data using S PLUS E Entrez Nucleotide Microsoft Internet Explorer Fie Edt View Favorites Tools Help Back gt O A Al Asearch Favorites Beda D S fl S Protein Ge M58459 A1936826 Limits Preview Index History Clipboard
148. incl ends TRUE converge FALSE verbose TRUE na rm FALSE quantiles Assuming an underlying common distribution the set of chips are normalized so that their quantiles have the same value quantiles robust x weights NULL remove extreme variance n remove 1 approx meth FALSE Quantile normalization with options to Eliminate chips with high variability Eliminate chips with means too disparate from others Down weight particular chips in the computation of the mean An Example with normalize Below we use the melanoma data set Fox et al 2001 to 164 demonstrate various normalization procedures The melanoma dataset is discussed in section Melanoma Probe Level Data Set on page 64 We first read in the data This can be done through the GUI as shown in section Importing Data on page 65 of Chapter 5 An Example Affymetrix Probe Level Data gt directory lt paste getenv SHOME module ArrayAnalyzer examples sep gt NCImelanoma lt ReadAffy celfile path directory Pre Processing And Normalization For Affymetrix Probe Level Data The data should be corrected for specific binding and background noise One way to do this is to simulate the Affymetrix MAS 5 0 software as follows correct melanoma CEL data d background correct gt NCImelanoma lt bg correct NCImelanoma method mas gt Correct using MM as controls gt tmp lt pmcorrect m
149. ing medianIQR and then performing 227 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data differential expression testing using LPEtest Please refer to Chapter 4 An Example Affymetrix MAS Data and Chapter 5 An Example Affymetrix Probe Level Data for details on this process There are three sections in this analysis 1 Setup the data for the melanoma experiment we use data that has been read in at the GUI and analyzed for differential expression 2 Filter the genes to identify interesting genes for annotation we use a filter based on fold change and LPE p value below but this could use any of the functions in the genefilter library 3 Annotate these genes using functions from the annotate library JHH Load libraries for annotation gt module arrayanalyzer gt library hgu95aAnnoData JHF 1 Setup the data for analysis dHHE Make a copy of the ExprSet object created by JHH reading in the HHH Melanoma data from the GUI and analyzing JHHF for differential expression using LPE gt summ0bj lt melanoma LPESumm JHHF 2 Filter the data to identify interesting JHH genes with fold change greater JHH than 10 and LPE p value less than 0 001 gt foldChange Ls 001 10 lt summObj summObj adjp lt 0 001 amp summObj foldChange gt 10 gt mel gnames lt foldChange Ls 001 10 gName JHF 3 Get LocusLink ID numbers and make a call to locuslink gt llnames lt as numeric unlist hgu95aLOC
150. irlLayoutACCNUM lt as list annoDF an gt names swirlLayoutACCNUM lt annoDF probes The resulting objects are named lists that look as follows gt swirlLayoutACCNUM 1 5 100001_at 1 M18228 100002_at 1 X70393 100003_at 1 D38216 100004_at 1 AW120890 100005_at 1 X92346 swirlLayoutLOCUSID is similar The probe names are the component names of the list and the Accession Numbers or LocusLink ID s are the values of the components of the list Now to hook this up to the graphics set the Chip Name field on the testing dialogs to the base name swirlLayout in this example used for naming the annotation named list objects Once the graphics are generated and displayed in HTML they will be hyperlinked to the annotation databases You will be able to access the annotation databases from the Top 10 gene list the volcano plot or the heat map plot Annotation Note that there isn t anything magical about using the name of the layout object in this process We do it here because by doing so we tie the annotation objects to the layout object that contains the probe names Any additional experiments you do using the same layout and probes will be able to use these same annotation objects 113 Chapter 6 An Example Two Color cDNA Data FROM THE COMMAND LINE Importing Data 114 All of the analysis done through the GUI can be done from the S PLUS command line Having access
151. is Not equal fe Adjustment Z Save Summary As MultTestSumm Cancel Apply Hf current Help Figure 3 8 The Multiple Comparisons Test dialog implements standard statistical methods with p value adjustments to maintain the user specified overall family wise error rate or false discovery rate 24 Annotation ANNOTATION Interactive annotation to public databases such NCBI s LocusLink or GenBank is provided through S PLUS graphlets These interactive graphs are hyperlinked to the databases so information about a differentially expressed gene is truly only a click away in Oa Fe Gene Name 36638_at T a g o 7 EN g RS Oo D o l Nn 00 00 00 omamno o a o A o Accession Number 0 4 0 2 0 0 0 2 0 4 Mean Log2 Fald Change i HE Summary Volcano Plot Heatmap Figure 3 9 Volcano plots of log adjusted p value vs average log fold change The points below the horizontal line correspond to genes which are significantly differentially expressed and are automatically hyperlinked to public annotation databases 25 Chapter 3 GUI Overview be cS NCBI LocusLink a PubMed Entrez BLAST OMIM Map Viewer Taxonomy Structure Search LocusLink Display Brief vj Organism All v Go Clear View Hs CTGF One of 1 Loci Save All Loci ABCDEFGHIJKLMNOPQRSTUVWKYZ Click to Display mRNA Genomic Alignments spanning 3196 bps PUB OMIM ACEVIEW JuNI
152. ition Repetition chip label File Name 0 hours 1 cga cg2a CEL 0 hours 2 cgb cg2b CEL 24 hours 1 cg24a cg24a CEL 24 hours 2 cg24b cg24b CEL IMPORTING DATA Import Affymetrix Data Dialog To import Affymetrix data from the main S PLUS menu select ArrayAnalyzer gt Import Data gt From Affymetrix ArrayAnalyzer Importing Data Import Data From fuss Affymetrix Expression Summary Normalization Differential Expression Analysis Figure 5 1 Menu selection to import Affymetrix data Figure 5 2 shows the Import Affymetrix Data dialog with the File Selection page displayed The primary task of the import process associates data files with experimental conditions and selects the variable columns that are used in subsequent analysis Import Affymetrix Data File Selection MIAME Variable Selection amp Filtering Associate Files with Design en Single Factor 2 Level Desian Reps 2 Reset Grid Read Design Save Design File Name Type filename or right click to browse Type filename or right click to browse Type filename or right click to browse Type filename or right click to browse r Data File Type Chip Name Save s kundetemined gt krequired gt X mySet I Print Output Figure 5 2 The Import Affymetrix Data dialog Chapter 5 An Example Affymetrix Probe Level Data File Selection Page 66 The Import Affymetrix Dat
153. iz distl1 method average cluster cols aliz dist2 lt dist as matrix aliz cmat aliz hclust2 lt hclust dist aliz dist2 method average color 6 GC B like color 5 Activated B like color 1 GC centroblasts array3a colors lt c rep 6 16 rep 1 2 rep 6 6 rep 5 23 plot heatmap and dendrograms par mai c 0 0 0 0 omi c 0 2 7 1 4 1 1 image aliz cmatLaliz hclust2 order aliz hclustl order axes F bty n par new T omi c 6 55 2 75 0 1 15 plclust2 fn aliz hclust2 cex 1 rotate me F 1ty 1 colors array3a colors aliz hclust2 order par new T omi c 0 02 0 95 1 42 7 75 plclust2 fn aliz hclustl cex 1 rotate me T 1ty 1 The cluster analysis can also be done from the S PLUS menu system using Statistics gt Cluster Analysis gt Agglomerative Hierarchical although this does not produce the heat map with overlaid dendrograms as visual output like the above code snippet Note that the above code snippet can be easily saved as an S PLUS function for repeated use as follows 2 220 cluster heat lt function cluster data sample colors rep 1 dim cluster data 1 stand norm lt function x x mean x na rm T sqrt var x na method available cmat lt apply cluster data 1 stand norm cluster rows disti lt dist t as matrix cmat Clustering Microarray Data using S PLUS hclustl lt hclust dist distl method average d cluster cols dist2 lt dist as matrix cmat hclus
154. ization dialog of S ARRAYANALYZER GUI menu An example of an MvA plot can be seen in section Affymetrix Diagnostic plots on page 155 Other plots such as histograms and qgplots can be obtained for any summarized data data of class exprSet by extracting the exprs slot values from the exprSet object Normalization Methods for Affymetrix MAS Data We demonstrate this using the previously medi an IQR normalized data The expression values are extracted and the log transform is taken before the box plot is created gt par mfcol c 1 2 two box plots on one page gt boxplot data frame log2 Dilution exprSet exprs ylim c 0 15 style bxp att gt boxplot data frame log2 cbind DilutionEsetNormTmtl DilutionEsetNormTmt2 style bxp att ylim c 0 15 From the plots in Figure 7 5 we can see that after normalization there are differences in the average expression levels between the two treatment groups Intensity Distribution Before Normalization Intensity Distribution After Normalization ite py 10 15 10 X20A X20B X10A X10B X20A X20B X10A X10B Figure 7 5 Before and after normalization box plots of summarized Dilution dataset Please refer to Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data for more command line plotting examples 173 Chapter 7 Pre Processing and Normalization REFERENCES 174 Affymetrix 2002 Statistical Algorithm Description
155. k with annotation for one of the differentially expressed genes in the melanoma example e A rC TAN NCBI LocusLink PubMed Entrez BLAST OMIM Map Viewer Taxonomy Structure Search LocusLink Display Brief Organism All v Query a I co Clear O O w One Pea Se ATL ABCDEFGHIJKLMNOPQRSTUVWKYZ Click to Display mRNA Genomic Alignments spanning 61407 bps MAP HOMOL e ucsc MGI Mus musculus Official Gene Symbol and Name MGI Rab10 RAB10 member RAS oncogene family LocusID 19325 Overview td Locus Type gene with protein product function known or inferred Product RAB10 member RAS oncogene family Figure 4 18 Annotation information from LocusLink 48 From The Command Line FROM THE COMMAND LINE Importing Data General GUI Import All of the analysis done through the GUI can be done from the S PLUS command line Having access to the command line adds great flexibility to the set of features available through the S ARRAYANALYZER GUI and opens the door to additional analyses The flexibility and feature rich S PLUS language make it an ideal platform for exploratory analysis statistical testing and modeling of gene expression data This section is designed to expose you to the critical functions for differential expression testing of microarray data If you have no interest in running your analyses from the command line you can skip this section S PLUs has several command line function
156. l be in each bin We are now ready to compute the LPE test The function used is IpeOLIG which takes three arguments the baseline variance objects for each treatment condition and the size of the sample i e number of chips for each treatment condition For the Melanoma data the call is gt LPEObj lt IpeOLIG OLIGgrp0 OLIGgrp24 sample c 2 2 IpeOLIG computes raw p values but you can adjust them to control the family wise error rate FWER or False Discovery Rate FDR with the mt rawp2adjp function mt rawp2adjp takes the vector of raw p values plus a character string indicating the adjustment procedure The function call is gt adjpObj lt mt rawp2adjp LPEObj pvalue proc Bonferroni We can plot the results in a Graphlet similar to what is obtained from the GUI We compute fold change first and then call the graphlet function gt foldChange lt rowMeans LCG N 1 2 rowMeans LCG N 3 4 gt LPESumm lt lpetest graphlet LCG N adjpObj adjp 2 LPEObj pvalue adjpObj index foldChange procedure procedure chip name hgu95a volcano plot T heatmap plot T chromosome plot T html output F variance plot T smoother df 10 trim 5 OLIGgrpl OLIGgrp0 OLIGgrp2 O0LIGgrp24 var xlabs c A for cg A for cg24 Ssummary name LPESumm open browser F The Output Table From The Command Line The first six critical arguments to Ipetest graphlet are 1 the expression set object e g LCG N 2 the vec
157. l distribution of the test statistic The p value is then obtained by quantifying the frequency of seeing a test statistic as extreme or more so than the one observed for the data Using permutation methods for reasonable samples sizes 10 or more per experimental condition can produce more accurate p value estimates for data which may not satisfy the assumptions of the test procedure In particular tests for skewed intensity values may benefit from computing p values by permutation rather than from the theoretical symmetrical distribution The permute versions of the tests should be used with caution for low replicate studies since the p values are based on the total number of possible test statistics for permuted data For example in a two sample study with two replicates four arrays the total number of different test statistics is at most six This means that the smallest possible p value for a two sided alternative is 0 333 When the number of replicates increases to 10 per sample the minimum p value drops to 0 00001 and when there are 20 arrays per sample the minimum p value drops to 0 00000000001 107 The local pooled error LPE test is an experimental procedure designed for low replicate studies When there are few replicates in a study the degrees of freedom for estimating the standard error of differential expression within genes may be as low as one or two In this context estimates of within gene standard errors are impreci
158. l example there are four rows and four columns Enter 4 for both the Grid rows and Grid columns in the Create Layout dialog 97 Chapter 6 An Example Two Color cDNA Data 98 e Spots The spot matrix size in rows and columns The spot matrix refers to the layout of the spots within a print tip group In the schematic in Figure 6 6 there are four rows and six columns of spots within each print tip group In the swirl example there are 22 rows and 24 columns of spots Enter 22 for the Spot rows and 24 for the Spot columns in the Create Layout dialog Control Column A column indicating which spots are control spots Enter ID in this example e Gene Name Col A column with the gene names Enter Name for this example Create Layout Layout File D Microarrays ArrayAnalyzer data fish gal Browse opoe a Spot Rows 22 4 Save Chip Layout fswirlLayout ve Chip Layout swirllayou a 4 Spot Columns Grid Grid Rows 4 Control a Control Column ID Grid Columns 4 pa Control Value control Gene Name Col Name Cancel Hof current Help Figure 6 7 The completed layout for the swirl example When you finish entering layout information in the dialog your input should look like what is shown in Figure 6 7 Click OK to create and save the layout object This returns you to the Import cDNA Data dialog where you can finish the import dialog Importing Data MIAME Page MIAME is an acronym for Minimal Info
159. lem but when going the other we need some mechanism for dealing with the multiplicity of matches There is a great deal of annotation data available for any given gene Examples include LocusLink UniGene chromosome number chromosomal location cytoband or bp KEGG pathway information and the Gene Ontology GO categorizations Other information such as syntenic regions or orthologous grouping can also be obtained We provide some data with the annotate library and we have added the hgu133aAnnoData and hgu95aAnnoData libraries which contains some of the annotation access points The DataLibs directory at the top level of the S ARRAYANALYZER CD ROM provides additional annotation data We also provide data for downloading from the S ARRAYANALYZER Web site http www insightful com support ArrayAnalyzer Researchers with special needs should feel free to contact Insightful or the Bioconductor project regarding the production of annotation data specialized to their needs In the following example we show how to produce a simple Web page with links to LocusLink UniGene and Pubmed at NCBI for genes that were selected according to some criteria This example uses the Melanoma data Fox et al 2001 and picks up after the data have been read in through the GUI and analyzed for differential expression The object melanoma LPESumm in the S ARRAYANALYZER database was created by reading in the summarized data files and normalizing the data us
160. ls Dimensions of spot matrices 22 rows by 24 cols Currently working with a subset of 8448 spots Control spots There are 7681 types of controls control fb16a01 fbl6a02 fb16a03 fb16a04 fb16a05 fb16a06 768 i 1 i 1 1 1 Notes on layout 115 Chapter 6 An Example Two Color cDNA Data Reading Experiment Information 116 C PROGRAM FILES INSIGHTFUL splus61 module ArrayAnalyzer examples fish gal Note that the column that had control indicators also had gene ID s we need to correct that in the swirl 1ayout object We do that by using a couple of utility functions maNspots and maControls gt controls lt rep Control maNspots swirl layout gt controls maControls swirl layout control lt N gt maControls swirl layout lt factor controls gt swirl layout Array layout Object of class marrayLayout Total number of spots 8448 Dimensions of grid matrix 4 rows by 4 cols Dimensions of spot matrices 22 rows by 24 cols Currently working with a subset of 8448 spots Control spots There are 2 types of controls Control N 768 7680 Notes on layout C PROGRAM FILES INSIGHTFUL splus61 module ArrayAnalyzer examples fish gal The functions maControls and maNspots return the control vector and the number of spots respectively for an object of class marrayLayout This step reads the file with the experiment information It s contained in the file Swir1Sample txt located in the examples
161. lues rawp lt testStat pValue testObj lt mt rawp2adjp rawp proc c Bonferroni BH We can now print the top 10 genes gt testObj adjp 1 10 rawp Bonferroni BH 1 0 0002952891 1 0 6673955 2 0 0004345541 1 0 6673955 3 0 0005307666 1 0 6673955 4 0 0005736545 1 06673955 5 0 0006389848 1 0 6673995 6 0 0007114647 1 0 6673955 7 0 0007657569 t 0 6673955 8 0 0009233798 10 6673955 9 0 0009371951 1 OL 6673955 10 0 0010859229 1 0 6673955 We create several objects that get passed to the plotting functions adjp lt testObj adjp index lt testObj index JHHF Set up for Bonferroni adjustment plot testDF lt data frame gName sw probes testStat testStatL testStat foldChange foldChange stringsAsFactors F row names testDF lt sw probes testDF lt testDFLindex testDFL rawp lt rawpLindex testDF adjp lt adjp Bonferroni JHHF Setup for BH adjustment plot testDFBH lt data frame gName sw probes testStat testStatL testStat foldChange foldChange stringsAsFactors F row names testDFBH lt sw probes testDFBH lt testDFBHLindex Volcano plots Graphlets From The Command Line testDFBHL rawp lt rawpLindex testDFBHL adjp lt adjpL BH volcanoPlot testDF fwer 5 volcanoPlot testDFBH fwer 5 Now the Graphlets The dif fExpr object contain raw p values and adjusted p values for each procedure diffExpr l
162. median m median x x MAD median x m x m The main function for location and scale normalization of cDNA microarray data is maNormMain Normalization is performed for each chip independently in a given batch of arrays using location and scale normalization procedures specified by the lists of functions f loc and f scale Typically only one function is given in each list otherwise composite normalization is performed using the weights computed by the functions a 1oc and a scale When both location and scale normalization functions f 1oc and f scale are passed location normalization is performed before scale normalization That is scale values are computed for the location normalized log ratios maNormMain operates on an object of class marrayRaw or possibly marrayNorm if normalization is performed in several steps and returns an object of class marrayNorm maNormMain accepts any of the normalization methods listed in Table 7 2 The default parameters for these methods are also listed The default normalization parameters can be changed by supplying the parameters as arguments in the normalization method call as follows cDNA normalization Within print tip group loess location normalization of swirl dataset Default normalization gt swirl norm lt maNormMain swirl f loc list maNormLoess if Change the default span parameter of the loess normalization 149 Chapter 7 Pre Processing
163. ment has no dye swapping don t associate any data files with the dye swapped rows and specify twice as many replications as you have data files 93 Chapter 6 An Example Two Color cDNA Data 94 Setting the Factor Levels Set the factor level names by clicking a cell in one of the factor columns the default column names are Cy3 and Cy5 respectively and typing in the new factor level name Note that changing the factor level name in one place changes all the factor level names with the same name and that the factor level names alternate for each gene This approach is used for any dye swapping experiment and we increase the number of rows proportionate to the number of colors In the Cy3 column enter swirl in the top cell and wild type in the second cell Note that all the associated factor level names change accordingly due to the dye swapping as shown in Figure 6 3 Import cDNA Data File Selection MIAME Variable Selection amp Filtering r Associate Files with Design Points Single Factor Design Reps ja Reset Grid Read Design Save Design File Name Cy3 Type filename or right click to browse swirl A2 Type filename or right click to browse wild typel N swirl Type filename or right click to browse swirl A2 Type filename or right click to browse A2 swirl r Data Chip Layout Save As Agilent Layout M Create Layout myMarrayR aw IV Print Output Be Teas ESENE E Di A
164. milar to that of the MAS4 0 software This involves forming the differences PM MM for each probe pair calculating the mean and standard deviation sd of these differences removing pairs with a difference of greater than 3 standard deviations from the mean and recalculating the mean from the trimmed set The liwong method fits the model described in LI and Wong 2001a 2001b The default setting gives the current PM only default The reduced model previous default can be obtained using pmcorrect method subtractmm mas medianpolish playerout Summarization in S ARRAYANALYZER Pre Processing And Normalization For Affymetrix Probe Level Data The mas method implements an approach similar to that of the MAS5 0 software This includes forming the differences PM MM for each probe pair and then condensing these within a probe pair set in a robust manner Outlier probe pairs are not dropped as in the avgdiff calculation they are down weighted The median of the probe pair differences within a probe pair set is calculated and each probe pair difference is down weighted as a function of its distance from the median The probe pair differences are then combined in a one step version of the Tukey biweight procedure The medianpolish algorithm works by alternately removing the row and column medians and continues until the proportional reduction in the sum of absolute residuals is less than eps or until there have been maxiter ite
165. mmaries should be based on in order to be included in the resulting expression object The more pairs included in the expression summary the better The maximum is all of the probe pair sets typically 11 16 or 20 The default value for If Less Than is 7 Press OK when you have completed the dialog and the data are imported It is now ready for use in S ARRAYANALYZER 37 Chapter 4 An Example Affymetrix MAS Data NORMALIZATION Normalization Dialog 38 Now that the data has been imported we are ready to move to the next step of the analysis procedure Normalization The Normalization dialog is designed to remove artifacts and systematic variation resulting from the measurement process The goal is to remove variability not due to differential expression so that differential expression is estimated accurately for each gene Note that we need to be careful not to normalize so aggressively as to wash out signal Typically this is accomplished by normalizing within experimental conditions although some forms of normalization may be comfortably applied across experimental conditions For our swimming mouse example this translates to normalizing within each level of our treatment 0 and 4 weeks separately To normalize the data select ArrayAnalyzer gt Normalization from the main menu ArrayAnalyzer Import Data Affymetrix Expression Summary EE Expression Analysis Figure 4 9 Selecting the Normalization menu item
166. model as in Wolfinger et al 2001 in which the individual gene expression data on each chip are adjusted by an overall ANOVA mean for each chip There are two approaches that can be used in S PLUS to fit such a model a linear mixed effects 1me approach and a variance components varcomp approach Note that these two channel data would be more appropriately normalized using the non linear smoother loess methods as described by Yang et al 2002 and Dudoit et al 2002 The normalization ANOVA model is Vig UtA 7 4 AT Eik where represents an overall mean value A is the main effect for arrays random effect T is the main effect for treatments AT is the interaction effect of arrays and treatments random effect is stochastic error We assume A N 0 0 4 4T N O 0 47 236 Differential Expression Analysis for Experiments with More than Two Experimental Conditions ep N O 0 e As pointed out by Wolfinger et al 2001 the AT term models a channel effect which is often necessary because of the arbitrary manual intensity scaling done with programs like ScanAlyze Eisen et al 1998 Also note that there is no main effect for dyes in the model since wild type was always labeled with Cy5 and therefore the treatment effect T is already accounting for differences between dyes 2 Normalization of microarray data 2 1 May like to set contrasts to match SAS controversial in some circles e g the
167. n that you can move just under 25 of the data at either end of the distribution to infinity and the IQR remains unchanged The robust properties of the median and IQR make them good reference values in normalization procedures There are also methods which do not require a target or reference chip These methods use the information from the chip set to be normalized to create average reference values for the chip set We can think of normalizing groups of data to a reference point that is bringing the median of a data group to a fixed reference point through a shift of the values This reference point can be as simple as a given constant intensity value constant or median normalization or as complicated as fitting a locally weighted least squares regression loess normalization through the data We call this type of normalization location normalization Location normalization is necessary to correct for spatial variation e g such as when the slide is slightly tilted during hybridization which results in more mRNA available for binding at different locations on the slide One of the most common location normalization methods for microarray data is to normalize the data to a loess curve fit through the MvA plot The loess method fits a curve to the data using robust locally weighted regression as discussed in Cleveland 1979 Yang et al 2001 2002 and the S PLUS 6 Guide to Statistics Local regression 139 Chapter 7 Pre Processing
168. n we focus on normalizing probe set data without summarizing it Open the Normalization dialog by selecting Normalization from the ArrayAnalyzer drop down menu Normalization Data Normalization Show Data of Type Normalization quantiles Zj Affymetrix CEL _y I MvA Plot Data CGAffyBatch v V Box Plot payee CGAffyB atch norm When to Show Before amp After Probe Set C PM C Only After PM and MM OK Cancel Apply k current Help Figure 5 7 The Normalization dialog Select Affymetrix CEL from the Show Data of Type drop down list and choose the CGAffyBatch data object Explore the normalization procedures available in the Normalization drop down list The quantiles procedure allows you to normalize only the PM intensities or both PM and MM intensities Save As The Save As field takes an object name for saving the normalized affyBatch probe level object Clicking OK creates the normalized affyBatch object and plots pre and post normalization boxplots for comparison The plot is on a logy scale but the expression intensities are saved on the original raw intensity scale Before quantiles Normalization Normalization of Probe Level Data After quantiles Normalization 8 o o 8 J NJ g mei Pepe E a geile Si 7 o 4 e d e ie ee i eb et eto odli Tei M
169. nly defined as follows M log Trt Rep Trt Rep tre log Trt Rep Trt Rep sone een for i i j 1 reps in treatment 1 To compare expression between treatment conditions the intensity log ratio M and the average intensity A are commonly defined as follows M log Trt Rep TrtgRep jee log Trt Rep TrtgRep en ann for i l reps in treatment 1 j 1 reps in treatment 2 Since Affymetrix probe level data is on the raw scale box plots from the GUI will plot the logs of the intensities Please refer to section Box Plots on page 142 for a general discussion of box plots MvA plots Pre Processing And Normalization For Affymetrix Probe Level Data The function mva pairs or the MvA plots from the GUI show pairwise graphical comparisons e g between replicates of a treatment condition The axes of these plots are the log ratio intensities M between a replicate chip pair vs the average log sum A intensities of the chip pair The pairwise scatter plots are shown on the top right half of the graph and the inter quartile range IQR of the log ratios is shown on the bottom left half of the graphs The chip labels are given on the diagonal These plots can be particularly useful in diagnosing problems in replicate sets of arrays Figure 7 4 shows an MVA plot for one treatment condition of the Dilution experiment there are two replicates of this condition Please refer to the help f
170. normalization using the loess function printTipLoess within print tip group intensity dependent location normalization using the loess function scalePrintTipMAD within print tip group intensity dependent location normalization followed by within print tip group scale normalization using the median absolute deviation MAD The default normalization method is printTipLoess To normalize swirl raw with the scalePrintTipMAD method the function call is wa gt swirl norm lt maNorm swirl raw norm s Note that only the first letter of the method s in this case is needed We can do a series of plots to compare before and after normalization First we do a pair of M vs A plots The maP1ot function handles all the details We pick off one of the arrays the third one for simplicity gt maPlot swirl rawL 3 main Pre normalization MvA Plot gt maPlot swirl norml 3 main Post normalization MvA 123 Chapter 6 An Example Two Color cDNA Data Pre normalization MvA Plot 14 7 2 1 3 1 e 4 1 Obs e 2 Sere eee ay RENE D pi 1 3 23 3 3 4 3 1 4 2 4 64 44 Figure 6 21 Pre normalized M vs A plot for the swirl data 124 From The Command Line Post normalization MvA Plot Scale Print Tip MAD SBSPLLLELHYNKNNKNSESSS i S H 1 i I fe SS Fe QS ge Ge Nn ae Oc Figure
171. ods for normalizing this summarized data medianIQR and affy scalevalue exprSet Dilution exprSet is a sample exprset object available in the S ARRAYANALYZER database Dilution exprSet is a summarized version of the Dilution experiment object Dilution Please refer to the help files for more details We use Dilution exprSet to demonstrate normalization of summarized microarray data medianIQR normalization scales the summarized chip data so that they have the same inter quartile range as the maximum IQR for the set and the median of each chip s data is shifted to the maximum median of the chip set medi anIQR takes as input an expression intensity matrix each column is one chip s values and returns a matrix of the same dimensions one column for each chip in the set medianIQR can be used from the command line as follows Normalizing each treatment group separately gt DilutionEsetNormTmtl lt medianIQR norm Dilution exprSet 1 2 exprs gt DilutionEsetNormTmt2 lt medianIQR norm Dilution exprSetLl 3 4 exprs 1 An object of class exprSet contains information for experiments where the probe level data has already been summarized into one expression value for each gene Please refer to the Biobase documentation for more details splus61 library Biobase Biobase pdf or the exprSet class help file 171 Chapter 7 Pre Processing and Normalization affy scalevalue exprSet Diagnostic Plots for Summarized
172. olcano Plot Heatmap Chromosome Variance Plot Figure 5 18 The chromosome plot displays the entire chromosome with differential expression marked up for positive down for negative for each gene on the chip The orange color indicates the location of the top 10 most significant differentially expressed genes Differential Expression Testing Variance Plots The variance plots display the variance estimates used for the LPE test as a function of differential expression for each treatment condition In this example the plot shows the variance decreasing dramatically as differential expression increases as shown in Figure 5 19 v o o o lt 8 d gt gt Ww wW a a 0 0005 0 0010 0 0015 0 0020 0 0025 2 0 25 3 0 A for 24hr Figure 5 19 Variance plots for the O hour and 24 hour data 85 Chapter 5 An Example Affymetrix Probe Level Data Annotation Clicking one of the hyperlinked points in one of the Top 10 Summary the volcano plot or the heat map pops up a menu for selecting the database to query for annotation information Selecting either one opens an HTML page in your default web browser displaying a brief description of the gene with a hyperlink to more detailed information Figure 5 20 shows an example page from LocusLink with annotation for one of the differentially expressed genes in the melanoma example E c auer NCBI LocusLink PubMed Entrez
173. olling Type I Error rates Note that the minP and maxT procedures are only available for the permute versions of the test statistics When you use the permutation methods you can specify the number of permutations used in the p value estimation and provide a seed to the random number generator for repeatability of results in testing or validation studies The permutation and minP and maxT adjustment procedures should not be used in the context of few replicates the results may be misleading For more information see the cautionary note in the section Within Gene Two Sample Comparisons of section Statistical Tests The Graph Options group is a list of check boxes for selecting which graphs you want as output Graph Options I Volcano Plot V Heat Map V Chromosome Plot V QO Norm Plot Figure 8 5 The Graph Options group in the Multiple Comparisons Test dialog Each of these options is described in detail in the section Differential Expression Analysis Plots GUI for Multiple Comparisons Testing Output The Output group controls where the graphs are displayed and the gene list table is saved after the testing step is complete Output M Display Output in S PLUS I Save Output as HTML r Save Summary As MultTestSumm Figure 8 6 The Output group of the Multiple Comparisons Test dialog Display Output in S PLUS Displays the selected graphics in an S PLUS graphic device Save Output as HTML Saves the S PLUS Graphlet
174. olor cDNA Data Create Layout Dialog 96 Saving the Design Once you ve entered all the information on this tab you can save it for later use by clicking the Save Design button at the top of the dialog A txt file is written to the directory of your choice with number of factors number of levels repetitions and the full path file names and their associated factor levels Reading Designs This design file can be reused for another experiment with the same design by modifying the file locations and names and factor levels as needed In fact if you have many chips in your experiment you can create a file with all the design content and read it with the Read Design button which will set the reps indicator and fill the file name fields and their associated factor levels Now that you have set the import parameters you need to create or find a layout file to complete the data import To specify the chip layout for a cDNA array you need to select a layout that has been previously created or create a new one Create one for the swirl example by clicking the Create Layout button on the bottom of the Import cDNA Data dialog This displays the Create Layout dialog as shown in Figure 6 5 Create Layout Layout File Browse eRe Spot Rows Save Chip Layout Grid Grid Rows Control Control Column Spot Columns Grid Columns Control Value Gene Name Col Cancel gt current Figure 6 5 The Create Modif
175. ontaining probe sequence information targets object of class marrayInfo containing target sample information Note that besides file names and location and the columns indicating foreground and background intensities we need to supply the objects we just created swirl layout swirl gnames and swirl samples The command line call for reading the four data files is as follows gt fnames lt paste swirl 1 4 spot sep gt swirl raw lt read marrayRaw fnames path AApath name Gf Gmean name Gb morphG name Rf Rmean name Rb morphkR layout swirl layout gnames swirl gnames targets swirl samples gt swirl raw Pre normalization intensity data Object of class marray Raw Number of arrays 4 arrays A Layout of spots on the array Array layout Object of class marrayLayout Total number of spots 8448 Dimensions of grid matrix 4 rows by 4 cols Dimensions of spot matrices 22 rows by 24 cols Currently working with a subset of 8448 spots Control spots There are 2 types of controls Control N 768 7680 Notes on layout C PROGRAM FILES INSIGHTFUL splus61 module ArrayAnalyzer examples fish gal B Samples hybridized to the array 119 Chapter 6 An Example Two Color cDNA Data Examining the Controls 120 Object of class marraylInfo maLabels of slide Names experiment Cy3 1 81 81 swirl 1 spot swirl 2 82 82 swirl 2 spot wild type 3 93 93 swirl 3 spot swirl 4 9
176. or to clustering it is also recommended to triage genes based on their statistical significance in differential expressions and to confirm consistent expression patterns within replicates e g Ross et al 2000 A variety of partitioning and hierarchical cluster analysis methods are available in S PLUS 6 1 including a library of algorithms described in Kaufman and Rousseeuw 1990 The partitioning methods include K means partitioning around medoids and a fuzzy clustering method in which probability of membership of each class is estimated A method for large datasets clara is also included which is based on pam The key difference between kmeans and pam is that kmeans uses a mean as the center of each cluster and pam uses an actual data point as the center The hierarchical methods include agglomerative methods which start from individual points and successively merge clusters until one cluster representing the entire dataset remains and divisive methods which consider the whole dataset and split it until each object is separate The available agglomerative methods are agnes mclust and hclust The available divisive methods are diana and mona The agglomerative hierarchical methods agnes and hclust use several measures for defining between cluster dissimilarity including a group average nearest neighbor or single linkage and farthest neighbor complete linkage These methods proceed by merging the two clusters with the smallest
177. ot Data BendgilentMo Box Plot REUSE BenAglentMouse When to Show Before amp After e C Only After Cc Cancel Apply KE current Help Figure 3 7 The Normalization dialog allows you to adjust for systematic biases in the measurements of expression intensities for probe level or summarized expression intensity values There are many methods available for normalization including print tip specific methods for custom cDNA arrays and methods for Affymetrix probe level data 23 Chapter 3 GUI Overview DIFFERENTIAL EXPRESSION ANALYSIS Tests for differential expression are implemented in two dialogs 1 Standard statistical testing procedures such as the t test for equal and unequal variances paired t test and Wilcoxon s signed rank sum test 2 Local pooled error LPE test designed for low replicate studies Both of these procedures provide numerous methods for adjusting the raw p values to account for the hundreds or thousands of statistical tests computed for any given experiment You can control the family wise error rate or the false discovery rate by specifying an overall error rate and a raw p value adjustment procedure Multiple Comparisons Test Data Graph Options Show Data of Type IV Volcano Plot Affymetrix Ad Z Heat Map Data CGExprSet ime v 7 Chromosome Pict Chip Name hgu35av2 V QO Norm Plot Options Output PWER 05 Display Output in S PLUS Test t z I Save Output as HTML Alt Hypothes
178. ot names gt getSlots swirl norm maA maM maMloc maMscale maW maLayout matrix matrix matrix matrix matrix marrayLayout maGnames maTargets maNotes maNormCal marrayInfo marrayInfo character gali Given that we have a dye swapped cDNA experiment we ll use a traditional paired t test to test for differences in expression We first have to create a couple of objects that are arguments to the aa teststat function First extract M the logged intensity ratios JHHF Extract M s log2 R G for each chip M lt maM swirl norm Now compute fold change in preparation to doing a paired t test See code for aa teststat for details on how this is done in general JHHF Compute fold change prep to a paired t test foldChange lt rowMeans M c 1 3 M c 2 4 Get gene names and label the rows of the M matrix JHH extract gene names sw probes lt maLabels maGnames swirl norm dimnames M 1 lt sw probes Set up a factor which indicates which cell type is colored red HHHE Set the factors factor indicates which cell type is colored Red gfac lt factor c swirl wild type swirl wild type 127 Chapter 6 An Example Two Color cDNA Data JHHF Compute test statistics testStat lt aa teststat M gfac test pairt Adjusted p values Compute adjusted p values for both Bonferroni and Benjamini and Graphics 128 Hochberg methods JHHF Compute adjusted p va
179. ound plot gt malmage swirl x maM log intensity ratio plot S ARRAYANALYZER supports both location and scale normalization for cDNA data One of the most common location normalization methods is loess normalization This method normalizes the data to the loess curve on the MvA plot Please refer to the section Normalization Using Loess on page 139 for more details about loess curves For cDNA data the intensity log ratio M log 4 and the overall intensity A log RG are most commonly used to create the loess regression curve Like the median normalization which shifts the median of each chip to zero the loess normalization effectively shifts the loess curve of the data to zero The spatial 2D normalization fits a loess surface to the intensities at the x and y spot Median Absolute Deviation Normalization MAD maNormMain Function Default Normalization Parameters For maNormMain Normalization Methods for cDNA Data coordinates This surface is then subtracted from the pre normalized values to center the data This procedure is often used separately on each print tip group on a chip Scale normalization attempts to align the variability of the expression intensity across chips Yang et al 2001 2002 suggest that for scale normalization a robust estimate such as median absolute deviation MAD may be used For a collection of numbers xy x the MAD is the median of their absolute deviations from the
180. rations In combination with the bg correct rma background correction method and the quantiles normalization method this forms the RMA probe level analysis method of Irizarry et al 2002 2003b The playerout method computes a weighted mean of the PM values based on the method described by Lazaridris et al 2002 Summarization in S tARRAYANALYZER can be done through the Affymetrix Expression Summary dialog as demonstrated in Chapter 5 An Example Affymetrix Probe Level Data From the command line summarization can be done as a separate step using the wrapper function computeExprSet on an AffyBatch object or through the functions generateExprVal method method These functions require a matrix of probe intensities with rows representing probes and columns representing samples Examples and details of using these functions can be found in the help files for example type gt generateExprVal method avgdiff The correcting normalization and summarization steps can be done in one step using the express and expresso functions Details on these functions can be found in the help files gt express and gt expresso Summarization Examples In these examples we correct the data for background signals and noise normalize the data at the probe level and summarize the probe level data into one value per gene transcript We do this all using the expresso function 167 Chapter 7 Pre Processing and Normalization 168 Affymetrix
181. re typically represented with a dendrogram showing the hierarchy from all samples to individual samples or from all genes to individual genes Genes with obviously non significant expression values should be omitted from the clustering analyses Genes included in the clustering analyses may be chosen using the statistical hypothesis tests for differential expression described earlier It is important to understand that hierarchical approaches do not directly provide any reliable measure of confidence for clustered expression patterns A hierarchical clustering method heuristically reorganizes the genes based on its predefined as association distance and allocation algorithm which only aids us in discerning co expression patterns visually Therefore a validation step is required for such hierarchical clustering discoveries before further inference can be drawn For example a bootstrapping method can be used for assessing reliability of clustering classifications of a fixed known number of groups Kerr and Churchill 2001 215 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data 216 Hierarchical clustering results are typically summarized with a dendrogram in which samples or genes are joined in a tree structure where the leaves branches successively join samples or genes that are most similar We note that this needs to be interpreted with care since hierarchical clustering imposes structure whether it is there or
182. rentially expressed Define Papi an ean As where N number of comparisons genes tested as the ordered p values from smallest to largest resulting from the statistical tests t j ordered test statistics from largest to smallest i 1 N Hy Null hypothesis no differential expression Then the p value adjustment procedures are defined below Bonferroni The Bonferroni correction is p min py N 1 for each i All genes with adjusted p values p less than are significant with an overall FWER of at most Note that the raw p values have simply been multiplied by the number of comparisons Hochberg The Hochberg 1988 step down correction is Pq min _ n min N k 1 pqq 1 The procedure sequentially computes P t min Np 1 N 1 P 9 ETE P N 1 Controlling Type I Error Rates Py min N 1 pq N 2 P 3 gt P N 1 Pn 1 min 2py 1 PiN 1 pj min pyyy 1 and stops at the first adjusted p value pq that exceeds a Holm The Holm 1979 step down correction is Pi max min N k 1 pqy 1 The procedure sequentially computes Pi max min Np 1 P g max min Npy 1 min N 1 pq 1 and stops at the first adjusted p value pq that exceeds a Sidak SS The Sidak single step SS correction is N p 1 1 p All genes with adjusted p value p less than are significant with an overall FWER of at most Sidak SD The Sidak free step down SD
183. ression values on all chips have equal medians and equal inter 39 Chapter 4 An Example Affymetrix MAS Data 40 quartile ranges i e the spread between the 25th and 75th percentiles is the same The data normalization plots can be viewed before and after normalization or just after by selecting the appropriate choice Note that the Probe Set radio buttons are disabled for Affymetrix MAS data These will be discussed in Chapter 5 An Example Affymetrix Probe Level Data Click OK or Apply to produce the normalized data and generate the pre and post normalization MvA pairs plots and box plots as shown in Figures 4 11 and 4 12 Oweeks After medianlQR Normalization 0 768 0 728 s03 txt s01 txt 0 911 s02 txt kah F tery 0 895 0 984 0 916 1 03 0 893 0 997 0 878 s05 txt 0 843 0 777 0 812 0 694 0 732 s06 txt Figure 4 11 MvA scatter plot matrix Each plot is M vs A for two chips To determine which chips are used for each plot go down vertically and left horizontally from the plot to the first chip names you encounter The numbers in the boxes below the diagonal are values of the interquartile range of M for the pair of chips obtained by going up vertically and right horizontally to the first chip names you encounter Normalization After medianlQR Normailizatior Before medianlQR Normalization
184. rintTip PrintTip Figure 6 19 Controls versus noncontrols by print tip group 121 Chapter 6 An Example Two Color cDNA Data We can also plot controls versus noncontrols with a boxplot for each chip as follows JHH Boxplots of controls vs noncontrols gt graphsheet gt par mfrow c 1 2 gt maBoxplot swirl raw controls main Controls by Print Tip Group srt 90 gt maBoxplot swirl raw din Won controls by Print Tip Group srt 90 Figure 6 20 displays the resulting graph Controls Across Chips Non controls Across Chips 81 82 93 94 81 Figure 6 20 Controls versus noncontrols across chips 122 Normalization From The Command Line The swirl raw object resulting from the call to read marrayRaw is an object of class marrayRaw Removing the controls does not effect the class of the object The normalization function we use for marrayRaw objects is maNorm It s arguments are listed here mbatch Object of class marrayRaw containing intensity data for the batch of arrays to be normalized An object of class marrayNorm may also be passed if normalization is performed in several steps norm Character string specifying the normalization procedures The options to the norm argument are none no normalization median global median location normalization loess global intensity or A dependent location normalization using the loess function twoD 2D spatial location
185. rmation About a Microarray Experiment and this information can be entered on the second page of the Import cDNA Data dialog This information is not required but it is used in table output and graphics and thus it is to your advantage to complete the information in this page Once you ve entered MIAME information for any experiment the first three fields are saved and are filled automatically the next time you open this dialog This dialog is shown in Figure 6 8 Import cDNA Data File Selection MIAME Variable Selection amp Filtering Experimenter s Name Bob Bryant Laboratory IGRI Contact Information bob igri com Experiment Title Zebra Fish swirl mutant Experiment Description Zebrafish embryos from two genetic strains a swirl mutant and a normal wild type Notes file location D Microarrays Arrayanalyzer swirlnates tat ee a Figure 6 8 Entering chip information in the MIAME page 99 Chapter 6 An Example Two Color cDNA Data Variable Selection The third page in the Import cDNA Data dialog is for variable and amp Filtering Page 100 row selection There are two required fields on this page Red Foreground and Green Foreground where you must select the columns containing red and green foreground intensities respectively Select the columns from the drop down list of variable names which is populated using the column headers in the imported data files Import cDNA Data File
186. rol rows from the expression data frame iHHE Remove controls gt LCG lt LEG leontrels 1 We normalize the intensity values by adjusting within chip expression values to a common median and inter quartile range using the medianIQR norm function gt LCG N lt data frame medianIQR norm LCG Compute a summary of the resulting logged and normalized expression intensities as follows dHHE Summarize Non controls gt summary LCG N CGa CGb CG24a Min 0 5415185 Min 0 000000 Min 0 5071245 1st Que 3 2774186 Ist Qu 3 277464 lst Qu 32121624 Median 6 6429682 Median 6 642968 Median 6 6429682 Mean 6 0522140 Mean 5 964245 Mean 6 0087211 3rd Qu 8 8044455 3rd Qu 8 804491 3rd u 8 7391893 Max 14 6903453 Max 15 151559 Max 15 1604125 NA s 8 0000000 NA s 8 000000 NA s 8 0000000 CG 24b Min 0 5725 788 Ist Qu 3 1813881 Median 6 6429682 Mean 5 9970227 3rd Qu 8 7084150 Max 15 1399289 NA s 8 0000000 From The Command Line Note the missing values in the summary output We ll have to take care of those before doing differential expression testing but first let s plot normalized and unnormalized data for comparison dHHF Before and after normalization boxplots par mfrow c 1 2 boxplot LCG style bxp att title Before Normalization boxplot LCG N style bxp att gt title After Normalization gt Ps The resulting boxplots are displayed in Figure 4 22
187. rovides the basic steps of differential expression analysis in an intuitive graphical user interface GUI There are dialogs for each of the following e Importing e Summarizing e Normalizing e Differential expression testing At each step of the analysis an S PLUS object is saved which becomes the starting point for the next step You can return to any step as much as you wish to try different options while refining your analysis In addition HTML output tables and graphics are hyperlinked to annotation information in NCBI GenBank or LocusLink databases The normalization and testing dialogs include options for plotting The GUI includes most of the functionality of the Bioconductor project but not all The following is a partial list included in S PLUS but not presented through the GUI Receiver operating characteristic curves e Variance stabilization e Hexbinning Also many standard S PLUS functions can be used for microarray data Some examples include the following e Clustering of various types e Hierarchical e Partitioning Model based e ANOVA models e Linear mixed effects models The S ARRAYANALYZER Interface In the User s Guide we detail examples of many of these additional techniques from the command line so you can get a better understanding of what is possible in S PLUS when analyzing microarray data S tARRAYANALYZER has a GUI which extends the S PLUS GUI When you load the module a new ArrayAnaly
188. rved Snf2 or one un conserved Swi1 component in minimal and rich media The experiment is described in Sudarasanam et al 2001 and the data are available at http genome www stanford edu swisnf In each chip the yeast wildtype was run as reference channel on each array and there were three replicates of each experimental condition for a total of 12 chips and 24 channels of gene expression data The analysis we present mainly follows that presented by Wolfinger et al 2001 The two stage analysis comprises a simple normalization model simple ANOVA removal of overall mean and a mixed model fit using 1me to model gene expression All 10 contrasts of wildtype v each combination of media strain 4 and contrasts among the media strain 6 are investigated and p values for each contrast with respect to each gene obtained using within gene error 8 d f Differential Expression Analysis for Experiments with More than Two Experimental Conditions To run this analysis you first need to download all 12 datasets txt files from the URL listed above to a directory These datasets are snf2ypda txt snf2ypdc txt snf2ypdd txt snf2mina txt sf2minc txt snf2mind txt swilypda txt swilypdc txt swilypdd txt swilmina txt swilminc txt and swilmind txt We use the function readScanAlyzeData to read in the data This is available once you load the S ARRAYANALYZER module We need to create the helper functions sudarsanamArrayFun sudarsanam
189. s for importing data as well as a very general facility for importing data through the GUI It is worth spending a little time importing data through the S PLUS GUI because the facility is quite general and easy to use To import a data file though the GUI go to File gt Import Data gt From File When the dialog opens select the File Format and then browse for files Figure 4 19 shows the Data Specs page of the S PLUS Import From File dialog You specify the name of the saved data object in the To Data Set field If there is any header information in the file you need to specify Start row so the header information is skipped You do that on the Options page of the Import From File dialog Figure 4 20 shows the location of the Start row field See the position of the mouse cursor 49 Chapter 4 An Example Affymetrix MAS Data Command Line Import 50 Import From File Data Specs Options Filter From File Name D Microarrays ArrayAnalyzersdata Oha xls Browse File Format Excel Worksheet xI To Data set lt Oh4 Create new data set C Add to existing data set Cancel Apply Hof current Figure 4 19 Importing data through the S PLUS GUI The command line equivalent to the Import From File dialog is the importData function The critical arguments are file a character string specifying the name of the file to import type a character string specifying the type of file to import Possible val
190. s from microarray analysis such as summary gene lists and volcano plots are presented using S PLUS Graphlets which facilitate interactive annotation of result summaries and allow you to share results via the Web Chapter 2 Introduction To Microarray Data MICROARRAY DATA 10 DNA microarrays are now widely used as a key experimental platform in drug discovery e g functional genomics and drug candidate evaluation e g toxicogenomics Their utility lies in the ability to simultaneously quantify the relative activity or differential expression of many genes under different biological conditions Some common uses of microarray experiments are to e Classify diseases and their subtypes e Identify and validate new targets for drug discovery e Improve understanding of biological processes e Evaluate drug candidates against drugs with known toxic side effects Develop personalized treatment plans tailored to genotypes It is not our intention to discuss in depth the biology of microarrays If you are new to this area you should investigate the references listed at the end of the chapters in this manual as most chapters provide references with detailed information We give here a brief overview for those new to the area A microarray consists of a slide with genes or active segments of genes attached at spots on a regularly spaced grid There may be anywhere from a few to tens of thousands of genes spotted on a single mi
191. s optional settings for the LPE estimator The options are Smoother D F The degrees of freedom used by the spline smoother to estimate the baseline variance function for each group Default is 10 Graph Options Output GUI for LPE Testing Number of Bins Number of bins to compute variance estimates These variance estimates along with an associated average expression intensity is the data used by the loess smoother to estimate the baseline variance function Trim Percent of pooled variances to trim from the low end of expression intensity prior to running the loess smoother The Graph Options group is a list of check boxes for selecting which graphs you want as output The options are described in detail in the Graph Options M Volcano Plot V Heat Map IV Chromosome Plot MV QQ Norm Plot Figure 8 9 The Graph Options group in the LPE test dialog section Differential Expression Analysis Plots The Output group controls where the graphs are displayed and the gene list table is saved after the testing step is complete Display Output in S PLUS Displays the selected graphics in an S PLUS graphic device Save Output as HTML Saves the S PLUS graphlet with selected graphs and the significant gene list to HTML files to view later Display HTML Output View the S PLUS Graphlet with selected graphs in a browser The displayed Graphlet has a hyperlink to the significant genes table Points on the Graphlet and en
192. se resulting in increased Type I and Type II errors In particular with the Statistical Tests large number of genes on the chip there will always be genes with low within gene error estimates by chance so that some signal to noise ratios will be large regardless of mean expression intensities and fold change The local pooled error test attempts to avert this by combining within gene error estimates with those of genes with similar expression intensity In this sense the LPE approach is similar to the SAM method of Tusher et al 2001 and the B statistic of Lonnstedt and Speed 2002 LPE estimates used for differential expression testing are formed by pooling variance estimates for genes with similar expression intensities The LPE is derived by first estimating the baseline variance function for each of the compared experimental conditions say U and V For example when duplicated arrays U Uo are used for condition U the variance of M expressed as U U on each percentile range of A expressed as U U is evaluated When there are more than duplicates all pairwise comparisons are pooled together for such estimation A non parametric local regression curve is then fit to the variance estimates on the percentile subintervals refer to Figure 8 15 as an example The baseline variance function for condition V is similarly derived and the LPE test statistic for comparison of median log intensities between the two samples is
193. se built in features e Microarray data import Affymetrix MAS and CEL and cDNA chip data e Data normalization Welcome e Differential expression analysis using the LPEtest and multtest libraries e Chromosome plot creation S PLUS Graphlets for annotation and exchanging information among researchers For the latest information and support on S ARRAYANALYZER go to http www insightful com support ArrayAnalyzer This contains information regarding Insightful efforts in the genomics and bioinformatics space Libraries There are 13 S PLUS libraries in S ARRAYANALYZER to assist in your analysis affy annotate Biobase edd geneplotter genefilter LPEtest marrayClasses marrayInput marrayNorm marrayPlots multtest and ROC These are loaded automatically when you load S ARRAYANALYZER The following shows a few examples of how S ARRAYANALYZER can process your data e Analysis of Affymetrix data uses the Biobase and affy libraries for reading and normalizing data e Differential expression analysis uses the multtest and LPEtest libraries and annotation is completed using the genefilter geneplotter and annotate libraries e Input and normalization of custom cDNA data uses the marrayClass marrayInput marrayNorm and Biobase libraries 12 of the S PLUS libraries in S ARRAYANALYZER are based on the Bioconductor libraries and information on these libraries is located at http www bioconductor org The oth
194. sic types of microarrays in the User s Guide Here we introduce them briefly to aid in understanding the examples that follow Affymetrix Affymetrix GeneChip microarrays represent each gene with an Arrays oligonucleotide 25 mer probe spotted at typically 11 20 pairs of spots 22 40 spots in all Each probe pair consists of a spot for the probe called a perfect match PM and a spot for a slight alteration of the probe called a miss match MM Non specific binding may be accounted for by adjusting PM intensities to account for MM expression intensities Biotin labeled Total RNA cDNA cRNA Reverse in Vitro Za N AAAA Transcription S Transcription Ann AAAA Ne gt a B NANA AAAA Q cy gt i ee GeneChip E lon xpression N NS B M maea B Es Bloat G S cia Q E2S exc X S 5 aie eyes Figure 2 3 Affymetrix s GeneChip is a one color oligonucleotide array Mass produced reliable standardized microarrays like the GeneChip have fueled the bioinformatics revolution i Affymetrix has revolutionized bioinformatics with its GeneChip technology To analyze Affymetrix expression data all the expression values for each probe pair are first summarized by a single value There are numerous ways to do this Affymetrix provides methods for such probe level summarization in their MAS4 and MASS software S ARRAYANALYZER provides other methods for probe level analysis which are discussed in depth in the Us
195. stData method reml gt normalizationModelVarComp variances array array strain Residuals 1 869852 0 02984192 4 221948 2 3 Save the residuals for the gene model fits gt yeastDataL Residuals lt residuals normalizationModel LME The gene model is then fit to the residuals from the normalization model as Tijk Mgt Sigt T Y jz where Tijk denotes the residuals from the normalization model U represents an overall mean value S is the main effect for spots random effect T is the main effect for treatments Y is stochastic error We assume S N O 0 st Ta N 0 0 YR 238 Differential Expression Analysis for Experiments with More than Two Experimental Conditions The spot effect array by gene interaction models spot to spot variability Note that the normalization and gene models have their own stochastic error components e and g In the normalization model the e s are assumed to have a constant variance while in the gene model the genes represent within gene variances While specifying separate within gene variances is a safe assumption biologically in many real world situations there may not be enough replicates and hence degrees of freedom d f on error to provide reliable estimates of within gene variances In this case there are 8 d f on error 11 d f for spots and 4 d f for treatments Inference based on tests and F tests described below are thus based on 8 d f for error which is minimall
196. t data frame adjp row names diffExpr lt sw probes index JHH Bonferroni adjustment MultSumm lt multtest graphlet M diffExprL 2 diffExpr 1 index testStat 1 foldChange fwer 0 05 procedure Bonferroni chip name NULL volcano plot T heatmap plot T chromosome plot F html output F qqnorm plot T summary name MultSumm open browser F dHHE BH adjustment MultSumm lt multtest graphlet M diffExprl 3 diffExpr 1 index testStat 1 foldChange fwer 1 procedure BH chip name NULL volcano plot T heatmap plot T chromosome plot F html output F qqnorm plot T summary name MultSumm open browser F 129 Chapter 6 An Example Two Color cDNA Data 130 PRE PROCESSING AND NORMALIZATION Introduction Normalization Technical Sources of Variability Why do We Normalize Data Normalizing in S ArrayAnalyzer Ideas in Normalization Normalizing to One Point Normalizing to Many Points Cautions in Normalizing Workflow Diagnostic Plots Box Plots MVA Scatter Plots Chip Specific Plots Normalization Methods for cDNA Data Normalizing With the GUI Notes For Command Line Users cDNA Diagnostic Plots Location and Scale Normalization maNormMain Function Normalization Functions maNorm maNormScale Pre Processing And Normalization For Affymetrix Probe Level Data CDF Libraries Affymetrix Diagnostic plots Background Correction PM correct methods Normalization 133 134 134 135 137 139 1
197. t observed differences in fluorescence intensities are due to differential expression and not experimental artifacts and should be done before any analysis that compares gene expression levels within or between arrays Effects that have been consistently removed by normalization include differential nonlinear hybridization of the two channels in two color cDNA arrays biases in DNA spotting due to eroded print tips and spatial variability of signal in regions of an array cDNA microarrays developed within research organizations are subject to variability in all of the preparation phases e g amplification purification and concentration of DNA clones the amount of DNA spotted the binding of the DNA to the array the shape size of the spot and dye quality and labeling There are several environmental factors at play during hybridization and scanning including temperature humidity non specific binding and washing conditions The scanning process is complex with higher intensities giving higher signals but leading to saturation at the high end while lower intensities remove saturation but miss signal on the low end Imaging algorithms are likewise complex with significant segmentation issues involved in the separation of signal from background see Yang et al 2001 Commercially manufactured oligonucleotide arrays have their own variability issues including those described above Affymetrix the market leader has made a considerable ef
198. t al 2000 Note that Alizadeh et al 2000 focused their attention on B cell differentiation genes based on visual examination of hierarchical cluster analysis and heat map visualization of 96 samples run on arrays of more than 10 000 genes We do not recommend this qualitative approach rather we suggest genes be included in cluster analyses based on their differential expression according to a reliable statistical hypothesis testing procedure We provide two analyses of the subset of data presented in Figure 3a of Alizadeh et al 2000 using hierarchical and partitioning cluster routines We actually use the data as summarized by Cluster Eisen et al 1998 and prepared for viewing in TreeView Eisen et al 1998 Note that this is not the actual raw data We treat it as raw data to show the cluster methods in S PLUS but the resulting output should not be directly compared with fig3a of Alizadeh et al 2000 Our hierarchical cluster method hclust uses a group average between cluster dissimilarity measure The partitioning method uses the partitioning around medoids method pam We begin by importing the data described above The data can be downloaded from http Ilmpp nih gov lymphoma data shtml Clustering Microarray Data using S PLUS From the Figure 3 link download the file named figure3a cdt into the splus61 module ArrayAnalyzer examples directory The first two lines in the following code will import the data and crea
199. t changing the factor level name in one place changes all the factor level names with the same name Enter Ohr for one of the A1 levels and enter 24hr for the A2 the third or fourth row level as shown in Figure 5 3 Import Affymetrix Data File Selection MIAME Variable Selection amp Filtering r Associate Files with Design Points Single Factor 2 Level Design Reps 2 Reset Grid Read Design Save Design File Name Type filename or right click to browse Type filename or right click to browse Type filename or right click to browse Type filename or right click to browse r Data File Type Chip Name Save s kundetermined gt krequired gt v mySet IV Print Output Figure 5 3 Setting the factor levels to Ohr and 24hr 67 Chapter 5 An Example Affymetrix Probe Level Data 68 Selecting Files To associate data files with the design points right click in a Filename field and then click the Browse for File button as in Figure 5 4 Import Affymetrix Data File Selection MIAME Variable Selection amp Filtering r Associate Files with Design Points Single Factor 2 Level Design Reps 2 Reset Grid Read Design Save Design File Name Factor C Program Files Insightful splus61 module Arra Ohr C Program Files Insightful splus61 module Arra Ohr C Program Files Insightful splus61 module rra 24hr Type filename or right click to bros
200. t2 lt hclust dist dist2 method average plot heatmap and dendrograms par mai c 0 0 0 0 omi c 0 2 7 1 4 1 1 image cmat hclust2 order hclustl order axes F bty n par new T omi c 6 55 2 75 0 1 15 plclust2 fn hclust2 cex 1 rotate me F 1ty 1 colors sample colors hclust2 order par new T omi c 0 02 0 95 1 42 7 75 plclust2 fn aliz hclustl cex 0 1 rotate me T 1ty 1 This function could then be called to analyze the Alizadeh data as follows Of course the function could do with some error checking if it were planned to be used by others gt cluster heat mat3a sample colors c rep 6 16 rep 1 2 rep 6 6 rep 5 23 Note that this function and the function plclust2 fn which allows a dendrogram to be rotated and easily laid next to a heat map are included in the S tARRAYANALYZER module Results of the hierarchical cluster analysis are presented below in Figure 9 2 Overall we produce a clustering result similar to that of Alizadeh et al 2000 Note that the data we worked with are not the actual raw data We treat it as raw data to show the cluster methods in S PLUS but the resulting output should not be directly compared with figure 3a of Alizadeh et al 2000 We have ordered the columns based on the default rule in S PLUS namely that at each merge the subtree with the tightest cluster is placed to the left This is the opposite of the ordering used by the package Cluster Eisen et al 1998 which w
201. te a data frame which we call mat3a IBS P us mata BE lajxj Flee ER yen pet Fonet pata gaits Gach Onions window beb lax Osaan ot oS DSE i kifer e on e a e EJLER A BG ER amp No Active Link 1 2 3 4 5 6 7 8 9 10 u n pe DeOL OOL2 OLCLONES DLCL0003 DLOL CCZS CLELONS DUCLOS 1 econo mcLoosa suos CUCL ONS nen DLO T E 134 Ded 011 0 0o21 2 GEL oa 0 46 as 164 nee oz 08 ooy 014 3 GEX 0 23 0 33 asi 0 43 aag osa 0 17 004 0 28 4 E 0 29 oo 005 0 23 ase 0 19 0 20 007 0 15 5 GENEI7x 0 55 0 00 083 091 0 46 Ls 1 24 0 041 6 GIEL 0 29 0 60 0 15 0 09 Q 0 0 0 60 O25 051 0 18 7 G NEM X 0 34 oesi 024 0 33 E oz 0 08 Os 0 10 oad 8 GENEDAK o1 o oag 0 48 ax oz 0 38 037 0 28 0 34 9 GENEI 0 35 0 09 O05 0 33 a 0 40 0 20 0 06 oa 003 10 GENEL 4x 0 43 0 28 oz 0 91 az os asa Da 043 0 13 11 GENEL45x 0 25 Q 002 3 14 E 05 a n 029 0 95 12 GEWX 004 0 45 0 08 026 as 0 44 ax oa 0 20 13 enx oad 043 asd 0 120 0 04 0 os ood 14 CENE22x 0 19 aoa 0 39 2 04 0 99 1 46 neo 0 12 0 19 15 GIE 0 27 0 ons 0 0 0 073 15 Qe ona 0 18 16 GENEZSK 283 1 021 0 08 oso 0 00 31 0 034 0 34 17 GEX 3 21 146 0 04 a71 NA 0 74 4 25 ooa 03 19 GENE124x 0 42 000 031 0 4 QO 076 0 41 Qi 0 10 003 19 GENEIZOX oal ooa 00A 061 D4 103 0 08 o 033 022 20 GENE L2ax asa 1 15 oO 18 14 1 09 O14 0 93 0 34 21 GENE127 0461 0 99 0 01 19 4 12 0 95 Q2 0s 003 22 GENEX os 0 58 ool 1 33 e 14 0 88 TO 0 70 023 23 GENEX
202. tering options to eliminate rows genes which have too few good spots in the summary calculations and to eliminate control spots Import Affymetrix Data File Selection MIAME Variable Selection amp Filtering r Data Column Names Probe Name Probe Set Name Expr Intensities Signal iW Apply Log Base 2 Remove Rows Column Name Stat Pairs Used X IfLess Than g Control Prefix Figure 3 4 The Variable Selection amp Filtering page in the Import Affymetrix Data dialog allows you to choose columns for gene expression and gene name For Affymetrix MAS4 5 data these fields are filled automatically 20 Import Data Import cDNA The Import cDNA Data dialog for cDNA data is similar to the Data Import Affymetrix Data dialog in Figures 3 2 3 3 and 3 4 providing many of the same options In addition for cDNA data there is a dialog for specifying the microarray layout Create Layout Layout File D Microarrays ArrayAnalyzer data fish gal Browse Spa 22 Spot Rows Save Chip Layout irll t ae Grid Be Me At Spot Columns 24 ri Grid Rows 4 Control Control Column ID X Grid Columns 4 Control Value control Gene Name Col Name Cancel pf current Help Figure 3 5 The Create Layout dialog for cDNA arrays allows you to specify the layout of the chip and select columns containing control information and probe names 21 Chapter 3 GUI Overview AFFYMETRIX EXPRES
203. than Bonferroni and controls the false discovery rate rather than the family wise error rate See Chapter 8 Differential Expression Testing for more detail There are three options in the Graph Options group to display any of the following a volcano plot a heat map or a normal Quantile Quantile plot QQ Norm plot Clicking OK or Apply produces the output plots which are discussed in the following section 107 Chapter 6 An Example Two Color cDNA Data 108 Volcano Plot A volcano plot displays the logarithm of p value versus fold change as shown in Figure 6 16 The vertical lines indicate fold change values of plus or minus two and the horizontal line indicates a significant LPE Test p value after doing the Bonferroni correction Points located in the lower outer sextants are those with large absolute fold change and small significant p value Each of those points is active so you can click an individual point to access annotation information from LocusLink or GenBank Graph Window 3 File View Options j Gene Name 3 D11 T i 4 a 5 D B F 5 k 6 Mean Log2 Fold Change HE Summary volcano Plot Heatmap Figure 6 16 A volcano plot which is the logarithm of p value versus fold change Differential Expression Testing Heat Map A heat map plot shows a two way layout of the most differentially expressed genes along the vertical axis versus the experimental conditions on the horizontal axis
204. the Affymetrix chip name S ARRAYANALYZER has pre loaded the gene annotation information for chips hgu95a hgu95av2 and hgu133a If you are using other chips you may want to refer to the Appendix S ARRAYANALYZER Data Libraries to see how to load the annotation information for your chip Saving the Data Object To save the data object type a name in the Save As field in the lower right corner of the dialog Remember this name as it is used in the next step to normalize the expression data For our example enter MouseSwimExprSet as object name Save As MouseSwimE xprSet Figure 4 6 Saving the imported data as MouseSwimExprSet Saving the Design Once you ve entered all the information on this tab you can save it for later use by clicking the Save Design button at the top of the dialog A txt file is written to the directory of your choice with number of factors number of levels repetitions and the full path file names and their associated factor levels Reading Designs This design file can be reused for another experiment with the same design by modifying the file locations and names and factor levels as needed In fact if you have many chips in your experiment you can create a file with all the design content and read it with the Read Design button which will set the reps indicator and fill the file name fields and their associated factor levels Importing Data MIAME Page MIAME is an acronym for Minimal Information Abo
205. ting Dialog Input Differential Expression Analysis Plots Common Plots Differential Expression Summary Table Output Top 10 List Complete Gene List References 178 179 179 180 182 183 184 188 188 193 193 197 197 204 204 204 208 177 Chapter 8 Differential Expression Testing INTRODUCTION 178 Differential expression testing in S ARRAYANALYZER is defined as statistical testing of the difference in expression intensities between the treatment conditions under study Effectively this means that a separate hypothesis test is computed for each gene or probe on the chip In two sample problems e g two tissue types treatment vs control this boils down to a t test or other two sample test for each gene When computing tests for chips with many probes setting a usual Type I Error false positive rate for individual tests will result in many false positives One key ingredient to good expression testing is controlling the family wise error rate FWER or false discovery rate FDR There is a rich historical literature on this topic in statistical journals and texts see for example Hochberg and Tamhane 1987 Westfall and Young 1993 or Hsu 1996 The net effect of poor FWER control is wasted time and money in discovering many genes that really aren t differentially expressed This topic is discussed in more detail in the section Controlling The False Positive Rate Another key ingredient to
206. tion data files obtained from image analysis software or databases Reading Layout Information From The Command Line We ll step through an example using each of the functions to give you a flavor of their use They are listed in typical order of use Layout files describe the structure of the microarray They include information on the arrangement of the spots e g number of rows and columns on the array where each gene is located which spots are control spots etc A typical layout file is the file fish gal provided as an example for the swirl data It has 21 lines of header information and then starts a data table The file is located in the examples directory of the ArrayAnalyzer module You can find the location of the file by doing gt AApath lt file path getenv SHOME module ArrayAnalyzer examples gt AApath C PROGRAM FILES INSIGHTFUL splus61 module ArrayAnalyzer examples By scanning the file in Notepad or Wordpad you can see that the spots are arranged in 16 blocks a 4 x 4 grid of 22 rows and 24 columns Note that the ID column column 4 has indicates which spots are controls We are now ready to read in the layout file gt swirl layout lt read marrayLayout file path AApath fish gal ngr 4 ngc 4 wer 22 NSC 24 skip 21 ctl col 4 gt swirl layout Array layout Object of class marrayLayout Total number of spots 8448 Dimensions of grid matrix 4 rows by 4 co
207. to GenBank and LocusLink annotation information on the Web To create the links the annotation library for the chip being used must be installed The annotation libraries are not attached when the S ARRAYANALYZER module is loaded However if you request plots that need the annotation data the plotting functions attempt to load a library it pastes AnnoData on to the chip name and tries to attach a library with that name if it exists If it cannot find a library with that name and the chip name ends in v2 then it attempts to attach the library with the chip name minus the v2 pasted with AnnoData If both of those fail a message that the annotation data cannot be found is printed If the libraries are installed in the library directory under the top level S PLUS installation directory type getenv SHOME at the S PLUS command line to find your S PLUS installation directory then they are automatically loaded when needed by S ArrayAnalyzer Example For example if you are working with the mgu74a chip 1 Find the file mgu74aAnnoData zip under DataLibs AnnotationLibs on the S ARRAYANALYZER CD or from the StARRAYANALYZER Web site above and copy it to your computer 259 Appendix S ARRAYANALYZER Data Libraries 2 Unzip mgu74aAnnoData zip into SHOME library The directory contains the files README txt DESCRIPTION and a Data folder When annotation information is requested in plots this library is automatically loaded 26
208. tor of adjusted p values 3 the vector of raw p values 4 the order vector from mt rawp2adjp for sorting from most differentially expressed to least vector of fold change values 5 6 family wise error rate default 0 05 7 p value adjustment procedure 8 chip name e g hgu95a The rest of the arguments are mostly for controlling output which plots are created whether it should be HTML output or not the smoothing parameters for the loess smoother used in the variance plots etc See the help file for Ipetest graphlet for more detail The resulting plot is a java graphlet in an S PLUS graphics device Note that HTML output is turned off html output F The graphlet displayed in S PLUS and the browser are the same except that points are not linked to annotation databases in the S PLUS display The output of the lpetest graphlet function contains information for annotation and for ranking the genes based on the magnitude of differential expression gt LPESumm 1 10 1 5 foldChange rawp adjp Locus Link Acc Num 31498 f_at 3 804011 0 0 2578 U19147 31609_s_at 1 440115 5118 L33799 31682_s_at 3 623148 1462 D32039 31953 f_at 3 927491 2575 U19144 31954_f_at 4 242013 2579 AA447559 31960_f_at 3 908833 2574 U19143 32395_r_at 1 786837 9349 X55954 S367 1 fat 3 577786 2576 U19145 33680_f_at 3 323057 2579 AF058988 32314 g at 3 720782 7169 M12125 G qO OG G O G O G aO i O OG 2 O GT GOG GOGO 59 Chapter 4 An
209. tries in the significant gene list are automatically hyperlinked to LocusLink and UniGene annotation databases e Save Summary As Name used for saving the S PLUS data frame containing the complete gene list including test statistics and p values The default name is LPESumm 195 Chapter 8 Differential Expression Testing 196 Output Files There are two output files generated when you select Save Output as HTML 1 A significant genes summary table The name of the this table is generated by taking the name supplied in the Save Summary As field and then adding LPEOutSumm html The default supplied name is LPESumm so the default output table name is LPESummLPEOutSumm htnil 2 An S PLUS Graphlet with selected graphs The name of the output files is generated by the name supplied in the Save Summary As field and then adding LPEOut html The default Graphlet name is LPESummLPEOut html Location of Output Files The location of these output files is determined by your S PLUS working directory To determine your working directory just type gt getenv S_ WORK D arrayanalyzer users lenk test The location of dumped files in general is the default S PLUS working directory If you specify no project folder when you start S PLUS your cmd directory is the default working directory gt getenv S_ WORK D Program Files Insightful splus61 cmd You should see two HTML files in your working directory when S
210. ts to read marrayInfo are info id gene ID numbers and associated labels labels the column number in fname which contains the names that the user would like to use to label spots or arrays e g for default titles in maPlot skip the number of lines of the data file to skip before beginning to read data We can now read in the data The four swirl files are located in the examples directory of the ArrayAnalyzer module like the other file read in the previous section The function read marrayRaw takes the following arguments fnames a vector of character strings containing the file names of each spot quantification data file These typically end in spot for the software Spot or gpr for the software GenePix path acharacter string representing the data directory By default this is set to the current working directory In the case where fnames contains the full path name path should be set to NULL name Gf character string for the column header for green foreground intensities name Gb character string for the column header for green background intensities name Rf character string for the column header for red foreground intensities name Rb character string for the column header for red background intensities name W character string for the column header for spot quality weights layout object of class marrayLayout containing microarray layout parameters From The Command Line gnames object of class marrayInfo c
211. ues are listed here the case of the character string is ignored startRow an integer specifying the first row to be imported from the data file This argument is available only when importing data from a spreadsheet To see descriptions of other arguments check the help file gt help importData Data Manipulation From The Command Line For the melanoma data we do a series of four importData commands to read the four files Import From File Data Specs General Col names row Row name col Start col End col Start row End row ASCII Decimal Point Options Filter Additional Auto X Worksheet number auto gt ra Auto v CO IV Strings as factors M Sort factor levels il fis e 1000s Separator None X M Separate Delimiters Cancel Apply KE current Figure 4 20 Setting the Start row option of the Import From File dialog gt cga lt importData OhA xls type Excel gt cgb lt importData OhB xls type Excel gt cg24a lt importData 24hA xl1s type Excel gt cg24a lt importData 24hB xls type Excel The resulting objects are data frames so you can do whatever you want to do in the way of data summaries and exploratory plots First we ll take care of some organizational details such as converting all column names to lower case to make typing easier dHHE Change column names to all lower case gt names cga lt casefold names cga gt names cgb lt
212. ummary Volcano Plot variance Plot Figure 8 15 Plots of the local pooled error variance within treatments versus the overall intensity within treatments 203 Chapter 8 Differential Expression Testing DIFFERENTIAL EXPRESSION SUMMARY TABLE OUTPUT In addition to graphical output the differential expression testing functions generate summary tables ordering the gene list from most to least differentially expressed The graphical Graphlet output provides a list of the top 10 genes but you get the complete gene list as well Top 10 List The 10 genes with the lowest p values are displayed in the Graphlet with the volcano plot heat map and other plots Graph Window 23 File View Options Summary Output for Mult Test with BH Adjustment Top 10 Genes Test Statistic Raw p Value Adj p Yalue Fold Change 35704_at gt 10 0 11 8 7 37023_at gt 10 0 11 8 2 33532_at lt 10 0 11 3 13 37712_g_at gt 10 0 11 7 93 31979_at gt 10 0 11 6 67 1837 _at gt 10 0 11 6 9 41848 _f_at gt 10 0 12 8 05 1984_s_at gt 10 0 12 7 89 41231_f_at lt 10 0 13 0 87 36250_at lt 10 0 13 0 45 P Summary volcano Plat Figure 8 16 70 most differentially expressed genes Complete Gene The complete gene list is saved in an S PLUS object For more details List see section Output in section Multiple Comparisons Testing Dialog Input and in section LPE Testing Dialog Input in this chapter You can access the gene list in three different ways
213. ut a Microarray Experiment and this information can be entered on the second page of the Import Affymetrix Data dialog This information is not required but it is used in table output and graphics and thus it is to your advantage to complete the information in this page Once you ve entered MIAME information for any experiment the first three fields are saved and are filled automatically the next time you open this dialog This dialog is shown in Figure 4 7 Import Affymetrix Data File Selection MIAME Variable Selection amp Filtering Experimenter s Name fs Izumo Laboratory Cardio Genomics Harvard University Contact Information http cardiogenomics med harvard edu groups proj1 p ages swim_home html Experiment Title Exercise Induced Hypertrophy Experiment Description Chronic conditioning of mice over 4 weeks monitoring venticular mass buildup Existing Notes icroarrays MicroarrayD emoD ata MouseS wimmingS tudy Cancel k gt l fi entries Figure 4 7 Entering chip information in the MIAME page 35 Chapter 4 An Example Affymetrix MAS Data Variable Selection The third page in the Import Affymetrix Data dialog is for variable amp Filtering Page 36 and row selection When reading MAS 4 5 data this page is automatically filled The Probe Name and Expr Intensities drop down fields are for selecting the columns in the data files corresponding to the probe names and expression intensit
214. utants in Saccaromyces cerevisiae Proc Natl Acad Sci 97 3364 3369 References Tibshirani R Hastie T Narashiman and Chu 2002 Diagnosis of multiple cancer types by shrunken centroids of gene expression PNAS 2002 99 6567 6572 P H Westfall and S S Young Resampling based multiple testing Examples and methods for p value adjustment John Wiley amp Sons 1993 Wolfinger RD Gibson G Wolfinger E Bennett L Hamadeh H Bushel P Afshari C Paules RS 2001 Assessing gene significance from cDNA microarray expression data via mixed models Journal of Computational Biology 8 625 637 Yang YH Dudoit S Luu P Lin DM Peng V Ngai J Speed T 2002 Normalization for cdna microarray data a robust composite method addressing single and multiple slide systematic variation Nucleic Acids Research 30 4 e15 Yeung K Y Fraley C Murua A Raftery A Ruzzo W L 2001 Model Based Clustering and Data Transformations for Gene Expression Data Technical Report 396 Department of Statistics University of Washington Seattle WA 253 Chapter 9 Using the S PLUS Command Line to Analyze Microarray Data 254 APPENDIX S ARRAYANALYZER DATA LIBRARIES CDF Libraries When working with probe level CEL data from Affymetrix microarrays the chip definition format CDF information for the particular chip is required This information is auto detected for Affymetrix MAS and cDNA data but it is not for probe level data
215. with selected graphs and the significant gene list to HTML files to view later Display HTML Output View the S PLUS Graphlet with selected graphs in a browser The displayed Graphlet has a hyperlink to the significant genes table Points on the Graphlet and entries in the significant gene list are automatically hyperlinked to LocusLink and UniGene annotation databases Save Summary As Name used for saving the S PLUS data frame containing the complete gene list including test statistics and p values The default name is MultTestSumm Output Files There are two output files generated when you select Save Output as HTML 1 2 A significant genes summary table The name of the this table is generated by taking the name supplied in the Save Summary As field and then adding MultOutSumm html The default supplied name is MultTestSumm so the default output table name is MultTestSummMultOutSumm html An S PLUS Graphlet with selected graphs The name of the output files is generated by the name supplied in the Save Summary As field and then adding MultOut html The default Graphlet name is MultTestSummMultOut html 191 Chapter 8 Differential Expression Testing 192 Location of Output Files The location of these output files is determined by your S PLUS working directory To determine your working directory just type gt getenv S_ WORK D arrayanalyzer users lenk test The location of dumped files in g
216. y Layout dialog The Create Modify Layout dialog requires you to fill in the following information Importing Data e Layout File A layout file name Enter the path in the Layout File field or click the Browse button to locate the layout file The file should be a text file with columns for the gene names and control indicator In the example navigate to the fish gal file located in your splus61 modules ArrayAnalyzer examples directory e Saved Chip Layout A name for the saved layout object You need to name the layout object which can be reused for all arrays with the same layout Enter swirlLayout as the object name in the Save Chip Layout field which you can use when we return to the Import cDNA Data dialog e Grid The print tip group grid size in rows and columns The schematic in Figure 6 6 represents a cDNA array The spots are arranged in large blocks or print tip groups Within each print tip group are spots where the cDNA is fixed To specify the array layout you must specify the size and arrangement of both the print tip groups the grid and the spot matrix within each group It is assumed that the spot matrix size is the same in each print tip group Figure 6 6 Schematic of a cDNA array layout The print tip grid refers to the layout of the print tip groups You specify this layout in terms of rows and columns In the schematic in Figure 6 6 there are two rows and three columns of print tip groups In the swir
217. y acceptable Note however that the wild type is replicated on each chip so that inference regarding wildtype is more informative than that for the other treatments Note that the array and channel effects A and AT may be considered as random effects in the normalization model Similarly the spot effects S can be considered random in the gene model since they represent random amounts of hybridization on each spot The two channel cDNA array is actually an incomplete block design with blocks of size 2 While this particular experimental design does not exploit this in a balanced manner recovery of interblock information is automatic in the 1me model fitting and this is important with the small 2 block sizes Using residual maximum likelihood estimation REML informative estimates of the treatment effects T are obtained in this mixed model formulation Note that REML on more than 6000 gene models requires a fair amount of computation As such you may want to fit the gene models in batch mode To do this write your S PLUS commands in a text file and then use the BATCH Windows or BATCH UNIX command to indicate a batch process In Windows create a text file with the job and save it as myjob txt Then right click the S PLUS shortcut and select Properties from the context menu In the resulting dialog box go to the Shortcut page and find the field labeled Target After the command that executes S PLUs that is splus exe leave a
218. y print tip groups if given one chip Creates one box plot for each chip if given more than one maImage x maM subset TRUE col contours FALSE bar TRUE Creates spatial images of shades of gray or colors that correspond to the values of a statistic for each spot on the array The statistic can be the intensity log ratio M a spot quality measure e g spot size or shape or a test statistic This function can be used to explore whether there are any spatial effects in the data for example print tip or cover slip effects Creates plot for first chip given maDotPlots data x list maA id ID pch col nrep 3 A dot plot showing the values of replicated control genes 147 Chapter 7 Pre Processing and Normalization Location and Scale Normalization Loess Normalization 148 if Additional examples of diagnostic plots Boxplots of all chips in swirl dataset pre normalized swirl dataset gt maBoxplot swirl Boxplots of pre normalization red foreground intensities for each grid row for the Swirl 81 array gt maBoxplot swirl 1 x maGridRow y maRf main Swirl array 81 pre normalization red foreground intensity J MvA plot of chip 93 overlaid with loess curves for each print tip group default gt maPlot swirl 3 gt Image plots of chip 81 the first chip in the object gt malmage swirl x maGf green foregr
219. y type gt help maDotsMatch at the command line HTML Help Online Reference Technical Support Supported Platforms and System Requirements HTML Help in S PLUS is based on Microsoft Internet Explorer and uses an HTML window to display the help files You can access help on any function or GUI dialog in St ARRAYANALYZER from the main menu by selecting Help gt Available Help gt arrayanalyzer The HTML Help system includes a table of contents organized by library an index and a Search button You can also get help on any S PLUS function from either the command line or from Help gt Available Help gt Language Reference If you need help on the S PLUs GUI click the Help button at the bottom of any dialog or navigate to Help gt Available Help gt S PLUS Help In addition to the online help you can access a pdf of the User s Guide by going to Help gt Online Manuals gt ArrayAnalyzer User s Guide The S ARRAYANALYZER User s Guide is particularly helpful for those new to S PLUS and microarray analysis You can also access versions of the Bioconductor library pdfs in S tARRAYANALYZER The individual library pdfs are located at the top level of each library for example the Biobase library pdf is available at splus61 library Biobase Biobase pdf Just double click the file to launch the pdf Note these pdfs are current only up to this release For updated information please visit the Bioconductor Web site
220. ys there are 11 16 or 20 probe pairs in a probe pair set with each probe pair consisting of a perfect match PM and mismatch MM probe The oligos are 25 mers and the MM probe uses the Watson Crick complement at the 13th position A key data operation is the summary of the 1 20 probe pair set intensities into a single value for each gene transcript that faithfully represents the expression of that gene transcript The Affymetrix MAS4 0 software did a poor job at this summarization by simply taking the average difference of the PM and MM values for each probe pair set Affymetrix MAS5 0 software does a better job of this summarization this is described below Several other summary methods for probe pair sets have emerged most notably those of Li and Wong 2001 and Irizarry et al 2003b This is an active area of research and as stated by Parmigiani et al 2003 there is mounting evidence that alternative summarization to the defaults currently implemented by Affymetrix may provide improved ability to detect biological signal The available summary methods can be obtained by typing gt express summary stat methods 1 avgdiff liwong mas medianpolish playerout Note that avgdiff and mas methods refer to the methods described in the Affymetrix manual versions 4 0 and 5 0 and the Affymetrix Statistical Algorithms Description Document SADD available from Affymetrix The avgdiff method implements an approach si
221. zer item on the main S PLUS menu bar is added Clicking ArrayAnalyzer opens the drop down menu where you can select any of the following menu items e Import Data e Affymetrix Expression Summary e Normalization Differential Expression Analysis Import Data gt Affymetrix Expression Summary Normalization Differential Expression Analysis gt LPE Test Mpltiple Comparisons Test Figure 3 1 Loading S ARRAYANALYZER from the command line or the GUI adds an ArrayAnalyzer menu item to the main S PLUS menu bar The ArrayAnalyzer menu reflects the usual order of microarray data analysis The results of each menu item can be used as input to the remaining analysis tasks By breaking down the data analysis into these steps it is possible to break the work session and return where you left off It is also possible that not all the steps are necessary for your data e g normalization We briefly describe each of the S ARRAYANALYZER dialogs in the following sections 17 Chapter 3 GUI Overview IMPORT DATA Import Affymetrix The Import Affymetrix Data dialog is multi tabbed On the first page you specify the experimental design associate data files with each design point and input the Affymetrix GeneChip name Data 18 Import Affymetrix Data File Selection MIAME Variable Selection amp Fitering m Associate Files with Design Points Single Factor 2 Level Design Reset Grid Read

User's Guide - solutionmetrics.com.au

Contents

Download Pdf Manuals

Related Search

Related Contents