Home

limma: Linear Models for Microarray Data User's Guide (Now

1. 3 4 24 24 daa 112 17 3 7 Differential expression between cell types 113 17 3 8 Signature genes for luminal progenitor cellS 113 17 4 Time Course Effects of Corn Oil on Rat Thymus with Agilent 4x44K Arrays 114 18 RNA Seq Case Studies 118 18 1 Profiles of Unrelated Nigerian Individuals 118 18 1 1 Backero nd a soea be E ee oe A ee EE OR ES he 118 181 2 Loading th d ta aad a ake A AA A A a eo 118 18 1 3 DEE List object La Pe oe Re Re 119 ISA Pan de a cas ras Gee AA ae da he 119 18 1 5 Scale normalization iba da alee a ee a e ocd as 120 18 1 6 Linear modeling AA ae Oy he ye kar an iberg age phere eg 120 18 17 Gene Sey Esta ers Se ee ed a a id rd gee 122 18 2 Differential Splicing after Pasilla Knockdown 125 1821 Backergund Tera rs Roane Gon hae kn Oho A hie hd 125 18 2 2 GEO samples and SRA Files a ide Shay a eb eA ltda Ta 125 18 2 3 Mapping reads to the reference genome 126 18 2 4 Read counts for exons ere ri eis Sa hh ee BO oe OE oe ot ok 127 18 2 5 Assemble DGEList and sum counts for technical replicates 127 18 2 6 Gene annotation 22 0 4 ed oe Bis oY Bre ee eee ed 128 18 27 Pepin ote A one ee oe oft 129 18 2 8 Scale normalization 200 thane amp wl Bd E pd de E gids de 129 120 Linear modelling sr e as a a dl ds GE 130 18 2 10 Alternate splicing II IAN 132 18 2 11 Session information o ooa a es Rie S
2. gt Group lt factor targets Group gt design lt model matrix Group X gt fit lt lmFit y design gt fit lt eBayes fit This creates a model with 12 parameters with the last 5 corresponding to interaction i e to differences in the curves between groups To detect genes with different time trends for treatment vs control gt topTable fit coef 8 12 This conducts a moderated F test for each gene on 5 df which can detect very general differ ences between the treatment and control curves Note that for this analysis it is not necessary to have replicates nor is it necessary for the two treatment groups to be observed at identical time points 50 9 7 Multi level Experiments We have considered paired comparisons and we have considered comparisons between two independent groups There are however experiments that combine both of these types of comparisons Consider a single channel experiment with the following targets frame FileName Subject Condition Tissue File01 1 Diseased A File02 1 Diseased B File03 2 Diseased A File04 2 Diseased B File05 3 Diseased A File06 3 Diseased B File07 4 Normal A File08 4 Normal B File09 5 Normal A File10 5 Normal B File11 6 Normal A File12 6 Normal B This experiment involves 6 subjects including 3 patients who have the disease and 3 normal subjects From each subject we have expression profiles of two tissue types A and B In analysing thi
3. gt ex lt diffSplice fit geneid GeneID 73 Then gt topSplice ex coef 2 level gene will find genes that show evidence of differential splicing associated with the second coefficient in the linear model The output is similar that from the limma topTable function More detail can be obtained by gt topSplice ex coef 2 level exon which will show individual exons that are enriched or depleted relative to other exons in the same gene To display the pattern of exons in the top genes gt plotSplice ex 74 Chapter 16 Two Color Case Studies 16 1 Swirl Zebrafish A Single Group Experiment In this section we consider a case study in which two RNA sources are compared directly on a set of replicate or dye swap arrays The case study includes reading in the data data display and exploration as well as normalization and differential expression analysis The analysis of differential expression is analogous to a classical one sample test of location for each gene In this example we assume that the data is provided as a GAL file called fish gal and raw SPOT output files and that these files are in the current working directory The data used for this case study can be downloaded from http bioinf wehi edu au limmaGUI DataSets html gt dirQ 1 fish gal swirl 1 spot swirl 2 spot swirl 3 spot swirl 4 spot 6 SwirlSample txt Background The experiment was carried out using zebrafish as a model or
4. Limma provides a strong suite of functions for reading exploring and pre processing data from two color microarrays The Bioconductor package marray provides alternative functions for reading and normalizing spotted two color microarray data The marray package provides flexible location and scale normalization routines for log ratios from two color arrays The limma package overlaps with marray in functionality but is based on a more general concept of within array and between array normalization as separate steps If you are using limma in conjunction with marray see Section 6 4 Limma can read output data from a variety of image analysis software platforms including GenePix ImaGene etc Either one channel or two channel formats can be processed The Bioconductor package affy provides functions for reading and normalizing Affymetrix microarray data Advice on how to use limma with the affy package is given throughout the User s Guide see for example Section 8 2 and the E coli and estrogen case studies Functions for reading and pre processing expression data from Illumina BeadChips were introduced in limma 3 0 0 See the case study in Section 17 3 for an example of these Limma can also be used in conjunction with the vst or beadarray packages for pre processing Illumina data From version 3 9 19 limma includes functions to analyse RNA Seq experiments demon strated in Case Study 11 8 The approach is to convert a table of sequence read co
5. locale 1 LC_COLLATE English_Australia 1252 LC_CTYPE English_Australia 1252 3 LC_MONETARY English_Australia 1252 LC_NUMERIC C 5 LC_TIME English_Australia 1252 attached base packages 1 splines stats graphics grDevices utils datasets methods base other attached packages 1 statmod_1 4 11 org Mm eg db_2 5 0 RSQLite_0 9 4 DBI_0 2 5 5 AnnotationDbi_1 14 0 Biobase_2 12 0 limma_3 9 8 16 4 Bobi Mutant Mice Arrays With Duplicate Spots In this section we consider a case study in which all genes ESTs and controls are printed more than once on the array This means that there is both within array and between array replication for each gene The structure of the experiment is therefore essentially a randomized block experiment for each gene The approach taken here is to estimate a common correlation for all the genes for between within array duplicates The theory behind the approach is explained in 38 This approach assumes that all genes are replicated the same number of times on the array and that the spacing between the replicates is entirely regular In this example we assume that the data is available as an RGList 96 Background This data is from a study of transcription factors critical to B cell maturation by Lynn Corcoran and Wendy Dietrich at the WEHI Mice which have a targeted mutation in the Bob1 aka Pou2afl or OBF 1 transcription factor display a number of abnormalities in the B lymphocyte compartment of the imm
6. FileName Strain Treatment Filel WT U File2 WT S File3 Mu U File4 Mu S File5 Mu S The two experimental dimensions or factors here are Strain and Treatment Strain specifies the genotype of the mouse from which the cells are extracted and Treatment specifies whether the cells are stimulated or not All four combinations of Strain and Treatment are observed so this is a factorial design It will be convenient for us to collect the Strain Treatment combinations into one vector as follows gt TS lt paste targets Strain targets Treatment sep gt TS 1 WT U WT S Mu U Mu S Mu S It is especially important with a factorial design to decide what are the comparisons of interest We will assume here that the experimenter is interested in 1 which genes respond to stimulation in wild type cells 2 which genes respond to stimulation in mutant cells and 3 which genes respond differently in mutant compared to wild type cells as these are the questions which are most usually relevant in a molecular biology context The first of these questions relates to the WT S vs WT U comparison and the second to Mu S vs Mu U The third relates to the difference of differences i e Mu S Mu U WT S WT U which is called the interaction term 9 5 2 Analysing as for a Single Factor We describe first a simple way to analyze this experiment using limma commands in a similar way to that in which two sample designs were analyz
7. If you use limma to pre process lumina BeadChip data please cite Shi W Oshlack A and Smyth GK 2010 Optimizing the noise versus bias trade off for Illumina Whole Genome Expression BeadChips Nucleic Acids Re search 38 e204 http nar oxfordjournals org content 38 22 e204 This article describes the read ilmn nec and negc functions If you use the separate channel analysis 1mscFit please cite Smyth GK and Altman NS 2013 Separate channel analysis of two channel microarrays recovering inter spot information BMC Bioinformatics 14 165 http www biomedcentral com 1471 2105 14 165 If you use limma for RNA seq analysis voom please cite Law CW Chen Y Shi W Smyth GK 2014 Voom precision weights unlock linear model analysis tools for RNA seq read counts Genome Biology 15 R29 http genomebiology com 2014 15 2 R29 The limma software itself can be cited as Smyth G K 2005 Limma linear models for microarray data In Bioinfor matics and Computational Biology Solutions using R and Bioconductor R Gen tleman V Carey S Dudoit R Irizarry W Huber eds Springer New York pages 397 420 The above article describes the software package in the context of the Bioconductor project and surveys the range of experimental designs for which the package can be used including spot specific dye effects The pre processing capabilities of the package are also described but more briefly with examp
8. will find those genes responding to stimulation in mutant mice Finally we could extract the interaction contrast Diff considered above by gt fit2 lt contrasts fit fit c 0 0 1 1 gt fit2 lt eBayes fit2 gt topTable fit2 This finds genes that respond differently to the stimulus in mutant vs wild type mice 9 5 4 Classic Interaction Models The analysis of factorial designs has a long history in statistics and a system of factorial model formulas has been developed to facilitate the analysis of complex designs It is important to understand though that the above three molecular biology questions do not correspond to any of the classic parametrizations used in statistics for factorial designs Hence we generally recommend the approaches already considered above for microarray analysis Suppose for example that we proceed in the usual statistical way gt design lt model matrix Strain Treatment This creates a design matrix which defines four coefficients with the following interpretations Coefficient Comparison Interpretation Intercept WT U Baseline level of unstimulated WT StrainMu Mu U WT U Difference between unstimulated strains TreatmentS WT S WT U Stimulation effect for WT StrainMu TreatmentS Mu S Mu U WT S WT U Interaction This is called the treatment contrast parametrization Notice that one of our comparisons of interest Mu S Mu U is not represented and instead the comparison Mu U WT U w
9. 1146 3 4 18 fb22a12 2 123 1 05 13 7 10 2 1 57e 05 0 004242 3 76 157 1 7 13 fb38a01 6 11 1 82 10 8 10 2 1 58e 05 0 004242 3 75 The top gene is BMP2 which is significantly down regulated in the Swirl zebrafish as it should be because the Swirl fish are mutant in this gene Other positive controls also appear in the top 30 genes in terms In the table t is the empirical Bayes moderated t statistic the corresponding P values have been adjusted to control the false discovery rate and B is the empirical Bayes log odds of differential expression gt plotMA fit gt top30 lt order fit lods decreasing TRUE 1 30 gt text fit Amean top30 fit coef top30 labels fit genes top30 Name cex 0 8 col blue 85 l i Eu 16 2 Apoal Knockout Mice A Two Group Common Reference Experiment In this section we consider a case study where two RNA sources are compared through a common reference RNA The analysis of the log ratios involves a two sample comparison of means for each gene In this example we assume that the data is available as an RGList in the data file Apoa1 RData The data used for this case study can be downloaded from http bioinf wehi edu au limma Background The data is from a study of lipid metabolism by 5 The apolipoprotein AI Apoal gene is known to play a pivotal role in high density lipoprotein HDL metabolism Mice which have the Apoal gene knocked out have very low HDL cholesterol
10. 39755_at XBP1 1 68 12 13 15 1 2 05e 07 3 58e 04 7 45 1824_s_at PCNA 1 91 9 24 14 9 2 27e 07 3 58e 04 7 37 1126_s_at CD44 1 78 6 88 13 8 4 12e 07 5 78e 04 6 89 1536_at CDC6 2 66 5 94 13 3 5 80e 07 7 32e 04 6 61 981_at MCM4 1 82 7 78 13 1 6 46e 07 7 42e 04 6 52 33252_at MCM3 1 74 8 00 12 6 8 86e 07 9 20e 04 6 25 1505_at TYMS 2 40 8 76 12 5 9 48e 07 9 20e 04 6 19 34363_at SEPP1 1 75 5 55 12 2 1 14e 06 1 03e 03 6 03 1884_s_at PCNA 2 80 9 03 12 1 1 26e 06 1 06e 03 5 95 36134_at OLFM1 2 49 8 28 11 8 1 50e 06 1 19e 03 5 79 37485_at SLC27A2 1 61 6 67 11 4 1 99e 06 1 48e 03 5 55 239_at CTSD 1 57 11 25 10 4 4 07e 06 2 66e 03 4 90 38116_at KIAAO101 2 32 9 51 10 4 4 09e 06 2 66e 03 4 90 40533_at BIRC5 1 26 8 47 10 4 4 21e 06 2 66e 03 4 87 Estrogen effect at 48 hours gt topTable fit2 coef E48 n 20 Symbol logFC AveExpr t P Value adj P Val B 910_at TKi 3 86 9 66 29 2 8 27e 10 1 04e 05 11 61 107 31798_at TFF1 3 60 12 12 21 0 1 28e 08 7 63e 05 9 89 1854_at MYBL2 3 34 8 53 20 2 1 81e 08 7 63e 05 9 64 38116_at KIAA0101 3 76 9 51 16 9 8 12e 08 2 51e 04 8 48 38065_at HMGB2 2 99 9 10 16 2 1 12e 07 2 51e 04 8 21 39755_at XBP1 1 77 12 13 15 8 1 36e 07 2 51e 04 8 05 1592_at TOP2A 2 30 8 31 15 8 1 39e 07 2 51e 04 8 03 41400_at TKi 2 24 10 04 15 3 1 81e 07 2 75e 04 7 81 33730_at GPRC5A 2 04 8 57 15 1 1 96e 07 2 75e 04 7 74 1651_at UBE2C 2 97 10 50 14 8 2 39e 07 3 02e 04 7 57 38414_at CDC20 2 02 9 46 14 6 2 66e 07 3 05e 04 7 48 1943_at CCNA2 2 19 7 60 14 0
11. K Michaud J and Scott H 2005 The use of within array replicate spots for assessing differential expression in microarray experiments Bioinformat ics 21 9 2067 2075 http bioinformatics oxfordjournals org content 21 9 2067 The above article describes the theory behind the duplicateCorrelation function If you use limma for normalization of two color microarray data please cite Smyth G K and Speed T P 2003 Normalization of cDNA microarray data Methods 31 265 273 The above article describes the functions read maimages normalizeWithinArrays normalize BetweenArrays etc including the use of spot quality weights If you use the backgroundCorrect function please cite Ritchie M E Silver J Oshlack A Silver J Holmes M Diyagama D Holloway A and Smyth G K 2007 A comparison of background correction methods for two colour microarrays Bioinformatics 23 2700 2707 http bioinformatics oxfordjournals org content 23 20 2700 This paper in particular describes the normezxp background correction method If you use limma to estimate array quality weights please cite Ritchie M E Diyagama D Neilson van Laar R J Dobrovic A Holloway A and Smyth G K 2006 Empirical array quality weights in the analysis of microarray data BMC Bioinformatics 7 261 http www biomedcentral com 1471 2105 7 261 The above article describes the functions arrayWeights arrayWeightsSimple etc
12. LP luminal ML Stromal cells were also profiles as a comparison group There were therefore four cell popu lations from each person RNA was extracted from freshly sorted cells making twelve RNA samples in total 109 17 3 3 The expression profiles The RNA samples were hybridized to two Illumina HumanWG 6 version 3 BeadChips Each BeadChip is able to accommodate six samples The BeadChips images were scanned and summarized using BeadStudio Un normalized summary probe profiles were exported from from BeadStudio to tab delimited text files Separate files were written for regular probes and for control probes The file probe profile txt contains the expression profiles for regular probes designed to interrogate the expression levels of genes control probe profile txt contains the profiles of control probes including negative control probes Note that BeadStudio by default writes the profiles for all the samples to the same two files We read in the expression profiles for both regular and control probes telling read ilm that we wish to read the detection p values as well as the expression values gt x lt read ilmn files probe profile txt ctrlfiles control probe profile txt other columns Detection Reading file probeprofile txt Reading file controlprobeprofile txt This reads a EListRaw object There are about 750 negative probes and about 49 000 regular probes gt table x genes Status BIOTIN CY3
13. Now find genes differentially expression between male and females Positive log fold changes mean higher in males The highly ranked genes are mostly on the X or Y chro mosomes gt fit lt 1lmFit v design gt fit lt eBayes fit gt options digits 3 gt topTable fit coef 2 n 16 sort p Symbol Chr logFC AveExpr t P Value adj P Val B ENSG00000229807 XIST X 9 857 3 8076 32 5 1 21e 44 2 10e 40 66 5 ENSG00000099749 CYorfi5A Y 4 260 0 3139 27 1 2 40e 39 2 08e 35 64 8 ENSG00000157828 RPS4Y2 Y 3 280 3 3074 26 7 5 88e 39 3 39e 35 73 8 ENSG00000233864 TTTY1B Y 4 896 0 5546 25 4 1 62e 37 7 00e 34 59 6 ENSG00000198692 EIFIAY Y 2 397 2 6799 20 5 1 04e 31 3 60e 28 59 1 ENSG00000131002 CYorf1i5B Y 5 439 0 1717 20 2 2 46e 31 7 09e 28 51 1 ENSGO0000213318 RP11 331F4 1 16 4 295 2 2647 19 4 3 04e 30 7 51e 27 55 0 ENSG00000165246 NLGN4Y Y 5 330 0 4924 17 9 3 59e 28 7 78e 25 45 9 ENSG00000129824 RPS4Y1 Y 2 779 4 7110 17 8 6 17e 28 1 19e 24 52 2 ENSG00000183878 UTY Y 1 877 2 7422 16 6 3 35e 26 5 81e 23 47 9 ENSG00000012817 KDM5D Y 1 470 4 7039 14 7 2 33e 23 3 67e 20 42 3 ENSG00000243209 ACO10889 1 Y 2 536 0 0186 14 0 2 97e 22 4 28e 19 36 5 ENSG00000146938 NLGN4X X 4 472 0 7808 13 8 6 47e 22 8 62e 19 34 8 ENSG00000067048 DDX3Y Y 1 670 5 3070 13 3 4 54e 21 5 61e 18 37 3 ENSGO0000006757 PNPLA4 X 0 989 2 5333 10 4 5 09e 16 5 87e 13 25 8 ENSG00000232928 RP13 204A15 4 xX 1 434 3 2499 10 2 1 43e 15 1 54e 12 25 0 gt fit df prior 1 4 6
14. gt summary decideTests fit Intercept Gendermale 1 292 43 0 913 17251 1 16105 16 121 gt chrom lt fit genes Chr gt plot fit coef 2 status chrom values c X Y col c red blue main Male vs Female legend bottomright gt abline h 0 col darkgrey Male vs Female logFC oe lt x AveExpr 18 1 7 Gene set testing The tweeDEseqCountData package includes a list of genes belonging to the male specific region of chromosome Y and a list of genes located in the X chromosome that have been reported to escape X inactivation We expect genes in the first list to be up regulated in males whereas genes in the second list should be up regulated in females gt Ymale lt rownames v in msYgenes gt Xescape lt rownames v in XiEgenes Roast gene set tests confirm that the male specific genes are significantly up as a group in our comparison of males with females whereas the X genes are significantly down as a group 49 gt index lt list Y Ymale X Xescape gt mroast v index design nrot 9999 NGenes PropDown PropUp Direction PValue FDR PValue Mixed FDR Mixed Y 12 0 0833 0 9167 Up 2e 04 1le 04 1e 04 5e 05 X 46 0 5217 0 0435 Down 2e 04 1e 04 1e 04 5e 05 The results from competitive camera gene sets tests are even more convincing 50 gt camera v index design NGenes Correlation Direction PValue FDR Y 12 0 0836 Up 2 12e 28 4 23e 28 X 46 0 0206 Down 1 14e
15. 01 80e 01 81e 01 67e 01 80e 01 49e 01 65e 01 65e 01 80e 01 95e 01 64e 01 83e 04 49e 01 95e 01 80e 01 95e 01 49e 01 49e 01 01e 08 01e 02 48e 08 92e 08 45e 07 48e 08 92e 08 56e 08 81e 07 64e 06 92e 08 26e 01 99e 04 O4e 04 81e 05 48e 02 65e 03 13e 04 34e 02 531 919 198 566 147 972 351 415 301 198 157 139 410 458 549 909 003 982 961 836 371 572 692 113 607 585 769 323 888 710 962 350 0 530 704 453 072 943 815 45320 trol do o Y D lt O D Oo Qu LL o D o MT LODMNMIOMD MNOOAN MDOSOD ODtOOMOTOODOOLOMADL OLODO TOONOTTOONT TOTTMOMNOOD TLOONNODOTSTOLOOORN WOOO OODDDDOOORRNERERANROO MMM MMMM MMMMMMMCIMOCC NANANAAAAANAAAA AOA AAAI Exon Start These two figures can be compared to Supplementary Figure 8A in Brooks et al 3 The gene trol was identified by Brooks et al to have a novel set of coordinately regulated exons Brooks et al said A particularly striking example involving these unannotated splice junctions is found in the trol gene in which a group of nine contiguous exons which are an notated as being constitutive are coordinately skipped in untreated cells but coordinately included in the ps RNAi cells 18 2 11 Session information gt sessionInfo R version 3 1 0 2014 04 10 Platform x86_64 w64 mingw32 x64 64 bit locale 1 L
16. 10 1 14e 10 122 We can highlight the male specific and X escape genes on the MA plot status lt rep Other nrow fit status Ymale lt Ymale status Xescape lt Xescape plot fit coef 2 status status values c Xescape Ymale col c red blue main Male vs Female legend bottomright abline h 0 col darkgrey V tvVVV Vv Male vs Female logFC Xescape Ymale T T T 0 5 10 10 AveExpr Another way to view the gene sets it to view the ranking of the Y and X specific genes in the differential expression results This shows that with a couple of exceptions the Y specific genes are highly ranked positive while the X specific genes are low ranked negative gt barcodeplot fit t 2 Ymale Xescape labels c Up in male Down in male gt legend top legend c Ymale genes Xescape genes lty 1 col c red blue 123 7 Ymale genes Xescape genes Enrichment 0 Up in male Down in male 0 Enrichment 4 1 pe ee eee ees i statistics gt sessionInfo R Under development unstable 2014 03 07 r65143 Platform x86_64 w64 mingw32 x64 64 bit locale 1 LC_COLLATE English_Australia 1252 LC_CTYPE English_Australia 1252 3 LC_MONETARY English_Australia 1252 LC_NUMERIC C 5 LC_TIME English_Australia 1252 attached base packages 1 parallel stats graphics grDevices utils datasets methods base other attach
17. 16 1 Swirl Zebrafish A Single Group Experiment 75 16 2 Apoal Knockout Mice A Two Group Common Reference Experiment 86 16 3 Weaver Mutant Mice A Composite 2x2 Factorial Experiment 89 16 3 1 Backerotind Ei fastest ca Ge eas tase eta Ae e a A 89 16 3 2 Sample Preparation and Hybridizations 90 16 334 Data Inputa rr a gn ak he A eae eg de aE a 90 16 34 Annotation sosea ae e eS eh a AA RS 91 16 3 5 Quality Assessment and Normalization 92 16 3 6 Setting Up the Linear Model 2 2 2 ee es 94 16 3 7 Probe Filtering and Array Quality Weights 95 16 3 8 Differential expression 244 3 4 aca a ed ke Ge Pek a Goel a se 95 16 4 Bobl Mutant Mice Arrays With Duplicate Spots 96 17 Single Channel Case Studies 101 17 1 Lrp Mutant Ecoli Two Group Experiment with Affymetrix Arrays 101 17 2 Effect of Estrogen on Breast Cancer Tumor Cells A 2x2 Factorial Experiment with Pry TORTI Array S co 6 ot ede oh fs Baan hes Ad Gs hye i Goth r e Oe hoes ale 1S NA 104 17 3 Comparing Mammary Progenitor Cell Populations with Illumina BeadChips 108 NN barato A IN Gr e toe ode 2m Se aod Gok 108 17 3 2 The target RNA samples e oko ao a a 109 17 3 3 The expression profiles Lia pe a ed Boada Tes ide de 110 17 3 4 How many probes are truly expressed 111 17 3 5 Normalization and filtering a a a e 111 17 3 6 Within patient correlations
18. 51e 12 5 19e 08 18 1 ILMN_1706051 PLD5 4 00 5 67 27 8 4 20e 12 5 19e 08 17 7 ILMN_1656369 C8orf4 5 24 11 26 25 1 1 36e 11 1 06e 07 16 7 ILMN_2413323 GRP 6 60 6 82 24 3 2 01e 11 1 06e 07 16 4 ILMN_1787750 CD200 5 68 6 30 23 9 2 43e 11 1 06e 07 16 2 ILMN_1669819 LOC402569 2 52 5 50 23 8 2 57e 11 1 06e 07 16 2 ILMN_1777998 ARHGAP25 4 78 6 26 23 1 3 50e 11 1 15e 07 15 9 ILMN_1701933 SNCA 5 26 6 27 22 8 4 19e 11 1 15e 07 15 7 ILMN_1777199 GRP 5 47 6 36 22 8 4 23e 11 1 15e 07 15 7 ILMN_1708303 CYP4F22 4 09 5 94 22 5 4 76e 11 1 15e 07 15 6 17 3 8 Signature genes for luminal progenitor cells Now we find genes uniquely expressed in LP cells as compared to MS and ML We refit the linear model making LP the reference cell type gt ct lt relevel ct ref LP gt design lt model matrix ct gt fit lt lmFit y design block targets Donor correlation dupcor consensus correlation gt fit2 lt fit c ctMS ctML gt fit2 lt eBayes fit2 trend TRUE The we find all those genes that are up regulated in LP vs both MS and ML using a 2 fold change and 5 FDR gt results lt decideTests fit2 lfc 1 gt vennDiagram results include c up down 113 ctMS ctML 1998 487 don 53015 There are 197 positive signature genes and 413 negative To see the top positive signature genes with their fold changes gt LP sig lt rowSums results gt 0 2 gt topTable fit2 LP sig SY
19. Array3 0 1 Array4 0 1 ArrayS 0 1 Differentially expressed genes can be found by fit lt lmFit eset design cont matrix lt makeContrasts MUvsWT MU WT levels design fit2 lt contrasts fit fit cont matrix fit2 lt eBayes fit2 topTable fit2 adjust BH VVVV oy For the first approach the treatment contrasts parametrization the design matrix can be computed by gt design lt cbind WT 1 MUvsWT c 0 0 1 1 1 or by gt Group lt factor targets Target levels c WT Mu gt design lt model matrix Group gt colnames design lt c WT MUvsWT 42 For the second approach the group means parametrization the design matrix can be com puted by gt design lt cbind WT c 1 1 0 0 0 MU c 0 0 1 1 1 or by gt design lt model matrix 0 Group gt colnames design lt c WT MU 9 3 Several Groups The above approaches for two groups extend easily to any number of groups Suppose that three RNA targets to be compared Suppose that the three targets are called RNA1 RNA and RNA3 and that the column targets Target indicates which one was hy bridized to each array An appropriate design matrix can be created using gt f lt factor targets Target levels c RNA1 RNA2 RNA3 gt design lt model matrix 0 f gt colnames design lt c RNA1 RNA2 RNA3 To make all pair wise comparisons between the three groups one could proceed gt fit lt
20. Data Reading single channel data is similar to two color data except that the argument green only TRUE should be added to tell read maimages not to expect a red channel Single channel Agilent intensities as produced by Agilent s Feature Extraction software can be read by gt x lt read maimages files source agilent green only TRUE or gt x lt read maimages targets source agilent green only TRUE As for two color data the path argument is used gt x lt read maimages files source agilent path lt directory gt green only TRUE if the data files are not in the current working directory The green only argument tells read maimages to output an EList object instead an RGList The raw intensities will be stored in the E component of the data object and can be checked for example by gt summary x E Agilent s Feature Extraction software has the ability to estimate the foreground and back ground signals for each spot using either the mean or the median of the foreground and back ground pixels The default for read maimages is to read the median signal for both foreground and background Alternatively gt x lt read maimages targets source agilent mean green only TRUE 19 would read the mean foreground signal while still using median for the background The possible values for source are Source Foreground Background agilent Median Signal Median Signal agilent mean Mean Signal Media
21. NIA15k H3121 0 441 10 48 11 31 4 73e 07 0 000443 6 88 2838 NIA15k H32838 1 640 12 74 11 26 4 92e 07 0 000443 6 84 6999 NIA15k H36999 0 381 9 91 11 19 5 21e 07 0 000443 6 79 132 NIA15k H3132 0 370 10 10 11 17 5 31e 07 0 000443 6 77 6207 NIA15k H36207 0 393 8 53 11 06 5 82e 07 0 000443 6 69 7168 NIA15k H37168 0 391 9 88 10 76 7 51e 07 0 000493 6 45 1831 NIA15k H31831 0 374 9 62 10 63 8 41e 07 0 000493 6 35 2014 NIA15k H32014 0 363 9 65 10 49 9 54e 07 0 000493 6 23 7558 NIA15k H37558 0 532 11 42 10 49 9 58e 07 0 000493 6 22 4471 NIA15k H34471 0 353 8 76 10 41 1 02e 06 0 000493 6 16 126 NIA15k H3126 0 385 10 59 10 40 1 03e 06 0 000493 6 15 4360 NIA15k H34360 0 341 9 37 10 22 1 21e 06 0 000545 6 00 6794 NIA15k H36794 0 472 11 33 10 11 1 35e 06 0 000570 5 90 329 NIA15k H3329 0 413 11 37 9 97 1 53e 06 0 000612 5 78 5017 NIA15k H35017 0 434 11 41 9 90 1 63e 06 0 000618 5 72 2678 NIA15k H32678 0 461 10 44 9 74 1 90e 06 0 000618 5 57 2367 NIA15k H32367 0 409 10 21 9 72 1 93e 06 0 000618 5 56 1232 NIA15k H31232 0 372 8 72 9 70 1 96e 06 0 000618 5 54 111 NIA15k H3111 0 369 10 42 9 69 1 98e 06 0 000618 5 53 2159 NIA15k H32159 0 418 10 19 9 67 2 03e 06 0 000618 5 51 4258 NIA15k H34258 0 299 9 11 9 62 2 12e 06 0 000622 5 47 3192 NIA15k H33192 0 410 9 46 9 55 2 26e 06 0 000638 5 41 6025 NIA15k H36025 0 427 10 37 9 47 2 45e 06 0 000654 5 33 5961 NIA15k H35961 0 362 8 50 9 46 2 49e 06 0 000654 5 31 1404 NIA15k H31404 0 474 11 34 9 26 3 00e 06 0 000722 5
22. Normal A DiseasedvsNormalForTissueB Diseased B Normal B TissueAvsTissueBForNormal Normal B Normal A TissueAvsTissueBForDiseased Diseased B Diseased A levels design Then compute these contrasts and moderated t tests gt fit2 lt contrasts fit fit cm gt fit2 lt eBayes fit2 Then gt topTable fit2 coef DiseasedvsNormalForTissueA will find those genes that are differentially expressed between the normal and diseased subjects in the A tissue type And so on This experiment has two levels of variability First there is the variation from person to person which we call the between subject strata Then there is the variability of repeat measurements made on the same subject the within subject strata The between subject variation is always expected to be larger than within subject because the latter is adjusted for baseline differences between the subjects Here the comparison between tissues can be made within subjects and hence should be more precise than the comparison between diseased and normal which must be made between subjects 52 Chapter 10 Two Color Experiments with a Common Reference 10 1 Introduction Now consider two color microarray experiments in which a common reference has been used on all the arrays If the same channel has been used for the common reference throughout the experiment then the expression log ratios may be analysed exactly as if they were log expression val
23. RNA3 For this experiment the design matrix could be formed by gt design lt modelMatrix targets ref Ref after which the analysis would be exactly as for the equivalent single channel experiment in Section 9 3 For example to make all pair wise comparisons between the three groups one could proceed gt fit lt ImFit eset design gt contrast matrix lt makeContrasts RNA2 RNA1 RNA3 RNA2 RNA3 RNA1 levels design gt fit2 lt contrasts fit fit contrast matrix gt fit2 lt eBayes fit2 and so on 56 Chapter 11 Direct Two Color Experimental Designs 11 1 Introduction Direct two color designs are those in which there is no common reference but the RNA samples are instead compared directly by competitive hybridization on the same arrays Direct two color designs can be very efficient and powerful but they require the most statistical knowledge to choose the appropriate design matrix 11 2 Simple Comparisons 11 2 1 Replicate Arrays The simplest possible microarray experiment is one with a series of replicate two color arrays all comparing the same two RNA sources For a three array experiment comparing wild type wt and mutant mu RNA the targets file might contain the following entries FileName Cy3 Cy5 Filel wt mu File wt mu File3 wt mu A list of differentially expressed probes might be found for this experiment by gt fit lt lmFit MA gt fit lt eBayes fit
24. Ref wild type 8 ca c8 spot Ref wild type 9 k1 kl spot Ref ApoAl KO 10 k2 k2 spot Ref ApoAl KO 11 k3 k3 spot Ref ApoAl KO 12 k4 k4 spot Ref ApoAl KO 13 k5 k5 spot Ref ApoAl KO 14 k6 k6 spot Ref ApoAl KO 15 k k 7 spot Ref ApoAl KO k8 spot Ref ApoAl KO lat This column can be used to created row names for the targets frame by 16 gt targets lt readTargets targets txt row names Name The row names can be propagated to become array names in the data objects when these are read in For ImaGene files the FileName column is split into a FileNameCy3 column and a FileNameCy5 because ImaGene stores red and green intensities in separate files This is a short example El Microsoft Excel maTargetsImaGene19and20 txt m oj xj File Edit View Insert Format Tools Data Window Help 8 x EASA Gay tee o z 2 K17 f A B E 5 SlideNumber FileNameCy3 FileNameCy5 Cy3 Cy5 19 slide19w595 txt slide19w685 txt WT Mutant 20 slide20w595 txt slide20w685 txt Mutant WT gt pIfimaTargetsImaGene1l9and20 4 Ready 4 4 Reading Two Color Intensity Data Let files be a character vector containing the names of the image analysis output files The foreground and background intensities can be read into an RGList object using a command of the form gt RG lt read maimages files source lt imageanalysisprogram gt path lt directory gt where lt imageanalysisprogram gt is the name of the
25. be used for the normaliza tion step For downstream analysis it is usually worthwhile to remove probes that appear not be expressed in any of the experimental conditions This is called filtering We generally rec ommend that this is done before the linear modelling and empirical Bayes steps but after normalization There are a number of ways that filtering can be done One way is to keep probes that are expressed above background on at least n arrays where n is the smallest number of replicates assigned to any of the treatment combinations See for example Case studies 15 3 or 15 4 in the limma User s Guide Note that filtering methods involving variances should not be used The limma algorithm analyses the spread of the genewise variances Any filtering method based on genewise vari ances will change the distribution of variances will interfere with the limma algorithm and hence will give poor results 35 Chapter 8 Linear Models Overview 8 1 Introduction The package limma uses an approach called linear models to analyze designed microarray experiments This approach allows very general experiments to be analyzed just as easily as a simple replicated experiment The approach is outlined in 36 54 The approach requires one or two matrices to be specified The first is the design matrix which indicates in effect which RNA samples have been applied to each array The second is the contrast matrix which specifies which comparisons
26. contrasts fit fit contrast matrix gt fit2 lt eBayes fit2 A list of top genes differential expressed in group2 versus groupl can be obtained from gt topTable fit2 coef 1 adjust BH The outcome of each hypothesis test can be assigned using 37 gt results lt decideTests fit2 A Venn diagram showing numbers of genes significant in each comparison can be obtained from gt vennDiagram results 8 3 Common Reference Designs Now consider two color microarray experiments in which a common reference has been used on all the arrays Such experiments can be analyzed very similarly to Affymetrix experiments except that allowance must be made for dye swaps The simplest method is to setup the design matrix using the modelMatrix function and the targets file As an example we consider part of an experiment conducted by Jo lle Michaud Catherine Carmichael and Dr Hamish Scott at the Walter and Eliza Hall Institute to compare the effects of transcription factors in a human cell line The targets file is as follows gt targets lt readTargets runxtargets txt gt targets SlideNumber Cy3 Cy5 1 2144 EGFP AML1 2 2145 EGFP AML1 3 2146 AML1 EGFP 4 2147 EGFP AML1 CBFb 5 2148 EGFP AML1 CBFb 6 2149 AML1 CBFb EGFP 7 2158 EGFP CBFb 8 2159 CBFb EGFP 9 2160 EGFP AML1 CBFb 10 2161 AML1 CBFb EGFP 11 2162 EGFP AML1 CBFb 12 2163 AML1 CBFb EGFP 13 2166 EGFP CBFb 14 2167 CBFb EGFP In the experiment green fluorescent pr
27. dye effects gt design lt cbind Dye 1 design gt design Dye Piimt Pi1wt P2imt P21wt cbmut 3 1 0 1 0 0 cbmut 4 1 1 0 0 0 94 cbmut 5 cbmut 6 cbmut 15 cbmut 16 cb 1 cb 2 cb 3 cb 4 l oOoOOoOorroO0OO0OOO l Oorroroore Oooroooo l FPrROOOrRF oO RRRR BR Rp l m 16 3 7 Probe Filtering and Array Quality Weights First we remove control probes leaving only the regular probes of the Riken library gt regular lt MA genes Status Riken gt MA2 lt MA regular gt MA2 genes Status lt NULL Then we estimate array quality weights gt aw lt arrayWeights MA2 design options digits 3 gt aw 1 2 3 4 5 6 7 8 9 10 1 175 1 457 0 852 1 216 0 371 1 087 1 325 0 856 1 418 0 869 Vv The array weights multiply the spot weights already in the data object gt library statmod gt w lt matvec MA2 weights aw 16 3 8 Differential expression Fit the linear model gt fit lt ImFit MA2 design weights w Now extract all possible comparisons of interest as contrasts We look for the mutant vs wt comparisons at 11 and 21 days the time effects for mutant and wt and the interaction terms gt cont matrix lt makeContrasts WT11 MT11 Piimt Piiwt WT21 MT21 P2imt P21wt WT11 WT21 P21wt P11wt MT11 MT21 P2imt Piimt Int P21mt P11mt P21wt P11wt levels design gt fit2 lt contrasts fit fit cont matrix gt fit2 lt eBayes fit2 Adjustment for
28. extra information which is borrowed from the ensemble of genes for inference about each individual gene The output from eBayes contains the tik as fit t with corresponding p values in fit p value 13 3 Multiple Testing Across Contrasts The output from topTable includes adjusted p values i e it performs multiple testing for the contrast being considered If several contrasts are being tested simultaneously then the issue arises of multiple testing for the entire set of hypotheses being considered across contrasts as well as probes The function decideTests offers a number of strategies for doing this The simplest multiple testing method is method separate This method does multiple testing for each contrast separately This method is the default because it is equivalent to using topTable Using this method testing a set of contrasts together will give the same results as when each contrast is tested on its own The great advantage of this method is that it gives the same results regardless of which set of contrasts are tested together The disadvantage of this method is that it does not do any multiple testing adjustment between contrasts Another disadvantage is that the raw p value cutoff corresponding to significance can be very different for different contrasts depending on the number of DE probes This method is recommended when different contrasts are being analysed to answer more or less independent questions method global i
29. file for each array and that a targets file targets txt has been prepared with a column containing the names of the gpr files gt library limma gt targets lt readTargets targets txt Set up a filter so that any spot with a flag of 99 or less gets zero weight gt f lt function x as numeric x Flags gt 99 Read in the data gt RG lt read maimages targets source genepix wt fun f The following command implements a type of adaptive background correction This is optional but recommended for GenePix data gt RG lt backgroundCorrect RG method normexp offset 50 Print tip loess normalization gt MA lt normalizeWithinArrays RG Estimate the fold changes and standard errors by fitting a linear model for each gene The design matrix indicates which arrays are dye swaps gt fit lt ImFit MA design c 1 1 1 1 Apply empirical Bayes smoothing to the standard errors 12 gt fit lt eBayes fit Show statistics for the top 10 genes gt topTable fit The second example assumes Affymetrix arrays hybridized with either wild type wt or mutant mu RNA There should be three or more arrays in total to ensure some replication The targets file is now assumed to have another column Genotype indicating which RNA source was hybridized on each array gt library gcrma gt library limma gt targets lt readTargets targets txt Read and pre process the Affymetrix CEL
30. four treatment combinations The microarray images were scanned using SPOT image analysis software 16 3 3 Data input The data used for this case study can be downloaded from http bioinf wehi edu au limma data weaverfull rar The data are provided courtesy of Drs Jean Yang and Elva Diaz First read in the targets frame gt library limma gt targets lt readTargets targets txt gt rownames targets lt removeExt targets FileName gt targets FileName Tissue Mouse Cy5 Cy3 cbmut 3 cbmut 3 spot Cerebellum Weaver Pliwt Pool cbmut 4 cbmut 4 spot Cerebellum Weaver Piimt Pool cbmut 5 cbmut 5 spot Cerebellum Weaver P21mt Pool cbmut 6 cbmut 6 spot Cerebellum Weaver P21wt Pool cbmut 15 cbmut 15 spot Cerebellum Weaver P21wt Pool cbmut 16 cbmut 16 spot Cerebellum Weaver P21imt Pool cb 1 cb 1 spot Cerebellum Weaver Pliwt Piimt 90 cb 2 cb 2 spot Cerebellum Weaver Plimt P2imt cb 3 cb 3 spot Cerebellum Weaver P21mt P21wt cb 4 cb 4 spot Cerebellum Weaver P21wt Piiwt Exploratory analysis showed that the segmented area for spots for these arrays was quite variable with a median spot area just over 50 pixels A small proportion of spots had very small segmented sizes suggesting that the intensities for these spots might be unreliable It was therefore decided to set a spot quality weight function so any spot with an area less than 50 pixels will get reduced weight The function is set so that any spot with zero area will get
31. gene symbols corresponding to the GenBank accession numbers by using the mouse organism package constructed from the NCBI database Symbols can be found for only a little over 5000 of the probes gt library org Mm eg db First we find the Entrez Gene ID for each accession number gt EG AN lt toTable org Mm egACCNUM gt i lt match RG genes GenBank EG AN accession gt EntrezID lt EG AN i gene_id Then convert Entrez Gene IDs to symbols gt EG Sym lt toTable org Mm egSYMBOL gt i lt match EntrezID EG Sym gene_id gt RG genes Symbol lt EG Sym i symbol 16 3 5 Quality Assessment and Normalization We also read in a spot types file and set a range of control spots gt spottypes lt readSpotTypes spottypes txt gt spottypes SpotType RikenID col cex 1 Control green 1 0 2 Riken Zx black 0 2 3 Buffer 3x SSC yellow 1 0 4 CerEstTitration cer est lightblue 1 0 5 LysTitration Lys M orange 1 0 6 PheTitration Phe orange 1 0 7 RikenTitration Riken est blue 1 0 8 ThrTitration Thr orange 1 0 9 188 188 0 15ug ul pink 1 0 10 GAPDH GAPDH 0 15 ug ul red 1 0 11 Lysine Lysine 0 2 ug ul magenta 1 0 12 Threonine Threonine 0 2ug ul lightgreen 1 0 13 Tubulin Tubulin 0 15 ug ul green 1 0 gt RG genes Status lt controlStatus spottypes RG Matching patterns for RikenID Found 19200 Control Found 16896 Riken Found 710 Buffer Found 192 Ce
32. levels The purpose of this experiment is to determine how Apoal deficiency affects the action of other genes in the liver with the idea that this will help determine the molecular pathways through which Apoal operates Hybridizations The experiment compared 8 Apoal knockout mice with 8 normal C57BL 6 black six mice the control mice For each of these 16 mice target mRNA was obtained from liver tissue and labelled using a Cy5 dye The RNA from each mouse was hybridized to a separate microarray Common reference RNA was labelled with Cy3 dye and used for all the arrays The reference RNA was obtained by pooling RNA extracted from the 8 control mice Number of arrays Red Green 8 Normal black six mice Pooled reference 8 Apoal knockout Pooled reference 86 This is an example of a single comparison experiment using a common reference The fact that the comparison is made by way of a common reference rather than directly as for the swirl experiment makes this for each gene a two sample rather than a single sample setup gt load Apoal RData gt objects 1 RG gt names RG 1 rg ou Rb Gp printer genes targets gt RG targets FileName alkoci spot alkoc2 spot alkoc3 spot alkoc4 spot alkoc5 spot alkoc6 spot alkoc7 spot alkoc8 spot alkok1 spot alkok2 spot alkok3 spot alkok4 spot alkok5 spot alkok6 spot alkok7 spot alkok8 spot c1 c2 c3 c4 c5 c6 c7 c8 k1 k2 k3 k4 Cy
33. lmFit eset design gt contrast matrix lt makeContrasts RNA2 RNA1 RNA3 RNA2 RNA3 RNA1 levels design gt gt fit2 lt contrasts fit fit contrast matrix fit2 lt eBayes fit2 A list of top genes for RNA2 versus RNA1 can be obtained from gt topTable fit2 coef 1 adjust BH The outcome of each hypothesis test can be assigned using gt results lt decideTests fit2 A Venn diagram showing numbers of genes significant in each comparison can be obtained from gt vennDiagram results The statistic it2 F and the corresponding fit2 F p value combine the three pair wise comparisons into one F test This is equivalent to a one way ANOVA for each gene except that the residual mean squares have been moderated between genes To find genes which vary between the three RNA targets in any way look for genes with small p values To find the top 30 genes gt topTableF fit2 number 30 43 9 4 Additive Models and Blocking 9 4 1 Paired Samples Paired samples occur when we compare two treatments and each sample given one treatment is naturally paired with a particular sample given the other treatment This is a special case of blocking with blocks of size two The classical test associated with this situation is the paired t test Suppose an experiment is conducted to compare a new treatment T with a control C Six dogs are used from three sib ships For each sib pair one dog is given the treatment while the
34. read maimages function The individual function help pages are especially important for listing all the arguments which a function will accept and what values the arguments can take A key to understanding R is to appreciate that anything that you create in Ris an object Objects might include data sets variables functions anything at all For example gt x lt 2 will create a variable x and will assign it the value 2 At any stage of your R session you can type 11 gt objects to get a list of all the objects you have created You can see the contents of any object by typing the name of the object at the prompt for example either of the following commands will print out the contents of x gt show x gt x We hope that you can use limma without having to spend a lot of time learning about the R language itself but a little knowledge in this direction will be very helpful especially when you want to do something not explicitly provided for in limma or in the other Bioconductor packages For more details about the R language see An Introduction to R which is available from the online help For more background on using R for statistical analyses see 6 3 2 Sample limma Session This is a quick overview of what an analysis might look like The first example assumes four replicate two color arrays the second and fourth of which are dye swapped We assume that the images have been analyzed using GenePix to produce a gpr
35. spots and 4 arrays gt colnames RG will give the names of the filenames or arrays in the object while if fit is an MArrayLM object then gt colnames fit would give the names of the coefficients in the linear model fit Objects of any of these classes may be subsetted so that RG j means the data for array j and RG i means the data for probes indicated by the index i Multiple data objects may be combined using cbind rbind or merge Hence gt RG1 lt read maimages files 1 2 source genepix gt RG2 lt read maimages files 3 5 source genepix gt RG lt cbind RG1 RG2 is equivalent to gt RG lt read maimages files 1 5 source genepix Alternatively if control status has been set in the MAList object then gt i lt MA genes Status Gene gt MA i might be used to eliminate control spots from the data object prior to fitting a linear model 14 Chapter 4 Reading Microarray Data 4 1 Scope of this Chapter This chapter covers most microarray types other than Affymetrix To read data from Affymetrix GeneChips please use the affy gcrma or aroma affymetrix packages to read and normalize the data 4 2 Recommended Files We assume that an experiment has been conducted with one or more microarrays all printed with the same library of probes Each array has been scanned to produce a TIFF image The TIFF images have then been processed using an image analysis program such a ArrayVisi
36. the data are very noisy one can apply the same between array normalization methods as would be used for microarrays for example gt v lt voom counts design plot TRUE normalize quantile After this the usual limma pipelines for differential expression can be applied for example Vv fit lt lmFit v design gt fit lt eBayes fit topTable fit coef ncol design Vv Or to give more weight to fold changes in the ranking one could use say Vv fit lt treat fit lfc log2 1 2 topTreat fit coef ncol design Vv 15 4 Differential splicing limma can also detect genes that how evidence of differential splicing between conditions One can test for differential splicing associated with any contrast for a linear model In this case the matrix of counts should be at the exon level with a row for each exon For example gt dge lt DGEList counts counts gt dge genes GeneID lt GeneID where counts is a matrix of exon level counts and GeneID identifies which gene each exon belongs to Then filter and normalize gt A lt rowSums dge counts gt dge lt dge A gt 10 keep lib sizes FALSE gt dge lt calcNormFactors dge Then apply the voom transformation and fit a linear model gt v lt voom dge design plot TRUE gt fit lt lmFit v design Now we can test for differential splicing associated with any coefficient in the linear model First run the diffSplice function
37. these controls turns out to be about 0 00014 gt i lt grep AFFX featureNames eset gt summary fit2 F p value i Min ist Qu Median Mean 3rd Qu Max 0 0001391 0 1727000 0 3562000 0 4206000 0 6825000 0 9925000 So a cutoff p value of 0 0001 say would conservatively avoid selecting any of the control probe sets as differentially expressed gt results lt classifyTestsF fit2 p value 0 0001 gt summary results E10 E48 1 40 76 O 12469 12410 1 116 139 gt table E10 results 1 E48 results 2 E48 E10 1 0 1 1 29 11 0 0 47 12370 52 1 0 29 87 gt vennDiagram results include up gt vennDiagram results include down 106 We see that 87 genes were up regulated at both 10 and 48 hours 29 only at 10 hours and 52 only at 48 hours Also 29 genes were down regulated throughout 11 only at 10 hours and 47 only at 48 hours No genes were up at one time and down at the other topTable gives a detailed look at individual genes Estrogen effect at 10 hours gt options digits 3 gt topTable fit2 coef E10 n 20 Symbol logFC AveExpr t P Value adj P Val B 39642_at ELOVL2 2 94 7 88 23 7 4 74e 09 3 13e 05 9 97 910_at TKi 3 11 9 66 23 6 4 96e 09 3 13e 05 9 94 31798_at TFF1 2 80 12 12 16 4 1 03e 07 3 51e 04 7 98 41400_at TKi 2 38 10 04 16 2 1 11e 07 3 51e 04 7 92 40117_at MCM6 2 56 9 68 15 7 1 47e 07 3 58e 04 7 71 1854_at MYBL2 2 51 8 53 15 2 1 95e 07 3 58e 04 7 49
38. you without giving you to opportunity to veto the download gt library limma gt library affy Welcome to Bioconductor Vignettes contain introductory material To view simply type openVignette 101 For details on reading vignettes see the openVignette help page gt Data lt ReadAffy gt eset lt rma Data Background correcting Normalizing Calculating Expression gt pData eset n E FO pad 0 nolrp_1 CEL nolrp_2 CEL nolrp_3 CEL nolrp_4 CEL wt_1 CEL wt_2 CEL wt_3 CEL wt_4 CEL onon PpP UNBE Now we consider differential expression between the lrp and lrp strains gt strain lt c lrp lrp lrp lrp lrp lrp lrp lrp gt design lt model matrix factor strain gt colnames design lt c lrp lrp vs gt design lrp lrp vs 1 1 0 2 1 0 3 1 0 4 1 0 5 1 1 6 a 1 7 1 1 8 f 1 attr assign 1 O 1 attr contrasts attr contrasts factor strain 1 contr treatment The first coefficient measures log expression of each gene in the lrp strain The second coefficient measures the log fold change of lrp over Irp i e the log fold change induced by Irp gt fit lt ImFit eset design gt fit lt eBayes fit gt options digits 2 gt topTable fit coef 2 n 40 adjust BH ID logFC AveExpr t P Value adj P Val B 4282 IG_821_1300838_1300922 fwd_st 3 32 12 4 23 1 7 2e 09 5 3e 05 8 017 5365 serA_b2913_st
39. you would like to make between the RNA samples For very simple experiments you may not need to specify the contrast matrix The philosophy of the approach is as follows You have to start by fitting a linear model to your data which fully models the systematic part of your data The model is specified by the design matrix Each row of the design matrix corresponds to an array in your experiment and each column corresponds to a coefficient that is used to describe the RNA sources in your experiment With Affymetrix or single channel data or with two color with a common reference you will need as many coefficients as you have distinct RNA sources no more and no less With direct design two color data you will need one fewer coefficient than you have distinct RNA sources unless you wish to estimate a dye effect for each gene in which case the number of RNA sources and the number of coefficients will be the same Any set of independent coefficients will do providing they describe all your treatments The main purpose of this step is to estimate the variability in the data hence the systematic part needs to be modelled so it can be distinguished from random variation In practice the requirement to have exactly as many coefficients as RNA sources is too restrictive in terms of questions you might want to answer You might be interested in more or fewer comparisons between the RNA source Hence the contrasts step is provided so that you can take the ini
40. zero weight gt wtfun lt function x pmin x area 50 1 Then read the SPOT files containing the intensity data using file names recorded in the targets file The data files are stored in the subdirectory spot gt RG lt read maimages targets source spot path spot wt fun wtfun Read spot cbmut 3 spot Read spot cbmut 4 spot Read spot cbmut 5 spot Read spot cbmut 6 spot Read spot cbmut 15 spot Read spot cbmut 16 spot Read spot cb 1 spot Read spot cb 2 spot Read spot cb 3 spot Read spot cb 4 spot Finally we set the print tip layout These arrays were printed using a print head with 8 rows and 4 columns of print tips gt RG printer lt list mgrid r 8 ngrid c 4 nspot r 25 nspot c 24 16 3 4 Annotation Probe annotation is contained a separate file The rows in the annotation file are as for the intensity data Columns give Riken chip rearray IDs GenBank accession numbers and UniGene information gt Annotation lt read delim 091701RikenUpdatev3 txt comment char quote check names FALSE stringsAsFactors FALSE gt names Annotation 1 ReArrayID Accession GenBank description Riken 4 Cluster ID UniGene 2nd description UniGene For our purposes we will keep the Riken IDs and GenBank accessions putting these into the data object gt RG genes lt Annotation c 1 2 gt colnames RG genes lt c RikenID GenBank 91 Where possible we find
41. 0 792 b3914_st 0 77 9 5 5 7 3 9e 04 1 0e 01 0 565 5008 pntB_b1602_st 1 47 12 8 5 6 4 1e 04 1 0e 01 0 496 4610 livM_b3456_st 1 04 8 5 5 5 4 7e 04 1 1e 01 0 376 5097 ptsG_b1101_st 1 16 12 2 5 5 4 8e 04 1 1e 01 0 352 4886 nupC_b2393_st 0 79 9 6 5 5 4 9e 04 1 1e 01 0 333 4898 ompT_b0565_st 2 67 10 5 5 4 5 6e 04 1 2e 01 0 218 5482 tdh_b3616_st 1 61 10 5 5 3 6 3e 04 1 3e 01 0 092 1927 IG_13_14080_14167_fwd_st 0 55 8 4 5 3 6 4e 04 1 3e 01 0 076 6320 yeeF_b2014_st 0 88 9 9 5 3 6 5e 04 1 3e 01 0 065 196 atpG_b3733_st 0 60 12 5 5 2 7 2e 04 1 4e 01 0 033 954 cydB_b0734_st 0 76 11 0 5 0 9 3e 04 1 8e 01 0 272 1186 fimI_b4315_st 1 15 8 3 5 0 9 5e 04 1 8e 01 0 298 4013 IG_58_107475_107629_fwd_st 0 49 10 4 4 9 1 1e 03 2 0e 01 0 407 The column logFC gives the log fold change while the column AveExpr gives the average log intensity for the probe set Positive M values mean that the gene is up regulated in lrp negative values mean that it is repressed It is interesting to compare this table with Tables III and IV in 11 Note that the top ranked gene is an intergenic region IG tRNA gene The knock out gene itself is in position four Many of the genes in the above table including the ser glt liv opp lys ilv and fim families are known targets of Irp 103 17 2 Effect of Estrogen on Breast Cancer Tumor Cells A 2x2 Factorial Experiment with Affymetrix Ar rays This data is from the estrogen package on Bioconductor A subset of the da
42. 1 SRRO31718 GSM461179 S2_DRSC_CG8144_RNAi 1 7 15 08 308T2AAXX SE 45 12 SRRO31719 GSM461179 S2_DRSC_CG8144_RNAi 1 7 18 08 308UEAAXX SE 45 13 SRRO31720 GSM461179 S2_DRSC_CG8144_RNAi 1 8 15 08 30AYWAAXX SE 45 14 SRRO31721 GSM461179 S2_DRSC_CG8144_RNAi 1 8 15 08 30AYWAAXX SE 45 15 SRRO31722 GSM461179 S2_DRSC_CG8144_RNAi 1 8 15 08 30AYWAAXX SE 45 16 SRRO31723 GSM461179 S2_DRSC_CG8144_RNAi 1 8 21 08 308A0AAXX SE 45 17 SRRO31724 GSM461180 S2_DRSC_CG8144_RNAi 3 12 23 08 30M2BAAXX PE 37 18 SRRO31725 GSM461180 S2_DRSC_CG8144_RNAi 3 12 23 08 30M2BAAXX PE 37 19 SRRO31726 GSM461181 S2_DRSC_CG8144_RNAi 4 12 23 08 30M2BAAXX PE 37 20 SRRO31727 GSM461181 S2_DRSC_CG8144_RNAi 4 12 23 08 30M2BAAXX PE 37 Run dates are in American format so that 7 15 08 denotes 15 July 2008 All runs in July and August 2008 used single end SE sequencing with 45 base pair reads while the later runs used paired end PE sequencing with 37 bp reads 18 2 3 Mapping reads to the reference genome The reads were mapped to the reference D melanogaster genome using subread 17 First the SRA format files need to be converted to FASTQ format which is the de facto standard for representing high throughput sequencing data The was done using the SRA Toolkit version 2 3 4 Specifically the Unix SRA toolkit command fastq dump split spot split files was used to produce one FASTQ file for each single end SRA file and two FASTQ files for each paired end SRA file Second FASTA files rep
43. 13 gt volcanoplot fit 99 Log Odds Log Fold Change 100 15 Chapter 17 Single Channel Case Studies 17 1 Lrp Mutant Ecoli Two Group Experiment with Affymetrix Arrays The data are from experiments reported in 11 and are available from the www site http visitor ics uci edu genex cybert tutorial index html The data is also available from the ecoliLeucine data package available from the Bioconductor www site under Experi mental Data Hung et al 11 state that The purpose of the work presented here is to identify the network of genes that are differentially regulated by the global E coli regulatory protein leucine responsive regulatory protein Lrp during steady state growth in a glucose supplemented minimal salts medium Lrp is a DNA binding protein that has been reported to affect the expression of approximately 55 genes Gene expression in two E coli bacteria strains labelled lrp and lrp were compared using eight Affymetrix ecoli chips four chips each for lrp and Irp The following code assumes that the data files for the eight chips are in your current working directory gt dir 1 Ecoli CDF nolrp_1 CEL nolrp_2 CEL 4 nolrp_3 CEL nolrp_4 CEL wt_1 CEL 7 wt_2 CEL wt_3 CEL wt_4 CEL The data is read and normalized using the affy package The package ecolicdf must also be installed otherwise the rma function will attempt to download and install it for
44. 14 49 2 22e 11 7 09e 08 12 4699 5356 CATECHOLO METHYLTRAN 1 873 12 93 13 16 1 10e 10 2 34e 07 11 5734 4139 EST WeaklysimilartoC 0 981 12 61 11 71 7 28e 10 1 16e 06 10 3623 1739 ApoCIII lipid Img 0 933 13 74 10 58 3 66e 09 4 67e 06 9 4155 1496 est 0 949 12 23 9 92 9 85e 09 1 05e 05 8 6905 2537 ESTs Highlysimilarto 1 011 13 63 9 56 1 75e 08 1 60e 05 8 2587 70 4941 similartoyeaststerol 0 873 13 29 6 88 1 93e 06 1 54e 03 4 6875 947 EST WeaklysimilartoF 0 566 10 54 5 08 7 78e 05 5 52e 02 1 6112 2812 5 similartoPIR 55501 0 514 11 65 4 30 4 31e 04 2 75e 01 0 1242 6073 estrogenrec 0 412 9 79 4 21 5 27e 04 3 06e 01 0 0497 1347 Musmusculustranscrip 0 412 10 18 4 07 7 09e 04 3 47e 01 0 3106 634 MDB1376 0 380 9 32 4 07 7 11e 04 3 47e 01 0 3123 2 Cy5RT 0 673 10 65 4 04 7 61e 04 3 47e 01 0 3745 5693 Meox2 0 531 9 77 3 84 1 19e 03 4 74e 01 0 7649 For example the moderated t statistic of the top ranked gene Apoal which has been knocked out in this experiment increases in absolute terms from 23 98 when equal weights are used to 25 64 with print tip weights The t statistic of the related gene ApoCIII also increases in absolute value moderated t statistic of 9 83 before weighting and 10 58 after This analysis provides a further example that a graduated approach to quality control can improve power to detect differentially expressed genes 14 4 When to Use Array Weights Array weights are generally useful when there is some re
45. 2 78 12 2 15 8 1 6e 07 6 0e 04 6 603 1389 gltD_b3213_st 3 03 10 9 13 3 6 4e 07 1 6e 03 5 779 4 2 3 4625 1rp_b0889_st 2 30 9 3 11 3e 06 4 0e 03 4 911 102 1388 gltB_b3212_st 3 24 10 1 11 1 2 8e 06 4 0e 03 4 766 4609 livK_b3458_st 2 35 9 9 10 8 3 5e 06 4 0e 03 4 593 4901 oppB_b1244_st 2 91 10 7 10 6 4 0e 06 4 0e 03 4 504 4903 oppD_b1246_st 1 94 10 4 10 5 4 4e 06 4 0e 03 4 434 5413 sodA_b3908_st 1 50 10 3 9 7 8 0e 06 6 5e 03 3 958 4900 oppA_b1243_st 2 98 13 0 9 1 1 3e 05 9 2e 03 3 601 5217 rmf_b0953_st 2 71 13 6 9 0 1 5e 05 9 3e 03 3 474 7300 ytfK_b4217_st 2 64 11 1 8 9 1 5e 05 9 3e 03 3 437 5007 pntA_b1603_st 1 58 10 1 8 3 2 5e 05 1 4e 02 3 019 4281 IG_820_1298469_1299205_fwd_st 2 45 10 7 8 1 3 1e 05 1 6e 02 2 843 4491 ilv1_b0077_st 0 95 10 0 7 4 6 3e 05 2 9e 02 2 226 5448 stpA_b2669_st 1 79 10 0 7 4 6 4e 05 2 9e 02 2 210 611 b2343_st 2 12 10 8 7 1 7 9e 05 3 4e 02 2 028 5930 ybfA_b0699_st 0 91 10 5 7 0 8 7e 05 3 5e 02 1 932 1435 grxB_b1064_st 0 91 9 8 6 9 1 0e 04 3 8e 02 1 810 4634 lysU_b4129_st 3 30 9 3 6 9 1 1e 04 3 9e 02 1 758 4829 ndk_b2518_st 1 07 11 1 6 7 1 2e 04 4 3e 02 1 616 2309 IG_1643_2642304_2642452_rev_st 0 83 9 6 6 7 1 3e 04 4 3e 02 1 570 4902 oppC_b1245_st 2 15 10 7 6 3 1 9e 04 5 9e 02 1 238 4490 ilvH_b0078_st 1 11 9 9 5 9 2 9e 04 8 8e 02 0 820 1178 fimA_b4314_st 3 40 11 7 5 9 3 2e 04 8 8e 02 0 743 6224 ydgR_b1634_st 2 35 9 8 5 8 3 3e 04 8 8e 02 0 722 4904 oppF_b1247_st 1 46 9 9 5 8 3 3e 04 8 8e 02 0 72
46. 3 Pool Pool Pool Pool Pool Pool Pool Pool Pool Pool Pool Pool Pool Pool Pool Pool Cy5 C57BL 6 C57BL 6 C57BL 6 C57BL 6 C57BL 6 C57BL 6 C57BL 6 C57BL 6 ApoAI ApoAI ApoAI ApoAI ApoAI ApoAI ApoAI ApoAI M values MA lt normalizeWithinArrays RG cols lt MA targets Cy5 cols cols C57BL 6 lt blue cols cols ApoAI lt yellow boxplot MA M col MA M names rownames MA targets col cols xlab Mouse ylab M values 000 commen ameno o oo oo o o coooa gamo op T T T T T T c4 c5 c6 c7 c8 ki Mouse k2 k3 k5 k6 Since the common reference here is a pool of the control mice we expect to see more differences from the pool for the knock out mice than for the control mice In terms of the above plot 87 this should translate into a wider range of M values for the knock out mice arrays than for the control arrays and we do see this Since the different arrays are not expected to have the same range of M values between array scale normalization of the M values is not appropriate here Now we can go on to estimate the fold change between the two groups In this case the design matrix has two columns The coefficient for the second column estimates the parameter of interest the log ratio between knockout and control mice gt design lt cbind Control Ref 1 KO Control MA targets Cy5 ApoAI gt design Control
47. 3 3355011 NT_033778 3 19542 20989 1448 5 3355011 NT_033778 3 18852 19484 633 8 3355011 NT_033778 3 18693 18773 81 9 3355011 NT_033778 3 18827 19484 658 The annotation includes Entrez ID and the length chromosome and start and stop position of each exon Resort back to original SRA order gt y all lt y all SRA SRA Then we collapse the data so that there is a single column for each GEO sample by summing the counts over the technical replicates gt y lt sumTechReps y all ID SRA GEO gt y samples group lib size norm factors GSM461176 1 28726876 1 GSM461177 1 12204569 1 GSM461178 1 14104800 1 GSM461179 1 26041043 1 GSM461180 1 13898076 1 GSM461181 1 15189911 1 gt colnames y La c N1 N3 N4 D1 D3 D4 18 2 6 Gene annotation Annotation for D melanogaster genes was downloaded from ftp ftp ncbi nlm nih gov gene DATA GENE_INFO Invertebrates We now add selected annotation columns to the DGEList object ncbi Li lt readLines Drosophila_melanogaster gene_info n 1 ncbi colname lt unlist strsplit substring ncbi L1 10 234 ncbi lt read delim Drosophila_melanogaster gene_info skip 1 header FALSE stringsAsFactors FALSE colnames ncbi lt ncbi colname m lt match y genes GeneID ncbi GeneID y genes Chr lt ncbi chromosome m y genes Symbol lt ncbi Symbol m y genes Strand lt NULL head y genes VVVVVV FV VV 128 GeneID Chr Start End Length Symbol 3
48. 3 72e 07 3 69e 04 7 18 40117_at MCM6 2 28 9 68 14 0 3 80e 07 3 69e 04 7 17 40533_at BIRC5 1 64 8 47 13 5 4 94e 07 4 45e 04 6 93 39642_at ELOVL2 1 61 7 88 13 0 6 71e 07 5 18e 04 6 65 34851_at AURKA 1 96 9 96 12 8 7 51e 07 5 18e 04 6 55 1824_s_at PCNA 1 64 9 24 12 8 7 95e 07 5 18e 04 6 50 35995_at ZWINT 2 76 8 87 12 7 8 32e 07 5 18e 04 6 46 893_at UBE2S 1 54 10 95 12 7 8 43e 07 5 18e 04 6 45 40079_at GPRC5A 2 41 8 23 12 6 8 62e 07 5 18e 04 6 42 gt sessionInfo R version 3 0 0 2013 04 03 Platform 1386 w64 mingw32 i386 32 bit locale 1 LC_COLLATE English_Australia 1252 LC_CTYPE English_Australia 1252 3 LC_MONETARY English_Australia 1252 LC_NUMERIC C 5 LC_TIME English_Australia 1252 attached base packages 1 parallel stats graphics grDevices utils datasets methods base other attached packages 1 hgu95av2 db_2 9 0 org Hs eg db_2 9 0 RSQLite_0 11 3 4 DBI_0 2 6 annotate_1 38 0 hgu95av2cdf_2 12 0 7 AnnotationDbi_1 22 2 affy_1 38 1 Biobase_2 20 0 10 BiocGenerics_0 6 0 limma_3 17 7 BiocInstaller_1 10 0 loaded via a namespace and not attached 1 affyio_1 28 0 IRanges_1 18 0 preprocessCore_1 22 0 4 stats4_3 0 0 tools_3 0 0 XML_3 96 1 1 7 xtable_1 7 1 zlibbioc_1 6 0 17 3 Comparing Mammary Progenitor Cell Populations with Illumina BeadChips 17 3 1 Introduction This case study examines the expression profiles of adult mammary stem cells and of progenitor and mature mammary lumina cells The data was firs
49. 32723 32723 30063 1 0 0 1749 It appears that only the 10 ml kg treatment is different from the saline control however this is quite different Same analysis but now averaging probes for each gene 116 VVVV MV 0 1 yave lt avereps y0 ID y0 genes SystematicName fit lt ImFit yave design fit lt eBayes fit trend TRUE plotSA fit main Gene level summary decideTests fit 1 Treatment2 ml kg corn oil Treatment5 ml kg corn oil Treatment10 ml kg corn oil 1 log2 sigma 1 2 3 4 0 19112 0 Probe level lowess prior Average log expression log2 sigma 117 0 407 19112 17878 0 827 Gene level lowess Edo E prior o o j 4 T T T T T T Average log expression Chapter 18 RNA Seq Case Studies 18 1 Profiles of Unrelated Nigerian Individuals 18 1 1 Background RNA Seq profiles were made from lymphoblastoid cell lines generated as part of the Interna tional HapMap project from 69 unrelated Nigerian individuals 25 RNA from each individual was sequenced on at least two lanes of the Illumina Genome Analyser 2 platform and mapped reads to the human genome using MAQ v0 6 8 The study group here is essentially an opportunity sample and the individuals are likely to be genetically diverse In this analysis we look at genes that are differentially expressed between males and females This case study requires
50. 355011 2R 18479 18629 151 CG17683 3355011 2R 18681 19484 804 CG17683 3355011 2R 19542 20989 1448 CG17683 3355011 2R 18852 19484 633 CG17683 3355011 2R 18693 18773 81 CG17683 3355011 2R 18827 19484 658 CG17683 OoOWONnNWNF 18 2 7 Filtering Keep exons that have more than 1 cpm in at least 3 samples gt isexpr lt rowSums cpm y gt 1 gt 3 gt y lt ylisexpr keep lib sizes FALSE 18 2 8 Scale normalization Apply TMM normalization gt y lt calcNormFactors y gt y sample group lib size norm factors Ni 1 28604956 0 954 N3 1 12132492 1 028 N4 1 13991699 0 975 Di 1 25901171 1 009 D3 1 13768063 1 022 D4 1 15105975 1 014 An MDS plot shows clear separation of the Pasilla down vs normal samples but also a batch effect associated with sequencing type and date gt plotMDS y 129 Z Di D3 D4 LO E S Lo O LL D o lo o D Cc 40 oO LO a o l o N4 N1 7 N3 1 0 0 5 0 0 0 5 1 0 1 5 Leading logFC dim 1 18 2 9 Linear modelling Create design matrix gt Batch lt factor c 1 3 4 1 3 4 gt Pasilla lt factor GEO Pasilla levels c Normal Down gt design lt model matrix Batch Pasilla Apply voom to convert the read counts to log2 cpm with associated weights gt v lt voom y design plot TRUE 130 voom Mean variance trend LO E 2 o 5 0 gt D O jo k O O Y n E o O O N o e log2 count size 0 5 Linear modelling g
51. 3697 3702 141 25 26 21 28 29 30 31 32 33 34 35 36 37 Pickrell J K Marioni J C Pai A A Degner J F Engelhardt B E Nkadori E Veyrieras J B Stephens M Gilad Y and Pritchard J K 2010 Understanding mechanisms underlying human gene expression variation with RNA sequencing Nature 464 768 72 Rainer J Sanchez Cabo F Stocker G Sturn A and Trajanoski Z 2006 CAR MAweb comprehensive r and bioconductor based web service for microarray data anal ysis Nucleic Acids Res 34 Web Server issue W498 503 Reiner A Yekutieli D and Benjamini Y 2003 Identifying differentially expressed genes using false discovery rate controlling procedures Bioinformatics 19 368 375 Renn S C P Aubin Horth N and Hofmann H A 2004 Biologically meaningful ex pression profiling across species using heterologous hybridization to a cDNA microarray BMC Genomics 5 Ritchie M Diyagama D Neilson J Van Laar R Dobrovic A Holloway A and Smyth G 2006 Empirical array quality weights in the analysis of microarray data BMC bioinformatics 7 261 Ritchie M Silver J Oshlack A Holmes M Diyagama D Holloway A and Smyth G 2007 A comparison of background correction methods for two colour microarrays Bioinformatics 23 2700 2707 Robinson M D and Oshlack A 2010 A scaling normalization method for differential expressio
52. 5 18 1 3 DGEList object Form a DGEList object combining the counts and associated annotation gt library edgeR gt y lt DGEList counts Counts genes annot rownames Counts 18 1 4 Filtering Keep genes with least 1 count per million reads cpm in at least 20 samples gt isexpr lt rowSums cpm y gt 1 gt 20 119 Keep only genes with defined annotation gt hasannot lt rowSums is na y genes 0 gt y lt ylisexpr amp hasannot keep lib sizes FALSE gt dim y 1 17310 69 18 1 5 Scale normalization Apply scale normalization gt y lt calcNormFactors y 18 1 6 Linear modeling Use voom 15 to convert the read counts to log cpm with associated weights ready for linear modelling gt library limma gt design lt model matrix Gender gt v lt voom y design plot TRUE voom Mean variance trend 1 5 Sqrt standard deviation 1 0 0 5 T T T 5 10 15 log2 count size 0 5 Some separation of female and male profiles is evident from an MDS plot although clear differences seem to rely on only a few score genes gt plotMDS v top 50 labels substring Gender 1 1 col ifelse Gender male blue red gene selection common 120 f o 4 f m a m m my m f m f m a f f So 7 f f A my m 5 m A a pe f m man pm gt Ei Ty m mar f nm f fo of ff f f f f f T f ff f f T T T T 2 0 2 4 Dimension 1
53. 658 0 56389 0 2951 0 124 0 3408 0 7447 0 0537 0 4680 0 3986 0 747 ILMN_1806310 0 2618 0 248 0 2589 0 12778 0 2196 0 264 0 1955 0 2920 0 1963 0 2382 0 1220 0 203 ILMN_1779670 0 0850 0 113 0 1644 0 10417 0 1469 0 224 0 0461 0 0691 0 0621 0 0655 0 0709 0 058 We can go further than this and estimate the overall proportion of the regular probes that correspond to expressed transcript using the method of Shi et al 34 gt pe lt propexpr x gt dim pe lt c 4 3 gt dimnames pe lt list CellType c MS Stroma ML LP Donor c 1 2 3 gt pe Donor CellType 1 2 3 MS 0 557 0 504 0 529 Stroma 0 518 0 514 0 514 ML 0 549 0 495 0 535 LP 0 555 0 518 0 517 The proportion of probes that are expressed varies from 50 56 The average is 52 5 17 3 5 Normalization and filtering Background correction and normalize 111 gt y lt negc x The negc functions performs normexp background correction using negative controls then quantile normalizes and finally logy transforms 35 It also automatically removes the control probes leaving only the regular probes in y gt dim y 1 48803 12 Filter out probes that are not expressed We keep probes that are expressed in at least three arrays according to a detection p values of 5 gt expressed lt rowSums y other Detection lt 0 05 gt 3 gt y lt ylexpressed gt dim y 1 24691 12 A multi dimensional scaling plot shows that the cell types are cell s
54. 7 D14 1 35 13 8 13 5 2 29e 06 0 002067 5 54 515 1 22 11 fc22a09 27 E17 1 27 13 2 13 4 2 44e 06 0 002067 5 48 5075 10 14 11 fb85f09 18 G18 1 28 14 4 13 4 2 46e 06 0 002067 5 48 7307 14 19 11 fc10h09 24 H18 1 20 13 4 13 2 2 67e 06 0 002067 5 40 319 1 14 7 fb85a01 18 E1 1 29 12 5 13 1 2 91e 06 0 002067 5 32 2961 6 14 9 fb85d05 18 F10 2 69 10 3 13 0 3 04e 06 0 002067 5 29 4032 8 14 24 fb87d12 18 N24 1 27 14 2 12 8 3 28e 06 0 002067 5 22 6903 14 2 15 control Vox 1 26 13 4 12 8 3 35e 06 0 002067 5 20 4546 9 14 10 fb85e07 18 G13 1 23 14 2 12 8 3 42e 06 0 002067 5 18 683 2 7 11 fb37b09 6 E18 1 31 13 3 12 4 4 10e 06 0 002182 5 02 1697 4 5 17 fb26b10 3 I20 1 09 13 3 12 4 4 30e 06 0 002182 4 97 7491 15 5 3 fb24g06 3 D11 1 33 13 6 12 3 4 39e 06 0 002182 4 96 4188 8 21 12 fc18d12 26 F24 1 25 12 1 12 2 4 71e 06 0 002209 4 89 4380 9 7 12 fb37e11 6 G21 1 23 14 0 12 0 5 19e 06 0 002216 4 80 3726 8 2 6 control fli 1 1 32 10 3 11 9 5 40e 06 0 002216 4 76 2679 6 2 15 control Vox 1 25 13 4 11 9 5 72e 06 0 002216 4 71 5931 12 6 3 fb32f06 5 C12 1 10 13 0 11 7 6 24e 06 0 002216 4 63 7602 15 9 18 fb50g12 9 L23 1 16 14 0 11 7 6 25e 06 0 002216 4 63 2151 5 2 15 control vent 1 40 12 7 11 7 6 30e 06 0 002216 4 62 3790 8 4 22 fb23d08 2 N16 1 16 12 5 11 6 6 57e 06 0 002221 4 58 7542 15 7 6 fb36g12 6 D23 1 12 13 5 11 0 9 23e 06 0 003000 4 27 4263 9 2 15 control vent 1 41 12 7 10 8 1 06e 05 0 003326 4 13 6375 13 2 15 control vent 1 37 12 5 10 5 1 33e 05 0 004026 3 91
55. 7_Sep09_1_3 txt Read GSM819074_US10283824_252828210181_S01_GE1_107_Sep09_1_2 txt Read GSM819073_US10283824_252828210180_S01_GE1_107_Sep09_1_4 txt Read GSM819072_US10283824_252828210180_S01_GE1_107_Sep09_1_3 txt Read GSM819071_US10283824_252828210180_S01_GE1_107_Sep09_1_2 txt Read GSM819070_US10283824_252828210180_S01_GE1_107_Sep09_1_1 txt Read GSM819069_US10283824_252828210179_S01_GE1_107_Sep09_1_4 txt Read GSM819068_US10283824_252828210179_S01_GE1_107_Sep09_1_3 txt Read GSM819067_US10283824_252828210179_S01_GE1_107_Sep09_1_2 txt Read GSM819066_US10283824_252828210179_S01_GE1_107_Sep09_1_1 txt Read GSM819065_US10283824_252828210178_S01_GE1_107_Sep09_1_4 txt Read GSM819064_US10283824_252828210178_S01_GE1_107_Sep09_1_3 txt Read GSM819063_US10283824_252828210178_S01_GE1_107_Sep09_1_2 txt Read GSM819062_US10283824_252828210178_S501_GE1_107_Sep09_1_1 txt Read GSM819061_US10283824_252828210177_S01_GE1_107_Sep09_1_4 txt Read GSM819060_US10283824_252828210177_S01_GE1_107_Sep09_1_3 txt Read GSM819059_US10283824_252828210177_S01_GE1_107_Sep09_1_2 txt Read GSM819058_US10283824_252828210177_S01_GE1_107_Sep09_1_1 txt Background correct and normalize gt y lt backgroundCorrect x method normexp Array 1 corrected Array 2 corrected Array 3 corrected Array 4 corrected 115 Array 5 corrected Array 6 corrected Array 7 corrected Array 8 corrected Array 9 corrected Array 10 corrected Array 11 corrected Array 12 corrected Array 13 corrected Arr
56. 9 141240 141239 141238 141237 141236 141235 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 45320 Ps PS PS PS PS PS PS PS OS PS PS OPS PS ODS PS PS OPS OS DS PS OPS PS PS PS PS ODS PS PS PS PS ODS PS bd bd OP PS OS ODS 2369449 2369857 2374609 2374767 2375103 2375450 2376030 2376492 2377157 2377519 2378366 2378650 2380090 2380960 2382114 2384489 2385352 2385514 2386379 2386688 2387021 2388437 2389074 2389981 2390494 2390809 2391351 2392055 2392516 2393244 2395660 2407676 2415567 2416257 2417438 2417672 2417919 2438498 2369721 2370062 2374941 2374941 2375352 2375909 2376424 2376669 2377419 2378122 2378546 2378792 2380275 2382046 2384350 2384875 2385445 2385647 2386453 2386918 2387248 2388670 2389271 2390066 2390626 2390958 2391521 2392148 2392664 2393711 2395770 2407835 2415821 2416454 2417562 2417849 2418435 2438795 273 206 333 175 250 460 395 178 263 604 181 143 186 1087 2237 387 94 134 75 231 228 234 198 86 133 150 171 94 149 468 111 160 255 198 125 178 517 298 trol trol trol trol trol trol trol trol trol trol trol trol trol trol trol trol trol trol trol trol trol trol trol trol trol trol trol trol trol tro
57. A design gt fit lt eBayes fit The genes which show dye effects can be seen by gt topTable fit coef DyeEffect The genes which are differentially expressed in the mutant are obtained by gt topTable fit coef MUvsWT The fold changes and significant tests in this list are corrected for dye effects Including the dye effect in the model in this way uses up one degree of freedom which might otherwise be used to estimate the residual variability but it is valuable if many genes show non negligible dye effects 58 11 3 A Correlation Approach to Technical Replication In the previous sections we have assumed that all arrays are biological replicates Now consider an experiment in which two wild type and two mice from the same mutant strain are compared using two arrays for each pair of mice The targets might be FileName Cy3 Cyd Filel wtl mul File2 wtl mul File3 wt2 mu2 Filed wt2 mu2 The first and second and third and fourth arrays are technical replicates It would not be correct to treat this experiment as comprising four replicate arrays because the technical replicate pairs are not independent in fact they are likely to be positively correlated One way to analyze these data is the following gt biolrep lt c 1 1 2 2 gt corfit lt duplicateCorrelation MA ndups 1 block biolrep gt fit lt ImFit MA block biolrep cor corfit consensus gt fit lt eBayes fit gt topTable fit adju
58. C_COLLATE English_Australia 1252 LC_CTYPE English_Australia 1252 3 LC_MONETARY English_Australia 1252 LC_NUMERIC C 5 LC_TIME English_Australia 1252 136 attached base packages 1 stats graphics grDevices utils datasets methods base other attached packages 1 statmod_1 4 19 edgeR_3 6 1 limma_3 21 2 loaded via a namespace and not attached 1 tools_3 1 0 18 2 12 Acknowledgements Thanks for Yang Liao for mapping the reads and running featureCounts 137 Notes Acknowledgements Thanks to Jean Yee Hwa Yang and Sandrine Dudoit for the first three data sets The Swirl zebrafish data were provided by Katrin Wuennenburg Stapleton from the Ngai Lab at UC Berkeley Laurent Gautier made the ecoliLeucine data set available on Bioconductor Lynn Corcoran provided the Bobl Mutant data Andrew Holloway Ryan van Laar and Dileepa Diyagama provided the quality control data set Authors of particular functions in limma include Belinda Phipson Charity Law and Yifang Hu The limma package has benefited from many people who have made suggestions or re ported bugs including Naomi Altman Henrik Bengtsson Lourdes Pena Castillo Dongseok Choi Marcus Davy Par Engstrom Ramon Diaz Uriarte Robert Gentleman Wolfgang Hu ber William Kenworthy Kevin Koh Erik Kristiansson Mette Langaas Michael Lawrence Gregory Lefebvre Andrew Lynch James MacDonald Martin Maechler Steffen Moeller Ron Ophir Francois Pepin Hubert Rehrauer Ken Simp
59. E48 c 0 0 0 1 gt design lt model matrix treatments gt colnames design lt c Intercept Time E10 E48 The second coefficient picks up the effect of time in the absence of estrogen The third and fourth coefficients estimate the log fold change for estrogen at 10 hours and 48 hours respectively gt fit lt ImFit eset design We are only interested in the estrogen effects so we choose a contrast matrix which picks these two coefficients out gt cont matrix lt cbind E10 c 0 0 1 0 E48 c 0 0 0 1 gt fit2 lt contrasts fit fit cont matrix gt fit2 lt eBayes fit2 We can examine which genes respond to estrogen at either time using the moderated F statistics on 2 degrees of freedom The moderated F p value is stored in the component fit2 F p value What p value cutoff should be used One way to decide which changes are significant for each gene would be to use Benjamini and Hochberg s method to control the false discovery rate across all the genes and both tests gt results lt decideTests fit2 method global Another method would be to adjust the F test p values rather than the t test p values 105 gt results lt decideTests fit2 method nestedF Here we use a more conservative method which depends far less on distributional assumptions which is to make use of control and spike in probe sets which theoretically should not be differentially expressed The smallest p value amongst
60. In normal mice the terminally differentiated granule cells migrate to the internal granule layer but in mutant mice the cells die before doing so meaning that the mutant mice have strongly reduced numbers of cells in the internal granule layer The expression level of any gene which is specific to mature granule cells or is expressed in response to granule cell derived signals is greatly reduced in the mutant mice 16 3 2 Sample Preparation and Hybridizations At each time point P11 11 days postnatal and P21 21 days postnatal cerebella were isolated from two wild type and two mutant littermates and pooled for RNA isolation RNA was then divided into aliquots and labelled before hybridizing to the arrays This means that aliquots are technical replicates arising from the same mice and RNA extraction In our analysis here we will ignore this complication and will instead treat the aliquots as if they were biological replicates See Yang and Speed 2002 for a detailed discussion of this issue in the context of this experiment A pool of RNA was also made by combining the different RNA samples There are four different treatment combinations Pllwt P11mt P21wt and P21mt com prising a 2x2 factorial structure The RNA samples were hybridized to ten two color microar rays spotted with a 20k Riken clone library There are six arrays comparing the four different RNA sources to the RNA pool and four arrays making direct comparisons between the
61. MBOL ctMS ctML AveExpr F P Value adj P Val ILMN_1716370 TNS4 4 87 1 47 6 81 156 3 3 31e 09 4 81e 07 ILMN_1810054 CNN1 6 41 1 42 8 47 146 0 4 88e 09 4 81e 07 ILMN_1713744 Ci4orf132 3 85 1 74 7 34 103 5 3 37e 08 2 21e 06 ILMN_1720513 SETBP1 4 17 1 27 8 27 92 5 6 28e 08 3 09e 06 ILMN_1767662 LASS6 2 60 1 98 10 01 85 2 9 88e 08 3 46e 06 ILMN_1681984 GALNT10 3 54 1 23 8 09 84 2 1 06e 07 3 46e 06 ILMN_1731374 CPE 4 23 3 01 9 12 81 9 1 23e 07 3 46e 06 ILMN_1789639 FMOD 3 57 2 24 8 21 75 7 1 89e 07 4 65e 06 ILMN_1743933 TSHZ3 3 94 1 22 8 93 72 6 2 38e 07 5 21e 06 ILMN_1721876 TIMP2 3 57 1 38 10 10 69 0 3 13e 07 6 17e 06 17 4 Time Course Effects of Corn Oil on Rat Thymus with Agilent 4x44K Arrays This case study analyses a time course experiment using single channel Agilent Whole Rat Genome Microarray 4x44K v3 arrays Background The experiment concerns the effect of corn oil on gene expression in the thymus or rats The data was submitted by Hong Weiguo to ArrayExpress as series E GEOD 33005 The description of the experiment reads To investigate the effects of corn oil CO common drug vehicle on the gene expression profiles in rat thymus with microarray technique Female Wistar Rats were administered daily with normal saline NS CO 2 5 10 ml kg for 14 days respectively Then the thymus samples of rats were collected for microarray test and histopathology examination The microarray data showed that 0 40 458 dif ferentially express
62. Ref KO Control 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 gt fit lt lmFit MA design gt fit coef 1 5 Control Ref K0 Control PRRERPRRPRRRRPRRRR BR BR SR RRRPRARPRRARRRRPROOOOOOOOoO 1 0 6595 0 6393 2 0 2294 0 6552 3 0 2518 0 3342 4 0 0517 0 0405 5 0 2501 0 2230 gt fit lt eBayes fit gt options digits 3 Normally at this point one would just type gt topTable fit coef 2 However the gene annotation is a bit wide for the printed page so we will tell topTable to show just one column of the annotation information gt topTable fit coef 2 number 15 genelist fitfgenes NAME ID logFC AveExpr t P Value adj P Val B 2149 ApoAI lipid Img 3 166 12 47 23 98 4 77e 15 3 05e 11 14 927 540 EST HighlysimilartoA 3 049 12 28 12 96 1 57e 10 5 02e 07 10 813 5356 CATECHOLO METHYLTRAN 1 848 12 93 12 44 3 06e 10 6 51e 07 10 448 4139 EST WeaklysimilartoC 1 027 12 61 11 76 7 58e 10 1 21e 06 9 929 88 1739 ApoCIII lipid Img 0 933 13 74 9 84 1 22e 08 1 56e 05 8 192 2537 ESTs Highlysimilarto 1 010 13 63 9 02 4 53e 08 4 22e 05 7 305 1496 est 0 977 12 23 9 00 4 63e 08 4 22e 05 7 290 4941 similartoyeaststerol 0 955 13 29 7 44 7 04e 07 5 62e 04 5 311 947 EST WeaklysimilartoF 0 571 10 54 4 55 2 49e 04 1 77e 01 0 563 5604 0 366 12 71 3 96 9 22e 04 5 29e 01 0 553 4140 APXL2 5q Img 0 420 9 79 3 93 9 96e 04 5 29e 01 0 619 6073 estrogenr
63. _HYB HOUSEKEEPING LABELING 2 6 7 2 LOW_STRINGENCY_HYB NEGATIVE regular 8 759 48803 The component E contains the expression value for each probe gt options digits 3 gt head x E 1 2 3 4 5 6 7 8 9 10 11 12 ILMN_1762337 52 3 46 1 54 0 47 7 54 8 47 4 67 4 47 9 40 5 44 7 80 6 42 5 ILMN_2055271 69 9 73 9 58 6 72 4 77 1 82 1 69 1 81 3 79 1 82 5 87 0 60 9 ILMN_1736007 57 5 53 7 53 4 49 4 58 6 59 9 56 4 51 6 58 7 51 7 58 4 43 9 ILMN_2383229 53 6 57 5 48 2 48 2 61 8 64 5 52 7 43 5 65 5 49 8 53 9 39 3 ILMN_1806310 58 1 55 1 50 5 60 0 64 2 58 4 58 0 52 3 56 6 55 6 65 3 46 4 ILMN_1779670 64 5 61 2 52 8 61 9 67 9 59 7 68 2 63 1 65 1 65 7 69 6 52 0 The intensities vary from about 5 to 14 on the log scale gt boxplot log2 x E range 0 ylab log2 intensity 110 12 14 4 f log2 intensity 10 fi 17 3 4 How many probes are truly expressed The detection values contain p values for testing whether each probe is more intense than the negative control probes Small values are evidence that the probe corresponds to a truly expressed gene gt head x other Detection 1 2 3 4 5 6 T 8 9 10 11 12 ILMN_1762337 0 5585 0 675 0 1370 0 60139 0 5776 0 782 0 0503 0 4781 0 9082 0 7145 0 0000 0 460 ILMN_2055271 0 0306 0 000 0 0493 0 00278 0 0364 0 000 0 0391 0 0000 0 0000 0 0000 0 0000 0 000 ILMN_1736007 0 2772 0 292 0 1534 0 48611 0 4112 0 220 0 2318 0 3145 0 1554 0 3774 0 2539 0 360 ILMN_2383229 0 4735 0 187 0 3
64. a from multiple arrays and in that intensities for control probes are written to a separate file from the regular probes There are other features of these files that can optionally be used for pre processing and filtering Illumina probe summary files can be read by the read ilmn function A typical usage is gt x lt read ilmn probe profile txt ctrlfiles control probe profile txt where probe profile txt is the name of the main probe summary profile file and control probe profile txt is the name of the file containing profiles for control probes If there are multiple probe summary profiles to be read and the samples are summarized in a targets frame then the read ilmn targets function can be used Reading the control probe profiles is optional but recommended If the control probe profiles are available then the Illumina data can be favorably background corrected and normalized using the negc or nec functions Otherwise Illumina data is background corrected and normalized as for other single channel platforms See Section 17 3 for a fully worked case study with lumina microarray data 20 4 7 Image derived Spot Quality Weights Image analysis programs typically output a lot of information in addition to the foreground and background intensities which provides information on the quality of each spot It is sometimes desirable to use this information to produce a quality index for each spot which can be used in the subsequent an
65. a two way experiment in which time course profiles are to be compared for two genotypes Consider the targets frame 48 FileName Target Filel wt 0hr File2 wt 0hr File3 wt 6hr File4 wt 24hr File mu 0Ohr File6 mu Ohr File7 mu 6hr File8 mu 24hr The targets are RNA samples collected from wild type and mutant animals at 0 6 and 24 hour time points This can be viewed as a factorial experiment but a simpler approach is to use the group mean parametrization gt lev lt c wt Ohr wt 6hr wt 24hr mu Ohr mu 6hr mu 24hr gt f lt factor targets Target levels lev gt design lt model matrix O f gt colnames design lt lev gt fit lt lmFit eset design Which genes respond at either the 6 hour or 24 hour times in the wild type We can find these by extracting the contrasts between the wild type times gt cont wt lt makeContrasts wt 6hr wt Ohr wt 24hr wt 6hr levels design gt fit2 lt contrasts fit fit cont wt gt fit2 lt eBayes fit2 gt topTableF fit2 adjust BH Any two contrasts between the three times would give the same result The same gene list would be obtained had wt 24hr wt 0hr been used in place of wt 24hr wt 6hr for example Which genes respond i e change over time in the mutant cont mu lt makeContrasts mu 6hr mu Ohr mu 24hr mu 6hr levels design fit2 lt contrasts fit fit cont mu fit2 lt eBayes fit2 topTabl
66. al GSM461177 S2_DRSC_Untreated 3 Normal GSM461178 S2_DRSC_Untreated 4 Normal GSM461179 S2_DRSC_CG8144_RNAi 1 Down GSM461180 S2_DRSC_CG8144_RNAi 3 Down GSM461181 S2_DRSC_CG8144_RNAi 4 Down DO 0 gt SU0Ne There are three untreated biological samples in which ps should be expressed at normal levels and three treated biological samples in which ps should be expressed at reduced levels While GEO records the sample information the sequencing data file are actually held on the NCBI Short Read Archive SRA The RNA samples were sequenced on an Illumina Genome Analyzer II Multiple sequencing runs were used for several of the samples resulting in a total of 20 SRA files gt SRA lt readTargets SRA Files csv sep gt SRA SRA GEO Title RunDate FlowCellID Type ReadLength 1 SRRO31708 GSM461176 S2_DRSC_Untreated 1 7 15 08 308T2AAXX SE 45 2 SRRO31709 GSM461176 S2_DRSC_Untreated 1 7 15 08 308T2AAXX SE 45 3 SRRO31710 GSM461176 S2_DRSC_Untreated 1 8 15 08 30AYWAAXX SE 45 125 4 SRRO31711 GSM461176 S2_DRSC_Untreated 1 8 15 08 30AYWAAXX SE 45 5 SRRO31712 GSM461176 S2_DRSC_Untreated 1 8 15 08 30AYWAAXX SE 45 6 SRRO31713 GSM461176 S2_DRSC_Untreated 1 8 15 08 30AYWAAXX SE 45 7 SRRO31714 GSM461177 S2_DRSC_Untreated 3 11 14 08 30MNEAAXX PE 37 8 SRRO31715 GSM461177 S2_DRSC_Untreated 3 12 23 08 30M2BAAXX PE 37 9 SRRO31716 GSM461178 S2_DRSC_Untreated 4 12 23 08 30M2BAAXX PE 37 10 SRRO31717 GSM461178 S2_DRSC_Untreated 4 12 23 08 30M2BAAXX PE 37 1
67. alysis steps One approach is to remove all spots from consideration which do not satisfy a certain quality criterion A more sophisticated approach is to produce a quantitative quality index which can be used to up or downweight each spot in a graduated way depending on its perceived reliability limma provides an approach to spot weights which supports both of these approaches The limma approach is to compute a quantitative quality weight for each spot Weights are treated similarly in limma as they are treated in most regression functions in R such as 1m A zero weight indicates that the spot should be ignored in all analysis as being unreliable A weight of 1 indicates normal quality A spot quality weight greater or less than one will result in that spot being given relatively more or less weight in subsequent analyses Spot weights less than zero are not meaningful The quality information can be read and the spot quality weights computed at the same time as the intensities are read from the image analysis output files The computation of the quality weights is defined by the wt fun argument to the read maimages function This argument is a function which defines how the weights should be computed from the information found in the image analysis files Deriving good spot quality weights is far from straightforward and depends very much on the image analysis software used limma provides a few examples which have been found to be useful by some re
68. arlier block of twelve exons which become relatively more highly expressed when Pasilla is down Most exons in the first half of the gene behave similarly to each other This gene is on the negative strand so transcription is from right to left By going back to the absolute fold changes for the top gene we can confirm that the early exons are not differentially expressed whereas the second block is up regulated and the last block is down regulated gt trol lt fit de genes Symbol trol gt tab lt topTable fit de trol coef PasillaDown n Inf gt o lt order tab Start gt tab o GeneID Chr Start End Length Symbol logFC AveExpr 141270 45320 X 2364493 2365227 735 trol 0 0845 4 3283 141600 45320 X 2365041 2365227 187 trol 0 0449 2 7849 141269 45320 X 2366985 2367132 148 trol 0 0183 3 5476 141268 45320 X 2367208 2367624 417 trol 0 0996 5 0373 141267 45320 X 2367693 2368393 701 trol 0 0944 5 4923 141266 45320 X 2368463 2369388 926 trol 0 0452 5 6816 134 O O OGO oo t P Value adj P Val 724 4 81e 01 6 80e 01 286 7 79e 01 8 65e 01 135 8 95e 01 8 95e 01 846 4 12e 01 6 25e 01 940 3 63e 01 5 80e 01 478 6 40e 01 8 28e 01 033 965 153 039 010 368 141265 141264 141958 141263 141262 141261 141260 141259 141258 141257 141256 141255 141779 141254 141253 141252 141250 141249 141248 141546 141247 141286 141285 141329 141328 141327 141326 141325 141324 141246 141241 14152
69. ason to expect variable array quality For example RNA samples from human clinical patients are typically variable in quality so array weights might be used routinely with human in vivo data see for example Ellis et al 9 If array quality plots suggest a problem then array weights are indicated If RNA is plentiful e g from cell lines or model organisms and quality plots of the arrays don t suggest problems then array weights are usually not needed In gross cases where an array is clearly bad or wrong it should be removed rather than downweighted However this action should be reserved for extreme cases If most genes are not differentially expressed then the design matrix for arrayWeights does not need to be as complex as for the final linear model For example in a two group comparison with just 2 replicates in each group the array weights should be estimated with the default intercept design matrix otherwise each array is compared only to its partner rather than to the other 3 arrays 71 Chapter 15 RNA seq Data 15 1 Introduction The limma approach to RNA seq explained in the article by Law et 15 The voom trans formation is applied to the read counts This converts the counts to log counts per million with associated precision weights After this the RNA seq data can be analyzed as if it was microarray data This means for example that any of the linear modelling or gene set testing methods in the limma package can b
70. ause they do not require this prior knowledge The eBayes function computes one more useful statistic The moderated F statistic F combines the t statistics for all the contrasts into an overall test of significance for that gene The F statistic tests whether any of the contrasts are non zero for that gene i e whether that gene is differentially expressed on any contrast The denominator degrees of freedom is the same as that of the moderated t Its p value is stored as fit F p value It is similar to the ordinary F statistic from analysis of variance except that the denominator mean squares are moderated across genes A frequently asked question relates to the occasional occurrence that all of the adjusted p values are equal to 1 This is not an error situation but rather an indication that there is no evidence of differential expression in the data after adjusting for multiple testing This can occur even though many of the raw p values may seem highly significant when taken as individual values This situation typically occurs when none of the raw p values are less than 1 G where G is the number of probes included in the fit In that case the adjusted p values are typically equal to 1 using any of the adjustment methods except for adjust none 13 2 Fitted Model Objects The output from 1mFit is an object of class MArrayLM This section gives some mathematical details describing what is contained in such objects This section can be skip
71. ay 14 corrected Array 15 corrected Array 16 corrected Array 17 corrected Array 18 corrected Array 19 corrected gt y lt normalizeBetweenArrays y method quantile Now filter out control probes and low expressed probes To get an idea of how bright expression probes should be we compute the 95 percentile of the negative control probes on each array We keep probes that are at least 10 brighter than the negative controls on at least four arrays because there are four replicates gt neg95 lt apply y Ely genes ControlType 1 2 function x quantile x p 0 95 gt cutoff lt matrix 1 1 neg95 nrow y ncol y byrow TRUE gt isexpr lt rowSums y E gt cutoff gt 4 gt table isexpr isexpr FALSE TRUE 11500 32754 Regular probes are code as 0 in the ControlType column gt yO lt yly genes ControlType 0 amp isexpr Now we can find genes differentially expressed for the corn oil treatments compared to the saline control gt Treatment lt SDRF Characteristics treatment gt levels lt c 10 ml kg saline 2 ml kg corn oil 5 ml kg corn oil 10 ml kg corn oil gt Treatment lt factor Treatment levels levels gt design lt model matrix Treatment gt fit lt lmFit y0 design gt fit lt eBayes fit trend TRUE gt plotSA fit main Probe level gt summary decideTests fit 1 Treatment2 ml kg corn oil Treatment5 ml kg corn oil Treatment10 ml kg corn oil 1 0 0 911
72. be negative for the first pair but positive for the second In this case there is no good alternative to treating the technical replicates as if they were biological so that that the experiment would be analysed as a simple comparison with dye swaps Beware however that treating technical replicates as biological gives p values that are smaller than they should be 60 Chapter 12 Separate Channel Analysis of Two Color Data Separate channel analysis is a way to analyse two color data in terms of the individual channel intensities 41 In effect separate channel analysis converts a two color experiment into a single channel experiment with twice as many arrays but with a technical pairing between the two channels that originated from the same array Consider an experiment comparing young and old animals for both both wild type and mutant genotypes FileName Cy3 Cy5 Filel wt young wt old File2 wt old wt young File3 mu young mu old File4 mu old mu young Each of the arrays in this experiment makes a direct comparison between young and old RNA targets There are no arrays which compare wild type and mutant animals This is an example of an unconnected design in that there are no arrays linking the wild type and mutant targets It is not possible to make comparisons between wild type and mutant animals on the basis of log ratios alone So to do this it is necessary to analyze the red and green channels intensities separately i e t
73. cs 21 2067 2075 Smyth G and Speed T 2003 Normalization of cDNA microarray data Methods 31 265 273 Smyth G Yang Y and Speed T 2003 Statistical issues in cDNA microarray data analysis Methods in Molecular Biology 224 111 136 Smyth G K and Altman N S 2013 Separate channel analysis of two channel microar rays recovering inter spot information BMC Bioinformatics 14 165 Uppalapati S R Ayoubi P Weng H Palmer D A Mitchell R E Jones W and Bender C L 2005 The phytotoxin coronatine and methyl jasmonate impact multiple phytohormone pathways in tomato The Plant Journal 42 201 217 Vaquerizas J M Dopazo J and D az Uriarte R 2004 DNMAD web based diagnosis and normalization for microarray data Bioinformatics 20 3656 3658 Visvader J E 2009 Keeping abreast of the mammary epithelial hierarchy and breast tumorigenesis Genes amp development 23 2563 2577 Weng H and Ayoubi P 2004 GPAP GenePix Pro Auto Processor for online pre processing normalization and statistical analysis of primary microarray data Software package Microarray Core Facility Oklahoma State University http darwin biochem okstate edu gpap3 Wettenhall J M Simpson K M Satterley K and Smyth G K 2006 affylmGUI a graphical user interface for linear modeling of single channel microarray data Bioinfor matics 22 897 899 Wettenhall J M and Smyth G K 2004 limmaGUI a graph
74. der to use reliably and they are not completely described in the main R documentation If we only wanted to test the first two questions above an easy to to setup the design matrix would be to use a nested interaction term gt Strain lt factor targets Strain levels c WT Mu gt Treatment lt factor targets Treatment levels c U S gt design lt model matrix StraintStrain Treatment gt colnames design 1 Intercept StrainMu StrainWT TreatmentS StrainMu TreatmentS The first term in the model formula is an effect for Strain This introduces an intercept column to the design matrix which estimates the average log expression level for wild type unstimulated cells and a column for Strain which estimates the mutant vs wildtype difference in the unstimulated state The second term in the model formula represents the interaction between stimulation and strain Because there is no main effect for treatment in the model the interaction is fitted in a nested sense It introduces a third and a fourth column to the design matrix which represent the effect of stimulation for wild type and for mutant mice respectively exactly the same as the contrasts SvsUinWT and SvsUinMu define in the previous section After gt fit lt lmFit eset design gt fit lt eBayes fit then 46 gt topTable fit coef 3 will find those genes responding to stimulation in wild type mice and gt topTable fit coef 4
75. ducted by Dr Mireille Lahoud at the Walter and Eliza Hall Institute to compare gene expression in three different populations of dendritic cells DC CD4 Z B DN cbs Arrow heads represent Cy5 i e arrows point in the Cy3 to Cy5 direction This experiment involved six cDNA microarrays in three dye swap pairs with each pair used to compare two DC types The design is shown diagrammatically above The targets file was as follows 39 gt targets SlideNumber FileName Cy3 Cy5 m112med 12 m112med spot CD4 CD8 mli3med 13 mli3med spot CD8 CD4 ml14med 14 mli4med spot DN CD8 mlid5med 15 mlidmed spot CD8 DN ml16med 16 mli6med spot CD4 DN ml117med 17 mli7med spot DN CD4 There are many valid choices for a design matrix for such an experiment and no single correct choice We chose to setup the design matrix as follows gt design lt modelMatrix targets ref CD4 Found unique target names CD4 CD8 DN gt design CD8 DN mli2med 1 0 mli3med 1 0 mli4med 1 1 mli5med 1 1 mli6med 0 1 mli7med 0 1 In this design matrix the CD8 and DN populations have been compared back to the CD4 population The coefficients estimated by the linear model will correspond to the log ratios of CD8 vs CD4 first column and DN vs CD4 second column After appropriate normalization of the expression data a linear model was fit using gt fit lt lmFit MA design The linear model can now be interrogated to answer any questions of intere
76. ductor Please send requests for gen eral assistance and advice to the mailing list rather than to the individual authors Users posting to the mailing list for the first time should read the helpful posting guide at http www bioconductor org doc postingGuide html Note that each function in limma has it s own online help page as described in the next section Mailing list etiquette requires that you read the relevant help page carefully before posting a problem to the list 10 Chapter 3 Quick Start 3 1 A brief introduction to R R is a program for statistical computing It is a command driven language meaning that you have to type commands into it rather than pointing and clicking using a mouse In this guide it will be assumed that you have successfully downloaded and installed R from http www r project org A good way to get started is to type gt help start at the R prompt or if you re using R for Windows to follow the drop down menu items Help gt Html help Following the links Packages gt limma from the html help page will lead you to the contents page of help topics for functions in limma Before you can use any limma commands you have to load the package by typing gt library limma at the R prompt You can get help on any function in any loaded package by typing and the function name at the R prompt for example gt read maimages or equivalently gt help read maimages for detailed help on the
77. e gt RG printer lt getLayout RG genes Image plots It is interesting to look at the variation of background values over the array Consider image plots of the red and green background for the first array gt imageplot log2 RG Rb 1 RG printer low white high red gt imageplot log2 RG Gb 1 RG printer low white high green Image plot of the un normalized log ratios or M values for the first array gt MA lt normalizeWithinArrays RG method none gt imageplot MA M 1 RG printer zlim c 3 3 78 The imageplot function lies the slide on its side so the first print tip group is bottom left in this plot We can see a red streak across the middle two grids of the 3rd row caused by a scratch or dust on the array Spots which are affected by this artefact will have suspect M values The streak also shows up as darker regions in the background plots MA plots An MA plot plots the log ratio of R vs G against the overall intensity of each spot The log ratio is represented by the M value M log R log G and the overall intensity by the A value A log R log2 G 2 Here is the MA plot of the un normalized values for the first array gt plotMA MA swirl 1 123 4 1 3 The red streak seen on the image plot can be seen as a line of spots in the upper right of this plot Now we plot the individual MA plots for each of the print tip groups on this array together with the
78. e eee ee See ees 136 18 2 12 Acknowledgements 10d e dae bh he A cw Ae a 137 Chapter 1 Introduction Limma is a package for the analysis of gene expression data arising from microarray or RNA Seq technologies A core capability is the use of linear models to assess differential expression in the context of multifactor designed experiments Limma provides the ability to analyze comparisons between many RNA targets simultaneously It has features that make the anal yses stable even for experiments with small number of arrays this is achieved by borrowing information across genes It is specially designed for analysing complex experiments with a variety of experimental conditions and predictors The linear model and differential expres sion functions are applicable to data from any microarray platform including single channel or two color microarray platforms The methods are also applicable to expression data from non microarray platforms such as quantitative PCR or RNA Seq provided that a suitable matrix of expression values can be provided This guide gives a tutorial style introduction to the main limma features but does not describe every feature of the package A full description of the package is given by the individual function help documents available from the R online help system To access the online help type help package limma at the R prompt or else start the html help system using help start or the Windows drop down help menu
79. e applied to RNA seq data 15 2 Making a count matrix RNA seq data usually arrives in the form of FastQ or BAM files of unaligned reads The reads need to be mapped to a reference genome or transcriptome then summarized at the exon or gene level to produce a matrix of counts We find the Rsubread package 18 to be convenient fast and effective for this purpose Other popular methods include RSEM 16 and HTseq A runnable example with complete code showing how to use subread and featureCounts with limma is provided at http bioinf wehi edu au RNAsegCaseStudy 15 3 Differential expression Suppose that a matrix of read counts counts has been created with rows for genes and columns for samples The limma voom method assumes that rows with zero or very low counts have been removed It is usual to apply scale normalization to RNA seq read counts and the TMM normaliza tion method 31 in particular has been found to perform well in comparative studies To apply TMM normalization it is convenient to create a DGEList object using the edgeR package gt dge lt DGEList counts counts gt dge lt calcNormFactors dge The voom transformation is then applied 72 v lt voom dge design plot TRUE The voom transformation uses the experiment design matrix and produces an EList object It is also possible to give a matrix of counts directly to voom without TMM normalization by gt v lt voom counts design plot TRUE If
80. e following is an example MA Plot for an Incyte array with various spike in and other controls Data courtesy of Dr Steve Gerondakis Walter and Eliza Hall Institute of Medical Research The data shows high quality data with long comet like pattern of non differentially expressed probes and a small proportion of highly differentially expressed probes The plot was produced using gt spottypes lt readSpotTypes gt RG genes Status lt controlStatus spottypes RG gt plotMA RG Example MA Plot with Spot Type Highlighting Gene Unknown e Y Control D03 u10 D10 u25 e p25 a Sensifivity 3 Buffer e E 25 The array includes spike in ratio controls which are 3 fold 10 fold and 25 fold up and down regulated as well as non differentially expressed sensitivity controls and negative controls The background intensities are also a useful guide to the quality characteristics of each array Boxplots of the background intensities from each array gt boxplot data frame log2 RG Gb main Green background gt boxplot data frame log2 RG Rb main Red background will highlight any arrays unusually with high background intensities Spatial heterogeneity on individual arrays can be highlighted by examining imageplots of the background intensities for example gt imageplot log2 RG Gb 1 RG printer plots the green background for the first array The function
81. e on each array which can be extracted using exprs eset Affymetrix and other single channel microarray data may be analyzed very much like ordinary linear models or anova models The difference with microarray data is that it is almost always necessary to extract particular contrasts of interest and so the standard parametrizations provided for factors in R are not usually adequate There are many ways to approach the analysis of a complex experiment in limma A straightforward strategy is to set up the simplest possible design matrix and then to extract from the fit the contrasts of interest Suppose that there are three RNA sources to be compared Suppose that the first three arrays are hybridized with RNA1 the next two with RNA2 and the next three with RNAS Suppose that all pair wise comparisons between the RNA sources are of interest We assume that the data has been normalized and stored in an ExpressionSet object for example by gt data lt ReadAffy gt eset lt rma data An appropriate design matrix can be created and a linear model fitted using Vv design lt model matrix 0 factor c 1 1 1 2 2 3 3 3 gt colnames design lt c group1 group2 group3 gt fit lt lmFit eset design To make all pair wise comparisons between the three groups the appropriate contrast matrix can be created by gt contrast matrix lt makeContrasts group2 group1 group3 group2 group3 group1 levels design gt fit2 lt
82. eF fit2 adjust BH VYN Which genes respond differently over time in the mutant relative to the wild type cont dif lt makeContrasts Dif6hr mu 6hr mu Ohr wt 6hr wt Ohr Dif24hr mu 24hr mu 6hr wt 24hr wt 6hr levels design fit2 lt contrasts fit fit cont dif fit2 lt eBayes fit2 topTableF fit2 adjust BH VVvV 4 lt q lt v 49 The method of analysis described in this section was used for a six point time course experiment on histone deacetylase inhibitors 24 9 6 2 Many time points Now we consider an example with many time points for each group When there are many time points it is reasonable to assume that expression changes smoothly over time rather than making discrete jumps from one time point to another This type of time course can be analysed by fitting a temporal trend using a regression spline or a polynomial Consider the following targets frame with 32 rows FileName Group Time Filel Control 1 File2 Control 2 File16 Control 16 Filel7 Treat 1 File18 Treat 2 File32 Treat 16 It might be reasonable to represent a time course for a particular gene in a particular condition using a cubic spline curve with a modest number of knots Choosing effective degrees of freedom to be in range 3 5 is reasonable Setup a basis for a natural regression spline gt library splines gt X lt ns targets Time df 5 Then fit separate curves for the control and treatment groups
83. ec 0 421 9 79 3 91 1 03e 03 5 29e 01 0 652 1337 psoriasis associated 0 838 11 66 3 89 1 08e 03 5 29e 01 0 687 954 Caspase heart Img 0 302 12 14 3 86 1 17e 03 5 30e 01 0 757 563 FATTYACID BINDINGPRO 0 637 11 62 3 81 1 29e 03 5 30e 01 0 839 Notice that the top gene is Apoal itself which is heavily down regulated Theoretically the M value should be minus infinity for Apoal because it is the knockout gene Several of the other genes are closely related The top eight genes here were confirmed by independent assay subsequent to the microarray experiment to be differentially expressed in the knockout versus the control line gt volcanoplot fit coef 2 highlight 8 names fit genes NAME main KO vs Control KO vs Control 15 ApoAl EST H CATEC ESTW 10 ApeCl ESES simil Log Odds 5 5 3 2 1 0 Log Fold Change 16 3 Weaver Mutant Mice A Composite 2x2 Factorial Experiment 16 3 1 Background This case study considers a more involved two color analysis in which the RNA sources have a factorial structure with two factors 89 The study examined the development of neurons in wild type and weaver mutant mice 7 The weaver mutation affects cerebellar granule neurons the most numerous cell type in the central nervous system Weaver mutant mice are characterized by a weaving gait Granule cells are generated in the first postnatal week in the external granule layer of the cerebellum
84. ed Then we will go on to describe the more classical statistical approaches using factorial model formulas All the approaches considered are equivalent and yield identical bottom line results The most basic approach is to fit a model with a coefficient for each of the four factor combinations and then to extract the comparisons of interest as contrasts gt TS lt factor TS levels c WT U WT S Mu U Mu S gt design lt model matrix 0 TS gt colnames design lt levels TS gt fit lt lmFit eset design 45 This fits a model with four coefficients corresponding to WT U WT S Mu U and Mu S respectively Our three contrasts of interest can be extracted by gt cont matrix lt makeContrasts SvsUinWT WT S WT U SvsUinMu Mu S Mu U Diff Mu S Mu U WT S WT U levels design gt fit2 lt contrasts fit fit cont matrix gt fit2 lt eBayes fit2 We can use topTable to look at lists of differentially expressed genes for each of three contrasts or else gt results lt decideTests fit2 gt vennDiagram results to look at all three contrasts simultaneously This approach is recommended for most users because the contrasts that are being tested are formed explicitly 9 5 3 A Nested Interaction Formula Model formulas in R are very flexible and offer lots of possible shortcuts for setting up the design matrix However they also require a high level of statistical understanding in or
85. ed genes DEGs in 2 5 10 ml kg CO group compared to NS 114 group respectively The altered genes were associated with immune response cel lular response to organic cyclic substance regulation of fatty acid beta oxidation et al However no obvious histopathologic change was observed in the three CO dosage groups These data show that 10 ml kg CO that dosage has been deter mined as the vehicle in drug safety assessment can cause obvious influence on gene expression in rat thymus Our study suggest that the dosage of CO gavage as the vehicle for water in soluble agents in drug development should be no more than 5 ml kg if agents molecular effects in thymus want to be assessed Gene expression in thymus from female Wistar rats daily administered with 2 5 10 ml kg of corn oil or 10 ml kg of saline by gavage for 14 consecutive days were measured using Agilent Rat Whole Genome 8x60K array Data availability All files files were downloaded from http www ebi ac uk arrayexpress experiments E GEOD 33005 Read the sample and data relationship format SDRF file This is equivalent to what is known as the targets file in limma gt SDRF lt read delim E GEOD 33005 sdrf txt check names FALSE stringsAsFactors FALSE Read data gt x lt read maimages SDRF Array Data File source agilent green only TRUE Read GSM819076_US10283824_252828210181_S01_GE1_107_Sep09_1_4 txt Read GSM819075_US10283824_252828210181_S01_GE1_10
86. ed packages 1 edgeR_3 5 27 limma_3 19 27 tweeDEseqCountData_1 0 11 4 Biobase_2 23 6 BiocGenerics_0 9 3 loaded via a namespace and not attached 1 tools_3 1 0 124 18 2 Differential Splicing after Pasilla Knockdown 18 2 1 Background This case study re analyzes data produced by Brooks et al 3 The proteins NOVA1 and NOVA2 are known to regulate splicing in mammals Brooks et al used Drosophila melanogaster as a model system and used an RNA interference system RNAi to knock down the expression of the D melanogaster ortholog of NOVA1 and NOVA2 which is Pasilla ps The experiment compared treated and untreated cells from the S2 DRSC cell line In this case study we find exons and genes that are differentially expressed after ps knock down and we find genes are that are differentially spliced in the knockdown samples as compared to wild type For completeness we will describe all the steps in the analysis from raw data files including mapping and alignment If you want to skip the mapping and read counting steps go straight to Section 18 2 6 where the statistical analysis begins 18 2 2 GEO samples and SRA Files Brooks et al 3 deposited data for six RNA samples to GEO http www ncbi nlm nih gov geo We have prepared the GEO accession numbers and titles in a csv file gt library limma gt GEO lt readTargets GE0 samples csv gt GEO sep i GEO Title Pasilla GSM461176 S2_DRSC_Untreated 1 Norm
87. eparated gt plotMDS y labels targets CellType MS MS a 4 MS a Ty E 2 2 LP a oJ LP LP ML ML Strom sima T T T T T T 29 1 0 1 2 3 Dimension 1 17 3 6 Within patient correlations The study involves multiple cell types from the same patient Arrays from the same donor are not independent so we need to estimate the within dinor correlation gt ct lt factor targets CellType gt design lt model matrix Ot ct gt colnames design lt levels ct gt dupcor lt duplicateCorrelation y design block targets Donor gt dupcor consensus correlation 1 0 134 As expected the within donor correlation is small but positive 112 17 3 7 Differential expression between cell types Now we look for differentially expressed genes We make all possible pairwise comparisons between the epithelial cell types allowing for the correlation within donors fit lt lmFit y design block targets Donor correlation dupcor consensus correlation contrasts lt makeContrasts ML MS LP MS ML LP levels design fit2 lt contrasts fit fit contrasts fit2 lt eBayes fit2 trend TRUE summary decideTests fit2 method global ML MS LP MS ML LP 1 3074 2836 1631 0 18088 18707 21307 1 3529 3148 1753 VVVV MV Top ten differentially expressed probes between ML and MS gt topTable fit2 coef 1 SYMBOL logFC AveExpr t P Value adj P Val B ILMN_1766707 IL17B 4 19 5 94 29 0 2
88. er Bioconductor packages there are always two versions of limma Most users will use the current official release version which will be installed by biocLite if you are using the current version of R There is also a developmental version of limma that includes new features due for the next official release The developmental version will be installed if you are using the developmental version of R The official release version always has an even second number for example 3 6 5 whereas the developmental version has an odd second number for example 3 7 7 Limma is updated frequently Once you have installed limma the change log can also be viewed from the R prompt To see the most recent 20 lines type gt changeLog n 20 2 3 How to get help Most questions about limma will hopefully be answered by the documentation or references If you ve run into a question which isn t addressed by the documentation or you ve found a conflict between the documentation and software itself then there is an active support community which can offer help The authors of the package always appreciate receiving reports of bugs in the package functions or in the documentation The same goes for well considered suggestions for im provements Any other questions or problems concerning limma should be sent to the Bioconduc tor mailing list bioconductor stat math ethz ch To subscribe to the mailing list see https stat ethz ch mailman listinfo biocon
89. er of genes Number of genes with 1 exon 36371 8272 1715 Mean number of exons in a gene 4 Max number of exons in a gene gt summary decideTests ex PasillaDown f 127 0 34341 1 188 57 Show top exons that don t follow the pattern of the other exons in the same gene gt topSplice ex level exon GeneID 11333 44548 11314 44548 141286 45320 95854 44448 141328 45320 16382 36542 41964 3771968 141285 45320 141327 45320 141246 45320 Chr 2R 2R X Start 6407125 6403340 2388437 3R 22426969 X 2R 2L Ps Ps PS 2390494 9771600 5158455 2389074 2390809 2393244 End 6408782 6403556 2388670 22429773 2390626 9782097 5185964 2389271 2390958 2393711 Length 1658 217 234 2805 133 10498 27510 198 150 468 Symbol lola lola trol scrib trol shot Msp 300 trol trol trol Show top genes that show differential splicing gt topSplice ex level gene n 20 132 logFC 2 49 29 3 56 15 5 88 61 46 4 01 5 49 2 73 22 14 12 13 11 afis iZ 11 10 10 AOrF ON OWA OO O ct P Value FDR 2 29e 33 7 95e 29 8 85e 23 1 53e 18 4 73e 22 5 47e 18 1 04e 20 8 99e 17 2 25e 20 1 56e 16 2 12e 19 1 23e 15 3 31e 19 1 64e 15 6 73e 19 2 92e 15 1 39e 18 5 34e 15 2 09e 18 7 24e 15 5 75e 09 19 4 6 49e 09 19 2 1 16e 08 18 4 1 05e 08 18 3 GeneID Chr Symbol NExons F P Value FDR 141235 45320 X trol 44 30 71 1 63e 40 1 07e 36 11214 44548 2R lo
90. es i e to within spot and whole spot contrasts on the log scale Usually created from an RGList using MA RG or normalizeWithinArrays Objects of this class contain one row for each spot There may be more than one spot and therefore more than one row for each probe MArrayLM MicroArray Linear Model Store the result of fitting gene wise linear models to the normalized intensities or log ratios Usually created by lmFitQ Objects of this class normally contain one row for each unique probe TestResults Store the results of testing a set of contrasts equal to zero for each probe Usually created by decideTests Objects of this class normally contain one row for each unique probe All these objects can be treated like any list in R For example MA M extracts the matrix of M values if MA is an MAList object or fit coef extracts the coefficient estimates if fit is an MArrayLM object names MA shows what components are contained in the object For those who are familiar with matrices in R all these objects are also designed to obey many analogies with matrices In the case of RGList and MAList rows correspond to spots and columns to arrays In the case of MarrayLM rows correspond to unique probes and columns to parameters or contrasts The functions summary dim length ncol nrow dimnames rownames colnames have methods for these classes For example gt dim RG 1 11088 4 shows that the RGList object RG contains data for 11088
91. es between the different background correction methods we consider one cDNA array which was self self hybridized i e the same RNA source was hy bridized to both channels For this array there is no actual differential expression The array was printed with a human 10 5k library and hybridized with Jurkatt RNA on both channels Data courtesy Andrew Holloway and Dileepa Diyagama Peter MacCallum Cancer Centre Melbourne The array included a selection of control spots which are highlighted on the plots Of particular interest are the spike in ratio controls which should show up and down fold changes of 3 and 10 The first plot displays data acquired with GenePix software and background corrected by subtracting the median local background which is the default with 27 GenePix data The plot shows the typical wedge shape with fanning of the M values at low intensities The range of observed M values dominates the spike in ratio controls The are also 1148 spots not shown on the plot because the background corrected intensities were zero or negative GenePix median background The second plot shows the same array background corrected with method normexp and offset 50 The spike in ratio controls now standout clearly from the range of the M values All spots on the array are shown on the plot because there are now no missing M values GenePix normexp background The third plot shows the same array quantified w
92. es of the files containing the green and red intensities respectively An example is given in Section 4 3 Typical code with ImaGene data might be gt targets lt readTargets gt files lt targets c FileNameCy3 FileNameCy5 gt RG lt read maimages files source imagene For ImaGene data the files argument to read maimages is expected to be a 2 column matrix of filenames rather than a vector The following table gives the default estimates used for the foreground and background intensities Source Foreground Background agilent agilent mean agilent median Median Signal Mean Signal Median Signal Median Signal Median Signal Median Signal bluefuse AMPCH None genepix F Mean B Median genepix median F Median B Median genepix custom Mean B imagene Signal Mean Signal Median or Signal Mean if auto segmentation has been used quantarray Intensity Background scanarrayexpress Mean Median smd old I MEAN B_MEDIAN smd Intensity Mean Background Median spot mean morph spot close open mean morph close open The default estimates can be over ridden by specifying the columns argument to read maimages Suppose for example that GenePix has been used with a custom background method and you wish to use median foreground estimates This combination of foreground and background is not provided as a pre set choice in limma but you can specify it by gt RG lt read maimages files source genepi
93. f interest to us namely the interaction Our three contrasts of interest could be extracted using gt fit lt ImFit eset design gt cont matrix lt cbind SvsUinWT c 0 0 2 2 SvsUinMu c 0 0 2 2 Diff c 0 0 0 4 gt fit2 lt contrasts fit fit cont matrix gt fit2 lt eBayes fit2 The results will be identical to those for the previous three approaches The various approaches described here for the 2 x 2 factorial problem are equivalent and differ only in the parametrization chosen for the linear model The fitted model objects fit will differ only in the coefficients and associated components The residual standard deviations fit sigma residual degrees of freedom fit df residual and all components of fit2 will be identical regardless of parametrization used Since the approaches are equivalent users are free to choose whichever one is most intuitive or convenient 9 6 Time Course Experiments 9 6 1 Replicated time points Time course experiments are those in which RNA is extracted at several time points after the onset of some treatment or stimulation How best to analyse a time course experiment depends on the nature of the experiment and especially on the number of distinct time points We consider first experiments with a relative small number of replicated time points Simple time course experiments of this type are similar to experiments with several groups covered in Section 9 3 As an example we consider here
94. fault for SPOT output is that Rmean and Gmean are used as foreground intensities and morphR and morphG are used as background intensities gt RG lt read maimages targets source spot Read swirl 1 spot Read swirl 2 spot Read swirl 3 spot Read swirl 4 spot gt RG An object of class RGList R swirl 1 swirl 2 swirl 3 swirl 4 1 19538 470 16138 720 2895 1600 14054 5400 2 23619 820 17247 670 2976 6230 20112 2600 3 21579 950 17317 150 2735 6190 12945 8500 4 8905 143 6794 381 318 9524 524 0476 5 8676 095 6043 542 780 6667 304 6190 8443 more rows G swirl 1 swirl 2 swirl 3 swirl 4 1 22028 260 19278 770 2727 5600 19930 6500 2 25613 200 21438 960 2787 0330 25426 5800 3 22652 390 20386 470 2419 8810 16225 9500 4 8929 286 6677 619 383 2381 786 9048 5 8746 476 6576 292 901 0000 468 0476 8443 more rows Rb swirl 1 swirl 2 swirl 3 swirl 4 1 174 136 82 48 2 174 133 82 48 3 174 133 76 48 4 163 105 61 48 5 140 105 61 49 8443 more rows Gb swirl 1 swirl 2 swirl 3 swirl 4 1 182 175 86 97 2 171 183 86 85 3 153 183 86 85 4 153 142 71 87 5 153 142 71 87 8443 more rows 76 targets SlideNumber FileName Cy3 Cy5 Date 81 swirl 1 spot swirl wild type 2001 9 20 82 swirl 2 spot wild type swirl 2001 9 20 93 swirl 3 spot swirl wild type 2001 11 8 94 swirl 4 spot wild type swirl 2001 11 8 BPwWDN source 1 spot The arrays The microarrays used
95. file data gt ab lt ReadAffy filenames targets FileName gt eset lt gcrma ab Form an appropriate design matrix for the two RNA sources and fit linear models The design matrix has two columns The first represents log expression in the wild type and the second represents the log ratio between the mutant and wild type samples See Section 9 2 for more details on the design matrix gt design lt cbind WT 1 MUvsWT targets Genotype mu gt fit lt ImFit eset design gt fit lt eBayes fit gt topTable fit coef MUvsWT This code fits the linear model smooths the standard errors and displays the top 10 genes for the mutant versus wild type comparison 3 3 Data Objects There are six main types of data objects created and used in limma EListRaw Raw Expression list A class used to store single channel raw intensities prior to normalization Intensities are unlogged Objects of this class contain one row for each probe and one column for each array The function read ilmn for example creates an object of this class EList Expression list Contains background corrected and normalized log intensities Usually created from an EListRaw objecting using normalizeBetweenArrays Or neqc RGList Red Green list A class used to store raw two color intensities as they are read in from an image analysis output file usually by read maimages 13 MAList Two color intensities converted to M values and A valu
96. ftware may use other column names Here is a STF below appropriate for arrays with Lucidea Universal ScoreCard control spots El Microsoft Excel mli40503SpotTypes txt iol xj H File Edit View Insert Format Tools Data Window Help 4 Xx DR ey bo 6zx 140 gt F11 v fe A SpotType ID Name Color gene la black ratio Ratio red calibration Calib blue utility Utility pink negative Negative brown buffer Buffer orange blank blank i yellow M 4 gt rh mi1405035potTypes ial Ready If the STF has default name SpotTypes txt then it can be read using gt spottypes lt readSpotTypes It is typically used as an argument to the controlStatus function to set the status of each spot on the array for example gt RG genes Status lt controlStatus spottypes RG 24 Chapter 5 Quality Assessment An essential step in the analysis of any microarray data is to check the quality of the data from the arrays For two color array data an essential step is to view the MA plots of the unnormalized data for each array The plotMA function produces plots for individual arrays The plotMA3by2 function gives an easy way to produce MA plots for all the arrays in a large experiment This functions writes plots to disk as png files 6 plots to a page The usefulness of MA plots is enhanced by highlighting various types of control probes on the arrays and this is facilited by the controlStatus funtion Th
97. g R 2 3 0 and limma 2 7 5 gt library limma gt load Bob RData gt objects 1 design RG gt design 1 1 1 1 1 gt names RG 1 R q Rb Gb genes printer gt RG genes 1 40 Library ID Control cDNA1 500 Control cDNA1 500 Control Printing buffer Control Printing buffer Control Printing buffer Control Printing buffer Control Printing buffer Control Printing buffer 9 Control cDNA1 500 10 Control cDNA1 500 11 Control Printing buffer 12 Control Printing buffer ANOoOFPWNE 97 13 Control Printing buffer 14 Control Printing buffer 15 Control Printing buffer 16 Control Printing buffer 17 Control cDNA1 500 18 Control cDNA1 500 19 Control Printing buffer 20 Control Printing buffer 21 Control Printing buffer 22 Control Printing buffer 23 Control Printing buffer 24 Control Printing buffer 25 Control cDNA1 500 26 Control cDNA1 500 27 NIA15k H31 28 NIA15k H31 29 NIA15k H32 30 NIA15k H32 31 NIA15k H33 32 NIA15k H33 33 NIA15k H34 34 NIA15k H34 35 NIA15k H35 36 NIA15k H35 37 NIA15k H36 38 NIA15k H36 39 NIA15k H37 40 NIA15k H37 Although there are only four arrays we have a total of eight spots for each gene and more for the controls Naturally the two M values obtained from duplicate spots on the same array are highly correlated The problem is how to make use of the duplicate spots in the best way The approach taken here is to estimate the spatial correlation between the ad
98. g2 intensity col red par mfrow c 1 1 VVVV MV 93 Green background Red background 8 14 E o o 14 o o o g 2 o gt 12 o 9 o gt 2 o 8a els z o 8 o E 10 EZ j a T robo ie a 10 D EO T e l en E D 2 8 See TL 2 E JH Be 8 6 Poor gp tt 6 NAO COIN ONO NAIL ONOV NN eer 2 0o00200 20 2 2 22 000 EEEES ZJ EEEESD GSGAEaAGAAEE Saag EE O O O O 2 2 o O O GOG 2 6 o oO oOo oO Later on we will investigate setting array quality weights Now normalize the data The Riken titration library being based on a pool of a large number of non specific genes should not be differentially expressed We can take advantage of this by upweighting these probes in the print tip normalization step Here we give double weight to the titration library probes although higher weights could also be considered gt w lt modifyWeights RG weights RG genes Status RikenTitration 2 gt MA lt normalizeWithinArrays RG weights w 16 3 6 Setting Up the Linear Model The experiment has a composite design with some arrays comparing back to the RNA pool as a common reference and other arrays making direct comparisons between the treatment conditions of interest The simplest design matrix is that which compares all the RNA samples back to the RNA pool gt design lt modelMatrix targets ref Pool Found unique target names Piimt Pilwt P21mt P21wt Pool We also add an intercept term to extract probe specific
99. ganism to study the early development in vertebrates Swirl is a point mutant in the BMP2 gene that affects the dorsal ventral body axis The main goal of the Swirl experiment is to identify genes with altered expression in the Swirl mutant compared to wild type zebrafish The hybridizations Two sets of dye swap experiments were performed making a total of four replicate hybridizations Each of the arrays compares RNA from swirl fish with RNA from normal wild type fish The experimenters have prepared a tab delimited targets file called SwirlSamples txt which describes the four hybridizations gt library limma gt targets lt readTargets SwirlSample txt gt targets SlideNumber FileName Cy3 Cy5 Date 81 swirl 1 spot swirl wild type 2001 9 20 82 swirl 2 spot wild type swirl 2001 9 20 93 swirl 3 spot swirl wild type 2001 11 8 94 swirl 4 spot wild type swirl 2001 11 8 e UNBE 75 We see that slide numbers 81 82 93 and 94 were used to make the arrays On slides 81 and 93 swirl RNA was labelled with green Cy3 dye and wild type RNA was labelled with red Cy5 dye On slides 82 and 94 the labelling was the other way around Each of the four hybridized arrays was scanned on an Axon scanner to produce a TIFF image which was then processed using the image analysis software SPOT The data from the arrays are stored in the four output files listed under FileName Now we read the intensity data into an RGList object in R The de
100. gt topTable fit where MA holds the normalized data The default design matrix used here is just a single column of ones The experiment here measures the fold change of mutant over wild type Genes which have positive M values are more highly expressed in the mutant RNA while genes with negative M values are more highly expressed in the wild type The analysis is analogous 57 to the classical single sample t test except that we have used empirical Bayes methods to borrow information between genes 11 2 2 Dye Swaps A simple modification of the above experiment would be to swap the dyes for one of the arrays The targets file might now be FileName Cy3 Cy5 Filel wt mu File mu wt File3 wt mu Now the analysis would be gt design lt c 1 1 1 gt fit lt lmFit MA design gt fit lt eBayes fit gt topTable fit Alternatively the design matrix could be set replacing the first of the above code lines by gt design lt modelMatrix targets ref wt where targets is the data frame holding the targets file information If there are at least two arrays with each dye orientation then it is possible to estimate and adjust for any probe specific dye effects The dye effect is estimated by an intercept term If the experiment was FileName Cy3 Cy5 Filel wt mu File mu wt File3 wt mu File4 mu wt then we could set gt design lt cbind DyeEffect 1 MUvsWT c 1 1 1 1 gt fit lt lmFit M
101. h a Common Reference 10 1 Introduction be ae che A Eber ys ee ES ak I ah A WOM TOUS A o at ae ae Lad oe een O ea a link 10 3 Several Groups sbi dat eld ack boy ek Sg os Sh Seay ed 11 Direct Two Color Experimental Designs TLI Er OCU OTs e ae ets ah Se aba id Bi De ues e Bs Be ok ee 11 2 Simple Comparisons a4 Sk Sa Be Rw ee Rew da E be Ba 11 2 1 Replicate Arrays seso Be tito RR eb eae es PS MA Dyes wa pen e ae ot BO oe oe Ee Gre Glee ey 11 3 A Correlation Approach to Technical Replication 12 Separate Channel Analysis of Two Color Data 13 Statistics for Differential Expression 13 1 Summary Top Tables ida EA ea we 13 2 Fitted Model Objects bir BRS Be ee 13 3 Multiple Testing Across Contrasts 04 36 36 37 38 39 Al 41 41 43 44 44 44 44 44 45 46 47 48 48 50 91 53 53 53 59 57 57 57 57 58 59 61 14 Array Quality Weights 67 14 1 Titroduction ptes dl de rs Bas Gee PA he to eee Ua a ash ae re N 67 14 2 Example lot e a Bld ee ee Se te Ee ES RS 67 14 3 Example e A AA RA Bie A Be ee ee en ee ee 69 14 4 When to Use Array Weights e e ee ee 71 15 RNA seq Data 72 15 1 Tntroduction s 31202 4622 4 8 Bote a a aa a g adi y a iay free Ar 72 15 2 Making a count matrik s x Leeda data Sh ge tg key ah A 72 15 3 Differential expression dr ek oe da Md oe e aes te a 72 15 4 Differential splicing gs ta e and E Sy che os Boy che od 73 16 Two Color Case Studies 75
102. he printer component of the MAList 14 3 Example 2 Below is an example of applying this method to the Apoal data gt ptw lt printtipWeights MA design layout MA printer gt zlim lt c min ptw max ptw gt par mfrow c 3 2 69 gt for i in seq 7 12 by 1 imageplot ptw i layout MA printer zlim zlim main colnames MA i 2 range 0 7 to 2 saturation 0 2 2 1 z range 0 6 to 2 1 saturation 0 2 2 1 2 range 0 2 to 1 2 saturation 0 2 2 1 z range 0 7 to 1 3 saturation 0 2 2 1 2 range 0 7 to 1 3 saturation 0 2 2 1 2 range 0 5 to 1 2 saturation 0 2 2 1 Image plots of the print tip weights for arrays 7 through to 12 are shown above with lighter shades indicating print tip groups which have been assigned lower weights A corner of array 9 alkok1 is measured to be less reproducible than the same region on other arrays which may be indicative of a spatial artefact Using these weights in the linear model analysis increases the t statistics of the top ranking genes compared to an analysis without weights compare the results table below with the table in section 16 2 gt fitptw lt ImFit MA design weights ptw gt fitptw lt eBayes fitptw gt options digits 3 gt topTable fitptw coef 2 number 15 genelist fitptw genes NAME ID logFC AveExpr t P Value adj P Val B 2149 ApoAI lipid Img 3 151 12 47 25 64 1 21e 15 7 73e 12 16 4206 540 EST HighlysimilartoA 2 918 12 28
103. he printer layout information has been set If you ve used readGAL to set the genes component you may also use getLayout to set the printer information by gt RG printer lt getLayout RG genes Note this will work only for GenePix GAL files not for general gene lists 4 10 The Spot Types File The Spot Types file STF is another optional tab delimited text file that allows you to identify different types of probes from the entries appearing in the gene list It is especially useful for identifying different types of control probes The STF is used to set the control status of each probe on the arrays so that plots may highlight different types of spots in an appropriate way It is typically used to distinguish control probes from regular probes corresponding to genes and to distinguish positive from negative controls ratio from calibration controls and so on The STF should have a SpotType column giving the names of the different spot types One or more other columns should have the same names as columns in the gene list and should contain patterns or regular expressions sufficient to identify the spot type Any other columns are assumed to contain plotting attributes such as colors or symbols to be associated with the spot types There is one row for each spot type to be distinguished The STF uses simplified regular expressions to match patterns For example AA means any string starting with AA AA means any code ending with AA AA
104. hich might not be of direct interest is included We need to use contrasts to extract all the comparisons of interest gt fit lt ImFit eset design gt cont matrix lt cbind SvsUinWT c 0 0 1 0 SvsUinMu c 0 0 1 1 Diff c 0 0 0 1 gt fit2 lt contrasts fit fit cont matrix gt fit2 lt eBayes fit2 This extracts the WT stimulation effect as the third coefficient and the interaction as the fourth coefficient The mutant stimulation effect is extracted as the sum of the third and fourth coefficients of the original model This analysis yields exactly the same results as the previous analysis An even more classical statistical approach to the factorial experiment would be to use the sum to zero parametrization In R this is achieved by 47 gt contrasts Strain lt contr sum 2 gt contrasts Treatment lt contr sum 2 gt design lt model matrix Strain Treatment This defines four coefficients with the following interpretations Coefficient Comparison Interpretation Intercept WT U WT S Mu U Mu S 4 Grand mean Strainl WT U WT S Mu U Mu S 4 Strain main effect Treatment1 WT U WT S Mu U Mu S 4 Treatment main effect Strainl Treatmentl WT U WT S Mu U Mu S 4 Interaction This parametrization has many appealing mathematical properties and is the classical parametriza tion used for factorial designs in much experimental design theory However it defines only one coefficient which is directly o
105. ical Analysis of Gene Expression Microarray Data pages 35 91 Chapman amp Hall CRC Press Yang Y H and Thorne N P 2003 Normalization for two color cDNA microarray data In D R Goldstein editor Science and Statistics A Festschrift for Terry Speed pages 403 418 Institute of Mathematical Statistics Lecture Notes Monograph Series Volume 40 144
106. ical user interface for linear modeling of microarray data Bioinformatics 20 3705 3706 Wolfinger R D Gibson G Wolfinger E D Bennett L Hamadeh H Bushel P Afshari C and Paules R S 2001 Assessing gene significance from cDNA microarray expression data via mixed models Journal of Computational Biology 8 625 637 Wu D Lim E Vaillant F Asselin Labat M Visvader J and Smyth G 2010 ROAST rotation gene set tests for complex microarray experiments Bioinformatics 26 2176 2182 Wu D and Smyth G 2012 Camera a competitive gene set test accounting for inter gene correlation Nucleic Acids Research 40 e133 Xia X McClelland M and Wang Y 2005 Webarray an online platform for mi croarray data analysis BMC Bioinformatics 6 306 143 52 53 54 55 Yang Y H Dudoit S Luu P Lin D M Peng V Ngai J and Speed T P 2002 Normalization for cDNA microarray data a robust composite method addressing single and multiple slide systematic variation Nucleic Acids Research 30 e15 Yang Y H Dudoit S Luu P and Speed T P 2001 Normalization for cDNA mi croarray data In M L Bittner Y Chen A N Dorsel and E R Dougherty editors Microarrays Optical Technologies and Informatics pages 141 152 Proceedings of SPIE Volume 4266 Yang Y H and Speed T P 2003 Design and analysis of comparative microarray experiments In T P Speed editor Statist
107. image analysis program and lt directory gt is the full path of the directory containing the files If the files are in the current R working directory then the argument path can be omitted see the help entry for setwd for how to set the current working directory The file names are usually read from the Targets File For example the Targets File Targets txt is in the current working directory together with the SPOT output files then one might use gt targets lt readTargets gt RG lt read maimages targets FileName source spot Alternatively and even more simply one may give the targets frame itself in place of the files argument as gt RG lt read maimages targets source spot In this case the software will look for the column FileName in the targets frame If the files are GenePix output files then they might be read using gt RG lt read maimages targets source genepix given an appropriate targets file Consult the help entry for read maimages to see which other image analysis programs are supported Files are assumed by default to be tab delimited although other separators can be specified using the sep argument 17 Reading data from ImaGene software is a little different to that of other image analysis programs because the red and green intensities are stored in separate files This means that the targets frame should include two filename columns called say FileNameCy3 and FileNameCy5 giving the nam
108. imageplot3by2 gives an easy way to automate the production of plots for all arrays in an experiment If the plots suggest that some arrays are of lesser quality than others it may be useful to estimate array quality weights to be used in the linear model analysis see Section 14 26 Chapter 6 Pre Processing Two Color Data 6 1 Background Correction The default background correction action is to subtract the background intensity from the foreground intensity for each spot If the RGList object has not already been background corrected then normalizeWithinArrays will do this by default Hence gt MA lt normalizeWithinArrays RG is equivalent to gt RGb lt backgroundCorrect RG method subtract gt MA lt normalizeWithinArrays RGb However there are many other background correction options which may be preferable in certain situations see Ritchie et al 30 For the purpose of assessing differential expression we often find gt RG lt backgroundCorrect RG method normexp offset 50 to be preferable to the simple background subtraction when using output from most image analysis programs This method adjusts the foreground adaptively for the background in tensities and results in strictly positive adjusted intensities i e negative or zero corrected intensities are avoided The use of an offset damps the variation of the log ratios for very low intensities spots towards zero To illustrate some differenc
109. in this experiment were printed with 8448 probes spots including 768 control spots The array printer uses a print head with a 4x4 arrangement of print tips and so the microarrays are partitioned into a 4x4 grid of tip groups Each grid consists of 22x24 spots that were printed with a single print tip Unlike most image analysis software SPOT does not store probe annotation in the output files so we have to read it separately The gene name associated with each spot is recorded in a GenePix array list GAL file gt RG genes lt readGAL fish gal gt RG genes 1 30 Block Row Column ID Name 1 1 1 1 control genol 2 1 1 2 control geno2 3 1 1 3 control geno3 4 1 1 4 control 3XSSC 5 1 1 5 control 3XSSC 6 1 1 6 control EST1 7 1 1 7 control genol 8 1 1 8 control geno2 9 1 1 9 control geno3 10 1 1 10 control 3XSSC 11 1 1 11 control 3XSSC 12 1 1 12 control 3XSSC 13 1 1 13 control EST2 14 1 1 14 control EST3 15 1 1 15 control EST4 16 1 1 16 control 3XSSC 17 1 1 17 control Actin 18 1 1 18 control Actin 19 1 1 19 control 3XSSC 20 1 1 20 control 3XSSC 21 1 1 21 control 3XSSC 22 1 1 22 control 3XSSC 23 1 1 23 control Actin 24 1 1 24 control Actin 25 1 2 1 control ath1 26 1 2 2 control Cad 1 27 1 2 3 control DeltaB 28 1 2 4 control D1x4 TT 29 1 2 5 control ephrinA4 30 1 2 6 control FGF8 Because we are using SPOT output the 4x4x22x24 print layout also needs to be set The easiest way to do this is to infer it from the GAL fil
110. ing values in log ratios which might arise from negative or zero corrected intensities The function backgroundCorrect gives a number of useful options For the purposes of this section the data has been corrected using the minimum method gt RG b lt backgroundCorrect RG method minimum plotDensities displays smoothed empirical densities for the individual green and red chan nels on all the arrays Without any normalization there is considerable variation between both channels and between arrays gt plotDensities RG b 31 RG densities 0 4 f Ta ed rae 5 tgs y EZ Density Intensity After loess normalization of the M values for each array the red and green distributions be come essentially the same for each array although there is still considerable variation between arrays gt MA p lt normalizeWithinArrays RG b gt plotDensities MA p RG densities Density 0 2 Intensity Loess normalization doesn t affect the A values Applying quantile normalization to the A values makes the distributions essentially the same across arrays as well as channels 32 gt MA pAq lt normalizeBetweenArrays MA p method Aquantile gt plotDensities MA pAg RG densities 0 30 1 0 25 1 Density 0 15 1 0 05 1 T T T T T T 6 8 10 12 14 16 Intensity Applying quantile normalization directly to the individual red and green in
111. ion 5 Suppose for example that B 1 5 The odds of differential expression is 63 exp 1 5 4 48 i e about four and a half to one The probability that the gene is differ entially expressed is 4 48 1 4 48 0 82 i e the probability is about 82 that this gene is differentially expressed A B statistic of zero corresponds to a 50 50 chance that the gene is differentially expressed The B statistic is automatically adjusted for multiple testing by assuming that 1 of the genes or some other percentage specified by the user in the call to eBayes are expected to be differentially expressed The p values and B statistics will normally rank genes in the same order In fact if the data contains no missing values or quality weights then the order will be precisely the same As with all model based methods the p values depend on normality and other mathemat ical assumptions which are never exactly true for microarray data It has been argued that the p values are useful for ranking genes even in the presence of large deviations from the assumptions 40 38 Benjamini and Hochberg s control of the false discovery rate assumes independence between genes although Reiner et al 27 have argued that it works for many forms of dependence as well The B statistic probabilities depend on the same assumptions but require in addition a prior guess for the proportion of differentially expressed probes The p values may be preferred to the B statistics bec
112. istics for comparing mutant to wt could be computed by gt ordinary t lt fit coef fit stdev unscaled fit sigma We prefer though to use empirical Bayes moderated t statistics which are computed below Now create an MA plot of the average M and A values for each gene gt plotMA fit gt abline 0 0 col blue 83 Empirical Bayes analysis We will now go on and compute empirical Bayes statistics for differential expression The moderated t statistics use sample standard deviations which have been shrunk towards a pooled standard deviation value gt fit lt eBayes fit gt qqt fit t df fit df prior fit df residual pch 16 cex 0 2 gt abline 0 1 Student s t Q Q Plot 15 1 a Sample Quantiles 10 5 0 xa 15 20 5 0 5 Theoretical Quantiles Visually there seems to be plenty of genes which are differentially expressed We will obtain a summary table of some key statistics for the top genes 84 gt options digits 3 gt topTable fit number 30 adjust BH Block Row Column ID Name logFC AveExpr t P Value adj P Val B 3721 8 2 1 control BMP2 2 21 12 1 21 1 1 03e 07 0 000357 7 96 1609 4 2 1 control BMP2 2 30 13 1 20 3 1 34e 07 0 000357 7 78 3723 8 2 3 control D1x3 2 18 13 3 20 0 1 48e 07 0 000357 7 71 1611 4 2 3 control D1x3 2 18 13 5 19 6 1 69e 07 0 000357 7 62 8295 16 16 15 b94h06 20 L12 1 27 12 0 14 1 1 74e 06 0 002067 5 78 7036 14 8 4 fb40h07
113. ith SPOT software and with morph back ground subtracted This background estimator produces a similar effect to that with normexp 28 SPOT morph background The effect of using morph background or using method normexp with an offset is to sta bilize the variability of the M values as a function of intensity The empirical Bayes methods implemented in the limma package for assessing differential expression will yield most ben efit when the variabilities are as homogeneous as possible between genes This can best be achieved by reducing the dependence of variability on intensity as far as possible 30 6 2 Within Array Normalization Limma implements a range of normalization methods for spotted microarrays Smyth and Speed 39 describe some of the most commonly used methods The methods may be broadly classified into methods which normalize the M values for each array separately within array normalization and methods which normalize intensities or log ratios to be comparable across arrays between array normalization This section discusses mainly within array normaliza tion which all that is usually required for the traditional log ratio analysis of two color data Between array normalization is discussed further in Section 6 3 Print tip loess normalization 53 is the default normalization method and can be performed by gt MA lt normalizeWithinArrays RG There are some notable cases where this i
114. jacent spots using REML and then to conduct the usual analysis of the arrays using generalized least squares First normalize the data using print tip loess regression The SPOT morph background ensures that the default background subtraction can be used without inducing negative in tensities gt MA lt normalizeWithinArrays RG Then remove the control probes gt MA2 lt MA MA genes Library NIA15k Now estimate the spatial correlation We estimate a correlation term by REML for each gene and then take a trimmed mean on the atanh scale to estimate the overall correlation This command will probably take at least a few minutes depending on the speed of your computer 98 gt options digits 3 gt corfit lt duplicateCorrelation MA2 design ndups 2 A slow computation Loading required package statmod gt corfit consensus correlation 1 0 575 gt boxplot tanh corfit atanh correlations d 10 05 00 05 10 gt fit lt lmFit MA2 design ndups 2 correlation corfit consensus gt fit lt eBayes fit gt topTable fit n 30 adjust BH Library ID logFC AveExpr t P Value adj P Val B 4599 NIA15k H34599 0 404 10 93 12 92 1 34e 07 0 000443 8 02 1324 NIA15k H31324 0 520 7 73 12 24 2 23e 07 0 000443 7 57 3309 NIA15k H33309 0 420 10 99 11 99 2 71e 07 0 000443 7 39 440 NIA15k H3440 0 568 9 96 11 64 3 60e 07 0 000443 7 13 6795 NIA15k H36795 0 460 10 78 11 56 3 85e 07 0 000443 7 07 121
115. l trol trol trol trol trol trol trol trol 1571 1013 1207 1467 1466 1379 0441 3391 0333 0792 0207 2030 4925 0325 0167 1009 0279 0627 0610 9552 4228 6270 0867 8028 9650 5739 9213 9700 5055 8011 0795 0040 1382 3530 4370 5505 5906 5333 gt plotExons fit de geneid 45320 exon identifier 135 Start DODOFNWENWNHKPDHOOCWWHPWAHA KH HP PBS o WOPRWNHOHNWOWN OO 5131 0798 3153 2557 4477 1931 1535 1965 5543 3762 5837 5462 8566 0676 9637 6787 8850 0326 9856 0867 8149 2685 5596 3276 5677 3320 0219 1280 7069 3177 2446 1555 7962 4022 1375 3020 5478 2835 262 809 988 195 059 939 384 962 277 701 153 433 273 348 188 934 179 366 397 108 066 423 791 906 704 937 721 256 276 156 508 201 061 711 306 678 321 309 TNrRFNWORrRRFPRANANKFPNAINWBAAADANWDAWBONrFPRFPRABRBNNNWWNWEBEDND 27e 01 32e 01 40e 01 52e 01 O9e 01 69e 01 07e 01 48e 02 86e 01 95e 01 81e 01 74e 01 13e 04 33e 01 54e 01 66e 01 61e 01 20e 01 97e 01 83e 09 23e 03 14e 10 32e 09 64e 08 20e 10 86e 09 04e 08 80e 08 00e 07 29e 09 19e 01 27e 04 12e 04 54e 06 66e 02 42e 03 02e 04 16e 03 FPWMHMOAWHWOAONNWANrFKFRNKFNATAWAMAAUNTAADWBWAAHRAArFrAUNHUHKRUAAD A 55e 01 33e
116. la 29 42 31 1 87e 32 6 12e 29 32242 37893 2R slik 19 26 42 1 35e 18 2 94e 15 16060 36542 2R shot 38 9 39 1 49e 17 2 43e 14 95956 44448 3R scrib 30 11 50 1 08e 16 1 42e 13 19880 36773 2R Dg 15 18 75 4 93e 13 5 38e 10 41795 3771968 2L Msp 300 29 8 00 2 04e 12 1 91e 09 107883 44030 3L msn 23 9 77 4 90e 12 4 01e 09 150694 32008 X Ant2 4 135 61 5 86e 12 4 27e 09 82117 42130 3R osa 18 12 10 1 52e 11 9 96e 09 150710 32007 X sesB 7 44 73 1 80e 11 1 07e 08 166083 33098 X CG32521 6 53 49 3 71e 11 2 03e 08 52817 34652 2L vir 1 9 23 69 3 01e 10 1 52e 07 115767 38879 3L pbl 12 15 02 8 21e 10 3 85e 07 11040 36104 2R G oalpha47A 12 14 27 1 59e 09 6 96e 07 2032 2768716 2R mim 24 6 81 2 14e 09 8 78e 07 134207 40464 3L Ten m 12 13 65 2 82e 09 1 09e 06 47338 34170 2L Akap200 8 21 75 4 22e 09 1 54e 06 36154 33367 2L CG7337 23 6 70 5 59e 09 1 93e 06 118209 39048 3L Tequila 20 7 47 7 55e 09 2 48e 06 We can plot all the exons for the most differentially spliced gene trol gt plotSplice ex 133 45320 trol co 5 o wt RL wn gt Cc O o N lt O LL D L2 o MT LODMNMIOM D MNOOA MDOSOD O YODO FLO IWOLONO LOL TOONOTTOONTH TOTTrMOMNOSOD TOON ODOTSOLOOORN WOOO ODODDDDOOORRNERERANROO MMMM MM MMMMMMMMMMMOCC NANANAAAAANAAAA AOA AAAI Exon Start The plot highlights exons that are individually significant We can see that a block of five or six exons at the end of the gene are relatively lost when Pasilla is down at the expense of an e
117. lation MA design gt fit lt ImscFit MA design correlation corfit consensus Subsequent steps proceed as for log ratio analyses For example if we want to compare wild type young to mutant young animals we could extract this contrast by gt cont matrix lt makeContrasts mu young wt young levels design gt fit2 lt contrasts fit fit cont matrix gt fit2 lt eBayes fit2 gt topTable fit2 adjust BH 62 Chapter 13 Statistics for Differential Expression 13 1 Summary Top Tables Limma provides functions topTable and decideTests which summarize the results of the linear model perform hypothesis tests and adjust the p values for multiple testing Results include log2 fold changes standard errors t statistics and p values The basic statistic used for significance analysis is the moderated t statistic which is computed for each probe and for each contrast This has the same interpretation as an ordinary t statistic except that the standard errors have been moderated across genes i e shrunk towards a common value using a simple Bayesian model This has the effect of borrowing information from the ensemble of genes to aid with inference about each individual gene 36 Moderated t statistics lead to p values in the same way that ordinary t statistics do except that the degrees of freedom are increased reflecting the greater reliability associated with the smoothed standard errors The effectiveness of the m
118. le and allows you to construct your own 21 weights The wt fun argument can be any function which takes a data set as argument and computes the desired weights For example if you wish to give zero weight to all GenePix flags less than 50 you could use gt myfun lt function x as numeric x Flags gt 50 5 gt RG lt read maimages files source genepix wt fun myfun The wt fun facility can be used to compute weights based on any number of columns in the image analysis files For example some researchers like to filter out spots if the foreground mean and median from GenePix for a given spot differ by more than a certain threshold say 50 This could be achieved by gt myfun lt function x threshold 50 okred lt abs x F635 Median x F635 Mean lt threshold okgreen lt abs x F532 Median x F532 Mean lt threshold as numeric okgreen amp okred gt RG lt read maimages files source genepix wt fun myfun Then all the bad spots will get weight zero which in limma is equivalent to flagging them out The definition of myfun here could be replaced with any other code to compute weights using the columns in the GenePix output files 4 8 Reading Probe Annotation The RGList read by read maimages will almost always contain a component called genes containing the IDs and other annotation information associated with the probes The only exceptions are SPOT data source spot
119. les of background correction spot quality weights and filtering with control spots Finally if you are using one of the menu driven interfaces to the software please cite the appropriate one of Wettenhall J M and Smyth G K 2004 limmaGUI a graphical user interface for linear modeling of microarray data Bioinformatics 20 3705 3706 Wettenhall J M Simpson K M Satterley K and Smyth G K 2006 affylmGUI a graphical user interface for linear modeling of single channel mi croarray data Bioinformatics 22 897 899 2 2 Installation Limma is a package for the R computing environment and it is assumed that you have already installed R See the R project at http www r project org To install the latest version of limma you will need to be using the latest version of R Limma is part of the Bioconductor project at http www bioconductor org Prior to R 2 6 0 limma was also available from the R project CRAN site It is one of a default set of packages installed by biocLite You can install a set of core Bioconductor packages by gt source http www bioconductor org biocLite R gt biocLite Q To get just limma alone much quicker you can use gt biocLite limma This will allow you do to perform many basic analyses although you ll probably want gt biocLite statmod as well Bioconductor works on a 6 monthly official release cycle lagging each major R release by a short time As with oth
120. limma Linear Models for Microarray Data User s Guide Now Including RNA Seq Data Analysis Gordon K Smyth Matthew Ritchie Natalie Thorne James Wettenhall Wei Shi and Yifang Hu Bioinformatics Division The Walter and Eliza Hall Institute of Medical Research Melbourne Australia First edition 2 December 2002 Last revised 17 June 2014 This free open source software implements academic research by the authors and co workers If you use it please support the project by citing the appropriate journal articles listed in Section 2 1 Contents 1 Introduction 2 Preliminaries 21 Citing MITA tsa i og etait ee ee A Ut ce ge 22 installation RA A ge AE ade MT eth as 2 3 How A 3 Quick Start LLA prief introd ction to Rade a ERA ADA A AS A 3 2 Sample limma Session cada da ra ae a Bro Data Opjects p r Soa ge a o A O A A a 4 Reading Microarray Data Al cope Or This Chaplel sy Ye roedor os he Rew de te ld FG hd RG he 4 2 Recommended Piles fe 151 rta is BY tr arte en oi ee Tey ee e La ne Targets Eran a NS A iaa ti 4 4 Reading Two Color Intensity Data o 4 5 Reading Single Channel Agilent Intensity Data 4 6 Reading Illumina BeadChip Data o 4 7 Image derived Spot Quality Weights 4 8 Reading Probe Annotation vc a a ad a ALO Printer Layoat ade rd A e tod Okc oda ae ade a ea hs 4 10 The Spot Types File threat a Se
121. limma 3 9 19 or later 18 1 2 Loading the data Read counts summarized by Ensembl gene identifiers are available in the tweeDEseqCountData package gt library tweeDEseqCountData gt data pickrell1 gt Counts lt exprs pickrell1 eset gt dim Counts 1 38415 69 gt Counts 1 5 1 5 NA18486 NA18498 NA18499 NA18501 NA18502 ENSGO0000127720 6 32 14 35 14 ENSGO0000242018 20 21 24 22 16 ENSGO0000224440 0 0 0 0 0 ENSGO0000214453 0 0 0 0 0 ENSG00000237787 0 0 1 0 0 The library sizes vary from about 5 million to over 15 million gt barplot colSums Counts 1e 6 names 1 69 ylab Library size millions 118 Library size millions 15 10 4 5 9 13 18 23 28 33 38 43 48 53 58 63 68 To demonstrate limma on RNA Seq data we will compare female with male individuals gt Gender lt pickrell1 eset gender gt table Gender Gender female male 40 29 gt rm pickrell1 eset gt data genderGenes Annotation for each Ensemble gene is also available from the tweeDEseqCountData package gt data annotEnsemb163 gt annot lt annotEnsemb163 c Symbol Chr gt annot 1 5 ENSGO0000252775 ENSGO0000207459 ENSGO0000252899 ENSGO0000201298 ENSGO0000222266 gt rm annotEnsemb163 Symbol Chr U7 5 U6 5 U7 5 U6 5 U6
122. loess curves which will be used for normalization 79 gt plotPrintTipLoess MA Tip Column 6 68 10 14 6 8 10 14 jj 1 1 1 1 5 a ma FR 6 8 10 14 6 8 10 14 A Normalization Print tip loess normalization gt MA lt normalizeWithinArrays RG gt plotPrintTipLoess MA Tip Column N o ad H 2024 6 8 10 14 6 8 10 14 We have normalized the M values with each array A further question is whether normalization is required between the arrays The following plot shows overall boxplots of the M values for the four arrays gt boxplot MA M col MA M names colnames MA M 80 did 1 17 000 osm E oo T T swirl 1 swirl 2 swirl 3 swirl 4 There is evidence that the different arrays have different spreads of M values so we will scale normalize between the arrays gt MA lt normalizeBetweenArrays MA method scale gt boxplot MA M col MA M names colnames MA M i i 1 1 J Q oq E E E ES ae 5 o o T T swirl 1 swirl 2 swirl 3 swirl 4 Note that scale normalization is not done routinely for all two color data sets in fact it is rarely done with newer platforms However it does give good results on this data set It should only be done when there is good evidence of a scale difference in the M values Linear model First setup an ap
123. m to give greater weight to probes which are sig nificance for two or more contrasts Most multiple testing methods tend to underestimate the number of such probes There is some practical experience to suggest that method nestedF gives less conservative results when finding probes which respond to several different contrasts at once However this method should still be viewed as experimental It provides formal false discovery rate control at the probe level only not at the contrast level 66 Chapter 14 Array Quality Weights 14 1 Introduction Given an appropriate design matrix the relative reliability of each array in an experiment can be estimated by measuring how well the expression values for that array follow the linear model This empirical approach of assessing array quality can be applied to both two color and single channel microarray data and is described in 29 The method is implemented in the arrayWeights function which fits a heteroscedastic model to the expression values for each gene by calling the function 1m wfit See also arrayWeightsSimple which does the same calculation more quickly when there are no probe level quality weights The dispersion model is fitted to the squared residuals from the mean fit and is set up to have array specific coefficients which are updated in either full REML scoring iterations or using an efficient gene by gene update algorithm The final estimates of these array variances are conve
124. means exactly these two letters AA means any string containing AA AA means AA followed by exactly one other character and AA means exactly AA followed by a period and no other characters For those familiar with regular expressions any other regular expressions are allowed but the codes for beginning of string and for end of string should be excluded Note that the patterns are matched sequentially from first to last so more general patterns should be included first The first row should specify the default spot type and should have pattern for all the pattern matching columns Here is a short STF appropriate for the ApoAl data 23 El Microsoft Excel ApoAISpotTypes txt E oj xj H File Edit View Insert Format Tools Data Window Help E EX OSHAR SY Blo exr H M0 gt F10 y fe A B SpotType ID Color cDNA black BLANK BLANK brown Blank Blank i orange Control Control blue Lal 1 ma In this example the columns ID and Name are found in the gene list and contain patterns to match The asterisks are wildcards which can represent anything Be careful to use upper or lower case as appropriate and don t insert any extra spaces The remaining column gives colors to be associated with the different types of points This code assumes of that the probe annotation data frame includes columns ID and Name This is usually so if GenePix has been used for the image analysis but other image analysis so
125. multiple testing using Benjamini and Hochberg s method to control the false discovery rate at 5 across all genes and all contrasts leads to the following 95 gt results lt decideTests fit2 method global gt summary results WT11 MT11 WT21 MT21 WT11 WT21 MT11 MT21 Int zi 28 136 455 765 74 0 16835 16540 16102 15377 16692 1 33 220 339 754 130 The probes that show significant interactions are those which develop differently in the mutant compared to the wildtype between days 11 and 21 To see these gt topTable fit2 coef Int RikenID GenBank Symbol logFC AveExpr t P Value adj P Val B 14395 ZX00005I07 AVO38977 lt NA gt 2 71 10 52 11 40 8 56e 08 0 00145 7 66 1473 ZX00028J17 AVO76735 lt NA gt 1 92 11 37 10 11 3 18e 07 0 00215 6 65 14288 ZA00003115 AVO10442 lt NA gt 2 21 9 50 9 94 3 81e 07 0 00215 6 50 1184 ZX00004J09 AKO005530 Tnfrsf12a 2 02 8 47 9 43 6 74e 07 0 00237 6 03 13843 ZX00003J07 AVO33559 lt NA gt 2 97 12 08 9 33 7 51e 07 0 00237 5 95 14898 ZX00003H24 AK005382 Tnfrsf12a 2 41 9 40 9 23 8 43e 07 0 00237 5 86 13520 ZX00020K15 AV088030 lt NA gt 1 83 9 93 9 09 9 88e 07 0 00238 5 73 10916 ZX00023L14 AV104464 lt NA gt 2 90 10 49 8 93 1 19e 06 0 00252 5 37 12532 ZX00026E06 AV140268 lt NA gt 2 75 10 55 8 63 1 71e 06 0 00322 5 27 14250 ZX00048F23 AV122249 lt NA gt 1 89 10 27 8 41 2 23e 06 0 00377 5 04 gt sessionInfo R version 2 13 0 Patched 2011 04 25 r55638 Platform i386 pc mingw32 i386 32 bit
126. n Signal agilent median Median Signal Median Signal As for two color data the default choices for the foreground and background estimates can be over ridden by specifying the columns argument to read maimages Agilent Feature Extraction output files contain probe annotation columns as well as in tensity columns By default read maimages will read the following annotation columns if they exist Row Col Start Sequence SwissProt GenBank Primate GenPept ProbeUID ControlType ProbeName GeneName SystematicName Description See Section 17 4 for a complete worked case study with single channel Agilent data 4 6 Reading Illumina BeadChip Data Illumina whole genome BeadChips require special treatment Illumina images are scanned by BeadScan software and Illumina s BeadStudio or GenomeStudio software can be used to export probe summary profiles The probe summary profiles are tab delimited files containing the intensity data Typically all the arrays processed at one time are written to a single file with several columns corresponding to each array We recommend that intensities should be exported from GenomeStudio without background correction or normalization as these pre processing steps can be better done by limma functions GenomeStudio can also be asked to export profiles for the control probes and we recommend that this be done as well Illumina files differ from other platforms in that each image output file contains dat
127. n analysis of RNA seq data Genome Biology 11 R25 Rodriguez M W Paquet A C Yang Y H and Erle D J 2004 Differential gene expression by integrin 87 and 87 memory T helper cells BMC Immunology 5 Royal Institute of Technology Sweden 2005 KTH package for microarray data analysis Software package http www biotech kth se molbio microarray pages kthpackagetransfer html Shi W De Graaf C Kinkel S Achtman A Baldwin T Schofield L Scott H Hilton D and Smyth G 2010 Estimating the proportion of microarray probes ex pressed in an RNA sample Nucleic acids research 38 2168 2176 Shi W Oshlack A and Smyth G 2010 Optimizing the noise versus bias trade off for lumina whole genome expression beadchips Nucleic Acids Research 38 e204 e204 Smyth G 2004 Linear models and empirical Bayes methods for assessing differential expression in microarray experiments Statistical applications in genetics and molecular biology 3 Article 3 Smyth G 2005 Individual channel analysis of two colour microarray data invited session IPM 11 Computational tools for microarray analysis In 55th Session of the International Statistics Institute 142 38 39 40 41 42 43 44 45 46 47 48 49 50 5l Smyth G Michaud J and Scott H 2005 Use of within array replicate spots for assessing differential expression in microarray experiments Bioinformati
128. ne and exon annotation were downloaded from ftp ftp ncbi nlm nih gov genomes Drosophila_melanogaster RELEASE_5_48 There is one GFF file for each chromosome arm as shown above The five gff files were concatenated into one file and repeated exons instances of the same exon same start and stop position were removed to create a data frame of start stop positions called unique gff SE reads were counted by gt fc_SE lt featureCounts SE_bam_files annot ext unique gff isGTFAnnotationFile TRUE GTF featureType exon GTF attrType 1D useMetaFeatures FALSE allowMulti0verlap TRUE where SE_bam_files is a vector of BAM file names for the SE reads PE reads were counted by gt fc_SE lt featureCounts PE_bam_files annot ext unique gff isGTFAnnotationFile TRUE GTF featureType exon GIF attrType ID useMetaFeatures FALSE allowMultiOverlap TRUE isPairedEnd TRUE where PE bam files is a vector of BAM file names for the PE reads Note that the RSubread package is not available for Windows 18 2 5 Assemble DGEList and sum counts for technical replicates We assemble all the counts into an edgeR DGEList object gt library edgeR gt y all lt DGEList counts cbind fc_SE counts fc_PE counts genes fc_SE annotation gt dim y all 1 72687 20 127 gt head y all genes GeneID Chr Start End Strand Length 1 3355011 NT_033778 3 18479 18629 151 2 3355011 NT_033778 3 18681 19484 804
129. o analyze log intensities instead of log ratios It is possible to do this using a mixed model representation which treats each spot as a randomized block 48 37 Limma implements mixed model methods for separate channel analysis which make use of shrinkage methods to ensure stable and reliable inference with small numbers of arrays 37 Limma also provides between array normalization to prepare for separate channel analysis for example gt MA lt normalizeBetweenArrays MA method Aquantile scales the intensities so that A values have the same distribution across arrays The first step in the differential expression analysis is to convert the targets frame to be channel rather than array orientated gt targets2 lt targetsA2C targets gt targets2 61 channel col FileName Target File1 1 1 Filel wt young File1 2 2 Filel wt old File2 1 1 File2 wt old File2 2 2 File2 wt young File3 1 1 File3 mu young File3 2 2 File3 mu old File4 1 1 File4 mu old File4 2 2 File4 mu young The following code produces a design matrix with eight rows and four columns gt u lt unique targets2 Target gt f lt factor targets2 Target levels u gt design lt model matrix 0 f gt colnames design lt u Inference proceeds as for within array replicate spots except that the correlation to be es timated is that between the two channels for the same spot rather than between replicate spots gt corfit lt intraspotCorre
130. obi nostat induces clinical responses with associated alterations in gene expression profiles in cutaneous T cell lymphoma Clinical Cancer Research 14 4500 4510 Golden T R and Melov S 2004 Microarray analysis of gene expression with age in individual nematodes Aging Cell 3 111 124 140 11 12 13 14 15 16 17 18 19 20 pal 22 23 24 Hung S Baldi P and Hatfield G W 2002 Global gene expression profiling in Es cherichia coli K12 The effects of leucine responsive regulatory protein Journal of Bio logical Chemistry 277 40309 40323 Irizarry R 2005 From CEL files to annotated lists of interesting genes In R Gentleman V Carey S Dudoit R Irizarry and W Huber editors Bioinformatics and Computa tional Biology Solutions using R and Bioconductor pages 431 442 Springer New York Kendziorski C Irizarry R A Chen K S Haag J D and Gould M N 2005 On the utility of pooling biological samples in microarray experiments PNAS 102 4252 4257 Kooperberg C Aragaki A Strand A D and Olson J M 2005 Significance testing for small microarray experiments Statistics in Medicine 24 2281 2298 Law C W Chen Y Shi W and Smyth G K 2014 Voom precision weights unlock linear model analysis tools for RNA seq read counts Genome Biology 15 R29 Li B and Dewey C 2011 RSEM accurate transcript quantification from RNA Seq data wi
131. oderated t approach has been demonstrated on test data sets for which the differential expression status of each probe is known 12 A number of summary statistics are presented by topTable for the top genes and the selected contrast The logFC column gives the value of the contrast Usually this represents a logs fold change between two or more experimental conditions although sometimes it repre sents a log2 expression level The AveExpr column gives the average log2 expression level for that gene across all the arrays and channels in the experiment Column t is the moderated t statistic Column P Value is the associated p value and adj P Value is the p value adjusted for multiple testing The most popular form of adjustment is BH which is Benjamini and Hochberg s method to control the false discovery rate 1 The adjusted values are often called q values if the intention is to control or estimate the false discovery rate The meaning of BH q values is as follows If all genes with q value below a threshold say 0 05 are selected as differentially expressed then the expected proportion of false discoveries in the selected group is controlled to be less than the threshold value in this case 5 This procedure is equivalent to the procedure of Benjamini and Hochberg although the original paper did not formulate the method in terms of adjusted p values The B statistic lods or B is the log odds that the gene is differentially expressed 36 Sect
132. on ImaGene GenePix QuantArray or SPOT to acquire the red and green foreground and back ground intensities for each spot The spot intensities have then been exported from the image analysis program into a series of text files There should be one file for each array or in the case of Imagene two files for each array You will need to have the image analysis output files In most cases these files will include the IDs and names of the probes and possibly other annotation information A few image analysis programs for example SPOT do not write the probe IDs into the output files In this case you will also need a genelist file which describes the probes It most cases it is also desirable to have a targets file which describes which RNA sample was hybridized to each channel of each array A further optional file is the spot types file which identifies special probes such as control spots 4 3 The Targets Frame The first step in preparing data for input into limma is usually to create a targets file which lists the RNA target hybridized to each channel of each array It is normally in tab delimited text format and should contain a row for each microarray in the experiment The file can 15 have any name but the default is Targets txt If it has the default name it can be read into the R session using gt targets lt readTargets Once read into R it becomes the targets frame The targets frame normally contains a FileName column gi
133. onSet or EList object called eset Such an object will have an slot containing the log expression values for each gene on each array which can be extracted using exprs eset 9 2 Two Groups The simplest possible single channel experiment is to compare two groups Suppose that we wish to compare two wild type Wt mice with three mutant Mu mice FileName Target Filel WT File2 WT File3 Mu File4 Mu File5 Mu There are two different ways to form the design matrix We can either 1 create a design matrix which includes a coefficient for the mutant vs wild type difference or 41 2 create a design matrix which includes separate coefficients for wild type and mutant mice and then extract the difference as a contrast For the first approach the treatment contrasts parametrization the design matrix should be as follows gt design WT MUvsWT Array1 1 0 Array2 1 0 Array3 1 1 Array4 1 1 Arrayb 1 1 Here the first coefficient estimates the mean log expression for wild type mice and plays the role of an intercept The second coefficient estimates the difference between mutant and wild type Differentially expressed genes can be found by gt fit lt ImFit eset design gt fit lt eBayes fit gt topTable fit coef MUvsWT adjust BH where eset is an ExpressionSet or matrix object containing the log expression values For the second approach the design matrix should be WT MU Array1 1 0 Array2 1 0
134. or when reading generic data source generic without setting the annotation argument annotation NULL Try gt names RG genes to see if the genes component has been set If the genes component is not set the probe IDs will need to be read from a separate file If the arrays have been scanned with an Axon scanner then the probes IDs will be available in a tab delimited GenePix Array List GAL file If the GAL file has extension gal and is in the current working directory then it may be read into a data frame by gt RG genes lt readGAL Non GenePix gene lists can be read into R using the function read delim from R base 4 9 Printer Layout The printer layout is the arrangement of spots and blocks of spots on the arrays Knowing the printer layout is especially relevant for old style academic spotted arrays printed with a mechanical robot with a multi tip print head The blocks are sometimes called print tip groups 22 or pin groups or meta rows and columns Each block corresponds to a print tip on the print head used to print the arrays and the layout of the blocks on the arrays corresponds to the layout of the tips on the print head The number of spots in each block is the number of times the print head was lowered onto the array Where possible for example for Agilent GenePix or ImaGene data read maimages will set the printer layout information in the component printer Try gt names RG printer to see if t
135. ore and after applying array weights use the following Vvv tv VV Vv fit lt ImFit MAlms fit lt eBayes fit boxplot fit t MAlms genes Status at 1 5 0 2 col 5 boxwex 0 4 xlab control type ylab moderated t statistics pch ylim c 70 70 medlwd 1 boxplot fitw t MAlms genes Status at 1 5 0 2 col 6 boxwex 0 4 add TRUE yaxt n xaxt n medlwd 1 pch abline h 0 col black lty 2 lwd 1 legend 0 5 70 legend c Equal weights Array weights fill c 5 6 cex 0 8 68 60 20 moderated t statistics 20 0 40 60 D03 D10 DR u03 U10 control type The boxplots show that the t statistics for all classes of ratio controls D03 D10 U03 and U10 move further from zero when array weights are used while the distribution of t statistics for the dynamic range controls DR does not noticeably change This demonstrates that the array quality weights increase statistical power to detect true differential expression without increasing the false discovery rate The same heteroscedastic model can also be fitted at the print tip group level using the printtipWeights function If there are p print tip groups across n arrays the model fitting procedure described in 29 is repeated p times to produce a weight for each print tip group on each array for use in 1mFit This method can be applied to two color microarray data where the probes are organized into print tip groups whose size is specified by t
136. os P C Moffat I D Franc M A Tijet N Tuomisto J Pohjanvirta R and Okey A B 2004 Identification of the DRE II gene battery by phylogenetic footprinting Biochem Biophys Res Commun 321 707 715 Brooks A N Yang L Duff M O Hansen K D Park J W Dudoit S Brenner S E and Graveley B R 2011 Conservation of an RNA regulatory map between Drosophila and mammals Genome Res 21 193 202 Buness A Huber W Steiner K Siiltmann H and Poustka A 2005 arrayMagic two colour cDNA microarray quality control and preprocessing Bioinformatics 21 554 556 Callow M J Dudoit S Gong E L Speed T P and Rubin E M 2000 Microarray expression profiling identifies genes with altered expression in HDL deficient mice Genome Research 10 2022 2029 Dalgaard P 2002 Introductory Statistics with R Springer New York Diaz E Ge Y Yang Y H Loh K C Serafini T A Okazaki Y Hayashizaki Y Speed T P Ngai J and Scheiffele P 2002 Molecular analysis of gene expression in the developing pontocerebellar projection system Neuron 36 417 434 Durinck S Allemeersch J Carey V J Moreau Y and De Moor B 2004 Importing MAGE ML format microarray data into BioConductor Bioinformatics 20 3641 3642 Ellis L Pan Y Smyth G George D McCormack C Williams Truax R Mita M Beck J Burris H Ryan G et al 2008 Histone deacetylase inhibitor pan
137. otein EGFP has been used as a common reference An adenovirus system was used to transport various transcription factors into the nuclei of HeLa cells Here we consider the transcription factors AML1 CBFbeta or both A simple design matrix was formed and a linear model fit gt design lt modelMatrix targets ref EGFP gt design AML1 AML1 CBFb CBFb 1 1 0 0 2 1 0 0 3 0 0 4 0 1 0 5 0 1 0 6 0 si 0 38 7 0 0 1 8 0 Oo 9 0 1 0 10 0 f 0 11 0 1 0 12 0 1 0 13 0 0 1 14 0 Oo 1 gt fit lt lmFit MA design It is of interest to compare each of the transcription factors to EGFP and also to compare the combination transcription factor with AML1 and CBFb individually An appropriate contrast matrix was formed as follows gt contrast matrix lt makeContrasts AML1 CBFb AML1 CBFb AML1 CBFb AML1 AML1 CBFb CBFb levels design gt contrast matrix AML1 CBFb AML1 CBFb AML1 CBFb AML1 AML1 CBFb CBFb AML1 1 0 0 1 0 AML1 CBFb 0 0 1 1 1 CBFb 0 1 0 0 1 The linear model fit can now be expanded and empirical Bayes statistics computed gt fit2 lt contrasts fit fit contrasts matrix gt fit2 lt eBayes fit2 8 4 Direct Two Color Designs Two color designs without a common reference require the most statistical knowledge to choose the appropriate design matrix A direct design is one in which there is no single RNA source which is hybridized to every array As an example we consider an experiment con
138. other dog is a control This produces the targets frame FileName SibShip Treatment Filel 1 File2 File3 File4 File5 File6 wwnNnnre 40000 A moderated paired t test can be computed by allowing for sib pair effects in the linear model SibShip lt factor targets SibShip Treat lt factor targets Treatment levels c C T design lt model matrix SibShip Treat fit lt lmFit eset design fit lt eBayes fit topTable fit coef TreatT VVVNVV MV 9 4 2 Blocking The above approach used for paired samples can be applied in any situation where there are batch effects or where the experiment has been conducted in blocks The treatments can be adjusted for differences between the blocks by using a model formula of the form gt design lt model matrix Block Treatment In this type of analysis the treatments are compared only within each block 9 5 Interaction Models 2 x 2 Factorial Designs 9 5 1 Questions of Interest Factorial designs are those where more than one experimental dimension is being varied and each combination of treatment conditions is observed Suppose that cells are extracted from wild type and mutant mice and these cells are either stimulated S or unstimulated U RNA 44 from the treated cells is then extracted and hybridized to a microarray We will assume for simplicity that the arrays are single color arrays such as Affymetrix Consider the following targets frame
139. ped by readers not interested in such details The linear model for gene 7 has residual variance 05 with sample value 5 and degrees of freedom d The output from 1mFit fit say holds the s in component fit sigma and the dj in fit df residual The covariance matrix of the estimated iso C XEVA G where V is a weight matrix determined by prior weights any covariance terms introduced by correlation structure and any iterative weights introduced by robust estimation The square roots of the diagonal elements of C X7V X C are called unscaled standard deviations and 64 are stored in fit stdev unscaled The ordinary t statistic for the kth contrast for gene j is tik Bjk Ujkrsj Where uj is the unscaled standard deviation The ordinary t statistics can be recovered by gt tstat ord lt fit coef fit stdev unscaled fit sigma after fitting a linear model if desired The empirical Bayes method assumes an inverse Chisquare prior for the 05 with mean s and degrees of freedom dy The posterior values for the residual variances are given by z2 _ dys djs P hka where d is the residual degrees of freedom for the jth gene The output from eBayes contains s and do as fit s2 prior and fit df prior and the 55 as fit s2 post The moderated t statistic is N Bix tik UjkSj This can be shown to follow a t distribution on d d degrees of freedom if Bj 0 36 The extra degrees of freedom fo represent the
140. propriate design matrix The negative numbers in the design matrix indicate the dye swaps 81 gt design lt modelMatrix targets ref wild type Found unique target names swirl wild type gt design swirl swirl 1 1 swirl 2 1 swirl 3 Sd swirl 4 1 Now fit a simple linear model for each gene This has the effect of estimating the average M value for each gene adjusting for the dye swaps gt fit lt lmFit MA design gt fit An object of class MArrayLM coefficients 1 0 3556298 0 3283455 0 3455845 0 2254783 0 3175470 8443 more rows rank 1 1 assign NULL qr qr 1 1 2 0 2 0 5 3 0 5 4 0 5 qraux 1 1 5 pivot 1 1 tol 1 1e 07 rank 1 1 df residual 1133333 8443 more elements sigma 82 1 0 2873571 0 3115307 0 3699258 0 3331054 0 2689609 8443 more elements cov coefficients 1 1 0 25 stdev unscaled 1 0 5 0 5 0 5 0 5 0 5 8443 more rows pivot 1 1 genes Block Row Column ID Name 1 1 1 1 control genol 2 1 1 2 control geno2 3 1 1 3 control geno3 4 1 1 4 control 3XSSC 5 1 1 5 control 3XSSC 8443 more rows Amean 1 13 44500 13 65700 13 40297 10 71737 10 83383 8443 more elements method 1 1s design 1 1 1 2 1 3 1 4 1 In the above fit object coefficients is the average M value for each gene and sigma is the sample standard deviations for each gene Ordinary t stat
141. rEstTitration Found 224 LysTitration Found 260 PheTitration Found 160 RikenTitration Found 224 ThrTitration Found 64 185 Found 64 GAPDH Found 32 Lysine Found 32 Threonine Found 64 Tubulin Setting attributes values col cex 92 MA plots were examined for all the arrays Here we give the plot for array 9 only gt plotMA RG array 9 xlim c 4 15 5 cb 3 Riken Buffer _ CerEstTitratio Lys Titration he Titration RikenTitration ThrTitration 185 Here Buffer is an obvious negative control while 185 GAPDH Lysine Threonine and Tubulin are single gene positive controls sometime called house keeping genes RikenTitration is a titration series of a pool of the entire Riken library and can be reasonably expected to be non differentially expressed CerEstTitration is a titration of a pool of a cerebellum EST library This will show higher expression in later mutant tissues The Lys Phe and Thr series are single gene titration series which were not spiked in in this case and can therefore be treated as negative controls The negative control probe intensities are quite high especially for the red channel and especially for array 7 negative lt RG genes Status in c Buffer LysTitration PheTitration ThrTitration par mfrow c 1 2 boxplot log2 RG G negative las 2 main Green background ylab log2 intensity col green boxplot 10g2 RG R negative las 2 main Red background ylab lo
142. rays RG weights NULL The output object MA will still contain any spot quality weights found in RG but these weights are not used in the normalization step It is often useful to make use of control spots to assist the normalization process For exam ple if the arrays contain a series of spots which are known in advance to be non differentially expressed these spots can be given more weight in the normalization process Spots which are known in advance to be differentially expressed can be down weighted Suppose for exam ple that the controlStatus has been used to identify spike in spots which are differentially expressed and a titration series of whole library pool spots which should not be differentially expressed Then one might use gt w lt modifyWeights RG weights RG genes Status c spikein titration c 0 2 gt MA lt normalizeWithinArrays RG weights w to give zero weight to the spike in spots and double weight to the titration spots This process is automated by the control normalization method for example 30 gt csi lt RG genes Status titration gt MA lt normalizeWithinArrays RG method control controlspots csi In general csi is an index vector specifying the non differentially expressed control spots 23 The idea of up weighting the titration spots is in the same spirit as the composite nor malization method proposed by 52 but is more flexible and generally applicable The above code as
143. resenting the D melanogaster genome were downloaded from ftp ftp ncbi nlm nih gov genomes Drosophila_melanogaster RELEASE_5_48 There are five FASTA files corresponding to chromosome arms gt DM lt readTargets FASTA GFF files csv sep gt DM FASTA GFF Chromosome 1 CHR_2 NT_033778 fna CHR_2 NT_033778 gff Chr2R 2 CHR_2 NT_033779 fna CHR_2 NT_033779 gff Chr2L 3 CHR_3 NT_033777 fna CHR_3 NT_033777 gff Chr3R 4 CHR_3 NT_037436 fna CHR_3 NT_037436 gff Chr3L 5 CHR_4 NC_004353 fna CHR_4 NC_004353 gff Chr4 6 CHR_X NC_004354 fna CHR_X NC_004354 gff ChrX The five fna files were concatenated into a single FASTA format file Drosophila ref fasta and an index file was built using Rsubread gt library Rsubread gt buildindex basename index reference Drosophila_ref fasta 126 Third reads were aligned to the genome The SE reads were mapped using gt align index SE_fastq_files input_format FASTQ where SE_fastq_files is a character vector containing the names of the SE FASTQ files The PE reads were mapped using gt align index PE_fastq_files1 PE_fastq_files2 input_format FASTQ where PE fastq_files1 and PE_fastq_files2 are vectors containing the names of the PE FASTQ files These commands produce BAM files containing aligned reads SE or frag ments PE 18 2 4 Read counts for exons Next we counted the number of reads or fragments overlapping each annotated exon of each gene GFF files containing ge
144. rted to weights which can be used in 1mFit This method offers a graduated approach to quality assessment by allowing poorer quality arrays which would otherwise be discarded to be included in an analysis but down weighted 14 2 Example 1 We consider the array quality weights applied to the spike in controls from a quality control data set courtesy of Andrew Holloway Ryan van Laar and Dileepa Diyagama from the Peter MacCallum Cancer Centre in Melbourne This collection of arrays described in 29 consists of 100 replicate hybridizations and we will use data from the first 20 arrays The object MAlms stores the loess normalized data for the 120 spike in control probes on each array Since these arrays are replicate hybridizations the default design matrix of a single column of ones is used gt arrayw lt arrayWeights MAlms gt barplot arrayw xlab Array ylab Weight col white las 2 gt abline h 1 lwd 1 1ty 2 67 2 0 7 Weight TNO LO Oro Ltn The empirical array weights vary from a minimum of 0 16 for array 19 to a maximum of 2 31 for array 8 These weights can be used in the linear model analysis gt gt fitw lt ImFit MAlms weights arrayw fitw lt eBayes fitw In this example the ratio control spots should show three fold or ten fold changes while the dynamic range spots should not be differentially expressed To compare the moderated t statistics bef
145. s filename gt targets filename estrogen time h low10 1 lowi0 1 cel absent 10 lowi0 2 low10 2 cel absent 10 high10 1 highi0 1 cel present 10 highi0 2 high10 2 cel present 10 low48 1 low48 1 cel absent 48 low48 2 low48 2 cel absent 48 high48 1 high48 1 cel present 48 high48 2 high48 2 cel present 48 Now read the cel files into an AffyBatch object and normalize using the rma function from the affy package 104 gt ab lt ReadAffy filenames targets filename celfile path datadir gt eset lt rma ab Background correcting Normalizing Calculating Expression By default the only probe set annotation contained in eset is the Affymetrix ID number We will add gene symbols to the data object now so that they will be automatically included in limma results tables later on gt library annotate Loading required package AnnotationDbi gt library hgu95av2 db Loading required package org Hs eg db Loading required package DBI gt ID lt featureNames eset gt Symbol lt getSYMBOL ID hgu95av2 db gt fDataleset lt data frame Symbol Symbol There are many ways to construct a design matrix for this experiment Given that we are interested in the early and late estrogen responders we can choose a parametrization which includes these two contrasts gt treatments lt factor c 1 1 2 2 3 3 4 4 labels c e10 E10 e48 E48 gt contrasts treatments lt cbind Time c 0 0 1 1 E10 c 0 1 0 0
146. s experiment we want to compare the two tissue types This comparison can be made within subjects because each subject yields a value for both tissues We also want to compare diseased subjects to normal subjects but this comparison is between subjects If we only wanted to compare the two tissue types we could do a paired samples com parison If we only wanted to compared diseased to normal we could do an ordinary two group comparison Since we need to make comparisons both within and between subjects it is necessary to treat Patient as a random effect This can be done in limma using the duplicateCorrelation function The two experimental factors Condition and Tissue could be handled in many ways Here we will assume that it is convenient to join the two into a combined factor gt Treat lt factor paste targetsfCondition targetsfTissue sep gt design lt model matrix 0 Treat gt colnames design lt levels Treat Then we estimate the correlation between measurements made on the same subject gt corfit lt duplicateCorrelation eset design block targets Subject corfit consensus Vv Then this inter subject correlation is input into the linear model fit gt fit lt lmFit eset design block targets Subject correlation corfit consensus 51 Now we can make any comparisons between the experimental conditions in the usual way example gt cm lt makeContrasts DiseasedvsNormalForTissueA Diseased A
147. s not appropriate For example Agilent arrays do not have print tip groups so one should use global loess normalization instead gt MA lt normalizeWithinArrays RG method loess Print tip loess is also unreliable for small arrays with less than say 150 spots per print tip group Even larger arrays may have particular print tip groups which are too small for print tip loess normalization if the number of spots with non missing M values is small for one or more of the print tip groups In these cases one should either use global loess normalization or else use robust spline normalization 29 gt MA lt normalizeWithinArrays RG method robustspline which is an empirical Bayes compromise between print tip and global loess normalization with 5 parameter regression splines used in place of the loess curves Loess normalization assumes that the bulk of the probes on the array are not differentially expressed It doesn t assume that that there are equal numbers of up and down regulated genes or that differential expression is symmetric about zero provided that the loess fit is implemented in a robust fashion but it is necessary that there be a substantial body of probes which do not change expression levels Oshlack et al 23 show that loess normalization can tolerate up to about 30 asymmetric differential expression while still giving good results This assumption can be suspect for boutique arrays where the total number of uniq
148. s recommended when a set of closely related contrasts are being tested This method simply appends all the tests together into one long vector of tests i e it treats all the tests as equivalent regardless of which probe or contrast they relate to An advantage is that the raw p value cutoff is consistent across all contrasts For this reason method global is recommended if you want to compare the number of DE genes found for different contrasts 65 for example interpreting the number of DE genes as representing the strength of the contrast However users need to be aware that the number of DE genes for any particular contrasts will depend on which other contrasts are tested at the same time Hence one should include only those contrasts which are closely related to the question at hand Unnecessary contrasts should be excluded as these would affect the results for the contrasts of interest Another more theoretical issue is that there is no theorem which proves that adjust method BH in combination with method global will correctly control the false discovery rate for combi nations of negatively correlated contrasts however simulations experience and some theory suggest that the method is safe in practice The hierarchical method offers power advantages when used with adjust method holm to control the family wise error rate However its properties are not yet well understood with adjust BH method nestedF has a more specialised ai
149. s the fitted values will be just the WIvsREF coefficient which is correct For the remaining arrays the fitted values will be WTvsREF MUvsWT which is equivalent to mutant vs reference also correct For reasons that will be apparent later this is sometimes called the treatment contrasts parametrization Differentially expressed genes can be found by gt fit lt lmFit MA design gt fit lt eBayes fit gt topTable fit coef MUvsWT adjust BH There is no need here to use contrasts fit because the comparison of interest is already built into the fitted model This analysis is analogous to the classical pooled two sample t test except that information has been borrowed between genes For the second approach the design matrix should be WT MU Arrayi 1 0 Array2 1 0 Array3 0 1 Array4 0 1 ArrayS 0 1 The first coefficient now represents wild type vs the reference and the second represents mutant vs the reference Our comparison of interest is the difference between these two coefficients We will call this the group means parametrization Differentially expressed genes can be found by fit lt ImFit MA design cont matrix lt makeContrasts MUvsWT MU WT levels design fit2 lt contrasts fit fit cont matrix fit2 lt eBayes fit2 topTable fit2 adjust BH VVVNV NM The results will be exactly the same as for the first approach The design matrix can be constructed 54 1 manually 2 using the limma func
150. searchers Some image analysis programs produce a quality index as part of the output For exam ple GenePix produces a column called Flags which is zero for a normal spot and takes increasingly negative values for different classes of problem spot If you are reading GenePix image analysis files the call gt RG lt read maimages files source genepix wt fun wtflags weight 0 cutoff 50 will read in the intensity data and will compute a matrix of spot weights giving zero weight to any spot with a Flags value less than 50 The weights are stored in the weights com ponent of the RGList data object The weights are used automatically by functions such as normalizeWithinArrays which operate on the RG list Sometimes the ideal size in terms of image pixels is known for a perfectly circular spot In this case it may be useful to downweight spots which are much larger or smaller than this ideal size If SPOT image analysis output is being read the following call gt RG lt read maimages files source spot wt fun wtarea 100 gives full weight to spots with area exactly 100 pixels and down weights smaller and larger spots Spots which have zero area or are more than twice the ideal size are given zero weight The appropriate way to computing spot quality weights depends on the image analysis program used Consult the help entry QualityWeights to see what quality weight functions are available The wt fun argument is very flexib
151. son Laurentiu Adi Tarca Gregory Theiler Bjorn Usadel Chris Wilkinson Jean Yee Hwa Yang John Zhang Conventions Where possible limma tries to use the convention that class names are in upper CamelCase i e the first letter of each word is capitalized while function names are in lower camelCase i e first word is lowercase When periods appear in function names the first word should be an action while the second word is the name of a type of object on which the function acts Software Projects Using limma The limma package is used as a building block or as the underlying computational engine by a number of software projects designed to provide user interfaces for microarray data analysis including RMAGEML 8 arrayMagic 4 DNMAD 43 GPAP GenePix Pro Auto Processor 45 the KTH Package 33 SKCC WebArray 51 CARMAweb 26 and Pomelo II 22 The LCBBASE project provides a limma plug in for the BASE database 20 The 138 Stanford Microarray Database http genome www5 stanford edu calls out to limma for background correction options Citations Biological studies using the limma package include 10 32 28 2 24 42 Methodological studies using the limma package include 13 14 139 Bibliography 10 Benjamini Y and Hochberg Y 1995 Controlling the false discovery rate a practical and powerful approach to multiple testing Journal of the Royal Statistical Society Series B 57 289 300 Boutr
152. st BH The vector biolrep indicates the two blocks corresponding to biological replicates The value corfit consensus estimates the average correlation within the blocks and should be positive This analysis is analogous to mixed model analysis of variance 21 Chapter 18 except that information has been borrowed between genes Information is borrowed by constraining the within block correlations to be equal between genes and by using empirical Bayes methods to moderate the standard deviations between genes 38 If the technical replicates were in dye swap pairs as FileName Cy3 Cyd Filel wtl mul File2 mul wti File3 wt2 mu2 Filed mu wt2 then one might use design lt c 1 1 1 1 corfit lt duplicateCorrelation MA design ndups 1 block biolrep fit lt lmFit MA design block biolrep cor corfit consensus fit lt eBayes fit topTable fit adjust BH VVVV MV In this case the correlation corfit consensus should be negative because the technical repli cates are dye swaps and should vary in opposite directions This method of handling technical replication using duplicateCorrelation is somewhat limited for two color experiments If for example one technical replicate was dye swapped and the other not 59 FileName Cy3 Cyd Filel wtl mul File2 mul wti File3 wt2 mu2 Filed wt2 mu2 then there is no way to use duplicateCorrelation because the technical replicate correlation will
153. st For this exper iment it was of interest to make all pairwise comparisons between the three DC populations This was accomplished using the contrast matrix gt contrast matrix lt cbind CD8 CD4 c 1 0 DN CD4 c 0 1 CD8 DN c 1 1 gt rownames contrast matrix lt colnames design gt contrast matrix CD8 CD4 DN CD4 CD8 DN CD8 1 0 1 DN 0 1 1 The contrast matrix can be used to expand the linear model fit and then to compute empirical Bayes statistics gt fit2 lt contrasts fit fit contrast matrix gt fit2 lt eBayes fit2 40 Chapter 9 Single Channel Experimental Designs 9 1 Introduction Unlike the early days of microarrays most data is now of the single channel type Single channel data is generated from popular microarray technologies such as Affymetrix Illumina or Agilent The new technology of RNA Seq also generates single channel data so everything in this chapter can be applied to RNA Seq analyses when the data has been pre processed using the voom function 15 Single channel data may be analyzed very much like ordinary univariate linear models or analysis of variance The difference with microarray data is that it is almost always necessary to extract particular contrasts of interest and so the standard parametrizations provided for factors in R are not usually adequate We will assume for our examples here that the data has been suitably pre processed nor malized and is available as an Expressi
154. sumes that RG already contains spot quality weights If not one could use gt w lt modifyWeights array 1 dim RG RG genes Status c spikein titration c 0 2 gt MA lt normalizeWithinArrays RG weights w instead Limma contains some more sophisticated normalization methods In particular some between array normalization methods are discussed in Section 6 3 of this guide 6 3 Between Array Normalization This section explores some of the methods available for between array normalization of two color arrays A feature which distinguishes most of these methods from within array normal ization is the focus on the individual red and green intensity values rather than merely on the log ratios These methods might therefore be called individual channel or separate channel normalization methods Individual channel normalization is typically a prerequisite to indi vidual channel analysis methods such as that provided by lmscFit Further discussion of the issues involved is given by 55 This section shows how to reproduce some of the results given in 55 The Apoal data set from Section 16 2 will be used to illustrate these methods We assume that the the Apoal data has been loaded and background corrected as follows gt load Apoal RData An important issue to consider before normalizing between arrays is how background correction has been handled For between array normalization to be effective it is important to avoid miss
155. t fit lt lmFit v design Test for differentially expressed exons gt fit de lt eBayes fit robust TRUE gt topTable fit de coef PasillaDown GeneID Chr Start End Length Symbol logFC AveExpr 150709 32007 X 10674926 10676128 1203 sesB 3 26 6 51 31 150713 32007 X 10675026 10676128 1103 sesB 3 26 6 51 31 96435 43230 3R 22694960 22695852 893 BM 40 SPARC 2 47 10 60 27 150697 32008 X 10672987 10673728 742 Ant2 2 85 5 63 27 96434 43230 3R 22695915 22696094 180 BM 40 SPARC 2 27 8 21 26 96433 43230 3R 22697648 22697717 70 BM 40 SPARC 2 17 6 37 23 131 RERO0O0RR P Value adj P Val 31e 14 31e 14 61e 14 12e 14 78e 13 11e 12 Ze 39e 10 83e 10 83e 10 29e 09 75e 09 o e o N 39e 10 23 23 21 21 21 19 awo oaoa w 107856 11333 52822 150695 44030 44548 34652 32008 3L 2561932 2562843 2R 6407125 6408782 2L 12403270 12404177 X 10674230 10674559 gt summary decideTests fit de 912 1658 908 330 Intercept Batch3 Batch4 PasillaDown 1 422 0 3237 1 32712 5104 5791 25887 24787 5380 5793 1983 32603 1785 18 2 10 Alternate splicing Test for differences in exon retention between Pasilla down vs normal msn lola vir 1 Ant2 2 46 2 Ze 2 25 15 96 23 1 1 08e 12 22 6 1 43e 12 21 3 3 40e 12 21 7 2 59e 12 gt ex lt diffSplice fit PasillaDown geneid GeneID exonid Start Total number of exons Total numb
156. t published by Lim et al 19 who used 108 the expression profiles to show that limina progenitor cells are the likely cell of origina for basal like breast cancer The data files used in this case study can be downloaded from http bioinf wehi edu au marray IlluminaCaseStudy The expression data files are provided as gzip files and will need to be uncompressed before they can be used for a limma analysis To run the analysis described here download and unzip the data files and set the working directory of R to the folder containing the files 17 3 2 The target RNA samples The data consists of three files gt dirQ 1 control probe profile txt 2 probe profile txt 3 Targets txt First read the targets file describing the target RNA samples gt library limma gt targets lt readTargets gt targets Donor Age CellType 1 1 39 MS 2 1 39 Stroma 3 1 39 ML 4 1 39 LP 5 2 57 MS 6 2 57 Stroma 7 2 57 ML 8 2 57 LP 9 3 21 MS 10 3 21 Stroma 11 3 21 ML 12 3 21 LP Breast tissue was obtained from three healthy human donors who were undergoing reduction mammoplasties Epithelial cells were sorted into three subpopulations enriched for mammary stem cells MS luminal progenitor cells LP and mature luminal cells ML 19 The MS LP and ML cells representative a lineage of luminal cells use to construct the ducts used to transport milk in the breast 44 Mammary Luminal Mature stem cell MS progenitor
157. t tip scale normalization Although there is some overlap between the normal ization functions in the two packages both providing print tip loess normalization the two approaches are largely complementary Marray also provides highly developed functions for graphical display of two color microarray data Read functions in marray produce objects of class marrayRaw while normalization produces objects of class marrayNorm Objects of these classes may be converted to and from limma data objects using the convert package marrayRaw objects may be converted to RGList objects and marrayNorm objects to MAList objects using the as function For example if Data is an marrayNorm object then gt library convert gt MA lt as Data MAList converts to an MAList object marrayNorm objects can also be used directly in limma without conversion and this is generally recommended If Data is an marrayNorm object then gt fit lt lmFit Data design fits a linear model to Data as it would to an MAList object One difference however is that the marray read functions tend to populate the maW slot of the marrayNorm object with qualitative spot quality flags rather than with quantitative non negative weights as expected by limma If this is so then one may need gt fit lt lmFit Data design weights NULL to turn off use of the spot quality weights 34 Chapter 7 Filtering We usually recommend that all probes on the microarray platform
158. ta is also analyzed in the factDesign package vignette To repeat this case study you will need to have the R packages affy estrogen and hgu95av2cdf installed The data gives results from a 2x2 factorial experiment on MCF breast cancer cells using Affymetrix HGU95av2 arrays The factors in this experiment were estrogen present or absent and length of exposure 10 or 48 hours The aim of the study is the identify genes which respond to estrogen and to classify these into early and late responders Genes which respond early are putative direct target genes while those which respond late are probably downstream targets in the molecular pathway First load the required packages gt library limma gt library affy Welcome to Bioconductor Vignettes contain introductory material To view simply type openVignette For details on reading vignettes see the openVignette help page gt library hgu95av2cdf The data files are contained in the extdata directory of the estrogen package gt datadir lt file path find package estrogen extdata gt dir datadir 1 OOIndex bad cel high10 1 cel highi0 2 cel high48 1 cel 6 high48 2 cel low10 1 cel low10 2 cel low48 1 cel low48 2 cel 11 phenoData txt The targets file is called phenoData txt We see there are two arrays for each experimental condition giving a total of 8 arrays gt targets lt readTargets phenoData txt path datadir sep row name
159. tensities pro duces a similar result but is somewhat noisier gt MA q lt normalizeBetweenArrays RG b method quantile gt plotDensities MA q col black Warning message number of groups 2 not equal to number of col in plotDensities MA q col black RG densities 0 20 0 25 0 30 1 1 1 Density 0 00 1 Intensity 33 There are other between array normalization methods not explored here For example normalizeBetweenArrays with method vsn gives an interface to the variance stabilizing nor malization methods of the vsn package 6 4 Using Objects from the marray Package The package marray is a well known R package for pre processing of two color microarray data Marray provides functions for reading normalization and graphical display of data Marray and limma are both descendants of the earlier and path breaking sma package available from http www stat berkeley edu users terry zarray Software smacode html but limma has maintained and built upon the original data structures whereas marray has converted to a fully formal data class representation For this reason Limma is backwardly compatible with sma while marray is not Normalization functions in marray focus on a flexible approach to location and scale nor malization of M values rather than the within and between array approach of limma Marray provides some normalization methods which are not in limma including 2 D loess normaliza tion and prin
160. th or without a reference genome BMC Bioinformatics 12 323 Liao Y Smyth G K and Shi W 2013 The Subread aligner fast accurate and scalable read mapping by seed and vote Nucleic Acids Research 41 e108 Liao Y Smyth G K and Shi W 2014 featureCounts an efficient general purpose read summarization program Bioinformatics 30 923 930 Lim E Vaillant F Wu D Forrest N Pal B Hart A Asselin Labat M Gyorki D Ward T Partanen A et al 2009 Aberrant luminal progenitors as the candidate target population for basal tumor development in brcal mutation carriers Nature Medicine 15 907 913 Linnaeus Centre for Bioinformatics Uppsala University S Milliken G A and Johnson D E 1992 Analysis of Messy Data Volume 1 Designed Experiments Chapman Hall New York Morrissey E R and Diaz Uriarte R 2009 Pomelo II finding differentially expressed genes Nucleic Acids Research 37 W581 W586 Oshlack A Emslie D Corcoran L and Smyth G 2007 Normalization of boutique two color microarrays with a high proportion of differentially expressed probes Genome Biology 8 R2 Peart M Smyth G Van Laar R Bowtell D Richon V Marks P Holloway A and Johnstone R 2005 Identification and functional significance of genes regulated by structurally different histone deacetylase inhibitors Proceedings of the National Academy of Sciences of the United States of America 102
161. tial coefficients and compare them in as many ways as you want to answer any questions you might have regardless of how many or how few these might be If you have data from Affymetrix experiments from single channel spotted microarrays or from spotted microarrays using a common reference then linear modeling is the same as ordinary analysis of variance or multiple regression except that a model is fitted for every gene With data of this type you can create design matrices as one would do for ordinary 36 modeling with univariate data If you have data from spotted microarrays using a direct design i e a connected design with no common reference then the linear modeling approach is very powerful but the creation of the design matrix may require more statistical knowledge For statistical analysis and assessing differential expression limma uses an empirical Bayes method to moderate the standard errors of the estimated log fold changes This results in more stable inference and improved power especially for experiments with small numbers of arrays 36 For arrays with within array replicate spots limma uses a pooled correlation method to make full use of the duplicate spots 38 8 2 Single Channel Designs Affymetrix data will usually be normalized using the affy package We will assume here that the data is available as an ExpressionSet object called eset Such an object will have an slot containing the log expression values for each gen
162. tion modelMatrix or 3 using the built in R function model matrix Let Group be the factor defined by gt Group lt factor c WT WT Mu Mu Mu levels c WT Mu For the first approach the treatment contrasts parametrization the design matrix can be computed by gt design lt cbind WIvsRef 1 MUvsWT c 0 0 1 1 1 or by gt param lt cbind WIvsRef c 1 1 0 MUvsWT c 0 1 1 gt rownames param lt c Ref WT Mu gt design lt modelMatrix targets parameters param or by gt design lt model matrix Group gt colnames design lt c WIvsRef MUvsWT all of which produce the same result For the second approach the group means parametriza tion the design matrix can be computed by gt design lt cbind WT c 1 1 0 0 0 MU c 0 0 1 1 1 or by gt param lt cbind WT c 1 1 0 MU c 1 0 1 gt rownames param lt c Ref WT Mu gt design lt modelMatrix targets parameters param or by gt design lt model matrix 0 Group gt colnames design lt c WT Mu all of which again produce the same result 10 3 Several Groups The above approaches for two groups extend easily to any number of groups Suppose that the experiment has been conducted to compare three RNA sources RNA1 RNA and RNA3 For example the targets frame might be 59 FileName Cy3 Cy5 Filel Ref RNAI File2 RNA1 Ref File3 Ref RNA2 Filed RNA2 Ref Filed Ref
163. ue genes on the array is small say less than 150 particularly if these genes have been selected for being specifically expressed in one of the RNA sources In such a situation the best strategy is to include on the arrays a series of non differentially expressed control spots such as a titration series of whole library pool spots and to use the up weighting method discussed below 23 A whole library pool means that one makes a pool of a library of probes and prints spots from the pool at various concentrations 52 The library should be sufficiently large than one can be confident that the average of all the probes is not differentially expressed The larger the library the better Good results have been obtained with library pools with as few as 500 clones In the absence of such control spots normalization of boutique arrays requires specialist advice Any spot quality weights found in RG will be used in the normalization by default This means for example that spots with zero weight flagged out will not influence the normal ization of other spots The use of spot quality weights will not however result in any spots being removed from the data object Even spots with zero weight will be normalized and will appear in the output object such spots will simply not have any influence on the other spots If you do not wish the spot quality weights to be used in the normalization their use can be over ridden using gt MA lt normalizeWithinAr
164. ues from a single channel experiment In these cases the design matrix can be formed as for a single channel experiment When the common reference is dye swapped the simplest method is to setup the design matrix using the modelMatrix function and the targets file 10 2 Two Groups Suppose now that we wish to compare two wild type Wt mice with three mutant Mu mice using arrays hybridized with a common reference RNA Ref FileName Cy3 Cy5 Filel Ref WT File2 Ref WT File3 Ref Mu Filed Ref Mu File Ref Mu The interest here is in the comparison between the mutant and wild type mice There are two major ways in which this comparison can be made We can either 1 create a design matrix which includes a coefficient for the mutant vs wild type difference or 2 create a design matrix which includes separate coefficients for wild type and mutant mice and then extract the difference as a contrast 53 For the first approach the design matrix should be as follows gt design WIvsREF MUvsWT Arrayl 1 0 Array2 1 0 Array3 1 1 Array4 1 1 Array5b 1 1 Here the first coefficient estimates the difference between wild type and the reference for each probe while the second coefficient estimates the difference between mutant and wild type For those not familiar with model matrices in linear regression it can be understood in the following way The matrix indicates which coefficients apply to each array For the first two array
165. une system Immature B cells that have emigrated from the bone marrow fail to differentiate into full fledged B cells resulting in a notable deficit of mature B cells Arrays Arrays were printed at the Australian Genome Research Facility with expressed sequence tags ESTs from the National Institute of Aging 15k mouse clone library plus a range of positive negative and calibration controls The arrays were printed using a 48 tip print head and 26x26 spots in each tip group Data from 24 of the tip groups are given here Every gene ESTs and controls was printed twice on each array side by side by rows The NIA15k probe IDs have been anonymized in the output presented here Hybridizations A retrovirus was used to add Bob back to a Bob deficient cell line Two RNA sources were compared using 2 dye swap pairs of microarrays One RNA source was obtained from the Bob deficient cell line after the retrovirus was used to add GFP green fluorescent protein a neutral protein The other RNA source was obtained after adding both GFP and Bob protein RNA from Bob GFP was labelled with Cy5 in arrays 2 and 4 and with Cy3 in arrays 1 and 4 Image analysis The arrays were image analyzed using SPOT with morph background estimation The data used for this case study can be downloaded from http bioinf wehi edu au limma The file should be placed in the working directory of your R session This case study was last updated on 29 June 2006 usin
166. unts into an expression object which can then be analysed as for microarray data This guide describes limma as a command driven package Graphical user interfaces to the most commonly used functions in limma are available through the packages limmaGUI 47 for two color data or affylmGUI 46 for Affymetrix data Both packages are available from Bioconductor This user s guide should be correct for R Versions 2 8 0 through 3 1 0 and limma versions 2 16 0 through 3 20 0 The limma homepage is http bioinf wehi edu au limma Chapter 2 Preliminaries 2 1 Citing limma Limma is an implementation of a body of methodological research by the authors and co workers Please try to cite the appropriate methodological papers when you use results from the limma software in a publication as such citations are the main means by which the authors receive professional credit for their work If you use limma for differential expression analysis please cite Smyth G K 2004 Linear models and empirical Bayes methods for assessing differential expression in microarray experiments Statistical Applications in Ge netics and Molecular Biology Vol 3 No 1 Article 3 http www bepress com sagmb vo13 iss1 art3 The above article describes the linear modeling approach implemented by ImFit and the empirical Bayes statistics implemented by eBayes topTable etc If you use limma with duplicate spots or technical replication please cite Smyth G
167. ving the name of the image analysis output file a Cy3 column giving the RNA type labelled with Cy3 dye for that slide and a Cy5 column giving the RNA type labelled with Cy5 dye for that slide Other columns are optional The targets file can be prepared using any text editor but spreadsheet programs such as Microsoft Excel are convenient The targets file for the Swirl case study includes optional SlideNumber and Date columns El Microsoft Excel SwirlSample txt f F ol x 8 File Edit view Insert Format Tools Data Window Help S xX OSH lSQY Blo 6 ME gt H11 y f SlideNumber FileName Cy3 Cy5 Date 81 swirl 1 spot swirl wild type 20 09 2001 82 swirl 2 spot wild type swirl 20 09 2001 93 swirl 3 spot swirl wild type 8 11 2001 94 swirl 4 spot wild type swirl 8 11 2001 1x1 It is often convenient to create short readable labels to associate with each array for use in output and in plots especially if the file names are long or non intuitive A column containing these labels can be included in the targets file for example the Name column used for the Apoal case study El Microsoft Excel ApoAlTargets txt F iol x E File Edit View Insert Format Tools Data Window Help ES DSBsea4kRy Bb o z me gt G25 M f C SlideNumber Name FileName Cy3 Cy5 1 c1 cl spot Ref wild type 2 c2 c2 spot Ref wild type 3 c3 c3 spot Ref wild type 4 c4 c4 spot Ref wild type 5 c5 c5 spot Ref wild type 6 cb cb spot Ref wild type Vic c7 spot
168. we i e Gd ev Ee Bea 5 Quality Assessment 6 Pre Processing Two Color Data 6 1 Background Correction oa oaoa ad E Ay a SE eee 6 2 Within Array Normalization seta ess de pe de e ds ok Kia BE ER 6 3 Between Array Normalization ca o oo ae PO eee wa ee de 6 4 Using Objects from the marray Package o 7 Filtering 15 15 15 15 17 19 20 21 22 22 23 25 27 27 29 31 34 35 8 Linear Models Overview Oud Introduction a guh ate de le o ty Se Pde eb dade te on 8 2 Single Channel Designs us aus DA e DAG Sees GA 8 3 Common Reference Designs ss sda e ks au eee ae cae As 8 4 Direct Two Color Designs o o 9 Single Channel Experimental Designs A e A AS A A ee d Mone 02 TWO GTOUDS lada da A Anas dl 9 3 Several Groups A Beek da ee Bde Beh Ane eee aoa te 9 4 Additive Models and Blocking 9 4 1 Paired Samples 2 4 03 2 se Bae A GA a ee OA Os Blocs dr A ae ae Ad OG RSs 9 5 Interaction Models 2 x 2 Factorial Designs 9 5 1 Questions of Interest 2 per eb we ee 9 5 2 Analysing as for a Single Factor 9 5 3 A Nested Interaction Formula 9 5 4 Classic Interaction Models 9 6 Time Course Experiments 1 2 diarrea e 9 6 1 Replicated time points ura cae oe ee oe ee 9 6 2 Many time points 0 au vet a eee ed ee eee ed 9 7 Multi level Experiments 42a Adee eae ee ee ea 10 Two Color Experiments wit
169. x columns list R F635 Median G F532 Median Rb B635 Gb B532 What should you do if your image analysis program is not in the above list If the image output files are in standard format then you can supply the annotation and intensity column names yourself For example gt RG lt read maimages files columns list R F635 Mean G F532 Mean Rb B635 Median Gb B532 Median annotation c Block Row Column ID Name 18 is exactly equivalent to source genepix Standard format means here that there is a unique column name identifying each column of interest and that there are no lines in the file following the last line of data Header information at the start of the file is acceptable but extra lines at the end of the file will cause the read to fail It is a good idea to look at your data to check that it has been read in correctly Type gt show RG to see a print out of the first few lines of data Also try gt summary RG R to see a five number summary of the red intensities for each array and so on It is possible to read the data in several steps If RG1 and RG2 are two data sets corre sponding to different sets of arrays then gt RG lt cbind RG1 RG2 will combine them into one large data set Data sets can also be subsetted For example RG 1 is the data for the first array while RG 1 100 is the data on the first 100 genes 4 5 Reading Single Channel Agilent Intensity

limma: Linear Models for Microarray Data User's Guide (Now

Contents

Download Pdf Manuals

Related Search

Related Contents