Home

CASAVA v1.8.2 User Guide (15011196) - Support

1. 112 INSTAIINGGAGAVA MAA SE GE De EE GE 116 Appendix B Using Parallelization 119 CASAVA v1 8 2 User Guide V Make Utilities 120 Appendix C Reference Files CASAVA iec ere 123 Tass CMI UGO AA 124 ELAND Reference Files 125 Variant Detection and Counting Reference Files 127 Getting Reference Files 128 Appendix D Algorithm Descriptions 131 MT OOUICUOM Ses ER Ee Hanay KANG SE LEEG ES no ka a GAGO SE a 132 ELANDv2 and ELANDv2e 222 22 e cece EE EE EE EE EE EE Eie 133 Variant Detection 141 readBases Counting Method 158 Appendix E Qseq ConversioN a 159 MOJU NON haces tains ons anng KAALAMAN EE RE EE MAA 160 Oseg Converter Input Files 161 Running Oseg Converter 163 Oseg Converter Parameters ee 164 Oseg Converter Output Data 165 Appendix F Export to SAM Conversion 16 a EE 168 SAM RONA EE EE EE Ge ala an ana ba a EE EE E ba De GE EE ge Ee a 169 BE e RE AE ON OE EL EE EE AE ES EE DE VERDER TIEN 173 GlGSS eN N EE OR OE OR AE 175 Vale AA eee ee 177 Technical Assistance 179 Parti 15011196 Rev D Table 1 Table 2 Table 3 Table 4 Table 5 Table 6 Table 7 Table 8 Table 9 Table 10 Table 11 Table 12 Table 13 Table 14 Table 15 Table 16 Table 17 Table 18 Table 19 Table 20 Table 21 Table 22
2. 127 Getting Reference Files 128 VEERA N NANU 3 A N 4 a wo M k ME ES RA e lt a mpa inga w D l po MIA a nee gr LANE Ee sci TER a aa v 1 G gt F re AT real REG ig 4 4 pi REID po c m DH PP ne ee ee gt CE lt get me xv par S CASAVA v1 8 2 User Guide T D 3 A XIpusddwy Reference Files CASAVA Introduction 124 CASAVA needs a number of special reference files to run analysis especially for RNA sequencing This chapter describes the reference files that are needed to run Elandv2e and CASAVA variant detection and provides instructions how to generate these files for other species and builds As of CASAVA 1 8 ELAND squashes genome files automatically when it starts Genome sequence files for most commonly used model organisms are available through iGenome Getting Reference Files on page 128 Parti 15011196 Rev D ELAND Reference Files ELAND needs the following file to perform an alignment Unsquashed genome sequence files As of CASAVA 1 8 ELAND squashes genome files automatically when it starts In addition eland_rna needs two types of files to analyze RNA Sequencing data Abundant sequences files mitochondrial DNA ribosomal region sequences 55 RNA optional and other contaminants RefFlat txt gz file UCSC type or seq_gene md gz file NCBI type Reference Genome CASAVA uses a reference genome in FASTA format Both single sequence FASTA and mult
3. If eland pair analysis has been specified for one or more lanes then two Expanded Lane Results Summaries are produced one for each read All lanes for which analysis has been specified are represented in the Read 1 table but only those for which eland_ pair analysis has been specified contribute statistics to the Read 2 table Per Tile Statistics Below the Expanded Sample Summary is a link to a file containing per tile statistics The displayed metrics are similar to the expanded Lane 1 Read 1 tables in the CASAVA 1 8 configureAlignment summary files IVC Plots Next is a link to IVC plots The IVC htm file Intensity versus Cycle contains plots that display lane averages for samples All This is the lane average of the data displayed in All htm It plots each channel A C G T separately as a different colored line Means are calculated over all Parti 15011196 Rev D clusters regardless of base calling If all clusters are T then channels A C and G will be zero If all bases are present in the sample at 25 of total and a well balanced matrix is used for analysis the graph will display all channels with similar intensities If intensities are not similar the results could indicate either poor cross talk correction or poor absolute intensity balance between each channel Called This plot is similar to All except means are calculated for each channel using clusters that the base caller has called in that channel If all b
4. Standard flow cell level variables USE BASES y y CHROM NAME VALIDATION off ANALYSIS eland rna ELAND FASTQ FILES PER PROCESS 2 Flow cell level ELAND GENOME variable set for all data sets with Reference HumanNCBB7ELAND REFERENCE HumanNCBI37ELAND ELAND GENOME nome user genomes archive UCSChg18 fasta Flow cell level SAMTOOLS GENOME variable set for all data sets with Reference AMPLICONS180111JRB ELAND REFERENCE AMPLICONS180111JRB ELAND SAMTOOLS GENOME home user genomes AMPLICONS180111JRB AMPLICONS180111JRB fa Flow cell level SAMTOOLS GENOME variable set for all data sets with Reference ISC1 ELAND REFERENCE TSC1 ELAND SAMTOOLS GENOME illumina user TSC1 TSC1 fa Overrides global ANALYSIS with eland extended if the reference is TSC1 ELAND REFERENCE TSC1 ELAND ANALYSIS eland extended If the reference is unknown default for Undetermined barcode data sets sets the analysis to none Only affects lanes 1 2 3 and 4 1234 REFERENCE unknown ANALYSIS none Alternative way of ensuring Undetermined barcode data sets do not get aligned Only affects lanes 5 6 7 and 8 5678 BARCODE Undetermined ANALYSIS none Part 15011196 Rev D Specific Scenarios Below a number of scenarios are written out assuming SampleSheet csv has two projects idxProj and noldxProj Analyze only data for idxProj not noldxProj Disable analysis by default ANALYSIS none Then the following analysis specifiations only affect sample sheet entries that hav
5. CASAVA is the part of Illumina s sequencing analysis software that performs alignment of a sequencing run to a reference genome and subsequent variant analysis and read counting The basic pieces of functionality of Illumina s sequencing analysis cascade are described below Analysis of Sequencing Data After the sequencing platform generates the sequencing images the data are analyzed in five steps image analysis base calling bcl conversion sequence alignment and variant analysis and counting CASAVA performs the bcl conversion sequence alignment and variant analysis and counting steps demultiplexes multiplexed samples during the bcl conversion step 1 Image analysis Uses the raw images to locate clusters and outputs the cluster intensity X Y positions and an estimate of the noise for each cluster The output from image analysis provides the input for base calling Image analysis is performed by the instrument control software 2 Base calling Uses cluster intensities and noise estimates to output the sequence of bases read from each cluster a confidence level for each base and whether the read passes filtering Base calling is performed by the instrument control software s Real Time Analysis RTA or the Off Line Basecaller OLB 3 Bel conversion Converts bcl files into fastq gz files compressed FASTO files in CASAVA Multiplexed samples are demultiplexed during this step 4 Sequence alignment Aligns samples to
6. No Label Description 1 seq name Reference sequence label 2 Pos Sequence position of the site snp 3 bcalls used Basecalls used to make the genotype call for this site 4 bcalls filt Basecalls mapped to the site but filtered out before genotype calling 5 Ref Reference Base 6 O snp A O value expressing the probability of the homozygous reference genotype subject to the expected rate of haplotype difference as expressed by the Watterson theta parameter see New Variant Calling Parameter Theta on page 150 max gt The most likely genotype subject to theta as above 8 Q max gt A Q value expressing the probability that the genotype is not the most likely genotype above subject to theta 9 max gtlpoly site The most likely genotype assuming this site is polymorphic with an expected allele frequency of 0 5 theta is still used to calculate the probability of a third allele i e the chance of observing two non reference alleles 10 Q max_gt poly_site A Q value expressing the probability that the genotype is not the most likely genotype above assuming this site is polymorphic 11 A_used A basecalls used 12 C used C basecalls used 15 G used G basecalls used 14 T used T basecalls used Indels txt Files Indels for each chromosome are summarized within each chromosome directory in a file called indels txt This file contains indels which have been called in each reference sequence by the small variant caller and fil
7. configureAlignment Parameters Detailed Description configureAlignment can be run in various analysis modes Customize your analysis by specifying variables parameters and options ANALYSIS Variables Set the ANALYSIS variable to define the type of analysis you want to perform for each lane The various analysis modes include default eland extended eland pair eland_ rna and none You can mix and match analyses between lanes Table 5 ANALYSIS Variables Variable Alignment Application Description Program ANALYSIS eland_ ELANDv2 Single reads Aligns single read data reads against a target using extended ELANDv2e alignments e Works well with reads gt 32 bases e Each alignment is given a confidence value based on its base quality scores e A single file of sorted alignments is produced for each lane For a detailed description see configureAlignment Input Files on page 48 ANALYSIS eland pair ELANDv2 Paired reads Aligns paired end reads against a target using ELANDv2 alignments A single read alignment is done for each half of the pair and then the best scoring alignments are compared to find the best paired read alignment For a detailed description see Using ANALYSIS eland_pair on page 69 CASAVA v1 8 2 User Guide 61 1uauubilveinbyuo2 BulUUNH Sequence Alignment Variable Alignment Application Description Program ANALYSIS eland rna ELANDv2 Single reads Aligns each read against a large referenc
8. The most scalable with the highest performance They have a very high bandwidth and support many simultaneous clients but are complex to manage and significantly more expensive Server Configurations You can use either a single multi processor multi core computer running Linux or a cluster of Linux servers with a head node CASAVA can take advantage of clustered and multi processing servers Single multi processor multi core server Simple but not scalable It can only analyze data from one sequencing platform or two depending on power and your turn around requirements Linux Cluster Highly scalable and capable of running multiple jobs simultaneously It requires one server as a management node and a minimum Parti 15011196 Rev D number of computational notes to be as efficient as a standalone server By adding computational nodes the cluster can service more instruments i NOTE We test our software with SGE other cluster configurations like LSF or PBS are not recommended Analysis Computer lumina supports running CASAVA only on Linux operating systems It may be possible to run CASAVA on other 64 bit Unix variants if all of the prerequisites described in this section are met lumina recommends the IlMuminaCompute data processing solution for CASAVA IlluminaCompute is available as a multi tier option with the volume of instrument data output per week determining the recommended Tier level For more information con
9. This tells configureAlignment it needs to perform eland rna and communicates the locations of the genome splice junction and contaminant files The following table describes the parameters for ANALYSIS eland rna Table 10 Parameters for ANALYSIS eland rna Parameter Description ELAND GENOME Must point to the reference genome just as for a standard ELANDv2e analysis ELAND RNA GENOME ANNOTATION Must point to the refFlat txt gz file gzip compressed or seg gene md gz file gzip compressed ELAND RNA GENOME CONTAM Must point to the files of ultra abundant seguences generally ribosomal and mitochondrial Any read that hits to these is ignored Considerations When Running eland rna When running eland rna bear in mind the following points The above parameters may be specified on a lane by lane basis in the usual fashion for example to do lanes one two and four enter the following 124 ANALYSIS eland rna 124 ELAND GENOME data Genome ELAND hg18 124 ELAND RNA GENOME ANNOTATION data Genome ELAND RNA Human refFlat txt gz E 124 ELAND RNA GENOME CONTAM data Genome ELAND RNA Human MT Ribo Filter E E 7 The output file export txt gz has the same format as those generated by eland_ extended for a description see Export txt gz on page 79 The existing code KM repeat masked denotes all reads that hit to abundant sequences or with any other unresolvable ambiguity CASAVA v1 8 2 User Guide Ta 1uauubilveinbyuo2 bu
10. Woo 7 Tari KG Ti AATT RAS ATAT TE 5 TH iso UM D rg Ae TE ie t robi Qa TC mi P Hi ang ET HeT TIETE dE TT G TTE Ama tar Ga s Ee Tal Tea T E anang eT eT teased EE eas GA shai ta ii s a N Si Ta AA AA EE Bia aN TEE TEE AE TA GTA KA GER TG pies TAN ARS Dat GNALI TT pat TA Paa zone TEAC ACGAAAAGAATE anG PRAGA TE E WET AG ATTAA ala va a kan PLEN STA fiat MEJE R r TTEA dig er zrli petih a Mija SAT ni LE FETTE buhe n al h ay A Wa p sabay EE EE CAL AGTAA MEAT M at PAPET TRATI pani ATE tati KE Mabi AA ok Fr ITA CAAT F MAT tt L me piga kra G d ani Ji Park 1 TO Fal PAGA h Na MATTE Th TTA TAB S EA S VANT TAALO TANK MANG Na en TRE STE aA NAE OM Bed LA ESTA v mt MELA SE ME Eis pike ia Er Pe Ka ze Hee KE bali z pal PALAG ia EME San aag snem ACT NET Al oe A KON EY Su EREE recipe es SAT LANS ba lg S ZV CT KIM 7 Maha a Na REA EE JO O o Ge EE maria i NU zali Mi NEA SERE Ie ii AE AA s ea Z S 2 Ha SUS ager L ALT EET a MATI TAG a ATA HIDE hea ar zen a NAA DAA AAT x GT TAL Kr Ai NOO AE En se TE merit LS DA RIA tat ye Lr ae AED TAA LA SA SIM Ge DA S KING S ete Er mang EE shah NGGAL lai 13 SIT PR Than kal IFTE Heg TE RES G ai NATITIRA GE en o EEE IA can ce RA HAS Ree ee ER MG F D ian stal IR L Ha ka Ka ml Nak anh ia s ATA N DATAA Ti EE Ba KO S a einen EE AE ES TE IAN GE EE N AT U VEE AA ng ater ANA ja KJE NO zi NS E Ha R HRS T TOT TAMET TAMANT DAC TREE AROEN Kaj
11. 100 Parti 15011196 Rev D Targeted Resequencing Since targeted resequencing only sequences part of a genome we recommend using the option variantsNoCovCutoff to turn off high coverage filtration of SNPs and indels Examples The CASAVA installation provides examples of common use cases such as E coli Single End E coli Paired End RNA sequencing The details of these examples are available on the configureBuild pl help page Go to the CASAVA installation directory and type configureBuild pl The examples are listed at the bottom of the help page CASAVA v1 8 2 User Guide 1 O DUI1UNOD pue U01 28 9 uenen BuluunH Variant Detection and Counting Output Files Once the post alignment build is complete all relevant information is listed in the build directory such as Build summary html pages The build summary html pages are located in the buildDir html folder and provides access to run information and graphs of important statistics Variant calls and counts The CASAVA build contains sequence SNP indels and for RNA Sequencing counts information and is located in buildDir Parsed DATE Computer readable statistics Computer readable statistics are located in buildDir stats Configuration files CASAVA configuration files are located in buildDir conf These files are described below Build Directory An outline of the CASAVA build directory is shown below Variant Detection and Counting 102 Parti 15011196 Rev
12. In the case of overlapping indels max_gtype refers to the most likely copy number of the indel Note that indel calls where ref is the most likely genotype will be reported These correspond to indels with very low Q indel values Phred scaled quality score of the most probable indel genotype which refers to the probability that the genotype of the indel is not that given as max gtype The Q values given only reflect those error conditions which can be represented in the indel calling model which is not comprehensive See also Quality Scores on page 148 Except for right side breakpoints this field reports the depth of the position preceding the left most indel breakpoint For right side breakpoints this is the depth of the position following the breakpoint Number of reads strongly supporting either the reference path or an alternate indel path Number of reads strongly supporting the indel path Number of reads intersecting the indel but not strongly supporting either the reference or any one indel path The smallest repeating sequence unit within the inserted or deleted sequence For breakpoints this field is set to the value N A Number of times the repeat_unit sequence is contiguously repeated starting from the indel start position in the reference case Number of times the repeat_unit sequence is contiguously repeated starting from the indel start position in the indel case 109 so9ji4 Indjno bununo PUB Uo1 8 2 J
13. Quality score Error probability Q A P A 10 0 1 20 0 01 30 0 001 Quality Scores Encoding Quality scores are encoded into a compact form in FASTQ files which uses only one byte per quality value In this encoding the quality score is represented as the character with an ASCII code equal to its value 33 as of CASAVA 1 8 The following table demonstrates the relationship between the encoding character the character s ASCII code and the quality score represented VE WARNING 4 Quality score encoding schemes in previous version of CASAVA used an at ce llumina specific offset value of 64 Table 1 ASCII Characters Encoding Q scores 0 40 Symbol ASCII O Symbol ASCII O Symbol ASCII O Code Score Code Score Code Score do 0 i 47 14 61 28 34 1 0 48 15 gt 62 29 35 2 1 49 16 63 30 36 3 2 50 17 64 31 Yo 37 4 3 51 18 A 65 32 amp 38 5 4 52 19 B 66 33 i 39 6 5 53 20 C 67 34 40 7 6 54 21 D 68 35 CASAVA v1 8 2 User Guide 4 1 p 04 nd ng UOISIBAUOD DA Bcl Conversion and Demultiplexing Symbol ASCII O Symbol ASCII O Symbol ASCII O Code Score Code Score Code Score 41 8 7 55 22 E 69 36 N 42 9 8 56 23 F 70 37 43 10 9 57 24 G 71 38 F 44 11 i 58 25 H 72 39 45 12 F 59 26 I 73 40 46 13 lt 60 27 Read Segment Quality Control Metric A number of factors can cause the quality of base calls to be low at the end of a read For example phasing artifacts can degrade signal quality in som
14. those reads aligned to an alternate alignment by the variant caller The BAM filename is sorted realigned bam Project Dir Parsed NN NN NN c1 bam realigned sorted realigned bam Statistics for coverage as well as snp and indel calls for all reference seguences are found in the stats directory Project Ditistats coverage summary LXU Project Dir stats snps summary txt Project DIr stats indels summary txt A summary of the same information is also available on the following html pages Project Dir html coverage himl Project Dir html snps htm Project Dit Wel ndels him To summarize the snps and indels in the stats and html directories above quality thresholds are used to select a subset of snps and indels for summary reporting The default thresholds are Q snp gt 20 and Q indel gt 20 These values may be changed using the options variantsSummaryMinosnp and variantsSummaryMinOindel snps txt and sites txt Files The snps txt files contain the SNP calls sorted by position while the sites txt files provide depth and single position genotype call scores for every mapped site There is one snp txt file for each chromosome stored in the chromosome specific directory under CASAVA v1 8 2 User Guide 1 07 Solid Ind ng bununo9 pug U 011791961 JUBIJEA Variant Detection and Counting the Parsed dd mm yy directory The snps txt and sites txt files are tab delimited text files contain the same columns which are the following
15. variant analysis 2 variant detection 7 88 configuring multiple runs 100 examples 101 input files 93 options 97 99 100 153 154 output files 102 running 96 Variant Detection and Counting 7 W What s New 9 Part 15011196 Rev D Technical Assistance For technical assistance contact Illumina Customer Support Table 29 Illumina General Contact Information Illumina Website http www illumina com Email techsupport illumina com Table 30 Ilumina Customer Support Telephone Numbers Region Contact Number Region Contact Number North America 1 800 809 4566 Italy 800 874909 Austria 0800 296575 Netherlands 0800 0223859 Belgium 0800 81102 Norway 800 16836 Denmark 80882346 Spain 900 812168 Finland 0800 918363 Sweden 020790181 France 0800 911850 Switzerland 0800 563118 Germany 0800 180 8994 United Kingdom 0800 917 0041 Ireland 1 800 812949 Other countries 44 1799 534000 MSDSs Material safety data sheets MSDSs are available on the Illumina website at http www illumina com msds Product Documentation You can obtain PDFs of additional product documentation from the Illumina website Go to http www illumina com support and select a product To download documentation you will be asked to log in to Mylllumina After you log in you can View or save the PDF To register for a Mylllumina account please visit https my illumina com Account Register CASAVA v1 8 2 User Guide 1 Vi O SOUEISISSV B2IuUDe B TT 1 k RA
16. which is the minimum Example verbose 1 Parti 15011196 Rev D Option Application version SE PE W SE PE Workflow Wa SE PE workflowAuto workflowFile lt FILE SE PE Description Prints version information Example version Instead of running CASAVA generates the workflow definition file tasks DATA txt Example w Generates the workflow definition file and runs it See jobsLimit Example workflowAuto Overrides workflow file name Default is tasks lt date gt txt Example workflowFile FILENAME txt Table 18 Global Analysis Options for Variant Detection and Counting Option Application QVCutof lt NUMBER PE OVCutoffSingle lt NUMBER SE PE read NUMBER PE singleScoreForPE VALUE PE sortKeepAllReads SE PE toNMScore lt NUMBER SE PE ignoreUnanchored PE Options for Target sort Description Sets the paired end alignment score threshold to NUMBER default 90 Example OVCutof f lt 60 Sets the single read alignment score threshold to NUMBER default 10 Example QVCutoffSingle 60 Limit input to the specified read only Forces single ended analysis on one read of a double ended data set Example read 1 Sets the variant caller to filter reads with single score below OV CutoffSingle in PE mode YES NO Default NO Example singleScoreForPE YES Generate an archive BAM file Keep all purity filtered duplicate and unmapped reads in the build T
17. 2 User Guide 1 1 D Requirements and Software Installation InstallingCASAVA Starting with CASAVA 1 8 CASAVA must be built outside of the source directory 1 NOTE For more information on the installation procedure see the file CASAV A 1 8 0 install CASAV A 1 8 2 src INSTALL The installation procedure is as follows 1 The Boost library 1 44 0 is bundled in the CASAVA distribution and will be automatically built when necessary If you want to use a preinstalled Boost library declare the BOOST ROOT bash variable by typing the following at the command prompt prior to running the CASAVA configure script export BOOST ROOT path to compiled boost directory boost 1_ 44 0 2 Download CASAVA v1 8 and copy it in a temporary directory you will not need to keep it once the installation is done like tmp for example 3 Download and untar CASAVA v1 8 using the following commands cd tmp tar xvji CASAVA 1 8 4 tar bzd 4 Prepare to build CASAVA mkdir CASAVA 1 6 2 build cd CASAVA 1 8 2 build 5 Prepare CASAVA installation directory mkdir illumina software CASAVA 1 8 2 6 Configure CASAVA so it will be first built and then install where you want in this example we want to install it in illumina software CASAV A 1 8 2 sol GASAVA 1 644 STE Configure prefix illumina software CASAVA 1 8 2 7 Build CASAVA make 8 Finally install it make install L NOTE For more information on the configuration options CASAVA 1 8 2 src c
18. 4 Yield The sum of all bases in clusters that passed filtering for the entire project PF The percentage of clusters that passed filtering of Lane Percentage of reads in the sample compared to total number of reads in that lane Perfect Index Percentage of index reads in this sample which perfectly matched the Reads given index One Mismatch Percentage of index reads in this sample which had 1 mismatch to given Reads Index index Of gt Q30 Bases Yield of bases with Q30 or higher from clusters passing filter divided by total yield of clusters passing filter Mean Quality Score The total sum of quality scores of clusters passing filter divided by total yield of clusters passing filter Recipe Recipe used during sequencing Operator Name or ID of the operator Directory Full path to the directory Below the sample information are links to the IVC plots Finding Demultiplexed Samples The key to finding the location of demultiplexed data is looking at the Demultiplex_ Stats htm file in the BaseCalls_Stats directory The Directory column will indicate the project sample output directory The FASTQ files within the directory contain the index and lane as part of the name Alternatively it can be inferred from the project name and the sample id as described in FASTQ Files on page 39 CASAVA v1 8 2 User Guide 4 3 J19p 04 IndiNO UOISIBAUOY DY 44 Part 15011196 Rev D Sequence Alignment Aa a AA 46 configureAlignm
19. 6 error rates yd file naming 79 lane averages 76 proportion of reads zi tile by tile E ANALYSIS variables 61 B BARCODE 57 base calling 2 BaseCalls directory 27 bcl files 28 C CASAVA build 105 build directory 102 build web page 104 installing 116 variant detection and counting 88 CASAVA software 5 7 88 clocs files 20 clusters passing filters 17 clusters per tile 17 Compressed FASTQ 5 config txt file 54 57 64 config xml 30 configurealignment pl script 46 Configuring GERALD 46 configuring multiple runs 100 contaminants 70 control files 29 Count txt files 110 counting 2 7 88 configuring multiple runs 100 examples 101 options 97 99 100 153 154 output files 102 running 96 customer support 179 D demultiplexing 6 26 example 32 options 33 CASAVA v1 8 2 User Guide DNA sequencing large genome small genome documentation E ELAND analysis modes eland_extended ELAND MAX MATCHES eland pair eland rna ELAND SEED LENGTHI ELAND SEED LENGTH2 ELAND SET SIZE ELAND_standalone pl script ELANDv2 email reporting Error htm file FASTO files FASTO generation filter files first cycle intensity G gapped alignment GERALD GERALD pl script H help reporting problems help technical image analysis indexing intensity curves IVC htm file K KAGU PAIR PARAMS KAGU PARAMS L locs files 88 179 61 61 68 68 61 69 62 70 68 68 63 85 47 116 77 49 250 IZ 6
20. 8 2 User Guide T D Q IF xiousddvy Qseq Conversion Introduction 160 As of CASAVA 1 8 configureAlignment uses FASTO files as input If you have qseq txt files that you want to analyze using CASAVA 1 8 use the Qseq Converter that converts _qseq txt files into FASTO files The script has the following features Creates a makefile to convert a directory of _qseq txt files to a directory tree of compressed FASTO files following CASAVA 1 8 filename and directory structure conventions If detected configuration data used by configureAlignment are also transferred to the output directory This script will not configure demultiplexing The input directory must contain _ qseq txt files which are either non demultiplexed or already demultiplexed by another utility This appendix provides instructions to run the Qseq Converter Parti 15011196 Rev D Oseg Converter Input Files The Qseq Converter needs the following input files A BaseCalls directory with _qseq txt files The Qseq Converter is specifically designed to convert _qseq txt files produced by OLB It expects the _qseq txt files to follow the OLB naming conventions s lt lane gt lt read gt lt tile gt gseg txt With lt lane gt the lane number on the flow cell 1 8 lt read gt the read number 1 or 2 lt tile gt the tile number left padded with 0 to 4 digits For example s 1 1 0001 gseg txt These files have the following format Field Description Ma
21. FASTA files should not be squashed for CASAVA PATH to a single samtools style reference file Table 17 Behavioral Options for Variant Detection and Counting Option Application Description a SE TE Type of analysis DNA RNA default is DNA applicationType TYPE Example a RNA SE PE Ignore errors from previous CASAVA execution force Example f Ally SEPE Prints on screen usage guide If TARGET is specified prints usage help TARGET lo sa ok SE PE postRunCmd lt CMDLINE SE PE sa sgeAuto SE PE sgeQsubFlags SE PE sgeQueue SE PE targets LIST SE PE tempDir SE PE verbose lt NUMBER SE PE 95 guide for the corresponding plugin target Example help bam Limit number of parallel jobs Defaults 1 unlimited for sge Auto 1 for workflow Auto Do not set it to the maximum number of processors as this might cause the terminal to become unresponsive Post Run Commands can be launched after CASAVA completes by including the postRunCmd option followed by the commands to be launched Generates the workflow definition file and runs it on SGE use with sgeQueue Extra parameters to be passed to SGE qsub by the taskServer pl SGE queue name used with sge Auto or workflow e g all q Space separated list of targets to run see Targets on page 96 Default is all Example targets sort bam Overrides default path for local temporary files Sets the verbose level default is O
22. Gigabit recommended or other data transfer mechanism A suitably large holding area for the analysis output 1 TB per run As there will almost certainly be some overlap between copying analysis possible reanalysis 2 3 TB is an absolute minimum You need to consider which parts of the data you want to store long term and what storage infrastructure you want to provide CASAVA provides the option to perform loss less data compression Storage Configurations You can configure your analysis server with either local storage or external network storage Local server storage can be internal to the server or Direct Attached Storage DAS which is a separate chassis attached to the server Internal Simple but not scalable Results data must be moved off to network storage at some point to make room for subsequent runs DAS External chassis that is scalable since more than one DAS can be connected to the server The server is an application server running CASAV A and a file server providing access to results and receiving incoming raw data files External network storage is either Network Attached Storage NAS or Storage Area Network SAN NAS and SAN are functionally equivalent but SAN is larger with higher performance more connections and more management options NAS External chassis connected via an Ethernet to the server instrument PC and other clients on the network NAS devices are scalable and highly optimized SAN
23. Page The Barcode Lane Summary htm file provides similar metrics as the Sample Summary page with the following differences Parti 15011196 Rev D The results are displayed for each barcoded sample in a lane instead of for samples Tables are named accordingly the equivalents for the Sample Results Summary and Expanded Sample Summary are named Barcode Lane Results Summary and Expanded Barcode Lane Summary The Barcode Lane Summary page contains a Barcode Lane Summary described below For a description see the equivalent section in the Sample Summary Page description Barcode Lane Summary on page 75 Flow Cell Summary For each run a FlowCellSummary_FCID htm file is produced which contains the Project Summaries and Sample Results Summaries of all projects This provides an overview of the most relevant metrics for the entire run It is located in the Aligned folder For a description of Project Summaries and Sample Results Summaries see Sample Summary Page on page 74 Analysis Results The output files for each lane of a flow cell are named using the format export txt gz For paired read analysis there are two parallel output files one for each read The files are named using the format sample name gt _ lt barcode sequence gt _L lt lane gt _R lt read number gt lt 0 padded 3 digit set number gt _export gz The files are found in the Aligned Project_ID SAmple_ID folder of a finished analysis run Export txt gz The sta
24. VALIDATION off ioe WARNING 4 You may run into problems with downstream analysis if you disable i chromosome name validation CASAVA v1 8 2 User Guide 1 D D s J 9o0USA9j2d NV 13 Reference Files CASAVA NOTE If ELAND finds two alignments with identical alignment scores ELAND will pick the first alignment in the single end case or combination of alignments in the paired end case that exhibit the highest observed alignment quality These are the alignments that make it into the export files which only contain the best alignment for each read In practice post alignment CASAVA ignores these reads because of the low alignment qualities Using a reference with lexicographic chromosome names like chr1 will yield slightly different results compared to a reference with numerical chromosome names like 1 for these reads since the hits are sorted in a different way Reference Sequence Blocks For reasons of efficiency ELAND treats the reference sequence as being in blocks of 16 MB of which there can be at most 240 This limits the total length of DNA that ELAND can match against in a single run In a single ELAND run you can match against One file of at most 240 x 16 3824 MB 239 files each up to 16 MB in size Something in between such as 24 files of up to 160 MB each The NCBI human genome will fit Abundant Sequences Files eland_rna eland_rna uses these files to mask hits to abundant or contaminant sequenc
25. conversion for read 1 is complete For instructions see Starting Alignment for Read 1 on page 64 configureAlignment Output configureAlignment output is a flat text file called _export txt gz containing each read and information about its alignment to the reference In addition configureAlignment produces statistics and diagnostic plots that can be used to assess data quality These are presented in the form of html pages found in the Aligned output folder As a result of running the configureAlignment pl script a new directory is created in the run folder This directory is named using the format Aligned If you want to rerun the analysis and change parameters you can rerun configureAlignment with new parameters if you specify a new alignment directory OUT DIR CASAVA 1 8 also contains a script that converts export txt files to SAM files see Introduction on page 168 and SAM Format on page 169 Alignment Algorithms CASAVA provides the alignment algorithm Efficient Large Scale Alignment of Nucleotide Databases ELAND ELAND is very fast and should be used to match a large number of reads against the reference genome ELAND has been improved a number of times CASAVA 1 6 introduced a new version of ELAND ELANDv2 The most important improvements of ELANDV2 are its ability to perform multiseed and gapped alignments 46 Part 15011196 RevD As of CASAVA 1 8 a new version of ELANDv2 is available ELANDv2e The most importan
26. diverse applications the CASAVA variant caller does not filter out low confidence calls and thus prints all sites where Q snp is greater than zero to the snps txt file Summary statistics for SNPs are generated for a subset of higher confidence SNPs by default any SNP with Q snp of 20 or greater is summarized in CASAVA s reports Note that for calls with a very low Q snp score it is possible that the most likely genotype will be that of the homozygous reference e g max_gt will be CC for a position with a reference value of C This can be interpreted to mean that there is a non trivial probability of a heterozygous SNP existing at this site but that the homozygous reference genotype is still more likely than that of any non reference variant Indel Caller Reporting Indels for each chromosome are summarized within each chromosome directory in a file called indels txt This file contains indels which have been called in each chromosomal bin segment using the small variant caller from CASAVA s callSmallVariants module These indel calls have been filtered to remove those calls which are found at a depth greater than a certain multiple of the mean chromosomal depth By default this multiple is set to 3 The purpose of this filtration is to remove indels calls in regions close to centromeres and other high copy number regions Three categories of indels are reported Insertions Deletions Breakpoints Breakpoint calls correspon
27. five stages 1 Compute clusterings of non aligned orphan reads 2 Compute clusterings of anomalous read pairs with an insert size that is anomalously large possible deletion or small possible insertion 3 Combine clusters that appear to correspond to the same event 4 Assemble them into contigs 142 Part 15011196 Rev D 5 Align the contigs back to the genome using the positions of associated singleton reads to narrow the search to a couple of thousand bp or so Figure 25 assemblelndels Algorithm 1 Cluster Orphan Reads Cluster of Orphan Reads Refe FENCE GCTTTTCECCGTAGCATGCATGCATGCACGGACTTTCGGGACTCTATCCGGCATCT 2 Cluster Anomalous Reads Cluster of Anomalous Reads insert too long Reference ccrTTTcaccGTAGCATGCATGCATGCACGGACITTCGGGACTCTATCCGGCATCT 3 Merge Clusters from Same Event Merged Cluster 4 Assemble Cluster into Contig New Contig GTAGCATGCATGCATGCACGGACGGACTCTATCCGG 5 Align Contig to Reference Potential New Deletion X GEAGCATGCATGCATGCACGGAC GGACICTATCCGG IDI DERD PETIT TIPE VELEELELELI GCCGTAGCATGCATGCATGCATGCACGATCGGTGTTTGTGGTGGGGGACTCTATCCGGCTAGT assemblelndels Components The assembleIndels module contains the following components IndelFinder The IndelFinder component takes a sorted bam file from a CASAVA build and extracts Any reads containing gapped alignments CASAVA v1 8 2 User Guide 143 U01 28 94 JUBIJEA Algorithm Descriptions An
28. for assemblelndels Option indelsSpReadThresholdIndels NUMBER indelsPrasThreshold lt NUMBER indelsAlignScoreThresh NUMBER indelsSdFlankWeight lt NUMBER indelsMinGroupSize NUMBER indelsSpReadThresholdClusters NUMBER indel sMinCoverage NUMBER CASAVA v1 8 2 User Guide Application Description PE PE PE PE PE PE PE Spanning read score threshold The higher the single read alignment score before realignment the more unlikely it is to see this pattern of mismatches given the read s quality values Default threshold value is 25 Drop this value to add more reads into the indel finding process at the possible expense of introducing noise For an alignment with no mismatches this option should be set at zero Paired read alignment score threshold If a read has a paired read alignment score of at least this then it is used to update the base quality stats for that sample prep Default is calculated based off the data If an alignment score for a read exceeds this threshold after realignment then the output file is updated to incorporate this new alignment Otherwise the read s entry remains as per the input file Default value is 120 A low value will cause some reads to be wrongly placed albeit within a small interval Number of standard deviations to use when defining the genomic interval to align the read to default 1 Only output clusters if they contain at least this many r
29. in the table below Option Description readl FILENAME Read1 export file mandatory File may be gzipped with gz extension read2 FILENAME Read export file File may be gzipped with gz extension nofilter Include reads that failed the basecaller purity filter glogodds Assume export file s use logodds guality values as reported by Pipeline prior to 1 3 version Prints version information help Prints on screen usage guide Example An example of illumina export2sam pl use is as follows path to CASAVA bin illumina export2sam pl read1 NA10831 ATCACG LOOL RI VOL EXPOTL TXC OZ readzZ NAIUSST ATCACG LOUL R2 001 export txt g2 Z Z Converted sam This will write an output file s 2 converted sam that contains the paired end reads from s 2 1 export txt and s 2 2 export txt CASAVA v1 8 2 User Guide 1 7 3 1 4 Parti 15011196 Rev D Glossary B gt Bayesian model A Bayesian model provides a means to update a prior hypothesis based on evidence As an example in a Bayesian genotype model we may have a prior hypothesis that our sample genotype matches that from a reference sample with probability q After accounting for evidence of the sample gen otype in the form of sequencing reads which are inconsistent with the ref erence genotype our hypothesis is updated such that the probability of the sample genotype matching the reference has been reduced to a value less than q ee De Bruijn graph A De Bru
30. indel genotype calling Whenever an indel larger than this size is nominated by a de novo assembly contig it is handled as two independent breakpoints Note that increasing this value should lead to an approximately linear increase in variant caller memory consumption The default value is 300 for paired end builds and 50 for single end builds Example variantsMaxIndelSize 200 15 7 U01 28 9 1UuEUEA Algorithm Descriptions readBases Counting Method This method is for exon and gene counts Before counting CASAVA converts the alignments to splice junction into two shorter genomic alignments Then CASAVA will count the number of bases not the number of reads that belong to exons and genes Bases within both original genomic and shorter genomic reads derived from spliced alignments participate in the exon and gene counts 4 NOTE Junction counts in reads not bases are provided for convenience Because alignments to the junctions are converted to the genomic reads before the counting bases within reads aligned to splice junction are counted only once for exon and gene counts For splice junctions counts are provided as the number of reads that cover the junction point The number of bases that fall into the exonic regions of each gene is summed to obtain gene level counts The normalized values are calculated as RPKM Reads Per KiloBase per Million of mapped reads Since the base counts rather than read counts are used th
31. of run folders places the names in chronological order 2 The second field specifies the name of the sequencing machine It may consist of any combination of upper or lower case letters digits or hyphens but may not contain any other characters especially not an underscore It is assumed that the sequencing platform is synonymous with the PC controlling it and that the names assigned to the instruments are unique across the sequencing facility 3 The third field is a four digit counter specifying the experiment ID on that instrument Each instrument should be capable of supplying a series of consecutively numbered experiment IDs incremental unique index from the onboard sample tracking database or a LIMS i NOTE It is desirable to keep Experiment IDs or Sample ID and instrument names unique within any given enterprise You should establish a convention under which each machine is able to allocate run folder names independently of other machines to avoid naming conflicts A run folder named 070108 instrument1 0147 indicates experiment number 147 run on instrument 1 on the 8th of Jan 2007 While the date and instrument name specify a unique run folder for any number of instruments the addition of an experiment ID ensures both uniqueness and the ability to relate the contents of the run folder back to a laboratory notebook or LIMS Additional information is captured in the run folder name in fields separated by an underscore from t
32. produced consisting of all sites with Q snp 7 0 8 A final filtration step is taken to remove potentially spurious SNP calls near the centromeres and within high copy number regions This is done by calculating the mean used depth for each chromosome and filtering out all SNP calls which occur at a used depth which is greater than 3 times this chromosomal mean Variant Detection Q Scores 148 Quality Scores A quality score or Q score expresses an error probability In particular it serves as a convenient and compact way to communicate very small error probabilities Given an assertion A the probability that A is not true P A is expressed by a quality score Q A according to the relationship Q A 10 log P A where P A is the estimated probability of an assertion A being wrong The relationship between the quality score and error probability is demonstrated with the following table Quality score Error probability O A P A 10 0 1 20 0 01 30 0 001 Part 15011196 Rev D Variant Genotypes In the context of resequencing a diploid individual a genotype for a single site or indel indicates the two alleles that are present The set of diploid site genotypes considered by the CASAVA v1 8 model for SNPs are AA CC GG TT ACAG AT CG CT GT For example given a site in the genome with a reference base of C the homozygous reference genotype is CC A prediction of a SNP at that site is an assertion that th
33. sort and bam modules instead This section describes the usage of the illumina export2sam pl script The script is located in CASAVA s bin directory and is an update to the SAMtools script export2sam pl redistributed in CASAVA under the MIT license see http sourceforge net projects samtools develop 1 NOTE Use CASAVA s illumina export2sam pl script instead of the SAMtools script The illumina export2sam pl script has a number of updates that are important for proper conversion of ELANDv2e alignments See the script header for a full list of these updates Parti 15011196 Rev D SAM Format The Sequence Alignment Map SAM format is a generic format for storing large nucleotide sequence alignments SAM files have a sam extension and consist of one header section and one alignment section The whole header section can be absent but keeping the header is recommended This section provides the information relevant for the SAM files generated by CASAVA a detailed description of the generic SAM format is available from samtools sourceforge net To generate a SAM file see Introduction on page 168 Header Section The Illumina SAM files start with PG which indicates that the first line is a header line of the program type PG The line is TAB delimited and each data field has an explicit field tag which is represented using two ASCII characters as described below Tag Description ID Program name VN Program ver
34. the analysis is done review the analysis for each sample See Demultiplex_Stats File on page 42 Example Bcl Conversion and Demultiplexing An example of a demultiplexing run is as follows 1 Enter path to CASAVA bin configureBclToFastq pl input dir lt Basecalls dit ourputedir sUnaligrmed sample shs et lt input dir gt SampleSheet csy 2 Go to the lt Unaligned gt folder 3 Run nohup make j 3 32 Parti 15011196 Rev D Step one will produce a set of directories in the Unaligned directory Reads with an unresolved or erroneous index are placed in the Undetermined indices directory Options for Bcl Conversion and Demultiplexing The options for demultiplexing are described below Option as AE K KTZ CET COUNT le K c SEL G lt KE m Te 1 ER G 1 EK S KE DOSIiLlon dir p051tlons format filter dir intensities dir S Sample sheet tiles use bases mask CASAVA v1 8 2 User Guide Description Maximum number of clusters per output FASTO file Do not go over 16000000 since this is the maximum number of reads we recommend for one ELAND process Specify 0 to ensure creation of a single FASTQ file Defaults to 4000000 Path to a BaseCalls directory Defaults to current dir Path to demultiplexed output Defaults to lt run_folder gt Unaligned Note that there can be only one Unaligned directory by default If you want multiple Unaligned directories you will h
35. to identify any issues which may be specific to a certain lane or group of tiles Cluster Density Box Plots These plots show the raw cluster densities per lane and the clusters passing filter L NOTE Many of the run quality metrics are depicted as box plots In these plots the red line shows the median the box delimits the middle 50 of the data interquartile range and the error bars indicate the sample minimum and maximum The sections below describe a number of examples of good runs and bad runs Excellent Quality Metrics The figure below shows a screen shot from SAV displaying a run with excellent quality metrics Note the trend of high O scores gt Q30 across each cycle left side and the cumulative distribution of gt Q30 among the reads right side Figure 3 SAV Screenshot Showing Excellent Quality Metrics Data By Cycle gt QScore Distribution gt Q30 Lane 4 Both Surfaces Lane 4 Both Surfaces All Cycles v g Z e lt 20 O Score Low Diversity Samples The figure below shows a screen shot from SAV displaying the percent base per cycle for a low diversity sample which might result from seguencing a small number of PCR artifacts CASAVA v1 8 2 User Guide 1 3 syde1g pue Selde len Figure 4 Low Diversity Samples Data By Cycle Base Lane 5 Both Surfaces jill All Bases Base 5 N o il In contrast the figure below shows the percent base per cycle g
36. ungapped_alignment mismatches gapped alignment mismatches gapped alignment If the ratio for a given alignment exceeds a certain value set to 3 1 by default we insert a gap If any of two conditions is not satisfied we return an ungapped alignment as the result ELANDv2e Alignment Improvements CASAVA 1 8 features ELANDv2e This updated alignment program includes the following new features Better repeat resolution A new orphan aligner Shorter run times with a new version of alignmentResolver 1 36 Part 15011196 Rev D Figure 21 ELANDv2e Workflow CASAVA v1 8 Finding seed hits Improvements Stage 0 s singleseed TO Stage 1 E ER Overlapping seeds multiple seeds m Increase sensitivity CT OO Resolve repeats Gapped alignment Extract 5 bases marked in orange on either side of a hit Perform a banded global alignment to account for indels aaa SS ema Reference EE L Read i Resolving Orphans Resolving orphans Increases alignment Improves indel finding If one read anchors the read pair do a local realignment of the other read in the vicinity of the anchored read Read 2 has multiple mappings shown in red Do local realignment using read 1 green as an anchor EE i Scoring alignments Estimate insert size distribution from uniquely aligning reads and score reads Score read pairs according to mismatches to the re
37. value of 100 means that 100 consecutive bases match the reference CASAVA v1 8 2 User Guide 1 71 ewo NYS Export to SAM Conversion 1 2 Tag Value Field XC e Mismatched bases are indicated by a base ACGIN where the letter indicates the reference base e Insertions and deletions start with a character and are closed with a character A number indicates an insertion in the read of that size a base or number of bases indicate the sequence of the reference that was deleted in the read For example the string 30 1 28G means the following e 30 30 bases matching reference e 1 one base insertion in read e 28 28 bases matching reference e G reference base G is mismatched in read Provides read status information normally conveyed in the chromosome field of the export txt file for unmapped reads Specificially XC Z QC is used to mark an ELAND OC failure read XC Z RM is used to mark an ELAND repeat mask read and XC Z CONTROL is used to mark a control read No optional field is added to reads which are marked as no match NM in the export file it is understood that this is the default status of an unmapped read Parti 15011196 Rev D Usage abesn For export to SAM conversion enter the following path to CASAVA bin illumina export2sam pl read1 FILENAME options gt outputfile sam Make sure to specify an output file else the output gets written to the screen The options are described
38. want all reads ina FASTO file use the with failed reads option Control Values The tenth columns lt control number gt is zero if the read is not identified as a control If the read is identified as a control the number is greater than zero and the value specifies what kind of control it is The value is the decimal representation of a bit wise encoding scheme with bit 0 having a decimal value of 1 bit 1 a value of 2 bit 2 a value of 4 and so on Parti 15011196 Rev D The bits are used as follows e Bit 0 always empty 0 e Bit 1 was the read identified as a control e Bit 2 was the match ambiguous e Bit 3 did the read match the phiX tag e Bit 4 did the read align to match the phiX tag e Bit 5 did the read match the control index sequence e Bits 6 7 reserved for future use e Bits 8 15 the report key for the matched record in the controls fasta file specified by the REPORT KEY metadata Quality Scores A quality score or Q score expresses an error probability In particular it serves as a convenient and compact way to communicate very small error probabilities Given an assertion A the probability that A is not true P A is expressed by a quality score Q A according to the relationship Q A 10 log P A where P A is the estimated probability of an assertion A being wrong The relationship between the quality score and error probability is demonstrated with the following table
39. you are using for alignment and are available from iGenome for the most common model organisms Getting Reference Files on page 128 CASAVA v1 8 2 User Guide O D S6ll4 Indu U01 28 84 JUBLe Running Variant Detection and Counting The major use cases for running CASAVA variant detection and counting are listed below Set additional options to define the type of analysis you want to perform for each project The options are listed in the next section Major Use Cases SNP and Indel Calling To run CASAVA with callSmallVariants and assemblelndels enter path to CASAVA bin configureBuild pl options SNP and Indel calling without large indel assembly To run CASAVA with callSmallVariants but without assemblelndels enter path to CASAVA bin configureBuild pl targets all noassembleIndels variantsSkipContigs options SNP and Indel calling Single end Build To run CASAVA with callSmallVariants for a single end build enter path to CASAVA bin configureBuild pl options RNA Sequencing To run CASAVA for RNA Sequencing enter path to CASAVA bin configureRnaBuild pl options Variant Detection and Counting Other Use Cases Help To get the CASAVA Help for callSmallVariants enter path to CASAVA bin configureBuild pl help callSmallVariants Rerun callSmallVariants In any pre existing build in which the sort module was previously completed and the assemblelndels module for a paired end build Small variant calling may be
40. you will find the config xml file that records any information specific to the generation of the subfolders This contains a tag value list describing the cycle image folders used to generate each folder of intensity and sequence files In the BaseCalls folder there is another config xml file containing the meta information about the base caller runs Adapter Sequences File The adapter sequences FASTA contains the Illumina adapter sequences and needs to be provided if the option adapter masking is used FASTA files for various Illumina adapters are available from teh Illumina website through iCom Bcl Conversion and Demultiplexing Generating the Sample Sheet The user generated sample sheet SampleSheet csv file describes the samples and projects in each lane including the indexes used The sample sheet should be located in the BaseCalls directory of the run folder You can create open and edit the sample sheet in Excel The sample sheet contains the following columns Column Description Header FCID Flow cell ID Lane Positive integer indicating the lane number 1 8 SampleID ID of the sample SampleRef The reference used for alignment for the sample Index Index sequences Multiple index reads are separated by a hyphen for example ACCAGTAA GGACATGA Description Description of the sample Control Y indicates this lane is a control lane N means sample Recipe Recipe used during seguencing Operator Name or ID of the
41. 1 8 2 User Guide 1 6 3 19 1940U05 basi DuluunH Qseq Conversion Oseg Converter Parameters 164 The Oseg Converter parameters that can be entered are listed below Parameter inpuct dir DIRECTORY U TL DUL d IE DIRECTORY fastq cluster count INTEGER config file FILENAME flowcell id STRING Description Path to _qseq txt directory No default Path to root of CASAVA 1 8 unaligned directory structure Directory will be created if it does not exist Default lt input dir gt QseqToFastq Unaligned Maximum number of fastq records per fastq file Default 4 000 Specify the Bustard config file to be copied to the fastq directory Default lt input dir gt config xml Use the specified string as the flow cell id Default value is parsed from the config file Parti 15011196 Rev D Qseq Converter Output Data The Qseq Converter generates the following output gzipped FASTO files in the directory structure configureAlignment expects configureAlignment Input Files on page 48 If found Qseq Converter copies the basecalling config xml to the root of the FASTO directory structure and renames it DemultiplexedBustardConfig xml which is the file expected by configureAlignment Oseg Converter also creates a default sample sheet in the destination directory IVC htm and corresponding plots are in the same directory where the qseq files are L NOTE configure Alignment in CASAVA 1 8 will fail if you try to run it a
42. 1196 Rev D Parameter Description rl Runs Bcl conversion for read 1 Can be started once the last read has started sequencing POST RUN _ A Makefile variable that can be specified either on the make command line or as an COMMAND R1 environment variable to specify the post run commands after completion of read one if needed Typical use would be triggering the alignment of read 1 POST RUN A Makefile variable that can be specified on the make command line to specify the COMMAND post run commands after completion of the run KEEP The option KEEP INTERMEDIARY tells CASAVA not to delete the intermediary files INTERMEDIARY in the Temp dir after Bcl conversion is complete Usage KEEP INTERMEDIARY yes NOTE k If you specify one of the more specific workflows and then run a more general one only the difference will get processed For instance make N rl followed by make N will do read 1 in the first step and read 2 the second one Starting Bcl Conversion for Read 1 If you want to start Bcl to FASTQ conversion before completion of the run use the makefile target r1 at any time after the last read has started for multiplexed runs this is after completion of the indexing read 1 Enter the following command to create a makefile for Bcl conversion path to CASAVA bin configureBclToFastg pl options 2 Move into the newly created Unaligned folder specified by output dir 3 Type the make r1 command make j 8
43. 15011196 Rev D Using ANALYSIS eland pair Based heavily on ANALYSIS eland extended ANALYSIS eland pair allows the analysis of a paired read run using ELANDv2e alignments As part of the analysis it will Align both read 1 and read 2 to the reference genome Determine the insert size distribution of the sample Use the insert size distribution to resolve repeats and ambiguities The export txt gz files are meant to contain all information necessary for downstream processing of the alignment data Other files produced that may be useful in some circumstances are s N 1 eland extended txt s N 2 eland extended txt these contain the candidate alignments for each read 1 and read 2 The software chooses from these possibilities in attempting to pick the best alignment of the read pair For a detailed description of the export txt files see Text Based Analysis Results on page Dl Multiseed Gapped Repeat Orphan Alignment ANALYSIS eland pair performs the following alignment features implemented in ELANDv2 and ELANDv2e By default performs multiseed alignment by aligning consecutive sets of 16 to 32 bases separately Uses a gapped alignment method to extend each candidate alignment to the full length that allows for gaps indels of up to 10 bases Aligns reads in repeat regions using two new modes semi repeat resolution and full repeat resolution Full repeat resolution is more sensitive and places more reads in repeat regions but will r
44. 18 Sequence Chromosomes PROJECT Project1 ANALYSIS eland pair PROJECT Projectl USE BASES y n y n Assignment by SAMPLE If you just want to align the samples from your sample named Samplel generate the following config txt file ELAND GENOME lt GenomesFolder gt iGenomes Homo _ sapiens UCSC hg18 Sequence Chromosomes SAMPLE Samplel ANALYSIS eland pair SAMPLE Samplel USE BASES y n y n Assignment by REFERENCE If you want to align the samples assigned to a human reference in the sample sheet generate the following config txt file ELAND GENOME lt GenomesFolder gt iGenomes Homo sapiens UCSC hg18 Sequence Chromosomes REFERENCE human ANALYSIS eland pair REFERENCE human USE BASES y n y n Parti 15011196 Rev D The requirements and options for the configureAlignment configuration file are described in configure Alignment Configuration File on page 54 Full Size Example A full sized example of a config txt is shown below 123456 ANALYSIS eland pair 78 ANALYSIS eland rna ELAND GENOME data pipeline in genomes human hg19 fasta 123456 USE BASES Y n Y n ei USE BAGES YOOn n REFERENCE human ELAND GENOME data pipeline in genomes human hgl9 fasta REFERENCE human ELAND RNA GENOME ANNOTATION data pipeline in genomes human humanrefflat refFlat txt gz REFERENCE human ELAND RNA GENOME CONTAM data pipeline in genomes human contams fasta REFERENCE phix ANALYSIS eland pair REFERENCE phix ELAND GENOME data pipeline in genomes phi
45. 2xN 11 The bits are used as follows Where is the cluster Bit 0 always empty 0 index e Bit 1 was the read identified as a control e Bit 2 was the match ambiguous e Bit 3 did the read match the phiX tag e Bit 4 did the read align to match the phiX tag e Bit 5 did the read match the control index sequence e Bits 6 7 reserved for future use e Bits 8 15 the report key for the matched record in the controls fasta file specified by the REPORT KEY metadata Position Files The BCL to FASTO converter can use different types of position files and will expect a type based on the version of RTA used locs the locs files can be found in the Intensities directory clocs the clocs files are compressed versions of locs file and can be found in the Intensities directory pos txt the pos files can be found in the Intensities directory CASAVA v1 8 2 User Guide D O Sol INdUJ UOISIBAUOYD DA The pos txt files are text files with 2 columns and a number of rows equal to the number of clusters The first column is the X coordinate and the second column is the Y coordinate Each line has a lt cr gt lt lf gt at the end Runinfo xmi File The top level Run Folder contains a RunInfo xml file The file RunInfo xml normally generated by SCS HCS identifies the boundaries of the reads including index reads The XML tags in the RunInfo xml file are self explanatory config xml Files In the Intensities folder
46. 39 3910 3910 5461 5461 3807 3807 5821 5821 8061 8061 79 07 2 34 79 07 2 34 77 71 1 70 77 71 1 70 78 07 1 60 78 07 1 60 78 82 1 50 78 82 1 50 78 67 2 56 78 67 2 56 86 05 1 45 86 05 1 45 83 62 0 31 81 81 0 50 83 83 0 08 81 28 0 27 83 73 0 11 80 79 0 27 83 86 0 16 81 20 0 31 83 83 0 47 81 17 0 40 83 92 0 25 80 64 1 96 1 53 0 37 1 99 0 44 2 07 0 28 2 53 0 24 2 51 0 39 3 37 0 56 20d 1 L 3 20 0 56 2 37 0 66 2 85 0 44 1 56 0 33 1 81 0 36 1628162 1828182 1838089 1838089 1833031 1833031 1879675 1879675 1870679 1870679 2053659 2053659 pes 88898999986 Part 15011196 RevD Figure 17 Coverage Graph in Home html illumina Bewort Many CASAVA 1 8 0a1 101019 PE DNA Seq CASAVA 1 8 0a1 101019 PE DNA Seg analysis coverage for all reference sequences mean depth at known sites mean depth at known sites Da S x v oa a ri A Pi a E E E E Pi p E E E E E E E E Pi Pi A Pi dil Pi 4 P7 c1fa c2fa c3fa c4 fa c5fa c6fa c7fa cXfa cBfa c9fa c10 fa c11 fa c12 fa c13 fa c14 fa c15fa ci6fa c17 fa c18 fa c19 fa c20 fa cY fa c22 fa c21fa mean depth at known sites fraction of known sites mapped o e a L 1 o a 1 fraction of known sites mapped o o o bo pe i n o Pa A A A A A A A A A A A A A A E A A A A A A FI PI FI P7 clfa
47. 68 69 59 10 179 84 76 65 69 65 29 1 7 XapU Index 1 8 M make make option mitochondrial DNA multiplexed sequencing multiseed alignment N network requirements none O Off Line Base caller OLB orphan alignment paired reads analysis variables eland pair parallelization limitations Perfect htm file 120 64 70 6 6 68 69 112 62 7 138 61 121 7 phasing prephasing percentage 18 pos txt files position files PROJECT Q Qseq Converter options parameters quality scores R Read Segment Quality Control Metric42 readBases REFERENCE reference files 5S RNA abundant sequences CASAVA contaminants eland_rna mitochondrial DNA ribosomal repeats reference genome 50 94 repeat alignment repeat masked repeat resolution ribosomal repeats RM RNA sequencing Run Folder naming 29 29 97 164 164 41 148 92 D 126 128 126 128 127 126 128 125 126 128 126 128 125 127 68 69 run quality 83 run conf xml file 93 RunInfo xml file 30 runReport pl script 116 S SAM Conversion 168 SAM format 168 169 SAMPLE SVA SamplesDirectories csv 42 SampleSheet csv file 22 30 50 seguence alignment 2 seguence alignments 46 sites txt files 107 snps txt files 107 splice junctions 70 standard deviations 18 Standard GERALD Analysis 53 stats files 28 Summary htm file 83 T technical assistance 179 tile variability 84 U USE BASES 62 V
48. A Human seg gene md gz seqGeneMdGroupLabel SE The group label specifies which assembly to use in the seg gene file and is found in column 13 of the file seg gene files can hold entries for multiple assemblies Reguired for RNA counting when you use the annotation seqGeneMd file from NCBI Example segGeneMdGroupLabel GRCh37 p2 Primary Assembly Options for Target bam The options described below are used to specify analysis for target bam Table 21 Analysis Options for bam Option Application Description bamChangeChromLabels SE PE Change chromosome labels in the bam plugin output The OFF NOFA UCSC available behaviors are OFF Use unmodified CASAVA chromosome labels default behavior NOFA Remove any fa suffix found on each chromosome label For example c11 fa is changed to c11 UCSC Remove any fa suffix found on each chromosome label and attempt to map the result to the corresponding UCSC human chromosome label For example c11 fa is changed to chr11 bamSkipRefSeg SE TE Do not generate a reference seguence file with each bam file The default behavior can be restored with no bamSkipRefSeg Configuring Multiple Runs To add multiple runs you can modify the run configuration file run conf xml a Go to Human conf run conf xml see Run conf xml on page 93 b Add the additional entries to the run conf xml file c Then run the configuration again by executing configureBuild pl p Human
49. AGACTAAATAT TAACGTACCAT TAAGAGCTACC ee NG V TATTAACGTACCATTAAGAGCTACCGTCTTCTGTTAACCT TAAGAT TACT T GAT CCACT GAT TCAAC T TGAGACTAAATAT TAACGTT GTTAACCTTAAGAT TACT TGATCCACTGAT TCAACGTACCGTAACGAACGTAT CAAT TGAGACTAAATAT TAACGTACCAT TAAGAGCTTCTGT TAACCT TAAGAT TACT TGATCCACTGAT TCAACGTACCGTFA TATCAATTGAGACTA TAAATAT TAACGTACT TAACCT TAAGAT TACT TGATCCACT GATT CAACGTACCGTAACGAACGT CTT CTGTTAACCT TAAGAT TACT TGATCCACTGAT TCAACGTACCGTAACGAACGTAT CAAT TGAGACTAACGACG GACTAAATAT TAACGTACCAT TAAGAGCTACAACCT TAAGAT TACT TGATCCACTGAT TCAACGTACCGTAACGAACGTATCAAT TGAGACTAAATAT TAACGTACCAT TAAGAGCTACCGT GCAACGACGAAAAGAAT GATAACAGTAACACS GATAACAGTAACACACTTCTGT TAACCTTAAGATTACTTGATCCACTGATTCAACGTACCGTAACGAACGTATCAATT GAGAGC TABATALIGAGGTAGCALIGAGAGG AG GG GLLGLGLIBAGGLIRAGALIAGLIGALGCACT AT oan ACCATTAAGAGCTACCGTGCAACT TAACCTTAAGATTACT TGATCCACTGATTCAACGTACCGTAACGAACGTATCAATTGAGACTA AAGAT TACT TGA GCTACCGTGCAACGAAAATAACCTTAAGATTACTTGATCCACTGATTCAACGTACTTCTGT TAACCTTAAGATTACTTGATCCAG GAAAAGAAT GAT TTAACCT TAAGAT TAC GATTACTTGATG GAAAAGAATGA TTAAGAGGTAGC AACAGTAACACACTTCTG TTGATCCACTGATTCAACGTACCGTAAA T IGATAACAGTAACACA T ATTACTTGATCCACTGATTCAACG GTAACG GTATCAATTGAGACTA ACACACTTCTGT CAT TAAGAGCTACCGTGCAACAGTAACACACTT CT TTAAGATTACTTGATCCACTGATTCAACGTAC AACGA AATGA GATAACAGTAACACA ATTACTTGATCCACTGATTCAACG GTAACG GTATCAATTGAGACTA ACTGAT TCAAC GTACC CGAACGTATCATTAAGATTACTTGATCCACTGATTCAACGTACCGTAACGAACGTATCAATTGAG TA AA TCTGTTAACCTT C TT TT A GTAC CGT C AA
50. CCCCCCCCCCCCCCCCCGGCATCIAIGGCTTTT 3 CASAVA v1 8 Align unmapped reads using overlapping seeds Read CCCCC GCCCCG CCCCCCCCCCC Seed Seed Reference CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCGGCATCTATGGCTTTT 4 Report seeds that hit a non repetitive sequence Read CCCCCCCCCCCCCCCCCCCCCCCOCCCCCGGCATCTA Reference CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCGGCATCTATGGCTTTT Orphan Alignment ELANDv2e performs orphan alignment by identifying read pairs for which only one of the reads aligns ELANDv2e tries to align the other read in a defined window by default 450 bp If the number of mismatches is lt 10 of the read length ELANDv2e reports the alignment 1 38 Part 15011196 Rev D Figure 23 Orphan Alignment 1 Identify orphan read pairs One read maps well green the other has multiple mappings red With mapped read as anchor generate a 450 bp window 3 Do local realignment of the unmapped read within the window Alignment Performance Improvements The multiple component updates in CASAVA were designed to improve overall alignment performance To asses the performance change alignment percentage mismatch rates and CPU run times were compared for three different configurations CASAVA v1 7 CASAVA v1 8 with semi repeat resolution and CASAVA v1 8 with full repeat resolution The data set consisted of three lanes of HiSeq data from a single sample sequenced with TruSeq v3 chemistry The analysis was performed o
51. D Figure 15 CASAVA Build Directory ProjectDir project directory Parsed xx xx xx current build directory final files are here notMapped non mapping reads only in archival builds c1 fa build chromosome directory OONN chromosome bin directory cont file with genotype calls configuration directory snps txt file with SNPs indels txt file with indels html Chromosome file with exon counts RNA Sequencing only exon count txt Pa Chromosome_ file with splice counts RNA Sequencing only splice_count txt A Chromosome file with gene counts RNA Sequencing only stats gene count txt directory with stats text reports directory with html reports chromosome bamdirectory bam sorted sorted ia file with sorted sequence reads bam genome genome directory kan bam directory ra file with whole genome in BAM format bam The most important folders for downstream analysis are listed below gt Html Folder The html folder contains the build summary html pages see Build Html Page on page 104 which provides access to run information and graphs of important statistics Parsed_xx xx xx folder The Parsed_xx xx xx folder contains most of the sequencing information such as sorted alignments SNP and indel calls and for RNA Sequencing gene counts exon counts and splice junction counts see CASAVA Build on page 105 This information is organized in chromosome folders named cl or c2 for exampl
52. DCS AA vii Chapter 1 Overview ss 1 aeta AA 2 CASAVA Features EG GEE EG cece renere eee eie 5 What s New 9 Frequently Asked Questions 2 22 EE EG EE EE cece cece EG Ee GE 10 Chapter 2 Interpretation of Run Ouality 11 Introduction 12 Quality Tables and Graphs ES e cece eee EE GE EE Ge 13 SINA se ea eee eek see hoes Se eee ae 17 Chapter 3 Bcl Conversion and Demultiplexing 19 WTO OUGHON AAP 20 Bel Conversion Input Files EE EES EE eee eee eee GE Eie 26 Running Bcl Conversion and Demultiplexing 32 Bcl Conversion Output Folder 37 Chapter 4 Sequence Alignment 45 Introduction 46 configureAlignment Input Files SS aa 48 Running configureAlignment cece eee eee eee eee ee eee eee ee 53 configureAlignment Output Files 73 Running ELAND as a Standalone Program 65 Chapter 5 Variant Detection and Countind 8 Introduction 88 RI AA 91 Variant Detection Input Files e eee eee c eee ee eeee 93 Running Variant Detection and Counting 96 Variant Detection and Counting Output Files 102 Appendix A Requirements and Software Installation 111 Hardware and Software Reguirements
53. G TAGCAACGACC GAAAAGAATGATAACAGTAACACACTTCTGT TAACCT TAAGATTACTTGATCCACTGATTCAACGTACCGTAAAGATTACTTGAT IT TAAGAGCTAC natah TTAGACCACH TN car tTACCACAATTAA ciThCAGTACGTACAACAT AGOGMAGACAGGTTACCATANC MTTATTAGATATTGTACAT CC AG AAGAGTCAAGATT GCAGGTGAAT AGAAGTTG GG FOR RESEARCH USE ONLY ILLUMINA PROPRIETARY Part 15011196 Rev D December 2011 Part 15011196 Rev D This document and its contents are proprietary to Illumina Inc and its affiliates Illumina and are intended solely for the contractual use of its customer in connection with the use of the product s described herein and for no other purpose This document and its contents shall not be used or distributed for any other purpose and or otherwise communicated disclosed or reproduced in any way whatsoever without the prior written consent of Illumina Illumina does not convey any license under its patent trademark copyright or common law rights nor similar rights of any third parties by this document The instructions in this document must be strictly and explicitly followed by qualified and properly trained personnel in order to ensure the proper and safe use of the product s described herein All of the contents of this document must be fully read and understood prior to using such product s FAILURE TO COMPLETELY READ AND EXPLICITLY FOLLOW ALL OF THE INSTRUCTIONS CONTAINED HEREIN MAY RESULT IN DAMAGE TO THE PRODUCT S INJURY TO PERSONS INCLUDING TO USERS
54. Intensity Intensity Cycle 5 Base A 1000 Parti 15011196 Rev D Summary Tab Another tab in the status htm page or SAV that you should examine is the Summary tab The key parameters are listed in the following sections along with conditions possible causes for those conditions and suggested actions to correct the condition qe AMEUUNS Clusters This column contains the average number of clusters per tile detected in the first cycle images Condition Possible Cause Suggested Action Fewer clusters than expected Reanalyze with new default offsets in OLB Few bright clusters Problem with cluster formation You will need cif files for that on the flow cell Blurred images Poor focus or dirty flow cell surface Lots of clusters Cluster density or size is too great to visible distinguish individual objects More clusters than expected Too many clusters Problem with cluster formation on the flow cell Very large clusters Double counting Average First Cycle Intensity Generally brighter is better but this result is instrument and sample dependent Condition Possible Cause Low Problem with cluster formation or poor intensity focus Percentage of First Cycle Intensity Remaining After 20 Cycles of Sequencing Generally the higher the better The intensity remaining can be sample dependent Condition Possible Cause Suggested Action Low value A correct measure of rapid signal decay deduced Check expe
55. LAND treats the reference sequence as being in blocks of 16 MB of which there can be at most 240 This limits the total length of DNA that ELAND can match against in a single run In a single ELAND run you can match against One file of at most 240 x 16 3824 MB 239 files each up to 16 MB in size Something in between such as 24 files of up to 160 MB each The NCBI human genome will fit Additional eland_rna Input Files The following additional files are needed for eland rna refFlat txt gz or seq_gene md gz file as of CASAVA 1 7 eland_rna uses the refFlat txt gz or seq_gene md gz file to generate the splice junction set automatically The refFlat txt gz file is available from UCSC while the seq_gene md gz file is from NCBI They should be provided gzip compressed and should be from the same build as the reference files you are using for alignment This negates the need to provide separate splice junction sets as in earlier versions of CASAVA The parameter to use for either one is ELAND_RNA_GENOME_ANNOTATOTION 9 WARNING Do not change the names of the refFlat txt gz or seg gene md gz file CASAVA uses the name to determine the type of file at CASAVA v1 8 2 User Guide D s jl4 Indu jusuuBijyosanBIJUOD Sequence Alignment 52 A set of contaminant sequences for the genome typically the mitochondrial and ribosomal sequences These must be in single FASTA format The parameter to use to direct to the contaminant seque
56. ND process needed to ensure that the memory usage stays below 2 GB The optimal value is such that there are approximately 10 to 13 million lines reads in one set Only available for ANALYSIS eland extended ANALYSIS eland pair and ANALYSIS eland rna pee ELAND FASTE FILES PER PROCESS on page 65 for more information Default value is 3 ANALYSIS Variables on page 61 k WARNING y Default for USE BASES is Y n which means perform a single read i alignment and ignore the last base If running ANALYSIS eland pair make sure to specify the USE BASES option for two reads for example USE BASES Y n Y n Optional Parameters Table 3 configureAlignment Configuration File Optional Parameters Parameter Definition SINGLESEED If SINGLESEED is set to singleseed ELANDv2e aligns only in singleseed mode Only available for ANALYSIS eland_extended and ANALYSIS eland_pair for which multiseed alignment is default See ELANDv2 Algorithm Description on page 133 for more information UNGAPPED If UNGAPPED is set to ungapped ELANDv2e aligns only in ungapped mode See ELANDv2 Algorithm Description on page 133 for more information INCREASED SENSITIVITY If you specify INCREASED SENSITIVITY sensitive ELANDv2e aligns in full repeat mode Semi repeat resolution alignment is default You can also use INCREASED SENSITIVITY sensitive on the command line See Repeat Resolution on page 137 for more information CASAVA v1 8 2 User Guid
57. Note that as a consequence of the candidate indel discovery process indels can be called using either gapped alignments or Grouper contig alignments as input and the evidence from these two sources will be combined if both are available Typically gapped alignments can be used to efficiently identify relatively small indels roughly 1 10 bases in length whereas local contig assembly can efficiently identify much larger indels The greatest indel sensitivity can be achieved by generating candidate indels from both of these sources The parameters described for candidate indel filtration above are configurable as described in the CASAVA User Guide Accepting too many candidate indels increases runtime and can lead to occasional spurious indel calls or poorly realigned reads in noisy regions of the genome Realignment and Indel Calling For the second stage of indel calling the variant caller realigns all intersecting reads to each candidate indel in addition to aligning the read to the reference and any alternate indel candidates at the same site It is common for reads which intersect the indel location to support the indel and reference alignments equally well so the model is designed in such a way that these reads do not affect the genotype call The relative likelihoods of all alignments for each read are used to assign probabilities to each of three possible indel genotypes homozygous heterozygous or not present The result of this calcu
58. OR OTHERS AND DAMAGE TO OTHER PROPERTY ILLUMINA DOES NOT ASSUME ANY LIABILITY ARISING OUT OF THE IMPROPER USE OF THE PRODUCT S DESCRIBED HEREIN INCLUDING PARTS THEREOF OR SOFTWARE OR ANY USE OF SUCH PRODUCT S OUTSIDE THE SCOPE OF THE EXPRESS WRITTEN LICENSES OR PERMISSIONS GRANTED BY ILLUMINA IN CONNECTION WITH CUSTOMER S ACQUISITION OF SUCH PRODUCT S FOR RESEARCH USE ONLY 2009 2011 Illumina Inc All rights reserved Illumina illuminaDx BaseSpace Bead Array BeadXpress cBot CSPro DASL DesignStudio Eco GAIIx Genetic Energy Genome Analyzer GenomeStudio GoldenGate HiScan HiSeq Infinium iSelect MiSeq Nextera Sentrix SeqMonitor Solexa TruSeq VeraCode the pumpkin orange color and the Genetic Energy streaming bases design are trademarks or registered trademarks of Illumina Inc All other brands and names contained herein are the property of their respective owners CASAVA v1 8 2 User Guide Part 15011196 Rev D Part 15011196 15011196 15011196 15011196 1509919 Revision History Revision Date Description of Change D December Updates in FASTO file control column 2011 description C October Supports dual indexing and adapter masking for 2011 CASAVA v1 8 2 B May 2011 Supports CASAVA v1 8 A March 2010 A November 2009 CASAVA v1 8 2 User Guide Part 15011196 Rev D Table of Contents Revision History ili Table of Contents 2 2 ieee EG EG EG EG EG ee V EIST OD A
59. On Leni Yoon This means 33 diuna pue uoisiaAuo9 jog Buluuny buixa Bcl Conversion and Demultiplexing Option e10 Gama mismatches flowcell id ignore missing stats ignore missing bcl ignore missing control with failed reads adapter sequence man h help Description e The read masks are separated by commas ir Ty J The format for dual indexing is as follows use bases mask Y gt 1 1 Y gt or variations thereof as specified above If this option is not specified the mask will be determined from the RunInfo xmI file in the run directory If it cannot do this you will have to supply the use bases mask Disable the masking of the quality values with the Read Segment Quality control metric filter Comma delimited list of number of mismatches allowed for each read for example 1 1 If a single value is provide all index reads will allow the same number mismatches Default is O Use the specified string as the flowcell id default value is parsed from the config file Fill in with zeros when stats files are missing Interpret missing bcl files as no call Interpret missing control files as not set control bits Include failed reads into the FASTQ files by default only reads passing filter are included Path to a FASTA adapter sequence file If there are two adapters sequences specified in the FASTA file the second adapter will be used to mask re
60. RALD Running ELAND as a standalone program does not perform all of the various steps that are included during a configureAlignment run The most important differences are ELAND standalone does not generates many of the statistics ELAND standalone is not massively parallel like configureAlignment If you require any or all of the above it is best to create a modified config file to align to a different genome and rerun configureAlignment For more information see Running configureAlignment on page 53 FASTQ Format 56 Any FASTO file will be supported but the CASAV A FASTO file format is optimal for populating the appropriate fields The format is note the space between y pos and read number lt instrument name gt lt run ID gt lt flowcell ID gt lt lane gt lt tile gt lt x pos gt lt y pos gt lt read number gt lt is filtered gt lt control number gt lt barcode sequence gt The elements are described below Element Requirements Description Each sequence identifier line starts with lt instrument name gt Characters allowed Instrument name a z A Z 0 9 lt RunID gt Characters allowed Run ID a z A Z 0 9 lt flowcell ID gt Characters allowed flowcell ID a z A Z 0 9 lt lane gt Numerical Lane number lt tile gt Numerical Tile number lt X pos gt Numerical X coordinate of cluster lt y pos gt Numerical Y coordinate of cluster lt read number gt Numerical Is usually 1 or 2 for paired end
61. TCAACGTACCGTAACGAACGTATCAATTGAGACTAAGCTACCGTGCAACGACGAAAAGAATGA GAAAAGAATGATAACAGTAACACACTTCTGTTAACCTTAAGATTACTTGATCCACTGATTCAACGTACCGTAAAGATTACTTGATCCACTGATT CAACGTACCGTAACGAACGTATCAATTGAGACTAAATAT TAACGTACCAT TAAGAGCTACC GATAACAGTAACACACTTCTGTTAACCTTI A AE RHL TACCGTAACGAACGTATCAATT GAGACTAAATAT TAACGTACCAT TAAGAGCTACCGT GCAACGACGAAAAGAAT GATAACAGTAACACACTTCTGT ACCAT TAAGAGCTACCGTGCAACAGTAACACACTTCTGTTAACCTTAAGAT TACT TGAT CCACT GATT CAACGTACCGTAACGAACGTAT CAAT TGAGACTAAATAT TAACGTACCAT TAAGAGCTACCGT GCAACGACGAAAAGAAT GATAA GATAACAGTAACACACT TCT GT TAACCT TAAGAT TACT TGATCCACT GATT CAACGTACCGTAACGAACGTAT CAAT T GAGACTAAATAT TAACGTACCATTAAGAGCTACCGTCTTCTGT TAACCTTAAGATTACTTGATCCACTGATTCAAC GTACCGTAACGAACGTATCATTAAGATTACTT GAT CCACT GATT CAACGTACCGTAACGAACGTAT CAAT T GAGACTAAATAT TAACGTACCATTAAGAGCTACCGTGCAACGACGAAAAGAATGATAACAGTAACACACTTCTGTTAACCTT SHANG GATTCAACGTTAAGA EE BG AI MIT AA TATCAATTGAGCTTCTGTTAACCTTAAGAT TACTTGATCCACT GAT TCAACGTACCGTAACGAACGT ee ee ANS G G AT CTT AC CT TACCG CGT GL TTAACGTACCATTI C GATAACAGTAACACACTTCTGTTAACCTTAAGATTACTTGTTGATCCACTGAT TCAACGTACCGTATCAAT TGAGACTAAATATTAACGTACCATTAAGAGCTACCGTCTTCTGTTAACCTTAAGATTACTTGATCCACTGATTCAACGTACCG CACTGAT TCAACGTACCAAGATTACTTGATCCACTGAT TCAACGTACCGTAACGAACGTATCAATTGAGACTAAATAT TAACGTACCAT TAAGAGCTACCGTCT TCTGTTAACCT TAAGATTACTTGATCCACTGATTCAACGTACCGTAACGA GAAAAGAATGATAACAGTAACACACTICTGTTAACCT TAAGATTACTTGATCCACTGATTCAACGTACCGTAAAGAT TACT TGATCCACTGAT TCAACGTACCGTAACGAACGTATCAAT TG
62. TE eland_rna does not support paired end cDNA reads yet Prerequisites Four sets of data files are needed A genome sequence file Fasta files of all chromosomes for on fly splice junction generation refFlat txt gz from UCSC or seq_gene md gz file from NCBI as of CASAVA 1 7 eland_rna uses the refFlat txt gz or seq_gene md gz file to generate the splice junction set automatically These files come from the following sources The refFlat txt gz file is available from UCSC The seq_gene md gz file is available from NCBI They should be provided gzip compressed and should be from the same build as the reference files you are using for alignment This negates the need to provide separate splice junction sets as in previous version of CASAVA A set of contaminant sequences for the genome typically the mitochondrial and ribosomal sequences Description of the eland_rna Algorithm The algorithm aligns the reads to each of three targets Contaminants Genome Splice junctions alignments need to span splice junction Then a script decides which of the alignments is most likely for each read The following steps are taken in order 1 Ifa read aligns to the contaminants then the read is discarded It is marked in the export file as RM for repeat masked 2 If the read aligns to the genome and or splice junctions If there is a unique alignment to the genome or splice junctions then that alignment is printed If there are multiple
63. Table 23 Table 24 Table 25 Table 26 Table 27 Table 28 Table 29 Table 30 CASAVA v1 8 2 User Guide List of Tables ASCII Characters Encoding scores 0 40 41 GERALD Configuration File Core Parameters 54 configureAlignment Configuration File Optional Parameters 55 configureAlignment Configuration File Paired End Analysis Options 56 ANALYSIS Variables cece cee cee cee cece cee cence 61 USE BASES OptiONS EE EE EG EG EE EE EE 62 Parameters for KAGU PAIR PARAMS and KAGU PARAMS 65 Parameters for KAGU PAIR PARAMS Only 65 Parameters for ANALYSIS eland extended 68 Parameters for ANALYSIS eland rna 1 Intermediate Output File Descriptions 82 Intermediate Output File Formats aa 82 Required Parameters for ELAND standalone pl 85 Options for ELAND standalone pl 85 Targets for Variant Detection and Counting 97 Major File Options for Variant Detection and Counting 98 Behavioral Options for Variant Detection and Counting 98 Global Analysis Options for Variant Detection and Countin
64. UBIJEA Variant Detection and Counting 110 Note that for a read to strongly support either the reference or the indel alignment it must overlap an indel breakpoint by at least 6 bases and the probability of the read s alignment following either the reference or the indel path must be at least 0 999 Count txt Files There are three different types of count txt files for exon gene or splice junction Chromosome_exon_count txt The _exon_count txt provides counts for the number of times a particular exon has been detected in a sample Chromosome_genes_count txt The _genes_count txt provides counts for the number of times a particular gene has been detected in a sample Chromosome_splice_count txt The _splice_count txt provides counts for the number of reads that align over a particular splice junction _count txt files are generated by RNA Sequencing sorted by position and there is one of each type per chromosome for example c19 exon count txt The _count txt files are stored in the chromosome specific directory under the Parsed dd mm yy directory and contain the following columns 1 Chromosome starting with a c The chromosome on which the exon resides cM indicates a mitochondrial DNA alignment Start The start of the gene End The end of the gene Genes The gene symbol GI AeA W N Normalized count RPKM 10 x raw count feature length x number of mapped bases 6 Raw count sum of coverages for each bas
65. VA accepts single sequence FASTA files as genome reference which should be provided unsquashed for both alignment and post alignment steps The chromosome name is derived from the file name Direct CASAVA to a folder containing the FASTA files using the option refSequences PATH for variant detection and counting Multi Sequence FASTA Files As of version 1 8 CASAVA accepts a multi sequence FASTA file as genome reference This should be provided as a single genome SAM compliant unsquashed file for both alignment and post alignment steps The chromosome name is derived directly from the first word in the header for each sequence Direct CASAVA to multi sequence FASTA file using the option samtoolsRefFile FILE for variant detection and counting a WARNING y GenomeStudio does not support the use of multi sequence FASTA files i Therefore if you want to analyze your output in GenomeStudio we recommend using single sequence FASTA reference files Chromosome Naming Restrictions CASAVA does not accept the following characters in the chromosome name NG TJ ERA FEE ee refFlat txt gz or seg gene md gz File CASAVA 1 8 generates the non overlapping exon coordinates set automatically using the refFlat txt gz file from UCSC or seq_gene md gz file from NCBI They should be from the same build as the reference files you are using for alignment and are available from iGenome for the most common model organisms Getting Reference Fil
66. a ITA Kanes adi EE Par ha T er IPA Tat E HAT TRACT U Naay ATT TULALA Wet TAE OTACA att ML O AWAL k Mr TN FT Fc Rt nur U Ai AG TA RD GE sd TET ATi LAT AAT reat the AN G kong ST Q TAAL ag ar ATI TA WT Y ME at AA Nara SN les AG TAAAIAY TACO TAGA TAMA re Ea GEG E 2 Sek i a Aa PANGA te S rit Hol oa le F i i C PET T O 5 ea Ig LANG G FI ETE NT FEIT ry Gi L L GATAAI ri Sh sali Ai ert i 7 er Soret TG CT AS TAL EI te GR TA AA Be TGR RL La CATT TE T TIARA TALA AAI AGE m r pana N ed VERE Teli 2 J TE C P TADIA BAGOT nar L D iT va DAWA dresa AA NG TGA AG TELC AABANG BAN a Saka IAAL AA ma act ca NE IKA AN Ra ETE TATTAR ee O sent ee TOAMCETAAAAGARTGATAAC AD TAAL AE TTE TETA NOTA EER ATA nepal EM UA Ko UTA O TELET EME ka aaa EE ARE Taga NG PAENG TE A PO MN TRA TA AA GE TA TE AA le leed re Ta TA LE aoai GE ka baa KAT LITA er TRON Th YH Tee a PA LE MEE SAMAL k apan CAAGOTAT pi Bai BARU rr kad In ER SKE anna ari la a gt AREA 1G laag d ie RelA of BEDER DB Sira Pl L ia i aia GAA LAVAL G TETE TS C ie TET AA ABA GA tr e EES L a zi AE z ga Na m Ta L al S i all TAN fats SI Lu aa re r ska Ee AL AART r me L LC kir li z i AA ITS T C due za F GI Wet rl es EE jis ie AE Ha di RE 9 zalije e BAC TAAATR Bars PERE mala ae ie paga RE TEE RE ACE HERFRA AE TA MEMETTETE AA k Pa TIAA HMI is ATR TAA ian Bese KIRA MTAA S i SAATTA lap AT LGT AT TATTO E Hg era L HE NVO NETA ee ES AE EET ie Aip N KG r v l ir E 7 i F A F Pepe erp se re
67. a PTE FT fo KI r x 1 a CA EW D TAT GA LA Ci la Ej aa UN Y SAL f HEN Na EE AE BIST ATAT FT ad TRATA a NN Ten NEM MERE EE lede Ee AE TAN en Dee MRM AE EEN INA AA TA N N ME Sars MI ARM GEN a nina RE ee lumina Headquartered in San Diego Califomia U S A 1 800 809 ILMN 4566 1 858 202 4566 outside North America techsupport illumina com www illumina com
68. a reference sequence using the compressed FASTO files 5 Variant analysis and counting Calls Single Nucleotide Polymorphisms SNPs and indels and performs read counting for RNA sequencing After variant analysis and counting are finished the results can be viewed and analyzed further in the GenomeStudio software or the result files can be analyzed using third party software D Part 15011196 Rev D Figure 1 Sequencing Data Analysis Workflow Analysis Step Analysis System Generating Sequencing Images HiSeq Genome Analyzer HiSan SQ Real Time Analysis RTA Performing Image Analysis Base Calling T files FASTQ Generation and Demultiplexing y Align ing CASAVA y Detecting Variants and Counting Viewing Results Analysis files txt html Visualization and analysis GenomeStudio Default Analysis Workflow Several analysis software products can be used for the analysis cascade The default workflow uses these software products HiSeq Control Software HCS and Real Time Analysis RTA or Genome Analyzer s Sequencing Control Software SCS and RTA The instrument computer running this software performs the following in real time Image analysis Base calling CASAVA 1 8 2 running on a Linux analysis server performs Bcl conversion and demultiplexing Off line sequence alignment SNP calling and indel detection read counting for RNA sequencing L NOTE As of 1 8 CASAVA uses bcl as pri
69. ach read resulting in the length of each read being set to the number of sequencing cycles associated with it minus one The two reads do not need to be of the same length USE BASES ni nY Ignore the first base of each read and perform a paired read alignment resulting in the length of each read being set to the number of sequencing cycles associated with it minus one The two reads do not need to be of the same length USE BASES nY This means ignore the first base and perform a single read alignment USE BASES n Y n Ignore the first read and perform a single read alignment with the second read ignoring the last base USE BASES Y n n Perform a single read alignment with the first read ignoring the last base and ignore the second read ELAND FASTO FILES PER PROCESS CASAVA requires a minimum of 2 GB RAM per core The parameter ELAND FASTQ FILES PER PROCESS optional in the configureAlignment config txt specifies the maximum number of FASTO files aligned by each ELAND process to limit the per core memory consumption i NOTE ELAND FASTQ FILES PER PROCESS supersedes the ELAND SET SIZE parameter used in CASAVA 1 7 and earlier The optimal value leads to approximately 10 to 13 million clusters in one set Since the FASTO file size in reads is determined by the Bcl conversion option fastq cluster count while the maximum number of files per process is determined by ELAND_ FASTO FILES PER PROCESS the product of these op
70. ad 2 Else the same adapter will be used for all reads Default None no masking Print a manual page for this command Produce help message and exit Makefile Options for Bcl Conversion and Demultiplexing Examples e Use first 50 bases for first read Y50 e Ignore the next n e Use 6 bases for index 16 e Ignore next n e Use 50 bases for second read Y50 e Ignore next n no eamss mismatches 1 flowcell id flow cell id 7 ignore missing stats e ignore mis sinag bci ignore missing control with failed reads adapter sequence adapter dir gt adapter fa man The options for make usage in demultiplexing analysis are described below Parameter Description nohup Use the Unix nohup command to redirect the standard output and keep the make process running even if your terminal is interrupted or if you log out The standard output will be saved in a nohup out file and stored in the location where you are executing the makefile nohup make j n amp The optional amp tells the system to run the analysis in the background leaving you free to enter more commands We suggest always running nohup to help troubleshooting if issues arise j N The j option specifies the extent of parallelization with the options depending on the setup of your computer or computing cluster For a description of parallellization see Using Parallelization on page 119 34 Part 1501
71. ads passing filter aligned uniquely to the splice junctions 8 genomeUsable number of reads passing filter aligned uniquely to the genome i NOTE Sum of spliceUsable and genomeUsable is equal to Usable CASAVA v1 8 2 User Guide 91 soji J 1NA1NO jusWubi yainbijuoo Sequence Alignment 9 In the last rows numbers are provided for number of passing filter reads aligned to each reference sequence file within the AbundantSequences directory The names are derived from the fasta headers up to first space used to list each reference in the multifasta abundant sequences file If you want a more descriptive names like ribosomal E coli or phiX you should modify fasta headers in the abundant sequences file i NOTE Difference between repeatMasked and sum of all abundant sequences gives the number of reads that do not have unique alignments contam_export txt gz Contains unique alignments to sequences in the CONTAM directory in the export format see Export txt gz on page 79 Intermediate Output Data Files Intermediate output files are found in the Aligned folder and contain data used to build the more meaningful results files described in Pipeline Analysis Output on page 43 CAUTION y Do not use the intermediate files as input for custom scripts These files may aH not be generated anymore in future CASAVA versions The files are named using one of the following formats s N TTTT name txt where N is the lane number T
72. aired read sample prep Unlike these short insert pairs that have a predominance in opposite and inwardly facing read pairs R gt R1 R2 lt the large insert mate pair libraries expect to produce a predominance in opposite and outwardly facing read pairs R lt R2 R1 gt High frequencies of paired reads having the same orientation F gt R2 R1 gt or F gt R1 R2 gt may be indicative of a sample preparation problem or evidence of an adapter read through problem found when the read lengths are long relative to the library insert size Insert Size Statistics Statistics are derived from the insert sizes of those pairs in which both reads were individually uniquely aligned and have the predominant relative orientation First the median is determined Then a standard deviation value is determined independently for those values below the median and those above it The lower and upper thresholds for acceptable insert sizes are then defined as three of the relevant standard deviations below and above the median respectively Insert Statistics Yo of individually uniquely alignable pairs This table shows the number of inserts out of those used to calculate insert size statistics considered acceptable in size and of those falling outside the thresholds displayed in the Insert Size Statistics table The percentages are relative to the original number of pairs in which both reads were individually uniquely aligned Barcode Lane Summary
73. ally at high read lengths High but constant mismatch rates from cycle 1 Possible Cause Bubbles Rapid focus fluctuations Dirty flow cell surface Low intensity at start High decay rate High phasing or prephasing Adapter read through Genomic contamination Parti 15011196 Rev D Running ELAND as a Standalone Program You can run ELAND without the rest of configureAlignment as a post analysis step ELAND can be run as a standalone program for the following reasons To test the effect of different filter parameters To test alignment targets To test applications that read export files To run ELAND as a standalone program use the script Path to CASAV A1 8 bin ELAND standalone pl Path to CASAVA1 8 bin ELAND standalone pl if readl fastg if fear Faste o ref lustre data01 Mondas software Genomes E coli ELAND Table 13 Required Parameters for ELAND_standalone pl Option Short Form Description input file lt input file gt if Specify at least one file for single reads and two files for paired reads mandatory ref sequences lt path to ref Full path of a genome directory mandatory genome dir gt Table 14 Options for ELAND_standalone pl Option Short Form Description bam Enables BAM output base quality lt value gt bq Assumes all bases have this quality when in fasta mode default is set to 30 copy references CT Copies the references to the output directory Use this option if your reference sequ
74. alysis 22 Filtering Did the read pass filtering N No Y Yes Additional configureAlignment Output Files s N TITT rescore txt The txt score and rescore files are produced by tile The corresponding XML summaries are by lane Various breakdowns of base mismatches within aligned reads e g by cycle called base and reference base along with associated statistics Tabular text format header data included rnagc txt The output file rnagc txt files in the Aligned folder provides the following information on alignment distribution for eland rna 1 totalClusters number of total clusters 2 PFClusters number of clusters passing purity filter 3 Usable number of reads passing filter and aligned uniquely to the genome plus splice junction 4 QC number of reads passing filter that were not aligned due to too many bases not called QC in the 11th field of the export file 5 noMatch number of reads passing filter that did not match anything including repeat masked these reads have NM label in the 11th field of the export file 6 repeatMasked number of reads passing filter that were masked by eland ma RM label in the 11th field of the export file These are reads mapping to abundant sequences and reads that do not have unique alignments to the genome or splice junctions i NOTE Sum of Usable OC noMatch and repeatMasked reads is equal to number of reads reported in PFC lusters 7 spliceUsable number of re
75. ample you would not want to use a read with 10 mismatches for SNP calling even if it is the only candidate found The same applies for a read of poor base quality Gapped Alignment Scoring Given a read ELANDv2e determines positions in the genome to which substrings of the read seeds of length 32 bp match with at most two errors We then grab x additional bases before and after the hit position default value for x is 5 to account for potential gaps in the alignment phase We then compute a global alignment between the read and the reference which means that the entire read is aligned to the reference We are using affine gap penalties opening a gap is more expensive than prolonging an existing gap The alignment algorithm is furthermore banded i e we restrict ourselves to a maximal length of an expected insertion deletion this value is set to 10 Conditions for Opening a Gap ELANDv2 tries to be conservative about when to open a gap There are two main conditions that have to be satisfied to open a gap 1 A gap corrects at least five mismatches downstream this means that the number of mismatches between the ungapped and the gapped alignment is at least five 2 We set the number of mismatches in the gapped and ungapped alignment in relation to each other The reason is that we want to distinguish gaps that improve noisy ungapped alignments and real small insertions deletions To this end we define the _noise ratio_ as mismatches
76. ane and not demultiplexed Each directory can be independently analyzed alignment variant analysis and counting with CASAVA and contains the files necessary for alignment variant analysis and counting with CASAVA i NOTE Some of the files needed for the alignment are at the top level of the Unaligned directory At the same time CASAVA also separates multiplexed samples demultiplexing Multiplexed sequencing allows you to run multiple individual samples in one lane The samples are identified by index sequences that were attached to the template during sample prep The multiplexed samples are assigned to projects and samples based on the sample sheet and stored in corresponding project and sample directories as described above At this stage adapter masking may also be performed With this feature CASAVA will check whether a read has proceeded past the genomic insert and into adapter sequence If adapter sequence is detected the corresponding basecalls will be changed to N in the resultant FASTO file q ja WARNING y The CASAVA 1 8 directory organization differs considerably from the i directory organization used in CASAVA 1 7 L NOTE You cannot start Bcl conversion demultiplexing and alignment in one step using CASAVA Bcl Conversion Demultiplexing Directory Structure Bcl conversion and demultiplexing is done in a single step and generates a new directory in the Run folder called Unaligned which contains all of the dem
77. archival and non archival versions of the build Fast creation of whole genome BAM files After the sort module has completed 30 40x whole genome BAM files can now be created and indexed in approximately 1 hour Spliced alignments are now represented in BAM using the same format as TopHat allowing visualization of splice junctions in IGV Archival Build Archival builds turned on with the option sortKeepAllReads include all reads given as input to the build in their entirety Purity filtered and duplicate reads are stored in the primary BAM files with the appropriate bit settings to identify them These will be ignored by variant calling and RNA read counting To handle various types of unmapped reads the CASAVA 1 7 NMNM directory has been renamed as notMapped Reads within this directory are classified into separate BAM files for the following categories noMatch qcFail nonUnique repeatMasked mixed In any situation where reads were trimmed in CASAVA 1 7 they are now soft clipped In some cases where a read would be removed in non archival mode due to some anomalous condition that read is now marked as unmapped and stored in the build instead Note that the small variant caller is designed to preserve any soft clip regions from an input read though it may expand them as part of local realignment i NOTE This is independent of the bam files produced by the target bam which aggregates all reads into a single BAM file with
78. ases are present in the sample at 25 with pure signal zero intensity in the non called channels the Called intensity will be four times that of All as the intensities will only be averaged over 25 of the clusters For impure clusters the called intensity will be less than four times that of All The Called intensities are independent of base representation so a well balanced matrix will display all channels with similar intensities Base Calls The percentage of each base called as a function of cycle Ideally this should be constant for a genomic sample reflecting the base representation of the sample In practice later cycles often show some bases more than others As the signal decays some bases may start to fall into the noise while other still rise above it Matrix adjustments may help to optimize data o All and Called Exactly the same as All and Called but expressed as a percentage of the total intensities These plots make it easier to see changes in relative intensities between channels as a function of cycle by removing any intensity decay All Intensity Plots The link to All htm file gives a representation of the mean matrix adjusted intensity of clusters plotted as a function of cycle It plots each channel A C G T separately as a different colored line Means are calculated over all clusters regardless of base calling If all clusters are T channels A C and G will be at zero If all bases are present in th
79. at analysis Standard configureAlignment Analysis The standard way to run configureAlignment is to set the parameters in a configuration file create a makefile and start the analysis with the make command 1 Edit the configureAlignment configuration file as described in configure Alignment Configuration File on page 54 2 Check the analysis by running the configureAlignment pl command without make path to CASAVA bin configureAlignment pl config txt EXPT DIR path to Unaligned folder 3 Enter the configureAlignment pl command but now with make This creates the makefile for sequence alignment path to CASAVA bin configureAlignment pl config txt EXPT DIR path vo Unaligned folder make 4 Move into the newly created Aligned folder under the Run folder see configureAlignment Output Files on page 73 Type the make command for basic analysis make L NOTE You may prefer to use the parallelization option as follows make j 3 all CASAVA v1 8 2 User Guide D 3 1uauubilveinbyuo2 buluuny Sequence Alignment The extent of the parallelization depends on the setup of your computer or computing cluster For a description of parallellization see Using Parallelization on page 119 5 After the analysis is done review the analysis a View the analysis results of your run See Analysis Summary on page 74 and Analysis Results on page 79 b Interpret the run quality See Interpretation of configureAlignment Ru
80. ated value containing the set of column labels in the following data segment The data segment contains one entry per line where each line is a set of tab delimited columns Wherever appropriate columns for sequence name and position number are included such that the files are tabix compatible The following files are generated by CASAVA variant detection and counting Depth and single position genotype call scores for every mapped site in the reference genome are saved in each bin directory in the gzipped file sites txt gz Project Dir Parsed NN NN NN c1 0000 sites txt gz Note that this output can be omitted with the variantsNoSitesFiles option The SNPs for each reference sequence are aggregated and filtered according to the variantsSnpCovCutoff setting and summarized in the chromosome level file snps txt Project Dir Parsed NN NN NN cl snps txt The indels for each reference sequence are aggregated and filtered according to the VariantsIndelCovCutoff setting and summarized in the chromosome level file indels txt Project Dir Parsed NN NN NN Cl indels txt If any SNPs and indels are removed by the high depth filter they can be found in their corresponding bin directory as Project Dir Parsed NN NN NN c1 0000 snps removed txt Project Dir Parsed NN NN NN c1 0000 indels removed txt When the variantsWriteRealigned option is selected there will alse be a BAM file written to each reference seguence realigned bam directory containing only
81. ave to use this option to generate a different output directory Path to a directory containing positions files Defaults depends on the RTA version that is detected Format of the input cluster positions information Options e locs e clocs e _pos txt Defaults to clocs Path to a directory containing filter files Defaults depends on RTA version that is detected Path to a valid Intensities directory Defaults to parent of base_calls_dir Path to sample sheet file Defaults to lt input_dir gt SampleSheet csv tiles option takes a comma separated list of regular expressions to match against the expected s_ lt lane gt _ lt tile gt pattern where lt lane gt is the lane number 1 8 and lt tile gt is the 4 digit tile number left padded with 0s The use bases mask string specifies how to use each cycle e An n means ignore the cycle e A Y or y means use the cycle e An I means use the cycle for the index read e Anumber means that the previous character is repeated that many times Examples fastq cluster count 6000000 EI NPYEAILE lt BaseCalls dir ein utedir Par folder gt Unaligned eposillons lt dir sPOSLETONG Cir eBOSTTLONS TOrMat locs filter dir lt filter dir gt intensities dir lt intensities dir gt sample sheet lt input _ dir gt SampleSheet csv tilesss 2460 HOF 9 0 9 02468 5 s 1 0001 use bases mask vyo
82. be called a SNP and acts to reduce the rate of any false positive SNP predictions made by the model For this reason the genomic prior is used to calculate the genotype probability distribution used for Q snp and O max gt Polymorphic Prior When considering a subset of sites from a genome that are known to be polymorphic in a population there is a much different prior expectation of the genotype distribution than in the scenario described in the previous section for all sites in the genome A principle difference in this scenario is that the expectation that each site will be homozygous for the reference allele is much lower These sites also need to be examined to distinguish strong evidence for the homozygous reference genotype from a site where no observations have been made The polymorphic prior is used to compute the polymorphic site genotype quality score O max gtl poly site the probabilbity that the true genotype is not the highest scoring if this site is known to be polymorphic New Variant Calling Parameter Theta The parameter theta as used in the variant calling model refers to the expected proportion of differing sites between two chromosomes sampled from the population For site genotyping it is set by default to 1 1000 a value appropriate for human re sequencing Raising this value to e g 1 100 would have the effect of increasing the prior expectation of a non reference genotype and increase Q snp values The param
83. bout the genotype distribution at the site before sequencing The CASAVA 1 8 SNP caller expresses this notion of prior expectation based on a reference sequence using its genomic prior distribution which is used to calculate Q snp and Q max gt A specialized polymorphic prior distribution is also used to compute O max gtl poly site which is applicable to sites where there is a greater prior expectation of polymorphism such as a set of sites from dbSNP Genomic Prior When resequencing an individual from a given population there is a strong prior expectation that a randomly selected site in the sample assembly will be homozygous for an allele at the same locus in a reference chromosome from the same population This expectation of similarity to a reference sequence in most portions of the genome is referred to below as the genomic prior for the model For example suppose that on average 1 in 1000 sites in a sample chromosome are expected to differ from a reference chromosome If the reference at a particular site is A then the Q score for the reference genotype AA will be approximately 30 in the absence of any sample observations Because of this prior the most likely genotype would still be AA even after observing a CASAVA v1 8 2 User Guide 1 4 Q U01 28 94 JUBIJEA Algorithm Descriptions single non reference basecall of modest quality Thus the genomic prior has the effect of increasing the evidence required for a site to
84. by or combined with target sort This BAM file is independent of the archival bam file which can be produced using the option sortKeepAllReads see Archival Build on page 90 gsIndex Pre compute Genome Studio linear index for all reads in the build If you run a target other than the default target a11 make sure to read the help written for the target This will help you identify any dependencies for the target you want to run Target help can be accessed by typing Path to CASAVA bin configureBuild pl help lt target gt NOTE Prefixing any target name with no will exclude it from the targets list Example path to CASAVA bin configureBuild pl targets all noassembleIndels variantsSkipContigs options Target callSmallVariants Usage The callSmallVariants module is designed to use the results of the assemblelndels module if available so a new paired end build could be run with the following minimum set of targets targets sort assembleIndels callSmallVariants If assemblelndels Grouper cannot be run an alternative workflow is elarljels Sort CallomallVarianus varlantsskipcontigs i NOTE To have the plugin provide a BAM file containing all reads which have had their alignments altered during realignment add the following to the configuration command line varlantsWriteRealigned These reads will appear in the file sorted realigned bam in the chromosome realigned bam directory The primary option
85. c2fa c3fa c4 fa c5fa c6fa c7fa cXfa c8fa c9fa c10fa c11fa c12 fa c13 fa c14 fa c15 fa c16 fa c17 fa c18 fa c19 fa c20fa cY fa c22 fa c21 facMT fa o o fraction of known sites mapped name cl fa c2 fa c3 fa c4 fa c5 fa c6 fa c7 fa cX fa sites 247249719 242951149 199501827 191273063 180857866 170899992 158821424 154913754 known sites bases mapped at known sites mean depth at known sites fraction of known sites mapped 224999719 237709794 194704822 187297063 177702766 167273991 154952424 151058754 7772224483 8638053015 7248243403 7003263067 6499205556 6235259537 5351528045 2615296049 34 54326 36 33865 37 22683 37 39121 36 57346 37 27573 34 53659 17 31310 0 96684 0 97455 0 98373 0 97914 0 97129 0 98115 0 96809 0 94708 Solid Ind ng Buljunoy pug U 011791961 JUBIJEA CASAVA Build The CASAVA build containing sequence SNP indels and for RNA Sequencing counts information is located in the buildDir Parsed xx xx xx folder Sorted bam Files The sorted bam file is a binary file that contains sorted sequence alignments There is one sorted bam file for each chromosome stored in the bam subdirectory under each chromosome specific directory BAM Format The Binary Alignment Map BAM file is the binary equivalent of SAM files and is compressed in the BGZF format Each BAM file is much smaller than its SAM equivalent yet it can be easily converted to SAM e g with samtools using samtoo
86. ccur the model does not strictly report a genotype but rather the max_gt call reflects the copy number for each of the two indel alleles and the probability of that copy number Each indel allele of the two overlapping indels are reported on separate lines by the model Due to the approximate nature of this model and the independent evaluation of each overlapping indel allele it is possible that the most likely copy number for each allele could conflict e g max_gt will not be het for both indel alleles in the rare cases where this occurs the associated Q max_gt scores are typically very low Calling SNPs Once the indels are called and the reads are re aligned to take into account the discovered indels site genotyping and SNP calling is conducted using the following steps 6 Given the set of filtered and realigned reads the variant caller next runs certain types of filtration on base calls within these reads First any contiguous trailing sequence of N base calls are effectively treated as trimmed off of the ends of reads for the purpose of genotyping and depth calculation CASAVA v1 8 2 User Guide 1 A U01 28 94 JUBIJEA Algorithm Descriptions Next the mismatch density filter is run on all reads to mask out sections of the read having an unexpectedly high number of disagreements with the reference The current default mismatch density filter behavior is as follows Base calls are ignored where more than 2 mismatches to th
87. ce Soft clip on the read clipped seguence present in lt seg gt Hard clip on the read clipped seguence NOT present in lt seg gt Padding silent deletion from the padded reference seguence For example the CIGAR string 30M1169M means 30 bases aligning to the reference 30M 1 base insert 11 and 69 bases aligning 69M Optional Fields Optional fields are in the format lt TAG gt lt VTYPE gt lt V ALUE gt for example XD Z 73T26 Each tag is encoded in two alphanumeric characters and appears only once for an alignment Illumina SAM files may use some or all of the following optional fields Tag SM AS XD XC Description ELAND single read alignment score ELAND paired read alignment score String for mismatching positions Provides information to distinguish different unmapped read types The lt VTYPE gt describes the value type in the optional field Valid types in SAM are described in the following table Type A i f Z H Description Printable character Signed 32 bit integer Single precision float number Printable string Hex string high nybble first The lt VALUE gt field format is defined by the tag Tag SM AS XD Value Field The lt VALUE field contains the ELAND single read alignment score The lt VALUE gt field contains the ELAND paired read alignment score The lt VALUE field contains the string for mismatching positions e Matching bases are numbered For example a
88. chine Identifier of the sequencer name Run Number to identify the run on the sequencer number Lane Positive integer currently 1 8 number Tile Positive integer number X X coordinate of the spot Integer As of RTA 1 6 OLB 1 6 and CASAVA 1 6 the X and Y coordinates for each clusters are calculated in a way that makes sure the combination will be unique The new coordinates are the old coordinates times 10 1000 and then rounded Y Y coordinate of the spot Integer As of RTA 1 6 OLB 1 6 and CASAVA 1 6 the X and Y coordinates for each clusters are calculated in a way that makes sure the combination will be unique The new coordinates are the old coordinates times 10 1000 and then rounded Index Index sequence or 0 For no indexing or for a file that has not been demultiplexed yet this field should have a value of 0 Read 1 for single reads 1 or 2 for paired ends or multiplexed single reads 1 2 or 3 for Number multiplexed paired ends Sequence Called sequence of read Quality The calibrated quality string Filter Did the read pass filtering 0 No 1 Yes The Qseq Converter also looks for files that configureAlignment needs and transfers them to its output directory These files are Config xml file in the Basecalls folder Requirements to Run configureAlignment To run CASAVA 1 8 configureAlignment on FASTO files generated by the Qseq Converter the following is required The input _qseq
89. chromosome re labeling see Targets on page 96 This is independent of the archival bam file which can be produced using the option sortKeepAllReads see Archival Build on page 90 Part 15011196 Rev D Methods CASAVA uses a number of methods to efficiently assemble indel candidates call SNPs and indels and provide counts This section explains the methods Variant Detection Post alignment CASAVA performs variant detection using two modules The assemblelndels module Grouper detects candidate indels using singleton orphan and anomalous read pairs The assemblelndels module works well for detecting larger indels The candidate indels detected by the assemblelndels module are passed on to the small variant caller for consolidation and genotyping The callSmallVariants module genotypes and provides quality scores for SNPs and indels Indels can be called from candidate indel evidence provided by both ELAND gapped read alignments for smaller indels and from the assemblelndels module for larger indels For each SNP or indel call the probability of both the called genotype and any non reference genotype is provided as a quality score Q score Reads are re aligned around candidate indels to improve the quality of SNP calls and site coverage summaries The callSmallVariants module also generates files which summarize the depth and genotype probabilities for every site in the genome As a final step it produces tables and html for
90. clusters with base call T double Byte 76 Number of clusters with base call A integer Byte 80 Number of clusters with base call C integer Byte 84 Number of clusters with base call G integer Byte 88 Number of clusters with base call T integer Byte 92 Number of clusters with base call X integer Parti 15011196 Rev D Start Description Data type Byte 96 Number of clusters with intensity for A integer Byte 100 Number of clusters with intensity for C integer Byte 104 Number of clusters with intensity for G integer Byte 108 Number of clusters with intensity for T integer Filter Files The filter files can be found in the BaseCalls directory The filter files are binary files containing filter results the format is described below Bytes Description Bytes 0 3 Zero value for backwards compatibility Bytes 4 7 Filter format version number Bytes 8 11 Number of clusters Bytes 12 N 11 unsigned 8 bits integer Where N is the cluster number e Bit 0 is pass or failed filter Control Files The control files can be found in the BaseCalls directory lt run directory gt Data Intensities BaseCalls L0O lt lane gt They are named as follows s lt l ne gt lt tils gt control The control files are binary files containing control results the format is described below Bytes Description Bytes 0 3 Zero value for backwards compatibility Bytes 4 7 Format version number Bytes 8 11 Number of clusters Bytes 12
91. completes a major portion of the post alignment analysis pipeline The first module sort bins aligned reads into separate regions of the reference genome sorts these reads and optionally removes PCR duplicates for paired end reads and finally converts these reads into BAM format In a paired end analysis the next module assemblelndels is used to search for clusters of poorly aligned and anomalous reads These clusters of reads are de novo assembled into contigs which are aligned back to the reference to produce candidate indels Subsequently the callSmallVariants module uses the sorted BAM files and the candidate indels predicted by the assemblelndels module to perform local read realignment and genotype SNPs and indels under a diploid model In an RNA Seq build the rnaCounts module will also be run to calculate gene and exon counts Other optional modules can be added to the build process to perform additional functions CASAVA automatically generates a range of statistics such as mean depth and percentage chromosome coverage to enable comparison with previous builds or other individuals Moreover CASAVA provides expression levels for exons genes and splice junctions in the RNA Sequencing analysis Use Cases The application has three basic use cases DNA Sequencing for large genomes DNA Sequencing for small genomes data sets RNA Sequencing All types of analysis take export files from configureAlignment as input and pro
92. converter The path should always be to the Unaligned directory even when the run only contains one project For a description of the run folder see Bcl Conversion Output Folder on page 37 USE BASES nY n Ignore the first and last base of the read 54 The USE BASES string contains a character for each cycle e If the character is Y the cycle is used for alignment e If the character is n the cycle is ignored e Wild cards are expanded to the full length of the read USE_BASES should not be used for masking custom index cycles use the use bases mask option Options for Bcl Conversion and Demultiplexing on page 33 Parti 15011196 Rev D Parameter Definition Default is USE BASES Y n which means perform a single read alignment and ignore the last base For a detailed description of USE BASES syntax see USE BASES Option on page 62 ELAND GENOME home user Genomes Specify the single FASTA files that you want to use as Eland BAC plus vector genome reference for alignment with ELANDv2e SAMTOOLS GENOME Direct CASAVA to a multi seguence FASTA reference file ANALYSIS eland extended Specify the type of alignment that should be performed Available options are ANALYSIS eland extended e ANALYSIS eland pair e ANALYSIS eland rna e ANALYSIS none The default is ANALYSIS none See ANALYSIS Variables on page 61 for more information ELAND FASTO FILES PER PROCESS N The maximum number of files analyzed by each ELA
93. cribes how to perform Bcl conversion and demultiplexing in CASAVA 1 8 Usage of configureBclToFastq pl The standard way to run bcl conversion and demultiplexing is to first create the necessary Makefiles which configure the run Then you run make on the generated files which executes the calculations 1 Enter the following command to create a makefile for demultiplexing path to CASAVA bin configureBclToFastg pl options i NOTE The options have changed significantly between CASAVA 1 7 and 1 8 See Options for Bcl Conversion and Demultiplexing on page 33 2 Move into the newly created Unaligned folder specified by output dir 3 Type the make command Suggestions for make usage depending on your workflow are listed below Make Usage Workflow nohup make j N Bcl conversion and demultiplexing default nohup make j N rl Bcl conversion and demultiplexing for read 1 The option specifies the extent of parallelization with the options depending on the setup of your computer or computing cluster see Using Parallelization on page 119 The Unix nohup command redirects the standard output and keeps the make process running even if your terminal is interrupted or if you log out See Makefile Options for Bcl Conversion and Demultiplexing on page 34 for explanation of the options L NOTE The ALIGN option which kicked off configure Alignment after demultiplexing was done in CASAVA 1 7 is no longer available 4 After
94. d dependent upon the false positive tolerance of a user s workflow For this reason summary Statistics about the called SNPs are created at a higher average application threshold which can be set using this option default is 20 Example variantsSummaryMinosnp 25 Table 28 Indel Options for callSmallVariants Option Application Description variantsIndelTheta FLOAT ETE The freguency with which indels are expected between two unrelated haplotypes default is 0 0001 See New Variant Calling Parameter Theta on page 150 for more explanation Example variantsIndelTheta 0 0002 variantsIndelCovCutoff FLOAT SE PE Indels are filtered out of the final output if the local sequence depth is greater than this value times the mean chromosomal depth The sequence depth of the indel is approximated by the depth of the site 5 of the indel default 3 0 The filter may be disabled for targeted resequencing or other applications by setting this value to 1 or any negative number Example variantsIndelCovCutoff 4 variantsCanIndelMin INTEGER SE PE Unless an indel is observed in at least this many gapped or assemblelndels reads the indel cannot become a candidate for realignment and genotype calling default 3 Example variantsCanIndelMin 4 variantsCanIndelMinFrac SE PE Unless an indel is observed in at least this fraction of FLOAT intersecting reads the indel cannot become a candidate for realignment and genotyp
95. d have a value of 0 Read number 1 for single reads 1 or 2 for paired ends or multiplexed single reads 1 2 or 3 for multiplexed paired ends Called sequence of read Quality string In symbolic ASCII format ASCII character code quality value 64 Match chromosome Name of chromosome match OR code indicating why no match resulted ee ss dd too many hits where ee is the number of exact hits ss is the number of hits with a single mismatch and dd is the number of hits with a double mismatch NM no match OC OC failure RM repeat masked for example match against abundant sequences Match Contig Gives the contig name if there is a match and the match chromosome is split into contigs Blank if no match found Match Position Always with respect to forward strand numbering starts at 1 Blank if no match found Match Strand F for forward R for reverse Blank if no match found Match Descriptor Concise description of alignment Blank if no match found A numeral denotes a run of matching bases A letter denotes substitution of a nucleotide For a 35 base read 35 denotes an exact match and 32C2 denotes substitution of a C at the 33rd position The escape sequence A represents an indel An integer in the indel escape sequence e g 1012518 indicates an insertion relative to reference of the specified size A sequence in the indel escape sequence e g IN AGS20 indicates a deletion relative to reference
96. d to one of the two ends of a large indel or structural variant for which the complete variant is either unknown or cannot be represented by the small variant caller Breakpoint events are reported as either left or right breakpoints A left breakpoint corresponds to a haplotype which can be mapped on the left side of the breakpoint location but not on the right A right breakpoint indicates that a haplotype can be mapped to the right of the breakpoint location but not to the left If CASAVA v1 8 2 User Guide 1 D U01 28 94 JUBIJEA Algorithm Descriptions a simple insertion or deletion were represented as two breakpoint calls then they would occur on the forward strand as a left breakpoint followed by a right breakpoint The figure below illustrates how two breakpoint calls could potentially be called corresponding to a large insertion in the sample relative to a population reference Figure 26 Left and Right Breakpoints Reiererce a GNG n ed Sample l Insertion that is hard to resolve with small variant caller Sequencing Sample LFS ba get Reads overlapping breakpoints Reported as Reported as left breakpoint right breakpoint Mapping Reads f to Reference C m mm ma m Advanced Options for Variant Detection This section lists advanced options for variant detection which will help you fine tune the variant calling Global Analysis Options The options described below are global options
97. dary analysis To assess a run you can use either the RTA based output or the Sequence Analysis Viewer SAV The SAV is an Illumina software package available on the Illumina website iCom and can be used to view the performance metrics of a sequencing run You can download it as part of the HiSeq Control Software HCS package 1 N OF A Q N 7 Log on icom illumina com Click on Downloads Search for SAV Click on the HCS Software link Download the Installers zip file Extract the SAV x x xx x msi file from the zip file Run the SAV x x xx x msi file and follow the installation instructions In general using a PhiX or other balanced suitable control sample such as human genomic DNA sequencing as guide helps when interpreting these graphs Parti 15011196 Rev D Quality Tables and Graphs Before beginning an analysis run you should check the following tables and graphs in status htm or SAV Run Info You can view basic information on the run s configuration read length and control specifications on the Run Info tab of the Status htm output or the Summary tab of the SAV window Data by Cycle These graphs help you examine intensities focus metrics FWHM percent base qscores error rates and other metrics per cycle and per lane You can identify sample properties or instrument related events that affect the data Data by Tile Charts These graphics show run metrics by cycle and by lane and tile These can be used
98. detailed below for the options variantsSnpCovCutoff and variantsIndelCovCutoff This setting is recommended for targeted reseguencing and RNA Seg Note it is already set by default for RNA Seg Example variantsNoCovCutoff Table 27 SNP Options for callSmallVariants Option Application Description variantsSnpTheta FLOAT SE PE The frequency with which single base differences are expected between two unrelated haplotypes default is 0 001 Example variantsSnpTheta 0 002 variantsSnpCovCutoffAll SE PE By default the mean chromosomal depth filter is based on used depth the number of basecalls used by the snp caller after filtration calculated from all known sites non N in the reference seguence When this option is set the threshold and the filtration use the full depth at all known sites in the reference seguence Example variantsSnpCovCutoffAll variantsSnpCovCutoff lt FLOAT SE BE SNPs are filtered out of the final output if the depth of reads used for that site is greater than this value times the mean chromosomal used depth default 3 0 The filter may be disabled for targeted reseguencing or other applications by setting this value to 1 or any negative number Example variantsSnpCovCutoff 4 variantsMDFilterCount INTEGER SE PE The mismatch density filter removes all basecalls from consideration during SNP calling where greater than variantsMDFilterCount mismatches to the reference occur o
99. digit set number export gz Alignment Algorithms CASAVA provides the alignment algorithm Efficient Large Scale Alignment of Nucleotide Databases ELAND ELAND is very fast and should be used to match a large number of reads against the reference genome ELAND has been improved a number of times CASAVA 1 6 introduced a new version of ELAND ELANDv2 The most important improvements of ELANDV2 are its ability to perform multiseed and gapped alignments As of CASAVA 1 8 a new version of ELANDv2 is available ELANDv2e The most important improvements of ELANDv2e are improved repeat resolution and implementation of orphan alignment A short description of these improvements is provided below more information about ELANDV2 is available in Algorithm Descriptions on page 131 Multiseed and Gapped Alignment ELANDv2e performs multiseed alignment by aligning consecutive sets of 16 to 32 bases separately After this ELANDv2e extends each candidate alignment to the full length of Parti 15011196 Rev D the read using a gapped alignment method that allows for gaps indels of up to 10 bases ELANDv2e then picks the best alignment based on alignment scores Repeat Resolution ELANDv2e aligns reads in repeat regions using two new modes semi repeat resolution and full repeat resolution Both modes take repetitive hits into account for the multiseed pass of ELAND Full repeat resolution is more sensitive and places more reads in repeat regions but
100. duce SNP and indel calls but note that gapped alignments are required for indel calls in RNA and single ended DNA builds In addition RNA Sequencing analysis provides counts for exons genes and splice junctions DNA Sequencing Analysis for Large Genomes DNA Sequencing whole genome analysis can be used for large genomes and high coverage like the human genome at 30x coverage and both single read and paired end runs CASAVA can take the large numbers of aligned single read or paired end sequences from multiple experiments arrange them into a genome build and describe differences from the reference sequence For big data sets 30x coverage human genome the process can take between 5 hours and several days depending on available infrastructure L NOTE Large projects like human genome resequencing require high performance computer clusters see Hardware and Software Requirements on page 112 88 Parti 15011196 Rev D DNA Sequencing Analysis for Small Genome DNA Sequencing for small genomes such as whole genome sequencing of bacteria or targeted resequencing is very similar to DNA Sequencing for large genomes with the only difference being that it may process data from one lane or less Thus a single computer is enough to make the build RNA Sequencing Analysis UONONPOALUI RNA Seguencing analysis supports whole transcriptome seguencing projects In addition to SNP and indel calls there are a few more data types produced Ex
101. duces tables and html formatted reports of SNP and indel calls assemblelndels Module Improvements The major changes for the assembleIndels module Grouper are assemblelndels uses an additional method to identify indels It finds read pairs that map anomalously for example with unexpected insert size and identifies potential indels assemblelndels merges indel calls detected through anomalous read pairs with those identified through singleton orphan reads and combines clusters that appear to correspond to the same event CASAVA v1 8 2 User Guide 1 4 U01 28 94 JUBIJEA Algorithm Descriptions Figure 24 Changes to the assembleIndels Workflow CASAVA v1 8 Improvements IndelFinder Extract reads Gapped alignments Alignment worse than expected Introduced Singleton shadow reads anomalous Anomalous read pairs read pair method AlignCandidates Localized alignment of extracted reads If better replace previous alignment ClusterFinder Cluster reads Introduced ClusterMerger ClusterMerger Moe Merge clusters from different types module SmallAssembler Assemble clusters in contigs Update alignment details for assembled reads AlignContig Align contigs to genome assemblelndels Algorithm The assemblelndels module Grouper runs only during paired read DNA CASAVA builds In CASAVA v1 8 it uses orphan reads and anomalous read pairs to detect indels Grouper detects indels in
102. e Stats Folder CASAVA v1 8 2 User Guide 1 O 3 Solid Ind ng bununo9 pug U 011791961 JUBIJEA The stats folder contains statistical information in computer readable form such as the runs summary xml file which shows which lanes from which run were aggregated and called for a CASAVA build Conf Folder The conf folder contains information about the configuration of the project such as the project conf file Build Html Page The build html page is located in buildDir html When you open the file Home html you will find a list of all runs and a link to statistics Figure 16 Build Html Page illumina Welcome Ej Report Menu CASA 5 8 0a1 10 JA CAS 101019 PE DNA Seg CASAVA CASAVA 1 8 0a1 101019 PE DNA Seg analysis Name Clusters PF PF Clusters Align PF Error Rate PF Yield Read Length Status Variant Detection and Counting 090406 HWI EAS68 9096 FC400PR PE Lane 1 Read 1 Lane 1 Read 2 Lane 2 Read 1 Lane 2 Read 2 Lane 3 Read 1 Lane 3 Read 2 Lane 5 Read 1 Lane 5 Read 2 Lane 6 Read 1 Lane 6 Read 2 090407 HWI EAS255 9097 FC304E3 PE Lane 1 Read 1 Lane 1 Read 2 A Report Menu link The Report Menu link on the build html page will lead you to graphs and tables for important statistics Coverage Duplicates Indels statistics SNPs statistics 104 150840 150840 151657 151657 151240 151240 155088 155088 154346 154346 169443 169443 6539 65
103. e found in the file snps removed txt in each chromosomal bin directory Parti 15011196 Rev D The SNP caller implemented in this module employs a probabilistic model which ultimately produces probability distributions over all diploid genotypes for each site in the genome The primary values summarized from these distributions are a set of quality scores O snp The value of Q snp expresses the probability that the genotype at this site is not the homozygous reference state Q max gt The value of O max gt expresses the probability of the most likely genotype state at this site reported as the value max gt Note that the value Q max_gt corresponds to a value referred to as consensus quality in SNP calling methods such as samtools pileup O max gtl poly site One additional score is provided by the SNP caller which can be used to look at sites for which there is a strong expectation that the site is polymorphic This value is O max gt poly site which expresses the probability of the most likely genotype state at the site assuming the site is polymorphic This state is separately reported as the value max gtl poly site This genotype value and quality score provides greater sensitivity when looking at for example a particular set of polymorphic sites from dbSNP This value should not be used to evaluate the genotype for every position in the genome as this would resultin a high number of false positive SNP predictions To accommodate
104. e sample at a rate of 25 and a well balanced matrix is used for analysis the graph will display all channels with similar intensities If intensities are not similar the results could indicate either poor cross talk correction or poor absolute intensity balance among each channel A genome rich in GC content may not provide a balanced matrix for accurate cross talk correction and absolute intensity balance Mismatch Graphs The Mismatch Graphs link leads to a file with graphs of error rates on a flow cell The red bar shows the percentage of bases at each cycle that are wrong as calculated based on alignment to the reference sequence Issues such as focus or fluidics problems manifest themselves as spikes in the graph ELANDv2e is capable of aligning against large genomes such as human in reasonable time However it allows only two errors per seed This means that error rates based on ELANDv2e alignments are underestimated Mismatch Curves The Mismatch Curves link leads to a file with graphs of the proportion of reads in a tile that have 0 1 2 3 or 4 errors by the time they get to a given cycle CASAVA v1 8 2 User Guide Fa soji J 1NA1NO j usWubi yainbijuoo Sequence Alignment Additional Paired Statistics For samples for which eland pair analysis was performed there is a table called Additional Paired Statistics This table provides statistics about the alignment outcomes of the two reads individually and as a pair the
105. e D D 1uauubilveinbyuo2 BulUUNH Sequence Alignment Parameter OUT DIR DATASET POST RUN COMMAND yourPath yourCommand yourArgs EMAIL LIST user example com user2 example com EMAIL SERVER mailserver EMATL DOMAIN example com WEB DIR ROOT file server example com share NUM LEADING DIRS TO STRIP ELAND RNA GENOME CONTAM ELAND RNA GENOME ANNOTATION ELAND RNA GENE MD GROUP LABEL KAGU PARAMS Definition Path to configure Alignment output The path must be to a directory not already present Defaults to lt run_folder gt Aligned Note that there can be only one Aligned directory by default If you want multiple Aligned directories you will have to use this option to generate a different output directory Allows user defined scripts to be run after all configureAlignment targets have been built Invoked per barcode lane for multiplexed samples per lane for non multiplexed samples See also Using DATASET POST RUN COMMAND on page 66 Send a notification to the user at the end of an analysis run For more information on email notification see Setting Up Email Reporting on page 116 Include hyperlinks with a specific prefix to the run folder Specifies the number of directories to strip from the start of the full run folder path before prepending the WEB DIR ROOT Points to the folder containing a set of contaminant sequences for the genome typically the mitochondrial and ribosomal sequences The files must be in si
106. e RPKM for exons and genes is calculated slightly differently than RPKM for splice junctions The normalized values for genes and exons are counted as follows Exons genes RPKM 10 x Cb NbL With RPKM Reads Per Kilobase of exon model per Million mapped reads Cb the number of bases that fall on the feature Nb total number of mapped bases in the experiment L the length of the feature in base pairs The normalized values for splice junctions are counted as follows Splice junctions RPKM 10 x Cr NrL With Cr the number of reads that cover the junction point Nr total number of mapped reads in the experiment L the length of the feature in base pairs Only the reads with alignment score gt OV CutoffSingle are considered Exons that have overlapping exons from other genes on the forward or reverse strand are excluded from counting and are also not included to compute the total gene length Reference Mortazavi A Williams BA McCue K Schaeffer L Wold 2008 Mapping and quantifying mammalian transcriptomes by RNA Seq Nature Methods 5 585 7 1 D8 Part 15011196 Rev D Qseq Conversion Late ae AA 160 Oseg Converter Input Files 161 Running Oseg Converter 163 Oseg Converter Parameters ennen nen une renere nere ener rrerrenen 164 Qseq Converter Output Data 165 2 ar ER wa sl _ TE s P Ft oe ow sf cang EE af Cara a Merete NG TE ES GAY ta CASAVA v1
107. e calling default 0 02 Example variantsCanIndelMinFrac 0 01 YariantsSmallCanIndelMinFrac SE PE In addition to the above filter for all indels for indels FLOAT of size 4 or less unless the indel is observed in at least this fraction of intersecting reads the indel cannot become a candidate for realignment and genotype calling default 0 1 Example variantsSmallCanIndelMinFrac lt 0 2 variantsIndelErrorRate FLOAT SE PE Set the indel error rate used in the indel genotype caller to a constant value of f 0 lt f lt 1 The default indel error rate is taken from an empirical function accounting for homopolymer length and indel type i e insertion or deletion This flag overrides the default behavior with a constant error rate for all indels Example variantsIndelErrorRate 0 5 1 DO Parti 15011196 Rev D Option Application variantsSummaryMinOindel SE PE INTEGER VariantsMaxIndelSize INTEGER SE PE CASAVA v1 8 2 User Guide Description The indels txt files contain all positions where Q indel gt 0 however it is expected that only a higher O indel subset of these will be used dependent upon the false positive tolerance of a user s workflow For this reason summary Statistics about the called snps are created at a higher averege application threshold which can be set using this option default is 20 Example variantsSummaryMinOindel 25 Sets the maximum indel size for realignment and
108. e genome ANALYSIS none Y splice junctions and contaminants using ELANDv2e For more information on ELAND rna see Using ANALYSIS eland_rna on page 70 None Any Omits the indicated lane from the analysis application Setting the parameter 8 ANALYSIS none ignores lane 8 WARNING Default for USE BASES is Y n which means perform a single read alignment and ignore the last base If running ANALYSIS eland pair make sure to specify the USE BASES option for two reads for example Y n Y n USE BASES Option The USE BASES option identifies which bases of a full read produced by a sequencing run should be used for the alignment analysis A fully expanded USE BASES value is a string with one character per sequencing cycle but more compact formats can be used as described in USE BASES Option on page 62 Each character in the string identifies whether the corresponding cycle should be aligned The following notation is used 44 JJ A lower case n means ignore the cycle I NOTE Prephasing correction cannot be applied to the last base since you need to know the next base in the sequence Thus there will be a minor error increase at the last base Ignoring the last base from the sequence analysis can reduce alignment errors somewhat For this reason Illumina recommends that if n bases of sequence are desired n 1 cycles should be run An upper case Y means use the cycle for the alignment A comma denotes a
109. e gt bel The bcl files are binary base call files with the format described below Bytes Description Data type Bytes 0 3 Number N of cluster Unsigned 32bits little endian integer Bytes 4 N 3 Bits 0 1 are the bases respectively A C G T Unsigned 8bits integer Where N is the for 0 1 2 3 cluster index bits 2 7 are shifted by two bits and contain the quality score All bits 0 in a byte is reserved for no call Stats Files The stats files can be found in the BaseCalls directory lt RunDirectory gt Data Intensities BaseCalls L0O lt lane gt C lt cycle gt 1 They are named as follows s slane gt silile usta The Stats file is a binary file containing base calling statistics the content is described below The data is for clusters passing filter only Start Description Data type Byte 0 Cycle number integer Byte 4 Average Cycle Intensity double Byte 12 Average intensity for A over all clusters with intensity for A double Byte 20 Average intensity for C over all clusters with intensity for C double Byte 28 Average intensity for G over all clusters with intensity for G double Byte 36 Average intensity for T over all clusters with intensity for T double Byte 44 Average intensity for A over clusters with base call A double Byte 52 Average intensity for C over clusters with base call C double Byte 60 Average intensity for G over clusters with base call G double Byte 68 Average intensity for T over
110. e idxProj as their project field PROJECT idxProj ANALYSIS eland pair PROJECT idxProj USE BASES Y n Y n PROJECT idxProj ELAND GENOME x y z G1 Align only PhiX of idxProj assuming there are 2 references for idxProj hum and PhiX Disable analysis by default so that anything not explicity described is not analysed ANALYSIS none Disable analysis for noldxProj This will take priority over REFERENCE scope attributes below PROJECT noldxProj ANALYSIS none Set REFERENCE scope variables so that when the data belongs to PhiX they have an effect noldxProj will not be analysed as PROJECT scope has higher priority REFERENCE phix ANALYSIS eland pair REFERENCE phix USE BASES Y n Y n REFERENCE phix ELAND GENOME x y z GP Align only human for Lane 2 assuming 2 references for idxProj human PhiX Disable analysis by default so that anything not explicitly described is not analysed ANALYSIS none Notice that everything below is set only for lane 2 so the rest of the data has ANALYSIS none from above Disable analysis for noldxProj This will take priority over REFERENCE scope attributes below 2 PROJECT noldxProj ANALYSIS none Set REFERENCE scope variables so that when the data belongs to PhiX they have an effect noldxProj will not be analysed as PROJECT scope has higher priority 2 REFERENCE hum ANALYSIS eland pair 2 REFERENCE hum USE BASES Y n Y n 2 REFERENCE hum ELAND GENOME x y z GH Samples Without Index Unless otherw
111. e pairs expressed as a percentage of the total number of non orphaned clusters passing filters must exceed a certain number set as decimal for example muf 0 1 Otherwise no pairing is attempted and the two reads are effectively treated as two sets of single reads e By default this threshold is set to 0 e For some applications it may be useful to switch off the pairing completely by specifying muf 1 0 siter Minimum percentage of Consistent Fragments set as set as decimal for example mcf 0 6 Of the unique pairs the vast majority should have the same orientation with respect to each other If they don t it is indicative of the following problems e Sample prep e A reference sequence is extremely diverged from the sample data In such cases no pairing is attempted and the two reads are effectively treated as two sets of single reads By default the threshold for this parameter is set to 0 7 Ea Minimum Fragment Alignment Quality For each cluster all possible pairings of alignments between the two reads are compared This is the score of the best one Since we are considering the two reads as one fragment both reads in a cluster get the same paired read alignment score The alignment score is nominally on a Phred scale However it is probably not safe to assume the calibration is perfect Nevertheless it is a good discriminator between good and bad alignments The score must exceed this threshold to go in the export txt gz fi
112. e predicted genotype is not CC The CASAVA1 8 model for indels comprises three possible indel genotypes homozygous heterozygous or not present NOTE It is possible to have high confidence that the genotype is not the reference without having high confidence in exactly what the genotype is at the site In this situation there is strong evidence of a SNP but the exact genotype at the site is less certain Q snp The SNP caller s site genotyping methods take a set of base calls and associated qualities for each site and produce a probability distribution over the 10 diploid genotype states AA CC GG TT AC AG AT CG CT GTJ Given this probability distribution the value Q snp is a Q score expressing the probability that the site genotype is not that of the homozygous reference NOTE The diploid genotypes are printed out as two letter codes representing two unphased single base alleles For each heterozygous genotype the two alleles are provided in alphabetical order e g CT will be used and not TC For example if the reference base is C and the probability of the reference genotype CC is 0 001 the value for Q snp is 30 reflecting a relatively high confidence that at least one non reference allele exists at this site Prior Probabilities and Quality Scores An important component of the SNP calling model is the prior probability distribution over diploid genotypes The prior distribution expresses the information available a
113. e provides the location of the configureAlignment data export txt gz for each flow cell run and describes their properties of each flow cell run There is one run section for each flow cell run one set section for each Aligned folder in each flow cell run and one lane section for each lane in each set The run conf xml file can be CASAVA v1 8 2 User Guide O 3 S6ll4 Indu UOI 2919 J JUBIJEA provided created by the user or CASAVA will generate it automatically based on command line options run conf xml file should be placed in buildDirectory conf Pair xml The pair xml file provides information about pair distribution in the configureAlignment output only for paired end sequencing Pair xml is required for paired samples to be treated as paired You do not need to point to it specifically since it should have been placed in the Aligned Project Sample folder for your sample by configureAlignment Genomesize xml The genomesizexml file contains names of reference genomes and is required for variant detection You do not need to point to it specifically since it should have been placed in the Aligned Project Sample folder for your sample by configureAlignment Reference Genome CASAVA uses a reference genome in FASTA format Both single sequence FASTA and multi sequence FASTA genome files are supported Variant Detection and Counting Genome sequence files for most commonly used model organisms are available through iG
114. e reads and the affected portions of these reads have high error rates and unreliable base calls Typically the increase in phasing causes quality scores to be low in these regions and thus these unreliable bases are scored correctly However the occurrence of phasing artifacts may not always correlate with segments of high miscall rates and biased base calls and therefore these low quality segments are not always reliably detected by our current quality scoring methods We therefore mark all reads that end in a segment of low quality even though not all marked portions of reads will be equally error prone The read segment quality control metric identifies segments at the end of reads that may have low quality and unreliable quality scores If a read ends with a segment of mostly low quality Q15 or below then all of the quality values in the segment are replaced with a value of 2 encoded as the letter in Illumina s text based encoding of quality scores while the rest of the quality values within the read remain unchanged We flag these regions specifically because the initially assigned quality scores do not reliably predict the true sequencing error rate This Q2 indicator does not predict a specific error rate but rather indicates that a specific final portion of the read should not be used in further analyses This is not a read level filter the occurrence of consecutive Q2 values in a read does not indicate that the read itself i
115. e reference sequence occur within 20 bases of the call Note that this filter treats each insertion or deletion as a single mismatch If the call occurs within the first or last 20 bases of a read then the mismatch limit is applied to the 41 base window at the corresponding end of the read The mismatch limit is applied to the entire read when the read length is 41 or shorter All bases marked by the mismatch density filter together with any N base calls which remain after the end trimming step are filtered out by the variant caller These filtered base calls are not used for site genotyping but appear in the filtered base call counts in the variant caller s output for each site 7 All remaining base calls are used for site genotyping The genotyping method heuristically adjusts the joint error probability calculated from multiple observations of the same allele on each strand of the genome to account for the possibility of error dependencies between these observations The method accomplishes this by treating the highest quality base call of each allele from each strand as independent observations leaving their associated base call quality scores unmodified However subsequent base calls for each allele and strand have their qualities adjusted to increase the joint error probability of that allele above the error expected from independent base call observations After running the site genotyper on all positions a set of unfiltered SNP sites is
116. e within the feature i NOTE For overlapping genes with different gene names only the non overlapping portions for each gene participate in count generation Exon counts are sum of base coverages from genomic and spliced reads Therefore gene counts are the sum of exon counts And junction counts in reads are provided for historical reasons and for alternative splicing analysis An example of a chromosome genes count txt file opened in Excel is shown below Figure 18 Chromosome genes count txt File Opened in Excel E Cc genes count Ext 16336627 16413045 CECR2 16423152 16453647 SLUSATE 16454502 16491555 ATPEVTI ET 16501454 16591959 BCLALIS 16596905 16637258 BIL 16650415 16007325 MILALJ 16940704 16952207 HE26 16973550 16994490 TUBAS Part 15011196 Rev D Requirements and Software Installation Hardware and Software Requirements 0 a 112 Installing ARSAV 116 Pd D P 7 aand ennn rr map pa WP m mn ae JE nn ii aen on xa sa T a B 4 Pil nag a 24 aa ma MAPS aft naa Oes a e MT YET C JAM z 6 i GCATEATGGAG TEE ta pancrase s en Nee SP GPE CASAVA v1 8 2 User Guide T 1 1 v XIDUSdAY Requirements and Software Installation Hardware and Software Requirements Network Infrastructure The large data volumes generated and moved when running CASAVA mean that you will need the following 112 1 A high throughput ethernet connection 1
117. eads Spanning read score threshold This is calculated in exactly the same way as indelsSpReadThresholdIndels However it is used in the opposite way Here the point to find reads with few or no mismatches which are presumed to arise from repeats and not from indels and exclude them from the clustering process Minimum coverage to extend contig default 3 153 U01 28 94 JUBIJEA Algorithm Descriptions Option indelsMinContext NUMBER PE indelsSaveTempFiles PE Application Description Demand at least x exact matching bases either side of variant default is 6 The idea here is to ensure that an indel has a minimum number of exactly matching bases on either side Setting this to zero might be good for finding reads which align to breakpoints Add this flag to save intermediate output files from each stage of the indel assembly process Options for Target callSmallVariants The options described below are used to specify analysis for target callSmallVariants Table 24 Workflow Options for callSmallVariants Option Application VariantsSkipContigs PE variantsNoSitesFiles SE PE variantsNoReadTrim SE PE variantsWriteRealigned SE PE Table 25 Read Mapping Options for callSmallVariants Option Application VariantsIncludeAnomalous PE VariantsIncludeSingleton PE variantsSEMapScoreRescue PE 154 Description By default information from the assemblelndels module is used and required in pair
118. ed end DNA Sequencing analysis This option disables use of indel contigs during variant calling and only uses gapped alignment to find indels Example variantsSkipContigs Do not write out the sites txt g7 files Example variantsNoSitesFiles By default the ends of reads can be trimmed if the alignment path through an indel is ambiguous This option disables read trimming and chooses the ungapped sequence alignment for any ambiguous read segment Note that this can trigger spurious SNP calls near indels Example variantsNoReadTrim Write only those reads which have been realigned to bam file sorted realigned bam for each reference sequence Example variantsWriteRealigned Description Include paired end reads which have anomalous insert size or orientation Note that variantsSEMapScoreRescue must also be specified because ELAND gives anomalous reads a PE mapping score of zero Include paired end reads which have unmapped mate reads Note that variants5EMapScoreRescue must also be specified because ELAND gives singleton reads a PE mapping score of zero Include reads if they have an SE mapping score equal to or above that set by the QV CutoffSingle option even if the read pair fails the PE mapping score threshold Parti 15011196 Rev D Table 26 SNP and Indel Options for callSmallV ariants Option Application Description VariantsNoCovlCutof SE PE Disables the SNP and indel coverage filters
119. ence directory is write protected force Forces existing output files to be overwritten input type lt input it Type of input file FASTQ FASTA export or qseq format gt log lt path to log gt l The path to the log file Default ELAND standalone log output od The output directory directory lt output dir gt output prefix Op Produces a set of output files with a prefix of this value default value lt prefix gt is reanalysis kagu options ko Indicates paired read analysis parameters to pass to lt options gt alignmentResolver e g ko c enables circular reference sequence support Multiple arguments must be contained in quotation marks remove temps rt removes all files except exports BAM files and log files upon successful completion seed length lt value gt sl Length of read substring seed used for ELAND alignment defaults to the lower of read length and 32 Use twice for paired end data sets use bases lt value gt ub Expanded mask to apply to the FASTQ file two values if paired analysis Defaults to Y n help h Shows help text NOTE The orphan aligner is always enabled when performing paired end analysis with ELAND standalone just like configureAlignment CASAVA v1 8 2 User Guide o D WIEIDOJd SUOIEpue1S e se GNV 173 Suluuny Sequence Alignment The orphan aligner is always enabled when performing paired end analysis with ELAND standalone just like GE
120. end eland extended paired end eland_ pair and single end RNA eland rna analysis The default behavior of configureAlignment pl is to perform a multi seeded gapped alignment This allows for the identification of small indels lt lt 10 nt during alignment a gap of up to 10 bases can be opened during seed extension DNA The eland extended and eland pair analysis modes can be used to align reads to a genome The types of experiments supported include genome resequencing exome capture targeted capture and ChIP Seq data Methylation There is currently no support for aligning Bisulfite Seq data with Eland RNA Eland ma will align transcriptome data Transcript data is limited to single reads that cross at most one splice junction Eland_rna cannot align paired end data For paired end read transcriptome data it is recommended that a third party tool such as BowTie TopHat be used Variant Analysis Variant analysis and RNA counting are controlled by the configureBuild pl script The script can be used to describe the following types of variation Site genotypes and SNPs Homozygous and heterozygous single nucleotide variants SNPs are called using a Bayesian site genotyping model which takes into account base calls quality scores and alignment scores of the reads at the given position Indels Indels are called using a two stage process First contigs are assembled from poorly aligned anomolous reads and aligned back to the
121. enome Getting Reference Files on page 128 Single Sequence FASTA Files CASAVA accepts single sequence FASTA files as genome reference which should be provided unsquashed for both alignment and post alignment steps The chromosome name is derived from the file name Direct CASAVA to a folder containing the FASTA files using the option refSequences PATH for variant detection and counting Multi Sequence FASTA Files As of version 1 8 CASAVA accepts a multi sequence FASTA file as genome reference This should be provided as a single genome SAM compliant unsquashed file for both alignment and post alignment steps The chromosome name is derived directly from the first word in the header for each sequence Direct CASAVA to multi sequence FASTA file using the option samtoolsRefFile FILE for variant detection and counting v I WARNING GenomeStudio does not support the use of multi sequence FASTA files Therefore if you want to analyze your output in GenomeStudio we recommend using single sequence FASTA reference files at Chromosome Naming Restrictions CASAVA does not accept the following characters in the chromosome name F A LAP ere eee ep ED EE 94 Part 15011196 Rev D refFlat txt gz or seg gene md gz File CASAVA 1 8 generates the non overlapping exon coordinates set automatically using the refFlat txt gz file from UCSC or seq_gene md gz file from NCBI They should be from the same build as the reference files
122. ent Input Files 48 Running configureAlignment aaa 53 configureAlignment Output Files 73 Running ELAND as a Standalone Program sense eee eee cece eee 85 3 w5 a va EE ee EER a ms 4 MA Y ng mw V PU Cor 7 F riv AT6CeGCihrearegactceteh CASAVA v1 8 2 User Guide i D y saj aeyo Introduction The CASAVA module configureAlignment performs sequence alignments This chapter describes running configureAlignment parameters analysis variables configuration file options and ELANDv2e alignments L NOTE For installation instructions see Requirements and Software Installation on page 111 Configuring configureAlignment You can define configureAlignment analysis parameters in a configuration file or in the command line Command line arguments take precedence over parameters set in the configuration file For a full description of analysis parameters and variables see configureAlignment Parameters Detailed Description on page 61 sequence Alignment configureAlignment uses multiple analysis parameters Therefore it is recommended to include the parameters in a configuration file and provide that file as input to configureAlignment configureAlignment and Align As You Go Bcl conversion supports alignment of the first read of a paired end run before completion of the run align as you go You can kick off alignment for read 1 using the target r1 when running make at any time after Bcl
123. ented in a range of locations the caller attempts to report it in the left most position possible String summarizing the indel type One of e nI Insertion of length n e g 101 is a 10 base insertion e nD Deletion of length n e g 10D is a 10 base deletion e BP LEFT Left side breakpoint e BP RIGHT Right side breakpoint Segment of the reference sequence 5 of the indel event For right side breakpoints this field is set to the value N A Equal length sequences corresponding to the reference and indel alleles which span the indel event The character indicates a gap sequence of the reference or the indel allele Segment of the reference sequence 3 of the indel event For left side breakpoints this field is set to the value N A Phred scaled quality score of the indel which refers to probability that this indel does not exist at the given position The Q values given only reflect those error conditions which can be represented in the indel calling model which is not comprehensive See also Quality Scores on page 148 By default the variant caller reports all indels with Q indel gt 0 Most probable indel genotype The indel genotype categories are as follows hom refers to a homozygous indel het refers to a heterozygous indel ref refers to no indel at this position Note that these do refer to true genotypes where indels overlap because the model is not capable of jointly calling overlapping indels
124. er of reads against a genome As of CASAVA 1 6 a new version of ELAND is available ELANDv2 The most important improvement of ELANDV2 is its ability to perform multiseed and gapped alignments As a consequence ELANDv2 handles indels and mismatches better CASAVA 1 8 also contains a new version of ELAND ELANDv2e with an orphan aligner repeat resolution and performance enhancements Input and Output Files For a detailed description of the input and output files for ELANDv2e see configureAlignment Input Files on page 48 and configureAlignment Output Files on page 73 ELANDv2 Algorithm Description Multiseed and Gapped Alignment ELANDv2 introduces multiseed and gapped alignments Multiseed alignment works by aligning the first seed of 32 bases and consecutive seeds separately Gapped alignment extends each candidate alignment to the full length of the read using a gapped alignment method that allows for gaps up to 10 bases A match descriptor string in the output file see Output File Formats on page 1 encodes which bases in the read matched the genome and which were mismatches and reports the gaps using the escape sequence see Export txt gz on page 79 The differences between gapped and ungapped alignments and singleseed and multiseed alignments are illustrated below Figure 19 Ungapped Versus Gapped Alignment CASAVA v1 8 2 User Guide 1 3 3 SGACINY 12 PUB AANV ld Algorithm Descriptions Ungapped Alignmen
125. erate the Aligned analysis folder and subsequently run the analysis Rerunning the Analysis The config txt file used to generate an analysis is copied to the analysis folder so it can be used by configureAlignment if a reanalysis of the same data is required Parallelization Switch If your system supports automatic load sharing to multiple CPUs you can parallelize the analysis run to lt n gt different processes by using the make utility parallelization switch make all j n For more information on parallelization see Using Parallelization on page 119 Nohup Command You should use the Unix nohup command to redirect the standard output and keep the make process running even if your terminal is interrupted or if you log out The standard output will be saved in a nohup out file and stored in the location where you are executing the makefile nohup out can be used by Illumina Technical Support for troubleshooting should problems arise nohup make all j n amp The optional amp tells the system to run the analysis in the background leaving you free to enter more commands Starting Alignment for Read 1 If you want to start alignment before completion of the run use the makefile target r1 This can be started once Bcl conversion for read 1 has finished Starting Bel Conversion for Read 1 on page 35 Set up a regular configureAlignment analysis but run make using the r1 target for example nohup make j 16
126. es The files can be derived from the following sources Mitochondrial DNA Ribosomal repeat region sequences 5S RNA optional Other contaminants for example phiX if phiX spikes are used eland_rna uses squashes the provided FASTA files at the start automatically similar to the genome sequence files refFlat txt gz or seq_gene md gz File eland_rna As of CASAVA 1 7 eland_rna uses the refFlat txt gz or seq_gene md gz file to generate the splice junction set automatically The refFlat txt gz file comes from UCSC while the seg gene md gz file comes from NCBI and are available through iGenomes They should be provided gzip compressed and should be from the same build as the reference files you are using for alignment This negates the need to provide separate splice junction sets as in previous versions of CASAVA 1 26 Part 15011196 Rev D Variant Detection and Counting Reference Files CASAVA variant detection and counting needs two types of files to analyze RNA Sequencing data Genome sequence files refFlat txt gz or seq_gene md gz File RNA Seq L NOTE CASAVA for DNA sequencing only needs the genome sequenee files Reference Genome CASAVA uses a reference genome in FASTA format Both single sequence FASTA and multi sequence FASTA genome files are supported Genome sequence files for most commonly used model organisms are available through iGenome Getting Reference Files on page 128 Single Sequence FASTA Files CASA
127. es on page 128 CASAVA v1 8 2 User Guide 1 D 7 92u0J9 84 Bul uno PUB U01 28 9 JUBIJEA Reference Files CASAVA Getting Reference Files To run CASAVA you will need to download genome and other reference files You can use iGenome for most commonly used model organisms This is explained in this section Illumina Provided Genomes Illumina provides a number of commonly used genomes at ftp illumina com along with a reference annotation Arabidopsis_thaliana Bos_ taurus Caenorhabditis_elegans Canis familiaris Drosophila melanogaster Equus_caballus Escherichia coli K 12 DH10B Escherichia coli K 12 MG1655 Gallus_gallus Homo sapiens Mus musculus Mycobacterium tuberculosis H37RV Pan troglodytes PhiX Rattus norvegicus Saccharomyces cerevisiae Sus scrofa You can login using the following credentials Username igenome Password G3nom3s4u For example download the FASTA annotation and bowtie index files for the human hg18 genome with the following commands gt wget ftp user igenome ftp password lt G3nom3s4u ftp ftp illumina com Homo sapiens UCSC hg18 Homo sapiens UCSC igle Far ses Unpack the tar file tar xvzi Homo sapiens UCSC hgls tar gz Unpacking will make its own folder Homo sapiens UCSC hg18 Abundant Sequence Files RNA Seg Process the abundant seguence files the following way 1 Generate a folder for abundant sequences 2 Collect FASTA files for abundant sequences in the abundant sequences folder fo
128. esult in longer run time By default ELANDv2e runs in semi repeat resolution mode Full repeat resolution can be turned on with the option INCREASED SENSITIVITY Performs orphan alignment by identifying read pairs for which only one of the reads aligns ELANDv2e then tries to align the other read in a defined window by default 450 bp Configuring a Paired Read Analysis The alignments of the two reads that provide input to the pairing process may be varied by setting ELAND SEED LENGTH and ELAND MAX MATCHES Both parameters may be set lane by lane but the same values will apply to each of the two reads in a lane The paired read analysis may be configured by passing options to alignmentResolver This is done by setting a parameter KAGU PAIR PARAMS in the configureAlignment configuration file For additional information see KAGU PAIR PARAMS and KAGU_ PARAMS on page 65 KAGU PAIR PARAMS can be specified lane by lane All of the options must be specified on a single line and space separated as in the following example 8 KAGU PAIR PARAMS circular muf 0 CASAVA v1 8 2 User Guide 6 O 1uauubilveinbyuo2 BulUUNH Sequence Alignment 70 Using ANALYSIS eland_rna eland_rna is the eland module built specifically for RNA Sequencing and is required to provide the input files for CASAVA eland_rna delivers the following information Read alignments to the genome Read alignments to splice junctions Read alignments to contaminants i NO
129. eter theta for indels is set to a default value of 1 10 000 Q Indel Once the candidate indels are identified the variant caller realigns all intersecting reads to each candidate indel in addition to aligning the read to the reference and any alternate indel candidates at the same site The relative likelihoods of all alignments for each read are used to assign probabilities to each of three possible indel genotypes homozygous heterozygous or not present The associated quality score Q indel expresses the probability that the non reference indel allele referred to in the indel call exists in the sample as either a heterozygous or homozygous variant analogous to Q snp SNP Caller Reporting 150 The SNP caller reports the following files snps txt SNPs for each chromosome are summarized within each chromosome directory in a file called snps txt This file contains SNPs which have been called by CASAVA s callSmallVariants module sites txt gz As part of the SNP calling process the variant caller also outputs information on coverage and consensus genotype for every mapped site in the genome These results are found in each chromosome bin directory in a gzip compressed file called sites txt gz snps removed txt As a final noise filtration step the SNP calls in the snps txt files have been filtered to remove SNP calls in regions close to centromeres and other high copy number regions The SNP calls filtered out by this procedure can b
130. exed samples need to be processed The FASTO files for both multiplexed and non multiplexed samples are organized using the Project and Sample concepts as governed by the sample sheet configureAlignment uses the sample sheet to identify projects and samples and the sample organization as described in the sample sheet should always match the actual Unaligned folder organization As a result of these changes configureAlignment expects the following input files A Unaligned directory with fastq gz files even for cases where only one project exists A config txt file which specifies the analysis A base calling config xml file DemultiplexedBustardConfig xml A FASTA reference genome for alignments For RNA applications additional files are required This section explains these files For file locations see figure below Note that the reference files may be located in a different location depending on your CASAVA installation Parti 15011196 Rev D Figure 11 Locations of configure Alignment Input Files lt ExperimentName gt YYMMDD machinename XXXX FC R HUH Input Files E Data from RTA or OLB iGenomes A se LOO1 Intensities By Lane L Homo_sapiens Basecalls UCSC LOO By Lane r hg18 Eli bel files C Lane Cycle Sequence lt lt Chromosomes FASTA Contaminants Reference Files ra Unaligned File Structure generated by single FASTA Bcl conversion Demultiplexing contaminants f
131. fault If you want multiple Aligned directories you will have to use the option OUT_DIR to generate a different output directory Analysis Summary 14 The results of an analysis are summarized as web pages that enable a large number of graphs to be viewed as thumbnail images This section is intended to help you interpret the various graphs that appear in an analysis directory For each project a Sample_Summary htm file is produced which contains comprehensive results and performance measures of your analysis run for a project per sample It is located in the Aligned project folder and provides an overview of quality metrics for a project with links to more detailed information in the form of pages of graphs For each project a Barcode_Lane_Summary htm file is produced which contains comprehensive results and performance measures of your analysis run for a project per barcode and lane It is located in the Aligned project folder and provides an overview of quality metrics for a project with links to more detailed information in the form of pages of graphs For each run a FlowCellSummary htm file is produced which contains comprehensive results and performance measures of your entire analysis run across all projects It is located in the Aligned folder Sample_Summary Page For each sample a Sample_Summary htm file and Barcode_Lane_Summary htm file is produced which contains comprehensive results and performance measures of your a
132. ference sequence name The contains the export txt gz file match chromosome value and if the export txt gz file Match contig field is not empty the SAM RNAME field will be appended with a character followed by the match contig name See Export txt gz on page 79 MAPQ Mapping quality Phred scaled posterior probability that the mapping position of this read is incorrect CIGAR Extended CIGAR string For a description see Extended CIGAR Format on page 170 MRNM Mate Reference sequence NaMe if the same as lt RNAME gt MPOS 1 based leftmost mate position of the clipped sequence ISIZE Inferred insert size SEQ query SEQuence for a match to the reference n N for ambiguity Export to SAM Conversion cases are not maintained QUAL query QUALity ASCII 33 gives the Phred base quality TAG TAG for an optional field For a description see Optional Fields on page 171 VTYPE Value TYPE for an optional field For a description see Optional Fields on page 171 VALUE Match lt VTYPE gt for an optional field For a description see Optional Fields on page 171 Bitwise Flag Values The FLAG field in the alignment section is a bitwise flag The meaning of predefined bits is shown in the following table Hexadecimal Decimal Description Value Value 0x0001 1 The read is paired in seguencing no matter whether it is mapped in a pair 0x0002 2 The read is mapped in a proper pair depends on the protocol n
133. ferences and their insert size E dm Ee z chr20 fa F 14812275 922 108M gt 5 p ng chr20 fa R 14812492 922 108M pa pH chr5 fa F 99771317 966 108M chr5 fa R 99771540 966 108M Repeat Resolution ELANDv2e aligns reads in repeat regions using two new modes semi repeat resolution and full repeat resolution Both modes take repetitive hits into account for the multiseed pass of ELAND Full repeat resolution is more sensitive and places more reads in repeat regions but will result in longer run time By default ELANDv2e runs in semi repeat resolution mode Full repeat resolution can be turned on with the option INCREASED SENSITIVITY CASAVA v1 8 2 User Guide 1 27 SGACINY la PUB AANV 13 Algorithm Descriptions Figure 22 Changes between CASAVA 1 7 and 1 8 in multiseed ELAND alignment ay Al ig n th e fi rst 3 2b p of read S ATGCATGCACGGACTTTCGGGACTCTATCCGGCATCTATGGCTTTTCGC WETE AA ATGCATGCACGGACTTTCGGGACTCTATCCGGCATCTATGGCTTTTCGC Reference AracATGCACGGACTTTCGGGACTCTATCCGGCATCTATGGCTITTCGC a WO eg 2 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCGGCAI CIAIGGCI ITICGC 2 Identify reads that do not align or hit a repetitive sequence Seed Read CCCCCCCCCCCCCCCCCCCCCCCCCCCCCGGCATCTA Reference CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCGGCATCTATGGCTTTT 3 CASAVA v1 7 Align unmapped reads using multiple seeds Seed Seed Read CCCCCCCCCCCGCCCCCCCCCCCGCCCCCGGCATCTA Reference CCCCCCCCCCCCCCCCCC
134. for each read which intersect include a candidate indel Select most likely read realignment for subsequent site counting and genotyping Further filter individual basecalls based on mismatch density or ambiguity N Use all remaining base calls to predict site genotypes and SNPs Filter to remove SNP and indel calls near the centromeres and within high copy number regions readBases Counting Method As of version 1 6 CASAVA uses the readBases counting method This method is for exon and gene counts and counts the number of bases that belong to each feature Both reads that map to the genome and reads that map to splice junctions contribute to exon base coverage value L NOTE Before counting CASAVA split alignments to the splice junction to two shorter genomic reads Counts for splice junctions are provided for convenience and correspond to the number of reads that cover the junction point Bases within reads aligned to the junction are counted only once in the exon counts The number of bases that fall into the exonic regions of each gene is summed to obtain gene level counts and normalized according to feature size and expressed as RPKM Reads Per Kilobase per Million of mapped reads Exons that have overlapping exons from other genes on the forward or reverse strand are excluded from counting and are also not included to compute the total gene length Variant Caller and Counting Detailed Description 92 For a detailed description
135. g 99 Analysis Options forsort eee renee reen rreneee 99 Analysis Options for rnaCounts 2 2 2 2 22 a 100 Analysis Options for Da EER EE IE bm Nka DUDA KA BANAAG ND KE kag anG UU Ge 100 Global Analysis Options for Variant Detection and Counting 152 Options for assemblelndels 153 Workflow Options for callSmallVariants 154 Read Mapping Options for callSmallVariants 154 SNP and Indel Options for callSmallVariants 155 SNP Options for callSmallVariants 155 Inde Options for callSsmallVariants 156 Ilumina General Contact Information 179 Ilumina Customer Support Telephone Numbers 179 Part 15011196 Rev D Overview gir ere dle g a EE ee RE ec see ee eee ae et EI 2 CASAVA Features iii 5 What s NeW Ss eee Sec AA a a a anaa LLL oDDD a a SG GA 9 Frequently Asked Questions ee eee cece cece ccc c cece GE EE GEE 10 P E S ZIM AA rer Say gt ry ATEEGGERTENTeGas TEER Mad CASAVA v1 8 2 User Guide 1 ffi JOIACUD Overview Introduction This user guide documents CASAVA 1 8 2 short for Consensus Assessment of Sequence And Y Ariation
136. he candidate indel contigs produced by assemblelndels The procedure is outlined below 144 Part 15011196 Rev D Read in read alignments and candidate indel contigs Filter out read alignments based on quality checks paired end anomalies or ELAND alignment score Filter out contig alignments containing adjacent insertion deletion events Consolidate indel evidence from read and contig alignments to produce a set of candidate indels Perform local read realignment using candidate indels Call indels based on the set of alignments for each read which intersect include a candidate indel Select most likely read realignment for subsequent site counting and genotyping Further filter individual basecalls based on mismatch density or ambiguity N Use all remaining base calls to predict site genotypes and SNPs Filter to remove SNP and indel calls near the centromeres and within high copy number regions Read Filtering The variant caller performs an initial read filtering to remove reads from both SNP and indel calling based on the following criteria Any reads marked as failing primary analysis quality checks e g failing the purity filter or marked as PCR or optical duplicates are removed from consideration For paired end reads any reads which are not marked as being part of a proper pair are removed from consideration This is intended to remove any reads from chimeric pairs with unmapped mates or with an anomalous pair inser
137. he first three fields For example you may want to capture the flow cell number in the run folder name as follows YYMMDD machinename XX XX FCYYY 1 NOTE When publishing the data to a public database it is desirable to extend the exclusivity globally for instance by prefixing each machine with the identity of the sequencing center BaseCalls Directory Demultiplexing requires a BaseCalls directory as generated by RTA or OLB Off Line Basecaller which contains the binary base call files bcl files I NOTE As of 1 8 CASAVA does not use _qseq txt files as input anymore The BCL to FASTQ converter needs the following input files from the BaseCalls directory bel files stats files filter files control files clocs locs or _pos txt files The BCL to FASTQ converter determines which type of position file it looks for based on the RTA version that was used to generate them CASAVA v1 8 2 User Guide D f Sol INdUJ UOISIBAUOYD DA Bcl Conversion and Demultiplexing RunInfo xml file The RunInfo xml is at the top level of the run folder config xml file RTA is configured to copy these files off the instrument computer machine to the BaseCalls directory on the analysis server The files are described below Bcl Files The bcl files can be found in the BaseCalls directory lt run directory gt Data Intensities BaseCalls L lt lane gt C lt cycle gt 1 They are named as follows s lt lane gt lt til
138. hese reads will be ignored during variant calling Example sortKeepAl lReads Minimum SE alignment score to put a read to NM Default 1 1 means option is turned off Ignore unanchored read pairs in indel assembly and variant calling Unanchored read pairs have a single read alignment score of 0 for both reads Example ignoreUnanchored The options described below are used to specify analysis for target sort Table 19 Analysis Options for sort Option Application rmDup YES NO PE sortBufferSize INTEGER SE PE sortKeepAllReads SE PE CASAVA v1 8 2 User Guide Description Turn On Off PCR duplicate marking removal for paired end reads default YES Buffer size used by the read sorting process in megabytes default 1984 Run the sort module in archival mode instead of the default filtered mode See Archival Build on page 90 Example sortKeepAl lReads 99 DUI1UNOD pue U01 28 9 uelJe A BuluunH Variant Detection and Counting Options for Target rnaCounts The options described below are used to specify analysis for target rnaCounts Table 20 Analysis Options for rnaCounts Option Application Description refFlatFile SE Name and location of UCSC refFlat txt gz file The file must be gz compressed Example refFlatFile data Genome ELAND _ RNA Human refFlat txt gz segGeneMdFile SE Name and location of NCBI seq_gene md gz file Example seqGeneMdFile data Genome ELAND RN
139. human8 human TGACCA myTest N 32 7 CB test PRC2 Examples below illustrate use of DATASET POST RUN COMMAND DATASET_POST_RUN_COMMAND limited to a PROJECT Following config file for PROJECT selection ANALYSIS Nong PROJECT testEPK l ANALYSIS elana excended PROJECT tesCPRCI USE BASES YN PROJECT testPRC1 ELAND GENOME illumina scratch iGenomes PhiX Illumina RTA Seguence Sguashed PhiX Illumina RTA PROJECT testPRC1 DATASET POST RUN COMMAND echo project S sample gt gt out DPRC txt testPRC1 testPRC1 testPRCl testPRC1 testPRC1 testPRC1 testPRC1 testPRC1 humanl humanl humanl humanl humanl humanl humanl S barcode will generate out DPRC txt in Aligned folder phixl TAGES 1 CGATGT CGATGT CGATGT TGACCA TGACCA GCCAAT SCGGCAAT N PF Ol W OY OO JN DATASET POST RUN COMMAND limited to a LANE Following config file for LANE selection ANALYSIS none 1 ANALYSIS eland extended IsUSE BASE FR 1 ELAND GENOME illumina scratch iGenomes PhiX Illumina RTA Seguence Sguashed PhiX Illumina RTA 1 DATASET POST RUN COMMAND echo project sample S barcode lane gt gt out DPRC txt will generate following out DPRC txt in Aligned folder testPRC1 phixl TTAGGC 1 Undetermined indices lanel Undetermined 1 testPRC1 humanl GCCAAT 1 testPRC2 humanl CTTGTA 1 POST RUN COMMAND You can also run the workflow wide POST RUN COMMAND from the make command lane for example CASAVA v1 8 2 User Gu
140. i sequence FASTA genome files are supported Genome sequence files for most commonly used model organisms are available through iGenome Getting Reference Files on page 128 L NOTE As of CASAVA 1 8 you do not need to squash the reference genome anymore Single Sequence FASTA Files CASAVA accepts single sequence FASTA files as genome reference which should be provided unsquashed for both alignment and post alignment steps The chromosome name is derived from the file name Direct CASAVA to a folder containing the FASTA files using the option ELAND GENOME for configureAlignment Multi Sequence FASTA Files As of version 1 8 CASAVA accepts a multi sequence FASTA file as genome reference This should be provided as a single genome SAM compliant unsquashed file for both alignment and post alignment steps The chromosome name is derived directly from the first word in the header for each sequence Direct CASAVA to a multi sequence FASTA file using the option SAMTOOLS GENOME for configureAlignment ki WARNING GenomeStudio does not support the use of multi seguence FASTA files Therefore if you want to analyze your output in GenomeStudio we recommend using single seguence FASTA reference files ad Chromosome Naming Restrictions CASAVA does not accept the following characters in the FASTA chromosome name header TAN LA EAT ET TG UA This validation can be disabled in configureAlignment using the following option CHROM NAME
141. ide O 1uauubiveinbyuo2 buluuny Sequence Alignment make all POST RUN COMMAND echo everything is done Using ANALYSIS eland extended ANALYSIS eland extended is an improved version of the ANALYSIS eland mode that existed in Pipeline and is now deprecated ANALYSIS eland could align reads longer than 32 bases but demanded that the first 32 bases of the read have a unique best match in the genome The position of this match is used as a seed to extend the match along the full length of the read ANALYSIS eland extended removes the uniqueness restriction by considering multiple 32 base matches and extending them Multiseed Gapped Repeat Alignment ANALYSIS eland extended performs the following alignment features implemented in ELANDv2 and ELANDv2e By default performs multiseed alignment by aligning consecutive sets of 16 to 32 bases separately Uses a gapped alignment method to extend each candidate alignment to the full length that allows for gaps indels of up to 10 bases Aligns reads in repeat regions using two new modes semi repeat resolution and full repeat resolution Full repeat resolution is more sensitive and places more reads in repeat regions but will result in longer run time By default ELANDv2e runs in semi repeat resolution mode Full repeat resolution can be turned on with the option INCREASED SENSITIVITY Configuring ANALYSIS eland extended There are three parameters that affect the output of the al
142. ignment ELAND SEED LENGTH1 ELAND SEED LENGTH2 and ELAND MAX MATCHES Both parameters can be specified lane by lane The following table describes the parameters for ANALYSIS eland extended Table 9 Parameters for ANALYSIS eland extended Parameter Description ELAND SEED LENGTH1 By default the first 32 bases of the read are used as a seed alignment ELAND SEED LENGTH Setting ELAND_SEED_LENGTH 1 to 25 will use 25 bases in read 1 instead of the maximum of 32 for the initial seed alignment This should increase the sensitivity since two errors per 25 bases is less stringent than two errors per 32 bases A read is more likely to be repetitive at the 25 base level than at the 32 base level so a decrease in ELAND SEED LENGTH should probably be used in conjunction with an increase in ELAND MAX MATCHES Setting this to very low values will drastically slow down the alignment time and will probably result in a lot of poor confidence alignments ELAND MAX MATCHES By default ANALYSIS eland extended will consider at most ten 68 alignments of each read This can ELAND_MAX_MATCHES allows the maximum number of alignments considered per read to be varied between 1 and 255 Both ANALYSIS eland_extended and ANALYSIS eland_pair produce export files that contain all read quality value and alignment information for the analysis For a detailed description of the export txt gz files see Text Based Analysis Results on page 51 Parti
143. ijn graph of m symbols is a graph representing overlaps between sequences De Bruijn graphs are used for de novo assembly of short read sequences into a genome deprecated Deprecated refers to software features that are superseded and should be avoided Although deprecated features remain in the current version their use may raise warning messages and deprecated features may be removed in the future Features are deprecated rather than being removed in order to provide backward compatibility and give programmers who have used the feature time to bring their code into compliance with the new standard K kmer hashing Hashing refers to the use of subfragments of a particular read to find match ing pieces of DNA in a hash table k mer means the size of the fragment used for hashing CASAVA v1 8 2 User Guide 1 F D Glossary 1 6 O orphan reads An orphan read is the unaligned part of paired reads for which only one read aligned Identical to shadow read S shadow read A shadow read is the unaligned part of paired reads for which only one read aligned Identical to orphan read singleton A singleton is the aligned part of paired reads for which only one read aligned W wrapper script A wrapper script is a script whose main purpose is to call a second func tion in a computer program with little or no additional computation Parti 15011196 Rev D Index A abundant sequences 70 All htm file 17 analysis output
144. iles Project A Annotation A fastq gz Sample_A files Genes Ea aa oz FASTA Sample B files genome files Project B Z fastg gz Sample C files Undetermined Indices F fastq gz Sample Lane fice Basecall_Stats_FC SampleSheet csv file Ea DemultiplexConfig xml Ea Demultiplexed BustardSummary xml Sequence Files configureAlignment needs a Unaligned directory as generated by BCL to FASTQ converter which contains the gzipped sequence files fastg gz files a imw NOTE 4 i As of CASAVA 1 8 configureAlignment uses FASTO input files instead of _ qseq txt files For a description of the FASTO files see FASTQ Files on page 39 CASAVA v1 8 2 User Guide 4 O soji Indu JuowuG6Iijyoe4n6BIJUOD Sequence Alignment Configuration File The configureAlignment configuration file generally named config txt specifies what analysis should be done for each lane The requirements and options for the configureAlignment configuration file are described in configure Alignment Configuration File on page 54 Sample Sheet The SampleSheet csv file describes the samples and projects in each lane including the indexes used It is derived from the user generated sample sheet that is required for bcl conversion and demultiplexing The sample sheet should be located in the Unaligned directory of the run folder The sample sheet has to match the directory structure created during the bcl conversion a
145. illumina CASAVA v1 8 2 User Guide T L LL ee r N EE a Mea a dla NGATAACAGTAACACACTTCTGTTAACCT TAAGATTACTTGTTGATCCACTGATTCAACGTACCGTATCAAT TGAGACTAAATATTAACGTACCATTAAGAGCTACCGTCTTCTGTTAACCTTAAGATTACTTGATCCACTGATTCAACGTACCG CACTGATTCAACG TAGCAAGATIACETGATO ACTGAT I AA O TAGOGTAAC AA GIATCNATI GAGAOTAMATATNACOTAC AT NAGAGCIAC GTOTICI GTIAA OTIRAG ATTACTTGATCCACTGATTCAACGTACCGTAACGE GAAAAGAATGATAACAGTAACACACTTCTGT TAACCTTAAGATTACTTGATCCACTGATTCAACGTACCGTAAAGATTACTTGATCCACTGATTCAACGTACCGTAACGAACGTATCAATTGAGACTAAATATTAACGTACCAT TAAGAGCTACC GATAACAGTAACACACTTCTG C GATTACTTGATCCACTGATTCAACGT GTAACGAACGTATCAATTGAGACTAAATAT TAACGTACCATTAAGAGCTACCGTCTTCTG G ACTTGATCCACTGATTCAA TTGAG T GTTAAGATTAGTTGATGGAGTGATTGAAGGTAGGGTAAGGAAGGTATGAATTGAGAGTAAATATTAAGGTACGATT G AGTTGATGCACTGATTCAAGGTAGCGT TATCAATTGAGACTAAATAT TI CTTAACCTTAAGATTACTTGATCCACTGATTCAACGTACCGTAAC CGTCTTCTG TTAAGATTACTTGATCCACTGATTCAACGTACCGTAACGAA TT CAAT T AACGACG GACTAAATAT TAACGTACCAT TAAGAGCTACAACC ACTTGATCCACTGAT TCAACGTACCGTAACGAACGTATCAAT TGAGACTAAATAT TAA ATTAAGAGCTACCGTGC CAGTAACAC GATAACAGTAACACACTTCT G ATTACTTGATCCACTGATTCAACG GTAACGAACGTATCAATTGA TATTAACGTACCAT TAAGAGCTACCGTCTTC CTTAAGAT TACT TGATCCACTGATTCAAC CATTAAGAGCTACCGTGCAACTTAA ACTTGATCCACTGATTCAACGTACCGTAACGAACGTATCAA G AACGTACCATTAAGAGCTACCGTGCAAC GTTAAGATTAGTTGA AG CCTTAAGATT A A IT TGAGACTAAATAT TAACG TT GACGAACTTCTGTTAA GCTACCGTGCAACGAAAATAACCTTAAGATTACTTGATCCACTGATTCAACGTACTTCTGTTAACCTTAAGATTACTTGATCCACTGAT
146. is the default value if no port number is specified The utility nmap if installed may help you identify which port on a server is hosting an SMTP service VAVSVOBUIE1SU 5 Test your email reporting by entering the following from the machine where you are running configureAlignment telnet yourserver yourPortNumber If you don t get a friendly message then email reporting will not work You can run runReport pl directly in test mode by entering cunReport pl test yourserver 25 yourdomain com anything your name yourdomain com You should receive a test email If you do not the transcript it generates should identify the problem L NOTE The optional email reporting feature depends on how your SMTP servers are set up locally Email reporting is not required to run the configureAlignment to a successful completion CASAVA v1 8 2 User Guide 1 1 7 11 o Parti 15011196 Rev D Using Parallelization Make Utilities CASAVA v1 8 2 User Guide g KIDUSAAY Using Parallelization Make Utilities Parallelization is built around the ability of the standard make utility to execute in parallel across multiple processes on the same computer configure Alignment also provides a series of checkpoints and hooks that enables you to customize the parallelization for your computing setup See Customizing Parallelization on page 121 for details Standard Make The standard make utility has many limitations but i
147. is the tile Number lt sample name lt barcode sequence gt _L lt lane gt _R lt read number gt lt 0 padded 3 digit set number name txt Table 11 Intermediate Output File Descriptions Output File configureAlignment Description Analysis Mode eland extended txt ANALYSIS eland Contains the corrected alignment positions and the full extended alignment descriptions for 232 base reads This file is not eland extended txt ANALYSIS eland purity filtered pair extended contam txt ANALYSIS eland rna Alignments to the ELAND RNA GENOME CONTAM extended splice txt ANALYSIS eland rna Alignments to the splice junctions B2 Table 12 Intermediate Output File Formats Output File Format s N TTTT align txt Deprecated sequence alignment format s NTTIT Space separated text values realign txt 1 Sequence s NITIT 2 Best score PIA 3 Number of hits at that score 4 The following columns only appear if hits equal 1 a single unique match 5 Target pos 6 Strand 7 Target sequence 8 Next best score Parti 15011196 Rev D Interpretation of configureAlignment Run Quality After the analysis of a run is complete you need to interpret the data in the report summary and various graphical outputs This section describes a standard systematic way to examine your data The starting point is to know what a standard run of acceptable quality looks like This is something of a moving target and is dependent on individual in
148. ise specified in the sample sheet samples without index will end up in the project folder Undetermined indices and in a sample folder named after the lane e g Sample lane1 If you want to specify analysis for these samples without index other than the global analysis you can use identifiers PROJECT Undetermined_indices or SAMPLE lanel CASAVA v1 8 2 User Guide D O lJuswubi yaInbijuo2 BulUUNH Sequence Alignment 4 NOTE Normally you would want to use PROJECT Undetermined indices ANALYSIS none or REFERENCE unknown ANALYSIS none to avoid wasting CPU time on the Undetermined_indices data which often is of poor quality Config txt Examples The configureAlignment configuration file generally named config txt specifies what analysis should be done for each lane Some examples for DNA Sequencing analysis are shown below Assignment by Lane If you want to Use as reference single FASTA files from human genome build hg18 in your lt GenomesFolder gt Align paired end data from lanes 1 2 and 3 Use all bases except the last one for both reads Generate the following config txt file ELAND GENOME lt GenomesFolder gt iGenomes Homo sapiens UCSC hg18 Sequence Chromosomes 123 ANALYSIS eland pair I23tUSE BASES VN VA Assignment by PROJECT If you instead want to align the samples from your project named Project1 generate the following config txt file ELAND GENOME lt GenomesFolder gt iGenomes Homo _ sapiens UCSC hg
149. ismatches The alignment score of a read is computed from the p values of the candidate alignments The candidate with the highest p value is the best candidate and its alignment score is its p value as a fraction of the sum of the p values of all the candidates This is also known as a Bayes Theorem inversion The alignment score is expressed on the Phred scale i e Q20 corresponds to 1 chance of alignment being wrong Q30 0 1 etc For example if there are two candidates for a read with p values 0 9 and 0 3 the alignment score calculation would be as follows 0 9 0 9 0 3 0 75 chance of highest scoring alignment being right 1 0 75 0 25 chance of highest scoring alignment being wrong Expressed on the Phred scale Alignment score 10 log 0 25 6 i NOTE The alignment score of a read and the p values of the candidate alignments for the read are not the same The former is computed from the latter Rest of Genome Correction If only one candidate alignment is found the scoring scheme above would give an infinite Phred score MAQ deals with this by giving such cases an arbitrary high score of 255 ELANDv2e uses a constant known as the rest of genome correction that depends on the average base quality of the read the read length and the size of the genome This gives a scoring scheme with the following properties Single candidate alignments for longer reads will score more highly than single candidate alignments of sho
150. l human ATCACG descl N R1 name Projl 12345AAXX 1 sample2 human CGATGT desc2 N R1 name Proj1 12345AAXX 2 sample3 rat TTAGGC desc3 N R1 name Proj2 12345AAXX 2 sample4 mouse TGACCA desc4 N R1 name Proj3 then this will initiate an eland_pair analysis for all human samples samplel and sample2 and use the global analysis eland_rna for all other samples sample3 and sample4 This allows you to set the analysis reference genome and all other ELAND parameters project by project or reference by reference or sample by sample or barcode by barcode Combining Specificity It is also possible to combine specific analyses like in this example 12 REFERENCE human ANALYSIS eland pair which tells configureAlignment to perform eland_pair analysis on the human reference samples from lanes 1 and 2 CASAVA v1 8 2 User Guide HD 1uauubilveinbyuo2 buluuny Sequence Alignment 55 Priority If multiple specific settings conflict configureAlignment uses the following order of priority 1 PROJECT 2 REFERENCE 3 SAMPLE 4 BARCODE 5 Lane 6 Global settings This means PROJECT settings override any other settings while REFERENCE settings can only be overruled by PROJECT settings and so on L WARNING The attribute cannot be set for more than one scope at a time In other words the following is not allowed PROJECT test BARCODE ACGT ANALYSIS eland extended Additional Examples Some more examples are listed below
151. l jobs belonging to the step have finished Finally hooks are provided upon completion of the step to issue user defined external commands Parallelization Limitations The analysis works on a per file basis so the maximum degree of parallelization achievable is equal to the total number of files generated during demultiplexing CASAVA v1 8 2 User Guide 1 O Using Parallelization However some parts of configureAlignment operate on a per lane basis and a few parts on a per run basis which means that scaling will cease to be linear at some stage for more than 8 way parallelization The ELAND FASTO FILES PER PROCESS affects the maximum level of parallelism available for ELAND If all sequence information is stored in a total of 64 files a value of 32 will lead to 2 processes 8 to 8 processes 4 to 16 processes etc These numbers of processes are doubled for paired end runs Memory Limitations 122 CASAVA requires a minimum of 2 GB RAM per core The parameter ELAND FASTQ FILES PER PROCESS in the configureAlignment config txt specifies the maximum number of tiles aligned by each ELAND process The optimal value is such that there are approximately 10 to 13 million lines reads in one set For additional information see Sequence Alignment on page 45 Parti 15011196 Rev D Reference Files CASAVA PIP OCCU LO AA 124 ELAND Reference Files 125 Variant Detection and Counting Reference Files
152. l show ELANDv2 from other as early cycle error rates If error rates remain fairly constant genetic with cycle then the correct genome has probably material sequenced correctly Non smooth error rate plots or IVC resulting in an plots indicate the presence of specific tags or sequences inability to align data Percentage Mismatch Rate of Clusters Passing Filters This value should be as low as possible but it is very dependent on read length If there is a sudden rise beyond cycle 32 then it is likely that ELANDv2e has effectively filtered out many clusters with more than two errors thus suppressing the true error rate up to this point The percentage aligning will also be low IVC htm For a detailed description of the plots found in the IVC htm file see IVC Plots on page 76 CASAVA v1 8 2 User Guide o 3 Saoi J nd ng JusWubi yainbijuoo Sequence Alignment 84 Condition Possible Cause Intensity curves are not smooth Called intensities are not equal Cycle to cycle focus or fluidics problems Poor fluidics or poorly blocked flow cell Called may be 5 out without If from cycle 1 initial matrix estimate may major problems also be in error All htm and Mismatch htm The results in both files should show consistency from tile to tile down a lane and from lane to lane if the results are from the same sample Condition Tile variability Rising mismatch rates Rates will always rise eventu
153. lation is reported as two quality scores The first of these scores Q indel expresses the probability that the indel is present in the sample as either a heterozygous or homozygous variant The second score Q max_gt expresses the probability that the most probable indel genotype reported as the value max gt is correct To accommodate diverse applications the CASAVA variant caller does not filter out low confidence calls and thus prints out all indels with a Q indel value of 1 or greater Summary statistics for indels are generated for a subset of higher confidence indels by default any indel with Q indel of 20 or greater is summarized in CASAVA s reports Note that for calls with a very low Q indel score it is possible that the most likely Parti 15011196 Rev D genotype will be ref indicating that the indel is not present This should be interpreted to mean that there is a non trivial probability of the indel existing as a heterozygous variant at this site but that the indel is more likely to be absent from the sample than present The predicted Q scores reflect only those error conditions which are represented in the genotype calling model which is not comprehensive The model accounts for basecalling error diploid chromosome sampling a spurious indel rate and an approximation of read mapping error However note that artifactual indel signatures could still arise due to complex overlapping variants atypical sample prepa
154. latter including relative orientation and separation insert size of partner read alignments If the criteria for paired alignment are not met the subset of tables reporting paired alignment results are replaced with the statement Paired alignment not performed When this happens CASAVA builds for these paired reads cannot be performed without first rerunning configureAlignment pl and adjusting parameters such as min percent unique pairs and min percent consistent pairs to produce acceptable paired data and summaries The following sections are displayed in Additional Paired Statistics Relative Orientation Statistics The relative orientation of a pair is the orientation of read 2 relative to the orientation of read 1 based on the definition that the read 1 orientation is forward The relative orientation is defined as positive if the read 2 position is greater than the read 1 position These statistics are given only for those pairs in which both reads were individually uniquely aligned since these are the reads used to determine the predominant relative orientation Other orientations are considered anomalous and are filtered out The symbols used in the column headings are intended as a visual reminder of the definitions of the four possible relative orientations In the example below the nominal orientation is correctly computed as the two reads pointing to each other as expected for the standard Illumina short insert p
155. le The default value is 4 Using DATASET POST RUN COMMAND DATASET POST RUN COMMAND will be invoked at completion of DATASET alignment and may be constructed of a single or multiple shell calls for latter separated by semicolon 5 Following variables derived from SampleSheet will be available please use brackets properly project sample barcode lane Assuming we use the following SampleSheet FCID Lane S ampleID SampleRef Index Description Control Recipe Operator Project B80 9UWABXX 1TILE DMX 1 humanl human GCCAAT myTest N 32 7 CB testPRC1 B80 9UWABXX 1TILE DMX 1 humanl human CTTGTA myTest N 32 7 CB testPRC2 B80 9UWABXX 1TILE DMX 1 phixl phix TTAGGC myTest N 32 7 CB testPRC1 B80 9UWABXX 1TILE DMX 2 humanl human GCCAAT myTest N 32 7 CB testPRCl B80 9UWABXX 1TILE DMX 2 phix2 phix TTAGGC myTest N 32 7 CB test PRC2 66 Parti 15011196 Rev D B809UWABXX 1TILE DMX 3 humanl human TGACCA myTest N 3247 CB testPRC1 B80 9UWABXX 1TILE DMX 4 phix4 phix TTAGGC myTest N 32 7 CB testPRC3 B80 9UWABXX 1TILE DMX 5 humanl human TGACCA myTest N 32 7 CB testPRC1 B80 9UWABXX 1TILE DMX 5 human5 human GCCAAT myTest N 32 7 CB testPRC2 B80 9UWABXX 1TILE DMX 6 humanl human CGATGT myTest N 32 7 CB testPRC1 B80 9UWABXX 1TILE DMX 7 humanl human CGATGT myTest N 3247 CB testPRC1 B80 9UWABXX 1TILE DMX 8 humanl human CGATGT myTest N 32 7 CB testPRC1 B80 9UWABXX 1TILE DMX 8
156. ls view h file bam When a BAM file is created for each chromosome these files are placed in the bam directory immediately under the Parsed chromosome directory For example the BAM file for chromosome 1 in a human build would be located here Project Dir Parsed NN NN NN c1 fa bam CASAVA v1 8 2 User Guide 105 Variant Detection and Counting 106 When one BAM file is created for the entire genome using the target bam it can be found in Project Dir genome bam A set of auxillary files is created with the whole genome BAM file to facilitate use in downstream packages such as SAMtools or the Broad IGV These files are sorted bam the bam file itself sorted bam bai index of the bam file sorted bam fa gz gzipped fasta file containing the reference sequence s For a description of the BAM format see samtools sourceforge net The format of SAM files is described in SAM Format on page 169 BAM files are the binary equivalent of SAM files and Illumina s BAM convention has the following features The new private optional tag XC has been added to provide read status information normally conveyed in the chromosome field of the export txt file for unmapped reads Specifically XC Z QC is used to mark an ELAND QC failure read XC Z RM is used to mark an ELAND repeat mask read and XC Z CONTROL is used to mark a control read No optional field is added to reads which are marked as no match NM in the export file it is under
157. luuny Sequence Alignment 2 KEEP INTERMEDIARY Option The option KEEP INTERMEDIARY tells CASAVA not to delete the intermediary alignment files in the alignment Temp dir after alignment is complete This is a make option and needs to be used when you run make For example nohup make ewd y PAIN q genexpr j 3532 all KEEP INTERMEDIARY yes amp Parti 15011196 Rev D configureAlignment Output Files The configureAlignment output files contain run information statistical analysis sequence information and alignment information They are described below Figure 12 Run Folder after configureAlignment Analysis lt ExperimentName gt YYMMDD_machinename_XXXX_FC Input Files from Zi RTA or OLB E Intensities Basecalls Unaligned File Structure generated by Bcl conversion Demultiplexing Project_A fast astq gz Sample_A tite 4 ZEEE fastq qz amp e_ files Basecall Stats FC Sai J 1Ndino juswubijyainbijuo2 Project_B Undetermined_Indices File Structure generated by Alignment JA Flowcell Summary htm Project A 1 export txt gz Sample A a S d HEEL EXPON tIXL GZ Sample B sa 9 Sampl ample_ F Summary_Stats_FC Siman bia A Barcode_Lane Summary htm Project_B df export txt gz Sample C pa S A Sample_ Summary Stats FC E Barcode Lane Summary htm CASAVA v1 8 2 User Guide 3 Sequence Alignment NOTE There can be only one Aligned directory by de
158. mary input and does not support the _ qseq txt format For _qseq txt files use an older version of CASAVA or convert the _qseq txt format as described in Qseq Conversion on page 159 CASAVA v1 8 2 User Guide 3 UONONPOALUI Overview Supporting Software There are a number of software applications that support CASAVA The Off Line Base caller OLB is an alternative for the on instrument base calling by RTA The Analysis Visual Controller AVC provides a GUI interface for running CASAVA and is especially convenient for users not proficient with running applications through the Linux command line GenomeStudio contains modules for viewing the data analysis results in the genomic context These modules are the GenomeStudio ChIP Sequencing Module DNA Sequencing Module and RNA Sequencing Module The Sequencing Analysis Viewer SAV allows you to view primary analysis metrics from the sequencing instrument To download these applications and their documentation go to http www illumina com or https icom illumina com L NOTE If you do not have an Illumina customer account register as a new user It may take up to three business days for initial review of the application 4 Part 15011196 Rev D CASAVA Features The CASAVA 1 8 package processes sequencing reads provided by RTA or OLB CASAVA can generate the following data Sample specific reads from multiplexed flow cells Aligned reads SNP calls Indel calls Expres
159. mation cwd tells the job to run in the current directory v PATH passes the job the path to the executables needed for CASAVA tells the job to pass everything after the to the make command j tells gmake how many tasks to run at the same time Omake will then submit this number of tasks to the SGE queue As tasks finish more tasks will be submitted Parti 15011196 Rev D The number after the j should be adjusted depending on the size of the system and the number of users sharing it This method uses resources efficiently but job monitoring and management is harder If you need to kill a job you have to kill each of these tasks individually Slots Dedicated Upfront The second method uses a parallel environment where a number of slots are dedicated upfront and the tasks are run on these slots 1 Move into the output folder 2 Create a script file which contains the following gmake CWA v PATH inherit all 3 Submit the jobs to the SGE qsub cwd v PATH pe make 32 lt script file gt In addition to the options described above this method uses the following options pe make says to run in the parallel environment make is the default one The number after the word make says how many slots the job needs to run If you set this number too high you may have to wait a long time for them all to become free It will never run if you set it to more slots than you have on your system The more slots you use the quicker y
160. matted reports of SNP and indel calls assemblelndels Algorithm The assemblelndels module Grouper runs only during paired read DNA CASAVA builds In CASAVA v1 8 it uses orphan reads and anomalous read pairs to detect indels Grouper detects indels in five stages 1 Compute clusterings of non aligned orphan reads 2 Compute clusterings of anomalous read pairs with an insert size that is anomalously large possible deletion or small possible insertion 3 Combine clusters that appear to correspond to the same event 4 Assemble them into contigs 5 Align the contigs back to the genome using the positions of associated singleton reads to narrow the search to a couple of thousand bp or so Variant Caller Methods The callSmallVariants module calls SNPs and small indels from both the sorted alignment files sorted bam and optionally also from the candidate indel contigs produced by assemblelndels The procedure is outlined below Read in read alignments and candidate indel contigs Filter out read alignments based on quality checks paired end anomalies or ELAND alignment score Filter out contig alignments containing adjacent insertion deletion events Consolidate indel evidence from read and contig alignments to produce a set of candidate indels Perform local read realignment using candidate indels CASAVA v1 8 2 User Guide 9 SPOUIEIN Variant Detection and Counting Call indels based on the set of alignments
161. n Input Files EE a 26 Running Bcl Conversion and Demultiplexing 222 eee eee eee eee eee eeee 32 Bel Conversion Output Folder 2 222 2222 ieee eee eee eee EG EE cee EE EG Eie 37 de TT EE r K gf Nanny Da if d AT OOR N WE S mma 4 a NG PAANU 2 LA premo N lt a EE EE a aa i SE aa of ee nm Rd x a 7 gt g AE NG ee pora sd gr Sa ies s EAN C 3 an elf S Nar eg z NG Teona saamaa K geji sa ee ae s CASAVA v1 8 2 User Guide T O J9 dPUD Bcl Conversion and Demultiplexing Introduction As of CASAVA 1 8 configureAlignment uses FASTO files as input Since Illumina sequencing instruments generate bcl files as primary sequencing output CASAVA contains a BCL to FASTO converter that combines these per cycle bcl files from a run and translates them into FASTO files i NOTE As of 1 8 CASAVA uses bcl as primary input and does not support the _ qseq txt format For _qseq txt files use an older version of CASAVA or convert the _qseq txt format as described in Qseq Conversion on page 159 CASAVA 18 can start with bcl conversion and alignment as soon as the first read has been sequenced completely In addition to generating FASTO files CASAVA uses a user created sample sheet to divide the run output in projects and samples and stores these in separate directories If no sample sheet is provided all samples will be put in the Undetermined Indices directory by l
162. n Quality on page 83 configureAlignment Configuration File This section describes the features and parameters of the configureAlignment configuration text file The configureAlignment configuration file specifies the analysis for each lane sample project reference or index barcode The configureAlignment configuration file is a text file and the path to the file should be the first argument after the configureAlignment pl command configureAlignment translates the analysis in the configuration file into a makefile The makefile specifies exactly what commands will be executed to carry out the requested analysis As part of the creation of the Aligned output folder the configureAlignment configuration file is copied to the Aligned output folder using the filename config txt Some sites use standard configuration files which may be stored in a central repository Config File Parameter List The following tables list the parameters that can be specified in a configureAlignment configuration file The section configureAlignment Parameters Detailed Description on page 61 provides a detailed description of these parameters Core Parameters Table 2 GERALD Configuration File Core Parameters Parameter Definition EXPT DIR data 110113 ILMN 1 0217 Provide the path to the experiment demultiplexed FC1234 Unaligned directory in the run folder if not specified on the command line Usually the output folder from the BCL to FASTQ
163. n a read within a window of 1 2 variantsMDFilterFlank positions encompassing the current position The default value for variantsMDFilterCount is 2 and for variantsMDFilterFlank is 20 Set either value to less than 0 to disable the filter Example variantsMDFilterCount 3 variantsMDFilterFlank INTEGER SE PE The mismatch density filter removes all basecalls from consideration during SNP calling where greater than variantsMDFilterCount mismatches to the reference occur on a read within a window of 1 2 variantsMDFilterFlank positions encompassing the current position The default value for variantsMDFilterCount is 2 and for variantsMDFilterFlank is 20 Set either value to less than 0 to disable the filter Example variantsMDFilterFlank 25 variantsIndependentErrorModel SE PE This switch turns off all error dependency terms in the SNP calling model resulting in a simpler model where each basecall at a site is treated as an independent observation Example variantsIndependentErrorModel CASAVA v1 8 2 User Guide 1 D D U01 28 94 JUBIJEA Algorithm Descriptions Option Application Description variantsMinQbasecall INTEGER SE PE The minimum basecall quality used for SNP calling default is 0 Example variantsMinQbasecall 10 variantsSummaryMinOsnp INTEGER SE PE The snps txt files contain all positions where Q snp gt 0 however it is expected that only a higher Q snp subset of these will be use
164. n an iCompute cluster with j 32 metrics are for reads passing filtering Alignment and Mismatch Rates v 17 v1 8 semi repeat v1 8 full repeat Mis Yo Mis Yo Mis Align match Align match Align match Read1 8456 0 70 88 29 0 72 90 17 0 73 Read2 81 92 1 39 85 81 1 44 87 56 1 44 CASAVA v1 8 aligns a higher percentage of reads with full repeat alignment performing best This higher alignment rate results from the improved ability to align in CASAVA v1 8 2 User Guide 1 3 O SGACINY 12 PUB AANV 13 Algorithm Descriptions 140 more challenging repeat regions Remarkably even with more reads aligned in repeat regions mismatch rates are still very similar CPU Run Time Comparison v1 7 v1 8 v1 8 CPU hours semi repeat full repeat CPU hours CPU hours ELAND 523 28 518 40 855 40 orphanAligner N A 54 17 31 20 PickBestPair 200 77 14 67 14 97 alignmentResolver produceAlign 21 65 12 43 14 55 Stats Other Processes 25 99 0 17 0 20 Total 181G 599 85 916 33 While CASAVA v1 8 provides the highest percentage of aligned reads this level of performance does require additional computational time Table 2 For the ELAND step v1 8 full repeat resolution takes quite a bit longer to run than semi repeat resolution 520 hours versus 855 hours Therefore researchers should consider the trade off between higher performance and slower run time to select the type of analysis best suited for their project Other algori
165. n the index sequence This is done the following way for each cluster 1 Get the raw index for each index read from the bcl file 2 Identify the appropriate directory for the index based on the sample sheet 3 Optional Detect and correct up to one error on the barcode and identify the appropriate directory If there are multiple index reads detect and correct up to one error in each index read 4 Optional Detect the presence of adapter sequence at the end of read If adapter sequence is detected mask the corresponding basecalls with N 5 For each read a Write the index sequence into the index field b Append the end to the appropriate new FASTO file in the selected directory 6 Ifthe index cannot be identified the data is written into the Undetermined indices directory unless the sample sheet specifies a project and sample for reads without index Updating Statistics and Reporting The sample demultiplexer updates the following files Generates statistics While splitting the FASTO files CASAVA recalculates the base calling analysis statistic that were computed during base calling for the unsplit lanes These files Demultiplex Stats htm and IVC htm are stored in the Unaligned Basecall Stats FCID folder Regenerates the analysis plots for each multiplexed sample Updates config xml for each multiplexed sample Copies raw matrix and phasing files Updates sample sheet The sample demultiplexer strips all the non relevant indexes fr
166. nalysis run for a project It is located in the Aligned Project_ lt ProjectName gt Summary_ Stats folder and provides an overview of quality metrics for a run per sample with links to more detailed information in the form of pages of graphs The metrics are described below Project Summary The Project Summary contains general project information Project Name Machine name Run Folder full path to the run folder Flow Cell ID Platform instrument type Control Software and version Primary Analysis software and version Secondary Analysis software and version Project Results Summary This table displays a summary of project wide performance statistics for the run Clusters Original number of detected clusters Clusters PF Number of clusters that passed quality filtering Yield The sum of all bases in Mb in clusters that passed filtering for the entire project Parti 15011196 Rev D Barcode Lane Summary The Barcode Lane Summary records information about the barcoded samples in each flow cell lane and the analysis that has been specified for it Barcode Lane The identity of the barcoded sample in a lane The identity follows the following format lt SampleName gt lt Barcode gt lt Lane gt Sample Sample name Barcode The sequence of the barcode index Lane Species The reference sequence against which was aligned Depending on the analysis mode this may be the name of a folder containing one or more se
167. nce files is ELAND RNA GENOME CONTAM Parti 15011196 Rev D Running configureAlignment When running configureAlignment two concepts are important to understand the configureAlignment configuration file that specifies analysis and the make utility that manages the analysis configure Alignment Configuration File configureAlignment uses a text based configuration file containing all parameters required for alignment visualization and filtering These parameters specify the type of analysis to perform which bases to use for alignment and the reference files for a sequence alignment Analysis can be specified by lane index barcode sample reference or project Make Utility configureAlignment is a collection of Perl scripts and C executables and is managed by the make utility The make utility is commonly used to build executables from source code and is designed to model dependency trees by specifying dependency rules for files These dependencies are stored in a file called a makefile The configureAlignment pl script is used to generate a makefile config containing variable definitions which uses static makefiles as required These static makefiles including the main Makefile have fixed content and so can be included in the distribution and do not have to be regenerated for every run When running configureAlignment the configureAlignment configuration file specifies the analysis and the make utility manages th
168. nd demultiplexing If you need to change the sample sheet it is best to rerun the bcl conversion and demultiplexing DemultiplexedBustardConfig xml File The base calling configuration file DemultiplexedBustardConfig xml in the demultiplexed directory includes the start and end cycles of each read The DemultiplexedBustardConfig xml file is derived from the Config xml file generated during base calling but renamed and moved by the BCL to FASTO converter Reference Genome 50 CASAVA uses a reference genome in FASTA format Both single sequence FASTA and multi sequence FASTA genome files are supported Genome sequence files for most commonly used model organisms are available through iGenome Getting Reference Files on page 128 i NOTE As of CASAVA 1 8 you do not need to squash the reference genome anymore Single Sequence FASTA Files CASAVA accepts single sequence FASTA files as genome reference which should be provided unsquashed for both alignment and post alignment steps The chromosome name is derived from the file name Direct CASAVA to a folder containing the FASTA files using the option ELAND GENOME for configureAlignment Multi Sequence FASTA Files As of version 1 8 CASAVA accepts a multi sequence FASTA file as genome reference This should be provided as a single genome SAM compliant unsquashed file for both alignment and post alignment steps The chromosome name is derived directly from the first word in
169. nd the DemultiplexedBustardConfig xml is not there CASAVA v1 8 2 User Guide 1 6 D ejeg ind no Ja 8 Auo09 base 1 66 Parti 15011196 Rev D Export to SAM Conversion Introduction ss ss LLL LLL LLL LLL 168 SAM Format vi L LLL LLL LLL LLL LLL LLL LLL LLL LL LL LLL LLL GR EE GE GE EE 169 BRS AA 173 m ow out er Fi CA ere rare GCC Ci rtaresacrcelt ta CASAVA v1 8 2 User Guide 1 67 XIpusdAy Export to SAM Conversion Introduction 168 CASAVA 1 8 provides two SAM BAM conversion pathways Running the post alignment sort and bam modules see Targets on page 96 Running the post alignment sort and bam modules together offers sorting PCR duplicate removal indexing and automatic chromosome renaming options and by default it will write out a reference sequence file with chromosome labels that have been synchronized to the labels used in the BAM file If the sort module is run in archival mode the BAM file created will contain all of the reads provided in export txt gz files given as input The illumina export2sam pl script The illumina_export2sam pl script provides basic conversion from export to SAM format without sorting duplicate removal conversion to BAM format or indexing This script is intended to be used as one component in a custom post alignment pipeline Users desiring a turn key BAM creation method e g to rapidly view reads in IGV are encouraged to use the post alignment
170. ndard naming format for _export txt gz files is lt sample name gt _ lt barcode sequence gt _L lt lane gt _R lt read number gt lt 0 padded 3 digit set number gt _export gz like in NA10831_ATCACG_L001_R1_001_export txt gz The _export txt gz files are saved as compressed gzipped files The content of the _export txt gz files is described below not all fields are relevant to a single read analysis i NOTE The old Illumina specific transformation ASCII offset of 64 will still be used in the export files but export txt gz is meant to be an internal file format Machine Parsed from run folder name Run Number Parsed from run folder name Lane Tile OP A Q ND A X Coordinate of cluster As of RTA 1 6 OLB 1 6 and CASAVA 1 6 the X and Y coordinates for each clusters are calculated in a way that makes sure the combination will be unique The new coordinates are the old coordinates times 10 1000 and then rounded 6 Y Coordinate of cluster As of RTA 1 6 OLB 1 6 and CASAVA 1 6 the X and Y coordinates for each clusters are calculated in a way that makes sure the combination will be unique The new coordinates are the old coordinates times 10 1000 and then rounded CASAVA v1 8 2 User Guide Fa O soji J nd ng jusuuBijyolnBIJUOD Sequence Alignment SO 10 11 12 13 14 15 16 17 18 19 20 Index sequence or 0 For no indexing or for a file that has not been demultiplexed yet this field shoul
171. ngle FASTA format Path to transcripts mapping to the genome refFlat txt gz or seg gene md gz See also Using ANALYSIS eland_rna on page 70 The group label above specifies which assembly to use in the seq_ gene file and is found in column 13 of the file seq_gene files can hold entries for multiple assemblies Example ELAND RNA GENE MD GROUP LABEL GRCh37 p2 Primary Assembly KAGU_PARAMS passes options to the alignmentResolver through the configureAlignment configuration file For additional information see KAGU PAIR PARAMS and KAGU_PARAMS on page 65 Paired End Analysis Options Table 4 configureAlignment Configuration File Paired End Analysis Options Parameter Definition ANALYSIS Use the paired end alignment mode of ELANDv2e to align paired reads against a target eland pair USE BASES Use all bases of the first read and ignore the first and last base of the second read TERT n 6 USE BASES Ignore the first base on both the first and second read of lane 6 use 25 bases each and arao ignore any other bases for lane 6 only KAGU PAIR KAGU PAIR PARAMS passes options for paired end runs to the alignmentResolver PARAMS through the configureAlignment configuration file For additional information see KAGU_PAIR_PARAMS and KAGU_PARAMS on page 65 For more information on USE BASES syntax see USE BASES Option on page 62 56 Parti 15011196 Rev D Specifying Analysis Analysis can be specified by project reference sam
172. ns Frequently asked questions are available online Go to http www illumina com FAQs and click on Software Reporting Problems 10 When reporting an issue it is critical to capture all the output and error messages produced by a run This is done by redirecting the output using nohup or the facilities of a cluster management system For an explanation of nohup see Nohup Command on page 24 Provide a description of the error bug feature along with the following information if available Demultiplexing Bcl Conversion The configureBcIToFastq pl command line Nohup out from the make execution SampleSheet csv support txt file in the Unaligned folder Alignment The configureAlignment pl command line Nohup out from the make execution Config txt support txt file in the Aligned folder Variant Detection Counting The command line CASAVA log conf project conf Parti 15011196 Rev D Interpretation of Run Quality Introduction ses SS a VRE MA 12 Quality Tables and Graphs EE EE EE EG EG EE EG EG EE EE EG EG Eie 13 El BI aD EO EE a a ee N EE N Ee 17 lt n TG GT 4 fi pisan O PP AA PU Garry P AE CEGERTENTeGag re l sa my CASAVA v1 8 2 User Guide 1 T Z J61OGU7 Interpretation of Run Quality Introduction 12 Before beginning a secondary analysis you should do an assessment of a sequencing run s performance metrics This can help reveal any issues which may affect the secon
173. ntrol files ad bel files stats files bel files stats files Unaligned CASAVA v1 8 2 User Guide D T UOI 2NPO U Bcl Conversion and Demultiplexing Sample Sheet a The sample sheet SampleSheet csv file directs the software how to assign reads to samples and samples to projects The sample sheet specifies for every index in every lane which sample and which project it belongs to Lanes with samples that were not indexed can also be assigned to samples and projects using the sample sheet Projects can consist of multiple samples and samples can consist of multiple lanes and multiple indexes The sample sheet contains the following columns Column Description FCID Flow cell ID Lane Positive integer indicating the lane number 1 8 SampleID ID of the sample SampleRef The name of the reference Index Index sequence s Description Description of the sample Control Y indicates this lane is a control lane N means sample Recipe Recipe used during sequencing Operator Name or ID of the operator SampleProject The project the sample belongs to I NOTE f The column SampleProject is new in CASAVA 1 8 and links samples to projects Every project in the sample sheet is linked to a corresponding project directory Each sample belonging to that project is linked to a corresponding sample directory within that project directory Reads are stored in the FASTQ files located in the projec
174. o any alternate indels at the same site The relative probabilities of these alignments for each read are used to call the indel s genotype and calculate the associated quality score CASAVA v1 8 2 User Guide 1 4 D U01 28 94 JUBIJEA Algorithm Descriptions 146 Candidate Indel search For the first stage of indel calling candidate indels are identified from two sources of evidence The first of these are from small indels already present in the input reads in the form of gapped alignments The second source are alignments of locally assembled contigs to the reference provided by the assemblelndels module Every indel present in a conventional read alignment or assemblelndels contig is stored in a pool of potential indels Support for each one of these potential indels is measured as the number of read alignments which contain the indel These alignments may be from the primary alignment or from reads used by Grouper to assemble each contig If the number of reads supporting a potential indel is less than 3 or less than 2 of the total depth at the indel site the indel cannot become a candidate Additionally for short indels of length 4 or less if the number of supporting read is less than 10 of the total depth the indel cannot become a candidate These cases are retained as private indels in the reads alignments which support them All other potential indels become candidate indels subject to realignment and indel calling
175. of the variant caller algorithm see Variant Detection on page 141 Parti 15011196 Rev D Variant Detection Input Files The variant detection and counting input files come from the configureAlignment module using the following eland modules eland extended configureAlignment Input Files on page 48 for single read DNA sequencing projects eland pair Using ANALYSIS eland pair on page 69 for paired end DNA sequencing projects eland rna Using ANALYSIS eland rna on page 70 for single read RNA sequencing projects paired end RNA sequencing projects are not supported The configureAlignment input files for CASAVA variant detection can be found in the Aligned directory of the run folder and are described below In addition CASAVA variant detection and counting uses annotation files genome sequence files and refFlat txt gz or seq_gene md gz file Figure 14 CASAVA Input Files lt ExperimentName gt YYMMDD machinename XXXX FC Genome Directory L Aligned H Species r Project Genome fasta files Sample d export txt files L E genomesize xml file TA config xml file 4 pair xml Only for paired read files alignments export txt gz Files The export txt gz files contain the aligned sequence information from the configureAlignment module and are required The export txt gz files are tab delimited text files for a detailed description see See Output File Formats Run conf xml The run conf xml fil
176. oject directory called Undetermined_indices unless the sample sheet specifies a specific sample and project for reads without index in that lane No multiplexed samples present with sample sheet Reads are placed within the directory structure directed by the sample sheet based on the lane information No multiplexed samples present without sample sheet Reads are placed in a project directory named after the flow cell and sample directories based on the lane number CASAVA v1 8 2 User Guide D 3 UONONPOALUI Bcl Conversion and Demultiplexing Bcl Conversion As You Go Bcl conversion supports alignment of the first read of a paired end run before completion of the run align as you go You can kick off Bcl conversion for read 1 using the target r1 when running make at any time after the last read has started for multiplexed runs this is after completion of the indexing read You can also start alignment using the target r1 when running make for configureAlignment or you can use the POST RUN COMMAND R1 variable to automatically start the alignment of read 1 at the end of the Bcl conversion For instructions see Starting Bcl Conversion for Read 1 on page 35 Demultiplexing Methods 24 Demultiplexing involves splitting the FASTO files and updating the statistics and reporting files This section describes these two steps Splitting FASTQ Files The first step of demultiplexing in CASAVA is splitting the base call files based o
177. om the original sample sheet and places the stripped out version in the appropriate directory Parti 15011196 Rev D Creates the Demultiplex Stats FCID csv file in the Unaligned folder to indicate in which subdirectory each index has been written For a description of these files see Bcl Conversion Output Folder on page 37 CASAVA v1 8 2 User Guide D D UONONPOALUI Bcl Conversion and Demultiplexing Bcl Conversion Input Files Demultiplexing needs a BaseCalls directory and a sample sheet to start a run These files are described below See also image below a kw NOTE 4 i For installation instructions see Requirements and Software Installation on page 111 Figure 10 Bcl Conversion Input Files lt ExperimentName gt YYMMDD machinename XXXX FC Data Intensities 4 config xml tile L LOO fi By Lane GOSS BaseCalls 7 RunInfo xml SampleSheet file csv file A config xml file LOO1 By Lane ey C Lane Cycle La La stats files Folder and File Naming The top level run folder name is generated using three fields to identify the lt ExperimentName gt separated by underscores For example bcl files 26 Part 15011196 RevD YYMMDD machinename NNNN You should not deviate from the run folder naming convention as this may cause the software to stop 1 The first field is a six digit number specifying the date of the run The YYMMDD ordering ensures that a numerical sort
178. on counts splice junction counts and gene counts can be used to determine gene expression levels and expressed splice variants TIP As long as a gapped alignment is performed small indels up to 10 nucleotides can be called from RNA Sequencing builds L NOTE RNA Sequencing only supports single read runs Post Alignment Workflow The CASAVA workflow for variant detection and counting is illustrated below Figure 13 CASAVA Variant Detection and Counting Workflow Z export txt files Chromosome BAM files indelAssembler smallVariantCaller A 4 snp txt E counts and indels txt files genotypes files Whole genome BAM files stats reports Summary tables htm FT i rnaCounts Post sort BAM third party CASAVA has a number of changes in the way files are handled in the post alignment workflow CASAVA v1 8 2 User Guide 59 Variant Detection and Counting CASAVA 1 8 operates entirely on BAM files after the sort module in the post alignment workflow has completed sorted txt files are no longer created or stored This significantly reduces the build size the combined changes in the new variant caller and BAM files for CASAVA reduce the human DNA Sequencing post alignment builds size by 75 8096 Archival mode CASAVA can be run so that all input reads are retained in the build in their entirety Variant calling and RNA counting results are identical in
179. onfigure help Setting Up Email Reporting 116 The script Gerald runReport pl is called at the end of a run and sends you an email when a run successfully completes To use email notification set up an SMTP server and set the following parameters in the configureAlignment configuration file For additional information see configureAlignment Configuration File on page 54 1 Enter a space separated list of the email addresses that should receive the run completion notification EMAIL LIST your name domain com that name domain com Parti 15011196 Rev D 2 Indicate the path to the Aligned folder The software assumes it can create a valid URL from the Aligned folder path by omitting a number of leading path elements as specified by NUM LEADING DIRS TO STRIP by default two and prepending WEB DIR ROOT WEB DIR ROOT http server SHARE For example if the path is mnt yourDrive folder folder Aligned and WEB DIR ROOT is http server SHARE the software will write the links as http server SHARE folder Aligned File htm 3 Identify your domain Your SMTP server may refuse to accept emails from or send emails to addresses that do not end in yourdomain com EMAIL DOMAIN yourdomain com 4 Identify your IP address EMAIL SERVER yourserverido where yourserver is the name or IP address of a mail server that will accept SMIP email reguests from you and 25 is the port number of the SMTP service on that server Generally this will be 25 This
180. operator SampleProject The project the sample belongs to You can generate it using Excel or other text editing tool that allows csv files to be saved Enter the columns specified above for each sample and save the Excel file in the csv format If the sample you want to specify does not have an index sequence leave the Index field empty 30 Parti 15011196 Rev D Illegal Characters Project and sample names in the sample sheet cannot contain illegal characters not allowed by some file systems The characters not allowed are the space character and the following TUN KA NGA ERA UU LM Multiple Index Reads If multiple index reads were used each sample must be associated with an index sequence for each index read All index sequences are specified in the Index field The individual index read sequences are separated with a hyphen character For example if a particular sample was associated with the sequence ACCAGTAA in the first index read and the sequence GGACATGA in the second index read the index entry would be ACCAGTAA GGACA TGA Samples Without Index As of CASAVA 1 8 you can assign samples without index to projects samplelDs or other identifiers by leaving the Index field empty CASAVA v1 8 2 User Guide 3 Sola Indu UOISJAAUOL Jog Bcl Conversion and Demultiplexing Running Bcl Conversion and Demultiplexing Bcl conversion and demultiplexing is performed by one script configureBclToFastq pl This section des
181. or large projects such as human genome resequencing we recommend using highly distributed disk storage like Lustre or Isilon The space requirements for ELAND temporary files inside the Aligned directory as long as you stay at lt 13M reads per eland set size are as follows Eight bytes per match This should equate to less than 0 6 GB per process CASAVA v1 8 2 User Guide 1 1 2 sj uawWaiinbay SIEMHOS DUE SIEMPJEH Requirements and Software Installation This is less than 5GB for 8 ELAND processes If tmp space is an issue perform the following Increase space for tmp Decrease ELAND FASTO FILES PER PROCESS see ELAND FASTO FILES PER PROCESS on page 63 Setting the right value for the ELAND FASTO FILES PER PROCESS is very important because too low may result in a decreased performance Memory Requirements CASAVA requires a minimum of 2 GB RAM per core The parameter ELAND FASTQ FILES PER PROCESS in the configureAlignment config txt specifies the maximum number of files aligned by each ELAND process The optimal value is such that there are approximately 10 to 13 million lines reads in one set L NOTE Peak memory usage occurs during the ELANDv2e portion of configure Alignment Software Requirements 114 CASAVA has been primarily developed and tested on CentOs 5 Illumina s recommended and supported platform It may be possible to install and run CASAVA on other 64 bit Linux distributions particularly on similar dis
182. or the percentage of molecules in a cluster for which sequencing falls behind the current position cycle within a read o Prephasing The estimated specification is not recommended value used for the percentage of molecules in a cluster for which sequencing jumps ahead of the current position cycle within a read o Mismatch Rate raw The percentage of called bases in aligned reads from all detected clusters that do not match the reference PF Clusters The percentage of clusters that passed filtering Cycle 2 4 Av Int PF The intensity averaged over cycles 2 3 and 4 for clusters that passed filtering Cycle 2 10 Av Loss PF The average percentage intensity drop per cycle over cycles 2 10 derived from a best fit straight line for log intensity versus cycle number Cycle 10 20 Av Loss PF The average percentage intensity drop per cycle over cycles 10 20 derived from a best fit straight line for log intensity versus cycle number Align PF The percentage of reads passing filter that were uniquely aligned to the reference o Mismatch Rate PF The percentage of called bases in aligned reads passing filter that do not match the reference 5 030 bases PF Yield of bases with Q30 or higher from clusters passing filter divided by total yield of clusters passing filter Mean Quality Score PF The total sum of quality scores of clusters passing filter divided by total yield of clusters passing filter
183. ormally inferred during alignment 0x0004 4 The query sequence itself is unmapped 0x0008 8 The mate is unmapped 0x0010 16 Strand of the query 0 for forward 1 for reverse strand 0x0020 32 Strand of the mate 0x0040 64 The read is the first read in a pair 0x0080 128 The read is the second read in a pair 0x0100 256 The alignment is not primary a read having split hits may have multiple primary alignment records 0x0200 512 The read fails platform vendor quality checks 0x0400 1024 The read is either a PCR duplicate or an optical duplicate i NOTE The bitwise flag means that if multiple conditions are true the values are added and only the total value is reported For example if a read is paired in sequencing value 1 the mate is unmapped value 8 and the read is the first read in a pair value 64 a total of 1 8 64 73 is reported Extended CIGAR Format A CIGAR string is comprised of a series of operation lengths plus the operations The conventional CIGAR format allows for three types of operations M for match or mismatch I for insertion and D for deletion The extended CIGAR format further allows 170 Part 15011196 Rev D four more operations as is shown in the following table to describe clipping padding and splicing Operation Description al dale han ie 9 Alignment match can be a seguence match or mismatch Insertion to the reference Deletion from the reference Skipped region from the referen
184. our job will run up to the parallelization limit but the correct number to use depends on how big the system is the number of other users and the number of jobs you want to run at any one time This method can have some inefficiency if there are fewer tasks than slots at any point but it allows easy job monitoring and management If you need to kill your job then this is much easier with this method When you submit the job the command will return the SGE job id You can get information about the state of your job with gstat j job id gt or viewing it with qmon Customizing Parallelization Many parts of configureAlignment are intrinsically parallelizable by lane or tile However some parts of configureAlignment cannot be parallelized completely configureAlignment has a series of additional hooks and check points for customization The configureAlignment can be divided into a series of steps with different levels of scalability where synchronization barriers cause configureAlignment to wait for each of the tasks within a step to finish before going to the next step You can parallelize the steps at the run level no parallelization the lane level up to eight jobs in parallel and the tile level up to thousands of jobs in parallel Each step is initiated by a make target After completion of each of these steps configureAlignment produces a file or a series of files at the lane tile level that determines whether al
185. ple index or lane which is explained in this section Lane Specific Analysis By adding the lane number s followed by colon in front of an analysis option you state that the analysis option is only for samples from that lane The lane number is only valid for the configureAlignment settings on that same line For example 567 ANALYSIS eland extended tells configureAlignment that eland_ extended should be run on samples from lane 5 6 and 7 Sample Specific Analysis The config txt file has some keywords that enable you to specify analysis for project reference sample or index PROJECT REFERENCE SAMPLE and BARCODE These keywords refer to the SampleProject SampleRef SampleID and Index specified in the samplesheet csv file located in the Unaligned directory of the run folder Lines starting with PROJECT REFERENCE SAMPLE and BARCODE override any default settings specified in the config txt file but only for those samples for which the SampleProject SampleRef SampleID or Index matches the PROJECT REFERENCE SAMPLE or BARCODE The override is only valid for the configure Alignment settings on that same line Example Sample Specific Analysis For example if the config txt file describes the following analysis ANALYSIS eland rna REFERENCE human ANALYSIS eland pair with the following sample sheet FCID Lane Sample Sample Index Descrip Control Recipe Operator Sample ID Ref tion Project 12345AAXX 1 sample
186. possible alignments to the genome and splice junctions then the read is marked as RM and discarded as above 3 If there is no alignment to either the contaminants the genome or the splice junctions then the read is marked as NM for not matched Multiseed Repeat Alignment ANALYSIS eland_rna performs the following alignment features implemented in ELANDv2 and ELANDv2e Parti 15011196 Rev D By default performs multiseed alignment by aligning consecutive sets of 16 to 32 bases separately Aligns reads in repeat regions using two new modes semi repeat resolution and full repeat resolution Full repeat resolution is more sensitive and places more reads in repeat regions but will result in longer run time By default ELANDv2e runs in semi repeat resolution mode Full repeat resolution can be turned on with the option INCREASED SENSITIVITY Running an eland rna Analysis The configureAlignment configuration file specifies how the sequences from a flow cell are processed which is described in configureAlignment Configuration File on page 54 The ANALYSIS parameter within the configureAlignment configuration file specifies what analysis to perform on the sequences you will need to set up this parameter the following way example shown ANALYSIS eland rna ELAND GENOME data Genome ELAND hg18 ELAND RNA GENOME ANNOTATION data Genome ELAND _ RNA Human refFlat txt gz ELAND RNA GENOME CONTAM data Genome ELAND RNA Human MT Ribo Filter
187. put for configureAlignment The files are located in the Unaligned Project_ lt ProjectName gt Sample_ lt SampleName gt directories NOTE Reads that were identified as sample prep controls in the control files are not saved in the FASTO files Naming Illumina FASTO files use the following naming scheme lt sample name gt barcode sequence gt Lilane U padded Lo 3 digits gt R lt read number gt lt set number 0 padded to 3 Gigits gt fastq qz For example the following is a valid FASTO file name NA10831 ATCACG LU02 RI UUl Tastq gz In the case of non multiplexed runs lt sample name gt will be replaced with the lane numbers lanel lane2 lane8 and lt barcode sequence will be replaced with Nolndex Set Size The FASTO files are divided in files with the file size set by the fastq cluster count command line option of configureBclToFastq pl The different files are distinguished by the O padded 3 digit set number HE If you need to generate one unique fastq gzipped file for use in a third party tool you can set the fastq cluster count option to 0 Compression FASTO files are saved compressed in the GNU zip format an open source file compression program This is indicated by the gz file extension CASAVA automatically unzips the files before using them Format Each entry in a FASTQ file consists of four lines Sequence identifier Sequence Quality score identifier line consisting of a Quality
188. quence files or the name of an individual file The acceptable file formats also depend on the analysis mode Analysis Type Contains the analysis mode for reads from this lane Length The number of bases used per read excluding any bases masked out using USE_BASES Where multiple reads are produced per cluster and a distinction is maintained between them during analysis as in eland_pair analysis of paired end reads their respective lengths will be listed Num Tiles The number of tiles from the lane that are used in the analysis Genome Directory Full path to the genome directory Sample Results Summary This table displays basic data quality metrics for each sample displayed on the Summary sample page Sample Yield The sum of all bases in Mb in clusters that passed filtering for the sample Clusters raw The number of clusters detected by the image analysis module Clusters PF The number of detected clusters that meet the filtering criterion 1st Cycle Int PF The average of the four intensities one per channel or base type measured at the first cycle averaged over filtered clusters Intensity after 20 cycles PF The corresponding intensity statistic at cycle 20 as a percentage of that at the first cycle PF Clusters The percentage of clusters passing filtering Align PF The percentage of reads passing filter that were uniquely aligned to the reference For eland_rna it is number of PF reads aligned to
189. r example files containing these sequences The cM fa file from the genome folder 1 28 Part 15011196 Rev D A ribosomal sequences FASTA file You will need to find it for your genome of interest for example from GenBank A 5SRNA FASTA file optional You will need to find it for your genome of interest for example from GenBank A contaminants file You can use the same newcontam fa file as for human mouse or rat You do not need to have all of the files listed above but you need at least one file for eland rna to work properly You can add other abundant sequences FASTA files if desired L NOTE Abundant sequence files need to be single F ASTA files no multi FASTA allowed CASAVA v1 8 2 User Guide 1 D O SojiJ 92UdJa JAH bui j an 1 30 Parti 15011196 Rev D Algorithm Descriptions Introduction ss ss eee 132 ELANDv2 and ELANDV E oo 133 Variant Detection 141 readBases Counting Method 2 22 c cece SG ee cee cee ES ESEG 158 AA NG TESE Kapa PA CASAVA v1 8 2 User Guide T 37 O XPUSddvy Algorithm Descriptions Introduction This appendix explains the algorithms used in CASAVA for the following functions Alignment using ELAND Indel detection and small variant genotyping RNA sequencing counting methods 1 32 Part 15011196 Rev D ELANDv2 and ELANDv2e Efficient Large Scale Alignment of Nucleotide Databases ELAND is a very fast aligner and should be used to match a large numb
190. rFinder separately clusters each type of anomalous read The resulting clusters are labeled in the output file as Shadow SemiAligned orphan semi aligned DeletionPair insert size anomalously large InsertionPair insert size anomalously small ClusterMerger This stage combines clusters of different types above that appear to correspond to the same event One anticipated case is that of two Shadow SemiAligned clusters and a DeletionPair cluster corresponding to the same medium or larger scale deletion The currently supported merging mechanism is the combination of clusters of different types that share reads This is possible as a read may be detected as being both SemiAligned as one partner in an anomalously mapped read pair Apart from its role in merging related clusters this step also ensures that reads are not multiply represented in the subsequent assemblelndels stages and downstream analysis SmallAssembler SmallAssembler takes the output of ClusterMerger and assembles clusters of reads into contigs It uses an approach based on kmer hashing and a de Bruijn graph If a read is successfully assembled into a contig the read s alignment details are updated to describe its position in the contig AlignContig AlignContig does a dynamic programming alignment of contig to genome Variant Caller Methods The callSmallVariants module calls SNPs and small indels from both the sorted alignment files sorted bam and optionally also from t
191. raph for a more diverse sample Note the low diversity for cycles 102 109 this was a multiplexed sample and these are the index read cycles so this is normal bah RUDI Z Interpretation of Run Quality Figure 5 Proper Diversity Samples Data By Cycle Lane 1 Both Surfaces Q N co CD amp il Ki a o m T Cluster Density The figure below shows a screen shot from SAV displaying cluster densities for lanes 1 8 of a flow cell The cluster density of lanes 7 and 8 is very low if any of these lanes is set as the control lane for the run you might need to repeat basecalling using OLB with a more successful control lane Note that the raw cluster density for lane 1 is too 14 Part 15011196 Rev D high resulting in a lower percentage of clusters passing filter the green box although the total number of clusters passing filter is still acceptable Figure 6 Cluster Density Analysis Imaging Summary Tila Status Controls Status Cxtracted 199 Called 109 Scored 109 Data By Lane Sudeuy pue Selde ANIEND Fluidics Leak The figure below depicts a flow cell with spatial variability in intensity Typically we would expect intensity to be nearly even within each lane This variability might indicate a fluidics issue such as a large volume of bubbles moving through the flow cell CASAVA v1 8 2 User Guide 1 D Interpretation of Run Quality 16 Figure 7 Fluidics Leak Flowcell Chart
192. ration large scale structural variants or other phenomena not accounted for by the model The Q score provided by the model should be interpreted with respect to these limitations Homopolymers The indel calling model accounts for the probability of a spurious indel error as a function of homopolymer length and indel type This spurious indel correction causes simple expansions and contractions of homopolymers to be predicted as less likely as homopolymer length increases The spurious indel error probabilities are calculated from empirical observations There is an option available in the small variant caller to replace these values with a single constant indel error probability to be used for all homopolymer lengths and indel types Overlapping Indels Note that the model handles overlapping indels in an approximate fashion by evaluating the probability of each indel allele compared to either the reference or any other indel allele at the same site Thus it does not explicitly enumerate all possible pairs of alignment paths at the site to calculate the joint probability of the path for both haplotypes of a diploid sample instead the method considers the current indel allele compared to all other possible alignment paths at the site This approximation effectively handles most simple overlapping indels but will tend to undercall indels in regions with very high indel error rates A consequence of this model is that where overlapping indels o
193. read boundary used for multiple reads An asterisk means fill up the read as far as possible with the preceding character A number means that the previous character is repeated that many times Unspecified cycles are set to n by default If USE BASES is not specified at all every cycle is used for the alignment Note that the symbol I for indexing is no longer accepted syntax for USE BASES NOTE Default is USE BASES Y n which means perform a single read alignment and ignore the last base If running ANALYSIS eland_pair make sure to specify the USE_BASES option for two reads for example USE_BASES Y n Y n The following table describes examples of USE_BASES options Table 6 USE_BASES Options Option USE BASES nYYY USE BASES Y30 USE BASES nY30 USE BASES nY30n 62 Definition Ignore the first base and use bases 2 4 Align the first 30 bases Ignore the first base and align the next 30 bases Ignore the first base align the next 30 bases and ignore the last base Parti 15011196 Rev D Option USE BASES nY n Ignore the first base perform a single read alignment and ignore the last Definition base The length of read is automatically set to the number of sequencing cycles minus two USE BASES Y n This means perform a single read alignment and ignore the last base Default for single read alignment USE BASES Y n Y n Perform a paired read alignment but ignore the last base of e
194. reads or 1 for single end reads There field can support more than two reads lt is filtered gt YorN is Y if the read is filtered N otherwise lt control number gt 0 Is 0 when none of the control bits are on reserved for future use lt barcode seguence gt ACGT Represents the USE BASES masked barcode sequence empty otherwise An example is shown below E 15951502 FC106 742459321000512 850 1 E18 CATENOG AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Part 15011196 Rev D Variant Detection and Counting WE OP a oo cara eee maha teen DELIA A GE EE ME MAVRI 88 Methods 91 Variant Detection Input Files ESEG SG EG EE GE cece cece cece Ee Gee 93 Running Variant Detection and Counting 2 2 22 22 cece eee eee eee ee cece eie 96 Variant Detection and Counting Output Files 102 3 iza san ffi kai Ko e B e mx att Gary ee j 1 t CASAVA v1 8 2 User Guide 8 G Ja aky o Variant Detection and Counting Introduction This chapter explains how to use CASAVA1 8 to detect Single Nucleotide Polymorphisms SNPs and insertions deletions indels and count hits on transcripts for RNA sequencing CASAVA generates a CASAVA build which is a post sequencing analysis of data from reads aligned to a reference genome by configureAlignment The CASAVA build process is divided into several modules or targets each of which
195. reference genome to produce indel candidates and then the variant caller consolidates these candidates performs local realignment and genotypes the indel Indels of up to 300 bases in length can be genotyped using this process Small indels up to 10 bases can be detected directly from the gapped alignment RNA counting The number of bases that fall into the exonic regions of each gene are summed to obtain gene level counts normalized according to feature size and expressed as RPKM Reads Per Kilobase per Million of mapped reads Only splice sites from known splice variants are reported one at a time If a read represents a new splice variant or spans multiple splice junctions it will not be counted Parti 15011196 Rev D What s New Important Changes in CASAVA 1 8 2 Bcl Conversion and Demultiplexing Supports dual and single indices Supports adapter masking CASAVA 1 8 2 FASTO files contain only reads that passed filtering If you want all reads in a FASTQ file use the with failed reads option For more information see the Release Notes for CASAVA 1 8 2 or the Changes file in CASAV AlnstallationDirectory share CASAV A 1 8 2 New Options The new options for release 1 8 2 are listed below Bcl Conversion and Demultiplexing For descriptions see Options for Bcl Conversion and Demultiplexing on page 33 adapter sedquence with failed reads CASAVA v1 8 2 User Guide O MON S IEYM Overview Frequently Asked Questio
196. rerun using path to CASAVA bin configureBuild pl od PROJECT DIR targets callSmallVariants EH L NOTE We only support data sets originated from the same version of the software Generate BAM File with Altered Alignments An advanced option useful for variant diagnosis is to create BAM files for those reads which had their alignments altered by the variant caller during local realignment This may be done by adding the command variantsWriteRealigned to any command line which runs the variant caller Targets The targets that define CASAVA analysis are listed in the tables below 96 Parti 15011196 Rev D Options Table 15 Targets for Variant Detection and Counting Option Description all Run all pre configured targets for the given analysis type default except for target bam sort Bin reads and sort by position Remove PCR duplicates for paired end data assembleIndels Search for candidate indels from paired end reads via de novo assembly of contigs which are aligned back to the reference callSmallVariants Call SNPs and indels from locally re aligned reads Candidate indels from the assemblelndels target can be used to improve indel results See also Target callSmallVariants Usage on page 97 rnaCounts Calculate gene and exon counts in an RNA Seq build bam Aggregate all reads into a single BAM file with chromosome re labeling This target is not part of target all and is therefore not done by default Must be preceded
197. riment fluidics or from intensity plots temperature control Problem with cycle 20 deduced from intensity Check fluidics and focus for this plots cycle Exceptionally Low first cycle intensity Check first cycle focus high value Percentage of Clusters Passing Filters To remove the least reliable data from the analysis the raw data can be filtered to remove any clusters that have too much intensity corresponding to bases other than the called base By default the purity of the signal from each cluster is examined over the first 25 cycles and calculated as Chastity Highest_Intensity Highest_Intensity Next_Highest_Intensity for each cycle The new default filtering implemented at the base calling stage allows at most one cycle that is less than the Chastity threshold CASAVA v1 8 2 User Guide 1 Fi Interpretation of Run Quality The higher the value the better This value is very dependent on cluster density since the major cause of an impure signal in the early cycles is the presence of another cluster within a few micrometers Condition Possible Cause Suggested Action Very few clusters Poor flow cell perhaps unblocked Some of the causes may be at a single cycle If passing filter DNA the problem is isolated to these early cycles it Faint clusters is possible that this filtering throws away very Out of focus good data Base calling errors may be limited to affected cycles and as early cycles are fairly resistan
198. rl NOTE i the j lt n gt command line option is supported to indicate up to lt n gt processes in parallel However for Bcl conversion the maximum level of parallelization is 8 Starting Alignment You can also start alignment before completion of the run using the target r1 when running make for configureAlignment See Starting Alignment for Read 1 on page 64 Alternatively you can use the POST RUN COMMAND R1 variable to automatically start the alignment of read 1 at the end of the Bcl conversion For example make j 8 rl POST RUN COMMAND Ri cd Aligned make j 16 T 1 11 Starting the Second Read To start Bcl conversion of the second read use the regular make command in the Unaligned folder Perform the following 1 Move into the Unaligned folder specified by output dir 2 Type the regular make command make j 8 CASAVA v1 8 2 User Guide 3 D buixa di j hwag pue UOISJBAUOYD jog DUIUUNH Bcl Conversion and Demultiplexing 36 3 After the analysis is done review the analysis for each sample See Demultiplex_Stats File on page 42 Part 15011196 RevD Bcl Conversion Output Folder The Bcl Conversion output directory has the following characteristics The project and sample directory names are derived from the sample sheet The Demultiplex Stats file shows where the sample data are saved in the directory structure The Undetermined_indices directory contains the reads with an unresolved or erroneo
199. rl Parti 15011196 Rev D L NOTE the j lt n gt command line option is supported to indicate up to lt n gt processes in parallel Starting the Second Read To start alignment of the second read use the regular make command in the Aligned folder Perform the following 1 Move into the Aligned folder 2 Type the regular make command make j n KAGU PAIR PARAMS and KAGU PARAMS The parameters KAGU PARAMS for all runs and KAGU PAIR PARAMS for paired end runs pass options to the alignmentkesolver through the configureAlignment configuration file For additional information see configureAlignment Configuration File on page 54 The parameters can be specified lane by lane All of the options must be specified on a single line and space separated as in the following examples StKAGU PRIR PARAMS Circular mui 0 OT 8 KAGU PARAMS mmag 4 The following tables describe the parameters Table 7 Parameters for KAGU PAIR PARAMS and KAGU PARAMS Parameter Description mmag Minimum Mate Alignment Quality Each read is given a single read alignment score This is identical to the alignment score from an eland extended analysis If a read has a zero paired read alignment score but a single read alignment score that exceeds this threshold its alignment will still go in the export txt gz files If the alignments of the two reads can not be paired resulting in a zero paired score and only one of the reads ha
200. rter reads Single candidate alignments for better quality reads will score more highly than single candidate alignments of lower quality reads Single candidate alignments to shorter genomes will score more highly than single candidate alignments to longer genomes Unreported Unique Alignments A linein an export file will only contain alignment information if the alignment score for that read exceeds a threshold The primary purpose of this threshold is to retain only alignments that are markedly better than any other possible alignment for the read configureAlignment reduces alignment quality to a single confidence score and read quality the number of mismatches in the best alignment and the presence of other candidate alignments all contribute to the calculation of that score Therefore changes in any of these three variables will affect whether the alignment passes the alignment quality threshold So even if only a single candidate alignment has been found for a CASAVA v1 8 2 User Guide 1 3 D SGACINY Ia PUE AANV 13 Algorithm Descriptions read it may still fail the alignment quality threshold for one of two reasons and not be reported in export txt gz Low base quality values Excessive number of mismatches in the candidate alignment There will be at most 2 mismatches in the seed but potentially there can be any number of mismatches in the remainder of the read For most applications this is the right thing in both cases For ex
201. s an alignment exceeding min single read alignment score the read pair is treated as a singleton The alignment of the orphan read is unreliable enough to be ignored The default value is 4 Table 8 Parameters for KAGU PAIR PARAMS Only Parameter Description moi Gulat This causes alignmentResolver to treat each chromosome as circular and not linear enabling it to detect valid pairings that wrap around when the two alignments are mapped onto the linear representation of the chromosome circular lt my mitochondria file fa Treat alignments to my mitochondria file fa as circular but other chromosomes as linear as you might want to do when e g aligning to the whole human genome CASAVA v1 8 2 User Guide 6 D 1uauubiveinbyuo2 BuluuNH Sequence Alignment Parameter Description mag bed Minimum percentage of Unique Fragments A unique pair is defined as a read pair such that its constituent reads can each be aligned to a unique position in the genome without needing to make use of the fact that they are paired alignmentResolver works in a two pass fashion 1 On the first pass it looks for all clusters that pass the quality filter and have a unique alignment of each of their two reads then uses this information to determine the nominal insert size distribution and the relative orientation of the two reads 2 On a second pass this information is used to resolve repeats and other ambiguous cases The number of uniqu
202. s that define CASAVA variant detection and counting analysis are listed in the tables below with SE lt single end single read PE lt paired end The primary options that define CASAVA variant detection and counting analysis are listed on the next pages with SE lt single end single read PE lt paired end CASAVA v1 8 2 User Guide O7 DUI1UNOD pue U01 28 8 1ueleA Buiuuny Variant Detection and Counting Advanced options for fine tuning the variant calling are listed in Advanced Options for Variant Detection on page 152 i NOTE The option outDir is mandatory for all analysis types CASAVA will not run if this option is missing CASAVA will only run without inSampleDir if the build has been already configured with inSampleDir before Global Options The options described below are global options used to specify analysis across different targets Table 16 Major File Options for Variant Detection and Counting Option Application Description id SE PE inSampleDir PATH od SE PE outDir PATH ref SE PE refSequences PATH samtoolsRefFile FILE SE PE PATH to the aligned sample input directory Example id TestData Aligned Project __ lt SampleProject gt Sample lt SampleID gt PATH to the build sample output directory Example od home user name data Project 01 PATH of the reference genome sequences Default is buildDir genomes Example ref data Genome CASAVA hg18 The
203. s unreliable but rather that only the base calls flagged with Q2 are unreliable Note however that these regions are included in the Gerald error rate calculations for aligned reads In typical sequencing runs most reads are reliable over their entire length and are not marked with Q2 indicators Of the reads that are marked with the Q2 indicator most are flagged only in the final few cycles Demultiplex Stats File 42 The Demultiplex Stats htm file provides stats about demultiplexing and shows where samples are saved in the directory structure The Demultiplex Stats file is located in the Unaligned Basecall Stats FCID directory The file contains the sample information from the sample sheet with added rows for reads that end up in the Undetermined indices directory If no sample sheet exists CASAVA generates rows for each lane The Demultiplex Stats file has a number of additional columns that display demultiplexing stats and show the directory the samples are saved in The Demultiplex Stats file contains the following fields Parti 15011196 Rev D Field Description Lane Positive integer indicating the lane number 1 8 SampleID ID of the sample SampleRef The reference sequence for the sample Index Index sequence Description Description of the sample Control Y indicates this lane is a control lane N means sample Project The project the sample belongs to Reads Number of reads equals total number of lines in fastq files
204. score Each sequence identifier the line that precedes the sequence and describes it needs to be in the following format lt instrument gt lt run number gt lt flowcell ID gt lt lane gt lt tile gt lt x pos gt lt y pos gt lt read gt lt is filtered gt lt control number gt lt index sequence gt The elements are described below CASAVA v1 8 2 User Guide 3 O J19p 04 nd ng UOISIBAUOD DA Bcl Conversion and Demultiplexing 40 Element Requirements Description Each sequence identifier line starts with lt instrument gt Characters Instrument ID allowed a z A Z 0 9 and underscore lt run number gt Numerical Run number on instrument lt flowcell Characters TE allowed a z A Z 0 9 lt lane gt Numerical Lane number lt tile gt Numerical Tile number lt x pos Numerical X coordinate of cluster lt y POS Numerical Y coordinate of cluster lt read gt Numerical Read number 1 can be single read or read 2 of paired end lt is YorN Y if the read is filtered N otherwise filtered gt lt control Numerical 0 when none of the control bits are on otherwise it is number gt an even number See below lt index ACTG Index sequence sequence gt An example of a valid entry is as follows note the space preceding the read number element BASI3O1362FC106YJ222531000512850 ery R ATCACG AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA NOTE CASAVA 1 8 2 FASTQ files contain only reads that passed filtering If you
205. sent in CASAVA 1 7 has been integrated in the bcl conversion step Demultiplexing Multiplexed sequencing allows you to run multiple samples per lane The samples are identified by index sequences barcodes that are attached to the template during sample preparation For TruSeq dual indexing you can analyze up to 96 individual samples per lane while TruSeq multiplexing with a single index allows up to 12 samples in one lane Multiplexed sequencing runs from SCS 2 4 and later versions set the index reads as separate reads Sample demultiplexing in CASAVA creates several subdirectories to dispatch the data associated with the different barcodes Each subdirectory has a structure similar to the original BaseCalls directory Aligning Reads CASAVA performs sequence alignment using the configureAlignment module which is a set of utilities supplied as source code and scripts The output data produced by configureAlignment are stored in a hierarchical folder structure called the run folder The run folder includes all data folders generated from the sequencing platform and the data analysis software For the alignment step the standard input files for reads are the compressed FASTQ files ssample name sbarcode sequence gt _L lt lane gt _R lt read number gt lt 0 padded 3 digit set number gt fastq gz The standard output files for reads are the export files ssample name sbarcode sequence I lt lane gt R lt read number 0 padded 3
206. sion levels for exons genes and splice junctions in the RNA Sequencing analysis In addition CASAVA automatically generates a range of statistics such as mean depth and percentage chromosome coverage to enable comparison with previous builds or other samples CASAVA analyzes sequencing reads in three stages FASTO file generation and demultiplexing Alignment to a reference genome Variant detection and counting These three stages are explained below Figure 2 CASAVA Workflow bel files FASTO Generation and Demultiplexing Convert bcl files into compressed FASTO files Separate multiplexedsequence runs by index Aligning Align to reference genome Detecting Variants and Counting Build consensus sequence Call SNPs Detect indels and structural variants Count RNA reads i CASAVA output files build for GenomeStudio Bcl Conversion FASTO generation and demultiplexing scrip Alignment Variant detection and counting CASAVA 1 8 uses bcl files as primary sequence input The first step bcl conversion performs the following Generates compressed FASTO files that can be used by configureAlignment CASAVA v1 8 2 User Guide SoANIESA VAVSVO Overview Organizes the output in Project and Sample folders based on the sample sheet if provided Demultiplexes samples into that same run folder organization based on the sample sheet 4 NOTE The separate demultiplexing step pre
207. sion used to generate the file CL The configureBuild pl command line used to execute or create the workflow for the SAM target An example of a header line is shown below GPG ID CASAVA VN CASAVA 1 8 0 CL home userl CASAVA 20091209 bin configureBuild pl p testBaseMiniBAM targets bam Alignment Section The alignment section consists of multiple TAB delimited lines with each line describing an alignment Each line is lt ONAME gt lt FLAG gt lt RNAME gt lt POS gt lt MAPO gt lt CIGAR gt lt MRNM gt lt MPOS gt lt ISIZE gt lt SEQ gt lt QUAL gt N lt TAG gt lt VTYPE gt lt VALUE gt An example of a line in an alignment section is shown below HW1 EAS568 9096 2115 512 204 99 Cczz ta 14483804 29 76M6I118M 14484254 550 AGAAATGTTCTAAAATTAAATTGTAGTGATGTCTGCACAACTTTGTAAGT TTATAAAAAATAATTGACTTGTACACTTAATATTAATGAGTTGTATGGCA HGFGHGHHGHHIHEGHHHEHHHFEECBBFBGFHHHEHHHEHHGHHHDHHD HEHDEFHH CC C6HHHEED FFFHHHF HEHH HHH HGHHGBHFBD KDE TOO lo SMST 1511429 The format of each field is explained in the following table Field Description QNAME Query pair name if paired or Query name if unpaired This consists of the following sequence lt Machine gt _ lt Run number gt lt Lane gt lt Tile gt lt X coordinate of cluster gt lt Y coordinate of cluster gt FLAG Bitwise flag For a description see Bitwise Flag Values on page 170 CASAVA v1 8 2 User Guide 1 6 O JEJO J WYS Field Description RNAME Re
208. stood that this is the default status of an unmapped read Reads which cross a splice junction are annotated as a single record using the SAM CIGAR N SKIP character For example a 75 base read spanning a 1000 base intron may have the cigar string 35M1000N40M Chromosome names cannot be changed in the chromosome BAM files on which CASAVA operates The bam module may be used to create a whole genome BAM file with translated chromosome labels Note that whole genome bam files are now the only option for the bam module Converting Sam and Bam Files If you want to convert Sam files into Bam files enter the following samtools view b h S o output bam lt in sam gt Where b Output in the BAM format h Include the header in the output S Input is in SAM If SQ header lines are absent the t option is required o FILE Output file stdout If you want to convert Bam files into Sam files enter the following samtools view h o output sam lt in bam gt Where h Include the header in the output o FILE Output file stdout For more information see samtools sourceforge net Variant Detection Output Files All variant caller output files are written in a text format composed of one header segment followed by one data segment Parti 15011196 Rev D All lines in the header segment begin with the character Header lines beginning with the sequence contain a key value pair The reserved key COLUMNS has an associ
209. struments instrument configuration genomic sample type type of analysis flow cell preparation and the current state of the art Therefore the numbers shown in this section are for example only Summary Pages After analysis is complete check the FlowCellSummary_FCID htm file Sample Summary htm and Barcode Lane Summary htm files These provide metrics per flow cell sample and barcode lane respectively For a description of the tables found see Flow Cell Summary on page 79 Sample Summary Page on page 74 and Barcode Lane Summary Page on page 78 The key parameters that you should examine are listed in SummaryTab on page 17 and in the following sections Percentage of Clusters Passing Filters that Align Uniquely to the Reference Genome Optimal value depends on the genome sequenced and the read length the higher up to 100 max the better This result is genome specific and dependent on the completeness of the reference A failure to align could be due to repeat or missing regions or due to indels where sample and reference do not match Condition Possible Cause Suggested Action Much Fluidics or Look for an intensity dip in IVC plots If there is a problem lower than instrument and it occurs after a sufficiently useful read length re run expected problem ELANDv2e analysis using only the good cycles before the when instrument problem using Contamination Align a few sample tiles Genomic contamination wil
210. t to minor focus and fluidics problems even the Bubbles in individual tiles number of errors may be few The filtering Too many clusters can always be set manually to some other values Check before assuming all the data are poor Poor matrix A fluidics or sequencing failure Large clusters High phasing or prephasing Percentage of Phasing and Prephasing Ideally these values should be as low as possible Condition Possible Cause Suggested Action High phasing or Reagent issue reagents Check for leaks or bubbles in images or early cycle prephasing have deteriorated discrepancies in intensity plots Fluidics Poor flow cell Poor blocking can be evident as intensity in all channels from cycle 1 Ambient temperature of Check whether machine or facility temperature gets system beyond recommended limits Standard Deviations Many values have standard deviations associated with them This can be the first indication as to the uniformity of the flow cell If standard deviations are high then it indicates variability from tile to tile with a lane Condition Possible Cause Suggested Action High standard Check poor tiles Look at the tile by tile statistics that appear below the flow cell deviations for wide summary e Bubbles e Focus e Dirty flow cell surface After reviewing the tables in Summary htm examine the thumbnails 1 o Parti 15011196 Rev D Bcl Conversion and Demultiplexing BILOLO ae 20 Bcl Conversio
211. t Gapped Alignment serpe J Sel Reads Spanning Indel V R Singleseed IIH mm Ungapped Gapped Extension mmm X III ma Alignment Reference nde Genome Reads spanning indel align properly Reads spanning indel do not align properly Multiseed Alignment Semple First 32 First 32 Second 32 base seed base seed base seed NN Reference Inde Genome Seed spanning indel Seed spanning indel Second seed does not align properly does not align properly aligns properly Gapped No extension possible UML HEIL Note that a read has to have at least one seed that matches with at most 2 mismatches and for that seed no gaps are allowed For the whole read we allow any number of gaps as long as they correct at least five mismatches downstream 134 Part 15011196 RevD Alignment Score Calculation The base quality values and the positions of the mismatches in a candidate alignment are used to give a probability score p value to each candidate This is the probability that the candidate position in the genome aligned to would if its bases were sequenced at error rates that correspond to the read s quality values give rise to the observed read This way the contribution of each base is weighted according to its quality L NOTE A consequence of this is that the best alignment does not necessarily have the least number of mismatches although an exact match will always beat any alignment containing m
212. t and sample directories specified in the sample sheet as illustrated below for the sample in line 4 of the sample sheet Part 15011196 Rev D Figure 9 Relation between Sample Sheet and Directory Structure B cC D E F G H J FCID Lane SamplelD SampleRef Index Description Control Recipe Operator SampleProject Z FC200DMAB 2A hg18 ATCACG Example N PE indexing FZ A FC200DMAB 2B ng18 CGATGT Example N PE indexing FZ A FC200DMAB 3C hg18 ATCACG Example N PE indexing FZ B dd K K Sheetl p Sheet Mn sheets 7 7 lt ExperimentName gt YYMMDD_machinename_XXXX_FC Data Unaligned Project_A lt fast astq gz Sample_A flee VA fasta gz Sample B dle Project B fastq gz Sample_C files Bci Conversion Demultiplexing Examples Bcl conversion and demultiplexing support four scenarios Multiplexed samples present with sample sheet Reads are placed within the directory structure specified by the sample sheet based on the index and lane information Reads for which the index sequence was ambiguous will be placed in a project directory called Undetermined_indices unless the sample sheet specifies a specific sample and project for reads without index in that lane Multiplexed and non multiplexed samples present with sample sheet Reads are placed within the directory structure specified by the sample sheet based on the index and lane information Reads containing ambiguous or no barcodes will be placed in a pr
213. t improvements of ELANDv2e are improved repeat resolution and implementation of orphan alignment A short description of these improvements is provided below more information about ELANDV2 is available in Algorithm Descriptions on page 131 ELANDv2 The most important improvement of ELANDv2 are the following Handles indels and mismatches better by performing multiseed and gapped alignments Enhanced match descriptor options to handle the gaps identified see Export txt ez on page 79 Ability to split queries on a per tile basis now to allow for much greater parallelization The hashing method in ELANDv2 has been optimized in CASAVA 1 7 for performance This leads to a significant improvement in running times for the seed matching step of the alignment in CASAVA More information about ELANDv2 is available in on page 151 ELANDv2e Alignment Improvements CASAVA 1 8 features ELANDv2e This updated alignment program includes the following new features Better repeat resolution A new orphan alignerShorter run times with a new version of alignmentResolver CASAVA v1 8 2 User Guide A UOILONDOJ UI Sequence Alignment configureAlignment Input Files 48 The folder structure and format of configureAlignment input has changed significantly in CASAVA 1 8 The major changes are as follows configureAlignment uses FASTO files as sequence input Bcl conversion and demultiplexing are merged in one step and both multiplexed and non multipl
214. t is universally available and has a built in parallelization switch 4 For example on a dual processor dual core system running make j 4 instead of make executes the configureAlignment run in parallel over four different processor cores with an almost 4 fold decrease in analysis run time On a system with more sockets or more cores per socket j 8 or more may be advisable Distributed Make 120 There are several distributed versions of make for cluster systems Frequently used versions include qmake from Sun Grid Engine SGE To use qmake a short wrapper script is required See below for details There are known issues with the use of Ismake that prevent parts of CASAVA from running Therefore lumina does not recommend using Ismake to run CASAVA i NOTE Distributed cluster computing may require significant system administration expertise Ilumina does not support external installations Using qmake SGE has the utility gmake which can run the tasks of a make across a cluster in parallel There are two possible ways to run this Separate Jobs on Queuing System The first generates each make tasks as a separate job run on the queuing system 6 Move into the output folder 7 Create a script file which contains the following gmake cwd v PATH j 32 8 Submit the jobs to the SGE gsub cwd V PATH lt script file gt The options convey the following infor
215. t size Reads are filtered on ELAND alignment score For paired end reads the variant caller removes by default any read with a paired end alignment score less than 90 and for single end reads those with a single end alignment score less than 10 are removed Detecting Indels and Realigning The variant caller proceeds with candidate indel discovery and generation of alternate read alignments based on these candidate indels As part of this re alignment process the variant caller selects a representative alignment to be used for site genotype calling and depth summarization by the SNP caller This alignment is selected to be within a certain threshold of the most likely of all alignments for a read and any leading or trailing portions of the read with ambiguous support for 2 or more different alignments are marked as clipped This representative alignment does not affect the indel caller the indel calling process considers all alignments for each read without end clipping For diagnostic purposes the set of reads which have their alignments altered during re alignment may be written out to a separate BAM file for each chromosome using the variantsWriteRealigned flag Indel Caller The indel caller finds indels using a two stage process In the first stage an indel must be identified as a candidate indel In the second stage after indel candidates have been identified all intersecting reads are aligned to each indel to the reference and t
216. tact Illumina Support For example for a laboratory generating 200 GB of sequence per week the Tier 1 IlluminaCompute solution is recommended for which the specifications are listed below non IlluminaCompute systems satisfying these requirements are also fully supported 1 APC Netshelter 40U Rack with 1U KMM console 3 Dell R610 Server 8 CPU cores 48 GB RAM 3 Isilon I012000x storage modules 1 Serial MGT Console 16 2 Cisco 3750e switches Sequence alignment takes somewhere between a few hours using our fast short read whole genome alignment program ELAND and days using more traditional alignment programs CASAVA parallelization is built around the multi processor facilities of the make utility and scales very well to beyond eight nodes Substantial speed increases are expected for parallelization across several hundred CPUs For a detailed description see Using Parallelization on page 119 Disk Space Requirements When running CASAVA without keeping temporary data removeTemps ON Disk space needed while running 3 x size of export files Disk space needed after running 1 5 x size of export file When running with all temporary files saved removeTemps Of f Disk space while running 5 x size of export files For example to generate a build from one lane of E coli data 1 GB with removeTemps ON we recommend an additional 3 GB of disc space while running CASAVA and 1 5 GB for the final build directory NOTE F
217. tered to remove those indels which are found at a depth greater than a multiple of the mean chromosomal depth 3 times the mean chromosomal depth is used by default which can be changed using the variantsIndelCovCutoff option This filter is designed to remove indel calls in regions close to centromeres and other high depth regions likely to generate spurious calls NOTE This filter is off for RNA variant calling and we recommend to turn it off for targeted resequencing The indels txt file follows the general variant caller output file structure The data segment of this file consists of 16 tab delimited fields The fields are described in the Table below note that all information is given with respect to the forward strand of the reference sequence No Label Description 1 seq_name Reference sequence label Part 15011196 Rev D No Label Z pos 3 type 4 ref upstream 5 ref indel 6 ref downstream 7 Q indel 8 max_gtype 9 O max gtype 10 depth 11 alt reads 12 indel reads 13 other reads 14 repeat unit 15 ref repeat count 16 indel repeat count CASAVA v1 8 2 User Guide Description Except for right side breakpoints the reported start position of the indel is the first left most reference position following the indel breakpoint For right side breakpoints the reported position is the right most position preceding the breakpoint Also note that wherever the same indel could be repres
218. the genome and splice junctions Reads aligned to abundant sequences and masked by eland rna do not participate in this number Alignment Score PF The average filtered read alignment score reads with multiple or no alignments effectively contribute scores of 0 For phiX spikes the number of reads aligning to PhiX is small and therefore the reported alignment score small number of aligned reads divided by total number of PF reads is usually small Mismatch Rate PF The percentage of called bases in aligned reads that do not match the reference 5 030 bases PF Yield of bases with Q30 or higher from clusters passing filter divided by total yield of clusters passing filter Mean Quality Score PF The total sum of quality scores of clusters passing filter divided by total yield of clusters passing filter CASAVA v1 8 2 User Guide P D soji J 1NA1NO j usWubi yainbijuoo Sequence Alignment 76 If eland_pair analysis has been specified for one or more lanes then two Lane Results Summaries are produced one for each read All lanes for which analysis has been specified are represented in the Read 1 table but only those for which eland_pair analysis has been specified contribute statistics to the Read 2 table Expanded Sample Summary This displays more detailed quality metrics for each sample Clusters raw The number of clusters detected by the image analysis module o Phasing The estimated or specified value used f
219. the header for each sequence Direct CASAVA to a multi sequence FASTA file using the option SAMTOOLS GENOME for configureAlignment Parti 15011196 Rev D WARNING y GenomeStudio does not support the use of multi sequence FASTA files i Therefore if you want to analyze your output in GenomeStudio we recommend using single sequence FASTA reference files Chromosome Naming Restrictions CASAVA does not accept the following characters in the FASTA chromosome name header F J Tae KERE ER ty ee T This validation can be disabled in configureAlignment using the following option CHROM NAME VALIDATION off vi WARNING y You may run into problems with downstream analysis if you disable i chromosome name validation NOTE If ELAND finds two alignments with identical alignment scores ELAND will pick the first alignment in the single end case or combination of alignments in the paired end case that exhibit the highest observed alignment quality These are the alignments that make it into the export files which only contain the best alignment for each read In practice post alignment CASAVA ignores these reads because of the low alignment qualities Using a reference with lexicographic chromosome names like chr1 will yield slightly different results compared to a reference with numerical chromosome names like 1 for these reads since the hits are sorted ina different way Reference Sequence Blocks For reasons of efficiency E
220. thms have been updated in CASAVA v1 8 to improve run times The module alignmentResolver previously called PickBestPair has been rewritten which has resulted in much faster run times for this step 200 hours for v1 7 versus 15 for v1 8 The best analysis type therefore depends on the project is a shorter run time more important or the highest number of aligned reads Parti 15011196 Rev D Vanant Detection Post alignment CASAVA performs variant detection using two modules The assembleIndels module Grouper detects candidate indels using singleton orphan and anomalous read pairs The assemblelndels module works well for detecting larger indels The candidate indels detected by the assemblelndels module are passed on to the small variant caller for consolidation and genotyping The callSmallVariants module genotypes and provides quality scores for SNPs and indels Indels can be called from candidate indel evidence provided by both ELAND gapped read alignments for smaller indels and from the assemblelndels module for larger indels For each SNP or indel call the probability of both the called genotype and any non reference genotype is provided as a quality score Q score Reads are re aligned around candidate indels to improve the quality of SNP calls and site coverage summaries The callSmallVariants module also generates files which summarize the depth and genotype probabilities for every site in the genome As a final step it pro
221. tions should not exceed 16 million ELAND FASTO FILES PER PROCESS value x fastg cluster count value s 16 million i NOTE The fastq cluster count used during Bcl conversion can be found in Unaligned Makefile See the table below for set size cluster count combinations 1 W CAUTION Setting the right value for the ELAND FASTO FILES PER PROCESS is very i important Too high may result in silent crashes due to too high memory utilization and should be avoided Too low may result m a decreased performance Use is optional and we generally recommend using default values fastq cluster ELAND FASTO FILES Reads per Comment count PER PROCESS process 12 000 000 1 12000000 6 000 000 2 12000000 CASAVA v1 8 2 User Guide 6 3 1uauubilveinbyuo2 buluuny Sequence Alignment 64 fastq cluster ELAND_FASTQ_FILES_ Reads per Comment count PER_PROCESS process 4 000 000 3 12000000 Default values 3 000 000 4 12000000 2 000 000 6 12000000 1 000 000 12 12000000 L NOTE Slight differences can be expected when using different combinations of fastq cluster count and ELAND FASTQ FILES PER PROCESS The fastq cluster count used during Bcl conversion can be found in Unaligned Makefile Make Option The make option creates Aligned output directories and makefiles Without the option configureAlignment pl will not create any directories and files and only operates in a diagnostic mode You must specify this option to gen
222. tributions such as RedHat and Fedora or on other Unix variants if all of the prerequisites described in this section are met The required software environment is described below CASAVA installation may not work properly with gcc versions 3 x If you have a ecc version 3 x install gcc 4 0 0 or newer up to and including version gcc 4 5 2 with the exception of gcc version 4 0 2 which is not supported Installation of CASAVA 1 8 now requires the Boost C library version 1 44 0 and cmake version 2 8 0 and above These packages are included in the CASAVA installation package and will automatically install during the configure stage if either package is not found in the user s environment The following software is required to run the CASAVA 1 8 check whether it has been installed GNU make 3 81 recommended Perl gt 5 8 Python 5 2 3 and lt 2 6 PyXML gnuplot gt 3 7 4 0 recommended ImageMagick gt 5 4 7 ghostscript libxslt libxslt devel libxml2 libxml2 devel libxml2 python ncurses ncurses devel Part 15011196 Rev D gcc 4 0 0 or newer up to and including version gcc 4 4 x except 4 0 2 with c libtiff libtiff devel bzip2 bzip2 devel zlib zlib devel Perl modules perl XML Dumper perl XML Grove perl XML LibXML perl XML LibXML Common perl XML NamespaceSupport perl XML Parser perl XML SAX perl XML Simple perl XML Twig perldoc SIUSWOJINDOY SIEMHOS DUE SJEMADIEH CASAVA v1 8
223. txt files must be non multiplexed or already demultiplexed into separate directories If the converter finds reads 1 2 and 3 from a multiplexed run it will convert all three to FASTQ but configureAlignment cannot run on these files A config xml file must be found in the Qseq Basecalls folder or the config file argument to the Qseq Converter must point to an equivalent file The CASAVA v1 8 2 User Guide 1 67 Sol Indu J9 8 AL01 basi Qseq Conversion 162 config xml file must be copied to the FASTQ root folder and renamed DemultiplexedBustardConfig xml L NOTE configure Alignment requires SampleSheet csv and SampleSheet xm1 files but default versions of both files are created by the Qseq Converter Parti 15011196 Rev D Running Qseq Converter To convert _qseq txt files you need to run the configureQseqToFastq pl script This sets up the run by generating a makefile and metadata Running make or qmake then converts the qseq txt files into FASTO files 9 Enter the following command to create a makefile for sequence alignment with the desired compression option path to CASAVA bin configureQsegToFastq pl input dir DIR options 10 Move into the newly created output folder Type the make command for basic analysis make L NOTE You may prefer to use the parallelization option as follows make j 3 ali The extent of the parallelization depends on the setup of your computer or computing cluster CASAVA v
224. ule to perform local read realignment and genotype SNPs and indels under a diploid model 4 In an RNA Seg build the maCounts module will also be run to calculate gene and exon counts Other optional modules can be added to the build process to perform additional functions For the variant discovery and counting step the standard input file format for reads is the export format lt sample name gt _ lt barcode sequence gt _L lt lane gt _R lt read number gt lt 0 padded 3 digit set number gt _export gz The standard output file format for reads is the BAM format The sorted bam files are stored in chromosome specific directories under the output directory Use and properties of CASAVA s post alignment modules are explained in Variant Detection and Counting on page 87 More information about the algorithms is available in Variant Detection on page 141 CASAVA v1 8 2 User Guide 4 SIIN EOZ VAVSVO Overview Capabilities and Limitations This section explains the capabilities and limitations of CASAVA when performing data analysis Demultiplexing Demultiplexing is required for downstream analysis when a run is indexed Demultiplexing processes the read data so that the reads are segregated and copied into separate directories along with the indexing read or barcodes being parsed and removed Alignment Alignment is controlled by the configure Alignment pl wrapper script which includes several analysis modes that initiate single
225. ultiplexed compressed FASTO files One level down from the Unaligned directory are the project directories and within each project directory are the sample directories Reads with undetermined indices will be placed in the directory Undetermined indices unless the sample sheet specifies a specific sample and project for reads without index in that lane i NOTE CASAVA 1 8 introduces samples and projects as organizing principle which differs from CASAVA 1 7 which organized output by lanes or index 20 Part 15011196 Rev D Figure 8 Typical Run Folder Structure after Bcl Conversion and Demultiplexing Before Bcl Conversion After Bcl Conversion lt ExperimentName gt lt ExperimentName gt YYMMDD machinename XXXX FC YYMMDD machinename XXXX FC Data EF Intensities 4 Config xml file A stn BaseCalls A Config xml file SampleSheet Data N Intensities ad Dos files Config xml file df an BaseCalls 4 Config xml file 7 SampleSheet csv file L001 By Lane RunInfo xml file L Runlnfo xml file oh csv file L001 By Lane cm C Lane Cycle L filter files A control files Project A O fast astq gz Sample A fes Sample B SampleSheet csv file Project B E Sample C Undetermined Indices fastq gz Sample Lane Sample Lane2 SampleSheet csv file Basecall Stats FC SI C Lane Cycle 4 filter files Pa co
226. us index If no sample sheet exists CASAVA generates a project directory named after the flow cell and sample directories for each lane Each directory is a valid base calls directory that can be used for subsequent alignment analysis in CASAVA i NOTE If the majority of reads end up in the Undetermined indices folder check the use bases mask parameter syntax and the length of the index in the sample sheet It may be that you need to set the use bases mask option to the length of the index in the sample sheet the character n to account for phasing Note that you will not be able to see which indices have been placed in the Undetermined indices folder CASAVA v1 8 2 User Guide 3 J19p 04 nding UOISJAAUOLH DY Bcl Conversion and Demultiplexing lt ExperimentName gt YYMMDD_machinename_XXXX_FC a Unaligned Si fastq gz Project_DirA Sample_DirA files 7 ee SampleSheet csv file Sample DirX Project DirX Undetermined Indices Sample lane SampleSheet csv file Sample lane8 Bustard Basecall Stats FC Summary xml 7 Demultiplex _Stats files IVC htm har NOTE L There can be only one Unaligned directory by default If you want multiple Unaligned directories you will have to use the option output dir to generate a different output directory 38 Part 15011196 RevD FASTO Files As of 1 8 CASAVA converts bcl files into FASTO files and uses these FASTO files as sequence in
227. used to specify analysis across different targets Table 22 Global Analysis Options for Variant Detection and Counting Option Application Description QVCutoff NUMBER PE Sets the paired end alignment score threshold to NUMBER default 90 Example QVCutoff 60 QVCutoffSingle NUMBER SE PE Sets the single read alignment score threshold to NUMBER default 10 Example QVCutoffSingle 60 read NUMBER PE Limit input to the specified read only Forces single ended analysis on one read of a double ended data set Example read 1 singleScoreForPE VALUE PE Sets the variant caller to filter reads with single score below 152 OV CutoffSingle in PE mode YES NO Default NO Example single5coreForPE YES Parti 15011196 Rev D Option Application Description sortKeepAllReads SE PE toNMScore lt NUMBER SE PE ignoreUnanchored PE Generate an archive BAM file Keep all purity filtered duplicate and unmapped reads in the build These reads will be ignored during variant calling Example sortKeepAllReads Minimum SE alignment score to put a read to NM Default 1 1 means option is turned off Ignore unanchored read pairs in indel assembly and variant calling Unanchored read pairs have a single read alignment score of 0 for both reads Example ignoreUnanchored Options for Target assemblelndels The options described below are used to specify analysis for target assemble Indels Table 23 Options
228. will result in longer run time By default ELANDv2e runs in semi repeat resolution mode Full repeat resolution can be turned on with the option INCREASED SENSITIVITY Orphan Alignment ELANDv2e performs orphan alignment by identifying read pairs for which only one of the reads aligns ELANDv2e tries to align the other read in a defined window by default 450 bp If the number of mismatches is lt 10 of the read length ELANDv2e reports the alignment Variant Detection and Counting During variant detection and counting CASAVA generates a CASAVA build which is a post sequencing analysis of data from reads aligned to a reference genome by configure Alignment The CASAVA build process is divided into several modules or targets each of which completes a major portion of the post alignment analysis pipeline 1 The first module sort bins aligned reads into separate regions of the reference genome sorts these reads by alignment position and optionally removes PCR duplicates for paired end reads and finally converts these reads into BAM format 2 In a paired end analysis the next module assemblelndels is used to search for clusters of poorly aligned and anomalous reads These clusters of reads are de novo assembled into contigs which are aligned back to the reference to produce candidate indels 3 Subsequently the callSmallVariants module uses the sorted BAM files and the candidate indels predicted by the assembleIndels mod
229. with the sequence given the deleted reference sequence Single Read Alignment Score Alignment score of a single read match or for a paired read alignment score of a read if it were treated as a single read Blank if no match found any scores less than 4 should be considered as aligned to a repeat 1 for orphan reads Paired Read Alignment Score Alignment score of a paired read and its partner taken as a pair Blank if no match found any scores less than 4 should be considered as aligned to a repeat Note that in single ended analyses it is always blank Partner Chromosome Name of the chromosome if the read is paired and its partner aligns to another chromosome Partner Contig Not blank if read is paired and its partner aligns to another chromosome and that partner is split into contigs Blank for single read analysis Partner Offset Parti 15011196 Rev D If a partner of a paired read aligns to the same chromosome and contig this number added to the Match Position gives the alignment position of the partner If partner is a orphan read this value is 0 If partner aligns to a different chromosome and or contig the number represents the absolute position of the partner Blank for single read analysis unless the record belongs to a part of a spliced RNA read 21 Partner Strand To which strand did the partner of the paired read align F for forward R for reverse N if no match found blank for single read an
230. y reads whose alignment is much worse than expected given its quality Any orphan reads not thought to be due just to poor base quality Reads from read pairs mapped anomalously The expected relative orientation of read partners and the insert size statistics required to detect the anomalies are read per lane from the s_ _pair xml files produced during the alignment phase by alignmentResolver An anomalously large insert size is defined as 3 standard deviations above the median an anomalously small one as 5 standard deviations below the median Two types of anomalous mapping are used Insert size anomalously large Possible deletion Insert size anomalously small Possible insertion IndelFinder tries to exclude reads for which the bad or non existent alignment is just a consequence of poor base quality AlignCandidates The component AlignCandidates does a dynamic programming alignment of each orphan read looking in the interval within which it is expected to sit It takes the output of IndelFinder and does a localized alignment of each read If this procedure finds an alignment for a read where none existed previously or finds a better alignment than the existing one then the previous alignment is replaced ClusterFinder This takes the output of AlignCandidates a list of orphan and badly aligning reads and tries to group them in clusters of reads that are thought to have been caused by the same indel based on genomic location Cluste

CASAVA v1.8.2 User Guide (15011196) - Support

Contents

Download Pdf Manuals

Related Search

Related Contents