Home

"user manual"

1. Q 10 where Q quality score So when LmP 0 mott_limit 104 Q 10 At limit 1 LmP will be ve for only bases with Q lt o At limit 0 6 LmP will be ve for only bases with Q lt 2 Example methods mott mott_lim 0 6 Will basically extract the highest quality substring with minimum allowed base quality score being 2 6 and 7 ncutoff and nperc A common trimming approach to to remove all reads with N bases However this may result in removing some reads that contain a small number of N bases and that may still be of use for assembly and mapping Algorithm remove reads if the number or percentage of N bases in them exceed a specific cutoff Defaults are 50 and 50 Example methods ncutoff ncutoff_len 3 Will filter out reads with gt 3 N bases 8 and 9 3end and 5end NGS reads generally have decreasing quality scores as low as 0 towards their 3 ends and removing a specific number of bases from the 3 end or 5 end of all reads is commonly done to solve this problem Alternatively these methods may be used to remove some fixed sequence at the beginning or ends of reads e g a primer Example methods 3end_5end n3 5 n5 10 Will remove 5 bases from the 3 end of reads and 10 bases from their 5 ends 10 and 11 qseqo and qseqB qseq format specific reads e The qseq file format provides a failed_chastity_filter flag that normally marks a read for being filtered from IIlumina s qseq output
2. ngsShoRT 2 1 manual Sari Khaleel Sari S Khaleel DM AT dartmouth edu Dartmouth Medical School Last updated 1 13 2014 Table of Contents l Basic trimming concepts applied in ngShoRT Il Startup tutorial for the impatient Ill Trimming methods Explains the algorithm behind each trimming method IV Recommended sequence of methods and parameters We list methods and their parameter values that we have found to be useful in trimming several test datasets V Output files VI ngShoRT s program structure Explains the object oriented part of ngsShoRT and its module hierarchy VII References and Suggested Readings Basic trimming concepts applied in ngShoRT 1 Take a paired end PE forward and reverse read or single end files and trim them using a user specified sequence of trimming methods 2 When trimming reads it s important to set a minimum read length for trimmed reads This is particularly useful with some of the de Bruijn graph assemblers SOAPdenovo and velvet which discard reads shorter than the K mer length used for assembly So if the K mer length for your assembly is 21 set min_rl to 21 3 A trimmed read is good and will thus be printed in the final output if it meets two conditions its length is gt min_rl AND it was not filtered out by the following read trimming methods Iqr nperc ncutoff 5adpt kr qseqB kr and qseqo For PE read pairs a pair is removed if either o
3. Atherton R et al 2010 Whole genome sequencing of enriched chloroplast DNA using the Illumina GAII platform Plant Methods 6 22
4. in fastq format are going to have this quality scoring as well If you want them to be Sanger Phred33 based add i2s to your method list to convert from illumina to Sanger scoring perl ngsShoRT pl se sample_data qseq SRR065390_1st_2000_reads_qseq txt O sample_data output_directory methods 5adpt_i2s Working with compressed files a a ngsShoRT auto detects and opens files with the extensions bz2 gz and zip If you want your trimmed files output to be gzipped add gzip to the commandline For example perl ngsShoRT pl se sample_data qseq SRR065390_1st_2000 reads _qseq txt O sample_data output_directory methods 5adpt_i2s gzip Will produce the output file trimmed_SRR065390_1st_2000_reads_qseq txt gz Ill Trimming Methods Our trimming methods can be divided into quality trimming methods which filter low quality reads Iqr trim low quality 3 ends of reads TERA or try to extract a high quality string from the read Mott Non quality trimming methods include 3end Send nsplit nperc 5adpt and qseqo 3end and 5end simply remove a specific number of bases from the 3 and 5 end of reads respectively In contrast nsplit nperc and 5adpt examine the alien bases Ns adapter sequences in the sequence line to trim reads Qseqo is a special case that works only for qseq files It removes reads whose filtering flag was o i e they did not pass filtering during Illumina sequencing analysis Non trimming m
5. but there are occasions where this filtering setting is turned off qseqo detects this flag in qseq input files and removes any reads still carrying it e Earlier versions of Illumina lt 1 8 used a score mapping based at Phred 64 corresponding to zero quality and included quality scores 2 1 and o with the 2 score corresponding to the B character a special indicator for unknown quality scores qseqB was designed to trim reads that contain more B scored bases than this cutoff Algorithm qseqB in the global modes remove a qseq read with gt qB_num bases In local mode search for strings of gt qB_num bases in a read and either trim the entire read qB_axn kr or just the string and all bases 3 to it qB_axn ka Default mode is local and default action is ka trim only the string and bases following it Example methods qseqo_qseqB qB_mode global qB_num 10 Will filter out qseq reads with the o chastity_filter flag and also filter reads with 10 B scored bases 12 rmHP Removing homopolymers from reads Algorithm search for a homopolymer sequence h whose length exceeds a user specified limit default is 8 and consists of one of the bases specified by the user default is all bases agct Example methods rmHP rmHP_ml 10 rmHP_bases ag Will remove only homopolymers of A or G bases whose length is gt 10 13 i2s and s2i The two most common quality score settings in fastq files are S
6. fQ read object which comes with its own set of trimming methods and properties For PE reads there are two single fQ_read object for the two paired reads which are then processed as a PE_fQ_ pair The purpose of this somewhat complex method of managing reads is to make the trimming methods and output modules of ngsShoRT format independent i e ngsShoRT can potentially trim any read format fastq fasta qual qseq as long as its components are loaded into a singe fQ_read object At the moment we do that only for fastq and qseq formats Threading ngsShoRT s multithreading using the perl Threads module implements embarrassingly parallel processing each thread processes a separate part of the input file and final trimmed thread outputs are merged in a final processing step Program architecture PE files are processed using the process_PE_files pm module which splits the files into consecutive sections each of which is trimmed by a separate thread running the process PE_files_section pm module This module then runs trimming modules specified by the methods option on the reads producing a trimmed file section along with any surviving _PE_mates fastq files When all threads are finished trimming their sections process PE_files pm runs a merging module to merge the files into one final trimmed output file SE files are processed using the process SE_files pm module in a similar fashion to the PE files VII Ref
7. anger where ASCII 33 zero phred and old Illumina pre 1 8 score with ASCII 64 as the base Algorithm convert scoring from old illumina to Sanger using i2s or from Sanger to old Illumina using s2i Example methods i2s This is used for a pre 1 8 illumina generated fastq qseq file phred 64 based where the output file will be scored in Sanger phred 33 based IV Recommended Trimming Methods We recommend the trimming methods methods Iqr_5adpt_tera for filtering low quality reads reads with gt 50 bases having a quality score lt 2 removing their adapter primer sequences and trimming their low quality 3 end bases V Output files For PE input pe1SR_foo 1 fq pe2 SR_ foo 2 fq trimmed_ lt original filename gt _foo_1 fastq trimmed_ lt original filename gt _foo_2 fastq surviving SE_mates fastq Contains surviving widowed mates See prev section log txt Contains used params and trimmer progress final_PE_report txt Contains total and by method trimming stats For SE input se SR_foo fq trimmed_ lt original filename gt _foo fastq log txt Contains used params and trimmer progress final_SE_report txt Contains total and by method trimming stats VI ngsShoRT s program structure The single_fQ_read object Unlike most trimming tools ngsShoRT does not perform trimming directly on the reads file Instead it loads the read s sequence quality scores and header into a single
8. c as default or user specified sequences e fmi The furthest matching index i e how far into the read should the script be searching for adapter sequences We recommend setting it to raw_read_length 10 Read_length obviously depends on your PE file and their read lengths 5a_mp matching percentage Default is 100 which uses regex matching If 5a_mp is lt 100 we use fuzzy matching implemented by the String Approx library e Ifan adapter sequence is matched the method will 1 Remove the detected adapter sequence and then remove a specific number of bases before and after it the number of before and after bases is specified in the adapter_sequences file 2 Doone of two actions depending on the value of 5a_axn kr Kill the entire read ka kill the detected adapter sequence and all bases after it Available Illumina libraries ig Illumina genomic Default i p Illumina PE i m Illumina multiplex i n Illumina Nlalll i d Illumina Dpnil i r Illumina sRNA Available 454 pyrosequencing libraries are p b pyroseq basic p r pyroseq sRNA p p pyroseq PE p a pyroseq amplicon Example methods 5adpt 5a_f i g 5a_axn ka e will kill reads that match to an adapter in the illumina genomic library instead of the default action to just trim the matched sequence and all bases following it 3 to it in the read Notes on approximate matching Approximate matching is case insensitive and is
9. done according to a user specified match_percentage A match_percentage of 90 means that for every 10 bases only one mismatch is allowed and so on The measure of approximateness for String Approx is Levenshtein edit distance More detail on how this String Approx works can be found at http search cpan org jhi String Approx 3 26 Approx pm Additional approximate matching options are 5a_ins INT 5a_del INT 5a_sub INT They refer to the maximum allowed number of insertions deletions and substitutions respectively So for example 5a_mp 90 5a_ins 0 5a_del 0 means that one mismatched character is allowed in every 10 chars but it can NOT be a deletion or an insertion Thus it can only be a substitution 3 nsplit Instead of completely removing a read with N bases which can be done using our ncutoff and nperc methods we split the read around the string to save some of the read Algorithm 1 Search the read for substrings of consecutive uncalled bases N n which we call Nblocks whose length gt min_Nblock_ min_Nblock_ is a user specified cutoff If there are gt 1 N strings that satisfy this condition pick the longer or leftmost one 2 If the read has such N block delete the block and use the bases after and before it to create two new reads and if this is PE_trimming pair them with copies of their parent read s sister read Example methods nsplit nsplit_len 5 Will split a read around a s
10. erences and Suggested Readings Cox M et al 2010 SolexaQA At a glance quality assessment of Illumina second generation sequencing data BMC Bioinformatics 11 485 CLC bio CLC Genomics Workbench User Manual http www clcbio com files usermanuals CLC_Genomics_Workbench_User_Manual pdf FASTX Toolkit http hannonlab cshl edu fastx_toolkit Miller J R et al 2010 Assembly algorithms for next generation sequencing data Genomics 95 315 327 Schendure J and Hanlee J 2008 Next generation DNA sequencing Nature biotechnology 26 1135 1145 Zerbino D and Birney E 2008 Velvet Algorithms for de novo short read assembly using de Bruijn graphs Genome Research 18 5 821 829 Zerbino D 2008 Velvet Manual version 1 1 Available online at http www ebi ac uk zerbino velvet Man ual pdf DiGuistini S et al 2009 De novo genome sequence assembly of a filamentous fungus using Sanger 454 and Illumina sequence data Genome Biology 10 R94 Garcia T I et al 2011 Effects of short read quality and quantity on a de novo vertebrate transcriptome assembly Comparative Biochemistry and Physiology Part C ScienceDirect In Press Shulaev V et al 2010 The genome of woodland strawberry Fragari vesca Nature Genetics 43 109 116 Haridas S et al 2011 A biologist s guide to de novo genome assembly using next generation sequence data A test with fungal genomes Journal of Microbiological Methods 86 3 368 375
11. ethods include i2s and s2i which allow switching the quality scoring of reads from Illumina to Sanger or from Sanger to Illumina respectively 1 TERA trim by the Three End Running Average quality score lumina reads generally have lower base call quality towards the 3 end of reads Instead of non discriminately trimming a specific number of bases from the 3 end of all reads which can be done using 3end without looking at their quality scores TERA aims to trim low quality 3 ends Algorithm Trim bases from the 3 end of a read based on the running average quality score RAQS of its bases Starting at the last the most 3 base of the read begin counting RAQs of all bases until reaching a base X where RAQS exceeds a cutoff value specified by tera_avg If X s 5 index is lt min_rl it s set to min_rl see above section to understand why we do this All bases 3 to X index are trimmed out Example methods tera tera_avg 3 2 5 adapter trimming 5adpt Removal of known and user specified adapter primer sequences from reads Algorithm Trim adapter sequences from reads in PE reads this means trim it from the forward and reverse reads To be more specific match the adapter sequences to reads then either remove the matched read or just trim out the matched sequence and all bases 3 to it e 5a _f specifies the adapter sequences file which can be one of our built in adapter libraries illumina or 454 with illumina genomi
12. f 5 adapters primers which by default trims known illumina primers and prints the output in sample_data output_directory Your output files should be at sample_data output_directory 2 trimmed_SRR065390 _1__1st_2000reads fastq trimmed pe1 reads 2 trimmed_SRR065390_2__1st_2000reads fastq trimmed pe2 reads a surviving SE _mates fastq trimmed pet or pe2 whose mate read was filtered out during trimming 2 extracted_five_prime_adapter_sequences_at_100_percent_match txt a log txt o final_ PE_report txt full report of ngsShoRT runtime and number of trimmed bases and reads and method specific trimming statistics Commonly used options include a t lt number of threads gt default is 10 a min_rl lt minimum trimmed read length gt default is 21 a print_discarded_reads yes default is no Additional trimming tools can be added to the methods sequence e g methods Iqr_5adpt will filter out low quality reads before trimming 5 adapters Single End SE fastq file perl ngsShoRT pl se sample_data fastq SRRO65390_1st_2000reads fastq o sample_data output_directory methods 5adpt You ll use the same code as the PE files because ngsShoRT can auto detect fastq and qseq files perl ngsShoRT pl se sample_data qseq SRR065390_1st_2000 reads_qseq txt O sample_data output_directory methods 5adpt However qseq file quality scoring is based at Phred64 for Illumina 1 8 so the output files which will be
13. r both of its reads are bad If only one read is bad its sister surviving read is saved to a surviving SE_mates fastq file If you re assembling your trimmed reads using velvet you can assemble this file along with the trimmed PE reads velveth output_directory lt hash_length gt fastq shortPaired lt shuffled_trimmed_PE_file gt fastq short pathj surviving_SE_mates fastq ll Startup tutorial for the impatient 1 NOTE if you copy the commandline off this file without replacing the dashes you may get this error ERROR main params methods no methods were specified First Check your CPAN modules ngsShoRT requires the perl modules String Approx and PerllO gzip which can be installed as follows you ll need admin permissions perl MCPAN e shell cpan gt install String Approx cpan gt install PerllO gzip See http www cpan org modules INSTALL html for more info on installing the module Download and untar ngsShoRT_2 1 in a target directory tar xvf path ngsShoRT_2 1 tar gz Run ngsShoRT on the sample_data Paired End PE fastq files cd lt ngsShoRT s path gt perl ngsShoRT pl pe1 sample_data fastq SRR065390_1_1st_2000reads fastq gz pe2 sample_data fastq SRR065390_2_1st_2000reads fastq gz o sample_data output_directory methods 5adpt a This trims the gzipped paired end files pe1 forward read pe2 reverse reads using the 5adpt removal o
14. tring of 2 5 Ns if it finds one 4 lqr Remove low quality reads from the PE files Algorithm Given a user specified Low quality score cutoff Iqs and a percentage cutoff for bases whose quality score is lt Iqs which we call Iq_p Count the number of bases whose qual score is lt lq Let s call these LQ bases Label the read bad if the percentage of LQ bases If it s gt LQ_perc_cutoff good otherwise Example methods lqr lqs 4 lqp 50 Will filter out a read if it has gt 50 of bases with a quality score lt 4 5 mott Extract the highest quality string of bases from the read In other words trim out low quality 5 and 3 bases from the read The Richard Mott trimming algorithm is described as follows in CLC s manual The algorithm For every base convert its quality score Q to its corresponding Pe Perror Pe 10 Q 10 So Q 0 gt Pe 1 Q 2 gt Pe 0 6 Q 10 gt Pe 0 1 Q 20 gt Pe 0 01 Q 30 gt Pe 0 001 For every base calculate its LmP value which equals Limit Pe For every base starting from the 3 end for short reads add its LmP value to a running sum If the sum drops below zero set it to zero 4 When done with the entire sequence retain the part of the sequence between 1st positive running sum and the highest value of the running sum Wns How to choose limit value LmP mott_limit Perror base where Perror base 10

"user manual"

Contents

Download Pdf Manuals

Related Search

Related Contents

&quot;user manual&quot;

Contents

Download Pdf Manuals

Related Search

Related Contents

"user manual"