Home

RNA-MATE user manual - Expression Genomics Laboratory

1. with E Join on data 4 and data 6 3 Second query Both queries are same filetype d if un ecked Second query will be forced into format TIP it your query does not ap the pulldown menu gt R is not in interval format Use edt attributes to set chromosome start end and strand columns Intersect the intervals of two quenes Subtract the intervals of two quenes See Galaxy interval Operation Screancyete right chek to open the Meray the overlapping intervals of a query Syntax e Concatenate two queries into Both queries are exactly the same filetype will preserve all extra fel the same column ssgrments far chrom start ond and strand but will fill extra fields with a pened periods to maintain 2 truly Example Cluster the intervals of a Query Min the intervals of two Queries side by side returns flanking r every gene Page 33 of 38 refresh coapse all ENCODE Tools 10 Concatenate on Gata 8 and data Unnamed history o Query missing See TIP below Text Manipulation Group by co Convert Formats a J EASTA manipulation Filter and Sort Join Subtract ond Group Operation 1 side by side on 4 Sum J COMMON OF Ine rows On column Sutu Wwhoig Query from eau another query data by column and Round result to nearest integert perform aggrega gspneration on athar colurane The
2. Lnterzect the intervals of two Execute quenes Subtract the intervals of two quenes TIP if your query does not appear n Screencasts See Galaxy interval Operation Screencasts right cick to operi Meran the overlapping intervals of a query e Concatenate two queries into one query Syntax e Base Coverage of af intervals Where overlap coocifies the minimum overlap botwoon intervals that allows thorn to be joinod Coverage of a set of intervals Return only records that are joined returns only the records of the fest query that join to a recond in the second query This is analogous to an INNER JOIN On second set of intervals Return all records of fiest query fil null with returns af intervals of the first query and any intervals that do nat Join an interval fram the second query are filled in nd any intervals that do not jon an interval from the first query are filled in lid chrom start end or strand Cluster the intervals of a l fills on eithear the right or left with periods Note thet this may produce an query Loin the intervals of two queries side by side retums flanking Tagion s for every gene atch closest feature for Overy interval on intervals a Complement intervals ot a query Galasey Mozilla Firefox Concatenate refresh colapse all S doin on data 5 and date 7 J Unnamed history o agente opn E sR test 500K tags SOmers positive
3. Refer to the Biota Credits page fer the ist of contributors and usage restrictions associated with these data e GrameneMart Contral server clade Mammal E genome Human 3 assembly Mer 2005 3 Gunina server group Genes and Gene Predicton Tracks I track Reseqcen E Encodeds at NHGRI i van SO table EQIGRAPH server ph mene deserto tebie schema region genome ENCODE postion chr 32376712 32378734 Liookup_ _dofine regions ENCODE Tools identifiers names accessions pastels Uplondist Lif Over filter create j Text Manipulation a Convert Formats FASTA monloulation correlation Gale output format CE F Send output to Galaxy gi Extract Features utp fe Fetch Sequences fide type returned plan text gap compressed Eetch Allanments Get Genomic Scores Ouerate on Genomic Intervals Statitics Groph Display Data Kealonal variation J Multiple rearession m analy c analyses by Short alysis provides brief ine by lne descriptions of the Table Bri y Specifies whech clade the organism it n Workflows L ene sequence to uze jack ket The options correspond to the track groupings shown m the Genome Browser Select All Tracks for an All Tables to see al tables mcluding those not associated with a track tech database should be used for options m table menu Ee ee ee Oe ST EI ee Page 30 of 38 Hist
4. chr2 test_500K_tags_50mers for_wig positive wig chr2 test_500K_tags_50mers for_wig positive wig success chrM test_500K_tags_50mers for_wig negative sorted chrM test_500K_tags_50mers for_wig negative wig chrM test_500K_tags_50mers for_wig negative wig success chrM test_500K_tags_50mers for_wig positive sorted chrM test_500K_tags_50mers for_wig positive wig chrM test_500K_tags_50mers for_wig positive wig success These are temporary internal files and can be deleted after the successful completion of the wiggle plot module sub start_plot_fork input chr2 test_500K_tags_50mers for_wig negative chr2 test_500K_tags_50mers for_wig positive chrM test_500K_tags_50mers for_wig negative chrM test_500K_tags_50mers for_wig positive output chr2 test_500K_tags_50mers for_wig negative starts chr2 test_500K_tags_50mers for_wig negative starts success chr2 test_500K_tags_50mers for_wig positive starts chr2 test_500K_tags_50mers for_wig positive starts success chrM test_500K_tags_50mers for_wig negative starts chrM test_500K_tags_50mers for_wig negative starts success chrM test_500K_tags_50mers for_wig positive starts chrM test_500K_tags_50mers for_wig positive starts success These are temporary internal files and can be deleted after the successful completion of the wiggle plot module Page 25 of 38 test_500K_tags_50mers negative starts test_500K_tags_50mers positive starts These are the final output files containing the geno
5. adjacent errors as a single mismatch These are comma separated parameters with the format of length mismatches valid_adjacent length defines the length of the tag to match at mismatches defines the number of mismatches allowed and valid_adjacent is set to 1 if valid adjacent errors are to be treated as a single mismatch or 0 if they are not In the above example the recursive mapping will match at lengths 50 45 40 35 and 30 For lengths 50 40 there will be 5 mismatches allowed whereas for lengths 35 and 30 will allow only 3 mismatches Valid adjacent errors are not treated as a single mismatch For a discussion on selecting optimal parameters for analysis see the section Selecting appropriate parameters NOTE Mapping schemas must be available to do the mapping at the specified length and number of mistmatches or else the pipeline will fail ie In this example the schemas required are schema_50_5 schema_45_5 schema_40_5 schema_35_3 schema_30_3 Mapping schemas are available from http solidsoftwaretools com mask 11111111111111111111111111111111111 This setting allows you to ignore particular bases in the tag when computing the number of mismatches 1 consider this base 0 do not consider this base The length of the mask should equal the length of the longest tags max_multimatch 10 Defines the maximum number of positions to be reported for multi mapping tags The higher this number the more disk
6. are available from http g2 trac bx psu edu wiki HowTolInstall The following is a tutorial on how to assign tags to genes using Galaxy In this step particular care needs to be taken to ensure that different RNAseq protocols are processed with the strand of capture in mind For example serial ligation approaches will generate sequences from the sense strand relative to the annotated gene whereas the random primed strand specific protocols will generate tags mapping to the anti sense strand Assigning tags to the wrong strand of gene models will result in relatively low numbers of tags assigned to the gene models and subsequently very low correlations between array data and sequence data _ History refresh colapse all Table Browser Unnamed history 0 e UCSC Main table browser e UCSC Archaea table browser Use this program to retrieve the data associated with a track in text format to calculate ztersections between tracks and to retrieve DNA sequence covered by a track For help in Your history is empty Ckck Get Gat Microbial bata using thes application see Using the Table Browser for a descnption of the controls in this form the Use for general information and sample queries and the Opentlelx Table Data on the left pane to start R PEPEE Browser tutorial for a narrated presentation of the software feanaes and usage For more complex queries you may wars to use Galaxy or our public MySQL server
7. following job has been succesfully added to the queue 11 Group on date 10 You c n check the status of queued jobs and view the dao wing the When the job has been run the status will from to 4 by refreshing the History pane j change from running iced erro ae fn compare two Queries to find COMMON OF t rows Buaya Whole Quere from another query data by a column and perform aggregate operation on other enkun Page 34 of 38 All these steps can be automated into what Galaxy calls a Workflow This is particularly useful if you have multiple data sets to analyze To create a workflow from the steps generated above follow the steps below iousty stored histories anew emety history o Construct workflow from current history Lif Over Clone current history Text Manipulation Share current history of e List pren Toots The following list contains each tool that was run to create Get Data Tools which cannot be run interactively and thus cannot bo retresh colapse all ENCODE Tools Workflow name Unnamed history Lift Over inaseq to Gene Counts Text Manipulation ae Create worksiow EASTA manipulation Eiler and Sort Tool History items created Join Subtract and Group Extract Features UCSC Mein 1 UCSC Main on Human refGene genome Fetch Sequences his tool cannot be used in workflows F Trost as input dataset
8. junctions 35 fa cat junction_45_index data RNA MATEv1 1 junction_libraries hg18 junctions 35 fa index junction_40 data RNA MATEv1 1 junction_libraries hg18 junctions 30 fa cat junction_40_index data RNA MATEv1 1 junction_libraries hg18 junctions 30 fa index junction_35 data RNA MATEv1 1 junction_libraries hg18 junctions 30 fa cat junction_35_index data RNA MATEv1 1 junction_libraries hg18 junctions 30 fa index junction_30 data RNA MATEv1 1 junction_libraries hg18 junctions 25 fa cat Page 12 of 38 junction_30_index data RNA MATEv1 1 junction_libraries hg18 junctions 25 fa index These parameters define the junction libraries and their associated index files that are used in RNA MATE The ability to specify the length of the junction library used for each of the different lengths of the tag means that you can have complete control over the stringency of the junction matching In this example by using the 40mer libraries 40nt from the donor exon concatenated with 40nt from the acceptor exon with the 50 40mer tag lengths we are requiring a minimum on 10nt of the tag to overlap the exon exon boundary For the 35 30mer tag lengths we are requiring an overlap of 5nt For obvious reasons the number of nucleotides overlapping should be greater than the number of mismatches allowed in the tag For every tag length specified in the mapping parameters option there are two files required the junction_ length file asks for
9. prior to ePCR The primary motivation behind the recursive mapping method was to maximize the number of mapping tags from every sequencing run The cost of sequencing reagents is considerably more than the cost of server time so gaining additional depth between about 1 6 and 3 times the tags mapping at the longest length represents good value for money In this respect it is up to the individual user to decide whether or not to apply a recursive mapping strategy for their own analysis How to map to the genome and junctions simultaneously RNA MATE has the flexibility to other approaches to junction matching In some circumstances one may wish to consider matches to the junction library at the same time as matches to the genome This can be done by renaming the junction library against which you wish to map to follow the chromosome naming convention see Configuration options We will sometimes use chrJ as the chromosome name and therefore chrJ fa as the filename Additionally you will need to provide a small junction library file eg a string of 35 Ns as it must go through the process of matching to the junction library even though there is nothing in it to match to This is not an efficient or fully automatic process as junction BED files will need to be recreated from the raw mapping files see Post RNA MATE scripts This inefficiency will be rectified in the next major release of the software where we will mak
10. space is required to store the data and the slower the program will run Recommended size for most applications is 10 For interrogating repeat sequences such as retrotransposable elements this value may need to be set higher Page 11 of 38 expect_strand This defines the strandedness of the data For example libraries made with the SREK protocol or other direct ligation protocols will have tags that are sequenced in the sense strand relative to the expressed gene Libraries made with the SQRL protocol will have tags that are sequenced in the antisense relative to the expressed gene exp_name test_500K_tags_50mers Set the experiment name with this parameter chromosomes chrM chr2 Defines the names of the chromosomes to map against The filenames that match to these chromosomes are expected to be named as chromosome_name fa In the above example the files needed to match against should be called chrM fa chr2 fa These should be in a standard single sequence fasta format There is no requirement for the header line to contain any particular string Only the filename and the chromosome name will be used in the pipeline chr_path data matching hg18_fasta The full path of the chromosome fasta files junction_50 data RNA MATEv1 1 junction_libraries hg18 junctions 40 fa cat junction_50_index data RNA MATEv1 1 junction_libraries hg18 junctions 40 fa index junction_45 data RNA MATEv1 1 junction_libraries hg18
11. with sequencing artifacts Additionally large numbers of mismatches relative to the tag length will create spurious matching events and increase the level of noise in your results For RNAseq data ideally the proportion of tags mapping exons should be relatively constant regardless of the length and studying this for your genome of interest will provide guidelines as to what levels of mismatching is acceptable for your system For mouse and human genomes and presumably other mammalian genomes we recommend to use 3 mismatches for lengths from 30 39nt 5 mismatches for lengths gt 40nt If additional matching data is required 25nt matches can be used but caution should be Page 14 of 38 used when interpreting the results Either allow only a single mismatch to ensure specificity of mapping or filter the final wiggle plots eg only look at nucleotide positions that are covered by more than 4 tags to an extent which removes the noise in this mapping see Figure 2 100 100000000 90 10000000 80 r i 1000000 70 an 6N 7 D Ay 100000 o S 60 D 5 o e 50 10000 8 a 8 E 9 40 5 Q 1000 c a 30 1 E Single mapping tags Tags mapping within an exon 7 100 20 f Total number of tags mapping 10 10 0 t t t t t t t t t t t 1 35 0 35 1 35 2 35 3 30 0 30 1 30 2 30 3 25 0 25 1 25 2 25 3 length mismatches Figure 2 Effect of length and mismatches on the s
12. 5 2009 PROCESS Creating files for wiggle plot Wed Jun 10 04 43 31 2009 PROCESS Creating start file for wiggle plot Wed Jun 10 04 44 58 2009 SUCCESS Created start file data RNA MATEv1 1 test_results test_500K_tags_50mers positive starts data RNA MATEv1 1 test_results test_500K_tags_50mers negative starts Wed Jun 10 04 46 17 2009 SUCCESS Created wiggle plot file data RNA MATEv1 1 test_results test_500K_tags_50mers positive wiggle data RNA MATEv1 1 test_results test_500K_tags_50mers negative wiggl Wed Jun 10 04 46 17 2009 PROCESS Creating BED file for junction mapping Wed Jun 10 04 46 24 2009 SUCCESS Created junction BED files data RNA MATEv1 1 test_results test_500K_tags_50mers expect junction BED data RNA MATEv1 1 test_results test_500K_tags_50mers unexpect junction BED Wed Jun 10 04 46 24 2009 SUCCESS all done enjoy the data Checking the finished run Misconfigured config files or interruptions to the server or queue can cause RNA MATE to die prematurely however there is some error catching code that can help you to work out what has gone wrong The steps you should go through to ensure that the run has finished successfully are listed below 1 Check the log file Look for WARNING or DIED messages that will describe what has gone wrong eg grep WARNING test_500K_tags_50mers log more 2 Check the nohup out file in the direc
13. 5_index data RNA MATEv1 1 junction_libraries hg18 junctions 40 fa index junction_40 data RNA MATEv1 1 junction_libraries hg18 junctions 35 fa cat junction_40_index data RNA MATEv1 1 junction_libraries hg18 junctions 35 fa index junction_35 data RNA MATEv1 1 junction_libraries hg18 junctions 30 fa cat junction_35_index data RNA MATEv1 1 junction_libraries hg18 junctions 30 fa index junction_30 data RNA MATEv1 1 junction_libraries hg18 junctions 25 fa cat junction_30_index data RNA MATEv1 1 junction_libraries hg18 junctions 25 fa index output_dir data RNA MATEv1 1 test_results raw_csfasta data RNA MATEv1 1 test_data test_500K_tags_50mers csfasta raw_qual data RNA MATEv1 1 test_data test_500K_tags_50mers qual quality_check false run_rescue false rescue_window 10 num_parallel_rescue 2 script_chr_wig data RNA MATEv1 1 chr_wig pl script_chr_start data RNA MATEv1 1 chr_start pl f 2m data RNA MATEv1 1 f2m pl mapreads data matching mapreads rescue data RNA MATEv1 1 chr_rescueSOLiD py master_script data RNA MATEv1 1 rna mate vl 1 pl Page 10 of 38 Configuration options raw_tag_length 35 This parameter defines the longest length of the tags contained in the csfasta file mapping_parameters 50 5 0 45 5 0 40 5 0 35 3 0 30 3 0 These parameters define the lengths at which matching will occur recursively the number of mismatches permissible between the tag and the reference sequence and whether or not to treat valid
14. RNA MATE user manual preliminary documentation Version 1 1 July 2009 Contact rna mate expressiongenomics org Institute for Molecular Bioscience The University of Queensland St Lucia QLD 4072 Page 1 of 38 License Copyright 2008 2009 Nicole Cloonan Qinying Xu Geoffrey Faulkner and Sean Grimmond This program is free software you can redistribute it and or modify it under the terms of the GNU General Public License as published by the Free Software Foundation either version 3 of the License or at your option any later version This package is distributed in the hope that it will be useful but WITHOUT ANY WARRANTY without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE See the GNU General Public License for more details You should have received a copy of the GNU General Public License along with this program If not see lt http Awww gnu org licenses gt Page 2 of 38 Table of Contents License 2 The RNA MATE pipeline 5 General Description 5 Part 1 Quality checking of the tag optional 6 Part 2 Recursive alignment to the human or mouse genome 6 Part 3 Multi mapping tag rescue optional 6 Part 4 Creation of visualization files 6 Availability 7 Requirements 7 Installation instructions 7 Testing RNA MATE 8 Scripts 10 Master script rna mate v1 1 pl 10 Configuration files 10 Configuration options 11 Selecting appropriate parameters 14 How to map to the gen
15. TIP Atempting to apply a fikering condition may throw exceptions condition 6 9 attempting certan numencal csiculabions on strings condition The number of invalid skipped lines is documented in the TIP if your data is not TAB delimited use Text Manipulation gt Convert Syntax The fiter too allows you to restrict the datset using simple conditional st o Columns are referenced with and 4 number for example c1 refers to the first column of tab delienited file e Make Sure that muRi character operators contan no white space 9 iS valid while lt is not vaid When usnog eaual to operator double equal sign must be used 0 9 Cl chr1 es asiosio oo double guotas eg chant Jerators are all lower case e g cls Chex and Clf chr or nat CG e 2 is less than the value of column 4 umes 100 less than four comma separated elaments equal to 1 Drk But 2 lt 44554350 will quoted 0 g 3 exon Page 31 of 38 lt A n ih 2 bags B genome biology retrucdons to authors 10 Select Coding Exons UTR Exons History Options Unnamed history 0 critter on sata a ex 11 Select your stranded et pees 7 Gene BED What it does Estch Alignments Get Genomic Scores Ooerate on Genomic Intervals Statistics 9 Extract exons from the BED files Click Extract Features then Gene BED to Exon Intron C
16. d use less memory wiggle_plot pm This module creates strand specific wiggle plot or bedGraph files for visualization in the UCSC genome browser This module also creates start site plots which facilitates tag counting applications UCSC_Jjunction pm This module creates BED files for displaying exon junction usage in the UCSC genome browser Page 18 of 38 Log File test_500K tags_50mers log This is an example of the output log file for the test_500K_tags 50mers experiment Each status output includes two lines the first line is system time and the second is what the system doing at that time Thu May 28 PROCESS Thu May 28 SUCCESS ignore tag Thu May 28 PROCESS SUCCESS PROCESS hu May 28 SUCCESS hu May 28 PROCESS hu May 28 SUCCESS hu May 28 PROCESS hu May 28 SUCCESS hu May 28 PRPCESS hu May 28 PROCESS on Jun UCCESS on Jun ESS e Jun 2 n U w O Q E J 4 3 4 J 4 J 1 i J J J J HEMO UC W E Oe MeO UE t zal O Q za Nn n S J n as Qo Q FG NG ns 3 J c 0 q c 5 PROCESS Tue Jun 9 SUCCESS Tue Jun 9 PROCESS Wed Jun 10 12 16 57 2009 Welcome to our mapping strategy system 12 22 02 2009 Created csfasta file for tag with different tag length quality 12 22 03 2009 mapping to all chromosomes Thu May 28 12 22 09 2009 mapped to all chromosomes Thu May 28 12 22 09 2009 collating genome m
17. e the command filter_bedGraphs pl f test_500K_tags_50mers negative wiggl m 5 Assessing the specificity of mapping In order to examine the specificity of the mapping by the directionality of the library this script can be used to examine the junction BED files generated by RNA MATE The sense strand matches will be your expected strand BED file and the antisense matches will be your unexpected strand BED file Usage assess_junctions_for_directionality pl REQUIRED s name of BED file containing sense matches a name of BED file containing antisense matches 0o name of outfile For example to assess the output of the test data set use the command assess_junctions_for_directionality pl s test_500K_tags_50mers expect junction BED a test_500K_tags_50mers unexpect junction BED o test output Ideally more than 99 5 of tags should be in the expected strand A lower value indicates a problem with the mapping parameters used or less likely a problem with the cDNA library generation Page 29 of 38 Assigning tags or coverage counts to gene models This script has been deprecated as more extensive tools for manipulation of genomic regions are available from the GALAXY website at http main g2 bx psu edu Local copies of Galaxy can be installed and used useful to avoid excessive internet usage through the transfer of large files Downloading and installation guides
18. e the hierarchical junction mapping step optional How to use RNA MATE to perform non recursive mapping Although designed for recursive mapping RNA MATE can also map at a single tag length if preferred This will speed up the analysis in situations where maximum sequencing depth is not required To do this simply adjust the mapping parameters option to the length of tags desired eg mapping_parameters 50 5 0 Page 17 of 38 Modules tools_RNA pm This module includes four functions creating log files checking whether the jobs on the queue are finished creating new csfasta files and chopping tags for recursive mapping tag_quality pm This module checks tag quality making sure that each tag contains less then five nucleotides where the QV value for that basecall is less than 10 PHRED scale Currently this threshold is hardcoded Future implementations will allow user defined values at this point RNA_mapping pm This module automatically arranges genome and junction mapping for different tag lengths single_select pm This module attempts to select a single mapping position for each tag based on the mapping results at the highest stringency For example if a tag maps once with zero mismatches and 3 times with one mismatch then the tag is recorded as a single mapping tag at a stringency of zero mismatches new_rescue pm This module uses the new version of the rescue program which can parallel rescue for each chromosome an
19. ences and graphically display exon combinatorics In addition plots containing only start sites of tags are included to facilitate tag counting applications Page 6 of 38 Availability All source code documentation and associated files described in this manual are freely available for download from http grimmond imb ug edu au RNA MATE or http solidsoftwaretools com gf project rnamate requires registration which is free Requirements This pipeline is written predominantly in perl with some python thrown in for good measure and requires that you have version 5 8 8 of perl or later and python version 2 4 or later It is designed to run in a unix environment with a PBS queue manager The scripts can be modified to work with an LSF or SGE manager It is not recommended to run this pipeline on a system without access to a cluster due to the large computational requirements of mapping to mammalian genomes however the scripts could potentially be modified to do this You will need to install the ForkManager pm perl module if you do not already have it as well as Path Class 0 16 Both are available from CPAN The alignment section of this pipeline is dependant upon the mapreads tool This tool and its installation instructions are available from http solidsoftwaretools com gf project mapreads requires registration which is free Finally you will need a genome against which to map The program expects
20. ength of the tag and these will contain adaptor sequence that will not map to the genome Here we present a computational pipeline to map RNAseq data which generates both tag counting and genome browser visualization of genomic and exon junction matching results RNA MATE Mapping and Alignment Tool for Expression is designed for the rapid mapping of data from the Applied Biosystems SOLiD system Figure 1 check quality TA escue ultimappers A i No Figure 1 The RNA MATE recursive mapping pipeline The pipeline consists of 4 major components 1 The optional tag quality module filters tags based on the quality values for each basecall 2 The alignment module attempts to align tags first to the genome and then to a library of known exon junction sequences If a tag fails to align then the tag is truncated and the process is repeated 3 The optional tag rescue module is uses information derived from both single mapping and multi mapping tags to uniquely place multi mapping tags 4 Finally UCSC genome browser compatible wiggle plots and BED files are generated Page 5 of 38 Part 1 Quality checking of the tag optional Depending on the downstream applications of the matched data the quality of individual tags may need to be assessed before their inclusion in the mapping pipeline To accommodate this we have provided an optional tag quality module which assesses the tags by the number of basecalls with PHRED
21. enome ucsc edu Page 26 of 38 Re entering the RNA MATE pipeline There may be some occasions where you might wish to regenerate the wiggles or BED tracks from the collated files For example you may have initially generated the wiggles without multi mapping rescue but now you wish to generate them with rescue turned on Rather than remapping you can simply modify your config file eg rescue true and reenter the pipeline at the rescue stage using the command nohup restart_at_rescue pl c test_500K_tags_50mers conf amp To make use of this feature you must keep the collated files and the junction ID files see Description of output files Modifying the pipeline to work with other queues In order to make this program compatible with other queue managers the RNA_mapping pm module will need to be edited Specifically line 332 in the mapping subroutine that contains Scomm qsub 1 walltime 48 00 00 o S mysh out e S mysh err Smysh gt Smysh id will need to be replaced with the appropriate job submission commands and parameters that are specific to your system For SGE Till Bayer from the MPI for Evolutionary Biology in Germany has kindly provided instructions on modifying this script to work on SGE systems The line above should be changed to Scomm qsub 1 s_rt 48 00 00 o Smysh out e Smysh err Smysh gt Smysh id In addition to modifying the lines above line 325 which reads print OUT Scomm sho
22. ers 35 22 11 54 2009 collated genome mers 35 22 11 54 2009 mapping to junction mers 35 2 1 22 11 54 2009 mapped to junction mers 35 2 1 22 11 54 2009 single selecting junction position amp ID mers 35 22 25 42 2009 single selected junction position amp ID mers 35 22 25 42 2009 chopping tag 22 33 48 2009 mapping to all chromosomes p E N E CO fi 14 58 04 2009 mapped to all chromosomes 1 14 58 04 2009 collating genome mers 30 02 56 07 2009 collated genome mers 30 02 56 07 2009 mapping to junction mers 30 2 1 04 21 10 2009 mapped to junction mers 30 2 1 04 21 10 2009 single selecting junction position amp ID mers 30 04 38 16 2009 single selected junction position amp ID mers 30 04 38 16 2009 chopping tag 04 52 46 2009 mapping to all chromosomes s Ris e5 9 12 58 27 2009 mapped to all chromosomes 9 12 58 28 2009 collating genome mers 25 23 28 09 2009 collated genome mers 25 23 28 09 2009 mapping to junction mers 25 2 1 00 48 10 2009 in which we Page 19 of 38 SUCCESS mapped to junction mers 25 2 1 Wed Jun 10 00 48 10 2009 PROCESS single selecting junction position amp ID mers 25 Wed Jun 10 00 58 18 2009 SUCCESS single selected junction position amp ID mers 25 Wed Jun 10 00 58 19 2009 PROCESS combine data from different tag length and then classify into different strand and chromosome Wed Jun 10 01 20 2
23. est_500K_tags_50mers csfasta input test_500K_tags_50mers csfasta output test_500K_tags_50mers mers35 unique csfasta test_500K_tags_50mers mers30 unique csfasta These files contain your original tag sequences and IDs in csfasta format These are temporary internal files and can be deleted when the genomic_mapping subroutine of the RNA mapping module has completed successfully Page 21 of 38 RNA_mapping pm eg 35mers sub genomic_mapping my Spara 35 3 0 input test_500K_tags_50mers mers35 unique csfasta output chr2 test_500K_tags_50mers mers35 unique csfasta ma 35 3 chrM test_500K_tags_50mers mers35 unique csfasta ma 35 3 These files are the mapreads output for all tags to each chromosome as indicated in the file name The format is identical to the csfasta input format except that matches are stored in a comma separated manner after the tag ID in the header line They are temporary internal files and can be deleted as soon as the genome collation subroutine of the RNA mapping module has been completed successfully sub genome_collation my Spara 35 3 0 input chr mers35 3 output test_500K_tags_50mers mers35 genomic collated test_500K_tags_50mers mers35 genomic non_matched The collated file contains all of the genomic matches for the tags identified by chromosome The file format is tag ID lt tab gt number of hits to the genome lt tab gt matches lt newline gt Colourspace tag lt n
24. ewline gt The tag matches themselves are tab delimited zero based and are in the format chromosome strand genomic location number of mismatches eg chr1 13000056 2 is a match at chromosome 1 position 13000056 on the negative strand with two mismatches eg chrX_random 64522 0 is a match at chromosome X random position 64522 on the positive strand with zero mismatches Note that tag matches will only be reported for those where the total number of matches second column is less than or equal to the number specified with the max_multimatch parameter The collated files are very useful particularly for saving time on the regeneration of wiggle plots or other analyses and we strongly recommend that all collated files are stored with the final output files The non_matched files are a csfasta file containing the tags that did not match to the genomic locations This is a temporary internal file and can be deleted after the junction mapping subroutine of the RNA mapping module has been completed successfully Page 22 of 38 sub junction_mapping my Spara 35 3 0 input test_500K_tags_50mers mers35 genomic non_matched output hg18_junctions_best_quality test_500K_tags_50mers mers35 genomic n on_matched ma 35 3 These files are the mapreads output for all tags to the junction library specified in the configuration file The format is identical to the csfasta input format except that matches are stored in a c
25. for 35nt tags test_500K_tags_ 50mers mers40 genomic collated all genomic matches for 40nt tags test_500K_tags 50mers mers45 genomic collated all genomic matches for 45nt tags Page 8 of 38 test_500K_tags 50mers mers50 genomic collated all genomic matches for 50nt tags test SOOK tags 50mers negative starts wiggle plot of tag start sites for the ve strand test SO0K tags 50mers negative wiggle wiggle plot for the ve strand test SOOK tags S5Omers positive starts wiggle plot of tag start sites for the ve strand test_500K_tags 50mers positive wiggle wiggle plot for the ve strand Edit the configuration file so that it refers to the appropriate directories that you have set up on your system See the Configuration Options section for more details on what each of the parameters does Run the script use the following command nohup path rna mate vl 1 pl c test_500K_tags_50mers conf amp Where path is the full path to ma mate v1 1 pl eg nohup data rna mate vl 1 pl c test_500K_tags_50mers conf amp Testing will run RNA MATE on the SOLiD test data in approximately 3 hours using 3 Blades each with 16GB of RAM and 2 Dual Core AMD Opteron tm Processor 2218 4 cores running RedHat Linux 2 6 18 92 1 17 el5 x86 64 The wiggle plots and BED junction files should be compared to those provided in the test_results folder to ensure that the pipeline is working as expected Page 9 of 38 Scripts Master script rna
26. he effect of sequencing error on the mapping results Obviously this consideration does not apply for non SOLiD data sets Secondly as every iteration takes CPU time the more iterations that are done the higher the cost of the additional mapping tags Larger iterations decrease the run time but they also decrease the Page 15 of 38 sensitivity of the strategy to detect tags that lie across exon junctions Figure 3 shows the effect of alternate mapping strategies on the computational time and the proportion of mappable tags based on two different scenarios an ideally sized mRNA library and a smaller than ideal mRNA library Scenario one larger sized fragment library Scenario two smaller sized fragment library run time minutes run time minutes 8 et ae so 4s 20 35 30 25 Length of tag Length of tag Tun time mins rTun time mins run tme mins Tun time mins Tun tme mins run tme mins cumulative run ime cumulstve run time cumuative run me cumulative run ime cumulstive run me cumulative run time Scenario one larger sized fragment library Scenario two smaller sized fragment library o W proportion of mappable reads mappable tags mappabie tags mappable tags mappable tags mappadie tags mappable tags cumulstve mappable tags 4 cumufatve mappable tags cumulative mappable tags cumutative map
27. he wiggle starts files are too large to upload even when gzipped and you may need to apply a post pipeline filter to the results see Post RNA MATE scripts Description of output files RNA MATE produces a large number of files in the specified output directory most of which are temporary working files and can be deleted Unless the total storage space on your cluster is an issue it is probably best to wait until the pipeline has finished before deleting files This allows you to re enter the mapping pipeline at different points without needing to start the mapping process from scratch This can save significant amounts of time in the even of a power outage or similar computational catastrophe This section takes you through the inputs and outputs generated by running RNA MATE on the test data the contents of the output files generated from each of the modules and whether or not these should be stored or deleted The bold headings are the perl modules that are called by RNA MATE The italicized headings refer to the subroutines within each of the perl modules and the bracketed information gives examples of the paramaters that are given to this subroutine Some perl modules are called multiple times such as RNA_mapping pm and particular attention should be paid to the length at which the subroutine has been run especially if insufficient resources are available and files must be deleted mid run tools_RNA pm sub create_csfasta data t
28. hgTables command start RNA MATE makes no requirement for the minimum or maximum number of nucleotides required on the donor or acceptor side of the junction and there is no requirement to keep these lengths the same However it may be beneficial for your own analysis to ensure that these are symmetrical so that when performing an analysis you can be sure of the minimum overlap of a tag on the junction sequence ie If you require a minimum of 10nt in a 5Ont tag to cross an exon junction then the donor and acceptor sequences should be 40nt long Once the sequences and coordinates have been assembled ensuring that the unique IDs are in the same format as the provided above there are two scripts provided to format the libraries the way RNA MATE is expecting them The first script is concatenate_sequences pl and is used to convert the multi fasta format into a single concatenated fasta format eg concatenate_sequences pl f fasta file o output file h header The second is make_index pl and this script creates the index file required for decoding the matches to the concatenated junction files eg make_index pl lt fasta_file gt output file Page 38 of 38
29. ing data includes e test 500K tags 50mers conf the test configuration file e test 500K tags 50mers csfasta the SOLID testing data The testing results folder includes the wiggle plots and junction BED files that were generated from the testing data on our system e test_500K_tags 50mers expect junction BED junction BED file for UCSC genome browser test 500K_tags 50mers unexpect junction BED antisense matches to junctions test_500K_tags_ 50mers junction30 SIM negativeID antisense junction matches for 30nt tags test_500K_tags_ 50mers junction30 SIM positiveID sense junction matches for 30nt tags test_500K_tags_ 50mers junction35 SIM negativeID antisense junction matches for 35nt tags test_500K_tags_ 50mers junction35 SIM positiveID sense junction matches for 35nt tags test_500K_tags_ 50mers junction40 SIM negativeID antisense junction matches for 40nt tags test_500K_tags_ 50mers junction40 SIM positiveID sense junction matches for 40nt tags test_500K_tags 50mers junction45 SIM negativeID antisense junction matches for 45nt tags test_500K_tags 50mers junction45 SIM positiveID sense junction matches for 45nt tags test_500K_tags 50mers junction50 SIM negativeID antisense junction matches for 50nt tags test_500K_tags 50mers junction50 SIM positiveID sense junction matches for 50nt tags test_500K_tags_ 50mers mers30 genomic collated all genomic matches for 30nt tags test_500K_tags 50mers mers35 genomic collated all genomic matches
30. m the unique ID of the junction and are defined as chromosome _ first base of intron _ last base of intron _ strand Note All junction sequences are provided in the sense orientation ie 5 to 3 and all coordinates are zero based This means that hits to these libraries should be predominantly on the one strand and which strand will depend on the laboratory based library preparation method used In the case of the mammalian_exon_junction_libraries each junction contains the last 30nt of the donor exon joined to the first 30nt of the acceptor exon to create a 60mer sequence Exon sequences were derived from all known genes gene predictions mRNA evidence and EST evidence available at the time of creation early 2007 Redundant sequences were removed as were 60nt sequences that matched to the genome in their entirety or matched to other exon junctions within the library The file list for this junction set is as follows mammalian_exon_junction_libraries tar gz hg18_ junctions_best_quality fasta cat hg18_ junctions_best_quality fasta index mm9 junctions _best_quality fasta cat mm9 junctions _best_quality fasta index In the case of the junction libraries different lengths of the donor and acceptor sequences are provided to allow full customization of matching stringency the number in the file name represents the number of nucleotides from the donor and acceptor eg hg18 junctions 25 fa cat contains 25nt form the dono
31. mate v1 01 pl This script will call the required modules in order There is only one user defined parameter for this script which allows you to specify a configuration file containing all the required parameters for the entire mapping pipeline To run this script use the following command nohup path rna mate vl 1 pl c configuration file amp Where path is the full path to rma mate v1 1 pl and configuration file is the name and full path if not in the current working directory of the configuration file Configuration file The configuration file is a text file containing all the required parameters to run RNA MATE In this file directory listings must end with a there must be no other punctuation at the end of the lines and there should be no empty lines in this file An example of the configuration file is given below raw_tag_length 50 mapping_parameters 50 5 0 45 5 0 40 5 0 35 3 0 30 3 0 mask 11111111111111111111111111111111111111111111111111 max_multimatch 10 expect_strand exp_name test_500K_tags_50mers chromosomes chrM chr2 chr_path data matching hg18_fasta junction_50 data RNA MATEv1 1 junction_libraries hg18 junctions 45 fa cat junction_50_index data RNA MATEv1 1 junction_libraries hg18 junctions 45 fa index junction_45 data RNA MATEv1 1 junction_libraries hg18 junctions 40 fa cat junction_4
32. mbiguous sequences have recently been applied to high throughput sequencing data and we have refined our previously published algorithm to work efficiently with large data sets For every multi mapping tag the algorithm considers all tags that map near to each of the possible locations of the tag within a user specified window to determine the most likely mapping position of the tag Where a tag can not be unambiguously assigned a fractional weighting to the relevant positions is assigned In practice between 40 60 of multi mapping tags can be assigned a single position with gt 60 likelihood depending on the relative sequence coverage The recommended window size for shotgun sequencing is 10 Cloonan et al 2008 Nat Methods 5 613 619 though for disparate data types currently available this can vary For instance Cap Analysis of Gene Expression CAGE tags are rescued using a window of 100 nt a size previously shown to optimize mammalian promoter detection Carninci et al 2006 Nat Genet 38 626 635 Part 4 Creation of visualization files Finally UCSC genome browser compatible wiggle plots for genome mapped data and BED files for exon junction mapped data are generated automatically from the collated results The wiggle plots are strand specific single nucleotide resolution coverage plots and directly represent the number of times an individual nucleotide has been seen in the sequencing data BED files depict hits to junction sequ
33. mic locations corresponding to the start sites of each mapped tag which can be loaded into the UCSC genome browser These files should be stored sub collect_data input chr2 test_500K_tags_50mers for_wig negative starts chr2 test_500K_tags_50mers for_wig positive starts chr2 test_500K_tags_50mers for_wig negative wig chr2 test_500K_tags_50mers for_wig positive wig chrM test_500K_tags_50mers for_wig negative wig chrM test_500K_tags_50mers for_wig positive wig chrM test_500K_tags_50mers for_wig negative starts chrM test_500K_tags_50mers for_wig positive starts output test_500K_tags_50mers negative wiggl test_500K_tags_50mers positive wiggle These are the final output files containing the single nucleotide resolution coverage plots wiggle plots which can be loaded into the UCSC genome browser These files should be stored UCSC_Junction pm sub main inputs test_500K_tags_50mers junction30 SIM positiveID test_500K_tags_50mers junction35 SIM positiveID test_500K_tags_50mers junction30 SIM negativeID test_500K_tags_50mers junction35 SIM negativeID outputs test_500K_tags_50mers expect junction BED test_500K_tags_50mers unexpect junction BED These are the final output files containing the matches to the exon junction libraries in a BED format which can be loaded into the UCSC genome browser For more information on BED or wiggle bedGraph file formats please see the UCSC genome browser website at http g
34. mount of memory consumed by the rescue see the note above regarding multi mapping tag rescue and memory usage quality check true This parameter allows you to turn on or off the quality checking of tags module Acceptable values are true or false True run quality check False do not run quality check script_chr_start data matching chr_start pl script_chr_wig data matching chr_wig pl 2m data matching f 2m pl mapreads data matching mapreads rescue data matching chr_rescueSOLiD py master_script data matching rna mate vl 1 pl These parameters define the full path showing the location of the various scripts required to run RNA MATE rescue_window 10 This parameter defines the window size used for multi map tag rescue The recommended setting for shotgun sequencing data is 10 whereas the recommended setting for CAGE and other disparate data sets is 100 Selecting appropriate parameters Understanding the two major parameters the number of mismatches allowed for at every tag length and the number of nucleotides chopped at each iteration as well as the smallest mappable size for the genome is critical to maximizing the efficiency and accuracy of the recursive mapping strategy How many mismatches to allow at each length is critical to both the speed and accuracy of tag mapping The more mismatches allowed the slower the program will run however a low number of mismatches may fail to capture mappable tags
35. odon BED expander La Mom vated _ the Expression Geno E mU T Geode Shole eta crvine Gs Vege Motse _ QF mtaiessle foods O Expression Genomes VM werldtork by Trendw Pubtied Home rt 14 Select interval and the file to upload eg test_500K_ tags_50mers negative starts nen File C Documents end Sen Browse e Geg Microbial Data e ioMart Central ser er 4 Gene BED To sox e GrameneMart Centra server URL Text Exen Intron Coden BED ondata 2 Ehning server Encodes at NHGRI 3 Filter on data 1 evn e ERIGRAPH server amman 0 UR moore 15 Select the Genome E Convert Formats FASTA monioulotion miss eg hg18 Send Data Ellter and Sat m Join Subtract and Group Extract Features av Mat Tabular nd h D oes 16 Press Execute your fhe iz rik datadas data Click Get Data ci nips nS mm na nes ne cn carb then Upload File Page 32 of 38 Galasey Mozilla Firefox Sale refresh co anse all ENCODE Tools fe test_SOOK_tags_SOmers negative 3J Unnamed history o Bee test 500K taas SOmersnositive Convert Format 4 Gene BED To Exon intron Codon EASTA manipulation Second query aa i with min overtap Extract Features 1 Fetch bp Estch alignments Get Genomic Scores Oniy records that are joined INNER JOIE Operate on Genomic intervals
36. ome and junctions simultaneously 17 How to use RNA MATE to perform non recursive mapping 17 Modules 18 Log File 19 Checking the finished run 20 Description of output files 21 Re entering the RNA MATE pipeline 27 Modifying the pipeline to work with other queues 27 Optimizing performance on your cluster 28 Page 3 of 38 Post RNA MATE scripts Filtering wiggle plots Assessing the specificity of mapping Assigning tags or coverage counts to gene models Junction libraries Description of the available junction libraries How to create your own junction libraries 29 29 29 30 37 37 38 Page 4 of 38 The RNA MATE pipeline General Description For mammalian genomes there are technical challenges associated with mapping and counting short tag sequences derived from high throughput sequencing data Firstly mammalian transcripts are non contiguous due to the splicing of introns from the pre mRNA This means that there will be a portion of tags that cross exon exon boundaries that will not map directly to the genome The ability to use short tag information relies directly upon being able to place short tags uniquely within the genome The presence of genome wide repeats and other repetitive sequence in the mouse and human genomes mean that a sizeable proportion of short tags can not be placed uniquely Finally the random fragmentation of mRNA creates a distribution of sizes of which a significant proportion will be less than the full l
37. omma separated manner after the tag ID in the header line They are temporary internal files and can be deleted as soon as the junction selection subroutine of the RNA mapping module has been completed successfully sub junction_selection my Spara 35 3 0 input hg18_junctions_best_quality test_500K_tags_50mers mers35 genomic n on_matched ma 35 3 output test_500K_tags_50mers mers35 junction non_matched The non_matched file is a csfasta file containing the tags that did not match to the junction library specified in the configuration file This is a temporary internal file and can be deleted after the chop_tag subroutine of the tools RNA module has been completed successfully test_500K_tags_50mers junction35 negative stats test_500K_tags_50mers junction35 positive stats These stats files show for every tag the number of times a tag matched in total and at each level of mismatching to the junction library specified in the configuration file The file format is as follows gt tag ID lt tab gt total matches lt tab gt number of 0 mismatches lt tab gt number of 1 mismatches lt tab gt number of 2 mismatches lt tab gt number of 3 mismatches lt newline gt These are not required for any further steps in RNA MATE and can therefore be deleted if they are not required for your analysis test_500K_tags_50mers junction35 SIM negative test_500K_tags_50mers junction35 SIM positive These are temporary internal files tha
38. one file per chromosome with the filename format as chromosome_name fa Genomes can be downloaded from the UCSC genome browser website at http hgdownload cse ucsc edu downloads html Installation instructions The instructions given below in courier font are examples of the commands needed to carry out the installation 1 Move the tarball to the destination directory and decompress it eg mv RNA MATEv1 1 tar gz data eg tar xzf RNA MATEv1 1 tar gz 2 Add the path of the installation directory to INC using the command Page 7 of 38 export PERL5LIB S PERL5LIB full path RNA MATEv1 0 perl1 Where full path should be replaced with the path of the RNA MATE directory eg export PERL5LIB PERL5LIB data RNA MATEv1 1 perl This command can be added to the bash_profile or profile files depending on the shell for automatic loading or it can be added to the default profile for all users 3 Place the script mask_schemas_mapreads pl in the same folder as the mapreads program eg mv mask_schemas_mapreads pl data mapreads Testing RNA MATE To ensure that your downloads are functioning correctly please download the testing data available and testing results available from http grimmond imb uq edu au RNA MATE Please also download the hg18 version of the human genome available from http hgdownload cse ucsc edu goldenPath hg18 bigZips chromFa zip The test
39. ory Options 4 refresh colapse all eee Output refGene as BED Unnamed history o e UCSC Archaea table browser F Include custom track header o poet Get Microbial Data name ip_resGene e BioMart Central server desenptice lable browser quary on ezans e GrameneMart Central server vishitty pock E Ehmmine server a eee cEocodeds at NHGRI e EDIGRAPH server Create one BED record per Send Date Whole Gene ENCODE Tools Upstream by 200 bases Enasi Po bases et cach end Convert Formats C trons pus F bases at each end FASTA manipulation C UTR Exons Join Subtract and Group Coding Exon Pi Extract Features 3 UTR Exons Fetch Sequences Downstream by P00 bases Note fa feature os close to the begnnmg or end of a chromosome and upstream downstream bases are added they may be truncated in order to avoid extending past the edge of the Get Genomic Scores An eara a oo Savoie Lliris Sond quoy to Galaxy Statistics Tearc Egalonal variation Multiple rearession Evolution Hvehy Metanenomic analyses Short Read Analysis EMBOSS Workflows alaxcy Mozilla Firefox Send Data Filter ENCODE Tools E UCSC Main on Human refGene 9 p Lift Over Query missing See TIP below Text Manipulation Convert Formats EASTA manipulation must be used as shown above To fiter for an A Double equal signs must be used as equal to 9 c1 cbr O
40. pable tags 9 cumulative mappable tags cumulative mappable tags Figure 3 How alternate mapping strategies affect the yield of mappable tags and the computational run time In all graphs red lines represent Strategy 1 2nt interations 5Ont 30nt blue lines represent Strategy 2 Snt interations 5Ont 25nt and green lines represent Strategy 3 10nt iterations 50nt 30nt In all scenarios 5 mismatches were allowed for tag lengths ranging from 5Ont to 40nt 3 mismatches were allowed for tag lengths ranging from 39nt to 30nt and 1 mismatch was allowed for tag lengths ranging from 29nt to 25nt Scenario One is a fragment library with a mode insert size of 54nt Scenario Two is the same library with the insert size shifted to 39nt Together these graphs show that there is more benefit for a recursive strategy when the library insert size is smaller than ideal The efficiency of the recursive strategy is largely dependant on the median insert size of the RNA fragments If all fragments are longer than the read length of the tag then the recursive strategy would only additionally map the 5 end of novel splice junctions and those with poor quality 3 ends Depending on the individual sample and sequencing run this may or may not yield sufficient additional mapping tags to justify the additional computational time Page 16 of 38 In contrast assuming perfect adaptor identification and an ideal sized fragment library a vector cli
41. pecificity of mapping tags For each length and mismatch combination the proportion of tags that map uniquely within the human genome is indicated by the dark blue line left Y axis The black line indicates the proportion of tags that map to exons and shows a steep decline at 25 3 indicating a drop in specificity at this length left Y axis The light blue line shows the total number of tags mapping at a given length and mismatch combination showing a sharp increase in the number of mapping tags at 25 3 right Y axis logarithmic scale To minimize the computational time used for this approach the number of nucleotides to be chopped at each iteration should be greater than or equal to the number of mismatches allowed at any iteration We typically remove S5nt at a time for SOLID sequencing data for two reasons First the sequencing chemistry of SOLID is performed with five different primers and the number of cycles will determine the length of the tag in multiples of five For example to generate a 50mer sequence the data of ten cycles from each of the five primers is added together Typically each cycle has roughly the same error profile as the corresponding cycles from other primers ie the third cycle on primer one will have the same error rate as the third cycle on primer 5 and the error rate increases as the cycle number increases This means that typically error rates jump in multiples of Snt so excluding 5nt at a time will minimize t
42. pping method would use less than 40 of the CPU time than the recursive method for the same yield of mapping tags Unfortunately even under the best circumstances adaptor identification is not perfect and there are additional technical challenges for adaptor identification in color space Typically adaptor identification and chopping will yield under the least stringent conditions approximately 60 of the tags that will map under a recursive method On top of this the SOLiD system can use Internal Adaptor Blockers which prevent the ligation of sequencing probes to that region This causes a drop in accurate base calling which is not just based on the quality of base calls and under these circumstances successful adaptor identification can drop to just over 20 Whilst these blockers were an optional reagent in SOLiD V2 chemistry they are now premixed and therefore not optional on the V3 plates Ideally neither a recursive method nor an adaptor identification method would be required at all if we could ensure that the RNA fragment size was always going to be larger than the read size For microRNA populations where the mode size is approximately 22nt this is simply not possible Due to technical limitations on the maximum insert size possible in an emulsion PCR ePCR and the strong amplification bias of small fragments in ePCR this is not always achievable for fragmented RNA libraries either even those that have been size selected
43. r and 25nt from the acceptor In this case exon sequences were defined from UCSC known genes Refseq Ensembl Aceview GeneID GenScan and N Scan The file list for this junction set is as follows junction_libraries tar gz hg18 junctions 25 fa cat hg18 junctions 25 fa index hg18 junctions 30 fa cat hg18 junctions 30 fa index hg18 junctions 35 fa cat hg18 junctions 35 fa index hg18 junctions 40 fa cat hg18 junctions 40 fa index hg18 junctions 45 fa cat hg18 junctions 45 fa index hg18 junctions 50 fa cat hg18 junctions 50 fa index hg18 junctions 55 fa cat Page 37 of 38 hg18 junctions 55 fa index hg18 junctions 60 fa cat hg18 junctions 60 fa index hg18 junctions 65 fa cat hg18 junctions 65 fa index mm9 junctions 25 fa cat mm9 junctions 25 fa index mm9 junctions 30 fa cat mm49 junctions 30 fa index mm9 junctions 35 fa cat mm49 junctions 35 fa index mm9 junctions 40 fa cat mm9 junctions 40 fa index mm9 junctions 45 fa cat mm49 junctions 45 fa index mm9 junctions 50 fa cat mm9 junctions 50 fa index mm9 junctions 55 fa cat mm9 junctions 55 fa index mm9 junctions 60 fa cat mm9 junctions 60 fa index mm9 junctions 65 fa cat mm9 junctions 65 fa index How to create your own junction libraries To create custom exon junction libraries you need the coordinates and sequences of your exons junctions For some species you may be able to download these from the UCSC genome browser Tables pages at http genome ucsc edu cgi bin
44. rs mers35 genomic collated test_500K_tags_50mers mers30 genomic collated OULDULS test_500K_tags_50mers mers30 genomic stats test_500K_tags_50mers mers35 genomic stats These stats files show for every tag the number of times a tag matched in total and at each level of mismatching to the genome as specified in the configuration file The file format is as follows gt tag ID lt tab gt total matches lt tab gt number of 0 mismatches lt tab gt number of 1 mismatches lt tab gt number of 2 mismatches lt tab gt number of 3 mismatches lt newline gt These are not required for any further steps in RNA MATE and can therefore be deleted if they are not required for your analysis Page 24 of 38 chr2 test_500K_tags_50mers for_wig negative chr2 test_500K_tags_50mers for_wig positive chrM test_500K_tags_50mers for_wig negative chrM test_500K_tags_50mers for_wig positive These are temporary internal files and can be deleted after the successful completion of the wiggle plot module wiggle_plot pm sub parallel_wig_fork input chr2 test_500K_tags_50mers for_wig negative chr2 test_500K_tags_50mers for_wig positive chrM test_500K_tags_50mers for_wig negative chrM test_500K_tags_50mers for_wig positive output chr2 test_500K_tags_50mers for_wig negative sorted chr2 test_500K_tags_50mers for_wig negative wig chr2 test_500K_tags_50mers for_wig negative wig success chr2 test_500K_tags_50mers for_wig positive sorted
45. scores of less than 10 Tags that pass the QC are fed into the recursive alignment module If this option is disabled all tags are passed to the alignment module Part 2 Recursive alignment to the human or mouse genome Alignment of the short tags to a reference genome is done using mapreads http solidsoftwaretools com gf project mapreads an algorithm specifically designed for the rapid mapping of data from the Applied Biosystems SOLiD system ie color space data Tags are first matched against all chromosomes of the reference genome and then against a library of known exon junctions hg18 and mm9 are currently supported Tags that fail to map to the genome or junctions are chopped to user defined lengths and the genomic mapping is restarted In this way tags that have adaptor sequence or poor quality ends are recovered at their longest length The number of mismatches between the reference and tag is user defined and when mappings from all tags are collated into a single file only the mappings at the highest level of stringency are retained Part 3 Multi mapping tag rescue optional For most downstream applications tags are only informative if they can be placed uniquely within a genome Tags that align to multiple places within a genome make up a sizeable proportion of transcriptome derived tags primarily from the inherent redundancy of the genome but also from CpG islands and genome wide repeat elements Strategies to rescue a
46. t specify the start end and number of mismatches for each of the tags against the specified exon junction library These can be deleted as soon as the junction_selection subroutine of the RNA_mapping module has been completed successfully Page 23 of 38 test_500K_tags_50mers junction35 SIM negativeID test_500K_tags_50mers junction35 SIM positiveID The ID files contain all of the junction library matches for the tags The file format is gt tag ID lt tab gt start coordinate of match lt tab gt end coordinate of match lt tab gt number of mismatches lt tab gt junction ID lt tab gt start coordinate of junction lt tab gt end coordinate of junction lt newline gt The ID files are very useful particularly for saving time on the regeneration of junction BED tracks or other analyses and we strongly recommend that all ID files are stored with the final output files tools _RNA pm sub chop_tag test_500K_tags_50mers mers35 junction non_matched test_500K_tags_50mers mers30 unique csfasta 5 input test_500K_tags_50mers mers35 junction non_matched test_500K_tags_50mers mers30 unique csfasta OuLDUT test_500K_tags_50mers mers30 unique csfasta These files contain the trimmed tag sequences and IDs in csfasta format These are temporary internal files and can be deleted when the genomic_mapping subroutine of the RNA _ mapping module has completed successfully single_select pm sub main input test_500K_tags_50me
47. teich Alignments Get Genomic Scores Statistics include Filter in workflow Groph Disolay Data i Beaienal variation gt s Fiiter on data Evolution HyPhy F inctude Filter in worktiow Metagenomic analyses Gene AED To Exen itron cadon BED EMBOSS 4 Gene BED To Exen intron Coden BED on data 2 Workflows F Indude Gene BD To Exon Intron Codon BED in workflow Geren R 5 Gene BED To exon intron codon BED on data 3 inchide Gene LED To Exon Intron Codon BED in workflow F Include Jom in worktiow Page 35 of 38 Galaney Mozilla Firefox Coding Exons UTR Exons i Running workflow RNAseq to Gene Counts it Group on data 10 3J 11 Group on data 10 E from Input Dataset Input Dataset Input Dataset i Group on data 10 3 Fiter Output dataset output from step 1 Output dataset output from step 1 With following condition e sa sa Extract Coding Exons UTR Exons Output dataset out_filel from step Page 36 of 38 Junction libraries Description of the available junction libraries Each available exon junction library contains two components The first is the concatenated fasta file of the exon junction sequences and the second is the index file that details the genomic coordinates of the exonic sequences In both available sets the genomic coordinates for
48. the concatenated junction library and the junction length index file asks for the file that decodes the concatenated fasta Junction libraries for human and mouse genomes can be downloaded from http grimmond imb uq edu au RNA MATE output_dir data RNA MATEv1 1 test_results The full path of the output directory This directory must exist prior to running the pipeline raw_qual data raw tag20000 qual The full path of the file containing the QV file raw_csfasta data raw tag20000 csfasta The full path of the csfasta file to be matched run_rescue true This parameter allows you to turn on or off the rescue of multi mapping tags module Acceptable values are true or false True run multi map rescue false do not run multi map rescue NOTE multi map rescue can be a very memory intensive process Rescue for a single chromosome of a transcriptome dataset with gt 100 million mappable tags can consume more than 20 Gb of resident memory The amount of memory used will depend on the size of the data set the number of multi mapping tags versus single mapping tags the underlying complexity of the data set and the number of positions of each tag to be rescued Page 13 of 38 num_parallel_ rescue 4 This parameter allows you to adjust the number of rescue jobs that are run in parallel The settings chosen here will depend on the amount of memory available on your system the number of CPUs available and the a
49. tory where you ran RNA MATE from A clean nohup out file should contain only messages that look like found file data RNA MATEv1 1 test_results hg18_junctions_20 test_500K_tags_50mers genomic non_mat ched ma 25 2 adj_valid success sleeped 80 60 seconds which simply indicate that the mapping jobs have been found Other messages will indicate errors in the pipeline 3 Check that the mapping was completed successfully for each chromosome The mapreads output see Description of output files should be at least the same file size as the original csfasta input A smaller file represents a failed mapreads run for that chromosome These mappings can Page 20 of 38 be regenerated by submitting the appropriate shell script sh to the queue manager and then restarting the pipeline 4 Check that the final visualization files are present see next section for a description of the final output files The expect junction BED file should be much larger than the unexpect BED file The positive and negative files for both starts and wiggle should be roughly the same size The final size of the files will depend on the size of the run being analyzed but for a single slide of SOLiD data you might expect the file sizes to be in the order of 100 300MB 5 Finally check that the files upload into the UCSC genome browser without errors To minimize file size the wiggle plots and bed tracks can be gzipped Sometimes t
50. uld be changed to include the bin sh line and a newline after the actual command needs to be inserted For LSF Xuanzhong Li from the Children s Hospital Boston USA has kindly provided instructions on modifying this script to work on LSF systems Any instances of the qsub command need to be replaced with the bsub command All other parameters appear to work fine for both PBS and LSF queue managers Page 27 of 38 Optimizing performance on your cluster The entire RNA MATE pipeline including mapreads is very I O intensive and depending on the cluster setup users may find that it needs to be modified for optimal performance For example those people using NFS filesystems may find that NFS will timeout if too much is asked of it For these systems an inefficient but necessary throttle may be to request two or more CPUs per mapping job This can be done by modifying line 322 as follows changed text is in red Scomm qsub 1 walltime 48 00 00 ncpus 2 o S mysh out e Smysh err Smysh gt Smysh id Page 28 of 38 Post RNA MATE scripts Filtering wiggle plots This script is for filtering and reducing the size of the wiggle plots bedGraphs to be uploaded into the UCSC genome browser Usage filter_bedGraphs pl REQUIRED f name of file to be filtered m minimum number of tags to report For example to remove the data from all nucleotides where there isn t at least 5 tags covering them us

RNA-MATE user manual - Expression Genomics Laboratory

Contents

Download Pdf Manuals

Related Search

Related Contents