Home

X-MATE user manual - Expression Genomics Laboratory

1. Send Data Extract ENCODE Tools Coding Exons UTR Exons El Lift Over from Text Manipulation Convert Formats FASTA manipulation Filter and Sort Join Subtract and Group fields Extract Features Gene BED To Exon Intron Codon BED expander a Extract features from GFF fil Fetch Sequences Fetch Alignments nt a single gene in just one line which information by converting a single lin Multiple rearession Wibrstum 5 ene 127475201 Evolution HyPh Cha 127485994 Chet 127488407 Metagenomic analyses el i Rede chet 127488410 Short Read Analysis EMBOSS Workflows ithe Genome Browser It has three required fields and additional optional ones In TS Derespond to blockCount Giculaeed relative vo chromSeart The number of items in this list should correspond to b 1 is refresh collapse all Unnamed history titer on data 1 e 0 z riter on aata olx 1 UCSC Main on Human 0 2 refGene genome genome biology instructions to authors File Format interval E Which format See help below le browser UCSC Archaea t Get Microbial Dat File a BioMart Central se CADocuments end Browse GrameneMart Centra server URL Text Elrmine server EncodeDB at NHGRI EDIGRAPH server Send Data ENCODE Tools Here you may specif
2. As well as these java library archives freely available from Apache org and also distributed with X MATE e commons io 1 4 jar http commons apache org io e commons lang 2 2 jar http commons apache org lang commons logging 1 1 jar http commons apache org logging Within the X MATE distribution is a directory called SamConverter it contains a directory jar and the jar file samConverter jar Also in the SamConverter directory is stored the source code for GffToSam cpp and a Makefile To install follow these instructions 1 First build the GffToSam utility and install it somewhere onto your PATH this may require administration privileges depending on where you would like to install it cd SamConverter make cp p GffToSam home software 2 Next add the Java archive files to your Java SCLASSPATH assuming that they reside in the directory path to jars you can enter the following commands SCLASSPATH SCLASSPATH path to jars commons io 1 4 jar SCLASS PATH SCLASSPATH path to jars commons lang 2 2 jar SCLASS PATH SCLASSPATH path to jars commons logging 1 1 jar H H SCLASSPATH CLASSPATH path to jars com apldbio aga common jar SCLASSPATH SCLASSPATH path to jars MaToGff jar The SamConverter utility is now ready to be used Page 9 of 41 Testing and Configuration Testing X MATE mapreads To ensure that your downloads are functioning correctly
3. perl path check matching stats XMATE pl p path to results The output of this script should look like this assuming you mapped to hg19 Checking directory output Checking genome and junction matches File output test 500K tags 50mers geno 30 3 0 collated File output test 500K tags 50mers geno 35 3 0 collated File output test 500K tags 50mers geno 40 5 0 collated File output test 500K tags 50mers geno 45 5 0 collated File output test 500K tags 50mers geno 50 5 0 collated File output test 500K tags 50mers junc 30 3 0 collated File output test 500K tags 50mers junc 35 3 0 collated File output test 500K tags 50mers junc 40 5 0 collated File output test 500K tags 50mers junc 45 5 0 collated File output test 500K tags 50mers junc 50 5 0 collated 50 mers 107698 45 mers 26733 40 mers 28452 35 mers 8730 30 mers 26204 Total GB matched 0 008817635 To clean up the X MATE output directory you can run this command perl path clean up XMATE output directories p pathToResults This will delete some working files not necessary for standard X MATE use More advanced users may want to keep some of these files so please check if you need them first see section Description of output files for more information Testing X MATE ISAS Colors If you have ISAS installed you can also run the following tests to check that X MATE is working prope
4. l Check the log file Look for WARNING or DIED messages that will describe what has gone wrong eg grep WARNING test 500K tags 50mers log more Check the nohup out file in the directory where you ran X MATE from A clean nohup out file should contain only messages that look like found file data X MATEv1 1 test_results hg18 junctions 20 test 500K tags 50 mat ched ma 25 2 adj valid success sleeped 80 60 seconds which simply indicate that the mapping jobs have been found Other messages will indicate errors in the pipeline Check that the mapping was completed successfully for each chromosome The mapreads output see Description of output files should be at least the same file size as the original csfasta input A smaller file represents a failed mapreads run for that chromosome These mappings can be regenerated by submitting the appropriate shell script sh to the queue manager and then restarting the pipeline Page 24 of 41 4 Check that the final visualization files are present see next section for a description of the final output files The expect junction BED file should be much larger than the unexpect BED file The positive and negative files for both starts and wiggle should be roughly the same size The final size of the files will depend on the size of the run being analyzed but for a single slide of SOLiD data you might expect the file sizes to be
5. ehri 24474 25944 NR 026020 0 chrl 24474 25944 Op 026822 0 ehri 55952 59571 001005404 0 chrl 257521 258460 HM 001005277 0 TIP Attempting to apply a filtering condition may throw exceptions if Select lines condition e g attempting certain numerical calculations on strings expression condition The number of invalid skipped lines is documented in the TIP If your data is not TAB delimited use Text Manipulation Convert Syntax Get Genomic Scores The filter tool allows you to restrict the datset using simple conditional statements Operate on Genomic Interka Columns are referenced with c and a number For example c1 refers to the first column of a tab delimited file Statistics Make sure that multi character operators contain no white space e g lt is valid while is not valid Graph Display Data When using equalto operator double equal sign must EE je Regional Variation lerators are all lower case e g c1 Multiple regression 1 r Evolution HvPhy Metagenomic analyses P Short Read Analysis n 2 is less than the value of column 4 times 100 chrX and cl chrY or not EMBOSS less than four comma separated elements or equal to 1 Workflows ork but c2 44554350 will quoted e g X Find Next Previous 62 Highlight sl T Match case Done Page 33 of 41 Galaxy Mozilla Firefox
6. test_S0oK_tags_SOmers negative E Lift Over First query Text Manipulation ith Convert Formats EASTA manipulation Filter and Sort Join Subtract and Group Extract Features Second query with min overlap Fetch Sequences bp Fetch Alignments Return Get Genomic Scores records that are joined INNER JOTE Operate on Genomic Intervals Intersect the intervals of two Execute queries i Subtract the intervals of two TIP If your query does not appear lldown menu queries Merge the overlapping intervals of a query Screencasts See Galaxy Interval Operation Screencasts right click to oper Concatenate two queries into one query Syntax Base Coverage of all intervals Where overlap specifies the minimum overlap between intervals that allows them to be joined Return only records that are joined returns only the records of the first query that join to a recond in the second query This is analogous to an INNER JOIN Return all records of first query fill null with returns all intervals of the first query and any intervals that do not join an interval from the second query are filled in Coverage of a set of intervals on second set of intervals Complement intervals of complement intervals of a any intervals that do not join an interval from the first query are filled in query alid chrom start end or strand a Cluster the inte
7. 9 fasta chr22 9 fasta chrX 9 fasta chrY 9 fasta chrM fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa fa cat cat cat cat cat data xmate junction libraries hg19 junctions 45 fa data xmate junction libraries hg19 junctions 40 fa data xmate junction libraries hg19 junctions 35 fa data xmate junction libraries hg19 junctions 30 fa data xmate junction libraries hg19 junctions 25 fa collated file data mapping output test 500K tags 50mers geno 30 3 0 collated data mapping output test 500K tags 50mers geno 35 3 0 collated data mapping output test 500K tags 50mers geno 40 5 0 collated data mapping output test 500K tags 50mers geno 45 5 0 collated data mapping output test 500K tags 50mers geno 50 5 0 collated data mapping output test 500K tags 50mers junc 30 3 0 collated Page 28 of 41 data mapping output test 500K tags 50mers junc 35 3 0 collated data mapping output test 500K tags 50mers junc 40 5 0 collated data mapping output test 500K tags 50mers junc 45 5 0 collated data mapping output test 500K tags 50mers junc 50 5 0 collated files data dwood test xmate test data test 500K tags 50mers QV qual exp name test 500K tags 50mers output dir data dwood test xmate test data samoutput ParallelNumber 4 MaxTagLength 50 MultiSAM false junctions NameOfJunction hg19 junctions 45 hg19 junctions 40 hg19 junc
8. The optional tag rescue module uses information derived from both single mapping and multi mapping tags to uniquely place multi mapping tags 4 Finally UCSC genome browser compatible wiggle plots and BED files are generated A final optional step creates SAM files 6 Figure 2 Effect of length and mismatches on the specificity of mapping tags For each length and mismatch combination the proportion of tags that map uniquely within the human genome is indicated by the green line The blue line indicates the proportion of tags that map to exons and shows a steep decline at 25 3 and 23 5 indicating a drop in specificity at this length The total number of tags mapping at a given length 19 Figure 3 How alternate mapping strategies affect the yield of mappable tags and the computational run time In all graphs red lines represent Strategy 1 2nt interations 5Ont 30nt blue lines represent Strategy 2 5nt interations 50nt 25nt and green lines represent Strategy 3 10nt iterations 50nt 30nt In all scenarios 5 mismatches were allowed for tag lengths ranging from 50nt to 40nt 3 mismatches were allowed for tag lengths ranging from 39nt to 30nt and 1 mismatch was allowed for tag lengths ranging from 29nt to 25nt Scenario One is a fragment library with a mode insert size of 54nt Scenario Two is the same library with the insert size shifted to 39nt Together these graphs show tha
9. and human genomes and presumably other mammalian genomes we recommend using 3 mismatches for lengths from 30 39nt 5 mismatches for lengths gt 40nt If additional matching data is required 25nt matches can be used but caution should be used when interpreting the results Either allow only a single mismatch to ensure specificity of mapping or filter the final wiggle plots eg only look at nucleotide positions that are covered by more than 4 tags to an extent which removes the noise in this mapping see Figure 2 100 o 80 e S 6 40 o n 20 50 0 50 3 50 5 50 7 45 0 45 3 45 5 45 7 40 0 40 3 40 5 35 0 35 3 35 5 30 0 30 3 30 5 25 0 25 3 25 5 Length mismatches tags mapping to exons mapped unique multi Figure 2 Effect of length and mismatches on the specificity of mapping tags For each length and mismatch combination the proportion of tags that map uniquely within the human genome is indicated by the green line The blue line indicates the proportion of tags that map to exons and shows a steep decline at 25 3 and 23 5 indicating a drop in specificity at this length The total number of tags mapping at a given length and Page 19 of 41 mismatch combination orange line and the total number of multi mapping tags red line are also shown indicating a sharp increase in the number of mapping tags at 25 5 The highest specificity for tags mapping to exons is achieved when mapping w
10. by side on a specified field Type sum Compare two Queries to find common or distinct rows On column Subtract Whole Query from another query Group data by a column and perform aggrega gggp eration on No gl other columns Remove Operation 1 Round result to nearest integer Extract Features Fetch Sequences Fetch Alignments Get Genomic Scores History Options refresh collapse all Unnamed history 0 e px 8 and data 9 112 453 regions format interval database hg1B Info save display at UCSC main chrl 929245 929249 1 chrl 929262 309204 1 chrl 909282 929282 1 chr ehel 947578 947579 1 chel 987967 960068 1 chel 969075 969076 1 al 1 9 Join on data Sand 0 s S Joinondata4and 0 data 6 Operate on Genomic Intervals ex Statistics test 500 tags SOmers positive Sraph Displav Data e px Regional Variation test 500K tags SOmers neqative Multiple regression Evolution HyPh Gene BED To epu Metagenomic analyses Short Read Analvsis is displayed in the resulting history item on data 3 EMBOSS i e 4 Gene BED To FEX Exon Intron Codon BED Workflows on data 2 3 Filter on data e 05 z riter on aata 9px 1 UCSC Main on Human amp 0 3 refGene genome Done Ele Edt View History Bookmarks Tools e cx ao E hitp main g2 bx psu eduj genome biology instructions to auth
11. for money In this respect it is up to the individual user to decide whether or not to apply a recursive mapping strategy in their analysis How to map to the genome and junctions simultaneously X MATE has the flexibility to other approaches to junction matching In some circumstances one may wish to consider matches to the junction library at the same time as matches to the genome This can be done by renaming the junction library against which you wish to map to follow the chromosome naming convention see Configuration options We will sometimes use chrJ as the chromosome name and therefore chrJ fa as the filename If you choose to do this simply configure X MATE to not map against junctions using the map junction false optional parameter How to use X MATE to perform non recursive mapping Although designed for recursive mapping X MATE can also map at a single tag length if preferred This will speed up the analysis in situations where maximum sequencing depth is not required To do this simply adjust the recursive maps option to the length of tags desired eg recursive maps 50 5 0 X MATE Functionality and Output Files Log File test 500K tags 50mers log Page 22 of 41 This is an example of the output log file for the test 500K tags 50mers experiment Each status output includes two lines the first line is system time and the second is what the system doing at that time Fri Sep ri Sep ri Sep ri Sep
12. in a standard single sequence fasta format There is no requirement for the header line to contain any particular string Only the filename and the chromosome name will be used in the pipeline Page 15 of 41 mapreads data matching mapreads Specifies the location of the mapreads binary optionally used as the mapping engine schema dir data matching schemas The location of the directory containing mapping schemas NOTE Mapping schemas must be available to do the mapping at the specified length and number of mismatches or else the pipeline will fail In the above example the schemas required are schema 50 5 schema 45 5 schema 40 5 schema 35 3 schema 30 3 Mapping schemas are available from http solidsoftwaretools com genome ISAS use this section if mapping using ISAS isas data isas ISAScolorsNewCPU The location of the ISAS binary to use Replace with ISASbasesNewCPU if mapping in base space global 50 5 45 5 40 5 35 3 25 2 List of recursive maps N M where N length of tag and M number of mismatches Note that ISASColorsNewCPU application will not allow for a global mismatch amount between 26 and 34 hence 30 3 is an invalid parameter in this case See the ISAS documentation for more information chrName index data isas hg19 25chr reference renamed chromosomes txt Full path to the file containing the list of chromosomes their index number as interpreted by ISAS and
13. tags To run the SamConverter you will need to first create a configuration file see the script write sam conversion config pl Below is an example of a SamConversion configuration file there are no optional parameters THEE AE FE E AE AE aE FE E AE FE ABA aE AE FE E HE E AE E AE FE E AE E AE HE HE AE E EH Ea E EE E E SAM Conversion configuration file for X MATE automatically generated using write sam configuration file pl f THEE aE aE E aR AE AE FE E AE HE ABA aE AE FE E HE E AE AE E EE AE E aE FE E AE FE AE AE E AE FE E AE E AE FE HE HE AE AE AE FE E AE E AE FE EH E E inputs genomes data matching hg19_fasta chr1 fa data matching hg data matching hg data matching hg data matching hg data matching hg data matching hg data matching hg data matching hg data matching hg data matching hg data matching hg data matching hg data matching hg data matching hg data matching hg data matching hg data matching hg data matching hg data matching hg data matching hg data matching hg data matching hg data matching hg data matching hg 9_fasta chr2 9_fasta chr3 9 fasta chr4 9 fasta chr5 9 fasta chr6 9 fasta chr7 9 fasta chr8 9 fasta chr9 9 fasta chr 9 fasta chr 9 fasta chr 9 fasta chr 9 fasta chr 9 fasta chr 9 fasta chr 9 fasta chr 9 fasta chr 9 fasta chr19 9 fasta chr20 9 fasta chr21
14. test 500K tags 50mers gen File test500K test 500K tags 50mers gen File test500K test 500K tags 50mers gen File test500K test 500K tags 50mers gen File test500K test 500K tags 50mers jun File test500K test 500K tags 50mers jun File test500K test 500K tags 50mers jun File test500K test 500K tags 50mers jun File test500K test 500K tags 50mers jun ners25 collated ers35 collated ers40 collated ers45 collated ers50 collated ers25 collated ers35 collated ers40 collated ers45 collated ers50 collated AN N NNNNA OQ 0 Q Q Q O O 38533899833 50 mers 105548 45 mers 24632 40 mers 24800 35 mers 24322 25 mers 34300 Total GB matched 0 00908661 Mapping using ISAS should take about 2 hours on a standard blade The majority of this time will be parsing and collating the mapped results ISAS is many times faster than mapreads for mapping but many of the processes in X MATE are bound by input output performance Page 12 of 41 Testing X MATE ISAS Bases If you have ISAS bases installed you can use the test X MATE using the data file xmate test dna 500K tar gz available at http grimmond imb ug edu au X MATE Once unzipped this package includes the following files e test xmate dna bases isas conf the ISASbases test configuration file e test dna 500K fastq illum
15. 0K tags 5Omers junc 40 5 0 collated all genomic matches for 40nt tags e test 500K tags 50mers junc 45 5 0 collated all genomic matches for 45nt tags e test 500K tags 50mers junc 50 5 0 collated all genomic matches for 50nt tags e test 500K tags 50mers start negative wiggle plot of tag start sites for the ve strand e test 500K tags 5O0mers wiggle negative wiggle plot for the ve strand e test 500K tags 50mers start positive wiggle plot of tag start sites for the ve strand e test 500K tags 5O0mers start negative wiggle plot for the ve strand Edit the configuration file so that it refers to the appropriate directories that you have set up on your system See the Configuration Options section for more details on what each of the parameters does Run the script use the following command nohup perl path XMate pl c test xmate 500K conf amp Page 10 of 41 Where path is the full path to X MATE pl eg nohup perl data XMate pl c test xmate 500K conf amp Testing will run X MATE on the SOLID test data in approximately 3 hours using 3 Blades each with 16GB of RAM and 2 Dual Core AMD Opteron tm Processor 2218 4 cores running RedHat Linux 2 6 18 92 1 17 15 86 64 Once the run has completed the results should be compared to those provided in the test xmate rna colours mapreads results folder Additionally you can check the matching stats using the script provided with X MATE and the following command
16. 16 39 07 2010 collating genome tags mers35 16 39 41 2010 junction mapping mers35 16 43 41 2010 collating junction tags mers35 16 43 42 2010 chopping tag from mers35 to mers30 16 43 43 2010 genome mapping mers30 17 26 44 2010 collating genome tags mers30 Page 23 of 41 Fri S PROC Fri S PROC Fri S SUCCI Fri S PROCI Fri S SUCCI Fri S PROCI Fri S PROCI Fri S PROC Fri S PROCI Fri S SUCC ep 3 17 27 18 2010 ESS junction mapping mers30 ep 3 17 31 18 2010 ESS collating junction tags mers30 ep 3 17 31 19 2010 ESS recursive mapping is done collated files are created ep 3 17 31 19 2010 ESS Collecting Junction mapping data and creating BED file ep 3 17 31 26 2010 ESS Created junction BED file ep 3 17 31 28 2010 ESS creating wiggle plot ep 3 172314229 2010 ESS creating start plot g 9 g ep 3 17 31 29 2010 ESS creating wiggle plot 0 g 1 e ep 3 17 31 30 2010 start plot 2010 enjoy the data ESS creatin Bo 1723153 ESS All don Checking the finished run Miscon figured config files or interruptions to the server or queue can cause X MATE to die prematurely however there is some error catching code that can help you to work out what has gone wrong The steps you should go through to ensure that the run has finished successfully are listed below
17. 41 All these steps can be automated into what Galaxy calls a Workflow This is particularly useful if you have multiple data sets to analyze To create a workflow from the steps generated above follow the steps below Galaxy Mozilla Firefox Ele Edt View History Bookmarks Tools Help X 2 genome biology instructions to authors P LE Most Visited The Expression Geno amp my UQ A Google Scholar Heritage on ine 4E Virgin Mobile QNF wholesale Foods Expression Genomics WorldMark by Trendw PubMed Home History Options List previously stored histories ENCODE Tools Create a new empty history Con rkflow from current history Clone current history Text Manipulation 3 Text Manipulation Share current history 16 354 lines format tabular e Change default for current history IIb Show deleted datasets in Nrrent history dapi EASTA manipulation Rename current history as Unnamed history Fig Ser Eilter and Sort Delete current history save Join Subtract and Group T Extract Features mu 000014 1 Eetch Sequences 00001 Fetch Alignments potoit am 000018 Get Genomic Scores am 000019 1 Operate on Genomic Intervals am 000021 7 T Graph Display Data 5 A 10 Concat ondata amp Regional Variation 8 and data 9 Multiple regression Evolution HyPhy 9 Join
18. E Tools identifiers names accessions pastelist uploadlist Lift Over filter create Text Manipulation a Convert Formats intersection create FASTA manipulation correlation create Filter and Sort output format BED browser extensible data E F Send output to Galasy Join Subtract and Group output file leave blank output in browser file type returned plaintext gzip compressed summary statistics user cart settings including custom tracks click here Fetch Sequences Fetch Alignments Get Genomic Scores Operate on Genomic Intervals m pale Multiple regression Evolution HyPhy Metagenomic analyses Short Read Analysis EMBOSS Workflows Using the Table Browser This seftion provides brief line by line descriptions of the Table Brows lade Specifies which clade the organism is in ome sequence to use fack list The options correspond to the track groupings shown in the Genome Browser Select All Tracks for an All Tables to see all tables including those not associated with a track yhich database should be used for options in table menu Unnamed history 0 vour history is empty Click Get Data on the left pane to start http aenome uesc edu cgi bin hgTables GALAXY_URL http main g2 bx psu edujtool_runnerGitool_id ucsc_table_direct1 amp hgta_compressType none amp sendTaGalaxy 1Ghgts_o
19. MATE pipeline There may be some occasions where you might wish to regenerate the wiggles or BED tracks from the collated files For example you may have initially generated the wiggles without multi mapping rescue Page 26 of 41 but now you wish to generate them with rescue turned on Rather than remapping you can simply modify your config file eg rescue true and reenter the pipeline at the rescue stage using the command nohup restart at rescue pl c test 500K tags 50mers conf amp To make use of this feature you must keep the collated files and the junction ID files see Description of output files Modifying the pipeline to work with other queues In order to make this program compatible with other queue managers you can specify the required qsub command in the configuration file options section Simply specify the start of the command eg qsub command qsub 1 s rt 48 00 00 For SGE Alternatively you can modify the source code directly The module to look at is QCMG BaseClass Mapping pm Till Bayer from the MPI for Evolutionary Biology in Germany has kindly provided instructions on modifying this script to work on SGE systems In addition to the above configuration value you can change line 208 to comm qsub 1 s rt 48 00 00 o mysh out e mysh err mysh gt mysh id In addition to modifying the lines above line 206 which reads print OUT comm should be changed to include the bin s
20. X MATE user manual Version 1 0 September 2010 Contact x mate expressiongenomics org Institute for Molecular Bioscience The University of Queensland St Lucia QLD 4072 Page 1 of 41 License This software is copyright 2010 by the Queensland Centre for Medical Genomics All rights reserved This License is limited to and you may use the Software solely for your own internal and non commercial use for academic and research purposes Without limiting the foregoing you may not use the Software as part of or in any way in connection with the production marketing sale or support of any commercial product or service or for any governmental purposes For commercial or governmental use please contact licensing qcmg org In any work or product derived from the use of this Software proper attribution of the authors as the Source of the software or data must be made The following URL should be cited http grimmond imb uq edu au X MATE This package is distributed in the hope that it will be useful but WITHOUT ANY WARRANTY without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE Applied Biosystems Software components distributed with this package carry their own license agreements located at the following URLs MaToGff http solidsoftwaretools com gf project matogff GffToSam http solidsoftwaretools com gf project sam Mapreads http solidsoftwaretools com gf project mapreads We
21. aps option there are two files required the junction length file asks for the concatenated junction library and the junction length index file asks for the file that decodes the concatenated fasta Junction libraries for human and mouse genomes can be downloaded from http grimmond imb ug edu au X MATE options quality check false This parameter allows you to turn on or off the quality checking of tags module Acceptable values are true or false True run quality check False do not run quality check run_rescue true This parameter allows you to turn on or off the rescue of multi mapping tags module Acceptable values are true or false True run multi map rescue false do not run multi map rescue rescue program data matching MuMRescueLite py This parameter defines the location of the script to be run to rescue multi map tag rescue rescue window 10 This parameter defines the window size used for multi map tag rescue The recommended setting for shotgun sequencing data is 10 whereas the recommended setting for CAGE and other disparate data sets is 100 map junction true Set this to true when you are mapping RNASeq data sets and would like to map to junction libraries Set this to false for genomic mapping or if you d prefer not to map against junctions map ISAS false Page 18 of 41 Set this to true when you are using ISAS as the mapping engine otherwise set if to
22. are to be treated as a single mismatch or 0 if they are not In the above example the recursive mapping will match at lengths 50 45 40 35 and 30 For lengths 50 40 there will be 5 mismatches allowed whereas for lengths 35 and 30 will allow only 3 mismatches Valid adjacent errors are not treated as a single mismatch For a discussion on selecting optimal parameters for analysis see the section Selecting appropriate parameters genomes data matching hg19 fasta chrl fa data matching hg19 fasta chr10 fa data matching hg19 fasta chrll fa data matching hg19 fasta chrl2 fa data matching hg19 fasta chr13 fa data matching hg19 fasta chrl4 fa data matching hg19 fasta chrl5 fa data matching hg19 fasta chrl6 fa data matching hg19 fasta chrl7 fa data matching hg19 fasta chr18 fa data matching hg19 fasta chrl9 fa data matching hg19 fasta chr2 fa data matching hg19 fasta chr20 fa data matching hg19 fasta chr21 fa data matching hg19 fasta chr22 fa data matching hg19 fasta chr3 fa data matching hg19 fasta chr4 fa data matching hg19 fasta chrb5 fa data matching hg19 fasta chr6 fa data matching hg19 fasta chr7 fa data matching hg19 fasta chr8 fa data matching hg19 fasta chr9 fa data matching hg19 fasta chrM fa data matching hg19 fasta chrX fa data matching hg19 fasta chrY fa etc Defines a list of reference sequences typically chromosomes or unassembled contigs to map against These should be
23. ategy 3 10nt iterations 50nt 30nt In all scenarios 5 mismatches were allowed for tag lengths ranging from 50nt to 40nt 3 mismatches were allowed for tag lengths ranging from 39nt to 30nt and 1 mismatch was allowed for tag lengths ranging from 29nt to 25nt Scenario One is a fragment library with a mode insert size of 54nt Scenario Two is the same library with the insert size shifted to 39nt Together these graphs show that there is more benefit for a recursive strategy when the library insert size is smaller than ideal The efficiency of the recursive strategy is largely dependant on the median insert size of the RNA fragments If all fragments are longer than the read length of the tag then the recursive strategy would only additionally map the 5 end of novel splice junctions and those with poor quality 3 ends Depending on the individual sample and sequencing run this may or may not yield sufficient additional mapping tags to justify the additional computational time In contrast assuming perfect adaptor identification and an ideal sized fragment library a vector clipping method would use less than 40 of the CPU time than the recursive method for the same yield of Page 21 of 41 mapping tags Unfortunately even under the best circumstances adaptor identification is not perfect and there are additional technical challenges for adaptor identification in color space Typically adaptor identification and chopping will yield unde
24. ath 3 If the Makefile has been written successfully there should be a file called Makefile in the source directory then to run the installation run the following commands make make install All required perl modules should now be installed in your system s default perl library location or in the location specified by INSTALL BASE above If the above INSTALL BASE option is chosen you may also be required to add the path of the installation directory to INC using the command export PERL5SLIB S PERL5SLIB full path X MATE lib Where full path should be replaced with the path of the X MATE directory This command can be added to the bash profile or profile files depending on the shell for automatic loading or it can be added to the default profile for all users Page 8 of 41 Installation Instructions SamConverter The SamConverter utility is a Java application with dependencies on the Applied BioSystem utilities GffToSam and some Java libraries Specifically the SamConverter utility relies on the following software e GffToSam o Binary available at http solidsoftwaretools com gf project sam requires registration o Source code distributed with for compilation if required see below Three java archive libraries distributed with X MATE e com apldbio aga common jar courtesy Life Technologies e MaToGff jar courtesy Life Technologies e SamConverter jar Developed by
25. can be created from the packaged fasta format junction libraries available for download from http grimmond imb ug edu au X MATE For each junction length that you would like to map against you must build the ISAS indexes This can be done by following the instructions in your ISAS manual Once built the junction libraries are treated as a different Database within ISAS eg a different genome to map against Because of this they must be preceeded with a chr For example the junction library containing 20 bases from the donor exon and 20 bases from the acceptor exon should be called chr20 fa This can then be indexed using the ISAS chr 20 20 command to create an index file 20 20 N a bin Once all junction libraries are created and Page 40 of 41 indexed in ISAS you can include the path to these in your configuration file see Configuration Options for more details A note on ISAS performance During our testing on our system we found ISAS to map between 3 and 50 times faster than mapreads depending on the parameters chosen data not shown In particular the ISAS parameter filter N has a considerable influence on performance Set to 0 ISAS will map exhaustively using a non heuristic approach this guarantees to find all alignments for a tag using a certain mismatch threshold Set to 10 maximum ISAS will use a statistical approach to filter out of the mapping step any reads likely to multi map thereby increa
26. custom exon junction libraries you need the coordinates and sequences of your exons junctions For some species you may be able to download these from the UCSC genome browser Tables pages at http genome ucsc edu cgi bin hgTables command start X MATE makes requirement for the minimum or maximum number of nucleotides required on the donor or acceptor side of the junction and there is no requirement to keep these lengths the same However it may be beneficial for your own analysis to ensure that these are symmetrical so that when performing an analysis you can be sure of the minimum overlap of a tag on the junction sequence ie If you require a minimum of 10nt in a 50nt tag to cross an exon junction then the donor and acceptor sequences should be 40nt long Once the sequences and coordinates have been assembled ensuring that the unique IDs are in the same format as the provided above there are two scripts provided to format the libraries the way X MATE is expecting them The first script is concatenate sequences pl and is used to convert the multi fasta format into a single concatenated fasta format eg concatenate sequences pl f fasta file o output file h header The second is make index pl and this script creates the index file required for decoding the matches to the concatenated junction files eg make index pl fasta file gt output file How to make ISAS junction libraries Junction libraries for ISAS
27. e Object InsideOut e Devel StackTrace e Class Data Inheritable The alignment section of this pipeline is dependant upon the mapreads tool This tool and its installation instructions are available from http solidsoftwaretools com gf project mapreads requires registration which is free Page 7 of 41 Optionally in place of mapreads the alignment software ISAS can be used and can be licensed from http www imagenix com Finally you will need a genome against which to map The program expects one file per chromosome with the filename format as name fa Genomes can be downloaded from the UCSC genome browser website at http hgdownload cse ucsc edu downloads html Installation instructions X MATE The instructions given below in courier font are examples of the commands needed to carry out the installation X MATE source is downloaded as a single gzipped tar file 1 Move the tarball to the destination directory navigate to your chosen directory and decompress X MATE mv X MATE tar gz home software cd home software tar xzf X MATE tar gz cd X MATE 2 Create the Makefile by running the command perl Makefile PL If you do not have administrator privileges on your system and you would like to install X MATE in your home directory you can optionally choose to do this by providing the argument INSTALL BASE full path eg perl Makefile PL INSTALL BASE full p
28. elative to the expressed gene For genomic data set the strand to 0 raw tag length 35 This parameter defines the longest length of the tags contained in the csfasta file genome mapping use this section if mapping using Mapreads mask 11111111111111111111111111111111111 This setting allows you to ignore particular bases in the tag when computing the number of mismatches 1 consider this base 0 do not consider this base The length of the mask should equal the length of the longest tags max multimatch 10 Page 14 of 41 Defines the maximum number of positions to be reported for multi mapping tags The higher this number the more disk space is required to store the data and the slower the program will run Recommended size for most applications is 10 For interrogating repeat sequences such as retrotransposable elements this value may need to be set higher recursive maps 50 5 0 45 5 0 40 5 0 35 3 0 30 3 0 These parameters define the lengths at which matching will occur recursively the number of mismatches permissible between the tag and the reference sequence and whether or not to treat valid adjacent errors as a single mismatch These are comma separated parameters with the format of length mismatches valid adjacent length defines the length of the tag to match at mismatches defines the number of mismatches allowed and valid adjacent is set to 1 if valid adjacent errors
29. extra fie Concatenate two queries into for chrom start end and strand but will fill extra fields with a period r one query e the same column assignments neriads to maintain a truly tabular output Base Coverage of a set ff intervals Example second set of intervals P First euet a Complement intervals SS query Cluster the intervals of a query a Join the intervals of two queries side by side Get flanks returns flanking region s for every gene Fetch closest feature for every interval Profile Annotations for a set of genomic intervals Statistics Graph Display Data Regional Variation Multiple regression Evolution HyPhy History Options refresh collapse all Unnamed history 0 9 JeinondataS and 0 X data 7 8 Joinondata4and amp X data 6 spx son 50 m 6 x test 500K tags S mers negativc 5 Gene BED To olx Exon Intron Codon BED on data 3 spx 4 Gene BED To Exon Intron Codon BED on data 2 titer on aata a 90x Filter on data 1 epu i ue x refGene qenome Done Page 35 of 41 Get Data Send Data Select data ENCODE Tools 10 concatenate on data 8 and data Lift Over Query missing See TIP below Text Manipulation Group by colui Convert Formats lir FASTA manipulation Filter and Sort Operations Join Subtract and Group Operation 1 Join two Queries side
30. false and mapreads will be used default base space false default Both Mapreads and ISAS have the functionality to map base space encoded sequencing runs We have extended this functionality into X MATE If your data is encoded in base space set this parameter to true Note that for mapreads the valid input file format is FASTA and for ISAS it is FASTQ For color space always use CSFASTA format The default for X MATE is to map in color space Selecting appropriate parameters Understanding the two major parameters the number of mismatches allowed for at every tag length and the number of nucleotides chopped at each iteration as well as the smallest mappable size for the genome is critical to maximizing the efficiency and accuracy of the recursive mapping strategy How many mismatches to allow at each length is critical to both the speed and accuracy of tag mapping The more mismatches allowed the slower the program will run however a low number of mismatches may fail to capture mappable tags with sequencing artifacts Additionally large numbers of mismatches relative to the tag length will create spurious matching events and increase the level of noise in your results For RNAseq data ideally the proportion of tags mapping to exons should be relatively constant regardless of the length and studying this for your genome of interest will provide guidelines as to what levels of mismatching is acceptable for your system For mouse
31. filter level the more exhaustive the search A filter level of 0 default is exhaustive while a filter level of 10 is extremely fast Please see ISAS documentation for more information verbose 1 Output format for ISAS Please use verbose 1 normal output Changing this may affect the downstream collation of results junction mapping Page 17 of 41 junction library data matching junctions hg junctions 40 fa cat data matching junctions hg junctions 35 fa cat data matching junctions hg junctions 30 fa cat junction index data matching junctions hg junctions 40 fa index data matching junctions hg junctions 35 fa index data matching junctions hg junctions 30 fa index These parameters define the junction libraries and their associated index files that are used in X MATE The ability to specify the length of the junction library used for each of the different lengths of the tag means that you can have complete control over the stringency of the junction matching In this example by using the 40mer libraries 40nt from the donor exon concatenated with 40nt from the acceptor exon with the 50 40mer tag lengths we are requiring a minimum on 10nt of the tag to overlap the exon exon boundary For the 35 30mer tag lengths we are requiring an overlap of 5nt For obvious reasons the number of nucleotides overlapping should be greater than the number of mismatches allowed in the tag For every tag length specified in the recursive m
32. h line and a newline after the actual command needs to be inserted For LSF Xuanzhong Li from the Children s Hospital Boston USA has kindly provided instructions on modifying this script to work on LSF systems Any instances of the qsub command need to be replaced with the bsub command All other parameters appear to work fine for both PBS and LSF queue managers For example you could use the qsub command option like qsub command bsub 1 walltime 24 00 00 Optimizing performance on your cluster The entire X MATE pipeline including mapreads is very I O intensive and depending on the cluster setup users may find that it needs to be modified for optimal performance For example those people Page 27 of 41 using NFS filesystems may find that NFS will timeout if too much is asked of it For these systems an inefficient but necessary throttle may be to request two or more CPUs per mapping job in the configuration file qsub command qsub 1 walltime 48 00 00 ncpus 2 Creating SAM files SAM Sequence Alignment Map files are now largely the default file format for mapped data storage and interchange SAM files can be created from X MATE runs using the samConverter jar java archive file distributed with X MATE For installation instructions please see the section Installation instructions Sam Converter The SamConverter will create entries in SAM format for both genomic mapping tags as well as junction
33. ificity of mapping s essssen teen ttt 30 Checking the mapping statistics eis ssstensesnntte ttn attte tnnt tttm tt staat tasa 31 Cleaning up after a mapping run sees 31 Writing SAM Converter configuration files sisse teen treten tente ttes ttn tton to tnnt 31 Assigning tags or coverage counts to gene models esse eene 32 Junction libraries rmt E 39 Description of the available junction libraries 39 How to create your own junction libraries isses trennen ttt ttn tte ttn nannaa 40 How to make ISAS junction libraries eise tente trennt tton tatto tte 40 A note ISAS perfoFmabnCEus ance ra orci mcs ri Decii Yr a CE OS ENG 41 Page 3 of 41 Supplimentary figure 2 canat LO REC QUA RACER ER KE UD Re ER 41 List of Figures Figure 1 The X MATE recursive mapping pipeline The pipeline consists of 4 major components 1 The optional tag quality module filters tags based on the quality values for each basecall 2 The alignment module attempts to align tags first to the genome and then to a library of known exon junction sequences if mapping RNA Seq data If a tag fails to align then the tag is truncated and the process is repeated 3
34. ified chromosome These files are collated into the wiggle files for negative and positive strand respectively chrN lt expName gt for_wig negative positive start Working file for wiggle plot generation for the specified chromosome These files are collated into the start files for negative and positive strand respectively lt expName gt start positive This file contains the start position of all mapped tags mapping on the sense positive strand This file should be kept lt expName gt start negative This file contains the start position of all mapped tags mapping on the antisense negative strand This file should be kept lt expName gt wiggle positive This file contains the tag count at each nucleotide mapping on the sense positive strand This file should be kept lt expName gt wiggle negative This file contains the tag count at each nucleotide mapping on the antisense negative strand This file should be kept lt expName gt expect junc BED This is a BED file of all tags mapped to junctions on the expect strand eg the strand in the same sense as the library generation protocol This file should be kept lt expName gt unexpect junc BED This is a BED file of all tags mapped to junctions on the unexpected strand eg the strand in the opposite sense as the library generation protocol This file should be kept Large amount of data in this file is indicative of library generation problems Re entering the X
35. in the order of 100 300MB 5 Finally check that the files upload into the UCSC genome browser without errors To minimize file size the wiggle plots and bed tracks can be gzipped Sometimes the wiggle starts files are too large to upload even when gzipped and you may need to apply a post pipeline filter to the results see Post X MATE scripts Description of output files X MATE produces a large number of files in the specified output directory most of which are temporary working files and can be deleted Unless the total storage space on your cluster is an issue it is probably best to wait until the pipeline has finished before deleting files This allows you to re enter the mapping pipeline at different points without needing to start the mapping process from scratch This can save significant amounts of time in the event of a power outage or similar computational catastrophe This section takes you through the inputs and outputs generated by running X MATE on the test data the contents of the output files generated from each of the modules and whether or not these should be stored or deleted lt expName gt log Log file for the run This file should be kept as a record of the mapping run and inspected to investigate any problems which may have occurred during the run lt expName gt mers lt N gt csfasta The csfasta file for each recursive run One of these will exist for each recursive mapping strategy lt expName gt mers l
36. ina DNA testing data Modify the configuration file to correctly point X MATE to the required locations of the ISASbases binary the output directory and input fastq file You will also need to create ISASbases indices following the instructions in the ISAS user manual and point X MATE to these indices in the configuration file Once this is complete launch X MATE If the run completes successfully download the results package test xmate dna bases isas results tar gz and compare your results with those in the results folder You can also run the script check matching stats XMATE pl the output should be equivalent to Checking directory Checking genome and junction matches File test isas dna 500K geno mers25 collated File test isas dna 500K geno mers30 collated File test isas dna 500K geno mers35 collated File test isas dna 500K geno mers40 collated File test isas dna 500K geno mers45 collated 45 mers 423327 40 mers 12300 35 mers 11781 30 mers 1242 25 mers 4436 Total GB matched 0 02010221 Configuration file The configuration file is a text file containing all the required parameters to run X MATE Example configuration files are available in src folder in the X MATE distribution There are eight example configurations corresponding with the following types of X MATE runs Configuration file Run Type Mapping Engine Encoding xmate dna colours
37. ith zero mismatches 50 0 45 0 40 0 35 0 30 0 and 25 0 however these mapping strategies also produce the lowest number of overall uniquely mapped tags To minimize the computational time used for this approach the number of nucleotides to be clipped at each iteration should be greater than or equal to the number of mismatches allowed at any iteration We typically remove Snt at a time for SOLID sequencing data for two reasons First the sequencing chemistry of SOLID is performed with five different primers and the number of cycles will determine the length of the tag in multiples of five For example to generate a 50mer sequence the data of ten cycles from each of the five primers is added together Typically each cycle has roughly the same error profile as the corresponding cycles from other primers ie the third cycle on primer one will have the same error rate as the third cycle on primer 5 and the error rate increases as the cycle number increases see figure 3 This means that typically error rates jump in multiples of 5nt so excluding 5nt at a time will minimize the effect of sequencing error on the mapping results Obviously this consideration does not apply for non SOLiD data sets Secondly as every iteration takes CPU time the more iterations that are done the higher the cost of the additional mapping tags Larger iterations decrease the run time but they also decrease the sensitivity of the strategy to detect tags that lie acr
38. mapreads conf DNA Mapreads colorspace xmate dna bases mapreads conf DNA Mapreads basespace xmate dna colours isas conf DNA ISAS colorspace Page 13 of 41 xmate dna bases isas conf DNA ISAS basespace xmate rna colours mapreads conf RNA Mapreads colorspace xmate rna bases mapreads conf RNA Mapreads basespace xmate rna colours isas conf RNA ISAS colorspace xmate rna bases isas conf RNA ISAS basespace These configuration files should serve as a starting point for any mapping run Parameters can be modified in these as required The following configuration parameters are available in X MATE Configuration options Standard Parameters exp name test 500K tags 50mers required Set the experiment name with this parameter output dir data mapping output required Specify the directory where X MATE will write all output files raw csfasta data unmappedData test csfasta required Specify the location of the raw csfasta file raw qual data raw tag20000 qual The full path of the file containing the quality values expect strand This defines the strandedness of the data For example libraries made with the SREK protocol or other direct ligation protocols will have tags that are sequenced in the sense strand relative to the expressed gene Libraries made with the SQRL protocol will have tags that are sequenced in the antisense r
39. ning sense matches a name of BED file containing antisense matches o name of outfile Page 30 of 41 For example to assess the output of the test data set use the command assess junctions for directionality pl s test 500K tags 50mers expect junction BED a test 500K tags 50mers unexpect junction BED o test output Ideally more than 99 5 of tags should be in the expected strand A lower value indicates a problem with the mapping parameters used or less likely a problem with the cDNA library generation Checking the mapping statistics We have provided a script to count the mapping statistics and calculate the coverage check mapping stats XMATE pl USAGE perl check matching stats XMATE pl p full path of the directory to be checked This script can be run after each mapping run and is a good means to check the quality of either the sequencing run or the mapping strategy Cleaning up after a mapping run If you would like to remove many of the unnecessary files after a mapping run simply run the script clean up XMATE output directories pl USAGE perl clean up XMATE output directories pl p full path of the directory to be cleaned Writing SAM Converter configuration files You can rapidly produce a configuration file for the samConverter utility using this script Simply provide the location of the mapping run s output directory and some other relevant parameters such as
40. nome and then optionally against a library of known exon junctions for RNAseq data sets hg18 hg19 and mm9 are currently supported Tags that fail to map to the genome or junctions are chopped to user defined lengths and the genomic mapping is restarted In this way tags that have adaptor sequence or poor quality ends are recovered at their longest length The number of mismatches between the reference and tag is user defined Page 5 of 41 Figure 1 The X MATE recursive mapping pipeline The pipeline consists of 4 major components 1 The optional tag quality module filters tags based on the quality values for each basecall 2 The alignment module attempts to align tags first to the genome and then to a library of known exon junction sequences if mapping RNA Seq data If a tag fails to align then the tag is truncated and the process is repeated 3 The optional tag rescue module uses information derived from both single mapping and multi mapping tags to uniquely place multi mapping tags 4 Finally UCSC genome browser compatible wiggle plots and BED files are generated A final optional step creates SAM files Part 3 Multi mapping tag rescue optional For most downstream applications tags are only informative if they can be placed uniquely within a genome Tags that align to multiple places within a genome make up a sizeable proportion of transcriptome derived tags primarily from the inherent redundancy of the genome b
41. of tags assigned to the gene models and subsequently very low correlations between array data and sequence data Galaxy Mozilla EI 1 Home enome Browser Blai bles ene Sorter PCR Session FAQ Help Options History refresh collapse all Table Browser UCSC Main table browser Use this program to retrieve the data associated with a track in text format to calculate intersections between tracks and to retrieve DNA sequence covered by a track For help in using this application see Using the Table Browser for a description of the controls in this form the User s Guide for general information and sample queries and the OpenHelix Table Browser tutorial for a narrated presentation of the software features and usage For more complex queries you may want to use Galaxy or our public MySQL server Refer to the Credits page for the list of contributors and usage restrictions associated with these data 2 assembly 2006 2 group Genes and Gene Prediction Tracks E track RetSeqGenes E table refGene z describe table schema region genome C ENCODE position chr8 92376712 92376734 lookup define regions UCSC Archaea table browser Get Microbial Data BioMart Central server GrameneMart Central server clade Mammal 2 genome Human Flymine server EncodeDB at NHGRI EDIGEAPH server Send Data ENCOD
42. on i e px Metagenomic analyses data7 Short Read Analysis EMBOSS Workflows Exon intron codon on data 3 4 Gene BED To 0 M ise Galaxy Mozilla Firefox Ele Edt History Bookmarks Tools e Q X 2 xr 7 biology instructions to authors L Most Visited The Expression Geno amp my UQ 8 Google Scholar Heritage on line E Virgin Mobile QNF wholesale foods Expression Genomics WorldMark by Trendw PubMed Home C The following list contains each tool that was run to create el nO gum Tools which cannot be run interactively and thus cannot be refresh collapse all Send Data ENCODE Tools Workflow name Unnamed history 0 Lift Over IRNAseq to Gene Counts Text Manipulation Convert Formats Create Workflow EASTA manipulation Filter and Sort Tool History items created Join Subtract and Group Extract Features Fetch Sequences Fetch Alignments caer an Gata uiten Operate on Genomic Intervals 11 Group on data 10 16 354 lines format tabular database hg18 Info Group by c8 sum c4 save OM 000014 1 am 000015 4 aW 000017 1 am 000018 14 aM 000019 1 UCSC Main nome gt in workflows F Treat as input dataset gt Filter on data 1 00001 7 Sta
43. orkflow is similar for both stranded and unstranded eg genomic data with the exception that optional junction libraries may be mapped against for RNA Seq data sets Also optionally X MATE has been extended to allow for the use of other mapping engines specifically the rapid alignment system ISAS www imagenix com The following sections describe each of the parts of the X MATE workflow as shown in Figure 1 Part 1 Quality checking of the tag optional Depending on the downstream applications of the matched data the quality of individual tags may need to be assessed before their inclusion in the mapping pipeline To accommodate this we have provided an optional tag quality module that assesses the tags by the number of basecalls with PHRED scores of less than 10 Tags that pass the QC are fed into the recursive alignment module If this option is disabled all tags are passed to the alignment module Part 2 Recursive alignment to the human or mouse genome Alignment of the short tags to reference genome is done using mapreads http solidsoftwaretools com gf project mapreads or ISAS http www imagenix com Both algorithms are specifically designed for the rapid mapping of data from the Applied Biosystems SOLiD system ie color space data although base space fastq file format for ISAS and fasta format for mapreads data sets can be mapped using X MATE also Tags are first matched against all chromosomes of the reference ge
44. ors P Get Data The following job has been succesfully added to the queue ENCODE Tools 11 Group on data 10 Text Manipulation You can check the status of queued jobs and view the resulting data by refreshing the History pane When the job has been run the status will change from running to Convert Formats finished if completed succesfully or error if problems were encountered FASTA manipulation Filter and Sort Join Subtract and Group Join two Queries side by side on a specified field a Compare two Queries to find common or distinct rows Subtract Whole Query from another query Group data by a column and perform aggregate operation on other columns Extract Features Fetch Sequences History refresh collapse all Unnamed history 0 11 Group on data 10 0 amp 16 354 lines format tabular database hg18 Info Group by c8 sum c4 save M 000014 1 am 000015 4 am 000017 1 aM 000018 14 000019 1 am 000021 7 10 Concatenate on data 3 ES 9 Join on data 5and amp 0 X data 7 Fetch Alignments 8 Join on data 4 and Get Genomic Scores data 6 cad erate Statistics eu Graph Display Data 5 E Evolution HyPhy omic analyses Short Read Analysis EMBOSS Workflows riter on 05 2 Eiter on aata 1 olx 1 UCSC Main on Human refGene genome Done Page 36 of
45. oss exon junctions Figure 3 shows the effect of alternate mapping strategies on the computational time and the proportion of mappable tags based on two different scenarios an ideally sized mRNA library and a smaller than ideal mRNA library Page 20 of 41 Scenario one larger sized fragment library Scenario two smaller sized fragment library run time minutes run time minutes 200 Length of tag run time mins run time mins nun tme mins run time mins run time mins run tme mins cumulative run cumulative run cumulative run cumulative run cumulative run run time Scenario one larger sized fragment library Scenario two smaller sized fragment library o n proportion of mappable reads aco amp mappabie tags amp mappable tags amp mappable tags mappable tags mappable tags mappable tags cumulstve mappable tags cumulative mappable tags cumulative mappable tags cumulative mappable tags cumulative mappable tags cumulative mappabie tags Figure 3 How alternate mapping strategies affect the yield of mappable tags and the computational run time In all graphs red lines represent Strategy 1 2nt iterations 50nt 30nt blue lines represent Strategy 2 Snt iterations 50nt 25nt and green lines represent Str
46. please download the testing data xmate test rna 500K tar gz and testing results test xmate rna colors mapreads results tar gz available from http grimmond imb ug edu au X MATE Please also download the hg19 version of the human genome available from http hgdownload cse ucsc edu The testing data includes test xmate rna colors mapreads conf the Mapreads test configuration file e test xmate rna colors isas conf the ISAS test configuration file see next section e test 500K tags 50mers csfasta the SOLID testing data The testing results folder includes the wiggle plots start files collated files and junction BED files that were generated from the testing data on our system e test 500K tags 5O0mers expect junc BED junction BED file for UCSC genome browser e test 500K tags 5O0mers unexpect junc BED antisense matches to junctions e test 500K tags 50mers geno 30 3 0 collated all genomic matches for 30nt tags e test 500K tags 5O0mers geno 35 3 0 collated all genomic matches for 35nt tags e test 500K tags 5O0mers geno 40 5 0 collated all genomic matches for 40nt tags e test 500K tags 50mers geno 45 5 0 collated all genomic matches for 45nt tags e test 500K tags 50mers geno 50 5 0 collated all genomic matches for 50nt tags e test 500K tags 5Omers junc 30 3 0 collated all genomic matches for 30nt tags e test 500K tags 50mers junc 35 3 0 collated all genomic matches for 35nt tags e test 50
47. plots and directly represent the number of times an individual nucleotide has been seen in the sequencing data BED files depict hits to junction sequences and graphically display exon combinatorics In addition plots containing only start sites of tags are included to facilitate tag counting applications Optionally SAM Sequence Alignment Format files can also be created using the SAM Converter package released with X MATE see section Creating SAM files for a description on how to install and use this module Availability All source code documentation and associated files described in this manual are freely available for download from http grimmond imb uq edu au X MATE Requirements This pipeline is written predominantly in perl with some optional python Java and C thrown in for good measure and requires that you have version 5 8 8 of perl or later python 2 4 or later Java 1 5 or later and a recent version of g It is designed to run in a unix environment with a PBS queue manager although PBS is not required for SAM conversion and ISAS mapping runs The scripts can be modified to work with an LSF or SGE manager It is not recommended to run this pipeline on a system without access to a cluster due to the large computational requirements of mapping to mammalian genomes however the scripts could potentially be modified to do this Required perl modules are available in CPAN e Parallel ForkManager e Path Class
48. r the least stringent conditions approximately 6096 of the tags that will map under a recursive method On top of this the SOLiD system can use Internal Adaptor Blockers which prevent the ligation of sequencing probes to that region This causes a drop in accurate base calling which is not just based on the quality of base calls and under these circumstances successful adaptor identification can drop to just over 20 Whilst these blockers were an optional reagent in SOLiD V2 chemistry they are now premixed and therefore not optional on the V3 and V4 plates Ideally neither a recursive method nor an adaptor identification method would be required at all if we could ensure that the RNA fragment size was always going to be larger than the read size For microRNA populations where the mode size is approximately 22nt this is simply not possible Due to technical limitations on the maximum insert size possible in an emulsion PCR ePCR and the strong amplification bias of small fragments in ePCR this is not always achievable for fragmented RNA libraries either even those that have been size selected prior to ePCR The primary motivation behind the recursive mapping method was to maximize the number of mapping tags from every sequencing run The cost of sequencing reagents is considerably more than the cost of server time so gaining additional depth between about 1 6 and 3 times the tags mapping at the longest length represents good value
49. ract Features 11 Group on data 10 J Fetch Sequences Fetch Alignments Get Genomic Scores Operate on Genomic Intervals Input Dataset 11 Group on data 10 El WM 000014 1 am 000015 4 am 000027 1 am 000018 14 Filter Metagenomic analyses Output dataset output from step 1 With following conditi Filter Output dataset output from step 1 With following condition t sq sq Extract Coding Exons UTR Exons from Output dataset out file1 from step 4 Extract Coding Exons UTR Exons Page 38 of 41 Junction libraries Description of the available junction libraries Each available exon junction library contains two components The first is the concatenated fasta file of the exon junction sequences and the second is the index file that details the genomic coordinates of the exonic sequences In both available sets the genomic coordinates form the unique ID of the junction and are defined as chromosome first base of intron last base of intron strand Note All junction sequences are provided in the sense orientation ie 5 to 3 and all coordinates are zero based This means that hits to these libraries should be predominantly on the one strand and which strand will depend on the laboratory based library preparation method used For the packaged junction libraries different lengths of the donor and acceptor sequences are provided to allow f
50. ream downstream bases are added they may be truncated in order to avoid extending past edge of Get Genomic Scores chromosome Operate on Genomic Intervals Statistics Send query to Graph Display Data Cancel Regional Variation Multiple rearession Evolution HyPhy Metagenomic analyses Short Read Analysis EMBOSS Workflows _ X Find Next Previous s gt Highlight all Match case Done Edt History Bookmarks Tools Hep 1 Most visited The Expression Geno my UQ A Google Scholar Heritage on line E Virgin Mobile QNF wholesale foods Expression Genomics WorldMark by Trendw PubMed Home Tools Get Data refresh collapse all Send Data ENCODE Tools 1 UCSC Main on Human refGene g Unnamed history 0 Query missing See TIP below Lift Qver van 1 UCSC Main on Human 9 Text Manipulation With following condition refGene qenome Convert Formats nd p format bed latabase 8 Poepen Double equal signs must be used as shown above To filter for an Info UCSC m Hiter and Sort refGene genome Filter data on any column using Execute save display at UCSC main simple pressions Sort ascending or Double equal signs must be used as equal to e g c1 chr descendiniNorder hel 19202 02540 0 chrl 24474 25944 NR 026818 0
51. ri Sep ri Sep ri Sep as ri Sep as ri Sep as ri Sep As ri Sep K i Sep ri Sep As ri Sep ri Sep ri Sep ri Sep td ri Sep ri Sep ri Sep ri Sep ri Sep ri Sep ee uu Gee PROCESS PROCESS PROCESS PROCESS PROCESS PROCESS ROCESS ROCESS ROCESS ROCESS ROCESS ROCESS ROCESS PROCESS PROCESS PROCESS ROCESS PROCESS PROCESS PROCESS PROCESS PROCESS PROCESS 3 11 42 39 2010 welcome to our X MATE 11 42 39 2010 genome mapping mers50 12 31 41 2010 collating genome tags mers50 12532233 2010 junction mapping mers50 12539 33 20700 collating junction tags mers50 12 39 235 2010 chopping tag from mers50 to mers45 123539236 2070 genome mapping mers45 13 49 38 2010 collating genome tags mers45 13250 519 2010 junction mapping mers45 14 00 19 2010 collating junction tags mers45 14 00 21 2010 chopping tag from mers45 to mers40 14 00 22 2010 genome mapping mers40 15 51 23 2010 collating genome tags mers40 15 52 03 2010 junction mapping mers40 16 05 03 2010 collating junction tags mers40 16 05 04 2010 chopping tag from mers40 to mers35 16 05 05 2010 genome mapping mers35
52. rly with your ISAS installation Note that first you will need to build the reference sequence indexes for ISAS by following the instructions in the ISAS user manual as well as the junction libraries for ISAS following the instructions in section Error Reference source not found Once these are Page 11 of 41 created you can use the same test data set as for the mapreads testing above but use the configuration file e test xmate rna colors isas conf Edit the configuration values in the Standard Parameters section to direct X MATE to the correct locations for files and output directories Run the X Mate pl script as follows nohup perl XMate pl c test isas 500K conf amp Once the process has completed download the ISAS colors testing results test xmate rna colors isas results tar gz available at http grimmond imb ug edu au X MATE Compare the output with data in the test xmate rna colors isas results folder Note that although this is the same data set as the mapreads tests and both mapreads and ISAS at a filter level of 0 see the ISAS documentation guarantee to find all alignments within a mismatch range the results will differ slightly due to the minor differences in mapping strategy Running the script check matching stats XMATE pl on this directory will produce the output Checking directory test500K Checking genome and junction matches File test500K test 500K tags 50mers gen File test500K
53. rvals of a fills on either the right or left with periods Note that this may produce an query Join the intervals of two queries side by side Get flanks returns flanking region s for every gene Fetch closest feature for every interval Profile Annotations for a set of genomic intervals Statistics Graph Display Data Regional Variation B5 zm a refresh collapse all Unnamed history n genome biology instructions to authors Concatenate 9 Join on data 5 and data 7 First query with 8 Join on data 4 and data 6 Second query Text Manipulation Convert Formats FASTA manipulation Filter and Sort Join Subtract and Group Extract Features Fetch Sequences Fetch Alignments Get Genomic Scores Both queries are same filetype v 1f unchecked Second query will be forced into format af First query Execute Operate on Genomic Intervals TIP If your query does not the pulldown menu it is not in interval format Use edit attributes to set chromosome start end and strand columns Intersect the intervals af two ot Screencasts Subtract the intervals of two See Galaxy Interval Operation Screencasts right click to open Merge the overlapping intervals of a query Syntax Both queries are exactly the same filetype will preserve all
54. sing overall mapping speed however at a slight cost to mapping specificity and sensitivity Supplementary figure 4 50 40 30 20 10 Average Base Quality 0 5 10 15 20 25 30 35 40 45 Base Position Figure 4 Quality profile for a recursive mapping run on approximately 10 000 000 base space reads Illumina SRA accession ERR000099 mapped using ISAS Each line represents a mapping run for a single recursive iteration N M N read length M mismatches Although the tags are mapped at progressively shorter lengths the quality values are reported for the complete un truncated length of every tag Tags that map at shorter lengths contain on average lower quality bases towards their 5 end Page 41 of 41
55. sssssescssesstscstesssscsseeneessescsesseeesesauesecesaesseesseecseceseesueeeaeesuessaesanesaeesaeesaeesites 11 Testing X MATE ISAS Bases educere Wats nein dr rbd bet ipe tat ae E Reda d 13 Configuration file rac t n eire editt e I p bas 13 Configuration options e e ae eodd 14 Selecting appropriate parameters seien sentent tnn tatnen tetti ttt to 19 How to map to the genome and junctions simultaneously eene 22 How to use X MATE to perform non recursive mapping sesses essent tttm tnnt tennis 22 X MATE Functionality and Output Files esses eese esterne terrenis nnne tetas 22 TO BF IG e EE 22 Checking the finished run e eda tutta rtc anda eee 24 Description of output files rne dai eee te dei tine ce rint e ide ded edo salads 25 Re entering the X MATE pipeline 26 Modifying the pipeline to work with other queues essent nnns 27 Optimizing performance on your cluster esee tente ttt 27 Creating SAM files ROS UR RARE ERA P GAL c E Rl gd 28 SCEIDLS season eave acad vede Prob conde and CE Dodo Bl ec c Ras Cena 29 Master script XeMATE plz uie bte eei a iuiu pr 29 Iure 30 Post X MATE SCEIDLIS knee n ada 30 Filtering wiggle plots nette Rte ben a es 30 Assessing the spec
56. t N gt nonMatch All tags from the recursive run immediately longer than N that did not align For example all unmatched tags from the mapping of test500K mers50 csfasta will be written to test500K mers45 nonMatch This file will then become test500K mers45 csfasta lt expName gt geno N N N collated Collated mapping results for all chromosomes for the specified recursive genomic mapping run N N N These files should be kept lt expName gt junc N N N collated Collated mapping results for all chromosomes for the specified recursive junction mapping run N N N These files should be kept lt expName gt junc N N N collated SIM Single mapping SIM collated mapping results for all chromosomes for the specified recursive junction mapping run N N N chrN lt expName gt ma N N N Matching file for the specified chromosome at the specified recursive length These files are Page 25 of 41 collated into the geno N N N collated file and can be deleted junctionLibraryName lt expName gt ma N N N Matching file for the specified junction library at the specified recursive length These files are collated into the junc N N N collated file and can be deleted chrN lt expName gt SIM Single match file all unique mappers for the genomic run for the specified chromosome These files can be deleted once the run has completed chrN lt expName gt for_wig negative positive wiggle Working file for wiggle plot generation for the spec
57. t there is more benefit for a recursive strategy when the library insert size is smaller than ideal 21 Figure 4 Quality profile for a recursive mapping run on approximately 10 000 000 base space reads Illumina GAII SRA accession ERR000099 mapped using ISAS Each line represents a mapping run for a single recursive iteration N M N read length M mismatches Although the tags are mapped at progressively shorter lengths the quality values are reported for the complete un truncated length of every tag Tags that map at shorter lengths contain on average lower quality bases towards their 5 end 41 Page 4 of 41 The X MATE pipeline Introduction X MATE is a computational pipeline designed for the rapid mapping of DNA fragment or RNA seq data from the Applied Biosystems SOLID system Built and production tested to run on modest computational resources X MATE generates tag count and genome browser visualization of genomic and exon junction matching results Wiggle BED and a variety of output files including SAM suitable for further tertiary analysis software X MATE is a framework for a recursive mapping strategy where tags are mapped against a reference genome and if not mapped at a certain number of mismatches truncated by a user defined length and mapped again Recursive mapping is optional but recommended A diagrammatic representation of the X MATE workflow is shown in Figure 1 The recursive w
58. thank and acknowledge the contributions of the developers of the above packages as well as Applied Biosystems for making the packages available with the X MATE system Page 2 of 41 X MATEuUiser niarn dal 3 2 ciis Eo ca ade SER 1 E PE 2 last of FIBUEES oinnia pbi Gradua cabinet tes Cordia Lr Rer hd iri n ia SR CFR Dal FR 4 The X MATE pipeline uei incitare asec ews ened edd be nd condi ad Fai eror OY ea 5 Isinqoretife o tr 5 Part 1 Quality checking of the tag optional esistente ntes nannan 5 Part 2 Recursive alignment to the human or mouse genome sesenta 5 Part 3 Multi mapping tag rescue esses iter tente trennt tton tn ttn tts 6 Part 4 Creation of visualization and SAM files esent tentent tentent 7 Availability ec ete detienen a on eaten 7 Requirements tu Si p oe adeb e tina botte a asta a 7 Installation instructions X MATE esses tnnt tent neran 8 Installation Instructions SamConverter esistente tnnt tnnt ttes ttt ttai ts 9 Testing and Configuration 10 Testing X MATE mapreads e rasel a tentent treten tte adaa aata ttt ttt tto tette 10 Testing X MATE ISAS Colors cesesssss
59. the location of the configuration file and the quality file required for SAM conversion and let the script write the new configuration file USAGE perl write sam conversion config pl x xmateMappingOutputDirectory o samOutputDirectory c originalXmateConfigFile q qualityFile 1 tagLength t tmpDirLocation p numThreads default multiSAMFile default 1 single SAM file ct ct Page 31 of 41 g gffToSamLocation Assigning tags or coverage counts to gene models default PATH location of GffToSam This script has been deprecated as more extensive tools for manipulation of genomic regions are available from the GALAXY website at http main g2 bx psu edu Local copies of Galaxy can be installed and used useful to avoid excessive internet usage through the transfer of large files Downloading and installation guides are available from http g2 trac bx psu edu wiki HowTolnstall The following is a tutorial on how to assign tags to genes using Galaxy In this step particular care needs to be taken to ensure that different RNAseq protocols are processed with the strand of capture in mind For example serial ligation approaches will generate sequences from the sense strand relative to the annotated gene whereas the random primed strand specific protocols will generate tags mapping to the anti sense strand Assigning tags to the wrong strand of gene models will result in relatively low numbers
60. their actual name This file is used to decode the chromosome number to a chromosome name in the mapping results And example of this file is chr1 fa chrl fa chr10 fa chr10 fa chr11 fa chrl11 fa chr12 fa chr12 fa chr13 fa chr13 fa chr14 fa chr14 fa chr15 fa chr15 fa chr16 fa chr16 fa Page 16 of 41 chr17 fa chr17 fa chr18 fa chr18 fa chr19 fa chr19 fa chr2 fa chr2 fa chr20 fa chr20 fa chr21 fa chr21 fa chr22 fa chr22 fa chrM fa chr23 fa chrX fa chr24 fa chrY fa chr25 fa chr3 fa chr3 fa chr4 fa chr4 fa chr5 fa chr5 fa chr6 fa chr6 fa chr7 fa chr7 fa chr8 fa chr8 fa chr9 fa chr9 fa database data isas hg19 25chr The name of the reference database reference genome index to map against This is a directory containing all the ISAS binary files chr 1 25 The chromosomes to map against N M where N first chromosome index and M last chromosome index inclusive mode 2 The ISAS mode to use Available modes are 0 1 2 2VA 02 012 02VA 012VA 3 3VA 4 4VA and 5 VA Valid Adjacent For more information please see the description in the ISAS user manual X MATE typically runs best using a combination of mode 2 or 2VA with global N M as mentioned above This way you can specify a full global mismatch amount for the length of the tag limit 5 The maximum number of alignment positions to report for multimapping tags filter 0 ISAS filter level range between 0 and 10 The lower the
61. this script use the following command nohup path X MATE pl c configuration file amp Page 29 of 41 where path is the full path to X MATE pl and configuration file is the name and full path if not in the current working directory of the configuration file For all other scripts please see the PerlDoc commenting in the script you can do this by typing perldoc scriptName Modules For module descriptions please use the embedded PerlDoc by typing perldoc lt ModuleName pm gt Post X MATE scripts Filtering wiggle plots This script is for filtering and reducing the size of the wiggle plots bedGraphs to be uploaded into the UCSC genome browser Usage filter bedGraphs pl REQUIRED f name of file to be filtered m minimum number of tags to report For example to remove the data from all nucleotides where there isn t at least 5 tags covering them use the command filter bedGraphs pl f test 500K tags 50mers negative wiggl m 5 Assessing the specificity of mapping In order to examine the specificity of the mapping by the directionality of the library this script can be used to examine the junction BED files generated by X MATE The sense strand matches will be your expected strand BED file and the antisense matches will be your unexpected strand BED file Usage assess junctions for directionality pl REQUIRED s name of BED file contai
62. tions 35 hg19 junctions 30 hg19 junctions 25 IndexOfJunction data xmate junction libraries hg19 junctions 45 fa index data xmate junction libraries hg19 junctions 40 fa index data xmate junction libraries hgl19 junctions 35 fa index data xmate junction libraries hgl9 junctions 30 fa index data xmate junction libraries hgl19 junctions 25 fa index SAM tools GffToSam usr local bin GffToSam TempDir data scratch end of configuration file Make sure the paths in the configuration file point to the required files on your system and then type the following command nohup java jar samConverter jar configurationFileLocation amp SAM conversion is backwards compatible with older RNA MATE mapping runs All you need are the original reference genome files junction libraries and the original csfasta and quality files You can specify the number of parallel threads to run at once with the configuration parameter ParallelNumber N Once you have a SAM file you can then use samtools http samtools sourceforge net to create BAM files Binary alignment Map files These are the default input format for most next gen sequence software Scripts Master script X MATE pl This script will call the required modules in order There is only one user defined parameter for this script which allows you to specify a configuration file containing all the required parameters for the entire mapping pipeline To run
63. tistics V Include Filter in workflow Graph Display Data 10 Concatenate on data 2 Deme Multiple regression 3 gt 3 Filter on data 1 Evolution HyPhy F include Filter in workflow I Cog Metagenomic analyses Short Read Analysis 8 Jeinondata4and 0 x 55 gt I 4 Gene BED To Exon Intron Codon BED on data 2 orkflows A 0 test 500K tags SOmers positive 5 Gene BED To Exon Intron Codon BED on data 3 F Include Gene BED To Exon Intron Codon BED in workflow test 500K tags 5Omers neqative 5 Gene BED Upload File 6 test 500K tags S mers negative starts TEE This too cannot be used in wor is tox i IV Treat as input dataset 4 Gene BED To PEET Exon Intron Codon BED Upload File 7 test_500K_tags_50mers positive starts on data 2 oe eiue ee eee in workflows V Treat as input dataset riter on aata 1 epos 2 Filter on data 1 eps 8 Join on data 4 and data 6 7 z UM I e ee Done Page 37 of 41 Galaxy Mozilla Firefox genome biology instructions to authors Workflo Libraries User Help History Options Running workflow RNAseq to Gene Counts refresh collapse all Unnamed history 0 liGmupondatai 0 16 354 lines format tabular REA Info Group by c8 sum c4 save Lift Over Input Dataset Text Manipulation 11 Group on data 10 Input Dataset Ext
64. ull customization of matching stringency the number in the file name represents the number of nucleotides from the donor and acceptor eg hg19 junctions 25 fa cat contains 25nt form the donor and 25nt from the acceptor In this case exon sequences were defined from UCSC known genes Refseq Ensembl Aceview GeneID GenScan and N Scan The file list for this junction set is as follows hg 19 junction libraries tar gz hg19 junctions 25 fa cat hg19 junctions 25 fa index hg19 junctions 30 fa cat hg19 junctions 30 fa index hg19 junctions 35 fa cat hg19 junctions 35 fa index hg19 junctions 40 fa cat hg19 junctions 40 fa index hg19 junctions 45 fa cat hg19 junctions 45 fa index hg19 junctions 50 fa cat hg19 junctions 50 fa index hg19 junctions 55 fa cat hg19 junctions 55 fa index hg19 junctions 60 fa cat hg19 junctions 60 fa index hg19 junctions 65 fa cat hg19 junctions 65 fa index 1119 junction libraries tar gz mm9 junctions 25 fa cat mm9 junctions 25 fa index mm9 junctions 30 fa cat mm9 junctions 30 fa index mm9 junctions 35 fa cat mm9 junctions 35 fa index mm9 junctions 40 fa cat Page 39 of 41 mm9 junctions 40 fa index mm9 junctions 45 fa cat mm9 junctions 45 fa index mm9 junctions 50 fa cat mm9 junctions 50 fa index mm9 junctions 55 fa cat mm9 junctions 55 fa index mm9 junctions 60 fa cat mm9 junctions 60 fa index mm9 junctions 65 fa cat mm9 junctions 65 fa index How to create your own junction libraries To create
65. ut also from CpG Page 6 of 41 islands and genome wide repeat elements Strategies to rescue ambiguous sequences have recently been applied to high throughput sequencing data and we have refined our previously published algorithm to work efficiently with large data sets For every multi mapping tag the algorithm considers all tags that map near to each of the possible locations of the tag within a user specified window to determine the most likely mapping position of the tag Where a tag cannot be unambiguously assigned a fractional weighting to the relevant positions is assigned In practice between 40 60 of multi mapping tags can be assigned a single position with gt 60 likelihood depending on the relative sequence coverage The recommended window size for shotgun sequencing is 10 Cloonan et al 2008 Nat Methods 5 613 619 though for disparate data types currently available this can vary For instance Cap Analysis of Gene Expression CAGE tags are rescued using a window of 100nt a size previously shown to optimize mammalian promoter detection Carninci et al 2006 Nat Genet 38 626 635 Part 4 Creation of visualization and SAM files UCSC genome browser compatible wiggle plots for genome mapped data and BED files for exon junction mapped data are generated automatically from the collated results The wiggle plots are strand specific or un stranded depending on experiment requirement single nucleotide resolution coverage
66. utputType bed Page 32 of 41 Ele Edit yew History Bookmarks Tools e X Be L Most Visited The Expression Geno my UQ 8 Googie Scholar Heritage on line 4E Virgin Mobile QNF wholesale foods Expression Genomics EZ workdMarkby Trendw PubMed Home Tools ry Home Genomes GenomeBrowser Blat Tables GeneSorter PCR Session FAQ Help Get Data refresh collapse all Upload File from your computer Output refGene as BED Unnamed history UCSC Main table browser UCSC Archaea table browser I Include custom track header Your history is empty Click Get Data on the left pane to start Get Microbial Data name re Gene BioMart Central server description table browser query on refGene GrameneMart Central server visibiity pack 2 Flymine server EncodeDB at NHGRI EDIGRAPH server Create one BED record per Send Data Whole Gene IRIS C Upstream by 200 bases Lift Over Exons pius bases at each end Convert Formats C Introns plus 0 bases at each end FASTA manipulation C 5 UTR Exons Filter and Sort Join Subtract and Group Coding Exons Extract Features C 3 UTR Exons Fetch Sequences Downstream by 200 bases Fetch Alignments Note i a feature is close to the beginning or end of a chromosome and upst
67. y a list of URLs ane per line or paste the contents of a file Hove Convert spaces to tabs IE Byes Use this option if you are entering intervals by hand FASTA manipulation Filter and Sort Join Subtract and Group Extract Features Fetch Sequences Fetch Alignments Get Genomic Scores Genome Human Mar 2006 hg18 You can still coerce the system to selyour data to the format you think it should be in also upload You must Short Read An EMBOSS Workflows t file contains three lines a summary line and 2 sequence lines Blocks are separated from one another by blank 2 information about the alignment It consists of 9 required fields You must 51 or scf format All files in this archive must have the same file extension which is one of ab1 or si Tab delimited format tabular History Options refresh collapse all Unnamed history 5 Gene BED To spx Exon Intron Codon BED Fil data 1 0 ex stes m umm 7 x refGene genome Page 34 of 41 Galaxy Mozilla Firefox Ele Edit View History Bookmarks Tools Help e EE L Most Visited The Expression Geno my UQ 8 Googie Scholar Heritage on line 4E Virgin Mobile QNF wholesale foods Expression Genomics PZA workdMarkby Trendw PubMed Home Send Data Joi ENCODE Tools 6

X-MATE user manual - Expression Genomics Laboratory

Contents

Download Pdf Manuals

Related Search

Related Contents