Home

Manual - CONTRA - Stanford University

image

Contents

1. CONTRAfold 2 02 User Manual CONTRAfold 2 02 User Manual 1 of 20 Contents 1 Description 2 2 License BSD 3 3 Installation 4 31 nixanstallatlons 22 54 Sates dhs lh he i RA een ered 4 4 Supported file formats 5 T Input file formats Ven a a E eae EE ES 5 4 1 1 Plain text format 5 4 1 2 FASTA format sst 522 A a 6 41 3 BPSEO format aee aa a a a a 7 4 2 Outputformats s saa snee ea E a a eai a ee 8 4 2 1 FASTA format 8 4 2 2 BPSEO format 2 ci cee ee ee o a 9 4 2 3 Posteriors format 9 5 Usage 11 51 Pr diction modes z sa daimi a a da a e 11 5 1 1 Asingleinputfile 11 5 1 2 Multiple input files o o o 12 5 13 Optional arguments o ooo o 13 52 Training mode sae esa ei snaa Da rra aooi 16 6 Visualization of folded RNAs 18 6 1 installation aia Ne Se a ke OA ee Be 18 61 1 nxcinstallationic 2 2 eddy ets Bk es ee ee oh es a 18 6 2 Usd cdi A Ae ea ae 18 6 3 Additional options 20 0 000000 00000 19 7 Citing CONTRA fold 20 CONTRAfold 2 02 User Manual 2 of 20 1 Description CONTRAfold is a novel algorithm for the prediction of RNA secondary struc ture based on conditional log linear models CLLMs Unlike previous sec ondary structure prediction programs CONTRAfold is the first fully proba bilistic algorithm to achieve state of the art accuracy in RNA
2. e If gt 1 the parsing algorithm emphasizes sensitivity e If0 lt y lt 1 the parsing algorithm emphasizes specificity In addition if the user specifies any value of y lt 0 then CONTRAfold tries trade off parameters of 2 for k 5 4 10 and generates one output file for each trade off parameter Note that this must be used in conjunction with either parens bpseg or posteriors in order to allow for writing to output files For example the command contrafold predict seq fasta gamma 100000 runs the maximum expected accuracy placing almost all emphasis on sensitivity predict correct base pairs The naming convention used by CONTRAfold when y lt 0 follows some what different conventions from normal Running contrafold predict seq fasta gamma 1 bpseq output CONTRAfold 2 02 User Manual 14 of 20 will create the files output output gamma 0 031250 output output gamma 0 062500 output output gamma 1024 000000 For multiple input files contrafold predict segql fasta seq2 fasta gamma 1 bpseg output will generate output output gamma 0 031250 seql fasta output output gamma 0 031250 seq2 fasta output output gamma 1024 000000 seql fasta output output gamma 1024 000000 seq2 fasta Like before multiple types of output parens BPSEQ posteriors may be requested simultaneously viterbi This option uses the Viterbi algorithm to compute s
3. 2 02 User Manual 7 of 20 gt sequence acguuggcu gt structure o a But the following is not starts with the wrong header character sequence ATGACGGT Also the following file is not valid because the parenthesized structure is not properly balanced gt sequence acguuggcu gt structure Cosel sed oe Finally the following file is not valid because the structural information header is missing gt sequence acguuggcu C22 O as 4 1 3 BPSEQ format A BPSEQ format file is used for describing a single RNA sequence and its an notated secondary structure This file format contains exactly one line for each nucleotide in an RNA sequence The ith line of the file contains three items separated by single spaces 1 The integer i with i 1 representing the first nucleotide 2 The ith character of the RNA sequence which may be A C G T U or N in either upper or lower case the output of the program will retain the case of the input any T s are automatically converted to U s any other letters are automatically converted to N s N s are treated as masked sequence positions which are ignored during all calculations i e any scoring terms involving an N will be skipped 3 The index of the character to which the ith character base pairs if known If the character is known to be unpaired then 0 appears here If it is unknown whether this character base pairs then a 1
4. PNG file in which the letters of each RNA nu cleotide is colored according to posterior probability confidence Black letters indicate high confidence structure whereas lighter gray letters in dicate lower confidence structure title title This option allows the user to annotate the generated RNA image with a title Note that the title string should be surrounded with double quo tation marks so as to ensure that it is interpreted as a single argument to the program In general the CONTRAfold visualization tools generate RNA layouts which tend to be visually pleasing The layout algorithm uses a simple deterministic layout rule followed by a gradient based optimization procedure This type of procedure is not guaranteed to generate non overlapping layouts for all RNA structures in practice however the visualization tools can provide reasonable visualizations for a large range of RNA structures CONTRAfold 2 02 User Manual 20 of 20 7 Citing CONTRAfold If you use CONTRAfold in your work please cite Do C B Woods D A and Batzoglou S 2006 CONTRA fold RNA secondary structure prediction without physics based models Bioin formatics 22 14 e90 e98 Other relevant references include Do C B Foo C S Ng A Y 2007 Efficient multiple hyperparame ter learning for log linear models In Advances in Neural Information Processing Systems 20
5. are treated as masked sequence positions which are ignored during all cal culations i e any scoring terms involving an N will be skipped Other non whitespace characters are not permitted 3 Optional A structural annotation for the sequence provided above The structural annotation requires a A single header line beginning with the character gt followed by a description any text after the description is ignored b One or more lines of parenthesized structural annotation These lines provided a structural annotation for each nucleotide in the RNA sequence using a sequence of Y and characters A nucleotide annotated with pairs with the nucleotide annotated with the matching y A 7 character indicates that the correspond ing nucleotide is unpaired Finally a indicates a position for which the proper matching either paired or unpaired is unknown Observe that the parentheses in the input file must be well balanced i e for each left parenthesis the corresponding pairing position must be marked with a right parenthesis not a and vice versa Since CONTRAfold generates only non pseudoknotted structure predic tions the proper pairing will always be unambiguous For example the following is a valid FASTA file gt sequence acggagaGUGUUGAU CUGUGUGUUACUACU caucuguaguucuag uugua Similarly the following is a valid FASTA file with a structural annotation CONTRAfold
6. prints the result to either the console or output files The basic syntax for running CONTRAfold in prediction mode is S contrafold predict OPTIONS INFILE s 5 1 1 A single input file For single sequence prediction CONTRAfold generates FASTA output see Section 4 1 2 to the console i e stdout by default For example suppose the file seq fasta contains a FASTA formatted se quence to be folded Then the command contrafold predict seq fasta will fold the sequence and display the results to the console in FASTA format CONTRAfold can also write parenthesized FASTA BPSEQ or posteriors formatted output to an output file To write FASTA output to a file contrafold predict seq fasta parens seq parens To write BPSEQ output to a file contrafold predict seq fasta bpseg seq bpseq To write all posterior pairing probabilities greater than 0 001 to a file contrafold predict seq fasta posteriors 0 001 seq posteriors CONTRAfold 2 02 User Manual 12 of 20 Note that here the backslash character is used to denote that a command line is broken over several lines it is not necessary if you type everything on a single line Finally itis also possible to obtain multiple different types of output simul taneously For example the command contrafold predict seq fasta parens seq parens bpseg seg bpseg posteriors 0 001 seq posteriors will generate three different output files s
7. secondary struc ture prediction The CONTRAfold program was developed by Chuong Do at Stanford Uni versity in collaboration with Daniel Woods Serafim Batzoglou The source code for CONTRAfold is available for download from http contra stanford edu contrafold under the BSD license The CONTRA fold logo was designed by Marina Sirota Any comments or suggestions regarding the program should be sent to Chuong Do chuongdo cs stanford edu CONTRAfold 2 02 User Manual 3 of 20 2 License BSD Copyright 2006 Chuong Do All rights reserved Redistribution and use in source and binary forms with or without modifi cation are permitted provided that the following conditions are met e Redistributions of source code must retain the above copyright notice this list of conditions and the following disclaimer e Redistributions in binary form must reproduce the above copyright no tice this list of conditions and the following disclaimer in the documen tation and or other materials provided with the distribution e Neither the name of Stanford University nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CON TRIBUTORS AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES IN CLUDING BUT NOT LIMITED TO THE IMPLIED WARRANTIES OF MER CHANTABILITY AND FITNESS FOR A PARTICULAR PURP
8. the ith nucleotide might pair and Pij is the probability that this base pairing occurs For example the following is a posteriors format output ox oo oo Ww w nu ow oo a pa e o ODNDOBWNE PQAGAARAAaGCHA P CONTRAfold 2 02 User Manual 10 of 20 In the above we see that nucleotide 2 has an 11 probability of pairing to nucleotide 8 Note that each pairing probability is reported only once i e on the ith line we show only the pairing probabilities to nucleotides j gt i which appear after the ith position in the RNA sequence CONTRAfold 2 02 User Manual 11 of 20 5 Usage CONTRAfold has two modes of operation prediction mode and training mode e In prediction mode CONTRAfold folds new RNA sequences using ei ther the default parameters or a CONTRAfold format parameter file e In training mode CONTRAfold learns new parameters from training data consisting of RNA sequences with pre existing structural annota tions Most users of this software will likely only ever need to use CONTRAfold s prediction functionality The optimization procedures used in the training algorithm are fairly computationally expensive for this purpose the CON TRAfold program is designed to support automatic training in a parallel com puting environment via MPI Message Passing Interface 5 1 Prediction mode In prediction mode CONTRAfold predicts the secondary structure of one or more unfolded input RNA sequence and
9. OSE ARE DIS CLAIMED IN NO EVENT SHALL THE COPYRIGHT OWNER OR CON TRIBUTORS BE LIABLE FOR ANY DIRECT INDIRECT INCIDENTAL SPE CIAL EXEMPLARY OR CONSEQUENTIAL DAMAGES INCLUDING BUT NOT LIMITED TO PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES LOSS OF USE DATA OR PROFITS OR BUSINESS INTERRUPTION HOW EVER CAUSED AND ON ANY THEORY OF LIABILITY WHETHER IN CON TRACT STRICT LIABILITY OR TORT INCLUDING NEGLIGENCE OR OTH ERWISE ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE CONTRAfold 2 02 User Manual 4 of 20 3 Installation At the moment CONTRAfold is only available for Unix based systems e g Linux We will be porting CONTRA fold to other architectures and making the binaries available 3 1 nix installation To compile CONTRA fold from the source code for a nix machine 1 Download the latest version of the CONTRA fold source code from http contra stanford edu contrafold download html 2 Decompress the archive S tar zxvf contrafold_v _ tar gz where the s are replaced with the appropriate version numbers for the tar gz you want to install This will create a subdirectory called contrafold inside of the current directory 3 Change to the cont rafold src subdirectory and compile the program cd contrafold sre make clean make Now your installation is complete CONTRAfold 2 02 User Manual 5 of 20 4 Supported file formats In this
10. appears here Note if the BPSEQ file specifies that character i base pairs with character j then it must also specify that character j base pairs with character i For example the following is a BPSEQ format file CONTRAfold 2 02 User Manual 8 of 20 1A7 2 G 1 3 4C0 EL 6 e 1 Tull in which it is known that the first and last positions base pair and the mid dle position does not base pair However the folding of the other positions is unknown However the following is not a valid BPSEQ format file 26 1 3 U 1 1A 7 4C0 oC Sf 6G L 7 U1 since all nucleotides in the file must appear in order Finally the following is also not a valid BPSEQ format file YAO BWNP coa na since the first position is specified as pairing with the last position but not vice versa 4 2 Output formats The results of a CONTRAfold secondary structure prediction are given in ei ther FASTA BPSEQ or posteriors format We describe each of these in detail 4 2 1 FASTA format The FASTA output format is identical to the FASTA input format see Sec tion 4 1 2 with structures Since CONTRAfold provides predictions for the pairing or non pairing of every single nucleotides no s will appear in the output The output will always consist of exactly four lines where the first and third lines are FASTA headers for the sequence and structure respectively the CONTRAfold 2 02 User Manual 9 of 20 second line specifies
11. ently only UNIX installation is supported 6 1 1 nix installation To compile CONTRAfold visualization tools from the source code for a nix machine 1 Install the 1ibgd graphics development library available from http www boutell com gd 2 Install the 1ibpng PNG image library available from http howw libpng org pub png libpng html 3 Compile the visualization tools make viz 6 2 Usage Given an input FASTA file generating an image of the predicted CONTRA fold structure involves three steps 1 Generate a secondary structure prediction in BPSEQ format contrafold predict seg fasta bpseq seq bpseq 2 Run the make_coords program to generate an RNA layout make_coords output bpseq output coords CONTRAfold 2 02 User Manual 19 of 20 The resulting coordinates are placed in the output coords file 3 Run the plot_rna program to convert the layout into a PNG image plot_rna output bpseg output coords Y png output png The resulting PNG is placed in the output png file and can be viewed with a web browser such as Mozilla Firefox Alternatively EPS format output is also available plot_rna output bpseg output coords eps output eps 6 3 Additional options The plot rna has a couple of options which you can use to control the gener ated PNG files posteriors posteriors file If a CONTRAfold posteriors file is also available then using the above option will generate a
12. etain the case of the input Any T s are automatically converted to U s Any other letters are automatically converted to N s All whitespace space tab newline is ignored N s are treated as masked sequence positions which are ignored during all calculations i e any scoring terms involving an N will be skipped Other non whitespace characters are not permitted Plain text files cannot contain any secondary structural annotation For example the following is a valid plain text file NACGACAGUGUAUCACUAGUACUUA GUAUGUACUAUC AGUAGUUGUUGUAGUUC CONTRAfold 2 02 User Manual 6 of 20 Note that the blank third line will be ignored and the initial N character will be treated as a placeholder character which appears in the output folded RNA but makes no contribution to the computations 4 1 2 FASTA format A FASTA format file consists of 1 A single header line beginning with the character gt followed by a text description of the RNA sequence Note that the description must fit on the same line as the gt character 2 One or more lines containing RNA sequence data Each of these lines may contain the letters A C G T U or N in either upper or lower case the output of the program will retain the case of the input Any T s are automatically converted to U s Any other letters are automatically converted to N s All whitespace space tab newline is ignored N s
13. imultaneously 5 1 2 Multiple input files For multiple input files CONTRAfold generates FASTA output see Section 4 1 2 to the console by default The output is presented in the order of the input files on the command line Using console output is not allowed when MPI is en abled or when certain other options are selected in general we recommend the usage of explicitly specified output files or subdirectories when dealing with multiple input files see below CONTRAfold can also write FASTA BPSEO or posteriors formatted out put to several output files In particular CONTRAfold creates a subdirectory whose name is specified by the user in which to store the results and writes each prediction to a file in that subdirectory of the same name as the original file being processed For example suppose that the files seql fasta and seq2 fasta each con tain a FASTA formatted sequence to be folded Then the command contrafold predict seql fasta seg2 fasta parens output will create a subdirectory called output and will place the results in the files output segl fasta and output seq2 fasta Alternatively contrafold predict seql fasta seg2 fasta bpseq output and contrafold predict seql fasta seg2 fasta posteriors 0 001 output generate BPSEQ and posteriors formatted outputs instead Observe that if multiple input files have the same base name then over writing of output may occur F
14. llow the re strictions described in Section 4 1 params PARAMSFILE This option uses a trained CONTRA fold parameter file instead of the de fault program parameters The format of the parameter file should be the same as the contrafold params complementary file in the CON TRAfold source code each line contains a single parameter name and a parameter value version Display the program version number verbose Show detailed console output partition Compute the log partition function for the input sequence This option may be used in conjunction with the constraints option in order to determine the CONTRAfold energy of a given RNA secondary struc ture specified in a BPSEQ file For example to compute the energy of a Viterbi parse generated via contrafold predict seq fasta viterbi bpseq seq bpseq CONTRAfold 2 02 User Manual 16 of 20 we can simply run contrafold predict seq bpseq constraints Y partition Some quick notes regarding the partition function e When used in conjunction with partial constraints i e only some of the mappings in the input BPSEQ file are 1 s see above then this option computes the log of the summed unnormalized probabilities for all structures consistent with the partial constraints e In order to compute the log of the summed probabilities which are normalized as opposed to the quantities mentioned above you must also run cont
15. or example if the input files list contains two different files called seq input and input the output subdirectory will con tain only a single file called input Finally you may also generate multiple types of output simultaneously as before Remember however to use different output subdirectory names for each The command CONTRAfold 2 02 User Manual 13 of 20 contrafold predict seql fasta seq2 fasta parens parens_output bpseq bpseq_ output posteriors 0 001 posteriors_output generates three different output subdirectories parens_output bpseq_output and posteriors_output each containing two files seql fasta seq2 fasta 5 1 3 Optional arguments CONTRAfold accepts a number of optional arguments which alter the default behavior of the program To use any of these options simply pass the option to the CONTRAfold program on the command line For example contrafold predict seq fasta viterbi noncomplementary The optional arguments include gamma y This option sets the sensitivity specificity tradeoff parameter for the max imum expected accuracy decoding algorithm In particular consider a scoring system in which each nucleotide which is correctly base paired gets a score of y and each nucleotide which is correctly not base paired gets a score of 1 Then CONTRAfold finds the folding of the input se quence with maximum expected accuracy with respect to this scoring sys tem Intuitively
16. rafold predict seq bpseg partition and subtract this log partition value from the previous log parti tion value described above Note that this quantity will always be greater than or equal to the log partition above implying that the log of the summed probabilities is necessarily non positive which makes sense as probabilities are at most 1 5 2 Training mode In training mode CONTRAfold infers a parameter set using RNA sequences with known or partially known secondary structures in BPSEQ format By default CONTRAfold uses the L BFGS algorithm for optimization For example suppose input bpseq refers to a collection of 100 files which represent sequences with known structures Calling S contrafold train input bpseq instructs CONTRAfold to learn parameters for predict all structures in input x bpseg without using any regularization The learned parameters after each iteration of the optimization algorithm are stored in optimize params iterl optimize params iter2 in the current directory The final parameters are stored in optimize params final and a log file describing the optimization is stored in CONTRAfold 2 02 User Manual 17 of 20 optimize log In general running CONTRAfold without regularization is almost always a bad idea because of overfitting There are currently two ways to use regularization that are supported in the CONTRAfold program 1 Regularization may be manually specified The current b
17. section we describe the input and output file formats supported by the CONTRAfold program 4 1 Input file formats CONTRAfold accepts input files which either contain only RNA sequences or contain both sequences and partial structural annotations For the file formats that support specification of partial structural annota tions in particular FASTA and BPSEQ the provided structures must obey the following properties 1 Each position in the RNA sequence is marked as either unpaired paired to some specific nucleotide or unknown 2 If position 7 is marked as pairing with position j then position j must be marked as pairing with position i 3 The partial structures specified must not have pseudoknots 4 A position cannot be marked as pairing unless its specific base pairing partner has been specified These structural annotations are generally ignored when performing predic tions unless the constraint s flag is specified on the command line These structural annotations are required for training CONTRA fold The three specific input file formats supported by CONTRAfold are plain text FASTA and BPSEQ We describe each of these formats in turn 4 1 1 Plain text format A plain text format file consists of one or more lines containing RNA sequence data Each of these lines may contain the letters A C G T U or N in either upper or lower case the output of the program will r
18. the sequence data and the fourth line specifies the paren thesized structure If a FASTA file is provided as input then the header in the FASTA input file will be used as the first line header in the output file oth erwise the relative path to the input file is used as the header The FASTA header for the structure will always be structure Since CONTRAfold gen erates only non pseudoknotted structure predictions the proper pairing will always be unambiguous For example the following parenthesized structure is a completion of the valid BPSEQ file from Section 4 1 3 assuming that the input file is specified in the file data input gt data input AGUCccu gt structure APA 4 2 2 BPSEQ format The BPSEQ output format is identical to the BPSEQ input format see Sec tion 4 1 3 Since CONTRAfold provides predictions for the pairing or non pairing of every single nucleotide no 1 s will appear in the output 4 2 3 Posteriors format The posteriors output format is distinct from the BPSEQ and FASTA formats in that it does not provide a single prediction of RNA secondary structure In stead it provides a sparse representation of the base pairing posterior probabil ities for pairs of letters in the RNA sequence Specifically the ith line contains 1 The integer i 2 The ith character of the file 3 Aspace separated list of base pairing probabilities of the form j p where j gt tis the index of nucleotide to which
19. tructures rather than the maximum expected accuracy posterior decoding algorithm The structures generated by the Viterbi option tend to be of slightly lower ac curacy than posterior decoding so this option is not enabled by default noncomplementary This option uses a folding model that allows non AU CG GU pairings in the CONTRAfold output This option is slower and generally slightly less accurate than the default option of allowing only canonical base pairings constraints This option requires the use of BPSEQ format input files By default any base pairings that are included in the BPSEQ file above are ignored However if the constraints flag is used then any base pairings in an input BPSEQ file are treated as constraints on the allowed structures In particular 1 A nucleotide mapping to a positive index i is constrained to base pair with nucleotide i CONTRAfold 2 02 User Manual 15 of 20 2 A nucleotide mapping to 0 is constrained to be unpaired 3 A nucleotide mapping to 1 is unconstrained For example given the following input BPSEQ file 1 1 L 7 0 0 4 ll 1 0 U 1 00000 0 S8d0 F a 2 3 4 5 6 7 8 9 1 and the constraints flag then CONTRAfold will assume that po sitions 4 and 7 are constrained to be base pairing while positions 5 and 6 are constrained to be unpaired The base pairing of the remaining po sitions is decided by CONTRA fold The constraints must fo
20. uild of CON TRAfold uses 15 regularization hyperparameters each of which is used for some subset of the parameters To specify a single value shared be tween all of the regularization hyperparameters manually one can use the regularize flag For example contrafold train regularize 1 input bpseq uses a regularization constant of 1 for each hyperparameter In general we recommend that you do not perform training yourself unless you know what you are doing also do not hesitate to ask us 2 The recommended usage is to use CONTRAfold s holdout cross validation procedure to automatically select regularization constants To reserve a fraction p of the training data as a holdout set run CONTRAfold with the holdout pflag For example to reserve 1 4 of the training set for holdout cross validation use contrafold train holdout 0 25 input bpseq Note that the holdout and regularize flags should not be used simultaneously CONTRAfold 2 02 User Manual 18 of 20 6 Visualization of folded RNAs Besides the main program the CONTRAfold package contains some addi tional tools for visualization of folded RNAs e make_coords generates a set of coordinates for plotting a CONTRAfold BPSEQ file e plot_rna converts a set of coordinates and a BPSEQ file into a viewable PNG In the following subsections we describe the installation and use of these two tools for RNA visualization 6 1 Installation Curr

Download Pdf Manuals

image

Related Search

Related Contents

k6706 two channel codelock transmitter  Equip 2-Port HDMI Splitter  MultiQC - User manual  Manuel d`instructions pour utiliser la console ordinateur TZ-6193  Mode d`emploi 2. Toujours contrôler soigneusement l`emplacement    Solução de Limpeza Biochoque  RX852M IM.cdr - crudesteel.net  IP Wireless / Wired Camera User Manual  町民アンケート、中高生アンケートの結果概要(平成 26 年 7~8 月実施)  

Copyright © All rights reserved.
Failed to retrieve file