Home

AAstretch Project

1. the option above since only proteins matching the keywords will be provcessed by AAstretch data_source This points to the genomic file that AAstretch will used as an input Syncronization with codons To syncronize in AAstretch means to localize which part of a coding sequence codes for a given trait of a polipeptidic sequence Since the main AAstretch program extracts polyAA repeats and their flanking sequences and since AAprepare generates for each gene transcript in an organism both the coding sequence and the corresponding protein sequence the AAsync program takes the output of the AAstretch program and convert polypeptidic sequences into coding sequences This step is fundamental to investigate the role of codons in amino acid repeats The AAexplore program see below can in fact analyze the frequencies of both residues and codons provided that the two input files are synchronized AAexplore this program has been written for directly investigating the results of AAstretch that basically emits a tab separated text file with stretch specific results one per line A screenshot of AAexplore is shown here below e088 X AAexplorer v1 0 by matteo ramazzotti unifi it 2010 AAstretch programme 2011 07 20 114837 File Visualization Selection Records Input output Y axis control over Load new file Save Current graphs Save All graphs 315 w Counts Percent Threshold 2 Refresh Background Graph mode iat Re
2. AAstretch Project User Manual Phylosophy of AAstretch AAstretch is a collaborative project of the University of Florence Italy and the CNRS France aimed at a systematic evaluation of amino acidic AA residue repeats in genomes across evolution As a bioinformatic project it arises from the necessity of expanding the classical concept of poly residue repeat usually defined as a consecutive stretch of the same residue to a more biologically meaningful definition taking into account relatively small insertions and percentage thresholds to define the beginning and the end of a stretch The project started with a precise biological aim in mind to discover the features and deepen the knowledge of the poly glutamine Q repeats typical of triplet expansion diseases such as Huntington s disease spinocerebellar ataxia and many other Aggregation of polyQ containing proteins in facts lead to the formation of fibrils that impair the cellular machinery and lead to organic or systemic dysfunctions To accomplish this task we conceived a computer program AAstretch that it is able to scan a properly formatted set of protein sequences and their corresponding coding sequences and extract from them stretches of pure impure poly residue stretches and emit a tabular text output in which a set of features are reported for each stretch Such features include their positioning on the sequence their flanking regions the annotations and GO terms of the contain
3. codon analysis must have been performed with the program AAsync The Stretches tab reports a main upper bar plot with genomic ratios of all residues against the background that can be adjusted or simple counts see the description of the upper block and a lower bar plot describing for each residue the length of the pure stretches in the main stretch This is useful for investigating if the main stretch is interrupted by homopolymers The Gaps tab report something that is similar to the stretch except for the fact that the residue specified into AAstretch is excluded form the counts both in the background and in the stretched in order to have and unbiased ratio and fully appreciate the over under representations of the various residues The Left Flanks tab report the same graphs seen for the stretches but in this case the counts are based on the region that the user selected as flanking the stretch at the N terminal The length of this region can be changed in AAstretch not in AAexplore An additional graph is present reporting the bias genomic ratio of selectable residues at the different position of the flanks note that thay all share the same length important for evaluating the topology of residues in the flank The Right Flanks is identical to the left one except for the fact that in this case the C treminal is taken into account Please not that the numeration indicate the positionining form where the s
4. homopolymer of a residue inside a protein sequence This impose the minimal length of the seed seed seed max size This impose the maximal length of the seed seed_stretch_min_aa_perc The minimal of the selected residue in the whole stretch seed_gap max_size The maximal length of gaps into homopolymers of the desired residue Rich engine rich_win_length The length of the sliding window over which to calculate the of the desired residue rich_gap_ tolerance The length of a gap that can be bypassed even if the threshold fall below that imposed as lower limit rich_stretch_min_size The minimal length of the whole stretch rich stretch min aa perc The minimal of the selected residue in the whole stretch Patt engine patt_stretch min size The minimal size of a stretch patt_stretch_max_size The maximal size of a stretch patt_gap_min_size The minimal size of the gap used to exclude homopolymers from the analysis patt_extrem_len This controls the length of homopolymers at the C and N termini of the stretch patt_gap_max number This controls how many gaps are tolerated 4 Parameters for the workflow control The workflow decides which programs to run given an instance of the AAstretch launched The three programs involved are AAstretch AAsync and AAexplore A full analysis is performed when all opti
5. ing protein and many other information information for a correct biological interpretation of the presence of that stretch in the sequence If coding sequences are available for the proteins this kind of information can also be investigated using what we call the synchronization output i e a cds version of the AAstretch output see below These rather large panel of data can be graphically displayed on AAexplore a GUI based tool specifically developed to get the best out of AAstretch results Even if in principle any sequence or sequence set can be scanned with AAstretch the best can be obtained working on whole genomic sets We therefore developed an automatic builder AAprepare that linked to EnsEMBL genomic database and taking advantage of the BioMart services prepares organism specific annotated gene sets for the analysis with AAstretch You can find that ready to use files for a number of organism spanning the different life kingdoms in the Organisms section of the AAstretch website Since this section is intended to be yearly updated following EMBL genome releases possibly all you need to do is download AAstretch download your preferred organism and start analyzing A full genome scanning takes one minute or two on modern computers We do not have enough fundings to maintain a web based engine so we kept AAstretch very simple to run edit a configuration file to change rather intuitive parameters then run the analysis trough the i
6. ng and coloring the bars in the graphs Graph mode the two mode of visualization residue based or codon based can be switched according to the investigation of interest Selection with 2 regions Stretch length from here one can limit the size of the stretches to be investigated Filter from here one have access to the different elements of the AAstretch output including annotations and detailed descriptions of each stretch in order to selectively exclude form the analysis those stratches with unwanted properties e g the AA the position on the protein or the fact that it is contained in the nucleous etc Records this is a simple remainder that counts the total number of stretches available and those remained after filtering The graphic block instead present several tabs each containing different output modes and portions of the stretch as well The Residue Summary tab describes with three graphs some general feature of the whole output e g there is A bar plot with a dedicated binner for investigating the distribution of the stretch lengths A scatter plot for evaluating if there is a dependence of the location of the stretch with the stretch length a scatter plot for evaluating if there is a relationship between stretch length and the of the residue of interest in the stretch The Codon Summary tab is basically the same as above except that all considerations are based on codons instead of AA residues to view this section a
7. nteractive menu Installation To install the programs simply extract the files in the downloaded zip file into a desired working folder This will leave you with three files AAstretch pl the main scanner AAsync pl for codon syncronization and AAstretch conf the configuration file All the pl scripts must be placed in the working directory along with the genomic files see below AAexplore pl should be placed in the AAstretch working directory but any other location is acceptable AAprepare pl can be placed everywhere and will put prepared genomic files into its current directory Dependencies and modules AAstrech programs are entirely written in perl so perl needs to be installed into the box On Linux and Mac OS X perl usually comes preinstalled on the system On Windows ActiveState holds a well curated perl distribution others are available but this one is suggested Only base modules are used in AAstretch so a minimal perl installation is sufficient for most purposes AAprepare additionally requires Net FTP Archive Extract and 1O Compress Gzip modules AAexplore additionally requires Tk GD and GD Graph modules Obtaining genomic files Genomic files are Ensembl derived fasta files containing protein sequence and the corresponding coding sequences provided with transcript protein and gene codes descriptions gene ontology annotations and omim annotations in human only They can be downloaded in zip format from the AAst
8. ons are set to 1 but in some cases it is useful to exclude some parts e g in explorative sessions sychronization is not appropriate that is done giving setting the desired option to zero scan If set to 1 instruct AAstretch to perform the scan according to parameters specified above sync If set to 1 after the proteome based search for stratches it loads the coding sequences and create an alternative version of the AAstretch output containing the coding sequences of the corresponding stretches flanks After the analysis is completes automatically launches AAexplore for the investigation of the results obtained from the scan explore 4 Parameters for the additional options Additional options are so called because they do not affect the operative procedures of the programs but are far to be less important In fact from here one can set input files and configure protein filters Filters allows e g restrict the analysis on a smaller portion of the genome given some previous results This is used to filter out entries whose description matches the words included here Some basic knowledge of pattern matching is needed to take full advantage of this option e g ignore hypothetical will trash all proteins ignore annotated as hypothetical while ignore hypothetical predicted will also additionally exclude predicted proteins being the sign the boolean OR only Somewhat the contrary of
9. retch Project page All the genomes available in Ensembl have already been processed and prepared so they are ready to use by AAstretch They only need to be extracted into the working folder and AAstretch has to be configured to use to use them in the conf file In cases of genomes absent from the organism list in the website e g due to delays in update after a new EnsEMBL release AAprepare pl can be run to build up an brand new genomic file Its use is straightforward launch the program select the database by typing the appropriate number then select from the list the organism name again by typing the correct number and wait till the protein and the cds files are created Basic usage Once both the scripts and the organism files are in the same folder open the terminal prompt and cd into that directory Launch AAstretch with perl AAstretch pl and an interactive menu appears listing the various possibilities 1 Run launches the scanner according to parameters specified in the Aastretch conf file 2 Restart used to delete previous analyses and recreate the starting environment 3 Clean used to delete previous analyses including the organism files 4 Isolate moves all the files of the current analysis results into a new timestamped folder 5 Help offers a brief explanation of these points 6 Quit shut off the program same as ctrl c At the first run the program spend time in loading the proteome into memory and if req
10. sidue res_ratio Codons res_ratio l a aie s Glob Full Purged _ Disable 1st M Stretch size from 5 to 100 Select Stretch length 250 Fiter Lock Residue Summary Codon Summary Stretches Gaps Left flanks Right flanks Flanks GO MIM Text Selected Genomic ratio on 2151 elements p value 0 0133 Genomic ratio m N w A ey Py onere anau aaraa Ue do te ie TE H TATON ai AES ny Ce IN Log _j Residue P Pray sec o25 j Log _ Residue P 4 Pray sec o25 250 Poly P in the right flanks 5 Topology of P in the right flanks 200 of Ea E n 150 3f 2 z G k a 2 3 100 2 5 4 50 o a o i 2 3S A 2D 6 r E 30 BO 2s 40 on 6 275 S10 Length of poly P Position of P The window is organized in two blocks the control block on the top and the graph block on the bottom The control block is divided into 4 sub blocks File with 2 regions Input Output from there one can load input files and save graphs Note that saving graphs is also possible by shift clicking in the graphs themselves Background from there one can change the default background model used when frequencies are calculated see the statistics section below Visualization with 2 regions Y axis control from there one can change the Y axis of the main graph in the graph block and the threshold for evidenci
11. tretch left thesequence so a distance of 2 means that is 2 residues far from where the stretch ended right flank or started left flank The Flanks tab is a simple combinations of the left and the right flank counts are summed up and expressed as a single entity called flanks The GO MIM tab show a bar plot of the distributions of the different GO terms or MIM description in the stretches It is basically useful to evaluate if there is some kind functional or topological enrichment in the results The Selected tab lists all the entries form the main AAstretch output that are currently under investigation since some or many stretches may have been removed form the analysis due to filtering or uninteresting lengths In the right part of this input there is a window that fills and refreshes automatically when one double click on a graph showing the data used to produce that graph Basically this is useful for reproducing the AAexplore images in external softwares since as the name say AAexplore is not intended to produce high quality images but rather to rapidly explore the output of the AAstretch program
12. uested to create isoforms details Then the analysis starts and according to the configuration in the end emits several files as output They can be investigated in several ways Once a run has finished the proigrams The AAstretch conf file This is the very heart of the AAstretch procedure this simple text file configures AAstretch pl to process the genomic files in the desired way It is ideally divided into four main sections that will be described here 1 General parameters residue one letter code the residue whose stretches must be searched flank_start the position before N term or after C term the stretch at which the flanking regions begins flank_length the length of the regions on left and right side of a stretch scanmode Can be set to rich seed or patt this decides the core engine to be used see below for details isoform check On or off this removes duplication in stretches due to protein isoforms see below for a working definition of isoform 2 Parameters for the specific engines AAstretch contains three search engine seed patt and rich with different performances and scopes The engine parameters are easily identifiable thanks to the prepended flag seed_ rich_ patt_ The engine is selected above so there is no need to put comment marks in this sections to hide the configuration of the engines other that that in use Seed engine seed seed min size A seed is an

AAstretch Project

Contents

Download Pdf Manuals

Related Search

Related Contents