Home

Introduction to Bioinformatics Course

1. Bioinformatics Course August 2012 5 Post translational modifications After translation has occurred proteins may undergo a number of posttranslational modifications These can include the cleavage of the pro region to release the active protein the removal of the signal peptide and numerous covalent modifications such as acetylations glycosylations hydroxylations methylations and phosphorylations Posttranslational modifications such as these may alter the molecular weight of your protein and thus its position on a gel There are many programs available for predicting the presence of posttranslational modifications we will take a look at one for the prediction of type O glycosylation sites in mammalian proteins Remember these programs work by looking for consensus sites and just because a site is found does not mean that a modification definitelv occurs NetOGlyc http www cbs dtu dk services NetOGlyc Prediction of type O glycosylation sites in mammalian proteins This program works by comparing the input sequence to a database of known and verified mucin type O glycosylation sites extracted from O GLYCBASE Example Human CDID sp P15813CDID HUMAN You can paste the gene sequence cdld from the course website At ExPASy gt Post translational modification e Click on the link to NetOGlyc e Paste your sequence in the box provided in FASTA format e Check generate graphics and click the submit button e T
2. Enter Text Here Search Tips vert search News and Announcements Search Tips Important announcements 05 03 09 EMBL Release 99 is now on line release notes data notes 17 02 09 The MEDLINE databanks MEDLINE MEDLINE2009 and MEDLINENEW now have additional views to provide citation information in formats compatible with popular citation management software In addition to the existing MEDLINE XML format BibTeX EndNote ISI MODS XML RefMan ProCite RIS and MS Word bibliography XML are available These formats are provided using Bibutils Public SRS servers 30 01 09 UniProtKB UNIPROT SWISSPROT and SPTREMBL are now b Ed EE available again worldwide d 2001 NO Linident VE IINTDOMT CVAITCOCODDNT and COTDEMTEIY ir rsirramtblus biowisdom SRS database s that you wish to search The databases may be of various types including Sequence Swissprot sptrembl PIR Protein or EMBL emblnew DNA Sequence related prosite blocks prints protein motifs and alignments repbase restriction enzymes Protein3Dstructure PDB HSSP For more information about the contents of the database click on the relevant blue underlined hypertext link UniProt say e Click the box to the left of UniProtKB Click on the Query Form tab at the top of the page This will move you to a Query Form Page that permits you to submit particular queries such as have been suggested at the beginning of this
3. RL Proc Natl Acad Sci U S A 77 2611 2615 1980 b GenBank LOCUS ECRECA 1391 bp DNA BCT 12 SEP 1993 DEFINITION E coli recA gene ACCESSION V00328 J01672 KEYWORDS SOURCE Escherichia coli ORGANISM Escherichia coli Eubacteria Proteobacteria gamma subdiv Enterobacteriaceae Escherichia REFERENCE 1 bases 1 to 1374 AUTHORS Sancar A Stachelek C Konigsberg MN and Rupp W D TITLE Sequences of the recA gene and protein JOURNAL Proc Natl Acad Sci U S A 77 5 2611 2615 1980 You can see that these two are obviously talking about the same sequence from E coli but the information is encoded in a rather different way This makes no difference to us reading the text but causes problems when writing a program to interrogate a database Each database entry has a name called ID or LOCUS which tries to be mnemonic and marginally informative More importantly each has an accession number which is arbitrary but which remains attached to the sequence for the rest of time The organism might become reclassified the gene may get renamed and the ID 1s thus subject to change but by noting the accession number you should always be able to identify and retrieve the sequence Note also that the original publication 1s cited Usually there will be other papers documenting functional analysis mutations allelic variations 3 D structure and so on Bioinformatics Course August 2012 Further down in the entry is annotation abo
4. Translate your nucleotide DNA RNA sequence to a protein sequence on Expasy server Input type DNA RNA sequence e GeneMark Predict ORF in your sequence Input type DNA sequence Protein Entrez Retrieve a protein sequence from Genbank at NCBI server Input type key words e ProtParam Calculate different physico chemical parameters of a protein sequence Input type protein sequence 29 Bioinformatics Course August 2012 PSIPRED Secondary structure Prediction Input type protein sequence ScanProsite Search for PROSITE pattern in your protein sequence Input type protein sequence Database BLASTN Search a nucleotide sequence against GenBank on NCBI server Query type DNA sequence BLASTP Search a protein sequence against the protein sequences on NCBI server Query type Protein Sequence Blocks Searcher Search BLOCKS database for similarity Input type protein or DNA sequence PUBMED Search bibliographic database at NCBI Input type key word Primer3 Create PCR primers and Hybridization oligos from a DNA sequence Input DNA sequence CODEHOP Pick primers from multiple alingnment of protein sequences Input type BLOCKS Alignment Tools ClustalW Align multiple sequence Input type DNA or protein sequences Block Maker Finds conserved blocks in a group of two or more unaligned protein sequences At least two protein sequences must be provided to make blocks Each sequence must have a u
5. http mbcf dfci harvard edu docs oligocalc html Tool to calculate the length GC content Melting temperature Tm the midpoint of the temperature range at which the nucleic acid strands separate Molecular weight amp what an OD I 1s in picoMolar of your input nucleic acid sequence Many of these parameters are useful in primer design see next section and in other areas of molecular biology e Goto URL above 3l Bioinformatics Course August 2012 e Paste the phosphoglycerate kinase gene sequence from Trypanosoma brucei in the box provided and click Calculate Example gt Tb927 1 700 phosphoglycerate kinase Trypanosoma brucei Length 1323 GC content 49 Tm 84 C Molecular Weight 409839 daltons g M OD of 1 69 picoMolar EMBOSS dan Calculates DNA RNA DNA melting temperature eprimer3 Picks PCR primers and hybridization oligos 32 Bioinformatics Course August 2012 Protein Sequence Analysis TOPICS e Physico chemical properties e Cellular localization e Signal peptides e Transmembrane domains e Post translational modifications e Motifs amp domains e Secondary structure e Other resources ExPASy http www expasy ch The ExPASy Expert Protein Analysis System protein and proteomics server of the Swiss Institute of Bioinformatics SIB is dedicated to the analysis of protein sequences and structures Besides the tools that we will introduce in this manual t
6. L V L V P 61 ATGCCCAGTGACCCTCCATTCAATACCCGAAGAGCCTACACCAGTGAGGATGAAGCCTGG 49 ATGCCCAGTGACCCTCCATTCAATACCCGAAGAGCCTACACCAGTGAGGATGAAGCCTGG H P S D P P F N T R R A Y T S E D E U 121 AAGTCATACTTGGAGAATCCCCTGACAGCAGCCACCAAGGCCATGATGAGCATTAATGGT 109 AAGTCATACTTGGAGAATCCCCTGACAGCAGCCACCAAGGCCATGATGAGCATTAATGGT 37 K S Y L E N P L T A A T K A M M S I N G 181 GATGAGGACAGTGCTGCTGCCCTCGGCCTGCTCTATGACTACTACAAGGTTCCTCGAGAC 169 GATGAGGACAGTGCTGCTGCCCTCGGCCTGCTCTATGACTACTACAAGGTTCCTCGAGAC 7 D E D S A A A L G L L Y D Y Y K V P R D Homologues in Gene Trees BLAST and BLAT aligners Important Notice We now used Blat as our default DNA search This will make your query faster Enter the Query Sequence Either Paste sequences max 30 sequences in FASTA or plain text Or Upload a file containing one or more FASTA sequences Browse Or Enter a sequence ID or accession EMBL UniProt RefSeq Retrieve 17 Bioinformatics Course August 2012 Ensembl Exercises 1 Open the home page of Ensembl Search for the human gene GFAP Select the ensembl gene ENSG00000131095 2 How many transcripts does this gene have 3 What is the genomic location of the gene 4 What is the length and how many exons do the different transcripts have 5 Choose the first transcript that has supporting evidence Copy the sequence for the first exon including the UTR Un translated region
7. ProfileScan http hits isb sib ch cgi bin PFSCAN Example Human CFTR sp P13569 CFTR HUMAN You can paste the gene sequence cftr from the course website e Go to the URL above e Paste your sequence in the box provided The sequence must be written using the one letter amino acid code e Tick the motif databases you wish to search other parameters should be OK e Press the scan button The output for this program is too large to show here but it gives lots of detail about motifs in the CFTR protein identifying potential ABC transporters family signature ATP GTP binding site motif A P loop Protein kinase C phosphorylation sites N glycosylation sites Casein kinase II phosphorylation site N myristoylation sites cAMP and cGMP dependent protein kinase phosphorylation site Bipartite nuclear localization signal NACHT NTPase domain profile Guanylate kinase domain profile etc 44 Bioinformatics Course August 2012 Remember that these programs only tell you are that there is a motif present and thus there is the potential for these modifications and functions to occur It is up to you to determine experimentally which are real but at least you now know what to look for 7 Secondary Structure Prediction If protein structure even secondary structure can be accurately predicted from the now abundantly available gene and protein sequences such sequences become immensely more valuable for the understanding of drug desi
8. Submit the sequence as an answer to this question 6 Now use Blat and search with this same sequence of that first exon against the human genome On the result click on C ContigView for the alignment that is 100 identical Does your Blat hit correspond to the first exon Export the picture and submit it as an answer to this question 7 Again choose the same transcript that has the supporting evidence Please find the protein translation for this transcript Copy and submit it as an answer to this question What is the protein length How many residues 8 Click on the gene tab How many paralogues and how many orthologues has been found for this gene Explain the difference between orthologues and paralogues genes 9 Click on Orthologues Can you find a dog orthologue for this gene If so what is the chromosome location in the dog genome 10 Click on the transcript tab and then click on Protein ID What can you find about the protein family of this gene Click on Domains amp Features right arrow Follow other links that might interest you and find out more about this gene SRS http srs ebi ac uk The DNA databases are enormously rich information resources partly because they are so big but it would make little sense if 1t consisted of a long list of As Ts Cs and Gs There are millions of individual entries in EMBL An entry could be a fragment as short as 3 base pairs e g M23994 or a large contig consisting o
9. html Databases Databases are of course the core resource for bioinformatics There 1s plenty of software for analysing one or a few sequences but many of the computationally interesting and biologically informative programs access databases of information Frequently used classes are the biological sequence databases These include EMBL European Mol Biol Lab GenBank DDBJ DNA DB of Japan Bioinformatics Course August 2012 These three DNA databases exchange their data on a daily basis and so should be identical as to content They are however rather different in format Each of the database cited above consists of a very large number of entries each consisting of a single sequence preceded by a quantity of annotation that puts the sequence in its biological functional and historical context Without the annotation GenBank would be a meaningless string of 32 billion As Ts Cs and Gs Compare and contrast the two extracts from a EMBL and b Genbank DDBJ has the same look and feel as Genbank a EMBL ID ECRECA standard DNA PRO 1391 BP AC V00328 J01672 DT 09 JUN 1982 Rel 01 Created DT 12 SEP 1993 Rel 36 Last updated Version 4 DE E coli recA gene KW OS Escherichia coli OC Bacteria Proteobacteria gamma subdiv Enterobacteriaceae OC Escherichia RN 1 RP 1 1374 RX MEDLINE 80234673 RA Sancar A Stachelek C Konigsberg W Rupp W D RT Sequences of the recA gene and protein
10. we can easily determine the sequence of an unknown strand of DNA if its matching strand is known For example if one strand of a double helix has the nucleotide sequence GATTCGTACG 26 Bioinformatics Course August 2012 then its complementarv strand will be CTAAGCATGC forming a double helix GATTCGTACG CTAAGCATGC 2 Translating DNA in 6 frames Why six frames DNA code for amino acids using a Three Letter genetic code See Appendix II for the complete genetic code Since we do not know where to start reading a DNA sequence we need to look at six different options For example the sequence GATTCGTACG MELT TTT TE CTAAGCATGC Can be translated into six different amino acid strings Looking at each strand separately GATTCGTACG CTAAGCATGC 1 GAT TCG TAC G A TA AGC ATG C Asp ser Tyr Leu Ser Met gt G ATT CGT ACG C TAA GCA TGC Tle Argi Thr Ala Arg 3 GA TTC GTA CG 6 CT AAG CAT GC Phe Val Lys His Pn U1 Bioinformatics Course August 2012 Translate tool http www expasy ch tools dna html This tool allows the 6 frame translation of a nucleotide DNA RNA sequence to a protein sequence in order to locate open reading frames in your sequence e Goto URL above e You can use the following phosphoglycerate kinase gene sequence from Trypanosoma brucei below or select from phospho kinase txt gt Tb927 1 700 phosphoglycerate kinase Trypanosoma brucei ATGACCCTTA ATCCGTGTTG CGATCAGCTC AGCC
11. August 2012 GenPept a Swissprot ID RECA ECOLI STANDARD PRT 352 AA AC P03017 P26347 P78213 DT 21 JUL 1986 REL 01 CREATED DT 21 JUL 1986 REL 01 LAST SEQUENCE UPDATE DT 15 DEC 1998 REL 37 LAST ANNOTATION UPDATE DE RECA PROTEIN GN RECA OR LEXB OR UMUB OR RECH OR RNMB OR TIF OR ZAB OS ESCHERICHIA COLI AND SHIGELLA FLEXNERI OC BACTERIA PROTEOBACTERIA GAMMA SUBDIVISION ENTEROBACTERIACEAE OC ESCHERICHIA CC FUNCTION RECA PROTEIN CAN CATALYZE THE HYDROLYSIS OF ATP IN THE CC PRESENCE OF SINGLE STRANDED DNA THE ATP DEPENDENT UPTAKE OF CC SINGLE STRANDED DNA BY DUPLEX DNA AND THE ATP DEPENDENT CC HYBRIDIZATION OF HOMOLOGOUS SINGLE STRANDED DNAS IT INTERACTS CC WITH LEXA CAUSING ITS ACTIVATION AND LEADING TO ITS AUTOCATALYTIC CC CLEAVAGE CC INDUCTION IN RESPONSE TO LOW TEMPERATURE SENSITIVE TO CC TEMPERATURE THROUGH CHANGES IN THE LINKING NUMBER OF THE DNA CC DATABASE NAME E coli recA Web page CC WWW http monera ncl ac uk 80 protein final reca htm KW DNA DAMAGE DNA RECOMBINATION SOS RESPONSE ATP BINDING DNA BINDING KW 3D STRUCTURE FT INIT MET 0 0 FT NP BIND 66 73 ATP FT CONFLICT 112 112 D gt E IN REF 5 FT TURN 4 4 FT HELIX 5 21 FT HELIX 23 25 FT TURN 29 30 etc etc b PIR gt P1 ROECA recA protein Escherichia coli C Species Escherichia coli C Date 31 Jul 1980 sequence revision 14 Nov 1997 text change 14 Nov 1997 C Accession G65049 A93847 A93
12. OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO O CH 133 153 4259 267 334 239 221 261 347 269 256 292 193 226 099 109 185 180 153 156 183 300 222 248 153 195 2222 155 177 217 242 210 204 334 327 OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO OO CH 022 023 054 4055 052 036 066 032 033 071 076 024 047 041 051 020 078 069 092 050 058 018 030 047 081 021 021 079 053 024 091 333 032 273 033 August 2012 43 Bioinformatics Course August 2012 NetOGlyc 3 1 predicted O glycosylation sites in sp P15813 C O glycosylation potential a 5 166 156 ean 250 366 Sequence position 6 Motifs and Domains If you want to determine the function of a protein the first tool of choice is homology searching see day 4 Unless this finds you a match with a well characterized protein comprehending the entire length of yours you should look for motifs and domains in your protein To determine if your protein sequence contains known motifs or conserved domain structures you should search the protein against one of the motif or profile databases There are many of these available but we will discuss ProfileScan now called myHits which allows you to search both the Prosite and Pfam databases simultaneously See the documentation for more details
13. QI and Q2 may refer to earlier queries in this SRS session osteonectin so use good judgement You have just used a boolean logical expression to vield sequences which are a human and b have calmodulin in the SwissProt description This shows you how it can be unreliable to depend on the annotation to get homologous sequences Nevertheless the list should contain the SwissProt entry for CALM HUMAN Questions 1 Can you think of a better way to find other mammalian calmodulin genes 2 If you do a search in SwissProt for calmodulin using the AllText descriptor instead of Description you find many more entries why do you think you get more entries under this search 3 There are more entries in SwissProt under Organism dog than Author dog but more for Author wolf than Organism wolf Why do you think this 1s so 4 Searching Organism mouse in SwissProt yields some plant sequences prove this by finding sequences matching Organism mouse amp Taxon viridiplantae Why 1s this so Clue append wildcard You should be able to reveal the full SwissProt entry for any protein sequence If you do this you will see several blue underlined hypertext links to related databases Almost certainly at least one of these will be EMBL and one to Medline Probably one will be the prosite motif database If the 3 D structure is known one link will be to PDB Investigate these other databases to get as much relevant informa
14. U M R W S Y K V H D B x Z or A or Tor C Amino Acids SYMBOL MEANING CODONS IUB code Ala GCT GCC GCA GCG IGCX Asp Asn GAT GAC AAT AAC IRAX Cvs TGT TGC TGY Asp GAT GAC IGAY Glu GAA GAG IGAR Phe TTT TTC ITTY Gly GGT GGC GGA GGG IGG X His CAT CAC ICAY Ile ATT ATC ATA ATH Lys AAA AAG IAAR Leu TTG TTA CIT CTC CTA CTG ITTR CTX YTR Met ATG IATG Asn AAT AAC IAAX Pro CCT CCC CCA CCG ICCX Gln CAA CAG ICAR Arg CGT CGC CGA CGG AGA AGG ICGX AGR MGR Ser TCT TCC TCA TCG AGT AGC TCX AGY Thr ACT ACC ACA ACG IACX Val GTT GTC GTA GTG IGTX Trp TGG TGG Unknown IXXX Tyr TAT TAC ITAY Glu Gln GAA GAG CAA CAG ISAR Terminator TAA TAG TGA ITAR TRA A B C D E F G H I K L M N P Q R S T V W X Y Z KN 49 Bioinformatics Course August 2012 APPENDIX II The Universal Genetic Code Second Base of Codon Serine STOP STOP Tryptophan G Arginine ETA Serine Isoleucine Phe Leu Leu Ile Met Val vV Gel Q z eb Ke del me z TE UUU UUC UUA UUG CUU CUC CUA CUG AUU AUC AUA AUG GUU GUC GUA GUG Threonine Lysine Arginine mn es ae RB e Valine Alanine Ser Pro Thr Ala UCU UCC UCA UCG CCU CEC CCA CCG ACU ACC ACA ACG GCU GCC GCA GCG Tyr ter ter His Glin Asn Lys Asp Glu UAU UAC UAA UAG CAU CAC CAA CAG AAU AAC
15. letters and five digits e g AAA12345 e Trembl Translated EMBL O P or Q followed by 5 letters digits PDB protein structure records 1 digit and three letters IHBA 1TUP 12 Bioinformatics Course August 2012 More recently an attempt has been made to reduce the redundancy in the databases there were 180 copies of D melanogaster alcohol dehvdrogenase each with its own accession number One result is RefSeq NCBI s reference sequence database RefSeq Two letters and underscore bar and six digits mRNA records NM NM 000492 genomic DNA contigs NT NT 000347 curated annotated Genomic regions NG NG 000567 Protein sequence records NP NP 000483 We will see how RefSeq is becoming the central resource for gene characterization expression studies and polymorphism discovery Because of the high level of necessary curation it is not anywhere close to being comprehensive even for those species that are included Accession numbers give the community a unique label to attach to a biological entity so we all know we are talking about the same thing Sequences in databases evolve as their real biological counterparts do They need to be updated corrected and merged and we need to know which version of the sequence entry is being referred to GenBank has used gi numbers and more recently version numbers for this Each small change made to a Genbank record gets the next gi number e g 916995995 and so is totally arbit
16. the S score changes from a high to a low value For each sequence SignalP will report the maximal C S and Y scores and the mean S score between the N terminal and the predicted cleavage site These values are used to distinguish between signal peptides and non signal peptides If your sequence is predicted to have a signal peptide the cleavage site is predicted to be immediately before the position with the maximal Y score The Human beta defensin protein has a predicted signal peptide from position 1 to 21 and a potential cleavage site exists between positions 21 and 22 These predictions correspond exactly to the SWISS PROT annotation for this protein accession Q09753 SignalP NN result data gt Sequence length 68 Measure Position Value Cutoff signal peptide m r C 220110 0532 VES Max I 22 0761 0435 YES max 5 14 0 998 0 87 YES mean S 1 21 0 943 0 48 YES D tel 0 952 DIS YES Most likely cleavage site between pos 21 and 22 ASG GN 38 Bioinformatics Course August 2012 SignalP HMM result data soequence Prediction Signal peptide Signal peptide probability 1 000 Signal anchor probability 0 000 Max cleavage site probability 0 818 between pos 21 and 22 EMBOSS sigcleave Reports protein signal cleavage sites 4 Transmembrane domains Tmpred http www ch embnet org software TMPRED_form html The TMpred program makes a prediction of membrane spanning regions and their orient
17. 0 87893 452 9 Fundamentals of Molecular Evolution D Graur and W H Li Sinauer 2000 ISBN 0 87893 266 6 PAUP 4 0 Phylogenetic Analysis Using Parsimony and other methods Manual David L Swofford Sinauer 1999 0 87893 801 X Introduction to Bioinformatics TK Attwood amp DJ Parry Smith Addison Wesley Longman 1999 ISBN 0582 32788 1 Molecular Evolution a phylogenetic approach RDM Page and EC Holmes Blackwell 1998 ISBN 0 86542 889 1 Bioinformatics for Dummies Notredame and Claverie 2003 Articles Baldauf SL 2003 Phylogeny for the faint of heart a tutorial TIG 19 6 345 351 47 Bioinformatics Course August 2012 APPENDIX I Nucleotide and Amino Acid Codes Nucleotides Abbreviation Thymidine Cvtosine Guanosine D Uridine Any nucleotide A T Cor G OO N a GorA AoT JW i CorT Y Are JM GoT lk EE GENED NotG AorCorT HW Not A CorGorT B Not T AorCorG LN Not C AorGorT 1 D Amino Acids Arginine IR lag Ip Asparticacid D L sp Cys TA Az Mi Isoleucine Histidine H Leucine LL Le gt Phenylalanine Proline Serine Tryptophan 6 S E G H I L K M F S Ser T 48 Bioinformatics Course August 2012 SEQUENCE SYMBOLS Nucleotides IUBcode MEANING COMPLEMENT A T C G G C T A AorC K A or G Y AorT W Cor G S Cor T R M B D H V X G Gor T AorCorG AorCorT AorGorT CorGorT G or A or T or C not A C G T
18. 846 S11931 563525 563979 A03548 C Comment The recA protein plays an essential role in homologous recombination in induction of the SOS response and in initiation of stable DNA replication C Genetics A Gene recA A Map position 58 min C Superfamily recA protein C Keywords ATP DNA binding DNA recombination DNA repair P loop SOS response F 67 75 Region nucleotide binding motif A P loop F 141 145 Region nucleotide binding motif B F 73 Binding site ATP Lys fstatus predicted Bioinformatics Course August 2012 Note that these two entries refer to the same gene from E coli despite differences in the way the data is encoded However in contrast to the difference between EMBL and Genbank the quality of the annotation is quite different The 3 D structure of this gene has been worked out and this information is reflected in the SwissProt entry as the position of every alpha helix and beta sheet is noted In general the quality of the annotation and the minimization of internal redundancy makes SwissProt the preferred database to use However note that PIR records the Genetic Map position of the gene so it is probably good to scrutinize both databases to abstract maximal information SwissProt also gives added value by incorporating a large number of DR database reference tags pointing to equivalent information in other databases a SwissProt DR EMBL V00328 G42673 DR EMBL X55553 NOT ANNOTATED CDS DR
19. AAG GAU GAC GAA GAG Aspartic Acid Glycine Glutamic Acid G Cys Ler Trp Arg Ser Arg Gly 50 UGU UGC UGA UGG CGU CGC CGA CGG AGU AGC AGA AGG GGU GGC GGA GGG uopo JO aseg piy j Bioinformatics Course August 2012 Exceptions to the Universal Code 1 2 3 4 5 6 T 8 9 Yeast Mitochondrial Code CUN T AUA M UGA W Mitochondrial Code of Vertebrates AGR AUA M UGA W Mitochondrial Code of Filamentous fungi UGA W Mitochondrial Code of Insects and platyhelminths AUA M UGA W AGR S Nuclear Code of Candida cylindracea see nature 341 164 CUG S Nuclear Code of Ciliata UAR Q Nuclear Code of Euplotes UGA C Mitochondrial Code of Echinoderms UGA W AGR S AAA N Mitochondrial Code of Ascidaceae UGA W AGR G AUA M 10 Mitochondrial Code of Platyhelminthes UGA W AGR S UAA Y AAA N 11 Nuclear Code of Blepharisma UAG Q 51
20. ACCTCG GGCEGLEOTTIC GAACTGCTAT TCTAAGATGT GGCAGCAAGA GTTTACATCA CCAAAGATTT GCTAAGGTAC AGCGACAAGA GGTGCAATGG GAGGAAAGTA CAGGTTATTC TTGATAACTG ACTATTGAAA ATGGGTGTAT GGTCGAGGAA GCAGCTGAGT TCTTTGGAAC GCGGTTGTGT TAA e Paste your sequence in the box provided amp click TRANSLATE SEQUENCE ACGAGAAGAA ACTTTAATGT TGCCAACGCT GGAGGCCGAA OGGGTTICCA TGAGGCCCGT CTCCGGGCGA AGGCAAAAGA GTGATGCTTT TGGGCAACGG TTGGTAACCC TCCAACTTCT CATACACATT AACTTGAATT TTCCAATTGA AGGATCAAAA AATATGTTCA TTGAAATGGT CTCACGAGCA TGAGCGGTGA ICCTCGAGGG CGTATGCE LC GAGCATTAAT TCCCGTGAAA CAAGAAGGTT AGGTATTCCC ACAGAAGGCA CACATTCGCA DOTTGOTTICTG ACGTGAAGCC TGGTACAGCT TELLGCCGET SCCGCGTCCG GGATAACATG TCTGAAGGCT TGCTCGATCC TOATGITIGC CATCCCTGAA GACGATTGGG TOCCTITATLTOO IGGACTCATG GGCGAAGCGC CAAAACGCTT TGCAGGTACT e You can choose 3 options o Verbose puts Met amp Stop to highlight start amp stop codons o Compact useful if you want to use output in other programs o Includes nucleotide sequence nucleotide sequence is above the e This returns a 6 frame translation of your sequence You can then choose the translation correct frame EMBOSS transeq Translate nucleic acid sequences GAATGCGATC AACGGTAAGA CTCACAGAAG ATGGCGCAAG ACACTCAAAC QGUIGACTGOUC CTTGAAAATG ATGGCCAAGA CACCGTGACA TATTTGATGG CIGGLIGUDA TTGCAGCGCA CAGGGTTACA CTGCTGAAGA CACACGGAAT GGACATATGG AAGTGTAAGA AAAGGTACAT AGTATCATCG AT
21. Bioinformatics Course August 2012 Joint BecA Hub and UNESCO Advanced Genomics and Bioinformatics Viral Bacterial Metagenomics and Next generation sequencing Workshop Introduction to Bioinformatics Course 13th 17th August 2012 ILRI BecA Nairobi Etienne de Villiers International Livestock Research Institute Nairobi Kenya ILRI LIVESTOCK RESEARCH A og SE e Bioinformane N Bioinformatics Course August 2012 Acknowledgements This course was adapted from a course designed and implemented by David Lynn and Andrew Lloyd while working at the Education and Research Centre ERC at St Vincent s University Hospital Dublin The original course and manual implemented by David Lynn grew naturally from The ABC Bioinformatics Course an earlier Irish National Centre for BioInformatics INCBI project based on GCG and the WWW to which Aoife McLysaght TCD was a major contributor That in turn owes a debt of gratitude to the ABCT tutorial designed by Rodrigo Lopez when he was the Norwegian EMBnet node This course would never have got off the ground without the encouragement of Cliona O Farrelly the Research Director at the Education and Research Centre ERC at St Vincent s University Hospital The development of the original course was funded by the Dublin Molecular Medicine Centre and the Conway Institute University College Dublin The Multiple Alignment and Phylogenetics section were adapted from a course develope
22. CAGAAAAATAGGAGGAATTCGTGGGCTTATTGTTCGGTCATTG GATTGGACCCTTCGCCGGTTTTCTGCGCATATAGTGCAGTGTAATTGGATTTACCAGAA GTCTAGCGGACCCTCTTTACTATTCCCAGCTGTGATGGCGGTTTGTTGTGGTCCTACAA GGGTCTTTGTGGACCGGGGTTTTTAGAATACGGACAATCCTTTGCCGATCGCCTGCGGC GAGCGCCCCATCGGTTTGTGGTCAGAATATTGTCAATGCCATCGAGAGATCGGAGAATG GATTGCGGATAAGGGTGGACTCGAGTTGGTCAAATTGCATTGGTACTCGACCAGTTCGT GCACTATATGCAGAGCAATGATGAAGAGGATCAAGAT This sequence is written in Fasta format see below for sequence formats A computer could do it quicker but it is still trivial to do it by eye Especially as one of the sites has been picked out in bold Can you find the other s Sequence analyses impossible without a computer include but are not limited to most operations that involve the sequence databases The DNA databases Genbank EMBL DDBJ are curated by three different groups in Bethesda MD Hinxton UK and Mishima JP but because they exchange information on a daily basis should be effectively the same in content The DNA databases are doubling in size about every year they currently Oct 2008 comprise gt 90 million sequences and 99 116 431 942 base pairs So finding all of the ecoRI sites in GenBank or even the whole of a printed copy of the human genome 3 200 000 000 bp would take more than a few minutes This course will introduce you to some of the more commonly used bioinformatics tools tell you how to use them and more importantly how to use them correctly or at least more effectiv
23. EMBL AE000354 G61789051 DR EMBL D90892 G61800085 DR PIR A03548 RQECA DR PIR S511931 S511931 DR PDB IREA 31 OCT 93 DR PDB 2REB 31 OCT 93 DR PDB 2REC 01 APR 97 DR PDB LAA3 23 JUL 97 DR SWISS 2DPAGE P03017 COLI DR ECO2DBASE C039 3 6TH EDITION DR ECOGENE EG10823 RECA DR PROSITE PS00321 RECA 1 DR PFAM PF00154 recA 1 When these are used as hypertext links they can enable a WWW browser to locate an extraordinary depth of detail about a given entry 3 D structure PDB protein motifs Prosite families of related genes Pfam the DNA sequence EMBL and a couple of specialist E coli added value databases SRS is one program that makes these hypertext links The PIR cross references are far fewer and less explicit its reference to Genbank GB U00096 refers to the whole E coli genome whereas SwissProt points specifically to the gene DR EMBL V00328 b PIR A Cross references GB AE000354 GB U00096 NID g2367149 PID g1789051 UWGP b2699 All these databases are made up of entries concatenated one after the other in plain readable text As such thev are far bigger than necessarv 1f vou are trving to analvze the sequence rather than interrogate or browse the annotation For these purposes special high compressed databases can be constructed Frequently these are not readable by humans because they have been optimized for speed reading computers One of the simplest compression protocol
24. GTCTCATG CCCEGCGCTIG GGAACTCTTT 28 TTAAGGGAAA TCACCAACGA GCGGCAGTTG CTGACAAAAT CGGTAGCCAA TGAATGCTGC TACGCTTTTA FETE GTGCTACCAT AGAAGGAGAT TOGITGGTEG TCGATTATCT GCATTGGAAA AGGCGGAGGA TCAAAGCTGT CTCTGGATAT GCGCCATTTG TTGCAATTGC GTGGTGGTGA TTTCAACTGG CAGTATTGGA CTAACCGGTG GAAGGTTCTT CTACCGAATC TOGTTCTCATG ACGGAGCACT SLETTA AGATGTCGTC CAAAGAAGAG ATATGGTGAT GACCGGAATT TTCATACTTC AGCGAAAGTG CTTAATTGGT ATCGAAGTGC CCGCAAGGTG GGATTCTCCA TGGTCCCAAG GAACGGTCCC GAAAGCCATG CAGCGCAAGT TGGTGGTGCG CGAAAAGTCG GAGLUTOTETT Bioinformatics Course August 2012 3 Reverse Complement amp other tools There are many cases where you might want to obtain the reverse complement of a DNA sequence for example the reverse complement is needed as a negative control when doing a DNA hybridisation experiment SEquence analysis using WEb Resources SeWeR e de analysis R http www bioinformatics org SeWeR http www bioinformatics org SeWeR SeWeR is an integrated portal to common web based services in bioinformatics It has a large number of tools available online ee ill Accession from Genbank Go Pubmed Query Go l Home Nucleic Acid Protein Database PCR Alignment Tools Bookmarklets izati Nucleic Acid Entrez Retrieve a DNA sequence from Genbank at NCBI server Input type key words e Webcutter One of the best programs for restriction analysis Input type DNA e Translate
25. PKAETEGE 2 Staden named after Rodger Staden early but still extant software writer same as raw sequence MAIDENKOKALAAALGOIEK ALGAGGLPMGRIVEIYGPES TPKAEIEGE X 3 NBRF PIR named after the protein database gt P1 ecrgcg pep ecrgcg pep 354 bases 218 checksum MAIDENKOKA LAAALGQIEK ALGAGGLPMG RIVEIVGPES TPKAEIEGE X Accession numbers The information above makes you aware of the diversity of ways in which something so simple as a one dimensional sequence may be represented Another source of confusion is the variety of identifying numbers attached to sequences and knowing to which database they refer Accession numbers are used as unique and unchanging numbers They are not mnemonic although databases also have a less stable more memorable nomenclature HBB HUMAN HSHBB HUMHBB 2HBB are all human beta globin IDs in various databases e GenBank EMBL accession numbers originally a letter followed by 5 digits X32152 M22239 When the number of sequences exceeded 2 600 000 2 letters followed by 6 digits AL234556 BF345788 e SwissProt Still one letter followed by 5 digits letter is either O P Q P23445 e PIR the other protein database one letter followed by 5 digits but numbers confusable with EMBL GenBank B93303 is chimp haemoglobin in PIR but a random genomic clone fragment in EMBL e GenPept Conceptual translations from DNA that have not yet been annotated well enough to get into SwissProt three
26. SAA gene sequence for Canis familiaris 2 What prosite motif defines the recA family of prokaryotic proteins Which Dublin based phylogeneticists used multiple sequence alignment to define this motif 3 What are the first and last 5 bases in the intron of the yeast actin gene with EMBL accession number V01288 4 What is the map position of one of the human SAA genes SwissProt P02735 What cross reference database is most likely to have map position 5 What mutation at what position causes phenylketonuria PKU hint EMBL K03020 but then try SwissProt P00439 6 What bases define the ribosome binding site of the Bacteroides fragilis glnA gene Perhaps start from the E coli homolog SwissProt PO6711 7 Why is the name Saarinen associated with life threatening cardiac arrythmias Hint not because of architectural flaws try voltage gated potassium channels 8 Are there more publicly available DNA sequences from Rodents or Prokaryotes What about protein sequences 9 Get a sample of mammalian introns See what common features they have Think how these common features might help splicing out the introns Entrez http www ncbi nIm nih gov Entrez Entrez is the US equivalent of SRS and is available from the NCBI webpage You will most likely be familiar with Entrez for interrogating Medline but the same engine can be pointed at DNA and protein databases It is handy if you are familiar with the Entrez system and you want a
27. SINA PEPE 37 4 Transmembrane domains m esesiarnsannnnannnnannnnansnenennsnanesnnonnnnsnennsnenenesnnnsnnsnskennnnnannnnansnannssenssnne 39 5 Post translational modifications 1 ee eese eene eene nenne nennen nnnm nnns 42 6 Motifs and Domains E 44 7 Secondary Structure PEedICUOB iii iii nannan nnmnnn nannan 45 Printed sources about Bioinformatics and the Internet 47 APPENDIX vr 48 APPENDIX Then E EEEE 50 Bioinformatics Course August 2012 Introduction This course is designed to impress upon you that computers and the Internet can not only make your work as a biologist easier and more productive but also enable you to answer questions that would be impossible without computational help Thus there are some computational analyses that you could conceivably do on the back of an envelope or with a pocket calculator and there are others so computationally demanding that you would not attempt them without electronic help An example of the first would be to scan the following DNA sequence for ecoRI restriction endonuclease sites GAATTC gt Adhr D melanogaster ATGTTCGATTTGACGGGCAAGCATGTCTGCTATGTGGCGGATTGCGGAGGGAGACCAGC AAGGTTCTCATGACCAAGAATATAGCGAAACTGGCCATTCGGAAAATCCCCAGGCCATC GCTCAGTTGCAGTCGATAAAGCCGAGTACTTCTGGACCTACGACGTGACCATGGCAAGA ATTCATATGAAGAAGTACTGATGGTCCAAATGGACTACATCGATGTCCTGATCAATGGT GCTACGCTGATAACATTGATGCCACCATCAATACAAATCTAACGGGAATGATGAACACG TGTTACCCTATATGGA
28. You should also be aware of the Interpro project which incorporates and sorts data from a diversity of protein motif and domain databases into one searchable meta database Sequence formats As we have seen comparing database entries above there are dozens of different ways in which you can store or represent the same fundamental information Databases are often compiled in highly conventionalized readable English text Computers being not so bright will have difficulty reading and interpreting the information unless the conventions are quite rigidly obeyed There are a very large number of ways you can write store and transmit simple one dimensional sequence files A common sequence interchange program called readseq recognizes at least 22 different file formats If a computer program does not recognize the format of an input sequence it may not work or worse misinterpret header lines as sequence data or otherwise mangle your analysis The EMBOSS package can also convert between different sequence formats EMBOSS seqret Reads and writes returns sequences in different formats It can also read in a sequence from a database and write it to a file Bioinformatics Course August 2012 Some commonly used file sequence formats are shown below 1 Fasta named for a widely used homology searching program single title line beginning gt gt ECRGCG TRANSLATE of ecrgeg 1 to 1062 MATDENKOKALAAALGQIEK ALGAGGLPMGRIVEIYGPES T
29. ation The algorithm is based on the statistical analysis of TMbase a database of naturally occurring transmembrane proteins The prediction is made using a combination of several weight matrices for scoring The presence of transmembrane domains is an indication that the protein is located on the cell surface Example Human chemokine receptor 4 protein sequence NP 003458 1 You can paste the gene sequence chemo4 from the course website At ExPASy 2 Topology prediction Click on the link to Tmpred Paste your sequence in the box provided in one of the supported formats e g plain text SwissProt ID or AC etc 39 Bioinformatics Course August 2012 You may change the minimal and maximal length of the hydrophic part of the transmembrane helix but unless you have reason to do so you should accept the defaults Le 17 and 33 22 residues is the same length as the width of a lipid bilayer Click the Run Tmpred button to start the search The output is given in 3 parts 1 2 and 3 see below Part 1 lists all the significant predictions of possible transmembrane helices in this case there are 7 helices predicted but at this stage we do not know the orientation of the helices so there are 2 tables the first with the helices orientated from the inside to the outside and vice versa for the second Part 2 shows which inside gt outside helices correspond to the outside gt inside helices and indicates which orientation is mos
30. ation sites of proteins from their amino acid sequences This program makes use of the fact that proteins destined for particular subcellular localizations have distinct amino acid properties particularly in their N terminal regions These properties can be used to predict whether a protein is localized in the cytoplasm nucleus mitochondria or is retained in the ER or destined for the lysosome vacuolar or the peroxisome There is a detailed page of output that we can probably ignore At the end of the output the percentage likelihood of the subcellular localization is given If you want to learn more about the output and how subcellular localization is determined please see the user manual at http psort nibb ac jp helpwww2 html Example Human ETS 1 protein You can paste the gene sequence ets 1 from the course website e At http psort nibb ac jp form2 html e Paste your sequence in the box provided The sequence must be written using the one letter amino acid code e Press the submit button The output for this sequence is shown below 35 Bioinformatics Course August 2012 There are a number parameters measured by this program which you can read about as links from the output file By scrolling to the bottom of the output you can see the probability that this sequence is nuclear cytoplasmic peroxisomal vacuolar or cytoskeletal PSORT predicts that ETS 1 is nuclear with a high probability The fact that ETS 1 is
31. chapter to the databases At the top of this page will be a note of which database s you have chosen to search and a block of four text insert boxes which you can use to enter your question 20 Bioinformatics Course August 2012 EMBL EBI jA mo All Databases ji Enter Text Here Go Reset Give us Databases Tools EBI Groups Training Industry AboutUs Help Site Index E Query Form Results j Databanks Fields you can search In a single field you can separate multiple values by 8 or a 1 Description n serum resistance associated E Taxonomy H Trypanosoma E AllText 44 Alex B Result Display Options Select the fields you want displayed in your view and choose the format f View results using UniprotView 4 Choose 1 or more fields Display As Table List or ID 0 EntryName niu Claes Sequence Format swiss 44 AccessionNumber Primary Accession Number Sequence Version al Creation Date D to the left you will see some things you can change including 1 Reset which clears the screen 2 combine search terms amp AND which enables you to apply other logical boolean operators 3 Use wildcards which means that bact will be interpreted as bact and look for bacteria bacteriophage etc 4 Number of entries to display per page default is 30 Your question can be entered into one of more of the text insert boxes thus Click All text change to Descrip
32. d by Hans Henrik and Anders Gorm Pedersen SLU Sweden Ensembl section was adapted from Ensembl tutorials and worked examples on Ensembl website Bioinformatics Course August 2012 Table of Contents Miir od Ve gr BE 4 Introduction to Bioinformatics rruserrenvrnnnvrnnvvnnnvrnnnvsnnvrnnnnennvvsnnvrnnnnsnnvennnvennnvsnnvennnvrnnersnnvennn 6 DIN CS c 6 Sequence om i e d 11 Accession HUM DCL Si a 12 Interrogating sequence databases rrnnsrnnnvrnnnvrnnnvnnnnennnvvnnvrnnnvnnnnvsnnvennnvsnnnennnernnvvsnnvennnnr 14 Ensembl Bilpt EON or i kiiaaaikaniiiranzinan i kuikoiauzi ENEE Een 14 SRS tire diu i E 18 Entrez http www ncbi nIm nih gov Entrez sra aaseservannnvnnnnansevnnnnnvnnnnsnnevnnnnvsennnnneevansnvennnnsnnenen 24 Nucleic Acid Sequence Analysis cniin connait tonat tigiie ian ied 25 1 Nucleic acids and tlie genetic code inno a 25 2 Translating DNA in 6 frames iii kk Dc enr ad 27 3 Reverse Complement amp other tools rnmsernunvrrnnvnnnnvnnnnnvnnnnvnnnnvnnnnvennnvennnnennnvennnevnnnnvennnnennnnennn 29 4 Oligo Calculator http www pitt edu rsup OligoCalc html 31 Protein Sseguen e Ai Si aiaa ienesa EdaS 33 1 Physico chemical properties nicsccscccscccscscssnecsnssessesedcncesensessseneendecsnncessascesnsessssensesessesesscnns 33 2 Cellular VOC ANZ AUN i a Eaa aa EAEAN EEEE 35 3
33. e must be written using the one letter amino acid code It is recommend that the N terminal part only not more than 50 70 amino acids of the sequences is submitted A longer sequence will increase the risk of false positives and make the graphical output difficult to read The new version now automatically truncates input sequences Choose one or more group of organisms for the prediction by clicking the check box next to the group s If no groups are indicated predictions from all three groups will be returned A graphical output in Postscript format of the prediction will be available if the Include graphics button is checked Press the Submit sequence button A WWW page will return the results when the prediction is ready Response time depends on system load The output for this sequence 1s shown below C score raw cleavage site score 37 Bioinformatics Course August 2012 The output score from networks trained to recognize cleavage sites vs other sequence positions Trained to be High at position 1 after the cleavage site and low at all other positions S score signal peptide score The output score from networks trained to recognize signal peptide vs non signal peptide positions Trained to be High at position before the cleavage site and low at all other positions Y score combined cleavage site score The prediction of cleavage site location is optimized by observing where the C score is high and
34. ely Most of the analysis will be carried out on the World Wide Web Bioinformatics Course August 2012 WWW This is partly because it is available to all comers without requiring direct access to the necessary computers which serve as database and software repositories But it is also partly because a well designed Web site can be particularly user friendly and intuitive in its operations There are likely to be network related problems trying to make 25 simultaneous connections over the Internet to the same site Try doing the course exercises late in the evening early in the morning best for speed or at weekends This module in bioinformatics is designed to give you a flavour of what analytical and informative tools are available on the World Wide Web Software used in the course 1s many and varied We have tried to put links to them all on the course website http hpc ilri cgiar org training BecA2012 Welcome html A few overall points for the course e Take the opportunity to compare and contrast different methods of doing a particular analysis e By all means take the defaults but be aware that changing them will almost certainly get more or better information The Web is free and you get what you pay for so use the Web with care amp caution As with lab work it takes time to get the protocol working Once you have one that works for you write it down bookmark and remember it But note the Web changes rapidl
35. erimentally determined it will be in here and you can follow the link to PDB for information on the structure of your protein If your protein is in PDB you can view your protein secondary structure using RasMol To download RasMol see the course website for a link e Once you have RasMol running you can open your structure in it a view it using a number of different options Otherwise continue with prediction The program may take a long time so you can save a bookmark and return to your results later or choose to have your results e mailed to you 45 Bioinformatics Course August 2012 e There are a number of options to view the output view your output in HTML format option 4 The complete output is too large to show here see webpage e Scroll down through the output until you get to Jpred output The line of output beside this is the consensus secondary structure for your sequence H Helices E strands C colls 46 Bioinformatics Course August 2012 Printed sources about Bioinformatics and the Internet Briefings in Bioinformatics a journal aimed at users rather than developers with useful review and how to articles Books Bioinformatics A Practical Guide to the Analysis of Genes and Proteins Andreas Baxevanis amp B F Francis Ouellette Eds John Wiley amp Sons 2 Ed 2001 ISBN 0471 38390 2 The Course text book Fundamentals of Molecular Evolution W H Li and D Graur Sinauer 1991 ISBN
36. es to display many layers of genome annotation into a simplified view for the ease of the user The picture above shows the Region in Detail page for the BRCA2 gene in human The example shows blocks of conserved sequence reflecting conservation scores of sequence identity on a base pair level across 34 species Conserved regions are displayed as dark blocks that represent local regions of alignment One of the blocks is circled in red You would only have to click on this block to see more details Also in this figure are proteins from the UniProtK B aligned to the same genomic region Filled yellow blocks show where these UniProtK B proteins align to the genome and gaps in the alignment are shown as empty yellow blocks Note in this case the UniProtK B proteins support most of the exons shown in the Ensembl BRCA2 001 transcript in gold Both Ensembl and Vega Havana transcripts are portrayed as exons boxes and introns connecting lines In fact filled boxes show coding sequence and empty boxes reflect UnTranslated Regions UTRs This Region in Detail view is useful for comparing Ensembl gene models with current proteins and mRNAs in other databases like NCBI RefSeq EMBL Bank and in the example above UniProtKB Everything in this view is aligned to the genome 15 Bioinformatics Course August 2012 MTAP transcript AP 003 gt processed transcript aGenome Q Ensembl Havana q lt RP11 70L8 1 201 proc
37. essed pseudogene Species pedfic Re mA Sequence variants JUNE MORE EI DUU 1 mm en Variations 1000 genomes Hi STET E MEER HIER UI LB LEE E III III 1000 genomes Ten III HELM UH The region in detail view 1000 genomes track The region in detail view can be configured using the Configure this page tool button to show regulatory features sequence variation and more Click on any vertical line in the variation track for a menu about the SNP single nucleotide polymorphism or InDel insertion deletion mutation Clicking on Variation properties in the pop up box will bring you to an information page for the genetic variation including links to population frequencies if known You can do the same for any regulatory feature An index page is provided for each species with information about the source of the genomic sequence assembly a karyotype if available and a link to past or archive sites The picture below shows the Ensembl homepage for human Links to the human karyotype a summary of gene and genome information and the most common InterPro domains in the genome are found at the left of this index page Description Human Homo sapiens os About the genome sequence Assembly P an This ste provides data set based on the February 2000 Homo sapiens high coverage asserbiy Hgt9 from the Genome Reference Consortum The data set consists of gene models buit from the geceeiso alignments of the human pr
38. f many genes including complete eukaryotic chromosomes e g X59720 The value of the database lies substantially in the quality of the annotation that puts the sequence in its biological context As a biologist you may need to be able to interrogate the Database to find particular sequences or a set of sequences matching given criteria such as 18 Bioinformatics Course August 2012 The sequence published in Cell 31 375 382 All sequences from Aspergillus nidulans Sequences submitted by Peter Arctander Flagellin or fibrinogen sequences The glutamine synthase gene from Haemophilus influenzae The upstream control region of Bacillus subtilis Spo0A SRS Sequence Retrieval System is a very powerful WW W based tool developed by Thure Etzold at EMBL and subsequently managed by Lion Biosciences for interrogating databases and abstracting information from them One of the neatest features of SRS is the fact that interrelated databases can be cross referenced with WWW hypertext links This means that you can discover the protein sequence the cognate DNA sequence a family of related proteins in other species a Medline reference to read an abstract of the original publication a 3 D structure all with a few point and clicks with the mouse There are several SRS servers on the Web We will be using http srs eb1 ac uk at the EBI in England because a it has a large number of interlinked databases b connectivity to the UK is g
39. gn the genetic basis of disease the role of protein structure in its enzymatic structural and signal transduction functions and basic physiology from molecular to cellular to fully systemic levels In short the solution of the protein structure prediction problem and the related protein folding problem will bring on the second phase of the molecular biology revolution Munson et al 1994 JPRED http www compbio dundee ac uk www pred Jpred is an Internet web server that takes either a protein sequence or a multiple alignment of protein sequences and predicts secondary structure It works by combining a number of modern high quality prediction methods to form a consensus Please be aware that secondary structure prediction is an extremely complex problem that is under intensive research and we are still at a relatively primitive stage We cannot discuss the details of protein secondary structure here but if you are interested in this area we recommend that you take a look at any major biochemistry textbook Essentially protein secondary structure consists of 3 major conformations the a Helix the B pleated sheet and the coil conformation Example Human alpha 1 hemoglobin NP 000549 1 You can paste the gene sequence hbb from the course website e Goto the website e Paste your sequence in the box provided e The defaults are OK e Click Makepredictions e If your sequence already has had its structure predicted or exp
40. he output for this program is shown below graphics not shown e This program predicts potential O glycosylation sites at Threonine 64 and Serine 214 NetOGlyc 3 1 Prediction Results Name sp P15813 C Length 335 MGCLLFLLLWALLOAWGSAEVPORLFPLRCLOISSFANSSWTRTDGLAWLGELOTHSWSNDSDTVRSLKPW SOGTFSDOOWETLOHIFRVYRSSFTRDVKEFAKMLRLSYPLELOVSAGCEVHPGNASNNFFHVAFOGKDIL SFOGTSWEPTOEAPLWVNLAIOVLNODKWTRETVOWLLNGTCPOFVSGLLESGKSELKKOVKPKAWLSRGP SPGPGRLLLVCHVSGFYPKPVWVKWMRGEOEOOGTOPGDILPNADETWYLRATLDVVAGEAAGLSCRVKHS SLEGODIVLYWGGSYTSMGLIALAVLACLLFLLIVG FTSRFKROTSYOGVL Name S T Pos G score I score Y N Comment sp P15813 C S 18 0 075 0 079 i sp P15813 C S 34 0 198 0 051 sp P15813 C S 35 0 177 0 037 e 42 Bioinformatics Course sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C sp P15813 C LO H DD HDH DD DI DD HHH DD DD DD DD HHH DH DD DD DD HH ID ID DD HDD HDD II HH D D 210 214 227 248 260 CO OO OO OO OO OO OO OO OO
41. here are many other applications available at this website that you should take some time to have a look at 1 Physico chemical properties ProtParam tool http www expasy ch tools protparam html Or use http www bioinformatics org SeWeR Calculates lots of physico chemical parameters of a protein sequence The computed parameters include the molecular weight theoretical pI amino acid composition atomic composition extinction coefficient estimated half life instability index aliphatic index and grand average of hydropathicity GRAVY Example Human BRCA I You can paste the gene sequence brcal from the course website e At ExPASy gt Proteomics and sequence analysis tools gt Primary structure analysis 33 Bioinformatics Course August 2012 Click on the ProtParam link e Paste your sequence in the box provided e The sequence must be written using the one letter amino acid code e Press the Compute parameters button The output for this sequence is shown below Number of amino acids 1863 Molecular weight 207720 8 Theoretical p 5 29 Amino acid composition Ala A 84 4 5 Arg R 76 4 1 Etc etc Thr T 111 6 0 Trp W 10 0 5 Tyr Y 31 1 796 Val V 101 5 496 Asx B 0 0 0 Glx Z 0 0 096 Xaa X 0 0 096 Total number of negatively charged residues Asp Glu 283 Total number of positively charged residues Arg Lys 213 Atomic composition Carbo
42. localized in the nucleus has been previously experimentally determined Results of Subprograms PSG a new signal peptide prediction method N region length 8 pos chg 2 neg chg 1 H region length 6 peak value 1 89 PSG Scores 25 Results of the k NN Prediction k 9 23 73 9 nuclear 1300 SE CV COD DS mco 4 3 peroxisomal 4 3 vacuolar 4 3 cytoskeletal gt gt prediction for QUERY 1 nuc K 23 36 Bioinformatics Course August 2012 3 Signal peptides Proteins destined for secretion operation with the endoplasmic reticulum Ivsosomes and many transmembrane proteins are synthesized with leading N terminal 13 36 residue signal peptides SignalP http www cbs dtu dk services SignalP The SignalP WWW server can be used to predict the presence and location of signal peptide cleavage sites in your proteins It can be useful to know whether your protein has a signal peptide as it indicates that it may be secreted from the cell Furthermore proteins in their active form will have their signal peptides removed if you can determine the length of the signal peptide then you can calculate the size of the protein minus the signal peptide Example Human Beta defensin sp Q09753 BD01 HUMAN You can paste the gene sequence HBD1 from the course website At ExPASy gt Post translational modification prediction Click on the SignalP link Paste your sequence in the box provided The sequenc
43. mes Change favourites Choose your favourite vertebrate Human GRCh37 Mouse NCBIM37 All genomes Browse plants bacteria fungi protists and Select a species A metazoa at EnsemblGenomes View full li II Ensem i 4 Other species are available in Ensembl Pre and EnsembiGenomes Introduction to Ensembl Ensembl is a joint project between the EBI European Bioinformatics Institute and the Wellcome Trust Sanger Institute that annotates chordate genomes Le vertebrates and closely related invertebrates with a notochord such as sea squirt Gene sets from model organisms such as yeast and worm are also imported for comparative analysis by the Ensembl compara team Most annotation is updated every two months leading to 14 Bioinformatics Course August 2012 increasing Ensembl versions such as version 62 however the gene sets are determined less frequently A sister browser at www ensemblgenomes org is set up to access non chordates namely bacteria plants fungi metazoa and protists 3291 Mb Chromosome bands E e Chromosome 34 way GERP scores BI me N DT UE Conserved sequence Species specific Re mammal UniProt pr Protein alignments BRCA2 transcript protein coding The region in detail view The vast amount of information associated with the genomic sequence demands a way to organise and access that information This is where genome browsers come in Ensembl striv
44. n C 8908 Hydrogen H 14246 Nitrogen N 2554 Oxygen O 3014 sulfur S 74 Formula C aoogH 14545 255430149 74 Total number of atoms 23796 Extinction coefficients Conditions 6 0 M guanidium hydrochloride 0 02 M phosphate buffer pH 6 5 ze et Extinction coefficients are in units of M cm The first table lists values computed assuming ALL Cys residues appear as half cystines whereas the second table assumes that NONE do 216 278 279 280 282 nm nm nm nm nm Ext coefficient 102140 102194 100935 99220 95840 Abs 0 1 sel g l 0 492 0 492 0 486 0 478 0 461 216 2718 2179 280 282 nm nm nm nm nm Ext coefficient 98950 99400 98295 96580 93200 Abs 0 1 1 g l 0 476 0 479 0 473 0 465 0 449 Estimated half life The N terminal of the sequence considered is M Met The estimated half life is 30 hours mammalian reticulocytes in vitro gt 20 hours yeast in vivo gt 10 hours Escherichia coli in vivo Instability index The instability index II is computed to be 54 68 34 Bioinformatics Course August 2012 This classifies the protein as unstable Aliphatic index 69 01 Grand average of hydropathicity GRAVY 0 785 EMBOSS pepinfo Plots simple amino acid properties pepstats Protein statistics charge Protein charge plot iep Calculates the isoelectric point of a protein 2 Cellular localization PSORT http psort nibb ac jp form2 html PSORT a program to predict the subcellular localiz
45. nce of this orientation Inside gt outside outside gt inside 39 62 24 1962 47 63 17 2568 78 105 28 1623 78 96 19 1331 lqp4es 155 JO 13550 TS AS 2290 140 skit Looe 175 21 bg lec 155 173 ue Jy MULA 205 CAU 2050 2044 2208 20 2404 T 240 261 422 2840 see 2420 0259 20 2037 2665 305 120 1241 263 305 251 17 03 44 3 Suggested models for transmembrane topology These suggestions are purely speculative and should be used with extreme caution since they are based on the assumption that all transmembrane helices have been found In most cases the Correspondence Table shown above or the prediction plot that 1s also created should be used for the topology assignment of unknown proteins 2 possible models considered only significant TM segments used STRONGLY preferred model N terminus outside 7 strong transmembrane helices total score 14594 from to length score orientation l 47 63 17 2568 0 1 2 18 1095 28 1623 l 6 SEE NNN dq qug OSE 4 155 175 21 18160 1 6 204 223 20 2404 I o 240261 223 2840 1 6 d uos 305 25 1103 50 14 gt alternative model 7 strong transmembrane helices total score 11172 t Fron to Length Score orientation L 39 62 24 X962 1 60 2 8 GE 35937 9 31 I SULA LES 20 952 199 2 LSS ITS BS LES o 204 223 201 2052 1 6 Os ZAD e 120 203 0 18 4 90 2309 20 1241 L 6 EMBOSS tmap Displays membrane spanning regions
46. nique name of 10 characters or less All sequences must be of same format FASTA Input type protein sequences ReadSeq It automatically recognizes the input sequence type and convert it into a format of choice Input type DNA protein sequence s of different formats CAP Contig assembly program CAP Input type DNA sequences in FASTA format Clean Inverse complement You can Inverse complement the sequence or Clean the sequence In either case SeWeR will filter out only A T G C N from the query All spaces numbers line breaks will be removed from the sequence 30 Bioinformatics Course August 2012 EMBOSS revseg Reverse and complement a sequence eprimer3 Picks PCR primers and hybridization oligos primersearch Searches DNA sequences for matches with primer pairs restrict Finds restriction enzyme cleavage sites transeq Translate nucleic acid sequences prettvseq Output sequence with translated ranges plotorf Plot potential open reading frames showorf Pretty output of DNA translations splitter Split a sequence into overlapping smaller sequences Exercise Paste in the phosphoglycerate kinase gene sequence from Trypanosoma brucei for each application Pay particular attention to the options available these will give you clues about standard practice See if you can repeat the exercise using the EMBOSS program s See Appendix and Appendix2 for details about the genetic code 4 Oligo Calculator
47. ontinue to displary vert gene set based on fe merge between the automatic annotation tom Eneonmbl and the marwally curated annotation from Havana Thes refined gene set comesponds to QENCODE release 6 The Consensus Coding Sequence CCDS iderfifiess have also been mapped to the annotations More information about the CCOS Are Vega Addtonal manual annotation of thes genome can be found in Vega 16 Bioinformatics Course August 2012 GRCh37 pe Feb 2006 613 321 3 279 005 676 3 101 804 739 Ersombi Full genebulld Mar 2009 Mar 2009 Ensembl devotes separate pages and views in the browser to display a variety of information types using a tabbed structure Human GRCh37 Location 13 32 889 611 32 973 347 Gene BRCA2 Transcript BRCA2 001 Variation displays Variation rs80358836 l Summary View genotype information in the variation tab gene trees in the gene tab a chromosomal region in the location tab and cDNA sequence alongside the protein translation in the transcript pages Compare conserved regions with the position of genes and population variation in the Region in Detail view See homology relationships in the gene page or perform a BLAST or BLAT search against any species in Ensembl Transcript Sequence w Variations Genes SNPs and Conserved Regions 1 ATTGGATCAAACATGTCACAAGAGTCGGAC AATAATABAAGACTAGTGGCCTTAGTGCCC ATGTCACAAGAGTCGGACAATAATAAAAGACTAGTGGCCTTAGTGCI M S Q E S D N N K R
48. ood c they are attempting to interconnect their SRS server with their clustalW server and blast server With experience and practice you will get to use as much of SRS s power as necessary to obtain the results you need Below as a worked example a series of instructions to obtain the sequences of serum resistance associated proteins in Trypanosoma brucei in SwissProt and download them locally to carry out a multiple sequence alignment using say ClustalW It should also be possible to do the multiple alignments on the EBI clustalW server Use your browser Netscape to go to http srs ebi ac uk or one of the other SRS servers at the top of the Course page You should see the following options Click on Library Page 19 Bioinformatics Course EMBL EBI 397 Databa es Tools SRS Start a Permanent Project Searches Databanks EMBL Nucleotides Want to know more about using SRS go to the Help Center for online searchable help look in our SRSQEBI FAQ for answers to commonly asked questions Linking to SRS Please read our Linking to SRS guide for important information regarding linking to our SRS server August 2012 Give us EB eyc Enter Text Here Reset Search zi Go Advanced Search feedback EBI Gro ips elie Ke Library Page Qyen Industry AboutUs Help Site Index EY amp Training Quick Text Search Find Nucleotides ES matching
49. oteome as well as from alignments of human CONAs using the cONA29genome model of x onerate ha release of The assembly has Su fofiowing properties e 27478 cortiga contig length total 3 2 Gb chromosome length total 3 1 Gb H also includes nine haplotype regions mainly in the MHC regon of chromosome 6 As the GAC maintains and improves the assembly patches are being introduced Patch release two GRONIT p was included in Ensembi release 60 Currently assembly patches are of two types Novel patch new sequences that add afermative sequence af loci and wil remain as haplotypes in the next major assembly release by GAC Fix patch sequences that correct the reference sequence and will replace the given region of the reference assembly af the nex major assembly release by GAC The addition of the patches allows the annotation of some genes thal car not be annotated cortectly on ihe reference genome such as Ih ADO blood group Oe which can now be annotated as a protein Coding Gane To convert your old data from Human assembly NCBI38 to GRORST click on Manage your data on ay human page and select Assembly converter from the left hand menu A preliminary assembly of the Neanderthal homo sapiens eanderthalensis genome is Sege via the Noandehal Genome Browser an Ensembi powened projec based af the Max Planck Iis itute Previous assemblies NCBI6 May2009 58 Go to archive Annotation in sponse 61 January 2011 we c
50. puters In particular types of analysis that require large amounts of computational power time are best carried out off the web Analyses of many genes are also often better done in an environment where a computer program does the pointing and clicking for you For the record the EMBOSS package is a suite of programs which carry out almost all the analvses that a molecular biologist might want to do with on DNA or protein sequences secondary structure prediction two sequence alignment conceptual translation of DNA restriction site analysis primer design as well as homology searching multiple sequence alignment etc For phylogenetic inference and tree drawing the PHYLIP package versions available for PCs Macs and Unix will answer most needs Both of these software packages and a variety of other sequence analysis packages are available for download from the Internet The web by contrast is a total mess the same program is implemented with different defaults at different sites it 1s often not clear what those defaults options and parameters are the results are not easily transferred to a different program So it is free but there is a cost You are advised to validate any analysis against the results yielded by other sites For a good introduction to Bioinformatics read the first chapter of Developing Bioinformatics Computer Skills Cynthia Gibas amp Per Jambeck available online at http oreilly com catalog bioskills chapter ch01
51. rary Version numbers are appended to the accession number after a dot V00234 2 NM 000492 2 13 Bioinformatics Course August 2012 Interrogating sequence databases Ensembl http www ensembl org Ensembl provides genes and other annotation such as regulatory regions conserved base pairs across species and sequence variations The Ensembl gene set is based on protein and mRNA evidence in UniProtKB and NCBI RefSeq databases along with manual annotation from the VEGA Havana group All the data are freely available and can accessed via the web browser at www ensembl org Perl programmers can directly access Ensembl databases through an Application Programming Interface Perl API Gene sequences can be downloaded from the Ensembl browser itself or through the use of the BioMart web interface which can extract information from the Ensembl databases without the need for programming knowledge by the user BLAST BLAT BioMart Tools Downloads Help amp Documentation Blog Mirrors Mine data with BioMart p e g human gene BRCA2 or rat X 100000 200000 or coronary heart disease Search All species 4 for Try our variant effect predictor R Search any organism for a gene location variation clone probeset or phenotype Browse a Genome The Ensembl project produces genome databases for vertebrates and other eukaryotic specie Click on a link below to go to the species home page Favourite geno
52. rm a helix Each strand consists of alternating nucleotides Each nucleotide consists of a phosphate PO4 and pentose sugar 2 deoxyribose and attached on the sugar is a nitrogenous base which can be 25 Bioinformatics Course August 2012 adenine thymine guanine or cytosine The four nucleotides are given one letter abbreviations as shorthand for the four bases A is for adenine Gis for guanine C is for cytosine T is for thymine See Appendix for more details l Sugar d DNA SN Ke b Cytosine and Gg Es b Thymine Molecule F pe x Bases Ze L SG mi 7 3 Adeni d Views 20 erta Phosphate OQ group ud P Hence DNA is a ladder like helical structure The two DNA strands are joined together at the center by pairing bases lined up with one another Adenine pairs with thymine and guanine with cytosine A and T are connected by two hydrogen bonds G and C are connected by three hydrogen bonds DNA is often described structurally as a twisting ladder In this ladder the rungs are the pairs of bases linked together and the sides are the two separate sugar and phosphate backbones The double helix is important because it preserves all of the information carrying features ofasingle DNA strand while at the same time introducing elements that make it easier for living cells to make copies of their DNA Because every base pair in the double helix must match its pairing partner A with T C with G
53. rotKB Q8T309 TRYBR isl ini Trypanosoma resistance Link to related information 81309 8T309 associated VSG protein brucei Save results Eg D selected results only f unselected results only rhodesiense e Under Display options change UniprotView to FastaSeqs e Click Save e Save as type Text File txt e Click Save Change selection wgetz to serum pro and then Click Save This should dump the concatenated fasta format protein sequences into a local file called serum pro You can use this file as input for clustalW There may be local security difficulties with downloading sequences onto a public terminal check with your neighbours or your demonstrator Query manager a powerful tool A quick example will show how you can combine very complex queries to zero in on the sequence s you need Having selected your database s go to the Query Form Page and enter e Description calmodulin you should get about 1140 entries Click QUERY tab at the top of the page to get a new page and enter e Organism name human or indeed Homo sapiens this will get you a large number of sequences Click RESULTS tab at the top of the page 22 Bioinformatics Course August 2012 A new window should appear with the results for all the queries you have entered in the current SRS session In the top box of this page enter Q1 amp Q2 leave off the quotes Note Your mileage may vary here
54. s 1s called Fasta format in which the annotation is edited down to a single title line followed by the sequence The sequence at the top of the chapter is in Fasta format All protein databases use the one letter amino acid code can you think why this might be 10 Bioinformatics Course August 2012 Sequence Related Databases Not all biologically relevant Databases consist of sequences and annotation There are databases of journal abstracts taxonomy 3 D structures mutations and metabolic pathways Some of the most useful of these are databases which specialise in particular entities that can be found dispersed in the whole sequence databases You notice one of the cross references for the SwissProt entry 1s DR PROSITE PS00321 RECA 1 Prosite is a database of protein motifs PS00321 is a family of proteins that all have the motif PA A L K F FY STA STAD VM R and are all believed to bind DNA hydrolyze ATP and act as a recombinase One of the members of this family is the recA gene in E coli which gives its name to PS00321 In the pattern above the residues within square brackets are alternatives Convince yourself that ALKFFAAVR could belong to the family but ALKFAAAVR could not There are more than 1000 other families classified in a similar way Finding a Prosite link in a SwissProt gene is a great help in finding other proteins related by structure and or function Interpro http www ebi ac uk interpro
55. sequence whose name or accession number you already know At the top of the Entrez page change the Search choice box from PubMed to the appropriate sort of database the available options are listed on the Entrez page If you want the sequence alone to paste into some analysis page change the Display choice box to FASTA then click on Save or Display depending on whether you want a permanent or transitory copy of you proteins Entrez has a more complex syntax for less straightforward queries 24 Bioinformatics Course August 2012 Nucleic Acid Sequence Analysis TOPICS 1 Nucleic acids and the genetic code 2 Translating DNA in 6 frames 3 Reverse complement amp other tools 4 Calculating some properties of DNA RNA sequences 5 Primer design 1 Nucleic acids and the genetic code Nucleic acids may be in the form of Deoxyribonucleic acid DNA or ribonucleic acid RNA molecules containing the genetic information important for all cellular functions and heredity DNA is a long polymer of nucleotides to code for the sequence of amino acid during protein synthesis DNA 1s said to carry the genetic blueprint since it contains the instructions or information called genes needed to construct cellular components like proteins and RNA molecules THE STRUCTURE OF ONA one helical turn 3 4nm Sugar phosphate backbone Base Hydrogen bonds DNA is composed of two strands that twist together to fo
56. t likely Part 3 proposes the strongly preferred model for the transmembrane domain structure of the protein and also an alternative model A graphic of the prediction is also available not shown here These predictions correspond well but not exactly to the SWISS PROT annotation for this protein accession P30991 Tmpred output Sequence MEG HSS length 352 Prediction parameters TM helix length between 17 and 33 1 Possible transmembrane helices The sequence positions in brackets denominate the core region Only scores above 500 are considered significant Inside to outside helices 7 found from to score center 39 46 62 1 62 1962 54 78 85 105 103 1623 95 LLAS d LLA oS Cy AA Joe sea 199 AST X5 0 113 1716 165 204 0206 223 223 2052 214 240 240 261 4259 2940 251 286 286 305 305 1241 295 Outside to inside helices 7 found from to score center 47 47 63 63 2568 55 78 C 78 96 96 1331 86 111 114 132 132 1740 122 40 Bioinformatics Course August 2012 VES L BET 1738 0175 BT 165 204 204 223 223 2404 214 240 4 242 259 259 2057 291 203 E 286 305 305 1703 294 2 Table of correspondences Here is shown which of the inside gt outside helices correspond to which of the outside gt inside helices Helices shown in brackets are considered insignificant A symbol indicates a preference of this orientation A symbol indicates a strong prefere
57. tion and insert serum resistance associated in box Note it does not have to be serum resistance associated it could be ubiquitin or haemoglobin or hemoglobin or actin amp alpha Separate keywords in the same box have to be linked by a logical Boolean operator such as and amp or but not Click the next All text change to Taxonomy and insert Trypanosoma in box Click Search 21 Bioinformatics Course August 2012 a new window appears with Query uniprot Description serum resistance associated amp uniprot Taxonomy Trypanosoma found 4 entries This is how SRS interprets what you have entered in the boxes and the numbers of hits found EMBL EBI s bm All Databases Enter Text Here mi pu B Databases Tools EBI Groups Training Industry AboutUs Help Site Index E Quick Search Library Pa Query Form Tools Results Projects Views Databanks Reset uniprot Description serum amp uniprot Descrip Query GE KIC found 4 entries Apply Optionsto Teg O UniProtKB Q70MW9 TRVBR Serum SRA Trvpanosoma Q70MW9 070mwo sistance brucei 9 lassociated protein rhodesiense C UniProtKB Q70MXO TRYBR pe ge Trypanosoma Result Options Q70MX0 Q70MXO brucei associated protein rhodesiense Launch analysis tool Fee SRA i NCBI BLASTP wy UniProtKB Q8T308 TRYBR ves Trypanosoma 8T308 8T308 demas brucei Show tools relevant to these VSG protein NN O UniP
58. tion as possible about your sequence Aside Displaying 3 D structures is not fitted as standard on all terminals You may need to get a copy of the RasMol 3 D structure viewer and install it in such a way that your Netscape IE will recognise it and connect suitable 3 D sequence file to it To display a PDB entry of 3 D coordinates as a rotatable colorable model you need to click on the save button The change the use mime type choice box to chemical x pdb and then click on the save box This should fire up CHIME a WWW implementation of RasMol Your mileage may vary It is this interlinked databases aspect of SRS which gives it a large part of its power You can extend your search to include other sequences related in some particular or peculiar way The Prosite link allows you to find members of a protein family The EMBL link allows you to find the introns and the intron splice junctions not to mention the ribosome binding site the stop codon and the journal reference for the original sequence The Medline link will give you an abstract etc You will probably find that 23 Bioinformatics Course August 2012 The PubMed server at http www ncbi nlm nih gov Entrez is a far better tool for browsing Medline that what is offered with SRS Especially powerful is its facility for finding Related entries Additional questions Effective researchers know how to find things out 1 Who submitted the serum amyloid A
59. ut the sequence itself so that the sequence 1s parsed into meaningful bits called a features table a EMBL FT source 1 1391 ET Jorganism Escherichia coli FT db xref taxon 562 FT mRNA 191 gt 1391 FT note messenger RNA FT RBS 229 233 FT note ribosomal binding site FT CDS 239 1300 FT db xref SWISS PROT P03017 FT transl table 11 FT gene recA FT product recA gene product FT protein id CAA23618 1 FT PT mutation 353 353 FT note g to a in recA441 E to K ET mutation 720 20 ET Jnotes q t a in recAl G to D b GenBank FEATURES Location Qualifiers source 1 1391 forganism Escherichia coli db xref taxon 562 mRNA 191 gt 1391 note messenger RNA RBS 229 233 note ribosomal binding site gene 239 1300 gene reca CDS 239 1300 gene recA codon start 1 transl table 11 product recA gene product db xref SWISS PROT P03017 mutation 353 gene reca note g tO in reca4l E to K mutation 720 gene recA I otes dq to in recAl G to D Again you can see that the information exchange between Genbank and EMBL includes all significant portions of the annotation Such useful signals and data as the open reading frame CDS for CoDing Sequence the ribosome binding site intron boundaries signal peptides variants mutations may be recorded Protein databases SwissProt PIR Protein Information Resource Bioinformatics Course
60. y and you cannot afford to use outmoded technology for long e Where applicable we will also introduce you to the same tool implemented in the EMBOSS package EMBOSS is a free Open Source software analysis package specially developed for the needs of the molecular biology user community EMBOSS integrates a range of currently available packages and tools for sequence analysis into a seamless whole The EMBOSS package will be described in detail in a separate course module Bioinformatics Course August 2012 Introduction to Bioinformatics Bioinformatics has been described as the storage retrieval and analysis of biological sequence information In this short course we will be taking a broader definition how computers can maximise the biological information available to you This will touch on determining the 3 D structure of bio molecules and trying to relate this to their function as well as accessing the relevant literature I hope that by the end of the course everyone will be adopting a more explicitly evolutionary understanding of their molecule The formal course practicals can be carried out entirely on the World Wide Web using Netscape or the other Web browser Nevertheless we recommend using locally installed FREE software for the phylogenetic trees part of the course You should note that several important types of bioinformatic analysis are not freely accessible on the Web but are available on various password controlled com

Introduction to Bioinformatics Course

Contents

Download Pdf Manuals

Related Search

Related Contents