Home

Combinded Workbench Manual

1. lt Strand Plus Plus FBE Eg E NP_058652 BLAST Rows 1004 Summary of hits from query NP_058652 Filter Hit Description E value Score Bit score Identity NP_058652 hemoglobin beta ad 1 25E 80 768 300 442 100 4 BAA77357 beta 2 globin Mus 4 76E 80 763 298 516 100 AAH32264 Hbb b2 protein Mus 8 11E 80 761 297 745 98 z Beda Figure 2 25 Output of a BLAST search By holding the mouse pointer over the lines you can get information about the sequence box appears containing information about the sequence and the match scores obtained from the BLAST algorithm The lines in the BLAST view are the actual sequences which are downloaded This means that you can zoom in and see the actual alignment Zoom in in the Tool Bar 590 Click in the BLAST view a number of times until you see the residues Now we will focus our attention on sequence PO2042 the BLAST hit that is second from the top of the list To open sequence PO2042 right click the line representing sequence P02042 Download Full Hit Sequence from NCBI This opens the sequence However the sequence is not saved yet Drag and drop the sequence CHAPTER 2 TUTORIALS 50 into the Navigation Area to save it This homologous sequence is now stored in the CLC Combined Workbench and you can use it to gain information about the query sequence by using the various tools of the workbench e g by studying its textu
2. g Elements of structure AG 34 4kcal mol AG 34 4kcal mol Dangling U at 1 dangling from position 2 AG 0 Okcalfmol 5 Ke Stem with bifurcation at 2 61 AG 32 7kcal mol a S Stem base pairs at join 2 4 9 14 48 53 59 61 AG 18 4kcal mol 5 Stacking of G C pair at 2 61 and G C pair at 3 60 AG 3 3kcal mol j Ss Stacking of G C pair at 3 60 and G C pair at 4 59 AG 3 3kcal mol Stacking of A U pair at 9 53 and G C pair at 10 52 AG 2 1kcalfmol Ss Stacking of G C pair at 10 52 and G C pair at 11 51 AG 3 3kcal mol Ss Stacking of G C pair at 11 51 and G C pair at 12 50 AG 3 3kcal mol i Ss Stacking of G C pair at 12 50 and U A pair at 13 49 AG 2 2kcal mol AG 31 7kcalimol 2007 11 02 11 29 13 31 7kcal mol Ss Stacking of U 4 pair at 13 49 and U A pair at 14 48 AG 0 9kcal mol Interior loop at join 5 8 54 58 AG 2 5kcal mol Ac C3 Multi loop at 14 48 AG 16 8kcal mol c G B 3 Multi loop base pairs at join 14 48 15 28 29 45 AG 5 6kcal mol c S U a Multi loop opened at 14 48 AG 3 4kcal mol Pia Si 2 Helix opened in multi loop at 14 48 AG 0 4kcal mol A 32 ce A U U A helix end at 14 48 AG 0 5kcal mol A ces Tas 2 Helix opened in multi loop at 15 28 AG 0 4kcal mol A I F cos A lt Helix opened in multi loop at 29 45 AG 0 4kcal mol te Cogg d A ofa U A helix end at 29 45 AG 0 5kcal mol 2 U A A
3. BO yl Figure 2 48 The history of the contig showing that a A has been substituted with a g in read1 at position 651 2 49 Ac Consensus O Ub Consensus Extracted the sequence Wed Dec 20 14 57 07 CET 2006 o User CLC user Parameters e Comments Edit No Comment Origins from gt Contig 1 history v Figure 2 49 The history of the consensus sequence which has been extracted from the contig Clicking the blue text will find the saved contig in the Navigation Area 2 13 Tutorial In silico cloning In this tutorial you will see how to insert a sequence fragment into a cloning vector and create a circular map of the vector The sequence is a PCR fragment which has been created using primers with restriction sites at the ends The sequences are located in the Example data in the Nucleotide folder under Cloning e PCR fragment with EcoRV restriction sites e PCR fragment with Sphl restriction sites used in this tutorial e The commonly known pBR322 cloning vector used in this tutorial CHAPTER 2 TUTORIALS 65 We choose to insert the fragment into the tetracyclin resistance gene of pBR322 which will enable us to select for tetracyclic sensitive clones the tetracyclin resistance gene is marked by the blue annotation with the label tet 2 13 1 The cloning editor Cloning in CLC Combined Workbench is carried out in the Cloning Editor 3 which can contain a number of se
4. o e lt a x Figure 12 25 BLAST graphical view A simple graphical overview of the hits found aligned to the query sequence The alignments are color coded ranging from black to red as indicated in the color label at the top Sequences producing significant alignments Click headers to sort columns Description Max score Totalscore Query coverage Evalue Max ident Links Tras NM 1 y Homo sapiens TGFB induced factor TALE family homeobox TGIF 339 563 85 1e 90 100 UE GM NM 173210 1 d factor TALE fami TGIF 339 563 85 1e 90 100 UE GM NM 173209 1 d factor TALE famil TGIF 339 563 85 1e 90 100 MER NM 173211 1 d factor TALE famil TGIF 339 563 85 1e 90 100 UE GM NM 173207 1 d factor TALE family hor TGIF 339 563 85 1e 90 100 UE GM NM 173208 1 d factor TALE famil TGIF 339 563 85 1e 90 100 UE GM NM 170695 2 lo d factor TALE famil TGIF 339 563 85 1e 90 100 UE GM d factor TALE family homeobox TGIF 339 563 85 1e 90 100 UE GMI thrombospondin 1 THBS1 mRNA 38 2 38 2 4 7 2 100 Ea chromosome 8 open reading frame 37 C8orf37 38 2 38 2 4 7 2 100 MEA sequences show first 1 Homo sapiens chromosome 18 genomic contig reference assembly 339 602 85 1e 90 100 NW 9 Homo sapiens chromosome 18 genomic contig alternate assembly 339 602 85 1e 90 100 NT _011109 15
5. Figure 12 9 Output options for BLAST At the top you can choose two different ways of getting the results of the BLAST search e Create overview BLAST table This will create one table containing and summarizing all the BLAST results See section 12 3 1 e Create one BLAST result per query This will create a BLAST result for each query sequence which can be opened in a table see section 12 3 3 or in the graphical alignment view see section 12 3 2 12 3 1 Overview BLAST table In the overview BLAST table shown in figure 12 10 there is one row for each query sequence Each row represents the BLAST result for this query sequence CHAPTER 12 BLAST SEARCH 180 ES Multi BLAST Rows 12 Query 1429_HUMAN CAA24102 P68053 P68063 P68225 P68228 P68231 P68873 P68945 Eo Number of hits 80 156 203 198 199 186 190 189 202 Filter Top hit E value 6 54682E 167 2 97851E 18 4 52309E 78 5 9482E 83 3 27026E 80 5 27244E 73 1 15547E 71 3 89654E 80 5 20635E 82 Top hit 1474_HUMAN P30459 HLA class I histocompatibility antig HBBZ_MOUSE P04444 Hemoglobin subunit beta H1 Hemo HBB_MUSLU P23602 Hemoglobin subunit beta Hemoglobi HBB_AEGMO P68061 Hemoglobin subunit beta Hemoglobi HBB_MACSP P68224 Hemoglobin subunit beta Hemaglobi HBB_LAMGU P68229 Hemoglobin subunit beta Hemoglobi HBB_LAMGU P68229 Hemoglobin subunit beta Hemoglobi
6. 1000 pBR322 4361 bp sr protein Figure 10 13 A molecule shown in a circular view You cannot zoom in to see the residues in the circular molecule If you wish to see these details split the view with a linear view of the sequence In the Annotation Layout you also have the option of showing the labels as Stacked This means that there are no overlapping labels and that all labels of both annotations and restriction sites are adjusted along the left and right edges of the view 10 2 1 Using split views to see details of the circular molecule In order to see the nucleotides of a circular molecule you can open a new view displaying a circular view of the molecule CHAPTER 10 VIEWING AND EDITING SEQUENCES 147 Press and hold the Ctrl button 3 on Mac click Show Sequence si at the bottom of the view This will open a linear view of the sequence below the circular view When you zoom in on the linear view you can see the residues as shown in figure 10 14 O per322 bla bla 22 PBR322 1000 4361 bp sr 0 5 0 El 1 Eo Ach pBR322 2 40 A 1 EEE E A E tet et 60 80 1 I pBR322 AGTTTATCACAGTTAAATTGCTAACGCAGTCAGGCACCGTGTA a O BS 0 E mee Figure 10 14 Two views showing the same sequence The bottom view is zoomed in Note If you make a selection in one of the views the other view will also make the corresponding selection providi
7. gt Conflict Plossible SNP Resolved conflict Rows 5 Contig ambiguities Filter Position Consensus residue Other residues IUPAC Status Notes Conflict C 1 T 2 Conflict resolution Yote Conflict A 1 G 2 Conflict resolution Yote Conflict A 1 G 2 Conflict resolution Yote C 1 T 2 A 1 G 2 G 2 g 1 c 2 T 1 Ea ss Figure 18 12 The graphical view of a contig is displayed at the top At the bottom the conflicts are shown in a table At the conflict at position 637 the user has entered a comment in the table This comment is now also reflected on the tooltip of the conflict annotation in the graphical view above The table has the following columns e Position The position of the conflict measured from the starting point of the contig sequence e Consensus residue The contig s residue at this position The residue can be edited in the graphical view of the contig as described above e Other residues Lists the residues of the reads Inside the brackets you can see the number of reads having this residue at this position In the example in figure 18 12 you can see that at position 637 there is a C in the top read in the graphical view The other two reads have a T Therefore the table displays the following text C 1 T 2 CHAPTER 18 SEQUENCING DATA ANALYSES AND ASSEMBLY 316 e IUPAC The ambiguity code for this position
8. Furthermore if this is the case you will see the names of the other enzymes in the Conflicting Enzymes column CHAPTER 19 CLONING AND CUTTING 341 Sequence The name of the sequence which is relevant if you have performed restriction map analysis on more than one sequence Length The length of the fragment If there are overhangs of the fragment these are included in the length both 3 and 5 overhangs Region The fragment s region on the original sequence Overhangs If there is an overhang this is displayed with an abbreviated version of the fragment and its overhangs The two rows of dots represent the two strands of the fragment and the overhang is visualized on each side of the dots with the residue s that make up the overhang If there are only the two rows of dots it means that there is no overhang Left end The enzyme that cuts the fragment to the left 5 end Right end The enzyme that cuts the fragment to the right 3 end Conflicting enzymes If more than one enzyme cuts at the same position or if an enzyme s recognition site is cut by another enzyme a fragment is displayed for each possible combination of cuts At the same time this column will display the enzymes that are in conflict If there are conflicting enzymes they will be colored red to alert the user If the same experiment were performed in the lab conflicting enzymes could lead to wrong results For this reason this functi
9. Rows 41 Standard primers for PERH3EC with primer annotations primers Filter Score Pair annealing align Fwd Rew cc sTTTCCTTCCTCT UUN toi CTCACGACTGTTCTC C Open Primer s Fwd Rev 44 552 Save Primer s Fwd Rev Mark Primer Annotation on Sequence Open Fragment Save Fragment Sequence Fwd Melt CCATGGTTTCCTTCCTCT 5 ATGGTTTCCTTCCTCT 55 227 CCAAACTCTTGTCAGCAC 55 349 Sequence Rev AACTCTTGTCAGCACTC Figure 2 41 The options available in the right click menu Here Mark primer annotation on sequence has been chosen resulting in two annotations on the sequence above labeled Oligo the Example data To assemble the files Toolbox in the Menu Bar Sequencing Data Analyses 54 Assemble Sequences 77 Click Next to go to the second step of the assembly where you choose to trim the sequences Click Next and you will be able to specify how this trimming should be performed see figure 2 42 It is possible to trim based on the quality of the chromatogram traces and you can also trim CHAPTER 2 TUTORIALS 61 Trim Sequences 1 Select nucleotide NAAA sequences 2 Set trim parameters Sequence trimming V Ignore existing trim information V Trim using quality scores Limit 0 05 V Trim using ambiguous nucleotides Residues 23 Vector trimming Trim contamination from vectors in UniVec database
10. 389 22 3 2 PHObADINTES ss ie a al ew eh tk A a A 390 22 4 Structure Scanning Plot 390 22 4 1 Selecting sequences for scanning ee 391 224 2 The structure scanning FeSuUIL s asienta anima ga ig et 392 22 5 Bioinformatics explained RNA structure prediction by minimum free energy MINIMIZATION gt aii rs a ek We eee A a OR aa 392 22 5 1 THEON caos ax ee eK Ae a wR HS Be A 393 22 5 2 Structure elements and their energy contribution 396 Ribonucleic acid RNA is a nucleic acid polymer that plays several important roles in the cell As for proteins the three dimensional shape of an RNA molecule is important for its molecular function A number of tertiary RNA structures are know from crystallography but de novo prediction of tertiary structures is not possible with current methods However as for proteins RNA tertiary structures can be characterized by secondary structural elements which are hydrogen bonds within the molecule that form several recognizable domains of secondary structure like stems hairpin loops bulges and internal loops A large part of the functional information is thus 374 CHAPTER 22 RNA STRUCTURE 375 contained in the secondary structure of the RNA molecule as shown by the high degree of base pair conservation observed in the evolution of RNA molecules Computational prediction of RNA secondary structure is a well define
11. On line verification CLC Gene Workbench is presently not able to contact i CLC bio s license server Check your internet connection and press OK to try again Figure 1 1 This dialog appears when an online license check is conducted by CLC Combined Workbench and the computer is offline Either at start up or after 24 hours You can then connect to the Internet and retry or you can save your work and close the program You can run the workbench again later as long as you are connected to the Internet at start up If being online while evaluating is a problem please contact support clcbio com 1 4 2 Getting and activating the demo license When you start the program for the first time you will be presented with the dialog shown in figure 1 2 If you connect to the internet via a proxy server click the proxy settings button Otherwise just click the Request evaluation license button in order to get a license key for a demo of CLC Combined Workbench Now our server will issue an evaluation license This process might take a while depending on your internet connection If you get an error while requesting a license please see section 1 4 2 CHAPTER 1 INTRODUCTION TO CLC COMBINED WORKBENCH 18 License Assistant CLC Combined Workbench O Get license iw Accept agreement Q Activate license A license is required In order to use this application you will need a valid license key file Ifyou already have a ke
12. Trim contamination From saved sequences to be chosen in the next step OO tre RADO Figure 2 42 Specifying how sequences should be trimmed for vector contamination If you place the mouse cursor on the parameters you will see a brief explanation For now we leave these settings at their default Click Finish 2 12 1 Getting an overview of the contig The result of the assembly is a Contig which is an alignment of the five reads Click Fit width fe to see an overview of the contig To help you determine the coverage display a coverage graph see figure 2 43 Alignment info in Side Panel Coverage Graph ele Contig 1 ill 1000 y wi Contig Alignment info Coverage Consensus Conservation gt Trace of 1041063818107 scf Sequence Logo Trace of 1041063818126 scf SERIES Trace of 1041063818147 scf C Foreground color Trace of 1041063818160 scf _ C Background color Trace of 1041063818173 scf M Graph Height low x Line plot Color different residues Nucleotide info Figure 2 43 An overview of the contig with the coverage graph This overview can be an aid in determining whether coverage is satisfactory and if not which regions a new sequencing effort should focus on Next we go into the details of the contig CHAPTER 2 TUTORIALS 62 2 12 2 Finding and editing inconsistencies Click Zoom to 100 4 to zoom in
13. Figure 10 4 Selecting enzymes All enzymes Filter 3 Name Overh Methyl Pop PstI 3 N6 meth rrr a KpnI 3 N meth Per SacI 3 S methyl Per Sphr 3 coo lapar 3 S methyl j Sacit E S meth et Nsil___ Enzyme Sacil Chat Recognition site pattern CC6CGG Ball Suppliers GE Healthcare Hhal Qbiogene XcmI American Allied Biochemical Inc DrallI Nippon Gene Co Ltd Takara Bio Inc panii New England Biolabs Toyobo Biochemicals Molecular Biology Resources Promega Corporation kev EURx Ltd Figure 10 5 Showing additional information about an enzyme like recognition sequence or a list of commercial vendors Show enzymes cutting inside outside selection Section 19 2 1 describes how to add more enzymes to the list in the Side Panel based on the name of the enzyme overhang methylation sensitivity etc However you will often find yourself in a situation where you need a more sophisticated and explorative approach An illustrative example you have a selection on a sequence and you wish to find enzymes cutting within the selection but not outside This problem often arises during design of cloning experiments In this case you do not know the name of the enzyme so you want the Workbench to find the enzymes for you right click the selection Show Enzymes Cutting Inside Outside Selection HE This will display the dialog sho
14. Annotation Layout and Annotation Types See section 10 3 1 Restriction sites See section 10 1 2 Residue coloring These preferences make it possible to color both the residue letter and set a background color for the residue e Non standard residues For nucleotide sequences this will color the residues that are not C G A T or U For amino acids only B Z and X are colored as non standard residues Foreground color Sets the color of the letter Click the color box to change the color Background color Sets the background color of the residues Click the color box to change the color e Rasmol colors Colors the residues according to the Rasmol color scheme See http www openrasmol org doc rasmol html CHAPTER 10 VIEWING AND EDITING SEQUENCES 134 Foreground color Sets the color of the letter Click the color box to change the color Background color Sets the background color of the residues Click the color box to change the color e Polarity colors only protein Colors the residues according to the polarity of amino acids Foreground color Sets the color of the letter Click the color box to change the color Background color Sets the background color of the residues Click the color box to change the color e Trace colors only DNA Colors the residues according to the color conventions of chromatogram traces A green C blue G black and T red Foreground color Sets th
15. Create Protein Charge Plot 1 Select a protein BE Projects Selected Elements 2 E3 CLC_Data As CAA24102 S E Example data As CAA32220 B f Extra W E Nucleotide ip Protein 3D structures 3 2 More data Sequences Ft 1429_HUMAN Pus NP_058652 Ps P68046 Pu P68053 us P68063 Pu P68225 Pu P68228 Ps P68231 Su P68873 P68945 gt Next of Finish X Cancel Figure 16 6 Choosing protein sequences to calculate protein charge If a Sequence was selected before choosing the Toolbox action the sequence is now listed in CHAPTER 16 PROTEIN ANALYSES 249 the Selected Elements window of the dialog Use the arrows to add or remove sequences or sequence lists from the selected elements You can perform the analysis on several protein sequences at a time This will result in one output graph showing protein charge graphs for the individual proteins Click Next if you wish to adjust how to handle the results See section 9 1 If not click Finish 16 2 1 Modifying the layout Figure 16 7 shows the electrical charges for three proteins In the Side Panel to the right you can modify the layout of the graph l CAA24102 charge Graph Set nos Protein charge iS 28 Graph preferences 20 Lock axes 15 Y Frame X axis at zero 10 O Y axis at zero gt 5 pannun 5 Tick type outside v 5 Tick lines at none v 5 10 CAA24102 15 CAA32220 gt CAA3222
16. HBB_ANAPP HBB_AQUCH HBB_CALJA 111 38 EER FEE Realigned with HBA_ANAPE A __ a HBA_ANSSE Enn dD HBA_ACCGE HBB_ANAPP m HBB_AQUCH pon HBB_CAJA _ _ _ ua aa Figure 20 7 Realigning using fixpoints In the top view fixpoints have been added to two of the sequences In the view below the alignment has been realigned using the fixpoints The three top sequences are very similar and therefore they follow the one sequence number two from the top that has a fixpoint One example would be three sequences A B and C where sequences A and B has one copy of a domain while sequence C has two copies of the domain You can now force sequence A to align to the first copy and sequence B to align to the second copy of the domains in sequence C This is done by inserting fixpoints in sequence C for each domain and naming them fp1 and fp2 for example Now you can insert a fixpoint in each of sequences A and B naming them fp1 and fp2 respectively Now when aligning the three sequences using fixpoints sequence A will align to the first copy of the domain in sequence C while sequence B would align to the second copy of the domain in sequence C You can name fixpoints by CHAPTER 20 SEQUENCE ALIGNMENT 353 right click the Fixpoint annotation Edit Annotation dy type the name in the Name field 20 2 View alignments
17. Mask lower case With this option selected you can cut and paste a FASTA sequence in upper case characters and denote areas you would like filtered with lower case This allows you to customize what is filtered from the sequence during the comparison to the BLAST databases e Expect The statistical significance threshold for reporting matches against database sequences the default value is 10 meaning that 10 matches are expected to be found merely by chance according to the stochastic model of Karlin and Altschul 1990 If the statistical significance ascribed to a match is greater than the EXPECT threshold the match will not be reported Lower EXPECT thresholds are more stringent leading to fewer chance matches being reported Increasing the threshold shows less stringent matches Fractional values are acceptable e Word Size BLAST is a heuristic that works by finding word matches between the query and database sequences You may think of this process as finding hot spots that BLAST can then use to initiate extensions that might lead to full blown alignments For nucleotide nucleotide searches i e BLASTn an exact match of the entire word is required before an extension is initiated so that you normally regulate the sensitivity and speed of the search by increasing or decreasing the wordsize For other BLAST searches non exact word matches are taken into account based upon the similarity between words The amount of similarity can b
18. HBB_PANTR P68873 Hemoglobin subunit beta Hemoglobi HBB_ANSIN P02118 Hemoglobin subunit beta Hemoglobin Open BLAST Output BE de BLAST Table 5 x a 26 Column width F Automatic v v Show column Y Query Number of hits v v Top hit E value v Top hit Figure 12 10 An overview BLAST table summarizing the results for a number of query sequences Double clicking a row will open the BLAST result for this query sequence allowing more detailed investigation of the result You can also select one or more rows and click the Open BLAST Output button at the bottom of the view In the overview table the following information is shown is the name of the query sequence e Number of hits The number of hits for this query sequence the top hit Query Since this table displays information about several query sequences the first column Top hit E value The E value of the top hit is shown here The top hit is defined as the hit with the lowest E value Top hit The description of the top hit If there is no description it will just be the name of If you wish to save some of the BLAST results as individual elements in the Navigation Area open them and click Save As in the File menu 12 3 2 BLAST graphics The BLAST editor shows the sequences hits which were found in the BLAST search The hit sequences are represented by colored horizontal lines and when hovering the mouse
19. Problems with online activation If you have problems activating the license online CLC Combined Workbench also offers you an opportunity to manually activate your license key The problem is most likely to occur if CLC Combined Workbench is unable to establish contact with our server This may be due to problems with your internet connection or because your computer has restricted access to the internet In this case you will see a dialog similar to the one shown in figure 1 9 Unable to Request Evaluation License The program was unable to request a license directly From CLC bio Please click the button below to go to the CLC license web site and submit your request Request license through web site Tf you are unable to request your license through the web site please contact support clcbio com and include the following information in your email Activation Key AQSBC 9JERU 2U1PK 60RTO 5ISZL Figure 1 5 If you cannot get a license automatically In this case click Request license through web site to go to a web page where you can make a request for a license Please fill out the form on the web site and we will send you an email with a pre activated license as soon as possible If you know that you are using a proxy server to connect to the internet click Cancel and click Proxy Settings in the license dialog 1 4 3 Fixed license Unlike the demo version the fixed license is fully functional offline You can purc
20. 427 ps format export 120 PubMed references search 171 PubMed references search 400 Quality of chromatogram trace 302 Quality of trace 304 Quality score of trace 304 Quick start 26 Rasmol colors 133 Reading frame 237 Realign alignment 401 Reassemble contig 316 Rebase restriction enzyme database 344 Rebuild index 98 Recognition sequence insert 327 Recycle Bin 79 Redo alignment 350 Redo Undo 83 Reference sequence 403 References 418 Region types 145 Remove annotations 154 sequences from alignment 358 terminated processes 88 Rename element 79 Report program errors 25 Report protein 400 Request new feature 24 Reset license 18 Residue coloring 133 Restore deleted elements 79 size of view 85 Restriction enzmyes filter 139 141 331 333 336 345 from certain suppliers 139 141 331 333 336 345 Restriction enzyme list 344 Restriction enzyme star activity 344 Restriction enzymes 328 compatible ends 143 335 cutting selection 140 332 isoschizomers 143 335 methylation 139 141 331 333 336 345 number of cut sites 137 330 overhang 139 141 331 333 336 345 separate on gel 342 sorting 137 330 Restriction sites 328 401 enzyme database Rebase 344 select fragment 144 number of 337 on sequence 133 329 parameters 335 tutorial 44 Results handling 127 Reverse complement 235 401 Reverse contig 311 Reverse translation 265 401 Bioinformatics e
21. N methyl N4 methyl e S methylc Pee S methylc ee N6 methyl em Figure 19 17 Choosing enzymes to be considered Y Einish i X Cancel the Add button gt If you e g wish to use EcoRV and BamHI select these two enzymes and add them to the right side panel If you wish to use all the enzymes in the list Click in the panel to the left press Ctrl A 38 A on Mac Add 5 gt The enzymes can be sorted by clicking the column headings i e Name Overhang Methylation or Popularity This is particularly useful if you wish to use enzymes which produce e g a 3 overhang In this case you can sort the list by clicking the Overhang column heading and all the enzymes producing 3 overhangs will be listed together for easy selection When looking for a specific enzyme it is easier to use the Filter If you wish to find e g Hindlll sites simply type Hindlll into the filter and the list of enzymes will shrink automatically to only include the Hindlll enzyme This can also be used to only show enzymes producing e g a 3 overhang as shown in figure 19 33 Restriction Map Analysis 1 Select DNA RNA sequence s Enzyme list 2 Enzymes to be considered in calculation Use existing enzyme list All enzymes Filter 3 Name PstI KpnI Overh Methyl N6 meth N6 meth peer Pop peto Sacl S methyl er Sphl Apal S methyl Sacl
22. N4 meth Pee Ball 13 N4 meth S methyl Hhal B S methyl 5 methyl emi 18 Drai F N6 meth N6 meth N6 methy N6 meth p Banll a S methyl N6 meth Previous Figure 19 23 Selecting enzymes All enzymes Filter 3 Name Overh Methyl Pop PstI 3 N6 meth Peer a KpnI 3 N6 meth rer SacI 3 S methyl k SphI 3 proto Apal S methyl pe SaclI E 5 methyl Pee Nsil Enzyme SacI Chal Recognition site pattern CCGCGG Ball Suppliers GE Healthcare Hhal Qbiogene cml American Allied Biochemical Inc Dralll Nippon Gene Co Ltd Banil Takara Bio Inc New England Biolabs Toyobo Biochemicals Molecular Biology Resources Promega Corporation hey EURx Ltd p Figure 19 24 Showing additional information about an enzyme like recognition sequence or a list of commercial vendors Number of cut sites Clicking Next confirms the list of enzymes which will be included in the analysis and takes you to the dialog shown in figure 19 25 Restriction Map Analysis 1 Select DNA RNA sequence s 2 Enzymes to be considered in calculation 3 Number of cut sites Display enzymes with No restriction site 0 Y One restriction site 1 ty Three restriction sites 3 N
23. P68228 Ss P68231 P68873 Ss P68945 gt Figure 12 2 Choose one or more sequences to conduct a BLAST search BLAST Against NCBI Databases 1 Select sequences of same AE este type 2 Set program parameters Choose program and database Program blastp Protein sequence and database v Database MECT EEEE v Genetic code Database genetic code Previous next Figure 12 3 Choose a BLAST Program and a database for the search e BLASTx Translated DNA sequence against Protein database If you want to search in protein databases this BLAST method allows for automated translation of the DNA input sequence and searching in various protein databases e tBLASTx Translated DNA sequence against Translated DNA database Here is both the input DNA sequence and the searched DNA database automatically translated BLAST search for protein sequences e BLASTp Protein sequence against Protein database This the most common BLAST method used when searching for homologous protein sequences having a protein sequence as search input e tBLASTn Protein sequence against Translated DNA database Here is the protein sequence searched against an automatically translated DNA database Depending on whether you choose a protein or a DNA sequence a number of different databases can be searched A complete list of these databases can be found in Appendix B CHAPTER 12 BLAST SEARCH 175 When nr appea
24. Search full domains and fragments This option allows you to search both for full domain but also for partial domains This could be the case if a domain extends beyond the ends of a sequence Search full domains only Selecting this option only allows searches for full domains Search fragments only Only partial domains will be found Database Only the 100 most frequent domains are included as default in CLC Combined Workbench but additional databases can be downloaded and installed as described in section 16 6 2 e Set significance cutoff The E value expectation value is the number of hits that would be expected to have a score equal to or better than this value by chance alone This means that a good E value which gives a confident prediction is much less than 1 E values around 1 is what is expected by chance Thus the lower the E value the more specific the search for domains will be Only positive numbers are allowed Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish This will open a view showing the found domains as annotations on the original sequence see figure 16 17 If you have selected several Sequences a corresponding number of views will be opened Each found domain will be represented as an annotation of the type Region More information on each found domain is available through the tooltip including detailed information on the identity score which is the ba
25. See http creativecommons org licenses by nc nd 2 5 for more information on how to use the contents Chapter 17 Primers Contents 17 1 Primer design an introduction 2 26 2 ee ee ee es 276 17 1 1 General concept lt o s ss 562486 See RR e 276 IAZ SCONE DIMES s lec a be ek wwe Re A A 278 17 2 Setting parameters for primers and probes 0 008 see 278 IL Primer Paramete e lao a A a ee A 279 17 3 Graphical display of primer information lt lt 280 17 3 1 Compact information Mode 281 17 3 2 Detailed information Mode 0 00 eee eee eee 281 17 4 Output from primer design lt lt ee 282 FAL SAVNE PIMES co e e Boe te bis a a ce e BL 283 17 4 2 Save POR Tragments o o see hehe deat PES a bee ex 283 17 4 3 Adding primer binding annotation a aoa o e 283 17 5 Standard PCR cc c o a a daa 283 IDA WSERINDUL cocina A a A SE 284 17 5 2 Standard PCR output table lt lt s lt o oa sa akaoa osoa 286 97 6 Nested PCR i 6 lt 2 e o da ae a a E a 287 17 6 1 Nested PCR output table o io cromos sd e ia nai 289 17 7 TaqUah sk kek eee ee ee A ee A 289 TEL TaqMan output table lt lt eae a wo ee ee ed bos 291 17 8 Sequencing primers 1 eee ee 4 2 4 291 17 8 1 Sequencing primers output table 291 17 9 Alignme
26. at the top 3 3 Zoom and selection in View Area The mode toolbar items in the right side of the Toolbar apply to the function of the mouse pointer When e g Zoom Out is selected you zoom out each time you click in a view where zooming is relevant texts tables and lists cannot be zoomed The chosen mode is active until another mode toolbar item is selected Fit Width and Zoom to 100 do not apply to the mouse pointer Co ORE 2 2 a wr Fit Width 10096 Pan EOC Zoom In Zoom Out Figure 3 15 The mode toolbar items 3 3 1 Zoom In There are four ways of Zooming In Click Zoom In 550 in the toolbar click the location in the view that you want to zoom in on or Click Zoom In 550 in the toolbar click and drag a box around a part of the view the view now zooms in on the part you selected or Press on your keyboard The last option for zooming in is only available if you have a mouse with a scroll wheel or Press and hold Ctrl 36 on Mac Move the scroll wheel on your mouse forward CHAPTER 3 USER INTERFACE 87 When you choose the Zoom In mode the mouse pointer changes to a magnifying glass to reflect the mouse mode Note You might have to click in the view before you can use the keyboard or the scroll wheel to zoom If you press the Shift button on your keyboard while clicking in a View the zoom function is reversed Hence clicking on a sequence in this way while the Zoom In mode toolbar item is select
27. e In the Toolbox you will find the other way of doing restriction site analyses This way provides more control of the analysis and gives you more output options e g a table CHAPTER 19 CLONING AND CUTTING 329 of restriction sites and you can perform the same restriction map analysis on several sequences in one step This chapter first describes the dynamic restriction sites followed by the toolbox way This section also includes an explanation of how to simulate a gel with the selected enzymes The final section in this chapter focuses on enzyme lists which represent an easy way of managing restriction enzymes 19 2 1 Dynamic restriction sites If you open a sequence a sequence list etc you will find the Restriction Sites group in the Side Panel As shown in figure 19 12 you can display restriction sites as colored triangles and lines on the sequence The Restriction sites group in the side panel shows a list of enzymes represented by different colors corresponding to the colors of the triangles on the sequence By selecting or deselecting the enzymes in the list you can specify which enzymes restriction sites should be displayed v Restriction sites E show Show name flags G CGACGAGTCTGGATH Toi Sorting Aa LI hol 40 vw V Non cutters I TGGGTCTGACC TCGAGCATGGT sama 0 MA Y est co EY smat 0 TTCA ATOTOTEECTO Tener Y Single cutters Hindi Bono OD Y Eo CTGAAGCT GnuTTREeTTRAe MA Mdina a OY
28. CHAPTER 7 IMPORT EXPORT OF DATA AND GRAPHICS 119 Export Graphics 1 Output options Nr AAA Export options Export visible area Export whole area Figure 7 3 Selecting to export whole view or to export only the visible area RJ ve ea O AY7386150 gai HBD HBBy lt gt Figure 7 4 A circular sequence as it looks on the screen When selecting Export visible area the exported file will only contain the part of the sequence that is visible in the view The result from exporting the view from figure 7 4 and choosing Export visible area can be seen in figure 7 5 HBD HBB Figure 7 5 The exported graphics file when selecting Export visible area On the other hand if you select Export whole view you will get a result that looks like figure 7 6 This means that the graphics file will also include the part of the sequence which is not visible when you have zoomed in For 3D structures this first step is omitted and you will always export what is shown in the view equivalent to selecting Export visible area Click Next when you have chosen which part of the view to export CHAPTER 7 IMPORT EXPORT OF DATA AND GRAPHICS 120 AY738615 180 bp Figure 7 6 The exported graphics file when selecting Export whole view The whole sequence is shown even though the view is zoomed in on a part of the sequence 7 3 2 Save location and file formats In this step you can choose name and save locatio
29. Finish cancel Figure 17 17 Search parameters for finding primer binding sites The adjustable parameters for the search are e Exact match Choose only to consider exact matches of the primer i e all positions must base pair with the template e Minimum number of base pairs required for a match How many nucleotides of the primer that must base pair to the sequence in order to cause priming mispriming e Number of consecutive base pairs required in 3 end How many consecutive 3 end base pairs in the primer that MUST be present for priming mispriming to occur This option is included since 3 terminal base pairs are known to be essential for priming to occur e Select primer to search for A primer is a normal DNA sequence but can only have a maximum length of 100 nucleotides Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish After clicking finish the sequences where the primer binds to a subsequence will be annotated with a Primer Binding Site containing information about the primer binding to this subsequence An example of the result is shown in figure 17 18 CHAPTER 17 PRIMERS 300 Primer Binding Site Inote Primer CCATG GTTTCCTTCCTCT note Number of mismatches 0 PERH3BC GTGAGTCTGATGGGTCT primer binding ste G note Primer binding region 20 37 Figure 17 18 Annotation showing a primer match 17 12 Order primers To facilitate the orderi
30. Manage enzymes The list of restriction enzymes contains per default 20 of the most popular enzymes but you can easily modify this list and add more enzymes by clicking the fManage enzymes button This will display the dialog shown in figure 19 14 At the top you can choose to Use existing enzyme list Clicking this option lets you select an enzyme list which is stored in the Navigation Area See section 19 4 for more about creating and modifying enzyme lists Below there are two panels e To the left you see all the enzymes that are in the list select above If you have not chosen to use an existing enzyme list this panel shows all the enzymes available a e To the right there is a list of the enzymes that will be used The CLC Combined Workbench comes with a standard set of enzymes based on http www rebase org CHAPTER 19 CLONING AND CUTTING 331 B Manage enzymes 1 Please choose enzymes A Enzyme list Y Use existing enzyme list Popular enzymes Enzymes in Popular en Enzymes shown in Side Panel Filter Fiter Name Overhang Methyla Ry Name Overhang Sall EcoRV Blunt EcoRV lunt i B EcoRI S aatt hol 7 A g Smal Blunt Bair r methy ee C gt Sal S tega Hind Tt a methy vor cn PstI 3 tgca Xbal E methy Pee Le hol 5 tega EcoRI f 7 BglII 5 gate Smal methy xbal 5 ctag PstI y methy Hind TIT 5 aget 5 gate N4 methy Blunt S methyle
31. P68046 P68053 P68225 P68873 P68228 P68231 P68063 P68945 Consensus 29 29 30 30 30 30 29 TERE Waa ficcabama 29 AVTGLWGKVN VDEVGGEALG Lineal Sequence go nara ural n VoevGethle l P68046 a REEDSEcDBs SPBAMMcNPK 59 P68053 Hd Eo H SPDAMMGNPK 59 P68225 FcB sSPDANMGNPK ie P68873 GBs TPBAMMGNPK 6 P68228 TABA P68231 TABA NP P68063 sPTAMES PAR9A5 SPTAMBGNPM 59 Ed e gt 25 v Sequence layout fa Spacing Every 10 residues v O No wrap O Auto wrap O Fixed wrap Y Numbers on sequences Relative to 1 Y Follow selection Lock number Hide labels Y Lock labels Sequence label Name vi C Show selection boxes v K3 Figure 2 6 The protein alignment as it looks when you open it with background color according to the Rasmol color scheme and automatically wrapped is kept on the same line To see more of the alignment you now have to scroll horizontally Next expand the Annotation Layout group and select Show Annotations Set the Offset to More offset and set the Label to On annotation Expand the Annotation Types group Here you will see a list of the types annotation that are carried by the sequences in the alignment see figure 2 7 w a Sequence layout Annotation layout Show annotations Position Next to sequence W Offset More offset v Label On annotation v Show arrows Use gradients v Annotation types HM sene
32. The Genetic Code represents translations of all 64 different codons into 20 different amino acids Therefore it is no problem to translate a DNA RNA sequence into a specific protein But due to the degeneracy of the genetic code several codons may code for only one specific amino acid This can be seen in figure 16 23 After the discovery of the genetic code it has been concluded that different organism and organelles have genetic codes which are different from the standard genetic code Moreover the amino acid alphabet is no longer limited to 20 amino acids The 21 st amino acid selenocysteine is encoded by an UGA codon which is normally a stop codon The discrimination of a selenocysteine over a stop codon is carried out by the translation machinery Selenocysteines are very rare amino acids The figure 16 23 and 16 24 represents the Standard Code which is the default translation table CHAPTER 16 PROTEIN ANALYSES 268 AAS FFLLSSSSYY CCWWLLLLPPPPHHQQRRRRI IMMTTTTNNEKS S VVVVAAAADDEEGGGG Starts MMMM M Basel TTTTTTTTTTTTTTTTCCCCCCCCCCCCCCCCAAAAAAAAAAAAAAAAGGGGGGGGGGGGGGGG Base2 TTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGGTTTTCCCCAAAAGGGG Base3 TCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAGTCAG Figure 16 23 The Standard Code for translation Second base in codon U c A G Phe Ser Tyr Cys Phe Ser Tyr Cys Leu Ser STOP STOP Leu Ser STOP
33. The peak is called by changing the residue to an ambiguity character and by adding an annotation at this position To call secondary peaks select sequence s Toolbox in the Menu Bar Sequencing Data Analyses 54 Call Secondary Peaks A This opens a dialog where you can alter your choice of sequences When the sequences are selected click Next This opens the dialog displayed in figure 18 14 Secondary Peak Calling 1 Select nucleotide BERE sequences with traces 2 Set parameters Calling parameters Percent of max peak height for calling 60 Use IUPAC code for ambiguous nucleotides O Use N for ambiguous nucleotides V Add annotations Cerves C 3r CYR Figure 18 14 Setting parameters secondary peak calling The following parameters can be adjusted in the dialog e Percent of max peak height for calling Adjust this value to specify how high the secondary peak must be to be called CHAPTER 18 SEQUENCING DATA ANALYSES AND ASSEMBLY 318 e Use IUPAC code N for ambiguous nucleotides When a secondary peak is called the residue at this position can either be replaced by an N or by a ambiguity character based on the IUPAC codes see section F e Add annotations In addition to changing the actual sequence annotations can be added for each base which has been called Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish This wil
34. a aes Cut Copy Paste Delete Workspace Search US Ee Ode 2A Fit Width 100 Pan Selection Zoom In Ec ver acre HUMDINUC HUMD HUMD HUMD Alignments and Trees E y A LA Nucleotide Analyses Ela Protein Analyses E A Sequencing Data Analyses Hal Primers and Probes E fag Cloning and Restriction Sites 4 fa RNA Structure HS BLAST Search w gh Database Search Processes Toolbox Aina Idle A General Sequence Analyses HUMD HUMD HUMD HUMD 160 50S5 0 ACAAATTGATTAATGATAGTGC TATCCTCTTGCATTTAGAGTTT AACTGGTACCTACTTCCAAAAG ATGTGGTTCCAGAAAGGAAGAA AAAGAACACACACACACACACA CACACACACACACACACACACT 20 l 5 40 Sequence layout I Spacing No spacing 7 O No wrap Auto wrap 80 l O Fixed wrap GGAAACAGAATTAGAAAAGAAA 100 We C Double stranded 129 Y Numbers on sequences Relative to 140 Y Numbers on plus strand V Follow selection uY E Figure 2 2 The HUMDINUC file is imported and opened File type Suffix File format used for ACE files ace contigs Phylip Alignment phy alignments GCG Alignment msf alignments Clustal Alignment aln alignments Newick nwk trees FASTA fsa fasta sequences GenBank gbk gb 8p sequences GCG sequence gcg sequences only import PIR NBRF pir sequences only import Staden sdn sequences only import DNAstrider Str strider sequences Swiss Pr
35. gt Translation gt GIC content v Figure 18 10 The view of a contig Notice that you can zoom to a very detailed level in contigs You can see that color of the residues and trace at the end of one of the reads has been faded This indicates that this region has not contributed to the contig This may be due to trimming before or during the assembly or due to misalignment to the other reads You can easily adjust the trimmed area to include more of the read in the contig simply drag the edge of the faded area as shown in figure 18 11 gPTGTCAATGAC occ Figure 18 11 Dragging the edge of the faded area If reads have been reversed this is indicated by red Otherwise the residues are colored green The colors can be changed in the Side Panel as described in section 18 6 1 If you find out that the reversed reads should have been the forward reads and vice versa you can reverse the whole contig right click the empty white area of the contig Reverse Contig CHAPTER 18 SEQUENCING DATA ANALYSES AND ASSEMBLY 312 18 6 1 Contig view settings in the Side Panel Apart from this the view resembles that of alignments see section 20 2 but has some extra preferences in the Side Panel e Assembly Layout A new preference group located at the top of the Side Panel Gather sequences at top Enabling this option affects the view that is shown when scrolling horizontally along a contig If selected the sequenc
36. in the Toolbar The plug in manager has four tabs at the top Manage Plug ins This is an overview of plug ins that are installed Download Plug ins This is an overview of available plug ins on CLC bio s server Manage Resources This is an overview of resources that are installed Download Resources This is an overview of available resources on CLC bio s server To install a plug in click the Download Plug ins tab This will display an overview of the plug ins that are available for download and installation see figure 1 15 Manage Plug ins and Resources g ay c M E Manage Plug ins Manage Resources Download Resources Bookmark Navigator g Version 1 03 g wa i i Additional allignments With this extension you can bookmark elements in the Navigation Area Version 1 02 Description Perform alignments with many different programs from within the workbench ClustalW Windows Mac Linux Muscle Windows Mac Linux T Coffee Mac Linux Download and install MAFFT Mac Linux Kalign Mac Linux Extract Annotations Gg Version 1 02 Extracts annotations from one or more sequences The result is a More information is available on the sequence list containing sequences covered by the specified Additional alignments plugin website annotations Additional information Usage g OS Located in Toolbox gt Alignments and Trees gt Additional Alignments Version 1 02 Using this plug i
37. mamembrane translocation in prokaryotes or is routed through the Endoplasmatic Reticulum in eukaryotic cells The signal peptide is removed from the resulting mature protein during translo cation across the membrane For prediction of signal peptides we query SignalP Nielsen et al 1997 Bendtsen et al 2004b located at http www cbs dtu dk services SignalP Thus an active internet connection is required to run the signal peptide prediction Additional information on SignalP and Center for Biological Sequence analysis CBS can be found at http www cbs dtu dk and in the original research papers Nielsen et al 1997 Bendtsen et al 2004b In order to predict potential signal peptides of proteins the D score from the SignalP output is used for discrimination of signal peptide versus non signal peptide see section 16 1 3 This score has been shown to be the most accurate Klee and Ellis 2005 in an evaluation study of signal peptide predictors In order to use SignalP you need to download the SignalP plug in using the plug in manager see section 1 7 1 When the plug in is downloaded and installed you can use it to predict signal peptides Select a protein sequence Toolbox in the Menu Bar Protein Analyses xj Signal Peptide Prediction t or right click a protein sequence Toolbox Protein Analyses y Signal Peptide Prediction If a sequence was selected before choosing the Toolbox action this sequence is no
38. 110 sequence B from alignment 1 110 sequence A from alignment 2 MM i10 sequence B from alignment 2 ii s P Figure 20 12 The joining of the alignments result in one alignment containing rows of sequences corresponding to the number of uniquely named sequences in the joined alignments CHAPTER 20 SEQUENCE ALIGNMENT 361 20 4 1 How alignments are joined Alignments are joined by considering the sequence names in the individual alignments If two sequences from different alignments have identical names they are considered to have the same origin and are thus joined Consider the joining of alignments A and B If a sequence named in A and B is found in both A and B the spliced alignment will contain a sequence named in A and B which represents the characters from A and B joined in direct extension of each other If a sequence with the name in A not B is found in A but not in B the spliced alignment will contain a sequence named in A not B The first part of this sequence will contain the characters from A but since no sequence information is available from B a number of gap characters will be added to the end of the sequence corresponding to the number of residues in B Note that the function does not require that the individual alignments contain an equal number of sequences 20 5 Pairwise comparison For a given set of aligned sequences see chapter 20 it is possible make a pairwise comparison in which
39. 7 N meth Y Figure 10 7 Selecting enzymes All enzymes Filter 3 Name Overh Methyl Pop PstI 3 N6 meth or a KpnI 3 N6 meth er pa SacI 3 S methyl pee SphI 3 E Apal E S methyl peer Sacil 3 S methyl or Nsil Enzyme SacII Chal Recognition site pattern CCGCGG Ball Suppliers GE Healthcare Hhal Qbiogene Xeml American Allied Biochemical Inc Dralll Nippon Gene Co Ltd Bani Takara Bio Inc New England Biolabs Toyobo Biochemicals Molecular Biology Resources Promega Corporation EURx Ltd Figure 10 8 Showing additional information about an enzyme like recognition sequence or a list of commercial vendors d Show Enzymes Cutting Inside Outside Selection 1 Enzymes to be considered in calculation 2 Number of cut sites Selected region 2466 4314 Cut sites Inside selection Outside selection No cut sites 0 Y No cut sites 0 AND Z One cut site 1 One cut site 1 Two cut sites 2 Two cut sites 2 Preview 3 enzymes will be added to Side Panel Eneyme name of cuts within selection of cuts elsewhere HindIII 1 o KpnI 1 del 1 g Figure 10 9 Deciding number of cut sites inside and outside the selection be selected
40. Numerous methods for prediction of protein targeting and signal peptides have been developed some of them are mentioned and cited in the introduction of the SignalP research paper Bendtsen et al 2004b However no prediction method will be able to cover all the different types of signal peptides Most methods predicts classical signal peptides targeting to the general secretory pathway in bacteria or classical secretory pathway in eukaryotes Furthermore a few methods for prediction of non classically secreted proteins have emerged Bendtsen et al 2004a Bendtsen et al 2005 Prediction of signal peptides and subcellular localization In the search for accurate prediction of signal peptides many approaches have been investigated Almost 20 years ago the first method for prediction of classical signal peptides was published von Heijne 1986 Nowadays more sophisticated machine learning methods such as neural networks Support vector machines and hidden Markov models have arrived along with the increasing computational power and they all perform superior to the old weight matrix based methods Menne et al 2000 Also many other classical statistical approaches have been carried out often in conjunction with machine learning methods In the following sections a wide range of different signal peptide and subcellular prediction methods will be described Most signal peptide prediction methods require the presence of the correct N term
41. ProData and MolData These folders are necessary when we import the data into CLC Combined Workbench In order to import all DNA RNA protein and oligo sequences if a default database directory is installed CHAPTER 7 IMPORT EXPORT OF DATA AND GRAPHICS 116 select File in the Menu Bar Import VectorNTI Data select Yes if you want to import the default database confirm the information or select File in the Menu Bar Import VectorNTI Data select No to choose a database select a database directory Import confirm the information After the import there is a new folder called Vector NTI Data in the Navigation Area In Vector NTI Data you can see three folders DNA RNA containing the DNA and RNA sequences Protein containing all protein sequences and oligo containing all oligo sequences See figure 7 2 The folders and all Sequences are automatically saved vector NTI Data 4 Proteins Nucleotide 200 ADCY7 3c Adeno2 YC ADRAIA X BaculoDirect Linear DNA 20 BaculoDirect Linear DNA Clonir JOC BPV1 LC BRAF 20 CDK2 Ot CalF1 Figure 7 2 The Vector NTI Data folder containing all imported sequences of the Vector NTI Database If for some reason the import fails an alternative approach would be to export all the files from Vector NTI and import them as described in the previous sections You can export a selection of files as a Vector NTI archieve ma4 pa4 which can be imported into the CLC Combined Workbenc
42. You can append a wildcard character by clicking the checkbox at the bottom This means that you only have to enter the first part of the search text e g searching for prot will find both protein and protease The following parameters can be added to the search All fields Text searches in all parameters in the NCBI structure database at the same time Organism Text Author Text PdbAcc The accession number of the structure in the PDB database CHAPTER 11 ONLINE DATABASE SEARCH 168 The search parameters are the most recently used The All fields allows searches in all parameters in the database at the same time All fields also provide an opportunity to restrict a search to parameters which are not listed in the dialog E g writing gene Feature key AND mouse in All fields generates hits in the GenBank database which contains one or more genes and where mouse ap pears somewhere in GenBank file NB the Feature Key option is only available in Gen Bank when searching for nucleotide structures For more information about how to use this syntax see http www ncbi nlm nih gov entrez query static help Summary_ Matrices html Search_Fields_and_Qualifiers When you are satisfied with the parameters you have entered click Start search Note When conducting a search no files are downloaded Instead the program produces a list of links to the files in the NCBI database This ensures a much faster search
43. and selecting another element while holding down the lt Shift gt key selects all the elements listed between the two locations the two end locations included e Selecting one element and moving the curser with the arrow keys while holding down the lt Shift gt key enables you to increase the number of elements selected 3 1 5 Moving and copying elements Elements can be moved and copied in several ways e Using Copy 5 Cut o and Paste T from the Edit menu CHAPTER 3 USER INTERFACE 17 Using Ctrl C 6 C on Mac Ctrl X 86 X on Mac and Ctrl V 36 V on Mac Using Copy 1 Cut s and Paste j4 in the Toolbar Using drag and drop to move elements Using drag and drop while pressing Ctrl Command to copy elements In the following all of these possibilities for moving and copying elements are described in further detail Copy cut and paste functions Copies of elements and folders can be made with the copy paste function which can be applied in a number of ways select the files to copy right click one of the selected files Copy right click the location to insert files into Paste 4 or select the files to copy Ctrl C 36 C on Mac select where to insert files Ctrl P 38 P on Mac or select the files to copy Edit in the Menu Bar Copy 5 select where to insert files Edit in the Menu Bar Paste 73 If there is already an element of that name the
44. click the Save Restore Settings button 5 at the top of the Side Panel and click Save Settings see figure 2 9 Save Settings Delete Settings Apply Saved Settings gt Figure 2 9 Saving the settings of the Side Panel This will open the dialog shown in figure 2 10 In this way you can save the current state of the settings in the Side Panel so that you can apply them to alignments later on If you check Always apply these settings these settings will be applied every time you open a view of the alignment Type My settings in the dialog and click Save CHAPTER 2 TUTORIALS 40 Save Settings Please enter a name for these user settings v my settings Always apply these settings Figure 2 10 Dialog for saving the settings of the Side Panel 2 3 2 Applying saved settings When you click the Save Restore Settings button 5 again and select Apply Saved Settings you will see My settings in the menu together with some pre defined settings that the workbench has created for you see figure 2 11 a Save Settings Delete Settings Apply Saved Settings Black white Conservation color Non compact Show annotations my settings CLC Standard Settings Figure 2 11 Menu for applying saved settings Whenever you open an alignment you will be able to apply these settings Each kind of view has its own list of settings that can be applied At the bottom of the list you will
45. e P68046 P68053 P68228 Ss P68231 P68873 Su P68945 Figure 14 17 Selecting two sequences to be joined If you have selected some sequences before choosing the Toolbox action they are now listed in the Selected Elements window of the dialog Use the arrows to add or remove sequences from the selected elements Click Next opens the dialog shown in figure 14 18 In step 2 you can change the order in which the sequences will be joined Select a sequence and use the arrows to move the selected sequence up or down Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish The result is shown in figure 14 19 CHAPTER 14 GENERAL SEQUENCE ANALYSES 227 Join Sequences 1 Select sequences of same BE a 2 Set parameters Set order of concatenation top first Ss P68063 Ss P68225 ET ET Previous J Bnet Y Frish X Cancel Figure 14 18 Setting the order in which sequences are joined HBB HBB Concatenated Sequence Figure 14 19 The result of joining sequences is a new sequence containing the annotations of the joined sequences they each had a HBB annotation 14 6 Motif Search CLC Combined Workbench offers advanced and versatile options to search for unknown sequence patterns or known motifs represented either by a literal string or a regular expression These advanced search capabilities are available for use in both DNA and protein seq
46. it is possible to edit the template to introduce mismatches which may affect the melting temperature At each side of the template sequence a text field is shown Here the dangling ends of the template sequence can be specified These may have an important affect on the melting temperature Bommarito et al 2000 Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish The result is shown in figure 17 16 Ez Primer proper Rows 1 Primer properties for sequence Primer Filter 5 Sequence v Melt Self annealing alignment Secondary structure Show column M Sequence T A V Melt temp TGGTTTCCTTCCICTAGT C oly TGGTTTCCTTCCTCTAGT 53 263 pl LI Tt 6 TGATCTCCTICCITIGGT c h T V Self annealing alignment T T T C self end annealing C GC content C Self annealing C Secondary structure score Secondary structure Figure 17 16 Properties of a primer from the Example Data In the Side Panel you can specify the information to display about the primer The information parameters of the primer properties table are explained in section 17 5 2 17 11 Find binding sites on sequence In CLC Combined Workbench you have the possibility of matching a known primer against one or more DNA sequences or a list of DNA sequences This can be applied to test whether a primer used in a previous experiment is applicable to amplify e g a homologous region in a
47. o 20 Sequence alignment 201 Cr ate an alignment s sa a so 646 bee ee eM e A a ee 20 2 View alignments 275 276 278 280 282 283 287 289 291 291 297 298 300 301 301 303 306 307 310 311 316 317 319 319 328 341 344 CONTENTS 8 20 3 BCIPAUENINENIS sigo ON ol ea ee we i eee PO ew ed a 357 AA A O oe e a a e 359 20 5 Pairwise COMPariSON e a aoao a o e a a a a a 361 20 6 Bioinformatics explained Multiple alignments o 364 21 Phylogenetic trees 366 211 nemne pPHylegeneue trees a sreeda m ai ra i Pe ee Bes 366 21 2 Bioinformatics explained phylogenetics eee 369 22 RNA structure 374 22 1 RNA secondary structure prediction 0 002 eee 375 22 2 View and edit secondary structures a o oo soosoo a 381 22 3 Evaluate structure hypothesis 2 2 eee ee ee 388 224A Structure Scanning PIOU s 0 8 woe eb ale ee we ew eke ee we ew ed 390 22 5 Bioinformatics explained RNA structure prediction by minimum free energy PAM AONE oo ee hee cy oar Nek E as ae he arcs cee a Hae tan er Sele ey ae 392 IV Appendix 399 A Comparison of workbenches 400 B BLAST databases 404 B 4 Peptide sequence databases 8 nk se Se A a do we we A 404 B 2 Nucleotide sequence databases n e 404 B 3 SNP BLAST databases ou a a a a aa a E 405 C Proteolytic cle
48. of the Preferences dialog figure 1 19 and enter the appropriate information The Preferences dialog is opened from the Edit menu You have the choice between a HTTP proxy and a SOCKS proxy CLC Combined Workbench only supports the use of a SOCKS proxy that does not require authorization If you have any problems with these settings you should contact your systems administrator 1 9 The format of the user manual This user manual offers support to Windows Mac OS X and Linux users The software is very similar on these operating systems In areas where differences exist these will be described separately However the term right click is used throughout the manual but some Mac users may have to use Ctrl click in order to perform a right click if they have a single button mouse The most recent version of the user manuals can be downloaded from http www clcbio CHAPTER 1 INTRODUCTION TO CLC COMBINED WORKBENCH 31 com usermanuals The user manual consists of four parts The first part includes the introduction and some tutorials showing how to apply the most significant functionalities of CLC Combined Workbench e The second part describes in detail how to operate all the program s basic functionalities e The third part digs deeper into some of the bioinformatic features of the program In this part you will also find our Bioinformatics explained sections These sections elaborate on the algorithms and analyses of CLC
49. primers are represented as DNA sequences in the Navigation Area Toolbox in the Menu Bar Primers and Probes Analyze Primer Properties i If a sequence was selected before choosing the Toolbox action this sequence is now listed in the Selected Elements window of the dialog Use the arrows to add or remove a sequence from the selected elements Clicking Next generates the dialog seen in figure 17 15 B Analyze Primer Properties 1 Select a nucleotide sequence shorter than 100 nt 2 Set parameters Concentrations Primer concentration nM 200 Salt concentration mM 100 Template GGTACCAAAGGAAGGAGA dangler 3 Template 5 dangler lt gt 5 Primer 3 ET ET _ Previous rex Finish 9 Cancel Figure 17 15 The parameters for analyzing primer properties In the Concentrations panel a number of parameters can be specified concerning the reaction CHAPTER 17 PRIMERS 298 mixture and which influence melting temperatures e Primer concentration Specifies the concentration of primers and probes in units of nanomoles nM e Salt concentration Specifies the concentration of monovalent cations N A K and equivalents in units of millimoles mM In the Template panel the sequences of the chosen primer and the template sequence are shown The template sequence is as default set to the reverse complement of the primer Sequence i e as perfectly base pairing However
50. xc NM_008387 XE INM_172827 XE NM 004123 pe Showing 1 50 uy Figure 4 2 Search results If there are many hits only the 50 first hits are immediately shown At the bottom of the pane you can click Next gt to see the next 50 hits see figure 4 3 If a search gives no hits you will be asked if you wish to search for matches that start with your search term If you accept this an asterisk will be appended to the search term Pressing the Alt key while you click a search result will high light the search hit in its folder in the Navigation Area In the preferences see 5 you can specify the number of hits to be shown CHAPTER 4 SEARCHING YOUR DATA 95 Y ay ad ND sj CLC_Data EE Example Data Ppl Etra H 4 Nucleotide w Protein E README E E Recycle bin 14 Qy insulin y NM_022563 1 NM_000055 NM_021514 NM_011144 NM_011146 NM_032397 RR RRR R y Figure 4 3 Page two of the search results Showing 51 98 4 2 2 Special search expressions When you write a search term in the search field you can get help to write a more advanced search expression by pressing Shift F1 This will reveal a list of guides as shown in figure 4 4 L Wildcard search 4 Search related words i Include both terms AND Include either term OR Any field search contents Mame search fname Length search length START TO END Organism search organis
51. you can enter different protein patterns from the PROSITE database protein patterns using regular expressions and describing specific amino acid sequences The PROSITE database contains a great number of patterns and have been used to identify related proteins see http www expasy org cgi bin prosite list pl In order to search for a known motif Select DNA or protein sequence s Toolbox in the Menu Bar General Sequence Analyses lt A Motif Search 4D or Right click DNA or protein sequence s Toolbox General Sequence Analyses A Motif Search 42 Motif Search 1 Select one or more ls parameters sequences of same type 2 Set parameters Motif parameters Simple motif O Java regular expression Prosite regular expression Motif 390 Press Shif iit F1 For options Accuracy 80 vw _ Previous mex Finish XK Cancel Figure 14 20 Setting parameters for the motif search See text for details If a Sequence was selected before choosing the Toolbox action the sequence is now listed in the Selected Elements window of the dialog Use the arrows to add or remove sequences or sequence lists from the selected elements You can perform the analysis on several DNA or several protein sequences at a time If the analysis is performed on several sequences at a time the method will search for patterns in the sequences and open a new view for each of the sequences Click Next to adjust p
52. 1 e User The user who performed the operation If you import some data created by another person in a CLC Workbench that persons name will be shown e Parameters Details about the action performed This could be the parameters that was chosen for an analysis e Origins from This information is usually shown at the bottom of an element s history Here you can see which elements the current element origins from If you have e g created an alignment of three sequences the three sequences are shown here Clicking the element selects it in the Navigation Area and clicking the history link opens the element s own history e Comments By clicking Edit you can enter your own comments regarding this entry in the history These comments are saved 8 1 1 Sharing data with history The history of an element is attached to that element which means that exporting an element in CLC format clc will export the history too In this way you can share folders and files with others while preserving the history If an element s history includes source elements i e if there are elements listed in Origins from they must also be exported in order to see the full history Otherwise the history will have entries named Element deleted An easy way to export an CHAPTER 8 HISTORY LOG 126 element with all its source elements is to use the Export Dependent Objects function described in section 7 1 2 The history view can be prin
53. 1 Howalignments are joined o 361 20 5 Pairwise comparison 2 361 20 5 1 Pairwise comparison on alignment selection 362 20 5 2 Pairwise comparison parameters ee 362 20 5 3 The pairwise comparison table a eee eee 362 20 6 Bioinformatics explained Multiple alignments 364 20 6 1 Use of multiple alignments 364 20 6 2 Constructing multiple alignments o 365 CLC Combined Workbench can align nucleotides and proteins using a progressive alignment algorithm see section 20 6 or read the White paper on alignments in the Science section of http www clcbio com This chapter describes how to use the program to align sequences The chapter also describes alignment algorithms in more general terms 347 CHAPTER 20 SEQUENCE ALIGNMENT 348 20 1 Create an alignment Alignments can be created from sequences sequence lists see section 10 7 existing align ments and from any combination of the three To create an alignment in CLC Combined Workbench select elements to align Toolbox in the Menu Bar Alignments and Trees Create Alignment Z or select elements to align right click any selected sequence Toolbox Alignments and Trees Create Alignment EF This opens the dialog shown in figure 20 1 Create Alignment
54. 152 Report Program enois ocio ra A A 25 1 5 3 Free vs commercial workbenches eee eee ee ee 25 1 6 When the program is installed Getting started 25 161 Quick Stant cosmetica da a eS a ra 26 1 6 2 Import of example data s soos iror a ecce aor ai mae o tt 26 1 7 Extending the workbench with plug ins n nanana 0 882 2 eee 26 Li Stalling PUES 2 2 0 aa ee Bote ee GO als 27 172 Uninstalling pUSIS 2 25 26 ee ee eee RR ee a eee 28 La Updating PIUETAS t cms 6S ew Ri gaa a WY et 29 LTA RESOUTCES ao so 5 044 cw beh ba a Dee Lee eS 29 1 8 Network configuration lt lt lt lt 1 29 1 9 The format of the user manual ee 30 LOL TEOMA isn etre eo ee e a E te ees 31 11 CHAPTER 1 INTRODUCTION TO CLC COMBINED WORKBENCH 12 Welcome to CLC Combined Workbench a software package supporting your daily bioinformatics work We strongly encourage you to read this user manual in order to get the best possible basis for working with the software package CHAPTER 1 INTRODUCTION TO CLC COMBINED WORKBENCH 13 1 1 Contact information The CLC Combined Workbench is developed by CLC bio A S Science Park Aarhus Gustav Wieds Vej 10 8000 Aarhus C Denmark http www clcbio com VAT no DK 28 30 50 87 Telephone 45 70 22 55 09 Fax 45 70 22 55 19 E mail info clcbio com If you have questions or comments regarding the program yo
55. 19 34 or use the view of enzyme lists see 19 4 At the bottom of the dialog you can select to save this list of enzymes as a new file In this way you can save the selection of enzymes for later use When you click Finish the enzymes are added to the Side Panel and the cut sites are shown on the sequence If you have specified a set of enzymes which you always use it will probably be a good idea to save the settings in the Side Panel see section 3 2 7 for future use The CLC Combined Workbench comes with a standard set of enzymes based on http www rebase org CHAPTER 10 VIEWING AND EDITING SEQUENCES 140 Restriction Map Analysis 1 Select DNA RNA sequence s 2 Enzymes to be considered in calculation Use existing enzyme list Enzyme list All enzymes Enzymes to be used Filter El Filter Name Overh Methyl Pop Name Overh Methyl PstI N6 meth Peer al Pst N meth Kpnt J N meth Sacl S methyl Sacl E 5 methyl SphI d Gaul ph y KpnI N meth Apal j S methyl Nsil Sacil E S methyl Pe apar E S methyl PP NsiI j i chal 8 Chal J Ball 7 N4 meth Boll 3 N meth Pe Sacr J S methyl por Hhal y S methyl IHhal S methyl KemI ig N meth Beat N6 methy Dralll N meth phi 7 N meth BanII i 5 methyl Pe zem E N6 meth
56. 20 40 Gene Gene Gene Gene CLCCLCCLCE LCCLOCLOCL CCLCCLCCLE CLECLCCLECS LCCLCCLCCL CC 60 80 100 Gene I Gene Gene LCE LCCLCCL CCLCCLCOLO CLOeCLCeLce LCCLCCLeCcL cCCLCCLCCLO CL 120 140 Gene J Gene Gene CCLCOCLCCLE CLCCLCOCECC ECCLECOCLCCL CCLCCLCCLC CLCCLCCLCC LC 160 180 200 I Gene I I CLCCLCCLCC LCCLCCLCCL CCLCCLCCLC CLCCLCCLCC LeCciccLccL cc 220 240 260 Gene Gene LCCLCCLCCL CECLCCLCCLC CLECLCCLCC LCCLCCLCCL CCLCCLCCLE CL 280 300 CCLCCLCCLC CCLCCLCCLC CCLCCLCCLC CCLCCLCCLC CCLCCLCCLC cc Figure 10 12 Region 1 A single residue Region 2 A range of residues including both endpoints Region 3 A range of residues starting somewhere before 30 and continuing up to and including 40 Region 4 A single residue somewhere between 50 and 60 inclusive Region 5 A range of residues beginning somewhere between 70 and 80 inclusive and ending at 90 inclusive Region 6 A range of residues beginning somewhere between 100 and 110 inclusive and ending somewhere between 120 and 130 inclusive Region 7 A site between residues 140 and 141 Region 8 A site between two residues somewhere between 150 and 160 inclusive Region 9 A region that covers ranges from 170 to 180 inclusive and 190 to 200 inclusive Region 10 A region on negative strand that covers ranges from 210 to 220 inclusive Region 11 A region on negative strand that covers ranges from 230 to 240 inclusive and 250 to 260 inclusive bl et et bl
57. 206 14 1 Shuffle SEQUENCE ax ica baa eee eae a ee ee a ee Se ee ae 206 WD OU IOUS e vey ere tae ae SS bc A owas eee ge dle a es ate o at 208 14 3 Local complexity plot uri 0 40 de ee ba bad we eee wb 218 14 4 Seouence statisties ak Ga srce ew sbe p ee Ea al be Gk See ae 220 14 5 JON SEQUENCES saa orraa aa Re ee ede eee Ee 226 1e WOU Se anes soy ease dE fee ec de ed et he ee es ee eae eee ee 227 14 7 Pattern DISCOVEN si ke ee Re Pa be e be me ee ee eS 230 15 Nucleotide analyses 233 16 4 Convert DNA TORNA 402 4 ae wk en a we ee ae Re a ae ee 233 15 2 Convert RINA tO DNA s fo ee ee eA Ge we de d i 234 15 3 Reverse complements of Sequences 2 000 ee ee eee eens 235 15 4 Translation of DNA or RNA to protein 2 2 00 ee ees 236 15 9 Fridopen reading ames ss wide Be wear we A Be Ae Ee we eS 237 16 Protein analyses 241 16 1 Signal peptide prediction x aw whe woa ae Re we a a a 242 102 Prote OATES ks 8 at fe eis aes fe OR a Eee A Be es Eom e e amp h 248 16 3 Transmembrane helix prediction 2 000 ees 251 TO AMEEN OY gt 0 a AR O E E a A 252 16 0 Hydro phobiGi uns ce tal See dew a a eR HP Me da A 254 16 6 Phim domai SES aa cs A a ek ORS a a 260 16 7 Secondary structure prediction o o 262 16 8 Protein tepore s a Sea datada ia a a a wt wa a eo 263 16 9 Reverse translation from protein into DNA o ee eee 265 16 10
58. 3 2 4 Save changes in a view When changes are made in a view the text on the tab appears bold and italic on Mac it is indicated by an before the name of the tab This indicates that the changes are not saved The Save function may be activated in two ways Click the tab of the view you want to save Save 5 in the toolbar or Click the tab of the view you want to save Ctrl S 3 S on Mac If you close a view containing an element that has been changed since you opened it you are asked if you want to save CHAPTER 3 USER INTERFACE 83 When saving a new view that has not been opened from the Navigation Area e g when opening a sequence from a list of search hits a save dialog appears figure 3 10 Select name and location For elem El fa Fa Location Folder Update All CLC_Data S E Example data a Extra SES Nt sequence list 70C PERH3BC 20 NM_000044 20 HUMDINUC DOC AY738615 20 HUMHBB Protein Name Fans caved stent of ok 3 Cancel Help Figure 3 10 Save dialog In the dialog you select the folder in which you want to save the element After naming the element press OK 3 2 5 Undo Redo If you make a change in a view e g remove an annotation in a sequence or modify a tree you can undo the action In general Undo applies to all changes you can make when right clicking in a view Undo is done by Click undo in the Toolbar or Edit Undo or Ctrl Z If you
59. 321 SNP 185 annotation 185 403 BLAST 185 search for 403 SNP annotation parameters 185 results 187 SNP annotation using BLAST 185 SNP BLAST 403 databases 405 Sort sequences 158 sequences alphabetically 358 sequences by similarity 358 Sort folders 76 Source element 125 Species display name 78 Staden file format 35 113 409 Standard layout trees 369 Standard Settings CLC 105 Star activity 344 Start Codon 238 Start up problems 25 Statistics about sequence 400 protein 222 sequence 220 Status Bar 88 89 illustration 73 str file format 112 Structure scanning 402 Structure prediction 262 Style sheet preferences 103 Support mail 13 Surface probability 135 svg format export 120 Swiss Prot 164 search see UniProt Swiss Prot file format 35 113 409 Swiss Prot TrEMBL 400 swp file format 112 System requirements 16 Table of fragments 340 Tabs use of 80 Tags insert into sequence 327 TaqMan primers 402 tBLASTn 174 tBLASTx 173 Terminated processes 88 Text format 137 user manual 31 view sequence 155 Text file format 35 113 409 tifformat export 120 Tips for BLAST searches 50 TMHMM 251 Toolbar illustration 73 preferences 100 Toolbox 88 89 illustration 73 show hide 89 Topology layout trees 369 Trace colors 134 Trace data 301 403 quality 304 Traces scale 302 Translate a selection 134 along DNA sequence 134 annotation to protei
60. 354 Graph Displays the conservation level as a graph at the bottom of the alignment The bar default view show the conservation of all sequence positions The height of the graph reflects how conserved that particular position is in the alignment If one position is 100 conserved the graph will be shown in full height x Height Specifies the height of the graph x Type The type of the graph Line plot Displays the graph as a line plot Bar plot Displays the graph as a bar plot Colors Displays the graph as a color bar using a gradient like the foreground and background colors x Color box Specifies the color of the graph for line and bar plots and specifies a gradient for colors e Gap fraction Which fraction of the sequences in the alignment that have gaps The gap fraction is only relevant if there are gaps in the alignment Foreground color Colors the letter using a gradient where the left side color is used if there are relatively few gaps and the right side color is used if there are relatively many gaps Background color Sets a background color of the residues using a gradient in the same way as described above Graph Displays the gap fraction as a graph at the bottom of the alignment x Height Specifies the height of the graph x Type The type of the graph Line plot Displays the graph as a line plot Bar plot Displays the graph as a line plot Colors Displays the graph as a co
61. 4 Getting an overview of the inconsistencies 63 2 12 5 Documenting your Changes lt ss e ser a raa dadia es 63 2 12 6 Using the result for further analyses 2 22 eee eo 63 2 13 Tutorial In silico cloning 1 eee ee 64 2 19 1 The cloninB editor lt s i woss wh aaga bee a A eS 65 2 13 2 Cutting the PCR fragment with the Sphl enzyme 66 2 13 3 Inserting the fragment in the vector o s sacs t scs s a oroia 0005 66 2 14 Tutorial Folding RNA molecules lt lt 67 This chapter contains tutorials representing some of the features of CLC Combined Workbench The first tutorials are meant as a short introduction to operating the program The last tutorials give examples of how to use some of the main features of CLC Combined Workbench The tutorials are also available as interactive Flash tutorials on http www clcbio com tutorials 2 1 Tutorial Getting started This brief tutorial will take you through the most basic steps of working with CLC Combined Workbench The tutorial introduces the user interface shows how to create a folder and demonstrates how to import your own existing data into the program When you open CLC Combined Workbench for the first time the user interface looks like figure 221 At this stage the important issues are the Navigation Area and the View Area The Navigation Area to the left is where you keep all your data
62. Ala A 4 4 hour gt 20 hours gt 10 hours Cys C 1 2 hours gt 20 hours gt 10 hours Asp D 1 1 hours 3 min gt 10 hours Glu E 1 hour 30 min gt 10 hours Phe F 1 1 hours 3 min 2 min Gly G 30 hours gt 20 hours gt 10 hours His H 3 5 hours 10 min gt 10 hours lle 1 20 hours 30 min gt 10 hours Lys K 1 3 hours 3 min 2 min Leu L 5 5 hours 3 min 2 min Met M 30 hours gt 20 hours gt 10 hours Asn N 1 4 hours 3 min gt 10 hours Pro P gt 20 hours gt 20 hours 2 Gin Q 0 8 hour 10 min gt 10 hours Arg R 1 hour 2 min 2 min Ser S 1 9 hours gt 20 hours gt 10 hours Thr T 7 2 hours gt 20 hours gt 10 hours Val V 100 hours gt 20 hours gt 10 hours Trp W 2 8 hours 3 min 2 min Tyr Y 2 8 hours 10 min 2 min Table 14 2 Estimated half life Half life of proteins where the N terminal residue is listed in the first column and the half life in the subsequent columns for mammals yeast and E coli amino acid composition is important when calculating the extinction coefficient The extinction coefficient is calculated from the absorbance of cysteine tyrosine and tryptophan using the following equation Ext Protein count Cystine Ext Cystine count Tyr Ext Tyr count Trp Ext Trp where Ext is the extinction coefficient of amino acid in question At 280nm the extinction coefficients are Cys 120 Tyr 1280 and Trp 5690 This equation is only valid under the following conditions e pH 6 5 e 6 0 M guanid
63. BLAST Database 1 Set parameters DEERE Select Input Source External FASTA file Navigation Area Sequence Type Input Sequences Sequence s selected Save BLAST Database Cdatabases my_nucleatide_database db pel a V Create index file A gt Next XX Cancel eu Figure 12 12 Setting parameters for the local BLAST database e Select Input Source Lets you choose whether to include sequences from the Navigation Area or from the computer s file system External FASTA file e Sequence type If you choose to import sequences from an external FASTA file into the database you must choose whether the sequences are nucleotide or protein sequences e Input Sequences Depending on the choice of Select Input Source above clicking the button will let you browse the Navigation Area or the external file system for the sequences which you want to include in the database e Save BLAST database Lets you browse your external file system for a suitable place to save the database The location of the local database can be defined by the user but as default all databases are stored in the following locations e Windows My Documents CLCdatabases lt databasename db gt e Mac users username lt dabasename db gt e Linux users username lt dabasename db gt Where lt dabasename db gt is the name entered in the dialog in figure 12 13 When a database is deleted from the navigation area in the workbench i
64. BamHI 1 x BamHI flar P N methy AE HaelIL Save Save as new enzyme list Figure 19 14 Adding or removing enzymes from the Side Panel Select enzymes in the left side panel and add them to the right panel by double clicking or clicking the Add button gt If you e g wish to use EcoRV and BamHI select these two enzymes and add them to the right side panel If you wish to use all the enzymes in the list Click in the panel to the left press Ctrl A 38 A on Mac Add The enzymes can be sorted by clicking the column headings i e Name Overhang Methylation or Popularity This is particularly useful if you wish to use enzymes which produce e g a 3 overhang In this case you can sort the list by clicking the Overhang column heading and all the enzymes producing 3 overhangs will be listed together for easy selection When looking for a specific enzyme it is easier to use the Filter If you wish to find e g Hindlll sites simply type Hindlll into the filter and the list of enzymes will shrink automatically to only include the Hindlll enzyme This can also be used to only show enzymes producing e g a 3 overhang as shown in figure 19 33 Restriction Map Analysis 1 Select DNA RNA sequence s 2 Enzymes to be considered in calculation Use existing enzyme list Enzyme list All enzymes Enzymes to be used Filter 3 Filter Name Overh Methyl Pop Name Overh Meth
65. Clicking Select Database opens the dialog shown in figure 12 7 CHAPTER 12 BLAST SEARCH 178 Select Protein Sequences or a Database Projects Selected Elements 12 T i more dtd P68225 9 Sequences P68063 P638945 P68231 NP_058652 CAA24102 P68228 P68046 P68873 1429_HUMAN P68053 CAA32220 e su Ns ne ee Ns ne En su nu nu Ns x Cancel Figure 12 7 Select a BLAST database or a set of sequences In this dialog you can either choose a database see section 12 4 or you can select a set of sequences which will be used as the database to BLAST against If you select sequence instead of an existing database it may take a little bit longer to perform the BLAST search since a temporary database is created on the fly before the actual BLAST begins If you often BLAST against the same set of sequences it will be faster to create the database first see section 124 When a database or a set of sequences has been selected click Next This opens the dialog seen in figure 12 8 BLAST Against Local Database 1 Select sequences of the nn same type 2 Set program parameters 3 Set input parameters Choose parameters Low Complexity Choose filter Mask for lookup Mask lower case Expect E Word size No of processors No of output alignments Matrix BLOSUM62 Gap cost Existence 11 Extension 1 GOO erm Cte vy Figure 12 8 Examples of different
66. Combined Workbench and provide more general knowledge of bioinformatic concepts e The fourth part is the Appendix and Index Each chapter includes a short table of contents 1 9 1 Text formats In order to produce a clearly laid out content in this manual different formats are applied e A feature in the program is in bold starting with capital letters Example Navigation Area e An explanation of how a particular function is activated is illustrated by and bold E g select the element Edit Rename Chapter 2 Tutorials Contents 2 1 Tutorial Getting started 1 lt lt lt lt 33 2441 Creatine aa folde a oro o et late e eo ja Se HO 33 212 IMPORTA 2 6 60 se Boe Bk em Se a Me wR ee ee ee g 34 24 3 SUpported data formats 2 4 4 ee rca e Ae we a 34 2 2 Tutorial View sequence 1 20 eee ee eee 4 36 2 3 Tutorial Side Panel Settings 2 02 ee eee ee 37 2 3 1 Saving the settings in the Side Panel 39 2 3 2 Applying saved settings 40 2 4 Tutorial GenBank search and download lt lt lt lt 40 2441 Searching Tor matching oblects s s a ese sra ee a 41 24 2 Saving Nhe Sequence gs s goa ea e a bw ee 41 2 5 Tutorial Align protein sequences lt lt ee ee eee 41 2 5 1 The alignment dialog lt lt es 42 2 6 Tutorial Creat
67. Combined Workbench you can translate a nucleotide sequence into a protein sequence using the Toolbox tools Usually you use the 1 reading frame which means that the translation starts from the first nucleotide Stop codons result in an asterisk being inserted in the protein sequence at the corresponding position It is possible to translate in any combination of the six reading frames in one analysis To translate select a nucleotide sequence Toolbox in the Menu Bar Nucleotide Analyses lt A Translate to Protein 25 or right click a nucleotide sequence Toolbox Nucleotide Analyses A Translate to Protein 2 This opens the dialog displayed in figure 15 4 Translate to Protein 1 Select nucleotide BE sect nudeotide SEQUENCES SECUENEES Projects Selected Elements 1 js CLC_Data 2 AY738615 S E Example data t Extra B E Nucleotide Hf Assembly a Cloning ES More data E E Primer design a f Restriction analysis S E Sequences we 206 HUMDINUC 20 HUMHBB 20 NM_000044 30 PERH2BD 20 PERH3BC iE sequence list a Protein Figure 15 4 Choosing sequences for translation If a sequence was selected before choosing the Toolbox action the sequence is now listed in the Selected Elements window of the dialog Use the arrows to add or remove sequences or sequence lists from the selected elements Clicking Next generates the dialog seen in figure 15 5 Here you have the following options Re
68. Export graphics to files 7 4 Copy paste view output 8 History log 8 1 Element history 9 Handling of results 9 1 How to handle results of analyses lil Bioinformatics 10 Viewing and editing sequences 10 1 View sequence 10 2 Circular DNA 10 3 Working with annotations 10 4 Sequence information 10 5 VIEWIASTEXL oe le 10 6 Creating a new sequence 10 7 Sequence Lists 11 Online database search 11 1 GenBank search 11 2 UniProt Swiss Prot TrEMBL search 2 2 eee 11 3 Search for structures at NCBI 11 4 Sequence web info 12 BLAST search 12 1 BLAST Against NCBI Database 12 2 BLAST Against Local Database 12 3 Output from BLAST search 12 4 Create Local BLAST Database 12 5 SNP annotation using BLAST 12 6 Bioinformatics explained BLAST 13 3D molecule viewing 124 124 127 127 130 131 131 145 147 154 155 156 157 160 160 164 166 170 172 173 176 179 183 185 190 199 CONTENTS 6 ABA Importing Structure TIES ow ok kik ee ee Oe A ee ee Oe ee 199 132 VEW Structure TES uc cs Ge ee ee a a Se i a amp ha 200 18 3 The Structure Tables s eee cat ce ae wl we we God Mn wk el eg ne dh 8 201 13 4 Options through the preference panel 2 2 eee ee ees 202 1 AD OUlOU x jasc ace acca a A ee a Me a A ae 204 14 General sequence analyses
69. F2 S 8 F 38 Shift F g B Shift U g A ae 2 de Es g T a J g U a Z g 3 plus Z 4 minus Combinations of keys and mouse movements are listed below tOn Linux changing tabs is accomplished using Ctrl Page Up Page Down 92 CHAPTER 3 USER INTERFACE Action Windows Linux Mac OSX Mouse movement Maximize View Double click the tab of the View Restore View Double click the View title Reverse zoom function Shift Shift Click in view Select multiple elements Ctrl Click elements Select multiple elements Shift Shift Click elements Chapter 4 Searching your data Contents 4 1 What kind of information can be searched 0 000082 se eae 93 42 Quick search io o a A A we es 94 424 QUICK Search resul sse emoi cua eR a A ee a i 94 4 2 2 Special search expressions 2 o a a ee a es 95 4 2 3 Quick search history 0 o ee es 96 4 3 Advanced search 0 02 ee ee 2 96 4 4 Search index ccuas anaa a ee a Ee a 98 There are two ways of doing text based searches of your data as described in this chapter e Quick search directly from the search field in the Navigation Area e Advanced search which makes it easy to make more specific searches In most cases quick search will find what you need but if you need to be more specific in your search criteria the advanced search is preferable 4 1 What kind of infor
70. Figure 18 5 Setting parameters for trimming The following parameters can be adjusted in the dialog CHAPTER 18 SEQUENCING DATA ANALYSES AND ASSEMBLY 305 Ignore existing trim information If you have previously trimmed the sequences you can check this to remove existing trimming annotation prior to analysis Trim using quality scores If the sequence files contain quality scores from a base caller algorithm this information can be used for trimming sequence ends The program uses the modified Mott trimming algorithm for this purpose Richard Mott personal communication Trim using ambiguous nucleotides This option trims the sequence ends based on the presence of ambiguous nucleotides typically N Note that the automated sequencer generating the data must be set to output ambiguous nucleotides in order for this option to apply The algorithm takes as input the maximal number of ambiguous nucleotides allowed in the sequence after trimming If this maximum is set to e g 3 the algorithm finds the maximum length region containing 3 or fewer ambiguities and then trims away the ends not included in this region Trim contamination from vectors in UniVec database If selected the program will match the sequence reads against all vectors in the UniVec database and remove sequence ends with significant matches the database is included when you install the CLC Combined Workbench A list of all the vectors in the UniVec database can be
71. Figure 22 15 The structure has now been rotated Press Reset layout in the Side Panel to reset the layout to the way it looked when the structure was predicted 22 2 2 Tabular view of structures and energy contributions There are three main reasons to use the Secondary structure table e f more than one structure is predicted see section 22 1 the table provides an overview of all the structures which have been predicted e With multiple structures you can use the table to determine which structure should be displayed in the Secondary structure 2D view see section 22 2 1 e The table contains a hierarchical display of the elements in the structure with detailed information about each element s energy contribution To show the secondary structure table of an already open sequence click the Show Secondary Structure Table E button at the bottom of the sequence view If the sequence is not open click Show 48 and select Secondary Structure Table F This will open a view similar to the one shown in figure 22 16 On the left side all computed structures are listed with the information about structure name when the structure was created the free energy of the structure and the probability of the structure if the partition function was calculated Selecting a row equivalent a structure will display a tree of the contained substructures with their contributions to the total structure free energy Each substructure contains a
72. G Secondary structure A TCATT g T 1 CAC Follow structure selection T E ys A K Layout strategy cw Y SS c we Auto a A As e G E T 770 Proportional a ACG A 7A Even spread AA e y O pi Reset layout ec OR E OE wae s fap AB009835 with AG Probability ig Elements of structure AG 9 7kcal mol AG 9 7kcal mol 9 9kcal mol 6 89E 3 A 6 Stem with bifurcation at 1 70 AG 8 6kcal mol 9 9 7kcal mol HP Stem base pairs at join 1 7 64 70 AG 9 9kcalfmol 9 4kcal mol 3 06E 3 3 Multi loop at 7 64 AG 1 3kcal mol 9 2kcal mol 2 21E 3 Le A 4 Multi loop base pairs at join 64 10 22 25 39 46 63 AC vee 26 11 2007 09 05 59 9 1kcal mol 1 88E 3 AG 9 1Kcal 26 11 2007 09 05 59 9 1kcalfmol 1 88E 3 AG 8 9kcal 26 11 2007 09 05 59 8 9kcal mol 1 36E 3 k AG 8 6kcal 26 11 2007 09 05 59 8 6kcal mol 8 35E 4 le Stem with hairpin at 25 39 AG 1 9kcal mol AG 8 4kcal 26 11 2007 09 05 59 8 4kcalfmol 6 03E 4 h D crom nith hainin ah AE En Ar AG 8 3kcal 26 11 2007 09 05 59___ 8 3kcal mol 5 13E 4 lt 1 Allin Figure 2 58 A split view showing the scondary structure table at the bottom and the Secondary structure 2D view at the top You might need to Zoom out to see the structure The secondary structure now looks very similar to figure 2 54 By adjusting the layout we can make it look exactly the same in the Side Panel of the 2D view unde
73. HL Metal binding site E E O Protein EJ MA O region A A OU source E Figure 2 7 The Annotation Layout and the Annotation Types in the Side Panel Check the Region annotation type and you will see the regions as red annotations on the CHAPTER 2 TUTORIALS 39 sequences Next we will change the way the residues are colored Click the Alignment Info group and under Conservation check Background color This will use a gradient as background color for the residues You can adjust the coloring by dragging the small arrows above the color box 2 3 1 Saving the settings in the Side Panel Now the alignment should similar to the one shown in figure 2 8 FEE protein align A 1 Allgrment Settings Globins are heme ES 25 55 I show F i Limit Majority v P6s046 MHLTADENA AVTABWGCREN 19 gt Ambiguous symbol Globins are heme k B mooo 7 Conservation C Foreground color Mature chain lt Background color P68053 MMLTGEBNA AVTARWCREN 19 Globins areheme 0 100 Graph Mature chain lt lt Height low v T Bar plot v pes225 MMHLTPEBEN AVTTEMIGREN 20 v lc le ES CB Figure 2 8 The alignment when all the above settings have been changed At this point if you just close the view the changes made to the Side Panel will not be saved This means that you would have to perform the changes again next time you open the alignment To save the changes to the Side Panel
74. Homo sapiens chromosome 19 genomic contig reference assembly 262 375 73 3e 67 94 NW 927217 1 Homo sapiens chromosome 19 genomic contig alternate assembly 262 375 3e 67 94 Figure 12 26 BLAST table view A table view with one row per hit showing the accession number and description field from the sequence file together with BLAST output scores 12 6 7 want to BLAST against my own sequence database is this possible It is possible to download the entire BLAST program package and use it on your own computer institution computer cluster or similar This is preferred if you want to search in proprietary sequences or sequences unavailable in the public databases stored at NCBI The downloadable BLAST package can either be installed as a web based tool or as a command line tool It is available for a wide range of different operating systems The BLAST package can be downloaded free of charge from the following location http www ncbi nlm nih gov BLAST download shtml CHAPTER 12 BLAST SEARCH 197 gt UJref NM _173209 1 MEE Homo sapiens TGFB induced factor TALE family homeobox TGIF transcript variant 5 mRNA Length 1382 Sort alignments for this subject sequence by E value Score Percent identity Query start position Subject start position Score 339 bits 171 Expect 1e 90 Identities 171 171 100 Gaps 0 171 0 Strand Plus Plus Query 1 ATTTGCACATGGGATTGCTAAAACAGCTTCCTGITACTGAGATGTCI
75. If a file or another element is dropped on a folder it is placed at the bottom of the folder If it is dropped on another element it will be placed just below that element If the element already exists in the Navigation Area you will be asked whether you wish to create a copy 3 1 2 Create new folders In order to organize your files they can be placed in folders Creating a new folder can be done in two ways right click an element in the Navigation Area New Folder 3 or File New Folder 1 If a folder is selected in the Navigation Area when adding a new folder the new folder is added at the bottom of this folder If an element is selected the new folder is added right above that element You can move the folder manually by selecting it and dragging it to the desired destination 3 1 3 Sorting folders You can sort the elements in a folder alphabetically right click the folder Sort Folder On Windows subfolders will be placed at the top of the folder and the rest of the elements will be listed below in alphabetical order On Mac both subfolders and other elements are listed together in alphabetical order 3 1 4 Multiselecting elements Multiselecting elements means that you select more than one element at the same time This can be done in the following ways e Holding down the lt Ctrl gt key 38 on Mac while clicking on multiple elements selects the elements that have been clicked e Selecting one element
76. Java regular expression syntax see http java sun com docs books tutorial essential regex index html Below is listed some of the most important syntax rules which are also shown in the help pop up when you press Shift F1 A Z will match the characters A through Z Range You can also put single characters between the brackets The expression AGT matches the characters A G or T A D M P will match the characters A through D and M through P Union You can also put single characters between the brackets The expression AG M P matches the characters A G and M through P A M amp amp H P will match the characters between A and M lying between H and P Intersection You can also put single characters between the brackets The expression A M amp amp HGTDA matches the characters A through M which is H G T D or A FA M will match any character except those between A and M Excluding You can also put single characters between the brackets The expression AG matches any character except A and G A Z amp amp M P will match any character A through Z except those between M and P Subtraction You can also put single characters between the brackets The expression A P amp amp CG matches any character between A and P except C and G The symbol matches any character X n will match a repetition of an element indicated by following that element with a numerical v
77. M Sequencing Data Analyses E Primers and Probes 12 Cloning and Restriction Sites fg RNA Structure H i BLAST Search a Database Search Processes Toolbox Idle 0 element s are selected Figure 2 1 The user interface as it looks when you start the program for the first time Windows version of CLC Combined Workbench The interface is similar for Mac and Linux Name the folder My folder and press Enter 2 1 2 Import data Next we want to import a sequence called HUMDINUC fsa FASTA format from our own Desktop into the new My folder This file is chosen for demonstration purposes only you may have another file on your desktop which you can use to follow this tutorial You can import all kinds of files In order to import the HUMDINUC fsa file Select My folder Import ES in the Toolbar navigate to HUMDINUC fsa on the desktop Select The sequence is imported into the folder that was selected in the Navigation Area before you clicked Import Double click the sequence in the Navigation Area to view it The final result looks like figure 2 2 2 1 3 Supported data formats CLC Combined Workbench can import and export the following formats CHAPTER 2 TUTORIALS 35 CLC Combined Workbench 3 0 Current workspace Default File Edit Search View Toolbox Workspace Help nd Show New Import Export fj3 CLC_Data B E My folder 200 A HUMDINUC Hf Recycle bin 0 E ele
78. MMPABARAMO MSKEMECPH HSRRIRHRROM ARTE BR SQ STRPPMDHER P12675 1MNPTETKAN MSKQMECPHS PNRRRHERQA MRTEPERNSQ STRPSMMHER P20811 MMPIMANAM RTEPBRKPQ SSKPSMMHBR Q95208 MNPTBAKAMr CSKQMECPHS PNENKRIHKEKOA METE ARASA STRPSMMHBR Figure 20 3 The first 50 positions of two different alignments of seven calpastatin sequences The top alignment is made with cheap end gaps while the bottom alignment is made with end gaps having the same price as any other gaps In this case it seems that the latter scoring scheme gives the best result NM_173881_CDS 1 NM_000559 1 NM_173881_CDS 1 NM_000559 1 Figure 20 4 The alignment of the coding sequence of bovine myoglobin with the full mRNA of human gamma globin The top alignment is made with free end gaps while the bottom alignment is made with end gaps treated as any other The yellow annotation is the coding sequence in both sequences It is evident that free end gaps are ideal in this situation as the start codons are aligned correctly in the top alignment Treating end gaps as any other gaps in the case of aligning distant homologs where one sequence is partial leads to a spreading out of the short sequence as in the bottom alignment For a comprehensive explanation of the alignment algorithms see section 20 6 20 1 3 Aligning alignments If you have selected an existing alignment in the first step 20 1 you have to decide how this alignment should be treat
79. Note Search results are downloaded before they are saved Downloading and saving several files may take some time However since the process runs in the background displayed in the Status bar it is possible to continue other tasks in the program Like the search process the download process can be stopped This is done in the Toolbox in the Processes tab 11 3 3 Save structure search parameters The search view can be saved either using dragging the search tab and and dropping it in the Navigation Area or by clicking Save E When saving the search only the parameters are saved CHAPTER 11 ONLINE DATABASE SEARCH 170 not the results of the search This is useful if you have a special search that you perform from time to time Even if you don t save the search the next time you open the search view it will remember the parameters from the last time you did a search 11 4 Sequence web info CLC Combined Workbench provides direct access to web based search in various databases and on the Internet using your computer s default browser You can look up a sequence in the databases of NCBI and UniProt search for a sequence on the Internet using Google and search for Pubmed references at NCBI This is useful for quickly obtaining updated and additional information about a sequence The functionality of these search functions depends on the information that the sequence contains You can see this information by viewing the sequence a
80. Nucleotide Assembly E More data Cloning 5 Primer design More data J Restriction analysis Primer design Da k oa Restriction analysis al Ea dl H E 3D structures Ga Sequences GE More data E Protein c Sequences 30 structures README More data af SRE O Sequences Figure 3 3 In this example the location called CLC_Data points to the folder at C Documents and settings clcuser CLC_Data Adding locations Per default there is one location in the Navigation Area called CLC_Data It points to the following folder e On Windows C Documents and settings lt username gt CLC_Data e On Mac CLC_Data e On Linux homefolder CLC_Data You can easily add more locations to the Navigation Area File New Location T This will bring up a dialog where you can navigate to the folder you wish to use as your new location see figure 3 4 When you click Open the new location is added to the Navigation Area as shown in figure 3 5 CHAPTER 3 USER INTERFACE 75 Look in a Desktop My Documents y 2 1 My Computer My Recent My Network Places Documents My Documents ps My Computer gt File name C Documents and Settings smoenstediDesktop Open My Network Places Files of type All Files Figure 3 4 Navigating to a folder to use as a new location iavigaton Area nen a a SV Y E CLC_Data eskop Figure 3 5 The
81. PCR conditions nearest neighbor corrections for Mg 2 deoxynu cleotide triphosphate and dimethyl sulfoxide concentrations with comparison to alternative empirical formulas Clin Chem 47 11 1956 1961 von Heijne 1986 von Heijne G 1986 A new method for predicting signal sequence cleavage sites Nucl Acids Res 14 4683 4690 Welling et al 1985 Welling G W Weijer W J van der Zee R and Welling Wester S 1985 Prediction of sequential antigenic regions in proteins FEBS Lett 188 2 215 218 Wootton and Federhen 1993 Wootton J C and Federhen S 1993 Statistics of local complexity in amino acid sequences and sequence databases Computers in Chemistry 17 149 163 Workman and Krogh 1999 Workman C and Krogh A 1999 No evidence that mRNAs have lower folding free energies than random sequences with the same dinucleotide distribution Nucleic Acids Res 27 24 4816 4822 Yang and Rannala 1997 Yang Z and Rannala B 1997 Bayesian phylogenetic inference using DNA sequences a Markov Chain Monte Carlo Method Mol Biol Evol 14 7 717 724 Zuker 1989a Zuker M 1989a On finding all suboptimal foldings of an rna molecule Science 244 4900 48 52 Zuker 1989b Zuker M 1989b The use of dynamic programming algorithms in rna secondary structure prediction Mathematical Methods for DNA Sequences pages 159 184 Zuker and Sankoff 1984 Zuker M and Sankoff D 1984 Rna secondary str
82. PERH3BC sequence from the Nucleotide folder under Sequences of the Example data First open the sequence in the Primer Designer Select the PERH3BC sequence Show Primer Designer 117 Now the sequence is opened and we are ready to begin designing primers 2 11 1 Specifying a region for the forward primer In this example we know where we want the primers to be located If you have annotated your sequence these annotations are also shown in the primer designer to help you guide where the primers should be located In this tutorial we want the forward primer to be in a region between positions 20 and 45 Select CHAPTER 2 TUTORIALS 57 this region right click and choose Forward primer region here see figure 2 34 20 40 rec Forward primer region here 4 Reverse primer region here gt Forw 2g Region to amplify Ph 1 3 No primers here i Y Copy Selection q Open Selection in New View Edit Selection 7 Delete Selection S Add Annotation He Add Enzymes Cutting Selection to Panel Insert Restriction Site After Selection Insert Restriction Site Before Selection Base Pair Constraint gt Set Numbers Relative to This Selection Blast Selection Against NCBI 2 Blast Selection Against Local Database Figure 2 34 Right clicking a selection and choosing Forward primer region here This will add an annotation to this region and five rows of red and green dots are seen be
83. Prohibited Stem to the sequence see figure 22 6 Prohibited stem 40 l AB009835 ATTTAATAGTAAATTAGCACTTACTTCTAATGACCA Figure 22 6 Prohibit the selected bases from forming a stem CHAPTER 22 RNA STRUCTURE 380 Using this procedure to add base pairing constraints will force the algorithm to compute minimum free energy and structure without a stem in the selected region Again the two selected regions must be of equal length To prohibit a region to be part of any base pair open the sequence and Select the bases you don t want to base pair right click the selection Add Structure Prediction Constraints Prohibit From Forming Base Pairs This will add an annotation labeled No base pairs to the sequence see 22 7 No base pairs 40 AB009835 ATTTAATAGTAAATTAGCACTTACTTCTAATGACCA Figure 22 7 Prohibiting any of the selected base from pairing with other bases Using this procedure to add base pairing constraints will force the algorithm to compute minimum free energy and structure without a base pair containing any residues in the selected region When you click Predict secondary structure HP and click Next check Apply base pairing constraints in order to force or prohibit stem regions or prohibit regions from forming base pairs You can add multiple base pairing constraints e g simultaneously adding forced stem regions and prohibited stem regions and prohibit regions from forming base pairs 22 1 5 Structure
84. Proteinase K Thrombin Factor Xa Granzyme B De select all CI Cera Y Figure 2 31 Selecting trypsin as the cleaving enzyme Click Next to go to Step 3 of the dialog In Step 3 you can adjust the parameters for which fragments of the cleavage you want to include in the table output of the analysis Type 10 in the Exclude fragments shorter than Check the box Exclude fragments longer than enter 15 in the corresponding text field These parameter adjustments are shown in figure 2 32 Proteolytic Cleavage 1 Select protein sequences NAAA 2 Select enzymes 3 Set parameters Enzyme criteria Min number of cleavage sites j Max number of cleavage sites Criteria for the list of fragments Min Fragment length Y Max Fragment length Min fragment mass Max Fragment mass MA FS _ Previous Dret Y Frish X Cancel Figure 2 32 Adjusting the output from the cleavage to include fragments which are between 10 and 15 amino acids long Click Finish to make the analysis The result of the analysis can be seen in figure 2 33 CHAPTER 2 TUTORIALS 56 ace CAA32220 Trypsin Trypsin bho ee p pm CAA32220 MVHFTAEEKAAITS IWOKVDLEKVGGETLGRLLIVYPWTQRFFDKFGNLS Trypsin Tiaimerim Y AE 0 E ES CAA32220 pret Row
85. Proteolytic cleavage Detecta a A we wee 269 CONTENTS 17 Primers 171 Primerdesign anINtrOdUCuon s s s ae ee ee ee ee Be om ee a 17 2 Setting parameters for primers and probes 17 3 Graphical display of primer information s e a somos 0000002502 eee ITA Output rom primerdesigni 2 6 a a aa a ae a ah aa a es 17 5 Standard PCR 17 6 Nested PCR 17 7 TaqMan 17 8 SEQUENCING primers s lt a aw as Aa BRE a a we ee ee a 17 9 Alignment based primer and probe design o o 17 10 Analyze primer properties cos 262 2 lt 2 2 2 ee ee ee es 17 11 Find binding sites 17 12 Order primers OM SEQUENCE e oso o ae eae AE a aa RR a 18 Sequencing data analyses and Assembly 18 1 Importing and viewing trace data 2 18 2 Trim sequences 18 3 Assemble sequences ce es 18 4 Assemble to reference sequence 2 2 2 0 4 2 18 5 Add sequences to an existing contig o o 18 6 VIEW And edit CONES sa aro ai wee a oR Re Boe a a 18 7 Reassemble contig 2 seen ee ta babe eee ad ee eee pa 18 8 Secondary peak Calling c s sre cee was Pode eee a E Be ae 19 Cloning and cutting 19 1 Molecular cloning 19 2 Restriction site analysis oa es 19 3 Gel electrophores Ss o Se he ea ee A A ds ee GO A a 19 4 Restriction enzyme lists
86. RNA STRUCTURE 378 AB009835 vs AB009835 AB009835 20 40 60 AB009835 Figure 22 3 The marginal base pair probability of all possible base pairs e Include coaxial stacking energy rules Include free energy increments of coaxial stacking for adjacent helices Mathews et al 2004 e Apply base pairing constraints With base pairing constraints you can easily add experimental constraints to your folding algorithm When you are computing suboptimal structures it is not possible to apply base pair constraints The possible base pairing constraints are Force two equal length intervals to form a stem Prohibit two equal length intervals to form a stem Prohibit all nucleotides in a selected region to be a part of a base pair Base pairing constraints have to be added to the sequence before you can use this option see below e Maximum distance between paired bases Forces the algorithms to only consider RNA structures of a given upper length by setting a maximum distance between the base pair that opens a structure Specifying structure constraints Structure constraints can serve two purposes in CLC Combined Workbench they can act as experimental constraints imposed on the MFE structure prediction algorithm or they can form a structure hypothesis to be evaluated using the partition function See section 22 1 3 To force two regions to form a stem open a normal sequence view and Select the two regions you wa
87. Sopa EOR K O aa 252 16 4 2 Antigenicity graphs along sequence aooo ee eee ee ee 254 16 5 Hydrophobicity lt lt recs si ek wee ee a a 254 16 01 Hydrophobicity Plot re ies e a A a ew RR 254 16 5 2 Hydrophobicity graphs along sequence 008 ee eee 256 16 5 3 Bioinformatics explained Protein hydrophobicity 257 16 6 Pfam domain search 0 0 ee eee ee 2 4 260 16 6 1 Pfam search parameters 261 16 6 2 Download and installation of additional Pfam databases 262 16 7 Secondary structure prediction 262 16 8 Proteinreport lt lt 2 ee d adanu amaai 263 16 6 1 Protein repont output 2 sac da asa kna a oa as a 265 16 9 Reverse translation from protein into DNA 2 08 82 eee eee 265 16 9 1 Reverse translation parameters ee 266 16 9 2 Bioinformatics explained Reverse translation 267 16 10 Proteolytic cleavage detection 1 2 ee ee te 269 16 10 1 Proteolytic cleavage parameters o o 269 16 10 2 Bioinformatics explained Proteolytic cleavage 271 CLC Combined Workbench offers a number of analyses of proteins as described in this chapter 241 CHAPTER 16 PROTEIN ANALYSES 242 16 1 Signal peptide prediction Signal peptides target proteins to the extracellular environment either through direct plas
88. Status Bar Figure 3 1 The user interface consists of the Menu Bar Toolbar Status Bar Navigation Area Toolbox and View Area 3 1 Navigation Area The Navigation Area is located in the left side of the workbench under the Toolbar see figure 3 2 It is used for organizing and navigating data lts behavior is similar to the way files and folders are usually displayed on your computer 3 1 1 Data structure The data in the Navigation Area is organized into a number of Locations When you start the CLC Combined Workbench for the first time there is one location called CLC_Data A location represents a folder on your computer The data shown under a location in the Navigation Area is stored on your computer in the folder which the location points to This is explained visually in figure 3 3 CHAPTER 3 USER INTERFACE 74 lt 4 ta tS EE CLC_Data i Example data Hf Extra E Nucleotide H H Assembly H Cloning ES More data H Primer design H Restriction analysis H E Sequences B E Protein EJ 3D structures H E More data tH Sequences E README af EEE Figure 3 2 The Navigation Area 8 CLC_Data DER File Edit View Favorites Tools Help a Q Back gt y A Search EEE 7 ha Y Address C Documents and Settings clcuser CLC_Data Y eg Go a yy A x ff CLC_Data Folders Example data E O CLC Data A a Extra Example data Sa Nucleotide Extra 5 Assembly w Cloning
89. The plug in will not be ready for use before the workbench is restarted 1 7 2 Uninstalling plug ins Plug ins are uninstalled using the plug in manager Help in the Menu Bar Install Plug ins H or Plug ins in the Toolbar This will open the dialog shown in figure 1 16 Manage Plug ins and Resources Q 3 5 3 Manage Plug ins Download Plug ins Manage Resources Download Resources Additional Alignments gQ CLC bio support clcbio com Version 1 02 Perform alignments with many different programs from within the workbench ClustalW Windows Mac Linux Muscle Windows Mac Linux T Coffee Mac Linux MAFFT Mac Linux Kalign Mac Linux Annotate with GFF file gQ CLC bio support clcbio com Version 1 03 Using this plug in it is possible to annotate a sequence from list of annotations found in a GFF file Located in the Toolbox Extract Annotations CLC bio support clcbio com Version 1 02 Extracts annotations from one or more sequences The result is a sequence list containing sequences covered by the specified annotations Help Jl Proxy Settings Ji Check for updates Install from File Figure 1 16 The plug in manager with plug ins installed The installed plug ins are shown in this dialog To uninstall Click the plug in Uninstall If you do not wish to completely uninstall the plug in but you don t want it to be used next time you start the Workbench click the Disable button When
90. The ambiguity code reflects the residues in the reads not in the consensus sequence The IUPAC codes can be found in section F e Status The status can either be conflict or resolved Conflict Initially all the rows in the table have this status This means that there is one or more differences between the sequences at this position Resolved If you edit the sequences in the contig e g if there was an error in one of the sequences and they now all have the same residue at this position the status is set to Resolved e Note Can be used for your own comments on this conflict Right click in this cell of the table to add or edit the comments The comments in the table are associated with the conflict annotation on the contig Therefore the comments you enter in the table will also be attached to the annotation on the contig sequence the comments can be displayed by placing the mouse cursor on the annotation for one second see figure 18 12 The comments are saved when you save the contig By clicking a row in the table the corresponding position is highlighted in the graphical view of the contig Clicking the rows of the table is another way of navigating the contig apart from using the Find Inconsistencies button or using the Space bar You can use the up and down arrow keys to navigate the rows of the table 18 7 Reassemble contig If you have edited a contig changed trimmed regions or added or removed reads you may
91. These panels offer a lot of flexibility for combining number of cut sites inside and outside the selection respectively To give a hint of how many enzymes will be added based on the combination of cut sites the preview panel at the bottom lists the enzymes which will be added when you click Finish Note that this list is dynamically updated when you change the number of cut sites CHAPTER 10 VIEWING AND EDITING SEQUENCES 143 If you have selected more than one region on the sequence using Ctrl or they will be treated as individual regions This means that the criteria for cut sites apply to each region Show enzymes with compatible ends Besides what is described above there is a third way of adding enzymes to the Side Panel and thereby displaying them on the sequence It is based on the overhang produced by cutting with an enzyme and will find enzymes producing a compatible overhang right click the restriction site Show Enzymes with Compatible Ends 1 J This will display the dialog shown in figure 19 21 d Show Enzymes with Compatible Ends to Taq 1 Please choose enzymes Tease choose enzymes Enzyme list O Exact matches only O Al matches Select enzymes to be added to Side Panel Enzymes with compatible ends Enzymes added to Side Panel Fiter Overhang Methylation Name Overhang Methylation Popul 5 cg N methyl Taqi 5 cg N6 methyl mk 5 cg N6 methyl eee Clar r N6 methyl eee 5 cg S methylcy ee
92. Threonine W Trp Tryptophan Y Tyr Tyrosine V Val Valine B ASX Aspartic acid or Asparagine Z GIx Glutamine or Glutamic acid X Xaa Any amino acid 411 Appendix F IUPAC codes for nucleotides Single letter codes based on International Union of Pure and Applied Chemistry The information is gathered from http www iupac org and http www dna affrec go jp misc MPsrch InfolUPAC html O O 2 O Description Adenine Cytosine Guanine Thymine Uracil Purine A or G Pyrimidine C T or U CorA T U orG T U orA CorG C T U or G not A A T U or G not C A T U or C not G A C or G not T not U Any base A C G T or U ZS lt SITUWOSXZ lt ICADO gt 412 Bibliography Altschul and Gish 1996 Altschul S F and Gish W 1996 Local alignment statistics Methods Enzymol 266 460 480 Altschul et al 1990 Altschul S F Gish W Miller W Myers E W and Lipman D J 1990 Basic local alignment search tool J Mol Biol 215 3 403 410 Andrade et al 1998 Andrade M A O Donoghue S l and Rost B 1998 Adaptation of protein surfaces to subcellular location J Mol Biol 276 2 51 7 525 Bachmair et al 1986 Bachmair A Finley D and Varshavsky A 1986 In vivo half life of a protein is a function of its amino terminal residue Science 234 4773 179 186 Bateman et al 2004 Bateman A Coin L Durbin R Finn R D Hollich V Gr
93. Trp Leu Pro His Arg Leu Pro His Arg Leu Pro Gln Arg Leu Pro Gln Arg Ile Thr Asn Ser lle Thr Asn Ser Ile Thr Lys Arg Met Thr Lys Arg Val Ala Asp Gly Val Ala Asp Gly Val Ala Glu Gly Val Ala Glu Gly First base in codon UOPOd Ul seq PAY OrPaC OaFPAC aAaFvIAC arAC Figure 16 24 The standard genetic code showing amino acids for all 64 possible codons Challenge of reverse translation A particular protein follows from the translation of a DNA sequence whereas the reverse translation need not have a specific solution according to the Genetic Code The Genetic Code is degenerate which means that a particular amino acid can be translated into more than one codon Hence there are ambiguities of the reverse translation Solving the ambiguities of reverse translation In order to solve these ambiguities of reverse translation you can define how to prioritize the codon selection e g e Choose a codon randomly e Select the most frequent codon in a given organism e Randomize a codon but with respect to its frequency in the organism As an example we want to translate an alanine to the corresponding codon Four different codons can be used for this reverse translation GCU GCC GCA or GCG By picking either one by random choice we will get an alanine CHAPTER 16 PROTEIN ANALYSES 269 The most frequent codon coding for an alanine in E coli is GCG encoding 33 7 of all alanines Then co
94. Trypsin Trypsin Trypsin Trypsin at 4 te 4 NP_058652 SSASAIMGNPKVKAHGKKVITAFNEGLKNLDNLKGTFASLSELHCDKLH mS oe ES NP_058652 pro Rows 13 Table of remaining fragments based on parameter settings Filter Start End posi Length Mass pl C end Name Fragment N end Name 1 la lg 1 043 2 5 55 START MVHLTDAEK 5 Trypsin A 10 18 9 964 14 9 18K Trypsin SAVSCLWAK Y Trypsin Bi 19 31 3 1 312 39 4 27 K Trypsin NPDEVGG L Trypsin 32 141 i10 1 274 51 9 72 R Trypsin ILLVVYPWTOR Y Trypsin Ba Figure 16 28 The result of the proteolytic cleavage detection Proteins often undergo proteolytic processing by specific proteolytic enzymes proteases peptidases before final maturation of the protein Proteins can also be cleaved as a result of intracellular processing of for example misfolded proteins Another example of proteolytic processing of proteins is secretory proteins or proteins targeted to organelles which have their signal peptide removed by specific signal peptidases before release to the extracellular environment or specific organelle Below a few processes are listed where proteolytic enzymes act on a protein substrate e N terminal methionine residues are often removed after translation e Signal peptides or targeting sequences are removed during translocation through a mem brane e Viral proteins that were translated from a monocistronic mRNA are cleav
95. a number of different options which can be changed in order to obtain the best possible result Changing these parameters can have a great impact on the search result It is not the scope of this document to comment on all of the options available but merely the options which can be changed with a direct impact on the search result The E value The expect value E value can be changed in order to limit the number of hits to the most significant ones The lower the E value the better the hit The E value is dependent on the length of the query sequence and the size of the database For example an alignment obtaining an E value of 0 05 means that there is a 5 in 100 chance of occurring by chance alone E values are very dependent on the query sequence length and the database size Short identical sequence may have a high E value and may be regarded as false positive hits This is often seen if one searches for short primer regions small domain regions etc The default threshold for the E value on the BLAST web page is 10 Increasing this value will most likely generate more hits Below are some rules of thumb which can be used as a guide but should be considered with common sense e E value lt 10e 100 Identical sequences You will get long alignments across the entire query and hit sequence e 10e 50 lt E value lt 10e 100 Almost identical sequences A long stretch of the query protein is matched to the database e 10e 10 lt E value
96. a sequence The Create Sequence dialog figure 10 21 reflects the information needed in the GenBank format but you are free to enter anything into the fields The following description is a guideline for entering information about a sequence e Name The name of the sequence This is used for saving the sequence e Common name A common name for the species e Latin name The Latin name for the species e Type Select between DNA RNA and protein e Circular Specifies whether the sequence is circular This will open the sequence in a circular view as default applies only to nucleotide sequences e Description A description of the sequence e Keywords A set of keywords separated by semicolons e Comments Your own comments to the sequence e Sequence Depending on the type chosen this field accepts nucleotides or amino acids Spaces and numbers can be entered but they are ignored when the sequence is created This allows you to paste Ctrl V on Windows and V on Mac in a sequence directly from a different source even if the residue numbers are included Characters that are not part of the IUPAC codes cannot be entered At the top right corner of the field the number of residues are counted The counter does not count Spaces or numbers Clicking Finish opens the sequence It can be saved by clicking Save E or by dragging the tab of the sequence view into the Navigation Area CHAPTER 10 VIEWING AND EDITING SEQUENC
97. above columns excluding the score are represented twice once for the forward primer designated by the letter F and once for the reverse primer designated by the letter R CHAPTER 17 PRIMERS 287 Before these and following the score of the primer pair are the following columns pertaining to primer pair information available e Pair annealing the number of hydrogen bonds found in the optimal alignment of the forward and the reverse primer in a primer pair e Pair annealing alignment a visualization of the optimal alignment of the forward and the reverse primer in a primer pair e Pair end annealing the maximum score of consecutive end base pairings found between the ends of the two primers in the primer pair in units of hydrogen bonds e Fragment length the length number of nucleotides of the PCR fragment generated by the primer pair 17 6 Nested PCR Nested PCR is a modification of Standard PCR aimed at reducing product contamination due to the amplification of unintended primer binding sites mispriming If the intended fragment can not be amplified without interference from competing binding sites the idea is to seek out a larger outer fragment which can be unambiguously amplified and which contains the smaller intended fragment Having amplified the outer fragment to large numbers the PCR amplification of the inner fragment can proceed and will yield amplification of this with minimal contamination Primer design
98. agreement Activate icense Activate license The license must be activated before the application can be used The activation has to be done on line and therefore you need to be connected to the internet during the activation Activate license Figure 1 8 Activate the license key online Your computer must be connected to the internet in order to activate the license Once the license is activated you can work offline It will take a few seconds to activate the license key When the license key is activated CLC Combined Workbench will start A license is locked to a specific computer and therefore it can be used by anyone using that 3If you have received a pre activated license key this step will not be shown CHAPTER 1 INTRODUCTION TO CLC COMBINED WORKBENCH 21 computer If at some time you want to transfer the license to another computer please contact support clcbio com Problems with online activation If you have problems activating the license online CLC Combined Workbench also offers you an opportunity to manually activate your license key The problem is most likely to occur if CLC Combined Workbench is unable to establish contact with our server This may be due to problems with your internet connection or because your computer has restricted access to the internet In this case you will see a dialog similar to the one shown in figure 1 9 Unable to Request Evaluation License The progra
99. also be the most represented structure at equilibrium The objective of minimum free energy MFE folding is therefore to identify Smin amongst all possible structures In the following we only consider structures without pseudoknots i e structures that do not contain any non nested base pairs Under this assumption a sequence can be folded into a single coherent structure or several sequential structures that are joined by unstructured regions Each of these structures is a union of well described structure elements see below for a description of these The free energy for a given structure is calculated by an additive nearest neighbor model Additive means that the total free energy of a secondary structure is the sum of the free energies of its individual structural elements Nearest neighbor means that the free energy of each structure element depends only on the residues it contains and on the most adjacent Watson Crick base pairs The simplest method to identify Smin would be to explicitly generate all possible structures but it can be shown that the number of possible structures for a sequence grows exponentially with the sequence length Zuker and Sankoff 1984 leaving this approach unfeasible Fortunately a two step algorithm can be constructed which implicitly surveys all possible structures without explicitly generating the structures Zuker and Stiegler 1981 The first step determines the free energy for each possible sequence f
100. as annotation You can choose to add the elements of the best structure as annotations See figure 22 8 20 40 l Interior loop Interior loop Stem Bulge Interior loop Bulge Hairpin loop AB009835 CATTAGATGACTGAAAGCAAGTACTGGTCTCTTAAACCATTTAATAGT Figure 22 8 Annotations added for each structure element This makes it possible to use the structure information in other analysis in the CLC Combined Workbench You can e g align different sequences and compare their structure predictions Note that possibly existing structure annotation will be removed when a new structure is calculated and added as annotations If you generate multiple structures only the best structure will be added as annotations If you wish to add one of the sub optimal structures as annotations this can be done from the Show Secondary Structure Table Fp described in section 22 2 2 CHAPTER 22 RNA STRUCTURE 381 22 2 View and edit secondary structures When you predict RNA secondary structure See section 22 1 the resulting predictions are attached to the sequence and can be shown as e Annotations in the ordinary sequence views Linear sequence view ser Annotation table etc This is only possible if this has been chosen in the dialog in figure 22 2 See an example in figure 22 8 e Symbolic representation below the sequence see section 22 2 3 e A graphical view of the secondary structure See section 22 2 1 e A tabular view of
101. author and provider of the work You may not use this work for commercial purposes You may not alter transform nor build upon this work CHAPTER 14 GENERAL SEQUENCE ANALYSES 218 SOME RIGHTS RESERVED See http creativecommons org licenses by nc nd 2 5 for more information on how to use the contents 14 3 Local complexity plot In CLC Combined Workbench it is possible to calculate local complexity for both DNA and protein sequences The local complexity is a measure of the diversity in the composition of amino acids within a given range window of the sequence The K2 algorithm is used for calculating local complexity Wootton and Federhen 1993 To conduct a complexity calculation do the following Select sequences in Navigation Area Toolbox in Menu Bar General Sequence Analyses A Create Complexity Plot Lz This opens a dialog In Step 1 you can change remove and add DNA and protein sequences When the relevant sequences are selected clicking Next takes you to Step 2 This step allows you to adjust the window size from which the complexity plot is calculated Default is set to 11 amino acids and the number should always be odd The higher the number the less volatile the graph Figure 14 14 shows an example of a local complexity plot l CAA24102 comp Gra ph setting S e Complexity plot of CAA24102 yra Wi v Graph preferences 0 98 Lock axes 0 86 Frame 0 94 0 92 A 0 90 Tick type
102. base pairs and computes the minimum free energy and associated secondary structure constrained to contain a specified base pair These structures are then sorted by their minimum free energy and the most optimal are reported given the specified number of structures Note that two different sub optimal structures can have the same minimum free energy Further information about suboptimal folding can be found in Zuker 1989a 22 1 3 Partition function The predicted minimum free energy structure gives a point estimate of the structural conformation of an RNA molecule However this procedure implicitly assumes that the secondary structure is at equilibrium that there is only a single accessible structure conformation and that the parameters and model of the energy calculation are free of errors Obvious deviations from these assumptions make it clear that the predicted MFE structure may deviate somewhat from the actual structure assumed by the molecule This means that rather than looking at the MFE structure it may be informative to inspect statistical properties of the structural landscape to look for general structural properties which seem to be robust to minor variations in the total free energy of the structure see Mathews et al 2004 To this end CLC Combined Workbench allows the user to calculate the complete secondary structure partition function using the algorithm described in Mathews et al 2004 which is an extension of the se
103. below CHAPTER 11 ONLINE DATABASE SEARCH 166 Drag and drop from UniProt search results The sequences from the search results can be opened by dragging them into a position in the View Area Note A sequence is not saved until the View displaying the sequence is closed When that happens a dialog opens Save changes of sequence x Yes or No The sequence can also be saved by dragging it into the Navigation Area It is possible to select more sequences and drag all of them into the Navigation Area at the same time Download UniProt search results using right click menu You may also select one or more sequences from the list and download using the right click menu see figure 11 2 Choosing Download and Save lets you select a folder or location where the sequences are saved when they are downloaded Choosing Download and Open opens a new view for each of the selected sequences Copy paste from UniProt search results When using copy paste to bring the search results into the Navigation Area the actual files are downloaded from UniProt To copy paste files into the Navigation Area select one or more of the search results Ctrl C 38 C on Mac select location or folder in the Navigation Area Ctrl V Note Search results are downloaded before they are saved Downloading and saving several files may take some time However since the process runs in the background displayed in the Toolbox under the Processes tab it
104. brighter annotations are the ORFs with a length of at least 100 amino acids On the positive strand around position 11 000 a gene starts before the ORF This is due to the use of the standard genetic code rather than the bacterial code This particular gene starts with CTG which is a start codon in bacteria Two short genes are entirely missing while a handful of open reading frames do not correspond to any of the annotated genes Click Next if you wish to adjust how to handle the results See section 9 1 If not click Finish CHAPTER 15 NUCLEOTIDE ANALYSES 240 Finding open reading frames is often a good first step in annotating sequences such as cloning vectors or bacterial genomes For eukaryotic genes ORF determination may not always be very helpful since the intron exon structure is not part of the algorithm Chapter 16 Protein analyses Contents 16 1 Signal peptide prediction ee ee ee 242 16 1 1 Signal peptide prediction parameter settings 242 16 1 2 Signal peptide prediction output o ee eee 243 16 1 3 Bioinformatics explained Prediction of signal peptides 243 16 2 Protein charge s eke ke a ee ee ee 248 16 21 Modine the layout o ima anem a Sets dil wee E a ee 249 16 3 Transmembrane helix prediction 0 0000082 eee 251 16 4 Antigenicity lt lt annann 252 16 4 1 Plotof antigenicity a e ia a lt lt
105. combinations of primers and single primer parameters for all four primers in a solution see section on Standard PCR for an explanation of the available primer pair and single primer information The fragment length in this mode refers to the length of the PCR fragment generated by the inner primer pair and this is also the PCR fragment which can be exported 17 7 TaqMan CLC Combined Workbench allows the user to design primers and probes for TaqMan PCR applications TaqMan probes are oligonucleotides that contain a fluorescent reporter dye at the 5 end and a quenching dye at the 3 end Fluorescent molecules become excited when they are irradiated and usually emit light However in a TaqMan probe the energy from the fluorescent dye is transferred to the quencher dye by fluorescence resonance energy transfer as long as the quencher and the dye are located in close proximity i e when the probe is intact TaqMan probes are designed to anneal within a PCR product amplified by a standard PCR primer pair If a TaqMan probe is bound to a product template the replication of this will cause the Taq polymerase to encounter the probe Upon doing so the 5 exonuclease activity of the polymerase will cleave the probe This cleavage separates the quencher and the dye and as a result the reporter dye starts to emit fluorescence The TaqMan technology is used in Real Time quantitative PCR Since the accumulation of fluorescence mirrors the accumulatio
106. consider adjusting view settings e g Wrap for sequences in the Side Panel before printing As explained in the beginning of this chapter the printed material will look like the view on the screen and therefore these settings should also be considered when adjusting Page Setup 6 2 1 Header and footer Click the Header Footer tab to edit the header and footer text By clicking in the text field for either Custom header text or Custom footer text you can access the auto formats for header footer text in Insert a caret position Click either Date View name or User name to include the auto format in the header footer text Click OK when you have adjusted the Page Setup The settings are saved so that you do not have to adjust them again next time you print You can also change the Page Setup from the File menu 6 3 Print preview The preview is shown in figure 6 7 The Print preview window lets you see the layout of the pages that are printed Use the arrows in the toolbar to navigate between the pages Click Print 4 to show the print dialog which lets you choose e g which pages to print The Print preview window is for preview only the layout of the pages must be adjusted in the Page setup CHAPTER 6 PRINTING 111 Preview CLC Combined Workbench 2 1 File View AY738615 180 bp Page 1of1 Figure 6 7 Print preview Chapter 7 import export of data and graphics Contents 7 1 Bioinformatic data
107. covering more than one sequence When you have made the selection the mouse pointer turns into a horizontal arrow indicating that the selection can be moved see figure 20 9 Note Residues can only be moved when they are next to a gap AGG GAGTCAT AGG GAGTCAT AGG GAGTCAT AGG GAGTCAT AGG GAGCAGT AGG GAGCAGT AGG GTACAGT AGG GTACAGT MiiTacce GA G TAGC R G amp G TAGC GAGTAGG GA Gj TAGG ATG GTGCACC ATG GTGCACC ATG GTGCATC ATG GTGCATC Figure 20 9 Moving a part of an alignment Notice the change of mouse pointer to a horizontal arrow 20 3 2 Insert gaps The placement of gaps in the alignment can be changed by modifying the parameters when creating the alignment However gaps can also be added manually after the alignment is created To insert extra gaps select a part of the alignment right click the selection Add gaps before after If you have made a selection covering e g five residues a gap of five will be inserted In this way you can easily control the number of gaps to insert Gaps will be inserted in the sequences that you selected If you make a selection in two sequences in an alignment gaps will be inserted into these two sequences This means that these two sequences will be displaced compared to the other sequences in the alignment 20 3 3 Delete residues and gaps Residues or gaps can be deleted for individual sequences or for the whole alignment For individual sequences select the part o
108. database The NCBI search view is opened in this way figure 11 1 Search Search for Sequences at NCBI E or Ctrl B 38 B on Mac This opens the following view 160 CHAPTER 11 ONLINE DATABASE SEARCH 161 NCBI search Choose database Nucleotide O Protein All Fields v human B All Fields v hemoglobin B All Fields v complete pe Add search parameters A Start search C Append wildcard to search words Rows 50 Search results Filter Accession Definition Modification Date E CELE ee i i 20071 a AMZ70166 Aspergillus niger contig An08c0110 complete genome 2007 03 24 AM711867 Clavibacter michiganensis subsp michiganensis NCPPB 2007 05 18 AP008209 Oryza sativa japonica cultivar group genomic DNA c 2007 05 19 J BA000016 Clostridium perfringens str 13 DNA complete genome 2007 05 19 BC029387 Homo sapiens hemoglobin gamma G mRNA cDNA clon 2007 02 08 BC130457 Homo sapiens hemoglobin gamma G mRNA cDNA clon 2007 01 04 BC130459 Homo sapiens hemoglobin gamma G mRNA cDNA clon 2007 01 04 BC139602 Danio rerio hemoglobin beta embryonic 2 mRNA cDNA 2007 04 18 BC142787 Danio rerio hemoglobin beta embryonic 1 mRNA cDNA 2007 06 11 Bx842577 Mycobacterium tuberculosis H37Rv complete genome 2006 11 14 x E f H Download and Open 4 Download and Save Total number of hits 245 Open at NCBI m Fig
109. database the standard BLAST param eters will have to be reconfigured Using the parameters described below you are likely to be able to identify whether antigenic determinants will cross react to other proteins Purpose Program Word size Low complexity filter Expect value Scoring matrix Standard BLAST blastp 3 On 10 BLSUM62 Remote homologues blastp 2 Off 20000 PAM30 These settings are shown in figure 2 30 2 9 4 Further reading A valuable source of information about BLAST can be found at http www ncbi nlm nih gov blast producttable shtml Remember that BLAST is a heuristic method thus you cannot trusted BLAST to be accurate For very accurate results you should use Smith Waterman You can read Bioinformatics explained CHAPTER 2 TUTORIALS 54 Local BLAST 1 Select sequences of same type 2 Set program parameters 3 Set input parameters Choose parameters Low Complexity Choose filter Mask lower case Expect 1000 Word size 7 v No of processors 2 Match Mismatch Match 1 Mismatch 3 Y Gap cost Open 5 Extension 2 Command line options C Figure 2 29 Settings for searching for primer binding sites Local BLAST f 1 Select sequences of same EA type 2 Set program parameters 3 Set input parameters Choose parameters Dl y C Mask lower case Expect 20000 Word size 2 Choose fi
110. e Fragment length e Fragment region on the original sequence e Enzymes cutting at the left and right ends respectively CHAPTER 19 CLONING AND CUTTING 343 es Separated sequent HUMDINUC pBR322 HUMHBB Figure 19 30 A sequence list shown as a gel we 7 a a E z PJ P3 2 Gel options 179 a 174 179 4 Gel background uo u ow ow u T a e S G Scale band spread V Show marker ladder 3 5 10 20 50 200 200 Sequences in separate lanes 50 O All sequences in one lane 20 el Text format Y E Figure 19 31 Five lanes showing fragments of five sequences cut with restriction enzymes For gels comparing whole sequences you will see the sequence name and the length of the sequence Note You have to be in Selection or Pan a mode in order to get this information It can be useful to add markers to the gel which enables you to compare the sizes of the bands This is done by clicking Show marker ladder in the Side Panel Markers can be entered into the text field separated by commas Modifying the layout The background of the lane and the colors of the bands can be changed in the Side Panel Click the colored box to display a dialog for picking a color The slider Scale band spread can be used CHAPTER 19 CLONING AND CUTTING 344 to adjust the effective time of separation on the gel i e how much the bands will be spread over
111. enzymes either place your mouse cursor on an enzyme for one second to display additional information see figure 19 34 or use the view of enzyme lists see 19 4 Clicking Next will show the dialog in figure 19 20 At the top of the dialog you see the selected region and below are two panels e Inside selection Specify how many times you wish the enzyme to cut inside the selection In the example described above One cut site 1 should be selected to only show enzymes cutting once in the selection e Outside selection Specify how many times you wish the enzyme to cut outside the selection i e the rest of the sequence In the example above No cut sites 0 should The CLC Combined Workbench comes with a standard set of enzymes based on http www rebase org CHAPTER 10 VIEWING AND EDITING SEQUENCES 142 Restriction Map Analysis 1 Select DNA RNA sequence s 2 Enzymes to be considered in calculation Use existing enzyme list Enzyme list All enzymes Filter 3 Name Overh Methyl Pop se Methyl PstI J N meth meer j N meth KpnI 8 1 W ae d S methyl SacI 3 5 methyl pee SphI 3 poe 7 N meth Apal 3 S methyl Pee j Sacil 8 S methyl Pe S methyl Pe Neil iy sp r Chal a p j N4 meth Ball E N4 meth Peer j 5 methyl Hhal e S methyl S methyl em 3 N6 meth j IN6 methy Drai F N meth N meth Banll J S methyl
112. for more about creating and modifying enzyme lists Below there are two panels e To the left you see all the enzymes that are in the list select above If you have not chosen to use an existing enzyme list this panel shows all the enzymes available A e To the right there is a list of the enzymes that will be used Select enzymes in the left side panel and add them to the right panel by double clicking or clicking The CLC Combined Workbench comes with a standard set of enzymes based on http www rebase org CHAPTER 19 CLONING AND CUTTING 333 Show Enzymes Cutting Inside Outside Selection 1 Enzymes to be considered in calculation Es 7 Use existing enzyme list Popular enzymes Enzymes in Popular en Enzymes to be used Filter Filter Name Sall EcoRV xhol Bott Hinan xbar JEcoRI Smal PstI BamHI Clal Nott sac Nder Ncol Haen Overhang Methyla Popu N methyl Peer N methyl Pee N methyl ee N4 methyl N methyl N methyl N methyl Name EcoRV EcoRI Smal Sall PstI Overhang Methyla Blunt N6 methyi S aatt N6 methyl Blunt 5 toga 3 tgca 5 tcga gate N methyl N6 methyl N4 methyl N6 methyl eer N6 methyl weee N methyl S methylc N4 methyl N methyl em hol eal Xbal Hindi BamHI Haelll NcoI NdeI SphI SacI Kpnt S methylc
113. for nested PCR thus involves designing two primer pairs one for the outer fragment and one for the inner fragment In Nested PCR mode the user must thus define four regions a Forward primer region the outer forward primer a Reverse primer region the outer reverse primer a Forward inner primer region and a Reverse inner primer region These are defined by making a selection on the sequence and right clicking the selection If areas are known where primers must not bind e g repeat rich areas one or more No primers here regions can be defined It is required that the Forward primer region is located upstream of the Forward inner primer region that the Forward inner primer region is located upstream of the Reverse inner primer region and that the Reverse inner primer region is located upstream of the Reverse primer region In Nested PCR mode the Inner melting temperature menu in the Primer parameters panel is activated allowing the user to set a separate melting temperature interval for the inner and outer primer pairs After exploring the available primers see section 17 3 and setting the desired parameter values in the Primer parameters preference group the Calculate button will activate the primer design algorithm After pressing the Calculate button a dialog will appear see figure 17 9 The top and bottom parts of this dialog are identical to the Standard PCR dialog for designing primer pairs described above The centr
114. found at http www ncbi nlm nih gov VecScreen replist html Trim contamination from saved sequences This option lets you select a specific vector sequence that you know might be the cause of contamination If you select this option you will be able to select one or more sequences when you click Next Hit limit Specifies how strictly vector contamination is trimmed Since vector contamination usually occurs at the beginning or end of a sequence different criteria are applied for terminal and internal matches A match is considered terminal if it is located within the first 25 bases at either sequence end Three match categories are defined according to the expected frequency of an alignment with the same score occurring between random sequences as calculated by NCBI VecScreen Weak Expect 1 random match in 40 queries of length 350 kb x Terminal match with Score 16 to 18 x Internal match with Score 23 to 24 Moderate Expect 1 random match in 1 000 queries of length 350 kb x Terminal match with Score 19 to 23 x Internal match with Score 25 to 29 Strong Expect 1 random match in 1 000 000 queries of length 350 kb x Terminal match with Score gt 24 x Internal match with Score gt 30 Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish This will start the trimming process Views of each trimmed sequence will be shown and you can inspect the result by looking at the
115. found in position P1 but not if a D or E is found in position P1 at the same time See figure 16 30 g ydrolysis site for trypsin al tp N C C N C C H IH R Lysine or arginine y Hydrolysis site for trombin H Q H 0 N C C N C C 4 H Arginine Glycine Figure 16 30 Hydrolysis of the peptide bond between two amino acids Trypsin cleaves unspecifi cally at lysine or arginine residues whereas trombin cleaves at arginines if asparate or glutamate is absent Bioinformatics approaches are used to identify potential peptidase cleavage sites Fragments can be found by scanning the amino acid sequence for patterns which match the corresponding cleavage site for the protease When identifying cleaved fragments it is relatively important to know the calculated molecular weight and the isoelectric point Other useful resources The Peptidase Database http merops sanger ac uk Creative Commons License All CLC bio s scientific articles are licensed under a Creative Commons Attribution NonCommercial NoDerivs 2 5 License You are free to copy distribute display and use the work for educational purposes under the following conditions You must attribute the work in its original form and CHAPTER 16 PROTEIN ANALYSES 274 CLC bio has to be clearly labeled as author and provider of the work You may not use this work for commercial purposes You may not alter transform nor build upon this work SOME RIGHTS RESERVED
116. given position in the contig The level of coverage is relative to the overall number of sequence reads that are included in the contig x Foreground color Colors the letters using a gradient where the left side color is used for low coverage and the right side is used for maximum coverage x Background color Colors the background of the letters using a gradient where the left side color is used for low coverage and the right side is used for maximum coverage x Graph The coverage is displayed as a graph beneath the contig Height Specifies the height of the graph Type The graph can be displayed as Line plot Bar plot or as a Color bar Color box For Line and Bar plots the color of the plot can be set by clicking the color box If a Color bar is chosen the color box is replaced by a gradient color box as described under Foreground color e Residue coloring There is one additional parameter CHAPTER 18 SEQUENCING DATA ANALYSES AND ASSEMBLY 313 Assembly Colors This option lets you use different colors for the residues of the contig and the forward and reverse reads It is particularly useful for getting an overview of forward and reverse reads in the contig x Contig color Colors the residues of the contig sequence with the specified color can be changed by clicking the colored box x Forward color Colors the residues of forward reads with the specified color can be changed by clicking the colored box x R
117. handle the results See section 9 1 If not click Finish The result can be seen in figure 16 12 In CLC Combined Workbench it is possible to change the layout of the hydrophobicity plot through the Side Panel The drop down menus are opened by clicking the black triangular arrows There are two kinds of view preferences The graph preferences and preferences for the kind of hydrophobicity scale used to calculate the graph e g Kyte Doolittle The Graph preferences include e Lock axis This will always show the axis even though the plot is zoomed to a detailed level e Frame Toggles the frame of the graph CHAPTER 16 PROTEIN ANALYSES 256 e X axis at zero Toggles the x axis at zero e Y axis at zero Toggles the y axis at zero e Tick type outside inside e Tick lines at Shows a grid behind the graph none major ticks e Show as histogram For some data series it is possible to see it as a histogram rather than a line plot The preferences for the different scales are identical and include the following e Dot type Lets you choose the marking of dots in the graph e Dot color Lets you choose the color of the dots Line width Setting the width of the line connecting the dots e Line type Setting the type of the line connecting the dots e Line color Lets you choose the color of the line connecting the dots 16 5 2 Hydrophobicity graphs along sequence Hydrophobicity graphs alo
118. how the result of the restriction map analysis should be presented e Add restriction sites as annotations to sequence s This option makes it possible to see the restriction sites on the sequence see figure 19 27 and save the annotations for later use e Create restriction map When a restriction map is created it can be shown in three different ways As a table of restriction sites as shown in figure 19 28 If more than one sequence were selected the table will include the restriction sites of all the sequences This makes it easy to compare the result of the restriction map analysis for two sequences CHAPTER 19 CLONING AND CUTTING 339 As a table of fragments which shows the sequence fragments that would be the result of cutting the sequence with the selected enzymes see figure19 29 As a virtual gel simulation which shows the fragments as bands on a gel see figure 19 31 For more information about gel electrophoresis see section 19 3 The following sections will describe these output formats in more detail In order to complete the analysis click Finish see section 9 1 for information about the Save and Open options Restriction sites as annotation on the sequence If you chose to add the restriction sites as annotation to the sequence the result will be similar to the sequence shown in figure 19 27 See section 10 3 for more information about viewing Ac PERH3BC 20 40 PERH3BC GTGAGTCTGATGGG
119. import SCF3 SCf trace files only import Phred phd trace files only import mmCIF Cif structure only import PDB pdb structure only import BLAST Database phr nhr BLAST database import Vector NTi Database VectorNTI achieves Gene Construction Kit RNA Structure ma4 pa4 0a4 gcc ct col rnaml xml sequences import of whole database sequences only import sequences only import RNA structures Preferences cpf CLC workbench preferences Note CLC Combined Workbench can import external files too This means that all kinds of files can be imported and displayed in the Navigation Area but the above mentioned formats are the only ones whose contents can be shown in CLC Combined Workbench The CLC Combined Workbench offers a lot of possibilities to handle bioinformatic data Read the next sections to get information on how to import different file formats or to import data from a Vector NTI database Import using the import dialog Before importing a file you must decide where you want to import it i e which location or folder The imported file ends up in the location or folder you selected in the Navigation Area select location or folder click Import E in the Toolbar CHAPTER 7 IMPORT EXPORT OF DATA AND GRAPHICS 114 This will show a dialog similar to figure 7 1 depending on which platform you use You can change which kind of file types that should be shown by selecting a
120. in the file that are specific to the individual file formats If the file type is not recognized it will be imported as an external file In most cases automatic import will yield a successful result but if the import goes wrong the next option can be helpful Force import as type This option should be used if CLC Combined Workbench cannot successfully determine the file format By forcing the import as a specific type the automatic determination of the file format is bypassed and the file is imported as the type specified Force import as external file This option should be used if a file is imported as a bioinformatics file when it should just have been external file It could be an ordinary text file which is imported as a sequence CHAPTER 7 IMPORT EXPORT OF DATA AND GRAPHICS 115 Import using drag and drop It is also possible to drag a file from e g the desktop into the Navigation Area of CLC Combined Workbench This is equivalent to importing the file using the Automatic import option described above If the file type is not recognized it will be imported as an external file Import using copy paste of text If you have e g a text file or a browser displaying a sequence in one of the formats that can be imported by CLC Combined Workbench there is a very easy way to get this sequence into the Navigation Area Copy the text from the text file or browser Select a folder in the Navigation Area Paste This will crea
121. in figure 5 5 This opens a menu where the following options are available CHAPTER 5 USER PREFERENCES AND SETTINGS 104 JE Sequence layout gt Annotation layout gt Annotation types gt Restriction sites Residue coloring gt Nucleotide info Find Text Format Figure 5 3 The Side Panel of a sequence contains several groups Sequence layout Annotation types Annotation layout etc Several of these groups are present in more views E g Sequence layout is also in the Side Panel of alignment views Sequence layout O Spaces every 10 residues O No wrap Auto wrap O Fixed wrap Double stranded V Numbers on sequences ative to 1 Numbers on plus strand D el ra Y Follow selection M Lock labels Sequence label Name v Annotation layout Annotation types gt Restriction sites Residue coloring gt Nucleotide info gt Find gt Text Format Figure 5 4 The Sequence layout is expanded e Save Settings This brings up a dialog as shown in figure 5 6 where you can enter a name for your settings Furthermore by clicking the checkbox Always apply these settings you can choose to use these settings every time you open a new view of this type If you wish to change which settings should be used per default open the Preferences dialog see section 5 2 e Delete Settings Opens a dialog to select which of
122. installation process is done in one of the following ways If you have downloaded an installer Locate the downloaded installer and double click the icon The default location for downloaded files is your desktop If you are installing from a CD Insert the CD into your CD ROM drive and open it by double clicking on the CD icon on your desktop Launch the installer by double clicking on the CLC Combined Workbench icon CHAPTER 1 INTRODUCTION TO CLC COMBINED WORKBENCH 15 Installing the program is done in the following steps e On the welcome screen click Next e Read and accept the License agreement and click Next e Choose where you would like to install the application and click Next e Choose if CLC Combined Workbench should be used to open CLC files and click Next e Choose whether you would like to create desktop icon for launching CLC Combined Workbench and click Next e Choose if you would like to associate clc files to CLC Combined Workbench If you check this option double clicking a file with a clc extension will open the CLC Combined Workbench e Wait for the installation process to complete choose whether you would like to launch CLC Combined Workbench right away and click Finish When the installation is complete the program can be launched from your Applications folder or from the desktop shortcut you chose to create If you like you can drag the application icon to the dock for easy access 1 2 4
123. instead of using BLAST However there are other situations where you either do not know the name of the gene or the genomic sequence is poorly annotated In these cases the approach described in this tutorial can be very productive 2 9 2 BLAST for primer binding sites You can adjust the BLAST parameters so it becomes possible to match short primer sequences against a larger sequence Then it is easy to examine whether already existing lab primers can be reused for other purposes or if the primers you designed are specific Purpose Program Word size Low complexity filter Expect value Standard BLAST blastn 11 On 10 Primer search blastn 7 Off 1000 CHAPTER 2 TUTORIALS 53 A RD_ID 0_reverse RD_ID 0_reverse RD_ID O_reverse amp NC_0000112 sel gt 000011 selection EALAN IRD_ID O_reverse gi RD_ID O_reverse a a imi mE E RSE dh NC_0000212 sel 000011 selection GGCAGACTTCTCCTCAGGAGTCAGATGCACCATGGTGTC RD_ID O_reverse Ala Ser Lys Glu Glu Pro Thr Leu His Val Met RD_ID 0_reverse RD_ID O_reverse RD_ID O_reverse RD_ID 0_reverse PAE Figure 2 28 Verification of the result at the top a view of the whole BLAST result At the bottom the same view is zoomed in on exon 3 to show the amino acids These settings are shown in figure 2 29 2 9 3 Finding remote protein homologues If you look for short identical peptide sequences in a
124. isoleucine An increase in the aliphatic index increases the thermostability of globular proteins The index is calculated by the following formula Aliphaticindex X Ala ax X Val bx X Leu b X Ile X Ala X Val X lle and X Leu are the amino acid compositional fractions The constants a and b are the relative volume of valine a 2 9 and leucine isoleucine b 3 9 side chains compared to the side chain of alanine Ikai 1980 Estimated half life The half life of a protein is the time it takes for the protein pool of that particular protein to be reduced to the half The half life of proteins is highly dependent on the presence of the N terminal amino acid thus overall protein stability Bachmair et al 1986 Gonda et al 1989 Tobias et al 1991 The importance of the N terminal residues is generally known as the N end rule The N end rule and consequently the N terminal amino acid simply determines the half life of proteins The estimated half life of proteins have been investigated in mammals yeast and E coli see Table 14 2 If leucine is found N terminally in mammalian proteins the estimated half life is 5 5 hours Extinction coefficient This measure indicates how much light is absorbed by a protein at a particular wavelength The extinction coefficient is measured by UV spectrophotometry but can also be calculated The CHAPTER 14 GENERAL SEQUENCE ANALYSES 224 Amino acid Mammalian Yeast E coli
125. level of coverage is relative to the overall number of hits included in the result Foreground color Colors the letters using a gradient where the left side color is used for low coverage and the right side is used for maximum coverage Background color Colors the background of the letters using a gradient where the left side color is used for low coverage and the right side is used for maximum coverage Graph The coverage is displayed as a graph beneath the contig x Height Specifies the height of the graph x Type The graph can be displayed as Line plot Bar plot or as a Color bar x Color box For Line and Bar plots the color of the plot can be set by clicking the color box If a Color bar is chosen the color box is replaced by a gradient color box as described under Foreground color The remaining View preferences for BLAST Graphics are the same as those of alignments See section 20 2 Some of the information available in the tooltips is e Name of sequence Here is shown some additional information of the sequence which was found This line corresponds to the description line in GenBank if the search was conducted on the nr database e Score This shows the bit score of the local alignment generated through the BLAST search e Expect Also known as the E value A low value indicates a homologous sequence Higher E values indicate that BLAST found a less homologous sequence e Identities This number shows
126. licensed under a Creative Commons Attribution NonCommercial NoDerivs 2 5 License You are free to copy distribute display and use the work for educational purposes under the following conditions You must attribute the work in its original form and CLC bio has to be clearly labeled as author and provider of the work You may not use this work for commercial purposes You may not alter transform nor build upon this work SOME RIGHTS RESERVED See http creativecommons org licenses by nc nd 2 5 for more information on how to use the contents Chapter 22 RNA structure Contents 22 1 RNA secondary structure prediction 0 000822 eee eens 375 22 1 1 Selecting sequences for prediction 2 ce eee ees 375 224 2 SUUE QUEUE e ao a a a A 376 224 3 Partition TUNCUON soiree e aa a a e a 377 22 1 4 Advanced Options ersz reas au Roe Be awe A a ee AS 377 22 1 5 Structure as annotation 1 ad sia ik aa ak a atoe ca a 380 22 2 View and edit secondary structures 1 lt lt eee ee es 381 22 2 1 Graphical view and editing of secondary structure 381 22 2 2 Tabular view of structures and energy contributions 384 22 2 3 Symbolic representation in sequence view 0 00 387 22 2 4 Probability based coloring saaara crassana tencas 388 22 3 Evaluate structure hypothesis 2 000 2 ee eee es 388 22 3 1 Selecting sequences for evaluation o a
127. lt 10e 50 Closely related sequences could be a domain match or similar e 1 lt E value lt 10e 6 Could be a true homologue but it is a gray area e E value gt 1 Proteins are most likely not related e E value gt 10 Hits are most likely junk unless the query sequence is very short Gap costs For blastp it is possible to specify gap cost for the chosen substitution matrix There is only a limited number of options for these parameters The open gap cost is the price of introducing gaps in the alignment and extension gap cost is the price of every extension past the initial opening gap Increasing the gap costs will result in alignments with fewer gaps Filters It is possible to set different filter options before running the BLAST search Low complexity regions have a very simple composition compared to the rest of the sequence and may result in CHAPTER 12 BLAST SEARCH 195 problems during the BLAST search Wootton and Federhen 1993 A low complexity region of a protein can for example look like this fftfflllsss which in this case is a region as part of a signal peptide In the output of the BLAST search low complexity regions will be marked in lowercase gray characters default setting The low complexity region cannot be thought of as a significant match thus disabling the low complexity filter is likely to generate more hits to Sequences which are not truly related Word size Change of the word size has a great
128. more of the set criteria For more detailed information place the mouse cursor over the circle representing the primer of interest A tool tip will then appear on screen displaying detailed information about the primer in relation to the set criteria To locate the primer on the sequence simply left click the circle using the mouse The various primer parameters can now be varied to explore their effect and the view area will dynamically update to reflect this allowing for a high degree of interactivity in the primer design process After having explored the potential primers the user may have found a satisfactory primer and choose to export this directly from the view area using a mouse right click on the primers information point This does not allow for any design information to enter concerning the properties of primer probe pairs or sets e g primer pair annealing and Tm difference between primers If the latter is desired the user can use the Calculate button at the bottom of the Primer parameter preference group This will activate a dialog the contents of which depends on the chosen mode Here the user can set primer pair specific setting such as allowed or desired Tm CHAPTER 17 PRIMERS 278 difference and view the single primer parameters which were chosen in the Primer parameters preference group Upon pressing finish an algorithm will generate all possible primer sets and rank these based on their characteristics and the chosen
129. move to CHAPTER 2 TUTORIALS 44 the next step in the dialog where you can choose between the neighbor joining and the UPGMA algorithms for making trees You also have the option of including a bootstrap analysis of the result Leave the parameters at their default and click Finish to start the calculation which can be seen in the Toolbox under the Processes tab After a short while a tree appears in the View Area figure 2 17 P68053 pa Tree sett a P68046 a 2S w Tree Layout Node symbol Dot v scof P68231 P68228 Layout Standard Y 100 P68945 C Show internal node labels P68063 B tel color 0 130 4 HBB Branch label color HBB Node color HBB Line color v Annotation Layout Nodes Name Y Branches Bootstrap vw gt Text Format Figure 2 17 After choosing which algorithm should be used the tree appears in the View Area The Side panel in the right side of the view allows you to adjust the way the tree is displayed 2 6 1 Tree layout Using the Side Panel in the right side of the view you can change the way the tree is displayed Click Tree Layout and open the Layout drop down menu Here you can choose between standard and topology layout The topology layout can help to give an overview of the tree if some of the branches are very short When the sequences include the appropriate annotation it is possible to choose between the accession number and the species names at the leaves of the tree Se
130. nanomoles nM x Salt concentration Specifies the concentration of monovalent cations N 4 K and equivalents in units of millimoles mM x Magnesium concentration Specifies the concentration of magnesium cations Mgt in units of millimoles mM x dNTP concentration Specifies the concentration of deoxynucleotide triphos phates in units of millimoles mM DMSO concentration Specifies the concentration of dimethyl sulfoxide in units of volume percent vol GC content Determines the interval of CG content C and G nucleotides in the primer within which primers must lie by setting a maximum and a minimum GC content Self annealing Determines the maximum self annealing value of all primers and probes This determines the amount of base pairing allowed between two copies of CHAPTER 17 PRIMERS 280 the same molecule The self annealing score is measured in number of hydrogen bonds between two copies of primer molecules with A T base pairs contributing 2 hydrogen bonds and G C base pairs contributing 3 hydrogen bonds Self end annealing Determines the maximum self end annealing value of all primers and probes This determines the amount of consecutive base pairs allowed between the ends of two copies of the same molecule This score is also calculated in units of hydrogen bonds between two primer copies of identical primer molecules Secondary structure Determines the maximum score of the optimal seco
131. negatively charged DNA which may regulate gene expression or help to fold the DNA Nuclear proteins often have a low percentage of aromatic residues Andrade et al 1998 Amino acid distribution Amino acids are the basic components of proteins The amino acid distribution in a protein is simply the percentage of the different amino acids represented in a particular protein of interest Amino acid composition is generally conserved through family classes in different organisms which can be useful when studying a particular protein or enzymes across species borders Another interesting observation is that amino acid composition variate slightly between proteins from different subcellular localizations This fact has been used in several computational methods used for prediction of subcellular localization Annotation table This table provides an overview of all the different annotations associated with the sequence and their incidence Dipeptide distribution This measure is simply a count or frequency of all the observed adjacent pairs of amino acids dipeptides found in the protein It is only possible to report neighboring amino acids Knowledge on dipeptide composition have previously been used for prediction of subcellular localization Creative Commons License All CLC bio s scientific articles are licensed under a Creative Commons Attribution NonCommercial NoDerivs 2 5 License You are free to copy distribute display and us
132. nodes also called leaves or tips of the tree represent extant species of Hominidae and are the operational taxonomical units OTUs The internal nodes which here represent extinct common ancestors of the great apes are termed hypothetical taxonomical units since they are not directly observable Terminal nodes leaves Operational Taxonomical Units Root node Branches edges Most recent common ancestor 2 Ora ngutan de Human EN Chimpanzee Gorilla Internal Node vertice Hypothetical Taxonomical Unit Figure 21 4 A proposed phylogeny of the great apes Hominidae Different components of the tree are marked see text for description The ordering of the nodes determine the tree topology and describes how lineages have diverged over the course of evolution The branches of the tree represent the amount of evolutionary divergence between two nodes in the tree and can be based on different measurements A tree is completely specified by its topology and the set of all edge lengths The phylogenetic tree in figure 21 4 is rooted at the most recent common ancestor of all Hominidae species and therefore represents a hypothesis of the direction of evolution e g that the common ancestor of gorilla chimpanzee and man existed before the common ancestor of chimpanzee and man If this information is absent trees can be drawn as unrooted 21 2 2 Modern usage of phylogenies Besides evolutionary biology and systemat
133. of commercial vendors 19 At the bottom of the dialog you can select to save this list of enzymes as a new file In this way you can save the selection of enzymes for later use When you click Finish the enzymes are added to the Side Panel and the cut sites are shown on the sequence If you have specified a set of enzymes which you always use it will probably be a good idea to save the settings in the Side Panel see section 3 2 7 for future use Show enzymes cutting inside outside selection Section 19 2 1 describes how to add more enzymes to the list in the Side Panel based on the name of the enzyme overhang methylation sensitivity etc However you will often find yourself in a situation where you need a more sophisticated and explorative approach An illustrative example you have a selection on a sequence and you wish to find enzymes cutting within the selection but not outside This problem often arises during design of cloning experiments In this case you do not know the name of the enzyme so you want the Workbench to find the enzymes for you right click the selection Show Enzymes Cutting Inside Outside Selection HE This will display the dialog shown in figure 19 17 where you can specify which enzymes should initially be considered At the top you can choose to Use existing enzyme list Clicking this option lets you select an enzyme list which is stored in the Navigation Area See section 19 4
134. of the information can be edited by clicking the blue Edit text This means that you can add your own information to sequences that do not derive from databases 10 5 View as text A sequence can be viewed as text without any layout and text formatting This displays all the information about the sequence in the GenBank file format To view a sequence as text select a sequence in the Navigation Area Show in the Toolbar As text This way it is possible to see background information about e g the authors and the origin of DNA and protein sequences Selections or the entire text of the Sequence Text Viewer can be copied and pasted into other programs Much of the information is also displayed in the Sequence info where it is easier to get an overview see section 10 4 In the Side Panel you find a search field for searching the text in the view CHAPTER 10 VIEWING AND EDITING SEQUENCES 156 10 6 Creating a new sequence A sequence can either be imported downloaded from an online database or created in the CLC Combined Workbench This section explains how to create a new sequence New 5 in the toolbar Create Sequence 1 Enter Sequence Data PVEnter Sequence Data Name Globin Common name Human Latin name Homo Sapiens Type 20 DNA 06 RNA 2 O Protein Circular Description Globin sequence Sequence required TCTAATCT CCTCTCAACCCTACAGTACCCATTGGTATATTAAA Figure 10 21 Creating
135. on the residues at the beginning of the contig Click the Find Inconsistency button at the top of the Side Panel or press the Space key to find the first position where there is disagreement between the reads see figure 2 44 Ac TCC A Ga ca Gl ascomy Layout Ac TCCA GITIA CAG Gather sequences at top V Show sequence ends Find Inconsistency Sequence layout ACTCCAGIEJACA G gt Annotation layout gt Annotation types ACTCCAGITIA ca GI Resiue coloring Alignment info Figure 2 44 Using the Find Inconsistency button highlights inconsistencies In this example the first and the third reads have a T whereas the second line has a C marked with a light pink background color The gray color of the residues in the fourth line indicates that this region has been trimmed based on the criteria in figure 2 42 and that this information is not included in the creation of the contig Since the majority of the reads show a T in this position we settle on this in the consensus In order to show that there has been a disagreement in this position type a lower case t see figure 2 45 CCAGIBACA Figure 2 45 Just press the key to replace the residue Clicking the Find Inconsistency button again will find the next inconsistency 2 12 3 Inspecting the traces Here it is read1 which is different from read3 and read4 There are two peaks black and green In order to see the details we zoom
136. opens the dialog shown in figure 14 4 Notice Calculating dot plots take up a considerable amount of memory in the computer Therefore you see a warning if the sum of the number of nucleotides amino acids in the sequences is higher than 8000 If you insist on calculating a dot plot with more residues the Workbench may shut down allowing you to save your work first However this depends on your computer s memory configuration Adjust dot plot parameters There are two parameters for calculating the dot plot e Distance correction only valid for protein sequences In order to treat evolutionary transitions of amino acids a distance correction measure can be used when calculating the dot plot These distance correction matrices Substitution matrices take into account the likeliness of one amino acid changing to another e Window size A residue by residue comparison window size 1 would undoubtedly result in a very noisy background due to a lot of similarities between the two sequences of interest For DNA sequences the background noise will be even more dominant as a match between only four nucleotide is very likely to happen Moreover a residue by residue comparison window size 1 can be very time consuming and computationally demanding Increasing the window size will make the dot plot more smooth Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish CHAPTER 14 GENERAL SEQUEN
137. options see figure 19 6 Restriction site in the list below indicates a name on a selection restriction site This could for example be EcoRV e Cut this sequence at this EcoRV site l This will cut the sequence at this particular site and only this site e Cut this sequence at all EcoRV sites This will cut the sequence at all identical restriction sites but at no other sites e Cut all sequences at all EcoRV sites This will cut all sequences in the cloning editor with this particular restriction enzyme This can potentially generate a lot of sequence fragments e Insert sequence at this EcoRV site This will insert a sequence from a list of the other sequences into this particular site e Add as Annotation This will add an annotation to the sequence indicating the recognition site and cut site of the enzyme By doing this the cut information will be retained on the sequence in other contexts e Show Enzymes with Compatible Ends 1 I See section 19 2 1 When a restriction site is double clicked the recognition site is marked on the sequence and the cut this is marked by arrows When a sequence region between two restriction sites are double clicked the entire region will automatically be selected This makes it very easy to make a new sequence from a fragment created by cutting with two restriction sites right click the selection and choose Duplicate selection CHAPTER 19 CLONING AND CUTTING 326 sJ Cut Sequ
138. or sections mentioned above to learn more about the significance of the parameters In Step 3 you can adjust parameters for sequence statistics e Individual Statistics Layout Comparative is disabled because reports are generated for one protein at a time e Include Background Distribution of Amino Acids Includes distributions from different organisms Background distributions are calculated from UniProt www uniprot org version 6 0 dated September 13 2005 In Step 4 you can adjust parameters for hydrophobicity plots e Window size Width of window on sequence odd number e Hydrophobicity scales Lets you choose between different scales In Step 5 you can adjust a parameter for complexity plots e Window size Width of window on sequence must be odd In Step 6 you can adjust parameters for dot plots e Score model Different scoring matrices e Window size Width of window on sequence In Step 7 you can adjust parameters for BLAST search e Program Lets you choose between different BLAST programs e Database Lets you limit your search to a particular database CHAPTER 16 PROTEIN ANALYSES 265 16 8 1 Protein report output An example of Protein report can be seen in figure 16 20 54 CAA24102 report gt 1 Protein statistics 1 1 Sequence information Sequence type Protein Length 47 nuc Name CAA24102 Description beta globin HO Mus musculus Modification Date 18 AP R 2005 Weight
139. outside v E 0 88 Tick lines at mn v S 0 86 0 84 Local complexity 0 82 Dot type none v aad _ Local Dot color E i complexity Line width medium Y 0 78 Line type line v 5 10 15 i 30 3 40 45 Line color A v Text format v ca i Figure 14 14 An example of a local complexity plot Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish The values of the complexity plot approaches 1 0 as the distribution of amino acids become more complex CHAPTER 14 GENERAL SEQUENCE ANALYSES 219 14 3 1 Local complexity view preferences There are two groups of preferences for the local complexity view Graph preferences and Local complexity preferences The Graph preferences apply to the whole graph e Lock axis This will always show the axis even though the plot is zoomed to a detailed level e Frame Toggles the frame of the graph e X axis at zero Toggles the x axis at zero e Y axis at zero Toggles the y axis at zero e Tick type outside inside e Tick lines at Shows a grid behind the graph none major ticks e Show as histogram For some data series it is possible to see it as a histogram rather than a line plot The Local complexity preferences include e Dot type none cross plus square diamond circle triangle reverse triangle dot e Dot color Allows you to choose between many different colors e Line width C
140. parameters A list will appear displaying the 100 most high scoring sets and information pertaining to these The search result can be saved to the navigator From the result table suggested primers or primer probe sets can be explored since clicking an entry in the table will highlight the associated primers and probes on the sequence It is also possible to save individual primers or sets from the table through the mouse right click menu For a given primer pair the amplified PCR fragment can also be opened or saved using the mouse right click menu 17 1 2 Scoring primers CLC Combined Workbench employs a proprietary algorithm to rank primer and probe solutions The algorithm considers both the parameters pertaining to single oligos such as e g the secondary structure score and parameters pertaining to oligo pairs such as e g the oligo pair annealing score The ideal score for a solution is 100 and solutions are thus ranked in descending order Each parameter is assigned an ideal value and a tolerance Consider for example oligo self annealing here the ideal value of the annealing score is O and the tolerance corresponds to the maximum value specified in the side panel The contribution to the final score is determined by how much the parameter deviates from the ideal value and is scaled by the specified tolerance Hence a large deviation from the ideal and a small tolerance will give a large deduction in the final score and a small deviation fro
141. part of the sequence 6 2 Page setup No matter whether you have chosen to print the visible area or the whole view you can adjust page setup of the print An example of this can be seen in figure 6 5 Page Setup Page Header Footer Orientation Portrait Landscape Paper Size A4 Fit to pages Horizontal pages Vertical pages wf ok X Cancel Help Figure 6 5 Page Setup In this dialog you can adjust both the setup of the pages and specify a header and a footer by clicking the tab at the top of the dialog You can modify the layout of the page using the following options e Orientation Portrait Will print with the paper oriented vertically Landscape Will print with the paper oriented horizontally e Paper size Adjust the size to match the paper in your printer e Fit to pages Can be used to control how the graphics should be split across pages see figure 6 6 for an example Horizontal pages If you set the value to e g 2 the printed content will be broken up horizontally and split across 2 pages This is useful for sequences that are not wrapped CHAPTER 6 PRINTING 110 Vertical pages If you set the value to e g 2 the printed content will be broken up vertically and split across 2 pages l 2 9 6 Figure 6 6 An example where Fit to pages horizontally is set to 2 and Fit to pages vertically is set to 3 Note It is a good idea to
142. particularly useful if you wish to use enzymes which produce e g a 3 overhang In this case you can sort the list by clicking the Overhang column heading and all the enzymes producing 3 overhangs will be listed together for easy selection When looking for a specific enzyme it is easier to use the Filter If you wish to find e g Hindlll sites simply type Hindlll into the filter and the list of enzymes will shrink automatically to only include the Hindlll enzyme This can also be used to only show enzymes producing e g a 3 overhang as shown in figure 19 33 If you need more detailed information and filtering of the enzymes either place your mouse cursor on an enzyme for one second to display additional information see figure 19 34 or use the view of enzyme lists see 19 4 3The CLC Combined Workbench comes with a standard set of enzymes based on http www rebase org CHAPTER 19 CLONING AND CUTTING 337 Restriction Map Analysis 1 Select DNA RNA sequence s 2 Enzymes to be considered in calculation L S Enzyme list Use existing enzyme list All enzymes Filter 3 Name Overh Methyl Pop PstI la N meth meer ha Methyl Pop N6 meth KpnI E N meth Pe S methyl SacI El S methyl Pe SphI la 4 N6 meth Apal p p SacI a S methyl S methyl Pe 5 methyl PPP Nsil a pes co Chal B
143. pasted element will be renamed by appending a number at the end of the name Elements can also be moved instead of copied This is done with the cut paste function select the files to cut right click one of the selected files Cut 0 right click the location to insert files into Paste 15 or select the files to cut Ctrl X 38 X on Mac select where to insert files Ctrl V 38 V on Mac When you have cut the element it is greyed out until you activate the paste function If you change your mind you can revert the cut command by copying another element Move using drag and drop Using drag and drop in the Navigation Area as well as in general is a four step process click the element click on the element again and hold left mouse button drag the element to the desired location let go of mouse button This allows you to e Move elements between different folders in the Navigation Area e Drag from the Navigation Area to the View Area A new view is opened in an existing View Area if the element is dragged from the Navigation Area and dropped next to the tab s in that View Area e Drag from the View Area to the Navigation Area The element e g a Sequence alignment search report etc is saved where it is dropped If the element already exists you are asked CHAPTER 3 USER INTERFACE 78 whether you want to save a copy You drag from the View Area by dragging the tab of the desired element U
144. patterns Below is a compiled list of proteolytic enzymes used in CLC Combined Workbench 406 APPENDIX C PROTEOLYTIC CLEAVAGE ENZYMES 407 Name P4 P3 P2 P1 PT P2 Cyanogen bromide CNBr M Asp N endopeptidase D Arg C R Lys C K Trypsin K R not P Trypsin W K P Trypsin M R P Trypsin C D K D Trypsin C K H Y Trypsin C R K Trypsin R R H R Chymotrypsin high spec F Y not P Chymotrypsin high spec W notM P Chymotrypsin low spec F L Y not P Chymotrypsin low spec W not M P Chymotrypsin low spec M notP Y l Chymotrypsin low spec H not D M P W o lodosobenzoate W Thermolysin not D E A F l L M or V Post Pro H K R P not P Glu C E Asp N D Proteinase K A E F l L T V W Y Factor Xa A F G l D E G R L T V Granzyme B E P D Thrombin G R G Thrombin A F G I A F G P R notD E notD E L T V L T V W A TEV Tobacco Etch Virus Y Q G S Appendix D Formats for import and export D 1 List of bioinformatic data formats Below is a list of bioinformatic data formats i e formats for importing and exporting sequences alignments and trees 408 APPENDIX
145. peptide But in borderline cases it is often convenient to have more information than just a yes no answer Here a graphical output can aid to interpret the correct answer An example is shown in figure 16 5 SignalP NN prediction gram networks SFMA_ECOLI T C score S score Y score 0 8 p 0 6 f Score 0 4 F 0 2 F 0 0 Se a ee AAA E MES I NE EG YMKLRF SSALAAALFAATGSYAAVVDGGT HFEGELVNAACSVNTDSADQVVTLGQYRT L 1 L L L 1 0 10 20 30 40 50 60 70 Position Figure 16 5 Graphical output from the SignalP method of Swiss Prot entry SFMA_ECOLT Initially this seemed like a borderline prediction but closer inspection of the sequence revealed an internal methionine at position 12 which could indicate a erroneously annotated start of the protein Later this protein was re annotated by Swiss Prot to start at the M in position 12 See the text for description of the scores CHAPTER 16 PROTEIN ANALYSES 247 The graphical output from SignalP neural network comprises three different scores C S and Y Two additional scores are reported in the SignalP3 NN output namely the S mean and the D score but these are only reported as numerical values For each organism class in SignalP Eukaryote Gram negative and Gram positive two different neural networks are used one for predicting the actual signal peptide and one for predicting the position of the signal peptidase SPase cleavage site The S score for th
146. pointer over a BLAST hit sequence a tooltip appears listing the characteristics of the sequence As default the query sequence is fitted to the window width but it is possible to zoom in the windows and see the actual sequence alignments returned from the BLAST server There are several settings available in the BLAST Graphics view e BLAST Layout You can choose to Gather sequences at top Enabling this option affects the view that is shown when scrolling horizontally along a BLAST result If selected the sequence hits which did not contribute to the visible part of the BLAST graphics will be omitted whereas the found BLAST hits will automatically be placed right below the query sequence e BLAST hit coloring You can choose whether to color hit sequences and you can adjust the coloring CHAPTER 12 BLAST SEARCH 181 e Compactness In the Sequence Layout in the Side Panel you can control the level of sequence detail to be displayed Not compact Full detail and spaces between the sequences Low The normal settings where the residues are visible when zoomed in but with no extra spaces between Medium The sequences are represented as lines and the residues are not visible There is some space between the sequences Compact Even less space between the sequences e Coverage In the Alignment info in the Side Panel you can visualize the number of hit sequences at a given position on the query sequence The
147. protein sequence but you only want to show which enzymes cut between two and four times Then you should select The enzymes has more cleavage sites than 2 and select The enzyme has less cleavage sites than 4 In the next step you should simply select all enzymes This will result in a view where only enzymes which cut 2 3 or 4 times are presented Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish The result of the detection is displayed in figure 16 28 Depending on the settings in the program the output of the proteolytic cleavage site detection will display two views on the screen The top view shows the actual protein sequence with the predicted cleavage sites indicated by small arrows If no labels are found on the arrows they can be enabled by setting the labels in the annotation layout in the preference panel The bottom view shows a text output of the detection listing the individual fragments and information on these 16 10 2 Bioinformatics explained Proteolytic cleavage Proteolytic cleavage is basically the process of breaking the peptide bonds between amino acids in proteins This process is carried out by enzymes called peptidases proteases or proteolytic cleavage enzymes CHAPTER 16 PROTEIN ANALYSES 272 aE NP_O58652 Trypsin Trypsin Trypsin Trypsin 20 40 NP_058652 MVHLTDAEKSAVSCLWAKVNPDEVGGEALGRLLVVYPWTQRYFDSFGDL Trypsin Trypsin Trypsin
148. restriction sites Minimum 13 Maximum 25 Any number of restriction sites gt 0 Previous gt Next LS Finish XX cancel Figure 19 25 Selecting number of cut sites If you wish the output of the restriction map analysis only to include restriction enzymes which cut the sequence a specific number of times use the checkboxes in this dialog CHAPTER 19 CLONING AND CUTTING 338 No restriction site 0 e One restriction site 1 e Two restriction sites 2 Three restriction site 3 e N restriction sites Minimum Maximum e Any number of restriction sites gt O The default setting is to include the enzymes which cut the sequence one or two times You can use the checkboxes to perform very specific searches for restriction sites e g if you wish to find enzymes which do not cut the sequence or enzymes cutting exactly twice Output of restriction map analysis Clicking next shows the dialog in figure 19 26 Restriction Site Analysis 1 Select DNA RNA rar sequence s 2 Enzymes to be considered in calculation 3 Number of cut sites Output options 4 Result handling Create list of cutting enzymes Result handling Open O Save Log handing Make log Figure 19 26 Choosing to add restriction sites as annotations or creating a restriction map eres gt Ss JO This dialog lets you specify
149. scanning The scanning is started from the Toolbox Toolbox RNA Structure 36 Evaluate Structure Hypothesis Lz This opens the dialog shown in figure 22 25 Scan for Local Structure 1 Select nucleotide ES sequences Projects Selected Elements 1 gt fa CLC_Data ne AB030907 S E Example Data H E Cloning vectors H E Extra H E Nucleotide H Protein 5 RNA 4 5 Base pairing plots i Sequences 70C AB009835 306 AB030907 x 20 Coronavirus 9 Structure predicti E Structure scannir gt Figure 22 25 Selecting RNA or DNA sequences for structure scanning If you have selected sequences before choosing the Toolbox action they are now listed in the Selected Elements window of the dialog Use the arrows to add or remove sequences or sequence lists from the selected elements Click Next to adjust scanning parameters see figure 22 26 The first group of parameters pertain to the methods of sequence resampling There are four ways of resampling all described in detail in Clote et al 2005 e Mononucleotide shuffling Shuffle method generating a sequence of the exact same mononucleotide frequency e Dinucleotide shuffling Shuffle method generating a sequence of the exact same dinu cleotide frequency e Mononucleotide sampling from zero order Markov chain Resampling method generating a sequence of the same expected mononucleotide frequency CHAPTER 22 RNA STRUCTURE 392 e Dinucleotide
150. see the CLC Standard Settings which are the default settings for the view 2 4 Tutorial GenBank search and download The CLC Combined Workbench allows you to search the NCBI GenBank database directly from the program giving you the opportunity to both open view analyze and save the search results without using any other applications To conduct a search in NCBI GenBank from CLC Combined Workbench you must be connected to the Internet This tutorial shows how to find a complete human hemoglobin DNA sequence in a situation where you do not know the accession number of the sequence To start the search Search Search for Sequences at NCBI E This opens the search view We are searching for a DNA sequence hence Nucleotide Now we are going to adjust parameters for the search By clicking Add search parameters you activate an additional set of fields where you can enter search criteria Each search criterion CHAPTER 2 TUTORIALS 41 consists of a drop down menu and a text field In the drop down menu you choose which part of the NCBI database to search and in the text field you enter what to search for Click Add search parameters until three search criteria are available choose Organism in the first drop down menu write human in the adjoining text field choose All Fields in the second drop down menu write hemoglobin in the adjoining text field choose All Fields in the third drop down menu write complet
151. server at NCBI is http www ncbi nlm nih gov blast Blast cgi Note Be careful to specify a valid URL otherwise BLAST will not work 5 4 Export import of preferences The user preferences of the CLC Combined Workbench can be exported to other users of the program allowing other users to display data with the same preferences as yours You can also CHAPTER 5 USER PREFERENCES AND SETTINGS 103 use the export import preferences function to backup your preferences To export preferences open the Preferences dialog Ctrl K 3 on Mac and do the following Export Select the relevant preferences Export Choose location for the exported file Enter name of file Save Note The format of exported preferences is cpf This notation must be submitted to the name of the exported file in order for the exported file to work Before exporting you are asked about which of the different settings you want to include in the exported file One of the items in the list is User Defined View Settings If you export this only the information about which of the settings is the default setting for each view is exported If you wish to export the Side Panel Settings themselves see section 5 2 1 The process of importing preferences is similar to exporting Press Ctrl K 3 on Mac to open Preferences Import Browse to and select the cpf file Import and apply preferences 5 4 1 The different options for export and importin
152. set contains the following a set of primers which are general to all sequences in the alignment a TaqMan probe which is specific to the set of included sequences Sequences where selection boxes are checked and a TaqMan probe which is specific to the set of excluded sequences marked by Otherwise the table is similar to that described above for TaqMan probe prediction on single sequences CHAPTER 17 PRIMERS 297 Calculation parameters Chosen parameters Maximum primer length Minimum primer length Maximum G C content Minimum G C content Maximum melting temperature Minimum melting temperature Maximum self annealing Maximum self end annealing Maximum secondary structure 3 end must meet G C requirements 5 end must meet G C requirements Probe parameters Minimum number of mismatches 1 Minimum number of mismatches in central part E Primer combination parameters Max percentage point difference in G C content 35 gt Max difference in melting temperatures within a primer pair siS Max hydrogen bonds between pairs Max hydrogen bonds between pair ends 8 Minimum difference in melting temperature Primers Probes a Maximum length of amplicon E 300 vi Calculate Help Figure 17 14 Calculation dialog shown when designing alignment based TaqMan probes 17 10 Analyze primer properties CLC Combined Workbench can calculate and display the properties of predefined primers and probes select a primer sequence
153. setup oo eee ee nsanra da ni as a ee 109 62 1 Headerandfodter lt s i toria 0 dhe a e a A be ee 110 6 3 Print PIEVIEW os S08 2 ee aa a A A me we 110 CLC Combined Workbench offers different choices of printing the result of your work This chapter deals with printing directly from the workbench Another option for using the graphical output of your work is to export graphics See chapter 7 3 in a graphic format and then import it into a document or a presentation All the kinds of data that you can view in the View Area can be printed The CLC Combined Workbench uses a WYSIWYG principle What You See Is What You Get This means that you should use the options in the Side Panel to change how your data e g a Sequence looks on the screen When you print it it will look exactly the same way on print as on the screen For some of the views the layout will be slightly changed in order to be printer friendly It is not possible to print elements directly from the Navigation Area They must first be opened in a view in order to be printed To print the contents of a view select relevant view Print 4 in the toolbar This will show a print dialog See figure 6 1 In this dialog you can e Select which part of the view you want to print e Adjust Page Setup e See a print Preview window These three options are described in the three following sections 107 CHAPTER 6 PRINTING 108 Print Graphics Pag
154. shown in sequence views and in the Navigation Area e Description A description of the sequence e Comments The author s comments about the sequence CHAPTER 10 VIEWING AND EDITING SEQUENCES 155 Name Edit Description Edit Comments Edit KeyWords Edit Db Source gt Gb Division gt Length Modification Date Latin name Edit Common name Edit gt Taxonomy name Edit Figure 10 20 The initial display of sequence info for the HUMHBB DNA sequence from the Example data e Keywords Keywords describing the sequence e Db source Accession numbers in other databases concerning the same sequence e Gb Division Abbreviation of GenBank divisions See section 3 3 in the GenBank release notes for a full list of GenBank divisions e Length The length of the sequence e Modification date Modification date from the database This means that this date does not reflect your own changes to the sequence See the history section 8 for information about the latest changes to the sequence after it was downloaded from the database e Organism Scientific name of the organism first line and taxonomic classification levels second and subsequent lines The information available depends on the origin of the sequence Sequences downloaded from database like NCBI and UniProt see section 11 have this information On the other hand some sequence formats like fasta format do not contain this information Some
155. system 183 create database from Navigation Area 183 create local database 183 database file format 35 113 409 graphics output 180 list of databases 404 parameters 175 search 173 SNP 185 specify server URL 102 table output 182 tips for specialized searches 50 tutorial 47 50 URL 102 BLAST DNA sequence BLASTn 173 BLASTx 173 tBLASTx 173 BLAST Protein sequence BLASTp 174 tBLASTn 174 BLAST result search in 183 BLAST search Bioinformatics explained 190 BLOSUM scoring matrices 214 Bootstrap values 372 Borrow floating license 22 Browser import sequence from 115 Bug reporting 25 C G content 134 CDS translate to protein 144 Chain flexibility 135 Cheap end gaps 349 Chromatogram traces scale 302 cif file format 112 199 Circular molecules 328 Circular view of sequence 145 400 cle file format 112 117 CLC Standard Settings 105 CLC Workbenches 24 CLC file format 35 113 409 associating with CLC Combined Workbench 14 Cleavage 269 the Peptidase Database 273 Cloning 319 403 circular view 328 insert fragment 326 navigation 321 restriction enzymes 325 Close view 82 Clustal file format 35 113 409 Coding sequence translate to protein 144 Codon INDEX 422 frequency tables reverse translation 267 usage 268 col file format 112 Color residues 354 Comments 154 Compare workbenches 400 Compatible ends 143 335 Complexity plot 218 Configure n
156. tabs and opening selection in new view We will be working with the protein sequence NP_058652 located in the Protein folder under Sequences Double click the sequence in the Navigation Area to open it The sequence is displayed with annotations above it See figure 2 3 CLC Combined Workbench 3 0 Current workspace Default DER File Edit Search View Toolbox Workspace Help my qa Sy AM 43 a AE 224 eC ol De Es O OIDO Show New Import Export Graphics Print Cut Copy Paste Workspace Search Fit Width 100 Pan ESE Zoom In Zoom Out ac NP_058652 TAS 7 ea Protein a a 3D structures 9 More data Hbb b2 Sequence layout Sequences Spacing Hys 1429_HUMAN j As CAA24102 NP_058652 MVHLTDAEKSAVSCLWAKVNPD No spacing Ss CAA32220 O No wrap se ME NNP_058652 o Pus P68046 Hbb b2 O Fixed wrap fet Alignments and Trees NP_058652 EVGGEALGRLLVVYPWTQRYFD KA General Sequence Analyses KA Nucleotide Analyses Y Numbers on sequences ayy Protein Analyses Hbb b2 PA Sequencing Data Analyses E Primers and Probes Y Follow selection E Cloning and Restriction Sites NP_058652 SFGDLSSASA MGNPKVKAHGK 8 RNA Structure E BLAST Search Hide labels 4 ga Database Search Hbb b2 _ Relative to Processes Toolbox E armas E Idle 1 element s are selected Figure 2 3 Protein sequence NP_O58652 opened in a view As def
157. the enzymes considered should have an exact match or not Since a number of restriction enzymes have ambiguous cut patterns there will be variations in the resulting overhangs Choosing All matches you cannot be 100 sure that the overhang will match and you will need to inspect the sequence further afterwards We advice trying Exact match first and use All matches as an alternative if a satisfactory result cannot be achieved At the bottom of the dialog the list of enzymes producing compatible overhangs is shown Use the arrows to add enzymes which will be displayed on the sequence which you press Finish When you have added the relevant enzymes click Finish and the enzymes will be added to the Side Panel and their cut sites displayed on the sequence 19 2 2 Restriction site analysis from the Toolbox Besides the dynamic restriction sites you can do a more elaborate restriction map analysis with more output format using the Toolbox Toolbox Cloning and Restriction Sites ey Restriction Site Analysis of This will display the dialog shown in figure 19 22 If a sequence was selected before choosing the Toolbox action this sequence is now listed in the Selected Elements window of the dialog Use the arrows to add or remove sequences or sequence lists from the selected elements CHAPTER 19 CLONING AND CUTTING 336 Restriction Map Analysis 1 Select DNA RNA ES jencels sequence s Projects Selected Elements 1 S
158. the lane In a real electrophoresis experiment this property will be determined by several factors including time of separation voltage and gel density You can also choose how many lanes should be displayed e Sequences in separate lanes This simulates that a gel is run for each sequence e All sequences in one lane This simulates that one gel is run for all sequences You can also modify the layout of the view by zooming in or out Click Zoom in 30 or Zoom out FP in the Toolbar and click the view Finally you can modify the format of the text heading each lane in the Text format preferences in the Side Panel 19 4 Restriction enzyme lists CLC Combined Workbench includes all the restriction enzymes available in the REBASE database However when performing restriction site analyses it is often an advantage to use a customized list of enzymes In this case the user can create special lists containing e g all enzymes available in the laboratory freezer all enzymes used to create a given restriction map or all enzymes that are available form the preferred vendor In the example data see section 1 6 2 under Nucleotide gt Restriction analysis there are two enzyme lists one with the 50 most popular enzymes and another with all enzymes that are included in the CLC Combined Workbench This section describes how you can create an enzyme list and how you can modify it 19 4 1 Create enzyme list CLC Combined Workbench uses e
159. the most recently used The All fields allows searches in all parameters in the NCBI database at the same time All fields also provide an opportunity to restrict a search to parameters which are not listed in the dialog E g writing gene Feature key AND mouse in All fields generates hits in the GenBank database which contains one or more genes and where mouse appears somewhere in GenBank file NB the Feature Key option is only available in GenBank when searching for nucleotide sequences For more information about how to use this syntax see http www ncbi nlm nih gov entrez query static help helpdoc html Writing_Advanced_Search_Statements When you are satisfied with the parameters you have entered click Start search Note When conducting a search no files are downloaded Instead the program produces a list of links to the files in the NCBI database This ensures a much faster search 11 1 2 Handling of GenBank search results The search result is presented as a list of links to the files in the NCBI database The View displays 50 hits at a time This can be changed in the Preferences see chapter 5 More hits can be displayed by clicking the More button at the bottom right of the View Each sequence hit is represented by text in three columns e Accession e Description e Modification date It is possible to exclude one or more of these columns by adjust the View preferences for the database search view Further
160. the number of identical residues or nucleotides in the obtained alignment e Gaps This number shows whether the alignment has gaps or not e Strand This is only valid for nucleotide sequences and show the direction of the aligned strands Minus indicate a complementary strand e Query This is the sequence or part of the sequence which you have used for the BLAST search e Sbjct subject This is the sequence found in the database CHAPTER 12 BLAST SEARCH 182 The numbers of the query and subject sequences refer to the sequence positions in the submitted and found sequences If the subject sequence has number 59 in front of the sequence this means that 58 residues are found upstream of this position but these are not included in the alignment By right clicking the sequence name in the Graphical BLAST output it is possible to download the full hits sequence from NCBI with accompanying annotations and information It is also possible to just open the actual hit sequence in a new view 12 3 3 BLAST table In addition to the graphical display of a BLAST result it is possible to view the BLAST results in a tabular view In the tabular view one can get a quick and fast overview of the results Here you can also select multiple sequences and download or open all of these in one single step Moreover there is a link from each sequence to the sequence at NCBI These possibilities are either available through a right click with the m
161. the saved settings to delete e Apply Saved Settings This is a submenu containing the settings that you have previously CHAPTER 5 USER PREFERENCES AND SETTINGS 105 equence Settings 75 Figure 5 5 At the top of the Side Panel you can Expand all groups Collapse all preferences Dock Undock preferences Help and Save Restore preferences saved By clicking one of the settings they will be applied to the current view You will also see a number of pre defined view settings in this submenu They are meant to be examples of how to use the Side Panel and provide quick ways of adjusting the view to common usages At the bottom of the list of settings you will see CLC Standard Settings which represent the way the program was set up when you first launched it Save Settings Please enter a name for these user settings my settings v Always apply these settings y Save Y Cancel Figure 5 6 The save settings dialog Save Settings Sequence layout Delete Settings Annotation layout Apply Saved Settings gt Compact gt Annotation types Non compact no wrap Restriction sites Non compact with translations Rasmol colors gt Residue coloring Show translation Nucleotide info CLC Standard Settings gt Find Text format Figure 5 7 Applying saved settings The settings are specific to the type of view Hence when you save settings of a circular view they will not b
162. words between the query sequence and the hit sequence s Only regions with a word hit will be used to build on an alignment CHAPTER 12 BLAST SEARCH 192 Query word W 3 GSVEDTTGSQSLAALLNKCKT POGORLVNOWIKOPLMDKNRIEERLNLVEAFVEDAELROTLOEDL Figure 12 21 Generation of exact BLAST words with a word size of W 3 BLAST will start out by making words for the entire query sequence see figure 12 21 For each word in the query sequence a compilation of neighborhood words which exceed the threshold of T is also generated A neighborhood word is a word obtaining a score of at least T when comparing using a selected scoring matrix see figure 12 22 The default scoring matrix for blastp is BLOSUM62 for explanation of scoring matrices see www clcbio com be The compilation of exact words and neighborhood words is then used to match against the database sequences Query word W 3 GSVEDTTGSQSLAALLNKCKT POGQRLVNQWIKQPLMDKNRIEERLNLVEAFVEDAELROQTLOEDL POG 18 POG 15 Neighborhood a Scores from Words PKG 13 BLOSUM62 matirx PNG 13 PDG 13 PHG 13 PMG 13 PSQ 13 POA 12 PON 12 Threshold for Etc neighborhood words T 13 Figure 12 22 Neighborhood BLAST words based on the BLOSUM62 matrix Only words where the threshold T exceeds 13 are included in the initial seeding After initial finding of words seeding the BLAST algorithm will extend the only 3 residues long alignment in both directions see figure 12 23 Each time the alig
163. you change the Sequence name in the Sequence Layout view preferences you will have to ask the program to sort the sequences again The sequences can also be sorted by similarity grouping similar sequences together Right click the name of a sequence Sort Sequences by Similarity 20 3 6 Delete rename and add sequences Sequences can be removed from the alignment by right clicking the label of a sequence right click label Delete Sequence This can be undone by clicking Undo in the Toolbar A sequence can also be renamed right click label Rename Sequence This will show a dialog letting you rename the sequence This will not affect the sequence that the alignment is based on Extra sequences can be added to the alignment by creating a new alignment where you select the current alignment and the extra sequences see section 20 1 CHAPTER 20 SEQUENCE ALIGNMENT 359 The same procedure can be used for joining two alignments 20 3 7 Realign selection If you have created an alignment it is possible to realign a part of it leaving the rest of the alignment unchanged select a part of the alignment to realign right click the selection Realign selection This will open Step 2 in the Create alignment dialog allowing you to set the parameters for the realignment see section 20 1 It is possible for an alignment to become shorter or longer as a result of the realignment of a region This is because gaps may have
164. zoom and keys will also zoom on all sequences 19 1 4 Manipulate sequences All manipulations of sequences are done manually giving you full control over how the sequence is constructed Manipulations are performed through right click menus which have three different appearances depending on where you click as visualized in figure 19 3 e Right click the sequence name to the left to manipulate the whole sequence e Right click a selection to manipulate the selection CHAPTER 19 CLONING AND CUTTING 322 Restriction site Sequence label Selection 40 pBR322 Sequence Details TT CGAP CATCAAATAGTGTCAA Figure 19 3 The red circles mark the three places you can use for manipulating the sequences AGTAGCTA AT e Right click a restriction site to use this specific restriction site or this restriction enzyme for manipulation The three menus are described in the following Manipulate the whole sequence Right clicking the sequence name at the left side of the view reveals several options on sorting opening and editing the sequences in the view see figure 19 4 pBR Open Sequence in Circular View y uence det Duplicate Sequence xe Reverse Complement Sequence Digest and Create Restriction Map Rename Sequence Select Sequence 4 Delete Sequence bead Open Copy of Sequence in New View paz Open This Sequence in New View Make Sequence Linear Sort Sequence List by Name So
165. 0 CAA24102 gt Text Format 20 0 2 4 6 8 10 12 14 pH m Figure 16 7 View of the protein charge Graph preferences The Graph preferences apply to the whole graph e Lock axis This will always show the axis even though the plot is zoomed to a detailed level e Frame Toggles the frame of the graph e X axis at zero Toggles the x axis at zero e Y axis at zero Toggles the y axis at zero e Tick type outside inside CHAPTER 16 PROTEIN ANALYSES 250 e Tick lines at Shows a grid behind the graph none major ticks e Show as histogram For some data series it is possible to see it as a histogram rather than a line plot Preferences for each protein Underneath the Graph preferences you will find is a set of preferences for each protein in the graph These preferences only apply to the curve for the specific protein Dot type none cross plus square diamond circle triangle reverse triangle dot e Dot color Allows you to choose between many different colors e Line width thin medium wide e Line type none line long dash short dash e Line color Allows you to choose between many different colors These settings will apply to both the curve and the legend Modifying labels and legends Click the title of the graph the axis titles or the legend to edit the text CHAPTER 16 PROTEIN ANALYSES 251 16 3 Transmembr
166. 0 4 e Length search length START TO END Search for sequences of a specific length E g search for sequences between 1000 and 2000 residues length 1000 TO 2000 If you do not use this special syntax you will automatically search for both name description organism etc and search terms will be combined as if you had put OR between them 4 2 3 Quick search history You can access the 10 most recent searches by clicking the icon Q next to the search field see figure 4 5 Qe llenath 100 TO 150 Search gt lt lt length 100 To 150 kaging signal hemo gt gt nt human name humhbb insulin aboos Figure 4 5 Recent searches Clicking one of the recent searches will conduct the search again 4 3 Advanced search As a supplement to the Quick search described in the previous section you can use the more advanced search Search Local Search or Ctrl F 38 F on Mac This will open the search view as shown in figure 4 6 The first thing you can choose is which location should be searched All the active locations are shown in this list You can also choose to search all locations Read more about locations in section 3 1 1 Furthermore you can specify what kind of elements should be searched e All sequences e Nucleotide sequences CHAPTER 4 SEARCHING YOUR DATA 97 amp Search O Search in Location CLC Data x within El Sequences v Any field is
167. 1 Select sequences of same type Projects Selected Elements 3 CLC_Data Ss P68046 E3 Example data Su P68053 ef Extra P68063 Hf Nucleotide 5 Protein E E 3D structures W E More data E Sequences 1429 _HUMAN 4 CAA24102 Ss CAA32220 Ss NP_058652 3 8 DEIS se ES Ss P68225 P68228 P68231 Ss P68873 P68945 Figure 20 1 Creating an alignment If you have selected some elements before choosing the Toolbox action they are now listed in the Selected Elements window of the dialog Use the arrows to add or remove sequences sequence lists or alignments from the selected elements Click Next to adjust alignment algorithm parameters Clicking Next opens the dialog shown in figure 20 2 Create Alignment 1 Select sequences of same stances type 2 Set parameters Gap settings Gap open cost 10 Gap extension cost 1 End gap cost As any other w Alignment Fast less accurate O Slow very accurate Previous Dee Fish cancel Figure 20 2 Adjusting alignment algorithm parameters CHAPTER 20 SEQUENCE ALIGNMENT 349 20 1 1 Gap costs The alignment algorithm has three parameters concerning gap costs Gap open cost Gap extension cost and End gap cost The precision of these parameters is to one place of decimal e Gap open cost The price for introducing gaps in an alignment e Gap extension cost The price for every extension past the ini
168. 10 Tutorial Proteolytic cleavage detection lt oc cora ersa eee Re ee 2 11 Tutorial Primer design lt s asaca 2b 20 whee ee be bed eee Ge we 2 412 Mtohals ASSEMDIV a e So ae ere RG oe kick Ge ae ere Boe er ks e Soa ee we E 10 11 13 13 16 16 24 25 26 29 30 CONTENTS 2 13 Tutorial In silico cloning 2 14 Tutorial Folding RNA molecules Core Functionalities User interface 3 1 NavigationArea Saz VEW Ea cona a a IE ave geass 3 3 Zoom and selection in View Area 3 4 Toolbox and Status Bar 3 amp 5 Workspace o cc s i 2s ad whee ew bad B16 ILIStof SHOMCUIS 2 5 econ e a ee wd Searching your data 4 1 What kind of information can be searched 4 2 Quicksearch o ee ee 4 3 Advanced search 44 Searchindex User preferences and settings 5 1 General preferences 5 2 Default View preferences 5 3 Advanced preferences 5 4 Export import of preferences 5 5 View settings for the Side Panel Printing 6 1 Selecting which part of the view to print 6 2 Page Setup o ms sico a a ad wed 6 3 Print previeW Import export of data and graphics 7 1 Bioinformatic data formats 7 2 External files 64 67 71 72 73 80 86 88 89 90 93 93 94 96 98 99 99 100 102 102 103 107 108 109 110 CONTENTS 7 3
169. 11 3 2 Handling of NCBI structure search results The search result is presented as a list of links to the files in the NCBI database The View displays 50 hits at a time can be changed in the Preferences see chapter 5 More hits can be displayed by clicking the More button at the bottom right of the View Each structure hit is represented by text in three columns e Accession Description e Resolution e Method Protein chains e Release date It is possible to exclude one or more of these columns by adjust the View preferences for the database search view Furthermore your changes in the View preferences can be saved See section 5 5 Several structures can be selected and by clicking the buttons in the bottom of the search view you can do the following e Download and open Download and open immediately e Download and save Download and save lets you choose location for saving structure e Open at NCBI Open additional information on the selected structure at NCBI s web page Double clicking a hit will download and open the structure The hits can also be copied into the View Area or the Navigation Area from the search results by drag and drop copy paste or by using the right click menu as described below CHAPTER 11 ONLINE DATABASE SEARCH 169 Drag and drop from structure search results The structures from the search results can be opened by dragging them into a position in the View Area Note A struc
170. 1kcal mol eORSO0 TRS TITEL I Figure 22 17 A split view showing a structure table to the right and the secondary structure 2D view to the left QP ABOO9835 Secondary structure AG 9 7kcal mol F ABDD9835 Rows 11 Filter Name Created AG AG 9 9kcal mol 2007 06 25 13 35 21 9 9kcal mol AG 9 4kcal mol 2007 06 25 13 35 21 9 4kcal mol AG 9 2kcal mol 2007 06 25 13 35 21 9 2kcal mol AG 9 ikcal mol 2007 06 25 13 35 21 9 1kcalfmol AG 9 1kcalfmol 2007 06 25 13 35 21 9 1kcalfmol AG 8 9kcal mol 2007 06 25 13 35 21 8 9kcal mol AG B 6kcal mol 2007 06 25 13 35 21 8 6kcal mol AG 8 4kcal mol 2007 06 25 13 35 21 8 4kcal mol he am un o Y YI aai u 1 de Multi loop at 7 64 AG 1 tkcal mol 4 Dangling nucleotide at 71 dangling from position 70 AG 1 1kcal mol 4 S Stem base pairs at join 1 7 64 70 AG 9 9kcal mol oO B OE wee wo OB OE me Figure 22 18 Now the Stem with bifurcation node has been selected in the table and a corresponding selection has been made in the view of the secondary structure to the left The correspondence between the table and the structure editor makes it easy to inspect the thermodynamic details of the structure while keeping a visual overview as shown in the above figures Handling multiple structures The table to the left offers a number of t
171. 2 ee eee EE 233 15 2 Convert RNA to DNA lt lt lt 234 15 3 Reverse complements Of sequences 0 ee enna 235 15 4 Translation of DNA or RNA to protein lt lt lt lt eee 236 15 4 1 Translate part of a nucleotide sequence 237 15 5 Find open reading frames 237 15 5 1 Open reading frame parameters 238 CLC Combined Workbench offers different kinds of sequence analyses which only apply to DNA and RNA 15 1 Convert DNA to RNA CLC Combined Workbench lets you convert a DNA sequence into RNA substituting the T residues Thymine for U residues Urasil select a DNA sequence in the Navigation Area Toolbox in the Menu Bar Nucleotide Analyses 4 Convert DNA to RNA 2 or right click a sequence in Navigation Area Toolbox Nucleotide Analyses A Convert DNA to RNA 2 This opens the dialog displayed in figure 15 1 If a sequence was selected before choosing the Toolbox action this sequence is now listed in the Selected Elements window of the dialog Use the arrows to add or remove sequences or sequence lists from the selected elements Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish Note You can select multiple DNA sequences and sequence lists at a time If the sequence list contain
172. 204 BLAST Rows 103 Summary of hits from query CAA26204 Filter Hit Description E value Score Bit score oldentity al 1COH B Chain B Alpha Ferrous Carbonmonoxy Beta Cobaltou 3 36E 66 624 244 973 96 1 85 B Chain B T To T High Quaternary Transitions In Human 3 36E 66 624 244 973 96 1Y83 B Chain B T To T High Quaternary Transitions In Human 7 48E 66 621 243 817 95 1DXU B Chain B Hemoglobin Deoxy Mutant With val 1 Replac 7 48E 66 621 243 817 95 1NQP B Chain B Crystal Structure Of Human Hemoglobin E At 1 9 77E 66 620 243 432 95 1K1K R Chain A Structure OF Mutant Human Carbonmonnxvhea 9 77F Ah AzA 243 4321 95 Y l Download and Open Download and Save Open at NCBI J l Open Structure j emon Figure 13 1 It is possible to open a structure file directly from the output of a conducted BLAST search by clicking the Open Structure button 13 2 Viewing structure files The usual view area is used to display the actual structure See figure 13 2 for an example of the structure view At the bottom of the view area you will find a table displaying the polymer subunits of the structure along with additional compounds and in some cases water molecules It is possible to copy polymer sequence information to the navigator area for further sequence analysis by the integrated workbench tools To view the contents of a polymer subunit right click on t
173. 3 3 4 3 3 4 3 4 2 3 1 o 3 2 1 3 al 3 L 1 2 3 4 4 2 3 4 3 2 4 2 2 o 3 2 1 2 1 1 K 1 2 o 1 3 1 1 2 1 3 2 5 13 1 Oo 1 3 2 2 M 4 1 2 3 1 Oo 2 3 2 1 2 1 5 Oo 2 1 1 1 1 1 F 2 3 3 3 2 3 3 3 1 0 o 3 0 6 4 2 2 1 3 1 P 4 2 2 4 3 4 4 2 2 3 3 1 2 4 7 1 1 4 3 2 S t 1 L Oo 1 0 0 0 4 2 2 0 1 2 4 4 1 3 2 2 T Oo 1 Oo 1 4 4 1 2 2 1 1 1 1 2 1 1 5 2 2 0 w 3 3 4 4 2 2 3 2 2 3 2 3 1 1 4 3 2 11 2 3 Y 2 2 2 3 2 4 2 3 2 4 1 2 1 3 3 2 2 2 7 1 V 0 3 3 3 4 2 2 3 3 3 1 2 4 1 2 2 O 3 1 4 Table 14 1 The BLOSUM62 matrix A tabular view of the BLOSUM62 matrix containing all possible substitution scores Henikoff and Henikoff 1992 Figure 14 13 Relationship between scoring matrices The BLOSUM62 has become a de facto standard scoring matrix for a wide range of alignment programs It is the default matrix in BLAST Other useful resources Calculate your own PAM matrix http www bioinformatics nl tools pam html LOKS database Eipi blocks fherc org CBI help site ttp www ncbi nlm nih gov Education BLASTinfo Scoring2 html TZ 2 Ww Creative Commons License All CLC bio s scientific articles are licensed under a Creative Commons Attribution NonCommercial NoDerivs 2 5 License You are free to copy distribute display and use the work for educational purposes under the following conditions You must attribute the work in its original form and CLC bio has to be clearly labeled as
174. 32 BLAST EPICS s a anase ka a A ds 180 12303 BASTIDA a is ad da 182 12 4 Create Local BLAST Database eee ee es 183 1241 linportvol BLAST databases i ces bok we we a a we ae a 184 12 5 SNP annotation using BLAST 2 eee ee ee ee 185 12 5 1 SNP annotation search parameters o 0000 185 12 5 2 Result of SNP annotation eos sema a 187 12 5 3 Bioinformatics explained Single Nucleotide Polymorphisms SNPs 188 12 6 Bioinformatics explained BLAST lt lt lt ee nee nen 190 12 6 1 Examples of BLAST usage kx i 04 see ede eee bade eas 190 12 6 2 Searching Tor homology gt aac so we Ge GS OR Re a See a 191 12 6 3 How does BLAST work e eo 191 12 6 4 Which BLAST programi should Use i i ee a ee a 193 12 6 5 Which BLAST options should change 194 12 6 6 Explanation of the BLAST output lt s ace s aoi moi d ma e e e ee eee 195 12 6 7 I want to BLAST against my own sequence database is this possible 196 12 6 8 What you cannot get out of BLAST s asa siroce omea eane 198 120 9 Other USCUITESQUIEES o i r doce aaa a e a ee ks 198 CLC Combined Workbench offers to conduct BLAST searches on protein and DNA sequences In short a BLAST search identifies homologous sequences by searching one or more databases hosted by NCBI http www ncbi nlm nih gov on your query sequence McGinnis and Madden 2004 BLAST Basic Local Alignme
175. 36 Find open reading frames 237 Floating license 21 Floating license use offline 22 Floating Side Panel 105 Folder create new tutorial 33 Follow selection 132 Footer 110 Format of the manual 31 FormatDB 183 Fragment table 340 Fragment select 144 Fragments separate on gel 342 Free end gaps 349 fsa file format 112 G C content 134 401 G C restrictions 3 end of primer 280 5 end of primer 280 End length 280 Max G C 280 Gap compare number of 362 delete 357 extension cost 349 fraction 354 401 insert 357 open cost 349 Gb Division 154 gbk file format 112 GC content 279 gcc file format 112 GCG Alignment file format 35 113 409 GCG Sequence file format 35 113 409 GCK Gene Construction Kit file format 35 113 409 Gel separate sequences without restriction en zyme digestion 342 tabular view of fragments 340 Gel electrophoresis 341 403 marker 343 view 342 view preferences 342 when finding restriction sites 339 GenBank view sequence in 155 INDEX 424 file format 35 113 409 search 160 400 search sequence in 170 tutorial 40 Gene Construction Kit file format 35 113 409 Gene finding 237 General preferences 99 General Sequence Analyses 206 Genetic code reverse translation 267 Getting started tutorial 33 Google sequence 170 Graphics data formats 409 export 118 Half life 223 Handling of results 127 Header 110 Help 25 Hete
176. 4 In order to understand protein function it is often valuable to see the actual three dimensional structure of the protein This is of course only possible if the structure of the protein has been resolved and published CLC Combined Workbench has an integrated viewer of structure files Structure files are usually deposited at the Protein DataBank PDB www rcsb org where protein structure files can be searched and downloaded 13 1 Importing structure files In order to view the three dimensional structure files there are different ways to import these The supported file formats are PDB and mmCIF which both can be downloaded from the Protein DataBank http www rcsb org and imported through the import menu see section 7 1 1 Another way to import structure files is if a structure file is found through a direct search at the GenBank structure database http www ncbi nlm nih gov entrez query fcgi db Structure Read more about search for structures in section 11 3 199 CHAPTER 13 3D MOLECULE VIEWING 200 It is also possible to make a BLAST search against the PDB database In the latter case structure files can be directly downloaded to the navigation area by clicking the Open structure button below all the BLAST hits Downloading structure files from a conducted BLAST search is only possible if the results are shown in a BLAST table See figure 13 1 How to conduct a BLAST search can be seen in section 12 1 ES CAA26
177. 4 C Gather sequences at top gt a inl BL_ORD_ID O e vw Blast hit coloring nI BL_ORD_ID O E Sequence color inl BL_ORD_ID O q i IGO __ _ _ _ Ls Identity jnl BL_ORD_ID ja jnl BL_ORD_ID O 40 100 nl BL_ORD_ID O v Sequence layout Im il NIIRI OPD ININ o s No spacing lt ll 5 Numbers on sequences Mi Bead ES AAA16334 BLAST BLAST Table Sett x Rows 17 Summary of hits from query 44416334 Filter 3 2 Description A E value Hit start Hit end Query start Query end Identity Positive ER ARO 7 E value 2 35E 54 5204380 5204607 31 106 99 100 O Score 1 06E 16 5203407 5203539 5204827 E 147 98 100 O Bit score 3 06E 50 5211794 5212021 106 95 99 9 45E 16 5210773 5210901 Y Hit start 3 06E 50 5212141 5212239 Hit end 3 1 05E 39 5232095 5232322 k 2 58E 31 5247257 5247484 C Hit length d 1 QOF 3Q 5277171 62273208 QQ Query start Query end C Identity v Bda Figure 2 26 Placement of translated nucleotide sequence hits on the Human beta globin Why did we find on the protein level three identical regions between our query protein sequence and nucleotide database The beta globin gene is known to have three exons and this is exactly what we find in the BLAST search Each translated exon will hit the corresponding sequence on the chromosome In the table you can also see the Hit start and Hit end positions These are the corresponding posi
178. 40 CTAGGGACGATTA Et Figure 10 1 Showing restriction sites of ten restriction enzymes Sorting Aa tl Figure 10 2 Buttons to sort restriction enzymes e Sort enzymes by overhang 1 I This will divide the enzymes into three groups Blunt Enzymes cutting both strands at the same position 3 Enzymes producing an overhang at the 3 end 5 Enzymes producing an overhang at the 5 end There is a checkbox for each group which can be used to hide show all the enzymes in a group Manage enzymes The list of restriction enzymes contains per default 20 of the most popular enzymes but you can easily modify this list and add more enzymes by clicking the fManage enzymes button This will display the dialog shown in figure 19 14 At the top you can choose to Use existing enzyme list Clicking this option lets you select an enzyme list which is stored in the Navigation Area See section 19 4 for more about creating and modifying enzyme lists Below there are two panels CHAPTER 10 VIEWING AND EDITING SEQUENCES 139 BB Manage enzymes 1 Please choose enzymes IS Enzyme list Z Use existing enzyme lst Popular enzmes Y Enzymes in Popular en Enzymes shown in Side Panel Filter Filter Name Overhang Methyla Popu Name Overhang Methyla Popu gt EcoRV N methy AE Save as new enzyme list Figure 10 3 Adding or removing enzymes from the Side Pan
179. 5 326 084 Da 1 2 Half life N tem inal aa Half life mammals Halflife yeaz Half life E Coli Proline gt 20 hours gt 20 hours Unknown 1 3 Extinction coefficient Conditions Extinction coefficient at Absorption at 280nm 0 1 Y lt gt Figure 16 20 A protein report There is a Table of Contents in the Side Panel that makes it easy to browse the report By double clicking a graph in the output this graph is shown in a different view CLC Combined Workbench generates another tab The report output and the new graph views can be saved by dragging the tab into the Navigation Area The content of the tables in the report can be copy pasted out of the program and e g into Microsoft Excel To do so Select content of table Right click the selection Copy 16 9 Reverse translation from protein into DNA A protein sequence can be back translated into DNA using CLC Combined Workbench Due to degeneracy of the genetic code every amino acid could translate into several different codons only 20 amino acids but 64 different codons Thus the program offers a number of choices for determining which codons should be used These choices are explained in this section In order to make a reverse translation Select a protein sequence Toolbox in the Menu Bar Protein Analyses xj Reverse Translate A or right click a protein sequence Toolbox Protein Analyses egy Reverse translate A This opens the dialog display
180. 68225 ac P6s053 Ol ac P68046 E A P68225 VDEVGGEAL P68046 DEVGGEALGF P68225 RLLVVYPWT 1 P68046 LLVVYPWTQEF P68225 RFFESFGDL v lt lis amp P68046 FFDSFGDLS lt Y lt gt lt amp Figure 3 7 A View Area can enclose several views each view is indicated with a tab see right view which shows protein P68225 Furthermore several views can be shown at the same time in this example four views are displayed view The view that was already open can be brought to front by clicking its tab Note If you right click an open tab of any element click Show and then choose a different view of the same element this new view is automatically opened in a split view allowing you to see both views See section 3 1 5 for instructions on how to open a view using drag and drop 3 2 2 Show element in another view Each element can be shown in different ways A sequence for example can be shown as linear circular text etc In the following example you want to see a sequence in a circular view If the sequence is already open in a view you can change the view to a circular view Click Show As Circular at the lower left part of the view The buttons used for switching views are shown in figure 3 8 fa O B Oh El amp E Figure 3 8 The buttons shown at the bottom of a view of a nucleotide sequence You can Click the buttons to change the view to e g a circular view or a history view If the s
181. 9 join 359 multiple Bioinformatics explained 364 remove sequences from 358 view 353 view annotations on 148 Aliphatic index 223 aln file format 112 Alphabetical sorting of folders 76 Ambiguities reverse translation 268 Amino acid composition 225 Amino acids abbreviations 411 UIPAC codes 411 Analyze primer properties 297 Annotate with SNP s using BLAST 185 Annotate with SNP s 403 Annotation select 144 Annotation Layout in Side Panel 148 Annotation types define your own 152 Annotation Types in Side Panel 148 Annotations add 152 copy to other sequences 358 edit 152 154 420 INDEX 421 in alignments 358 introduction to 147 links 171 overview of 151 show hide 148 table of 151 trim 304 types of 148 view on sequence 148 viewing 148 Annotations add links to 153 Antigenicity 252 401 Append wildcard search 161 164 167 Arrange layout of sequence 36 views in View Area 83 Assemble sequences 306 to existing contig 310 to reference sequence 307 Assembly 403 tutorial 59 variance table 315 Atomic composition 225 Backup 117 Base pairs required for a match 299 required for mispriming 285 Batch processing 127 log of 128 Bibliography 418 Binding site for primer 298 Bioinformatic data export 116 formats 112 408 bl2seq see Local BLAST BLAST 400 for SNP s 403 against local Database 176 against NCBI 173 create database from file
182. 9 search results UniProt 166 Download and save search results GenBank 163 169 search results UniProt 166 Download of CLC Combined Workbench 13 Drag and drop Navigation Area 76 search results GenBank 162 169 search results UniProt 166 Edit alignments 357 401 annotations 152 154 400 enzymes 138 330 sequence 144 sequences 400 single bases 145 INDEX 423 Element Fit to pages print 109 delete 79 Fit Width 87 rename 79 Fixpoints for alignments 351 embl file format 112 Embl file format 35 113 409 Encapsulated PostScript export 120 End gap cost 349 End gap costs cheap end caps 349 free end gaps 349 Enzyme list 344 create 344 edit 346 view 346 eps format export 120 Error reports 25 Evolutionary relationship 366 Example data import 26 Expand selection 143 Expect BLAST search 181 Export bioinformatic data 116 dependent objects 117 folder 116 graphics 118 history 117 list of formats 408 multiple files 116 preferences 102 Side Panel Settings 101 tables 35 113 409 Export visible area 118 Export whole view 118 Extensions 26 External files import and export 118 Extinction coefficient 223 Extract sequences 159 FASTA file format 35 113 409 Feature request 24 Feature table 225 Features see Annotations File system local BLAST database 183 Filtering restriction enzymes 139 141 331 333 336 345 Find in GenBank file 155 in sequence 1
183. A Duplicate Selection TGAC ACTC Replace Selection With Sequence E Insert Sequence Before Selection Eie Insert Sequence After Selection xE Cut Sequence Before Selection ak Cut Sequence After Selection II Make Positive Strand Single Stranded Lut Make Negative Strand Single Stranded mo Make Double Stranded Copy Ga Open Selection in New Yiew Edit Selection 7 Delete Selection y Add Annotation Insert Restriction Site After Selection Insert Restriction Site Before Selection He Show Enzymes Cutting Inside Outside Selection Add Structure Prediction Constraints gt Figure 19 5 Right click on a sequence selection in the cloning view e Replace Selection with sequence This will replace the selected region with a sequence The sequence to be inserted can be selected from a list containing all sequences in the cloning editor e Insert Sequence before Selection i Insert a sequence before the selected region The sequence to be inserted can be selected from a list containing all sequences in the cloning editor e Insert Sequence after Selection Insert a sequence after the selected region The sequence to be inserted can be selected from a list containing all sequences in the cloning editor e Cut Sequence before Selection Xi This will cleave the sequence before the selection and will result in two smaller fragments e Cut Sequence after Selection 2 This will cleave the sequence after t
184. A CGACGAGTCTGGATH TCA Sorting Aa WE GI Xholl 40 v V Non cutters I TGGGTCTGACCITCGAGCATGGT E Y Bam 0 MA pst co EJ Y Smar o TTCA PEPE ren riE gt V Single cutters Meau E Hindin EE M o 1 Meco a CTGAIAGCTTGGCTTACCTTAAC Y Hinan 1 O masaa Mio xawm CTICTAGATH OROAROONAATO MIA xhor 1 O 120 M Double cutters GIA ATTCGORAO GATTCDAT T V Multiple cutters E 2 Eor 3 2 140 O Figure 2 18 Showing restriction sites of ten restriction enzymes The restriction sites are shown on the sequence with an indication of cut site and recognition sequence In the list of enzymes in the Side Panel the number of cut sites is shown in parentheses for each enzyme e g EcoRV cuts three times If you wish to see the recognition sequence of the enzyme place your mouse cursor on the enzyme in the list for a short moment and a tool tip will appear You can add or remove enzymes from the list by clicking the Edit enzymes button However there is a very smart way of adding enzymes make a selection on the sequence right click the selection Show Enzymes only Cutting Selection This will show a dialog where you can specify criteria for the enzymes to be added to the list in the Side Panel When you click OK the selection will be scanned for restriction sites according to the settings in the dialog and the relevant enzymes will be added to the list in the Side Panel CHAPTER 2 TUTORIALS 46 Ej copy
185. Appendix F Note that conflicts will always be highlighted no matter which of the options you choose Furthermore each conflict will be marked as annotation on the contig sequence and will be present if the contig sequence is extracted for further analysis As a result the details of any experimental heterogeneity can be maintained and used when the result of single sequence analyzes is interpreted Read more about conflicts in section 18 6 4 e Create full contigs including trace data This will create a contig where all the aligned reads are displayed below the contig sequence You can always extract the contig sequence without the reads later on For more information on how to use the contigs that are created see section 18 6 e Show tabular view of contigs A contig can be shown both in a graphical as well as a tabular view If you select this option a tabular view of the contig will also be opened Even if you do not select this option you can show the tabular view of the contig later on by clicking Table 4 at the bottom of the view For more information about the tabular view of contigs see section 18 6 6 e Create only consensus sequences This will not display a contig but will only output the assembled contig sequences as single nucleotide sequences If you choose this option it is not possible to validate the assembly process and edit the contig based on the traces If you have chosen to Trim sequences click Next and you w
186. BI describing the sequence e E value Measure of quality of the match Higher E values indicate that BLAST found a less homologous sequence e Score This shows the score of the local alignment generated through the BLAST search e Bit score This shows the bit score of the local alignment generated through the BLAST search Bit scores are normalized which means that the bit scores from different alignments can be compared even if different scoring matrices have been used CHAPTER 12 BLAST SEARCH 183 e Hit start Shows the start position in the hit sequence e Hit end Shows the end position in the hit sequence e Hit length The length of the hit e Query start Shows the start position in the query sequence e Query end Shows the end position in the query sequence e Identity Shows the number of identical residues in the query and hit sequence e ldentity Shows the percentage of identical residues in the query and hit sequence e Positive Shows the number of similar but not necessarily identical residues in the query and hit sequence In the BLAST table view you can handle the hit sequences Select one or more sequences from the table and apply one of the following functions e NCBI Opens the corresponding sequence s at GenBank at NCBI Here is stored additional information regarding the selected sequence s The default Internet browser is used for this purpose e Open sequence Opens the selected sequence s in
187. Ba ic aa a vac Se a dood ig ef do A A amp Oe Ge ee ae A 89 3 5 Workspace s c secs asoa oraa encara ee a ee a 89 CHAPTER 3 USER INTERFACE 73 3 5 1 Create WONSPAC s sora cee a ee a Se ee we Le A 89 3 5 2 Select Workspace aac o et es 90 3 5 3 Delete Workspace aoaaa aooaa de a we do 90 3 6 Listof Shortcuts icc sa ee o a a a a a a 90 This chapter provides an overview of the different areas in the user interface of CLC Combined Workbench As can be seen from figure 3 1 this includes a Navigation Area View Area Menu Bar Toolbar Status Bar and Toolbox CLC Combined Workbench 3 0 Current workspace Default Eile Edit Search view Toolbox Workspace Help dl le E O A DA EI En Import Export Graphics Print Delete Workspace Search A Menu Bar TT eS REATO Toolb a D oolbar B E Sequences A i H 2C PERH2BD Navi g ation Areg iE sequence lis v Sequence layout 20 PERH3BC Spacing 20 NM_000044 8 No spacin 20 HUMDINUC Nospecng Ml K peg ES O Mo wrap View Area 20 HUMHBB O Auto wrap O Fixed wrap AY738615 CCTTTAGTGATGGCCTGGCTCACCTGG Alignments and Trees J General Sequence Analyses TE Nucleotide Analyses Double stranded Toolbox E a protainiAnaiysas Y Numbers on sequences H A Sequencing Data Analyses E Primers and Probes Relative to 1 E V Numbers on plus strand Y Follow selection v Processes Toolbox E Idle 1 element s aye selected
188. C PERH3BC i sequence lis Y gt Antigenicity fet Alignments and Trees a EEA General Sequence Analyses LA Nucleotide Analyses EE Protein Analyses IF Sequencing Data Analyses HE Primers and Probes fa Cloning and Restriction Sites fa RNA Structure v o 20 lt gt Processes Toolbox lt a jos 2 ren AAD b mzn Ea Antigenicity plot of NP_058652 40 60 80 Position 100 120 Welling 140 eS Graph preferences Y Lock axes V Frame X axis at zero Y axis at zero Tick type Tick lines at none x outside Show as histogram weling Dot type none v Dot color E Line width medium Line type line Line color a Text format v E Idle Figure 16 10 The result of the antigenicity plot calculation and the associated Side Panel e Frame Toggles the frame of the graph e X axis at zero Toggles the x axis at zero e Y axis at zero Toggles the y axis at zero e Tick type outside inside e Tick lines at Shows a grid behind the graph 1 element s are selected CHAPTER 16 PROTEIN ANALYSES 254 none major ticks e Show as histogram For some data series it is possible to see it as a histogram rather than a line plot The preferences for the different scales are identical and include the following e Dot type Lets you choose the marking of dots in the graph e Dot color Lets y
189. CACGACTGTTCTCAAACC Self annealing alignment Fwd CCATGGTTTCCTICCTCT Self end annealing Fwd 43 58 11 MN CCATGGTTTCCTTCCTCT AAACTCTTGTCAGCACTC GC content Fwd CTCACGACTGTTCTCAAA 4 Melt temp Fwd ne Secondary structure score Fwd 42 488 1 I III I ICCATGGTTTCCTTCCTCTA _ AAACTCTTGTCAGCACTC le YA Mi Figure 17 6 Proposed primers The columns in the output table can be sorted by the present information For example the user can choose to sort the available primers by their score default or by their self annealing score simply by right clicking the column header The output table interacts with the accompanying primer editor such that when a proposed combination of primers and probes is selected in the table the primers and probes in this solution are highlighted on the sequence 17 4 1 Saving primers Primer solutions in a table row can be saved by selecting the row and using the right click mouse menu This opens a dialog that allows the user to save the primers to the desired location Primers and probes are saved as DNA sequences in the program This means that all available DNA analyzes can be performed on the saved primers including BLAST Furthermore the primers can be edited using the standard sequence viewer to introduce e g mutations and restriction sites 17 4 2 Saving PCR fragments The PCR fragment generated from the primer pair in a given table row can also be saved by selecting the row and using the right click mouse m
190. CE ANALYSES 210 Create Dot Plot 1 Select one or two sequences of same type 2 Set parameters Distance correction and window size Score model BLOSUM62 v Window size 9 ET ET Previous J Pnet Y Frish X Cancel Figure 14 4 Setting the dot plot parameters 14 2 2 View dot plots A view of a dot plot can be seen in figure 14 5 You can select Zoom in 2 in the Toolbar and click the dot plot to zoom in to see the details of particular areas amp P68053 vs P68045 P68053 vs P68046 Ara 28 y Dot Plot Preferences al ae Lock axes 2 Similarity 120 100 min max Text format 8 2 804 E 2 60 4 40 4 E 204 fle 50 100 150 P68046 a Figure 14 5 A view is opened showing the dot plot The Side Panel to the right let you specify the dot plot preferences The gradient color box can be adjusted to get the appropriate result by dragging the small pointers at the top of the box Moving the slider from the right to the left lowers the thresholds which can be directly seen in the dot plot where more diagonal lines will emerge You can also choose another color gradient by clicking on the gradient box and choose from the list Adjusting the sliders above the gradient box is also practical when producing an output for printing Too much background color might not be desirable By crossing one slider over the other the two sliders change side the colors are inve
191. CF Open Selection in New View Edit Selection EF Delete Selection Ely Add Annotation TGATAAATGCTTCA Trim Sequence Left Trim Sequence Right AATACATTCAAATA Set Alignment Fixpoint Here GTATTCAACATTTC Set Numbers Relative to This Selection Insert Restriction Site After Selection Insert Restriction Site Before Selecti GGCATTTTGCTTCC E BLAST Selection against NCBI E BLAST Selection against Local Data Add Structure Prediction Constraints 5T GAAAGTAAAAGATGCT GAAGAT CAGTTGGGTGCACGA Figure 2 19 This will add enzymes that cut this selection to the Side Panel 2 7 2 The Toolbox way of finding restriction sites Suppose you are working with sequence PERH3BC from the example data and you wish to know which restriction enzymes will cut this sequence exactly once and create a 3 overhang Do the following select the PERH3BC sequence from the Nucleotide folder under Sequences Toolbox in the Menu Bar Cloning and Restriction Sites 2 Restriction Site Analysis of Click Next to set parameters for the restriction map analysis In the next step you write 3 into the filter to the left Then you click in the list of enzymes to the left and press Ctrl A 38 A on Mac Then click the Add button The result should be like in figure 2 20 Restriction Map Analysis 1 Select DNA RNA sequence s Use existing enzyme list All enzymes Enzymes to be used Filter 3 Filter Ov
192. CLC Combined Workbench User manual User manual for CLC Combined Workbench 3 5 Windows Mac OS X and Linux December 6 2007 CLC bio Gustav Wieds Vej 10 DK 8000 Aarhus C CL big Denmark Contents 1 2 Introduction Introduction to CLC Combined Workbench UA CONMACLINGONNAION s x u we ea oe Oe A He Oe ee 1 2 Download and installation 2 sc Ge ee wok we eo OR a ee eG 1 3 System requirements ee ee ee es LA LICENSES e ana a e wat lat hate Mg a cg oa Gel ee Gd sw a a Bo 1 6 About CLC Workbenches s s ou ll ed ee MA ee ed ee SS a 1 6 When the program is installed Getting started 1 7 Extending the workbench with plug ins e 18 N tWw rk CONMMBUPALION o s i es a E a al Pe ea a a tO The format or thewser Manual x sais a a a E A Tutorials 2 1 Tutorial Getting Started ss ecce soan 488 Fee eRe ee RR EO 2 2 T toralk VIEW SEQUENCE i occse e a a a wd ee te ee a a a 2 3 Tutorial Side Panel Settings ooo Hb 4 Gee bebe ee eee eS 2 4 Tutorial GenBank search and download 2 e 2 5 Tutorial Align protein sequences ee 2 6 Tutorial Create and modify a phylogenetic tree o 2 7 Tutorial Find restriction SITES s so s a cisma hed ee eee eS 28 tonal BLAST SCAR 0 lt esr aere e ere E RE E ee E 2 9 Tutorial Tips for specialized BLAST Searches o 5000 2
193. CLC Combined Workbench are listed below CHAPTER 3 USER INTERFACE 91 Action Windows Linux Mac OS X Adjust selection Change between tabs Close Close all views Copy Cut Delete Exit Export Export graphics Find Next Inconsistency Find Previous Inconsistency Help Import Maximize restore size of View Move gaps in alignment Navigate sequence views New Folder New Sequence View Paste Print Redo Rename Save Search local data Search in an open sequence Search NCBI Search UniProt Select All Selection Mode User Preferences Split Horizontally Split Vertically Show hide Preferences Undo Zoom In Mode Zoom In without clicking Zoom Out Mode Zoom Out without clicking Shift arrow keys Ctrl tab Ctrl W Ctrl Shift W Ctrl C Ctrl X Delete Alt F4 Ctrl E Ctrl G Space or F1 Ctrl Ctrl M Ctrl arrow keys left right arrow keys Ctrl Shift N Ctrl N Ctrl O Ctrl V Ctrl P Ctrl Y F2 Ctrl S Ctrl F Ctrl Shift F Ctrl B Ctrl Shift U Ctrl A Ctrl 2 Ctrl K Ctrl T Ctrl J Ctrl U Ctrl Z Ctrl plus plus Ctrl minus minus Shift arrow keys Ctrl Page Up Down W Shift W a C 98 X Delete or 8 Backspace Q g E G Space or F1 a l 8 M 38 arrow keys left right arrow keys Shift N N 0 ao V de P a6 Y
194. Combined Workbench can import external files too This means that all kinds of files can be imported and displayed in the Navigation Area but the above mentioned formats are the only ones whose contents can be shown in CLC Combined Workbench D 2 List of graphics data formats Below is a list of formats for exporting graphics All data displayed in a graphical format can be exported using these formats Data represented in lists and tables can only be exported in pdf format see section 7 3 for further details APPENDIX D FORMATS FOR IMPORT AND EXPORT 410 Format Suffix Type Portable Network Graphics png bitmap JPEG Jpg bitmap Tagged Image File tif bitmap PostScript ps vector graphics Encapsulated PostScript eps vector graphics Portable Document Format _ pdf vector graphics Scalable Vector Graphics SVE vector graphics Appendix E IUPAC codes for amino acids Single letter codes based on International Union of Pure and Applied Chemistry The information is gathered from http www dna affrce go jp misc MPsrch InfolUPAC html One letter Three letter Description abbreviation abbreviation A Ala Alanine R Arg Arginine N Asn Asparagine D Asp Aspartic acid C Cys Cysteine Q Glin Glutamine E Glu Glutamic acid G Gly Glycine H His Histidine lle Isoleucine L Leu Leucine K Lys Lysine M Met Methionine F Phe Phenylalanine P Pro Proline U Sec Selenocysteine S Ser Serine T Thr
195. Commercial NoDerivs 2 5 License You are free to copy distribute display and use the work for educational purposes under the following conditions You must attribute the work in its original form and CLC bio has to be clearly labeled as author and provider of the work You may not use this work for commercial purposes You may not alter transform nor build upon this work SOME RIGHTS RESERVED See http creativecommons org licenses by nc nd 2 5 for more information on how to use the contents Chapter 21 Phylogenetic trees Contents 21 1 Inferring phylogenetic trees 366 21 11 Phylogenetic tree parameters oras a cos a kace sea RE ee i 366 21 1 2 Tree View Preferences xi a 24 EE Be ed ew eS 368 21 2 Bioinformatics explained phylogenetics lt lt ee anne 369 21 2 1 The phylogenetic tree et es 370 21 2 2 Modern usage of phylogenies lt ss sa sa sou ace do ae ba eee 370 21 2 3 Reconstructing phylogenies from molecular data 371 21 2 4 Interpreting phylogenies es 372 CLC Combined Workbench offers different ways of inferring phylogenetic trees The first part of this chapter will briefly explain the different ways of inferring trees in CLC Combined Workbench The second part Bioinformatics explained will give a more general introduction to the concept of phylogeny and the associated bioinformatics methods 21 1 Inf
196. D FORMATS FOR IMPORT AND EXPORT 409 File type Suffix File format used for ACE files ace contigs Phylip Alignment phy alignments GCG Alignment msf alignments Clustal Alignment aln alignments Newick nwk trees FASTA fsa fasta sequences GenBank bk gb gp sequences GCG sequence gcg sequences only import PIR NBRF pir sequences only import Staden sdn sequences only import DNAstrider Str strider sequences Swiss Prot Swp protein sequences Lasergene sequence pro protein sequence only import Lasergene sequence Seq nucleotide sequence only import Embl embl nucleotide sequences Nexus nxs nexus sequences trees alignments and sequence lists CLC cle sequences trees alignments reports etc Text txt all data in a textual format CSV CSV tables each cell separated with semicolons only export ABI abi trace files only import AB1 ab1 trace files only import SCF2 SCf trace files only import SCF3 Scf trace files only import Phred phd trace files only import mmCIF Cif structure only import PDB pdb structure only import BLAST Database phr nhr BLAST database import Vector NTi Database VectorNTl achieves Gene Construction Kit RNA Structure ma4 pa4 oa4 gcc ct col rnaml xml sequences import of whole database sequences only import sequences only import RNA structures Preferences cpf CLC workbench preferences Note CLC
197. E CLC_Data m6 PERH3BC E3 Example data 4 ES Extra Seip Nucleotide E Assembly 3 Cloning More data Primer design Restriction analysis ie Sequences JOC AY738615 20 HUMDINUC 20 HUMHBB 20 NM_000044 DOC PERH2BD wm i sequence list 25 Protein Figure 19 22 Choosing sequence PERH3BC for restriction map analysis Selecting sorting and filtering enzymes Clicking Next lets you define which enzymes to use as basis for finding restriction sites on the sequence At the top you can choose to Use existing enzyme list Clicking this option lets you select an enzyme list which is stored in the Navigation Area See section 19 4 for more about creating and modifying enzyme lists Below there are two panels e To the left you see all the enzymes that are in the list select above If you have not chosen to use an existing enzyme list this panel shows all the enzymes available a e To the right there is a list of the enzymes that will be used Select enzymes in the left side panel and add them to the right panel by double clicking or clicking the Add button gt If you e g wish to use EcoRV and BamHI select these two enzymes and add them to the right side panel If you wish to use all the enzymes in the list Click in the panel to the left press Ctrl A 38 A on Mac Add gt The enzymes can be sorted by clicking the column headings i e Name Overhang Methylation or Popularity This is
198. E E Multi loop arcs at join 1415 28 29 46 47 AG 0 8kcal mol U A 4 Dangling A at 46 dangling from position 45 AG 0 8kcalfmol G c 39 2 Coaxial interaction between 14 48 and 15 28 AG 2 1kcal mol 10 e B e Stem with hairpin at 15 28 AG 8 7kcal mol A uU Stem base pairs at join 15 19 24 28 AG 13 2kcalfmol A c Hairpin loop at 20 23 AG 4 5kcal mol A c B S Stem with hairpin at 29 45 AG 10 8kcal mol Ag P a 8 Stem base pairs at join 29 34 40 45 AG 15 6kcalfmol G cC lt a Ss Stacking of U 4 pair at 29 45 and C G pair at 30 44 AG 2 4kcal mol 1 6 c Ss Stacking of C G pair at 30 44 and C G pair at 31 43 AG 3 3kcalfmol u A 4 Stacking of C G pair at 31 43 and C G pair at 32 42 AG 3 3kcal mol Ss Stacking of C G pair at 32 42 and C G pair at 33 41 AG 3 3kcal mol Ss Stacking of C G pair at 33 41 and C G pair at 34 40 AG 3 3kcal mol C Hairpin loop at 35 39 AG 4 8kcal mol Dangling A at 62 dangling from position 61 AG 1 7kcal mol Figure 22 28 A number of suboptimal structures have been predicted using CLC Combined Workbench and are listed at the top left At the right hand side the structural components of the selected structure are listed in a hierarchical structure and on the left hand side the structure is displayed CHAPTER 22 RNA STRUCTURE 396 22 5 2 Structure elements and their energy contribution In this section
199. E g if you are studying a sequence you can click anywhere in the sequence and hold the mouse button By moving the mouse you move the sequence in the View CHAPTER 3 USER INTERFACE 88 3 3 6 Selection nw The Selection mode is used for selecting in a View selecting a part of a sequence selecting nodes in a tree etc It is also used for moving e g branches in a tree or sequences in an alignment When you make a selection on a sequence or in an alignment the location is shown in the bottom right corner of your workbench E g 23 24 means that the selection is between two residues 23 means that the residue at position 23 is selected and finally 23 25 means that 23 24 and 25 are selected By holding ctrl 3 you can make multiple selections 3 4 Toolbox and Status Bar The Toolbox is placed in the left side of the user interface of CLC Combined Workbench below the Navigation Area The Toolbox shows a Processes tab and a Toolbox tab 3 4 1 Processes By clicking the Processes tab the Toolbox displays previous and running processes e g an NCBI search or a calculation of an alignment The running processes can be stopped paused and resumed Active buttons are blue If a process is terminated the stop pause and play buttons of the process in question are made gray The terminated processes can be removed by View Remove Terminated Processes gt Running and paused processes are not
200. ES 157 10 7 Sequence Lists The Sequence List shows a number of sequences in a tabular format or it can show the sequences together in a normal sequence view Having sequences in a sequence list can help organizing sequence data The sequence list may originate from an NCBI search chapter 11 1 Moreover if a multiple sequence fasta file is imported it is possible to store the data in a sequences list A Sequence List can also be generated using a dialog which is described here select two or more sequences right click the elements New Sequence List This action opens a Sequence List dialog Create Sequence List 1 Select sequences of same Pey SelecE Sequences of same typ E Projects Selected Elements 8 CLC_Data Mu P68046 E3 Example data su 68053 a Extra A Pes063 a Nucleotide As P68225 23 Protein Su P68228 5 3D structures As P68231 More data Pu P68873 EE Sequences ee P68945 1829_HUMAN CAA24102 se CAA32220 Fs NP_058652 E Figure 10 22 A Sequence List dialog The dialog allows you to select more sequences to include in the list or to remove already chosen sequences from the list Clicking Finish opens the sequence list It can be saved by clicking Save E or by dragging the tab of the view into the Navigation Area Opening a Sequence list is done by right click the sequence list in the Navigation Area Show 42 Graphical Sequence List OR Ta
201. EVGG EALGRLLVVY PWTQRFFESF GDLSTPDAVM GNPK Sequence Logo MVHTT EEKe AvYzLWGKV AVsEvGG EALGRLLVVY PWTSRFFesF GbLS esAvM NPK II Ia Figure 20 5 The top figures shows the original alignment In the bottom panel a single sequence with four inserted X s are aligned to the original alignment This introduces gaps in all sequences of the original alignment All other positions in the original alignment are fixed This feature is useful if you wish to add extra sequences to an existing alignment in which case you just select the alignment and the extra sequences and choose not to redo the alignment It is also useful if you have created an alignment where the gaps are not placed correctly In this case you can realign the alignment with different gap cost parameters 20 1 4 Fixpoints With fixpoints you can get full control over the alignment algorithm The fixpoints are points on the sequences that are forced to align to each other Fixpoints are added to sequences or alignments before clicking Create alignment To add a fixpoint open the sequence or alignment and Select the region you want to use as a fixpoint right click the selection Set alignment fixpoint here This will add an annotation labeled Fixpoint to the sequence see figure 20 6 Use this procedure to add fixpoints to the other sequence s that should be forced to align to each other When you click Create alignment and go to Step 2 check Use fixpoints in orde
202. GATCCTGA a Figure 2 5 The resulting two views which are split horizontally click Selection in Toolbar select a part of the sequence right click the selected part of the sequence in the top view Open Selection in New View La This opens a third display of sequence NP_058652 However only the part which was selected In order to make room for displaying the selection of the sequence the most recent view drag the tab of the view down next to the tab of the bottom view 2 3 Tutorial Side Panel Settings This brief tutorial will show you how to use the Side Panel to change the way your sequences alignments and other data are shown You will also see how to save the changes that you made in the Side Panel Open the protein alignment located under Protein gt More data in the Example data The initial view of the alignment has colored the residues according to the Rasmol color scheme and the alignment is automatically wrapped to fit the width of the view shown in figure 2 6 Now we are going to modify how this alignment is displayed For this we use the settings in the Side Panel to the right All the settings are organized into groups which can be expanded collapsed by clicking the name of the group The first group is Sequence Layout which is expanded by default First select No wrap in the Sequence Layout This means that each sequence in the alignment CHAPTER 2 TUTORIALS 38 E3 P68046_alignment
203. GCACTTACTTCTAATGACCA Figure 22 24 This hypothesis has a probability of 0 338 as shown in the annotation 22 4 Structure Scanning Plot In CLC Combined Workbench it is possible to scan larger sequences for the existence of local conserved RNA structures The structure scanning approach is similar in spirit to the works of Workman and Krogh 1999 and Clote et al 2005 The idea is that if natural selection is operating to maintain a stable local structure in a given region then the minimum free energy of CHAPTER 22 RNA STRUCTURE 391 the region will be markedly lower than the minimum free energy found when the nucleotides of the subsequence are distributed in random order The algorithm works by sliding a window along the sequence Within the window the minimum free energy of the subsequence is calculated To evaluate the significance of the local structure signal its minimum free energy is compared to a background distribution of minimum free energies obtained from shuffled sequences using Z scores Rivas and Eddy 2000 The Z score statistics corresponds to the number of standard deviations by which the minimum free energy of the original sequence deviates from the average energy of the shuffled sequences For a given Z score the statistical significance is evaluated as the probability of observing a more extreme Z score under the assumption that Z scores are normally distributed Rivas and Eddy 2000 22 4 1 Selecting sequences for
204. GGAGAGCGAGGGA ha ANIIAKC KID red TAQDREOR PO ARCA RCO ROR CAR COR COO ABER ET ROC Cre nT ACO RAT TAA E Bago E NM_000044 BLAST Rows 570 Summary of hits From query NM_000044 Filter Hit Description E value Score Bit score Identity rs3032358 rs 3032358 pos 48 DEO 975 1 933 29 91A 153032358 ts 3032358 pos 48 3 021 22 44 105 100 rs4045402 rs 4045402 pos 50 DEO 911 1 506 42 98 rs6152 rs 6152l00s 401 lle DEO 799 1 584 39 991 Y Baa Figure 12 17 The graphical and tabular view of the SNP BLAST CHAPTER 12 BLAST SEARCH 188 If the option of annotating with variation annotation was chosen the query sequence in the BLAST search object is also annotated with the dbSNP hits which passed the set criteria In the graphical editor auxiliary information about the hits are shown in a tooltip when the mouse is hovered on a hit sequence In addition to the BLAST statistics this includes the length of the original doSNP sequence the variation position the database build and the type of the variation This information is also available from the tabular view of the BLAST search The variation annotations on the sequence When sequences are annotated with variation information as shown in figure 12 18 and 12 17 the type of variation is displayed in the name of the annotation Furthermore if genotype information is availab
205. GITGACTGTGAA ACCAAAACAAAGCAGAATGCAGTTCTCTTCACITGACTGTGAA mama lin Vial AA AAA SSPE medina a ribo ACCAAAACAAAGCAGAATGCAGTTCTCTTCACITGACTGTGAA AVY AVV VYY V NA ay Iw AO NI MA M Figure 12 20 Identification of single nucleotide polymorphisms In this illustration a C T SNP is seen in position 986 of the sequence contig SOME RIGHTS RESERVED See http creativecommons org licenses by nc nd 2 5 for more information on how to use the contents 12 6 Bioinformatics explained BLAST BLAST Basic Local Alignment Search Tool has become the defacto standard in search and alignment tools Altschul et al 1990 The BLAST algorithm is still actively being developed and is one of the most cited papers ever written in this field of biology Many researchers use BLAST as an initial screening of their sequence data from the laboratory and to get an idea of what they are working on BLAST is far from being basic as the name indicates it is a highly advanced algorithm which has become very popular due to availability speed and accuracy In short a BLAST search identifies homologous sequences by searching one or more databases usually hosted by NCBI http www ncbi nlm nih gov on the query sequence of interest McGinnis and Madden 2004 BLAST is an open source program and anyone can download and change the program code This has also given rise to a number of BLAST derivatives WU BLAST is probably the most commonly used Altschu
206. H 177 The advantage of conducting a local BLAST search is the speed and that it is possible to BLAST very long sequences To conduct a Local BLAST search right click the tab of an open sequence Toolbox BLAST Search 3 Local BLAST 2 or click an element in the Navigation Area Toolbox BLAST Search Local BLAST 2 This opens the dialog seen in figure 12 5 BLAST Against Local Database 1 Select sequences of the rect ser same type Projects Selected Elements 1 CLC_Data Xs CAA24102 3 Example data of Extra E Nucleotide Protein E5 Sequences 1429 HUMAN 4 CAA32220 NP_058652 a aa Hs P68046 ES Sw P68053 e P68063 Ss P68225 P68228 Sw P68231 P68873 Mu P68945 E 3D structures w More data Figure 12 5 Choose one or more sequences to conduct a Local BLAST search Click Next This opens the dialog seen in figure 12 6 Local BLAST 1 Select sequences of same BE gram parameters aa aa type 2 Set program parameters Choose program and database Program blastp Protein sequence and database Target 12 sequences selected Database genetic code Figure 12 6 Choose a BLAST program and a local database to conduct BLAST search Er ee In Step 2 you can choose between different BLAST methods See section 12 1 for information about these methods In this step you can also choose which of your local BLAST databases you want to conduct the search in
207. HAPTER 14 GENERAL SEQUENCE ANALYSES 220 thin medium wide e Line type none line long dash short dash e Line color Allows you to choose between many different colors 14 4 Sequence statistics CLC Combined Workbench can produce an output with many relevant statistics for protein sequences Some of the statistics are also relevant to produce for DNA sequences Therefore this section deals with both types of statistics The required steps for producing the statistics are the same To create a statistic for the sequence do the following select sequence s Toolbox in the Menu Bar General Sequence Analyses lt A Create Sequence Statistics This opens a dialog where you can alter your choice of sequences which you want to create statistics for You can also add sequence lists Note You cannot create statistics for DNA and protein sequences at the same time When the sequences are selected click Next This opens the dialog displayed in figure 14 15 Create Sequence Statistics 1 Select sequences of same BEMA ots type 2 Set parameters Layout CO Individual statistics layout Of ri Background distribution for proteins Include background distribution of amino acids L2 JS Previous Pnet Y Finish X Cancel Figure 14 15 Setting parameters for the sequence statistics The dialog offers to adjust the following parameters CHAPTER 14 GENERAL
208. I S methylc wee Kpnt S methylc Peer N6 methyl em 3 5 5 5 5 5 N methyl Haelll Blunt 5 5 3 3 3 mex Y Einish i X Cancel Figure 10 6 Choosing enzymes to be considered to use an existing enzyme list this panel shows all the enzymes available e To the right there is a list of the enzymes that will be used Select enzymes in the left side panel and add them to the right panel by double clicking or clicking the Add button gt If you e g wish to use EcoRV and BamHI select these two enzymes and add them to the right side panel If you wish to use all the enzymes in the list Click in the panel to the left press Ctrl A 38 A on Mac Add gt The enzymes can be sorted by clicking the column headings i e Name Overhang Methylation or Popularity This is particularly useful if you wish to use enzymes which produce e g a 3 overhang In this case you can sort the list by clicking the Overhang column heading and all the enzymes producing 3 overhangs will be listed together for easy selection When looking for a specific enzyme it is easier to use the Filter If you wish to find e g Hindlll sites simply type Hindlll into the filter and the list of enzymes will shrink automatically to only include the Hindlll enzyme This can also be used to only show enzymes producing e g a 3 overhang as shown in figure 19 33 If you need more detailed information and filtering of the
209. IOGRAPHY 416 Kyte and Doolittle 1982 Kyte J and Doolittle R F 1982 A simple method for displaying the hydropathic character of a protein J Mol Biol 157 1 105 132 Larget and Simon 1999 Larget B and Simon D 1999 Markov chain monte carlo algorithms for the bayesian analysis of phylogenetic trees Mol Biol Evol 16 750 759 Leitner and Albert 1999 Leitner T and Albert J 1999 The molecular clock of HIV 1 unveiled through analysis of a known transmission history Proc Nat Acad Sci U S A 96 19 10752 10757 Longfellow et al 1990 Longfellow C E Kierzek R and Turner D H 1990 Thermodynamic and spectroscopic study of bulge loops in oligoribonucleotides Biochemistry 29 1 278 285 Maizel and Lenk 1981 Maizel J V and Lenk R P 1981 Enhanced graphic matrix analysis of nucleic acid and protein sequences Proc Natl Acad Sci U S A 78 12 7665 7669 Mathews et al 2004 Mathews D H Disney M D Childs J L Schroeder S J Zuker M and Turner D H 2004 Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of rna secondary structure Proc Natl Acad Sci U S A 101 19 7287 7292 Mathews et al 1999 Mathews D H Sabina J Zuker M and Turner D H 1999 Expanded sequence dependence of thermodynamic parameters improves prediction of rna secondary structure J Mol Biol 288 5 911 940 Mathews and Turner 2002 Math
210. Installation on Linux with an installer Navigate to the directory containing the installer and execute it This can be done by running a command similar to sh CLCCombinedWorkbench_3_JRE sh If you are installing from a CD the installers are located in the linux directory Installing the program is done in the following steps e On the welcome screen click Next e Read and accept the License agreement and click Next e Choose where you would like to install the application and click Next For a system wide installation you can choose for example opt or usr local If you do not have root privileges you can choose to install in your home directory e Choose where you would like to create symbolic links to the program DO NOT create symbolic links in the same location as the application Symbolic links should be installed in a location which is included in your environment PATH For a system wide installation you can choose for example usr local bin If you do not have root privileges you can create a bin directory in your home directory and install symbolic links there You can also choose not to create symbolic links e Wait for the installation process to complete and click Finish CHAPTER 1 INTRODUCTION TO CLC COMBINED WORKBENCH 16 If you choose to create symbolic links in a location which is included in your PATH the program can be executed by running the command clccombinedwb3 Otherwise you start the appl
211. Menu bar 1 6 2 Import of example data It might be easier to understand the logic of the program by trying to do simple operations on existing data Therefore CLC Combined Workbench includes an example data set When downloading CLC Combined Workbench you are asked if you would like to import the example data set If you accept the data is downloaded automatically and saved in the program If you didn t download the data or for some other reason need to download the data again you have two options You can click Install Example Data lt in the Help menu of the program This installs the data automatically You can also go to http www clcbio com download and download the example data from there If you download the file from the website you need to import it into the program See chapter 7 1 for more about importing data 1 7 Extending the workbench with plug ins When you install CLC Combined Workbench it has a standard set of features However you can upgrade and customize the program using a variety of plug ins CHAPTER 1 INTRODUCTION TO CLC COMBINED WORKBENCH 27 As the range of plug ins is continuously updated and expanded they will not be listed here Instead we refer to http ww clcbio com plug ins for a full list of plug ins with descriptions of their functionalities 1 7 1 Installing plug ins Plug ins are installed using the plug in manager Help in the Menu Bar Install Plug ins H or Plug ins
212. NA sequence is found the sequence can be viewed by double clicking it in the list of hits from the search If the desired sequence is not shown you can click the More button below the list to see more hits 2 4 2 Saving the sequence The sequences which are found during the search can be displayed by double clicking in the list of hits However this does not save the sequence You can save one or more sequence by selecting them and click Download and Save or drag the sequences into the Navigation Area 2 5 Tutorial Align protein sequences It is possible to create multiple alignments of nucleotide and protein sequences CLC Combined Workbench offers several opportunities to view alignments The alignments can be used for building phylogenetic trees CHAPTER 2 TUTORIALS 42 The sequences must be saved in the Navigation Area in order to be included in an alignment To save a sequence which is displayed in the View Area click the tab of the sequence and press Ctrl S or 6 S on Mac In this tutorial eight protein sequences from the Example data will be aligned See figure 2 13 power usu As P4443 P67821 IN Q6H1U7 Figure 2 13 Eight protein sequences in Sequences from the Protein folder of the Example data To align the sequences select the sequences from the Protein folder under Sequences Toolbox Alignments and Trees Create Alignment EF 2 5 1 The alignment dialog Th
213. OOKE tr EEE ARER Y Y EC Jee Sele Y i CEEE AG 8 4kcalmol 10000000 A OOO 004 Ps DG BIkcallmol Alert UU d i oaa RESETS paaa AG 8 2kcallmol AAA EE DD eee DI ID ee Figure 2 56 The inital linear view of the secondary structure prediction For now we are not interested in the linear view Click the Show Secondary Structure 2D View Qe button at the bottom of the view to show the secondary structure It looks as shown in figure 2 57 40 50 60 7a a j litt Witt at CY 3 2 1 10 Figure 2 57 The inital 2D view of the secondary structure CHAPTER 2 TUTORIALS 69 This structure does not look like the one we expected shown in figure 2 54 We now take a look at some of the other structures we chose to compute 10 different structures to see if we can find the classic tRNA structure First open a split view of the Show Secondary Structure Table F Press and hold Ctrl 36 on Mac Show Secondary Structure Table fp You will now see a table displaying the ten structures Selecting a structure in the table will display this structure in the view above Select the second structure in the table The views should now look like figure 2 58 eee QP AB009835 with Secondary structure AG 9 7kcal mol Secondary Structure 20 Vew S TR T T A c A Sequence layout T Pe gt Residue coloring cr c TTTA gt SA A 50 Find y i JAT Text format c AGTAA A
214. OR OE Re Figure 22 13 A split view of the secondary structure view and a linear sequence view If you make a selection in another sequence view this will will also be reflected in the secondary structure view The CLC Combined Workbench seeks to produce a layout of the structure where none of the elements overlap However it may be desirable to manually edit the layout of a structure for ease of understanding or for the purpose of publication To edit a structure first select the Pan 8 mode in the Tool bar Now place the mouse cursor on the opening of a stem and a visual indication of the anchor point for turning the substructure will be shown see figure 22 14 gt 0 0 gt gt P gt P gt H 4 gt o o Figure 22 14 The blue circle represents the anchor point for rotating the substructure Click and drag to rotate the part of the structure represented by the line going from the anchor point In order to keep the bases in a relatively sequential arrangement there is a restriction CHAPTER 22 RNA STRUCTURE 384 on how much the substructure can be rotated The highlighted part of the circle represents the angle where rotating is allowed In figure 22 15 the structure shown in figure 22 14 has been modified by dragging with the mouse AAG A Cc G A T A 20 30 c G i ios Moak G T a TIA W c LEEI ACC AA cay TAATCT t 70 T T T c STAA A rT 50 r cT A 50 DE c Gal
215. Primer Parameters preference group the Calculate button will activate the primer design algorithm When a single primer region is defined If only a single region is defined only single primers will be suggested by the algorithm After pressing the Calculate button a dialog will appear see figure 17 7 Calculation parameters Chosen parameters Maximum primer length Minimum primer length Maximum G C content Minimum G C content Maximum melting temperature Minimum melting temperature Maximum self annealing Maximum self end annealing Maximum secondary structure 3 end must meet G C requirements 5 end must meet G C requirements Mispriming parameters Use mispriming as exclusion criteria Y Calculate 2 Help Figure 17 7 Calculation dialog for PCR primers when only a single primer region has been defined The top part of this dialog shows the parameter settings chosen in the Primer parameters preference group which will be used by the design algorithm The lower part contains a menu where the user can choose to include mispriming as a criteria in the design process If this option is selected the algorithm will search for competing binding sites of the primer within the sequence CHAPTER 17 PRIMERS 285 The adjustable parameters for the search are e Exact match Choose only to consider exact matches of the primer i e all positions must base pair with the template for mispriming to occur e Minimum numbe
216. SEQUENCE ANALYSES 221 e Individual statistics layout If more sequences were selected in Step 1 this function generates separate statistics for each sequence e Comparative statistics layout If more sequences were selected in Step 1 this function generates statistics with comparisons between the sequences You can also choose to include Background distribution of amino acids If this box is ticked an extra column with amino acid distribution of the chosen species is included in the table output The distributions are calculated from UniProt www uniprot org version 6 0 dated September 13 2005 Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish An example of protein sequence statistics is shown in figure 14 16 ba CAA32220 stat 1 Protein statistics 1 1 Sequence information Sequence type Length Organism Name Description Modification Date Weight 1 2 Half life N terminal aa Half life mammals Half life yeast Half ife E Coli lt gt ma Figure 14 16 Comparative sequence statistics Nucleotide sequence statistics are generated using the same dialog as used for protein sequence statistics However the output of Nucleotide sequence statistics is less extensive than that of the protein sequence statistics Note The headings of the tables change depending on whether you calculate individual or co
217. SEQUENCES 144 drag the edge of the selection you can see the mouse cursor change to a horizontal arrow or press and hold the Shift key while using the right and left arrow keys to adjust the right side of the selection If you wish to select the entire sequence double click the sequence name to the left Selecting several parts at the same time multiselect You can select several parts of sequence by holding down the Ctrl button while making selections Holding down the Shift button lets you extend or reduce an existing selection to the position you clicked To select a part of a Sequence covered by an annotation right click the annotation Select annotation or double click the annotation To select a fragment between two restriction sites that are shown on the sequence double click the sequence between the two restriction sites Read more about restriction sites in section 10 1 2 Open a selection in a new view A selection can be opened in a new view and saved as a new sequence right click the selection Open selection in New View L This opens the annotated part of the sequence in a new view The new sequence can be saved by dragging the tab of the sequence view into the Navigation Area The process described above is also the way to manually translate coding parts of sequences CDS into protein You simply translate the new sequence into protein This is done by right click the tab of the new sequence Toolbo
218. Since an alignment is a display of several sequences arranged in rows the basic options for viewing alignments are the same as for viewing sequences Therefore we refer to section 10 1 for an explanation of these basic options However there are a number of alignment specific view options in the Alignment info and the Nucleotide info in the Side Panel to the right of the view Below is more information on these view options Under Translation in the Nucleotide info there is an extra checkbox Relative to top sequence Checking this box will make the reading frames for the translation align with the top sequence so that you can compare the effect of nucleotide differences on the protein level The options in the Alignment info relate to each column in the alignment e Consensus Shows a consensus sequence at the bottom of the alignment The consensus sequence is based on every single position in the alignment and reflects an artificial sequence which resembles the sequence information of the alignment but only as one single sequence If all sequences of the alignment is 100 identical the consensus sequence will be identical to all sequences found in the alignment If the sequences of the alignment differ the consensus sequence will reflect the most common sequences in the alignment Parameters for adjusting the consensus sequences are described below Limit This option determines how conserved the sequences must be in order to agree on a c
219. TANAN NGANGGTNTN TGCTTNTTCC Mode sewercetoe TTGTAGCGAG ses TTIMads abAeC Tals TOCIT TICC ei 140 160 E O TaqMan I I PERH2BD O CCCATGGAAT GCGGA AGA GTTTGATTGT TTTACCCTCC 158 Primer solution PERH3BC O CCCATGGAGT GCTGACAAGA GTTTGGTTAT TTTACTCTCC 160 Perfect match Consensus CCCATGGANT GCNGACAAGA GTTTGHTTHT TTTACNCTCC seanco GUCATGGAAT GOsGAcaMGA CTTTGATTaT TTTAGSCTOG a max a sd Figure 17 12 The initial view of an alignment used for primer design TaqMan Used when the objective is to design a primer pair and a probe set for TaqMan quantitative PCR e The Primer solution submenu is used to specify requirements for the match of a PCR primer against the template sequences These options are described further below It contains the following options Perfect match Allow degeneracy Allow mismatches The work flow when designing alignment based primers and probes is as follows e Use selection boxes to specify groups of included and excluded sequences To select all the sequences in the alignment right click one of the selection boxes and choose Mark All e Mark either a single forward primer region a single reverse primer region or both on the sequence and perhaps also a TaqMan region Selections must cover all sequences in the included group You can also specify that there should be no primers in a region No Primers Here or that a whole region should be amplified Region to Amplify e Adjust parameters regarding
220. TCAATGGAATACA 60 EN Sbjct 993 ATTTGCACATGGGATTGCTAAAACAGCTTCCTGITACTGAGATGICITCAATGGAATACA 1052 Query 61 GTCATTCCAAGAACTATAAACTTAAAGCTACTGTAGAAACAAAGGGITITCITITITAAA 120 rrrrrrr ara rrr rra rrrr EEE Eee Sbjct 1053 GTCATTCCAAGAACTATAAACTTAAAGCTACTGTAGAAACAAAGGGTITITCITITITAAA 1112 Query 121 TGTTICTTGGTAGATTATTCATAATGIGAGATGGITCCCAATATCATGIGA 171 PEPE EE a rar EE EE EP EEE Sbjct 1113 TGITICITGGTAGATTATTCATAATGTGAGATGGTTCCCAATATCATGTGA 1163 Score 224 bits 113 Expect 6e 56 Identities 161 161 100 Gaps 0 161 0 Strand Plus Plus Query 213 GACTGTGCAATACTTAGAGAACCTATAGCATCTTCTCATTCCCATGIGGAACAGGATGCC 272 Herr area TEEPE EEE EEE eee Sbjct 1205 GACTGTGCAATACTTAGAGAACCTATAGCATCTTCTCATTCCCATGTGGAACAGGATGCC 1264 Query 273 CACATACTGTICTAATTAATAAATTTTCCAttttttrtCABAACAAGTATGAATCTAGITGG 332 HEHE nn anna r rar n EPP Sbjct 1265 CACATACTGTCTAATTAATAAATITICCATITITITICAMACAAGTATGAATCTAGTTGG 1324 Query 333 TIGATGCCttttttttCATGACATAATAAAGTATITICTIT 373 PEPED EET EET EET EEE EET EEE Sbjct 1325 TIGATGCCITITITITCATGACATAATAAAGTATTTTCITT 1365 Figure 12 27 Alignment view of BLAST results Individual alignments are represented together with BLAST scores and more Pre formatted databases are available from a dedicated BLAST ftp site ftp ftp ncbi nlm nih gov blast db Moreover it is possible to download programs scripts from the same site enabling automatic download of changed BLAST databases Thus it is possible to schedule a nightly update of
221. TCTGCCCATGGTTTCCTTCCTCTAGTTTCTGGGCTTA Neul Mboll Tth1 1111 a 100 l l PERH3BC CCTTCCTATCAGAAGGAAATGGGAAGAGATTCTAGGGAGCAGTTTAGATGG Tth11111 Tsol CjeP 120 Ly i mud PERH3BC AAGGTATCTGCTTGTTCCCCCATGGAGTGCTGACAAGAGTTTGGTTATTTT CjeP Ci PERH3BC ACTCTCCACTCACAATCATCATGTCCTCTCACTTCGTCCTTCT Figure 19 27 The result of the restriction analysis shown as annotations annotations Table of restriction sites The restriction map can be shown as a table of restriction sites see figure 19 28 ES Restriction m Rows 5 Restriction sites table Filter i Sequ Name Pattern Overhang Number Cut position s PERH3BC CjePI Iccannanointe 3 1 151 184 PERH3BC Mboll gaaga 3 1 86 PERH3BC Neul gaaga 3 pl 86 PERH3BC Tsol ltarcca 3 1 134 PERHSBC ehn caarca f 1 ion i ReBO Figure 19 28 The result of the restriction analysis shown as annotations Each row in the table represents a restriction enzyme The following information is available for each enzyme CHAPTER 19 CLONING AND CUTTING 340 If the enzyme cuts more than once the positions are separated by commas Number of cut sites Name The name of the enzyme Cut position s The position of each cut Pattern The recognition sequence of the enzyme Overhang The overhang produced by cutting with the enzyme 3 5 or Blunt Sequence The name of the sequence w
222. Trim annotations they are colored red as default If there are no trim annotations the sequence has not been trimmed CHAPTER 18 SEQUENCING DATA ANALYSES AND ASSEMBLY 306 18 3 Assemble sequences This section describes how to assemble a number of sequence reads into a contig without the use of a reference sequence a known sequence that can be used for comparison with the other sequences see section 18 4 To perform the assembly select sequences to assemble Toolbox in the Menu Bar Sequencing Data Analyses 41 Assemble Sequences gt This opens a dialog where you can alter your choice of sequences which you want to assemble You can also add sequence lists When the sequences are selected click Next This will show the dialog in figure 18 6 Assemble Sequences 1 Select at least two amater nucleotide sequences Trimming 2 Set parameters 2 Trim sequence ends before assembly Alignment options Minimum aligned read length 50 Alignment stringency Medium v Conflicts Vote A C G T Unknown nucleotide N Ambiguity nucleotides R Y etc Output options V Create full contigs including trace data Show tabular view of contigs Create only consensus sequences Previous gt Next Figure 18 6 Setting assembly parameters This dialog gives you the following options for assembling e Trim sequence ends before assembly If yo
223. UM62 is probably the best choice This matrix has become the de facto standard for scoring matrices and is also used as the default matrix in BLAST searches The selection of a wrong scoring matrix will most probable strongly influence on the outcome of the analysis In general a few rules apply to the selection of scoring matrices e For closely related sequences choose BLOSUM matrices created for highly similar align ments like BLOSUM8O You can also select low PAM matrices such as PAM1 e For distant related sequences select low BLOSUM matrices for example BLOSUM45 or high PAM matrices such as PAM250 The BLOSUM matrices with low numbers correspond to PAM matrices with high numbers See figure 14 13 for correlations between the PAM and BLOSUM matrices To summarize if you want to find distant related proteins to a sequence of interest using BLAST you could benefit of using BLOSUMA5 or similar matrices CHAPTER 14 GENERAL SEQUENCE ANALYSES 217 A R N D C Q E G H L K M F P S T W Y V A 4 1 2 2 0 1 1 Oo 2 1 1 1 1 2 1 1 0 3 2 0 R 1 5 o 2 3 1 Oo 2 0 3 2 2 1 3 2 1 1 3 2 3 N 2 0 6 1 3 O 0 0 1 3 3 O 2 3 2 1 O 4 2 3 D 2 2 il 6 3 O 2 1 4 3 4 1 3 3 1 Oo 1 4 3 3 C o 3 3 3 9 3 4 3 3 1 1 3 1 2 3 1 1 2 2 1 Q 1 1 0 o 3 5 2 2 Oo 3 2 1 o 3 1 Oo 1 2 L 2 E 1 0 0 2 4 2 5 2 o 3 3 1 2 3 Ll Oo 1 3 2 2 G Oo 2 Oo 4 3 2 2 6 2 4 4 2 3 3 2 Oo 2 2 3 3 H 2 0 t 41 3 O Oo 2 8 3 3 1 2 1 2 1 2 2 2 3 l 13
224. Vector graphic is a collection of shapes Thus what is stored is e g information about where a line starts and ends and the color of the line and its width This enables a given viewer to decide how to draw the line no matter what the zoom factor is thereby always giving a correct image This format is good for e g graphs and reports but less usable for e g dot plots If the image is to be resized or edited vector graphics are by far the best format to store graphics If you open a vector graphics file in an application like e g Adobe Illustrator you will be able to manipulate the image in great detail Graphics files can also be imported into the Navigation Area However no kinds of graphics files can be displayed in CLC Combined Workbench See section 7 2 for more about importing external files into CLC Combined Workbench 7 3 3 Graphics export parameters When you have specified the name and location to save the graphics file you can either click Next or Finish Clicking Next allows you to set further parameters for the graphics export whereas clicking Finish will export using the parameters that you have set last time you made a graphics export in that file format if it is the first time it will use default parameters Parameters for bitmap formats For bitmap files clicking Next will display the dialog shown in figure 7 8 Export Graphics 1 Output options Ml 2 Save in file 3 Export size Choose resolution Scree
225. Z Figure 18 3 A sequence with trace data The preferences for viewing the trace are shown in the Side Panel A A 18 2 Trim sequences CLC Combined Workbench offers a number of ways to trim your sequence reads prior to assembly Trimming can be done either as a separate task before assembling or it can be performed as an integrated part of the assembly process see section 18 3 Trimming as a separate task can be done either manually or automatically In both instances trimming of a sequence does not cause data to be deleted instead both the manual and automatic trimming will put a Trim annotation on the trimmed parts as an indication to the assembly algorithm that this part of the data is to be ignored see figure 18 4 This means that the effect of different trimming schemes can easily be explored without the loss of data To remove existing trimming from a sequence simply remove its trim annotation see section 10 3 2 Trim CAGCACAGAGGTCATACTGGCATTCTGAACG Figure 18 4 Trimming creates annotations on the regions that will be ignored in the assembly process CHAPTER 18 SEQUENCING DATA ANALYSES AND ASSEMBLY 304 18 2 1 Manual trimming Sequence reads can be trimmed manually while inspecting their trace and quality data Trimming sequences manually corresponds to adding annotation see also section 10 3 2 but is special in the sense that trimming can only be applied to the ends of a sequence double click the
226. _9606 Database human snps snp human_9606 human_9606 OO BR y Figure 12 14 Choosing species and database e Species The species from which the query database is constructed e Database Depending on the species specific databases are available for subsets of the genome Click Next to go to the next step where you can set BLAST and annotation parameters as shown in figure 12 15 SNP Annotation Using BLAST 1 Select nucleotide sequences 2 Set program parameters 3 Set input parameters Blast parameters Low complexity Human repeats Mask For lookup Choose filter Mask lower case Expect 0 0001 Word size 11 v Match Mismatch Match 1 Mismatch 3 a Gap cost Open Ps on 2 Em ES Previous gt Next f Finist Figure 12 15 Setting parameters for SNP BLAST The top part of the parameters shown in 12 15 pertains to the BLAST algorithm and is described in section 12 1 Click Next to go to the next step where you can choose how the output of the SNP Annotation Using BLAST should be displayed There are three options as shown in figure 12 16 e Create overview BLAST table This will create one table containing and summarizing all the BLAST results See section 12 3 1 e Create one BLAST result per query This will create a BLAST result for each query sequence which can be opened in a table see section 12 3 3 or in the graphical alignmen
227. a a a ee 10 6 Creating a new sequence 0 02 eee 2 10 7 Sequence Lists oo ea a a ce 10 7 1 Graphical view of sequence lists 10 2 Sequence listtabl 2 448 86 be ewe ee ee a ee eG 107 2 Ext actSeQuences os 4245 55 648 oe hao we Ree eee ee a CLC Combined Workbench offers five different ways of viewing and editing single sequences as described in the first five sections of this chapter Furthermore this chapter also explains how to create a new sequence and how to assemble several sequences in a sequence list 10 1 View sequence When you double click a sequence in the Navigation Area the sequence will open automatically and you will see the nucleotides or amino acids The zoom options described in section 3 3 allow 131 CHAPTER 10 VIEWING AND EDITING SEQUENCES 132 you to e g zoom out in order to see more of the sequence in one view There are a number of options for viewing and editing the sequence which are all described in this section All the options described in this section also apply to alignments further described in section 20 2 10 1 1 Sequence settings in Side Panel Each view of a sequence has a Side Panel located at the right side of the view When you make changes in the Side Panel the view of the sequence is instantly updated To show or hide the Side Panel select the View Ctrl U or Click the 3 at the top right corner of the Side Panel to
228. about zooming in section 3 3 e Rotate mode The structure is rotated when the Pan mode _ is selected in the toolbar If the pan mode is not enabled on the first view of a structure a warning is shown e Zoom mode Use the zoom buttons on the toolbar to enable zoom mode A single click with the mouse will zoom slightly on the structure Moreover it is possible to zoom in and out on the structure by keeping the left mouse button pressed while moving the mouse up and down e Move mode It is possible to move the structure from side to side if the Ctrl key on Windows and key on Mac is pressed while dragging with the mouse fp BICO e 25 v Atoms amp Bonds Non polymer atoms Polymer atoms Y 25 50 75 100 Size Transparency Y 0 25 50 75 v Non polymer bonds Polymer bonds Thickness Thin vj Backbone Coloring scheme Id Type Name Seque Open Highlight view Select A Protein P PROTEIN ss open O M N Selection scheme B Non Poly CALCIU S E m General settings C Non Poly FLAVIN g v D Water water do Y MIA v Performance settings 30 Figure 13 2 3D view Structure files can be opened viewed and edited in several ways 13 3 The structure table Below the structure view you will find a table presenting information on the protein or nucleic acid subunits along with an
229. ading frames If you wish to translate the whole sequence you must specify the reading frame for the translation If you select e g two reading frames two protein sequences are generated Translate coding regions You can choose to translate regions marked by and CDS or ORF annotation This will generate a protein sequence for each CDS or ORF annotation on the sequence Genetic code translation table Lets you specify the genetic code for the translation The translation tables are occasionally updated from NCBI The tables are not available in this printable version of the user manual Instead the tables are included in the Help menu in the Menu Bar in the appendix CHAPTER 15 NUCLEOTIDE ANALYSES 237 Translate to Protein 1 Select nucleotide MAESTE sequences eee ee eee Translation of whole sequence Y Reading frame 1 Reading frame 2 Reading frame 3 Reading frame 1 Reading frame 2 Reading frame 3 ranslation of coding regions Translate CDS Genetic code translation table a 1 Standard 2 sS _ Previous J Bnet Y Fmish MX Cancel Figure 15 5 Choosing 1 and 3 reading frames and the standard translation table Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish The newly created protein is shown but is not saved automatically To save a protein sequence drag it into the Navigation Area o
230. ain of the structure with the ID 1A00 will be named 1A00 A Brackets around the name indicate the child parent relationship Selection on the sequence If you select a part of the sequence in the sequence view it will be mirrored in the 3D structure using one of two selection schemes Most structure files allow a well defined mapping between sequence and structure but in some cases an unambiguous mapping is not possible In these cases a dialog is presented to the user when the structure is opened stating the problem Sequence selection may be disabled in some cases 13 3 3 Display and coloring options Individual subunits polymer as well as non polymer may be switched on and off in the 3D view using the View in 3D checkbox Also when using the Entity coloring mode see below the colors of individual subunits may be specified by the user using the Select Entity Color color choosers 13 4 Options through the preference panel The view of the structure can be changed in several ways All graphical changes are carried out through the Side Panel All options in the Side Panel are described below 13 4 1 Atoms amp Bonds e Non Polymer Atoms Show the individual atoms of non polymer molecules as ball shaped structures Atom size and transparency can be varied by using the sliders see figure 13 2 The size represents the of van der Waals radii e Polymer Atoms Show the individual atoms of the protein chain as ball shaped structures At
231. air Max hydrogen bonds between pairs Max hydrogen bonds between pair ends Minimum difference in melting temperature Inner Outer Maximum length of amplicon Mispriming parameters Use mispriming as exclusion criteria Y Calculate 2 Help Figure 17 10 Calculation dialog In this dialog the options to set a minimum and a desired melting temperature difference between outer and inner refers to primer pair and probe respectively Furthermore the central part of the dialog contains an additional parameter e Maximum length of amplicon determines the maximum length of the PCR fragment CHAPTER 17 PRIMERS 291 generated in the TaqMan analysis 17 7 1 TaqMan output table In TaqMan mode there are two primers and a probe in a given solution forward primer F reverse primer R and a TaqMan probe TP The output table can show primer probe pair combination parameters for all three combinations of primers and single primer parameters for both primers and the TaqMan probe see section on Standard PCR for an explanation of the available primer pair and single primer information The fragment length in this mode refers to the length of the PCR fragment generated by the primer pair and this is also the PCR fragment which can be exported 17 8 Sequencing primers This mode is used to design primers for DNA sequencing In this mode the user can define a number of Forward primer regions and Reverse primer regions where a se
232. airwise Comparison 1 Select alignments of MI same type 2 Select comparisons to Select comparisons v Gaps Identities Distance v V Differences v v Similarity _ Previous pre X cancel Figure 20 14 Adjusting parameters for pairwise comparison e Gaps Calculates the number of alignment positions where one sequence has a gap and the other does not e Identities Calculates the percentage of identical alignment positions to overlapping align ment positions between the two sequences e Differences Calculates the number of alignment positions where one sequence is different from the other This includes gap differences as in the Gaps comparison e Distance Calculates the Jukes Cantor distance between the two sequences This number is given as the Jukes Cantor correction of the proportion between identical and overlapping alignment positions between the two sequences e Similarity Calculates the percentage of similar residues in alignment positions to overlap ping alignment positions between the two sequences Click Next if you wish to adjust how to handle the results See section 9 1 If not click Finish 20 5 3 The pairwise comparison table The table shows the results of selected comparisons see an example in figure 20 15 Since comparisons are often symmetric the table can show the results of two comparisons at the same time one in the upper right and on
233. al information by studying its annotation or by aligning it to the query sequence 2 9 Tutorial Tips for specialized BLAST searches BLAST is a great and invaluable tool in bioinformatics BLAST has become a very central tool when it comes to identification of homologues and similar sequences and at the same time BLAST has evolved to become a highly complex tool which can be used for many different purposes In this tutorial you will learn how to e Use BLAST to find the gene of a protein on a genomic sequence e Find primer binding sites on genomic sequences e Identify remote protein homologues This tutorial requires some experience using the Workbench so if you get stuck at some point we recommend going through the more basic tutorials first 2 9 1 Locate a protein sequence on the chromosome If you have a protein sequence but want to see the actual location on the chromosome this is easy to do using BLAST In this example we wish to map the protein sequence of the Human beta globin protein to a chromosome We know in advance that the beta globin is located somewhere on chromosome 11 Data used in this example can be downloaded from GenBank Search Search for Sequences at NCBI E Human chromosome 11 NC_000011 consists of 134452384 nucleotides and the beta globin AAA16334 protein has 147 amino acids BLAST configuration Next conduct a local BLAST search Toolbox BLAST Search Local BLAST amp Select th
234. al part of the dialog contains parameters pertaining to primer pairs and the comparison between the outer and the inner pair Here five options can be set CHAPTER 17 PRIMERS 288 Calculation parameters Chosen parameters Maximum primer length Minimum primer length Maximum G C content Minimum G C content Maximum melting temperature Minimum melting temperature Maximum self annealing Maximum self end annealing Maximum secondary structure 3 end must meet G C requirements 5 end must meet G C requirements Primer combination parameters Max percentage point difference in G C content an w a 4 gt 14 gt Max difference in melting temperatures within a primer pair Max hydrogen bonds between pairs e a a 1 Max hydrogen bonds between pair ends a lt gt Minimum difference in melting temperature Inner Outer Fast 2 Accurate Mispriming parameters Use mispriming as exclusion criteria EE Es Y Calculate 2 Help Figure 17 9 Calculation dialog Maximum percentage point difference in G C content described above under Standard PCR this criteria is applied to both primer pairs independently Maximal difference in melting temperature of primers in a pair the number of degrees Celsius that primers in a pair are all allowed to differ This criteria is applied to both primer pairs independently Maximum pair annealing score the maximum number of hydrogen bonds allowed b
235. ally show up as lines parallel to the diagonal line Direct repeats gt gt La ACDEFGHIACDEFGHIACDEFGHIACDEFGHI Inverted repeats gt gt 4 ACDEFGHIIHGFEDCAACDEFGHIIHGFEDCA Figure 14 8 Direct and inverted repeats shown on an amino acid sequence generated for demonstration purposes If the dot plot shows more than one diagonal in the same region of a sequence the regions depending to the other sequence are repeated In figure 14 9 you can see a Sequence with repeats Frame shifts CHAPTER 14 GENERAL SEQUENCE ANALYSES 213 v b HLY lo Figure 14 9 The dot plot of a sequence showing repeated elements See also figure 14 8 Frame shifts in a nucleotide sequence can occur due to insertions deletions or mutations Such frame shifts can be visualized in a dot plot as seen in figure 14 10 In this figure three frame shifts for the sequence on the y axis are found 1 Deletion of nucleotides 2 Insertion of nucleotides 3 Mutation out of frame Sequence inversions In dot plots you can see an inversion of sequence as contrary diagonal to the diagonal showing similarity In figure 14 11 you can see a dot plot window length is 3 with an inversion Low complexity regions Low complexity regions in sequences can be found as regions around the diagonal all obtaining a high score Low complexity regions are calculated from the redundancy of amino acids within a limited region Wootton and Federhe
236. alue or a numerical range between the curly brackets For example ACG 2 matches the string ACGACG X n m will match a certain number of repetitions of an element indicated by following that element with two numerical values between the curly brackets The first number is a lower limit on the number of repetitions and the second number is an upper limit on the number of repetitions For example ACT 1 3 matches ACT ACT ACT and ACT ACT ACT X n represents a repetition of an element at least n times For example AC 2 matches all strings ACAC ACAC AC ACACACAC The symbol restricts the search to the beginning of your sequence For example if you search through a sequence with the regular expression AC the algorithm will find a match if AC occurs in the beginning of the sequence The symbol restricts the search to the end of your sequence For example if you search through a sequence with the regular expression GT the algorithm will find a match if GT occurs in the end of the sequence Examples The expression ACG AC G 2 matches all strings of length 4 where the first character is A C or G and the second is any character except A C and the third and fourth character is G The expression G A matches all strings of length 3 in the end of your sequence where the first character is C the second any character and the third any character except A CHAPTER 14 GENERAL SEQUENCE ANALYSES 229 For proteins
237. ample of how parameters can be set CHAPTER 16 PROTEIN ANALYSES 271 Proteolytic Cleavage 1 Select protein sequences DEHE 2 Select enzymes Include Name Cyanogen bromide CNBr Asp N endopeptidase larg C Ilys Trypsin Chymatrypsin high spec Chymotrypsin low spec o Iodosobenzoate Thermolysin Post Pro Glu C lAsp N Proteinase K Thrombin Factor Xa Granzyme B C Select all De select all 2 G _ Previous next Figure 16 27 Setting parameters for proteolytic cleavage detection e Min and max number of cleavage sites Certain proteolytic enzymes cleave at many positions in the amino acid sequence For instance proteinase K cleaves at nine different amino acids regardless of the surrounding residues Thus it can be very useful to limit the number of actual cleavage sites before running the analysis e Min and max fragment length Likewise it is possible to limit the output to only display sequence fragments between a chosen length Both a lower and upper limit can be chosen e Min and max fragment mass The molecular weight is not necessarily directly correlated to the fragment length as amino acids have different molecular masses For that reason it is also possible to limit the search for proteolytic cleavage sites to mass range Example If you have one
238. an be seen as spikes in the graphs see figure 22 27 22 5 Bioinformatics explained RNA structure prediction by minimum free energy minimization RNA molecules are hugely important in the biology of the cell Besides their rather simple role as an intermediate messenger between DNA and protein RNA molecules can have a plethora of biologic functions Well known examples of this are the infrastructural RNAs such as tRNAs rRNAs and snRNAs but the existence and functionality of several other groups of non coding RNAs CHAPTER 22 RNA STRUCTURE 393 Plot of Z scores for AB030907 3 5 3 0 2 5 4 2 0 1 5 1 0 0 5 0 5 Z score 1 0 1 5 2 0 2 5 3 0 3 5 40 4 5 5 0 Dinucleotide Shuffling 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000 8500 9000 Sequence position Figure 22 27 A plot of the Z scores produced by sliding a window along a sequence are currently being discovered These include micro miRNA small interfering SIRNA Piwi interacting piRNA and small modulatory RNAs smRNA Costa 2007 A common feature of many of these non coding RNAs is that the molecular structure is important for the biological function of the molecule Ideally biological function is best interpreted against a 3D structure of an RNA molecule However 3D structure determination of RNA molecules is time consuming expensive and difficult Shapiro et al 2007 and there is t
239. an save this arrangement using Workspaces A Workspace remembers the way you have arranged the views and you can switch between different workspaces The Navigation Area always contains the same data across Workspaces It is however possible to open different folders in the different Workspaces Consequently the program allows you to display different clusters of the data in separate Workspaces All Workspaces are automatically saved when closing down CLC Combined Workbench The next time you run the program the Workspaces are reopened exactly as you left them Note It is not possible to run more than one version of CLC Combined Workbench at a time Use two or more Workspaces instead 3 5 1 Create Workspace When working with large amounts of data it might be a good idea to split the work into two or more Workspaces As default the CLC Combined Workbench opens one Workspace the largest window in the right side of the workbench see figure 3 1 Additional Workspaces are created in the following way Workspace in the Menu Bar Create Workspace enter name of Workspace OK When the new Workspace is created the heading of the program frame displays the name of the new Workspace Initially the selected elements in the Navigation Area is collapsed and the View Area is empty and ready to work with See figure 3 17 CHAPTER 3 USER INTERFACE 90 CLC Combined Workbench 3 0 Current workspace Default DEK File Edit Search
240. and Stephens 1990 Nevertheless the conservation of every position is defined as Rse which is the difference between the maximal entropy Smar and the observed entropy for the residue CHAPTER 20 SEQUENCE ALIGNMENT 356 talA evgA ypdl nirB hmpA narQ gltF intS yfdF dsdX suhB Consensus Sequence Logo Conservation distribution Sobs Rseg Smar Sobs log N 20 l CTTTTCAAGG CATTGCAAAG CATTTTCAGG GAAAAGAAAT TGCAAAAAAA TTTTTGTGGA GTTATTAAGG TACCCACCGG AATCAAAATG ATCACAGGGG ACATCCAGTG AATTTAAAGG AGTATTTCCT GGAATAATCT ATAACTTTCT CGAGGCAAAA GGAAGACCAT GAAGACGCGT ATATGTTCAT ATTTTTACCC GAATAAAATC AAGGTGAGAT AGAGAGACCG AGAATTACCT 1 ATGAACGAGT ATGAACGCAA ATGAAAGTAA ATGAGCAAAG ATGCTTGACG GTGATTGTTA ATGTTTTTCA ATGCTCACCG ATGCTACCAT ATGCACTCTC ATGCATCCGA ATGAACGCAA 20 l TAGACGGCAT TAATTATTGA ACTTAATACT TCAGACTCGC CTCAAACCAT AACGACCCGT AAAAGAACCT TTAAGCAGAT CTATTTCAAT AAATCTGGGT TGCTGAACAT TAATAAACAT E a E x98__2086 Reha xe Afla a xia al steel bent tero mitre Figure 20 8 Ungapped sequence alignment of eleven E coli sequences defining a start codon The start codons start at position 1 Below the alignment is shown the corresponding sequence logo As seen a GTG start codon and the usual ATG start codons are present in the alignment This can also be visualized in the logo at position 1 N Y pn logy pn n 1 Pn is the observ
241. and ending residues lt 345 500 Indicates that the exact lower boundary point of a region is unknown The location begins at some residue previous to the first residue specified which is not necessarily contained in the presented sequence and continues up to and including the ending residue lt 1 888 The region starts before the first sequenced residue and continues up to and including residue 888 1 gt 888 The region starts at the first sequenced residue and continues beyond residue 888 102 110 Indicates that the exact location is unknown but that it is one of the residues between residues 102 and 110 inclusive 123 124 Points to a site between residues 123 and 124 join 12 78 134 202 Regions 12 to 78 and 134 to 202 should be joined to form one contiguous sequence complement 34 126 Start at the residue complementary to 126 and finish at the residue complementary to residue 34 the region is on the strand complementary to the presented strand complement join 2691 4571 4918 5163 Joins regions 2691 to 4571 and 4918 to 5163 then complements the joined segments the region is on the strand complementary to the presented strand join complement 4918 5163 complement 2691 4571 Complements regions 4918 to 5163 and 2691 to 4571 then joins the complemented segments the region is on the strand complementary to the presented strand e Annotations In this field you can add more information abo
242. and structure of the N end rule J Biol Chem 264 28 16700 16712 BIBLIOGRAPHY 415 Han et al 1999 Han K Kim D and Kim H 1999 A vector based method for drawing RNA secondary structure Bioinformatics 15 4 286 297 Hein 2001 Hein J 2001 An algorithm for statistical alignment of sequences related by a binary tree In Pacific Symposium on Biocomputing page 179 Hein et al 2000 Hein J Wiuf C Knudsen B Mgller M B and Wibling G 2000 Statistical alignment computational properties homology testing and goodness of fit J Mol Biol 302 1 265 279 Henikoff and Henikoff 1992 Henikoff S and Henikoff J G 1992 Amino acid substitution matrices from protein blocks Proc Natl Acad Sci U S A 89 22 10915 10919 Hopp and Woods 1983 Hopp T P and Woods K R 1983 A computer program for predicting protein antigenic determinants Mol Immunol 20 4 483 489 Horikawa et al 2000 Horikawa Y Oda N Cox N J Li X Orho Melander M Hara M Hinokio Y Lindner T H Mashima H Schwarz P E del Bosque Plata L Horikawa Y Oda Y Yoshiuchi l Colilla S Polonsky K S Wei S Concannon P Iwasaki N Schulze J Baier L J Bogardus C Groop L Boerwinkle E Hanis C L and Bell G I 2000 Genetic variation in the gene encoding calpain 10 is associated with type 2 diabetes mellitus Nat Genet 26 2 163 175 Ikai 1980 Ikai A 1980 Ther
243. and you will see the dialog shown in figure 18 7 Assemble Sequences to Reference 1 Select some nucleotide MIES sequences 2 Set reference parameters Reference sequence Choose reference sequence 206 reference Reference sequence Include reference sequence in contig s Only include part of the reference sequence in the contig Do not include reference sequence in contig s EH ECY _ Previous pree Figure 18 7 Setting assembly parameters when assembling to a reference sequence This dialog gives you the following options for assembling e Reference sequence Click the Browse and select element icon 15 in order to select a sequence to use as reference e Include reference sequence in contig s This will display a contig data object with the reference sequence at the top and the reads aligned below This option is useful when comparing sequence reads to a closely related reference sequence e g when sequencing for SNP characterization Only include part of the reference sequence in the contig If the aligned sequence reads only cover a small part of the reference sequence it may not be desirable to include the whole reference sequence in the contig data object When selected this option lets you specify how many residues from the reference sequence that should be kept on each side of the region spanned by sequencing reads by entering the number in the Extra residues field e Do not i
244. ane helix prediction Many proteins are integral membrane proteins Most membrane proteins have hydrophobic regions which span the hydrophobic core of the membrane bi layer and hydrophilic regions located on the outside or the inside of the membrane Many receptor proteins have several transmembrane helices spanning the cellular membrane For prediction of transmembrane helices CLC Combined Workbench uses TMHMM version 2 0 Krogh et al 2001 located at http www cbs dtu dk services TMHMM thus an active internet connection is required to run the transmembrane helix prediction Additional information on THMHH and Center for Biological Sequence analysis CBS can be found at http www cbs dtu dk and in the original research paper Krogh et al 2001 In order to use the transmembrane helix prediction you need to download the plug in using the plug in manager see section 1 7 1 When the plug in is downloaded and installed you can use it to predict transmembrane helices Select a protein sequence Toolbox in the Menu Bar Protein Analyses egy Transmembrane Helix Prediction df or right click a protein sequence Toolbox Protein Analyses egy Transmembrane Helix Prediction 8 If a sequence was selected before choosing the Toolbox action this sequence is now listed in the Selected Elements window of the dialog Use the arrows to add or remove sequences or sequence lists from the selected elements The predictions obtain
245. arameters see figure 14 20 14 6 1 Motif search parameter settings Various parameters can be set prior to the motif search The parameters are listed below and a screen shot of the parameter settings can be seen in figure 14 20 e Motif types You can choose literal string simple motif or a Java regular expression as your motif type For proteins you can choose to search with a Prosite regular expression e Motif If you choose to search with a simple motif you should enter a literal string as your motif Ambiguous amino acids and nucleotides are allowed Example ATGATGNNATG If your motif type is Java regular expression you should enter a regular expression according to the syntax rules described in section 14 6 Press Shift F1 key for options For proteins you can search with a Prosite regular expression and you should enter a protein pattern from the PROSITE database CHAPTER 14 GENERAL SEQUENCE ANALYSES 230 e Accuracy If you search with a simple motif you can adjust the accuracy of the search string to the match on the sequence e Search for reverse motif This enables searching on the negative strand on nucleotide sequences e Exclude unknown regions Genome sequence often have large regions with unknown sequence These regions are very often padded with N s Ticking this checkbox will not display hits found in N regions Click Next if you wish to adjust how to handle the results See section 9 1 If not click F
246. ard PCR Used when the objective is to design primers or primer pairs for PCR amplification of a single DNA fragment Nested PCR Used when the objective is to design two primer pairs for nested PCR amplification of a single DNA fragment Sequencing Used when the objective is to design primers for DNA sequencing TaqMan Used when the objective is to design a primer pair and a probe for TaqMan quantitative PCR Each mode is described further below e Calculate Pushing this button will activate the algorithm for designing primers 17 3 Graphical display of primer information The primer information settings are found in the Primer information preference group in the Side Panel to the right of the view see figure 17 3 CHAPTER 17 PRIMERS 281 There are two different ways to display the information relating to a single primer the detailed and the compact view Both are shown below the primer regions selected on the sequence 17 3 1 Compact information mode This mode offers a condensed overview of all the primers that are available in the selected region When a region is chosen primer information will appear in lines beneath it see figure 17 4 Tr PERH3BC A FAME r Designer settings gt a 2 v Primer information A PERH3BC GTGAGTCTGATGGGTCTGC Lgt 18 cocoooonncr e o Y Show Compact A oiT covering positions 20 to 37 Lgt 20 ecoor e crs o A O Detailed Fracti
247. ardt A and Hubbard T 1998 Using neural networks for prediction of the subcellular location of proteins Nucleic Acids Res 26 9 2230 2236 Rivas and Eddy 2000 Rivas E and Eddy S R 2000 Secondary structure alone is generally not statistically significant for the detection of noncoding RNAs Bioinformatics 16 7 583 605 Rose et al 1985 Rose G D Geselowitz A R Lesser G J Lee R H and Zehfus M H 1985 Hydrophobicity of amino acid residues in globular proteins Science 229 4716 834 838 Rost 2001 Rost B 2001 Review protein secondary structure prediction continues to rise J Struct Biol 134 2 3 204 218 Saitou and Nei 1987 Saitou N and Nei M 1987 The neighbor joining method a new method for reconstructing phylogenetic trees Mol Biol Evol 4 4 406 425 Sankoff et al 1983 Sankoff D Kruskal J Mainville S and Cedergren R 1983 Time Warps String Edits and Macromolecules the Theory and Practice of Sequence Comparison chapter Fast algorithms to determine RNA secondary structures containing multiple loops pages 93 120 Addison Wesley Reading Ma SantaLucia 1998 SantaLucia J 1998 A unified view of polymer dumbbell and oligonu cleotide DNA nearest neighbor thermodynamics Proc Natl Acad Sci U S A 95 4 1460 1465 Schechter and Berger 1967 Schechter and Berger A 1967 On the size of the active site in proteases Papain Biochem Biophys Res C
248. are included if you export the sequence in GenBank Swiss Prot EMBL or CLC format When exporting in other formats annotations are not preserved in the exported file 10 3 1 Viewing annotations Annotations can be viewed in a number of different ways e As arrows or boxes in the sequence views Linear and circular view of sequences set O Alignments E3 Graphical view of sequence lists BLAST views only the query sequence at the top can have annotations 2 Cloning editor 5 Primer designer both for single sequences and alignments Tr Contig view e In the table of annotations 33 e In the text view of sequences In the following sections these view options will be described in more detail In all the views except the text view annotations can be added modified and deleted This is described in the following sections View Annotations in sequence views Figure 10 15 shows an annotation displayed on a sequence cos 20 HUMHBB GGCCCTGTTCTGATCATGGGCCCTTCCTAACACTGCATGACTACCTTA CDS HUMHBB TTCTTGTTAGGATCCAAGCAACGGATTCTGCTGGAGCTGTCGTTTTTT CDS 140 I HUMHBB CTGGGTGTGTCTCCAACAAGTCCTGAGCACACATAACTGGAAACAATG Figure 10 15 An annotation showing a coding region on a genomic dna sequence The various sequence views listed in section 10 3 1 have different default settings for showing annotations However they all have two groups in the Sid
249. artition function calculation one with structural constraints and one without 22 21 No base pairs Forced stem CATTTAATAGTAAATTAGCACTTACTTCTAATGACCA Figure 22 21 Two constraints defining a structural hypothesis 22 3 1 Selecting sequences for evaluation The evaluation is started from the Toolbox Toolbox RNA Structure 36 Evaluate Structure Hypothesis X This opens the dialog shown in figure 22 22 Evaluate Structure Hypothesis 1 Select nucleotide elect nucleotide seque ith structure prediction constraints sequences with structure Projects Selected Elements 1 prediction constraints i e S 5 CLC_Data oc AB009835 5 5 Example Data ES Cloning vectors EJ Extra ES Nucleotide 1 ES Protein 5 RNA w Base pairing plots Sequences ys 6009835 206 AB030907 206 AB089957 20 Coronavirus p Structure predictio sH Structure scanning Figure 22 22 Selecting RNA or DNA sequences for evaluating structure hypothesis If you have selected sequences before choosing the Toolbox action they are now listed in the Selected Elements window of the dialog Use the arrows to add or remove sequences or sequence lists from the selected elements Note that the selected sequences must contain a structure hypothesis in the form of manually added constraint annotations Click Next to adjust evaluation parameters see figure 22 23 The partition function algorithm includes a number
250. ata E Sequences 1429 HUMAN CAA24102 Hs CAA32220 e BA Mu P68046 Sw P68053 P68063 z di Selected Elements 1 As NP_058652 Ye Figure 16 25 Choosing sequence CAA32220 for proteolytic cleavage CLC Combined Workbench allows you to detect proteolytic cleavages for several sequences at a time Correct the list of sequences by selecting a sequence and clicking the arrows pointing left and right Then click Next to go to Step 2 In Step 2 you can select proteolytic cleavage enzymes The list of available enzymes will be expanded continuously Presently the list contains the enzymes shown in figure 16 26 The full list of enzymes and their cleavage patterns can be seen in Appendix section C Proteolytic Cleavage 1 Select protein sequences NR 2 Select enzymes 3 Set parameters Enzyme criteria Min number of cleavage sites Max number of cleavage sites Criteria for the list of Fragments Min Fragment length Max Fragment length Min Fragment mass Max Fragment mass 0 gt Next Sul Previous f 1 X cancel Figure 16 26 Setting parameters for proteolytic cleavage detection Select the enzymes you want to use for detection When the relevant enzymes are chosen click Next In Step 3 you can set parameters for the detection This limits the number of detected cleavages Figure 16 27 shows an ex
251. ata on network drive C ul E C Search all your data a e E a Database searches Free Protein DNA RNA Combined GenBank Entrez searches a a E E E UniProt searches Swiss Prot TrEMBL Web based sequence search using BLAST BLAST on local database Creation of local BLAST database PubMed lookup Web based lookup of sequence data Search for structures at NCBI 400 APPENDIX A COMPARISON OF WORKBENCHES 401 General sequence analyses Free Protein DNA RNA Combined Linear Sequence view a Circular sequence view a Text based sequence view Editing sequences y Adding and editing sequence annotations Advanced annotation table Join multiple sequences into one a Sequence statistics Shuffle sequence E Local complexity region analyses Advanced protein statistics Comprehensive protein characteristics report Nucleotide analyses Free Protein DNA RNA Combined Basic gene finding n Reverse complement without loss of annota E tion Restriction site analysis u Advanced interactive restriction site analysis Translation of sequences from DNA to pro u teins Interactive translations of sequences and alignments G C content analyses and graphs Protein analyses Free Protein DNA RNA Combined 3D molecule view Hydrophobicity analyses Antigenicity analysis Protein charge analysis Reverse translation from protein to DNA Proteolytic cleavage detection Prediction of signal
252. ation of how to export graphics 7 1 Bioinformatic data formats The different bioinformatic data formats are imported in the same way therefore the following description of data import is an example which illustrates the general steps to be followed regardless of which format you are handling 7 1 1 Import of bioinformatic data Here follows a list of the formats which CLC Combined Workbench handles and a description of which type of data the different formats support 112 CHAPTER 7 IMPORT EXPORT OF DATA AND GRAPHICS 113 File type Suffix File format used for ACE files ace contigs Phylip Alignment phy alignments GCG Alignment msf alignments Clustal Alignment aln alignments Newick nwk trees FASTA fsa fasta sequences GenBank bk gb gp sequences GCG sequence gcg sequences only import PIR NBRF pir sequences only import Staden sdn sequences only import DNAstrider Str strider sequences Swiss Prot SWp protein sequences Lasergene sequence pro protein sequence only import Lasergene sequence Seq nucleotide sequence only import Embl embl nucleotide sequences Nexus nxs nexus sequences trees alignments and sequence lists CLC cic sequences trees alignments reports etc Text txt all data in a textual format CSV CSV tables each cell separated with semicolons only export ABI abi trace files only import AB1 ab1 trace files only import SCF2 Scf trace files only
253. ault CLC Combined Workbench displays a sequence with annotations colored arrows on the sequence and zoomed to see the residues In this tutorial we want to have an overview of the whole sequence Hence click Zoom Out in the Toolbar click the sequence until you can see the whole sequence In the following we will show how the same sequence can be displayed in two different views double click sequence NP_058652 in the Navigation Area This opens an additional tab Drag this tab to the bottom of the view See figure 2 4 The result is two views of the same sequence in the View Area as can be seen in figure 2 5 If you want to display a part of the sequence it is possible to select it and open it in another view CHAPTER 2 TUTORIALS 37 Hbb b2 NP_058652 MVHLTDAEK SAVSCLWAKVNPDEVGGEALGRLLVVYPWTQRYFDSFGDLSS Hbb b2 NP_058652 ASAIMGNPKVKAHGKKV I TAFNEGLKNLDNLKGTFASLSELHCDKLHVDPE Hbb b2 NP_058652 NFRLLGNAIVIVLGHHLGKDFTPAAQAAFQKVVAGVATALAHK YH Figure 2 4 Dragging the tab down to the bottom of the view will display a gray area indicating that the tab can be dropped here and split the view ae AY738615 HBD HBB AY738615 CCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACTTTT AY738615 TCTCAGCTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGA Ho AY738615 E D BD HBI AY738615 CCTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACTTTT AY738615 TCTCAGCTGAGTGAGCTGCACTGTGACAAGCTGCACGTG
254. avage enzymes 406 D Formats for import and export 408 D 1 List of bioinformatic data formats a iaoe s a e ioca roa koa es 408 D2 Listor graphics data TOnmmatS ses ee eke to priret SR akc bee ere es 409 E IUPAC codes for amino acids 411 F IUPAC codes for nucleotides 412 CONTENTS 9 Bibliography 413 V Index 419 Part Introduction 10 Chapter 1 Introduction to CLC Combined Workbench Contents 1 1 Contact information s se saa ee ewe ae A ee ee ee 13 1 2 Download and installation 0 0 2 eee ee 13 LLE Program download 402 a a e La 4 Y 13 1 2 2 Installation on Microsoft Windows ee eee eee 14 1 23 instalation on Mac OSX es ee a eR ee ar e Ah aen oh 14 1 2 4 Installation on Linux with an installer 15 1 25 Installation on Linux with an RPM package 16 1 3 System requirements 2 16 DA LIGCNSeS oc ic a ee Be ee a a 16 144 Demolicense concept 2 6 6 645 86s eo a ee ee ee 17 1 4 2 Getting and activating the demo license 17 14 3 Fixed licensa scr 4446 0066 ew dd So we dae oe eS a 19 1 4 4 Floatinglicense o acr as aca sla aa a a EUR droo k 4 21 1 4 5 Upgrading or changing licenses o 23 1 5 About CLC Workbenches lt lt lt eee nn 24 1 5 1 New program feature request 24
255. benches and provide an easy way to customize and extend their functionalities All workbenches will be improved continuously If you have a CLC Free Workbench or a commercial workbench and you are interested in receiving news about updates you should register your e mail and contact data on http www clcbio com if you haven t already registered when you downloaded the program 1 5 1 New program feature request The CLC team is continuously improving the workbench with our users interests in mind Therefore we welcome all requests and feedback from users and hope suggest new features or more general improvements to the program on support clcbio com CHAPTER 1 INTRODUCTION TO CLC COMBINED WORKBENCH 25 1 5 2 Report program errors CLC bio is doing everything possible to eliminate program errors Nevertheless some errors might have escaped our attention If you discover an error in the program you can use the Report a Program Error function in the Help menu of the program to report it In the Report a Program Error dialog you are asked to write your e mail address optional This is because we would like to be able to contact you for further information about the error or for helping you with the problem Note No personal information is sent via the error report Only the information which can be seen in the Program Error Submission Dialog is submitted You can also write an e mail to support clcbio com Remember to specify how
256. ble 5 The two different views of the same sequence list are shown in split screen in figure 10 23 10 7 1 Graphical view of sequence lists The graphical view of sequence lists is almost identical to the view of single sequences See section 10 1 The main difference is that you now can see more than one sequence in the same view However you also have a few extra options for sorting deleting and adding sequences e To add extra sequences to the list right click an empty white space in the view and select Add Sequences CHAPTER 10 VIEWING AND EDITING SEQUENCES 158 iE sequence list 50 100 a l l PERH1BA _ _ _ 50 100 l PERH1BB _ _ tt oooooooooQ Q Q Q 50 100 A I PERH2BA 50 100 v GRE PQ ES sequence list Rows 5 Sequence list sequence list Filter Name Accession Definition Modification Date Length PERHIBA M15292 P maniculatus dee 27 APR 1993 110 PERH1BB M15289 P maniculatus dee 27 APR 1993 110 E PERH2BA M15293 Pan maniculatus dee 27 APR 1993 110 E PERH2BB m15290 P maniculatus dee 27 APR 1993 110 PERH3BA M15291 P maniculatus dee 27 APR 1993 110 Figure 10 23 A sequence list of two sequences can be viewed in either a table or in a graphical sequence list e To delete a sequence from the list right click the
257. both protein and DNA The analyses are described in this chapter 14 1 Shuffle sequence In some cases it is beneficial to shuffle a sequence This is an option in the Toolbox menu under General Sequence Analyses It is normally used for statistical analyses e g when comparing an alignment score with the distribution of scores of shuffled sequences Shuffling a sequence removes all annotations that relate to the residues select sequence Toolbox in the Menu Bar General Sequence Analyses A Shuffle Sequence 3 206 CHAPTER 14 GENERAL SEQUENCE ANALYSES 207 or right click a sequence Toolbox General Sequence Analyses A Shuffle Sequence 50 This opens the dialog displayed in figure 14 1 Shuffle Sequence 1 Select one or more wina sequences of same type Projects Selected Elements 1 JA CLC_Data x HUMDINUC S E Example Data H E Cloning vectors E Extra Nucleotide H E Assembly w Cloning H E More data H E Primer design s Restriction an 25 Sequences DOC AY738615 x 206 HUMHBB 206 NM_0000 lt 7C PERH2BD 70C PERH3BC iE sequence 20 REST JE Protein v lt enter search term gt Figure 14 1 Choosing sequence for shuffling If a sequence was selected before choosing the Toolbox action this sequence is now listed in the Selected Elements window of the dialog Use the arrows to add or remove sequences or sequence lists from the selected elements Click Next to determi
258. called an exterior or external loop see figure 22 29F These regions do not contribute to the total free energy e Dangling nucleotide A dangling nucleotide is a single stranded nucleotide that forms a stacking interaction with an adjacent base pair A dangling nucleotide can be a 3 or 5 dangling nucleotide depending on the orientation see figure 22 29G The energy contribution is determined by the single stranded nucleotide its orientation and on the adjacent base pair e Non GC terminating stem f If a base pair other than a G C pair is found at the end of a stem an energy penalty is assigned see figure 22 29H e Coaxial interaction gt Coaxial stacking is a favorable interaction of two stems where the base pairs at the ends can form a stacking interaction This can occur between stems in a multi loop and between the stems of two different sequential structures Coaxial stacking can occur between stems with no intervening nucleotides adjacent stems and between stems with one intervening nucleotide from each strand see figure 22 291 The energy contribution is determined by the adjacent base pairs and the intervening nucleotides CHAPTER 22 RNA STRUCTURE 398 Experimental constraints A number of techniques are available for probing RNA structures These techniques can determine individual components of an existing structure such as the existence of a given base pair It is possible to add such experimental con
259. cation of a putative antigenic determinant Find The Find function can also be invoked by pressing Ctrl Shift F Shift F on Mac The Find function can be used for searching the sequence Clicking the find button will search for the first occurrence of the search term Clicking the find button again will find the next occurrence and so on If the search string is found the corresponding part of the sequence will be selected e Search term Enter the text to search for The search function does not discriminate between lower and upper case characters e Sequence search Search the nucleotides or amino acids For amino acids the single letter abbreviations should be used for searching The sequence search also has a set of advanced search parameters Include negative strand This will search on the negative strand as well Treat ambiguous characters as wildcards in search term If you search for e g ATN you will find both ATG and ATC If you wish to find literally exact matches for ATN i e only find ATN not ATG this option should not be selected Treat ambiguous characters as wildcards in sequence If you search for e g ATG you will find both ATG and ATN If you have large regions of Ns this option should not be selected e Annotation search Searches the annotations on the sequence The search is performed both on the labels of the annotations but also on the text appearing in the tooltip that you see when yo
260. ce in the table CHAPTER 20 SEQUENCE ALIGNMENT 364 20 6 Bioinformatics explained Multiple alignments Multiple alignments are at the core of bioinformatical analysis Often the first step in a chain of bioinformatical analyses is to construct a multiple alignment of a number of homologs DNA or protein sequences However despite their frequent use the development of multiple alignment algorithms remains one of the algorithmically most challenging areas in bioinformatical research Constructing a multiple alignment corresponds to developing a hypothesis of how a number of sequences have evolved through the processes of character substitution insertion and deletion The input to multiple alignment algorithms is a number of homologous sequences i e sequences that share a common ancestor and most often also share molecular function The generated alignment is a table see figure 20 16 where each row corresponds to an input sequence and each column corresponds to a position in the alignment An individual column in this table represents residues that have all diverged from a common ancestral residue Gaps in the table commonly represented by a represent positions where residues have been inserted or deleted and thus do not have ancestral counterparts in all sequences 20 6 1 Use of multiple alignments Once a multiple alignment is constructed it can form the basis for a number of analyses e The phylogenetic relationship of the sequence
261. cense server could not be located on your network Ifthe problem persists please contact your local license server administrator Figure 1 12 Unable to contact license server In this case you need to make sure that you have access to the license server and that the server is running 1 4 5 Upgrading or changing licenses If you start the Workbench without a valid license a dialog similar to the one in figure 1 2 will be shown However there may be situations where you wish to use another license or see information about the license you currently use In this case open the license manager Help License Manager E The license manager is shown in figure 1 13 CLC Combined Workbench Leer las Figure 1 13 The license manager CHAPTER 1 INTRODUCTION TO CLC COMBINED WORKBENCH 24 Besides letting you borrow licenses see section 1 4 4 this dialog can be used to e See information about the license e g what kind of license when it expires e Configure how to connect to a license server Configure License Server the button at the lower left corner Clicking this button will display a dialog similar to figure 1 10 e Upgrade from an evaluation license to a fixed license by clicking the Upgrade license button Follow the description in section 1 4 3 If you wish to switch from using a floating license to a fixed license click Configure License Server and choose not to connect to a lic
262. ces your query in the BLAST search queue After a short while the result is received and opened in a new view The output is shown in figure 2 25 and consists of a list of potential homologs that are sorted by their BLAST match score and shown in descending order below the query sequence Try placing your mouse cursor over a potential homologous sequence You will see that a context CHAPTER 2 TUTORIALS 49 BLAST Against NCBI Databases 1 Select sequences of same AA type 2 Set program parameters 3 Set input parameters Choose parameters Limit by entrez query Zines ORG Low complexity Human repeats Mask for lookup Choose filter Mask lower case Expect Word size Matrix BLOSUM62 Gap cost Existence 11 Extension 1 _ Previous next F Figure 2 24 The BLAST search is limited to homo sapiens ORGN The remaining parameters are left as default Es NP_058652 BLAST a o a gt NP_058652 gil 17647499 ref NP_058652 1 gij4760594 dbj BAA77357 1 aaa AT 112 2 22 gt dbj S 2 globin Mus musculus piney meaner l gil4760592 dbj BAA77356 1 beta 2 globin Mus musculus gil 1183933 emb CAA32225 1 Score 295 8 bits 756 Expect 3E 79 allt 09287733 dbiIBAES6 286 1 Identities 144 146 99 Positives 145 146 99 Gaps 0 146 0 Y gt MEAKA
263. changed databases and have the updated BLAST database stored locally or on a shared network drive at all times Most BLAST databases on the NCBI site are updated on a daily basis to include all recent sequence submissions to GenBank A few commercial software packages are available for searching your own data The advantage of using a commercial program is obvious when BLAST is integrated with the existing tools of these programs Furthermore they let you perform BLAST searches and retain annotations on the query sequence see figure 12 28 It is also much easier to batch download a selection of hit sequences for further inspection a CGTGGATCCTGAGAACTTCAGGGTGAGTCTATGGGACGCTTGAT CGTGGATCCTGAGAACTTCAGGGTGAGTC TGTGGATCCTGAGAACTTCAAGGTGAGTC TGTGGATCCTGAGAACTTCAAGGTGAGTC TGTGGATCCTGAGAACTTCAAGGTGAGT CGTGGACCCTGAGAACTTCCTGGTGAGT Figure 12 28 Snippet of alignment view of BLAST results from CLC Combined Workbench Individual alignments are represented directly in a graphical view The top sequence is the query sequence and is shown with a selection of annotations CHAPTER 12 BLAST SEARCH 198 12 6 8 What you cannot get out of BLAST Don t expect BLAST to produce the best available alignment BLAST is a heuristic method which does not guarantee the best results and therefor you cannot rely on BLAST if you wish to find all the hits in the database Instead use the Smith Waterman algorithm for obtaining the best possible local align
264. city Plot Lz This opens a dialog The first step allows you to add or remove sequences Clicking Next takes you through to Step 2 which is displayed in figure 16 11 The Window size is the width of the window where the hydrophobicity is calculated The wider the window the less volatile the graph You can chose from a number of hydrophobicity scales CHAPTER 16 PROTEIN ANALYSES 255 E Create Hydrophobicity Plot 1 Select protein sequences DE eers 2 Set parameters Hydrophobicity scale V Kyte Doolittle V Eisenberg V Engelman Hopp Woods Janin Rose Cornette Window size Number of residues 11 OO Ceres oe ee tea Figure 16 11 Step two in the Hydrophobicity Plot allows you to choose hydrophobicity scale and the window size l CAA32220 hydr BE raph settings Hydrophobicity plot of CAA32220 gt a UR v Graph preferences 2 Y Lock axes 1 Y Frame O x axis at zero Z o 0 O Y axis at zero 5 s Tick type outside v r L 5 o 1 Ticklinesat none v 5 pro gt ahi T 2 Engelman Kyte Doolittl Eisenberg ii 3 Kyte gt Eisenberg Doolittle gt Engelman gt Text Format o 20 40 60 80 100 120 140 Position Figure 16 12 The result of the hydrophobicity plot calculation and the associated Side Panel which are further explained in section 16 5 3 Click Next if you wish to adjust how to
265. clude information on the background distribution of amino acids from a range of organisms Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish This will open a view showing the patterns found as annotations on the original sequence see figure 14 23 If you have selected several sequences a corresponding number of views will be opened Pattern1 Pattern1 3VCNKNGQTA EDLAWSYGFP ECARFLTMIK CMQTARSSGE Figure 14 23 Sequence view displaying two discovered patterns 14 7 2 Pattern search output If the analysis is performed on several sequences at a time the method will search for patterns in the sequences and open a new view for each of the sequences in which a pattern was discovered Each novel pattern will be represented as an annotation of the type Region More information on each found pattern is available through the tool tip including detailed information on the position of the pattern and quality scores It is also possible to get a tabular view of all found patterns in one combined table Then each found pattern will be represented with various information on obtained scores quality of the pattern and position in the sequence A table view of emission values of the actual used HMM model is presented in a table view This model can be saved and used to search for a similar pattern in new or unknown sequences Chapter 15 Nucleotide analyses Contents 15 1 Convert DNAtoRNA
266. ct any inconsistencies that may exist between different reads First select the five trace files the reads in the Assembly folder in the Nucleotide folder of CHAPTER 2 TUTORIALS 60 Fag PERH3EC primers ble Setting x Rows 41 Standard primers for PERH3BC primers Filter n Se 28 Score v Pair annealing align Fwd Rew Sequence Fwd Melt te Sequence Rev Melt temp gt Resize mode 2 CCATGGTTTCCTTCCTCT v Show column 44 552 poa LLL CCATGGTTTCCT 55 227 CASACTCTIGT 55 551 Score CTCACGACTGTTCTCAAAC _ Pair annealing Fwd Rew E CCATGGTITCCTTCCICT Pair annealing align Fwd Rew IN CCATGGTTTCCT 55 227 CCASACTCTT CACGACTGTTCTCAAACC C Pair end annealing Fwd Rev Fragment length Fwd Rev CCATGGTTTCCTTCCTCT C Frag meee Rev 14 II HH CCATGGTTTCCT 55 227 AAACTCTTGTC Sequence Fwd CTCACGACTGTTCTCAAA C Region Fwd AAACTCTTGTCAGCACTC C Self annealing Fwd l E a CCATGGTTTCCT 53 968 44ACTCTTGTC o Self annealing alignment Fwd ATCTCCTTCCTTTGGTACC _ Self end annealing Fwd CCATGGTTTCCTTCCTCT aH LIO N CCATGGTTTCCT 55 227 CAAACTCTTGT 1 EC content Fw boo Y Melt temo Fwd vi Ch Figure 2 40 A list of primers To the right are the Side Panel showing the available choices of information to display TD PERH3BC primer re Oligofri A PERH3BC Lgt 18 Lgt 19 Lgt 20 162 Es Ex PERH3EC with
267. d in figure 16 14 Notice that you can choose the height of the graphs underneath the sequence 16 5 3 Bioinformatics explained Protein hydrophobicity Calculation of hydrophobicity is important to the identification of various protein features This can be membrane spanning regions antigenic sites exposed loops or buried residues Usually these calculations are shown as a plot along the protein sequence making it easy to identify the location of potential protein features The hydrophobicity is calculated by sliding a fixed size window of an odd number over the protein sequence At the central position of the window the average hydrophobicity of the entire window is plotted see figure 16 15 CHAPTER 16 PROTEIN ANALYSES 258 20 40 Q6H1U7 mvh MASERA aitsiwgkva ie BGgealgG Fl1ivyBWts BRFGhHFGGMS nakavms a O re EE Figure 16 15 Plot of hydrophobicity along the amino acid sequence Hydrophobic regions on the sequence have higher numbers according to the graph below the sequence furthermore hydrophobic regions are colored on the sequence Red indicates regions with high hydrophobicity and blue indicates regions with low hydrophobicity Hydrophobicity scales Several hydrophobicity scales have been published for various uses Many of the commonly used hydrophobicity scales are described below Kyte Doolittle scale The Kyte Doolittle scale is widely used for detecting hydrophobic regions in proteins Regions with a posit
268. d in figure 15 6 Find Open Reading Frames 1 Select nucleotide sequences Projects Selected Elements 1 2 HUMHBB 5 6 Example data a Extra 3 Nucleotide B E Assembly 9 Cloning B More data E Primer design 8 5 Restriction anal gi Sequences DOC AY738615 20 HUMDINUG 20 NM_00004 206 PERH2BD DOC PERH3BC iZ sequence li E Protein a O Figure 15 6 Create Reading Frame dialog aj CLC_Data If a sequence was selected before choosing the Toolbox action the sequence is now listed in the Selected Elements window of the dialog Use the arrows to add or remove sequences or sequence lists from the selected elements If you want to adjust the parameters for finding open reading frames click Next 15 5 1 Open reading frame parameters This opens the dialog displayed in figure 15 7 Find Open Reading Frames 1 Select nucleotide NA sequences 2 Set parameters Start Codon O auc O Any All start codons in genetic code Other AUG CUG UUG V Both strands Open ended sequence Genetic code 1 Standard v Minimum length 100 Include stop codon in result _ Previous next Y Finish X Cancel Figure 15 7 Create Reading Frame dialog The adjustable parameters for the search are e Start Codon AUG Most commonly used start codon Any CHAPTER 15 NUCLEOTIDE ANALYSES 239 All start codons in g
269. d on all the individual characters nucleotides or amino acids Parsimony In parsimony based methods a number of sites are defined which are informative about the topology of the tree Based on these the best topology is found by minimizing the number of substitutions needed to explain the informative sites Parsimony methods are not based on explicit evolutionary models Maximum Likelihood Maximum likelihood and Bayesian methods see below are probabilistic methods of inference Both have the pleasing properties of using explicit models of molecular evolution and allowing for rigorous statistical inference However both approaches are very computer intensive A stochastic model of molecular evolution is used to assign a probability likelinood to each phylogeny given the sequence data of the OTUs Maximum likelinood inference Felsenstein 1981 then consists of finding the tree which assign the highest probability to the data Bayesian inference The objective of Bayesian phylogenetic inference is not to infer a single correct phylogeny but rather to obtain the full posterior probability distribution of all possible phylogenies This is obtained by combining the likelinood and the prior probability distribution of evolutionary parameters The vast number of possible trees means that bayesian phylogenetics must be performed by approximative Monte Carlo based methods Larget and Simon 1999 Yang and Rannala 1997 21 2 4 Interpretin
270. d problem and a large body of work has been done to refine prediction algorithms and to experimentally estimate the relevant biological parameters In CLC Combined Workbench we offer the user a number of tools for analyzing and displaying RNA structures These include e Secondary structure prediction using state of the art algorithms and parameters e Calculation of full partition function to assign probabilities to structural elements and hypotheses Scanning of large sequences to find local structure signal e Inclusion of experimental constraints to the folding process Advanced viewing and editing of secondary structures and structure information 22 1 RNA secondary structure prediction CLC Combined Workbench uses a minimum free energy MFE approach to predict RNA secondary structure Here the stability of a given secondary structure is defined by the amount of free energy used or released by its formation The more negative free energy a structure has the more likely is its formation since more stored energy is released by the event Free energy contributions are considered additive so the total free energy of a secondary structure can be calculated by adding the free energies of the individual structural elements Hence the task of the prediction algorithm is to find the secondary structure with the minimum free energy As input to the algorithm empirical energy parameters are used These parameters summarize the free energy contribu
271. ded to the tree as the parent of the two chosen nodes The clusters to be joined are chosen as those with minimal pairwise distance The branch lengths are set corresponding to the distance between clusters which is calculated as the average distance between pairs of sequences in each cluster The algorithm assumes that the distance data has the so called molecular clock property i e the divergence of sequences occur at the same constant rate at all parts of the tree This means that the leaves of UPGMA trees all line up at the extant sequences and that a root is estimated as part of the procedure Neighbor Joining The neighbor joining algorithm Saitou and Nei 1987 on the other hand builds a tree where the evolutionary rates are free to differ in different lineages i e the tree does not have a particular root Some programs always draw trees with roots for practical reasons but for neighbor joining trees no particular biological hypothesis is postulated by the placement of the root The method works very much like UPGMA The main difference is that instead of using pairwise distance this method subtracts the distance to all other nodes from the pairwise distance This is done to take care of situations where the two closest nodes are not neighbors in the real tree The neighbor join algorithm is generally considered to be fairly good and is widely used Algorithms that improves its cubic time performance exist The improvement is only sig
272. deleted A Aligning sequences A MERA sam gt Download process Ho i es il gt DB nucleotide human COLEEEEEE ET 100 e eee Aligning sequences MAMMA MAMA 100 O l gt y Processes Toolbox sa Aligning sequences ia Figure 3 16 Two running and a number of terminated processes in the Toolbox If you close the program while there are running processes a dialog will ask if you are sure that you want to close the program Closing the program will stop the process and it cannot be CHAPTER 3 USER INTERFACE 89 restarted when you open the program again 3 4 2 Toolbox The content of the Toolbox tab in the Toolbox corresponds to Toolbox in the Menu Bar The Toolbox can be hidden so that the Navigation Area is enlarged and thereby displays more elements View Show Hide Toolbox The tools in the toolbox can be accessed by double clicking or by dragging elements from the Navigation Area to an item in the Toolbox 3 4 3 Status Bar As can be seen from figure 3 1 the Status Bar is located at the bottom of the window In the left side of the bar is an indication of whether the computer is making calculations or whether it is idle The right side of the Status Bar indicates the range of the selection of a sequence See chapter 3 3 6 for more about the Selection mode button 3 5 Workspace If you are working on a project and have arranged the views for this project you c
273. described below CHAPTER 1 INTRODUCTION TO CLC COMBINED WORKBENCH 17 1 4 1 Demo license concept We offer a fully functional demo version of CLC Combined Workbench to all users free of charge If you have already purchased CLC Combined Workbench you can skip this section and go directly to section 1 4 3 Each user is entitled to 30 days demo of CLC Combined Workbench If you need more time for evaluating another two weeks of demo can be requested We use the concept of quid quo pro The last two weeks of free demo time given to you is therefore accompanied by a short form questionnaire where you have the opportunity to give us feedback about the program If the 30 days plus the two weeks are not enough time for evaluating the program you can request more demo time The 30 days demo is offered for each major release of CLC Combined Workbench You will therefore have the opportunity to try the next major version when it is released If you purchase CLC Combined Workbench the first year of updates is included Internet connection is required for a demo license To prevent unauthorized use of the program you must be connected to the Internet while starting up a demo version of CLC Combined Workbench An additional online check will be conducted 24 hours after the first start of the workbench After running CLC Combined Workbench for 24 hours if you are not connected to the Internet you will be met with the dialog shown in figure 1 1
274. distinct structure elements e Stacking of base pairs 4 A stacking of two consecutive pairs occur if i i 1 j j Only canonical base pairs A U or G C or G U are allowed see figure 22 29B The energy contribution is determined by the type and order of the two base pairs e Bulge C A bulge loop occurs if i i gt 1 or j 3 gt 1 but not both This means that the two base pairs enclose an unpaired region of length O on one side and an unpaired region of length gt 1 on the other side see figure 22 29C The energy contribution of a bulge is determined by the length of the unpaired loop region and the two closing base pairs e Interior loop An interior loop occurs if both 2 i gt 1 and 3 gt 1 This means that the two base pairs enclose an unpaired region of length gt 1 on both sides see figure 22 29D The energy contribution of an interior loop is determined by the length of the unpaired loop region and the four unpaired bases adjacent to the opening and the closing base pair 3 Multi loop opened A base pair with more than two accessible base pairs gives rise to a multi loop a loop from which three or more stems are opened see figure 22 29E The energy contribution of a multi loop depends on the number of Stems opened in multi loop Y that protrude from the loop Other structure elements e Acollection of single stranded bases not accessible from any base pair is
275. done in one of the following ways If you have downloaded an installer Locate the downloaded installer and double click the icon The default location for downloaded files is your desktop If you are installing from a CD Insert the CD into your CD ROM drive Choose the Install CLC Combined Workbench from the menu displayed If you already have Java installed on your computer you can choose Install CLC Combined Workbench without Java Installing the program is done in the following steps e On the welcome screen click Next e Read and accept the License agreement and click Next e Choose where you would like to install the application and click Next Choose a name for the Start Menu folder used to launch CLC Combined Workbench and click Next Choose if CLC Combined Workbench should be used to open CLC files and click Next e Choose where you would like to create shortcuts for launching CLC Combined Workbench and click Next e Choose if you would like to associate clc files to CLC Combined Workbench If you check this option double clicking a file with a clc extension will open the CLC Combined Workbench Wait for the installation process to complete choose whether you would like to launch CLC Combined Workbench right away and click Finish When the installation is complete the program can be launched from the Start Menu or from one of the shortcuts you chose to create 1 2 3 Installation on Mac OS X Starting the
276. e in the adjoining text field NCBI search Choose database Nucleotide O Protein All Fields human B All Fields m hemoglobin Mx All Fields complete la Add search parameters A Start search Append wildcard to search words Rows 50 Search results Filter Accession Definition Modification Date THA Aspergillus niger contig An0Sc0110 complete genome 2007 03 24 am711867 Clavibacter michiganensis subsp michiganensis NCPPB 2007 05 18 AP008209 Oryza sativa japonica cultivar group genomic DNA c 2007 05 19 BAOOOO16 Clostridium perfringens str 13 DNA complete genome 2007 05 19 80029387 Homo sapiens hemoglobin gamma G mRNA cDNA clon 2007 02 08 BC130457 Homo sapiens hemoglobin gamma G mRNA cDNA clon 2007 01 04 BC130459 Homo sapiens hemoglobin gamma G mRNA cDNA clon 2007 01 04 BC139602 Danio rerio hemoglobin beta embryonic 2 mRNA cDNA 2007 04 18 Bc142787 Danio rerio hemoglobin beta embryonic 1 mRNA cDNA 2007 06 11 BxB42577 Mycobacterium tuberculosis H37Rv complete genome 2006 11 14 y H Download and open 8 Download and Save Total number of hits 245 more Bia Figure 2 12 NCBI search view Click Start search E to commence the search in NCBI 2 4 1 Searching for matching objects When the search is complete the list of hits is shown If the desired complete human hemoglobin D
277. e 8 Equus caballus horse Homo sapiens human 1 Peromyscus maniculatus deer mouse Peromyscus maniculatus deer mouse Peromyscus maniculatus deer mouse Homo sapiens human Homo sapiens human Peromyscus maniculatus deer mouse Peromyscus maniculatus deer mouse Peromyscus maniculatus deer mouse s00f Homo sapiens human Homo sapiens human Figure 21 3 Method choices for phylogenetic inference The top shows a tree found by neighbor joining while the bottom shows a tree found by UPGMA The latter method assumes that the evolution occurs at a constant rate in different lineages 21 1 2 Tree View Preferences The Tree View preferences are these e Text format Changes the text format for all of the nodes the tree contains Text size The size of the text representing the nodes can be modified in tiny small medium large or huge Font Sets the font of the text of all nodes Bold Sets the text bold if enabled e Tree Layout Different layouts for the tree Node symbol Changes the symbol of nodes into box dot circle or none if you don t want a node symbol Layout Displays the tree layout as standard or topology Show internal node labels This allows you to see labels for the internal nodes Initially there are no labels but right clicking a node allows you to type a label Label color Changes the color of the labels on the tree nodes Branch label color Modifies the color
278. e Hpall ce S methylcy ee S cg S methylcy er MspI E S methylcy Pee S cg 5 09 S methylcy S cg hd 5 09 Figure 10 10 Enzymes with compatible ends At the top you can choose whether the enzymes considered should have an exact match or not Since a number of restriction enzymes have ambiguous cut patterns there will be variations in the resulting overhangs Choosing All matches you cannot be 100 sure that the overhang will match and you will need to inspect the sequence further afterwards We advice trying Exact match first and use All matches as an alternative if a satisfactory result cannot be achieved At the bottom of the dialog the list of enzymes producing compatible overhangs is shown Use the arrows to add enzymes which will be displayed on the sequence which you press Finish When you have added the relevant enzymes click Finish and the enzymes will be added to the Side Panel and their cut sites displayed on the sequence 10 1 3 Selecting parts of the sequence You can select parts of a sequence Click Selection 03 in Toolbar Press and hold down the mouse button on the sequence where you want the selection to start move the mouse to the end of the selection while holding the button release the mouse button Alternatively you can search for a specific interval using the find function described above If you have made a selection and wish to adjust it CHAPTER 10 VIEWING AND EDITING
279. e sSNPs represent triplets encoding the same amino acid before and after the polymorphism arise while nsSNPs on the other hand alters the encoded amino acid and may signal chain termination within the identification and description of single nucleotide polymorphisms is a growing area of research SNPs can be identified through e g direct DNA sequencing of PCR products followed by assembly and contig analysis by array analysis or by RT PCR After identification of SNPs non synonymous mutations nsSNPs and their possible impacts can be described according to different criteria such as translation from nucleotide to protein sequence and secondary structure prediction Other useful resources SNP fact sheet http www ornl gov sci techresources Human_Genome fag snps shtml The Single Nucleotide Polymorphism database dbSNP http www ncbi nlm nih gov projects SNP Creative Commons License All CLC bio s scientific articles are licensed under a Creative Commons Attribution NonCommercial NoDerivs 2 5 License You are free to copy distribute display and use the work for educational purposes under the following conditions You must attribute the work in its original form and CLC bio has to be clearly labeled as author and provider of the work You may not use this work for commercial purposes You may not alter transform nor build upon this work CHAPTER 12 BLAST SEARCH 190 960 980 l 4 ACCAAAACAAAGCAGAATGCAGTTCTCTTCA
280. e Panel in common e Annotation Layout e Annotation Types CHAPTER 10 VIEWING AND EDITING SEQUENCES 149 w a 2 5 Annotation layout Show annotations Position Next to sequence Offset Little offset v Label Stacked Show arrows Use gradients Annotation types EY cs O A Y Conflict MA O Exo MA Y cere O MA Y mena A O old sequence MA O Precursor Riva CJ MA O Repeat region MA O Repeat unit MA O source ED EA O unsure A Y Variation EJ Select all Deselect all v Figure 10 16 Changing the layout of annotations in the Side Panel The two groups are shown in figure 10 16 In the Annotation layout group you can specify how the annotations should be displayed notice that there are some minor differences between the different sequence views e Show annotations Determines whether the annotations are shown e Position On sequence The annotations are placed on the sequence The residues are visible through the annotations if you have zoomed in to 100 Next to sequence The annotations are placed above the sequence e Offset If several annotations cover the same part of a sequence they can be spread out Piled The annotations are piled on top of each other Only the one at front is visible Little offset The annotations are piled on top of each other but they have been offset a little More offset Same as above but with mo
281. e Setup Parameters Orientation Portrait Paper Size 44 Horizontal Pagecount Not Applicable Vertical Pagecount Not Applicable Header Text Footer Text Show Pagenumber Yes Output Options Print visible area Print whole view l Y Print X Cancel Help l a Preview E3 Page Setup Figure 6 1 The Print dialog 6 1 Selecting which part of the view to print In the print dialog you can choose to e Print visible area or e Print whole view These options are available for all views that can be zoomed in and out In figure 6 2 is a view of a circular sequence which is zoomed in so that you can only see a part of it O AY7386150 A HBD HBBy AYT 18 v lt gt Figure 6 2 A circular sequence as it looks on the screen When selecting Print visible area your print will reflect the part of the sequence that is visible in the view The result from printing the view from figure 6 2 and choosing Print visible area can be seen in figure 6 3 HBD HBB Figure 6 3 A print of the sequence selecting Print visible area CHAPTER 6 PRINTING 109 On the other hand if you select Print whole view you will get a result that looks like figure 6 4 This means that you also print the part of the sequence which is not visible when you have zoomed in AY738615 180 bp Figure 6 4 A print of the sequence selecting Print whole view The whole sequence is shown even though the view is zoomed in on a
282. e and modify a phylogenetic tree 43 261 MESISOUME o r e i a ee e ee ea Brae oe ee a aE 44 2 7 Tutorial Find restriction sites lt 44 2 7 1 The Side Panel way of finding restriction sites 45 2 7 2 The Toolbox way of finding restriction sites 0000 46 2 8 Tutorial BLAST search 0 lt lt 1 tani 47 2 9 Tutorial Tips for specialized BLAST searches lt 50 2 9 1 Locate a protein sequence on the chromosome 50 2 9 2 BLAST for primer binding sites 52 2 9 3 Finding remote protein homologues 53 2 04 FUMMerreadiNg o raksu E E eee ER eee Ow E 53 2 10 Tutorial Proteolytic cleavage detection 2 00 ee eee ees 54 2 11 Tutorial Primer design 1 anasan ee ee 56 2 11 1 Specifying a region for the forward primer 56 2 11 2 Examining the primer suggestions s s o s soa wosa o 57 2 11 3 Calculating a primer pair o 59 2 12 Tutorial Assembly lt lt 4 59 CHAPTER 2 TUTORIALS 33 2 12 1 Getting an overview of the contig s s s s se mae s oie 61 2 12 2 Finding and editing inconsistencieS lt s s sosi ses iee 62 2 12 3 Inspecting the traces o 4 62 2 12
283. e are other settings beside CLC Standard Settings you can use this overview to choose which of the settings should be used per default when you open a view 5 2 1 Import and export Side Panel settings If you have created a special set of settings in the Side Panel that you wish to share with other users of CLC Workbenches you can export the settings in a file The other user can then import the settings and use it on another computer When you export and import settings it applies to all the settings for the different views To export the Side Panel settings make sure you are at the bottom of the View panel of the Preferences dialog and Export settings select a name and location for the settings file Save Now the settings are saved in a file with a vsf extension View Settings File This file can now be imported in a workbench on another computer To import a Side Panel settings file make sure you are at the bottom of the View panel of the Preferences dialog and Import settings locate and select the vsf file Import Then you will see the dialog shown in figure 5 2 The dialog asks if you wish to overwrite existing Side Panel settings in your workbench or if you wish to merge the imported settings into the existing ones Note If you choose to overwrite the existing settings you will loose all the Side Panel settings that you have previously saved To avoid confusion of the different import and export options here i
284. e available if you open the sequence in a linear view If you wish to export the settings that you have saved this can be done in the Preferences dialog under the View tab see section 5 2 1 The remaining icons of figure 5 5 are used to Expand all groups Collapse all groups and Dock Undock Side Panel Dock Undock Side Panel is to make the Side Panel floating see below 5 5 1 Floating Side Panel The Side Panel of the views can be placed in the right side of a view or it can be floating see figure 5 8 By clicking the Dock icon 53 the floating Side Panel reappear in the right side of the view The size of the floating Side Panel can be adjusted by dragging the hatched area in the bottom right CHAPTER 5 USER PREFERENCES AND SETTINGS 106 ES sequence list Sequence list sequence list Number of rows 5 Name Accession Definition Modificati Length PERHIBA M15292 P maniculat 27 APR 1993 110 PERHIBB M15289 P maniculat 27 APR 1993 110 PERH284 M15293 P maniculat 27 4PR 1993 110 PERH288_ M15290 P maniculat 27 APR 1993 110 PERH3BA M15291 P maniculat 27 APR 1993 110 E TE Show column 4 Figure 5 8 The floating Side Panel can be moved out of the way e g to allow for a wider view of a table Chapter 6 Printing Contents 6 1 Selecting which part of the view to print lt lt lt lt lt 108 6 2 Page
285. e color of the letter Background color Sets the background color of the residues Nucleotide info These preferences only apply to nucleotide sequences e Translation Displays a translation into protein just below the nucleotide sequence Depending on the zoom level the amino acids are displayed with three letters or one letter Frame Determines where to start the translation 1 to 1 Select one of the six reading frames Selection This option will only take effect when you make a selection on the sequence The translation will start from the first nucleotide selected Making a new selection will automatically display the corresponding translation Read more about selecting in section 10 1 3 x All Select all reading frames at once The translations will be displayed on top of each other Table The translation table to use in the translation For more about translation tables see section 15 4 Only AUG start codons For most genetic codes a number of codons can be start codons Selecting this option only colors the AUG codons green Single letter codes Choose to represent the amino acids with a single letter instead of three letters e Trace data See section 18 1 e G C content Calculates the G C content of a part of the sequence and shows it as a gradient of colors or as a graph below the sequence Window length Determines the length of the part of the sequence to calculate A window le
286. e in the lower left triangle CHAPTER 20 SEQUENCE ALIGNMENT 363 ES TOP2_BOMMD_al TOP2_BOMMO TOP2_DROME TOP2_PEA TOP2_ARATH TOP2_PLAFK TOP2_CANGA TOP2_YEAST TOP2_CANAL TOP2_SCHPO TOP2_LEICH TOP2_CRIFA TOP2_TRYBB TOP2_TRYCR TOP2_ASFN2 TOP2_ASFE7 Bd Figure 20 15 A pairwise comparison table The following settings are present in the side panel e Contents Upper comparison Selects the comparison to show in the upper triangle of the table Upper comparison gradient Selects the color gradient to use for the upper triangle Lower comparison Selects the comparison to show in the lower triangle Choose the same comparison as in the upper triangle to show all the results of an asymmetric comparison Lower comparison gradient Selects the color gradient to use for the lower triangle Diagonal from upper Use this setting to show the diagonal results from the upper comparison Diagonal from lower Use this setting to show the diagonal results from the lower comparison No Diagonal Leaves the diagonal table entries blank e Layout Lock headers Locks the sequence labels and table headers when scrolling the table Sequence label Changes the sequence labels e Text format Text size Changes the size of the table and the text within it Font Changes the font in the table Bold Toggles the use of boldfa
287. e navigation area WA 14 7 1 Pattern discovery search parameters Various parameters can be set prior to the pattern discovery The parameters are listed below and a screen shot of the parameter settings can be seen in figure 14 22 e Create and search with new model This will create a new HMM model based on the selected sequences The found model will be opened after the run and presented in a table view It can be saved and used later if desired e Use existing model It is possible to use already created models to search for the same pattern in new sequences e Minimum pattern length Here the minimum length of patterns to search for can be specified e Maximum pattern length Here the maximum length of patterns to search for can be specified e Noise Specify noise level of the model This parameter has influence on the level of degeneracy of patterns in the sequence s The noise parameter can be 1 2 5 or 10 percent e Number of different kinds of patterns to predict Number of iterations the algorithm goes through After the first iteration we force predicted pattern positions in the first run to be CHAPTER 14 GENERAL SEQUENCE ANALYSES 232 member of the background In that way the algorithm finds new patterns in the second iteration Patterns marked Pattern1 have the highest confidence The maximal iterations to go through is 3 e Include background distribution For protein sequences it is possible to in
288. e not accompanied by a separate license agreement or terms of use By installing copying downloading accessing or otherwise using the Software Product You agree to be bound by the terms of this EULA If You do not agree to the terms of this BULA do not install access or use the Software Product Figure 1 3 License Agreement Please read the License agreement carefully before clicking I accept In the next step shown in figure 1 4 select Activate license Again you might have to wait a few seconds while the license key is being activated on our server The license is locked to your computer and therefore it can be used by anyone using that computer License Assistant CLC Combined Workbench T Get license w Accept agreement 3 Activate license Activate license The license must be activated before the application can be used The activation has to be done on line and therefore you need to be connected to the internet during the activation Activate license Figure 1 4 Activate the license key Now the license key is activated on your computer and CLC Combined Workbench starts CHAPTER 1 INTRODUCTION TO CLC COMBINED WORKBENCH 19 In all steps of the license dialog you have an option of resetting the license This will allow you to start over importing another license However information about which licenses were used on the computer is stored to prevent unauthorized use of demo licenses
289. e protein sequence as query sequence and click Next Since you wish to BLAST a protein sequence against a nucleotide sequence use tblastn which will automatically translate the nucleotide sequence selected as database As Target select NC_000011 that you downloaded If you are used to BLAST you will know that you usually have to create a BLAST database before BLASTing but the Workbench does this on the fly when you just select one or more sequences Click Next leave the parameters at their default click Next again and then Finish CHAPTER 2 TUTORIALS 51 Inspect BLAST result When the BLAST result appears make a split view so that both the table and graphical view is visible see figure 2 26 This is done by pressing Ctrl 38 on Mac while clicking the table view EX at the bottom of the view In the table start out by showing two additional columns Positive and Query start These should simply be checked in the Side Panel Now sort the BLAST table view by clicking the column header Positive Then press and hold the Ctrl button 38 on Mac and click the header Query start Now you have sorted the table first on Positive hits and then the start position of the query sequence Now you see that you actually have three regions with a 100 positive hit but at different locations on the chromosome sequence see figure 2 26 AAA16334 BLAST A HBB S 2 w Blast layout all gt AAA 1633
290. e reads which did not contribute to the visible part of the contig will be omitted whereas the contributing sequence reads will automatically be placed right below the contig Show sequence ends Regions that have been trimmed are shown with faded traces and residues This illustrates that these regions have been ignored during the assembly Find Inconsistency Clicking this button selects the next position where there is an conflict between the sequence reads Residues that are different from the contig are colored as default providing an overview of the inconsistencies Since the next inconsistency in the contig is automatically selected it is easy to make changes You can also use the Space key to find the next inconsistency e Sequence layout There is one additional parameter regarding the sequence layout Compactness In the Sequence Layout view preferences you can control the level of sequence detail to be displayed x Not compact The normal setting with full detail Low Hides the trace data and puts the reads annotations on the sequence x Medium The labels of the reads and their annotations are hidden and the residues of the reads can not be seen x Compact Even less space between the reads Furthermore it is not possible to wrap contigs as you can do with alignments e Alignment info There is one additional parameter Coverage Shows how many sequence reads that are contributing information to a
291. e signal peptide prediction is reported for every single amino acid position in the submitted sequence with high scores indicating that the corresponding amino acid is part of a signal peptide and low scores indicating that the amino acid is part of a mature protein The C score is the cleavage site score For each position in the submitted sequence a C score is reported which should only be significantly high at the cleavage site Confusion is often seen with the position numbering of the cleavage site When a cleavage site position is referred to by a single number the number indicates the first residue in the mature protein This means that a reported cleavage site between amino acid 26 27 corresponds to the mature protein starting at and include position 27 Y max is a derivative of the C score combined with the S score resulting in a better cleavage site prediction than the raw C score alone This is due to the fact that multiple high peaking C scores can be found in one sequence where only one is the true cleavage site The cleavage site is assigned from the Y score where the slope of the S score is steep and a significant C score is found The S mean is the average of the S score ranging from the N terminal amino acid to the amino acid assigned with the highest Y max score thus the S mean score is calculated for the length of the predicted signal peptide The S mean score was in SignalP version 2 0 used as the criteria for discrimina
292. e the work for educational CHAPTER 14 GENERAL SEQUENCE ANALYSES 226 purposes under the following conditions You must attribute the work in its original form and CLC bio has to be clearly labeled as author and provider of the work You may not use this work for commercial purposes You may not alter transform nor build upon this work SOME RIGHTS RESERVED See http creativecommons org licenses by nc nd 2 5 for more information on how to use the contents 14 5 Join sequences CLC Combined Workbench can join several nucleotide or protein sequences into one sequence This feature can for example be used to construct Supergenes for phylogenetic inference by joining several disjoint genes into one Note that when sequences are joined all their annotations are carried over to the new spliced sequence Two or more sequences can be joined by select sequences to join Toolbox in the Menu Bar General Sequence Analyses Join sequences 32 or select sequences to join right click any selected sequence Toolbox General Sequence Analyses Join sequences 3 This opens the dialog shown in figure 14 17 Join Sequences 1 Select sequences of same Py Selece Sequences of same type Projects Selected Elements 2 CLC_Data As P68063 E3 Example data Su P68225 fj Extra aE Nucleotide 3 Protein a 3D structures fj More data Seip Sequences 4 1429_HUMAN CAA24102 CAA32220 Fs NP_058652
293. e varied so that you normally uses just the wordsizes 2 and 3 for these searches e Matrix A key element in evaluating the quality of a pairwise sequence alignment is the substitution matrix which assigns a score for aligning any possible pair of residues The matrix used in a BLAST search can be changed depending on the type of sequences you are searching with see the BLAST Frequently Asked Questions Only applicable for protein sequences or translated DNA sequences e Gap Cost The pull down menu shows the Gap Costs Penalty to open Gap and penalty to extend Gap Increasing the Gap Costs and Lambda ratio will result in alignments which decrease the number of Gaps introduced The more limitations are submitted to the search parameters the faster the search will be conducted If no limitations are submitted the BLAST search may take several minutes 12 1 1 BLAST a selection against NCBI If you only wish to BLAST a part of a sequence this is possible directly from the sequence view select the region that you wish to BLAST right click the selection BLAST Selection Against NCBI This will go directly to the dialog shown in figure 12 3 and the rest of the options are the same as when performing a BLAST search with a full sequence 12 2 BLAST Against Local Database CLC Combined Workbench will let you conduct a BLAST search in a local database See section 12 4 for more about how to create a database CHAPTER 12 BLAST SEARC
294. each pair of sequences are compared to each other This provides an overview of the diversity among the sequences in the alignment In CLC Combined Workbench this is done by creating a comparison table Toolbox in the Menu Bar Alignments and Trees Pairwise Comparison H or right click alignment in Navigation Area Toolbox Alignments and Trees Pairwise Comparison 3 This opens the dialog displayed in figure 20 13 Create Pairwise Comparison 1 Select alignments of MIES same type Projects Selected Elements 1 CLC_Data 5 Example data 5 Extra H E Nucleotide 3 Protein E 3D structures More data Ferdprotein alignment f Sequences lo Es Figure 20 13 Creating a pairwise comparison table If an alignment was selected before choosing the Toolbox action this alignment is now listed in the Selected Elements window of the dialog Use the arrows to add or remove elements from the Navigation Area Click Next to adjust parameters CHAPTER 20 SEQUENCE ALIGNMENT 362 20 5 1 Pairwise comparison on alignment selection A pairwise comparison can also be performed for a selected part of an alignment right click on an alignment selection Pairwise Comparison 4 This leads directly to the dialog described in the next section 20 5 2 Pairwise comparison parameters There are four kinds of comparison that can be made between the sequences in the alignment as shown in figure 20 14 Create P
295. econdary structure prediction by free energy minimization and it should be clear that the predicted MFE structure may deviate somewhat from the actual preferred structure of the molecule This means that it may be informative to inspect the landscape of suboptimal structures which surround the MFE structure to look for general structural properties which seem to be robust to minor variations in the total free energy of the structure An effective procedure for generating a sample of suboptimal structures is given in Zuker CHAPTER 22 RNA STRUCTURE 395 1989a This algorithm works by going through all possible Watson Crick base pair in the molecule For each of these base pairs the algorithm computes the most optimal structure among all the structures that contain this pair see figure 22 28 Name AG 34 4kcal mol AG 34 3kcal mol Created 2007 11 02 11 29 2007 11 02 11 29 13 AG 4 4kcal mol 34 3kcal mol AG 34 kcal mol 2007 11 02 11 29 13 34 1kcal mol AG 33 8kcal mol 2007 11 02 11 29 13 33 8kcal mol AG 33 3kcal mol 2007 11 02 11 29 13 33 3kcal mol AG 32 7kcal mol 2007 11 02 11 29 13 32 7kcalfmol AG 32 5kcal mol 2007 11 02 11 29 13 32 5kcal mol AG 32 5kcal mol 2007 11 02 11 29 13 32 5kcal mol AG 32 3kcal mol 2007 11 02 11 29 13 32 3kcal mol AG 31 8kcal mol 2007 11 02 11 29 13 31 8kcal mol
296. ed e Redo alignment The original alignment will be realigned if this checkbox is checked Otherwise the original alignment is kept in its original form except for possible extra equally sized gaps in all sequences of the original alignment This is visualized in figure 20 5 CHAPTER 20 SEQUENCE ALIGNMENT 351 20 40 60 P68873 MVHLTPEEKS AVTALWGKVN VDEVGGEALG RLLVVYPWTQ RFFESFGDLS TPDAVMGNPK VKAH Q6WN20 MVHLTGEEKA AVTALWGKVN VXEVGGEALG RLLVVYPWTQ RFFESFGDLS SPDAVMSNXK VKAH P68231 MVHLSGDEKN AVHGLWSKVK VDEVGGEALG RLLVVYPWTR RFFESFGDLS TADAVMNNPK VKAH Q6H1U7 MVHLTAEEKN AITSLWGKVA EQTGGEALG RLLIVYPWTS RFFDHFGDLS NAKAVMSNPK VLAH P68945 VHWTAEEKQ LITGLWGKVN VADCGAEALA RLLIVYPWTQ RFFSSFGNLS SPTAILGNPM VRAH Consensus MVHLTAEEKN AVTALWGKVN VDEVGGEALG RLLVVYPWTQ RFFESFGDLS SPDAVMGNPK VKAH Sequence Logo MVHETAEEKa AvYaLWGKVa vsevGGEALG RLLOVYPWTa RFFesFGDLS esAvMeNPR VRAH Conservation MORSA os 000009 PA 0000 T Sealine Ban 20 40 60 P68873 MVHLTPEEKS AVTALWGKV NVDEVGG EALGRLLVVY PWTQRFFESF GDLSTPDAVM GNPK Q6WN20 MVHLTGEEKA AVTALWGKV NVXEVGG EALGRLLVVY PWTQRFFESF GDLSSPDAVM SNXK P68231 MVHLSGDEKN AVHGLWSKV KVDEVGG EALGRLLVVY PWTRRFFESF GDLSTADAVM NNPK Q6H1U7 MVHLTAEEKN AITSLWGKV AILEQTGG EALGRLLIVY PWTSRFFDHF GDLSNAKAVM SNPK P68945 VHWTAEEKQ LITGLWGKV NVADCGA EALARLLIVY PWTQRFFSSF GNLSSPTAIL GNPM P68873 MVHLTPEEKS AVTALWGKVX XXXNVDEVGG EALGRLLVVY PWTQRFFESF GDLSTPDAVM GNPK Consensus MVHLTAEEKN AVTALWGKV NVD
297. ed e Proteins or peptides can be cleaved and used as nutrients e Precursor proteins are often processed to yield the mature protein Proteolytic cleavage of proteins has shown its importance in laboratory experiments where it is often useful to work with specific peptide fragments instead of entire proteins Proteases also have commercial applications As an example proteases can be used as detergents for cleavage of proteinaceous stains in clothing The general nomenclature of cleavage site positions of the substrate were formulated by Schechter and Berger 1967 68 Schechter and Berger 1967 Schechter and Berger 1968 They designate the cleavage site between P1 P1 incrementing the numbering in the N terminal direction of the cleaved peptide bond P2 P3 P4 etc On the carboxyl side of the cleavage site the numbering is incremented in the same way P1 P2 P3 etc This is visualized in figure 16 29 CHAPTER 16 PROTEIN ANALYSES 273 Cleavage site P4 P3 P2 P1 P1 P2 P3 Figure 16 29 Nomenclature of the peptide substrate The substrate is cleaved between position P1 P1 Proteases often have a specific recognition site where the peptide bond is cleaved As an example trypsin only cleaves at lysine or arginine residues but it does not matter with a few exceptions which amino acid is located at position P1 carboxyterminal of the cleavage site Another example is trombin which cleaves if an arginine is
298. ed zooms out instead of zooming in 3 3 2 Zoom Out It is possible to zoom out step by step on a Sequence Click Zoom Out in the toolbar click in the view until you reach a satisfying zoomlevel or Press on your keyboard The last option for zooming out is only available if you have a mouse with a scroll wheel or Press and hold Ctrl 36 on Mac Move the scroll wheel on your mouse backwards When you choose the Zoom Out mode the mouse pointer changes to a magnifying glass to reflect the mouse mode Note You might have to click in the view before you can use the keyboard or the scroll wheel to zoom If you want to get a quick overview of a sequence or a tree use the Fit Width function instead of the Zoom Out function If you press Shift while clicking in a View the zoom function is reversed Hence clicking on a sequence in this way while the Zoom Out mode toolbar item is selected zooms in instead of zooming out 3 3 3 Fit Width The Fit Width function adjusts the content of the View so that both ends of the sequence alignment or tree is visible in the View in question This function does not change the mode of the mouse pointer 3 3 4 Zoom to 100 The Zoom to 100 4 _ function zooms the content of the View so that it is displayed with the highest degree of detail This function does not change the mode of the mouse pointer 3 3 5 Move The Move mode allows you to drag the content of a View
299. ed can either be shown as annotations on the sequence in a table or as the detailed and text output from the TMHMM method e Add annotations to sequence e Create table e Text Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish You can perform the analysis on several protein sequences at a time This will add annotations to all the sequences and open a view for each sequence if a transmembrane helix is found If a transmembrane helix is not found a dialog box will be presented After running the prediction as described above the protein sequence will show predicted transmembrane helices as annotations on the original sequence see figure 16 8 Moreover annotations showing the topology will be shown That is which part the proteins is located on the inside or on the outside 100 200 300 400 l AAC73287 Figure 16 8 Transmembrane segments shown as annotation on the sequence and the topology Each annotation will carry a tooltip note saying that the corresponding annotation is predicted with CHAPTER 16 PROTEIN ANALYSES 252 TMHMM version 2 0 Additional notes can be added through the Edit annotation dy right click mouse menu See section 10 3 2 Undesired annotations can be removed through the Delete Annotation right click mouse menu See section 10 3 4 16 4 Antigenicity CLC Combined Workbench can help to identify antigenic regions in protein sequences in differ
300. ed comparison we refer to http www clcbio com compare Appendix B BLAST databases Several databases are available at NCBI which can be selected to narrow down the possible BLAST hits B 1 Peptide sequence databases nr Non redundant GenBank CDS translations PDB SwissProt PIR PRF excluding those in env_nr refseq Protein sequences from NCBI Reference Sequence project http www ncbi nlm nih gov RefSeq swissprot Last major release of the SWISS PROT protein sequence database no incre mental updates pat Proteins from the Patent division of GenBank pdb Sequences derived from the 3 dimensional structure records from the Protein Data Bank http www rcsb org pdb env_nr Non redundant CDS translations from env_nt entries month All new or revised GenBank CDS translations PDB SwissProt PIR PRF released in the last 30 days B 2 Nucleotide sequence databases nr All GenBank EMBL DDBJ PDB sequences but no EST STS GSS or phase O 1 or 2 HTGS sequences No longer non redundant due to computational cost refseq_rna MRNA sequences from NCBI Reference Sequence Project refseq_genomic Genomic sequences from NCBI Reference Sequence Project est Database of GenBank EMBL DDBJ sequences from EST division est_human Human subset of est 404 APPENDIX B BLAST DATABASES 405 est_mouse Mouse subset of est est_others Subset of est other than human or mouse gss Genome Sur
301. ed frequency of a amino acid residue or nucleotide of symbol n at a particular position and N is the number of distinct symbols for the sequence alphabet either 20 for proteins or four for DNA RNA This means that the maximal sequence information content per position is log 4 2 bits for DNA RNA and log 20 4 32 bits for proteins The original implementation by Schneider does not handle sequence gaps We have slightly modified the algorithm so an estimated logo is presented in areas with sequence gaps If amino acid residues or nucleotides of one sequence are found in an area containing gaps we have chosen to show the particular residue as the fraction of the sequences Example if one position in the alignment contain 9 gaps and only one alanine A the A represented in the logo has a hight of 0 1 Other useful resources The website of Tom Schneider http www lmmb ncifcrf gov toms WebLogo http weblogo berkeley edu Crooks et al 2004 CHAPTER 20 SEQUENCE ALIGNMENT 357 20 3 Edit alignments 20 3 1 Move residues and gaps The placement of gaps in the alignment can be changed by modifying the parameters when creating the alignment see section 20 1 However gaps and residues can also be moved after the alignment is created select one or more gaps or residues in the alignment drag the selection to move This can be done both for single sequences but also for multiple sequences by making a selection
302. ed in figure 16 21 If a sequence was selected before choosing the Toolbox action the sequence is now listed in the Selected Elements window of the dialog Use the arrows to add or remove sequences or sequence lists from the selected elements You can translate several protein sequences at a CHAPTER 16 PROTEIN ANALYSES 266 Reverse Translate l 1 Select protein sequences Projects Selected Elements 1 CLC_Data e CAA24102 HE Example data E E Nucleotide I ap Protein w 3D structures w More data B k Sequences Pus 1829_HUMAN pond CAA32220 As NP_058652 Pu P68046 Ss P68053 P68063 P68225 Ss P68228 Pu P68231 us P68873 Ss P68945 ES Extra ES JEA Figure 16 21 Choosing a protein sequence for reverse translation time Click Next to adjust the parameters for the translation 16 9 1 Reverse translation parameters Figure 16 22 shows the choices for making the translation Reverse Translate 1 Select protein sequences strates 2 Set parameters Codons Use random codon Use only the most frequent codon Use codon based on frequency distribution e a Transfer annotations Map annotations to reverse translated sequence Ka Xe Figure 16 22 Choosing parameters for the reverse translation e Use random codon This will randomly back translate an amino acid to a codon without using the translation tables Every time you perform the analysis you will
303. ed manually If areas are known where primers or probes must not bind e g repeat rich areas one or more No primers here regions can be defined The regions are defined by making a selection on the sequence and right clicking the selection It is required that part of the Forward primer region is located upstream of the TaqMan Probe region and that the TaqMan Probe region is located upstream of a part of the Reverse primer region In TaqMan mode the Inner melting temperature menu in the primer parameters panel is activated allowing the user to set a separate melting temperature interval for the TaqMan probe After exploring the available primers see section 17 3 and setting the desired parameter values in the Primer Parameters preference group the Calculate button will activate the primer design algorithm After pressing the Calculate button a dialog will appear see figure 17 10 which is similar to the Nested PCR dialog described above see section 17 6 Calculation parameters Chosen parameters Maximum primer length Minimum primer length Maximum G C content Minimum G C content Maximum melting temperature Minimum melting temperature Maximum self annealing Maximum self end annealing Maximum secondary structure 3 end must meet G C requirements 5 end must meet G C requirements Primer combination parameters Max percentage point difference in G C content bs Max difference in melting temperatures within a primer p
304. el e To the left you see all the enzymes that are in the list select above If you have not chosen to use an existing enzyme list this panel shows all the enzymes available t e To the right there is a list of the enzymes that will be used Select enzymes in the left side panel and add them to the right panel by double clicking or clicking the Add button gt If you e g wish to use EcoRV and BamHI select these two enzymes and add them to the right side panel If you wish to use all the enzymes in the list Click in the panel to the left press Ctrl A 38 A on Mac Add gt The enzymes can be sorted by clicking the column headings i e Name Overhang Methylation or Popularity This is particularly useful if you wish to use enzymes which produce e g a 3 overhang In this case you can sort the list by clicking the Overhang column heading and all the enzymes producing 3 overhangs will be listed together for easy selection When looking for a specific enzyme it is easier to use the Filter If you wish to find e g Hindlll sites simply type Hindili into the filter and the list of enzymes will shrink automatically to only include the Hindlll enzyme This can also be used to only show enzymes producing e g a 3 overhang as shown in figure 19 33 If you need more detailed information and filtering of the enzymes either place your mouse cursor on an enzyme for one second to display additional information see figure
305. ence at This BamHI Site Cut Sequence at All BamHI Sites Cut All Sequences at All BamHI Sites Insert Sequence at This BamHI Site Add as Annotation Gc GI Show Enzymes with Compatible Ends iT Figure 19 6 Right click on a restriction enzyme annotation in the cloning view uence at This Sphl Site Name Length rt Compatible ends left Compatible ends right PCR fragment with Sph_Sph 5 Aanl Aatl Acc1131 Ac AflIII Avol Bbul BsaJl ra ES aul Acc SI AccBil A A II Avol Bbul Beall PCR fragment with Sph_Sph 5 Aaul Acc65I AccBII A Aanl Aatl Accl131 Ac pBR322 Vector 4361 Aanl Aatl Acc1131 Ac Aanl Aatl Acct 131 Ac Figure 19 7 Select a sequence for insertion 19 1 5 Insert one sequence into another Sequences can be inserted into each other in several ways as described in the lists above When you chose to insert one sequence into another you will be presented with a dialog where all sequences in the view are present See figure 19 7 The sequence that you have chosen to insert into will be marked with bold and the text vector is appended to the sequence name The list furthermore includes the length of the fragment an indication of the overhangs and a list of enzymes that are compatible with this overhang for the left and right ends respectively If not all the enzymes can be shown place your mouse cursor on the enzymes and a full list will be shown in the
306. ence belongs to the included sequences if not it belongs to the excluded sequences We use the terms included and excluded here to be consistent with the section above although a probe solution is presented for both groups In TaqMan mode primers are not allowed degeneracy or mismatches to any template sequence in the alignment variation is only allowed required in the TaqMan probes CHAPTER 17 PRIMERS 296 Pushing the Calculate button will cause the dialog shown in figure 17 14 to appear The top part of this dialog is identical to the Standard PCR dialog for designing primer pairs described above The central part of the dialog contains parameters to define the specificity of TaqMan probes Two parameters can be set e Minimum number of mismatches the minimum total number of mismatches that must exist between a specific TaqMan probe and all sequences which belong to the group not recognized by the probe e Minimum number of mismatches in central part the minimum number of mismatches in the central part of the oligo that must exist between a specific TaqMan probe and all sequences which belong to the group not recognized by the probe The lower part of the dialog contains parameters pertaining to primer pairs and the comparison between the outer oligos primers and the inner oligos TaqMan probes Here five options can be set e Maximum percentage point difference in G C content described above under Standard PCR Maxima
307. ence in the Navigation Area Show Annotation Table 33 or If the sequence is already open Click Show Annotation Table E at the lower left part of the view This will open a view similar to the one in figure 10 18 E NM_o00044 notation Tables x Rows 28 lg New Annotation a 25 Name Type Region Qualifiers Shoe annotation types v CDS organisn Homo sapiens ale Le mRNA smo n ype m v Repeat region Source db_xref taxon 9606 chromosome X id Sauce map Xq11 2 q12 vists Select all gene AR Deselect all 1023 1097 standard_name GDB 600694 db_xref UniSTS 99252 gene AR 836 958 standard_name DX57498 db_xref UniSTS 38944 OB El me R Figure 10 18 A table showing annotations on the sequence Each row in the table is an annotation which is represented with the following information e Name e Type e Region e Qualifiers The Name Type and Region for each annotation can be edited simply by double clicking typing the change directly and pressing Enter In the Side Panel you can show or hide individual annotation types in the table E g if you only wish to see gene annotations de select the other annotation types so that only gene is selected This information corresponds to the information in the dialog when you edit and add annotations see section 10 3 2 You can benefit from this table in several ways e It p
308. enetic code Other Here you can specify a number of start codons separated by commas e Both Strands Finds reading frames on both strands e Open Ended Sequence Allows the ORF to start or end outside the sequence If the sequence studied is a part of a larger sequence it may be advantageous to allow the ORF to start or end outside the sequence e Genetic code translation table e Include stop codon in result The ORFs will be shown as annotations which can include the stop codon if this option is checked The translation tables are occasionally updated from NCBI The tables are not available in this printable version of the user manual Instead the tables are included in the Help menu in the Menu Bar in the appendix e Minimum Length Specifies the minimum length for the ORFs to be found The length is specified as number of codons Using open reading frames for gene finding is a fairly simple approach which is likely to predict genes which are not real Setting a relatively high minimum length of the ORFs will reduce the number of false positive predictions but at the same time short genes may be missed see figure 15 8 th 0 iia NC_000913 selection ORF aa NC_000913 selection El ORF yaa OR A ORF NC_000913 selection l ol aa Go ORF ORF ys RL Figure 15 8 The first 12 000 positions of the E coli sequence NC_000913 downloaded from GenBank The blue dark annotations are the genes while the yellow
309. ense server in the dialog When you restart CLC Combined Workbench you will be asked for a license and you can then provide the license key file for the fixed license as described in section 1 4 3 1 5 About CLC Workbenches In November 2005 CLC bio released two Workbenches CLC Free Workbench and CLC Protein Workbench CLC Protein Workbench is developed from the free version giving it the well tested user friendliness and look amp feel However the CLC Protein Workbench includes a range of more advanced analyses In March 2006 CLC DNA Workbench formerly known as CLC Gene Workbench and CLC Combined Workbench were added to the product portfolio of CLC bio Like CLC Protein Workbench CLC DNA Workbench builds on CLC Free Workbench It shares some of the advanced product features of CLC Protein Workbench and it has additional advanced features CLC Combined Workbench holds all basic and advanced features of the CLC Workbenches In June 2007 CLC RNA Workbench was released as a sister product of CLC Protein Workbench and CLC DNA Workbench CLC Combined Workbench now also includes all the features of CLC RNA Workbench For an overview of which features all the workbenches include see http www clcbio com features In December 2006 CLC bio released a Software Developer Kit which makes it possible for anybody with a knowledge of programming in Java to develop plug ins for the workbenches The plug ins are fully integrated with the CLC Work
310. ensed under a Creative Commons Attribution NonCommercial NoDerivs 2 5 License You are free to copy distribute display and use the work for educational purposes under the following conditions You must attribute the work in its original form and CLC bio has to be clearly labeled as author and provider of the work You may not use this work for commercial purposes You may not alter transform nor build upon this work SOME RIGHTS RESERVED See http creativecommons org licenses by nc nd 2 5 for more information on how to use the contents Chapter 13 3D molecule viewing Contents 13 1 Importing structure files 1 2 eee ee ee es 199 13 2 Viewing structure files lt lt ee ee 200 13 2 1 MOVINE ANO rotating s tos acr da a a a eee 8 200 13 3 The Structure table 20 2 a we a 201 PS Sol o A ee Aes dea ae eed 201 13 3 2 Opening sequence information 2 0000 eee ee ee 202 1333 Display and coloring Opuons scs ama asied a Bon a 202 13 4 Options through the preference panel 1 20 ee eee ees 202 ISA ATOMS Se BONUS sei da Bk eke Beet BS ale OMA aod es 202 13 442 BaCKDONe lt 5 068d oa SMO Ow Ra le ea aH eR a eS 203 134 3 COMME c e amg Rate eo oe i a we ee A da 203 13 4 4 Selection scheme 0 6 2 ee rar sarun 204 134 5 General SENES is o Ge aa A ee es 204 1346 Penmoimeance SOUINES s c 2 a e a be a E a 204 13 5 3D Output 00 kk are eee ee Oe a a 20
311. ent ways using different algorithms The algorithms provided in the Workbench merely plot an index of antigenicity over the sequence Two different methods are available Welling et al 1985 Welling et al used information on the relative occurrence of amino acids in antigenic regions to make a scale which is useful for prediction of antigenic regions This method is better than the Hopp Woods scale of hydrophobicity which is also used to identify antigenic regions A semi empirical method for prediction of antigenic regions has been developed Kolaskar and Tongaonkar 1990 This method also includes information of surface accessibility and flexibility and at the time of publication the method was able to predict antigenic determinants with an accuracy of 75 Note Similar results from the two method can not always be expected as the two methods are based on different training sets 16 4 1 Plot of antigenicity Displaying the antigenicity for a protein sequence in a plot is done in the following way select a protein sequence in Navigation Area Toolbox in the Menu Bar Protein Analyses y Create Antigenicity Plot Lz This opens a dialog The first step allows you to add or remove sequences Clicking Next takes you through to Step 2 which is displayed in figure 16 9 The Window size is the width of the window where the antigenicity is calculated The wider the window the less volatile the graph You can chose from a number
312. ents are selected click Next and you will see the dialog shown in figure 18 9 Add Sequences to Contig 1 Select some nucleotide _ BEEE sequences and one contig 2 Set parameters Alignment options Minimum aligned read length 50 Alignment stringency Medium Y Trimming options Use existing trim information Generally not necessary since a reference sequence is used Output options Show tabular view of contigs onflict resolution Update consensus sequence manual changes will be overwritten Previous net ES Fin X cancel Figure 18 9 Setting assembly parameters when assembling to an existing contig The options in this dialog are similar to the options that are available when assembling to a reference sequence see section 18 4 Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish This will start the assembly process See section 18 6 on how to use the resulting contig CHAPTER 18 SEQUENCING DATA ANALYSES AND ASSEMBLY 311 18 6 View and edit contigs The result of the assembly process is one or more contigs where the sequence reads have been aligned see figure 18 10 Contig 1 Alignment info v Nucleotide info vw Trace data Show MA trace M E trace M G trace MT trace C Show confidence Show as probabilities Trace height medium v Scaling drag trace data in view
313. enu This opens a dialog that allows the user to save the fragment to the desired location The fragment is saved as a DNA sequence and the position of the primers is added as annotation on the sequence The fragment can then be used for further analysis and included in e g an in silico cloning experiment using the cloning editor 17 4 3 Adding primer binding annotation You can add an annotation to the template sequence specifying the binding site of the primer Right click the primer in the table and select Mark primer annotation on sequence 17 5 Standard PCR This mode is used to design primers for a PCR amplification of a single DNA fragment CHAPTER 17 PRIMERS 284 17 5 1 User input In this mode the user must define either a Forward primer region a Reverse primer region or both These are defined by making a selection on the sequence and right clicking the selection It is also possible to define a Region to amplify in which case a forward and a reverse primer region are automatically placed so as to ensure that the designated region will be included in the PCR fragment If areas are known where primers must not bind e g repeat rich areas one or more No primers here regions can be defined If two regions are defined it is required that part of the Forward primer region is located upstream of the Reverse primer region After exploring the available primers see section 17 3 and setting the desired parameter values in the
314. eoio tiia k a 76 SL GONNE TOMES cir a We a we See 76 3 1 4 Multiselecting elements 0 ee ee es 76 3 1 5 Moving and copying elements eee es 76 3 16 Change l ment names 2 65 a a eee ees 78 3 1 7 Deleteelements 0 2 et es 79 3 1 8 Show folder elements in View 79 3 19 Sequence properties sa p p wack m ma ae aa ke A eas 80 3 2 View Alea ss com moa aa a a aa ee es i Aa 80 321 OP MVEW 2 se pesa e mae Re Re a ee 80 3 2 2 Show element in another view 0 a ee ee es 81 3 23 CIOS VIEWS e ace ee a niina a eee ata a GG ek A e a A 82 324 Savechangesinaview lt s sa a sona d ma dona aoaaa eiat 82 325 Wnde RedG ii s doed e mierea boo ohm ee a ia SoSH EBSA 83 3 2 6 Arrange views in View Area aoaaa et 83 32d Side Panel y soe a Ge Re we REE A r E E 86 3 3 Zoom and selection in View Area 2 0 ee eee ee 86 Sd ZOOM sica o ow Bowe ee eee ak ee a Re ee 86 332 ZOOMPOUL e la 2 4 08 4 oS ae do a eS we a Ae e ee 87 32S RIO ns hoe araeo le bk aa i a Wea ey Sek a Geo BY SE 87 JSA LODO TOOK x ae eae oh ee ee aha a ee ke ae A 87 Sia MOVE 22 e a a Re eee a Gd ee 87 330 SOON sisas ee eee Rb Poe eae il eek 88 3 4 Toolbox and Status Bar 1 ee ee 88 SAd PROCESSES x iia e dee a a ae Rw ESA we LS 88 SA2 o o as sr Stork ede 8 ee Bg Ge A es A eee he a a 89 Bae SAWS
315. equal to center search term gt 8 Add filter Any field v Type Label Description Length Path ch Figure 4 6 Advanced search e Protein sequences e All data When searching for sequences you will also get alignments sequence lists etc as result if they contain a sequence which match the search criteria Below are the search criteria First select a relevant search filter in the Add filter list For sequences you can search for e Name e Length e Organism See section 4 2 2 for more information on individual search terms For all other data you can only search for name If you use Any field it will search all of the above For each search line you can choose if you want the exact term by selecting is equal to or if you only enter the start of the term you wish to find select begins with An example is shown in figure 4 7 This example will find nucleotide sequences with a gene starting with brca it will search for human sequences organism is Homo sapiens and it will only find sequences shorter than 10 000 nucleotides Note that a search can be saved E for later use You do not save the search results only the search parameters This means that you can easily conduct the same search later on when your data has changed CHAPTER 4 SEARCHING YOUR DATA 98 amp Search O Search in Location CLC_Data w with
316. equence is already open in a linear view ser and you wish to see both a circular and a linear view you can split the views very easily Press Ctrl 38 on Mac while you Click Show As Circular at the lower left part of the view This will open a split view with a linear view at the bottom and a circular view at the top see 10 14 CHAPTER 3 USER INTERFACE 82 You can also show a circular view of a sequence without opening the sequence first Select the sequence in the Navigation Area Show 4 As Circular Q 3 2 3 Close views When a view is closed the View Area remains open as long as there is at least one open view A view is closed by right click the tab of the View Close or select the view Ctrl W or hold down the Ctrl button Click the tab of the view while the button is pressed By right clicking a tab the following close options exist See figure 3 9 ac P68046 ep P6s053 0 sep P6s063 QO ac P le r gt View b HBB Toolbox a Show MN P68225 MVHLTPEEKNAVTTLWG y cose Siew Y PWT B Close Tab Area HBB TA Close All Views Ctrl Shift W P68225 ESFGDLSSPDAVMGNPK iLDNL Save As Ctrl Shift 5 Figure 3 9 By right clicking a tab several close options are available Close See above Close Tab Area Closes all tabs in the tab area Close All Views Closes all tabs in all tab areas Leaves an empty workspace Close Other Tabs Closes all other tabs in the particular tab area
317. equences 002 ee eee 2 es 306 18 4 Assemble to reference sequence 00 5000888 e ee 307 18 5 Add sequences to an existing contig lt lt 310 18 6 View and edit contigs lt 311 18 6 1 Contig view settings in the Side Panel 312 18 6 2 Editing the CONE vo se ca gaca nara o a eR a bows 313 18 63 SOng reads re a A A ee ee a a 314 18 6 4 Assembly conflicts lt 0 lt 0 lt 314 13 6 5 Output fom the CONUS e sil a Rw ae a a a 314 18 6 6 Assembly variance table lt lt so coa cc ope rad aana Re 315 18 7 Reassemble contig 2 20 eee eee 4 2 316 18 8 Secondary peak calling lt lt lt lt lt 317 CLC Combined Workbench lets you import trim and assemble DNA sequence reads from automated sequencing machines A number of different formats are supported see section 7 1 1 This chapter first explains how to trim sequence reads Next follows a description of how to assemble reads into contigs both with and without a reference sequence In the final section the options for viewing and editing contigs are explained 18 1 Importing and viewing trace data A number of different binary trace data formats can be imported into the program including Standard Chromatogram Format SCF ABI sequencer data files ABI and AB1 PHRED
318. er pair Max hydrogen bonds between pairs Max hydrogen bonds between pair ends Maximum length of amplicon 2 000 Mispriming parameters Use mispriming as exclusion criteria af Calculate Help Figure 17 8 Calculation dialog for PCR primers when two primer regions have been defined Again the top part of this dialog shows the parameter settings chosen in the Primer parameters preference group which will be used by the design algorithm The lower part again contains a CHAPTER 17 PRIMERS 286 menu where the user can choose to include mispriming of both primers as a criteria in the design process see above The central part of the dialog contains parameters pertaining to primer pairs Here three parameters can be set e Maximum percentage point difference in G C content if this is set at e g 5 points a pair of primers with 45 and 49 G C nucleotides respectively will be allowed whereas a pair of primers with 45 and 51 G C nucleotides respectively will not be included e Maximal difference in melting temperature of primers in a pair the number of degrees Celsius that primers in a pair are all allowed to differ e Max hydrogen bonds between pairs the maximum number of hydrogen bonds allowed between the forward and the reverse primer in a primer pair e Max hydrogen bonds between pair ends the maximum number of hydrogen bonds allowed in the consecutive ends of the forward and the reverse primer in a primer
319. erh Methyl 2 Name Overh Methyl i PstI J N meth jeer SacI J S methyl Sphr J N meth S methyl F S methyl 4 N meth Nemeth P acl 7 S methyl F S methyl Pe 5 J Seth F N meth P ce J IN methy Ne meth J N meth S methyl JP Xcml J Ne meth z F E E E Er Er E Er js E iF gt Previous _ net Figure 2 20 Selecting enzymes Click Next In this step you specify that you want to show enzymes that cut the sequence only once This means that you should de select the Two restriction sites checkbox Click Next and select that you want to Add restriction sites as annotations on sequence and Create restriction map See figure 2 21 Click Finish to start the restriction map analysis CHAPTER 2 TUTORIALS 47 Restriction Map Analysis 1 Select DNAJRNA sequencel s Display enzymes with C No restriction site 0 Z One restriction site 1 N restriction sites Minimum L 1 Maximum 28 Any number of restriction sites gt 0 previous pree J _ Arnh Xene Figure 2 21 Selecting output for restriction map analysis View restriction site The restriction sites are shown in two views one view is in a tabular format and the other view displays the sites as annotations on the sequence To see both views at once View in the menu bar Split Horizontally The result is sho
320. erring phylogenetic trees For a given set of aligned sequences see chapter 20 it is possible to infer their evolutionary relationships In CLC Combined Workbench this is done by creating a phylogenetic tree Toolbox in the Menu Bar Alignments and Trees 2 Create Tree z or right click alignment in Navigation Area Toolbox Alignments and Trees Create Tree tc This opens the dialog displayed in figure 21 1 If an alignment was selected before choosing the Toolbox action this alignment is now listed in the Selected Elements window of the dialog Use the arrows to add or remove elements from the Navigation Area Click Next to adjust parameters 21 1 1 Phylogenetic tree parameters Figure 21 2 shows the parameters that can be set 366 CHAPTER 21 PHYLOGENETIC TREES 367 Create Tree 1 Select alignments of same type Projects Selected Elements 1 CLC_Data EE protein alignment 55 Example data oP Extra H E Nucleotide Protein E 3D structures S E More data E Sequences bro f Figure 21 1 Creating a Tree Create Tree 1 Select alignments of AAA same type 2 Set parameters Algorithm Neighbor Joining Bootstrapping V Perform bootstrap analysis Replicates 100 JL _ Previous Pnet Finish X Cancel Figure 21 2 Adjusting parameters e Algorithms The UPGMA method assumes that evolution has occured at a constant rate
321. ertion polymorphisms DIP s and short tandem repeats STR s also termed micro satellites A BLAST search against dbSNP produces output similar to a regular nucleotide BLAST search against NCBI However when searching against dbSNP CLC Combined Workbench also offers the user the possibility to transfer the found BLAST hits to the query sequence as variation annotation This information can then be used to interpret experimental data or to design further experiments using either the primer designer functionality or the cloning editor of the program To annotate with SNP s select one or more nucleotide sequences Toolbox in the Menu Bar BLAST Search E SNP Annotation Using BLAST If a sequence was selected before choosing the Toolbox action the sequence is now listed in the Selected Elements window of the dialog Use the arrows to add or remove sequences or sequence lists from the selected elements When you have selected the desired sequences click Next 12 5 1 SNP annotation search parameters In this step you can choose species and genome specific database for use in the BLAST search as shown in figure 12 14 The list of databases is available at http www ncbi nlm nih gov staff tao URLAPI remote_accessible_blastdblist htm1 8 CHAPTER 12 BLAST SEARCH 186 SNP Annotation Using BLAST 1 Select nucleotide sequences 2 Set program parameters Choose Program and Database Program blastn Species human
322. erver By checking this option you do not have to enter more information to connect to the server e Manually specify license server There can be technical limitations which mean that the CHAPTER 1 INTRODUCTION TO CLC COMBINED WORKBENCH 22 Figure 1 10 Activate the license key license server cannot be detected automatically and in this case you need to specify more options manually Host name Enter the address for the licenser server Port Specify which port to use e Disable license borrowing on this computer If you do not want users of the computer to borrow a license see section 1 4 4 you can check this option Borrow a license A floating license can only be used when you are connected to the license server If you wish to use the CLC Combined Workbench when you are not connected to the server you can borrow a license Borrowing a license means that you take one of the floating licenses available on the server and borrow it for a specified amount of time During this time period there will be one less floating license available on the server At the point where you wish to borrow a license you have to be connected to the license server The procedure for borrowing is this Click Help License Manager to display the dialog shown in figure 1 13 Use the checkboxes to select the license s that you wish to borrow Select how long time you wish to borrow the license and click Borrow Licen
323. es see section 5 3 From CLC Combined Workbench it is also possible to conduct BLAST searches on a database stored locally on your computer Local BLAST and the creation of a database for local BLAST search is described later in this chapter If you are interested in the bioinformatics behind BLAST there is an easy to read explanation of this in the last section of the chapter 12 1 BLAST Against NCBI Database To conduct a BLAST search click an element in the Navigation Area Toolbox BLAST Search 3 NCBI BLAST Alternatively use the keyboard shortcut Ctrl Shift B for Windows and 3 Shift B on Mac OS This opens the BLAST dialog This opens the dialog seen in figure 12 2 Click Next In Step 2 you can choose which type of BLAST search you want to conduct and you can limit your search to a particular database see section B in the appendix for a list of available databases Step 2 can be seen in figure 12 3 BLAST search for DNA sequences e BLASTn DNA sequence against DNA database This BLAST method is used to identify homologous DNA sequences to your query sequence CHAPTER 12 BLAST SEARCH 174 BLAST Against NCBI Databases 1 Select sequences of same o type Projects Selected Elements 1 CLC_Data Ae CAA24102 E3 Example data a Extra ES Nucleotide Protein 9 3D structures E E More data ei Sequences At 1429_HUMAN me CAA32220 e NP_058652 ss P68046 4 P68053 Ss P68063 P68225
324. etes worldwide affecting approximately 4 of the adult population Horikawa et al 2000 SNPs can also be useful as genetic markers for e g association studies where relations between specific genetic variation and phenotypic appearance are mapped The polymorphism must appear at a certain frequency to be useful as a genetic marker and for a single nucleotide polymorphism to be considered a SNP the less frequent allele must occur in population at a frequency rate of at least 1 percent Brookes 1999 Association studies are expected to speed the discovery of disease related genes as it is much easier to get access to DNA samples from a random set of individuals in a population than it is to do traditional pedigree analysis The research and results within genetic diseases are thereby expanding significantly along with the identification and characterization of SNPs and research CHAPTER 12 BLAST SEARCH 189 Classification of Single Nucleotide Polymorphisms f SNP Locational classification iSNP cSNP YSNP ZSNP Functional REE nsSNP sSNP classification exon 2 exon 3 exon 1 exon 2 rSNP cSNP gSNP iSNP Figure 12 19 According to their location in the genome SNPs are classified as either iSNPs located in intronic regions cSNPs in coding regions exons rSNPs in regulatory regions and gSNPs located in intergenomic regions cSNPs can either be represented as synonymous s or non synonymous ns SNPs dependent on their influenc
325. etween the forward and the reverse primer in a primer pair This criteria is applied to all possible combinations of primers Minimum difference in the melting temperature of primers in the inner and outer primer pair all comparisons between the melting temperature of primers from the two pairs must be at least this different otherwise the primer set is excluded This option is applied to ensure that the inner and outer PCR reactions can be initiated at different annealing temperatures Please note that to ensure flexibility there is no directionality indicated when setting parameters for melting temperature differences between inner and outer primer pair i e it is not specified whether the inner pair should have a lower or higher Tm Instead this is determined by the allowed temperature intervals for inner and outer primers that are set in the primer parameters preference group in the side panel If a higher Tm of inner primers is desired choose a Tm interval for inner primers which has higher values than the interval for outer primers CHAPTER 17 PRIMERS 289 e Two radio buttons allowing the user to choose between a fast and an accurate algorithm for primer prediction 17 6 1 Nested PCR output table In nested PCR there are four primers in a solution forward outer primer FO forward inner primer Fl reverse inner primer RI and a reverse outer primer RO The output table can show primer pair combination parameters for all four
326. etwork 29 Conflicting enzymes 340 Conflicts overview in assembly 315 Consensus sequence 353 401 open 353 Conservation 353 graphs 401 Contact information 13 Contig 403 ambiguities 315 create 306 reverse 311 view and edit 311 Copy 123 annotations in alignments 358 elements in Navigation Area 76 into sequence 145 search results GenBank 163 search results structure search 169 search results UniProt 166 sequence 155 156 sequence selection 235 text selection 155 cpf file format 102 Create alignment 348 dot plots 208 enzyme list 344 local BLAST database 183 new folder 76 workspace 89 csv file format 112 CSV file format 35 113 409 ct file format 112 Custom annotation types 152 Data storage location 75 Data formats bioinformatic 408 graphics 409 Data sharing 75 Data structure 73 Database GenBank 160 local 73 nucleotide 404 peptide 404 shared BLAST database 184 SNP BLAST 405 structure 166 UniProt 164 Db source 154 db_xref references 171 Delete element 79 residues and gaps in alignment 357 workspace 90 Demo license 17 Description 154 Dipeptide distribution 225 Distance pairwise comparison of sequences in alignments 362 DNA translation 236 DNAstrider file format 35 113 409 Dot plots 402 Bioinformatics explained 210 create 208 print 210 Double cutters 137 330 Double stranded DNA 132 Download and open search results GenBank 163 16
327. everse color Colors the residues of reverse reads with the specified color can be changed by clicking the colored box Beside from these preferences all the functionalities of the alignment view are available This means that you can e g add annotations such as SNP annotations to regions of interest in the contig However some of the parameters from alignment views are set at a different default value in the view of contigs Trace data of the sequencing reads are shown if present can be enabled and disabled under the Nucleotide info preference group and the Color different residues option is also enabled in order to provide a better overview of conflicts can be changed in the Alignment info preference group 18 6 2 Editing the contig When editing contigs you are typically interested in confirming or changing single bases and this can be done simply by selecting the base typing the right base Some users prefer to use lower case letters in order to be able to see which bases were altered when they use the contig later on In CLC Combined Workbench all changes to the contig are recorded in its history log see section 8 allowing the user to quickly reconstruct the actions performed in the editing session There are three shortcut keys for easily finding the positions where there are inconsistencies e Space bar Finds the next inconsistency e punctuation mark key Finds the next inconsistency e comma key Fi
328. ews D H and Turner D H 2002 Experimentally derived nearest neighbor parameters for the stability of RNA three and four way multibranch loops Biochemistry 41 3 869 880 Mathews and Turner 2006 Mathews D H and Turner D H 2006 Prediction of RNA secondary structure by free energy minimization Curr Opin Struct Biol 16 3 270 278 McCaskill 1990 McCaskill J S 1990 The equilibrium partition function and base pair binding probabilities for RNA secondary structure Biopolymers 29 6 7 1105 1119 McGinnis and Madden 2004 McGinnis S and Madden T L 2004 BLAST at the core of a powerful and diverse set of sequence analysis tools Nucleic Acids Res 32 Web Server issue W20 W25 Menne et al 2000 Menne K M Hermjakob H and Apweiler R 2000 A comparison of signal sequence prediction methods using a test set of signal peptides Bioinformatics 16 8 741 742 Michener and Sokal 1957 Michener C and Sokal R 1957 A quantitative approach to a problem in classification Evolution 11 130 162 Nielsen et al 1997 Nielsen H Engelbrecht J Brunak S and von Heijne G 1997 Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites Protein Eng 10 1 1 6 Purvis 1995 Purvis A 1995 A composite estimate of primate phylogeny Philos Trans R Soc Lond B Biol Sci 348 1326 405 421 BIBLIOGRAPHY 417 Reinhardt and Hubbard 1998 Reinh
329. f the analysis should be added as annotations on the sequence or shown on a table If both options are selected you will be able to click the results in the table and the corresponding region on the sequence will be selected If you choose to add annotations to the sequence they can be removed afterwards by clicking Undo in the Toolbar 9 1 2 Batch log For some analyses there is an extra option in the final step to create a log of the batch process see e g figure 9 3 This log will be created in the beginning of the process and continually updated with information about the results See an example of a log in figure 9 4 In this example the log displays information about how many open reading frames were found CHAPTER 9 HANDLING OF RESULTS 129 Find Open Reading Frames Rad 1 Select nucleotide BCA sequences 2 Set parameters 3 Result handling Output options V Add annotation to sequence V Create table Result handling Open O Save Log handling Make log EE ee gt Figure 9 3 Analyses which also generate tables EB leg Rows 9 Log Filter Name Description Type Time AY738615 Found 10 reading frames Fri Nov 17 HUMDINUC Found 5 reading frames Fri Nov 17 PERHIBA Found 5 reading Frames iz Fri Nov 17 PERHIBB Found 7 reading frames Fri Nov 17 PERH2BA Found 4 reading Frames Fri Nov 17 PERH2BB Fo
330. f the sequence you want to delete right click the selection Edit Selection 4 Delete the text in the dialog Replace The selection shown in the dialog will be replaced by the text you enter If you delete the text the selection will be replaced by an empty text i e deleted CHAPTER 20 SEQUENCE ALIGNMENT 358 To delete entire columns select the part of the alignment you want to delete right click the selection Delete columns The selection may cover one or more sequences but the Delete columns function will always apply to the entire alignment 20 3 4 Copy annotations to other sequences Annotations on one sequence can be transferred to other sequences in the alignment right click the annotation Copy Annotation to other Sequences This will display a dialog listing all the sequences in the alignment Next to each sequence is a checkbox which is used for selecting which sequences the annotation should be copied to Click Copy to copy the annotation 20 3 5 Move sequences up and down Sequences can be moved up and down in the alignment drag the name of the sequence up or down When you move the mouse pointer over the label the pointer will turn into a vertical arrow indicating that the sequence can be moved The sequences can also be sorted automatically to let you save time moving the sequences around To sort the sequences alphabetically Right click the name of a sequence Sort Sequences Alphabetically If
331. file format in the Files of type box Import Look in Desktop O My Documents 4 24 My Computer My Recent amp My Network Places Documents Desktop My Documents 53 My Computer D File name My Network Places Files of type All Files v Options Automatic import Force import as type Force import as external file s Figure 7 1 The import dialog Next select one or more files or folders to import and click Select The imported files are placed at the location which was selected when the import was initiated E g if you right click on a file in the Navigation Area and choose import the imported files are placed immediately below the selected file If you right click a folder the imported files are placed as the last file in that folder If you right click a folder the imported files are placed as the last elements in this folder If you import one or more folders the contents of the folder is automatically imported and placed in that folder in the Navigation Area If the folder contains subfolders the whole folder structure is imported In the import dialog figure 7 1 there are three import options Automatic import This will import the file and CLC Combined Workbench will try to determine the format of the file The format is determined based on the file extension e g SwissProt files have swp at the end of the file name in combination with a detection of elements
332. for use in the program Most analyses of CLC Combined Workbench require that the data is saved in the Navigation Area There are several ways to get data into the Navigation Area and this tutorial describes how to import existing data The View Area is the main area to the right This is where the data can be viewed In general a View is a display of a piece of data and the View Area can include several Views The Views are represented by tabs and can be organized e g by using drag and drop 2 1 1 Creating a a folder When CLC Combined Workbench is started there is one element in the Navigation Area called CLC_Data This element is a Location A location points to a folder on your computer where your data for use with CLC Combined Workbench is stored The data in the location can be organized into folders Create a folder File New Folder 3 or Ctrl Shift N 3 Shift N on Mac 1f you have downloaded the example data this will be placed as a folder in CLC_Data CHAPTER 2 TUTORIALS 34 CLC Combined Workbench 3 0 Current workspace Default File Edit Search View Toolbox Workspace Help Ob EL Ea TEET Y A ode om Delete Workspace Search Selection Zoom In Zoom Out sj CLC_Data Example data E Nucleotide E Protein a i Extra README ES Recycle bin 2 2 Alignments and Trees i QA General Sequence Analyses EA Nucleotide Analyses E G Protein Analyses
333. formats lt ee 112 7 1 1 Import of bioinformatic data lt s so a aoso i a 112 71 2 Export of bioinformatic data circa a a cet 116 1 2 External fil s s s e o 6000 5 cen Se ee ee ee a a 118 7 3 Export graphics to files lt lt 118 7 3 1 Which part of the view to export a 118 7 3 2 Save location and file formats 00 eee ee ee ee ee 120 7 3 3 Graphics export paraMeterS es 121 7 3 4 Exporting protein reports s sa aoada a a ee 122 7 4 Copy paste view output 123 CLC Combined Workbench handles a large number of different data formats All data stored in the Workbench are available in the Navigation Area The data of the Navigation Area can be divided into two groups The data is either one of the different bioinformatic data formats or it can be an external file Bioinformatic data formats are those formats which the program can work with e g Sequences alignments and phylogenetic trees External files are files or links which are stored in CLC Combined Workbench but are opened by other applications e g pdf files Microsoft Word files Open Office spreadsheet files or links to programs and web pages etc This chapter first deals with importing and exporting data in bioinformatic data formats and as external files Next comes an explan
334. ft press Ctrl A 38 A on Mac Add 5 gt The enzymes can be sorted by clicking the column headings i e Name Overhang Methylation or Popularity This is particularly useful if you wish to use enzymes which produce e g a 3 overhang In this case you can sort the list by clicking the Overhang column heading and all the enzymes producing 3 overhangs will be listed together for easy selection When looking for a specific enzyme it is easier to use the Filter If you wish to find e g Hindlll sites simply type Hindlll into the filter and the list of enzymes will shrink automatically to only include the Hindlll enzyme This can also be used to only show enzymes producing e g a 3 overhang as shown in figure 19 33 Restriction Map Analysis 1 Select DNA RNA sequence s 2 Enzymes to be considered in calculation Use existing enzyme list Enzyme list All enzymes Enzymes to be used Filter 3 Filter Name Overh Methyl Pop Name Overh Methyl Pop PstI j N6 meth piee PstI N meth fr A KpnI J N meth Peer Sacl S methyl Her SacI J S methyl Peer Sphi pa Sphi 3 poor Kpnt 7 N6 meth pore Apal 15 5 methyl Pe f 3 NsiI f ere Sacll p S methyl Apal t 5 methyl p Nsil J a EE Chal 3 peer Chal B Ball J N4 meth peer Ball J N meth Sacr S methyl Per Hhal 3 S methyl Pe hal S methyl Pe KemI J N meth Beal 7 IN6 meth
335. g To avoid confusion of the different import and export options here is an overview e Import and export of bioinformatics data such as sequences alignments etc described in section 7 1 1 e Graphics export of the views which creates image files in various formats described in section 7 3 e Import and export of Side Panel Settings as described in the next section e Import and export of all the Preferences except the Side Panel settings This is described above 5 5 View settings for the Side Panel The Side Panel is shown to the right of all views that are opened in CLC Combined Workbench By using the settings in the Side Panel you can specify how the layout and contents of the view Figure 5 3 is an example of the Side Panel of a sequence view By clicking the black triangles or the corresponding headings the groups can be expanded or collapsed An example is shown in figure 5 4 where the Sequence layout is expanded The content of the groups is described in the sections where the functionality is explained E g Sequence Layout for sequences is described in chapter 10 1 1 When you have adjusted a view of e g a Sequence your settings in the Side Panel can be saved When you open other sequences which you want to display in a similar way the saved settings can be applied The options for saving and applying are available in the top of the Side Panel see figure 5 5 To save and apply the saved settings click seen
336. g phylogenies Bootstrap values A popular way of evaluating the reliability of an inferred phylogenetic tree is bootstrap analysis The first step in a bootstrap analysis is to re sample the alignment columns with replacement CHAPTER 21 PHYLOGENETIC TREES 373 l e in the re sampled alignment a given column in the original alignment may occur two or more times while some columns may not be represented in the new alignment at all The re sampled alignment represents an estimate of how a different set of sequences from the same genes and the same species may have evolved on the same tree If a new tree reconstruction on the re sampled alignment results in a tree similar to the original one this increases the confidence in the original tree If on the other hand the new tree looks very different it means that the inferred tree is unreliable By re sampling a number of times it is possibly to put reliability weights on each internal branch of the inferred tree If the data was bootstrapped a 100 times a bootstrap score of 100 means that the corresponding branch occurs in all 100 trees made from re sampled alignments Thus a high bootstrap score is a sign of greater reliability Other useful resources The Tree of Life web project http tolweb org Joseph Felsensteins list of phylogeny software http evolution genetics washington edu phylip software html Creative Commons License All CLC bio s scientific articles are
337. get a different result e Use only the most frequent codon On the basis of the selected translation table this parameter option will assign the codon that occurs most often When choosing this option the results of performing several reverse translations will always be the same contrary to the other two options e Use codon based on frequency distribution This option is a mix of the other two options The selected translation table is used to attach weights to each codon based on its CHAPTER 16 PROTEIN ANALYSES 267 frequency The codons are assigned randomly with a probability given by the weights A more frequent codon has a higher probability of being selected Every time you perform the analysis you will get a different result This option yields a result that is closer to the translation behavior of the organism assuming you choose an appropriate codon frequency table e Map annotations to reverse translated sequence If this checkbox is checked then all annotations on the protein sequence will be mapped to the resulting DNA sequence In the tooltip on the transferred annotations there is a note saying that the annotation derives from the original sequence The Codon Frequency Table is used to determine the frequencies of the codons Select a frequency table from the list that fits the organism you are working with A translation table of an organism is created on the basis of counting all the codons in the coding sequences E
338. gt Find gt Text format Secondary structure Y Follow structure selection Layout strategy O Auto Proportional Even spread Figure 22 11 Proportional layout Length of the arc is proportional to the number of residues in the arc 1 20 Secondary structure i A FA 40 60 Follow structure selection f i om T E g p u n Lostis 7 7 q 120 100 80 Opt Proportional Even spread Figure 22 12 Even spread Stems are spread evenly around loops Selecting and editing When you are in Selection mode Q you can select parts of the structure like in a normal sequence view CHAPTER 22 RNA STRUCTURE 383 Press down the mouse button where the selection should start move the mouse cursor to where the selection should end release the mouse button One of the advantages of the secondary structure 2D view is that it is integrated with other views of the same sequence This means that any selection made in this view will be reflected in other views see figure 22 13 p ee sc ABDDIEIS Qe ABOO9835 Secondary structure AG 9 7kcal mol tRNA Lys a TA A 40 50 AB009835 CATTAGATGACTGAAAGCAAGTACT T A f A c T A c c T 7 de L A AG 74 ome 30 A TCA i AB009835 GGETCTCTTAAACCATTTAATAGTAA A Pe 60 Ta A A m2 GOAGgT A Ae o A _ tRNA Lys A ac T T TL c AB009835 ATTAGCACTTACTTCTAATGACCA ae A A 1 se lt gt fa O Ss U El u E
339. gular expressions a a E a Motif search with ProSite patterns a a E m Pattern discovery y E C nm APPENDIX A COMPARISON OF WORKBENCHES 403 Primer design Free Protein DNA RNA Combined Advanced primer design tools a a Detailed primer and probe parameters a a Graphical display of primers a E Generation of primer design output m am Support for Standard PCR a C Support for Nested PCR nm C Support for TaqMan PCR C a Support for Sequencing primers nm a Alignment based primer design a a Alignment based TaqMan probedesign a C Match primer with sequence E o Ordering of primers ul E Advanced analysis of primer properties a a Assembly of sequencing data Free Protein DNA RNA Combined Advanced contig assembly a a Importing and viewing trace data u m Trim sequences a a Assemble without use of reference sequence C C Assemble to reference sequence E a Assemble to existing contig E E Viewing and edit contigs E a Tabular view of an assembled contig easy y C data overview Secondary peak calling a C Molecular cloning Free Protein DNA RNA Combined Advanced molecular cloning E E Graphical display of in silico cloning i E Advanced sequence manipulation E E SNP annotation using BLAST Free Protein DNA RNA Combined Integrated BLAST at SNP database E E Annotate sequence with SNP s E E Virtual gel viewer Free Protein DNA RNA Combined Fully integrated virtual 1D DNA gel simulator i E For a more detail
340. h The Vector NTI import is a plug in which is pre installed in the Workbench It can be uninstalled and updated using the plug in manager see section 1 7 7 1 2 Export of bioinformatic data CLC Combined Workbench can export bioinformatic data in most of the formats that can be imported There are a few exceptions See section 7 1 1 To export a file select the element to export Export ES choose where to export to select File of type enter name of file Save Note The Export dialog decides which types of files you are allowed to export into depending on what type of data you want to export E g protein sequences can be exported into GenBank Fasta Swiss Prot and CLC formats Export of folders and multiple elements The zip file type can be used to export all kinds of files and is therefore especially useful in these situations e Export of one or more folders including all underlying elements and folders CHAPTER 7 IMPORT EXPORT OF DATA AND GRAPHICS 117 e f you want to export two or more elements into one file Export of folders is similar to export of single files Exporting multiple files of different formats is done in zip format This is how you export a folder select the folder to export Export ES choose where to export to enter name Save You can export multiple files of the same type into formats other than ZIP zip E g two DNA sequences can be exported in GenBank format se
341. h shows some of the cloning capabilities which are included in CLC Combined Workbench but there are many other ways to manipulate sequences in the cloning editor To see all the possibilities press F1 and read the help section for the cloning editor However all of the possibilities have one thing in common they are all accessed by right clicking selections restriction sites or sequence names 2 14 Tutorial Folding RNA molecules In this tutorial you will learn how to predict the secondary structure of an RNA molecule You will also learn how to use the powerful ways of viewing and interacting with graphical displays of the structure The sequence to be folded in this tutorial is a tRNA molecule with the characteristic secondary structure as shown in figure 2 54 Figure 2 54 Secondary structure of a tRNA molecule The goal for this tutorial is to get a nice looking graphic result of this structure CHAPTER 2 TUTORIALS 68 The sequence we are working with is a mitochondrial tRNA molecule from Drosophilia melanogaster The name is AB009835 and it is located in the example data under Nucleotide gt RNA Select the sequence AB009835 Toolbox RNA Structure 94 Predict Secondary Structure JP Since the sequence is already selected click Next In this dialog choose to compute a sample of sub optimal structure and leave the rest of the settings at their default see figure 2 55 Predict Secondary Structure 1 Selec
342. hase licenses which are valid for a specified period of time or you can purchase license which will be valid forever When you buy a license for CLC Combined Workbench we will provide you with a license key which is activated as described here Start CLC Combined Workbench and the dialog shown in figure 1 6 will appear 7 Choose the option Import a license key file in order to specify where your license key is located Select the license key file provided by CLC bio If the license key was sent to you by email you have to save to e g your desktop first When you have selected this file the License Agreement is shown see figure 1 7 21f the program is already activated with another license go to the Help menu and click Upgrade License This will bring up the dialog shown in figure 1 6 CHAPTER 1 INTRODUCTION TO CLC COMBINED WORKBENCH 20 License Assistant CLC Combined Workbench O Get license Accept agreement Q Activate license A license is required In order to use this application you will need a valid license key file Ifyou already have a key file containing a valid license you can import it by clicking the import button below Ifyou do not have a license you can request an evaluation license on line by clicking the request button below while being connected to the internet or by sending an email to license cicbio com Ifyou experience any problems please contact support elcbio corn Request evaluatio
343. he parameters you have entered click Start search Note When conducting a search no files are downloaded Instead the program produces a list of links to the files in the UniProt database This ensures a much faster search 11 2 2 Handling of UniProt search results The search result is presented as a list of links to the files in the UniProt database The View displays 50 hits at a time can be changed in the Preferences see chapter 5 More hits can be displayed by clicking the More button at the bottom right of the View More hits can be displayed by clicking the More button at the bottom left of the View Each sequence hit is represented by text in three columns e Accession e Name e Description e Organism It is possible to exclude one or more of these columns by adjust the View preferences for the database search view Furthermore your changes in the View preferences can be saved See section 5 5 Several sequences can be selected and by clicking the buttons in the bottom of the search view you can do the following e Download and open does not save the sequence e Download and save lets you choose location for saving sequence e Open at UniProt searches the sequence at UniProt s web page Double clicking a hit will download and open the sequence The hits can also be copied into the View Area or the Navigation Area from the search results by drag and drop copy paste or by using the right click menu as described
344. he relevant table row and select Open Sequence The newly opened view can be dragged onto the navigation area for further analysis Structures can be rotated and moved using the mouse and keyboard Pan mode must be enabled in order to rotate and move the sequence When changing to the 3D view a dialog box with the option of shifting to pan mode is displayed if Selection mode is enabled Note It is only possible to view one structure file at a time in order to limit the amount of memory used 13 2 1 Moving and rotating Structure files are simply rotated by holding down the left mouse button while moving the mouse This will rotate the structure in the direction the mouse is moved The structures can be freely rotated in all directions Holding down the Ctrl on Windows or 8 on Mac key on the keyboard while dragging the mouse moves the structure in the direction the mouse is moved This is particularly useful if the view is zoomed to cover only a small region of the protein structure Zoom in 90 and zoom out P on the structure is done by selecting the appropriate zoom tool in the toolbar and clicking with the mouse on the view area Alternatively click and hold the left mouse button while using either zoom tool and move the mouse up or down to zoom out or in CHAPTER 13 3D MOLECULE VIEWING 201 respectively The view can be restored to display the entire structure by clicking the Fit width e button on the toolbar read more
345. he selection and will result in two smaller fragments e Make Positive Strand Single Stranded ir This will make the positive strand of the selected region single stranded e Make Negative Strand Single Stranded 1 1 This will make the negative strand of the selected region single stranded e Make Double Stranded m This will make the selected region double stranded e Copy Selection This will copy the selected region to the clipboard which will enable it for use in other programs e Duplicate Selection If a selection on the sequence is duplicated the selected region will be added as a new sequence to the cloning editor with a new sequence name representing the length of the fragment CHAPTER 19 CLONING AND CUTTING 325 e Open Selection in New View La This will open the selected region in the normal sequence view e Edit Selection K This will open a dialog box in which is it possible to edit the selected residues e Delete Selection j This will delete the selected region of the sequence e Add Annotation 5 This will open the Add annotation dialog box e Show Enzymes Only Cutting Selection This will add enzymes cutting this selection to the Side Panel e Insert Restriction Sites before after Selection This will show a dialog where you can choose from a list restriction enzymes see section 19 1 6 Manipulate using restriction sites Right click on a restriction site gives you the following
346. herefore a great disparity between the number of known RNA sequences and the number of known RNA 3D structures However as it is the case for proteins RNA tertiary structures can be characterized by secondary structural elements These are defined by hydrogen bonds within the molecule that form several recognizable domains of secondary structure like stems hairpin loops bulges and internal loops see below Furthermore the high degree of base pair conservation observed in the evolution of RNA molecules shows that a large part of the functional information is actually contained in the secondary structure of the RNA molecule Fortunately RNA secondary structure can be computationally predicted from sequence data allowing researchers to map sequence information to functional information The subject of this paper is to describe a very popular way of doing this namely free energy minimization For an in depth review of algorithmic details we refer the reader to Mathews and Turner 2006 22 5 1 The algorithm Consider an RNA molecule and one of its possible structures S In a stable solution there will be an equilibrium between unstructured RNA strands and RNA strands folded into S The CHAPTER 22 RNA STRUCTURE 394 propensity of a strand to leave a structure such as S the stability of S1 is determined by the free energy change involved in its formation The structure with the lowest free energy Smin iS the most stable and will
347. heterogeneous behavior on the different templates in the group of included sequences Calculation parameters Chosen parameters Maximum primer length Minimum primer length Maximum G C content Minimum G C content Maximum melting temperature Minimum melting temperature Maximum self annealing Maximum self end annealing Maximum secondary structure 3 end must meet G C requirements 5 end must meet G C requirements Exclusion parameters Minimum number of mismatches bi Minimum number of mismatches in 3 end 08 Length of 3 end Primer combination parameters Max percentage point difference in G C content 35 Max difference in melting temperatures within a primer pair f Max hydrogen bonds between pairs Max hydrogen bonds between pair ends L Maximum length of amplicon ra Fe Figure 17 13 Calculation dialog shown when designing alignment based PCR primers 17 9 3 Alignment based TaqMan probe design CLC Combined Workbench allows the user to design solutions for TaqMan quantitative PCR which consist of four oligos a general primer pair which will amplify all sequences in the alignment a specific TaqMan probe which will match the group of included sequences but not match the excluded sequences and a specific TaqMan probe which will match the group of excluded sequences but not match the included sequences As above the selection boxes are used to indicate the status of a sequence if the box is checked the sequ
348. hich is relevant if you have performed restriction map analysis on more than one sequence If the enzyme s recognition sequence is on the negative strand the cut position is put in brackets as the enzyme Tsol in figure 19 28 whose cut position is 134 Some enzymes cut the sequence twice for each recognition site and in this case the two cut positions are surrounded by parentheses Table of restriction fragments The restriction map can be shown as a table of fragments produced by cutting the sequence with the enzymes Click the Fragments button 41 at the bottom of the view The table is shown in see figure 19 29 Rows 9 Sequence PERH3BC FA Restriction m O Restriction Fragment table Length 35 Region 100 134 Filter Overhangs Left end Tth11111 Right end Conflicting enzymes Tsol PERH3BC 100 151 GTTATT Tth11111 CjePI PERH3BC 133 151 GTTATT Tsol CjePI PERH3BC 146 184 CTCTCA CjePI CjePI PERH3BC ZEEN 179 196 Figure 19 29 The result of the restriction analysis shown as annotations Each row in the table represents a fragment If more than one enzyme cuts in the same region or if an enzyme s recognition site is cut by another enzyme there will be a fragment for each of the possible cut combinations The following information is available for each fragment
349. hide Click the gray Side Panel button to the right to show Below each group of settings will be explained Some of the preferences are not the same for nucleotide and protein sequences but the differences will be explained for each group of settings Note When you make changes to the settings in the Side Panel they are not automatically saved when you save the sequence Click Save restore Settings 5 to save the settings see section 5 5 for more information Sequence Layout These preferences determine the overall layout of the sequence e Spacing Inserts a space at a specified interval No spacing The sequence is shown with no spaces Every 10 residues There is a space every 10 residues starting from the beginning of the sequence Every 3 residues frame 1 There is a space every 3 residues corresponding to the reading frame starting at the first residue Every 3 residues frame 2 There is a space every 3 residues corresponding to the reading frame starting at the second residue Every 3 residues frame 3 There is a space every 3 residues corresponding to the reading frame starting at the third residue e Wrap sequences Shows the sequence on more than one line No wrap The sequence is displayed on one line Auto wrap Wraps the sequence to fit the width of the view not matter if it is zoomed in our out displays minimum 10 nucleotides on each line Fixed wrap Makes it p
350. ibody by a virus specific synthetic peptide J Virol 55 3 836 839 Engelman et al 1986 Engelman D M Steitz T A and Goldman A 1986 Identifying nonpolar transbilayer helices in amino acid sequences of membrane proteins Annu Rev Biophys Biophys Chem 15 321 353 Felsenstein 1981 Felsenstein J 1981 Evolutionary trees from DNA sequences a maximum likelinood approach J Mol Evol 17 6 368 376 Feng and Doolittle 1987 Feng D F and Doolittle R F 1987 Progressive sequence align ment as a prerequisite to correct phylogenetic trees J Mol Evol 25 4 351 360 Forsberg et al 2001 Forsberg R Oleksiewicz M B Petersen A M Hein J Botner A and Storgaard T 2001 A molecular clock dates the common ancestor of European type porcine reproductive and respiratory syndrome virus at more than 10 years before the emergence of disease Virology 289 2 174 179 Galperin and Koonin 1998 Galperin M Y and Koonin E V 1998 Sources of systematic error in functional annotation of genomes domain rearrangement non orthologous gene displacement and operon disruption In Silico Biol 1 1 55 67 Gill and von Hippel 1989 Gill S C and von Hippel P H 1989 Calculation of protein extinction coefficients from amino acid sequence data Anal Biochem 182 2 319 326 Gonda et al 1989 Gonda D K Bachmair A W nning l Tobias J W Lane W S and Varshavsky A 1989 Universality
351. ic regions in proteins Regions with a positive value are hydrophobic This scale can be used for identifying both surface exposed regions as well as transmembrane regions depending on the window size used Short window sizes of 5 7 generally work well for predicting putative surface exposed regions Large window sizes of 19 21 are well suited for finding transmembrane domains if the values calculated are above 1 6 Kyte and Doolittle 1982 These values should be used as a rule of thumb and deviations from the rule may occur e Cornette Cornette et al computed an optimal hydrophobicity scale based on 28 published scales Cornette et al 1987 This optimized scale is also suitable for prediction of alpha helices in proteins e Engelman The Engelman hydrophobicity scale also known as the GES scale is another scale which can be used for prediction of protein hydrophobicity Engelman et al 1986 As the Kyte Doolittle scale this scale is useful for predicting transmembrane regions in proteins e Eisenberg The Eisenberg scale is a normalized consensus hydrophobicity scale which shares many features with the other hydrophobicity scales Eisenberg et al 1984 e Rose The hydrophobicity scale by Rose et al is correlated to the average area of buried amino acids in globular proteins Rose et al 1985 This results in a scale which is not showing the helices of a protein but rather the surface accessibility e Janin This scale also pro
352. ication by navigating to the location where you choose to install it and running the command clccombinedwb3 1 2 5 Installation on Linux with an RPM package Navigate to the directory containing the rpm package and install it using the rpm tool by running a command similar to rpm ivh CLCCombinedWorkbench_3_JRE rpm If you are installing from a CD the rpm packages are located in the RPMS directory Installation of RPM packages usually requires root privileges When the installation process is finished the program can be executed by running the command clecombinedwb3 1 3 System requirements The system requirements of CLC Combined Workbench are these e Windows 2000 Windows XP or Windows Vista e Mac OS X 10 3 or newer e Linux Redhat or SuSE e 256 MB RAM required e 512 MB RAM recommended 1024 x 768 display recommended 1 4 Licenses When you have installed CLC Combined Workbench three license set ups are available e Demo license for evaluating CLC Combined Workbench section 1 4 1 It is a fully functional 30 days license Further evaluation time can be requested e Fixed license section 1 4 3 With this license type you purchase one license per computer that should run CLC Combined Workbench e Floating license CLC Combined Workbench section 1 4 4 By installing a license server all computers on the network can access a set of floating licenses The three license types are
353. ics the inference of phylogenies is central to other areas of research As more and more genetic diversity is being revealed through the completion of multiple genomes an active area of research within bioinformatics is the development of comparative machine learning algorithms that can simultaneously process data from multiple species Siepel and Haussler 2004 Through the comparative approach valuable evolutionary information can be obtained about which amino acid substitutions are functionally tolerant to the organism and which are not This information can be used to identify substitutions that affect protein function and stability and is of major importance to the study of proteins Knudsen and Miyamoto 2001 Knowledge of the underlying phylogeny is however paramount to comparative methods of inference as the phylogeny describes the underlying correlation from shared history that exists between data from different species CHAPTER 21 PHYLOGENETIC TREES 371 In molecular epidemiology of infectious diseases phylogenetic inference is also an important tool The very fast substitution rate of microorganisms especially the RNA viruses means that these show substantial genetic divergence over the time scale of months and years Therefore the phylogenetic relationship between the pathogens from individuals in an epidemic can be resolved and contribute valuable epidemiological information about transmission chains and epidemiologically sig
354. icularly useful when you wish to edit several annotations To edit the information simply double click and you will be able to edit e g the name or the annotation type If you wish to edit the qualifiers and double click in this column you will see the dialog for editing annotations 10 3 4 Removing annotations Annotations can be hidden using the Annotation Types preferences in the Side Panel to the right of the view see section 10 3 1 In order to completely remove the annotation right click the annotation Delete Annotation g gt If you want to remove all annotations of one type right click an annotation of the type you want to remove Delete Annotations of This Type If you want to remove all annotations from a sequence right click an annotation Delete All Annotations The removal of annotations can be undone using Ctrl Z or Undo in the Toolbar 10 4 Sequence information The normal view of a sequence by double clicking shows the annotations as boxes along the sequence but often there is more information available about sequences This information is available through the Sequence info view To view the sequence information select a sequence in the Navigation Area Show J in the Toolbar Sequence info This will display a view similar to fig 10 20 All the lines in the view are headings and the corresponding text can be shown by clicking the text e Name The name of the sequence which is also
355. ied over to the new spliced alignment Alignments can be joined by Join Alignments F or select alignments to join right click either selected alignment Toolbox Align ments and Trees Join Alignments EF This opens the dialog shown in figure 20 10 If you have selected some alignments before choosing the Toolbox action they are now listed in the Selected Elements window of the dialog Use the arrows to add or remove alignments from the selected elements Click Next opens the dialog shown in figure 20 11 CHAPTER 20 SEQUENCE ALIGNMENT 360 Join Alignments ES 1 Select alignments of NAAA A same type Projects Selected Elements 2 ja LA CLC_Data HEE alignment 2 S E Example data FEE Jalignmenti cf Extra S E Nucleotide B Protein of EE Asa Ferg alignment Figure 20 10 Selecting two alignments to be joined Join Alignments ES 1 Select alignments of BEE same type 2 Set parameters Set order of concatenation top First Previous gt Next pe X Cancel Figure 20 11 Selecting order of concatenation To adjust the order of concatenation click the name of one of the alignments and move it up or down using the arrow buttons Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish The result is seen in figure 20 12 A 100 200 l l sequence A from alignment 1 gt _ _ _ _ m
356. iew Settings In general these are default settings for the user interface The Toolbar preferences let you choose the size of the toolbar icons and you can choose whether to display names below the icons The Side Panel Location setting lets you choose between Dock in views and Float in window When docked in view view preferences will be located in the right side of the view of e g an alignment When floating in window the side panel can be placed everywhere in your screen also outside the workspace e g on a different screen See section 5 5 for more about floating side panels CHAPTER 5 USER PREFERENCES AND SETTINGS 101 The New view setting allows you to choose whether the View preferences are to be shown automatically when opening a new view If this option is not chosen you can press Ctrl U 38 U on Mac to see the preferences panels of an open view The View Format allows you to change the way the elements appear in the Navigation Area The following text can be used to describe the element e Name this is the default information to be shown e Accession sequences downloaded from databases like GenBank have an accession number e Latin name e Latin name accession e Common name e Common name accession The User Defined View Settings gives you an overview of the different Side Panel settings that are saved for each view See section 5 5 for more about how to create and save style sheets If ther
357. iffiths Jones S Khanna A Marshall M Moxon S Sonnhammer E L L Studholme D J Yeats C and Eddy S R 2004 The Pfam protein families database Nucleic Acids Res 32 Database issue D138 D141 Bendtsen et al 2004a Bendtsen J D Jensen L J Blom N Heijne G V and Brunak S 2004a Feature based prediction of non classical and leaderless protein secretion Protein Eng Des Sel 17 4 349 356 Bendtsen et al 2005 Bendtsen J D Kiemer L Fausb ll A and Brunak S 2005 Non classical protein secretion in bacteria BMC Microbiol 5 58 Bendtsen et al 2004b Bendtsen J D Nielsen H von Heijne G and Brunak S 2004b Improved prediction of signal peptides SignalP 3 0 J Mol Biol 340 4 783 795 Blobel 2000 Blobel G 2000 Protein targeting Nobel lecture Chembiochem 1 86 102 Bommarito et al 2000 Bommarito S Peyret N and SantaLucia J 2000 Thermodynamic parameters for DNA sequences with dangling ends Nucleic Acids Res 28 9 1929 1934 Brookes 1999 Brookes A J 1999 The essence of SNPs Gene 234 2 17 7 186 Chen et al 2004 Chen G Znosko B M Jiao X and Turner D H 2004 Factors affecting thermodynamic stabilities of RNA 3 x 3 internal loops Biochemistry 43 40 12865 12876 Clote et al 2005 Clote P Ferr F Kranakis E and Krizanc D 2005 Structural RNA has lower folding energy than random RNA of the same dinucle
358. ight click the name of the sequence and choose Digest Sequence with Selected Enzymes and Run on Gel 3 The views where this option is available are listed below Circular view See section 10 2 Ordinary sequence view see section 10 1 Graphical view of sequence lists See section 10 7 Cloning editor see section 19 1 Primer designer see section 17 3 Furthermore you can also right click an empty part of the view of the graphical view of sequence lists and the cloning editor and choose Digest All Sequences with Selected Enzymes and Run on Gel Note When using the right click options the sequence will be digested with the enzymes that are selected in the Side Panel This is explained in section 10 1 2 The view of the gel is explained in section 19 3 3 19 3 2 Separate sequences on gel To separate sequences without restriction enzyme digestion first create a sequence list of the sequences in question see section 10 7 Then click the Gel button 3 at the bottom of the view of the sequence list For more information about the view of the gel see the next section 19 3 3 Gel view In figure 19 31 you can see a simulation of a gel with its Side Panel to the right This view will be explained in this section Information on bands fragments You can get information about the individual bands by hovering the mouse cursor on the band of interest This will display a tool tip with the following information
359. igure 2 38 v Primer parameters Length Max 22 3 Min 182 Melt temp C Max 58 Min 485 Inner Melt temp C Max Min Advanced parameters Mode Standard PCR O Taqman O Nested PCR Sequencing Calculate vv Figure 2 38 The Primer parameters Note that the maximum melting temperature is per default set to 58 and this is the reason why the primer in figure 2 37 with an melting temperature of 58 4 does not meet the requirements and is colored red If you raise the maximum melting temperature to 59 the primer will meet the requirements and the dot becomes green In figure 2 37 there is an asterisk before the melting temperature This indicates that this primer does not meet the requirements regarding melting temperature In this way you can easily see why a Specific primer represented by a dot fails to meet the requirements CHAPTER 2 TUTORIALS 59 By adjusting the Primer parameters you can define primers which match your specific needs Since the dots are constantly updated you can immediately see how a change in the primer parameters affects the number of red and green dots 2 11 3 Calculating a primer pair Until now we have been looking at the forward primer To mark a region for the reverse primer make a selection covering positions 125 to 157 and Right click the selection Reverse primer region here 4 The two regions should now be located as show
360. igure 22 1 Selecting RNA or DNA sequences for structure prediction DNA is folded as if it were RNA sequence lists from the selected elements You can use both DNA and RNA sequences DNA will be folded as if it were RNA Click Next to adjust secondary structure prediction parameters Clicking Next opens the dialog shown in figure 22 2 Predict Secondary Structure 1 Select nucleotide EIA sequences 2 Set parameters Structure output C Compute sample of suboptimal structures Partition Function O Calculate base pair probabilities reate plot of marginal base pairing probabilities Advanced options 4 Avoid isolated base pairs Apply different energy rules for Grossly Asymmetric Interior Loops GAIL Apply base pairing constraints Maximum distance between paired bases M O Include coaxial stacking energy rules O C 2 A Previous Next Finish Cancel y Figure 22 2 Adjusting parameters for secondary structure prediction 22 1 2 Structure output The predict secondary structure algorithm always calculates the minimum free energy structure of the input sequence In addition to this it is also possible to compute a sample of suboptimal structures by ticking the checkbox labeled Compute sample of suboptimal structures Subsequently you can specify how many structures to include in the output The algorithm then CHAPTER 22 RNA STRUCTURE 377 iterates over all permissible canonical
361. ile format 112 PHR file format 35 113 409 Phred file format 35 113 409 phy file format 112 Phylip file format 35 113 409 Phylogenetic tree 366 402 tutorial 43 Phylogenetics Bioinformatics explained 369 pir file format 112 PIR NBRF file format 35 113 409 Plot dot plot 208 local complexity 218 Plug ins 26 png format export 120 Polarity colors 134 Portrait Print orientation 109 Positively charged residues 225 PostScript export 120 Preference group 103 Preferences 99 advanced 102 export 102 General 99 import 102 style sheet 103 toolbar 100 View 100 view 86 Primer 298 analyze 297 based on alignments 291 Buffer properties 279 design 402 design from alignments 402 display graphically 280 length 279 mode 280 nested PCR 280 order 300 sequencing 280 standard 280 TaqMan 280 tutorial 56 Print 107 3D molecule view 204 dot plots 210 preview 110 visible area 108 whole view 108 pro file format 112 Problems when starting up 25 Processes 88 Protease cleavage 269 Protein charge 248 401 cleavage 269 hydrophobicity 257 Isoelectric point 223 report 263 400 report output 265 signal peptide 242 statistics 222 structure prediction 262 translation 265 Proteolytic cleavage 269 401 Bioinformatics explained 271 tutorial 54 Proteolytic enzymes cleavage patterns 406 Proxy server 29 Proxy settings license activation 17 INDEX
362. ill be able to set trim parameters see section 18 2 2 Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish When the assembly process has ended a number of views will be shown each containing a contig of two or more sequences that have been matched If the number of contigs seem too high or low try again with another Alignment stringency setting Depending on your choices of output options above the views will include trace files or only contig sequences However the calculation of the contig is carried out the same way no matter how the contig is displayed See section 18 6 on how to use the resulting contigs 18 4 Assemble to reference sequence This section describes how to assemble a number of sequence reads into a contig using a reference sequence A reference sequence can be particularly helpful when the objective is to characterize SNP variation in the data Note that CLC Combined Workbench allows you to annotate a reference sequence with known SNP information from the dbSNP database see section 12 5 To start the assembly CHAPTER 18 SEQUENCING DATA ANALYSES AND ASSEMBLY 308 select sequences to assemble Toolbox in the Menu Bar Sequencing Data Analyses 41 Assemble Sequences to Reference 777 This opens a dialog where you can alter your choice of sequences which you want to assemble You can also add sequence lists When the sequences are selected click Next
363. imates Catarrhini Hominidae Homo Description Human dinucleotide repeat polymorphism at the D115439 and HBB loci Keywords KEYWORDS dinucleotide repeat polymorphism Comments Original source text Homo sapiens DNA Last modified 06 May 1993 Figure 3 6 Sequence properties for the HUMDINUC sequence For a more comprehensive view of sequence information see section 10 4 3 2 View Area The View Area is the right hand part of the workbench interface displaying your current work The View Area may consist of one or more Views represented by tabs at the top of the View Area This is illustrated in figure 3 7 The tab concept is central to working with CLC Combined Workbench because several operations can be performed by dragging the tab of a view and extended right click menus can be activated from the tabs This chapter deals with the handling of views inside a View Area Furthermore it deals with rearranging the views Section 3 3 deals with the zooming and selecting functions 3 2 1 Open view Opening a view can be done in a number of ways double click an element in the Navigation Area or select an element in the Navigation Area File Show Select the desired way to view the element or select an element in the Navigation Area Ctrl O 36 B on Mac Opening a view while another view is already open will show the new view in front of the other CHAPTER 3 USER INTERFACE 81 acr P68063 ac P
364. impact on the seeded sequence space as described above But one can change the word size to find sequence matches which would otherwise not be found using the default parameters For instance the word size can be decreased when searching for primers or short nucleotides For blastn a suitable setting would be to decrease the default word size of 11 to 7 increase the E value significantly LOOO and turn off the complexity filtering For blastp a similar approach can be used Decrease the word size to 2 increase the E value and use a more stringent substitution matrix e g a PAM30 matrix Fortunately the optimal search options for finding short nearly exact matches can already be found on the BLAST web pages http www ncbi nlm nih gov BLAST Substitution matrix For protein BLAST searches a default substitution matrix is provided If you are looking at distantly related proteins you should either choose a high numbered PAM matrix or a low numbered BLOSUM matrix See Bioinformatics Explained on scoring matrices on http www clcbio com be The default scoring matrix for blastp is BLOSUM62 12 6 6 Explanation of the BLAST output The BLAST output comes in different flavors On the NCBI web page the default output is html and the following description will use the html output as example Ordinary text and xml output for easy computational parsing is also available The default layout of the NCBI BLAST result is a graphical representat
365. in CLC Combined Workbench bho A Protein info Kyte Doolittle CAA32220 MVHFTAEEKAAITS IWDKVDLEKVGGETLGRLL Window length ry Kyte Doolittle E n e Foreground color 7 Background color bho Min Max Y Graph CAA32220 1VYPWTQRFFDKFGNLS SAQA MGNPR 1 KAHGK pio o Line plot Y Kyte Doolittle dd Cornette Engelman Figure 16 14 The different ways of displaying the hydrophobicity scores using the Kyte Doolittle scale the hydrophobicity scores You can choose one two or all three options by selecting the boxes See figure 16 14 Coloring the letters and their background When choosing coloring of letters or coloring of their background the color red is used to indicate high scores of hydrophobicity A color slider allows you to amplify the scores thereby emphasizing areas with high or low blue levels of hydrophobicity The color settings mentioned are default settings By clicking the color bar just below the color slider you get the option of changing color settings Graphs along sequences When selecting graphs you choose to display the hydrophobicity scores underneath the sequence This can be done either by a line plot or bar plot or by coloring The latter option offers you the same possibilities of amplifying the scores as applies for coloring of letters The different ways to display the scores when choosing graphs are displaye
366. in Nucleotide Sequence Y Gene begins with brea 18 Organism jisequalto homo sapiens x Length islessthan 10000 E Add filter Length v Type Label Description Length Path de Bco3o969 Homosapiens breast cancer 1 early onset mRNA c 2090 CLCDatai vc Bc072418 Homo sapiens breast cancer 1 early onset mRNA c 3273 CLC_Datal ax BC106745 Homo sapiens breast cancer 1 early onset mRNA c 1468 CLC_Data xx BC106746 Homo sapiens breast cancer 1 early onset mRNA c 779 CLC_Data a BC115037 Homo sapiens breast cancer 1 early onset mRNA c 5854 CLC_Datal oc BCO62429 Homo sapiens breast cancer 1 early onset mRNA c 1322 CLC_Datal Showing 1 6 lt a amp gt mya Figure 4 7 Searching for human sequences shorter than 10 000 nucleotides containing the BRCA1 or BRCA2 genes 4 4 Search index This section has a technical focus and is not relevant if your search works fine However if you experience problems with your search results if you do not get the hits you expect it might be because of an index error The CLC Combined Workbench automatically maintains an index of all data in all locations in the Navigation Area If this index becomes out of sync with the data you will experience problems with strange results In this case you can rebuild the index Right click the relevant location Location Rebuild Index This will take a while depending on the size of your data At any time the
367. in on this position Zoom in in the Tool Bar 50 Click the selected base Click again three times Now you have zoomed in on the trace see figure 2 46 Conflic Consensus read Trace data Figure 2 46 Now you can see all the details of the traces Since the other reads have a G and because there is also a black peak below the green peak we conclude that it should have been a G To change the A to G Select the A in read1 Press g on your keyboard CHAPTER 2 TUTORIALS 63 2 12 4 Getting an overview of the inconsistencies Browsing the inconsistencies by clicking the Find Inconsistencies button is useful in many cases but you might also want to get an overview of all the inconsistencies in the entire contig This is easily achieved by showing the contig in a table view Press and hold the Ctrl button 38 on Mac Click Show Table H at the bottom of the view This will open a table showing the inconsistencies You can right click the Comment field and enter you own comment as shown in figure 2 47 Figure 2 47 he graphical view of a contig is displayed at the top At the bottom the conflicts are shown in a table At the conflict at position 637 the user has entered a comment in the table This comment is now also reflected on the tooltip of the conflict annotation in the graphical view above When you edit a comment this is also reflected in the conflict annotation on the con
368. in proteins Rose scale The hydrophobicity scale by Rose et al is correlated to the average area of buried amino acids in globular proteins Rose et al 1985 This results in a scale which is not showing the helices of a protein but rather the surface accessibility Janin scale This scale also provides information about the accessible and buried amino acid residues of globular proteins Janin 1979 Welling scale Welling et al used information on the relative occurrence of amino acids in antigenic regions to make a scale which is useful for prediction of antigenic regions This method is better than the Hopp Woods scale of hydrophobicity which is also used to identify antigenic regions Kolaskar Tongaonkar A semi empirical method for prediction of antigenic regions has been developed Kolaskar and Tongaonkar 1990 This method also includes information of surface CHAPTER 16 PROTEIN ANALYSES 259 aa aa Kyte Hopp Cornette Eisenberg Rose Janin Engelman Doolittle Woods GES A Alanine 1 80 0 50 0 20 0 62 0 74 0 30 1 60 C Cysteine 2 50 1 00 4 10 0 29 0 91 0 90 2 00 D Aspartic acid 3 50 3 00 3 10 0 90 0 62 0 60 9 20 E Glutamic acid 3 50 3 00 1 80 0 74 0 62 0 70 8 20 F Phenylalanine 2 80 2 50 4 40 1 19 0 88 0 50 3 70 G Glycine 0 40 0 00 0 00 0 48 0 72 0 30 1 00 H Histidine 3 20 0 50 0 50 0 40 0 78 0 10 3 00 Isoleucine 4 50 1 80 4 80 1 38 0 88 0 70 3 10 K Lysine 3 90 3 00 3 10 1 50 0 52 1 80 8 80 L Leuc
369. in the different lineages This means that a root of the tree is also estimated The neighbor joining method builds a tree where the evolutionary rates are free to differ in different lineages CLC Combined Workbench always draws trees with roots for practical reasons but with the neighbor joining method no particular biological hypothesis is postulated by the placement of the root Figure 21 3 shows the difference between the two methods e To evaluate the reliability of the inferred trees CLC Combined Workbench allows the option of doing a bootstrap analysis A bootstrap value will be attached to each branch and this value is a measure of the confidence in this branch The number of replicates in the bootstrap analysis can be adjusted in the wizard The default value is 100 For a more detailed explanation see Bioinformatics explained in section 21 2 CHAPTER 21 PHYLOGENETIC TREES 368 sq Peromyscus maniculatus deer mouse Peromyscus maniculatus deer mouse Homo sapiens human Homo sapiens human Peromyscus maniculatus deer mouse Peromyscus maniculatus deer mouse Peromyscus maniculatus deer mouse 44 Homo sapiens human Peromyscus maniculatus deer mouse ad Peromyscus maniculatus deer mouse Equus caballus horse 100 Homo sapiens human Homo sapiens human Peromyscus maniculatus deer mouse soo Peromyscus maniculatus deer mouse se Peromyscus maniculatus deer mous
370. in the excluded group to ensure that it does not prime these e Length of 3 end the number of consecutive nucleotides to consider for mismatches in the 3 end of the primer The lower part of the dialog contains parameters pertaining to primer pairs this is omitted when only designing a single primer Here three parameters can be set e Maximum percentage point difference in G C content if this is set at e g 5 points a pair of primers with 45 and 49 G C nucleotides respectively will be allowed whereas a pair of primers with 45 and 51 G C nucleotides respectively will not be included CHAPTER 17 PRIMERS 295 e Maximal difference in melting temperature of primers in a pair the number of degrees Celsius that primers in a pair are all allowed to differ e Max hydrogen bonds between pairs the maximum number of hydrogen bonds allowed between the forward and the reverse primer in a primer pair e Maximum length of amplicon determines the maximum length of the PCR fragment The output of the design process is a table of single primers or primer pairs as described for primer design based on single sequences These primers are specific to the included sequences in the alignment according to the criteria defined for specificity The only novelty in the table is that melting temperatures are displayed with both a maximum a minimum and an average value to reflect that degenerate primers or primers with mismatches may have
371. inal end of the preprotein for correct classification As large scale genome sequencing projects sometimes assign the 5 end of genes incorrectly many proteins are annotated without the correct N terminal Reinhardt and Hubbard 1998 leading to incorrect prediction of subcellular localization Inttp ww ncbi nlm nih gov entrez CHAPTER 16 PROTEIN ANALYSES 245 Sec signal peptide Cleavage site region RiK t h region gt lt gt Twin arginine signal peptide Cleavage site d n region sa h region gt gn a Mature RIK RRxFLK A A 3 Ae Lipoprotein signal peptide Cleavage site aMn a h region gt IN a Mature RIK L c 3 1 1 Prepillin like signal peptide Cleavage site h region 4 Mature gt Bacteriocin signal peptide Cleavage site n region c region Mature 2 gt SEP a Non classical secreted protein 4 Mature OO000000000 1 Figure 16 3 Schematic representation of various signal peptides Red color indicates n region gray color indicates h region cyan indicates c region All white circles are part of the mature protein 1 indicates the first position of the mature protein The length of the signal peptides is not drawn to scale These erroneous predictions can be ascribed directly to poor gene finding Other methods for prediction of subcellular localization use information within the mature protein and therefore they are more robust to N terminal tr
372. ine 3 80 1 80 5 70 1 06 0 85 0 50 2 80 M Methionine 1 90 1 30 4 20 0 64 0 85 0 40 3 40 N Asparagine 3 50 0 20 0 50 0 78 0 63 0 50 4 80 P Proline 1 60 0 00 2 20 0 12 0 64 0 30 0 20 Q Glutamine 3 50 0 20 2 80 0 85 0 62 0 70 4 10 R Arginine 4 50 3 00 1 40 2 53 0 64 1 40 12 3 S Serine 0 80 0 30 0 50 0 18 0 66 0 10 0 60 T Threonine 0 70 0 40 1 90 0 05 0 70 0 20 1 20 V Valine 4 20 1 50 4 70 1 08 0 86 0 60 2 60 W Tryptophan 0 90 3 40 1 00 0 81 0 85 0 30 1 90 Y Tyrosine 1 30 2 30 3 20 0 26 0 76 0 40 0 70 Table 16 1 Hydrophobicity scales This table shows seven different hydrophobicity scales which are generally used for prediction of e g transmembrane regions and antigenicity accessibility and flexibility and at the time of publication the method was able to predict antigenic determinants with an accuracy of 75 Surface Probability Display of surface probability based on the algorithm by Emini et al 1985 This algorithm has been used to identify antigenic determinants on the surface of proteins Chain Flexibility isplay of backbone chain flexibility based on the algorithm by Karplus and Schulz 1985 It is known that chain flexibility is an indication of a putative antigenic determinant Many more scales have been published throughout the last three decades Even though more advanced methods have been developed for prediction of membrane spanning regions the simple and very fast calculations are still high
373. ing horisontally may be done this way right click a tab of the view View Split Horizontally This action opens the chosen view below the existing view See figure 3 13 When the split is made vertically the new view opens to the right of the existing view Splitting the View Area can be undone by dragging e g the tab of the bottom view to the tab of the top view This is marked by a gray area on the top of the view CHAPTER 3 USER INTERFACE 85 Ac P68063 ser P68225 ar P68053 acp P68046 A P68225 VDEVGGEALI P68046 DEVGGEALGF P68225 RLLVVYPWT 1 P68046 LLVVYPWTQF E EX P68225 RFFESFGDL P68046 FFDSFGDLSEY x Sale swm Ag amp Figure 3 13 A vertical split screen Maximize Restore size of view The Maximize Restore View function allows you to see a view in maximized mode meaning a mode where no other views nor the Navigation Area is shown CLC Combined Workbench 3 0 Current workspace Default File Edit Search View Toolbox Workspace Help AE A EN A e as te Show New Import Export Graphics Print Delete rm Search ie ia Fit wean ry Pan ESPs Zoom In cuca HEE protein align P68053 P68225 P cEAB ho P68873 P P68228 P68231 P68063 MHWT MT EW E P68945 MHWTABEKO BiltcBWcKWN MaBccaBalla 29 Consensus MVHLTXEEKN AVTGLWGKVN VDEVGGEALG v Sequence layout Spacing Every 10 residues O No wrap Auto wrap O Fixed wrap Sequence logo Y Number
374. inish This will open a view showing the motifs or patterns found as annotations on the original sequence see figure 14 21 If you have selected several sequences a corresponding number of views will be opened Match LEL QRQKRSINLQ QPRMATERGN Figure 14 21 Sequence view displaying the pattern found The search string was QRQXRXXXXQQ 14 6 2 Motif search output If the analysis is performed on several sequences at a time the method will search for patterns in the sequences and open a new view for each of the sequences If wanted annotations on patterns found can be added to all the sequences Each pattern found will be represented as an annotation of the type Region More information on each motif or pattern found is available through the tooltip including detailed information on the position of the pattern and how similar it was to the search string It is also possible to get a tabular view of all motifs or patterns found in either one combined table or in individual tables if multiple sequences were selected Then each pattern found will be represented with its position in the sequence and the obtained accuracy score 14 7 Pattern Discovery With CLC Combined Workbench you can perform pattern discovery on both DNA and protein sequences Advanced hidden Markov models can help to identify unknown sequence patterns across single or even multiple sequences In order to search for unknown patterns Select DNA or protein sequence
375. ion 14 Workspace 89 create 89 delete 90 save 89 select 90
376. ion and updating are the same as for plug ins see section 1 7 1 and section 1 7 2 for more information 1 8 Network configuration If you use a proxy server to access the Internet you must configure CLC Combined Workbench to use this Otherwise you will not be able to perform any online activities e g searching GenBank CLC Combined Workbench supports the use of a HTTP proxy and an anonymous SOCKS proxy CHAPTER 1 INTRODUCTION TO CLC COMBINED WORKBENCH 30 Manage Plug ins and Resources Manage Plug ins Download Plug ins Manage Resources Download Resources PFAM 100 Al Version 1 01 Top 100 occuring protein domains PFAM 100 Size 5 MB Download and Install Version 1 0 PFAM 500 PEENE Version 1 0 i Top 500 occuring protein domains PFAM Full Version 10 Complete PFAM database Y Help i USES Coria mass J TETRIS Figure 1 18 Resources available for download Preferences Use HTTP Proxy Server HTTP Proxy Account Password Use SOCKS Proxy Server SOCKS Host Port You may have to restart the application for these changes to take effect Default Persistence Location CLC_Data URL to use when blasting http www ncbi nlm nih gov blast Blast cgi Lox J Xcancei Hep export import Figure 1 19 Adjusting proxy preferences To configure your proxy settings open CLC Combined Workbench and go to the Advanced tab
377. ion of the hits found a table of sequence identifiers of the hits together with scoring information and alignments of the query sequence and the hits The graphical output Shown in figure 12 25 gives a quick overview of the query sequence and the resulting hit sequences The hits are colored according to the obtained alignment scores The table view Shown in figure 12 26 provides more detailed information on each hit and furthermore acts as a hyperlink to the corresponding sequence in GenBank In the alignment view one can manually inspect the individual alignments generated by the BLAST algorithm This is particularly useful for detailed inspection of the sequence hit found sbjct and the corresponding alignment In the alignment view all scores are described for each alignment and the start and stop positions for the query and hit sequence are listed The strand and orientation for query sequence and hits are also found here In most cases the table view of the results will be easier to interpret than tens of sequence alignments CHAPTER 12 BLAST SEARCH 196 Color key for alignment scores lt 40 40 50 50 80 80 200 gt 200 Query E a A l l l TO 140 210 o NN O O A o da lt lt el SS a aua u u Sa aah x
378. is is useful if the vector and the insert sequences are not oriented the same way Digest Sequence with Selected Enzymes and Run on Gel 3 See section 19 3 1 Rename sequence Renames the sequence Select sequence This will select the entire sequence Delete sequence This deletes the given sequence from the cloning editor Open copy of sequence in new view bd This will open a copy of the selected sequence in a normal sequence view Open this sequence in new view 2 This will open the selected sequence in a normal sequence view Make sequence circular This will convert a sequence from a linear to a circular form If the sequence have matching overhangs at the ends they will be merged together If the sequence have incompatible overhangs a dialog is displayed and the sequence cannot be made circular The circular form is represented by gt gt and lt lt at the ends of the sequence Make sequence linear This will convert a sequence from a circular to a linear form removing the lt lt and gt gt at the ends Sort sequence list by name This will sort all the sequences in the cloning editor alphabetically by name Sort sequences by length This will sort all the sequences in the cloning editor by length Manipulate parts of the sequence Right clicking a selection reveals several options on manipulating the selection see figure 19 5 CHAPTER 19 CLONING AND CUTTING 324 380 eae ERARIO AA
379. is opens the dialog shown in fig 2 14 Create Alignment 1 Select sequences of same 5 Y Projects Selected Elements 8 CLC_Data P68046 E Example data P68053 E Extra P68063 E Nucleotide P68225 S E Protein P68228 E E 3D structures P68231 More data Ne P68873 E Sequences P68945 1429_HUMAN 4 CAA24102 Ss CAA32220 s NP_058652 EE Figure 2 14 The alignment dialog displaying the 8 chosen protein sequences It is possible to add and remove sequences from Selected Elements list Since we had already selected the eight proteins just click Next to adjust parameters for the alignment Clicking Next opens the dialog shown in fig 2 15 Leave the parameters at their default settings An explanation of the parameters can be found by clicking the help button Alternatively a tooltips is displayed by holding the mouse cursor on the parameters Click Finish to start the alignment process which is shown in the Toolbox under the Processes tab When the program is finished calculating it displays the alignment see fig 2 16 Note The new alignment is not saved automatically CHAPTER 2 TUTORIALS 43 Create Alignment 1 Select sequences of same stances type 2 Set parameters Gap settings Gap open cost 10 Gap extension cost 1 End gap cost las any other v Alignment Fast less accurate O Slow very accurate Figure 2 15 The alignme
380. is possible to continue other tasks in the program Like the search process the download process can be stopped paused and resumed 11 2 3 Save UniProt search parameters The search view can be saved either using dragging the search tab and and dropping it in the Navigation Area or by clicking Save E When saving the search only the parameters are saved not the results of the search This is useful if you have a special search that you perform from time to time Even if you don t save the search the next time you open the search view it will remember the parameters from the last time you did a search 11 3 Search for structures at NCBI This section describes searches for three dimensional structures from the NCBI structure database http www ncbi nlm nih gov Structure MMDB mmdb shtml For manipu lating and visualization of the downloaded structures see section 13 The NCBI search view is opened in this way Search Search for structures at NCBI E CHAPTER 11 ONLINE DATABASE SEARCH 167 or Ctrl B 38 B on Mac This opens the view shown in figure 11 4 O NCBI structur Database Structure All Fields x insulin B All Fields human a Add search parameters 4 Start search C Append wildcard to search words Rows 50 Search results Filter Accession Description Resolution Method Protein Chains Release Date C Terminal Domain O Nmr 20 Structures 2004 11 9 E
381. is step shown in figure 9 1 you have two options e Open This will open the result of the analysis in a view This is the default setting e Save This means that the result will not be opened but saved to a folder in the Navigation Area If you select this option click Next and you will see one more step where you can specify where to save the results see figure 9 2 In this step you also have the option of creating a new folder or adding a location by clicking the buttons Gha a at the top of the dialog 127 CHAPTER 9 HANDLING OF RESULTS 128 Convert DNA to RNA 1 Select DNA sequences NR 2 Result handling Result handling Open O Save CIA Crees 98 Figure 9 1 The last step of the analyses exemplified by Translate DNA to RNA Convert DNA to RNA 1 Select DNA sequences EA 2 Result handling a S 3 Save in folder Folder Update All S CLC_Data S E Example data 6 7 Extra S E Nucleotide B E Assembly 9 3 Cloning 5 More data Primer design Restriction analysis S E Sequences e 20 HUMDINUC 90 HUMHBB IOC NM_000044 20 PERH2BD DOC PERH3BC 3 sequence list E Protein E README Figure 9 2 Specify a folder for the results of the analysis 9 1 1 Table outputs Some analyses also generate a table with results and for these analyses the last step looks like figure 9 3 In addition to the Open and Save options you can also choose whether the result o
382. ity of 256 mutations in a 100 amino acids see figure 14 13 There are some limitation to the PAM matrices which makes the BLOSUM matrices somewhat more attractive The dataset on which the initial PAM matrices were build is very old by now and the PAM matrices assume that all amino acids mutate at the same rate this is not a correct assumption BLOSUM In 1992 14 years after the PAM matrices were published the BLOSUM matrices BLOcks SUbstitution Matrix were developed and published Henikoff and Henikoff 1992 Henikoff et al wanted to model more divergent proteins thus they used locally aligned sequences where none of the aligned sequences share less than 62 identity This resulted in a scoring matrix called BLOSUM62 In contrast to the PAM matrices the BLOSUM matrices are calculated from alignments without gaps emerging from the BLOCKS database http CHAPTER 14 GENERAL SEQUENCE ANALYSES 216 Low complaxity va Low complaxity Figure 14 12 The dot plot showing a low complexity region in the sequence The sequence is artificial and low complexity regions does not always show as a square olocks fhere org Sean Eddy recently wrote a paper reviewing the BLOSUM62 substitution matrix and how to calculate the scores Eddy 2004 Use of scoring matrices Deciding which scoring matrix you should use in order of obtain the best alignment results is a difficult task If you have no prior knowledge on the sequence the BLOS
383. ium hydrochloride e 0 02 M phosphate buffer The extinction coefficient values of the three important amino acids at different wavelengths are found in Gill and von Hippel 1989 Knowing the extinction coefficient the absorbance optical density can be calculated using the following formula Ext Protei Absorbance Protein o a Two values are reported The first value is computed assuming that all cysteine residues appear CHAPTER 14 GENERAL SEQUENCE ANALYSES 225 as half cystines meaning they form di sulfide bridges to other cysteines The second number assumes that no di sulfide bonds are formed Atomic composition Amino acids are indeed very simple compounds All 20 amino acids consist of combinations of only five different atoms The atoms which can be found in these simple structures are Carbon Nitrogen Hydrogen Sulfur Oxygen The atomic composition of a protein can for example be used to calculate the precise molecular weight of the entire protein Total number of negatively charged residues Asp Glu At neutral pH the fraction of negatively charged residues provides information about the location of the protein Intracellular proteins tend to have a higher fraction of negatively charged residues than extracellular proteins Total number of positively charged residues Arg Lys At neutral pH nuclear proteins have a high relative percentage of positively charged amino acids Nuclear proteins often bind to the
384. ive value are hydrophobic This scale can be used for identifying both surface exposed regions as well as transmembrane regions depending on the window size used Short window sizes of 5 7 generally work well for predicting putative surface exposed regions Large window sizes of 19 21 are well suited for finding transmembrane domains if the values calculated are above 1 6 Kyte and Doolittle 1982 These values should be used as a rule of thumb and deviations from the rule may occur Engelman scale The Engelman hydrophobicity scale also known as the GES scale is another scale which can be used for prediction of protein hydrophobicity Engelman et al 1986 As the Kyte Doolittle scale this scale is useful for predicting transmembrane regions in proteins Eisenberg scale The Eisenberg scale is a normalized consensus hydrophobicity scale which shares many features with the other hydrophobicity scales Eisenberg et al 1984 Hopp Woods scale Hopp and Woods developed their hydrophobicity scale for identification of potentially antigenic sites in proteins This scale is basically a hydrophilic index where apolar residues have been assigned negative values Antigenic sites are likely to be predicted when using a window size of 7 Hopp and Woods 1983 Cornette scale Cornette et al computed an optimal hydrophobicity scale based on 28 published scales Cornette et al 1987 This optimized scale is also suitable for prediction of alpha helices
385. k each of the small fragments Delete Sequence 2 13 3 Inserting the fragment in the vector Find and select the Sphl restriction site at position 567 of the donor plasmid pBR322 To insert the PCR fragment at this position right click the Sphl restriction site in pBR322 Insert Sequence at This Sphl site Select the PCR fragment from the drop down list Select This will produce a result shown in figure 2 52 HBG2 HBG2 liso2 Conflict Conflict Conflict bla In tet tet ert ROP protein bla pBR322 merece ee Figure 2 52 The fragment has been inserted into the cloning vector Notice that the sequence inserted is automatically selected and both ends of the inserted fragment are shown in the sequence details Sequence details CHAPTER 2 TUTORIALS 67 Now the fragment has been inserted and you can see how it breaks up the tetracyclin resistance gene If you had another sequence with an overhang different from the one created by the Sphl enzyme you would not be able to insert this Open the sequence in a circular view and see that the tetracycline gene is disrupted by an insert of the HBG2 gene right click the name of the pBR322 sequence Open Sequence in Circular View This will show a circular view of the plasmid as shown in figure 2 53 bl bl ROP protei Conflic Conflic Figure 2 53 A circular view of the result of the cloning tutorial This very short walk throug
386. king up very much disk space you can free this disk space by emptying the Recycle Bin f Edit in the Menu Bar Empty Recycle Bin 3 Note This cannot be undone and you will therefore not be able to recover the data present in the recycle bin when it was emptied 3 1 8 Show folder elements in View A location or a folder might contain large amounts of elements It is possible to view their elements in the View Area select a folder or location Show 4 in the Toolbar Contents F When the elements are shown in the view they can be sorted by clicking the heading of each of the columns You can further refine the sorting by pressing Ctrl 38 on Mac while clicking the heading of another column Sorting the elements in a view does not affect the ordering of the elements in the Navigation Area Note The view only displays one layer at a time the content of subfolders is not visible in this view CHAPTER 3 USER INTERFACE 80 3 1 9 Sequence properties Sequences downloaded from databases have a number of properties which can be displayed using the Sequence Properties function Right click a sequence in the Navigation Area Properties This will show a dialog as shown in figure 3 6 Sequence Properties Type Lie SIE ona Name HUMDINUC Source SOURCE Homo sapiens human ORGANISM Homo sapiens Eukaryota Metazoa Chordata Craniata Vertebrata Euteleostomi Mammalia Eutheria Euarchontoglires Pr
387. l S methyl NsiI pa Chal Ball Hhal N4 meth S methyl Pe cl N6 meth DrallI BanII N meth S methyl Enzymes to be used Filter Name PstI Sacr Overh Methyl N6 meth S methyl Sphi KpnI Nsil N6 meth Pe Apal Chal S methyl Ball N4 meth Sacll IHhal S methyl S methyl Pe Beal N methy Hphr N6 meth N6 meth Y gt Next Figure 19 18 Selecting enzymes f Fini If you need more detailed information and filtering of the enzymes either place your mouse cursor on an enzyme for one second to display additional information see figure 19 34 or use the view of enzyme lists see 19 4 CHAPTER 19 CLONING AND CUTTING 334 All enzymes Filter 3 Name Overh Methyl Pop PstI 3 N6 meth er A KpnI 3 N6 meth gt Re SacI 3 S methyl pee SphI E pares Apal 3 5 methyl Per SacI E S methyl Per Nsil Enzyme SacII Chal Recognition site pattern CCGCGG Ball Suppliers GE Healthcare Hhal Qbiogene emi American Allied Biochemical Inc Dralll Nippon Gene Co Ltd Banll Takara Bio Inc New England Biolabs J Toyobo Biochemicals Molecular Biology Resources Promega Corporation Las EURx Ltd p Figure 19 19 Showing additional info
388. l and Gish 1996 BLAST is highly scalable and comes in a number of different computer platform configurations which makes usage on both small desktop computers and large computer clusters possible 12 6 1 Examples of BLAST usage BLAST can be used for a lot of different purposes A few of them are mentioned below CHAPTER 12 BLAST SEARCH 191 e Looking for species If you are sequencing DNA from unknown species BLAST may help identify the correct species or homologous species e Looking for domains If you BLAST a protein sequence or a translated nucleotide sequence BLAST will look for known domains in the query sequence e Looking at phylogeny You can use the BLAST web pages to generate a phylogenetic tree of the BLAST result e Mapping DNA to a known chromosome If you are sequencing a gene from a known species but have no idea of the chromosome location BLAST can help you BLAST will show you the position of the query sequence in relation to the hit sequences e Annotations BLAST can also be used to map annotations from one organism to another or look for common genes in two related species 12 6 2 Searching for homology Most research projects involving sequencing of either DNA or protein have a requirement for obtaining biological information of the newly sequenced and maybe unknown sequence If the researchers have no prior information of the sequence and biological content valuable information can often be obtained u
389. l difference in melting temperature of primers in a pair the number of degrees Celsius that primers in the primer pair are all allowed to differ Maximum pair annealing score the maximum number of hydrogen bonds allowed between the forward and the reverse primer in an oligo pair This criteria is applied to all possible combinations of primers and probes Minimum difference in the melting temperature of primer outer and TaqMan probe inner oligos all comparisons between the melting temperature of primers and probes must be at least this different otherwise the solution set is excluded e Desired temperature difference in melting temperature between outer primers and inner TaqMan oligos the scoring function discounts solution sets which deviate greatly from this value Regarding this and the minimum difference option mentioned above please note that to ensure flexibility there is no directionality indicated when setting parameters for melting temperature differences between probes and primers i e it is not specified whether the probes should have a lower or higher Tm Instead this is determined by the allowed temperature intervals for inner and outer oligos that are set in the primer parameters preference group in the side panel If a higher Tm of probes is required choose a Tm interval for probes which has higher values than the interval for outer primers The output of the design process is a table of solution sets Each solution
390. l start the secondary peak calling A detailed history entry will be added to the history specifying all the changes made to the sequence Chapter 19 Cloning and cutting Contents 19 1 Molecular cloning 2 02602 66 22 ee ee ee 22d 319 19 1 1 Introduction to the cloning view o ee 320 19 1 2 Sequence details s sdis acro a e e be eae ees 321 19 1 3 Howto navigate the cloning view 321 19 1 4 Manipulate sequences ee 321 19 1 5 Insert one sequence into another o eee 326 19 1 6 Insert restriction site o lt s so aa renere rpa a 327 19 1 7 Show ina circular vieW oaos acide ra gote e a e E e 328 19 2 Restriction site analysis 0 lt lt 328 19 241 Dynamic restriction Sites lt scor e moa moe 2a kee a 329 19 2 2 Restriction site analysis from the Toolbox 335 19 3 Gel electrophoresis 2 0 eee es 341 19 3 1 Separate fragments of sequences on gel o 342 19 3 2 Separate sequences on gel o ee 342 10 33 vGGNVIEW gt cria Be Se eK a BB ao ee alte da eh ow Er 342 19 4 Restriction enzyme lists 0 0 2 eee ee 344 1941 Creat ONZYMEJISE oe iora a o ee a a eS 344 19 4 2 View and modify enzyme list o 346 CLC Combined Workbench offers graphically advanced in silico c
391. le ES are colored green The conflict can be resolved by correcting the deviating residues in the reads as described above A fast way of making all the reads reflect the consensus sequence is to select the position in the consensus right click the selection and choose Transfer Selection to All Reads The opposite is also possible make a selection on one of the reads right click and Transfer Selection to Contig Sequence 18 6 5 Output from the contig Due to the integrated nature of CLC Combined Workbench it is easy to use the created contig sequence as input for additional analyzes If you wish to use the contig sequence for other analyses right click the name of the contig to the left Open Copy of Sequence in New View Save E the new sequence This will generate a new nucleotide sequence which can be used for e g BLAST analysis or cloning construction In order to preserve the history of the changes you have made to the contig the contig itself should be saved from the contig view using either the save button E or by dragging it to the Navigation Area CHAPTER 18 SEQUENCING DATA ANALYSES AND ASSEMBLY 315 18 6 6 Assembly variance table In addition to the standard graphical display of a contig as described above you can also see a tabular overview of the conflicts in the contig right click the tab of the contig Show Table This will display a new view of the conflicts as shown in figure 18 12
392. le AAJSAGTAC AAGTTCTT Mario O DO rom HB G2 X Double cutters HBG2 HBG2 E Y Bami 2 EM Eo __ gt HE M Hinai 2 z rt Biase 8xhollxbalj dos E A SE gt Y Multiple cutt agment with Sph ultiple cutters sequence details CATGCAG TCGCATGC foe CGTACGTC AGCGTACG gt mma a ER i lt Figure 19 1 Two sequences in the cloning view If you later in the process need additional sequences you can easily add more sequences to the view Just right click anywhere on the empty white area Add Sequences 19 1 1 Introduction to the cloning view The cloning view operates with a linear representation of the sequences even though they might be circular Circular sequence are represented with a small lt lt and gt gt at the ends of each sequence When you have finished designing your cloning sequence you can open it in a circular view see section 19 1 7 When you save the content of a cloning view it is saved as a Sequence list See section 10 7 for more information about sequence lists CHAPTER 19 CLONING AND CUTTING 321 In the cloning view most of the basic options for viewing selecting and zooming the sequences are the same as for the standard sequence viewer See section 10 1 for an explanation of these options This means that features such as e g known SNP s exons and other annotations can be displayed on the sequences to guide the choice of region
393. le this can be accessed in the tooltip shown when hovering the mouse on the annotation If you show the sequence in the Annotation Table see section 10 3 1 there is also a hyperlink to the NCBI web page describing the SNP annotation SNP a NM_000044 GGAGAGCGAGGGAGGCCTCGGGG Figure 12 18 A sequence annotated with SNP s 12 5 3 Bioinformatics explained Single Nucleotide Polymorphisms SNPs Single nucleotide polymorphisms can be defined as any single base substitution e g the alter ation from AAGGCT to ATGGCT A single nucleotide polymorphism is denoted SNP pronounced SNiP and represents a nucleotide variation in either coding or non coding regions SNPs can be further classified according to location and function see figure 12 19 SNPs are the most abundant type of genetic variation in the human genome accounting for more that 90 of all differences between individuals Collins et al 1998 and single nucleotide polymorphisms occur very frequently once every 100 1000 bp in humans Often higher frequencies of SNPs are observed in intronic and intergenic regions than in coding regions and there are variations as great as 100 fold in SNP frequency in different regions of the genome Single nucleotide polymorphisms can be disease causing factors It has for example been found that genetic variation in the gene encoding calpain 10 CAPN10 is associated with non insulin dependent diabetes mellitus the most common form of diab
394. lect the two sequences by lt Ctrl gt click 3 click on Mac or lt Shift gt click Export E choose where to export to choose GenBank gbk format enter name the new file Save Export of dependent objects When exporting e g an alignment CLC Combined Workbench can export all dependent objects l e the sequences which the alignment is calculated from This way when sending your alignment with the dependent objects your colleagues can reproduce your findings with adjusted parameters if desired To export with dependent files select the element in Navigation Area File in Menu Bar Export with dependent objects enter name of of the new file choose where to export to Save The result is a folder containing the exported file with dependent objects stored automatically in a folder on the desired location of your desk Export history To export an element s history select the element in Navigation Area Export ES select History PDF pdf choose where to export to Save The entire history of the element is then exported in pdf format The CLC format CLC Combined Workbench keeps all bioinformatic data in the CLC format Compared to other formats the CLC format contains more information about the object like its history and comments The CLC format is also able to hold several objects of different types e g an alignment a graph and a phylogenetic tree This means that if you are exporting your data to a
395. limitations which can be set before submitting a BLAST search See section 12 1 for information about these limitations Additional settings in the Local BLAST wizard e Number of processors lt is possible to specify the number of processors which should be used if the Workbench is installed on a multi processor system CHAPTER 12 BLAST SEARCH 179 e Number of output alignments Limit the number of output alignments based on the E value The local BLAST is in CLC Combined Workbench is NCBI BLAST version 2 2 17 http www ncbi nlm nih gov BLAST 12 2 1 BLAST a selection against a local database If you only wish to BLAST a part of a sequence this is possible directly from the sequence view select the region that you wish to BLAST right click the selection BLAST Selection Against Local Database 2 This will go directly to the dialog shown in figure 12 6 and the rest of the options are the same as when performing a BLAST search with a full sequence 12 3 Output from BLAST search In the last step of the BLAST searches you can specify the output options as shown in figure 12 9 Local BLAST 1 Select sequences of same Reena type 2 Set program parameters 3 Set input parameters 4 Result handling Output options V Create one overview BLAST table V Create single BLAST results Result handling Open O Save Log handling Make log C JCA C Prevos
396. ll appear at the starting points of the primers which fail to meet this requirement 17 3 2 Detailed information mode In this mode a very detailed account is given of the properties of all the available primers When a region is chosen primer information will appear in groups of lines beneath it see figure 17 5 The number of information line groups reflects the chosen length interval for primers and probes One group is shown for every possible primer length Within each group a line is shown for every primer property that is selected from the checkboxes in the primer information preference group Primer properties are shown at each potential primer starting position and are of two types CHAPTER 17 PRIMERS 282 TT Tr PERH38C wra 2 fa v Primer information a PERH3BC GTGAGTCTGATGGGTCTGCECAIGGITITCCTICCTICIA Y Show TmL 18 Compact Detailed TmL 19 E C s c content G c Melting temp Tm TmL 20 a O Self annealing SA Self end annealing SEA imi E A J E C Secondary structure 55 REED SAN cnn J 3 end Gc TmL 22 ST EEEEEEELEEEEE nd DS end Gc v Y Figure 17 5 Detailed information mode Properties with numerical values are represented by bar plots A green bar represents the starting point of a primer that meets the set requirement and a red bar represents the starting point of a primer that fails to meet the set requirement e G C content e Melting temperature e Self anneali
397. loci 190 HUMHEB Hur k NM000044 Ha PERH2BD Pa PERH3BC E sequence list Figure 7 10 Selected elements in a Folder Content view When the elements are selected do the following to copy the selected elements right click one of the selected elements Edit Copy Then right click in the cell A1 Paste 7 The outcome might appear unorganized but with a few operations the structure of the view in CLC Combined Workbench can be produced Except the icons which are replaced by file references in Excel Chapter 8 History log Contents 8 1 Element history c o cos rarase noe ee daa aaO 124 S14 Sharing data with history 6 6 400 8 eek a aE a ew a a 125 CLC Combined Workbench keeps a log of all operations you make in the program If e g you rename a sequence align sequences create a phylogenetic tree or translate a Sequence you can always go back and check what you have done In this way you are able to document and reproduce previous operations This can be useful in several situations It can be used for documentation purposes where you can specify exactly how your data has been created and modified It can also be useful if you return to a project after some time and want to refresh your memory on how the data was created Also if you have performed an analysis and you want to reproduce the analysis on another element you can check the history of the analysis which will give you all
398. loning and design of vectors for various purposes together with restriction enzyme analysis and functionalities for managing lists of restriction enzymes First after a brief introduction the cloning and vector design is explained Next the restriction site analyses are described 19 1 Molecular cloning Molecular cloning is a very important tool in the quest to understand gene function and regulation Through molecular cloning it is possible to study individual genes in a controlled environment 319 CHAPTER 19 CLONING AND CUTTING 320 Using molecular cloning it is possible to build complete libraries of fragments of DNA inserted into appropriate cloning vectors We offer a significantly different approach for visual in silico cloning than other software tools In CLC Combined Workbench the user is in total control of the cloning process The in silico cloning process in CLC Combined Workbench begins with the selection of sequences to be used typically a vector Sequence and an insert select the sequences in the Navigation Area Toolbox in the Menu Bar Cloning and Restriction Sites si Cloning G This will open a view of the selected sequences similar to figure 19 1 TR Conflict Conflict tet Conflict bla ROP protein A Sorting Aa TI Non cutters MA Y san o E Y smart to HindllIRM Sall pBR322 X Single cutters MA Mecorv 1 TTCTCATG TTCAAGAA usw d
399. lor bar using a gradient like the foreground and background colors x Color box Specifies the color of the graph for line and bar plots and specifies a gradient for colors e Color different residues Indicates differences in aligned residues Foreground color Colors the letter Background color Sets a background color of the residues e Sequence logo A sequence logo displays the frequencies of residues at each position in an alignment This is presented as the relative heights of letters along with the degree of sequence conservation as the total height of a stack of letters measured in bits of information The vertical scale is in bits with a maximum of 2 bits for nucleotides and approximately 4 32 bits for amino acid residues See section 20 2 1 for more details Foreground color Color the residues using a gradient according to the information content of the alignment column Low values indicate columns with high variability whereas high values indicate columns with similar residues Background color Sets a background color of the residues using a gradient in the same way as described above CHAPTER 20 SEQUENCE ALIGNMENT 355 Logo Displays sequence logo at the bottom of the alignment x Height Specifies the height of the sequence logo graph x Color The sequence logo can be displayed in black or Rasmol colors For protein alignments a polarity color scheme is also available where hydrophobic resid
400. lot Instead of comparing single residues it compares subsequences of length set as window size The score is now calculated with respect to aligning the subsequences e Threshold The dot plot shows the calculated scores with colored threshold Hence you can better recognize the most important similarities Examples and interpretations of dot plots Contrary to simple sequence alignments dot plots can be a very useful tool for spotting various evolutionary events which may have happened to the sequences of interest Below is shown some examples of dot plots where sequence insertions low complexity regions inverted repeats etc can be identified visually Similar sequences CHAPTER 14 GENERAL SEQUENCE ANALYSES 212 The most simple example of a dot plot is obtained by plotting two homologous sequences of interest If very similar or identical sequences are plotted against each other a diagonal line will occur The dot plot in figure 14 7 shows two related sequences of the Influenza A virus nucleoproteins infecting ducks and chickens Accession numbers from the two sequences are DQ232610 and DQ023146 Both sequences can be retrieved directly from http www ncbi nlm nih gov gquery gquery fcgi DQ23281 va 0Q923146 Figure 14 7 Dot plot of DO232610 vs DQ023146 Influenza A virus nucleoproteins showing and overall similarity Repeated regions Sequence repeats can also be identified using dot plots A repeat region will typic
401. low as shown in figure 2 35 Forward primer re Figure 2 35 Five lines of dots representing primer suggestions There is a line for each length 2 11 2 Examining the primer suggestions Each line consists of a number of dots each representing the starting point of a possible primer E g the first dot on the first line primers of length 18 represents a primer starting at the dot s position and with a length of 18 nucleotides shown as the white area in figure 2 36 CHAPTER 2 TUTORIALS 58 rimer re Forward pi CTGOCCATGGTTTCCTTCCTCTAGTTTOTGG 000000000 ocoo 0 0 0 Figure 2 36 The first dot on line one represents the starting point of a primer that will anneal to the highlighted region Position the mouse cursor upon a dot and you will see an information box providing data about this primer Clicking the dot will select the region where the primer will anneal See figure 2 37 rcTrGceCATGGTTTCCTTECCTETAG 0 0 0 Primer covering positions 20 to 40 Fraction of G and C 0 48 Melting temperature 58 4 C Self annealing 16 Self end annealing 2 Secondary structure 15 requirement not met Figure 2 37 Clicking the dot will select the corresponding region and placing the cursor upon the dot will reveal an information box Note that some of the dots are colored red This indicates that the primer represented by this dot does not meet the requirements set in the Primer parameters See f
402. lowing description of BLAST search parameters is based on information from http www ncbi nlm nih gov BLAST blastcgihelp shtml e Limit by Entrez query BLAST searches can be limited to the results of an Entrez query against the database chosen This can be used to limit searches to subsets of the BLAST databases Any terms can be entered that would normally be allowed in an Entrez search session Some queries are pre entered and can be chosen in the drop down menu e Choose filter Low complexity Mask off segments of the query sequence that have low compo sitional complexity Filtering can eliminate statistically significant but biologically uninteresting reports from the BLAST output e g hits against common acidic basic or proline rich regions leaving the more biologically interesting regions of the query sequence available for specific matching against database sequences Human repeats This option masks Human repeats LINE s and SINE s and is especially useful for human sequences that may contain these repeats Filtering for repeats can increase the speed of a search especially with very long sequences gt 100 kb and against databases which contain large number of repeats htgs Mask for lookup This option masks only for purposes of constructing the lookup table used by BLAST BLAST searches consist of two phases finding hits based upon a lookup table and then extending them CHAPTER 12 BLAST SEARCH 176
403. lter No of processors 2 Matrix PAM30 Gap cost Existence 9 Extension 1 Command line options C CO ere e ee Xe Figure 2 30 Settings for searching for remote homologues BLAST versus Smith Waterman here http www clcbio com BE 2 10 Tutorial Proteolytic cleavage detection This tutorial shows you how to find cut sites and see an overview of fragments when cleaving proteins with proteolytic cleavage enzymes Suppose you are working with protein CAA32220 from the example data and you wish to see where the enzyme trypsin will cleave the protein Furthermore you want to see details for the resulting fragments which are between 10 and 15 amino acids long CHAPTER 2 TUTORIALS 55 click protein CAA32220 from the Protein folder under Sequences Toolbox Protein Analyses y Proteolytic Cleavage This opens Step 1 of the Proteolytic Cleavage dialog In this step you can choose which sequences to include in the analysis Since you have already chosen protein CAA32220 click Next In this step you should select Trypsin This is illustrated in figure 2 31 Proteolytic Cleavage 1 Select protein sequences Nr 2 Select enzymes Include Name Cyanogen bromide CNBr Asp N endopeptidase Arg C Lys C Trypsin Chymotrypsin high spec Chymotrypsin low spec o lodosobenzoate Thermolysin Post Pro auc Asp N
404. ly used Other useful resources AAindex Amino acid index database http www genome ad jp dbget aaindex htm1l Creative Commons License All CLC bio s scientific articles are licensed under a Creative Commons Attribution NonCommercial NoDerivs 2 5 License You are free to copy distribute display and use the work for educational purposes under the following conditions You must attribute the work in its original form and CLC bio has to be clearly labeled as author and provider of the work You may not use this work for commercial purposes You may not alter transform nor build upon this work CHAPTER 16 PROTEIN ANALYSES 260 SOME RIGHTS RESERVED See http creativecommons org licenses by nc nd 2 5 for more information on how to use the contents 16 6 Pfam domain search With CLC Combined Workbench you can perform a search for Pfam domains on protein se quences The Pfam database at http pfam sanger ac uk is a large collection of multiple sequence alignments that covers approximately 9318 protein domains and protein families Bateman et al 2004 Based on the individual domain alignments profile HMMs have been developed These profile HMMs can be used to search for domains in unknown sequences Many proteins have a unique combination of domains which can be responsible for instance for the catalytic activities of enzymes Pfam was initially developed to aid the annotation of the C elegans genome Annotating u
405. m Figure 4 4 Guides to help create advanced search expressions You can select any of the guides using mouse or keyboard arrows and start typing If you e g wish to search for sequences named BRCA1 select Name search name and type BRCA1 Your search expression will now look like this name BRCA1 The guides available are these e Wildcard search Appending an asterisk to the search term will find matches starting with the term E g searching for brca will find both brcai and brca2 e Search related words If you don t know the exact spelling of a word you can append a question mark to the search term E g brac1 will find sequences with a brca1 gene e Include both terms AND If you write two search terms you can define if your results have to match both search terms by combining them with AND E g search for brca1 AND human will find sequences where both terms are present CHAPTER 4 SEARCHING YOUR DATA 96 e Include either term OR If you write two search terms you can define that your results have to match either of the search terms by combining them with OR E g search for brca1 OR brca2 will find sequences where either of the terms is present e Name search name Search only the name of element e Organism search organism For sequences you can specify the organism to search for This will look in the Latin name field which is seen in the Sequence Info view See section 1
406. m the ideal and a high tolerance will give a small deduction in the final score 17 2 Setting parameters for primers and probes The primer specific view options and settings are found in the Primer parameters preference group in the Side Panel to the right of the view see figure 17 3 Primer parameters w Primer information Length Max 22 a Min 18 Compact Melt temp C Detailed Max 58 Min 48 Inner Melt temp C Max Min Advanced parameters Mode Standard PCR TaqMan Nested PCR Sequencing Calculate Figure 17 3 The two groups of primer parameters in the program the Primer information group is listed below the other group CHAPTER 17 PRIMERS 279 17 2 1 Primer Parameters In this preference group a number of criteria can be set which the selected primers must meet All the criteria concern single primers as primer pairs are not generated until the Calculate button is pressed Parameters regarding primer and probe sets are described in detail for each reaction mode see below e Length Determines the length interval within which primers can be designed by setting a maximum and a minimum length The upper and lower lengths allowed by the program are 100 and 10 nucleotides respectively e Melting temperature Determines the temperature interval within which primers must lie When the Nested PCR or TaqMan reaction type is chosen the first pair of melting
407. m was unable to request a license directly from CLC bio Please click the button below to go to the CLC license web site and submit your request Request license through web site Tf you are unable to request your license through the web site please contact support clcbio com and include the following information in your email Activation Key AQSBC 9JERU 2U1PK 60RTO SISZL Figure 1 9 If you cannot get a license automatically In this case click Request license through web site to go to a web page where you can make a request for a license Please fill out the form on the web site and we will send you an email with a pre activated license as soon as possible If you know that you are using a proxy server to connect to the internet click Cancel and click Proxy Settings in the license dialog 1 4 4 Floating license If you organization has installed a license server you can use a floating license The license server has a set of licenses that can be used on all computers on the network If the server has e g 10 licenses it means that maximum 10 computers can use a license simultaneously To use a floating license select Connect to a license server in the dialog shown in figure 1 6 This will bring up the dialog shown in figure 1 10 This dialog lets you specify how to connect to the license server e Connect to a license server Check this option if you wish to use the license server e Automatically detect license s
408. mation can be searched Below is a list of the different kinds of information that you can search for applies to both quick search and the advanced search e Name The name of a sequence an alignment or any other kind of element The name is what is displayed in the Navigation Area per default e Length The length of the sequence e Organism Sequences which contain information about organism can be searched In this way you could search for e g Homo sapiens sequences Only the first item in the list Name is available for all kinds of data The rest is only relevant for sequences 93 CHAPTER 4 SEARCHING YOUR DATA 94 If you wish to perform a search for sequence similarity use Local BLAST see section 12 2 instead 4 2 Quick search At the bottom of the Navigation Area there is a text field as shown in figure 4 1 ma Ge DO ffs CLC_Data 3 63 Example Data taj Extra Nucleotide Ht Protein E EU E Recycle bin 12 Qy lt enter search term gt A Figure 4 1 Search simply by typing in the text field and press Enter To search simply enter a text to search for and press Enter 4 2 1 Quick search results To show the results the search pane is expanded as shown in figure 4 2 e WZ iw Y 5 63 CLC_Data Example Data ia Hf Nucleotide H E Protein README E B Recycle bin 14 Qr insulin y Xc m_o19129 Ja Xc INM_031156 Xc NM008341
409. ments Smith and Waterman 1981 BLAST only makes local alignments This means that a great but short hit in another sequence may not at all be related to the query sequence even though the sequences align well in a small region It may be a domain or similar It is always a good idea to be cautious of the material in the database For instance the sequences may be wrongly annotated hypothetical proteins are often simple translations of a found ORF on a sequenced nucleotide sequence and may not represent a true protein Don t expect to see the best result using the default settings As described above the settings should be adjusted according to the what kind of query sequence is used and what kind of results you want It is a good idea to perform the same BLAST search with different settings to get an idea of how they work There is not a final answer on how to adjust the settings for your particular sequence 12 6 9 Other useful resources The BLAST web page hosted at NCBI http www ncbi nlm nih gov BLAST Download pages for the BLAST programs http www ncbi nlm nih gov BLAST download shtml D ownload pages for pre formatted BLAST databases ftp FEp nebi 0 m nih gow blast db O Reilly book on BLAST ttp www oreilly com catalog blast h Explanation of scoring substitution matrices and more http www clcbio com be Creative Commons License All CLC bio s scientific articles are lic
410. mes GCC 25 5 GCA 20 3 and finally GCU 15 3 The data are retrieved from the Codon usage database see below Always picking the most frequent codon does not necessarily give the best answer By selecting codons from a distribution of calculated codon frequencies the DNA sequence obtained after the reverse translation holds the correct or nearly correct codon distribution It should be kept in mind that the obtained DNA sequence is not necessarily identical to the original one encoding the protein in the first place due to the degeneracy of the genetic code In order to obtain the best possible result of the reverse translation one should use the codon frequency table from the correct organism or a closely related species The codon usage of the mitochondrial chromosome are often different from the native chromosome s thus mitochondrial codon frequency tables should only be used when working specifically with mitochondria Other useful resources The Genetic Code at NCBI http www ncbi nlm nih gov Taxonomy Utils wprintgc cgi mode c Codon usage database http www kazusa or jp codon Wikipedia on the genetic code http en wikipedia org wiki Genetic_code Creative Commons License All CLC bio s scientific articles are licensed under a Creative Commons Attribution NonCommercial NoDerivs 2 5 License You are free to copy distribute display and use the work for educational purposes under the following co
411. minal work by McCaskill 1990 There are two options regarding the partition function calculation e Calculate base pair probabilities This option invokes the partition function calculation and calculates the marginal probabilities of all possible base pairs and the the marginal probability that any single base is unpaired e Create plot of marginal base pairing probabilities This creates a plot of the marginal base pair probability of all possible base pairs as shown in figure 22 3 The marginal probabilities of base pairs and of bases being unpaired are distinguished by colors which can be displayed in the normal sequence view using the Side Panel see section 22 2 3 and also in the secondary structure view An example is shown in figure 22 4 Furthermore the marginal probabilities are accessible from tooltips when hovering over the relevant parts of the structure 22 1 4 Advanced options The free energy minimization algorithm includes a number of advanced options e Avoid isolated base pairs The algorithm filters out isolated base pairs i e stems of length 1 e Apply different energy rules for Grossly Asymmetric Interior Loops GAIL Compute the minimum free energy applying different rules for Grossly Asymmetry Interior Loops GAIL A Grossly Asymmetry Interior Loop GAIL is an interior loop that is 1 x n or n x 1 where n gt 2 see http www bioinfo rpi edu zukerm lectures RNAfold html rnafold print pdf CHAPTER 22
412. ming you can check this box It requires that the sequence reads have been trimmed beforehand see section 18 2 for more information about trimming CHAPTER 18 SEQUENCING DATA ANALYSES AND ASSEMBLY 310 e Show tabular view of contigs A contig can be shown both in a graphical as well as a tabular view If you select this option a tabular view of the contig will also be opened Even if you do not select this option you can show the tabular view of the contig later on by clicking Show 4 and selecting Table E8 For more information about the tabular view of contigs see section 18 6 6 Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish This will start the assembly process See section 18 6 on how to use the resulting contigs 18 5 Add sequences to an existing contig This section describes how to assemble sequences to an existing contig This feature can be used for example to provide a steady work flow when a number of exons from the same gene are sequenced one at a time and assembled to a reference sequence To start the assembly select one contig and a number of sequences Toolbox in the Menu Bar Sequencing Data Analyses f Add Sequences to Contig or right click the empty white area of the contig Add Sequences to Contig This opens a dialog where you can alter your choice of sequences which you want to assemble You can also add sequence lists When the elem
413. mit is set to 500 By writing a higher number in this field more actions can be undone Undo applies to all changes made on sequences alignments or trees See section 3 2 5 for more on this topic 99 CHAPTER 5 USER PREFERENCES AND SETTINGS 100 Preferences Undo l mit 500 Number of hits 50 Style English United States Y liada os Show all dialogs with Never show this dialog again Show Dialogs Y OK X Cancel it Export Import Figure 5 1 Preferences include General preferences View preferences Colors preferences and Advanced settings e Number of hits The number of hits shown in CLC Combined Workbench when e g searching NCBI The sequences shown in the program are not downloaded until they are opened or dragged saved into the Navigation Area e Locale Setting Specify which country you are located in This determines how punctation is used in numbers all over the program e Show Dialogs A lot of information dialogs have a checkbox Never show this dialog again When you see a dialog and check this box in the dialog the dialog will not be shown again If you regret and wish to have the dialog displayed again click the button in the General Preferences Show Dialogs Then all the dialogs will be shown again 5 2 Default View preferences There are five groups of default View settings Toolbar Side Panel Location New View View Format oOo BR WN BB User Defined V
414. more your changes in the View preferences can be saved See section 5 5 Several sequences can be selected and by clicking the buttons in the bottom of the search view you can do the following e Download and open doesn t save the sequence e Download and save lets you choose location for saving sequence e Open at NCBI searches the sequence at NCBI s web page Double clicking a hit will download and open the sequence The hits can also be copied into the View Area or the Navigation Area from the search results by drag and drop copy paste or by using the right click menu as described below Drag and drop from GenBank search results The sequences from the search results can be opened by dragging them into a position in the View Area CHAPTER 11 ONLINE DATABASE SEARCH 163 Note A sequence is not saved until the View displaying the sequence is closed When that happens a dialog opens Save changes of sequence x Yes or No The sequence can also be saved by dragging it into the Navigation Area It is possible to select more sequences and drag all of them into the Navigation Area at the same time Download GenBank search results using right click menu You may also select one or more sequences from the list and download using the right click menu see figure 11 2 Choosing Download and Save lets you select a folder where the sequences are saved when they are downloaded Choosing Download and Open opens a new view for each
415. mostability and aliphatic index of globular proteins J Biochem Tokyo 88 6 1895 1898 Janin 1979 Janin J 1979 Surface and inside volumes in globular proteins Nature 277 5696 491 492 Jukes and Cantor 1969 Jukes T and Cantor C 1969 Mammalian Protein Metabolism ed HN Munro chapter Evolution of protein molecules pages 21 32 New York Academic Press Karplus and Schulz 1985 Karplus P A and Schulz G E 1985 Prediction of chain flexibility in proteins Naturwissenschaften 72 212 213 Kierzek et al 1999 Kierzek R Burkard M E and Turner D H 1999 Thermodynamics of single mismatches in RNA duplexes Biochemistry 38 43 14214 14223 Klee and Ellis 2005 Klee E W and Ellis L B M 2005 Evaluating eukaryotic secreted protein prediction BMC Bioinformatics 6 256 Knudsen and Miyamoto 2001 Knudsen B and Miyamoto M M 2001 A likelinood ratio test for evolutionary rate shifts and functional divergence among proteins Proc Natl Acad Sci USA 98 25 14512 14517 Kolaskar and Tongaonkar 1990 Kolaskar A S and Tongaonkar P C 1990 A semi empirical method for prediction of antigenic determinants on protein antigens FEBS Lett 276 1 2 172 174 Krogh et al 2001 Krogh A Larsson B von Heijne G and Sonnhammer E L 2001 Predicting transmembrane protein topology with a hidden Markov model application to complete genomes J Mol Biol 305 3 567 580 BIBL
416. mparative sequence statistics The output of comparative protein sequence statistics include e Sequence information Sequence type Length Organism Name Description Modification Date Weight Isoelectric point Aliphatic index e Half life CHAPTER 14 GENERAL SEQUENCE ANALYSES 222 e Extinction coefficient e Counts of Atoms e Frequency of Atoms e Count of hydrophobic and hydrophilic residues e Frequencies of hydrophobic and hydrophilic residues e Count of charged residues e Frequencies of charged residues e Amino acid distribution e Histogram of amino acid distribution e Annotation table e Counts of di peptides e Frequency of di peptides The output of nucleotide sequence statistics include e General statistics Sequence type Length Organism Name Description Modification Date Weight e Atomic composition Nucleotide distribution table e Nucleotide distribution histogram e Annotation table e Counts of di nucleotides e Frequency of di nucleotides A short description of the different areas of the statistical output is given in section 14 4 1 14 4 1 Bioinformatics explained Protein statistics Every protein holds specific and individual features which are unique to that particular protein Features such as isoelectric point or amino acid composition can reveal important information of a novel protein Many of the features described below are calculated i
417. n 144 CDS 237 coding regions 237 DNA to RNA 233 nucleotide sequence 236 ORF 237 protein 265 RNA to DNA 234 to DNA 401 to protein 236 401 Translation of a selection 134 show together with DNA sequence 134 Transmembrane helix prediction 251 401 TrEMBL search 164 Trim 303 403 Trimmed regions adjust manually 311 Tutorial Getting started 33 txt file format 112 INDEX 430 UIPAC codes Wrap sequences 132 amino acids 411 Undo limit 99 xml file format 112 Undo Redo 83 Zoom 86 UniProt 164 tutorial 36 search 164 400 search sequence in 171 ZOOM hse a piping oaths Zoom Out 87 UniVec trimming 304 UPGMA algorithm 371 402 Upgrade license 23 Urls Navigation Area 118 User defined view settings 101 User interface 73 Zoom to 100 87 Zoom 3D structure 200 Variance table assembly 315 Vector see cloning 319 Vector contamination find automatically 304 Vector design 319 Vector graphics export 120 VectorNTI file format 35 113 409 import data from 115 View 80 alignment 353 dot plots 210 GenBank format 155 preferences 86 save changes 82 sequence 131 sequence as text 155 View Area 80 illustration 73 View preferences 100 show automatically 101 style sheet 103 View settings user defined 101 Virtual gel 403 vsf file format for settings 101 Web page import sequence from 115 Wildcard append to search 161 164 167 Windows installat
418. n 1993 These are most often seen as short regions of only a few different amino acids In the middle of figure 14 12 is a square shows the low complexity region of this sequence Creative Commons License All CLC bio s scientific articles are licensed under a Creative Commons Attribution NonCommercial NoDerivs 2 5 License You are free to copy distribute display and use the work for educational CHAPTER 14 GENERAL SEQUENCE ANALYSES 214 Framashifti vs Franiashift3 D yy DAA A AA EH EY Figure 14 10 This dot plot show various frame shifts in the sequence See text for details purposes under the following conditions You must attribute the work in its original form and CLC bio has to be clearly labeled as author and provider of the work You may not use this work for commercial purposes You may not alter transform nor build upon this work SOME RIGHTS RESERVED See http creativecommons org licenses by nc nd 2 5 for more information on how to use the contents 14 2 4 Bioinformatics explained Scoring matrices Biological sequences have evolved throughout time and evolution has shown that not all changes to a biological sequence is equally likely to happen Certain amino acid substitutions change of one amino acid to another happen often whereas other substitutions are very rare For instance tryptophan W which is a relatively rare amino acid will only on very rare occasions mutate int
419. n a simple way CHAPTER 14 GENERAL SEQUENCE ANALYSES 223 Molecular weight The molecular weight is the mass of a protein or molecule The molecular weight is simply calculated as the sum of the atomic mass of all the atoms in the molecule The weight of a protein is usually represented in Daltons Da A calculation of the molecular weight of a protein does not usually include additional posttransla tional modifications For native and unknown proteins it tends to be difficult to assess whether posttranslational modifications such as glycosylations are present on the protein making a calculation based solely on the amino acid sequence inaccurate The molecular weight can be determined very accurately by mass spectrometry in a laboratory Isoelectric point The isoelectric point pl of a protein is the pH where the proteins has no net charge The pl is calculated from the pKa values for 20 different amino acids At a pH below the pl the protein carries a positive charge whereas if the pH is above pl the proteins carry a negative charge In other words pl is high for basic proteins and low for acidic proteins This information can be used in the laboratory when running electrophoretic gels Here the proteins can be separated based on their isoelectric point Aliphatic index The aliphatic index of a protein is a measure of the relative volume occupied by aliphatic side chain of the following amino acids alanine valine leucine and
420. n alignment of multiple sequences The primer designer for alignments can be accessed in two ways CHAPTER 17 PRIMERS 292 Calculation parameters Chosen parameters Maximum primer length Minimum primer length Maximum G C content Minimum G C content Maximum melting temperature Minimum melting temperature Maximum self annealing Maximum self end annealing Maximum secondary structure 3 end must meet G C requirements 5 end must meet G C requirements Mispriming parameters Use mispriming as exclusion criteria Y Calculate 2 Help Figure 17 11 Calculation dialog for sequencing primers select alignment Toolbox Primers and Probes 1 Design Primers TT OK or If the alignment is already open Click Primer Designer 33 at the lower left part of the view In the alignment primer view see figure 17 12 the basic options for viewing the template alignment are the same as for the standard view of alignments See section 20 for an explanation of these options Note This means that annotations such as e g known SNP s or exons can be displayed on the template sequence to guide the choice of primer regions Since the definition of groups of sequences is essential to the primer design the selection boxes of the standard view are shown as default in the alignment primer view 17 9 1 Specific options for alignment based primer and probe design Compared to the primer view of a single sequence the most no
421. n for the graphics file see figure 7 7 Export Graphics ookin E Desktop E My Document ts YY My Computer My Network Places My Network Places Files of type Portable Document Format pdF Directory C Documents and Settings smoensted Desktop Name PERH36C pdF Figure 7 7 Location and name for the graphics file CLC Combined Workbench supports the following file formats for graphics export Format Suffix Type Portable Network Graphics png bitmap JPEG Jpg bitmap Tagged Image File tif bitmap PostScript ps vector graphics Encapsulated PostScript eps vector graphics Portable Document Format pdf vector graphics Scalable Vector Graphics SVg vector graphics These formats can be divided into bitmap and vector graphics The difference between these two categories is described below CHAPTER 7 IMPORT EXPORT OF DATA AND GRAPHICS 121 Bitmap images In a bitmap image each dot in the image has a specified color This implies that if you zoom in on the image there will not be enough dots and if you zoom out there will be too many In these cases the image viewer has to interpolate the colors to fit what is actually looked at A bitmap image needs to have a high resolution if you want to zoom in This format is a good choice for storing images without large shapes e g dot plots It is also appropriate if you don t have the need for resizing and editing the image after export Vector graphics
422. n in figure 2 39 50 100 I l Figure 2 39 A forward and a reverse primer region enclosing the conflicts Now you can let CLC Combined Workbench calculate all the possible primer pairs based on the Primer parameters that you have defined Click the Calculate button Modify parameters regarding the combination of the primers for now just leave them unchanged Calculate This will open a table showing the possible combinations of primers To the right you can specify the information you want to display e g showing secondary structure see figure 2 40 Clicking a primer pair in the table will make a corresponding selection on the sequence in the view above At this point you can either settle on a specific primer pair or save the table for later If you want to use e g the first primer pair for your experiment right click this primer pair in the table and save the primers You can also mark the position of the primers on the sequence by selecting Mark primer annotation on sequence in the right click menu see figure 2 41 You have now reached the end of this tutorial which has shown some of the many options of the primer design functionalities of CLC Combined Workbench You can read much more in the program s Help function or in the users manual on http www clcbio com download 2 12 Tutorial Assembly In this tutorial you will see how to assemble data from automated sequencers into a contig and how to find and inspe
423. n it is possible to annotate a sequence from list of annotations found in a GFF file Y amp Additional Alignments Located in the Toolbox Ez A HEE Clustal Alignment g SignalP HEE Muscle Alignment Version 1 02 HEE Clustal Alignment ht vi L l Proxy Settings Check For updates at Install From File Figure 1 15 The plug ins that are available for download Clicking a plug in will display additional information at the right side of the dialog This will also display a button Download and Install In order to install plug ins on Windows Vista the Workbench must be run in administrator mode Right click the program shortcut and choose Run as Administrator Then follow the procedure described below When you start the Workbench after installing the plug in it should also be run in administrator mode CHAPTER 1 INTRODUCTION TO CLC COMBINED WORKBENCH 28 Click the plug in and press Download and Install A dialog displaying progress is now shown and the plug in is downloaded and installed If the plug in is not shown on the server and you have it on your computer e g if you have downloaded it from our web site you can install it by clicking the Install from File button at the bottom of the dialog This will open a dialog where you can browse for the plug in The plug in file should be a file of the type cpa When you close the dialog you will be asked whether you wish to restart the workbench
424. n license Import a license key file Configure network license Proxy settings Reset license CLC Combined Workbench uy Get license Y Accept agreement O Activate license CLC Combined Workbench 3 5 1 Recitals 1 1 This End User License Agreement EULA is a legal agreement between you either an individual person or a single legal entity who willbe referred to in this EULA as You and CLC bio A S CVR no 28 30 50 8 for the software products that accompanies this EULA including any associated media printed materials and electronic documentation the Software Product 1 2 The Software Product also includes any software updates add on components web services andlor supplements that CLC bio may provide to You or make available to You after the date You obtain Your initial copy of the Software Product to the extent that such items are not accompanied by a separate license agreement or terms of use By installing copying downloading accessing or otherwise using the Software Product You agree to be bound by the terms of this EULA If You do not agree to the terms of this BULA do not install access or use the Software Product Y Figure 1 7 Read the License Agreement carefully Read the License Agreement carefully before clicking the accept button In the next step shown in figure 1 8 click the Activate license button License Assistant CLC Combined Workbench Get license Accept
425. n of PCR products it can can be monitored in real time and used to quantify the amount of template initially present in the buffer The technology is also used to detect genetic variation such as SNP s By designing a TaqMan probe which will specifically bind to one of two or more genetic variants it is possible to detect genetic variants by the presence or absence of fluorescence in the reaction Note In CLC Combined Workbench it is possible to annotate sequences with SNP information from dbSNP and use this information to guide TaqMan allele specific probe design A specific requirement of TaqMan probes is that a G nucleotide can not be present at the 5 end since this will quench the fluorescence of the reporter dye It is recommended that the melting temperature of the TaqMan probe is about 10 degrees celsius higher than that of the primer pair Primer design for TaqMan technology involves designing a primer pair and a TaqMan probe In TaqMan the user must thus define three regions a Forward primer region a Reverse primer region and a TaqMan probe region The easiest way to do this is to designate a TaqMan primer probe region spanning the sequence region where TaqMan amplification is desired This will automatically add all three regions to the sequence If more control is desired about the CHAPTER 17 PRIMERS 290 placing of primers and probes the Forward primer region Reverse primer region and TaqMan probe region can all be defin
426. n resolution 715x392 pixels 821 kB memory usage Low resolution 1181x647 pixels 2 MB memory usage Medium resolution 4724x2590 pixels 35 MB memory usage O High resolution 18898x10360 pixels 560 MB memory usage Figure 7 8 Parameters for bitmap formats size of the graphics file CHAPTER 7 IMPORT EXPORT OF DATA AND GRAPHICS 122 You can adjust the size the resolution of the file to four standard sizes e Screen resolution Low resolution e Medium resolution High resolution The actual size in pixels is displayed in parentheses An estimate of the memory usage for exporting the file is also shown If the image is to be used on computer screens only a low resolution is sufficient If the image is going to be used on printed material a higher resolution is necessary to produce a good result Parameters for vector formats For pdf format clicking Next will display the dialog shown in figure 7 9 this is only the case if the graphics is using more than one page Export Graphics Page setup parameters Orientation Portrait Paper Size A4 Horizontal Pagecount 1 Vertical Pagecount 1 Header t r Ye es G Page Setup Figure 7 9 Page setup parameters for vector formats The settings for the page setup are shown and clicking the Page Setup button will display a dialog where these settings can ba adjusted This dialog is described in section 6 2 The page setup is only available if yo
427. nclude reference sequence in contig s This will produce a contig data object without the reference sequence The contig is created in the same way as when you make an ordinary assembly see section 18 3 but the reference sequence is omitted in the resulting contig In the assembly process the reference sequence is only used as a scaffold for alignment This option is useful when performing assembly with a reference sequence that is not closely related to the sequencing reads Conflicts resolved with If there is a conflict i e a position where there is disagreement about the residue A C T or G you can specify how the contig sequence should reflect this conflict CHAPTER 18 SEQUENCING DATA ANALYSES AND ASSEMBLY 309 x Unknown nucleotide N The contig will be assigned an N character in all positions with conflicts x Ambiguity nucleotides R Y etc The contig will display an ambiguity nucleotide reflecting the different nucleotides found in the reads For an overview of ambiguity codes see Appendix F Vote A C G T The conflict will be solved by counting instances of each nucleotide and then letting the majority decide the nucleotide in the contig Note that conflicts will always be highlighted no matter which of the options you choose Furthermore each conflict will be marked as annotation on the contig sequence and will be present if the contig sequence is extracted for further analysis As a result the de
428. ndary DNA structure found for a primer or probe Secondary structures are scored by the number of hydrogen bonds in the structure and 2 extra hydrogen bonds are added for each stacking base pair in the structure e 3 end G C restrictions When this checkbox is selected it is possible to specify restrictions concerning the number of G and C molecules in the 3 end of primers and probes A low G C content of the primer probe 3 end increases the specificity of the reaction A high G C content facilitates a tight binding of the oligo to the template but also increases the possibility of mispriming Unfolding the preference groups yields the following options End length The number of consecutive terminal nucleotides for which to consider the C G content Max no of G C The maximum number of G and C nucleotides allowed within the specified length interval Min no of G C The minimum number of G and C nucleotides required within the specified length interval e 5 end G C restrictions When this checkbox is selected it is possible to specify restrictions concerning the number of G and C molecules in the 5 end of primers and probes A high G C content facilitates a tight binding of the oligo to the template but also increases the possibility of mis priming Unfolding the preference groups yields the same options as described above for the 3 end e Mode Specifies the reaction type for which primers are designed Stand
429. nditions You must attribute the work in its original form and CLC bio has to be clearly labeled as author and provider of the work You may not use this work for commercial purposes You may not alter transform nor build upon this work SOME RIGHTS RESERVED See http creativecommons org licenses by nc nd 2 5 for more information on how to use the contents 16 10 Proteolytic cleavage detection CLC Combined Workbench offers to analyze protein sequences with respect to cleavage by a selection of proteolytic enzymes This section explains how to adjust the detection parameters and offers basic information on proteolytic cleavage in general 16 10 1 Proteolytic cleavage parameters Given a protein sequence CLC Combined Workbench detects proteolytic cleavage sites in accordance with detection parameters and shows the detected sites as annotations on the sequence and in textual format in a table below the sequence view CHAPTER 16 PROTEIN ANALYSES 270 Detection of proteolytic cleavage sites is initiated by right click a protein sequence in Navigation Area Toolbox Protein Analyses ah Proteolytic Cleavage of This opens the dialog shown in figure 16 25 Proteolytic Cleavage 1 Select protein sequences Projects _Data Example data E Extra E Nucleotide a Assembly a Cloning aj More data E Primer design a Restriction analysis E Sequences E3 Protein 8 6 3D structures 3 More d
430. nds the previous inconsistency In the contig view you can use Zoom in 2 to zoom to a greater level of detail than in other views see figure 18 10 This is useful for discerning the trace curves If you want to replace a residue with a gap use the Delete key If you wish to edit a selection of more than one residue right click the selection Edit Selection A This will show a warning dialog but you can choose never to see this dialog again by clicking the checkbox at the bottom of the dialog CHAPTER 18 SEQUENCING DATA ANALYSES AND ASSEMBLY 314 18 6 3 Sorting reads If you wish to change the order of the sequence reads simply drag the label of the sequence up and down You can also sort the reads by right clicking a sequence label and choose from the following options e Sort Reads by Alignment Start Position This will list the first read in the contig at the top etc e Sort Reads by Name Sort the reads alphabetically e Sort Reads by Length The shortest reads will be listed at the top 18 6 4 Assembly conflicts When the contig is created conflicts between the reads are annotated on the consensus sequence The definition of a conflict is a position where at least one of the reads have a different residue A conflict can be in two states e Conflict Both the annotation and the corresponding row in the Contig table are colored red e Resolved Both the annotation and the corresponding row in the Contig tab
431. ne how the shuffling should be performed In this step shown in figure 14 2 For nucleotides the following parameters can be set Shuffle Sequence 1 Select one or more ETA sequences of same type 2 Set parameters Resampling methods Mononucleotide shuffling Mononucleotide sampling from zero order Markov chain Dinucleotide shuffling Dinucleotide sampling From first order Markov chain Number of sequences 10 C JOS Crea pre Y caca Figure 14 2 Parameters for shuffling e Mononucleotide shuffling Shuffle method generating a sequence of the exact same mononucleotide frequency e Dinucleotide shuffling Shuffle method generating a sequence of the exact same dinu cleotide frequency e Mononucleotide sampling from zero order Markov chain Resampling method generating a sequence of the same expected mononucleotide frequency CHAPTER 14 GENERAL SEQUENCE ANALYSES 208 e Dinucleotide sampling from first order Markov chain Resampling method generating a sequence of the same expected dinucleotide frequency For proteins the following parameters can be set e Single amino acid shuffling Shuffle method generating a sequence of the exact same amino acid frequency e Single amino acid sampling from zero order Markov chain Resampling method generating a sequence of the same expected single amino acid frequency e Dipeptide shuffling Shuffle method generating a sequence of the exact same dipep
432. nerally be tolerated at the sequence ends improving the overall alignment This is the default setting of the algorithm Finally treating end gaps like any other gaps is the best option when you know that there are no biologically distinct effects at the ends of the sequences Figures 20 3 and 20 4 illustrate the differences between the different gap scores at the sequence ends 20 1 2 Fast or accurate alignment algorithm CLC Combined Workbench has two algorithms for calculating alignments e Fast less accurate This allows for use of an optimized alignment algorithm which is very fast The fast option is particularly useful for datasets with very long sequences e Slow very accurate This is the recommended choice unless you find the processing time too long CHAPTER 20 SEQUENCE ALIGNMENT 350 20 P49342 MNPTETRAMP MSQQMECPHE PNRIRIIHKIRQS MKITEPERIKISQ STRESMMHER P20810 MNP TETRAN BSQQMBcPH PNEKKHKEKOA METE ERASO STRESMMHEN P27321 1 FRIESER P08855 MNPABARA Mr MsKEMECPHP HSRRBHRROB ARTEPBR sQ STEP ADHERA P12675 MNPTETKA MP MSKQMBECPHS PNEKRHKKOA METE EKKSO STKPSMMHEK P20811 1 MNPTEARA METE EKKPO SSKPSMMHER Q95208 MNPTBAKAM csSKOMECPHS PNEKRHKKOA METE BRKESO STKPSMMHER 20 P49342 MNPTETRAM MSQQMECPHM PNRRRARRNGA METE ERKSO STRESMMHER P20810 MNPTETRAM MSQQMBCPHIM PNEKKHEKOA MATA ERASOA STRESMMHERN P27321 1MSTICAMAM KIESBK SQ SSBPPMNHBR Possess
433. new location has been added The name of the new location will be the name of the folder selected for the location To see where the folder is located on your computer you can either place your mouse cursor on the location icon K for second or you can right click the location and choose Properties This will show a dialog with the path to the location Sharing data is possible of you add a location on a network drive The procedure is similar to the one described above When you add a location on a network drive or a removable drive the location will appear inactive when you are not connected Once you connect to the drive again click Update All y and it will become active note that there will be a few seconds delay from you connect Opening data The elements in the Navigation Area are opened by Double click the element or Click the element Show 2 in the Toolbar Select the desired way to view the element This will open a view in the View Area which is described in section 3 2 Adding data Data can be added to the Navigation Area in a number of ways Files can be imported from the file system see chapter 7 Furthermore an element can be added by dragging it into the Navigation Area This could be views that are open elements on lists e g search hits or CHAPTER 3 USER INTERFACE 76 sequence lists and files located on your computer Finally you can add data by adding a new location see section 3 1 1
434. ng an easy way for you to focus on the same region in both views 10 2 2 Mark molecule as circular and specify starting point You can mark a DNA molecule as circular by right clicking its name in either the sequence view or the circular view In the right click menu you can also make a circular molecule linear A circular molecule displayed in the normal sequence view will have the sequence ends marked with a The starting point of a circular sequence can be changed by make a selection starting at the position that you want to be the new starting point right click the selection Move Starting Point to Selection Start Note This can only be done for sequence that have been marked as circular 10 3 Working with annotations Annotations provide information about specific regions of a sequence A typical example is the annotation of a gene on a genomic DNA sequence Annotations derive from different sources e Sequences downloaded from databases like GenBank are annotated e In some of the data formats that can be imported into CLC Combined Workbench sequences can have annotations GenBank EMBL and Swiss Prot format e The result of a number of analyses in CLC Combined Workbench are annotations on the sequence e g finding open reading frames and restriction map analysis CHAPTER 10 VIEWING AND EDITING SEQUENCES 148 e You can manually add annotations to a sequence described in the section 10 3 2 Note Annotations
435. ng methods for prediction of signal peptides and prediction of subcellular localization in general After the first published method for signal peptide prediction von Heijne 1986 more and more methods have surfaced although not all methods have been made available publicly Different types of signal peptides Soon after Gunter Blobel s initial discovery of signal peptides more targeting signals were found Most cell types and organisms employ several ways of targeting proteins to the extracellular environment or subcellular locations Most of the proteins targeted for the extracellular space or subcellular locations carry specific sequence motifs signal peptides characterizing the type of secretion targeting it undergoes Several new different signal peptides or targeting signals have been found during the later years and papers often describe a small amino acid motif required for secretion of that particular protein In most of the latter cases the identified sequence motif is only found in this particular protein and as such cannot be described as a new group of signal peptides Describing the various types of signal peptides is beyond the scope of this text but several review papers on this topic can be found on PubMed Targeting motifs can either be removed from or retained in the mature protein after the protein has reached the correct and final destination Some of the best characterized signal peptides are depicted in figure 16 3
436. ng of primers and probes CLC Combined Workbench offers an easy way of displaying and saving a textual representation of one or more primers select primers in Navigation Area Toolbox in the Menu Bar Primers and Probes E2 Order Primers 4 This opens a dialog where you can choose additional primers Clicking OK opens a textual representation of the primers See figure 17 19 The first line states the number of primers being ordered and after this follows the names and nucleotide sequences of the primers in 5 3 orientation From the editor the primer information can be copied and pasted to web forms or e mails The created object can also be saved and exported as a text file See figure 17 19 E Primer order 3 Number of primers 4 Name Primer Fl 24 44 GTTTCCTICCTCTAGTITCT Name Primer Rl 123 141 CTCTTGTCAGCACTCCAT Name Primer Rl 128 146 CCAAACTCTTGTCAGCAC Name Primer Fl 19 37 CCATGGTTTCCTTCCTCT Figure 17 19 A primer order for 4 primers Chapter 18 Sequencing data analyses and Assembly Contents 18 1 Importing and viewing trace data 0 0 ee eee ee ee 301 PSL Sealing WAGES eog ac cise ee rt ok wo enw a Se ca a 302 18 1 2 Trace settings in the Side Panel o 4 4 eb oe we doe eae ed 302 18 2 Trim sequences es 303 182 1 Manual tiMmMINS 2212 2 44 225 2 ee eo he Ao eee SoS 304 1322 Automate THINNING ves 2k ae a aaa he a eo es 304 18 3 Assemble s
437. ng score e Self end annealing score Secondary structure score Properties with Yes No values If a primer meets the set requirement a green circle will be shown at its starting position and if it fails to meet the requirement a red dot is shown at its starting position e C G at 3 end e C G at 5 end Common to both sorts of properties is that mouse clicking an information point filled circle or bar will cause the region covered by the associated primer to be selected on the sequence 17 4 Output from primer design The output generated by the primer design algorithm is a table of proposed primers or primer pairs with the accompanying information see figure 17 6 In the preference panel of the table it is possible to customize which columns are shown in the table See the sections below on the different reaction types for a description of the available information CHAPTER 17 PRIMERS 283 Ex PERHSBC with Rows 41 Standard primers for PERH3BC with primer annotations primers Filter e as ayog A Score Pair annealing align Fwd Rev Sequence Fwd Sequence Rev Pair end annealing Fwd Rev Al CCATGGTTTCCTICCTCT Fragment length Fwd Rev 44 552 PA ANEI CCATGGTTTCCTTCCTCT CAAACTCTTGTCAGCACTC CTCACGACTGTTCTCAAAC V Sequence Fwd 1 Region Fwd CCATGGTTTCCTICCTCT 44 201 bo et CCATGGTTTCCTTCCTCT CCAAACTCTTGTCAGCAC Self annealing Fwd
438. ng sequence can be displayed easily by activating the calculations from the Side Panel for a sequence right click protein sequence in Navigation Area Show Sequence open Protein info in Side Panel or double click protein sequence in Navigation Area Show Sequence open Protein info in Side Panel These actions result in the view displayed in figure 16 13 The level of hydrophobicity is calculated on the basis of the different scales The different scales add different values to each type of amino acid The hydrophobicity score is then calculated as the sum of the values in a window which is a particular range of the sequence The window length can be set from 5 to 25 residues The wider the window the less fluctuations in the hydrophobicity scores For more about the theory behind hydrophobicity see 16 5 3 In the following we will focus on the different ways that CLC Combined Workbench offers to display the hydrophobicity scores We use Kyte Doolittle to explain the display of the scores but the different options are the same for all the scales Initially there are three options for displaying CHAPTER 16 PROTEIN ANALYSES 257 r oncomus vonn rg Protein info gt Kyte Doolittle gt Cornette Engelman Eisenberg Rose Janin Hopp Woods Welling gt Kolaskar Tongaonkar gt Surface Probability Chain Flexibility Find Figure 16 13 The different available scales in Protein info
439. ngth of 9 will calculate the G C content for the nucleotide in question plus the 4 nucleotides to the left and the 4 nucleotides to the right A narrow window will focus on small fluctuations in the G C content level whereas a wider window will show fluctuations between larger parts of the sequence CHAPTER 10 VIEWING AND EDITING SEQUENCES 135 Foreground color Colors the letter using a gradient where the left side color is used for low levels of G C content and the right side color is used for high levels of G C content The sliders just above the gradient color box can be dragged to highlight relevant levels of G C content The colors can be changed by clicking the box This will show a list of gradients to choose from Background color Sets a background color of the residues using a gradient in the same way as described above Graph The G C content level is displayed on a graph x Height Specifies the height of the graph x Type The graph can be displayed as Line plot Bar plot or as a Color bar x Color box For Line and Bar plots the color of the plot can be set by clicking the color box For Colors the color box is replaced by a gradient color box as described under Foreground color Protein info These preferences only apply to proteins The first nine items are different hydrophobicity scales and are described in section 16 5 2 e Kyte Doolittle The Kyte Doolittle scale is widely used for detecting hydrophob
440. nhancing The Activi 1 8 X Ray Diffraction 2005 2 15 Nmr Structure OF Hu Nmr 15 Structures 2004 6 22 Solution Structure Nmr 30 Structures 2005 8 30 Nmr Structure OF Hu Nmr 15 Structures 2004 3 10 Nmr Structure OF Hu Nmr 15 Structures 2004 8 10 Nmr Structure OF Hu Nmr 15 Structures i 2004 8 10 Crystal Structure Of 1 42 X Ray Diffraction 2004 12 28 Crystallographic And 1 6 X Ray Diffraction 2005 3 3 Structural Properties 2 08 X Ray Diffraction Crystal Structure OF 2 63 X Ray Diffraction Structural Basis For 1 6 X Ray Diffraction Diabetes Associated 2 3 X Ray Diffraction 2005 4 12 Download and Open 4 Download and Save Total number of hits 166 E Figure 11 4 The structure search view 11 3 1 Structure search options Conducting a search in the NCBI Database from CLC Combined Workbench corresponds to conducting search for structures on the NCBI s Entrez website When conducting the search from CLC Combined Workbench the results are available and ready to work with straight away As default CLC Combined Workbench offers one text field where the search parameters can be entered Click Add search parameters to add more parameters to your search Note The search is a AND search meaning that when adding search parameters to your search you search for both or all text strings rather than any of the text strings
441. nificant events Leitner and Albert 1999 Forsberg et al 2001 21 2 3 Reconstructing phylogenies from molecular data Traditionally phylogenies have been constructed from morphological data but following the growth of genetic information it has become common practice to construct phylogenies based on molecular data known as molecular phylogeny The data is most commonly represented in the form of DNA or protein sequences but can also be in the form of e g restriction fragment length polymorphism RFLP Methods for constructing molecular phylogenies can be distance based or character based Distance based methods Two common algorithms both based on pairwise distances are the UPGMA and the Neighbor Joining algorithms Thus the first step in these analyses is to compute a matrix of pairwise distances between OTUs from their sequence differences To correct for multiple substitutions it is common to use distances corrected by a model of molecular evolution such as the Jukes Cantor model Jukes and Cantor 1969 UPGMA A simple but popular clustering algorithm for distance data is Unweighted Pair Group Method using Arithmetic averages UPGMA Michener and Sokal 1957 Sneath and Sokal 1973 This method works by initially having all sequences in separate clusters and continuously joining these The tree is constructed by considering all initial clusters as leaf nodes in the tree and each time two clusters are joined a node is ad
442. nificant for quite large datasets Character based methods Whereas the distance based methods compress all sequence information into a single number the character based methods attempt to infer the phylogeny CHAPTER 21 PHYLOGENETIC TREES 372 sq Peromyscus maniculatus deer mouse Peromyscus maniculatus deer mouse Homo sapiens human Homo sapiens human Peromyscus maniculatus deer mouse Peromyscus maniculatus deer mouse Peromyscus maniculatus deer mouse 44 Homo sapiens human Peromyscus maniculatus deer mouse ad Peromyscus maniculatus deer mouse Equus caballus horse 100 Homo sapiens human Homo sapiens human Peromyscus maniculatus deer mouse 120 Peromyscus maniculatus deer mouse se Peromyscus maniculatus deer mouse 8 Equus caballus horse Homo sapiens human 100 Peromyscus maniculatus deer mouse Peromyscus maniculatus deer mouse Peromyscus maniculatus deer mouse Homo sapiens human Homo sapiens human Peromyscus maniculatus deer mouse Peromyscus maniculatus deer mouse Peromyscus maniculatus deer mouse so0f Homo sapiens human Homo sapiens human Figure 21 5 Algorithm choices for phylogenetic inference The top shows a tree found by the neighbor joining algorithm while the bottom shows a tree found by the UPGMA algorithm The latter algorithm assumes that the evolution occurs at a constant rate in different lineages base
443. nknown sequences based on pairwise alignment methods by simply transferring annotation from a known protein to the unknown partner does not take domain organization into account Galperin and Koonin 1998 An unknown protein may be annotated wrongly for instance as an enzyme if the pairwise alignment only finds a regulatory domain Using the Pfam search option in CLC Combined Workbench you can search for domains in sequence data which otherwise do not carry any annotation information The Pfam search option adds all found domains onto the protein sequence which was used for the search If domains of no relevance are found they can easily be removed as described in section 10 3 4 Setting a lower cutoff value will result in fewer domains In CLC Combined Workbench we have implemented our own HMM algorithm for prediction of the Pfam domains Thus we do not use the original HMM implementation HMMER http hmmer wustl edu for domain prediction We find the most probable state path alignment through each profile HMM by the Viterbi algorithm and based on that we derive a new null model by averaging over the emission distributions of all M and I states that appear in the state path M is a match state and is an insert state From that model we now arrive at an additive correction to the original bit score like it is done in the original HMMER algorithm In order to conduct the Pfam search Select a protein sequence Toolbox in the Menu Bar P
444. nment is extended an alignment score is increases decreased When the alignment score drops below a predefined threshold the extension of the alignment stops This ensures that the alignment is not extended to regions where only very poor alignment between the query and hit sequence is possible If the obtained alignment receives a score above a certain threshold it will be included in the final BLAST result lt gt Query 325 SLAALLNKCKTPOGORLVNOWIKOPLMDKNRIEERLNLVEA 365 LA L TP G R W P D ER A Sbjct 290 TLASVLDCTVTPMGSRMLKRWLHMPVRDTRVLLERQOTIGA 330 Figure 12 23 Blast aligning in both directions The initial word match is marked green By tweaking the word size W and the neighborhood word threshold T it is possible to limit the search space E g by increasing T the number of neighboring words will drop and thus limit the search space as shown in figure 12 24 This will increase the speed of BLAST significantly but may result in loss of sensitivity Increasing CHAPTER 12 BLAST SEARCH 193 o e o o e o e o Oo N e e O E e T 12 3 e D E n o o o o o o e N e oO Oo D 2 e T 16 0 99 e e e Sequence 1 Figure 12 24 Each dot represents a word match Increasing the threshold of T limits the search space significantly the word size W will also increase the speed but again with a loss of sensitivity 12 6 4 Which BLAST program should use Depending on the na
445. nother species or to test for potential mispriming When applied the algorithm will search for competing binding sites of the primer within the sequence You have the option of choosing the minimum number of matching nucleotides and a minimum number of nucleotides that must bind in the end of the primer These parameters will be explained in this section To search for primer binding sites select a nucleotide sequence Toolbox in the Menu Bar Primers and Probes E2 Find Binding Sites on Sequence 7 CHAPTER 17 PRIMERS 299 or right click a nucleotide sequence Toolbox Primers and Probes 2 Find Binding Sites on Sequence 7 If a sequence was selected before choosing the Toolbox action this sequence is now listed in the Selected Elements window of the dialog Use the arrows to add or remove sequences or sequence lists from the selected elements If you want to adjust the parameters for primer matching click Next 17 11 1 Search for primer binding sites parameters This opens the dialog displayed in figure 17 17 Find Binding Sites on Sequence 1 Select nucleotide BE meters sequences to match primer against 2 Set parameters Primer Select primer to match against sequencefs 20 Primer Fwd 19 37 IS Match criteria Exact match Minimum number of base pairs required for a match 15 gt Number of consecutive base pairs required in 3 end 3 2 _ Previous J next
446. nother CLC Workbench you can use the CLC format to export several objects in one file and all the objects information is preserved Note CLC files can be exported from and imported into all the different CLC Workbenches Backup If you wish to secure your data from computer breakdowns it is advisable to perform regular backups of your data Backing up data in the CLC Combined Workbench is done in two ways CHAPTER 7 IMPORT EXPORT OF DATA AND GRAPHICS 118 e Making a backup of each of the folders represented by the locations in the Navigation Area e Selecting all locations in the Navigation Area and export in zip format The resulting file will contain all the data stored in the Navigation Area and can be imported into CLC Combined Workbench if you wish to restore from the back up at some point No matter which method is used for backup you may have to re define the locations in the Navigation Area if you restore your data from a computer breakdown 7 2 External files In order to help you organize your research projects CLC Combined Workbench lets you import all kinds of files E g if you have Word Excel or pdf files related to your project you can import them into the Navigation Area of CLC Combined Workbench Importing an external file creates a copy of the file which is stored at the location you have chosen for import The file can now be opened by double clicking the file in the Navigation Area The file is opened u
447. nt Search Tool identifies homologous sequences using a heuristic method which finds short matches between two sequences After initial match 172 CHAPTER 12 BLAST SEARCH 173 FA NP_058652 BLAST 50 100 150 R l I NP_058652 gil 17647499 ref NP_058652 1 gil4760594 dbj BAA77357 1 jaan DAR T A ES RAN gt dbj 2 globin Mus musculus _ PASSAT I PAAT TAEA gi 4760592 dbjJBAA77356 1 beta 2 globin Mus musculus 222 gi 1183933 emb CAA32225 1 Score 295 8 bits 756 Expect 3E 79 m ail109287733 dbilBAE96288 1 Identities 144 146 99 Positives 145 146 99 Gaps 0 146 0 B el S lt Strand PlusPlus BE Eg E NP_058652 BLAST Rows 1004 Summary of hits from query NP_058652 Filter Hit Description E value Score Bit score Yoldentity NP_058652 hemoglobin beta ad 1 25E 80 768 300 442 100 A BAA77357 beta 2 globin Mus 4 76E 80 763 298 516 100 44H32264 Hbb b2 protein Mus 8 11E 80 761 297 745 98 F ws DA mal ml me Baa Figure 12 1 Display of the output of a BLAST search At the top is there a graphical representation of BLAST hits with tooltips showing additional information on individual hits Below is a tabular form of the BLAST results BLAST attempts to start local alignments from these initial matches You can also specify that another database should be used for BLAST search
448. nt based primer and probe design 1 0s ee eee es 291 17 9 1 Specific options for alignment based primer and probe design 292 17 9 2 Alignment based design of PCR primers 24 293 17 9 3 Alignment based TaqMan probe design o o 295 17 10 Analyze primer properties 2 1 lt lt 297 17 11 Find binding sites on sequence 2 00 eee ee 298 17 11 1 Search for primer binding sites parameters 299 17 12 Order primers oc 6 02 a ee ee es 300 275 CHAPTER 17 PRIMERS 276 CLC Combined Workbench offers graphically and algorithmically advanced design of primers and probes for various purposes This chapter begins with a brief introduction to the general concepts of the primer designing process Then follows instructions on how to adjust parameters for primers how to inspect and interpret primer properties graphically and how to interpret save and analyze the output of the primer design analysis After a description of the different reaction types for which primers can be designed the chapter closes with sections on how to match primers with other Sequences and how to create a primer order 17 1 Primer design an introduction Primer design can be accessed in two ways select sequence Toolbox in the Menu Bar Primers and Probes E1 Design Primers 17 OK or right click sequence Show Primer 7 In the prime
449. nt dialog displaying the available parameters which can be adjusted i P68046_alignment w a 25 P68046 ae ear P68053 Aiei layout y P68225 Spacing P68873 Every 10 residues v P68228 Ono wrap P68231 Ae P68063 Auto wrap P68945 O Fixed wrap Consensus Y Numbers on sequences Sequence log evel TAIT Teel reto a Y Follow selection L k numt 7 P68046 P REEDSEcDES sPBAuMcNPK 59 Pesos3 RB P SPBAMMGNPK 59 Hide labels P68225 H eH 60 I Lock labels P68873 MGNPK 60 P68228 NPK 60 Sequence label P68231 NPK 60 Name o P68063 M 59 C Show selection boxes SPTAMMGNP SPTAMBGNPM 59 x Y PARIAS5 Ed Figure 2 16 The resulting alignment To save the alignment drag the tab of the alignment view into the Navigation Area If you wish to use other alignment algorithms like e g ClustalW please download the Additional Alignments Module from http www clcbio com plugins 2 6 Tutorial Create and modify a phylogenetic tree You can make a phylogenetic tree from an existing alignment See how to create an alignment in the tutorial Align protein sequences We use the protein alignment located in More data in the Protein folder To create a phylogonetic tree click the protein alignment in the Navigation Area Toolbox Alignments and Trees E 1 Create Tree z A dialog opens where you can confirm your selection of the alignment Click Next to
450. nt to force by pressing Ctrl while selecting use 8 on Mac right click the selection Add Structure Prediction Constraints Force Stem Here This will add an annotation labeled Forced Stem to the sequence see figure 22 5 CHAPTER 22 RNA STRUCTURE 379 Ac AB009835 with 20 l 835 with structure CATTAGA GCAAGTACTGGTCTCTTAAAC AG 9 9kcalmol Clete cl tec eatiee NG 9 7kcallmol 4 gt NG 9 4kcallmol NG 9 2kcallmol 4 GP AB009835 with Secondary structure AG 9 9kcal mol 40 50 60 TARA A TTTAATAG ATTAG ACTT CT TCTAA c IlI I TE IIIT III I LIII c AAATT TO TGGTC zTGAACGA AGATT l 30 20 Y N 10 lt z a OBA Exea Figure 22 4 Marginal probability of base pairs shown in linear view top and marginal probability of being unpaired shown in the secondary structure 2D view bottom CATTTAAT Figure 22 5 Force a stem of the selected bases Using this procedure to add base pairing constraints will force the algorithm to compute minimum free energy and structure with a stem in the selected region The two regions must be of equal length To prohibit two regions to form a stem open the sequence and Select the two regions you want to prohibit by pressing Ctrl while selecting use 8 on Mac right click the selection Add Structure Prediction Constraints Prohibit Stem Here This will add an annotation labeled
451. nzymes from the REBASE restriction enzyme database at http rebase neb com To create an enzyme list of a subset of these enzymes right click in the Navigation Area New Enzyme list This opens the dialog shown in figure 19 32 At the top you can choose to Use existing enzyme list Clicking this option lets you select an enzyme list which is stored in the Navigation Area See section 19 4 for more about creating and modifying enzyme lists Below there are two panels e To the left you see all the enzymes that are in the list select above If you have not chosen to use an existing enzyme list this panel shows all the enzymes available e To the right there is a list of the enzymes that will be used 5The CLC Combined Workbench comes with a standard set of enzymes based on http www rebase org CHAPTER 19 CLONING AND CUTTING 345 use existing enzyme list All enzymes New enzyme list Fiter Overhang Methylation N methyl N6 methyl N methyl N methyl N methyl N methyl N methyl N methyl ree N methyl e CE ew Figure 19 32 Choosing enzymes for the new enzyme list Select enzymes in the left side panel and add them to the right panel by double clicking or clicking the Add button gt If you e g wish to use EcoRV and BamHI select these two enzymes and add them to the right side panel If you wish to use all the enzymes in the list Click in the panel to the le
452. o a leucine L Based on evolution of proteins it became apparent that these changes or substitutions of amino acids can be modeled by a scoring matrix also refereed to as a substitution matrix See an example of a scoring matrix in table 14 1 This matrix lists the substitution scores of every single amino acid A score for an aligned amino acid pair is found at the intersection of the corresponding column and row For example the substitution score from an arginine R to a lysine K is 2 The diagonal show scores for amino acids which have not changed Most substitutions changes have a negative score Only rounded numbers are found in this matrix CHAPTER 14 GENERAL SEQUENCE ANALYSES 215 SI A A SEH SED SED SE Figure 14 11 The dot plot showing a inversion in a sequence See also figure 14 8 The two most used matrices are the BLOSUM Henikoff and Henikoff 1992 and PAM Dayhoff and Schwartz 1978 Different scoring matrices PAM The first PAM matrix Point Accepted Mutation was published in 1978 by Dayhoff et al The PAM matrix was build through a global alignment of related sequences all having sequence similarity above 85 Dayhoff and Schwartz 1978 A PAM matrix shows the probability that any given amino acid will mutate into another in a given time interval As an example PAM1 gives that one amino acid out of a 100 will mutate in a given time interval In the other end of the scale a PAM256 matrix gives the probabil
453. of advanced options e Avoid isolated base pairs The algorithm filters out isolated base pairs i e stems of length 1 e Apply different energy rules for Grossly Asymmetric Interior Loops GAIL Compute the minimum free energy applying different rules for Grossly Asymmetry Interior Loops GAIL CHAPTER 22 RNA STRUCTURE 390 A Grossly Asymmetry Interior Loop GAIL is an interior loop that is 1 x n or n x 1 where n gt 2 see http www bioinfo rpi edu zukerm lectures RNAfold html rnafold print pdf e Include coaxial stacking energy rules Include free energy increments of coaxial stacking for adjacent helices Mathews et al 2004 Evaluate Structure Hypothesis 1 Select nucleotide sequences with structure prediction constraints 2 Set parameters Advanced options Y Avoid isolated base pairs Y Apply different energy rules For Grossly Asymmetric Interior Loops GAIL Include coaxial stacking energy rules Maximum distance between paired bases Figure 22 23 Adjusting parameters for hypothesis evaluation 22 3 2 Probabilities After evaluation of the structure hypothesis an annotation is added to the input sequence This annotation covers the same region as the annotations that constituted the hypothesis and contains information about the probability of the evaluated hypothesis see figure 22 24 No base pairs Structure Hypothesis 0 338 Forced stem CATTTAATAGTAAATTA
454. of an entire sequence select a sequence in the Navigation Area Toolbox in the Menu Bar Nucleotide Analyses 4 Create Reverse Complement x or right click a sequence in Navigation Area Toolbox Nucleotide Analyses A Create Reverse Complement x This opens the dialog displayed in figure 15 3 Create Reverse Complement 1 Select nucleotide Ices sees etal eect sequences Projects Selected Elements 1 lt 3 CLC_Data _ PERH3BC Example data Extra Nucleotide E Assembly E Cloning H E More data Primer design E3 Restriction analysis 3 Sequences DOC AY738615 20 HUMDINUC 20 HUMHEB 20 NM_000044 20 PERH2BD x sequence list X Protein gt Figure 15 3 Creating a reverse complement sequence If a sequence was selected before choosing the Toolbox action the sequence is now listed in the Selected Elements window of the dialog Use the arrows to add or remove sequences or sequence lists from the selected elements Click Next if you wish to adjust how to handle the results See section 9 1 If not click Finish This will open a new view in the View Area displaying the reverse complement of the selected sequence The new sequence is not saved automatically To save the sequence drag it into the Navigation Area or press Ctrl S 8 S on Mac to activate a save dialog CHAPTER 15 NUCLEOTIDE ANALYSES 236 15 4 Translation of DNA or RNA to protein In CLC
455. of antigenicity scales Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish The result can be seen in figure 16 10 CLC Combined Workbench offers some View Preferences for the view of the antigenicity plot The drop down menus are opened by clicking the black triangular arrows There are two kinds of view preferences The graph preferences and preferences for the kind of hydrophobicity scale used to calculate the graph e g Welling The Graph preferences include e Lock axis This will always show the axis even though the plot is zoomed to a detailed level CHAPTER 16 PROTEIN ANALYSES 253 fl Create Antigenicity Plot 1 Select protein sequences 2 Set parameters v Antigenicity scale Welling Kolaskar Tongaonkar Window size Number of residues 11 A EN e EET gt Next O Figure 16 9 Step two in the Antigenicity Plot allows you to choose different antigenicity scales and the window size CLC Combined Workbench 3 0 Current workspace Default File Edit Search View Toolbox Workspace Help a Show New Import Export Cut Copy Paste Delete Workspace Search Fit 4 OFS 2 th 0 lan EATON Zoom In Zoom Out NP_058652 ant IOC Primer Fwd A 286 Primer Rev E Primer table aa Sequences e006 AY738615 3 HUMDINUC 20 HUMHBB 20 NM_000044 20 PERHZBD DO
456. of the labels on the branches Node color Sets the color of all nodes Line color Alters the color of all lines in the tree e Labels Specifies the text to be displayed in the tree CHAPTER 21 PHYLOGENETIC TREES 369 Nodes Sets the annotation of all nodes either to name or to species Branches Changes the annotation of the branches to bootstrap length or none if you don t want annotation on branches Note Dragging in a tree will change it You are therefore asked if you want to save this tree when the Tree Viewer is closed You may select part of a Tree by clicking on the nodes that you want to select Right click a selected node opens a menu with the following options Set root above node defines the root of the tree to be just above the selected node e Set root at this node defines the root of the tree to be at the selected node Toggle collapse collapses or expands the branches below the node e Change label allows you to label or to change the existing label of a node e Change branch label allows you to change the existing label of a branch You can also relocate leaves and branches in a tree or change the length It is possible to modify the text on the unit measurement at the bottom of the tree view by right clicking the text In this way you can specify a unit e g years Note To drag branches of a tree you must first click the node one time and then click the node again and this
457. of the selected sequences Definition l File Edit Toolbox mal jlc gt gt view gt gt di Show H Download and Open pc pir S 4 Download and Save i Open at NCBI rel Figure 11 2 By right clicking a search result it is possible to choose how to handle the relevant sequence Copy paste from GenBank search results When using copy paste to bring the search results into the Navigation Area the actual files are downloaded from GenBank To copy paste files into the Navigation Area select one or more of the search results Ctrl C 36 C on Mac select a folder in the Navigation Area Ctrl V Note Search results are downloaded before they are saved Downloading and saving several files may take some time However since the process runs in the background displayed in the Status bar it is possible to continue other tasks in the program Like the search process the download process can be stopped This is done in the Toolbox in the Processes tab 11 1 3 Save GenBank search parameters The search view can be saved either using dragging the search tab and and dropping it in the Navigation Area or by clicking Save E When saving the search only the parameters are saved not the results of the search This is useful if you have a special search that you perform from time to time Even if you don t save the search the next time you open the search view it will remember the parameters from the last
458. of these will be referred to as regions Regions are generally illustrated by markings often arrows on the sequences An arrow pointing to the right indicates that the corresponding region is located on the positive strand of the sequence Figure 10 11 is an example of three regions with separate colors 34500 35000 35500 36000 l E HBG2 Figure 10 11 Three regions on a human beta globin DNA sequence HUMHBB Figure 10 12 shows an artificial sequence with all the different kinds of regions 10 2 Circular DNA A sequence can be shown as a circular molecule select a sequence in the Navigation Area Show in the Toolbar As Circular or If the sequence is already open Click Show As Circular at the lower left part of the view This will open a view of the molecule similar to the one in figure 10 13 This view of the sequence shares some of the properties of the linear view of Sequences as described in section 10 1 but there are some differences The similarities and differences are listed below e Similarities The editing options Options for adding editing and removing annotations Restriction Sites Annotation Types Find and Text Format preferences groups e Differences In the Sequence Layout preferences only the following options are available in the circular view Numbers on plus strand Numbers on sequence and Sequence label CHAPTER 10 VIEWING AND EDITING SEQUENCES 146
459. og The left hand part of the dialog lists a number of Annotation types When you have selected an annotation type it appears in Type to the right You can also select an annotation directly in this list Choosing an annotation type is mandatory If you wish to use an annotation type which is not present in the list simply enter this type into the Type field The right hand part of the dialog contains the following text fields e Name The name of the annotation which can be shown on the label in the sequence views Whether the name is actually shown depends on the Annotation Layout preferences see section 10 3 1 e Type Reflects the left hand part of the dialog as described above You can also choose directly in this list or type your own annotation type e Region If you have already made a selection this field will show the positions of the selection You can modify the region further using the conventions of DDBJ EMBL 4Note that your own annotation types will be converted to unsure when exporting in GenBank format As long as you use the sequence in CLC format you own annotation type will be preserved CHAPTER 10 VIEWING AND EDITING SEQUENCES 153 and GenBank The following are examples of how to use the syntax based on http www ncbi nlm nih gov collab FT 467 Points to a single residue in the presented sequence 340 565 Points to a continuous range of residues bounded by and including the starting
460. ollowing options can be set e Show all structures If more than one structure is predicted this option can be used if all the structures should be displayed e Show first If not all structures are shown this can be used to determine the number of structures to be shown e Sort by When you select to display e g four out of eight structures this option determines which the first four should be Sort by AG Sort by name Sort by time of creation If these three options do not provide enough control you can rename the structures in a meaningful alphabetical way so that you can use the name to display the desired ones e Match symbols How a base pair should be represented e No match symbol How bases which are not part of a base pair should be represented e Height When you zoom out this option determines the height of the symbols as shown in figure 22 20 when zoomed in there is no need for specifying the height CHAPTER 22 RNA STRUCTURE 388 e Base pair probability See section 22 2 4 below When you zoom in and out the appearance of the symbols change In figure 22 19 the view is zoomed in In figure 22 20 you see the same sequence zoomed out to fit the width of the sequence Interior loop Bulge Bulge Interior loop Interior loop AB009835 AG 9 9kcal mol OE oT Figure 22 20 The secondary structure visualized below the sequence and with annotations shown above The view is zoomed
461. om sizes and transparency can be varied by using the sliders see figure 13 2 e Non Polymer Bonds Show bonds between atoms in non polymer compounds The width of the bond can be selected from the drop down box e Polymer Bonds Show bonds between polymer atoms The width of the bond can be selected from the drop down box CHAPTER 13 3D MOLECULE VIEWING 203 13 4 2 Backbone e None The structure is displayed without any special indication of the backbone e Cartoon Show the backbone on proteins as cartoon drawings When using this view it is possible to see alpha helices and beta sheets e Backbone The alpha carbon atoms are connected by thick bonds 13 4 3 Coloring Atoms bonds and cartoon elements are colored individually according to the list below For the Atom Type scheme the coloring scheme CPK is adapted from the visualization tool Rasmol e Atom type Color the atoms individually Carbon Light grey Oxygen Red Hydrogen White Nitrogen Light blue Sulphur Yellow Chlorine Boron Green Phosphorus Iron Barium Orange Sodium Blue Magnesium Forest green Zn Cu Ni Br Brown Ca Mn Al Ti Cr Ag Dark grey F Si Au Goldenrod lodine Purple Lithium firebrick Helium Pink Other Deep pink e Entities This will color protein subunits and additional structures individually Using the view table the user may select which colors are u
462. ommun 27 2 157 162 Schechter and Berger 1968 Schechter and Berger A 1968 On the active site of pro teases 3 Mapping the active site of papain specific peptide inhibitors of papain Biochem Biophys Res Commun 32 5 898 902 Schneider and Stephens 1990 Schneider T D and Stephens R M 1990 Sequence logos a new way to display consensus sequences Nucleic Acids Res 18 20 6097 6100 Schroeder et al 1999 Schroeder S J Burkard M E and Turner D H 1999 The energetics of small internal loops in RNA Biopolymers 52 4 157 167 Shapiro et al 2007 Shapiro B A Yingling Y G Kasprzak W and Bindewald E 2007 Bridging the gap in RNA structure prediction Curr Opin Struct Biol 17 2 157 165 Siepel and Haussler 2004 Siepel A and Haussler D 2004 Combining phylogenetic and hidden Markov models in biosequence analysis J Comput Biol 11 2 3 413 428 Smith and Waterman 1981 Smith T F and Waterman M S 1981 Identification of common molecular subsequences J Mol Biol 147 1 195 197 Sneath and Sokal 1973 Sneath P and Sokal R 1973 Numerical Taxonomy Freeman San Francisco Tobias et al 1991 Tobias J W Shrader T E Rocap G and Varshavsky A 1991 The N end rule in bacteria Science 254 5036 1374 1377 BIBLIOGRAPHY 418 von Ahsen et al 2001 von Ahsen N Wittwer C T and Schutz E 2001 Oligonucleotide melting temperatures under
463. on of G and C 0 5 Lgt 21 eccccccececcecceceee Melting temperature 55 23 C Self annealing 16 Lgt 22 voornsrcrran se ss Self end annealing 2 Secondary structure 10 60 a0 I ie TTCTGGGCTTACCTTCCTATCAGAAGGAAATGGGAAGAGA Lgt 19 v i v Figure 17 4 Compact information mode The number of information lines reflects the chosen length interval for primers and probes One line is shown for every possible primer length if the length interval is widened more lines will appear At each potential primer starting position a circle is shown which indicates whether the primer fulfills the requirements set in the primer parameters preference group A green primer indicates a primer which fulfils all criteria and a red primer indicates a primer which fails to meet one or more of the set criteria For more detailed information place the mouse cursor over the circle representing the primer of interest A tool tip will then appear on screen displaying detailed information about the primer in relation to the set criteria To locate the primer on the sequence simply left click the circle using the mouse The various primer parameters can now be varied to explore their effect and the view area will dynamically update to reflect this If e g the allowed melting temperature interval is widened more green circles will appear indicating that more primers now fulfill the set requirements and if e g a requirement for 3 G C content is selected rec circles wi
464. on types that are not relevant Unchecking the checkboxes in the Annotation Layout will not remove this type of annotations them from the sequence it will just hide them from the view Besides selecting which types of annotations that should be displayed the Annotation Types group is also used to change the color of the annotations on the sequence Click the colored square next to the relevant annotation type to change the color This will display a dialog with three tabs Swatches HSB and RGB They represent three different ways of specifying colors Apply your settings and click OK When you click OK the color settings cannot be reset The Reset function only works for changes made before pressing OK Furthermore the Annotation Types can be used to easily browse the annotations by clicking the small button Gg next to the type This will display a list of the annotations of that type see figure 10 17 Annotation types Y cos E Y Conflict E CO Exon E v Gene MEE 7 mana HBG2 34478 36069 HBG1 39414 40985 C Old seqHBD 54740 56389 C Repeat unit WI Figure 10 17 Browsing the gene annotations on a sequence Clicking an annotation in the list will select this region on the sequence In this way you can quickly find a specific annotation on a long sequence CHAPTER 10 VIEWING AND EDITING SEQUENCES 151 View Annotations in a table Annotations can also be viewed in a table select the sequ
465. onality is useful to simulate digestions with complex combinations of restriction enzymes If views of both the fragment table and the sequence are open clicking in the fragment table will select the corresponding region on the sequence Gel The restriction map can also be shown as a gel This is described in section 19 3 1 19 3 Gel electrophoresis CLC Combined Workbench enables the user to simulate the separation of nucleotide sequences on a gel This feature is useful when e g designing an experiment which will allow the differentiation of a successful and an unsuccessful cloning experiment on the basis of a restriction map There are two main ways to simulate gel separation of nucleotide sequences e One or more sequences can be digested with restriction enzymes and the resulting fragments can be separated on a gel e A number of existing sequences can be separated on a gel There are several ways to apply these functionalities as described below CHAPTER 19 CLONING AND CUTTING 342 19 3 1 Separate fragments of sequences on gel This section explains how to simulate a gel electrophoresis of one or more sequences which are digested with restriction enzymes There are two ways to do this e When performing the Restriction Site Analysis from the Toolbox you can choose to create a restriction map which can be shown as a gel This is explained in section 19 2 2 e From all the graphical views of Sequences you can r
466. one or more sequence views e Save sequence Downloads and saves the sequence without opening it e Open structure If the hit sequence contain structure information the sequence is opened in a text view or a 3D view 3D view in CLC Protein Workbench and CLC Combined Workbench You can do a text based search in the information in the BLAST table by using the filter at the upper right part of the view In this way you can search for e g species or other information which is typically included in the Description field The table is integrated with the graphical view described in section 12 3 2 so that selecting a hit in the table will make a selection on the corresponding sequence in the graphical view 12 4 Create Local BLAST Database In CLC Combined Workbench you can create a local database which you can use for local BLAST Both DNA RNA and protein sequences can be used It is not necessary to import the sequences into CLC Combined Workbench before creating the database The local database can be created from sequences which are stored in the Navigation Area or the sequences can be browsed from the computer s file system In the latter case the files must be in fasta fsa fa fasta format To create a local BLAST data base from the file system or from the Navigation Area BLAST search in Toolbox 3 Create Local BLAST Database 4 This opens the dialog seen in figure 12 12 CHAPTER 12 BLAST SEARCH 184 Create Local
467. onsensus Here you can also choose IUPAC which will display the ambiguity code when there are differences between the sequences E g an alignment with A and a G at the same position will display an R in the consensus line if the IUPAC option is selected The IUPAC codes can be found in section F and E No gaps Checking this option will not show gaps in the consensus Ambiguous symbol Select how ambiguities should be displayed in the consensus line as N or This option has now effect if IUPAC is selected in the Limit list above The Consensus Sequence can be opened in a new view simply by right clicking the Consensus Sequence and click Open Consensus in New View e Conservation Displays the level of conservation at each position in the alignment The conservation shows the conservation of all sequence positions The height of the bar or the gradient of the color reflect how conserved that particular position is in the alignment If one position is 100 conserved the bar will be shown in full height and it is colored in the color specified at the right side of the gradient slider Foreground color Colors the letters using a gradient where the right side color is used for highly conserved positions and the left side color is used for positions that are less conserved Background color Sets a background color of the residues using a gradient in the same way as described above CHAPTER 20 SEQUENCE ALIGNMENT
468. ools for working with structures Select a structure right click and the following menu items will be available e Open Secondary Structure in 2D Viewer o This will open the selected structure in the Secondary structure 2D view e Annotate Sequence with Secondary Structure This will add the structure elements as annotations to the sequence Note that existing structure annotations will be removed e Rename Secondary Structure This will allow you to specify a name for the structure to be displayed in the table CHAPTER 22 RNA STRUCTURE 387 e Delete Secondary Structure This will delete the selected structure e Delete All Secondary Structures This will delete all the selected structures Note that once you save and close the view this operation is irreversible As long as the view is open you can Undo the operation 22 2 3 Symbolic representation in sequence view In the Side Panel of normal sequence views se you will find an extra group under Nucleotide info called Secondary Structure This is used to display a symbolic representation of the secondary structure along the sequence see figure 22 19 20 40 l l Interior loop Interior loop Stem Bulge Interior loop Bulge Hairpin loop AB009835 CATTAGATGACTGAAAGCAAGTACTGGTCTCTTAAACCATTTAATAGT AG 9 9kcal mol sss ree ER OEI Ge a Ta Figure 22 19 The secondary structure visualized below the sequence and with annotations shown above The f
469. open the list select the relevant enzymes right click Create New Enzyme List from Selection E If you combined this method with the filter located at the top of the view you can extract a very specific set of enzymes E g if you wish to create a list of enzymes sold by a particular distributor type the name of the distributor into the filter and select and create a new enzyme list from the selection Chapter 20 Sequence alignment Contents 20 1 Create an alignment 348 20 11 GapCOS S a ee aime p o li la a ds ae ce 349 20 1 2 Fast or accurate alignment algorithm o o 349 20 1 3 Aligning alignments e 350 20 1 4 FIKDGINIS e a a a a a a A eS ee He Bes 351 20 2 View alignments ee eo 353 20 2 1 Bioinformatics explained Sequence logo 355 20 3 Edit alignments 2 357 20 3 1 Move residues and gaps 64 mano a a a Be 357 20 3 2 IASON CGPS e ec ene ae eee bee Re ee Be bee ES 357 20 3 3 Delete residues and gapS es 357 20 3 4 Copy annotations to other Sequences 0 000 ee eens 358 20 3 5 Move sequences up and down a 358 20 3 6 Delete rename and add sequences 2 00 ee eens 358 20 3 7 R aligMSel6CtOn ccc See Mae kPa ee a ee ek 359 20 4 Join alignments ooo ee ee ee a ee ee 359 20 4
470. or nucleotides stacked on top of each other see figure 20 8 The sequence logo provides a far more detailed view of the entire alignment than a simple consensus sequence Sequence logos can aid to identify protein binding sites on DNA sequences and can also aid to identify conserved residues in aligned domains of protein sequences and a wide range of other applications Each position of the alignment and consequently the sequence logo shows the sequence information in a computed score based on Shannon entropy Schneider and Stephens 1990 The height of the individual letters represent the sequence information content in that particular position of the alignment A sequence logo is a much better visualization tool than a simple consensus sequence An example hereof is an alignment where in one position a particular residue is found in 70 of the sequences If a consensus sequence is used it typically only displays the single residue with 70 coverage In figure 20 8 an un gapped alignment of 11 E coli start codons including flanking regions are shown In this example a consensus sequence would only display ATG as the start codon in position 1 but when looking at the sequence logo it is seen that a GTG is also allowed as a start codon Calculation of sequence logos A comprehensive walk through of the calculation of the information content in sequence logos is beyond the scope of this document but can be found in the original paper by Schneider
471. ossible to specify when the sequence should be wrapped In the text field below you can choose the number of residues to display on each line e Double stranded Shows both strands of a sequence only applies to DNA sequences CHAPTER 10 VIEWING AND EDITING SEQUENCES 133 e Numbers on sequences Shows residue positions along the sequence The starting point can be changed by setting the number in the field below If you set it to e g 101 the first residue will have the position of 100 This can also be done by right clicking an annotation and choosing Set Numbers Relative to This Annotation e Numbers on plus strand Whether to set the numbers relative to the positive or the negative strand in a nucleotide sequence only applies to DNA sequences e Follow selection When viewing the same sequence in two separate views Follow selection will automatically scroll the view in order to follow a selection made in the other view e Lock numbers When you scroll vertically the position numbers remain visible Only possible when the sequence is not wrapped e Lock labels When you scroll horizontally the label of the sequence remains visible e Sequence label Defines the label to the left of the sequence Name this is the default information to be shown Accession Sequences downloaded from databases like GenBank have an accession number Latin name Latin name accession Common name Common name accession
472. ot Swp protein Sequences Lasergene sequence pro protein sequence only import Lasergene sequence seq nucleotide sequence only import Embl embl nucleotide sequences Nexus nxs nexus sequences trees alignments and sequence lists CLC clc sequences trees alignments reports etc Text txt all data in a textual format CSV CSV tables each cell separated with semicolons only export ABI abi trace files only import AB1 ab1 trace files only import SCF2 Scf trace files only import SCF3 SCf trace files only import Phred phd trace files only import mmCIF cif structure only import PDB pdb structure only import BLAST Database phr nhr BLAST database import Vector NTi Database VectorNTl achieves Gene Construction Kit RNA Structure ma4 pa4 oa4 gcc ct col rnaml xml sequences import of whole database sequences only import sequences only import RNA structures Preferences cpf CLC workbench preferences CHAPTER 2 TUTORIALS 36 Note CLC Combined Workbench can import external files too This means that all kinds of files can be imported and displayed in the Navigation Area but the above mentioned formats are the only ones whose contents can be shown in CLC Combined Workbench 2 2 Tutorial View sequence This brief tutorial will take you through some different ways to display a sequence in the program The tutorial introduces zooming on a sequence dragging
473. otein analyses which are described elsewhere in this manual To create a protein report do the following Right click protein in Navigation Area Toolbox Protein Analyses A Create Protein Report This opens dialog Step 1 where you can choose which proteins to create a report for When the correct one is chosen click Next In dialog Step 2 you can choose which analyses you want to include in the report The following list shows which analyses are available and explains where to find more details CHAPTER 16 PROTEIN ANALYSES 264 e Sequence statistics See section 14 4 for more about this topic e Plot of charge as function of pH See section 16 2 for more about this topic e Plot of hydrophobicity See section 16 5 for more about this topic e Plot of local complexity See section 14 3 for more about this topic e Dot plot against self See section 14 2 for more about this topic e Secondary structure prediction See section 16 7 for more about this topic e Pfam domain search See section 16 6 for more about this topic e Local BLAST See section 12 2 for more about this topic e NCBI BLAST See section 12 1 for more about this topic When you have selected the relevant analyses click Next Step 3 to Step 7 if you select all the analyses in Step 2 are adjustments of parameters for the different analyses The parameters are mentioned briefly in relation to the following steps and you can turn to the relevant chapters
474. otide frequency RNA 11 5 578 591 413 BIBLIOGRAPHY 414 Collins et al 1998 Collins F S Brooks L D and Chakravarti A 1998 A DNA polymor phism discovery resource for research on human genetic variation Genome Res 8 12 1229 1231 Cornette et al 1987 Cornette J L Cease K B Margalit H Spouge J L Berzofsky J A and DeLisi C 1987 Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins J Mol Biol 195 3 659 685 Costa 2007 Costa F F 2007 Non coding RNAs lost in translation Gene 386 1 2 1 10 Crooks et al 2004 Crooks G E Hon G Chandonia J M and Brenner S E 2004 WebLogo a sequence logo generator Genome Res 14 6 1188 1190 Dayhoff and Schwartz 1978 Dayhoff M O and Schwartz R M 1978 Atlas of Protein Sequence and Structure volume 3 of 5 suppl chapter Atlas of Protein Sequence and Structure pages 353 358 Nat Biomed Res Found Washington D C Eddy 2004 Eddy S R 2004 Where did the BLOSUM62 alignment score matrix come from Nat Biotechnol 22 8 1035 1036 Eisenberg et al 1984 Eisenberg D Schwarz E Komaromy M and Wall R 1984 Analysis of membrane and surface protein sequences with the hydrophobic moment plot J Mol Biol 179 1 125 142 Emini et al 1985 Emini E A Hughes J V Perlow D S and Boger J 1985 Induction of hepatitis a virus neutralizing ant
475. ou choose the color of the dots Line width Setting the width of the line connecting the dots Line type Setting the type of the line connecting the dots e Line color Lets you choose the color of the line connecting the dots The level of antigenicity is calculated on the basis of the different scales The different scales add different values to each type of amino acid The antigenicity score is then calculated as the sum of the values in a window which is a particular range of the sequence The window length can be set from 5 to 25 residues The wider the window the less fluctuations in the antigenicity scores 16 4 2 Antigenicity graphs along sequence Antigenicity graphs along the sequence can be displayed using the Side Panel The functionality is similar to hydrophobicity see section 16 5 2 16 5 Hydrophobicity CLC Combined Workbench can calculate the hydrophobicity of protein sequences in different ways using different algorithms See section 16 5 3 Furthermore hydrophobicity of sequences can be displayed as hydrophobicity plots and as graphs along sequences In addition CLC Combined Workbench can calculate hydrophobicity for several sequences at the same time and for alignments 16 5 1 Hydrophobicity plot To display the hydrophobicity for a protein sequence in a plot is done in the following way select a protein sequence in Navigation Area Toolbox in the Menu Bar Protein Analyses y Create Hydrophobi
476. ouse or by using the buttons below the table If the BLAST table view was not selected in Step 4 of the BLAST search the table can be shown in the following way Click the Show BLAST Table button 3 at the bottom of the view Figure 12 11 is an example of a BLAST Table ES CAA26204 BLAST Rows 103 Summary of hits from query CAA26204 Filter Hit Description E value Score Bit score oIdentity a 1COH B Chain B Alpha Ferrous Carbonmonoxy Beta Cobaltou 3 36E 66 624 244 973 96 1 85 B Chain B T To T High Quaternary Transitions In Human 3 36E 66 624 244 973 96 1 83 B Chain B T To T High Quaternary Transitions In Human 7 48E 66 621 243 817 95 1DXU B Chain B Hemoglobin Deoxy Mutant With val 1 Replac 7 48E 66 621 243 817 95 INQP B Chain B Crystal Structure Of Human Hemoglobin E At 1 9 77E 66 620 243 432 95 1K1K R Chain A Structure Of Mutant Human Carhonmoannxvhe _ 9 77F Ah 620 243 432 95 Download and Open Download and Save Open at NCBI Open Structure 2 OB Figure 12 11 Display of the output of a BLAST search in the tabular view The hits can be sorted by the different columns simply by clicking the column heading The BLAST Table includes the following information e Query sequence The sequence which was used for the search e Hit The Name of the sequences found in the BLAST search e Description Text from NC
477. out to fit the width of the sequence 22 2 4 Probability based coloring In the Side Panel of both linear and secondary structure 2D views you can choose to color structure symbols and sequence residues according to the probability of base pairing not base pairing as shown in figure 22 4 In the linear sequence view yet this is found in Nucleotide info under Secondary structure and in the secondary structure 2D view 4 it is found under Residue coloring For both paired and unpaired bases you can set the foreground color and the background color to a gradient with the color at the left side indicating a probability of O and the color at the right side indicating a probability of 1 22 3 Evaluate structure hypothesis Hypotheses about an RNA structure can be tested using CLC Combined Workbench A structure hypothesis H is formulated using the structural constraint annotations described in section 22 1 4 By adding several annotations complex structural hypotheses can be formulated see 22 21 Given the set S of all possible structures only a subset of these Sy will comply with the formulated hypotheses We can now find the probability of H as Y P sx P H a sgESH PF gt Ps P Fn ses where PF is the partition function calculated for all structures permissible by H S4 and P Fr CHAPTER 22 RNA STRUCTURE 389 is the full partition function Calculating the probability can thus be done with two passes of the p
478. output files PHD and PHRAP output files ACE see section 7 1 1 301 CHAPTER 18 SEQUENCING DATA ANALYSES AND ASSEMBLY 302 After import the sequence reads and their trace data are saved as DNA sequences This means that all analyzes which apply to DNA sequences can be performed on the sequence reads including e g BLAST and open reading frame prediction You can see additional information about the quality of the traces by holding the mouse cursor on the imported sequence This will display a tool tip as shown in figure 18 1 EF Assembly 206 read1 2 read2 Trace of read2 scf length 560 low quality 88 medium quality 135 high quality 337 9S read3 DOC read4 Y reads Figure 18 1 A tooltip displaying information about the quality of the chromatogram If the trace file does not contain information about quality only the sequence length will be shown To view the trace data open the sequence read in a standard sequence view mer 18 1 1 Scaling traces The traces can be scaled by dragging the trace vertically as shown in figure figure 18 2 The Workbench automatically adjust the height of the traces to be readable but if the trace height varies a lot this manual scaling is very useful The height of the area available for showing traces can be adjusted in the Side Panel as described insection 18 1 2 AATGTATTATTATCTCCTGAGGA Figure 18 2 Grab the traces to scale 18 1 2 Trace settings in the Side Panel In
479. ox action this sequence is now listed in the Selected Elements window of the dialog Use the arrows to add or remove sequences or sequence lists from the selected elements Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish This will open a new view in the View Area displaying the new DNA sequence The new sequence is not saved automatically To save the sequence drag it into the Navigation Area or press Ctrl CHAPTER 15 NUCLEOTIDE ANALYSES 235 S 3 S on Mac to activate a save dialog Note You can select multiple RNA sequences and sequence lists at a time If the sequence list contains DNA sequences as well they will not be converted 15 3 Reverse complements of sequences CLC Combined Workbench is able to create the reverse complement of a nucleotide sequence By doing that a new sequence is created which also has all the annotations reversed since they now occupy the opposite strand of their previous location To quickly obtain the reverse complement of a sequence or part of a sequence you may select a region on the negative strand and open it in a new view right click a selection on the negative strand Open selection in New View La By doing that the sequence will be reversed This is only possible when the double stranded view option is enabled It is possible to copy the selection and paste it in a word processing program or an e mail To obtain a reverse complement
480. pair e Maximum length of amplicon determines the maximum length of the PCR fragment 17 5 2 Standard PCR output table If only a single region is selected the following columns of information are available e Sequence the primer s sequence e Score measures how much the properties of the primer or primer pair deviates from the optimal solution in terms of the chosen parameters and tolerances The higher the score the better the solution e Region the interval of the template sequence covered by the primer e Self annealing the maximum self annealing score of the primer in units of hydrogen bonds e Self annealing alignment a visualization of the highest maximum scoring self annealing alignment e Self end annealing the maximum score of consecutive end base pairings allowed between the ends of two copies of the same molecule in units of hydrogen bonds e GC content the fraction of G and C nucleotides in the primer e Melting temperature of the primer template complex e Secondary structure score the score of the optimal secondary DNA structure found for the primer Secondary structures are scored by adding the number of hydrogen bonds in the structure and 2 extra hydrogen bonds are added for each stacking base pair in the structure e Secondary structure a visualization of the optimal DNA structure found for the primer If both a forward and a reverse region are selected a table of primer pairs is shown where the
481. parameters you set This chapter will describe how to use the History functionality of CLC Combined Workbench 8 1 Element history You can view the history of all elements in the Navigation Area except files that are opened in other programs e g Word and pdf files The history starts when the element appears for the first time in CLC Combined Workbench To view the history of an element Select the element in the Navigation Area Show in the Toolbar History CB or If the element is already open History Cb at the bottom left part of the view This opens a view that looks like the one in figure 8 1 When opening an element s history is opened the newest change is submitted in the top of the view The following information is available e Title The action that the user performed e Date and time Date and time for the operation The date and time are displayed according 124 CHAPTER 8 HISTORY LOG 125 Ub Contig 1 Replaced a symbol Fri Nov 17 09 38 35 CET 2006 User smoensted Parameters Position 637 Removed T Inserted c Modified element Consensus Comments Edit No Comment Replaced a symbol Fri Nov 17 09 38 18 CET 2006 gt User smoensted Parameters Position 651 Removed A Inserted g Modified element readl Comments Edit No Comment User smoensted Parameters Position 550 Bo Figure 8 1 An element s history lt to your locale settings see section 5
482. peptide prediction output After running the prediction as described above the protein sequence will show predicted signal peptide as annotations on the original sequence see figure 16 2 100 Peptide Figure 16 2 N terminal signal peptide shown as annotation on the sequence ECOT_ELOLI Each annotation will carry a tooltip note saying that the corresponding annotation is predicted with SignalP version 3 0 Additional notes can be added through the Edit annotation dy right click mouse menu See section 10 3 2 Undesired annotations can be removed through the Delete Annotation g right click mouse menu See section 10 3 4 16 1 3 Bioinformatics explained Prediction of signal peptides Why the interest in signal peptides The importance of signal peptides was shown in 1999 when Gunter Blobel received the Nobel Prize in physiology or medicine for his discovery that proteins have intrinsic signals that govern their transport and localization in the cell Blobel 2000 He pointed out the importance of defined peptide motifs for targeting proteins to their site of function CHAPTER 16 PROTEIN ANALYSES 244 Performing a query to PubMed reveals that thousands of papers have been published regarding signal peptides secretion and subcellular localization including knowledge of using signal peptides as vehicles for chimeric proteins for biomedical and pharmaceutical industry Many papers describe statistical or machine learni
483. peptides SignalP Transmembrane helix prediction TMHMM Secondary protein structure prediction PFAM domain search APPENDIX A COMPARISON OF WORKBENCHES 402 Sequence alignment Free Protein DNA RNA Combined Multiple sequence alignments Two algo E a a C E rithms Advanced re alignment and fix point align a wl E a ment options Advanced alignment editing options a a a E a Join multiple alignments into one a E E m Consensus sequence determination and a a a E E management Conservation score along sequences u E wl a a Sequence logo graphs along alignments a E E a Gap fraction graphs E m E m Copy annotations between sequences in a a E E alignments Pairwise comparison u a a E E RNA secondary structure Free Protein DNA RNA Combined Advanced prediction of RNA secondary struc E E ture Integrated use of base pairing constraints E C Graphical view and editing of secondary struc E a ture Info about energy contributions of structure E E elements Prediction of multiple sub optimal structures E E Evaluate structure hypothesis C E Structure scanning E E Partition function E r Dot plots Free Protein DNA RNA Combined Dot plot based analyses y a C E Phylogenetic trees Free Protein DNA RNA Combined Neighbor joining and UPGMA phylogenies a a a a a Pattern discovery Free Protein DNA RNA Combined Search for sequence match a 2 a E C Motif search for basic patterns C nm E C Motif search with re
484. ported imported between the different platforms in the same easy way as when export ing importing between two computers with e g Windows 1 6 When the program is installed Getting started CLC Combined Workbench includes an extensive Help function which can be found in the Help menu of the program s Menu bar The Help can also be shown by pressing F1 The help topics are sorted in a table of contents and the topics can be searched CHAPTER 1 INTRODUCTION TO CLC COMBINED WORKBENCH 26 We also recommend our Online presentations where a product specialist from CLC bio demon strates features of the workbenches This is a very easy way to get started using the program Read more about online presentations here http clcbio com presentation 1 6 1 Quick start When the program opens for the first time the background of the workspace is visible In the background are three quick start shortcuts which will help you getting started These can be seen in figure 1 14 Figure 1 14 Three available Quick start short cuts available in the background of the workspace The function of the three quick start shortcuts is explained here e Import data Opens the Import dialog which you let you browse for and import data from your file system e New sequence Opens a dialog which allows you to enter your own sequence e Read tutorials Opens the tutorials menu with a number of tutorials These are also available from the Help menu in the
485. process can be stopped in the process area see section 3 4 1 Chapter 5 User preferences and settings Contents 5 1 General preferences 0 00 eee eee es 99 5 2 Default View preferences 2 00 eee ee eee ee 100 5 2 1 Import and export Side Panel settings 0 0000 eee 101 5 3 Advanced preferences 0 000 ee eee ee eee ee ee 102 5 3 1 Default persistence location 102 332 WRLtO Se torBLAST sp ipaa gea ri as Boe a Gee 102 5 4 Export import of preferences ee 102 5 4 1 The different options for export and importing 103 5 5 View settings for the Side Panel ee eee ee ee es 103 5 5 1 Floating Side Panel s i se pos ea moi a Oe ee ee eS 105 The first three sections in this chapter deal with the general preferences that can be set for CLC Combined Workbench using the Preferences dialog The next section explains how the settings in the Side Panel can be saved and applied to other views Finally you can learn how to import and export the preferences The Preferences dialog offers opportunities for changing the default settings for different features of the program The Preferences dialog is opened in one of the following ways and can be seen in figure 5 1 Edit Preferences 73 or Ctrl K 38 on Mac 5 1 General preferences The General preferences include e Undo Limit As default the undo li
486. protein sequences from the protein databank http www rcsb org pdb a hidden Makov model HMM was trained and evaluated for performance Machine learning methods have shown superior when it comes to prediction of secondary structure of proteins Rost 2001 By far the most common structures are Alpha helices and beta sheets which can be predicted and predicted structures are automatically added to the query as annotation which later can be edited In order to predict the secondary structure of proteins Select a protein sequence Toolbox in the Menu Bar Protein Analyses xj Predict secondary structure Ww or right click a protein sequence Toolbox Protein Analyses A Predict secondary structure Ww This opens the dialog displayed in figure 16 18 If a sequence was selected before choosing the Toolbox action this sequence is now listed in the Selected Elements window of the dialog Use the arrows to add or remove sequences or sequence lists from the selected elements You can perform the analysis on several protein sequences at a time This will add annotations to all the sequences and open a view for each sequence Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish CHAPTER 16 PROTEIN ANALYSES 263 Predict Secondary Structure 1 Select protein sequences iy Select protein sequences Projects Selected Elements 1 CLC_Data As CAA24102 HE Example data i E
487. quences To open the two sequences the pBR322 vector and the PCR fragment in the cloning editor additional sequences can be added later select pBR322 and PCR fragment with Sphl sites Toolbox Cloning and Restriction Sites 3 Cloning G OK This will show an information dialog which tells that the two sequences have been converted to a sequence list which can be saved in the Navigation Area Click OK and the cloning editor will open as shown in figure 2 50 D New sequence a ening Editor Setting Sa ee 2 Conflict Conflict gt 5 tet Conflict bla Sorting Aa EG aI tet ROP protein bla V Non cutters MA 7 sou o mal v Smal 0 v Single cutters v EcoRY 1 D vsar iv iv 4 A gt TTCTCATG TTCAAGAA RAJGAGTAC AAGTTCTT oo Sequence details xbal 1 xhol 1 oO Q N 4 V Double cutters Y BamHI 2 v EcoRI 2 Y HindIII 2 E w Psti 2 AX hoi Xba iaa HinalitoR 75 A E y Y Multiple cutt agment with Sph ultiple cutters sequence details SCATGCAG TCGCATGC Tence CGTACGTC AGCGTACG d v oso Figure 2 50 Two sequences in the cloning editor The vector and the PCR fragment The small colored triangles represent restriction sites Restriction sites for 10 different enzymes are shown per default but for now we a
488. quences downloaded from GenBank for example have this information The Labels preferences allows these different node annotations as well as different annotation on the branches The branch annotation includes the bootstrap value if this was selected when the tree was calculated It is also possible to annotate the branches with their lengths 2 7 Tutorial Find restriction sites This tutorial will show you how to find restriction sites and annotate them on a sequence There are two ways of finding and showing restriction sites In many cases the dynamic restriction sites found in the Side Panel of sequence views will be useful since it is a quick and easy way of showing restriction sites In the Toolbox you will find the other way of doing restriction site analyses This way provides more control of the analysis and gives you more output options e g a table of restriction sites and a list of restriction enzymes that can be saved for later use In this tutorial the first section describes how to use the Side Panel to show restriction sites whereas the second section describes the restriction map analysis performed from the Toolbox CHAPTER 2 TUTORIALS 45 2 7 1 The Side Panel way of finding restriction sites When you open a sequence there is a Restriction sites setting in the Side Panel By default 20 of the most popular restriction enzymes are shown see figure 2 18 Restriction sites Salil 9 Show y Show name flags GT
489. quencing primer can start These are defined by making a selection on the sequence and right clicking the selection If areas are known where primers must not bind e g repeat rich areas one or more No primers here regions can be defined No requirements are instated on the relative position of the regions defined After exploring the available primers see section 17 3 and setting the desired parameter values in the Primer Parameters preference group the Calculate button will activate the primer design algorithm After pressing the Calculate button a dialog will appear see figure 17 11 Since design of sequencing primers does not require the consideration of interactions between primer pairs this dialog is identical to the dialog shown in Standard PCR mode when only a single primer region is chosen See the section 17 5 for a description 17 8 1 Sequencing primers output table In this mode primers are predicted independently for each region but the optimal solutions are all presented in one table The solutions are numbered consecutively according to their position on the sequence such that the forward primer region closest to the 5 end of the molecule is designated F1 the next one F2 etc For each solution the single primer information described under Standard PCR is available in the table 17 9 Alignment based primer and probe design CLC Combined Workbench allows the user to design PCR primers and TaqMan probes based on a
490. r 11 Online database search Contents 11 1 GenBank search lt lt lt eee ee eee eee ee 160 11 1 1 GenBank search options 000 eee eee ee 161 11 1 2 Handling of GenBank search results c sssr sessa 0000 162 11 1 3 Save GenBank search parameters 163 11 2 UniProt Swiss Prot TrEMBL search lt lt ee eee 164 11 2 1 UniProt search Options lt ses 2658 eee eee ee g 164 11 2 2 Handling of UniProt search results s s s asoro a siroaa piia 165 11 2 3 Save UniProt search parameters 0 eee a ee 166 11 3 Search for structures at NCBI 1 2 2 eee ee te ee 166 11 3 1 Structure search options 2 2 ee es 167 11 3 2 Handling of NCBI structure search results o 168 11 3 3 Save structure search parameters o eee eae 169 11 4 Sequence web info 2 ee ee 4 2 2 4 170 114 1 sGOOgIeSeCQUENCE cose ear a aa a aoe bes 170 1142 NCBI zocc eho o e E ew de Ge Ew a 170 11 4 3 PubMed References es 171 11 4 4 UNPU 26 eo ee eS rm e ee a ee ee RS 171 11 4 5 Additional annotation information e ee eee 171 CLC Combined Workbench offers different ways of searching data on the Internet You must be online when initiating and performing the following searches 11 1 GenBank search This section describes searches for sequences in GenBank the NCBI Entrez
491. r Secondary Structure choose the Proportional layout strategy You will now see that the appearance of structure changes Next zoom in on the structure to see the residues This is easiest if you first close EJ the table view at the bottom Zoom in 550 Click the structure until you see the residues If you wish to make some manual corrections of the layout of the structure first select the Pan 8 mode in the Tool bar Now place the mouse cursor on the opening of a stem and a visual indication of the anchor point for turning the substructure will be shown see figure 22 14 Click and drag to rotate the part of the structure represented by the line going from the anchor point In order to keep the bases in a relatively sequential arrangement there is a restriction on how much the substructure can be rotated The highlighted part of the circle represents the angle where rotating is allowed In figure 22 15 the structure shown in figure 22 14 has been modified by dragging with the 70 CHAPTER 2 TUTORIALS Figure 2 59 The blue circle represents the anchor point for rotating the substructure mouse Figure 2 60 The structure has now been rotated The view can of course be printed 4 or exported as graphics Part Il Core Functionalities 71 Chapter 3 User interface Contents 3 1 Navigation Area isos a a ee ae ee 73 SAL Data Siucture caw ww o a a e a we Ee a 73 3 1 2 Create new folderS sorse
492. r of base pairs required for a match How many nucleotides of the primer that must base pair to the sequence in order to cause mispriming e Number of consecutive base pairs required in 3 end How many consecutive 3 end base pairs in the primer that MUST be present for mispriming to occur This option is included since 3 terminal base pairs are known to be essential for priming to occur Note Including a search for potential mispriming sites will prolong the search time substantially if long sequences are used as template and if the minimum number of base pairs required for a match is low If the region to be amplified is part of a very long molecule and mispriming is a concern consider extracting part of the sequence prior to designing primers When both forward and reverse regions are defined If both a forward and a reverse region are defined primer pairs will be suggested by the algorithm After pressing the Calculate button a dialog will appear see figure 17 8 Calculation parameters Chosen parameters Maximum primer length Minimum primer length Maximum G C content Minimum G C content Maximum melting temperature Minimum melting temperature Maximum self annealing Maximum self end annealing Maximum secondary structure 3 end must meet G C requirements 5 end must meet G C requirements Primer combination parameters Max percentage point difference in G C content Max difference in melting temperatures within a prim
493. r press Ctrl S 8 S on Mac to activate a save dialog 15 4 1 Translate part of a nucleotide sequence If you want to make separate translations of all the coding regions of a nucleotide sequence you can check the option Translate CDS and ORF in the translation dialog see figure 15 5 If you want to translate a specific coding region which is annotated on the sequence use the following procedure Open the nucleotide sequence right click the ORF or CDS annotation Translate CDS ORF 5 choose a translation table OK If the annotation contains information about the translation this information will be used and you do not have to specify a translation table The CDS and ORF annotations are colored yellow as default 15 5 Find open reading frames CLC Combined Workbench has a basic functionality for gene finding in the form of open reading frame ORF determination The ORFs will be shown as annotations on the sequence You have the option of choosing translation table start codons minimum length and other parameters for finding the ORFs These parameters will be explained in this section To find open reading frames select a nucleotide sequence Toolbox in the Menu Bar Nucleotide Analyses lt A Find Open Reading Frames Xx or right click a nucleotide sequence Toolbox Nucleotide Analyses A Find Open Reading Frames X lt CHAPTER 15 NUCLEOTIDE ANALYSES 238 This opens the dialog displaye
494. r to force the alignment algorithm to align the fixpoints in the selected sequences to each other In figure 20 7 the result of an alignment using fixpoints is illustrated You can add multiple fixpoints e g adding two fixpoints to the sequences that are aligned will force their first fixpoints to be aligned to each other and their second fixpoints will also be aligned to each other Advanced use of fixpoints Fixpoints with the same names will be aligned to each other which gives the opportunity for great control over the alignment process It is only necessary to change any fixpoint names in very special cases CHAPTER 20 SEQUENCE ALIGNMENT 352 20 peso46 MMLTADENA AVTAINIGREN MDEVECENES 29 P68053 MMLTGEBKA AVTABWGREN MDEVECENES 29 P68225 MMMLTPE ABE MREGAABIMA 2 m Copy Selection lectioni i P68873 MMAUTPEBRs AV CH Open Selection in New View 0 Edit Selection P68228 MENDSCDOBRN AV Add Annotation 0 Add Gaps After 2 Add Gaps Before P68231 MMAMLScDBKEN Av b E Delete Selection P68063 MAWTA EBRO Li Realign Selection 9 Set Alignment Fixpoint Here peso45 MHWTAEBNO Li Set Numbers Relative to This Selection Consensus MVHLTXEEKN AV f Create Pairwise Comparison Figure 20 6 Adding a fixpoint to a sequence in an existing alignment At the top you can see a fixpoint that has already been added qzLut t FEE Alignment wit gt HBA_ANAPE HBA_ANSSE fipnt D HBA_ACCGE
495. r view see figure 17 1 the basic options for viewing the template sequence are the same as for the standard sequence viewer See section 10 1 for an explanation of these options Note This means that annotations such as e g known SNP s or exons can be displayed on the template sequence to guide the choice of primer regions Also traces in Sequencing reads can be shown along with the structure to guide e g the re sequencing of poorly resolved regions TX PERH3BC 20 A AG mer Designer Serw mas x gt eS 25 PERH3BC GTGAGTCTGATGGGTCTGCCCATGGTTTCCTT x A Lat 18 Primer parameters ar Lenath Lgt 19 Max 22 A Min 18 Lgt 20 Melt temp C Max 58 Lgt 21 Min 48 lt Inner Melt temp C Lgt 22 Max Min 40 60 Advanced parameters l I Mode PERH3BC CCTCTAGTTTCTGGGCTTACCTTCCTATCAGA Standard PCR Lgt 18 O TaqMan Lgt 19 O Nested PCR i O Sequencing Lat 20 v JE v a O B Oh El ile amp Figure 17 1 The initial view of the sequence used for primer design 17 1 1 General concept The concept of the primer view is that the user first chooses the desired reaction type for the session in the Primer Parameters preference group e g Standard PCR Reflecting the choice of reaction type it is now possibly to select one or more regions on the sequence and to use the right click mouse menu to designate these as primer or probe regions see figure 17 2 CHAPTER 17 PRIMERS 277 Fo
496. ragment starting with the shortest fragments Here the lowest free energy for longer fragments can be expediently calculated from the free energies of the smaller sub sequences they contain When this process reaches the longest fragment i e the complete sequence the MFE of the entire molecule is known The second step is called traceback and uses all the free energies computed in the first step to determine Smin the exact structure associated with the MFE Acceptable calculation speed is achieved by using dynamic programming where sub sequence results are saved to avoid recalculation However this comes at the price of a higher requirement for computer memory The structure element energies that are used in the recursions of these two steps are derived from empirical calorimetric experiments performed on small molecules see e g Mathews et al 19991 Suboptimal structures determination A number of known factors violate the assumptions that are implicit in MFE structure prediction Schroeder et al 1999 and Chen et al 2004 have shown experimental indications that the thermodynamic parameters are sequence dependent Moreover Longfellow et al 1990 and Kierzek et al 1999 have demonstrated that some structural elements show non nearest neighbor effects Finally single stranded nucleotides in multi loops are known to influence stability Mathews and Turner 2002 These phenomena can be expected to limit the accuracy of RNA s
497. re displayed right click any element or folder in the Navigation Area Sequence Representation select format This will only affect sequence elements and the display of other types of elements e g alignments trees and external files will be not be changed If a sequence does not have this information there will be no text next to the sequence icon CHAPTER 3 USER INTERFACE 79 Rename element Renaming a folder or an element can be done in three different ways select the element Edit in the Menu Bar Rename or select the element F2 click the element once wait one second click the element again When you can rename the element you can see that the text is selected and you can move the cursor back and forth in the text When the editing of the name has finished press Enter or select another element in the Navigation Area If you want to discard the changes instead press the Esc key 3 1 7 Delete elements Deleting a folder or an element can be done in two ways right click the element Delete 4 or select the element press Delete key This will cause the element to be moved to the Recycle Bin jj where it is kept until the recycle bin is emptied This means that you can recover deleted elements later on Restore Deleted Elements The elements in the Recycle Bin jj can be restored by dragging the elements with the mouse into the folder where they used to be If you have deleted large amounts of data ta
498. re only interested in the Sphl sites which are not shown To hide the other restriction sites and add the Sphl enzyme Restriction sites in the Side Panel Deselect all Edit enzymes find and double click Sphl in the upper list Finish Notice that there are two Sphl sites at the ends of the PCR fragment and that there is one Sphl site in the middle of the tetracyclin resistance gene of pBR322 We are going to insert the fragment at this Sphl site There are two steps in this First we have to cut the fragment with the Sphl enzyme This will produce a new fragment with sticky ends since the Sphl enzyme creates a 3 overhang Second this fragment has to be inserted at vector s Sphl site CHAPTER 2 TUTORIALS 66 2 13 2 Cutting the PCR fragment with the Sphl enzyme To cut the PCR fragment with the Sphl enzyme right click one of the Sphl restriction sites Cut Sequence at All Sphl Sites This will cut the sequence at the two Sphl sites and generate three new fragments as seen in figure 2 51 ss PCR fragment with Sph_Sphl GCATG Sequence details CANS PCR fragment with Sph_Sphl CAGG GTCGCATG Sequence details GTACGTCC CAGC ss PCR fragment with Sph_Sphl CATGC sde c Sequence details TACG Figure 2 51 The PCR fragment cut with Sphl enzyme The single stranded regions are illustrated with a blue annotations labeled SS We do not need the two small leftovers so these can just be deleted right clic
499. re spreading Most offset The annotations are placed above each other with a little space between This can take up a lot of space on the screen CHAPTER 10 VIEWING AND EDITING SEQUENCES 150 e Label The name of the annotation can shown as a label Additional information about the sequence is shown if you place the mouse cursor on the annotation and keep it still No labels No labels are displayed On annotation The labels are displayed in the annotation s box Over annotation The labels are displayed above the annotations Before annotation The labels are placed just to the left of the annotation Flag The labels are displayed as flags at the beginning of the annotation Stacked The labels are offset so that the text of all labels is visible This means that there is varying distance between each sequence line to make room for the labels e Show arrows Displays the end of the annotation as an arrow This can be useful to see the orientation of the annotation for DNA sequences Annotations on the negative strand will have an arrow pointing to the left e Use gradients Fills the boxes with gradient color In the Annotation Types group you can choose which kinds of annotations that should be displayed This group lists all the types of annotations that are attached to the sequence s in the view For sequences with many annotations it can be easier to get an overview if you deselect the annotati
500. rect or inverted repeats or regions with low sequence complexity Various smoothing algorithms can be applied to the dot plot calculation to avoid noisy background of the plot Moreover can various substitution matrices be applied in order to take the evolutionary distance of the two sequences into account To create a dot plot Toolbox General Sequence Analyses A Create Dot Plot 3 or Select one or two sequences in the Navigation Area Toolbox in the Menu Bar General Sequence Analyses 4 Create Dot Plot 2 CHAPTER 14 GENERAL SEQUENCE ANALYSES 209 or Select one or two sequences in the Navigation Area right click in the Navigation Area Toolbox General Sequence Analyses 4 Create Dot Plot 5 This opens the dialog shown in figure 14 3 Create Dot Plot 1 Select one or two PV SelecE one or two sequences oF same type sequences of same type projects Selected Elements 2 CLC_Data Sue P68046 5 Example data fu P68053 EE Extra Nucleotide 3 Protein H E 3D structures E More data SE Sequences At 1429_HUMAN Ps CAA24102 e CAA32220 NP_058652 esoo Le E Lu P68945 Figure 14 3 Selecting sequences for the dot plot If a sequence was selected before choosing the Toolbox action this sequence is now listed in the Selected Elements window of the dialog Use the arrows to add or remove elements from the selected elements Click Next to adjust dot plot parameters Clicking Next
501. rmation about an enzyme like recognition sequence or a list of commercial vendors Clicking Next will show the dialog in figure 19 20 d Show Enzymes Cutting Inside Outside Selection ES 1 Enzymes to be considered Wineries AAA in calculation 2 Number of cut sites Selected region 2466 4314 Cut sites Inside selection Outside selection No cut sites 0 Y No cut sites 0 7 One cut site 1 One cut site 1 Two cut sites 2 Two cut sites 2 Preview 3 enzymes will be added to Side Panel Enzyme name of cuts within selection of cuts elsewhere HindIII 1 KpnI 1 NdeI 1 Figure 19 20 Deciding number of cut sites inside and outside the selection At the top of the dialog you see the selected region and below are two panels e Inside selection Specify how many times you wish the enzyme to cut inside the selection In the example described above One cut site 1 should be selected to only show enzymes cutting once in the selection e Outside selection Specify how many times you wish the enzyme to cut outside the selection i e the rest of the sequence In the example above No cut sites 0 should be selected These panels offer a lot of flexibility for combining number of cut sites inside and outside the selection respectively To give a hint of how many enzymes will be added based on the combination of cut sites the preview panel a
502. rotein Analyses ih Pfam Domain Search Ww or right click a protein sequence Toolbox Protein Analyses egy Pfam Domain Search Ww If a sequence was selected before choosing the Toolbox action this sequence is now listed in the Selected Elements window of the dialog Use the arrows to add or remove sequences or sequence lists from the selected elements You can perform the analysis on several protein sequences at a time This will add annotations to all the sequences and open a view for each sequence Click Next to adjust parameters see figure 16 16 CHAPTER 16 PROTEIN ANALYSES 261 Pfam Domain Search 1 Select protein sequences parameter 2 Set parameters Database and search type Search full domains and Fragments Search full domains only O Search fragments only Database 100 most common domains Additional databases www clcbio com databases Significance cutoff Max E value 1 MA ET _ Previous mex Y Finish XK Cancel Figure 16 16 Setting parameters for Pfam domain search 16 6 1 Pfam search parameters e Choose database and search type When searching for Pfam domains it is possible to choose different databases and specify the search for full domains or fragments of domains Only the 100 most frequent domains are included as default in CLC Combined Workbench Additional databases can be downloaded directly from CLC bio s web site at http www clcbio com resources
503. rovides an intelligible overview of all the annotations on the sequence e You can use the filter at the top to search the annotations Type e g UCP into the filter and you will find all annotations which have UCP in either the name the type the region or the qualifiers Combined with showing or hiding the annotation types in the Side Panel this makes it easy to find annotations or a subset of annotations e You can copy and paste annotations e g from one sequence to another CHAPTER 10 VIEWING AND EDITING SEQUENCES 152 e f you wish to edit many annotations consecutively the double click editing makes this very fast See section 10 3 2 10 3 2 Adding annotations Adding annotations to a sequence can be done in two ways open the sequence in a sequence view double click in the Navigation Area make a selection covering the part of the sequence you want to annotate right click the selection Add Annotation gt or select the sequence in the Navigation Area Show Annotations E Add Annotation This will display a dialog like the one in figure 10 19 Edit annotation Annotation types Other properties Protein Features Protein Functional Features a Protein Sequence Features Name Test Type Misc Feature DNA RNA Sequence Features Region 10 26 gt Misc a ES Annotations 2 Alignment fixpoint Add qualifier key AN Figure 10 19 The Add Annotation dial
504. rozygotes discover via secondary peaks 317 Hide show Toolbox 89 History 124 export 117 preserve when exporting 125 source elements 125 Homology pairwise comparison of sequences in alignments 362 Hydrophobicity 254 401 Bioinformatics explained 257 Chain Flexibility 259 Cornette 135 258 Eisenberg 135 258 Emini 135 Engelman GES 135 258 Hopp Woods 135 258 Janin 135 258 Karplus and Schulz 135 Kolaskar Tongaonkar 135 258 Kyte Doolittle 135 258 Rose 258 Surface Probability 259 Welling 135 258 Import bioinformatic data 113 115 existing data 34 FASTA data 34 from a web page 115 list of formats 408 preferences 102 raw sequence 115 Side Panel Settings 101 using copy paste 115 Vector NTI data 115 Index for searching 98 Infer Phylogenetic Tree 366 Information point primer design 277 Insert gaps 357 Insert restriction site 327 Installation 13 Isoelectric point 223 Isoschizomers 143 335 IUPAC codes nucleotides 412 Join alignments 359 sequences 226 jpg format export 120 Keywords 154 Label of sequence 132 Landscape Print orientation 109 Lasergene sequence protein file format 35 113 409 sequence file format 35 113 409 Length 154 License 16 License server 21 License server access offline 22 Links from annotations 153 Linux installation 15 installation with RPM package 16 List of restriction enzymes 344 List of sequences 157 Load enzyme li
505. rresponding to the colors of the triangles on the sequence By selecting or deselecting the enzymes in the list you can specify which enzymes restriction sites should be displayed The color of the restriction enzyme can be changed by clicking the colored box next to the enzyme s name The name of the enzyme can also be shown next to the restriction site by selecting Show name flags above the list of restriction enzymes Sort enzymes Just above the list of enzymes there are three buttons to be used for sorting the list see figure 19 13 e Sort enzymes alphabetically Aa Clicking this button will sort the list of enzymes alphabetically e Sort enzymes by number of restriction sites 1 This will divide the enzymes into four groups Non cutters Single cutters Double cutters Multiple cutters There is a checkbox for each group which can be used to hide show all the enzymes in a group CHAPTER 10 VIEWING AND EDITING SEQUENCES 138 7 GTGTCGACGAGTCTGGATIATCA E l TGGGTCTGACCITCGAGCATGGT Ball TTCAIGATCTCTTCCTCTCGTTT Hindili 80 l Restriction sites Show Show name flags Sorting Aa I X Non cutters EE Y Bart 0 MA M pst co EY smar 0 X Single cutters Boen EM Eoi CTGAAGCTTGGCTTACCTTAAC Gl hina y O OU sio ii Maan O CTICTAGATATCAGAAGGAAATG EA xhot 1 120 gt V Double cutters l GAATTCGGAAGAGATTCGATAT Y Multiple cutters MA Y Eor 3 O 1
506. rs in the Database parameter drop down menu the search will include all relevant databases at NCBI The nr database is the most complete but also the most redundant database that can be searched Searches can be limited to less complete databases As an example when choosing pdb only sequences with a known structure are searched If homologous sequences are found to the query sequence these can be downloaded and opened with the 3D viewer of CLC Protein Workbench or CLC Combined Workbench When choosing BLASTx or tBLASTx to conduct a search you get the option of selecting a translation table for the genetic code The standard genetic code is set as default This is particularly useful when working with organisms or organelles which have a genetic code that differs from the standard genetic code In Step 3 you can limit the BLAST search by adjusting the parameters seen in figure 12 4 BLAST Against NCBI Databases 1 Select sequences of same ME put parameter 2 Set program parameters 3 Set input parameters Choose parameters Limit by entrez query All organisms V Low complexity Human repeats Choose filter Mask for lookup Mask lower case Expect 10 Word size 3 Matrix BLOSUM62 v v v Gap cost Existence 11 Extension 1 ES EA _ Previous caca Figure 12 4 Examples of different limitations which can be set before submitting a BLAST search The fol
507. rt Sequence List by Length Web Info Figure 19 4 Right click on the sequence in the cloning view e Open sequence in circular view 0 Opens the sequence in a new circular view If the sequence is not circular you will be asked if you wish to make it circular or not This will not forge ends with matching overhangs together use Make Sequence Circular Q instead e Duplicate sequence Adds a duplicate of the selected sequence The new sequence will be added to the list of sequences shown on the screen e Insert sequence after this sequence 4 Insert another sequence after this sequence The sequence to be inserted can be selected from a list which contains the sequences present in the cloning editor The inserted CHAPTER 19 CLONING AND CUTTING 323 sequence remains on the list of sequences If the two sequences do not have blunt ends the ends overhangs have to match each other Otherwise a warning is displayed Insert sequence before this sequence Insert another sequence before this sequence The sequence to be inserted can be selected from a list which contains the sequences present in the cloning editor The inserted sequence remains on the list of sequences If the two sequences do not have blunt ends the ends overhangs have to match each other Otherwise a warning is displayed Reverse complement sequence x Creates the reverse complement of a sequence and replaces the original Sequence in the list Th
508. rted allowing for a white background If you choose a color gradient which includes white Se figure 14 5 14 2 3 Bioinformatics explained Dot plots Realization of dot plots Dot plots are two dimensional plots where the x axis and y axis each represents a sequence and the plot itself shows a comparison of these two sequences by a calculated score for each CHAPTER 14 GENERAL SEQUENCE ANALYSES 211 P68053 vs PEBOSI Sequence 2 T T T T T T T T T 10 s 60 7 80 80 100 110 120 10 10 Sequence Figure 14 6 Dot plot with inverted colors practical for printing position of the sequence If a window of fixed size on one sequence one axis match to the other sequence a dot is drawn at the plot Dot plots are one of the oldest methods for comparing two sequences Maizel and Lenk 1981 The scores that are drawn on the plot are affected by several issues e Scoring matrix for distance correction Scoring matrices BLOSUM and PAM contain substitution scores for every combination of two amino acids Thus these matrices can only be used for dot plots of protein sequences e Window size The single residue comparison bit by bit comparison window size 1 in dot plots will undoubtedly result in a noisy background of the plot You can imagine that there are many successes in the comparison if you only have four possible residues like in nucleotide sequences Therefore you can set a window size which is smoothing the dot p
509. ructure for primers 280 Select exact positions 136 in sequence 143 parts of a sequence 143 workspace 90 Select annotation 144 Selection mode in the toolbar 88 Selection adjust 143 Selection expand 143 Selection location on sequence 88 Self annealing 279 Self end annealing 280 Separate sequences on gel 342 using restriction enzymes 342 Sequence alignment 347 analysis 206 display different information 78 extract from sequence list 159 find 136 information 154 join 226 layout 132 lists 157 logo 401 logo Bioinformatics explained 355 new 156 region types 145 search 136 select 143 shuffle 206 statistics 220 view 131 view as text 155 view circular 145 view format 78 web info 170 Sequence details 321 Sequence logo 354 Sequencing data 403 Sequencing primers 402 Share data 75 400 Share Side Panel Settings 101 Shared BLAST database 184 Shortcuts 90 Show enzymes cutting selection 140 332 Show dialogs 100 Show enzymes with compatible ends 143 335 Show hide Toolbox 89 Shuffle sequence 206 400 Side Panel tutorial 37 Side Panel Settings export 101 import 101 share with others 101 Side Panel location of 100 Signal peptide 242 243 401 SignalP 242 Bioinformatics explained 243 Similarity pairwise comparison of sequences in alignments 362 Single base editing INDEX 429 in contigs 313 in sequences 145 Single cutters 137 330 Single stranded view
510. rward primer region here Reverse primer region here gt dE Region to amplify 3 i No primers here cu m Copy Selection CE Open Selection in New View Edit Selection E Delete Selection y Add Annotation E HZ Add Enzymes Cutting Selection to Panel E Insert Restriction Site After Selection CCAAG Insert Restriction Site Before Selection AAL Base Pair Constraint gt Set Numbers Relative to This Selection Blast Selection Against NCBI ld Blast Selection Against Local Database Figure 17 2 Right click menu allowing you to specify regions for the primer design When a region is chosen graphical information about the properties of all possible primers in this region will appear in lines beneath it By default information is showed using a compact mode but the user can change to a more detailed mode in the Primer information preference group The number of information lines reflects the chosen length interval for primers and probes In the compact information mode one line is shown for every possible primer length and each of these lines contain information regarding all possible primers of the given length At each potential primer starting position a circular information point is shown which indicates whether the primer fulfills the requirements set in the primer parameters preference group A green circle indicates a primer which fulfils all criteria and a red circle indicates a primer which fails to meet one or
511. ry will also be written in the history Cp when you click Finish When you click Finish and the sequence is inserted it will be marked with a selection I ec2 Conflict tet Conflict Conflict bla tet Insert ROP protein blaf pBR322 Sequence details Tengen regargeace cen Figure 19 9 One sequence is now inserted into the cloning vector The sequence inserted is automatically selected 19 1 6 Insert restriction site If you make a selection on the sequence right click you find this option for inserting the recognition sequence of a restriction enzyme before or after the region you selected This will display a dialog as shown in figure 19 10 At the top you can select an existing enzyme list or you can use the full list of enzymes default Select an enzyme and you will see its recognition sequence in the text field below the list GTCTAC If you wish to insert additional residues such as tags etc this can be typed into the text fields adjacent to the recognition sequence Click OK will insert the sequence before or after the selection If the enzyme selected was not already present in the list in the Side Panel the it will now be added and selected Furthermore an restriction site annotation is added CHAPTER 19 CLONING AND CUTTING 328 Insert restriction site after selection 1 Please choose enzymes Enzyme list Use existing enzyme list All enzymes Filter Name Overhang Me
512. s Toolbox in the Menu Bar General Sequence Analyses lt A Pattern Discovery 42 or right click DNA or protein sequence s Toolbox General Sequence Analyses A Pattern Discovery 42 If a sequence was selected before choosing the Toolbox action the sequence is now listed in the Selected Elements window of the dialog Use the arrows to add or remove sequences or sequence lists from the selected elements CHAPTER 14 GENERAL SEQUENCE ANALYSES 231 You can perform the analysis on several DNA or several protein sequences at a time If the analysis is performed on several sequences at a time the method will search for patterns which is common between all the sequences Annotations will be added to all the sequences and a view is opened for each sequence Click Next to adjust parameters see figure 14 22 Pattern Discovery 1 Select one or more sequences of same type 2 Set parameters Define model Create and search with new model O Use existing model Set motif parameters Pattern length Min 4 Max 9 Noise 11 4 Number of patterns to predict 1 v Include background distribution Cermo pun Sr cs Figure 14 22 Setting parameters for the pattern discovery See text for details In order to search unknown sequences with an already existing model Select to use an already existing model which is seen in figure 14 22 Models are represented with the following icon in th
513. s 6 Table of remaining fragments based on parameter settings Filter Start End posi Length Mass pI C end Name Fragment N end Name 32 41 10 1 288 54 9 72 R START LLIVYPWTQR F Trypsin 46 60 15 1 562 75 10 62 K Trypsin FGNLSSAQ I Trypsin 68 77 10 1 000 23 10 06 K Trypsin VLTSLGLAVK N Trypsin 84 96 13 1 529 67 5 49 K Trypsin ETFAHLSEL L Trypsin 122 133 12 1 379 47 4 6 K Trypsin EFTAEAQA L Trypsin 134 145 12 1 194 42 10 07 K Trypsin ILWVGVATAL Y Trypsin Eo Figure 2 33 The output of the proteolytic cleavage shows the cleavage sites as annontations in the protein sequence The accompanying table lists all the fragments which are between 10 and 15 amino acids long Note The output of proteolytic cleavage is two related views The sequence view displays annotations where the sequence is cleaved The table view shows information about the fragments satisfying the parameters set in the dialog Subsequently if you have restricted the fragment parameters you might have more annotations on the sequence than fragments in the table If you conduct another proteolytic cleavage on the same sequence the output consists of possibly new annotations on the original sequence and an additional table view listing all fragments 2 11 Tutorial Primer design In this tutorial you will see how to use CLC Combined Workbench for finding primers for PCR amplification of a specific region We use the
514. s RNA sequences as well they will not be converted 233 CHAPTER 15 NUCLEOTIDE ANALYSES 234 Convert DNA to RNA 1 Select DNA sequences z Projects Selected Elements 1 CLC_Data 76 _ PERH3BC E Example data a Extra S EF Nucleotide w Assembly S E Cloning w More data W E Primer design W E Restriction analysis S E Sequences DOC AY738615 206 HUMDINUC 20 HUMHBB 20 NM_000044 DOC PERH2BD o a iE sequence list Protein e gt Next Finch X Cancel Figure 15 1 Translating DNA to RNA 15 2 Convert RNA to DNA CLC Combined Workbench lets you convert an RNA sequence into DNA substituting the U residues Urasil for T residues Thymine select an RNA sequence in the Navigation Area Toolbox in the Menu Bar Nucleotide Analyses 4 Convert RNA to DNA 3 or right click a sequence in Navigation Area Toolbox Nucleotide Analyses A Convert RNA to DNA YA This opens the dialog displayed in figure 15 2 Convert RNA to DNA 1 Select RNA sequences Projects Selected Elements 1 CLC_Data We NIM_000044 ep Example data E Extra Nucleotide W E Assembly 9 Cloning EE More data W E Primer design w Restriction analysis Sequences DOC AY738615 206 HUMDINUC 206 HUMHBB ro aniM_O00044 DOC PERH2BD 206 PERH3BC 13 sequence list if Protein J X cancel Figure 15 2 Translating RNA to DNA If a sequence was selected before choosing the Toolb
515. s an overview CHAPTER 5 USER PREFERENCES AND SETTINGS 102 Merge or Overwrite luz How do you want to import Merge into existing styles O Overwrite existing styles Figure 5 2 When you import settings you are asked if you wish to overwrite existing settings or if you wish to merge the new settings into the old ones e Import and export of bioinformatics data such as sequences alignments etc described in section 7 1 1 e Graphics export of the views which creates image files in various formats described in section 7 3 e Import and export of Side Panel Settings as described above e Import and export of all the Preferences except the Side Panel settings This is described in the previous section 5 3 Advanced preferences The Advanced settings include the possibility to set up a proxy server This is described in section 1 8 5 3 1 Default persistence location If you have more than one location in the Navigation Area you can choose which location should be the default location The default locating is used when you e g import a file without selecting a folder or element in the Navigation Area first Then the imported element will be placed in the default location Note The default location cannot be removed You have to select another location as default first 5 3 2 URL to use for BLAST It is possible to specify an alternate server URL to use for BLAST searches The standard URL for the BLAST
516. s can be investigated by tree building methods based on the alignment e Annotation of functional domains which may only be known for a subset of the sequences can be transferred to aligned positions in other un annotated sequences e Conserved regions in the alignment can be found which are prime candidates for holding functionally important sites e Comparative bioinformatical analysis can be performed to identify functionally important regions NP_058652 NP_032246 Q6H1U7 P68945 P68063 NP_032247 CAA32220 CAA24102 P04443 Q6WN28 Q6WN21 Figure 20 16 The tabular format of a multiple alignment of 24 Hemoglobin protein sequences Sequence names appear at the beginning of each row and the residue position is indicated by the numbers at the top of the alignment columns The level of sequence conservation is shown on a color scale with blue residues being the least conserved and red residues being the most conserved CHAPTER 20 SEQUENCE ALIGNMENT 365 20 6 2 Constructing multiple alignments Whereas the optimal solution to the pairwise alignment problem can be found in reasonable time the problem of constructing a multiple alignment is much harder The first major challenge in the multiple alignment procedure is how to rank different alignments i e which scoring function to use Since the sequences have a shared history they are correlated through their phylogeny and the scoring function should ideally take this into acco
517. s on sequences Relative to P68046 P68053 PWTO z P68225 P68873 P68228 Follow selection Hide labels FAI aek lahele Figure 3 14 A maximized view The function hides the Navigation Area and the Toolbox Maximizing a view can be done in the following ways select view Ctrl M or select view View Maximize restore View or select view right click the tab View Maximize restore View 7 or double click the tab of view The following restores the size of the view CHAPTER 3 USER INTERFACE 86 Ctrl M or View Maximize restore View 7 or double click title of view 3 2 7 Side Panel The Side Panel allows you to change the way the contents of a view are displayed The options in the Side Panel depend on the kind of data in the view and they are described in the relevant sections about sequences alignments trees etc Side Panel are activated in this way select the view Ctrl U 38 U on Mac or right click the tab of the view View Show Hide Side Panel Note Changes made to the Side Panel will not be saved when you save the view See how to save the changes in the Side Panel in chapter 5 The Side Panel consists of a number of groups of preferences depending on the kind of data being viewed which can be expanded and collapsed by clicking the header of the group You can also expand or collapse all the groups by clicking the icons
518. s text See section 10 5 In the following sections we will explain this in further detail The procedure for searching is identical for all four search options see also figure 11 6 Open a sequence or a sequence list Right click the name of the sequence Web Info D select the desired search function 20 TCTGCCCATGGTTTCC PERH 3 Digest and Create Restriction Map 80 Select Sequence 1 PERF Make Sequence Circular GAAATGGGAAGAGA1I 120 160 amp NCBI 1 PERH3BC ACTCTCCACTCACA CTC ful Figure 11 6 Open webpages with information about this sequence This will open your computer s default browser searching for the sequence that you selected 11 4 1 Google sequence The Google search function uses the accession number of the sequence which is used as search term on http www google com The resulting web page is equivalent to typing the accession number of the sequence into the search field on http www google com 11 4 2 NCBI The NCBI search function searches in GenBank at NCBI http www ncbi nlm nih gov using an identification number when you view the sequence as text it is the Gl number Therefor the sequence file must contain this number in order to look it up in NCBI All sequences downloaded from NCBI have this number CHAPTER 11 ONLINE DATABASE SEARCH 171 11 4 3 PubMed References The PubMed references search option lets you look up Pubmed articles based on references con
519. s to clone However the cloning view has many additional interaction possibilities compared to the normal sequence view and there are several extra visual aids to help you manipulate the sequences All of this is described in the following 19 1 2 Sequence details When you make a selection on the sequence you will see details of the residues and restriction sites as illustrated in figure 19 2 CTCCCTAG Figure 19 2 Sequence details of a selection At the top the sequence is zoomed out and represented as a black line with annotations and below the residues are shown double stranded with detailed visualization of restriction sites The Sequence details are particularly useful when the Sequences have overhangs as shown at the right side end of the sequence in figure 19 2 which has a CTAG overhang If you have not made a selection the details of the ends of the sequences will automatically be shown The sequence details can turned on and off by clicking Show in the Sequence details group at the top of the Side Panel 19 1 3 How to navigate the cloning view The zoom function in the cloning view works on the individual sequence and not the entire view In that way you can show a long plasmid and short sequence fragments in the same view However Fit Width f and Zoom to 100 4 apply to all the sequences in the view and can thus be used to reset different zoom levels of the individual sequences Using the keyboard to
520. s to the number of possible primer combinations formed by a degenerate primer Thus if a primer covers two 4 fold degenerate site and one 2 fold degenerate site the total fold of degeneracy is 4 x 4 2 32 and the primer will when supplied from the manufacturer consist of a mixture of 32 different oligonucleotides When scoring the available primers degenerate primers are given a score which decreases with the fold of degeneracy e Allow mismatches Designs primers which are allowed a specified number of mismatches to the included template sequences The melting temperature algorithm employed includes the latest thermodynamic parameters for calculating Tm when single base mismatches occur When in Standard PCR mode clicking the Calculate button will prompt the dialog shown in figure 17 LS The top part of this dialog shows the single primer parameter settings chosen in the Primer parameters preference group which will be used by the design algorithm The central part of the dialog contains parameters pertaining to primer specificity this is omitted if all sequences belong to the included group Here three parameters can be set e Minimum number of mismatches the minimum number of mismatches that a primer must have against all sequences in the excluded group to ensure that it does not prime these e Minimum number of mismatches in 3 end the minimum number of mismatches that a primer must have in its 3 end against all sequences
521. sampling from first order Markov chain Resampling method generating a sequence of the same expected dinucleotide frequency The second group of parameters pertain to the scanning settings and include e Window size The width of the sliding window e Number of samples The number of times the sequence is resampled to produce the background distribution e Step increment Step increment when plotting sequence positions against scoring values The third parameter group contains the output options e Z scores Create a plot of Z scores as a function of sequence position e P values Create a plot of the statistical significance of the structure signal as a function of sequence position Scan for Local Structure 1 Select nucleotide MIA CIENTE sequences 2 Set parameters Resampling methods Mononucleotide shuffling Mononucleotide sampling from zero order Markov chain Dinucleotide shuffling Dinucleatide sampling from First order Markov chain Scanning parameters Window size must be odd 101 Number of samples Step increment Output options v Z scores a P values a jl A Previous next ER Fini Figure 22 26 Adjusting parameters for structure scanning 22 4 2 The structure scanning result The output of the analysis are plots of Z scores and probabilities as a function of sequence position A strong propensity for local structure c
522. sapiens using the Basic Local Alignment Search Tool BLAST algorithm Please note that your computer must be connected to the Internet to complete this tutorial Start out by select protein NP_058652 from the Protein folder under Sequences Toolbox BLAST Search NCBI BLAST In Step 1 you can choose which sequence to use as query sequence Since you have already chosen the sequence it is displayed in the Selected Elements list Click Next In Step 2 figure 2 23 choose the default BLAST program BLASTp Protein sequence against Protein database and select the Swiss Prot database in the Database drop down menu BLAST Against NCBI Databases 1 Select sequences of same See program parameters 2 Set program parameters Choose program and database Program blastp Protein sequence and database Database MERA EE Genetic code Database genetic code ET From Previous dre Figure 2 23 Choosing BLAST program and database Click Next In the Limit by Entrez query in Step 3 choose Homo sapiens ORGN from the drop down menu to arrive at the search configuration seen in figure 2 24 Including this term limits the query to proteins of human origin Click Next to set how the output of the BLAST search should be displayed Leave these options at their default Click Finish to accept the parameter settings and begin the BLAST search The computer now contacts NCBI and pla
523. se of drag and drop is supported throughout the program Further description of the function is found in connection with the relevant functions Copy using drag and drop To copy instead of move using drag and drop hold the Ctrl 38 on Mac key while dragging click the element click on the element again and hold left mouse button drag the element to the desired location press Ctrl 38 on Mac while you let go of mouse button release the Ctrl 38 button 3 1 6 Change element names This section describes two ways of changing the names of sequences in the Navigation Area In the first part the sequences themselves are not changed it s their representation that changes The second part describes how to change the name of the element Change how sequences are displayed Sequence elements can be displayed in the Navigation Area with different types of information e Name this is the default information to be shown e Accession sequences downloaded from databases like GenBank have an accession number e Latin name e Latin name accession e Common name e Common name accession Whether sequences can be displayed with this information depends on their origin Sequences that you have created yourself or imported might not include this information and you will only be able to see them represented by their name However sequences downloaded from databases like GenBank will include this information To change how sequences a
524. search fields you can choose which database to search e Swiss Prot This is believed to be the most accurate and best quality protein database available All entries in the database has been currated manually and data are entered according to the original research paper e TrEMBL This database contain computer annotated protein sequences thus the quality of the annotations is not as good as the Swiss Prot database As default CLC Combined Workbench offers one text field where the search parameters can be entered Click Add search parameters to add more parameters to your search Note The search is a and search meaning that when adding search parameters to your search you search for both or all text strings rather than any of the text strings You can append a wildcard character by checking the checkbox at the bottom This means that you only have to enter the first part of the search text e g searching for genom will find both genomic and genome The following parameters can be added to the search e All fields Text searches in all parameters in the UniProt database at the same time CHAPTER 11 ONLINE DATABASE SEARCH 165 Organism Text Description Text e Created Since Between 30 days and 10 years Feature Text The search parameters listed in the dialog are the most recently used The All fields allows searches in all parameters in the UniProt database at the same time When you are satisfied with t
525. sed to color subunits e Rainbow This color mode will color the structure with rainbow colors along the sequence e Secondary structure The structure is colored according to secondary structures Alpha helices are colored light blue while beta sheets are colored light green All other atoms are colored grey CHAPTER 13 3D MOLECULE VIEWING 204 13 4 4 Selection scheme When a polymer sequence from a structure is opened selections made on the sequence will be mirrored by the 3D elements of the structure The selection scheme specifies how atoms are highlighted e Inverse Transparency Non selected elements are rendered transparent while highlighted atoms will retain their original appearance This scheme is useful for large complex molecules or for selections deep within the molecule Note that the transparency slider is not functional when this scheme is set e Uniform Color All selected elements are colored yellow 13 4 5 General settings e Show table The table containing sequence information etc may be turned off using this checkbox e Background color The background color may be changed using this color chooser Default color is black 13 4 6 Performance settings e Cartoon as wireframe If the selected drawing method is cartoon selecting this checkbox will render the drawing in a wireframe mode e Anti aliasing on wireframe Enable anti aliasing e Rendering quality You may specify the image quality by using
526. sensus sequence This means that when you use this sequence later on you will easily be able to see the comments you have entered The comment could be e g your interpretation of the conflict 2 12 5 Documenting your changes Whenever you make a change like replacing a T for a t it will be noted in the contig s history To open the history Right click the tab of the contig Show History Cp In the history you can see the details of each change see figure 2 48 2 12 6 Using the result for further analyses When you have finished editing the contig it can be saved and you can also extract and save the consensus sequence Right click the name Consensus Open Copy of Sequence in New View Save Eb This will make it possible to use this sequence for further analyses in the CLC Combined Workbench All the conflict annotations are preserved and in the sequence s history you will find a reference to the original contig As long as you also save the contig you will always be able to go back to it by clicking the reference in the consensus sequence s history see figure CHAPTER 2 TUTORIALS 64 LI Contig 1 User smoensted Parameters Position 637 Removed T Inserted c Modified element Consensus Comments Edit No Comment User smoensted Parameters Position 651 Removed A Inserted g Modified element readl Comments Edit No Comment User smoensted Parameters Position 550
527. sequence s name and select Delete Sequence e To sort the sequences in the list right click the name of one of the sequences and select Sort Sequence List by Name or Sort Sequence List by Length e Torename a sequence right click the name of the sequence and select Rename Sequence 10 7 2 Sequence list table Each sequence in the table sequence list is displayed with e Name Accession e Description Modification date e Length In the View preferences for the table view of the sequence list columns can be excluded and the view preferences can be saved in a style sheet See section 5 5 The sequences can be sorted by clicking the column headings You can further refine the sorting by pressing Ctrl while clicking the heading of another column CHAPTER 10 VIEWING AND EDITING SEQUENCES 159 10 7 3 Extract sequences It is possible to extract individual sequences from a sequence list in two ways If the sequence list is opened in the tabular view it is possible to drag with the mouse one or more sequences into the Navigation Area This allows you to extract specific sequences from the entire list Another option is to extract all sequences found in the list to a preferred location in the Navigation Area select a sequence list in the Navigation Area File Extract Sequences Select a location for the sequences and click OK Copies of all the sequences in the list are now placed in the location you selected Chapte
528. sequence to trim in the Navigation Area select the region you want to trim right click the selection Trim sequence left right to determine the direction of the trimming This will add trimming annotation to the end of the sequence in the selected direction 18 2 2 Automatic trimming Sequence reads can be trimmed automatically based on a number of different criteria Automatic trimming is particularly useful in the following situations e f you have many sequence reads to be trimmed e f you wish to trim vector contamination from sequence reads e If you wish to ensure that the trimming is done according to the same criteria for all the sequence reads To trim sequences automatically select sequence s or sequence lists to trim Toolbox in the Menu Bar Sequencing Data Analyses j Trim Sequences This opens a dialog where you can alter your choice of sequences When the sequences are selected click Next This opens the dialog displayed in figure 18 5 Trim Sequences 1 Select nucleotide IES sequences 2 Set trim parameters Sequence trimming V Ignore existing trim information V Trim using quality scores Limit 0 05 V Trim using ambiguous nucleotides Residues 2a Vector trimming Trim contamination from vectors in UniVec database Trim contamination from saved sequences to be chosen in the next step Em Ce previous ne rn ceca
529. ses You can now go offline and work with CLC Combined Workbench oOo BR WN BB When the borrow time period has elapsed you have to connect to the license server again to use CLC Combined Workbench 6 When the borrow time period has elapsed the license server will make the floating license available for other users Note that the time period is not the period of time that you actually use the Workbench Note When your organization s license server is installed license borrowing can be turned off In that case you will not be able to borrow licenses CHAPTER 1 INTRODUCTION TO CLC COMBINED WORKBENCH 23 License Server Error DEF bA CLC Network Licensing No license available at the moment All licenses obtainable from the server are currently in use Ifthe problem persists please contact your local license server administrator Additional licenses can be purchased by contacting the CLC bio sales team on sales elcbio com Figure 1 11 No more licenses available on the server No license available If all the licenses on the server are in use you will see a dialog as shown in figure 1 11 when you start the Workbench In this case please contact your organization s license server administrator To purchase additional licenses contact sales clcbio com If your connection to the license server is lost you will see a dialog as shown in figure 1 12 License Server Error J CLC Network Licensing A li
530. sing BLAST The BLAST algorithm will search for homologous sequences in predefined and annotated databases of the users choice In an easy and fast way the researcher can gain knowledge of gene or protein function and find evolutionary relations between the newly sequenced DNA and well established data After the BLAST search the user will receive a report specifying found homologous sequences and their local alignments to the query sequence 12 6 3 How does BLAST work BLAST identifies homologous sequences using a heuristic method which initially finds short matches between two sequences thus the method does not take the entire sequence space into account After initial match BLAST attempts to start local alignments from these initial matches This also means that BLAST does not guarantee the optimal alignment thus some sequence hits may be missed In order to find optimal alignments the Smith Waterman algorithm should be used see below In the following the BLAST algorithm is described in more detail Seeding When finding a match between a query sequence and a hit sequence the starting point is the words that the two sequences have in common A word is simply defined as a number of letters For blastp the default word size is 3 W 3 If a query sequence has a QWRTG the searched words are QWR WRT RTG See figure 12 21 for an illustration of words in a protein sequence During the initial BLAST seeding the algorithm finds all common
531. sing the default application for this file type e g Microsoft Word for doc files and Adobe Reader for pdf External files are imported and exported in the same way as bioinformatics files See section 7 1 1 Bioinformatics files not recognized by CLC Combined Workbench are also treated as external files 7 3 Export graphics to files CLC Combined Workbench supports export of graphics into a number of formats This way the visible output of your work can easily be saved and used in presentations reports etc The Export Graphics function is found in the Toolbar CLC Combined Workbench uses a WYSIWYG principle for graphics export What You See Is What You Get This means that you should use the options in the Side Panel to change how your data e g a sequence looks in the program When you export it the graphics file will look exactly the same way It is not possible to export graphics of elements directly from the Navigation Area They must first be opened in a view in order to be exported To export graphics of the contents of a view select tab of View Graphics E on Toolbar This will display the dialog shown in figure 7 3 7 3 1 Which part of the view to export In this dialog you can choose to e Export visible area or e Export whole view These options are available for all views that can be zoomed in and out In figure 7 4 is a view of a circular Sequence which is zoomed in so that you can only see a part of it
532. single primers in the preference panel e Click the Calculate button 17 9 2 Alignment based design of PCR primers In this mode a single or a pair of PCR primers are designed CLC Combined Workbench allows the user to design primers which will specifically amplify a group of included sequences but not amplify the remainder of the sequences the excluded sequences The selection boxes are used to indicate the status of a sequence if the box is checked the sequence belongs to the included sequences if not it belongs to the excluded sequences To design primers that are general for CHAPTER 17 PRIMERS 294 all primers in an alignment simply add them all to the set of included sequences by checking all selection boxes Specificity of priming is determined by criteria set by the user in the dialog box which is shown when the Calculate button is pressed see below Different options can be chosen concerning the match of the primer to the template sequences in the included group e Perfect match Specifies that the designed primers must have a perfect match to all relevant sequences in the alignment When selected primers will thus only be located in regions that are completely conserved within the sequences belonging to the included group e Allow degeneracy Designs primers that may include ambiguity characters where hetero geneities occur in the included template sequences The allowed fold of degeneracy is user defined and correspond
533. sio Mio xawm O CTICTAGATH TCARARGORNAT MIA xhor 1 120 gt V Double cutters GAATT SOGAR GAOATTCG T T V Multiple cutters MA Y Eor 3 O 140 CTAGGGACGATTA Figure 19 12 Showing restriction sites of ten restriction enzymes The color of the restriction enzyme can be changed by clicking the colored box next to the enzyme s name The name of the enzyme can also be shown next to the restriction site by selecting Show name flags above the list of restriction enzymes CHAPTER 19 CLONING AND CUTTING 330 Sort enzymes Just above the list of enzymes there are three buttons to be used for sorting the list see figure 19 13 Sorting Aa I Figure 19 13 Buttons to sort restriction enzymes e Sort enzymes alphabetically Aa Clicking this button will sort the list of enzymes alphabetically e Sort enzymes by number of restriction sites 1 This will divide the enzymes into four groups Non cutters Single cutters Double cutters Multiple cutters There is a checkbox for each group which can be used to hide show all the enzymes in a group o e Sort enzymes by overhang 1 I This will divide the enzymes into three groups Blunt Enzymes cutting both strands at the same position 3 Enzymes producing an overhang at the 3 end 5 Enzymes producing an overhang at the 5 end There is a checkbox for each group which can be used to hide show all the enzymes in a group
534. sis for the prediction For a more detailed description of the provided scores through the tooltip look at http www sanger ac uk Software Pfam help scores shtml CHAPTER 16 PROTEIN ANALYSES 262 Region note Helx loop helix DNA binding domain Inote Score 3 6 Inote E value 9 2 Inote Predicted by CLC Protein Workbench version 1 0 Inote Predicted from database 100 most common domains CAA24102 PRNKTHGKKV LTSLGLAVKN MN Figure 16 17 Domains annotations based on Pfam 16 6 2 Download and installation of additional Pfam databases Additional databases can be downloaded as a resource using the Plug in manager see section 1 7 4 If you are not able to download directly from the Plug in manager 16 7 Secondary structure prediction An important issue when trying to understand protein function is to know the actual structure of the protein Many questions that are raised by molecular biologists are directly targeted at protein structure The alpha helix forms a coiled rodlike structure whereas a beta sheet show an extended sheet like structure Some proteins are almost devoid of alpha helices such as chymotrypsin PDB_ID 1AB9 whereas others like myoglobin PDB_ID 101M have a very high content of alpha helices With CLC Combined Workbench one can predict the secondary structure of proteins very fast Predicted elements are alpha helix beta sheet Same as beta strand and other regions Based on extracted
535. st 138 330 Local BLAST 176 Local BLAST Database 183 Local complexity plot 218 400 Local Database BLAST 176 Locale setting 100 Location INDEX 425 search in 96 of selection on sequence 88 path to 75 Side Panel 100 Locations multiple 400 Log of batch processing 128 Logo sequence 354 401 ma4 file format 112 Mac OS X installation 14 Manipulate sequences 403 Manual format 30 Marker in gel view 343 Maximize size of view 85 Melting temperature DMSO concentration 279 dNTP concentration 279 Magnesium concentration 279 Melting temperature 279 Cation concentration 279 298 Inner 279 Primer concentration 279 298 Menu Bar illustration 73 MFold 402 mmCIF file format 35 113 409 Mode toolbar 86 Modification date 154 Modify enzyme list 346 Modules 26 Molecular weight 223 Motif search 227 402 Mouse modes 86 Move content of a view 87 elements in Navigation Area 76 sequences in alignment 358 msf file format 112 Multiple alignments 364 401 Multiselecting 76 Name 154 Navigate 3D structure 200 Navigation Area 73 create local BLAST database 183 illustration 73 NCBI 160 search for structures 166 search sequence in 170 search tutorial 40 Negatively charged residues 225 Neighbor Joining algorithm 371 Neighborjoining 402 Nested PCR primers 402 Network configuration 29 Network drive shared BLAST database 184 Never show this dialog again 100 New feat
536. straints to the secondary structure prediction based on free energy minimization see figure 22 30 and it has been shown that this can dramatically increase the fidelity of the secondary structure prediction Mathews and Turner 2006 Forced stem No base pairs CTCTTAAACCATTTAATAGTAAATTAGCAC Figure 22 30 Known structural features can be added as constraints to the secondary structure prediction algorithm in CLC Combined Workbench Creative Commons License All CLC bio s scientific articles are licensed under a Creative Commons Attribution NonCommercial NoDerivs 2 5 License You are free to copy distribute display and use the work for educational purposes under the following conditions You must attribute the work in its original form and CLC bio has to be clearly labeled as author and provider of the work You may not use this work for commercial purposes You may not alter transform nor build upon this work SOME RIGHTS RESERVED See http creativecommons org licenses by nc nd 2 5 for more information on how to use the contents Part IV Appendix Appendix A Comparison of workbenches Below we list a number of functionalities that differ between CLC Workbenches e CLC Free Workbench m e CLC Protein Workbench 1 e CLC DNA Workbench m e CLC RNA Workbench m e CLC Combined Workbench m Data handling Free Protein DNA RNA Combined Add multiple locations to Navigation Area a a a a Share d
537. t is not deleted on the native file system The location of the database can be seen in the history of the blast database Note On some file systems there is a 2GB file size limit After having adjusted all these settings click Next which opens the dialog seen in figure 12 13 Click Next to complete the creation of the database 12 4 1 Import of BLAST databases Already existing databases can be imported to the workbench Select to import phr or the nhr files which is a database for proteins and nucleotides respectively When a database is imported CHAPTER 12 BLAST SEARCH 185 Create Local BLAST Database 1 Set parameters MAESA 2 Save in folder 5 4S Folder Update All fa CLC_Data E 3 Nucleotide a Protein E Extra README Ceres Coen Xow Figure 12 13 Choose where the access point to your local BLAST database is saved in the Navigation Area the blast database is not moved to the above mentioned locations Thus it is possible to store a blast database on a shared network drive and access the same database through a number of workbenches 12 5 SNP annotation using BLAST CLC Combined Workbench can perform a BLAST search against the databases in the dbSNP database at NCBI http www ncbi nlm nih gov SNP The dbSNP database is a central repository for both single base nucleotide substitutions SNP s and other types of small scale variations including e g short deletion and ins
538. t nucleotide EA sequences 2 Set parameters structure output Partition Function Calculate base pair probabilities eate plot of mar Advanced options V Avoid isolated base pairs V Apply different energy rules for Grossly Asymmetric Interior Loops GAIL Include coaxial stacking energy rules a Maximum distance between paired bases OO eres e ee tee Figure 2 55 Selecting to compute 10 suboptimal structures Click Finish and you will see a linear view of the sequence with structure information for the ten structures below the sequence and the elements of the best structure are shown as annotations above the sequence see figure 2 56 Interior loop Stem Interior loop Interior loop Bulge Hairpin loop AB009835 CAT TAGATGACTGAAAGCAAGTACTGGTCTCTTAAACCATTTAATAGTAAATTAGCACT TACT TCTAATGACCA AG 9 9kcalmol Creer MO OOOO COLOCO COCO AG 9 7kcalmol Ur er yD it aes A A Y EE PINTA e ENTRO DI II DG 9 4kcalmol LAMA rt Ut A aca ce Ro a COG Ges soe ss DIDIIIIIIIDD ee AG 92kcalmol ACCC ee CALLE E O VDDD DERE DPE 5 gt AG 9Akcalimol AO ee rere eee VENIS Areca a s are LEEN eeta DI DDE ee AG 9tkcalmol AAAA e OC ree DDD Mis 2 sss eens fears eas A PROTA POD ee AG 8 9kcalmol C0 essere ee COC OOOO COLOCO COCO ee AG B6kcalimol 100000 T E DEA E TEES C
539. t the bottom lists the enzymes which will be added when you click Finish Note that this list is dynamically updated when you change the number of cut sites If you have selected more than one region on the sequence using Ctrl or they will be treated as individual regions This means that the criteria for cut sites apply to each region CHAPTER 19 CLONING AND CUTTING 335 Show enzymes with compatible ends Besides what is described above there is a third way of adding enzymes to the Side Panel and thereby displaying them on the sequence It is based on the overhang produced by cutting with an enzyme and will find enzymes producing a compatible overhang right click the restriction site Show Enzymes with Compatible Ends 1 I This will display the dialog shown in figure 19 21 d Show Enzymes with Compatible Ends to Tag 1 Please choose enzymes Nice SALES Enzyme list O Exact matches only O Al matches Select enzymes to be added to Side Panel th compatible ends Enzymes added to Side Panel Filter Overhang Methylation Popul Name Overhang Methylation Popul cg N methyl mer a r N methyl k cg N6 methyl Pere t N6 methy Peer cg S methylcy ee a ce S methyley Pee g S methylcy ee S methylcy ver 8 ca S methylcy c S s 5 Sc c p Ste Sc Sc pz Sc Sc 5 5s 88888888 c c Figure 19 21 Enzymes with compatible ends At the top you can choose whether
540. t view see section 12 3 2 CHAPTER 12 BLAST SEARCH 187 SNP Annotation Using BLAST 1 Select nucleotide RE AAA sequences 2 Set program parameters 3 Set input parameters 4 Result handling Output options Create overview BLAST table V Create one BLAST result per query vV Add annotation to input sequences Result handling Open O Save Log handling Make log EE Lere 9s Figure 12 16 Output options for e Add annotations to input sequences This will add the variation annotations found in the BLAST search to the sequence that was chosen in the first step If multiple sequences where selected a BLAST search is conducted for each of the sequences The first two options represent two different ways of showing the BLAST result and if neither is selected you will not be able to see and save the BLAST result The result of the BLAST search is described more elaborately below 12 5 2 Result of SNP annotation The SNP BLAST hits The graphical BLAST output of a SNP BLAST search is shown in figure 12 17 dh NM_000044 BLAST A SNP NM 000044 GCAGCAGCGGGAGAGCGAGGGAGGCCUCGGGGGCUCCCACUUCCUY gni dbSNP rs3032358 GCAGCAGCGGGA GAGCGAGGGAGGCCT CGGGGGCTCCCACTTCCT a gni dbSNP rs4045402 GCAGCAGCGGGA GAGCGAGGGAGGCCT CGGGGGCTCCCACTTCCT gni dbSNPirs6152 GCAGCAGCGGGA GAGCGAGGGARGCCT CGGGGGCTCCCACTTCCT gnl dbSNP rs35025615 GCAGCAGCGGGAGAGC gni dbSNP rs35885096 GCAGCAGCG
541. table difference is that the alignment primer view has no available graphical information Furthermore the selection boxes found to the right of the names in the alignment play an important role in specifying the oligo design process This is elaborated below The Primer Parameters group in the Side Panel has the same options for specifying primer requirements but differs by the following see figure 17 12 e In the Mode submenu which specifies the reaction types the following options are found Standard PCR Used when the objective is to design primers or primer pairs for PCR amplification of a single DNA fragment CHAPTER 17 PRIMERS 293 5 small nucleot mn i A Primer De signer Settings PERH2BD O GTGAGTCTGA TGGGTCTGCC CATGGTTTTC TTCCTCTAGT 40 PERH3BC O GTGAGTCTGA TGGGTCTGCC CATGGTTTCC TTCCTCTAGT 40 y Primer parameters Consensus GTGAGTCTGA TGGGTCTGCC CATGGTTINC TTCCTCTAGT Lenath seamos GTGAGTOTGA TOGGTCTGCG CATUGTTTSC TTGCTCTAGT ma 228 Min 18 1 1 PERH2BD O TTCTGGGGTT ACCTTCCTAT CAGAAGGAAA GGGGAAGAGA 80 Melt temp C PERH3BC O TTCTGGGCTT ACCTTCCTAT CAGAAGGAAA TGGGAAGAGA 80 Max 58 gt Consensus TTCTGGGNTT ACCTTCCTAT CAGAAGGAAA NGGGAAGAGA Min 48 seawwcetoo TTCTGGUETT ACCTTCCTAT CAGAAGGAAA JGGGAAGAGA ER Max E pa PERH2BD O TTCTAGGGAG TCATTTAAAC AGATGGTGTT TGCTTATTCC 120 Min PERH3BC O TTCTAGGGAG CAGTTTAGAT GGAAGGTATC TGCTTGTICC 120 Advanced parameters Consensus TTCTAGGGAG NNNTT
542. tails of any experimental heterogeneity can be maintained and used when the result of single sequence analyzes is interpreted When the parameters have been adjusted click Next to see the dialog shown in figure 18 8 Assemble Sequences to Reference 1 Select some nucleotide EE sequences 2 Set reference parameters 3 Set algorithm parameters Alignment options Minimum aligned read length 50 Alignment stringency Medium i Trimming options Use existing trim information Generally not necessary since a reference sequence is used Output options Show tabular view of contigs CIO Le res pre LY Figure 18 8 Different options for the output of the assembly In this dialog you can specify more options Minimum aligned read length The minimum number of nucleotides in a read which must be successfully aligned to the contig If this criteria is not met by a read this is excluded from the assembly Alignment stringency Specifies the stringency of the scoring function used by the alignment step in the contig assembly algorithm A higher stringency level will tend to produce contigs with less ambiguities but will also tend to omit more sequencing reads and to generate more and shorter contigs Three stringency levels can be set Low Medium High Use existing trim information When using a reference sequence trimming is generally not necessary but if you wish to use trim
543. tained in the sequence file when you view the sequence as text it contains a number of PUBMED lines Not all sequence have these PubMed references but in this case you will se a dialog and the browser will not open 11 4 4 UniProt The UniProt search function searches in the UniProt database http www ebi uniprot org using the accession number Furthermore it checks whether the sequence was indeed downloaded from UniProt 11 4 5 Additional annotation information When sequences are downloaded from GenBank they often link to additional information on taxonomy conserved domains etc If such information is available for a sequence it is possible to access additional accurate online information If the db_xref identifier line is found as part of the annotation information in the downloaded GenBank file it is possible to easily look up additional information on the NCBI web site To access this feature simply right click an annotation and see which databases are available Chapter 12 BLAST search Contents 12 1 BLAST Against NCBI Database 2 00 lt lt es 173 12 1 1 BLAST aselectionagainst NCBI morgo e ke we Sie eee Ew ear 176 12 2 BLAST Against Local Database 000 2 ee eee ee es 176 12 2 1 BLAST a selection against a local database 179 12 3 Output from BLAST search lt lt ee 179 12 3 1 Overview BLAST table o a 179 12
544. te a new sequence based on the text copied This operation is equivalent to saving the text in a text file and importing it into the CLC Combined Workbench If the sequence is not formatted i e if you just have a text like this ATGACGAATAGGAGTTC TAGCTA you can also paste this into the Navigation Area Note Make sure you copy all the relevant text otherwise CLC Combined Workbench might not be able to interpret the text Import of Vector NTI data CLC Combined Workbench can import DNA RNA and protein sequences from a Vector NTI Database The import can be done for Vector NTI Advance 10 for Windows machines and Vector NTI Suite 7 1 for Mac OS X for Panther and former versions A new folder will be placed in the Navigation Area and you can find all sequences in subfolders ready to work with In order to import all DNA RNA protein and oligo sequences select File in the Menu Bar Import VectorNTI Data select a database directory Import confirm the information Note The default installation of the VectorNTl program for the database home is e C VNTI Database for Windows machines and e Library Application Support VNTI Database for Mac OS X for Panther Therefore the CLC Combined Workbench will check if there is a default installation and will ask whether you want to use the default database directory or another directory Note Make sure that the Vector NTI database directory default or backup contains folders like
545. ted To do so click the Print icon 4 The history can also be exported as a pdf file Select the element in the Navigation Area Export ES in File of type choose History PDF Save Chapter 9 Handling of results Contents 9 1 Howto handle results of analyses lt lt eee ee 127 911 Table QUIbUIS ss e a eek dora ho d Goce kG SES ee ms i 128 Oa Bemlos 2 4 6 2 2 oad beet Bo bo Be ee he ae ee Se 128 Most of the analyses in the Toolbox are able to perform the same analysis on several elements in one batch This means that analyzing large amounts of data is very easily accomplished If you e g wish to translate a large number of DNA sequence to protein you can just select the DNA sequences and set the parameters for the translation once Each DNA sequence will then be treated individually as if you performed the translation on each of them The process will run in the background and you will be able to work on other projects at the same time 9 1 How to handle results of analyses All the analyses in the Toolbox are performed in a step by step procedure First you select elements for analyses and then there are a number of steps where you can specify parameters some of the analyses have no parameters e g when translating DNA to RNA The final step concerns the handling of the results of the analysis and it is almost identical for all the analyses so we explain it in this section in general In th
546. tempera ture interval settings relate to the outer primer pair i e not the probe Melting temperatures are calculated by a nearest neighbor model which considers stacking interactions between neighboring bases in the primer template complex The model uses state of the art thermo dynamic parameters SantaLucia 1998 and considers the important contribution from the dangling ends that are present when a short primer anneals to a template sequence Bom marito et al 2000 A number of parameters can be adjusted concerning the reaction mixture and which influence melting temperatures see below Melting temperatures are corrected for the presence of monovalent cations using the model of SantaLucia 1998 and temperatures are further corrected for the presence of magnesium deoxynucleotide triphosphates dNTP and dimethyl sulfoxide DMSO using the model of von Ahsen et al 2001 e Inner melting temperature This option is only activated when the Nested PCR or TaqMan mode is selected In Nested PCR mode it determines the allowed melting temperature interval for the inner nested pair of primers and in TaqMan mode it determines the allowed temperature interval for the TaqMan probe e Advanced parameters A number of less commonly used options Buffer properties A number of parameters concerning the reaction mixture which influence melting temperatures x Primer concentration Specifies the concentration of primers and probes in units of
547. the Nucleotide info preference group the display of trace data can be selected and unselected When selected the trace data information is shown as a plot beneath the sequence The appearance of the plot can be adjusted using the following options see figure 18 3 e Nucleotide trace For each of the four nucleotides the trace data can be selected and unselected e Show confidence If confidence information was provided by the base calling algorithm this can be displayed as a bar plot behind the trace plots The confidence data is displayed as the log transformed value of the probability of a given nucleotide position being correctly assigned CHAPTER 18 SEQUENCING DATA ANALYSES AND ASSEMBLY 303 e Show as probabilities Displays confidence data as probabilities on a O 1 scale i e not log transformed e Scale traces A slider which allows the user to scale the height of the trace area Scaling the traces individually is described in section 18 1 1 ch read1 20 Sequence settings x 53 75 readl CTGGGCATCGACTGAGACACGCTGTGGATATATG A Nucleotide info Trace data Translation w Trace data Y Show Y A trace read Y C trace Y G trace Trace data lt trace C Show confidence readl CTTCAGCTTTGGTGGGTTTACATTTAAAAGAACA E Trace height medium Y Trace data Scaling drag trace data in view 120 GIC content gt Secondary structure readl AGCGGGTCATCAGTCAAAAAAGAGGAAGAAGTGC b Find a RoBs 058 9
548. the dropdown list Lower quality images render faster but may not display well under high zoom factors 13 5 3D Output The output of the 3D viewer is rendered on the screen in real time and changes to the preferences are visible immediately From CLC Combined Workbench you can export the visible part of the 3D view to different graphic formats by pressing the Graphics button 41 on the Menu bar This will allow you to export in the following formats Format Suffix Type Portable Network Graphics png bitmap JPEG Jpg bitmap Tagged Image File tif bitmap PostScript ps vector graphics Encapsulated PostScript eps vector graphics Portable Document Format pdf vector graphics Scalable Vector Graphics SVE vector graphics Printing is not fully implemented with the 3D editor Should you wish to print a 3D view this can be done by either exporting to a graphics format and printing that or use the scheme below Windows CHAPTER 13 3D MOLECULE VIEWING 205 e Adjust your 3D view in CLC Combined Workbench e Press Print Screen on your keyboard or Alt Print Screen e Paste the result into an image editor e g Paint or GIMP http www gimp org e Crop edit the screenshot e Save in your preferred file format and or print e Set up your 3D view e Press shift 3 or 98 shift 4 to take screen shot e Open the saved file pdf or png in a image editor e g GIMP http www gimp org e Crop edit
549. the energy contributions of the elements in the structure If more than one structure have been predicted the table is also used to switch between the structures shown in the graphical view The table is described in section 22 2 2 22 2 1 Graphical view and editing of secondary structure To show the secondary view of an already open sequence click the Show Secondary Structure 2D View FP button at the bottom of the sequence view If the sequence is not open click Show 4 and select Secondary Structure 2D View GP This will open a view similar to the one shown in figure 22 9 QP ABOO9835 Secondary structure AG 9 7kcal mol mis Er ad A gt Sequence layout T A A 40 s E gt Residue coloring T hen att he tart Find c Sa T TA A Text format Tete aS a G AT T c Secondary structure 39 c TC CA Follow structure selection A A c 60 Layout strategy ao Na ge ES A iig il a A Se Sa A Ps T Proportional NT tT SA O Even spread c a c Y A A A 1 A OB 0Erar Figure 22 9 The secondary structure view of an RNA sequence zoomed in Like the normal sequence view you can use Zoom in 2 and Zoom out 5 Zooming in will reveal the residues of the structure as shown in figure 22 9 For large structures zooming out will give you an overview of the whole structure Side Panel settings The settings in the Side Panel are a subset of the settings in the normal sequence
550. the program error can be reproduced All errors will be treated seriously and with gratitude We appreciate your help Start in safe mode If the program becomes unstable on start up you can start it in Safe mode This is done by pressing and holding down the Shift button while the program starts When starting in safe mode the user settings e g the settings in the Side Panel are deleted and cannot be restored Your data stored in the Navigation Area is not deleted When the workbench has been started in safe mode some of the functionalities are missing and you will have to restart the workbench again without pressing Shift 1 5 3 Free vs commercial workbenches The advanced analyses of the commercial workbenches CLC Protein Workbench CLC RNA Workbench and CLC DNA Workbench are not present in CLC Free Workbench Likewise some advanced analyses are available in CLC DNA Workbench but not in CLC RNA Workbench or CLC Protein Workbench and visa versa All types of basic and advanced analyses are available in CLC Combined Workbench However the output of the commercial workbenches can be viewed in all other workbenches This allows you to share the result of your advanced analyses from e g CLC Combined Workbench with people working with e g CLC Free Workbench They will be able to view the results of your analyses but not redo the analyses The CLC Workbenches are developed for Windows Mac and Linux platforms Data can be ex
551. the screenshot e Save in your preferred file format and or print Linux e Set up your 3D view e e g use GIMP to take the screen shot http www gimp org e Crop edit the screenshot e Save in your preferred file format and or print Chapter 14 General sequence analyses Contents 14 1 Shuffle Sequence 0 0 0 ce ee es 206 14 2 Dot plotS 0 5648 krane nga raa a ee ee 208 14 2 4 Create dot plotS sa coc m 4600 8 bbw eee ee de ee 8 208 14 2 2 NIEW GOP plots e a ue ea wy Be ia Oe Pe ee AA 210 14 2 3 Bioinformatics explained Dot plots 05502 210 14 2 4 Bioinformatics explained Scoring matrices 214 14 3 Local complexity plot 1 2 eee es 218 14 3 1 Local complexity view preferences 0 000 eee eae 219 14 4 Sequence statistics lt 220 14 4 1 Bioinformatics explained Protein statistics 222 14 5 Join sequenceS ios a a a ee 226 14 6 Motif Search is s sa sonaa a a a a ee S a a a a 227 14 6 1 Motif search parameter settings 229 14 6 2 Motif Search OUTPUT s cs ek ee ee a we 230 14 7 Pattern Discovery 1 2 ee eee 4 2 2 230 14 7 1 Pattern discovery search parameters o o 0004 231 14 7 2 Pattern search OUtpuUt lt e e i sucr 0 es 232 CLC Combined Workbench offers different kinds of sequence analyses which apply to
552. thylation Popularity Hind IIT 5 agct N methyladenosine Smal Blunt N4 methyleytosine Xbal 5 ctag N6 methyladenosine Sall 5 tega N6 methyladenosine EcoRV Blunt N6 methyladenosine EcoRI S aatt N6 methyladenosine BglII gate N4 methylcytosine Xhol tega N methyladenosine PstI toca N6 methyladenosine BamHI gate N4 methylcytosine Kool f ob N methyladenosine Sequence to be inserted S additional Sall 3 additional GTCGAC Figure 19 10 Inserting the Sall recognition sequence 19 1 7 Show in a circular view The sequences stored in the cloning view can be saved to a sequence list and later be opened again for further editing A sequence list is represented by the following icon in the Navigation Area After finishing the in silico cloning in a linear mode the newly formed cloning vector or plasmid can easily be visualized in circular mode Simply verify that the molecule is circular right click the sequence name and right click the sequence name and press open sequence in circular view Then you have a circular view as displayed in figure 19 11 bl bl ROP protei Conflic Conflic Figure 19 11 Final circular view of the plasmid 19 2 Restriction site analysis There are two ways of finding and showing restriction sites e In many cases the dynamic restriction sites found in the Side Panel of sequence views will be useful since it is a quick and easy way of showing restriction sites
553. thylation sensitivity Star activity 7 Column width EcoRV gatate Blunt GE Healthc N methyladenosine Yes af Automatic Bali agatct 5 gate GE Healthc N4 methylcytosine No Sr Sall alegar 5 toga GE Healthc N6 methyladenosine Yes Name XhoI ctegag 5 tcga GE Healthc N methyladenosine No Recognition sequence HindIII aagctt S agt GE Healthc N6 methyladenosine Yes xbal tctaga S ctag GE Healthc N6 methyladenosine Yes Srna EcoRI gaatte S aatt GE Healthc N6 methyladenosine Yes Suppliers PstI ctgcag 3 tgca GE Healthc N methyladenosine Yes Y de BamHI ogatee gate GE Healthc N methylcytosine Yes Menoni Clal atcgat S cg GE Healthc N6 methylas No O Recognizes palindrome NotI gcggccge 5 ggcc GE Healthc N4 methyic No ira Star activity NdeI catatg 5 ta GE Healthc N6 methylac Yes SacI gagcte 3 agct GE Healthc S methylcytosine Yes C Popularity Pyull cagctg Blunt GE Healthc N4 methylcytosine Yes a Select All Create New Enzyme List from Selection Add Remove Enzymes Deselect All a Figure 19 35 An enzyme list and you can use the filter at the top right corner to search for specific enzymes recognition sequences etc If you wish to remove or add enzymes click the Add Remove Enzymes button at the bottom of the view This will present the same dialog as shown in figure 19 32 with the enzyme list shown to the right If you wish to extract a subset of an enzyme list
554. tial gap If you expect a lot of small gaps in your alignment the Gap open cost should equal the Gap extension cost On the other hand if you expect few but large gaps the Gap open cost should be set significantly higher than the Gap extension cost However for most alignments it is a good idea to make the Gap open cost quite a bit higher than the Gap extension cost The default values are 10 0 and 1 0 for the two parameters respectively e End gap cost The price of gaps at the beginning or the end of the alignment One of the advantages of the CLC Combined Workbench alignment method is that it provides flexibility in the treatment of gaps at the ends of the sequences There are three possibilities Free end gaps Any number of gaps can be inserted in the ends of the sequences without any cost Cheap end gaps All end gaps are treated as gap extensions and any gaps past 10 are free End gaps as any other Gaps at the ends of sequences are treated like gaps in any other place in the sequences When aligning a long sequence with a short partial sequence it is ideal to use free end gaps since this will be the best approximation to the situation The many gaps inserted at the ends are not due to evolutionary events but rather to partial data Many homologous proteins have quite different ends often with large insertions or deletions This confuses alignment algorithms but using the Cheap end gaps option large gaps will ge
555. tide frequency e Dipeptide sampling from first order Markov chain Resampling method generating a sequence of the same expected dipeptide frequency For further details of these algorithms see Clote et al 2005 In addition to the shuffle method you can specify the number of randomized sequences to output Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish This will open a new view in the View Area displaying the shuffled sequence The new sequence is not saved automatically To save the sequence drag it into the Navigation Area or press ctrl S S on Mac to activate a save dialog 14 2 Dot plots Dot plots provide a powerful visual comparison of two sequences Dot plots can also be used to compare regions of similarity within a sequence This chapter first describes how to create and second how to adjust the view of the plot 14 2 1 Create dot plots A dot plot is a simple yet intuitive way of comparing two sequences either DNA or protein and is probably the oldest way of comparing two sequences Maizel and Lenk 1981 A dot plot is a 2 dimensional matrix where each axis of the plot represents one sequence By sliding a fixed size window over the sequences and making a Sequence match by a dot in the matrix a diagonal line will emerge if two identical or very homologous sequences are plotted against each other Dot plots can also be used to visually inspect sequences for di
556. time hold the mouse button In order to change the representation e Rearrange leaves and branches by Select a leaf or branch Move it up and down Hint The mouse turns into an arrow pointing up and down e Change the length of a branch by Select a leaf or branch Press Ctrl Move left and right Hint The mouse turns into an arrow pointing left and right Alter the preferences in Side Panel for changing the presentation of the tree 21 2 Bioinformatics explained phylogenetics Phylogenetics describes the taxonomical classification of organisms based on their evolutionary history i e their ohylogeny Phylogenetics is therefore an integral part of the science of systematics that aims to establish the phylogeny of organisms based on their characteristics Furthermore phylogenetics is central to evolutionary biology as a whole as it is the condensation of the overall paradigm of how life arose and developed on earth CHAPTER 21 PHYLOGENETIC TREES 370 21 2 1 The phylogenetic tree The evolutionary hypothesis of a phylogeny can be graphically represented by a phylogenetic tree Figure 21 4 shows a proposed phylogeny for the great apes Hominidae taken in part from Purvis Purvis 1995 The tree consists of a number of nodes also termed vertices and branches also termed edges These nodes can represent either an individual a species or a higher grouping and are thus broadly termed taxonomical units In this case the terminal
557. time you did a search CHAPTER 11 ONLINE DATABASE SEARCH 164 11 2 UniProt Swiss Prot TrEMBL search This section describes searches in UniProt and the handling of search results UniProt is a global database of protein sequences The UniProt search view figure 11 3 is opened in this way Search Search UniProt G amp UniProt search Choose database Y Swiss Prot TrEMBL AlFields insulin 8 Created Since 30 Days B 6 Start search Append wildcard to search words Rows 13 Search results Filter Accession Name Description Organism l 32179 INSL6_BOVIN Q3TDx8 Q3TIH3IQ3 NBSR4_MOUSE Q30F67 NTRKI_MOUSE QSNWK7 QSRAUE OS SPOP_PONPY 1063132 063130 06 ROS_RAT r ogene tyrosine protein kinase ROS pr tus M Rat Q780x7 Q60705 ROS_MOUSE Proto oncogene tyrosine protein kinase ROS precur Mus musculus Mouse losTL28 ULIL METAC Ullvsin precursor EC 3 IMethanosarcina acetivorans ed _ Download and open 9 Download and Save Total number of hits 13 oh Figure 11 3 The UniProt search view 11 2 1 UniProt search options Conducting a search in UniProt from CLC Combined Workbench corresponds to conducting the search on UniProt s website When conducting the search from CLC Combined Workbench the results are available and ready to work with straight away Above the
558. tion associated with a large number of structural elements A detailed structure overview can be found in 22 5 In CLC Combined Workbench structures are predicted by a modified version of Professor Michael Zukers well known algorithm Zuker 1989b which is the algorithm behind a number of RNA folding packages including MFOLD Our algorithm is a dynamic programming algorithm for free energy minimization which includes free energy increments for coaxial stacking of stems when they are either adjacent or separated by a single mismatch The thermodynamic energy parameters used are from the latest Mfold version 3 see http www bioinfo rpi edu zukerm rna energy 22 1 1 Selecting sequences for prediction Secondary structure prediction can be accessed in the Toolbox Toolbox RNA Structure 36 Predict Secondary Structure QP This opens the dialog shown in figure 22 1 If you have selected sequences before choosing the Toolbox action they are now listed in the Selected Elements window of the dialog Use the arrows to add or remove sequences or CHAPTER 22 RNA STRUCTURE 376 Predict Secondary Structure 1 Select nucleotide ucIeotid sequences Projects Selected Elements 1 B Ea CLC_Data Jc AB009835 S E Example data i Extra B E Nucleotide a Assembly ES Cloning G More data H Primer design ES Restriction anal 2 Sequences RNA 206 Coronaviru soc 0009635 E Protein E EA F
559. tion of secretory and non secretory proteins The D score is introduced in SignalP version 3 0 and is a simple average of the S mean and Y max score The score shows superior discrimination performance of secretory and non secretory proteins to that of the S mean score which was used in SignalP version 1 and 2 For non secretory proteins all the scores represented in the SignalP3 NN output should ideally be very low The hidden Markov model calculates the probability of whether the submitted sequence contains a signal peptide or not The eukaryotic HMM model also reports the probability of a signal anchor previously named uncleaved signal peptides Furthermore the cleavage site is assigned by a probability score together with scores for the n region h region and c region of the signal peptide if it is found Other useful resources ttp www cbs dtu dk services SignalP ubmed entries for some of the original papers ttp www ncbi nlm nih gov entrez query fcgi db pubmed amp cmd Ret rieve dopt bstractPlus list_uids 9051728 query_h1l 1 amp itool pubmed_docsum TO eP uU B uids 15223320 amp dopt Citation ttp www ncbi nlm nih gov entrez query fcgi cmd Retrieve amp db PubMed amp list CHAPTER 16 PROTEIN ANALYSES 248 Creative Commons License All CLC bio s scientific articles are licensed under a Creative Commons Attribution NonCommercial NoDerivs 2 5 License You are free to copy distrib
560. tions on the chromosome sequence Notice that the hit start position has a lover number than the hit end position This is because the gene is located on the complementary strand If you place the mouse cursor on the sequence hits in the graphical view you can see the reading CHAPTER 2 TUTORIALS 52 frame which is 1 2 and 3 for the three hits respectively Verify the result Open NC_000011 in a view and go to the Hit start position 5 204 729 and zoom to see the blue gene annotation You can now see the exon structure of the Human beta globin gene showing the three exons on the reverse strand see figure 2 27 1 000 an cio OS a HBB HBB HBB a NC_000011 selecti Figure 2 27 Human beta globin exon view If you wish to verify the result make a selection covering the gene region and open it in a new view right click Open Selection in New View Save 5 Save the sequence and perform a new BLAST search e Use the new sequence as query e Use BLASTx e Use the protein sequence AAA16334 as database Using the genomic sequence as query the mapping of the protein sequence to the exons is visually very clear as shown in figure 2 28 In theory you could use the chromosome sequence as query but the performance would be very bad it would take a long time and the computer might run out of memory In this example you have used well annotated sequences where you could have searched for the name of the gene
561. to be inserted in or deleted from the sequences not selected for realignment This will only occur for entire columns of gaps in these sequences ensuring that their relative alignment is unchanged Realigning a selection is a very powerful tool for editing alignments in several situations e Removing changes If you change the alignment in a specific region by hand you may end up being unhappy with the result In this case you may of course undo your edits but another option is to select the region and realign it e Adjusting the number of gaps If you have a region in an alignment which has too many gaps in your opinion you can select the region and realign it By choosing a relatively high gap cost you will be able to reduce the number of gaps e Combine with fixpoints If you have an alignment where two residues are not aligned but you know that they should have been You can now set an alignment fixpoint on each of the two residues select the region and realign it using the fixpoints Now the two residues are aligned with each other and everything in the selected region around them is adjusted to accommodate this change 20 4 Join alignments CLC Combined Workbench can join several alignments into one This feature can for example be used to construct Supergenes for phylogenetic inference by joining alignments of several disjoint genes into one spliced alignment Note that when alignments are joined all their annotations are carr
562. tool tip Select the sequence you wish to insert and click Next This will show the dialog in figure 19 8 At the top is a button to reverse complement the inserted sequence Below is a visualization of the insertion details The inserted sequence is at the middle shown in red and the vector has been split at the insertion point and the ends are shown at each side of the inserted sequence If the overhangs of the sequence and the vector do not match you can blunt end or fill in the overhangs using the drag handles Whenever you drag the handles the status of the insertion point is indicated below CHAPTER 19 CLONING AND CUTTING 327 sert Sequence at This Sphl Site t Sequence i Change insert orientation Change orientation of PCR fragment with Sph_Sphi Drag handles to adjust sequence overhangs pBR322 PCR fragment with Sph_SphI 4 4 yA TTGCATG CAG TCGCATG CAC y AAC GTACGTC AGC GTACGTG Vector sequence pBR322 Positive strand 3 no change Negative strand 5 no change Insert sequence PCR fragment with Sph_SphI orientation original Figure 19 8 Drag the handles to adjust overhangs e The overhangs match f e The overhangs do not match In this case you will not be able to click Finish Drag the handles to make the overhangs match At the bottom of the dialog is a summary field which records all the changes made to the overhangs This contents of the summa
563. tructure elements simply expand the Stem with bifurcation node see figure 22 18 The multi loop structure element is a union of three Stem with hairpin substructures and contributions to the multi loop opening considering multi loop base pairs and multi loop arcs Selecting an element in the table to the right will make a corresponding selection in the Show Secondary Structure 2D View HP if this is also open and if the Follow structure selection has been set in the editors side panel In figure 22 18 the Stem with bifurcation is selected in the table and this part of the structure is high lighted in the Secondary Structure 2D view CHAPTER 22 RNA STRUCTURE 386 QP ABOD9835 F ABDD9835 O Secondary structure AG 9 7kcal mol Rows 11 Filter Name Created AG AG 9 9kcal mol 2007 06 25 13 35 21 9 9kcal mol A AG 9 4kcal mal 2007 06 25 13 35 21 9 4kcal mol AG 9 2kcal mol 2007 06 25 13 35 21 9 2kcalfmol 4 AG 9 kcal mol 2007 06 25 13 35 21 9 1kcal mol 3 AG 9 1kcalfmol 2007 06 25 13 35 21 9 1kcal mol AG 8 9kcal mol 2007 06 25 13 35 21 1 8 9kcal mol AG 8 6kcal mol 2007 06 25 13 35 21 1 8 6kcal mol AG 8 4kcal mol 2007 06 25 13 35 21 8 4kcal mol aa Qe Elements of structure AG 9 7kcal mol AG 9 7kcal mol E K Stem with bifurcation at 1 70 AG 8 8kcal mol 4 Dangling nucleotide at 71 dangling from position 70 AG 1
564. ture and an energy contribution Three examples of mixed substructure elements are Stem base pairs Stem with bifurcation and Stem with hairpin The Stem base pairs substructure is simply a union of stacking elements It is given by a joined set of base pair positions and an energy contribution displaying the sum of all stacking element energies The Stem with bifurcation substructure defines a substructure enclosed by a specified base pair with and with energy contribution AG The substructure contains a Stem base pairs substructure and a nested bifurcated substructure multi loop Also bulge and interior loops can occur separating stem regions The Stem with hairpin substructure defines a substructure starting at a specified base pair with an enclosed substructure energy given by AG The substructure contains a Stem base pairs substructure and a hairpin loop Also bulge and interior loops can occur separating stem regions In order to describe the tree ordering of different substructures we use an example as a starting point see figure 22 17 The structure is a disjoint nested union of a Stem with bifurcation substructure and a dangling nucleotide The nested substructure energies add up to the total energy The Stem with bifurcation substructure is again a disjoint union of a Stem base pairs substructure joining position 1 7 with 64 70 and a multi loop structure element opened at base pair 7 64 To see these s
565. ture is not saved until the View displaying the structure is closed When that happens a dialog opens Save changes of structure x Yes or No The structure can also be saved by dragging it into the Navigation Area It is possible to select more structures and drag all of them into the Navigation Area at the same time Download structure search results using right click menu You may also select one or more structures from the list and download using the right click menu see figure 11 5 Choosing Download and Save lets you select a folder or location where the structures are saved when they are downloaded Choosing Download and Open opens a new view for each of the selected structures Definition Fie gt fm Edit E View gt E A Toolbox gt jlc Show jic H Download and Open pic 3 Download and Save Sr Open at NCBI Yel Figure 11 5 By right clicking a search result it is possible to choose how to handle the relevant structure The selected structures are not downloaded from the NCBI website but is downloaded from the RCSB Protein Data Bank http www rcsb org pdb home home do in mmCIF format Copy paste from structure search results When using copy paste to bring the search results into the Navigation Area the actual files are downloaded To copy paste files into the Navigation Area select one or more of the search results Ctrl C 3 C on Mac select location or folder in the Navigation Area Ctrl V
566. ture of the sequence it is possible to use different BLAST programs for the database search There are five versions of the BLAST program blastn blastp blastx tblastn tblastx Option Query Type DB Type Comparison Note blastn Nucleotide Nucleotide Nucleotide Nucleotide blastp Protein Protein Protein Protein tblastn Protein Nucleotide Protein Protein The database is translated into protein blastx Nucleotide Protein Protein Protein The queries are translated into protein tblastx Nucleotide Nucleotide Protein Protein The queries and database are translated into protein The most commonly used method is to BLAST a nucleotide sequence against a nucleotide database blastn or a protein sequence against a protein database blastp But often another BLAST program will produce more interesting hits E g if a nucleotide sequence is translated before the search it is more likely to find better and more accurate hits than just a blastn search One of the reasons for this is that protein sequences are evolutionarily more conserved than nucleotide sequences Another good reason for translating the query sequence before the search CHAPTER 12 BLAST SEARCH 194 is that you get protein hits which are likely to be annotated Thus you can directly see the protein function of the sequenced gene 12 6 5 Which BLAST options should change The NCBI BLAST web pages and the BLAST command line tool offer
567. u are welcome to contact our support function E mail support clcbio com 1 2 Download and installation The CLC Combined Workbench is developed for Windows Mac OS X and Linux The software for either platform can be downloaded from http www clcbio com download Furthermore the program can be sent on a CD Rom by regular mail To receive the program by regular mail please write an e mail to support clcbio com including your postal address 1 2 1 Program download The program is available for download on http www clcbio com download Before you download the program you are asked to fill in the Download dialog In the dialog you must choose e Which operating system you use e Whether you want to include Java or not this is necessary if you haven t already installed Java e Whether you would like to receive information about future releases Depending on your operating system and your Internet browser you are taken through some download options When the download of the installer an application which facilitates the installation of the program is complete follow the platform specific instructions below to complete the installation procedure You must be connected to the Internet throughout the installation process unless you have a pre activated license see section 1 4 2 CHAPTER 1 INTRODUCTION TO CLC COMBINED WORKBENCH 14 1 2 2 Installation on Microsoft Windows Starting the installation process is
568. u have not previously trimmed the sequences this can be done by checking this box If selected the next step in the dialog will allow you to specify settings for trimming see section 18 2 2 e Minimum aligned read length The minimum number of nucleotides in a read which must be successfully aligned to the contig If this criteria is not met by a read the read is excluded from the assembly e Alignment stringency Specifies the stringency of the scoring function used by the alignment step in the contig assembly algorithm A higher stringency level will tend to produce contigs with less ambiguities but will also tend to omit more sequencing reads and to generate more and shorter contigs Three stringency levels can be set Low Medium High e Conflicts If there is a conflict i e a position where there is disagreement about the residue A C T or G you can specify how the contig sequence should reflect the conflict CHAPTER 18 SEQUENCING DATA ANALYSES AND ASSEMBLY 307 Vote A C G T The conflict will be solved by counting instances of each nucleotide and then letting the majority decide the nucleotide in the contig Unknown nucleotide N The contig will be assigned an N character in all positions with conflicts Ambiguity nucleotides R Y etc The contig will display an ambiguity nucleotide reflecting the different nucleotides found in the reads For an overview of ambiguity codes see
569. u have selected to export the whole view if you have chosen to export the visible area only the graphics file will be on one page with no headers or footers 7 3 4 Exporting protein reports It is possible to export a protein report using the normal Export function which will generate a pdf file with a table of contents Click the report in the Navigation Area Export S in the Toolbar select pdf You can also choose to export a protein report using the Export graphics function E but in this way you will not get the table of contents CHAPTER 7 IMPORT EXPORT OF DATA AND GRAPHICS 123 7 4 Copy paste view output The content of tables e g in reports folder lists and sequence lists can be copy pasted into different programs where it can be edited CLC Combined Workbench pastes the data in tabulator separated format which is useful if you use programs like Microsoft Word and Excel There is a huge number of programs in which the copy paste can be applied For simplicity we include one example of the copy paste function from a Folder Content view to Microsoft Excel First step is to select the desired elements in the view click a line in the Folder Content view hold Shift button press arrow down up key See figure 7 10 Sequences Contents of Sequences Filter T Z male BI Riaz a Name Description Length lav738615 Hox fusion protein HBD HBB gene 180 IHUMDINUC Hur ism at the D115439 and HBB
570. u keep the mouse cursor fixed If the search term is found the part of the sequence corresponding to the matching annotation is selected Below this option you can choose to search for translations as well Sequences annotated with coding regions often have the translation specified which can lead to undesired results e Position search Finds a specific position on the sequence In order to find an interval e g from position 500 to 570 enter 500 570 in the search field This will make a selection from position 500 to 570 both included Notice the two periods between the start an end number CHAPTER 10 VIEWING AND EDITING SEQUENCES 137 e Include negative strand When searching the sequence for nucleotides or amino acids you can search on both strands This concludes the description of the View Preferences Next the options for selecting and editing sequences are described Text format These preferences allow you to adjust the format of all the text in the view both residue letters sequence name and translations if they are shown e Text size Five different sizes e Font Shows a list of Fonts available on your computer e Bold residues Makes the residues bold 10 1 2 Restriction sites in the Side Panel As shown in figure 19 12 you can display restriction sites as colored triangles and lines on the sequence The Restriction sites group in the side panel shows a list of enzymes represented by different colors co
571. uctures and their prediction Bulletin of Mathemetical Biology 46 591 621 Zuker and Stiegler 1981 Zuker M and Stiegler P 1981 Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information Nucleic Acids Res 9 1 133 148 Part V Index Index 3D molecule view 199 export graphics 204 navigate 200 output 204 rotate 200 zoom 200 AB1 file format 35 113 409 Abbreviations amino acids 411 ABI file format 35 113 409 About CLC Workbenches 24 Accession number display 78 ace file format 112 ACE file format 35 113 409 Activate license commercial 20 demo 18 Add annotations 152 400 sequences to alignment 358 sequences to contig 310 Structure Prediction Constraints 377 Adjust selection 143 Adjust trim 311 Advanced preferences 102 Advanced RNA options Apply base pairing constraints 377 Avoid isolated base pairs 377 389 Coaxial stacking 377 389 GAIL rule 377 389 Advanced search 96 Algorithm alignment 347 neighbor joining 371 UPGMA 371 Align alignments 350 protein sequences tutorial 41 sequences 401 Alignment see Alignments Alignment Primers Degenerate primers 293 294 PCR primers 292 Primers with mismatches 293 294 Primers with perfect match 293 294 TaqMan Probes 293 Alignment based primer design 291 Alignments 347 401 add sequences to 358 compare 361 create 348 design primers for 291 edit 357 fast algorithm 34
572. uences Difference between Motif Search and Pattern Discovery In motif search the user has some predefined knowledge about the pattern motif of interest This motif is defined by the user and the algorithm runs through the entire sequence and looks for identical or degenerate patterns Motif search handles ambiguous characters in the way that two residues are different if they do not have any residues in common For example For nucleotides N matches any character and R matches A G For proteins X matches any character and Z matches E Q Our pattern discovery algorithm see section 14 7 is based on proprietary hidden Markov models HMM and scans the entire sequence one or more for patterns which may be unknown to the user Motifs If you have a known motif represented by a literal string or a sequence pattern of interest you can search for them using the CLC Combined Workbench Patterns and motifs can be searched with different levels of degeneracy in both DNA and protein sequences You can also search for matches with known motifs represented by a regular expression A regular expressions is a string that describes or matches a set of strings according to certain syntax rules They are usually used to give a concise description of a set without CHAPTER 14 GENERAL SEQUENCE ANALYSES 228 having to list all elements The simplest form of a regular expression is a literal string The syntax used for the regular expressions is the
573. ues are shown in black color hydrophilic residues as green acidic residues as red and basic residues as blue 20 2 1 Bioinformatics explained Sequence logo In the search for homologous sequences researchers are often interested in conserved sites residues or positions in a sequence which tend to differ a lot Most researches use alignments see Bioinformatics explained multiple alignments for visualization of homology on a given set of either DNA or protein sequences In proteins active sites in a given protein family are often highly conserved Thus in an alignment these positions which are not necessarily located in proximity are fully or nearly fully conserved On the other hand antigen binding sites in the Fa unit of immunoglobulins tend to differ quite a lot whereas the rest of the protein remains relatively unchanged In DNA promoter sites or other DNA binding sites are highly conserved see figure 20 8 This is also the case for repressor sites as seen for the Cro repressor of bacteriophage A When aligning such sequences regardless of whether they are highly variable or highly conserved at specific sites it is very difficult to generate a consensus sequence which covers the actual variability of a given position In order to better understand the information content or significance of certain positions a sequence logo can be used The sequence logo displays the information content of all positions in an alignment as residues
574. uncation and gene finding errors The SignalP method One of the most cited and best methods for prediction of classical signal peptides is the SignalP method Nielsen et al 1997 Bendtsen et al 2004b In contrast to other methods SignalP also predicts the actual cleavage site thus the peptide which is cleaved off during translocation over the membrane Recently an independent research paper has rated SignalP version 3 0 CHAPTER 16 PROTEIN ANALYSES 246 bits Figure 16 4 Sequence logo of eukaryotic signal peptides showing conservation of amino acids in bits Schneider and Stephens 1990 Polar and hydrophobic residues are shown in green and black respectively while blue indicates positively charged residues and red negatively charged residues The logo is based on an ungapped sequence alignment fixed at the 1 position of the signal peptides to be the best standalone tool for signal peptide prediction It was shown that the D score which is reported by the SignalP method is the best measure for discriminating secretory from non secretory proteins Klee and Ellis 2005 SignalP is located at http www cbs dtu dk services SignalP What do the SignalP scores mean Many bioinformatics approaches or prediction tools do not give a yes no answer Often the user is facing an interpretation of the output which can be either numerical or graphical Why is that In clear cut examples there are no doubt yes this is a signal
575. und 7 reading Frames Fri Nov 17 PERH2BD Found 8 reading frames Fri Nov 17 PERH3BA Found 3 reading Frames Fri Nov 17 PERH3BC Found 7 reading frames Fri Nov 17 Figure 9 4 An example of a batch log when finding open reading frames The log will either be saved with the results of the analysis or opened in a view with the results depending on how you chose to handle the results Part Ill Bioinformatics 130 Chapter 10 Viewing and editing sequences Contents 10 1 View Sequence s c co osaa ee scada a a 10 1 1 Sequence settings in Side Panel 10 1 2 Restriction sites in the Side Panel lt lt ee aes 10 1 3 Selecting parts of the sequence o 10 14 Editing the SEQUENCE sc isis ee de Se eS we a A i a 10 1 9 SEQUENCE region YDES s ansi e awed as a wed 10 2 Circular DNA c c 6 200 ee a ee a ee a a 10 2 1 Using split views to see details of the circular molecule 10 2 2 Mark molecule as circular and specify starting point 10 3 Working with annotations lt lt 10 3 1 VIEWINE annotations se uu e a a aw eee 103 2 Adding annotatlonS ves iria a a A ee es 10 33 Edit IMMOLatiONS lt e ose rar a al dale Sd 10 3 4 Removing annotations s s s ics ed magad a A 10 4 Sequence information 1 ee sosanna 10 5 View as text c cona ce See a
576. union of nested structure elements and other substructures see a detailed description of the different structure elements in section 22 5 2 Each substructure CHAPTER 22 RNA STRUCTURE 385 fp ABDD9835 Rows 11 Filter Name Created AG Zp Elements of structure AG 9 9kcal mol AG 9 9kcal mol HX Stem with hairpin at 1 70 AG 8 Skcal mol AG 9 7kcal mol 2007 06 25 13 3 9 7kcal mol 4 Dangling nucleotide at 71 dangling from position 70 AG 1 1kcal mol AG 9 4kcaljmol 2007 06 25 13 3 9 4kcal mol AG 9 2kcal mol 2007 06 25 13 3 9 2kcal mo AG 9 1kcal mol 2007 06 25 13 3 9 1kcal mo AG 9 1kcal mol 2007 06 25 13 3 9 1kcalfmol AG 8 9kcal mol 2007 06 25 13 3 8 9kcal mol AG 8 6kcal mol 2007 06 25 13 3 8 Gkcalfmol AG 8 4kcal mol 2007 06 25 13 3 8 4kcal mol AG 8 3kcal mol 2007 06 25 13 3 8 3kcal mol AG 8 2kcal mol 2007 06 25 13 3 8 2kcalfmol eOR Z Qaws se Figure 22 16 The secondary structure table with the list of structures to the left and to the right the substructures of the selected structure u contributes a free energy given by the sum of its nested substructure energies and energies of its nested structure elements The substructure elements to the right are ordered after their occurrence in the sequence they are described by a region the sequence positions covered by this substruc
577. unt Doing so is however not straightforward as it increases the number of model parameters considerably It is therefore commonplace to either ignore this complication and assume sequences to be unrelated or to use heuristic corrections for shared ancestry The second challenge is to find the optimal alignment given a scoring function For pairs of sequences this can be done by dynamic programming algorithms but for more than three sequences this approach demands too much computer time and memory to be feasible A commonly used approach is therefore to do progressive alignment Feng and Doolittle 1987 where multiple alignments are built through the successive construction of pairwise alignments These algorithms provide a good compromise between time spent and the quality of the resulting alignment Presently the most exciting development in multiple alignment methodology is the construction of statistical alignment algorithms Hein 2001 Hein et al 2000 These algorithms employ a scoring function which incorporates the underlying phylogeny and use an explicit stochastic model of molecular evolution which makes it possible to compare different solutions in a statistically rigorous way The optimization step however still relies on dynamic programming and practical use of these algorithms thus awaits further developments Creative Commons License All CLC bio s scientific articles are licensed under a Creative Commons Attribution Non
578. ure 11 1 The GenBank search view 11 1 1 GenBank search options Conducting a search in the NCBI Database from CLC Combined Workbench corresponds to conducting the search on NCBI s website When conducting the search from CLC Combined Workbench the results are available and ready to work with straight away You can choose whether you want to search for nucleotide sequences or protein sequences As default CLC Combined Workbench offers one text field where the search parameters can be entered Click Add search parameters to add more parameters to your search Note The search is a and search meaning that when adding search parameters to your search you search for both or all text strings rather than any of the text strings You can append a wildcard character by checking the checkbox at the bottom This means that you only have to enter the first part of the search text e g searching for genom will find both genomic and genome The following parameters can be added to the search e All fields Text searches in all parameters in the NCBI database at the same time Organism Text Description Text Modified Since Between 30 days and 10 years Gene Location Genomic DNA RNA Mitochondrion or Chloroplast Molecule Genomic DNA RNA mRNA or rRNA Sequence Length Number for maximum or minimum length of the sequence CHAPTER 11 ONLINE DATABASE SEARCH 162 e Gene Name Text The search parameters are
579. ure request 24 folder 76 folder tutorial 33 sequence 156 New sequence create from a selection 144 Newick file format 35 113 409 nexus file format 112 Nexus file format 35 113 409 nhr file format 112 NHR file format 35 113 409 Non standard residues 133 nr BLAST databases 175 Nucleotide info 134 sequence databases 404 Nucleotides UIPAC codes 412 Numbers on sequence 132 nwk file format 112 xs file format 112 0a4 file format 112 Online check of demo license 17 Open consensus sequence 353 from clipboard 115 Open reading frame determination 237 Open ended sequence 238 Order primers 300 402 ORF 237 Organism 154 Origins from 125 Overhang of fragments from restriction digest 340 Overhang find restriction enzymes based on 139 141 331 333 336 345 Overhang visualization of 321 pa4 file format 112 INDEX 426 Page heading 110 Page number 110 Page setup 109 Pairwise comparison 361 PAM scoring matrices 214 Parameters search 161 164 167 Partition function 377 402 Paste text to create a new sequence 115 Paste copy 123 Pattern Discovery 230 Pattern discovery 402 Pattern Search 227 PCR primers 402 pdb file format 112 199 seq file format 112 PDB file format 35 113 409 pdf format export 120 Peak call secondary 317 Peptidase 269 Peptide sequence databases 404 Personal information 25 Pfam domain search 260 401 phr f
580. ut the annotation like comments and links Click the Add qualifier key button to enter information Select a qualifier which describes the kind of information you wish to add If an appropriate qualifier is not present in the list you can type your own qualifier The pre defined qualifiers are derived from the GenBank format You can add as many qualifier key lines as you wish by clicking the button Redundant lines can be removed by clicking the delete icon 3 The information entered on these lines is shown in the annotation table see section 10 3 1 and in the yellow box which appears when you place the mouse cursor on the annotation If you write a hyperlink in the Key text field like e g www clcbio com it will be recognized as a hyperlink Clicking the link in the annotation table will open a web browser Click OK to add the annotation Note The annotation will be included if you export the sequence in GenBank Swiss Prot or CLC format When exporting in other formats annotations are not preserved in the exported file CHAPTER 10 VIEWING AND EDITING SEQUENCES 154 10 3 3 Edit annotations To edit an existing annotation from within a sequence view right click the annotation Edit Annotation Sy This will show the same dialog as in figure 10 19 with the exception that some of the fields are filled out depending on how much information the annotation contains There is another way of quickly editing annotations which is part
581. ute display and use the work for educational purposes under the following conditions You must attribute the work in its original form and CLC bio has to be clearly labeled as author and provider of the work You may not use this work for commercial purposes You may not alter transform nor build upon this work SOME RIGHTS RESERVED See http creativecommons org licenses by nc nd 2 5 for more information on how to use the contents 16 2 Protein charge In CLC Combined Workbench you can create a graph in the electric charge of a protein as a function of pH This is particularly useful for finding the net charge of the protein at a given pH This knowledge can be used e g in relation to isoelectric focusing on the first dimension of 2D gel electrophoresis The isoelectric point pl is found where the net charge of the protein is zero The calculation of the protein charge does not include knowledge about any potential post translational modifications the protein may have The pKa values reported in the literature may differ slightly thus resulting in different looking graphs of the protein charge plot compared to other programs In order to calculate the protein charge Select a protein sequence Toolbox in the Menu Bar Protein Analyses E Create Protein Charge Plot or right click a protein sequence Toolbox Protein Analyses qh Create Protein Charge Plot This opens the dialog displayed in figure 16 6
582. very codon in a Codon Frequency Table has its own count frequency per thousand and fraction which are calculated in accordance with the occurrences of the codon in the organism Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish The newly created nucleotide sequence is shown and if the analysis was performed on several protein sequences there will be a corresponding number of views of nucleotide sequences The new sequence is not saved automatically To save the sequence drag it into the Navigation Area or press Ctrl S 38 S on Mac to show the save dialog 16 9 2 Bioinformatics explained Reverse translation In all living cells containing hereditary material such as DNA a transcription to mRNA and subsequent a translation to proteins occur This is of course simplified but is in general what is happening in order to have a steady production of proteins needed for the survival of the cell In bioinformatics analysis of proteins it is sometimes useful to know the ancestral DNA sequence in order to find the genomic localization of the gene Thus the translation of proteins back to DNA RNA is of particular interest and is called reverse translation or back translation The Genetic Code In 1968 the Nobel Prize in Medicine was awarded to Robert W Holley Har Gobind Khorana and Marshall W Nirenberg for their interpretation of the Genetic Code http nobelprize org medicine laureates 1968
583. vey Sequence includes single pass genomic data exon trapped se quences and Alu PCR sequences htgs Unfinished High Throughput Genomic Sequences phases O 1 and 2 Finished phase 3 HTG sequences are in nr pat Nucleotides from the Patent division of GenBank pdb Sequences derived from the 3 dimensional structure records from Protein Data Bank They are NOT the coding sequences for the corresponding proteins found in the same PDB record month All new or revised GenBank EMBL DDBJ PDB sequences released in the last 30 days alu Select Alu repeats from REPBASE suitable for masking Alu repeats from query sequences See Alu alert by Claverie and Makalowski Nature 371 752 1994 dbsts Database of Sequence Tag Site entries from the STS division of GenBank EMBL DDBJ chromosome Complete genomes and complete chromosomes from the NCBI Reference Sequence project It overlaps with refseq_genomic wgs Assemblies of Whole Genome Shotgun sequences env_nt Sequences from environmental samples such as uncultured bacterial samples isolated from soil or marine samples The largest single source is Sagarsso Sea project This does overlap with nucleotide nr B 3 SNP BLAST databases The list of databases for SNP Annotation Using BLAST E is available at http www ncbi nlm nih gov staff tao URLAPI remote_accessible_blastdblist html 8 Appendix C Proteolytic cleavage enzymes Most proteolytic enzymes cleave at distinct
584. vides information about the accessible and buried amino acid residues of globular proteins Janin 1979 e Hopp Woods Hopp and Woods developed their hydrophobicity scale for identification of potentially antigenic sites in proteins This scale is basically a hydrophilic index where apolar residues have been assigned negative values Antigenic sites are likely to be predicted when using a window size of 7 Hopp and Woods 1983 CHAPTER 10 VIEWING AND EDITING SEQUENCES 136 e Welling Welling et al 1985 Welling et al used information on the relative occurrence of amino acids in antigenic regions to make a scale which is useful for prediction of antigenic regions This method is better than the Hopp Woods scale of hydrophobicity which is also used to identify antigenic regions e Kolaskar Tongaonkar A semi empirical method for prediction of antigenic regions has been developed Kolaskar and Tongaonkar 1990 This method also includes information of surface accessibility and flexibility and at the time of publication the method was able to predict antigenic determinants with an accuracy of 75 e Surface Probability Display of surface probability based on the algorithm by Emini et al 1985 This algorithm has been used to identify antigenic determinants on the surface of proteins e Chain Flexibility Display of backbone chain flexibility based on the algorithm by Karplus and Schulz 1985 It is known that chain flexibility is an indi
585. view Toolbox Workspace Help ADA E TEF a y Delete Workspace Search Selection Zoom In Zoom Out fs CLC_Data Example data w Nucleotide ES Protein a Extra E G Recycle bin 2 C E E Alignments and Trees EKA General Sequence Analyses a Nucleotide Analyses E egy Protein Analyses 8 AA Sequencing Data Analyses a Ey Primers and Probes e F Cloning and Restriction Sites ve fa RNA Structure BLAST Search 68 a Database Search Processes Toolbox E 1de 0 element s are selected Figure 3 17 An empty Workspace 3 5 2 Select Workspace When there is more than one Workspace in the workbench there are two ways to switch between them Workspace ED in the Toolbar Select the Workspace to activate or Workspace in the Menu Bar Select Workspace 51 choose which Workspace to activate OK The name of the selected Workspace is shown after CLC Combined Workbench at the top left corner of the main window in figure 3 17 it says default 3 5 3 Delete Workspace Deleting a Workspace can be done in the following way Workspace in the Menu Bar Delete Workspace choose which Workspace to delete OK Note Be careful to select the right Workspace when deleting The delete action cannot be undone However no data is lost because a workspace is only a representation of data It is not possible to delete the default workspace 3 6 List of shortcuts The keyboard shortcuts in
586. view described in section 10 1 1 However there are two additional groups of settings unique to the secondary structure 2D view Secondary structure CHAPTER 22 RNA STRUCTURE 382 e Follow structure selection This setting pertains to the connection between the structures in the secondary structure table Fp If this option is checked the structure displayed in the secondary structure 2D view will follow the structure selections made in this table See section 22 2 2 for more information e Layout strategy Specify the strategy used for the layout of the structure In addition to these strategies you can also modify the layout manually as explained in the next section Auto The layout is adjusted to minimize overlapping structure elements Han et al 1999 This is the default setting see figure 22 10 Proportional Arc lengths are proportional to the number of residues see figure 22 11 Nothing is done to prevent overlap Even spread Stems are spread evenly around loops as shown in figure 22 12 e Reset layout If you have manually modified the layout of the structure clicking this button will reset the structure to the way it was laid out when it was created gt Residue coloring gt Find gt Text Format Secondary structure Follow structure selection Layout strategy Auto Proportional Even spread Figure 22 10 Auto layout Overlaps are minimized gt Residue coloring
587. w listed in the Selected Elements window of the dialog Use the arrows to add or remove sequences or sequence lists from the selected elements Click Next to set parameters for the SignalP analysis 16 1 1 Signal peptide prediction parameter settings It is possible to set different options prior to running the analysis see figure 16 1 An organism type should be selected The default is eukaryote e Eukaryote default e Gram negative bacteria e Gram positive bacteria You can perform the analysis on several protein sequences at a time This will add annotations to all the sequences and open a view for each sequence if a signal peptide is found If no signal peptide is found in the sequence a dialog box will be shown The predictions obtained can either be shown as annotations on the sequence listed in a table or be shown as the detailed and full text output from the SignalP method This can be used to interpret borderline predictions e Add annotations to sequence CHAPTER 16 PROTEIN ANALYSES 243 Signal Peptide Prediction 1 Select proteins Mza ete 2 Set parameters Organism group Eukaryotes Gram negative bacteria Gram positive bacteria 2 Ja _ Previous Bnet Finish YK Cancel Figure 16 1 Setting the parameters for signal peptide prediction e Create table e Text Click Next if you wish to adjust how to handle the results see section 9 1 If not click Finish 16 1 2 Signal
588. want to undo several actions just repeat the steps above To reverse the undo action Click the redo icon in the Toolbar or Edit Redo f or Ctrl Y Note Actions in the Navigation Area e g renaming and moving elements cannot be undone However you can restore deleted elements see section 3 1 7 You can set the number of possible undo actions in the Preferences dialog see section 5 3 2 6 Arrange views in View Area Views are arranged in the View Area by their tabs The order of the views can be changed using drag and drop E g drag the tab of one view onto the tab of a another The tab of the first view is now placed at the right side of the other tab If a tab is dragged into a view an area of the view is made gray see fig 3 11 illustrating that CHAPTER 3 USER INTERFACE 84 the view will be placed in this part of the View Area P68225 RLLVVYPWTQRFFESFGDLSSPDAVMGNPK P68225 VKAHGKKVLGAFSDGLNHLDNLKGTFAQLS P68225 ELHCDKLHVDPENFKLLGNVLVCVLAHHFG Figure 3 11 When dragging a view a gray area indicates where the view will be shown The results of this action is illustrated in figure 3 12 AGP P68046 O act P68053Q ner P68063 P68063 LLIVYPWTQRFFASFGNLSSPTAIIGNPMV Act P68225 P68225 RLLVVYPWTORFFESFGDLSSPDAVMGNP K Figure 3 12 A horizontal split screen The two views split the View Area You can also split a View Area horizontally or vertically using the menus Splitt
589. we classify the structure elements defining a secondary structure and describe their energy contribution Figure 22 29 The different structure elements of RNA secondary structures predicted with the free energy minimization algorithm in CLC Combined Workbench See text for a detailed description Nested structure elements The structure elements involving nested base pairs can be classified by a given base pair and the other base pairs that are nested and accessible from this pair For a more elaborate description we refer the reader to Sankoff et al 1983 and Zuker and Sankoff 1984 If the nucleotides with position number i j form a base pair and i lt k l lt j then we say that the base pair k l is accessible from i 7 if there is no intermediate base pair 7 j such that 1 lt 8 lt k l lt 5 lt j This means that k l is nested within the pair i j and there is no other base pair in between CHAPTER 22 RNA STRUCTURE 397 Using the number of accessible pase pairs we can define the following distinct structure elements 1 Hairpin loop G A base pair with O other accessible base pairs forms a hairpin loop The energy contribution of a hairpin is determined by the length of the unpaired loop region and the two bases adjacent to the closing base pair which is termed a terminal mismatch see figure 22 29A 2 A base pair with 1 accessible base pair can give rise to three
590. wish to reassemble the contig This can be done in two ways Toolbox in the Menu Bar Sequencing Data Analyses 54 Reassemble Contig select the contig and click Next or right click the empty white area of the contig Reassemble contig This opens a dialog as shown in figure 18 13 Reassemble Contig 1 Select assembly algorithm MAE Oo Reference assembl Reassemble reads using old consensus sequence as reference Figure 18 13 Re assembling a contig In this dialog you can choose CHAPTER 18 SEQUENCING DATA ANALYSES AND ASSEMBLY 317 e De novo assembly This will perform a normal assembly in the same way as if you had selected the reads as individual sequences When you click Next you will follow the same steps as described in section 18 3 The consensus sequence of the contig will be ignored e Reference assembly This will use the consensus sequence of the contig as reference When you click Next you will follow the same steps as described in section 18 4 When you click Finish a new contig is created so you do not lose the information in the old contig 18 8 Secondary peak calling CLC Combined Workbench is able to detect secondary peaks a peak within a peak to help discover heterozygous mutations Looking at the height of the peak below the top peak the CLC Combined Workbench considers all positions in a sequence and if a peak is higher than the threshold set by the user it will be called
591. wn in figure 19 17 where you can specify which enzymes should initially be considered At the top you can choose to Use existing enzyme list Clicking this option lets you select an enzyme list which is stored in the Navigation Area See section 19 4 for more about creating and modifying enzyme lists Below there are two panels e To the left you see all the enzymes that are in the list select above If you have not chosen CHAPTER 10 VIEWING AND EDITING SEQUENCES 141 IB Show Enzymes Cutting Inside Outside Selection 1 Enzymes to be considered MI in calculation Enzyme list 7 Use existing enzyme list Popular enzymes Y Enzymes in Popular en Enzymes to be used Filter Filter Name Overhang Methyla kss T Name Overhang Methyla Sall 5 N6 methyl EcoRV Blunt N6 methyl EcoRV Blunt N6 methyl Pee EcoRI 5 aatt N6 methyl Xhol N methyl ee Smal Blunt N4 methyl ero Bgl N4 methyl Sall 5 tega N6 methyl Hind IIT N methyl se PstI tgca N methyl Xbal N methyl ee Xhol EcoRI N methyl ter Barr Smal lunt N4 methyl Xbal 5 5 5 5 5 Bl ha pst 3 N methyl Hind IIT s s 5 3 5 5 el tega N methyl gate ctag aget BamHI N4 methyl BamHI gate clar NotI saci NdeI Nicol HaellI N methyl Peer Neol S methylc ete NdeI N methyl ter Spht N methyl vere N4 methyl Sac
592. wn in figure 2 22 The restriction map at the bottom can also be shown as a Ac PERHSBC q P pa PERH3BC GTGAGTCTGATGGGTCTGCCCATGGTTTCCTTCCTCTAGTTTCTGGGCTT Neul Mboll 60 100 l l PERH3BC ACCTTCCTATCAGAAGGAAATGGGAAGAGATTCTAGGGAGCAGTTTAGAT 5 O 5B 0 El 1 E ES Restriction m K Rows 5 Restriction sites table Filter Sequence Name Pattern Overhang Number Cut position s PERH3BC CjePI ccannnnn 3 1 151 184 PERH3BC MbolI gaaga 3 1 86 PERH3BC Neul Igaaga 3 1 86 PERH3BC Tsol tarcca 3 1 134 PERH3BC Tthi 1111 caarca 3 1 101 Rega Figure 2 22 The result of the restriction map analysis is displayed in a table at the bottom and as annotations on the sequence in the view at the top table of fragments produced by cutting the sequence with the enzymes Click the Fragments button 4 at the bottom of the view In a similar way the fragments can be shown on a virtual gel Click the Gel button 3 at the bottom of the view 2 8 Tutorial BLAST search This tutorial shows you how to perform a BLAST search using CLC Combined Workbench CHAPTER 2 TUTORIALS 48 Suppose you are working with the NP_058652 protein which constitutes the beta part of the hemoglobin molecule that is expressed in the adult house mouse Mus musculus To obtain more information about this molecule you wish to query the Swiss Prot database to find homologous proteins in humans Homo
593. x Nucleotide Analyses lt A Translate to Protein 25 A selection can also be copied to the clipboard and pasted into another program make a selection Ctrl C 38 C on Mac Note The annotations covering the selection will not be copied A selection of a sequence can be edited as described in the following section 10 1 4 Editing the sequence When you make a selection it can be edited by right click the selection Edit Selection 2 A dialog appears displaying the sequence You can add remove or change the text and click CHAPTER 10 VIEWING AND EDITING SEQUENCES 145 OK The original selected part of the sequence is now replaced by the sequence entered in the dialog This dialog also allows you to paste text into the sequence using Ctrl V 38 V on Mac If you delete the text in the dialog and press OK the selected text on the sequence will also be deleted Another way to delete a part of the sequence is to right click the selection Delete Selection If you wish to only correct only one residue this is possible by simply making the selection only cover one residue and then type the new residue Another way to edit the sequence is by inserting a restriction site See section 19 1 6 10 1 5 Sequence region types The various annotations on sequences cover parts of the sequence Some cover an interval some cover intervals with unknown endpoints some cover more than one interval etc In the following all
594. xplained 267 Right click on Mac 30 RNA secondary structure 402 RNA structure partition function 377 RNA structure prediction by minimum free en ergy minimization Bioinformatics explained 392 RNA structure file format 35 113 409 RNA translation 236 rnaml file format 112 Rotate 3D structure 200 Safe mode 25 Save changes in a view 82 sequence 41 style sheet 103 view preferences 103 workspace 89 Save enzyme list 138 330 Scale traces 302 SCF2 file format 35 113 409 SCF3 file format 35 113 409 Score BLAST search 181 Scoring matrices Bioinformatics explained 214 BLOSUM 214 PAM 214 Scroll wheel to zoom in 86 to zoom out 87 Search 96 in one location 96 INDEX 428 BLAST 173 for structures at NCBI 166 GenBank 160 GenBank file 155 handle results from GenBank 162 handle results from NCBI structure DB 168 handle results from UniProt 165 hits number of 100 in a sequence 136 in annotations 136 in Navigation Area 94 Local BLAST 176 local data 400 options GenBank 161 options GenBank structure search 167 options UniProt 164 parameters 161 164 167 patterns 227 230 Pfam domains 260 PubMed references 171 sequence in UniProt 171 sequence on Google 170 sequence on NCBI 170 sequence on web 170 TrEMBL 164 troubleshooting 98 UniProt 164 Secondary peak calling 317 Secondary structure predict RNA 402 Secondary structure prediction 262 401 Secondary st
595. xtra Nucleotide 3 Protein EE 3D structures E More data ep Sequences Pe 1829_HUMAN es Ss CAA32220 e NP_058652 ss P68046 P68053 s P68063 Su P68225 Pu P68228 us P68231 P68873 Su P68945 EA gt Next of Finish Y Cancel Figure 16 18 Choosing one or more protein sequences for secondary structure prediction After running the prediction as described above the protein sequence will show predicted alpha helices and beta sheets as annotations on the original sequence see figure 16 19 Helix Helix Strand Helix CAA26204 MVHLTPEEKS AVTALWGKVN VDEVGGEALG RLVSRLLVVY PWTQRFFESF gt gt gt p Helix ze CAA26204 GDLSTPDAVM GNPKVKAHGK KVLGAFSDGL AHLDNLKGTF ATLSELHCDK Helix Strand CAA26204 LHVDPENFRL LGNVLVCVLA HHFGK Figure 16 19 Alpha helices and beta strands shown as annotations on the sequence Each annotation will carry a tooltip note saying that the corresponding annotation is predicted with CLC Combined Workbench Additional notes can be added through the Edit Annotation Sy right click mouse menu See section 10 3 2 Undesired alpha helices or beta sheets can be removed through the Delete Annotation gt right click mouse menu See section 10 3 4 16 8 Protein report CLC Combined Workbench is able to produce protein reports that allow you to easily generate different kinds of information regarding a protein Actually a protein report is a collection of some of the pr
596. y Dralll J N meth gt Hpht N meth BanII j S methyl XecmI i N meth PF Y Figure 19 33 Selecting enzymes If you need more detailed information and filtering of the enzymes either place your mouse cursor on an enzyme for one second to display additional information see figure 19 34 or use the view of enzyme lists see 19 4 Click Finish to open the enzyme list CHAPTER 19 CLONING AND CUTTING 346 All enzymes Filter 3 Name Overh Methyl Pop PstI 3 N6 meth er A KpnI 8 N6 meth Pee SacI p S methyl pee SphI g FEFE Apal E S methyl peer Sacil 3 5S methyl Por Nsil___ Enayme Sacil Chal Recognition site pattern CCGCGG Ball Suppliers GE Healthcare Hhal Qbiogene Xeml American Allied Biochemical Inc Dran Nippon Gene Co Ltd Banll Takara Bio Inc New England Biolabs J Toyobo Biochemicals Molecular Biology Resources Promega Corporation La EURx Ltd p Figure 19 34 Showing additional information about an enzyme like recognition sequence or a list of commercial vendors 19 4 2 View and modify enzyme list An enzyme list is shown in figure 19 35 The list can be sorted by clicking the columns EJ A enzymes E Rows 1362 Table of restriction enzymes Fiter axa E Name Recognition sequence Overhang Suppliers Me
597. y compounds complexed with the protein in the resolved structure 13 3 1 Identification ID specifies an identifier for the subunit or compound as specified by the PDF or mmCIF record while Type specifies the nature of the compound in question Protein chains and RNA DNA chains are specified as Polymers while all other molecules including water are specified as Non Polymers The Name of the compound is also displayed as specified by the PDB or mmCIF record The ID is appended to the structure identifier when opening sequence information see below CHAPTER 13 3D MOLECULE VIEWING 202 13 3 2 Opening sequence information Only Polymer sequences can be opened in a sequence view This is done by right clicking the appropriate table element and selecting Open Sequence Editing a sequence directly is not allowed in order to preserve consistency between the displayed 3D structure and the sequence A number of analyses can be performed on the sequence when it is opened in a new view Eg finding Pfam domains or motifs which can be added to the sequences as any other annotation If amino acids in the sequence view are colored in gray they are not present in the structure view A structure file imported into the Workbench often carries linear Sequence data which is not present in the structure data and this is indicated by the gray color The sequence is named according to the structure with the ID of the subunit appended For example the A ch
598. y file containing a valid license you can import it by clicking the import button below Ifyou do not have a license you can request an evaluation license on line by clicking the request button below while being connected to the internet or by sending an email to license cicbio com Ifyou experience any problems please contact support elcbio com Request evaluation license Import a license key file Configure network license Figure 1 2 Selecting Request evaluation license When the license key is received you will be asked to accept the License agreement shown in figure 1 3 License Assistant CLC Combined Workbench W Get license Y Accept agreement O Activate license END USER LICENSE AGREEMENT FOR CLC BIO SOFTWARE CLC Combined Workbench 3 5 1 Recitals 1 1 This End User License Agreement EULA is a legal agreement between you either an individual person ora single legal entity who willbe referred to in this EULA as You and CLC bio A S CVR no 28 30 50 8 for the software products that accompanies this EULA including any associated media printed materials and electronic documentation the Software Product 1 2 The Software Product also includes any software updates add on components web services andlor supplements that CLC bio may provide to You or make available to You after the date You obtain Your initial copy of the Software Product to the extent that such items ar
599. yl PstI N meth ere PstI N meth Kpnl f N meth Pee SacI 7 S methyl Sacl 3 S methyl Peer SphI j Spht y pre Kpnt N meth Apal 9 S methyl pee e NsiI Sacil S methyl Pe apar 7 S methyl NsiI J pe Le Chal j Chal J pers Ball i N4 meth Ball E N meth Pe Sacr J S methyl e Hhal y S methyl Pee Hhal S methyl cml y N meth 4 lEcgl N6 methy Dralll 3 N meth gt Hphr 7 N meth Pe Bani J S methyl al Xeml N meth 3 Figure 19 15 Selecting enzymes If you need more detailed information and filtering of the enzymes either place your mouse cursor on an enzyme for one second to display additional information see figure 19 34 or use the view of enzyme lists see 19 4 CHAPTER 19 CLONING AND CUTTING 332 All enzymes Filter 3 Name Overh Methyl Pop PstI g KpnI 3 Sart F 3 3 Spht lapal Sacir 5 methyl Per _ S methyl ver Enzyme SacI Chal Recognition site pattern CCGCGG Ball Suppliers GE Healthcare Hhal Qbiogene cml American Allied Biochemical Inc Dram Nippon Gene Co Ltd i Takara Bio Inc New England Biolabs Toyobo Biochemicals Molecular Biology Resources Promega Corporation bag EURx Ltd Figure 19 16 Showing additional information about an enzyme like recognition sequence or a list
600. you close the dialog you will be asked whether you wish to restart the workbench The plug in will not be uninstalled before the workbench is restarted CHAPTER 1 INTRODUCTION TO CLC COMBINED WORKBENCH 29 1 7 3 Updating plug ins If a new version of a plug in is available you will get a notification when the Workbench starts as shown in figure 1 17 o CLC Plugins Updates are available for your plug ins and or resources Updates available Use the list below to select which updates you would like to install IF you prefer you can install the updates manually through the plugin and resource manager Additional Alignments Version 1 03 M Size 12 5 MB Updated to fit new versions of the CLC Workbenches m Install updates If Cancel Figure 1 17 Plug in updates In this list select which plug ins you wish to update and click Install Updates If you press Cancel you will be able to install the plug ins later by clicking Check for Updates in the Plug in manager see figure 1 16 1 7 4 Resources Resources are downloaded installed un installed and updated the same way as plug ins Click the Download Resources tab at the top of the plug in manager and you will see a list of available resources see figure 1 18 Currently the only resources available are PFAM databases for use with CLC Protein Workbench and CLC Combined Workbench Because procedures for downloading installation uninstallat

Combinded Workbench Manual

Contents

Download Pdf Manuals

Related Search

Related Contents