Home

PhenoLink user guide

1. PhenoLink removes all but one of the highly correlated features Features with similar or same values across all observations having very low variance default cutoff is 0 05 decreases classification accuracy so such features are also discarded by default Additionally in phenotype data many strains may exhibit the same phenotype dominating phenotype and only a few would have a different phenotype Such imbalance in phenotype data is decreased by bagging for which two procedures are used multiple down sizing and multiple covering PhenoLink uses two procedures to identify relevant features based on predictive scores generated by the Random Forest algorithm i select only relevant features ii discard irrelevant features The selection procedure is iteratively applied until there are not more than a certain number of features default of 5 are removed Once final set of relevant features are selected features that are highly correlated to any feature in this data set are added to a list of relevant features Links identified by PhenoLink is visualized to allow better identification of relations between features and phenotypes among features and among phenotypes Additionally this enhanced visualization allows to search and sort feature names hide columns and limit number of displayed rows In the following sections for demonstration purposes of a PhenoLink omics and phenotype data of 42 Lactobacillus plantarum strains is used in actual r
2. inks in each experiment without annotation See inks in each experiment with annotation Classification report for each experiment See Preprocessed data sets used in the analysis Phenotype data al ol al Session details Open settings window To all FG web update v1 0 GE W3 ECTE Figure 10 PhenoLink results page with links to results visualization of results and preprocessed files Visualization Fig 11 12 and 13 There are three different kinds of plots of which two visualize results found by PhenoLink These visualizations show relations to all phenotypes see Fig 11 and for phenotypes of a single experiment see Fig 12 Columns of these tables can be hidden by clicking tick marks shown below phenotype names Classification performances for each experiment is shown as a bar plot like the one in Fig 13 15 Entry is important for a phenotype and it is sufficiently present in strains of this phenotype Entry is not important for a phenotype but it is sufficiently present in strains of this phenotype Entry is important for a phenotype and it is sufficiently absent in strains of this phenotype Entry is not important for a phenotype but it is sufficiently absent in strains of this phenotype Entry is important for a phenotype but it is not sufficiently present or absent in strains of this phenotype Entry is neither important for a phenotype nor it is suffici
3. kept Merge feature Medianuas contribution scores Feature selection procedure Keep significant features Percentage of instances strains a feature must be present floating point number Percentage of instances strains a feature must be absent floating point number Vizualize links to phenotypes of each experiment as separate HTML files Vizualize classification results Yes v for each experiment Proceed q q q a q q Vc ECTE Figure 6 Parameter settings page for PhenoLink Note that since this web page is large its screenshot image is shown as two separate figures Figure 5 see above and this figure Characters that NENE i represent missing to User guide values comma characters To all FG web delimited update v1 0 Available tools EA Binarize continuous DNA microarray feature values Cutoff to binarize continuous values Phenotypes to be discarded comma delimited characters Figure 7 A Enabling binarizitaion option shows a text box B to enter a cutoff value Generic characters Genomics Statistics 13 Minimum variance in A P E NE 0 05 floating point number 0 to 0 1 Classification bagging Use bagging No v Classification feature selection Error cutoff f floating point number Figure 8 Disabling bagging option hides all bagging related parameters Run
4. usage for each bagging procedure In case of multiple down sizing this number of bags will be created In the multiple covering procedure at least this number times a number defined in Fig 5 J bags would be created The recommended value for large data sets is smaller because each bag is classified separately requiring substantial computational resources For small data sets even the maximum value of 100 should not be a problem with multiple down sizing An imbalance in phenotype data can be detected by comparing the number of instances with each phenotype A phenotype with the maximum number of instances is a dominating phenotype and a phenotype with minimum number of instances is a repressed phenotype We define that if the dominating phenotype has at least r times more instances than the repressed phenotype there is an imbalance in phenotype data The recommended value of 2 for the cutoff r can be changed in a text box shown in Fig 5 J Instances here strains of phenotypes with fewer instances are prone to misclassification Thus phenotypes with fewer than the predefined number of instances are not used in classification This cutoff is by default 4 but it can be changed in a text box shown in Fig 5 K Phenotype data that are shown as continuous values are binned prior to classification For large data 9 sets more bins would result in more accurate description of phenotypic measurements however for small data sets e g for L plantarum
5. 1 Spearman s cutoff floating point number 0 8tol Minimum variance in panniers Tine floating point number 0 to 0 1 isualization Classification bagging Use bagging Yes v Bagging procedure Multiple down sizing v_ Number of bags to create Number of times instances of majority phenotype is sampled 10 integer 5to 100 Ratio of largest phenotype size to smallest phenotype 2 integer 1to10 size Minimum number of m organisms with any 4 integer phenotype a Bin continuous phenotype measurements into 3 integer this number of bins Bin names comma separated or a bin class name prefix characters Figure 5 Parameter settings page for PhenoLink Note that since this web page is large its screenshot image ma q is shown as two separate figures this figure and Figure 6 see below 12 Classification feature selection Minimum n classification 0 6 floating point number Otol accuracy Multiply mtry ee F parameter with 1 floating point number 1to10 Take the top N features with highest importance for 50 integer 10 to 100 accurately classified phenotype Number of trees for _____ the Random Forest 1500 integer 50 to 5000 algorithm a Percentage of instances strains a 0 1 iia Em floating point number 0 to 0 7 important Feature irrelevance relevance 7 count to belll o o o integer 1to 10 removed
6. IZO2261_Yes NIZO2260_Yes NIZO2259_Yes NIZO2258_No NIZO2257_No NIZO2256_Yes NIZO2029_Yes NIZO1840_No NIZO1839_No NIZO1838_Yes NIZO1836_Yes NCDO1193_Yes CIP104448_No CIP102359_Yes j ii Number of bags Figure 13 Classification performance using data from D Turanose sugar utilization experiment Horizontal axis the number of bags generated Vertical axis strain names with their phenotypes as suffixes Growth on this sugar is added as suffix Yes and no growth is represented as No suffix Length of a bar represents how many times a strain with a particular phenotype has been used in classification and colors represent how many times a strain was correctly black or incorrectly gray classified 18
7. PhenoLink user guide Brief introduction PhenoLink is an easily accessible web tool to link phenotypes to omics data It requires both omics see Fig 3 D and phenotype data see Fig 3 E as tab delimited text files see Fig 1 A and Fig 2 The first column of these files must contain information about strains thus for a strain the same identifier must be used in both files For strains with public genbank NCBI files one can select a corresponding file from the genbank files list shown in Fig 3 A and selected files will be used to add annotation information to genes uploaded in omics data set When there is no genbank file for uploaded omics data or omics data do not contain information about genes then one can upload tab delimited annotation file see Fig 2 C and Fig 3 B PhenoLink can be used in actual see Fig 3 C or demo mode see Fig 3 F Input data is only necessary in actual mode For the demo mode Lactobacillus plantarum data would be used This data was also used to demonstrate applicability of PhenoLink After selecting input data and run mode click to Upload Files button see Fig 3 H to go to Settings page The default settings of parameters are often sufficient for linking omics to phenotype data However the following parameters might be adapted to uploaded data discarded phenotypes see Fig 5 C bin count and names of bins for continuous values see Fig 5 L and Fig 5 M and visualization of links to ph
8. Required Upload tab delimited omics file First columns of omics and phenotypes file must be the same Browse Required Upload tab delimited phenotypes file First columns of omics and phenotypes files must be the same Browse O Run in demo mode Will use omics data for 42 Lactobacillus plantarum strains and their growth on different sugars based on API tests and nitrogen dioxide production Note if you only select L plantarum genomes and or plasmids then genes that were linked to phenotypes will have additional information which are gene s start and end positions strand function gene name Genotype and phenotype data of L plantarum used in demo mode can be downloaded from the links shown below omics data type Phenotype data type Species Gene occurrence Sugar growth and NO2 production test Lactobacillus plantarum cDNA array hybridization results at 3 time points 3h 9h and 15h Transposon mutant library and time point information Streptococcus pneumoniae Upload File s Figure 4 Start page of a PhenoLink Uploading phenotype and omics data sets Fig 4 In this guide we are going to use presence absence of genes in 42 L plantarum strains and phenotypic assessments of these strains under various experimental conditions These data sets can be downloaded by right clicking on a link Presence absence file see Fig 4 G and then clicking Save Link As command In the same way download phenoty
9. Will use omics data for 42 Lactobacillus plantarum strains and their growth on different sugars based on API tests and nitrogen dioxide production Note if you only select L plantarum genomes and or plasmids then genes that were linked to phenotypes will have additional information which are gene s start and end positions strand function gene name Genotype and phenotype data of L plantarum used in demo mode can be downloaded from the links shown below omics data type Phenotype data type Species Gene occurrence Sugar growth and NO2 production test Lactobacillus plantarum cDNA array hybridization results at 3 time points 3h 9h and 15h Transposon mutant library and time point information Sweptococcus pneumoniae Upload File s Figure 3 Start page of a PhenoLink Association analysis with PhenoLink PhenoLink is used to identify links to phenotypes from omics data as briefly described in the previous section These data sets are often large which makes identifying links to phenotypes difficult Therefore we use the Random Forest algorithm to select features that are relevant for a phenotype Since this algorithm build ensemble of trees highly correlated features would get predictive scores that are biased towards their selection order in tree building A pair of features is highly correlated if their correlation is above certain threshold based on Pearson s default of 0 98 and Spearman s default of 0 95 correlation metrics
10. and lack of easy to use tools We present an easily accessible web tool PhenoLink It preprocesses input datato Open settings decrease noise and uses classification based feature selection to accurately find features that are linked to phenotypes It identifies links to phenotypes window more accurately than correlation based methods and works much faster than Bayesian based association algorithms Additionally visualization of links allows quick identification of relations i between features and phenotypes ii among features iii among phenotypes and iv features and organisms 5 which use different feature sets to exhibit the same phenotype Visualizing classification accuracy for each experiment separately would allow detecting News Userguide noisy measurements Identified links might be used to improve feature annotations in selected cases without experimental validation PhenoLink therefore To all FG web allows researchers to quickly screen large data sets for new leads to phenotype associations update v1 0 Available tools Dam Suiwn taaion Form Use this form to choose genbank files from available genbank files list Genbank files are only necessary 1 gt ifyou uploaded omics data where features are genes such as in gene expression or gene presence absence data 2 gt ifyou are interested in adding extra information besides gene names in visualization Your data will be stored on our server for up to three weeks and
11. data the default bin count defined in the text box shown in Fig 5 L should be sufficient Foe large data sets e g phenotype data with more than 100 instances here strains a bin count of 4 or above would be more adequate Naming each bin by default will follow this convention classl class2 classN Here N is the number defined in the previous step However naming could be changed to obtain more meaningful names like for 3 bins low medium high If multiple names are used then they should be separated by comma in a text box shown in Fig 5 M Classification feature selection 1 The Random Forest algorithm estimates the classification error for each class phenotype which determines how many instances here strains of a phenotype have been correctly identified Only the results of the association analysis for phenotypes with a classification error below the default cutoff of 40 defined in a text box in Fig 6 A would be listed In the Random Forest algorithm for each split in a tree m square root of number of features features are chosen randomly For omics data sets with many features multiplying this number by a number bigger than the default number of 1 defined in a text box in Fig 6 B allows to consider more features for each split increasing classification accuracy Feature selection based on the Random Forest algorithm decreases the number of possibly relevant features for a phenotype However for some pheno
12. ed features Finished removing correlated features Started imputing omics data Visualization There are no missing values in omics data Finished imputing omics data Started feature selection process Phase details Classifying phenotype data for an experiment API_K Gluconate Refreshing in 5 seconds Genomics CAm j Figure 9 Run phase in PhenoLink shows each step involved in the association analysis 14 Results Fig 10 In the Results page links to downloadable files are shown which include results of the association analysis Fig 10 A links to the visualization of the results Fig 10 B by clicking See link visualization will be displayed in a new page In Fig 10 C links to preprocessed omics and phenotype data are shown and by clicking See content of the file will be displayed in a new page PhenoLink 3 3 RESULTS Menu Please bookmark this page if you decide to check back later Restart Note PhenoLink runs on a Quad Core 3 GHz Depending on the load it takes about 10 min to complete a run FG web home EY SBP Toe ee started at Mon Dec 19 17 31 35 CET 2011 ented at Won Dec 19 17 34 12 CET 2011 Parameters used for this run PCE ate se remtnrn Available tools Results of association analysis i DRS SaS Genomics Statistics Visualization of results Visualization Links in all experiments without annotation See inks in all experiments with annotation
13. enotypes for each experiment see Fig 6 K If supplied omics data do not contain binary data then change option shown in Fig 5 B to Yes which will show another text box below this drop down box see Fig 7 In this new text box enter a cutoff value However binarizing continuous feature values is only necessary for visualization of identified relations Bagging is enabled by default to minimize imbalance in phenotype data but it can be disabled see Fig 5 G and Fig 8 though not recommended All these parameters are explained in detail in Modifying process settings section of this guide below Once all parameters are set the association analysis can be started by clicking Proceed button see Fig 6 M and information about each step in the analysis is shown see Fig 9 The typical run time of PhenoLink for the L plantarum genotype and phenotype data would be around 10 minutes however it differs depending on the data uploaded After association analysis is successfully finished links to results are displayed see Fig 10 These links include visualization of relations between features and all phenotypes see Fig 11 visualization of relations between features and phenotypes of a single experiment see Fig 12 and classification performance for each experiment see Fig 13 Remove homogeneous features highly correlated features Decrease class imbalance by bagging C Classify omics data for each experime
14. ently present or absent in strains of this phenotype Show 25 entries Search e_ No L_Rhamnose_Yes L_Arabinose_Yes D_Raffinose Yes K Gluconate_ Yes Methyl_ D_Glucopyranoside_No D_Turanose_Yes L_Rhamnose_No D_ Sorbitol No K Gluconate_No L_Arabinose_ No D_Arabitol_ Yes D_Sorbitol_ Yes pyranosid D_Raffinose_No D_Turanose_No Featureld pH3_c pH3_c D_Arabitol_ No D_ Manno D_Manno Sa Sal Sal Sa Sa Sa S E pyranoside_Yes NO2production_No NO2production_Yes tPerc10_ class2 tPerc10_class3 tPerc20_class2 tPerc20_ class3 tPerc30_class1 tPerc30_class2 tPerc40_class1 SucrosePerc10_class1 q SucrosePerc20_class2 SucrosePerc30_class2 SucrosePerc30_class3 SucrosePerc40_class1 ass2 ass3 pH4_class1 pH4_class2 pH5_class3 P H amp class2 Methyl Methyl VCA OMME Figure 11 Visualization of relations between features rows and all phenotypes columns Columns of the table can be hidden by clicking tick marks shown below phenotype names 16 Meaning Entry is important for a phenotype and it is present in a strain Entry is not important for a phenotype but it is present in a strain Entry is important for a phenotype and it is absent in a strain Entry is not important for a phenotype but it is absent in a strain Strains with this phenotype have not been accurately classified Show 25 v entries Search l
15. ers to quickly screen large data sets for new leads to phenotype associations update v1 0 Available tools Data Submission Form Use this form to choose genbank files from available genbank files list Genbank files are only necessary 1 gt ifyou uploaded omics data where features are genes such as in gene expression or gene presence absence data Genomics gt 2 ifyou are interested in adding extra information besides gene names in visualization shits Wevateton gt Your data will be stored on our server for up to three weeks and will be kept confidential Senor Select genbank files for each strain of which gene content information is used in omics data Lactobacillus plantarum ST Ill uid53537 NC_014554 Chromosome Cal Lactobacillus acne ST Ill uid53537 NC Oe Plasmid psT ll 51 uid 4 CFS1 uid6 Lactobacilus reuteri DSM 20016 uid58471 NC 009513 ETT Lactobacillus reuteri JCM 1112 uid58875 NC_010609 Chromosome Lactobacillus reuteri SD2112 uid55357 NC_015697 Chromosome ral Lactobacillus reuteri SD2112 uid55357 NC_015698 Plasmid pLR585 i Optional Upload tab delimited annotation file which will be used in visualization which could be useful if no genbank file is available for instance for GC MS data First column must have information about at least one feature e g a peak value that you supplied in omics data Browse Run in actual mode
16. he text box shown in Fig 5 C empty default 8 otherwise write phenotypes that should be discarded in this text box Features with Pearson s and Spearman s correlation score above certain cutoff values are assumed to be highly correlated These cutoff values are defined by default to be 0 98 and 0 95 for Pearson s and Spearman s metrics respectively see Fig 5 D and Fig 5 E Features that have similar or the same value across many or all observations i e features with low variances are not used in classification Minimum variance can be defined in a text box shown in Fig 5 F Setting this value to 0 zero would use such features in classification Classification bagging Imbalance in phenotype data can be decreased by any of the two bagging procedures It is recommended to always enable bagging even if there is no imbalance in phenotype data because for such data set bagging will not create any bags Though it is not recommended bagging can be disabled by choosing No option from the drop down box shown in Fig 5 G see also Fig 8 There are two types of bagging procedures to create bags Multiple down sizing and Multiple covering as shown in Fig 5 H The latter procedure guarantees that each member of a phenotype with many instances are used at least predefined times However former method is recommended to create bags see Manuscript text The number shown in the text box in Fig 5 I has different
17. ibed in the Brief introduction section the first column of this file should contain information about organisms used in this study PhenoLink K 1 3 DATA UPLOAD aera Hr sar Menu by Jumamurat R Bayjanov Douwe Molenaar Roland J Siezen and Sacha A F T van Hijum fasion etails Restart a Linking phenotypes to large omics data sets is essential for generating leads to understand the underlying mechanism of a phenotype Often such Login FG web home analysis is hindered by the scale of data and lack of easy to use tools We present an easily accessible web tool PhenoLink Itpreprocesses input datato Open settings decrease noise and uses Cclassification based feature selection to accurately find features that are linked to phenotypes It identifies links to phenotypes window Terms ofuse of Terms ofuse more accurately than correlation based methods and works much faster than Bayesian based association algorithms Additionally visualization of links allows quick identification of relations i between features and phenotypes ii among features iii among phenotypes and iv features and organisms which use different feature sets to exhibit the same phenotype Visualizing classification accuracy for each experiment separately would allow detecting News noisy measurements Identified links might be used to improve feature annotations in selected cases without experimental validation PhenoLink therefore To all FG web allows research
18. nt Select discard features with m times positive negative contributions At least k features are removed D Visualize links to phenotypes Figure 1 Flow diagram of PhenoLink A NizoName CIP102359 CIP104448 NCDO1193 NIZO1836 NIZO1837 NIZO1838 NIZO1839 NIZO1840 Ip_0001 Ip_0002 lIp_0004 Ip_0005 PRPRPRPRPRER start elele RPP PRR 1 1 1 1 1 1 1 0 stop gene name ele hehehehehehe NizoName NO2production D_Arabinose L_Arabinose CIP102359 Yes No Yes CIP104448 No No No NCDO1193 No No Yes NIZO1836 Yes No Yes NIZO1837 No NA NA NIZO1838 No No No NIZO1839 No No No NIZO1840 No No Yes function 1546 3210 3444 4565 6676 1365 dnaA 2682 dnaN 3440 Ip_0004 4565 recF 6508 gyrB 9234 gyrA chromosomal replication initiation protein DnaA DNA directed DNA polymerase III beta chain unknown DNA repair and genetic recombination protein RecF DNA gyrase B subunit DNA gyrase A subunit Figure 2 Omics A phenotype B and annotation C data should be uploaded as tab delimited text files Uploading an annotation file is optional PhenoLink 1 3 DATA UPLOAD Menu by Jumamurat R Bayjanov Douwe Molenaar Roland J Siezen and Sacha AF T van Hijum asson etails Restart Linking phenotypes to large omics data sets is essential for generating leads to understand the underlying mechanism of a phenotype Often such Login FG webhome web home analysis is hindered by the scale of data
19. p down box shown in Fig 5 B if supplied omics data is already binary data Enabling binarizing omics data by choosing Yes option will show a new text box just below this drop down box see Fig 7 and you can define a cutoff to binarize data in this text box read the next step In default setting of No continuous values are binarized by using a cutoff which is an average of maximum and minimum values in omics data 2 Continuous values below a predefined cutoff value are assumed as zero e g absent or low expressed and values above or equal to the cutoff value are assumed as one e g present or highly expressed A default cutoff value is calculated as the average of maximum and minimum values in a data set This cutoff value can be changed in a field shown in Fig 7 B to suit your needs 3 Sometimes phenotype of an organism couldn t be reliably determined For instance in L plantarum phenotype data in some experiments the phenotype of certain strains could not be identified reliably resulting in a phenotype Maybe Thus strains with such ambiguous phenotypes should not be used in association analysis to increase classification accuracy If there are several ambiguously defined phenotypes e g Maybe Putative they can be discarded by listing names of all these phenotypes where names are separated by comma If there are no such phenotypes or you want to include them in the association analysis then leave t
20. pe data from the link Phenotype information file see Fig 4 G Note Save Link As command shown in Firefox might be different in other browsers Having downloaded these files click on Browse button shown in Fig 4 D and select the presence absence file you have just downloaded and for phenotypes file upload the second file you have downloaded by clicking Browse button shown in Fig 4 E PhenoLink by default runs in an actual mode make sure actual mode is chosen see Fig 4 C Click on Upload File s button shown in Fig 4 H to proceed to next step Modifying process settings Fig 5 and Fig 6 Parameter settings for data preprocessing and phenotype to omics association analysis can be changed on the web interface Fig 5 and Fig 6 Generally predefined values should be sufficient for typical omics and phenotype data So before modifying any parameter it is recommended to read more about each parameter by clicking on a link shown in Fig 5 A and reading further on this guide Additionally in the following sub sections we explain what each parameter is and how to change them to optimize the association analysis for your own needs Data upload and preprocessing 1 Features in a given omics data set might have continuous values e g gene expression data However binary values are used only for visualization purposes There is no need to change default chosen option of No in a dro
21. phase Fig 9 Once all parameters are configured PhenoLink starts the association analysis and web page is refreshed each 5 seconds showing each step of the association analysis phase Run phase for association analysis using L plantarum gene presence absence and phenotype data is shown in Fig 9 Some processes may take longer so their sub processes are shown in phase details section see Fig 9 A Once the process is finished phase details section will not be shown anymore After association analysis finishes typically requiring around 10 minutes results of the association analysis would be comparable to that of Fig 8 PhenoLink 3 3 RESULTS Menu Please bookmark this page if you decide to check back later Session details Restat Note PhenoLink runs on a Quad Core 3 GHz Open settings window Depending on the load ittakes about 30 min to complete a run pooo O oea stared at Wed May 11 12 39 37 CEST 2011 News Termsofuse Parameters used for this run To all FG web update eeeeeeeeeeee v1 0 7 Run phase User guide Started removing inconsistent rows Finished removing inconsistent rows CE i Started validating features file SAU ETE Finished validating features file DNA microarray gt Started validating responses file gt z Finished validating responses file Generic Started removing features with variance below 0 050000 Finished removing features with standard deviation below 0 050000 Started removing correlat
22. relations 7 The contribution of each feature to correctly classify a strain of a phenotype is determined by the Random Forest algorithm however in case of bagging where strains of a phenotype is generally used more than once the contribution scores for each strain in multiple classifications will be merged to obtain a general contribution score of a feature for a given strain The default method to merge contribution scores determines the median of all scores defined in a drop down box shown in Fig 6 G This method is more robust than the averaging contribution scores because when there is a single positive contribution score with all other features with zero contribution scores averaging would result in a positive score 8 In PhenoLink the feature selection elimination process could be defined either as using only relevant features or discarding irrelevant features in next classification step Both procedures shown in Fig 6 H give similar results Visualization 1 There are three types of visualizations of which two could be disabled or enabled in the settings page Visualization of links to all phenotypes is always provided A feature is considered as sufficiently present if is present in at least in predefined percent of strains of a phenotype This cutoff can be defined in a text box shown in Fig 6 1 Sufficient presence level of a feature is used in visualization to merge with feature s phenotype importance i e the sum of the feature
23. s contribution score to classify each strain of a phenotype 2 Similar to previous step a feature is considered as sufficiently absent if is absent in at least predefined percent of strains of a phenotype This cutoff can be defined in a text box shown in Fig 6 J Sufficient absence level of a feature is used in visualization to merge with feature s phenotype importance i e the sum of the feature s contribution score to classify each strain of a phenotype 3 The relationship between relevant features and strains of a phenotype for each experiment is disabled by default as shown in Fig 6 K Enabling this would allow to identify relationship between phenotypes strains and features 4 Classification results for each experiment could be visualized to identify which strains were more often misclassified than others This visualization is enabled by default drop down box Fig 6 L Once all parameters are configured the association analysis will begin by clicking the Proceed button at the 11 bottom of the page as shown in Fig 6 M PhenoLink 2 3 SETTINGS el naRA A oe w J Menu Help all these settings what should change Session icc Proceed KO Termsofuse Data upload and preprocessing window I EG web Phenotypes to be L discarded comma o 0 Available tools delimited characters I DNA microarray gt Pearson s cutoff floating point number 0 8to
24. t NIZO2766_No 4 NIZO2741_ Yes 4 NIZO1836 Yes 4 NIZO2801_ Yes lt NIZO1838_No 4 NIZO2029 Yes NIZO2889 Yes NIZO2457_Yes lt CIP104448 No 4 NIZO1840 Yes 4 NIZO2776 Yes 4 NIZO1839 No NIZO2806_ Yes 4 NIZO2535_ Yes 4 NIZO2814 Yes lt CIP102359_Yes NIZO2261_ Yes NIZO2263_Yes lt NIZO2264_No 4 NIZO2485 Yes 4 NIZO2260 Yes 4 NIZO2256_No 4 NIZO2259 Yes 4 NIZO2757_No NIZO2891_Yes 4 NIZO2896 Yes NIZO2830_Yes NIZO2855_ Yes 4 NCDO1193_ Yes lt NIZO3400_No lt NIZO2877_Yes 4 NIZO2831_ Yes lt NIZO2258_No lt NIZO2494 Yes lt NIZO2257_No Geneld 4 NIZO2484 Yes Showing 1 to 25 of 27 entries Figure 12 Visualization of relations between features rows and phenotypes columns of a single experiment L Arabinose sugar utilization test Columns of the table can be hidden by clicking tick marks shown below phenotype names 17 Classification of strains on D_ Turanose Correct O Incorrect NIZO3400_No NIZO2897_Yes NIZO2896_Yes NIZO2891_Yes NIZO2889_Yes NIZO2877_Yes NIZO2855_Yes NIZO2830_Yes NIZO2806_Yes NIZO2802_Yes NIZO2801_Yes NIZO2776_Yes NIZO2766_Yes NIZO2757_Yes NIZO2753_Yes NIZO2741_Yes NIZO2535_Yes NIZO2494_Yes NIZO2485_No NIZO2484_Yes NIZO2457_Yes NIZO2264_Yes NIZO2263_Yes N
25. types still many relevant features could be identified This list can be reduced by selecting only top N features based on their importance for a given phenotype Recommended number of top 50 features can be changed in the text box shown in Fig 6 C The Random Forest algorithm builds many trees to classify input data The default number of trees trained by this algorithm in PhenoLink is 500 Fig 6 D For typical omics and phenotype data sets this number should not be changed but for very large data sets one can increase it to accurately identify links to phenotypes An increase in the number of trees would also increase time required to do association analysis Features that have a positive contribution to classify a phenotype could in some cases be just by chance getting this positive score Thus a feature must be consistently positively contributing to at least a certain percent default of 10 of strains of a phenotype A large cutoff value defined in a text box shown Fig 6 E would decrease number of relevant features allowing only identification of very obvious relations 10 6 In order to have a more stable feature selection procedure the same data is by default classified 3 times Features that were identified as relevant in all classifications were considered as relevant which decreases chance of identifying wrong relations Note that the higher values defined in a text box shown in Fig 6 F would increase the time to identify
26. un mode of the tool In demo run mode the same data set would be used This data sets were described in PhenoLink a web tool for linking phenotype to omics data for bacteria application to gene trait matching for Lactobacillus plantarum strains manuscript is submitted Selection of annotation information source Fig 4 This step is only necessary if you want to add additional information to the visualization of links to phenotypes A genbank file can be chosen from the genbank files list as shown in Fig 4 A only when uploaded omics data contains information about genes e g gene presence absence or gene expression data and the organisms used in the design of the omics experiment e g organisms used in designing microarray probes are listed in the genbank files list Multiple files can be selected by holding the Ctrl key pressed and clicking the desired strain or plasmid name In this guide we are going to use the presence absence of genes in 42 L plantarum strains based on comparative genome hybridization CGH arrays Probes on these arrays were based on L plantarum WCFS1 and its three plasmids therefore from the genbank files list we choose four files as shown in Fig 4 4 When there is no genbank file for an organism of your choice or you want to add more information to the resulting visualization you can upload a tab delimited text file see Fig 2 C by clicking Browse as shown in Fig 4 B Note that as descr
27. will be kept confidential Visualization Optional Select genbank files for each strain of which gene content information is used in omics data Acaryochloris marina MBIC11017 uid58167 NC_009925 Chromosome s Acaryochloris marina MBIC11017 uid58167 NC_009926 Plasmid pREB1 Acaryochloris marina MBIC11017 uid58167 NC_009927 Plasmid pREB2 Acaryochloris marina MBIC11017 uid58167 NC_009928 Plasmid pREB3 Acaryochloris marina MBIC11017 uid58167 NC_009929 Plasmid pREB4 Acaryochloris marina MBIC11017 uid58167 NC_009930 Plasmid pREB5 Acaryochloris marina MBIC11017 uid58167 NC_009931 Plasmid pREB6 Acaryochloris marina MBIC11017 uid58167 NC_009932 Plasmid pREB7 Acaryochloris marina MBIC11017 uid58167 NC_009933 Plasmid pREB8 ix Acaryochloris marina MBIC11017 uid58167 NC_009934 Plasmid pREB9 v Optional Upload tab delimited annotation file which will be used in visualization which could be useful if no genbank file is available for instance for GC MS data First column must have information about at least one feature e g a peak value that you supplied in omics data Browse Run in actual mode Required Upload tab delimited omics file First columns of omics and phenotypes file must be the same Browse Required Upload tab delimited phenotypes file First columns of omics and phenotypes files must be the sam Browse Run in demo mode

PhenoLink user guide

Contents

Download Pdf Manuals

Related Search

Related Contents