Home

CzeekS Manual

1. Kyoto Constella Technologies Co Ltd 17 CzeekS Manual Description The information about the model or interaction data registered in the DB file is displayed as a table When no option is specified the information about the model is displayed Option c The compound ID list and the number of proteins which interact are displayed p The protein ID list and the number of compounds which interact are displayed pv All the protein ID lists and the number of compounds which interact displayed In the case of the p option the number of compounds and the protein name can be checked only if the number of compounds is 1 or more As for the pv option all the registered proteins can be checked The proteins that are listed using the pv option can be used with the predict subcommand Kyoto Constella Technologies Co Ltd 18
2. 0 24544712 0 19194762 0 24428153 0 23186666 0 24361162 0 24403326 ADRB3_HUMAN 0 28057862 0 24609816 0 17238724 0 24424564 0 25030520 0 24361155 0 24620981 Both the decision function value and the normalized score are displayed when using the v option Kyoto Constella Technologies Co Ltd cgbvs predict v gpcr_sample db ADR sample_mols csvel compound ZINCQ0074638 ZINCQ0074638 ZINCQ0074638 ZINC00075927 ZINC00075927 ZINC00075927 ZINC00492910 protein probability 0 28596379 15600113 20430067 20458141 20357125 95634899 ADRB1_HUMAN ADRB2_HUMAN ADRB3_HUMAN ADRB1_HUMAN ADRB2_HUMAN ADRB3_ HUMAN ADRB1_HUMAN 17813841 score 0 0 0 0 0 0 26372672 19973609 28057862 24563104 24544712 24609816 0 22043506 CzeekS Manual ZINC00492910 ZINC00492910 94327482 93282221 0 19194762 0 17238724 ADRB2_HUMAN ADRB3_ HUMAN RVOVOYUYUVAOVAOO In this format 2 types of scores for a compound protein pair are displayed in one line 3 3 Target Prediction Target Prediction Using CGBVS The preceding section explained that using CGBVS enables scoring against multiple proteins Extending this view if score is calculated against all available proteins it makes the search for the target protein possible When specifying the target argument of cgbvs predict and the all option is used all the compounds registered in the DB file will be score
3. the CGBVS model can be created by performing machine learning Please refer to section 4 4 for details about machine learning As explained in section 3 4 calculation of structure similarity Tanimoto coefficient of the compounds registered in the DB file can be performed in CzeekS When calculating structure similarity compound descriptors and fingerprints must be registered first Fingerprint registration uses the following command cgbvs import training db training mols fp fingerprint import training mols fp Kyoto Constella Technologies Co Ltd 11 CzeekS Manual Refer to section 3 4 for the format of the fingerprint file and the calculation method using MACCS 4 3 Addition of Data This section describes how to update the CGBVS model by adding data user s original assay data separately to the existing DB file There are basically three types of information that must be prepared as described in section 4 1 However it is not anymore necessary to prepare the protein descriptor information To check whether the intended target protein is registered or not execute the cgbvs status with the pv option The pv option will also display proteins with 0 ligand Please refer to section 3 1 for more information Use the cgbvs add command in order to add data to the DB file As sample data 100 ligands of the histamine H3 receptor are prepared as a file called H3_mols sdf The calculated descriptors for these ligan
4. HUMAN 1000123 ARBK1_HUMAN 100014 CRFR1_HUMAN 1000194 FAK2_ HUMAN 1000948 CCR6_HUMAN 1000956 NTR1_HUMAN 1001098 FAK2_HUMAN 1001421 OX1R_HUMAN 100163 PTAFR_HUMAN 1001651 ADRB2_HUMAN In the format above the compound ID is shown in the first column while the protein ID is in the second column In this way a compound protein pair is shown in one line In this example we utilized data from the ChEMBL database where only compound protein combinations having activities of 30M or less are selected 4 2 Creation of Model File DB File The CGBVS model file DB file can be created once the required files above are prepared Here we will be using the sample files training mols csv gpcr csv positive csv introduced earlier Perform the operation by issuing the following commands cgbvs create training db4 Creation of an empty DB file cgbvs import training db training mols csv compound Registration of compound descriptors import training_mols csv cgbvs import training db gpcr csv protein Registration of protein descriptors import gpcr csv cgbvs import training db positive csv positive Registration of interaction information import positive csv First an empty DB file is created Next the 3 required files are imported into the DB file files can be imported in any order File import and DB file creation can be done simultaneously by using the appropriate option with the cgbvs create command At this point
5. by lt arg gt The file specified by lt arg gt should be in CSV format delete Used to remove specific type of data from the DB file Format cgbvs delete lt db file gt lt target gt Description Deletes the data type specified by the lt target gt argument from the DB file specified by lt db file gt argument compound Compound descriptors protein Protein descriptors positive Positive interaction pairs positive examples negative Negative interaction pairs negative examples fingerprint Compound fingerprints del model Used to delete a specified SVM model from the DB file Format cgbvs del model lt db file gt lt model ID gt Kyoto Constella Technologies Co Ltd 15 CzeekS Manual Description Deletes the SVM model having the number specified by lt model ID gt argument from the DB file specified by lt db file gt argument The list of model numbers can be displayed by issuing the cgbvs status command If all is specified for the lt model ID gt argument all the SVM models will be deleted import Existing data in the db file are deleted before importing new data Format cgbvs import lt db file gt lt data file gt lt target gt Description The command imports and registers the data files CSV such as descriptor information and interaction pair information into the DB file The lt target gt argument specifies the type descriptor information interaction pair information etc of t
6. descriptors the format is essentially the same as that for compounds A sample file gpcr csv is shown below head gpcr csvel SHT1A_HUMAN 9 71564 3 317536 3 791469 3 554502 4 028436 SHT1B_HUMAN 8 974359 2 820513 3 589744 3 333333 4 358974 SHT1D_HUMAN 9 814324 2 917772 2 65252 3 183024 4 509284 SHT1E_HUMAN 6 575342 3 287671 3 561644 3 287671 4 657534 SHT1F_HUMAN 6 284153 3 005464 4 098361 4 644809 4 371585 SHT2A_HUMAN 6 157113 3 184713 4 246285 3 821656 5 307856 5HT2B_HUMAN 6 029106 1 663202 2 910603 4 365904 5 4054 5 SHT2C_HUMAN 5 895197 2 620087 2 838428 4 803493 4 585153 SHT4R_HUMAN 6 958763 4 639175 3 865979 3 092784 5 670103 SHT5A_HUMAN 7 843137 2 80112 2 521008 3 921569 6 162465 The example above is calculated from FASTA file using the PROFEAT site the link is indicated below http bidd cz3 nus edu sg cgi bin prof protein profnew cgi Kyoto Constella Technologies Co Ltd 10 CzeekS Manual Refer to the PROFEAT site for detailed information including the calculation method and other relevant information CzeekS adopts the UniProt ID as the protein ID and as much as possible if the protein is not considered to be a special protein please use the HUMAN format Regarding the interaction information the contents of the sample file positive csv by the command shown below head positive csvel 1000029 NPBW1_
7. file under directory exec and the compounds are desalted and the charges are neutralized Calculation of descriptors from SMILES file using DRAGONE can be performed using the command below This command creates a standard output file You can use OpenBabel to convert SD files to SMILES files Kyoto Constella Technologies Co Ltd 5 CzeekS Manual babel isdf sample _mols sdf osmi sample_mols smi Execute when there is no SMILES file calc_dragon sh sample _mols smi gt output csvd cat output csvd ZINC00074638 315 320 8 522 24 952 38 109 25 091 ZINC00075927 269 300 8 416 21 796 32 563 22 216 ZINC00492910 300 390 7 152 25 928 42 138 27 228 ZINC02759964 339 170 10 941 21 362 32 153 21 784 ZINC 3518134 264 360 6 778 22 928 39 138 24 228 Format will be comma separated values CSV Descriptor file should show information of only 1 compound per line with the following information written in a comma delimited manner Compound ID Descriptorl Descriptor2 etc Be careful of the format especially when not using the calc_dragon sh script Scoring Prediction calculation can be performed using the cgbvs predict command once the descriptor file has been prepared The sample descriptor file sample_mols csv included in the CzeekS installation is the same file created using the command above For example the score calculation against adrenaline B2 receptor can be performed using
8. gt lt data file gt lt target gt Description Use the add subcommand to append data files CSV such as descriptor information and interaction pair information to existing data in the DB file Also specify the type of the data files descriptor information interaction pair information etc of the compound in the lt target gt argument The types of the targets that can be specified are as follows compound Compound descriptors protein Protein descriptors positive Positive interaction pairs positive examples negative Negative interaction pairs negative examples fingerprint Compound fingerprints add model Used to add model created through machine learning into the DB file Format cgbvs add model option lt db file gt lt model file gt lt ID number gt Description Append model file created by SVM machine learning into the DB file while at the same time attaching an ID number to it The ID number specified here is used for the identification of the negative example set created by the program Keep in mind that specifying an already used ID number will overwrite an already existing model having the same ID number By default it imports the model file that is calculated and created by the SVMlearn command If the l option is used the model file created by the svm train command of libsvm is imported Option l Used to import model files created by libsvm comment Used to input comments Format cgbvs comme
9. the following command and the result is subsequently displayed on the screen cgbvs predict gpcr_sample db ADRB2_HUMAN sample _mols csvd compound ADRB2_HUMAN ZINC00074638 28596379 ZINC00075927 20458141 ZINC00492910 94327482 ZINC 2759964 20639719 ZINC03518134 23033582 ZINC03912658 20744996 ZINC04143221 20678472 Argument 2 of this command specifies the DB file of the CGBVS model Argument 3 specifies the target protein ID and the file name of the compound descriptor is specified by argument 4 Please check the available target proteins that can be specified in argument 3 above by using the cgbvs status p command You can redirect the calculation results to a file if needed Scoring against multiple proteins Scoring against multiple proteins can be performed by specifying 2 or more target proteins separated by commas in argument 3 There is no limit to the number of target proteins that can be specified For example execute the following command if you want to calculate scores against B1 and B2 receptors Kyoto Constella Technologies Co Ltd 6 CzeekS Manual cgbvs predict gpcr_sample db ADRB1 HUMAN ADRB2_HUMAN sample _mols csvd compound ZINC00074638 ZINC00075927 ZINCQ0492910 ZINC 2759964 ZINC03518134 ZINC03912658 ZINC04143221 ADRB1_HUMAN 17813841 20430067 95634899 20634203 20936986 20745000 20458645 ADRB2_HUMAN 28596379 20458141 94327482 20639
10. 29 23952181 24918873 15109197 21765020 24860986 25367056 29825117 Kyoto Constella Technologies Co Ltd CzeekS Manual cgbvs predict v gpcr_sample db all test csv gt outd sort k3 nr out headd ZINC10454282 ZINC10454282 ZINC10454282 ZINC10454282 ZINC10454282 ZINC10454282 ZINC10454282 ZINC10454282 ZINC10454282 ZINC10454282 MTR1A_HUMAN MTR1B_HUMAN TSHR_HUMAN GRM2_HUMAN SHT1E_HUMAN CCR3_HUMAN ACM3_HUMAN ACM5_HUMAN HRH3_ HUMAN ACM4_HUMAN VCWOWYWVWVWNAVWVVWOOVOO 86593198 82153994 71460631 71075249 55664237 50475269 43933527 42168349 40058001 39069602 09740989 05707536 0 00721098 0 00930970 07965085 0 10143637 0 12913799 0 13881759 0 14602600 0 15187816 Information about the two proteins on top of the column MTRIA HUMAN and MTRIB_HUMAN can be displayed by issuing the command below cgbvs status pv gpcr_sample db grep e MTR1 4d MTR1A_HUMAN 102 P48039 Melatonin receptor type 1A MTR1B_HUMAN 101 P49286 Melatonin receptor type 1B 3 4 Calculation of Structure Similarity Tanimoto Coefficient With CzeekS the Tanimoto coefficient Similarity can be calculated from the fingerprints of the compound Tanimoto coefficient is calculated based on the specified target protein and the information of compounds in DB file to be evaluated The Tanimoto coefficient of multiple compounds is calculated and the maximum value is displayed This i
11. 719 23033582 20744996 20678472 The scores are then displayed in a tab delimited manner If multiple proteins are specified screening with consideration to compound selectivity The sign can be used as a wild card For example screening against all the adrenalin receptors including a receptors can be performed using the following command cgbvs predict gpcr_sample db ADA ADR sample _mols csvd compound ADA2B_HUMAN ZINCQ0074638 17890952 ZINC00075927 0 20498811 ZINC00492910 0 17438357 Display format ADA1A_HUMAN ADA2C_HUMAN 12149832 16551650 20223752 20679086 66670499 29246626 ADA1B_HUMAN ADRB1_HUMAN 0 12341347 0 17813841 20377914 0 20430067 0 58061474 95634899 ADA1D_HUMAN ADRB2_HUMAN 0 13156714 0 28596379 0 19969655 20458141 46289849 0 94327482 ADA2A_HUMAN ADRB3_HUMAN 0 17294950 0 15600113 0 20499859 20357125 0 12777100 93282221 The display information of the CGBVS score can be changed through the cgbvs predict command option The average of the decision function score of SVM instead of the normalized score can be displayed when the d option is used cgbvs predict d gpcr_sample db ADR sample _mols csvd compound ZINC00074638 ZINC00075927 ZINC00492910 ZINC 2759964 ZINC03518134 ZINC03912658 ZINC04143221 ADRB1_HUMAN 0 26372672 0 24563104 0 22043506 0 24432048 0 24301969 0 24361160 0 24544375 ADRB2_HUMAN 8 19973609
12. Kyoto Constella Technologies Co Ltd CzeekS Manual December 4 2014 CzeekS Manual TABLE OF CONTENTS 1 IDtFOAUCHOD sii s ssssssccdeccsisessscascadedecascesseccssdedscosssesdecsededesadeesesssestecssesseesdeseededesssdessecssscetssassesscessscecesece 2 Installation and Seti 2 1 Extracting Archive Files and Placement of License File i 2 2 2 Setting Environmental VariableS c i uu iii 2 2 3 OpenBabelSettimps ip nie linate allenta lira a einer ipa ei 3 3 Compound Screening and Target Prediction errrrrrererrcrrereereseeereseeeresene resse srenenereceesececeneee J 3 1 CGBVS Model Rit 3 3 2 Compound Screening from descriptor calculation to scoring ii 5 3 3 Target Prediction ss iisssncavciedundss e E RIOT TRANI TANI ahudendel RN PR 8 3 4 Calculation of Structure Similarity Tanimoto Coefficient i 9 4 Creation of CGBVS Model and Addition of User Data cccccccsssssssssssccccccccssccsscssccesseeeses LO 4 1 Data and Format Required for Model Creation i 10 4 2 Creation of Model File DB F16 irrini iraniani rr naar 11 4 3 Additton OF Data crea AR 12 4 4 Machine Learin gissar ei al nea 12 Aad OEIS siii OTTO i ire 12 5 Cobys Command Reit ln Kyoto Constella Technologies Co Ltd i CzeekS Manual Trademarks All the company and product names appearing in this manual are trademarks or regist
13. Ltd CzeekS Manual export CGBVS home czeeks CGBVS exec lt export PATH PATH CGBVS lt export LD_LIBRARY_PATH usr local lib LD_LIBRARY_PATHd export DRAGON6 usr local bind For the environment variables of DRAGONE please specify the directory where the DRAGONE executable file dragon6shell is installed Also specify file name with a full path in environmental variable CGBVS_LICENSE if you want to put the license file license dat in a subdirectory other than under CGBVS 2 3 OpenBabel Settings Within CzeekS OpenBabel is used for the calculation using calc FP _MACCS of compound fingerprints MACCS and generation of SMILES from SD file If OpenBabel is not yet installed in your system you can install it using the following steps Installation of cmake Since cmake is required to compile OpenBabel it has to be installed into the system It can be installed using the command yum install cmake after becoming a superuser Compiling and Installing OpenBabel OpenBabel is a free software GPL v 2 and can be downloaded from the following URL http openbabel org wiki Get Open Babel Extract the archive file after downloading it from the URL above If the version you downloaded is 2 3 1 and the archive file is extracted using the tar command a directory named openbabel 2 3 1 will be created containing the extracted file s Switch into the openbabel 2 3 1 directory then compile and install OpenBabel using the fol
14. P_MACCS CGBVS exec SVMlearn CGBVS exec protein lst Extracted files are indicated below Copy your license file license dat file received from Constella into the subdirectory home czeeks CGBVS exec overwriting the existing invalid license dat file CGBVS example 2 2 Setting Environmental Variables gpcr csv positive csv sample_mols csv sample_mols fp sample_mols sdf sample_mols smi training mols csv training mols fp training mols sdf training mols smi 2D_894 drt SVMlearn calc_FP_MACCS calc_dragon sh cgbvs license dat protein lst Directory in which sample data and other files were extracted Descriptor vector of GPCR Positive examples Descriptor file of test compounds Fingerprint file of test compounds SD file of test compounds SMILES file of test compounds Descriptor file of sample compounds for learning Fingerprint file of sample compounds for learning SD file of sample compounds for learning SMILES file of sample compounds for learning Directory in which executable files were extracted Script file for DRAGON6 SVM machine learning executable file MACCS fingerprints calculation executable file DRAGONE script for descriptor calculation CGBVS executable file License file invalid initially Protein list file After extracting the files and copying your license file set environment variables as indicated below Add the same details into the bashrc file Kyoto Constella Technologies Co
15. _2 2 Importmodel 2 as id 2 add_model training db model _3 3 Importmodel 3 as id 3 add_model training db model 4 4 Importmodel 4 as id 4 add_model training db model 5 54 Import model 5 as id 5 Imported models can be checked using the cgbvs status command Searching for the optimal SVM parameters can also be performed using the above method The following is an example script that searches for optimal parameters of the file input 1 bin sh for c in 1 3 10 30 100 do for g in 0 001 0 003 0 01 0 03 0 1 do echo ne c t g t SVMlearn c c g g input_1 model 1 grep cross validation awk print 6 done done The above script will calculate for SVM parameters using a total of 25 combinations of y 0 001 0 003 0 01 0 03 0 1 and C 1 3 10 30 100 values Output is displayed in the order of C y and prediction rate Calculate for the combination of C and y that will give the highest prediction rate for each model then import the results into the DB file Kyoto Constella Technologies Co Ltd 13 CzeekS Manual 5 cgbvs Command Reference Usage cgbvs lt subcommand gt lt option gt lt Argument gt The available subcommands are as follows add add model comment create delete del model import learn predict status Note that lt option gt and lt Argument gt may differ for every subcommand Subcommands add Used to append data into the DB file Format cgbvs add lt db file
16. binations of C and y in order to find the optimal settings An example of parameter search is described in the next section 4 5 Others In 4 4 the machine learning execution method was described where calculation was performed by creating 5 sets of negative examples When utilizing several machines it is also possible to calculate in parallel for these negative Kyoto Constella Technologies Co Ltd 12 CzeekS Manual example sets Here command execution is described regarding how to perform machine learning calculation independently in parallel for every negative example set First create the SVM input files by using the f option with the cgbvs learn command as indicated below cgbvs learn f training db 54 output input_1 output input_2 output input_3 output input_4 output input_5 Next execute SVM machine learning for each machine as follows earn mode xecute for machine SVM1 del 13 E fi hine 1 earn mode xecute for mach ne 2 SVM1 del 209 E fi hine 2 earn mode xecute for machine 2 SVM1 del 34 E fi hine 3 earn mode xecute for machine 4 SVM1 del 43 E fi hine 4 earn mode xecute for machine 5 SVM1 del 54 E fi hine 5 If the above mentioned command has successfully completed five files named model 1 to model 5 should already exist Import those into the DB file by using the following commands add_model training db model 1 14 Importmodel 1asid 1 add_model training db model
17. d against all proteins available Also use the a option if you want to score against proteins that do not have registered ligands in the DB file Available proteins can be checked by cgbvs status pv command For example calculating scores for the compound with the ID ZINC10454282 in the sample mols csv file against all the proteins available can be performed as follows grep ZINC10454282 sample mols csv gt test csvd cgbvs predict v gpcr_sample db all test csvd compound ZINC10454282 ZINC10454282 ZINC10454282 ZINC10454282 ZINC10454282 ZINC10454282 ZINC10454282 ZINC10454282 ZINC10454282 ZINC10454282 ZINC10454282 ZINC10454282 ZINC10454282 ZINC10454282 In this example the v option is used to display the protein ID in a column Sorting the probability scores from highest to lowest can be done by redirecting the output to a file and then having it sorted by using the commands below protein probability 0 0 21639315 0 23220133 0 55664237 25697899 26910340 33050923 0 0 0 0 0 0 0 5HT1A_HUMAN 5HT1B_HUMAN 5HT1D_HUMAN 5HT1E_HUMAN 5HT1F_HUMAN 5HT2A_HUMAN 5HT2B_HUMAN 5HT2C_HUMAN SHT4R_HUMAN SHT5A_HUMAN SHT6R_HUMAN SHT7R_HUMAN A4_HUMAN AA1R_HUMAN 20230991 22229833 20269564 38818196 25142398 20856294 19385333 13968021 score 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25578423 24077885 22949139 07965085 21507697 21419708 178813
18. ds are contained in the file H3_mols csv The interaction information file is H3_ positive csv As the protein descriptor is already registered there is no necessity for any addition cgbvs add training db H3_mols csv compound import H3_mols csv cgbvs add training db H3_positive csv positive import H3_positive csv 4 4 Machine Learning After registering or adding data to the DB file it is necessary to perform machine learning using SVM Machine learning can be executed as follows using the cgbvs learn command cgbvs learn c 10 g 0 01 training db 54 output input_1 SVMlearn c 10 000000 g 0 010000 v 5 input_1 model 14 itr nSV vKKT Objective 1 978 42378 4 497671328644441E 02 2 1907 41404 8 200883693534472E 02 3 2786 43240 1 321260914509097E 03 The above mentioned example will create five sets of negative examples and this is specified in the last argument 5 10 is usually specified for this argument Refer to section 3 1 for details about the negative example set c and g are the optional parameters of SVM The parameter C relating to the soft margin of SVM is specified by c In CzeekS the gauss type RBF Radial Basis Function function is employed as the kernel function of SVM The value y of the RBF function is specified by g Although machine learning is executed assuming C 10 and y 0 01 in the above example predictive accuracy depends on the SVM parameter value It is recommended to check different com
19. e creation of a CGBVS learning model 1 Compound descriptor information 2 Protein descriptor information 3 Compound protein pair interaction information The above mentioned information must be prepared as comma delimited CSV files The file format is described as follows using the sample data for model creation as an example The contents of the sample file training_mols csv are shown below head training_mols csve 1000029 419 62 6 557 38 396 63 214 41 347 72 142 0 6 0 988 0 646 1000123 279 35 8 73 21 03 32 782 21 835 36 119 0 657 1 024 0 682 100014 377 35 8 029 30 009 46 891 32 353 53 033 0 638 0 998 0 688 1000194 405 5 7 651 33 993 53 443 35 245 59 857 0 641 1 008 0 665 1000948 246 24 8 794 19 009 29 047 18 875 31 495 0 679 1 037 0 674 1000956 399 54 9 08 30 072 44 618 31 801 49 242 0 683 1 014 0 723 1001098 216 32 6 76 19 246 31 709 20 591 36 484 0 601 0 991 0 643 1001421 300 51 8 839 22 007 33 945 24 739 37 872 0 647 0 998 0 728 100163 481 66 6 784 42 746 70 829 45 466 80 149 0 602 0 998 0 64 1001651 336 37 8 204 27 59 41 698 28 159 45 741 0 673 1 017 0 687 It is the same format as the descriptor file in Section 3 used for the scoring of compounds The first column shows the compound ID while the numerical values are indicated starting at column 2 This is the result of calculating the descriptors from the SMILES file training _mols smi using DRAGON6 Regarding protein
20. ered trademarks of the respective companies Furthermore trademarks are not appended to all the software and product names described in this manual 2012 Kyoto Constella Technologies Co Ltd All Rights Reserved Copyright 2014 Kyoto Constella Technologies Co Ltd ii CzeekS Manual 1 Introduction In recent years it has become common sense to have view that a certain compound can interact with multiple target proteins We refer to such complicated compound protein relationship as chemical genomics information It is this kind of information that has been built into a bioactivity database and continuously improved by organizations such as ChEMBL We refer to the technique of predicting and screening the activity of an unknown compound by pattern recognition of such information through machine learning as CGBVS Chemical Genomics Based Virtual Screening CzeekS is a set of tools for performing CGBVS and offers the following functions Compound scoring Creation of CGBVS learning models Managing functions of learning models Calculation of compound fingerprints MACCS Similarity calculation with a target compound Section 2 of this manual explains the installation method of CzeekS Section 3 explains the screening method of a compound using sample data Selectivity and target prediction of a compound as advanced utilities are also explained in the same Section Section 4 explains the construction of a learning model us
21. he compound of the data file The types of targets that can be specified are as follows compound Compound descriptors protein Protein descriptors positive Positive interaction pairs positive examples negative Negative interaction pairs negative examples fingerprint Compound fingerprints The difference with the add subcommand is that it deletes the data type in the DB file that is specified in the lt target gt argument Use the import subcommand when you want to register descriptors such as vector dimensions that are different from that already registered in the DB file Option m lt arg gt Register the contents specified in the lt arg gt argument as a comment learn Used to create input files for machine learning Format cgbvs learn option lt db file gt lt negative example number of sets gt Description Machine learning by SVM is performed after generating the negative example sets using the data compound descriptors protein descriptors the interaction pairs of the positive examples registered in the DB file random pair The model files created are then imported into the DB file The number of machine learning calculations to be performed by SVM is the same as the number of negative example sets generated Perform the following procedure when machine learning of negative example sets is to be performed using several machines First generate the SVM input files Once the required number of negative example se
22. ile to be used for LIBSVM is created predict CGBVS prediction score is performed Format cgbvs predict option lt db file gt lt protein ID gt lt compound descriptor file gt Description Using the CGBVS model specified by the lt db file gt argument the prediction score of the compounds in the file specified by the lt compound descriptor file gt argument against the target specified by lt protein ID gt is calculated Descriptors of the compound to be analyzed are created beforehand and should be in the appropriate file format There is no upper limit to the number of compounds Multiple lt protein ID gt can be specified separated with commas can be used as a wild card for a character string and score is computed for all the proteins registered in the db file by specifying the all argument Available protein targets can be checked by attaching the p option to the status subcommand Option a Prediction of a target without learned compound information is enabled S Similarity Tanimoto coefficient with the known compound group of specified protein is calculated d The value of the decision function of SVM is displayed V Both the binding prediction score and the decision function value are displayed n lt arg gt A score is computed using only the model ID specified by lt arg gt argument status Information about the model in the DB file is displayed Format cgbvs status option lt db file gt
23. ing sample data Section 5 describes command references Using CzeekS in the following computer environment is recommended Since CzeekS supports the parallel computation by OpenMP more CPU cores equates to better efficiency It is also possible to run CzeekS using two or more machines CPU Multi core CPU with four or more cores Intel AMD Memory 8GB or more HDD 10 GB or more of free space OS CentOSS x or 6 x 64bit Linux kernel 2 6 External tool DRAGON ver 6 0 30 External library OpenBabel 2 3 1 Time required for machine learning of sample data 1 node CPU Number of threads Memory Computation time Intel Xeon E5620 x 2 16 24GB 20h 10m Intel Core 13 550 4 4GB 66h 52m AMD Phenom II X6 1055T 6 8GB 70h 40m Kyoto Constella Technologies Co Ltd 1 CzeekS Manual 2 Installation and Settings 2 1 Extracting Archive Files and Placement of License File Extract the archive file CzeekS_ tgz using the tar command as follows While you can extract into any one of directories it is recommended to extract it under usr local or under home czeeks after creating users such as czeeks In this manual we proceed with the explanations with the assumption that files were extracted under home czeeks tar xvfz CzeekS_ tg7J CGBVS CGBVS exec CGBVS exec license dat CGBVS exec cgbvs CGBVS exec calc_dragon sh CGBVS exec 2D_990 drt CGBVS exec calc_F
24. lowing steps mkdir build cd build cmake d make sud make installd The above procedure is for the necessary minimum installation of OpenBabel for use within CzeekS Refer to the OpenBabel manual or other sources for detailed compile settings 3 Compound Screening and Target Prediction 3 1 CGBVS Model Sample model files are included in CzeekS and these should not be used for actual in silico screening The extension of a model file is db and hereinafter may be referred to as DB file These samples models are created from data originating from the ChEMBL database Those data are also included in CzeekS Section 4 gives an explanation about these data In CGBVS the support vector machine SVM is used as the pattern recognition technique SVM is the method of classifying two classes of positive examples and negative examples and both data are required to perform Kyoto Constella Technologies Co Ltd 3 CzeekS Manual machine learning However while there are plenty of information about interacting compound protein pairs positive examples there are very few information about experimentally validated non interacting compound protein pairs negative examples available in public databases In this case information to be used as negative examples is generated virtually before performing machine learning Virtual negative examples are generated by rearranging positive example pairs at random This create
25. mpounds registered interactions of positive interactions 21761 Interaction information on the positive example of negative interactions 0 Interaction information on the negative example details of models of sampled positive interactions 21761 The number of interactions used for machine learning gamma accuracy 41024 10 0000 82 2664 41026 10 0000 82 2019 41007 10 0000 82 3506 41023 10 2000 82 0856 41046 10 2000 82 1124 Concerning the table details of models id indicates the ID number of the model and in this case 5 are shown nSV indicates the number of support vectors while C and gamma indicate parameters for SVM Kyoto Constella Technologies Co Ltd 4 CzeekS Manual Accuracy indicates the precision of distinction when cross validation is performed for each model The table of the proteins that are available for calculation will be displayed if the p option is used with the cgbvs status command cgbvs status p gpcr_sample dbdJ protein ID list protein ID of compounds accession name SHT1A_HUMAN 407 P 8988 5 hydroxytryptamine receptor 1A 5HT1B_HUMAN 207 P28222 5 hydroxytryptamine receptor 1B 5HT1D_HUMAN 203 P28221 5 hydroxytryptamine receptor 1D SHT1E_HUMAN 74 P28566 5 hydroxytryptamine receptor 1E SHT1F_HUMAN 103 P30939 5 hydroxytryptamine receptor 1F SHT2A_HUMAN 388 P28223 5 hydroxytryptamine receptor 2A SHT2B_HUMAN 287 P41595 5 hydroxytryptamine recepto
26. nt lt db file gt lt comment gt lt target gt Description Kyoto Constella Technologies Co Ltd 14 CzeekS Manual Enter comments regarding what is specified in the lt target gt argument into the DB file specified in the lt db file gt argument Although it is optional you can enter what you used as compound or protein descriptors The types of the targets that can be specified are as follows compound Compound descriptors protein Protein descriptors positive Positive interaction pairs positive examples negative Negative interaction pairs negative examples fingerprint Compound fingerprints create Used to create an empty DB file Format cgbvs create option lt db file gt Description Create a db file with no registered data If a source file is provided through an option data such as descriptor information can be imported simultaneously with DB file generation Even if no option is specified here the data can be registered by import subcommand later Options c lt arg gt Register compound descriptors from the file specified by lt arg gt p lt arg gt Register protein descriptors from the file specified by lt arg gt i lt arg gt Register interaction pairs of the positive examples from the file specified by lt arg gt n lt arg gt Register interaction pairs of the negative examples from the file specified by lt arg gt f lt arg gt Register compound fingerprints from the file specified
27. r 2B SHT2C_HUMAN 422 P28335 5 hydroxytryptamine receptor 2C SHT4R_HUMAN 109 Q13639 5 hydroxytryptamine receptor 4 5HT5A_HUMAN 112 P47898 5 hydroxytryptamine receptor 5A SHT6R_HUMAN 252 P50406 5 hydroxytryptamine receptor 6 SHT7R_HUMAN 227 P34969 5 hydroxytryptamine receptor 7 A4 HUMAN 100 P 5067 Amyloid beta A4 protein The protein ID shown in the table indicates the protein ID used during binding prediction calculation This ID including the accession are the same IDs being used in the protein database UniProt http www uniprot org The of compounds column indicates the number of active compounds for every protein registered in the DB file While it depends on the diversity of the compound structure there is a general trend that higher number of compounds results to more accurate prediction calculation 3 2 Compound Screening from descriptor calculation to scoring Descriptor Calculation It is necessary to calculate the descriptors from compound structures SD file before compound prediction calculation against target protein s can be performed The type of the compound descriptor must coincide with the type in the DB file Furthermore it is also necessary to make the compound processing conditions desalting charge neutralization etc uniform at the time of descriptor calculation The descriptor of the file included in CzeekS as a sample has been obtained through calculation by DRAGONE using the script
28. s multiple sets of negative examples that are used to create learning models The average scores of negative example sets are then calculated and eventually used Scores generated by CGBVS are of two types One is the average of the decision function value of SVM and it takes the range of 00 00 Another is the average of this decision function value after normalization by sigmoid function and takes the range 0 1 Usually the normalized score is displayed in CzeekS This score indicates the probability of the compound having an activity against the target protein This does not indicate proportionality between this value and the value indicating actual activity The information on the CGBVS model explained above can be checked by the cgbvs status command Check the DB file of sample models first by using the following command The information about the number of the compounds registered in the DB file the number of the proteins and the learned models are displayed in the list cgbvs status gpcr_sample dbd compound Dragone v 6 0 30 Software used to generate the compound descriptors of data 13838 Number of compounds registered of descriptors 894 Number of compound descriptors protein PROFEAT 2011 System used to generate the protein descriptors of data 859 Number of the proteins registered of descriptors 1080 Number of protein descriptors fingerprint MACCS Type of fingerprints of data 13838 Number of the co
29. s performed by issuing the cgbvs predict s command The procedure is shown below calc_FP_MACCS sample mols sdf test fp Fingerprints calculation test fp and sample mols fp will be the same cgbvs predict s gpcr_sample db ADRB2_HUMAN test fp compound ADRB2_HUMAN ZINC00074638 55737705 ZINC00075927 48571429 ZINC00492910 71428571 ZINCO2759964 58108108 ZINC0O3518134 56666667 ZINC0O3912658 72000000 ZINC04143221 72972973 ZINC05766699 54385965 RVODVYUVNYUAVAOO The contents of the fingerprint file test fp are shown below head sample mols fpd ZINC00074638 42 50 57 62 72 75 76 83 85 87 89 91 92 95 ZINC00075927 41 42 52 65 75 78 80 87 92 94 95 97 98 107 110 ZINC00492910 54 72 82 90 92 95 97 100 104 109 110 113 117 126 ZINC02759964 24 46 49 52 56 63 65 70 71 75 79 80 83 87 92 93 ZINC03518134 65 72 75 83 85 90 91 92 93 95 96 104 110 111 117 Regarding the format the first column shows the compound ID while the next column shows the fingerprints Kyoto Constella Technologies Co Ltd 9 CzeekS Manual The numbers in the fingerprint part are generally increasing values from left to right corresponding to the positions of 1 within a list of binary values bitstrings created during evaluation of compound structures based on MACCS keys 4 Creation of CGBVS Model and Addition of User Data 4 1 Data and Format Required for Model Creation The following are required for th
30. ts as specified in the lt negative example number of sets gt argument are generated perform SVM machine learning for each machine then import the model files into the DB file Option c lt arg gt Specify the C parameter of the soft margin of SVM default 10 g lt arg gt Specify the y parameter of RBF kernel default 0 01 Kyoto Constella Technologies Co Ltd 16 CzeekS Manual v lt arg gt Specify the number of cross validation iterations default 5 s lt arg gt Specify the upper limit of the number of compounds per protein during data sampling pc lt arg gt Analyze the main components of the compound descriptors and compress the information pp lt arg gt Perform main component analysis of the protein descriptors and compress the information When lt arg gt of the above mentioned 2 options are integer values it indicates the number of main components to be sampled When lt arg gt is a percentage numerical value main components are sampled until an accumulative contribution ratio reaches the appointed value m Generation of negative example sets is not performed n Registered negative example sets will be used r Machine learning is performed without changing a negative example set When the following two options are specified only the output of a file is performed and SVM machine learning is not performed f The input file to be used for the SVMlearn command is created fl The input f

CzeekS Manual

Contents

Download Pdf Manuals

Related Search

Related Contents