Home

CF program User manual

1. Continious Seas Gee Gee eo he Continious OK Cancel z EJ If you choose several Y s then Y weights button will be enabled and weights for each Y property will be able to be assigned All positive numbers are allowed he CF DAC F_example T I Tree growth options File Forest Predict Hilo Continous Hifatt Continious Hbrain Continious Haver Continious Himuscle Coantinious Excluded Hekidney YE Contini Fifat Rbrain Property Select by names River fhlood Rimuscle Hat Fukidney Mele H brain Bi z Continiouz H liver AMY H muscle oy Rank Se Morminal Continious Continious Continious Coantinious Coantinious 1 2 2 Cases tab 11 Select appropriate set of each case compound on the Cases tab Possible values are training working set test set or excluded set i CF DACF example File Forest Predict O amp id la 3 0 Variablename Set List of forests Tree 1 mti 0001 TE 2 mtl_OOO2 WS 3 mtl_OO0S WS 4 mtl_Oo04 WE 5 mtl_OO0S WE 6 mtl_OO06 WE T mti 0007 WE o mtl_OOUS WS J mtl 000S WE 10 mtl _0010 WS 11 mtl_0014 WE 12 mtl_O01 2 WS 13 mtl_O01 5 WE 14 mtl_O01 4 WE 15 mti _0015 WS 16 mtl _0016 WE 17 mtl 0017 WE 18 mtl _0018 WE 19 mtl_O01 9 WS 20 mtl_O020 WE 21 mtl_O021 WS 22 mtl_OO022 WS 25 mtl_OO23 WS 24 mtl_O024 WS Tree growth options Variables Cases Forest
2. is described only Preset mode will be described below in a separate chapter 13 It should be input in the table Trees the number of trees in the Random Forest model Vars the number of variables descriptors which will be used for splitting in each node of trees If one input this value which will be greater then available descriptors number this value will be reduced automatically at the calculation step Min parent and Min child it is a minimum number of cases compounds in the parent or child nodes It can not be greater then 1 3 from the number of training set compounds Otherwise warning message will appear and this model will not be constructed In the original algorithm there are no such restriction parameters All trees are growing for their maximum size So we recommend to use 1 as a value of Min child and Min parent fields for classification tasks e For regression task to greater numbers can be assigned for these values to increase calculation speed for example Min parent 5 usually it has no influence on model quality Models it is the number of models which will be constructed according to specified settings When all fields in one row are filled with non zero values another row is appeared This new row one can fill with new settings Thus a queue package of tasks is formed Press Ctrl Del to delete selected row in the table In the case of very big datasets thousands o
3. C F Os 2 O 48 MAM D F2 0 46 MAN 0 59 0 45 O 54 O15 MAM MAM MAN MAN 0 78 BAM MAN MAM PAM PAM PAM MAM 0 39 0 5 0 65 7 0 3 MAM MAM PAN MAN 0 37 O 64 0 69 0 4 MAM PLAN MAM 0 62 1 09 1 03 0 95 2 58 0 6z MAN dal ToL 1 03 i 0 B6 MAN MAN MAN 0 61 eg ee 1 41 z oO SL MAN MAN MAN MAN MAN MAN MAN tage a O 64 MAN J 2 MAM MAN MARN 2 47 i 0 52 MAM ive ae MAN MAN MAN MAN MAN MAN 1 92 BAM MAN PARN PAM PAM MAM O 2 1 15 O 4 MAM PAM MAR BAN BAM MAM PAM PAM MAM j 86 z 0 8r MAM MAN MAM 1 03 i 0 B6 2a 0 OL MAN MAM MAM MAM PAM MAM PAN MAN PAN PMAN PAN MAM P 5 MAM 0 51 O 2 MAM MAN MAN 0 69 0 26 MAN BAM PLAN 1 34 i 0 14 0 04 MAN MAN MAN MAN MAN BAM MAM PAM PAR MAM MAM PAN MAM O a1 MAN PAN MAN PHAN PLAN PAM MAM MAN MAN 1 3 MAM r A on a E a a M a E a a a a D hb a E a i a a a a a a a a a a a a a RRR Ew If txt file has been chosen to create new project following dialog window would be displayed One should select appropriate settings to load txt file If variables descriptors names are absent in the first line of the file uncheck corresponding box program will give names automatically Var1 Var2 etc Analogous procedure will be executed if case names are absent File Options About O B id M Does the first row contain var names M Does the first column contain case names Select a sing delimiter Space C Comma Semicolumn C Other After successf
4. OF on E Wo ha O SH slood 2 508 H fat 2 084 H brain 2 028 H liver 0 48 H muscle 1 983 H kidney 1 988 Hfblood 2 12 H fat 1 42 H brain 0 029 H liver 0 133 Hemuscle 0 183 Hkidney 0 1 Hfblood 1 906 H Tat 2 04 H brain 0 44 H liver 0 56 H muscle 0 55 H kidney 0 53 F hiv 86 64 a Modes count 13 Gini 124 1 2 3 Detailed statistics and results To do that make double click on the model in the list or select model in the list by left click and switch to Forest Statistics tab The following information are displayed 1 compound name 2 set to which compound belongs 3 observed values of investigated properties 4 predicted values of investigated properties for regression models it is a mean of all single tree predictions for classification models it is a class having majority of votes one tree one vote 5 is compound inside sing or outside sing of domain of applicability several domain of applicability measures were implemented and will discussed separately Additional regression model specific information 1 standard deviation StdDev it is calculated from set of predicted values by each tree 18 Additional classification model specific information 1 number of each class predictions in separate columns 2 misclassification matrix on the bottom of the window 4 CF DACF_example MTLtet File Forest Prediction Options About djaai List of forests Trees Forest Statistics
5. CF DACF_example MTL txt File Forest Precliction About amp amp a ae List of forests Trees Forest Statistics Variable importance Data Lente Lenten J to ro Lrg to J to 0 ti frmDeletedltems Following variables have constant vanance and will not be included in model building CF DACF_example MTLtxt File Forest Prediction Optio About 2 a Se x List of forests Trees Forest Statistics Variable importance Data eo woo meon to mti_00 mtl_oo motto rmtl_oo mti_00 i frmDeleteditems l 5 4 3 Following cases have missing values for all selected properties and will be excluded from training set Progress of model construction is displayed in the bottom of main window After that statistics of obtained model is calculated for each case set 16 2 View model results 2 1 General statistics General obtained results can be looked on Forest list tab Statistics for each property are displayed CF DACF example M TL it iio File Forest Prediction Options About amp ed ao ae Se Trees count _ vars court_ Min parent_ Min child Risk estimate ws Risk estimate oob Hilood 0 4999 O67 3 0 020 0 523 Hitat 0 1463 0 2556 0 915 0 455 Hbrain 0 1739 darts 0 596 0 520 Haver 0 1174 0 2681 0 069 0 138 Himuacle 0 2067 0 4206 0 608 0 504 Hikidney 0 2076 0445r 0 8592 0 5
6. Full path to selected set file in popup menu are displayed in the status bar just under the list of cases If opened set file was not find in its location the respective message would be appeared in the status bar 12 Statistics of compound numbers in each set are displayed below wsS number of compounds in the training set ts number of compounds in all test sets exc number of compounds in the excluded set If one Y variable selected and it has ranked or nominal type Variables tab then button Class weights will be enabled Click it and following window will appear where one can define weights of each compound class Case weights can be integers only This window is analogous to previously described Y weights dialog from Variable tab Function of Select by names button is absolutely analogous to the same button on Variables tab 1 2 3 Forest tab Model building settings are defined on Forest tab Tree growth options File Forest Preclict Variables Cases Forest Select torest grow mode f Ordinary mode 1 1 Preset mode a only Number of forests Possible number of Possible number of of each type trees in each torest variables in each forest file Browse OB set mode global option Y randomization 7 Mix i o0 al Bootstrap tclassical a C g 67 a C training ook 100 BT aa b Cancel Here Ordinary mode
7. tp1058_4 nitrop ts2 1 51 tp1059_phenylgp tse 2 02 tp1090_4 chloro ts2 2 11 tp1091 1 broma tsz 2 51 tp1092 1 _4 dic ts2 2 55 tp1093_Benzyli ts2 24 tp0964 S hydro ts2 1 019 p0985 4 hydro ts2 0 569 tpO986 4 aming ts2 0 909 pods Benzam ts2 0 909 pigs Resorci ts2 0 569 p0959 4 aceta ts2 0 519 C a E a T aa a a m man Set Retest MSE RMSE R2 DAY R2test DA MSE DA RMSE DA DA Coverage DA calculation complete i DT F WelphiProjects TP roji Fie Tree Forest Prediction Statistics C Sdad s List of forests Trees Forest Statisti S_AdipvD_D_E EM_3a 2 S_AtlipvC_D_E_E _3a 2 8 AflipwC_C_C_GM_3s 2 S AtchgWC_C_E FiM_2s 2 6 AtchoWA_C_E EM_351 8 _AtvpeviCG AR_C AR_H_ S_AdipvAC_C_GM_4q2 S Ad al LIIM 39 2 45 5 Atha C E EM 25 1 Uh 23 4 Preset mode of model construction Tree growth options Variables Cases Forest elect forest grow mode Ordinary mode 1 1 i Preset mode statistics only Number of forests 10 Possible number ot of each type trees in each forest Log file 10 F DelphiProjects DTProjecti_Te 100 200 Browse soo OB set mode global option i Bootstrap classical F O e 67 s I oo I T training oob 400 Taa EEk Possible number of Variables in each forest This option is needed to collect statistics of huge number of models on the base of predefined se
8. vara importance Data Property all et fall DAin sigma units 3 Recalc e _ compoune name Jisa io Observed Hila Predicted Hibloo Pred StdDev Hiblood Observed Hit Predicted Hit Pred StdDev HA mtl Oo mtl _01 01 mtl _0102 mtl 0104 mtl _01 06 mtl _01 07 mtl _0108 mtl 0012 mtl 0117 mtl _0118 mtl _0119 mtl 04120 ook mr Set There is a possibility to filter results by property and or set Selecting certain property from the list allows to see detailed model property corresponding specified property see figure below 19 CF DACF_example MTL bet File Forest Prediction Options About Edea List of forests Trees Forest Statistics Variable importance Data Property Hiblood Set all DAIN sigma units 3 Recalc Compoundname Set Observed Hibla Predicted Hiblo Pred StdDev Hilo mti _0011 ook mti _0101 ook mtlO1o2 ook mti _0104 ook mtlO1 08 ook mti _0107 ook mtlO1 08 ook mtlOO12 ook mti _0117 ook mtlO0116 ook mtlO119 ook mtl Oi 20 ook mt 0013 ook Set For regression models following measures are calculated 1 R determination coefficient reliable for training set only 2 R test coefficient is calculated as 1 PRESS SS reliable for OOB test and external sets 3 MSE mean standard error 4 RMSE root mean square error For classification models following measures are calculated 1 Misclassification error rat
9. 2 ea E E E E S 24 6 Prediction of compounds properties which are in an external data file essssseesesererrerrrrerree 25 k GDE M O ea E 26 B ANINO e E E E E E E E 27 e Advices are marked in such style 1 Creation of the first RandomForest project 1 1 Load data file To create RandomForest project choose menu FILE NEW PROJECT NEW RANDOM FOREST PROJECT Select a file with source data in the dialog rfd file this is own file format of CF program dat file this is file format of MDA1 program from HiT QSAR Software package txt file plain text format descriptors are in columns cases compounds are in rows see example below First row and column contain descriptors names and molecules names correspondingly If some values are missing then they should be represented as NAN textual value or leave empty Such missed descriptor values automatically replaced with special NAN value Descriptor values should be numerical only restriction of the current version else an error message will be displayed and file will not be opened Program does not check all possible errors in txt file so be careful and be sure that there are no errors in your data file MIL bt aacra EEE Mpaska Dopmat Bua Cnpaera H blood H fat H fbrain He liver H muscle Hk Idney Rbrain Rfliver R muscle Rwkidne 1 A MAN PAN MAN MAN PLAN PAN PHAN PLAN 30 0 278 MAM MAM MAN MARN MAM MAM PAM MAM 0 08 O a4 D A O 72 MAM MAM MAN MAM 0 36
10. 2 Fitat 0 2348 0 3152 0 888 0 514 Rarain 0 2148 0 3969 0 834 0 315 River 0 299 0 43596 0 852 0 528 Fuimuscle 0 3032 0 4561 0 864 0 563 Fokidney 0 0939 0 2445 0 933 0 573 All data from this table can be copied by right mouse button click A case set is shown into the brackets after the value name in the column caption ws training set oob out of bag set ts first test set ts2 second test set and so on Risk estimate value is a misclassification error for classification models and mean square error for regression ones Values of coefficients of determination R2 are calculated only for regression models R for out of bag OOB and test sets are calculated by the formula 1 PRESS SS 17 2 2 View single trees composing RF model To do that select model in the list by left click and switch to Trees tab Each tree in the list can be selected and viewed Due to of a little importance of such information only general information is displayed i CF DACE_example MTL txt File Forest Prediction Options About O d ee SR List offorests Trees Forest Statistics Variable importance Data Hfblood 0 203 H fat 2 094 H brain 0 794 H liver 0 82 H muscle 0 683 H kidney 0 534 Hfblood O 71 H fat 2 604 H brain 1 308 H liver 1 318 H muscle 1 152 H kidney 1 H blood 1 343 H fat 2 23 H brain 0 74 H liver 0 81 H muscle 0 84 H kidney 0 62 Hfblood 1 689 H Tat 3 289 H brain 1 56 H liver 1 68 H muscle 1 54 H kidney 1 25 co
11. CF program User manual for working with Random Forest projects Changes program version ees ee 1 27 Predicted values for oob set compounds can now be viewed on the Forest statistics tab Specified model can be deleted from the model list 1 28 New chapter was inserted Options menu with various settings was added to the program Loading of multiple models to the same forest list are allowed now 18 03 09 1 Y randomization procedure was implemented 1 29 11 06 09 Possibility of analysis of multi target models was 2 00 added Each Y property can have its own weight at model construction process Menu Statistics has been removed Menu Rebuild forest has been disabled Visualization of model statistics and details has been changed and can be displayed for each Y property separately Data files can now contain missing values marked as NAN 25 09 09 RF algorithm speed was significantly boosted 2 03 Some interface elements were optimized for working with numerous data 05 11 09 Two domain applicability measures were 2 04 implemented 1 based on variable importance values in descriptor space considering their relative importance 2 based on each tree prediction in space of models 21 11 09 Multi threads calculation was implemented which 2 05 can speed up very intensive calculation steps 10 01 10 Improve statistics calculation 2 06 Found memory leaks were eliminat
12. ata and needed to successful data loading If rfd file was created once try to use only it to create new projects for the same data set This can keep free space on HDD Otherwise each time new rfd dile will be created e Opening model s To open model use standard menu FILE OPEN PROJECT To open model it is necessary that data file rfd file is in its initial directory where it has been saved first time or in the same directory with rf file 1 one can freely move models on the computer if place of corresponding rfd file will be initial 2 one can copy model to USB stick and transfer it to another computer but it is necessary to copy all model files and associated rfd rfn files into the same directory One could add saved models to the current forest list if they have identical associated data file rfd file 1 if one try to open model file and data file name will be identical to already opened model then the new model will be added to the list 2 if one model has been already opened than one can select menu FILE ADD MODELS TO THE CURRENT LIST to proceed In opened dialog only models having according associated data file will be displayed Selection of multiple files is allowed 25 6 Prediction of compounds properties which are in an external data file To make prediction of compounds in an external data file select the desired model in the model list and choose menu PREDICTION PREDICT DATA FROM FILE If the
13. consideration at next version development ABOUT erna N eee 25 NEW FANDOM FOREST PROSECT cc 0006 4 ADD MODELS TO THE CURRENT LIST 23 NEW PROJE CT xticersticeastnccncnencteemestiewes 4 ADD TREES TO FOREST csistsccsasscensacdeestnaateacsters 19 OPEN PROIEC Porania n ecceer cent ate 23 CALC DOMAIN APPLICABILITY 2000008 20 Ordinary MOJE sps 11 CALC VAR IMPORTANCE scccseseseceeseeees 19 PREDICT DATA FROM FILE oseeseesereesereeseeress 24 CLEAR FORESTS Tenite aasinevatueetialls 15 Preset ModE crane n a ecatelecerabuces 22 DELETE FORES Trecand 15 SAVE PROJEC Trecnorucnini aain 23
14. d domain applicability measure Measure based on trees prediction calculated by creation minimum cost tree Distance s between pairs of training set compounds in models space are considered That is each model has T number of predictions made by each tree in the model T total number of trees Each prediction is considered as a separate dimension Thus Euclidean distance can be calculated 22 Measure based on variable importance is calculated by creation minimum cost tree Euclidean distances between pairs of training set compounds in descriptors space are calculated but additionally variables importance are considered So the more important variable is the lesser variability of descriptor value is allowed This procedure is more time consuming than previous one Measure based on proximities is under testing and disabled now To change domain applicability ranges one should change the number in the field DA in sigma units Forest statistics tab which represents the coefficient k in the following equation this coefficient can be a real non negative number DA limit mean distance value k x standard deviation distance value After Recalc button clicked DA limit will be recalculated and all corresponding statistics too File Forest List of forests Trees Forest Statistics variable importance Data Property log IGC80 1 Set fts2 DAin sigma units 3 tp10856_ethyl_p ts2 1 67 pide a chloro ts2 Tra
15. e current program version To do these operation simply select variable s in the list and click on the appropriate button Buttons Y X and Excluded have keyboard shortcuts y x and space correspondingly e One can select variables in the list by its names Click on the Select by names button and input variables names one variable name per line CF DACF_ example ATI inai i Tree growth options O File Forest Predict Hblood x Continious v 9 tl OU tay EES a Hitaat Hi x Input names Each name on the separate line i iey Excluded Hiki ha pE Continiqus Rank After button OK clicked specified variables would be selected and you can set all of them as excluded for example _ Hiblood Hlond H fat Hbrain Haliver Coantinious Continious Continious Continious rR Hfat Himuecle Hibrain Hkidney Hiliver Ritat Himuscle Rubrain Hikidney River Rilfat Ruimuecle brain Fukidney Riliver hy Rimuscle AN Rikiviney Sil hal Se Alyy Sp Continious DJ Continous Excluded Continious Continious Select by names Continious P nm Gi continous Type Continious Fank Nominal oo Continious Continious continous Continious Coantinious Continious Continous Coantinious Continious i weight SE Continous co Continious co Continious cc Continious
16. ed Structure of the manual was considerably revised new chapters were added and obsolete ones were deleted Content 1 Creation of the first RandomForest project ness acnsesscs ssasncenescanderdeawenarreesavaravccatdesanadovasassaovansevcanieeendee 4 EL EOAR EA E E E AA A A E E 4 L PUU RE MOCE er AE E ENA E E 7 1 2 1 MY ET AOS CANO A E E N EE A E E E A E E E E 7 1 2 2 GE I E E E E E 11 1 2 3 FS a EA A A EEA EEEE ness A A OTT N 12 1 2 4 Possible warning IMGSSAGE Si aacecccteccisncnsadadcaavecdnbdstisnessamcnsarssaneacerasnesnnacesendabiadwentacdeniecseae es 14 2 WCW OTS ES ga ects aces secant E T ETA E E E E oeancacssuccsdvestaeeass aba 16 DMs GENO Tal SUSU 5 space ace ss cise cise oe ts ge vdeo tak nce EE EAO A 16 2 2 View single trees COMPOSING RF model cccccescccssseccesecceeecceeececenceeeenceeenceseneeseeceeeeseceeass 17 2 3 Detailed statistics and results ccrisiccusiadcasssetasucsevetbasausivdnatuavessacicnsinseceabsaesaeostnnocndadwensaieeithootaabebes 17 3 Mod l ore tI VOUT INES erreina E E O O 20 3 4 Variable imp rtance calculators isorinis iasa siaaa 20 3 2 Domain of applicability CalCUlation ccccccesccccsecccsseccenececneceeeeeeceeceeeuceteueesseeseuaeeteeceegess 21 4 Preset mode of model CONSTFUCTION ccccceseccccsssececeeeccceuscsceeneceeeueceeeuecsseueseeeuecesseneseseueeseees 23 Dg UOC NTIS FO NE eor E E R E 24 Dee SAVE UO Eao E E E 24 52 0121011012209 0 0
17. f cases and variables models construction consumes considerable memory size So be careful when you choose forest growth settings And be sure that you have enough memory to complete all your needed operations e A method of training set formation of each tree is specified in the OOB set mode options dialog Bootstrap it as a classical mode of formation of training and out of bag sets for each tree construction with replacement Custom user can specify parts of cases of training and out of bag sets without replacement Experience is shown that models which constructed in the second custom mode have not appreciable changes in their quality In addition there is only little difference in model construction time So we recommend to choose the first classical mode bootstrap Each model can be constructed with randomized Y values Y randomization To define part of Y values which will be shuffled at model building one should check Mix field and choose corresponding value from the range 0 100 If 100 value was chosen it would be Y scrambling procedure This procedure is used to prove that obtained model isn t random 14 1 2 4 Possible warning messages After OK button is pressed if there are descriptors with constant and or missing values among X s then a list with those descriptors names will be appeared in separate windows All these descriptors will be removed from the model construction process
18. io of number of erroneous predictions to the whole number of predictions When domain of applicability was calculated corresponding values based on set of compounds inside of domain of applicability are displayed 20 3 Model forest routines Unlimited number of trees can be added to the selected forest To make this choose menu FOREST ADD TREES TO FOREST and specify the desired number of trees 3 1 Variable importance calculation To calculate variable importances choose menu FOREST CALC VAR IMPORTANCE File Forest Prediction Options About 8 od eae SEX List of forests Trees Forest Statistics Variable importance Data Trees count_ vars count_ Min parent _ Min chil Risk estimate ws Risk estimate oob n 1 Hiblooe 0 5011 0 657 0 538 0 534 Hitaat 0 1244 0 2525 0 921 0 471 Hbrain 0 1841 0 4506 0 901 0 431 Hier 0 2544 0 924 0 181 cle 4544 0 894 0 417 0 696 0 411 0 695 0 543 0 845 0 293 0 855 0 519 M Permutation mode number of iterations fi 0 573 0 557 0 925 O 444 User has to define calculation type of variable importances selection of both simultaneously are allowed Sum coefficients for each descriptor it is a very fast and very rough estimate temporarily disabled Permutation mode it is a more time consuming process especially for very large sets of compounds But obtaining results are highly adequate This calculation based on estimation of influence of randomiza
19. m test set i ra number ae External test t Excluded Select by names Load set M Save Set we count 199 ts count 0 exc count 0 taint count 0 Cancel co E A J rti 00 mii B ga i ENW Eo o DBA 1 cl 5B fl cle 0 O NAN 2 5 d ae beiaNAN A ce AN oo NAN NM bO04w5_339ts_ LLOts2 fs dr_ames rfs 2 bws_O ts_rancomrts alpha La rts 436 lws_218 Lts rfs clragqpan_436 lws_218 Lts rfs all_ 32ts tfs all 28ts rfs Lrfs all rfs pa T m F pu The program allows to define up to 10 separate test sets To set a case to the wanted test set second for example one should specify corresponding number in test set number field in our case it is 2 and then select the case and click External test button e Buttons Training External test n Excluded have keyboard shortcuts w t and space correspondingly It is possible to load and save case sets Case sets saves simultaneously in two formats ffs internal format of CF program it supports multiple test sets wsf format of MDA1 program from HiT QSAR Software package for backward compatibility purpose it supports only one test set all test sets if more than one are saved as one entire test set Program keeps 10 latest loaded and saved set files To view list of them click by right mouse button on Load set button Latest used files will be on the top of the list
20. open file has a variable with the same name as a target property then this file will be recognized as an external test set and the corresponding statistics will be calculated After prediction process was complete new set named ext1 will be added to the list of model sets on Forest statistics tab There one can select this set from the list or select certain property to look for detailed statistics As results of external data prediction don t save to model file one can find it useful to copy and paste this information in external editor i CF DACF_example MTL txt File Forest Prediction Options About C oe wo SR List of forests Trees Forest Statistics variable importance Data Property Hiblood DAIN sigma units 3 Recalc Tomer 5a Observed Hibla Predicted Hibla Pred StdDev Hilo mti 0001 mtl 0002 mtl 0011 mtl 0101 mtl 0102 mtl 0103 mtl 0104 mtl _0105 mtl _01 06 mtl_04 OF mtl_04 08 mtl_o409 mtl Aan a s doe E 26 7 General information To copy data from various lists and tables one can often use right mouse clicking and chosing appropriate item in popup menu Current program version is displayed in window which is call via menu ABOUT 2 8 Afterword Do not hesitate to contact us if you found mistakes faults unusual program behavior or program failure or had any questions or ideas to improve program algorithm or interface Any advices are welcome and will be taking in
21. tion of each descriptor values on out of bag prediction ability of the forest The greater statistic values for out of bag set decrease the greater importance of the descriptor Due to 21 randomness of permutation process it is more reliable to make several iterative calculation and average of obtained result Numbers of iterations is a fully arbitrary parameter However we can e give an advice the more compounds in the training set the less number of iterations is needed For huge data sets about 1000 compounds and more one iteration can be enough To view results of calculation switch to Variable importance tab Variables importance for each property is calculated separately 3 2 Domain of applicability calculation To calculate domain of applicability measures choose FOREST CALC DOMAIN APPLICABILITY CF DACF example MTL tet File Forest Prediction Options About Fd a a TEx List of forests Trees Forest Statistics Variable importance Data Trees count vars count Min parent Min child Property Risk estimate w Risk estimate ook R2 ws R2 oob 1 1 1 Hlond 0 657 0 535 0 534 Hitat 0 2528 0 921 0 471 Hbrain 0 4506 0 901 0 431 Haier i 0 921 0 161 4 Choose DA type calculation 7 0 594 0 417 0 595 0 411 0 595 0 545 0 545 0 293 0 555 0 519 0 673 0 557 C based on proximities 0 325 0 44 OF f based on variable importance In the opened dialog you can select desire
22. ttings possibility of saving of individual models is absent in this mode This procedure is useful to investigate forest behavior in a wide range of setup variables number of trees and number of descriptors There should be defined number of models of each type possible number of trees and descriptors for splitting one value per line log file name where all results are saved 5 1 files 24 5 Model files routines Saving model One can save model in a file by choosing FILE SAVE PROJECT Model saves into several separate 1 rf file has a plain text format and contains general information which can be useful for user 2 t file has a binary file format and contains all trees composing the model 3 bin has a binary file format and contains all data concerning the model and all statistics for training OOB and test sets information and statistics of external set doesn t save in the file 4 imp has a binary file format and contains information concerning variable importances if they are calculated of course All these files are needed for model opening and should be stored in the same directory If the source file of the data set is not an rfd file then at saving one should specify rfd file name which will be contain a data set and then rf file name which will be contain model information 5 2 Rfd file has an associated rfn file of the same name Both of them are store source d
23. ul loading of source data it will be displayed on Data tab File Forest Prediction Options About 8d eae SEX it mion mh on tioo too mt_oo mtl_oo mtl_oo 4 6 F 7 1 2 Build RF model 1 2 1 Variables tab To grow forest build model choose menu FOREST GROW FOREST File Forest Preclicty BASI Hblood Continious Hifat Continious Hbrain Continious Haver Continious Himuscle Continious Excluded Hibrain f Hikidney Continious _Hiliver f Rifat Continious Rvbrain Continious select by names Flier Continious bh Ruimuscle Continious Type Rkidney Continious _ ht Continious Continious ee Ate Continious SY Continious Rank Se Continious l S Continious 5E Continious _ p hiy Continious Continious _ Continious cc Continious Co Continious Continious ed Continious rR Continious The following window will appear Select variables which will be used for model construction on Variables tab Variables can be dependent Y several Y s are allowed independent X and excluded which will not take part in model construction Also variables type should be chosen Y variable can possess all three types but each Y should have identical variable type X variables can be continuous type only restriction of th

CF program User manual

Contents

Download Pdf Manuals

Related Search

Related Contents