Home

QUEST User Manual

1. cost no of class no of class where cost ilj cost of misclassifying class j as class i and class label is assigned in alphabetical order 0 0000000E 00 1 000000 2 000000 0 0000000E 00 The altered priors are die 34225 live 65775 P4 minimal node size 5 use univariate split use unbiased statistical tests for variable selection alpha value 050 split point method exhaustive search use Pearson chi 2 P5 use 155 fold CV sample pruning SE rule trees based on number of SEs 0 00 P6 subtree Terminal complexity current number nodes value cost 1 15 0 0000 0 0581 2 9 0 0043 0 0839 3 8 0 0065 0 0903 4 T 0 0129 0 1032 5 2 0 0284 0 2452 6 1 0 1677 0 4129 P7 Size and CV misclassification cost and SE of subtrees Tree Tnodes Mean SE Mean 15 QUEST manual 5 1 Annotated output 1 15 0 3355 0 4937E 01 2 9 0 3419 0 5034E 01 3 8 0 3290 0 5089E 01 Ax 7 0 2903 0 4911E 01 5 2 0 3226 0 4556E 01 6 1 0 4129 0 6502E 01 CART O SE tree is marked with CART SE rule using CART SE is marked with The and trees are the same P8 Following tree is based on Structure of final tree Node Left node Right node Split variable Predicted class 1 2 3 ALBUMIN 2 4 5 BILIRUBIN 4 6 7 ASCITES 6 8 9 MALAISE 8 terminal node live 9 14 15 STEROID 14 terminal node live 15 16 17 PROTIME 16 terminal node die 17 terminal node live 7 x terminal node die 5 terminal node die 3 terminal node
2. 4 1 Interactive mode QUEST version 1 9 Copyright c 1997 2004 by Shih Yu Shan This version was updated on April 27 2004 Qo Input 0 to read the warrenty disclaimer 1 to run QUEST in interactive mode 2 to create input file for batch job Input 0 1 or 2 1 2 lt cr gt 1 Q1 Input name of file to store results hep out Q2 You should have a file with the following codes for each variable d dependent n numerical c categorical f frequency x excluded from analysis Use commas or spaces as delimiters Input name of variable description file enclose within quotes if it contains embedded spaces hepdsc txt Q3 Code for missing values Number of cases in data file 155 There are missing values in the learning sample Number of learning samples 155 Cases with 1 or more missing values 75 Percentage of missing values 5 67 Number of numerical variables 6 Number of categorical variables 13 Input 1 for default options 2 for advanced options 1 2 lt cr gt 1 2 Number of classes 2 Q4 Input priors 1 for estimated 2 for equal 3 for given 1 3 lt cr gt 1 1 5 QUEST manual Q5 Input Input Input Input Input Q6 Input QT Input Qs 4 1 Interactive mode misclassification costs 1 for equal 2 for given 1 2 lt cr gt 1 2 the the the the cost cost cost cost of predicting class of predicting class of predicting class of predicting class die as class die 0 000 l
3. Column Variable name Variable type 1 Class d AGE SEX STEROID ANTIVIRALS FATIGUE MALAISE ANOREXTA BIGLIVER o o 1o00 PUN Oe uoc 0 c8 22 QUEST manual 5 3 Linear combination splits 10 FIRMLIVER c 11 SPLEEN C 12 SPIDERS c 13 ASCITES c 14 VARICES c 15 BILIRUBIN n 16 ALKPHOSPHA n 17 SGOT n 18 ALBUMIN n 19 PROTIME n 20 HISTOLOGY c Number of cases in data file 155 Number of learning samples 155 Cases with 1 or more missing values 75 Percentage of missing values 5 67 Number of numerical variables 6 Number of categorical variables 13 Summary of response variable Class class frequency die 32 live 123 155 Summary of numerical variable AGE Size Obs Min Max Mean Sd 155 155 0 700E 01 0 780E 02 0 412E 02 0 126E 02 Summary of categorical variable SEX category frequency female 16 male 139 155 Summary of categorical variable STEROID category frequency no 78 yes 76 23 QUEST manual 5 3 Linear combination splits missing 1 Summary of categorical variable ANTIVIRALS category frequency no 131 yes 24 155 Summary of categorical variable FATIGUE category frequency no 54 yes 100 154 missing 1 Summary of categorical variable MALAISE category frequency no 93 yes 61 154 missing 1 Summary of categorical variable ANOREXIA category frequency no 122 yes 32 154 missing 1 Summary of categorical variable BIGLIVER category frequency no 120 24 QUEST manua
4. f This is a frequency variable It is the number of replications for each record and thus must be great than or equal to 0 Only one variable can have the f indicator x This indicates that the variable is excluded from the analysis 4 Running the program The QUEST program can be executed in interactive or batch modes The virtual memory can be changed on various platforms for running the pro gram on large data sets On Linux machines the user can use all the memory that the system allows by typing the command unlimit On PC Windows machines the user can change the size of the virtual memory in the system folder in the control panel Since the format for text file on PC Windows is not the same as that on Linux it may be helpful to convert the text format by the Linux command dos2unix if the file is originally tested on PC This step can avoid some potential run time errors An example session log for the hepatitis data Diaconis and Efron 1983 obtained from the UCI Repository of Machine Learning Databases Lich man 2013 follows 4 1 Interactive mode The QUEST program can be executed by simply typing its name at the prompt Following is an annotated example session log for the Linux version annotations are printed in italics The PC version is similar Whenever the user is prompted for a selection a recommended choice is usually given The latter may be selected by hitting the ENTER or RETURN key gt quest 4 QUEST manual
5. live Number of terminal nodes of final tree 7 Total number of nodes of final tree 13 P9 Classification tree Node 1 ALBUMIN 3 850 Node 2 BILIRUBIN lt 3 700 Node 4 ASCITES no 16 QUEST manual 5 1 Annotated output Node 6 MALAISE no Node 8 live Node 6 MALAISE yes Node 9 STEROID no Node 14 live Node 9 STEROID yes Node 15 PROTIME lt 70 50 Node 16 die Node 15 PROTIME gt 70 50 Node 17 live Node 4 ASCITES yes Node 7 die Node 2 BILIRUBIN gt 3 700 Node 5 die Node 1 ALBUMIN gt 3 850 Node 3 live P10 Information for each node skeokokokek k k k k k kk kk kkk okkekek okkek 2K 2K K okke FK 2K 2K 2K K 2K FK FK FK 2K 2K 2K K K K FK FK gt K gt K Node 1 Intermediate node A case goes into Node 2 if its value of ALBUMIN lt 3 8500 Class cases Mean of ALBUMIN die 32 3 1519 live 123 3 9777 155 FE kk kkk kkk kkk k k k kk k kk KK 2 kkk k k kkk k kK Node 2 Intermediate node A case goes into Node 4 if its value of BILIRUBIN lt 3 7000 Class cases Mean of BILIRUBIN die 29 2 6222 live 32 1 3687 61 FE kkk kkk kkk kkk k k kkk k KKK K KKK kkk k kkk K kK Node 4 Intermediate node A case goes into Node 6 if its value of ASCITES 17 QUEST manual 5 1 Annotated output no Class cases Mode of ASCITES die 21 no live 32 no 53 FOO OO I kkk kk kk kk kk kk kk Node 6 Intermediate node A case goes into Node 8 if its value of MALAISE no Class case
6. Distribution files QUEST is distributed in compiled executable files for the following computer systems PC compatible Microsoft Windows Linux Apple computer Mac OS X Yosemite 10 10 4 The QUEST trees are given in outline form suitable for importing into flowchart packages like allCLEAR CLEAR Software 1996 Alternatively the trees may be output in IATEX code The public domain macro package pstricks Goossens et al 1997 needed to render the IATEX trees 3 Input files The QUEST program needs two text input files 3 1 Data file This file contains the learning or training samples Each sample consists of observations on the class or response or dependent variable and the predictor or independent variables plus any frequency variable The entries in each sample record should be comma or space delimited Each record can occupy one or more lines in the file but each record must begin on a new line Record values can be numerical or character strings Categorical variables can be given numerical or character values Any character string that contains a comma or space must be surrounded by a matching pair of quotation marks either or Please make sure that either the data file or the description file ends with a carriage return Otherwise the program will ignore all incomplete lines and may yield false results 2 QUEST manual 3 2 Description file 3 2 Description file This file is used to provide information to
7. categorical variable FIRMLIVER category frequency no 84 yes 60 144 missing 11 Summary of categorical variable SPLEEN category frequency no 120 yes 30 150 missing 5 Summary of categorical variable SPIDERS category frequency no 99 yes 51 150 missing 5 Summary of categorical variable ASCITES category frequency no 130 yes 20 150 Summary of categorical variable VARICES 13 QUEST manual 5 1 Annotated output category frequency no 132 yes 18 150 missing 5 Summary of numerical variable BILIRUBIN Size Obs Min Max Mean Sd 155 149 0 300E 00 0 800E 01 0 143E 01 0 121E 01 Summary of numerical variable ALKPHOSPHATE Size Obs Min Max Mean Sd 155 126 0 260E 02 0 295E 03 0 105E 03 0 515E 02 Summary of numerical variable SGOT Size Obs Min Max Mean Sd 155 151 0 140E 02 0 648E 03 0 859E 02 0 897E 02 Summary of numerical variable ALBUMIN Size Obs Min Max Mean Sd 155 139 0 210E 01 0 640E 01 0 382E 01 0 652E 00 Summary of numerical variable PROTIME Size Obs Min Max Mean Sd 155 88 0 000E 00 0 100E 03 0 619E 02 0 229E 02 Summary of categorical variable HISTOLOGY category frequency no 85 yes 70 155 Options for tree construction estimated priors are Class prior die 0 20645 live 0 79355 The cost matrix is in the following format 14 QUEST manual 5 1 Annotated output cost 1 1 cost 1 2 cost i no of class cost 2 1 cost 2 2 cost 2 no of class cost no of class 1
8. priors are listed If unequal costs are present like in this example the priors are altered using the formula in Breiman et al 1984 pp 114 115 Additional options selected for this run are given here The number of SEs for the pruning rule and the number of folds of cross validation are shown here If the details option in Q14 is selected the sequence of pruned subtrees is also given for each fold This table gives the sequence of pruned subtrees The 3rd column shows the cost complexity value for each subtree using the definition in Breiman et al 1984 Definition 3 5 p 66 The 4th column gives the current or resubstitution cost error for each subtree This table gives the size estimate of misclassification cost and its stan dard error for each pruned subtree The 2nd column shows the number of terminal nodes The 3rd column shows the mean cross validation es timate of misclassification cost and the 4th column gives its estimated standard error using the approximate formula in Breiman et al 1984 pp 306 309 The tree marked with an asterisk is the one with the minimum mean cross validation estimate of misclassification cost also called the 0 SE tree The tree based on the mean cross validation es timate of misclassification cost and the number of SEs shown in P6 is marked with two asterisks The structure of the tree selected by the user the tree marked by in this example is given here The root node alw
9. use either pstricks or Tree TEX package So is allCLEAR code This allows the user to obtain a file containing the class label and terminal node for each case in the learning sample The information is useful for extracting the learning samples from particular terminal nodes of the tree After the tree is built some related information is printed to the screen Batch mode If the answer in QO is 2 QUEST will ask for a file to store the selected options It also checks the description file and the data file However it does not construct the tree After all the questions being asked QUEST will prompt the command for running a job in batch mode 9 QUEST manual 5 Sample output files 5 Sample output files The annotated output file hep out is in the following 5 1 Annotated output eee U U Eee Sss TTTIT 0 QAQ QQ Q Q Q Q Eee Sss Q e qo Q Q Q Q Q Q QUUQ Eee Sss Q Classification tree program QUEST version 1 9 Copyright c 1997 2004 by Shih Yu Shan This version was updated on April 27 2004 Please send comments questions or bug reports to yshih math ccu edu tw This job was started on 04 27 2004 at 10 48 P1 Variable description file hepdsc txt Learning sample file hepdat txt Code for missing values Variables in data file are variable types are d dependent n numerical c categorical f frequency x excluded Column Variable name Variable type 1 Class d AGE SEX STEROID
10. you do NOT want TreeTeX LaTeX code for tree else 2 1 2 lt cr gt 1 Input 1 if you do NOT want allCLEAR code for tree else 2 1 2 lt cr gt 1 Q17 Input 1 if you do NOT want to save the class label and terminal node id for each case in the learning sample input 2 otherwise Input 1 or 2 1 2 lt cr gt 1 2 Input name of file to store node ids hep nid Cross validation is executing Please wait Each row of dots signifies 50 iterations completed Q18 Number of terminal nodes of final tree 7 Pstricks codes are stored in file hep tex Case ids class label terminal ids and predicted label for the learning sample are in file hep nid Results are stored in file hep out elapsed time 30 71 seconds user 29 43 system 1 28 7 QUEST manual 4 2 Explanation of questions 4 2 Explanation of questions Following is a brief explanation of the questions asked by the program The default choice for each question is indicated by the carriage return symbol lt cr gt It can be chosen by simply hitting the carriage return key Qo Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 QUEST allows both interactive and batch mode If the answer if 1 it will start with interactive mode If the answer is 2 the program will ask all the options first and store those options into a file name is given by the user for running in batch mode This question asks for a file to store the results If a file by that name alr
11. 2 no of class 26 QUEST manual 5 3 Linear combination splits cost no of class 1 cost no of class no of class where cost ilj cost of misclassifying class j as class i and class label is assigned in alphabetical order 0 0000000E 00 1 000000 2 000000 0 0000000E 00 The altered priors are die 34225 live 65775 minimal node size 5 use linear split split point method exhaustive search use Pearson chi 2 use 155 fold CV sample pruning SE rule trees based on number of SEs 0 00 subtree Terminal complexity current number nodes value cost 1 5 0 0000 0 0129 2 3 0 0129 0 0387 3 2 0 0387 0 0774 4 1 0 3355 0 4129 Size and CV misclassification cost and SE of subtrees Tree Tnodes Mean SE Mean 1 5 0 2581 0 4900E 01 2 3 0 2258 0 4612E 01 3 2 0 2194 0 4208E 01 4 1 0 4129 0 6502E 01 CART O SE tree is marked with CART SE rule using CART SE is marked with The and trees are the same Following tree is based on Structure of final tree 20 QUEST manual 5 3 Linear combination splits Node Left node Right node Split variable Predicted class 1 2 3 linear 2 terminal node die 3 terminal node live Number of terminal nodes of final tree 2 Total number of nodes of final tree 3 Classification tree Node 1 linear combination lt 0 1307 Node 2 die Node 1 linear combination gt 0 1307 Node 3 live Information for each node SOO k KK k kk ke ok ok Kk K Kk KK k K Kk 2K
12. ANTIVIRALS FATIGUE MALAISE ANOREXIA Oo 10 015 0 N ec xi co 0 0 B 10 QUEST manual 5 1 Annotated output 9 BIGLIVER c 10 FIRMLIVER C 11 SPLEEN c 12 SPIDERS c 13 ASCITES c 14 VARICES c 15 BILIRUBIN n 16 ALKPHOSPHA n 17 SGOT n 18 ALBUMIN n 19 PROTIME n 20 HISTOLOGY c P2 Number of cases in data file 155 Number of learning samples 155 Cases with 1 or more missing values 75 Percentage of missing values 5 67 Number of numerical variables 6 Number of categorical variables 13 P3 Summary of response variable Class class frequency die 32 live 123 155 Summary of numerical variable AGE Size Obs Min Max Mean Sd 155 155 0 700E 01 0 780E 02 0 412E 02 0 126E 02 Summary of categorical variable SEX category frequency female 16 male 139 155 11 QUEST manual 5 1 Annotated output Summary of categorical variable STEROID category frequency no 78 yes 76 154 missing 1 Summary of categorical variable ANTIVIRALS category frequency no 131 yes 24 155 Summary of categorical variable FATIGUE category frequency no 54 yes 100 154 missing 1 Summary of categorical variable MALAISE category frequency no 93 yes 61 154 missing 1 Summary of categorical variable ANOREXIA category frequency no 122 yes 32 154 missing 1 Summary of categorical variable BIGLIVER 12 QUEST manual 5 1 Annotated output category frequency no 120 yes 25 145 missing 10 Summary of
13. ERS category no yes ASCITES category no yes VARICES category no yes HISTOLOGY category no yes 5 3 Linear combination splits CRIMCOORD 0 821155E 01 0 821155E 01 CRIMCOORD 0 101177 0 101177 CRIMCOORD 0 842006E 01 0 842006E 01 CRIMCOORD 0 118011 0 118011 CRIMCOORD 0 124193 0 124193 CRIMCOORD 0 801872E 01 0 801872E 01 EEEE k kkk kk kkk kkk kkk k k k kk k KK KK KKK kkk k kk kK K Node 2 Terminal node assigned to Class die Class cases die 30 live 8 38 aa k k kk k k k k k k k k k ak A ACA akk k k k ak ak ak 3k 21 21 21 K K K KKK K K k K K K K K K Node 3 Terminal node assigned to Class live Class cases die 2 live 115 117 Classification matrix based on learning sample 30 QUEST manual REFERENCES predicted class actual class die live die 30 2 live 8 115 Classification matrix based on 155 fold CV predicted class actual class die live die 24 8 live 18 105 elapsed time 59 53 seconds user 58 45 system 1 08 This job was completed on 04 27 2004 at 10 53 The linear combination splits and the associated CRIMCOORD values for each categorical variables are given in terms of their coefficients printed at the end of each intermediate node References Breiman L Friedman J H Olshen R A and Stone C J 1984 Clas sification And Regression Trees Wadsworth Belmont CA CLEAR Software I 1996 allCLEAR User s Guide CLEAR Software Inc 199 W
14. Kk K KK FK KK k KK 2 KK ak K Node 1 Intermediate node Class cases die 32 live 123 155 A case goes into Node 2 if a linear combination of variables lt 0 1307 The coefficients in the linear combination are Variable Coefficient AGE 0 2988E 03 SEX 0 1819 STEROID 0 5505E 01 ANTIVIRALS 0 3659E 01 FATIGUE 0 2138E 01 MALAISE 0 2194 ANOREXIA 0 1964 BIGLIVER 0 7677E O1 FIRMLIVER 0 1026 SPLEEN 0 9356E 01 28 QUEST manual SPIDERS ASCITES VARICES BILIRUBIN ALKPHOSPHATE SGOT ALBUMIN PROTIME HISTOLOGY ooo 0 oo0oo0oooO 2537 1549 4411E 01 1977E 01 8270E 04 4785E 04 3183E 01 1206E 02 3936E 01 5 3 Linear combination splits The CRIMCOORD values assiciated with each categorical variable variable variable variable variable variable variable variable SEX category female male STEROID category no yes ANTIVIRALS category no yes FATIGUE category no yes MALAISE category no yes ANOREXIA category no yes BIGLIVER category no yes CRIMCOORD 0 131776 0 131776 CRIMCOORD 0 802351E 01 0 802351E 01 CRIMCOORD 0 110913 0 110913 CRIMCOORD 0 839007E 01 0 839007E 01 CRIMCOORD 0 816611E 01 0 816611E 01 CRIMCOORD 0 991144E 01 0 991144E 01 CRIMCOORD 0 109190 0 109190 29 QUEST manual variable variable variable variable variable variable FIRMLIVER category no yes SPLEEN category no yes SPID
15. QUEST User Manual Yu Shan Shih Department of Mathematics National Chung Cheng University Taiwan yshih math ccu edu tw Revised July 31 2015 Contents 1 Introduction 1 2 Distribution files 2 3 Input files 2 Sila Data file dna heel rr eMe ow obese Ee use ao 2 3 2 Description file lee 3 4 Running the program 4 4 1 Interactive mode 2e 4 4 2 Explanation of questions 2l 8 43 Batch modes sa buo dee Sh Goh eS mou bob 9 5 Sample output files 10 5 1 Annotated output es 10 5 2 Explanation of annotations llle 20 5 3 Linear combination splits llis 22 1 Introduction QUEST stands for Quick Unbiased Efficient Statistical Trees and is a program for tree structured classification The algorithms are described in Loh and Shih 1997 The performance of QUEST compared with other classification methods can be found in Lim et al 2000 The main strengths of QUEST are unbiased variable selection and fast computational speed In 2 Distribution files addition it has options to perform CART style exhaustive search and cost complexity cross validation pruning Breiman et al 1984 The updated versions of QUEST can be obtained from http www math ccu edu tw yshih quest html For detailed changes made in the latest version please read the companion history file history txt This user manual explains how the program is executed and how the output is interpreted 2
16. ak 3k ak 3k 21 21 K K K KKK KK k K KK K K K Node 3 Terminal node assigned to Class live Class cases die 3 live 91 19 QUEST manual 5 2 Explanation of annotations Pil Classification matrix based on learning sample predicted class actual class die live die 26 6 live 4 119 Classification matrix based on 155 fold CV predicted class actual class die live die 19 13 live 19 104 P12 Pstricks codes are stored in file hep tex Case ids class label terminal ids and predicted label for the learning sample are in file hep nid elapsed time 30 71 seconds user 29 43 system 1 28 This job was completed on 04 27 2004 at 10 49 5 2 Explanation of annotations P1 This paragraph shows some of the information obtained from the user during the interaction session The names of the description and data files the code for missing values and the content of the description file are reported Character strings in variable names which are longer than 10 characters are truncated P2 Counts are given of the total number of cases number of cases with non missing dependent values number of cases with one or more missing values percentage of missing values and the numbers of variables of each type P3 Summary statistics are shown for each included variable if the ad 20 QUEST manual PA P5 P6 PT P8 P9 P10 P11 P12 5 2 Explanation of annotations vanced option is selected In addition the
17. ays has the label 1 The total number of nodes and terminal nodes are also shown The tree structure in outline form suitable for importing into flow chart programs such as allCLEAR is given here The formatted TFX tree using pstricks package is shown in Figure 5 2 Details of the split summary of classes for each node and the node assignment are given here The classification matrices based on the learning sample and CV pro cedure are reported The file name for the pstricks tree and the file name for the terminal node id are given here if either option is selected The total CPU time taken by the run is also reported 21 QUEST manual 5 3 Linear combination splits 5 3 Linear combination splits The following example shows the output file for the hepatitis data set us ing linear combination splits choice 2 in Q7 with all the other options unchanged U U Eee Sss TTTTT Qaq Q Q Q e qe Q Q Eee Sss Q Qe QA QA Q Q Q Q QUUQ Eee Sss Q Classification tree program QUEST version 1 9 Copyright c 1997 2004 by Shih Yu Shan This version was updated on April 27 2004 Please send comments questions or bug reports to yshih math ccu edu tw This job was started on 04 27 2004 at 10 52 Variable description file hepdsc txt Learning sample file hepdat txt Code for missing values Variables in data file are variable types are d dependent n numerical c categorical f frequency x excluded
18. eady exists the user is asked to either overwrite it or choose another name This asks for the description file If the file is read correctly the code for missing values is printed to the screen and a brief summary of the learning data is printed to the screen This allows the user either to select all default options or to control every step of the run If the first choice is selected the run will skip all the later questions The number of classes is printed to the screen This asks for the prior for each class If the priors are to be given the program will then ask the user to input the priors This asks for the misclassification costs If the costs are to be given the program will ask the user to input the costs like in this example This asks for the smallest number of samples in a node during tree construction A node will not be split if it contains fewer cases than this number The smaller this value is the larger the initial tree will be prior to pruning The default value is max 5 n 100 where n is the total number of observations The user can choose either splits on single variable or linear combina tion of variables This asks for the user to choose between the unbiased variable selection method described in Loh and Shih 1997 or the biased exhaustive search method which is used in CART If the unbiased method based on statistical tests is used in Q8 this asks for the alpha value to conduct the tests The sugge
19. ells Avenue Newton MA Diaconis P and Efron B 1983 Computer intensive methods in statistics Scientific American 248 96 108 Goossens M Rahtz S and Mittelbach F 1997 The ATX Graphics Companion Addison Wesley Lichman M 2013 UCI machine learning repository URL http archive ics uci edu ml Lim T S Loh W Y and Shih Y S 2000 A comparison of predic tion accuracy complexity and training time of thirty three old and new classification algorithms Machine Learning 40 203 228 Loh W Y and Shih Y S 1997 Split selection methods for classification trees Statistica Sinica T 815 840 31 QUEST manual REFERENCES Shih Y S 1999 Families of splitting criteria for classification trees Statis tics and Computing 9 309 315 32 QUEST manual REFERENCES ALBUMIN lt 3 850 BILIRUBIN lt 3 700 ASCITES S4 MALAISE 3 18 die live Figure 1 The value beneath a terminal node is the predicted class for the node and the numbers beside a terminal node is the numbers of learning samples for each class in the node Their class labels from left to right are die live Splitting rule for each intermediate node is given beside the node 33 QUEST manual
20. l 5 3 Linear combination splits 145 missing 10 Summary of categorical variable FIRMLIVER category frequency no 84 yes 60 144 missing 11 Summary of categorical variable SPLEEN category frequency no 120 yes 30 150 missing 5 Summary of categorical variable SPIDERS category frequency no 99 yes 51 150 missing 5 Summary of categorical variable ASCITES category frequency no 130 yes 20 150 missing 5 Summary of categorical variable VARICES category frequency no 132 25 QUEST manual 5 3 Linear combination splits 150 missing 5 Summary of numerical variable BILIRUBIN Size Obs Min Max Mean Sd 155 149 0 300E 00 0 800E 01 0 143E 01 0 121E 01 Summary of numerical variable ALKPHOSPHATE Size Obs Min Max Mean Sd 155 126 0 260E 02 0 295E 03 0 105E 03 0 515E 02 Summary of numerical variable SGOT Size Obs Min Max Mean Sd 155 151 0 140E 02 0 648E 03 0 859E 02 0 897E 02 Summary of numerical variable ALBUMIN Size Obs Min Max Mean Sd 155 139 0 210E 01 0 640E 01 0 382E 01 0 652E 00 Summary of numerical variable PROTIME Size Obs Min Max Mean Sd 155 88 0 000E 00 0 100E 03 0 619E 02 0 229E 02 Summary of categorical variable HISTOLOGY category frequency no 85 yes 70 155 Options for tree construction estimated priors are Class prior die 0 20645 live 0 79355 The cost matrix is in the following format cost 1 1 cost 1 2 cost i no of class cost 2 1 cost 2 2 cost
21. s Mode of MALAISE die 12 yes live 28 no 40 aE kk kkk kk kkk kkk kkk k ak k kk 21 kK K K KKK kkk k kk K K K Node 8 Terminal node assigned to Class live Class cases die 3 live 18 21 FOO OO ORO o o k kk kk kk kk kk kk kk Node 9 Intermediate node A case goes into Node 14 if its value of STEROID no Class cases Mode of STEROID die 9 yes live 10 yes 19 akk kk k k CA ACA akak ak ak A ak ak ak 3K 3k 21 21 K K K 1 KKK K KK K K K Node 14 Terminal node assigned to Class live Class cases die 0 live 4 4 k k k k k k k k k kk kK k K K k FK okkek 2K K K FK FK rrr rer K FK FK es fo K FK FK gt K gt K Node 15 Intermediate node 18 QUEST manual 5 1 Annotated output A case goes into Node 16 if its value of PROTIME lt 70 500 Class cases Mean of PROTIME die 9 36 333 live 6 100 00 15 FE I A A A A AC A A A A k ak ak ak k k kK K K Kk kkk k k kk K K K Node 16 Terminal node assigned to Class die Class cases die 9 live 0 9 FEC A A k k kk ak A k ak ak ak 3k 2k 21 21 K K K KKK KK K k KK K K K Node 17 Terminal node assigned to Class live Class cases die 0 live 6 6 aE kk kkk kkk k kkk kk k k k k kk k KK KK KKK kkk k kk KK K Node 7 Terminal node assigned to Class die Class cases die 9 live 4 13 FE k kk kk kkk A AC A kkk k k k kk k KK KK Kk kkk k k kk KK K Node 5 Terminal node assigned to Class die Class cases die 8 live 0 8 aa k kkk kkk k kk k k k k k k kk k k k ak
22. st value is usually best 8 QUEST manual Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 4 3 4 3 Batch mode For the split point this asks for the user to choose between methods using discriminant analysis Loh and Shih 1997 and the exhaustive search method Breiman et al 1984 The former is the default option if the number of classes is more than 2 otherwise the latter is the default option If the latter option is selected the program will ask for the user to choose the splitting criterion These criteria are studied in Shih 1999 The likelihood criterion is the default option If instead the CART style split is used the Gini criterion is the default option The number of SEs controls the size of the pruned tree 0 SE gives the tree with the smallest cross validation estimate of misclassification cost or error The user can choose to select the final tree by cross validation or test sample pruning Test sample estimates are available for both trees This asks for the value of V in V fold cross validation The larger the value of V is the longer running time the program takes 10 fold is usually recommended and is the default in CART The test sample estimate can be obtained for the final CV tree if it is needed The details of CV tree sequences are reported if the user chooses 2 They are not reported by default If IATEX source code for drawing the tree is needed the user should choose 2 to
23. t cr gt 0 000 0 live as class die 1 000 lt cr gt 1 000 1 die as class live 1 000 lt cr gt 1 000 2 live as class live 0 000 lt cr gt 0 000 0 minimal node size of constructed tree 1 155 lt cr gt 5 5 splitting method 1 for univariate 2 for linear 1 2 lt cr gt 1 1 Input variable selection method 1 unbiased statistical tests 2 biased exhaustive search Input 1 or 2 1 2 lt cr gt 1 1 Q9 Input the alpha value 0 1000E 02 0 9990 lt cr gt 0 5000E 01 0 05 Q10 Input method of split point selection 1 discriminant analysis 2 exhaustive search Input 1 or 2 1 2 lt cr gt 2 2 Input 1 for 2 for 3 for 4 for 5 for Input Q11 Input splitting criterion likelihood ratio G 2 Pearson chi 2 Gini MPI other members of the divergence family 1 2 3 4 or 5 1 5 lt cr gt 1 2 number of SEs for pruning 0 000 lt cr gt 1 000 0 0 QUEST manual 4 1 Interactive mode Q12 Input 1 to prune by CV 2 to prune by test sample 1 2 lt cr gt 1 1 Q13 Input number of fold 2 155 lt cr gt 10 155 Q14 Input 1 if you DO NOT want test sample estimate else 2 Input 1 or 2 1 2 lt cr gt 1 Q15 Input 1 if you do NOT want the details for CV trees else 2 Input 1 or 2 1 2 lt cr gt 1 1 Q16 Input 1 if you do NOT want Pstricks LaTeX code else 2 1 2 lt cr gt 1 2 Input name of file to store Pstricks LaTeX code hep tex Input 1 if
24. the program about the name of the data file the names and the column locations of the variables and their roles in the analysis The following is an example file hepdsc txt included with the distribution file hepdat txt Hp H column var type 1 Class d 2 AGE n 3 SEX c 4 STEROID c 5 ANTIVIRALS c 6 FATIGUE c 7 MALAISE c 8 ANOREXIA c 9 BIGLIVER c 10 FIRMLIVER c 11 SPLEEN c 12 SPIDERS c 13 ASCITES c 14 VARICES c 15 BILIRUBIN n 16 ALKPHOSPHATE n 17 SGOT n 18 ALBUMIN n 19 PROTIME n 20 HISTOLOGY c The content of the file is explained in the following 1 The first line gives the name of the learning sample 2 The second line gives the code that denotes a missing value in the data A missing value code must be present in the second line even if there are no missing value in the data in which case any character string not present in the data file can be used If the string contains characters other than alphabets or numbers it must be surrounded by quotation marks 3 QUEST manual 4 Running the program 3 The third line contains three character strings to indicate column head ers for the subsequent lines 4 The position name and role of each variable comes next with one line for each variable The following roles for the variables are permitted c This is a categorical variable d This is the class dependent variable Only one variable can have the d indicator n This is a numerical variable

QUEST User Manual

Contents

Download Pdf Manuals

Related Search

Related Contents