Home

as a PDF

1. Update Select Attributes Colour play Nom SubSample Class Colour WEKA s visualization section allows you to visualize 2D plots of the current relation 4 7 1 The scatter plot matrix When you select the Visualize panel it shows a scatter plot matrix for all the attributes colour coded according to the currently selected class It is possible to change the size of each individual 2D plot and the point size and to randomly jitter the data to uncover obscured points It also possible to change the attribute used to colour the plots to select only a subset of attributes for inclusion in the scatter plot matrix and to sub sample the data Note that changes will only come into effect once the Update button has been pressed 4 7 2 Selecting an individual 2D scatter plot When you click on a cell in the scatter plot matrix this will bring up a separate window with a visualization of the scatter plot you selected We described above how to visualize particular results in a separate window for example classifier errors the same visualization controls are used here Data points are plotted in the main area of the window At the top are two drop down list buttons for selecting the axes to plot The one on the left shows which attribute is used for the x axis the one on the right shows which is used for the y axis Beneath the x axis selector
2. 5 2 1 2 Results destination By default an ARFF file is the destination for the results output But you can choose between e ARFF file e CSV file e JDBC database ARFF file and JDBC database are discussed in detail in the following sec tions CSV is similar to ARFF but it can be used to be loaded in an external spreadsheet application ARFF file If the file name is left empty a temporary file will be created in the TEMP directory of the system If one wants to specify an explicit results file click on Browse and choose a filename e g Experiment1 arff 5 2 STANDARD EXPERIMENTS 53 Save In weka 3 5 6 y a G a BR E C3 changelogs C3 data doc FileName Experiments arfl Files of Type ARFF files v Click on Save and the name will appear in the edit field next to ARFF file weka Experiment Environment miali Setup Run Analyse xperiment Configuration Mode Simple Advanced Open Save New Results Destination ARFF file Filename le Templweka 3 5 6lExperiments1 arff Browse Experiment Type Iteration Control Cross validation vv Number of repetitions 10 Number of folds 10 8 Data sets first 8 Classification Regression Algorithms first Datasets Algorithms Add new Edit se
3. 4 i The schemes used in the experiment are shown in the columns and the datasets used are shown in the rows The percentage correct for each of the 3 schemes is shown in each dataset row 33 33 for ZeroR 94 31 for OneR and 94 90 for J48 The annotation v or indicates that a specific result is statistically better v or worse than the baseline scheme in this case ZeroR at the significance level specified currently 0 05 The results of both OneR and J48 are statistically better than the baseline established by ZeroR At the bottom of each column after the first column is a count xx yy zz of the number of times that the scheme was better than xx the same as yy or worse than zz the baseline scheme on the datasets used in the experiment In this example there was only one dataset and OneR was better than ZeroR once and never equivalent to or worse than ZeroR 1 0 0 J48 was also better than ZeroR on the dataset The standard deviation of the attribute being evaluated can be generated by selecting the Show std deviations check box and hitting Perform test again The value 10 at the beginning of the iris row represents the number of esti mates that are used to calculate the standard deviation the number of runs in this case 5 4 ANALYSING RESULTS Weka Experiment Env
4. eBayes MaiveBayes Updateable 6 3 5 Clusterers All of WEKASs clusterers are available DataSources DataSinks Filters Classifiers Clusterers Associations Evaluation Visualization Clusterers CITE Cobweb EJ 7 6 3 COMPONENTS 93 6 3 6 Evaluation DataSources DataSinks Filters Classifiers V Clusterers Jj Associations Evaluation Visualization Evaluation ajajajajala e LAJ a Trainin g TestSet CrossValidation TrainTest ClassValu Classifier Increment bal Clug SetMaker Maker Fol dMaker SplitMaker Assigner Picker PerformanceEvaluator ClassifierEvaluator Performan 4 il I gt a e TrainingSetMaker make a data set into a training set e TestSetMaker make a data set into a test set e Cross ValidationFoldMaker split any data set training set or test set into folds e TrainTestSplitMaker split any data set training set or test set into a training set and a test set e ClassAssigner assign a column to be the class for any data set training set or test set e Class ValuePicker choose a class value to be considered as the posi tive class This is useful when generating data for ROC style curves see ModelPerformanceChart below and example 6 4 2 e ClassifierPerformanceEvaluator evaluate the performance of batch trained tes
5. 4 2 4 Working With Filters dd Classification 0 2 ofa ra A la page le 4 3 1 Selecting a Classifier o 4 32 Test Options ia A 4 33 The Class Attribute o e e 4 3 4 Training a Classifier o 11 11 12 12 14 15 17 21 23 25 29 29 30 30 31 CONTENTS 4 3 5 The Classifier Output Text o 41 4 3 6 The Result List 41 dd Clustering 02 a a Boh a ad Beh ae a 43 4 4 1 Selecting a Clusterer o o 43 4 4 2 Cluster Modes o wh ppe e a o ss mae 43 4 4 3 Ignoring Attributes o E a 43 4 4 4 Working with Filters o 44 4 45 Learning Clusters o 44 5 Associating us ara de eee A a EE o e L aA 45 4 5 1 Setting Up sane Ve Bore aa A E 45 4 5 2 Learning Associations 2000 45 4 6 Selecting Attributes 0 0 0000000000 46 4 6 1 Searching and Evaluating 46 410 2 Options 3 8 295 Se ee a a 46 4 6 3 Performing Selection 0 0 46 AE Nisualizing e 5 isa edad olan ged doe ele Su a ee Se AE A 48 4 7 1 The scatter plot matrix 04 48 4 7 2 Selecting an individual 2D scatter plot 48 4 7 3 Selecting Instances 49 Experimenter 51 bal Introduction 2404 Paik AL AAA 51 5 2 Standard Experiments 0 000
6. Status Log The Bayes network is automatically layed out and drawn thanks to a graph drawing algorithm implemented by Ashraf Kibriya AE sepalwidth jetallength petalwidth When you hover the mouse over a node the node lights up and all its children are highlighted as well so that it is easy to identify the relation between nodes in crowded graphs Saving Bayes nets You can save the Bayes network to file in the graph visualizer You have the choice to save as XML BIF format or as dot format Select the floppy button and a file save dialog pops up that allows you to select the file name and file format Zoom The graph visualizer has two buttons to zoom in and out Also the exact zoom desired can be entered in the zoom percentage entry Hit enter to redraw at the desired zoom level 8 9 BAYES NETWORK GUI 135 Graph drawing options Hit the extra controls button to show extra options that control the graph layout settings weka Classifier Graph Visualizer 17 08 27 Bayes Bayes olx ExtraControls Layout Type Naive Layout 8 Priority Layout Layout Method Top Down 8 Bottom Up With Edge Concentration Custom Node Size Width Height Layout Graph ajajja m B The Layout Type determines the algorithm applied to place the nodes The Layout Method determines in
7. This scheme has no modifiable properties besides debug mode on off but most other schemes do have properties that can be modified by the user The Capabilities button opens a small dialog listing all the attribute and class types this classifier can handle Click on the Choose button to select a different scheme The window below shows the parameters available for the J48 decision tree scheme If desired modify the parameters and then click OK to close the window 68 CHAPTER 5 EXPERIMENTER weka gui GenericObjectEditor weka classifiers trees J48 About Class for generating a pruned or unpruned C4 More Capabilities binarySplits False b A confidenceFactor 0 25 debug False RA minNumObj 2 numPolds 3 reducedErrorPruning False y savelnstanceData False z seed 1 subtreeRaising True z unpruned False E A useLaplace False AA Open Save OK Cancel The name of the new scheme is displayed in the Result generator panel Weka Experiment Environment lol x Setup Run Analyse xperiment Configuration Mode Simple 8 Advanced Open Save New Destination Choose InstancesResultListener O Experiment arff Result generator Choose RandomSplitResultProducer P 66 0 O splitEvalutorOutzip Wy weka experiment ClassifierSpl
8. Choose None Apply Current relation Selected attribute Relation None Name None Type None Instances None Attributes None Missing None Distinct None Unique None Attributes All w Visualize All Remove Status Welcome to the Weka Explorer 4 2 1 Loading Data The first four buttons at the top of the preprocess section enable you to load data into WEKA 1 Open file Brings up a dialog box allowing you to browse for the data file on the local file system 2 Open URL Asks for a Uniform Resource Locator address for where the data is stored 3 Open DB Reads data from a database Note that to make this work you might have to edit the file in weka experiment DatabaseUtils props 4 Generate Enables you to generate artificial data from a variety of DataGenerators Using the Open file button you can read files in a variety of formats WEKA s ARFF format CSV format C4 5 format or serialized Instances for mat ARFF files typically have a arff extension CSV files a csv extension C4 5 files a data and names extension and serialized Instances objects a bsi extension NB This list of formats can be extended by adding custom file converters to the weka core converters package 4 2 2 The Current Relation Once some data has been loaded the Preprocess panel shows a variety of in formation The Current relation box the current rela
9. xperiment Configuration Mode Simple 8 Advanced Open Save New Destination Choose InstancesResultListener O weka_experiment25619 arff Result generator Choose RandomSplitResultProducer P 66 0 O splitEvalutorOutzip W weka experiment ClassifierSplitEvaluator W weka classifier Runs Distribute experiment Generator properties From 1 To 10 Hosts Disabled h 4 Select property Bydataset O Byrun Iteration control 8 Data sets first Custom generator first Datasets Add new Edit selecte Delete select Use relative pat Can t edit To define the dataset to be processed by a scheme first select Use relative paths in the Datasets panel of the Setup tab and then click on Add new to open a dialog window Look In weka 3 5 6 v a B a C3 changelogs data Ci doc File Name data Files of Type Arff data files arff z Open Cancel Double click on the data folder to view the available datasets or navigate to an alternate location Select iris arff and click Open to select the Iris dataset 62 CHAPTER 5 EXPERIMENTER Look In data N al a esfe B contact lenses arff B weather arff D weather nominal arff
10. y segment challenge arff y segmenttest arff A soyhean arff File Name iris arff Files of Type Arff data files arff z weka Experiment Environment Setup Run Analyse xperiment Configuration Mode Simple 8 Advanced Open Save New Destination Choose InstancesResultListener O weka_experiment25619 arff Result generator Choose RandomSplitResultProducer P 66 0 O splitEvalutorOutzip Wy weka experiment ClassifierSplitEvaluator W weka classifier Runs Distribute experiment Generator properties From ji To i0 Hosts T Disabled RA Select property By data set O Byrun Iteration control 8 Data sets first O Custom generator first Datasets Add new Edit selecte Delete select v Use relative pat Can t edit Wataliris arff The dataset name is now displayed in the Datasets panel of the Setup tab Saving the Results of the Experiment To identify a dataset to which the results are to be sent click on the Instances ResultListener entry in the Destination panel The output file parameter is near the bottom of the window beside the text outputFile Click on this parameter to display a file selection window weka gui GenericObjectE
11. A since this is done automatically And if we now put all together we can transform this more complicated command line java and the CLASSPATH omitted lt options type class value weka classifiers meta Stacking gt lt option name B type quotes gt lt options type classifier value weka classifiers meta AdaBoostM1 gt lt option name W type hyphens gt lt options type classifier value weka classifiers trees J48 gt lt option name C gt 0 001 lt option gt lt options gt lt option gt lt options gt lt option gt lt option name B type quotes gt lt options type classifier value weka classifiers meta Bagging gt lt option name W type hyphens gt lt options type classifier value weka classifiers meta AdaBoostM1 gt lt option name W type hyphens gt lt options type classifier value weka classifiers trees J48 gt lt option gt lt options gt 16 6 XML 201 lt option gt lt options gt lt option gt lt option name B type quotes gt lt options type classifier value weka classifiers meta Stacking gt lt option name B type quotes gt lt options type classifier value weka classifiers trees J48 gt lt option gt lt options gt lt option gt lt option name t gt test datasets hepatitis arff lt option gt lt options gt Note The type and value attribute of the outermost options tag is not used while reading the para
12. Prior to versions 3 4 5 and 3 5 0 it looked like this lt DOCTYPE object lt ELEMENT object PCDATA object gt lt ATTLIST object name CDATA REQUIRED gt lt ATTLIST object class CDATA REQUIRED gt lt ATTLIST object primitive CDATA yes gt lt ATTLIST object array CDATA no gt gt Responsible Class es weka experiment xml XMLExperiment for general Serialization weka core xml XMLSerialization weka core xml XMLBasicSerialization e KOML http koala ilog fr XML serialization The Koala Object Markup Language KOML is published under the LGPL http www gnu org copyleft lgp1 html and is an alternative way of serializing and derserializing Java Objects in an XML file Like the normal serialization it serializes everything into XML via an ObjectOut putStream including the SerialUID of each class Even though we have the same problems with mismatching SerialUIDs it is at least possible edit the XML files by hand and replace the offending IDs with the new ones In order to use KOML one only has to assure that the KOML classes are in the CLASSPATH with which the Experimenter is launched As soon as KOML is present another Filter kom1 will show up in the Save Open Dialog The DTD for KOML can be found at http koala ilog fr XML kom112 dtd Responsible Class es weka core xml KOML The experiment class can of course read those XML files if passed as input or out put file see options of
13. b d WEKA 0PTARG usage exit 0 w h 55 usage exit 1 55 esac done either plaintext or bibtex if PLAINTEXT BIBTEX then echo echo ERROR either p or b has to be given echo usage exit 2 fi CHAPTER 15 RESEARCH 15 2 PAPER REFERENCES 187 do we have everything if DIR d DIR then echo echo ERROR no directory or non existing one provided echo usage exit 3 fi generate Java call if WEKA then JAVA java else JAVA java classpath WEKA fi if PLAINTEXT yes then CMD JAVA TECHINFO plaintext elif BIBTEX yes then CMD JAVA TECHINFO bibtex fi find packages TMP find DIR mindepth 1 type d grep v CVS sed s weka weka g sed s g PACKAGES echo TMP sed s g get technicalinformationhandlers TECHINFOHANDLERS JAVA weka core ClassDiscovery TECHINFOHANDLER PACKAGES grep weka sed s weka weka g output information echo for i in TECHINFOHANDLERS do TMP i class_to_filename exclude internal classes if TMP then continue fi CMD W i echo done 188 CHAPTER 15 RESEARCH Chapter 16 Technical documentation 16 1 ANT What is ANT This is how the ANT homepage http ant apache org defines its tool Apache Ant is a Java based build tool In theory it is kind of like Make but without Make s wrinkles
14. jdbc oracle thin Oserver my domain 1526 orc1 Note CmachineName port SID for the Express Edition you can use jdbc oracle thin server my domain 1521 XE e PostgreSQL jdbc postgresql server my domain 5432 MyDatabase You can also specify user and password directly in the URL jdbc postgresql server my domain 5432 MyDatabase user lt gt password lt where you have to replace the lt gt with the correct values e sqlite 3 x jdbc sqlite path to database db you can access only local files 13 3 Missing Datatypes Sometimes e g with MySQL it can happen that a column type cannot be interpreted In that case it is necessary to map the name of the column type to the Java type it should be interpreted as E g the MySQL type TEXT is returned as BLOB from the JDBC driver and has to be mapped to String 0 represents String the mappings can be found in the comments of the properties file Java type String boolean double byte short int long float date text time In the props file one lists now the type names that the database returns and what Java type it represents via the identifier e g CHAR 0 VARCHAR 0 Java method Identifier getString getBoolean getDouble getByte getByte getInteger getLong getFloat getDate getString getTime OU00OJIOdc rao0oNRAOo E Weka attribute type nominal nominal numeric numeric numeric numeric numeric numer
15. mbc CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS Random number seed Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S LOO CV k Fold CV Cumulative CV Q Score type LOO CV k Fold CV Cumulative CV Use probabilistic or 0 1 scoring default probabilistic scoring weka classifiers bayes net search global HillClimber P lt nr of parents gt R mbc Maximum number of parents Use arc reversal operation default false Initial structure is empty instead of Naive Bayes Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S LOO CV k Fold CV Cumulative CVv Q Score type LOO CV k Fold CV Cumulative CV Use probabilistic or 0 1 scoring default probabilistic scoring weka classifiers bayes net search global K2 N Initial structure is empty instead of Naive Bayes P lt nr of parents gt R mbc Maximum number of parents Random order default false Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S LOO CV k Fold CV Cumulative CV Score
16. 2459427002147861445 z 3 trees J48 C 0 25 M 2 217733168393644444 Perform test Save output Result list 116 47 17 Percent_correct rules ZeroR 4805 ia IL Averaging Result Producer An alternative to the Cross ValidationResultProducer is the AveragingResultPro ducer This result producer takes the average of a set of runs which are typ ically cross validation runs This result producer is identified by clicking the Result generator panel and then choosing the AveragingResultProducer from the GenericObjectEditor 5 2 STANDARD EXPERIMENTS 75 weka gui GenericObjectEditor weka experiment AveragingResultProducer About Takes the results from a ResultProducer and submits the average to the result listener More calculateStdDevs False xs expectedResultsPerAverage 10 keyFieldName Fold resultProducer Choose CrossValidationResultPro fer The associated help file is shown below Information jaj x NAME weka experiment AveragingResultProducer SYNOPSIS Takes the results from a ResultProducer and submits the average to the result listener Normally used with a CrossValidationResultProducer to perform n x m fold cross validation OPTIONS calculateStdDevs Record standard deviations for each run expectedResultsPerAverage Setthe expected number of results to
17. 4 30 0 female typ_al 170 0 no 0 0 ifixed_ lt 50 5 31 Olfemale latyp ee am 150 0 no 0 0 lt 50 16 32 0 female jatyp_ Rename attribute I 165 0 no 0 0 50 7 32 O male _ aty A I 184 0 no 0 0 lt 50 8 32 0male al Sui ate as ase I 155 0 no 0 0 50 9 33 0 male non_d Delete attribute I 185 0 no 0 0 lt 50 10 34 0 female aty Delete attributes I 190 0 no 0 0 50 11 34 0 male__ atyp_ Sort data ascending 168 0 no 0 0 50 12 34 O male _ atyp_ I 150 0 no 0 0 lt 50 lia 35 0 female typ_ar Optimal column width current 185 0 no 0 0 50 14 35 0 female asym Optimal column width all 150 0 no 0 0 50 15 35 0 male _ atyp_arar E E cr 180 0 no 0 0 50 16 35 0 male latyp_angi 150 0 264 0 f normal 168 0 no 0 0 lt 50 Maz 36 0 male atyp_angi 120 0 166 0 f normal 180 0 no 0 0 lt 50 18 36 0 male non_angi 112 0 340 0 f normal 184 0 no 1 0 fat normal 50 19 36 0 male non_angi 130 01 209 0 f normal 178 0 no 0 0 lt 50 20 36 O male non_angi 150 0 160 0f normal 172 0 no 0 0 lt 0 21 37 O female atyp_angi 120 0 260 0 f normal 130 0 no 0 0 50 22 37 0 female non_angi 130 0 211 0 f normal 142 0 no 0 0 50 paa 37 0 female asympt 130 0 173 0 f O 0 0 lt 50 24 37 O male atyp_angi 130 0 283 0 f ER 98 0 no 0 0 lt 50 25 37 0 male non_angi 130 0 194 0 f normal 150 0 no 0 0 50 26 37 O male _ asympt 120 0 223 0 f normal 168 0 no 0 0 normal lt 50
18. Simple 8 Advanced Open Save New Destination Choose InstancesResultListener O Experiment arff Result generator Choose AveragingResultProducer F Fold X 10 vy weka experiment CrossValidationResultProducer X 10 0 splitEvalutorOutzip Runs Distribute experiment Generator properties From 1 Tho Hosts gt Enabled lr Select property 9 By data set By run Iteration control Choose sas C 0 25 M2 aaa 8 Data sets first Custom generator first ZeroR OneR B 6 Datasets J48 C 0 25 M2 Add new Edit selecte Delete select v Use relative pat dataliris artt Delete In this experiment the ZeroR OneR and J48 schemes are run 10 times with 10 fold cross validation Each set of 10 cross validation folds is then averaged producing one result line for each run instead of one result line for each fold as in the previous example using the Cross ValidationResultProducer for a total of 30 result lines If the raw output is saved all 300 results are sent to the archive 5 2 STANDARD EXPERIMENTS weka ment Environment lolx Setup Run Analyse Source Got 30 results File Database Experiment Configure test Test output Testing with Paired T Tester c
19. 16 1 1 Basics e the ANT build file is based on XML e the usual name for the build file is build xml e invocation the usual build file needs not be specified explicitly if it s in the current directory if not target is specified the default one is used ant f lt build file gt lt target gt e displaying all the available targets of a build file ant 16 1 2 f lt build file gt projecthelp Weka and ANT e a build file for Weka is available from subversion e some targets of interest clean Removes the build dist and reports directories also any class files in the source tree compile Compile weka and deposit class files in path_modifier build classes docs Make javadocs into path_modifier doc exe jar Create an executable jar file in path_modifier dist 189 190 CHAPTER 16 TECHNICAL DOCUMENTATION 16 1 3 Links e ANT homepage http ant apache org e XML http www w3 org XML 16 2 CLASSPATH The CLASSPATH environment variable tells Java where to look for classes Since Java does the search in a first come first serve kind of manner you ll have to take care where and what to put in your CLASSPATH I personally never use the environment variable since Im working often on a project in different versions in parallel The CLASSPATH would just mess up things if you re not careful or just forget to remove an entry ANT http ant apache org offers a nice way fo
20. A lt float gt Start temperature U lt integer gt Number of runs D lt float gt Delta temperature R lt seed gt mbc Random number seed Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S BAYES MDL ENTROPY AIC CROSS_CLASSIC CROSS_BAYES Score type BAYES BDeu MDL ENTROPY and AIC weka classifiers bayes net search local TabuSearch L lt integer gt Tabu list length U lt integer gt Number of runs P lt nr of parents gt Maximum number of parents Use arc reversal operation default false P lt nr of parents gt mbc Maximum number of parents Use arc reversal operation default false Initial structure is empty instead of Naive Bayes Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S BAYES MDL ENTROPY AIC CROSS_CLASSIC CROSS_BAYES Score type BAYES BDeu MDL ENTROPY and AIC weka classifiers bayes net search local TAN mbc Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the 8 7 RUNNING FROM THE COMMAND LINE 127
21. Pattern Cancel 5 4 ANALYSING RESULTS 87 If the test is performed on the Percent_correct field with OneR as the base scheme the system indicates that there is no statistical difference between the results for OneR and J48 There is however a statistically significant difference between OneR and ZeroR weka Experiment Environment Sel x Setup Run Analyse Source Got 30 results File Database Experiment Configure test M Test output Testing with Paired T Tester cor w Tester weka experiment PairedCorrectedTTester Analysing Percent_correct Row Select Datasets 1 Resultsets 3 Confidence 0 05 two tailed Column Select s t ARISE Sorted by Date 21 12 05 16 41 Comparison field Percent_correct PA Significance 0 05 ee Dataset 2 rules On 1 rules 3 trees Sorting asc by lt default gt WEN en re gE eee iris 10 94 31 33 33 94 90 Test base AE ws 1 0 0 1 0 1 0 Displayed Columns Columns Show std deviations C 1 rules ZeroR 48055541465867954 2 rules OneR B 6 2459427002147861445 Output Format Select 3 trees J48 C 0 25 M 2 217733168393644444 Perform test Save output Result list 16 41 47 Percent_correct rules OneR B 6 ES 4 1 gt 5 4 4 Statistical Significance T
22. The model after training can be saved via this parameter Each classifier has a different binary format for the model so it can d only be read back by the exact same classifier on a compatible dataset Only the model on the training set is saved not the multiple models generated via cross validation Loads a previously saved model usually for testing on new pre l viously unseen data In that case a compatible test file should be specified i e the same attributes in the same order If a test file is specified this parameter shows you the predictions and one attribute 0 for none for all test instances A more detailed performance description via precision recall true and false positive rate is additionally output with this pa rameter All these values can also be computed from the confusion matrix This parameter switches the human readable output of the model description off In case of support vector machines or NaiveBayes this makes some sense unless you want to parse and visualize a lot of information We now give a short list of selected classifiers in WEKA Other classifiers below weka classifiers may also be used This is more easy to see in the Explorer GUI e trees J48 A clone of the C4 5 decision tree learner e bayes NaiveBayes A Naive Bayesian learner K switches on kernel den sity estimation for numerical attributes which often improves performance e meta ClassificationViaRegression W functions LinearRe
23. average per run For example if a CrossValidationResultProducer is being used with the number of folds setto 10 then the expected number of results per run is 10 keyFieldName Set the field name that will be unique for a run resultProducer Set the resultProducer for which results are to be averaged Clicking the resultProducer panel brings up the following window weka gui GenericObjectEditor weka experiment CrossValidationResultProducer About Performs a cross validation run using a supplied evaluator ore numFolds 10 outputFile splitEvalutorOut zip rawOutput False z splitEvaluator Choose ClassifierSplitEvaluator V we ka classifier Open Save OK Cancel As with the other ResultProducers additional schemes can be defined When the AveragingResultProducer is used the classifier property is located deeper in the Generator properties hierarchy 76 CHAPTER 5 EXPERIMENTER x Select a property J Available properties calculateStaDevs D expectedResultsPerAverage keyFieldName 2 Co resultProducer E numFolds a outputFile Ey rawoutput c splitEvaluator E attributelD classForiRStatistics o ca passien E predTargetColumn Select Cancel weka Experiment Environment Setup Run Analyse xperiment Configuration Mode
24. cancel The Delete Arc menu brings up a dialog with a list of all arcs that can be deleted Cs x Select arc to delete petallength gt sepallength v Cancel The list of eight items at the bottom are active only when a group of at least two nodes are selected e Align Left Right Top Bottom moves the nodes in the selection such that all nodes align to the utmost left right top or bottom node in the selection re spectively e Center Horizontal Vertical moves nodes in the selection halfway between left and right most or top and bottom most respectively e Space Horizontal Vertical spaces out nodes in the selection evenly between left and right most or top and bottom most respectively The order in which the nodes are selected impacts the place the node is moved to Tools menu File Edit View Help j Generate Network Cti N E Generate Data Ctrl D Set Data Ctrl A Learn Network Ctrl L Learn CPT Layout 3 Show Margins T Show Cliques The Generate Network menu allows generation of a complete random Bayesian network It brings up a dialog to specify the number of nodes number of arcs cardinality and a random seed to generate a network 140 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS Generate Random Bayesiane x Nr of nodes 10 Nr of arcs 15 Cardinality 2 Random seed 123 Generate Network Cancel The Generate Data menu allows
25. e g for visualization via multi dimensional scaling java weka filters supervised attribute NominalToBinary i data contact lenses arff o contact lenses bin arff c last Keep in mind that most classifiers in WEKA utilize transformation filters in ternally e g Logistic and SMO so you will usually not have to use these filters explicity However if you plan to run a lot of experiments pre applying the filters yourself may improve runtime performance weka filters supervised instance Resample creates a stratified subsample of the given dataset This means that overall class distributions are approximately retained within the sample A bias towards uniform class distribution can be specified via B java weka filters supervised instance Resample i data soybean arff o soybean 5 arff c last Z 5 java weka filters supervised instance Resample i data soybean arff o soybean uniform 5 arff c last Z 5 B 1 16 CHAPTER 1 A COMMAND LINE PRIMER StratifiedRemoveFolds creates stratified cross validation folds of the given dataset This means that per default the class distributions are approximately retained within each fold The following example splits soybean arff into strat ified training and test datasets the latter consisting of 25 1 4 of the data java weka filters supervised instance StratifiedRemoveFolds i data soybean arff o soybean train arff c last N 4 F 1 V java weka filters supervised instance Stra
26. iteration control Choose J48 C 0 25 M2 aaa 8 Data sets first Custom generator first ZeroR OneR B 6 Datasets 348 C 0 25 M 2 Add new Edit selecte Delete select v Use relative pat dataliris artt Up l Town Delete Edit Up Down After the experiment setup is complete run the experiment Then to anal yse the results select the Analyse tab at the top of the Experiment Environment window Click on Experiment to analyse the results of the current experiment Weka Experiment Environment lol x Setup Run Analyse Source Got 30 results File Database Experiment Configure test i Test output Testing with Paired T Tester cor v Available resultsets 1 rules ZeroR 48055541465867954 Row Select 2 rules OneR B 6 2459427002147861445 3 trees J48 C 0 25 M 2 217733168393644444 Column Select Comparison field Percent_correct nA Significance 0 05 Sorting asc by lt default gt X Test base Select Displayed Columns Columns Show std deviations Output Format Select Perform test Save output Result list 16 36 04 Available resultsets 84 CHAPTER 5 EXPERIMENTER The number of result lines available Got 30 results is shown in the Source panel This experiment consisted of 10 runs for 3 schemes f
27. lt 50 Oimale _ atyp_angi 125 0 254 0 normal 155 0 no 0 0 lt 50 Olmale _ non_angi 120 0 298 0 normal 185 0 no a E 5 Ojfemale jatyp_angi 130 0 161 0 normal 190 0 no 0 0 50 Ojfmale jatyp_angi 150 0 214 0 stt 168 0 no 0 0 lt 60 atyp_angi 98 0 220 0 normal 150 0 no 0 0 lt 50 typ_angina 120 0 160 07 stt 185 0 no 0 0 lt 50 asympt 140 0 167 0 normal 150 0 no 0 0 50 atyp_angi 120 0 308 0 lefty 180 0 no 0 0 lt 50 atyp_angi 150 0 264 0 normal 168 0 no HAA A RR 117 atyp_angi 120 0 166 0 normal 180 0 no 118 non_angi 112 0 340 0 normal 184 0 no 119 non_angi 130 0 209 0 normal 178 0 no 120 non_angi 150 0 160 0 normal 172 0 no 21 37 Ojfemale jatyp_angi 120 0 260 0 normal 130 0 no 122 37 Oifemale non_angi 130 0 211 0 normal 142 0 no 23 37 0 female jasympt 130 0 1730 stt 184 0 no 124 37 O male fatyp_angi 130 0 283 01 stt 98 0 no 25 37 0 male _ non_angi 130 0 194 0 normal 150 0 no 26 37 0 male jasympt 120 0 223 0 normal 168 0 no 27 37 0 male _ asympt 130 0 315 0 normal 158 0 no 28 38 0 female jatyp_angi 120 0 275 0 normal 129 0 no 129 38 0 male _ atyp_andgi 140 0 297 0 normal 150 0 no 7 2 EDITING 107 For convenience it is possible to sort the view based on a column the underlying data is
28. only non abstract ones are then listed E g the weka classifiers Classifier entry in the GOE file looks like this weka classifiers Classifier weka classifiers bayes AQDE weka classifiers bayes BayesNet weka classifiers bayes ComplementNaiveBayes weka classifiers bayes NaiveBayes weka classifiers bayes NaiveBayesMultinomial weka classifiers bayes NaiveBayesSimple weka classifiers bayes NaiveBayesUpdateable weka classifiers functions LeastMedSq weka classifiers functions LinearRegression weka classifiers functions Logistic The entry producing the same output for the classifiers in the GPC looks like this 7 lines instead of over 70 weka classifiers Classifier weka classifiers bayes weka classifiers functions weka classifiers lazy weka classifiers meta weka classifiers trees weka classifiers rules 16 43 Exclusion It may not always be desired to list all the classes that can be found along the CLASSPATH Sometimes classes cannot be declared abstract but still shouldn t be listed in the GOE For that reason one can list classes interfaces superclasses for certain packages to be excluded from display This exclusion is done with the following file weka gui GenericPropertiesCreator excludes The format of this properties file is fairly simple lt key gt lt prefix gt lt class gt lt prefix gt lt class gt Where the lt key gt corresponds to a key in the GenericProper
29. 27 37 O male asympt 130 0 315 0 f normal 158 0 no 0 0 50 as 38 0 female jatyp_angi 120 0 275 0 normal 129 0 no 0 0 lt 50 29 38 0 male _latyp_angi 140 0 297 0 f normal 150 0 no 0 0 lt 50 lation hungarian 14 heart disease sex chest_pain trestbps chol fes restecgithalach exang oldpeak slope ca thal num Nominal Nominal Numeric Numeric Nominal Nominal Numeric Nominal Numeric Nominal Numeric Nominal Nominal i 28 0 male atyp_angi 130 0 132 0 f lefty 185 0 no 0 0 lt 50 a 2 29 0 male atyp_andi 120 0 243 0 f normal 160 0 no 0 0 lt 50 3 29 0 male atyp_angi 140 0 f normal 17D 0 no 0 0 50 4 30 0 female typ_angina 170 0 237 0f 170 0 no 0 0 fixed_ 50 115 31 Offemale jatyp_angi 100 0 219 0 f 150 0 no 0 0 50 6 32 0 female jatyp_angi 105 0 198 0 f 165 0 no 0 0 lt 50 Undo Copy Search Clear search Delete selected instance 16 35 0 male _ atyp_anol A z momar Ono 0 0 50 az 36 0 male __ atyp_angi 120 0 166 0 f normal 180 0 no 0 0 lt 50 18 36 0 male non_angi 112 0 340 0 f normal 184 0 no 1 0 flat normal lt 50 19 36 0 male non_angi 130 0 209 0 f normal 178 0 no 0 0 50 20 36 0 male non_angi 150 0 160 0 f normal 172 0 no 0 0 50 ar 37 Ojfemale fatyp_angi 120 0 260 0 f normal somno 0 0 lt 50 22 3
30. 4 Align Bottom E Center Horizontal EE Center Vertical Space Horizontal E Space Vertical Unlimited undo redo support Most edit operations on the Bayesian network are undoable A notable exception is learning of network and CPTs Cut copy paste support When a set of nodes is selected these can be placed on a clipboard internal so no interaction with other applications yet and a paste action will add the nodes Nodes are renamed by adding Copy of before the name and adding numbers if necessary to ensure uniqueness of name Only the arrows to parents are copied not these of the children The Add Node menu brings up a dialog see below that allows to specify the name of the new node and the cardinality of the new node Node values are assigned the names Valuel Value2 etc These values can be renamed right click the node in the graph panel and select Rename Value Another option is to copy paste a node with values that are already properly named and rename the node Add node Name Cardinality 8 9 BAYES NETWORK GUI 139 Then a dialog is shown to select a parent Descendants of the child node parents of the child node and the node itself are not listed since these cannot be selected as child node since they would introduce cycles or already have an arc in the network LS 8 x x Select parent node for sepallength petalwidth y
31. 5 2 Examples e acota a ee ee a a a e a a 198 16 07 ME aiuto ot a e AE A e A er ta Hew da 199 16 6 1 Command Line 0 2000 199 16 6 2 Serialization of Experiments 201 16 6 3 Serialization of Classifiers 202 16 6 4 Bayesian Networks o e 203 16 6 5 XREF files o e 204 17 Other resources 205 TA Mailing list 22 ee ee a e et 205 17 2 Troubl shootings 6 2 26 a chad eas ao Ge ae E a 205 17 2 1 Weka download problems 205 17 2 2 OutOfMemoryException 040 205 172 2 1 Windows 2 0 0 2 4 ee ep a ee ae 206 EZ MacOSX ae 228 Sk Gtk RR Dahan GY SY Bree Ee AA 206 17 2 4 StackOverflowError 000 206 17 2 5 just in time JIT compiler oo 207 17 2 6 CSV file conversion 22 000 207 17 2 7 ARFF file doesn t load 2 2 ee ee 207 17 2 8 Spaces in labels of ARFF files 207 17 2 9 CLASSPATH problems 207 17 2 10 Instance ID 2602 kek Le eA ee A Bw 208 17 2 10 1 Adding the ID 0 208 17 2 10 2 Removing the ID 208 P7211 Visualization 2 4 2 cee hose oe oe Ey ge ere lb bo 209 17 2 12 Memory consumption and Garbage collector 209 17 2 13 GUIChooser starts but not Experimenter or Explorer 209 17 2 14 KnowledgeFlow toolbars are empty 210 NE AA eee ead Mo E ee cel ha bee 210 Bibliog
32. A lt alpha gt Initial count alpha e weka classifiers bayes net estimate BMAEstimator k2 Whether to use K2 prior A lt alpha gt Initial count alpha e weka classifiers bayes net estimate MultiNomialBMAEstimator k2 Whether to use K2 prior A lt alpha gt Initial count alpha e weka classifiers bayes net estimate SimpleEstimator A lt alpha gt Initial count alpha Generating random networks and artificial data sets You can generate random Bayes nets and data sets using weka classifiers bayes net BayesNetGenerator The options are B Generate network instead of instances N lt integer gt Nr of nodes A lt integer gt Nr of arcs M lt integer gt Nr of instances C lt integer gt Cardinality of the variables S lt integer gt Seed for random number generator F lt file gt The BIF file to obtain the structure from 132 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS The network structure is generated by first generating a tree so that we can ensure that we have a connected graph If any more arrows are specified they are randomly added 8 8 Inspecting Bayesian networks You can inspect some of the properties of Bayesian networks that you learned in the Explorer in text format and also in graphical format Bayesian networks in text Below you find output typical for a 10 fold cross validation run in the Weka Explorer with comments where the output is specific for Bayesian nets Run i
33. ARFF Orel Owerview A be SS 9 2 Examples sd 45 400 2S bb ole ara a 9 2 1 The ARFF Header Section 9 2 2 The ARFF Data Section 9 3 Sparse ARFF files o e e e 9 4 Instance weights in ARFF files o 89 89 91 92 92 92 92 92 92 93 94 95 95 97 99 101 103 104 106 109 109 113 113 114 117 119 120 120 122 132 135 147 147 149 150 153 155 CONTENTS 10 XRFF 10 1 Filecextensions o era eo eae eel ee Dek ae Ee ee 10 2 COMParisG severo dor hs Bay Geek wk ak ER A cp ieee aE T02L SARE ooh bod Ne a ta eta bh BD Bedok A Greta wl 10 22 XRFE ela eee eA eee Boe eh ole OMe 10 3 Sparse format aa E eh hw ee AR A 10 4 Compression dosegi hey Geos ba ee ee Rk 10 5 Useful features ene 2 2 a aa a a e a a 10 5 1 Class attribute specification ooo 10 5 2 Attribute weights aoaaa ee 10 5 3 Instance weights osaa 11 Converters Tal Introduction itses A eo a E A V2 Usar arie ia E yim ee a Ae o 11 2 1 File converters aoaaa a a a 11 2 2 Database converters 2 0 0 0 ee a 12 Stemmers 121 Introdtictiont se nae u ea ara he Gee te Bd 12 2 Snowball stemmers ee 12 3 Using stemMerS o 0 4 o anas 12 3 1 Commandlin 2 aie eee Ee as AA E N 12 3 2 StringToWordVector o o 12 4 Adding new stemmers 0 0002 eee 13 Databases 13 1 Configuration files
34. By run Iteration control Choose J48 C0 25 M2 Add 8 Data sets first Custom generator first ZeroR OneR B 6 Datasets J48 0 25 M2 Add new Edit selecte Delete select v Use relative pat dataliris arff The number of runs is set to 1 in the Setup tab in this example so that only one run of cross validation for each scheme and dataset is executed When this experiment is analysed the following results are generated Note that there are 30 1 run times 10 folds times 3 schemes result lines processed Weka Experiment Environment O x Setup Run Analyse Source Got 30 results File Database Experiment Configure test Test output Testing with Paired T Tester cor w Tester weka experiment PairedCorrectedTTester Ki Analysing Percent_correct Bow Select Datasets 1 Resultsets 3 column Select Confidence 0 05 two tailed or Sorted by Date 21 12 05 16 47 Comparison field Percent_correct Ape Significance 0 05 PEA Dataset 1 rules Ze 2 rules 3 trees Sorting asc by lt default gt a iris 10 33 33 94 00 v 96 00 v Test base HET E vs 1 1 0 0 1 0 0 Displayed Columns Columns Show std deviations Keys 1 rules ZeroR 48055541465867954 Output Format Select 2 rules OneR B 6
35. Experiment To run the current experiment click the Run tab at the top of the Experiment Environment window The current experiment performs 10 randomized train and test runs on the Iris dataset using 66 of the patterns for training and 34 for testing and using the ZeroR scheme weka Experiment Environment Setup Run Analyse Start Stop Log Status Not running Click Start to run the experiment 5 2 STANDARD EXPERIMENTS 65 Weka Experiment Environment E x Setup Run Analyse Log 16 17 12 Started 16 17 12 Finished 16 17 12 There were 0 errors Status Not running If the experiment was defined correctly the 3 messages shown above will be displayed in the Log panel The results of the experiment are saved to the dataset Experiment1 arff The first few lines in this dataset are shown below relation InstanceResultListener Gattribute Oattribute Oattribute Oattribute Oattribute Oattribute Gattribute Gattribute Gattribute Gattribute attribute Gattribute Oattribute Oattribute Oattribute Oattribute Oattribute Gattribute Gattribute Gattribute Gattribute Gattribute Gattribute Gattribute Oattribute Oattribute Key_Dataset iris Key_Run 1 2 3 4 5 6 7 8 9 10 Key_Scheme weka classifiers rules ZeroR weka classifiers trees J48 Key_Scheme_options C 0 25 M 2 Key_Sc
36. Instances D Use training set Percentage split 9 Classes to clusters evaluation o 14 100 A Nom play gt Z Store clusters for visualization Log likelihood 3 54934 Ignore attributes Class attribute play Classes to Clusters Start Result list right click for options 5 16 14 EM 0 lt assigned to cluster 9 yes 5 no Cluster 0 lt yes Incorrectly clustered instances 5 0 35 7143 4 4 1 Selecting a Clusterer By now you will be familiar with the process of selecting and configuring objects Clicking on the clustering scheme listed in the Clusterer box at the top of the window brings up a GenericObjectEditor dialog with which to choose a new clustering scheme 4 4 2 Cluster Modes The Cluster mode box is used to choose what to cluster and how to evaluate the results The first three options are the same as for classification Use train ing set Supplied test set and Percentage split Section 4 3 1 except that now the data is assigned to clusters instead of trying to predict a specific class The fourth mode Classes to clusters evaluation compares how well the chosen clusters match up with a pre assigned class in the data The drop down box below this option selects the class just as in the Classify panel An additional option in the Cluster mode box the Store clusters for visualization tick box determines whether or
37. Plain Text Show Average Remove filter classnames Cancel 5 4 2 Saving the Results The information displayed in the Test output panel is controlled by the currently selected entry in the Result list panel Clicking on an entry causes the results corresponding to that entry to be displayed Perform test Save output Result list 16 36 04 Available resultsets 16 37 11 Percent_correct rules ZeroR 480555 16 37 40 Percent_correct rules ZeroR 480555 16 38 12 Number_correct rules ZeroR K T I Pi The results shown in the Test output panel can be saved to a file by clicking Save output Only one set of results can be saved at a time but Weka permits the user to save all results to the same file by saving them one at a time and using the Append option instead of the Overwrite option for the second and subsequent saves File query x File exists Append Overwrite Choose new name Cancel 5 4 3 Changing the Baseline Scheme The baseline scheme can be changed by clicking Select base and then selecting the desired scheme Selecting the OneR scheme causes the other schemes to be compared individually with the OneR scheme Select items x rules ZeroR 48055541465867954 jules Oven B 6 2459427002147861445 rees J48 C 0 25 M 2 217733168393644444 Summary Ranking
38. S BAYES MDL ENTROPY AIC CROSS_CLASSIC CROSS_BAYES Score type BAYES BDeu MDL ENTROPY and AIC weka classifiers bayes net search local LAGDHil1lClimber L lt nr of look ahead steps gt Look Ahead Depth G lt nr of good operations gt Nr of Good Operations P lt nr of parents gt Maximum number of parents R Use arc reversal operation default false N Initial structure is empty instead of Naive Bayes mbc Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S BAYES MDL ENTROPY AIC CROSS_CLASSIC CROSS_BAYES Score type BAYES BDeu MDL ENTROPY and AIC weka classifiers bayes net search local RepeatedHil1lClimber U lt integer gt Number of runs A lt seed gt Random number seed P lt nr of parents gt Maximum number of parents R Use arc reversal operation default false N Initial structure is empty instead of Naive Bayes mbc Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S BAYES MDL ENTROPY AIC CROSS_CLASSIC CROSS_BAYES Score type BAYES BDeu MDL ENTROPY and AIC weka classifiers bayes net search local SimulatedAnnealing 126 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS
39. Simple Advanced Open Save New Results Destination JDBC database w URL habe mysqlflocalhost3306 weka_test User Experiment Type Iteration Control Cross validation 7 Number of repetitions 10 Number of folds 10 Data sets first 8 Classification Regression Algorithms first Datasets Algorithms Add new Delete selected Add new Edit selected Delete selected J Use relative paths Load options Save options The advantage of a JDBC database is the possibility to resume an in terrupted or extended experiment Instead of re running all the other algo rithm dataset combinations again only the missing ones are computed 5 2 1 3 Experiment type The user can choose between the following three different types e Cross validation default performs stratified cross validation with the given number of folds e Train Test Percentage Split data randomized splits a dataset according to the given percentage into a train and a test file one cannot specify explicit training and test files in the Experimenter after the order of the data has been randomized and stratified 5 2 STANDARD EXPERIMENTS 55 weka Experiment Environment loj x Setup Run Analyse xperiment Configuration Mode 8 Simple Advanc
40. Try to be as com prehensive as possible Here we define two nominal at tributes outlook and windy The former has three values sunny attribute outlook sunny overcast rainy overcast and rainy the latter attribute windy TRUE FALSE two TRUE and FALSE Nom inal values with special charac ters commas or spaces are en closed in single quotes These lines define two numeric attributes Instead of real inte ger or numeric can also be used attribute temperature real attribute humidity real While double floating point val ues are stored internally only seven decimal digits are usually processed The last attribute is the default target or class variable used for prediction In our case it is a attribute play yes no nominal attribute with two val ues making this a binary classi fication problem 1 2 BASIC CONCEPTS 13 The rest of the dataset consists data of the token data followed by sunny FALSE 85 85 no sunny TRUE 80 90 no comma separated values for the overcast FALSE 83 86 yes attributes one line per exam rainy FALSE 70 96 yes rainy FALSE 68 80 yes ple In our case there are five ex amples In our example we have not mentioned the attribute type string which defines double quoted string attributes for text mining In recent WEKA versions date time attribute types are also supported By default the last attribute is considered the class target
41. You will also see some progress information in the Status bar and Log at the bottom of the window Select Show plot from the popup menu of the ModelPerformanceChart under the Actions section Here are the two ROC curves generated from the UCI dataset credit g eval uated on the class label good Model Performance Chart Y A oj x X False Positive Rate Num w Y True Positive Rate Num v Colour Threshold Num Reset Clear Open Save Jitter i Plot german_credit 1 x J48 good 4 A on RandomForest good 0 06 0 023 Class colour o 0 5 1 6 4 EXAMPLES 99 6 4 3 Processing data incrementally Some classifiers clusterers and filters in Weka can handle data incrementally in a streaming fashion Here is an example of training and testing naive Bayes incrementally The results are sent to a Text Viewer and predictions are plotted by a StripChart component instance h instance rie incrementalClasay P i 4 ArffLoader Class NaiveBayes Incr ntal Assigner Updateable ClassifieyEvaluator a fet hart Ey G TextViewer StripChart e Click on the DataSources tab and choose ArffLoader from the toolbar the mouse pointer will change to a cross hairs e Next place the ArffLoader component on the layout area by clicking some where on the layout a copy of the ArffLoader icon will appear on the layout area e Next specify an ARFF file to lo
42. and T the ae ae Nj and Nj jk Yepresent choices of priors on counts restricted by Ni a N ij ke With Nijx 1 and thus Nj ri we obtain the K2 metric 18 n qi SL Ona Bs D PBS TT py I sue i 0 j 1 With Nji 1 ri q and thus Nj 1 q we obtain the BDe metric 21 8 2 2 Search algorithms The following search algorithms are implemented for local score metrics e K 18 hill climbing add arcs with a fixed ordering of variables Specific option randomOrder if true a random ordering of the nodes is made at the beginning of the search If false default the ordering in the data set is used The only exception in both cases is that in case the initial network is a naive Bayes network initAsNaiveBayes set true the class variable is made first in the ordering 8 2 LOCAL SCORE BASED STRUCTURE LEARNING 115 Hill Climbing 15 hill climbing adding and deleting arcs with no fixed ordering of variables useArcReversal if true also arc reversals are consider when determining the next step to make Repeated Hill Climber starts with a randomly generated network and then applies hill climber to reach a local optimum The best network found is returned useArcReversal option as for Hill Climber LAGD Hill Climbing does hill climbing with look ahead on a limited set of best scoring steps implemented by Manuel Neubach The number of look ahead steps and number of steps considered for look ahead are configurab
43. attribute You can get the predictions from J48 along with the identifier strings by issuing the following command at a DOS Unix command prompt java weka classifiers meta FilteredClassifier F weka filters unsupervised attribute RemoveType W weka classifiers trees J48 t train arff T test arff p 5 all on a single line If you want you can redirect the output to a file by adding gt output txt to the end of the line 17 2 TROUBLESHOOTING 209 In the Explorer GUI you could try a similar trick of using the String attribute identifiers here as well Choose the FilteredClassifier with RemoveType as the filter and whatever classifier you prefer When you visualize the results you will need click through each instance to see the identifier listed for each 17 2 11 Visualization Access to visualization from the ClassifierPanel ClusterPanel and Attribute Selection panel is available from a popup menu Click the right mouse button over an entry in the Result list to bring up the menu You will be presented with options for viewing or saving the text output and depending on the scheme further options for visualizing errors clusters trees etc 17 2 12 Memory consumption and Garbage collector There is the ability to print how much memory is available in the Explorer and Experimenter and to run the garbage collector Just right click over the Status area in the Explorer Experimenter 17 2 13 GUIChooser starts but not Ex
44. b Typical use after an ant exejar for BibTeX get_wekatechinfo sh d w dist weka jar b gt tech txt command is issued from the same directory the Weka build xml is located in 185 186 Bash shell script get_wekatechinfo sh bin bash This script prints the information stored in TechnicalInformationHandlers to stdout FracPete Revision 4582 the usage of this script function usage echo echo O d lt dir gt w lt jar gt pl b h echo echo Prints the information stored in TechnicalInformationHandlers to stdout echo echo h this help echo d lt dir gt echo the directory to look for packages must be the one just above echo the weka package default DIR echo w lt jar gt echo the weka jar to use if not in CLASSPATH echo p prints the information in plaintext format echo b prints the information in BibTeX format echo hi generates a filename out of the classname TMP and returns it in TMP uses the directory in DIR function class_to_filename TMP DIR echo TMP sed s g java variables DIR PLAINTEXT no BIBTEX no WEKA TECHINFOHANDLER weka core TechnicalInformationHandler TECHINFO weka core TechnicalInformation CLASSDISCOVERY weka core ClassDiscovery interprete parameters while getopts hpbw d flag do case flag in p PLAINTEXT yes BIBTEX yes DIR 0PTARG
45. cici o c global o ci fixed Close Local score based algorithms have the following options in common initAsNaiveBayes if set true default the initial network structure used for starting the traversal of the search space is a naive Bayes network structure That is a structure with arrows from the class variable to each of the attribute variables If set false an empty network structure will be used i e no arrows at all markovBlanketClassifier false by default if set true at the end of the traversal of the search space a heuristic is used to ensure each of the attributes are in the Markov blanket of the classifier node If a node is already in the Markov blanket i e is a parent child of sibling of the classifier node nothing happens otherwise an arrow is added If set to false no such arrows are added scoreType determines the score metric used see Section 2 1 for details Cur rently K2 BDe AIC Entropy and MDL are implemented maxNrOfParents is an upper bound on the number of parents of each of the nodes in the network structure learned 8 2 1 Local score metrics We use the following conventions to identify counts in the database D and a network structure Bs Let r 1 lt i lt n be the cardinality of x We use q to denote the cardinality of the parent set of x in Bg that is the number of different values to which the parents of x can be instantiated So q can be calculated
46. classifier node S BAYES MDL ENTROPY AIC CROSS_CLASSIC CROSS_BAYES Score type BAYES BDeu MDL ENTROPY and AIC weka classifiers bayes net search ci CISearchAlgorithm mbc Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S BAYES MDL ENTROPY AIC CROSS_CLASSIC CROSS_BAYES Score type BAYES BDeu MDL ENTROPY and AIC weka classifiers bayes net search ci ICSSearchAlgorithm cardinality lt num gt When determining whether an edge exists a search is performed for a set Z that separates the nodes MaxCardinality determines the maximum size of the set Z This greatly influences the length of the search default 2 mbc Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S BAYES MDL ENTROPY AIC CROSS_CLASSIC CROSS_BAYES Score type BAYES BDeu MDL ENTROPY and AIC weka classifiers bayes net search global GeneticSearch L lt integer gt Population size A lt integer gt Descendant population size U lt integer gt Number of runs M Use mutation default true C Use cross over default true 0 Use tournament selection true or maximum subpopulatin false default false R lt seed gt 128
47. cp home johndoe jars mysql jar remoteEngine jar home johndoe weka weka jar Djava rmi server codebase file home johndoe weka weka jar weka gui experiment Experimenter Note the database name experiment can still be modified in the Exper imenter this is just the default setup Now we will configure the experiment e First of all select the Advanced mode in the Setup tab e Now choose the DatabaseResultListener in the Destination panel Config ure this result producer HSQLDB Supply the value sa for the username and leave the password empty 5 3 REMOTE EXPERIMENTS 81 MySQL Provide the username and password that you need for connecting to the database e From the Result generator panel choose either the Cross ValidationResult Producer or the RandomSplitResultProducer these are the most com monly used ones and then configure the remaining experiment details e g datasets and classifiers e Now enable the Distribute Experiment panel by checking the tick box e Click on the Hosts button and enter the names of the machines that you started remote engines on lt Enter gt adds the host to the list e You can choose to distribute by run or dataset e Save your experiment configuration e Now start your experiment as you would do normally e Check your results in the Analyse tab by clicking either the Database or Experiment buttons 5 3 5 Troubleshooting e If you get an error at the start of an exper
48. data soybean arff o soybean 5 arff Z 5 RemoveFoldscreates cross validation folds of the given dataset The class distri butions are not retained The following example splits soybean arff into training and test datasets the latter consisting of 25 1 4 of the data java weka filters unsupervised instance RemoveFolds i data soybean arff o soybean train arff c last N 4 F 1 V java weka filters unsupervised instance RemoveFolds i data soybean arff o soybean test arff c last N 4 F 1 RemoveWithValues filters instances according to the value of an attribute java weka filters unsupervised instance RemoveWithValues i data soybean arff o soybean without_herbicide_injury arff V C last L 19 1 2 BASIC CONCEPTS 17 1 2 4 weka classifiers Classifiers are at the core of WEKA There are a lot of common options for classifiers most of which are related to evaluation purposes We will focus on the most important ones All others including classifier specific parameters can be found via h as usual t specifies the training file ARFF format specifies the test file in ARFF format If this parameter is miss a ing a crossvalidation will be performed default ten fold cv This parameter determines the number of folds for the cross validation A cv will only be performed if T is missing se As we already know from the weka filters section this parameter sets the class variable with a one based index
49. drawback of this format is the possible incompatibility between different versions of Weka A more robust alternative to the binary format is the XML format Previously saved experiments can be loaded again via the Open button 60 CHAPTER 5 EXPERIMENTER 5 2 1 8 Running an Experiment To run the current experiment click the Run tab at the top of the Experiment Environment window The current experiment performs 10 runs of 10 fold strat ified cross validation on the Iris dataset using the ZeroR and J48 scheme weka Experiment Environment la x Setup Run Analyse Start Stop Log Status Not running Click Start to run the experiment Weka Experiment Environment Setup Run Analyse Stop Log 16 17 12 Started 16 17 12 Finished 16 17 12 There were 0 errors Status Not running If the experiment was defined correctly the 3 messages shown above will be displayed in the Log panel The results of the experiment are saved to the dataset Experiment1 arff 5 2 STANDARD EXPERIMENTS 61 5 2 2 Advanced 5 2 2 1 Defining an Experiment When the Experimenter is started in Advanced mode the Setup tab is displayed Click New to initialize an experiment This causes default parameters to be defined for the experiment weka Experiment Environment Setup Run Analyse
50. for connecting to the database in the User field the same for the password Choose the database for this DSN from the Database combobox g Click on OK 6 Your DSN should now be listed in the User Data Sources list Step 2 Set up the DatabaseUtils props file You will need to create a file called DatabaseUtils props This file already exists under the path weka experiment in the weka jar file that is part of the Weka download In this directory you will also find a sample file for ODBC connectivity called DatabaseUtils props odbc You can use that as basis since it already contains default values specific to ODBC access This file needs to be recognized when the Explorer starts You can achieve this by making sure it is in the working directory or by replacing the version the already exists in the weka experiment directory A way of achieving the second alternative would be to extract the contents of the weka jar and setting your CLASSPATH to point to the directory where weka resides rather that the jar file The file is a text file that needs to contain the following lines jdbcDriver sun jdbc odbc JdbcOdbcDriver jdbcURL jdbc odbc dbname where dbname is the name you gave the user DSN This can also be changed once the Explorer is running 181 Step 3 Open the database 1 5 Start up the Weka Explorer If you want to be sure that the DatabaseUtils props file is in the current path you can open a command prom
51. for generating a data set from the Bayesian network in the editor A dialog is shown to specify the number of instances to be generated a random seed and the file to save the data set into The file format is arf When no file is selected field left blank no file is written and only the internal data set is set Generate Random Data Options xj Nr of instances 100 Random seed 1234 Output file optional workspaceweka2 arf Browse rere The Set Data menu sets the current data set From this data set a new Bayesian network can be learned or the CPTs of a network can be estimated A file choose menu pops up to select the arff file containing the data The Learn Network and Learn CPT menus are only active when a data set is specified either through e Tools Set Data menu or e Tools Generate Data menu or e File Open menu when an arff file is selected The Learn Network action learns the whole Bayesian network from the data set The learning algorithms can be selected from the set available in Weka by selecting the Options button in the dialog below Learning a network clears the undo stack Learn Bayesian Network xi Options P 1 5 BAYES E weka classifiers bayes net estimate SimpleEstimator A 0 5 The Learn CPT menu does not change the structure of the Bayesian network only the probability tables Learning the CPTs clears the undo stack The La
52. gt e string 9 2 EXAMPLES 157 e date lt date format gt e relational for multi instance data for future use where lt nominal specification gt and lt date format gt are defined below The keywords numeric real integer string and date are case insensitive Numeric attributes Numeric attributes can be real or integer numbers Nominal attributes Nominal values are defined by providing an lt nominal specification gt listing the possible values lt nominal name1 gt lt nominal name2 gt lt nominal name3 gt For example the class value of the Iris dataset can be defined as follows CATTRIBUTE class Iris setosa Iris versicolor Iris virginica Values that contain spaces must be quoted String attributes String attributes allow us to create attributes containing arbitrary textual val ues This is very useful in text mining applications as we can create datasets with string attributes then write Weka Filters to manipulate strings like String ToWordVectorFilter String attributes are declared as follows ATTRIBUTE LCC string Date attributes Date attribute declarations take the form attribute lt name gt date lt date format gt where lt name gt is the name for the attribute and lt date format gt is an op tional string specifying how date values should be parsed and printed this is the same format used by SimpleDateFormat The default format string accepts the ISO 8601 combined date and t
53. hits Open then all ARFF files will be added recursively Files can be deleted from the list by selecting them and then clicking on Delete selected ARFF files are not the only format one can load but all files that can be converted with Weka s core converters The following formats are currently supported e ARFF compressed e C4 5 e CSV e libsvm e binary serialized instances XRFF compressed 5 2 STANDARD EXPERIMENTS 57 By default the class attribute is assumed to be the last attribute But if a data format contains information about the class attribute like XRFF or C4 5 this attribute will be used instead Weka Experiment Environment Ea x Setup Run Analyse Experiment Configuration Mode 8 Simple Advanced Open Save New Results Destination ARFF file v Filename le ATemplweka 3 5 BlExperiments1 arff Browse Experiment Type Iteration Control Cross validation sz Number of folds 10 Number of repetitions 8 Data sets first Classification Regression Algorithms first 10 Datasets Algorithms Add new Edit selecte Delete select Add new Edit selected Delete selected y Use relative pat dataliris artt Up Down Up Down Load options Save options 5 2 1 5 Iteration control e Number of repetitions In or
54. is a drop down list for choosing the colour scheme This allows you to colour the points based on the attribute selected Below the plot area a legend describes what values the colours correspond to If the values are discrete you can modify the colour used for each one by clicking on them and making an appropriate selection in the window that pops up To the right of the plot area is a series of horizontal strips Each strip represents an attribute and the dots within it show the distribution of values 4 7 VISUALIZING 49 of the attribute These values are randomly scattered vertically to help you see concentrations of points You can choose what axes are used in the main graph by clicking on these strips Left clicking an attribute strip changes the x axis to that attribute whereas right clicking changes the y axis The X and Y written beside the strips shows what the current axes are B is used for both X and Y Above the attribute strips is a slider labelled Jitter which is a random displacement given to all points in the plot Dragging it to the right increases the amount of jitter which is useful for spotting concentrations of points Without jitter a million instances at the same point would look no different to just a single lonely instance 4 7 3 Selecting Instances There may be situations where it is helpful to select a subset of the data us ing the visualization tool A special case of this is the Us
55. list of the Explorer A menu pops up in which you select Visualize graph 134 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS Weka Explorer BEET Preprocess Classify Cluster Associate Select attributes Visualize Classifier Choose BayesNet D B iris xml Q weka classifiers bayes net search local K2 P 2 S BAYES E weka classifiers bayes net estimate S Test options Classifier output Use training set Summary El Supplied test set Set Correctly Classified Instances 116 77 3333 8 Cross validation Folds 10 Incorrectly Classified Instances 34 22 6667 s K tatist 0 66 Percentage split epee Seer ors Mean absolute error 0 1882 More options Root mean squared error 0 3153 Relative absolute error 42 3453 aT Root relative squared error 66 8799 AMERO Total Number of Instances 150 az L Stop Detailed Accuracy By Class Result list right click for options TP Rate FP Rate Precision Recall F Measure ROC Area Class AA View in main window 1 0 909 0 977 Iris setosa 0 54 0 614 0 799 Iris versicoli 0 78 0 765 0 899 Iris virginic View in separate window Save result buffer Delete result buffer Load model Save model Re evaluate model on current testset bjor Visualize classifier errors i a Visualize margin curve Visualize threshold curve Visualize cost curve i
56. measureExtraArcs extra arcs compared to reference network The net work must be provided as BIFFile to the BayesNet class If no such network is provided this value is zero e measureMissingArcs missing arcs compared to reference network or zero if not provided e measureReversedArcs reversed arcs compared to reference network or zero if not provided e measureDivergence divergence of network learned compared to reference network or zero if not provided e measureBayesScore log of the K2 score of the network structure e measureBDeuScore log of the BDeu score of the network structure e measureMDLScore log of the MDL score e measureAICScore log of the AIC score e measureEntropyScore log of the entropy 8 11 Adding your own Bayesian network learn ers You can add your own structure learners and estimators 148 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS Adding a new structure learner Here is the quick guide for adding a structure learner 1 Create a class that derives from weka classifiers bayes net search SearchAlgorithm If your searcher is score based conditional independence based or cross validation based you probably want to derive from ScoreSearchAlgorithm CISearchAlgorithmor CVSearchAlgorithm instead of deriving from SearchAlgorithm directly Let s say it is called weka classifiers bayes net search local MySearcher derived from ScoreSearchAlgorithn 2 Implement the method public void buildStructure
57. non_angi 130 01 211 0 f normal 142 0 no 0 0 50 23 37 O female jasympt 130 0 173 0 f 184 0 no 0 0 50 24 37 O male latyp_angi 130 0 283 0 f 98 0 no 0 01 lt 50 25 37 0 male _ non_angi 130 0 194 011 150 0 no 0 0 50 26 37 O male asympt 120 0 223 0 f normal 168 0 no 0 0 normal lt 50 127 37 0 male jasympt 130 0 315 0 f normal 158 0 no 0 0 50 28 38 0 female jatyp_angi 120 0 275 0 normal 129 0 no 0 0 lt 50 29 38 0 male atyp_angi 140 0 297 011 normal 150 0 no 0 0 50 iit 4 103 104 CHAPTER 7 ARFFVIEWER 7 1 Menus The ArffViewer offers most of its functionality either through the main menu or via popups table header and table cells Short description of the available menus e File ARFF Viewer D devel Save as Ctri shifts Aint l Close Ctrl Close all Properties Cti Enter gt Exit 32 0 male AILX latyp_ angi contains options for opening and closing files as well as viewing properties about the current file e Edit ARFF Viewer D development datasi File View iris lt gt Undo Chiz lat Copy Ctrl Insert ol erio N 1 Ctrl 2 0 f Clear search Ctrlt Shift F 2 an 3 Rename attribute i 4 Attribute as class 7 0 F A Delete attribute eer 7 Delete attributes 5 01f 8 Delete instance pOr off E
58. not it will be possible to visualize the clusters once training is complete When dealing with datasets that are so large that memory becomes a problem it may be helpful to disable this option 4 4 3 Ignoring Attributes Often some attributes in the data should be ignored when clustering The Ignore attributes button brings up a small window that allows you to select which attributes are ignored Clicking on an attribute in the window highlights it holding down the SHIFT key selects a range of consecutive attributes and holding down CTRL toggles individual attributes on and off To cancel the selection back out with the Cancel button To activate it click the Select button The next time clustering is invoked the selected attributes are ignored 44 CHAPTER 4 EXPLORER 4 4 4 Working with Filters The FilteredClusterer meta clusterer offers the user the possibility to apply filters directly before the clusterer is learned This approach eliminates the manual application of a filter in the Preprocess panel since the data gets processed on the fly Useful if one needs to try out different filter setups 4 4 5 Learning Clusters The Cluster section like the Classify section has Start Stop buttons a result text area and a result list These all behave just like their classifica tion counterparts Right clicking an entry in the result list brings up a similar menu except that it shows only two visualization options Visualize cluster ass
59. not use the AttributeSelectedClassifier from the classifier panel it is best to use the AttributeSelection filter a supervised attribute filter in batch mode b from the command line or in the SimpleCLI The batch mode allows one to specify an additional input and output file pair options r and s that is processed with the filter setup that was determined based on the training data specified by options i and o Here is an example for a Unix Linux bash java weka filters supervised attribute AttributeSelection E weka attributeSelection CfsSubsetEval S weka attributeSelection BestFirst D 1 N 5 b i lt input1 arff gt o lt output1 arff gt r lt input2 arff gt s lt output2 arff gt Notes e The backslashes at the end of each line tell the bash that the command is not finished yet Using the SimpleCLI one has to use this command in one line without the backslashes e It is assumed that WEKA is available in the CLASSPATH otherwise one has to use the classpath option e The full filter setup is output in the log as well as the setup for running regular attribute selection 48 CHAPTER 4 EXPLORER 4 7 Visualizing Weka 3 5 4 Explorer Program Applications Tools Visualization Windows Help E Explorer Preprocess Classify Cluster Associate Select attributes Visualize Prot Matrix outlook temperature humidity windy
60. set is automatically filtered and a warning is written to STDERR Inference algorithm To use a Bayesian network as a classifier one simply calculates argmaz P y x using the distribution P U represented by the Bayesian network Now note that P y x P U P x x P U J pulpal 8 1 uEU And since all variables in x are known we do not need complicated inference algorithms but just calculate 8 1 for all class values Learning algorithms The dual nature of a Bayesian network makes learning a Bayesian network as a two stage process a natural division first learn a network structure then learn the probability tables There are various approaches to structure learning and in Weka the following areas are distinguished lif there are missing values in the test data but not in the training data the values are filled in in the test data with a ReplaceMissingValues filter based on the training data 8 1 INTRODUCTION 111 e local score metrics Learning a network structure Bs can be considered an optimization problem where a quality measure of a network structure given the training data Q Bs D needs to be maximized The quality mea sure can be based on a Bayesian approach minimum description length information and other criteria Those metrics have the practical property that the score of the whole network can be decomposed as the sum or product of the score of the individual nodes This allows for local scoring and th
61. that instance and enclosing the value in curly braces E g data 0 X 0 Y class A 5 For a sparse instance this example would look like data 1 X 3 Y 4 class A 5 Note that any instance without a weight value specified is assumed to have a weight of 1 for backwards compatibility Chapter 10 XRFF The XRFF Xml attribute Relation File Format is a representing the data in a format that can store comments attribute and instance weights 10 1 File extensions The following file extensions are recognized as XRFF files e xrff the default extension of XRFF files e xrff gz the extension for gzip compressed XRFF files see Compression section for more details 10 2 Comparison 10 2 1 ARFF In the following a snippet of the UCI dataset iris in ARFF format relation iris attribute sepallength numeric attribute sepalwidth numeric attribute petallength numeric attribute petalwidth numeric attribute class Iris setosa Iris versicolor Iris virginica data 5 1 3 5 1 4 0 2 Iris setosa 4 9 3 1 4 0 2 lris setosa 161 162 CHAPTER 10 XRFF 10 2 2 XRFF And the same dataset represented as XRFF file lt xml version 1 0 encoding utf 8 gt lt DOCTYPE dataset lt ELEMENT dataset header body gt lt ATTLIST dataset name CDATA REQUIRED gt lt ATTLIST dataset version CDATA 3 5 4 gt lt ELEMENT header notes attributes gt lt ELEMENT body instances gt lt ELE
62. to a FilteredClassifier used in the Classify panel Left Clicking on any of these gives an opportunity to alter the filters settings For example the setting may take a text string in which case you type the string into the text field provided Or it may give a drop down box listing several states to choose from Or it may do something else depending on the information required Information on the options is provided in a tool tip if you let the mouse pointer hover of the corresponding field More information on the filter and its options can be obtained by clicking on the More button in the About panel at the top of the GenericObjectEditor window Some objects display a brief description of what they do in an About box along with a More button Clicking on the More button brings up a window describing what the different options do Others have an additional button Capabilities which lists the types of attributes and classes the object can handle At the bottom of the GenericObjectEditor dialog are four buttons The first two Open and Save allow object configurations to be stored for future use The Cancel button backs out without remembering any changes that have been made Once you are happy with the object and settings you have chosen click OK to return to the main Explorer window Applying Filters Once you have selected and configured a filter you can apply it to the data by pressing the Apply button at the right end of the Fil
63. type LOO CV k Fold CV Cumulative CV 8 7 RUNNING FROM THE COMMAND LINE 129 Q Use probabilistic or 0 1 scoring default probabilistic scoring weka classifiers bayes net search global RepeatedHillClimber U lt integer gt Number of runs A lt seed gt Random number seed P lt nr of parents gt Maximum number of parents R Use arc reversal operation default false N Initial structure is empty instead of Naive Bayes mbc Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S LOO CV k Fold CV Cumulative CV Score type L00 CV k Fold CV Cumulative CV Q Use probabilistic or 0 1 scoring default probabilistic scoring weka classifiers bayes net search global SimulatedAnnealing A lt float gt Start temperature U lt integer gt Number of runs D lt float gt Delta temperature R lt seed gt Random number seed mbc Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S LOO CV k Fold CV Cumulative CV Score type LOO CV k Fold CV Cumulative CV Q Use probabilistic or 0 1 scoring default probabilistic scoring 130 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS weka classifiers bayes net search global Ta
64. used as follows e Loader They take one argument which is the file that should be converted and print the result to stdout You can also redirect the output into a file java lt classname gt lt input file gt gt lt output file gt Here s an example for loading the CSV file iris csu and saving it as iris arff java weka core converters CSVLoader iris csv gt iris arff e Saver For a Saver you specify the ARFF input file via and the output file in the specific format with o java lt classname gt i lt input gt o lt output gt Here s an example for saving an ARFF file to CSV java weka core converters CSVSaver i iris arff o iris csv A few notes e Using the ArffSaver from commandline doesn t make much sense since this Saver takes an ARFF file as input and output The ArffSaver is normally used from Java for saving an object of weka core Instances to a file e The C 5Loader either takes the names file or the data file as input it automatically looks for the other one e For the C45Saver one specifies as output file a filename without any ex tension since two output files will be generated names and data are automatically appended 11 2 2 Database converters The database converters are a bit more complex since they also rely on ad ditional configuration files besides the parameters on the commandline The setup for the database connection is stored in the following props file DatabaseUti
65. variable i e the attribute which should be predicted as a function of all other attributes If this is not the case specify the target variable via c The attribute numbers are one based indices i e c 1 specifies the first attribute Some basic statistics and validation of given ARFF files can be obtained via the main routine of weka core Instances java weka core Instances data soybean arff weka core offers some other useful routines e g converters C45Loader and converters CSVLoader which can be used to import C45 datasets and comma tab separated datasets respectively e g java weka core converters CSVLoader data csv gt data arff java weka core converters C45Loader c45_filestem gt data arff 14 CHAPTER 1 A COMMAND LINE PRIMER 1 2 2 Classifier Any learning algorithm in WEKA is derived from the abstract weka classifiers Classifier class Surprisingly little is needed for a basic classifier a routine which gen erates a classifier model from a training dataset buildClassifier and another routine which evaluates the generated model on an unseen test dataset classifyInstance or generates a probability distribution for all classes distributionForInstance A classifier model is an arbitrary complex mapping from all but one dataset attributes to the class attribute The specific form and creation of this map ping or model differs from classifier to classifier For example ZeroR s weka classifie
66. weka experiment Experiment and weka experiment RemoteExperiment 16 6 3 Serialization of Classifiers The options for models of a classifier 1 for the input model and d for the out put model now also supports XML serialized files Here we have to differentiate between two different formats 16 6 XML 203 e built in The built in serialization captures only the options of a classifier but not the built model With the 1 one still has to provide a training file since we only retrieve the options from the XML file It is possible to add more options on the command line but it is no check performed whether they collide with the ones stored in the XML file The file is expected to end with xml e KOML Since the KOML serialization captures everything of a Java Object we can use it just like the normal Java serialization The file is expected to end with koml The built in serialization can be used in the Experimenter for loading saving options from algorithms that have been added to a Simple Experiment Unfor tunately it is not possible to create such a hierarchical structure like mentioned in Section 16 6 1 This is because of the loss of information caused by the getOptions method of classifiers it returns only a flat String Array and not a tree structure Responsible Class es weka core xml KOML weka classifiers xml XMLClassifier 16 6 4 Bayesian Networks The GraphVisualizer weka gui graphvisualizer GraphVisu
67. which direction nodes are considered The Edge Concentration toggle allows edges to be partially merged The Custom Node Size can be used to override the automatically deter mined node size When you click a node in the Bayesian net a window with the probability table of the node clicked pops up The left side shows the parent attributes and lists the values of the parents the right side shows the probability of the node clicked conditioned on the values of the parents listed on the left class sepallength lris setosa Cinf 6 1 lris versicolor inf 6 1 6 1 inf lris virginica inf 6 1 Iris virginica _ 6 1 in _ So the graph visualizer allows you to inspect both network structure and probability tables 8 9 Bayes Network GUI The Bayesian network editor is a stand alone application with the following features e Edit Bayesian network completely by hand with unlimited undo redo stack cut copy paste and layout support e Learn Bayesian network from data using learning algorithms in Weka e Edit structure by hand and learn conditional probability tables CPTs using 136 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS learning algorithms in Weka e Generate dataset from Bayesian network e Inference using junction tree method of evidence through the network in teractively changing values of nodes e Viewing cliques in junction tree e Accelerator key support for most co
68. z 8 0 f i Delete instances 3 10 1olr 11 8 Sort data ascending soff 12 34 0 male Jayp_angi Yg Tl 0 0 f allows one to delete attributes instances rename attributes choose a new class attribute search for certain values in the data and of course undo the modifications e View ARFF Viewer D development datasets ud File Edit iris arff Values Chit Shifty Pr Optimal column width all male atyp_angi 130 0 132 0 f le brings either the chosen attribute into view or displays all the values of an attribute After opening a file by default the column widths are optimized based on the attribute name and not the content This is to ensure that overlong cells do not force an enormously wide table which the user has to reduce with quite some effort 7 1 MENUS 105 In the following screenshots of the table popups File Edit View l Relation hungarian 14 heart disease age sex ches gjthalach exang oldpeak slope ca thal num Numerio Nominal _ Nom Get mean al Numeric Nominal Numeric Nominal Numeric Nominal Nominal 1 28 0 male _ atyp_ 185 0 no 0 0 50 a 2 29 0 male ai I 160 0 no 0 0 lt 50 3 29 0 male__ atyp_ Set missing values to I 170 0 no 0 0 50 E
69. 0 0 no normal lt 50 typ_angina 110 0 249 0 f normal 150 0 no 0 0 lt 50 non_angi 110 0 211 0 f normal 138 0 no 0 0 fixed_ 50 49 O female jatyp_angi 110 0 f normal 160 0 no 50 49 O female jatyp_angi 110 0 f normal 160 0 no 50 atyp_angi 110 0 225 0 f normal 184 0 no lt 50 50 0 female jatyp_angi 110 0 202 0 f normal 145 0 no il lt 50 51 0 female non_angi 110 0 190 0 f normal 120 0 no 0 0 50 atyp_angi 110 0 208 0 f normal 142 0 no 50 55 D female atyp_angi 110 0 344 0 f stt 160 0 no l 50 non_angi 277 01f normal 160 0 no tl lt 50 atyp_angi f normal 140 0 no i E asympt f normal f gt 50_1 108 CHAPTER 7 ARFF VIEWER Chapter 8 Bayesian Network Classifiers 8 1 Introduction Let U z1 n n gt 1 be a set of variables A Bayesian network B over a set of variables U is a network structure Bs which is a directed acyclic graph DAG over U and a set of probability tables Bp p u pa u u U where pa u is the set of parents of u in Bs A Bayesian network represents a probability distributions P U J ey plulpalu Below a Bayesian network is shown for the variables in the iris data set Note that the links between the nodes class petallength and petalwidth do not form a directed cycle so the graph is a proper DAG 10 x weka Classifier Graph Visualizer 14 40 17 bayes Baye etallength Thi
70. 0004 52 02d Sim plers saat ee Beye Oa e 52 5 2 1 1 New experiment 52 5 2 1 2 Results destination 52 9 2 1 3 Experiment type 54 FARA Datasets ASe paana a A ee ee es 56 5 2 1 5 Iteration control o 57 5 2 1 6 Algorithms 57 5 2 1 7 Saving the setup 0 0 59 5 2 1 8 Running an Experiment 60 5 2 2 Advanced 2 00 fh bee eA AA ee ee 61 5 2 2 1 Defining an Experiment 61 5 2 2 2 Running an Experimet 64 5 2 2 3 Changing the Experiment Parameters 66 5 2 2 4 Other Result Producers 73 5 3 Remote Experiments 2 00000004 78 9 3 1 Preparation 2 08 84 e ee ees 78 5 3 2 Database Server Setup 2 000 78 5 3 3 Remote Engine Setup 79 5 3 4 Configuring the Experimenter 80 5 3 5 Troubleshooting e 81 5 4 Analysing Results co sep ymin tae Be ee A ee oe 83 DAs Setup aa tii at 2 A bow ao ese A ted ht os Se ae a 83 5 4 2 Saving the Results 86 5 43 Changing the Baseline Scheme 86 5 4 4 Statistical Significance o 87 5 4 5 Summary Test o oco cer eee ee 87 5 4 6 Ranking Testli y oane A ee Ba es 88 CONTENTS 6 KnowledgeFlow 6L INTO UCA cae A be PP See A a 6 2 Features ib e ee A A 6 3 Components 6 3 1 Da
71. 2000 Troubleshooting Error Establishing Socket with JDBC Driver Add TCP IP to the list of protocols as stated in the following article http support microsoft com default aspx scid kb en us 313178 13 5 TROUBLESHOOTING 177 Login failed for user sa Reason Not associated with a trusted SQL Server connection For changing the authentication to mixed mode see the following article http support microsoft com kb 319930 en us e MS SQL Server 2005 TCP IP is not enabled for SQL Server or the server or port number specified is incorrect Verify that SQL Server is lis tening with TCP IP on the specified server and port This might be re ported with an exception similar to The login has failed The TCP IP connection to the host has failed This indicates one of the following SQL Server is installed but TCP IP has not been installed as a net work protocol for SQL Server by using the SQL Server Network Util ity for SQL Server 2000 or the SQL Server Configuration Manager for SQL Server 2005 TCP IP is installed as a SQL Server protocol but it is not listening on the port specified in the JDBC connection URL The default port is 1433 The port that is used by the server has not been opened in the firewall e The Added driver output on the commandline does not mean that the actual class was found but only that Weka will attempt to load the class later on in order to establish a database connectio
72. 2008 weka gui beans O Wed Feb 20 13 56 36 NZDT 2008 weka gui beans icons 2812 Wed Feb 20 14 01 20 NZDT 2008 weka gui beans icons KettleInput gif 2812 Wed Feb 20 14 01 18 NZDT 2008 weka gui beans icons KettleInput_animated gif 1839 Wed Feb 20 13 59 08 NZDT 2008 weka gui beans KettleInput class 174 Tue Feb 19 15 27 24 NZDT 2008 weka gui beans KettleInputBeanInfo class cygnus mhall more Users mhall knowledgeFlow plugins kettle Beans props Specifies the tools to go into the Plugins toolbar weka gui beans KnowledgeFlow Plugins weka gui beans Kettlelnput 102 CHAPTER 6 KNOWLEDGEFLOW Chapter 7 Arff Viewer The ArffViewer is a little tool for viewing ARFF files in a tabular format The advantage of this kind of display over the file representation is that attribute name type and data are directly associated in columns and not separated in defintion and data part But the viewer is not only limited to viewing multiple files at once but also provides simple editing functionality like sorting and deleting ARFF Viewer D development datasets uci nominal File Edit View lation hungarian 14 heart disease age sex jchest_pain trestbps chol bs restecg thalach exang joldpeak slope ca thal num Numerio Nominal Nominal Numeric Numeric Nominal Nominal Numerico Nominal Numeric Nominal Numeric Nominal Nomina
73. 3 35 inf 0 111 0 407 0 481 0 2 0 6 0 2 4 75 inf Cinf2 95 0 02 0 347 0 633 0 018 0 158 0 825 4 75 inf 3 35 inf 0 077 0 077 0 846 Randomize Ok Cancel The whole table can be filled with randomly generated distributions by selecting the Randomize button The popup menu shows list of parents that can be added to selected node CPT for the node is updated by making copies for each value of the new parent Set evidence Rename Delete node Edit CPT Delete parent gt Delete child gt Add value Rename value gt Delete value gt 14 75 sepallength 8 9 BAYES NETWORK GUI 145 The popup menu shows list of parents that can be deleted from selected node CPT of the node keeps only the one conditioned on the first value of the parent node Setevidence pp gt nh 3933 Rename Delete node Edit CPT Add parent SS sepaiwidth Delete child Add value Rename value gt Delete value gt The popup menu shows list of children that can be deleted from selected node CPT of the child node keeps only the one conditioned on the first value of the parent node pr mm 4 7 5 inf Rename Delete node Edit CPT Add parent gt Delete parent gt Delete child sepalwidth Add value petalwidth Rename value gt Delete value Selecting the Add Value fr
74. 5 10 22 http www cs waikato ac nz ml weka stemmers snowball jar 2 You can compile the stemmers yourself with the newest sources Just download the following ZIP file unpack it and follow the instructions in the README file the zip contains an ANT http ant apache org build script for generating the jar archive http www cs waikato ac nz ml weka stemmers snowball zip Note the patch target is specific to the source code from 2005 10 19 171 172 CHAPTER 12 STEMMERS 12 3 Using stemmers The stemmers can either used e from commandline e within the StringToWordVector package weka filters unsupervised attribute 12 3 1 Commandline All stemmers support the following options e h for displaying a brief help e i lt input file gt The file to process e o lt output file gt The file to output the processed data to default stdout e Uses lowercase strings i e the input is automatically converted to lower case 12 3 2 StringToWordVector Just use the GenericObjectEditor to choose the right stemmer and the desired options if the stemmer offers additional options 12 4 Adding new stemmers You can easily add new stemmers if you follow these guidelines for use in the GenericObjectEditor e they should be located in the weka core stemmers package if not then the GenericObjectEditor props GenericPropertiesCreator props file need to be updated and e they must implement the interface weka co
75. 7 Ojfemale non_angi 130 0 211 0 f normal 142 0 no 0 0 50 23 37 O female jasympt 130 0 173 0 f stt 184 0 no 0 0 50 24 37 0 male _latyp_angi 130 0 283 0 f stt 98 0 no 0 0 50 25 37 0 male _ non_angi 130 0 194 0 f normal 150 0 no 0 0 50 11 37 Omale _lasympt 120 0 223 0 f normal abe Ono 0 0 normal lt 50 az 37 O male asympt 130 0 31 En normal 158 0 no 0 0 lt 50 ae 38 0 female atyp_angi 120 0 275 0 normal 129 0 no 0 0 50 pa 38 0 male _ atyp_angi 140 0 297 0f normal 150 0 no 0 0 50 106 CHAPTER 7 ARFFVIEWER 7 2 Editing Besides the first column which is the instance index all cells in the table are editable Nominal values can be easily modified via dropdown lists numeric values are edited directly ARFF Viewer D development datasets uci nominal heart h artt lolx File Edit View iris arff t h arff elation hungarian 14 heart disease age sex jchest_pain trestbps chol fes restecgithalach exang oldpeak slope Numerio Nominal Nominal Numeric Numeric Nominal Nominal Numeric Nominal Numeric Nominal 28 0 male atyp_angi 130 0 132 0 left v 185 0 no male 120 0 243 0 normal 160 0 no male 140 0 _ normal 170 0 Oifemale stt 150 0 no Ojfemale jatyp_angi 198 0 normal 165 0 no f 32 0 male _ atyp_angi 110 0 225 0 normal 184 0 no 0 0 ormal
76. 919 21 12 2005 16 55 414 Siiris ClassifierSplitEvaluator E _rules ZeroR_ version_48055541465857954 568 21 12 2005 16 55 257 G iris ClassifierSplitEvaluator Sy _trees J48_ C_0 25_ M_2 version_ 217733168393644444 1 001 21 12 2005 16 57 427 G iris ClassifierSplitEvaluator S _tules ZeroR_ version_48055541 465867954 568 21 12 200516 55 257 7 iris ClassifierSplitEvaluator S _trees J48_ C_0 25_ M_2 version_ 2177331 68393644444 844 21 12 2005 16 54 391 7 iris ClassifierSplitEvaluator 5 _rules ZeroP_ version_48055541 465867954 568 21 12 2005 16 55 257 8 iris ClassifierSplitEvaluator B _trees J48_ C_0 25_ M_2 version_ 2177331 68393644444 923 21 12 2005 16 55 414 Giris ClassifierSplitEvaluator _rules ZeroR_ version_48055541465867954 568 21 12 2005 16 55 257 9iris ClassifierSplitEvaluator S _trees J48_ C_0 25_ M_2 version_ 2177331 68393644444 907 21 12 200516 55 408 9 iris ClassifierSplitEvaluator The contents of the first run are ClassifierSplitEvaluator weka classifiers trees J48 C 0 25 M 2 version 217733168393644444 Classifier model J48 pruned tree petalwidth lt 0 6 Iris setosa 33 0 petalwidth gt 0 6 petalwidth lt 1 5 Iris versicolor 31 0 1 0 petalwidth gt 1 5 Iris virginica 35 0 3 0 Number of Leaves 3 Size of the tree 5 5 2 STANDARD EXPERIMENTS Correctly Classified Instances Incorrectly Classified Instances Kappa statistic Mean absolute error Ro
77. BayesNet bayesNet Instances instances Essentially you are responsible for setting the parent sets in bayesNet You can access the parentsets using bayesNet getParentSet iAttribute where iAttribute is the number of the node variable To add a parent iParent to node iAttribute use bayesNet getParentSet iAttribute AddParent iParent instances where instances need to be passed for the parent set to derive properties of the attribute Alternatively implement public void search BayesNet bayesNet Instances instances The implementation of buildStructure in the base class This method is called by the SearchAlgorithm will call search after ini tializing parent sets and if the initAsNaiveBase flag is set it will start with a naive Bayes network structure After calling search in your cus tom class it will add arrows if the markovBlanketClassifier flag is set to ensure all attributes are in the Markov blanket of the class node 3 If the structure learner has options that are not default options you want to implement public Enumeration listOptions public void setOptions String options public String getOptions and the get and set methods for the properties you want to be able to set NB 1 do not use the E option since that is reserved for the BayesNet class to distinguish the extra options for the SearchAlgorithm class and the Estimator class If the E option is used it will not be passed to your SearchAlgorithm and p
78. CLASSIFIERS default true Use cross over default true Use tournament selection true or maximum subpopulatin false default false R lt seed gt Random number seed mbc Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S BAYES MDL ENTROPY AIC CROSS_CLASSIC CROSS_BAYES Score type BAYES BDeu MDL ENTROPY and AIC weka classifiers bayes net search local HillClimber P lt nr of parents gt Maximum number of parents R Use arc reversal operation default false N Initial structure is empty instead of Naive Bayes mbc Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S BAYES MDL ENTROPY AIC CROSS_CLASSIC CROSS_BAYES Score type BAYES BDeu MDL ENTROPY and AIC weka classifiers bayes net search local K2 N Initial structure is empty instead of Naive Bayes P lt nr of parents gt Maximum number of parents R Random order default false mbc Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node 8 7 RUNNING FROM THE COMMAND LINE 125
79. Choose database e Microsoft Access a Note Make sure your database is not open in another application before following the steps below b Choose the Microsoft Access driver and click Finish c Give the source a name by typing it into the Data Source Name field d In the Database section choose Select e Browse to find your database file select it and click OK f Click OK to finalize your DSN e Microsoft SQL Server 2000 Desktop Engine a Choose the SQL Server driver and click Finish b Give the source a name by typing it into the Name field c Add a description for this source in the Description field d Select the server you re connecting to from the Server combobox 179 180 CHAPTER 14 WINDOWS DATABASES e For the verification of the authenticity of the login ID choose With SQL Server f Check Connect to SQL Server to obtain default settings and supply the user ID and password with which you installed the Desktop Engine g Just click on Next until it changes into Finish and click this too h For testing purposes click on Test Data Source the result should be TESTS COMPLETED SUCCESSFULLY i Click on OK e MySQL a Choose the MySQL ODBC driver and click Finish b Give the source a name by typing it into the Data Source Name field c Add a description for this source in the Description field d Specify the server you re connecting to in Server e Fill in the user to use
80. DTree False mA weka gui GenericObjectEditor Open Save OK Cancel The BIFFile option can be used to specify a Bayes network stored in file in BIF format When the toString method is called after learning the Bayes network extra statistics like extra and missing arcs are printed comparing the network learned with the one on file The searchAlgorithm option can be used to select a structure learning algorithm and specify its options The estimator option can be used to select the method for estimating the conditional probability distributions Section 8 6 When setting the useADTree option to true counts are calculated using the ADTree algorithm of Moore 23 Since I have not noticed a lot of improvement for small data sets it is set off by default Note that this ADTree algorithm is dif ferent from the ADTree classifier algorithm from weka classifiers tree ADTree The debug option has no effect 8 2 LOCAL SCORE BASED STRUCTURE LEARNING 113 8 2 Local score based structure learning Distinguish score metrics Section 2 1 and search algorithms Section 2 2 A local score based structure learning can be selected by choosing one in the weka classifiers bayes net search local package weka Ed classifiers E bayes Ed net EA search Ef local IM GeneticSearch y HillClimber O k2 Ey LAGDHillClimber Cy RepeatedHillClimber D SimulatedAnnealing D TabuSearch y TAN e
81. ERIMENTER e Using a corrupt or incomplete Database Utils props file can cause peculiar interface errors for example disabling the use of the User button along side the database URL If in doubt copy a clean Database Utils props from Subversion 10 e If you get NullPointerException at java util Hashtable get in the Remote Engine do not be alarmed This will have no effect on the results of your experiment 5 4 ANALYSING RESULTS 5 4 Analysing Results 5 4 1 Setup Weka includes an experiment analyser that can be used to analyse the results of experiments in this example the results were sent to an InstancesResultLis tener The experiment shown below uses 3 schemes ZeroR OneR and J48 to classify the Iris data in an experiment using 10 train and test runs with 66 of the data used for training and 34 used for testing weka Experiment Environment 0 x Setup Run Analyse xperiment Configuration Mode Simple Advanced Open Save New Destination Choose InstancesResultListener O Experiment arf Result generator Choose RandomSplitResultProducer P 66 0 O splitEvalutorOutzip WV weka experiment ClassifierSplitEvaluator W weka classifier Runs Distribute experiment Generator properties From 1 Tho Hosts L Enabled 5 4 Select property Bydataset Byrun
82. Help petallength sepalwidth sepallength Set Group Position Action View menu The view menu allows for zooming in and out of the graph panel Also it allows for hiding or showing the status and toolbars Bayes Network Editor File Edit Tools k El Zoom out View toolbar View statusbar The help menu points to this document Help menu Bayes Network Editor File Edit Tools View Di Ge 8 9 BAYES NETWORK GUI 143 Toolbar DA S 3 undo Redo SE 23 FF 4 El m 00 BY Layout The toolbar allows a shortcut to many functions Just hover the mouse over the toolbar buttons and a tooltiptext pops up that tells which function is activated The toolbar can be shown or hidden with the View View Toolbar menu Statusbar At the bottom of the screen the statusbar shows messages This can be helpful when an undo redo action is performed that does not have any visible effects such as edit actions on a CPT The statusbar can be shown or hidden with the View View Statusbar menu Click right mouse button Clicking the right mouse button in the graph panel outside a node brings up the following popup menu It allows to add a node at the location that was clicked or add select a parent to add to all nodes in the selection If no node is selected or no node can be added as parent this function is disabled Add node Add
83. MENT notes ANY gt lt ELEMENT attributes attribute gt lt ELEMENT attribute labels metadata attributes gt lt ATTLIST attribute name CDATA REQUIRED gt lt ATTLIST attribute type numeric date nominal string relational REQUIRED gt lt ATTLIST attribute format CDATA IMPLIED gt lt ATTLIST attribute class yes no no gt lt ELEMENT labels label gt lt ELEMENT label ANY gt lt ELEMENT metadata property gt lt ELEMENT property ANY gt lt ATTLIST property name CDATA REQUIRED gt lt ELEMENT instances instance gt lt ELEMENT instance value gt lt ATTLIST instance type normal sparse normal gt lt ATTLIST instance weight CDATA IMPLIED gt lt ELEMENT value PCDATA instances gt lt ATTLIST value index CDATA IMPLIED gt lt ATTLIST value missing yes no no gt lt dataset name iris version 3 5 3 gt lt header gt lt attributes gt lt attribute name sepallength type numeric gt lt attribute name sepalwidth type numeric gt lt attribute name petallength type numeric gt lt attribute name petalwidth type numeric gt lt attribute class yes name class type nominal gt lt labels gt lt label gt Iris setosa lt label gt lt label gt Iris versicolor lt label gt lt label gt Iris virginica lt label gt lt labels gt 10 3 SPARSE FORMAT 163 lt attribute gt lt attributes gt lt header gt lt body gt
84. NOT changed via Edit Sort data one can sort the data permanently This enables one to look for specific values e g missing values To better distinguish missing values from empty cells the background of cells with missing values is colored grey ARFF Viewer D development datasets uci nominal heart h artt File Edit View iris arff l Relation hungarian 14 heart disease No age sex chest_pain chol fes restecgithalach exang oldpeak slope ca thal num Numerio Nominal Nominal Numeric Nominal Nominal Numeric Nominal Numeric Nominal Numeric Nominal Nominal 91 48 O female atyp_anagl lt 50 a 197 38 0 male __ asympt gt 50_1 34 0 male _ atyp_anol lt 50 31 0 female jatyp_angi 43 O female typ_angina atyp_ang If normal 33 0 female jasympt OF normal E asympt 100 0 248 0f normal 125 0 no gt 50_1 32 0 female atyp_angi 105 0 198 0 f normal 165 0 no 50 asympt 106 0 263 01t normal 110 0 no gt 50_1 48 0 female jasympt 108 01 163 0 f normal 175 0 no lt 50 39 0 female non_angi 110 0 182 0 f stt 180 0 no lt 50 E asympt 110 0 273 0 f normal 132 0 no lt 50 41 Offemale atyp_angi 110 0 250 0 f stt 142 0 no 50 asympt 110 0 238 0 f st 140 0 yes normal 50 asympt 110 0 240 0 f stt 14
85. Search E HillClimber C ka B RepeatedHillClimber a SimulatedAnnealing QA TabuSearch E TAN gt cf fixed Common options for cross validation based algorithms are initAsNaiveBayes markovBlanketClassifier and maxNrOfParents see Sec tion 8 2 for description Further for each of the cross validation based algorithms the CVType can be chosen out of the following e Leave one out cross validation loo cv selects m N training sets simply by taking the data set D and removing the ith record for training set Dt The validation set consist of just the th single record Loo cv does not always produce accurate performance estimates e K fold cross validation k fold cv splits the data D in m approximately equal parts D1 Dm Training set D is obtained by removing part D from D Typical values for m are 5 10 and 20 With m N k fold cross validation becomes loo cv e Cumulative cross validation cumulative cv starts with an empty data set and adds instances item by item from D After each time an item is added the next item to be added is classified using the then current state of the Bayes network Finally the useProb flag indicates whether the accuracy of the classifier should be estimated using the zero one loss if set to false or using the esti mated probability of the class 120 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS weka gui GenericObjectEditor weka classifiers bayes net se
86. THE UNIVERSITY OF WAIKATO Te Whare Wananga o Waikato WEKA Manual for Version 3 6 0 Remco R Bouckaert Eibe Frank Mark Hall Richard Kirkby Peter Reutemann Alex Seewald David Scuse December 18 2008 2002 2008 University of Waikato Hamilton New Zealand Alex Seewald original Commnd line primer David Scuse original Experimenter tutorial Contents I The Command line 1 A command line primer TT Introduction 6 4 6 a 4 ee ee a E a kd 112 Basic COnCepts lt b adie aed iva eet A ae A da ds Ge Te Dil Dataset e mme a white a eke ty teat at Be tees oe Bd 122 Classifieds bod a LD do WEA MES ca ghd E ae ee ao het a 1 2 4 weka classifiers 13 Examples a4 w 2 224 a a EN II The Graphical User Interface 2 Launching WEKA 3 Simple CLI oak Commands iiA 22 Ps ar Ee eS Pe eee A CAG 32 Invocation wisin le We eG At he Ge 3 3 Command redirection 000000 eee 3 4 Command completion 0 000004 4 Explorer 4 1 The user interface 02 00002 ee eee ee ALI Section Tabs sx dd ele en hake BN eS 4 1 2 Status Box oi e be ee PEE a aa 4 1 3 Log Button sa stee le ee ae ee a e 4 1 4 WEKA Status Icon o e e 4 1 5 Graphical output esci uea neperi a e e 4 27 Preprocessing s esain aee i de anii Ba age yay A a 4 2 1 Loading Data vip es ei a e Gos 4 2 2 The Current Relation 0 4 2 3 Working With Attributes
87. Then it 116 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS steps to the least worse candidate in the neighborhood However it does not consider points in the neighborhood it just visited in the last tl steps These steps are stored in a so called tabu list weka gui GenericObjectEditor weka classifiers bayes net search local TabuSearch About This Bayes Network learning algorithm uses tabu search for More finding a well scoring Bayes network structure initAsNaiveBayes True Y markovBlanketClassifier False NA maxNrOfParents 2 runs 10 scoreType BAYES RA tabuList 5 useArcReversal False BA Open Save OK Cancel Specific options runs is the number of iterations used to traverse the search space tabuList is the length tl of the tabu list e Genetic search applies a simple implementation of a genetic search algo rithm to network structure learning A Bayes net structure is represented by a array of n n n number of nodes bits where bit i n represents whether there is an arrow from node j gt i weka gui GenericObjectEditor weka classifiers bayes net search local GeneticSearch About This Bayes Network learning algorithm uses genetic search for finding a well scoring Bayes network structure descendantPopulationSize 100 markovBlanketClassifier False populationSize 10 run
88. ad by first right clicking the mouse over the ArffLoader icon on the layout A pop up menu will appear Select Configure under Edit in the list from this menu and browse to the location of your ARFF file e Next click the Evaluation tab at the top of the window and choose the ClassAssigner allows you to choose which column to be the class com ponent from the toolbar Place this on the layout e Now connect the ArffLoader to the ClassAssigner first right click over the ArffLoader and select the dataSet under Connections in the menu A rubber band line will appear Move the mouse over the ClassAssigner component and left click a red line labeled dataSet will connect the two components e Next right click over the ClassAssigner and choose Configure from the menu This will pop up a window from which you can specify which column is the class in your data last is the default e Now grab a NaiveBayesUpdateable component from the bayes section of the Classifiers panel and place it on the layout e Next connect the ClassAssigner to NaiveBayes Updateable using a instance connection e Next place an IncrementalClassiferEvaluator from the Evaluation panel onto the layout and connect NaiveBayesUpdateable to it using a incre mentalClassifier connection 100 CHAPTER 6 KNOWLEDGEFLOW Next place a Text Viewer component from the Visualization panel on the Layout Connect the IncrementalClassifierEvaluator to it using a text connection N
89. alizer can save graphs into the Interchange Format http www 2 cs cmu edu fgcozman Research InterchangeFormat for Bayesian Networks BIF If started from command line with an XML file name as first parameter and not from the Explorer it can display the given file directly The DTD for BIF is this lt DOCTYPE BIF lt ELEMENT BIF NETWORK gt lt ATTLIST BIF VERSION CDATA REQUIRED gt lt ELEMENT NETWORK NAME PROPERTY VARIABLE DEFINITION gt lt ELEMENT NAME PCDATA gt lt ELEMENT VARIABLE NAME OUTCOME PROPERTY gt lt ATTLIST VARIABLE TYPE nature decision utility nature gt lt ELEMENT OUTCOME PCDATA gt lt ELEMENT DEFINITION FOR GIVEN TABLE PROPERTY gt lt ELEMENT FOR PCDATA gt lt ELEMENT GIVEN PCDATA gt lt ELEMENT TABLE PCDATA gt lt ELEMENT PROPERTY PCDATA gt gt 204 CHAPTER 16 TECHNICAL DOCUMENTATION Responsible Class es weka classifiers bayes BayesNet toXMLBIF03 weka classifiers bayes net BlFReader weka gui graphvisualizer BlFParser 16 6 5 XRFF files With Weka 3 5 4 a new more feature rich XML based data format got intro duced XRFF For more information please see Chapter 10 Chapter 17 Other resources TODO 17 1 Mailing list The WEKA Mailing list can be found here e http list scms waikato ac nz mailman listinfo wekalist for subscribing unsubscribing the list e https list scms waik
90. all estimates on test data and computing average and standard deviation of accuracy A more elaborate method is cross validation Here a number of folds n is specified The dataset is randomly reordered and then split into n folds of equal size In each iteration one fold is used for testing and the other n 1 folds are used for training the classifier The test results are collected and averaged over all folds This gives the cross validation estimate of the accuracy The folds can be purely random or slightly modified to create the same class distributions in each fold as in the complete dataset In the latter case the cross validation is called stratified Leave one out loo cross validation signifies that n is equal to the number of examples Out of necessity loo cv has to be non stratified i e the class distributions in the test set are not related to those in the training data Therefore loo cv tends to give less reliable results However it is still quite useful in dealing with small datasets since it utilizes the greatest amount of training data from the dataset 1 2 BASIC CONCEPTS 15 1 2 3 weka filters The weka filters package is concerned with classes that transforms datasets by removing or adding attributes resampling the dataset removing examples and so on This package offers useful support for data preprocessing which is an important step in machine learning All filters offer the options i for specifying the input d
91. an handle all the selected Capabilities black the ones that cannot grey and the ones that might be able to handle them blue e g meta classifiers which depend on their base classifier s 16 5 Properties A properties file is a simple text file with this structure lt key gt lt value gt Comments start with the hash sign To make a rather long property line more readable one can use a backslash to continue on the next line The Filter property e g looks like this weka filters Filter weka filters supervised attribute weka filters supervised instance weka filters unsupervised attribute weka filters unsupervised instance 16 5 1 Precedence The Weka property files extension props are searched for in the following order e current directory e the user s home directory nix HOME Windows USERPROFILE e the class path normally the weka jar file If Weka encounters those files it only supplements the properties never overrides them In other words a property in the property file of the current directory has a higher precedence than the one in the user s home directory Note Under Cywgin http cygwin com the home directory is still the Win dows one since the java installation will be still one for Windows 16 5 2 Examples e weka gui LookAndFeel props e weka gui GenericPropertiesCreator props e weka gui beans Beans props 16 6 XML 199 16 6 XML Weka now supports XML http www w3c or
92. ap plication it may make sense to output don t know below a cer tain threshold WEKA also out puts a trailing newline p first last the mentioned attributes would have been output afterwards as comma separated values in parentheses However the zero based instance id in the first column offers a safer way to determine the test instances If we had saved the output of p in soybean test preds the following call would compute the number of correctly classified instances cat soybean test preds awk 2 488 0 gt we 1 Dividing by the number of instances in the test set i e wc 1 lt soybean test preds minus one trailing newline we get the training set accuracy 1 3 EXAMPLES 21 1 3 Examples Usually if you evaluate a classifier for a longer experiment you will do something like this for csh java Xmx1024m weka classifiers trees J48 t data arff i k A d J48 data model gt amp J48 data out amp The Xmx1024m parameter for maximum heap size ensures your task will get enough memory There is no overhead involved it just leaves more room for the heap to grow i and k gives you some additional information which may be useful e g precision and recall for all classes In case your model performs well it makes sense to save it via d you can always delete it later The implicit cross validation gives a more reasonable estimate of the expected accuracy on unseen data than the training set acc
93. arch global K2 About This Bayes Network learning algorithm uses a hill climbing More algorithm restricted by an order on the variables CvType LOO C v initAsNaiveBayes k rola cy Cumulative C markovBlanketClassifier maxNrOfParents 1 randomOrder False v useProb False 54 The following search algorithms are implemented K2 HillClimbing Repeat edHillClimber TAN Tabu Search Simulated Annealing and Genetic Search See Section 8 2 for a description of the specific options for those algorithms 8 5 Fixed structure learning The structure learning step can be skipped by selecting a fixed network struc ture There are two methods of getting a fixed structure just make it a naive Bayes network or reading it from a file in XML BIF format weka UU classifiers CE bayes Ei net EG search local e ci o c global cf fixed E FromFile D NaiveBayes 8 6 Distribution learning Once the network structure is learned you can choose how to learn the prob ability tables selecting a class in the weka classifiers bayes net estimate 8 6 DISTRIBUTION LEARNING 121 package weka U classifiers Eo bayes Ei net Ll estimate D BayesNetEstimator y BMAEstimator QA MultiNomialBMAEstimator y SimpleEstimator Close The SimpleEstimator class produces direct estimates of the conditional probab
94. as the product of cardinalities of nodes in pa z qi Jp epa a i 114 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS Note pa x implies q 1 We use Nj 1 lt i lt n 1 lt j lt qi to denote the number of records in D for which pa x takes its jth value We use Ni l lt i lt n 1 lt j lt q 1 lt k lt r to denote the number of records in D for which ae takes its jth value and for which a takes its kth value So Nij ok Nijp We use N to denote the number of records in D Let the entropy metric H Bs D of a network structure and database be defined as a E H Bs D N Y Y Da Nigk y E 8 2 i 1 j 1 k 1 and the number of parameters K as n K Sn i 1 8 3 AIC metric The AIC metric Q arc Bs D of a Bayesian network structure Bs for a database D is Qaic Bs D H Bs D K 8 4 A term P Bs can be added 14 representing prior information over network structures but will be ignored for simplicity in the Weka implementation MDL metric The minimum description length metric Qmp Bs D of a Bayesian network structure Bs for a database D is is defined as K Qumpr Bs D H Bs D gt log N 8 5 Bayesian metric The Bayesian metric of a Bayesian network structure Bp for a database D is FET NG yy TN ijx Nije ayes B D P Bs Qs Y S S MU eN Nis y l T N ijk where P Bs is the prior on the network structure taken to be constant hence ignored in the Weka implementation
95. ata contains for this attribute 5 Unique The number and percentage of instances in the data having a value for this attribute that no other instances have Below these statistics is a list showing more information about the values stored in this attribute which differ depending on its type If the attribute is nominal the list consists of each possible value for the attribute along with the number of instances that have that value If the attribute is numeric the list gives four statistics describing the distribution of values in the data the minimum maximum mean and standard deviation And below these statistics there is a coloured histogram colour coded according to the attribute chosen as the Class using the box above the histogram This box will bring up a drop down list of available selections when clicked Note that only nominal Class attributes will result in a colour coding Finally after pressing the Visualize All button histograms for all the attributes in the data are shown in a separate window Returning to the attribute list to begin with all the tick boxes are unticked They can be toggled on off by clicking on them individually The four buttons above can also be used to change the selection 4 2 PREPROCESSING 37 1 All All boxes are ticked 2 None All boxes are cleared unticked 3 Invert Boxes that are ticked become unticked and vice versa 4 Pattern Enables the user to select attributes based on a P
96. ataset and o for specifying the output dataset If any of these parameters is not given this specifies standard input resp output for use within pipes Other parameters are specific to each filter and can be found out via h as with any other class The weka filters package is organized into supervised and unsupervised filtering both of which are again subdivided into instance and attribute filtering We will discuss each of the four subsection separately weka filters supervised Classes below weka filters supervised in the class hierarchy are for super vised filtering i e taking advantage of the class information A class must be assigned via c for WEKA default behaviour use c last weka filters supervised attribute Discretize is used to discretize numeric attributes into nominal ones based on the class information via Fayyad amp Irani s MDL method or optionally with Kononeko s MDL method At least some learning schemes or classifiers can only process nominal data e g weka classifiers rules Prism in some cases discretization may also reduce learning time java weka filters supervised attribute Discretize i data iris arff o iris nom arff c last java weka filters supervised attribute Discretize i data cpu arff o cpu classvendor nom arff c first NominalToBinary encodes all nominal attributes into binary two valued at tributes which can be used to transform the dataset into a purely numeric representation
97. ato ac nz pipermail wekalist Mirrors http news gmane org gmane comp ai weka http www nabble com WEKA f435 htm1 for searching previous posted messages Before posting please read the Mailing List Etiquette http www cs waikato ac nz ml weka mailinglist_etiquette html 17 2 Troubleshooting Here are a few of things that are useful to know when you are having trouble installing or running Weka successfullyon your machine NB these java commands refer to ones executed in a shell bash command prompt etc and NOT to commands executed in the SimpleCLI 17 2 1 Weka download problems When you download Weka make sure that the resulting file size is the same as on our webpage Otherwise things won t work properly Apparently some web browsers have trouble downloading Weka 17 2 2 OutOfMemoryException Most Java virtual machines only allocate a certain maximum amount of memory to run Java programs Usually this is much less than the amount of RAM in your computer However you can extend the memory available for the virtual machine by setting appropriate options With Sun s JDK for example you can 205 206 CHAPTER 17 OTHER RESOURCES go java Xmx100m to set the maximum Java heap size to 100MB For more information about these options see http java sun com docs hotspot VMOptions html 17 2 2 1 Windows Book version You have to modify the JVM invocation in the RunWeka bat batch file in your installation d
98. ator and a search method The evaluator determines what method is used to assign a worth to each subset of attributes The search method determines what style of search is performed 4 6 2 Options The Attribute Selection Mode box has two options 1 Use full training set The worth of the attribute subset is determined using the full set of training data 2 Cross validation The worth of the attribute subset is determined by a process of cross validation The Fold and Seed fields set the number of folds to use and the random seed used when shuffling the data As with Classify Section 4 3 1 there is a drop down box that can be used to specify which attribute to treat as the class 4 6 3 Performing Selection Clicking Start starts running the attribute selection process When it is fin ished the results are output into the result area and an entry is added to the result list Right clicking on the result list gives several options The first three View in main window View in separate window and Save result buffer are the same as for the classify panel It is also possible to Visualize 4 6 SELECTING ATTRIBUTES 47 reduced data or if you have used an attribute transformer such as Principal Components Visualize transformed data The reduced transformed data can be saved to a file with the Save reduced data or Save transformed data option In case one wants to reduce transform a training and a test at the same time and
99. bs are active clicking on them flicks between different screens on which the respective actions can be performed The bottom area of the window including the status box the log button and the Weka bird stays visible regardless of which section you are in The Explorer can be easily extended with custom tabs The Wiki article Adding tabs in the Explorer 6 explains this in detail 4 1 2 Status Box The status box appears at the very bottom of the window It displays messages that keep you informed about what s going on For example if the Explorer is busy loading a file the status box will say that TIP ight clicking the mouse anywhere inside the status box brings up a little menu The menu gives two options 33 34 CHAPTER 4 EXPLORER 1 Memory information Display in the log box the amount of memory available to WEKA 2 Run garbage collector Force the Java garbage collector to search for memory that is no longer needed and free it up allowing more memory for new tasks Note that the garbage collector is constantly running as a background task anyway 4 1 3 Log Button Clicking on this button brings up a separate window containing a scrollable text field Each line of text is stamped with the time it was entered into the log As you perform actions in WEKA the log keeps a record of what has happened For people using the command line or the SimpleCLI the log now also contains the full setup strings for classificat
100. buSearch L lt integer gt Tabu list length U lt integer gt Number of runs P lt nr of parents gt Maximum number of parents Use arc reversal operation default false P lt nr of parents gt mbc Maximum number of parents Use arc reversal operation default false Initial structure is empty instead of Naive Bayes Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S LOO CV k Fold CV Cumulative CVv Q Score type LOO CV k Fold CV Cumulative CV Use probabilistic or 0 1 scoring default probabilistic scoring weka classifiers bayes net search global TAN mbc Applies a Markov Blanket correction to the network structure after a network structure is learned This ensures that all nodes in the network are part of the Markov blanket of the classifier node S L00 CV k Fold CV Cumulative CV Q Score type LOO CV k Fold CV Cumulative CV Use probabilistic or 0 1 scoring default probabilistic scoring weka classifiers bayes net search fixed FromFile B lt BIF File gt Name of file containing network structure in BIF format weka classifiers bayes net search fixed NaiveBayes 8 7 RUNNING FROM THE COMMAND LINE 131 No options Overview of options for estimators e weka classifiers bayes net estimate BayesNetEstimator
101. but without the hassle of the CLASSPATH it facilitates the one with which Weka was started It offers a simple Weka shell with separated commandline and output SimplecLi Welcome to the WEKA SimpleCLI Enter commands in the textfield at the bottom of the window Use the up and dow arrows to move through previous commands Command completion for classnames and files is initiated with lt Tab gt In order to distinguish etween files and classnames file names must gt either absolute or start with Alt BackSpace gt is used for deleting the text in the commandline in chunks gt help Command must be one of java lt classname gt lt args gt gt file break kill cls exit help lt command gt 3 1 Commands The following commands are available in the Simple CLI e java lt classname gt lt args gt invokes a java class with the given arguments if any e break stops the current thread e g a running classifier in a friendly manner 29 30 CHAPTER 3 SIMPLE CLI e kill stops the current thread in an unfriendly fashion ecls clears the output area e exit exits the Simple CLI e help lt command gt provides an overview of the available commands if without a command name as argument otherwise more help on the specified command 3 2 Invocation In order to invoke a Weka class one has only to prefix the class with java This command tells the Simple CLI to loa
102. ces if no test instances provided along with attributes 0 for none distribution Outputs the distribution instead of only the prediction in conjunction with the p option only nominal classes ST Only outputs cumulative margin distribution 78 Only outputs the graph representation of the classifier xml filename xml string Retrieves the options from the XML data instead of the command line Options specific to weka classifiers bayes BayesNet D Do not use ADTree data structure B lt BIF file gt BIF file to compare with Q weka classifiers bayes net search SearchAlgorithm Search algorithm E weka classifiers bayes net estimate SimpleEstimator Estimator algorithm The search algorithm option Q and estimator option E options are manda tory Note that it is important that the E options should be used after the Q option Extra options can be passed to the search algorithm and the estimator after the class name specified following For example java weka classifiers bayes BayesNet t iris arff D Q weka classifiers bayes net search local K2 P 2 S ENTROPY E weka classifiers bayes net estimate SimpleEstimator A 1 0 Overview of options for search algorithms e weka classifiers bayes net search local GeneticSearch L lt integer gt Population size A lt integer gt Descendant population size U lt integer gt Number of runs M Use mutation 124 CHAPTER 8 BAYESIAN NETWORK
103. cross validation run using a supplied evaluator numFolds Number of folds to use in cross validation outputFile Set the destination for saving raw output Ifthe rawOutput option is selected then output from the splitEvaluator for individual folds is saved Ifthe destination is a directory then each output is saved to an individual gzip file if the destination is a file then each output is saved as an entry in a zip file rawOutput Save raw output useful for debugging If set then output is sentto the destination specified by outputFile splitEvaluator The evaluator to apply to the cross validation folds This may be a classifier regression scheme etc l0 x 74 CHAPTER 5 EXPERIMENTER As with the RandomSplitResultProducer multiple schemes can be run during cross validation by adding them to the Generator properties panel weka Experiment Environment Setup Run Analyse xperiment Configuration Mode Simple 8 Advanced Open Save New Destination Choose InstancesResultListener O Experiment arff Result generator Choose CrossValidationResultProducer X 10 O splitEvalutorOutzip W weka experiment ClassifierSplitEvaluator VV weka classifig Runs Distribute experiment Generator properties From 1 To 1 Hosts Ed Enabled Vv Select property 9 By data set D
104. d a class and execute it with any given parameters E g the J48 classifier can be invoked on the iris dataset with the following command java weka classifiers trees J48 t c temp iris arff This results in the following output SimplecLI 50 0 01 a Iris setosa i 0 49 1 b Iris versicolor 0 2481 c Iris virginica Stratified cross validation Correctly Classified Instances 144 36 Incorrectly Classified Instances 6 4 Kappa statistic 0 94 Mean absolute error 0 035 Root mean squared error 0 1586 Relative absolute error 7 8705 Root relative squared error 33 6353 fTotal Number of Instances 150 Confusion Matrix a b c lt classified as 49 1 01 a Iris setosa 047 31 b Iris versicolor 0 2481 c Iris virginica java weka classifiers trees J48 t data iris ar f 3 3 Command redirection Starting with this version of Weka one can perform a basic redirection java weka classifiers trees J48 t test arff gt j48 txt Note the gt must be preceded and followed by a space otherwise it is not recognized as redirection but part of another parameter 3 4 COMMAND COMPLETION 31 3 4 Command completion Commands starting with java support completion for classnames and filenames via Tab Alt BackSpace deletes parts of the command again In case that there are several matches Weka lists all possible matches e package name completion java weka cl lt Tab gt resu
105. d and the highlighting removed again To change to a decision tree scheme select J48 in subgroup trees 5 2 STANDARD EXPERIMENTS 71 weka gui GenericObjectEditor 5 x weka classifiers trees J48 About Class for generating a pruned or unpruned C4 More Capabilities binarySplits False RA confidenceFactor 0 25 debug False SA minNumObj 2 numPolds 3 reducedErrorPruning False X savelnstanceData False NA seed 1 subtreeRaising True gt A unpruned False k useLaplace False Open Save OK Cancel The new scheme is added to the Generator properties panel Click Add to add the new scheme weka Experiment Environment me Setup Run Analyse xperiment Configuration Mode Simple Advanced Open Save New Destination Choose InstancesResultListener O Experimentt arf Result generator Choose RandomSplitResultProducer P 66 0 O splitEvalutorOutzip W weka experiment ClassifierSplitEvaluator W weka classifier Runs Distribute experiment Generator properties From 1 To 10 p AS eta Enabled Select property 9 Bydataset Byrun Iteration control Choose J48 C 0 25 M 2 ma Data sets first Custom generator first ZeroR J48 C 0 25 M 2 Datasets Add new Edit selecte Del
106. d from a terminal window 192 CHAPTER 16 TECHNICAL DOCUMENTATION e explorer The command that s executed if one double clicks on an ARFF or XRFF file In order to change the maximum heap size for all those commands one only has to modify the maxheap placeholder For more information check out the comments in the INI file 16 2 3 java jar When you re using the Java interpreter with the jar option be aware of the fact that it overwrites your CLASSPATH and not augments it Out of conve nience people often only use the jar option to skip the declaration of the main class to start But as soon as you need more jars e g for database access you need to use the classpath option and specify the main class Here s once again how you start the Weka Main GUI with your current CLASSPATH variable and 128MB for the JVM e Linux java Xmx128m classpath CLASSPATH weka jar weka gui Main e Win32 java Xmx128m classpath CLASSPATH weka jar weka gui Main 16 3 Subversion 16 3 1 General The Weka Subversion repository is accessible and browseable via the following URL https svn scms waikato ac nz svn weka A Subversion repository has usually the following layout root l trunk tags branches Where trunk contains the main trunk of the development tags snapshots in time of the repository e g when a new version got released and branches development branches that forked off the main trunk at so
107. d in a directory called knowledgeFlow plugins in the user s home directory If this directory does not exist you must create it in order to install plugins Plugins are installed in subdirectories of the knowledgeFlow plugins directory More than one plugin component may reside in the same subdirectory Each subdirectory should contain jar file s that contain and support the plugin components The KnowledgeF low will dy namically load jar files and add them to the classpath In order to tell the KnowledgeFlow which classes in the jar files to instantiate as components a second file called Beans props needs to be created and placed into each plu gin subdirectory This file contains a list of fully qualified class names to be instantiated Successfully instantiated components will appear in a Plugins tab in the KnowledgeFlow user interface Below is an example plugin directory listing the listing of the contents of the jar file and the contents of the associated Beans props file cygnus mhall 1s 1 HOME knowledgeFlow plugins kettle total 24 EW I I gt 1 mhall mhall 117 20 Feb 10 56 Beans props rw r r 1 mhall mhall 8047 20 Feb 14 01 kettleKF jar cygnus mhall jar tvf Users mhall knowledgeFlow plugins kettle kettlekF jar O Wed Feb 20 14 01 34 NZDT 2008 META INF 70 Wed Feb 20 14 01 34 NZDT 2008 META INF MANIFEST MF O Tue Feb 19 14 59 08 NZDT 2008 weka O Tue Feb 19 14 59 08 NZDT 2008 weka gui O Wed Feb 20 13 55 52 NZDT
108. d models e StripChart component that can pop up a panel that displays a scrolling plot of data used for viewing the online performance of incremental clas sifiers 6 4 EXAMPLES 95 6 4 Examples 6 4 1 Cross validated J48 Setting up a flow to load an ARFF file batch mode and perform a cross validation using J48 WEKA s C4 5 implementation trainine page a L5 irag S arag ls so qb Ep gt J48 10 Arffloader Class CrossVal idation J48 Classifier Assigner Tol dMaker PerformanceEvaluator ext Click on the DataSources tab and choose ArffLoader from the toolbar the mouse pointer will change to a cross hairs Next place the ArffLoader component on the layout area by clicking some where on the layout a copy of the ArffLoader icon will appear on the layout area Next specify an ARFF file to load by first right clicking the mouse over the ArffLoader icon on the layout A pop up menu will appear Select Configure under Edit in the list from this menu and browse to the location of your ARFF file Next click the Evaluation tab at the top of the window and choose the ClassAssigner allows you to choose which column to be the class com ponent from the toolbar Place this on the layout Now connect the ArffLoader to the ClassAssigner first right click over the ArffLoader and select the dataSet under Connections in the menu A rubber band line will appear Move the mouse over the ClassAssigner componen
109. declarations are case insensitive 9 2 Examples Several well known machine learning datasets are distributed with Weka in the WEKAHOME data directory as ARFF files 9 2 1 The ARFF Header Section The ARFF Header section of the file contains the relation declaration and at tribute declarations The Orelation Declaration The relation name is defined as the first line in the ARFF file The format is relation lt relation name gt where lt relation name gt is a string The string must be quoted if the name includes spaces The Qattribute Declarations Attribute declarations take the form of an ordered sequence of Cattribute statements Each attribute in the data set has its own Cattribute statement which uniquely defines the name of that attribute and it s data type The order the attributes are declared indicates the column position in the data section of the file For example if an attribute is the third one declared then Weka expects that all that attributes values will be found in the third comma delimited column The format for the Cattribute statement is Cattribute lt attribute name gt lt datatype gt where the lt attribute name gt must start with an alphabetic character If spaces are to be included in the name then the entire name must be quoted The lt datatype gt can be any of the four types supported by Weka e numeric e integer is treated as numeric e real is treated as numeric e lt nominal specification
110. der to get statistically meaningful results the default number of it erations is 10 In case of 10 fold cross validation this means 100 calls of one classifier with training data and tested against test data e Data sets first Algorithms first As soon as one has more than one dataset and algorithm it can be useful to switch from datasets being iterated over first to algorithms This is the case if one stores the results in a database and wants to complete the results for all the datasets for one algorithm as early as possible 5 2 1 6 Algorithms New algorithms can be added via the Add new button Opening this dialog for the first time ZeroR is presented otherwise the one that was selected last weka gui GenericObjectEditor weka classifiers rules ZeroR Choose About Class for building and using a 0 R classifier Capabilities debug False z Open Cancel With the Choose button one can open the GenericObjectEditor and choose another classifier 58 CHAPTER 5 EXPERIMENTER weka gui GenericObjectEditor weka E classifiers gt c hayes functions More gt ES lazy Capabilities o ci meta gt cmi gt cJ misc bA o trees c rules Cancel E ConjunctiveRule a DecisionTable a JRip a M5Rules E NNge a OneR E part a Prism a Ridor D Zerr Filter Remove filte
111. ditor weka experiment InstancesResultListener About Outputs the received results in arf format to a Writer More outputFile weka_experiment25619 arff q ox 5 2 STANDARD EXPERIMENTS 63 weka gui FileEditor lolx Lookin C weka 3 5 6 aS la El C3 changelogs D documentation html B remoteExperin c3 data D Experimenter Tutorial pdf D Tutorial pdf C doc E ExplorerGuide paf y weka src jar y BayesianNetClassifiers pdf README y weka gif C copyinc y README_Experiment_Gui weka ico y documentation css README_KnowledgeFlow 7 wekajar 4 I J D gt File Name Experiment arft Files of Type All Files la Type the name of the output file click Select and then click close x The file name is displayed in the outputFile panel Click on OK to close the window weka gui GenericObjectEditor weka experiment InstancesResultListener About Takes results from a result producer and assembles them More into a set of instances outputFile lExperimentl arf Open Save OK Cancel The dataset name is displayed in the Destination panel of the Setup tab weka Experiment Environment lolx Setup Run Analyse xperiment Configuration Mode Simple Advanced Open Save New D
112. dow a file save as dialog pops up that allows you to select the file name to save to e Java Create a BayesNet and call BayesNet toXMLBIF03 which returns the Bayes network in BIF format as a String e Command line use the g option and redirect the output on stdout into a file How do I compare a network I learned with one in BIF format Specify the B lt bif file gt option to BayesNet Calling toString will produce a summary of extra missing and reversed arrows Also the divergence between the network learned and the one on file is reported How do I use the network I learned for general inference There is no general purpose inference in Weka but you can export the network as XML BIF file see above and import it in other packages for example JavaBayes available under GPL from http www cs cmu edu javabayes 8 13 Future development If you would like to add to the current Bayes network facilities in Weka you might consider one of the following possibilities e Implement more search algorithms in particular general purpose search algorithms such as an improved implemen tation of genetic search structure search based on equivalent model classes implement those algorithms both for local and global metric based search algorithms 8 13 FUTURE DEVELOPMENT 151 implement more conditional independence based search algorithms e Implement score metrics that can handle sparse
113. e changed by clicking on the outputFile panel in the window Now when the experiment is run the result of each processing run is archived as shown below Name I Size Modified Ratio Packed Path gt B rules ZeroR_ version_48055541465867954 568 21 12 200516 55 257 1 iris ClassifierSplitEvaluator B trees J48_ C_0 25_ M_2 version_ 217733168393644444 844 21 12 2005 16 53 397 1iris ClassifierSplitEvaluator _rules ZeroR_ version_48055541 465867954 568 21 12 2005 16 55 257 10iris ClassifierSplitEvaluator E _trees J48_ C_0 25_ M_2 version_ 217733168393644444 915 21 12 2005 16 54 417 1Diris ClassifierSplitEvaluator E _rules ZeroR_ version_48055541465857954 568 21 12 200516 55 257 2 iris ClassifierSplitEvaluator E _trees J48_ C_0 25_ M_2 version_ 217733168393644444 1 001 21 12 2005 16 58 425 2 iris ClassifierSplitEvaluator _rules ZeroR_ version_48055541465857954 568 21 12 200516 55 257 3iiris ClassifierSplitEvaluator By_trees J48_ C_0 25_ M_2 version_ 2177331 68393644444 844 21 12 2005 16 53 395 3 iris ClassifierSplitEvaluator _rules ZeroR_ version_48055541465867954 568 21 12 200516 55 257 4 ris ClassifierSplitEvaluator B trees J48_ C_0 25_ M_2 version_ 217733168393644444 997 21 12 2005 16 57 433 4iiris ClassifierSplitEvaluator S _tules ZeroR_ version_48055541 465867954 568 21 12 200516 55 257 S iris ClassifierSplitEvaluator B trees J48_ C_0 25_ M_2 version_ 217733168393644444
114. e options Root mean squared error Relative absolute error Root relative squared error Total Number of Instances Detailed Accuracy By Class Start stop Result list right click for options TP Rate FP Rate Precision Recall F Measure ROC Area 15 15 03 trees J48 7 0 556 0 6 0 625 0 556 0 588 0 633 Do 0 4 0 444 0 333 0 4 0 364 0 633 Confusion Matrix ab lt classified as 54 a yes 32 1 b n0 4 3 1 Selecting a Classifier At the top of the classify section is the Classifier box This box has a text field that gives the name of the currently selected classifier and its options Clicking on the text box with the left mouse button brings up a GenericObjectEditor dialog box just the same as for filters that you can use to configure the options of the current classifier With a right click or Alt Shift left click you can once again copy the setup string to the clipboard or display the properties in a GenericObjectEditor dialog box The Choose button allows you to choose one of the classifiers that are available in WEKA 4 3 2 Test Options The result of applying the chosen classifier will be tested according to the options that are set by clicking in the Test options box There are four test modes 1 Use training set The classifier is evaluated on how well it predicts the class of the instances it was trained on 2 Supplied test set The classifier is evaluated on
115. e relative pat Can t edit Wataliris arff Click Select property and expand splitEvaluator so that the classifier entry is visible in the property list click Select Select a property x 7 Available properties Ey outputFile y randomizeData D rawOutput c splitEvaluator O attributelD E classForlRStatistics gt blassi er D predTargetColumn y trainPercent Select Cancel The scheme name is displayed in the Generator properties panel 70 CHAPTER 5 EXPERIMENTER Weka Experiment Environment Setup Run Analyse xperiment Configuration Mode Open Destination InstancesResultListener O Experimentt arf Choose Result generator Choose RandomSplitResuttProducer P 66 0 O splitEvalutorOutzip W weka experiment ClassifierSplitEvaluator W weka classifier Runs Distribute experiment Generator properties Enabled Choose From 1 To 10 Hosts O Byrun Select property By data set Iteration control 8 Data sets first Datasets Custom generator first Add new v Use relative pat Edit selecte Delete select dataliris artt To add another scheme click on the C
116. e to the Weka Knowledge Flow The KnowledgeFlow presents a data flow inspired interface to WEKA The user can select WEKA components from a tool bar place them on a layout can vas and connect them together in order to form a knowledge flow for processing and analyzing data At present all of WEKA s classifiers filters clusterers loaders and savers are available in the KnowledgeFlow along with some extra tools The KnowledgeFlow can handle data either incrementally or in batches the Explorer handles batch data only Of course learning from data incremen 89 90 CHAPTER 6 KNOWLEDGEFLOW tally requires a classifier that can be updated on an instance by instance basis Currently in WEKA there are ten classifiers that can handle data incrementally AODE IB1 IBk KStar NaiveBayesMultinomialUpdateable NaiveBayesUpdateable NNge Winnow And two of them are meta classifiers e RacedIncrementalLogitBoost that can use of any regression base learner to learn from discrete class data incrementally e LWL locally weighted learning 6 2 FEATURES 91 6 2 Features The KnowledgeF low offers the following features intuitive data flow style layout process data in batches or incrementally process multiple batches or streams in parallel each separate flow executes in its own thread chain filters together view models produced by classifiers for each fold in a cross validation visualize performance of incremental clas
117. ected by default Output predictions The predictions on the evaluation data are output Note that in the case of a cross validation the instance numbers do not correspond to the location in the data Output additional attributes If additional attributes need to be out put alongside the predictions e g an ID attribute for tracking misclassi fications then the index of this attribute can be specified here The usual Weka ranges are supported first and last are therefore valid indices as well example first 3 6 8 12 last Cost sensitive evaluation The errors is evaluated with respect to a cost matrix The Set button allows you to specify the cost matrix used Random seed for xval Split This specifies the random seed used when randomizing the data before it is divided up for evaluation purposes Preserve order for Split This suppresses the randomization of the data before splitting into train and test set Output source code If the classifier can output the built model as Java source code you can specify the class name here The code will be printed in the Classifier output area 4 3 3 The Class Attribute The classifiers in WEKA are designed to be trained to predict a single class attribute which is the target for prediction Some classifiers can only learn nominal classes others can only learn numeric classes regression problems still others can learn both By default the clas
118. ed Open Save New Results Destination are file Filename lc Mtemplweka 3 5 6lExperiments1 arff Browse Experiment Type Iteration Control Z Number of repetitions 10 Train percentage 66 0 8 Data sets first 8 Classification Regression Algorithms first Datasets Algorithms Add new Edit selecte Delete select Add new Edit selected Delete selected Use relative pat Up Down Load options Save options Up Down Notes e Train Test Percentage Split order preserved because it is impossible to specify an explicit train test files pair one can abuse this type to un merge previously merged train and test file into the two original files one only needs to find out the correct percentage weka Experiment Environment lolx Setup Run Analyse xperiment Configuration Mode 8 Simple Advanced Open Save New Results Destination ARFF file v Filename le tempiweka 3 5 6 Experiments1 arff Browse Experiment Type Iteration Control Y Number of repetitions Train percentage 66 0 8 Data sets first 8 Classification Regression Algorithms first Datasets Algorithms Add new Edit selecte Delete select Add new Edit selected Delete selected Use relative pat Up Down Load options Save options Up Down Notes Addi
119. eii ica eo eee a oe wR ee A 1322 Setup 2244 dace ta doh doe OF eR 13 3 Missing Datatypes 0 000000 13 4 Stored Procedures o a 13 5 Troubleshooting 0 0 0 0 0000000004 14 Windows databases IV Appendix 15 Research 15 1 Citing Weka e sra So ee eA A ee ek 15 2 Paper references e e 16 Technical documentation TOMAN o ld ap des ad E te tenes SAA E A aca ete sg arte aa ee Me eed A os he eee 16 1 2 Weka and ANT 2 3 4 4 4 06 wie a da a a T LO di kE ra ts ida A a a Acs E 16 2 CLASSPATH Lo e epee kh Ye a KR e GS eS 16 2 1 Setting the CLASSPATH 16 2 2 RunWeka bat i ereua a o a e e 16 2 3 java lira ia e ee ee aa CONTENTS 7 16 3 Subversion 2 64 665 EE aa be EE meie 192 T6 3 L General samal a es a 192 16 3 2 Source code s rai a tl td ac aa 193 16 33 Unit 2 ae A a BE 193 16 3 4 Specific version aws us a a dt 193 167305 Clients ias a A Ad ua 193 16 4 GenericObjectEditor o 194 16 4 1 Intro ductiO s a amantan ts o Roe aa aes a le 194 16 4 2 File Structure 0 0 o 194 16 43 Exclusion 52d de a e Eh are aes 195 16 4 4 Class Dis 6very iaa pe a a a eE e a 196 16 4 5 Multiple Class Hierarchies o 196 16 4 6 Capabilities o a np p ae aa 197 16 5 Properties wa ma ow a a ede eS ee eng 198 16 5 1 Prec denc 2000 Li ee ee Ah 198 16
120. en calling Java For example if the jar file is located at c weka 3 4 weka jar you can use java cp c weka 3 4 weka jar weka classifiers etc See also Section 16 2 17 2 10 Instance ID People often want to tag their instances with identifiers so they can keep track of them and the predictions made on them 17 2 10 1 Adding the ID A new ID attribute is added real easy one only needs to run the AddID filter over the dataset and it s done Here s an example at a DOS Unix command prompt java weka filters unsupervised attribute AddID i data_without_id arff o data_with_id arff all on a single line Note the AddID filter adds a numeric attribute not a String attribute to the dataset If you want to remove this ID attribute for the classifier in a FilteredClassifier environment again use the Remove filter instead of the RemoveType filter same package 17 2 10 2 Removing the ID If you run from the command line you can use the p option to output predic tions plus any other attributes you are interested in So it is possible to have a string attribute in your data that acts as an identifier A problem is that most classifiers don t like String attributes but you can get around this by using the RemoveType this removes String attributes by default Here s an example Lets say you have a training file named train arff a testing file named test arff and they have an identifier String attribute as their 5th
121. entioned here will work Part II The Graphical User Interface Chapter 2 Launching WEKA The Weka GUI Chooser class weka gui GUIChooser provides a starting point for launching Weka s main GUI applications and supporting tools If one prefers a MDI multiple document interface appearance then this is provided by an alternative launcher called Main class weka gui Main The GUI Chooser consists of four buttons one for each of the four major Weka applications and four menus eee Weka GUI Chooser Program Visualization Tools Help Applications i W E KA Explorer The University et of Waikato Experimenter Waikato Environment for Knowledge Analysis KnowledgeFlow Version 3 5 8 0 1999 2008 The University of Waikato impl 1 Hamilton New Zealand AUS The buttons can be used to start the following applications Explorer An environment for exploring data with WEKA the rest of this documentation deals with this application in more detail Experimenter An environment for performing experiments and conduct ing statistical tests between learning schemes e KnowledgeFlow This environment supports essentially the same func tions as the Explorer but with a drag and drop interface One advantage is that it supports incremental learning e SimpleCLI Provides a simple command line interface that allows direct execution of WEKA commands for operating systems that do not provide their own c
122. erClassifier in the Classify panel which lets you build your own classifier by interactively selecting instances Below the y axis selector button is a drop down list button for choosing a selection method A group of data points can be selected in four ways 1 Select Instance Clicking on an individual data point brings up a window listing its attributes If more than one point appears at the same location more than one set of attributes is shown 2 Rectangle You can create a rectangle by dragging that selects the points inside it 3 Polygon You can build a free form polygon that selects the points inside it Left click to add vertices to the polygon right click to complete it The polygon will always be closed off by connecting the first point to the last 4 Polyline You can build a polyline that distinguishes the points on one side from those on the other Left click to add vertices to the polyline right click to finish The resulting shape is open as opposed to a polygon which is always closed Once an area of the plot has been selected using Rectangle Polygon or Polyline it turns grey At this point clicking the Submit button removes all instances from the plot except those within the grey selection area Clicking on the Clear button erases the selected area without affecting the graph Once any points have been removed from the graph the Submit button changes to a Reset button This button undoes all previous remova
123. erl 5 Regular Expression E g _id selects all attributes which name ends with _id Once the desired attributes have been selected they can be removed by clicking the Remove button below the list of attributes Note that this can be undone by clicking the Undo button which is located next to the Edit button in the top right corner of the Preprocess panel 4 2 4 Working With Filters Program Applications Tools Visualization Windows Help E Explorer ji Preprocess Classity Cluster Associate Select attributes Visualize Open file Open URL Open DB Generate Undo C filters Selected attribute D AlFilter Name outlook Type Nominal D MuttiFitter Missing 0 0 Distinct 3 Unique 0 0 EI supenised 9 Ci unsupenised sunny 7 attribute F overcast O Ada rainy O AddCluster i O AddExpression D Aadio O AdgNoise D Adavalues Class play Nom D center C ChangeDateFormat D ClassAssigner O ClusterMembership C copy O Discretize O Firstorder _Label JE Count Filter jl Remove fiter Close OK The preprocess section allows filters to be defined that transform the data in various ways The Filter box is used to set up the filters that are required At the left of the Filter box is a Choose button By clicking this button it is possible to select one of the filters in WEKA Once a filter has been selected it
124. es datasets for learning curves ie creating a 75 training set and 25 test set from a given dataset then successively reducing the test set by factor 1 2 83 until it is also 25 in size All this is repeated thirty times with different random reorderings S and the results are written to different directories The Experimenter GUI in WEKA can be used to design and run similar experiments bin csh foreach f set run 1 while run lt 30 mkdir run gt amp dev null java weka filters supervised instance StratifiedRemoveFolds N java weka filters supervised instance StratifiedRemoveFolds N foreach nr 0 1 2 3 4 5 set nrpi nr O nrpi run c last i 4 F1 S 4 F 1 S run V c last f o run t_ f i o run tO f 22 CHAPTER 1 A COMMAND LINE PRIMER java weka filters supervised instance Resample S 0 Z 83 c last i run t nr f o run t nrp1 f end echo Run run of f done runt end end If meta classifiers are used i e classifiers whose options include classi fier specifications for example StackingC or ClassificationViaRegression care must be taken not to mix the parameters E g java weka classifiers meta ClassificationViaRegression W weka classifiers functions LinearRegression S 1 t data iris arff x 2 gives us an illegal options exception for S 1 This parameter is meant for LinearRegression not for ClassificationViaRegression but WEKA does not know this b
125. es network class The discretiza tion algorithm chooses its values based on the information in the data set 8 10 BAYESIAN NETS IN THE EXPERIMENTER 147 However these values are not stored anywhere So reading an arff file with continuous variables using the File Open menu allows one to specify a network then learn the CPTs from it since the discretization bounds are still known However opening an arff file specifying a structure then closing the applica tion reopening and trying to learn the network from another file containing continuous variables may not give the desired result since a the discretization algorithm is re applied and new boundaries may have been found Unexpected behavior may be the result Learning from a dataset that contains more attributes than there are nodes in the network is ok The extra attributes are just ignored Learning from a dataset with differently ordered attributes is ok Attributes are matched to nodes based on name However attribute values are matched with node values based on the order of the values The attributes in the dataset should have the same number of values as the corresponding nodes in the network see above for continuous variables 8 10 Bayesian nets in the experimenter Bayesian networks generate extra measures that can be examined in the exper imenter The experimenter can then be used to calculate mean and variance for those measures The following metrics are generated e
126. estination Choose InstancesResultListener O Experiment arf Result generator Choose RandomSplitResultProducer P 66 0 O splitEvalutorOutzip W weka experiment ClassifierSplitEvaluator W weka classifier Runs Distribute experiment Generator properties From 1 To to Hosts a Disabled ha Select property 8 By data set By run Iteration control 8 Data sets first Custom generator first Datasets Add new Edit selecte Delete select v Use relative pat Can t edit dataliris arft Saving the Experiment Definition The experiment definition can be saved at any time Select Save at the top of the Setup tab Type the dataset name with the extension exp or select the dataset name if the experiment definition dataset already exists for binary files or choose Experiment configuration files xml from the file types combobox the XML files are robust with respect to version changes 64 CHAPTER 5 EXPERIMENTER Save In C weka 3 5 6 N ala lla Hi E changelogs data Ci doc File Name Experiment exp Files of Type Experiment configuration files exp y The experiment can be restored by selecting Open in the Setup tab and then selecting Experiment1 exp in the dialog window 5 2 2 2 Running an
127. ete select v Use relative pat dataliris arff a ht Delete Edit up Down Notes Now when the experiment is run results are generated for both schemes To add additional schemes repeat this process To remove a scheme select the scheme by clicking on it and then click Delete Adding Additional Datasets The scheme s may be run on any number of datasets at a time Additional datasets are added by clicking Add new in the Datasets panel Datasets are deleted from the experiment by selecting the dataset and then clicking Delete Selected 72 CHAPTER 5 EXPERIMENTER Raw Output The raw output generated by a scheme during an experiment can be saved to a file and then examined at a later time Open the ResultProducer window by clicking on the Result generator panel in the Setup tab weka gui GenericObjectEditor weka experiment RandomSplitResultProducer About Performs a random train and test using a supplied More evaluator outputFile splitEvalutorOutzio randomizeData True PA rawOutput splitEvaluator Choose ClassifierSplitEvaluator WY weka classifit trainPercent 66 0 Open Save OK Cancel Click on rawOutput and select the True entry from the drop down list By default the output is sent to the zip file splitEvaluatorOut zip The output file can b
128. eve better performance on test data by increasing the margins on the training data Visualize threshold curve Generates a plot illustrating the trade offs in prediction that are obtained by varying the threshold value between classes For example with the default threshold value of 0 5 the pre dicted probability of positive must be greater than 0 5 for the instance to be predicted as positive The plot can be used to visualize the pre cision recall trade off for ROC curve analysis true positive rate vs false positive rate and for other types of curves Visualize cost curve Generates a plot that gives an explicit represen tation of the expected cost as described by 4 Plugins This menu item only appears if there are visualization plugins available by default none More about these plugins can be found in the WekaWiki article Explorer visualization plugins 7 Options are greyed out if they do not apply to the specific set of results 4 4 CLUSTERING 43 4 4 Clustering Weka 3 5 4 Explorer Program Applications Tools Visualization Windows Help E Explorer Preprocess Classify Cluster Associate Select attributes Visualize Clusterer z Choose EM 100 N 1 M1 0E 6 S 100 Cluster mode Clusterer output ACCEIDUCET nmrarey Discrete Estimator Counts 8 8 Total 16 gt Supplied test set Set Attribute windy a7 Discrete Estimator Counts 79 Total 16 Clustered
129. ext place a StripChart component from the Visualization panel on the layout and connect IncrementalClassifierEvaluator to it using a chart con nection Display the StripChart s chart by right clicking over it and choosing Show chart from the pop up menu Note the StripChart can be configured with options that control how often data points and labels are displayed Finally start the flow by right clicking over the ArffLoader and selecting Start loading from the pop up menu 80080 Strip Chart Note that in this example a prediction is obtained from naive Bayes for each incoming instance before the classifier is trained updated with the instance If you have a pre trained classifier you can specify that the classifier not be updated on incoming instances by unselecting the check box in the configuration dialog for the classifier If the pre trained classifier is a batch classifier i e it is not capable of incremental training then you will only be able to test it in an incremental fashion 8 6886 About Class for a Naive Bayes classifier using estimator classes j More Capabilities debug False B displayModellnOldFormat False 8 useKernelEstimator False useSupervisedDiscretization False EZ Y Update classifier on incoming instance stream 6 5 PLUGIN FACILITY 101 6 5 Plugin Facility The KnowledgeF low offers the ability to easily add new components via a plugin mechanism Plugins are installe
130. g Associations Once appropriate parameters for the association rule learner bave been set click the Start button When complete right clicking on an entry in the result list allows the results to be viewed or saved 46 CHAPTER 4 EXPLORER 4 6 Selecting Attributes Weka 3 5 4 Explorer Program Applications Tools Visualization Windows Help E Explorer Preprocess Classify Cluster Associate Select attributes Visualize Attribute Evaluator Choose _ CfsSubsetEval Search Method Choose BestFirst D 1 N 5 Attribute Selection Mode Attribute selection output Use full training set O crass vamentian Folds Attribute Selection on all input data Eee Search Method Best first Nom play Start set no attributes i Search direction forward Start Stop Stale search after 5 node expansions Total number of subsets evaluated 11 Result list right click for options Merit of best subset found 0 247 15 17 28 BestFirst CfsSubsetEval Attribute Subset Evaluator supervised Class nominal 5 play CFS Subset Evaluator Including locally predictive attributes Selected attributes 1 3 2 outlook humidity 4 6 1 Searching and Evaluating Attribute selection involves searching through all possible combinations of at tributes in the data to find which subset of attributes works best for prediction To do this two objects must be set up an attribute evalu
131. g XML eXtensible Markup Language in several places 16 6 1 Command Line WEKA now allows Classifiers and Experiments to be started using an xml option followed by a filename to retrieve the command line options from the XML file instead of the command line For such simple classifiers like e g J48 this looks like overkill but as soon as one uses Meta Classifiers or Meta Meta Classifiers the handling gets tricky and one spends a lot of time looking for missing quotes With the hierarchical structure of XML files it is simple to plug in other classifiers by just exchanging tags The DTD for the XML options is quite simple lt DOCTYPE options lt ELEMENT options option gt lt ATTLIST options type CDATA classifier gt lt ATTLIST options value CDATA gt lt ELEMENT option PCDATA options gt lt ATTLIST option name CDATA REQUIRED gt lt ATTLIST option type flag single hyphens quotes single gt gt The type attribute of the option tag needs some explanations There are cur rently four different types of options in WEKA e flag The simplest option that takes no arguments like e g the V flag for inversing an selection lt option name V type flag gt e single The option takes exactly one parameter directly following after the op tion e g for specifying the trainings file with t somefile arff Here the parameter value is just put between the opening and closing tag Si
132. g and nominal attributes are case sensitive and any that contain space or the comment delimiter character must be quoted The code suggests that double quotes are acceptable and that a backslash will escape individual characters An example follows relation LCCvsLCSH Cattribute LCC string attribute LCSH string data AGS Encyclopedias and dictionaries Twentieth century AS262 Science Soviet Union History AES Encyclopedias and dictionaries AS281 Astronomy Assyro Babylonian Moon Phases AS281 Astronomy Assyro Babylonian Moon Tables 9 3 SPARSE ARFF FILES 159 Dates must be specified in the data section using the string representation spec ified in the attribute declaration For example RELATION Timestamps ATTRIBUTE timestamp DATE yyyy MM dd HH mm ss DATA 2001 04 03 12 12 12 2001 05 03 12 59 55 Relational data must be enclosed within double quotes For example an in stance of the MUSK1 dataset denotes an omission MUSK 188 42 30 1 93 Sparse ARFF files Sparse ARFF files are very similar to ARFF files but data with value 0 are not be explicitly represented Sparse ARFF files have the same header i e relation and attribute tags but the data section is different Instead of representing each value in order like this data 0 X 0 Y class A 0 0 W O class B the non zero attributes are explicitly identified by a
133. g lots of examples and HOWTOs around the development and use of WEKA e Weka on Sourceforge WEKA s project homepage on Sourceforge net e SystemInfo Lists some internals about the Java WEKA environ ment e g the CLASSPATH To make it easy for the user to add new functionality to the menu with out having to modify the code of WEKA itself the GUI now offers a plugin mechanism for such add ons Due to the inherent dynamic class discovery plu gins only need to implement the weka gui MainMenuExtension interface and 27 WEKA notified of the package they reside in to be displayed in the menu un der Extensions this extra menu appears automatically as soon as extensions are discovered More details can be found in the Wiki article Extensions for Weka s main GUI 5 If you launch WEKA from a terminal window some text begins scrolling in the terminal Ignore this text unless something goes wrong in which case it can help in tracking down the cause the LogWindow from the Program menu displays that information as well This User Manual focuses on using the Explorer but does not explain the individual data preprocessing tools and learning algorithms in WEKA For more information on the various filters and learning methods in WEKA see the book Data Mining 1 28 CHAPTER 2 LAUNCHING WEKA Chapter 3 Simple CLI The Simple CLI provides full access to all Weka classes i e classifiers filters clusterers etc
134. gression Multi response linear regression e functions Logistic Logistic Regression 18 CHAPTER 1 A COMMAND LINE PRIMER e functions SMO Support Vector Machine linear polynomial and RBF ker nel with Sequential Minimal Optimization Algorithm due to 3 Defaults to SVM with linear kernel E 5 C 10 gives an SVM with polynomial kernel of degree 5 and lambda of 10 e lazy KStar Instance Based learner E sets the blend entropy automati cally which is usually preferable e lazy IBk Instance Based learner with fixed neighborhood K sets the number of neighbors to use IB1 is equivalent to IBk K 1 e rules JRip A clone of the RIPPER rule learner Based on a simple example we will now explain the output of a typical classifier weka classifiers trees J48 Consider the following call from the command line or start the WEKA explorer and train J48 on weather arff java weka classifiers trees J48 t data weather arff i J48 pruned tree outlook sunny humidity lt 75 yes 2 0 humidity gt 75 no 3 0 outlook overcast yes 4 0 outlook rainy windy TRUE no 2 0 windy FALSE yes 3 0 Number of Leaves 5 Size of the tree 8 Time taken to build model 0 05 seconds Time taken to test model on training data 0 seconds The first part unless you specify o is a human readable form of the training set model In this case it is a decision tree out look is at the root of the tree and dete
135. gt row Sorting asc by lt default gt w 1 1 1 1 a 1 rules ZeroR 48055541465867954 o 0 1 0 b 2 rules OneR B 6 2459427002147861445 rest basal Select O 0 0 0 c 3 trees J48 C 0 25 M 2 217733168393644444 Displayed Columns Columns Show std deviations Output Format Select Perform test Save output Result list 16 42 24 Percent_correct Summary i il gt In this experiment the first row 1 1 indicates that column b OneR is better than row a ZeroR and that column c J48 is also better than row a The number in brackets represents the number of significant wins for the column with regard to the row A 0 means that the scheme in the corresponding column did not score a single significant win with regard to the scheme in the row 5 4 6 Ranking Test Selecting Ranking from Test base causes the following information to be gener ated Weka Experiment Environment Ein x Setup Run Analyse Source Got 30 results File Database Experiment Configure test A Test output Testing with Paired T Tester cor w Tester weka experiment PairedCorrectedTTester Analysing Percent_correct Row Select Datasets 1 Resultsets 3 Confidence 0 05 two tailed Column Select Sorted by A Date 21 12 05 16 42 Co
136. he term statistical significance used in the previous section refers to the re sult of a pair wise comparison of schemes using either a standard T Test or the corrected resampled T Test 8 The latter test is the default because the stan dard T Test can generate too many significant differences due to dependencies in the estimates in particular when anything other than one run of an x fold cross validation is used For more information on the T Test consult the Weka book 1 or an introductory statistics text As the significance level is decreased the confidence in the conclusion increases In the current experiment there is not a statistically significant difference between the OneR and J48 schemes 5 4 5 Summary Test Selecting Summary from Test base and performing a test causes the following information to be generated 88 CHAPTER 5 EXPERIMENTER weka Experiment Environment mm Setup Run Analyse Source Got 30 results l File Database Experiment Configure test Test output Testing with Paired T Tester cor w Tester weka experiment PairedCorrectedTTester Analysing Percent_correct jen Select Datasets 1l pi Resultsets 3 Colima Select Confidence 0 05 two tailed Sorted by P Date 21 12 05 16 42 Comparison field Percent_correct Significance 0 05 pea a b c No of datasets where col gt
137. heme_version_ID 48055541465867954 217733168393644444 Date_time numeric Number_of_training_instances numeric Number_of_testing_instances numeric Number_correct numeric Number_incorrect numeric Number_unclassified numeric Percent_correct numeric Percent_incorrect numeric Percent_unclassified numeric Kappa_statistic numeric Mean_absolute_error numeric Root_mean_squared_error numeric Relative_absolute_error numeric Root_relative_squared_error numeric SF_prior_entropy numeric SF_scheme_entropy numeric SF_entropy_gain numeric SF_mean_prior_entropy numeric SF_mean_scheme_entropy numeric SF_mean_entropy_gain numeric KB_information numeric 66 attribute Oattribute Oattribute attribute attribute attribute attribute attribute attribute attribute Oattribute Oattribute Oattribute attribute attribute attribute attribute gt Number gt Number attribute Oattribute Oattribute Odata CHAPTER 5 EXPERIMENTER KB_mean_information numeric KB_relative_information numeric True_positive_rate numeric Num_true_positives numeric False_positive_rate numeric Num_false_positives numeric True_negative_rate numeric Num_true_negatives numeric False_negative_rate numeric Num_false_negatives numeric IR_precision numeric IR_recall numeric F_measure numeric Area_under_ROC numeric Time_training numeric Time_testing numeric Summary Number of leaves 3 nSize of the tree of leaves 5 nSize of the tree 9 n of
138. hen output from the splitEvaluator for individual train test splits is saved Ifthe destination is a directory then each output is saved to an individual gzip file ifthe destination is a file then each output is saved as an entry in a zip file randomizeData Do not randomize dataset and do not perform probabilistic rounding if true rawOutput Save raw output useful for debugging If set then output is sentto the destination specified by outputFile classifier regression scheme etc trainPercent Set the percentage of data to use for training outputFile Set the destination for saving raw output Ifthe rawOutput splitEvaluator The evaluator to apply to the test data This may be a Click on the splitEvaluator entry to display the SplitEvaluator properties weka gui GenericObjectEditor weka experiment ClassifierSplitEvaluator About A SplitEvaluator that produces results for a classification More scheme on a nominal class attribute attributelD 1 classForiRStatistics 0 classifier Choose ZeroR predTargetColumn False Open Save OK Cancel Click on the classifier entry ZeroR to display the scheme properties weka gui GenericObjectEditor weka classifiers rules ZeroR About Class for building and using a 0 R classifier More Capabilities debug False Open
139. his svn co https svn scms waikato ac nz svn weka trunk weka SmartSVN SmartSVN http smartsvn com is a Java based graphical cross platform client for Subversion Though it is not open source free software the foundation version is for free 194 CHAPTER 16 TECHNICAL DOCUMENTATION TortoiseSVN Under Windows TortoiseCVS was a CVS client neatly integrated into the Windows Explorer TortoiseSVN http tortoisesvn tigris org is the equivalent for Subversion 16 4 GenericObjectEditor 16 4 1 Introduction As of version 3 4 4 it is possible for Weka to dynamically discover classes at run time rather than using only those specified in the GenericObjectEditor props GOE file In version 3 5 8 and higher this facility is not enabled by default as it is slower than the props file approach and furthermore does not function in environments that do not have a CLASSPATH e g application servers If you wish to use dynamic class discovery the relevant file to edit is GenericPropertiesCreator props GPC located in weka gui All that is required is to change the UseDynamic property in this file from false to true If dynamic class discovery is too slow e g due to an enormous CLASS PATH you can generate a new GenericObjectEditor props file and then turn dynamic class discovery off again Just follow these steps e generate a new GenericObjectEditor props file based on your current setup assuming the weka classes are i
140. hoose button to display the Generic ObjectEditor window weka Experiment Environment Setup Run Analyse xperiment Configuration Mode Simple 8 Advanced Open Save New Destination Choose IinstancesResuttListener O Experimentt arff Result generator Choose RandomSplitResuttProducer P 66 0 O splitEvalutorOutzip W weka experiment ClassifierSplitEvaluator W weka classifier Runs From 1 Distribute experiment Generator properties To fio Hosts O Byrun Select property Enabled y 6 By data set weka GA classifiers o ci bayes ava E Iteration control 8 Data sets first Custom generator first Datasets Add new Edit selecte Delete select v Use relative pat dataliris artt J functions gt lazy o ci meta gt mi o cI misc trees C ADTree D BFTree 5 DecisionStump Bas D E unr E Msp D NBTree y RandomForest D RandomTree Ej REPTree E simnlecart Eiter Remove fiter Down 4i Close The Filter button enables one to highlight classifiers that can handle certain attribute and class types With the Remove filter button all the selected capabilities will get cleare
141. how well it predicts the class of a set of instances loaded from a file Clicking the Set button brings up a dialog allowing you to choose the file to test on 3 Cross validation The classifier is evaluated by cross validation using the number of folds that are entered in the Folds text field 4 Percentage split The classifier is evaluated on how well it predicts a certain percentage of the data which is held out for testing The amount of data held out depends on the value entered in the field Note No matter which evaluation method is used the model that is output is always the one build from all the training data Further testing options can be set by clicking on the More options button 40 10 11 CHAPTER 4 EXPLORER Output model The classification model on the full training set is output so that it can be viewed visualized etc This option is selected by default Output per class stats The precision recall and true false statistics for each class are output This option is also selected by default Output entropy evaluation measures Entropy evaluation measures are included in the output This option is not selected by default Output confusion matrix The confusion matrix of the classifier s pre dictions is included in the output This option is selected by default Store predictions for visualization The classifier s predictions are remembered so that they can be visualized This option is sel
142. ic date string date gt 176 CHAPTER 13 DATABASES CHAR and VARCHAR are both String types hence they are interpreted as String identifier 0 Note in case database types have blanks one needs to replace those blanks with an underscore e g DOUBLE PRECISION must be listed like this DOUBLE_PRECISION 2 13 4 Stored Procedures Let s say you re tired of typing the same query over and over again A good way to shorten that is to create a stored procedure PostgreSQL 7 4 x The following example creates a procedure called emplyoee name that returns the names of all the employees in table employee Even though it doesn t make much sense to create a stored procedure for this query nonetheless it shows how to create and call stored procedures in PostgreSQL e Create CREATE OR REPLACE FUNCTION public employee_name RETURNS SETOF text AS select name from employee LANGUAGE sql VOLATILE e SQL statement to call procedure SELECT FROM employee_name e Retrieve data via InstanceQuery java weka experiment InstanceQuery Q SELECT FROM employee_name U lt user gt P lt password gt 13 5 Troubleshooting e In case you re experiencing problems connecting to your database check out the WEKA Mailing List see Weka homepage for more information It is possible that somebody else encountered the same problem as you and you ll find a post containing the solution to your problem e Specific MS SQL Server
143. ignments and Visualize tree The latter is grayed out when it is not applicable 4 5 ASSOCIATING 45 4 5 Associating Weka 3 5 4 Explorer Program Applications Tools Visualization Windows Help E Explorer Classify Cluster Associate Select attributes Visualize Apriori N 10 T 0 C 0 9 D 0 05 U 1 0 M0 1 S 1 0 c 1 Associator output Stop Result list right click fc Size of set of large itemsets L 1 15 16 49 Apriori Size of set of large itemsets L 2 Size of set of large itemsets L 3 Size of set of large itemsets L 4 Best rules found outlook overcast 4 gt play yes 4 conf 1 temperature cool 4 gt humidity normal 4 conf 1 hunidity normal vindy FALSE 4 gt play yes 4 conf 1 outlook sunny pley no 3 gt humidity high 3 conf 1 outlook sunny hunidity high 3 gt play no 3 conf 1 outlook rainy play yes 3 gt windy FALSE 3 conf 1 Outlook reiny windy FALSE play yes 3 conf 1 temperature cool play yes hunidity normel 3 conf 1 outlook sunny temperature hot 2 gt humidityshigh 2 conf 1 temperature hot play no 2 gt outlook sunny 2 conf 1 4 5 1 Setting Up This panel contains schemes for learning association rules and the learners are chosen and configured in the same way as the clusterers filters and classifiers in the other panels 4 5 2 Learnin
144. ilities that is Nigk Nijk P x klpa x j where Noir is the alpha parameter that can be set and is 0 5 by default With alpha 0 we get maximum likelihood estimates weka gui GenericObjectEditor weka classifiers bayes net estimate SimpleEstimator About SimpleEstimator is used for estimating the conditional probability tables of a Bayes network once the structure has been learned alpha 0 5 Open Save ok Cancel With the BMAEstimator we get estimates for the conditional probability tables based on Bayes model averaging of all network structures that are sub structures of the network structure learned 14 This is achieved by estimat ing the conditional probability table of a node 2 given its parents pa x as a weighted average of all conditional probability tables of x given subsets of pa x The weight of a distribution P x S with S C pa x used is propor tional to the contribution of network structure Vycsy x to either the BDe metric or K2 metric depending on the setting of the useK2Prior option false and true respectively 122 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS weka gui GenericObjectEditor oO x weka classifiers bayes net estimate BMAEstimator About BMAEstimator estimates conditional probability tables of a More Bayes network using Bayes Model Averaging BMA alpha 0 5 useK2Prior False s Open Save OK Ca
145. ime format yyyy MM dd T HH mm ss Dates must be specified in the data section as the corresponding string rep resentations of the date time see example below Relational attributes Relational attribute declarations take the form attribute lt name gt relational lt further attribute definitions gt end lt name gt For the multi instance dataset MUSK1 the definition would look like this denotes an omission 158 CHAPTER 9 ARFF attribute molecule_name MUSK jf78 NON MUSK 199 attribute bag relational Cattribute f1 numeric attribute f166 numeric Cend bag Cattribute class 0 1 9 2 2 The ARFF Data Section The ARFF Data section of the file contains the data declaration line and the actual instance lines The QOdata Declaration The data declaration is a single line denoting the start of the data segment in the file The format is Odata The instance data Each instance is represented on a single line with carriage returns denoting the end of the instance A percent sign introduces a comment which continues to the end of the line Attribute values for each instance are delimited by commas They must appear in the order that they were declared in the header section i e the data corresponding to the nth Cattribute declaration is always the nth field of the attribute Missing values are represented by a single question mark as in Odata 4 4 7 1 5 lris setosa Values of strin
146. iment that looks a bit like this 01 13 19 RemoteExperiment blabla company com RemoteEngine sub experiment datataset vineyard arff failed java sql SQLException Table already exists EXPERIMENT_INDEX in statement CREATE TABLE Experiment_index Experiment_type LONGVARCHAR Experiment_setup LONGVARCHAR Result_table INT 01 13 19 dataset vineyard arff RemoteExperiment blabla company com RemoteEngine sub experiment datataset vineyard arff failed java sql SQLException Table already exists EXPERIMENT_INDEX in statement CREATE TABLE Experiment_index Experiment_type LONGVARCHAR Experiment_setup LONGVARCHAR Result_table INT Scheduling for execution on another host then do not panic this happens because multiple remote machines are trying to create the same table and are temporarily locked out this will resolve itself so just leave your experiment running in fact it is a sign that the experiment is working e If you serialized an experiment and then modify your Database Utils props file due to an error e g a missing type mapping the Experimenter will use the Database Utils props you had at the time you serialized the ex periment Keep in mind that the serialization process also serializes the Database Utils class and therefore stored your props file This is another reason for storing your experiments as XML and not in the properietary binary format the Java serialization produces 82 CHAPTER 5 EXP
147. in order to determine the score e fixed structure Finally there are a few methods so that a structure can be fixed for example by reading it from an XML BIF file For each of these areas different search algorithms are implemented in Weka such as hill climbing simulated annealing and tabu search Once a good network structure is identified the conditional probability ta bles for each of the variables can be estimated You can select a Bayes net classifier by clicking the classifier Choose button in the Weka explorer experimenter or knowledge flow and find BayesNet under the weka classifiers bayes package see below 2See http www 2 cs cmu edu fgcozman Research InterchangeFormat for details on XML BIF 112 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS weka E classifiers Ei bayes C aone y BayesNet i ComplementNaiveBayes C HNB NaiveBayes E NaiveBayesMultinomial E NaiveBayesSimple D NaiveBayesUpdateable Ey waoDE J functions cI lazy 4 meta E mi E misc c trees E rules o O A a a Filter Remove filter Close The Bayes net classifier has the following options weka classifiers bayes BayesNet About Bayes Network learning using various search algorithms Moe and quality measures Capabilities BIFFile_ irisxmi debug False estimator Choose SimpleEstimator A 0 5 searchAlgorithm Choose TabuSearch L5 U 10 P 2 5 BAYES useA
148. in the literature To test whether variables x and y are condi tionally independent given a set of variables Z a network structure with arrows Veezz y is compared with one with arrows x y UVzezz y A test is performed by using any of the score metrics described in Section 2 1 weka UU classifiers 2 CE bayes cnet Gd search 3 local Eci D ClSearchAlgorithm Ey leSSearchAlgorithm gt c global gt cJ fixed At the moment only the ICS 24 and CI algorithm are implemented The ICS algorithm makes two steps first find a skeleton the undirected graph with edges if f there is an arrow in network structure and second direct 118 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS all the edges in the skeleton to get a DAG Starting with a complete undirected graph we try to find conditional inde pendencies z y Z in the data For each pair of nodes x y we consider sets Z starting with cardinality 0 then 1 up to a user defined maximum Further more the set Z is a subset of nodes that are neighbors of both x and y If an independency is identified the edge between x and y is removed from the skeleton The first step in directing arrows is to check for every configuration x z y where x and y not connected in the skeleton whether z is in the set Z of variables that justified removing the link between x and y cached in the first step If z is not in Z we can assign direction x gt z y Fina
149. instances in order to allow for processing large datasets e Implement traditional conditional independence tests for conditional in dependence based structure learning algorithms e Currently all search algorithms assume that all variables are discrete Search algorithms that can handle continuous variables would be interest ing e A limitation of the current classes is that they assume that there are no missing values This limitation can be undone by implementing score metrics that can handle missing values The classes used for estimating the conditional probabilities need to be updated as well e Only leave one out k fold and cumulative cross validation are implemented These implementations can be made more efficient and other cross validation methods can be implemented such as Monte Carlo cross validation and bootstrap cross validation e Implement methods that can handle incremental extensions of the data set for updating network structures And for the more ambitious people there are the following challenges e A GUI for manipulating Bayesian network to allow user intervention for adding and deleting arcs and updating the probability tables e General purpose inference algorithms built into the GUI to allow user defined queries e Allow learning of other graphical models such as chain graphs undirected graphs and variants of causal graphs e Allow learning of networks with latent variables e Allow learning of dy
150. ion clustering attribute selection etc so that it is possible to copy paste them elsewhere Options for dataset s and if applicable the class attribute still have to be provided by the user e g t for classifiers or i and o for filters 4 1 4 WEKA Status Icon To the right of the status box is the WEKA status icon When no processes are running the bird sits down and takes a nap The number beside the x symbol gives the number of concurrent processes running When the system is idle it is zero but it increases as the number of processes increases When any process is started the bird gets up and starts moving around If it s standing but stops moving for a long time it s sick something has gone wrong In that case you should restart the WEKA Explorer 4 1 5 Graphical output Most graphical displays in WEKA e g the GraphVisualizer or the TreeVisu alizer support saving the output to a file A dialog for saving the output can be brought up with Alt Shift left click Supported formats are currently Win dows Bitmap JPEG PNG and EPS encapsulated Postscript The dialog also allows you to specify the dimensions of the generated image 4 2 PREPROCESSING 35 4 2 Preprocessing Weka 3 5 4 Explorer Program Applications Tools Visualization Windows Help E Explorer Preprocess Classify Cluster Associate Selectattributes Visualize opens Open UR Open 8 conor Filter
151. irectory Developer version e up to Weka 3 5 2 just like the book version e Weka 3 5 3 You have to modify the link in the Windows Start menu if you re starting the console less Weka only the link with console in its name executes the RunWeka bat batch file e Weka 3 5 4 and higher Due to the new launching scheme you no longer modify the batch file but the RunWeka ini file In that particular file you ll have to change the maxheap placeholder See section 16 2 2 17 23 Mac OSX In your Weka installation directory weka 3 x y app locate the Contents sub directory and edit the Info plist file Near the bottom of the file you should see some text like lt key gt VMOptions lt key gt lt string gt Xmx256M lt string gt Alter the 256M to something higher 17 2 4 StackOverflowError Try increasing the stack of your virtual machine With Sun s JDK you can use this command to increase the stacksize java Xss512k to set the maximum Java stack size to 512KB If still not sufficient slowly increase it 17 2 TROUBLESHOOTING 207 17 2 5 just in time JIT compiler For maximum enjoyment use a virtual machine that incorporates a just in time compiler This can speed things up quite significantly Note also that there can be large differences in execution time between different virtual ma chines 17 2 6 CSV file conversion Either load the CSV file in the Explorer or use the CVS converter on the com ma
152. ironment Eiai Setup Run Analyse Source Got 30 results File Database Experiment Configure test Test output Testing with Paired T Tester cor Tester veka experiment PairedCorrectedTTester Analysing Percent_correct Eom Select Datasets 1 Resultsets 3 Ik Select Confidence 0 05 two tailed be Sorted by E Date 21 12 05 16 37 Comparison field Percent_correct X Significance 0 05 22 ll paraset 1 rules ZeroR 2 rules OneR 3 trees J48 Sorting asc by lt default gt A ea aaa E 9 EASE NS HA eS IG T COTTE iris 10 33 33 0 00 I 94 31 2 52 v 94 90 2 95 v Test base COO O 07 11 1 0 0 1 0 07 Displayed Columns Columns Show std deviations Key 1 rules ZeroR 48055541465867954 DAMA Fons Select 2 rules OneR B 6 2459427002147861445 p 3 trees J48 C 0 25 M 2 217733168393644444 Perform test Save output Result list 16 37 40 Percent _correct rules ZeroR 4805 4 I I gt 85 Selecting Number correct as the comparison field and clicking Perform test generates the average number correct out of 50 test patterns 33 of 150 patterns in the Iris dataset Weka Experiment Environment olx Setup Run Analyse Source Got 30 results File Database Experiment Configure test Test output Testing
153. itEvaluator W weka classifier Pot tec Ml Runs Distribute experiment Generator properties From 1 Tho Hosts l 7 Disabled RA Select property By data set Byrun Iteration control 9 Data sets first Custom generator first Datasets Add new Edit selecte Delete select v Use relative pat Can t edit Wataliris arff Adding Additional Schemes Additional schemes can be added in the Generator properties panel To begin change the drop down list entry from Disabled to Enabled in the Generator properties panel 5 2 STANDARD EXPERIMENTS 69 weka Experiment Environment TES Setup Run Analyse xperiment Configuration Mode Simple 9 Advanced Open Save New Destination Choose InstancesResultListener O Experiment arff Result generator Choose RandomSplitResultProducer P 66 0 O splitEvalutorOutzip W weka experiment ClassifierSplitEvaluator W weka classifier Runs Distribute experiment Generator properties From 1 Tho Hosts 3 Disabled RA Select property By data set O Byrun Disabled Iteration control 9 Data sets first O Custom generator first Datasets Add new Edit selecte Delete select v Us
154. ix pathnames need to be changed for Windows e The home directory is located at home johndoe e Weka is found in home johndoe weka e Additional jar archives i e JDBC drivers are stored in home johndoe jars e The directory for the datasets is home johndoe datasets Note The example policy file remote policy example is using this setup available in weka experiment 5 3 2 Database Server Setup e HSQLDB Download the JDBC driver for HSQLDB extract the hsqldb jar and place it in the directory home johndoe jars To set up the database server choose or create a directory to run the database server from and start the server with java classpath home johndoe jars hsqldb jar org hsqldb Server database 0 experiment dbname 0 experiment Note This will start up a database with the alias experiment dbname 0 lt alias gt and create a properties and a log file at the current location prefixed with experiment database 0 lt file gt lWeka s source code can be found in the weka src jar archive or obtained from Subversion 10 5 3 REMOTE EXPERIMENTS 79 e MySQL We won t go into the details of setting up a MySQL server but this is rather straightforward and includes the following steps Download a suitable version of MySQL for your server machine Install and start the MySQL server Create a database for our example we will use experiment as database name Download the appr
155. ki http weka wiki sourceforge net Plotting multiple ROC curves R R Bouckaert Bayesian Belief Networks from Construction to Inference Ph D thesis University of Utrecht 1995 W L Buntine A guide to the literature on learning probabilistic networks from data IEEE Transactions on Knowledge and Data Engineering 8 195 210 1996 J Cheng R Greiner Comparing bayesian network classifiers Proceedings UAI 101 107 1999 211 212 17 18 19 20 21 22 23 24 BIBLIOGRAPHY C K Chow C N Liu Approximating discrete probability distributions with de pendence trees IEEE Trans on Info Theory IT 14 426 467 1968 G Cooper E Herskovits A Bayesian method for the induction of probabilistic networks from data Machine Learning 9 309 347 1992 Cozman See http www 2 cs cmu edu fgcozman Research InterchangeFor mat for details on XML BIF N Friedman D Geiger M Goldszmidt Bayesian Network Classifiers Machine Learning 29 131 163 1997 D Heckerman D Geiger D M Chickering Learning Bayesian networks the combination of knowledge and statistical data Machine Learining 20 3 197 243 1995 S L Lauritzen and D J Spiegelhalter Local Computations with Probabilities on graphical structures and their applications to expert systems with discussion Journal of the Royal Statistical Society B 1988 50 157 224 Moore A and Lee M S Cached Sufficient Statistics f
156. l 1 28 0 male jatyp_angi 130 0 132 0 f lefty 185 0 no 0 0 50 2 29 0 male _latyp_angi 120 0 243 0f normal 160 0 no 0 0 50 3 29 0 male latyp_angi 140 0 if normal 170 0 no 0 0 lt 50 4 30 0 female typ_angina 170 0 237 0lf stt 170 0 no 0 0 fixed_ 50 5 31 0 female jatyp_angi 100 0 219 0 f 150 0 no 0 0 50 6 32 0 female latyp_angi 105 0 198 0 f 165 0 no 0 0 lt 50 7 32 0 male atyp_angi 110 0 225 0 f normal 184 0 no 0 0l 50 8 32 0 male latyp_angi 125 0 254 0 f normal 155 0 no 0 0 50 9 33 0male non_angi 120 0 298 0 f normal 185 0 no 0 0 50 10 34 0 female jatyp_angi 130 0 161 0 f normal 190 0 no 0 0 50 11 34 0 male latyp_angi 150 0 214 0 f stt 168 0 no 0 0 50 12 34 0 male _ atyp_angi 98 0 220 0 f normal 150 0 no 0 0 50 13 35 0 female typ_angina 120 0 160 0 f stt 185 0 no 0 0 50 14 35 0 female jasympt 140 01 167 0 f normal 150 0 no 0 0 50 15 35 0 male atyp_angi 120 0 308 0 f lefty 180 0 no 0 0 50 16 35 0 male latyp_angi 150 0 264 0 f normal 168 0 no 0 0 lt 50 17 36 0 male atyp_angi 120 0 166 0 f normal 180 0 no 0 0 lt 50 18 36 0 male non_angi 112 0 340 0 normal 184 0 no 1 0 flat normal lt 50 19 36 0 male non_angi 130 0 209 0 f normal 178 0 no 0 0 lt 50 20 36 0 male _ non_angi 150 0 160 0 F normal 172 0 no 0 01 lt 60 21 37 O female atyp_angi 120 0 260 0 f normal 130 0 no 0 0 50 22 37 Ojfemale
157. lassification Regression Algorithms first Datasets Algorithms ii Add new Edit selecte Delete select Addnew _ Editselected Delete selected v Use relative pat ZeroR dataiiris artt pan C025 M2 Up Down Load options Save options Up Down With the Load options and Save options buttons one can load and save the setup of a selected classifier from and to XML This is especially useful for highly configured classifiers e g nested meta classifiers where the manual setup takes quite some time and which are used often One can also paste classifier settings here by right clicking or Alt Shift left clicking and selecting the appropriate menu point from the popup menu to either add a new classifier or replace the selected one with a new setup This is rather useful for transferring a classifier setup from the Weka Explorer over to the Experimenter without having to setup the classifier from scratch 5 2 1 7 Saving the setup For future re use one can save the current setup of the experiment to a file by clicking on Save at the top of the window Save In C weka 3 5 6 aje a HS C3 changelogs A data C doc File Name Files of Type Experiment configuration files exp v Save Cancel By default the format of the experiment files is the binary format that Java serialization offers The
158. le TAN 16 20 Tree Augmented Naive Bayes where the tree is formed by calculating the maximum weight spanning tree using Chow and Liu algorithm 17 No specific options Simulated annealing 14 using adding and deleting arrows The algorithm randomly generates a candidate network By close to the current network Bg It accepts the network if it is better than the current i e Q Bg D gt Q Bs D Otherwise it accepts the candidate with probability eti Q Bs D Q Bs D where t is the temperature at iteration i The temperature starts at to and is slowly decreases with each iteration 10 x weka gui GenericObjectEditor weka classifiers bayes net search local SimulatedAnnealing About This Bayes Network learning algorithm uses the general purpose search method of simulated annealing to find a well scoring network structure TStart delta markovBlanketClassifier runs scoreType 10 0 10 999 False 10000 BAYES seed 1 Open Save OK Cancel Specific options TStart start temperature to delta is the factor used to update the temperature so tj41 ti runs number of iterations used to traverse the search space seed is the initialization value for the random number generator Tabu search 14 using adding and deleting arrows Tabu search performs hill climbing until it hits a local optimum
159. leaves 4 nSize of the tree 7 n measureTreeSize numeric measureNumLeaves numeric measureNumRules numeric 5 n iris 1 weka classifiers rules ZeroR 48055541465867954 20051221 033 99 51 17 34 0 33 333333 66 666667 0 0 0 444444 0 471405 100 100 80 833088 80 833088 0 1 584963 1 584963 0 0 0 0 1 17 1 34 0 0 0 0 0 333333 1 0 5 0 5 0 0 7 7 7 5 2 2 3 Changing the Experiment Parameters Changing the Classifier The parameters of an experiment can be changed by clicking on the Result generator panel The RandomSplitResultProducer performs repeated train test runs weka gui GenericObjectEditor El x weka experiment RandomSplitResultProducer About Performs a random train and test using a supplied evaluator outputFile snlitEvalutorOutzin randomizeData True rawOutput False v splitEvaluator Choose ClassifierSplitEvaluator VV weka classifii trainPercent 66 0 la Save Cancel Open The number of instances expressed as a percentage used for training is given in the 5 2 STANDARD EXPERIMENTS 67 trainPercent box The number of runs is specified in the Runs panel in the Setup tab A small help file can be displayed by clicking More in the About panel Information JOA NAME weka experiment RandomSplitResultProducer SYNOPSIS Performs a random train and test using a supplied evaluator OPTIONS option is selected t
160. lecte Delete select Add new Edit selected Delete selected C Use relative pat Up Down Load options Save options Up Down Notes The advantage of ARFF or CSV files is that they can be created without any additional classes besides the ones from Weka The drawback is the lack of the ability to resume an experiment that was interrupted e g due to an error or the addition of dataset or algorithms Especially with time consuming experiments this behavior can be annoying JDBC database With JDBC it is easy to store the results in a database The necessary jar archives have to be in the CLASSPATH to make the JDBC functionality of a particular database available After changing ARFF file to JDBC database click on User to specify JDBC URL and user credentials for accessing the database 54 CHAPTER 5 EXPERIMENTER Database Connection Parameters Database URL jdbc mysqlilocalhost 3306weka_test Username Password Debug OK Cancel After supplying the necessary data and clicking on OK the URL in the main window will be updated Note at this point the database connection is not tested this is done when the experiment is started weka Experiment Environment Setup Run Analyse Experiment Configuration Mode
161. lly a set of graphical rules is applied 24 to direct the remaining arrows Rule 1 i gt j k amp i k gt j gt k Rule 2 i gt j gt k amp i k gt i gt k Rule 3 m MN i k gt m gt i gt j lt k I j Rule 4 m INM i k gt i gt m k gt m i gt j j Rule 5 if no edges are directed then take a random one first we can find The ICS algorithm comes with the following options weka gui GenericObjectEditor lol x weka classifiers bayes net search ci ICSSearchAlgorithm About This Bayes Network learning algorithm uses conditional More independence tests to find a skeleton finds V nodes and applies a set of rules to find the directions of the remaining arrows markovBlanketClassifier False k d maxCardinality 2 scoreType BAYES v Open Save OK Cancel Since the ICS algorithm is focused on recovering causal structure instead of finding the optimal classifier the Markov blanket correction can be made afterwards Specific options The maxCardinality option determines the largest subset of Z to be considered in conditional independence tests x y Z The scoreType option is used to select the scoring metric 8 4 GLOBAL SCORE METRIC BASED STRUCTURE LEARNING 119 8 4 Global score metric based structure learning weka c classifiers cf bayes E net c search c local ici c global QA Genetic
162. lopment distribution weka jar weka gui experiment Experimente Note Windows users have to replace the with 16 4 5 Multiple Class Hierarchies In case you re developing your own framework but still want to use your clas sifiers within WEKA that wasn t possible so far With the release 3 4 4 it is possible to have multiple class hierarchies being displayed in the GUI If you ve developed a modified version of NaiveBayes let s call it DummyBayes and it s located in the package dummy classifiers then you ll have to add this package to the classifiers list in the GPC file like this weka classifiers Classifier weka classifiers bayes weka classifiers functions weka classifiers lazy weka classifiers meta weka classifiers trees weka classifiers rules dummy classifiers 16 4 GENERICOBJECTEDITOR 197 Your java call for the Experimenter might look like this java classpath weka jar dummy jar weka gui experiment Experimenter Starting up the GUI you ll now have another root node in the tree view of the classifiers called root and below it the weka and the dummy package hierarchy as you can see here IJ root Y 2 weka Y gt classifiers gt bayes gt functions gt lazy gt meta gt J mi gt J misc gt trees gt 7 rules Y F dummy i 2 classifiers _ DummyBayes Eilter Remove filter Close 16 4 6 Capabilities Version 3 5 3 of Weka int
163. ls and returns you to the original graph with all points included Finally clicking the Save button allows you to save the currently visible instances to a new ARFF file 50 CHAPTER 4 EXPLORER Chapter 5 Experimenter 5 1 Introduction The Weka Experiment Environment enables the user to create run modify and analyse experiments in a more convenient manner than is possible when processing the schemes individually For example the user can create an exper iment that runs several schemes against a series of datasets and then analyse the results to determine if one of the schemes is statistically better than the other schemes The Experiment Environment can be run from the command line using the Simple CLI For example the following commands could be typed into the CLI to run the OneR scheme on the Iris dataset using a basic train and test process Note that the commands would be typed on one line into the CLL java weka experiment Experiment r T data iris arff D weka experiment InstancesResultListener P weka experiment RandomSplitResultProducer W weka experiment ClassifierSplitEvaluator W weka classifiers rules OneR While commands can be typed directly into the CLI this technique is not particularly convenient and the experiments are not easy to modify The Experimenter comes in two flavours either with a simple interface that provides most of the functionality one needs for experiments or with an interface wi
164. ls props The default file can be found here weka experiment DatabaseUtils props 11 2 USAGE 169 e Loader You have to specify at least a SQL query with the Q option there are additional options for incremental loading java weka core converters DatabaseLoader Q Select from employee e Saver The Saver takes an ARFF file as input like any other Saver but then also the table where to save the data to via T java weka core converters DatabaseSaver i iris arff T iris 170 CHAPTER 11 CONVERTERS Chapter 12 Stemmers 12 1 Introduction Weka now supports stemming algorithms The stemming algorithms are located in the following package weka core stemmers Currently the Lovins Stemmer iterated version and support for the Snow ball stemmers are included 12 2 Snowball stemmers Weka contains a wrapper class for the Snowball homepage http snowball tartarus org stemmers containing the Porter stemmer and several other stemmers for dif ferent languages The relevant class is weka core stemmers Snowball The Snowball classes are not included they only have to be present in the classpath The reason for this is that the Weka team doesn t have to watch out for new versions of the stemmers and update them There are two ways of getting hold of the Snowball stemmers 1 You can add the following pre compiled jar archive to your classpath and you re set based on source code from 2005 10 19 compiled 200
165. lt instances gt lt instance gt lt value gt 5 1 lt value gt lt value gt 3 5 lt value gt lt value gt 1 4 lt value gt lt value gt 0 2 lt value gt lt value gt Iris setosa lt value gt lt instance gt lt instance gt lt value gt 4 9 lt value gt lt value gt 3 lt value gt lt value gt 1 4 lt value gt lt value gt 0 2 lt value gt lt value gt Iris setosa lt value gt lt instance gt lt instances gt lt body gt lt dataset gt 10 3 Sparse format The XRFF format also supports a sparse data representation Even though the iris dataset does not contain sparse data the above example will be used here to illustrate the sparse format lt instances gt lt instance type sparse gt lt value index 1 gt 5 1 lt value gt lt value index 2 gt 3 5 lt value gt lt value index 3 gt 1 4 lt value gt lt value index 4 gt 0 2 lt value gt lt value index 5 gt Iris setosa lt value gt lt instance gt lt instance type sparse gt lt value index 1 gt 4 9 lt value gt lt value index 2 gt 3 lt value gt lt value index 3 gt 1 4 lt value gt lt value index 4 gt 0 2 lt value gt lt value index 5 gt Iris setosa lt value gt lt instance gt lt instances gt 164 CHAPTER 10 XRFF In contrast to the normal data format each sparse instance tag contains a type attribute with the value sparse lt instance type sparse gt And each value tag needs to specify the inde
166. lts in the following output of possible matches of package names Possible matches weka classifiers weka clusterers e classname completion java weka classifiers meta A lt Tab gt lists the following classes Possible matches weka classifiers meta AdaBoostM1 weka classifiers meta AdditiveRegression weka classifiers meta AttributeSelectedClassifier e filename completion In order for Weka to determine whether a the string under the cursor is a classname or a filename filenames need to be absolute Unix Linx some path file Windows C Some Path file or relative and starting with a dot Unix Linux some other path file Windows Some Other Path file 32 CHAPTER 3 SIMPLE CLI Chapter 4 Explorer 4 1 The user interface 4 1 1 Section Tabs At the very top of the window just below the title bar is a row of tabs When the Explorer is first started only the first tab is active the others are greyed out This is because it is necessary to open and potentially pre process a data set before starting to explore the data The tabs are as follows 1 Preprocess Choose and modify the data being acted on 2 Classify Train and test learning schemes that classify or perform regres sion 3 Cluster Learn clusters for the data 4 Associate Learn association rules for the data 5 Select attributes Select the most relevant attributes in the data 6 Visualize View an interactive 2D plot of the data Once the ta
167. me weight gt 0 9 lt property gt lt metadata gt lt attribute gt 10 5 USEFUL FEATURES 165 10 5 3 Instance weights Instance weights are defined via the weight attribute in each instance tag By default the weight is 1 Here is an example lt instance weight 0 75 gt lt value gt 5 1 lt value gt lt value gt 3 5 lt value gt lt value gt 1 4 lt value gt lt value gt 0 2 lt value gt lt value gt Iris setosa lt value gt lt instance gt 166 CHAPTER 10 XRFF Chapter 11 Converters 11 1 Introduction Weka offers conversion utilities for several formats in order to allow import from different sorts of datasources These utilities called converters are all located in the following package weka core converters For a certain kind of converter you will find two classes e one for loading classname ends with Loader and e one for saving classname ends with Saver Weka contains converters for the following data sources e ARFF files ArffLoader ArffSaver e C4 5 files C45Loader C45Saver e CSV files CSVLoader CSVSaver e files containing serialized instances SerializedInstancesLoader Serial izedInstancesSaver JDBC databases DatabaseLoader DatabaseSaver libsvm files LibSVMLoader LibSVMSaver XRFF files XRFFLoader XRFFSaver text directories for text mining TextDirectoryLoader 167 168 CHAPTER 11 CONVERTERS 11 2 Usage 11 2 1 File converters File converters can be
168. me stage e g legacy versions that still get bugfixed 16 3 SUBVERSION 193 16 3 2 Source code The latest version of the Weka source code can be obtained with this URL https svn scms waikato ac nz svn weka trunk weka If you want to obtain the source code of the book version use this URL https svn scms waikato ac nz svn weka branches book2ndEd branch weka 16 3 3 JUnit The latest version of Weka s JUnit tests can be obtained with this URL https svn scms waikato ac nz svn weka trunk tests And if you want to obtain the JUnit tests of the book version use this URL https svn scms waikato ac nz svn weka branches book2ndEd branch tests 16 3 4 Specific version Whenever a release of Weka is generated the repository gets tagged e dev X Y Z the tag for a release of the developer version e g dev 3 5 8 for Weka 3 5 8 https svn scms waikato ac nz svn weka tags dev 3 5 8 e stable X Y Z the tag for a release of the book version e g stable 3 4 13 for Weka 3 4 13 https svn scms waikato ac nz svn weka tags stable 3 4 13 16 3 5 Clients Commandline Modern Linux distributions already come with Subversion either pre installed or easily installed via the package manager of the distribution If that shouldn t be case or if you are using Windows you have to download the appropriate client from the Subversion homepage http subversion tigris org A checkout of the current developer version of Weka looks like t
169. meters It is merely for documentation purposes so that one knows which class was actually started from the command line Responsible Class es weka core xml XMLOptions 16 6 2 Serialization of Experiments It is now possible to serialize the Experiments from the WEKA Experimenter not only in the proprietary binary format Java offers with serialization with this you run into problems trying to read old experiments with a newer WEKA version due to different SerialUIDs but also in XML There are currently two different ways to do this e built in The built in serialization captures only the necessary informations of an experiment and doesn t serialize anything else It s sole purpose is to save the setup of a specific experiment and can therefore not store any built models Thanks to this limitation we ll never run into problems with mismatching SerialUIDs This kind of serialization is always available and can be selected via a Filter xml in the Save Open Dialog of the Experimenter The DTD is very simple and looks like this for version 3 4 5 lt DOCTYPE object lt ELEMENT object PCDATA object gt lt ATTLIST object name CDATA REQUIRED gt lt ATTLIST object class CDATA REQUIRED gt lt ATTLIST object primitive CDATA no gt lt ATTLIST object array CDATA no gt lt ATTLIST object null CDATA no gt lt ATTLIST object version CDATA 3 4 5 gt 1 gt 202 CHAPTER 16 TECHNICAL DOCUMENTATION
170. mmon operations The Bayes network GUI is started as java weka classifiers bayes net GUI jbif file The following window pops up when an XML BIF file is specified if none is specified an empty graph is shown Bayes Network Editor File Edit Tools View Help sf x 0 a amp ma on o aLOW 0500 ORMAL 95 ARTCO2 a f Jar ALSE 0000 a PROM HRSAT U KSR 604 A eae aae IGH 6014 co or a OB 151 ENE Undo action performed Layout Graph Action Moving a node Click a node with the left mouse button and drag the node to the desired position 8 9 BAYES NETWORK GUI 137 Selecting groups of nodes Drag the left mouse button in the graph panel A rectangle is shown and all nodes intersecting with the rectangle are selected when the mouse is released Selected nodes are made visible with four little black squares at the corners see screenshot above The selection can be extended by keeping the shift key pressed while selecting another set of nodes The selection can be toggled by keeping the ctrl key pressed All nodes in the selection selected in the rectangle are de selected while the ones not in the selection but intersecting with the rectangle are added to the selection Groups of nodes can be moved by keeping the left mouse pressed on one of the selected nodes and dragging the group to the desired position File me
171. mparison field Percent_correct A Significance 0 05 J gt lt gt lt Resultset Sorting asc by lt default gt y l 1l O trees J48 C 0 25 M 2 217733168393644444 1 1l Q rules OneR B 6 2459427002147861445 est basa Select 2 0 2 rules ZeroR 48055541465867954 Displayed Columns Columns Show std deviations C Output Format Select Perform test Save output Result list 16 42 48 Percent_correct Ranking 4 il gt The ranking test ranks the schemes according to the total number of sig nificant wins gt and losses lt against the other schemes The first column gt lt is the difference between the number of wins and the number of losses This difference is used to generate the ranking Chapter 6 KnowledgeF low 6 1 Introduction The KnowledgeFlow provides an alternative to the Explorer as a graphical front end to WEKA s core algorithms The KnowledgeF low is a work in progress so some of the functionality from the Explorer is not yet available On the other hand there are things that can be done in the KnowledgeFlow but not in the Explorer IB SF bataSources DataSinks Filters Classifiers Clusterers Associations Evaluation Visualization Plugins h E Z p DataSources 0 211 iS A l EA 5 mmni E pp Knowledge Flow Layout Log Component Parameters Time Status KnowledgeFlow 0 0 49 Welcom
172. n e The error message No suitable driver can be caused by the following The JDBC driver you are attempting to load is not in the CLASS PATH Note using jar in the java commandline overwrites the CLASSPATH environment variable Open the SimpleCLI run the command java weka core SystemInfo and check whether the prop erty java class path lists your database jar If not correct your CLASSPATH or the Java call you start Weka with The JDBC driver class is misspelled in the jdbcDriver property or you have multiple entries of jdbcDriver properties files need unique keys The jdbcURL property has a spelling error and tries to use a non existing protocol or you listed it multiple times which doesn t work either remember properties files need unique keys 178 CHAPTER 13 DATABASES Chapter 14 Windows databases A common query we get from our users is how to open a Windows database in the Weka Explorer This page is intended as a guide to help you achieve this It is a complicated process and we cannot guarantee that it will work for you The process described makes use of the JDBC ODBC bridge that is part of Sun s JRE 1 3 and higher The following instructions are for Windows 2000 Under other Windows versions there may be slight differences Step 1 Create a User DSN 1 Go to the Control Panel 2 Choose Adminstrative Tools Choose Data Sources ODBC At the User DSN tab choose Add oe AeA 0
173. n your CLASSPATH and you re currently just above the root package java weka gui GenericPropertiesCreator weka gui GenericPropertiesCreator props HOME GenericObjectEditor props this will generate a new props file in your home directory Windows users have to replace the HOME with USERPROFILE e edit the GenericPropertiesCreator props file and set UseDynamic to false Like with the GOE file the GPC can be either modified in its original position inside the source tree or you can place a copy of it in your home directory and modify this one which makes installing WEKA updates easier by just replacing the weka jar A limitation of the GOE was so far that additional classifiers filters etc had to fit into the same package structure as the already existing ones i e all had to be located below weka WEKA can now display multiple class hierarchies in the GUI which makes adding new functionality quite easy as we will see later in an example it is not restricted to classifiers only but also works with all the other entries in the GPC file 16 4 2 File Structure The structure of the GOE is a key value pair separated by an equals sign The value is a comma separated list of classes that are all derived from the su 16 4 GENERICOBJECTEDITOR 195 perclass superinterface key The GPC is slightly different instead of declar ing all the classes interfaces one need only to specify all the packages de scendants are located in
174. namic Bayesian networks so that time series data can be handled 152 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS Part III Data 153 Chapter 9 ARFF An ARFF Attribute Relation File Format file is an ASCII text file that describes a list of instances sharing a set of attributes 9 1 Overview ARFF files have two distinct sections The first section is the Header informa tion which is followed the Data information The Header of the ARFF file contains the name of the relation a list of the attributes the columns in the data and their types An example header on the standard IRIS dataset looks like this 1 Title Iris Plants Database 2 Sources a Creator R A Fisher b Donor Michael Marshall MARSHALL PLUCio arc nasa gov c Date July 1988 RELATION iris ATTRIBUTE sepallength NUMERIC ATTRIBUTE sepalwidth NUMERIC ATTRIBUTE petallength NUMERIC ATTRIBUTE petalwidth NUMERIC ATTRIBUTE class Iris setosa Iris versicolor Iris virginica The Data of the ARFF file looks like the following DATA 5 1 3 5 1 4 0 2 lris setosa 4 9 3 0 1 4 0 2 Iris setosa 4 7 3 2 1 3 0 2 Iris setosa 4 6 3 1 1 5 0 2 Iris setosa 5 0 3 6 1 4 0 2 Iris setosa 5 4 3 9 1 7 0 4 Iris setosa 155 156 CHAPTER 9 ARFF 4 6 3 4 1 4 0 3 Iris setosa 5 0 3 4 1 5 0 2 Iris setosa 4 4 2 9 1 4 0 2 Iris setosa 4 9 3 1 1 5 0 1 Iris setosa Lines that begin with a are comments The RELATION CATTRIBUTE and DATA
175. nce single is the default value for the type tag we don t need to specify it ex plicitly lt option name t gt somefile arff lt option gt e hyphens Meta Classifiers like AdaBoostM1 take another classifier as option with the W option where the options for the base classifier follow after the And here it is where the fun starts where to put parameters for the base classifier if the Meta Classifier itself is a base classifier for another 200 CHAPTER 16 TECHNICAL DOCUMENTATION Meta Classifier E g does W weka classifiers trees J48 C 0 001 become this lt option name W type hyphens gt lt options type classifier value weka classifiers trees J48 gt lt option name C gt 0 001 lt option gt lt options gt lt option gt Internally all the options enclosed by the options tag are pushed to the end after the if one transforms the XML into a command line string quotes A Meta Classifier like Stacking can take several B options where each single one encloses other options in quotes this itself can contain a Meta Classifier From B weka classifiers trees J48 we then get this XML lt option name B type quotes gt lt options type classifier value weka classifiers trees J48 gt lt option gt With the XML representation one doesn t have to worry anymore about the level of quotes one is using and therefore doesn t have to care about the correct escaping ic
176. ncel 8 7 Running from the command line These are the command line options of BayesNet General options t lt name of training file gt Sets training file T lt name of test file gt Sets test file If missing a cross validation will be performed on the training data c lt class index gt Sets index of class attribute default last x lt number of folds gt Sets number of folds for cross validation default 10 no cv Do not perform any cross validation split percentage lt percentage gt Sets the percentage for the train test set split e g 66 preserve order Preserves the order in the percentage split s lt random number seed gt Sets random number seed for cross validation or percentage split default 1 m lt name of file with cost matrix gt Sets file with cost matrix 1 lt name of input file gt Sets model input file In case the filename ends with xml the options are loaded from the XML file d lt name of output file gt Sets model output file In case the filename ends with xml only the options are saved to the XML file not the model v Outputs no statistics for training data Outputs statistics only not the classifier Outputs detailed information retrieval statistics for each class 8 7 RUNNING FROM THE COMMAND LINE 123 Outputs information theoretic statistics p lt attribute range gt Only outputs predictions for test instances or the train instan
177. ndline as follows java weka core converters CSVLoader filename csv gt filename arff 17 2 7 ARFF file doesn t load One way to figure out why ARFF files are failing to load is to give them to the Instances class At the command line type the following java weka core Instances filename arff where you substitute filename for the actual name of your file This should return an error if there is a problem reading the file or show some statistics if the file is ok The error message you get should give some indication of what is wrong 17 2 8 Spaces in labels of ARFF files A common problem people have with ARFF files is that labels can only have spaces if they are enclosed in single quotes i e a label such as some value should be written either some value or some_value in the file 17 2 9 CLASSPATH problems Having problems getting Weka to run from a DOS UNIX command prompt Getting java lang NoClassDefFoundError exceptions Most likely your CLASS PATH environment variable is not set correctly it needs to point to the Weka jar file that you downloaded with Weka or the parent of the Weka direc tory if you have extracted the jar Under DOS this can be achieved with set CLASSPATH c weka 3 4 weka jar CLASSPATH Under UNIX Linux something like export CLASSPATH home weka weka jar CLASSPATH 208 CHAPTER 17 OTHER RESOURCES An easy way to get avoid setting the variable this is to specify the CLASSPATH wh
178. nformation Scheme weka classifiers bayes BayesNet D B iris xml Q weka classifiers bayes n Options for BayesNet include the class names for the structure learner and for the distribution estimator Relation iris weka filters unsupervised attribute Discretize B2 M 1 0 Rfirst last Instances 150 Attributes 5 sepallength sepalwidth petallength petalwidth class Test mode 10 fold cross validation Classifier model full training set Bayes Network Classifier not using ADTree Indication whether the ADTree algorithm 23 for calculating counts in the data set was used attributes 5 classindex 4 This line lists the number of attribute and the number of the class variable for which the classifier was trained Network structure nodes followed by parents sepallength 2 class sepalwidth 2 class petallength 2 class sepallength petalwidth 2 class petallength class 3 8 8 INSPECTING BAYESIAN NETWORKS 133 This list specifies the network structure Each of the variables is followed by a list of parents so the petallength variable has parents sepallength and class while class has no parents The number in braces is the cardinality of the variable It shows that in the iris dataset there are three class variables All other variables are made binary by running it through a discretization filter LogScore Bayes 374 9942769685747 LogScore BDeu 351 85811477631626 LogScore MDL 416 86897021246466 LogSc
179. nu File Edit Tools View Help D New Sem 3 Load cti o Save Ct s Save As amp Print Ctri P Export _ Exit The New Save Save As and Exit menu provide functionality as expected The file format used is XML BIF 19 There are two file formats supported for opening e xml for XML BIF files The Bayesian network is reconstructed from the information in the file Node width information is not stored so the nodes are shown with the default width This can be changed by laying out the graph menu Tools Layout e arff Weka data files When an arff file is selected a new empty Bayesian net work is created with nodes for each of the attributes in the arff file Continuous variables are discretized using the weka filters supervised attribute Discretize filter see note at end of this section for more details The network structure can be specified and the CPTs learned using the Tools Learn CPT menu The Print menu works sometimes as expected The Export menu allows for writing the graph panel to image currently supported are bmp jpg png and eps formats This can also be activated using the Alt Shift Left Click action in the graph panel 138 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS Edit menu File EN Tools View Help Undo Ctrl Z Redo Ctrl Y E Select All Ctrl Delete Node Delete Cut Ctrl X Copy Ctrl C F Paste Ctrl Add Node Add Arc Delete Arc se Align Left 33 Align Right 25 Align Top
180. oldMaker component from the Evaluation toolbar and place it on the layout Connect the ClassAssigner to the CrossValidationFoldMaker by right clicking over ClassAssigner and se lecting dataSet from under Connections in the menu 98 CHAPTER 6 KNOWLEDGEFLOW Next click on the Classifiers tab at the top of the window and scroll along the toolbar until you reach the J48 component in the trees section Place a J48 component on the layout Connect the CrossValidationFoldMaker to J48 TWICE by first choosing trainingSet and then testSet from the pop up menu for the CrossValida tionFoldMaker Repeat these two steps with the RandomForest classifier Next go back to the Evaluation tab and place a ClassifierPerformanceE valuator component on the layout Connect J48 to this component by selecting the batchClassifier entry from the pop up menu for J48 Add another ClassifierPerformanceEvaluator for RandomForest and connect them via batchClassifier as well Next go to the Visualization toolbar and place a ModelPerformanceChart component on the layout Connect both ClassifierPerformanceEvaluators to the ModelPerformanceChart by selecting the thresholdData entry from the pop up menu for ClassifierPerformanceEvaluator e Now start the flow executing by selecting Start loading from the pop up menu for ArffLoader Depending on how big the data set is and how long cross validation takes you will see some animation from some of the icons in the layout
181. om the popup menu brings up this dialog in which the name of the new value for the node can be specified The distribution for the node assign zero probability to the value Child node CPTs are updated by copying distributions conditioned on the new value Node sepallength X New value Value4 The popup menu shows list of values that can be renamed for selected node 146 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS eran 5 Set evidence gt Mmm 4 7 5 iri 34 Rename Delete node Edit CPT Add parent Delete parent gt Delete child gt Add value inf 2 45 Delete value 2 45 4 75 4 75 infy Selecting a value brings up the following dialog in which a new name can be specified New name for value inf 5 55 flow The popup menu shows list of values that can be deleted from selected node This is only active when there are more then two values for the node single valued nodes do not make much sense By selecting the value the CPT of the node is updated in order to ensure that the CPT adds up to unity The CPTs of children are updated by dropping the distributions conditioned on the value E 4 O 1011 Set evidence gt Rename Delete node Edit CPT Add parent gt Delete parent gt Delete child gt Add value Rename value gt Delete value inf 2 45 2 45 4 75 4 75 infy A note on CPT learning Continuous variables are discretized by the Bay
182. ommand line interface The menu consists of four sections 1 Program en Weka EERIE Visualization LogWindow EE Memory usage M Exit 26 25 26 CHAPTER 2 LAUNCHING WEKA e LogWindow Opens a log window that captures all that is printed to stdout or stderr Useful for environments like MS Windows where WEKA is normally not started from a terminal e Exit Closes WEKA 2 Tools Other useful applications GUI Chooser Help ArffViewer A SqlViewer SS Bayes net editor N e ArffViewer An MDI application for viewing ARFF files in spread sheet format e SqlViewer Represents an SQL worksheet for querying databases via JDBC e Bayes net editor An application for editing visualizing and learn ing Bayes nets 3 Visualization Ways of visualizing data with WEKA Weka GUI Chooser Westra Tools Help Plot a ROC R TreeVisualizer sy GraphVisualizer G BoundaryVisualizer B e Plot For plotting a 2D plot of a dataset e ROC Displays a previously saved ROC curve e TreeVisualizer For displaying directed graphs e g a decision tree e GraphVisualizer Visualizes XML BIF or DOT format graphs e g for Bayesian networks e BoundaryVisualizer Allows the visualization of classifier decision boundaries in two dimensions 4 Help Online resources for WEKA can be found here e Weka homepage Opens a browser window with WEKA s home page e HOWTOs code snippets etc The general WekaWiki 2 con tainin
183. ontaining the textual output Load model Loads a pre trained model object from a binary file Save model Saves a model object to a binary file Objects are saved in Java serialized object form Re evaluate model on current test set Takes the model that has been built and tests its performance on the data set that has been specified with the Set button under the Supplied test set option Visualize classifier errors Brings up a visualization window that plots the results of classification Correctly classified instances are represented by crosses whereas incorrectly classified ones show up as squares Visualize tree or Visualize graph Brings up a graphical representation of the structure of the classifier model if possible i e for decision trees or Bayesian networks The graph visualization option only appears if a Bayesian network classifier has been built In the tree visualizer you can bring up a menu by right clicking a blank area pan around by dragging the mouse and see the training instances at each node by clicking on it CTRL clicking zooms the view out while SHIFT dragging a box zooms the view in The graph visualizer should be self explanatory Visualize margin curve Generates a plot illustrating the prediction margin The margin is defined as the difference between the probability predicted for the actual class and the highest probability predicted for the other classes For example boosting algorithms may achi
184. opriate JDBC driver extract the JDBC jar and place it as mysql jar in home johndoe jars 5 3 3 Remote Engine Setup e First set up a directory for scripts and policy files home johndoe remote_engine e Unzip the remoteExperimentServer jar from the Weka distribution or build it from the sources with ant remotejar into a temporary direc tory e Next copy remoteEngine jar and remote policy example to the home johndoe remote engine directory e Create a script called home johndoe remote_engine startRemoteEngine with the following content don t forget to make it executable with chmod a x startRemoteEngine when you are on Linux Unix HSQLDB java Xmx256m classpath home johndoe jars hsqldb jar remoteEngine jar Djava security policy remote policy weka experiment RemoteEngine amp MySQL java Xmx256m classpath home johndoe jars mysql jar remoteEngine jar Djava security policy remote policy weka experiment RemoteEngine amp e Now we will start the remote engines that run the experiments on the remote computers note that the same version of Java must be used for the Experimenter and remote engines Rename the remote policy example file to remote policy For each machine you want to run a remote engine on ssh to the machine 2Weka s source code can be found in the weka src jar archive or obtained from Subversion 10 80 CHAPTER 5 EXPERIMENTER cd to home johndoe
185. or w Tester veka experiment PairedCorrectedTTester Analysing Percent_correct Datasets Resultsets Confidence Sorted by Select Select Date Conpansonneia Perc cared y Significance 0 05 sonic aoe e Row Column Dataset Test base Select Displayed Columns Columns Show std deviations Key Output Format Select Perform test Save output Result list 1 rules ZeroR 48055541465867954 2 rules OneR B 6 2459427002147861445 3 trees J48 C 0 25 M 2 217733168393644444 1 3 0 05 two tailed 21 12 05 16 51 1 rules Ze 2 rules 3 trees 93 53 v 94 73 v ws 1 1 0 0 1 0 0 77 78 CHAPTER 5 EXPERIMENTER 5 3 Remote Experiments Remote experiments enable you to distribute the computing load across multiple computers In the following we will discuss the setup and operation for HSQLDB 11 and MySQL 12 5 3 1 Preparation To run a remote experiment you will need e A database server e A number of computers to run remote engines on e To edit the remote engine policy file included in the Weka distribution to allow Java class and dataset loading from your home directory e An invocation of the Experimenter on a machine somewhere any will do For the following examples we assume a user called johndoe with this setup e Access to a set of computers running a flavour of Un
186. or 1 dataset for a total of 30 result lines Results can also be loaded from an earlier experiment file by clicking File and loading the appropriate arff results file Similarly results sent to a database using the DatabaseResultListener can be loaded from the database Select the Percent_correct attribute from the Comparison field and click Perform test to generate a comparison of the 3 schemes Weka Experiment Environment O x Setup Run Analyse Source fi 1 Got 30 results File Database Experiment Configure test i Test output Testing with paired T Tester cor w Tester weka experiment PairedCorrectedTTester Analysing Percent_correct Row Select Datasets 1 Resultsets 3 Confidence 0 05 two tailed Sorted by Date 21 12 05 16 37 Column Select Comparison field Percent_correct z Significance 0 05 Dataset 1 rules Ze 2 rules 3 trees Sorting asc by lt default gt AE rrr Test base Select Fee Ten GU tian aah ah ws 1 1 0 0 1 0 0 Displayed Columns Columns Show std deviations C Key 1 rules ZeroR 48055541465867954 2 rules OneR B 6 2459427002147861445 3 trees J48 C 0 25 M 2 217733168393644444 Output Format Select Perform test Save output Result list 16 37 11 Percent_correct rules ZeroR 4805
187. or Efficient Machine Learning with Large Datasets JAIR Volume 8 pages 67 91 1998 Verma T and Pearl J An algorithm for deciding if a set of observed indepen dencies has a causal explanation Proc of the Eighth Conference on Uncertainty in Artificial Intelligence 323 330 1992
188. ore ENTROPY 366 76261727150217 LogScore AIC 386 76261727150217 These lines list the logarithmic score of the network structure for various meth ods of scoring If a BIF file was specified the following two lines will be produced if no such file was specified no information is printed Missing 0 Extra 2 Reversed 0 Divergence 0 0719759699700729 In this case the network that was learned was compared with a file iris xml which contained the naive Bayes network structure The number after Missing is the number of arcs that was in the network in file that is not recovered by the structure learner Note that a reversed arc is not counted as missing The number after Extra is the number of arcs in the learned network that are not in the network on file The number of reversed arcs is listed as well Finally the divergence between the network distribution on file and the one learned is reported This number is calculated by enumerating all possible in stantiations of all variables so it may take some time to calculate the divergence for large networks The remainder of the output is standard output for all classifiers Time taken to build model 0 01 seconds Stratified cross validation Summary Correctly Classified Instances 116 77 3333 Incorrectly Classified Instances 34 22 6667 etc Bayesian networks in GUI To show the graphical structure right click the appropriate BayesNet in result
189. ot mean squared error Relative absolute error Root relative squared error Total Number of Instances measureTreeSize 5 0 measureNumLeaves 3 0 measureNumRules 3 0 5 2 2 4 Other Result Producers Cross Validation Result Producer 47 4 0 8824 0 0723 0 2191 16 2754 7 46 4676 7 51 73 92 1569 7 8431 To change from random train and test experiments to cross validation exper iments click on the Result generator entry At the top of the window click on the drop down list and select Cross ValidationResultProducer The window now contains parameters specific to cross validation such as the number of par titions folds The experiment performs 10 fold cross validation instead of train and test in the given example weka experiment CrossValidationResultProducer About Performs a cross validation run using a supplied evaluator more weka gui GenericObjectEditor numFolds 10 outputFile splitEvalutorOut zip rawOutput False v splitEvaluator Choose ClassifierSplitEvaluator VV weka classifier Open Save The Result generator panel now indicates that cross validation will be per formed Click on More to generate a brief description of the Cross Validation ResultProducer Information l NAME SYNOPSIS OPTIONS weka experiment CrossValidationResultProducer Performs a
190. ou need to rename it first 173 174 CHAPTER 13 DATABASES 13 2 Setup Under normal circumstances you only have to edit the following two properties e jdbcDriver e jdbcURL Driver jdbcDriver is the classname of the JDBC driver necessary to connect to your database e g e HSQLDB org hsqldb jdbcDriver e MS SQL Server 2000 Desktop Edition com microsoft jdbc sqlserver SQLServerDriver e MS SQL Server 2005 com microsoft sqlserver jdbc SQLServerDriver e MySQL org gjt mm mysql Driver or com mysql jdbc Driver e ODBC part of Sun s JDKs JREs no external driver necessary sun jdbc odbc JdbcOdbcDriver e Oracle oracle jdbc driver OracleDriver e PostgreSQL org postgresql Driver e sqlite 3 x org sqlite JDBC URL jdbcURL specifies the JDBC URL pointing to your database can be still changed in the Experimenter Explorer e g for the database MyDatabase on the server server my domain e HSQLDB jdbc hsqldb hsql server my domain MyDatabase e MS SQL Server 2000 Desktop Edition jdbc microsoft sqlserver v 1433 Note if you add databasename db name you can connect to a different database than the default one e g MyDatabase e MS SQL Server 2005 jdbc sqlserver server my domain 1433 13 3 MISSING DATATYPES e MySQL jdbc mysql server my domain 3306 MyDatabase e ODBC 175 jdbc odbc DSN_name replace DSN_name with the DSN that you want to use e Oracle thin driver
191. parent gt sepalwidth petallength petalwidth Clicking the right mouse button on a node brings up a popup menu The popup menu shows list of values that can be set as evidence to selected node This is only visible when margins are shown menu Tools Show margins By selecting Clear the value of the node is removed and the margins calculated based on CPTs again inf 2 45 Rename 2 45 4 75 Delete node 4 75 infy Edit CPT Add parent b Delete parent gt Delete child gt Add value Rename value gt Delete value gt 144 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS A node can be renamed by right click and select Rename in the popup menu The following dialog appears that allows entering a new node name A sepallength The CPT of a node can be edited manually by selecting a node right click Edit CPT A dialog is shown with a table representing the CPT When a value is edited the values of the remainder of the table are update in order to ensure that the probabilities add up to 1 It attempts to adjust the last column first then goes backward from there petallength sepalwidth Cinf 2 45 Cinf 2 95 cinf 2 45 Cinf 5 55 5 55 6 15 6 15 inf 0 143 0 143 0 949 0 026 0 026 Cinf 2 45 _ 3 35 inf 0 873 0 111 0 016 2 45 4 75 0 343 0 463 0 194 2 45 4 75 2 95 3 35 2 45 4 75
192. perimenter or Ex plorer The GUIChooser starts but Explorer and Experimenter don t start and output an Exception like this in the terminal usr share themes Mist gtk 2 0 gtkrc 48 Engine mist is unsupported ignoring Registering Weka Editors java lang NullPointerException at weka gui explorer PreprocessPanel addPropertyChangeListener PreprocessPanel java 519 at javax swing plaf synth SynthPanelUI installListeners SynthPanelUI1 java 49 at javax swing plaf synth SynthPanelUI instal1UI SynthPanelUI java 38 at javax swing JComponent setUI JComponent java 652 at javax swing JPanel setUI JPanel java 131 This behavior happens only under Java 1 5 and Gnome Linux KDE doesn t produce this error The reason for this is that Weka tries to look more native and therefore sets a platform specific Swing theme Unfortunately this doesn t seem to be working correctly in Java 1 5 together with Gnome A workaround for this is to set the cross platform Metal theme In order to use another theme one only has to create the following properties file in ones home directory LookAndFeel props With this content Theme javax swing plaf metal MetalLookAndFeel 210 CHAPTER 17 OTHER RESOURCES 17 2 14 KnowledgeFlow toolbars are empty In the terminal you will most likely see this output as well Failed to instantiate weka gui beans Loader This behavior can happen under Gnome with Java 1 5 see Section 17 2 13 for a
193. pt window change to the directory where the DatabaseUtils props file is located make sure your CLASSPATH environment variable is set correctly or set it with the cp option to java and launch the Explorer with the following command java weka gui explorer Explorer Choose Open DB Edit the query field to read select from tablename where tablename is the name of the database table you want to read or you could put a more complicated SQL query here instead The databaseURL should read jdbc odbc dbname where dbname is the name you gave the user DSN Click OK At this point the data should be read from the database 182 CHAPTER 14 WINDOWS DATABASES Part IV Appendix 183 Chapter 15 Research 15 1 Citing Weka If you want to refer to Weka in a publication please cite the data mining book The full citation is Ian H Witten and Eibe Frank 2005 Data Mining Practical ma chine learning tools and techniques 2nd Edition Morgan Kauf mann San Francisco 2005 15 2 Paper references Due to the introduction of the weka core TechnicalInformationHandler in terface it is now easy to extract all the paper references via weka core ClassDiscovery and weka core TechnicalInformation The script listed at the end extracts all the paper references from Weka based on a given jar file and dumps it to stdout One can either generate simple plain text output option p or BibTeX compliant one option
194. r Close The Filter button enables one to highlight classifiers that can handle certain attribute and class types With the Remove filter button all the selected capabilities will get cleared and the highlighting removed again Additional algorithms can be added again with the Add new button e g the J48 decision tree weka gui GenericObjectEditor Choose weka classifiers trees J48 About Class for generating a pruned or unpruned C4 More Capabilities binarySplits False X confidenceFactor 0 25 debug False E A minNumObj 2 numFolds 3 reducedErrorPruning False savelnstanceData False gt A seed 1 subtreeRaising True unpruned False 5 A useLaplace False E Open Save OK Cancel After setting the classifier parameters one clicks on OK to add it to the list of algorithms 5 2 STANDARD EXPERIMENTS 59 lt Weka Experiment Environment Setup Run Analyse Experiment Configuration Mode 8 Simple Advanced Open Save New Results Destination ARFF file w Filename caTempweka 3 6 6 Experiments1 art Browse Experiment Type Iteration Control Cross validation A Number of repetitions 10 Number of folds 10 Data sets first 8 C
195. r building and separating source code and class files Java projects But still if you re only working on totally separate projects it might be easiest for you to use the environment variable 16 2 1 Setting the CLASSPATH In the following we add the mysql connector java 3 1 8 bin jar to our CLASSPATH variable this works for any other jar archive to make it possible to access MySQL databases via JDBC Win32 2k and XP We assume that the mysql connector java 3 1 8 bin jar archive is located in the following directory C Program Files Weka 3 5 In the Control Panel click on System or right click on My Computer and select Properties and then go to the Avanced tab There you ll find a button called Environment Variables click it Depending on whether you re the only person using this computer or it s a lab computer shared by many you can either create a new system wide you re the only user environment variable or a user depen dent one recommended for multi user machines Enter the following name for the variable CLASSPATH and add this value C Program Files Weka 3 5 mysql connector java 3 1 8 bin jar If you want to add additional jars you ll have to separate them with the path separator the semicolon no spaces 16 2 CLASSPATH 191 Unix Linux We make the assumption that the mysql jar is located in the following directory home johndoe jars Open a shell and execute the following command depending on
196. raphy 211 CONTENTS Part I The Command line Chapter 1 A command line primer 1 1 Introduction While for initial experiments the included graphical user interface is quite suf ficient for in depth usage the command line interface is recommended because it offers some functionality which is not available via the GUI and uses far less memory Should you get Out of Memory errors increase the maximum heap size for your java engine usually via Xmx1024M or Xmx1024m for 1GB the default setting of 16 to 64MB is usually too small If you get errors that classes are not found check your CLASSPATH does it include weka jar You can explicitly set CLASSPATH via the cp command line option as well We will begin by describing basic concepts and ideas Then we will describe the weka filters package which is used to transform input data e g for preprocessing transformation feature generation and so on Then we will focus on the machine learning algorithms themselves These are called Classifiers in WEKA We will restrict ourselves to common settings for all classifiers and shortly note representatives for all main approaches in machine learning Afterwards practical examples are given Finally in the doc directory of WEKA you find a documentation of all java classes within WEKA Prepare to use it since this overview is not intended to be complete If you want to know exactly what is going on take a look at the mostly well documen
197. re stemmers Stemmer Chapter 13 Databases 13 1 Configuration files Thanks to JDBC it is easy to connect to Databases that provide a JDBC driver Responsible for the setup is the following properties file located in the weka experiment package DatabaseUtils props You can get this properties file from the weka jar or weka src jar jar archive both part of a normal Weka release If you open up one of those files you ll find the properties file in the sub folder weka experiment Weka comes with example files for a wide range of databases DatabaseUtils props hsql HSQLDB 3 4 1 DatabaseUtils props mssqlserver MS SQL Server 2000 j 3 4 9 3 5 4 DatabaseUtils props mssqlserver2005 MS SQL Server 2005 j 3 4 11 i 3 5 6 DatabaseUtils props mysql MySQL j 3 4 9 3 5 4 DatabaseUtils props odbc ODBC access via Sun s ODBC JDBC bridge e g for MS Access j 3 4 9 3 5 4 see the Windows databases chapter for more information DatabaseUtils props oracle Oracle 10g 3 4 9 3 5 4 DatabaseUtils props postgresql PostgreSQL 7 4 3 4 9 3 5 4 DatabaseUtils props sqlite3 sqlite 3 x 3 4 12 3 5 7 The easiest way is just to place the extracted properties file into your HOME directory For more information on how property files are processed check out this article Note Weka only looks for the DatabaseUtils props file If you take one of the example files listed above y
198. remote engine Run home johndoe startRemoteEngine to enable the remote engines to use more memory modify the Xmx option in the startRemoteEngine script 5 3 4 Configuring the Experimenter Now we will run the Experimenter e HSQLDB Copy the Database Utils props hsql file from weka experiment in the weka jar archive to the home johndoe remote_engine directory and rename it to Database Utils props Edit this file and change the jdbcURL jdbc hsqldb hsql server_name database_name entry to include the name of the machine that is running your database server e g jdbcURL jdbc hsqldb hsql dodo company com experiment Now start the Experimenter inside this directory java cp home johndoe jars hsqldb jar remoteEngine jar home johndoe weka weka jar Djava rmi server codebase file home johndoe weka weka jar weka gui experiment Experimenter e MySQL Copy the Database Utils props mysql file from weka experiment in the weka jar archive to the home johndoe remote_engine direc tory and rename it to Database Utils props Edit this file and change the jdbcURL jdbc mysal server_name 3306 database_name entry to include the name of the machine that is running your database server and the name of the database the result will be stored in e g jdbcURL jdbc mysql dodo company com 3306 experiment Now start the Experimenter inside this directory java
199. rmines the first decision In case it is overcast we ll al ways play golf The numbers in parentheses at the end of each leaf tell us the number of exam ples in this leaf If one or more leaves were not pure all of the same class the number of mis classified examples would also be given after a slash As you can see a decision tree learns quite fast and is evalu ated even faster E g for a lazy learner testing would take far longer than training 1 2 BASIC CONCEPTS 19 Error on training data Correctly Classified Instance 14 100 Incorrectly Classified Instances 0 0 k z 4 z Kappa statistic 1 This is quite boring our clas Mean absolute error 0 sifier is perfect at least on the Root mean squared error 0 aa a Relativa abeoluta cerros 0 y training data all instances were Root relative squared error o classified correctly and all errors Total Number of Instances 14 Detailed Accuracy By Class are zero As is usually the case the training set accuracy is too TP Rate FP Rate Precision Recall F Measure Class optimistic The detailed accu E i yes racy by class which is output via no i and the confusion matrix is So Contusion Me tritio similarily trivial ab lt classified as 90 a yes 05 b no Stratified cross validation The stratified cv paints a more Correctly Classified Instances 9 64 2857 listi A Th Incorrectly Classified Instances 5 35 7143 realistic pic
200. robably causes problems in the BayesNet class NB 2 make sure to process options of the parent class if any in the get setOpions methods Adding a new estimator This is the quick guide for adding a new estimator 1 Create a class that derives from weka classifiers bayes net estimate BayesNetEstimator Let s say it is called weka classifiers bayes net estimate MyEstimator 2 Implement the methods public void initCPTs BayesNet bayesNet 8 12 FAQ 149 public void estimateCPTs BayesNet bayesNet public void updateClassifier BayesNet bayesNet Instance instance and public double distributionForInstance BayesNet bayesNet Instance instance 3 If the structure learner has options that are not default options you want to implement public Enumeration listOptions public void setOptions String options public String getOptions and the get and set methods for the properties you want to be able to set NB do not use the E option since that is reserved for the BayesNet class to distinguish the extra options for the SearchAlgorithm class and the Estimator class If the E option is used and no extra arguments are passed to the SearchAlgorithm the extra options to your Estimator will be passed to the SearchAlgorithm instead In short do not use the E option 8 12 FAQ How do I use a data set with continuous variables with the BayesNet classes Use the class weka filters unsupervised attribute Discretize to di
201. roduced the notion of Capabilities Capabilities basi cally list what kind of data a certain object can handle e g one classifier can handle numeric classes but another cannot In case a class supports capabili ties the additional buttons Filter and Remove filter will be available in the GOE The Filter button pops up a dialog which lists all available Capabilities ene Filtering Capabilities Classifiers have to support at east the following capabilities the ones highlighted don t meet these requirements the ones highlighted blue possibly meet them O Nominal attributes Binary attributes Unary attributes Empty nominal attributes Mm Numeric attributes C Date attributes O String attributes O Relational attributes O Missing values C No class m Nominal class y Binary class O Unary class Date class C String class Relational class Missing class values Only multi Instance data OK Cancel 198 CHAPTER 16 TECHNICAL DOCUMENTATION One can then choose those capabilities an object e g a classifier should have If one is looking for classification problem then the Nominal class Capability can be selected On the other hand if one needs a regression scheme then the Capability Numeric class can be selected This filtering mechanism makes the search for an appropriate learning scheme easier After applying that filter the tree with the objects will be displayed again and lists all objects that c
202. rs rules ZeroR model just consists of a single value the most common class or the median of all numeric values in case of predicting a numeric value regression learning ZeroR is a trivial classifier but it gives a lower bound on the performance of a given dataset which should be significantly improved by more complex classifiers As such it is a reasonable test on how well the class can be predicted without considering the other attributes Later we will explain how to interpret the output from classifiers in detail for now just focus on the Correctly Classified Instances in the section Stratified cross validation and notice how it improves from ZeroR to J48 java weka classifiers rules ZeroR t weather arff java weka classifiers trees J48 t weather arff There are various approaches to determine the performance of classifiers The performance can most simply be measured by counting the proportion of cor rectly predicted examples in an unseen test dataset This value is the accuracy which is also 1 ErrorRate Both terms are used in literature The simplest case is using a training set and a test set which are mutually independent This is referred to as hold out estimate To estimate variance in these performance estimates hold out estimates may be computed by repeatedly resampling the same dataset i e randomly reordering it and then splitting it into training and test sets with a specific proportion of the examples collecting
203. s 10 scoreType BAYES K A seed 1 useCrossOver True useMutation True useTournamentSelection False Open Save OK Specific options populationSize is the size of the population selected in each generation descendantPopulationSize is the number of offspring generated in each 8 3 CONDITIONAL INDEPENDENCE TEST BASED STRUCTURE LEARNING117 generation runs is the number of generation to generate seed is the initialization value for the random number generator useMutation flag to indicate whether mutation should be used Mutation is applied by randomly adding or deleting a single arc useCrossOver flag to indicate whether cross over should be used Cross over is applied by randomly picking an index k in the bit representation and selecting the first k bits from one and the remainder from another network structure in the population At least one of useMutation and useCrossOver should be set to true useTournamentSelection when false the best performing networks are selected from the descendant population to form the population of the next generation When true tournament selection is used Tournament selection randomly chooses two individuals from the descendant popula tion and selects the one that performs best 8 3 Conditional independence test based struc ture learning Conditional independence tests in Weka are slightly different from the standard tests described
204. s name and options are shown in the field next to the Choose button Clicking on this box with the left mouse button brings up a GenericObjectEditor dialog box A click with the right mouse button or Alt Shift left click brings up a menu where you can choose either to display the properties in a GenericObjectEditor dialog box or to copy the current setup string to the clipboard The GenericObjectEditor Dialog Box The GenericObjectEditor dialog box lets you configure a filter The same kind of dialog box is used to configure other objects such as classifiers and clusterers see below The fields in the window reflect the available options Right clicking or Alt Shift Left Click on such a field will bring up a popup menu listing the following options 38 CHAPTER 4 EXPLORER 1 Show properties has the same effect as left clicking on the field i e a dialog appears allowing you to alter the settings 2 Copy configuration to clipboard copies the currently displayed con figuration string to the system s clipboard and therefore can be used any where else in WEKA or in the console This is rather handy if you have to setup complicated nested schemes 3 Enter configuration is the receiving end for configurations that got copied to the clipboard earlier on In this dialog you can enter a classname followed by options if the class supports these This also allows you to transfer a filter setting from the Preprocess panel
205. s bayes NaiveBayes K t soybean train arff T soybean test arff p 0 rhizoctonia root rot 0 999998912860593 rhizoctonia root rot rhizoctonia root rot 0 9999994386283236 rhizoctonia root rot XAONO U0NRrO 32 phyllosticta leaf spot 0 7789710144361445 brown spot 39 alternarialeaf spot 0 6403333824349896 brown spot 44 phyllosticta leaf spot 0 893568420641914 brown spot 46 alternarialeaf spot 0 5788190397739439 brown spot 73 brown spot 0 4943768155314637 alternarialeaf spot If we had chosen a range of attributes via p e g The values in each line are sep arated by a single space The diaporrhe stem tanker 0 9999672587892333 diaporthe stem canker fi e ds are the zero based test in diaporthe stem canker 0 9999992614503429 diaporthe stem canker 3 diaporthe stem canker 0 999998948559035 diaporthe stem canker Stance id followed by the pre diaporthe stem canker 0 9999998441238833 diaporthe stem canker dicted class value the confi diaporthe stem canker 0 9999989997681132 diaporthe stem canker ae gt rhizoctonia root rot 0 9999999395928124 rhizoctonia root rot dence for the prediction esti mated probability of predicted class and the true class All these are correctly classified so let s look at a few erroneous ones In each of these cases a misclas sification occurred mostly be tween classes alternarialeaf spot and brown spot The confidences seem to be lower than for correct classification so for a real life
206. s is taken to be the last attribute in the data If you want to train a classifier to predict a different attribute click on the box below the Test options box to bring up a drop down list of attributes to choose from 4 3 CLASSIFICATION 41 4 3 4 Training a Classifier Once the classifier test options and class have all been set the learning process is started by clicking on the Start button While the classifier is busy being trained the little bird moves around You can stop the training process at any time by clicking on the Stop button When training is complete several things happen The Classifier output area to the right of the display is filled with text describing the results of training and testing A new entry appears in the Result list box We look at the result list below but first we investigate the text that has been output 4 3 5 The Classifier Output Text The text in the Classifier output area has scroll bars allowing you to browse the results Clicking with the left mouse button into the text area while holding Alt and Shift brings up a dialog that enables you to save the displayed output in a variety of formats currently BMP EPS JPEG and PNG Of course you can also resize the Explorer window to get a larger display area The output is split into several sections 1 Run information A list of information giving the learning scheme op tions relation name instances attributes and test mode that were in vol
207. s picture just shows the network structure of the Bayes net but for each of the nodes a probability distribution for the node given its parents are specified as well For example in the Bayes net above there is a conditional distribution 109 110 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS for petallength given the value of class Since class has no parents there is an unconditional distribution for sepalwidth Basic assumptions The classification task consist of classifying a variable y xp called the class variable given a set of variables x 21 n called attribute variables A classifier h x y is a function that maps an instance of x to a value of y The classifier is learned from a dataset D consisting of samples over x y The learning task consists of finding an appropriate Bayesian network given a data set D over U All Bayes network algorithms implemented in Weka assume the following for the data set e all variables are discrete finite variables If you have a data set with continuous variables you can use the following filter to discretize them weka filters unsupervised attribute Discretize e no instances have missing values If there are missing values in the data set values are filled in using the following filter weka filters unsupervised attribute ReplaceMissingValues The first step performed by buildClassifier is checking if the data set fulfills those assumptions If those assumptions are not met the data
208. scretize them From the command line you can use java weka filters unsupervised attribute Discretize B 3 i infile arff o outfile arff where the B option determines the cardinality of the discretized variables How do I use a data set with missing values with the BayesNet classes You would have to delete the entries with missing values or fill in dummy values How do I create a random Bayes net structure Running from the command line java weka classifiers bayes net BayesNetGenerator B N 10 A 9 C 2 will print a Bayes net with 10 nodes 9 arcs and binary variables in XML BIF format to standard output How do I create an artificial data set using a random Bayes nets Running java weka classifiers bayes net BayesNetGenerator N 15 A 20 C 3 M 300 will generate a data set in arff format with 300 instance from a random network with 15 ternary variables and 20 arrows 150 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS How do I create an artificial data set using a Bayes nets I have on file Running java weka classifiers bayes net BayesNetGenerator F alarm xml M 1000 will generate a data set with 1000 instances from the network stored in the file alarm xml How do I save a Bayes net in BIF format e GUI In the Explorer learn the network structure right click the relevant run in the result list choose Visualize graph in the pop up menu click the floppy button in the Graph Visualizer win
209. sifiers during processing scrolling plots of classification accuracy RMS error predictions etc plugin facility for allowing easy addition of new components to the Knowl edgeF low 92 CHAPTER 6 KNOWLEDGEFLOW 6 3 Components Components available in the KnowledgeF low 6 3 1 DataSources All of WEKAS s loaders are available DataSources DataSinks Filters Classifiers Clusterers Associations Evaluation Visualization Silo lols 1s TextDirectory XRFF DataSources 2 oader InstancesLoader Loader Loader 6 3 2 DataSinks All of WEKAS s savers are available DataSources DataSinks 1 Filters Classifiers Clusterers Associations Evaluation Visualization DataSinks rff cas esv Database Libsv XRFF Saver Saver Saver Saver MSaver InstancesSaver Saver 4 gt 6 3 3 Filters All of WEKASs filters are available DataSources DataSinks Filters Classifiers Clusterers Associations Evaluation Visualization lt supervised unsupervised z Si fi fe SEE IEE Bo Avvribute Spread Seravifiad ification Selection Order Discretize ToBinar Resample Subsample Remove Folds lt il 6 3 4 Classifiers All of WEKA s classifiers are available DataSources DataSinks Filters Classifiers Clusterers Associations Evaluation Visualization
210. solution 17 2 15 Links e Java VM options http java sun com docs hotspot VMOptions htm1 Bibliography 1 10 11 12 13 14 15 16 Witten I H and Frank E 2005 Data Mining Practical machine learn ing tools and techniques 2nd edition Morgan Kaufmann San Francisco Weka Wiki http weka wiki sourceforge net J Platt 1998 Machines using Sequential Minimal Optimization In B Schoelkopf and C Burges and A Smola editors Advances in Kernel Methods Support Vector Learning Drummond C and Holte R 2000 Explicitly representing expected cost An alternative to ROC representation Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Publishers San Mateo CA Extensions for Weka s main GUI on WekaWiki http weka wiki sourceforge net Extensions for Weka 27s main GUI Adding tabs in the Explorer on WekaWiki http weka wiki sourceforge net Adding tabs in the Explorer Explorer visualization plugins on Weka Wiki http weka wiki sourceforge net Explorer visualization plugins Bengio Y and Nadeau C 1999 Inference for the Generalization Error Ross Quinlan 1993 C4 5 Programs for Machine Learning Morgan Kaufmann Publishers San Mateo CA Subversion http weka wiki sourceforge net Subversion HSQLDB http hsqldb sourceforge net MySQL http www mysql com Plotting multiple ROC curves on Weka Wi
211. t and left click a red line labeled dataSet will connect the two components Next right click over the ClassAssigner and choose Configure from the menu This will pop up a window from which you can specify which column is the class in your data last is the default Next grab a Cross ValidationFoldMaker component from the Evaluation toolbar and place it on the layout Connect the ClassAssigner to the CrossValidationFoldMaker by right clicking over ClassAssigner and se lecting dataSet from under Connections in the menu Next click on the Classifiers tab at the top of the window and scroll along the toolbar until you reach the J48 component in the trees section Place a J48 component on the layout 96 CHAPTER 6 KNOWLEDGEFLOW Connect the CrossValidationFoldMaker to J48 TWICE by first choosing trainingSet and then testSet from the pop up menu for the CrossValida tionFoldMaker Next go back to the Evaluation tab and place a ClassifierPerformanceE valuator component on the layout Connect J48 to this component by selecting the batchClassifier entry from the pop up menu for J48 Next go to the Visualization toolbar and place a TextViewer compo nent on the layout Connect the ClassifierPerformanceEvaluator to the Text Viewer by selecting the text entry from the pop up menu for Classi fierPerformanceEvaluator Now start the flow executing by selecting Start loading from the pop up menu for ArffLoader Depending on how big the data se
212. t is and how long cross validation takes you will see some animation from some of the icons in the layout J48 s tree will grow in the icon and the ticks will animate on the ClassifierPerformanceEvaluator You will also see some progress information in the Status bar and Log at the bottom of the window When finished you can view the results by choosing Show results from the pop up menu for the Text Viewer component Other cool things to add to this flow connect a TextViewer and or a Graph Viewer to J48 in order to view the textual or graphical representations of the trees produced for each fold of the cross validation this is something that is not possible in the Explorer 6 4 EXAMPLES 97 6 4 2 Plotting multiple ROC curves The KnowledgeFlow can draw multiple ROC curves in the same plot window something that the Explorer cannot do In this example we use J48 and Ran domForest as classifiers This example can be found on the WekaWiki as well 13 tre a LE lt 0 LO do de Peor A 148 10 x ezln esvSee e cies as PerformanceEvaluator lt lt i daras da 4h davade De yY moc CARFE ie Cae RQ training3et _ Ar fLoader Class ClassValue CrossValidation pest3et resholgbara Assigner Picker FoldMaker PerformanceChart re a em barchCl que 43 gt 220155 R RAND FOREST fou Random Classifier Forest PerformanceEvaluator e Click on the DataSources tab and choose ArffLoader from the
213. taSources o e ee eee 6 3 2 Datasinks s e 201 ama hele a Be ce ee 63 30 Filter ie e ted kok da a ed 6 34 Classifiers pa is a de de Ea Sk Ua 6 3 5 Cluster rs d e u ab ee ee eee a 6 3 6 Evaluation 2 0204 seco eb ee eh ee ee a 6 3 7 Visualization 220406606046 622A ee eee ee 6 4 Examples 32428404 4 4 a a dd a tae ed 6 4 1 Cross validated J48 o o 00 6 4 2 Plotting multiple ROC curves o 6 4 3 Processing data incrementally 6 57 Plugin Bacilllty madia ade a oe elie ee aa 7 ArffViewer Lalo Mens yrs ho AA A Rene es a ee Se a 2 editing ia pape Be eee ad NA 8 Bayesian Network Classifiers 8 1 Antrodtiction 6 22 e hae b Boe e a ars 8 2 Local score based structure learning 8 2 1 Local score metrics o 8 2 2 Search algorithms o lt 8 3 Conditional independence test based structure learning 8 4 Global score metric based structure learning 8 5 Fixed structure learning 0 000 8 6 Distribution learning 2 000 8 7 Running from the command line 8 8 Inspecting Bayesian networks 8 9 Bayes Network GUI o 8 10 Bayesian nets in the experimenter 8 11 Adding your own Bayesian network learners SAL FAU aa Y A A A 8 13 Future development e IHI Data 9
214. ted classifiers e IncrementalClassifierEvaluator evaluate the performance of incremen tally trained classifiers e ClustererPerformanceEvaluator evaluate the performance of batch trained tested clusterers e PredictionAppender append classifier predictions to a test set For dis crete class problems can either append predicted class labels or probabil ity distributions 94 CHAPTER 6 KNOWLEDGEFLOW 6 3 7 Visualization DataSources DataSinks Fitters Classifiers Clusterers Associations I Evaluation Visualization Visualization J amp e ajajaja Data Scatter Avtribute Model Text Graph Strip Visualiser PlotMatrix Summarizer PerformanceChart Viewer Viewer Chart 4 6 e DataVisualizer component that can pop up a panel for visualizing data in a single large 2D scatter plot e ScatterPlotMatrix component that can pop up a panel containing a ma trix of small scatter plots clicking on a small plot pops up a large scatter plot e AttributeSummarizer component that can pop up a panel containing a matrix of histogram plots one for each of the attributes in the input data e ModelPerformanceChart component that can pop up a panel for visual izing threshold i e ROC style curves e TextViewer component for showing textual data Can show data sets classification performance statistics etc e GraphViewer component that can pop up a panel for visualizing tree base
215. ted source code which can be found in weka src jar and can be extracted via the jar utility from the Java Development Kit or any archive program that can handle ZIP files 11 12 CHAPTER 1 A COMMAND LINE PRIMER 1 2 Basic concepts 1 2 1 Dataset A set of data items the dataset is a very basic concept of machine learning A dataset is roughly equivalent to a two dimensional spreadsheet or database table In WEKA it is implemented by the weka core Instances class A dataset is a collection of examples each one of class weka core Instance Each Instance consists of a number of attributes any of which can be nominal one of a predefined list of values numeric a real or integer number or a string an arbitrary long list of characters enclosed in double quotes Additional types are date and relational which are not covered here but in the ARFF chapter The external representation of an Instances class is an ARFF file which consists of a header describing the attribute types and the data as comma separated list Here is a short commented example A complete description of the ARFF file format can be found here Comment lines at the beginning This is a toy example the UCI weather dataset of the dataset should give an in Any relation to real weather is purely coincidental dication of its source context and meaning Here we state the internal name relation golfWeatherMichigan_1988 02 10_14days of the dataset
216. ter panel in the Preprocess panel The Preprocess panel will then show the transformed data The change can be undone by pressing the Undo button You can also use the Edit button to modify your data manually in a dataset editor Finally the Save button at the top right of the Preprocess panel saves the current version of the relation in file formats that can represent the relation allowing it to be kept for future use Note Some of the filters behave differently depending on whether a class at tribute has been set or not using the box above the histogram which will bring up a drop down list of possible selections when clicked In particular the supervised filters require a class attribute to be set and some of the unsu pervised attribute filters will skip the class attribute if one is set Note that it is also possible to set Class to None in which case no class is set 4 3 CLASSIFICATION 39 4 3 Classification gt weka 3 5 4 Explorer Program Applications Tools Visualization Windows Help E Explorer Preprocess Classify Cluster Associate Select attributes Visualize Classifier Choose J48 C 0 25 M 2 Test options Classifier output summary Use training set Supplied test set Set Correctly Classified Instances Incorrectly Classified Instances Kappa statistic 8 Cross validation Folds 10 Percentage split h Mean absolute error Mor
217. th full access to the Experimenter s capabilities You can choose between those two with the Experiment Configuration Mode radio buttons e Simple e Advanced Both setups allow you to setup standard experiments that are run locally on a single machine or remote experiments which are distributed between several hosts The distribution of experiments cuts down the time the experiments will take until completion but on the other hand the setup takes more time The next section covers the standard experiments both simple and ad vanced followed by the remote experiments and finally the analysing of the results 51 52 CHAPTER 5 EXPERIMENTER 5 2 Standard Experiments 5 2 1 Simple 5 2 1 1 New experiment After clicking New default parameters for an Experiment are defined weka Experiment Environment loj x Setup Run Analyse Experiment Configuration Mode Simple Advanced Open Save New Results Destination ARFF file gt Filename Browse Experiment Type Iteration Control Cross validation X Number of repetitions 10 Number of folds 10 Data sets first 8 Classification Regression Algorithms first Datasets Algorithms Add new Edit selecte Delete select Add new Edit selected Delete selected J Use relative pat Up Down Load options Save options Up Down
218. the shell you re using e bash export CLASSPATH CLASSPATH home johndoe jars mysql connector java 3 1 8 bin jar e c shell setenv CLASSPATH CLASSPATH home johndoe jars mysql connector java 3 1 8 bin jar Cygwin The process is like with Unix Linux systems but since the host system is Win32 and therefore the Java installation also a Win32 application you ll have to use the semicolon as separator for several jars 16 2 2 RunWeka bat From version 3 5 4 Weka is launched differently under Win32 The simple batch file got replaced by a central launcher class RunWeka class in combination with an INI file RunWeka ini The RunWeka bat only calls this launcher class now with the appropriate parameters With this launcher approach it is possible to define different launch scenarios but with the advantage of hav ing placeholders e g for the max heap size which enables one to change the memory for all setups easily The key of a command in the INI file is prefixed with cmd_ all other keys are considered placeholders cmd_blah java command blah bloerk placeholder bloerk A placeholder is surrounded in a command with cmd_blah java bloerk Note The key wekajar is determined by the w parameter with which the launcher class is called By default the following commands are predefined e default The default Weka start without a terminal window e console For debugging purposes Useful as Weka gets starte
219. tiesCreator props file and the lt prefix gt can be one of the following e S Superclass any class derived from this will be excluded e I Interface any class implementing this interface will be excluded 196 CHAPTER 16 TECHNICAL DOCUMENTATION e C Class exactly this class will be excluded Here are a few examples exclude all ResultListeners that also implement the ResultProducer interface all ResultProducers do that weka experiment ResultListener I weka experiment ResultProducer exclude J48 and all SingleClassifierEnhancers weka classifiers Classifier C weka classifiers trees J48 S weka classifiers SingleClassifierEnhancer 16 4 4 Class Discovery Unlike the Class forName String method that grabs the first class it can find in the CLASSPATH and therefore fixes the location of the package it found the class in the dynamic discovery examines the complete CLASSPATH you re starting the Java Virtual Machine JVM with This means that you can have several parallel directories with the same WEKA package structure e g the standard release of WEKA in one directory distribution weka jar and another one with your own classes development weka and display all of the classifiers in the GUI In case of a name conflict i e two directories contain the same class the first one that can be found is used In a nutshell your java call of the Experimenter can look like this java classpath deve
220. tifiedRemoveFolds i data soybean arff o soybean test arff c last N 4 F 1 weka filters unsupervised Classes below weka filters unsupervised in the class hierarchy are for un supervised filtering e g the non stratified version of Resample A class should not be assigned here weka filters unsupervised attribute StringToWordVector transforms string attributes into a word vectors i e cre ating one attribute for each word which either encodes presence or word count C within the string W can be used to set an approximate limit on the number of words When a class is assigned the limit applies to each class separately This filter is useful for text mining Obfuscate renames the dataset name all attribute names and nominal attribute values This is intended for exchanging sensitive datasets without giving away restricted information Remove is intended for explicit deletion of attributes from a dataset e g for removing attributes of the iris dataset java weka filters unsupervised attribute Remove R 1 2 i data iris arff o iris simplified arff java weka filters unsupervised attribute Remove V R 3 last i data iris arff o iris simplified arff weka filters unsupervised instance Resample creates a non stratified subsample of the given dataset i e random sampling without regard to the class information Otherwise it is equivalent to its supervised variant java weka filters unsupervised instance Resample i
221. tion is the currently loaded data which can be interpreted as a single relational table in database terminology has three entries 36 CHAPTER 4 EXPLORER 1 Relation The name of the relation as given in the file it was loaded from Filters described below modify the name of a relation 2 Instances The number of instances data points records in the data 3 Attributes The number of attributes features in the data 4 2 3 Working With Attributes Below the Current relation box is a box titled Attributes There are four buttons and beneath them is a list of the attributes in the current relation The list has three columns 1 No A number that identifies the attribute in the order they are specified in the data file 2 Selection tick boxes These allow you select which attributes are present in the relation 3 Name The name of the attribute as it was declared in the data file When you click on different rows in the list of attributes the fields change in the box to the right titled Selected attribute This box displays the char acteristics of the currently highlighted attribute in the list 1 Name The name of the attribute the same as that given in the attribute list 2 Type The type of attribute most commonly Nominal or Numeric 3 Missing The number and percentage of instances in the data for which this attribute is missing unspecified 4 Distinct The number of different values that the d
222. tionally one can choose between Classification and Regression depend ing on the datasets and classifiers one uses For decision trees like J48 Weka s implementation of Quinlan s C4 5 9 and the iris dataset Classification is necessary for a numeric classifier like M5P on the other hand Regression Clas sification is selected by default Note if the percentage splits are used one has to make sure that the cor rected paired T Tester still produces sensible results with the given ratio 8 56 5 2 1 4 Datasets CHAPTER 5 EXPERIMENTER One can add dataset files either with an absolute path or with a relative one The latter makes it often easier to run experiments on different machines hence one should check Use relative paths before clicking on Add new Look In weka 3 5 6 C3 changelogs c3 data c3 doc File Name Files of Type Arff data files arff Look In C data y contacttenses arff soybean arff C cpu artt y weather arff y cpu withvendor arff 7 weather nominal arff A segment challenge arff E segment test arff File Name iris arff Files of Type Arff data files arff v Open Cancel After clicking Open the file will be displayed in the datasets list If one selects a directory and
223. toolbar the mouse pointer will change to a cross hairs e Next place the ArffLoader component on the layout area by clicking some where on the layout a copy of the ArffLoader icon will appear on the layout area e Next specify an ARFF file to load by first right clicking the mouse over the ArffLoader icon on the layout A pop up menu will appear Select Configure under Edit in the list from this menu and browse to the location of your ARFF file e Next click the Evaluation tab at the top of the window and choose the ClassAssigner allows you to choose which column to be the class com ponent from the toolbar Place this on the layout e Now connect the ArffLoader to the ClassAssigner first right click over the ArffLoader and select the dataSet under Connections in the menu A rubber band line will appear Move the mouse over the ClassAssigner component and left click a red line labeled dataSet will connect the two components e Next right click over the ClassAssigner and choose Configure from the menu This will pop up a window from which you can specify which column is the class in your data last is the default e Next choose the Class ValuePicker allows you to choose which class label to be evaluated in the ROC component from the toolbar Place this on the layout and right click over ClassAssigner and select dataSet from under Connections in the menu and connect it with the Class ValuePicker e Next grab a Cross ValidationF
224. ttribute number and their value stated like this data 1 X 3 Y 4 class A 2 W 4 class B Each instance is surrounded by curly braces and the format for each entry is lt index gt lt space gt lt value gt where index is the attribute index starting from 0 Note that the omitted values in a sparse instance are 0 they are not missing values If a value is unknown you must explicitly represent it with a question mark 7 Warning There is a known problem saving SparseInstance objects from datasets that have string attributes In Weka string and nominal data values are stored as numbers these numbers act as indexes into an array of possible attribute values this is very efficient However the first string value is as signed index 0 this means that internally this value is stored as a 0 When a SparseInstance is written string instances with internal value 0 are not out put so their string value is lost and when the arff file is read again the default value 0 is the index of a different string value so the attribute value appears to change To get around this problem add a dummy string value at index 0 that is never used whenever you declare string attributes that are likely to be used in SparseInstance objects and saved as Sparse ARFF files 160 CHAPTER 9 ARFF 9 4 Instance weights in ARFF files A weight can be associated with an instance in a standard ARFF file by ap pending it to the end of the line for
225. ture e accuracy 18 Kappa statistic 0 186 around 64 The kappa statis Mean absolute error 0 2857 A th t f Root mean squared error 0 4818 tic measures the agreement o Relative absolute error 60 prediction with the true class Root relative squared error 97 6586 if 1 Total Number of Instances 14 1 0 signilies comp ete agreement The following error values are Detailed Accuracy By Class not very meaningful for classifi cation tasks however for regres TP Rate FP Rate Precision Recall F Measure Class t k th t of th 0 778 0 6 0 7 0 778 0 737 yes sion tasks 8 e root of the 0 4 0 222 0 5 0 4 0 444 no mean squared error per exam Confusion Matrix lt classified as yes no won b 2l a 21b ple would be a reasonable cri terion We will discuss the re lation between confusion matrix and other measures in the text The confusion matrix is more commonly named contingency table In our case we have two classes and therefore a 2x2 confusion matrix the matrix could be arbitrarily large The number of correctly classified instances is the sum of diagonals in the matrix all others are incorrectly classified class a gets misclassified as b exactly twice and class b gets misclassified as a three times The True Positive TP rate is the proportion of examples which were clas sified as class x among all examples which truly have class x i e how much part of the class was capt
226. uracy The output both of standard error and output should be redirected so you get both errors and the normal output of your classifier The last amp starts the task in the background Keep an eye on your task via top and if you notice the hard disk works hard all the time for linux this probably means your task needs too much memory and will not finish in time for the exam In that case switch to a faster classifier or use filters e g for Resample to reduce the size of your dataset or StratifiedRemoveFolds to create training and test sets for most classifiers training takes more time than testing So now you have run a lot of experiments which classifier is best Try cat out grep A 3 Stratified grep Correctly this should give you all cross validated accuracies If the cross validated ac curacy is roughly the same as the training set accuracy this indicates that your classifiers is presumably not overfitting the training set Now you have found the best classifier To apply it on a new dataset use e g java weka classifiers trees J48 1 J48 data model T new data arff You will have to use the same classifier to load the model but you need not set any options Just add the new test file via T If you want p first last will output all test instances with classifications and confidence followed by all attribute values so you can look at each error separately The following more complex csh script creat
227. ured It is equivalent to Recall In the confusion ma trix this is the diagonal element divided by the sum over the relevant row i e 7 7 2 0 778 for class yes and 2 3 2 0 4 for class no in our example The False Positive FP rate is the proportion of examples which were classi fied as class x but belong to a different class among all examples which are not of class x In the matrix this is the column sum of class x minus the diagonal element divided by the rows sums of all other classes i e 3 5 0 6 for class yes and 2 9 0 222 for class no The Precision is the proportion of the examples which truly have class x 20 CHAPTER 1 A COMMAND LINE PRIMER among all those which were classified as class x In the matrix this is the diagonal element divided by the sum over the relevant column i e 7 7 3 0 7 for class yes and 2 2 2 0 5 for class no The F Measure is simply 2 Precision Recall Precision Recall a combined measure for precision and recall These measures are useful for comparing classifiers However if more de tailed information about the classifier s predictions are necessary p out puts just the predictions for each test instance along with a range of one based attribute ids 0 for none Let s look at the following example We shall assume soybean train arff and soybean test arff have been constructed via weka filters supervised instance StratifiedRemoveFolds as in a previous example java weka classifier
228. us local search methods e conditional independence tests These methods mainly stem from the goal of uncovering causal structure The assumption is that there is a network structure that exactly represents the independencies in the distribution that generated the data Then it follows that if a conditional indepen dency can be identified in the data between two variables that there is no arrow between those two variables Once locations of edges are identified the direction of the edges is assigned such that conditional independencies in the data are properly represented e global score metrics A natural way to measure how well a Bayesian net work performs on a given data set is to predict its future performance by estimating expected utilities such as classification accuracy Cross validation provides an out of sample evaluation method to facilitate this by repeatedly splitting the data in training and validation sets A Bayesian network structure can be evaluated by estimating the network s param eters from the training set and the resulting Bayesian network s perfor mance determined against the validation set The average performance of the Bayesian network over the validation sets provides a metric for the quality of the network Cross validation differs from local scoring metrics in that the quality of a network structure often cannot be decomposed in the scores of the indi vidual nodes So the whole network needs to be considered
229. ved in the process 2 Classifier model full training set A textual representation of the classification model that was produced on the full training data 3 The results of the chosen test mode are broken down thus 4 Summary A list of statistics summarizing how accurately the classifier was able to predict the true class of the instances under the chosen test mode 5 Detailed Accuracy By Class A more detailed per class break down of the classifier s prediction accuracy 6 Confusion Matrix Shows how many instances have been assigned to each class Elements show the number of test examples whose actual class is the row and whose predicted class is the column 7 Source code optional This section lists the Java source code if one chose Output source code in the More options dialog 4 3 6 The Result List After training several classifiers the result list will contain several entries Left clicking the entries flicks back and forth between the various results that have been generated Pressing Delete removes a selected entry from the results Right clicking an entry invokes a menu containing these items 1 View in main window Shows the output in the main window just like left clicking the entry 42 10 11 12 CHAPTER 4 EXPLORER View in separate window Opens a new independent window for view ing the results Save result buffer Brings up a dialog allowing you to save a text file c
230. with Paired T Tester cor w Tester weka experiment PairedCorrectedTTester a Analysing Number_correct Row Select Datasets 1 Resultsets 3 arin Solo Confidence 0 05 two tailed Sorted by Date 21 12 05 16 38 Comparison field Number_correct Significance 0 05 2 nataset 1 rules Ze 2 rules 3 trees Sorting asc by lt defautt gt gee iris 10 17 00 48 10 v 48 40 v Test base Senet El AAA vZ 1 17070 1 0 0 Displayed Columns Columns Show std deviations Key 1 rules ZeroR 48055541465867954 Output Format Select 2 rules OneR B 6 2459427002147861445 3 trees J48 C 0 25 M 2 217733168393644444 Perform test Save output Result list 16 38 12 Number_correct rules ZeroR 4805 101 Clicking on the button for the Output format leads to a dialog that lets you choose the precision for the mean and the std deviations as well as the format of the output Checking the Show Average checkbox adds an additional line to the output listing the average of each column With the Remove filter classnames checkbox one can remove the filter name and options from processed datasets filter names in Weka can be quite lengthy The following formats are supported e CSV e GNUPlot e HTML 86 CHAPTER 5 EXPERIMENTER e LaTeX e Plain text default e Significance only Output Format x Mean Precision StdDev Precision Output Format
231. x attribute which contains the 1 based index of this value lt value index 1 gt 5 1 lt value gt 10 4 Compression Since the XML representation takes up considerably more space than the rather compact ARFF format one can also compress the data via gzip Weka automat ically recognizes a file being gzip compressed if the file s extension is xrff gz instead of xrff The Weka Explorer Experimenter and command line allow one to load save compressed and uncompressed XRFF files this applies also to ARFF files 10 5 Useful features In addition to all the features of the ARFF format the XRFF format contains the following additional features e class attribute specification e attribute weights 10 5 1 Class attribute specification Via the class yes attribute in the attribute specification in the header one can define which attribute should act as class attribute A feature that can be used on the command line as well as in the Experimenter which now can also load other data formats and removing the limitation of the class attribute always having to be the last one Snippet from the iris dataset lt attribute class yes name class type nominal gt 10 5 2 Attribute weights Attribute weights are stored in an attributes meta data tag in the header sec tion Here is an example of the petalwidth attribute with a weight of 0 9 lt attribute name petalwidth type numeric gt lt metadata gt lt property na
232. y itself One way to clarify this situation is to enclose the classifier specification including all parameters in double quotes like this java weka classifiers meta ClassificationViaRegression W weka classifiers functions LinearRegression S 1 t data iris arff x 2 However this does not always work depending on how the option handling was implemented in the top level classifier While for Stacking this approach would work quite well for ClassificationViaRegression it does not We get the dubious error message that the class weka classifiers functions Linear Regression S 1 cannot be found Fortunately there is another approach All parameters given after are processed by the first sub classifier another lets us specify parameters for the second sub classifier and so on java weka classifiers meta ClassificationViaRegression W weka classifiers functions LinearRegression t data iris arff x 2 S 1 In some cases both approaches have to be mixed for example java weka classifiers meta Stacking B weka classifiers lazy IBk K 10 M weka classifiers meta ClassificationViaRegression W weka classifiers functions LinearRegression S 1 t data iris arff x 2 Notice that while ClassificationViaRegression honors the parameter Stacking itself does not Sadly the option handling for sub classifier specifi cations is not yet completely unified within WEKA but hopefully one or the other approach m
233. yout menu runs a graph layout algorithm on the network and tries to make the graph a bit more readable When the menu item is selected the node size can be specified or left to calculate by the algorithm based on the size of the labels by deselecting the custom node size check box 8 9 BAYES NETWORK GUI 141 Graph Layout Options x Custom Node Size Width 154 Height 32 Layout Graph The Show Margins menu item makes marginal distributions visible These are calculated using the junction tree algorithm 22 Marginal probabilities for nodes are shown in green next to the node The value of a node can be set right click node set evidence select a value and the color is changed to red to indicate evidence is set for the node Rounding errors may occur in the marginal probabilities Bayes Network Editor lolx File Edit Tools View Help js Setosa 9999 Iris versicolor 0 Iris virginica 0 es in 8 9805 petalwidth SE a inf 2 45 9680 petalleng A sepalwidth a int 5 55 8673 sepallength 0 Set evidence for class The Show Cliques menu item makes the cliques visible that are used by the junction tree algorithm Cliques are visualized using colored undirected edges Both margins and cliques can be shown at the same time but that makes for rather crowded graphs 142 CHAPTER 8 BAYESIAN NETWORK CLASSIFIERS Bayes Network Editor Of x ile Edit Tools View

as a PDF

Contents

Download Pdf Manuals

Related Search

Related Contents