Home

WEKA Explorer User Guide for Version 3-3

1. The University of Waikato Te Whare W nanga o Waikato WEKA Explorer User Guide for Version 3 3 4 Richard Kirkby July 3 2002 2002 University of Waikato Contents 1 2 Launching WEKA 2 The WEKA Knowledge Explorer 2 Section Labs ri see ee Se Pe BE A AAA 2 Log BOX os wir a a el a eS ES ce es AAA A 9 2 Status Box ta e ef Pig et Salt ht EE da de dla lato EE 3 WEKA Status COM 3 iia EA ES PG ee 3 Preprocessing 3 Opening les viesis Hena Ss A e se S 3 The Base Relation and the Working Relation 3 Working With Attributes e 4 Working With Filters omic a aea a a e tg lA So pe A 5 Classification 6 Selecting a Classifier 2 o o ee ee ee 6 Test Options a ee Se Be ae EA A Ee ea 6 The Class Attribute rr eee BESS Ae sae A 7 Training a Classifier ee eee 7 The Classifier Output Text e e 0 000 4 7 The Result Lists eros dd AL Be ee OE SEN 7 Clustering 8 Selecting a Clusterer o ee 8 Cluster Modes aice dag soe Pe eee A de 9 Ignoring Attributes e 9 Learning Clusters ee ee 9 Associating 9 Setting Ups ala 2M ai A E e 9 Learning Associations o 9 Selecting Attributes 10 Searching and Evaluating 10 Options 33 546 amp dead A A AREA 10 Performing Selection 4310 2 fs A a e tE 10 Visualizing 10 Changing the Vi
2. anges the y axis The X and Y written beside the strips shows what the current axes are B is used for both X and Y Above the attribute strips is a slider labelled Jitter which is a random displacement given to all points in the plot Dragging it to the right increases the amount of jitter which is useful for spotting concentrations of points Without jitter a million instances at the same point would look no different to just a single lonely instance Selecting Instances There may be situations where it is helpful to select a subset of the data using the visualization tool A special case of this is the UserClassifier which lets you build your own classifier by interactively selecting instances Below the y axis selector button is a drop down list button for choosing a selection method A group of data points can be selected in four ways 1 Select Instance Clicking on an individual data point brings up a window listing its attributes If more than one point appears at the same location more than one set of attributes is shown 2 Rectangle You can create a rectangle by dragging that selects the points inside it 3 Polygon You can build a free form polygon that selects the points inside it Left click to add vertices to the polygon right click to complete it The polygon will always be closed off by connecting the first point to the last 4 Polyline You can build a polyline that distinguishes the poi
3. een set click the Start button When complete right clicking on an entry in the result list allows the results to be viewed or saved 7 Selecting Attributes Searching and Evaluating Attribute selection involves searching through all possible combinations of at tributes in the data to find which subset of attributes works best for prediction To do this two objects must be set up an attribute evaluator and a search method The evaluator determines what method is used to assign a worth to each subset of attributes The search method determines what style of search is performed Options The Attribute Selection Mode box has two options 1 Use full training set The worth of the attribute subset is determined using the full set of training data 2 Cross validation The worth of the attribute subset is determined by a process of cross validation The Fold and Seed fields set the number of folds to use and the random seed used when shuffling the data As with Classify Section 4 there is a drop down box that can be used to specify which attribute to treat as the class Performing Selection Clicking Start starts running the attribute selection process When it is fin ished the results are output into the result area and an entry is added to the result list Right clicking on the result list gives several options The first three View in main window View in separate window and Save result buffer are the same as for the classi
4. ew ss da oe eR ee e 10 Selecting Instances o uai ra 11 1 Launching WEKA The WEKA GUI Chooser window is used to launch WEKA s graphical envi ronments At the bottom of the window are three buttons 1 Simple CLI Provides a simple command line interface that allows direct execution of WEKA commands The book Data Mining Witten and Frank 2000 covers the command line interface to WEKA 2 Explorer An environment for exploring data with WEKA This User Manual focuses on using the Explorer 3 Experimenter An environment for performing experiments and con ducting statistical tests between learning schemes There is a separate tutorial document for this environment If you launch WEKA from a terminal window some text begins scrolling in the terminal Ignore this text unless something goes wrong in which case it can help in tracking down the cause 2 The WEKA Knowledge Explorer Section Tabs At the very top of the window just below the title bar is a row of tabs When the Explorer is first started only the first tab is active the others are greyed out This is because it is necessary to open a data set before starting to explore the data The tabs are as follows 1 Preprocess Choose and modify the data being acted on 2 Classify Train and test learning schemes that classify or perform regres sion Cluster Learn clusters for the data Associate Learn association rules for the data Select attributes Select
5. fy panel If you have used an attribute transformer such as PrincipalComponents a fourth item is active Visualize transformed data 8 Visualizing WEKA s visualization section allows you to visualize a 2D plot of the current working relation We described above how to visualize particular results in a separate window the same visualization controls are used Changing the View Data points are plotted in the main area of the window At the top are two drop down list buttons for selecting the axes to plot The one on the left shows which attribute is used for the x axis the one on the right shows which is used for the y axis Beneath the x axis selector is a drop down list for choosing the colour scheme This allows you to colour the points based on the attribute selected Below the plot area a legend describes what values the colours correspond to If the values 10 are discrete you can modify the colour used for each one by clicking on them and making an appropriate selection in the window that pops up To the right of the plot area is a series of horizontal strips Each strip represents an attribute and the dots within it show the distribution of values of the attribute These values are randomly scattered vertically to help you see concentrations of points You can choose what axes are used in the main graph by clicking on these strips Left clicking an attribute strip changes the x axis to that attribute whereas right clicking ch
6. he clusters once training is complete When dealing with datasets that are so large that memory becomes a problem it may be helpful to disable this option Ignoring Attributes Often some attributes in the data should be ignored when clustering The Ignore attributes button brings up a small window that allows you to select which attributes are ignored Clicking on an attribute in the window highlights it holding down the SHIFT key selects a range of consecutive attributes and holding down CTRL toggles individual attributes on and off To cancel the selection back out with the Cancel button To activate it click the Select button The next time clustering is invoked the selected attributes are ignored Learning Clusters The Cluster section like the Classify section has Start Stop buttons a result text area and a result list These all behave just like their classifica tion counterparts Right clicking an entry in the result list brings up a similar menu except that it shows only one visualization option Visualize cluster assignments 6 Associating Setting Up WEKA presently has just one scheme for learning associations called Apriori Clicking the text field in the Associator box at the top of the window brings up the settings for Apriori there are no other associators to choose from Also there are no extra options for testing the learning scheme Learning Associations Once appropriate parameters for Apriori bave b
7. ions relation name instances attributes and test mode that were in volved in the process 2 Classifier model full training set A textual representation of the classification model that was produced on the full training data 3 The results of the chosen test mode are broken down thus 4 Summary A list of statistics summarizing how accurately the classifier was able to predict the true class of the instances under the chosen test mode 5 Detailed Accuracy By Class A more detailed per class break down of the classifier s prediction accuracy 6 Confusion Matrix Shows how many instances have been assigned to each class Elements show the number of test examples whose actual class is the row and whose predicted class is the column The Result List After training several classifiers the result list will contain several entries Left clicking the entries flicks back and forth between the various results that have been generated Right clicking an entry invokes a menu containing these items 1 View in main window Shows the output in the main window just like left clicking the entry 2 View in separate window Opens a new independent window for view ing the results 3 Save result buffer Brings up a dialog allowing you to save a text file containing the textual output 4 Load model Loads a pre trained model object from a binary file 5 Save model Saves a model object to a binary file Objects are saved in Ja
8. ng relation changes but the base relation does not When we go on to perform other actions such as building a classifier or visualizing the data we are always acting on the working relation The boxes describing the relations have three entries 1 Relation The name of the relation as given in the file it was loaded from Filters described below modify the name of a relation 2 Instances The number of instances data points records in the data 3 Attributes The number of attributes features in the data Working With Attributes Below the Base relation box is a box titled Attributes in base relation There are three buttons and beneath them is a list of the attributes in the current base relation The list has three columns 1 No A number that identifies the attribute in the order they are specified in the data file 2 Selection tick boxes These allow you select which attributes are present in the working relation 3 Name The name of the attribute as it was declared in the data file When you click on different rows in the list of attributes the fields change in the box to the right titled Attribute information for base relation This box displays the characteristics of the currently highlighted attribute in the list 1 Name The name of the attribute the same as that given in the attribute list 2 Type The type of attribute most commonly Nominal or Numeric 3 Missing The number and
9. nts on one side from those on the other Left click to add vertices to the polyline right click to finish The resulting shape is open as opposed to a polygon which is always closed Once an area of the plot has been selected using Rectangle Polygon or Polyline it turns grey At this point clicking the Submit button removes all instances from the plot except those within the grey selection area Clicking on the Clear button erases the selected area without affecting the graph Once any points have been removed from the graph the Submit button changes to a Reset button This button undoes all previous removals and returns you to the original graph with all points included Finally clicking the Save button allows you to save the currently visible instances to a new ARFF file 11 References Drummond C and Holte R 2000 Explicitly representing expected cost An alternative to ROC representation Proceedings of the Sixth ACM SIGKDD In ternational Conference on Knowledge Discovery and Data Mining Witten I H and Frank E 2000 Data Mining Practical machine learning tools and techniques with Java implementations Morgan Kaufmann San Fran cisco 12
10. percentage of instances in the data for which this attribute is missing unspecified 4 Distinct The number of different values that the data contains for this attribute 5 Unique The number and percentage of instances in the data having a value for this attribute that no other instances have Below these statistics is a list showing more information about the values stored in this attribute which differ depending on its type If the attribute is nominal the list consists of each possible value for the attribute along with the number of instances that have that value If the attribute is numeric the list gives four statistics describing the distribution of values in the data the minimum maximum mean and standard deviation Returning to the attribute list to begin with all the tick boxes are ticked They can be toggled on off by clicking on them individually The three buttons above can also be used to change the selection 1 All All boxes are ticked 2 None All boxes are cleared unticked 3 Invert Boxes that are ticked become unticked and vice versa Note The attribute tick boxes are treated like a filter that is applied before any other filters As with any filter see below any changes you make do not take effect until the Apply Filters button is clicked Working With Filters The preprocess section allows filters to be defined that transform the data in various ways The Filters box is used to set up the filte
11. poses The Class Attribute The classifiers in WEKA are designed to be trained to predict a single class attribute which is the target for prediction Some classifiers can only learn nominal classes others can only learn numeric classes regression problems still others can learn both By default the class is taken to be the last attribute in the data If you want to train a classifier to predict a different attribute click on the box below the Test options box to bring up a drop down list of attributes to choose from Training a Classifier Once the classifier test options and class have all been set the learning process is started by clicking on the Start button While the classifier is busy being trained the little bird moves around You can stop the training process at any time by clicking on the Stop button When training is complete several things happen The Classifier output area to the right of the display is filled with text describing the results of training and testing A new entry appears in the Result list box We look at the result list below but first we investigate the text that has been output The Classifier Output Text The text in the Classifier output area has scroll bars allowing you to browse the results Of course you can also resize the Explorer window to get a larger display area The output is split into several sections 1 Run information A list of information giving the learning scheme op t
12. rs that are required At the top of the Filters box is a text field with the name of a filter followed by some options Clicking on this box brings up a GenericObjectEditor dialog box The GenericObjectEditor Dialog Box The GenericObjectEditor dialog box lets you choose a filter and configure its options The same kind of dialog box is used for other objects such as classifiers and clusterers see below A single left click on the filter name at the top of the window brings up a drop down list of all of the filters Click on the one you want When a filter is chosen the fields in the window change to reflect the available options Clicking on any of these gives an opportunity to alter its setting For example the setting may take a text string in which case you type the string into the text field provided Or it may give a drop down box listing several states to choose from Or it may do something else depending on the information required Some objects display a brief description of what they do in an About box along with a More button Clicking on the More button brings up a window describing what the different options do At the bottom of the GenericObjectEditor dialog are four buttons The first two Open and Save allow object configurations to be stored for future use The Cancel button backs out without remembering any changes that have been made Once you are happy with the object and settings you have chosen click OK to ret
13. sifier is evaluated on how well it predicts the class of a set of instances loaded from a file Clicking the Set button brings up a dialog allowing you to choose the file to test on Cross validation The classifier is evaluated by cross validation using the number of folds that are entered in the Folds text field Percentage split The classifier is evaluated on how well it predicts a certain percentage of the data which is held out for testing The amount of data held out depends on the value entered in the field Further testing options can be set by clicking on the More options button 1 Output model The classification model on the full training set is output so that it can be viewed visualized etc Output per class stats The precision recall and true false statistics for each class are output Output entropy evaluation measures Entropy evaluation measures are included in the output Output confusion matrix The confusion matrix of the classifier s pre dictions is included in the output Store predictions for visualization The classifier s predictions are remembered so that they can be visualized Cost sensitive evaluation The errors is evaluated with respect to a cost matrix The Set button allows you to specify the cost matrix used Random seed for xval Split This specifies the random seed used when randomizing the data before it is divided up for evaluation pur
14. the most relevant attributes in the data Oro Oy E ts Visualize View an interactive 2D plot of the data Once the tabs are active clicking on them flicks between different screens on which the respective actions can be performed The bottom area of the window from the log box downwards stays visible regardless of which section you are in Log Box Near the bottom of the window is the log box It contains a scrollable text field Each line of text is stamped with the time it was entered into the log As you perform actions in WEKA the log keeps a record of what has happened Status Box The status box appears at the very bottom of the window below the log box It displays messages that keep you informed about what s going on For example if the Explorer is busy loading a file the status box will say that TIP tright clicking the mouse anywhere inside the status box brings up a little menu The menu gives two options 1 Available memory Display in the log box the amount of memory avail able to WEKA 2 Run garbage collector Force the Java garbage collector to search for memory that is no longer needed and free it up allowing more memory for new tasks Note that the garbage collector is constantly running as a background task anyway WEKA Status Icon To the right of the status box is the WEKA status icon When no processes are running the bird sits down and takes a nap The number beside the x symbol gives the n
15. umber of concurrent processes running When the system is idle it is zero but it increases as the number of processes increases When any process is started the bird gets up and starts moving around If it s standing but stops moving for a long time it s sick something has gone wrong In that case you should restart the WEKA explorer 3 Preprocessing Opening files The first three buttons at the top of the preprocess section enable you to load data into WEKA 1 Open file Brings up a dialog box allowing you to browse for the data file on the local filesystem 2 Open URL Asks for a Uniform Resource Locator address for where the data is stored 3 Open DB Reads data from a database The easiest and most common way of getting data into WEKA is to store it as an Attribute Relation File Format ARFF file and load it using the Open file button ARFF files typically have a arff extension The Base Relation and the Working Relation Just below the row of buttons there are two boxes titled Base relation and Working relation The base relation is the unmodified relation or data that has been loaded into WEKA The working relation is a copy of the base relation complete with any modifications that have been made with filters in the pre process panel When a relation is first loaded into WEKA the working relation is the same as the base relation As soon as any filters are applied to the data the worki
16. urn to the main Explorer window Applying Filters The general process for setting up filters is to choose the desired filter and its options then click on the Add button to add it to the list The filters are only applied when the Apply Filters button is clicked and they are applied in the order shown in the list At the bottom of the list the Delete button removes any selected filters from the list The Replace button at the top of the preprocess section replaces the base relation with the current working relation making the changes permanent at least until a new file is loaded Finally the Save button at the top right of the window saves the working relation to an ARFF file allowing it to be kept for future use 4 Classification Selecting a Classifier At the top of the classify section is the Classifier box This box has a text field that gives the name of the currently selected classifier and its options Clicking on the text box brings up a GenericObjectEditor dialog box just the same as for filters This allows you to choose one of the classifiers that are available in WEKA and configure it Test Options The result of applying the chosen classifier will be tested according to the options that are set by clicking in the Test options box There are four test modes 1 Use training set The classifier is evaluated on how well it predicts the class of the instances it was trained on Supplied test set The clas
17. ust be greater than 0 5 for the instance to be predicted as positive The plot can be used to visualize the pre cision recall tradeoff for ROC curve analysis true positive rate us false positive rate and for other types of curves 11 Visualize cost curve Generates a plot that gives an explicit represen tation of the expected cost as described by Drummond and Holte 2000 Options are greyed out if they do not apply to the specific set of results 5 Clustering Selecting a Clusterer By now you will be familiar with the process of selecting and configuring objects Clicking on the clustering scheme listed in the Clusterer box at the top of the window brings up a GenericObjectEditor dialog with which to choose a new clustering scheme Cluster Modes The Cluster mode box is used to choose what to cluster and how to evaluate the results The first three options are the same as for classification Use training set Supplied test set and Percentage split Section 4 except that now the data is assigned to clusters instead of trying to predict a specific class The fourth mode Classes to clusters evaluation compares how well the chosen clusters match up with a pre assigned class in the data The drop down box below this option selects the class just as in the Classify panel An additional option in the Cluster mode box the Store clusters for visualization tick box determines whether or not it will be possible to visualize t
18. va serialized object form 6 Re evaluate model on current test set Takes the model that has been built and tests its performance on the data set that has been specified with the Set button under the Supplied test set option 7 Visualize classifier errors Brings up a visualization window that plots the results of classification Correctly classified instances are represented by crosses whereas incorrectly classified ones show up as squares 8 Visualize tree Brings up a graphical representation of the structure of the classifier model if possible This is only possible with certain classifiers You can bring up a menu by right clicking a blank area pan around by dragging the mouse and see the training instances at each node by clicking on it CTRL clicking zooms the view out while SHIFT dragging a box zooms the view in 9 Visualize margin curve Generates a plot illustrating the prediction margin The margin is defined as the difference between the probability predicted for the actual class and the highest probability predicted for the other classes For example boosting algorithms may achieve better performance on test data by increasing the margins on the training data 10 Visualize threshold curve Generates a plot illustrating the tradeoffs in prediction that are obtained by varying the threshold value between classes For example with the default threshold value of 0 5 the pre dicted probability of positive m

WEKA Explorer User Guide for Version 3-3

Contents

Download Pdf Manuals

Related Search

Related Contents