Home

LightSIDE - School of Computer Science

1. LightSIDE Text Mining and Machine Learning User s Manual Elijah Mayfield Carolyn Penstein Rose Spring 2012 LightSIDE Text Mining and Machine Learning User s Manual 2012 Carnegie Mellon University User Manual Co authors can be contacted at the following addresses Elijah Mayfield elijah cmu edu Carolyn Ros cprose cs cmu edu Work related to this project was funded through the Pittsburgh Science of Learning Center the Office of Naval Research Cognitive and Neural Sciences Division the National Science Foundation Carnegie Mellon University and others Special thanks to David Adamson Philip Gianfortoni Gregory Dyke Moonyoung Kang Sourish Chaudhuri Yi Chia Wang Mahesh Joshi and Eric Ros for collaboration and contributions to past versions of SIDE SIDE is released under the GPL version 3 Ihe GNU General Public License is a free copyleft license for software and other kinds of works This manual is released until the GFDL version 1 3 The GNU Free Documentation License is a form of copyleft intended for use on a manual textbook or other document to assure everyone the effective freedom to copy and redistribute it with or without modifications either com mercially or non commercially These licenses are available in full at http www gnu org licenses Table of Contents Table of Contents 0 A Message from the Author 1 Installation and Setup 3 Using LightSIDE Feature Extraction Lesson 1 Formatting
2. To export your feature space to another format to use in another program such as Weka click the Export button Using LightSIDE Model Building Lesson 4 Training a machine learning model from a feature table 1 2 First follow the instructions through Lesson 2 to build a feature table Then at the top of the LightSIDE window switch to the Build Model tab From here first select the machine learn ing plugin you wish to use LightSIDE comes with the Weka toolkit installed by default as shown in the top left corner of this screen Weka is an extremely robust machine learning package that offers dozens of implementations of various learning algorithms The three that are most important to remember are the following NaiveBayes in the bayes folder assumes indepen dence between variables and is based on a genera tive probabilistic model of your feature table SMO in the functions folder is an implementation of support vector machines This also assumes inde pendence between variables but is a discriminative classifier based on assigning weight to variables J48 in the trees folder is an implementation of de cision trees which can model dependence between variables but are often overwhelmed by very large feature tables such as those used in text mining Other algorithms which are commonly used and may be helpful for certain tasks are Logis tic MultilayerPerceptron and Winnow in the functions
3. 1990 NEG aoe ete intem net movie databas 5 awa ards sect n tor de ae and tou rd that this NEG e formula is sim we vapa ied go dhe eople on a NEG th thirteenth floor is a bland herioa ary e n genre pent ing ai dni NEG my gia sat bedin wha a monologue thats more m N med han s the ameri NEG 8mm is not going w n righ ve ie di is hs ada re imya ard dis urbin ig P ans a ofu ade NEG synopsis original jurassic park survivor alan grai a eill is tri nck d pide a nd aman the characters in masta junns the whole nine wa ni PE export Figure 7 Labeling new data with LightSIDE Lesson 6 Error analysis of a machine learning model in LightSIDE In some cases a basic model utilizing unigrams alone or another simple representation is sufficient for a model s accuracy However in many cases we wish to improve upon this performance One of the best ways to do this is to understand what shortcom ings the current representation has and build new features either individually or as a class of many fea tures and use them in addition to the simple repre sentation LightSIDE offers many tools for under standing the strengths and weaknesses of a model 1 First follow the instructions through Lesson 4 in order to have a built model open in LightSIDE 2 The first step in understanding a model s errors is to check the confusion matrix This is visible in the top center of the Model Building window In the confusion m
4. Contains NON stopwords Delete Clear Figure 2 Loading files into LightSIDE To load a data set into LightSIDE 1 Ensure that the CSV file that you are loading follows the conventions described at the start of this lesson 2 Open LightSIDE following the instructions in Chapter 2 of this user s manual 3 You will see the Feature Extraction tab of Light SIDE open In the top left corner click the Add button to open a file chooser menu LightSIDE automatically opens its data folder which contains sample data files Select the CSV files that you want to open In this tutorial we will open the MovieReviews csv file If at any point you wish to switch datasets with out closing LightSIDE click the Clear button This will allow you to open new files and create new feature tables independently of ones you have already opened Once you select a file LightSIDE loads it into internal memory and attempts to guess the an notation you are trying to predict and the column that contains text Its assumptions are given in the Annotation and Text Field dropdown menus Opening these menus allows you to tell Light SIDE to use another column instead by selecting another column Alternatively if your data is only columns in the CSV and does not contain text the Text Field dropdown has a final option No Text which will inform LightSIDE of this Now that you have loaded your data into Light SIDE we can build a featur
5. Feature Tables s gt 5 _ ombine With 5 8 OR AND XOR NOT Seq ineq P Delete s P Move to Tabl gt Delete Clear 1 0 1 1 F igure 8 Feature construction and deactivation in LightSIDE ing this boolean operator For instance com bining awful and bad with an OR operator would create a new feature which recognizes either word in a document effectively merging the two features Seq Takes as arguments two features and checks whether they occur in consecutive instances This feature assumes that your data is sequential such as in a conversation or fo rum thread In a set of unrelated data such as detached movie reviews which are not related to one another such a feature is not useful Ineq Takes as an argument a single numeric fea ture When clicked this button will open a popup asking you to define a threshold for instance gt 0 5 It will then produce a boolean feature which is true every time the numeric value of the selected feature matches that threshold To combine features in the feature lab highlight all of the features you would like to combine then click the button of the appropriate operator 13 9 Note that the quality of these features is auto matically calculated in columns as soon as they are created A key feature of these combination functions especially the boolean operators is that they can be constructed into larger trees for instance a feature such as AND good
6. For our example feature table how ever model building should only take a few seconds 10 To stop a model prematurely because it is tak ing too long or using too much memory or performing poorly in cross validation click the red button 11 When the model is finished your model will appear in the bottom left corner in the Trained Models list 12 At the same time the right hand side of the window will populate with data about the model For now the most important information is in the bottom panel giving the accuracy Correctly Classified Instances and improvement over chance measured by Kappa statistic If you fol lowed the instructions in Lesson 2 your perfor mance will be around 75 accuracy and 5 kappa 13 To clean out models that you don t intend to use again use the Delete or Clear buttons in the bottom left corner 14 To save a model for future use or to load a model that you ve trained previously click the Save and Load buttons in the bottom right corner 15 For more detailed analysis of your model s per formance move on to Lesson 5 Lesson 5 Adding labels to unlabeled data with a trained model A key feature of models built with LightSIDE is that they are not just for evaluation of your labeled data set Ihe models that you train can be used to annotate more data in a fraction of the time that it would oth erwise take a trained expert at roughly the accuracy given in the summary from t
7. NOT bad In Figure 8 a rudimentary lexicon of negative terms has been built OR awful bad hideous terrible This feature turns out to have a higher kappa than any of the unigrams individually which intuitively makes sense this allows grouping of many weak sources of evidence into a single stronger source of evidence 10 To delete component parts that are no longer necessary and are cluttering your view of the lab highlight them and click the Delete button 11 When a feature has been built to your satisfac tion move it back into the feature table with the Move to Table button 12 The revised feature table can now be trained us ing the same steps outlined in Lesson 4 Using LightSIDE Plugin Writing For users with some Java programming experience and an idea for a more complicated representation of their documents plugin writing is the next step in the feature space development process if TagHelper Tools metafeatures and the Feature Lab are insufficient Begin by opening a new project in Eclipse or another IDE and adding the LightSIDE source folder to your build path This will give your new project access to the interfaces that are necessary to write a plugin LightSIDE internally views a feature table as a set of objects of type FeatureHi t These objects con tain within them a Feature object a value which can be any object but is usually a doub le or a String and an 1Nt documentIndex Th
8. clashes and independence won lumumba refused to pander tc POS 11 the american action film has been slowly drowning to death in a sea of asia POS 12 after watching ratrace last week i noticed my cheeks were sore and realiz POS 13 ive noticed something lately that ive never thought of before pseudo sub POS 14 synopsis bobby garfield yelchin lives in a small town with his mirthless v POS J OD UT oe WN et 15 synopsis in this movie steven spielberg one of todays finest directors atti POS 16 the police negotiator is the person with the entirely unenviable job of going POS 7 MovieReviews csv i eS 41 gt Normal View Ready Figure 1 The standard CSV format that SIDE is designed to read data may not be processed by LightSIDE as you expect Finally consider the segmentation of data that you want to use for cross validation This is a process described at length in lesson 4 that allows a user to estimate the ef fectiveness of their trained model on unseen data One option that LightSIDE gives is to cross validate by file allowing the user to explicitly delimit the subsets to be tested independently For instance examples might be separated out by occupation of the writer or speaker If you wish to make such non random subsets of your data cre ate multiple CSVs before loading data into LightSIDE De activate Move to Lab Metafeatures Regex Search _ Line length
9. else you will have to navigate to it yourself Type run sh Fedora Core 11 Linux Click Applications then Administration Click Terminal Type cd Desktop SIDE to navigate to the location where LightSIDE was extracted If you saved this folder somewhere else you will have to navigate to it yourself Type run sh Using LightSIDE Feature Extraction Lesson 1 Formatting your data and loading it into LightSIDE SIDE reads corpora in the form of CSV files to ensure maximum portability to different systems or users In addition multiple files may be added at once so long as their column names are identical The first row of your CSV file should be the head er names for each column If your data has text it should be labelled with the column header text Likewise the most common name for a class value to predict is class If your data does not follow these conventions LightSIDE will do its best to deter mine which field is your text column and which is your class value if it is wrong you will be able to adjust this through the UI as explained in Lesson 2 In your CSV file a row is assumed to correspond to one instance that row will be assigned one label by the machine learning model that you build in Light SIDE Before using LightSIDE ensure that your seg mentation into rows is the way that you want it to be for instance if you want to classify each sent
10. your data and loading it into LightSIDE Lesson 2 Converting your data set into a LightSIDE feature table Lesson 3 Exploring and analyzing a feature table in LightSIDE 4 Using LightSIDE Model Building Lesson 4 Training a machine learning model fromafeaturetable Lesson 5 Adding labels to unlabeled data with a trained model Lesson 6 Error analysis of a trained model in LightSIDE Lesson 7 Constructing features with LightSIDE s feature lab 4 Using LightSIDE Plugin Writing A message from the author Thanks for using LightSIDE For beginning or intermediate users of text mining and machine learning I believe it s the package with the best tradeoff between usability and power of anything created If you find any outstanding bugs with the software or with this manual please contact me and I ll work with you to figure it out If there s something you want to do with the software that you don t think it cur rently supports please contact me and we can talk about how to make it work and whether it can be added to the next released version If there s something unusual you want to do that LightSIDE isn t normally designed to handle please contact me it s likely I ve already tried and found a way to do it A lot of people have used this software for a lot of unusual things and I ve talked to most of them You will not be wasting my time LightSIDE is my job Instead hopefully I can make your li
11. IFor Subclass This method returns any Swing component and will display it in the bottom right corner for users to edit configuration options At feature table creation time a method u1TOMemory is called which by default does nothing If you wish to move informa tion from the UI to your extraction code it should be done by overriding this method Your plugin also knows about a field boolean halt This field is set to true when the red button is clicked to stop feature extraction To make that but ton fuctional you must set up points in your code for your plugin to fail gracefully and return null 15 You ve reached the end of the manual Language Technologies Institute School of Computer Science Carnegie Mellon University www lti cs cmu edu
12. Java 6 VM Mac OS X v10 6 Open Finder Click Applications then Utilities Double click Terminal o Type java version and press Enter If an appro priate version of Java is installed on your com puter you will receive a response which includes somewhere in the text java version 1 6 0 If your computer gives a similar response to this you may proceed to installing LightSIDE Otherwise skip to the next section Installing the Java 6 VM Fedora Core 11 Linux Click Applications then Administration Click Terminal Type java version and press Enter If an appro priate version of Java is installed on your com puter you will receive a response which includes somewhere in the text java version 1 6 0 If your computer gives a similar response to this you may proceed to installing LightSIDE Otherwise skip to the next section Installing the Java 6 VM Installing the Java 6 VM If you are using a computer running Mac OS X then you can install the Java 6 VM through the Software Update utility Open this program by clicking on the Apple icon in the top left corner and running select ing the Software Update option Install jre 6 with the highest update version available for your computer Ifyouareusingacomputerrunning Windows Linux or any other operating system you will need to download the appropriate file direct
13. When clicked the bottom confusion ma trix in LightSIDE displays the following information Distribution of feature becomes Actual Pred ee What does this tell us The distribution shows that among the 113 documents correctly identified as NEG the term becomes only appeared in 15 9 of those documents On the other hand in documents correctly classified as positive the term occurs in 24 6 of the documents And for documents in the error cell that weve highlighted It occurs in a much higher 35 1 of the 37 documents working out to 13 of those 37 This is likely to mean that the word become has a tendency to occur in contexts which are predictive of sentiment but can be deceptive Vertical Comparison performs this same compari son but instead of measuring against the diago nal across from an error cell it measures against the diagonal cell above or below the highlighted cell 5 One way to better understand these aggregate statistics is to look at examples in text For this click the Selected Documents Display tab at the bottom right of the LightSIDE window 6 This display shows a list of instances on the left hand side and focuses in on a single clicked instance on the right hand side To test this click a line in the list The LightSIDE window can be stretched to give more room to the instance being highlighted 7 The left hand list shows the predicted and actual label fo
14. atrix columns corresponding to the possible labels you are trying to classify show your model s predictions Rows correspond to the same la bels but show the true value of the data For instance a cell at the intersection of row POS and column NEG shows the number of instances that were originally la beled POS but that the model incorrectly classified as NEG In our test example there were 36 such cases 3 Click a cell in the confusion matrix This popu lates the list on the top right of the Model Build ing window This list now shows the most con fusing features as measured by multiple metrics The features in this list identify those features which make the instances in the highlighted cell most deceiving For instance a feature which is high ly predictive of POS might sometimes occur in a different context in NEG documents If this hap pened but not frequently enough to decrease the model s certainty of a feature s predictiveness it is likely to occur unusually frequently in the clicked cell bringing these features to the top of the list 4 Click a feature in the top right list to populate the bottom confusion matrix located in the center of the LightSIDE window This shows the distribu tion of that feature in the instances in each cell Example When reading the value of Horizontal Comparison near the top of the list for predicted POS documents which were actually NEG we find the word becomes
15. cant unwed mote say played by dre w ba ane ore encoi POS the 5 a a l based film ome out in 1997 nthe io bys willis was sim POS one ol of the is hry n the ion jong r Day mais rry o n behind is erys similar t POS the c Sages r tra pines been us unsp oni ng in moviehouses for quite some eter e now it Text Fleld POS very b things is he je most delightfully morbid film of the vp a movie that goes so far POS it is ha cote to imagine that a vie which im ane i bortion an Leys incest as prominent plot de text POS known as the most succe st i 496 st a dos romatic comedy in history director garry POS a dramatic ome co is has all the a sary ingredients a witty and whimsical scri fi Model to appl POS age x folles e Pey 5 a POS m at nis remake of Pza 1999 at5 p model thes Annotation Name the both le predict_class have you ever heard the one about a movie so sage 7 bees pesan uy run out of the thea we are grateful that we have starring lee kang s Eo NEG what happens when you pu nants lawrence in a fat suit in rea Me you get martin lawre NEG blatantly borrowing elements from 1993s like water we chocolate and 1991s the butch i wont even pr at i have seen the other 3 alien films i sa alia of alien and n NEG ne i aia Aiba afe affleck and da ais Kann ett do e us air force and ar NEG in clined to like at ie outset aes had been inv NEG ya ealy do every tim atch great m s like pie llas
16. e table Move on to Lesson 2 Lesson 2 Converting your data set into a LightSIDE feature table The process of converting text data into a feature ta ble in LightSIDE is done through extractor plugins Three plugins come available by default They are the TagHelperTools plugin for extracting n grams from text the Metafeatures plugin for converting additional columns from a CSV file into features and the Regex Search plugin for converting individual searches into features Each will be explained by the end of this lesson To build a feature table in LightSIDE 1 First follow the instructions from Lesson 1 to open a data set in LightSIDE 2 Inthe middle of the left hand panel for now ensure that TagHelper Tools is selected in the Feature Extractor Plugins list In the Name field type in the name that you intend to refer to the extracted feature table by The default name is simply features The Threshold field sets the minimum number of documents that a feature must appear in to be included in your feature table speeding up pro cessing Ihe default threshold requires a feature to appear in at least 5 documents When TagHelper Tools is selected study the Configure Plugins pane The following options are available First in the left hand column are types of features to extract Unigrams Extracts a bag of words represen tation from your text field Each feature cor responds to a single word with a value o
17. ed creating searches create a new feature table or append the searches to your existing feature table as in steps 7 and 9 above Now that you have extracted features from your data you may explore the resulting table in more detail Les son 3 or move straight to model building Lesson 4 Lesson 3 Exploring and analyzing a feature table in LightSIDE A key goal of LightSIDE is the ability not just to perform machine learning but to understand the effectiveness of different approaches to representa tion of data This lesson introduces the metrics used by LightSIDE to judge feature quality and the fil tering options for searching for features in a table Understanding feature quality metrics in LightSIDE 1 First follow the instructions from Lesson 1 and 2 to create a feature table In this lesson we are using a unigram and bigram feature space 2 Now examine the feature table in the top right section of LightSIDE The window can be stretched to show more columns The following feature information and metrics are given From The shortened name of the extractor plugin which created this feature Feature Name The name of the feature in this row Type The type of value that this feature can store Options are boolean true false numeric and nominal one of a set number of possible options Predictor Of The simplest metric this simply tells which annotation label the presence of this feature is m
18. ence of a document separately each sentence must be separated into rows prior to loading your CSV into LightSIDE If you would like to include preprocessed features in your machine learning model they should be included as additional columns in your CSV file These can later be read in to LightSIDE using the Metafeature extractor as explained in Lesson 2 Akeyassumption made by LightSIDE is thatin your data every cell is filled in with some value Files with missing anno MovieReviews csv m A a 4 Ce bed Ax search in sheer gt f Home Layout Tables Charts SmartArt A Hr Edit Font Alignment Number Format E Calibri Body 12 x General EEN ii Bi 2 U Ale 9 Conditional lt Paste DP fhe Align Ss Formatting gt C8 fx 7 A B text class 0 films adapted from comic books have had plenty of success whether theyre POS every now and then a movie comes along from a suspect studio with every POS youve got mail works alot better than it deserves to in order to make the POS jaws is a rare film that grabs your attention before it shows you a single in POS moviemaking is a lot like being the general manager of an nfl team in the pc POS on june 30 1960 a self taught idealistic yet pragmatic young man became POS BEM apparenti director tony kaye had a major battle with new line regarding h POS one of my colleagues was surprised when i told her i was willing to see bets POS 2 after bloody
19. ese feature hits are specific to a given data file passed to your plugin as a DoCumentL1stiInterface A Feature object is a more abstract concept representing the dimension in your feature table that this feature hit corresponds to They need to know their String featureName their Type from the options NUMERIC BOOLEAN and NOMINAL and if they are nominal they need to know what possible values they can take Each instance in a document list can be represented as the set of all feature hits with that instance s document index At feature table construction time that instance will be given the value of those hits in its feature space all other dimensions will be set to 0 or false To write a feature extraction plugin create a new class extending FeaturePlugin This is an abstract class that already includes some functionality How ever you must provide the following methods String getOutputName The short prefix string displayed throughout the LightSIDE GUI for features created by your plugin List lt FeatureHit gt extractFeatureH 1ts DocumentListinterface docu ments JLabel update This method must iterate through your data file extracting FeatureH1t objects for each instance creating Feature objects as necessary Creating a new feature should be done through the static method Feature fetchFeature String prefix String name Feature Type type in order to allow caching Component getConfigurationU
20. f new features which in corporate relationships between existing features Deactivating features from a feature table 1 2 Before beginning ensure that a feature table has been built in LightSIDE If a feature seems to be detrimental to perfor mance highlight it in the feature table explor ing interface in the Extract Features tab then click the De activate button It will turn red Reversing this decision can be done by click ing the De activate button to reactivate any highlighted features Once you have settled on what features to remove click Freeze A new feature table will now appear in the bottom left corner which does not contain the deactivated features This new feature table can be passed on to the Model Building window Constructing new features with the Feature Lab 6 If you have an idea for a new feature combining multiple sources of evidence click the Feature Lab tab on the bottom of the screen Now for each component of your new feature find it in the top panel filtering will be helpful highlight it and click the Move to Lab button The Feature Lab allows combinations of features using various tools OR AND XOR NOT Takes as arguments any number of features and combines them us AOO LightSIDE Extract Features Build Model Predict Labels CSV File Add Clear Feature Table features 3573 features f Filter MovicReviews csv ene
21. f true if that feature is present and false if it is not Bigrams Identical functionality to unigrams but checks for adjacent pairs of words POS Bigrams Identical functionality to bigrams but checks for part of speech tags for each word rather than the surface form of a word itself Punctuation Creates features for periods quota tion marks and a variety of other punctuation marks functioning identically to n gram features Line length Creates a single numeric feature representing the number of words in an instance Contains NON Stopwords Creates a single boolean feature representing whether an instance has any contentful words in it Second in the right hand column are configuration options about how these features should be extracted Treat above features as binary When un checked instead of a boolean feature the ex tracted features will be numeric and represent the number of times a feature occurs in an instance Remove Stopwords This prevents features from being extracted which correspond to around 100 common words such as the or and which tend to be unhelpful for classification Stem Consolidates features which represent the same word in different forms such as run run ning and runs into a single feature For now leave the options set to the defaults unigrams and treat above features as binary should be selected To create a new feature table click
22. fe easier by ensuring that you get the best possible experience out of the program Happy trails Elijah Mayfield eliyjah cmu edu Installation and Setup Checking your Java VM In order to use LightSIDE your computer must have a Java Virtual Machine installed with support for at least Java 6 You must first ensure that you have the ap propriate JVM installed Below you will find instruc tions for checking your JVM on Windows XP Mac OS X v10 5 and Fedora Core 11 Linux Other oper ating systems should follow a similar general process Windows XP Click Start then Run Inthe Run dialog type cmd and click OK Type java version and press Enter If an appro priate version of Java is installed on your com puter you will receive a response which includes somewhere in the text java version 1 6 0 If your computer gives a similar response to this you may proceed to installing LightSIDE Otherwise skip to the next section Installing the Java 6 VM Windows 7 Open the start menu then search for cmd Click the cmd icon Type java version and press Enter If an appro priate version of Java is installed on your com puter you will receive a response which includes somewhere in the text java version 1 6 0 If your computer gives a similar response to this you may proceed to installing LightSIDE Otherwise skip to the next section Installing the
23. folder and JRip in the rules folder AOO LightSIDE Extract Features Build Model Drei Labels Model Bullding Plugin Model Contusion Matrix Predicted NEG Actual POS weka i Act Pred NEG POS a ree NEG 122 8 movie 183 0 183 0 023 D Choose smo C 1 0 l POS 43 107 guy 0 181 0 181 0 073 guys 176 0 176 0 101 whole 176 0 1 Feature Table hour 0 167 0 167 0 085 fentares really 0 166 0 166 0 077 want 0 162 0 162 0 073 2 CVbyFold 10 Highlighted Feature Distribution pour 0 148 0 148 0 132 you 0 146 0 146 0 094 CV by file Act Pred G POS surprise 144 l 0 119 NEG i 0 141 0 141 0 002 Supplied Test Set ros my 1 0 138 0 098 fight 0 13 0 13 0 101 didnt 0 13 0 13 0 Name model Le ansan nsan aso Build Model 9 gt Choose a teature to highlight 1 0 i tput Selected Documents Display View output summary 229 76 3333 Trained Models aS vans r 71 23 6667 nodel pp 0 5267 nodell 0 2449 0 471 f 48 9725 94 2379 00 S Load Delet Clear 1 4 Figure 6 Model building in LightSIDE Advanced users may wish to use AttributeSelected Classifier Bagging Stacking or AdaBoostM1 locat ed in the meta folder which allow you to perform en semble learning or feature selection within LightSIDE 4 For this tutorial we will choose the SMO classifier Ensure that it is selected in Weka s Choose menu Select the feature table that you would li
24. he Model Building panel 1 First follow the instructions through Lesson 4 to train a machine learning model 2 Then at the top of the LightSIDE window switch to the Predict Labels tab 3 Select the CSV file that you wish to annotate in the top left corner with the Add button If you are extracting metafeatures from the CSV column titles must match exactly in order for feature spaces to align between documents 4 Ensure that the Text Field option has cor rectly guessed the column to use for text input or choose No Text if you are not using text feature extraction 5 Choose which model will annotate your data in the Model to apply field 6 Select a name for the predicted column in the Annotation Name field 7 Click Predict to use the selected model to an notate the documents in the selected file 8 When annotation is finished the predicted la bels will be displayed in the right window next to the document they correspond to 9 Ifyou wish to export these predicted labels to a new CSV document click the Export button in the bottom right corner If the performance that you re getting so far is insuf ficient for your liking continue to Lessons 6 and 7 to improve your models ANA LightSIDE Extract Features Build Model Predict tabels 1 Documents to Annotate predict class POS Br e premise eow wag the dog is so simple that its adequately explai imme by the commercia POS the film p expe
25. ilter based on a numeric column type that column s name followed by the operator you want to use followed by the threshold you wish to set no spaces anywhere then press the En ter key or click the filter button For instance filtering for kappa gt 0 04 will re move thousands of features from view in a unigram feature space as all the features that are no more predictive of a movie review s sentiment than chance like handed or chief are hidden from view 7 These filters can be combined on a single query separated by spaces as shown in Figure 4 Feature Table 7524 features Al 5 Filter him kappa gt 0 04 fr feature name type predictor of kappa precision recall f score accur th him BOOLEAN POS 0 08666 536312 0 64 0 58358620 54 h him_and BOOLEAN POS 0 066666 0 708333 0 113333 0 195402 0 53 h of_him BOOLEAN POS 0 046666 0 888888 0 053333 0 100628 0 52 h to_him BOOLEAN POS 0 046666 0 652173 0 1 0 173410 0 52 h himself BOOLEAN POS 0 040000 0 539473 0 273333 0 362831 0 52 Deactivate MovetoLab Freeze Save Export Load __ WW Figure 5 Feature quality metrics and filtering in LightSIDE 8 9 To save a feature table that you wish to use in another session or to load a feature table that you previously built use the Save and Load buttons in the bottom right corner
26. ke to use for training in the Feature Table dropdown Now choose the type of cross validation that you would like to use to test the validity of the model CV by Fold The default setting performs N fold cross validation N models are built each on N 1 N of the data and tested on the remaining 1 N Instances are chosen in a round robin fashion for instance in 5 fold cross validation the 1st 6th 11th etc instances are chosen for testing in the first fold followed by the 2nd 7th 12th etc held out in the second fold 10 fold cross validation is selected by default CV by File This setting assumes that your feature table was built from multiple files In each fold of cross validation all but one file is used for train ing and the remaining file is used for testing This repeats through each file in your data set Supplied Test Set One model is built on your full training set and it is evaluated on a second feature table from a file that you select here 7 For this tutorial leave the CV by fold option selected and the number of folds set at 10 8 Ifyou wish to specifically name this model change it in the Name text field 9 Click Build Model to train and evaluate a model For large feature tables either many documents or many features this may take a large amount of mem ory greater than 1 GB and may take several min utes to complete Be patient and close other programs as necessary
27. lected in the top right corner of LightSIDE will be displayed 8 For surface features such as n grams which have a defined start and end point the bottom right window indicates their location in a document Selecting a feature in the top right interface high lights that feature in the bottom right panel 9 For the next step if you trained a model using Weka s SMO classifier switch back to the Extract Features tab 10 In the feature table list scroll to the far right You will now see a new column titled weights1 This corresponds to the SVM weight for each feature What does the SVM weight tell us about a feature Primarily whether it is actually having an impact on a model If a feature looks highly confusing by horizontal comparison but is being given a weight near zero then it is unlikely to be a true source of confusion for the model Using these tools can make it easier to recognize what a model is doing and what instances are be ing misclassified and for what reasons The next step is to attempt to change a feature table to im prove its ability to represent your data For in struction on how to do this move on to Lesson 7 12 Lesson 7 Constructing features with LightSIDE s feature lab LightSIDE gives you the opportunity to make many changes to a feature table Here we high light two different possibilities removing features that are superfluous or detrimental to perfor mance and introduction o
28. ly from Sun s official website http java sun com javase downloads Once you select the appropriate file here you should open it and follow the instructions it gives Installing amp Running LightSIDE Now that Java 6 is installed on your comput er you can start using LightSIDE All the files that you will need for basic use are available in a single package located at the following website http www cs cmu edu emayfiel SIDE html Save the file to your desktop or downloads fold er for easy access Now extract the package to a folder using your favorite archive manager To run LightSIDE open this folder Depending on the operating system you are using you will need to follow different steps to run LightSIDE Once you have completed these steps LightSIDE will be run ning and you can begin to learn to use the software Windows XP Open the LightSIDE folder Double click the run icon Windows 7 Open the start menu and search for cmd Click the cmd icon Type cd Desktop SIDE to navigate to the location where LightSIDE was extracted If you saved this folder somewhere else you will have to navigate to it yourself Type run bat Mac OS X v10 6 Open Finder Click Applications then Utilities Double click Terminal Type cd Desktop SIDE to navigate to the location where LightSIDE was extracted If you saved this folder somewhere
29. ost likely to predict compared to random guessing Kappa This metric measures the added value of this feature from 0 to 1 for predicting the class given in the previous column compared to ran dom guessing Kappa value of 1 means that the feature is perfectly correlated with a label while a negative Kappa value would represent a feature with worse accuracy than flipping a coin Precision Recall and F Score These closely related metrics give an impression of false positive precision false negative recall and harmonic mean of these two metrics f score Accuracy This simple metric tells how well a classifier would perform from using this feature alone to predict class labels for your data set Hits The total number of documents that this feature appears in across your data set 3 Features can be sorted by these metrics by click ing the title of a column Filtering the features shown in the user interface 4 Sometimes you will want to check the statistics of a certain subset of your features To do this click in the Filter text field at the top of the window 5 To filter by name simply type in the string of characters you want to see then press the Enter key or click the filter button For instance filtering for him will give not only the uni gram him but also bigrams such as to_him or longer words such as himself which contain that substring 6 To f
30. p right section of the screen will fill with the contents of the feature table This section will be studied in Lesson 3 8 There should currently be 3571 features in your feature table as shown at the top of the Light SIDE window Now deselect the unigrams option and select punctuation 9 Ifyou wish to stop feature extraction partway through aborting the extraction process click the red button next to the status bar 10 Click the Extract Features Same Table button in the feature extractor plugin menu Instead of creating a new feature table this adds the currently configured plugin s features to the cur rently open feature table 11 The remaining plugins are simpler to configure The Metafeatures plugin has no configuration options it simply pulls all non class value col umns from your CSV file into a feature table 12 To create features based on regular expressions first click the Regex Search extractor in the fea ture extractor plugins menu You will see a text field and an empty list Configure Plugin Feature Lab his her sacting l 3 Add very s good great very s bad terrible d 3 not svery s w F igure 4 Regular expression feature extractor con figuration 13 To create a regex feature type the search into the top bar then click Add Your search appears in the bottom list 14 To remove a search highlight it and click Delete 15 When you are finish
31. r a ty predictorof kappa p f ac lab OR hideous terri BOOLEAN NEG 0 306666 0 669117 0 606666 0 636363 d 0 th bad BOOLEAN NEG 0 246666 0 652892 0 526666 0 583025 oO th between BOOLEAN POS 0 219999 0 651376 0 473333 0 548262 0 Annotation th great BOOLEAN POS 0 206666 0 630252 0 5 557620 0 class B kh would BOOLEAN NEG 0 193333 0 590062 0 633333 0 610932 0 kh my BOOLEAN NEG 0 166666 0 603305 0 486666 0 538745 0 Text Field kh many BOOLEAN POS 0 166666 0 606837 0 473333 0 531835 0 kh see BOOLEAN POS 0 166666 0 587412 0 56 0 573378 0 text xs kh very BOOLEAN POS 0 153333 0 576158 0 58 0 578073 0fa kh only BOOLEAN NEG 0 153333 0 558974 0 726666 0 631884 0J Feature Extractor Plugins i ree YI u t De activate Move to Lab Freeze Save Fxp Load Metafeatures Regex Search 7 4 Configure usin SaaS Name features 1 Threshold 5 fr feature name type predict kappa precision recai li f score _ mmay aon u DUULCAN NCU U L23 i U 133 U e Extract Features New Table th bad BOOLEAN NEG 0 653 0527 OS th terrible BOOLFAN NFG 0 895 0 113 0 2 Extract Features Same Table lab AND NOT not bad BOOLEAN POS 0 02 0 667 0 04 0 0 th hideous BOOLEAN NEG 0 00 0 6 0 02 0 0 lab OR hideous terrible bad awful BOOLFAN NFG 0 307 0 669 0 607 0 64 th not BOOLEAN POS 0 033 0 51 0 88 0 617
32. r each instance and can be filtered in three different ways All This option simply shows all instances in the order they were listed in the original CSV By Error Cell The documents which occur in the error cell that is currently highlighted at the top of the screen will be displayed AAO LightSIDE 9 Extract Features Model Confusion Matrix Build Model Predict Labels Model Building Plugin P weka ir Choose _ SMO C 1 0 L Predicted POS Actual NEG Act Pred NEG POS vert cell NEG 112 38 0 108 198 0 E 0 421 POS 34 116 j ae a bas sed 0 173 0 173 0 108 02 289 Feature Table wn 172 0 172 0 015 0 342 features z nevi o 169 0 169 0 077 0 553 popular 0 166 0 166 0 115 0 184 saa 0 164 0 164 0 16 0 289 cvbyFold 10 0 141 0 141 0 062 0 579 Act Pred NEG Pos sgl 0 14 0 14 0 098 0 158 CV by file NEG 0 161 0 342 all 0 138 0 138 0 108 0 289 2 ho oihwoo od 0 138 0 138 0 065 0 289 great 0 136 0 136 0 131 0 395 reality 0 132 0 132 0 063 0 132 Highlighted Feature Distribution _ Supplied Test Set Name modell Build Model Distribution of feature becomes Learning Output _ Selected Documents Display Trained Models model negotiator is the person zi ung man who loves heavy me etween good a ed allowed myse eff POS POS Figure 7 Error analysis in LightSIDE Model Building pane By Feature Only instances which contain the feature you currently have se
33. the Extract rene LightSIDE Extract Features Build Model Predict Labels CSV File Add Clear Feature Table features 3591 features Filter MovieReviews csv ilte 1 fi feature name ty predi f kapp recision recal Jl f score ac th bad BOOLEAN NEG 0 246666 0 652892 0 526666 0 583025 0 th between BOOLEAN POS 0 219999 0 651370 0 473333 0 548202 0 th great BOOLEAN POS 0 206666 0 630252 0 5 0 557620 0 Annotation th would BOOLEAN NEG 0 193333 0 590062 0 633333 0 610932 0 class B th QUESTION MARK BOOLEAN NEG 0 173333 0 571428 0 693333 0 626506 7 0 0 0 0 th my BOOLEAN NEG 0 166666 0 603305 0 486666 0 538745 Text Field th many BOOLEAN POS 0 166666 0 606837 0 473333 0 531835 th see BOOLEAN POS 0 166666 0 587412 0 56 0 573378 th very BOOLFAN POS 0 153333 0 576158 0 58 0 578073 a th only BOOLFAN NFG 0 153333 0 558974 0 726666 0 631884 gt 4 5 Configure Plugin Feature Lab Name features hreshold 5 Extract Features New Table m Unigrams 6 F Treat above features as binary Bigr Rem Stopword Extract Features Same Table SA tesa ais ati L POS Bigrams Stem Punctuation 8 Feature Tables _ Line length C Contains NON stopwords Figure 3 Feature extraction in LightSIDE Features New Table button This takes a few seconds to process when it finishes the to

LightSIDE - School of Computer Science

Contents

Download Pdf Manuals

Related Search

Related Contents