Home
User Guide
Contents
1. 28 Advanced Method without Stop Words and Stemming Rules 38 Adapted Method Defining a Specific Language for the Domain 41 Simple Method Using a Classification Model on the Data Description Using Infinitelnsight Modeler Regression Classification feature previously known as K2R you will generate a predictive model in order to determine if the auction sales revenue is higher than the sales revenue of its category This model will be generated by using as is the data provided in your data base SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 11 Extracting Information from Textual Data M To Start a Classification Regression Model Simple Method Using a Classification Model on the Data Y On the Infinitelnsight start panel click the option Classification Regression in the Modeler Section KXEN InfiniteInsight X InfiniteInsight Version 6 1 0 Explorer Create or Edit Explo Perform a Text Ana Create a Social Network Ana Load a Social Network Ana Modeling Process EA Modeler magi Classification Regression Model Create a Clustering Model Create a Time Series Ana Create Association Rules Load a Model a Toolkit Open the Data Viewer Perform a Data Transfer List Distinct Values in a Data Set Get Descriptive Statistics for a Data Set The Infinitelnsight Modeler Regression Classification feature previously known as
2. KXEN InfiniteInsight New Model with Text Coding KTC Parameters Settings Dictionary Construction Parameters J Stop Words Removing J Stemming Reduction FT Concept Merging PF Synonym Replacement Maximum Generated Root Number 1 000 Encoding Parameters f Boolean Term Frequency TF TF Inverse Document Frequency Term Count TC TC Inverse Document Frequency Gn x Dictionary Construction Parameters The dictionary is made of roots that is meaningful words or terms You can set the following parameters of the dictionary construction Stop Words Removing when this option is checked the stop words are removed from the list of roots Stemming Reduction when this option is checked the affixes are removed to limit the number of roots Concept Merging this option allows you to use an external file associating terms that is groups of words designating a single concept such as the White House or credit card with concepts Because it treats groups of words this option is applied before the removal of the stop words and the stemming You can create your own concepts dictionary by creating a text file named ConceptList_ lt LanguageCode gt without extension which contains on each line a group of words associated with the corresponding concept For example you can create a concept list for an airline company word concept business class BusinessClass Piro r eles shies CIRE flyi
3. Deviations Apply Model Simulation Select Variables Save Export Generate Source Code Export KocShell Script Save Model Gr a The screen Using the Model presents the various options for using a model that allow you to Display the information relating to the model just generated or opened Display section referring to the model curve plots contributions by variables the various variables themselves HTML statistical reports table debriefing as well as the model parameters Apply the model just generated or opened to new data to run simulations and to refine the model by performing automatic selection of the explanatory variables to be taken into consideration Run section Save the model or generate the source code Save Export section Taking a Closer Look at the Model From the screen Using the Model you can display a suite of plotting tools that allow you to analyze and understand the model generated in details The three most useful tools are described in the table below On the screen You can observe and analyze Profit Curves The performance of the model with respect to a hypothetical perfect model and a random type of model Contributions by Variables The contribution of each of the explanatory variables with respect to the target variable Significance of Categories The significance of the various categories of each variable with respect to the target variable On the screen Contributions by Var
4. Kxen RobustRegression Author denise o Monotonic Variables Detected Yes Nominal Targets TargetKey 1 O Frequency 53 14 1 Frequency 46 66 Selection Process Last Iteration 4 Kl 0 663 KR 0 965 Nb Variables Kept 31 Gn The table below compares these results with the ones obtained for the first two methods KI KR Simple Method 0 468 0 970 Intermediate Method 0 547 0 969 Advanced Method 0 667 0 963 Advanced Method without Stop Words and Stemming 0 663 0 965 There are not significant evolution of the KI and the KR So you can conclude that using German stop words and stemming rules does not really add anything to the model SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 40 Extracting Information from Textual Data Adapted Method Defining a Specific Language for the Domain Adapted Method Defining a Specific Language for the Domain Description The results of disabling the German stop words and stemming rules show that they have no real impact on the model quality Actually after viewing the data that makes sense Indeed the content of the listing_title variable can not be considered exactly as natural language but more as a language specific to a smaller domain So in this last method you will define the stop words and stemming rules based on German but relevant to this domain only This comes down to creating a specific language which you will name dmc Modeling Process For this meth
5. Metadata Metadata are stored in the same place as data source By default the Predifined mode is set to the Random without test cutting strategy To get other values refer to the drop down list of available cutting strategies By selecting the Custom mode you can use the Customized cutting strategy KXEN Infinite Insight Cutting Strategy Predefined Random without test ial Custom Estimation a qi Browse Validation a Wi Browse Test a q Browse Gina lion chu Dm 2 Click the strategy that you want to use Note In order to use the Customized cutting strategy you must have previously prepared three files corresponding to the three data sub sets estimation validation and test 3 Click OK Back to the Select a Data Source Panel click the Next button The screen Data Description will appear 5 Go to the section Describing the Data Selected SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 15 Extracting Information from Textual Data Simple Method Using a Classification Model on the Data Describing the Data Why Describe the Data Selected In order for Infinitelnsight features to interpret and analyze your data the data must be described To put it another way the description file must specify the nature of each variable determining their Storage format number number integer integer character string string date and time datetime or
6. Name ge 5p Name sp 1t Name it dmc Name dme dmc Conceptlist Conceptlist_dmc dmc StemmingRules StemmingRules_dmc du StemmingRules StemmingRules_du Stemmingrules Stemmingrules _en StemmingRules StemmingrRules fr StemmingRules StemmingrRules_ge Stemmi ngRules Stemmi ngRules_sp Stemmingrules Stemmingrules _ it dmc StopList Stoplist_dmc du Stoplist Stoplist_du StopLlist Stoplist_en StopLlist stoplist fr ge Stoplist Stoplist_ge sfr sStoplist stoplist sfr sp StopList StopLlist_sp it StoplList StoplList_it dmc SynomymLlist Synonmymlist_dmc Notes the lt Key gt or language name has to be entered in the configuration file KxLanguage cfg to be taken into account in the interface If not set up there the language will not appear in the interface If no lt Key gt or language name is specified the name of the language will be lt Key gt If different lt Key gt have the same name only the first lt Key gt will be treated The referenced files have to be in the current directory KTC Dictionary and Encoding Parameters The second panel KTC Parameters Settings allows you set the construction parameters for the dictionary and the encoding parameters SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 33 Extracting Information from Textual Data Advanced Method Using Text Coding to Extract Information from the Textual Variables
7. You can assume that the results of auctions are better on week ends or at the beginning of the months or better some months than others etc To make the most of the date variables you will create new variables for example by separating the days of the weeks so that they can be used as input in the modeling Additionally to make use of the two other most important variables you will extract more information from Start_price and Buy_it_now_price by calculating the ratio between the starting price and the sales mean for the category and the ratio between the By it now price and the sales mean for the category SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 23 Extracting Information from Textual Data Intermediate Method Adding Information with the Data Manipulations Modeling Process The process of building a predictive model on a data set containing added time data is approximately the same as the one you used for building the model on the original data The only additional step you have to perform is to create new columns for both variables listing_start_date and listing end date one for each day of the week one for the day of the month and one for the month of the year The modified data set contains the following added columns extracted from the original variable listing_start_date listing start monday listing start_tuesday listing start wednesday listing start thursday listing start _frid
8. date date Type continuous nominal ordinal or textual Warning When creating a text coding model you need to define at least one variable as textual to be able to go to the next panel For more information about data description see the Infinitelnsight User Guide How to Describe Selected Variables To describe your data you can Either use an existing description file that is taken from your information system or saved from a previous use of Infinitelnsight features Or create a description file using the Analyze option available to you in nfinitelnsight Modeling Assistant In this case it is important that you validate the description file obtained You can save this file for later re use If you name the description file KxDesc_ lt SourceFileNames it will be automatically loaded when clicking the Analyze button Important The description file obtained using the Analyze option results from the analysis of the first 100 lines of the initial data file In order to avoid all bias we encourage you to mix up your data set before performing this analysis Each variable is described by the fields detailed in the following table The Field Gives information on Name the variable name which cannot be modified Storage the type of values stored in this variable Number the variable contains only computable numbers be careful a telephone number or an account number should not be considered
9. numbers String the variable contains character strings Datetime the variable contains date and time stamps Date the variable contains dates Value the value type of the variable Continuous a numeric variable from which mean variance etc can be computed Nominal categorical variable which is the only possible value for a string Ordinal discrete numeric variable where the relative order is important Textual textual variable containing phrases sentences or complete texts SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 16 Extracting Information from Textual Data Simple Method Using a Classification Model on the Data The Field Gives information on Key whether this variable is the key variable or identifier for the record 0 the variable is not an identifier 1 primary identifier 2 secondary identifier Order whether this variable represents a natural order There must be at least one variable set as Order in the Event data source Warning If the data source is a file and the variable stated as a natural order is not actually ordered an error message will be displayed before model checking or model generation Missing the string used in the data description file to represent missing values e g 999 or Empty without the quotes Group the name of the group to which the variable belongs Description an additional description label for the variable For this Scenario
10. the Analyze button to obtain the data description Selecting the Target Selecting the Target Variable Select gms_greater_avg as the target variable Variable and a Weight Do not select a weight variable Variable Selecting Explanatory Selecting Variables Exclude the variable gms from the list of variables to be used for Variables modeling Selecting a Data Source After selecting the type of model that you want to generate you must select The data source that you want to use as the training data set lt A cutting strategy to cut your training data set into the three sub sets estimation validation and test For more information on Cutting Strategies see the nfinitelnsight User Guide For this Scenario SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 13 Extracting Information from Textual Data Simple Method Using a Classification Model on the Data Inthe panel Select a Data Source select the options Use a File or a Database Table and Text File in Data Type Inthe Data Set field specify the data source to be used by selecting the file dmc2006 txt M To Select a Data Source mh 2 Click the Browse button The following dialog box will appear Data Source Selection Select Source Folder for Data Ara El gal Text Files dat data csv txt User Password CE co 3 Double click the Samples folder then the KTC folder On the screen Select a Da
11. variables tc_listing_title_2gb and tc_listing_title_2 exist and yet they contain the same information Ta KXEN InfiniteInsight gms_ greater _avg_dmc2006 enriched ES et Chart Type Maximum Smart Variable Contributions te_listing_title_2gb buy_it_now_pitge_div mean start_price_div_mean_category listing star_moanthofyear item_leaf_category_name te_listing_title_30 te_listing title 30h f te_listing tithe meu d te_listing_title_4gb te_listing_tithe_schwarf te_listing_tithe_EffectiveRoot te_listing_title_video listing type_code te_listing_title_3 te_listing_tithe_defekt te_listing_tithe_weiss te_listing_title_ovp Variables listing end monday te_listing_title_farbdisplay UN a When building a model Text Coding automatically generates two variables tc lt variable name gt EffectiveRoot this variable counts the final number of roots in the textual field csReferer tc lt variable name gt Countinformation this variable counts the number of roots before filtering Advanced Method without Stop Words and Stemming Rules Description In the results of the model using Text Coding you can see that the variables created by Text Coding have brought information in the final model For example tc_listing_title_2gb is the most contributive variable You have seen that some of these variables contain the same information and should be grouped However before grouping similar terms you have to measur
12. 6B 5 yerpack t ung nocond nocond z GB 5 GE 2GB GE 3GE 1GE nocond A generation generation 5 nocond A docking_station mp3 5 neu 3 AMG neu 5 Abrandneu neu 5 neu Asuper 4top Aguter Aqgut A4extra Aextras nocond nocond nocond A ipod 5 AUS grun 5 AWG weiss 5 nocond A Zubehor 5 AMS apple 5 hana 5 Arestgqarantie garantie AUS photo ALY song 5 blau 5 adeutsche deutsch 5 Summary of the Modeling Settings to Use The table below summarizes the modeling settings you must use for the final method The Text Coding specific steps are grayed in the table below and the steps different from the previous model are indicated in green The other settings are similar to the ones used for the advanced method For detailed procedures and more information see the Modeling Process section of the Simple Method section Task s Specifying the Data Source Selecting a Cutting Strategy Describing the Data Text Coding Setting the Language Definition Text Coding Setting the Dictionary and Encoding Parameters Selecting the Target Screen Settings Data to be Modeled Data Description KTC Parameters Settings KTC Parameters Settings 2 Selecting the Target Select the option Text Files in Data Type In the Folder field select the folder Samples KTC In the Data Set field select the file dmc2006 enriched txt Cutting strategy Random Wi
13. 7998 Building Date 2012 08 07 15 50 35 Learning Time 11s Engine Name Kxen RobustRegression Author denise o Modeling Warnings Monotonic Variables Detected Yes Nominal Targets gms_greater_avg TargetKey 1 D Frequency 53 14 1 Frequency 46 86 Selection Process Last Iteration Nb Variables Kept Gn ua The table below compares these results with the ones obtained for the simple method KI KR Simple Method 0 468 0 970 Intermediate Method 0 547 0 969 The created variables give a better model Indeed the KI has increased from 0 46 to 0 54 and the KR has also slightly increased Adding data from already existing variables has led you to obtaining a model that has both a better quality and robustness Taking a Closer Look at the Model On the screen Statistical Reports gt Model Performance gt KI amp KR see below you notice that the added variables have made a difference in the model since some appear among the variables with the higher individual KI The individual KI represents the capacity of a variable to predict the target if only this variable was available SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 26 Extracting Information from Textual Data Intermediate Method Adding Information with the Data Manipulations KXEN InfiniteInsight gms_greater_avg_dmc 2006_enriched allt Statistical Reports amp Statistical Reports H Descriptive Statistics amp 7 Model Perfor
14. Create the data description by clicking the Analyze button M To Create a Description File 1 On the screen Data Description click the Analyze button The data description will appear Check that the description obtained is correct Once the data description has been validated you can Save it by clicking the Save button Click the Next button to go to the following step The screen Selecting the Target Variable will appear 4 Goto the section Selecting a Target Variable A Comment about Database Keys For data and performance management purposes the data set to be analyzed must contain a variable that serves as a key variable Two cases should be considered f the initial data set does not contain a key variable a variable index Kxindex is automatically generated by Text Coding This will correspond to the row number of the processed data f the file contains one or more key variables they are not recognized automatically You must specify them manually in the data description See the procedure To Specify that a Variable is a Key On the other hand if your data is stored in a database the key will be automatically recognized M To Specify that a Variable is a Key 1 Inthe Key column click the box corresponding to the row of the key variable SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 17 Extracting Information from Textual Data Simple Method Using a Classification Model on the Data 2 T
15. End User Documentation N oN 7 mM ro A l O ro un D gt a Cc D O O QO Table of Contents Welcome to this Guide 3 ADO TUS TIO CUNO E Saccas ee ae eee tee 3 Who Should Read this DOCUMENT ane an mec 3 PSS a E 3 W bat tis Document COVES i czssosivervsculscetarwienisscaccentiovunlercelequteerinadaesaatsovionssenbsqularsionssaceadeneianies 3 How to Vise this Docume ii ES dan a de to aa sn ce 4 PROV DONNER S Re ne ae D en ne 5 Files and Documentation Provided with this Guide 5 CO le Ce 6 General Introduction to Scenario 8 DOC OS ee 0 eeaedeaeeesoae ds amp DOIG OT LO NOD CS A TENANE dde ae aa eo ei 8 POCHE on O AAC e N e E E E E AAR 10 Extracting Information from Textual Data 11 Simple Method Using a Classification Model on the Data ss 11 D SIR E E E E tneinsceccenee gt tesseteteannaen 11 Mode IS OCS steers E cadence soasesaedeaaaae ns 12 SDS a seaiesustuae ante 20 Intermediate Method Adding Information with the Data Manipulations 23 DSC DO ee tees 23 Modeline POS ee a a 24 RSS a ee D ae CD ee E E AOE 25 Advanced Method Using Text Coding to Extract Information from the Textual Variables 28 PS SC Moise 28 Modelno POS ea do E 28 RSS ctetdaorandereded E E A P AN E A PEE E arauredanans 36 Advanced Method without Stop Words and Stemming Rules 38 Dese PUO op a a 38 Mod m POS E ae dl 39 RSS a eeeoceuanaes 39 Adapted Method Defin
16. Engine Name Kxen RobustRegression Author denise o Modeling Warnings Monotonic Variables Detected Yes Nominal Targets gms_greater_avg TargetKey 1 0 Frequency 53 14 1 Frequency 46 66 Selection Process Last Iteration Nb Variables Kept The table below compares these results with the ones obtained for the first two methods KI Simple Method 0 468 Intermediate Method 0 547 Advanced Method 0 667 KR 0 970 0 969 0 963 SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 47 Extracting Information from Textual Data Adapted Method Defining a Specific Language for the Domain Advanced Method 0 663 0 965 without Stop Words and Stemming Adapted Method 0 695 0 970 The KI has largely improved and the KR stays very confident So you can see that from a simple Classification Regression model performed on the original data set to a Text Coding Classification Regression model on improved data with a specialty language defined you have gained a lot in model quality 25 in KI without losing model robustness The increased quality of the model is clearly apparent on the model graphs below Model graph for the simple method Model graph for the adapted method Performance Performance 074 07 5 5 0 6 4 om S os 0 5 g 0 4 2 a a 0 4 034 0 3 aa 0 2 0 1 0 1 0 0 oo TPS RO RST HOS BOS HOLS EO 29 EP GO YO 4D RO ES ON 49 90 Pea Rea Ree hs ho sens EO Eo 40 4S BO es oN es Ur perce
17. Infinitelnsight Modeling Assistant presented to you and the concepts required for their application M To Display the Contextual Help 1 2 Click the Previous button to go back to the original screen Contact Us Click the Help button located on the screen lower left corner We are interested in your feedback and welcome your questions and comments The following table provides a list of e mail addresses that you may use to contact us SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide Welcome to this Guide If you Contact our team in Want more business application information Marketing Have technical questions related to the integration and Support use of KXEN products Have comments or questions concerning the KXEN Documentation documentation SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide Before Beginning Send an email to the following address info kxen com United States support usa kxen com Canada support ca kxen com mailto support ca kxen com France support fr kxen com United Kingdom support uk kxen com Europe Middle East and Africa support emea kxen com documentation kxen com General Introduction to Scenario Scenario General Introduction to Scenario IN THIS CHAPTER TS OU UN eeepc cere a se ete pacts E besa E E E E EE O E E E E E E T TS 8 ATO CIES MOM TO Sample Tr SS ernes r E R i 8 Introduction to nine Insignia sent ae note nn net den ae
18. K2R allows you to create explanatory and predictive models The first step in the modeling process consists of defining the modeling parameters BR N a and a Weight Variable on page 18 Select a cutting strategy see Selecting a Cutting Strategy on page 15 Select a data source see Selecting a Data Source on page 13 to be used as training data set Describe the data set see Describing the Data on page 16 selected Select the target variable and possibly a weight variable see Selecting the Target Variable 5 Select the explanatory variables see Selecting Explanatory Variables on page 19 SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 12 Extracting Information from Textual Data Simple Method Using a Classification Model on the Data Summary of the Modeling Settings to Use The table below summarizes the modeling settings that you must use for the simple method It should be sufficient enough for users who are already familiar with the nfinitelnsight Modeling Assistant For detailed procedures and more information see the following sections Task s Screen Settings Specifying the Data Data to be Modeled Select the option Use a File or a Database Source Inthe Folder field select the folder Samples KTC Selecting a Cutting In the Data Set field select the file dmc2006 txt Strategy Cutting strategy Random Without Test Describing the Data Data Description Use
19. al uses assert start of subject or line in multiline mode assert end of subject or line in multiline mode match any character except newline by default start character class definition l end character class definition start of alternative branch l start subpattern end subpattern i extends the meaning of also 0 or 1 quantifier also quantifier minimizer 0 or more quantifier 1 1 or more quantifier start min max quantifier end min max quantifier SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 49 www sap com contactsap 2013 SAP AG or an SAP affiliate company All rights reserved No part of this publication may be reproduced or transmitted in any form or for any purpose without the express permission of SAP AG The information contained herein may be changed without prior notice Some software products marketed by SAP AG and its distributors contain proprietary software components of other software vendors National product specifications may vary These materials are provided by SAP AG and its affiliated companies SAP Group for informational purposes only without representation or warranty of any kind and SAP Group shall not be liable for errors or omissions with respect to the materials The only warranties for SAP Group products and services are those that are set forth in the express warranty statements accompanying such products and services if any Not
20. anguage Settings Definition Text Coding Setting KTC Parameters the Dictionary and Settings 2 Encoding Parameters Selecting the Target Selecting the Target Variable and a Weight Variable Variable Selecting Explanatory Selecting Variables Variables Setting KTC Parameters KTC Languages Parameters Settings Select the option Text Files in Data Type Inthe Folder field select the folder Samples KTC n the Data Set field select the file dmc2006 enriched txt Cutting strategy Random With No Test Select desc_dmc2006_enriched_textual txt as the description file Check that listing title is set as textual Keep default Language Definition Repository blank Select the User Defined Language option as the Language Recognition Mode Select ge German in the combo box as the User Defined Language Keep the default settings Select gms_greater_avg as the target variable Do not select a weight variable Exclude the variables Kxindex and gms from the list of variables to be used for modeling The first panel KTC Parameters Settings allows you to chose the language settings SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 30 Extracting Information from Textual Data Advanced Method Using Text Coding to Extract Information from the Textual Variables Define the location of the Language Definition Repository 1 Select the list of Supported Languages 2 select the list o
21. ariable in the screen section Target s Variable s and click the button lt to move the variables back to the screen section Explanatory variables selected SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 18 Extracting Information from Textual Data Simple Method Using a Classification Model on the Data Selecting Explanatory Variables By default and with the exception of key variables such as KxIndex all variables contained in your data set are taken into consideration for generation of the model You may exclude some of these variables For this Scenario Exclude gms from the list of variables to be used for modeling the variables since this variable contains the actual amount the auction reached it answers the question and so would provide a perfect model if used Retain all the other variables M To Select Variables for Data Analysis 1 On the screen Selecting Variables in the section Explanatory variables selected left hand side select the variable to be excluded Note On the screen Selecting Variables variables are presented in the same order as that in which they appear in the table of data To sort them alphabetically select the option Alphabetic sort presented beneath each of the two parts of the screen 2 Click the button gt located in the center of the screen The variable moves to the screen section Variables excluded Also click the button lt to move the variables to the screen secti
22. ay listing start saturday listing _ start sunday listing start dayofmonth listing_ start monthofyear extracted from the original variable listing end date listing end monday listing end tuesday listing end wednesday listing _ end _ thursday listing end _friday listing end saturday listing end sunday listing end _ dayofmonth listing end _ monthofyear You will also create two new columns in which the ratios described in the previous section will be stored Start_price_div_mean_category which is the result of the division of start_price by category_avg_ gms Buy it now price div mean category which is the result of the division of buy_it_now_price by category_avg_gms To create these columns you can use the KXEN Data Manipulation feature However to speed the process for this demonstration the modified data set is provided in the folder Samples KTC The data file that correspond to the original file with the data manipulation creation is dmc2006 enriched txt SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 24 Extracting Information from Textual Data Intermediate Method Adding Information with the Data Manipulations Summary of the Modeling Settings to Use The table below summarizes the modeling settings that you must use for the intermediate method Except for the additional columns created in the data set the other settings are similar to the ones used for t
23. column the number of occurrences of each root in the Frequency column Look at the roots ipod and apple for example When you compare their number of occurrences with the total number of records in the data set it appears that ipod is present 7500 times and apple 5228 times in a data set that counts 8000 lines It is evident that they are much too frequent to contain information Another way to detect stop words is to use individual variable contribution after the Classification Regression process in order to see the variables that have no KI SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 42 Extracting Information from Textual Data Adapted Method Defining a Specific Language for the Domain KXEN InfiniteInsight gms_greater_avg_dmc 006_enriched ally Statistical Reports a id dw AAS Ae l E KI amp KR Target ams_greater_avg e CB namn see ene es 5 Selection Proce Pl RE SET 00 7 00 H Score Detailed H Control for Deviatic H Expert Debriefing 00 00 00 00 00 00 00 00 ory r E Selection Proce ei 0 0 le 0 0 F 0 0 at A Exdusion Step fe fee ef 0 000 un uza The stop words list is stored in a text file named StopList lt language code gt For example the stop words list created for the specific language you are working on will be named StopList_dmc The stop words list for the dmc language should look like the following file gt Stopli
24. d lead you through all stages of the modeling process M To Start the Modeling Assistant Select Start gt Programs gt KXEN I Infinitelnsight gt KXEN nfinitelnsight 1 Infinitelnsight welcome page will appear KXEN InfiniteInsight X InfiniteInsight Version 6 1 0 Explorer Create or Edit Create a Data Manipulation Load an Existing Data Manipulation Perform an Event L ation Perform a uence Ana Perform a Text Anay Sa ae Social Create a Social Network Analysi Load a Social Network Analysis Model EA Modeler Create a Classification Regression Model Create a Clustering Model Create a Time Series Analysis Create Association Rules Load a Model Toolkit Open the Data Viewer Perform a Data Transfer List Distinct Values in a Data Set Get Descrip tive Statistics for a Data Set Click on the feature related to the type of model you want to create in the Modeler section SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 10 Extracting Information from Textual Data Simple Method Using a Classification Model on the Data Extracting Information from Textual Data IN THIS CHAPTER Simple Method Using a Classification Model on the Data ss 11 Intermediate Method Adding Information with the Data Manipulations cccccccscsseeeeeeeaeeeeeeeeeeeeeeeseeeeeeeeeaees 23 Advanced Method Using Text Coding to Extract Information from the Textual Variables
25. e the impact of the German processing on the data set To that effect you will build a Text Coding model without specific German processing in order to see what its impact on the model quality is SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 38 Extracting Information from Textual Data Advanced Method without Stop Words and Stemming Rules Modeling Process The process of using Text Coding without German specific processing is approximately the same as the one you used for building the previous model You will only need to change the dictionary and encoding parameters Summary of the Modeling Settings to Use The table below summarizes the modeling settings you must use for the advanced method The Text Coding specific steps are grayed in the table below and the step different from the previous model is indicated in blue The other settings are similar to the ones used for the advanced method Text Coding steps are presented in details in the following sections For detailed procedures and more information see the Modeling Process section of the Simple Method section Task s Screen Settings Specifying the Data to be Modeled Select the option Text Files in Data Type Data Source Inthe Folder field select the folder Samples KTC Selecting a Cutting In the Data Set field select the file dmc2006 enriched txt Strategy Cutting strategy Random With No Test Describing the Data Data Description Selec
26. el Learning then paste the information on excel and sort the data by alphabetical order Then you can identify different forms of the same words For example three different occurrences of a simple word can be identified eingeschweisst eingeschweist eingeschwei t So you can create two stemming rules to manage this word 85 3 eingeschweist nocond nocond eingeschweist eingeschweisst 4 86 3 eingeschweiRES nocond nocond eingeschweiBt eingeschweisst 4 These rules will replace two of the identified forms by the third one so that only one form remains Moreover in the file you can find color names in different languages for example blau in German and blue in English So you can create the associated stemming rules such as 653 blues mocond nocond blues Dlau 4 You can also create stemming rules that merge words that often appear together in the data set such as original and packaging which translate in German to original verpackt This can be managed by the following the stemming rules SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 44 Extracting Information from Textual Data Adapted Method Defining a Specific Language for the Domain 41 3 original NOCONa Mocond originals original verpackt 4 42 verpackts nocond nocond verpackts original_verpackt 4 Lastly you can merge correlated roots In the previous model look the Variables Correlations in Statistic Report gt De
27. f Excluded Languages 3 Select the Language Recognition Mode 4 KXEN InfiniteInsight New Model with Text Coding KTC Parameters Settings Languages Definition Repository f Default Local KTC Repository Custom Repository Samples KTC Supported Languages Language Recognition Mode f Automatic Language Recognition User Defined Language For this Scenario Keep the default KTC Language Definition Repository Resources KTCData You can exclude the language named en for English Select the User Defined Language option for the Language Recognition Mode If you did not exclude the English language select ge German in the combo box as the User Defined Language The screen should look like this SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 31 Extracting Information from Textual Data Advanced Method Using Text Coding to Extract Information from the Textual Variables KXEN InfiniteInsight New Model with Text Coding KTC Parameters Settings Languages Definition Repository Default Local KTC Repository Resources KTCData Custom Repository ff Samples KTC Wi Browse Supported Languages Exduded Languages Language Recognition Mode Automatic Language Recognition f User Defined Language M For Advanced Users You can create your own file to indicate the parameters on this panel This file has to be named KxLanguage cf
28. g and needs to be structured as following lt Key gt Name lt Value gt lt Key gt ConceptList lt File Name gt lt Key gt StemmingRules lt File Name gt lt Key gt StopList lt File Name gt lt Key gt SynonymList lt File Name gt To add comments begin these lines with lt Key gt refers to the defined language The configuration file KxLanguage cfg should look like the following one SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 32 Extracting Information from Textual Data Advanced Method Using Text Coding to Extract Information from the Textual Variables al KxLanguage ctg Notepad the goal of this file is to d e the language and configuration file available which could be use by KTC the options are a lt Keyo Name lt Values lt Key gt CONCEPLTLISt lt File Name c lt Khey gt StemmingRules lt File Name gt lt Key gt STOPLIST lt File Name lt Key gt SynonymList lt File Name aKey gt 15011 true false lt Key gt DLLLaunchoptions lt param to run when the dll is loaded lt Key gt DLLName lt the name of the D11 wihtout extensions 50 lib dll 3 gt it no name is specified the name of the language will be lt kKey gt If different lt Key gt have the same Name only the first lt key gt will be treat Except if the language is in library the files referenced must be in the current directory Name du en Name en fr Name fr ge
29. he simple method For detailed procedures and more information see the Modeling Process section of the Simple Method section Task s Screen Specifying the Data to be Modeled Data Source Selecting a Cutting Strategy Describing the Data Data Description Selecting the Target Selecting the Target Variable and a Weight Variable Variable Selecting Explanatory Selecting Variables Variables Results Settings Select the option Text Files in Datat Type n the Folder field select the folder Samples KTC n the Data Set field select the file dmc2006 enriched txt Cutting strategy Random With No Test Select desc dmc2006 enriched no textual txt as the description file Select gms greater avg as the target variable Do not select a weight variable Exclude the variables Kxindex and gms from the list of variables to be used for modeling The screen below shows the quality KI and robustness KR indicators obtained for the model generated with the additional columns added to the original data set SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 25 Extracting Information from Textual Data Intermediate Method Adding Information with the Data Manipulations KXEN InfiniteInsight gms_greater_avg_dmc 006_enriched Training the Model amp FBR ab Overview Data Set dmc 2006 enriched txt Initial Number of Variables 47 Number of Selected Variables 44 Number of Records
30. hing herein should be construed as constituting an additional warranty SAP and other SAP products and services mentioned herein as well as their respective logos are trademarks or registered trademarks of SAP AG in Germany and other countries Please see for additional trademark information and notices ue
31. iables see below you notice that among the variables that contribute the most to the explanation of the target variable is listing end date From this result and the knowledge of how auctions work you can infer that calendar time has an impact on the SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 22 Extracting Information from Textual Data Intermediate Method Adding Information with the Data Manipulations auctions results and so you may want to detail this variable content into more informative elements such as the day of the week the month and so on This leads you to the intermediate method KXEN InfiniteInsight gms_greater_avg_dmc2006 ally Contributions by Variables ASA Maximum Smart Variable Contributions 0 000 0 025 0 050 0 075 0 100 start_price listing_title buy_it_now_price listing end date Dov item_leaf_category_name category_avg_gms listing end date DoW listing durtn_days feedback_score_at listing time listing subtitle gallery fee flag Variables auct_id listing end date _Gov listing start_date_DoM bold fee flag ipix_featured fee flag Gr Gx Intermediate Method Adding Information with the Data Manipulations Description The result of the simple method has highlighted the fact that dates have an important role in the modeling It seems logical for time information to have an impact on auctions such as the day of the week the day of the month the month of the year
32. ilding Date 2012 08 07 16 23 58 Learning Time mn 1s Engine Name Kxen RobustRegression Author denise o Modeling Warnings Monotonic Variables Detected Yes Nominal Targets TargetKey 1 D Frequency 53 14 1 Frequency 46 86 Selection Process Last Iteration 5 KI 0 667 KR 0 963 Nb Variables Kept 24 UD The table below compares these results with the ones obtained for the first two methods KI KR Simple Method 0 468 0 970 Intermediate Method 0 547 0 969 Advanced Method 0 667 0 963 The analysis of textual variables gives a better model Indeed the KI has increased from 0 55 to 0 66 Using Infinitelnsight Explorer Text Coding has led you to obtaining a model with a better quality and a high robustness Taking a Closer Look at the Model On the screen Contributions by Variables see below you notice that the variables that have been created by Text Coding are important in the final model For example tc_listing_title_2gb is the best maximum smart variable contribution From this debriefing you can see that 25 variables are displayed 14 of which have been generated by Text Coding SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 37 Extracting Information from Textual Data Advanced Method without Stop Words and Stemming Rules However after studying the roots listed in the panel KTC Model Learning you can see that some of them are similar and should probably be merged For example both
33. ill need to use a data encoding feature That is where Text Coding comes into play Text Coding is a data encoding feature that that allows building a representative vector of the textual entries it splits texts in words unit and extracts roots from the dataset Text Coding is automatically included when textual attributes are declared Modeling Process Compared to using only nfinitelnsight Modeler Regression Classification as you did for the first two methods using Infinitelnsight Explorer Text Coding means performing the two additional steps below Setting the language parameters Setting the dictionary and encoding parameters SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 28 Extracting Information from Textual Data Advanced Method Using Text Coding to Extract Information from the Textual Variables Selecting the Type of Model to Create 1 In Infinitelnsight main menu select the option Perform a Text Analysis in the Eplorer section KXEN InfiniteInsight X InfiniteInsight Version Eii Modeler Create a Classification Regression Model Create a Clustering Model Create a Time Series Analysis Create Association Rules Load a Model N Toolkit Create a Social Network Analysi Open the Data Viewer Load a Social Network Analysi Perform a Data Transfer List Distinct Values in a Data Set Get Descriptive Statistics for a Data Set The screen Add a Modeling Fea
34. ing a Specific Language for the Domain 41 D COR a 41 Modelno Proosa Ei aE eiieeii 41 Re 47 Annex 49 Regular Expression CHUNG CT 5 eureiurennamuesnesen esse cetene een ere E EUNE cata neo 49 SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide Il Welcome to this Guide About this Document Welcome to this Guide IN THIS CHAPTER About TANS DOCS eee ne ee ete ee Le 3 Bo Ge BEO N Are E E E cesses cee saseuse E E 5 About this Document Who Should Read this Document This document is addressed to the business users who wish to perform tasks using predictive information about their customers or prospects through Infinitelnsight powerful engine There is no prerequisite for technical data mining knowledge Prerequisites Before reading this guide you should read chapters 2 and 3 of the Infinitelnsight User Guide that present respectively An introduction to the Infinitelnsight The essential concepts related to use of the Infinitelnsight features When following the scenario described in this user guide you will have to use KXEN Data Manipulation feature No prior knowledge of SQL is required to use KXEN Data Manipulation only knowledge about how to work with tables and columns accessed through ODBC sources Furthermore users must have read access on these ODBC sources To use the Java graphical interface users need write access on the tables KxAdmin and ConnectorsTable which are used to sto
35. mal auction multi auction Feedback_score_at_listing_time Feedback score by the seller at listing An integer value time of auction Start_price Start price in EUR A numerical value with n decimals SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 8 General Introduction to Scenario Introduction to Infinitelnsight Variable Description Example of Values Buy_it_now_price Buy it now price In EUR for buy A numerical value with n decimals Buy_it_now_listed_flag Auction listing with buy it now option 1 if the information is true Bold fee flag Auction listing with boldface 1 if the information is true Featured_fee_ flag Auction listing as homepage top offer 1 if the information is true Category_featured_fee_flag Auction listing as category top offer 1 if the information is true Gallery fee flag Auction listing with gallery image 1 if the information is true Gallery featured fee flag Auction listing with gallery just in gallery 1 if the information is true view Ipix_ featured _ fee _flag Auction listing with ipix Additional xxl 1 if the information is true pic show pack Reserve fee flag Auction listing with reserve price 1 if the information is true Highlight_fee_flag Auction listing with background color in 1 if the information is true list view Schedule fee flag Auction listing with determination of 1 if the information is true start time Border fee flag Auction listing with border 1 if the informati
36. mance nR L E Maximum Smart Variable Contribution E Other Performance Indicators HE 23804008 rr_gms_greater_avg E Selection Process Evolution E Selection Process KI KR Evolution iz Selection Process L1 L2 Linf Evolution i Exclusion Step Score Detailed Statistics H Control for Deviations H Expert Debriefing start_price_div_mean_category buy_it_now_price_div_mean_category isting_start_monthofyear isting_end_monthofyear id jis tle ti Gr em You can see that both variables listing_start_monthofyear and listing_end_monthofyear are appear in the top ten variables When looking at their categories importance you will notice that the auctions happening in December indicated as 12 on the graph below have a better chance to sale higher than the average This can be explained by the fact that people buy more around Christmas than any other period of the year Y KEXEN InfiniteInsight gms_greaber_avg_dimc2006_ enriched X EXEN InfiniteInsight gms_greater_avg_dmc2006_enrched Category Significance BA0 Ba Abba Variables listing _end_monthofyear Category Significance 210 Bd Asda Variables Pot a aa ET Variable listing_start_monthofyear Influence on Target 0078 0060 008 0000 002s posa ga 04100 0 1 Variable listing_end_monthofyear Influence on Target 0128 04100 0078 0050 Gaas 0000 aaas ga 007s 0100 les keOther 13 Categories eo Categor When
37. mme 10 Scenario This scenario demonstrates how to use the nfinitelnsight Explorer Text Coding feature for creating a standard model The file dmc2006 txt is the sample data file that you will use to follow the scenario described in this user guide It is the contest file from the Data Mining Cup 2006 http www data mining cup com 2006 wettbewerb aufgabe 1165919250 which is a German eBay file containing auctions with full conformance with protection of data privacy The data used in this scenario are online auctions from the category Audio amp Hi Fi MP3 Player Apple iPod The purpose of this scenario is to predict for new auctions if the actual sales revenue is higher than the average sales revenue of the product category Introduction to Sample Files Infinitelnsight is provided with sample data files allowing you to evaluate the Text Coding feature and take your first steps in using it The data or variables contained in the sample file dmc2006 txt are described in the following table Variable Description Example of Values auct_id ID number of auction An index value Item leaf catedory name Product category numerical value with two decimals Listing title Title of auction Listing_subtitle Subtitle of auction Listing_start_date Start date of auction A date in the format such as Listing end_date End date of auction Listing_durtn_days Duration of auction Specification in days Listing _type_code Type of auction nor
38. nd 2 obs 0 9 g a o 3 A4 A2 AGG A3 AG th ly Adock Amp Aneu anagel Abrand Anew PRRERPRPRP wma mo Baw WIR D D DO D O un f EURE Aila GRUN g A 8155 ube Aappel Ananag Asongs Ab ues ABUTS oO 0 0 oO 0 1 2 5 5 t CondRl cCondre hocond nocond nocond nocond hocond nocond hocond nocond hocond nocond hocond nocond hocond nocond hocond nocond hocond nocond hocond nocond nocond nocond Agrig 4 ignal 4 qinal 4 of nocond nocond nocond nocond nocond nocond nocond nocond nocond nocond nocond A te rd nocond nocond Astationt nocond nocond nocond nocond n nocond neus hocond hocond nocond iv poCd t run green nocond 4weIss nocond hor 4 UBEHORS Aaplle nocond nocond nocond Arestgarantie nocond Afoto nocond 4musik nocond nocond nocond ches hocond Match Adapted Method Defining a Specific Language for the Domain Pile Es Replace Stepafter F 1 U 1 55 1 m 1 1 CACaeiouybedt gh k Imnpqrstywxzy aei ouy Chcdfat CA Cae ouybcdtghy k Imnporstyvwxzy Taeiouy bedfgt 4 nocond gb hocond AAS Ad AGG A3 AG hocond Ageng hocond A A HE nocond nocond Anew hocond hocond hocond hocond hocond Ananaofs hocond nocond hocond Ablue hocond hocond 4 GE 5
39. ng blue FlyingBlue Or you can apply the concept of creditcard to any credit card such as American Express Visa Card credit card creditcard american express creditcard visa card creditcard mastercard creditcard SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 34 Extracting Information from Textual Data Advanced Method Using Text Coding to Extract Information from the Textual Variables Notes you have to put a sign of equality between the words and the concepts to replace the blanks or every other separator by dashes and to write the words in lower case letters since the concept merging is applied after the removal of all upper case letters you have to do the concept merging for the singular and plural forms of the words to cover all the occurrences The use of the concept list being language dependent the appropriate list is automatically selected once the language has been either automatically identified or selected by the user Synonyms Replacement this option allows you to use an external file defining synonymic roots It will be used to replace some roots by a root selected by the user This option is applied after the stop words have been removed and the stemming rules have been applied You can create your own synonyms dictionary by creating a text file named SynonymList_ lt LanguageCode gt which contains on each line a root found by Text Coding associated with the syno
40. nt importance to the values obtained for the KI and KR of a model The model generated on the provided data gives the following results SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 20 Extracting Information from Textual Data Simple Method Using a Classification Model on the Data Kl 0 468 KR 0 970 KXEN InfiniteInsight gms_greater_avg_dmc2006 Training the Model amp FR Ast Current Report Al Reports Model gms_greater_avg_dmc 006 Data Set Initial Number of Variables Number of Selected Variables Number of Records Building Date 2012 08 07 15 29 21 Learning Time TS Engine Name Kxen RobustRegression Author denise o Modeling Warnings Monotonic Variables Detected Yes Nominal Targets gms_greater_avg TargetKey 1 D Frequency 53 14 1 Frequency 46 86 Selection Process Last Iteration Kl 0 468 KR 0 970 Nb Variables Kept 17 Ge Presentation of the Infinitelnsight User Menu Once the model has been generated click the Next button The screen Using the Model will appear SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 21 Extracting Information from Textual Data Simple Method Using a Classification Model on the Data KXEN InfiniteInsight gms_greater_avg_dmc2006 Using the Model FT Display Model Overview Model Graphs Contributions by Variables Category Significance Statistical Reports Scorecard Contusion Matrix Run Analyze
41. ntage percentage Random M Wizard M Validation E Random M Wizard Validation With each method you have been able to uncover more and more information from your data When looking at the Maximum Smart Variable Contributions below you can see that the majority of the most contributive variables come from the textual analysis of the data The variable that contributes the most to the target is tc_listing_title_capacity_2gb Maximum Smart Variable Contributions 000 0 01 002 005 004 0 05 0 06 OOF O08 O88 0 10 te_listing_tithe_capacity_2gb listing end date buy_it_novw_price_div_mean_ start_price_div_mean_category item_leaf_category_name te _listing_title_capacity_30gb te_listing_title_state_neu te _listing_title_color_ schwar auct_id te_listing_title_capacity_4gb te_listing_title_color_weiss te_listing_title_generation_3 te _listing_title_funetionality te_listing_title_package_origin buy_it_now_price buy_it_now_listed_flag category_avg_gms te_listing_tithe_state_defekt te_listing_tithe_functionality_f listing end monday Variables SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 48 Annex Regular Expression Reminder Annex Regular Expression Reminder The regular expressions engine used for the stemming rules is a PCRE engine Pear Compatible Regular Expression The following table summarizes the main elements that can be used in the regular expressions general escape character with sever
42. nym root as shown below lt found_root gt lt replacement_root gt The use of the synonyms list being language dependent the appropriate list is automatically selected once the language has been either automatically identified or selected by the user Maximum Generated Root Number this option allows you to select how many roots you want to keep in the dictionary By default the roots with the highest frequencies are kept but you can select a percentage of the most frequent roots to exclude by clicking the Advanced button Each root is converted into a variable and when the root appears in a text its presence can be encoded in three ways Boolean the presence of the word is encoded 1 and its absence is encoded 0 Term Frequency the number of apparitions of the root in the current text TF Inverse Document Frequency a measure of the general importance of a root in the current document relative to the whole set of documents based on Term Frequency TF IDF TF log10 TotalNumberOfDocuments NumberOfDocument sContainingTheRoot Term Count the number of times the root appears in the current text TC Inverse Document Frequency a measure of the general importance of a root in the current document relative to the whole set of documents based on Term Count TC IDE a TC log10 TotalNumberOfDocuments NumberOfDocument sContainingTheRoot For this Scenario SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 35 E
43. od you will have to create a list of stop words specific to the current domain and the stemming rules also adapted to this domain The process is the same as the one you used for the advanced method you will only need to set the language to the one you will create in the following steps The two sections below describe how to create a stop words list and stemming rules However since the process of creating the stop words list and the stemming rules can be lengthy both are provided as an example in nfinitelnsight Thus the new language dmc will appear in the list of languages How to Detect Stop Words Stop words are words that bring no information because they are too frequent or on the contrary that are less frequent Typically stop words are link words such as aber ob ich so am auf in German However other words can also be defined as stop words The panel KTC Model Learning obtained by the advanced method without stop words and stemming rules can give you insight on which words can be considered as stop words SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 41 Extracting Information from Textual Data Adapted Method Defining a Specific Language for the Domain KXEN InfiniteInsight New Model with Text Coding KTC Model Learning ui a EE sia Textual Variable isting_tite E b 4 ug Cancel This panel lists for each textual variable the identified roots in the Root
44. on Explanatory variables selected KXEN InfiniteInsight New Regression Classification Model Selecting Variables Target Variables gt _greater_avg lt 5 Number of Variables 24 q ke 1 Alphabetic Sort 3 Click the Next button The screen Summary of the Modeling Parameters will appear SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 19 Extracting Information from Textual Data Simple Method Using a Classification Model on the Data Results Model Performance Indicators Once the model has been generated you must verify its validity by examining the performance indicators The quality indicator KI allows you to evaluate the explanatory power of the model that is its capacity to explain the target variable when applied to the training data set A perfect model would possess a KI equal to 1 and a completely random model would possess a KI equal to 0 The robustness indicator KR defines the degree of robustness of the model that is its capacity to achieve the same explanatory power when applied to a new data set In other words the degree of robustness corresponds to the predictive power of the model applied to an application data set To see how the KI and KR indicators are calculated see KI KR and Profit Curves in the document InfiniteInsight User Guide Note Validation of the model is a critically important phase in the overall process of Data Mining Always be sure to assign significa
45. on is true Qty_available_per_listing Quantity of offered articles for An integer value multi auctions Gms Achieved sales revenue In EUR A numerical value with n decimals for multi auctions average price of sold articles Category_avg_ gms Average sales revenue In EUR of A numerical value with n decimals product category item_leaf_category_name Gms_greater_avg 0 if gms lt category_avg_gms Target 1 if gms gt category avg _gms The file desc dmc2006 without textual txt is the description file corresponding to the data file dmc2006 txt The file dmc2006_enriched txt is an enriched version of the dmc2006 txt data set The KXEN Data Manipulation feature has been used to create new variables from the ones already existing in the original data set The file desc_dmc2006_enriched_no_textual is the description file corresponding to the data file dmc2006_enriched txt with no variable declared as string textual The file desc_dmc2006_enriched_textual txt is the description file corresponding to the data file dmc2006_enriched txt with the listing_title variable declared as string textual SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 9 General Introduction to Scenario Introduction to Infinitelnsight Introduction to Infinitelnsight To accomplish the scenario you will use the Java based graphical interface of nfinitelnsight It will allow you to select the feature with which you will work an
46. pecific to the scenario For this Scenario presented in this guide Before Beginning Files and Documentation Provided with this Guide Sample Data Files Both the evaluation version and the registered version of the Infinitelnsight are supplied with sample data files These files allow you to take your first steps using various features of the Infinitelnsight and evaluate them During installation of the nfinitelnsight the following sample files for Text Coding are saved under the folder Samples KTC E lt cmcz00G cxe E desc_dmc2006 without textual txt E dmc2006 enriched txt E desc dmc2006 enriched no textual txt E desc dmc2006 enriched textual txt To obtain a detailed description of these files see Introduction to Sample Files on page 8 The folder Samples KTC is located for Windows in the folder Program Files KXEN InfiniteInsight6 1 0 Samples KTC for UNIX in the folder Samples kKTC located in the folder where you have decompressed the KXENAF archive file that is tar Z or tar gz Supported Languages Files The Infinitelnsight Explorer Text Coding feature comes packaged with rules for several languages and can be easily extended to other languages The pre packaged that comes with the installation includes Dutch Du English En French Fr German Ge Spanish Sp and Italian It The folder Resources KTCData is located SAP Infinitelnsight 6 5 SP4 Explo
47. re representations of data manipulations For more technical details regarding the Infinitelnsight please contact us on page 6 We will be happy to provide you with more technical information and documentation What this Document Covers This document introduces you to the main functionalities of the Infinitelnsight Explorer Text Coding feature Using the application scenario you can create your first models with confidence Infinitelnsight Explorer Text Coding previously knwon as KTC lets you build predictive models from data containing textual fields Thanks to Text Coding models you can SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 3 Welcome to this Guide About this Document Improve your models with textual processing Handle some text mining problems such as text categorization or mail rerouting Do automatic language recognition To know more about the basic concepts underpinning the nfiniteInsight read the Infinitelnsight User Guide How to Use this Document Organization of this Document This document is subdivided into three chapters This chapter Welcome to this Guide serves as an introduction to the rest of the guide This is where you will find information pertaining to the reading of this guide and information that will allow you to contact us The Chapter 2 General Introduction to Scenario provides a summary to the Text Coding application scenario It also int
48. rer Text Coding User Guide 5 Welcome to this Guide Documentation Full Documentation for Windows in the folder Program Files KXEN InfiniteInsight6 1 0 Resources KTCData Before Beginning for UNIX in the folder Resources KTCData located in the folder where you have decompressed the KXENAF archive file that is tar Z or tar gz Name _ ConceptList_dmc 7 KxLanguage cfg _ StemmingRules_dmc _ StemmingRules_du _ StemmingRules_en _ StemmingRules_fr _ StemmingRules_ge _ StemmingRules_it _ StemmingRules_sp _ StopList_dme _ StopList_du _ StopList_en _ StopList_fr _ StopList_ge _ StopList_it _ StopList_jp _ StopList_sp _ SynonymList_dmc Date modified 9 11 2009 11 31 AM 2 29 2012 1 09 PM 5 30 2012 1 11PM 5 30 2012 1 11PM 5 30 2012 1 11PM 5 30 2012 1 11PM 5 30 2012 1 11PM 5 30 2012 1 11PM 7 23 2008 5 15 PM 7 12 2007 12 01PM 3 5 2008 8 13 PM 5 24 2007 3 29 PM 5 24 2007 3 29 PM 5 24 2007 3 29 PM 2 25 2009 8 52 PM 9 24 2009 11 27 AM 5 28 2007 5 36 PM 9 11 2009 11 31 AM Type File ee RER RRQ Size Complete documentation is included with the nfinitelnsight This documentation covers Contextual Help Each screen in the Modeling Assistant is accompanied by contextual help that describes the options The operational use of nfinitelnsight features The architecture and integration of the Infinitelnsight API The Java graphical user interface the
49. roduces the user interface and the data files used in this scenario The Chapter 3 Standard Modeling with Text Coding presents the nfinite nsight Explorer Text Coding feature It describes how to create five different predictive models by adding data to the original data set and by using only Classification Regression for the first two models and then Text Coding combined with Classification Regression for the last three models You will then be able to compare the results obtained with each model A summary and detailed table of contents located at the beginning of the guide and cross references throughout the document allow you to find the information that you need quickly and easily If you want more information on the nfinitelnsight and on the essential concepts of modeling data read the nfiniteInsight User Guide provided with KXEN software Conventions Used in this Document To facilitate reading certain publishing conventions are applied throughout this guide These are presented in the following table The following information items Are presented using For example Graphical interface features and file names Arial bold Click Next The titles of particularly useful sections Garamond italicized bold See Operations The titles of procedures M To Select the Target Variable SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 4 Welcome to this Guide Before Beginning The titles of sections s
50. scriptive Statistics Tt jeng endase TT MES E s urkoompees buy non pree mene NS ia top fari prce av mean rategory ia lien end monitorea lista siri monnoye NS i2 jitngend dts Ising strt morthoro CS CU SES z0 bc_listing_tithe_mp3 tc_listing_title_player kind myer king tate NEO o kme tr end onthofyear 7 stg type_code Start pice dv menage NN He sting type_code tat pee A kategory_ave ons Item leah catapoy rere 7 from leaf categery rare etre te 50 5 fategory_ave gns Ising te 9 em Jaf category_name esting S NE 5 kategory_ave gns Ising le_video 9 eeng te ois be etre ozs NE 8 tem leaf_catogery_pare be etre e206 NE You can see that the roots mp3 and player are highly correlated so you can create a stemming rule that will merge those roots into a single one 43 3 Mp so nocond nocond mp3Ss mp3 player 4 44 3 players nocond nocond players mp3_player 4 The stemming rules are listed in a text file named StemmingRules lt language code gt For example the stemming rules created for the specific language you are working on will be sotred in the file StemmingRules_dmc The Stemming Rules list for the dmc language should look like the following file SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 45 Extracting Information from Textual Data E StemmingRules_dmc Notepad File Edit Format View Help CondMo i pal Cc i om hocond hocond AY hoco
51. st_dmc Notepad Iols File Edit Format View Help How to Build Simple Stemming Rules According to the words displayed in the panel KTC Model Learning you can build some simple stemming rules Indeed the first thing that appears is that some words such as 20gb 20g and 20 can be merged into a single words 20 GB It can be defined by these stemming rules SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 43 Extracting Information from Textual Data Adapted Method Defining a Specific Language for the Domain 7 3 aZ0gbS nocond mocone Z0gqos 20 GB 4 8 3 209 NOCOnG nocond 205 20 GB 4 Note See Regular Expression Reminder on page 49 in the Annex The syntax of a stemming rule is the following Rule Step CondWord CondR1 CondR2 Match Replace StepAfter The columns represent Rule the number of the rule Step the step the rule belongs to CondWord the condition applied to the word CondR1 the condition applied to the first region CondR2 the condition applied to the second region Match the parts of the word to select for replacement Replace the string to replace the matched part StepAfter the step to go if the rule has been applied So the stemming rule 7 3 20gb nocond nocond 20gb 20 GB 4 says if the word is 20gb then replace 20gb by 20 GB and go to stemming rules of step 4 if they exist An other way to create stemming rules is to use the copy button in the panel KTC Mod
52. t desc_dmc2006_enriched_textual txt as the description file Check that listing title is set as textual Text Coding Setting KTC Parameters Keep default Language Definition Repository the Language Settings Resources KT CData Definition Select the User Defined Language option as the Language Recognition Mode Select ge German in the combo box as the User Defined Language Text Coding Setting KTC Parameters Uncheck Stop Word Removing the Dictionary and Settings 2 Uncheck Stemming Reduction Encoding Parameters Selecting the Target Selecting the Target Select gms_greater_avg as the target variable Variable and a Weight Variable Do not select a weight variable Variable Selecting Explanatory Selecting Variables Exclude the variables Kxindex and gms from the list of variables to be Variables used for modeling Results The screen below shows the quality KI and robustness KR indicators obtained for the model generated with Text Coding SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 39 Extracting Information from Textual Data Adapted Method Defining a Specific Language for the Domain KXEN InfiniteInsight gms_greater_avg_dmc 006_enriched Training the Model amp FR ab Model gms_greater_avg_dmc 006_enriched Data Set Initial Number of Variables Number of Selected Variables Number of Records 7998 Building Date 2012 08 07 16 31 01 Learning Time 43s Engine Name
53. ta Source after selecting option Use a File or a Database Table select the option Text files in Data Type to select the data source format to be used Note Depending on your environment the Samples folder may or may not appear directly at the root of the list of folders If you selected the default settings during the installation process you will find the Samples folder located in C Program Files KXEN KXENCompV3 4 Select the file dmc2006 txt then click OK The name of the file will appear in the Data Set field KXEN InfiniteInsight New Regression Classification Model Select a Data Source f Use a File or a Database Table Use Explorer Data Type Text Files ad Metadata Metadata are stored in the same place as data source 5 Select a Cutting Strategy SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 14 Extracting Information from Textual Data Simple Method Using a Classification Model on the Data Selecting a Cutting Strategy For this scenario Select the Random without test cutting strategy M To Select a Cutting Strategy 1 Once you have selected your Data Source click on the Cutting strategy button KXEN InfiniteInsight New Regression Classification Model Select a Data Source Use a File or a Database Table Use Explorer Data Type Text Files ka Browse Data Set dmc2006 bet s a J Browse af Cutting Strategy create a Target
54. textual Another of the top variables that already appeared in the previous model is listing_title looking at the variable categories you can see that each category contains many varied elements as shown in the screenshot below SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 27 Extracting Information from Textual Data Advanced Method Using Text Coding to Extract Information from the Textual Variables Variable listing_title Influence on Target 0 03 00 0 01 0 00 0 01 0 02 t SONDERPOSTENR Profit Detail SOMDERPOSTEN Apple iPod Nano 438 NEU amp OVP Apple iPod WANG 4565 4 GB schwarz MEU OVP Rechnung Apple Category fad NANO 4GB schwarz NEL OVF Apple iPod NANO 24 Rechnung Apple ipod Mini 4G5 SILBER NEU AOWP Rechnung iPod Nano 4 GE Neu Profit 0 04 Frequency 1 9 Apple IPOD Nano 3 Cal gonss You can infer that this variable contains information that has yet to be exploited Since this variable is a string the best way to extract these hidden information is to use Text Coding which leads you to the advanced method Advanced Method Using Text Coding to Extract Information from the Textual Variables Description Although the intermediate method resulted in a model that was both accurate and robust you still have textual data not yet exploited Since Infinitelnsight Modeler Regression Classification is not designed to process such data you w
55. th No Test Select desc_dmc2006_enriched_textual txt as the description file Check that listing title is set as textual Keep default Language Definition Repository Resources KT CData Select the User Defined Language option as the Language Recognition Mode Select dmc in the combo box as the User Defined Language Check Stop Word Removing Check Stemming Reduction Check Concept Merging Check Synonym Replacement Select gms_greater_avg as the target variable SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 46 Extracting Information from Textual Data Task s Screen Variable and a Weight Variable Variable Selecting Explanatory Selecting Variables Variables Results Adapted Method Defining a Specific Language for the Domain Settings Do not select a weight variable Exclude the variables KxIndex and gms from the list of variables to be used for modeling The screen below shows the quality KI and robustness KR indicators obtained for the model generated with Text Coding adapted to the specific domain of application KXEN InfiniteInsight gms_greater_avg_dmc 006_enriched Training the Model amp FBR abk Current Report Al Reports Overview Model gms_greater_avg_dmc 006_enriched Data Set dmc2006_enriched tst Initial Number of Variables 4T Number of Selected Variables 1046 Number of Records 7998 Building Date 2012 08 07 17 27 44 Learning Time 4fs
56. ture is displayed KXEN InfiniteInsight Add a Modeling Feature Add a Classification Regression Standalone Data Transformation Cancel Previous Add a Classification Regression analyzes the textual data generates the corresponding variables and builds a Classification Regression model on it Add a Clustering analyzes the textual data generates the corresponding variables and builds a Clustering model on it Standalobe Data Transformation analyzes the textual data and generates the corresponding variables For this scenario SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 29 Extracting Information from Textual Data Advanced Method Using Text Coding to Extract Information from the Textual Variables 2 Click Add a Classification Regression Summary of the Modeling Settings to Use The table below summarizes the modeling settings you must use for the advanced method Except for the Text Coding specific steps which are grayed in the table below the other settings are similar to the ones used for the intermediate method Text Coding steps are presented in details in the following sections For detailed procedures and more information see the Modeling Process section of the Simple Method section Task s Screen Specifying the Data to be Modeled Data Source Selecting a Cutting Strategy Describing the Data Data Description Text Coding Setting KTC Parameters the L
57. xtracting Information from Textual Data Advanced Method Using Text Coding to Extract Information from the Textual Variables Keep the default parameters Click the Next button the panel KTC Model Learning is displayed KXEN InfiniteInsight New Model with Text Coding KTC Model Learning a m El Frequency E 337 2 01 Apple iPod mini 4GB 2 Gen 2005 A ipod Nano NEU OVP 2GB in weiG i 5 Apple iPod Mini Pink 468 2 Gener 1 29 Apple iPod 20 GB 4 Genera Color 20 17 an Ga This panel lists the roots identified by Text Coding in the analyzed textual variable here listing_title with their respective frequency of apparition in the data set It allows you to identify the most frequent roots and to decide if these roots are really meaningful for your problem or not Results The screen below shows the quality KI and robustness KR indicators obtained for the model generated with nfiniteInsight Explorer Text Coding SAP Infinitelnsight 6 5 SP4 Explorer Text Coding User Guide 36 Extracting Information from Textual Data Advanced Method Using Text Coding to Extract Information from the Textual Variables KXEN InfiniteInsight gms_greater_avg_dmc 006_enriched Training the Model amp FBR ab Overview Model gms_greater_avg_dmc 006_enriched Data Set dmc2006_enriched txt Initial Number of Variables 47 Number of Selected Variables 1046 Number of Records 7998 Bu
58. ype in the value 1 to define this as a key variable Cds Name storage voue rey order mesra croup beserption sua A en 2 ftemleaf_category_name sting roma O oO o 3 isting ttle mg tema o A a S E listing_start_date date continuous a Selecting the Target Variable and a Weight Variable For this Scenario Select the variable gms_greater_avg as your target variable Do not select any weight variable M To Select Target Variable 1 Onthe screen Selecting Variables in the section Explanatory variables selected left hand side select the variable you want to use as Target Variable i KXEN InfiniteInsight New Regression Classification Model Selecting Variables Explanatory Variables Selected Target Variables gt ms_greater_avg lt _ Alphabetic Sort y_it_now _price uy_it_now_listed_flag Weight Variable gt Exduded Variables gt lt iv aiki _per_listing ms El Number of Variables 25 A H C Alphabetic Sort A H C Alphabetic Sort Note On the screen Selecting Variables variables are presented in the same order as that in which they appear in the table of data To sort them alphabetically select the option Alphabetic sort presented beneath each of the variables list 2 Click the button gt located on the left of the screen section Target s Variable s upper right hand side The variable moves to the screen section Target s Variable s Also select a v
Download Pdf Manuals
Related Search
Related Contents
GXX, et G9T ou G9U Nettoyage d`électrovanne EGR Owner`s Manual Raritan Computer 23-422 User's Manual Fujitsu LIFEBOOK S792 ARNING - Down To Earth Trailers Interactive Funboard manual como cargar el vehiculo con carga_1 Manuale di installazione ed istruzione d`uso per 非常用携帯電源ジャンプスターター 取 扱 説 明 書 家庭用 Copyright © All rights reserved.
Failed to retrieve file