Home

Data extraction tools

1. Preview welected enniberit 17 Content fields Barcalona pairs Tithe Ewrcelona Wikipedia the free encyclopedia Rename Delete Sive Cancel Barcelona Trowel Guide and Travel Information Lonely Planet Deseriptlon ai Freddie ddersuny Last Performance Barcelona Figure 10 Dapper Screenshot Dapper is one of the easiest tools to use as its interface is totally graphical Apart from extracting data it allows you to create Flash widgets or alerts using the extracted information Link a Dapp output to another Dapp input to create some new Dapps is another useful functionality Robomaker Robomaker is a Web 2 0 developer platform for creating mashups The tool lets the user create RSS feeds REST Web Services or Webclips in few steps It is provided with powerful programming features including interactive visual programming full debugging capabilities an overview of the program state and easy access to context sensitive online help this features make it really complete and dynamic It can be used in both Windows and Linux platforms RaboHaker ASS Robot enkapew Edition SA Project Library google srarch_input_tagprohlem_generaLrobot be bee Fie Gof View Debug Communty Hep Daad EOS oe unax aan mu zei ua u Kapow RoboMaker 6 3 Cre eH Se Gee om ome wpe ic 7 wanders Cookies Authendicatars Base Tag Firsiens Action Error Hancina iar r Enter Text ae Mei body centerton
2. Cached Similar pages al Figure 33 Google results with Dapper Yahoo Search In this case Dapper is able to extract all the entries without problems We have to mention that Yahoo Search uses a live search input form with AJAX code but in this case it doesn t affect to our data extraction What happens is that we can not join all the description fields to a single description item because the HTML structure of the description does not include the previous commented extra fields URL and cached In conclusion Dapper passes this test Data Mapping i item A em a title Barcelona Wikipedia the free encyclopedia n ae a e a 5 m y Algo ty fc barcelona barcalona notis paremona zus Barcelona weather description Provides an overview of the history and culture Z i of the Spanish city of Barcelona Poles ea han Pros Comer Faisg Fes Save on Hansen Barcelona SR die ks wel thes Fetes tor Hanoak raa Hoisas ant description en wikipedia org wiki Barcelona mw i re iaip e eel Fria ine a ee ai y F i arviati coun d i nption 186k Barorlona ve amp F Walspeil 2 Barcalona Arte Hotel OR Surcelena a FAS ol or ee moth i mia iei Hari mpg t i rr descri ption Cac hed nee Hk vow JassidYasahsns cim T Barcalona Howi nee Genes bi Bareelind hrel z it iia irpan dii T item feet 420 Barcelona Cordray des aparent ars title Barcelona Airport ee n hme EEE aa d
3. Dapper to Dapper Dapper to Web Content Extractor Web Content Extractor to Dapper Web Content Extractor to Web Content Extractor 5 1 Dapper to Dapper When talking about using the output of one Dapp to the input of another Dapp we can already use a built option from Dapper it is the Dapp Linker To use it we only have to select a first built Dapper that is going to provide us an output and then select another one that is going to use this data as input Dapp Linker Use this page to link a Dapp with one more other Dapps For instance if you have a Dapp which returns movies you can link it to a Dapp which returns ratings for a given movie The Dapp Linker will do the work for you and the end result is a Dapp which you can use like any other yahoosearchtest gt Search for Dapps Figure 80 Dapp linker screenshot Our first Dapp was configured to extract data from a previous example of our documentation it was the Kings of Sun 2008 Contest of Chapter 2 4 We have configured this Dapp to extract almost all of the data given 71 Figure 81 Highlighted data extracted by the first Dapp A second Dapp has been created to extract data from the same source But this time we only have extracted the name of the players By linking these two wrappers we obtained a new output The format of this output is constructed having for each entry of the first wrapper all of the entries of our second wrapper title
4. Figure 34 Yahoo Search results with Dapper u Live Search We have extracted all the entries but as happened in Yahoo Search we can not join all the description fields to a single description item But we conclude that Dapper is able to perform searches through Microsoft Live Search without problems 38 Data Mapping E item Dr Live Search barcelona GB F title Barcelona Wikipedia the free encyclopedia a Ba ances Vices flew fen hire F FAC Euer Prieta Raa A CAAA BE EE Malen tar eee cd paste en description Barcelona Catalan IPA barss lona Spanish ge ne IPA bar8e lona is the capital and most populous city of Catalonia and the second den Chern esis Ji ies er Wri Kia mii ts _ largest city in Spain witha eg bored ie irae Smears aa DE eter Cee norte t Too I yu 1 app f BE Garcin nieten Mite opein be rimimi Sea ee by A description en wikipedia ore wiki Barcelone hea dee Li a erh za area ar description Cached page tery D baci hapsir Bree TAH i Bariai e ia a pipis ppi ph ET AEri are Berge ne eraa E t Do Mm mri oles g 4 Hinter bri Barselona a ma mire Frij Basiyr heated ini ver MP Br a a ray item Seer See Barcekena Waai Guido Bari i ig E Barcelona Figure 35 MSN Live Search results with Dapper m Robomaker Like with Dapper we have chosen to use a RSS feed output To realize a correct data extraction from these Web search engines we have to design a flow of actions
5. 4 1 Overview of tests To realize these tests we have thought about using known and often visited Web Pages where an enterprise or a single user can found interest on extracting data This set of tests tries to embrace several aspects of the data extraction process considering several situations with specific characteristics A general view of them presenting the used sources and the goals of each one is shown in the next table Used source Figure 20 General overview of used tests 27 The question that answers why we have selected a set of tests to realize the extractions is because we need a process to qualify the extraction of each of the tools analyzing several of the features presented in the section 3 4 and to get tangible results to elaborate the final categorization of the tools Other types of tests could be used to achieve similar results but somehow the ones selected take a global view of the extraction features and are suitable to be used The general tests embrace several scenarios from basic data extraction to dynamic content pages They are representative and give an idea of the behavior of our tools in several situations Resilience and precision tests have been introduced to evaluate these two particular features both really important when extracting Web data 4 2 Methodology To realize these tests a methodology has been elaborated to get final results and to know which the steps are that we are going to fo
6. We can carry out this process from the most simple data extractions to the most complex ones It is true that at the beginning we have to learn how to use them 78 and to realize complex data extractions which means spend some time experiencing with the tools In conclusion they are in a general view the best professional option to extract data from Web sites 7 Conclusions In this last chapter we write conclusions of all the entire developed work pointing out the most important features for a data extraction tool taking care of some criterias and user profiles Next we explain the problems that we have faced doing all the tests documentation and other general aspects To conclude possible future work and some ideas to go further with this project are explained to finalize this document From all the features analyzed in the chapter 3 4 we are going to explain which the most important are and why In fact all are really useful but due to the scenarios that we can find on the Web and due to the usability some stand out Interface The fact of using a GUI to accomplish extractions give the user a high degree of ease when configuring and performing the data extractions As a matter of fact it is normal that when extracting data from visual sources like Web sites the use of a GUI become a natural and logic way to treat with the content and to select the sources The absence of a GUI makes the things harder as we don t have a d
7. lt list gt lt xpath expression div matches class Estilo2 text gt lt html to xml gt lt http url http www dedicom net test sample test htm gt lt html to xml gt lt xpath gt lt xpath expression div contains class Estilo6 div text gt lt html t0 xml gt lt http url http www dedicom net test sample test htm gt lt html to xml gt lt xpath gt lt xpath expression td contains height 5 text gt lt html to xml gt lt http url http www dedicom net test sample test htm gt lt html to xml gt lt xpath gt lt list gt lt body gt lt var name link gt lt var gt lt body gt lt loop gt lt config gt The first Xpath expression extracts the title from the Web page the second one extracts the description and the third one all the names of the players We have to use the information of the HTML tags to guide the tool to extract data The results are shown here a _ Keep synchronized View as Textw _ wrap Text 34 Find This is the final table result of the Kings of Sun 2005 Contest This strategy game was created by Likstorh Software in 2005 and due to the growth of online players each lt Player 2 Player 3 a Player 4 _ Keep synchronized View as Player 5 Player 6 Kings of Sun 2008 Contest Player 7 Player 8 Player 9 Player 10 Figure 30 Extracted data by Web Harvest ka In this e
8. description Figure 84 Web Content Extractor final results As shown in the previous figure Web Content Extractor has realized a successful data extraction from the output of Dapper In conclusion we can concatenate the results of Dapper to the input of Web Content Extractor 5 3 Web Content Extractor to Dapper In this case the first program to produce the first output will be Web Content Extractor and the tool to receive the input Dapper Let s configure WCE we are going to extract almost all of the data of our Kings of Sun 2008 Contest 73 Figure 85 Web Content Extractor HTML output Once we receive all the results in HTML we are going to use Dapper to extract some of these fields Configuring it to extract the title the subtitle and the description will not generate problems and the output will look like that title title Top 10 Users Classification subtitle subtitle Kings of Sun 2008 Contest description desription This is the final table result Figure 86 Dapper final results 5 4 Web Content extractor to Web Content extractor In this last case we are going to use Web Content Extractor twice to extract information and link the output to the input The first HTML output data looks like Figure 85 from the previous case and the final output is really similar as we are only extracting some fields of all the information Figure 87 Web Content Extractor final results 74
9. google de search hl esk q barcelonakbtnG B uscar con Googlekmeta Target Data First Data Row optional usually lt TR gt lt P gt lt DIY gt tag HTML 1 BODY 1 DIv 5 DIV 1 x E Data Columns Column Name HTML Path Max Length Required HTML Path alternative Transformation Script First Name TABLE 1 TBODY 1 TAT TB 2 H2 255 No First URL T BLEI1 TBODYI1 TR 1 TD 2 H2 255 No Second Name DIv 1 H2 1 255 No Second URL DIV 1 H2 1 4 1 255 No Second desc DIV 1 TABLE 1 TBODY 1 TR 1 T 255 No Third Name DIv 2 H2 1 255 No gt m le x E Preview E a First Name First URL Second Name Second URL Second description Barcelona Espa a http maps google de maps El web de la ciutat de Barcelona http www bones El web oficial de Barcelona creat per l Ajuntame OK Cancel Figure 40 Web Content Extractor results screenshot 43 E Goldseeker With Goldseeker the data extraction process is based using a substring method which is configured into a file To realize an extraction we need the name of the container tag that contains the information that we want to extract and depending of the realized search this result is going to vary so it is focused to treat with static content We can not take advantage of it in Web search engines E Webharvest To realize an extraction with Webharvest we have to edit an entire configuration file We have used Xpath exp
10. tithe Top 10 Users Classification v player player Player 1 v player player Player 2 Figure 82 Dapp linker final results 5 2 Dapper to Web Content extractor In this case we are going to combine the output and the input of two different tools One of the output formats of Dapper is HTML we are going to take advantage of this feature to use this resulting HTML code as the input of the Web Content Extractor tool A part of the output given by Dapper is the following 12 king of sun contest uniw v description description This is the final table result of the Kings of Sun 2008 Contest This strategy game was created by Likstorh Software in 2005 and due to the growth of online players each year online competitions take place The user has to use his strategy habilities to be the best king of his land that includes have a growing population construct temples study new technologies begin wars to extend territory v entry users Player 1 military 359 technology 566 religion 45 social 411 total 1381 entry users Player 2 military 320 technology 468 religion 69 Figure 83 Dapper HTML output After that we are going to use this output with Web Content Extractor We are going to extract the title the subtitle and the description of this Web site ki n g of sun con test details R of Sun 2008 Contest aa Tann Top 10 Users Classification This is the final table result of the Kings of v description
11. tool functions and other source code Then we use two files as parameters for the constructor of the GSParser Kings data which is the Web page containing all the HTML structure In this case the file is directly provided by the local server Kings gs which contains the config files format It configures the tool to extract data Without problems we extracted the data using Goldseeker here we present a little part of the output Array 0 gt Array name gt Title instances gt Array 0 gt Array contents gt Kings of Sun 2008 Contest position gt 1166 1 gt Array name gt Description instances gt Array 0 gt Array contents gt This is the final table result of the Kings of Sun 2008 Contest This strategy game was created by Likstorh Software in 2005 and due to the growth of online players each year online competitions take place The user has to use his strategy habilities to be the best king of his land that includes have a growing population construct temples study new technologies begin wars to extend PSL LEORY iu position gt 1404 Done Figure 29 Final output using Goldseeker 34 m Webharvest In this case we used Xpath expressions to extract data from our scenario The configuration file looks like that lt xml version 1 0 encoding UTF 8 gt lt config charser I50 8859 1 3 lt loop item link index i filter unique gt
12. 6 Categorization of the data extraction tools This chapter deals with about the necessity of categorize the data extraction tools It can be seen as a final conclusion after realizing tests achieve results read documentation and all the work done in this project The goal of the data extraction tools categorization is to give an idea to the final user to know in what kind of scenario one tool is better than another one which are the advantages and disadvantages of each tool and to realize a final conclusion analyzing several characteristics We also want to give too a qualitative approach of some of the characteristics that our tools have It is true that tools having a GUI give facilities to the user compared to the non GUI tools Anyway is really useful to analyze all the cases Let s present an example We want to extract certain data from a Web page and we want the output in a concrete format for an enterprise no problem will occur the license can be bought and a GUI tool will be comfortable For an individual user could be too expensive to buy a license to realize only few or concrete data extractions For this user a free non GUI tool will be better A categorization of the tools is going to be constructed considering qualitative characteristics derived from the tests conducted in this document This process make us able to realize conclusions and to select which the best types of scenarios for each tool knowing what the strong and
13. Del tpat pP Del tads pP s For Each Extract Title Extract URL Extract Des pP Return tem d Al A tet hl nn Lamborghini Stuttgart re an atta Jamborghini c hini de Lamborghin tre ander Wit dem neuen Gallardo LP 560 4 legt Lamb Lamborghini Tiere SST aS ES 1948 rd die Firma in Zu zer in italien von Ferruccio ven 1916 1993 als Lamborghini Gallardo Wikipedia http fide wikipedia orgAvikiLamborghini_ Gallardo Der Gallardo ist ein Sportwagen von Lamborghini einem Tochterunternehmen von Audi Der 1 million Lamborghini Reventon live http www youtube comyvatch y LovB1 UjaPxd Ferrari ve Lamborghini http ev youtube comivatch y 7ebd F2broy Automobili Lamborghini Holding Spa http Shaw lamborghini coms Official website of Lamborghini s US distributor Details and photos of every model made sinc LAMBORGHIFI hip Shay autosalon singen de Liste aspx marke L Archivbild Lamborghini GALLARDO SPYDER Cabrio Roadster 161 990 00 Gesamtpreis AUTO SALON SINGEN Ferrari Pors Http Mews autosalon singen der Sportwagen Oldtimer H ndler AUTO SALON SINGEN Ferrari Porsche Lamborghini Mercede Lamborghini Traktoren http ey samedeutz tahr comidesamboarghinis kraftvoll zupackend durchzugsstark Die Traktoren von Lamborghini verk rpern in perfekter Lamborghini Bilder http er einfach autos des ildergalerielamborghini Lamborghini Bilder in der Gallerie Fotos der Sportwagen Diablo und Ga
14. Our data extraction tools read the HTML code to perform extractions All the static content is written in HTML it doesn t occur when speaking about dynamic content such as Javascript AJAX or Flash Our data extraction tools cannot parse or treat all this information like with normal HTML It doesn t follow the same syntax sometimes it has to be preprocessed before displaying a result or others the result is only visual or changes could be introduced at any time the page is loaded Some of our tools have support to treat dynamic content especially Javascript but often this kind of content generates difficulties to perform data extractions 12 2 4 Ideal characteristics for a Web page to extract data An example Once analyzed the characteristics of the extraction process and the problems that could generate difficulties we are going to construct a sample page to extract data The aim to include this chapter is to reflect which the ideal Web page that gives facilities to our tools to extract data is As easy to imagine this sample page is going to be constructed avoiding all the previous commented problems It will have the following characteristics Structured data representation HTML code following the W3C standard No nested data elements Structure containing the same type of elements Used Flash or scripts don t contain data to be extracted Use of CSS Styles to identify and give format to elements Taking a
15. Player 8 land that includes have a growing population construct temples study new technologies players Player 9 Figure 22 Final output using Dapper 30 m Robomaker Robomaker presented no problems when extracting simple data We only had to select the title and the description to extract these fields and introduce a loop to select all the players Following are the final results nmel ___ sa o l name aber names alea title Kine qe of Contes description This he fin Kings of Sun 2008 Contest Heeer ation This i is s the fin ane a 2 Kings of Sun 2005 Contest description This is the fin player Player 3 Kings of Sun 2008 Contest description This is the fin player Player 4 layer Pla rt Kings of Sun 2008 Contest description This is the fin player Player 5 Kings of Sun 2008 Contest description This is the fin player Player 6 Kings of Sun 2008 Contest description This is the fin player Player 7 Kings of Sun 2008 Contest description This is the fin player Player 6 Kings of Sun 2008 Contest description This is the fin player Player 9 Kings of Sun 2008 Contest description This is the fin player Player 10 2 mo Oo 4 om oh PE Oo bh Figure 23 Final output using Robomaker E Lixto First of all we have to know that Lixto VD only extracts our results in the XML format First of all we have to create a Lixto Data Model to specify how the output to our XML file will be As we are going
16. Software It provides businesses with effective user friendly and time critically viable wrapping integration and delivery of information all in the same product Esd Leto Google weblnevww Lixto Visual Developer 4 84 Fie Edit Navigate Project Rum Recording Wiexlew Help u 3 04 SS Nav or j P A e u 7A output kumadl x project L xteephote tw Lf bar comm byw Al atat bod L cut J rusr fi T S ki a G x A hee Www google de jses ch Ti ade a abar celona ONG euch Seta THE WEB INTELLIGENCE COMPANY 5 POH TRETEN IR ZEN N OR CEO A fa 1 k ee i ganz m Herzen der Stadt Tate www bonunbernet com sees i Ea Ww 4 Barcelona Wikipedia Barcelona Dp Dieser Amiel benindar Be Tatalanische Stadt Barcelona zu anderen gleichnamegen Warum aul die Sonne warten i g Bedeutungen siehe Barcelona Begriffskl rung Jetzt buchen und sparen i Ge wikipeda org wiki Barcelona 117k www TUl com 4 Barcelona de Reisef hrer Hotel Flig Barcelona Card buchen Wohnungen Barcelona i m wo Wasteltall Kulturelle Hauptstadt Spansens Ein Reisef hrer t r Ihren Urlaub Voll ausgestattet Stadt Zentrum i www barcelona del 19k Buchen Sie onbre Angebote www Rent days com arcalona R isatuhrer und Reiseiniorm lich Wesentliche Tounsmusinformaton mit Karten Barcelona wichtigsten Attraktionen Reisef hrer zu den Hotels in Barcelona Besuchen S e Barcelona 1 wen barcelona to
17. XML based Web sites which still make vast majority of the Web content M Web Hanrest bend pce ba Config Edt View Eatin Hep Ba include path PUDCLIODA AMI We b H a rve ST q Var def makes gearch orececites false splatomdt srar dele euu Fis oF cvar def panss ugl c pach expressions ngscriprffer hce i i j ditaliti Cheep utl heep imagan google com eages tije CAER tee pe EEC Var dei nak imgLink Seal fees do ole et paege ie Ceell pare mamca peer sors nae uel ec foal pare ecall pecem pames pextXsch pii a Hexe a Shoet lt call paces gt lt 6ell patee Hama itni ath ine contains Mace ieee qth Parcs cal l pacde gt F Ceell pares oases eee log St fen piia p ipeme link index i 11 cece umique gt ar names leglinka Fb Chile achione write type binacy path google_images s peacch _ 12 gif gt FFeseesing LaF j l sdid FEEN JEBEL er c n imagi Tete yona Gyi PORES TA ERD Yan Erinards c w OTIO Makara gl TeetOol ei Strad JPafOr lost Rate LaF EE FFEWBESE PESiadacr Het in Iles THF potir aicital in Ta IHF EodyPrverisol pi t it Breuer in He IWF Lorpferaeassor processer sueewed in Ebene Figure 16 Web Harvest Screenshot Goldseeker Goldseeker is a data extraction tool specifically a script under the GNU LGPL license It was built to extract formatted data from HTML files but it can be used with all kin
18. Year of the last publication 2 last digits of the year of the last publication Figure 76 Final results extracting data from formatted text The only column that has changed is the second one as now some of our programs could split up the content taking advantage of the lt strong gt tag 68 4 5 4 Extracting data using styled text It is very common to use CSS to define a style to our text We can use a CSS style to identify the elements that appear in the Web taking the information of the attribute class For example we are going to format only the date from the Last Published edition field using the following CSS entry date font family Verdana Arial Helvetica sans serif font size 14px Then the data HTML source would be probably as follows lt div align center class Estilo9 gt lt span class date gt 1998 07 07 lt span gt First edition lt div gt With this kind of tagging our tools could recognize the date and separate it from the entire field The resulting table is the following Allthe information of the row 2 last digits of the year of the last publication Date of the last Year of the last publication publication Dapper Robomaker WinTask Automation Anywhere Goldseeker Webharvest Figure 77 Final results from extracting data using styled text 4 5 5 Extracting data from CSV formatted text Now we are going to try it with CSV data It is a file type
19. and directly to this path we could not extract information This tool doesn t therefore pass the test m Webharvest As we are using Xpath expressions to refer to the extracted data fields and we used the information contained in the SPAN tags This alteration of the content caused no problems when realizing an extraction Due to this fact Webharvest passed this test 1 test of resilience Dapper Robomaker Lixto Web Content Extractor Webharvest Figure 61 Final test results 4 4 4 Test 2 Delete previous content from the extracted data The second attempt to make a modification of the structure of the page consists of deleting the first div container and a table Together they represent the dark area of the next image Category jungle Any Category Books 128 055 Home amp Garden 8 401 MP3 Downloads 4 562 i INSIDE The Jungle Enriched Classics by Upton Sinclair Mass Market Paperback April 27 2004 Showing Top Results Previous Page 1 2 3 Next Music 3 398 Buy new 5 95 Buy new 5 95 54 Used amp new from 2 67 Toys amp Games 2 715 Everything Else 1 545 Apparel 1 312 Sports amp Outdoors 950 Arfer 6 Get it by Wednesday April 30 if you order in the next 10 hours and choose one day shipping Eligible for FREE Super Saver Shipping DVD 813 Excerpt page 7 THE JUNGLE Jurgis of all men to Jurgis Rudkus he with the Baby 581 Surprise me See a random pag
20. at the same time they have to be configured in a manual way We are going to concentrate our effort in HTML aware tools that have a high degree of automation but they could only extract information from HTML Explained in a brief way these are the main characteristics of each type of tools found in the taxonomy Languages for Wrapper Development It was one of the first initiatives to assist users in constructing wrappers These languages were proposed as alternatives to general purpose languages such as Perl or Java which were prevalent so far for this task Some of the best known tools that adopt this approach are Minerva TSIMMIS Web OQL Ontology based This type of tools relies directly to the data to perform data extractions Given a specific domain application an ontology can be used to 16 locate constants present in the page and to construct objects with them The most representative tool of this approach is BYU NLP based This type of tools uses Natural language processing NLP techniques to learn extraction rules for extracting relevant data existing in natural language documents They use rules based on syntactic and semantic constraints that help to identify relevant information within a document The most representative tools of this approach are RAPIER SRV and WHISK Wrapper induction They generate delimiter based extraction rules derived from a given set of training examples The main distinction between the
21. categorization using user profiles About the problems that we encountered developing all the work we could mention some installation and configuration difficulties at the beginning especially with some Linux tools Some of the tools had a trial version license and we had to execute all the tests within the license time period In fact once we had to reinstall all the entire operating system to continue using these tools Some of the non GUI tools need a higher level of configuration and the time spent to configure them was quite bigger than the GUI tools As also explained in the chapter 4 3 we could not execute our tests with XWRAP and Roadrunner Doing a global critique of our tools we emphasize the reality that none of our tool could achieve good extractions from dynamic content pages This is an important point as more and more dynamic content is introduced nowadays and the use of AJAX and Javascript technologies becomes a usual fact Finished this job we let the way open to go further with future work The execution of more tests taking more scenarios could give more precision to our conclusions On the other hand increasing the number of the data extraction tools allows extending the information that we have elaborated in this document and to expand the set of tools Also taking tools from other taxonomies can widen the final conclusions and the tests done in this document 80 8 References 1 Amazon http www amazon com
22. found 4 4 1 Testing the resilience of our tools In this section we are going to experiment how a change of the HTML code could affect the correctness of the extracted data by our tools We have prepared a stage for this purpose We have downloaded all the HTML code and required files from a Book search in Amazon com We have used the input value J ungle to perform the search Once we have downloaded these files we are going to upload them in a test server Doing this we assure that the content is going to be static and no new changes are going to be introduced when performing our tests By using our tools we are going to extract some fields of this book search in concrete title book format new price and valuation New tests to evaluate resilience will be performed changing the HTML code of this Amazon search and replacing the original content Then we can compare if the extracted data was the same as before or new errors has been generated 54 jungle Showing Top Results Page 1 2 3 Next The Jungle Enriched Classics by Upton Sinclair Mass Market Paperback April 27 2004 Buy new 5 95 54 Used amp new from 2 67 Get it by Wednesday April 30 if you order in the next 10 hours and choose one day shipping Eligible for FREE Super Saver Shipping Arrr 6 Excerpt page 7 THE JUNGLE Jurgis of all men to Jurgis Rudkus he with the Surprise me See a random page in this book Books See all 128
23. gt lt span gt 5 95 lt span gt amp nbsp amp nbsp lt span class usedAndNewPriceBlock gt lt span class priceType gt Figure 64 HTML code with highlighted Span tags 60 All the tools except Webharvest were able to extract the book title although we have made a change That happens because the span tag is inside an lt a gt tag and this is the one that our tools use to carry out the identification On the other hand the Xpath expression of Webharvest used a SPAN tag and we were not able to extract this field In that way all the tools except Web Content Extractor failed when extracting the field price This happened as they have been not able to relate the old content with the current content Particularly with Web Content Extractor happened the opposite it doesn t care about the class attribute of the id and div tags It could find all the information and no problems were encountered as the structure remained the same 3 test of resilience Dapper Robomaker Lixto Web Content Extractor Webharvest Figure 65 Final test results 4 4 6 Test 4 Duplicating extracted data The fourth attempt to modify the HTML structure consists of duplicating one of the elements that appear and we want to extract More than a test is a way to see how against these changes our tools react jungle Showing Top Results Previous Page 1 2 3 Next 1 SEARCH INSIDE The Jungle Enriched Classics by Upton Si
24. hierarchy for example Figure 5 XPath expression to navigate through the HTML hierarchy 10 Due to this structure the maximum precision to extract information from a Web page is found in the content of a leave Afterwards depending of the extracting data tool functionalities by treating the content more level of precision could be obtained All the content placed in a tag is suitable to be extracted We can differentiate these tags by the identifier the style of the tag when we use CSS the tag attributes This information is used by the data extraction tools to realize an extraction Depending of the tool we have to proceed in a specific way to realize a good configuration of the wrapper 2 3 HTML problems to extract data As HTML has semi structured content we can find some problems in the structure that could produce errors at the time of extracting data These errors can be categorized in several groups We are now going to comment each of them Presentation of the data without following a structure Normally the content of a Web page is presented following structured patterns This structure supplies the user an easy and logic way to find the information avoiding to waste his time A good structure helps the data extraction tools to realize a good work A suitable example could be a scenario of a digital newspaper In this scenario we can find a table that contains all the news ordered by time of success Each row is co
25. is from all of the tools the more basic one and in an earlier phase of development The way to realize the data extraction process is reduced on editing a 71 small configuration file with some of the commands that can be found in the readme txt file In conclusion we don t recommend the use of this tool to perform data extractions as we could not realize in some cases basic extractions It is a kind of basic development tool that was interesting to test but it doesn t have a real utility when expecting professional results When we realize a final conclusion of the entire group we conclude the same as in the two first groups we can only perform extractions using simple structure Web pages Dapper Automation Anywhere GUI tools without using scripts or expressions The main characteristic of this group is the facility to realize a configuration for the data extraction process The fact of having a GUI and don t use scripts or expressions turn the process to an easy sequence of steps that a non advanced user can do On the other hand the weak point of the tools is that they don t use scripts or expressions Due to this fact advanced data extractions can not be realized but normal or complete data extractions are a real possibility We can find two tools in this group although they have similarities they have at the same time important differences Dapper is more complete as it is only focused on the data extraction process an
26. random scenario for this Web page we have built a Top10 page from the users that gained more points in a strategy game The next screenshot give us an accurate idea on how does it look like Title and Top 10 Users Classification subtitle This is the final table result of the Kings of Sun 2008 Contest This strategy game was created by Likstorh Software in 2005 Descri ption and due to the growth of online players each year online competitions take place The user has to use his strategy habilities to be the best king of his land that includes have a growing population construct temples study new technologies begin wars to extend territory Results table Flash Welcome to my web si Li n k Main menu Figure 7 Screenshot of the used scenario The structure of the data follows a logic order From top it is shown the following elements the main title and the second title a short description of the contest the result table a banner in Flash and a link to the main menu of the page By observing this structure we conclude that there are no elements mixed we mean for example that a second part of the description is not placed after the result table or that the flash banner is not located between some of the rows 13 We have designed this HTML code accomplishing the W3C standard and as it is visible no nested elements appear This page is static too so it means that the structure of the content is not going to change If thing
27. similar characteristics but from a global view our selection let us reach our goal to realize a final categorization We can find all these 10 tools in the section overview of the tools A brief description of each tool with his main characteristics is shown Next we are going to introduce a taxonomy to characterize them and give the reader a general eyesight 15 3 2 A taxonomy for characterizing Web data extraction tools This general presented taxonomy is based on the main technique used by each tool to generate a wrapper what led us to the following group of tools Languages for Wrapper Development HTML aware Tools NLP based Tools Modeling based Tools and Ontology based Tools Degree af Flexibility Resilience Adaptiveness Text NLP based Tools non HT ML HIML Degres of J iNANO Manual Semi autonratic Automatic Figure 9 Classification using the flexibility and automation degree of Web data extraction tools from 24 As shown on Figure 9 a classification using the flexibility degree and the automation degree could be constructed Generally the more automated a tool is the less flexibility the degree has We can find different grades of flexibility from treating standard HTML documents to those ones having strong resilience adaptiveness properties There are also several grades of automation that vary from manual to automatic Ontology based tools are the ones that have the best flexibility but
28. streifen s Aigner Exclusiver Roller Ball Blue P AK Greetings from Blue Ball PA sterling silver amp resin ball pen Fideli CD Marcia Ball Blue House MEU Amiga Boing Ball Sticker blue Case Alpas Handball Hand Ball Magic Blu price price price price price price price price price price price price price price price EUF 101 EUR TE EUR 96 18 EUR 4 50 EUR 10 99 EUR 5 00 EUR 5 00 EUR 33 00 EUR 1 00 EUR 53 51 EUR 16 01 EUR 2 00 EUR 15 95 hippil n Reg _price Shipping price Shipping price Shipping price Shipping price Shipping price Shipping price shipping _ price shipping _ price shipping _ price Shipping _ price Shipping price Shipping price Shipping price EUR 2 99 EUR 5 90 EUR 10 97 EUR 1 65 EUR 2 00 EUR 2 50 EUR 2 50 EUR 4 00 EUR 0 60 EUR 11 62 EUR 2 99 EUR 1 00 EUR 5 90 Figure 44 Ebay results with Robomaker remaining_time remaining_time remaining_time remaining_time remaining_time remaining_time remaining_time remaining_time remaining _time remaining_time remaining_time remaining_time remaining_time remaining_time remaining_time Zell Wee 1T 09Std 2T UStd 2T 09Std aT 14Std 4T 05Std 4T 06Std 4T 10Std ST 11Std BT 17Std fT 13Std ST OSStd ST 09Std hir N 17Min 39Min 49Min 59Min 40Min 2oMin S6Min O2Min Semi 39Min dahin 17Min 46 E Lixto As we are going to ext
29. that Robomaker sequentially executes As we want to extract more than one element of our performed search we have to use a flow step that could iterate through the tag which identifies a resulting element When we act in this way we sometimes experience problems Together with the results we encounter other annoying elements such as sponsors images or videos that don not have any interest for us To avoid this we can use one of the Robomaker steps that allow us to remove these annoying tags before performing the data extraction The highest possibility is that all this elements will not appear together we have to ignore errors generated due to the absence of one of these tags Something similar happens when we iterate from entries that do not have the same structure We have to perform more than one extraction step of an element afterwards to avoid data loses and for the same reason as before we have to ignore produced errors Google Before extracting our data we remove three types of annoying tags The two first ones are from advertising and the third one refers to related images of the realized search We iterate through the elements of the result container and extract the title URL and description field of each resulting element We have to ignore errors from the description field as we can find elements without description This extraction includes all the information grouped text cached content size Load Page Click search
30. that stores tabular data and uses a comma to separate values We are going to place all the information in the same column and separate the fields with commas This new HTML page will look like this one 69 LIST OF PUBLISHED BOOKS Name Author Last published edition How to begin with Computers Andrew Moss 1998 07 07 First edition Spain The guide Roberto Diaz 1995 02 04 Second edition The book of Manchester United John Henley 2003 06 18 First edition How to survive in Africa Kate Nebit 1991 01 25 First edition Red apple blue sky Marko Owen 2006 12 07 Second edition Love in the mountain Katja M ller 2000 05 19 Fourth edition gt j Bash programming guide John Harker 2001 11 23 Second edition The 100 best horror films Jack Ismay 1995 04 22 Second edition Speak freanch in 1 month Henry Petit 1997 03 19 First edition Welcome to the reality Robert Morel 2005 10 10 First edition Discrete mathematics Vera Beltran 1999 30 05 Second edition Planes and boats Naomi Michel 1997 08 03 First edition Second world war image collection Juan Espada 2002 03 12 Third edition Discovering Poland Anja Tomaka 2003 06 22 Second edition Figure 78 Fourth constructed scenario for the precision tests In the following we are going to use our tools to extract the same content as before Tools with data transformation and more extraction accuracy have better extracted our desired information All the information 2 last digits of the of the last publi
31. the results change depending on the importance of the content the number of searches and other factors In another way we can use the GET value contained in the URL to perform a search in one step Maybe we can experience little data errors experimenting with some particular cases of input searches that produce a custom output for example the description field of YouTube videos It is really important to select a good sample to realize the data extraction we mean to include a varied structure of the output Using Google if we use the input value of Barcelona we experience data loses in some cases as we don t include all the possible results On the other hand if we use the input value of Lamborghini the results have a more varied HTML structure that decrease the possibility of loosing data But certainly in Web search engines we can not be 100 sure that all the data is going to be perfectly extracted as the result page doesn t have a static structure and it can experience changes In the next stage we are going to test each suitable tool with each search engine and comment the way that they perform the data extraction and the problems that we get m Dapper Dapper let us to define an input variable to update the content of the search This feature is really useful if we want to realize another new search we only have to change the value of this input variable To configure the data extraction process we have to do a first search wi
32. to extract three fields of the resulting search items we create the following data model with a root node at the top A Data Model l e root e title e description player Figure 24 Data model used by Lixto for the simple data extraction This data model is used by Lixto to specify the format of the XML output The next step consists of defining which the actions that Lixto should realize before extracting data are Page Class start ki Action Sequence i 1 http iiem dedicom netitestisampleltest hem per Bl Data Extractor rook title AF Filter e description W Filter fe player oo SF Filter Figure 25 Action sequence to extract data by Lixto 31 1 Go to the Web Page of our source data 2 Use a data extractor together with our data model and filters to extract the information Once configured the filters to extract data we could extract all the fields correctly Here is presented the result lt xml version 1 0 encoding UTF 8 gt lt document gt lt root gt lt title gt Kings of Sun 2008 Contest lt title gt lt description gt This is the final table result of the Kings of Sun 2008 Contest This strategy game was created by Likstorh Software in 2005 and due to the growth of online players each year online competitions take place The user has to use his strategy habilities to be the best king of his land that includes have a growing population construct temples stud
33. url gt lt link gt http de wikipedia org wiki Barcelona lt link gt lt url gt lt description gt Dieser Artikel behandelt die katalanische Stadt Barcelona zu anderen gleichnamigen Bedeutungen siehe Barcelona Begriffskl rung de wikipedia org wiki Barcelona 117k Im Cache hnliche Seiten lt description gt lt root gt Yahoo Search Using the same structure as with Google Lixto is able to extract all the resulting information without taking data from the sponsor container Description doesn t include all the information but it works without problems lt lixto extractor gt lt root gt lt title gt Barcelona hotels apartments football tickets city guide of Barcelona lt title gt lt url gt lt link gt http rds yahoo com _ylt A0geu cUyBIgxkBXRpXNyoA _ylu X30D MTEzZmF 0 YW5uBHN1YwNzcgRwb3MDMOR jb2xvA2F jJMgR2dG1kAORGRDVf MTA1 SIG 1 1djgduo4 EXP 1210164508 http 3a www barcelona com lt link gt lt url gt lt description gt Travel and city guide for Barcelona Spain lt description gt lt root gt lt root gt lt title gt Barcelona Wikipedia the free encyclopedia lt title gt lt url gt lt link gt http rds yahoo com _ylt A0geu cUyBIgxkBXxpXNyoA _ylu X30D MTEZbTRiNWsOBHN1YwNzcgRwb3MDMgRjb2xvA2F jJMgR2dGLkAORGRDV MTA1 SIG 1 lglukckh EXP 1210164508 http 3a en wikipedia org wiki Barcelona lt link gt lt url gt lt description gt Provides an overview of the history and cu
34. use extraction patterns that are often based on tokens and delimiters for example the HTML tags We are going to explain the various possibilities to extract information and how they work as follows Basically there exist three different ways to perform the extraction Manual extraction of the data Use a built API Use a semi automatic wrapper Manual extraction is the most precise option to extract data as we directly choose the data fields of our interest The necessity to treat elements in an individual way takes a lot of time when treating large amount of data and hence makes to rule out this option as it is not viable This could be a good option for small and concrete data extractions However it is not the most common scenario when talking about Web data extractions For these reasons these extractions should be performed in a more automatically way On the other hand an API belongs to the owner of the Web page where we want to extract data Normally we can find APIs in few specific numbers of Web pages and its use and supply are limited by the specifications of the owner To use them we have to take a look at the documentation and the method list of the owner A wrapper let the end user use a set of methods without the necessity to have support of the owner of the Web page and with independence of the content It can be seen as a procedure that is designed for extracting content of a particular information source for deliveri
35. weak points are A table is presented as follows It has a set of features some obtained from the own tool and others obtained through our tests execution that will help us to carry out the final categorization 75 Legend Poor Low O Good High NR No Result Web Search Dynamic Precision Output Engines Content Formats Ebay Resilience Impression Total Robomaker RoadRunner XWRAP Lixto WebHarvest GoldSeeker WinTask Automation Anywhere Web Content Extractor Figure 88 Tools categorization using qualitative features Once we obtained this information we are able to group our tools and analyze the features that can be used in a concrete scenario This is the main goal of realizing a categorization as this helps us to select the best tool knowing the strong and weak points of it General features like the complexity ease of use or the output formats have been used here as they are relevant when expecting results from our tools On the other hand information from our realized tests has been used to fill other columns Resilience Web Search Engines and Ebay and dynamic content and precision The column impression includes other general aspects not shown in the other columns like installation configuration presentation number of options Using this table and having knowledge of each of the tools we are going to start with the categorization a XWRAP Non GUI tools without editing fi
36. 055 items Jungle A Harrowing True Story of Survival by Yossi Ghinsberg Hardcover Sep 1 2005 Buy new 23 95 16 29 52 Used amp new from 10 04 Get it by Wednesday April 30 if you order in the next 12 hours and choose one day shipping Eligible for FREE Super Saver Shipping Worse 30 Books See all 128 055 items The Jungle The Uncensored Original Edition by Upton Sinclair Earl Lee and Kathleen DeGrave Paperback April 1 2003 Buy new 1200 9 60 47 Used amp new from 6 25 Get it by Wednesday April 30 if you order in the next 12 hours and choose one day shipping Eligible for FREE Super Saver Shipping wits 47 Excerpt page 3 THE JUNGLE e 3 and bologna sausages The room is about thirty Surprise me See a random page in this book Books See all 128 055 items Figure 54 Amazon result page to be used We have tried to extract all these fields with all of our visual tools but we experienced problems when using some of them We are going to explain the problems that we have encountered Automation Anywhere With this tool we could not extract correctly some fields for example the book format or the price Although this is not a problem to evaluate the resilience this tool only allows the user to extract and save the data This is a problem when we want to test the resilience property as each time that we want to perform a new extraction we have to select the data and it makes the test not us
37. 2 Cyberneko HTML Parser homepage http sourceforge net projects nekohtml 3 Ebay http www ebay com 4 Google http www google com 5 Html2xhtml http www it uc3m es jaf htmi2xhtml 6 Microsoft Live Search http www live com 7 Netcraft Ltd http www netcraft com 8 Pageflakes http www pageflakes com 9 Searchenginewatch com http searchenginewatch com 10 Wikipedia the free Encyclopedia http en wikipedia org 11 World Wide Web Consortium http www w93 org 12 Yahoo Search http search yahoo com 13 Abiteboul S Querying Semi Structured Data ICDT 1997 14 Ashish N Knoblock C A Wrapper Generation for Semi structured Internet Sources SIGMOD Record 26 4 1997 15 Aum ller D Thor A Mashup Werkzeuge zur Ad hoc Datenintegration im Web 2008 16 Baumgartner R Ceresna M Lederm ller G DeepWeb Navigation in Web Data Extraction CIMCA IAWTIC 2005 81 17 Baumgartner R Fresca S Gottlob G Visual Web Information Extraction with Lixto VLDB 2001 18 Chang C Kayed M Girgis M R Shaalan K F A Survey of Web Information Extraction Systems IEEE Trans Knowl Data Eng TKDE 18 10 2006 19 Crescenzi V Mecca G Merialdo P RoadRunner Towards Automatic Data Extraction from Large Web Sites VLDB 2001 20 Eikvil L Information extraction from world wide web a survey Technical Report 945 Norwegian Computing Center 1999 21 Fiumara G A
38. Bash programming guide John Harker 2001 11 23 Second edition The 100 best horror films Jack Ismay 1995 04 22 Second edition Speak freanch in 1 month Henry Petit 1997 03 19 First edition Welcome to the reality Robert Morel 2005 10 10 First edition Discrete mathematics Vera Beltran 1999 30 05 Second edition Planes and boats Naomi Michel 1997 08 03 First edition Second world war image collection Juan Espada 2002 03 12 Third edition Discovering Poland Anja Tomaka 2003 06 22 Second edition Figure 73 First constructed scenario for the precision tests The final results for each tool can be found in the next table ail ane Date of the last Year of the last 3 Lethe COS DES One information of year of the last ER publication publication publication Dapper Robomaker Lixto WinTask Automation Anywhere Web Content Extractor Goldseeker Webharvest Figure 74 Final results extracting data from simple text The difference using Lixto Robomaker and Web Content Extractor compared with the other tools is that we can use extra features that allow us to extract data in a more accurate way This means applying a concrete format to the data or realizing transformations of it before getting the final content The substring method of Goldseeker is useful when modifying the precision As shown in the table this four tools passed all the tests 67 4 5 3 Extracting data from formatted text In this second test we are going to highlight
39. Dapper is an online tool which allows the user to extract information from Websites To use it all what we need is an Internet browser and Internet connection as this service is only available online Dapper is at the moment in beta phase but it is totally functional The usage of Dapper is totally free and we only need to create a new account to use it We can create our own wrappers or Dapps as they are called or use wrappers already created from other registered users 17 ie Dapper Dapp Factory Chek on the content you want Click again to remove selected content Start abii Select Inside Collect Sample Pages WMI Kaus ibe BE cs ae E ai Web bmage Maps News Shopping Gamal mont F E Select Content Preview Feed Google barcelona Search E Save Feed ac aknah Results 1 10 of about 168 000 000 for barcelona defintion 0 07 seconds Sponsoned Links Help i WE Start address Good availabdity and greai rates Click on the coniont you le Pane 20 Hotels in Barcelona Geidrechang Book online now pay at the hotell ww Booking com Barcelona vould like to include as a field z i i i E z Y Bezermier thin outa feld of content might be NE A Wie Tithe or Humber of Ihe besi of Barcelona Rietults a Ad Stay in style in our high quality a aria i holday apartments in Barcelona When you finish hightighting a ww BCHimemet com Nelfs content save it by clicking une Field
40. Universit t Leipzig Fakult t f r Mathematik und Informatik Abteilung Datenbanken A comparison of HTML aware tools for Web Data extraction Diplomarbeit Leipzig September 2008 vorgelegt von Xavier Azagra Boronat Master Studiengang Informatik Betreuender Hochschullehrer Prof Dr Erhard Rahm Betreuer Dr Andreas Thor Index Me MVE CCV thes sees aadce espe eta sevaetactienneuecne i eiwecneaneaecondecencapansnee scene saceewinnetaemancaeeeiiat 5 2 Data extraction Process aan 6 2 1 Characteristics of the data extraction Proc amp sS eenneeneeeenenneeennnnn 7 2 2 Representation of Web page elements u uunueessssseneenennneennennnneenn 10 2 3 HTML problems to extract data uuesssssseessssnnnesnnnnnneennnnnnen nennen een 11 2 4 Ideal characteristics for a Web page to extract data An EMMONS ns A 13 B Data extrac HOn TOOlS iria an 15 3 1 Related WOLK ae ansehe 15 3 2 A taxonomy for characterizing Web data extraction tools 16 3 320 VEeIVIEW OT TOON Sn ee ee 17 3 4 Descriptive comparison of HTML based tools ueneen 22 4 Tests using the data Extraction tOOIS uuuuuuuu0unnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnenn 27 A be Overview OT TESTS an near ernennen 27 4 22MECN090109Y energie 28 4 3 Problems with some of our too0 S enueeessssseesssnnesnnnneennnnnnnennn onen 29 4 3 General data extraction tests eoneeeeeesssnsneesnsnnnnsnnnnnnnennnnnnne
41. a extraction when performing our tests and our final tool categorization introducing the main problems and the main techniques to extract data 2 1 Characteristics of the data extraction process Nowadays we can find several services and tools based on data extracting techniques for end users that allow them to extract information from Web pages The process of extracting structured data from Web sites is not a trivial task as most of the information is formatted in the Hypertext Markup Language HTML Knowing that this format is designed for presentation purposes and not for automated data extraction It happens that most of the HTML content of the Web is semi structured It means that pages with this type of content are in an intermediate position between structured and unstructured format and they don t conform to a description for the types of data published therein This situation is unlikely to change in short or even medium term for at least two reasons the Simplicity and power of HTML authoring tools together with a valuable inertia to change markup language A vast quantity of semi structured data stored in electronic form is not present in HTML pages but in text files such as e mails program code documentation configuration files Therefore it is very important that some data extraction tools might be able to extract this kind of information However in real life scenarios data extraction capabilities are only one half of the
42. a paper named A brief Survey of Web Data Extraction Tools 24 This paper contains a categorization of the data extraction tools and explains the characteristics of each group Searches into Google don t have always contributed to find a suitable tool On the other hand by Sourceforge we could find several useful open source tools but sometimes unfinished or not suitable projects In fact we decided to work with HTML aware tools These kinds of tools which features are going to be explained in the next section are characteristic for its level of automation We don t have to spend a big amount of time in the configuration process to realize extractions but on the other hand most of them can only extract information from HTML files However we are more interested in the automation degree than to the source structure of our data as we are going to focus our work on HTML extractions In this way we can say that data extractions normally work with big amounts of data and this kind of tools are designed to automate this process We realized an heterogeneous selection of these tools which means that we have considered a variety of characteristics between them commercial and non commercial tools GUI and non GUI support tools Linux and Windows tools The aim of this variety consist of having a general sample set of tools and see if their different characteristics affect to the data extraction features Other possible groups could be formed with
43. a project of the database departments of the Universita di Roma Tre and the Universita della Basilicata This tool generates a wrapper for the analysis of similarities and differences from several sample files of the same class With this tool a class is an amount of pages generated by the same script so structurally the same but in some places both content are quantitatively different This wrapper is a representation of the investigated sample files in the form of a regular expression or so called union free regular expression UFRE XWRAP XWRAP is a tool that was developed at the Georgia Institute of Technology Its developers described it as an XML enabled wrapper construction system for Web information sources The toolkit includes three components Object and Element extraction filter interface extraction and code generation The wrappers are generated as Java 21 classes To use it we have to enter the URL of our desired Web site and the customization of the extraction process results is done via the Web by XWRAP To use XWRAP we need a separate Web server such as Apache Tomcat Webharvest Webharvest is an Open Source Web Data Extraction tool written in Java It offers a way to collect desired Web pages and extract useful data from them In order to do that it leverages well established techniques and technologies for text XML manipulation such as XSLT XQuery and Regular Expressions Web Harvest mainly focuses on HTML
44. as developed learning algorithms for a spectrum of wrappers This kind of wrapper require a minimum intervention of human experts and systems which go through a training phase where it is fed with training examples and in many cases this learning has to be supervised Generally the steps to extract information using a wrapper are the following Load the information of the source page Transform the source page for its posterior treatment Identify the appearing elements Filter these elements Export of the final data to an output format The first and last steps are common to all types of wrappers as we need a data input and a data output to perform a data extraction Depending of the used wrapper type the intermediate steps could vary We can find several types of wrappers following the taxonomy of 24 This taxonomy is based on the main technique used by the tool to generate a wrapper what led us to the following groups of tools Languages for Wrapper Development HTML aware tools NLP based tools and Ontology based tools More details can be found in the paper Of all of these tool kinds we are going to center us in the most modern and practical in fact HTML aware tools 2 2 Representation of Web page elements As explained before most of the Web pages follow the HTML syntax independent of their content images Flash scripts The main elements that construct the structure are the HTML tags They are identified b
45. chi ener eee nc ss Kategorian Mote Shops Jetzt Top Angebot i f r 39 99 sichern u ae a Eu e a Bn kennen Sturtealte Keulen Suchergebnisse f r red bal red ball Ale K gi Finden Ermsiterte Suche FT wrtkelbaichnung wed Beateaibung dunhauchen 5 Artikel gefunden f r red ball biere Suche a L stensenssche Gidra Seren nach ante sibel Musik 3 ta Yard 2 Gebote Preis nach G4102 DLU T Kleidung amp Accessoires 2 re Kleidung sonstige 1 EUR 1 00 She Beschrebung p IT 20Std 15Min Sammeln amp Seltenes i Sport 1 7 Suchootionan f LABAN red ream ladies ball pem EXPRESSION mew Waar 11 6 P 21 194 AMn Sel ee eee Pre Pr Riese A Turpa 7 p nee EUR 1 00 EUR 3 90 47 105td 35Min Figure 42 Resulting page of an Ebay search 45 To execute this test we have to consider that if a search is not so specific then we retrieve a more general page with categories that don t follow the same structure This could be a problem because we don t always receive results E Dapper After realizing some executions with different values sometimes happened that for some unknown reasons searches that return some correct results have not been extracted by Dapper We conclude that Dapper occasionally experience problems to extract information from the Ebay product search a i den je Kaufen Yarzaulen Mein ebay lege Ho u Ei Paibmi puie hmm Bde Kin Dan dienen wl irn daki
46. code to locate information and afterwards extract it It is very common that a Web page structure varies to extend the content to improve the visual design or to introduce new Web technologies All these changes could produce loses of data or errors to our already built extraction wrappers In this chapter we are going to talk about the resilience property against changes to the HTML code This means how good is our wrapper to continue extracting the correct data when changes are introduced These changes are a problem when using the data extraction tools If we have a feed that receives information from a concrete Web page and happens that in a certain moment the HTML structure is modified and we can lose the flow of data to this feed Furthermore we have to be aware of this situation happens as it is a critical point We can control it by monitoring the data source and looking that the resulting information is correct Somehow the detection of problems can be automated by using scripts and also some little programs that test if we receive information or if it follows a certain structure When a problem appears some kind of alert could be sent like a mail to the administrator to fix as faster as possible this problem What is sure is that we will have to configure again our data extraction tool to have it up to date We can classify these changes into categories Changes to the structure of the Web o Changing order of the elements o Erasin
47. d has a lot of output formats Talking about Automation Anywhere it can realize other types of tasks but only simple data extractions From these two tools the use of Dapper is recommended Winstask Web Content Extractor GUI tools using scripts or expressions and without full data extraction support In this group we can find tools using a GUI and expressions or scripts to perform extractions The fact of using scripts and expressions give these tools more chances and possibilities to extract concrete information but at the same time introduce a more complex process for the data extraction Compared to the previous group they are more suitable to be used in professional areas as we can extract information in a more precise way The main problem of these tools is that they are not constructed to be focused only on the data extraction process they are built to automate tasks and other type of stuff and often more functionalities are required Wintask and Web Content Extractor belong to this group and they are built for the Windows operating system a Robomaker Lixto GUI tools using scripts or expressions and with full data extraction support This last group of tools use a GUI use scripts and expressions and have a fully data extraction support For this reason it is not difficult to conclude this kind of tools are the most powerful ones and recommended to realize all type of data extractions Robomaker and Lixto belong to this group
48. d of files Its behavior is defined by a rule based configuration file It can process files on the local server or directly get Web pages via Internet It is a development version uncommented undebugged and unfinished Nevertheless it can already be used for simple extractions 3 4 Descriptive comparison of HTML based tools Once each of the tools has been introduced we are going to present some of its characteristics Two tables have been realized to show in the first one a basic overview of them and in the second one the data extraction tools features A categorization of the tools using distinguishing features is presented too 22 Installation Interface Extraction Free internet Allow input variables Dapper Online Execution k several formats easy Yes rowser a Program Allow input variables Windows GUI complete functionality installation Internet several formats browser medium complexity Linux RoadRunner Local installation BE g GPL i static content complex configuration License to use Medium complexity XWRAP anune a Configuration meine working with static Tomcat browser content Program Allow input variables Lixto Online Windows GUI Scripts usage medium requires installation Internet complexity Web li icense browser recording tool i Scripts usage in Yes GNU GoldSeeker Bo elaine Configuration al development simple LGPL PHP support browser i i extraction uses License ETE A Scripts usage Work
49. e 1 E 8 tbody E lt 9 tr amp 3 tO 9 tof Figure 55 HTML parsing tree structure of the Amazon test 56 4 4 3 Test 1 Delete a table column next to the extracted data Figure 56 Columns of data for this resilience test The first attempt to make a modification of the structure of the page consists of deleting the first column of the first row of the second table of the HTML document The new column where our interesting data will be placed is the first one We make a modification using Adobe Dreamweaver to achieve this result m Dapper We have used the same Dapp as before to extract the same information This modification in the HTML code doesn t produce any errors We can assure that dapper is robust to this kind of modifications title The Jungle Enriched Classics format Mass Market Paperback price_new 55 95 stars Aridi title Jungle A Harrowing True Story of Survival format Hardcover Figure 57 Dapper results for this test 57 m Robomaker We don t experience any problem using the same structure as before to extract content so we can say that Robomaker is robust to this kind of modifications too Extract Val pP get new pr pP Previous Page 1 2 3 eJungled Eoriched Classics by Upton Sinclair s Mass Market Paperbag April 27 2004 T E Fe E Buy new 5 95 54 Used amp new from 2 67 n Get it by Wednesday April 30 if you order in the next 10 hours and choos
50. e in this book Health amp Personal Care 574 VHS 376 Home Improvement 184 Books See all 128 055 items Office Products 160 2 Jungle A Harrowing True Story of Survival by Yossi Ghinsberg Hardcover Sep 1 2005 Unbox Video Downloads 117 Buy new 23 95 16 29 Buy new 23 95 16 29 52 Used amp new from 10 04 Electronics 35 Automotive 32 Kindle Store 81 Video Games 66 Software 59 Get it by Wednesday April 30 if you order in the next 12 hours and choose one day shipping Eligible for FREE Super Saver Shipping Arrr 30 Books See all 128 055 items Figure 62 Row going to be deleted for this resilience test 59 As four of our five tested data extraction tools Dapper Robomaker Lixto and Webharvest passed these tests without problems we are not going to comment each one individually We can say that all these tools are robust to this kind of modification With Web Content Extractor happened the same as before changes to the HTML structure produced problems to realize a successful extraction 2 test of resilience Dapper Robomaker Lixto Web Content Extractor Webharvest Figure 63 Final test results 4 4 5 Test 3 Making modifications to DIV and SPAN tags The third attempt to modify the HTML structure consists of making changes at DIV and SPAN tags Most of our data extracting tools use them to identify the Web elements we want to extract We are going to make these two cha
51. e one day shipping Eligible for FREE Super Saver Shipping Peeves 6 Excerpt page 7 THE JUNGLE Jurgis of all men to Jurgis Rudkus he with the i l E Surprise me See a random page in this book Books See all 128 055 items Figure 58 Robomaker results for this test m Lixto With Lixto has happened the same as with our two previous tested tools We have passed this test and this modification in the HTML code doesn t produce any error a Page Class start Action Sequence 2 1 http www dedicom net test am age 2 Data Extractor a root 4 e title we 4 e format SP Filter 4 e new _price oo Filter a stars T Filter Figure 59 Data model used by Lixto for this test Buy new 5 95 54 Used amp new from 2 67 Get it by Wednesday April 30 if you order in the next 10 hours and choose one day shipping Eligible for FREE Super Saver Shipping Ariz 6 Excerpt page 7 THE JUNGLE Jurgis of all men to Jurgis Rudkus he with the Surprise me See a random page in this book Books See all 128 055 items Figure 60 Lixto data selection for this test m Web Content Extractor This tool uses an absolute path to identify the elements that appear on the HTML source We can specify a first data row where the content begins and then from this point select all the suitable data to be extracted As we realized changes that affect to the HTML structure
52. ebsite All the latest news about the club s football team and the various sporting sections Basketball Handball Roller Hockey http www zoobarcelona com Parc Zooldogic de Barcelona S A Fichas de animales revista visita virtual y webcams de algunos animales en directo The only problem that occurred is that the data has been retrieved following another order as expected Using the other two Web search engines the results were successful and no problems have occurred 44 m The following table summarizes the final results of our tests Google Search Yahoo Search MS Live Search Dapper Robomaker Lixto WinTask Automation Anywhere Web Content Extractor Goldseeker Webharvest Figure 41 Final Web search engines test results 4 3 2 Data extraction from Ebay The second test that we are going to execute with our data extraction tools consists of extracting data from an Ebay product search It is the most important auction shop of the Web and it is famous all over the world This test could be useful because of the use of input values and non static result pages The information is organized in fields rows and columns of data which is the most outstanding feature For each resulting product we are going to extract the next fields Product name Price Shipping price Remaining time i Fame Sees kaufen Verkaulen MeineBay Community Hile Z ch i Hallo Einloggen oder Heu anmelden Scher handeln Kontakt Uerse
53. eed used too to elaborate the final categorization section 4 3 Problems with some of our tools Before starting we have to mention that we have experienced some problems with XWRAP and Roadrunner As explained in the tools introduction chapter XWRAP is a tool developed at the Georgia Institute of Technology It has been installed without problems in our computer we have configured the tool to realize data extractions from our scenario and we got a resulting Java file to execute the wrapper As it states in the Web of the XWRAP project there exists only three ways to realize a data extraction Register the wrapper to the GT Wrapper Repository Download the wrapper package and integrate it into our own Java program Download the wrapper package and run the wrapper on the command line Unfortunately we could not realize a satisfactory execution of one of these options Neither the GT wrapper Repository was available through the XWRAP Elite Home Page and nor the link to download the wrapper package to execute the wrappers In conclusion we could configure the Web data extractions but we could not retrieve any final results due to the unavailability of sources from the XWRAP Elite Home Page The last update of this Web page was on April 2000 Up to now no more updates have been carried out Due to this fact we could not get self conclusions of the quality of the extractions performed by XWRAP and we are going to exclude this tool fro
54. eful we mean we can not compare two different extractions WinTask With this tool we could not extract correctly the fields that we wanted We have configured it to extract them but what happened is that the used precision let only to extract the information found on the book cell For this abscense of precision the resilience test that we are going to apply is not going to be useful and then we are going to rule out this tool to be used with the resilience tests Goldseeker Due to the simplicity of this tool it has been not possible to realize the data extraction of all the fields Only part of the information mixed with another data not of our interest has been extracted The fact of realizing scripts that work searching for concrete strings creates difficulties to extract concrete data when having a big amount of HTML code Anyway as we know how this tool works we can achieve conclusions about its resilience property The fact of selecting two strings that are placed between the content of interest makes this tool resilience in some cases 55 If the content we want to extract is placed between two strings that are not going to be modified then the extraction is going to be performed without problems If the content we want to extract is placed between two strings used for its location and are going to be modified then the extraction is not going to be performed as it is not going to be found and problems will appear In c
55. eloped as a solution to this fact They are specialized programs that can extract data in a manual a semi automatic or automatic way They use the structure of the data sources and that give a final output of the extracted data We are going to use a set of tools that have been specifically designed for this purpose First we will explain the data extraction process and then we will characterize each of these tools and execute several tests in some constructed scenarios The main motivation of this document is to realize a categorization of the tools explaining the weak and strong points of them We will find out which of them is suitable for different scenarios 2 Data extraction process In this chapter we are going to explain the data extraction process used to achieve data extractions This is really significant as here we are going to explain how all this process works and which the possibilities to extract data are We are going to detail each of the aspects of the data extraction process from the main purposes of Web data extraction to the main problems that can be found when performing extractions Mainly the fact of including this chapter in the document let the reader to have an overview of the situation of the data extraction process We will talk about querying data on the Web and an idea of what is a wrapper the selected method to extract data Somehow we pretend to make easier the understanding of some characteristics of the dat
56. en 7 Angebote EUR se Wit Bucher Alle 734 Artikel ansehen Datenbanken Implementierungstechniken Neu kaufen EUR 49 95 81 Angebote EUR 39 95 Lieferung bis Mittwoch 23 Januar Bestellen Sie innerhalb Overnight Express Figure 6 Example of nested data elements from 27 We want to extract the part of information that is related to the auction What happens here is that the second element is not new and then this type of information is displaced to the beginning this will produce errors Similar examples of this kind could be found on the Web Problems choosing the correct Web page source example This problem can be shown choosing a Web page which content structure could change depending on some factors One real example of this kind is the resulting page of Web search engines If we perform a search using an input value we get a result page with some entries Depending of this value this resulting Web page will change We would we get some image snapshots video snapshots or some advertising related to this value If the structure changes depending of this value we can not use our data extraction tool with all the possible values to be sure it uses exactly the best source Because of this fact we can say it is really important in this kind of pages to select a good sample to assure that we are going to produce the minimum number of errors during the data extraction process Problems using scripts or dynamic content
57. enen MT ikariam aaa md er iein a ha za re Beki oy i h lp erde Deren preis ma ein beks sake See Ss Bee ul 1106 Amka gefunden ir hernia pi Se TS 2 ern ra ee a kicker com g ti TEET Ferner Acide jii wehr Bacher cin Leiste I Jerite Falae Det um Kimua Lit Sykt Kunde naaa item product_name shipping price remaining time Y item product_name STEPHENSON NEU OVP price EUR 18 99 53T 04Std 50Min new Figure 43 Ebay results with Dapper E Robomaker PAUL EUR 2 49 Aus Schweiz LIGHT GREEN BALL Jaguar Concept Ball Pen british racing green With Robomaker we have actions to solve the specialization search for example using branches grade problem of the It detects two different types of prices at the data extraction time so we use two different steps to realize the extraction although we are going to have a single result depending of the structure of the price field We have to ignore errors as empty rows that don t contain information can appear tte namel valuet names valuer names Valles named ae name name name name name name name name name name name name name name name nonsurll CEt darcia Ball Blue House MEU Alpas Handball Hand Ball Magic Blu Dunhill Sidecar Ball point pen Sea Training f rs Handgelenk blue ball CD Pop Marcia Ball Blue House RO BLUE BALL kariertes TRAGER K BLUE BALL 116 lustige
58. ep will not find this target If we change the Tag Path manually to div we immediately improve the resilience of this step because now the step is really looking for a Div with the class notice anywhere on the page The same happens with the next button we can change the Tag Path to a and then this robot will look for this tag anywhere of the page 65 This procedure could be applied to other element steps when constructing our robot and then we will have a high level of resilience when talking about structure changes Realizing this kind of changes has consequences If we could find two next buttons and we are only interested in the one that has the initial Tag Path div div div div a will make no possible to apply this improvement 4 5 Precision in extracted data Another Web data extraction field to consider is the precision that our tools have when extracting data It means if they can extract the part of the data that we are expecting to Our data can be structured in several ways for example it can be placed into a table row it can be distributed in several table cells or what s more mixed with some other content In this section we want to test the accuracy of our tools against this kind of situations Generally they take benefit from the structure and the information of the HTML tags but it may happen that the structure of the HTML code could generate problems and the user is not receiving the data that he is expectin
59. etting the JavaScript to execute Error P loading JavaScript ere Unknown protocol for URL javascript woid D LOK Figure 49 Pageflakes results with Robomaker 50 E Lixto With Lixto all the dynamic resources don t load correctly and we can not extract our desired data It fails this test ge cp E amp A http www pageflakes com akes EEE Local News Pageflakes TV Powered by weather com Mother of all podcasts Loading Loading Figure 50 Pageflakes results with Lixto m WinTask WinTask as it uses directly the Internet browser For this reason it is able to display the dynamic page content without problems The problem appears when we want to extract data an error message of the data we want to extract cannot be found This happens as it needs the content of the tag to extract the data correctly For example from the temperature CaptureHTML DIV CONTENT 69 captured_string In this case when we experience a change to the temperature value the content of the tag changes too and then this produces an error In conclusion WinTask doesn t pass this test E Automation Anywhere As it happens with WinTask Automation Anywhere is able to record the actions that we perform in our browser and is able to show the dynamic content too We have used 4 variables to extract dynamic information without problems It worked fine for one day but when the next day the information was renewed we expe
60. extraction The target to introduce these basic tests is to see if our tools can extract data from basic Web sources With this kind of test we want to include all the usual extractions that can be found in a normal HTML file It doesn t matter if the data is retrieved direct from an URL or from a file We have used a previous scenario to realize the tests it is the Kings of Sun 2008 Contest of the chapter 2 4 We selected this scenario as it has a basic HTML structure and it will make the things easier and clearer when presenting the results From all the information found on this page we are going to extract the title the Short description and the list of the player names By extracting such fields we can realize a conclusion of basic HTML extractions m Dapper With Dapper after following the standard steps to select the content of interest we could receive all the information without problems We grouped the information distinguishing the main title the description and the players list title 1 players Player 1 title Kings of Sun 2008 Contest players Player 2 players Player 3 players Player 4 description This is the final table result of the Kings of Sun players Player 5 2004 Contest This strategy game was created by Likstorh Software in 2005 and due to the players Player 6 growth of online players each year online competitions take place The user has to use players Player 7 his e habilities to be ee king e hiz players
61. g 4 5 1 Precision extracting a date field To build a suitable scenario we have designed an HTML page and uploaded it to a server It consists of a list of books with title author and the publication date We are going to extract data from the Last Published Edition column In concrete we are going to extract it with a different precision each time All the information of the row Date of the last publication Year of the last publication 2 last digits numbers of the year of the last publication This process allows seeing how flexible our tools are to let the user extract information in a more accurate way In each test we are increasing the acuteness of the extracted data and testing the accuracy property 4 5 2 Extracting data from simple text In this first test we are going to construct the HTML source page without using span or div tags that specifically identify the elements This is useful as then we can conclude how important is to have this tags in order to identify data elements and extract data from them 66 LIST OF PUBLISHED BOOKS Last Published edition How to begin with Computers Andrew Moss 1998 07 07 First edition Spain The guide Roberto Diaz 1995 02 04 Second edition The book of Manchester United John Henley 2003 06 18 First edition How to survive in Africa Kate Nebit 1991 01 25 First edition Red apple blue sky Marko Owen 2006 12 07 Second edition Love in the mountain Katja Muller 2000 05 19 Fourth edition 7
62. g old content o Introducing new content Changes to the style tags of the Web Changes to the visual design of the Web Other types of changes 53 Changes to the Web structure could generate errors depending where our interesting data is placed e g if it is deleted errors will appear On the other hand if we introduce new content and it doesn t affect the initial structure of the Web no problems will appear Changing order of the elements could introduce errors if our data extraction tool only considers the position of the HTML tags into the parsing tree Some tools get information from the DIV or SPAN tags specifically the class attribute This helps to locate the data wherever is placed independently of the HTML structure If the tool uses this information and changes to those tags which are introduced we can experience problems Changes to the visual design for example changing the background of the page from the tables from the cells or changing font colors will not create errors to our data extraction tools What commonly happens is that these types of changes are introduced together with changes to the Web structure and then we have more possibilities to generate errors More errors could be introduced by other types of changes For example place data that we want to extract into a Flash object or place this data to an emerging Javascript window or into a file available through a link More examples of this kind can be
63. game We can find password protected sites cookies Sessions IDs JavaScript or dynamic changes on Web sites that make Web data extraction difficult in real life application scenarios Two of the most important purposes when talking about Web data extraction are Information retrieval e g feeds Web search engines information services Economical issues e g stock market shopping comparison In order to perform Web data extractions we are going to use a set of tools designed for this purpose Normally to specify the input we provide our tools one or more Web page sources The most common way to access the information is by giving the URL where these Web pages are located Otherwise some tools can directly take a path to a file and extract its data Once the tool knows where the source information is its users work to configure the data extraction process About the data output we can find several formats depending on the used tool The most common formats of the extracted data are XML HTML RSS ATOM Feeds or plain text being XML the most used Some tools are designed to directly transform the extracted data to other more specific Web formats such as modules for Web portals or proprietary formats Possible options that some of the tools present are putting the extracted data and embedding it on a Flash object or send it directly per email The data extraction is only a step when speaking about the process of getting data fr
64. gli e toto dei modelli costruiti fino ad ora LAMBORGHINI FESER GRAF GRUPPE http whey lamborghini nue LAMBORGHINI Meu bei Kleeblatt Sutomobile LAMBORGHINI WERKAUF N RNBERG LAMBORGHINI http Mew sutosalon singe wir verkaufen Lamborghini und bieten weitere Lamborghini zu fairen Preisen zum Kauf Lamborghini Reventon Feuer frei Bilder Fahrberichte FOCUS Herr focus delato Zugegeben er sieht so aus Doch auch f r 1 2 Millionen Euro Kaufpreis kann sich der Lamborghini Berlin http er lamk rghiri berli Lamborghini Berlin ist eine Geschdftzeinhet von Helge Leonhardt Luxury Cars mit sein Lamborghini Berlin http er Jamk orghini che HERZLICH WILLKOMMEN BEILAMBORGHIN CHEMNITZ In verkehrsg nstiger Lage befi Marken bersicht Lamborghini http weve autoscouted def Kontakt zum Hersteller Automobili Lamborghini gt p via Modena 12 40019 Sant Aga 2 3 4 5 G T g g 0 i Figure 37 MSN Live Search results with Robomaker E Lixto As we are going to extract three fields of the resulting search items we create the following data model with a root node at the top 40 a Data Model F e root e title F url e link description Figure 38 Data model used by Lixto for Web search engines tests In this case we want to extract the title the URL and the description of the result entries From the URL we want to extract the link and not the text this is why we use an interna
65. harvest With this tool as we don t have to display the dynamic information we don t face any problem referred to the presentation of the content We can directly construct a Xpath expression to extract the desired data What happened in this case is that we could not find a Xpath expression to extract the desired fields because we didn t have enough information from the tags to refer the fields of our interest The interesting data is placed between a div tag with no attributes Due to this fact we can not extract the data correctly from Pageflakes as we receive more data as desired lt div gt Tuesday lt div gt lt div gt lt img src Pageflakes_files 33 png width 63 border 0 height 63 gt lt div gt lt div gt 13 C lt br gt lt div gt lt td gt lt td style width 255 vertical align top text align center gt lt div gt Wednesday lt div gt lt div gt lt img src Pageflakes_files 11 png width 63 border 0 height 63 gt lt div gt lt div gt 23 13 C lt br gt lt span style line height 100 gt lt span gt lt div gt 32 m The following table summarizes the final results of our tests Pageflakes Robomaker Lixto WinTask Automation Anywhere Web Content Extractor Goldseeker Webharvest Figure 53 Final Pageflakes test results 4 4 Resilience against changing HTML code One of the main characteristics of the data extracting tools is that they use the structure of the HTML
66. has been created selecting a set of different characteristics that can be evaluated in all of our tools The aim of this table is to give to the user a general view of them making a first comparison and to have an idea of the main differences that exist The field input variables is useful to introduce information in form of fields the user could need this feature when expecting results from a non static Web page Through scripts the tools have a powerful way to threat with the information and the property of working with non static content pages makes the tool able to work in a bigger number of actual Web pages having dynamic content Another important feature is to know if our tool is able to extract information from more than one Web page at the same time this is useful to join more than one normal page extraction General features that can be found in most of the programs can be found in the data extraction tools Fields like complexity error treatment execution time or input and output formats Next the filled fields from the data extraction tools are commented in detail e Input variables That field refers if we can use an input variable to use in form fields to get dynamic results Changing the value of this variable we can obtain new results This is really useful when performing searches for example using Web search engines Values Yes No e Scripts usage Usage of scripts gives the tools more flexibility to interact with the ex
67. have with XWRAP This means that we have to select a kind of basic pages that allow us to extract information from them The fact of only using a specific configuration file makes sometimes difficult to configure the data extraction process a Web Harvest Non GUI tools with XPath expressions In this group of tools we can find Web Harvest which is characteristic for the fact that it uses Xpath expressions to extract data XPath expressions are quite common and it is really easy to find information for its usage Sometimes it can be difficult to find the correct expression to extract data and maybe we have to concatenate several of them It may also happen that we can not find a suitable expression to extract data of our interest Web Harvest is categorized in this group of tools To realize a data extraction process we have to look at the user manual found in the project Homepage and find all the possible methods and expressions that can be used to carry out concrete actions As in the two previous groups of tools it is recommended to realize data extractions in basic scenarios However it is also true too that by realizing more complex configurations we can extract information from non basic Web sites 5 Goldseeker Non GUI tools with own scripting This group of tools is really similar to the one using configuration files The difference consists of using scripts in the place of several lines containing configuration methods Goldseeker
68. ing WinTask Local Online d with static content Installation Internet Web recording tool browser Program Automation Windows GUI vion O program Local Online variables Web Anywhere Installation Internet i Te a recording tool browser Program z sonen Local Online u Ic Bra sn Extractor Installation Internet d formats browser Figure 17 Data extraction tools overview Robomaker Online Yes A first categorization of the tools has been realized selecting distinguishing features to organize groups containing a set of tools with the same characteristics We have used a tree structure to represent this categorization as it reflects the result in a visual and clear way it is shown in the following figure 23 GUI N N BEJ nu J XWRAP Automation Ran er S SS Robomaker Wintask GoldSeeker Roadrunner Lixto Web Web Harvest Content Extractor Figure 18 Tools categorization using distinguishing features In the first level of the tree we can split the tools into two groups by making a distinction among the GUI This characteristic directly let us to have a fully distinguished group of tools the one that has GUI and the other one that doesn t The fact of having a GUI makes it easier for the user He has more options and menus to interact and a real time visualization of the elements that are being selected to extract information When speaking about the GUI tools the next distinguishing feature cons
69. irect relation between the content that we see and the HTML code or configuration files that we treat with Engine Another really important point to consider is the used engine when realizing data extractions Some tools base their extractions direct on the parsing tree derived from the HTML sources other ones take care of the tag type where the data of our interest is placed and others use alternative methods like Goldseeker and its substring system In conclusion the best results are achieved when having a hybrid engine of these explained methods as this really helps to increase the level of resilience Scripts and expressions As explained on the last chapter the fact of using Scripts and expressions give the tools more chances and possibilities to extract and to treat concrete information Tools having this feature will be more powerful when executing data extractions On the other hand it is true that depending on the user profile and the data extraction needs one tool will be more suitable than another Considering that a final user profile is a single user or an enterprise or researching group and taking care of the complexity of the extractions and the price of the license we are going to construct a table The presented order of the tools is used to give priority to the best ones 79 Single user Enterprise or researching group Basic extractions Complex extractions Basic extractions Complex extractions Figure 89 Tools
70. ists of scripts and expressions This feature makes a tool more powerful and lets extract data in a more precise way so that is a really important point to take into account Once realized a group that use expressions or scripts the next characteristic to make a new separation is the full support to the data extraction On the other hand when speaking about the non GUI tools we can realize a main separation by the necessity to edit a configuration file to prepare the data extraction process When the property is true a new separation could be realized taking care of the configured file type This tree representation is useful to construct groups taking into account structural characteristics of our tools 24 EXTRACTION FEATURES Extract Input Scripts Don tale contents from Error Execution HTML or other 2 Output Formats Complexity content variables Usage pages more than one treatment Time documents Dapper HTML XML RSS HTML Google Gadget Netvibes Module PageFlake Google L Maps Image Loop Icalendar Atom Feed CSV JSON XSL YAML email BEE Be gt Eu Robomaker Yes ee SS Se ua Sana Medium Yes Yes Yes Very Good HTML Javascript Web Clip documents documents MCI File Excel DB EXE Low Yes Good ul ine Anywhere documents Web Content File Excel DB SQL script File MySQL Extractor pono to script File HTML XML HTTP submit ON ach DE Figure 19 Data extraction tools features 25 This table
71. l node The next step consists of defining the actions that Lixto should realize before extracting data As there are not many differences between our Web search engines we are going to explain this only for Google PUNH 4 Page Class start 4 Action Sequence E 1 www google com lt gt 2 Key Action rt 3 GoogleSuche a e 4 Data Extractor a root 4 title W Filter F url Filter 4 e link W Filter description W Filter Figure 39 Action sequence to extract data by Lixto Go to the Web Page of our selected search engine Write the search value into the input form Click to the search button Use a data extractor together with our data model and filters to extract the information Google All the data of the obtained results has been correctly extracted We have to use a XPath expression to select both Google map entries and all the results together Description includes all the information so we can say that Lixto works well with this Web search engine lt lixto extractor gt lt root gt lt title gt Barcelona Spanien lt title gt lt url gt lt link gt http maps google de maps hl desamp q Barcelona Spaniens amp a mp um 1 amp amp ie UTF 8 amp amp Sa X amp amp oi geocode_result amp amp resnum 1 amp amp ct title lt link gt lt url gt lt description gt maps google de lt description gt lt root gt 41 lt root gt lt title gt Barcelona Wikipedia lt title gt lt
72. les In this group of tools we can find tools which don t use any GUI and don t use any configuration file We have to say that is not totally true as we use an Internet browser to realize the configuration of the data extraction process but we can consider this not a truly GUI as we only feel forms and use buttons to send orders 76 In concrete XWRAP is used to realize a sequence of steps that allow us to configure the data extraction process In each step we configure concrete characteristics like elements identification data tagging refinements Although we can not realize executions due to the library support the recommended scenario to use this tool is the set of Web pages with a simply structure When referring to simple structure we mean no dynamic content a logic structure and no input variables It is mainly designed to extract data from plain HTML files z Roadrunner Non GUI tools with configuration files In this group of tools we can find tools that don t use a GUI but they take the input of configuration files to work This way to proceed is more logic and common than the previous one as the fact of no having a GUI force us to feed the input of our tools with configuration files Roadrunner belongs to this group of tools We only use configuration files and the Linux shell to proceed with the data extraction process Although we have not used it in our tests the recommended scenario is similar to the one that we
73. lladro bei Einfach auto Figure 36 Google results with Robomaker su Yahoo Search As we mentioned before this engine uses an AJAX live search in the input form of the searched item Although we have a step in Robomaker to execute Javascript we could not receive the result page For this reason we could not extract any information from this page u Live Search With the Microsoft Live Search we have directly a container that includes all the results This means we do not have to worry about annoying elements We can iterate directly through the result elements and get our desired data As we can find results without description we have to ignore the possible generated errors As happened with Dapper we do not take all the information elements from the description URL cached Load Page Enter Value 1 Click Submit For Each 4 Extract Title Extract URL Extract Des P Return item tda description Lamborghini Stuttgart Yertragsh ndler Lamborghini 5 tee lamborghini des Lamborghini de Lamborghini Wert ler zeren Me er U Lamborghini Stuttgart Lamborghini Reventon http key Jamb orghini des Abonnieren Sie unseren Newsletter und vir EET Sie ber aie Ben News Lamborghini Reiter Engineering Lamborghini Lambo Racing Motor Kiefer reiter engineeri Lamborghini Racing Reiter Engineering Rennteam Wir sind die Lamborghini TE Automobili Lamborghini Holding Spa http er Jamb orghini com Sito ufficiale Detta
74. llow The next figure illustrates the used methodology used when performing our data extraction tests Figure 21 Used methodology for the data extraction tests The first step consists of creating or selecting a Web page source in which we want to extract data After that we select the data extraction tool with we are going to perform the test Most of the selected Web page sources can be found on the Web However self made Web pages have been created to focus in some of the features that we want to test To elaborate these self made sources we have used Adobe Macromedia Dreamweaver to create and edit the content together with a private Web server to locate the files To upload the data files we have used an FTP client Next we configure our tool to extract the data this process varies depending the selected tool Then we receive an output from this tool and the resulting extracted data is compared with the correct extracted data 28 This comparison allows qualifying the data extraction results of the analyzed tool Several degrees of qualification have been used for example a poor good or very good data extraction We can also give an explanation of why the data has not been extracted correctly These are possible ways to realize a conclusion of the test Once we get all the final results a conclusion table is presented with a summary of all of them and a general view to the reader can be presented These conclusion results are ind
75. lture of the Spanish city of Barcelona lt description gt lt root gt u Live Search Lixto is able to extract all the requested data avoiding videos and sponsors content As happens with Yahoo Search description doesn t take all the complete information So we can conclude that Lixto works without problems with Microsoft Live Search lt lixto extractor gt lt root gt lt title gt Barcelona de Reisef hrer Hotel Flug Barcelona Card buchen lt title gt lt url gt lt link gt http www barcelona de lt link gt lt url gt lt description gt Information ber die Hauptstadt Kataloniens Mit Hinweisen zu Sehensw rdigkeiten Hotels Gastronomie Kunst und Kultur Natur und Umgebung Zus tzlich gibt es ein lt description gt lt root gt lt root gt lt title gt Barcelona de Hotel Flug und Mietwagen buchen lt title gt lt url gt lt link gt http www barcelona de de 2 php lt link gt lt url gt lt description gt Barcelona kulturelle Hauptstadt Spaniens Ein Reisef hrer Sie erhalten auf dieser Seite einen berblick ber die vielf ltigen Sehensw rdigkeiten in Barcelona lt description gt lt root gt 42 m WinTask WinTask can use HTML descriptors to detect data that we want to search in the document and then realize an extraction It is not able to extract dynamic information as its engine works using the name of the container that has the information that we want to extract This represe
76. m our tests We could realize extractions and achieve results using Roadrunner This tool infers a grammar for the HTML code to generate a wrapper for a set of HTML pages and then uses this grammar to parse the page and extract pieces of data That is to say it doesn t rely on user specified examples and does not require any interaction with the user during the wrapper generation process This means that wrappers are generated and data is extracted in a completely automatic way The system works with two HTML pages at a time and pattern discovery is based on the study of similarities and dissimilarities between the pages The tests presented in the following section are thought to extract concrete data of a set of HTML page sources the same occurs when evaluating the resilience and precision properties Due the way to proceed of this tool we can not achieve self conclusions using our tests and then we are going to exclude it 29 4 3 General data extraction tests In this section we are going to use all the extraction tools to perform several general tests The aim of this section is to test most of the general features of our data extraction tools In each performed test is explained which features we are going to test 4 3 1 Basic data extractions In this section we are going to extract information from a simple Web page This means we are not going to realize extractions that have the necessity of specific features to perform an
77. mposed by a headline and a brief description of the news This way to structure is simple and if we represent it on a tree We will see that some elements are appearing repeatedly This will help our tools to extract the information Let is imagine the opposite example a digital newspaper that doesn t use a main table with all the news and it doesn t follow a rule to present the information It means some news could have photos others videos and the information will be presented in a cell of a specified size and location that makes a nice end view to the user This kind of structure has more possibilities to generate problems to our data extraction tools Bad constructed HTML source documents A well built HTML document must follow some rules Although most of the browsers could visualize the content of a page having some errors in the structure it is highly recommended to follow the W3C standard of HTML Some of these errors could consist of bad placed tags repeated tags without sense no closed tags All these kind of mistakes could make harder our data extraction Nested data elements These kinds of elements nest data and then element by element could appear differences An example is shown on Figure 6 11 Datenbanken kompakt Neu kaufen EUR 19 95 76 Angebote EUR 14 95 Lieferung bis Mittwoch 23 Januar Bestellen Sie innerhalb Overnight Express Kirk B cher Alle 734 Artikel ansehen Datenbanken Konzepte und Sprach
78. n table tbody Arc naa a in iui el ie Tal ro fal fo eS 3 i Sr RR SS MARRS A CRE This ion entens a feat into a teed field in a Term Web Bilder Maps News Shopping Mail Mahrr iGoogle Anmelden free Ted 0 Erden u vaari atiriote Thi OO i kenn Doini Op ER Deutschland E Copiers fig F L T C O WE WE mE mE GE mE Gm Gm Gm Gm Sm m Gm su m zu 117 k n n een deine st GoogheeSuche Auf guiiGhuck hoe suche Das Web Saten auf Deutsch Siten aus Deutschland ae Warhenrogramnma Untamehmansangebat Uber G a Eui i r Reg se input Er TALigne hops Fead VRIES Bor ma Rio romeo iH br and width igi snd iE inputi ed aligns canner mniraps s ik ipani peer Dame hl Oe Zu dn value ot u A et LA nn 3 3 nzinpur BanLengch ZOE meee gq sige SE tiple Cooqgla Buche valwer Fie search J mi Sinful nee bel tppes subeie value Coogle Fuche BE 4 b k z a fot iF E El Gl EL Cl ignore Styler Ll Line Humbera AdelPemeve Apr E Beit red Voues O Sree Valuer of Siep Hass a cece S Figure 11 Robomaker Screenshot 18 Lixto The Lixto Visual Developer VD is a software tool that allows the user to define wrappers which visually access data in a structured way as well as configuring the necessary Web connectors The program is originally from a research project of the Technical University of Vienna that becomes later in the Lixto
79. nclair Mass Market Paperback April 27 2004 54 Used amp new from 2 67 Get it by Wednesday April 30 if you order in the next 10 hours and choose one day shipping Eligible for FREE Super Saver Shipping Weve 6 Excerpt page 7 THE JUNGLE Jurgis of all men to Jurgis Rudkus he with the Surprise me See a random page in this book Books See all 128 055 items Jungle A Harrowing True Story of Survival by Yossi Ghinsberg Hardcover Sep 1 2005 52 Used amp new from 10 04 Get it by Wednesday April 30 if you order in the next 12 hours and choose one day shipping Eligible for FREE Super Saver Shipping Foret 30 Books See all 128 055 items Figure 66 Duplicated data for this resilience test 61 m Dapper It has extracted two times the duplicated new price entry m Robomaker It has only extracted one new price entry m Lixto It has extracted two times the duplicated new price entry m Web Content Extractor It has only extracted one new price entry m Webharvest It has extracted two times the duplicated new price entry 4 test of resilience Dapper Robomaker Lixto Web Content Extractor Webharvest Figure 67 Final test results 4 4 7 Test 5 Changing order of extracted data The fifth attempt to modify the HTML structure consists of changing the order where data which is going to be extracted appears In this case we have placed the first row of the two first bo
80. ng the content of interest in a self describing representation Its target should be converting information implicitly stored as an HTML document into information explicitly stored as a data structured for further processing Due to these characteristics we are going to choose this kind of tools to perform data extractions from the Web A wrapper for a Web source accept queries about information in the pages of that source fetches relevant pages from the source and extracts the requested information and returns the result The construction of a wrapper can be done manually or by using a semi automatic or automatic approach The manual generation of a wrapper involves the writing of ad hoc code The creator has to spend quite some time understanding the structure of the document and translating it into program code The task is not trivial and hand coding could be tedious and error prone On the other hand semi automatic wrapper generation benefits from support tools to help design the wrapper By using a graphical interface the user can describe which the important data fields to be extracted are A specific configuration of the wrapper should be done for each Web page source as the content structure varies from each other Expert knowledge in wrapper coding is not required at this stage and it is also less error prone that coding On the other hand the automatic wrapper generation uses machine learning techniques and the wrapper research community h
81. nges e Change the class attribute from the span tag that identifies the price of a new product The change will be sr_ price to amazon _ price e Change the class attribute from the span tag that identifies the name of a product The change will be srTitle to amazonlitle http www amazon com Jungle Enriched Classics Upton Sinclair dp 0743487621 ref pd_bbs_sr_1 ie UTF3sanp s bookssamp qid 1209462435samp sr 8 1 gt lt img src ref nb_ss_gw_files 413VD80VTI6 L_002 jpg class alt The Jungle Enriched Classics border 0 height 115 width 115 gt lt a gt lt td gt lt td width 8 gt lt td gt lt tr gt lt tbody gt lt table gt lt td gt lt td class dataColumn gt lt table border 0 cellpadding 0 cellspacing 0 gt lt tbody gt lt tr gt lt td gt lt a href http www amazon com Jungle Enriched Classics Upton Sinclair dp 0743457621 refi pd_bbs_sr_1 ie UTF3samp s bookssamp qid 1209462435samp sr 8 1 gt lt span class The Jungle Enriched Classics lt span gt lt a gt by Upton Sinclair lt span class bindingBlock gt lt span class binding gt Mass Market Paperback lt span gt April 27 2004 lt span gt lt td gt lt tr gt lt tr gt lt td class priceBlockWithTopPadding gt lt span class priceType gt lt a href http www amazon com Jungle Enriched Classics Upton Sinclair dp 0743487621 ref pd_bbs_sr_1 ie UTF samp s bookssamp qid 1209462435 lt amp sr 8 1 gt Buy new lt a
82. nnnnnnee nenn 30 4 35 17 BaSiG data EXtr actions an ME 30 4 3 2 Data extraction from Web search engines uuueesssssnsssnnnenneeesnnnennnnn 36 4 3 2 Data extraction froni Ebay a au 45 4 3 3 Data extraction from dynamic content Web pages unneeen 49 4 4 Resilience against changing HTML code uueesseneeeeseeeeneennnn 53 4 4 1 Testing the resilience Of our too S uuesss220sssnsneeeesnnnnnnnnnnnneesnnennnnn 54 44 2 STTUC LU Eee ee een sn een ee 56 4 4 3 Test 1 Delete a table column next to the extracted data 57 4 4 4 Test 2 Delete previous content from the extracted data 59 4 4 5 Test 3 Making modifications to DIV and SPAN tagS e 60 4 4 6 Test 4 Duplicating Extracted data uueessssssssssneneneessnnnnnnnnnnnneennnnennnnn 61 4 4 7 Test 5 Changing order of extracted data uuueeeeesssnsssnnenneeeesnnnennnnn 62 4 4 8 A concrete example Improving resilience with Robomaker against Structure CGA GCS na ai urn 65 4 5 Precision in extracted d t acarana 66 4 5 1 Precision extracting a date field uuuueeessssssssnenneeessnnnnnnnnnnnneesnnnennnn 66 4 5 2 Extracting data from simple text uussneneseeessssnennennennnnnnennnnnnnneennennn 66 4 5 3 Extracting data from formatted text sssssesssneeeseessnnnnnnnnnnnneennnnnnnnnn 68 4 5 4 Extracting data using Styled text uuesssne
83. ns class g b text span contains class ship Cee text i span contains delass time text gt lt html to xml gt lt http url http shop ebay de items _W0Q0Q_nkwZ search QQ_armrsz1QQ_fromzZzQQ_mdozZ gt lt html to xml gt lt xpath gt lt var def gt lt config gt 48 As happened with the Web search engines tests the limitation of the extraction depends of the limitation of the formed Xpath expressions In this case Ebay presented no problems to perform the extraction A part of the final output is shown next Cyrkle The Red Rubber Ball A Collection CD OVP new EUR 8 49 EUR 2 90 13std 13Min RED HOT N BLUE Havin a Ball in RED VINYL RARE EUR 1 00 EUR 4 00 2T 12Std 34Min Neil Diamond La Bamba Red Rubber Ball 1973 7 BUR 2 49 EUR 1 80 2T 17Std 9Min THE CYRKLE RED RUBBER BALL 1 HIT USA APRIL 1966 EUR 1 00 EUR 2 20 AT 22Std 32Min m The following table summarizes the final results of our tests Ebay search Dapper Robomaker Lixto WinTask Automation Anywhere Web Content Extractor Goldseeker Webharvest Figure 47 Final Ebay test results 4 3 3 Data extraction from dynamic content Web pages flakes The aim of Pageflakes is to create a personalized Web page where all his users can keep up to date many blogs and new sources that are going to be read frequently 49 This stage was chosen to test our tools with d
84. nts a problem as it is focused to treat with static content With this tool the user is able to write complex scripts However it is basically built to work with automation of tasks so we can not take advantage of it in Web search engines m Automation Anywhere This tool let the user record a set of actions to perform a customized search To realize extractions in the Web search engines field we don t have enough resources by the tool and we experience difficulties This tool like Wintask it is more oriented on the task automation field For this reason we could not extract all the data as well m Web Content Extractor This tool doesn t allow the user to insert information in a HTML form and pick up the resulting data after clicking the submit button We have to use a GET method to generate directly a result page When talking about the data extraction it is able to choose the tag that contains information and save it in a result column It doesn t use iterators it is only guided by the tags that the page contains It is very slow to manually develop this work but it guarantees the correctness of the extracted data as it uses the page structure As we have to select element by element the data this program will work in all the Web search engines the problem it is that configuring the extraction could require so much time In conclusion this tool passes the test _ Extraction Pattern Web Site Address URL http www
85. o Oo Oo Oo oO Oo D Oo Oo Oo 2 2 Oo 2 ou om Mm oh amp OO mm Go GS DD TB BB BB OD DD UT OD OS Co Io sto a A A ect itt tt AS OY OO ee oJ oJ oJ ol el oe ol ol Oy a oe un 2 Do So a w a a a w m w Gn a u 3 u 3 u 33 u 33 u Z3 u 3 u SS u So u SS A Z u SS DD 53 DG I LE L T UL T aL T aL T aAA T aL T aL TT aL T Lal T Ml T l E LE Figure 1 Growth of the number of hostnames from 7 Figure 1 illustrates the growth of hostnames in the last years As shown the curve represents a kind of exponential form and that means that the growth tendency is going to increase further The same happens when talking about Wikipedia an online and free encyclopedia The number of articles is increase equivalent this exponential form and with other words more and more information is inserted into the Web 3 3 amp a a E 5 Incraase per Day 2001 2002 2002 2009 2003 2004 2004 2005 2005 2006 2006 2007 2007 2008 01 01 0701 01 01 0701 01 01 07 01 01 01 0701 01 01 07 01 01 01 0701 01 01 07 01 01 01 Figure 2 Growth of the number of Wikipedia articles from 10 In particular we are going to concentrate on the information we find in Web pages Those had evolved introducing dynamic content animations video visual content audio etc One of the main problems that we are faced with is how to structure this big amount of information At the beginning the Web was designed as a source of data for a human use It was built
86. ogether ret CaptureHTHL DIV CONTENT Tucaday sesulcs capturses stringsechst 10 UsePage Pageflakes Get it Tegethes ret CaptureHTML DIV CONTENT 647 captur os i teaulc8 cagtured sering chrsi l1 chrE 10 magbox results lation ended successiu Automation Anywhere ng C Archivos de Oo gig aa Scripts pegef lakes src y Figure 13 WinTask Screenshot Clipboard and log mitna Comper demetrem Automation Anywhere is a Windows tool that lets the user record click and mouse movements and to create tasks in desktop that could interact with our programs It can also record from the Web this consists basically of creating a navigation sequence and extract data of our interest We can also use templates to realize concrete tasks or use the task editor that lets the user create a task using some predefined actions conditions scripts mouse and keyboard activity This tool is only available in the trial version if we want full functionality we have to buy it De Met Resear Do hot Repeat fe mal Repeat Do Mat Bepaal ATKIN 22 ATX 2 03 EPS Monday ff oy _lemnperatur STENT TERI misivas nam cn nE nom_praduce Prome Arsignmant Tee ale The Amar Aue Tursday m Enabic thin tasi j rya ve other brar Ges of ver ice ENTER om Las m alup ii mis som pale mya nam Ter_link y Darcelona Eapanla PORTO Toughpowar 1800 air a Tr Soitin i Automation Anywhe
87. ojects disl X WRAP Elite 37 Web Harvest Homepage http Web harvest sourceforge net 38 Goldseeker Project Homepage http goldseeker sourceforge net 83 10 Declaration of authorship Ich versichere dass ich die vorliegende Arbeit selbst ndig und nur unter Verwendung der angegebenen Quellen und Hilfsmittel angefertigt habe insbesondere sind w rtliche oder sinngem e Zitate als solche gekennzeichnet Mir ist bekannt dass Zuwiderhandlung auch nachtr glich zur Aberkennung des Abschlusses f hren kann Ort Datum Unterschrift 84
88. ok entries where the title and the book format appears to the last row of the table 62 jungle Showing Top Results Previous Page 1 2 3 Next 1 SEARCH INSIDE Buy new 5 95 54 Used amp new from 2 67 Get it by Wednesday April 30 if you order in the next 10 hours and choose one day shipping Eligible for FREE Super Saver Shipping Reich 6 Excerpt page 7 THE JUNGLE Jurgis of all men to Jurgis Rudkus he with the Surprise me See a random page in this book Books See all 128 055 items Buy new 2395 16 29 52 Used amp new from 10 04 Get it by Wednesday April 30 if you order in the next 12 hours and choose one day shipping Eligible for FREE Super Saver Shipping Arrr 30 Books See all 128 055 items Figure 68 Data order changed for this resilience tests m Dapper Dapper displays the following error message so we can not extract data of this modified HTML page Error While trying to run Events chain null E Robomaker Robomaker has not extracted this two changed fields in a correct way it has taken information from another rows Thus we experience problems because of this type of changes A tl ame valea ranes valued Dan EB IM g5 pO decicom me tamazonwret nb_ ss_gyvv_tiles stars 4 5 git 316 29 aie one DNNN dedicam nethest ariazonirel nh ss yy Pers 4 5 git 3 ttle The Jungle The Uncens format Paperback price mew 79 60 stars http iee dedicom nettestiamazon
89. om the Web This data is queried by human users or by applications in this case we access to the stored data of other computers This data could be stored in files in databases or directly in HTML documents When a user performs a standard query it uses a Web browser to access directly to the HTML Web data sources 7 Another possibility is to realize an extraction process when we want to extract concrete information of the sources and then an integration process when we retrieve information from more than one data source This last process is responsible for joining the information in order to deal with unified data Users Applications Integration Web data source Figure 3 Querying data from the Web A classification when speaking about the structure type of the data exists Free text This type of text could be found in natural language texts for example magazines or pharmaceutical research abstracts Patterns involving syntactic relations between words or semantic classes of words are used to extract data from this type of sources Structured text This type of text is defined as textual information in a database or file following a predefined and strict format To extract this kind of data we have to use the format description Semi structured text This type of text is placed in an intermediate point between unstructured collections of textual documents and fully structured tuples of typed data To extract data we
90. on v entry description This is the final table result of the Kings of Sun users Player 1 2008 Contest This strategy game was created Bu by Likstorh Software in 2005 and due to the military 357 erowth of online players each year online technology 566 competitions take place The user has to use his strategy habilities to be the best king of his religion 45 land that includes have a a social 411 construct temples study new technologies begin wars to extend territory total 1381 Figure 8 Some results of the data extraction 14 3 Data extraction tools In this chapter we are going to present all the information referred to the Web data extraction tools First of all we are going to describe shortly all of their characteristics In the end a brief tool comparative is presented to directly compare the features of all of them 3 1 Related work The fact of creating the group of tools used to perform extractions it is an important decision in our work Selecting an specific group of tools with some characteristics or another one with others could lead to different results We decided to work with a group of ten tools The methodology followed to realize the selection is focused on searching into papers and related documents of Web data extraction tools realize searches through Google and through Sourceforge A good deal of information that helped us to realize the selection has been extracted from papers Especially we used
91. onclusion this tool will have a good resilience property depending on the change realized to the HTML structure As there are more possibilities to modify other content than the strings used to identify the field we are going to categorize this tool with a good resilience property 4 4 2 Structure If we take a look at the original structure of this Web Page we can see that the content of our interest is located on the second column of the first row of the second table of our HTML code We want to extract all the information from the rows of this column Each one of them represents a publication and we can find all the information that we want to extract We can make a first test of extracting information using this structure After that we are going to make modifications on this Web Page that pretend to represent possible changes that a Webmaster could apply to update the content and could lead to data extraction errors EA htmi e head E body div O 8 span 0 spanf1 E e span 2 E span 3 e span 4 e span 5 amp span 6 8 spanl 8 span span 9 amp span 10 e span 11 gt div 1 gt div 2 gt script 13 gt script 1 4 gt script 15 4 commentft lt comment 2 lt comment 3 script 16 text 0 gt script 1 7 amp div 5 comment 4 lt comment 5 amp script 18 amp script 19 8 table 0 E E tabl
92. ract four fields of the resulting search items we create the following data model with a root node at the top 4 4 Data Model a root name price shipping_price remaining_tme Figure 45 Data model used by Lixto for Ebay tests With this structure the actions that we are going to perform to extract data with Lixto are the following j Page Class start d Action Sequence 1 http fistings ebay de WOOQso r 2 Mouse Action lt gt 3 3 key action gt 4 Finden j Bes 5 Data Extractor fe root F name W Filter 4 price Y Filter 4 e shipping_price W Filter F remaining_time SF Filter Figure 46 Action sequence to extract data by Lixto Go to the Web Page of Ebay Click to the input form Write the product value into the input form Click to the search button Use a data extractor together with our data model and filters to extract the information Ara We get the result page in the same way as dapper or Robomaker we get a category page or a product page depending on the input product If the search doesn t output any result then the tags of our XML file are empty Nevertheless the data has been extracted without problems lt lixto extractor gt lt root gt lt name gt CD Marcia Ball Blue House NEU lt name gt lt price gt EUR 16 01 lt price gt 47 lt shipping_price gt EUR 2 99 lt shipping_price gt lt remaining_time gt 21Std 57Min lt remaining_time gt lt root gt lt root g
93. re Sau le gm SM Sm I fr ne Era aun RO 100 ao miles Descrchon Hosen Eratied by SWART Ausamaten Techneisgr Figure 14 Automation Anywhere Screenshot 20 Web Content Extractor Web Content Extractor is a Windows tool that allows the user to create a project for a particular site extract data from it and store it in the current projects database The extracted data can be exported to a variety of formats including Microsoft Excel CSV Access TXT HTML XML SQL script or MySQL script As it happens with the two tools analyzed before we could only download the trial version of Web Content Extractor E Youtube fienysaweepe Web Content Extractor Fie ide Proseck Agada Togi Help Sec Bee BY Agx Muy web Tasks here Ve oe zo eu search gapura che Seach hen au wouhube cam enig eweh guayasii peshat Siam Wo GuizkLi videos Channels Community sett Videos Soich gavances search TUNMY vdeo results 1 20 of millones Channels orl Relevance I Anytime fon pranks fume tale O ces Bummi schrie bie chanie biin finer Funny Cals 37 1 eek age a es and hannycatpix ERE Ee Th ere RT Cats Funny kiflan hilanous sparta dance youtube gate katze accident dogs pupey Animation Hal Blotpetr Funny ahs Kitten hilanaus CAhanday re in Pete A nimai tiene Funny Cais d 1 weet age a i ki re ile e pues ae Greece Gr bei From Figure 15 Web Content Extractor Screenshot Roadrunner Roadrunner is
94. ressions to extract all the different fields from Web Search Engines results The process to extract information consists of looking into the HTML code and search for the specific tags and attributes that identifies each of the fields We are going to analyze together the three different types of Web search engines as the way to proceed is similar and only some tags from the configuration file vary The following code lines extract all the data fields from the Google search lt xml version 1 0 encoding UTF 8 gt lt config charset UTF 8 gt lt var def name search overwrite false gt barcelona lt var def gt lt var def name all gt lt xpath expression h3lcontains class r text a em text h 3 a contains class 1 href div contains class s text em text gt lt html to xm1 gt lt http url http www google com search hl es amp g search s amp samp btnG Buscar tc on tGoogle amp amp Ir lang_en gt lt html to xml gt lt xpath gt lt var def gt lt config gt Here is displayed a share of the output given by Webharvest when extracting data from the Google search http maps google com maps hl es amp q barcelona amp lr lang_en amp um 1 amp ie UTF 8 amp sa X amp oi geocode_result amp resnum l amp ct title http www fcbarcelona com FCBarcelona cat fcbarcelona cat http www fcbarcelona com web english FCBarcelona cat The official FCBarcelona w
95. rienced problems Automation Anywhere Error Ex x Unable to find Y control in the webpage An error occured at line number 2 Please open the task in the Task Editor to view the action at line number 2 Figure 51 Pageflakes results with Automation Anywhere 5I m Web Content Extractor This program allows us to extract information of dynamic Web pages We have extracted 3 of the 4 fields that we wanted the last one didn t return in any result Executing the same task the day after we didn t retrieve any result Then we can conclude that this program doesn t work correctly with this type of dynamic content Duta Clune Conn Nene HTML FER T MaxLength Megqured i Coke KTHEN LEDC I pE ET TABLET PROD LTA TON OA ORT ORT OR ONTO OMT ORT TABLED T BODY TRL TOL TABLED TBODY TALT ON ENT Ted 2 i Cikini HTML LADO T pin A N TABLET TBOOHTILTAQN TGS Om n T L A T e ON OPT TABLET TBODY eI TRATOR TABLET TAO LTAM TOO Ten 2 F Commn ATLL OOD LENTA ONTI TALEN TOC TANTON OA ORT ORT e eT E OT PT TABLET TEO TTL TALEN TEO TANTE Tea 2 Codumnd HTML BOG Tl ENTER TI TARLEN TROD TATON 001 01 TOT OT 1 OPT TALEN TEOG TR TOR TABLE TROD TAN TOO Tet i i i i tt eal i T F i i Cosa Column Loki Poren by weathangem Figure 52 Pageflakes results with Web Content Extractor m Goldseeker With this tool we arrive to the same conclusion as using Web search engines and Ebay This tool doesn t pass this test m Web
96. s to perform the data extraction are the following Test Tag Click Next Figure 72 Robomaker step sequence The robot follows this step sequence 1 We load the www digg com page we use the input variable to realize the search and click to the search button 2 Then we use a test tag that uses the following Tag path div div div div looking for the class attribute notice If we don t find in the previous defined path such attribute it means that we don t retrieve any result and it has no sense to extract information of the rows 3 Once we know that results are going to be found we are going to extract the title the URL and the description of each entry 4 We are repeating the step 3 for a concrete number of pages by clicking the next button each time We search the next button by using the following Tag path div div div div a having the class attribute nextprev and using the tag pattern gt Next which tests that the text in the link tag starts with the text Next Once we have presented this example we are going to improve its resilience Let us start by taking a closer look at the Tag finder configuration of the Test Tag step In this step is the Tag Path div div div and it doesn t help to the resilience property This Tag Path points to any Div tag contained inside two other Div tags If Digg changes its structure and the Div we are looking for is no longer contained within two other Divs then the st
97. s were more complicated and we decide to realize a query to a database getting the 10 users with higher points and use PHP to write these results to the table no problems will appear This happens because we are only doing modifications to the data the structure and all the other characteristics of our HTML source remain without changes We have inserted dynamic content too specifically a banner in Flash in this case we don t care about this as it doesn t contain important data to extract On the other hand CSS styles have been used to give format to the different types of content This will help our tools to separate the content by using the class attribute from DIV or SPAN tags lt style type text css gt body I background image url stripe Of0fS5alals4d3lesdbbsero rflle2z9 6 png title font family Verdana Arial Helvetica sans serif font size 24px color FFFFFF subtitle font family Verdana Arial Helvetica sans serif description font family Verdana Arial Helvetica sans serif font size l2px table data font family Verdana Arial Helvetica sans serif font size 12px link I color EFEEFF font family Verdana Arial Helvetica sans serif font size 1px lt style gt After describing all the characteristics of the page we have realized an extraction using one of our testing data extraction tools and all the data has been extracted correctly descripti
98. se tools and those based on NLP is that they don t rely on linguistic constraints but rather in formatting features that implicitly delineate the structure of the pieces of the data found Some tools of this approach are WIEN SoftMealy and STALKER Modeling based They are based in the fact that given a target structure for objects of interest they try to locate in Web pages portions of data that implicitly conform to that structure The structure is provided according to a set of modeling primitives e g tuples lists etc that conform to an underling data model Tools that adopt this approach are NoDoSE and DEBYE HTML aware This type of tools relies on inherent structural features of HTML documents for accomplishing data extraction A transformation of the source document to a parsing tree is realized and it reflects its HTML tag hierarchy Therefore extraction rules are generated either semi automatically or automatically and then applied to the tree In these documents we are going to use tools that follow this approach some of them are RoadRunner XWRAP or Robomaker In the following sections we are going to explain the main characteristics of the set of HTML aware tools that we have selected 3 3 Overview of tools The aim of this section is to give a general view to the reader of each of the tools used in this document Its main features are shown here and in almost all of them a screenshot is presented Dapper
99. shed Date oi us Seas NEE a ante ERM year of the last Se publication publication SEE edition publication Dapper Robomaker Lixto WinTask Automation Anywhere Web Content Extractor Goldseeker Webharvest Figure 79 Final results from extracting data from CSV formatted text 5 Concatenating the input output of our tools All of our HTML aware data extraction tools produce an output once the data has been extracted This output could be given in several formats but from all of them is really interesting to realize that we can reutilize the outputted data in HTML again for the input of our programs 70 Such a feature could be useful to extract some part of the data which produces problems with one of these tools we can use a tool to extract some part of the data and other tool to extract all the other part Another useful characteristic of this process is that we can separate the extraction process in steps increasing the precision of the extracted data each time If we take a look to the table of our data tools features we can see that Dapper Roadrunner and Web Content Extractor can use the HTML format both for input and output In this chapter we are going to carry out some tests combining two of these tools to see if this process could fix some problems in the data extraction process or might be useful to have several precision of the extracted data We are going to realize the following combinations with our programs
100. sssesssennnnnesnnennnnnennnnnnnnnennnnnn 69 4 5 5 Extracting data from CSV formatted text 2 0 eccccccecessseeeeeeeeeeeees 69 5 Concatenating the input output Of our COOIS uuneesennnnnannnnnnnnnnnnnnnnnnnnnn 70 Jl Dapper to Dappen ssn zen er 71 5 2 Dapper to Web Content extracto orius e nen 72 5 3 Web Content Extractor to Dapper uuessseseeessennnnsnnnnnnnennnnnneennnnnnne nen 73 5 4 Web Content extractor to Web Content extractor neece 74 6 Categorization of the data extraction tOOIS nuuuuennnnannnnnnnnnnnnnnnnnnnnnnnenn 75 Fa CONCUSSIONS anheben 79 8 2 21 1 21 6 0 12 SSR caved semen canes EE REEE EO h E R A 81 2 TOOLS a E E 82 10 Declaration of authorShip sssssssssnssssssnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn neneman 84 1 Introduction Nowadays we live in a world where information is present everywhere in our daily life In those last years the amount of information that we receive has grown and the stands in which is distributed have changed from conventional newspapers or the radio to mobile phones digital television or the Web In this document we reference to the information that we can find in the Web a really big source of data which is still developing 170000000 BEER SENDERS 153000000 i 136000000 11900000 LOZO0000 iNT SLALALATSTS ATEIRLALALA LOTR SH RSNTALATSTS Sq OOO 17000000 O O O m e O O O0 O0 gt Zr dd ee od a r von Woo Wo T Pf oom JO oh oh JO JO oh mm who
101. t lt name gt Alpas Handball Hand Ball Magic Blue 777 Handb lle B lle lt name gt lt price gt EUR 18 95 lt price gt lt shipping_price gt EUR 5 90 lt shipping_price gt lt remaining_time gt 1T 08Std 51Min lt remaining_time gt lt root gt m WinTask As the Ebay search structure is similar to the one of Web search engines we have the same problem as in the previous test This tool doesn t allow the user to handle dynamic content correctly it only works fine with static content m Automation Anywhere With this tool we arrive to the same conclusion as using Web search engines E Web Content Extractor With this tool we arrive to the same conclusion as using Web search engines E Goldseeker With this tool we arrive to the same conclusion as using Web search engines m Webharvest We have extracted successfully all the fields presented in the Ebay search using Xpath expressions again The process of extracting the information consists of looking into the HTML code and search for the specific tags and attributes that identifies each of the fields that we want to extract These following code lines show the configuration file used to realize the extraction lt xml version 1 0 encoding UTF 8 gt lt config charset UTF 8 gt lt var def name search overwrite false gt redQ20ball lt var def gt lt var def name all gt lt xpath expression div contains class ttl text div Contai
102. t Extractor With Web Content Extractor we could not extract the information due to the changes in the HTML structure It has extracted wrong data and we can conclude this tool doesn t pass the test ID title 1 Format pricel Sa tilez formatz price stars Figure 70 Web Content Extractor results m Webharvest We are using Xpath expressions again to refer to the extracted data fields and we used the information contained in the SPAN tags This change of order caused no problems when realizing an extraction Due to this fact Webharvest passes this test 5th test of resilience Dapper Robomaker Lixto Web Content Extractor Webharvest Figure 71 Final test results 64 4 4 8 A concrete example Improving resilience with Robomaker against structure changes Of all of our tools Robomaker is pretty sure the one with a higher functionality degree This is the reason why we selected it to improve the resilience of a data extraction of a Web page To do this we need to go into some of the more advanced and powerful features of this tool It is certainly impossible to make a robot that can handle all thinkable and unthinkable scenarios but a few easy changes can make a robot much more able to handle minor changes such as layout changes or added content We are going to create a robot that extracts all the entries of the first two pages of results of www digg com using an input value to perform a search The sequence step
103. th an input variable and add the resulting page to the basket of sample pages After that we have to select interactively the fields of data that we want to extract in this case all the information of each resulting entry As explained before we have to look for selecting a good sample for the input value These tests have been built using the RSS feed output 37 Google Search Dapper is able to extract all the entries without problems It extracts the Google maps entries normal links and nested links see the third link of the left screenshot of figure 33 The description takes all the information Text cached content size It is suitable to be used to perform Google searches Data Mapping item Wet bes Mes Miss Hiss meii F i Description Tourist guide to Barcelona Catalonia and Google u Zus oy eSscriprion DUNST GUIDE TO barcelona alalon a an ts Costa Brava www about barcelona com 88k Web ip Hie iii Fe 1 0 gf dia DER a or Ret doe beri chet ete M N Cached Similar pages aval Hipiai In Bearer ied Gone Aral pap Tee Title Barcelona DC BEE iH re pay ri i hotel Hart primeni Menpan A Fi qee Ey Ener Y item The best of Barcelona E Ibe pesi of Bergetarn er z ii a E Ra Description Barcelona Direct Connect the free adverts site for everything in Barcelona Spain Advertise your apartment for rent cars for sale offices to let jobs www barcelona dc com 24k
104. the only have to select the specific content to ese variables Then our results are extracted How bo use Examples Controltype Table Cell Yalue Player 1 Extract cell value to a variable f Extract cell value to a new variable Before Keyword Optional After Keyword Optional Example Price 249 suz To extract 249 Before Keyword Price Row 2 Col 1 player r Alter Keyword xyz Save Cancel Figure 27 Saving a field with Automation Anywhere E Web Content Extractor Web Content Extractor presented no problems when extracting these fields With this tool we only have to select the We b page source and select the fields we want to extract We have to name each of the extracted fields to be referenced title description Kings of Sun This is the fi player Player 1 player Player 2 Listo lt player3 Player 3 player4 Player 4 player Player 5 player Player 6 player Player 7 players Player 8 player Player 9 playerl0 Player 10 Figure 28 Final output using Web Content Extractor 33 E Goldseeker Using this tool to configure the data extraction process we have to edit the sample php file lt include GSparser php Sdm new GSParser kings gs kings data singleFile Sdm gt parse 22 In this file we indicate to include the GSparser php which is the file that has all the
105. the date of the Last Published edition field applying a bold style With this process we are placing a lt strong gt tag into our HTML code and this will help our tools to distinguish among the two parts of the Last Published edition field LIST OF PUBLISHED BOOKS Last Published edition How to begin with Computers Spain The guide The book of Manchester United How to survive in Africa Red apple blue sky E Love in the mountain p Bash programming guide The 100 best horror films Speak freanch in 1 month Welcome to the reality Discrete mathematics Planes and boats Second world war image collection Discovering Poland Andrew Moss Roberto D az John Henley Kate Nebit Marko Owen Katja M ller John Harker Jack Ismay Henry Petit Robert Morel Vera Beltran Naomi Michel Juan Espada Anja Tomaka 1998 07 07 First edition 1995 02 04 Second edition 2003 06 18 First edition 1991 01 25 First edition 2006 12 07 Second edition 2000 05 19 Fourth edition 2001 11 23 Second edition 1995 04 22 Second edition 1997 03 19 First edition 2005 10 10 First edition 1999 05 30 Second edition 1997 08 03 First edition 2002 03 12 Third edition 2003 06 22 Second edition Figure 75 Second constructed scenario for the precision tests The resulting table is the following All the information of the row Date of the last publication Robomaker Lixto WinTask Automation Anywhere Web Content Extractor Goldseeker Webharvest
106. to guarantee that the content and the information could easily be understood and read by humans but not prepared to be used as data able to be treated by other applications Because of this fact this kind of representation is not the most appropriate to extract data and sometimes we have to deal with difficulties When talking about the use of information possibly this data will not be useful for any profile of user or its excess could produce information saturation or what is more maybe we are only interested in a particular share On the other hand it could be useful to transform this information to deal with it later or use it in other areas This is where the data extraction process takes importance Specific data is able to be extracted from all these Web sources in order to be used by other users or applications The capacity to get specific information lets realize a summarization of this big amount of data located in the Web and use it for concrete purposes So the importance of the Web data extraction resides on the fact that we realize extractions of all the content At the same time such extraction presents problems when considering the data that we want to extract and how we realize this extraction One possibility is a manual extraction of this data but it is not viable because of the big amount of information that we have to treat with We have to find another solution of this problem Several data extraction tools have already been dev
107. tracted data and to perform transformations Sometimes it is hard to familiarize with the syntax but upon learnt they are useful to perform complex tasks that are difficult to realize in a visual way Values Yes No e Output formats This is the list of the output formats that the tool can export e Complexity It measures the complexity when using our tool to perform extractions Values Low Medium High e Non static content pages This field refers if the tool is suitable to extract data from pages that are applicant to perform changes on its content For example the result search pages from Web search engines Values Yes No e More than one page It refers if we can get data from more than one page at the same time Useful for example in several search results from Web search engines Values Yes No 26 e Error treatment This field refers if the application has a way to treat errors when performing data extractions Values Yes No e Execution time This field refers to how much time requires a tool to perform data extractions Values Very Poor Poor Good Very Good e HTML or other documents It refers if we can only extract data from HTML sources or others Values HTML documents 4 Tests using the data extraction tools To evaluate the quality of the extraction tools a set of tests have been developed Our goal is to see the behavior of these tools and if they could extract the data that we expect
108. urist auide com da barcelona touristeninformation html 30k Hotelrewree R knikken wenlsrhen Barcelona Googie Suche fy DOM Source View A problems Ei Report WE Bromser Console Cookies Select Example Selected node Node attrbutes Name Figure 12 Lixto VD Screenshot WinTask WinTask is a Windows tool used to automate repetitive tasks or actions which should run at a certain moment One of its features is data extraction of Web sites WinTask can launch the URL to load send a userid and an encrypted password if it is a secure site conduct searches and navigate to the different pages where some field contents have to be extracted This tool is only available in the trial version if we want full functionality we have to buy it It works by using its own scripts so at the beginning it can be a little hard to familiarize with the syntax 19 O wmTask pageflakes sre ER tie uam gt o u 2 mw D 4 8 v a gt 528388 Compils Compi la Gdt Yen Start Insert Configure Window Heb hives de programa Interner Ex UseWi ndow TEXPLORE EXE Edir iferscnalisar Sendweys www pageflases conm lt Enter gt UsePage Pagefiaken Get it ret CaptureNTML DIV CONTENT resuic ecaptured stringSechrs 10 Userage Pagefiakes Get it Together CaprureHTHL DIV CONTENT 50 cagtuse resultSecaptured stringSechrs 10 echrs 10 Usefage Pageflakes Get it T
109. utomated Information Extraction from Web Sources a Survey Between Ontologies and Folksonomies Workshop in 3rd International Conference on Communities and Technology 2007 22 Hammer J Garcia Molina H Cho J Aranha R Crespo A Extracting Semistructured Information from the Web Workshop on management of semistructured data 1997 23 Kuhlins S Tredwell R Toolkits for Generating Wrappers NetObjectDays 2002 24 Laender A H F Ribeiro Neto B A da Silva A S Teixeira J S A Brief Survey of Web Data Extraction Tools SIGMOD Record SIGMOD 31 2 2002 25 Liu B WWW 2005 Tutorial Web Content Mining Fourteenth International World Wide Web Conference WW W 2005 2005 26 Liu B Grossman R Zhai Y Mining Data Records in Web Pages KDD 2003 27 Meiling S Kempe C Vergleich von IE Werkzeugen 2008 28 Myllymaki J Effective Web Data Extraction with Standard XML Technologies Computer Networks CN 39 5 2002 9 Tools 29 Dapper Homepage http www dapper net 30 Lixto Software GmbH Homepage http www lixto com 31 Openkapow Homepage http openkapow com 32 Wintask http www wintask com 82 33 Automation Anywhere http www tethyssolutions com automation software htm 34 Web Content Extractor http www newprosoft com Web content extractor htm 35 The Roadrunner project Homepage http www dia uniromas it db roadRunner 36 XWRAP Elite Homepage http www cc gatech edu pr
110. wret nb_ ss gyy_files stars 3 5 git 4 ttle The Jungle Book 40th 4 format DWD price_new 16 99 stars http tww dedicom nettestfamazonret nb_ss_gv_files stars 4 3 gif 5 tte Ragga Jungle Dubs format Audio CD price ney 11 98 stars Figure 69 Robomaker results for this test E Lixto Lixto has extracted all the information without problems What happened is that we selected the title of the book to be the separator of each group of elements This means a book entry Although we have extracted correctly all the data the structure of the output XML file is not correct To fix this problem we can select the price to be the separator but then it will only works applying this change to all the book entries In conclusion Lixto doesn t pass this test lt lixto extractor gt lt root gt lt new_price gt 5 95 lt new_price gt lt stars gt lt stars gt lt root gt 63 lt root gt lt title gt The Jungle Enriched Classics lt title gt lt format gt Mass Market Paperback lt format gt lt new_price gt 16 29 lt new_price gt lt stars gt lt stars gt lt root gt lt root gt lt title gt Jungle A Harrowing True Story of Survival lt title gt lt format gt Hardcover lt format gt lt root gt lt root gt lt title gt The Jungle The Uncensored Original Edition lt title gt lt format gt Paperback lt format gt lt new_price gt 9 60 lt new_price gt lt stars gt lt stars gt lt root gt m Web Conten
111. xample we have only used Xpath expressions to extract data but if we take a look to the manual section of the Web Harvest Homepage we can find a big amount of functions that let us to perform more concrete actions like extracting data to files transform HTML to XML or execute XQueries 35 m The following table summarizes the final results of our tests Ebay search Robomaker Lixto WinTask Automation Anywhere Web Content Extractor Goldseeker Webharvest Figure 31 Final basic data extractions 4 3 2 Data extraction from Web search engines Google YaHoo B Web search engines are important for every day Web searches as they are a simple fast and powerful way to find information of our interest On those last years their popularity has been increasing and nowadays they are an indispensable tool We are going to use the three most used search engines that at the present we can find on the Web e Google e Yahoo Search e Microsoft Live Search Figure 32 illustrates the use percentage of them 36 AOL 5 9 Others 3 4 Ask 5 4 Figure 32 Percentage of use of the most important Web search engines from 9 This is a useful test as evaluate several features to start we need an input value to perform a search and afterwards we receive a page with all the search results for our input value This test is limited to the tools that let us to use an input value and to get data from a dynamic page content as
112. y a name and can contain attributes and inner content For the correctness of its usage there is an already defined syntax that defines an order of appearance which are the available attributes wether a tag should have a close tag or not An already created standard by W3C exists that exposes all the construction rules of HTML http www w3 org TR html401 Thanks to this structure wrappers are able to detect the elements of a Web site and extract the desired information They can recognize repetition patterns of tags to extract similar content read the attributes of this tags to associate elements or extract elements in an individual way When specifically speaking about HTML aware tools before performing the extraction process these tools turn the document into a parsing tree a representation that shows its HTML tag hierarchy E htm Eiger head e body e div O e iframe E amp script 0 E tesd 0 u comment D e script 1 E E script 2 E text 1 e noscript commenti 3 cliv t gt div 2 E 2 divia Figure 4 HTML parsing tree Following extraction rules are generated either semi automatically or automatically and applied to the tree In this three each node represents a tag while the outer tags are leaves A specific tag is represented by a unique node and we can perform an expression to navigate through all the
113. y new technologies begin wars to extend territory lt description gt lt player gt Player 1 lt player gt lt player gt Player 2 lt player gt lt player gt Player 3 lt player gt lt player gt Player 4 lt player gt lt player gt Player 5 lt player gt lt player gt Player 6 lt player gt lt player gt Player 7 lt player gt lt player gt Player 8 lt player gt lt player gt Player 9 lt player gt lt player gt Player 10 lt player gt lt root gt lt document gt m WinTask To extract data with WinTask we have to edit a script file that will extract all the fields of interest First of all we need two orders one to open the Internet explorer and other one to load the Web page source Then we only have to use the graphical interface to extract all the fields No problems have been encountered with this tool and all the information has correctly been extracted Kings of Sun 2008 Contest This is the final table result of the Kings of Sun 2008 Contest This strategy game was created by Likstorh Software in 2005 and due to the Player 1 Player 2 Player 3 Player 4 Player 5 Player 6 Player 7 Player 8 Player 9 Player 10 Figure 26 Final output using WinTask 32 m Automation Anywhere With Automation Anywhere we only extracted values Once created we extract and establish the relation to th and can be outputted go Extract Data Extract Data have to create new variables to save
114. ynamic content Web pages using ASP NET AJAX or Javascript content As Pageflakes is mainly constructed using this kind of content it has been selected to be an applicant Web page to extract data We are going to extract data from the weather widget specifically four fields the first and the second name of the day and its weather information E Dapper We got an error trying Dapper to extract data at the time of collecting sample pages This error means our tool do not to pass this test To start Browse to similar pages and add them to the sample pages basket http pageflakes com 3 Add to Basket esses eee x flakes A Sign up Login Reader Make this my homepage La pagina en http www dapper net dice Ea Something is wrong with your page Please try refreshing the page If the problem still persists please let us know via email info pagefiakes com We are sorry for the inconvenience i This page contains input fields Text input 1 Figure 48 Pageflakes results with Dapper m Robomaker With Robomaker we can not load the start page of Pageflakes To solve this problem we have tried to execute all the Javascript content of the page using the Execute Javascript step but after doing that we received another error So this tool fails to extract data from this Web page too x flakes Sign up Login Reader Make this my homepage Error 3e Error from the Execute JavaScript action Error g

Data extraction tools

Contents

Download Pdf Manuals

Related Search

Related Contents