Home

1.2 Use Cases

1. 1 2 Scenario creation 1 3 Additional branches to Scenario Creation 4 1 4 crOWLer scraping Designing scenario format 2 1 Strigil KML 0 2 2 Adaptation of Strig il XML format 2 0 SOWL JSON MM 2 4 Consequences of con version to JSON format 4 3 JavaScript and events sup 4 4 User Interface i ac aacc c i SOWL user interface crOWLer user interface 4 5 Model aaa aa aaa ankese SOWL model crOWLer model Program Implementation and Specifications SOWL implementation Parsing Ontologies in JavaScript Targeting elements on webpage and generat ing selectors N 24 25 27 27 28 28 28 29 29 30 30 Kisi on Do crOWLer implementation SOWL JSON syntax template call template onto elem value of ana aaa aa sana ses NATTOW ve kody epos function 6 Conclusion 065 References i i I to Assignment 0 Abbreviations RDF and RDFS vocabulary Example of RDF XML syntax Configuration component of original crOWLer Selector component of original crOWLer nana anen crOVVLer architecture Detailed architecture of Strigil
2. basic ex ample case 1 2 2 Use Case 2 National Heritage Institute 1 2 3 Use Case 3 Air Ac cidents Investigation Institute 0008 1 2 4 Use Case 4 National Transportation Safety Proposed Solution and Methodology Specific goals of the thesis Work structure na ananas rinciples and technologies Technology of Semantic VVeb Linked Data nana aaa sa cacas RDF and RDFS e RDF and RDFS vo cabulary z ae n vd s v DA OWL arin sene Bo dead 25 RDFa pi 2 6 SPARQL dananisvrsoreribis s 2 7 RDF XML syntax 2 8 Turtle SYNDAX aaa aaa aaa 3 Existing solutions 3 1 Semantic and non semantic crawlers ss Advantages and pit falls of Semantic crawlers Analysis of crOWLer 3 2 1 Issues of crOWLer con figuration 3 2 2 Confrontation with use cases technical issues 3 2 3 Result form crOWLer ANALYSIS aa aaa B 3 Strigjl aaa ndan and eee NI Aaa kl sl Col Dola pice 20 21 with use cases 3 3 4 What inspiration it brings for crOVVLer 3 4 Finding platform for frontend 8 4 1 InfoCram 6000 Ps Bra are k ka d d B 4 2 Selenium 4 Program design Workflow ana aa aa aaa nenen 1 1 Main line
3. Graphical interface is defined using XUL the standard Mozilla XML format for defining user interface XUL defines an overlay system using which a new layer is defined and layered over existing part of application layout while extending or modifying it The overlay system itself comes with Mozilla stack and can be used on IDE by default Eile Edit Actions Options Help Base URL http www inventati org kublx t hd Slow DE PE R le Test Case Table Source Command Target Value Command p Target Select Find Runs 0 Failures 0 Value Log Reference Ul Element Rollup Infor Clear Figure 3 5 GUI of Selenium IDE showing the Command Target and Value fields The functionality of IDE is however linked to its layout and has to be taken in account Selenium IDE internally defines set of commands that can be used in scenarios List of default commands can be seen in dropdown on main screen of the IDE This list can be extended but the use and structure of commands is implemented internally in Selenium IDE Addition of new commands is accomplished by extending the Selenium prototype object in registered plugin After the extension is processed through internal command loader a new set of commands is added for user to use Commands in this system are recognized by their names as they are assigned on the prototype object the prefixes used are a do the
4. Diagram of selector creation aleorit hi e ERE The overall architecture of new crOWLer implementa CON O A O years Components of Data Appli cation part of Strigil SIS SIS H 2 Components of Download System part of Strigil H 3 Example deployment struc t re Of UNI gets bi ne viti Chapter 1 Introduction During past few years the Web has undergone several bigger or smaller revolutions WEB 2 0 and tag cloud HIML5 and semantic tags a Smart phones tablets responsivity and mobile web everywhere m The run out of IPv4 addresses nonexistent boom of IPv6 a Cloud technologies and BigData Bitcoin Tor anonymous internet WikiLeaks NSA Heartbleed and security concerns a Google Knowledge Graph Facebook Open Graph That is only several examples of some of the biggest recent technology booms and issues on the global network So little can mean so much in such a global environment The environment online is constantly changing usually on a wave of some new useful or sometimes terrifying technology or with popularization of a new phenomena The Semantic Web technologies have been described standardized and implemented for several years now and their tide seems to be near though yet to come Semantic Web itself relates to several principles along with their implementation that allow users to add meaning to their data This meaning brings not only a stan dardized structure but also
5. as a consequence the possibility to query and reason on data originating from multiple sources Once given the structure similar data can be joined in a form of a bigger bulk Presenting this data publicly creates a virtual cloud When put together this practices are called the Linked Data The intention of this work is to bring the Semantic Web technologies closer to users Specifically it focuses on the process of creation of semantic data We will propose a methodology for extracting and annotating data out of unstructured web content along with a design of a tool to simplify the process The design will be supported by implementation of a prototype of the tool Results will be confronted with real life use cases i 1 1 Problem Statement and Motivation Giving meaning i e semantization of web pages gets more popular Probably the most obvious example can be seen in the way the Google search engine serves its results When possible Google presents not only the list of pages corresponding to the searched term but also snippets of information scraped directly from the content of the pages such as menu fields parsed from CSS annotation or HTML5 tags contact information or opening hours When applicable Google also adds data from their own internal ontology the Knowledge Graph 2 1 One of the most recent standards OWL2 was released in 2008 1 1 1 Problem Statement and Motivation What options are there to bring r
6. handle selectors for it Sizzle is very popular library for handling selectors vyhich also defines its own selectors like eq or first It is simpler and more expressive than CSS Its popularity is mainly based on its involvement in jQuery library Being so close to required structure and workflow of SOWL InfoCram 6000 served as the base implementation for it in the early stages As can be seen at the end of this chapter the first implementation named SelectOWL caries similar user interface and make use of several modules of the InfoCram implementation n http www extbrain net 2 https addons mozilla org en US f irefox addon aardvark 24 3 4 Finding platform for frontend Info Cram 6000 name area rad R aselect achild table achild tbody tr child td eq 0 text childi td eq 1 text child2 td eq 2 text l Add Child Remove Child Edit Definitions Newfie Loadfile Sevefite Run Extraction Figure 3 4 Main window of InfoCram 6000 E 3 4 2 Selenium Selenium is a collection of tools for automated testing of web pages This tools include u Selenium IDE a Firefox plugin for creating test scenarios WebDriver a set of libraries for various languages capable of running tests generated from Selenium scenarios A user of Selenium typically a web designer programmer or coder would create a scenario using Selenium IDE in order to test his web server From this s
7. Crawler called crOWLer serves the needs of extracting data from web It follows the workflow of scraping data using manually created scenario with given structure and user defined set of ontological resources In previous implementation both the scenario followed by the crawler and the ontology structure schema are hard coded into the crOWLer code This requires un necessary load of work for each particular use case whilst in practice all the use cases share the same workflow 1 load the ontology 2 add selectors to specific resources from the ontology 3 implement the rules to follow another page 4 run the crawling process according the above 1 4 Proposed Solution and Methodology In the original crOVVLer implementation it was necessary to fulfill the first three steps with an actual programming In order to perform this task we needed to have a programmer with knowledge of Java programming language and several technologies used on the web Moreover a knowledge of the domain of data being scraped is needed in order to correctly choose appropriate resources for annotation There is also a huge overload in preparation of development environment and learning time of the crOVVLer implementation The need of more elegant and generic solution is evident i 1 4 Proposed Solution and Methodology To simplify the creation of guidelines scenarios for crOWLer we will propose a tool that allows user to select elements directly on the
8. Java Jsoup library The overall architecture then looks as follows Figure 4 3 A new overall architecture of the crOWLer implementation 38 Chapter 5 Program Implementation and Specifications This chapter describes the implemented prototype of SOWL crOWLer tool stack The relation between tools can be seen on diagram 5 1 Internet Sesame repository RDF serislized E data Figure 5 1 Overview of the whole stack and files exchanged E 5 1 SOWL implementation During testing of various technologies and frameworks several prototypes of the scenario creator was built The first one called SelectOWL was native Firefox addon build on XUL and calls to Firefox low level API Development of SelectOWL was discontinued in favour of new addon with shortened name SOWL The new addon is based on Firefox addon SDK The structure of the addon is completely different from the original one and the JavaScript of the addon runs in different context too The new SDK is the recommended approach now and offers more flexible functionality and more intuitive code structure as the user interface is defined using classical HTML instead of XUL The original version is kept in the repository for reference E 5 1 1 Parsing Ontologies in JavaScript Both jOWL vs rdfAuery were tested on common ontologies FOAF Dublin Core Good Relations Results shown that the newer rdfAuery library more accurately implements the standard behavior for handling RDF re
9. action commands for performing user actions get and is the accessor commands for testing and or waiting for a values on page and potentially storing it assert the assertion commands for performing actual tests 26 3 5 Libraries for SOWL When command is generated the prefix is being stripped and according to type multiple versions commands can be created For example do commands have always immediate and patient version and in this principle Selenium prototype doClick will generate the click and clickAndWait command Accessor commands are even more complex and generate eight commands for every single method positive and neg ative assertion store method waitFor etc Implementation of the command method defines how Selenium IDE would behave when replying the scenario recorded Tech nically it is possible to leave the implementation empty in the IDE and use it only as a command for WebDriver unit test None of the original command types corresponds to format of commands for handling the semantic annotation like adding URI to element recording creation of individual assigning literal to its property etc A new set of commands was suggested and partially implemented having the prefix owl This led to changes in core sources of Selenium IDE which by itself is not a good practice as it technically creates a new branch of the program CommandBuilder had to be extended directly in the Selenium co
10. an instance of a class a commonly accepted qname for this property is a the class of everything all things described by DF are resources declares a resource as a class for other resources literal values such as strings and integers property values such as textual strings are examples of RDF literals literals may be plain or typed the class of datatypes rdfs Datatype is both an instance of and a subclass of rdfs Clas each instance of rdfs Datatype is a subclass of rdfs Literal the class of XML literal values rdf XMLLiteral is an instance of rdfs Datatype and thus a subclass of rdfs Literal the class of properties of an rdf predicate declares the class of the subject in a triple whose second component is the predicate of an rdf predicate declares the class or datatype of the object in a triple whose second component is the predicate allows to declare hierarchies of classes an instance of rdf Property that is used to state that all resources related by one property are also related by another rdf Property used to provide a human readable version of a resource s name rdf Property used to provide a human readable description of a resource Table C 1 RDF and RDFS vocabulary 52 Append ix D Example of RDF XML syntax lt rdf RDF xmlns xmlns xmlns xmlns xmlns xmlns lt Here we of the lt owl 0Ontolog lt owl 0Ontolo lt rdfs Class lt rdf type lt owl equiv rdf reso lt owl equiv rdf r
11. at least for developers to keep the script compact and easily readable Addition of language tag as seen in previous chapter is widely used pattern that polutes the resulting script with unnecessary overload Suggested improvement would separate this functionality into an extra attribute of the value of tag named lang The same suggestion can be applied to the data type specification Moreover implicit parsing of known data types would not only simplify the scraping script but also help to clean and clear the resulting data Let us imagine hypothetical scenario of two similar tables on one page containing two sets of data in the same format For such a case we would need to define a template on subset of DOM and call it twice with different root node Creation of dom template and call dom template tags would solve this issue and would allow scenario cre ator to narrow down his focus to a subpart of the scraped webpage This would be particularly useful on complicated pages with a lot of nested HTML dom template and call dom template would be defined within a single template tag and unlike call template they would keep the ontological context co call of value of within dom template would assign a property to individual created by onto elem wrapping the current call dom template call The architecture of Strigil distributed downloader suggests that it uses simple raw HTML pages as they were downloaded and uses JSOUP to extract data from it as JSO
12. by some ontology structure In some use cases the ontology of the desired data is yet to be created and the user is aware of the data structure and capable to manually spot and select the data on a web page Currently there are not many tools allowing this kind of operation The ideal implementation and the vision of result of this thesis will allow user to partially identify the structure of a webpage while leaving the repetitive tedious work on crawler following the same procedure repeatedly on all data of the page For such a process we need to create tools that allow users to address previously unstructured content link it to resources of existing ontology and or create these re T An anonymous search alternative to Google http duckduckgo com 1 2 Use Cases sources on the go By using existing ontologies we would not only give the meaning to our data but also create valuable connection to any other dataset annotated using the same ontology LI 1 2 Use Cases In a general case our goal is to obtain data from a webpage in a semantic form We have a webpage and optionally an ontology as an input and annotated set of data as an output For start we will focus on data having a structure defined in HTML The data might be structured as a table or set of paragraphs or any other set of HTML tags and we will handle it on the level of these tags Some text handling might be performed using regular expressions but usually w
13. can be found in appendix i E 4 2 4 Consequences of conversion to JSON format According to difference in syntax between XML and JSON do not have text content of elements like XML elements can In JSON we simply reserve a property for a value that would be otherwise specified this way in corresponding XML Strigil however does not explicitly use the textual values and everything is specified using attributes Some elements return textual values to their parents to handle and in these cases it might be suitable to enable textual values as constants instead of the required element Another syntactical distinction is that JSON does not explicitly define child nodes Everything is property in JSON object so we again assign a property to store the child nodes Child nodes are held in ordered list which in JavaScript corresponds to an array As we build a structure of scenario steps the reserved property will be simply called steps for every element that allows child nodes e g onto elem or template Technically each JSON object quacks like a hash map with a string keys and value of any JavaScript type We can benefit from this loose structure For example we can use any key to store a substep not only the steps array The onto elem command benefits exactly from this difference between XML and JSON In original Strigil XML the onto elem tag allow us to specify URI of the re sulting individual as commonly denoted by the about property by tak
14. findElement By cssSelector a detail String result String exec executeScript var elem arguments 0 var href elem getAttribute href return href 7 window location href href el Previous example is simple yet if we wanted to cover it with our scenario imple mentation we would bring a lot of single problem specific syntax into the scenario We would have to use special notation for obtaining current URL and for conditioning on values Following code demonstrates how this functionality might look like if it was cov ered only by scenario syntax without usage of JavaScript The getCurrentUrl function is inspired by Strigil command condition condition ne param value commad value of selector Odeta onfalse i command function value getCurrentUrl J http goo g1 Hhwq31 Selenium JavaScriptExecutor documentation 34 4 3 JavaScript and events support We have declared the condition command with implementation of ne the not equal operator and for completeness we would implement all the other un equality operators and the function command with implementation of getCurrentUrl which again probably is not the last function to be implemented All this would require update of the scenario parser the implementation for commands and all their attributes and thus update of the whole backend every single time new functionality is needed Th
15. info play Robert Isele Jiirgen Umbrich Chris Bizer and Andreas Harth LDSpider An open source crawling framework for the Web of Linked Data In Proceedings of 9th International Semantic Web Conference ISWC 2010 Posters and Demos 2010 http iswc2010 semanticweb org pdf 495 pdf Semantic Web W3C http www w3 org standards semanticweb Linking Open Data diagram http lod cloud net Resource Description Framework Wikipedia https en wikipedia org wiki Resource Description Framework SPARQL Protocol and RDF Query Language Wikipedia https en wikipedia org wiki SPARQL RDF XML Wikipedia https en wikipedia org wiki RDF XML Turtle Terse RDF Triple Language W3C DBpedia the Datahub http datahub io dataset dbpedia 46 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Apache Jena http jena apache org JSOUP Java HTML parser http jsoup org DataTables Table plug in for jQuery http www datatables net Ne ask M Starka J Holubov I Strigil A Framework for Data Extraction in Semi Structured Web Documents 2013 paper submitted to 15th International Conference on Information Integration and Web based Applications amp Services Vienna Austria 2013 XPath XML Path Language http www w3 org TR xpath jQuery http jquery com Sizzle JavaS
16. presenting of malformed data or obfuscation In the big picture misinformation of people seems to be the major thread to democracy as we usually envision it By supporting the creation of semantic data we are naturally taking part in this movement The hope is to bring government data closer to people to help overcome the information gap that prevents each of us from being adequately informed about how our resources are being spent and how our countries are truly led and offices driven I hope this and any follow up work will serve to support this common vision 45 12 13 14 15 16 17 18 References Web Ontology Language Wikipedia https en wikipedia org wiki Web Ontology Language Google Knowledge Graph Wikipedia https en wikipedia org wiki Google Knowledge Graph Search Engine Optimization Wikipedia https en wikipedia org wiki Search engine optimization Semantic Web Wikipedia https en wikipedia org wiki Semantic Web Linked Data Connect Distributed Data Across Web http linkeddata org HTML5 Wikipedia https en wikipedia org wiki HTML5 Microformats http microformats org HTML RDFa 1 1 Support for RDFa in HTML4 and HTML5 http dev w3 org htm15 rdfa Google Structured Data Testing Tool http www google com webmasters tools richsnippets RDFa Play the RDFa data visualisation tool http rdfa
17. reguirements on resolver being used for reasoning on our ontology and brings in necessary computational complexity 25 RDFa RDFa technology defines a concept of embedding content of a web document defined in HTML with resources from some ontology Technically we create a invisible layer of annotations over the data that turns our content into machine readable record This is accomplished by embedding the original HTML with custom attributes Tools can be used to visualise this data 3 2 6 SPARQL SPAROL is a semantic guery language for data stored in RDF format 15 Using SPARQL syntax we define a pattern of the RDF graph using triples and as a result vve obtain such a nodes that form a subgraph of the original graph matching the given pattern So called SPARQL endpoints are the main entry points through which users can obtain data from openly available datasets Below you can see a simple example of a SPARQL query that returns a list of all resources from database that have a rdf type associated to it 5 https en wikipedia org wiki Uniform resource identifier 2 Major part of the vocabulary is described in appendix 5 http linkedgeodata org sparql LinkedGeoData SPARQL endpoint 13 2 7 RDF XML syntax PREFIX rdf lt http www w3 org 1999 02 22 rdf syntax ns gt PREFIX foaf lt http xmlns com foaf 0 1 gt SELECT target name WHERE target rdf type foaf Person OPTIONAL target foaf nam
18. returns an URL the targeted template will be called on each URL a value same as previous only contains a single command selector URL will be taken from text of elements matched by this selector a attribute URL will be taken from this attribute of elements matched by previous selector a url default URL if one of the previous returns a value E 5 3 3 onto elem Creates an ontological individual m about contains a command returning URI identifying the newly created individual m typeof contains the rdf type of the individual rel contains an URI of Object Property the individual is assigned to this property of his parent m selector the individual is created for each element matching this selector steps list of subcommands they will be executed in context of this individual and the selected HTML element E 5 3 4 value of Returns a string value or assigns it to a data property selector returns a value of text content of the first element specified by this attribute m attribute if specified a value of this attribute of the selected element is specified m text a constant string is returned if none of the previous targets a non null value a property the resulting value is assigned to this property of parent individual rather than returned in combination with selector the values of all targeted elements are assigned a lang a language tag appended to the string before assign
19. the Jefferson Avenue Bridge Executive Summary Accident Location MI Rouge River at Jefferson Avenue Bridge city of River Rouge near Detroit Michigan 42 16 8 N 837077 W Accident Date 5 12 2013 Accident ID DCA13LM021 About 0212 on May 12 2013 the bulk carrier Herbert C Jackson was cleared for passage through the Jefferson Avenue Bridge over the Rouge River about 6 miles southwest of Detroit Michigan when the bridge tender who was legally intoxicated at the time lowered the dravvbridge striking the bulk carriers bow Damage to the vessel was estimated at 5 000 The bridge a registered historic structure was extensively damaged and expected to remain closed until 2015 for repair and restoration No one was injured Date plod 1053 2014 NTSB Number MAB1419 NTIS Number Probable Cause i The National Transportation Safety Board determines that the probable cause ofthe allision ofthe Herbert C Full Report Jackson with the Jefferson Avenue Bridge was the intoxicated bridge tender s closing of the drawbridge as the J MAB1419 vessel began its transit through the open bridge span Related Press Releases Related Events Related Investigations More NTSB Links Investigation Process Data 8 Stats Accident Reports Most Wanted List Figure 1 9 View on detail page on National Transportation Safety Board webpage i 1 3 Current solution crOVVLer The suggested base technology is being developed on our faculty
20. use case 1 2 4 as described lays in AJAX driven pagination Every page change event dynamically updates the content of the webpage In this specific case we do not need to be alarmed as the pagination component is created using jQuery DataTables plugin 21 Using this plugin the pagination is built on top of the data table after it has been completely loaded In case of crOWLer the plugin is never executed and the table remains complete and unchanged over the whole scraping process This is not always the case though Even the DataTables plugin itself supports load ing data through AJAX so the alertness is more than appropriate In the hypothetical situation when AJAX is used for data loading crOVVLer would not be able to handle the pagination and would only access the first page The additional data would have to be loaded using workaround similar to the one in UC2 And even if we successfully load the data we still might be unable to handle them by crOWLer The AJAX call typically serves only the new chunk of data to be inserted into the page either in HTML or in JSON format When in HTML we would have to extend the configuration to correctly target elements in the reduced form of AJAX update In case of JSON a completely new selector system would have to be added to crOWLer The situation dramatically changes if we use full stack web environment with JavaScript engine In that case we would be able to ignore the background functional ity o
21. use cases we derive common pattern based on which we define the desired workflow of the data extraction Then we briefly describe underlying technologies used for handling the se mantic data We investigate existing tools and platforms for automated data extraction based on these technologies We focus on the tools which conform to the defined workflow We then choose the most promising tools and deeply analyse specific tech nigues used in their implementation For each tool we describe in detail the main part of our interest its benefits and dravbacks During this analysis we pay special attention to the form in which user defines rules for data extraction and configures the extrac tion process Additionally we examine semantic and non semantic libraries and platforms that might serve as a base technology for implementation of a prototype of the proposed design Based on the analysed techniques we research options for best combina tion and improvement of each of them Namely we define format of scenario for semantic data extractor and design tools for scenario creation and for per forming the data extraction To support the design we implement and describe prototypes of both tools Contents Introduction B 3 1 What problem does it Problem Statement and Mo SOLVE sa uv var s r dce 21 GIVAGION Ken al si n k ab en 8 2 Strigil vs crOWLer 21 Use Cases aaa 3 3 3 Confronting Strigil 1 2 1 Use Case 1
22. value ae af type value J Il http www ll type value J Jo http www type value J l http www n type UJ value I http www af type UJ value type u value I 1 http www if type n value com foaf 0 1 Person w3 org 1999 02 22 rdf syntax nsftype Ures N http www w3 org 2000 01 rdf schema Class za 3 http www w3 org 2002 07 owl Class w3 org 2000 01 rdf schema tlabel literal Personi w3 org 2000 01 rdf schema tcomment tera A person w3 org 2003 06 sw vocab status nsitterm status literal stable w3 org 2002 07 owl equivalentClass za A http schema org Person en M 5 http www w3 org 2000 10 swap pim contact Person w3 org 2000 01 rdf schema subClassOf Ven A http xmlns com foaf 0 1 Agent 63 type value i 1 type UJ value J T http www type value E type value http www za 3 http www w3 org 2003 01 geo wgs84 positSpatialThing w3 org 2000 01 rdf schema tisDefinedBy za A http xmlns com foaf 0 1 w3 org 2002 07 owl disjointWith za 3 http xmlns com foaf 0 1 Organization Ven A http xmlns com foaf 0 1 Project 64 Appendix L User manual for SOWL and crOVVLer i 1 1 sovvL 1 SOWL is installed from XPI fi
23. 4 ZPLN obdr el ozn men leteck nehody kluz ku Nimbus 2 v prostoru obce Bukovice Pilot prov d l let do Vlnov ho proud n v r mci mezin rodn plachta sk akce Vlnov kemp 2014 V prostoru vypnut z aerovleku ani v bl zk m okol nenav zal na vlnov proud n Proto e nem l dostate nou v ku k doletu na LKMI pokusil se ne sp n vyhledat stoupav proud nad svahy v bl zkosti nouzov plochy Bukovice V mal v ce pak zah jil p ibl en na p ist n ale v pr b hu p ist vac ho man vru zachytil o vodi nadzemn ho elektrick ho veden 22 kV Kluz k narazil v mal vzd lenosti za veden m do zem a do betonov ch sloup a pletiva plotu Pilot utrp l lehk zran n Kluz k byl n razem v znamn po kozen P inou byl pozd zah jen man vr na p ist n jeho d sledkem byla nedostate n v ka a n raz do vodi e elektrick ho veden Spolup sob c m faktorem pravd podobn bylo e vodi e nebyly z eteln vid t proti ter nu ve sm ru p ist n ikmo proti slunci Figure 1 7 View on detail page on Air Accidents Investigation Institute M 1 2 4 Use Case 4 National Transportation Safety Board http www ntsb gov investigations AccidentReports Pages aviation aspx This use case serves mainly to demonstrate usage of the same ontology vocabulary on two different data sources Additionally we might fill default values in place of missing ones in this table F
24. JavaScript is sandboxed in WebDriver it is still running in a browser in your computer and could technically submit some data on a web Security issues have not been considered so far but might become a point of interest when we take in account an option of obtaining and executing scenarios from unknown sources 4 4 User Interface Here the required structure of user interface is described IE 4 4 1 SOWL user interface The user interface of SOWL shall be presented in a form of sidebar The sidebar shall have tyvo parts a scenario editor and resources list Scenario editor shall contain a tree shaped structure of steps of the scenario being created along with panel for editing the general settings of the scenario The resources list shall accept dropping of ontology files which would load it into current dataset Addition of resources manually shall be possible using a button The list shall show all currently loaded resources and allow textual filtering SOWL shall enable tag selection on the webpage being processed by clicking or other user action B 4 4 2 crOWLer user interface CrOWLer is a console application It shall accept scenario as one of its parameters Following settings shall be enabled using parameters as well setting of target directory for RDF files m setting of sesame repository for the result storage i 4 5 Model Presenting proposed design of the two programs the SOWL Firefox addon and the crOWLer Java applica
25. Master s Thesis Czech Technical University in Prague F3 Faculty of Electrical Engineering Department of Computer Science and Engineering Platform for semantic extraction of the web Jakub Podlaha Artificial Intelligence podlajakCfel cvut cz January 2015 Supervisor Ing Petr K emen Ph D Acknowledgement Td like to thank my parents and fam ily for enormous support my supervisor for endless patience and guidance and my friends for not letting me go insane Declaration Prohla uji ze jsem p edlo enou pr ci vypracoval samostatn a e jsem uvedl ve ker informa n zdroje v souladu s Metodickm pokynem o dodr ov n et ick ch princip p i p prav vysoko kol sk ch z v re n ch prac V Praze dne 5 1 2015 Abstrakt Tato diplomov pr ce zkoum t ma semantick extrakce dat Hlavnim ci lem t to pr ce je navrhnout n stroj pro zjednodu en procesu anotov ni a sbir ni dat z webovych str nek Nejd ve pro specifikaci e en ho pro bl mu a motivaci definujeme n kolik p pad u it z re ln ho ivota t ka j c ch se semantick extrakce dat Pro ka d z t chto p pad pop eme v em tkv jeho n ro nost Ze v ech p pad pak odvod me souhrn vzor a ur me po adovan postup extrakce N sledn stru n pop eme z kladn technologie pou van p i pr ce se semantick mi daty Prozkoum me exis tuj c n stroje a platformy pro au
26. RL pamfond list php IdReg _ 0 Technically this is a form of a workaround rather than systematic solution of the given problem We can not securely rely on JavaScript code within the attribute as a part of data It is important to realize that the technique used on the webpage is rather non standard and can not be effectively covered with general purpose tool without a need of problem specific solution Understanding the configuration implementation we will now briefly analyze the rest of use cases crOWLer would solve the UC1 1 2 1 with quite basic configuration Here we present a short example ClassSpec chObject Factory createClassSpec foaf Person conf addInitialDefinition Factory createInitialDefinition chObject Factory createJSoupSelector tr First name chObject addSpec Factory createDPSpec Factory createJSoupSelector td eq 0 foaf firstName Analogically for the rest of properties Link to detail page chObject addSpec Factory createDPSpec Factory createChainedFirstElementSelector Factory createNewDocumentSelector conf getEncoding Factory createAttributePatternMatchingURLCreator e EGE VBL UL sp SCONE Factory createJSoupSelector nick foaf nickname This example is using only classes from the original crOWLer Note at the bottom How we define following a link to the detail page In proper implementation we would probably simplify the new doc
27. UP is the selector system of choice Many webpages or even web applications make use of dynamic AJAX calls to fetch additional data after the presentation layer of the web is shown to the user Strigil does not handle these cases by default The internal AJAX code could be analyzed and simulated using call template call but this requires deep knowledge of the webpage being processed In crOWLer we opted to switch from JSOUP to WebDriver library and use PhantomJs a no GUI web browser This technology allow us to handle webpages the same way as user sees them Usage of actual full stack web browser with JavaScript engine long with WebDriver allows us to inject and execute arbitrary JavaScript code into the processed webpage In order to make full use of this feature we can define function def tag which would define JavaScript function with name and parameters and contain its code To execute this function we would call function call and identify it by its name Return value of this function can be then used the same way as the one from value of tag From the experience with development on Strigil XML we can derive that it is tied with its intended use for distributed downloader and it lacks some functionality In SOWL we would almost necessarily modify its formal definition and thus it is of consideration if we can not make use of more appropriate format M 4 2 3 SOWL JSON As all Firefox extensions SOWL is written entirely using JavaScript with add
28. a obtained from web page being processed Strigil also implements variety of functoins to help with processing of textual data Function addLanguageInfo for example is widely used in Strigil scraping scripts to add language tags to string literals The function call can be seen below lt scr function name addLanguageInfo gt lt scr with param gt lt scr value of select Hello World gt lt scr with param gt lt scr with param gt lt scr value of text en gt lt scr with param gt lt scr function gt Similarly we can use function addDataTypeInfo to add datatype flag function generateUUID to obtain unique identifier or function convertDate to convert Czech and English dates into a common xsd date format and several others Some functions like the last one mentioned cover task specific issues and Strigil does not define a way to extend the list of functions In early stages of SOWL development an attempt was made to use original St rigil XML as a format of choice An appropriate consistent subset was chosen that would cover reguired use cases Implementation of simple use cases revealed some pit falls of this decision and revealed several suggestions for improvements on the approach and the format itself 31 4 2 Designing scenario format M 4 2 2 Adaptation of Strigil XML format Strigil creates its scraping script internally hidden under GUI and leaves user unaware of its actual content It might still serve well
29. a root for a tree of ontological individuals linked by their properties The tree is in configuration defined using ClassSpec and PropertySpec classes that hold definition of type of the individual and the assigned property respectively The spec classes also carry information about selector used to find the corresponding HTML element A collection of Selector classes is available and can be extended JSOUP selector handling is implemented as well as selector chaining or resolving data from a link target In Crowler an individual of an ontological object is created after all his defined properties values are scraped within the inner loop as the URI of the individual can be formed using one or more of these values This way we can refer to the same object if we create individual of the same URI on two different pages for example E 3 2 1 Issues of crOVVLer configuration From deeper analysis of the original crOWLer source we can observe that the whole scraping process relies on the configuration defining it a set of Java classes imple menting the predefined interfaces and using the API provided This reveals the issue being addressed Writing a crOWLer configuration requires knowledge of Java programming language along with knowledge of RDF technologies 17 3 2 Analysis of crOVVLer Programmer also gets into the position of ontological engineer when designing ontolog ical resources used in the configuration Knowledge of WEB technologies
30. af phone rdf datatype xsd string gt 603123123 lt foaf phone gt lt foaf nickname rdf datatype xsd string gt Jackie lt foaf nickname gt lt rdf Description gt lt rdf Description rdf about foaf firstName gt lt rdf type rdf resource owl DatatypeProperty gt lt rdf Description gt lt rdf Description rdf about foaf Person gt lt rdf type rdf resource owl Class gt lt rdf Description gt lt rdf Description rdf about foaf nickname gt lt rdf type rdf resource owl DatatypeProperty gt lt rdf Description gt lt rdf Description rdf about kbx scenario 201412060213045124 indiv 201412060213058113 gt lt rdf type rdf resource foaf Person gt lt foaf firstName rdf datatype xsd string gt Foo lt foaf firstName gt lt foaf lastName rdf datatype xsd string gt Bar lt foaf lastName gt lt foaf phone rdf datatype xsd string gt 0x1AF49C70 lt foaf phone gt lt rdf Description gt lt rdf Description rdf about kbx scenario 201412060213045124 indiv 201412060213057696 gt lt rdf type rdf resource foaf Person gt lt foaf firstName rdf datatype xsd string gt John lt foaf firstName gt lt foaf lastName rdf datatype xsd string gt Doe lt foaf lastName gt lt foaf phone rdf datatype xsd string gt 0x1AF49B01 lt foaf phone gt lt rdf Description gt 61 201412060213057200 gt i Appendix K Example of JSON dump of rdfquery datastore http xmlns http www 1 type
31. all Architecture of Strigil It consists of Data Application in a form of webserver and backend service providing Download System for the application The webserver offers frontend for configuring the crawling process The application then follows the configuration scraping the data and storing results while using the backend handle downloading Strigil strongly focuses on the download process Components of the backend conform in a structure of Down loadManager Downloaders and Proxy servers that help to distribute the load of data being transfered The frontend part serves user interface for handling ontological data on top of a web being scraped It internally creates a scraping script will be referred to as Strigil Scraping Script or Strigil XML which strongly inspired format for scenario used in the actual implementation later in this work and will be closely analyzed in chapter 4 E 3 3 1 What problem does it solve The architecture of Strigil more in H is tailor made for parallel processing of doc uments The installation of Strigil requires working Apache2 web server with PHP5 Tomcat PostgreSQL database OpenMQ service and several other components before the actual deployment of Strigil into the environment The system is designed for pro cessing many requests on targeted server heavy loads of data and long running tasks Its complicated architecture and installation process prevents it from being effectively used in occa
32. ata on National Heritage Institute webpage or a tag the onclick attribute does not contain URL but rather a JavaScript function content that handles the click event After closer investigation 1 4 we can observe that in this case the function advances to the detail page of the clicked record by modifying a value of a hidden input tag and by submitting a form parametrized by the value lt table width 100 gt lt table gt lt table width 1 border 8 gt lt table gt lt table class list cellspacing 1 lt form ac list php Kinput t r Kinput t Kinput PrirUbytod val PrirUbytDo Kinput typ gt Limit value 25 gt lt input gt lt input typ lt tbody gt lt tr gt lt tr gt lt tr class list onclick document listpf IdReg value 131164 document listpf submit onmous lt td class list align left gt lt td gt lt td class list gt lt td gt Figure 1 4 A preview of HTML source analysis on National Heritage Institute webpage If possible we would simply simulate the user click action to advance to the de tail page and the back action usually performed by the Back button of browser or Alt left keyboard shortcut to get back and follow on next line This approach will be analyzed further in this vvork If the stated approach can not be implemented to give the expected results the original approach will be simulated by the new scenario driven structure Thi
33. cenario a unit test can be generated for desired programming language and in desired form e g JUnit test case Such a test can be simply included it in a set of tests for the web server project WebDriver library needed for running these tests is available through Maven There is also a chance to use PhantomJs no gui web browser for running tests without a need for actual browser for cases when tests are being executed automatically in background or on server environment without X server or other form of graphical interface The capabilities of WebDriver make it one of the most popular testing platforms for web servers nowadays Selenium IDE is a Firefox plugin that allow us to directly record user actions on webpage such as following links storing and comparing values filling in and submitting forms An attempt was made to implement SOWL as a plugin for Selenium IDE This plugin would have two parts 1 an extension of graphical interface 2 a formatter that would generate scenarios for crOWLer in some desired form 25 3 4 Finding platform for frontend Certain limitations were discovered during development of this plugin Selenium IDE as being plugin itself implements its own plugin system through which it allows other developers to extend its functionality The Selenium IDE plugin API allows us to use standard Firefox techniques along with predefined API to extend the graphical interface and the functionality of the IDE respectively
34. crawled web page with all the necessary settings pass the scenario created to the crOWLer and obtain the results in a form of an RDF graph LI 1 5 Specific goals of the thesis m define use cases for the semantic data creation m create syntax for scenario used by crOWLer implement a web browser extension for creating these scenarios this extension shall a load and visualise ontology a join page structure and ontology resources in a form of scenario serialize scenario and necessary ontological data a parse the scenario by crOWLer run crOWLer following the scenario m store the extracted data 1 6 Work structure Next part of this work 2 will cover tools and technologies and the related lingo used in this work and in the field Chapter 3 will describe research on existing solutions and how they influenced results of this work Chapter 4 is the main part and describes the proposed design Chapter H gives details about the prototype implemented according to proposed design Both design and implementation will then be confronted against the real life use casesl 2 10 Chapter 2 Principles and technologies In following chapter we will provide basic information about technologies of Semantic Web and Knowledge Representation The terminology often used in the field will be defined to help full understanding before we proceed to the design and implementation CI 2 1 Technology of Semantic VVeb Web F
35. cript selector library http sizzlejs com Vanilla JS http vanilla js com JOWL Ontology Online http jowl ontologyonline org Petr Kremen Towards SPAROL DL Evaluation in Pellet 2007 http weblog clarkparsia com 2007 10 26 towards spargl dl evaluation in pellet j rdfQuery RDF processing in your browser https code google com p rdfauery Scraping script documentation https drive google com f ile d 0B40n 1Gb38CgW1AyZDhGbDV2TFk edit JavaScript in JavaScript js js Sandboxing Third Party Scripts http goo g1 RJESQE Open Government Data http opengovernmentdata org 47 49 Appendix A Assignment Czech Technical University in Prague Faculty of Electrical Engineering Department of Computer Science and Engineering DIPLOMA THESIS ASSIGNMENT Student Bc Jakub Podlaha Study programme Open Informatics Specialisation Artificial Intelligence Title of Diploma Thesis Platform for Semantic Cravvling of the VVeb Guidelines 1 Become familiar with semantic web technologies namely with Linked Data concept and RDF S OWL 2 SPARQL languages 2 Design a platform for creating semantic extraction scenarios of the web The platform will provide a Ul for searching a suitable ontological schema construction of the extraction scenarios their execution using existing tools and visualization 3 Test the platf
36. d or even removed it becomes invalid from the Java context Modification to an element can cause unexpected behavior of its reference in Java too The same applies for operations on the whole page When a link is followed the original DOM tree is dropped and all references are lost To better describe the underlying behavior during this issue below you can see a simple test When link is clicked the WebDriver follows the link in current window and the reference to the original DOM is lost WebDriver wd new FirefoxDriver wd navigate to http www inventati org kubix t WebElement a wd findElement By cssSelector a System out println a getText Prints detail cike DE System out println a getText throws org openga selenium StaleElementReferenceException errorMessage Element does not exist in cache The previous can be partially solved by sandboxing the code in a closure By doing so we can hide some essential object in global scope like window or document and make it harder to do inappropriate operations on the DOM In following example we create the described partial sandbox JavaScriptExecutor exec JavaScriptExecutor driver WebElement elem driver findElement By css div wewant exec executeScript return function elem window document funStr arguments 0 elem This technique is not completely secure for example the element passed as an ar gument does
37. de as it is impossible to change its behavior through native Selenium IDE API Unfortunately even though the new command type was implemented it is not possible to change the more general concept of all commands Every command is stored as name target value triple and from this format everything is derived It is technically impossible to create command for example for a creation of an ontological literal along with its language tag assignment as there is simply no field for it For the same reason we cant create a command to create an ontological object of some type as a property of another object These commands relate to each other but such a behavior is not supported by the scenario editor in its current architecture There is also no way to alter editor GUI for specific command For instance we can not offer autocomplete for input field when user enters URI of ontological resource Such a feature would be an essential part of SOWLs workflow and as a consequence these limitations are critical and disallow us from properly implementing SOWL on top of the Selenium IDE LI 3 5 Libraries for SOWL Research on existing JavaScript libraries that handle RDF data resulted in two promis ing libraries joVVL and rdfquery Both are based on the jQuery library and both claim to be capable to parse RDF files which is the main requirement for us Additionally the library might be used in SOWL as a storage for the loaded RDF resources E 3 5 1 jQuer
38. e designed for humans first and machines second Last but not least we can use joined power of HTML and RDFa 8 to annotate data on a webpage with an actual ontology This technology is part of the Semantic Web stack and we will describe it closer in next chapter Pl Annotating data on the server side enables users to use tools to highlight data they are specifically interested in extract them and reason on them Services can use annotated data combine them and offer new results based on merged knowledge obtained from multiple sources Providing data in such a form makes server a part of Linked Data cloud For completeness let us mention some examples of utilities for extracting and testing or scraping structured data a Google Structured Data Testing Tool i e rich snippets 9 RDFa Play tool for visualisation and extraction of RDFa content a LDSpider a semantic data crawler 11 Unfortunately it is not always possible or desired by the web owner to embed se mantics into their data and support it Vast majority of the web holds plain text data without any machine readable meaning given to it leaving it on human readers to understand it To bypass the gap between unstructured data present on the web on one side and rich linked meaningful ontologies on the other we can take the opposite direction to the one described so far We can take the unannotated data already present on the web and retrieve them in a form defined
39. e name J i 2 7 RDF XML syntax RDF XML is one of formats into which we can serialize our RDF data 16 It is a regular XML document containing elements and attributes from the RDF S vocabu lary RDF XML is one of the most common formats for RDF data serialization An example from popular FOAF ontology can be found in appendix D LI 2 8 Turtle syntax Turtle syntax is another popular syntax for expressing RDF It allows an RDF graph to be completely written in a compact and natural text form with abbreviations for common usage patterns and datatypes 17 Its syntax suits more naturally to RDF data as it conforms the triple pattern Follows an example about author of this work base lt http kublx org gt prefix rdf lt http www w3 org 1999 02 22 rdf syntax ns gt prefix rdfs lt http www w3 org 2000 01 rdf schema gt prefix foaf lt http xmlns com foaf 0 1 gt lt me gt a foaf Person foaf name Jakub Podlaha 14 Chapter 3 Existing solutions In this chapter we will describe the research on existing solutions for given task scraping and annotating data from a web The performed search was focused on tools directly targeting the problem as well as libraries and technologies that could be included in the solution or existing open source programs we could build the solution on LI 3 1 Semantic and non semantic crawlers By researching existing solutions there is currently no open source or o
40. e advantage of this approach is that user does not have to know JavaScript and understand how it is called in WebDriver in order to use advanced conditioning and or value formating It is disputable if a set of extra commands in scenario syntax and hence extra con trols in scenario editor would be more understandable than a single field for JavaScript function Technically by adding conditioning and function commands we are inclining towards building a new programming language To offer the best for the user imple menting both is the option basic conditioning to easily direct the scenario flow along with a set of functions to format and modify string and other values as well as enabling JavaScript execution for complex problems With use of JavaScript the same scenario step as in previous example would look as follows command value of selector a detail exec var href elem getAttribute href return href 7 window location href href I In this case we embedded only the value of with a single attribute that takes JavaScript function body From there we have technically unlimited power for extending the functionality of the crOWLer without need of changing the Java implementation Note that compared to example in Java the first line of the original JavaScript was omitted var elem arguments 0 It will be automatically prepended every time we exec JavaScript on a single DOM element It is a simpl
41. e helper and does not invalidate any users input as in JavaScript we can redefine variable as many times as we want Similarly we will predefine variable elems when a list of elements is passed value when passing a string or number to a JavaScript function But with great power comes a great current squared times resistance 1 With usage of JavaScript as suggested in previous paragraphs we have to take in account two major considerations Firstly JavaScript function can accept any number of parameters and return an arbitrary value In both cases the parameters and the return value can be of any of the allowed types as JavaScript is not strongly typed language We thus have to specify what exact parameters are being passed to a function and what result of what type is expected We also have to implement a robust way of controlling this specification and properly define a fallback on error behavior This is especially important as we might want to use JavaScript function not only as a string filter but also for example as a universal selector where we struggle with classical selectors Any additional use have to be described separately before it can be universally used v http www xkcd com 643 35 4 3 JavaScript and events support More importantly there is the second consideration Any DOM element is accessible from any JavaScript function using for example the document getElementByTagName method When an element is replace
42. e property attribute specified In crOWLer we can only specify what data properties will be part of the generated identifier in Strigil we can create arbitrary URI using value of elements and functions provided Strigil does not handle AJAX calls and a workaroud would have to be implemented for UC4 Just like crOWLer Strigil downloads the raw HTML page and thus does not even encounter the pagination widget present on the page 23 3 4 Finding platform for frontend M 3 3 4 What inspiration it brings for crOVVLer a The scraping script specifies the template system Compared to loops in crOVVLer it appeals more natural and well structured It also brings extra flexibility by calling templates from within each other The XML format is however rather verbose Other less verbose syntax might serve better while persisting most of the semantic m The system of functions provided gives a good set of tools for string manipulation Sometimes we encounter a problem specific notion e g function for conversion of Czech and English date formats rather than general use date parser a If the Strigil Scraping Script gets implemented in the crOWLer in some form the suggested improvements will be incorporated in the implementation as well LI 3 4 Finding platform for frontend In order to develop appropriate tool for generating scenarios several similar tools were inspected for best practices libraries and possible extension The resulted impl
43. e will simply select a HTML tag and use its content along with some annotation In the friendliest cases the data we want to scrape are formed in some repetitive form most often a table This is the best case as we can simply define structure on one row of the table and repeat the same pattern over and over Sometimes the table spreads over several pages so we need to define a way of advancing to the next page and start over Following sections contain description of several use cases that shall be solvable using design proposed in this thesis E 1 2 1 Use Case 1 basic example case http www inventati org kub1x t The first use case is the simplest task that will be covered by the implemented prototype As you can see on the picture it consists of table holding values about people and link to a detail page for one of them On the detail page there is a field with nickname www inventati org kublx t E A ww inventati org kublx t detail html Black 1603123123 detail White 603123321 Detail Doe 0x1AF49B01 Bar 0x1AF49C70 This is a detail of MERR Figure 1 1 The example main page and detail page for the basic use case In order to fulfill this use case SOWL shall support following operation a Load the FOAF ontology that contains resources to describe data about people a Create scenario with two templates init and detail Save this sce
44. eal semantic into a webpage One direction to go is to annotate data on the server side i e at the time it is being created and or published When we are in position of owner of the data or server we can help not only Google or DuckDuckGo to understand our website To avoid confusion this part is not focused on SEO the search engine optimization 3 even though the topic overlaps in many ways SEO primarily focuses on increasing the ranking of a webpage in the eyes of a search engine whereas pure semantization focuses on best describing the meaning of the pages content no matter how good or bad it appeals to anyone as long as it is valid according to standards of Semantic Web 4 and Linked Data 5 In order to perform semantization on the server side the person or engine creating the data have to use the right tool and put some time and effort giving the data the appropriate annotation There are standards covering this use case In the simplest form the HTML5 6 brings in tags for clearer specification of the page structure such as nav article section aside and others Microformats 7 define specialized values for HTML class attribute to bring stan dardized patterns for several basic use cases with fixed structure such as vCard or Event The microformat approach is easy to implement as it does not impose any extra syntax an can simply embed an existing page source As the community around micro formats states Microformats ar
45. ect and the original information is lost This does not apply to regular transfer to a new page using URL because we can use completely separate REST call Technically it is identical to clicking a ling versus opening it in a new tab in your browser only in crOW Ler these operations are performed internally on lower level adding default value if no content is found 1 3 Current solution crOVVLer Report Number NTSB Title Accident Date Report Date Gity Country Other Fire on Board MAB1422 Towing Vessel 3 12 2013 12 10 2014 Bayou Perot LA USA Shanon E Settoon Collision of BNSF Railway Company and Union Pacific Railroad Trains Near Keithville Louisiana Auxiliary Power Unit Battery Fire Japan Airlines Boeing 787 8 JA829J Crash Following In Flight Fire Fresh Air Inc Convair CV 440 38 N153JR Collision of Union Pacific Railroad RAR1402 Freight Train with 5 25 2013 11 17 2014 Chaffee MO USA A PDF BNSF Railvvay Freight Train Marine Accident Brief Breakavvay of Tanker Harbour Feature from its Moorings and Subsequent Allision with the Sarah Mildred Long Bridge 29 38 03N gt PDF 90 10 63 W RAB1414 1213012012 12912014 Keithville LA A PDF AIR1401 1 712013 11 21 2014 Boston MA A PDF AAR 1404 3 15 2012 11 17 2014 SanJuan PR A PDF MAB1421 4 1 2013 11 12 2014 Portsmouth NH A PDF Figure 1 8 View on list page on National Transportation Safety Board webpage Allision of Bulk Carrier Herbert C Jackson with
46. ed as a property a type a datatype appended to the string before assigned as a property a exec a JavaScript function applied to the string before it is returned HM 5 3 5 narrow This tag only narrows the HTML context to simplify selectors in child steps m steps set of steps to be called on narrowed context m select inner steps will be called on each of these elements m exec call JavaScript function on a set of elements to filter them 42 5 3 SOWL JSON syntax E 5 3 6 function Calls a predefined function a name name of the called function one of following a conc concatenate all strings into one a join similar to previous inserts the first string between all the other ones when connecting parseDate takes date format string as a first parameter and date to be parsed as second it returns the parsed date in xsd Date or null a uuid takes no parameters returns a new UUID a current Url takes no parameters returns a URL of current document a params an array of commands returning values used as parameters for the function call 43 Chapter 6 Conclusion This diploma thesis investigates current situation in the field of Semantic Web It specifically focuses on automated semantic data extraction At first available tools were researched Deeper analysis revealed useful patterns and techniques as well as weaknesses in some of the examined tools and platforms Especially an implementat
47. ement There seems to be no other option even for repetitive patters on webpage such as table row but to define the script on each one of them By allowing the selector attribute we would bring in an intuitive structure meaning create individual for each matching element For simplification we suggest implementation of these suggestions in following anal ysis Second use case would be solved in similar manner as in the hardcoded crOW Ler solution i e by extracting a value from onclick attribute and manually building the target URL lt scr template name init gt lt scr onto elem typeof npu MonumnetRecord selector aeria lt scr call template name detail gt lt scr function name conc gt lt scr with param gt lt scr value of text http monumnet npu cz idReg gt lt scr with param gt lt scr with param gt lt scr value of selector attribute onclick regexp d x replace f0 gt lt scr with param gt lt scr function gt lt scr call template gt lt scr onto elem gt lt scr template gt lt scr template name init gt RR lt scr template gt In case of UC3 and UC4 the situation is practically identical for Strigil and for crOWLer Just like crOWLer Strigil natively supports setting of values used to create identifier for an individual In Strigil an URI of individual crated by onto elem is specified by the first value of child node that returns a value i e does not have th
48. ementation is named SOWL short for SelectOVVL and refers to Firefox addon for creating scenarios for crOWLer In following sections we will refer to SOWL as set of requirements and a envisioned expected result of this work The actual implementation vvill be covered in later chapters E 3 4 1 InfoCram 6000 ExtBrain InfoCram 6000 is part of project ExtBrain that is developed on Department of Computer Science This specific part was implemented by Jiri Masek and is described as prototype of user interface for visual definition of extraction rules for ExtBrain Extractor Its intended usage is very close to the usage of SOWL It is an Firefox extension that generates rules scenario for extractor implemented as another part of the ExtBrain project The ExtBrain extractor is implemented in JavaScript as opposed to Java in case of crOWLer It extracts data according to definitions by InfoCram 6000 The result is stored in JSON format thus not carrying semantic information but only set of raw data in some form Main part of the extension window shows a tree view with rules being edited This view corresponds to required structure of scenario for crOVVLer Interesting part it an engine for selection elements of page Its implementation is based on Aardvark a Firefox extension that addresses this issue using mouse selection and several keyboard commands InfoCram does not use simple CSS or XPath selectors but include Sizzle library to
49. entSelectori ti 3 k NewDocumentSelector JSoupSelector ChainedFistElementSelector Selector NewDocumentSelector String URLCrestor JSoupSelector String NewDocumentSelector String String JSoupSelector int String resolve Element Elements resolve Element Elements a ee goToURLTemplate String pattern String AttributePatternMatchingURLCreator AttributePatternMatchingURL Creator String String String equals Object boolean generste Element String hashCodef int 55 Appendix G crOVVLer architecture Local Filesystem 56 Appendix H Detailed architecture of Strigil platform emp Components subsystem subsystem subsytem COMP0005 Web COMP0001 Scraper COMP0010 Scheduler Aplication engine and Data Connector COMPOOO6 Data models and services interface Data flow I I COMPOOOS Web Services OtologyRepositoryServices ScrapingScriptRepositoryServices Figure H 1 Components of Data Application part of Strigil emp Component Model 7 i Compresser t Connection Controller Connector Download Requests Ej Download Request Buffer F TI KS Ta Downloader i Scheduler i t File System Connector t State Loader t FTP Connector i Statistic Helper HTTP Connector Source Queues DownloadManager i t Device Controller i Distributor Redirected Requests Request Redirecto
50. er gt connector JenaConnector main Stringi void gt logger Logger LoggerFactory g gt baseURL Sting uses use ciassCache Map lt String OntClass gt new WeskHashMsp FullCrawler JenaConnector conf Configuration run Configuration void initislDefinitionlF InitislDefinition logger Logger LoggerFactory g readOnly model OntModel physicalURL Sting connector propertyCache Map lt Sting OntProperty gt new WeskHashMap interface cawi OntModel ceste String String Literal dloseModel Model void connect void disconnect void getModel String Model Figure 3 2 Core classes of original crOWLer implementation The main program flow of crOWLer lays upon few core classes The pair FullCrawler Crowler diagram B 2 form the crawling process loop In this loop FullCrawler fetches the source web pages and passes them one by one to the Crowler The NextPageResolver which defines list of pages to be crawled is structure imple mented within the configuration and thus is specific for given problem instance Results are stored in the outer loop after each scraped page According to input parameters data are uploaded into Sesame repository using JenaSesame library or locally in an RDF file The inner loop performed by Crowler finds a set of HTML elements as defined by the InitialDefinition class Each of these elements serve as
51. eso lt rdfs subC rdf http www w3 org 1999 02 22 rdf syntax ns rdfs http www w3 org 2000 01 rdf schema owl http www w3 org 2002 07 owl vs http www w3 org 2003 06 sw vocab status ns foaf http xmlns com foaf 0 1 dc http purl org dc elements 1 1 gt describe general characteristics FOAF vocabulary ontology gt y rdf about http xmlns com foaf 0 1 dc title Friend of a Friend FOAF vocabulary dc description The Friend of a Friend FOAF RDF vocabulary described using W3C RDF Schema and OWL the Web Ontology Language gt Sy rdf about http xmlns com foaf 0 1 Person rdfis label Person rdfs comment A person vs term_status stable gt rdf resource http www w3 org 2002 07 owl Class gt alentClass urce http schema org Person gt alentClass urce http www w3 org 2000 10 swap pim contact Person gt lassOf gt lt owl Class rdf about http xmlns com foaf 0 1 Agent gt lt rdfs sub lt rdfs subC ClassOf gt lassOf gt lt owl Class rdf ab nale beeg E lt rdfs sub lt rdfs isDe rdf reso lt owl disjo rdf reso lt owl disjo out http www w3 org 2003 01 geo wgs84_pos SpatialThing abel Spatial Thing gt ClassOf gt finedBy urce http xmlns com foaf 0 1 gt intWith urce http xmlns com foaf 0 1 Organization gt intWith rdf resource http xmlns com foaf 0 1 Project gt lt rdfs Class KISSH K lt rdf RDF
52. f pagination and simply simulate click on the Next button Enabling JavaScript has huge consequences and will be analyzed in a separate section 4 3 E 3 2 3 Result form crOVVLer analysis The original implementation of crOVVLer can solve tasks defined by specified use cases 1 2 The requirements on users of crOWLer are too high and the usability is very limited The options of extending the configuration component will be examined during the design part4 The configuration can be either generated using scenario or completely replaced if the scenario defines different crawling procedure other than current double loop The option of incorporating JavaScript will get an extra attention Previous section roughly define requirements on scenario for semantic crawler To fully satisfy all considered use cases in all settings in addition to the functionality implemented so far we would have to cover the a following hyperlinks on a page a firing JavaScript and browser events a functions of transforming scraped data using regular expressions or key value map ping 20 3 3 Strigil E 3 3 Strigil Strigil is an ontological scraping system developed at Faculty of Mathematics and Physics of the Charles University in Prague It represents an easily configurable tool that enables users to retrieve data from textual or weak structured documents 22 composite structure Logical View Figure 3 3 Over
53. gt gt 53 Appendix E Configuration component of original crOVVLer E C getConfigurstioniMapsString String2 Configuration baseOntoPrefix String encoding String id String initialSelectors List lt InitialDefinition gt new ArrayList lt l lang String nextPageResolver NextPageResolver schemas String addinitiaslDefinition InitiaslDefinition void equals Object boolean getBaseOntoPrefixi String getEncoding String getidi String setNextPsgeResolver NextPageResolver void setPublisher String void equals Object boolean hashCode int equals Object boolean hasNexti boolean hashCodel int nexti String hasNext boolean removel void nexti String removel void TableRecordsNextPsgeResolver TableRecordsNextPageResolver String int int int 54 Appendix F Selector component of original crOWLer equals Object boolean qetClsssSpeci ClassSpec getSelector Selector hashCodel int InitialDefinition InitialDefinition ClassSpec Selector 4 444444 addSpeaPropertySpec void addSpeciboolesn PropertySpec void ClassSpec ClassSpec String equals Object boolean getiRI String getSpecs List lt PropertySpec gt TT EEE cache WeakHashMap lt String Document gt new WeskHashMap encoding String generator URLCrestor See T m ChainedFirstElem
54. have reference to its parent which is already leak of intended sandbox Proper sandboxing would require implementing whole JavaScript engine in JavaScript which is probably too much for our intentions In crOWLer we can now distinguish between two ways of ascending to another HTML page 1 using call template command 2 using JavaScript or user event such as click or back The call template is always called on an URL and always creates new web context keeping the original one untouched It actually behaves like call stack so when we return from the template call we can follow on the original DOM tree Just to note compared to corresponding Strigil command crOWLer persists the ontological context throughout this call and so we can relate to it when assigning properties Direct interaction with current window in any way that changes page location will however irreversibly invalidate all the elements of current DOM This does not have to mean we can not use this functionality all together Probably the best solution would be to only allow DOM modifying operations on the bottom level of templates i e within the steps property of the template command in scenario At this place we only hold the body of current document and as such we can simply replace it with 36 4 4 User Interface the newly loaded content In the original crOWLer implementation this would be the spot between two Initial Definitions Even thought the
55. he same ontology On low level of the implementation we deal with simple oriented graph The graph structure is defined in a form of triples Each triple consists of three parts subject predicate and object which all are simply resources listed by their identifiers URIs In this very general form we can express basically any relationship between two resources On a level of classes and properties we can define hierarchies or set a class as a domain of some property On lower more concrete level we can assign a type to an individual On a level of ontologies in a way a meta meta level we can specify for instance an author description and date it was released Each of the relations is described using triples and together form one complex graph 11 2 2 Linked Data N 2 2 Linked Data Wikipedia defines Linked Data as a term used to describe a recommended best prac tice for exposing sharing and connecting pieces of data information and knowledge on the Semantic Web using URIs and RDF Just like Semantic Web it is a phenom ena a community a set of standards created by this community tools and programs implementing these standards and people willing to use these tools and of course the data being presented Linked data effort strives to solve the problem of unreachability of majority of the knowledge present on the web as it is not accessible in machine readable form doing so by defining standards and supporting implemen
56. igure 2 1 Logo of Semantic Web MF SE Semantic v ap Wikipedia defines Semantic Web as a collaborative movement led by international standards body the World Wide Web Consortium W3C 4 W3C itself defines Se mantic Web as a technology stack to support a Web of data as opposed to Web of documents the web we commonly know and use 12 Just like with Cloud or Big Data the proper definition tends to vary but the notion remains the same It is collaborative movement led by W3C and it does define a technology stack It also in cludes users and companies using this technology and the data itself Technologies and languages of Semantic Web such as RDF RDFa OWL SPARQL are well standardized and will be described in following sections of this chapter As a general logical concept of the technology languages of Semantic Web are de signed to describe data and metadata give them unique identifiers so that we can address them and form them into oriented graphs The metadata part define a schema of types or classes and properties that both can be assigned to data and also relations between this types and properties themselves Wrapped together this metainformation is being presented in a form of ontology When some data are annotated by resources from such an ontology we gain power to reason on this data i e resolve new relations based on known ones and also to query on our data along with any data annotated using t
57. ike in crOWLer the processing of each template is performed inde pendently in Strigil Each template call puts a request into the download system first The actual execution of each template is fired asynchronously when download of the targeted document is finished as notified by message from the Download System The inner part of a template conforms with structure inside crOWLer configura tions It defines tree structure of ontological classes and properties along with selectors specifying the position of targeted data Resulting from different document resolving system there are no NewDocumentSelectors in Strigil In place of this selector we would simply call another template on the new document This approach is clearer than using chained selector especially if we handle two or more nested documents It is required though to carry the ontological context from one template to another This behavior is unfortunately neither mentioned in Strigil documentation nor in examples examined E 3 3 3 Confronting Strigil with use cases As a basic example UC 11 2 1 can be solved by Strigil We are presuming here that strigil carries the ontological context through template calls Notice in the following example that the value of tag in the template named detail does not have any onto elem defined above it By carrying the ontological context we denote that every property specified by the children nodes of the template vvill be assigned to individua
58. ing it from from its first child which is expected to be value of tag Needles to say this specification lowers robustness as the position in the XML file is not enforced by the syntax and can be easily unintentionally broken by accidental swap of two elements although it would not invalidate the files syntax and thus would not be captured by the script parser as an error In the JSON format we lack the notion of child elements Even when we simulate it as mentioned before we would only cause the same indetermination So instead we simply reserve a property named about exactly for the described use LI 4 3 JavaScript and events support Special attention have to be payed when dealing with direct interaction with DOM ele ments and script execution VVebDriver supports injection and execution of JavaScript as well as simulation of user interactions like click on element or back and forward navi gation Even though it brings great power there are considerations and great limitations to be taken in account v https en wikipedia org wiki Duck typing 33 4 3 JavaScript and events support WebDriver supports execution of JavaScript directly on webpage loaded in the driver This is done by calling executeScript or executeAsyncScript function on the driver object First argument of these functions is string defining content of JavaScript func tion we want to execute Header and actual call of this function will be added for us before i
59. ion of prototype of lightweight semantic crawler crOWLer was examined and documented The research was focused on improvement of configuration of scraping process By examining Strigil the scraping system a new template based approach in scrap ing of the semantical data was revealed The functionality of Strigil and crOWLer was compared on real life use cases The Strigil XML syntax for scraping scripts was exam ined and several possibilities for improvements were described According to original XML syntax a new JSON based syntax was derived and documented Open source Firefox addons InfoCram 6000 and Selenium IDE were chosen as poten tial bases for future frontend implementation Neither of them showed to be suitable for the intended use but each brought a new knowledge Algorithm for selector gener ation and aardvark the element selection engine later used in SOWL originate in the InfoCram 6000 Selenium IDE relates to WebDriver engine which was later included in the final crOWLer prototype Options to use JavaScript as a language for extending the scraping script functionality were thoroughly researched Several useful patterns for JavaScript usage were revealed and the results documented together with examples of JavaScript and Java code A prototype Firefox addon named SOWL was created as a tool for generating sce narios in the proposed JSON syntax Subset of the syntax necessary to cover example use case was involved in the implementati
60. is based on SPARQL and design in a manner that make the resulting JavaScript code look familiar when compared to native SPARQL query To better show the similarity we are presenting a rdfquery code equivalent to this SPARQL query 2 6 along with printing of its output rdt prefix foaf http xmlns com foaf 0 1 where person a foaf Person optional person foaf name name each function i var person this person value name this name undefined Anonymous this name value console log person has name name DE E 3 5 4 aardvark Aardvark is a JavaScript engine for in place modifications of a webpage It allows user to select delete or highlight part of HTML page It has been released in two forms as a bookmarklet and a Firefox extension The later was used in a modified form in the InfoCram 60003 4 1 and later in one of SOWL SelectOVVL prototypes This library help to implement the selection and serves as a framevvork for the selector generating algorithm ty https code google com p jowl plugin 28 Chapter 4 Program design This chapter defines the overall behavior of the program stack derived from presented use Cases i 4 1 Workflow From the use cases defined and from analysis performed on existing solution we can derive the general workflow for both SOWL and crOWLer part of the implementation act General Workflow start scenario definiti
61. is needed in order to properly target elements on the webpage using JSOUP selectors This is one of the hardest task as the selectors have to be manually extracted using for example browser console The scenario based approach focused in this thesis will enable user to bypass the Java programming and focus only on matching web structure with an ontology B 3 2 2 Confrontation with use cases technical issues In this section the capabilities of the original crOWLer implementation will be con fronted with use cases specified for this work 1 2 For all use cases a separate configuration would have to be created We will mainly focus on problems specific for each case The first configuration of crOVVLer was created for the Monumnet webpage of the National Heritage Institute the UC2 1 2 2 Stating that the UC2 can be and was solved using the hardcoded configuration First we will focus on the structure of configuration Following code is a simplified snipped of actual configuration building code of original crOVVLer implementation It uses NPU class as simple static storage for URIs used in our ontology According to this configuration a monumnetRecord object is created for each table row as defined by the initialDefinition The second part creates district object with its label found in third table column denoted by the td eq 2 JSOUP selector and assigns it to the record using hasDistrict object property The conf object holds the configura
62. itional HTML defining the graphical layout Early stages of implementation generated XML based on Strigi XML format using hardcoded XML snippets and string formating approach often used on webpages with dynamically loaded content A string holds a snippet of HTML or XML structure with placeholder This placeholder is replaced by either a value or by another already processed snippet This way piece by piece the whole scenario is generated This solution is not hard to implement but brings in poor 32 4 3 JavaScript and events support maintainability and with additional complexity it looses elegance readability and can even cause performance issues Original data of the scenario created by SOWL are stored naturally in JavaScript object Using standard JavaScript method JSON stringify we can immediately generate JSON serialization of such object This way we have structure similar to the original defined by Strigil XML but in flexible structure Obviously some adaptations are necessary Nesting is recorded using the steps the header section is redesigned for the JSON structure For example instead of listing prefixes in a single string of XML attribute we define object ontology with a map of prefix URI pairs The original semantics of onto elem and value of was preserved only limited to its basic use value of serves to assign literal properties or to retrieve textual values for its parent scenario step An example of the scraping script
63. l created by onto elem containing the invoked call template i e the property assignment will bubble through the template call until it finds an onto elem node Unfortunately Strigil documentation does not state this clearly and the examples provided do not contain the ontological context carrying structure lt scr template name init gt lt scr onto elem typeof foaf Person Selector lt scr call template name detail gt lt scr value of selector detail href gt lt scr call template gt lt scr onto elem gt lt scr template gt lt scr template name detail gt lt scr value of property foaf nickname selector nick gt lt scr template gt Also it is important to note that strigil uses JSOUP selector system extended by the attribute selector In the example we target a value of a href attribute of elements with class detail The tag is probably taken from XPath 23 This kind of extension is rather unfortunate as it combines two different syntaxes As we primarily use JSOUP the space in the selector sting denotes any descendant In that case we would read the 22 3 3 Strigil example selector as any href attribute of any descendant of elements with class detail which probably is not the intended meaning We would suggest adding attribute named attribute to the value of element rather then extending the JSOUP syntax Strigil Scraping Script also does not allow the selector attribute on the onto elem el
64. le as a regular Firefox extension After installation an icon will show next to the addressbar which opens sidebar with SOWL The user interface divides into two parts the scenario editor top half and the resources list bottom half Keyboar shortcuts and mouse controls are used to navigate through the scenario editor a arrows or h j k 1 navigate parent down up child m Ctrl Enter toggle editing a A append step as a child a a append step as a sibling a I prepend step as a parent a i prepend step as a sibling To load an ontology from file simply drop the file onto the resources list To assign a selector to a step drag the element from webpage and drop it onto the step selection have to be started as denoted by red border around the hovered element a press n w to narrow or wider the selected element webpage have to be focused To assign a resource to a step drag it from resources list and drop it onto the step E L 2 crOVVLer crOWLer depends on an instance of PhantomJS running in the background crOWLer is distributed in a form of jarball A run sh script can be used to run it crOWLer accepts several command line attributes scenario lt file gt the scenario file required rdfDir lt path gt the path to directory for storing RDF files phantom lt path gt the path to phantomjs exe sesameUrl lt url gt an addres of sesame server when s
65. lem By storing such crawled data into database we obtain persistent database possibly automatically obtained by the script from previous case Such data is static but can be queried over and over and possibly re retrieved when it becomes obsolete Its structure is however based on programmer s imagination and needs to be described in order to understand and handle the data properly When a triple store is used as the database in previous case we obtain one time solution to our problem This is technically equal to original state of crOWLer When using Ontology based solution tailor made for crawling and annotating data from web we obtain several benefits for free The tool designed specially for this purpose makes it easy Once the data is annotated we can not only query on them but also automatically reason on them and obtain more or more specific narrow results than with general data The attributes and relations within ontology that allow reasoning 15 3 2 Analysis of crOVVLer are usually part of the ontology definition and as such comes in naturally without any extra effort Last for benefits using ontology from public resource as a schema for our data can give us correct structure without need for building it from scratch Also by using some common ontology we can join together any accessible data structured according to this ontology and simply query on resulting super set However semantic crawling is not a silver bu
66. llet yet This technology is still finding its place and uses and is constantly shaped by the needs of the users For instance there is always a threat of inconsistency of an ontology when some data do not fit the rules or break structure of an ontology In its state from April 2014 DBpedia states there is 3 64 million resources out of which 1 83 million are classified in a consistent Ontology 18 That is only half of the data being arguably consistent with each other This does not mean that the rest of the data is bad however it might cause a inconsistency and prevent us from reasoning on the data if we include a wrong subset of the data Just like with hardcoded crawling technique the semantic crawling is tightly bound to the structure of the crawled web The web is being matched against some pattern described by selectors and the matching element when found is accepted for further processing Any change on a webpage structure can lead to broken selectors or links during the crawling process and make the scenario partially or completely invalid Many web pages load their data dynamically using AJAX queries Some pages simply change their content frequently e g news pages forums user content pages like video or music servers and social web applications Crawling content on such servers would require almost constant crawling and would cause growth into massive ontology of oftentimes questionable quality The semantic crawling is an usef
67. lns com foaf 0 1 lastName selector 1 valme din cha E type css a command value of property http xmlns com foaf 0 1 phone selector 1 value td nth child 3 type css Fo Po T command call template name detail selector 1 value value td detail a type css Fo 1 value href type xpath 1 type chained name detail steps i command value of property http xmlns com foaf 0 1 nickname selector 1 value nick GYPS se 5 60 Appendix J Result of crOVVLer run on UC1 URIs in attributes were prefixified for compactness lt rdf RDF xmlns rdf http www w3 org 1999 02 22 rdf syntax ns xmlns owl http www w3 org 2002 07 owl xmlns kbx http kubix org onto dip t xmlns rdfs http www w3 org 2000 01 rdf schema xmlns foaf http xmlns com foaf 0 1 xmlns xsd http www w3 org 2001 XMLSchema gt lt rdf Description rdf about kbx scenario 201412060213045124 gt lt rdf type rdf resource owl Ontology gt lt owl imports rdf resource kbx gt lt owl imports rdf resource foaf gt lt rdf Description gt lt rdf Description rdf about kbx scenario 201412060213045124 indiv 201412060213050157 gt lt rdf type rdf resource foaf Person gt lt foaf firstName rdf datatype xsd string gt Jack lt foaf firstName gt lt foaf lastName rdf datatype xsd string gt Black lt foaf lastName gt lt fo
68. nario to a file CrOWLer shall be able to to perform following tasks a Parse scenario created by SOWL and follow it while scraping data from the page a Store results into RDF files 1 2 Use Cases definition s o o m G a o E 5 Figure 1 2 Diagram of general workflow as derived from presented use case This use case defines the simplest functionality that have to be implemented by both programs It covers resources handling scenario creation and running and finally storage of the results It helps to define the proper behavior of the program as it is written in simple valid HTML5 code without any JavaScript and all elements can be simply targeted by CSS or XPath selectors MM 1 2 2 Use Case 2 National Heritage Institute http monumnet npu cz pamfond hledani php The webpage of National Heritage Institute of Czech Republic gives a public access to a table of damages of national monuments This is of interest for project MONDIS 1 partially developed on our school Its main purpose is a documentation and analysis of damages and failures of cultural heritage objects The data were successfully crawled by the original implementation of crOWLer The goal of following development is to replicate the behavior with new implementation using scenario driven crawling process instead of process driven by hardcoded configu ration The main challenge of this use case lays in JavaScript Each row of the data table has the
69. om loop based to template based The main library for vveb communication was changed from JSOUP to VVebDriver which combined with scenario led to com plete reimplementation of the core The only part derived from original crOWLer are the Jena and JenaSesame libraries for handling the ontological models and storage of RDF data The complete architecture can be better seen on the component model in appendix P M scenarie i WebDriver Figure 5 3 The overall architecture of new crOWLer implementation Jena Sesame A new structure was implemented holding a Scenario object with its steps In this form the Scenario is passed to main loop Instead of FullCravler based on JSOUP we created WebDriver based solution the WebDriverCrawler i 5 3 SOWL JSON syntax Following is the final list of commands proposed for crOWLer implementation Only a subset is implemented in the prototype Each command is described and its attributes are listed also with description 41 5 3 SOWL JSON syntax E 5 3 1 template Command defining a list of steps to be performed on document passed to it name name identifying the template referenced by call template command m steps list of steps of the template E 5 3 2 call template Command used to call a template If no URL is specified template shall be called on current context a name name identifying the template to be called m values defines list of commands every command
70. on 5 s o 5 Figure 4 1 Diagram of the general workflow of the stack 29 4 2 Designing scenario format E 4 1 1 Main line user loads creates ontology using SOWL a user opens webpage with data m user creates scenario using SOWL a user adds selectors to scenario steps a user adds resources to scenario steps m SOWL sends scenario to crOWLer m crOWLer crawls the web according to scenario and stores results in a file or repository IE 4 1 2 Scenario creation m user starts scenario creation in SOWL a loop until finished m user creates a step in scenario m user selects an element on page a selector is generated if applicable on the step m user selects a resource resource is updated on the appropriate field of the step if applicable M 4 1 3 Additional branches to Scenario Creation a user can navigate through scenario by clicking scenario steps a user can navigate through scenario by clicking ontological context a user can navigate through scenario by clicking areas on webpage covered by scenario m when user clicks on a hyperlink a existing template can be assigned to the action no need to actually follow the link new template can be created for resulting action resulting page loaded new tem plate created E 4 1 4 crOWLer scraping a user runs crOWLer passing it the created scenario m crOWLer parses the scenario m crOWLer scrapes data from the webpage following the scenario m crOWLer store
71. on The crOWLer tool was newly implemented Support of the new scenario syntax was added and replaced the original hardcoded configuration A subset of the scenario commands was fully implemented and tested using sample use case The template based approach was implemented instead of loop based The JSOUP library was replaced by WebDriver and PhantomJs in order to enable JavaScript The prototype of the semantic crawler was successfully created as a pair of tools SOWL crOWLer The rdfquery library used in SOWL enables it to bring in power to handle semantical structures before we start crOWLer or after in a form of visual feedback using RDFa The new architecture of crOWLer along with WebDriver opens possibilities for future extension and utilisation of JavaScript But mainly a tool was created that simplifies the process of description of semantical content of web for users It is suitable to notice that in many cases the intentions and activities of semantic web community focus on government data 32 The common goal leads us to turn the web into an open accessible source of knowledge and data of all kinds linking the data together where possible Naturally the governmental data and statistics get the most attention Government handles collects and is often obliged to publish in some form 44 a lot of data and statistics Not always this form complies with standards of semantic web Sometimes it might even be the case of intentional
72. onclick attribute defined Unlike the classical link also known as the anchor 1 https mondis cz 1 2 Use Cases Nalezeno 40203 je chr n no p r stky od 03 05 1958 do 10 12 2013 Str nka 1 1609 gt A 1234567891011 20339 1 1971 Bi Praha hl m Praha B chovice p 1 z jezdn hostinec Na Star po t Praha 9 eskobrodsk 104764 aj Praha hl m Praha Benice zvoni ka 40604 1 1569 E Praha hl m Praha Bohnice I kostel sv Petra a Pavla Praha 8 Bohnice 54973 1 1628 Praha hl m Praha Bohnice v inn opevn n s dli t hradi t Z mka archeologick stopy Praha 8 na ostrohu nad Vltavou 54974 1 1571 B Praha hl m Praha Bohnice p 1 Jjvenkovsk usedlost Vran ch Praha 8 Bohnick 44366 1 1572 B Praha hl m Praha Bohnice J p 4 fara Praha 8 Bohnick 54975 1 1573 B Praha hl m Praha Bohnice j p 12 in ovn d m hospoda trasburk Praha 8 Bohnick 40605 1 1570 B Praha hl m Praha Bohnice p 91 ice psychiatri Praha 8 stavn Bohnick 44368 1 1347 B Praha hl m Praha Bran k Praha 4 koln Nad kostelem 44369 1 1713 B Praha hl m Praha Branik Praha 4 Star cesta ae so saani o oo Ta 1 En Teca n a Cr ma 2 4 T Figure 1 3 Partial view at d
73. or example the country value isn t specified for majority of the event records but we can determine by the State field that they happened in United States We will have to deal with JavaScript again As we can see from the URL of the site having the aspx suffix we are dealing with Active Server Pages created by ASP NET server The whole table with all its sorting functionality and pagination is generated by the server and defined by the framework used on the server side The pagination is of our consideration as it loads data into the table using AJAX call This means data are loaded dynamically and we do not have easy access to the low level network communication happening in behind The options we have are analogical to those in second use case 1 2 2 We can either simulate the user action of clicking on the next page button or deeply analyze the JavaScript behind the pagination and perform the AJAX call manually The situation here is slightly different from the one in UC2 though If we successfully emulate the user action for both use cases in UC2 we will have to perform it for each line in the table thus during creation of consistent ontological object and within iterating the table whereas in this use case we only perform the click when we need to load completely new set of data The difference might not seem so essential at a first glance but the devil is in the detail user action modifies or replaces current DOM obj
74. orm on the selected public data sets and evaluate its potential for semantic web authoring Bibliography Sources 1 Tom Heath and Christian Bizer 2011 Linked Data Evolving the Web into a Global Data Space 1st edition Synthesis Lectures on the Semantic Web Theory and Technology 1 1 1 136 Morgan amp Claypool 2 OWL Primer http Awww w3 org TR owl2 primer cit 19 12 2013 Diploma Thesis Supervisor Ing Petr Kfemen Ph D Valid until the end of the summer semester of academic year 2014 2015 doc Ing Filip Zelezny Ph D 4 prof Ing Pavel Ripka CSc Head of Department Dean Prague March 3 2014 50 Appendix B Abbreviations MDN Mozilla Developers Network URI Uniform Resource Identifier URL Uniform Resource Locator URN Uniform Resource Name RDF Resource Description Framework RDFS RDF Schema set of classes and properties providing basic elements for the description of ontologies OWL Web Ontology Language SPARQL SPARQL Protocol and RDF Query Language query language for semantic databases triplestores foaf friend of a friend a popular ontology for describing personal information and relation ships 51 Appendix C RDF and RDFS vocabulary resource description rdf type rdfs Resource rdfs Class rdfs Literal rdfs Datatype rdf XMLLiteral rdf Property rdfs domain rdfs range rdfs subClassOf rdfs subPropertyOf rdfs label rdfs comment a property used to state that a resource is
75. p lt http kub1x org dip rlp gt lt rlp event xFuHbjA5 gt a lt rlp event gt lt rlp hasEventType gt Leteck nehoda Ocs Motivation for the previous instantiation lays in the following use case As it uses the same domain flight accidents it might use some of the resources previously defined here For the event type it would probably use exactly the same instances and would only add the English label to them This should not be much of a problem as long as we can specify an URI identifier when creating an instance of an ongological object In the example above the identifier is lt rlp flightAccident gt Another identifier in the example is the URI of the event lt rlp event xFuHbjA5 gt This one was chosen from an URL of a PDF file on the page From previous paragraph we define another useful functionality conditioning on string literals and specifying URIs of instances directly in the scenario either as a constant string or obtained by combining other string values probably in a form of a pattern Zpr vy o LN a Incidentech Zobrazit podle kategorie V e lt 2250 kg 2250 5700 kg 75700 kg Letouny Vrtuln ky Kluz ky Sportovn leteck za zen Para Bal ny a Vzducholod Zobrazit podle data Od Fl Do Datum ud losti Druh zpr vy 2014 11 02 Z v re n zpr va 2014 08 23 Z v re n zpr va 2014 08 20 Z v re n zpr va 2014 08 10 Z v re n zpr va 2014 07 27 Z v re n
76. pecified rdfDir will be ignored m repositoryId lt repo gt an identifier of the sesame repository E http phantomjs org 65
77. penly available solution that would directly follow the required workflow and fulfill the requirements Existing tools named as Ontology based Web Crawlers refer mostly to crawlers that rank pages being crawled by guess matching them against some ontology In those programs user can not specify data that are being retrieved Moreover there is no way to get involved in the crawling process The tool is solely used to automatically rank the relevance of documents which solves different set of problems In the case we are trying to solve the input is one or more documents and one or more ontologies The result is data retrieved from the documents and annotated with resources from the ontologies HM 3 1 1 Advantages and pitfalls of Semantic crawlers To properly target the benefits the semantification of the scraped data brings to the user let us quickly follow an evolution from the most primitive technologies for scraping data to the advanced ones The ultimate goal is to effectively search in data and maximally utilize the knowledge it carries The simplest approach is manual searching for keywords or even simple browsing the web That might be useful in some cases but when there is a lot of data it becomes exhausting Crawling data using simple tools like weet mirror allows us to load data and then write a program or script to retrieve a relevant information This approach takes a lot of energy for one time only solution of a given prob
78. platform I SOWL JSON scenario solving Use Case l Result of crOWLer run on UC1 Example of JSON dump of rdfquery datastore User manual for SOWL and crOWLer ina aa aaa SOWL PP et CROW LEP zsz nine Sn br moon vi Tables Figures C 1 RDF and RDFS vocabulary vii 1 1 BOR Fees NIN N m wj QW PIR fa A B RO T g anong A screenshot of an example main and detail page for the basic use CASE aa aa aaa An activity diagram of the general workflow of the stack Partial view at data on Na tional Heritage Institute web Preview of HTML analysis on National Heritage Insti tute webpage Partial view at data on Na tional Heritage Institute web View on list page on Air Ac cidents Investigation Institute View on detail page on Air Accidents Investigation Insti Logo of Semantic Web Linking Open Data cloud di General architecture of the original crOWLer implemen Core classes of original crOWLer implementation Overall Architecture of Strigil Main Window of InfoCram Image of Selenium IDE Diagram of the general work flow of the stack Components structure of the SOWL Firefox addon A new overall architecture of the crOWLer implementation Overview of the whole stack and files exchanged
79. r Troughput Controller Unresolved DNS Addresses Addresses E I I I h Resolved DN I I 1 DNS Resolver t DNS Cache t DNS Resolver i t Request Listener Figure H 2 Components of Download System part of Strigil 57 deployment Final Deployment 7 Local Network p DataApplication DownloadSystem device DatsApplication ed N device device Downloader SA Proy nd X DownloadManager _ device 2 Downloader O MH LO device Proxy device Web Client Figure H 3 Example deployment structure of Strigil 58 Appendix SOWL JSON scenario solving Use Case 1 type scenario name scenario ontology base http kub1x org onto dip t imports profis as O Ata uri http xmlns com foaf 0 1 q prefix Ko uri http kublx org onto dip t H 1 F creation date 2014 11 30 12 40 call template command call template name init url http www inventati org kubix t templates name init steps command onto elem typeof http xmlns com foaf 0 1 Person selector value Utro type css Fo steps af command value of property http xmlns com foaf 0 1 firstName selector 1 Value tdi nth chald 1 type css ig E i command value of 59 property http xm
80. rally specifies address of resource and in many cases can be directly accessed in order to obtain the related data In some cases we can use URN as well URN as opposed to URL allow us to identify a resource without specifying its location This way we can for example use ISBN codes when working with books and records or UUID a Universally Unique Identifier widely used to identify data instances of any kind E 2 3 2 RDF and RDFS vocabulary In order to work with data properly RDF S vocabulary defines several basic resources along with their semantics These are the basic building blocks of our future RDF graphs The semantics defined in the specification allows us to specify class hierarchy properties with domain and range as well as use this structure on individuals and literals This is the most general standard that lays under every ontology out there 24 ow Additionally to RDF and RDFS the OWL Web Ontology Language is a family of languages for knowledge representation OWL extends syntax and semantics of RDF brings in notion of subclasses and superclasses distinction between datatype properties and object properties defines transitivity symmetricity and other logical capabilities of properties When querying an OWL ontology it allow us to use unions or intersections of classes or cardinality of properties All this capabilities come in with well defined semantics Usage of each feature brought in by OWL semantics extends
81. rent document can be passed as an argument but due to nature of Strigil this would create completely separate context Strigil is tailor made for parallel processing The architecture of the Strigil system contains not only scraping processor but also a layer for distributed download queue processing and layer of proxy servers that can be used to spread the traffic and scale the download process horizontally As the downloads are performed asynchronously and can be even delayed due to network lags and timeouts there is no guaranteed order in which documents will be scraped Each of Strigil templates create its own context when called If we want to link data obtained from different template calls we have to use some additional techniques For example we can assign some properly defined non random unique identifiers to an object This identifier have to be guaranteed to be the same for the same object through different template calls and potentially on different pages To handle ontological data manipulation the commands onto elem and value of are used First one creates an individual of given type and if nested into different onto elem relates this new individual to its parent with some property Literals are assigned to properties of parent object using value of command with property name specified This command is very powerful with usage regular expressions selectors or nested calls of itself it can create arbitrary values from constants and dat
82. s means crOWLer will be getting the content of the onclick attribute parsing it using regular expression and combining it vvith a predefined pattern into an URL to be directly called using call template Additionally this use case hides one more pitfall that this time challenges the selector creation The web page uses JavaScript to colorize table rows when user hovers them vvith mouse cursor Using a deeper analysis vve can figure out that table lines are given additional CSS on certain mouse events This is often a sign of poor web practices as the same behavior can be achieved by hover CSS selector without a need of additional class but it is an example of a challenge that our tool need to overcome In this very case we probably will not be able to generate selectors using CSS classes and will rely only on tag names positions and other identifiers if applicable Additional requirement on SOWL to those in Use Case 1 1 2 1 a allow manual resources creation m record the click event OR m access the onclick attribute 1 2 Use Cases 49604 1 1569 Figure 1 5 View on detail page on National Heritage Institute webpage enable string handling using regular expressions m record a call template on the resulting URL Additional requirements on crOWLer a simulate the click event OR a handle the attribute according to the string filters m do call template on the result as URL The outcome of this use case and its anal
83. s the results in file or repository i 4 2 Designing scenario format One of main tasks of this work was to create format for scenario generated by SOWL and consumed by crOWLer This scenario will describe information necessary for the crawling process what operation to do create ontological object assign property to such an object perform task with webpage This task is closely related to implementation peculiarity of semantic crawler we are dealing with two separate contexts at the same time the ontological and the web context Ontological context holds current object individual to which we assign prop erties web context hold current webpage along with currently selected element on that webpage Scenario have to support operations to change each context separately and or both at the same time 30 4 2 Designing scenario format M 4 2 1 Strigil XML Strigil the scraping platform in order to solve similar problem as crOWLer introduces its own XML based Scraping Script format 30 Basis of the whole script is system of templates Each template has a name and mime type declaring type of document the template is designed for This information is needed as Strigil supports HTML and also Excel spreadsheet files Templates call each other using call template command anywhere in the script This command accepts URL as an argument from its nested commands Each template is called only with new URL thus on new document Of course URL of cur
84. sional simple yet non trivial scraping tasks Moreover its download system fetches only the raw HTML data just like the orig inal crOWLer implementation and treats it as static document This way it can not properly handle dynamic content and temporal changes in documents performed by JavaScript for the exact same reasons that applied for crOWLer E 3 3 2 Strigil vs crOVVLer Because of the difference in complexity of Strigil and crOWLer we can t correctly compare them one to one But we might find a common subset of functionality Strigil is a server with frontend scraping unit and download system crOWLer is a tool without user interface and with download system reduced to simple REST calls The common part then is the scraping unit ty http xrg ksi ms mff cuni cz software ld ldi html strigil 21 3 3 Strigil The scraping algorithm of crOWLer has been described previously in section B 2 It consist of outer loop over documents inner loop over initial definitions and tree of recursive calls forming the ontological structure while scraping data from elements on the page Strigil has a slightly different approach Instead of configuration it is guided by a scraping script The script will be closely analyzed in the following chapter but in general it defines a set of templates where one template is called at the beginning and each template can call any other template on some URL i e on document located by the URL Unl
85. sources https github com kub1x selectowl tree master ff extension 39 5 1 SOWL implementation Specifically in OWL all resources have only one type This type is determined when parsing input XML fileby a lookup cascade if the type is not determined by the explicit RDF type property the parser would look into the overlying tag name rdfQuery on the other hand properly stores all the data in form of triples in its internal dataset implementation By using this approach it offers correct results and is our library of choice Even though rdfQuery currently serves for parsing of input files only we might con sider utilizing its reasoning capabilities in future development E 5 1 2 Targeting elements on webpage and generating selectors unique within generstor siblinks parent node Figure 5 2 Diagram of selector creation algorithm Inspired by the InfoCram project we decided to use Aardvark code in order to target elements on webpage and obtain their selectors In early stages the native addon code for Aardvark was used Unfortunately this code uses some internal Firefox API and had to be replaced when new Firefox SDK was used for the SOWL development In current Implementation of SOWL we create different type of Firofox extension using new SDK Moreover the aardvark code is injected directly into the webpage using the Content Script feature of the Firefox SDK According to these differences the bookmarkle
86. t gets attached to the webpage We can pass any number of accepted arguments to these functions and they will be accessible through standard arguments object in on the JavaScript side Types corresponding to standard JavaScript types are supported as arguments number boolean String WebElement or List of any combination of the previous 1 The second asynchronous version returns immediately with a response object It provides callback as additional argument to the JavaScript call This callback is used for synchronization when accessing the result on the response object from Java JavaScriptExecutor exec JavaScriptExecutor driver List lt WebElement gt labels driver findElements By tagName label List lt WebElement gt inputs List lt WebElement gt exec executeScript var labels arguments 0 j inputs for var i 0 i lt labels length i var name labels i getAttribute for inputs push document getElementById name return inputs labels In simple cases we can use JavaScript to extend functionality of crOWLer It might be used as a complex string formatter parser for nontrivial values etc In following example it is used to condition on attribute value of an anchor tag A document location if the href tag contains a hash symbol often used when the link is handled by JavaScript function JavaScriptExecutor exec JavaScriptExecutor driver WebElement el driver
87. t version better fits the needs and is used The aardvark code is included in 40 5 2 crOWLer implementation the addon files extended with features necessary for SOWL Namely the event handling was extended by drag drop events and selector creation algorithm was added Even though it was rewritten it behaves almost identically as in InfoCram We simply bubble up the DOM tree until we meet our context On each element we try to generate unique selector according to the elements parent element The last method to try is the nth child selector which always exists and targets the correct element but is also most prone to failures due tu structure changes If possible ID or class attributes are used to target the element s use case 2 shown we can not always rely on class selectors as they are often dynamically modified by pages JavaScript For this reason the class selector are disabled by default but they are supported by crOWLer and can be manually specified in the selector field Aardvark shows the class of a hovered element on its label to simplify this task i 5 2 crOVVLer implementation The current implementation of crOVVLer forms the architecture on picture b 3 Even though the overall architecture holds visually the similar structure as the original im plementation the result is technically brand new program Change from configuration system to scenario changed the input handling and influenced structure of the core al gorithm fr
88. tation of those standards To imagine current state of the Linked Data we can take a look on the Linking Open Data cloud diagram 13 The visualisation t contains a node for each ontology and shows known connections between ontologies The data originate from datahub io a popular web service for hosting semantic data Current diagram visu alises the state of linked data cloud in April 2014 As we can see in the center many data resources are linked to DBpedia 2 the semantic data extracted from Wikipedia This best describes the notion of Linked data When two datasets relate to the same resource they can be logically linked together through this connection as this way they state they relate to the same thing Figure 2 2 The Linking Open Data cloud diagram i 2 3 RDF and RDFS RDF is a family of specifications for syntax notations and data serialization formats meta data modeling and vocabulary used for it 14 We will look closely on URI the resource identifier vocabularies and semantics de fined by RDF RDFS and OWL and serialization into Turtle and RDF XML formats M 2 3 1 URI In order to give each resource an unique identifier a Uniform Resource Identifier is used This is mostly in a form of URL as we commonly know it as web address e g http lod cloud net versions 2014 08 30 lod cloud_colored svg 2 http dbpedia org 12 2 4 OWL http www example org some place something This lite
89. tion E 4 5 1 SOWL model Current recommendation of Mozilla Developer Network suggests developing new addons using their native SDK It allows creation of restartless addons uses new API and limits usage of older libraries or low level calls by wrapping it in consistent API The SDK based addons have partially predefined structure The hackground script runs in its own scope and uses the SDK API to control the addons behavior The content script is a JavaScript code that is injected into a webpage but runs in its own sandboxed overlay while having access to pages DOM and JavaScript content In SOWL the scenario editor will be placed into a sidebar Sidebar holds standard HTML window object in which the JavaScript code is running All three components communicate via textual messages using port object offered internally by by Firefox 37 4 5 Model Figure 4 2 Components structure of the SOWL Firefox addon B 4 5 2 crOVVLer model In the new implementation of the scraping backend the original JOOUP component will be replaced by VVebDriver WebDriver with its support for JavaScript will help to handle dynamic content and brings in new possibilities for the crOVVLer itself The original configuration component is replaced by parser for the SOWL JSON scenario format The core crOWLer is also reimplemented according to new set of instructions i e commands in the scenario and the new web interface i e the WebDriver instead of the native
90. tion being passed to the actual crawler ClassSpec chObject Factory createClassSpec NPU monumnetRecord getURI S conf addInitialDefinition Factory createInitialDefinition chObject Factory createJSoupSelector table tbody tr list ClassSpec sDistrict Factory createClassSpec NPU district getURI chObject addSpec Factory createOPSpec Factory createJSoupSelector td eq 2 NPU hasDistrict getURI sDistriet i sDistrict addSpec true Factory createDPSpec Vocabulary RDFS_LABEL This pattern is with some variation repeated for all data properties and object properties The interesting part is how crOWLer handles the detail page link Just to remind a situation in UC2 1 2 2 each table row of the page uses unique onclick attribute in following form document listpf IdReg value 71311647 document listpf submit The numerical value IdReg corresponds to last column of the row and holds the identification number of the national monument in the MonumNet system As crOWLer handles every page as a static HTML document there is no way to execute this code as a JavaScript handler Instead it is being parsed by a regular expression and the result is used to fill in a format string creating a URL This URL locates the detail page for each table record 18 3 2 Analysis of crOVVLer Factory createNewDocumentSelector conf getEncoding Factory createAttributePatternMatchingURLCreator once QOJ MONUMNET _U
91. to matizovanou extrakci dat zalo en na popsan ch technologi ch Zam me se zejm na na ty ter odpov daj po ado van mu postupu extrakce provedeme detailn anal zu zvl tn ch technik pou it ch p i jejich implemen taci Pro ka d n stroj pop eme hlavn st na eho z jmu spolu s p nosy a nedostatky kter p in B hem t to anal zy se obzvl t zam me na zp sob jak m u ivatel zad v pravidla pro extakci dat a jak mi nastavuje jeji proces D le prozkoum me knihovny a platformy semantick i nesemantick kter by mohly slou it jako z klad pro implementaci prototypu navrhovan ho designu Na z klad zkouman ch postup pro zkoum me mo nosti jejich kombinace a jejich p padn ch zdokonalen Kon kr tn definujeme form t sc n e pro extraktor semantick ch dat a navrh neme n stroje pro tvorbu sc n a pro extrakci dat Abychom n vrh podpo ili vytvo me a pop eme prototyp obou n stroj P eklad titulu Platforma pro s man tickou extrakci webu Abstract This diploma thesis investigates the topic of semantic data extraction Its main goal is to design a tool that would simplify the process of annotation and scraping of data from pages on the web First we define several real life use cases of data extraction task as a prob lem specification and motivation For each use case we explain what is its ma jor challenge From all the
92. ul way to effectively obtain and query on data from the web but it still have its challenges to overcome LI 3 2 Analysis of crOVVLer A thorough analysis of the current program shall precede creation of the final design We will focus on architecture dependencies and components that will have to be reim plemented Figure 3 1 General architecture of the original crOWLer implementation gt k In original implementation crOWLer is a prototype of console Java application It uses Apache Jena library for handling ontological data and JSOUP library for 16 3 2 Analysis of crOVVLer accessing webpages and addressing elements Instead of scenario file crOVVLer accepts Java class files containing an implementation of ConfigurationFactory class This factory class builds a Configuration object In appendix E you can see definition of classes defining the configuration component of crOWLer The class diagram in appendix F describes the InitialDefinition and the Selector classes that are main building blocks of configuration Configuration defined using this structure specifies all the information needed for crawling process webpages to be crawled in a form of list or pagination description a way to address data on each page using JSOUP selectors u definition of ontology resources used to annotate the obtained data setting of how URI will be created for each individual FullCrawler Runner Crowl
93. ument selector creation by wrapping it in a single factory method createLinkTargetSelector which would internally create selector for the address targeted by href attribute of the link tag either absolute or relative to current document so that we could avoid the explicit specification of URL using KUB1X_URL constant If we wanted to get more properties from the resulting page we would reuse the NewDocumentSelector in combination with selector targeting value of each property crOWLer always relates selectors to the document currently referenced by the outer 19 3 2 Analysis of crOVVLer loop in the FullCrawler Whenever a selector containing NewDocumentSelector is applied during the crawling process a REST call is performed to fetch the targeted document On a MonumNet webpage this means hundreds of thousands of calls for each run of the crOVVLer over 40 000 records each with 16 properties on detail page Caching system can be implemented to reduce this amount to the necessary minimum We are still bound by the double loop architecture though The UC3 1 2 3 is equal to UC1 according to configuration complexity All links are implicitly specified in a form of hyperlinks without any interruption or dynamic content change Moreover in crOWLer configuration we can specify what properties will combined together form URI of ontological object we are building This is exactly the additional functionality required by UC3 The specificity of fourth
94. y jQuery 24 is a widely used JavaScript library that simplifies general tasks like DOM manipulation or event handling A simplified selectors can be used to target DOM elements as jQuery internally uses Sizzle 25 library for selector handling Compared to Vanilla JavaScript 26 jQuery produces more compact and coherent code Developers can extend the jQuery library with their own plugins This is the case for two most promising JavaScript libraries handling RDF and OWL data and so jQuery will be necessary if we decide to use either JOWL3 5 2 or rdfQuery3 5 3 1 https code google com p selenium source browse ide main src content commandBuilders js the CommandBuilder implementatio 27 3 5 Libraries for SOWL E 3 5 2 jOWL The jOWL library is a jQuery plugin for navigating and visualising OWL RDFS doc uments 27 It can parse and handle RDF files store them in its internal storage and query on them using subset of OUERY DL language 28 The library was last updated in 2008 M 3 5 3 rdfQuery rdfQuery 29 is a JavaScript library for RDF related processing It supports parsing RDFa RDF OWL formats for loading data It can dynamically embed HTML webpage with RDFa data rdfQuery is written as a jQuery plugin The intended use of the rdfQuery library is to write queries over data stored in rdfQuery internal datastore in similar way as DOM objects are queried using jQuery Moreover the whole concept
95. ysis brings an important message In many cases we will have to dive into the implementation of the processed webpage to find out how it behaves In a vast majority of these cases it will require a web developer or coder to correctly and exhaustively define the scraping scenario M 1 2 3 Use Case 3 Air Accidents Investigation Institute http www uzpln cz cs ln incident This is basic use case with a table a detail page and a pagination Everything is present in a clear HTML form without any interruption by JavaScript In this case we might consider replacing repetitive values by an object instance car rying the information For example the table shows column Event type in Czech original Druh ud losti It contains constant values of Incident Flight accident and several more A resource can be created to denote these types of accidents The resource corresponding to the string scraped from table would than be used as a value of object property instead of the original string literal The original literal is assigned to this resource as a label For example we can use in turtle syntax 2 8 1 2 Use Cases Oprefix rdfs lt http www w3 org 2000 01 rdf schema gt prefix rip lt rlp event xFuHbjA5 gt a lt rlp event gt lt rlp hasEventType gt lt rlp flightAccident gt lt http kub1x org dip rlp gt lt rlp flightAccident gt lt rdfs label gt Leteck nehoda cs Instead of prefix rl
96. zpr va 2014 07 26 Z v re n zpr va 2014 07 25 Z v re n zpr va 2014 07 19 Z v re n zpr va 2014 07 06 Z v re n zpr va 2014 07 06 Z v re n zpr va FH Podle data ud losti Funkci pro zobrazen dle data vlo en ud losti Ize pou t od 1 kv tna 2012 M sto ud losti Bukovice Rackov LKPL LKHB LKPS LKCH LKMB LKKM Kun ice pod Ond ejn kem LKPM Druh ud losti Leteck nehoda Leteck nehoda Leteck nehoda Leteck nehoda Leteck nehoda Leteck nehoda Leteck nehoda Leteck nehoda Leteck nehoda Leteck nehoda Druh provozu Ostatn v ce Rekrea n a sportovn l t n v ce Ostatn v ce Ostatn v ce Ostatn v ce Ostatn v ce Ostatn v ce Ostatn v ce Ostatn v ce Ostatn v ce Figure 1 6 View on list page on Air Accidents Investigation Institute m specifying a pattern for creation of URI of each instance adding language tag to all string values possible usage of geographical ontology possible usage of enumeration 1 2 Use Cases Pr vodn formul k p edb n a z v re n zpr v Datum ud losti 2014 11 02 Druh zpr vy Z v re n zpr va M sto ud losti Bukovice Druh ud losti Leteck nehoda Hmotnostn kategorie MTOM lt 2250 kg Druh provozu Ostatn Druh letadla SLZ Kluz ky Typ letadla SLZ NIMBUS 2 Zdravotn n sledky ud losti Se zran n m PDF dokument i Popis Dne 2 11 201

1.2 Use Cases

Contents

Download Pdf Manuals

Related Search

Related Contents