Home

Documentation for Law Leecher 1.4

image

Contents

1. shared variables Figure 3 The basic architecture of the program 4 2 Implementation Details The code is entirely written in Ruby and in English language Only the GUI output is German 4 2 1 Threading The fetcher holds a list of IDs of laws which are to be processed It acts as a thread dispatcher For each ID a new thread is started However there is only a fixed number of free slots i e a maximum number of threads If all threads are busy no new thread is started When there are less than the specified number of threads running and still law IDs left it removes the first ID of this list and starts a new thread a Parser Thread processing this ID retrieval and parsing Most of the time all threads are busy Then the thread iterates over a list of running threads to check whether there is a finished thread If a thread has finished its return value a big hash array is added to the result list There is no Mutex mechanism that allows threads to individually save information into a central variable Sometimes parsing fails due to HTTP errors ruby s retry statement did not always help and due to the following error is not used Moreover the inserting of parsed laws in the results array is buggy Instead of one law another is inserted which then exists twice and the first law disappears To handle this thread invocation takes place in two nested loops In the inner loop threads are started At the end of this lo
2. Rectangle D is characterized by a green background color It always contains four key value pairs The keys which are on green background are Fields of activity Legal basis Procedures and Type of file Values are on gray background some of the laws do not possess a value others contain line breaks colons etc Values are always transformed to simple strings with all HTML markup removed The name under which the information is saved is greenbox FieldsOfActivity 9 66 sreenbox LegalBasis greenbox Procedures and greenbox TypeOfFile In this document they are called laws even if not each of them did evolve to be an adopted law Rectangle E is the first table All key value pairs are read out for example first box Responsible or firstbox LegalBasis Values for the Documents key are split and joined with Rectangle F is the last table It contains several key value pairs but only the val ues of Documents Procedures Type of file and NUMERO CELEX are taken as lastbox Documents lastbox Procedures lastBox TypeOfFile and lastbox NumeroCelex Black rectangles are used differently As mentioned above each table consists of a header row which contains a date stamp and a title and optionally a second row with some key value pairs These tables are to be read out while grabbing each table s date stamp the
3. 5000 4000 Average 3000 2000 1000 0 123 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Figure 4 Benchmark for the average runtime in seconds among three runs for close to 3000 laws The x axis shows the number of threads employed With more than 8 threads there is no substantial decrease in runtime However this number may be depending on the total number of laws Therefore the default thread number is set to 20 12 4 4 Pitfalls 4 4 1 Interpreter The default ruby interpreter is able to run the full program currently However the JRuby interpreter runs into memory problems with the current implementation These problems appreared on a 2 GHz dual core 4 GB RAM Ubuntu 9 04 64 Bit machine with the JRuby interpreter set to Xms32m Xmx2048m XX UseGCIOverheadLimit 4 4 2 Output Multi threadded programming is hard One trait of this fact is that puts does not work always properly It constists of the writing of the message and a n If these are separated due to thread scheduling the output looks odd Therefore print is used on all threadded sections which does not create a line break in the end Instead the n has to be inserted directly at the message explicitly Furthermore due to threading some IDs are dispatched multiple times Thus the output would contain exact duplicates about 1 of the total number which has been removed explicitly after retrieving all laws 4 4 3 Encoding All
4. apcnet cfm CL en This web page contains about 30 000 laws by March 2010 To drive statistical analysis on this data it has to be extraced from this web database first To retrieve this structured data a tool coined Law Leecher has been developed This document fulfills three purposes e It shows which information is extracted e It is a user manual e It serves as source code documentation 2 Extracted Information The details of each law are contained on each law s web page Figure 1 shows one of these pages and the pieces of information which are relevant for the crawling It is annotated with red and black rectangles The web page always contains some meta information on top a greenish area below and finally a number of directly piled up tables which describe the progress of the discussion about the law as a timeline Each of these tables contains one row with a date stamp and a title and optionally another row with key value pairs Generally all HTML markup is removed from the extracted values e g hyperlinks line breaks or text formatting and only plain text is saved Rectangle A contains information which are retrieved as is and saved under the name bluebox UpperLeftIdentifier Rectangle B contains information which are retrieved as is and saved under the name bluebox UpperCenterIdentifier Rectangle C contains information which are retrieved as is and saved under the name bluebox Short Description
5. source files rb are encoded in UTF 8 without using the Byter Order Mark BOM Ruby 1 9 has problems with files that actually contain non ASCII characters such as German umlauts However inserting the BOM in the beginning of those or all files causes Ruby 1 8 7 installations not to work any more under Windows at least since they try to interpret the BOM as code The workaround is simply to stick to Ruby 1 8 7 where also GTK2 does work 13
6. title and if existing the value of the key named Decision or Decision mode The title is not saved explicitly Instead it serves as the prefix In the figure s first table which is firstbox at the same time this would be Adoption by Commission with the suffix date and the value 01 10 2004 and the suffix decision with the value Written procedure The difference between Decision and Decision mode is irrelevant so that it is always named decision However the prefixes are not necessarily unique within one law That is why they are extended by three digit numbers starting with 001 for each title even if this specific title just occurs once in this law They are numbered from top to bottom In the figure this rule creates the following keys Adoption by Commission001 date and Adoption by Commission decision Further information are the law s type Type and the law s id ID The id is taken from the HTTP request it does not appear on the page The type can be derived from bluebox UpperCenterIdentifier It is the abbreviation after the second slash and can have the values AVC COD SYN or CNS If the type is different or not stated in this area it is put into the set of extracted data but without a value E e 5 tant legal notice Monitoring of the decision making process between institutions English
7. Documentation for Law Leecher 1 4 Tobias Vogel tobias vogel name July 1 2010 This documentation comprises goal installation instruction usage manual implementation details for Law Leecher a tool for retrieving data from the PreLex database Law Leecher is published under the 3 clause BSD license Contents 1 Preface 2 Extracted Information 3 Installation and Usage dal Installation eee oes EN A AI ee S12 Wandows cota 8 oe 2 oa ee A eS ES LN Aw red IL USA YM As Be doit o tut A aan Aah A A 3 2 1 Graphical User Interface o o e 22 20 2004 3 2 2 Command line client 4 Implementation Details CS E II SiN ae bt tM cp he ode a 4 2 Implementation Details 0 0 02 2 00202020004 ADA Threading nyo yo ese ae Serer gee eee a he EO a Be ee amp 4 2 2 GUI Callbacks r manea sr a oia 4 2 3 Regular Expressions 0000 2 eee eee AA A E O RN 4 2 5 GUI Implementation 20 02 0040 4 2 6 Default Values 0 0 0 0000 ee ee Av Benchmark cy vy taster de Bahk cares Ree oes eG Bee had eck ch By Shee a AAs Pitfalls 24 cc Sexe Ar cht Rega Syed gs Ghee ag nttp en wikipedia org wiki BSD_licenses Terms 4 401 Interpreter sosu rnoes he ee ho ER Pa ee A oe Roe Se A 4 4 2 Output 4 43 Encoding 1 Preface The decision making process between institutions in the European Union is documented and freely available under http ec europa eu prelex
8. SV files in such a way This translation has been implemented in the Saver s convertUTF8ToANSI method which simply returns the converted string Latin 1 is hardcoded into this method by calling the iconv library however it is just a string an can be replaced by other formats Characters that do not fit in ANSI are simply removed 4 2 5 GUI Implementation The GUI contains a description of all widgets in the window It s programmed with GTK2 It also connects the widgets with the appropriate functions in the program The GUI is held responsive by implementing a cooperative multitasking From time to time more exact at the beginning of each law processing via the informUser method the method updateWidgets is called The provided hash contains a bunch of information to update That may be the progress bar or the status message Afterward a pending events handling loop is executed allowing to move the window and to redraw the recently edited text 11 4 2 6 Default Values The Configuration class contains global variables and their getters and setters which are used throughout the program 4 3 Benchmark To find out the optimal number of worker threads and to evaluate the usefulness of the usage of threads a benchmark was driven Nearly 3000 laws were retrieved with 1 to 30 threads where each run was repeated twice i e three passes for each number of threads to eliminate the bias Figure 4 shows the results 7000 6000
9. cher crawls laws from this start year to the current year The default start year is 1969 numberofthreads number Law Leecher is multi threaded c f Section 4 2 1 By default it uses 20 worker threads to retrieve and parse law web pages filename filename The default output file is called export csv and will be placed directly in the directory where main rb is located You can change it here overwriteexistingfile Use this option to allow Law Leecher to overwrite an exist ing output file loglevel quiet default verbose Different degrees of verbosity for stdout can be set verbose gives information about when each law parsing procedure is initiated and ended etc quiet does not give any output default is the default and gives a few statements about the progress Warnings are always printed on stderr re gardless of the log level To get rid of those start the program with ruby main rb 2 gt dev null Linux resp ruby main rb 2 gt nul Windows When you call it without the nogui the GUI will be started and all command line parameters will be ignored 4 Implementation Details This section will describe the architecture of the program It will not got into too much detail since the code is well documented Instead some important aspects are emphasized and the main functionality of the existing classes are explained 4 1 Architecture Figure 3 shows the architecture of the program The user starts
10. ding001 date 12 04 05 decision Partial agreement Fs Council agreement001 4 date 02 06 05 decision null Fs EP opinion single rdg001 date 12 04 05 decision Approval with amendments F Formal adoption by Council001 date 19 09 05 decision null Fs Transmission to Council001 date 01 10 04 decision null Fs Transmission to EP001 date 01 10 04 decision null 3 Installation and Usage 3 1 Installation 3 1 1 Linux Install the following packages ruby ruby gnome2 It might be useful to additionally install irb and to get the Ruby NetBeans IDE from http netbeans org download To install the packages type sudo apt get install ruby irb ruby gnome2 netbeans 3 1 2 Windows Follow these steps to install Law Leecher on Windows tested on Windows XP and Windows 7 1 Download the One Click Installer from http rubyinstaller org to install Ruby If you intend to use the GUI you should choose Ruby 1 8 7 because version 1 9 seems to be incompatible Install it under the proposed directory will be about c ruby or similar Check all checkboxes the wizard presents to you 2 Second install the toolkit which provides the GUI It s named GT K2 Download it from http prdownloads sourceforge net ruby gnome2 ruby gnome2 0 16 0 1 i386 mswin32 exe download When following the wizard correct the path of the Ruby installation in case that there are special characters at the end of the
11. en The European Commission gt PreLex Contact Search EUROPA 2005 0215 CNS European Commission a i European Parliament i Council E E o 01 10 2004 01 2005 03 2005 06 2005 19 09 2005 Events D Fields of activity Justice and Home Affairs Legal basis Commission Trait CE art 30 par 1 34 par 2 Commission Consultation dure proce Council Consultation procedure Commission Proposal for a Decision Council Decision Adoption by Commission 01 10 2004 Transmission to Council 01 10 2004 Transmission to EP 01 10 2004 EP opinion single 12 04 2005 position on EP amendments on single reading E 12 04 2005 Council agreement 02 06 2005 Formal adoption by Council 19 09 2005 Type of file Activities of the institutions E oe Consultation procedure Proposal for a Decision Trait CE art 30 par 1 34 par 2 E 52004Pc05023 Franco FRATTINI Rapporteur Panayiotis DEMETRIOU Taking over Franco FRATTINI Documents Bl ao 2005 501 El puvotn 2005 4 1 4 25 OJ C E 2006 33 136 NUMERO CELEX El 52005AP0085 is Bl csr200510682 Bl pres z005 114 Bulletin 2005 6 1 4 16 OJ CONSEIL ITEM 8 ON COUNCIL AGENDA SESSION CONSEIL 2664 SUJET JUSTICE AND HOME AFFAIRS OJ CONSEIL ITEM A ON COUNCIL AGENDA SESSION CONSEIL 2677 SUJET AGRICULTURE FISHERIES Figure 1 Website content of law 191763 with annotated relevant sections If the list of tables just c
12. ontains one table this is not handled differently and thus the values are extracted twice The full retrieved record for this law would be as follows noted in a JSON like notation ID 191763 Type CNS bluebox UpperLeftIdentifier COM 2004 623 UpperCenterIdentifier 2005 0215 CNS ShortDescription Council Decision 2005 681 JHA of 20 September 2005 establishing t greenbox FieldsOfActivity Justice and Home Affairs LegalBasis Commission TraitA CE art 30 par 1 34 par 2 Procedures Commission Consultation procedure Council Consultation procedure Type0fFile lastbox 1 Documents CS 2005 10040 CS 2005 12242 PRES 2005 222 OJ L 2005 256 63 CS 20 Procedures Consultation procedure TypeOfFile Decision NumeroCelex null Iz firstbox Addressee for formal act Council Decision mode Written procedure Documents OJ C 2004 323 4 Bulletin 2004 10 1 4 12 COM 2004 623 FINAL IP 200 Legal basis TraitA CE art 30 par 1 34 par 2 NUMERO CELEX 52004PC06023 Optional consultation European Parliament Primarily responsible DG Justice and Home Affairs Commission Proposal for a Decision Council Decision Procedures Consultation procedure Responsible Antonio VITORINO Type of file Proposal for a Decision timeline Adoption by Commission001 4 date 01 10 04 decision Written procedure Commission position on EP amendments on single rea
13. op all 10 successfully retrieved laws are taken from the list of IDs Unsuccessfully parsed laws are not taken from the ID list If there have been unsuccessful laws the outer loop is passed again and again with the unsuccessfully parsed laws until finally all laws are parsed 4 2 2 GUI Callbacks The system is designed to work with and without a GUI To provide status information to the GUI callbacks are used The Core contains a method coined callback which receives textual information This information is forwarded to the updateWidgets method of the GUI class Information which is not intended to be sent to the GUI is simply printed with a print c f Section 4 4 2 4 2 3 Regular Expressions Ruby 1 8 does not support variable length lookbehinds To overcome this deficit the ParserThread class has the method parseSimple which takes two patterns and the string to apply the Regular Expressions on The second pattern should start with to match the desired substring Afterwards an arbitrary lookahead pattern can be contained The first pattern contains the desired lookbehind without the lt specification The method first extracts the concatenation of the first and the second pattern from the string and then replaces the first pattern with an empty string in the intermediate result 4 2 4 Unicode The text on the website is provided in Unicode UTF 8 It has to be translated into ANSI Latin 1 because Excel interprets C
14. path string In case of problems follow the instructions on http ruby gnome2 sourceforge jp hiki cgi Install Guidetfor Windows 3 2 Usage Law Leecher provides both an easy to use GUI in German language and a powerful command line interface 3 2 1 Graphical User Interface Law Leecher v1 4 TS x Dateiname home user lawleecher export csv Durchsuchen Vorhandene Datei ggfs berschreiben star Figure 2 The graphical user interface of the program Start the program by invoking main rb The window as depicted in Figure 2 should become visible In the input area type in the path and the file name of the output file You can use the button on the right to select it by browsing over the file system Check the check box under the input to overwrite a possibly existing file You will get a warning message in advance of starting the process if a file exists there and you didn t check the check box Press the start button to start the process You can only abort it by closing the window which might take a while Law Leecher takes about 8 minutes to retrieve 1000 laws on an average DSL connection 3 2 2 Command line client Law Leecher offeres a command line client for batch processing It can be called via ruby main rb nogui as the base and any combination of the following optional parameters The notation is parameter value except for parameters which are simple flags startyear year Law Lee
15. the program Core which controls all components and additionally enables the user to communicate with the program via the GUI The program s task is to retrieve laws from the Internet and to save them on the disk This is done by the Fetcher and the Saver In the beginning the Core starts the Fetcher to fetch all IDs of the laws to retrieve The IDs are saved in the lawIDs variable Next the Core calls the Fetcher again to retrieve and parse the single web pages The result is written into the laws variable To speed up the retrieval process the Fetcher uses several threads for retrieval and scraping Finally the Saver writes all the laws to the file system Because the laws have partially different keys e g depending from the different num ber of tables in them all keys from all laws are taken to create a huge sparse table where each column is at least populated with one value of one law but many columns mostly stay empty To get all the column headers timelineTitles and firstboxKeys are filled by the Fetcher where the keys are sorted and rewritten concerning the numbering schema described in Section 2 The depicted 4 variables can only be written by the Core To illustrate which com ponent conceptually writes on reads from them gray dashed arrows are used Each agent is implemented within a class of the same name The starting method is located within main rb which is no class GUI and Core are singletons Internet Thread pool

Download Pdf Manuals

image

Related Search

Related Contents

小形制御弁式鉛蓄電池 総合カタログ    MI2013-3  Salton ME5B User's Manual  Stanley Stud Sensor 200 Sensor de parales Stanley 200      Netgear FS102 User's Manual  2 - Olympus  Nikon EH-54 User's Manual  

Copyright © All rights reserved.
Failed to retrieve file