Home

Term Report PDF - VTechWorks

1. Access documents folders and network places Mozilla Firefox ee File Edit View History Bookmarks Tools Help http quickstar udera 8042 node lt de quickstart cloudera 8042 node v av 4 Cloudera JHue 1880000 2188856 Ejimpalaw EJSparkw solr jOozie Cloudera Manager Getting Started Mozilla Firefox is free and open source software from the non profit Mozilla Foundation Know your rights a x ita Logged in as drwho NodeManager information ResourceManager NodeManager Total Vmem allocated for 16 80 GB Node Information Containers List of Vmem enforcement enabled false Applications Total Pmem allocated for 8 GB List of Containers Container Pmem enforcement enabled false Total VCores allocated for 8 Containers NodeHealthyStatus true LastNodeHealthTime Fri Feb 13 02 22 22 PST 2015 NodeHealthReport Node Manager Version 2 5 0 cdh5 3 0 from f19097cda2536da1df41ff6713556c8f7284174d by jenkins source checksum be4bbd98be48027c8e2668ccad78b91 on 2014 12 17T03 11Z Hadoop Version 2 5 0 cdh5 3 0 from f19097cda2536da1df41ff6713556c8f7284174d by jenkins source checksum 9c4267e6915cf5bbd4c6e08be54d54e0 2 About Apache Hadoop B Mozilla Firefox Figure 14 Hadoop is running on the virtual machine 5 Then we install Mahout on CentOS using the command sudo yum install mahout 6 Use the executable usr bin mahout to run our analysis Now we ve insta
2. profanitv words f read 1 split n f close def cleanblankspaces clean contente clean content clean contento clean content re sub clean content clean content re sub 1 clean content clean content re sub clean content clean content re sub clean content return clean content def webcontentcleanup content UrlRegexp r P lt url gt https a zA Z0 9 ProfanityRegexp re compile r 2 lt 4 lt 4a zA Z0 9 join __profanity_words r W re IGNORECASE uri list re findall UrlRegexp content for url in url_list content content replace url clean_content_only join word for word in content split if word lower not in stop word 1 clean content 2 re sub ProfanitvRegexp clean content oniv clean content oniv re sub ProfanitvRegexp profanity clean content oniv clean content 2 join st stem word for word in clean content 2 split clean content oniv cleanblankspaces clean content oniv clean content 2 re sub r s w _ clean content 2 clean content 2 cleanblankspaces clean content 2 url list str l join url list return clean content oniv clean content 2 url list str avrocleaning filenamel filename2 filename3 doc id try InFile open filename1 r OutFile open filename2 w SchemaFile open filename3 r fp open test dat
3. Command Progress Completed 6 of 24 steps Y Initializing ZooKeeper Service Completed 1 steps successfully Jv Starting ZooKeeper Service Completed 1 steps successfully Details Jv Checking if the name directories of the NameNode are empty Formatting HDFS only if empty Successfully formatted NameNode Details Jf Starting HDFS Service Successfully started HDFS service Details J Creating HDFS tmp directory Successfully created HDFS directory tmp Details gt J Creating HBase root directory Successfully created HDFS directory Details Starting HBase Service Details Creating MR2 job history directory Creating NodeManager remote application log directory Back mala A Continue Figure 15 Hadoop installation 18 Here we skipped several tinv issues that mav cause problems since thev can be easilv solved bv looking up ClouderaManager documants 19 Run the PiTest and WordCount programs to examine if the installation is successful 42 19 References 1 Steven Bird Ewan Klein and Edward Loper Natural language processing with Python O Reilly Media 2009 2 NLTK project NLTK 3 0 documentation http www nltk org accessed on 02 05 2015 3 Oracle Java SE Development Kit 7 Downloads http www oracle com technetwork java javase downloads jdk7 downloads 1880260 html accessed on 02 05 2015 4 The Apache Software Foundation Solr Download http lucene apach
4. w except IOError print please check the filenames in arguments return raw text all InFile read decode utf8 53 InFile close schema avro schema parse SchemaFile read writer DataFilewriter OutFile DatumWriter schema regex_raw_webpage re compile url contentType Content Version 1 re DOTALL regex_webpage re compile Content Version 1 re DOTALL regex_url re compile r lt url http regex_contentType re compile r lt contentType regex rubbish re compile http Version 1 webpages re findall regex raw webpage raw text all FIND THE LAST WBBPAGE Version 1 clean webpage count 0 html file count 0 contentTypeAll 7 languageAll for raw_text in webpages url re findall regex_url raw_text strip contentType re findall regex_contentType raw_text strip if contentType not in contentTypeAll contentTypeAll contentType 1 else contentTypeAll contentType 1 if contentType find html gt 0 continue html_file_count 1 raw text re findall regex_webpage raw_text 0 raw text re sub regex_rubbish raw text readable article Document raw text summarv readable title Document raw text short title readable title join i if ord i gt 128 else for i in readable _title f url join i if ord i gt 128 for i in url url url decode utf8 re
5. accessed on 02 05 2015 23 Google List of bad words http fffff at googles official list of bad words accessed on 05 01 2015 44 Appendix A AVRO Schema for Tweets and Web pages AVRO schema for tweet collections namespace cs5604 tweet NoiseReduction type record name TweetNoiseReduction fields f name 006 10 type string doc original name tweet_id type string doc original name text clean type string doc original name text original type string doc original name created at type string doc original name user screen name type string doc original name user_id type string null l doc original name source type string null doc original name lang type string null doc original name favorite_count type int nul1 doc original name retweet_count type int null doc original name contributors id type string null doc original name coordinates type string null l doc original name urls type string null l doc original name hashtags type string 11 doc original name user mentions id type string null l doc original name in_reply_to_user_id type string
6. for userhandle in Userhandlelist tweet tweet replace userhandle uri list re findall UrlRegexp tweet for url in url_list tweet tweet replace url tweet re sub r s w _ tweet task 5 included clean_tweet_only tweet for hashtag in Hashtaglist tweet tweet hashtag for userhandle in Userhandlelist tweet tweet userhandle task 3 validating url ValidUrlRegexp re compile r http ftp s f http or https domain 48 r A Z0 9 A Z0 9 61 A Z0 9 A Z 2 6 A Z0 9 2 r localhost localhost 4006 ip 0 1 3 041 3 011 37 011 37 r d optional port r S re IGNORECASE for url in url_list if ValidUrlRegexp match url and validate url tweet tweet url tweet re sub tweet clean_tweet_only join w if len w gt 6 else for in clean tweet oniv if len clean tweet oniv 0 return None None None None None None clean tweet 1 join w if ord w gt 128 else for in clean tweet oniv clean tweet oniv join word for word in clean tweet oniv split if word lower not in stop word ProfanitvRegexp re compile r lt 4 lt 4a zA Z0 9 join __profanity_words r W re IGNORECASE clean_tweet_only re sub Clean_tweet_only clean_tweet_only r
7. libjars apache nutch 1 9 jar D mapred reduce tasks 0 input user cs5604s15_cluster ebola_S_webpages ebola_S_Crawl segments 20150413181432 content part 00000 data inputformat SequenceFileAsTextInputFormat output user cs5604s15_noise clustering_webpage_small mapper org apache hadoop mapred lib IdentitvMapper The above script takes as input the web pages for the clustering team s collection 32 2 Run Python script to clean the text files python webpageclean py clustering webpage data small text data clustering webpage small part 00000 WEBPAGES CLEAN clustering small 00000 v2 webpage new avsc ebola 5 The above command is of format python Python script file input file output file AVRO schema for webpages collection name 3 Load output into HDFS hadoop fs put WEBPAGES CLEAN clustering small 00000 v2 user cs5604s15 noise WEBPAGES CLEAN All the small and large web page collections have been cleaned up and they are available under the user cs5604s15_noise WEBPAGES CLEAN folder One can check the cleaned collections with the command hadoop fs Is user cs5604s15_noise WEBPAGES_CLEAN as shown below cs5604s15_noise nodel hadoop fs ls user cs5604s15_noise WEBPAGES_CLEAN Found 30 items rw r r rw r r rw r r TW r rw r r rW r r rw r r rw r r rWw r r rW r r il l f w r r rWw r r rW
8. null doc original name in replv to status id type string null l AVRO schema for web pages namespace cs5604 webpage NoiseReduction type record name WebpageNoiseReduction fields name doc id type string doc original name text clean type null string default null doc original name text original type null string default null doc original name created at type null string default null doc original name accessed_at type null string default null 45 doc original doc original doc original doe s original doc original null doc original doc original null doc original null doc original doc original doc original doc original 4 22 2015 doc original name default null 4 21 2015 1 author type null string default null subtitle type null string default null section type null string default null lang type null string default null coordinates type null string default urls type null string default null content tvpe type null string default text cleanz type null string default coll
9. u down u in u out u on u off u over u under u again u further u then u once u here u there u when u where u whv u how u all u anv u both u each u few u more u most u other u some u such u no u nor u not u only u own u same u so u than u too u verv u s u t u can u will u just u don u should with open profanitv en txt as f profanitv words f read 1 split w 47 61056 def validate url task 4 return 1 cleanup tweet task 1 remove emoticon try Wide UCS 4 EmoRegexp re compile u u UQ 001F300 U0001F64F u U 001F680 U0001F6FF u u2600 u26FF u2700 u27BF re UNICODE except re error Narrow UCS 2 EmoRegexp re compile u u ud83c udfee udfFfF u 0600 08360 ude4f ude8e udefF u u26 0 Uu26FF u2700 u27BF re UNICODE tweet EmoRegexp sub tweet tweet re sub r s s tweet tweet re sub r s tweet task 2 Remove non alphanumeric characters HashtagRegexp r lt lt a zA Z 9 A Za z_ A Za z 9_ UserhandleRegexp r lt lt a zA Z 9 A Za z_ A Za z 9_ UrlRegexp r P lt url gt https a zA Z 9 Hashtaglist re findall HashtagRegexp tweet for hashtag in Hashtaglist tweet tweet replace hashtag Userhandlelist re findall UserhandleRegexp tweet
10. Figure 10 Figure 11 Figure 12 Figure 13 Figure 14 Figure 15 Figures Framework of noise reduction within whole system context sse eeemnnnnnnanzaninzznnnmenezznntarni 13 Example structure format Of a tweet 000 16 Example structure format of a web page sssseennnzznnnnazznnnninazznntinznzznntrknazznntrnzzznnatannzzznnta 16 Example of a cleaned eae aa eie aaa Ena E 0 19 0 21 Overview of web page cleanup implementation 0 22 The size of cleaned tweet collections on HDFS ss essssrenenezzznnnnzznnznnnnnzanznnanzzanantannznannnnnnz anna 31 The size of cleaned web pages collection ON HDFS c cccccccccccscssssssssscaeseeeeeeeesesssesesasaeeeeesesens 33 Total amount of URLs in small shooting collection coccccononooononanonnnnnnonocanananan ono nnnnnnnncnnnns 34 35 Result after running the Python script ccccessesssssssecseeeeeeeeeessessenecaeeeeeeeecsseesseseeaseaeeeeeeesens 35 Screenshot displaying Shooting collection nn ennennn anna nanna nanna nanna anna 37 Import the Hadoop virtual machine into Virtual BOX cccccccconcononoonnnncnncnnnnnnanananononnnnnnnncnnnns 39 Hadoop is running on the virtual Machine ccconcnncuuonncnnncnnnnnnnnonnononnnonnnncnnnnnnnnnnnn ono nnnnnnnncnnnns 40 Hadoop installation ia qedesdeesiosusainoneaseeereends 42 1 Executive Summary The corpora for which we are
11. To get a better understanding of the IDEAL project and information retrieval systems we want to install and practice with Solr on our local machines We installed Solr on a PC with Windows 7 and indexed the collection CSV file by following procedures based on the Solr Quickstart Tutorial 6 1 Download Java version 1 7 64 bit installation file jdk 7u75 windows x64 exe from Oracle website 3 Install Java and modify the related environment variables PATH JAVA_HOME and CLASSPATH Download Solr installation file solr 4 10 3 zip from Apache Solr website 4 and unzip it in D Solr Under the command line tool cmd exe enter the Solr directory and run the command bin solr cmd start e cloud noprompt to lunch Solr Check if the Solr is running by visiting http localhost 8983 solr with a web browser Edit the environment variable CLASSPATH add D Solr dist solr core 4 10 3 jar Copy the collection CSV file and the web pages just crawled into directory D Solr data 36 8 Index the small collection by running the command curl http localhost 8983 solr update commit true amp separator 09 H Content type application csv data binary z4t csv in command line tool 9 Index the web pages by running the command java Dauto org apache solr util SimplePostTool data txt 10 Check if the collection has been indexed by visiting the link http localhost 8983 solr brows
12. Wells receives Advertisement Noise reader offe offers g promotions Crossword Word Scrambler Find us on Facebook ES London Evening Standard 357 803 people like London Evening Standard Download our new amp improved FREE app Figure 3 Example structure format of a web page 16 9 Implementation Strategy There are various languages frameworks that enable us to implement our design and more specifically the aforementioned tasks of text processing and document cleanup Based on the project requirements and our deliverables we have identified the following constraints considerations that our implementation strategy has to satisfy 1 Coding language framework should have libraries packages for Natural Language Processing for text cleanup 2 Coding language framework should have libraries packages that are specifically designed to cleanup HTML XML content 3 Coding language framework should be extensible to run on Hadoop cluster with little modification Taking into consideration all of the above we have decided that our implementation will be done in Python We plan to develop a stand alone Python script to clean the collections 10 Tools and Frameworks Following are the tools and frameworks that we plan to leverage as a part of our system framework to achieve our goal 1 Python Script To Unshorten shortened URLs Provided by Mohamed a There are many shortened URLs in the twe
13. anaconda lib pvthon2 7 site packages requests packages urllib3 connectionpool certificate verification is strongly advised See https urllib3 readthedocs InsecureRequestWarning lt class requests exceptions ReadTimeout gt http t co OCTjV7kcK2 Unique Orig URLs expanded 5 Bad URLs 6 Webpages text saved Figure 10 Output from crawling web pages 7 In total 85 unique web pages are successfully collected and stored within the same directory of the script 7 78 txt Today 1 11 PM 17 KB Plain cument 7 79 txt Today 1 11 PM 1KB Plain cument 7 80 txt Today 1 11 PM 35 KB Plain cument 1 Today 1 11 PM 5 KB Plain cument 82 txt Today 1 11 PM 47 KB Plain cument 7 83 txt Today 1 11 PM 5 KB Plain cument 7 84 txt Today 1 11 PM 8 KB Plain cument 7 85 txt Todav 1 11 PM 8 KB Plain cument 7 seedsURLs z540t txt Todav 1 11 PM 7 KB Plain cument 7 short_origURLsMapping_z540t txt Today 1 11 PM 12 KB Plain cument 7 shortURLs_z540t txt Today 1 07 PM 1 1 MB Plain cument tweet URL archivingFile pv Todav 1 07 PM z540t csv Feb 12 2015 1 23 PM 8 3 MB comm values Figure 11 Result after running the Python script 18 2 Loading Web page Collection to HDFS Please execute the following instructions in order to load the web page collection into HDFS 1 Download Shooting Charlie Hebdo collection called z540t csv at http nick dlib vt edu data CS5604S15 z540t csv 2 Download Python script to extract URLs called tweet_short
14. building an information retrieval system consists of tweets and web pages extracted from URL links that might be included in the tweets that have been selected based on rudimentary string matching provided by the Twitter API As a result the corpora are inherently noisy and contain a lot of irrelevant information This includes documents that are non English off topic articles and other information within them such as stop words whitespace characters non alphanumeric characters icons broken links HTML XML tags scripting codes CSS style sheets etc In our attempt to build an efficient information retrieval system for events through Solr we are devising a matching system for the corpora by adding various facets and other properties to serve as dimensions for each document These dimensions function as additional criteria that will enhance the matching and thereby the retrieval mechanism of Solr They are metadata from classification clustering named entities topic modeling and social graph scores implemented by other teams in the class It is of utmost importance that each of these initiatives is precise to ensure the enhancement of the matching and retrieval system The quality of their work is dependent directly or indirectly on the quality of data that is provided to them Noisy data will skew the results and each team would need to perform additional tasks to get rid of it prior to executing their core functionalities It is our role a
15. content The character sequence RT or rt is a Twitter specific term that stands for retweet It is often found in tweets that are re tweets of the original tweet Also Dr Fox noticed several instances of curse words or swear words in the cleaned text He suggested that we replace these occurrences with the term profanity so as to standardize the text and preserve them for future text analysis such as sentiment analysis We used Google s list of curse words or offensive words for reference 23 Therefore we cleaned the collections to remove the occurrence of the character sequence RT and rt and replaced all curse words and changed them to profanity 28 15 Project Timeline Report 1 Solr installation on local machines Documented the requirements for our team s work and identified the dependencies associated with other teams Report 2 Data imported into Solr Outlined the requirement of our team s work Finalized the References Report 3 Reorganized the structure of the report Finalized the design part Report 4 Coding for text extraction from webpages Report 5 The code for cleanup tweet collections will be released along with the report Report 6 The preliminary executable code for cleanup of web pages will be released Report 7 Cleaned tweet collections ready Reports 8 amp 9 The final executable code for web page cleanup released Report 10 Version 2 of the web pag
16. datafile import DataFilewriter from avro io import DatumWriter from nltk stem lancaster import LancasterStemmer from urlparse import urlparse from langdetect import detect_langs import uuid import time LancasterStemmer _docid_ charlie hebdo 5 stop word u i u me u mv u myself u we u our u ours u ourselves u vou u vour u vours u vourself u vourselves u he u him u his u himself u she u her u hers u herself u it t its 5 u itself u thev u them u their u theirs u themselves u what u which u who u whom u this u that u these u those u am u is u are u was u were u be u been u being u have u has u had u having u do u does u did u doing u a u an u the u and u but u if u or u because u as u until u while u of u at u bv u for u with u about u against u between u into u through u during u before u after u above u below u to u from u up u down u in u out u on u off u over u under u again u further u then u once u here u there u when u where u whv u how u all u anv u both u each u few u more u most u other u some u such u no u nor 52 u not u only u own u same u so u than u too u verv u s u t u can u will u just u don u should with open profanitv en txt as f
17. not part of the User Handle or Hashtag and icons from the original tweet 11 2 Webpage Cleanup Cleaning up web pages is far more challenging because web pages are unstructured as compared to tweets Web pages do come with the standard HTML tags However the HTML source also includes external content present in advertisements navigational elements headers footers etc Therefore it is difficult to identifv patterns or properties of text within a web page Please find below the table describing the intermediate steps for web page cleanup Table 2 Steps and description for cleaning web pages Step Number Description 1 Discard non English web pages Remove advertising content banners and other such content from the HTML page using Python package python readability include link Remove text within lt script gt tags along with style sheets links to images and other a hyperlinks 4 Cleanup HTML XML tags in the web page using package BeautifulSoup 5 19 Remove all remaining non alphanumeric text using regular expressions Standardize remaining text through stemming and stop word removal Replace curse words or swear words with the text profanity CO Generate a universally unique identifier UUID for each web page 11 3 Organize output in AVRO As mentioned previously the output from our cleaning module would be consumed by all of the other teams in
18. 0 2 23 3 tunisia_B 328 162 49 6 8 9 charlie hebdo 5 341 87 1 16 5 communities_B 76 6 1804 Table 5 Language statistics for web page collections CON winter_storm_S 17 communities_B 45 charlie hebdo 5 10 26 Table 6 Various file type for each web page collection Collection Name Web page Type plane_crash_S application xml text html application xhtml xml text html application pdf application xhtml xml diabetes B audio mpeg application octet stream application pdf text html application xhtml xml application x shockwave flash video mp4 text plain text aspdotnet text x php tunisia_B text html audio mpeg application pdf application xhtml xml winter_storm_S application atom xml text html text plain application octet stream application pdf application xhtml xml application xml egypt_B text html application vnd android package archive application xml video mp4 application octet stream application pdf application xhtml xml communities B application vnd google earth kmz image jpeg audio mpeg video x ms wmv application rss xml video mp4 text html image gif application octet stream video x m4v application pdf application xhtml xml application x rar compressed text x php text aspdotnet charlie hebdo S text html application xhtml xml 27 14 Evaluation Techniques The evaluation of our collection was conducted in two phases 1 Manual Evaluation 2 Feedba
19. 01 2 user cs5604s15_noise WEBPAGES CLEAN noise large 00000 vi fuser cs5604s15 noise WEBPAGES CLEAN noise large 00001 vi 567 655604515 noise WEBPAGES CLEAN noise small 00000 fuser cs5604s15 noise WEBPAGES CLEAN noise small 00000 2 user cs5604s15 noise WEBPAGES CLEAN noise small 00001 user cs5604s15_noise WEBPAGES CLEAN noise small 00001 2 567 655604515 noise WEBPAGES CLEAN social 00000 user cs5604s515 noise WEBPAGES CLEAN social 00000 v2 user cs5604s15_noise WEBPAGES CLEAN solr large 00000 vi 567 655604515 noise WEBPAGES CLEAN solr large 00001 vi fuser cs5604515 noise WEBPAGES CLEAN solr small 00000 567 655604515 noise WEBPAGES CLEAN solr small 00000 v2 user cs5604s15_noise WEBPAGES CLEAN solr small 00001 user cs5604515 noise WEBPAGES CLEAN solr small 00001 2 Figure 8 The size of cleaned web pages collection on HDFS 33 18 Developer Manual 18 1 Webpage crawling with Python scripts As the noise reduction team we have been given a small tweet collection about the event shooting With the Python script provided by the RA Mohamed Magdy we are able to crawl each of the webpages that have a URL appearing in the set of tweets We ran the script on a Macbook with Python 2 7 8 and Anaconda 2 1 0 installed The procedures are listed below 1 Download Mohamed s script please refer to Appendix C and put it in the same directory as the small collection 2 Modify the script change line 17 from thres
20. 1 Framework of noise reduction within whole system context 13 The HBase schema for storing the cleaned tweets and webpages can be found below The column family original contains information that we will be storing after cleaning the collection and as a result providing to the other teams Each row contains details for each individual document tweet and or web pages The column family analysis contains information that is provided by the other teams as a result of their analysis or work on the cleaned collection For further details on the individual columns please refer to the report submitted by the Hadoop team in VTechWorks Tweets Column Family Column Qualifier original tweet_id text_original text_clean created at source user screen name user id lang retweet count favorite count contributors id coordinates urls hashtags user mentions id in replv to user id in replv to status id text clean collection analvsis ner people ner locations ner dates ner organizations cluster id cluster label class social importance lda topics lda vectors Webpages 14 rowkey collection uuid Column Family Column Qualifier original domain source collection text_original text_clean author subtitle created at accessed at section lang urls coordinates tweet_source content_type text_clean appears_in_tweet_ids analysis ner_people ner_locations ner_dates ner_organizations clust
21. 15 05 02 2015 04 23 2015 04 26 2015 04 23 2015 04 26 2015 04 23 2015 04 26 2015 04 23 2015 04 26 2015 04 23 2015 04 26 2015 04 23 2015 04 26 2015 05 02 2015 05 02 2015 04 23 2015 04 26 2015 04 23 2015 04 26 2015 04 23 2015 04 26 2015 05 02 2015 05 02 2015 04 23 2015 04 26 2015 04 23 2015 04 26 20 30 20 30 22 15 22 15 16 42 18 05 16 42 18 05 16 42 18 05 16 42 18 05 16 42 18 05 16 42 18 05 22 15 22 15 16 42 12 26 16 42 18 05 16 42 13 17 22 15 22 15 16 42 18 05 16 42 18 05 user cs5604515 noise WEBPAGES CLEAN classification small 00000 v2 user cs5604s15 noise WEBPAGES CLEAN classification small 00001 2 user cs5604s15_noise WEBPAGES CLEAN clustering large 00000 vi user cs5604s15_noise WEBPAGES CLEAN clustering large 00001 vi fuser cs5604s515 noise WEBPAGES CLEAN clustering small 00000 user cs5604s15_noise WEBPAGES CLEAN clustering small 00000 2 user cs5604s15 noise WEBPAGES CLEAN clustering small 00001 fuser cs5604s15 noise WEBPAGES CLEAN clustering small 00001 2 fuser cs5604515 noise WEBPAGES CLEAN hadoop small 08 user cs5604515 noise WEBPAGES CLEAN hadoop small 00000 2 fuser cs5604515 noise WEBPAGES CLEAN hadoop small 00001 user cs5604515 noise WEBPAGES CLEAN hadoop small 00001 2 567 655604515 noise WEBPAGES CLEAN ner small 00000 fuser cs5604s15 noise WEBPAGES CLEAN ner small 00000 2 fuser cs5604515 noise WEBPAGES CLEAN ner small 00001 user cs5604515 noise WEBPAGES CLEAN ner small 000
22. CS5604 Information Storage and Retrieval Term Project Team Reducing Noise May 13 2015 Virginia Tech Blacksburg VA Team Members Prashant Chandrasekar Xiangwen Wang Table of Contents 1 Executive 5 2 6 7 3 VTechWorks Submission InventOrV gt 0 8 A Chapter Section SUMMALY ccccccsesccccessssscececcsssseccecsseseeceesseasececeeessecescseaseseeeseeaaeeecseesaececeseaaeeeeeeuaas 9 5 Project Requirements a an EW wara wa 10 6 Literature RUI Wes a e 11 TF System Desi 0 13 8 Document Properties A A a 15 8 1 cutee A a KEE a teas 16 8 2 Web PAR ES i i ii A ida 16 9 Implementation Strategy 0 17 10 Tools and Framewonks mec ii A 17 11 Implementation Methodology 18 TAT TWEE CIOANUP sastecc dsc secacis A saucesGase denna sssadagauuesh ds eiescsaevensgsaeecaast ass 18 11 2 Webpage Cleanup 0 19 11 3 Organize output in AVRO 20 12 implementati ui iii dei sede ki kera tigi a A ear 20 PATA EEE mm E E 20 12 2 Cleaning Webpa es iii ao td 22 13 25 13 1 Twe t ClO 25 132 Web
23. ETS_CLEAN 204 3 M 612 8 M user cs5604s15_noise TWEETS_CLEAN Jan 25_S 258 4 M 775 3 M user cs5604s15_noise TWEETS_CLEAN Malaysia_Airlines_B 5 4 G 16 3 G user cs5604s15 noise TWEETS CLEAN bomb B 60 8 M 182 5 M user cs5604s15 noise TWEETS CLEAN charlie hebdo 5 Le ait user cs5604s15 noise TWEETS CLEAN diabetes B 121 0 M 363 0 M user cs5604s15 noise TWEETS CLEAN ebola S 2 9 G 8 8 user cs5604s15 noise TWEETS CLEAN egvpt B 284 9 M 854 8 M user cs5604s15 noise TWEETS CLEAN election 5 86 6 M 259 7 M user cs5604s15 noise TWEETS CLEAN plane crash 5 6 9 6 20 7 G user cs5604s15 noise TWEETS CLEAN shooting B 6 3 G 19 0 G user cs5604s15 noise TWEETS CLEAN storm 8 13 1 M 39 4 M user cs5604s15 noise TWEETS CLEAN suicide bomb attack 5 836 7 M 2 5 6 user cs5604s15_noise TWEETS_CLEAN tunisia_B 166 2 M 498 6 M user cs5604s15 noise TWEETS CLEAN winter storm S Figure 7 The size of cleaned tweet collections on HDFS 17 2Cleaning tweets for small collections using Hadoop Streaming 31 You can clean up the large tweets collection in CSV format using the Hadoop steaming and mapper py attached in the appendix The only problem is that the output of this execution is NOT a custom AVRO schema This was an experiment to extend our Python implementation to run on HDFS Following are the steps to execute the script 1 First you need to download the avro 1 7 7 jar and avro mapred 1 7 7 hadoop1 jar from http www gtlib gatech edu pub apach
24. EmoRegexp sub tweet tweet re sub r s tweet task 2 Remove non alphanumeric characters HashtagRegexp r lt lt a zA Z 9 A Za z_ A Za z 9_ UserhandleRegexp r lt lt a zA Z0 9 1 0 A Za z_ A Za z0 9 UrlRegexp r P lt url gt https a zA Z0 9 Hashtaglist re findall HashtagRegexp tweet for hashtag in Hashtaglist tweet tweet replace hashtag Userhandlelist re findall UserhandleRegexp tweet for userhandle in Userhandlelist tweet tweet replace userhandle uri list re findall UrlRegexp tweet for url in url_list tweet tweet replace uri tweet re sub r s w _ tweet f task 5 included clean_tweet_only tweet for hashtag in Hashtaglist tweet tweet hashtag 56 for userhandle in Userhandlelist tweet tweet userhandle task 3 validating url ValidUrlRegexp re compile r http ftp s f http or https domain r 2 A Z0 9 A Z0 9 61 A Z0 9 A 2 20 9 2 6 2 r localhost localhost or ip rod 3 041 37 01 1 3 011 37 r d optional port r S re IGNORECASE for url in url_list if ValidUrlRegexp match url tweet tweet url tweet re sub tweet return tweet clean_tweet_only Hashtaglist Userhandlelist url_list input comes from STDIN standard
25. HDFS Additionally the cleaned documents and any metadata that we had extracted during the cleaning phase was required to be stored in HBase so that the Solr team could extract the table data into their Solr engine The Hadoop team was responsible in building the table structure in HBase We collaborated with the Solr and the Hadoop team in building the schema for the AVRO file For details of the schema please refer to Appendix A Since each team is working with two collections that are stored separately in local filesystem and Hadoop cluster we will be developing a mechanism for each of the scenarios processing files in Solr and another that will process the collection stored in HDFS Through our framework each team will have access to clean and relevant content that would be available and accessible in HDFS for large collections and in their local file system for small collections 12 Implementation 12 1 Cleaning Tweets 20 Tweets collections in AVRO Format ASCII characters only English only Tweets Matching regular Stop words and profane expressions words removal Clean Tweets URLs Hashtags Clean Tweets User Handles etc Stemming in AVRO format with preset schema Clean Tweets 2 Figure 5 Overview of tweet cleanup implementation The tweets collections are initially stored in AVRO file format After dumping it as JSON format we will only extract useful fields such as te
26. ToLongURL_File py at https scholar vt edu access content group 5508d3d6 c97d 43 7f a09d 2cfd43828a9d Tutorials tweet shortToLongURL File pv 3 Execute Pvthon script a python tweet shortToLongURL File pv z540t csv 35 b This script outputs a file with a list of URLs that we can then use as a seed file for Apache Nutch seedsURLs_z540t txt Download Apache Nutch as per instruction provided in the Tutorial https scholar vt edu access content group 5508d3d6 c97d 43 7f a09d 2cfd43828a9d Tutorials Nutch 20Tutorial pdf Modify crawl script in directory bin crawl a Comment out section that indexes the web pages into Solr Execute Nutch a bin crawl urls TestCrawl4 http preston dlib vt edu 8980 solr 1 1 i urls is the directory where the seed URL text file is ii TestCrawl4 is the directory where the output is stored iii http prestion dlib vt edu 8980 solr is the Solr instance that needs to be provided regardless of whether the output is indexed on Solr or not iv 1 is the number of times you want to attempt to connect to the URLs b The web page file can be found in directory Snutch 1 9 TestCrawl4 segments 20150320180211 content part 0000 c Upload File to Hadoop Cluster i scp part 0000 cs5604s15_noise hadoop dlib vt edu home cs5604s15_noise d Upload File to HDFS i hadoop fs copyFromLocal home cs5604s15_noise user cs5604s15_noise input 18 3 Solr Installation and data importing
27. a open source Python library to extract text from HTML and XML files Beautiful Soup Document 5 will be our main reference for web page text extraction We can also find details for each of the functions in the beautifulsoup4 Python package along with some useful examples When we successfully extract the text from the HTML we will find there s still much visible text in the menus headers and footers which we might want to filter out as well We can approach this problem under this concept using information of text vs HTML code to work out if a line of text inside or outside of an HTML tag is worth keeping or not We can be motivated by the methods employed via the online tutorial The Easy Way To Extract Useful Text From Arbitrary HTML 8 which provides some ideas on how to fulfill this task The analysis is based on the neural network method by using the package open source C library FANN Webpage 16 provides a general reference manual for the FANN 11 library Also the readability website 19 provides a powerful tool with the introduction for extract useful content from a HTML XML file Since we are currently trying to build an English based information retrieval system it s important for us to filter out the non English documents which leads to a new problem language detection of a document The presentation 20 provides us with an idea and its algorithm for language detection based on Naive Bayes classifier The
28. a URL In Figure 2 we ve highlighted some of the properties of a tweet that are common knowledge We have additionally identified instances of noisy data that require our attention Please find below the table describing the intermediate steps for tweet cleanup Table 1 Steps and description for cleaning tweets Step Number Description 1 Discard non English tweets 2 Remove Emoticons Other Icons Remove non alphanumeric characters that aren t part of the User Handle Hashtag or 3 URL This involves removing punctuation 4 Remove RT or rt that signifies that a tweet is a retweet 5 Replace curse words or swear words with the text profanity Inspect format of a URL Remove URL from tweet if the format is invalid as in the 8 example above Validate URL Remove URL from tweet if URL isn t registered or is invalid if it returns a 404 8 Standardize text through stemming and stop word removal 18 9 Generate a universally unique identifier UUID for each tweet 10 Output the results in expected format AVRO User Handle l Cleaned up Tweet RT KhaledBeydoun Ahmed Merabet was shot point blank on the sidewalk He is a Muslim and a hero CharlieHebdo ParisShooting HashTag Figure 4 Example of a cleaned tweet Figure 4 illustrates the output of the tweet cleanup functionalitv that has removed the invalid URL non alphanumeric characters
29. adable title re sub readable title soup BeautifulSoup readable article texts soup findAll text True all_text join texts strip EEY lan str detect_langs all_text 0 split 0 except continue if lan not in languageAll languageAll lan 1 else languageAll lan 1 if lan en continue all text all_text replace r n all text all text replace w all text all text replace it 54 all_text join i if ord i gt 128 else for i in all_text all text re sub 4 all text clean content oniv clean content 2 uri list str webcontentcleanup all text print clean content 2 domain uri netlocj format uri uriparse uri webpage_id str uuid uuid3 uuid NAMESPACE DNS url encode ascii ignore webpage json webpage_json doc_id doc id webpage webpage_json text_clean clean content oniv webpage_json text_original raw text webpage _json title readable title webpage _json text_clean2 clean content 2 webpage _json collection doc id webpage jsonf content tvpe l html webpage _json urls url list str webpage _json domain domain webpage _json url url writer append webpage json clean webpage count 1 fp write s n s n n s n n n n n _ doc 10 webpage webpage id url clean content oniv webpage id fp close SchemaFile close writer close print fil
30. asis echo Link inversion bin nutch invertlinks CRAWL PATH linkdb CRAWL_PATH segments SEGMENT if ne 0 then exit fi echo Dedup on crawldb bin nutch dedup CRAWL_PATH crawldb if ne 0 then exit fi echo Indexing SEGMENT on SOLR index gt SOLRURL bin nutch index D solr server url SOLRURL CRAWL_PATH crawldb linkdb CRAWL_PATH linkdb CRAWL_PATH segments SEGMENT if ne 0 then exit fi echo Cleanup on SOLR index gt SOLRURL bin nutch clean D solr server url SOLRURL CRAWL_PATH crawldb if f ne o then exit fi fi URL extraction from tweets import svs import requests import hashlib from bs4 import BeautifulSoup Comment import re import sunburnt import pymysql 59 from operator import itemgetter from contextlib import closing requests packages urllib3 disable_warnings headers User Agent Digital Library Research Laboratory DLRL def visible element if element parent name in style script document head return False return True thresh 1 archiveID sys argv 1 split 0 tweetFile sys argv 1 tweets open tweetFile r us f readlines 61056 for 1 in us 1 l strip l split pre tweets append t docs print tweets is read from File Extract short URLs from Tweets shortURLsList for row in cursor fetchall for line
31. ck from other teams 14 1 Manual evaluation As we were saving the cleaned text into AVRO format we additionally saved the plain text version for a subset of the entries through random selection We randomly selected 50 documents from each collection and scanned the text to check for any aberration in the output We conducted the test for tweets and web pages It was through this validation step that we learnt that some documents had text with multiple languages Until then we were removing all non ASCII text We assumed that by doing so we would get rid of non English text in the document However this wasn t enough as we had web pages written in non English say Japanese which included many instances of numbers and dates within the text Only these numbers and dates would appear in the cleaned version of the document As a refinement we chose to process the documents written in English The Python package langdetect 20 was included in our code to identify the prominent language of the document We would filter out documents that didn t predominantly contain English text 14 2 Feedback from teams The cleaned collections were directly consumed by other teams We asked the teams to let us know if their calculations or results were skewed or affected because of occurrences of any text that they considered as noise The common concern among all the teams was the occurrence of the term RT or rt when analyzing tweet
32. ding rule has been set up with NAT on master node so that slave nodes can get access to internet For each Hadoop node disable NetworkManager service and enable network service Download and install ClouderaManager with the command wget http archive cloudera com cm5 installer latest cloudera manager installer bin chmod u x cloudera manager installer bin sudo cloudera manager installer bin 11 12 13 14 15 16 17 Manually assign an IP address Gateways etc for each node Use a web browser to install CDH by visiting the manager node through port 7180 In License section select Cloudera Express option In Specify hosts section type in the IP addresses of each node in the cluster The installation program will automatically install Hadoop on each node Select the basic services HDFS MapReduce YARN Zookeeper Oozie Hive Pigs etc and the HBase service Assign the role for each node here we will select the master node as the NameNode SecondarvNameNode and the HBase Master and select the five slave nodes as the Data Nodes and HBase Region severs Then manually select the corresponding services for each node Waiting for the installation to complete 41 CUB ca 192 168 0 104 7180 cmf clusters 1 express add services index step commandDetailsStep WwW 6 A Progress Command Context Status Started at Ended at a First Run In Progress Apr 13 2015 10 29 22 PM EDT
33. e As seen in the figure below the collection has been imported e C fi D localhost 8983 solr browse e olr Type of Search Simple Spatial Group By Find Boost by Price Field Facets 37001 results found in 194 ms Page 1 of 3701 cat TthWoman More Like This Id 552943156447440897 Price Features In Stock Lyndon_RSA More Like This Id 552943155231076353 Price Features In Stock ipod 1 GB ChrystalDimitra More Like This Id 552943155121635328 Range Facets p PivotFacets 0 WA In Stock BtSPAK More Like This Id 552943154098634752 Run Solr with java oe Dsolr clustering enabled true Features jar start jar to see clustered search results In Stock Figure 12 Screenshot displaying Shooting collection 37 This small collection contains 36917 tweets about the CharlieHebdo attack so we can practice normalizing text and classifying text using those tweets Also we can try to unshorten each tiny URL in those tweets 18 4Hadoop and Mahout Installation We initially installed Mahout to understand the input format for all of the machine learning implementations that were being used by other teams We were able to experiment freely with our own installation Following is a list of the procedures to build a Hadoop Mahout environment on a virtual machine with a Windows 7 64 bit PC as the host 1 Download the virtualization software VirtualBox version 4 3 22 Vi
34. e avro avro 1 7 7 java and upload them to the same directorv with the mapper pv on cluster 2 Forthe shell command attached below replace the MAPPER PATH with the path of the mapper pv for example as the noise reduction team our path is home cs5604s15_noise testmapreduce mapper py 3 Again replace the INPUT PATH and OUTPUT PATH with the paths of your input and output directories For example our input path is cs5604s15_noise input and our output path is user cs5604s15_noise output_tweet_test 20150328 4 Then paste the command in remote terminal and run it The output will be stored in AVRO format in the given output directory Shell command hadoop jar opt cloudera parcels CDH lib hadoop 0 20 mapreduce contrib streaming hadoop streaming jar D mapred reduce tasks 0 files avro 1 7 7 jar avro mapred 1 7 7 hadoop1 jar libjars avro 1 7 7 jar avro mapred 1 7 7 hadoop1 jar file MAPPER_PATH mapper MAPPER PATH input INPUT_PATH csv output OUTPUT_PATH outputformat org apache avro mapred AvroTextOutputFormat 17 3 Cleaning webpages The following process was following to clean the web pages for each and everv collection 1 The web pages that were output from Apache Nutch are in SequenceFile format We first need to convert them to text file format hadoop jar opt cloudera parcels CDH lib hadoop 0 20 mapreduce contrib streaming hadoop streaming 2 5 0 mr1 cdh5 3 1 jar files apache nutch 1 9 jar
35. e cleanup released 29 16 Conclusion amp Future Work Through our work we were able to successfully clean 14 English tweet collections and 9 English HTML formatted web page collections These collections were indexed and loaded into Solr for retrieval We were able to employ industry standard packages utilities to achieve our goals Our work can be extended to extract more information about each collection as a whole as well as individual documents within them For example our cleaning process removed all emoticons which could be used along with other text processing tools to derive sentiment for tweets This information would be useful for collections that include disasters such a hurricanes or fires as well as for the Egypt uprising collection Additionally our work was restricted to processing entries that were in English Researchers and developers alike could relax the constraint in our code and could achieve similar goals for the 30 or so languages that we detected in our collection Finally our current scope was defined such that we only focus on cleaning HTML formatted or plain text formatted documents Important documents that were represented in other formats could ve been pruned as a result There are many freely available tools that help convert and or extract plain text from documents of various formats such as PDF files and Word files Interested parties can extend our code to covert these documents to plain text f
36. e org solr mirrors solr latest redir html accessed on 02 05 2015 5 Leonard Richardson Beautiful Soup Documentation http www crummy com software BeautifulSoup bs4 doc accessed on 02 05 2015 6 The Apache Software Foundation Solr Quick Start http lucene apache org solr quickstart html accessed on 02 05 2015 7 Christopher Manning Prabhakar Raghavan and Hinrich Sch tze Introduction to information retrieval Vol 1 Cambridge Cambridge University Press 2008 8 Alex J Champandard The Easy Way to Extract Useful Text from Arbitrary HTML http ai depot com articles the easy way to extract useful text from arbitrary html accessed on 02 05 2015 9 HossMan YonikSeeley OtisGospodnetic et al Solr Analyzers Tokenizers and Token Filters https wiki apache org solr AnalyzersTokenizersTokenFilters accessed on 02 03 2015 10 Apache Infrastructure Team Solr Solr Schema http svn apache org repos asf lucene dev branches lucene solr 3 6 solr example solr conf schem a xml accessed on 02 03 2015 11 Joseph Acanfora Stanislaw Antol Souleiman Ayoub et al Vtechworks CS4984 Computational Linguistics https vtechworks lib vt edu handle 10919 50956 accessed on 02 05 2015 12 Arjun Chandrasekaran Saurav Sharma Peter Sulucz and Jonathan Tran Generating an Intelligent Human Readable Summary of a Shooting Event from a Large Collection of Webpages Report for Course 654984 https v
37. e sub r n clean tweet oniv clean tweet 1 re sub RT clean tweet oniv clean tweet 2 re sub ProfanitvRegexp clean tweet oniv clean tweet oniv ProfanityRegexp profanity clean tweet oniv clean tweet 2 join st stem word lower for word in clean tweet 2 split clean tweet 2 re sub clean tweet 2 clean tweet 2 re sub clean tweet 2 return tweet clean tweet 2 clean tweet oniv Hashtaglist Userhandlelist url list checknone str convert empty string to None if str or str return None else return str avrocleaning filename1 filename2 filename3 doc_id try InFile open filenamel r OutFile open filename2 w ShemaFile open filename3 r except IOError print please check the filenames in arguments 49 return reader DataFileReader InFile DatumReader schema avro schema parse ShemaFile read writer DataFileWriter OutFile DatumWriter schema tweet_count 0 clean_tweet_count 0 for full_tweet_json in reader tweet_count 1 if tweet_count 25000 continue if tweet_count gt 100 break remove leading and trailing whitespace try print full_tweet_json full_tweet json loads json dumps full_tweet_json except continue only select tweets in English if full tweet u iso language code l u en print not English continue rawtweet full_tweet u text o
38. ebdo Shooting user mentions id tweet id 498584703352320001 urls http t co Ko0D8yweJd collection charlie hebdo S doc 10 charlie_hebdo_S 8515a3c7 1d97 3bfa a264 93ddb159e58e user screen name CharlieHebdo toveenb 12 2 Cleaning Webpages Raw Files amp metadata in Sequence File format Raw consolidated files amp metadata in Text format Re Identified individual files i HTML only raw Titles amp Useful HTML with corresponding URLs Readability Beautifulsoup English only Text ASCII only Text Langdetect Re Clean web pages in AVRO format Clean Text and Extract with preset schema Hats Figure 6 Overview of web page cleanup implementation 22 The input to our module are web page contents that are output from Apache Nutch The files are written in SequenceFile format To clean up the contents of the file we first had to convert them into text files We were able to convert the format of the file from SequenceFile format to text format using a script provided by the RA Mohamed The details can be found in the Developer Manual section of the report The format in which each web page content is presented in the file is url http 21stcenturywire com 2015 02 08 free speech british police hunt down buyers of charlie hebdo base http 21stcenturywire com 2015 02 08 free speech british police hunt down buyers of charlie hebdo contentType application
39. ection type null string default null title type null string default null domain type null string default null url type null string default null appears in tweet ids type null string l 46 Appendix B Source code of cleaning script Python script for tweet cleanup usr bin env python coding utf 8 import import import avro import avro schema from avro datafile import DataFileReader DataFilewriter from avro io import DatumReader DatumWriter import json from nitk stem lancaster import LancasterStemmer import time import uuid LancasterStemmer doc id charlie hebdo S stop word u i u me u mv u myself u we u our u ours u ourselves u vou u vour u vours u vourself u vourselves u he u him u his u himself u she u her u hers u herself u it u its u itself u thev u them u their u theirs u themselves u what u which u who u whom u this u that u these u those u am u is u are u was u were u be u been u being u have u has u had u having u do u does u did u doing u a u an u the u and u but u if u or u because u as u until u while u of u at u bv u for u with u about u against u between u into u through u during u before u after u above u below u to u from u up
40. ename1 has been cleaned up print Total webpages d len webpages print Cleaned webpages d clean webpage count print Percentage cleaned 3f 100 0 clean_webpage_count len webpages print HTML webpages d html file count print Non English webpages d html file count clean webpage count print Content Type Statistics print Language Statitics return 1 contentTypeAll languageAll main argv try InputFile argv 1 OutputFile argv 2 SchemaFile argv 3 except IndexError print Please specify the webpage input avro filename output avro filename and avro schema filename return 0 try doc_id argv 4 except IndexError doc_id __doc_id return avrocleaning InputFile OutputFile SchemaFile doc_id 55 if name start time time time main svs argv print s seconds n n n time time start time sys exit 1 Python code for cleaning up tweets with MapReduce Raw version usr bin env python import sys import re doc_id charlie hebdo S def cleanup tweet task 1 remove emoticon try Wide UCS 4 EmoRegexp re compile u u U 001F300 U0001F64F u U 001F680 U0001F6FF u u2600 u26FF u2700 u27BF re UNICODE except re error Narrow UCS 2 EmoRegexp re compile u u ud83c udfee udffF u ud83d udc ude4f ude8e udef Ff u u26 0 u26FF u2700 u27BF re UNICODE tweet
41. ent initiative thereby providing us with various resources that serve as a platform for our project This project is sponsored by NSF grant IIS 1319578 Also we would like to thank the Digital Libraries Research Laboratory DLRL for sharing the cluster where we executed our analysis We also want to convey a special thank you to Dr Edward A Fox and the GTA GRAs Sunshin amp Mohamed for helping us throughout the project Last but not the least we want to thank all the other teams of CS5604 for their precious feedback and especially the Hadoop amp Solr teams for their discussions on the AVRO schemas 3 VTechWorks Submission Inventory All our work for the semester will be uploaded to VTechWorks at https vtechworks lib vt edu handle 10919 19081 Please find below a brief description for each file that will be uploaded as a part of our submission 1 ReportRN pdf a The PDF format of the term report that describes in detail our work for the project 2 ReportRN docx a An editable Word format of the term report ReportRN pdf 3 PresentationRN pdf a The PDF format of the presentation slides that provide an overview of our work for the project 4 PresentationRN pptx a An editable PowerPoint format of the presentation PresentationRN pdf 5 Code zip a Acompressed folder that contains the source code for our tweet and web page cleanup implementation b Folder contains i profanitv en and profanitv en 2 Reference list of cu
42. er_id cluster_label class social_importance lda_topics lda vectors The AVRO file format is a JSON schema with the fields listed above as kevs with corresponding values Using the schema the Hadoop team will be writing a utilitv package that extracts the values within the AVRO file and inserts them as records into HBase 8 Document Properties Prior to cleaning the character sequences it is critical to understand and evaluate the structure of each of the documents tweets and web pages 15 8 1 Tweets Random User Handle Emoticon Icon Example Tweet RT GiKhaledBevdoun Ahmed Merabet was shot point blank on the sidewalk nmunHe is a Muslim and a hero n n CharlieHebdo ParisShooting http t Random Characters HashTag URL Broken Unbroken Figure 2 Example structure format of a tweet Web pages Actual Text to Process Sound Eight in 10 say their families have seen either zero or not very much improvement in their living standards according to pollsters Ipsos MORI Looking to the future less than a quarter think they will be much better off in the next 12 months spanning the general election in May The grim findings come a day after the Bank of England Governor trumpeted the return of real pay growth as official figures showed wages creeping up ahead of the cost of living Labour leader Ed Miliband making a keynote comeback speech in London said most families were
43. et collections which would cost additional time for analysis This script will unshorten and replace them with the original long URLs The script can be found in Appendix C 2 Packages in Python a Natural Language Toolkit NLTK is an open source Python library which provides varieties of modules datasets and tutorials for research and development in natural language processing and related areas such as information retrieval and machine learning 2 b Beautiful Soup is an open source package that can be used to extract plain text from HTML XML source files which is our first procedure in webpage clean up 5 c Langdetect is an open source Python package which can detect the dominant language from given text 20 17 Readability is an open source Python package that extracts the main HTML body text and cleans it up It helps with removing content from advertisers banners etc 19 Re is an open source Python package which provides regular expression matching operations 21 UUID is an open source Python package which generates UUIDs based on different algorithms md5 SHA1 etc 22 11 Implementation Methodology We ve designed a methodology for each of the intermediate tasks that we have identified for our system s functionality 11 1 Tweet Cleanup We inspected the collection of tweets that was provided to us Contents within a tweet are textual in nature Each tweet has a user handle and may have a hashtag or
44. h 10 to thresh 0 to count the total number of unique URLs in the collection Run the script by using the command python tweet_URL_archivingFile py 2540t csv 3 Terminate the program when it starts to send a connection request See the figure below 655604 python tweet_URL_archivingFile py z54 t csv tweets is read from File short Urls extracted 47389 cleaned short URLs 47365 Unique short URLs 42932 Freq short URLs gt 42932 anaconda lib pvthon2 7 site packages requests packages urllib3 c certificate verification is strongly advised See https urllib InsecureRequestWarning Figure 9 Total amount of URLs in small shooting collection 4 We find that there are in total 42932 unique URLs appearing in the small tweet collection Due to the limitation on the performance of our computers and network bandwidth we will only crawl the web pages that have a URL appearing more than 5 times in the set However on the cluster this need not be the case 5 Modify the script change line 17 from thresh 0 to thresh 5 Rerun the script using the same command python tweet_URL_archivingFile py z540t csv Wait until the program finishes running which takes about 5 minutes as shown in the figure below 34 CS5604 python tweet URL archivingFile pv z540t csv tweets is read from File short Urls extracted 47389 cleaned short URLs 47365 Unique short URLs 42932 Freq short URLs gt 5 196
45. in tweets line row 1 regExp P lt url gt https a zA Z0 9 url_li re findall regExp line find all short urls in a single tweet while len uri li gt shortURLsList append url li pop print short Urls extracted len shortURLsList surls for url in shortURLsList i url rfind if i 1 gt len url continue p url i 1 if len p lt 10 continue while url endswith 60 url url 1 surl1s append url print cleaned short URLs surlsDic for url in surls if url in surlsDic surlsDic url surlsDic url 1 else surlsDic url 1 print Unique short URLs len surlsDic sorted_list sorted surlsDic iteritems key itemgetter 1 reverse True len suris freqShortURLs for surl v in sorted_list if v gt thresh freqShortURLs append surli print Freq short URLs gt str thresh len fregShortURLs fs open shortURLs_ archiveID txt w for surl v in sorted list fs write surl str v n fs close Expand Short URLs expanded_url_dict 10 0 webpages for url try with closing requests get url timeout 10 stream True verifv False headers headers as r fpage r text or r content if r status code requests codes ok ori url r url if ori url l add the expanded original uris to a pvthon dictionarv with their count if ori url in expanded url dict expanded url dictfori url appe
46. input for line in sys stdin remove leading and trailing whitespace line line strip split the line content line split if len content lt 14 continue if content 6 en only select tweets in English continue source content 0 rawtweet content 1 cleanedtweet clean_tweet_only Hashtaglist Userhandlelist url_list cleanup content 1 hashtags join Hashtaglist urls join uri list user_mentions_id join Userhandlelist username content 3 tweetID content 4 user_id content 5 if content 10 0 0 and content 11 1 coordinate else coordinate s s content 10 content 11 timestamp content 13 json_string json_string doc_id doc_id _ tweetID json_string text_clean clean_tweet_only json_string tweet_id tweetID json_string text_original line json_string user_screen_name username json_string user_id user id json_string created_at timestamp json stringl source l source if hashtags l json_string hashtags hashtags if urls 57 json_strin u urls _st urls 1 if username json_strin u menti 1 username _st user mentions id if coordinate l json_string coordinates coordinate print s json string 58 Appendix C Code provided by TAs Crawl script for Nutch if false then note that the link inversion indexing routine can be done within the main loop on a per segment b
47. ions to developers and researchers alike who are interested in using our component and possibly extend its functionality Chapter 19 provides an exhaustive list of the references that we consulted for our work Appendix A B and C are supplementary notes that provide the HBase schema our code and the code provided to us by our TA 5 Project Requirements Our goal for the project at a high level can be described as the following 1 Identify and remove the noise in the data 2 Process and standardize the sound in the data 3 Extract and organize the data into a usable format As the Noise Reduction team we are to clean up the collection of documents tweets and web pages by firstly identifying and removing character sequences that are irrelevant to various components of the IR system After that we standardize the remaining text using popular Natural Language Processing techniques to convert the text to a common structure format We then extract any information stored within the text that are valuable such as twitter handles URLs that are out links in web pages etc and store all of this information in a schema format that aids the other teams that are responsible for building the remaining components of the IR system 10 6 Literature Review Before the data cleaning procedures several preprocessing steps are necessary Chapter 2 of 7 introduces the method of word segmentation true casing and detecting the coding a
48. list l full clean tweet uris l join uri list encode ascii ignore Userhandlelist full clean tweet user mentions 10 join Userhandlelist encode ascii ignore Hashtaglist full_clean_tweet hashtags join Hashtaglist encode ascii ignore source is not None full clean tweetf source J source encode ascii ignore in replv to user id is not None full clean tweetf in replv to user 10 in repliv to user id encode ascii ignore print full clean tweet writer append full clean tweet reader close ShemaFile close writer close print filename1 has been cleaned up print total tweets d tweet count print cleaned tweets d clean_tweet_count return 1 main argv try InputFile argv 1 OutputFile argv 2 SchemaFile argv 3 except IndexError print Please specify the tweets input avro filename output avro filename and avro schema filename return O try doc_id argv 4 except IndexError doc_id __doc_id return avrocleaning InputFile OutputFile SchemaFile doc_id if name main start time time time main svs argv print s seconds n n time time start time 51 Python script for web page cleanup usr bin env python coding utf 8 import sys from readability readability import Document import re from bs4 import BeautifulSoup import avro import avro schema from avro
49. lled the Hadoop and Mahout on our virtual machine Further configurations will be included in future reports 18 5Setting up a 6 node Hadoop cluster In order to practice coding on Hadoop we tried to build a 6 node Hadoop cluster with ClouderaManager Hardware specifications Number of nodes 5 Hadoop Nodes and 1 Manager Node Quad core CPU on each node 24 cores in total 48GB memory in total 8GB on each node One 750GB enterprise level hard drive on each node Nodes are connected by a 8 port gigabit ethernet switch Master node has two network cards one for public access one for internal network Procedures 1 For each node install CentOS 6 5 as the operating system 40 a or 9 10 In order to use ClouderaManager selinux has to be disabled on manager node by editing etc selinux config For each Hadoop slave node the iptables firewall is disabled to avoid NoRouteToHostException The ntp service need to be enabled to synchronize time For all nodes sshd server is enabled as well as ssh root login For each node the GUI had to be turned off by editing etc inittab For the manager master node the iptables firewall is enabled and all incoming traffic from public interface is denied except through Ports 22 and 50030 which corresponds to SSH service and Hadoop JobTracker service Also traffic through Ports 7180 should be accepted to allow controlling the services through cloudera on master node A forwar
50. nd language of a document Also Section 2 2 2 provides the basic ideas for removing the stop words from a document which are the extremely common words with little value by using a predefined stop words list One of our goals is to reduce the original HTML or XML documents to smaller text documents Chapter 10 would be very useful since it provides the concepts and techniques about how to retrieve information from structured documents such as XML files HTML files and other markup language documents There are a lot of existing resources in Python for text processing Thus we also want to explore how to reduce noise with Python A very useful and well developed tool in Python for language processing is the open source library called Natural Language Toolkit NLTK There are two references for this tool The main reference will be Natural Language Processing with Python 1 which systematically introduces how NLTK works and how it can be used to clean up documents We can apply most of the text processing procedures based on this book For example Chapter 3 of the book presents the methods for raw text processing and Chapter 6 introduces how to classify text etc In addition the NLTK official website 2 provides an exhaustive introduction of the NLTK toolkit where we can find details for specific functions that we might use For the webpages cleaning up the first step is to extract text from the source files Beautiful Soup is
51. nd responsibility to remove irrelevant content or noisy data from the corpora For both tweets and web pages we cleaned entries that were written in English and discarded the rest For tweets we first extracted user handle information URLs and hashtags We cleaned up the tweet text by removing non ASCII character sequences and standardized the text using case folding stemming and stop word removal For the scope of this project we considered cleaning only HTML formatted web pages and entries written in plain text file format All other entries or documents such as videos images etc were discarded For the valid entries we extracted the URLs within the web pages to enumerate the outgoing links Using the Python package readability 19 we were able to clean advertisement header and footer content We were able to organize the remaining content and extract the article text using another Python package beatifulsoup4 5 We completed the cleanup by standardizing the text by removing non ASCII characters stemming stop word removal and case folding As a result 14 tweet collections and 9 web pages collections were cleaned and indexed into Solr for retrieval The detailed list of the collection name can be found in Section 17 of the report 2 Acknowledgement First we want to thank the Integrated Digital Event Archiving and Library IDEAL 21 team for providing us with the wonderful opportunity to extend their curr
52. nd uri else expanded url dictfori uri url il 1 page r content or r text soup BeautifulSoup page title text 1 if soup title if soup title string title soup title string comments soup findAll text lambda text isinstance text Comment comment extract for comment in comments text_nodes soup findAll text True visible text filter visible text nodes text join visible text text title text webpages append url title text else e 1 except print sys exc_info 0 url 6 1 print Unique Orig URLs expanded i print Bad URLs e print Unique Orig URLs len expanded_url_dict fo open seedsURLs_ archivelD txt w fs open short_origURLsMapping archiveID txt w for ourl surls in expanded_url_dict items fs write ourl join surls n fo write ourl n fs close fo close Saving Webpages text to file i 1 for wp in webpages open str i txt w cont wp 1 wp 2 f write cont encode utf8 f close i 1 print Webpages text saved 62
53. nnnnnnnnnn ono nnnnnnnnnnnanannnnos 47 Python script for tweet CIEANUP cccccssscccecceceeeessesseaeeeeeeeeeeeeeeseeeeseaaeeeeeeeeseseeeeneaeaaeaeeeeeseseesseesnasaaaeess 47 Python script for web page cleaNUpP 0 52 Python code for cleaning up tweets with MapReduce Raw version ss sssseseenanezonnananenzznznznnenaznna 56 Appendix C Code provided by TAS cccccconconoooonnnnnnnnnnnnnnnnnonnnnonnnnnnnnnnnonnnnnrnnnnnnnnnnnnnnnnnnnnnnn nro nnnnnnnnnnnanannenes 59 Crawl script for NUtCH ENEA E E E AA 59 URLiextraction from tWweets is cnc ect i ecient AA ea a a a eee ee 59 List of Tables Table 1 Steps and description for cleaning tweets cccconnnnoocononnnonnnnnnnnnnnonnnnnnnonnnnnnonnnannnn non nnnnnnnnnnnananananos 18 Table 2 Steps and description for cleaning web pages ss sess eenennnnknnnnnanzenznnnanranananntrnzenazzzaananannnnnnnna 19 Table 3 Tweet cleaning statistics rennene e E aen Eana aeea taa aiak 25 Table 4 Web page cleaning statistics 7 26 Table 5 Language statistics for web page collections ss nnnennennennnnenanannanannnnnnennnnan anna ran ana nn ntnna 26 Table 6 Various file type for each web page collection ooooooocccncccccononnnonononnnnnnnncnnnnnnnn ono nnonnnnnnnnanananonos 27 List ar Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9
54. ormat for further processing 30 17 User Manual 17 1 Cleaning tweets for small collections on local machine The code has been attached The procedures are listed below 1 Install Python library AVRO with the command pip install Avro user Save the AVRO schema file in Appendix to tweets arsc and save the Python code attached to tweet_cleanup py and upload them onto the cluster 2 Copy the large collections on HDFS to local using the command hadoop fs copyToLocal class CS5604S15 dataset XXXXXX part m 00000 avro XXXXXX part m 00000 ori avro where XXXXXX is the collction name 3 To clean up the tweets run the Python script with the command nohup python tweet_cleanup py part m 00000 ori avro part m 00000 avro tweet avsc XXXXXX amp where argument part m 00000 ori avro is the input AVRO file argument part m 00000 avro is the output AVRO file and tweet avsc is the AVRO schema file 4 Copy the output AVRO file onto HDFS with the command hadoop fs copyFromLocal part m 00000 avro YYYYYY where YYYYYY is the expected path of the output AVRO file on HDFS All the small and large tweet collections have been cleaned up and they are available under the user cs5604s15_noise TWEETS_CLEAN folder One can check the cleaned collections with the command hadoop fs du h user cs5604s15_noise TWEETS_CLEAN as shown below cs5604s15_noisefinodel hadoop fs du h user cs5604s15_noise TWE
55. page clean cada cd 26 14 Evaluation 7 6 28 TAT ES A ia aa aaa aea a aiaee aaiae 28 14 2 Feedback from teams 5 28 15 29 16 Conclusion amp Future gt 3 30 17 User Manlio tit 31 17 1 Cleaning tweets for small collections on local machine oooocccccnnccannnanooononnnnnnnnnnnananannn conocen 31 17 2 Cleaning tweets for small collections using Hadoop 56 31 17 35 Cleaning 0 5 32 18 Developer Manual 34 18 1 Webpage crawling with Python scripts ccccccccccccccecesssssssssaeceeeeeeceseecseaeaseeeeeeeeesenseseesaaaaaeess 34 18 2 Loading Web page Collection to HDFS c cccccsscccccccecssssssssssaeeeeeeeeceseessecseaseeeeeeeeeseeessesnaaaaeess 35 18 3 Solr Installation and data importing 000 sess ennnnnnannnninnnnenenzzanaananakannnnnn nazz anna tan ana ptnnrnnanannna 36 18 4 Hadoop and Mahout Installation 0 38 18 5 Setting up a 6 node Hadoop ClusSterF ccconnnncnuononononncnnnnnnnnnnonnnononnononnnonnnannonnnnnnnnnnnnnnnnnnnnn ono nnnnnos 40 19 SO a a 43 Appendix A AVRO Schema for Tweets and Web pages occococcncnonccncnnnononananonononononnnnnnnanono cnn nnnnnnnnnnnanananonos 45 AVRO schema for tweet collections 45 AVRO schema for web 3 45 Appendix B Source code of cleaning script oooccccccconnnonnnononnnononnnnnannnnnnnnnnnnnnnnnnn
56. presenter implements the algorithm with Noise filter and Character normalization to get a decent detection accuracy of 99 8 12 7 System Design When designing our system we focused on how the other components of the information retrieval system will consume clean and relevant data that is produced More specifically our design is focused on building a system that would seamlessly integrate with the frameworks and or methodologies designed by other teams For the large collections the tweets will be stored in the Hadoop cluster We execute a Python script tweet shortToLongURL File pv that was provided to us to extract all the URLs which are fed into Apache Nutch to extract the web pages which are stored in SequenceFile Format We plan to develop a Python script that will run on the Hadoop Cluster via Hadoop s Streaming API that will process the input files stored on HDFS and output files in AVRO file format An AVRO file is essentially a JSON file that also contains the JSON schema within the file along with the data The schema for the JSON is provided to us by the Hadoop team which is responsible for uploading the AVRO files into HBase Once our system is in place and fully functional other teams could retrieve the data from the AVRO files or from the tables in HBase l Webpage and Tweets Cleanup Raw Source Web Clean Web Pages pages in in AVRO File Sequence File Format Format Figure
57. riginal tweet clean tweet clean tweet 2 clean tweet oniv Hashtaglist Userhandlelist url list cleanup rawtweet if clean tweet is None continue clean tweet count 4 1 full clean tweet 7 username full tweet u from user tweetID full_tweet u id user_id full tweetf u from user 10 timestamp str full_tweet u time original time is of type int source checknone full_tweet u archivesource in_reply_to_user_id checknone full_tweet u to_user_id geocord1 geocord2 full_tweet u geo_coordinates_0 full_tweet u geo_coordinates_1 full_clean_tweet tweet_id tweetID encode ascii ignore unique_id uuid uuid3 uuid NAMESPACE_DNS user_id encode ascii ignore tweetID encode ascii ignore full_clean_tweet doc_id doc id str unique_id full_clean_tweet text_clean clean_tweet_only encode ascii ignore full_clean_tweet text_clean2 clean_tweet_2 encode ascii ignore full clean tweet text_original rawtweet full clean tweetf created at J timestamp encode ascii ignore full clean tweetf user screen name J username full clean tweetf user 10 user id encode ascii ignore full clean tweetf lang l English full clean tweetf collection doc id 50 float geocordi 0 0 or float geocord2 l 0 0 coordinate s s geocordi geocord2 full clean tweetf coordinates J coordinate encode ascii ignore url
58. rse words or swear words ii tweet avsc AVRO schema for tweet collections iii webpage avsc AVRO schema for web page collections iv tweet_cleanup py Python script to clean tweets v webpageclean pv Python script to clean web pages 4 Chapter Section Summary Chapter 1 introduces the goal of our project Chapter 5 discusses our project goal and outlines specific tasks within that goal Chapter 7 discusses the overall system architecture Chapter 8 provides an insight to the various structure and properties of the documents of our collection Chapter 9 and 10 provide details on the approach to our implementation and the tools we will be using to help us during the process Chapter 11 describes the step by step rules that we followed to clean the tweet collections and the web page collections Chapter 12 provides further detail on the process that we used for cleaning via a data flow diagram and an example output for the tweet cleanup implementation Chapter 13 describes our work in cleaning all of the collections The chapter also includes statistics such as the various languages found in the collections along with the various file types that we encountered while cleaning the web page collection These details are broken down for each collection Chapter 14 talks about the different ways in which we evaluated our work Chapter 15 provides a timeline breakdown of our tasks for the semester Chapter 17 and 18 provide step by step instruct
59. rtualBox 4 3 22 98236 Win exe from Oracle VM VirtualBox website https www virtualbox org and install it 2 We will use the open source Cloudera Distribution Including Apache Hadoop CDH as our Hadoop platform Download the CDH demo a VirtualBox image which has CentOS 6 4 and CHD 5 3 already built in from Cloudera Quick start webpage http www cloudera com content cloudera en downloads quickstart_vms cdh 5 3 x html and unzip it 3 From the VirtualBox menu import the OVF file which is just unzipped 38 file Machine Help Li ae gt y Snapshots New Settings Start Discard BT cloudera qui 5 General 5 Preview O Powered Off Name cloudera qui ckstart ym 5 3 0 0 virtualbox Operating System Red Hat 64 bit perenne Y cloudera SARA quickstart Base Memory 4096 MB vm 5 3 0 0 Boot Drder Hard Disk CD DVD Acceleration VI x AMD V Nested Paging PAE NX virtualbox Display Video Memory 8 MB Remote Desktop Server Disabled Video Capture Disabled Storage Controller IDE Controller IDE Primary Master cloudera qui ckstart wm 5 3 0 O virtualbox diskl vmdk Normal 62 50 GB IDE Secondary Master CD DVD SABE P Audio Disabled w Figure 13 Import the Hadoop virtual machine into Virtual Box 4 Turn on the virtual machine check the Node Manager and see that the Hadoop is running dj BE Frireb13 02 29 cloudera di Applications Places System amp BB
60. s the language with most character sequence occurrences For example if a web page contained 70 English text and 30 non English text the 23 library would return en as the language of the text We did not process clean text for the web pages that didn t return en or English as the language of the web page Additionally the social networking team required all the outgoing links for each web page They needed this information to build their social graph Using regular expressions we were able to extract the URLs within a web page document and store them as a list in the AVRO schema You can find the details of our code and the results of our execution in the User Manual section of the report 24 13 Implementation Output Statistics Please find below the details of our cleanup runs for all of the collections Collections suffixed with 5 are small collections whereas collections suffixed with B are big collections 13 1 Tweet Cleanup Table 3 Tweet cleaning statistics Collection Name Number of Number of Percent of Execution Size Of Input Size of Output Tweets Non Tweets cleaned Time File MB File MB English seconds Tweets 25 13 2Webpage cleanup Table 4 Web page cleaning statistics Collection Name Number of Number of Percent of Execution Size of Size of Web pages Non English Web pages Time Input File Output Web pages cleaned seconds MB File MB plane_crash_S 575 170 7
61. simply missing out altogether on a recovery that favoured the better off For Tory MPs the findings will fuel fears that the past five years of painful austerity are leading to a voteless recovery as households face up to repaying debts rather than enjoying spending sprees Mr Miliband told an audience in west London that the recovery was only working for the privileged few Other people he warned were asking why are they being told there is a recovery when they aren t feeling the benefits People working so hard but not being rewarded young people fearing that they are going to have a worse life than their parents people making a decent living but still unable to afford to buy a house Related stories s Welfare cuts leave councils with huge bill to put families in hotels s David Cameron can t explain how 7bn tax giveaway will be funded Labour says a Iain Duncan Smith unveils smart cards in clampdown on benefit claimants drinking and gambling George Osborne pledges two year welfare freeze to raise 3 2bn to cut the deficit Other Content Noise Other Content Noise David Cameron talks to the Standard with eight days to go until the General Election T Yy 9 Will battle after north London man leaves 500k to builder who cleaned his gutters for free Clandon Park House fire 80 firefighters battle huge blaze at historic statelv home Ione
62. techworks lib vt edu handle 10919 51137 accessed on 02 05 2015 13 Apache Software Foundation Mahout http mahout apache org accessed on 02 13 2015 14 Edureka Apache Mahout Tutorial https www youtube com watch v zvfKH9YbOsO amp list PL900VrP 1hQOGbSzlhdjb47SFMDc6bK6BW accessed on 02 13 2015 43 15 Cloudera Cloudera Installation and Upgrade http www cloudera com content cloudera en documentation core latest PDF cloudera installation pdf accessed on 02 13 2015 16 Steffen Nissen Reference Manual for FANN 2 2 0 http leenissen dk fann html files fann h html accessed on 02 13 2015 17 Ari Pollak Include OutputFormat for a specified AVRO schema that works with Streaming https issues apache org jira browse AVRO 1067 accessed on 03 29 2015 18 Michael G Noll Using AVRO in MapReduce Jobs With Hadoop Pig Hive http www michael noll com blog 2013 07 04 using avro in mapreduce jobs with hadoop pig hive accessed on 03 29 2015 19 Arc90 Readability http lab arc90 com 2009 03 02 readability accessed on 05 05 2015 20 Nakatani Shuyo Language Detection Library for Java http www slideshare net shuyo language detection library for java accessed on 05 06 2015 21 Python Software Foundation Regular expressions https docs python org 2 library re html accessed on 02 05 2015 22 Python Software Foundation UUID https docs python org 2 library uuid html
63. w r r TW r r rw r r rW r r rw r r TWN r rW r r rW r r Ti r rw r r rWw r r rW r r TW r r rw r r rW r r rw r r WPT W Lu ly ly lu ly Lu lu ly lu lu ly lu lu lu lu ly lu lu ly lu lu lu lu www ww 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 0156600061 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 655604515 6 16686151 8299499 118997565 186977392 8454196 8462670 19892415 19917217 82503774 82716142 61997309 61382817 42629815 42569518 22449398 23250237 78575447 299322794 8757982 8771493 8356503 8183090 461710605 462612082 3420783 5405302 7013519 7024837 4274191 4477940 2015 05 02 2015 05 02 2015 05 02 20
64. xhtml xml metadata X Pingback http 21stcenturywire com xmlrpc php Expires Fri 20 Mar 2015 23 02 16 GMT fst 33 nutch segment name 20150320180211 Connection close X Powered Bv W3 Total Cache 0 9 4 1 Server Apache Cache Control max age 3600 Link lt http wp me p3bwni aFh gt rel shortlink Date Fri 20 Mar 2015 22 02 16 GMT Vary Accept Encoding User Agent nutch crawl score 1 0 Content Encoding gzip Content Type text html charset UTF 8 Content lt DOCTYPE html PUBLIC W3C DTD XHTML 1 6 Transitional EN http ww w3 org TR xhtm11 DTD xhtml1 transitional dtd gt lt html xmlns http www w3 org 1999 xhtml lang en US xml lang en US gt lt head profile http gmpg org xfn 11 gt lt meta http equiv Content Type content text html charset UTF 8 gt The output file contains web page content that is sequentially written into it For each web page you have the following information about it highlighted in red 1 Original URL URL same as 1 Base URL Content Type of web page Metadata of web page E en oN Content For each web page we extracted the text under the Content field using regular expressions After that our module employed the Python libraries BeautifulSoup4 amp Readability to clean the text and save the cleaned text in AVRO files Our biggest challenge was working with non ASCII character sequences For such instances we used the Python library langdetect that parses the text and return
65. xt time id etc An example of the tweet input is shown below as in JSON format where the red text is the tweet content which we will clean up iso language code en text u News Ebola in Canada CharlieHebdo CharlieHebdo Shooting Suspected patient returns from Nigeria to Ontario with symptoms http t co Ko0D8yweJd time 1407706831 from user toveenb from user id 81703236 to user id P id 498584703352320001 We have written a Python script that takes the tweets in AVRO format from HDFS cleans the tweets and stores the results as AVRO files in HDFS The AVRO schema can be found in Appendix A During the cleanup process we have extracted user handles hashtags usernames tweet IDs URLs and timestamps from the original tweets and cleaned up the tweet content The hashtags urls user_mentions_id etc fields in the AVRO schema might be non value or of multiple values As of 21 now different values are stored within the same string and separated by Also we generate a UUID for each tweet as the suffix of the doc_id Examples can be found below as marked red lang English user id 81703236 text clean with symptoms text cleanz created at News Ebola in Canada Suspected patient returns from Nigeria to Ontario new ebol canad suspect patv return niger ontario svmptom 1407706831 hashtags CharlieH

Term Report PDF - VTechWorks

Contents

Download Pdf Manuals

Related Search

Related Contents