Home

Xpantrac Final Report.docx

1. COMMITting Solr index changes to http localhost 8983 solr update Time spent 0 00 00 072 dcabrera DMBP exampledocs Figure 13 IOException from indexing the 50docs xml file into Solr The issue was caused by the existence of ampersand characters amp in the XML file we tried to index To fix this problem we removed the ampersands and then ran the indexing command again The files were then able to be indexed into our local Solr instance without any more issues 16 Concept Map Algorithm lt is an Retrieval stands Cognitive Informatics eXPANsion exTRACtion lt o xeantrac combines gt with Information x contains components T M for B 0 retrieve topics of a given page takes as input md P d Frores Query Une Builder rai kroniese Coler E TEE a text file containing Topic Setector J ie relevant information f on article i e first 30 words in article Symbol and Stopword SE NT such as contains Removal the input text to p to External Knowledge Collector SOLR system API or bing currently builds API Queries of every 5 words with a 1 word overlap Figure 14 Xpantrac concept map Xpantrac for Yahoo Search API For our midterm presentation we tried to modify Yang s original Xpantrac script that used a database to instead use the Bing Search API However we ran into multiple authentication issues As a result o
2. Yang 90 The design of Xpantrac has two parts Expansion and Extraction The flow of the algorithm can be shown in the figure below Input Text Preprocessor Symbol Stopword Remover Remover Query Unit Builder External Knowledge Collector j Web EXPANSION NLP Module Term Doc Matrix Builder EXTRACTION Topic tags Figure 3 Components of Xpantrac grouped into two parts Because of the modular design of Xpantrac any component can be flexibly replaced For our project we used a web API as the External Knowledge Collector on the first run and later replaced it with a Solr system Expansion The Expansion part of the algorithm is responsible for building a derived corpus of relevant information by accessing an external knowledge source by expanding input text This part contains three parts 1 Preprocessor removes symbol characters e g amp and stopwords e g a and the 2 Query Unit Builder segments the preprocessed input texts into uniform sized groups of words The words are grouped with neighboring words to keep the context 3 External Knowledge Collector accesses a knowledge source located outside the system to search and retrieve relevant information on the queries sent Extraction The extraction part is where a list of words is derived from the corpus created from expansion This part contains three parts 1 NLP Module applies a POS Part of Speech tagger to the co
3. For example if the current IDEAL collection in Solr is changed to suit our specifications then you would only need to change the hostname and port to the match the corresponding URL Configuration File In order to help create an easier Xpantrac experience for future developers we have created a configuration file This file will allow users to enter commonly changed variables such as hostname port query field title of input documents path to input documents number of topics to be found and window overlap 1 xpanconfig ini e O 1 server 2 hostname hostname 3 port port 4 query field content 5 input documents 0 6 path to input documents 7 num topics 10 8 window overlap al Figure 26 Xpantrac configuration file Because of this configuration file there is no longer a need to change variables directly in the Xpantrac script This will help ensure that all variables are changed correctly when a new user wishes to use the system 23 Evaluation of Extracted Topics File Hierarchy project CTR_30 A directory of 30 CTR files project V ARIOUS 30 A directory of 30 various files project gold_ctr30 csv The gold standard of merged human topics project gold_various30 csv The gold standard of merged human topics project human_topics_CTR30 csv Human assigned topics for 30 CTR articles project human_topics_VARIOUS30 csv Human assigned topics for 30 various articles project xpantrac_
4. asterisks with the queries of your choice This link stays constant for all queries Another option that you see is the part that says json Here you can change it to return json xml python ruby php or cesy WARC Files with IDEAL Documents Our group collaborated with the IDEAL Pages group for the initial part of our project since we were both working with IDEAL and Solr The IDEAL Pages group goal was to index the IDEAL documents into Solr To achieve this goal they had been given a set of WARC files containing IDEAL documents in the form of HTML pages However the WARC files also contained non HTML documents that were unnecessary for our purposes After the IDEAL Pages group created a Python script to unpack the WARC files they sent it to us for further modification 13 Python Script to Remove HTML As stated before the WARC files included the HTML documents we needed but they also included a lot of other files we did not need Figure 8 shows the Python script we created to remove all of the unnecessary documents removeAllButHTML py os root subFolders files os walk rootdir filename files filePath os path join root filename filename find filename r filename filePath os remove filePath Figure 8 Python script to remove all other files except HTML from a directory This script recursively deletes all of the files in a root directory that do not end
5. performance timing c GLOBAL window jstiming load b a navigationStart a a responseStart 0 lt b amp amp a gt b amp amp c tick _ wtsrt void 0 b c tick wtsrt_ _wtsrt a c tick tbsd_ wtsrt_ try a null GLOBAL _window chrome amp amp GLOBAL window chrome csi amp amp n a Math floor GLOBAL window chrome csi pageT c amp amp 0 lt b amp amp c tick _tbnd void collection id 3650 id 7 74825401865487f671bd0fd388ce2b version 1465938356823130000 Figure 24 A document from the IDEAL collection in Solr As you can see there is a lot of unnecessary text and JavaScript inside of the content field If Xpantrac thought that this page was a match and we returned the first 30 words of the content field it would look like Google Newsvar GLOBAL window window function function d a this t this tick functiona a c b b void b b new Date getTime this t a b c this tick start null a var a new d GLOBAL window jstiming Figure 25 First 30 words of the content field from the IDEAL collection in Solr 22 Therefore we are unable to use the Solr collection at this time because the project specifications for IDEAL Pages and Xpantrac were not the same If an IDEAL collection is created to match our specifications then you would only need to change the URL to match the collection URL and the field name to match the field containing the relevant content information
6. 1 is being processed m39 10 Topics ater rain california weather drought los angeles storm fire street 81 75 seconds 1399317596 51 Figure 18 Output from Xpantrac_yahooWeb py script Xpantrac for Solr Finding a Larger Solr Collection After we successfully indexed our 50 CNN documents into Solr we found out that 50 files is too small a number to enable Xpantrac to work correctly Instead we ended up using Yang s collection of Wikipedia articles on Solr This collection currently holds 4 2 million documents and counting Removing Code from Xpantrac_yahooWeb py First we removed all of the database code and db variables from the Xpantrac_Yahoo py script This database held one thousand New York Times articles Solr will replace this database so we can remove it and the import MySQLdb statement Changing the URL in Xpantrac After obtaining the URL to Yang s Wikipedia collection in Solr we created a new query request in Xpantrac First we had to import urlopen as seen in Figure 19 20 from urllib import urlopen Figure 19 Importing urlopen to be used for the query request Next we had to modify the query assembled with the correct URL and field name for itea in query l st query unit 1 query unit 2 query join ites num results returned if query l ry query assembled http ER g jr llection1 select a ntent query RutzjsonBindentstroues conn urlopen query assem
7. 17 Xpantrde for Yahoo Search APh aesti i ii s d n nte Se 17 File Hierarchy MP iea SEA 17 Input TextFile S o ali 18 Yahoo Search API AUNOrnzaton ip 19 UEP PTT ln o ln 20 berries dvn a 20 Binding a Larger Solr Collection eqs AI a hoes ag Vo Bema a 20 Removing Code from Xpantrac yahooWeb py esee nne 20 Changing the URL Tn Xp anna ruo eee tenerse to e aeo aus te UNS te dace pe use tees a tae beer tuas 20 Handling the Content Field iiie tu sued cada E cascada abedaatasasbacs 21 Changing the Xpanttac Dar alie els ooo ce diera Ra At Gus wt cea E E cae ue tu Re nu 21 Connecting with IDEAL in the Future eee eese eee eere neenon teen nest en aee enne tesa des 22 CA O EXT eon ag o e 23 Evaluation of Extracted Topics evisos cren Arden aan reia 24 A A E O 24 Howto Rule esent cda s oa 24 Human Assigned MIO 24 Grold Standard Piles CE A A A eee Be le de 24 Evaluation WECUICS e CT 25 ExaluatlioB sq e A Cd tel 25 Lessons Learned C Eai 27 BSC Tal NOR E Eat stt E 27 A A ues ts it iot Up dmt wea 28 RETOS aso setic olent eei e RS 28 Table of Figures Figure 1 The 0 txt file used to run the Xpantrac script 6 Figure 2 How to run Xpantrac from the command line with output eee 7 Figure 3 Components of Xpantrac grouped into two PaltS ooocconcccconc
8. 4 04 10 09 27 27 amp Core Admin Swap Space 34 lucene spec 4 7 2 1 Java Properties lucene impl 4 7 2 1586229 rmuir 2014 04 10 09 00 35 Thread Dump File Descriptor Count T JVM JVM Memory T Runtime Oracle Corporation Java HotSpot TM 64 Bit Server VM 1 7 0_51 24 51 b03 Bl Processors 4 Documentation 3 Issue Tracker ffi IRC Channel Community forum E Solr Query Syntax Figure 5 Shows the Solr administration page Indexing To index documents with the default setup of the Solr server you can use the post jar file that is located in the exampledocs folder You can copy and paste the post jar file into any folder and do the command java jar post jar file name here Once you run post it uploads the files to servers and they are indexed Querying To query the files you have indexed you choose the Solr collection to search for the default setup the collection is named Collection 1 Once you choose the collection from the administrator page you can select the Query tab to see the Query menu From here you have a lot of options when you search What we are most concerned about is the q box containing the query The left asterisk indicates the tag you want to search in you can leave the to search all tags and the right asterisk indicates the content you want to search within the tag Searching returns all of the documents that are contained within the server 12 Chrome File Edit Vi
9. Server jetty 8 1 10 v20130312 46 main INFO org eclipse jetty deploy providers ScanningAppProvider Deployment monitor Users d cabrera Downloads solr 4 7 2 example contexts at interval 57 main INFO org eclipse jetty deploy DeploymentManager Deployable added Users dcabrera Downl oads solr 4 7 2 example contexts solr jetty context xml 1616 main INFO org eclipse jetty webapp StandardDescriptorProcessor NO JSP Support for solr did not find org apache jasper servlet JspServlet 1690 main INFO org apache solr servlet SolrDispatchFilter SolrDispatchFilter init 1708 main INFO org apache solr core SolrResourceLoader JNDI not configured for solr NoInitialCon textEx 1708 main INFO org apache solr core SolrResourceLoader solr home defaulted to solr could not find system property or JNDI 1709 main INFO org apache solr core SolrResourceLoader new SolrResourceLoader for directory sol r 1865 main INFO org apache solr core ConfigSolr Loading container configuration from Users dcabre ra Downloads solr 4 7 2 example solr solr xml Figure 4 Shows the command to start the server and initialization output 11 Chrome File Edit View History Bookmarks Window Help 000 Solr Admin E c localhost 8983 solr 4 00 amp Apache DB E Instance EB System o Solr oir Start 3 minutes ago Pista Mery Dashboard E Versions 3 Logging so solr spec 4 7 2 3 solr impl 4 7 2 1586229 rmuir 201
10. Xpantrac Connection with IDEAL David Cabrera Erika Hoffman Samantha Johnson Sloane Neidig dcabrera vt edu herika6 O vt edu sjf2728 vt edu sloane10 vt edu Client Seungwon Yang syang20 gmu edu CS4624 Edward A Fox Blacksburg VA May 8 2014 Table of Contents Table OF CODI o esa o ict coded aed au ea i ee ate ee ca ente et a eaters 2 Table OPV AT o tene at dede a DIS UM ae este Dc ua E OA NES 4 PD SER ACT E 5 User SM wir orinni PH 6 Command Mine C HD ves lease eee eed ot tee ee ats 6 Developer s Wiaritial sx tati id 8 Inventory ofData ESS ise sees E cee E evan qu ed tole a ti e ees aga Du ed tiae cude pip tend 8 Xpantrac Explained eee ers I PHYS EET MESSA GA USA CL SU SER RANT EU VR TUNER Ra UE Pe venus 9 Expansion e dati besstpse Ec oiga uA em aeons eaa oci s ceto sette ted i ok Lotta ea ue EE 10 Extraction Met 10 How to Setup Apache Solucion RECS UE EEE ae iia Eea AES iin 11 Download aec detener A A el de tdi a te ge le aa ad 11 Starting the SV iii 11 Ande XI oco E tate 12 QUAY E A e A AE EE A E ENEE N E AEE TE 12 WARC Piles vith IDEAE Documents io 13 Python Script to Remove HTM r setae A ei ona eh oder cs bul tute idea n 14 Indexing Documents into Sol ro a Eten baden aen Een deis 14 Attempting to use the IDEAL Pages Script esee esee en eene tne 14 Manually Indexing Documents into SOI ee eeeceseeeeeecneeceeeeseeeeseecaecsseesseeeeaeessaeen 15 Concept MEAD P dia
11. and drought now rains torment Southern California By Kyung Lah and Ben Brumfield CNN updated 3 26 PM EST Sat March 1 2014 Watch this video Mudslides wreak havoc on Southern Calif STORY HIGHLIGHTS Rains are the first since the weather system behind the drought collapsed Though desperately needed the rain has not been great news The deluge has come down at more than an inch an hour at times Rain and cold will move hitting the East Coast Monday CNN Mario Vazquez grabbed his dog and got out of the way as a stream of water and mud came gushing on to his streets Since California has been in the middle of its worst drought in 100 years it would seem that the sight of rain would be good news But in Glendora and other towns in Los Angeles County it wasn t The rain has been much needed but Friday s deluge coming down at more than an inch an hour at times landed on bone dry hills scorche With little vegetation left to stop them walls of water have gushed into valleys below They have spewed mud and debris into quiet resider More could hit before Saturday is up the National Weather Service says It has placed Los Angeles and Ventura counties under a flash flooc By the time it s over up to six inches will have landed on the foothills of Los Angeles County and as much as 10 inches on the ridge line Weather weirdness Figure 9 Text file containing the information from a CNN article EP Gi 1 id 1 id 2 lt title gt A
12. at we had a better understanding of their project goals We had initially thought they could help us accomplish some of our tasks so we waited for them to finish one of their deliverables so that they could share it with us It turned out that this particular deliverable did not accomplish the same thing we needed so we wasted time waiting on it Another lesson learned was dealt with Apache Solr We were very confused about the purpose of Solr when we first started our project Additionally we were unsure how to use it We did not understand how to index or query files so we had to find a lot of tutorials some of which were misleading or ask our primary contact However these tasks became more clear after was had the guest lecture from Tarek Kanan about Solr and completed the Solr assignments for homework We hope that in the future the Solr activity will be moved toward the beginning of the semester instead of the end We believe that we would have experienced less troubles if the course had been structured this way Overall we gained a lot of knowledge regarding tools that were new to us such as Solr and Yahoo Search API We are glad to have the experience of working with Yang s code and hope that his research can be carried on in the future Special Note Yang has requested that the URL to the GMU Wikipedia Solr collection be redacted as it should not yet be public This explains the blackened hostname and port in Figure 20 27 Ackn
13. bled rsp eval coen read results rsp response d Figure 20 Shows the new query_assembled with content as the field name to query in the collection This can be found in the makeMicroCorpus function Handling the Content Field In addition to changing the query field to content in the query_assembled for the request we also had to change the field name in the configuration for the results seen later in the code First we changed the field name to content Next we returned on the first 30 words of the content field Only the first 30 words are used because they tend to represent the key issues of an entire document The field change can be seen in Figure 21 for M_43 configuration only 1 results merged for result in results 0 10 short result a S split 30 clean _result short result replace p k strip replace S Figure 21 Shows the mum of the first 30 anis of ihe content field Because we are no longer using the Yahoo Search API we also removed all of the authorization code that enabled us to access that API Changing the Xpantrac parameters With Yang s help the number of topics for Xpantrac to find was changed to 10 the number of API results to return to be 10 and the query unit size to be 5 These changes can be seen in Figures 22 and 23 def main num_topics 10 window_overlap 1 Figure 22 num_topics represents the number of
14. ccconccononcnonnncnonnnnnnnnnanononcnnnns 10 Figure 4 Shows the command to start the server and initialization output es 11 Figure 5 Shows the Solr administration page ccccesssccessseceeneeceeeeceeacecseneeceeceeceeeeecseeeeeeteeeees 12 Figure 6 A query of that returns all of the documents in the collection 13 Figure 7 URL to the query TeSDODSO us erect aeu edet so o tee tasses get e aal niae ead tud 13 Figure 8 Python script to remove all other files except HTML from a directory 14 Figure 9 Text file containing the information from a CNN article eese 15 Figure 11 Command to index 50docs xmI into Solt eee 15 Figure 12 Figure 13 Figure 14 Figure 15 Figure 16 Figure 17 Figure 18 Figure 19 XML file using the correct forinat beide dicente vende tertio i detti iode 16 IOException from indexing the 50docs xml file into Solr suse 16 Xpa trac concept maps osos puntas iat eiut oom ti nO ERE M ii 17 Creates a list of all file IDs from plain text ids txt esses 18 Shows how each input text file is accessed esses 18 Authorization and query information for Yahoo Search API esses 19 Output from Xpantrac_yahooWeb py script oooococnnocccnncccconanccnoncnonnncnononononnncncnnnccnonnos 20 Importin
15. ck authorities told the news agency No motive has been provided A doctor with the Kunming No 1 People s Hospital told Xinhua over the phone they re not sure of the number of casualties Xinhu said the Kunming Railway Station is one of the largest stations in southwest China lt gt gt name id gt 1 lt gt name title gt After forest fires and drought now rains torment Southern California lt gt name content gt Mario Vazquez grabbed his dog and got out of the way as a stream of water and mud came gushing on to his streets Since California has been in the middle of its worst drought in 100 years it would seem that the sight of rain would be good news Figure 12 XML file using the correct format Initially we had 50 separate XML files for each of the 50 articles However we learned that we were able to combine these into one long XML file with each article in its own doc tag When we tried to index the 50docs xml file into Solr we received the error seen in Figure 13 dcabreraeDMBP exampledocs java jar post jar 0 xml SimplePostTool version 1 5 Posting files to base url http localhost 8983 solr update using content type application xml POSTing file Q xml SimplePostTool WARNING Solr returned an error 500 Server Error SimplePostTool WARNING IOException while reading response java io I0Exception Server returned HTTP response code 500 for URL http localhost 8983 solr update 1 files indexed
16. ctr30_10topics csv Xpantrac assigned topics for 30 CTR articles 10 topics per article project xpantrac_ctr30_20topics csv Xpantrac assigned topics for 30 CTR articles 20 topics per article project xpantrac_various30_10topics csv Xpantrac assigned topics for 30 various articles 10 topics per article Jproject xpantrac various30 20topics csv Xpantrac assigned topics for 30 various articles 20 topics per article project computePRFl py Computes the precision recall F1 score of the extracted topics How to Run gt python computePRF1 py gold ctr30 csv xpantrac ctr30 topics csv gt python computePRF1 py gold various30 csv xpantrac various30 topics csv Human Assigned Topics Two sets of test files CTR_30 and VARIOUS 30 were included in this project These files have been tagged with topics by multiple human sources The people who tagged these articles were from the Library Sciences field so they were experienced taggers The human assigned topics for each file can be found in human topics CTR30 csv and human topics VARIOUS30 csv Gold Standard Files The gold standard files are a merged version of the human assigned topics That means that if Tagger A said that a file s topics are Florida marsh tropical coast and Tagger B said that same file s topics are marsh storm Jacksonville then those topics would be merged in the 24 gold standard file Therefore the gold standard of topics for that file would be F
17. ew History Bookmarks Window Help 000 Solr Admin H localhost 8983 solr collection1 query Apache e Request Handler qt aa select lt t Solr common responseHeader Dashboard q status 0 m Qrime 3 Logging params E Core Admin indent J true x fq en J Java Properties a Thread Dump collection1 response start rows numFound 51 f start 0 fl docs t id 0 df title Knife wielding mob kills 27 at China train station m m Raw Query Parameters content At least 27 people were killed and 109 wounded when a group of people armed with knives stormed a railway station in the wt y I Query json version 1467200232247787500 i E indent t w debugQuery iat 1 title dismax After forest fires and drought now rains torment Southern California 1 edismax content hi Mario Vazquez grabbed his dog and got out of the way as a stream of water and mud came gushing on to his streets Since facet l spatial version 1467200232282390500 b spellcheck t Execute Query miai 2 title Figure 6 A query of that returns all of the documents in the collection The link at the top of the query gives you the general structure of a query if you do not want to use the Admin page al Figure 7 URL to the query response From here the in the link represent the the things we search for and you can replace the
18. ewitnesses eyewitnesses told cnn affi liate wpri wpri acrobats type aerial scaffolding scaffolding human chandelier cable snapped 18 snapped payne told cnn fredricka fredricka whitfield apparatus multiple performances 20 performances week ringling bros barnum 21 barnum bailey 1lauched 1egends time 22 time venue equipment performer group 23 group performers wel1 performers careful ly 24 carefully inspected health safety per formers 25 performers guests seriously company safety 26 safety department spends countless making 27 making equipment safe effective continued 28 continued circus local authorities investigating 29 investigating incident payne legends began 30 began short providence residency final 31 final performances slated rest canceled 32 canceled making determination remainder providence 33 providence engagement payne Micro corpus is created Vector Space Model is applied for topic extraction Topics separated by payne island rhode circus providence reuter american county john state Figure 2 How to run Xpantrac from the command line with output Developer s Manual Inventory of Data Files Directory containing all project files project Xpantrac py Script containing Xpantrac algorithm to be used with Apache Solr project 0 txt Sample input file to be used by algorithm project pos_tagger py Part of speech tagger Trained using t
19. f these problems we modified the original Xpantrac script to use the Yahoo Search API File Hierarchy File Description Jproject Directory containing all project files project Xpantrac_yahooWeb py Script containing the Xpantrac algorithm to be used with the Yahoo Search API project plain_text_ids txt Text file containing a list of file IDs Used in project Xpantrac_yahooWeb py project files Directory of text files with corresponding IDs Used in project Xpantrac_yahooWeb py 17 Input Text Files The Xpantrac_yahooWeb py script used a plain_text_ids txt file to identify all of the IDs of the text files to be used as input These text files can be found in the project files directory The IDs for the text files are simply 0 50 and the text files themselves are named 0 txt 50 txt respectively Figures 15 and 16 show how the files are accessed in the Xpantrac for Yahoo script develop id list fi open plain text ids txt r li fi read split fi close Figure 15 Creates a list of all file IDs from plain text ids txt print n Document ID s is being processed in doc id filename str filenum txt for Linux mac machine text open files filename r read for Windows machine text open files filename r read Figure 16 Shows how each input text file is accessed 18 Yahoo Search API Authorization Querying the Yahoo Search API
20. fter forest fires and drought now rains torment Southern California lt title gt 3 lt content gt Mario Vazquez grabbed his dog and got out of the way as a stream of water and mud came gushing on to his streets 4 Since California has been in the middle of its worst drought in 100 years it would seem that the sight of rain would be good news 5 But mud from the streets is beginning to ooze over into yards pools and houses It has damaged two homes in Glendora so far police chief Tim Staub said lt content gt Figure 10 XML file containing the information from the text file in Figure 9 Next we tried to manually index those XML files into Solr using the command line eoo C exampledocs bash x amp 0MBP exampledocs java jar post jar 50docs xml Figure 11 Command to index 50docs xml into Solr However we ran into an error After examining Solr s schema xml file and reviewing some tutorials we realized that we had been formatting our XML files incorrectly for Solr The correct formatting can been seen in Figure 12 15 50docs xml name id gt 0 lt gt name title gt Knife wielding mob kills 27 at China train station lt gt name content gt At least 27 people were killed and 109 wounded when a group of people armed with knives stormed a railway station in the southwest Chinese city of Kunming authorities said according to state news agency Xinhua It was an organized premeditated terrorist atta
21. g urlopen to be used for the query request eee 2l Figure 20 Shows the new query assembled with content as the field name to query in the collection This can be found in the makeMicroCorpus function sse 21 Figure 21 Shows the return of the first 30 words of the content field sssss 21 Figure 22 num_topics represents the number of topics to be found for each input document 21 Figure 23 u_size represents the query unit size and a size represents the API return size 22 Figure 24 A document from the IDEAL collection in Solr eee 22 Figure 25 First 30 words of the content field from the IDEAL collection in Solr 22 Figure 26 Xpantrae configuration file ua erties eed eel va ceste de ee vea dE Ue bes PI ace eg ts 23 Abstract Title Integrating Xpantrac into the IDEAL software suite and applying it to identify topics for IDEAL webpages Identifying topics is useful because it allows us to easily understand what a document is about If we organize documents into a database we can then search through those documents using their identified topics Previously our client Seungwon Yang developed an algorithm for identifying topics in a given webpage called Xpantrac This algorithm is based on the Expansion Extraction approach Consequently it is also named after this approach In the first part the text of a d
22. he CoNLL2000 corpus provided by the Natural Language Tool Kit NLTK project pos_tagger pyc Compiled version of project pos_tagger py project get pip py Package installer project stopwords txt A list of words to exclude from the topic identification project custom_stops txt A list of words to exclude from the topics identification project Xpantrac_yahooWeb py Script containing the Xpantrac algorithm to be used with the Yahoo Search API project plain_text_ids txt Text file containing a list of file IDs Used in project Xpantrac_yahooWeb py project files Directory of text files with corresponding IDs Used in project Xpantrac_yahooWeb py project processWarcDir py Unpacks a WARC file and returns only html files project CTR_30 A directory of 30 CTR files Jproject V ARIOUS 30 A directory of 30 various files project gold_ctr30 csv The gold standard of merged human topics project gold_various30 csv The gold standard of merged human topics project human_topics_CTR30 csv Human assigned topics for 30 CTR articles project human_topics_VARIOUS30 csv Human assigned topics for 30 various articles project xpantrac_ctr30_10topics csv Xpantrac assigned topics for 30 CTR articles 10 topics per article project xpantrac_ctr30_20topics csv Xpantrac assigned topics for 30 CTR articles 20 topics per article Jproject xpantrac various30 20topics csv Xpantrac assigned topics for 30 various articles 20 t
23. lorida marsh tropical coast storm Jacksonville Evaluation Metrics This evaluation of topics measures precision recall and F1 Precision can be used to compute the proportion of matching topics 1 e C from all the retrieved topics i e A by the following formula C precision A P relevant retrieved Recall is the proportion of the matching topic 1 e C from all of the retrieved topics 1 e B which are assigned by the human topic indexers or exist as the gold standard C recall ic P retrieved relevant B Ideally both the precision and recall values should be 1 This would mean that the sets of topics compared would be exactly the same The F1 score is used to compare precision and recall with the following formula F 2 precision recall precision recall Evaluation The tables below show the evaluation of average precision recall and F1 of the gold standard of topics versus 10 Xpantrac topics gt python computePRF1 py gold ctr30 csv xpantrac ctr30 10topics csv Evaluation Average Precision Average Recall Average F1 25 gt python computePRF1 py gold various30 csv xpantrac various30 10topics csv Evaluation Average Precision Average Recall Average F1 Above the number of human assigned topics are much larger than the number of Xpantrac topics 10 Because of this the recall value will be somewhat low Increasing the number of Xpantrac topics from 10 t
24. ly inspected We take the health and safety of our performers and our guests very seriously and our company has a safety department that spends countless hours making sure that all of our equipment is indeed safe and effective for continued use he said The circus and local authorities are investigating the incident together Payne said Legends began a short Providence residency on Friday The final five performances there were slated for 11 a m 3 p m and 7 p m on Sunday and 10 30 a m and 7 p m on Monday The rest of the 11 a m Sunday show was canceled and we re making a determination about the remainder of the shows for the Providence engagement Payne said Figure 1 The 0 txt file used to run the Xpantrac script PS C Users sloan_000 Desktop project gt python Xpantrac py Input text 0 txt is being processed List of queries query size 5 cnn ringling bros barnum bai ley bailey circus performers injured providence providence rhode island apparatus fai led failed circus spokesman stephen payne payne performers fel 1 hair hang hang apparatus holds performers hair hair failed payne performer injured injured ground performers hospitalized injuries injuries accident rhode island hospital hospital spokeswoman 3111 reuter to1d told cnn listed critical condition condition reuter clear victims multiple multiple emergency units responded accident accident dunkin donuts center ey
25. nt will be used as input to Xpantrac To run the Xpantrac script simple type python Xpantrac py The output in the console will show the query size each query performed and a list of topics found in the relevant documents CNN Nine Ringling Bros and Barnum and Bailey circus performers were among 11 people injured Sunday in Providence Rhode Island after an apparatus used in their act failed circus spokesman Stephen Payne said Eight performers fell when the hair hang apparatus which holds performers by their hair failed Payne added Another performer was injured on the ground he said The performers were among 11 people hospitalized with injuries related to the accident Rhode Island Hospital spokeswoman Jill Reuter told CNN One of those people was listed in critical condition Reuter said It was not immediately clear who the other two victims were Multiple emergency units responded to the accident at the Dunkin Donuts Center Eyewitnesses told CNN affiliate WPRI that they saw acrobats up on a type of aerial scaffolding doing a human chandelier when a cable snapped Payne told CNN s Fredricka Whitfield the apparatus had been used for multiple performances each week since Ringling Bros and Barnum amp Bailey lauched its Legends show in February Each and every time that we come to a new venue all of the equipment that is used by this performer this group of performers as well as other performers is careful
26. o a larger number such as 20 will increase the recall value Eventually the F1 measure will increase as well However the precision value may decrease slightly Below are the average precision recall and Fl scores for the increased number of topics 20 gt python computePRF1 py gold ctr30 csv xpantrac ctr30 20topics csv Evaluation Average Precision Average Recall Average F1 gt python computePRF1 py gold various30 csv xpantrac various30 20topics csv Evaluation Average Precision Average Recall Average F1 As expected the precision value has decreased and the recall value has increased Overtime we should still expect the F1 score to increase 26 Lessons Learned This capstone project was definitely an eye opening experience for all of us We had never done this type of work in any of the courses from our past semesters before Because of this we felt that we learned a lot of lessons and gained a lot of experience While all of our group members had previous experience working in a team none of us had ever had to coordinate with another separate team before Overall we felt that there was a good deal of miscommunication between our group and the IDEAL Pages group Throughout the semester we were under the impression that some of our project goals overlapped with their project goals However this was not the case In hindsight we should have made our objectives more clear with the other group and ensured th
27. ocument is used as input into Xpantrac and is expanded into relevant information using a search engine In the second part the topics in each document are identified or extracted In his prototype Yang used a standard data set a collection of one thousand New York Times articles as a search database As our CS4624 capstone project our group was asked to modify Yang s algorithm to search through IDEAL documents in Apache Solr In order to accomplish this we set up and became familiar with a Solr instance Next we replaced the prototype s database with the Yahoo Search API to understand how it would work with a live search engine Then we indexed a set of IDEAL documents into Solr and replaced the Yahoo Search API with Solr However the amount of documents we had previously indexed was far too few In the end we used Yang s Wikipedia collection in Solr instead This collection has approximately 4 2 million documents and counting We were unable to connect Xpantrac to the IDEAL collection in Solr This issue is discussed in detail later along with a future solution Therefore our deliverable is Xpantrac for Yang s Wikipedia collection in Solr along with an evaluation of the extracted topics User s Manual Command Line In the command prompt the user must navigate to Xpantrac s project directory Before running the Xpantrac script the user must ensure there is a document named 0 txt in that project directory This docume
28. opics per article project computePRFl py Computes the precision recall F1 score of the extracted topics Jproject xpantrac various30 lOtopics csv Xpantrac assigned topics for 30 various articles 10 topics per article Xpantrac Explained Xpantrac is an algorithm that combines Cognitive Informatics with the Vector Space Model to retrieve topics from an input of text The name Xpantrac came from the Expansion Extraction approach it takes when expanding the query and eventually extracting the topics Consider this use case of Xpantrac in the following scenario Rachel is a librarian working at a children s library This library received about 100 short stories each of which was written by young writers who recently started their literary career To make these stories accessible online Rachel decides to organize them based on the topic tags So she opens a Web browser and enters a URL of the Xpantrac UI After loading documents that contain 100 stories she selects each document to briefly view it and then extracts suggested topic tags using the UI After selecting several suggested tags from the Xpantrac UI and also coming up with additional tags by herself she enters them as the topic tags representing a story A library patron Jason accesses the library homepage at home clicks a tag Christmas which lists 5 stories about Christmas He selects a story that might be appropriate for his 4 year daughter and reads the story to her
29. owledgements We would first like to thank Seungwon Yang for taking the time out of his busy schedule at George Mason University to help our group better understand the Xpantrac algorithm and goals for this capstone project We would also like to mention Mohamed Magdy and IDEAL Pages group consisting of Mustafa Aly and Gasper Gulotta for their contributions to the initial part of our project The IDEAL Pages project goal was to index the IDEAL documents into Solr Lastly we would like to thank Dr Edward Fox for presenting us with the opportunity to work on and improve this project for our capstone class and the National Science Foundation NSF for supporting the Integrated Digital Event Archiving and Library IDEAL organization References Yang Seungwon Automatic Identification of Topic Tags from Texts Based on Expansion Extraction Approach Diss Virginia Polytechnic Institute and State University 2013 230 pages lt http hdl handle net 10919 25111 gt 28
30. required authorization Therefore this script had a few extra authorization lines than normal Figure 17 shows the necessary authorization and query information if query l 25 try Mila gi if yahoo_api_type web url http yboss yahooapis com ysearch web q query else url http yboss yahooapis com ysearch news q query consumer oauth2 Consumer key 0AUTH_CONSUMER_KEY secret 0AUTH_CONSUMER_SECRET params oauth version 1 0 oauth nonce oauth2 generate nonce oauth timestamp int time time oauth_request oauth2 Request method GET url url parameters params oauth request sign request oauth2 SignatureMethod HMAC SHA1 consumer None oauth header oauth request to header realm yahooapis com Get search results http httplib2 Http resp content http request url GET headers oauth header print resp print content results simplejson loads content Figure 17 Authorization and query information for Yahoo Search API 19 Output See Figure 18 for instructions on how to run the Xpantrac for Yahoo script in the command prompt This figure also shows the list of topics output for each document processed PS C Users sloan_000 Desktop project gt python Xpantrac_yahooweb py 1399317485 37 Document ID O is being processed m39 10 Topics station people attack news railway xinhua china train group knife 29 3789999485 seconds 1399317514 75 Document ID
31. rpus to select only nouns verbs or both It also finds lemmas of the nouns or verbs to resolve singular and plural forms 10 2 Term Doc Matrix Builder develops a term index using the unique words from the derived corpus and constructs a term document matrix as in the Vector Space Model 3 Topic Selector identifies significant words representative of the input text How to Setup Apache Solr Download In order to setup Solr you need to have the latest Java JRE installed on your system At the time of this writing the current version of Java Java 8 is fully compatible with Apache Solr but previous versions can be used if desired Once the latest Java is installed you can download Apache Solr Starting the Server Once Solr is downloaded you can run the server in its template form by navigating to solr download example From here running java jar start jar starts the server You can then navigate to http localhost 8983 solr If the server is successfully started you should be able to see the administrator page The figure below shows the command to start the server and what a developer should see when initializing the server eoo _ example java dcabreraeDMBP example 1s README txt example DIH lib resources solr webapp contexts example schemaless logs scripts start jar etc exampledocs multicore solr webapps dcabreraeDMBP example java jar start jar o main INFO org eclipse jetty server
32. topics to be found for each input document 21 input text Control inputs for p size in 20 15 10 5 1 3 4 for u size in 20 15 10 5 2 1 3 4 4 3 4 gt group 5 words together for a return in 50 10 5 1 1 2 amp 1 2 gt ask Solr to return 10 matching documents Figure 23 u_size represents the query unit size and a size represents the API return size This can be found in the main function Connecting with IDEAL in the Future In the future Xpantrac should connect to the IDEAL collection in Solr This collection can be found at http nick dlib vt edu 8080 solr collection1 query While this collection does contain a content field it does not meet the specifications of our project at this time The IDEAL Pages group was given a different specification to use for the content of their Solr collection Their group was instructed to collect the entire content of an HTML page This means that all of the text in the lt body gt of an HTML page will be put into their content field Figure 24 shows an example of a content field content Google Newsvar GLOBAL window window function function d a this t this tick function a c b b void 0 b b new Date getTime this t a b c this tick start null a var a new d GLOBAL window jstiming Timer d load a if GLOBAL window performance amp amp GLOBAL windo w performance timing var a GLOBAL window
33. with the HTML extension When running the script the only parameter needed is the path to the root directory where the files are located The full path to each deleted file is printed as it is removed Indexing Documents into Solr Attempting to use the IDEAL Pages Script As mentioned before the IDEAL Pages group goal was to index IDEAL documents into Solr Our group also needed to do this in order to later use IDEAL documents with Xpantrac After speaking with our professor and primary contacts our groups were asked to work together The IDEAL Pages group would supply the Xpantrac group with the script to index documents into Solr and the Xpantrac group would manually index the documents until that script was created When the IDEAL Pages script was finally received it would not run with our Solr instance Our group spent a lot of time trying to fix the script and get it to run with our instance The IDEAL Pages group was also unable to help Eventually we realized that we would rather spend time manually indexing the files into Solr instead of trying to fix a script that may never work for us 14 Manually Indexing Documents into Solr Initially we had 50 text documents from CNN that were supposed to be indexed into Solr See Figure 9 These documents would represent documents from the IDEAL collection However Solr needed those documents to be in XML format See Figure 10 File Edit Format View Help fter forest fires

Xpantrac Final Report.docx

Contents

Download Pdf Manuals

Related Search

Related Contents