Home
1. User Manual
Contents
1. note that a run can consist of multiple jobs Chose _ Harvest name_if you want to check details on the history of a specific harvest definition Chose JobID if you want to check details on a specific job In case of Harvest errors a Restart button will appear and the operator can choose to resubmit that specific job to be harvested again When resubmitting a failed job the status will say Resubmitted and refer to the new resubmitted job History of a harvestdefinition 19101 Dansk English Deutsch Italiano Francais Menu 10101 Dra Searching for My_Snapshot_Harvest returned 1 hits Harvest status All Jobs Search results 1 displaying results 1 to 1 All Jobs per domain Running Jobs Bitpreservation previous next Quality Assurance Harvest history for full harvest My_Snapshot_Harvest Systemstate Run Start End Bytes Documents hiso delta sele number time time Harvested Harvested UMOCr OF fal erie of jobs jobs jobs Sep5 Sep5 2011 2011 0 3 57 40 3 08 17 02235961 2 702 1 Show jobs 0 0 PM PM The history page for a harvestdefinition is the same as you can reach from the frontpage with the History buttons This history page gives you further information for each run of the harvestdefinition Start time End time number of bytes harvested and number of documents harvested The page also show how many jobs each run consists of and how many of these that failed and eventually got resubmitted History of a domain Dans
2. Io 10101 Definitions Selective Harvests Snapshot Harvests Harvest name my_first_snapshot Schedules l Max number of objects per domain 1 Find Domain s Create Domain Domain Statistics Alias Summary Edit Harvest Templates Global Crawler Traps Extended Fields Harvest status Bitpreservation Quality Assurance Systemstate Menu o Dansk English Deutsch Italiano Francais I Snapshot Harvest Max number of bytes per domain 1 000 000 000 Max number of seconds for each job 0 Comments Harvest only domains that were not completely harvest in a previous harvest None Save This page is used to define name and size max bytes per domain of the harvest It is now possible to use number of objects as harvest limits as well as the size in bytes The default object limit for harvests if using object limits rather than bytelimits 1 means unlimited It is recommended to systematize the naming for clarity e g 2007 1 2007 2 etc The size of the harvest can be defined in two ways at the harvest definition Snapshot Harvests or at the configuration of the single domain lt will always be the lower size limit stopping the harvesting of a domain Comments can freely be added Snapshot harvests can be based on previous snapshots in the sense that it can be limited to only harvest domains that hit the max number of bytes limit in a previous harvest The domains completely finished not hitting the max
3. default_orderxml defaultseeds Edit New configuration defaultseeds www netarkivet dk Edit New seed list Save Show historical harvest information for netarkivet dk A crawlertrap is a path followed blindly by the harvester which in principle can continue forever This typically could be a calendar To avoid crawlertraps on a domain the administrator can state parts of URLs that should never be harvested in any configuration Matching URLs are omitted in all harvests of the domain and in other domains harvested in the same job So be careful not to give too general statements that could potentially omit things on other domains perhaps always include the domainname itself in the statement The string of text must be stated as a regular expression Domain statistics toron Dansk English Deutsch Italiano Fran ais i IOI Menu Definitions Selective Harvests Snapshot Harvests Number of registered domains 17 Schedules Find Domain s Top level domain Number of subdomains Create Domain dk 17 Domain Statistics Alias Summary Edit Harvest Templates Global Crawler Traps Extended Fields Harvest status Bitpreservation Quality Assurance Systemstate The domain statistics page will give you information about number of subdomains for each unique Top level domain known in the system IP numbers will be counted separately The number in the Number of subdomains column is clickable and will do a search for al
4. Global Crawler Traps Extended Fields Harvest status Bitpreservation Quality Assurance Systemstate Enter seeds Max number of bytes per domain 1 000 000 000 Max number of objects per domain 1 Harvest template 3levels_orderxml B Insert Click the Add seeds at the bottom of the Seletive Harvest page Enter identified start URLs covering the event in the Enter seeds box In Max number of bytes per domain enter preferred max number e g 1000000000 Select a harvest template with the Harvest template drop down box All seeds will use the same template so to harvest different seeds with different templates you need to add them bunch by bunch for each template you need for your event harvest Pressing Insert starts the power adding function This function runs through the entered seeds one by one and does the following with each seed 1 Finds the domain from which the seed derives 2 Creates a seedlist with the name of the harvestdefinition and the template as seedlist name 3 Creates a configuration with the name of the harvestdefinition and the template as configuration name And select the seedlist from 2 to use with the new configuration f the seedlist to create in 2 or the configuration to create in 3 already exist If the power adding function has been used before with other seeds from the same domain in the same event harvest the system will only add the new URLs to the existing seedlist You can also use Add s
5. alias will not be harvested within the snapshot harvests Alias is defined one year at a time and then has to be renewed Configurations New configuration and Edit open a new page Enter edit configuration Seed lists New seed list and Edit opens a new page Enter edit seed list see below see below Crawler traps Show crawler traps opens a new text box Crawler traps see below Show historical harvest information for opens a new page Harvest history for domain see Harvest History Editing configurations netarkivet dk Definitions Selective Harvests Snapshot Harvests Enter edit configuration Schedules Dansk English Deutsch Italiano Francais Find Domain s mama ae Harvest template _default_orderxmi W Create Domain Domain Statistics Alias Summary Edit Harvest Templates Global Crawler Traps Extended Fields Harvest status Bitpreservation Quality Assurance Systemstate Maximum number of objects 2 000 Maximum number of bytes 500 000 000 Comments Seed list defaultseeds Enter edit configuration is used to define a new configuration and edit an existing one A configuration contains information about which Harvest template and Seed lists are used more than one Seed list can be used hold down CTRL At the creation of a new configuration a name is given that thereafter can not be changed Furthermore it is possible to choose between different Ha
6. and press Retrieve Upload In the Upload section you can either update an existing template or create a new one To update an existing template select the template to update in the first select box and browse for a file on your local computer with the Browse button By pressing Replace harvesttemplate with file from your own harddrive you will overwrite the chosen template in the database To create a new template give the new template a name in the Template Name box and select a file from your local computer with Browse By pressing Create a new harvesttemplate using a file from your own harddrive you will add the new template with the given name to the database When using the upload functions the uploaded files will be checked against certain rules to ensure that the templates contain specific elements used by the NetarchiveSuite system Previous Next Quality Assurance 19101 Dansk English Deutsch Italiano Frangais my Menu 10101 SP Viewerproxy Status Harvest status Bitpreservation Current Viewerproxy status Quality Assurance Viewerproxy Status Systemstate Currently does _not_ collect missing URLs Current list of missing URLs contains O URLs Using index Harvest burkarapport run 21 built on jobs 115 If the frame is empty either the viewerproxy hasn t been started or your web browser has not been configured to use it Missing URL collection Start collecting URLs Stop collecting URLs Clea
7. harvest here Add domains Save Event harvest Add seeds Add seeds from a file Create a new selective harvest definition by pressing Create new selective harvestdefinition from the frontpage Give the harvestdefinition a recognizable harvest name you can not change it later If necessary add a comment Choose a schedule from the dropdown list Now you can add domains to the harvestdefinition Write the name of the domains you want to add in the box Enter domain s to add to the harvest here and click on Add domains The added domains will appear in the column Domain For each added domain choose the wanted configuration from the dropdown list for each domain Press Save to save the harvestdefinition The scheduling of selective harvest definitions can be overridden by filling out the input field Override with new date Simply set the date to whenever you wish the harvest definition to run next time The scheduling of the harvest definition will continue from that point in time Easy creation of non existing domains 10101 Definitions Selective Harvests Snapshot Harvests Schedules Find Domain s Create Domain Domain Statistics Alias Summary Edit Harvest Templates Global Crawler Traps Extended Fields Harvest status Bitpreservation Quality Assurance Systemstate Dansk English Deutsch Italiano Francais Selective Harvest Harvest name An arbitrary name Comments Schedule Once_a day E The harvest
8. number of bytes limit either on the configuration level or on the snapshot harvest level in the first harvest will not be included in the second Domains included in harvests which were aborted through the Heritrix GUI or otherwise stopped uncleanly for example by a crash of a harvester machine will also not be included All other domains will be harvested from the beginning in the second harvest Save saves the harvest definition and returns to Snapshot harvests After defining a snapshot harvest the harvest is activated with the Activate button on the snapshot frontpage Harvest will not start until you press Activate Status then changes to Active Deactivate is not relevant in Snapshot Harvests because they only run once By Edit the Snapshot Definition can be changed but only before activation Parameters changeable are size commentary and if previous harvest start point should be used The name can not be changed History provides an overview of the specific harvest see Harvest History Previous Next Domains e Creating Domains e Finding Domains e Editing Domains e Editing configurations e Editing seed lists e Editing crawlertraps e Domain statistics e Alias summary Creating Domains i Lo i o AA Menu 19101 Dansk English Deutsch Italiano Francais 101 Create Domain Definitions Selective Harvests Snapshot Harvests Enter the domain or list of domains to be created Schedules Find Domain s
9. Create Domain Domain Statistics Alias Summary Edit Harvest Templates Global Crawler Traps Extended Fields Harvest status _Create Bitpreservation Quality Assurance Systemstate Or import domains from a file on your harddrive Browse Ingest The Create domain is used for creating new domains in the system It is possible to create a single domain as well as list of domains It is also possible to import domains from a file To create single domains enter domain names in the text box and press Create To bulk create domains from a file select the file from your local computer with Browse and press Ingest The file must be a simple list of domain names one at each line The file must be UTF 8 encoded if it contains special characters New domains get a default configuration when created with the defaultorderxml template and a default maximum number of bytes New domains also get a defaultseedlist when created Already existing domains in the system will not be recreated Finding Domains 19101 Dansk English Deutsch Italiano Francais Menu IOI 10101 Definitions Selective Harvests Schedules Use of wildcards permitted e g com Find Domain s Create Domain Domain Statistics Alias Summary Edit Harvest Templates Global Crawler Traps Extended Fields Harvest status Bitpreservation Quality Assurance Systemstate Find Domain s Find Domain s is used to find domains existing in the system
10. MA sc sec cig oe ey AI A he HO oe ne AN 2 ARS A A tet re Were toler Welt nt ac yew ae ener eter E ee Se nt E PP ee ey Oe ee 2 122 SHADSMOLHIAIVESIS trata eee nee Re A CE RAS AAA EA ORES Sb ON RR CON ERR DOES ea ee eens 5 SA A A ls Seg cade akin ade ie A ek te A dah ee shoes acts tan tiie lee Sik alae es ede aa a ee Whee hd 7 Mtl CHICO Ss stas fee dis clasica prada dada das pasota As pone 11 ES Hennis QUIBACCESS dai O a o i id a iO db 12 1 6 Global CrawlerTiaps sessios diria A AAA AA EAE AN eae ed 13 te o A A O end eer k Oe a Glyahd er ale eee eth hey Siete hemi eee een a ho aes 14 16 Harvester Templates a cok Sal bate atte Ailsa te mae San cae ule tie Angee eae ag egies woke Mie ee aie Mee cate a ule a get 17 LS QUAINYASSUTANCS este dos arab raid das dea dana bata bee Dee ea 18 TO System Ale 1 22 0 Ses aes edocs A A AS Sh As a hs eh ea acai eee OA 18 Ul Bi PreServalion crac a od eee eek eRe Sas ee Dee PRE EMER AAA eee eee nea eee 20 1A2Z Alternative Ways to Get DAT OUR risa LA a iat ety a AT Ee E eye ae aha hea aN 23 User Manual This is a manual for end user setup and control of harvests and controlling storage and QA The audience for this manual will typically be curators The basic concept in the NetarchiveSuite harvesting module is the notion of domains A domain has a two part name host top level domain top level domain e g netarchive dk or is an P number What is considered a top level domain is configurable For most c
11. Missing files for SBN There are no more missing files In case of a checksum error this error can be corrected through the interface To replace a bad file you need to type a security password and press Replace the file in bitarchive replica XX The bad file will not be completely removed but moved to an attic directory on the bitarchive server holding the bad file Batchjob Overview Menu o 10101 Definitions Harvest status Bitpreservation Filestatus Batchjob Overview Quality Assurance Systemstate Dansk English Deutsch Italiano Francais ChecksumJob Batchjob has never been run No output file No error file FileListlob Mon Nov 08 10 59 58 CET 2010 Download outputfile 17 bytes 1 lines Download errorfile 260 bytes 4 lines To get at overview over batchjobs select Batch overview on the left hand side menu Press the ChecksumJob link in the batchjob column to get ready to run one or more checksum jobs Batchjob Checksum I o I o 10101 I A 9 I fee Menu Definitions Harvest status Bitpreservation Filestatus Batchjob Overview Quality Assurance Systemstate Dansk English Deutsch Italiano Frangais Name of batchjob dk netarkivet common utils batch ChecksumJob Batchjob has never been run Choose replica KBN BITARCHIVE CSN CHECKSUM SBN BITARCHIVE Which files Job ID 1 Metadata Content C Both Execute batchjob Press the Execute batchjob button to start the desired chec
12. UIWebServer KBN 0 Sep 9 2011 1 30 17 PM dk netarkivet harw adm 001 INFO Deleted 2 running job info records kb test HarvestControllerServer LOWPRIORITY KBN 0 Sep 7 2011 1 42 55 PM dk netarkivet harvi har 001 INFO HarvestControllerServer started The initial view is the last log message from every machine and every application This can be narrowed down to single Machine Application Instance id Priority Use Replica and extended to not only show the last log message but the last 100 log messages don t do that for the initial view of everything To narrow the view press either of the links on the page when narrowed the view can be extended again with the Show all buttons that will dynamically appear in the headlines of the table It is from 3 10 also possible to remove an application from the system Be careful about this feature because removing a running application will make it disappear The new column and button is added to the right Remove Application The following is a view of one harvester instance on one specific machine narrowed down by application 10101 i i o pag Menu 10101 Definitions Harvest status Bitpreservation Quality Assurance show Location Instance id Http port Systemstate Overview of the system state kb test HarvestControllerServerLOWPRIORITY KBN Sep 7 2011 1 42 55 PM dk netarkivet he har 001 INFO HarvestControllerServer started kb test HarvestContr
13. Write a domain name in the box e g kb dk Searching is done on the complete text string Press Search Left and or right wildcards with If there are several hits a list is given of the found domains If only one hit it leads directly to the Domain page If the search for a specific domain results in no hits you are prompted with the ability to create the domain in the system and by accepting Yes it leads directly to Domain page for the newly created domain Editing Domains Tor Dansk English Deutsch Italiano Francais Menu 10101 a Definitions Edit domain Selective Harvests Snapshot Harvests Schedules Domain name netarkivet dk Find Domain s E Create Domain Domain Statistics Alias Summary Edit Harvest Templates Global Crawler Traps Alias of Extended Fields Configurations Default Harvest status defaultconfig default_orderxml defaultseeds Edit Bitpreservation Quality Assurance Systemstate New configuration Seed lists defaultseeds www netarkivet dk Edit New seed list Crawler traps Show crawler traps Save Show historical harvest information for netarkivet dk Edit domain is an overview of a single domain where it is possible to edit the domain s definition in the harvest system Free commentary text box Alias of Here it can be stated if the domain is an alias of another domain they are identical in content and only one of them should be harvested Domains marked as an
14. a standard download dialog to present the txt file e g in notepad System State Next Alternative Ways to Get Data Out There are alternative ways to get data out of the bitarchive e g by run of batch programs on the bitarchive replicas However there are no explicit user interface to these tools and some technical skills are required to use them This is the reason why these tools are described in the Additional Tools Manual under the tools in the Archive Module Bit Preservation
15. ary time of the day for the harvest to run In the drop down menu you have the choice between hours days weeks and months Changing days switches the Time of day so that hours lets you choose a specific minute of the hour days lets you choose a specific time of day weeks lets you choose a specific day of the week months lets you choose a specific day of the month After selecting the frequency you must select Start at the earliest which could either be as soon as possible default or at a specific date and time The last thing to determine is how long this schedule should go on The default for the duration of a schedule is forever lt is also possible to choose an end date and a certain number of harvests to perform This allows you to define schedules that will only run in a shorter period e g in connection with an event harvest where the date range in which to harvest is predefined Previous Next Heritrix GUI Access It is possible only while a job is running to access the Heritrix user interface on the harvester machine Start a browser on the harvestermachine and use the port specified e g http my harvester machine 8090 The port is defined by the setting settings harvester harvesting heritrix port Enter the administrator name e g admin and password e g adminPassword as set in the settings harvester harvesting heritrix adminName and in settings harvester harvesting heritrix adminPasswor
16. d settings See in the Installation Manual how you change settings In the Heritrix GUI you can e g pause stop or restart a job El v4 Status as of Sep 1 2011 12 09 39 GMT Alerts no alerts N h CRAWLING JOBS RUNNING job 2 Admin Console 0 jobs pending O completed 329 URIs in 1m42s 3 2 sec Console Jobs Profiles Logs Reports Setup Help Crawler Status CRAWLING JOBS Hold Jobs Memory Running 2 7 15420 KB used 0 pending 0 completed 108352 KB current heap Alerts 0 0 new 1454592 KB max heap Job Status RUNNING Pause Checkpoint Terminate Rates Load 3 2 URIs sec 3 21 avg 1 active of 50 threads 2 KB sec 85 avg 1 congestion ratio Time 6 deepest queue 1m42s elapsed 6 average depth 2s remaining estimated Totals downloaded 329 O 6 queued 336 total downloaded and queued 8 6 MB crawled 8 6 MB novel Refresh Shut down Heritrix software Logout Previous Next Global Crawler Traps A crawler trap is any sequence of webpages which a crawler can blindly and endlessly follow without harvesting any new information A common example is a calendar system with hyperlinks to subsequent or previous dates Crawler traps can be avoided by specifying as regular expressions URLs which the crawler is to ignore In NetarchiveSuite one can specify crawler traps either per domain or globally This section describes the management of global crawler traps A list of crawler traps is just a plain text file c
17. definition An arbitrary name is inactive If activated it will run again on Sep 1 2011 8 55 42 AM Override with new date format DD MM YYYY hh mm There are 1 domain configurations in this harvest definition s Remove Domain Choose configuration Fon liek netarkivet dk defaultconfig B Remove The following domains are unknown and were not added newdomain dk aS Create and add to the harvest definition Enter domain s to add to the harvest here Add domains Save Event harvest Add seeds Add seeds from a file When adding a domain that is not existing in the database you are warned with The following domains are unknown and were not added You can simply add the unknown domains to the database and your harvestdefinition by clicking Create and add to the harvestdefinition Event harvest Event harvests are treated almost the same as selective harvests in the system The only difference is a power adding of domains function This could be used for selective harvests as well but was developed for event harvesting definitions where the operator must fill in larger number of URLs without having to edit configurations and seedlists on all those domains Adding seeds to an event harvest o Dansk English Deutsch Italiano Francais I 10501 Event harvest An arbitrary name Definitions Selective Harvests Snapshot Harvests Schedules Find Domain s Create Domain Domain Statistics Alias Summary Edit Harvest Templates
18. e There are no inactive global crawler traps Hide Name My list Active Inactive O Description File To Upload Users mss ingestfile txt Browse Create A list may be made active or inactive by clicking on the Activate and Deactivate buttons Lists may also be viewed via the Retrieve button deleted or edited Note that the retrieved version of a crawler trap list may differ from the original uploaded version because any duplicates in the original are removed during upload and the order of the lines in the retrieved version will not be the same as in the original file The Edit actions allow for uploading of a new version of the list IOI L Dansk English Deutsch Italiano Fran ais I Menu 10101 a EE Active Crawler Traps Selective Harvests Snapshot Harvests My list Show as text re Retrieve Deactivate Edit Delete Schedules Bobs list Show as text Retrieve Deactivate Edit Delete Find Domain s Create Domain Domain Statistics Inactive Crawler Traps Alias Summary Edit Harvest Templates Alices list Show as text B Retrieve Activate lEdit Delete Global Crawler Traps Extended Fields Harvest status Upload New Global Crawler Trap List Bitpreservation Quality Assurance Edit Systemstate A side effect of using global crawler trap lists is that the database will grow more rapidly as the modified crawl template including all the active crawler traps is stored for every job Previous Next Har
19. ed date Ended date Output file Error file See Mon Nov 08 10 59 58 CET 2010 Mon Nov 08 10 59 58 CET 2010 17 bytes 1 lines Download outputfile 260 bytes 4 lines Download errorfile Choose replica KBN BITARCHIVE CSN CHECKSUM C SBN BITARCHIVE Which files Job ID fi Metadata C Content C Both Execute batchjob Press the Execute batchjob button to start the desired Filelist 19101 Dansk English Deutsch Italiano Francais m Menu 10101 ee Executing batchjob Harvest status i i Executing batchjob with the following parameters ENPI Ser BatchJob name dk netarkivet common utils batch FileListiob Filestatus Replica KBN Batchjob Overview Regular expression 1 metadata Quality Assurance Systemstate To get at overview over batchjobs select Batch overview on the left hand side menu 19101 Dansk English Deutsch Italiano Francais Menu 10101 al Batchjob Overview Harvest status Bitpreservation Batchjob Last run Output file Error file Filestatus ChecksumJob Tue Nov 09 10 57 44 CET 2010 Download outputfile 51 bytes 1 lines Download errorfile 260 bytes 4 lines Batchjob Overview FileListlob Tue Nov 09 11 22 06 CET 2010 Download outputfile 17 bytes 1 lines Download errorfile 260 bytes 4 lines Quality Assurance Systemstate When you press the link Download outputfile the system starts a standard download dialog to present the txt file e g in notepad When you press the link Download errorfile the system starts
20. eeds from a file This allows you upload a file with the seeds instead of entering the seeds in a text field Otherwise the functionality is the same 19101 Dansk English Deutsch Italiano Francais Menu i at ble Event harvest An arbitrary name Definitions Selective Harvests Enter seeds Snapshot Harvests Select file Browse Schedules Max number of bytes per domain 1 000 000 000 Find Domain s Max number of objects per domain 1 naar pol Harvest template 3levels_orderxml omain Statistics Insert Alias Summary Edit Harvest Templates Global Crawler Traps Extended Fields Harvest status Bitpreservation Quality Assurance Systemstate Previous Next Snapshot Harvests e Creating editing a snapshot harvest i o i o 10101 Menu 19101 Dansk English Deutsch Italiano Francais 101 Snapshot Harvests Definitions Selective Harvests Snapshot Harvests No snapshot harvests defined Schedules Find Domain s Create Domain Domain Statistics Alias Summary Edit Harvest Templates Global Crawler Traps Extended Fields Harvest status Bitpreservation Quality Assurance Systemstate Create new snapshot harvest definition On Snapshot Harvests new snapshot harvests are started harvesting all domains known to the system in their default configurations An overview of all snapshot harvests is also provided Create new harvestdefinition opens the template below Creating editing a snapshot harvest 101 Io
21. eral jobs all the finished jobs will appear as Done And only the ones that was actually stopped will appear as stopped due to Harvesting aborted for some domains Previous Next Harvester Templates e Download e Upload 19101 Dansk English Deutsch Italiano Francais Menu 10101 a Painia Edit Harvest Templates Selective Harvests Snapshot Harvests Download Schedules Find Domain s Select one of the following templates Chal somal _Blevelsorderxml F _Show as text Retrieve Domain Statistics Alias Summary Edit Harvest Templates Upload Global Crawler Traps Extended Fields Here you can upload a det re ea to replace an existing harvesttemplate 3levels_orderxml H Browse Harvest status ened Replace harvesttemplate with file from your own harddrive Bitpreservation Quality Assurance Here you can create a new harvesttemplate by uploading it from your harddrive Systemstate Template Name Select file Create a new harvesttemplate using a file from your own harddrive Browse The Edit Harvest Templates is used for managing the harvester templates It enables you to both download and upload templates from to the system database Download The download part lets you view existing templates as either plain text or XML in the browser window or download existing templates to your local computer Select the template you want to view download in the first select box select the method in the second
22. hedules Schedules i Once_a_day Edit Find Domain s Once_a_month Edit Create Domain Once_a_week Edit Domain Statistics Once_an_hour Edit Alias Summary Edit Harvest Templates Global Crawler Traps Extended Fields Harvest status Bitpreservation Quality Assurance Systemstate Create new schedule Schedules are only applied on selective and event harvests A schedule defines a harvesting frequency The minimum entity is one hour It is possible to choose an automatically fixed start and or end time for a specific harvest It is possible to create an infinite number of schedules For a new schedule click on Create new schedule And to edit an existing schedule press Edit o Dansk English Deutsch Italiano Francais l 10101 a Definitions Edit Schedule Selective Harvests Snapshot Harvests Schedules Find Domain s Schedule name Create Domain Comments Domain Statistics Alias Summary Edit Harvest Templates Global Crawler Traps Extended Fields Harvest status Perform harvest Bitpreservation i Every 1 hours eq ceili a Time of day Anytime ystemstate OoOnthe o th minute of the hour Starts at the earliest As soon as possible O at format DD MM YYYY hh mm Continue Forever O Until format DD MM YYYY hh mm O Until harvests have been done Save Give the schedule an easily recognizable name note that it can t be changed once saved If necessary add a comment Fill in the frequency and if necess
23. in rp ype Harvest name numbertime time time Status errors errors limit limit Running Jobs Sep9 Sep9 Bitpreservation Partial 2011 2011 Quality Assurance hat est My_selective_harvest0 1 10 02 1 10 02 Started 2 000 500 000 000 PM PM Systemstate QA job selection Select this job for QA with viewerproxy The link above will select this job for the viewerproxy browse index This will only work if your browser is set up to use the viewerproxy as web proxy Included domains and configurations Domain Configuration Bytes Harvested Documents Harvested Stopped due to netarkivet dk defaultconfig E Seed list www netarkivet dk Harvest order template based on default_orderxml Show harvest template for job 3 Clicking on a jobID on any of the harvest history pages will give you a very detailed report on the job This page gives all the information available about the job itself e g max bytes limit and about the single domains included in the job Furthermore the page shows the complete seedlist used with the job and the complete Harvest order template as well as detailed error information in case of errors The two latter is mainly for advanced users debugging specific crawls where things didn t go as expected Details on a terminated job If you terminates a running job in the Heritrix GUI you can view Job Details and see that the job is stopped due to Harvesting aborted If it is a bigger snapshot harvest that includes sev
24. ith the number of missing files For each missing file you can select Get info With the Change the infobox for field in the bottom of the screen you can select a number of files in one operation Pressing Execute makes the system get a fresh status on the files and their checksums from both copies out of which one is missing and from the administrative system so that there are always three checksums available for each file Missing Files Definitions Harvest status Bitpreservation Missing files for SBN Filestatus Batchjob Overview pees Quality Assurance 2 metadata 1 arc IM Get info Systemstate Replica Admin State Checksum Admin Data NO ADMIN CHECKSUM EBM UPLOAD COMPLETED 145d5563c83b3161dd52b6577fe3a8e01 M Add to replica CSN UPLOAD COMPLETED 175d5563c83b3161dd52b6577 fe32ae01 SEN UPLOAD FAILED No checksum Execute Change the infobox for fi files l Change f that can be added If the two remaining checksums On the screen dump 32 byte long are identical the system allows you to add the missing file to the bitarchive instance that had lost it This requires you to click Add to archive and then press Execute Marking files for addition can be done for a number of files in one operation Checksum Errors Missing Files Definitions Harvest status Bitpreservation The file 2 metadata 1 arc has been restored in replica SBN Filestatus Batchjob Overview Quality Assurance Systemstate
25. k English Deutsch Italiano Francais Menu Search results 3 displaying results 1 to 3 101 ALL previous next Definitions Harvest status s m All Jobs Harvest history for netarkivet dk All Jobs per domain mice e PSA o o Confiauration Start End Bytes Documents Stopped l epica number ID g time time Harvested Harvested due to Quality Assurance Systemstate ee pi 2011 2011 Domain My_selective_Harvest 0 3 defaultconfig 3 12 51 3 23 02 127058769 313 Completed PM PM Sep 5 Sep 5 2011 2011 Domain My_Harvest 0 1 defaultconfig 2 57 493 08 02 009709 313 Completed PM PM Sep 5 Sep 5 2011 2011 Harvesting My_Snapshot_Harvest 0 2 defaultconfig 2 57 49 3 08 17 129981769 313 are PM PM If you want to see all the jobs connected to a specific domain click on All jobs per domain and search for the domain name You will get a chronological list of the harvest definitions including the chosen domain This page gives the same history information as the other two history pages and further more gives a Stopped due to information This column will show the operator if a harvest was stopped unexpectedly or if the harvest hit the max bytes limit for the chosen domain or if the harvest was stopped because of an error on the harvester machine Details on a job 19101 Dansk English Deutsch Italiano Francais I Menu 10101 ri Details for Job 3 Harvest status All Jobs Job Run Submit Start End Harvest Upload Object Byte All Jobs per doma
26. ksum I o I o 10101 Menu 10101 o IOI Definitions Harvest status Bitpreservation Filestatus Batchjob Overview Quality Assurance Systemstate Dansk English Deutsch Italiano Fran ais Executing batchjob with the following parameters BatchJob name dk netarkivet common utils batch ChecksumJob Replica KBN Regular expression 1 metadata To get at overview over batchjobs select Batch overview on the left hand side menu 10101 Definitions Harvest status Bitpreservation Filestatus Batchjob Overview Quality Assurance Systemstate When you press the link Download outputfile the system starts a standard download dialog to present the ixt file e g in notepad When you press the link Download errorfile the system starts a standard download dialog to present the txt file e g in notepad Batchjob Filelist Dansk English Deutsch Italiano Frangais ChecksumJob Tue Nov 09 10 57 44 CET 2010 Download outputfile 51 bytes 1 lines Download errorfile 260 bytes 4 lines FileListlob Mon Nov 08 10 59 58 CET 2010 Download outputfile 17 bytes 1 lines Download errorfile 260 bytes 4 lines Press the FileListJob link in the batchjob column to get ready to run one or more filelist jobs 101 Dansk English Deutsch Italiano Frangais I 10 Menu 10101 a Definitions Batchjob Harvest status Bitpreservation Name of batchjob dk netarkivet common utils batch FileListJob Filestatus Number of runs 1 Batchjob Overview Stat
27. l domains matching that Top level domain This is only applicable to Top level domains with a limited number of subdomains since the matching domains will be listed on one page and that page will get very long if the system contains hundreds of thousands of domains Alias summary 10101 Dansk English Deutsch Italiano Francais Menu 10101 a Overview of Aliases Selective Harvests Snapshot Harvests Existing aliases Schedules Find Domain s Domain Alias of Expires Create Domain netarkivet dk kb dk Aug 31 2012 11 24 10 AM Domain Statistics Alias Summary Edit Harvest Templates Global Crawler Traps Extended Fields Harvest status Bitpreservation Quality Assurance Systemstate The alias summary page gives an overview of the domains marked as aliases of other domains in the system Both domain names are clickable and will open the domain page for the clicked domain The Expires column shows when the alias expires 12 month after they are marked The mark does not disappear after 12 month in the database but the Overview of Aliases page will show the expired ones in the top To renew an alias for another 12 month one is currently forced to open the domain page of the marked domain the Domain column select renew alias and press Save Previous Next 19101 Dansk English Deutsch Italiano Francais Menu A Definitions Sched u les Selective Harvests Snapshot Harvests Existing Sc
28. ollerServer HIGHPRIORITY KBN 0 Sep 9 2011 1 20 16 PM dk netarkivet he har 002 INFO HarvestControllerServer started kb test HarvestControllerServerLOWPRIORITY KBN 0 Sep 7 2011 12 48 22 PM dk netarkivet t har 002 INFO HarvestControllerServer started sb test HarvestControllerServer HIGHPRIORITY SBN 0 Sep 9 2011 1 30 17 PM dk netarkivet he har 001 INFO HarvestControllerServer started If you load this page just while a harvester instance is restarting you might get a JMX error The same thing will happen if one of the configured applications does not run or does not respond So the system state will in some sense also discover non functional applications Quality Assurance Bit Preservation Bit Preservation Missing Files Checksum Errors Batchjob Overview Batchjob Checksum Batchjob Filelist Menu 1a Definitions Harvest status Bitpreservation Filestatus Filestatus Status of the replicas Filestatus for KBN Batchjob Overview Quality Assurance rate Mumber of files 24 Missing files O Last update at Nov 5 2010 11 50 11 AM Update Filestatus for CSN Number of files 24 Missing files O Last Update at Jan 1 1970 1 00 00 AM Update Filestatus for SBN Number of files 23 Missing files 1 Show missing files Last Update at Nov 5 2010 3 10 00 PM Update Checksum status for KBN Number of files with error O Last Update at Jan 1 1970 1 00 00 4M Update Checksum status for CSN Numbe
29. ontaining crawler trap regular expressions one per line Lists may be active or inactive When NetarchiveSuite creates a new job for any harvest all crawler traps for all active lists excluding duplicates are added to the crawl template for that job 19101 Dansk English Deutsch Italiano Francais a Menu Definitions Active Crawler Traps Selective Harvests Snapshot Harvests There are no active global crawler traps Schedules Amelia Inactive Crawler Traps Create Domain Domain Statistics Alias Summary Edit Harvest Templates Global Crawler Traps Upload New Global Crawler Trap List Extended Fields Harvest status Bitpreservation Quality Assurance Systemstate There are no inactive global crawler traps Edit To upload a list of global traps first click on the Edit link and fill in a name and description for the list of crawler traps and the path where the file containing the crawler trap expressions is to be found You can also choose whether the list should be initially active or inactive Click Create to upload the list 19101 Dansk English Deutsch Italiano Francais Menu AAA Active Crawler Traps Selective Harvests Snapshot Harvests There are no active global crawler traps Schedules ade Inactive Crawler Traps Create Domain Domain Statistics Alias Summary Edit Harvest Templates Global Crawler Traps Upload New Global Crawler Trap List Extended Fields Harvest status Bitpreservation Quality Assurance Systemstat
30. ountries it makes sense that the top level domain is simply the country code like dk or fr while for others it makes sense to go one level further down like co uk A domain can hold multiple so called configurations A configuration describes how to harvest the domain or a part of the domain So one configuration could harvest the whole domain used by the snapshot functionality and other configurations could take different minor parts of the same domain for the selective event harvest A configuration consists basically of two things e A harvester template predefined templates for the Heritrix web crawler e A number of seedlists to use with that template A domain will always have a default configuration selectable and that configuration will be used when starting a snap shot harvest The snap shot harvest therefore takes all domains known in the database Contents Selective Harvests Snapshot Harvests Domains Schedules Heritrix GUI Access Global Crawler Traps Harvest History Harvester Templates Quality Assurance System State Bit Preservation Alternative Ways to Get Data Out Search manual Download as paf user manual paf Start Selective Harvests e Creating editing a selective harvest e Easy creation of non existing domains e Event harvest e Adding seeds to an event harvest Definitions Selective Harvests Snapshot Harvests Schedules Find Domain s Create Domain Domain Statistics Alias Summa
31. r collected URLs Show collected URLs Browsing jobs in the viewerproxy Use these pages to select the index for viewerproxy browsing Selective harvest history Snapshot harvest history Quality assurance is done by browsing the archive for selected domains If something is missing on the pages the system can be set to automatically collect all the missing URL s for later transfer to the harvesting system Before doing Quality Control you need to setup your browser to use a proxyserver See Quick Start Manual It is suitable to investigate one domain at a time unless several domains are included in the same website complex Start collecting URL s Hereby starts the collection of URL s The Current Viewerproxy status textbox shows if the system collects URL s or not and how many URL s are currently collected Stop collecting URL s Collection of URL s can be stopped at any time Clear collected URL s The list of URL s can be cleared at any time e g when investigating a new domain starts NB This function can not be undone Show collected URL s The list of URL s can be viewed at any time The list can be copied and manually be added to relevant Seed lists for the relevant domains in the harvesting system Harvester Templates System State System State The Systemstate pages lets the operator monitor the entire system all machines and applications from one central point 10101 Definitions Ha
32. r of tiles with error O Last update at Jan 1 1970 1 00 00 AM Update Checksum status for SBN Number of files with error O Last update at Jan 1 1970 1 00 00 4M Update The Bitpreservation interface lets you control active checks of the status of the underlying bitarchive This only applies if your installation uses the NetarchiveSuite bitarchive application The interfaces lets you initiate two types of checks on every copy of the files in the archive Filestatus and Checksum status In the example on the screen dump there are more bitarchive instances e g SBN and KBN The Update buttons let you update the status for both files and checksums for both bitarchive instances The page will give you the Filestatus as Number of files Missing files Last updated a and the Checksum status as Number of files with error Last updated at The Filestatus checks are rather fast because only the existence of the files are checked whereas the Checksum status checks can take days weeks for larger archives depending on the number of CPUs and the lO speed of your hard drives Missing Files Missing Files Definitions Harvest status Bitpreservation Missing files for SBN Filestatus Batchjob Overview SR l Quality Assurance 2 metadata 1 arc Get info Systemstate Execute i Change the infobox for 1 files If files are missing on one instance of the bitarchive a Show missing files button will appear right next to the line w
33. rvest status Bitpreservation Quality Assurance show Location Instance id Http port Systemstate Overview of the system state KB TEST BitarchiveServer 09 09 2011 13 30 17 dk netarkivet archive BAR 014 INFO Finished batch job dk netarkivet co KB TEST BitarchiveServer KBN 0 09 09 2011 13 30 17 dk netarkivet archive BAR 014 INFO Finished batch job dk netarkivet co KB TEST BitarchiveServer KBN 0 09 09 2011 13 30 17 dk netarkivet archive BAR 014 INFO Finished batch job dk netarkivet co kb test ChecksumFileServer CSN 0 sep 9 2011 1 30 16 PM dk netarkivet arch acs 001 INFO Replying GetChecksumMessage ID 6191 kb test IndexServer KBN 0 Sep 9 2011 1 20 04 PM dk netarkivet arch acs 001 INFO Sending successful reply for IndexR kb test ViewerProxy KBN 0 Sep 7 2011 12 48 27 PM dk netarkivet vier acs 001 INFO Starting viewerproxy jetty on port kb test ArcRepository KBN 0 Sep 9 2011 1 30 17 PM dk netarkivet arch adm 001 INFO Store OK 4 metadata 1 arc kb test BitarchiveMonitorServer KBN 0 Sep 9 2011 1 30 16 PM dk netarkivet arch adm 001 INFO Replying GetChecksumMessage ID 621 kb test BitarchiveMonitorServer SBN 0 Sep 9 2011 1 30 16 PM dk netarkivet arch adm 001 INFO Replying GetChecksumMessage ID 61 kb test HarvestlobManagerApplication KBN 0 Sep 9 2011 1 30 17 PM dk netarkivet harw adm 001 INFO Harvester sbhigh reported itself i LOWPRIORITY kblow001 kblow002 HIGHPRIORITY kbhigh sbhigh kb test G
34. rvester templates and maximum number of bytes to be harvested in each harvest of the configuration At creation the default number of bytes is chosen for each domain And a default maximum number af objects is set but can be overwritten Editing seed lists Menu Ee netarkivet dk Selective Harvests Snapshot Harvests Enter edit seed list Schedules Find Domain s Name defaultseeds Dansk English Deutsch Italiano Francais d www netarkivet dk Create Domain Domain Statistics Alias Summary Edit Harvest Templates Global Crawler Traps Extended Fields Harvest status Bitpreservation Quality Assurance Systemstate Seeds Comments Save Enter edit seed list is used to define a new Seed list or to edit an existing one At the creation of a new Seed list a name is given that thereafter can not be changed In the Seeds text box a list of seeds to be harvested is given Seeds can be omitted by writing a prefix e g http www kb dk This can also be used for comments inside the seedlist e g this seed is important Editing crawlertraps Definitions Selective Harvests Snapshot Harvests Schedules Find Domain s Create Domain Domain Statistics Alias Summary Edit Harvest Templates Global Crawler Traps Extended Fields Harvest status Bitpreservation Quality Assurance Systemstate Dansk English Deutsch Italiano Francais Domain name netarkivetdk Comments A Alias of defaultconfig
35. ry Edit Harvest Templates Global Crawler Traps Extended Fields Harvest status Bitpreservation Quality Assurance Systemstate Dansk English Deutsch Italiano Francais Selective Harvests No selective harvests defined Create new selective harvest definition The front page by default shows the list of selective harvests You can Activate an inactive harvest definition and Deactivate an active harvest definition If you deactivate a running harvest the system will finish the running jobs Click on Edit to change an existing harvest definition or Create new harvest definition Click on History if you wish to trace back all the jobs from former finished harvests Creating editing a selective harvest Definitions Selective Harvests Snapshot Harvests Schedules Find Domain s Create Domain Domain Statistics Alias Summary Edit Harvest Templates Global Crawler Traps Extended Fields Harvest status Bitpreservation Quality Assurance Systemstate Dansk English Deutsch Italiano Francais Selective Harvest Harvest name An arbitrary name Comments Schedule Once_a_day B The harvestdefinition An arbitrary name is inactive If activated it will run again on Sep 1 2011 8 55 42 AM Override with new date format DD MM YYYY hh mm There are 1 domain configurations in this harvest definition Remove Domain Choose configuration P netarkivet dk defaultconfig 5 Remove Enter domain s to add to the
36. vest History All jobs History of a harvestdefinition History of a domain Details on a job e Details on a terminated job All jobs Harvest Status in the left menu by default shows a chronological list of all jobs ever harvested with status Started in ascending order The same does All jobs 19101 Dansk English Deutsch Italiano Francais 1 Menu 101 10101 All 0 E New Definitions Submitted y Harvest status Job status Started Harvest name Start date All Jobs End date All Jobs per domain Order ascending Display 100 rows per page Show Reset Running Jobs Bitpreservation Search results 2 displaying results 1 to 2 Quality Assurance a t Systemstate prenion freee Job Status Job Run Start End Harvest Upload Number of Harvest name a Status A O ID number time time errors errors configurations 2011 09 05 i 1 My_Harvest 0 14 57 49 Started 1 2011 09 05 _ i j 2 My_Snapshot_HarvestO 14 57 49 Started 17 Resubmit selected failed jobs 4 If information is wanted for jobs with other statuses or All statuses or other sort order then this can be specified in the combo boxes in the top of the page and then activated by clicking the Show button For each job the page shows information about the job and its status as well as information about errors harvest errors or upload errors and number of configurations in the job Chose Run number if you want to check details on a specific run of that harvestdefinition
Download Pdf Manuals
Related Search
Related Contents
2 zone RF wireless wall mounted touch dimmer remote Delta Bravo« Bedienungsanleitung l Operating instructions L Mode DS-Series Users Manual Spec Sheet PS 1000 取扱説明書 - TOEX Guida per l`utente HERMA Address labels Premium A4 99.1x33.8 mm white paper matt 400 pcs. AGUERO PI`ITMAN, Roger E. "Estudio de Abastecimiento descargar Copyright © All rights reserved.
Failed to retrieve file