Report on the Analysis of the 2005 Australian Domain Crawl Executive summary




Yüklə 66.31 Kb.
tarix27.02.2016
ölçüsü66.31 Kb.

powerpluswatermarkobject3


Report on the Analysis of the 2005 Australian Domain Crawl
Executive summary
The whole domain harvest has demonstrated that it is possible to harvest a substantial part of the .au web domain in a relatively short period of time. Utilising robust archival crawler technology it was possible to harvest within the space of six weeks around six times the amount of data harvested for the PANDORA Archive over a period of 10 years. Analysis of the content of the domain crawl is crucial in order to understand more clearly the success and the shortcomings of large-scale crawl harvesting. Given the scale of the content such analysis requires automated procedures and tools to be developed to assist this analysis process. The lack of practical benchmarks against which to compare the domain harvest presents a further difficulty for analysing the crawl; and quality assurance at the specific title or site level is clearly not practical. Nevertheless, the statistical analysis done so far does suggest that the crawl technology is efficient in harvesting the content that can be found, while constraints in relation to the duration of the crawl together with compliance with robots.txt exclusions are clearly factors detrimental to the comprehensiveness such large-scale web archiving.
Background
The National Library undertook the first large-scale harvest of the Australian web domain in June and July 2005. The Australian web domain for this purpose was identified as all content on the .au domain and some non .au content that could be automatically identified as residing on a server located in Australia. The content of the domain harvest, which was undertaken by the San Francisco, based Internet Archive using the Heritrix crawl robot was delivered to the Library on a Capricorn Technologies PetaBox high density storage system in November 2005. Since that time the National Library has been developing methods to analyse the domain harvest data so as to gain a better understanding both of the content collected as a result of this particular crawl and of large-scale web archiving more generally.
The crawl and harvest ran over a period of six weeks during which time the bulk of the web content was collected. Following the crawl, the Internet Archive did some quality checking of the content prior to delivery to the National Library during which time some supplementary crawling was done. The figures for the size of the harvest as derived from the reports produced in July 2005 following the termination of the crawl are:


Hosts crawled

811,523

Files crawled

189,824,119

Unique files crawled

185,549,662

Size (uncompressed)

6.69 TB

Size (compressed total)

4.6 TB

Size (compressed ARC)

4.52 TB

Size (compressed DAT)

84.65 GB

Because this is the first such crawl of the Australian web domain, one major problem is what to compare it against in order to assess is success, since there are no reliable or usefully comparable metrics for the size and dimension of the Australian web domain. Moreover, in assessing the coverage and success of the crawl, we want to understand both the breadth and depth of the harvest as well as the completeness of individual websites.


Initial Analysis Undertaken
Visual Analysis
The initial analysis of the crawl undertaken immediately following the termination of the crawl, when the content was hosted by the IA, involved browsing a small sample of the content. Given the size of the crawl such analysis is very limited and may only be considered impressionistic and certainly not definitive. A random selection of a small sample of sites to assess the broadness of the crawl did, prima facie, suggest a generally successful coverage, with around 95% of the sample .au sites being found in the archived content. Discussion of this preliminary analysis may be found in the October 2005 report on the whole domain harvest < http://pandora.nla.gov.au/documents/domain_harvest_report_public.pdf>
A second round of visual analysis is being undertaken by Digital Archiving Section staff in March and April 2006 and involves comparing harvesting results in PANDORA with that in the domain harvest. While the small sample that this will produce may not result in definitive conclusions, we hope that this analysis will help inform us at the detailed level of the performance of the domain crawl in respect to particular example sites include specific file types and site delivery mechanisms (e.g. dynamic content, stylesheets, JavaScript, Flash).
Analysis of some reports provided by the Internet Archive
A number of reports were produced by the IA following the termination of the crawl and included:

  1. Crawl report (the statistics quoted above)

  2. MIME type report

  3. Response code report

  4. Per host summary

  5. File exclusion report (>20 minutes to download)

  6. File exclusion report (>100 MB)

  7. File exclusion report (robots.txt)

The IA subsequently also provided a report identifying hosts with URLs remaining to be crawled when the crawl was terminated.


MIME types
The domain crawl identified 976 MIME (Multipurpose Internet Mail Extension) types. Typically, as has been found in the experience of PANDORA Archive, a large number of these represent badly or incorrectly formed MIME type identification. Of the 976 MIME types reported, 836 are associated with less than 1000 files, 686 of those with less than 100 files and 476 with 10 or less. The top MIME types as reported are listed in the table below. More complex analysis would be required to obtain exact figures for the types of files (e.g. all image files or all audio files) since all 976 reported MIME types would have to be analysed and sorted. As may be expected, the greater percentage of the domain crawl content is constituted of text/html files and the major image file types (jpg, gif and png). Also of note is the large number of PDF files, amounting to more than three million files.


MIME Type

No. of files

% of total


text/html

126,587,753

67%

image/jpeg

32,414,376

17%

image/gif

20,716,296

11%

application/pdf

3,071,252

1.6%

text/plain

1,521,619

0.8%

image/png

913,104

0.48%

text/css

808,571

0.42%

application/x-JavaScript

429,700

0.22%

application/msword

392,140

0.21%

application/x-shockwave

355,840

0.18%



Response codes
The report on HTTP response codes reported 47 different codes.
The top reported response codes were as follows:


Code

Description

URLs

% of total


200

Accepted

164,350,530

86.58%

302

Found – redirect

11,620,137

6.12%

404

Not found

10,716,117

5.65%

301

Moved permanently

990,638

0.52%

500

Internal Server Error

689,972

0.36%

400

Bad request – bad syntax, request not understood by server

576,656

0.3%

401

Unauthorised – request requires authentication

444,214

0.23%

403

Forbidden

275,910

0.15%

503

Service unavailable

51,388

0.03%

Summary of response code categories




Code class


Description

URLs

% of total

2xx

Successful - may or may not include content

164,367,392

86.59%

3xx

Redirections

12,629,233

6.65%

4xx

Client error – e.g. bad requests; authorisation required; forbidden

12,067,299

6.36%

5xx

Server error

759,995

0.4%

These figures give a picture of the nature of the Australian web domain in terms of the percentage and occurrence prevalence of authentication requirements, redirections and client and server errors. Such a web landscape obviously has some implication for the efficiency and success of domain harvesting; for example, in regard to the significant number of client error responses including requirements for authentication in order to access content.


URLs remaining at termination of crawl


Hosts with URLs queued

Percentage of all hosts with queued URLs


>=1 URL queued

36,944

100%

>=100 URLs queued

11,647

32.5%

>=100,000 URLs queued

305

0.8%

These figures are for the hosts with URL remaining to be crawled at the termination of the harvest. The number of URLs remaining to be crawled, per host, ranges from 1 to 9,671,112. This suggests a large number of URLs were not able to be harvested within the crawl period. We may also assume that those URLs not harvested would also have provided the crawler with links to identify additional content unknown at the termination of the crawl. The result suggests the crawl duration (~6 weeks) was a significant limiting factor for the completeness of the harvest.



Analysis Undertaken Following Delivery of Content
Beginning in December 2005 further analysis of the crawl was undertaken, following the delivery of the content in November 2005. Broadly this has taken two forms: Chris Huston undertook statistical analysis based on and in reference to the harvest content itself; and Alexander Osborne has undertaken analysis comparing the domain harvest content with content from the PANDORA Archive.
Analysis based on content data
This analysis, undertaken by Chris Huston in December 2005 and January 2006, examines statistically the number of URLs identified by the crawler and the number of URLs identified and successfully crawled (harvested). These statistics also provide information about the percentage of identified URLs not crawled for various reasons such as robots.txt exclusions, server errors and crawl robot rules.
High level counts:


URLs found by Heritrix

231,394,607

DNS URLs found by Heritrix

25,612,280

URLs successfully crawled by Heritrix

208,071,087

The percentage of successfully crawled URL in relation to those found (identified) by Heritrix was 89.92%. This percentage figure does not include the DNS URL figure indicated above. The DNS URLs have been separated from the general figure for URLs found as they refer simply to URLs which look up the DNS entry for a web server name and are therefore not relevant to the archival content.


The difference between the “URLs found” figure and the “URLs successfully crawled” figure is accounted for by a number of factors preventing the crawl from successfully completing in respect to certain URLs. The greater percentage of these are for robots.txt exclusions. As the crawler was set to obey robots.txt exclusions, URLs meeting exclusion parameters were not crawled. This accounts for 5.95% of the URLs found or 13,756,599 URLs. Invalid URLs (unrecognised or illegal format) account for the next largest percentage of non-crawled URLs at 3.08% or 7,129,636 files. The crawler was programmed with some user setting limitations to exclude files on the basis of the size of a file (i.e. >100mb) and download timeout (>20 minutes). These exclusions from the crawl amount to only 0.09% of URLs not crawled or 201,901 files. These figures largely re-iterate the figures from the response code reports cited above.
The slightly better figure for successfully crawled URLs (89.92% in contrast to 86.59) is a result of accounting some client errors and redirections as successful. That is, we should understand in this analysis that “crawled” means that an HTTP server response was obtained, not necessarily that a content file was successfully obtained for the archive. For example in the above figures, successfully crawled includes responses indicating 404 (Not Found), 403 (Forbidden), 402 (Payment required), 401 (Unauthorised) and other 4xx, 3xx (redirection) server responses. It does include successful server responses (200) but also successful server responses indicating no content was delivered (204). Thus, these figures indicate a high degree of success on the part of the crawler in obtaining server responses (and presumably content when it was there to be obtained and authorisation was not required) but that this does not necessarily mean that successful crawling results in archival content being obtained.
In recognition of this, Chris Huston also provided statistics specifically on successfully downloaded URLs; that is, those with a server response code of 200. These figures (following) relate more closely to the reports provided with the crawl by the IA (see above).


Text/html files

111,622,305

62.89%

Images files

55,936,725

31.51%

Data (application) files

5,251,823

2.96%

Audio files

221,608

0.12%

Video files

112,641

0.0541%

A further sub-set analysed was just those URLs with the .au extension. While the figures are obviously smaller the percentages remain much the same.




.au URLs found by Heritrix

162,955,477

.au DNS URLs found by Heritrix

17,654,226

.au URLs successfully crawled by Heritrix

147,777,465

The percentage of successfully crawled .au URLs in respect to those found was 90.68%. Robot.txt exclusions accounted for 4.77% of those not crawled and invalid URLs accounted for 3.44% of those not crawled. Those not crawled because of crawler user setting parameters was 0.12%.


Some analysis was also done of specific sites. As an example, the Australian Bureau of Statistics domain www.abs.gov.au was analysed since this site is know to be very large and is delivered in a dynamic way (and assumed problematic to harvest). The results of this analysis suggested the same high degree of success in crawling (harvesting) URLs found by the crawler. In this case 92% were successfully crawled (15,988 URLs crawled from 17,238 found). However, a visual check of the site and URL searching using the Wayback Machine suggest a substantial amount of missing content. To try and get some indication of the extent of the missing content – that is URLs not even found by the crawler – a comparison against Google was done. Google returned 5,990,000 URLs known and indexed for +site:abs.gov.au, a figure that does not seem a genuinely accurate in terms of a single instance of the site. Clearly, such comparison does not appear to be very meaningful or useful.
Analysis comparing domain content with PANDORA content
In order to obtain meaningful comparative statistics, Alex Osborne has undertaken comparison of domain harvest content with the content in the PANDORA Archive. The rationale of this comparison is that PANDORA archived instances will generally represent complete versions of a specific web resource (albeit its parameters defined by PANDORA selection practices). Websites included in PANDORA are routinely quality assessed for completeness and, when necessary, content found to be missing is added to the harvested instance of that site. By comparing URLs derived from PANDORA url.map files with the domain crawl logs some estimate could be made of the effectiveness of the domain crawl. The instances for comparison were garnered from PANDORA harvesting done between July 2004 and July 2005; that is, during the year prior to the domain harvest. Therefore, while there was a reasonable expectation that the content should still have been available for harvest at the time of the domain crawl, the margin for error in this comparison would suggest that success rates, if anything, may be lower than the actual. Further analysis is currently underway to use PANDORA data from a period that is shorter in duration and more directly contemporaneous with the domain crawl.
The comparison was initially done on a list of URLs only edited to remove obvious invalid URLs and non .au URLs. It was evident from the results of this comparison that a number of very large sites were influencing the results. A refined methodology was therefore developed in which values for the percentage of each website covered were calculated and then these in turn were averaged. This was intended to address the problem of large sites dominating the results. The refined methodology also addressed certain problems in retrieving all the PANDORA logs for the report period from the storage tapes by examining the HTTrack headers added to the HTML files. These headers are not present in files such as PDF files and image and other media files, so only HTML files were compared.
Base figures from analysis


URLs tested

2,121,196




Archived by domain crawl

940,797

44.23%

Inaccessible (missing, restricted, server error)

83,697

3.94%

Blocked by robots.txt

23,726

1.11%

Unable to connect to server

5.61

0.024%

Not encountered in the domain crawl or queued when crawl terminated

1,067,915

50.34%

Figures from refined methodology




URLs (HTML) tested

2,793,058

Percentage of each website crawled

59.91%

Percentage of each website identified by domain crawl

64.39%

Percentage of each website queued at termination of crawl

4.48%

Percentage of each website not encountered by domain crawl

35.61%

These figures do suggest that impressions gained from limited visual checking and from analysis of the success rate of the crawl in relation to identified URLs are overly optimistic. The comparative figures, which indicate a large percentage of URLs not encountered, do suggest that, in terms of the depth of harvesting, there is a significant shortfall in the completeness of the harvest. However, the high percentage of successfully crawled URLs in relation to identified URLs, also suggest that this shortfall in depth coverage may be addressed to some degree by strategies such as a longer crawl duration or focusing the crawl more narrowly to better complement a limited crawl duration.





Paul Koerbin

22 March 2006


Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©azrefs.org 2016
rəhbərliyinə müraciət

    Ana səhifə