Library harvesting is a process used to collect content and metadata that is available without access restriction on the open web. The deposit libraries will use automated web crawling software wherever possible, especially when collecting for the UK Web Archive, but may also use manual or other methods of downloading content and metadata when necessary.
The National Library of Scotland and other legal deposit libraries are entitled to copy UK-published material from the internet for archiving under legal deposit. Web crawling is an automated process used to collect content and metadata that is available without access restriction on the open web.
Frequently asked questions for webmasters
Why is your crawler visiting my website?
Legal deposit libraries will copy UK-published material from the internet, including freely accessible material on the open web. They will also be entitled to harvest copies of password-protected or paid-for material, but are putting alternative delivery arrangements in place for any publisher who prefers to deposit such material instead.
What do you do with the copy of my website — where can I see it?
Digital materials collected through legal deposit, including archived websites, will be accessible onsite at the legal deposit libraries — usually in the reading room facility of each institution.
A deposited work may only be displayed at one computer at a time on the premises of each legal deposit library, in compliance with the 2013 regulations.
For additional detail, see our pages on access to electronic material.
What content is available now?
With the passing into law of the regulations, the legal deposit libraries will be able to collect digital materials extensively for the first time, so the collection is expected to grow over the coming months and years.
Users will be able to access a range of electronic journal articles and other digital materials immediately. Large-scale harvesting of the UK domain websites will begin shortly, with the results of the first harvest becoming available on-site in the legal deposit libraries towards the end of 2013.
Do you respect robots.txt?
As a rule, yes: we do follow the robots exclusion protocol. However, in certain circumstances we may choose to overrule robots.txt — for instance: if content is necessary to render a page (e.g. Javascript, CSS) or content is deemed of curatorial value and falls within the bounds of the Legal Deposit Libraries Act 2003.
What about robots META tag exclusions?
As above, these are normally respected.
What about 'rel=nofollow' attributes?
At present we are not able to interpret these due to technical limitations of the crawler being used.
How often do you recheck robots.txt during the crawl?
The robots.txt is considered valid for any host for up to 24 hours, after which it will be reconsidered.
What crawler do you use and how does it identify itself?
We use Heritrix and the crawler's User Agent should identify itself as 'bl.uk_lddc_bot'.
How often will you crawl my site and how long will the crawl last?
We try to crawl every site in the UK at least once per year. There are circumstances in which sites deemed of specific curatorial value may be crawled more frequently. However, as with all our crawling activities, our intention is to keep any impact on the site to a minimum.
Will you crawl embedded and linked content hosted outside the UK?
We intend to include only the embedded content for web-based material (such as CSS, images) that is necessary to display a UK webpage in a complete, cohesive and intelligible way, regardless of where it is hosted. This would normally mean a web page that would be displayed when the user follows a link or enters a URL, including transcluded content that displays within the same tab or window as its contextual material. Linked files, the content of which, when opened, display in a separate tab or window would normally be treated as separate items and will not be crawled.
Can I stop the crawling by using robots.txt or blocking your IP?
Adding our crawls to robots.txt will stop further crawling once we reconsider the file (see above). Similarly, blocking our IP will stop all further access from that IP address. However, the British Library and other deposit libraries are entitled to copy UK-published material from the internet for this national collection. If you disallow our crawler or block our IP, you will introduce barriers to us fulfilling our legal obligations.
Should I 'whitelist' your crawler?
Permitting our crawler greater 'Allow' access to your site would certainly be helpful and appreciated, especially if your default policy is quite restrictive.
Can I request you to crawl my site additionally outside your default crawl schedule, e.g. before a major upgrade?
We do try to crawl every site in the UK at least once per year. In some circumstance we may crawl more frequently and are willing to accept requests to amend our schedule. However, we cannot guarantee specific timings.
Do you archive audio / video material?
Any audio or video resources discovered as part of the crawl will be archived if discovered. However, in the case of media using a streaming protocol these are unlikely to be discovered. Any website for which the delivery of video and/or audio is its primary function, and other content is merely incidental, is entirely excluded.
See also our page on recorded sound and film.
Do you archive password-protected material?
In the case of individual resources which are password-protected, such as a PDF which requires a password to view, then yes, these will be crawled, but on an 'as-is' basis — i.e. the password protection will remain in place. For websites which are in whole or part restricted by a login page then these, at present, will not be crawled. In such cases we intend to contact the publisher separately, in order to agree arrangements for their deposit in accordance with the legislation.
Why is the archived copy of my site incomplete or not displaying correctly?
We intend for the archival copy to be as accurate a representation of the original as possible. We try to harvest all the resources associated with a website including HTML, images, CSS and associated scripts. However, certain content may not be gathered due to technical limitations: support for capturing streaming media, dynamic-/interactive-content is limited at this time. Similarly, by default we follow the Robots Exclusion Protocol and thus some content may be restricted from our crawling activities by the website owner.
How do you decide which websites are UK-published?
The 2013 regulations define how relevant material must be 'connected to the UK'. The Joint Committee on Legal Deposit has agreed how the legal deposit libraries will interpret and implement these regulations.
See also our notice and take down policy.