Consuming content in digital form has become the norm for many of us. We watch videos on smartphones, we skim the news on tablets, we share photos on social networks and we read books on e-readers. But in a world where digital rules, where does a traditional organisation like The British Library fit in?
The first thing most people find out about The British Library is that it holds at least one copy of every book produced in the United Kingdom and the Republic of Ireland. The Library adds some three million volumes every year, occupying roughly 11 kilometres of new shelf space.
It also owns an almost complete collection of British and Irish newspapers since 1840. Housed in its own building in Colindale, North London, the collection consists of more than 660,000 bound volumes and 370,000 reels of microfilm, containing tens of millions of newspapers.
It may have come as a surprise, therefore, when The British Library – an organisation that places such high value on paper objects – announced in May 2010 that it was teaming up with online publisher brightsolid to digitise a large portion of the British Newspaper Archive and make it available via a dedicated website.
The British Newspaper Archive
By the time The British Newspaper Archive website went live in November 2011, it offered access to up to 4 million fully searchable pages, featuring more than 200 newspaper titles from every part of the UK and Ireland. Since then, the Library has been scanning between 5,000 and 10,000 pages every day, and the digital archive now contains around 197TB of data.
The newspapers – which mainly date from the 19th century, but which include runs dating back to the first half of the 18th century – cover every aspect of local, regional and national news. The archive also offers offers a wealth of material for people researching family history, including family notices, announcements and obituaries.
According to Nick Townend, head of digital operations at The British Library, the idea of the project is to ensure the stability of the collection and make it available to as many people as possible.
“The library has traditionally had quite an academic research focus, but the definition of research has maybe broadened to mean everybody who's interested in doing research, and I think the library's trying to respond to that and make the collections more accessible,” said Townend.
The British Library and brightsolid have set themselves a minimum target of scanning 40 million pages over ten years. “That's actually a relatively small percentage of the total collection,” said Townened. The entire collection consists of 750 million pages.
“The digitisation project gives us a really good audit of the physical condition of the collection items,” he added. “Some of the earlier collections were made on very thin paper and it's just naturally degraded over time, so they've effectively become 'at risk' collection items. Making a digital surrogate is part of the longer term preservation of the collection.”
Eight thousand pages a day
The fragility of some items in the collection is the reason why the scanning process has to take place on-site at Colindale, according to Malcolm Dobson, chief technology officer at brightsolid. He explained that the company set up a scanning facility there at the start of the project, with five very high-spec scanners from Zeutschel.
“We do fairly high resolution scanning – 400 DPI, 24-bit colour. The full-res image sizes vary from anything from 100MB up to 600MB per page,” said Dobson. “At 400DPI these can be 12,000 pixels by 10,000 pixels – very large bitmaps. So even compressed, they are massive.”
The pages are scanned in TIF format, and then converted into JPEG 2000 files. According to Dobson, JPEG 2000 provides a good quality of compression and retains a much better representation of the image than standard JPEG.
“We throw away the TIF files because they're just too big to keep,” said Dobson. “To put it into perspective, we've probably got something like 250TB of JPEG 2000, and we have 3 copies of each file, so it's a lot of data. If we'd just been going with the uncompressed TIF, that would probably be something in excess of a petabyte and a half.”
Once scanned, the images are transported over a Gigabit Ethernet connection to brightsolid's data centre in Dundee. The transfer happens over night, and usually takes around five to six hours.
The scanned images are entered into an OCR workflow system, where they are cropped, de-skewed, and turned into searchable text using optical character recognition. They are also “zoned” using an offshore arrangement in Cambodia. This means that areas of the page are manually catalogued by content – such as births, marriages, adverts or photographs – and referenced to coordinates.
“We end up with quite a comprehensive metadata package that accompanies the image, and it's that metadata package along with the OCR information that forms the basis of the material that's then searchable,” said Townened.
Having gone through this process, one copy of the file is uploaded to The British Newspaper Archive website, and another is sent to the British Library, to be ingested into its digital library system.
Dobson explained that, while JPEG 2000 is a perfect file format for storing and transferring high resolution images, most browsers are not able to render it. The images are therefore converted into a set of JPEG tiles.
“We decided to take a tiling image server, and the format we use is something called Deep Zoom, or Seadragon. There's various resolution layers stored or created, so that when someone initially goes to view an image they're looking at a sampled smaller image, and it's served up as a number of tiles,” said Dobson.
“As you zoom in, only the tiles relating to that area you're looking at are delivered. You can zoom in further and further, so you get quite a good experience in terms of looking at a very high resolution image without the obvious latencies.”
Illustration of Queen Victoria - Supplement to the Bucks Herald, June 25 1887
The website is delivered using a virtualised blade solution from IBM, consisting of a virtualised HS22 blade environment in an IBM BladeCenter H Chassis with an IBM x3755 rack server and associated SAN fabric.
According to Michael Mauchlin, IBM's systems sales manager for Scotland, the IBM blade platform is highly energy efficient, and consumes 10 percent less energy than the nearest rival platform.
“The other key attribute is that it has no single point of failure,” said Mauchlin. “This is not the case with all blade designs, and system availability was a key reason why brightsolid chose IBM for this platform.”
The availability aspect was of particular importance when The British Newspaper Archive website first launched in 2011, and received extensive coverage in the mainstream press. Dobson said that this prompted a surge of traffic, as people visited the site to try out the free search service.
Brightsolid was keen to provide a good experience for all these people, so it used sizing and load testing to model what it thought the peak demand was likely to be, and then create a suite of load testing scripts that represented various user journeys and activities and run that on the hardware.
“So we were able to prove that we could deliver the kind of numbers of searches per second, the number of images and tiles associated with those things per second to meet that peak demand,” said Dobson. “All the evidence from our customers on that day was that the experience was good.”
Digital library system
Meanwhile, the copy that is sent to the British Library enters into its digital library system.
“We have a four node digital library system, where we have a base in Yorkshire, one in London, one in Aberystwith and one in Edinburgh, and effectively we replicate content across all of the nodes,” explained Townend. “Each of the nodes has a slightly different technical architecture and hardware setup, so that even if one node were to develop a fault, technically it shouldn't be the same fault replicated across all four sites.”
The British Library's ambitions don't stop there, however. Townend explained that the organisation is always looking to make more content available in more interesting ways. For example, it has massive ambitions in terms of “born digital” material.
“The government is currently in the process of reviewing new legislation that will allow us to collect the digital items under law, but that's not in place at the moment,” explained Townend. “We're currently working on a voluntary deposit base. The legislation will allow us to capture the UK web domain, in terms of every .co.uk website, which is a fairly frightening prospect.”
It also hopes to enter into a licensing agreement with copyright owners, so that more up-to-date newspaper content can be published on the site and accessed in digital format. However, this could be a long process, according to Townend. “This is a huge challenge in terms of copyright and content management,” he said.