The Business Need
A massive global information-content and technology company based out of Michigan, USA had its core mission to make content held in various forms digitally accessible for researchers and librarians, providing them the technology needed to organize and store the abundance of content too.
They had to deal with a myriad of diverse content including 450,000 ebooks, 20 million pages covering global, national, regional and specialty newspapers (across 3 centuries!), a rich collection of significant journals and periodicals and a huge repository of dissertations and theses.
The giant enablers of digital content, wanted Mobius to convert a part of their decades-old magazines to a suitable format that facilitates digitization. The basic rules for content tagging were given and the digitized output of about 100 magazine images was expected in XML format for ease of storage and indexing in their database.
Challenges we faced
Scanned images of magazines from the 19th-century had multiple elements within them such as headlines, images of different sizes, subheads, bylines, image captions, credits, quotes, and body copies that weren’t consistent across all pages and had to be tagged page-wise individually.
We realized that to achieve accurate tagging, human intelligence was required for comprehending the tagging intricacies and constantly evaluating the tagging routines. The magazines apart from being in English were also in Latin and coupled with the age of the magazines posed a challenge in terms of character recognition.
How we solved the problem
Our approach to content digitization was broken into 2 major parts - content conversion through OCR and content tagging. While applying OCR, the layout of the sections was also to be maintained in the digitized output, which meant that an overlap or joint of text between different sections were to be avoided.
Hence with our wide expertise in content digitization, our team at Mobius rapidly developed a tool to identify zone coordinates and word coordinates within each zone. The home-grown tool then extracted the text within the word coordinated thus automating the entire extraction process.
Marking the coordinates ensured that the various sections within the magazine images were digitized consistently and accurately. With the conversion automated, our specialists then tagged the different elements lying on an OCR-ed page. The intensive tagging and indexing of content could facilitate the retrieval of relevant sections from the database when a keyword search is performed.
The success of an OCR based digitization depends on how closely the digitized output conforms to the input scanned files and we were able to achieve high levels of integrity and accuracy nearing 100% in the digitized output text, making our approach a huge success in terms of technical competence and the time span within which digitization was attained. The highly customized single interface tool combined processes like OCR, XML conversion, and coordinates identification and was built in record time.