Digitization - Content Taggigng Digitization

The Business Need

A massive global information-content and technology company based out of Michigan, USA had its core mission to make content held in various forms digitally accessible for researchers and librarians, providing them the technology needed to organize and store the abundance of content too.

They had to deal with a myriad of diverse content including 450,000 ebooks, 20 million pages covering global, national, regional and specialty newspapers (across 3 centuries!), a rich collection of significant journals and periodicals and a huge repository of dissertations and theses.

The giant enablers of digital content, wanted Mobius to convert a part of their decades-old magazines to a suitable format that facilitates digitization. The basic rules for content tagging were given and the digitized output of about 100 magazine images was expected in XML format for ease of storage and indexing in their database.

Challenges we faced

Scanned images of magazines from the 19th-century had multiple elements within them such as headlines, images of different sizes, subheads, bylines, image captions, credits, quotes, and body copies that weren’t consistent across all pages and had to be tagged page-wise individually.

We realized that to achieve accurate tagging, human intelligence was required for comprehending the tagging intricacies and constantly evaluating the tagging routines. The magazines apart from being in English were also in Latin and coupled with the age of the magazines posed a challenge in terms of character recognition.

How we solved the problem

Our approach to content digitization was broken into 2 major parts - content conversion through OCR and content tagging. While applying OCR, the layout of the sections was also to be maintained in the digitized output, which meant that an overlap or joint of text between different sections were to be avoided.

Hence with our wide expertise in content digitization, our team at Mobius rapidly developed a tool to identify zone coordinates and word coordinates within each zone. The home-grown tool then extracted the text within the word coordinated thus automating the entire extraction process.

Marking the coordinates ensured that the various sections within the magazine images were digitized consistently and accurately. With the conversion automated, our specialists then tagged the different elements lying on an OCR-ed page. The intensive tagging and indexing of content could facilitate the retrieval of relevant sections from the database when a keyword search is performed.

Results

The success of an OCR based digitization depends on how closely the digitized output conforms to the input scanned files and we were able to achieve high levels of integrity and accuracy nearing 100% in the digitized output text, making our approach a huge success in terms of technical competence and the time span within which digitization was attained. The highly customized single interface tool combined processes like OCR, XML conversion, and coordinates identification and was built in record time.

19th
century magazines of varying clarity

100%
accuracy in digitized text

Images to be tagged and digitized too

Mobius developed a custom tool that powered OCR conversion and intelligent content tagging

A giant content technology enabling company digitizes 19th century magazines with Mobius

The Business Need

Challenges we faced

How we solved the problem

Results

Looking out for a similar solution?

Services

Data services & Analytics

Industry Services

Bespoke Services

Brands

Products

Company

Resources

Follow us:

Subscribe to our business insights

Data Services & Analytics

Data Management

Data Analytics

Digitization and Conversion

Industry Services

Ecommerce and Retail

Real Estate and Lease Administration

Location Data and Intelligence

Customer Experience

Business Process Outsourcing

Bespoke Back Office

Data Services & Analytics

Data Management

Data Analytics

Digitization and Conversion

Industry Services

Ecommerce and Retail

Real Estate and Lease Administration

Location Data and Intelligence

Bespoke services

Customer experience

Business Process Outsourcing

Technology Services

Xtract

Techmobius

Promobius

Xtract Data Automation Suite (XDAS)

DigiSense 360

Freda

PIMworks

Uptime

BuzzSense

Blog

Case Studies

Infographics

Blog

Case Studies

Infographics

About

News

Culture

Careers

Contact

About

News

Culture

Careers

Contact

The Business Need

Challenges we faced

How we solved the problem

Results

Looking out for a similar solution?

Services

Data services & Analytics

Industry Services

Bespoke Services

Brands

Products

Company

Resources

Follow us:

Subscribe to our business insights