An algorithm identifying the class of a document using OCR
Computer Vision
About the company
Founded in January 2019 by four IIT-Bombay alumni, Khatabook has rapidly become the world's fastest-growing SaaS company and India's leading business management app for MSMEs, with 13 languages and over 50 million downloads. Led by Ravish Naresh, the app empowers merchants to track transactions securely, collect payments online, and send reminders, while the innovative Interactive Voice Response feature simplifies credit collection. With $200 million worth of transactions added daily, Khatabook's success is driven by high engagement and word-of-mouth referrals, significantly impacting India's MSME sector. Having recently secured a $60 million Series C funding round, Khatabook is an employee-first, innovation-centered organization seeking dedicated professionals to join its mission of catering to a digital Bharat.
Problem Statement
The customer had a database of 70000+ untagged documents that needed to be submitted to the Government for compliance. However, the format for submission included a Tag that mentioned the type of the document, which was currently missing. There were 15+ types of documents in the dataset. Manually adding tags to each document would have taken 250+ human hours. The deadline for compliance was 3 days from the start of the project, which would’ve required a team of 15+ humans working simultaneously, which would’ve been an operational nightmare.
Key Challenges
[1] A major chunk of the documents were not scanned but clicked through a mobile camera with poor background, incorrect alignment, illegible text, and blurriness. Due to the substandard quality of documents conventional text extraction methods are ineffective. Consequently, image processing becomes necessary.
[2] Time posed a significant constraint. With less than three days available to complete the project, any image processing technique requiring approximately one second per image would demand 20 hours to finish processing all images. We aimed to reduce processing time to less than 0.05 seconds per image, enabling us to explore various features and algorithms within the realm of image processing. This adjustment afforded us approximately one to two hours for each run across the entire pool of images.
Approach
[1] Optimized PDF Processing: Employing OpenCV or pdftoimage for PDF reading required a processing time of >1 second per document (~20 hrs). We circumvented this bottleneck by leveraging mutool, to convert PDFs to PNG format swiftly, achieving a remarkable processing speed of 0.01 seconds per PDF. Furthermore, for documents originally in image format, we optimized processing time by reducing image sizes to lower resolutions.
[2] Feature Extraction and Classification: Our methodology involved extracting key features from documents, including logos, whitespace, barcodes, QR codes, colors, headers, and footers, for each representative class of documents. Subsequently, we processed each image to identify the presence of these features, employing heuristics to classify documents effectively.
[3] Quality Control Tool Development: Recognizing the potential for ambiguity, particularly in cases where documents appeared similar but varied in format, we developed a manual annotation tool which facilitated rapid validation of documents, streamlining the identification of ambiguous instances.
[4] Achieving High Accuracy within Timeline: Despite the stringent timeframe of less than three days, our team successfully delivered the complete document classification solution with an accuracy exceeding 99%.
Transform your operations, insights, and customer experiences with AI.