This disclosure describes techniques for detection of text cutoff in captured images of documents that include text. Optical character recognition (OCR) is applied to an input image. A bounding box for each text character (OCR symbol) is determined, defined by x and y coordinates of its four corners. A feature vector is determined and utilized to represent the spatial location of OCR symbols extracted from the image. The feature vector is constructed based on OCR symbol coordinates and is provided to a trained classifier to determine a class label for the input document, indicating whether the document includes text cutoff. Optionally, the area of an image that includes text is automatically determined and utilized to limit the area of the image utilized for downstream document processing.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Lahiri, Avisek; Yao, Xinwei; and Yu, Tianli, "Text Cutoff Detection for Document Images", Technical Disclosure Commons, (May 01, 2022)