HP INCFollow


This disclosure relates to the field of synthetic document generation for training Deep Learning algorithms in order to understand document contents.

Document understanding is important for applications such as document quality enhancement and information extraction pipelines. In document quality enhancement, different computer vision techniques are applied to specific regions of the document depending on the element type (text, image), for tasks like printing and/or scanning. Information extraction pipelines aim to retrieve valuable knowledge from documents in an automated fashion. Again, depending on the element type, different extractors are used. Machine Learning techniques may be applied to decompose a document into element types: text, images, equations, charts, and diagrams. Regardless of the training regime (supervised or unsupervised), data is necessary. An option could be to obtain documents from the Internet. However, there are some problems: No permissive license, unbalanced data (i.e. slides with only text elements), and difficulty to extract precise annotations for training ML models from raw documents. This disclosure presents a synthetic data generator able to create a diverse set of documents based on randomized template formats, here we focus on slide presentations.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 4.0 License.