Abstract

This disclosure relates to the field of synthetic document generation for training Deep Learning algorithms in order to understand document contents.

Document understanding is important for applications such as document quality enhancement and information extraction pipelines. In document quality enhancement, different computer vision techniques are applied to specific regions of the document depending on the element type (text, image), for tasks like printing and/or scanning. Information extraction pipelines aim to retrieve valuable knowledge from documents in an automated fashion. Again, depending on the element type, different extractors are used. Machine Learning techniques may be applied to decompose a document into element types: text, images, equations, charts, and diagrams. Regardless of the training regime (supervised or unsupervised), data is necessary. An option could be to obtain documents from the Internet. However, there are some problems: No permissive license, unbalanced data (i.e. slides with only text elements), and difficulty to extract precise annotations for training ML models from raw documents. This disclosure presents a synthetic data generator able to create a diverse set of documents based on randomized template formats, here we focus on slide presentations.

Creative Commons License

This work is licensed under a Creative Commons Attribution-Share Alike 4.0 License.

Recommended Citation

INC, HP, "DOCUMENT GENERATOR BASED ON RANDOMIZED TEMPLATES", Technical Disclosure Commons, (January 24, 2021)
https://www.tdcommons.org/dpubs_series/3991

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

DOCUMENT GENERATOR BASED ON RANDOMIZED TEMPLATES

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

DOCUMENT GENERATOR BASED ON RANDOMIZED TEMPLATES

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information