Abstract
Automated text extraction from paginated documents, such as portable document format (PDF) files, can be problematic when layout-agnostic parsers capture non-essential content like headers and footers, which can introduce noise and degrade data quality for downstream applications. A described technique may address this through a multi-stage workflow that combines heuristic analysis with large language models (LLMs). For example, a process can begin with a heuristic segmentation stage to identify potential header, footer, and main content regions based on layout cues, such as text position and font size. This initial segmentation may then be provided as a structured hint to a multimodal LLM, which can analyze both the visual page and the text to refine content boundaries. The workflow can also include an iterative refinement loop with a critic LLM and a consolidation stage to merge text across page breaks. This approach can produce a more continuous stream of core content useful for applications like knowledge management and retrieval-augmented generation systems.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Kuligin, Leonid, "Layout-Aware Text Extraction Using Heuristic Segmentation and LLM-Based Refinement", Technical Disclosure Commons, (January 28, 2026)
https://www.tdcommons.org/dpubs_series/9234