Abstract

Conventional compression techniques for text are based on typical frequencies of individual letters within the text, independent of higher-level semantics. This disclosure describes a compression scheme for text data in which a conventional coder/decoder (codec) is augmented with an additional semantic codec to achieve greater compression and throughput. The additional semantic codec can be implemented with a pre-trained large language model (LLM). Text data is first input to a semantic coder for semantic-based compression. Codes within the codebook are reranked based on selective erasure by the encoder LLM. Once the codebook is established, portions within the text that can be recovered by a decoder LLM are erased. Such semantically compressed data is encoded as usual via conventional techniques and can be first decoded via conventional techniques to recover the semantically coded text. The semantically coded text is further decoded using a semantic decoder that recovers the original text by inferring, based on semantics, the portions that were erased prior to transmission.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Shin, D, "Better Text Compression Using a Large Language Model", Technical Disclosure Commons, (August 21, 2023)
https://www.tdcommons.org/dpubs_series/6155

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Better Text Compression Using a Large Language Model

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Better Text Compression Using a Large Language Model

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information