D ShinFollow


Conventional compression techniques for text are based on typical frequencies of individual letters within the text, independent of higher-level semantics. This disclosure describes a compression scheme for text data in which a conventional coder/decoder (codec) is augmented with an additional semantic codec to achieve greater compression and throughput. The additional semantic codec can be implemented with a pre-trained large language model (LLM). Text data is first input to a semantic coder for semantic-based compression. Codes within the codebook are reranked based on selective erasure by the encoder LLM. Once the codebook is established, portions within the text that can be recovered by a decoder LLM are erased. Such semantically compressed data is encoded as usual via conventional techniques and can be first decoded via conventional techniques to recover the semantically coded text. The semantically coded text is further decoded using a semantic decoder that recovers the original text by inferring, based on semantics, the portions that were erased prior to transmission.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.