Abstract

This disclosure describes techniques that leverage the analysis, reasoning, and classification abilities of large language models (LLMs) to cluster text. An initial cluster is formed, and it is represented by a text description (rather than a conventional vector in a high-dimensional space). Guided by controlled generation, the LLM operates in a loop to classify successive input text into new or existing clusters. Each text from the input dataset is assigned to an existing cluster if its description is similar to that of one of the existing clusters. The description of an existing cluster can be merged with the new text processed using LLM generation. A text is assigned to a new cluster if its description is sufficiently dissimilar to the descriptions of existing clusters. Clusters of relatively small size that remain in the long tail can be re-classified into existing clusters. The techniques result in more accurate clustering and an improved clustering stability, it eliminates a challenge of determining an optimal amount of clusters for unsupervised text clustering and provides interpretability of generated clusters.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Kuligin, Leonid and Tschochohei, Max, "Unsupervised Clustering with a Large Language Model Using Controlled Generation", Technical Disclosure Commons, (September 09, 2025)
https://www.tdcommons.org/dpubs_series/8569

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Unsupervised Clustering with a Large Language Model Using Controlled Generation

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Unsupervised Clustering with a Large Language Model Using Controlled Generation

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information