Inventor(s)

NAFollow

Abstract

This disclosure describes techniques that leverage the analysis, reasoning, and classification abilities of large language models (LLMs) to cluster text. An initial cluster is formed, and it is represented by a text description (rather than a conventional vector in a high-dimensional space). Guided by controlled generation, the LLM operates in a loop to classify successive input text into new or existing clusters. Each text from the input dataset is assigned to an existing cluster if its description is similar to that of one of the existing clusters. The description of an existing cluster can be merged with the new text processed using LLM generation. A text is assigned to a new cluster if its description is sufficiently dissimilar to the descriptions of existing clusters. Clusters of relatively small size that remain in the long tail can be re-classified into existing clusters. The techniques result in more accurate clustering and an improved clustering stability, it eliminates a challenge of determining an optimal amount of clusters for unsupervised text clustering and provides interpretability of generated clusters.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS