Defensive Publications Series

Sem-TF-IDF: A Simple Semantic Approach to Generalize TF-IDF by Employing Instruction Tuned Large Language Models

Abstract

TF-IDF (Term Frequency - Inverse Document Frequency) based information retrieval approaches compute the measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general. However, this measure is solely based on counts of individual words in the document or corpus without consideration of higher-level semantics. Attribute extraction is another approach for identifying salient information about a document; however, the extracted attributes are often multi-word phrases and do not include an indication of their salience for the document in question. This disclosure describes a simple unsupervised learning approach called Sem-TF-IDF that leverages instruction-tuned large language models (IT-LLMs) for identifying salient pieces of information related to a document or an entity in the semantic space. The approach involves modifying the frequency definitions in the classic TF-IDF to be based on topics instead of terms. The types of topics can include terms (used interchangeably with “words” in this document), phrases and broadly speaking, any piece of information. Topics within a document can be identified by inputting the document to an IT-LLM along with a suitable prompt and/or employing existing attribute extraction approaches. Thresholds and peer document or entity groups appropriate for the task can be used to filter the topics and optionally summarize the corresponding information as relevant for the application and user needs. The techniques can generalize the classic TF-IDF approach to the higher-level semantic space and are suitable for any information retrieval application in digital maps, search engines, etc.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Chen, Wei and Lin, Bo, "Sem-TF-IDF: A Simple Semantic Approach to Generalize TF-IDF by Employing Instruction Tuned Large Language Models", Technical Disclosure Commons, (February 09, 2024)
https://www.tdcommons.org/dpubs_series/6675

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Sem-TF-IDF: A Simple Semantic Approach to Generalize TF-IDF by Employing Instruction Tuned Large Language Models

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Sem-TF-IDF: A Simple Semantic Approach to Generalize TF-IDF by Employing Instruction Tuned Large Language Models

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information