Abstract
N-grams are a technique used in document processing to summarize the content of a document as a set of text fragments that it contains. N-grams are used for document processing across a wide range of applications such as indexing, clustering, and machine learning. This disclosure describes techniques to efficiently extract n-grams of a given length from a grammar, specified as a nondeterministic finite automaton (NFA) with ε-moves. The algorithm described here uses O(N) graph traversals to compute n-grams of length N from a grammar.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Boulgakov, Alexandre, "Efficient Extraction of n-grams From a Grammar", Technical Disclosure Commons, (October 29, 2020)
https://www.tdcommons.org/dpubs_series/3721