Abstract

N-grams are a technique used in document processing to summarize the content of a document as a set of text fragments that it contains. N-grams are used for document processing across a wide range of applications such as indexing, clustering, and machine learning. This disclosure describes techniques to efficiently extract n-grams of a given length from a grammar, specified as a nondeterministic finite automaton (NFA) with ε-moves. The algorithm described here uses O(N) graph traversals to compute n-grams of length N from a grammar.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS