Abstract

Name matching is a fundamental component of search and retrieval systems, where inputs often exhibit spelling variations due to pronunciation differences, transliteration across languages, typographical errors, and noise introduced by OCR or speech-based inputs. Existing approaches either rely on character-level similarity, which over-penalizes phonetic variations, or phonetic encodings, which collapse multiple variants into identical representations and fail to capture graded similarity.

The present invention introduces a phonetic-aware similarity learning framework in which similarity is modeled as a bounded distance constraint problem. Input names are segmented using vowel-anchored phonetic tokenization and mapped into a continuous embedding space. During training, pairs of names are associated with predefined similarity bands that represent different degrees of variation. A band-constrained contrastive loss function enforces that embedding distances fall within these bands for similar pairs while maintaining separation for dissimilar pairs.

Unlike conventional contrastive learning approaches that only enforce relative proximity, the proposed method encodes multiple levels of similarity severity through explicit distance constraints. This enables learning of a calibrated and interpretable similarity space in which distances directly reflect variation severity, thereby improving ranking, disambiguation, and retrieval of names.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS