Abstract
Techniques are described for low-latency pointwise content ranking using a fine-tuned student language model. A ranking service constructs per-candidate language-model inputs that include user-context signals and candidate item context, and sends the inputs to an inference server in coordinated batches. The inference server returns output values for designated positive and negative label tokens, reducing accelerator-to-host transfer. A continuous relevance score is computed using token-probability normalization, score = P(pos)/(P(pos)+P(neg)), yielding a calibrated value in [0,1] for thresholding and ranking. Batch processing may include reuse of cached key/value states for shared user-context prefixes. The techniques enable scoring hundreds of candidates within tight latency and serving-cost constraints for online ranking surfaces such as feeds and search results.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Anonymous, "Low-Latency Pointwise Language Model Ranker with Token-Probability Normalization and Coordinated Batch Inference for Online Content Ranking", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10700