Inventor(s)

Abstract

Techniques are described for low-latency pointwise content ranking using a fine-tuned student language model. A ranking service constructs per-candidate language-model inputs that include user-context signals and candidate item context, and sends the inputs to an inference server in coordinated batches. The inference server returns output values for designated positive and negative label tokens, reducing accelerator-to-host transfer. A continuous relevance score is computed using token-probability normalization, score = P(pos)/(P(pos)+P(neg)), yielding a calibrated value in [0,1] for thresholding and ranking. Batch processing may include reuse of cached key/value states for shared user-context prefixes. The techniques enable scoring hundreds of candidates within tight latency and serving-cost constraints for online ranking surfaces such as feeds and search results.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS