Abstract

On-server vision transformer (ViT) models can suffer from substantial send-to-inference latency because the entire image is encoded at the client at once before transmission. Server-side ViT inference does not begin until the entire image is uploaded. This disclosure describes techniques that leverage the patch-based token structure of transformer decoders to stream encoded patches of an image (rather than the fully encoded image) from a client to a server. Effectively, patching and encoding are pipelined such that the delay at the client for compressing full images for a single send to the server is saved. Early stopping can be used to accelerate classification. At the server, updated patches are received on the fly, and are tokenized and queued at the input layer of the ViT in a just-in-time (JIT) manner. The JIT tokenization enables attention computation to be updated for the latest patch as new patches arrive, amounting to a substantial reduction in wait time for the inference.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Shin, D, "Patch-Based Encoding and Streaming for Real-Time Visual Transformer Inference", Technical Disclosure Commons, (May 08, 2024)
https://www.tdcommons.org/dpubs_series/6982

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Patch-Based Encoding and Streaming for Real-Time Visual Transformer Inference

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Patch-Based Encoding and Streaming for Real-Time Visual Transformer Inference

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information