D ShinFollow


On-server vision transformer (ViT) models can suffer from substantial send-to-inference latency because the entire image is encoded at the client at once before transmission. Server-side ViT inference does not begin until the entire image is uploaded. This disclosure describes techniques that leverage the patch-based token structure of transformer decoders to stream encoded patches of an image (rather than the fully encoded image) from a client to a server. Effectively, patching and encoding are pipelined such that the delay at the client for compressing full images for a single send to the server is saved. Early stopping can be used to accelerate classification. At the server, updated patches are received on the fly, and are tokenized and queued at the input layer of the ViT in a just-in-time (JIT) manner. The JIT tokenization enables attention computation to be updated for the latest patch as new patches arrive, amounting to a substantial reduction in wait time for the inference.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.