Abstract

Image tokenization is a technique that divides an image into multiple patches and embeds each patch into a vector space. Image tokenization is important for large language models (LLMs) to effectively answer queries relating to an image. A limitation of current image tokenization techniques for screenshots is that the patches are chosen in a manner that does not take into account user interface semantics, resulting in low information efficiency of tokens and user interface (UI) elements being split across tokens. This disclosure describes techniques that leverage UI element trees to guide screenshot tokenization, leading to higher quality screenshot tokens and superior screenshot-based LLM inference. Given a screenshot and a corresponding UI element tree, screenshot tokenization is performed by recursively traversing the UI element tree, finding children (subtrees) of the tree with size under a threshold, allocating an image token for each subtree under the threshold, generating a screenshot for each subtree with size under the threshold, and transforming the screenshot into an embedding. The tokenized output can be used by a computer control agent or a virtual assistant to perform a task with reference to the user interface that the screenshot captures.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS