Inventor(s)

Shashwat RazdanFollow

Abstract

The ability of intelligent agents to perform complex, cross-application tasks may be constrained by a dependency on application-specific programming interfaces (APIs), which can limit interaction with user interfaces and adaptation to dynamic visual changes. A system can employ a closed-loop perception-action cycle where a multimodal intelligent agent analyzes visual data from a screen of a host device, such as a smartphone or computer, to predict a next action. A control application can translate this action into a sequence of low-level human interface device events. A peripheral hardware actuator device may receive these events and inject them into the host device's operating system by emulating a standard input peripheral, for example, a keyboard or a pointing device. This approach can provide an application-agnostic method for control, enabling an agent to perform visually-grounded, multi-step tasks across a graphical user interface without a dependency on specific APIs or software integrations.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS