The Language of Computer Control Agents

The current focus of frontier AI research is on Computer Control Agents (CCA).

The goal of this research is the automation of workflows in web browsers and operating systems on computers and mobile devices. In our previous post, we discussed that most CCAs operate by employing large language models (LLM).

Which language do CCAs speak?

Evolved Natural Language

The origins of natural language date back more than 100,000 years [Miyagawa et al., 2025]. Humans evolved language for coordination, planning and teaching. For conversation, storytelling and persuasion.

While natural language is descriptive of the real world, it is also verbose, fuzzy and imprecise.

Large language models are bootstrapped via natural language and benefit from the encoded representations of the real world. But LLMs also inherit all the flaws of human language.

When LLMs are not tasked to generate lengthy text but perform precise actions in real or virtual worlds, these flaws are particularly apparent. The most prominent limitations are the following:

Combinatorial explosion:

For planning a sequence of 5 actions with an LLM with a moderate vocabulary size of 100,000, there are 10^25 possible token trajectories. Learning a policy to precisely navigate this massive search space is challenging.

Form-meaning mismatch:

For an LLM-based agent, there are many ways to describe the same action, e.g. "open", "launch" or "start" to spin up a new application. On the other hand, "file" can both describe an action and an object. These synonyms and homonyms further complicate learning an accurate policy.

Order variability:

The sequence of actions and the order of actions in a sentence don't always align. The sentences "Save before closing the window" and "Close the window after saving" describe the same action sequence while the order of "save" and "close" is reversed.

Which language should CCAs speak instead?

Learned Agent Language

The shortcomings of natural language motivate the development of a concise, computer use-specific language for agents.

The goal is identifying the smallest possible action space that allows the agent to complete all its tasks.

Therefore, we turned to representation learning [Bengio et al., 2014] and developed the following recipe:

Collect action trajectories from both expert users and LLM-based agents
Learn action vocabulary
Train bespoke agent to replicate training trajectories
Perform reinforcement learning and search on both training and novel tasks to refine agent policy

The small learned action space O(10^2) does not only result in more accurate policies, but translates into smaller models that can run fast and efficient on edge devices and keep your personal data secure.

While natural language and LLMs still matter at the user-agent interface, for planning and execution, agents at Maincode speak their own language.

Stay tuned for exciting benchmark results coming soon!

[Miyagawa et al., 2025]: Linguistic capacity was present in the Homo sapiens population 135 thousand years ago. Frontiers in Psychology, 16, 1503900.

[Bengio et al., 2014]: Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8), 1798-1828.