Microsoft just introduced Magma, a new artificial intelligence model designed to help robots see, understand and act more intelligently. Unlike traditional AI models, Magma processes different types of data all at once — an effort Microsoft is calling a big leap toward “agentic AI,” or systems that can plan and execute tasks on a user’s behalf.
The model, which uses a combination of vision and language processing, is trained on videos, images, robotics data and interface interactions so as to make it more versatile than previous models.
On its Github page, the Microsoft Research team outlined how Magma can perform tasks, such as how it can manipulate robots and navigate user interfaces like clicking buttons.
To develop the technology, the company partnered with researchers from the University of Maryland, the University of Wisconsin-Madison and the University of Washington.
The launch comes as tech giants race to develop AI agents that can automate more aspects of daily life. Google has been advancing robotics-focused language models, while OpenAI’s Operator tool is designed to handle mundane tasks like making reservations, ordering groceries and filling out forms via typing, clicking and scrolling within a specialized browser.
Jianwei Yang, Microsoft’s lead researcher on the project, told CNET the future of AI is more than just developing multimodal foundation models for chatbots.
“We believe that the next important step for AI hinges on developing agents that can seamlessly understand and interact with both digital and physical environments,” he said.
He said Magma’s significance lies in its ability to bridge the gap for multimodal AI agents, as traditional AI models excel in verbal intelligence but often struggle with planning and real-world action.
“Robots today often rely on task-specific training on domain specific data, resulting in their limited capability to handle simple daily tasks, let alone generalizing to new tasks and environments,” he explained. “Magma changes this by significantly enhancing their verbal and spatial intelligence, allowing robots to ground their actions on top of the environments, either digital or physical, and execute actions precisely and effectively.”
Meanwhile, Craig Le Clair — a principal analyst at Forrester and author of Random Acts of Automation — said the news aligns with the market research firm’s prediction that 25% of 2025 robotics projects will combine cognitive and physical automation. He said, however, the debate continues whether this announcement and others signify a true turning point or just more large-language entries.
“Microsoft has provided an important developer capability but now needs to demonstrate leadership in guiding productive and safe human-robot interaction,” Le Clair said.
Read the full article here