The latest update of Anthropic's AI enables its autonomous operation on computers.
Wasn't that what was expected from the Rabbit R1?
Anthropic has launched a new public beta feature for its artificial intelligence model, Claude 3.5 Sonnet, which allows it to control a computer by observing its screen, moving the cursor, clicking buttons, and typing text. This functionality, called "computer usage," is available in the API, enabling developers to direct Claude to perform tasks on a computer as a human would, as demonstrated in a video on a Mac.
Additionally, other artificial intelligence tools such as Microsoft's Copilot Vision feature and OpenAI's ChatGPT desktop application have shown similar capabilities based on visualizing the computer screen. Google also has similar features in its Gemini app for Android devices. However, these have not advanced to the point of launching tools that perform tasks automatically, as is now offered with Claude. Rabbit had promised similar functionality with its R1, which has yet to materialize.
Anthropic warns that this computer usage functionality is still experimental, stating that it may be "awkward and prone to errors." The company notes that this function is being launched early to gather feedback from developers and hopes that the capability will improve rapidly.
Developers have noted that there are many routine actions that people perform on computers, such as dragging or zooming, which Claude still cannot execute. Furthermore, the way Claude visualizes the screen, taking screenshots and assembling them rather than observing a more detailed video stream, may lead to overlooking fleeting actions or notifications.
On the other hand, this version of Claude is instructed to avoid interacting with social media, implementing measures to monitor and correct behaviors related to electoral activities, as well as systems to divert the model from generating and publishing content on social platforms, registering web domains, or interacting with government sites.
Regarding its performance, the new Claude 3.5 Sonnet model shows improvements in many evaluation criteria and is offered to customers at the same price and speed as its predecessor. This updated model presents significant progress in programming tasks and tool usage, improving its score in the SWE-bench Verified evaluation from 33.4% to 49.0%, surpassing all publicly available models, including reasoning models like OpenAI's o1-preview. It also improves in the TAU-bench evaluation, reaching 69.2% in the retail domain, up from 62.6%, and increasing from 36.0% to 46.0% in the more complex aviation domain.