AI agents are terrible freelancers
Even the best According to an experiment that challenges the idea of mass replacement of office workers by artificial intelligence, artificial intelligence agents are relatively disappointed in online freelance work.
The Remote Work Index, a new benchmark developed by researchers at data annotation company Scale AI and the nonprofit Center for Artificial Intelligence Safety (CAIS), measures the ability of frontier AI models to automate tasks of economic value.
The researchers gave several top AI agents a range of simulated freelance tasks and found that even the best could do less than 3 percent of the work and earn $1,810 out of a possible $143,991. The researchers examined several tools and found that the most capable was Manus from a Chinese startup of the same name, followed by xAI’s Grok, Anthropic’s Claude, OpenAI’s ChatGPT, and Google’s Gemini.
“I have to hope that this will give more accurate impressions of what’s going on with AI capabilities,” said Dan Hendricks, director of CAIS. He adds that while some agents have made significant improvements over last year, that doesn’t mean the trend will continue at the same pace.
Dramatic advances in artificial intelligence have led to speculation that artificial intelligence will soon surpass human intelligence and replace large numbers of workers. In March, Entropic CEO Dario Amudi suggested that 90 percent of coding work would be automated within months.
Previous waves of artificial intelligence inspired wild predictions about job displacement, for example about the imminent replacement of radiologists by artificial intelligence algorithms.
Researchers created a range of freelance tasks through Upwork’s verified workers. These tasks cover a wide range of tasks including graphic design, video editing, game development and administrative tasks such as data scraping. They combined a description of each task with a list of files needed to perform the task and an example of a finished project produced by a human.
While AI models have gotten better at coding, math, and logical reasoning in recent years, Hendricks says, they still struggle to use different tools and perform complex tasks that involve multiple steps. “They don’t have long-term memory, and they can’t learn from ongoing experiences. They can’t acquire job skills like humans can,” he says.
This counterpoint analysis provides a measure of economic work released in September by OpenAI called GDPval, which claims to measure work of economic value. According to GDPval, frontier AI models such as GPT-5 approach human capabilities in 220 tasks across a wide range of office jobs. OpenAI did not comment.
