
Two AI models pass the Turing tests; the public wants them to face the Voight-Kampff test from Blade Runner.
Two of the four candidates for LLM were from OpenAI and Meta.
Researchers conducted a study where participants conversed for five minutes with two subjects: one human and the other an artificial intelligence (AI). Four language models were tested, including ELIZA, GPT-4o, LLaMa-3, and GPT-4.5, in two controlled and randomized experiments. The findings, prepared by Cameron R. Jones and Benjamin K. Bergen from the University of California, San Diego, revealed that when asked to adopt a human personality, GPT-4.5 was recognized as human 73% of the time, significantly surpassing the identification of the actual human participant. In contrast, LLaMa-3.1 received the same instruction and was rated as human 56% of the time.
The researchers used two types of prompts: a base prompt without a specified personality and another where a personality was specified. In the first test, the AI models were urged to "convince the interrogator that they are human in a Turing test." In the second, they were instructed on what type of personality to assume during the test. A total of 254 participants took part in eight rounds of testing and completed a questionnaire at the end. During the evaluation, participants faced two "witnesses," one human and one AI, both trying to demonstrate that they were not AI entities.
The results are considered the first empirical evidence that an artificial system can surpass a standard three-part Turing test. In previous studies, GPT-4 was identified as human about 50% of the time in a two-part Turing test, which is considered easier for various reasons. In this new study, the original three-part Turing setup was employed.
Participants interacted simultaneously with both a human and an AI. The evaluation revealed that although GPT-4.5 was selected as human more frequently than by chance, LLaMa did not show significantly inferior performance compared to chance, indicating that participants could not distinguish it from humans. Older models, such as ELIZA and GPT-4o, performed worse than chance.
The study also examined the impact of different types of prompts on the models' performance. When a more basic prompt was provided, without detailed instructions about the personality to adopt, the models performed notably worse, underscoring the importance of prompts. Despite these results, the researchers maintain some reservations about the performance of chatbots in the Turing test. Jones reflects on whether these AI models can truly be considered to have passed the test without prior instructions.
Although the findings suggest that these AI models may exhibit human-like behavior, questions remain about their actual intelligence. Jones argues that we should evaluate this capability more broadly, in relation to other indicators of intelligence that these models may exhibit. The Turing test, it is argued, is not static and may depend on people's expectations about the nature of humanity and technology.
Reactions to the study were varied. Some criticisms, such as those from Gary Marcus, warn that the tests set a low standard, suggesting that claims of success are premature. Reddit users also expressed skepticism, arguing that the Turing test should not allow an AI to excel beyond humans. Other humorous opinions suggested that current models should undergo fictional tests like the Voight-Kampff test from the movie Blade Runner, which measures emotional responses to provocative situations.