![]() Interestingly, the study found that participants based their decisions primarily on linguistic style and socio-emotional traits, rather than the perception of intelligence alone. The participants made their judgments based on the responses they received. More successful strategies involved speaking in a non-English language, inquiring about time or current events, and directly accusing the witness of being an AI model. Finally, some interrogators reported thinking that ELIZA was “too bad” to be a current AI model, and therefore was more likely to be a human intentionally being uncooperative."ĭuring the sessions, the most common strategies used by interrogators included small talk and questioning about knowledge and current events. Second, ELIZA does not exhibit the kind of cues that interrogators have come to associate with assistant LLMs, such as being helpful, friendly, and verbose. While this generally leads to the impression of an uncooperative interlocutor, it prevents the system from providing explicit cues such as incorrect information or obscure knowledge. "First, ELIZA’s responses tend to be conservative. The best way to pretend to be a human chatting is to fine-tune on human chat logs." Advertisementįurther, the authors speculate about the reasons for ELIZA's relative success in the study: The authors tried to change this with the prompt, but it has limits. As always, testing behavior doesn't tell us about capability." In a reply, he continued, "ChatGPT is fine-tuned to have a formal tone, not express opinions, etc, which makes it less humanlike. In a post on X, Princeton computer science professor Arvind Narayanan wrote, "Important context about the 'ChatGPT doesn't pass the Turing test' paper. GPT-3.5, the base model behind the free version of ChatGPT, has been conditioned by OpenAI specifically not to present itself as a human, which may partially account for its poor performance. GPT-4 achieved a success rate of 41 percent, second only to actual humans. GPT-3.5, depending on the prompt, scored a 14 percent success rate, below ELIZA. Surprisingly, ELIZA, developed in the mid-1960s by computer scientist Joseph Weizenbaum at MIT, scored relatively well during the study, achieving a success rate of 27 percent. The experiment involved 652 participants who completed a total of 1,810 sessions, of which 1,405 games were analyzed after excluding certain scenarios like repeated AI games (leading to the expectation of AI model interactions when other humans weren't online) or personal acquaintance between participants and witnesses, who were sometimes sitting in the same room. Players matched with AI models were always interrogators." "Witnesses were instructed to convince the interrogator that they were human. "The two participants in human matches were randomly assigned to the interrogator and witness roles," write the researchers. Through the site, human interrogators interacted with various "AI witnesses" representing either other humans or AI models that included the aforementioned GPT-4, GPT-3.5, and ELIZA, a rules-based conversational program from the 1960s. In the recent study, listed on arXiv at the end of October, UC San Diego researchers Cameron Jones (a PhD student in Cognitive Science) and Benjamin Bergen (a professor in the university's Department of Cognitive Science) set up a website called turingtest.live, where they hosted a two-player implementation of the Turing test over the Internet with the goal of seeing how well GPT-4, when prompted different ways, could convince people it was human. The threshold for passing the test is subjective, so there has never been a broad consensus on what would constitute a passing success rate. If the judge cannot reliably tell the chatbot from the human a certain percentage of the time, the chatbot is said to have passed the test. In modern versions of the test, a human judge typically talks to either another human or a chatbot without knowing which is which. Since then, it has become a famous but controversial benchmark for determining a machine's ability to imitate human conversation. ![]() Further Reading People think white AI-generated faces are more real than actual photos, study saysĮven with limitations and caveats, which we'll cover below, the paper presents a thought-provoking comparison between AI model approaches and raises further questions about using the Turing test to evaluate AI model performance.īritish mathematician and computer scientist Alan Turing first conceived the Turing test as "The Imitation Game" in 1950.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |