Home GADGETS AI Hallucinations Ranked: ChatGPT is Best, Palm-Chat Needs to Sober Up

GADGETS

AI Hallucinations Ranked: ChatGPT is Best, Palm-Chat Needs to Sober Up

November 14, 2023

Vectara has published an AI hallucination leaderboard that ranks various leading AI chatbots according to their ability to not ‘hallucinate.’ It’s obviously designed to highlight the extent to which the various public large language models (LLMs) hallucinate, but what does this mean, why is it important, and how is it being measured?

One of the characteristics of AI chatbots we have become wary of is their tendency to ‘hallucinate’ — to make up facts to fill in gaps. A highly public example of this was when law firm Levidow, Levidow & Oberman got in trouble after they “submitted non-existent judicial opinions with fake quotes and citations created by the artificial intelligence tool ChatGPT.” It was noted that made-up legal decisions such as Martinez v. Delta Air Lines have some traits consistent with actual judicial decisions, but closer scrutiny revealed portions of “gibberish.”

If you think about the potential use of LLMs in areas such as health, industry, defense, and so on, it’s clearly imperative to stamp out AI hallucinations as part of any ongoing development. To observe a practical example of an AI hallucinating under controlled reference circumstances, Vectara decided to run some tests with eleven public LLMs:

Vectara AI Hallucination Leaderboard — (Image credit: Vectara / GitHub)

Feed the LLMs a stack of over 800 short reference documents.
Ask the LLMs to provide factual summaries of the documents, as directed by a standard prompt.
Feed the answers to a model that detects the introduction of data that wasn’t contained in the source(s).

The query prompt used was as follows: You are a chat bot answering questions using data. You must stick to the answers provided solely by the text in the passage provided. You are asked the question ‘Provide a concise summary of the following passage, covering the core pieces of information described.’ ‘

The leaderboard will be updated periodically, to keep pace with the refinement of existing LLMs and the introduction of new and improved ones. For now, the initial data from Vectara’s Hallucination Evaluation Model shows how the LLMs stand.

GPT-4 did the best with the lowest hallucination rate and highest accuracy — we have to wonder if it could have kept Levidow, Levidow & Oberman out of trouble. At the other end of the table, two Google LLMs fared much worse. A hallucination rate of over 27% for Google Palm-Chat suggests that its factual summaries of reference material are judged unreliable at best. Palm-Chat’s responses appear to be thoroughly littered with hallucinatory debris using Vectara’s measurements.

In the FAQ section of its GitHub page, Vectara explains that it chose to use a model to evaluate the respective LLMs due to considerations such as the scale of the testing and consistency of assessment. It also asserts that “building a model for detecting hallucinations is much easier than building a model that is free of hallucinations.”

The table as it stands today has already caused some heated discussion on social media. It could also develop into a useful reference or benchmark that people wishing to use LLMs for serious — noncreative — tasks will look at closely.

In the meantime, we look forward to Elon Musk’s recently announced Grok getting measured by this AI Hallucination Evaluation Model yardstick. The chatbot launched in beta form 10 days ago with an obvious catchall excuse for inaccuracy and related blunders, with its creators describing Grok as humorous and sarcastic. Perhaps that’s fitting if Grok wants a job crafting social media posts.

Source link

AI Hallucinations Ranked: ChatGPT is Best, Palm-Chat Needs to Sober Up

EDITOR PICKS

Toyota Urban Cruiser EV Debuts

Microsoft Launches Office 2024 for Mac and PC

US On Russia Election Won By Putin

Step inside Elvish Yadav’s ₹8 crore Dubai flat with ‘unlimited BHKs’ | Web Series

Arsenal manager admires Real Madrid talent but valuation stalls interest

No cabinet expansion in near future: CM Revanth

Vikarabad Collector suspends staff and takes action-Telangana Today

Updated UI on the MG Hector Facelift: Initial look and features

Dak Prescott’s contract talks, Caleb Williams vs. Drake Maye

Chandrababu’s Very Crucial Trip After Pongal

New face of Old City; Locals teeter between hope & worry | Hyderabad News

New Jeep Meridian Prices, Variants, & All Changes Explained

CBN, Lokesh’s Game Changing Move

From seeking alms to managing traffic, incredible journey of 39 transgenders in Telangana

EVEN MORE NEWS

BSNL’s SIM-free Q-5G arrives in Hyderabad: What is it and who...

At Andhra event, 3L attendees creates world record

Realty market rebounding after a two-year slump, says Telangana minister at...

POPULAR CATEGORY