07 July 2025

Reliability

'Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools' by Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher D. Manning and Daniel E. Ho in (2025) Journal of Empirical Legal Studies comments 

 In the legal profession, the recent integration of large language models (LLMs) into research and writing tools presents both unprecedented opportunities and significant challenges (Kite-Jackson 2023). These systems promise to perform complex legal tasks, but their adoption remains hindered by a critical flaw: their tendency to generate incorrect or misleading information, a phenomenon generally known as “hallucination” (Dahl et al. 2024). 

As some lawyers have learned the hard way, hallucinations are not merely a theoretical concern (Weiser and Bromwich 2023). In one highly publicized case, a New York lawyer faced sanctions for citing ChatGPT-invented fictional cases in a legal brief (Weiser 2023); many similar incidents have since been documented (Weiser and Bromwich 2023). In his 2023 annual report on the judiciary, Chief Justice John Roberts specifically noted the risk of “hallucinations” as a barrier to the use of AI in legal practice (Roberts 2023). 

Recently, however, legal technology providers such as LexisNexis and Thomson Reuters (parent company of Westlaw) have claimed to mitigate, if not entirely solve, hallucination risk (Casetext 2023; LexisNexis 2023b; Thomson Reuters 2023, inter alia). They say their use of sophisticated techniques such as retrieval-augmented generation (RAG) largely prevents hallucination in legal research tasks. ... 

However, none of these bold proclamations have been accompanied by empirical evidence. Moreover, the term “hallucination” itself is often left undefined in marketing materials, leading to confusion about which risks these tools genuinely mitigate. This study seeks to address these gaps by evaluating the performance of AI-driven legal research tools offered by LexisNexis (Lexis+ AI) and Thomson Reuters (Westlaw AI-­Assisted Research and Ask Practical Law AI) and, for comparison, GPT-4. 

Our findings, summarized in Figure 1, reveal a more nuanced reality than the one presented by these providers: while RAG appears to improve the performance of language models in answering legal queries, the hallucination problem persists at significant levels. To offer one simple example, shown in the top left panel of Figure 2, the Westlaw system claims that a paragraph in the Federal Rules of Bankruptcy Procedure (FRBP) states that deadlines are jurisdictional. But no such paragraph exists, and the underlying claim is itself unlikely to be true in light of the Supreme Court's holding in Kontrick v. Ryan, 540 U.S. 443, 447– 48 & 448 n.3 (2004), which held that FRBP deadlines under a related provision were not jurisdictional. 

We also document substantial variation in system performance. LexisNexis's Lexis+ AI is the highest-performing system we test, answering 65% of our queries accurately. Westlaw's AI-Assisted Research is accurate 42% of the time, but hallucinates nearly twice as often as the other legal tools we test. And Thomson Reuters's Ask Practical Law AI provides incomplete answers (refusals or ungrounded responses; see Section 4.3) on more than 60% of our queries, the highest rate among the systems we tested. ... 

Our article makes four key contributions. First, we conduct the first systematic assessment of leading AI tools for real-world legal research tasks. Second, we manually construct a preregistered dataset of over 200 legal queries for identifying and understanding vulnerabilities in legal AI tools. We run these queries on LexisNexis (Lexis+ AI), Thomson Reuters (Ask Practical Law AI), Westlaw (AI-Assisted Research), and GPT-4 and manually review their outputs for accuracy and fidelity to authority. Third, we offer a detailed typology to refine the understanding of “hallucinations,” which enables us to rigorously assess the claims made by AI service providers. Last, we not only uncover limitations of current technologies, but also characterize the reasons that they fail. These results inform the responsibilities of legal professionals in supervising and verifying AI outputs, which remains an important open question for the responsible integration of AI into law.  

The rest of this work is organized as follows. Section 2 provides an overview of the rise of AI in law and discusses the central challenge of hallucinations. Section 3 describes the potential and limitations of RAG systems to reduce hallucinations. Section 4 proposes a framework for evaluating hallucinations in a legal RAG system. Because legal research commonly requires the inclusion of citations, we define a hallucination as a response that contains either incorrect information or a false assertion that a source supports a proposition. Section 5 details our methodology to evaluate the performance of AI-based legal research tools (legal AI tools). Section 6 presents our results. We find that legal RAG can reduce hallucinations compared to general-­purpose AI systems (here, GPT-4), but hallucinations remain substantial, wide-ranging, and potentially insidious. Section 7 discusses the limitations of our study and the challenges of evaluating proprietary legal AI systems, which have far more restrictive conditions of use than AI systems available in other domains. Section 8 discusses the implications for legal practice and legal AI companies. Section 9 concludes with implications of our findings for legal practice.