LLM Hallucination Detection: Background with Latest Techniques

12 min readJun 11, 2024

One major hurdle in adopting AI for real life scenario is its inability to be accurate every time. There was a sudden “hope” with the recent developments in GenAI, more specifically the openAI’s chatGPT. However, there is only a single letter difference between the hope and the hype!

The accuracy problem (from traditional ML to complex GenAI) has taken another term called hallucination, when the LLMs (i.e. GenAI use-case for text generation and even other tasks) generate responses that are either factually incorrect or factually inconsistent. The hallucination problem is so much prevalent that there are at least 5 survey papers on hallucination detection that I came across in a span of 12 months, cited below.

Recent studies show that up to 30% of summaries generated by abstractive models contain factual inconsistencies

In this article, we will describe what hallucination is, types of hallucination, the stages or levels where hallucination gets introduced and then finally some hallucination detection techniques described in various research papers from teams at Salesforce, Google, Microsoft etc including some detailed analysis of the selective techniques that I worked on. In case you are aware of the definitions and terminologies related to LLM hallucination, please feel free to skip directly to the detection techniques.

What is hallucination?

Among all the survey papers, one that is worth spending time over is the one that talks about in detail about the principles and taxonomy of LLM hallucinations. Hallucination( or LLM hallucination) is a situation when the LLM response (or any GenAI use-case) is either inconsistent with the real world facts or user inputs. More formally it can be categorised into two groups: factuality hallucination related to fact checking and faithfulness hallucination related to factual consistency.

Factuality hallucination emphasises the discrepancy between generated content and verifiable real-world facts, typically manifesting as factual inconsistency or fabrication. This is an important problem in case of generic GenAI solutions where people can ask questions related to real world topics and facts.

Faithfulness hallucination indicates the divergence of LLM generated content from the input provided by the user in the form of either instructions or the content itself. User input could be the question or the context from which the LLM has to generate an answer for QnA tasks, or it can be the original article which needs to be summarised by the LLM, as shown in the above figure on the right hand side.

What is not hallucination: Sometimes people get confused with hallucination and completeness. This is the similar confusion that occurs with precision and recall. If the LLM generated content (eg. summary) is faithful to the original reference or article, but missing some important details from the source text, it is not hallucination but incompleteness. To some extent hallucination could be related to precision while completeness to the recall.

Types of hallucination: Hallucination can further be classified into two main types:

Intrinsic hallucination pertaining to the outputs of LLMs that conflict with the source input.
Extrinsic hallucination refers to the LLM generations that cannot be verified from the source content.

After working on the problem of hallucination, I can say the intrinsic hallucination is hard to detect while it might look from the definitions that extrinsic is difficult. Because the dot product between two similar yet conflicting sentences (their embeddings) will yield higher numbers indicating similar content in both the sentences. But the same dot product will be quite low if either of the content’s context is entirely different.

Sources of hallucination

Hallucination gets introduced in either of these three stages of data, trainnig and inference:

Data can have incorrect and outdated facts, conflicting information, biases due to duplication and social trends like certain names belonging to specific nations or even gender! Inferior data utilisation is also an important source of hallucination, as they rightly say, half knowledge is dangerous. However, major source in this regard is due to the problem of recalling long tail information. Hence data quality is of utmost importance to reduce or prevent LLM hallucinations.
Training can introduce hallucination at either of two stages of pre-training or fine-tuning. In pre-training stages, there are risks of flawed architecture, attention glitches, and exposure bias resulting from the disparity between the training and inference in auto-regressive generative models. Meanwhile, in the fine-tuning( or alignment to the final task), issues of capability misalignment and belief misalignment arise. The former risks pushing LLMs beyond their knowledge boundaries, while the latter reveals a disparity between the LLM’s beliefs and its outputs. Example for capability misalignment is when an LLM was trained on the data before the year 2021 but now is being used for tasks that requires the LLM to answer queries related to the year 2024. Example for the belief misalignment would be when the LLM has been trained so much with a particular domain or task that it has formed a pre conceived notion or belief and always responds based on that belief instead of understanding the prompt and respond accordingly.
Inference could be a source of hallucination depending on how the inference is done. Note that inference is nothing but the output of the decoding process. So inference related hallucination is all rooted in the decoding stage, emphasising two factors: inherent randomness of decoding strategies and the imperfect decoding representation. First one should be quite intuitive due to the randomness involved in the output generation. The second one can either have issue of over reliance on the nearby content, and the soft-max bottleneck limiting the model’s capability to express the diverse output probabilities.

Hallucination Detection Techniques

Here I will focus more on the faithfulness hallucination detection.

One can refer to the section 4.1.1. of this survey paper for factuality hallucination detection. One method worth looking at is called FacTool.

For exhaustive list of techniques, please refer to the survey papers cited below. Here I will list down some selected techniques only and at the end I will talk in detail about one of the best performing technique as of mid 2024.

Faithfulness hallucination detection primarily focuses on ensuring the alignment of the generated content with the given context.

Fact-based metrics measures the overlap of important facts between the generated content and the source content. Essentially, these techniques focus on finding the intersection of noun entities and relationships (verb) between the source content and the LLM generated content. These can be categorised in n-gram based , entity-based, and relationship-based. I have covered in detail about the relation based hallucination detection here where I tried to show why a knowledge graph for hallucination detection may not be efficient. A knowledge graph is nothing but a collection of multiple entities and relationships (in the form of triplets).

Classifier based metrics involves utilising classifiers trained on comprising both task specific hallucinated and faithful content, as well as data from related tasks or synthetically generated data. Again, this can be further classified into two types —

Entailment based techniques are a class of Natural Language Inference (NLI) techniques that evaluates if the faithful content is entailed by the corresponding source content. Some of the techniques under the umbrella of entailment based or NLI are:

This paper by Google on faithful and factuality in abstractive summarisation focusses on extreme summarisation and claims that the facts based measures (eg. triplets or relations) are not suited for extreme summarisation that tries to summarise a document with a single sentence. It also makes sense as a single sentence cannot capture all the relationships (and entities) present in the original source content. So it would be unfair to use triplets or facts based methods for extreme summarisation. Note that google news kind of applications end up using extreme summarisation, summarising entire news article with a single sentence.
SummaC has been a promising technique for entailment detection because of its sentence level approach influencing better techniques like Align score. In SummaC, one breaks the summary into sentences and then evaluate each sentence against every other sentence of the source content. One can use any entailment based model for each sentence to sentence entailment. Main problem in this approach is its complexity arising due to cross product of m x n sentence comparisons.

Weekly supervised classifiers addresses the problem of data scarcity. Entailment based classifiers needs to be trained on task and problem specific data, i.e. the data should be from the similar domain and should ideally be useful for entailment detection. Kryscinski et. al, from Salesforce, in their paper titled “Evaluating the Factual Consistency of Abstractive Text Summarisation” talks about generating synthetic data to train a weakly-supervised model for verifying factual consistency. The training data is not generated blindly but after doing a thorough error analysis on the tasks like text-summarisation. Some of the techniques used to generate the synthetic data included

Paraphrasing by back-translation of text from english to french (and other languages) to english, sentence negation, entity and pronoun swapping and noise injection.

This strategy is quite useful in increasing the amount of contextual data in order to further train a entailment detection classifier. This goes without saying that in machine learning , a simple model with a good quality and quantity of data can be more useful than a complex model with the compromise in quality and quantity.

Some other methods for faithfulness hallucination detection are

Question Answer based metrics that checks for faithfulness by generating answers from both the the source and the LLM response for the same question. One can then compare if both the answers are same or not.
Uncertainty estimation relates to the fact that the model’s most uncertain response is highly correlated to hallucination. The method in this bucket include model’s log probability (of the LLM response text), entropy based, model based and prompting based metrics. One can generate the same response multiple times to find if the model responds with the same response every time, thereby, estimating the model uncertainty. Prompting techniques require to prompt the model to evaluate the response by giving more reasoning to the response, using approaches like chain-of-thought prompting.

Both the above techniques i.e. question answer based metric and uncertainty estimation are quite complex, time consuming, and may not make sense to be applied in production during run time. In my opinion, it is better to avoid calling the LLM (whether same or different) to evaluate another LLM’s response as the LLM could itself hallucinate in validating because it is still another GenAI solution that can suffer from the same problem of hallucination.

Until this point, few points I want to highlight before spilling the beans for the best model for faithfulness hallucination detection.

Sentence level detection: It makes sense to check for faithfulness consistency for each sentence separately like in the SummaC paper.
LLM based evaluation techniques like prompting another LLM for hallucination detection may not be good in production.
For classifier based metrics, it makes sense to spend time and resources to generate more domain specific data. In fact, the Salesforce team that talks about data generation for a weak classifier have submitted a patent for the same.

— — — — — — — — — — — — Promising Models — — — — — — — — — — —

ALIGN SCORE model is one of the best model based metric that is based on the Roberta base model. This model measures the correspondence of information between two pieces of text.

Here are some of the highlights of this model:

Align score model is trained on datasets from diverse set of tasks like text summarisation, Question Answering, etc to make it more diverse and generic.
One of the main value add of this method is the way different tasks are modelled for alignment detection, shown in the figure below. This leads to a uniform yet a larger dataset leading upto 5.9m samples from 22 different tasks.
Most importantly, the model can be run on a CPU as both the base and large models are not big enough to require a GPU. However this is at the cost of increase in latency.
The models handles long source text by chunking the same and then aggregating the alignment scores from each chunk. Each chunk is important as the original information could be hidden in either of the chunk.
It has two variants called Align base and Align large based on Roberta-base (125M params) and Roberta-large (355M params) respectively.
Since the model is based on Roberta, one can argue that it is an LLM but the model does not generate text. It is a classifier based model that either gives a numeric score or classifies the piece of text into classes like entailment, neutral and contradiction. Align score may not be good in performance ( on scores like F1) but it won’t suffer through the hallucination problem. The model is specifically trained for alignment or entailment detection instead of text generation.
One can fine-tune on their domain specific dataset using the open source code given in the Align score repository.
One cal apply their own customisations in the above code to improve the performance by optimising for both latency as well as the metrics like F1 score.
Even the Amazon Science team have used this model and created a customised method called RefChecker. However, they have made a simpler model like Align score more complex by going via knowledge triplets from the sentences.

The idea of RefCheck looks interesting but the knowledge triplets that one gets are just broken sentences that are again joined with a delimiter like whitespace, and sent to the align score model for alignment detection

In my opinion, the RefCheck will only be useful if the knowledge triplets are derived in abstractive manner encompassing entire text and not just a single sentence.

Datasets from different tasks converted into a unified alignment dataset. Source

Suggestions to improve and build upon align score:

Synthetic data generation for domain specific and fine tuning the align score with this data.
Playing with the chunking size that deals with what piece of text should be send to align score along with the LLM response being evaluated for hallucination.
Modifying the model architecture for faster inference and only use what’s relevant in the inference flow.
Some future research direction: the model still has some challenges as it is only limited to english language, the model is not interpretable and the data used to train the model had lot of synthetic data that could have quality issues.

AUC-ROC scores of different metrics on the SummaC benchmark.

Some papers that have cited Align score:

CoNLI called Chain of Natural Language Inference is a technique invented by the Microsoft’s Responsible AI team in their recent paper for hallucination detection and mitigation. A very short and quick article highlighting the contributions of this paper are given here. Their biggest value add is the plug and play framework to mitigate and correct the LLM response after hallucination detection. The only caveat is the usage of openAI’s GPT-x models.

Why the CoNLI method may not be useful in hallucination detection?

In my opinion,

If someone had to use an openAI’s GPT-x model, they can directly use it for the original purpose of text summarisation or question-answering instead of using it for hallucination detection and making it a costly affair.

Nevertheless, I have used this technique to see how it works, its only dependency is the LLM you want to use. One can replace the openAI’s GPT-x with light weight open source LLMs like Mixtral-7b. Even then the problem remains the same, i.e. the complexity of calling an LLM in production for hallucination detection.

Kudos for not hallucinating until this point!

Hope this article helped in giving you some direction on solving the interesting problem of hallucination in GenAI solutions.

EDITS:

MiniCheck is the latest technique that shows better results than Align Score.

LLM Hallucination Detection: Background with Latest Techniques

What is hallucination?

Sources of hallucination

Hallucination Detection Techniques

References

Written by Pradeep Bansal