LLM Hallucination Detection: Can LLM-Generated Knowledge Graphs Be Trusted?

11 min readJun 10, 2024

An LLM response can be hallucinated which means it can be factually incorrect or inconsistent w.r.t. the reference document. For eg. while generating a summary of a news article, the LLM might state something in the summary that is inconsistent w.r.t. the reference document. LLM hallucination detection is part of the LLM evaluation step. One can use LLM evaluation techniques to give an estimate about the degree of hallucination in the LLM generated summary. For eg. LLM evaluation metric like Rouge-x and others can be used for both evaluating the summary as well as detecting the hallucination. Hence LLM evaluation and LLM hallucination detection can be used interchangeably to great extent.

Example

Below we show a small news article with its corresponding summary and explain how the summary is inconsistent w.r.t. the news article.

Title: City Introduces New Green Initiative to Combat Air Pollution
Content:
The city council has launched a new green initiative aimed at reducing air pollution and promoting environmental sustainability. The initiative includes planting 10,000 trees over the next two years, creating more bike lanes, and increasing the number of electric vehicle charging stations. Mayor Jane Smith emphasized the importance of community involvement in making the city a cleaner and healthier place to live. “We are committed to improving air quality and reducing our carbon footprint,” said Mayor Smith. The project is set to begin next month, with the first phase focusing on urban areas with the highest levels of pollution.
Summary (with inconsistency or hallucination):
The city council has announced a comprehensive green initiative designed to tackle air pollution and enhance environmental sustainability. Key components of the plan involve planting 10,000 trees, expanding bike lanes, and boosting the availability of electric vehicle charging stations. Mayor Jane Smith highlighted the critical role of community participation in ensuring the initiative’s success. Interestingly, the initiative also includes a controversial plan to reduce public transportation services, which has raised concerns among residents. The project will commence next month, starting with the most polluted urban areas.
Explanation of Inconsistency:
The summary introduces a sentence about a “controversial plan to reduce public transportation services,” which is inconsistent with the article. The original article makes no mention of reducing public transportation services; instead, it focuses solely on positive environmental measures such as planting trees, creating bike lanes, and increasing electric vehicle charging stations.

This inconsistency highlighted above could be flagged by an hallucination detection algorithm. Note that there could be hallucination in other ways too, for eg. factual inconsistency which relates to the truthfulness of the LLM while the kind of hallucination we focus here is about faithfulness which is more relevant in the enterprise world.

For our discussion, we will be considering such factual inconsistencies and treat them as hallucination w.r.t. the reference document.

Assumption

I am assuming we don’t have a true summary for evaluating the LLM predicted summary for either hallucination or precision-recall metrics. Otherwise one can argue that detecting hallucination is trivial by thresholding the dot product between the embeddings(eg. BERT) of true summary and the embeddings of LLM generated summary (eg. using sentence similarity). But this is highly unlikely that such a true summary will be available in production during run-time. Hence we will use the original reference article to evaluate the summary for hallucination detection. Because of this assumption it makes little sense in keeping the knowledge graph(or just the triplets in the form of noun-verb-entity or subject-verb-object, i.e. s-v-o, that make the knowledge graph) of the original reference and evaluate the summary against such a knowledge graph for hallucination.

Background

I know that LLM hallucination detection is possible by multiple ways(as mentioned in the beginning about Rouge-x ) and already written an article on the background for LLM hallucination and latest techniques for LLM hallucination detection. However the point of writing this article is to show the issues in using a knowledge graph to detect the hallucination, especially when the knowledge graph is generated using another LLM. I also know that such an approach sounds impractical even before attempting for the same. But there was a scenario at my work when I had to show this to my manager that it is actually impractical though it might sound nice in theory. While implementing and experimenting with this approach, I came across multiple blogs and papers that are related to this article. I will refer them as well to avoid any redundant content as well as show the readers that people have tried similar approaches before.

Even if I am writing about LLM hallucination detection via knowledge graphs, consider this as a self-critique article.

Methodology

Steps taken to detect LLM hallucination via knowledge graphs:

Collect the data with reference articles and their corresponding summaries. Note that even in run-time, one needs to have both the reference and the corresponding summary. Needless to much about this step and hence we will only talk in depth about generating the graph and comparing the two graphs.
Choose the LLM using which we will generate the knowledge graph of the reference article. This is critical because ideally, one should not be using the same LLM to generate the summary as well as the knowledge graph. Its like the teacher and examiner should be different else the teacher may have some biases towards their (best-performing/least-performing) students. Here students could mean some concepts/sentences in the reference that could be treated with bias when using the same LLM for generating the summary and generating the knowledge graph.
Generate the knowledge graph or the triplets from the reference article using the above chosen LLM. Ultimately the triplets is what makes the knowledge graph as a graph can be represented by set of triplets that make the graph.
Compare the summary against the set of triplets (of the reference text) to detect any hallucination in the LLM generated summary.

Though it may look simple, but each step has its own complication which we will see in the following sections.

Choosing the LLM for generating the knowledge graph

I used two LLMs, viz. Zephyr:7b (fine-tuned from Mistral-7b), and the other one was GPT-3.5-turbo. The reason for me to choose a model based on Mistral-7b was its Apache-2.0 license that allows you to eventually use it in production, especially for any enterprise use-case without any compliance issues in the end. The main bottleneck of using AI in enterprises is not its performance but the compliance issues. And the reason of using OpenAI’s GPT-x was because of using the LlamaIndex in the next step. Eventually I would have to give up the idea of using openAI’s GPT-x due to compliance issues. But there’s no harm in checking and benchmarking our results.

Generating the knowledge graph

There are two ways I generated the triplets (of the knowledge graph). One is via LlamaIndex and other one is using the code and prompt given in this blog (and open source github repository for knowledge base). Note that I’ve used this repository as it is and hence I also want to appreciate the hard work done by Rahul in creating that repository helping in my experimentation. There are prompts used in the flow to query the Mistral-LLM to generate the triplets, I did fine tune the prompt to get relevant and better triplets for my context.

Generating Knowledge graph using llama-index

There are lot of resources on internet showing how can one generate a Knowledge graph using llama-index. Still for completeness, I have uploaded a notebook with complete code that generates the knowledge graph both for the reference and the summary. Here are the graphs I got for the above reference and the summary.

Knowledge graph of the entire reference generated using llama-index and GPT-3.5-turbo

One of the subcomponent from the above graph is shown below which is zoomed out for readability.

Zoomed out Knowledge graph of one subcomponent

The corresponding triplets for the above subcomponent are also shown below, where one can see the centre node called “Initiative” is repeated three times for the corresponding three edges( or relationships).

Triplets: 
[('Initiative', 'Includes', 'Creating Bike lanes'),
 ('Initiative', 'Includes', 'Increasing charging stations'),
 ('Initiative', 'Includes', 'Planting trees')
]

Now lets get the graph for the summary.

Knowledge graph of the entire summary generated using llama-index and GPT-3.5-turbo

We take one of the subcomponent from the summary graph as well.

Triplets: 
[('Initiative', 'Includes', 'Controversial plan'),
 ('Initiative', 'Raised concerns among','Residents'),
 ('Initiative', 'Includes',
'Controversial plan to reduce public transportation services'),

('Controversial plan to reduce public transportation services',
'Raised concerns among', 'Residents']

So whats happening here?

The LLM (gpt-3.5-turbo in this case) has essentially broken each sentence into three entities, (usually called triplets of subject-verb-object, s-v-o). Common entities across sentences are also joined to make the graph more and more connected.

Once the triplets are generated for both the reference and the summary, we store these triplets in order to compare in the later stage.

Generating Knowledge graph using another LLM via prompting

In this approach one can use LLM like Mixtral-7b or zephyr(still based on Mixtral-7b) with zero-shot prompting (as shown in the repo here) to generate the triplets from the piece of text. The end result is same as the above approach which is to generate and store the triplets for both the reference and the corresponding summary. The results I get using the code here are as follows:

Knowledge Graph of the reference:

Knowledge graph generated by prompting zephyr:7b

Knowledge Graph of the summary:

The color code is given according to the community or cluster formation. More details can be found in the repository link and the blog here.

Evaluation of the LLM generated summary

In normal scenario, one can use metrics like Rouge to evaluate as well as detect hallucination in LLM responses. A low rouge score may indicate some hallucination and can be assumed to be positively correlated with the degree of hallucination in the LLM generated summary.

As mentioned above, we are interested in evaluating the summary using the knowledge graph approach. If you have followed this article from the top until this point, we now have the triplets (subject-verb-object) from both the reference text and the summary text. There are at least two ways of comparing the triplets from the reference and the summary:

Matching entire triplet

Convert a triplet(from both the reference and the summary) into a sentence as shown below

print(" ".join(['Initiative', 'Includes', 'Controversial plan']))

# Output 
'Initiative Includes Controversial plan'

and then find the best matching sentence for each summary_triplet(cum summary_sentence) among the reference_triplets(converted to reference_sentences). This approach is shown in the notebook here. Some sample scores are listed below as well.

# Summary Samples that were hallucinated

Matched triplets: 
Summary triplet: ['Initiative', 'Includes', 'Controversial plan']
Reference triplet: ['Initiative', 'Includes', 'Planting trees']
Matching score: 0.5433995723724365

Matched triplets: 
Summary triplet: ['Controversial plan to reduce public transportation services', 'Raised concerns among', 'Residents']
Reference triplet: ['City council', 'Launched', 'Green initiative']
Matching score: 0.391315758228302

Matched triplets: 
Summary triplet: ['Initiative', 'Includes', 'Controversial plan to reduce public transportation services']
Reference triplet: ['Initiative', 'Includes', 'Increasing charging stations']
Matching score: 0.5326859951019287

# Summary samples that do not look hallucinated

Matched triplets: 
Summary triplet: ['City council', 'Announced', 'Green initiative']
Reference triplet: ['City council', 'Launched', 'Green initiative']
Matching score: 0.971133828163147

Matched triplets: 
Summary triplet: ['Green initiative', 'Designed to tackle', 'Air pollution']
Reference triplet: ['Green initiative', 'Aimed at', 'Reducing air pollution']
Matching score: 0.9727627038955688

Matched triplets: 
Summary triplet: ['Green initiative', 'Designed to enhance', 'Environmental sustainability']
Reference triplet: ['Green initiative', 'Aimed at', 'Promoting environmental sustainability']
Matching score: 0.9679796099662781

One thing is clear from above, the samples that are hallucinated have lower score(scores less than 0.6) than the samples which are not hallucinated (scores around 0.9+).

Matching each entity of the triplet separately

In this approach, one has to match the entity of a triplet against the both the entities in the other triplet, and also comparing the relationship(verb) against the corresponding relationship in the other triplet. While in the previous approach there were m x n comparisons, in this approach, there are m x n x 5 comparisons in the worst case.

Though I have not shown this approach here as one can just follow the same notebook and extend on the same.

It looks so nice but what is the problem with using LLM generated knowledge graph to evaluate and hence detect hallucination in the LLM response like summary ?

Problems:

Error Propagation or multiplication: There is approximation in both the stages of getting the triplets and then comparing the triplets against each other (using the sentence similarity transformer ).
High complexity: The approach might look quite innovative but the time taken in both computing the triplets as well as matching the triplets, is quite high. There are better and straightforward ways of evaluating, and detecting hallucination in the LLM generated response.
Tuning Similarity Threshold: Even if both the above issues pointed above are okay, one has to tune the threshold to be chosen in the last step to decide at what point, one calls a summary triplet as hallucinated and not hallucinated.
Flexible and indefinite entities: The reference example here was quite simple. In enterprise use cases, there might be a fixed set of entities that one deals with. However, with the approach described above, there are indefinite entities and relationships possible. This creates difficulty in normalising the graph making the comparison stage more complex.
Trivial Triplets: Summaries and other LLM responses can be abstractive or extractive in nature. Note that the LLM generated triplets can also be both abstractive and extractive in nature. If the triplets are extractive in nature, then this approach doesn’t help at all for obvious reasons, because one can just perform the sentence to sentence matching in the original reference and the summary. Moreover, if the summary itself is extractive, then such a complex approach may not yield the desired efficiency in detecting hallucination.

The idea of using knowledge graph should be to capture the relationships that could either be far from each other in the text or that are abstractive in nature.

How can we use LLM generated knowledge graphs

Fixed ontology can provide some relief

There is a research paper by Google team that argues that if the graph ontology is fixed, then this approach could still be useful. Fixed ontology addresses the problem during the second stage of matching the triplets. However there are better approaches have come that avoids the need to compute knowledge graph.

Pre-built knowledge graph can still be used

Organisations like google have been in the industry since long time and have built knowledge graphs (eg. Knowledge Bases, KBs) over time. Such pre-existing KBs can be handy in validating as well as improving the precision of the LLM responses.

Using LLM generated knowledge graphs to evaluate LLM response is a complex and inefficient approach.