One source of LLM hallucination is exposure bias

Generated by DALL·E

With the release of closed-source ChatGPT, GPT-4, and open-source LLaMa models, the LLM development has seen tremendous improvements in recent months. While we are hyped with the fact that these LLMs are capable of many tasks, we have also noticed again and again that these LLMs hallucinate content. Today I came accross this inspiring paper, Sources of Hallucination by Large Language Models on Inference Tasks by McKenna et al., in which the authors have identified two main sources of hallucination:

  • Knowledge that was memorised by the model during pre-training
  • Corpus-based heuristics such as term frequency

In my opinion, I would put these two reasons into one category: the exposure bias. This is becuase either the memorised knowledge, or frequent terms, were exposed to the LLM at pre-training state. The observation made in this paper is very enlightning, and reminded me of an ealier paper of mine, where we also concluded that the low-diversity issue of generative chatbots are caused by frequent terms in the training corpora1.

Although LLMs are becoming larger, trained with more sophisticated techniques like RLHF, they have a deep root in the field of statistical models. Losses are calculated based on terms, which are used to update the model weights, so it’s not surprising at all if the trained LLMs respond differently to terms with different frequencies. And in fact, it would be surprising if these LLMs only learn perfect grammar and semantics and totally shake off the frequency part. There is nothing wrong for LLMs being statistical. We human often make decisions based on experience, and isn’t that a kind of statistical model? To make matters even worse, natural languages have a statistical nature too – most of them, if not all, evolve over time, not neccessarily changing the meaning of words, but definitely changing the frequency speakers use them.

As pointed out by Konstantine Arkoudas2, GPT-4 can’t reason. I agree with this statement. I think LLMs are sophisticated statistical models, and the generation process is more like information retrieval but using the neural network weights and in the granularity of tokens. Also as mentioned by Arkoudas, the lack of reasoning in LLMs has a connection with the hallucination problem. I agree with him and many other researchers, retrieval-augmentation could serve as the “guardrail” of LLM generations, but unlikely to be the silver bullet for eliminating the hallucination problem.

However, “can’t be solved” is different from “can’t be improved”. Given that more and more studies have shown the vulnerability of LLMs to the statistical nature of their training data, maybe more effort is needed in thinking of a different way of training the model.

Lastly, it’s worth noting that the McKenna et al. work was studied under NLI. Although the hallucination problem is more prominent in NLG, it’s not straightforwad how to do a similar analysis in the NLG scenario. But if it can be done, it would be more attention catching.

Shaojie Jiang
Shaojie Jiang
Manager AI

My research interests include information retrieval, chatbots and conversational question answering.

comments powered by Disqus