Document Type

Conference Proceeding


Despite their wide applications to language understanding tasks, large language models (LLMs) still face challenges such as hallucinations - the occasional fabrication of information, and alignment issues - the lack of associations with human-curated world models (e.g., intuitive physics or common-sense knowledge). Additionally, the black-box nature of LLMs makes it highly challenging to train them meaningfully in order to achieve a desired behavior. Specifically, the attempt to adjust LLMs’ concept embedding spaces can be highly intractable, which involves analyzing the implicit impact on LLMs’ numerous parameters and the resulting inductive biases. This paper proposes a novel architecture that wraps powerful function approximation architectures within an outer, interpretable read-out layer, which can be scrutinized to explicitly observe the effects of concept modeling during the training of the LLM. This is in contrast with the gradient-based implicit mechanisms, which solely rely on modifications to the LLM parameters which, therefore, do not lend themselves to scrutiny. Through extensive experiments across both generative and discriminative language modeling tasks, we analyze the abilities of our proposed architecture in comparison to the state-of-the-art LLMs of comparable size. We further provide a qualitative analysis of the interpretable read-out layer, and visualize the concepts captured by this layer. Our findings show the potential of our approach for robust LLM hallucination control and enhanced alignment of LLMs with human expectations.

APA Citation

Zi, Y., Roy, K., Narayanan, V., & Sheth, A. (2024). Exploring alternative approaches to language modeling for learning from data and knowledge, [Preprint]