Abstract

Retrieval-augmented generation (RAG) is increasingly recognized as an effective approach for mitigating the hallucination of large language models (LLMs) through the integration of external knowledge. While numerous efforts, most studies focus on a single type of external knowledge source. However, in real-world applications, most situations involve diverse knowledge from various sources, yet this area has been less explored. The main dilemma is the lack of a suitable dataset containing multiple knowledge sources and pre-exploration of the associated issues. To address these challenges, we standardize a benchmark dataset that combines structured and unstructured knowledge across diverse and complementary domains. Based on this dataset, we further develop a plug-and-play RAG framework, PruningRAG, whose main characteristic is to employ multi-granularity pruning strategies for optimizing the integration of relevant information and minimizing misleading context. Building upon the standardized dataset and PruningRAG, we also report a series of experimental results, as well as insightful findings. Our dataset and code are publicly available here, with the aim of advancing future research in the RAG community.

Overview

Benchmark Evaluation of RAG

Table 1: Comparative analysis of RAG performance across different external knowledge configurations.
External Knowledge Method Acc. Halluc. Miss. Score
None LLM-Only 15.61% 20.42% 63.97% -4.81%
5 Web CRAG 7.51% 8.68% 83.81% -1.16%
RAG-X 24.87% 22.10% 53.03% 2.77%
Mock API CRAG 8.53% 1.60% 89.86% 6.93%
RAG-X 29.03% 10.86% 60.10% 18.16%
5 Web + Mock API CRAG 15.10% 7.95% 76.95% 7.15%
RAG-X 32.53% 23.12% 44.35% 9.41%
50 Web + Mock API CRAG 14.22% 8.90% 76.88% 5.32%
RAG-X 35.16% 26.26% 38.58% 8.90%

Empirical Study

Knowledge Source Selection

This section discusses the importance of selecting the right knowledge sources for RAG frameworks. The effectiveness of the generated responses depends heavily on the quality and relevance of the external knowledge sources. Various selection strategies are compared to determine the optimal method for different types of queries.

Table 2: Performance comparison across different knowledge source selection methods.
Phase Experiment Setting Acc. Halluc. Miss. Score
Manual Setting LLM 13.86% 16.70% 69.44% -2.84%
WEB 16.63% 12.40% 70.97% 4.23%
Mock API 29.18% 10.14% 55.26% 19.04%
ALL 17.07% 8.83% 74.11% 8.24%
LLM+WEB 13.20% 10.50% 76.29% 2.70%
LLM+Mock API 21.88% 10.21% 67.91% 11.67%
LLM+ALL 17.72% 8.83% 73.45% 8.90%
Predefined Path LLM+WEB 24.00% 22.61% 53.39% 1.39%
LLM+Mock API 36.62% 25.97% 37.42% 10.65%
LLM+ALL 36.62% 39.82% 23.56% -3.20%
Agent-Based Path Agent 25.31% 21.59% 53.10% 3.72%

Retrieval for Knowledge Extraction

Retrieval is a crucial step in the RAG process. In this subsection, we analyze different retrieval techniques and their performance in extracting relevant information from knowledge sources. We also explore the trade-offs between speed and accuracy in various retrieval methods.

Table 3 compares the performance of the RAG system with and without a broad retrieval step, where each query is supplemented with 50 web pages—far more than the 5 used in other experiments. The results demonstrate that this broad retrieval phase significantly enhances system efficiency, notably reducing processing time. Sparse retrieval methods like BM25 filter out irrelevant data, allowing dense retrieval to focus on a more targeted subset of external knowledge.

Table 3: Comparison of performance with and without broad retrieval.
Setting Acc. Halluc. Miss. Score Processing Time
Broad Retrieval (w/) 31.71% 41.50% 26.79% -9.79% 8166.59s
Broad Retrieval (w/o) 28.94% 42.65% 28.42% -13.71% 93459.87s

Figure 1 illustrates the trade-offs between dense and sparse retrieval configurations in a RAG system. While higher dense retrieval ratios boost accuracy due to richer semantic understanding, they also elevate hallucination rates, suggesting a risk of contextual misalignment. Conversely, increasing sparse retrieval may lower accuracy and fails to consistently reduce hallucinations, likely because of its dependence on surface-level keyword matching.

Performance of varying retrieval methods in RAG
Figure 1: Performance of varying retrieval methods in RAG.

Table 4 summarizes the impact of various reranker configurations on RAG system performance, focusing on different retrieval chunk sizes. The results show that as the number of retrieved chunks increases, hallucination rates rise, while accuracy remains stable, suggesting that the reranker performs better with smaller, more focused sets.

Table 4: Performance of reranker configurations in RAG.
Config. Acc. Halluc. Miss. Score
(3, 3) 24.07% 43.18% 32.75% -19.11%
(3, 5) 24.87% 44.06% 31.07% -19.18%
(3, 10) 24.28% 45.88% 29.83% -21.59%
(3, 20) 24.80% 46.32% 28.88% -21.52%
(3, ALL) 24.95% 46.32% 28.74% -21.37%

Note: (3, X) denotes retrieval of X chunks, with (3, ALL) indicating all chunks passed directly.

Knowledge Reasoning

Once the relevant knowledge is extracted, reasoning over this information becomes essential. This subsection delves into different reasoning techniques and how they can be effectively combined with retrieval results to generate coherent and accurate responses.

Figure 2 illustrates the context-dependent impact of incorporating Chain of Thought (CoT) into RAG systems. While CoT aims to enhance logical reasoning, the results indicate that it does not consistently improve performance and can sometimes reduce accuracy, especially when dealing with multiple conflicting knowledge sources.

Impact of CoT across knowledge sources
Figure 2: Impact of CoT across knowledge sources.

Table 5 presents the impact of few-shot learning on RAG systems, particularly in identifying false premises across various domains. The results show that the model performs best under 0-shot conditions. However, as more examples are introduced, performance on these questions declines due to overfitting and noise. Despite this, overall accuracy improves with few-shot examples, which provide task-specific guidance. Cross-domain examples also enhance generalization and reduce hallucination rates, demonstrating the value of diverse examples in broadening the model's adaptability across different query types.

Table 5: Impact of few-shot learning on LLM reasoning.
Category N Acc. Halluc. Miss. Score
Overall 0 13.20% 10.50% 76.29% 2.70%
1 16.05% 12.62% 71.33% 3.43%
2 16.12% 12.98% 70.90% 3.14%
3 15.17% 12.69% 72.14% 2.48%
False Premise 0 25.00% 5.77% 69.23% 19.23%
1 16.03% 14.10% 69.87% 1.93%
2 16.57% 13.46% 69.87% 3.11%
3 17.31% 12.82% 69.87% 4.49%
Cross-Domain 2 18.02% 11.23% 70.75% 6.78%

Figure 3 illustrates the effect of query position within the prompt on RAG system performance. The results show that placing the query after the reference information not only increases accuracy but also reduces hallucination rates, suggesting that the model benefits from having more context before addressing the query.

Impact of query position within prompt
Figure 3: Impact of query position within prompt.

Figure 4 illustrates the impact of varying noise chunks on RAG system performance. The results show that as the number of noise chunks increases, accuracy initially dips but then improves, reaching its peak at moderate noise levels. Interestingly, while hallucination rates rise with the introduction of noise, they tend to stabilize and slightly decrease at higher noise levels. This suggests that a certain degree of noise may prompt the model to better filter out irrelevant information.

Performance comparison based on noise chunk quantity in RAG
Figure 4: Performance comparison based on noise chunk quantity in RAG.

Impact of Hyperparameter Configurations

The performance of RAG frameworks can be significantly influenced by the choice of hyperparameters. In this subsection, we present a detailed study on the impact of various hyperparameter configurations and provide recommendations for optimal settings based on different scenarios.

Table 6: Performance comparison of embedding models.
Model Acc. Halluc. Miss. Score
BGE-small 24.65% 38.95% 36.40% -14.30%
BGE-large 22.83% 34.35% 42.82% -11.52%
BGE-M3 25.09% 41.21% 33.70% -16.12%
Effect of chunk size on RAG performance
Figure 5: Effect of chunk size on RAG performance.
Impact of chunk overlap on RAG performance
Figure 6: Impact of chunk overlap on RAG performance.
Performance across different temperature settings
Figure 7: Performance across different temperature settings.
Table 7: Performance comparison between LLaMA 3.1 8B and Qwen2-7B with CoT prompting.
Model Acc. Halluc. Miss. Score
LLaMA 3.1 8B 13.20% 10.50% 76.29% 2.70%
Qwen2-7B 26.19% 27.43% 46.39% -1.23%
Talbe 8: Comparison of top-p settings on RAG performance.
Top-p Acc. Halluc. Miss. Score
0.6 22.17% 42.23% 35.59% -20.06%
0.7 25.53% 38.88% 35.59% -13.35%
0.8 25.75% 38.73% 35.52% -12.98%
0.9 25.89% 38.37% 35.74% -12.47%
1.0 26.11% 38.29% 35.59% -12.18%

Contact

If you have any questions, we encourage you to either create Github issues or get in touch with us at rag_ustc@icanary.cn.