Retrieval-augmented generation (RAG) is increasingly recognized as an effective approach for mitigating the hallucination of large language models (LLMs) through the integration of external knowledge. While numerous efforts, most studies focus on a single type of external knowledge source. However, in real-world applications, most situations involve diverse knowledge from various sources, yet this area has been less explored. The main dilemma is the lack of a suitable dataset containing multiple knowledge sources and pre-exploration of the associated issues. To address these challenges, we standardize a benchmark dataset that combines structured and unstructured knowledge across diverse and complementary domains. Based on this dataset, we further develop a plug-and-play RAG framework, PruningRAG, whose main characteristic is to employ multi-granularity pruning strategies for optimizing the integration of relevant information and minimizing misleading context. Building upon the standardized dataset and PruningRAG, we also report a series of experimental results, as well as insightful findings. Our dataset and code are publicly available here, with the aim of advancing future research in the RAG community.
External Knowledge | Method | Acc. | Halluc. | Miss. | Score |
---|---|---|---|---|---|
None | LLM-Only | 15.61% | 20.42% | 63.97% | -4.81% |
5 Web | CRAG | 7.51% | 8.68% | 83.81% | -1.16% |
RAG-X | 24.87% | 22.10% | 53.03% | 2.77% | |
Mock API | CRAG | 8.53% | 1.60% | 89.86% | 6.93% |
RAG-X | 29.03% | 10.86% | 60.10% | 18.16% | |
5 Web + Mock API | CRAG | 15.10% | 7.95% | 76.95% | 7.15% |
RAG-X | 32.53% | 23.12% | 44.35% | 9.41% | |
50 Web + Mock API | CRAG | 14.22% | 8.90% | 76.88% | 5.32% |
RAG-X | 35.16% | 26.26% | 38.58% | 8.90% |
This section discusses the importance of selecting the right knowledge sources for RAG frameworks. The effectiveness of the generated responses depends heavily on the quality and relevance of the external knowledge sources. Various selection strategies are compared to determine the optimal method for different types of queries.
Phase | Experiment Setting | Acc. | Halluc. | Miss. | Score |
---|---|---|---|---|---|
Manual Setting | LLM | 13.86% | 16.70% | 69.44% | -2.84% |
WEB | 16.63% | 12.40% | 70.97% | 4.23% | |
Mock API | 29.18% | 10.14% | 55.26% | 19.04% | |
ALL | 17.07% | 8.83% | 74.11% | 8.24% | |
LLM+WEB | 13.20% | 10.50% | 76.29% | 2.70% | |
LLM+Mock API | 21.88% | 10.21% | 67.91% | 11.67% | |
LLM+ALL | 17.72% | 8.83% | 73.45% | 8.90% | |
Predefined Path | LLM+WEB | 24.00% | 22.61% | 53.39% | 1.39% |
LLM+Mock API | 36.62% | 25.97% | 37.42% | 10.65% | |
LLM+ALL | 36.62% | 39.82% | 23.56% | -3.20% | |
Agent-Based Path | Agent | 25.31% | 21.59% | 53.10% | 3.72% |
Retrieval is a crucial step in the RAG process. In this subsection, we analyze different retrieval techniques and their performance in extracting relevant information from knowledge sources. We also explore the trade-offs between speed and accuracy in various retrieval methods.
Table 3 compares the performance of the RAG system with and without a broad retrieval step, where each query is supplemented with 50 web pages—far more than the 5 used in other experiments. The results demonstrate that this broad retrieval phase significantly enhances system efficiency, notably reducing processing time. Sparse retrieval methods like BM25 filter out irrelevant data, allowing dense retrieval to focus on a more targeted subset of external knowledge.
Setting | Acc. | Halluc. | Miss. | Score | Processing Time |
---|---|---|---|---|---|
Broad Retrieval (w/) | 31.71% | 41.50% | 26.79% | -9.79% | 8166.59s |
Broad Retrieval (w/o) | 28.94% | 42.65% | 28.42% | -13.71% | 93459.87s |
Figure 1 illustrates the trade-offs between dense and sparse retrieval configurations in a RAG system. While higher dense retrieval ratios boost accuracy due to richer semantic understanding, they also elevate hallucination rates, suggesting a risk of contextual misalignment. Conversely, increasing sparse retrieval may lower accuracy and fails to consistently reduce hallucinations, likely because of its dependence on surface-level keyword matching.
Table 4 summarizes the impact of various reranker configurations on RAG system performance, focusing on different retrieval chunk sizes. The results show that as the number of retrieved chunks increases, hallucination rates rise, while accuracy remains stable, suggesting that the reranker performs better with smaller, more focused sets.
Config. | Acc. | Halluc. | Miss. | Score |
---|---|---|---|---|
(3, 3) | 24.07% | 43.18% | 32.75% | -19.11% |
(3, 5) | 24.87% | 44.06% | 31.07% | -19.18% |
(3, 10) | 24.28% | 45.88% | 29.83% | -21.59% |
(3, 20) | 24.80% | 46.32% | 28.88% | -21.52% |
(3, ALL) | 24.95% | 46.32% | 28.74% | -21.37% |
Note: (3, X) denotes retrieval of X chunks, with (3, ALL) indicating all chunks passed directly.
Once the relevant knowledge is extracted, reasoning over this information becomes essential. This subsection delves into different reasoning techniques and how they can be effectively combined with retrieval results to generate coherent and accurate responses.
Figure 2 illustrates the context-dependent impact of incorporating Chain of Thought (CoT) into RAG systems. While CoT aims to enhance logical reasoning, the results indicate that it does not consistently improve performance and can sometimes reduce accuracy, especially when dealing with multiple conflicting knowledge sources.
Table 5 presents the impact of few-shot learning on RAG systems, particularly in identifying false premises across various domains. The results show that the model performs best under 0-shot conditions. However, as more examples are introduced, performance on these questions declines due to overfitting and noise. Despite this, overall accuracy improves with few-shot examples, which provide task-specific guidance. Cross-domain examples also enhance generalization and reduce hallucination rates, demonstrating the value of diverse examples in broadening the model's adaptability across different query types.
Category | N | Acc. | Halluc. | Miss. | Score |
---|---|---|---|---|---|
Overall | 0 | 13.20% | 10.50% | 76.29% | 2.70% |
1 | 16.05% | 12.62% | 71.33% | 3.43% | |
2 | 16.12% | 12.98% | 70.90% | 3.14% | |
3 | 15.17% | 12.69% | 72.14% | 2.48% | |
False Premise | 0 | 25.00% | 5.77% | 69.23% | 19.23% |
1 | 16.03% | 14.10% | 69.87% | 1.93% | |
2 | 16.57% | 13.46% | 69.87% | 3.11% | |
3 | 17.31% | 12.82% | 69.87% | 4.49% | |
Cross-Domain | 2 | 18.02% | 11.23% | 70.75% | 6.78% |
Figure 3 illustrates the effect of query position within the prompt on RAG system performance. The results show that placing the query after the reference information not only increases accuracy but also reduces hallucination rates, suggesting that the model benefits from having more context before addressing the query.
Figure 4 illustrates the impact of varying noise chunks on RAG system performance. The results show that as the number of noise chunks increases, accuracy initially dips but then improves, reaching its peak at moderate noise levels. Interestingly, while hallucination rates rise with the introduction of noise, they tend to stabilize and slightly decrease at higher noise levels. This suggests that a certain degree of noise may prompt the model to better filter out irrelevant information.
The performance of RAG frameworks can be significantly influenced by the choice of hyperparameters. In this subsection, we present a detailed study on the impact of various hyperparameter configurations and provide recommendations for optimal settings based on different scenarios.
Model | Acc. | Halluc. | Miss. | Score |
---|---|---|---|---|
BGE-small | 24.65% | 38.95% | 36.40% | -14.30% |
BGE-large | 22.83% | 34.35% | 42.82% | -11.52% |
BGE-M3 | 25.09% | 41.21% | 33.70% | -16.12% |
Model | Acc. | Halluc. | Miss. | Score |
---|---|---|---|---|
LLaMA 3.1 8B | 13.20% | 10.50% | 76.29% | 2.70% |
Qwen2-7B | 26.19% | 27.43% | 46.39% | -1.23% |
Top-p | Acc. | Halluc. | Miss. | Score |
---|---|---|---|---|
0.6 | 22.17% | 42.23% | 35.59% | -20.06% |
0.7 | 25.53% | 38.88% | 35.59% | -13.35% |
0.8 | 25.75% | 38.73% | 35.52% | -12.98% |
0.9 | 25.89% | 38.37% | 35.74% | -12.47% |
1.0 | 26.11% | 38.29% | 35.59% | -12.18% |
If you have any questions, we encourage you to either create Github issues or get in touch with us at rag_ustc@icanary.cn.