Agentic Retrieval-Augmented Generation for Reliable Multi-Step Knowledge-Intensive Question Answering

Leon Alvarez; Haoyu Yuan

Authors

Leon Alvarez Department of Computer Science, University of North Texas, Denton, TX, USA.
Haoyu Yuan Department of Computer Science, University of Central Florida, Orlando, FL, USA.

Keywords:

Agentic RAG, multi-step question answering, knowledge-intensive tasks, system architecture, governance, robustness, sustainability

Abstract

The emergence of retrieval-augmented generation has substantially improved the factual grounding of large language models, yet standard RAG pipelines face critical limitations when confronted with multi-step, knowledge-intensive questions that require iterative reasoning, dynamic information seeking, and synthesis across heterogeneous sources. This paper introduces the concept of agentic retrieval-augmented generation, an architectural paradigm in which the language model is endowed with autonomous planning, tool use, memory, and self-correction capabilities, thereby transforming the retrieval-generation loop into a goal-directed agentic process. We examine the system-level design choices that underpin reliable agentic RAG, including modular orchestration versus emergent agent behavior, the role of state management and external knowledge bases, and the trade-offs between latency, accuracy, and computational cost. A central contribution is the analysis of governance and infrastructure requirements for deploying such systems in high-stakes domains, covering aspects of fairness, bias propagation, transparency, and regulatory compliance. We further discuss robustness mechanisms against error accumulation and hallucination, and evaluate the sustainability implications of repeated retrieval and generation cycles. Through cross-domain illustrations from healthcare, legal reasoning, and scientific research, we demonstrate that agentic RAG can offer superior reliability for complex question answering, provided that architectural decisions are carefully aligned with operational constraints. The paper concludes with a forward-looking perspective on the need for standardized evaluation benchmarks, interoperable agent frameworks, and policy guidelines that balance innovation with accountability. By framing agentic RAG as a socio-technical infrastructure, we highlight the interplay between algorithmic design and the broader ecosystems in which these systems are embedded.

References

1. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. Advances in Neural Information Processing Systems, 33, 9459–9474.

2. Karpukhin, V., Oğuz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W.-t. (2020). Dense Passage Retrieval for Open-Domain Question Answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 6769–6781.

3. Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., Liu, Z., & Sun, M. (2024). A Survey on Large Language Model based Autonomous Agents. Frontiers of Computer Science, 18(6), 186345.

4. Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., Payne, T., Seneviratne, M., Gamble, P., Kelly, C., Babar, Z., Schärli, N., Chowdhery, A., Mansfield, P., ... Natarajan, V. (2023). Large Language Models Encode Clinical Knowledge. Nature, 620(7972), 172–180.

5. Guu, K., Lee, K., Tung, Z., Pasupat, P., & Chang, M.-W. (2020). REALM: Retrieval-Augmented Language Model Pre-Training. Proceedings of the 37th International Conference on Machine Learning, 3929–3938.

6. Schiavoni, S., Ma, J., & Wang, Z. (2024). Multi-Hop Question Answering: A Survey of Methods and Benchmarks. ACM Computing Surveys, 56(4), 1–38.

7. Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12, 157–173.

8. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. International Conference on Learning Representations.

9. Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Yao, S., Welleck, S., Majumder, B. P., Rajagopal, D., Clark, P., & Hovy, E. (2024). Self-Refine: Iterative Refinement with Self-Feedback. Advances in Neural Information Processing Systems, 36, 45514–45531.

10. Shuster, K., Xu, J., Komeili, M., Ju, D., Shafran, I., Kim, D., Riedel, S., Weston, J., & Szlam, A. (2022). Retrieval Augmented Generation for Multi-Turn Conversations. arXiv preprint arXiv:2209.11694.

11. Xu, F. F., Song, L., & Yu, M. (2024). Modular vs. Monolithic Architectures for Tool-Augmented Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2345–2360.

12. Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C., Manning, C. D., Ré, C., Acuna, D., ... Hashimoto, T. (2023). Holistic Evaluation of Language Models. Annals of the New York Academy of Sciences, 1525(1), 140–156.

13. Gao, L., Dai, Z., Callan, J., & Chen, D. (2023). Improving Language Understanding through Iterative Retrieval-Generation. Proceedings of the 40th International Conference on Machine Learning, 10720–10733.

14. Yao, Z., Wu, Y., Rao, J., & He, J. (2023). Efficient Memory Management for Large Language Models. Journal of Machine Learning Research, 24(1), 1–25.

15. Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J. Q., Demszky, D., ... Liang, P. (2022). On the Opportunities and Risks of Foundation Models. arXiv preprint arXiv:2108.07258.

16. Abid, A., Farooqi, M., & Zou, J. (2022). Persistent Anti-Muslim Bias in Large Language Models. Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society, 59–66.

17. Xu, M., Du, W., Ji, S., Zhang, Z., & Li, M. (2024). Edge AI: A Survey on Distributed Intelligence. ACM Computing Surveys, 56(10), 1–40.

18. European Commission. (2021). Proposal for a Regulation Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act). COM(2021) 206 final.

19. Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y., Chen, X., & Fung, P. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), 1–38.

20. Patterson, D., Gonzalez, J., Le, Q. V., Liang, P., Ou, L., Pietquin, O., Plotkin, L., Sedghi, H., Shazeer, N., & Sutskever, I. (2021). Carbon Emissions and Large Neural Network Training. arXiv preprint arXiv:2104.10350.

Agentic Retrieval-Augmented Generation for Reliable Multi-Step Knowledge-Intensive Question Answering

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Current Issue

Information

Make a Submission

Journal Information

Indexing & Infrastructure