Open-Source Large Language Models for Domain-Specific Intelligent Decision Support: A Llama 3-Based Evaluation Framework

Timothy C. Perkins; Walid Anderson; Manish M. Banerjee

Authors

Timothy C. Perkins Department of Computer Science, University of Alabama at Birmingham, Birmingham, AL, USA.
Walid Anderson School of Information Technology, University of Cincinnati, Cincinnati, OH, USA.
Manish M. Banerjee Department of Computer Science, Binghamton University, Binghamton, NY, USA.

Keywords:

open-source large language models, Llama 3, intelligent decision support, domain-specific evaluation, socio-technical systems, model governance, fairness, robustness, sustainability

Abstract

The rapid proliferation of large language models has transformed the landscape of intelligent decision support across numerous domains. While proprietary models have historically dominated high-stakes applications, the emergence of open-source architectures such as Llama 3 presents new opportunities for customization, transparency, and cost-effective deployment. This paper proposes a systematic evaluation framework specifically designed for open-source large language models in domain-specific intelligent decision support contexts. The framework integrates considerations of computational infrastructure, model governance, fairness, robustness, and sustainability, moving beyond traditional accuracy-centric metrics. Through a detailed analysis of architectural trade-offs, including model size, quantization, retrieval-augmented generation, and fine-tuning strategies, we examine how Llama 3 can be adapted for specialized fields such as healthcare diagnosis, financial risk assessment, and legal document analysis. The evaluation methodology employs a multi-dimensional scoring system that captures not only task performance but also inference latency, resource consumption, interpretability, and bias mitigation. We further explore the socio-technical implications of deploying open-source models within regulated environments, highlighting issues of accountability, data privacy, and model drift. By synthesizing insights from systems engineering, artificial intelligence safety, and public policy, this paper provides a comprehensive blueprint for practitioners and researchers seeking to leverage open-source language models for robust, fair, and sustainable decision support. Our findings underscore that while Llama 3 offers significant advantages in flexibility and community-driven improvement, successful domain-specific adoption requires careful orchestration of model selection, infrastructure design, and continuous monitoring. The proposed framework serves as a foundation for future empirical studies and standardized benchmarks in the open-source large language model ecosystem.

References

1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 5998–6008.

2. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

3. OpenAI. (2023). GPT-4 technical report. arXiv preprint arXiv:2303.08774.

4. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

5. Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. de las, Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., & Sayed, W. E. (2023). Mistral 7B. arXiv preprint arXiv:2310.06825.

6. AI@Meta. (2024). Llama 3 model card. arXiv preprint arXiv:2407.21783.

7. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186.

8. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.

9. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Canton Ferrer, C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., … Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.

10. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 353–355.

11. Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2020). Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.

12. Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., Payne, P., Seneviratne, M., Gamble, P., Kelly, C., Scharli, N., Chowdhery, A., Mansfield, P., Arcas, B. A. y, Webster, D., … Natarajan, V. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 172–180.

13. Lopez, C., & Gu, G. (2023). Financial sentiment analysis with large language models: A survey. ACM Computing Surveys, 56(4), Article 85.

14. Shortliffe, E. H., & Buchanan, B. G. (1975). A model of inexact reasoning in medicine. Mathematical Biosciences, 23(3–4), 351–379.

15. Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y., Dai, W., Madotto, A., & Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), Article 251.

16. Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623.

17. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474.

18. Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3645–3650.

19. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., & Zemel, R. (2012). Fairness through awareness. Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, 214–226.

20. Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why should I trust you?" Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–1144.

21. Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2018). Towards deep learning models resistant to adversarial attacks. International Conference on Learning Representations.

22. Schwartz, R., Dodge, J., Smith, N. A., & Etzioni, O. (2020). Green AI. Communications of the ACM, 63(12), 54–63.

23. Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., Gschwind, M., Ghosh, E., Gupta, A., Babu, P., Wang, Y., & Bedard, D. (2022). Sustainable AI: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems, 4, 1–19.

24. Mittelstadt, B. D., Allo, P., Taddeo, M., Wachter, S., & Floridi, L. (2016). The ethics of algorithms: Mapping the debate. Big Data & Society, 3(2), 1–21.

25. Raji, I. D., Smart, A., White, R. N., Mitchell, M., Gebru, T., Hutchinson, B., Smith-Loud, J., Theron, D., & Barnes, P. (2020). Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 33–44.

Open-Source Large Language Models for Domain-Specific Intelligent Decision Support: A Llama 3-Based Evaluation Framework

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Current Issue

Information

Make a Submission

Journal Information

Indexing & Infrastructure