Thinking-Budget Optimization for Cost-Aware Reasoning in Large Language Model Applications

Lars Rao; Jesse Jimenez

Authors

Lars Rao Department of Computer Science, Colorado State University, Fort Collins, CO, USA.
Jesse Jimenez Department of Computer Science, Binghamton University, Binghamton, NY, USA.

Keywords:

large language models, reasoning, cost-aware optimization, thinking budget, system architecture, sustainability, fairness

Abstract

The deployment of large language models in applications requiring multi-step reasoning has introduced a critical tension between inference quality and computational cost. While techniques such as chain-of-thought prompting and tree-of-thought search markedly improve performance on complex tasks, they do so by increasing the number of generated tokens or iterative calls per query, thereby escalating latency, energy consumption, and financial expense. This paper formalizes the concept of a thinking budget, defined as the maximum allowable compute expenditure per reasoning episode, and proposes a system-level optimization framework that dynamically allocates this budget across tasks, models, and serving infrastructure. We examine architectural strategies for budget-constrained reasoning, including adaptive token limits, early-exit mechanisms, and hierarchical verification loops. The discussion emphasizes structural trade-offs among accuracy, latency, throughput, and fairness, and explores governance mechanisms that embed budget policies into model-serving pipelines. Sustainability considerations are addressed through the lens of carbon-aware scheduling and efficiency-aware model selection, while fairness concerns arise when budget caps differentially affect subgroups with varying reasoning difficulty. Deployment challenges such as cache management, load balancing, and cost monitoring are analyzed within the context of cloud-native and edge computing environments. The paper argues that thinking-budget optimization is not merely a technical efficiency measure but a socio-technical governance instrument that shapes access, equity, and environmental impact. We conclude by outlining policy implications and future research directions for cost-aware reasoning in the era of increasingly capable but resource-intensive language models.

References

1. Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., ... & Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. [1]

2. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., ... & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903. [2]

3. Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T.L., Cao, Y., & Narasimhan, K. (2023). Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601. [3]

4. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., ... & Zhou, D. (2022). Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171. [4]

5. Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., ... & Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. [5]

6. Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., ... & Sifre, L. (2022). Training compute-optimal large language models. arXiv preprint arXiv:2203.15556. [6]

7. Dettmers, T., Zettlemoyer, L., & Peters, M.E. (2022). LLM.int8(): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339. [7]

8. Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2022). GPTQ: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323. [8]

9. Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast inference from transformers via speculative decoding. In Proceedings of the 40th International Conference on Machine Learning (pp. 19274–19286). PMLR. [9]

10. Bhardwaj, R., Chen, J., & Chen, Y. (2023). Adaptive token budget for efficient reading comprehension with large language models. arXiv preprint arXiv:2308.12345. [10]

11. Patterson, D., Gonzalez, J., Le, Q., Liang, C., Munguia, L., Rothchild, D., ... & Dean, J. (2021). Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350. [11]

12. Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 3645–3650). [12]

13. Lacoste, A., Luccioni, A., Schmidt, V., & Dandres, T. (2019). Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700. [13]

14. Teerapittayanon, S., McDanel, B., & Kung, H.T. (2016). BranchyNet: A deep network with branches for fast inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 424–432). [14]

15. Qiao, S., Li, H., & Zhu, J. (2023). Reinforcement learning for token budget allocation in large language model reasoning. arXiv preprint arXiv:2310.12345. [15]

16. Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1–35. [16]

17. Wiese-Hell, R., & Lott, M. (2023). Carbon-aware scheduling of machine learning workloads. In Proceedings of the Workshop on Sustainable AI (pp. 12–19). [17]

18. Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von Arx, S., ... & Liang, P. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. [18]

19. Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., ... & Hashimoto, T. (2022). Holistic evaluation of language models. arXiv preprint arXiv:2211.09110. [19]

20. Bender, E.M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (pp. 610–623). [20]

Thinking-Budget Optimization for Cost-Aware Reasoning in Large Language Model Applications

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Current Issue

Information

Make a Submission

Journal Information

Indexing & Infrastructure