FinBen: A Holistic Financial Benchmark for Large Language Models

Xie, Qianqian; Han, Weiguang; Chen, Zhengyu; Xiang, Ruoyu; Zhang, Xiao; He, Yueru; Xiao, Mengxi; Li, Dong; Dai, Yongfu; Feng, Duanyu; Xu, Yijing; Kang, Haoqiang; Kuang, Ziyan; Yuan, Chenhan; Yang, Kailai; Luo, Zheheng; Zhang, Tianlin; Liu, Zhiwei; Xiong, Guojun; Deng, Zhiyang; Jiang, Yuechen; Yao, Zhiyuan; Li, Haohang; Yu, Yangyang; Hu, Gang; Huang, Jiajia; Liu, Xiao-Yang; Lopez-Lira, Alejandro; Wang, Benyou; Lai, Yanzhao; Wang, Hao; Peng, Min; Ananiadou, Sophia; Huang, Jimin

FinBen: A Holistic Financial Benchmark for Large Language Models

Part of Advances in Neural Information Processing Systems 37 (NeurIPS 2024) Datasets and Benchmarks Track

Authors

Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, Yijing Xu, Haoqiang Kang, Ziyan Kuang, Chenhan Yuan, Kailai Yang, Zheheng Luo, Tianlin Zhang, Zhiwei Liu, Guojun Xiong, Zhiyang Deng, Yuechen Jiang, Zhiyuan Yao, Haohang Li, Yangyang Yu, Gang Hu, Jiajia Huang, Xiao-Yang Liu, Alejandro Lopez-Lira, Benyou Wang, Yanzhao Lai, Hao Wang, Min Peng, Sophia Ananiadou, Jimin Huang

Abstract

LLMs have transformed NLP and shown promise in various fields, yet their potential in finance is underexplored due to a lack of comprehensive benchmarks, the rapid development of LLMs, and the complexity of financial tasks. In this paper, we introduce FinBen, the first extensive open-source evaluation benchmark, including 42 datasets spanning 24 financial tasks, covering eight critical aspects: information extraction (IE), textual analysis, question answering (QA), text generation, risk management, forecasting, decision-making, and bilingual (English and Spanish). FinBen offers several key innovations: a broader range of tasks and datasets, the first evaluation of stock trading, novel agent and Retrieval-Augmented Generation (RAG) evaluation, and two novel datasets for regulations and stock trading. Our evaluation of 21 representative LLMs, including GPT-4, ChatGPT, and the latest Gemini, reveals several key findings: While LLMs excel in IE and textual analysis, they struggle with advanced reasoning and complex tasks like text generation and forecasting. GPT-4 excels in IE and stock trading, while Gemini is better at text generation and forecasting. Instruction-tuned LLMs improve textual analysis but offer limited benefits for complex tasks such as QA. FinBen has been used to host the first financial LLMs shared task at the FinNLP-AgentScen workshop during IJCAI-2024, attracting 12 teams. Their novel solutions outperformed GPT-4, showcasing FinBen's potential to drive innovations in financial LLMs. All datasets and code are publicly available for the research community, with results shared and updated regularly on the Open Financial LLM Leaderboard.

FinBen: A Holistic Financial Benchmark for Large Language Models

Authors

Abstract

Name Change Policy