SmallToLarge (S2L): Scalable Data Selection for Fine-tuning Large Language Models by Summarizing Training Trajectories of Small Models

Part of Advances in Neural Information Processing Systems 37 (NeurIPS 2024) Main Conference Track

Bibtex Paper

Authors

Yu Yang, Siddhartha Mishra, Jeffrey Chiang, Baharan Mirzasoleiman

Abstract

Despite the effectiveness of data selection for pretraining and instruction fine-tuninglarge language models (LLMs), improving data efficiency in supervised fine-tuning(SFT) for specialized domains poses significant challenges due to the complexityof fine-tuning data. To bridge this gap, we introduce an effective and scalabledata selection method for SFT, SmallToLarge (S2L), which trains a smallmodel, clusters loss trajectories of the examples, and samples from these clusters toguide data selection for larger models. We prove that during fine-tuning, sampleswithin the same loss trajectory cluster exhibit similar gradients. Then, we showthat S2L subsets have a bounded gradient error w.r.t. the full data, hence guaranteeconvergence to the neighborhood of the optimal solution. We demonstrate throughextensive experiments that S2L significantly improves data efficiency in SFT formathematical problem-solving, reducing the training data requirement to just $11$%of the original MathInstruct dataset to match full dataset performance whileoutperforming state-of-the-art data selection algorithms by an average of $4.7$%across $6$ in- and out-domain evaluation datasets. Remarkably, selecting only 50Kdata for SFT, S2L achieves a $32.7$% accuracy on the challenging MATHbenchmark, improving Phi-2 by $16.6$%. In clinical text summarization on theMIMIC-III dataset, S2L again outperforms training on the full dataset usingonly $50$% of the data. Notably, S2L can perform scalable data selection using areference model $100\times$ smaller than the target model, proportionally reducing thecomputational cost.