Part of Advances in Neural Information Processing Systems 37 (NeurIPS 2024) Main Conference Track
Shitong Shao, Zikai Zhou, Huanran Chen, Zhiqiang Shen
Dataset condensation, a concept within $\textit{data-centric learning}$, aims to efficiently transfer critical attributes from an original dataset to a synthetic version, meanwhile maintaining both diversity and realism of syntheses. This approach can significantly improve model training efficiency and is also adaptable for multiple application areas. Previous methods in dataset condensation have faced several challenges: some incur high computational costs which limit scalability to larger datasets ($\textit{e.g.,}$ MTT, DREAM, and TESLA), while others are restricted to less optimal design spaces, which could hinder potential improvements, especially in smaller datasets ($\textit{e.g.,}$ SRe$^2$L, G-VBSM, and RDED). To address these limitations, we propose a comprehensive designing-centric framework that includes specific, effective strategies like implementing soft category-aware matching, adjusting the learning rate schedule and applying small batch-size. These strategies are grounded in both empirical evidence and theoretical backing. Our resulting approach, $\textbf{E}$lucidate $\textbf{D}$ataset $\textbf{C}$ondensation ($\textbf{EDC}$), establishes a benchmark for both small and large-scale dataset condensation. In our testing, EDC achieves state-of-the-art accuracy, reaching 48.6% on ImageNet-1k with a ResNet-18 model at an IPC of 10, which corresponds to a compression ratio of 0.78\%. This performance surpasses those of SRe$^2$L, G-VBSM, and RDED by margins of 27.3%, 17.2%, and 6.6%, respectively. Code is available at: https://github.com/shaoshitong/EDC.