Part of Advances in Neural Information Processing Systems 37 (NeurIPS 2024) Main Conference Track
Dmitrii Avdiukhin, Vaggos Chatziafratis, Orr Fischer, Grigory Yaroslavtsev
We study the embedding dimension of distance comparison data in two settings: contrastive learning and $k$-nearest neighbors ($k$-NN). In both cases, the goal is to find the smallest dimension $d$ of an $\ell_p$-space in which a given dataset can be represented. We show that the arboricity of the associated graphs plays a key role in designing embeddings. Using this approach, for the most frequently used $\ell_2$-distance, we get matching upper and lower bounds in both settings. In contrastive learning, we are given $m$ labeled samples of the form $(x_i, y_i^+, z_i^-)$ representing the fact that the positive example $y_i$ is closer to the anchor $x_i$ than the negative example $z_i$. We show that for representing such dataset in:- $\ell_2$: $d = \Theta(\sqrt{m})$ is necessary and sufficient.- $\ell_p$ for $p \ge 1$: $d = O(m)$ is sufficient and $d = \tilde \Omega(\sqrt{m})$ is necessary.- $\ell_\infty$: $d = O(m^{2/3})$ is sufficient and $d = \tilde \Omega(\sqrt{m})$ is necessary.We also give results for the more general scenario when $t$ negatives are allowed.In $k$-NN, for each of the $n$ data points we are given an ordered set of the closest $k$ points. We show that for preserving the ordering of the $k$-NN for every point in:- $\ell_2$: $d = \Theta(k)$ is necessary and sufficient.- $\ell_p$ for $p \ge 1$: $d = \tilde O(k^2)$ is sufficient and $d=\tilde \Omega(k)$ is necessary.- $\ell_\infty$ : $d = \tilde \Omega(k)$ is necessary.Furthermore, if the goal is to not just preserve the ordering of the $k$-NN but also keep them as the nearest neighbors then $d = \tilde O (\mathrm{poly}(k))$ suffices in $\ell_p$ for $p \ge 1$.