NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
[Post Author-Response Update] I believe that my concerns about the lack of random baselines in the original submission are partially addressed by the new experiments provided in the rebuttal. While I feel that the new random baselines significantly strengthen the paper's results on CIFAR-100, random baselines are not provided for CIFAR-10, SVHN, or ImageNet. I've updated my score from a 6 to an 7, based on the random baselines for CIFAR-100 and the authors' promise to clarify their evaluation measure in the final submission. [Originality] The search space used to derive data augmentation policies is reused from previous work (Cubuk et al.'s AutoAugment), with appropriate citations. However, Cubuk et al.'s original algorithm is extremely resource-intensive. The main contribution of this paper is an algorithm that can operate on the same search space and come up with data augmentation schemes orders of magnitude more efficiently. The most closely related work I'm aware of is Population Based Augmentation (ICML 2019), which tries to solve the same problem in a different way. It seems like there's not yet a large body of work in this area, and the submission's solution seems novel. Related work appears to be adequately cited. [Quality] Empirical evaluations of the proposed algorithm are conducted on four different datasets, and the results appear to be technically sound. On three of the four datasets, empirical results are roughly on par with existing results from AutoAugment and Population Based Augmentation (PBA). On the fourth dataset (CIFAR-100), results are slightly worse than AutoAugment/PBA in some cases, and the submission provides a discussion of these results. This is an indication that "authors are careful and honest about evaluation both the strengths and weaknesses of their work." (Quote taken from the scoring rubric.) The submission also provides experiments showing how the empirical results change when two hyper-parameters are varied: (i) the number of data augmentation sub-policies used to train the final model, and (ii) the size of the dataset used to select augmentation sub-policies. These are likely to be of practical interest for anyone who wants to build on the proposed algorithm. On the negative side, without random baselines, it was difficult for me to tell what fraction of the quality improvements came from the search space vs. the proposed algorithm. It would be very helpful to add the accuracies of models trained on 25, 50, or 100 sub-policies selected uniformly at random from the AutoAugment search space. Basically: is the proposed algorithm able to outperform random sampling? [Clarity] For the most part, the paper is clear, well-organized, and polished. However, I struggled to understand the exact criterion that was used to select and score augmentation sub-policies from the AutoAugment search space. Even after reading the "Search Strategy" section of the paper for a third time, I'm not entirely sure that I understand the proposed criterion. My current impression (based on Equation 3 in Section 3.2.2) is that we first train a model from scratch without data augmentation, then select an augmentation rule that minimizes the model's loss on augmented images from a held-out validation set. If this is correct, does it cause problems for a data augmentation sub-policy like CutOut regularization that makes the model's job harder by removing useful information from the input image? (I might be missing something, but it seems like the model would have a high loss if we evaluated it on a batch of input images augmented using CutOut, and therefore CutOut would never be selected as a sub-policy.) [Significance] If the paper's results hold up, they are likely to be of broad interest for people who want to find better data augmentation policies on new problems and new domains. [Notes on reproducibility checklist] I'm a bit confused about why the authors responded "yes" to the reproducibility checklist item "A description of results with central tendency (e.g. mean) & variation (e.g. stddev)." I might've missed something, but I didn't see variance/stddev numbers reported in the paper (e.g., in Tables 2, 3, 4, or 5). The reproducibility checklist indicates that source code is (or will be made) available. However, I couldn't find any source code attached to the submission.
Reviewer 2
This paper introduces a new search approach to learn data augmentation policies for image recognition tasks. The key difference between this paper and AutoAugment is that the augmentation policy is applied to the validation set rather than the training set during the augmentation policy learning phase. This modification removes the needs of repeated weight training of child models, thus improves the search efficiency. The paper is clearly written. The experiments seem sound and are similar in setup to previous work. The performances are comparable to AutoAugment on three image classification datasets (ImageNet, CIFAR, and SVHN). Regarding the results on ImageNet, it would be interesting to see the performances of the proposed method on other types of neural network architectures (e.g., DenseNet, MobileNet, ShuffleNet, EfficientNet, etc.). However, I have a concern about the proposed method. It is not clear to me why augmentation policies, which are optimized to match the density of two training data splits, can improve the generalization performances. To my understanding, applying strong data augmentation will increase the distance between augmented dataset and original dataset, which however is very useful when training large neural networks. In Equation 2, the model parameters are trained on the original training (not augmented) images and the augmentation policies are applied to the validation images. However, when using learned augmentation policies, the model parameters are trained on augmented training images and the validation set is not augmented. This inconsistency looks weird to me. I am not sure whether the good results come from the proposed method or the good search space. I hope the authors can provide more theoretical or empirical analysis, explaining how the proposed method leads to better generalization ability. [Post-Author-Feedback-Response] I increase the sore from 5 to 6 based on the author feedback. But I think more thorough comparisons between FAA and random baselines must be provided in the revision.
Reviewer 3
*** After reading Author Feedback *** After reading the Author Response, I reconsider my assessment of the paper and increase my score from a 3 to a 4. Below, I reiterate my main reasons for the low score, and how the authors have addressed them. 1. Lack of comparisons against the RandomSearch baselines. The authors provided comparisons against two random baselines, namely Randomly pre-selected augmentations (RPSA) and Random augmentations (RA), and hence fulfilled my request. However, these numbers (in error rates, RPSA-25: 17.42, RPSA-50: 17.50, RA: 17.60, FAA: 17.15), clearly confirmed my point that that FAA is not *much* better than the random baselines. Note that while the authors wrote in their Author Feedback that each experiment is repeated 20 times, they did not provide the standard deviations of these numbers. Furthermore, their FAA error rate is now 17.15, while in the submitted paper (Table 3), it was 17.30, suggesting that the variance of these experiments can be large. Taking all these into account, I am *not* convinced that FAA is better than random search baselines. Last but not least, the provided comparisons against the random search baselines is only provided for CIFAR-100. How about CIFAR-10 and SVHN? I can sympathize with the authors that they could not finish these baselines for ImageNet within the 1 week allowed for Author Feedback (still, a comparison should be done), but provided the improvements on CIFAR-100 is not that significant, I think the comparison should be carried out for CIFAR-10 and SVHN as well. Also, standard deviations should be reported. 2. Impractical implementation. While the authors have provided some running times in the Author Feedback, my concern of the training time remains unaddressed. Specifically, in the Author Feedback, the authors provided the training time for CIFAR-10 and CIFAR-100, but not for ImageNet. I am personally quite familiar with the implementations of the policies from AutoAugment, and I have the same observation with the authors, ie. the overhead for CIFAR-10/100 and SVHN is not too bad. However, the real concern is with ImageNet, where the image size is 224x224, which makes the preprocessing time much longer than that of CIFAR-10/100 and SVHN, where the image size is 32x32. If we take this overhead into account, then the improvement that FAA delivers, in *training time*, is probably negligible. That said, since the authors have (partially) provided the comparisons against the baseline, I think it's fair for me to increase my score to 4. Strengths. This paper targets a crucial weakness of AutoAugment, namely, the computational resources required to find the desired augmentation policy. The method Fast AutoAugment introduced in this paper indeed reduces the required resources, whilst achieving similar *accuracy* to AutoAugment on CIFAR-10, CIFAR-100, SVHN, and ImageNet. Weaknesses. This paper has many problems. I identify the following. The comparisons against AutoAugment are not apple-to-apple. Specifically, the number of total policies for Fast AutoAugment and for AutoAugment are different. From Algorithm 1 of Fast AutoAugment (FAA), FAA ultimately returns N*K sub-policies. From Lines 168-170, N=10 and K=5, and hence FAA returns a policy that has 50 sub-policies. From the open-sourced code of AutoAugment (AA), AA uses only 25 sub-policies. Using more sub-policies, especially when combined with more training epochs, can make a difference in accuracy. A missing baseline is (uniformly) random search. In the original AutoAugment paper, Cubuk et al (2019) showed that random search is not much worse than AA. I’m not convinced that random search is much worse that FAA. AA’s contributions include the design of the search space of operations, but FAA’s contribution is *a search algorithm*, so FAA should include this baseline. In fact, an obvious random search baseline is to train one model from scratch, and at each training step, for each image in a minibatch, a sub-policy is uniformly randomly sampled from the search space and applied to that image, independently. I believe FAA will not beat this baseline, if this baseline is trained for multiple epochs. While I am aware that “training for more epochs” is an unfair comparison, in the end, what we care about is the time required for a model to get to an accuracy, making this baseline very relevant. Last but not least, I want to mention that the FAA method (as well as the AA method, which FAA relies on a lot), is *impractical* to implement. Judging from the released source code of FAA, the augmented images are generated online during the training process. I suspect this is extremely slow, perhaps slow enough to render FAA’s policies not useful for subsequent works. I am willing to change my opinion about this point, should the authors provide training time for the policies found by FAA.