|
Submitted by
Assigned_Reviewer_3
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a simple latent factor model for
one-shot learning with continuous outputs where very few observations are
available. Specifically, it derives risk approximations in an asymptotic
regime where the number of training examples is fixed and the number of
features in the X space diverges. Based on principal component regression
(PCR) estimator, two estimators including the bias-corrected estimator and
the so-called "oracle" estimator are proposed and the bounds for the risks
of these estimators are derived. These bounds provide insights into the
significance of various parameters relevant to one-shot learning. The
major contribution of this paper is the bounds derived for 3 estimators:
the principal component regression estimator, the bias-corrected estimator
and the oracle estimator which assumes the first principal component is
known.
Q2: Please summarize your review in 1-2
sentences
The bounds of risks of PCR estimator in the one-shot
learning are derived. A bias-corrected estimator based on PCR estimator is
proposed with better consistency property. Submitted by
Assigned_Reviewer_5
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
UPDATE: After discussion with the other reviewers,
I've lowered my score a little. The simple estimator suggested by another
reviewer does seem to perform surprisingly well in the idealized model
studied in this paper. I interpret this to mean that the model is a bit
unrealistic. However, the PCR estimator still seems interesting, and the
authors' feedback has helped clarify that.
This paper studies a
linear latent factor model, where one observes "examples" consisting of
high-dimensional vectors x1, x2, ... in R^d, and one wants to predict
"labels" consisting of scalars y1, y2, ... in R. Crucially, one is working
in the "one-shot learning" regime, where the number of training examples n
is small (say, n=2 or n=10), while the dimension d is large (say, d ->
infinity). This paper considers a well-known method, principal component
regression (PCR), and proves some somewhat surprising theoretical results:
PCR is inconsistent, but a modified PCR estimator is weakly consistent;
the modified estimator is obtained by "expanding" the PCR estimator, which
is different from the usual "shrinkage" methods for high-dimensional data.
Quality: The paper appears to be correct, though I have not
checked the calculations in the supplementary material. I do have one
question, which is more conceptual: how would the methods in this paper,
which use PCR with 1 component, compare to PCR with more components?
In section 3, line 117, the authors argue that it is natural to
restrict attention to PCR with 1 component, because when one computes the
expectation of the matrix X^T X, one finds that the interesting
information (the vector u) is in the leading eigenvector. However, I am
not quite convinced by this argument, because in the "one-shot learning"
regime, where n << d, the matrix X^T X is not at all close to its
expectation.
Moreover, in practice, people have reported cases
where PCR runs into exactly this problem -- when one keeps only the first
few principal components, the resulting subspace does not contain the
desired solution (see, e.g., Hadi and Ling, Amer. Stat. 52(1), p.15, Feb.
1998). Such behavior may be ruled out by the assumptions made in this
paper -- specifically, the assumption that the vectors x_i have strong
"signal-to-noise" ratio. Still, this seems like a significant point to
mention, to illustrate that rather strong assumptions are needed for
one-shot learning.
Clarity: The paper is clear and well organized.
Some minor suggestions: move the definition of the bias-corrected PCR
estimator to section 3 (instead of burying it in the middle of section 4);
in table 2, when n=4, in the "oracle" column, there may be a typo (0.43%
should be 4.3%); in section 8, line 424, the matrix S should possibly be
transposed.
Originality: The main novelty of the paper seems to be
the "one-shot learning" setting, and the modified PCR estimator. The
latent factor model studied in this paper (and in particular the
assumption that various distributions are Gaussian) seem standard and
unremarkable. Nonetheless, the proofs require significant work.
I
am not familiar with some of the related work mentioned in the
introduction (lines 52-55). It would be helpful if the authors stated in
more detail how this paper differs from the existing work.
Significance: It is impressive that the method works quite well
(at least on synthetic data) for specific small values of n, like n=2 and
n=9. It is also impressive that the theoretical bounds are not far off
from the actual behavior (though again, this is on synthetic data).
Overall this seems like a solid theory paper, and the open questions look
interesting. Q2: Please summarize your review in 1-2
sentences
This paper considers a well-studied method, principal
components regression, in a challenging setting, one-shot learning. This
seems like a solid theory paper. Submitted by
Assigned_Reviewer_7
Q1: Comments to author(s).
First provide a summary of the paper, and then address the following
criteria: Quality, clarity, originality and significance. (For detailed
reviewing guidelines, see
http://nips.cc/PaperInformation/ReviewerInstructions)
Update:
The point of my simple method was to
obtain a better understanding of the theoretical analysis. I ran the same
experiments the authors ran from section 7, with d=500,5000,500000 and the
simple method began to converge to the oracle method and at d=500 it
certainly did not have an error greater than 500 (roughly 7). I ran all of
the experiments with n=2. **edit** The focus of these experiments were on
weak convergence as that was the theoretical analysis performed for n=2.
For n=2 the simple method does perform very poorly for squared loss, but
performs better for absolute loss. **end edit**
I believe that
this one-shot idea is very interesting, but the bias correction has not
been thoroughly explored. The experiments are show a lot of promise, which
is why I remain borderline with this paper.
====
***
This paper aims to provide an analysis for principle component
regression in the setting where the feature vectors $x$. The authors
let $x = v + e$ where $e$ is some corruption of the nominal feature
vector $v$; and $v = a u$ where $a \sim N(0,\eta^2 \gamma^2 d)$ while
the observations $y = \theta/(\gamma \sqrt{d}) \langle v,u \rangle +
\xi$. This formulation is slightly different than the standard one
because our design vectors are noisy, which can pose challenges in
identifying the linear relationship between $x$ and $y$. Thus, using
the top principle components of $x$ is a standard method used in order
to help regularize the estimation. The paper is relevant to the ML
community. The key message of using a bias-corrected estimate of $y$
is interesting, but not necessarily new. Handling bias in regularized
methods is a common problem (cf. Regularization and variable selection
via the Elastic Net, Zou and Hastie, 2005). The authors present
theoretical analysis to justify their results. I find the paper
interesting; however I am not sure if the number of new results and
level of insights warrants acceptance.
*** The result is
interesting, but lacks enough depth to provide a convincing argument
for its use. The authors present a very specific setting for the
design vectors: spiked covariance. The motivation of the bias
correction is interesting, but how do the authors expect it should
work under other models for $x$ and $y$? A simulation study of these
methods could prove useful in order to provide insights into the
behavior in the non-Gaussian setting. Furthermore, as the authors
admit, the signal of $x$ is very strong in the direction of the
regression vectors. It is conceivable that the signal $x$ is strong in
multiple directions (especially in the high-dimensional setting). How
would the bias-correction change when adapting to multiple PCA
factors?
*** Overall, the model being used to perform the
analysis seems to simple to shed serious light on the problem. For
example, one can show risk consistency in the simple setting of
letting the largest principle component simply be equal $x_1$. That
is, only perform PCA one the first example and regression on the
second.
The authors go through a lot of analysis from fundamentals
without providing intuition for why the shrinkage occurs. In
particular, the authors should explain why least squares (which
produces consistent estimators) must be corrected in this context.
Interestingly, the simple estimator of just taking $x_1$ to be the
principle component and performing regression on the remaining
examples performs better. Thus, rather than a shrinkage correction, a
more reasonable solution seems to be to learn the PC vectors on one
set of data and regress on the other. Can the authors provide more
intuition for why they expect their proposed method to perform better
in generic settings?
Q2: Please summarize your
review in 1-2 sentences
I find this idea interesting; however, the authors do
not provide enough intuition for why their bias correction should work in
other settings outside of their specific model.
Q1:Author
rebuttal: Please respond to any concerns raised in the reviews. There are
no constraints on how you want to argue your case, except for the fact
that your text should be limited to a maximum of 6000 characters. Note
however that reviewers and area chairs are very busy and may not read long
vague rebuttals. It is in your own interest to be concise and to the
point.
We thank the reviewers for their thoughtful comments
on the paper. While all of the reviewers' comments have been valuable,
we'd like to take this space-limited opportunity to respond to four
specific comments (numbered below) made by the reviewers.
1. One
of the reviewers suggested a very simple alternative methodology, where
the principal component direction is taken to be x_1 (the first predictor
vector) and the estimator \hat{\beta} is obtained by performing principal
component regression in this direction with the remaining n-1
observations. The reviewer further suggested that this elementary
estimator might out-perform the estimators discussed in the paper -- if
true, this could substantially negate the impact of our results.
In order to investigate these questions, we ran a simulation
analysis involving the method proposed by the reviewer and the other
methods discussed in the paper. We used the same settings as in Section 7
of the paper. The table below contains the empirical prediction error of
the various methods for d=500 and several values of n (based on 1000
datasets).
PCR BC OR X1 n=2 18.0748 7.6869 1.7507 595.8016
n=4 6.4769 0.9444 0.3514 7.1833 n=9 1.3757 0.3430 0.2612 3.9847
n=20 0.4466 0.2721 0.2403 4.1778
Here, PCR refers to the basic
principal component regression estimator discussed in the paper; BC refers
to the bias-corrected estimator; OR refers to the oracle estimator; and X1
is the method suggested by the reviewer. Note that BC substantially
outperforms X1 in all settings (PCR and OR also outperform X1 in all
settings). Results for d=5000 (not reported here) were similar -- BC
substantially outperformed X1 in all settings.
To help explain
these results, let X=(x_2,…,x_n)^T denote the matrix consisting of the
predictors x_2,…,x_n and let y=(y_2,…,y_n) denote the corresponding
responses. Then the estimator suggested by the reviewer has the form
\hat{\beta} = (x_1^TX^Ty/||Xx_1||^2)x_1. Notice that for large d and fixed
n, the denominator in \hat{\beta} satisfies
||Xx_1||^2 = O(dR +
d^2W_1W_{n-1}),
where R is a bounded (in probability) non-negative
random variable, and W_1, W_{n-1} are independent chi-squared random
variables on 1 and n-1 degrees of freedom, respectively. The main point is
that the risk of \hat{\beta} involves inverse moments of ||Xx_1||^2 and
that instability of 1/W_1 (indeed, E(1/W_1) = \infty) contributes to
instability of \hat{\beta}.
The simulation results and the above
discussion suggest that \hat{\beta} may not be a viable alternative to the
approaches discussed in the paper. On the other hand, other split sample
approaches to the problem could potentially be effective, but it seems
unclear how these approaches might compare to the methods proposed in the
paper. We believe that this is an interesting topic for future research.
2. In the submitted paper, we consider a simple latent factor
model with Gaussian data. One of the reviewers asks "Can the authors
provide more intuition for why they expect their proposed method to
perform better in generic settings?" Our key theoretical results primarily
depend on lower-order moments of the random variables involved and basic
facts about the eigenvalues and eigenvectors of random matrices. Methods
relying only on lower-order moments are frequently useful in generic
settings. Additionally, a great deal of recent work has demonstrated the
universality of many properties of eigenvalues and eigenvectors of
high-dimensional random matrices (see, for instance, Vershynin (2012)
"Introduction to the non-asymptotic analysis of random matrices" or Pillai
and Yin (2013) "Universality of covariance matrices"). Taken together,
these observations suggest that the proposed methods may perform quite
well under more relaxed distributional assumptions. (Additionally, see 3
below for a discussion of a more general multi-factor model -- this is
another more generic setting in which methods related to those proposed in
this paper may be useful.)
3. One of the reviewers asks "How would
the methods in this paper, which use PCR with 1 component, compare to PCR
with more components?" The results in this paper may be extended to a more
general k-factor model, which corresponds to PCR with more than 1
component. In the k-factor model, k > 1 components link the predictors
and outcomes, and k components are utilized in the regression. In the
extended journal version of this paper (currently in preparation), we show
that the "usual" PCR with k components is inconsistent, when n is fixed
and d \to \infty, but that a simple bias-corrected k-component PCR method
is consistent. Similar to the single component model studied here, in the
k-component model, the bias-corrected estimator involves "correcting" for
the contribution of noise in the sample eigenvectors.
4. One of
the reviewers notes that "In the 'one-shot learning' regime, where n
<< d, the matrix X^TX is not at all close to its expectation" and
questions the implications of this fact for the proposed methods. This
observation is in fact one of the key issues that necessitates
bias-correction for consistent prediction in the one-shot regime: If X^TX
were closer to its expectation (as in "large n, small d" asymptotics),
then bias-correction would not be required for consistency. One of the
major implications of our work is that the discrepancy between X^TX and
its expectation (as manifest through noise in the sample eigenvectors and
eigenvalues) can be quantified and corrected-for, in order to achieve
effective prediction in the one-shot regime.
| |