Paper ID: 1045
Title: Small-Variance Asymptotics for Hidden Markov Models
Reviews

Submitted by Assigned_Reviewer_4

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The paper provides a small-variance asymptotics analysis for the standard HMM as well as the infinite HMM, and in particular uses the asymptotic analysis to derive "hard clustering" optimization programs for learning and inference in both kinds of HMMs.

The exposition in the paper is excellent, and it deftly explains the framework and algorithm derivation (with two approaches, providing extra intuition pumps). The paper also explains connections to existing algorithms (for standard HMMs, Viterbi re-estimation corresponds to setting the algorithm parameter lambda to 1) and previous work in small-variance asymptotics. It clearly delineates the challenges in adapting the small-variance algorithm approaches to the (infinite) HMM (Section 3.2) and explains good methods to overcome those challenges. These HMM fitting methods are likely to be very interesting to the community, and the development here (and in the supplementary materials) is really great.


However, the experiments are weak. The data, both real and synthetic, are very easy, and it seems unlikely that the only comparison to existing techniques (the beam sampler) does justice to existing techniques.

Despite the fact that synthetic data is especially easy, the beam sampler does very poorly, yet I ran the same experiment using a library that came up as the first result for 'python hmm sampling' (pyhsmm), which primarily uses weak limit samplers (which have been used in many application papers and certainly seem to fit these experiments well), and got much better results in a fraction of the time. Direct comparison to the results in the paper is hard because there is no formula or citation provided for the NMI performance metric, and the timing plot in Figure 1 has no units (just "log of the time" on the vertical axis), but it seems the weak limit sampler is decoding the synthetic sequences nearly perfectly in just 0.03 seconds per Gibbs sweep on my laptop. Given the papers showing successful applications of weak limit samplers (many times without dependence on good initializations or slow convergence, as mentioned in this paper's abstract and introduction) and the easy availability of code (there are several libraries online), it seems misleading not to provide any comparison or comment, yet this paper only provides a surprisingly weak showing of the beam sampler.

There are other weaknesses and points to clarify in the experiments section:

(1) "For these datasets, we also observe that the EM algorithm for the standard HMM (not reported in the figure) can easily output a smaller number of states than the ground truth, which yields a smaller NMI score." Why not show the results? The synthetic dataset should be easy for EM, so the omission is strange and unexplained.

(2) "Next we demonstrate... along with comparison to the standard HMM." I don't see any lines in Figure 2 corresponding to the standard HMM methods. Were they left out accidentally?

(3) Was the time for the grid search over algorithm parameters (mentioned in the third paragraph of Section 4) included in the reported run times (which seems appropriate, since the iHMM methods are supposedly fitting concentration parameters)? What are the units for the vertical axis in the right-hand panel of Figure 2?

Timing experiments can be a rabbit hole and a line must be drawn somewhere, but it's not clear what can be learned from the given experimental results, and as presented they do not support the paper's relative algorithm performance claims. These issues with the experimental evaluation provided could be easily remedied by toning down the claims about existing methods and focusing instead on the fact that the experiments demonstrate the proposed method is competitive with an iHMM sampling method (and perhaps also with EM if those results were reported).


Despite those issues, the paper remains a great treatment of an exciting new analytical and algorithmic perspective on the well-loved (i)HMM.
Q2: Please summarize your review in 1-2 sentences
The paper provides a great treatment of a subject of great interest to the NIPS community, though the experiments should be improved or the claims about the experiments adjusted.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
The authors use, by now well-established, tools for small-variance
asymptotics of latent-variable model and apply to Hidden Markov Models
(HMM) and their infinite, non-parametric counter parts. For the former
model class, they recover as a special case segmental k-means or
Viterbi re-estimation. For infinite HMM they obtain an alternative
training algorithm which outperforms a fast sampling scheme in their
evaluation.

The methodological results are not very surprising, but constitute a
natural extension of relevance of small-variance asymptotics, given
the wide-spread use of HMM.

The paper has two major weaknesses.

Organization: Given that small-variance asymptotics have been
considered for several types of latent-variables, a more detailed
description of the contribution---mostly 3.2 and also the details in
3.1---would have been been preferable. Also, the quality of the
writing varies. In particular 3.2 is much weaker (and hand-wavy) than
the other parts and gives insufficient to the point of not being
reproducible. A more careful balance of basics, introduction and core
contribution would improve the paper very much.

Evaluation: I disagree with the choice of computational examples
and find the level of details given insufficient. I would have
expected to see an evaluation using real problems in which HMM perform
well and where variable number of states would be needed. I find that
neither computational example is particularly enlightening. For
example, in most applications of HMMs, self-transition probabilities
are not zero and in particular getting the state-durations right is
what helps with a lot of applications with very noisy data in biology
(e.g., ArrayCGH). I don't believe that the examples were cherry-picked
to show the advantage of the proposed method, but they certainly do
not do the authors a favor by persuading scientists using HMM to try
their method.

Also, I would have preferred to see the number of states inferred for
both types of algorithms, and in general more details and further
experiments (for example: what happens for larger k in the experiment
in Figure 1. The beam sampler's running time barely change, while the
running time of the proposed method increases drastically.

A more general comment: It is well known, in the HMM literature at
least, that modeling emissions is very important. There are very fast
samplers for HMM with emissions modeled by infinite mixtures available
(Yau et al. J R Stat Soc Series B Stat Methodol. 2011), which perform
very well at least on the aforementioned ArrayCGH example. It would be
interesting to see, how the proposed method, which I realize could be extended
to more complicated emission distributions, would perform in comparison.
Q2: Please summarize your review in 1-2 sentences
Natural extension of small-variance asymptotics of latent-variable models to HMM with an evaluation which is unlikely to convince researchers using HMM in real applications.

Submitted by Assigned_Reviewer_8

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)

Recent work by Kulis et al has revisited some latent variable
models to examine their ``small variance asymptotics'' ---
the limiting behavior of their inference algorithms as
variance tends to zero. For example, Gaussian mixture models
have been shown to behave like the k-means algorithm in this
small-variance limit. This paper does the same analysis
for the HMM as well as its Bayesian nonparametric version,
the infinite HMM (with hierarchical Dirichlet prior). The development
follows that of previous work by Jiang et al. and the authors
present intuitive combinatorial algorithms (that look like
penalized versions of k-means) in both the traditional and
infinite settings.

I am not very familiar with the prior work in this thread,
but the result of the current submission seems novel enough
for a NIPS publication. The technical details required
to develop the analysis in the infinite HMM case in particular
are quite nontrivial to work out. Additionally, the writing
was mostly understandable, though I wish that the pseudocode
for Section 3.2 would have been spelled out in some more detail.
In sum, it's a nice contribution and I recommend acceptance.

The weakest part of the paper is the experiments section. The
simulated data experiments are quite small scale and do little
to illustrate the strengths or weaknesses of the approach.
To really show off the efficiency of these combinatorial algorithms,
it seems necessary to tackle a problem with a much larger state
space.

The financial dataset likewise does not seem like a serious
experiment as the number of states used is again 5. We also
don't *really* believe that 5-state HMMs are the `correct' way to
model this dataset right? But my other question about this dataset
is --- why would we expect the combinatorial algorithm to perform
better than the sampling based algorithm? Is there a good rule of
thumb to deciding when one is better than the other? I suppose this
question is a bit like answering when k-means is better than gaussian
mixture models --- and I would think that in many cases, the opposite
is true.

Finally, in light of all the new activity in the community on
spectral methods for latent variable models (see, e.g., Hsu et al. COLT '09),
how do these small-variance algorithms compare? We know that they
can still get stuck in local optima...


Q2: Please summarize your review in 1-2 sentences
This is a good extension of a line of research on ``small variance asymptotics''. Despite limited experiments, I recommend acceptance.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.