Summary: The authors propose a new model called the Sequential Neural Process that models a sequence of stochastic processes that show temporal dependency. This is done by formulating a Neural Process for each time step (with shared parameters) and modelling dependence between the latent variables of the NPs with a state-space model. Similar to NPs, variational inference with the reparameterisation trick is used for learning, but with a sequential variational distribution. The model is applied to a sequence of GPs with kernel parameters evolving through time as well as a 2D and 3D shape environment with moving objects, and compared empirically to CGQNs. The paper tackles a novel problem of modelling a sequence of stochastic processes and is technically sound. The writing is clear for the most part, and the results are well-organised and strong. The work addresses the difficult task of modelling a dynamic scene (i.e. with moving objects) from images of varying viewpoints, and shows results that are significant improvements from the baseline CGQN. The ablation study for the posterior dropout learning is also convincing. Overall, this is a very strong submission. Here below are some suggestions for the paper that may help to improve it. 1. I think it should be made even clearer that the task of SNPs is to model a changing sequence of stochastic processes by emphasising that SNP does NOT model a stochastic process that is a sequence (time series). The latter, being a more common problem than the former, might be what a reader might expect when reading the title and abstract, and may confuse the reader. Having said that, it would be interesting to see experimentally if the latter can be achieved in the SNP framework, e.g. by setting C_t, D_t as singleton data points (x_t,y_t) where x_t=t. One issue is that for extrapolation, one would have to deal with empty C_t for t>T. Have you explored this avenue? 2. In Section 2 when you introduce the Neural Process, you introduce the encoder by saying it is used to define the conditional prior p(z|C) in line 70-71. In this case, it would be helpful to clarify what the “intractable posterior” that you mention in line 75 actually is. It is proportional to p(z|C)p(Y|X,z), and you would usually denote this as p(z|C,D), but you can’t in this case because it’s not equal to a different conditional prior p(z|CUD) that you get when feeding in CUD to the encoder. And also in Equation (2), the variational distribution Q_{phi}(z|C,D) is in fact just the conditional prior p_theta(z|CUD) in practice. I think it can be misleading if you use phi for Q and theta for p, since phi=theta in this case. 3. I think it would be better to move Figure 1 to Section 3.3 when you describe TGQN, and for Section 3.1 add instead a figure with a graphical model of SNP. It can be confusing to put the figure for TGQN next to the description of SNP because e.g. you haven’t defined a_{t-1} that appears in the figure. 4. The example that you use to show SNPs are a meta-transfer learning method in lines 141-146 is nice, but since this example isn’t what you apply SNPs to in the experiments, I think it would be better to replace it with the example of the 3D environment with moving objects. This example of games against enemies can be moved to the discussion as an example of a more realistic scenario for a potential application of SNPs. 5. In line 177 what is meant by the sum of C_t and D_t? These two are both sets as far as I understand. 6. In line 183-187, you give an explanation of why transition collapse happens. Could this be made clearer by saying something along the lines of “the information about C_t is already present in the sampled z_{
Reviewer 3
This is an interesting paper, combining the features of NPs with dynamic latent-variable modelling through RSSMs. [Originality] The precise combination of NPs with RSSMs (or similar) models appears to be novel. As is the application to a dynamic 3D data domain. [Quality] The formulation and algorithmic model appears to be sound. I do have one particular concern to do with the characterisation of the posterior dropout elbo---while its use for TGQNs might be apposite, the idea of sampling from the prior to prevent the posterior from overfitting is not new, especially for ELBOs. It is usually referred to as defensive importance sampling [1]. Extending this temporally is relatively trivial given that the application is conditionally independent given the (presumably global) choice of which time steps to apply it to. Also, while the model allows for change of query viewpoint, within a temporal sequence, at least going by the examples shown in the experiments, there does not appear to be change in viewpoint within a sequence. Is this by design? To what extent does the TGQN learn a general scene representation that can queried in the spirit of GQNs? At test time, for a fixed scene setting (fixed context C that is) do new query viewpoints produce consistent scenes as GQN purports to do? [Clarity] The paper is well written and well organised. The experiments we also well structured. [Significance] I do believe the results shown here, along with the algorithmic contributions will help in pushing the state of the art in learning dynamic scene representations forward. I hope that source code for the model and data will be made publicly available for the community to build on. [1] Weighted Average Importance Sampling and Defensive Mixture Distributions, Hesterberg, Technometrics 1995 **Update** I have read the authors' response and am happy with the comments. Having perhaps one example of non-fixed viewpoints would definitely help showcase the features of the proposed method better. And I'm glad the authors are committed to releasing source code for this work. The additional results presented to analyse posterior collapse also helped answer some questions I subsequently had, thanks to the thorough work done by the other reviewers.