Paper ID: | 480 |
---|---|
Title: | Deep Alternative Neural Network: Exploring Contexts as Early as Possible for Action Recognition |
The paper describes a new architecture for action recognition in videos. The basic module of the architecture consists of 3D convolutional layers (for a spatio-temporal region) alternating with a recurrent (in time) network. These alternating layers are then followed by a volumetric pooling to a fixed size vector, and several fully connected layers. The paper also proposes a method for determining the temporal period of an action, and uses this for augmentation.
This paper proposes an architecture that is not over-complicated and contains some nice ideas. It is fairly clearly written. - The novelty is somewhat limited because a related alternation architecture was proposed in: Delving Deeper into Convolutional Networks for Learning Video Representations Nicolas Ballas, Li Yao, Chris Pal, Aaron Courville, ICLR 2016 https://arxiv.org/abs/1511.06432 This paper is not cited. The authors should discuss the relation of their architecture to this. One of the differences is that the current submission uses 3D convolutions whereas Ballas et al do not, but the CNN-RNN alternation idea is there. - On the proposed architecture: it was not clear to me what temporal neighborhood is used in the video when computing the 3D convolutions. The authors should clarify this. Also, some of the design choices should be better explained, e.g. Why (and how) is Local Response normalization (LRN) used? Most recent deep networks have not used LRN. Did the authors consider a bi-directional RNN? - The method of obtaining the temporal length (section 2.1) seems fine for static cameras, but might not be very stable if the camera is tracking or zooming (as this will also generate optical flow). Can the authors comment on whether camera motion is a problem? - I think there is a typo at line 36, it should be inter-image context (rather than image-intra context) - The results are competitive with the state of the art - though some more recent papers at CVPR 2016 have superior performance on UCF-101 (e.g. Feichtenhofer et al, Wang et al "Actions ~ Transformations"). However, datasets such as UCF-101 are too limited to really show the advantages of one architecture over another.
3-Expert (read the paper in detail, know the area, quite certain of my opinion)
This paper explores a custom neural network architecture designed specifically for action recognition. The key new component is an "alternative layer" which is composed of a convolutional layer followed by a recurrent layer. As the paper targets action recognition in video, the convolutional layer acts on a 3D spatio-temporal volume. Volumetric pyramid pooling, inspired by [7], is used to allow the system to take arbitrary-sized video clips down to the same fixed-length feature representation. The entire architecture consists of a series of alternative+pooling layers, followed by volumetric pyramid pooling, and fully connected layers for classification. A preprocessing stage based on optical flow is used to select video fragments to feed to the neural network. Specifically, the total energy of the optical flow field is computed for each video frame, and frames at temporal local extrema are extracted as landmarks. Video fragments are then extracted using these landmarks to guide their temporal bounds. Experiments present results on the UCF101 [23] and HMDB51 [14] action recognition benchmarks. Table 3 shows the proposed method to perform on par with the best previous approaches.
While the proposed architecture might be interesting, this paper has some significant drawbacks. The experimental results (Table 3) are not particularly compelling with respect to the state-of-the-art. On both UCF and HMDB, classification accuracy at most matches the best current systems. On UCF, the proposed system scores 91.6% compared to 91.5% for [30]. On HMDB, peformance is 1% lower than [19] (65.9 vs 66.8 for [19]). I also have reservations about clarity of presentation with regard to the proposed recurrent component of alternative layers. Specifically, the text describes the purpose of these layers as integrating contextual information in larger receptive fields. However, equation (2) makes it appear as those these layers are not convolutional - the output at xyz coordinates at time t depends only on the output at the same xyz coordinate in the previous time step. To integrate context, shouldn't these outputs also depend on neighboring spatial or temporal locations? A rebuttal clarifying these points would be helpful. Related to the above point, deeper chains of convolutional layers are now a common tool for integrating contextual information and building larger receptive fields (as discussed in lines 99-108). It would be helpful to have an experiment comparing the proposed recurrent layers to a baseline of simply a deeper stack of convolutional layers.
2-Confident (read it all; understood it all reasonably well)
The paper presents a deep neural network architecture for action recognition in videos. The proposed network stacks alternative layers consisting of volumetric convolutional layer and recurrent layer, and hence named as deep alternative neural network (DANN). The authors argue that this preserves the contexts of local features as early as possible and are embedded in the feature learning procedure. They also present an adaptive method to determine the temporal size for network input based on the optical flow energy, this is unlike other methods which set it manually. This also produces output of variable sizes, to deal with this a volumetric pyramid pooling layer is added to resize the output to fixed-size before fully connected layers. Experiments are extensive and evaluate most of the modules well, however more detailed analysis of how proposed learning of context helps for different classes would be interesting.
1) It is not clear if the proposed learning of evolution of context is of much help compared to the image-intra context learning that is done by other recent methods (that use LSTM/RNN in the end on features from CNN). Also for a given action mostly the context do not change much. The results do not really support this enough as the improvement is minor (from 88.6% of [34] to 89.2% on ucf101), while the two dataset used have action classes with lot of context, especially UCF101. Some examples of the classes where the method performs well and those classes where it does not perform well could be given with analyses and explanations. One example for each is not enough, also why it does not improve on class ‘haircut’, which has a lot of context? Is it too simple to learn context that can be done by the baseline method? Detailed explanations are missing even when there is quite a bit of space is there to utilize. 2) Figure 4 is not clear, M is the number of bins and k is the number of kernels, shouldn’t it be k instead of M in the figure? 3)Some of the key state-of-art methods are missing in fusion part: a) 71.3 on hmdb and 88.5 on ucf101 by ‘What do 15,000 object categories tell us about classifying and localizing actions?’, CVPR'15 b) 91.3% on ucf101 by 'Modeling spatial-temporal clues in a hybrid deep learning framework for video classification', ACM Multimedia'15. 4) Typo: line 34 “…at the end CNN…”
3-Expert (read the paper in detail, know the area, quite certain of my opinion)
The authors argue that the utilizing context early in an deep neural network architecture for action recognition in videos is beneficial. To this end, they propose a novel architecture called deep alternative neural network (DANN) which can utilize context as early as possible. DANN uses volumetric convolution layer in conjunction with a recurrent layer to incorporate context into the architecture. They empirically demonstrate the benefits of their proposed model on standard datasets, and compare their performance against other state of the art methods. In addition to exploring the contexts early in the architecture, their proposed model is also capable of handling arbitrary sized inputs through the usage of a volumetric pyramid pooling layer. They also provide a mechanism for adaptive fragmentation of a given video that can be beneficial for improved performance.
Detailed review: The problem of action recognition is well established in the literature. The authors provide an interesting deep neural network based approach for leveraging the context early on in the network architecture. They empirically show the advantage of exploring context as early as possible in the deep architecture. They also demonstrate that exploring contexts at different levels of the architecture is also beneficial. The capability of handling arbitrary sized input is also useful. Novelty : Their usage of context early in the network seems novel. The capability to use arbitrary length input is also helpful. Clarity : The explanations seem sound. Technical Quality: The authors seem to provide the necessary technical details. They also provide implementation details about the actual experiments performed by them. However, there are some questions/doubts/comments which were brought up by the manuscript. 1) In the lower equation which is part of the Equation set 2, what is u_{(i-1)j}^{xyz} on the RHS. Is it the output of the previous layer (which can theoretically be a recurrent layer unrolled T times )? 2) In Figure 3, is the right side dotted box the expansion of output->output cycle in the left side dotted box, where the cycle repeats T times ? 3) An example that explains the author's comments on line 97-98 would have been useful. 4) Some clarification regarding line 110 would be useful. 5) Explanation regarding the claim presented in line 144-145 would be nice. What causes 3 times speedup when 2 GPUs are used instead of 1. that using 2 GPUs can provide 3 times speedup as compare to 1 GPU ? 6) At what layers are the dropouts used ? 7) How many random video clips are selected per video during data augmentation ? 8) What is the benefit of using a sliding window approach during testing , while using random clips during training? If the aim is to get video level score, then can the testing also be done using fixed sized random clips.
2-Confident (read it all; understood it all reasonably well)
In this paper, the authors propose a deep alternative neural network for action recognition, which consists of volumetric convolutional layer and recurrent layer. The authors also introduce an adaptive method to determine the temporal size of the input video clips. Experimental results on benchmarks datasets demonstrate the competitive or superior results to the state-of-the-art results. In general, the papar is clear, but the theoritical contributions of the paper are incremental. It seems it integrates the well-studied methods of CNN and RNN.Besides, the difference to the "Long-term Recurrent Convolutional Networks" by Jeff Donahue et al. is also not clear.
In this paper, the authors propose a deep alternative nueral network for action recognition. However, it seems that the proposed algortihm is an integration of CNN and RNN, and it is also not clear what is the advantages to the previous Long-term Recurrent Convolutional Networks. The temporal information can be well-explored via the recurrent layers. Then why the volumetric convolutional layer is necessary? The authors are expected to further clarify its advantaged to the LRCN algorithm with experimental comparison. The authors are also expected to report the computational complexity.
3-Expert (read the paper in detail, know the area, quite certain of my opinion)
This paper proposes a neural network based method for action recognition. The proposed network is called deep alternative neural network. Each alternative layer has a volumetric convolutional layer and a recurrent layer. The method also proposes a new approach to select network input based on optical flow. The experiments are carried out on HMDB51 and UCF101 datasets and the proposed method achieves comparable performance against state of the art methods.
Pos. 1. The method achieves comparable performance to state of the art methods on both UCF101 and HMDB51, two large video datasets for action recognition. 2. The paper has reasonable novelty that proposes a novel approach for choosing network input and volumetric pyramid for pooling. Neg. 1. Although the method achieves better performance against state of the arts on UCF101, the gap is really marginal. Similarly, the method has the same performance as TDD+iDT on HMDB51. 2. A recent paper that explores the key video frames/segments in a learned fashion can be cited: Yeung et al. End-to-end Learning of Action Detection from Frame Glimpses in Videos. CVPR, 2016.
2-Confident (read it all; understood it all reasonably well)