NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
- I think there are quite a few novel and interesting ideas in this work. The crop selection is not a trivial problem and the authors use the visual code of each object for matching. It looks interesting but I do not quite understand how well this approach works. I suggest the authors can make some visualization on this. - What are u_{img} and u^{attn}? What do they refer to? - The paper looks technically interesting but I am not if all the components are necessary. For example, is D_{obj} really useful? I think D_{img} is probably enough and D_{img} has more global information than objects. - How are the hyper-parameters tuned? The lambdas are supposed to be weights to balance different terms in the loss function. - Not sure if the code will be available.
Reviewer 2
2.1. Originality. The proposed method contains a lot of components. Some of them are well known already. In [16] a very similar architecture is proposed. The differences are that this work uses text, while [16] directly requires layout with bboxes and this one requires object crops, while the work in [16] generates objects from scratch. Overall, I think the originality of this method is only okay. Methodological novelty of the presented method, I believe, is marginal 2.2. Quality. The presented method seems to outperform both [16] and [4]. I would consider the numbers with caution, since the presented method uses ground truth image patches to generate images. Therefore, inception scores and diversity scores are much higher, especially the diversity score. Clearly, by sampling different patches you get these numbers higher, compared to works directly generating pixels. In section 4.4 the qualitative benefits of the presented method are discussed, while this simple advantage is omitted. Overall quality of the generated results is still far from realistic. Would it be possible to generate higher resolution samples, like 256x256? I don't think we should stick to 64x64 for comparison purposes. 2.3. The paper is well written, easy to follow. At least one reference/comparison is missing [Hong et al.]. Similarly to the presented work, they also first generate bounding boxes from text. They don't use any patches though. 2.4. Significance. The paper marginally moves the results in text-to-image synthesis forward. The paper proposes no groundbreaking contributions to the reader. Hong et al. Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis, CVPR'2018.
Reviewer 3
Limited novelty: The proposed approach is closely related to two lines of related work: 1) sg2im [4] which generates images from scene graph representations, and 2) semi-parametric image synthesis [3], which leverages semantic layouts and training images to generate novel images. The key difference to sg2im is the use of image crops in order to perform semi-parametric synthesis; however, in comparison to prior work on semi-parametric methods [3], as suggested by the authors (Line 82-83) the primary difference is the use of graph convolution architecture, where a similar graph convolution method has been introduced in [4]. I’d like to see more justifications from the authors regarding the technical novelty of this approach in presence of these two lines of work. Limited resolution: My concern about the limited novelty is exacerbated by the fact that the generated images are still in low-resolution (64x64) as prior work [4], even though high-resolution image crops are used to aid the image generation process. In contrast, related work [3] is able to generate images of much higher resolutions, e.g., 512x1024, using their semi-parametric method (which was not compared in the experiment). Could the authors comment on the possibility of using this proposed method in generating high-resolution images? The experiment results could be much stronger if the authors can demonstrate the effectiveness of this method in generating larger images. Crop selection: It is unsatisfying that the crop selector relies on pretrained models from prior work [4] to rank crop candidates, instead of jointly learned with the rest of the model. Is there a way to make the crop selector training as part of the final learning objective? Relations of scene graphs: The model is trained adversarially with two discriminators on both object level and image level. However, there seems to be no training objective to ensure the pairwise relationships in the generated images to match the edges of the scene graphs. Is there any other learning objective that can ensure the consistency of the relationships between the scene graph and its corresponding generated image? Figure 2: Is there a mistake in the caption description, which is inconsistent with the main text? Does the top branch generate the new image while the bottom branch reconstructs the ground-truth?