NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Dear authors: your paper was carefully evaluated by the reviewers, and was discussed after we received the rebuttal. There was general agreement that this was an interesting paper and worthy of acceptance at NeurIPS 2019. Adversarial attacks on policy learning in RL is very timely. I would like to note, however, that I solicited some outside feedback on this paper after the reviews were in, and this feedback had both positive and negative comments. This 4th perspective was, I think, particularly on point and worth reading carefully, and I will share it below. I would like to encourage the authors to take this, and the other reviews, into account when preparing their final submission. ==== Additional Feedback to Authors ==== - This paper looks like an extension of the previous work on data poisoning attacks [15] (from bandit to RL) in the sense that it uses the same problem formulation (reward modification towards a target policy and l_p norm as a "cost"). - Although this is a novel extension with a theoretical contribution, the theoretical/empirical result is quite limited to offline batch RL and simple algorithms (tabular / LQR), which is far from real applications. I think the RL community is generally more interested in online learning, where the agent keeps collecting data, and more complex RL architectures/algorithms (e.g., policy gradient with linear function approximation). At least, it could be better if this paper justified how realistic this scenario is. - (Minor comment) Introducing the term "policy poisoning" is misleading because the problem setup is exactly the same as "data poisoning" paper [15], which only manipulates data (not the policy itself). Also, the paper should refer to the previous work when describing the problem setup and objective and clarify this in the related work.