NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
Originality: This is a very original contribution which studies commonly used reductions for multi-label classification. The obtained results are not only surprising, but also very important from the practical point of view. Quality: All the results are presented in a formal way. The claims are clear and theoretically justified. The empirical illustration is rather limited, but this is not the main focus of the paper (the results for OVA and OVA-N should be given in the Appendix). Clarity: The paper is clearly written. Nevertheless the list below contains some comments regarding the text: - P(0_L | x) = 0 => I would prefer to not make such assumption. For example, some benchmark datasets contain examples with no labels. - Prec@k and Rec@k: these measures have certainly been used before Lapin et al. 2018. - Lemma 3 => There are undefined quantities in the lemma and typos in its proof. The final result is correct, but readability should be improved (at least the authors should say what y_{\not i} means and that they use the fact that P(y) = P(y_i)P(y_{\not i}|y_i)). Moreover, P(y_i' = 1) should be given in both forms, i.e. as in Lemma 3 and as E_y|x [ y_i/\sum_j y_j ] (the latter form is used often in the proofs of the other results). - binary relevance: the name binary relevance was used before Dembczynski et al 2010, but not sure where for the first time :( - The analysis could mention the label powerset approach which reduces a multi-label problem to a multi-class problem with exponentially many meta-labels. In this approach, however, we do not get easily marginal quantities. - Eq. \ell_{PAN-N} => I suppose that there should be log before the link function in the sum - The analysis in this paper is methodologically similar to the one from "On label dependence and loss minimization in multi-label classification", MLJ 2012. - Appendix: please proof-read the derivations. Significance: This is a significant contribution containing somehow simple, but surprising results with high practical potential.
Reviewer 2
It is a well-written paper with clear structure, easy to follow. It contains adequate discussions of related work, and the proposed approach is experimentally validated. Regarding novelty, the paper follows the footsteps of Wydmuch et al., 2018, but the extension is substantial, as it covers a much boarder range of reduction techniques and the discussion on recall in Wydmuch et al., 2018 is not explicit. The authors also call for caution in interpreting the produced probabilities scores of the reduction techniques. But isn't it rather trivial? It is not a criticism; I'd just like to point it out in case I've missed something.
Reviewer 3
This paper analyzes a relevant machine learning problem, as it intends to provide additional insights into existing methods that have been introduced in an ad-hoc manner. From that perspective, the paper is interesting. However, it is not easy to understand the main results. I have the feeling that the write-up and design of the paper could be substantially improved. Let me give some specific comments to make this point clear. Up to Section 4 the paper is reasonably clear, but in Section 4 things get a bit messy. I have some problems to understand the losses in Equations 6 and 7, because the l_BC and l_MC are never defined explicitly. Please give formulas here. Many different ways of expressing the logistic and softmax loss exist, so I would like to have it precise here. I can’t see the usefulness of Equation 8. Is this loss reasonable? Let’s take a training example with 100 positive labels. Treating each positive label as positive gets a weight of 1/100, while treating positives as negatives gets a weight of 99/100. So, this seems to suggest that false positives should be heavily penalized. Maybe I am missing here something, but this does not make sense to me. Equation 9 is also a strange variant. Here the denominator in the sum does not depend on i, so it can be moved in front of the sum. As a result, this term just reweighs the importance of an instance, but it does not influence the risk minimizer for that instance. PAL and PAL-N should therefore have the same risk minimizer, but Corollary 7 seems to suggest a different result. I am confused here. Should Corollary 7 not be named Theorem 7? In mathematics a corollary is a direct implication from a theorem. I don’t see from which theorem the corollary would follow here. Traditionally, there are two ways to optimize task-based loss functions in machine learning: (a) optimizing a convex and differentiable approximation of the task loss during training, (b) fitting a probabilistic model at training time, followed by a loss-specific inference step at test time in a decision-theoretic framework. For me, a big point of confusion is that the approaches are somewhat mixed in this paper. Wouldn’t it be easier to analyze the different methods in a classical decision-theoretic framework? In essence this would boil down to using accuracy for l_BC and l_MC. In a nutshell, this is an interesting paper, but I think the write-up could be improved. In general the results are not very clear and counterintuitive. ---- After author rebuttal: ---- my main motivation for giving a somewhat-lower score was some technical things that were not clear to me. I am satisfied with the author's response and will raise my score.