NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
All the reviewers agreed that the work tackles an interesting, timely topic and the methodological contribution is sound. As a result, the paper was discussed during the PC meeting and acceptance was recommended. That being said, there were also several concerns that were brought up during the discussion among reviewers and I would encourage the authors to address them in the final version of the paper. In particular: (i) The motivation of the fairness metrics appears insufficient. A more detailed discussion of its merits would be helpful -- it is unclear why the disparity is only measured in one direction (over-emphasis of higher relevance item) with the direction based on relevance rather than group identity. As stated, the metric seems to reward "diversity" rather than "fairness". (ii) The execution of the experimental evaluation could be significantly improved. More specifically, the yahoo dataset and the german credit dataset show entirely different types of experiments, the proposed LTR approach is only compared with very few, and relatively older methods (a comparison to more recent and widely used algorithms like LambdaRank, LambdaMART or their successors seems necessary), and the argument that it does worse than GBDT since it’s different model class if pretty weak. (iii) There are other datasets that may be better suited for validating the authors' method. For example, the ones used in: 1. Zehlike et al. "Fa*ir: A fair top-k ranking algorithm." 2. Asudeh et al. "Designing fair ranking schemes." 3. Biega et al, "Equity of Attention: Amortizing Individual Fairness in Rankings"