-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Description
Description
The evaluation module currently provides 21 metrics across rating accuracy (RMSE, MAE, R-squared, etc.), ranking quality (Precision@k, Recall@k, NDCG@k, MAP, etc.), and beyond-accuracy dimensions (diversity, novelty, serendipity, coverage). However, it has no metrics for fairness or popularity bias -- two increasingly important evaluation dimensions in recommendation systems research.
Fairness in recommendations is a growing concern with regulatory implications (e.g., the EU AI Act requires fairness assessments for AI systems). Popularity bias is one of the most well-studied biases in recommender systems, where algorithms disproportionately recommend already-popular items at the expense of niche/long-tail content.
I'd like to contribute 10 new metrics across these two categories:
Popularity Bias Metrics (5):
| Metric | Description | Reference |
|---|---|---|
| Average Recommendation Popularity (ARP) | Average popularity of recommended items | Abdollahpouri et al., FLAIRS 2019 |
| Average Percentage of Long Tail Items (APLT) | Fraction of long-tail items in recommendations | Abdollahpouri et al., FLAIRS 2019 |
| Average Coverage of Long Tail Items (ACLT) | Coverage of long-tail catalog | Abdollahpouri et al., FLAIRS 2019 |
| Popularity Lift | Popularity amplification ratio (reco vs train) | Abdollahpouri, PhD Thesis 2020 |
| Gini Index | Inequality of item recommendation frequency | Standard inequality measure |
Fairness Metrics (5):
| Metric | Description | Reference |
|---|---|---|
| Group Metric Disparity | Meta-metric: any existing metric's gap across user groups | Li et al., WWW 2021 |
| Demographic Parity | Equal recommendation rates across groups | Burke et al., FAccT 2018 |
| Equal Opportunity Difference | Recall@k gap across groups | Adapted from Hardt et al., NeurIPS 2016 |
| Calibration Error | KL divergence between user preferences and reco distribution | Steck, RecSys 2018 |
| Exposure Fairness | Gini of exposure across item providers | Singh and Joachims, KDD 2018 |
Expected behavior with the suggested feature
- A new file
recommenders/evaluation/python_evaluation_fairness.pywith Python/pandas implementations of all 10 metrics, following the existing coding conventions (decorators, docstrings with citations, constants fromrecommenders.utils.constants) - A new
SparkFairnessEvaluationclass inspark_evaluation.pywith PySpark versions of the popularity bias metrics (following theSparkDiversityEvaluationpattern) - Unit tests with boundary condition checks (e.g., Gini=0 for uniform distribution, calibration error=0 when preference distribution matches recommendation distribution)
- An example notebook
examples/03_evaluate/fairness_and_bias_evaluation.ipynbdemonstrating the metrics, comparing a biased vs fair recommender - Zero new dependencies -- uses only numpy, pandas, and sklearn (already required)
Proposed implementation details:
| File | Action |
|---|---|
recommenders/utils/constants.py |
Modify: add DEFAULT_GROUP_COL, DEFAULT_PROVIDER_COL, DEFAULT_LONG_TAIL_THRESHOLD |
recommenders/evaluation/python_evaluation_fairness.py |
Create: 10 metric functions + decorators + helpers |
recommenders/evaluation/spark_evaluation.py |
Modify: add SparkFairnessEvaluation class |
tests/unit/recommenders/evaluation/conftest.py |
Modify: add fairness_data fixture |
tests/unit/recommenders/evaluation/test_python_evaluation_fairness.py |
Create: 22 unit tests with boundary checks |
tests/unit/recommenders/evaluation/test_spark_evaluation_fairness.py |
Create: Spark parity tests |
examples/03_evaluate/fairness_and_bias_evaluation.ipynb |
Create: example notebook |
Willingness to contribute
- Yes, I can contribute for this issue independently.
- Yes, I can contribute for this issue with guidance from Recommenders community.
- No, I cannot contribute at this time.
Other Comments
References:
- H. Abdollahpouri, R. Burke, B. Mobasher. "Managing Popularity Bias in Recommender Systems with Personalized Re-ranking." FLAIRS 2019
- H. Steck. "Calibrated Recommendations." RecSys 2018
- A. Singh, T. Joachims. "Fairness of Exposure in Rankings." KDD 2018
- R. Burke, N. Sonboli, A. Ordonez-Gauger. "Balanced Neighborhoods for Multi-Sided Fairness in Recommendation." FAccT 2018
- Y. Li et al. "User-oriented Fairness in Recommendation." WWW 2021
- M. Hardt, E. Price, N. Srebro. "Equality of Opportunity in Supervised Learning." NeurIPS 2016
I have a working implementation ready with 22 passing tests and 0 regressions against the existing test suite. Happy to share early for feedback before opening the PR.