Skip to content

[FEATURE] Add fairness and popularity bias evaluation metrics #2283

@bajajahsaas

Description

@bajajahsaas

Description

The evaluation module currently provides 21 metrics across rating accuracy (RMSE, MAE, R-squared, etc.), ranking quality (Precision@k, Recall@k, NDCG@k, MAP, etc.), and beyond-accuracy dimensions (diversity, novelty, serendipity, coverage). However, it has no metrics for fairness or popularity bias -- two increasingly important evaluation dimensions in recommendation systems research.

Fairness in recommendations is a growing concern with regulatory implications (e.g., the EU AI Act requires fairness assessments for AI systems). Popularity bias is one of the most well-studied biases in recommender systems, where algorithms disproportionately recommend already-popular items at the expense of niche/long-tail content.

I'd like to contribute 10 new metrics across these two categories:

Popularity Bias Metrics (5):

Metric Description Reference
Average Recommendation Popularity (ARP) Average popularity of recommended items Abdollahpouri et al., FLAIRS 2019
Average Percentage of Long Tail Items (APLT) Fraction of long-tail items in recommendations Abdollahpouri et al., FLAIRS 2019
Average Coverage of Long Tail Items (ACLT) Coverage of long-tail catalog Abdollahpouri et al., FLAIRS 2019
Popularity Lift Popularity amplification ratio (reco vs train) Abdollahpouri, PhD Thesis 2020
Gini Index Inequality of item recommendation frequency Standard inequality measure

Fairness Metrics (5):

Metric Description Reference
Group Metric Disparity Meta-metric: any existing metric's gap across user groups Li et al., WWW 2021
Demographic Parity Equal recommendation rates across groups Burke et al., FAccT 2018
Equal Opportunity Difference Recall@k gap across groups Adapted from Hardt et al., NeurIPS 2016
Calibration Error KL divergence between user preferences and reco distribution Steck, RecSys 2018
Exposure Fairness Gini of exposure across item providers Singh and Joachims, KDD 2018

Expected behavior with the suggested feature

  • A new file recommenders/evaluation/python_evaluation_fairness.py with Python/pandas implementations of all 10 metrics, following the existing coding conventions (decorators, docstrings with citations, constants from recommenders.utils.constants)
  • A new SparkFairnessEvaluation class in spark_evaluation.py with PySpark versions of the popularity bias metrics (following the SparkDiversityEvaluation pattern)
  • Unit tests with boundary condition checks (e.g., Gini=0 for uniform distribution, calibration error=0 when preference distribution matches recommendation distribution)
  • An example notebook examples/03_evaluate/fairness_and_bias_evaluation.ipynb demonstrating the metrics, comparing a biased vs fair recommender
  • Zero new dependencies -- uses only numpy, pandas, and sklearn (already required)

Proposed implementation details:

File Action
recommenders/utils/constants.py Modify: add DEFAULT_GROUP_COL, DEFAULT_PROVIDER_COL, DEFAULT_LONG_TAIL_THRESHOLD
recommenders/evaluation/python_evaluation_fairness.py Create: 10 metric functions + decorators + helpers
recommenders/evaluation/spark_evaluation.py Modify: add SparkFairnessEvaluation class
tests/unit/recommenders/evaluation/conftest.py Modify: add fairness_data fixture
tests/unit/recommenders/evaluation/test_python_evaluation_fairness.py Create: 22 unit tests with boundary checks
tests/unit/recommenders/evaluation/test_spark_evaluation_fairness.py Create: Spark parity tests
examples/03_evaluate/fairness_and_bias_evaluation.ipynb Create: example notebook

Willingness to contribute

  • Yes, I can contribute for this issue independently.
  • Yes, I can contribute for this issue with guidance from Recommenders community.
  • No, I cannot contribute at this time.

Other Comments

References:

  1. H. Abdollahpouri, R. Burke, B. Mobasher. "Managing Popularity Bias in Recommender Systems with Personalized Re-ranking." FLAIRS 2019
  2. H. Steck. "Calibrated Recommendations." RecSys 2018
  3. A. Singh, T. Joachims. "Fairness of Exposure in Rankings." KDD 2018
  4. R. Burke, N. Sonboli, A. Ordonez-Gauger. "Balanced Neighborhoods for Multi-Sided Fairness in Recommendation." FAccT 2018
  5. Y. Li et al. "User-oriented Fairness in Recommendation." WWW 2021
  6. M. Hardt, E. Price, N. Srebro. "Equality of Opportunity in Supervised Learning." NeurIPS 2016

I have a working implementation ready with 22 passing tests and 0 regressions against the existing test suite. Happy to share early for feedback before opening the PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions