Common methods for assessing the fairness of machine learning systems involve evaluating disparities in error metrics on static datasets for various inputs to the system. 
Indeed, many existing ML fairness toolkits (e.g., AIF360, fairlearn, fairness-indicators, fairness-comparison) provide tools for performing such error-metric based analysis on existing datasets.

Assessment Methods

A standard practice in machine learning to assess the impact of a scenario like the lending problem is to reserve a portion of the data as a “<a href="https://developers.google.com/machine-learning/crash-course/training-and-test-sets/video-lecture">test set</a>”, and use that to calculate relevant performance metrics. Fairness is then assessed by looking at how those performance metrics differ across salient groups. However, it is well understood that there are two main issues with using test sets like this in systems with feedback. If test sets are generated from existing systems, they may be incomplete or reflect the <a href="https://developers.google.com/machine-learning/glossary#bias-ethicsfairness">biases</a> inherent to those systems.

Deficiencies in Static Dataset Analysis

Our <a href="https://github.com/google/ml-fairness-gym/blob/master/papers/acm_fat_2020_fairness_is_not_static.pdf">paper</a> extends the analysis of two other scenarios that have been previously studied in the academic ML fairness literature. The ML-fairness-gym framework is also flexible enough to simulate and explore problems where “fairness” is under-explored. For example, in a supporting paper, “<a href="https://github.com/google/ml-fairness-gym/blob/master/papers/fairmlforhealth2019_fair_treatment_allocations_in_social_networks.pdf">Fair treatment allocations in social networks</a>,” we explore a stylized version of epidemic control, which we call the precision disease control problem, to better understand notions of fairness across individuals and communities in a <a href="https://en.wikipedia.org/wiki/Social_network">social network</a>.

Conclusion

Since Liu et al.’s original formulation of the lending problem examined only the short-term consequences of the bank’s policies — including short-term profit-maximizing policies (called the max reward agent) and policies subject to an equality of opportunity (EO) constraint — we use the ML-fairness-gym to extend the analysis to the long-term (many steps) via simulation.

Extending the Analysis to the Long-Term

The ML-fairness-gym simulates sequential decision making using <a href="https://gym.openai.com/">Open AI’s Gym</a> framework. In this framework, agents interact with simulated environments in a loop. At each step, an agent chooses an action that then affects the environment’s state. The environment then reveals an observation that the agent uses to inform its subsequent actions. In this framework, environments model the system and dynamics of the problem and observations serve as data to the agent, which can be encoded as a machine learning system.

ML-fairness-gym as a Simulation Tool for Long-Term Analysis

There are cases (e.g., systems with active data collection or significant feedback loops) where the context in which the algorithm operates is critical for understanding its impact. In these cases, the fairness of algorithmic decisions ideally would be analyzed with greater consideration for the environmental and temporal context than error metric-based techniques allow.

The Anomalies

Our long-term analysis found two results. First, as found by Liu et al., the equal opportunity agent (EO agent) overlends to the disadvantaged group (group 2, which initially has a lower average credit score) by sometimes applying a lower threshold for the group than would be applied by the max reward agent. Second, equal opportunity constraints — enforcing equalized TPR between groups at each step — does not equalize TPR in aggregate over the simulation.

The Results

In order to facilitate algorithmic development with this broader context, we have released <a href="https://github.com/google/ml-fairness-gym">ML-fairness-gym</a>, a set of components for building simple simulations that explore potential long-run impacts of deploying machine learning-based decision systems in social environments. In “<a href="https://github.com/google/ml-fairness-gym/blob/master/papers/acm_fat_2020_fairness_is_not_static.pdf">Fairness is not Static: Deeper Understanding of Long Term Fairness via Simulation Studies</a>” we demonstrate how the ML-fairness-gym can be used to research the long-term effects of automated decision systems on a number of established problems from current machine learning fairness literature.

ML-Fairness Gym

Deep-dive into google's machine learning models and how it relies more on logic than data