Bias Correction
Summary
Bias correction in AI alignment research focuses on addressing and mitigating unfair biases present in datasets and machine learning models. This subtopic explores mathematical formulations of how bias arises and develops methods to counteract its effects. Techniques include re-weighting data points to achieve unbiased classification without changing labels, as well as novel regularization algorithms that aim to “unlearn” bias information during the training process. These approaches often involve adversarial training methods, where additional networks are employed to predict and subsequently remove bias from feature embeddings. The goal of bias correction is to create fair and unbiased machine learning classifiers that perform well on diverse test sets, even when trained on biased data. Research in this area demonstrates promising results across various fairness notions and standard machine learning fairness datasets.