Our work as data scientists is often focused on building predictive models. We work with vast quantities of data, and the more we have, the better our models and predictions can potentially become. When we have a high-performing model, we continue to retrain and iterate, introducing new data as required to keep our model fresh and free from degrading. The result is that the model’s performance level is largely retained and we, therefore, continue delivering value for users.
But what happens if restrictions around a data set or individual data point are introduced? How then do we remove this information without compromising the model overall and without kicking off potentially intense retraining sessions? A potential answer that is gaining interest, and that we would like to explore, is machine unlearning.
Machine unlearning is a nascent field, with research and development delivering some compelling results. It provides a potential solution to so many of the problems faced across industries, from costly re-work needed in the face of new data laws and regulations to an attempt to spot and mitigate bias.
Dealing With Data Non Grata
For data science teams, taking a high-performing model out of production due to law or regulatory changes is not an uncommon problem. The process of retraining a large model, however, is extensive and costly.
Take the example of a hypothetical lending approval model in the US. Across the States, it’s likely we will have tens to hundreds of millions of data points, from which we have created hundreds of features that we are using to train a massive neural network. The time and cost it will take to train this model, as we might be using expensive hardware (e.g. multiple GPUs), can be great. Now imagine that this model has been in production for a year, delivering significant value for customers, when new privacy laws are introduced in California that prohibit the use of a particular region of the data set.
Now we are in a difficult position, as the only option we have is to retrain our model. But what if there were a way to make the model forget this data without explicit retraining on the reduced dataset? This is essentially what machine unlearning could do, which has significant benefits for organizations as well as individuals.
Privacy is a key concern for us all. In financial services and other heavily regulated industries, such as healthcare, falling foul of privacy laws can present a mission-critical problem, so seamlessly removing data that’s no longer permissible by law offers a significant get-out-of-jail-free card. For an individual, especially one in Europe, whose right to be forgotten is enshrined in GDPR, machine unlearning could also be the means by which they preserve this right.
Making Bias Disappear
Another way machine unlearning could deliver value for both individuals and organizations is the removal of biased data points that are identified after model training. Despite laws that prohibit the use of sensitive data in decision-making algorithms, there is a multitude of ways bias can find its way in through the back door, leading to unfair outcomes for minority groups and individuals. There are also similar risks in other industries, such as healthcare.
When a decision can mean the difference between life-changing and, in some cases, life-saving outcomes, algorithmic fairness becomes a social responsibility and often algorithms may be unfair due to the data they are being trained on. For this reason, financial inclusion is an area that is rightly a key focus for financial institutions, and not just for the sake of social responsibility. Challengers and fintechs continue to innovate solutions that are making financial services more accessible.
Protecting Against Model Degradation
From a model monitoring perspective, machine unlearning could also safeguard against model degradation. Models that have been in production for a long time will contain data that becomes less relevant over time. A key example of this is the way in which customer behavior changed following the pandemic. In banking, for example, customers quickly moved to digital channels where once they had opted for in-person interactions. This behavioral paradigm shift made it necessary to retrain many models.
Another use case could be removing data that might lead to an adversarial attack, or increasing remediation when bad data is introduced through, say, a system failure that causes a model to deliver malicious outcomes. Again, the essential driver for this use case is to reduce re-work, but also to make models and data science at large more secure.
How to Begin
Researchers working on how to deliver machine unlearning have proposed a framework called Sharded, Isolated, Sliced, and Aggregated (SISA) training. This approach divides training data into subsets called shards, which are essentially smaller models that make up the larger model. If data within these shards needs to be removed, then it is only these shards that need to be retrained, which can happen in isolation. Retraining is still needed in small portions with SISA, but alternate research around data removal-enabled (DaRE) forests leverages caching at nodes in an attempt to forget and remove the need for any explicit retraining.
This is promising for the data science community and businesses for which models deliver a significant portion of business value, but there is a potential for the need for data removal in a dynamic and changing environment.
It’s a key question for the data science community, which is why we wanted to discuss the areas in which we see machine unlearning delivering the most value.
So, now you have our thoughts, we’d love to hear yours. Please do leave a comment below, or get in touch, and let’s continue this fascinating and vital conversation