GDPR And Machine Learning: Hindrance or Enabler?
An astonishing 2.5 quintillion bytes of data are produced online every day, according to 2018 research from American software company Domo. Contributing to this figure are the reams of data created by users through emails, social media accounts, online banking and any interaction with a business’ website.
This digital mountain of data has posed an issue for regulators in recent years, with the implied consent that user data can be utilised by service-providing companies raising awareness of alarming privacy shortcomings.
The adoption in 2016 of General Data Protection Regulation (GDPR) in the European Union (EU) was a watershed moment, both within Europe and beyond. Every company that has even one digital customer from within the EU is bound to protect the personal data and privacy of its users, which has required a fundamental overhaul of business practices.
GDPR, and its shining of a spotlight on to the issues of data protection, has been largely well received. The wide-reaching nature of the legislation does however present the possibility of direct operational conflict with machine learning research.
R&D in this field encompasses a vast array of wildly differing projects, but a fundamental requirement of any project aiming to utilise machine learning algorithms is access to a large, high-quality pool of training data. The type used is, of course, dependant on the nature of the algorithm, but as the conceptualisation of machine learning applications increases, user data continues to be a highly prized asset.
“Data’s quality is paramount, because low-quality or poorly vetted data used in the formation of these complex machine learning algorithms can lead to issues such as manifestations of bias in resultant AI platforms.”
A burning question for machine learning academics, researchers and engineers, now and in the future, is whether or not GDPR in its current form will overly hinder their access to satisfactory training data. This data’s quality is paramount, because low-quality or poorly vetted data used in the formation of these complex machine learning algorithms can lead to issues such as manifestations of bias in resultant AI platforms.
So, how concerning is GDPR’s influence on the way user data is treated and accessed by machine learning developers? Binary District Journal spoke to Dimitra Kamarinou, PhD candidate and researcher in Cloud Computing Law at Queen Mary University of London. An expert in data protection, Kamarinou has published several papers on the subject, including one focusing on GDPR’s relationship with machine learning.
Developers’ Reactions to GDPR
Understanding the impact of a complex piece of legislation such as GDPR on the machine learning development community can be difficult unless one is within it, or at least talking to those within it. With her work on data protection and the legal aspects of cloud computing, Kamarinou has been perfectly positioned to witness the immediate effects in the field.
“GDPR and the publicity around it has made those concerns [about research based on personal data] very mainstream.”
“I’ve had conversations with people who want to do machine learning research, and the legislation is definitely on their minds, so it’s definitely something they will consider now, every time they want to undertake experiments,” Kamarinou explains.
“I'm talking about engineers and computer scientists mostly – they want to do research based on personal data, so they'll definitely have to think about it, whereas, before GDPR, we’d have to advise them, ‘Maybe you should think of this – the processing of personal data. You should think of the legal framework.’ I think GDPR and the publicity around it has made those concerns very mainstream – now people are thinking of that themselves.”
The Machine-Learning/AI Mix-up
GDPR’s impact on the consciousness of machine learning developers has certainly been heightened by the level of publicity the legislation has received. However, the increase in attention machine learning has seen has had an unfortunate side effect: its exclusive association with AI in the mind of the public.
“If you see the regulation initiative, [the response from] the Royal Society, the reports in the Information Commissioner’s Office and their initiatives to explain what it is, you might think it’s driven by the public reaction,” Kamarinou says.
“If we take the news and the newspaper articles, then I would definitely say there's a lot of hype. We looked at this issue two years ago, and even in my team, we had to speak to some computer scientists and engineers who did machine learning to understand its actual level of capability at the time.”
Understanding the Restrictions on Training Data
There is also a lack of public understanding concerning the freedoms that machine learning developers have when operating with user data. GDPR has received heavy press coverage, and B2C emails have been sent out across subscriber lists alerting consumers to changing regulations. The average person could be forgiven for assuming that detailed access to personal data by third parties is now essentially impossible.
That is emphatically not the case, however. There are still numerous applications in the machine learning R&D fields that can use such data, but the degree to which they can utilise them is governed by the type of data itself, and its desired usage.
“It will depend on what your research is about,” Kamarinou says. “So, for example, if it’s scientific, there are specific exceptions in the regulations about how you can process data for your purposes. Similarly, if it's in the public interest – say, if you process data to combat diseases, or if you process data for national security. In Cardiff last year, for example, during the Champions League final, they were using facial recognition in the stadium for security purposes.”
Accountability for Machine Learning Programs?
The use of facial recognition made possible by machine learning algorithms has proved to be a controversial topic in the UK – its success rate in Cardiff was only 8% on the day. Although no arrests of innocent people occurred, the high number of false positives returned by the program highlights the need for GDPR to increase the level of transparency from machine learning developers about how their products actually work.
“In machine learning, one of the biggest issues now being discussed in the literature is the ‘right to explanation.’ So, when a machine based on general machine learning algorithms produces a decision, can you explain how it reached that decision? That's what our paper was about, but even that was two years ago.”
“In the time since, there’s been a lot of discussion about this issue, because GDPR introduced and expanded on the transparency obligation. That means, basically, that every time you process personal data, you have to be transparent about how you do it – why are you processing the data, are you allowed to do that and what are your grounds for doing so?”
“One of the biggest issues now is the ‘right to explanation.’ When a machine based on general machine-learning algorithms produces a decision, can you explain how it reached that decision?”
On the surface, this increase in transparency from machine learning developers should go a long way to educating the general public about the technology. Not only that, but the reassurance that any decision reached by a machine could have its reasoning explained would be a positive step towards debunking any myths behind this highly technical area of computing.
However, the reality of machine learning being an immensely technical and complex field of research is the main stumbling block for this ‘right to explanation’ being a universal aid to increased transparency. As Kamarinou points out, “There can still be difficulty in trying to explain how an algorithm reached a particular decision, especially if the algorithm itself is not very easy to explain.”
“[GDPR allows platforms to] give enough information so people understand what’s being processed with the logic, without revealing the code or the underlying protected software [to competitors].”
Sharing the inner workings of a machine learning algorithm to cater for public curiosity into how it reached a decision is hardly a tenable long-term prospect, especially as these platforms grow in complexity. Not only that, but wouldn’t a developer making public their workings run the risk of exposing sensitive industrial information to competitors?
“That's something the regulation took into account,” says Kamarinou. “It says in the regulations that, just because you have to give meaningful information about the logic involved in the decision, it doesn’t mean you have to give away trade secrets. It has to be a balance between the two: give enough information so people understand what’s being processed with the logic, without revealing the code or the underlying protected software.”
Who will GDPR Affect the Most in the Machine-Learning Field?
As with any legislation that governs a diverse area of industry and research, it’s likely that GDPR will not be felt by every machine learning developer in the same way. With such a variety of projects within the field, will the regulations be easier for some to adhere to than others? “I think it’s definitely going to be more difficult for startups,” Kamarinou says, “because they don't have access, potentially, to legal experts to tell them they’re doing it right or not.”
As with the special provisions factored in for machine learning projects focusing on scientific research and security, GDPR stipulations also differ depending on the size of the business. “Some of the obligations under GDPR take into account company size – the threshold is 250 employees – but some of the bigger obligations don’t really take that into account,” says Kamarinou.
“It’s definitely going to be more difficult for startups, because they don't have access, potentially, to legal experts to tell them they’re doing [GDPR] right or not.”
GDPR offers various suggestions as to how developers can adapt their methods to cater for a potential limitation on access to traditional training data. One such suggestion is the use of synthetic data as a substitute for data made unavailable by GDPR restrictions. However, producing viable synthetic data can be a complex, resource-exhaustive task, and smaller-scale developers may not be in a position to produce such data on the same scale as larger companies.
The Future of Machine Learning Regulations
GDPR is undoubtedly a strong foundation for enhanced transparency and accountability for the use of user data. As machine-learning development progresses, though, there have been questions over the need for continuing legislation to be drafted in order to cater for this.
“There’s a lot of talk right now about whether you actually need any new regulation,” Kamarinou says. “GDPR came into force this year, and there were discussions about that ever since it was first proposed, in 2012. My understanding is that a lot of the legislation we have already will be used to regulate machine learning. So, for example, we have legislation about liability and data protection, and product liability relating to autonomous vehicles. I can’t predict what will happen, but I think we have enough already to deal with the applications.”
“[GDPR is] very broad and it took forever to come into force, so I think we’re going to be dealing with the results of that for a little while yet.”
“In terms of data protection, which is the field I work in most, I think GDPR is quite a heavy legislation – it’s very broad and it took forever to come into force, so I think we’re going to be dealing with the results of that for a little while yet.”
Earlier this year, the BBC ran an experiment as part of its mini-season of programming around artificial intelligence and machine learning, in which it gave a machine access to millions of hours of archive footage. Using object recognition, text recognition and movement recognition, the machine was tasked with making a show that was as ‘BBC Four-like’ as possible.
The result will make reassuring viewing for anyone worried that their creative job is under threat from the machines. What came out was mostly nonsense, with streams of miscategorisation displayed so that viewers could see the ‘thought process’. It was a fascinating experiment, but it did not come close to creating something recognisable. So, while technology may be able to help you choose what to watch next, it will be nowhere near the director’s chair any time soon.
Illustrations by Kseniya Forbender
To contact the editor responsible for this story:
Margarita Khartanovich at [email protected]