02 December 2024
Many of the privacy challenges from AI systems are familiar. There are non-AI systems that ingest large amounts of data, scrape the internet and produce biased outputs.
But AI is fundamentally different.
This post looks at three core data protection definitions that may need to be reconsidered in relation to AI: ‘personal data,’ ‘data controller’ and ‘data processor.’
Why is AI different to traditional computing?
AI by definition involves training systems on one set of data to enable them to identify patterns in a different set of data. In practice, many AI systems are developed by one organisation (the ‘provider’) and put into operation by another (the ‘deployer’). This separation is particularly significant with AI systems, as the development of models and processing of data to train them are often the provider’s closely-guarded commercial secrets… to which deployers may not have access.
The second major difference between AI and traditional computing is that it ‘learns.’ As a result, biases and inaccuracies can be compounded as the system ‘learns’ from its previous results by incorporating this logic into future responses, unless adequate human intervention takes place (often through RLHF: reinforcement learning through human feedback).
The definition of personal data: is an AI model ‘personal data’?
Let’s begin with the example of large language models (LLMs). Parametric models – a series of probabilistic connections (‘parameters’) between segments of words (‘tokens’) – form the basis of LLMs.
There’s no dispute that the input data used to create and train LLMs can constitute personal data, or that output data produced when the model is queried (or ‘prompted’) can constitute personal data. However, whether the model itself is defined as containing personal data is disputed.
A discussion paper on Large Language Models and Personal Data from the Hamburg Commissioner for Data Protection and Freedom of Information in August 2024 suggested that parametric models do not themselves contain personal data:
“The mere storage of an LLM does not constitute processing within the meaning of article 4(2) GDPR. This is because no personal data is stored in LLMs.”
The Hamburg paper goes on to argue that, even though “data subject rights as defined in the GDPR cannot relate to the model itself […] claims for access, erasure or rectification can certainly relate to the input and output of an AI system.”
However, there are both practical and conceptual problems with this position.
One practical problem is that individuals are unlikely to be able to exercise their rights over input data if they are unaware their data has been used to train the system. Where web-scraping is involved, this is likely to be the case.
A second practical problem could arise with the application of the EU’s AI Act to the model; placing the model on the EU market would trigger the AI Act, but if only the model is made available in the EU, then the GDPR would not apply to the provider.
The first conceptual problem is that the model itself contains probabilistic connections between tokens that are capable of producing outputs that indirectly identify individuals. This appears to meet the definition of ‘personal data’ found in the GDPR.
Whilst this may seem an issue with output data rather than the model itself, if this principle were extended to other computing systems it could mean that any probabilistic system would be deemed outside the scope of data protection law. This could include data processed by quantum computers.
The second conceptual problem is that if the model itself is not considered ‘personal data’ then it would be outside the scope of the GDPR. This would mean that a provider who gives access to their model to a deployer would not pass any personal data to them. The provider could not be considered a ‘data controller’ for the model, or a ‘joint-controller’ with the deployer, because the model itself would be outside the scope of GDPR.
This brings us to two more core concepts of data protection that AI challenges…
The definitions of ‘controllers’ and ‘processors’
If a model were not considered ‘personal data’ (and hence was outside the scope of GDPR), the provider could only be considered a ‘joint-controller’ of any personal data inputted by the deployer and processed by the provider. If this processing were not carried out by the provider (for example, where the model were an off-the-shelf, on-premise solution) then the provider could not be deemed a joint-controller of the input data, or even a data processor.
This has significant implications for the liability of deployers. They would be fully responsible under data protection law for outputs produced by an AI system, but would not have the ability to alter (or possibly even understand) how that system works.
Data Protection Authorities are considering these complex issues...
The UK ICO’s March 2023 Guidance on AI (currently being updated following a series of consultations) stresses how complex it is to identify controllers in an AI supply chain, but does recognise that joint-controllership between providers and deployers may be possible.
The CNIL issued guidance on Determining the legal qualification of AI system providers on 07 June 2024, focusing on the processing of input data to train AI systems. The guidance concludes that providers “may be qualified as controllers, joint controllers or processors” depending on the circumstances.
The Irish Data Protection Commission, in its blog on AI, Large Language Models and Data Protection from 18 July 2024, argued that a user of an AI system (a ‘deployer’ under the AI Act) “could be a data controller and if so a formal risk assessment should be considered.”
An example by way of contrast
Perhaps one way to distinguish AI systems from algorithms drawn from large databases is the recent decision by the Dutch Data Protection Authority (Autoriteit Persoonsgegevens) to fine facial recognition provider Clearview €20 mil for unlawful processing, in May 2024.
In this case, Clearview collected input data for a facial recognition system, developed the facial recognition algorithm, and carried out analysis for its users, providing them with the results. The Dutch DPA found that:
“Clearview determines the process relating to the collection of personal data, the build-up of the database, its maintenance, and the training of the Clearview facial recognition algorithm. […] Clearview also determines which technology they will then use to compare the photos uploaded by clients to all photos that are already in the database set up and maintained by Clearview.”
Here, the AI system provider (Clearview) created the training dataset, developed the algorithm and applied it to the deployer’s (user’s) input data. That is, Clearview processed the user’s input data through the algorithm.
This is different to the scenario in which an algorithm (or parametric model in the case of AI) is supplied to a user/deployer, who then uses it without requiring the provider to carry out the actual analysis. In this case, it would be difficult to argue that the provider would be considered a data controller, if the underlying model were not itself considered ‘personal data.’
This is one example of how AI is testing traditional data protection concepts: even concepts that were clear just a few years ago with simpler algorithms.
Conclusion: a pragmatic approach?
These are just three of the core principles of data protection law that AI systems challenge. There are many other key underlying principles of privacy law that must be re-examined in light of AI.
Privacy pros should not despair, however. The courts and national data protection authorities in the past have all shown flexibility and skill in applying GDPR principles to new technology.
The Google Spain judgment of May 2014, which looked at how the GDPR could be applied to search engines, is just one example. The European Court of Justice took a pragmatic approach, holding that a search engine must comply with data protection law only “within the framework of its responsibilities, powers and capabilities” (at para 38). Such an approach may be a potential solution to the challenges of applying data protection law to larger AI systems.
Whether the flexible approach of Google Spain can solve some of the core data protection issues for large AI systems remains to be seen.
In the meantime, providers and deployers can reduce their compliance risks by documenting their relationships with one-another clearly, in written contracts. This is the current advice of the UK’s ICO, and is certainly a great starting point…