07 January 2025
How does the EU’s main advisory body for data protection apply the GDPR to AI models?
The European Data Protection Board (EDPB) issued its Opinion 28/2024 “on certain data protection aspects related to the processing of personal data in the context of AI models” on 17 December 2024.
As the oversight, advisory and cooperation body for all EU privacy regulators under the GDPR, the Opinion will form the basis of the EU’s 27 national Data Protection Authorities’ (DPAs) interpretation and enforcement of the GDPR in relation to AI.
The release of the Opinion was followed on 20 December by the Italian DPA, the Garante, announcing a €15 million against OpenAI for unlawfully processing personal data to train its models and in managing user registration data.
The Garante’s investigation has been ongoing since it first banned the use of ChatGPT in Italy in March 2023, with the decision dating from November 2024. However, it was published only after the EDPB’s Opinion was given, according to the Garante’s press release.
The release of the Garante’s decision after the Opinion shows that the impact of this key EDPB document is already being seen.
This post will dive into the Opinion to look at some of its key findings, underlying assumptions, and the potential impact for AI providers and deployers operating in the EU.
The final (longer) section will break down the questions in the Opinion in more detail. (The detailed part is clearly signposted so readers can decide whether to dive even deeper!)
In Summary
The good news for AI developers/deployers
Web-scraping is not prohibited, but is tightly constrained with several safeguards expected (see paras 104-106). However, the major risks web-scraping poses to individuals is noted (see para 80).
‘Legitimate interests’ is not excluded as a legal basis for developing and deploying AI models, even where using large amounts of personal data to train models (see para 66). However, some of the safeguards suggested to meet the balancing test for this legal ground are perhaps unrealistic for larger foundation models (these are set out below in the details section).
Arguing that an AI model of itself does not constitute ‘personal data’ (and hence is outside the scope of GDPR) is possible, but will require a developer or deployer to prove that a huge number of safeguards are in place to prevent re-identification of any individual whose data was included in the original training set (see para 34).
The documentation required by the AI Act, particularly for high risk systems, may be used to demonstrate some of the criteria for showing a legitimate interest under the GDPR, but will not be conclusive, and additional documentation will be required for GDPR compliance (para 131).
The bad news
Both developers and deployers are likely to be deemed data controllers (rather than processors) in relation to AI models. Although joint-controllership is a possibility (see para 112 for example), it is not clear whether the use of a model would involve a deployer being in a joint-controller relationship with the model developer.
Whilst possible, successfully arguing that models are not ‘personal data’ seems practically impossible. The Opinion effectively creates a presumption that all models will be ‘personal data’ (and hence GDPR will apply) unless a controller can demonstrate otherwise.
Relying on legitimate interests as a lawful basis will be tough. In particular, legitimate interests cannot be speculative, which presents a challenge for foundation model developers (see para 68).
National DPAs can use a number of enforcement measures against developers and deployers, including ordering dataset deletion and model disgorgement (para 114).
The powers of DPAs under the GDPR are without prejudice to the enforcement powers of any national authority designated under the AI Act (para 117).
What developers and deployers should do
Clarify the roles of all parties: make sure contractual arrangements with suppliers (developers) and customers (deployers) clearly state whether the parties are data controllers, joint-controllers or (if arguable) data processors
Safeguards: Implement as many of the recommended safeguards set out in the Opinion as possible (summarised in the details section under Question 2: legitimate interests)
Document, document, document: keep records of every decision made in relation to selecting data to train your model, filtering out data, pseudonymising or masking data (or decisions not to do this), training the model, cybersecurity testing of the model, output verification and checking, and all risk-assessments conducted
Work with your privacy team: document DPIAs, feedback from your DPO, results of any training or security testing, all technical and organisational measures for security, bias reduction and accuracy, and (for developers) all documentation provided to deployers (para 58 has a list of documentation)
Be as transparent as possible: make sure any individual whose data may be used to train an AI model is aware that this may happen, including publicity campaigns, information on your website, social media posts, and publishing safety cards and model specifications online
Work with your national DPA: involve them early to obtain feedback on the development process where possible
Some issues of contention with the Opinion
Definitions differ between the Opinion and the AI Act: the Opinion acknowledges this, but it is nevertheless unhelpful for organisations trying to comply with both sets of requirements
Like the AI Act (and the GDPR itself) the Opinion focuses on the final use case (or foreseeable use) of a particular model, which presents problems for foundation model developers (some of which are acknowledged). This is particularly problematic when considering the safeguards and mitigations suggested to meet the legitimate interests test for processing.
Many of the mitigating measures to meet the balancing test for legitimate interests are unrealistic, including removing “irrelevant” personal data from any training set. Again, foundation models do not seem to have been at the forefront of the EDPB’s thinking.
The Opinion does not cover the use of sensitive personal data (SPD) or automated decision-making. Whilst the latter is perhaps more an issue for deployers, the difficulties in filtering out SPD when collecting data for training means that the failure to consider the impact of Article 9 (grounds for processing SPD) on training leaves a significant gap and uncertainty for developers.
Due to its position as an advisory board, the EDPB’s Opinion falls short of providing definitive answers and leaves many issues to national DPAs to apply their judgement on a case-by-case basis. This is not a satisfactory position for AI deployers or developers, as it leaves a huge amount of uncertainty in how the GDPR will be applied in practice.
So is this a ‘Google Spain’ moment for AI?
In the Google Spain case, the Court of Justice for the EU (CJEU) found a pragmatic solution to applying the Data Protection Directive (the precursor to the GDPR) to search engines. Essentially, a search engine operator was required to comply with the Directive, but “must ensure, within the framework of its responsibilities, powers and capabilities, that that processing meets the requirements of Directive 95/46” (para 83, emphasis added).
Whilst the Google Spain case was controversial and many saw it as contrary to the Directive’s intentions, it was nevertheless a pragmatic solution that allowed search engines to operate an ‘opt-out’ system for de-indexing. This was a more practical solution to requiring them to comply with all requirements of the Directive in the context of the huge amounts of data they process.
So is this Opinion a ‘Google Spain moment’ for AI models?
In short, no. The EDPB has applied the GDPR to AI models in a manner that permits case-by-case analysis, but ultimately imposes safeguards and requirements on model developers and deployers that are, in my view, practically impossible to meet (at least for foundation model developers). Whilst I entirely concur with their analysis, concerns remain over whether this Opinion can be implemented in practice without damaging the AI industry that, according to para 8, the EDPB seeks to protect.
The Opinion in detail
First some words on terminology….
One of the key issues with the Opinion is that it uses different terms for various actors in the AI supply chain to those in the AI Act. (EDPB acknowledges this at paras 19-23). So, for those comparing the two laws:
‘Developer’ in the Opinion = ‘Provider’ in the AI Act
‘Deployer’ in the Opinion = ‘Deployer’ in the AI Act, but could encompass other terms such as ‘Distributors’ or ‘Importers’
‘AI model’ in the Opinion ≠ ‘AI system’ in the Act (an ‘AI system’ in the AI Act will include a model, but is a broader term including user interface)
One striking point of similarity is the focus on the “supply chain” and “life-cycle” of a model (see para 18 in particular). However, this means that the Opinion shares one of the weaknesses of the AI Act, in that it applies most of its analysis to foreseeable uses of AI models (“purposes for processing”), which is problematic with foundation models where the end use is not usually apparent.
The Opinion acknowledges this shortcoming and tries to provide suggestions in relation to foundation models (see for example para 107). Ultimately, however, as with product safety law (on which the AI Act is largely based), the GDPR is structured around assessing personal data in light of using it for a particular purpose. As a result, it may struggle to adapt to processing where there is no clear end goal in mind.
What the Opinion does NOT cover
Perhaps frustratingly, the Opinion specifically does not discuss (para 17):
Use of sensitive personal data (SPD) in AI models
Use of AI models for automated decision-making or profiling
Re-purposing data (collected for one purpose but then used to train AI models)
Data Protection Impact Assessments (DPIAs)
Privacy by design
It should also be noted that the Opinion only looks at the subset of AI models that are trained using personal data (para 26).
Whilst the original request did not specifically mention SPD, the omission of SPD is particularly frustrating, as many larger models are trained using data sets that do no filter out SPD for training. The EDPB has commented in the past that developers should filter out any SPD as soon as possible and not use it to train models (para 19 of the ChatGPT Task Force Report, May 2024), but acknowledged that this is difficult to achieve in practice.
In addition, the EDPB's guidelines on legitimate interests as a basis for processing state that: "a set of data that contains at least one sensitive data item is deemed sensitive data in its entirety, in particular if it is collected en bloc without it being possible to separate the data items from each other at the time of collection" (para 40, emphasis added, quoting the CJEU's judgment in Meta v. Bundeskartellamt at para 89). The omission of SPD from the Opinion is therefore especially challenging for bulk data collection to train models.
What the Opinion DOES cover
The Opinion comes in response to a request from the Irish Data Protection Commission and answers four questions, which I’ve paraphrased into three here for simplicity:
1. Definition of personal data and anonymisation for AI models
2. Legitimate interests as a legal basis for processing personal data in AI models
3. Impact of unlawful processing in development on subsequent use of AI models
Let’s break down each question in turn…
1. Definition of personal data and anonymisation for AI models
Is an AI model that has been trained on personal data itself classified as ‘personal data’ or can it be seen as ‘anonymised’?
For example, where an LLM is trained on lots of personal data scraped from the internet, but only the parametric model is used by deployers, with no access to the training data
The short answer is: yes, the model itself will almost always constitute personal data.
The Opinion divides AI models into two groups:
Models specifically designed to provide personal data about people on whose data the model has been trained
An example might be a deepfake detection system designed to identify deepfake images of political figures, trained using images of these people
These “types of models cannot be considered anonymous” (para 29)
Models not specifically designed to provide personal data about people on whose data the model has been trained
An example might be a model trained to identify video images that show someone shoplifting, trained on security footage from one shop but intended to be used by shops in other countries
These models “may still retain the original information” from the training data “absorbed in the parameters of the model.” If it is reasonably possible to obtain information relating to an identifiable individual whose data was used to train the model, “such a model is not anonymous” (para 31)
In spite of this clarity, the Opinion does not rule out the possibility of making a model anonymous and – frustratingly – leaves it to national DPAs to assess this on a case-by-case basis (para 34).
Instead, the Opinion creates a de facto presumption that models trained on personal data will constitute personal data under GDPR, rebuttable by a developer or deployer if they can demonstrate to the national DPA that: “with reasonable means: (i) personal data, related to the training data, cannot be extracted out of the model; and (ii) any output produced when querying the model does not relate to the data subjects whose personal data was used to train the model” (para 38; see para 41 for a list of criteria to determine what are "reasonable means").
These tests may be possible with small language models, but since smaller models tend to be built for use with a particular group of individuals (employees or customers, for example), it seems unlikely that a developer or deployer will be able to convince a DPA that its model is anonymous. As such, the model itself will remain subject to the full requirements of the GDPR (see paras 60-65 in particular).
In conclusion on the anonymity question, the Opinion gives an extensive list of factors for DPAs to take into account when assessing a claim that a model is anonymous and hence outside the scope of GDPR (see paras 41-58). These include careful documentation of the model design, testing, data preparation and minimisation, and selection of data to train the model. However, given the extent of proof required to claim a model is not subject to GDPR, it may be that developers and deployers may wish to concentrate their resources on finding ways to adhere to the GDPR in developing and deploying AI models.
This brings us to the next question: what legal basis can developers and deployers rely upon to develop and use AI models?
2. Legitimate interests as a legal basis for processing personal data in AI models
Can a developer rely on the legal basis of legitimate interests to develop, train and adjust an AI model where personal data is involved?
Can a deployer rely on the legal ground of legitimate interests to use an AI model, including when the model has been developed by another party?
Anyone who has read this far will probably know that the GDPR requires a person or organisation processing personal data to have a legal basis for doing so, taken from the exhaustive list found in Article 6. Legitimate interests is one of these legal bases (Art 6(1)(f)).
Importantly, it is extremely difficult to rely on legitimate interests as a basis for processing SPD (which is outside the scope of the Opinion) and so any web-scraping or bulk data analysis for training models will probably be unable to rely on this basis. (Any processing of SPD will also have to meet one of the grounds in Art 9, which do not include legitimate interests: see EDPB comments in their ChatGPT working party report mentioned above.)
The Opinion repeats the criteria for relying on the legitimate interest basis from its Guidelines on legitimate interest (paras 66-68 of the Opinion):
a. The interest is legitimate, meaning it is:
i. lawful
ii. clearly and precisely articulated; and
iii. real and present, not speculative (legitimacy test)
b. Processing is necessary to pursue the legitimate interest (necessity test)
c. The legitimate interest is not overridden by the fundamental rights and freedoms of data subjects (balancing test)
With regard to (a) (legitimacy test), the ICO has already noted that breaching copyright to obtain training data may render processing unlawful. More significantly, the three aspects of defining a legitimate interest present real problems for foundation models, as they all require a clear purpose for the processing (and not a speculative one: para 68). It is telling that the Opinion gives examples of legitimate interests that relate to specific end use cases, including fraud detection and chatbots, rather than general purpose / foundation models (see para 69).
For (b) (necessity test), the Opinion gives detailed guidance (paras 70-75) but again stresses that the GDPR requires clarity of purpose and an assessment of whether less-intrusive means of achieving the purpose are possible. Data minimisation and the technical safeguards suggested to support a claim that a model is anonymous should be considered by DPAs:
When assessing whether the condition of necessity is met, SAs [DPAs] should pay particular attention to the amount of personal data processed and whether it is proportionate to pursue the legitimate interest at stake, also in light of the data minimisation principle.
Opinion, para 73
Whilst this is an important consideration, both for the legitimate interests test and general requirement to adhere to data minimisation, it may be difficult to achieve for foundation model developers in practice.
For (c) (balancing test), the Opinion outlines a number of serious risks to the rights of individuals that AI models may pose. It is interesting that these extend beyond the protection of privacy to include socioeconomic, financial, freedom of expression, mental health and anti-discrimination rights (paras 77-81). This reflects the broad range of risks identified in Art 1(1) of the AI Act.
The Opinion also looks at the nature of the data processed, the context of processing and possible further consequences for individuals (para 82-90). Again, the “nature of the model and the intended operational uses play a key role” in this analysis, potentially presenting problems for foundation model developers.
A number of mitigating measures are proposed for developers and deployers to help meet the balancing test, including:
Model design and data selection
Selection of data sources, including documentation on selection criteria
Use of anonymised or pseudonymised data (or documented reasons for not using these techniques)
Using synthetic data or masked data to train models
Data minimisation techniques to restrict amount of data used
Filtering processes to “remove irrelevant personal data”
2. Model training
Methodological choices in training the model, including choices to reduce identifiability of individuals (such as regularisation methods to reduce over-fitting and generalisation, and privacy-preserving techniques)
3. Output controls
Output restrictions to reduce the risk of extracting data from individuals unintentionally
4. Model audit, testing and documentation
AI model analysis, including documented governance processes, planned audits and reports of code reviews to avoid identification of individuals
AI model testing to resist attacks, including attribute and membership inference, exfiltration, regurgitation of training data, model inversion and reconstruction attacks
Documentation of all stages of model development and training
5. Transparency measures
Releasing publicly-available communications including details on collection criteria for datasets
Using non-traditional communication techniques, including different media outlets, emails, visualisations, model cards and annual transparency reports
6. Web-scraping safeguards
Excluding data from publications (such as news outlets)
Filtering out or preventing collection of certain categories of data (such as health or geolocation information)
Preventing collection from certain websites or sources, including respecting robots.txt or ai.txt files on websites
Using clear criteria for collection based on time periods (presumably using older data for training, but this is unclear)
Respecting subject rights, including opt-out lists, the right to object being clearly available to individuals whose data has been collected, and providing information on websites that the data on these sites may be scraped for training models
The Opinion reiterates many of these measures in relation to the deployers of systems (paras 107-108), including the need for deployers to conduct their own legitimate interests balancing test and a recommendation that the results of these tests be published.
In conclusion on the legitimate interests legal basis, the Opinion clearly imposes a lot of requirements and safeguards on developers and deployers if this legal basis is to be relied upon successfully.
The difficulty for developers, particularly of larger models, is that other bases are unlikely to be available (see, for example, the ICO’s opinion that only legitimate interests is available as a legal basis for web-scraping).
However, the additional complexity relates to SPD. It is extremely difficult to rely on legitimate interests for the processing of SPD (and an additional ground under Art 9 is required). This suggests that filtering sensitive data at the point of collection will be a requirement for any developers of larger models, where developers do not have direct contact with the individuals involved that might allow them to rely on other grounds (like consent).
3. Impact of unlawful processing in development on subsequent use of AI models
If an AI model is developed by processing personal data unlawfully, what impact does that have on the lawfulness of a deployer using the model?
The good news for deployers is that the Opinion does envisage a scenario where a deployer – if it is a separate entity to the developer – may be able to use a model lawfully, even where that model was developed unlawfully for the purposes of the GDPR.
Where a developer deploys a model that has been produced unlawfully, it is unlikely that the deployment phase will be lawful under GDPR (paras 122-123). However, where a separate entity deploys the model, national DPAs should assess whether the deployer has carried out full due diligence in obtaining the model for use:
SAs [DPAs] should take into account whether the controller deploying the model conducted an appropriate assessment, as part of its accountability obligations to demonstrate compliance with Article 5(1)(a) and Article 6 GDPR, to ascertain that the AI model was not developed by unlawfully processing personal data.
Opinion, para 129
Deployers should therefore ensure that full due diligence assessments are carried out before putting AI models into use. These assessments should include examining the source of the data used to train the model, whether the training data was obtained unlawfully and whether the developer was subject to any sanctions from a DPA or court (paras 129-130).
The Opinion does note that the AI Act requires declarations of conformity from developers of high-risk systems (providers under the AI Act) which may assist deployers, but notes that this will not be conclusive in terms of GDPR compliance (para 131).
In conclusion on unlawful development affecting deployment, the Opinion does provide a basis for deployers to use models trained unlawfully. However, deployers should be careful to ensure that they conduct full due diligence on their suppliers, document every stage of the procurement process, and engage with their suppliers to obtain as much technical information on the training and adjustment of the model as possible.
Obtaining information from developers may, however, be difficult where deployers are in a weak negotiating position, for example where developers are major tech companies. However, demonstrating that questions were asked, and information sought, could be an important consideration if a deployer is ever investigated by a DPA. Deployers should therefore always try to obtain as much information as possible from developers.
It may be, however, that the ability of deployers to comply with their due diligence and transparency requirements under the GDPR rests on their developers’ compliance with the EU’s AI Act conformity assessment and transparency obligations. Enforcement of these two key legal instruments will be interesting.