The rise of artificial intelligence (AI) and its widespread availability offers significant growth opportunities for businesses. However, it necessitates a robust governance framework to ensure compliance with regulatory requirements, especially under the European Union’s (EU) Artificial Intelligence Act (AI Act) (see our Guide to the AI Act) and the EU General Data Protection Regulation (GDPR). The reason GDPR compliance is so important is that (personal) data is a key pillar of AI. For AI to function effectively, it requires good quality and abundant data so that it can be trained to identify patterns and relationships. Additional personal data is often gathered during deployment and incorporated into AI to assist with individual decision-making.
In this series of five blog posts, we discuss GDPR compliance throughout the AI development life cycle and when using AI.
This is our second episode. The first episode is available here.
Data Protection by Design
GDPR compliance plays a key role throughout the AI development life cycle, starting from the very first stages. This reflects one of the key requirements and guiding principles of the GDPR, called data protection by design (Article 25 GDPR). Businesses are required to implement appropriate technical and organizational measures, such as pseudonymization, at both the determination stage of processing methods and during the processing itself. These measures should aim to implement data protection principles, such as data minimization, and integrate necessary safeguards into the processing to ensure GDPR compliance and protect individuals’ data protection rights.
AI Development Life Cycle
The AI development life cycle encompasses four distinct phases: planning, design, development, and deployment. In this context, in accordance with the terminology of the EU AI Act, we will refer to both AI models and AI systems.
- AI models are a component of an AI system and are the engines that drive the functionality of AI systems. AI models require the addition of further components, such as a user interface, to become AI systems.
- AI systems present two characteristics: (1) they operate with varying levels of autonomy and (2) they infer from the input they receive how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments.
In this blog post, we focus on the second phase of the AI development life cycle: design. We already discussed the first phase (planning) in a previous blog post.
The Design Phase
The second phase of the AI development life cycle involves implementing a data strategy, focusing on data gathering and addressing potential data quality issues. It also includes converting raw data into valuable information, anonymizing and minimizing personal data, and implementing privacy-enhancing technologies. In this phase, key issues for GDPR compliance include data collection, data preparation (including regarding training methodology), measures regarding outputs of the AI model, and the model’s or system’s architecture.
Data Collection
For AI development, (personal) data can be collected either from first-party or third-party sources.
- First-party data refers to personal data directly collected from the individuals concerned.
- Third-party data refers to personal data collected from a third party, for example, from a data broker or collected with web scraping, a commonly used technique for collecting information from publicly available online sources.
GDPR compliance requires a careful assessment of the selection of sources used to train the AI model. According to the European Data Protection Board’s (EDPB, the umbrella group of the EU’s data protection authorities) Opinion on AI Models, this includes an evaluation of “any steps taken to avoid or limit the collection of personal data, including, among other things, (i) the appropriateness of the selection criteria; (ii) the relevance and adequacy of the chosen sources considering the intended purpose(s); and (iii) whether inappropriate sources have been excluded.” Typically, web scraping can be configured to ensure that specific data categories are not collected or that certain sources, such as public social media profiles, are excluded from data collection.
Data Preparation
The preparation of data for the training phase is key to GDPR compliance. This requires, according to the EDPB, careful assessment of anonymization and pseudonymization techniques, with consideration for minimization and accuracy principles. These aspects are also important when choosing an AI training methodology.
- Anonymization. Anonymous data is not subject to the GDPR, so anonymizing personal data for AI training purposes is a good way to limit the scope of application of the GDPR (see episode 1). The standard for anonymizing personal data is very high and is the subject of complex case law, especially in Breyer and SRB v EDPS (under appeal at the time of writing). To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used to identify an individual. This requires taking into account all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments (Recital 26 GDPR). The EDPB considers that AI models may be anonymous, although that is highly unlikely in its opinion (see episode 1).
- Synthetic data. An alternative to collecting and anonymizing personal data can be the use of synthetic data, which avoids the complexities associated with meeting the legal standard for anonymization. Synthetic data is based on artificial data points engineered to serve as direct substitutes for real personal data in various downstream applications. AI models learn the patterns and statistical attributes of the original data and can then be used to re-create new, entirely made-up datasets. These synthetic datasets “look and feel” like the original data and contain all the statistical information but none of the personal identifiable information.
- Pseudonymization. Pseudonymization is also a good way to mitigate GDPR compliance risks. It is one of the measures identified in Article 25 of the GDPR under the data protection by design approach. Pseudonymization should be implemented considering the current technology, the implementation cost, as well as the nature, scope, context, and purposes of processing. The risks to the rights and freedoms of individuals, with varying likelihood and severity, must also be taken into account. Importantly, pseudonymous data is still personal data and therefore falls within the scope of the GDPR. However, pseudonymizing data helps mitigate risks, such as unauthorized access to the personal data in question. Pseudonymization may also be a mitigating measure that may tip the balance in favor of the AI developer when relying on legitimate interests as a legal basis for the processing of personal data (see episode 1).
- Minimization. Personal data must be adequate, relevant, and limited to what is necessary in relation to the purposes for which it is processed. This therefore requires a careful assessment of the personal data processed, determining whether it is necessary for AI development. AI models must be tested to prevent unintentional data memorization and reduce the risk of accidentally disclosing personal data.
- Accuracy. Personal data must be accurate and, where necessary, kept up to date. Every reasonable step must be taken to ensure that personal data that is inaccurate, having regard to the purposes for which it is processed, is erased or rectified without delay. Data accuracy is key both for input and output data. Inaccurate personal data input is not compliant with the GDPR and will lead to inaccurate output data. The GDPR transparency principle requires informing individuals about the accuracy limits of personal data generated by AI. The AI Act requires that high-risk AI systems be designed in such a way that they achieve an appropriate level of accuracy, which must be declared in the instructions for use of the AI system in question.
Measures Regarding Outputs
Generative AI trained on personal data might unintentionally reveal some of such data when prompted. If the AI model lacks safeguards such as response filtering or differential privacy, a user could extract personal information by crafting specific queries. It is therefore critical to adopt measures to lower the likelihood of obtaining personal data related to training data from queries.
Architecture Design
In the design phase, AI engineers select the prepared data and the most suitable algorithms and techniques for the problem they are trying to solve. The architecture design should also include mechanisms for human oversight and intervention under the GDPR and the AI Act. This is quite challenging given that black-box AI models currently make up a substantial portion of the most sophisticated machine learning models on the market. These AI models are built to analyze data autonomously and in a manner that is frequently challenging to decipher from the outside. Although users can view the inputs and outputs of the system, they are unable to observe the internal workings of the AI tool that generates those outputs.
Naturally, this makes it more challenging to transparently convey the intricacy of the analytical procedures used to the affected individuals.
- GDPR and automated individual decision-making. Save limited exceptions, the GDPR gives data subjects the right not to be subject to decisions based solely on automated processing, which produce legal effects on them or similarly significantly affect them. This right includes the right for the individuals concerned to obtain human intervention and express their point of view to contest the decision. Thus, when designing AI, it is important to foresee the possibility of human intervention to comply with this provision. In addition, individuals must be provided with meaningful information about the logic involved in the automated individual decision-making.
In Dun & Bradstreet, the Court of Justice of the EU clarified that this entails an obligation to explain by means of relevant information and in a concise, transparent, intelligible, and easily accessible form, the procedure and principles applied to use personal data to obtain a specific result. The mere communication of a complex mathematical formula or algorithm is not sufficient. The explanation offered must help the data subject understand and challenge the automated decision. If disclosing such information may entail the disclosure of trade secrets, the company in question must provide the relevant information to the court or supervisory authority, which will determine on a case-by-case basis whether and what information should be supplied to the data subject.
- AI Act and human oversight for high-risk AI. Under the AI Act, high-risk AI systems must be designed and developed in such a way that they can be effectively overseen by humans (see here). Human oversight must aim to prevent or minimize the risks to health, safety, or fundamental rights – including the right to the protection of personal data – that may emerge when a high-risk AI system is used in accordance with its intended purpose or under conditions of reasonably foreseeable misuse. The oversight measures must be commensurate with the risks, level of autonomy, and context of use.
For more information on this or other AI matters, please contact one of the authors.
The authors would like to thank Ekaterina Fakirova for her assistance in preparing this blog post.
 
                                                                     
                                                                    