29 April 2024
The Financial Times has agreed a deal with OpenAI to license content for LLM models, addressing copyright concerns. But does this present data protection issues?
According to the OpenAI website: “The Financial Times today announced a strategic partnership and licensing agreement with OpenAI, a leader in artificial intelligence research and deployment, to enhance ChatGPT with attributed content, help improve its models’ usefulness by incorporating FT journalism, and collaborate on developing new AI products and features for FT readers.”
The full details of the deal have not been disclosed, but at first glance there seems to be a difference between working to “enhance ChatGPT with attributed content” and to “help improve its models’ usefulness by incorporating FT journalism.”
Although aimed at addressing copyright concerns, the data protection issues underlying these two aspects may be quite different, because the original data processing by the FT is likely to have been conducted for purposes of journalism. Journalism is treated differently to other processing in DP law, to balance freedom of expression and the right to privacy.
🔹 "Enhanc[ing] ChatGPT with attributed content" suggests that ChatGPT may present FT content in a manner similar to a search engine. If this is the case, then the approach taken by courts in search engine cases like Google Spain may apply.
🔹 Helping to “improve the models’ usefulness by incorporating FT journalism” suggests that FT content might be used to train LLMs more generally, treating FT content like any other training data.
If this is the case, then data collected for journalistic purposes might be used to adapt the connections a model makes between tokens to produce outputs, contributing to what the model “knows” about how to connect words for outputs that may identify individuals. In other words, journalism may be used to influence the standard response ChatGPT gives to prompts asking about individuals, presenting the results as 'fact.'
The details of the deal are not public, so this is speculation. However, reading between the lines of the announcement (and thinking about how LLMs work) there could be data protection issues in allowing personal data originally processed for journalistic purposes to be re-purposed for training LLMs. In a commercial deal, these issues could affect the original controller (the FT) as well as the LLM provider (OpenAI).
Will OpenAI guarantee that every output generated by a model trained on FT data will attribute the FT as one of its sources? This would be a radical change in how LLMs work. If not, journalism may be presented as ‘fact’ in ChatGPT outputs: not necessarily problematic, but not processing for the journalistic purposes for which the data was originally collected.
With a number of such agreements being reached with AI providers like OpenAI, more data protection questions are likely to be asked by individuals, civil society and regulators.