OpenAI publishes its approach to data and model specification

Caro Robson
May 7, 2024
3 min read

Updated: Jan 8, 2025

08 May 2024

OpenAI published their 'Approach to Data and AI' yesterday, followed by making their model specification available for public consultation today.

Approach to Data and AI

Although their ‘Approach to Data and AI’ is not quite a transparency statement (it wouldn't meet the requirements of the GDPR or of the AI Act), it does contain some interesting insights on the future of copyright and AI.

OpenAI says that it has been using web crawler permissions (like the robots.txt standard) to restrict access to copyrighted materials when gathering large amounts of data from web crawlers
The company recognises that web crawler permissions are an imperfect solution to collecting IP-related data
OpenAI is creating new tools to give copyright holders greater control over how their work is ingested and processed by their LLMs (due in 2025):

"OpenAI is developing Media Manager, a tool that will enable creators and content owners to tell us what they own and specify how they want their works to be included or excluded from machine learning research and training. Over time, we plan to introduce additional choices and features."

Notably, although the statement recognises that LLMs pose significant privacy questions, only copyright holders / content creators are mentioned in OpenAI's plans to give individuals control over their data:

"Our mission is to benefit all of humanity. This encompasses not only our users, but also creators and publishers."

Quite how OpenAI's mission encompasses data subjects is not clear from the statement.

Can the tools being developed for copyright holders to exercise their rights be extended to allow individual data subjects to exercise theirs?

Model Spec for Public Consultation

OpenAI’s version of its Model Spec details how it believes its AI models should behave. The spec is aimed at researchers and developers, to guide how models are trained via Reinforcement Learning from Human Feedback (RLHF).

OpenAI welcomes comments from organisations and individuals, via their website.

The model spec makes very interesting reading, particularly as technical regulations under the AI Act and other international regulations are being shaped. Whether public engagement on how its models are built will help OpenAI navigate future regulatory pressure is unclear.

The document sets out the priorities and decisions made by OpenAI in developing its models, via a series of 'Rules' and 'Defaults,' which aim to control the outputs produced:

Rules

Follow the chain of command
Comply with applicable laws
Don't provide information hazards
Respect creators and their rights
Protect people's privacy
Don't respond with NSFW (not safe for workplace) content
Exception: Transformation tasks

Defaults

Assume best intentions from the user or developer
Ask clarifying questions when necessary
Be as helpful as possible without overstepping
Support the different needs of interactive chat and programmatic use
Assume an objective point of view
Encourage fairness and kindness, and discourage hate
Don't try to change anyone's mind
Express uncertainty
Use the right tool for the job
Be thorough but efficient, while respecting length limits

As always, the details are key, and many of the sections on complying with applicable laws, respecting privacy and upholding creators' rights are sadly lacking in practical ways to ensure compliance.

In spite of their shortcomings, OpenAI’s decision proactively to release their approach to data and a public consultation on their model specification are interesting steps towards greater transparency and public engagement.

How they will influence public opinion, or future regulatory action, is however less certain.