top of page
Writer's pictureCaro Robson

OpenAI publishes its approach to data and model specification

Updated: Jan 8

08 May 2024


 

Approach to Data and AI


Although their ‘Approach to Data and AI’ is not quite a transparency statement (it wouldn't meet the requirements of the GDPR or of the AI Act), it does contain some interesting insights on the future of copyright and AI. 


  • OpenAI says that it has been using web crawler permissions (like the robots.txt standard) to restrict access to copyrighted materials when gathering large amounts of data from web crawlers

  • The company recognises that web crawler permissions are an imperfect solution to collecting IP-related data

  • OpenAI is creating new tools to give copyright holders greater control over how their work is ingested and processed by their LLMs (due in 2025):


"OpenAI is developing Media Manager, a tool that will enable creators and content owners to tell us what they own and specify how they want their works to be included or excluded from machine learning research and training. Over time, we plan to introduce additional choices and features."


Notably, although the statement recognises that LLMs pose significant privacy questions, only copyright holders / content creators are mentioned in OpenAI's plans to give individuals control over their data:


"Our mission is to benefit all of humanity. This encompasses not only our users, but also creators and publishers."


Quite how OpenAI's mission encompasses data subjects is not clear from the statement.


Can the tools being developed for copyright holders to exercise their rights be extended to allow individual data subjects to exercise theirs?



Model Spec for Public Consultation


OpenAI’s version of its Model Spec details how it believes its AI models should behave. The spec is aimed at researchers and developers, to guide how models are trained via Reinforcement Learning from Human Feedback (RLHF). 


OpenAI welcomes comments from organisations and individuals, via their website.  


The model spec makes very interesting reading, particularly as technical regulations under the AI Act and other international regulations are being shaped. Whether public engagement on how its models are built will help OpenAI navigate future regulatory pressure is unclear. 


The document sets out the priorities and decisions made by OpenAI in developing its models, via a series of 'Rules' and 'Defaults,' which aim to control the outputs produced:


Rules

  • Follow the chain of command

  • Comply with applicable laws

  • Don't provide information hazards

  • Respect creators and their rights

  • Protect people's privacy

  • Don't respond with NSFW (not safe for workplace) content

  • Exception: Transformation tasks


Defaults

  • Assume best intentions from the user or developer

  • Ask clarifying questions when necessary

  • Be as helpful as possible without overstepping

  • Support the different needs of interactive chat and programmatic use

  • Assume an objective point of view

  • Encourage fairness and kindness, and discourage hate

  • Don't try to change anyone's mind

  • Express uncertainty

  • Use the right tool for the job

  • Be thorough but efficient, while respecting length limits


As always, the details are key, and many of the sections on complying with applicable laws, respecting privacy and upholding creators' rights are sadly lacking in practical ways to ensure compliance. 


In spite of their shortcomings, OpenAI’s decision proactively to release their approach to data and a public consultation on their model specification are interesting steps towards greater transparency and public engagement.


How they will influence public opinion, or future regulatory action, is however less certain.

bottom of page