The need for diverse and extensive datasets to train large language models (LLMs) is becoming increasingly crucial. The internet, with its vast reservoir of data, is an attractive source for this training material. However, the use of publicly available information from the internet raises significant copyright issues, especially concerning the fair use doctrine. This blog post explores the intersection of fair use and AI training, offering insights into how developers can navigate this complex landscape.
Understanding Fair Use in AI Training
Fair use is a legal doctrine that allows limited use of copyrighted material without permission from the copyright owner, under certain conditions. In the United States, the assessment of fair use is based on four factors:
- The purpose and character of the use: Non-commercial, educational, or transformative uses generally favor fair use.
- The nature of the copyrighted work: Use of factual works is more likely to be favored than creative works.
- The amount and substantiality of the portion used: Less extensive use of the copyrighted material, and use that is not the “heart” of the work, typically supports a fair use claim.
- The effect of the use on the potential market: If the use does not harm the market for the original work, it may be considered fair use.
AI Training and the Challenge of Fair Use
Transformative Use
One of the strongest arguments for fair use in AI training lies in the concept of transformative use. If AI training involves repurposing the data in a way that adds new meaning or value, it may support a fair use claim. For example, using publicly available text to train a model that generates medical advice could be seen as transformative, as the model is not merely republishing the text but generating new, useful content for a specific purpose.
Nature of the Copyrighted Work
AI often requires the use of large datasets, which may include both factual and creative content. Factual data, such as weather statistics or historical facts, if even subject to copyright protection, are more likely to fall under fair use when used for AI training. In contrast, using copyrighted literary works, music, or videos can be more problematic and less likely to be deemed fair use.
Amount and Substantiality
The breadth of data required to effectively train AI models can complicate fair use claims. While using small excerpts from a large number of works might support a fair use argument, extensive extraction from a single source or key portions of a work might not.
Market Effect
The impact of AI training on the potential market for the original works is a critical consideration. If the training data is accessible in a way that could serve as a substitute for the original works, it likely negates a fair use claim. However, if the dataset is used internally and does not diminish the value of the original works, this may favor fair use.
Practical Steps for AI Developers
- Seek Permissions:
When feasible, obtaining licenses for the use of copyrighted material reduces legal risks.
- Use Public Domain or Licensed Data:
Prioritize data that is explicitly in the public domain or available under open licenses such as Creative Commons.
- Documentation and Transparency:
Maintain clear records of how data is used and the sources it comes from, which can be crucial in justifying fair use.
- Legal Consultation:
Given the complexities and evolving nature of copyright law in relation to AI, consulting with legal experts in copyright and technology law is advisable.
Conclusion
The use of publicly available information on the internet for training AI models presents a fertile yet complex legal landscape, centered around the doctrine of fair use. By carefully considering the purpose, nature, amount, and market effect of their use of copyrighted material, AI developers can better navigate this terrain. As the legal environment continues to evolve with technological advancements, staying informed and cautious is paramount for anyone in the field of AI development.