Wang Yong - Partner, Attorney-at-Law and Patent Attorney
The rapid development of large language model technology and industry, represented by generative artificial intelligence (AI), will greatly impact all aspects of human society in the future. Article 22 of the Interim Measures for the Administration of Generative AI Services, released by the Cyberspace Administration of China (CAC) and six other ministries and commissions on July 10, 2023, defines "the generative AI technologies and services" as "models and related technologies with capabilities of generating contents, which can be texts, images, audio and video", with the provision of generative AI services through programmable interfaces and other means included. Technically, generative AI is a natural language processing model that uses a large-scale corpus for self-supervised learning during a pre-training process, and then uses the generated model to generate, on the user's prompts, new contents in the form of texts, images, audio and video, or a combination thereof. At present, the model with certain practical application has an extremely large number of parameters (reaching a scale of hundreds of billions), and requires a large amount of training data. Large, high-quality training data is the foundation for AI models to generate desirable results. As for the training data sources, they mainly include self-owned data, open-source datasets, external data, automatically collected data, and synthetic data. However, as for the nature of protection for the data, they are classified into public information that is not protected by the copyright and data of works protected by the copyright.
Theoretically, the issues facing generative AI in training data involve many legal risks in connection with personal privacy information, personality, trade secrets, unfair competition, and data copyright. In recent years, applications of AI products have caused many lawsuits around the world. From 2023 to 2024, dozens of lawsuits have been instituted involving large AI models in the United States, with the most controversial issue being the copyright of training data. For instance, in June 2024, the Recording Industry Association of America (RIAA), together with Sony Music Entertainment, Universal Music Group, and Warner Records, sued AI startups Suno and Udio, alleging that they used audio materials from record to train their models without authorization. In a similar vein, in December 2023, The New York Times sued OpenAI and Microsoft, accusing them of unauthorized use of newspaper articles for large model training. The underlying issue behind these legal battles is the conflict between data use and copyright protection in connection with the generative AI technology, which has become an insurmountable obstacle to the development of the AI industry. Therefore, with the development and widespread application of generative AI, the data sources legitimacy lies in the core of the legal risk for AI models.
The following is an exploration on the copyright dilemma now facing large language model training data and the corresponding countermeasures to be taken from the perspective of the current Chinese Copyright Law.
AI is an important driving force for a new round of scientific and technological revolution and industrial transformation, and many countries, including the United States, have increased investment in AI research, set up special scientific research funds to support universities, scientific research institutions and enterprises to conduct AI research, and at the same time released a national AI strategy, spelling out the development goals, key areas and policy orientation, and providing a clear roadmap for the AI development. At the same time, some large multinational high-tech enterprises have increased their investment in AI hardware and software developments, and have continuously launched all sorts of AI products, hoping to take up the leading, advantageous position in the competition raging in this new field.
How to drive the development of the new generation artificial intelligence is also a strategic issue related to whether China can seize the opportunity of a new round of scientific and technological revolution and industrial transformation. The New Generation Artificial Intelligence Development Plan released by the State Council in July 2017 is China's first strategic plan for systematic deployment in the AI field, focusing on the overall conception, strategic objectives, main tasks and supporting measures for the development of China's new generation artificial intelligence before 2030. Subsequently, the Data Security Law of the People's Republic of China, the Personal Information Protection Law of the People's Republic of China, the White Paper on Artificial Intelligence Standardization, the Guidelines for the Construction of the National New Generation Artificial Intelligence Standard System were formulated and promulgated. Additionally, the relevant sector administrative regulations, like the Provisions on the Administration of Algorithm Recommendation for Internet Information Services and the Interim Measures for the Administration of Generative AI Services were introduced.
With regard to the training data of large language models, Article 7 of the Interim Measures for the Administration of Generative AI Services definitely requires that "generative AI service providers shall carry out training data processing activities, such as pre-training and optimized training, pursuant to the law, and use data and basic models from legitimate sources."
In the current AI development practice, the data sources of large language model training databases can be roughly divided into three categories: first, contents in the public domain, namely data that can be used and processed by anyone without restrictions, including contents that are not protected by law and contents that have entered the public domain after the copyrights expire; second, contents on which legal authorization contracts have been concluded, that is, the effective authorizations to legitimately use of the relevant data and contents are obtained under the contracts concluding with the right holders; and third, unauthorized information and contents, that is, the data and contents themselves are subject to copyright protection, and the channels for obtaining them are usually through use of "crawlers" and other technologies to obtain network data and contents, illegally obtain database contents, and digitize non-electronic data contents without permission. The training databases constructed in this way are at the risk of copyright infringement as they involve unauthorized use of copyrighted data and contents.
Under the framework of current Chinese Copyright Law, risks of copyright infringement would exist with different acts in the use of the above-mentioned training data. The AI model learning process is roughly divided into three stages around the use of data or works: first, the data content collection stage, in which the collection and storage of training data may constitute an infringement of the right of reproduction; the second is the training stage, in which data collected in the first stage are used to train the model, involving the steps, such as cleaning, standardization, labeling and feature extraction of the collected data, with the risk of infringing the right of adaptation; and the model application output stage, in which the data output generated by the trained model according to user prompts or guidelines may also constitute infringement of the rights related to distribution and dissemination. Within this fairly short article, this author will be focusing on the copyright issues involved in the data content collection stage and the training stage.
In the traditional copyright licensing mechanism, the prior copyright authorization method is usually adopted, which simply means that before a work is used, the user needs to obtain the formal permission from the copyright owner and remunerate him or her accordingly. This authorization model is the basic model for respecting others’ intellectual achievements and maintaining the market operation in the knowledge-based economy, with the core being to give copyright owners the opportunity and ability to negotiate under the Copyright Law, so that their works can effectively circulate and they benefit from their intellectual achievements in the operation of the market, and promote the output of innovation achievements and the sharing of knowledge. This is a proactive and planned way of copyright protection, aiming to clarify the relationship between rights and obligations in advance and to avoid subsequent disputes.
However, this traditional, effective licensing model has completely failed in the face of current large-scale model training. On the one hand, such training involves a large number of works, varied sources, and different ownerships. With the method of prior authorization, it is necessary to accurately separate and extract the copyrighted works from the massive data, and find the corresponding right holder of every single copyrighted work to negotiate the authorization matter with him or her, and pay the licensing fees at different rates, which will be extremely long, complex and difficult process to work out. On the other, due to the extremely large amount of data required for training, unlike the traditional field where remunerations could be calculated based on the use of a single work, hundreds of millions of works are to be used in large model training, which will exacerbate the problem of license fee accumulation, and the accumulation of such fees will result in the total amount of the final license fees too high to carry out the commercial activities. Therefore, the unauthorized use of copyrighted works for machine learning in the current large-scale model training has become the norm, and the copyright market transaction has fallen into a dilemma.
To promote the innovation and development of generative AI technology, so that developers of AI large models could fully and freely use works for data training without the permission of copyright owners, people have turned their attention to the copyright fair use system under the Copyright Law, and believe that the copyright fair use system could be a relatively feasible legal protection for unauthorized large-scale reproduction of copyrighted works to train artificial intelligence models. The so-called copyright fair use system refers to one of the core systems of copyright limitations and exceptions, which allows others to freely use copyrighted works without the consent of the copyright owners or corresponding remuneration to them, provided that certain conditions are met. The purpose of the fair use system is to balance the exclusive rights of copyright owners in works and the public's demand for access to works, promote innovation and cultural diversity, and protect the basic interests of the public.
China's relevant provisions on copyright fair use are spelt out in the Copyright Law of the People's Republic of China and the Implementing Regulations of the Copyright Law.
Article 24 of the Copyright Law stipulates that: "A work may be used under the following circumstances without the permission of the copyright owner and without paying remuneration to him or her, provided that the name or title of the author and the title of the work shall be indicated, and shall not affect the normal use of the work, and shall not unreasonably harm the legitimate rights and interests of the copyright owner: (1) using the published works of others for personal study, research or appreciation; … (6) translation, adaptation, compilation, broadcasting, or small number of reproductions of published works for the purpose of classroom teaching or research by scientific personnel, but not for publication or distribution purposes; … (13) other circumstances provided for by the laws and administrative regulations".
Rule 21 of the Implementing Regulations of the Copyright Law stipulates that: "Under the relevant provisions of the Copyright Law, the use of a published work that may be done without permission of the copyright owner shall not affect the normal use of the work, and shall not unreasonably harm the legitimate rights and interests of the copyright owner".
Therefore, under the above-mentioned provisions of the Copyright Law, fair use of copyright should meet one of the specific scenarios and reasons as listed in Article 24 of the Copyright Law, and also satisfy the three-step test standard (i.e., the work has been published, does not affect the normal use of the work, and does not unreasonably harm the legitimate rights and interests of the copyright owner).
The following is a detailed analysis of the preceding provisions of the Copyright Law. Obviously, the contents of subparagraphs (1) and (6) of Article 24 of the Copyright Law are somewhat related to the use of training data. With regard to Article 24 (1) of the Copyright Law "the use of published works of others for personal study, research or appreciation", this provision and its limitations, according to current judicial practice and theory, are usually understood from the following aspects: 1) restricted purpose of use: this provision clarifies that the purpose of using another person's work must be for personal study, research or appreciation, and these purposes are non-commercial in nature, mainly for personal knowledge acquisition, skill improvement or cultural enjoyment; 2) restricted works for use: the works of fair use are limited to "published works", which means that unpublished works are not included, and the unauthorized use of unpublished works could affect the normal use of the works and unreasonably harm the legitimate rights and interests of the copyright owners, so such use does not constitute fair use; 3) limited scope of use: although this provision does not directly restrict the method and scope of use, in practice, the scope of dissemination of a disputed work caused by the act of use is usually considered. If the dissemination of a work is limited to individual persons, it can generally be considered to satisfy the requirement of "study, research or appreciation of the individual"; on the other hand, if the work is widely disseminated, for example, causing it to be disclosed to an unspecified public, it clearly exceeds the scope of "personal use" and does not constitute fair use; 4) non-profit principle: personal use should go together with non-profit purposes, and if it is for personal gain, saying getting paid through mass reproduction, it is no longer a case of personal use. Therefore, based on the above aspects, it is difficult to classify the current acts of using data or works for large model training into the scope of personal learning, research and appreciation.
With regard to Article 24 (6) of the Copyright Law, the nature of AI data training seems to be similar to the "scientific research" under this provision. Although the law does not stipulate what constitutes "small number of reproductions", the provision of "small number of reproductions" is unlikely to be compatible with such large-scale use of training data in large model training. Therefore, it is difficult to apply the provisions of subparagraph (6) to demonstrate that fair use is applicable to the use of training data.
Article 24 (13) of the Copyright Law, i.e., "other circumstances as provided for in the law and administrative regulations," is a catch-all clause intended to cover other special circumstances that would not be included in the aforesaid 12 circumstances of fair use, and is added to the Copyright Law as amended in November 2020. With the addition of this clause, the "three-step test," commonly followed for determining fair use, has been formally incorporated into the Copyright Law. The "three-step test standard", also known as the "three-step test method", means that the scope of application should be limited to special cases, the way of use should not conflict with the normal use of works, and the results of use should not unreasonably harm the legitimate rights and interests of the copyright owners.
The Qinghai Provincial Higher People's Court summarized, in the Second-instance Civil Judgment in Beijing Panorama Visual Network Technology Co., Ltd. vs. Qinghai Daily, a case of dispute over the infringement of the right of information network dissemination of Works, the common characteristics of the 12 types of fair use stipulated in the Copyright Law as follows: first, the objective of public welfare, which does not involve commercial operations, that is, does not seek profits; second, to use appropriately rather than prominently, and not to damage the integrity and beauty of the work; and third, the name and title of the author and the title of the work shall be indicated when using a work, which shall not affect the normal use of the work, and shall not harm other lawful rights and interests of the copyright owner.
With regard to fair use as provided for in other administrative regulations, the specific circumstances of fair use in information network dissemination scenarios provided for in Rules 6 and 7 of the Regulations on the Protection of the Right of Information Network Dissemination do not exceed the scope of specific fair use of works provided for in the Copyright Law. Article 7 of the Interim Measures for the Administration of Generative AI Services, jointly issued by the Cyberspace Administration of China (CAC) and seven other government agencies in July 2023, sets forth the IP law-compliance requirements for machine learning, but the corresponding IP protection rules have not yet been updated or improved.
With regard to other special circumstances of fair use, although there are individual cases in judicial practice that fail to fall within the 12 circumstances stipulated in the Copyright Law and the provisions of administrative regulations, so far, there are no administrative regulations that explicitly stipulate that use of training data by large models falls within the scope of fair use.
Therefore, it is theoretically possible to include the use of training data as a special case of fair use in the Copyright Law in China, but the following analysis of the development of large model training shows that this special case somewhat lacks legitimacy.
First of all, training a general-purpose large model currently requires a lot of hardware resources, especially high-performance GPUs. Taking ChatGPT for example, it demands more than 30,000 NVIDIA A100 GPUs for its corresponding chips, and the initial investment costs are as much as about $800 million. As for the training costs, according to the 2024 Artificial Intelligence Index Report, the training costs of OpenAI's GPT-4 and other cutting-edge model systems are estimated to be $78 million. Google's Gemini Ultra model is estimated to cost $191 million in computational costs. These data show that training cutting-edge large language models requires a huge amount of financial investment. In addition to the costs of hardware, there are ongoing operational costs associated with the operation of the large models, including expenses in power consumption and data center maintenance. It is estimated that the costs for running ChatGPT amounts up to about $100,000 a day, or about $3 million per month.
Obviously, to build and use a large language model, its resource investments are not what a small or medium-size company or organization can afford, except a few with national financial support, a large number of current running or upcoming large models are controlled by large high-tech companies like Microsoft, Google, OpenAI and Baidu. They have invested huge amount of money, and launched a competition in performance in order to seize the commanding heights of AI technology.
It has been nearly 70 years now since the concept of artificial intelligence emerged in 1956, and the advent of generative AI products has only been a matter of the last three years. Prior to this, most of the AI research was basically limited to theoretical exploration and scientific research, with their products limited to the laboratories of enterprises and research institutes inside. But today, basically some large IT companies in China and abroad have launched their own AI products, such as Baidu's Wenxin Yiyan, iFLYTEK's Xinghuo Large Model, Tencent's Hunyuan Large Model and other generative AI products. OpenAI in the United States was the first to launch ChatGPT, a generative AI product, and is currently one of the most successful AI companies. Internet search giant Google owns such products as DeepMind and Gemini, and Microsoft has not only developed many of its own generative AI tools, such as copilot, but also supported and funded OpenAI's new technology. Since these products were launched, they have been well received by users, and attracted a large number of users, including paying users. At the same time, the market capitalization or valuation of these companies has also been on a constant rise.
Obviously, the current frenzied competition in the performance of large models is hard to be explained as for non-commercial purposes. The gold rush for capital and technology in large model training is almost always aimed at pursuing current and future excessive returns. Therefore, if the large models’ training and learning are considered to be fair use, the developers of these AI technologies can continue to obtain a large amount of free and high-quality copyrighted contents, continuously optimize their algorithms, improve the quality of contents generation, and then obtain more lucrative benefits from the technology market. Correspondingly, countless authors would provide huge wealth of copyrighted contents, without being able to derive any benefits, and their original benefits would even be affected because the generated contents have a substitution effect in the works market. This would not only harm the market for author's original works, but also further harm the public rights and interests in the long run.
Balance of interests is an important principle underlying China's intellectual property legal system. Obviously, in the era of artificial intelligence, traditional IP rules,once again challenged and faced with difficulties, must be adjusted for the necessary rebalanced interests. The essence of the copyright legitimacy of AI large model training data is the conflict between copyright protection and technological innovation, excessive protection of the interests of copyright owners will hinder the innovation and development of the technology industry, and conversely, policies that are overly inclined to the technology industry will also damage the incentive mechanism and cultural diversity in the market for works. Therefore, the regulation of AI machine learning needs to take into account of the balance of interests, striking a balance between copyright protection and technological innovation, and promoting coordinated developments of the copyright industry and the AI industry.
For this author, the copyright collective management mechanism in the Copyright Law is perhaps a relatively feasible way to address the issue of copyright licensing for training data within the current legal framework.
The copyright collective management mechanism refers to a system in which copyright holders and copyright-related rights holders issue copyright licenses, collect royalties, distribute the collected fees to them, and even launch infringement lawsuits through intermediary organizations. Such a system has these features: 1) centralized exercise of rights: copyright collective management organizations, authorized by right holders, exercise the relevant rights for them in a centralized manner and carry out the relevant activities in their own names, including use authorization, royalties collection, and filing lawsuits; 2) reducing transaction costs: the copyright collective management mechanism reduces, through centralized management, the high transaction costs caused by the individual ownership of rights and the diversified uses , and the value loss caused by cumbersome transaction procedures; and 3) user-friendly: the collective management system provides users with a one-stop licensing mechanism, allowing users to obtain the copyright authorization for most works they use at one time, avoiding the risk of infringement and meeting the commercial needs to use works on a large scale.
A comparison shows that the current large model training data and copyright collective management have some identical features: 1) there are many copyrighted works and copyright owners involved, but the transaction frequency of a single work is very low; 2) it is difficult to carry out transactions one by one, and almost impossible to address copyright issues with the copyright owners one by one in advance; 3) correspondingly, the costs for completing a transaction would be unbearable, hindering the copyright transaction. Therefore, it has become a relatively feasible natural choice to use the existing copyright collective management mechanism to address the copyright issues the large model training data involve.
To date, China has five collective copyright management organizations respectively for music, audio-visual, literal, photographic and film works. AI R&D entities can obtain collective copyright authorization for their training data by resorting to these copyright collective management organizations. With the collective management model, the scattered individual interests are concentrated, with enhanced bargaining position for individual authors and much reduced costs for work search, source identification, negotiation on individual works by virtual of the centralized authorization under package agreements. In this way, use and dissemination of works are boosted, and the needs for large-scale use of works in machine learning scenarios satisfied.
Therefore, in the process of copyright authorization for AI training data, it is possible for AI developers to obtain authorization of works in a certain field through a copyright collective management organization, fully leverage the protective efficacy of collective copyright management, reduce the risk of developers in connection with data legitimacy, simplify the acquisition and use of works at the same time, and it is also possible for right holders of the works to achieve their economic interests. All this will also encourage enterprises to actively carry out technological innovation and use high-quality copyrighted works to develop new technology market and promote economic growth. This will bring benefits to the public good and achieve the dual goals of achieving balance between the copyright protection and technological advancements.
Through the collective management model, the management agency can negotiate licensing conditions with each AI developer on behalf of a large number of copyright owners, and the two parties can negotiate on an equal footing to reach licensing agreements and determine the rate standards. If negotiation fails, things will be decided through arbitration or litigation. The government agencies can also negotiate with representative groups of copyright owners and technology developers to determine royalty standards compatible with the realities of the market. For example, on May 22, 2024, OpenAI and News Corp. reached a cooperation agreement that allows OpenAI to obtain current and archived contents of News Corp.'s major news and information publications, and the two parties reached a five-year agreement to use the works for about $250 million (about RMB 1.81 billion) as the licensing fees. These include The Wall Street Journal, Barron's, The New York Post, The Times, The Sun, and more than a dozen other outlets.
Correspondingly, to ensure the collective management model works smoothly, an information disclosure system should be also put in place, requiring AI model developers to disclose the information of the works they use for data training, bringing the information and data in the data training stage out of a secret state to facilitate collective management organizations to check the lists of works used, collect royalties from the users in a timely manner, safeguard the legitimate rights and interests of the right holders, and facilitate the competent authorities to effectively perform their supervision and administrative functions to ensure the AI model training and learning to be compliant with the copyright laws and regulations.
To conclude, we have entered AI era, and the rise of generative AI has brought new challenges to the existing copyright system. With the data of works required for training AI large models being in their hundreds of millions and the demand for works to be used tremendous, how to balance the interests of the industry in the technological development and the rights of authors has become an unavoidable issue facing the copyright law in the new era. The characteristics of the training data of large AI models determine that it is somewhat feasible to use the copyright collective management mechanism to address the copyright legitimacy dilemma, which not only reduces the costs of transaction between copyright owners and AI developers, search and negotiation costs included, but also delivers massive work authorization, thus making authorization more efficient, with fewer transaction entities, lower negotiation and supervision costs for the right holders, and an available one-stop licensing mechanism for AI developers. Therefore, AI developers can obtain the copyrights authorization for many works at once and their commercial needs for large-scale use of works are met, which will safeguard and balance the public interests and benefit the society while delivering copyright incentives and protection of the private rights and interests.
Author:
Mr. Richard Yong Wang
Mr. Wang received his bachelor's degree in 1991 from the department of computer science of East China Normal University and his master's degree from the Institute of Computing Technology of the Chinese Academy of Sciences in 1994. In 2005, he received degree of master of laws from Renmin University of China. From 1994 to 2006, Mr. Wang worked with China Patent Agent (HK) Ltd, as a patent attorney and director of Electrical and Electronic Department. Mr. Wang joined Panawell in January 2007.
Mr. Wang is a member of the All-China Patent Attorneys Association (ACPAA), Sub-Committee of Electronic and Information Technology of ACPAA, LES China and AIPPI China, and FICPI China.
In the past years, Mr. Wang has handled thousands of patent applications for both domestic and foreign clients, and he has extensive experiences in application drafting, responding to office actions, patent reexamination and invalidation proceeding, patent administrative litigation, infringement litigation, software registration and integrated circuit layout design registration. As a very experienced patent attorney and attorney-at-law, Mr. Wang also participated in many patent litigation cases on behalf of a number of multinational companies as leading attorney. Mr. Wang's practices include computer hardware, computer software, communication technology, semiconductor devices and manufacturing process, automatic control, household electrical appliances, and etc.