The article examines how European competition law, specifically the Essential Facilities Doctrine, could apply to so-called uncontaminated datasets in the field of AI. This conclusion is drawn by an internationally renowned team of researchers in a recent JOLT Digest contribution. Their insights warrant further reflection, and the article itself is worth reading. However, a brief summary and analysis of the central ideas follow.
What is Model Collapse?
The article begins with the assumption that early large language models (LLMs) were built on scraping a significant amount of the existing internet at that time. From there, they started offering various AI-powered services, and users created new internet content, often leveraging AI-based offerings. This new content is again scraped and integrated into LLMs. The article suggests that this leads to “contamination,” where the AI-generated data is distorted through processes like the exclusion of statistical minor quantities. As a result, the present internet, and any subsequent scraping, is influenced by previous AI errors.
A potential problem arises for new LLMs entering the market: they may face the risk of data contamination, meaning they would not be able to access the original, uncontaminated datasets. This creates a competitive disadvantage for newer models, which cannot draw from the pristine dataset from the pre-AI era. The article proposes that such newer models might experience a gradual decline in performance due to this lack of access.
Market Entry Barriers?
The time advantage held by established players, who have access to uncontaminated data, is further amplified by other competitive factors. Established companies may not only have original data but could also improve this data through human training, giving them an edge over newer providers who cannot replicate this human feedback. Consequently, users may struggle to differentiate between human-generated and AI-generated content, which could undermine the variety of available content. This, in turn, could lead to a collapse of meaningful value in the marketplace, as synthetic data would only produce more synthetic content, rather than useful information.
Thus, holders of pre-2022 datasets could be in a competitive position, potentially monopolizing the data market by offering “untainted” materials.
Potential Solutions
If data access becomes a competition issue, the competition law (antitrust law) could intervene, alongside regulatory measures. A well-known solution could be the application of the Essential Facilities Doctrine.
From an antitrust perspective, the article assumes that access to uncontaminated historical datasets is crucial for training new models. Control over this access could further entrench the competitive position of established players, possibly leading to a market controlled by only a few companies holding the original dataset.
This raises concerns about exclusivity agreements that might violate Article 101 TFEU, especially if they prevent licensing to third parties or restrict data collection. Antitrust concerns also arise in the context of mergers, where the access to crucial datasets must be carefully considered.
The article notes that the abuse of market dominance under Article 102 TFEU could also arise if a dominant company refuses access to crucial datasets, potentially isolating the market. However, the authors highlight the significant legal hurdles in proving such cases, including the complexities of establishing clear conditions for access.
Relevance of Existing Regulatory Tools
The authors point to the increasing use of FRAND principles in the context of data access, as reflected in Article 8 of the Data Act and in voluntary commitments related to standards. The concept of obligations for data holders is seen as an important aspect of the ongoing debate.
From a regulatory standpoint, one suggestion is to “freeze” the supposedly uncontaminated dataset, with the EU’s existing regulations on AI and data (e.g., the Data Governance Act) serving as a potential model. The authors speculate about imposing direct obligations on data holders under the AI Regulation. A new data space or the use of data trustees might be helpful in this context.
Which Companies Hold Market Power?
A critical point raised is the identification of which companies hold market power in relation to data access. It is unlikely that a single company controls all relevant data. Even the notion of joint market dominance through several companies seems improbable, given that competition within the sector remains robust.
The article considers whether search engine indexing might play a role in designating certain companies as gatekeepers, potentially triggering the application of Article 6(11) of the Digital Markets Act (DMA). However, this would only apply if the company requesting access is itself an online search engine, which not all AI services are.
Could Markets Self-Regulate?
The article concludes with an exploration of whether the market could self-regulate. It suggests that by assigning specific responsibilities to particular companies, a market for the provision of uncontaminated data could emerge. Furthermore, labeling such data as “uncontaminated” might help formalize access and incentivize the creation of new markets, although this raises the issue of qualitative censorship — who would decide what qualifies as uncontaminated data?
A dynamic market could also develop for data correction services, where existing datasets would be monitored and corrected in real-time. This might counteract the model collapse by enabling continuous improvements to the datasets used by AI systems.
How Long Will the 2022 Datasets Matter?
The article poses a critical question: how long will datasets from 2022 remain relevant? If we follow the authors’ reasoning, the original dataset from the pre-AI era would serve as a benchmark for data integrity for decades. However, newer AI services might have less interest in outdated data that no longer reflects the latest developments.
This also raises the possibility that established companies might be required to retain the 2022 dataset indefinitely to comply with competition and regulatory requirements. The feasibility of this retention obligation remains questionable.
What Was Ever Truly Uncontaminated?
Finally, the article raises two fundamental issues:
- Competition Law’s Scope: Competition law primarily protects the competitive process itself, not the free flow of information on the internet. Antitrust intervention is only warranted if a competition problem arises. However, user demand for “uncontaminated” information is not necessarily a driving force. AI services might still function without it, which could present an argument for regulation.
- Defining Uncontaminated Data: Who decides what qualifies as uncontaminated data? The notion that data from 2022 is uncontaminated is debatable, especially given the prevalence of misinformation in recent years. The assumption of a perfect, uncontaminated dataset is increasingly unrealistic.
Conclusion and Critique:
The article identifies several critical assumptions, including the belief that uncontaminated datasets ever existed or can be preserved. It also suggests that the very same technology that caused data contamination could correct it, thereby resolving potential competition issues through market-driven solutions. Furthermore, a competition law connection seems unlikely, given the absence of clear market dominance. While regulatory measures to protect informational freedom are sensible, the key points for regulation remain unclear.
tl;dr:
- The concept of “model collapse” resulting from AI contamination of datasets is a significant competition issue.
- Competition law (including EU antitrust provisions) and regulatory approaches could play a key role in ensuring fair access to historical datasets.
- However, many of the assumptions about uncontaminated datasets and market dominance remain questionable.
For more information on how we can assist with data access requests and navigate these legal challenges, feel free to contact us.