The artic­le exami­nes how Euro­pean com­pe­ti­ti­on law, spe­ci­fi­cal­ly the Essen­ti­al Faci­li­ties Doc­tri­ne, could app­ly to so-cal­led uncon­ta­mi­na­ted data­sets in the field of AI. This con­clu­si­on is drawn by an inter­na­tio­nal­ly renow­ned team of rese­ar­chers in a recent JOLT Digest con­tri­bu­ti­on. Their insights war­rant fur­ther reflec­tion, and the artic­le its­elf is worth rea­ding. Howe­ver, a brief sum­ma­ry and ana­ly­sis of the cen­tral ide­as follow.

What is Model Collapse?

The artic­le beg­ins with the assump­ti­on that ear­ly lar­ge lan­guage models (LLMs) were built on scra­ping a signi­fi­cant amount of the exis­ting inter­net at that time. From the­re, they star­ted offe­ring various AI-powered ser­vices, and users crea­ted new inter­net con­tent, often lever­aging AI-based offe­rings. This new con­tent is again scraped and inte­gra­ted into LLMs. The artic­le sug­gests that this leads to “con­ta­mi­na­ti­on,” whe­re the AI-gene­ra­ted data is dis­tor­ted through pro­ces­ses like the exclu­si­on of sta­tis­ti­cal minor quan­ti­ties. As a result, the pre­sent inter­net, and any sub­se­quent scra­ping, is influen­ced by pre­vious AI errors.

A poten­ti­al pro­blem ari­ses for new LLMs ente­ring the mar­ket: they may face the risk of data con­ta­mi­na­ti­on, mea­ning they would not be able to access the ori­gi­nal, uncon­ta­mi­na­ted data­sets. This crea­tes a com­pe­ti­ti­ve dis­ad­van­ta­ge for newer models, which can­not draw from the pris­ti­ne data­set from the pre-AI era. The artic­le pro­po­ses that such newer models might expe­ri­ence a gra­du­al decli­ne in per­for­mance due to this lack of access.

Market Entry Barriers?

The time advan­ta­ge held by estab­lished play­ers, who have access to uncon­ta­mi­na­ted data, is fur­ther ampli­fied by other com­pe­ti­ti­ve fac­tors. Estab­lished com­pa­nies may not only have ori­gi­nal data but could also impro­ve this data through human trai­ning, giving them an edge over newer pro­vi­ders who can­not repli­ca­te this human feed­back. Con­se­quent­ly, users may strugg­le to dif­fe­ren­tia­te bet­ween human-gene­ra­ted and AI-gene­ra­ted con­tent, which could under­mi­ne the varie­ty of available con­tent. This, in turn, could lead to a col­lap­se of meaningful value in the mar­ket­place, as syn­the­tic data would only pro­du­ce more syn­the­tic con­tent, rather than useful information.

Thus, hol­ders of pre-2022 data­sets could be in a com­pe­ti­ti­ve posi­ti­on, poten­ti­al­ly mono­po­li­zing the data mar­ket by offe­ring “untain­ted” materials.

Potential Solutions

If data access beco­mes a com­pe­ti­ti­on issue, the com­pe­ti­ti­on law (anti­trust law) could inter­ve­ne, along­side regu­la­to­ry mea­su­res. A well-known solu­ti­on could be the appli­ca­ti­on of the Essen­ti­al Faci­li­ties Doc­tri­ne.

From an anti­trust per­spec­ti­ve, the artic­le assu­mes that access to uncon­ta­mi­na­ted his­to­ri­cal data­sets is cru­cial for trai­ning new models. Con­trol over this access could fur­ther ent­rench the com­pe­ti­ti­ve posi­ti­on of estab­lished play­ers, pos­si­bly lea­ding to a mar­ket con­trol­led by only a few com­pa­nies hol­ding the ori­gi­nal dataset.

This rai­ses con­cerns about exclu­si­vi­ty agree­ments that might vio­la­te Artic­le 101 TFEU, espe­ci­al­ly if they pre­vent licen­sing to third par­ties or rest­rict data coll­ec­tion. Anti­trust con­cerns also ari­se in the con­text of mer­gers, whe­re the access to cru­cial data­sets must be careful­ly considered.

The artic­le notes that the abu­se of mar­ket domi­nan­ce under Artic­le 102 TFEU could also ari­se if a domi­nant com­pa­ny refu­ses access to cru­cial data­sets, poten­ti­al­ly iso­la­ting the mar­ket. Howe­ver, the aut­hors high­light the signi­fi­cant legal hurd­les in pro­ving such cases, inclu­ding the com­ple­xi­ties of estab­li­shing clear con­di­ti­ons for access.

Relevance of Existing Regulatory Tools

The aut­hors point to the incre­asing use of FRAND prin­ci­ples in the con­text of data access, as reflec­ted in Artic­le 8 of the Data Act and in vol­un­t­a­ry com­mit­ments rela­ted to stan­dards. The con­cept of obli­ga­ti­ons for data hol­ders is seen as an important aspect of the ongo­ing debate.

From a regu­la­to­ry stand­point, one sug­ges­ti­on is to “free­ze” the sup­po­sedly uncon­ta­mi­na­ted data­set, with the EU’s exis­ting regu­la­ti­ons on AI and data (e.g., the Data Gover­nan­ce Act) ser­ving as a poten­ti­al model. The aut­hors spe­cu­la­te about impo­sing direct obli­ga­ti­ons on data hol­ders under the AI Regu­la­ti­on. A new data space or the use of data trus­tees might be hel­pful in this context.

Which Companies Hold Market Power?

A cri­ti­cal point rai­sed is the iden­ti­fi­ca­ti­on of which com­pa­nies hold mar­ket power in rela­ti­on to data access. It is unli­kely that a sin­gle com­pa­ny con­trols all rele­vant data. Even the noti­on of joint mar­ket domi­nan­ce through seve­ral com­pa­nies seems impro­ba­ble, given that com­pe­ti­ti­on within the sec­tor remains robust.

The artic­le con­siders whe­ther search engi­ne index­ing might play a role in desi­gna­ting cer­tain com­pa­nies as gate­kee­pers, poten­ti­al­ly trig­ge­ring the appli­ca­ti­on of Artic­le 6(11) of the Digi­tal Mar­kets Act (DMA). Howe­ver, this would only app­ly if the com­pa­ny reques­t­ing access is its­elf an online search engi­ne, which not all AI ser­vices are.

Could Markets Self-Regulate?

The artic­le con­cludes with an explo­ra­ti­on of whe­ther the mar­ket could self-regu­la­te. It sug­gests that by assig­ning spe­ci­fic respon­si­bi­li­ties to par­ti­cu­lar com­pa­nies, a mar­ket for the pro­vi­si­on of uncon­ta­mi­na­ted data could emer­ge. Fur­ther­mo­re, labe­l­ing such data as “uncon­ta­mi­na­ted” might help for­ma­li­ze access and incen­ti­vi­ze the crea­ti­on of new mar­kets, alt­hough this rai­ses the issue of qua­li­ta­ti­ve cen­sor­ship — who would deci­de what qua­li­fies as uncon­ta­mi­na­ted data?

A dyna­mic mar­ket could also deve­lop for data cor­rec­tion ser­vices, whe­re exis­ting data­sets would be moni­to­red and cor­rec­ted in real-time. This might coun­ter­act the model col­lap­se by enab­ling con­ti­nuous impro­ve­ments to the data­sets used by AI systems.

How Long Will the 2022 Datasets Matter?

The artic­le poses a cri­ti­cal ques­ti­on: how long will data­sets from 2022 remain rele­vant? If we fol­low the aut­hors’ reaso­ning, the ori­gi­nal data­set from the pre-AI era would ser­ve as a bench­mark for data inte­gri­ty for deca­des. Howe­ver, newer AI ser­vices might have less inte­rest in out­da­ted data that no lon­ger reflects the latest developments.

This also rai­ses the pos­si­bi­li­ty that estab­lished com­pa­nies might be requi­red to retain the 2022 data­set inde­fi­ni­te­ly to com­ply with com­pe­ti­ti­on and regu­la­to­ry requi­re­ments. The fea­si­bi­li­ty of this reten­ti­on obli­ga­ti­on remains questionable.

What Was Ever Truly Uncontaminated?

Final­ly, the artic­le rai­ses two fun­da­men­tal issues:

  1. Com­pe­ti­ti­on Law’s Scope: Com­pe­ti­ti­on law pri­ma­ri­ly pro­tects the com­pe­ti­ti­ve pro­cess its­elf, not the free flow of infor­ma­ti­on on the inter­net. Anti­trust inter­ven­ti­on is only war­ran­ted if a com­pe­ti­ti­on pro­blem ari­ses. Howe­ver, user demand for “uncon­ta­mi­na­ted” infor­ma­ti­on is not neces­s­a­ri­ly a dri­ving force. AI ser­vices might still func­tion wit­hout it, which could pre­sent an argu­ment for regulation.
  2. Defi­ning Uncon­ta­mi­na­ted Data: Who deci­des what qua­li­fies as uncon­ta­mi­na­ted data? The noti­on that data from 2022 is uncon­ta­mi­na­ted is debata­ble, espe­ci­al­ly given the pre­va­lence of mis­in­for­ma­ti­on in recent years. The assump­ti­on of a per­fect, uncon­ta­mi­na­ted data­set is incre­asing­ly unrealistic.

Conclusion and Critique:

The artic­le iden­ti­fies seve­ral cri­ti­cal assump­ti­ons, inclu­ding the belief that uncon­ta­mi­na­ted data­sets ever exis­ted or can be pre­ser­ved. It also sug­gests that the very same tech­no­lo­gy that cau­sed data con­ta­mi­na­ti­on could cor­rect it, ther­eby resol­ving poten­ti­al com­pe­ti­ti­on issues through mar­ket-dri­ven solu­ti­ons. Fur­ther­mo­re, a com­pe­ti­ti­on law con­nec­tion seems unli­kely, given the absence of clear mar­ket domi­nan­ce. While regu­la­to­ry mea­su­res to pro­tect infor­ma­tio­nal free­dom are sen­si­ble, the key points for regu­la­ti­on remain unclear.

tl;dr:

  • The con­cept of “model col­lap­se” resul­ting from AI con­ta­mi­na­ti­on of data­sets is a signi­fi­cant com­pe­ti­ti­on issue.
  • Com­pe­ti­ti­on law (inclu­ding EU anti­trust pro­vi­si­ons) and regu­la­to­ry approa­ches could play a key role in ensu­ring fair access to his­to­ri­cal datasets.
  • Howe­ver, many of the assump­ti­ons about uncon­ta­mi­na­ted data­sets and mar­ket domi­nan­ce remain questionable.

For more infor­ma­ti­on on how we can assist with data access requests and navi­ga­te the­se legal chal­lenges, feel free to cont­act us.

About the author

Porträtbild von Dr. Sebastian Louven

Dr. Sebastian Louven

I have been an independent lawyer since 2016 and advise mainly on antitrust law and telecommunications law. Since 2022 I am a specialist lawyer for international business law.

Other articles

Digital Markets Act – Private Enforcement

The Digi­tal Mar­kets Act con­ta­ins regu­la­ti­ons for a Euro­pean approach to mar­ket regu­la­ti­on of digi­tal plat­forms. First of all, this includes the iden­ti­fi­ca­ti­on as a rele­vant gatekeeper.…

Read more

Brogsitter Defence Returns

Brog­sit­ter Defence Returns­So­me time ago, the ECJ ruled in its Wikin­ger­hof decis­i­on on inter­na­tio­nal juris­dic­tion in anti­trust actions if the­re is also a con­trac­tu­al rela­ti­onship between…

Read more
Louven Rechtsanwälte PartGmbB

New partner: Dr Verena Louven

lou​ven​.legal has recent­ly beco­me a PartGmbB. Dr Vere­na Lou­ven joi­n­ed as a part­ner. She brings seve­ral years of legal expe­ri­ence in busi­ness and in par­ti­cu­lar com­ple­ments the…

Read more

Newsletter

Updates on antitrust and telecommunications law