🧠The Emerging Empires of AI & Their Hunger for Data
More is more
Give us your datasets. If you give us your datasets, we'll be very happy.
-Sam Altman, answering a question at a live event on what Indonesians can do for OpenAI, Jakarta 2023
The race to build the largest and most powerful model is already underway, with tech giants competing aggressively against each other to seize monopoly power.
Unfortunately for us, the AI sector is one in which bigger tends to be better, and in which giants tend to have an advantage. Bigger companies can afford more compute power and create larger models. However, the more parameters a large language model has, the more data it requires for effective training.
Over the past months, model developers have reached the limits not just of publicly available web scraper data (such as Common Crawl) but also of datasets known to be constituted from pirated material (such as Books3, which is a component of The Pile - the key dataset used in the training of models such as Llama).
Efforts to train models on their own outputs have encountered only mixed success: research indicates that this leads, over time, to as the original data distribution degrades with each iteration. We have reached a point at which the next generation of models are simply too large to be trained efficiently upon the data to which producers currently have legal access.
In order to obtain more data, these companies are enouraging users to upload their own and others’ private information – with some corporates going so far as to announce that not only do they intend to continue, they will encourage their users to do so, and even .
This escalation from 'mere' solo misbehaviour, to the aiding and abetting of others to do the same, is clearly bullying in nature, and unprecedented even by the standards of bigtech misbehaviour in the past. But it is clear proof of the all-consuming hunger for data and knowledge faced by big tech to feed their models, and the disadvantage at which small players sit when transacting on the AI market.
When Elephants Fight...
Many have long foreseen a new arms-race between the largest companies in getting control over proprietary data. Indeed, it has already begun, but it is in their heavy reliance on knowledge & data described above that we find the greatest opportunity to start turning the situation around.
The big tech companies have no trouble paying for the compute resources they need to train ever bigger models, of coopting legislators to make it more difficult for competitors to obtain and use high-performance GPUs, and - if necessary - securing the use of others' intellectual property for training. These advantages compound in favour of these large companies and against their suppliers, competitors and even their users.
IP-owners, whose data is generally too small to be valuable in isolation and spread across multiple platforms, lack the bargaining power to protect their rights and are at the mercy of data security policies set by secondary owners (the publishing platforms). This goes some way to explain the recent shift among many AI companies away from ever-bigger models and towards retrieval-augmented generation (RAG) and towards smaller custom models and apps: the goal is to push back the training of the next generation of models until extra data can be collected from current users.
Everyone who has ever contributed anything to the collective public knowledge of the Internet has probably already had that knowledge taken, and absorbed into the core corpus of at least one Large Language Model.
Currently there is no way to attribute ownership of data & knowledge being used by AI models or AI apps, and consequently no way to account for interactions.
However, individual data-owners are not the only ones being hit. AI as it is experienced by the majority of users is composed of three elements: the model that carries out the calculations, the data that feeds it, and the app via which the outputs are presented to users. In each case, the smaller providers struggle in the face of the oligopoly power held by the large tech companies.
The more parameters a large language model has, the more data it requires for effective training. Over the past months, model developers have reached the limits not just of publicly available web scraper data (such as Common Crawl) but also of datasets known to be constituted from pirated material (such as Books3, which is a component of The Pile - the key dataset used in the training of models such as Llama).
Efforts to train models on their own outputs have encountered only mixed success: research indicates that this leads, over time, to as the original data distribution degrades with each iteration.
We have reached a point at which the next generation of models are simply too large to be trained efficiently upon the data to which producers currently have legal access.
This goes some way to explain the recent shift among many AI companies away from ever-bigger models and towards retrieval-augmented generation (RAG) and towards smaller custom models and apps.
Data providers are often excluded from the economic benefits of AI
Whether the specific interface of a given AI app makes this clear or not, AI works on a pay-per-query model, with monthly subscriptions being a reflection of the average monthly queries per user (usually API subscriptions make this clear, while web-client users pay a standard monthly fee).
As users make queries to an AI app, credits are spent as the the app interacts with AI models and Knowledge Assets to fulfill the query. These credits are an economic representation of used GPU computing power, plus the other costs and margins of the model developer/provider. Compute providers, model designers, and shareholders all get a share of these revenues, but the providers of the data upon which the entire edifice is built do not.
A New Arms Race over Domination of our AI future
Many have long foreseen a new arms-race between the largest companies in getting control over proprietary data. Indeed, it has already begun, and the battle is to bring all levers of AI value creation under the control of single corporate entities.
But it is in this heavy reliance on knowledge & data that we find the greatest opportunity to start turning the situation around.
References
Shumailov, Ilia, Zakhar Shumaylov, Yiren Zhao, Yarin Gal, Nicolas Papernot, and Ross Anderson. "Model Dementia: Generated Data Makes Models Forget." arXiv preprint arXiv:2305.17493 (2023).
Growcoot, Matt, "Midjourney Founder Admits to Using a ‘Hundred Million’ Images Without Consent", Petapixel, 21 December 2022. Retrieved 15 November 2023: https://petapixel.com/2022/12/21/midjourny-founder-admits-to-using-a-hundred-million-images-without-consent/
Novak, Matt. "OpenAI To Pay Legal Fees Of Business Users Hit With Copyright Lawsuits", Forbes, 6 November 2023. Retrieved 15 November 2023: https://www.forbes.com/sites/mattnovak/2023/11/06/openai-to-pay-legal-fees-of-business-users-hit-with-copyright-lawsuits/?sh=1d98e71d51cd
Song, Congzheng, and Ananth Raghunathan. "Information leakage in embedding models." In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security, pp. 377-390. 2020.
Last updated