[agi] LLMs will run out of training data by 2026-2032.

Matt Mahoney Wed, 11 Dec 2024 16:34:29 -0800

According to this paper in PMLR.
https://proceedings.mlr.press/v235/villalobos24a.html


The largest LLM, Llama, trains on 15T tokens (60 TB of text). Common Crawl
has 130T tokens scraped from web pages. The authors estimate that Google
indexes 500T tokens and there are 3100T tokens on the deep web, stuff like
private emails, texts, and social media group posts. This data set is
growing much slower than Moore's law, between 0 and 10% per year. These are
very rough numbers that are surprisingly difficult to estimate.

3100T tokens divided by world population is 400K per person, which
compresses to 1.6M bits. That is only 1/600 of human long term memory. In
my 2013 paper I estimated it will cost $1 quadrillion to collect the rest.
LLMs seem like they could replace any job that could be done remotely over
the internet, but after 2 years this hasn't even begun to happen. The
reason is that the knowledge it needs to do your job isn't written down,
and won't be for a long time.

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T7a144d8141d1d5b0-Mebe794604ab8dea235d91ae2
Delivery options: https://agi.topicbox.com/groups/agi/subscription

[agi] LLMs will run out of training data by 2026-2032.

Reply via email to