Synthetically transformed data will fill the gap. Raw synthetic data, no, but real data transformed through synthetic processes has already been shown to improve models. I don't think we will ever run out of ways to synthesize new data. Most of Common Crawl isn't even used because it is of such poor quality, and of the portion of it that is used, nearly all of it runs through some synethic process to clean or augment it (see: FineWeb)
On Wed, Dec 11, 2024, 7:35 PM Matt Mahoney <mattmahone...@gmail.com> wrote: > According to this paper in PMLR. > https://proceedings.mlr.press/v235/villalobos24a.html > > The largest LLM, Llama, trains on 15T tokens (60 TB of text). Common Crawl > has 130T tokens scraped from web pages. The authors estimate that Google > indexes 500T tokens and there are 3100T tokens on the deep web, stuff like > private emails, texts, and social media group posts. This data set is > growing much slower than Moore's law, between 0 and 10% per year. These are > very rough numbers that are surprisingly difficult to estimate. > > 3100T tokens divided by world population is 400K per person, which > compresses to 1.6M bits. That is only 1/600 of human long term memory. In > my 2013 paper I estimated it will cost $1 quadrillion to collect the rest. > LLMs seem like they could replace any job that could be done remotely over > the internet, but after 2 years this hasn't even begun to happen. The > reason is that the knowledge it needs to do your job isn't written down, > and won't be for a long time. > *Artificial General Intelligence List <https://agi.topicbox.com/latest>* > / AGI / see discussions <https://agi.topicbox.com/groups/agi> + > participants <https://agi.topicbox.com/groups/agi/members> + > delivery options <https://agi.topicbox.com/groups/agi/subscription> > Permalink > <https://agi.topicbox.com/groups/agi/T7a144d8141d1d5b0-Mebe794604ab8dea235d91ae2> > ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T7a144d8141d1d5b0-M5cd781b19aa0027d1e282509 Delivery options: https://agi.topicbox.com/groups/agi/subscription