> Su huggingface trovi il dataset sorgente? > Mi passi il link diretto?
Anche se ci fosse, che ci faresti? I dataset che usano ormai si sanno quali sono (e i link, con un po' di fatica, si trovano), quella che non si conosce è, ad esempio, la Pre-Processing Pipeline. Prendiamo IBM Granite [1]: granite.13b.v1, was trained on 1 trillion tokens. The individual datasets used in the training are described below. 1) arXiv: Over 1.8 million scientific paper pre-prints posted to arXiv. 2) Common Crawl: Open repository of web crawl data. 3) DeepMind Mathematics: Mathematical question and answer pairs data. 4) Free Law: Public-domain legal opinions from US federal and state courts. 5) GitHub Clean: Code data from CodeParrot covering a variety of coding languages. 6) Hacker News: News on computer science and entrepreneurship, taken between 2007-2018 7) OpenWeb Text: Open-source version of OpenAI's Web Text corpus containing web pages through 2019. 8) Project Gutenberg (PG-19): A repository of free e-books with focus on older works for which U.S. copyright has expired. 9) Pubmed Central: Biomedical and life sciences papers. 10) SEC Filings: 10-K/Q filings from the US Securities and Exchange Commission (SEC) for the years 1934-2022. 11) Stack Exchange: Anonymized set of all user-contributed content on the Stack Exchange network, a popular collection of websites centered around user-contributed questions and answers. 12) USPTO: US patents granted from 1975 to May 2023, excluding design patents. 13) Webhose: Unstructured web content converted into machine-readable data feeds acquired by IBM. 14) Wikimedia: Eight English Wikimedia projects (enwiki, enwikibooks, enwikinews, enwikiquote, enwikisource, enwikiversity, enwikivoyage, enwiktionary). containing extracted plain text from pages and articles. 15) Earnings Call Transcripts: Transcripts from the quarterly earnings calls that companies hold with investors. The dataset reports a collection of earnings call transcripts, the related stock prices, and the sector index. 16) EDGAR Filings: Annual reports from all the publicly traded companies in the US spanning a period of more than 25 years. 17) FDIC: The data is from the annual submissions of the FDIC. 18) Finance Text Books: A corpus from UMN’s Open Textbook Library, including a dump of all textbooks tagged as finance. 19) Financial Research Papers: Publicly available financial research paper corpus. 20) IBM Documentation: IBM redbooks and product documents A. [1] https://www.ibm.com/downloads/cas/X9W4O6BM