Re: [nexa] una "gemma" da Google

Antonio Sat, 29 Jun 2024 11:51:26 -0700

> Su huggingface trovi il dataset sorgente?
> Mi passi il link diretto?


Anche se ci fosse, che ci faresti?
I dataset che usano ormai si sanno quali sono (e i link, con un po' di fatica, 
si trovano), quella che non si conosce è, ad esempio, la Pre-Processing 
Pipeline.
Prendiamo IBM Granite [1]:
granite.13b.v1, was trained on 1 trillion tokens. The individual datasets used 
in the training are described below.
1) arXiv: Over 1.8 million scientific paper pre-prints posted to arXiv.
2) Common Crawl: Open repository of web crawl data.
3) DeepMind Mathematics: Mathematical question and answer pairs data.
4) Free Law: Public-domain legal opinions from US federal and state courts.
5) GitHub Clean: Code data from CodeParrot covering a variety of coding 
languages.
6) Hacker News: News on computer science and entrepreneurship, taken between 
2007-2018
7) OpenWeb Text: Open-source version of OpenAI's Web Text corpus containing web 
pages through 2019.
8) Project Gutenberg (PG-19): A repository of free e-books with focus on older 
works for which U.S. copyright has expired.
9) Pubmed Central: Biomedical and life sciences papers.
10) SEC Filings: 10-K/Q filings from the US Securities and Exchange Commission 
(SEC) for the years 1934-2022.
11) Stack Exchange: Anonymized set of all user-contributed content on the Stack 
Exchange network, a popular collection of websites centered around 
user-contributed questions and answers.
12) USPTO: US patents granted from 1975 to May 2023, excluding design patents.
13) Webhose: Unstructured web content converted into machine-readable data 
feeds acquired by IBM.
14) Wikimedia: Eight English Wikimedia projects (enwiki, enwikibooks, 
enwikinews, enwikiquote, enwikisource, enwikiversity, enwikivoyage, 
enwiktionary). containing extracted plain text from pages and articles.
15) Earnings Call Transcripts: Transcripts from the quarterly earnings calls 
that companies hold with investors. The dataset reports a collection of 
earnings call transcripts, the related stock prices, and the sector index.
16) EDGAR Filings: Annual reports from all the publicly traded companies in the 
US spanning a period of more than 25 years.
17) FDIC: The data is from the annual submissions of the FDIC.
18) Finance Text Books: A corpus from UMN’s Open Textbook Library, including a 
dump of all textbooks tagged as finance.
19) Financial Research Papers: Publicly available financial research paper 
corpus.
20) IBM Documentation: IBM redbooks and product documents

A.

[1] https://www.ibm.com/downloads/cas/X9W4O6BM

Re: [nexa] una "gemma" da Google

Reply via email to