[Corpora-List] MessIRve: a large-scale Spanish Information Retrieval dataset

Francisco Valentini via Corpora Wed, 11 Sep 2024 06:53:35 -0700

We're happy to announce the release of *MessIRve*, a new *large-scale IR
dataset in Spanish!*


MessIRve* contains around *730k queries from 20 Spanish-speaking
countries* *and
the United States*, with relevant documents sourced from Wikipedia.
MessIRve's queries reflect diverse Spanish-speaking regions, unlike other
datasets that are translated from English or do not consider dialectal
variations. The large size of the dataset allows it to cover a wide variety
of topics, unlike smaller datasets.

The dataset is available in *HuggingFace*! 🤗

   - Queries and relevance judgments: spanish-ir/messirve
   <https://huggingface.co/datasets/spanish-ir/messirve>
   - The collection of documents:  spanish-ir/eswiki_20240401_corpus
   <https://huggingface.co/datasets/spanish-ir/eswiki_20240401_corpus>
   - Queries and qrels in TREC format: spanish-ir/messirve-trec
   <https://huggingface.co/datasets/spanish-ir/messirve-trec>

For more details, check out our *arXiv paper*: MessIRve: A Large-Scale
Spanish Information Retrieval Dataset <http://arxiv.org/abs/2409.05994>

We hope MessIRve serves to spur more work in IR for the Spanish language
and facilitate the development of efficient information access tools for
Spanish speakers.

* MessIRve means *works** for **me* in Spanish ("me sirve"). The reference
to Lionel Messi, player of the most popular sport in Spanish-speaking
countries, football, stresses the importance of using topics that are
relevant to Spanish speakers.

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] MessIRve: a large-scale Spanish Information Retrieval dataset

Reply via email to