We're happy to announce the release of *MessIRve*, a new *large-scale IR dataset in Spanish!*
MessIRve* contains around *730k queries from 20 Spanish-speaking countries* *and the United States*, with relevant documents sourced from Wikipedia. MessIRve's queries reflect diverse Spanish-speaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets. The dataset is available in *HuggingFace*! 🤗 - Queries and relevance judgments: spanish-ir/messirve <https://huggingface.co/datasets/spanish-ir/messirve> - The collection of documents: spanish-ir/eswiki_20240401_corpus <https://huggingface.co/datasets/spanish-ir/eswiki_20240401_corpus> - Queries and qrels in TREC format: spanish-ir/messirve-trec <https://huggingface.co/datasets/spanish-ir/messirve-trec> For more details, check out our *arXiv paper*: MessIRve: A Large-Scale Spanish Information Retrieval Dataset <http://arxiv.org/abs/2409.05994> We hope MessIRve serves to spur more work in IR for the Spanish language and facilitate the development of efficient information access tools for Spanish speakers. * MessIRve means *works** for **me* in Spanish ("me sirve"). The reference to Lionel Messi, player of the most popular sport in Spanish-speaking countries, football, stresses the importance of using topics that are relevant to Spanish speakers.
_______________________________________________ Corpora mailing list -- [email protected] https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/ To unsubscribe send an email to [email protected]
