Dear colleagues,

We are pleased to announce the last call for participation for 1st first Shared 
Task on Language Identification for Web Data at WMDQS/COLM 2025.

Important information:
🗓️ Registration Deadline: July 23 (AoE)
📍 Montréal, Canada
🌐 https://wmdqs.org/shared-task/

Registration:
To register, please submit a one-page document with a title, a list of authors, 
a list of provisional languages that you want to focus on, and a brief 
description of your approach. This document should be sent to 
[email protected]. You can change the list of languages or the system 
description during the shared task. This document's only purpose is to register 
your participation in the shared task. The shared task will run until the last 
week of September.

Motivation:
The lack of training data—especially high-quality data—is the root cause of 
poor language model performance for many languages. One obstacle to improving 
the quantity and quality of available text data is language identification 
(LangID or LID). LangID remains far from solved for many languages. Several of 
the commonly used LangID models were introduced in 2017 (e.g. fastText and 
CLD3). The aim of this shared task is to encourage innovation in open-source 
language identification and improve accuracy on a broad range of languages.

All participants will be invited to contribute a larger paper, which will be 
submitted to a high-impact NLP venue.

Description:
The main shared task is to submit LangID models that work well on a wide 
variety of languages on web data. We encourage participants to employ a range 
of approaches, including the development of new architectures and the curation 
of novel high-quality annotated datasets.

We recommend using the GlotLID corpus as a starting point for training data. 
Access to the data will be managed through the Hugging Face repository. Please 
note that this data should not be redistributed. We will use the same language 
label format as those used by GlotLID: an ISO 639-3 language code plus an ISO 
15924 script code, separated by an underscore.

Although all systems will be evaluated on the full range of languages in our 
test set, we encourage submissions that focus on a particular language or set 
of languages, especially if those language(s) present particular challenges for 
language identification.

The shared task will take place in rounds. The first round will only include 
data from already existing datasets, subsequent rounds will include data 
annotated by the community as it is collected and processed. More languages 
will also be added in subsequent rounds.

Organizers:
For any questions, please drop a mail to [email protected]

Program Chairs:
Pedro Ortiz Suarez (Common Crawl Foundation)
Sarah Luger (MLCommons)
Laurie Burchell (Common Crawl Foundation)
Kenton Murray (Johns Hopkins University)
Catherine Arnett (EleutherAI)

Organizing Committee:
Thom Vaughan (Common Crawl Foundation)
Sara Hincapié (Factored)
Rafael Mosquera (MLCommons)
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to