[Corpora-List] [Deadline Extension] 1st Workshop on Multilingual Data Quality Signals at COLM 2025

Pedro Ortiz Suarez via Corpora Mon, 23 Jun 2025 07:02:47 -0700

Dear colleagues,


We are pleased to announce the first call for papers of the
*1st Workshop on Multilingual Data Quality Signals at COLM 2025*


Important information:
🗓️ CfP Deadline Extended to: July 3, Workshop: October 10
📍 Montréal, Canada
🌐 https://wmdqs.org


Scope


Recent research has shown that large language models (LLMs) not only need large 
quantities of data, but also need data of sufficient quality. Ensuring data 
quality is even more important in a multilingual setting, where the amount of 
acceptable training data in many languages is limited. Indeed, for many 
languages even the fundamental step of language identification remains a 
challenge, leading to unreliable language labels and thus noisy datasets for 
underserved languages.


In response to these challenges, we will be holding the first Workshop on 
Multilingual Data Quality Signals (WMDQS) in tandem with COLM. We invite the 
submission of long and short research papers related to data quality in 
multilingual data.


Even though most previous work on data quality has been targeted at LLM 
development, we believe that research in this area can also benefit other 
research communities in areas such as web search, web archiving, corpus 
linguistics, digital humanities, political sciences and beyond. We therefore 
encourage submissions from a wide range of disciplines.


WMDQS will also include a shared task on language identification for web text. 
We invite participants to submit novel systems which address current problems 
with language identification for web text. We will provide a training set of 
annotated documents sourced from Common Crawl to aid development.


Topics


We welcome submissions of (1) original research papers, (2) review/opinion 
papers, (3) online systems on the topics listed below, and (4) extended 
abstracts. We especially welcome work-in-progress projects and all novel ideas 
covering research in multilinguality, underserved/low-resource languages, 
under-represented linguistic communities and all types of work covering data 
quality signals. Suggested areas include:


- Data pipelines for data annotation and data filtering
- Undesirable content detection in a multilingual setting
- Multilingual or language independent content ranking
- Human annotation platforms and systems
- Multilingual tokenization mechanisms
- Small language models and embeddings
- Linguistic studies in underserved languages
- Corpus creation and curation methods, especially for underserved languages
- Machine translation
- Digital humanities
- Historical and constructed languages


Shared task


The lack of training data—especially high-quality data—is the root cause of 
poor language model performance for many languages. One obstacle to improving 
the quantity and quality of available text data is language identification 
(LangID or LID). Lang ID remains far from solved for many languages. Several of 
the commonly used LangID models were introduced in 2017 (e.g. fastText and 
CLD3). The aim of this shared task is to encourage innovation in open-source 
language identification and improve accuracy on a broad range of languages.


All accepted authors will be invited to contribute a larger paper, which will 
be submitted to a high-impact NLP venue.


Important dates for the Workshop:
Workshop paper submission deadline (extended): July 3, 2025
Workshop paper acceptance notification: July 24, 2025
Workshop: October 10, 2025


Important dates for the Shared Task:
1st Deadline to contribute annotations: July 7, 2025
1st Annotations released (train split): July 14, 2025
Abstract Deadline: July 21, 2025
Decision Notification: July 24, 2025
Camera Ready Deadline: September 21, 2025


(All deadlines are 23:59 AoE.)


Organizers:
For any questions, please drop a mail to [email protected]


Program Chairs:
Pedro Ortiz Suarez (Common Crawl Foundation)
Sarah Luger (MLCommons)
Laurie Burchell (Common Crawl Foundation)
Kenton Murray (Johns Hopkins University)
Catherine Arnett (EleutherAI)


Organizing Committee:
Thom Vaughan (Common Crawl Foundation)
Sara Hincapié (Factored)
Rafael Mosquera (MLCommons)
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] [Deadline Extension] 1st Workshop on Multilingual Data Quality Signals at COLM 2025

Reply via email to