Welcome to SHROOM, a Shared-task on Hallucinations and Related Observable 
Overgeneration Mistakes!


Task description: SHROOM participants will need to detect grammatically sound 
output that contains incorrect semantic information (i.e. unsupported or 
inconsistent with the source input), with or without having access to the model 
that produced the output.


Overview of the task: The modern NLG landscape is plagued by two interlinked 
problems:

On the one hand, our current neural models have a propensity to produce 
inaccurate but fluent outputs; on the other hand, our metrics are most apt at 
describing fluency, rather than correctness. This leads neural networks to 
“hallucinate”, i.e., produce fluent but incorrect outputs that we currently 
struggle to detect automatically. For many NLG applications, the correctness of 
an output is, however, mission critical. For instance, producing a 
plausible-sounding translation that is inconsistent with the source text puts 
in jeopardy the usefulness of a machine translation pipeline. With our shared 
task, we hope to foster the growing interest in this topic in the community.


With SHROOM we adopt a post hoc setting, where models have already been trained 
and outputs already produced: participants will be asked to perform binary 
classification to identify cases of fluent overgeneration hallucinations in two 
different tracks:  a model-aware and a model-agnostic track. In the former, 
participants have access to the model that produced the output; in the latter, 
they do not. To ensure a low barrier to entry, we format the task as a binary 
classification problem.  All systems will be rated on accuracy (i.e., the 
proportion of test examples correctly labeled) and calibration (i.e., the 
correlation between the probability assigned by a system and the proportion of 
annotators marking a production as hallucinatory).


We provide participants with a collection of checkpoints, inputs, references 
and outputs of systems covering three NLG tasks: definition modeling (DM), 
machine translation (MT), and paraphrase generation (PG), trained with varying 
degrees of accuracy. The development set provides binary annotations from five 
different annotators and a majority vote gold label.


Anyone wishing to participate in the task is welcome! Participants will have to

  *   Submit at least once during the evaluation phase next January;

  *   Write a system description paper;

  *   Review other system description papers (max. 2).


Trial, dev and train data are now available on the task website: 
https://helsinki-nlp.github.io/shroom/

Codalab competition: https://codalab.lisn.upsaclay.fr/competitions/15726

Join the mailing group: 
https://groups.google.com/u/1/g/semeval-2024-task-6-shroom

Updates on Twitter: @shroom2024<https://twitter.com/shroom2024>


Important dates:


  *   Sample data ready: July 15th, 2023

  *   Validation data ready: September 11th, 2023

  *   Unlabeled train data ready: September 22nd, 2023

  *   Evaluation period starts (test set released): January 10th, 2024

  *   Evaluation period ends: January 31st, 2024

  *   Workshop paper submission deadline: February 29th, 2024

  *   Notification to authors: April 1st, 2024

  *   SemEval workshop: TBA (Summer 2024, collocated with a major NLP 
conference)



Task organizers


  *   Elaine Zosa, University of Helsinki, Finland

  *   Raúl Vázquez, University of Helsinki, Finland

  *   Jörg Tiedemann, University of Helsinki, Finland

  *   Vincent Segonne, Universite Grenoble Alpes, France

  *   Teemu Vahtola, University of Helsinki, Finland

  *   Alessandro Raganato, University of Milano-Bicocca, Italy

  *   Timothee Mickus, University of Helsinki, Finland

  *   Marianna Apidianaki, University of Pennsylvania, USA

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to