[Corpora-List] December 2025 Newsletter - LDC

Penn LDC via Corpora Mon, 15 Dec 2025 08:16:38 -0800

In this newsletter:
LDC 2026 membership discounts now available
LDC's 1000th corpus
Approaching deadline for Spring 2026 data scholarship applications
LDC closed for Winter Break December 25 - January 2


New publications:
2021 NIST Speaker Recognition Evaluation Development and Test 
Set<https://catalog.ldc.upenn.edu/LDC2025S11>
LORELEI Sinhala Incident Language Pack<https://catalog.ldc.upenn.edu/LDC2025T17>
________________________________
LDC 2026 membership discounts now available
Now through March 2, 2026, any organization that joins the Consortium or renews 
their membership will receive a 10% discount off the 2026 membership fee. 
Membership remains the most economical way to access current and past LDC 
releases. Consult Join LDC<https://www.ldc.upenn.edu/members/join-ldc> for 
details on membership options and benefits.

LDC's 1,000th corpus
LDC is delighted to announce the release of the 1,000th corpus into the 
Catalog! This milestone represents the commitment we made over thirty years ago 
to provide large quantities of diverse data, robust research program support, 
and exceptional member services. We are grateful for the continued support and 
collaboration of our members, friends, and the community.

Approaching deadline for Spring 2026 data scholarship applications
Attention students: don't miss out on the chance to receive no-cost access to 
LDC data for your research. Applications for Spring 2026 data scholarships are 
due January 15, 2026. For more information on requirements and program rules, 
see LDC Data 
Scholarships<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>.

LDC closed for Winter Break December 25-January 2
LDC will be closed from Thursday, December 25, 2025, through Friday, January 2, 
2026, in accordance with the University of Pennsylvania Winter Break Policy. 
Our offices will reopen on Monday, January 5, 2026. Requests received by the 
Membership Office during Winter Break will be processed when the office reopens.
________________________________
New publications:
2021 NIST Speaker Recognition Evaluation Test 
Set<https://catalog.ldc.upenn.edu/LDC2025S11> was developed by LDC and NIST 
(National Institute of Standards and Technology). It contains approximately 447 
hours of Cantonese, Mandarin, and English conversational telephone speech, 
audio from video, and selfie image data for development and test, along with 
answer keys, enrollment, trial files, and documentation from the NIST-sponsored 
2021 Speaker Recognition Evaluation 
(SRE)<https://www.nist.gov/itl/iad/mig/nist-2021-speaker-recognition-evaluation-sre21>.

The SRE task is speaker detection, that is, to determine whether a specified 
target speaker was speaking during a segment of speech. SRE21 focused on 
telephone speech and audio from video and included close-up images of 
participants. The evaluation also featured cross-lingual trials, that is, 
enrollment and test segments spoken in different languages.

The data was drawn from the WeCanTalk corpus collected by LDC in which speakers 
called friends or relatives who agreed to record their telephone conversations 
lasting between 8-10 minutes. Subjects contributed multiple conversational 
telephone speech recordings and audio recordings in which they were talking, 
plus a single selfie image. Recordings were manually audited to verify speaker, 
language, and quality.

2025 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

LORELEI Sinhala Incident Language 
Pack<https://catalog.ldc.upenn.edu/LDC2025T17> was developed by LDC and is 
comprised of 8.1 million words of Sinhala monolingual text, 700,00 words of 
English monolingual text, 6.4 million words of parallel Sinhala- English text, 
and 50,000 words annotated for entity discovery and linking and situation 
frames. It constitutes all of the text data, annotations, supplemental 
resources, and related software tools for the Sinhala language used in the 
DARPA LORELEI / LoReHLT 2018 
Evaluation<https://www.nist.gov/itl/iad/mig/lorehlt-evaluations>.

The LORELEI (Low Resource Languages for Emergent Incidents) program was 
concerned with building human language technology for low resource languages in 
the context of emergent situations. In the evaluation scenario, an unforeseen 
event triggered a need for humanitarian and logistical support in a region 
where the incident language had received little or no attention in NLP 
research. Evaluation participants provided NLP solutions, including information 
extraction and machine translation, with limited resources and limited 
development time.

Data was collected from news, social network, weblog, newsgroup, discussion 
forum, and reference material. Entity discovery and linking annotation 
identified entities to be detected by systems for scoring purposes. Situation 
frame analysis was designed to extract basic information about needs and 
relevant issues for planning a disaster response effort.

2025 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] December 2025 Newsletter - LDC

Reply via email to