[Corpora-List] July 2025 Newsletter - LDC

Penn LDC via Corpora Tue, 15 Jul 2025 08:34:09 -0700

In this newsletter:
Fall 2025 LDC data scholarship program

New publications:
AnnoDIFP Session Audio and Transcripts<https://catalog.ldc.upenn.edu/LDC2025S06>
Penn Parsed Corpora of Historical English Second 
Release<https://catalog.ldc.upenn.edu/LDC2025T09>
LoReHLT Uzbek Representative Language 
Pack<https://catalog.ldc.upenn.edu/LDC2025T08>


________________________________
Fall 2025 LDC data scholarship program
Student applications for the Fall 2025 LDC data scholarship program are being 
accepted now through September 15, 2025. This program provides eligible 
students with no-cost access to LDC data. Students must complete an application 
consisting of a data use proposal and letter of support from their advisor. For 
application requirements and program rules, visit the LDC Data Scholarships 
page<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>.
________________________________

New publications:
AnnoDIFP (Annotated Data for the Investigation of Facets of Personality) 
Session Audio and Transcripts<https://catalog.ldc.upenn.edu/LDC2025S06> was 
developed by LDC, the Florida Institute of Technology <https://www.fit.edu/> 
(FIT), and the University of New Haven<https://www.newhaven.edu/index.php> 
(UNH) to support algorithm development for predicting personality traits. It 
contains 438.34 hours of English audio and transcripts from in-person 
interviews of 366 participants paired with scores from two self-reported 
personality assessments, HEXACO Personality Inventory (Revised) (HEXACO-PI-R) 
and Short Dark Triad (SD3).

In-person interviews were recorded at LDC, FIT, and UNH. In each session, the 
participant and interviewer were in separate sound-isolated rooms with 
communication between them supplied by audio/video hardware. Sessions consisted 
of the following tasks: rapport building, a YouTube task, a map task, and a 
business task. Further details on collection methodology and session tasks are 
contained in the documentation accompanying this release.

2025 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

Penn Parsed Corpora of Historical English Second 
Release<https://catalog.ldc.upenn.edu/LDC2025T09> was developed at the 
University of Pennsylvania and consists of running texts and text samples of 
British English prose from the earliest Middle English documents (1100 CE) up 
to the period of the First World War (1914 CE). This second release corrects 
errors and inconsistencies in Penn Parsed Corpora of Historical English 
(LDC2020T16<https://catalog.ldc.upenn.edu/LDC2020T16>), further streamlines 
annotation, simplifies the directory structure, and includes updated 
documentation.

This data set contains three corpora covering traditionally recognized periods 
of English:


  *   The Penn-Helsinki Parsed Corpus of Middle English, second edition
  *   The Penn-Helsinki Parsed Corpus of Early Modern English
  *   The Penn Parsed Corpus of Modern British English, second edition

The texts are in two forms: part-of-speech tagged text and syntactically 
annotated text. Annotations were manually reviewed for accuracy and 
consistency. Included in this release are updated annotation guidelines, 
philological information for each corpus, and the CorpusSearch 2 program, which 
allows users to search the data for words, word sequences, and syntactic 
structure.

2025 members can access this corpus through their LDC accounts provided they 
have submitted a completed copy of the special license agreement. Non-members 
may license this data for a fee.

*

LoReHLT Uzbek Representative Language 
Pack<https://catalog.ldc.upenn.edu/LDC2025T08> was developed by LDC and is 
comprised of approximately 47 million words of Uzbek monolingual text, 563,000 
words of found Uzbek-English parallel text, 100,000 Uzbek words translated from 
English data, and 6.4 hours of Uzbek broadcast news and amateur web audio 
recordings. Approximately 151, 000 words were annotated for named entities and 
over 28,000 words were annotated for full entity including nominals and 
pronouns. Noun-phrase chunking was applied to more than 13,000 words. Over 
20,890 words were labeled with simple semantic annotation. Topic annotation was 
applied to the audio recordings. Data was collected from discussion forum, 
news, reference, social network, broadcast news, web audio recordings, and 
weblogs.

LoReHLT was a companion project of the DARPA LORELEI program. The LORELEI (Low 
Resource Languages for Emergent Incidents) program was concerned with building 
human language technology for low resource languages in the context of emergent 
situations. Representative languages were selected to provide broad typological 
coverage.

2025 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] July 2025 Newsletter - LDC

Reply via email to