[Corpora-List] June 2025 Newsletter - LDC

Penn LDC via Corpora Mon, 16 Jun 2025 07:44:34 -0700

In this newsletter:
LDC data and commercial technology development

New publications:
Chinese Sentence Pattern Structure 
Treebank<https://catalog.ldc.upenn.edu/LDC2025T06>
IWSLT 2022-2023 Shared Task Training, Development and Test 
Set<https://catalog.ldc.upenn.edu/LDC2025S05>
KAIROS Schema Learning Complex Event 
Annotation<https://catalog.ldc.upenn.edu/LDC2025T07>


________________________________

LDC data and commercial technology development
For-profit organizations are reminded that an LDC membership is a pre-requisite 
for obtaining a commercial license to almost all LDC databases. Non-member 
organizations, including non-member for-profit organizations, cannot use LDC 
data to develop or test products for commercialization, nor can they use LDC 
data in any commercial product or for any commercial purpose. LDC data users 
should consult corpus-specific license agreements for limitations on the use of 
certain corpora. Visit the 
Licensing<https://www.ldc.upenn.edu/data-management/using/licensing> page for 
further information.
________________________________

New publications:
Chinese Sentence Pattern Structure 
Treebank<https://catalog.ldc.upenn.edu/LDC2025T06> was developed at Beijing 
Normal University<https://english.bnu.edu.cn/> and Peking 
University<https://english.pku.edu.cn/>. It contains 5,016 sentences and 
119,627 tokens syntactically annotated following the concept of sentence 
constituent analysis which emphasizes sentence pattern structure. The source 
data consists of 27 chapters extracted from modern Mandarin and ancient Chinese 
works. There are three annotation layers: lexical sense and structural mode for 
dynamic words; syntactic structure for clauses; and inter-clause relation 
within complex sentence and sentence clusters. These structures can be 
visualized using the Jbw-viewer tool<https://github.com/bnucip/jbwviewer> which 
is included in the release.

2025 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

IWSLT 2022 - 2023 Shared Task Training, Development and Test 
Set<https://catalog.ldc.upenn.edu/LDC2025S05> was developed by LDC and contains 
210 hours of Tunisian<https://catalog.ldc.upenn.edu/LDC2025S05> Arabic 
conversational telephone speech, transcripts, English translations, speaker 
metadata, and documentation. This material constitutes the training, 
development, and test data used in the International Conference on Spoken 
Language Translation (IWSLT) Dialectal Speech Translation task 
(2022)<https://iwslt.org/2022/dialect> and the Dialectal and Low-resource track 
(2023)<https://iwslt.org/2023/low-resource>.

The telephone speech was collected by LDC in 2016-2017 from native speakers of 
Tunisian Arabic in Tunis. Speakers were recruited to make telephone calls to 
people in their social networks from a variety of noise conditions and 
handsets. Transcripts are orthographic following 
Buckwalter<https://catalog.ldc.upenn.edu/LDC2004L02> transliteration and cover 
175 hours of the collected speech. IPA transcripts were added to a subset of 
the data. All transcribed segments were translated into English.

2025 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

KAIROS Schema Learning Complex Event 
Annotation<https://catalog.ldc.upenn.edu/LDC2025T07> was developed by LDC to 
support the DARPA KAIROS program. It contains English and Spanish text, audio, 
video, and image data labeled for 93 real-world complex events with event, 
relation, and argument annotations linking to document provenance. Source data 
was collected from the web; 3431 root web pages were collected and processed, 
yielding 1919 text data files, 24019 image files, 1472 video files, and 16 
audio files.

The DARPA KAIROS (Knowledge-directed Artificial Intelligence Reasoning Over 
Schemas) program aimed to build technology capable of understanding and 
reasoning about complex real-world events in order to provide actionable 
insights to end users. KAIROS systems utilized formal event representations in 
the form of schema libraries that specified the steps, preconditions, and 
constraints for an open set of complex events; schemas were then used in 
combination with event extraction to characterize and make predictions about 
real-world events in a large, multilingual, multimedia corpus.

2025 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] June 2025 Newsletter - LDC

Reply via email to