In this newsletter:
Fall 2022 LDC Data Scholarship Program
30th Anniversary Highlight: The LDC Gigawords
________________________________
New publication:
HAVIC MED Novel 2 Test - Videos, Metadata and 
Annotation<https://catalog.ldc.upenn.edu/LDC2022V02>

Fall 2022 LDC Data Scholarship Program
Student applications for the Fall 2022 LDC Data Scholarship program are being 
accepted now through September 15, 2022. This program provides eligible 
students with no-cost access to LDC data. Students must complete an application 
consisting of a data use proposal and letter of support from their advisor. For 
application requirements and program rules, visit the LDC Data Scholarships 
page<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>.

30th Anniversary Highlight: The LDC Gigawords
Giga: a combining form meaning "billion," used in the formation of compound 
words (Source: https://www.dictionary.com/browse/giga-)

LDC's Gigaword corpora are a natural outgrowth of its vast decades-long 
multi-language newswire collection. Newswire data was originally collected, 
annotated, and distributed for use in many sponsored projects and was also 
released through the LDC catalog in tailored data sets. Then came the idea of 
making LDC's entire newswire collection available by language with a simple, 
minimal markup to support a broad range of NLP/HLT tasks. The first 
Arabic<https://catalog.ldc.upenn.edu/LDC2011T11>, 
Chinese<https://catalog.ldc.upenn.edu/LDC2011T13>, and 
English<https://catalog.ldc.upenn.edu/LDC2011T07> Gigaword editions were 
released in 2003; subsequent cumulative releases through fifth editions in 2011 
represent LDC's newswire collection spanning 1994-2010 in those languages. 
French<https://catalog.ldc.upenn.edu/LDC2011T10> and 
Spanish<https://catalog.ldc.upenn.edu/LDC2011T12> Gigawords were first 
published in 2006, culminating in the release of third editions in 2011, 
likewise covering newswire collected by LDC through 2010.

The community has used, and continues to use, these data sets in numerous ways. 
Automatic text summarization is a favorite, and current work in this area 
applies deep learning principles (see, e.g., Gao et al. 
2020<https://link.springer.com/article/10.1007/s00521-018-3946-7>, English). 
Gigawords are also useful for text source classification (Huang et al. 
2003<https://aclanthology.org/Y08-1042.pdf>, Chinese), information extraction 
(Lan et al. 2020<https://arxiv.org/pdf/2004.14519.pdf>, Arabic), knowledge 
extraction and distributional semantics (Napoles et al. 
2012<https://aclanthology.org/W12-3018.pdf>, English), and natural language 
understanding (Ganitkevitch 
2013<https://www.cs.jhu.edu/~juri/pdf/proposal-naacl-2013-srw.pdf>, English), 
among other fields. Recent variations like the 
annotated<https://catalog.ldc.upenn.edu/LDC2012T21> and concretely 
annotated<https://catalog.ldc.upenn.edu/LDC2018T20> English Gigawords add 
syntactic, semantic, and coreference annotations to this billion word text 
collection.

All Gigaword corpora are available for licensing by Consortium members and 
non-members. Visit Obtaining Data 
<https://www.ldc.upenn.edu/language-resources/data/obtaining> for more 
information.
________________________________
New publication:

HAVIC MED Novel 2 Test - Videos, Metadata and 
Annotation<https://catalog.ldc.upenn.edu/LDC2022V02> is comprised of 6,200 
hours of user-generated videos with annotation and metadata developed by LDC 
for the 2015 NIST Multimedia Event Detection tasks. The data consists of videos 
of various events (event videos) and videos completely unrelated to events 
(background videos). Each event video was manually annotated with judgments 
describing its event properties and other salient features. Background videos 
were labeled with topic and genre categories.

HAVIC MED Novel 2 Test -- Videos, Metadata and Annotation is distributed via 
web download.

2022 Subscription Members will automatically receive copies of this corpus. 
2022 Standard Members may request a copy as part of their 16 free membership 
corpora. This corpus is a members-only release and is not available for 
non-member licensing. Contact [email protected]<mailto:[email protected]> for 
information about membership.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104






_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to