[Corpora-List] GUM Corpus V11 - new documents and annotations

Amir.Zeldes--- via Corpora Thu, 13 Mar 2025 13:46:44 -0700

(Apologies for cross-postings)

 �


*** The GUM Corpus - Release 11.0.0 ***

*** Georgetown University Multilayer corpus ***

 �

Corpling@GU <https://gucorpling.org/corpling/>  is happy to announce the first 
release of series 11 of the Georgetown University Multilayer corpus (GUM 
V11.0.0):

 �

https://gucorpling.org/gum/

 �

New in this version: 

 �

*       GUM and the out-of-domain test set GENTLE have now merged!
*       New documents – the corpus now contains 268,208 tokens
*       Five different summaries per document
*       Graded salience scores (0-5) for each entity in every document

 �

GUM is an open source corpus of richly annotated English texts from 24 genres: 

 �

*       Main genres: (available in train/dev/test)

*       academic writing
*       biographies
*       courtroom transcripts
*       essays
*       fiction
*       how-to guides
*       interviews
*       letters
*       news
*       online forum discussions
*       podcasts
*       political speeches
*       spontaneous face to face conversations
*       textbooks
*       travel guides
*       vlogs

 �

*       Out-of-domain test genres: (test2, aka GENTLE partition):

*       dictionary entries
*       live esports commentary
*       legal documents
*       medical notes
*       poetry
*       mathematical proofs
*       course syllabuses
*       threat letters

 �

The corpus is created by students as part of the Computational Linguistics 
curriculum at Georgetown University and is available under Creative Commons 
licenses.

 �

This is the first version of GUM series 11, containing roughly 281 documents 
annotated for:

 �

*       Multiple POS tags (100% manual gold PTB, extended PTB, converted CLAWS5 
and UPOS) and UD morphological features
*       Manually corrected lemmatization and morphological segmentation
*       Sentence segmentation and rough speech act (manual)
*       Document structure using TEI tags (paragraphs, headings, figures, 
captions etc., all manual)
*       Constituent and dependency syntax (manually corrected Universal 
Dependencies, and PTB parses from gold tags with function labels and enhanced 
dependencies)
*       Construction Grammar annotations following UCxn
*       Information status (given-active/inactive, accessible-inferable/common 
ground/aggregate, and new)
*       Entity type, graded salience (0-5) and coreference annotation 
(including non-named entities, singletons, appositions, cataphora and several 
types of bridging), as well as Centering Theory annotations
*       Entity linking (Wikification) of all named entities with Wikipedia 
articles, including their non-named and pronominal mentions
*       Discourse parses in enhanced Rhetorical Structure Theory (eRST) and 
discourse dependencies, including multiple concurrent and non-projective 
relations
*       Discourse signal annotations classified into 9 major and 45 minor types 
indicating how the presence of a relation is marked (based on the Signaling 
Corpus scheme)
*       Shallow discourse relations following the PDTB v3 scheme
*       Five abstractive summaries for each document following strict, 
comparable guidelines across genres

 �

Note on Reddit data: token text is not contained in the release but can be 
downloaded with an included script.

 �

For more information and to search or download the corpus online, see the 
corpus website <https://gucorpling.org/gum/> .

 �

Best wishes,

The GUM team

 �

PS – if you like GUM, also check out our automatically annotated AMALGUM 
<https://github.com/gucorpling/amalgum/>  corpus!

 �

 �

 �

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] GUM Corpus V11 - new documents and annotations

Reply via email to