The Autogramm project (https://autogramm.github.io/en) invites applications for 
a 3-year PhD position starting between now and October 2023. The position is 
funded by ANR (Agence National de la recherche), France.

Applications and questions can be sent to Sylvain Kahane <[email protected]>

Applications should include:
- Cover letter outlining interest in the position
- Names of two referees
- Curriculum Vitae (CV) with publications (if applicable)
- Copy of MA degree
- University grade sheet of at least the two last years

Today, we have databases concerning several dozen languages, including corpora 
annotated according to the same principle, thanks in particular to corpora 
annotated in interlinear gloss (IGT, see for example the Pangloss collection, 
https://pangloss.cnrs.fr) or with the Universal Dependencies annotation scheme 
(UD, https://universaldependencies.org and its SUD variant, 
https://surfacesyntacticud.github.io/). These databases allow typological 
studies and have several advantages:
- the results obtained are based directly on primary data (corpora) and not 
secondary data (grammars written by linguists). (This is only partially true, 
since the results still depend on the choices made by a linguist in selecting 
the corpus and annotating it; nevertheless, these choices are visible and can 
be discussed.)
- the results are reproducible as long as the data are freely accessible;
- the nature of the data allows for quantitative results: we will not say that 
a language is OV or VO, but that it has such and such a percentage of OV 
constructions, and we will be able to observe directly on the data which 
factors determine the distribution between OV and VO (Levshina 2019, Gerdes et 
al. 2019, Futrell et al. 2015). (See also https://typometrics.elizia.net/#/.)

The goal of the thesis topic is to contribute to the development of 
quantitative typology by participating in the construction of a quantitative 
database on a large number of typologically diverse languages and by focusing 
on the exploitation of such a dataset (Levshina 2022). The originality of the 
project lies in the fact that we are working on quantitative data and not on 
categorical features like existing typological databases (see in particular the 
Word Atlas of Language Structure online, https://wals.info/, which gives access 
to data on more than 2500 languages).

The following questions can be studied:
- How to identify cross-linguistic regularities, such as quantitative 
entailment universals, from a set of corpora of world languages (see for 
example Gerdes et al. 2021)? How can we make inferences between quantitatively 
valued features?
- What quantitative information can be extracted from a corpus that is useful 
for a typological study? Which features require prior annotation of the data 
and what is the nature of the annotations needed (see for example the case of 
IGT for morphosyntactic features and treebanks for word order).
- How to identify the typological signature of a language from an annotated 
corpus and determine what makes it special within a group of languages (see 
Bickel & Nichols 2002 and AutoTyp project).
- How to take into account the imbalance of a database that is not 
representative of the distribution of languages in the world, but includes a 
higher proportion of languages from certain regions or families (Indo-European 
languages, Semitic languages, East Asian languages, etc.) to the detriment of 
other regions or families (Papua New Guinea, Oceania, Sub-Saharan Africa, 
Amerindian languages, aboriginal languages)? (see Guzmán Naranjo & Becker 2022).
- How to solve the question of the commensurability of the categories used in 
the description of the different languages? How can we check the consistency of 
the data? This question can be addressed by studying the consistency of 
treebanks of the same language or language family. How to detect the presence 
of aberrations in some treebanks (categorization choices not conforming to the 
universal scheme, e.g. assignment of the subject relation in ergative 
languages, use of the ADJ category in languages without real adjectives, etc.)?
- How to visualize multidimensional quantitative data? Linguistic data pose 
many challenges.

The work will be conducted in collaboration with the members of the ANR 
Autogramm project (https://autogramm.github.io/), researchers in field 
linguistics, typology, formal linguistics and automatic language processing. It 
could lead, with the help of engineers, to the constitution of a typometric 
database accompanied by query and data visualization tools.

Bickel & Nichols 2021
Futrell 2015
Gerdes et al. 2019
Gerdes et al. 2021
Guzmán Naranjo & Becker 2022
Levshina 2019
Levshina 2022
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to