CorpAfroAs is an integrated pilot project realized by field linguists for field linguists and typologists, which proposes:
- A methodology for the treatment of fieldwork textual data in underdescribed languages, from data gathering to automatic searches on the corpus,
- A free, open-source and user-friendly new software, ELAN-CorpA, developed within our project from Elan (Max Planck Institute Nijmegen),
- A pilot corpus composed of annotated first-hand transcriptions of narrative and conversational data in twelve AfroAsiatic languages (one hour per language), with accompanying sound files, list of glosses, grammatical sketches, and metadata..
The background for the project is the current lack of transcribed and sound-indexed spoken data, for Afroasiatic languages. In a technical context where it is becoming easier to compile databases, and to provide long-term archives for sound files, it appeared urgent to propose a methodology, a series of tools, and scientific analyses that may help the community of researchers compile their data. To that effect, we have been working extensively on segmentation and glossing. The final purpose of this pilot corpus is not so much the amount of data it will provide, as the full set of methods and electronic tools that has been developed and is currently available freely to the community (see Glosses, Tools, Manual). Ultimately, the pilot corpus is designed to grow and become a reference corpus, as well as to inspire initiatives for other language phyla.
The languages represented in the project are:
Kabyle, Tamashek (Berber),
Hausa, Bata, and Zaar (Chadic),
Afar, Beja, Gawwada, Ts'amakko (Cushitic),
Moroccan and Libyan Arabic, Juba-Arabic, Hebrew (Semitic).
One hour per language has been thoroughly transcribed, glossed, translated into English, and sound-indexed. The corpus itself will be released in December 2012. In its pilot form, the corpus is not designed to present a balanced sample of languages. It covers all branches, and different types of languages, so that we have an opportunity to provide technical and scientific solutions for all potential types: tonal and intonational, concatenative and non-concatenative, endangered and well-described languages, with or without codeswitching, etc.
The project is organized along two axes, linked to the nature of the materials and to the aim of the project, which is typological comparability among languages: prosody and morphosyntax.
The body of data is spoken, and we have decided to take into account this oral dimension by working on segmentation. We do not use the punctuation system of written texts, because it is not adapted to the specificities of the spoken language. Instead, we have adapted the widely accepted system of boundary-marking used for instance in the C-ORAL-Rom developed by Cresti & Moneglia.
We therefore analyze the prosodic units of our languages into minor (non-terminal)and major (terminal) units, using the software Praat. No other specification (tones, contours etc.) is given to those boundaries, but the fact that the transcription is indexed to the sound will allow more in-depth prosodic studies on the available data.
The corpus is not only translated, but also interlinearly glossed. For this purpose, we have developed a format allowing several annotation tiers. Those tiers are aimed at the automatic retrieval of a number of relevant queries concerning Afroasiatic languages: pronominal systems, case systems, nominal predicates, aspect, ideophones, demonstratives, verbal derivation, etc.
The theoretical question underlying our project, is that of the tension between comparability and language-internal coherence: What is the optimal degree of unification of the annotations, in order to both respect the specificities of languages, and provide a comparative basis for typology?
The annotation system we chose is based on the Leipzig Glossing Rules, developed jointly by the Department of Linguistics of the Max Planck Institute for Evolutionary Anthropology (Bernard Comrie, Martin Haspelmath) and by the Institute of Linguistics of the University of Leipzig (Balthasar Bickel), in order to promote convergence in glossing systems. The list of morphemes being open, and the rules being devised more for readability than for automatic retrieval, one of the tasks we realized is the establishment of a completed list, with rules adapted to computer retrieval.
The corpus will be put on-line in the TGE-Adonis repository (Très Grand Equipement chargé des apports numériques à la recherche en sciences humaines et sociales).
As the data will be made available online to the community, a thorough reflection process was engaged in before data collection, concerning the ethical aspects of the project. Thus, anonymization procedures, as well as control over sensitive data, have been implemented when needed. At the same time, all the relevant information was listed, in order to provide rich metadata on the recordings, in the IMDI format. These metadata follow the requirements of OLAC (Open Language Archives Community) and the TEI (Text Encoding Initiative).
The recordings were all digital, with strict requirements as to the format: non-compressed, .wav files, recorded at 44,1 khz / 16 bits, with high-quality microphones and pre-amplifiers. The high quality of the recording is necessary, not only because one of our scientific aims is to conduct a prosodic analysis of the data, but also for conservation purposes.
The software used for the analysis of the data is: Praat for segmentation, and ELAN-CorpA, or Toolbox followed by ELAN for primary annotation. The version of Elan we have developed is part of the deliverables of the project: it makes it possible to do without the complex process of data treatment and annotation via Toolbox.
The technical dimension of the project implies of course the participation of an engineer on a permanent basis (Christian Chanard, of the LLACAN laboratory), as well as a developer specially hired for the project, Coralie Villes. Regular collaboration with Han Sloetjes of the Max Planck Institute at Nijmegen has guaranteed the adaptation of this new software to the general architecture of ELAN.
It is possible to contribute to CorpAfroAs under the following conditions:
- you have a digitalized recording (.wav) associated to an ELAN (.eaf) file with, minimally, a transcription and a translation;
- those files are accompanied by a complete IMDI file thoroughly documenting the session, for each recording;
- you have obtained the (written or recorded) informed consent of the recorded speakers/signers;
- you provide information about the format of your annotation scheme : number and type of tiers (phonetic transcription, phonological transcription, morphosyntactic glossing, other annotations, translation, etc.) ; type of segmentation in the text-sound indexation (based on prosodic units, on clauses, on other types of units...).
- you use the abbreviations provided in the CorpAfroAs list of glosses. If your categories are not in our list, please propose an abbreviation conforming to the CorpAfroAs Glossing Rules.
Of course we encourage contributions using the same annotation scheme as the one devised in the CorpAfroAs project. Do not hesitate to contact us for further information.