Download PDF PDF Download WAV WAV Download MP3 MP3 Download EAF EAF Download XML XML Download TSV TSV Download XLS XLS Download XLSX XLSX Open as HTML HTML Download ZIP ZIP Download ZIP/WAV ZIP/WAV Download ZIP/MP3 ZIP/MP3 Download ZIP/EAF ZIP/EAF Download ZIP/XML ZIP/XML Download ZIP/XLS ZIP/XLS Download ZIP/PDF ZIP/PDF Download ZIP/TSV ZIP/TSV

The WOWA corpus grew out of the project Post-predicate elements in Iranian and neighbouring languages: Inheritance, contact, and information structure. It contains data that were collected and annotated by the researchers involved in that project, as well as others contributed by associated researchers.

The principle aim of WOWA is to provide an accessible and transparent source of data for corpus-based approaches to word order typology, focussing on the languages spoken in the region designated here as Western Asia.

The data sets are successively being made available, with 41 online as of July 2024.

Research Background

The focus on Western Asia is motivated by an overarching research interest in the areal diffusion of word order regularities; specifically, we investigate the respective impact of inheritance (the genetic affiliation of the languages concerned, e.g. Turkic, Semitic, etc.) and the impact of neighbouring languages, related or not, in shaping word order in usage. In addition, we address the issue of which aspects of word order are stable within a particular doculect, and which display corpus-internal variability.

More generally, this is connected to the issue of integrating variation into typology. Finally, WOWA is the only cross-linguistic data-base of its type that includes exclusively spoken language, and thus provides an important corrective to much ongoing work in corpus-based typology, which is still largely based on written language.

Corpus design

Each dataset in WOWA is based on a corpus of transcribed spoken language, usually compiled in a field-work setting. The sources are extremely varied; some are taken from published dialect surveys such as those undertaken by the Turkish Language Society (Turk Dil Kurumu), or published work by experts on particular language groups (e.g. Khan 2008, on the Neo-Aramaic (Christian) dialect of Barwar, northern Iraq). Others were gathered in the course of PhD projects and other initiatives in language documentation.

All data in the WOWA corpus, including supplementary materials, are published under the Creative Commons Attribution 4.0 International licence (CC BY 4.0). The text of the licence can be found online here.

The texts in WOWA contain at least 500 analysable tokens; the current mean is 650 tokens. They are digitalized, if not already in digital form, segmented into syntactic segments of up to three clauses (the size of segmented units varies and is immaterial for the analysis), and imported to a spreadsheet template.

The tokens to be analysed are referential nominal expressions in non-subject positions (i.e. subjects are not included). They are coded for a range of features, including animacy, weight, role, and flagging. The dependent variable is position relative to the governing predicate, for which two values are available: (A) before the governing predicate, or (B) after the governing predicate. The details are outlined in the coding guidelines. Once fully coded, the spreadsheets are exported as TSV files, which can then be imported into R for statistical analysis.

For each data set, we minimally make available (i) metadata on the doculect and source texts, (ii) the complete coded data, in XLS and TSV formats, and, where available, (iii) the original sources including sound files.

The doculects

— Please note that a number of data sets are still in the process of being compiled. —

Missing components are marked with "—/—" in the lists below; they will be added in the near future.

Turkic

Iranian

Kartvelian

Semitic

Armenian

Hellenic

Indo-Aryan

Publications

Published papers

(NEW!) Craevschi, Alexandru. 2022. Historical contingency and typological tendencies in languages of Western Asia: A quantitative study of word order of non-subject constituents. Unpublished MA thesis, University of Bamberg.

Haig, Geoffrey & Rasekh-Mahand, Mohammad. 2019. Post-predicate elements in Iranian and neighbouring languages: Inheritance, contact, and information structure. Position paper for the project Post-predicate constituents in Iranian and neighbouring languages.

Conference talks

(NEW!) Leitner, Bettina. 2022. Word order in Khuzestani Arabic (with some notes on Bushehr and Hormozgan Arabic). Paper presented at the workshop on Post-predicate elements across the languages of Western Asia: Theoretical and empirical approaches, Bamberg, Germany, 22–23 September 2022.

(NEW!) Rasekh-Mahand, Mohammad. 2022. Forty years after Frommer: Post-predicate elements in Persian. Paper presented at the workshop on Post-predicate elements across the languages of Western Asia: Theoretical and empirical approaches, Bamberg, Germany, 22–23 September 2022.

(NEW!) Schreiber, Laurentia & Janse, Mark. 2022. Word order & post-predicate elements in Romeyka. Paper presented at the workshop on Post-predicate elements across the languages of Western Asia: Theoretical and empirical approaches, Bamberg, Germany, 22–23 September 2022.

Haig, Geoffrey. 2021. Doing corpus-based syntactic typology with spoken language corpora. Workshop held as part of the LILEC Summer School 2021: Catching Language Data, Bologna, Italy, 23–24 April 2021.

Haig, Geoffrey. 2020. Stability and adaptivity of word order in the Western Asian Transition Zone: Evidence from West Iranian. Paper presented at the Workshop on Tracing Contact in Closely Related Languages, Zürich, Switzerland, 19–20 November 2020.

References

Faghiri, Pegah & Samvelian, Pollet & Hemforth, Barbara. 2018. Is there a canonical order in Persian ditransitive constructions? In Korn, Angnes & Malchukov, Andrey (eds.), Ditransitive constructions in a cross-linguistic perspective, 165–186. Wiesbaden: Reichert.

Frommer, Paul. 1981. Post-verbal phenomena in colloquial Persian syntax. PhD dissertation, University of Southern California.

Khan, Geoffrey. 2008. The Neo-Aramaic dialect of Barwar. Leiden: Brill.

Menz, Astrid. 2013. Gagauz. Tehlikedeki Diller Dergisi [Journal of Endangered Languages] 2(2), 55–69.

Contact

For inquiries, please contact Geoffrey Haig. Please direct questions concerning this website to Nils Schiborr.

The resources presented here as well as this page are hosted on the servers of the computing centre of the University of Bamberg. Relevant legal information can be found here.