prosogit

Prosogit


In this website a corpus of spontaneous speech is presented; the considered languages are German L1 and Italian L2. The corpus is available in audio, video and text format; orthographic transcription and Part-of-Speech tagging are provided, together with automatic syllabification of the data and stylized intonational patterns. The design and the realization of this resource constitutes the first phase of a research project in a joint PhD program between the University of Naples «L’Orientale» (Italy) in collaboration with the Bielefeld University (Germany). The corpus is originally thought for prosodic analyses: specifically, it is supposed to be used to investigate prominence patterns and potential transfer phenomena in the second language acquisition process. However, it could be used for analyses in other linguistic fields. As presented, this corpus is made up of read and spontaneous speech in German and Italian as a second language. Furthermore, the audio and video data are enriched with orthographic transcription, PoS-tagging, labeling of (extra-)linguistic phenomena and some intonational information (specifically, a stylization of the pitch curve). This material should help analyzing the data in many directions, as for example in the Prosody-Syntax interface or in the NLP framework. However, our main aim remains the prosodic investigation of German and Italian L2; the corpus design reflects this aim. The high quality of the recordings and the considerable amount of data we have collected makes this corpus an important resource for linguistic examinations. We believe it represents an invaluable resource for prosodic analysis of German L1 and Italian L2; additionally, it provides both read and spontaneous productions for investigations at other linguistic levels. Another important characteristic of this corpus is the choice of making it freely available: third parts can use it (for research purposes) and are encouraged to enlarge it, both with further annotations and with new recordings. Every part of the corpus is available for download at this website.