Adam Mickiewicz University Foundation (Poznan PL)

Persistent IDentifier: hdl:11041/sldr000868
SLDR/ORTOLANG id: http://sldr.org/sldr000868
OAI: oai:sldr.org:sldr000868 (olac - oai_dc - VLO - language-archives)

Type of item Primary data (corpus)
Identifier sldr000868 (version 1/1)
Statussource data
DescriptionThe JURISDICT speech database is a large continuous speech database originally designed for dictated speech recognition.
The database includes above 1500 annotated sessions of speakers from 16 regions of Poland, plus another 500 experimental recordings.
The JURISDICT database is intended to provide material for both training and testing of speech dictation of common and legal texts, including isolated word systems, word-spotting systems and vocabulary independent systems which use either whole word or sub-word modeling approaches. The typical JURISDICT recording session scenario is a mixture of semi-spontaneous (controlled dictation) and read/dictated speech. The specification is based on the general language features and also on peculiarities of Polish on the different linguistic as well as phonetic levels. The general assumptions for the structure of database take into account text features: semantic structure, syntactic factors, grammatical and acoustic-phonetic factors and speaking style: semi-spontaneous, controlled spontaneous dictation, elicited dictation (answering speech).

The development of the database, as well as the first analyses of its contents were possible thanks to the work of several dozens of contributors (researchers, engineers, annotators, and technicians), including the authors of the following publications:

The design, structure, and annotation of the resource were described e.g. in:

Demenko, G., Grocholewski, S., Klessa, K., Ogórkiewicz, J., Wagner, A., Lange, M., ... & Cylwik, N. (2008, May). JURISDIC: Polish Speech Database for Taking Dictation of Legal Texts. In LREC.

Demenko, G., Grocholewski, S., Klessa, K., Ogórkiewicz, J., Wagner, A., Lange, M., ... & Cylwik, N. (2008, September). LVCSR speech database-JURISDIC. In Signal Processing Algorithms, Architectures, Arrangements, and Applications (SPA), 2008 (pp. 67-72). IEEE.

Klessa, K., & Demenko, G. (2009). Structure and Annotation of Polish LVCSR Speech Database. In Tenth Annual Conference of the International Speech Communication Association.

Citation information EN: JURISDICT. Primary data (corpus). Adam Mickiewicz University Foundation (Poznan PL). Created 2014-01-09. Speech and Language Data Repository (SLDR/ORTOLANG). Identifier hdl:11041/sldr000868
FR: JURISDICT. Données primaires (corpus). Adam Mickiewicz University Foundation (Poznan PL). Création 2014-01-09. Banque de données parole et langage (SLDR/ORTOLANG). Identifiant hdl:11041/sldr000868
ES: JURISDICT. Datos primarios (corpus). Adam Mickiewicz University Foundation (Poznan PL). Creación 2014-01-09. Banco de datos de habla y lenguaje (SLDR/ORTOLANG). Identificador hdl:11041/sldr000868
ZH: JURISDICT. 语音库. Adam Mickiewicz University Foundation (Poznan PL). 创建 2014-01-09. Speech and Language Data Repository (SLDR/ORTOLANG). 标识符 hdl:11041/sldr000868
Access rights
(see documentation)
Main language of the corpus: Polish (język polski) [pol]
OLAC discourse type
OLAC linguistic data type
Languages: Polish (język polski)
Linguistic subject(s): applied_linguistics
  Katarzyna KLESSA
Link to the wiki pagehttp://sldr.org/wiki/jurisdict
Keywords: speech dictation, LVCSR
Keywords (Polish): mowa dyktowana, duże zasoby akustyczne dla rozpoznawania mowy
Alternate description language Polish (język polski)
Specific extensions of text filesPLO
Users' communitysldr.org/sldr000868/com
Size: 535 Mb
480164 files
Largest file: 269.45 Mb
Data coverage - temporal: 2007-2009
Data coverage (spatial): PL
(see documentation)
author: Pr Grazyna Demenko, Adam Mickiewicz University in Poznan
author: Dr Katarzyna Klessa, Adam Mickiewicz University in Poznan
The starting point for annotation specification applied for the present corpus were SpeeCon annotation guidelines (deliverable D214) based on orthographic, word-level transcription. In the first step, annotators (a team of students of The Faculty of Modern Languages and Literature in Poznań, above thirty people during the whole period of the annotation process) manually validated the agreement of the recorded text with the input orthographic transcription by inserting necessary adjustments, special events markers, and time boundaries. The first-step annotations were hand-validated (where necessary) by two expert phoneticians and four experienced labellers for whom the inter-labeller agreement was monitored, especially as concerned the number and types of special events and time boundary insertion and spelling errors. The inter-labeler agreement concerning the time boundaries was high (above 90%), the agreement for the special events labels depended on the type of label and was best for the unintelligible speech markers (above 80%) and filled pause labels (approx. 70%). It was lower for speaker noise labels and mispronunciation markers because of a greater variation observed for one of the labelers, after excluding the results for that labeler, the agreement was up to 70%.
Derogation to the principle of open access to public archives (see documentation)AR048 (50 years) - Documents disclosure of which undermines the protection of privacy or for appreciation or value judgments about a person named or easily identifiable, or which reveal the behavior of a person under circumstances which might cause him/her prejudice. (Code du Patrimoine, art. L. 213-2, I, 3)
Deadline for next update of access rights
Compliance with current policy on public archives100%
Privileged access
(see documentation)
First deposit on 2014-01-09
Last modified on 2014-01-09

Sat, 03 Jan 2015 10:53:47 GMT: Primary data (sound recordings) will be supplied in early 2015.

