ORTOLANG Deposit and sharing

Speech and Language Data Repository (SLDR/ORTOLANG)

Investissements d'avenir  Huma-Num  CLARIN

Open archives (OAI-PMH)

JURISDICT
Adam Mickiewicz University Foundation (Poznan PL)

Persistent IDentifier: hdl:11041/sldr000868
SLDR/ORTOLANG id: http://sldr.org/sldr000868
OAI: oai:sldr.org:sldr000868 (olac - oai_dc - VLO - language-archives)


 [Discussion]
 
Type of item Primary data (corpus)
Identifier sldr000868 (version 1/1)
Statussource data
Table of contents
(More)
 
PreviewNot (yet) available from this site.
DescriptionThe JURISDICT speech database is a large continuous speech database originally designed for dictated speech recognition.
The database includes above 1500 annotated sessions of speakers from 16 regions of Poland, plus another 500 experimental recordings.
The JURISDICT database is intended to provide material for both training and testing of speech dictation of common and legal texts, including isolated word systems, word-spotting systems and vocabulary independent systems which use either whole word or sub-word modeling approaches. The typical JURISDICT recording session scenario is a mixture of semi-spontaneous (controlled dictation) and read/dictated speech. The specification is based on the general language features and also on peculiarities of Polish on the different linguistic as well as phonetic levels. The general assumptions for the structure of database take into account text features: semantic structure, syntactic factors, grammatical and acoustic-phonetic factors and speaking style: semi-spontaneous, controlled spontaneous dictation, elicited dictation (answering speech).

The development of the database, as well as the first analyses of its contents were possible thanks to the work of several dozens of contributors (researchers, engineers, annotators, and technicians), including the authors of the following publications:

The design, structure, and annotation of the resource were described e.g. in:

Demenko, G., Grocholewski, S., Klessa, K., Ogórkiewicz, J., Wagner, A., Lange, M., Sledzinski, D. & Cylwik, N. (2008, May). JURISDIC: Polish Speech Database for Taking Dictation of Legal Texts. In LREC.

Demenko, G., Grocholewski, S., Klessa, K., Ogórkiewicz, J., Wagner, A., Lange, M., Sledzinski, D. & Cylwik, N. (2008, September). LVCSR speech database-JURISDIC. In Signal Processing Algorithms, Architectures, Arrangements, and Applications (SPA), 2008 (pp. 67-72). IEEE.

Klessa, K., & Demenko, G. (2009). Structure and Annotation of Polish LVCSR Speech Database. In Tenth Annual Conference of the International Speech Communication Association.

Citation information EN: JURISDICT. Primary data (corpus). Adam Mickiewicz University Foundation (Poznan PL). Created 2014-01-09. Speech and Language Data Repository (SLDR/ORTOLANG). Identifier hdl:11041/sldr000868
FR: JURISDICT. Données primaires (corpus). Adam Mickiewicz University Foundation (Poznan PL). Création 2014-01-09. Banque de données parole et langage (SLDR/ORTOLANG). Identifiant hdl:11041/sldr000868
ES: JURISDICT. Datos primarios (corpus). Adam Mickiewicz University Foundation (Poznan PL). Creación 2014-01-09. Banco de datos de habla y lenguaje (SLDR/ORTOLANG). Identificador hdl:11041/sldr000868
ZH: JURISDICT. 语音库. Adam Mickiewicz University Foundation (Poznan PL). 创建 2014-01-09. Speech and Language Data Repository (SLDR/ORTOLANG). 标识符 hdl:11041/sldr000868
Access rights
(see documentation)
none
Main language of the corpusPolish (język polski) [pol]
OLAC discourse type
OLAC linguistic data type
A list of languages concerned by this itemPolish (język polski)
Linguistic subject(s)applied_linguistics
phonetics
phonology
SLDR/ORTOLANG contact
  Katarzyna KLESSA
Link to the wiki pagehttp://sldr.org/wiki/jurisdict
Keywordsspeech dictation, LVCSR
Keywords (other language)mowa dyktowana, duże zasoby akustyczne dla rozpoznawania mowy
Alternate description language Polish (język polski)
Specific extensions of text filesPLO
Users' communitysldr.org/sldr000868/com
Size of this item535 Mb
480164 files
Largest file: 269.45 Mb
Data coverage - temporal (period)start=2007; end=2009
Data coverage (spatial) (2-char country code)PL
Roles
(see documentation)
author: Ms Natalia Cylwik, Adam Mickiewicz University in Poznan
author: Pr Grazyna Demenko, Adam Mickiewicz University in Poznan
author: Pr Stefan Grocholewski, Poznan University of Technology
author: Dr Katarzyna Klessa, Adam Mickiewicz University in Poznan
author: Mr Marek Lange, Adam Mickiewicz University Foundation
author: Mr Jerzy Ogórkiewicz, Adam Mickiewicz University Foundation
author: Dr Daniel Sledzinski, Adam Mickiewicz University in Poznan
author: Dr Agnieszka Wagner, Adam Mickiewicz University in Poznan
Labelling/tagging
The starting point for annotation specification applied for the present corpus were SpeeCon annotation guidelines (deliverable D214) based on orthographic, word-level transcription. In the first step, annotators (a team of students of The Faculty of Modern Languages and Literature in Poznań, above thirty people during the whole period of the annotation process) manually validated the agreement of the recorded text with the input orthographic transcription by inserting necessary adjustments, special events markers, and time boundaries. The first-step annotations were hand-validated (where necessary) by two expert phoneticians and four experienced labellers for whom the inter-labeller agreement was monitored, especially as concerned the number and types of special events and time boundary insertion and spelling errors. The inter-labeler agreement concerning the time boundaries was high (above 90%), the agreement for the special events labels depended on the type of label and was best for the unintelligible speech markers (above 80%) and filled pause labels (approx. 70%). It was lower for speaker noise labels and mispronunciation markers because of a greater variation observed for one of the labelers, after excluding the results for that labeler, the agreement was up to 70%.
Derogation to the principle of open access to public archives (see documentation)AR048 (50 years) - Documents disclosure of which undermines the protection of privacy or for appreciation or value judgments about a person named or easily identifiable, or which reveal the behavior of a person under circumstances which might cause him/her prejudice. (Code du Patrimoine, art. L. 213-2, I, 3)
Deadline for next update of access rights
2064-01-09
Compliance with current policy on public archives100%
Privileged access
(see documentation)
jurisdict
OLAC [hide]<oai:record>
<oai:header>
<oai:identifier>oai:sldr.org:sldr000868</oai:identifier>
<oai:datestamp>2018-05-26</oai:datestamp>
</oai:header>
<oai:metadata><olac:olac>
<dc:title xml:lang="en">JURISDICT</dc:title>
<dcterms:bibliographicCitation xml:lang="en">JURISDICT. Primary data (corpus). Adam Mickiewicz University Foundation (Poznan PL). Created 2014-01-09. Speech and Language Data Repository (SLDR/ORTOLANG). Identifier hdl:11041/sldr000868</dcterms:bibliographicCitation>
<dcterms:bibliographicCitation xml:lang="es">JURISDICT. Datos primarios (corpus). Adam Mickiewicz University Foundation (Poznan PL). Creación 2014-01-09. Banco de datos de habla y lenguaje (SLDR/ORTOLANG). Identificador hdl:11041/sldr000868</dcterms:bibliographicCitation>
<dcterms:bibliographicCitation xml:lang="fr">JURISDICT. Données primaires (corpus). Adam Mickiewicz University Foundation (Poznan PL). Création 2014-01-09. Banque de données parole et langage (SLDR/ORTOLANG). Identifiant hdl:11041/sldr000868</dcterms:bibliographicCitation>
<dcterms:bibliographicCitation xml:lang="zh">JURISDICT. 语音库. Adam Mickiewicz University Foundation (Poznan PL). 创建 2014-01-09. Speech and Language Data Repository (SLDR/ORTOLANG). 标识符 hdl:11041/sldr000868</dcterms:bibliographicCitation>
<dc:publisher>Adam Mickiewicz University Foundation (Poznan PL)</dc:publisher>
<dcterms:provenance>Adam Mickiewicz University Foundation (Poznan PL)</dcterms:provenance>
<dc:publisher xsi:type="dcterms:URI">http://amu.edu.pl/</dc:publisher>
<dcterms:provenance xsi:type="dcterms:URI">http://amu.edu.pl/</dcterms:provenance>
<dc:contributor xsi:type="olac:role" olac:code="author">Cylwik, Natalia Ms, Adam Mickiewicz University in Poznan</dc:contributor>
<dc:contributor xsi:type="olac:role" olac:code="author">Demenko, Grazyna Pr, Adam Mickiewicz University in Poznan</dc:contributor>
<dc:contributor xsi:type="olac:role" olac:code="author">Grocholewski, Stefan Pr, Poznan University of Technology</dc:contributor>
<dc:contributor xsi:type="olac:role" olac:code="author">Klessa, Katarzyna Dr, Adam Mickiewicz University in Poznan</dc:contributor>
<dc:contributor xsi:type="olac:role" olac:code="author">Lange, Marek Mr, Adam Mickiewicz University Foundation</dc:contributor>
<dc:contributor xsi:type="olac:role" olac:code="author">Ogórkiewicz, Jerzy Mr, Adam Mickiewicz University Foundation</dc:contributor>
<dc:contributor xsi:type="olac:role" olac:code="author">Sledzinski, Daniel Dr, Adam Mickiewicz University in Poznan</dc:contributor>
<dc:contributor xsi:type="olac:role" olac:code="author">Wagner, Agnieszka Dr, Adam Mickiewicz University in Poznan</dc:contributor>
<dc:creator>Adam Mickiewicz University Foundation (Poznan PL)</dc:creator>
<dc:contributor xsi:type="olac:role" olac:code="depositor">Adam Mickiewicz University Foundation (Poznan PL)</dc:contributor>
<dc:contributor xsi:type="dcterms:URI">http://amu.edu.pl/</dc:contributor>
<dc:type>info:eu-repo/semantics/dataset</dc:type>
<dc:rights>info:eu-repo/date/submitted/2014-01-09</dc:rights>
<dc:rights>info:eu-repo/semantics/embargoedAccess</dc:rights>
<dc:rights>info:eu-repo/date/embargoEnd/2064-01-09</dc:rights>
<dcterms:accessRights xml:lang="en">SLDR licence; rightsHolder = Adam Mickiewicz University Foundation (Poznan PL)</dcterms:accessRights>
<dcterms:accessRights xml:lang="en">Access granted: none</dcterms:accessRights>
<dcterms:license xsi:type="dcterms:URI">http://sldr.org/licence_v1/en</dcterms:license>
<dcterms:license xsi:type="dcterms:URI">http://sldr.org/licence_v1/es</dcterms:license>
<dcterms:license xsi:type="dcterms:URI">http://sldr.org/licence_v1/fr</dcterms:license>
<dcterms:license xsi:type="dcterms:URI">http://sldr.org/licence_v1/zh</dcterms:license>
<dcterms:provenance xml:lang="en">source data</dcterms:provenance>
<dcterms:provenance xml:lang="es">datos de origen</dcterms:provenance>
<dcterms:provenance xml:lang="fr">données source</dcterms:provenance>
<dcterms:provenance xml:lang="zh">源数据</dcterms:provenance>
<dcterms:accessRights xml:lang="fr">Restriction AR048 (50 ans à partir de 2014-01-09) - Documents dont la communication porte atteinte à la protection de la vie privée ou portant appréciation ou jugement de valeur sur une personne physique nommément désignée, ou facilement identifiable, ou qui font apparaître le comportement d'une personne dans des conditions susceptibles de lui porter préjudice. (Code du Patrimoine, art. L. 213-2, I, 3) </dcterms:accessRights>
<dcterms:accessRights xml:lang="en">Restriction AR048 (50 years from 2014-01-09) - Documents disclosure of which undermines the protection of privacy or for appreciation or value judgments about a person named or easily identifiable, or which reveal the behavior of a person under circumstances which might cause him/her prejudice. (Code du Patrimoine, art. L. 213-2, I, 3) </dcterms:accessRights>
<dcterms:accessRights xml:lang="zh">制约 AR048 (从2014-01-09准入限制组50年) - 提供破坏隐私保护或欣赏或关于容易辨认的人的价值判断的命名或, 或者在情况也许带来他或她的伤害下显露人行为的透露。 (Code du Patrimoine, 艺术。 L. 213-2, I, 3)</dcterms:accessRights>
<dcterms:accessRights xml:lang="es">Restriccion AR048 (50 years from 2014-01-09) - Documentos de divulgación de lo que perjudica la protección de la intimidad o de los juicios de valor acerca de apreciación o una persona con nombre o fácilmente identificables, o que revelan el comportamiento de una persona en circunstancias que podrían llevarle lesión. (Code du Patrimoine, art. L. 213-2, I, 3)</dcterms:accessRights>
<dcterms:extent>535656029</dcterms:extent>
<dcterms:temporal xsi:type="dcterms:Period">start=2007; end=2009</dcterms:temporal>
<dcterms:spatial xsi:type="dcterms:ISO3166">PL</dcterms:spatial>
<dc:subject xsi:type="olac:linguistic-field" olac:code="applied_linguistics"/>
<dc:subject xsi:type="olac:linguistic-field" olac:code="phonetics"/>
<dc:subject xsi:type="olac:linguistic-field" olac:code="phonology"/>
<dc:subject xml:lang="en">speech dictation</dc:subject>
<dc:subject xml:lang="en">LVCSR</dc:subject>
<dc:subject xml:lang="pl">mowa dyktowana</dc:subject>
<dc:subject xml:lang="pl">duże zasoby akustyczne dla rozpoznawania mowy</dc:subject>
<dc:subject xsi:type="olac:language" olac:code="pol"></dc:subject>
<dc:subject xsi:type="olac:language" olac:code="pol" xml:lang="en">Polish</dc:subject>
<dc:subject xsi:type="olac:language" olac:code="pol" xml:lang="es">Polaco</dc:subject>
<dc:subject xsi:type="olac:language" olac:code="pol" xml:lang="fr">polonais</dc:subject>
<dc:subject xsi:type="olac:language" olac:code="pol" xml:lang="zh">波兰语</dc:subject>
<dc:subject xsi:type="olac:language" olac:code="pol" xml:lang="pl">język polski</dc:subject>
<dc:language xsi:type="olac:language" olac:code="pol"></dc:language>
<dc:language xsi:type="olac:language" olac:code="pol" xml:lang="en">Polish</dc:language>
<dc:language xsi:type="olac:language" olac:code="pol" xml:lang="es">Polaco</dc:language>
<dc:language xsi:type="olac:language" olac:code="pol" xml:lang="fr">polonais</dc:language>
<dc:language xsi:type="olac:language" olac:code="pol" xml:lang="zh">波兰语</dc:language>
<dc:language xsi:type="olac:language" olac:code="pol" xml:lang="pl">język polski</dc:language>
<dc:description xml:lang="en">The JURISDICT speech database is a large continuous speech database originally designed for dictated speech recognition.<br />The database includes above 1500 annotated sessions of speakers from 16 regions of Poland, plus another 500 experimental recordings.<br />The JURISDICT database is intended to provide material for both training and testing of speech dictation of common and legal texts, including isolated word systems, word-spotting systems and vocabulary independent systems which use either whole word or sub-word modeling approaches. The typical JURISDICT recording session scenario is a mixture of semi-spontaneous (controlled dictation) and read/dictated speech. The specification is based on the general language features and also on peculiarities of Polish on the different linguistic as well as phonetic levels. The general assumptions for the structure of database take into account text features: semantic structure, syntactic factors, grammatical and acoustic-phonetic factors and speaking style: semi-spontaneous, controlled spontaneous dictation, elicited dictation (answering speech).<br /><br />The development of the database, as well as the first analyses of its contents were possible thanks to the work of several dozens of contributors (researchers, engineers, annotators, and technicians), including the authors of the following publications:<br /><br />The design, structure, and annotation of the resource were described e.g. in:<br /><br />Demenko, G., Grocholewski, S., Klessa, K., Ogórkiewicz, J., Wagner, A., Lange, M., Sledzinski, D. & Cylwik, N. (2008, May). JURISDIC: Polish Speech Database for Taking Dictation of Legal Texts. In LREC.<br /><br />Demenko, G., Grocholewski, S., Klessa, K., Ogórkiewicz, J., Wagner, A., Lange, M., Sledzinski, D. & Cylwik, N. (2008, September). LVCSR speech database-JURISDIC. In Signal Processing Algorithms, Architectures, Arrangements, and Applications (SPA), 2008 (pp. 67-72). IEEE.<br /><br />Klessa, K., & Demenko, G. (2009). Structure and Annotation of Polish LVCSR Speech Database. In Tenth Annual Conference of the International Speech Communication Association.<br /><br /></dc:description>
<dc:identifier xsi:type="dcterms:URI">http://hdl.handle.net/11041/sldr000868</dc:identifier>
<dc:identifier xsi:type="dcterms:URI">http://sldr.org/logo/LogoOrtolang_small.png</dc:identifier>
<dc:identifier xsi:type="dcterms:URI">http://hdl.handle.net/11041/sldr000868?urlappend=/toc</dc:identifier>
<dc:date xsi:type="dcterms:W3CDTF">2014-01-09</dc:date>
<dcterms:created xsi:type="dcterms:W3CDTF">2014-01-09</dcterms:created>
<dcterms:modified xsi:type="dcterms:W3CDTF">2014-01-09</dcterms:modified>
<dc:type xsi:type="dcterms:DCMIType">Sound</dc:type>
</olac:olac>
</oai:metadata>
</oai:record>

SIP (DocDC + DocMeta)Display code
First deposit on2014-01-09
This item was last modified on2014-01-09

Discussion
Sat, 03 Jan 2015 10:53:47 GMTPrimary data (sound recordings) will be supplied in early 2015.

(Identified users only)

This site has been declared to Commission Nationale de l’Informatique et des Libertés (CNIL) under agreement Nr.1222972 on 26 March 2008. As per French Law, any person cited by name is granted access to, modification, correction and suppression of data relative to him/her (art. 34 of the « Informatique et Libertés » act of 6 January 1978). To exert your right, send a message to webmaster(at)sldr.org.

[back]