Corpora

Our linguistic corpora are available both on this website (minlang.iling-ran.ru/corpora) and on an independent platform at corpora.iling-ran.ru. Most of the corpora are based on the Tsakorpus interface. However, some of them are only integrated…

Some of the corpora are currently represented only by SIL FieldWorks files or pdf-files. We are working on making them available on their own programme platforms with a user-friendly search interface. The list of the available corpora includes the corpus of Kullu—a minority language spoken in India. Further development of our corpus project is aimed at a wider coverage beyond the languages of Russia, and the Kullui corpus can be regarded as the first step in this direction.

If you have any questions or suggestions for improvement of our corpus platform, or if you would like your corpus to be integrated into the platform, please contact us at minlanglab@iling-ran.ru.

Corpus of North Selkup Written Texts (Legal Texts)

The corpus consists of translations into the Upper Taz dialect of the North Selkup language of a number of legal texts: the Charter (basic law) of the Yamalo-Nenets Autonomous Okrug, as well as federal laws and laws of the Yamalo-Nenets Autonomous Okrug relating to the Indigenous Small-Numbered Peoples of the North. The translations were produced as part of a project by the YNAO government and published in two books:

Charter (Basic Law) of the Yamalo-Nenets Autonomous Okrug of December 28, 1998, No. 56-zao (In the Selkup language). Salekhard, 2008.
Federal Laws and Laws of the Yamalo-Nenets Autonomous Okrug (In the Selkup language). Salekhard, 2008.

Link to the corpus

Evenki corpus

The corpus includes texts in the Northern, Southern, and Eastern dialects of the Evenki language from the multimedia archive of the Laboratory for Parallel Information Technologies Research Computing Center (RCC) of Moscow State University / Laboratory for Study and Preservation of Minority Languages, Institute of Linguistics, RAS. These materials were recorded between 1998 and 2021 during field expeditions documenting the Evenki language under the supervision of O. Kazakevich. The corpus also contains archival Evenki texts recorded by G. Vasilevich in the 1930–1950s and by E. Lebedeva in the 1950–1960s.
The morphological annotation was carried out mainly by E. Klyachko with the participation of N. Mitrofanova.

The corpus was created by E. Klyachko based on the platform developed by Timofey Arkhangelsky (Tsakorpus).

Link to the corpus

Hill Mari corpus

The corpus data have been collected by the participants of the MSU field project on Hill Mari. The project is carried out at the Department of Theoretical and Applied Linguistics (Lomonosov Moscow State University, Faculty of Philology). It has been supported by the RSSF grant №16-04-18 037е and the RFBR grants №17-04-18 036е, 16-06-00 536а and 19-012-00 627.

The corpus structure and annotation is also a result of joint work of the participants of the MSU field project on Hill Mari.

Link to the corpus

Itelmen corpus

The corpus consists of 15 archival Itelmen texts recorded by V. Iokhelson in 1910–1911 and by A. Volodin in 1962–1973. The morphological annotation was carried out by K. Sheifer, S. Ganieva, and M. Plugaryov.

The software component was developed by Maksim Bazhukov.

Link to the corpus

Ket corpus

The corpus includes texts on all three dialects of Ket that were recorded in 2002–2014 fieldwork held under the direction of O. A. Kazakevich and archived in the Laboratory for Computational Lexicography (Scientific Research Computer Center Moscow State University) / Laboratory for Study and Preservation of Minority Languages (Incstitute of Linguistics, Russian Academy of Sciences), as well as archival texts recorded by G. M. Korsakov in 1937. Morphological annotation was performed by Yu. E. Galyamina and E. M. Budyanskaya.

Link to the corpus

Корпус быстринского диалекта эвенского языка

Корпус состоит из устных и письменных текстов на быстринском эвенском языке. Среди них: сказки, собранные К. С. Черкановым и переведённые на эвенский К. В. Банакановой и Л. Е. Банакановой (Петропавловск-Камчатский: Камчатпресс, 2018), статьи из газеты "Айдит", публиковавшейся в 1990е-2005 в Быстринском районе, а также устные рассказы и песни, записанные в ходе студенческих экспедиций НИУ ВШЭ в 2019-2025 годах в сёлах Эссо и Анавгай.

Ссылка на корпус

Корпус науканского языка

Корпус состоит из текстов, собранных в ходе экспедиций по гранту РНФ № 23-18-00204 «Динамика языковой ситуации в сообществах Крайнего Севера: диахроническая документация науканского языка».

Ссылка на корпус

Корпус северноселькупских устных текстов

Корпус состоит из текстов на разных локальных вариантах северноселькупского языка,
записанных в интервале с 1996 по 2015 г. в ходе экспедиций по документации
северноселькупских говоров, организованных на базе Лаборатории автоматизированных
лексикографических систем НИВЦ МГУ. В настоящее время в корпусе представлены
среднетазовский, верхнетазовский и туруханский диалекты. В основном это истории жизни,
но есть и один фольклорный текст. Планируется скорое пополнение корпуса.

Ссылка на корпус

Minority languages of the world

Kullui corpus

The corpus of Kullui, one of Indo-Aryan languages of North India, was created by a team of scholars documenting the language—E. Renkovskaya (Institute of Linguistics, RAS), J. Mazurova (Institute of Linguistics, RAS) and A. Krylova (Institute of Oriental Studies, RAS). The software for the corpus was developed by E. Korovina (Institute of Linguistics, RAS). Currently, the corpus includes spontaneous and elicited texts in the central dialect of Kullui, recorded in 2014–2017 during field trips to the Kullu district (the villages of Naggar, Bashing, Thava, and Suma). The project was supported by the RFBR grant №19-012-00 355 (2019–2021).

Link to the corpus