„Bilingual automatic term recognition (DVITAS)“, No. P-MIP-20-282
Project No. P-MIP-20-282
Project title: „Bilingual automatic term recognition”
Project duration: from 2019-05-01 to 2022-06-30
Project coordinator: Vytauto Didžiojo universitetas
Project manager in MRU: prof. dr. Sigita Rackevičienė.
Summary. The aim of the project was to develop a methodology for the automatic extraction of English and Lithuanian terms from corpora of a selected specialised domain and to create a publicly available bilingual terminology database based on empirical data. The project addressed the scientific problem of automatically collecting terminological data from bilingual parallel and comparable corpora, when one of the languages was under-resourced and morphologically rich. It sought to develop an innovative data collection methodology based on deep learning systems – an approach that, to our knowledge, had not yet been applied in Lithuania.
Cybersecurity (CS) was chosen as the specialised domain for this research due to its dynamic nature and particular relevance in today’s information society. New documents in the CS domain constantly emerge, introducing new concepts whose designations are not yet established in Lithuanian. These terms often appear in multiple variants, frequently retaining their original (English) form or occurring as hybrids (combinations of English and Lithuanian lexical units). As a result, a CS terminology database was in high demand among drafters and translators of legal and administrative acts, IT professionals, and the general public.
Achieved results: The project successfully developed and publicly released bilingual (English-Lithuanian) cybersecurity corpora – a parallel corpus and a comparable corpus – now accessible in the CLARIN-LT repository. These corpora reflect the use of cybersecurity terminology across different genres and text types in both national and international settings. Various state-of-the-art machine learning algorithms and neural networks were explored to automate the extraction of terminological data from corpora and to enhance the overall efficiency of the process. The collected data was used to compile the Lithuanian-English Cybersecurity Termbase, which could serve as a model for developing terminology databases in other domains using advanced technologies.
The project is carried out under the Lithuanian Research Council (LRC) supported activity “Research Group Projects”.