The AINA project seeks millions of voices so that technology understands and speaks Catalan

15 February 2022

AINA is a project based on data technologies and Artificial Intelligence promoted by the Vice Presidency of the Government and BSC to make it possible for machines to understand and speak Catalan.

The 'Our language is your voice' campaign invites citizens of all dialectal varieties of Catalan to show their speech by reading some texts

The Vice Presidency will allocate €3M this year to the AINA project to, among other objectives, create the first voice corpus of Catalan and generate the second, enriched version of the text corpus

AINA is building the necessary digital resources for Catalan so that any company or entity can use them to develop solutions or services such as translators, personal assistants or conversational agents in Catalan.

Under the slogan 'Our language is your voice', the Government of Catalonia launches this February 17 a campaign to capture voices to generate the first corpus or "dictionary" of Catalan voice. The campaign is part of the AINA project, promoted by the Department of the Vice Presidency and of Digital Policies and Territory in collaboration with the Barcelona Supercomputing Center (BSC) to make technology understand and speak Catalan.

The AINA project is building corpus (massive data sets) and models of the Catalan language so that any company or organization can use them to develop their specific solutions or services (translators, personal assistants, voice synthesizers, text classifiers, etc.) to be able to interact with the machines in Catalan.

In short, to teach Catalan to machines so that citizens can interact with them and participate in the digital world in Catalan at the same level as speakers of a global language such as English, and thus avoid the digital extinction of the Catalan language.

Citizen participation in the campaign to collect voices 'Our language is your voice' will be done through Mozilla's Common Voice initiative for Catalan, a platform where everyone who wants to can read and record an unlimited number of phrases (grouped 5 by 5 but without limit) to help machines learn how people speak.

Although this collaboration can be carried out completely anonymously and without any prior registration, knowing the gender, age and dialect variant parameters of the voice “donor” person greatly facilitates the work of classifying the voice data obtained and, at the same time, , allows us to know if all the linguistic diversity of Catalan is being considered. Therefore, the campaign encourages citizens to register and create a profile on the platform to advance more quickly in the objectives of the AINA project.

Teaching Catalan to machines, quite a challenge

“Teaching” a language to machines so that they are capable not only of understanding us when we speak to them but also of responding in a coherent way to what we have asked or requested is today a challenge.

If we want computers, voice assistants and other computer systems to speak and understand Catalan, it is necessary to obtain massive data on the language (in text and voice format). This data is passed to a deep neural network that learns how the words are combined until it generates a model of the language capable, for example, of distinguishing the different meanings of the word “bank” thanks to the different contexts in which it is used. .

To build the corpus of the language (data sets) that a machine needs, it is necessary to have millions of texts and millions of hours of audio and video in that language and, furthermore, that these millions and millions of data represent all the wealth of the language including, for example, voice recordings of people of different genders, different age groups and different dialect variants and registers.

Obtaining this volume and specificity of data is especially difficult for minority languages ​​on a global scale such as Catalan, since majority languages ​​such as English have all this information easily available: you only need to go to the Internet to find millions and millions of texts, audios and videos in English.

For this reason, the 'Our language is your voice' campaign invites Catalan-speaking citizens of all ages, genders, conditions and origins to "give" their voice, with the aim of obtaining voice content that captures all the richness of oral Catalan, with all its registers and dialectal varieties. Currently, the majority voice profile on Mozilla's Common Voice platform is that of men between the ages of 30 and 50 who speak Central Catalan.

Create the first voice corpus in Catalan, an AINA milestone for 2022

The creation of the first version of the Catalan voice corpus is one of the main milestones of the AINA project for this 2022. This corpus will be nourished by the content obtained through Mozilla's Common Voice platform, but also by the contribution of the documentary repository of the Catalan Audiovisual Media Corporation (CCMA) or the Audiovisual Council of Catalonia (CAC), among others.

In parallel, the project also sets as its objective this year the creation of the second version of the Catalan text corpus. To date, the project has a first textual corpus, consisting of 1,770 million words gathered in 95 million sentences, which has been obtained by downloading texts from different digital sources in Catalan (web plans, files, etc. ), clean them and delete duplicates. Now, work will continue on this corpus of text to generate a second improved and enriched version that includes all the nuances of the written language, whether they are dialect variants or linguistic registers, such as colloquial, literary or administrative.

Other objectives highlighted in the roadmap of the AINA project for this 2022 are:

  • Create three basic linguistic services (anonymization, document classification and identification of entities and key concepts) necessary to build future applications and solutions for the end user.
  • Create specialized language models (courses) in a specific field (for example, health or legal) or in a specific task (for example, translation of texts), to help machines better understand and analyze the nuances and context of words in a text or conversation.
  • Create a Catalan-Spanish translation engine to improve the quality of currently available engines.
  • Implement an impact use case in the Catalan Public Administration to show the potential and integration in real applications of the different pieces developed by the AINA.

€3M budget for 2022 for a strategic project

To make this roadmap possible, the Department of the Vice Presidency and Digital Policies and Territory will allocate €3 million of its budget to the AINA project this year through a direct grant to BSC, which will be in charge of executing it. With this contribution, which multiplies by 12 the budget allocated by the Generalitat in 2021, the Government reinforces its firm commitment to this strategic project whose ultimate objective is to guarantee that citizens can speak and interact in Catalan in the digital world at the same level as the speakers of other languages ​​such as English or Spanish. These languages , for now, have their digital survival guaranteed because behind them they have had States that have invested in providing them with sufficient resources in terms of learning techniques and neural networks in Artificial Intelligence.

The AINA project, presented in December 2020, is part of the Government's digital strategy, through two initiatives led by the Department of the Vice Presidency: the Artificial Intelligence Strategy of Catalonia (Catalonia.AI), approved in February 2020, and the Interdepartmental Board of Directors for the promotion of Catalan on the Internet and in advanced digital technologies, approved in December 2018.