MareNostrum will generate a language model in Spanish based on millions of digital contents from the National Library of Spain

22 June 2020
The generation of new language models is vital to merge language knowledge and artificial intelligence.

This project is part of a commission to the BSC from the Secretary of State for Digital and Artificial Intelligence Advancement, in the framework of the plan to promote language technologies

The supercomputer MareNostrum has already started to receive a vast amount of data from the Web archive of the National Library of Spain, which will be the base to generate a model of the Spanish language and other languages from the state. The archive of the Spanish web is the collection formed by websites with the domain .es (including blogs, forums, documents, images, videos, etc.) plus all those considered documentary heritage included in other domains that are collected so as to preserve the Spanish documentary heritage on the Internet and to ensure access to it. The Barcelona Supercomputing Center (BSC) will be responsible for its undertaking, as commissioned by the Secretary of State for Digital and Artificial Intelligence Advancement (SEDIA), in the framework of the Plan to promote language technologies.

This task is twofold: the transportation of the data to the supercomputer, and its processing to generate a language model. For some weeks now the MareNostrum has initiated content storage, after developing an extraction process of textual data from the Web archive of the library, which has allowed to transfer content to the BSC promptly. The transmission of this enormous quantity of data was one of the significant challenges of this initiative. As of now, the supercomputer has already stored 45 Terabytes.

The next step will be the processing of this data to generate language models through natural language processing technologies. This resource is already available in English, the best known is Google Bert, which has been a milestone in the processing of natural language. The model in which the BSC is working stands out from other initiatives of Spanish language models because of the quantity of Spanish linguistic data it contains, which makes it more precise and practical for cross use.

Language models and artificial intelligence

Language models reproduce language use and allow us to know the real meaning of words, even in whole sentences, since the data is contextualized and has more information and sense. This allows to disambiguate the sense of words (for instance, to distinguish between the meaning of sick in This is sick! or in I'm feeling sick). It also allows us to interpret the ideological bias, and it opens the way to deal with irony and figurative sense. It also endows artificial intelligence systems with common sense.

Quim Moré, researcher from the CASE department of the BSC, and David Vicente, team manager of the Operations group, are the ones responsible for this project. According to Quim Moré "the generation of language models is vital to artificial intelligence. The computer application of a disambiguous language model with a context founded in our world knowledge means a great advance in the generation of smarter and closer systems".

The applications of this model are diverse: from an automatic translation, cybersecurity, or the description of the content of a XV-century picture made by a robot. Nevertheless, models capable of generating this revolution require such computational and data resources that only a few centers and companies, such as Google or Facebook, do have.

In this sense, Moré highlights that "we are lucky that MareNostrum has the computing capacity that we need, and on the other hand, we have the huge linguistic data amount revised and provided by the National Library. We have a great opportunity to be on the same level as the great centers of artificial intelligence and also to provide a computational application of linguistic knowledge to culture".

Spanish web archive

The Spanish web archive is the collection formed by websites with the .es domain and others (including globs, forums, documents, images, videos, etc.) that are collected in order to preserve the Spanish documentary heritage on Internet and to ensure access to it. In December 2019, there was the 10th anniversary since the launch of the Spanish web archive project. Since then, the Spanish National Library has strengthened its infrastructure, politics and processes to carry out this task to preserve online heritage, just as the most important national libraries have been doing for years now.

Further information here.

Video of the session for the 10th anniversary of the Spanish web archive:

https://www.youtube.com/watch?v=oySUYJdiDwY&feature=youtu.be