BSC presents AI models, protocols and the most complete anonymised corpus of medical records in Spanish

15 March 2023

CARMEN-I, the result of a collaboration between the Barcelona Supercomputing Center and the Hospital Clínic de Barcelona, will be made publicly available to clinicians, AI researchers, academics and industry in Spain and globally.

The latest resources and advances of the Language Technologies Plan applied to the field of health and biomedicine, promoted by the State Secretariat for Digitalisation and Artificial Intelligence (SEDIA), were presented at the Infoday "AI and language technologies applied to clinical data: CARMEN-I resources, systems and applications", organised by the Barcelona Supercomputing Center - Centro Nacional de Supercomputación (BSC-CNS) and the Hospital Clínic de Barcelona.

The event, which was a success in terms of attendance with 360 participants, both on-site and remote, represented an opportunity for dissemination, training and interaction of the Language Technologies Plan with the technology sector of Natural Language Processing and Artificial Intelligence (NLP/AI) and representatives and experts from the public and private health sector and biomedical research.

Among the participants there were representatives of the Clínic and the BSC as organisers, as well as the State Secretariat for Digitalisation and Artificial Intelligence (SEDIA), the Spanish Agency for Medicines and Health Products (AEMPS) and the Carlos III Institute of Health, together with a representation of hospitals and health systems and numerous companies in the sector.

Interest in the development of artificial intelligence and natural language processing systems applied to the health and biomedical domain is constantly growing, and these technologies have a relevant socio-economic impact in terms of efficiency and resource management. The global NLP market in healthcare and life sciences is valued at €5,029 million by 2027, with a annual growth rate of 19.4%.

"Taking into account this potential impact of NLP systems for data in Spanish and their application to the healthcare and biomedical research sector in Spain and Latin America, SEDIA's Language Technologies Plan, in which the BSC plays a key role, pays special attention to a sector of vital importance in economic terms and with great benefits for society," says Martin Krallinger, co-leader of the BSC's Text Mining team.

Likewise, new language technologies offer significant potential to improve not only patient safety, quality of life and care, but also patient privacy. Robust technologies for the anonymisation and safeguarding of clinical data would help to avoid cases such as the recent cyber-attack suffered by Hospital Clínic, which affected the centre's information systems and forced the re-scheduling of non-urgent surgeries and appointments.

The anonymised corpus of clinical reports CARMEN-I

The Infoday was an opportunity to present the latest resources and advances of the Language Technologies Plan in the field of health, including the most complete anonymised corpus of real clinical reports in Spanish, known as CARMEN-I (Corpus of Anonymized Records for Medical information Extraction). In addition to technical details, aspects related to accessibility according to data protection regulations and knowledge transfer to other technical and healthcare agents interested in the technological development of AI in healthcare were discussed.

CARMEN-I will be made publicly available to clinicians, AI researchers, academics and industry in Spain and globally, under the accomplishment of specific conditions, with the objective of serving as a freely accessible health database that allows the application of AI in health, and that serves as a resource with an appropriate information structure (model of extensions, compliance and versioning) for the creation of documented, evaluated and licensed clinical NLP components. Experts in medical specialties, clinical informatics, clinical documentation, Machine Learning, Artificial Intelligence, linguists and medical ethics have participated in its development.

As part of the collaboration between BSC and Clínic, the Barcelona hospital has shared reports of patients with Covid-19 admitted to the hospital since the beginning of the pandemic. "The processing of hundreds of records, which in addition to Covid-19-related aspects include all kinds of underlying pathologies, comorbidities and complications of Covid-19, has generated a very rich corpus of mentions of infectious, geriatric, oncological, rheumatological, cardiac, pulmonary, neurological, immunological, etc. diseases," explains Krallinger, who is in charge of the annotation and standardisation of the texts, as well as the training of Machine Learning and AI systems to facilitate the computerisation of the annotation and standardisation processes.

Among the main challenges of the initiative is the disparity that exists between clinical cases published in scientific journals and the reality of real medical records, which often contain spelling mistakes, irregular formatting, language jumps between Spanish and Catalan, abbreviations that are highly context-dependent, etc. Solving these challenges will help the AI language research and industry to develop automatic processing methods to exploit data that are currently not standardised and therefore not used.

High-impact communication and dissemination environment

The event, with its character and structure (presentations, panel discussions and interventions by representatives of public and private institutions), attracted attendees from a wide variety of sectors, including technology, biomedical research and healthcare. The Infoday provided a high-impact communication and dissemination environment for the latest advances and results of the Language Technologies Plan in the health domain.

"The Infoday served as a mechanism to give visibility and interact with experts to maximise the use, dissemination and application of the NLP components of anonymisation and semantic annotation of clinically relevant information. Results generated for a very diverse set of anonymised clinical reports related to non-covid patients, which represent different stages of the pandemic, as well as a diversity of high-impact report types for the development of clinical NLP systems have been presented," Krallinger adds.

The event included a panel discussion with industry experts and representatives from public and private healthcare, who addressed the ethical and regulatory implications and challenges related to the use of AI and NLP systems in healthcare. The panel discussion enabled attendees to better understand the critical issues that must be considered to ensure the integrity and privacy of patient data, as well as to ensure a fair and equitable application of these technologies.

In summary, the Infoday "AI and language technologies applied to clinical data: CARMEN-I resources, systems and applications" was a success in terms of dissemination, training and interaction in the healthcare and biomedical research sector, and reflected the growing interest and commitment to the use of AI and NLP systems in healthcare and biomedicine to improve healthcare and research.

The results of this event and the resources presented have also aroused the interest of those responsible for American initiatives such as PhysioNet (MIMIC-IV), who are very interested in the anonymisation mechanism used and the extension of clinical NLP resources beyond English resources. At the national level, interested actors include for example the National Epidemiology Centre, which wants to explore the use of AI and advanced NLP systems to process clinical records for applications related to viral infection surveillance.