CecilIA: A Cuban Artificial Intelligence Model
especiales

Artificial intelligence is no longer the future, it is the present. And in Cuba, that present is named CecilIA. It is a language model developed with national talent, capable of processing information, learning, interacting in natural language, and responding to the needs of institutions and citizens. Today, En Tiempo Real examines one of the most innovative ventures in the Cuban digital ecosystem. We will speak with those behind its design, development, and application to understand how CecilIA is becoming both a technological and a cultural tool.
AI in CecilIA, a Cuban Language Model
At the second public presentation of the Cuban language model for AI named CecilIA, its creators reaffirmed the project's relevance to the digital transformation process and to safeguarding the identity of the Cuban people.
CecilIA stands in 2025 as a Cuban language model for artificial intelligence (AI), and simultaneously, as an example of technological sovereignty in a current issue. Its initial training concluded in the final days of May, during the Saber UH Convention in Havana. The news was announced in that context, along with the details of this project, which is a critical path for fulfilling the constitutional commitment to developing an information and knowledge society. It also inspires the Agenda for Digital Transformation and the Artificial Intelligence Strategy approved in the country a year ago.
Among its creators are Dr. C. Yudivián Almeida Cruz, director of the Artificial Intelligence and Data Science Group at the Faculty of Mathematics and Computing (Matcom) of the University of Havana (UH), Dr. C. Suilán Estévez Velarde, dean of the Matcom Faculty, professors like Dr. C. Alejandro Piad Morffis, and a group of students deeply committed to the advances of this model. Although it still requires training processes and adjustments for optimization and full development, it already possesses merits that its counterparts elsewhere in the world do not.
CecilIA is as Cuban as the character of the same name popularized by a Cirilo Villaverde novel. But it aims to be quintessentially Cuban. Its computational system was trained on approximately 400 significant Cuban literary works, information from the country's press from the last 10 years, plus encyclopedias, various speeches, and the digitally available Official Gazette. This allowed for a volume of 2.7gb of information and required three days of intensive training.
A current objective is to expand this database by incorporating other references, such as scripts for audiovisual products, with intentions to add dialogues from quintessentially Creole resources like Las Aventuras de Elpidio Valdés. All this enables the model, based on AI techniques and algorithms, to interpret and generate text in the Spanish language, but with words or phrases closer to the identity of Cubans.
Elevating the success of such an aspiration will also be possible to the extent that all of society and its institutions support the process of digitizing documents and collaborate to make relevant information public and accessible. To this end, the Cuban Society of Law and Informatics convened members of its Havana Chapter and professionals from various areas on July 4th. The meeting at the headquarters of the National Union of Jurists of Cuba (UNJC) brought together a hundred interested people. More than two hours of exchange allowed for reflection and a commitment to contribute from their different areas of work.
Base Technology
Among the first lessons shared by Professor Yudivián Almeida Cruz was a reflection on the small language models (SLMs) used as a base for CecilIA. This is a variant accessible to developing countries, as they require fewer hardware resources, electrical power, training time, and the training data can be more tailored to the usage environment. He also noted that large models and their training have not been prepared to incorporate cultural nuances of communities.
His presentation covered the elements considered for creating the Cuban language model, namely: building a Cuban textual corpus, taking an SLM as a base, performing continual pre-training from the model base with the Cuban corpus, quantizing to different sizes, performing fine-tuning to instructions, designing a benchmark to validate the model's Cubanía, and validating the model.
It is important to know that the Salamandra model, pre-trained for the Spanish language, was used as a base. For validation, several experiments were considered, and the CecilIA model achieved performance similar to the Salamandra 2b model. However, by personalizing it with new knowledge, performance on some specific tasks decreased compared to Salamandra 2b, an expected result considering the No-Free-Lunch theorem.
Unlike the model's first presentation at the Saber UH 2025 academic event, this time it was possible to show some advances, as work has been intense since then. Current efforts are focused on improving the training corpus, making fine adjustments, and perfecting the instruction corpus with a greater number of personalized elements.
According to Dr. C. Yudivián Almeida, the goal is to create a Cuban instruction corpus with about 10,000 expected instructions. To achieve this, there is an open possibility for participation in creating instructions in JSON format. Anyone who wishes to participate can propose instructions for the new model training. Future training should reach 7B models and then continue.
"Today, with the existence of the first version of the Cuban language model (#CecilIA), work is underway on developing the first ecosystem to give it greater use value," the specialist said.
Committed to Development
The model benefits from the Cuban contributions of the doctoral theses of Dr. Suilán Estévez and Dr. Alejandro Piad. Furthermore, other Cuban doctoral students, co-supervised by these professors, are currently working on their research studies and betting on the development of both Salamandra and CecilIA.
As happened at the Saber UH Convention, a session for exchange was opened at the UNJC headquarters. Among the questions from the audience, one of great importance concerned Cuban Sign Language and the development prospects for models considering this theme.
The responses addressed intellectual property in literature and the importance of expanding the community interested in developing the Cuban model. This was coupled with the importance of having a strategy for exchanging documents from the digital heritage repository, loose phrases, and song lyrics in each case, with the help of all organizations involved in managing information.
From the model training, a reflection emerged on their influence in transforming societies, ways of speaking, and cultures. For example, models like ChatGPT interpret EcuRed and provide translated or manipulated responses, according to their ideology and in ways far removed from the original information source.
It was also learned that the CecilIA development team pays attention to the importance of balancing training data, treating bias prevention, and explainability, adhering to protocols, best practices, and standards promoted by UNESCO following the adoption of the Recommendation on the Ethics of AI.
The presenters shared a concern for ethics, which is a task for everyone and all areas of knowledge, philosophers, linguists, sociologists. The social sciences must participate actively, both in the construction of the models and in their exploitation.
The topic of ambiguity and the importance of working with the uncertainty of information, or certainty in response management, was also asked about and addressed. However, on the issue of hallucination, which was another public question, Dr. C. Yudivián Almeida said, "language models are machines for hallucinating." But in weighing the virtues and risks of hallucination, they bet on a good balance at the level of the state of the art in these matters.
There was a consensus in recognizing that CecilIA will contribute to preserving Cuban culture if we can provide it with daily news, legal texts, film scripts, images and sounds, data that helps it "speak Cuban."
In terms of sovereignty and identity, CecilIA defends the idea that today it is not enough to have information online, because the population increasingly interacts with AI through language models (using applications like ChatGPT, among others). Therefore, having a Cuban language model would ensure the ability to subsequently build generative AI applications and preserve our culture and ideology.
As on other occasions, the need for data in digital format came to light. However, it is known that entire libraries of great value still exist where everything is on paper. Therefore, the importance of data, normalization, and its use, with a revolutionary digital transformation policy, is emphasized.
In this sense, the importance of having data and information in plain text is insisted upon. In the book editing process, it is essential to preserve the original information in digital format, the plain text. The editable format must henceforth be an output of any content generation process and an input for CecilIA.
Finally, while facing processes of massive and orderly digitization, as a digital heritage, CecilIA MLS calls on everyone who wishes to collaborate to achieve endogenous developments from specific domains like law, health, the Cuban language itself, arts, history, everything that Cuban creativity can generate.
In this case, the legal sector contributes to a coherent, coordinated, and organized incorporation, thus contributing a significant part of that language corpus needed for CecilIA and that our population requires to speak in good Cuban about any legal topic.
The exchange with the public was predominated by congratulations to the creator team, the importance of a public policy guaranteeing access to data to feed the model, the ethics and explainability of AI, and the training of skills in other disciplines so they can contribute to the language model's development from their fields of knowledge.
At the conclusion of the presentation conference, the room was filled with pride, interest, gratitude, curiosity, and, very importantly, the development team itself acknowledged the results of other teams across the country in achieving AI tools and applications. "We grew in everything. Excellent presentation. A good debate." The work on the CecilIA model continues.
Add new comment