ACC Cyfronet AGH, the National Digital Archives (NAC), NASK, and the SpeakLeash project have launched a collaboration that could mark a new chapter in the development of artificial intelligence in Poland. Together, they aim to leverage NAC’s vast archival resources - millions of photographs, maps, and scans - to create modern datasets essential for training advanced AI models.
Two large language models (LLMs) have already been developed in Poland: Bielik.AI, primarily designed for business environments, and PLLuM, intended mainly for public institutions and administration. Both are capable of analyzing and generating text, assisting with document processing, and facilitating information retrieval. However, this is just the beginning. The future belongs to multimodal models - those that can simultaneously understand various types of data, such as text, images, sound, and video. Among them are Vision-Language Models (VLMs), which combine language and visual understanding. These models enable computers not only to read text but also to comprehend what an image depicts, describe it in words, and even answer questions about illustrations or maps.
However, creating such models requires vast, carefully annotated datasets. This is where the National Digital Archives plays a key role, collecting petabytes of digitized resources - photographs, documents, maps, and scans. Thanks to collaboration with Cyfronet, NASK, and SpeakLeash, these archives can be shared and developed to become the foundation of artificial intelligence research in Poland. This will enable the development of a multimodal data ecosystem, the conduct of initial research and development projects, and the training of new linguistic and multimodal models. Over time, tools may emerge that will facilitate citizens' access to cultural and historical resources, making digital archives more useful and accessible than ever before.
Importantly, this initiative goes beyond purely technological aspects. SpeakLeash, in collaboration with Cyfronet, is already running the “Obywatel Bielik” project - the first crowdsourcing initiative in Poland that invites everyone to take part in shaping the future of AI. Citizens contribute by submitting their own photos and helping to annotate them, thereby co-creating the data needed to train multimodal models. These participatory experiences and mechanisms will now be integrated into the consortium’s efforts involving NAC and NASK. This means that the development of Polish artificial intelligence will take place not only in research labs and data centers, but also with the active involvement of citizens.
Such a strong partnership - bringing together the creators of Polish language models, a vast archive of digitized resources, and unique civic components - is a global phenomenon. It aligns perfectly with strategies for AI development, as well as with the concept of AI factories and giga-factories, where model building is carried out in an organized, responsible manner, supported by collaboration across diverse sectors. This is a step that could give Polish AI a completely new momentum and significance on the international stage.