GL-BLARK

Project details

Status: Completed project.

Duration: Project developed in 2023, within the framework of the SRIA Contribution Projects.

Funding programme / development framework: Participation in the SRIA Contribution Projects, together with eight other selected projects, in the context of European initiatives aimed at the technological equality of languages.

Further information: 

Summary

GL-BLARK was a project aimed at creating an updated BLARK (Basic Language Resource Kit) for minoritised languages in the deep learning era, using Galician as a case study. Its purpose was to define the minimum resources a language needs to be competitive in language technologies and, at the same time, to provide a tool for systematically assessing the actual level of development of each language in this field.

The project started from a clear observation: the classic BLARKs, formulated in the late 1990s and early 2000s, no longer adequately addressed the current context of artificial intelligence. The emergence of neural networks and deep models completely transformed the technical requirements needed to develop useful language technologies, making it necessary to redefine the minimum standards for minoritised languages in this new scenario.

In this context, GL-BLARK made it possible to review, structure and assess the Galician language resource ecosystem from a contemporary perspective, with particular attention to the availability of corpora, NLP tools, lexical resources, language models and specific tasks such as Machine Translation, Speech Recognition, Speech Synthesis and Evaluation Systems.

The challenge

One of the main challenges for minoritised languages in the field of artificial intelligence is that having a few linguistic resources or isolated tools is not enough. In the deep learning era, the real development of a language depends on having adequate corpora, trained models, reusable data, open licences and evaluation systems that make it possible to measure and improve the performance of the various tasks.

The challenge of GL-BLARK was therefore to formulate a useful tool to answer a question of great strategic relevance: what does a minoritised language need today to be truly present in language technologies. It was not just a matter of drawing up an inventory of existing resources, but of establishing comparable, quantifiable minimum standards adapted to the current technological reality.

This challenge was particularly important because many languages have historical resources or tools inherited from previous paradigms, but do not always have the necessary foundation to compete in an ecosystem dominated by neural models, large corpora and advanced AI architectures. Updating the BLARK concept meant, in that sense, offering a useful diagnostic tool for researchers, institutions and linguistic communities alike.

imaxin's contribution

imaxin's contribution to GL-BLARK focused on defining and building a new evaluation framework for minoritised languages in the deep learning era, using Galician as an application example. The work involved reviewing the BLARK concept, adapting it to the new requirements of language technologies and designing a structure that would allow the development status of a language to be assessed quantitatively.

To this end, a comparative analysis was carried out of existing resources in various minoritised languages, such as Galician, Catalan and Basque, as well as other large languages that, despite having larger speaker communities, do not always have a level of development in language technologies equivalent to that of languages such as English or Chinese.

Based on this study, imaxin contributed to defining a methodology capable of evaluating both the transversal resources needed for any linguistic task and the specific tasks that require neural models and dedicated data, such as Machine Translation, Text Correction, Speech Recognition, Speech Synthesis, Dialogue Systems and Automatic Summarisation.

What was developed

The outcome of the project was an updated BLARK for minoritised languages, structured in two main blocks. The first groups together the transversal resources, that is, the resources and tools that serve as the foundation for multiple language technology tasks. The second covers the specific tasks, focused on those concrete applications that require trained neural models and adequate corpora for their development.

Within the transversal resources, categories such as corpora, NLP tools, lexical resources and language models were included. In the specific tasks block, areas such as Speech Synthesis, Speech Recognition, Machine Translation, Error Correction, Automatic Summarisation, Sentiment Analysis, Fact Verification, Dialogue Systems and Evaluation Systems were analysed.

In addition to the conceptual structure, a quantitative evaluation system was also developed to measure the degree of coverage of each language based on three criteria applied to each resource: size, quality and licence. This last variable was particularly relevant, as many resources exist but cannot be freely used in applied research, technology transfer or commercial contexts.

Evaluation system

One of the most significant elements of the project was the introduction of a quantitative evaluation system, in contrast to the more descriptive approach that had been followed by previous BLARKs. In this new model, each major block of the BLARK has a specific weight within the overall result: transversal resources account for 40% and the specific language technology tasks for 60%.

In turn, each section is divided into subsections, and each subsection contains specific resources that also have their own weighting. In this way, the BLARK not only makes it possible to know whether or not a language has a given resource, but also to assess to what extent that resource is genuinely useful in the context of current deep learning.

For each resource, criteria of size, quality and licence were defined. This approach made it possible to go beyond the mere existence of corpora, lexicons or models, and to distinguish between small or large resources, of low or high quality, and with closed, research-only or fully open licences. The result is a more precise tool for identifying gaps, strengths and development priorities.

Transversal resources

In the analysis of Galician, GL-BLARK made it possible to observe more clearly the state of the available transversal resources. In the area of corpora, the existence of relevant text collections was confirmed, both annotated and reference or large-scale, but significant limitations were also identified in terms of quality, openness or sufficient size for training neural models.

In the case of NLP tools, Galician presented a more favourable situation. Resources such as tokenisers, morphosyntactic taggers, entity recognisers and language identifiers already had a certain degree of maturity, in many cases integrated into broader libraries. This reflected solid prior development in classic linguistic processing tasks.

Regarding lexical resources, the project evidenced relevant advances, although also limitations arising from the closed nature of some fundamental resources or the low quality of more open ones. Finally, in the area of language models, Galician showed an intermediate level of development: it had its own embeddings and several BERT-type models, both monolingual and multilingual, but continued to have a limited presence in autoregressive models and large generative models.

Specific language technology tasks

In the specific tasks block, the project made it possible to analyse to what extent Galician was prepared to develop neural applications in specific areas. In Speech Synthesis and Speech Recognition, the situation was positive thanks to the recent availability of high-quality corpora and models with open licences, which placed Galician in a favourable position within the BLARK for these tasks.

In Machine Translation, the assessment was more nuanced. Although neural models and parallel corpora existed for some language pairs, especially Galician-Spanish and Galician-English, coverage remained limited compared to what might be expected from a fully developed ecosystem. The existence of models for only a few pairs and the uneven quality of the available corpora meant that the overall score for this task was still low.

In other tasks, such as Grammatical Correction, Automatic Summarisation, Sentiment Analysis, Fact Verification and Dialogue Systems, the project identified a very low or practically non-existent level of development. Although not all of these tasks form part of the minimum required for a minoritised language, they do represent necessary lines of work to ensure that speakers can access advanced tools on equal terms in the future.

Evaluation systems

Another aspect analysed by GL-BLARK was the availability of specific resources for evaluating linguistic models. This point is particularly important, as training models is not enough: it is also necessary to be able to measure their performance, compare them with other solutions and objectively identify their limitations and improvements.

In this area, the project found that Galician was in a weak position. The scarcity of evaluation datasets and language-specific metrics limited the ability to rigorously analyse the state of the art in specific tasks. This gap reinforced one of the central conclusions of the project: for a language to advance in language technologies, it needs not only data and models, but also tools to evaluate them systematically.

Results

GL-BLARK made it possible to build an updated tool for assessing the degree of development of a minoritised language in the deep learning era and to apply it to the case of Galician. The overall result placed Galician in an intermediate state within a BLARK designed specifically to measure the minimum standards required of non-dominant languages in the field of language technologies.

The project made it possible to identify more precisely the existing strengths, such as the development of certain NLP tools, the recent improvement in speech tasks and the availability of some language models, but also highlighted significant gaps in corpora, open lexical resources, advanced tasks and evaluation systems.

Beyond the case of Galician, GL-BLARK established a reusable analytical framework for other minoritised languages, providing a useful basis for guiding investment, research and development priorities in language technologies.

Impact

GL-BLARK enabled imaxin to participate in a strategic reflection on the future of minoritised languages in the context of artificial intelligence and deep learning. The project represented a relevant contribution at a time when the technological competitiveness of a language depends increasingly on its ability to be integrated into next-generation models, data and tools.

Its main contribution was to convert a diffuse concern—technological inequality between languages—into a concrete diagnostic and evaluation tool. This makes it possible to move from general statements about the lack of resources to a more structured, comparative and decision-useful perspective.

Furthermore, the project reinforces imaxin's expertise in areas such as language technologies, linguistic resources, NLP ecosystem evaluation, deep learning applied to minoritised languages and the development of tools aimed at measuring and improving the presence of a language in AI environments.

Funding

Project developed in 2023, within the framework of the SRIA Contribution Projects, as a contribution to European initiatives for linguistic and technological equality. For more information on the context and results of the project, see the report published by European Language Equality:
https://european-language-equality.eu/wp-content/uploads/2023/04/ELE2_Project_Report_BLARK.pdf

 

 

Do you have a project?

Request a no-obligation quote.