GUA-SPA - Guarani-Spanish Code Switching Analysis

Welcome to the shared task GUA-SPA - Guarani-Spanish Code Switching Analysis, a task to automatically detect and analyze instances of code-switcihng between Guarani and Spanish in news and social media. This task is part of IberLEF 2023.

News

June 7, 2023. We have published the official results.
May 24, 2023. You can download the test data from the Codalab page. Evaluation phase is open for submissions.
March 22, 2023. You can download the training data from the Codalab page. Development phase is open for submissions.
February 22, 2023. The Codalab page for the competition is available: https://codalab.lisn.upsaclay.fr/competitions/11030. Registration is open!

Introduction

Guarani is a South American indigenous language that belongs to the Tupi-Guarani family, Spanish is a Romance language that belongs to the Indo-European family, and both languages have been in contact in the South American region for about 500 years [Rodríguez, 2018], resulting in many interesting varieties with different levels of mixture. Paraguay is a South American country where Guarani and Spanish are the two official languages (Ley de Lenguas, [Ley 4251]). According to the most recent census in Paraguay, most of the population of the country speak at least some Guarani, and there is a high prevalence of Guarani-Spanish bilingualism in urban areas, while Guarani monolingualism is more limited to rural areas.

Bilingual speakers often make use of the two languages at the same time, mixing them in different ways, in a phenomenon called code-switching [Joshi, 1982]. This phenomenon is very frequent in situations where two or more languages come into contact. In Paraguay, this has resulted in several identified language varieties that combine Guarani and Spanish [Kallfell, 2016].

There have been a number of competitions focusing on detection and analysis of code-switching, starting in [Solorio et al., 2014] with language identification in code-switched data for some language pairs, including Spanish-English. Later on these competitions started to include more complex tasks in code-switched contexts, such as NER [Aguilar et al., 2020] and MT into the code-switched languages [Chen et al., 2021]. In our case, we are proposing language identification (Task 1), NER (Task 2), and a novel classification task for Spanish spans in a code-switched Guarani-Spanish context (Task 3).

Guarani is considered a low resource language [Joshi et al., 2020] because, despite having millions of speakers, it does not have many digital resources to work with, its written use online is scarce, and it has been mostly under-researched from the NLP perspective. The situation of this and other American indigenous languages could change in the future as there are now some initiatives to build resources for these languages [Mager et al., 2021], but there is still a long way to go. Spanish, on the other hand, belongs to the set of very resource-rich languages [Joshi et al., 2020], which is good for this competition as there are many tools for Spanish that could be leveraged to see how they work in this context.

The expected target audience are NLP researchers interested in working with low-resource languages and code-switched data. Also, researchers interested in NER and MT in general.

Task description

We propose a challenge for analyzing code-switched texts in Guarani and Spanish, trying to identify the language used in each span of text, the named entities mentioned in the text, and the way Spanish is used. The challenge will be structured as three tasks:

Task 1: Language identification in code-switched data

Given a text (sequence of tokens), label each token of the sequence with one of the following categories:

gn: It is a Guarani token.
es: It is a Spanish token.
ne: It is part of a named entity (either in Guarani or in Spanish).
mix: The token is a mixture between Guarani and Spanish. For example a verb with a Spanish root that has been transformed into the Guarani morphology, like: ‘osuspendeta’ (he/she will suspend)
foreign: Used for tokens that are in languages other than Guarani or Spanish.
other: Used for other types of tokens that are invariant to language, like punctuation, emojis and URLs.

Examples:

→ che kuerai de pagar 6000 gs. por una llamada de 40 segundos . son aliados del gobierno parece ustedes
Could be tagged as:
che/gn kuerai/gn de/es pagar/es 6000/other gs./other por/es una/es llamada/es de/es 40/other segundos/es ./other son/es aliados/es del/es gobierno/es parece/es ustedes/es

→ Ministerio de Salud omombe'u ko'ã káso malaria ojuhúva importado Guinea Ecuatorial guive ha oîma jesareko ohapejokóvo jeipyso .
Could be tagged as:
Ministerio/ne de/ne Salud/ne omombe'u/gn ko'ã/gn káso/mix malaria/es ojuhúva/gn importado/es Guinea/ne Ecuatorial/ne guive/gn ha/gn oîma/gn jesareko/gn ohapejokóvo/gn jeipyso/gn ./other

The metrics for task 1 are accuracy, weighted precision, weighted recall and weighted F1. The main metric is weighted F1.

Task 2: Named entity classification

Given a text (sequence of tokens), identify the named entities as spans in the text, and classify each one with a category: person, location or organization. These must be marked in the tokens using BIO labels: B-per, B-loc, B-org, I-per, I-loc, I-org, O.

Examples:

→ [ORG Ministerio de Salud] omombe'u ko'ã káso malaria ojuhúva importado [LOC Guinea Ecuatorial] guive ha oîma jesareko ohapejokóvo jeipyso .

→ [PER Ministra de Hacienda Lea Giménez] he'i oñepromulga léi capitalidad ary 2014

The metrics for task 2 are precision, recall and F1, either labeled or unlabeled. The criterion for finding a named entity is exact match. The main metric is labeled F1.

Task 3: Spanish code classification

Given a text (sequence of tokens), identify spans of text in Spanish and label them in one of these categories:

change in code (CC): the text keeps all the characteristics of Spanish.
unadapted loan (UL): the Spanish text could be partially adapted in some ways to Guarani syntax, but it is not fully merged into Guarani, in particular it does not present orthographic transformations.

These must be marked in the tokens using BIO labels: B-cc, B-ul, I-cc, I-ul, O.

Examples:

→ che kuerai [CC de pagar 6000 gs. por una llamada de 40 segundos . son aliados del gobierno parece ustedes]

→ Okañývo pe Policía Nacional ha orekóva caso omomarandúvo Fiscalía , peteî [UL investigación ámbito penal] .

The metrics for task 3 are precision, recall and F1, either labeled or unlabeled. The criterion for finding a Spanish span is exact match. The main metric is labeled F1.

Important Dates

Februray 22nd, 2023: Codalab page.
March 22nd, 2023: training set.
May 24th, 2023: test set and open for submissions.
June 7th, 2023: publication of results.
June 14th, 2023: paper submission.
June 28th, 2023: notification of acceptance.
July 3rd, 2023: camera-ready paper submission.
September, 2023: IberLEF 2023 Workshop.

Data

Training, dev, and test data can be downloaded from the Codalab page.

Results

These are the final results for competition over the test data:

User	Task 1 - wF1	Task 2 - Labeled F1	Task 3 - Labeled F1
pughrob	0.9381 (1)	0.7028 (1)	0.3836 (1)
tsjauhia	0.9139 (2)	-	-
amunozo	0.8500 (3)	0.4153 (3)	0.1939 (3)
baseline	0.7325 (4)	0.4946 (2)	0.2195 (2)
pakapro	0.0452 (5)	-	-

Contact

We will use the Codalab platform to manage participants and submissions. If you have any further questions, you can contact us via email.

The organizers of the task are:

Luis Chiruzzo. Universidad de la República, Montevideo, Uruguay.
Marvin Agüero-Torales. Universidad de Granada, Granada, Spain. Global CoE of Data Intelligence, Fujitsu, Spain.
Gustavo Giménez-Lugo. Universidade Tecnologica Federal do Paraná, Curitiba, PR, Brasil.
Santiago Góngora. Universidad de la República, Montevideo, Uruguay.
Aiala Rosá. Universidad de la República, Montevideo, Uruguay.
Aldo Alvarez. Universidad Nacional de Itapúa, Encarnación, Paraguay.
Yliana Rodríguez. Universidad de la República, Montevideo, Uruguay.
Thamar Solorio. University of Houston, Houston, TX, USA.

Bibliography

[Agüero-Torales et al., 2021] Agüero-Torales, Marvin M., David Vilares, and Antonio G. López-Herrera. "On the logistical difficulties and findings of Jopara Sentiment Analysis." In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, pages 95–102, Online. Association for Computational Linguistics. 2021.

[Agüero-Torales, 2022] Agüero-Torales Marvin M. “Machine Learning approaches for Topic and Sentiment Analysis in multilingual opinions and low-resource languages: From English to Guarani” [Ph.D. thesis]. University of Granada. Granada; 2022.

[Aguilar et al., 2020] Aguilar, Gustavo, Sudipta Kar, and Thamar Solorio. "Lince: A centralized benchmark for linguistic code-switching evaluation." Proceedings of The 12th Language Resources and Evaluation Conference, pp. 1803-1813, 2020.

[Chiruzzo et al., 2022] Chiruzzo, Luis, Santiago Góngora, Aldo Alvarez, Gustavo Giménez-Lugo, Marvin Agüero-Torales, and Yliana Rodríguez. "Jojajovai: A Parallel Guarani-Spanish Corpus for MT Benchmarking." In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 2098-2107. 2022.

[Chen et al., 2021] Chen, Shuguang, Gustavo Aguilar, Anirudh Srinivasan, Mona Diab, and Thamar Solorio. "CALCS 2021 Shared Task: Machine Translation for Code-Switched Data." arXiv preprint arXiv:2202.09625 (2022).

[Góngora et al., 2021] Góngora, Santiago, Nicolás Giossa, and Luis Chiruzzo. "Experiments on a Guarani corpus of news and social media." In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, pp. 153-158. 2021.

[Joshi, 1982] Joshi, Aravind. "Processing of sentences with intra-sentential code-switching." In Coling 1982: Proceedings of the Ninth International Conference on Computational Linguistics. 1982.

[Joshi et al., 2020] Joshi, Pratik, Sebastin Santy, Amar Budhiraja, Kalika Bali, and Monojit Choudhury. "The state and fate of linguistic diversity and inclusion in the NLP world." arXiv preprint arXiv:2004.09095 (2020).

[Kallfell, 2016] Kallfell, G., 2016. ¿Cómo hablan los paraguayos con dos lenguas?: gramática del jopara. Centro de Estudios Antropológicos de la Universidad Católica (CEADUC).

[Ley 4251] Ley Nº 4251 / Ley de Lenguas - https://www.bacn.gov.py/leyes-paraguayas/2895/ley-n-4251-de-lenguas.

[Mager et al., 2021] Mager, M., Oncevay, A., Ebrahimi, A., Ortega, J., Gonzales, A.R., Fan, A., Gutierrez-Vasques, X., Chiruzzo, L., Lugo, G.G., Ramos, R. and Meza-Ruiz, I., 2021, June. Findings of the AmericasNLP 2021 shared task on open machine translation for indigenous languages of the Americas. In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas (pp. 202-217).

[Ríos et al., 2018] Ríos, Adolfo A., Pedro J. Amarilla, and Gustavo A. Giménez Lugo. "Sentiment categorization on a creole language with lexicon-based and machine learning techniques." In 2014 Brazilian Conference on Intelligent Systems, pp. 37-43. IEEE, 2014.

[Rodríguez, 2018] Rodríguez Gutiérrez, Y. V. (2018). Language contact and the indigenous languages of Uruguay. In E. Núñez Méndez (Ed.), Biculturalism and Spanish in contact: sociolinguistic case studies (pp. 217-238). Routledge.

[Solorio et al., 2014] Solorio, Thamar, Elizabeth Blair, Suraj Maharjan, Steven Bethard, Mona Diab, Mahmoud Ghoneim, Abdelati Hawwari et al. "Overview for the first shared task on language identification in code-switched data." In Proceedings of the First Workshop on Computational Approaches to Code Switching, pp. 62-72. 2014.