Welcome to the shared task QuALES - Question Answering Learning from Examples in Spanish, a task to automatically find answers to questions in Spanish from news text. This task is part of IberLEF 2022, and is organized by Grupo PLN-UdelaR



Question Answering (QA) is a classical Natural Language Processing task (Jurafsky, 2021), and can be divided into two main categories: semantic analysis, where the question is transformed to a query to a knowledge database; and open domain question answering, where, starting from a question written in natural language and a set of documents, the answer to the question is obtained using information retrieval and information extraction techniques.

Open domain question answering involves two main stages: a) obtaining the relevant documents, generally using methods from the Information Retrieval field (IR) (Manning, 2008), possibly one of the most widely studied topics in NLP, with web search engines as their most noticeable product, b) extracting the answer from those documents. Each of these stages has its own challenges, and the whole task requires a successful outcome in each of them and in their integration.

In this task we address the problem of answering questions by extracting answers from a set of documents.

Task description

We propose a task for developing question answering systems that can answer questions based on news articles written in Spanish. The systems will get a full news article and a question, and must find the shortest spans of text in the article (if they exist) that answer the question.

The training, development and test datasets are a based on a corpus of news in Spanish related to the Covid-19 domain. Originally, we planned to have two separate corpora for evaluation, but seeing that the texts often contain Covid-19 related news mixed with other topics, we decided to annotate only one set. Most of the questions in the dataset are about Covid-19 matters, but some of them are also about other topics.

The expected results are the shortest spans of text that contain the answer, taking into account that some questions could not be answered using the information in the text. For example, consider the following news text:

Comenzaron las clases presenciales en 344 escuelas rurales, con baja asistencia
A las 8.45 dos perros paseaban por el patio de la escuela rural 27 de La Macana, en Florida. Dos maestras con túnicas blancas y tapabocas esperaban a los alumnos que reanudarían las clases presenciales luego de cinco semanas de conexión virtual. Ya estaba instalado el micrófono y el parlante en el patio, habían llegado los inspectores regionales junto con la directora general del Consejo de Educación Inicial y Primaria (CEIP), Irupé Buzzetti, que junto a la prensa local esperaban a los niños. De los 28 alumnos que asisten regularmente, 14 habían dicho que no iban a ir y los otros no habían confirmado. A las 9.00, cuando debían comenzar las clases en la escuela de La Macana, no había ningún niño.
La situación de La Macana se repitió en varias de las escuelas que abrieron este miércoles. De las 547 escuelas habilitadas abrieron 344, confirmó a la diaria Limber Santos, director del departamento de Educación Rural del CEIP. De esas escuelas, cerca de 90 no recibieron alumnos; Santos estimó que en la mañana del miércoles 1.030 niños concurrieron a las escuelas, de un total de 3.900 que concurren a las 547 habilitadas y de 2.838 alumnos que tienen matriculadas las 344 escuelas que abrieron. La asistencia, por tanto, llegó a 36% en el primer día.

Given these possible questions, the expected answers that the system should find are the following:

Q1: ¿Cuántas escuelas rurales hay en Uruguay?
A1: De las [547] escuelas habilitadas abrieron 344, confirmó a la diaria Limber Santos, director del departamento de Educación Rural del CEIP.

Q2: ¿Cuándo vuelven las clases presenciales a todas las escuelas?
A2: –not found in the text–

As one of our evaluation metrics, we will measure average Exact Match for all the dataset instances, following the approach of SQuAD (Rajpurkar, 2016). We will also report, following Reddy et al. (2019), the macro-average F1 score of word overlap: we compare each individual prediction against the different human gold standard answers and select the maximum value as system F1 score for that instance; the system performance is the macro-average of all those F1 scores. Determinants and punctuations are excluded in the evaluation.

Important Dates


We provide a (small) training set of 1000 question-answer pairs, development set of 800 question-answer pairs, and a test set of 800 question-answer pairs. Participants can use any other data for training as well, in particular SQuAD or NewsQA. Please see the Codalab for downloading the data.


These are the results for the evaluation phase. We show the best result for each user for each metric. Please notice that the best exact match and F1 scores might have been obtained in different submissions by the same user.

The best exact match scores for each user are the following:


The best F1 overlap scores for each user are the following:



We will use the Codalab platform to manage participants and submissions. If you have any question, you can contact us via the Codalab Forum or email

The organizers of the task are:


Starting last decade, and together with the popularization of distributional semantic methods based on neural networks (Le, 2014; Lecun, 2015), this type of methods began to be applied to the QA task, achieving significant new improvements in the results (Yu, 2014; Seo, 2017; Min, 2018; Xiong, 2018).

All these supervised learning works were possible due to the existence of datasets publicly available for research purposes (Richardson, 2013; Yang, 2015; Rajpurkar, 2016). These datasets have enabled not only the training of models, but also the continuous monitoring of state of the art in this area.

In the last few years, after the publication of models based on the Transformers architecture (Vaswani, 2017) for solving sequence to sequence transformation problems, and particularly language models such as BERT (Devlin, 2018) and ALBERT (Lan, 2019), there has been a new push in system performance, particularly for the English language. These types of models are trained in an unsupervised (or self-supervised) way using large volumes of data and computing power, but after that stage (called pretraining), they can be easily fine-tuned to use them to different tasks, In particular, they can be adapted to the task of finding answers to questions.

QA research for Spanish has been much slower so far. However, similar language resources have been created for this language, which makes us think it is possible to study and fine-tune current architectures to obtain competitive results. In particular, there is a recently developed version of BERT for Spanish, dubbed BETO (Cañete, 2020), and a version of SQuAD (the main dataset for training and evaluating open domain QA systems) translated to Spanish (Rajpurkar, 2016; Carrino, 2019).

From 2003 to 2014, the CLEF Question Answering Track has proposed different campaigns related to question answering, some of which included Spanish datasets. For example, together with the CLEF 2009 forum, ResPubliQA, a Question Answering Task over European legislation was proposed (Peñas,2009). The task consisted of extracting a relevant paragraph of text that included the answer to a natural language question. During CLEF 2010, the task was expanded (Peñas,2010) to include an answer selection task (i.e. besides retrieving the relevant paragraph, systems were required to identify the exact answer to the question). It also proposed several cross-lingual tasks, working on two multilingual parallel corpus: the JRC-ACQUIS Multilingual Parallel Corpus (10700 parallel and aligned documents), and the Europarl collection (150 parallel and aligned document per language), with 200 question/answer pairs provided for evaluation.

Unlike the task we present here, the CLEF tasks have addressed domain-general questions, or questions for some specific domains, but different from the one selected for QuALES. In addition, they have worked with smaller amounts of training and testing data than those currently available. Some of these tasks have different characteristics from the current proposal: datasets oriented to answer multiple choice questions or natural language questions to be answered from DBPedia structured data, among others.


(Cañete, 2020) Cañete, J., Chaperon, G., Fuentes, R., Ho, J. H., Kang, H., & Pérez, J. (2020). Spanish pre-trained bert model and evaluation data. Pml4dc at iclr, 2020, 2020.

(Carrino, 2019) Carrino, C. P., Costa-jussà, M. R., & Fonollosa, J. A. (2019). Automatic spanish translation of the squad dataset for multilingual question answering. arXiv preprint arXiv:1912.05200.

(Devlin, 2018) Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, Toutanova, Kristina. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

(Jurafsky, 2021) Jurafsky, D. and Martin, J.H. (2021). Speech and Language Processing (3rd ed. draft).

(Kwiatkowski, 2019) Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., ... & Petrov, S. (2019). Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7, 453-466.

(Lan, 2019) Lan, Zhenzhong, et al. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.

(Le, 2014) Le, Quoc; Mikolov, Tomas. Distributed representations of sentences and documents. En International conference on machine learning. 2014. p. 1188-1196.

(Lecun, 2015) Lecun, Yann; Bengio, Yoshua; Hinton, Geoffrey. Deep learning. Nature, 2015, vol. 521, no 7553, p. 436-444.

(Manning, 2008) Manning, Christopher D.; Raghavan, Prabhakar; Schütze, Hinrich. Introduction to Information Retrieval. 2008.

(Min, 2018) Min, Sewon, et al. Efficient and robust question answering from minimal context over documents. arXiv preprint arXiv:1805.08092, 2018.

(Nakov, 2015) Nakov, P., Màrquez, L., Magdy, W., Moschitti, A., Glass, J., Randeree, B. (2019). Semeval-2015 task 3: Answer selection in community question answering. arXiv preprint arXiv:1911.11403.

(Nakov, 2016) Nakov, P., Villodre, L.M., Moschitti, A., Magdy, W., Mubarak, H., Freihat, A.A., Glass, J.R., & Randeree, B. (2016). SemEval-2016 Task 3: Community Question Answering.

(Peñas,2009) Peñas, A., Forner, P., Sutcliffe, R., Rodrigo, Á., Forăscu, C., Alegria, I., ... & Osenova, P. (2009, September). Overview of ResPubliQA 2009: Question answering evaluation over European legislation. In the Workshop of the Cross-Language Evaluation Forum for European Languages (pp. 174-196). Springer, Berlin, Heidelberg.

(Peñas, 2010) Peñas, A., Forner, P. , Rodrigo, Á., Sutcliffe,R., Forascu, C., Mota, C. (2010). Overview of ResPubliQA 2010: Question Answering Evaluation over European Legislation.

(Rajpurkar, 2016) Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.

(Reddy, 2019) Reddy, S., Chen, D., & Manning, C. D. (2019). Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7, 249-266.

(Richardson, 2013) Richardson, Matthew; Burges, Christopher JC; Renshaw, Erin. Mctest: A challenge dataset for the open-domain machine comprehension of text. En Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2013. p. 193-203.

(Trischler, 2016) Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A., Bachman, P., & Suleman, K. (2016). Newsqa: A machine comprehension dataset. arXiv preprint arXiv:1611.09830.

(Vaswani, 2017) Vaswani, Ashish, Shazeer, Noam, Parmar, Niki, Uszkoreit, Jakob, Jones, Llion, Gómez, Aidan N., Kaiser, Lukasz, Polosukhin, Illia. Attention is all you need. In Advances in neural information processing systems. 2017. p. 5998-6008.

(Xiong, 2018) Xiong, Caiming; Zhong, Victor; Socher, Richard. DCN+: Mixed Objective And Deep Residual Coattention for Question Answering. En International Conference on Learning Representations. 2018.

(Yang, 2015) Yang, Yi; Yih, Wen-tau; Meek, Christopher. Wikiqa: A challenge dataset for open-domain question answering. En Proceedings of the 2015 conference on empirical methods in natural language processing. 2015. p. 2013-2018.

(Yu, 2014) Yu, Lei, et al. Deep learning for answer sentence selection. arXiv preprint arXiv:1412.1632, 2014.