Research projects | Facultad de Ingeniería

Coding for DNA storage (2023)

Participants: Federico Bello, Santiago Castro, Guillermo Dufort y Álvarez, Fernando Fernández, Álvaro Martín (PI), Marcos Rapetti, Gadiel Seroussi
Founding: CSIC

The idea of using DNA molecules as a means of storing information has been around for decades. This medium offers two very attractive distinctive qualities: high information density per unit of physical space and high durability. In both respects, DNA storage is unmatched by any other technology available today. At present, this topic is in full development, both in technological aspects and in theoretical foundations. One of the research objects has to do with the coding of information to make optimal use of the storage medium, taking into account that the information storage and retrieval processes are subject to errors. In this project we intend to advance on this topic, studying aspects of storage medium capacity (channel capacity) and also coding for error correction in practice.

Compression of raw nanopore sequencing data (2023)

Participants: Guillermo Dufort y Álvarez (PI), Tomás González, Álvaro Martín, Gadiel Seroussi, Rodrigo Torrado
Founding: ANII

In this project we intend to advance in the development of compression algorithms for raw nanopore sequencing data. This improvement in compression levels translates into lower IT infrastructure costs for data storage and transmission, which are increasingly significant in the current context of mass production of genomic data.

Compression of genome sequencing data generated by nanopore technology (2019 - 2022)

Participants: Guillermo Dufort y Álvarez, Álvaro Martín (PI), Idoia Ochoa, Tatiana Rischewski, Gadiel Seroussi, Pablo Smircich, José Sotelo Silveira
Founding: CSIC

Nanopore sequencing has some distinctive features that make it very attractive. One of them, without a doubt, is that the reads of DNA sequence fragments that are generated are much longer than those generated by the most widely used sequencers. However, the read error rate is high. From a data compression point of view, both the specific statistical characteristics of the signals being measured for sequencing and the combined use of technologies give rise to interesting challenges that we plan to address in this project.

Quality score quantization of nanopore sequencing data (2020 - 2021)

Participants: Lucía Balestrazzi, Martín Rivara, Guillermo Dufort y Álvarez, Álvaro Martín (PI), Idoia Ochoa, Gadiel Seroussi, Pablo Smircich, José Sotelo Silveira
Founding: ANII

The amount of data generated by modern sequencing platforms (genomics, metagenomics, transcriptomic data, etc.) is extremely large and, thanks to the decrease in costs in recent years, is growing at an ever-increasing rate. This makes the costs of storing and transmitting this type of information in various bioinformatics applications a real problem. The data produced during a sequencing process includes the so-called quality scores, which represent an estimate of the probability of error for each of the nucleotide readings. Quality scores are a fundamental input for the analysis of sequencing data and, at the same time, they occupy most of the data that is generated during sequencing (more than the base information itself). In light of the general concern that exists about the amount of data generated from the new sequencing methodologies, there is great interest in fully understanding how much of the information provided by the quality scores is really necessary to carry out the biological investigations that follow from these data. In this project we set out to investigate this problem for a nanopore sequencing platform, a state-of-the-art technology that has not yet been investigated in this regard. For this, we propose using sequencing databases, associated with previously developed biological experiments, to analyze the effect that different quality scores quantization schemes have on the biological conclusions that emerge from these data.

Applications of Information Theory to nanopore DNA sequencing data processing (2017 - 2019)

Participants: Guillermo Dufort y Álvarez, Álvaro Martín (PI), Gadiel Seroussi, José Sotelo Silveira
Founding: CSIC

In 2015, the first commercial version of a nanopore genome sequencer was released, a technology that is emerging as the next generation of sequencing instruments. This type of sequencers generate very long reads of DNA sequence fragments, which is generally advantageous, but with a high error rate. Processing reads with these characteristics demand a specific treatment, which we think will be of central importance as the use of this technology spreads. In this project we aim to investigate compression algorithms for various types of nanopore sequencing data and the application of denoising techniques.

A low-energy wireless electroencephalograph (2015 - 2017)

Participants: Ignacio Capurro, Guillermo Dufort y Álvarez, Federico Favaro, Federico Lecumberry, Álvaro Martín (PI), Juan Pablo Oliver, Julián Oreggioni, Julio Pérez, Ignacio Ramírez (PI), Gadiel Seroussi, Leonardo Steinfeld
Founding: CSIC

We investigate the energy savings that can be obtained in wireless electroencephalographs through the use of efficient coding schemes. We will measure the energy consumption obtained with different coding algorithms with the goal of evaluating the trade off between algorithmic complexity (which translates to higher computation energy) and compression efficiency (which yields lower transmission energy).

Low complexity brain-computer interface (2013 - 2015)

Participants: Ignacio Capurro, Federico Lecumberry, Álvaro Martín (PI), Martín Patrone, Eugenio Rovira, Ignacio Ramírez (PI), Gadiel Seroussi
Founding: CSIC

We investigate the application of Signal Processing and Information Theory techniques to the development of Brain-Computer interfaces based on electroencephalography, with low power consumption in the electroencephalograph. We are interested in low complexity algorithms that can be applied to the compression of electroencephalograms, with the aim of reducing energy consumption for wireless transmission between the electroencephalograph and a computer that analyzes the signals.

Efficient estimation of stochastic models (2011 - 2013)

Participants: Álvaro Martín (PI), Gadiel Seroussi, Luciana Vitale
Founding: CSIC

Estimating a stochastic model from a sample sequences over a given alphabet is key in many practical applications, such as various data compression algorithms, simulation, and prediction. In this project we focus on Markov models, which have common application in different areas of Information Theory. For this type of models, there exist estimation algorithms that are efficient from theoretical point of view, in the sense of requiring run time and memory amount that are linear in the length of the input sequence. In practice, however, the memory requirements of these algorithms can be prohibitive for large input sequences. We will study theoretical properties of these estimators with the aim of deriving new efficient estimation algorithms.

Study of Tree Models in Information Theory (2007 - 2008)

Participants: Álvaro Martín (PI)
Founding: PDT

Tree models, which in the statistical community have been called variable length Markov chains, provide a mechanism to "join" states of a Markov chain that share the same probability distribution. In practical applications, such models allow for an important reduction in the number of free scalar parameters (conditional probabilities) required to model a stochastic process. In this project we study theoretical properties of tree models and some applications in data compression.

Study of models for finite memory stochastic processes (2005 - 2007)

Participants: Álvaro Martín, Alfredo Viola (PI)
Founding: CSIC

The objective of this project is to achieve a deep understanding of the properties of tree models for finite memory processes. Progress in this direction have both theoretical and practical interest with important implications, for example, in compression algorithms and simulation.