On the improvement of missing value imputation in proteomics

Researchers at CEA-IRIG/BGE have designed a new statistical model for missing values in mass spectrometry-based proteomics*, as well as a new and more efficient imputation algorithm.

Published on 18 June 2025

In experimental science, data collections can be affected by missing values (defined by an absence of measure for a given observation). As too much missing values may jeopardize the data analysis, imputation (i.e., the completion of the data by estimating the measures which should have been observed) is often both a necessity and a lesser evil. However, this task is particularly difficult in proteomics*, because of the rate of missing values, but also because of their multiple origins.

Researchers with CEA-IRIG/BGE have therefore designed a new statistical model, which jointly characterizes two missing types of values: the censored ones, (i.e., when a protein fragment is not abundant enough to be detected), and those lacking randomly (i.e., resulting from the non-exhaustiveness of the instruments). In addition, they have shown that an imputation algorithm which maximizes the known correlations between biomolecules (proteins and their fragments, transcribed, etc.) can be derived from this model. Finally, in the absence of a formal solution to the associated maximization problem, they have implemented a numerical solver relying on a feed-forward neural network.

Figure: Toy example of how to leverage biomolecules’ correlations to improve missing value imputation: several peptides coming from the same protein (as well as possibly the transcript it was translated from) having measurement profiles that should be correlated. It is thus relevant to impute the missing values as to maximize it, as illustrated by the location of "?” For Peptide 4.

The resulting imputation tool outperforms all state-of-the-art imputation methods and its use makes it possible to significantly improve on the results of mass spectrometry-based proteomic analyses.

Proteomics*: characterisation by identification and quantification of all the proteins present in a biological sample.

Fundings 
This work was supported by the ANR through the following projects:

ProFI (ANR-10-INBS-08)
GRAL CBH (ANR-17-EURE-0003)
SECRET (ANR-22-CE45-0026)
DEAP (ANR-15-IDEX-02)
MIAI @ Grenoble Alpes (ANR-19-P3IA-0003).

Collaboration
Laboratoire TIMC (Univ. Grenoble Alpes, CNRS, Grenoble INP) « Recherche Translationnelle et Innovation en Médecine et Complexité »

L. Etourneau, L. Fancello, S. Wieczorek, N. Varoquaux and T. Burger. 
Penalized likelihood optimization for censored missing value imputation in proteomics.
Biostatistics 2025.

Top page

Keywords : machine learning | proteomics | EDyP

Alternative and Atomic Energies Agency

CEA is a French government-funded technological research organisation in four main areas: low-carbon energies, defense and security, information technologies and health technologies. A prominent player in the European Research Area, it is involved in setting up collaborative projects with many partners around the world.

Top page

Interdisciplinary Research Institute of Grenoble (IRIG)

In the same section :

On the improvement of missing value imputation in proteomics

References

Keywords : machine learning | proteomics | EDyP

Proteomics

Browse the site

Alternative and Atomic Energies Agency

Browse the portal

Interdisciplinary Research Institute of Grenoble (IRIG)

Interdisciplinary Research Institute of Grenoble

Departments of the Institute

Laboratories/UMR at the Institute

Platforms and technical facilities

News

In the same section :

On the improvement of missing value imputation in proteomics

References

Keywords : machine learning | proteomics | EDyP

Proteomics

Browse the site

Alternative and Atomic Energies Agency

Browse the portal