Unintentional p-value hacking. The risk of making the data lie, despite themselves

Researchers at iRIG publish a series of articles aimed at improving good practices in data science for proteomics.

Published on 28 August 2019

« There are three kinds of lies: lies, damned lies, and statistics ». This sentence by British Prime Minister Benjamin Disraeli originates from an inappropriate use of statistical tools to support an intuition or conviction. Although censurable, this practice unfortunately has its counterpart in research: p-value hacking. It consists in dredging the data until the desired significance is obtained for a hypothesis tested, to the detriment of some statistical rules that are violated. In most cases, this is done in good faith, since the objective is to get the best from the data while complying with publication requirements. Nevertheless, it is essential to limit this practice, which leads to many false discoveries.

This is practically the case in proteomics, due to the intrinsic proteome complexity, but also to the rapid evolution of analytical technologies. This is one of the reasons why many bioinformatics and biostatistics tools regularly blossom in the literature ^[1], with the promise of overcoming the triple limit of big proteomics data: their large size, their big dimensionality, and their great complexity. However, the simplicity of these tools cannot hide the need for a minimum of theoretical understanding to use them correctly.

It is with this objective in mind that researchers at IRIG have made a special effort in recent years to disseminate good practices in data science for proteomics ^[2-4]. They published an introduction to FDR theory (False Discovery rate, an ubiquitous quality control measure), they disambiguated a set of terms with different meanings in artificial intelligence and analytical chemistry, and also proposed five steps to improve the quality control of differential proteomic analysis between several samples.

Graphical construction allowing to visually estimate the quality of the p-values calibration.

^[1] Jacob L, Combes F and Burger T. PEPA test: Fast and powerful differential analysis from relative quantitative proteomics data using shared peptides. Biostatistics, 2018
^[2] Burger T. Gentle introduction to the statistical foundations of false discovery rate in quantitative proteomics. Journal of Proteome Research, 2018
^[3] Borges H, Guibert R, Oermiakova O and Burger T. Distinguishing between spectral clustering and cluster analysis of mass spectra. Journal of Proteome Research, 2019
^[4] Wieczorek S, Giai Gianetto Q and Burger T. Five simple yet essential steps to correctly estimate the rate of false differentially abundant proteins in mass spectrometry analyses. Journal of Proteomics, 2019

Top page

Alternative and Atomic Energies Agency

CEA is a French government-funded technological research organisation in four main areas: low-carbon energies, defense and security, information technologies and health technologies. A prominent player in the European Research Area, it is involved in setting up collaborative projects with many partners around the world.

Top page

Interdisciplinary Research Institute of Grenoble (IRIG)

In the same section :

Unintentional p-value hacking. The risk of making the data lie, despite themselves

References

On the same theme

Browse the site

Alternative and Atomic Energies Agency

Browse the portal

Interdisciplinary Research Institute of Grenoble (IRIG)

Interdisciplinary Research Institute of Grenoble

Departments of the Institute

Laboratories/UMR at the Institute

Platforms and technical facilities

News

In the same section :

Unintentional p-value hacking. The risk of making the data lie, despite themselves

References

On the same theme

Browse the site

Alternative and Atomic Energies Agency

Browse the portal