You are here : Home > News > Bringing Order to Protein Databases

Scientific result | Large-scale biology

Bringing Order to Protein Databases

New research is challenging the way genes are classified in international databases. How to make a perfect match between a gene and the function of the associated protein without testing? Researchers from the François-Jacob Institute have found the answer.
Published on 20 December 2017

The human genome contains about 25,000 genes, which, although significant, is less than that of mice (30,000) and the paramecium's (40,000), but higher than the number of genes of bacteria (around 4,000). New sequencing technologies have led to a dramatic increase in the inventory of the genomes of different living species and the genes that compose them. Each gene performs a specific function via the protein it produces—but what does each one of them do exactly? Testing all 88 million genes classified in databases to answer this question is simply impossible.

"Scientists rely on similarities between proteins to extrapolate the function of one protein to another," said Véronique de Berardinis, a researcher at the François-Jacob Biology Institute. "Yet how similar do two proteins have to be to be considered as having the same function?" she added. The matching is often calculated automatically by software and can be risky due to the lack of experimental data. This absence of data means that families of proteins, whose function can change with only a few amino acids, cannot be represented in their entire complexity. For instance, some human proteins are annotated in the databases in the same way as proteins of bacterium Escherichia coli and "a large fraction of the proteins are associated with a predicted function that is unreliable," said de Berardinis. On the other hand, different proteins may ultimately perform the same activity—a phenomenon called function convergence. "This is the case for two families of proteins called MetA and MetX, involved in the production of methionine, an essential amino acid in living organisms" she said. "Known for nearly 40 years, these two families are responsible for performing this same step in this metabolic pathway, yet using different approaches. To obtain a thorough representation of the complexity of these families, we have selected and tested the activity of 100 representative proteins." The results show that many proteins in both families ensure the same function, which was previously unknown. In this case, making a match between protein function and sequence was not enough. In the end, an in-depth study of the three-dimensional structures of these enzymes, and their active sites (where the chemical reaction unfolds) provided the solution.

"Our studies have shown that the functions depend on the topology of the active sites," said de Berardinis. These results have a global impact since European database UniProt, which gathers all 88 million documented proteins, will now integrate the function annotation rules proposed by the researchers from CEA. "The 10,000 annotations of the MetA and MetX proteins listed in UniProt have been updated," de Berardinis said. "And these two essential families will be correctly annotated in any new documented genome." This functional exploration also revealed that 10% of MetX are involved in the biosynthesis of the essential amino acid cysteine, via a molecule (O-succinyl-L-serine) that had never been described in nature.

This study has opened other prospects: an unexpected result on the evolution of the MetA and MetX families paves the way for further studies. The scientists have shown how these two groups have undergone evolutionary pressure twice to converge towards the same function.

Top page