You are here : Home > Research Centers and Units > Genoscope > Scientific Information Technol ... > Bioinformatics R&D and sequencing team

Bioinformatics R&D and sequencing team

Activities


​  

Published on 25 June 2018

The Bioinformatics R&D and sequencing team is responsible for the data flows directly derived from the sequencers in order to exploit them via the various bioinformatics processes and imaging interfaces. The data are heterogeneous and cover all types of preparation and sequencing. The data derive from the various sequencing projects in partnership with the Institute of Genomics laboratories or in cooperation with outside laboratories. Processing covers the whole spectrum of bioinformatics analyses (primary, secondary and tertiary) from data generation and quality control through to assembly and eukaryote genome annotation.​

        

                                                                              NGS Workflow

Data production 

Technological monitoring 

The Biosequencing R&D team closely interacts with the technological development team in order to develop new protocols meeting the underlying bioinformatics analysis requirements (metagenomics, transcriptomics, assembly, annotation, etc.). In that context, several key points have been identified:

  1. Selection of model organisms
  2. Development/enhancement of new protocols
  3. Quality bias identification
  4. Evaluation of the various sequencing technologies 


Quality control

The Biosequencing R&D team has set up a quality control process for the data derived from the sequencers. The control is based on metrics identified as a function of the various types of sequencing technology and underlying bioinformatics analyses. In that context, we have developed several components:

  1. A software suite for quality processing
  2. A workflow to schedule processing
  3. A man-machine interface (MMI) to display the results of processing and validate the sequencing data.


 

Workflow de traitement des données


  
                                                   Taxonomic assignment

  
 

                                              ​Quality control display​


Assembly

From collections of random reads of a genome sequencing project known as a whole-genome shotgun (WGS), the assembly stage aims to reconstitute the sequence of chromosomes in the organism under study. The algorithms used are based on information on the identity relationships between overlapping reads and on topological information provided by 'links' or markers from genetic and physical maps. The result of the assembly, a 'supercontig' set, is a consensus reconstruction of the original sequence.

The tools and methods implemented by the group for this activity derive either from in-house IT development at the Institute of Genomics or development implemented by other groups involved in the issues of assembly.


Annotation 

The aim of annotation is to define along the assembled sequences the structure of the genes, i.e. their start and end positions and those of their exons. We have chosen an approach taking into account a number of a priori undefined information of all types. The information falls, however, into major categories:

1/ Ab initio predictions. For each genome, we calibrate and use several gene prediction programs which use, as information, the statistical properties of the known protein genes of the species. Calibration is implemented, as a preliminary, on a collection of known genes.

2/ Exploitation of coding sequences. We align all the public proteins and the cDNA sequences available for related phyla. We assign more statistical weight to collections of cDNA from the same species, either public sequences or sequenced at Genoscope. Alignment is ultimately implemented with software constraining the junctions of exons to sites compatible with the splicing margins.

The set of predictions is  « reconciled » in order to retain only a single « gene model » per locus. This stage is implemented by exploiting the possibilities of the Gaze program. The program integrates a set of weighted information which is fed to an autoanalyzer that we have adapted. Via dynamic programming, this stage ensures generation, for each sequence, of a set of gene model with no phase break and for which the score is highest.


Display

The results of the various analyses are stored in a database and available to personnel members via a dedicated interface, a generic genome browser (GGB). ​


 

(Excerpt of the vine GGB )
Annotation of a locus of vine K11.


                                   

                                      Genome duplications of Paramecium tetraurelia 

The sequence of the genome of the macronucleus of the paramecium spectacularly retains the trace of at least 3 whole genome duplications which succeeded each other during evolution (outside circles, more recent, toward inside circles, older). While in other evolving groups there remain very few genes duplicated subsequent to whole-genome duplications (fish, plants, yeast), in this case 24000 genes, i.e. 68% of the total, have been maintained as 2 copies since the most recent duplication. In addition, very little chromosome rearrangement has taken place since the order of the genes is maintained. Those characteristics, essentially the high number of genes duplicated at 3 different evolutionary time points, show that gene loss is markedly constrained in the short term. In particular, the stoechiometric effect of the genes involved in the interactions is strong. ​ 

  

  Projects

 Tetraodon nigroviridis (GGB
 Paramecium tetraurelia (GGB
 Vitis vinifera (GGB
 Oikopleura dioica ( GGB
 Tuber melanosporum (GGB)​