The European Nucleotide Archive (ENA) of the EMBL-EBI and its partners

The European Nucleotide Archive – Imagine Copyright of the ROTFEL INSIGHT LIBRARY

Introduction

Almost 38 years ago the European Molecular Biology Laboratory (EMBL) launched the EMBL nucleotide sequence database library. Since that day all the improvements in sequencing archiving technologies led to the creation of the European Nucleotide Archive (ENA) as we see it today. ENA is an online database which archives and furnishes the world’s nucleotide sequencing information, it is an ELIXIR Core Data Resource and a founding partner of the International Nucleotide Sequence Database Collaboration (INDSC), which includes the collaboration and the sharing of sequenced data and the related information among the several other parties such as among the major: the DNA Data Bank of Japan (DDBJ) and the National Center for Biotechnology Information GeneBank (NCBI GeneBank). In this article we will explore the main scope of the ENA, and we will introduce the reader to the tools which ENA makes available to the world.

The scope of ENA

The ENA database has the main scope to function as world genetic sequence repository. Its aim is to promote and support the use of genetic sequences to facilitate and foster the work of researchers worldwide, and it does so by providing, through its website, the service of submission, archive, search and download of genetic sequences. ENA is a public archive which brings together several databases which collect not only raw sequence data, but also assembly of data (which are information describing the construction of reads and sequence contig into higher order scaffolds and chromosome (European Nucleotide Archive)), and functional annotations, made by the researchers who have previously uploaded the sequence. Almost 10 years ago (in October 2010), ENA database contained 50 trillion base pairs (Leinonen, 2010), in 2013 it contained almost 570 trillion base pairs (Pakaseresht, 2013) (more than ten times more in 3 years). All this trillions of sequences which now in 2020, thanks to the sharing of genetic sequences with ENA and also thanks to the partnership with other databases have become quadrillions are collected and displayed and categorized always more accurately, allowing the researchers to search and share the sequences within data domains, such as assembly, sequence, coding, non-coding, markers, analysis, read, traces, taxon, samples and study associated to specific information. Thanks to its over 30 years of progresses ENA services have been constantly improved and changed in order to be able to manage a growing volume of genetic sequences and associated data. This further tested ENA’s ability to develop the database in line with its users, and in line with partners that have improved in numbers over the years. This meant great efforts to adapt and standardize databases, with the aim of being able to store and share data each other. For this purpose, ENA is developed and maintained at the EMBL-EBI under the guidance of the INSDC advisory board. As a result of this coordination now a days, as we will see in the next paragraph, ENA can be used both interactively and programmatically, and all the data can be viewed using the ENA Browser which has been significantly improved in 2019. Through the ENA Browser it is possible to search through free text search, through programmatic data, through sequence similarity and it is also possible to download bulk data through FTP (File Transfer Protocol) and Aspera protocols (European Nucleotide Archive). Through the free text search, it is possible for researchers to select a search query, and to select a domain among assembly, sequence, coting set, coding, non-coding etc. As we understand from this short overview of the scope of the ENA, the collection of genetic sequencing linked to the analysis of such data settles the bases and open the horizons to deepen the comprehension of the nucleic acids, which are one of the four components of living beings together with lipids, sugars and proteins and which coordinate the evolution of life at all levels.

The beating heart of sharing genetic data at ENA

The ENA beating heart reveals its flux in the sharing and use of genetic sequencing and make it possible through its portal of submission and update of data Webin. In 2018 ENA has introduced a Command Line Interface (Webin-CLI) which has revolutionized and considerably ameliorated the submission process becoming the ENA’s primary submission tool for genomes and transcriptomes and supporting the reads and annotated sequences. The Webin-CLA is provided in the form of a standalone executable JAR file, and readable form UNIX terminal or Windows command prompt (Amid, 2020). In 365 days, since its “debut” Webin-CLI has provided support to thousand of data submitters all around the world, it have counted 5.700 studies and 620.000 samples with 197.000 assemblies. In order to guide the users to share and use the data stored, ENA has published a dedicated section of its website called “readthedocs” to a detailed list of guidelines and tutorials. The ENA training modules is subdivided into four main sections:

  1. The ENA data submission,
  2. the ENA data discovery and retrieval,
  3. the ENA data updates,
  4. and the ENA tips and FAQs.

Among this ordered “user manual” made available there are clear guidelines concerning the registration of a study. Each data submission to ENA needs the registration of a study object. This has the aim to allow the database to collect all the studies with similar objects together. The analysis of each data sample (the sequenced biomaterial) is run after the upload of the raw reads which for each experiment allow the storage of additional information to carry on a study. Each study and its data won’t be published, and so won’t become public until the study release date has expired. Once a study has been published there won’t be any opportunity for the submitter to withdraw the study anymore. As conclusion of a study the Webin will create two different accession numbers each serving a scope. One is the BioProject accession number (which starts with PRJEB) and will be generated to be used in journal publications, the second is the Sequence Read Archive number (which starts with SRA) and will be used as code to access the study within the ENA database. Furthermore, studies can be submitted to ENA using two methods. One method consists in registering the study interactively (using a form), the second method consists in registering the study programmatically submitting it in XML (eXtensible Markup Language) format. Thanks to ENA biological systems can be assembled, and this has opened the opportunity for data submitters to have a place where to publish and to compare their researches; has furnished a direct data consumers and secondary service providers the blueprint to found their databases for the future sharing of data, implementing their sources, and ultimately has given data coordinators for sequence-based studies the platform to work with in almost every area of the Life Sciences such as genomics, marine biotechnology, pathogen surveillance, livestock, stem cell biology and so on. Among the several service providers of ENA there are UniProt, RNAcentral, Ensemble Genomes, Array Express and many others.

The ENA workflow

As we said in the introduction of this article, ENA collects data from several sources; such sources have the possibility to furnish to ENA different types of information which span from raw data to annotations and from small scale to major sequencing, accordingly to the capacity of the research laboratory which submit such data. In addition to this data sharing flux, ENA exchange its data with its partners of the INSDC, allowing the analytical capacity to grow exponentially. Everything begins with the isolation and the preparation f the biological material for the sequencing, subsequently such material is sequenced, and data are recorded within the databases which now are ready for further bioinformatic analysis. The ENA workflow is subdivided into three phases:

  1. input information
  2. output information, and
  3. interpreted information.

This workflow has become a central step to promote the dissemination of research findings to the scientific community. Such information is accessible without any restrictions to all the scientists worldwide. Anybody can download the data generated through the ENA workflow in order to begin a new study which once completed may re-enter the database improving its reliability. Scientists who use the ENA database are also free and encouraged by the INSDC policy to publish any analysis or critique, provided that appropriate credit is given by citing the original submission. Anyway, there are not use restrictions or licensing requirements in any sequence data records (as stated in the Nucleotide Sequence Database Policies of November 2002), as well as any responsibility of the INSDC if the submitters upload any data for which he or she does not hold the rights. In order to ameliorate the limited editorial control and to favourite internal integrity checks, ENA has established the following simple and clear reporting standards with the purpose to implement the workflow and also to avoid technical breaches:

  1. BARCODE – Minimum information about a species BARCODE sequence
  2. GMI:MDM – Minimal Data for Mapping in relation to the Global Microbial Identifier pathogen tracking initiative
  3. Micro B3 – Minimum information about marine microbial sampling
  4. MINSEQE – Minimum Information about a high-throughput Nucleotide SeQuencing Experiment
  5. MIxS – Minimum Information about any (x) Sequence
  6. Influenza/COMPARE: Minimum Information for reporting of Influenza virus samples

Central to this is the harmonization of data and metadata collections which are essential for the research, but which require more efforts than the data generation itself. ENA requirements for the submission of new data are based on the fact that such data should include the description of nucleotide sequence provenance, and also the functional annotation of nucleotide sequence domains. Such requirements are implemented by guidelines produced by the INSDC itself for the submission of assembly and or annotations and for genome assemblies (e.g. chromosomes). Submitters also have the possibility to upload data into specialized databases, but in this case, they will be allowed to access such repositories only if they receive specific accession numbers from ENA itself. This is not meant to be a limitation to the access, it is a necessary requirement for the accession based on the fact that such specialized databases have a characteristic structure which imply the user to be familiar with the nomenclature approved and used for such specialized collections. In conclusion the ENA and the INSDC are one of the greatest expressions of sharing of knowledge related to genetic resources and their success has been possible thanks to the scientists which work in private or public laboratories bringing immense value to build the solid bases for the comprehension of Life across the global scientific community and beyond, allowing to those who have little resources to have access to immense resources, and to allow to those who have great resources to reach levels that, by their own, they would have reached in decades instead of minutes.

Resources

Amid Clara et.al. The European Nucleotide Archive in 2019 [Journal] // Nucleic Acids Research. – [s.l.] : Oxford University Press, 08 January 2020. – D1 : Vol. 48. – pp. D70-D76. – This article despite the publication of the whole journal on 8 January 2020 has been published online the 13 November 2019.

European Nucleotide Archive Changes to public data release mechanisms for sequence and study records [Online] // ENA – European Nucleotide Archive. – European Molecular Biology Laboratory. – https://www.ebi.ac.uk/ena/about/data-release-mechanism.

European Nucleotide Archive ebi.ac.uk [Online] // ENA – European Nucleotide Archive. – European Molecular Biology Laboratory (EMBL). – https://www.ebi.ac.uk/ena/about.

European Nucleotide Archive ENA data formats [Online] // ENA – European Nucleotide Archive. – EMBL-EBI. – ALPHA. – https://www.ebi.ac.uk/ena/submit/data-formats.

European Nucleotide Archive Searching ENA [Online] // ENA – European Nucleotide Archive. – EMBL-EBI. – https://www.ebi.ac.uk/ena/browse.

European Nucleotide Archive Standards and policies [Online] // ENA – European Nucleotide Arhcive. – European Molecular Biology Laboratoryies (EMBL). – https://www.ebi.ac.uk/ena/standards-and-policies.

Leinonen Rasko et.al. The European Nucleotide Archive [Journal] // Pub Med Central. – 22 October 2010. – Vol. 39. – pp. D28-D31.

Pakaseresht Nima et. al. Assembly information services in the European Nucleotide Archive [Journal] // Nucleic Acids Research. – [s.l.] : Oxford Library Press, 8 November 2013. – Vol. 42.

Silvester Nicole et. al. The European Nucleotide Archive in 2017 [Journal]. – 13 November 2017. – Vol. 46.

Acronyms and abbreviations

  • DDBJ – DNA Data Bank of Japan
  • EBI – European Bioinformatics Institute
  • EMBL – European Molecular Biology Laboratories
  • ENA – European Nucleotide Archive
  • FTP – File Transfer Protocol
  • INSDC – International Nucleotide Sequence Database Collaboration
  • NCBI – National Center for Biotechnology Information
  • PRJEB – Starting code of the BioProject accession number
  • SRA – Sequence Read Archive (accession number)
  • XML – eXtensible Markup Language

Suggested websites to visit after the reading of this article

The ENA browser improvements of 2019

Few examples of ENA service providers