Decoding SRA Accessions: A Primer on Navigating NCBI's Sequence Read Archive
Overview
If you work with high-throughput sequencing data, chances are you have encountered the Sequence Read Archive (SRA) from the National Center for Biotechnology Information (NCBI). SRA is the largest publicly accessible repository of high-throughput sequencing data, containing a vast amount of valuable information. SRA collects data not only from NCBI but also from two other organizations, the European Molecular Biology Laboratory (EMBL) and the DNA Data Bank of Japan (DDBJ). These three organizations form the International Nucleotide Sequence Database Collaboration. Over the years, multiple users and organizations have submitted large amounts of data to SRA. The following figure shows the components of SRA submissions.
A submission consists of a project that seeks to explore a biological or technical question. The exploration involves running sequencing experiments on certain biological samples. Each component of the project carries a specific information. Post-submission, the information is stored under different accessions. However, navigating and understanding the different accessions used in SRA can be confusing. In this article, we will explore the various accessions in SRA, their uses, and guide users on which accession to look for when seeking specific information. For example, the source of the data can be determined by the accession prefix: S for NCBI, E for EMBL, and D for DDBJ. In the following sections, the prefix for NCBI is mentioned for all accessions.
Let’s begin.
Important SRA accessions
1. Study (Prefix: SRP)
A Study is an object imported from BioProject, which contains metadata describing a sequencing study or project. It provides information such as study type, abstract, links to study publications, and other relevant details. The following figure shows an SRA Study page.
2. Experiment (Prefix: SRX)
An Experiment is an object that contains metadata describing the library preparation, the sequencing technique, instrument used, and other experiment-specific details. A snapshot of an SRA Experiment is shown below. Unlike SRA Study, the long description is not the abstract. Can you guess which organization is the source of this experiment? (Hint: Prefix)
Did you notice that multiple runs belong to the same experiment above?
3. Run (Prefix: SRR)
A Run is an object that contains the actual sequencing data for a particular sequencing experiment. Experiments may have multiple Runs. The Run accession allows users to access the raw sequencing data. SRA Run Browser provides an interactive web-interface to view run-related information and download the data. Here’s an example. There are other ways to access and download data as well such as using SRA Toolkit, AWS or GCP.
Metadata accessions
Technically, the resources below are not part of SRA. Please see note 3 for details.
BioProject (Prefix: PRJNA)
BioProject provides a description of the research project associated with the sequencing study. It includes information such as the study’s abstract, links to publications (if available), and links to project data, including experiments, PubMed, and GEO datasets.
BioSample (Prefix: SAMN)
BioSample contains sample-specific details. It could include details such as disease information, sex, response to drug, race, and other relevant attributes. The practical challenge with BioSample is that attributes vary across submissions. For example, in one study, a BioSample could have an attribute called race, and in other, the same attribute could be ethnicity. This makes cross-study comparisons difficult.
Source identification using accession
For most accessions, you can distinguish the source with prefix which in this case is the first letter. As mentioned above: S for NCBI, E for EMBL and D for DDBJ. For example, an Experiment has prefix SRX for NCBI, ERX for EMBL and DRX for DDBJ. However, the source metadata accessions BioProject and BioSample can’t be identified using the first letter as shown in the table below.
Source | Accession | Prefix |
---|---|---|
NCBI | BioProject | PRJNA |
EMBL | BioProject | PRJEB |
DDBJ | BioProject | PRJDB |
NCBI | BioSample | SAMN |
EMBL | BioSample | SAME |
DDBJ | BioSample | SAMD |
Other SRA accessions
There are webpages corresponding to each of the above accessions. However, there are other accessions as well.
4. Sample Accession (Prefix: SRS):
A Sample is an object imported from BioSample, which contains metadata describing the physical sample upon which a sequencing experiment was performed. It provides information about the biological material, such as its nature, characteristics, and relevant attributes. However, there isn’t a webpage for it so to get the sample information you’ll visit the BioSample webpage.
5. Analysis Accession (Prefix: SRZ):
An Analysis is an object that contains a sequence data analysis BAM file and metadata describing the sequence analysis. However, specific examples of Analysis accessions were not found. If any reader finds it, please let me know. :)
6. Submission Accession (Prefix: SRA):
The submission accession represents a virtual container that holds the objects represented by the other accessions. It is used to track the submission in the SRA archive. However, this accession does not have a specific response page associated with it.
Summary
To summarize the accessions and their uses:
- SRA Study Accession (SRP#): Contains metadata describing the sequencing study or project imported from BioProject.
- SRA Experiment Accession (SRX#): Provides metadata information about a specific sequencing experiment.
- SRA Run Accession (SRR#): Contains the actual sequencing data for a particular experiment.
- SRA Sample Accession (SRS#): Contains metadata associated with the physical sample imported from BioSample.
- SRA Analysis Accession (SRZ#): Contains sequence data analysis BAM file and associated metadata (specific examples not found).
- SRA Submission Accession (SRA#): Virtual container holding other accessions, used for tracking submissions.
Understanding these accessions and their relationships can greatly help researchers utilize the SRA effectively. For example, if you are interested in the details of a particular sequencing study, you should look for the Study Accession (SRP#). If you want to access the raw sequencing data, you should focus on the Run Accession (SRR#). Similarly, if you need information about the biological sample, the BioSample Accession (SAMN#) is what you should look for.
Conclusion
In conclusion, the Sequence Read Archive (SRA) is a valuable resource for researchers working with high-throughput sequencing data. By understanding the different accessions and their uses, users can navigate the SRA more effectively. Whether you are interested in the study details, experiment metadata, raw sequencing data, or sample information, knowing which accession to look for will help you find the relevant information efficiently.
Notes
- This article is written with respect to the web interface and not the CLI.
- Chat-GPT was used to help improve the article, but not as a source of information. All information is coming from NCBI.
- SRA knowledgebase mentions 6 types of accessions that are numbered in this article. BioSample and BioProject are separate NCBI resources like SRA. They are called metadata accessions and included in this article due to their importance.