Genome sequencing has become fast, affordable, and accessible, even for small labs. But after obtaining a raw genome assembly, the real scientific value comes understanding what is genome annotation.
In simple terms, genome annotation is the process of identifying the biological meaning of DNA sequences. Sequencing gives you the letters (A, T, G, and C). Annotation turns them into information: genes, proteins, functions, pathways, and biological roles.
In this comprehensive guide, you’ll learn:
- What is genome annotation
- Why annotation matters in microbial genomics
- The difference between structural and functional annotation
- The complete step-by-step annotation process
- The most widely used tools: Prokka, PGAP, RAST, InterProScan
- Best practices for accurate results
- When automatic pipelines fail—and how to fix them
Let’s start with the fundamentals.
1. What Is Genome Annotation?
Genome annotation is the process of describing the location and function of genomic features within a DNA sequence.
When you sequence a genome, the FASTA file contains nothing but:
ATGCGTCGTGACCT…
Annotation transforms this raw sequence into a biological blueprint by identifying:
- Genes (CDS)
- Open Reading Frames (ORFs)
- RNA genes (tRNA, rRNA, ncRNA)
- Promoters and regulatory regions
- Mobile elements (IS elements, prophages)
- Operons and clusters
- Protein functions and metabolic roles
In microbial genomics, we need to know what is genome annotation because it is essential for:
- Species identification and taxonomic placement
- Antimicrobial resistance prediction
- Pathway reconstruction
- Comparative genomics
- Metagenomic bin annotation
- Understanding virulence factors
- Industrial strain engineering
Without annotation, a genome is essentially a long string of characters with no meaning.
2. Two Main Types of Genome Annotation
Genome annotation is usually divided into structural and functional annotation.
Both are required to produce a complete and biologically useful genome.
2.1 Structural Annotation
Structural annotation detects the physical components of the genome:
Structural annotation identifies:
- Protein-coding genes (CDS)
- Open reading frames (ORFs)
- 5’ and 3’ UTRs (rare in bacteria, common in eukaryotes)
- rRNA operons
- tRNA genes
- Repeats
- GC content and GC skew
- Mobile elements
Tools commonly used:
- Prodigal (default in most pipelines)
- Glimmer
- Barrnap (rRNA)
- tRNAscan-SE
Accuracy in structural annotation depends on:
- Genome completeness
- Sequencing depth
- Assembly contiguity (Nanopore/PacBio help dramatically)
A fragmented Illumina-only assembly often leads to:
- Missing plasmids
- Split genes
- Misannotated rRNA operons
This is why hybrid assemblies or long-read assemblies give superior annotation quality.
2.2 Functional Annotation
Once genes are predicted, each must be assigned a function, or at least an annotation that reflects known homology.
Functional annotation includes:
- Assigning protein names
- Mapping Enzyme Commission (EC) numbers
- Adding Gene Ontology (GO) terms
- Assigning KEGG Orthology (KO) identifiers
- Predicting COG categories
- Identifying virulence factors and AMR genes
Key databases used:
Functional annotation is the step where your genome becomes scientifically interpretable.
3. Why Genome Annotation Matters in Microbial Genomics
Understanding what is genome annotation enables researchers to:
1. Understand metabolism and pathways
Which nutrients can the microbe use?
Does it fix nitrogen?
Can it degrade pollutants?
2. Identify virulence and AMR genes
Critical for clinical and public health microbiology.
3. Detect plasmids and mobile elements
Plasmids can encode:
- Resistance
- Toxins
- Biotechnologically relevant pathways
4. Perform comparative genomics
Ortholog detection depends on well-annotated genomes.
5. Industrial strain engineering
Annotation guides metabolic engineering and gene knockouts.
Without annotation, none of these insights are possible.
4. Step-by-Step: The Genome Annotation Workflow
This section will help your readers (and Google SEO) understand what is genome annotation and its entire process.
4.1 Step 1: Genome Assembly (Prerequisite)
Annotation requires a completed assembly.
Typical workflows:
Quality checks:
- N50
- Contig count
- Coverage
- GC %
- BUSCO completeness
4.2 Step 2: Structural Annotation
Tools predict:
- CDS / ORFs
- tRNA
- rRNA
- Repeats
- Mobile elements
Most pipelines use Prodigal (bacterial gene finder).
4.3 Step 3: Functional Annotation
Functional annotation leverages homology-based approaches.
Typical steps:
- Blast protein sequences against curated databases
- Assign protein names
- Map GO, COG, KEGG, EC numbers
- Annotate domains (Pfam, InterPro)
- Identify AMR / virulence genes
4.4 Step 4: Quality Control of Annotation
After annotation:
- Check missing rRNAs (common in Illumina-only assemblies)
- Inspect pseudogene calls
- Manually inspect suspicious ORFs
- Compare against reference genomes
Tools like Artemis, IGV, or Ugene help.
4.5 Step 5: Export Annotation
Output files:
- GFF3
- GenBank (.gbk)
- Table (.tbl)
- Amino acid FASTA
- Nucleotide FASTA
- Annotation summary tables
These formats are required for:
- PGAP submission
- NCBI BioProject / BioSample
- Downstream analyses (DGE, variant calling, pan-genomics)
5. Popular Genome Annotation Tools (Pros & Cons)
Below is an SEO-friendly section targeting “genome annotation tools”, essential to understand what is genome annotation.
5.1 Prokka
Probably the most widely used tool for bacterial annotation.
Pros:
- Fast (minutes per genome)
- Easy to run
- Local and offline
- Custom databases supported
Cons:
- Depends heavily on included databases
- Fewer curated protein names than PGAP
Great for microbial genomics, MAGs, metagenomics bins.
5.2 NCBI PGAP
Mandatory for submitting genomes to NCBI.
Pros:
- Extremely high-quality curated annotations
- Standardized naming
- Good for clinical strains
Cons:
- Strict requirements
- Slower
- Difficult for fragmented assemblies
5.3 RAST / RASTtk
Web-based annotation server.
Pros:
- Easy interface
- KEGG-like functional categorization
Cons:
- Slower
- Requires account
- Not ideal for very large datasets
5.4 Bakta
A modern replacement for Prokka.
Pros:
- Uses high-quality curated databases
- Consistent annotation
- Very fast
Cons:
- Newer (less documented)
5.5 InterProScan
For protein domain annotation.
Pros:
Deep domain-level annotation, excellent for functional analyses.
Cons:
Very computationally heavy.
5.6 EggNOG-mapper
Best for broad functional annotation:
- COG
- GO
- KEGG Orthology
- EC numbers
Especially useful in metagenomics.
6. Challenges in Genome Annotation
We have covered the most important steps and concepts to ask the question what is genome annotation. However, genome annotation is not perfect. Common issues:
1. Fragmented assemblies
→ Missing genes
→ Split CDS
2. Frameshifts
→ False pseudogenes
3. Incorrect protein names
→ Homology-based errors
4. Incomplete pathways
→ Missing enzymes lead to misinterpretation
5. False positives in AMR or virulence genes
This is why many labs combine multiple tools.
7. Best Practices for High-Quality Annotation
1. Use high-quality assemblies
Nanopore or PacBio reads produce much better annotation than Illumina-only assemblies.
2. Use multiple databases
No single database is perfect.
3. Perform manual curation
When possible.
4. Submit to NCBI PGAP for official annotation
5. Keep annotation reproducible
Document tool versions and parameters.
8. Conclusion
Genome annotation is a crucial step in converting raw sequence data into meaningful biological knowledge. Whether you are working with a new isolate, a metagenomic bin, or a clinical pathogen, understanding what is genome annotation and performing genome annotation properly is essential.
By using a combination of structural and functional tools—Prokka, PGAP, RAST, Bakta, EggNOG-mapper—you can generate high-quality annotations suitable for publication, comparative genomics, and downstream analyses.
This guide can serve as a reference for researchers, students, and bioinformatics practitioners looking to understand what is genome annotation, how it works and how to annotate genomes with modern tools.