What Is Genome Annotation? A Complete Guide to Understanding and Annotating Genomes

Estimated reading time: 5 min

Learn what is genome annotation, how it works, and which tools are most used. A complete guide to structural and functional genome annotation for beginners.
what is genome annotation

Table of Contents

Genome sequencing has become fast, affordable, and accessible, even for small labs. But after obtaining a raw genome assembly, the real scientific value comes understanding what is genome annotation.

In simple terms, genome annotation is the process of identifying the biological meaning of DNA sequences. Sequencing gives you the letters (A, T, G, and C). Annotation turns them into information: genes, proteins, functions, pathways, and biological roles.

In this comprehensive guide, you’ll learn:

  • What is genome annotation
  • Why annotation matters in microbial genomics
  • The difference between structural and functional annotation
  • The complete step-by-step annotation process
  • The most widely used tools: Prokka, PGAP, RAST, InterProScan
  • Best practices for accurate results
  • When automatic pipelines fail—and how to fix them

Let’s start with the fundamentals.


1. What Is Genome Annotation?

Genome annotation is the process of describing the location and function of genomic features within a DNA sequence.

When you sequence a genome, the FASTA file contains nothing but:
ATGCGTCGTGACCT…

Annotation transforms this raw sequence into a biological blueprint by identifying:

  • Genes (CDS)
  • Open Reading Frames (ORFs)
  • RNA genes (tRNA, rRNA, ncRNA)
  • Promoters and regulatory regions
  • Mobile elements (IS elements, prophages)
  • Operons and clusters
  • Protein functions and metabolic roles

In microbial genomics, we need to know what is genome annotation because it is essential for:

  • Species identification and taxonomic placement
  • Antimicrobial resistance prediction
  • Pathway reconstruction
  • Comparative genomics
  • Metagenomic bin annotation
  • Understanding virulence factors
  • Industrial strain engineering

Without annotation, a genome is essentially a long string of characters with no meaning.


2. Two Main Types of Genome Annotation

Genome annotation is usually divided into structural and functional annotation.
Both are required to produce a complete and biologically useful genome.


2.1 Structural Annotation

Structural annotation detects the physical components of the genome:

Structural annotation identifies:

  • Protein-coding genes (CDS)
  • Open reading frames (ORFs)
  • 5’ and 3’ UTRs (rare in bacteria, common in eukaryotes)
  • rRNA operons
  • tRNA genes
  • Repeats
  • GC content and GC skew
  • Mobile elements

Tools commonly used:

Accuracy in structural annotation depends on:

  • Genome completeness
  • Sequencing depth
  • Assembly contiguity (Nanopore/PacBio help dramatically)

A fragmented Illumina-only assembly often leads to:

  • Missing plasmids
  • Split genes
  • Misannotated rRNA operons

This is why hybrid assemblies or long-read assemblies give superior annotation quality.


2.2 Functional Annotation

Once genes are predicted, each must be assigned a function, or at least an annotation that reflects known homology.

Functional annotation includes:

  • Assigning protein names
  • Mapping Enzyme Commission (EC) numbers
  • Adding Gene Ontology (GO) terms
  • Assigning KEGG Orthology (KO) identifiers
  • Predicting COG categories
  • Identifying virulence factors and AMR genes

Key databases used:

Functional annotation is the step where your genome becomes scientifically interpretable.


3. Why Genome Annotation Matters in Microbial Genomics

Understanding what is genome annotation enables researchers to:

1. Understand metabolism and pathways

Which nutrients can the microbe use?
Does it fix nitrogen?
Can it degrade pollutants?

2. Identify virulence and AMR genes

Critical for clinical and public health microbiology.

3. Detect plasmids and mobile elements

Plasmids can encode:

  • Resistance
  • Toxins
  • Biotechnologically relevant pathways

4. Perform comparative genomics

Ortholog detection depends on well-annotated genomes.

5. Industrial strain engineering

Annotation guides metabolic engineering and gene knockouts.

Without annotation, none of these insights are possible.


4. Step-by-Step: The Genome Annotation Workflow

This section will help your readers (and Google SEO) understand what is genome annotation and its entire process.


4.1 Step 1: Genome Assembly (Prerequisite)

Annotation requires a completed assembly.
Typical workflows:

Quality checks:

  • N50
  • Contig count
  • Coverage
  • GC %
  • BUSCO completeness
Flye options

4.2 Step 2: Structural Annotation

Tools predict:

  • CDS / ORFs
  • tRNA
  • rRNA
  • Repeats
  • Mobile elements

Most pipelines use Prodigal (bacterial gene finder).


4.3 Step 3: Functional Annotation

Functional annotation leverages homology-based approaches.

Typical steps:

  1. Blast protein sequences against curated databases
  2. Assign protein names
  3. Map GO, COG, KEGG, EC numbers
  4. Annotate domains (Pfam, InterPro)
  5. Identify AMR / virulence genes

4.4 Step 4: Quality Control of Annotation

After annotation:

  • Check missing rRNAs (common in Illumina-only assemblies)
  • Inspect pseudogene calls
  • Manually inspect suspicious ORFs
  • Compare against reference genomes

Tools like Artemis, IGV, or Ugene help.


4.5 Step 5: Export Annotation

Output files:

  • GFF3
  • GenBank (.gbk)
  • Table (.tbl)
  • Amino acid FASTA
  • Nucleotide FASTA
  • Annotation summary tables

These formats are required for:

  • PGAP submission
  • NCBI BioProject / BioSample
  • Downstream analyses (DGE, variant calling, pan-genomics)

5. Popular Genome Annotation Tools (Pros & Cons)

Below is an SEO-friendly section targeting “genome annotation tools”, essential to understand what is genome annotation.


5.1 Prokka

Probably the most widely used tool for bacterial annotation.

Pros:

  • Fast (minutes per genome)
  • Easy to run
  • Local and offline
  • Custom databases supported

Cons:

  • Depends heavily on included databases
  • Fewer curated protein names than PGAP

Great for microbial genomics, MAGs, metagenomics bins.

Prokka options

5.2 NCBI PGAP

Mandatory for submitting genomes to NCBI.

Pros:

  • Extremely high-quality curated annotations
  • Standardized naming
  • Good for clinical strains

Cons:

  • Strict requirements
  • Slower
  • Difficult for fragmented assemblies
PGAP options

5.3 RAST / RASTtk

Web-based annotation server.

Pros:

  • Easy interface
  • KEGG-like functional categorization

Cons:

  • Slower
  • Requires account
  • Not ideal for very large datasets

5.4 Bakta

A modern replacement for Prokka.

Pros:

  • Uses high-quality curated databases
  • Consistent annotation
  • Very fast

Cons:

  • Newer (less documented)

5.5 InterProScan

For protein domain annotation.

Pros:
Deep domain-level annotation, excellent for functional analyses.

Cons:
Very computationally heavy.


5.6 EggNOG-mapper

Best for broad functional annotation:

  • COG
  • GO
  • KEGG Orthology
  • EC numbers

Especially useful in metagenomics.


6. Challenges in Genome Annotation

We have covered the most important steps and concepts to ask the question what is genome annotation. However, genome annotation is not perfect. Common issues:

1. Fragmented assemblies

→ Missing genes
→ Split CDS

2. Frameshifts

→ False pseudogenes

3. Incorrect protein names

→ Homology-based errors

4. Incomplete pathways

→ Missing enzymes lead to misinterpretation

5. False positives in AMR or virulence genes

This is why many labs combine multiple tools.


7. Best Practices for High-Quality Annotation

1. Use high-quality assemblies

Nanopore or PacBio reads produce much better annotation than Illumina-only assemblies.

2. Use multiple databases

No single database is perfect.

3. Perform manual curation

When possible.

4. Submit to NCBI PGAP for official annotation

5. Keep annotation reproducible

Document tool versions and parameters.


8. Conclusion

Genome annotation is a crucial step in converting raw sequence data into meaningful biological knowledge. Whether you are working with a new isolate, a metagenomic bin, or a clinical pathogen, understanding what is genome annotation and performing genome annotation properly is essential.

By using a combination of structural and functional tools—Prokka, PGAP, RAST, Bakta, EggNOG-mapper—you can generate high-quality annotations suitable for publication, comparative genomics, and downstream analyses.

If you want to learn more about bacterial genome sequencing, compare the different technologies available or learn what metagenomics is, check our other posts! And if you are planning a microbial genomics or metagenomics project that require also genome annotation, don’t hesitate and contact us now!

Rubén Javier López Avatar

Rubén Javier López

Founder and Bioinformatician PhD in Microbiology

Rubén holds a microbiology PhD degree granted by the University of Bergen (Norway). He is proficient in bacterial metagenomics, genomics, transcriptomics and transcriptomics. He has hands-on experience and data analysis expertise in Illumina, Nanopore and PacBio sequencing technologies and has collaborated with scientists and labs all over the world. Moreover, he has been associated with biomedicine research groups, analyzing microbiome and mycobiome data.

Areas of Expertise: Microbiology, Extremophiles, NGS, Microbial Genomics, Transcriptomics, Differential Gene Expression, Metagenomics, Microbiome studies.
Fact Checked & Editorial Guidelines
Reviewed by: Subject Matter Experts

Ready to uncover the functional landscape of your microbial samples?

Explore our services at Tailoredomics. Request a quote or contact us for consultation

Leave a Reply

Proteomics
Rubén Javier López

How to Submit Proteomics Data to PRIDE: A Practical Guide

Submitting proteomics data to the PRIDE repository is a mandatory requirement for publication in most journals — yet it is one of the most common bottlenecks that delays manuscript submission in proteomics groups. The science is done. The paper is written. And then everything stalls at data deposition. This post explains what PRIDE submission involves, why it fails more often than it should, and what your options are when you need it done quickly and correctly. Note: Tailoredomics provides downstream proteomics bioinformatics and PRIDE data deposition services. We do not perform mass spectrometry or wet-lab work — we work with

Read More »
Tips
Rubén Javier López

How to Choose a Bioinformatics Service Provider

Sequencing data are easier to generate than ever, but analyzing them correctly remains difficult. Many research groups now receive FASTQ files, count tables, genome assemblies or metagenomic datasets from sequencing facilities, but do not always have the time, computational resources or specialized expertise to process them into reliable biological results. This is where a bioinformatics service provider can help. The right provider can turn raw sequencing data into reproducible workflows, interpretable figures, clear reports and publication-ready results. The wrong provider can produce generic outputs, poorly documented methods, unclear files, weak interpretation or results that are difficult to defend in a

Read More »
Volcano plot showing differentially expressed genes with log2 fold change on the x-axis and statistical significance on the y-axis.
Transcriptomics
Rubén Javier López

How to Interpret Differential Gene Expression Results

Differential gene expression analysis is one of the most common outputs of RNA-seq experiments. After running tools such as DESeq2, edgeR or limma-voom, researchers often receive a table containing gene IDs, expression values, log2 fold changes, p-values and adjusted p-values. At first glance, this table may look straightforward. Genes with low adjusted p-values are “significant”. Genes with positive log2 fold change are “upregulated”. Genes with negative log2 fold change are “downregulated”. But interpretation is more subtle than that. A differential expression result is not just a list of significant genes. It is a statistical summary of an experiment, shaped by

Read More »