In this workshop we will assemble a mock metagenome from the following species
| Proportion of Reads | Species | Accession | Number of Chromosomes | Size | GC Content |
|---|---|---|---|---|---|
| 44.5 | Geitlerinema sp. PCC 7407 | ASM31704v1 | 1 | 4.7Mb | 58 |
| 30 | Vibrio alfacsensis | ASM1967048v1 | 2 | 5.3Mb | 44 |
| 7.5 | Vibrio algicola | ASM960176v2 | 3 | 3.5Mb | 42 |
| 15 | Desulfovibrio sulfodismutans | ASM1337645v1 | 2 | 4.4Mb | 63.5 |
| 3 | Winogradskyella schleiferi | ASM1339465v1 | 1 | 4.6Mb | 34.5 |
We will proceed through a full metagenomic analysis consisting of the following steps
Data for the workshop is provided in a folder on the rstudio server
called metagenome_assembly in your home directory.
Look inside this folder with the ls command
ls -R metagenome_assembly
The raw_data folder contains reads that will be used for our assembly.
All of the other folders are empty. We will use those to contain work
for each step of our analysis.
Assuming that we are satisfied with the overall quality of our reads the first step is to assemble them. We will use flye for this task because it is probably the best assembler for long-read metagenome data.
Change to the flye directory in our project folder
cd metagenome_assembly/flye
Run flye.
flye --nano-hq ../raw_data/sim.fastq.gz --threads 8 --out-dir mock --meta
Once finished, outputs from flye will be available in the directory
mock. The key output files are;
| File | Description |
|---|---|
| assembly.fasta | The actual assembled contigs |
| assembly_graph.gfa | Representation of the assembly graph in gfa format. |
| assembly_graph.gv | Assembly graph in graphviz format |
| assembly_info.txt | Table of information on each contig |
Before moving onto more formal binning analysis it is worthwhile quickly inspecting some more properties of our assembly.
Firstly we will calculate GC content for each of our contigs. This could be used to refine the rough contig groupings we made using Bandage
Switch to the filter directory
cd filter
Create a new file plain text file called gc.awk inside this directory.
Open this file and copy-paste the following awk script into it.
# Bioawk script for calculating gc
BEGIN {
OFS="\t"
}
{
l=length($seq)
ngc=gsub("[cgCG]","",$seq)
nat=gsub("[taTA]","",$seq)
lend=length($seq)
print $name,l,ngc/(ngc+nat)
}
After you have saved the file you should now be able to run it as follows
cat ../flye/mock/assembly.fasta | bioawk -c fastx -f gc.awk > assembly.gc.tsv
Now take a look at the GC content values for the top 10 largest contigs
cat assembly.gc.tsv | sort -n -k 2 -r | head
Compare these values with the expected GC content from our mock community. Considering both GC content and coverage, does this affect your ideas about how these contigs should be assigned to taxa?
Now let’s create some plots in R to explore the GC and coverage of contigs in a slightly more general way.
First read the gc and assembly_info datasets.
library(tidyverse)
gc <- read_tsv("filter/assembly.gc.tsv",col_names = c("#seq_name","length","GC"))
info <- read_tsv("flye/mock/assembly_info.txt")
Join them together
gc_info <- info %>%
left_join(gc)
A histogram by GC content shows several peaks, probably corresponding to major groupings of taxonomically distinct lineages within the data.
gc_info %>%
ggplot(aes(x=GC)) + geom_histogram()

A plot of GC vs coverage helps to identify clusters that have similar values for both these parameters.
gc_info %>%
ggplot(aes(x=GC,y=cov.)) + geom_point()

Adding labels allows us to look at the contigs with highest coverage and compare these with the contigs in bandage.
library(ggrepel)
gc_info %>%
ggplot(aes(x=GC,y=cov.)) + geom_point() + geom_label_repel(aes(label=`#seq_name`))

One of the most challenging tasks in metagenomic assembly is separating sequences from the many different organisms that were present in the sample. There are many approaches to this and many different tools. Broadly speaking tools fall into two categories;
In both cases the resulting outputs are “bins” each of which will contain multiple distinct sequences (reads or contigs) that come the same organism. It is important to acknowledge that binning is a very difficult problem and the results from binning tools often contain errors such as;
We will perform contig binning here since we have long-reads and can rely on metaFlye to do a good job of assembling high quality contigs from a mixed collection of reads.
Firstly we will use the program metabat2. This program tries to group
contigs based on nucleotide composition and read coverage.
For simplicity and speed we will use coverage information already
calculated by flye and provide this to metabat.
First prepare an input file summarising read depth information for all contigs
cd metabat
cat ../flye/mock/assembly_info.txt | grep -v '#' | awk '{OFS="\t";print $1,$3}' > depths.txt
Next we need to create a new version of the assembly file in which the
contigs have the same order as those in the depths.txt file.
cat depths.txt | awk '{print $1}' | xargs samtools faidx ../flye/mock/assembly.fasta > assembly_ordered.fasta
Finally, run metabat as follows
metabat -i assembly_ordered.fasta -a depths.txt --cvExt -o bins/metabat --seed 5
The bins directory will contain multi-fasta files with each of the
bins inferred by metabat. To quickly inspect which contigs were grouped
into which bin we can just use a grep command like this
grep '>' bins/*
You can see here that metabat has grouped contigs into 7 bins (exact
numbers can vary from run to run which is why we used the --seed
argument above). If you compare the contigs in these bins with those
shown in bandage you can see that the largest and highest coverage
contig has not been included. Other than this some of the other bins to
seem to make some sense. For example, bin number 4 includes contigs 4
and 30, 7 and 8 all of which have a similar GC content and coverage.
In this step we will collect bins created by MetaBAT and also include any large contigs that were skipped by the binning step.
First switch back to the filter directory and make a subdirectory
called long_contigs
cd filter
mkdir -p long_contigs
The following code will collect the longest 20 contigs and place each
into its own fasta file within the long_contigs directory
while read contig;do
echo $contig;
samtools faidx ../flye/mock/assembly.fasta $contig > long_contigs/$contig.fasta;
done < <(cat ../flye/mock/assembly_info.txt | grep -v 'seq_name' | sort -n -k 2 -r | head -n 20 | awk '{print $1}')
There is quite a bit going on in this code. Before running it let’s break it down and see how it works
First let’s just focus on the bash while loop. A much simpler loop would be
while read item;do
echo $item
done < <(echo "1 2")
Now let’s look at what is “feeding” the loop. It expects to be reading a
file line by line. We don’t actually have a file but we can trick bash
into thinking we have one by using the construct <(). Text output
generated by a command inside the brackets is treated as if it were in a
file.
Let’s examine the code we are using to generate our list of contigs
cat ../flye/mock/assembly_info.txt | grep -v 'seq_name' | sort -n -k 2 -r | head -n 20 | awk '{print $1}'
Break this down by removing the pipes and gradually adding them back
Finally we can look at the command used to extract contigs
samtools faidx ../flye/mock/assembly.fasta contig_1
Now we are in a position to identify the long contigs that were not otherwise included in our MetaBAT bins.
# Make a sorted list of contigs in metabat bins
grep '>' ../metabat/bins/metabat.[0-9]*.fa | sed -E 's/.*(contig_.*)/\1/' | sort > metabat_contigs.txt
# Make a sorted list of long contigs
cat ../flye/mock/assembly_info.txt | grep -v 'seq_name' | sort -n -k 2 -r | head -n 20 | awk '{print $1}' | sort > long_contigs.txt
Use the comm command to print contigs present in long but not
present in metabat.
comm -13 metabat_contigs.txt long_contigs.txt > missing_contigs.txt
Switch to the checkm directory
cd checkm
Collect bins from metabat and combine with long contigs
mkdir bins
cp ../metabat/bins/metabat.[0-9]*.fa bins/
while read contig;do
cp ../filter/long_contigs/${contig}.fasta bins/${contig}.fa
done < <(cat ../filter/missing_contigs.txt)
Take a look in your bins folder. It should contain some files with
names like metabat.[0-9].fa and others like contig_23.fa.
Now run the checkm workflow. This could take quite a while.
checkm lineage_wf bins out -t 10 -x fa -f checkm_results.txt
Once checkm has finished take a look at the results file,
checkm_results.txt. It provides a summary of results for each of the
bins. This includes
Change to the gtdbtk directory
cd gtdbtk
Make a symbolic link to the bins and call it genomes
ln -s ../checkm/bins genomes
Run gtdbtk. This will take a while
conda activate gtdbtk-2.1.1
gtdbtk classify_wf -x fa --cpus 10 --genome_dir genomes --out_dir out
Once it has finished running you can find the final output table in
out/classify/gtdbtk.bac120.summary.tsv
Take a look at this table
head -n 2 out/classify/gtdbtk.bac120.summary.tsv
An easier way to look at it might be to load it into R
gtdbtk <- read_tsv("gtdbtk/out/classify/gtdbtk.bac120.summary.tsv")
There are many columns in this output, however, the key one is
classification. You should see from this that only four of the bins
were classified. Looking at the classifications though we can see that
they are very close to our reference genomes.
MetaQuast provides several useful visualisations for assessing metagenome assembly quality. In this case since we know the reference genomes of all members of our community we can use it to see how well our assemblies match those.
cd metaquast
ln -s ../gtdbtk/genomes .
ln -s ../raw_data/refs .
Run metaquast. This will download each of the reference genomes from NCBI and perform alignments between each of the bins and these genomes.
metaquast.py genomes/*.fa -o out --references-list refs/references.txt -t 4
When metaquast is finished running you can find an html report in the
output folder out/report.html. This is a little confusing but it
essentially confirms that two of our genomes (those with the most
coverage) are assembled very well.
Alternative binning program
An alternative binning option is metabinner. This is a much newer program that uses a greater suite of information when performing binning. It is also very new, a little hard to use, and much much slower than metabat.
When performing binning for your own projects you could explore a range of tools. Your goal should be to obtain bins with high checkM scores (see below) and that are supported by multiple tools.
The code below shows how to run metabinner. Please DO NOT run this during the day when we are in the workshop as it will use up too much CPU on the server. Try it yourself after the workshop is complete.
conda activate metabinner
cd metabinner
cp ../metabat/depths.txt
cp ../metabat/assembly_ordered.fasta
/opt/miniconda3/envs/metabinner/bin/scripts/gen_kmer.py assembly_ordered.fasta 1000 4
run_metabinner.sh -t 10 -a $(pwd)/assembly_ordered.fasta -o $(pwd)/mbout2 -d $(pwd)/depths.txt -k $(pwd)/assembly_ordered_kmer_4_f1000.csv -p /opt/miniconda3/envs/metabinner/bin