Hello all, does anyone know of a program which can filter out all the non bacterial genomes from metagenomic data.





I only know how to do this after the assembly, which may not be what you are asking. I am assuming that you mean non-prokaryotic rather than non-bacterial, but the answer is probably the same.

The first step is to bin the contigs by 4n/5n frequencies. Even related bacterial species can be separated this way, and it is almost a guarantee that any eukaryotic sequence will be well-separated from the rest. The same is true for archaeal bins, in case you really meant non-bacterial genomes. Bins can be classified using GTDB-Toolkit, where eukaryotes will usually be classified as Asgard/Loki group.

you can perform the separation at read level by using Kraken2.
Download well made kraken2 and bracken database (I suggest to download the standard database, not mini one) here: (dec/2020).

And preform kraken2 with

kraken2 --db {kraken2_database_path} --unclassified-out {uncseq} --classified-out {cseq} --use-names --threads {threads} -output {output.txt} -report {output.kreport} {input.fq}

Then kraken2 will classify your reads into different categories, you can select them later in {cseq} by using the index produced in {output.txt}

