Dealing with chimeric assembly contigs


Hello esteemed fellow researchers.

I've been getting partial matches when comparing assemblies to BLAST database. I was told that the partial matches of contigs to BLAST database are due to chimera (unrelated sequences that are wrongfully merged during an assembly process). For instance, a little over a half of a contig of about 880k base pairs matched a certain bacteria, which must be a result of sample contamination. Given that chimera is the case, I thought that a correct next step might be to remove the contaminated sequences from the chimera contigs, which will hopefully keep only the DNA of interest. Then I'll be able to align these sequences to a reference genome or to other genomes of closely related species, to find genetic differences between them, which is my research question.

I'd love to get your opinions whether that's a good idea or otherwise.

Thanks and keep on rocking in the free world!





to immediately start of with the worst news: this is one of the most difficult things to get resolved in assembly.

rerunning the whole assembly process again with for instance more stringent settings is perhaps the best option but usually not feasible.

There are probably also tools around that can somewhat fix this but it remains difficult. One thing you can try yourself is to map the reads back to your assembled contigs and inspect the mapping coverage, the thing to look for are 'drops' in the coverage indicating that something might be going on there and as such you might be able to pinpoint where the chimera is located.

Simply removing all contigs that have some contamination match is also an option but you will chuck out many true sequences as well. Moreover to make it all even more difficult: it's not because some parts of the contigs matches bacteria that that part is also not actually true sequence from your assembly (there are eukaryotic sequences that matches bacteria as well).

If you want to remove contamination contigs from your assembly it will be worth to take some other info into account as well. for instance you can check the %GC of your contigs (bact and euk contigs will have seriously different %GC content) in combination with your blast matches.
If a whole contig has nothing but bacterial matches and a %GC that is very deviant from the mean of your whole assembly it's a high chance it's a contamination indeed.

long story short: it's a difficult issue to resolve

Are we talking contigs or scaffolds here btw? (contigs or even worse in this setting than scaffolds)

before adding your answer.

Traffic: 1222 users visited in the last hour

Source link