Hi!
In (Roux et al, 2017). The authors state the use of nucmer (from mummer) to cluster the contigs:
' Contigs from all samples were clustered with nucmer (Delcher, Salzberg & Phillippy, 2003) at ≥95% ANI across ≥80% of their lengths, as in (Brum et al., 2015; Gregory et al., 2016), to generate a pool of non-redundant “population contigs” '
- I have all my contigs from all the samples (which are grouped by experimental conditions) in only one denovo assembly file (with megahit). There is no reference; the samples come from mouse gut. Theoretically, there are several (probably unknown) genomes.
- nucmer has at least two obligatory multifasta inputs; a reference and the query.
- Merge a selection of viral genomes and use it as reference?
- Assemble the samples/groups separately and then use one assembly as reference?
- Use the same assembly file as both reference and query?
- Split the assembly file then use one (maybe the largest) contig as the reference?
- Anything else?
Alternatively, I've performed clustering with CD-HIT. Would nucmer be better at clustering? I can only answer that if I could somehow run nucmer.
If anyone has good experience with another viromics pipeline, I would be happy to test it.
Any help will be very much appreciated. Thanks!