Clustering based on sequence similarity


I've got 5 from 7 to 30 thousand virus genome sequences per each strain and I need to separate the sequences into groups based on the similarity of the sequences. How can I do that? By the way I'm able to align each strain with MAFFT, but i don't really know the way to cluster. I'd be really happy ot hear the answer



This part is not clear:

I've got 5 from 7 to 30 thousand virus genome sequences

You have 5 protein sequences from 30K genomes? 5-7 protein sequences? 5-7 genes?

If you are talking about whole genome clustering, that would not be easy on such a scale. I recommend that you use predicted proteins for each of them. Then:

  • align them individually
  • trim the alignments
  • concatenate those alignments into a super-matrix
  • make a phylogenetic tree

Beware that each of these steps, especially the last one, will take a long time. Also, there is a large potential for error when working on this scale, even for those who have already done all these steps before. Even if all of this works, it is very difficult to look through a tree that has 30K nodes. Lastly, most of your genomes will be (near-)identical at a protein level, so you still may not get much useful information.

before adding your answer.

Traffic: 1854 users visited in the last hour

Source link