I have just started to learn bioinformatics and pangenomics. So if this question seems to you pretty basic then I apologize in advance.
As we know, NCBI for genome database, there are two kinds of sequences available refseq and genbank. I have read the differences in both refseq and genbank. But I was just curious which one would be prefered for pangenome studies?
For example phylogeny built by refseq sequences would be different from genbank sequences?
As I read, refseq is already curated, annotated and contaminants are removed. Will it affect the phylogeny that we will build after using any pangenome pipeline?
I am particularly interested in non-synonymous mutations (frameshifts or stop codons inside coding sequences). Do refseq curators also remove this kind of unusual errors or mutations? If yes then it would be difficult to study such mutations in refseq sequences.
Somebody may say a combination of both genbank and refseq would be good, after removing the duplicate one. Which one should I remove refseq or genbank? As I read in this previous post on biostar (www.biostars.org/p/377799/).