gravatar for ATpoint

7 minutes ago by

I was thinking the similar to Mensur Dlakic, but from the nucleotide perspective. Even though IUPAC allows by best knowledge 16 characters for nucleotide fasta the majority will always be A/T/C/G/N, even in noisy Sanger sequencing data. I include N explicitely because in genome fasta files repetitive regions, especially the telomers often consist of large N stretches.
Therefore, if a selection of the fasta, say the first 100 characters, maybe of the first 10 entries if being multi-fasta, consist of more than x% A/T/C/G/N characters then call it fasta, else call it amino acid. One can probably calibrate this by randomly pulling some fasta files from NCBI, maybe from the nucleotide collection, plus the same from an amino acid collection, and then derive an expected composition of sum(A/T/C/G/N) versus all other characters. This will be rather bimodal I guess so finding a good cutoff is probably not too difficult as Mensur Dlakic suggested in his second last sentence.



Source link