I am trying to find unique sequences along with count and IDs from a FASTA file in R using Biostring. For exmaple

>random sequence 1
tatgtgcgag
>random sequence 2
agggtgttat
>random sequence 3
tatgtgcgag
>random sequence 4
gactcgcggt
>random sequence 5
tatgtgcgag
>random sequence 6
gcagccatcg
>random sequence 7
gactcgcggt
>random sequence 8
tatgtgcgag
>random sequence 9
tatgtgcgag
>random sequence 10
tatgtgcgag

The following code gives me a list of unique sequences

library(Biostrings)
random <- readDNAStringSet("random.fasta")
unique(random)

DNAStringSet object of length 4:
width seq names
[1] 10 TATGTGCGAG random sequence 1
[2] 10 AGGGTGTTAT random sequence 2
[3] 10 GACTCGCGGT random sequence 4
[4] 10 GCAGCCATCG random sequence 6

But I am not sure how to return “count” and “IDs” for each unique sequence and how to remove sequences with ambiguous characters.
Can anyone help please? Thanks



Source link