I have paired end 250 sequencing data for a sample. The read count is around 2 million. The data has 70-mer barcodes which are embedded in common upstream and downstream region.
I have to analyze how many unique barcodes are present in the sample and their frequency relative to total number of reads.
So far, I have mapped the reads to the reference which has N's in them for the barcode region. I found some common sequences which may be the barcodes.
I then merged forward and reverse reads with minimum overlap of 50 and grepped the observed barcode sequence. Although, I am sure of this approach is correct.
Is there any better way to perform this kind of analysis?
Help would be appreciated.
Thanks in advance !!