I'm currently working with the transcriptome from a nonmodel plant organism. For this study I began to assemble a transcriptome using the genome as a guide and using short + long reads. Afterward, I decided to extract all the non-mapping read pairs (from short reads) and nonmapping long reads (around 10% of the data) and build a new transcriptome reference-free.
I also decided to check the quality of the non-mapping reads. Not surprisingly, I got some reads which I suspect have contamination since some of the samples contained 2 highpoints in the GC content plot.
I assembled my reads using Trinity and I decided to blast randomly 100 sequences against nr. I was expecting to find fungi, human or animal sequences, but instead, I only got plant sequences in my results. Although this appears to be good news I want to make sure I really do not have contaminants sequences. What would be the best path to make sure I do not have contamint sequences?