How to quantify piRNAs ?

1

Hi

I'm trying to create a piRNA count table from samples enriched with small RNAs (~31 bases) using the piRBase reference. Despite all my readings I haven't find a simple way to do that.

In the first place I tried to quantifiy piRNAs with a similar strategy as miRNAs by aligning trimmed reads with Bowtie to piRBase (instead of miRBase) but I underestimated the multimapping problem: more than 99% of the aligning reads multimap the reference even when requiring 0 mistmatch (bowtie -v 0). Indeed the drosophila piRNA reference contains >41 million sequences.

Among all the perfect matches I would be interested in prioritizing the reference sequences that have no additionnal bases.

For example the read TGCTTGGACTACATATGGTTGAGGGTTGTA perfectly matches many piRNAs in the piRBase:

>piR-dme-179
TGCTTGGACTACATATGGTTGAGGGTTGTA
>piR-dme-183
CTGCTTGGACTACATATGGTTGAGGGTTGTA
>piR-dme-800
TGCTTGGACTACATATGGTTGAGGGTTGTAA
>piR-dme-46776
ATGCTTGGACTACATATGGTTGAGGGTTGTA
>piR-dme-48715
GCTGCTTGGACTACATATGGTTGAGGGTTGTA
...

For that case I would like to count only piR-dme-179 because it has no additionnal bases. Are there tools that could help efficiently do that ? Maybe using a reciprocal matching criterion.

I hope you can help, thank you


piRNA


piRBase

• 151 views

updated 2 hours ago by

101k

written 4 days ago by

0

Thank you.

The other references/databases have less piRNAs but they still have sequences that are subsequences of others, therefore it doesn't solve the multi-mapping problem when using aligners like Bowtie. In most papers/pipelines I have seen they map to the genome which is worse in terms of multi-mapping and I'm not trying to discover new piRNAs. I ended up writting a custom script to quantify piRBase piRNAs (in my case >95% of the FASTQ sequences found a perfect match).

In case someone is interested, the way I did (requires at least 10GB of RAM):

  1. Parse the piRBase FASTA to create a map:<sequence> -> <sequence name> (e.g. with a dict in Python or a list in R)
  2. Parse the sample FASTQ:
    • if you have UMIs to account for PCR bias, create a map <sequence> -> seen UMIs: the count will be the number of unique UMIs
    • otherwise map <sequence> -> number of appearances
  3. Write the count table from these 2 mappings


Login
before adding your answer.

Traffic: 2628 users visited in the last hour



Source link