Trying to use python regular expressions to filter fasta file sequence headers

0

Hi,

Apologies if I do not follow the correct question formatting, this is my first time posting. My question is regarding the use of python regular expressions. I have a fasta file of sequences following the format:

>NODE_143195_length_100_cov_16076.000000
TTGTGTTGGTTGTTGTGTTGCCTGTCTTGGTGGCGGTTGTGTTGGCTGCTTTCGTGTCAG
TCTCTTCACCGATGTTATGTTGCTCTGTTGTGGCTCCGGC
>NODE_143196_length_100_cov_15891.000000
CTTGTGTTGGTTGTTGTGTTGCCTGTCTTGGTGGCGGTTGTGTTGGCTGCTTTCGTGTCA
GTCTCTTCACCGATGTTATGTTGCTCTGTTGTGGCTCCGG
>NODE_143197_length_100_cov_15696.000000
GCTTGTGTTGGTTGTTGTGTTGCCTGTCTTGGTGGCGGTTGTGTTGGCTGCTTTCGTGTC
AGTCTCTTCACCGATGTTATGTTGCTCTGTTGTGGCTCCG

I am trying to filter by both length and coverage. I want to filter sequences less than 5000bp and less than 100 coverage. I have been trying different variations of the following line:

^.+cov=([5-9][0-9][0-9]|([1-9]d{4}d*)..+$

But I cannot seem to make it work. If anyone can help me, would be greatly appreciated.
Thanks


filter


fasta


QC


python


read

• 32 views

updated 1 hour ago by

16k

written 2 hours ago by

0



Source link