Coordinates long non-coding RNAs

0

Dear All,

I am trying to get long non-coding RNA coordinates in gtf format. i have downloaded the file from here. but i want filter the file based on couple of conditions such as

  • Remove records with length less then 200bps
  • keep the records which are intersecting with coding region with 100bps upstream and downstream

first one was easily achievable using python

import pandas as pd
df_nc = pd.read_csv('gencode.v37.hg38.long_noncoding_RNAs.gtf', sep='t', names['CHROM', 'HAVANA', 'TYPE', 'START', 'END', 'ID', 'STRAND', 'ID1','DETAILS'])
df_nc_len = df_nc[df_nc['END'] - df_nc['START'] >200]

How can go about with the next condition?

Also why do i find exons in the non-coding gtf

df_nc_len['TYPE'].value_counts()

the 3rd column gives me

exon 69042

transcript 48673

gene 17882

Any help would be much appreciated


gtf


python


bedtools


rna-seq


hg38

• 160 views

updated 2 hours ago by

▴

40

written 2 days ago by

▴

20



Source link