gravatar for el24

2 hours ago by

USA

Hi all!

I have a big .bed file, which is about 100GB of data, and I want to extract specific columns from it in a timely manner. I was wondering if there is a useful tool that works best for huge bed files. (or could be bed.gz) Could you please tell me if you know any excellent tool (or Python package) that can help me generate my desire output?

An example of head of the data is:

chr1    10006   10018   M6176_1.02-NR2F1    0.00117 +   sequence=taaccctaaccc
chr1    10006   10020   M6432_1.02-PPARD    0.00034 +   sequence=taaccctaacccta
chr1    10008   10030   M6456_1.02-RREB1    0.00014 -   sequence=GGGTTAGGGTTAGGGTTAGGGT

And imagine I build my output using columns exact value of 1, 2, 3, 5, and 6th columns, and also TF of 4th column. Therefore, for the first line of my data, I like my output to be as follows:

chr1 10006 10018 NR2F1 0.00117 +

I know I can extract it using awk, by this command, but I hope there are faster ways to do it:

cat file.bed | awk '($5<0.01)gsub(/.*-/,"",$4) {print $1"t"$2"t"$3"t"$4"t"$5"t"$6}'

My second question is that if you can give me some recommendations to have criteria for limiting the values on one column, for example, keeping a row when the 5th column being less than 0.00001 values.

Thank you very much!

link

modified 56 minutes ago

written
2 hours ago
by

el2410



Source link