bedtools intersect combined with filtering after particular word

1

Hello,
I have NCBI reference .gtf file, containing annotations about genes, transcripts, protein id, etc. I need to extract only those rows, containing word "gene". Can that be done with bedtools intersect, or should I use awk?

Input file looks similar to this:

1       BestRefSeq      gene    943678  943679  .       +       .       gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS";        1       943678  943678  0       T       T
1       BestRefSeq      gene    943682  943683  .       +       .       gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS";        1       943682  943682  0       T       T
1       BestRefSeq      gene    943686  943687  .       +       .       gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS";        1       943686  943686  0       T       T
1       BestRefSeq      gene    943692  943693  .       +       .       gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS";        1       943692  943692  0       T       T
1       BestRefSeq      transcript      924024  924025  .       +       .       gene_id "SAMD11"; transcript_id "NM_001385640.1"; db_xref "GeneID:148398"; gbkey "mRNA"; gene "SAMD11"; product "sterile alpha motif domain containing 11, transcript variant 2"; transcript_biotype "mRNA";         1       924024  924024  0       G       G
1       BestRefSeq      transcript      924310  924311  .       +       .       gene_id "SAMD11"; transcript_id "NM_001385640.1"; db_xref "GeneID:148398"; gbkey "mRNA"; gene "SAMD11"; product "sterile alpha motif domain containing 11, transcript variant 2"; transcript_biotype "mRNA";         1       924310  924310  0       G       G
1       BestRefSeq      transcript      924321  924322  .       +       .       gene_id "SAMD11"; transcript_id "NM_001385640.1"; db_xref "GeneID:148398"; gbkey "mRNA"; gene "SAMD11"; product "sterile alpha motif domain containing 11, transcript variant 2"; transcript_biotype "mRNA";         1       924321  924321  0       G       G
1       BestRefSeq      transcript      924533  924534  .       +       .       gene_id "SAMD11"; transcript_id "NM_001385640.1"; db_xref "GeneID:148398"; gbkey "mRNA"; gene "SAMD11"; product "sterile alpha motif domain containing 11, transcript variant 2"; transcript_biotype "mRNA";         1       924533  924533  0       G       G

And I need only this part of gtf file:

1       BestRefSeq      gene    943678  943679  .       +       .       gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS";        1       943678  943678  0       T       T
1       BestRefSeq      gene    943682  943683  .       +       .       gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS";        1       943682  943682  0       T       T
1       BestRefSeq      gene    943686  943687  .       +       .       gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS";        1       943686  943686  0       T       T
1       BestRefSeq      gene    943692  943693  .       +       .       gene_id "SAMD11"; transcript_id ""; db_xref "GeneID:148398"; db_xref "HGNC:HGNC:28706"; db_xref "MIM:616765"; description "sterile alpha motif domain containing 11"; gbkey "Gene"; gene "SAMD11"; gene_biotype "protein_coding"; gene_synonym "MRS";        1       943692  943692  0       T       T

I will appreciate any tips.

Thank you!


bedtools

• 27 views

updated 2 hours ago by

33k

written 4 hours ago by

0

You should be able to use awk:

awk -F"t" '$3=="gene"' my_file.gtf > my_genes_only_file.gtf


Login
before adding your answer.

Traffic: 1633 users visited in the last hour



Source link