gravatar for Midge

2 hours ago by

Hello all,
I need help to select rows from gene table, selecting only longest transcript when multiple are available.
For each unique gene ID (gene), I need to extract the row with the longest transcript (length). I would like to keep the whole row because I need the ID of the protein and coordinates for further analysis.
See below for a sample of my table, with data extracted from Ensembl BioMart. For example for ENSGACG00000000027.1 i would keep only this row with biggest length (3024)
(ENSGACG00000000027.1 ENSGACP00000000041.1 scaffold_89 278309 290469 3024).

The whole table has about 60K lines so I could really appreciate some help sorting these. I have been trying to find this out but similar questions focus on processing fasta files.

' gene  protein chr start   end length
ENSGACG00000000022.1    ENSGACP00000000030.1    scaffold_89 230109  238032  2415
ENSGACG00000000022.1    ENSGACP00000000031.1    scaffold_89 230109  238032  1284
ENSGACG00000000023.1    ENSGACP00000000032.1    scaffold_89 240752  246187  864
ENSGACG00000000024.1    ENSGACP00000000033.1    scaffold_89 263731  273624  3684
ENSGACG00000000025.1    ENSGACP00000000034.1    scaffold_89 275261  277377  780
ENSGACG00000000026.1    ENSGACP00000000035.1    scaffold_1045   508 2074    678
ENSGACG00000000026.1    ENSGACP00000000036.1    scaffold_1045   508 2074    810
ENSGACG00000000027.1    ENSGACP00000000037.1    scaffold_89 278309  290469  1289
ENSGACG00000000027.1    ENSGACP00000000038.1    scaffold_89 278309  290469  1305
ENSGACG00000000027.1    ENSGACP00000000041.1    scaffold_89 278309  290469  3024

Source link