Hello all,
I need help to select rows from gene table, selecting only longest transcript when multiple are available.
For each unique gene ID (gene), I need to extract the row with the longest transcript (length). I would like to keep the whole row because I need the ID of the protein and coordinates for further analysis.
See below for a sample of my table, with data extracted from Ensembl BioMart. For example for ENSGACG00000000027.1 i would keep only this row with biggest length (3024)
(ENSGACG00000000027.1 ENSGACP00000000041.1 scaffold_89 278309 290469 3024).
The whole table has about 60K lines so I could really appreciate some help sorting these. I have been trying to find this out but similar questions focus on processing fasta files.
' gene protein chr start end length
ENSGACG00000000022.1 ENSGACP00000000030.1 scaffold_89 230109 238032 2415
ENSGACG00000000022.1 ENSGACP00000000031.1 scaffold_89 230109 238032 1284
ENSGACG00000000023.1 ENSGACP00000000032.1 scaffold_89 240752 246187 864
ENSGACG00000000024.1 ENSGACP00000000033.1 scaffold_89 263731 273624 3684
ENSGACG00000000025.1 ENSGACP00000000034.1 scaffold_89 275261 277377 780
ENSGACG00000000026.1 ENSGACP00000000035.1 scaffold_1045 508 2074 678
ENSGACG00000000026.1 ENSGACP00000000036.1 scaffold_1045 508 2074 810
ENSGACG00000000027.1 ENSGACP00000000037.1 scaffold_89 278309 290469 1289
ENSGACG00000000027.1 ENSGACP00000000038.1 scaffold_89 278309 290469 1305
ENSGACG00000000027.1 ENSGACP00000000041.1 scaffold_89 278309 290469 3024