I am looking at the Ensembl V100 gtf annotation file for one model animal of interest and found the annotation lacks a number of important genes, most crucially human homologues. Interestingly, alongside the reference annotation for this species a few alternative annotations are available in gtf format.
I can find some of the missing genes in the alternative annotation and was wondering what is the best way to merge the gtfs to create an 'improved' gtf annotation for the species. I am not particularly fussed about differences in exons, intron chains and transcript, but rather in the completeness of the protein coding gene annotation landscape.
I would imagine one way to doing this is manually, via a script using unique gene signatures in hash keys. Is there anything out there possibly doing this better?
So far, I've only found 'StringTie' and 'GffCompare'. These appear to be both unsuitable for the task, because they discard all gene level information in the input and only retain 'exon' and 'transcript' info.
Thanks in advance!