I downloaded this version of the All Human coding sequence (Homo_sapiens.GRCh38.cds.all.fa) from Ensembl but could not filter it. I initially thought I could find scripts online but I have not come across any so far;
I need help on the following:
(1) I want to extract the longest CDS transcript from Homo_sapiens.GRCh38.cds.all.fa
(2) remove the pseudogenes
I want to use the final output_file as a reference database to find orthologous coding sequence in other mammalian taxon.
I am a beginner in bioinformatic, especially big data, but I can find my way around Ubuntu and Vagrant VM.