RefSeq Annotation Report - Gene and Feature Statistics
I have a question about RefSeq annotation reports, particularly the one for the purple urchin. The RefSeq Annotation Report indicates that there are 258,355 exons. However, the gff file for this assembly has 442,528 lines in which column three has the value of "exon." What would explain this discrepancy?
For example, the following code returns "442528."
wget -O - -o /dev/null https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/7668/102/GCF_000002235.5_Spur_5.0/GCF_000002235.5_Spur_5.0_genomic.gff.gz | gunzip --stdout | awk '$3 == "exon"' | wc -l
I see that there is a note on the annotation report indicating that the counts do not include pseudogenes. There is also an additional note next to the exons row that states:
"Exons in mRNAs, misc_RNAs and ncRNAs of class lncRNA. Does not include tRNAs, rRNAs or ncRNAs of class other than lncRNA. Exons shared by multiple transcripts are counted once."
I'm not sure that this could account for the 184,173 exon difference. I am hoping to compute sequencing coverage statistics across exons.
• 126 views