Here I show how to filter an annotation (often evidence-based) to create a golden set of genes in order to train ab-initio annotation tools.


  • an annotation in gff or gtf
  • AGAT (Another Gff Analysis Toolkit)
  • blast
  • to make it easier a nice folder structure:
    • mkdir filter
    • mkdir protein
    • mkdir nonredundant
    • mkdir blast_recursive

1. Select protein coding genes
First we select only the protein coding genes from the annotation file and remove all other type of level2 features: tRNA, snoRNA, etc. --gff annotation_evidence_based.gff -o splitByLevel2  
ln -s splitByLevel2/mrna.gff

2. Optional (Only if Maker annotation): filter by score
Next step, we need to filter the best genes. We will keep gene with AED score under a certain score (e.g 0.3). --gff mrna.gff  --value 0.3 -a _AED -t ">=" -o filter/codingGeneFeatures.filter.gff

3. Longest isoform per mRNA
Now we filter mRNAs to keep only the longest isoform when several are avaialble. --gff filter/codingGeneFeatures.filter.gff -o filter/codingGeneFeatures.filter.longest_cds.gff

4. Complete genes
We then check that the gene models are complete (has a start and stop codon), and remove the incomplete ones. --gff filter/codingGeneFeatures.filter.longest_cds.gff -f genome.fa -o filter/codingGeneFeatures.filter.longest_cds.complete.gff

4. Distance from neighboring genes
We can also filter the gene models by distance from neighboring genes in order to be sure to have nice intergenic region wihtout genes in it in order to train properly intergenic parameters. (default 500 bp) --gff filter/codingGeneFeatures.filter.longest_cds.complete.gff -o filter/codingGeneFeatures.filter.longest_cds.complete.good_distance.gff

5. Remove redundancy
To avoid to bias the training and give an exhaustive view of the diversity of gene we have to remove the ones that are too similar to each other. In order to do so, we translate our coding genes into proteins, format the protein fasta file to be able to run a recursive blast and then filter them. --gff filtercodingGeneFeatures.filter.longest_cds.complete.good_distance.gff -f genome.fa --aa -o protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa  

makeblastdb -in protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa -dbtype prot   

blastp -query protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa -db 
protein/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa -outfmt 6 -out blast_recursive/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa.blast_recursive --gff filter/codingGeneFeatures.filter.longest_cds.complete.good_distance.gff --blast blast_recursive/codingGeneFeatures.filter.longest_cds.complete.good_distance.proteins.fa.blast_recursive --outfile nonredundant/

Voila, nonredundant/ contains your nice gene set.
You need several hundred genes to train properly an ab initio tool. The more you have, the better is... but you kind of reach a plateau around 700 genes.

Source link