Hello everyone,

I'm pretty new on the genomic and transcriptomic world, so any kind of help is much appreciated. My problem in a nutshell is that I can't use featureCounts with the GBFF file provided in NCBI for Periplaneta americana. I've tried to convert the GBFF file to GFF and GTF but always get error. I can align transcripts to FASTA genome, and generate BAM and SAM files but that's all. Any idea if the GBFF file is corrupt or what I'm doing wrong? Here below a more detailed explanation of the problem and command lines I'm using.

Many thanks!


I'm trying to do a differential expression analysis from RNAseq samples of cockroach Periplaneta americana. I did it before using a de novo transcriptome with Salmon without problems. Nevertheless, I'd like to do the same analysis using the genome of this species (available here, Li et al 2018).

I can align transcripts and generate SAM and BAM files using the genome (FASTA format, I downloaded here and I merged all bits together) with hisat2 and samtools following this tutorial (more or less):

hisat2-build genome_fasta.fa index/genome_hisat hisat2 -x index/genome_hisat -1 my_fastafile1 -2 my_fastafile2 -S aligned_reads/aligned_out.sam

samtools view -bS aligned_reads/aligned_out.sam > aligned_reads/aligned_out.bam

samtools sort aligned_reads/aligned_out.bam -o aligned_reads/aligned_out.sorted.bam

samtools index aligned_reads/aligned_out.sorted.bam

samtools flagstat aligned_reads/aligned_out.sorted.bam

samtools idxstats aligned_reads/aligned_out.sorted.bam

Nevertheless, I get stack after this step. I cannot use featureCounts or anything to generate counts. I tried to download the GTF file using NCBI Assembly (here), which downloads some json and jsonl files but no GTF file. So the only option is to convert the GBFF file they provide to GFF. I did it with bp-genbank2gff3.pl, and the resulting file looks a bit weird (see image below). I also tried the conversion from this GFF to GTF file using gffread but gives an empty output file.

Image of GFF file after conversion

I used this command line to generate a GFF file that looks 'better' but it also gives empty output file when converting to GTF.

cut -d';' -f1 GCA_002939525.1_ASM293952v1_genomic.gbff.gff > genome.gtf

GFF file after 'cut' command line

I've tried to run featureCounts with both GFF files but I always get this error:

ERROR: no features were loaded in format GTF. The annotation format can be specified by the '-F' option, and the required feature type can be specified by the '-t' option..
The program has to terminate.

The command line I use is this one, although I also tried to add several options (e.g., -F GFF, -t ID, ... ) but nothing worked:

featureCounts -a genome.gff -o featureCounts/sample1.counts aligned_reads/aligned_out.sorted.bam

Any idea of what I'm doing wrong? Is it even possible to do this analysis with the GBFF file provided or is there some format problem? I can provide a BAM/SAM file or any other file if they can help by email (sorry, they might be too heavy to upload them here).

Thanks a lot!



Source link