I am very new at bioinformatics and have quite limited trainings. I am trying to look at a publically available data set (SRP182816, Drosophila testis single cell RNAseq from Witt et al 2019 eLife), but it seems like I can't quite get the correct quantification.

I downloaded the data from the SRA site. I then used all-transcript fasta file from flybase (dmel-all-transcript-r6.33.fasta.gz) to create an index with salmon. I followed up with quantification of the experiment with salmon alevin.

This is how I called salmon.

salmon alevin -l ISR -1 517_novogene_S1_L008_R1_001.fastq.gz -2 517_novogene_S1_L008_R2_001.fastq.gz --chromium -i Dmel_Ref_flybase_salmon/ -o SalmonAlevin_out --tgMap t2g.tsv

Salmon had mapping rate of 39.7% and gave 3001 cells (the authors sequenced 5000 cells). When I take a look at the PCA plot for the quantified gene by cell table, I had a horseshoe shaped PCA plot (quite literally look like a nike logo but a bit more symmetrical). I am under impression that this indicates something is wrong, but I cannot figure out what to do at this moment.

Can someone more knowledgeable please let me know what may cause this and how would I fix it?

