Hello all
First of all I apologise if I miss any information out, this is my second Biostar post so I apologise for anything missed.

I have 3' RNA-seq reads prepped with the Lexogen FWD Quantseq kit. Reads range from 4m to 15m per sample. My species is a stony coral (Acropora palmata).

I have so far done adapter and quality trimming as recommended by Lexogen, and viewed with FastQC (everything seems good)

bbduk.sh in=stdin.fq out=${i}_trimmed_clean ref=/data/resources/polyA.fa.gz,/data/resources/truseq_rna.fa.gz k=13 ktrim=r useshortkmers=t mink=5 qtrim=r trimq=10 minlength=20

I am now having some problems as I am trying 2 methods

  1. STAR align and Salmon Quant.

I am again using the STAR align recommended parameters provided by Lexogen (see code chunk below which is the info from their website and so does not include my exact read paths). My STAR index is gff3 and masked.fasta avaliable from the Acropora genome. I have added the extra flags -outFilterScoreMinOverLread 0.5 and --outFilterMatchNminOverLread 0.5 which has resulted in unique mapping be around 65% for all samples. I do probably have reads from the symbiotic Symbiodinum and this is a value which is repoprted for other coral TAG-seq studies.

STAR --runThreadN 8 --genomeDir /data/star/human --readFilesIn fastq/${sample}_R1.fastq --outFilterType BySJout --outFilterMultimapNmax 20 --alignSJoverhangMin 8 --alignSJDBoverhangMin 1 --outFilterMismatchNmax 999 --outFilterMismatchNoverLmax 0.1 --alignIntronMin 20 --alignIntronMax 1000000 --alignMatesGapMax 1000000 --outSAMattributes NH HI NM MD --outSAMtype BAM SortedByCoordinate --outFileNamePrefix star_out/${sample}
  1. Salmon Index and Salmon Quant

I have also been testing the Salmon Index and Quant for my data. Here is where my query comes in.

I generate a transcriptome using gff read for Salmon indexing. This is the same fasta and gff3 file as used in STAR.

gffread -F -w /nethome/bdy8/apal_genome/version3.1/Apal_3.1_gff_t_.fa -g /nethome/bdy8/apal_genome/version3.1/Apalm_assembly_v3.1_200911.masked.fasta /nethome/bdy8/apal_genome/version3.1/Apalm_assembly_v3.1_notrna_200911.gff3

I then utilise this in the Salmon Index command. As I have short reads I have gone with a K value of 11.

salmon index -t /nethome/bdy8/apal_genome/version3.1/Apal_3.1_gff_t_.fa -i /nethome/bdy8/apal_genome/version3.1/Apal_trans_index_3.1 -k 9

Then this is used in the Salmon quant.

salmon quant -i /nethome/bdy8/apal_genome/version3.1/Apal_trans_index_3.1 -l SR -p 12 --minScoreFraction 0.3 --noLengthCorrection -r /scratch/projects/transcriptomics/ben_young/POR/tagseq/host/trimmed_reads/'"${PALPAL}"'_tr.fastq -o /scratch/projects/transcriptomics/ben_young/POR/tagseq/host/salmon_quant/'"${PALPAL}"'_salmon

This is where I have problems. My mapping rate is 3% and the number of reads getting removed due to 'high number of mappings discarded because of alignment score' which I do not understand. Comparing it to what I get from alignment with STAR I would expect at least a 40% alignment rate. From reading of other questions and recommendations from salmon quant-help, i have used --noLengthCorrection (recomended for Quantseq data in man page) and --minScoreFraction 0.3 to try and increase the mapping. Moving --minScoreFraction 0.3 to 0 increases the mapping rate to ~8% but that is still awful. I have also tried older versions of the genome with no luck so far. I have also checked the -SR option in the salmon quant outputs and this is what it detects my reads as.

On recommendation from Genomax I have also tried the flag --softclip and this had no discernable affect on the mapping rate.

Any and all advice would be seriously appreciated and please let me know if you need any additional information required. I appreciate the help.

Source link