Tajima'S D for targeted sequencing data
I apologize for the long post! Just though it may help to give some background.
I was hoping to get some suggestions for testing for selection at a particular (short) locus in a population with our data, < 100 short loci?
Our data: We are working with targeted amplicon sequencing illumina short read data. We have short amplicons of ~ 70 - 100 bp each and we have about ~ 80 of these loci across the genome (not necessarily coding regions, although some may be). For our population of interest, we have genotyped these loci from 14 individuals (it is a small population). We also have data at these same loci in other populations.
Additionally, these are non-invasive samples and we are also concerned about genotyping error and missing data (although depth is not too low), and also non-target amplification. We have bam files (mapped to the reference, a chromosome-level assembly, using bwa).
Our question: We want to know if one of the 80 loci could be under selection in the population of our interest.
- Can we use outlier loci since we only have ~ 80 loci? (between population tests).
- If we try Tajima’s D (within population test), can we restrict our analysis somehow to each locus? So that we can make meaningful interpretations from the output and see which locus, if any, may be under selection.
- What program may we use for this? If we try DnaSP, is there a way to code the SNPs using IUPAC nomenclature and submit to the program? Also, could we restrict the analysis to our loci?
- Since we have short loci, we though ANGSD may be best to use, since we can use the genotype likelihoods across each locus and estimate Tajima's D in sliding windows. With this approach, we have the following concerns:
4a. Because this is sfs-based, is the number of loci we have enough? If yes, is there a way to restrict our analysis to these 100 loci? (we want to avoid non-target amplifications getting picked up).
4b. At present, when we run the code (angsd and realsfs) without the sliding window option, we get an estimate per chromosome - is this the average across the sites it gets across the entire chromosome? We did use a regions file with the Chr:start-stop indicating our loci; however, we have multiple loci in each chromosome. So this isn't very helpful and easy to interpret.
4c. When we specify a window size, often the window is interrupted by the "ends" specified in the regions file and we get an error. Is there any way to circumvent this?
(The error we get: outnames=simAngsd/slidingWindowAnalysis/simTajimaDslidingW step: 10 win: 50pc.chr=A1 pc.nSites=506 firstpos=107559481 lastpos=94103534end of dataset is before end of window: end of window:107559540 last position in chr:94103534)
4d. Would it make sense to not specify a regions file, and then filter the results to our desired windows/loci?
• 11 views