How to compare LD of a gene for all subpopulations in 1000 Genome?
I have a list of genes, for which I need to compare LD plots of the gene across all subpopulations in 1000 Genome. I have written a perl script to do so but have run into a few challenges.
My script does the following, given a gene name:
1) extract the list of variants present in 1000G data from the VCF file (these files do not have an rsID, just the position)
2) For each subpopulation in 1000G, pairwise LD is calculated for all variants using PLINK
3) Plot the LD for all subpopulations in one PDF file using R.
However, I run into a couple of challenges, as I am new to population genetics.
1) From the LD1 plot attached, you can see the list of variants is not the same for all subpopulations. So can I just take the common subset of variants so that I can compare the LD among them?
2) For some genes, the number of variants are too many (>100, sometimes 200-300) and hence the LD plot does not appear or is uninformative (see plot LD2). How can I subset the list of variants WITHOUT LOSING LD structure? (NOTE: --indep option in PLINK is not suitable for me, I am NOT looking for independent SNPS)
• 12 views