i am trying to perform a PCA on a local data set in comparison to the 1000 genome dataset. for that i merged the two sets using bcftools with the below command:
bcftools merge --force-samples -m none 1000genome_normalized.vcf.gz local_normalized.vcf.gz -Oz -o Merged.vcf.gz
after that i used plink 1.9 to perform the PCA :
plink --vcf Merged.vcf.gz --pca var-wts header --out Merged-plink1.9
i also did the same with plink 2.0 and got the same result.
when i got the eigenvec values and plot them through R and got the below output
now i need to color code the plot to show the 1000 genome samples as opposed to my local samples. how can this be performed, knowing that when i checked the eigenvec file i found that all family ids changed to 0!!! but i still have the sample IDs. so is there a way where i can group the samples and indicate to R to color code it accordingly?
for clarification, here is a row from the eigenvec file representing the 1000genome samples:
0 HG00096 -0.0117915 0.0241081 0.0132739 -0.0150777 -0.0034903
0.00428502 -0.00142438 0.0305039 -0.00026421 0.000715683 -0.000358925 0.00548679 -0.00263568 0.00154371 0.0028039 0.000966683 -0.00173037 0.00517797 -0.00036687 -0.00965849
and here is a row representing the local dataset:
0 13590 -0.0119216 0.0149963 0.00264354 -0.00849061 -0.0044652
0.00106673 -0.00368617 -0.00693559 0.0069702 0.0024315 0.00691328 -0.019868 0.0171684 -0.0075124 -0.0335678 0.0382487 0.0106169 0.0207575 -0.00615446 -0.0858812
it is clear that the 1000genome values start with alphabet while the local doesn't. will this be applicable in R? if so, then how can i use it and plot a colored scatter diagram? i am very sorry for this long question, but i have
zero experience in R and wanted to make all the data available