gravatar for Nicolas Rosewick

2 hours ago by

Belgium, Brussels


I merged two plink dataset using

# take only SNP present in both datasets
plink --keep-allele-order --bfile dataset_A --extract snp_in_common.txt --make-bed --out dataset_A_common
plink --keep-allele-order --bfile dataset_B --extract snp_in_common.txt --make-bed --out dataset_B_common

echo dataset_A_common > merge.txt
echo dataset_B_common >> merge.txt

# merge datasets
plink --merge-list merge.txt --make-bed --out dataset_merge

# filter out SNP with low freq and low genotyping rate 
plink --maf 0.01 --geno 0.05 --hwe 0.00001 --bfile dataset_merge --out dataset_merge

I perfomed a PCA (after pruning the merged dataset)

# pruning
 plink --bfile dataset_merge --exclude high-ld-regions.txt --range --indep-pairwise 50 5 0.2 --out dataset_merge
 plink --bfile datase_merge --make-bed --out dataset_merge_pruned

 # pca
 plink --pca --bfile dataset_merge_pruned --out dataset_merge_pruned

When I plot PCA shows clearly a strong batch effect between both datasets

enter image description here

I continued the analysis by performing a logistic :

plink --bfile dataset_merge --covar pca_file.txt --covar-name PC1,PC2 --logistic --out dataset_merge

Looking at the manhattan and p-value histogram, there is clearly something not correct ... most of p-values are close to 1..

enter image description here

enter image description here

Any idea how to solve this ?

Thank you

Source link