I have two sets of imputed + QCed genetic data in Plink binary format (.bed, .bim, .fam) and would like to merge them (e.g. using plink --merge-list) for subsequent GWAS.

  1. The two data sets are from different populations.
  2. The variant counts in the data sets are very different. One set has 1.8× variants compared with the other.
  3. The same variant may have different A1/A2 values in the two .bim files.

I wonder if I may directly merge the files, or some cleaning beforehand is needed. More specifically,

  1. Should I keep only the variants present in both data sets?
  2. Do I need to fix the A1/A2 coding before merging, so that the same variant has the same A1/A2 in the two data sets? If so, how can I do this?
    • Some of my analyses involve analyzing the merged data in the additive + dominant component format (generated using plink --recodeAD). This format counts A1 alleles -- if I do not make the A1/A2 coding consistent before merging, will the formatted merged data be problematic?

Any advice will be much appreciated!

