gravatar for curious

2 hours ago by

I am working with an old, but widely used "mixed" dataset that contains SNPS mapped to a mixture of b36 + 37 coordinates

I don't know which build each SNP refers to, but each is labeled with an rsid. So I essentially tried to lift to b38 by rsid only like this:

  1. I updated the positions of the "mixed" dataset to b38 positions by merging with dbSNP141 on rsid ot create a "lifted" set.

  2. I downloaded 30x 1000 genomes data, which is called de novo on b38 and updated ID to include rsid

  3. II used beagle conform gt to make a "harmonized lifted" set by comparing to 1000 genomes as reference. This should make sure alleles/strand are harmonized between the datasets using freq and LD to correct ambiguous sites.

I realize this isn't ideal, but does this approach seem OK or are there better alternatives? I was able to "lift" 7022 of my original 7281 mixed build sites like this. plotting allele freq against a b38 references like topmed looks really clean too, so I think it worked

Source link