I am new in the field of bioinformatics and I just got back SNP array data for 96 samples for the Affymetrix Axiom Precision Medicine Array for Family Based Association Test analyses in rare diseases, with the objective of deciphering some common variation which could be implicated on the phenotype of the pathology.
After using Axiom Analysis Suite to export on PLINK format for the 89 samples that passed the filters, I decided to do a preliminary case vs control analysis using PLINK 1.9 to see what could I expect from these samples.
After deleting missing genotypes, differential genotyping rate, genotypes with MAF lower than 0.01, SNPs whitout Hardy-Weingberg equilibrium, and check for sex and heterozygosity inconsistencies with the following commands in R:

system2("plink",c("--bfile PMDA_Analysis", "--geno 0.1","--make-bed","--out cc1"))
system2("plink",c("--bfile cc1", "--mind 0.1","--make-bed","--out cc2"))
system2("plink",c("--bfile cc2", "--geno 0.01","--make-bed","--out cc3"))
system2("plink",c("--bfile cc3", "--mind 0.01","--make-bed","--out cc4"))
system2("plink",c("--bfile cc4", "--test-missing","--out dif_genot"))
dif_genot <- read.table("dif_genot.missing",header = T)
dif <- subset(dif_genot, dif_genot$P < 0.001)
write.table(dif$SNP, "missing", append = F, quote = F, row.names = F, col.names = F)
system2("plink",c("--bfile cc4", "--exclude missing", "--make-bed", "--out cc5"))
system2("plinK", c("--bfile cc5", "--maf 0.01", "--make-bed", "--out cc6"))
system2("plink", c("--bfile cc6", "--hwe 1e-5", "--make-bed", "--out cc7"))
system2("plink",c("--bfile cc7", "--check-sex", "--out sexcheck"))
sexinc <- read.table("sexcheck.sexcheck", header = T)
sexmal <- subset(sexinc, sexinc$STATUS !="OK")
write.table(sexmal[,1:2], "sexomal", append = F, quote = F, row.names = F, col.names = F)
system2("plink", c("--bfile cc7", "--remove sexomal", "--make-bed", "--out cc8"))
system2("plink",c("--bfile cc8", "--not-chr 6", "--indep-pairwise 50 5 0.1", "--out poda"))
system2("plink",c("--bfile cc8", "--extract poda.prune.in", "--het", "--out heteroz"))
heteroz <- read.table("heteroz.het", header = T)
heteroz$porc <- (heteroz$N.NM. - heteroz$O.HOM.)/heteroz$N.NM.
tresSD <- subset(heteroz, heteroz$porc > (mean(heteroz$porc) + 3 * sd(heteroz$porc)))
write.table(tresSD[,1:2], "remhet", append = F, quote = F, row.names = F, col.names = F)
system2("plink", c("--bfile cc8", "--remove remhet", "--make-bed", "--out cc9"))

I decided not to perform a population stratification control as all my samples are from Spanish people with the exception of a Turkish family of 6 members. So finally I get 375085 SNPs along 75 samples (27 cases vs 48 controls).
Then I decided to use the Michigan Imputation Server to increase the number of SNPs per sample, but this tool is a little bit picky. After extracting only ACGT SNPs and splitting my file in chromosomes to then convert them into VCF files and compress them in bgzip format using the following commands on the Linux console:

plink --bfile cc9 --list-duplicate-vars ids-only suppress-first
plink --bfile cc9 --alleleACGT --snps-only just-acgt --exclude plink.dupvar --make-bed --out cc9_corrected
plink --bfile cc9_corrected --allow-extra-chr --chr 1 --make-bed --out cc9_1
plink --bfile cc9_1 --recode vcf --out cc9_1
vcf-sort cc9_1.vcf | bgzip -c > cc9_1.vcf.gz

And so on for the rest of chromosomes, the Michigan Imputation Server gave me the following error:

Error: More than 100 obvious strand flips have been detected. Please check strand. Imputation cannot be started!

So, following the recommendations posted in How to fix this strand flips for Michigan Imputation Server? I decided to use the tool conform-gt as follows:

java –jar conform-gt.jar ref=chr1.1kg.phase3.v5a.vcf.gz gt=cc9_1.vcf.gz chrom=1 out=cc9_chr1

But the outputs for all chromosomes were empty, as no SNPs where common with the reference file.
To solve that I decided to use Beagle directly, without using conform-gt first, instead of the Michigan Imputation Server. The command used was:

java –jar beagle.jar map=plink.chr1.GRCh37.map gt=cc9_1.vcf.gz chrom=1 ne=75 out=cc9_chr1_imputed

But no imputation occur, so I decided just try to set a logistic linear model with that 375085 SNPs detected.
For that, I came back to PLINK 1.9 through R console and start by calculating the first 5 PCA components to set as covariates using the following commands:

system2("plink", c("--bfile cc9_corrected", "--pca 5 header", "--out pca"))
eigenvec <- read.table("pca.eigenvec", header = T)
fam$IID <- fam$V2
fampca <- merge (fam, eigenvec, by = "IID")
plot(fampca$PC1, fampca$PC2, col = fampca$V6, xlab = "PC1", ylab = "PC2")

But all PCA plots have only one point at (0,0), and that is imposible! Following the instructions posted in groups.google.com/forum/#!topic/plink2-users/i4dH8nf4NeA, as I have already applied those filters, I decided to calculate the GRM matrix to see if I had NA values:

system2("plink", c("--bfile cc9", "--make-grm-bin", "--pca", "--out cc9_grm"))
GRM <- read.table("cc9_grm.grm.bin", header = F)

And indeed I had NA values in the GRM matrix, but I don’t know how to solve that. Could it be because my family structure? Among the initial 89 samples there are grandparents, parents, uncles, brothers, half-siblings and samples without any familiar relationship. Could anyone suggest me a way to solve this problem?

Due to my resounding failure using this approach, I decided to search for a tool to perform a proper FBAT analysis, but I could only find one tool more or less brought up to date. That tool is ONETOOL (healthstat.snu.ac.kr/software/onetool/index.php). Could anyone suggest me a better way to analyse my data?

Many thanks!!!

Source link