Hello,

I am very new to R, so I apologise if this looks simple to someone.

I try to to join two files and then perform a one-sided Fisher's exact test to determine if there is a greater burden of qualifying variants in casefile or controlfile.

casefile:

GENE  CASE_COUNT_HET  CASE_COUNT_CH   CASE_COUNT_HOM  CASE_TOTAL_AC
ENSG00000124209   1   0   0   1
ENSG00000064703   1   1   0   9
ENSG00000171408   1   0   0   1
ENSG00000110514   1   1   1   12
ENSG00000247077   1   1   1   7

controlfile:

GENE  CASE_COUNT_HET  CASE_COUNT_CH   CASE_COUNT_HOM  CASE_TOTAL_AC
ENSG00000124209   1   0   0   1
ENSG00000064703   1   1   0   9
ENSG00000171408   1   0   0   1
ENSG00000110514   1   1   1   12
ENSG00000247077   1   1   1   7
ENSG00000174776   1   1   0   2
ENSG00000076864   1   0   1   13
ENSG00000086015   1   0   1   25

I have this script:

#!/usr/bin/env Rscript
library("argparse")
suppressPackageStartupMessages(library("argparse"))

parser <- ArgumentParser()
parser$add_argument("--casefile", action="store")
parser$add_argument("--casesize", action="store", type="integer")
parser$add_argument("--controlfile", action="store")
parser$add_argument("--controlsize", action="store", type="integer")
parser$add_argument("--outfile", action="store")

args <- parser$parse_args()

case.dat<-read.delim(args$casefile, header=T, stringsAsFactors=F, sep="t")
names(case.dat)[1]<-"GENE"
control.dat<-read.delim(args$controlfile, header=T, stringsAsFactors=F, sep="t")
names(control.dat)[1]<-"GENE"

dat<-merge(case.dat, control.dat, by="GENE", all.x=T, all.y=T)
dat[is.na(dat)]<-0

dat$P_DOM<-0
dat$P_REC<-0

for(i in 1:nrow(dat)){

  #Dominant model
  case_count<-dat[i,]$CASE_COUNT_HET+dat[i,]$CASE_COUNT_HOM
  control_count<-dat[i,]$CONTROL_COUNT_HET+dat[i,]$CONTROL_COUNT_HOM

  if(case_count>args$casesize){
    case_count<-args$casesize
  }else if(case_count<0){
    case_count<-0
   }
  if(control_count>args$controlsize){
    control_count<-args$controlsize
  }else if(control_count<0){
    control_count<-0
   }

  mat<-cbind(c(case_count, (args$casesize-case_count)), c(control_count, (args$controlsize-control_count)))
  dat[i,]$P_DOM<-fisher.test(mat, alternative="greater")$p.value

and problem starts in here:

  case_count<-dat[i,]$CASE_COUNT_HET+dat[i,]$CASE_COUNT_HOM
  control_count<-dat[i,]$CONTROL_COUNT_HET+dat[i,]$CONTROL_COUNT_HOM

the result of case_count and control_count is NULL values, however corresponding columns in both input files are NOT empty. I am confused, which step to take next.

as a newbie in R and statistics, I will be happy about any suggestion. Thank you!



Source link