gravatar for Vivek

3 hours ago by

Denmark

Hi,

I'm trying to implement a polygenic score pipeline in WDL and I'm quite new to this pipeline management.

The first step of the pipeline takes a file of GWAS summary statistics and splits it by chromosome.

The corresponding wdl task:

task split {
    input {
        File gwas
        String output_prefix
    }

    command {
        ./splitGwas -i ${gwas} -o ${output_prefix}
    }

    output {
        Array [File] gwas_by_chr = glob("${output_prefix}_*.assoc")
    }
}

The next part of the process is to compute posterior effects of SNPs by chromosome. The inputs to this task are going to be the files split by chromosome in the previous step and an LD matrix for the chromosome.

The corresponding wdl call would be something like this:

Array [File] ld_matrices = read_lines(file_of_ld_matrices_by_chr)

scatter(pair in zip(split.gwas_by_chr, ld_matrices)) {
    call sbayes.run {
        input:
            GCTB = gctb_executable_path,
            gwas = pair.left,
            ld_matrix = pair.right,
            output_prefix = out
    }
}

Since I'm grabbing the split files using glob in the previous step, I do not particularly know which element in the array corresponds to which chromosome. So I would be making a mistake if I zip the summary stats for one chromosome with the LD matrix corresponding to a different chromosome.

The programmatic way would be to iterate the scatter over chromosome name but then I would lose the ability to specify the dependency between the two tasks.

Is there a better way to do this?

Cheers!

link

modified 3 hours ago

written
3 hours ago
by

Vivek2.4k



Source link