How to calculate variant length for each sample


I want to see the max, min, median, mean lengths of variants for each sample. Is there any tool I can use or I can just calculate from vcf files? I have used RGT tools vcfstats but only got histgram of the distribution.

Thanks in advance,



using bioalcidaejdk

with the file 'biostar.code' (I'm too lazy to do the median depth )

class Stat {
    Integer min;
    Integer max;
    long count;
    double sum;

final Map<String,Stat> hash = new HashMap<>();
for(final String sn: header.getSampleNamesInOrder()) {
    hash.put(sn,new Stat());
    for(final String sn:hash.keySet()) {
        final Genotype gt = V.getGenotype(sn);
        if(gt.isNoCall() || gt.isHomRef()) continue;
        Stat stat = hash.get(sn);
        final int len = V.getLengthOnReference();
        if(stat.min==null || stat.min.intValue()>len) stat.min = len;
        if(stat.max==null || stat.max.intValue()<len) stat.max = len;

for(final String sn:hash.keySet()) {
Stat stat = hash.get(sn);
if(stat.count==0L) continue;
println(sn+" "+stat.min+" "+stat.max+" "+(stat.sum/stat.count));

usage: (sample / min / max /mean )

java -jar dist/bioalcidaejdk.jar -f biostar.code src/test/resources/test_vcf01.vcf 
S3 1 5 1.1839080459770115
S4 1 5 1.1744186046511629
S5 1 5 1.1975308641975309
S6 1 5 1.1948051948051948
S1 1 3 1.0909090909090908
S2 1 5 1.2083333333333333

I think one way to do this is to list both the REF and ALT, then subtract the lengths of each accordingly. Feels naive a bit, but I think it works just fine:


# Extract the REF and the ALT
cat input_test.vcf | grep -v '^#' | cut -f 4,5 > alts.txt

# Compute lengths of each REF/ALT pair.
cat alts.txt | python > len.txt

 # Compute the statistics
cat len.txt | datamash mean 1 median 1 min 1 max 1


2.3435582822086 1   1   81

where is a simple python script I just wrote (might be incorrect though since I just typed it all out in a single go, start there if not corrrect):

import sys
for line in sys.stdin:
    ref, alts = line.strip().split()
    for alt in alts.split(","):
        size = abs(len(ref) - len(alt))
        print (size + 1)

the above will count deletions the same way as insertions - the lengths are computed relative to the reference rather than the length of the variant itself)

before adding your answer.

Traffic: 1681 users visited in the last hour

Source link