How to calculate variant length for each sample

Hi,

I want to see the max, min, median, mean lengths of variants for each sample. Is there any tool I can use or I can just calculate from vcf files? I have used RGT tools vcfstats but only got histgram of the distribution.

Thanks in advance,
Leilei


genome

• 65 views

using bioalcidaejdk lindenb.github.io/jvarkit/BioAlcidaeJdk.html

with the file 'biostar.code' (I'm too lazy to do the median depth )

class Stat {
    Integer min;
    Integer max;
    long count;
    double sum;
    };

final Map<String,Stat> hash = new HashMap<>();
for(final String sn: header.getSampleNamesInOrder()) {
    hash.put(sn,new Stat());
}
stream().forEach(V->{
    for(final String sn:hash.keySet()) {
        final Genotype gt = V.getGenotype(sn);
        if(gt.isNoCall() || gt.isHomRef()) continue;
        Stat stat = hash.get(sn);
        final int len = V.getLengthOnReference();
        if(stat.min==null || stat.min.intValue()>len) stat.min = len;
        if(stat.max==null || stat.max.intValue()<len) stat.max = len;
        stat.sum+=len;
        stat.count++;
        }
    });

for(final String sn:hash.keySet()) {
Stat stat = hash.get(sn);
if(stat.count==0L) continue;
println(sn+" "+stat.min+" "+stat.max+" "+(stat.sum/stat.count));
}

usage: (sample / min / max /mean )

java -jar dist/bioalcidaejdk.jar -f biostar.code src/test/resources/test_vcf01.vcf 
S3 1 5 1.1839080459770115
S4 1 5 1.1744186046511629
S5 1 5 1.1975308641975309
S6 1 5 1.1948051948051948
S1 1 3 1.0909090909090908
S2 1 5 1.2083333333333333

I think one way to do this is to list both the REF and ALT, then subtract the lengths of each accordingly. Feels naive a bit, but I think it works just fine:

wget https://raw.githubusercontent.com/everestial/VCF-Simplify/master/exampleInput/input_test.vcf

# Extract the REF and the ALT
cat input_test.vcf | grep -v '^#' | cut -f 4,5 > alts.txt

# Compute lengths of each REF/ALT pair.
cat alts.txt | python size.py > len.txt

 # Compute the statistics
cat len.txt | datamash mean 1 median 1 min 1 max 1

prints:

2.3435582822086 1   1   81

where size.py is a simple python script I just wrote (might be incorrect though since I just typed it all out in a single go, start there if not corrrect):

import sys
for line in sys.stdin:
    ref, alts = line.strip().split()
    for alt in alts.split(","):
        size = abs(len(ref) - len(alt))
        print (size + 1)

the above will count deletions the same way as insertions - the lengths are computed relative to the reference rather than the length of the variant itself)


Login
before adding your answer.

Traffic: 1681 users visited in the last hour



Source link