How to calculate variant length for each sample
Hi,
I want to see the max, min, median, mean lengths of variants for each sample. Is there any tool I can use or I can just calculate from vcf files? I have used RGT tools vcfstats but only got histgram of the distribution.
Thanks in advance,
Leilei
• 65 views
using bioalcidaejdk lindenb.github.io/jvarkit/BioAlcidaeJdk.html
with the file 'biostar.code' (I'm too lazy to do the median depth )
class Stat {
Integer min;
Integer max;
long count;
double sum;
};
final Map<String,Stat> hash = new HashMap<>();
for(final String sn: header.getSampleNamesInOrder()) {
hash.put(sn,new Stat());
}
stream().forEach(V->{
for(final String sn:hash.keySet()) {
final Genotype gt = V.getGenotype(sn);
if(gt.isNoCall() || gt.isHomRef()) continue;
Stat stat = hash.get(sn);
final int len = V.getLengthOnReference();
if(stat.min==null || stat.min.intValue()>len) stat.min = len;
if(stat.max==null || stat.max.intValue()<len) stat.max = len;
stat.sum+=len;
stat.count++;
}
});
for(final String sn:hash.keySet()) {
Stat stat = hash.get(sn);
if(stat.count==0L) continue;
println(sn+" "+stat.min+" "+stat.max+" "+(stat.sum/stat.count));
}
usage: (sample / min / max /mean )
java -jar dist/bioalcidaejdk.jar -f biostar.code src/test/resources/test_vcf01.vcf
S3 1 5 1.1839080459770115
S4 1 5 1.1744186046511629
S5 1 5 1.1975308641975309
S6 1 5 1.1948051948051948
S1 1 3 1.0909090909090908
S2 1 5 1.2083333333333333
I think one way to do this is to list both the REF and ALT, then subtract the lengths of each accordingly. Feels naive a bit, but I think it works just fine:
wget https://raw.githubusercontent.com/everestial/VCF-Simplify/master/exampleInput/input_test.vcf
# Extract the REF and the ALT
cat input_test.vcf | grep -v '^#' | cut -f 4,5 > alts.txt
# Compute lengths of each REF/ALT pair.
cat alts.txt | python size.py > len.txt
# Compute the statistics
cat len.txt | datamash mean 1 median 1 min 1 max 1
prints:
2.3435582822086 1 1 81
where size.py
is a simple python script I just wrote (might be incorrect though since I just typed it all out in a single go, start there if not corrrect):
import sys
for line in sys.stdin:
ref, alts = line.strip().split()
for alt in alts.split(","):
size = abs(len(ref) - len(alt))
print (size + 1)
the above will count deletions the same way as insertions - the lengths are computed relative to the reference rather than the length of the variant itself)
Traffic: 1681 users visited in the last hour