The number of SNPs and the sum of transitions+transversions does not match in the snpEff CSV output. Has anyone ever encountered this before? I ran snpEff on a sample called "B86097":

snpEff 
    -Xms750m  
    -csvStats B86097-effect-stats.csv 
    GRCh37.75 B86097.varscan_cns.threshold_0.01.vcf 
    > B86097.snpEff.vcf

The CSV output shows 109 SNPs:

# Variantss by type 
Type , Count , Percent  
DEL , 13 , 10.4%  
INS , 3 , 2.4%  
SNP , 109 , 87.2%

But the Ts/Tv summaries show 85 transitions and 47 transversions, which adds to 132:

# Ts/Tv summary

Transitions , 85
Transversions , 47
Ts_Tv_ratio , 1.808511

# Ts/Tv : All variants

Sample ,Sample1,Total
Transitions ,85,85
Transversions ,47,47
Ts/Tv ,1.809,1.809

Notably, the sum of changes in the "Base changes matrix" adds up to 109:

# Base changes

base  , A  , C  , G  , T 
 A  , 0  , 5  , 24  , 4 
 C  , 3  , 0  , 10  , 15 
 G  , 18  , 5  , 0  , 4 
 T  , 4  , 11  , 6  , 0

This discrepancy has been the case for all samples that I've examined so far. I investigated this issue by running snpEff on each variant individually. I found 23 variants which are counted as either 2 transitions or 2 transversions, which would put the sum of transitions and transversions at 132 (109+23 = 132):

$grep -A 4 "# Ts/Tv summary" *.csv | grep " 2"
line_102.vcf-effects-stats.csv-Transitions , 2
line_10.vcf-effects-stats.csv-Transversions , 2
line_11.vcf-effects-stats.csv-Transversions , 2
line_12.vcf-effects-stats.csv-Transversions , 2
line_13.vcf-effects-stats.csv-Transitions , 2
line_14.vcf-effects-stats.csv-Transversions , 2
line_15.vcf-effects-stats.csv-Transitions , 2
line_18.vcf-effects-stats.csv-Transversions , 2
line_1.vcf-effects-stats.csv-Transitions , 2
line_29.vcf-effects-stats.csv-Transversions , 2
line_35.vcf-effects-stats.csv-Transitions , 2
line_37.vcf-effects-stats.csv-Transitions , 2
line_42.vcf-effects-stats.csv-Transitions , 2
line_44.vcf-effects-stats.csv-Transitions , 2
line_48.vcf-effects-stats.csv-Transitions , 2
line_59.vcf-effects-stats.csv-Transitions , 2
line_66.vcf-effects-stats.csv-Transitions , 2
line_70.vcf-effects-stats.csv-Transitions , 2
line_72.vcf-effects-stats.csv-Transitions , 2
line_7.vcf-effects-stats.csv-Transitions , 2
line_8.vcf-effects-stats.csv-Transitions , 2
line_90.vcf-effects-stats.csv-Transitions , 2
line_95.vcf-effects-stats.csv-Transitions , 2

I suspected that the double-counting might be due to snpEff annotating each of these variants for multiple genes, but this double-annotation was not the case for all of the double-counted variants:

$while read i; do echo $i--------------------; grep -v ^# line_$i.vcf.snpEff.vcf | cut -f8 | tr ',' 'n' | cut -d| -f4 | sort | uniq -c; done < double_counted_variant_numbers.txt 
102--------------------
      6 ATM
      1 C11orf65
10--------------------
     17 TP53
11--------------------
      3 SPEN
     12 ZBTB17
12--------------------
     25 TP53
13--------------------
     25 TP53
14--------------------
     17 TP53
15--------------------
      6 CD79B
18--------------------
      2 CTD-2369P2.2
      2 DNMT1
      1 S1PR2
1--------------------
     16 IL4R
29--------------------
      1 BCL10
      2 RP11-131L23.1
35--------------------
      9 FCGR2B
      1 RP11-25K21.1
37--------------------
      5 RP3-395M20.8
     15 TNFRSF14
42--------------------
      6 SETD2
      1 snoU13
44--------------------
      5 SETD2
48--------------------
      1 RP3-395M20.7
     16 TNFRSF14
59--------------------
      1 SPEN
66--------------------
      3 RP1-234P15.4
      3 TMEM30A
70--------------------
      1 SPEN
72--------------------
      3 CARD11
7--------------------
      4 PLCG2
8--------------------
      4 PLCG2
90--------------------
      1 NOTCH1
95--------------------
     19 FAS
      1 RP11-399O19.9

Any idea why this might be happening? Let me know if you need more information! Thanks so much.



Source link