Hi All,

I am working on identifying insertion site and copy number of a retroviral vector integration in human cell line. The sample was sequenced by WGS on NextSeq with around 900 million PE150 reads.

I concatenated hg38 and the vector genome and mapped reads to this combined reference using BWA. I used SAM format to identify reads containing transgene using following command from the SAM file.

cat sample.sam | grep "vector" | awk '$7~/chr*/' > transgene_reads.sam (Which will extract reads which mapped to vector and their mates which mapped to human chromosomes.

This command generated an output sam file which I believe have the reads where the first of the pair is mapped to the vector and mate is mapped to the human chromosome.

The problem is that the trangene_reads.sam is showing me that the vector has mate pairs mapping in all of the chromosomes e.g chr1, chr2, chr3, and so on.

I visualized bam file for chr7 using IGV and setting the visualization to show the chromosome by mate. The coverage for the target at the region is 2X i.e 2 reads map to the target in the region chr7:50343667

Here is the IGV screenshot

The highlighted yellow region is the vector region.

  1. Is it common for a vector to integrate itself into multiple chromosomes at low frequency?
  2. How can I validate and identify true insertion site using IGV and SAM file?
  3. Should i remove supplementary and secondary alignment from the SAM file before proceeding to insertion site analysis?

Thanks in advance !!

