I am working on identifying insertion site and copy number of a retroviral vector integration in human cell line. The sample was sequenced by WGS on NextSeq with around 900 million PE150 reads.
I concatenated hg38 and the vector genome and mapped reads to this combined reference using BWA. I used SAM format to identify reads containing transgene using following command from the SAM file.
cat sample.sam | grep "vector" | awk '$7~/chr*/' > transgene_reads.sam (Which will extract reads which mapped to vector and their mates which mapped to human chromosomes.
This command generated an output sam file which I believe have the reads where the first of the pair is mapped to the vector and mate is mapped to the human chromosome.
The problem is that the trangene_reads.sam is showing me that the vector has mate pairs mapping in all of the chromosomes e.g chr1, chr2, chr3, and so on.
I visualized bam file for chr7 using IGV and setting the visualization to show the chromosome by mate. The coverage for the target at the region is 2X i.e 2 reads map to the target in the region chr7:50343667
The highlighted yellow region is the vector region.
- Is it common for a vector to integrate itself into multiple chromosomes at low frequency?
- How can I validate and identify true insertion site using IGV and SAM file?
- Should i remove supplementary and secondary alignment from the SAM file before proceeding to insertion site analysis?
Thanks in advance !!