gravatar for jbalberge

2 hours ago by

France/Nantes

In bam file from single-cell RNA-seq experiments processed with 10X's cellranger 3.0 pipeline, the @RG:SM tag is set by sample while the @CB is computed from cell barcode.

$ samtools view possorted_genome_bam.bam | head -n 1
> NS500651:49:"RUN":4:12405:6951:17414  256 1   11256   0   97M1S   *   0   0   "SEQ" "qual"
> NH:i:5    HI:i:2  AS:i:83 nM:i:6  RE:A:I  li:i:0  BC:Z:ATCTTTAG   QT:Z:A/AAAAEE   CR:Z:GAACGGACAGACGCAA   CY:Z:/AAAAEAEEEEEEEEE   CB:Z:GAACGGACAGACGCAA-1 UR:Z:TAGCTGTAAT UY:Z:EEEEEEEEEE UB:Z:TAGCTGTAAT RG:Z:MYSAMPLE:0:1:"RUN":4

For cell-specific genotyping, it would make more sense to treat the @CB barcode as individual's identity.

Is there an efficient way to move the @CB tag values to @RG tag ? This way, each unique @RG defines a cellular barcode.

Currently I do:

(1) split the bam by @CB tag (thanks to www.biostars.org/p/263346/#263348)

samtools view [email protected] 32 <input.bam> | cut -f 12- | tr "t" "n" | grep  "^CB:Z:" | cut -d ':' -f 3 | sort | uniq | while read S; do samtools view -h <input.bam> | awk -v tag="CB:Z:$S" '($0 ~ /^@/ || index($0,tag)>0)' > out/${S}.sam ; done

(2) edit @RG SM based on file name with Picard/AddOrReplaceReadGroups

for s in $(ls out/*.sam); do SM=$(basename $s .sam) ; java -jar /sandbox/apps/bioinfo/binaries/picard/2.19.0/picard.jar AddOrReplaceReadGroups I=$s O=out-RGSM/${SM}.bam RGID=1 RGLB=0.1 RGSM=${SM} RGPU=MYSAMPLE RGPL=ILLUMINA; done
samtools merge -r - out-RGSM/*.bam | samtools view -o merged-RGSM.bam -
samtools index merged-RGSM.bam

(3) call variants per sample

bcftools mpileup -r chr22 -Ou -f GRCh38.fasta merged-RGSM.bam | bcftools call -mv -Ob -o calls-RGSM.bcf

This is very inefficient and memory/disk-consuming. How could I improve this process by either (1) properly replace @RG:SM with CB or to run (2) bcftools mpileup with CB instead of RG:SM?

Thanks,
JB

link

modified 1 hour ago

written
2 hours ago
by

jbalberge90



Source link