I have received some CRAM files with paired-end WGS data. I would like to remap the reads in a specific region to another reference.

However, most of the template/read names have the format filename.cram:21937611, so I suspect that the CRAM files have been stored with samtools ... lossy_names=1.

When converting the CRAM file into FASTQ files (samtools fastq), I loose most of the reads, and from the CRAM file, I can see that <10% of the read names are unique.

Is it possible to generate new read names based on the information present in the CRAM file? Of course it would not be possible to recreate the old names, but is there a solution that will allow me to generate fastq files and perform realignment?

I've inserted one example for a read read name used for multiple reads below. Note that the same sample/library has been analysed in 8 lanes.

From samtools documentation:
lossy_names=0|1
CRAM output only; defaults to 0 (off). If 1, templates
with all members within the same CRAM slice will have their read names
removed. New names will be automatically generated during decoding.
Also see the name_prefix option.


An example where filename.cram:21937611 is found as read name for multiple reads (note that the the reads may come from different read groups: RG:Z:AHH7F7CCXX.sample_name.#, but RG:Z:AHH7F7CCXX.sample_name.6 is found 4 times)

filename.cram:21937611  99  6   28999893    60  150M    =   29000058    315 AACATACCCATGTTCCCATTATGACGTTTCTTCATTTAATTAATAAATTAGAAAAAAATTCTTGTTGGATGAGTTACTAATGCCCTGAAGAATTGGATTAACCACTGGTCATACTGACACTACAGTGCCATTCACACTTAAATGCAACAG  AAFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK  MC:Z:150M   MD:Z:150    PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.3   NM:i:0  MQ:i:60 AS:i:150    XS:i:0
filename.cram:21937611  147 6   29000058    60  150M    =   28999893    -315    ATAGAAGTCTCTATTTAATGTGGATATTGGAAGTAAACTAAATGTGGGACTGGTGAAAATCCTTAATTAGGTTTGGTTAAATATATTTTCGGTTGGCTATTTGATGTCTTTTTAATGTATACTCTTGTTATTCATATTTACAGCTAGATT  KKKKKFA<FFKKKFFFKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFFFAA  MC:Z:150M   MD:Z:150    PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.3   NM:i:0  MQ:i:60 AS:i:150    XS:i:20
filename.cram:21937611  1187    6   29028730    60  150M    =   29028863    185 ACATCCCAGGGATGAAGCTGACTAGCTCATGGTGAATAAGCTTTTTGATGTGCTGTTGAATTGTTTGCTAGTATTTTATTGAAGATTTTCACATCAATGTCCATCAGGGATATTGGCCTGAAATTTTCTTTTTTGTTGTTGTTGTGTCTC  AAA<FKKFKFK,,AAKKKFK<FFKKFKKKAFKKFKKKFAF,,77A,77,F7<KKK7FFA,K7FAAAFA7AFKFK<F7<FKKKKKF7FFKKKKKAKK,,AFAA<FFKKK,7,7F,A,,,A7AA,,<<FKKF,7,,<AF,,AFA,,,,<,7<  MC:Z:18S52M80S  MD:Z:150    PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.4   NM:i:0  MQ:i:40 AS:i:150    XS:i:66
filename.cram:21937611  1107    6   29028863    40  18S52M80S   =   29028730    -185    GTCTTTTATTTGTCTTTGTGTTGTTGTTGTGTCTCTGCCCTGTTTTGTTAACAATATGATGCTGTCTTCAAACTTAGAGTTATGTAGGAAATCCCCTCTTACTTTATTGTGGATTAGTTTCTGAAGGAAGGATACCAGTTCATCTTAGTA  ,,A,,,,,,,,,,,<,,,A<A,,K<<,,7,AA7A,,(AA<,,,7,7,,,,,7<,7,,<,7KFA77,KF<7,7,,,,,7,FFA7,,,,,7,,,(,,,,,,,,KK7F,KF,7<FA,7,AF,,A,<<,,7,,,FF7KFA<,,A,F,A<FAA,,  XA:Z:2,-117731944,14S36M100S,1;4,+75904852,111S26M13S,0;    MC:Z:150M   MD:Z:29G2T3G9A5 PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.4   NM:i:4  MQ:i:60 AS:i:32 XS:i:31
filename.cram:21937611  99  6   29086353    60  119M31S =   29086353    119 AGACATATATGTTGTCTTGGGTGGCCACAACATAGTTAGATTTTGCACTGCAGTCCTCCAGATCTAGGGCTTATCTTTTGAGGAAAGTATACTTGCTCTGAAAGAAACATGTAGGTAGCAGATCGGAAGAGCGTCGTGTAGGGAAAGAGT  AAFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKAKAFKKKKKKKKKA  MC:Z:31S119M    MD:Z:119    PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.7   NM:i:0  MQ:i:60 AS:i:119    XS:i:20
filename.cram:21937611  147 6   29086353    60  31S119M =   29086353    -119    ACTGGAGTTCAGACGTGTGCTCTTCCGATCTAGACATATATGTTGTCTTGGGTGGCCACAACATAGTTAGATTTTGCACTGCAGTCCTCCAGATCTAGGGCTTATCTTTTGAGGAAAGTATACTTGCTCTGAAAGAAACATGTAGGTAGC  KKKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFFFAA  MC:Z:119M31S    MD:Z:119    PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.7   NM:i:0  MQ:i:60 AS:i:119    XS:i:20
filename.cram:21937611  163 6   29115837    60  150M    =   29116033    346 GTAACCCCTTTTACTACCTACCCCAGAGTTTCTTTTATTATTAACAACTTGCATTCATGTGGTACATTTGTTATAATTGATAAGCCAATATTAATACATTATTAGTAACCAAATTCCATAGTTTACATTAGGGTTGATGGTGTGTGTTTT  AAFFFAKKKKKFKKA,F<F,A,AFKKKKAFKAKKKK7KKKKKKFKF<<AFK<AFK,AAKKKKKKKKKKKKKFKKKKFAAKK<FFF,7<FKKKFKKF,AFAKKK7AKKF<<,A7,<A<FA7<,AK77,<FAFAF<AFKKKKKKKKKKAKKF  MC:Z:150M   MD:Z:15C3C130   PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.8   NM:i:2  MQ:i:60 AS:i:140    XS:i:34
filename.cram:21937611  83  6   29116033    60  150M    =   29115837    -346    TTCAGTACCATAAAGAATGGTTTCACTGCCTTAAAAATTCCCTGTTCTCCTTCCATTCATCATTTCCTCCCCTCCTCCCCGGGAGCCCCTGACAACCACTGATTTTTTATTGTTTCCATAACTGTGCTTTTTCCAGAATATCATACAATT  <,,,7<,,FKAA,,,KFF<,AFA7FF7KKKAFF,7KAF,AKKFFF,KKF,FFKF7F,KKKF,KAF,FFFA,FA(F(KFA77A7<(FFKKFAFFF<FKAKK,AKKKKKKKKKKKKFK<,AFFFKKKKKKKFF,KKKKKKAKFKKFKFFAAA  MC:Z:150M   MD:Z:7T142  PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.8   NM:i:1  MQ:i:60 AS:i:145    XS:i:23
filename.cram:21937611  99  6   29144947    40  150M    =   29145234    437 NACACTGACTTCCACAATGGTTGAACTAGTTTACAGTCACACCAACAGTGTAAAAGTGTTCCTATTTCTCCACATCCTGTCCAGCACCTGTTGTTTCGTGACTTTTTAATGATTGCCATTCTAACTGGTGTGAGGTGATATCTCCTTGTG  #AAAFF<AFKKKKAKKF7FKAKFK<FK,FKKKKKKFFA,7F,FKKKFKKF,,<F77FKKKKKKFFFKKKK,AFFKKKAKKK7FKKKKFF<KFKK7<<7<<7AFKKKKFF,,7F,<,A<,,,7,,,7,,,,,,,,A7A,,,,,,,,,,,,,  MC:Z:150M   MD:Z:0C37C58C39G6A5PG:Z:MarkDuplicates  RG:Z:AHH7F7CCXX.sample_name.5   NM:i:5  MQ:i:60 AS:i:129    XS:i:128
filename.cram:21937611  147 6   29145234    60  150M    =   29144947    -437    CTTGTAAATTTGTTTGAGTTCATTGTAGATTCTGGATATTAGCCCTTTGTAAGATGAGTAGCTTGCAAAAATTTTCTCCCATTCTGTAGGTTGCCTATTCACTCTGATGGTAGTTTCTTTTGCTGTGCAGAAACTCTTTAGTTTAATTAG  FF7<<777FKKKFKKA,7A7,,,<A7A,FFKKKKKKFF<AFKKKFFAF7F7A,F,,KKFA7AFF,,FF7FF,,FKKFKKKKKKKKKKKKKFFFAKKKKKAA7<AKKKAFFKFFKKKKKKFKAKKKKAFAAKKKKKKKKKFKKKKKF<F<<  MC:Z:150M   MD:Z:50C99  PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.5   NM:i:1  MQ:i:40 AS:i:145    XS:i:135
filename.cram:21937611  163 6   29174463    60  150M    =   29174749    436 TGTATTGAGATATAATTTATTTACAATACAGTTCACCCATTTAAAGTGTACAAGTCAATGATTTTTGGTATATTTCATTACTAATTTATAACATTGTGGTAATATATAACATAAAATTTGCCATTTTAACTATTTTTAAGTGTACAATTT  AAFFFKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFAAFKKKKKKKKKKKKKKFKAFFKKAAFFAFKKKKFKKK7,A,AA,AAFK<  MC:Z:150M   MD:Z:150    PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.6   NM:i:0  MQ:i:60 AS:i:150    XS:i:26
filename.cram:21937611  83  6   29174749    60  150M    =   29174463    -436    CTGCCTTATGTCTGTATGAATTTGCATGTTCCAGATATTTCATATTATTGGAAACATACAATTTTTGCCCTTTTAGGTCTGGCTTATTCATTTAGCATAAGATGTACATATTGCTTTATAGCCTGCTTTTTCACTTAGCAGCATATTGTG  ,,,,,<A,<<7<A,AFFA<,<FA,,<A<AFFFA<,F<KKKAAKKKKKKFFKKKKKKKKKKFKKKKA7AFAAA<F,FFKKF<7FFFKKKKFF7KKKKKKKKKKKKKKKFKKKKKKKKKKKFKKKKKKKKKKKKFKKKKKKKKKKKKFFFAA  MC:Z:150M   MD:Z:2A147  PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.6   NM:i:1  MQ:i:60 AS:i:147    XS:i:24
filename.cram:21937611  99  6   29202989    60  150M    =   29203110    271 CATGAAATATTTAAAGAGTTTTATTCTGAGCCAAATGTGAGTAATTGACGGCCTGAGGCGCAGTCTCAAGACATCCTGAGAACAAGTGCCCAAGGTGATTGAATTACAACTTGATTTACATTTTAGGAGCATGTAAAAAATCAGTCAATA  AAFFFKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKA<FKKKKKKKKKKKKKKKKKKKAFAFKKKKKKKKKFFKKFKFKFKKFKKKAK  MC:Z:150M   MD:Z:150    PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.6   NM:i:0  MQ:i:60 AS:i:150    XS:i:30
filename.cram:21937611  147 6   29203110    60  150M    =   29202989    -271    TTTAGGAGCATGTAAAAAATCAGTCAATACATGTGGGGTCTGTGTTGGTTCAGTCATGAAAGGTGGAACAACTCAAAGTGGGGCCTTCCACATCAAAGGTAAATTCAAAGATTTTCTTATTGGCAATTGGTTGAAAAACTTGTTACTATT  FKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKFFFAA  MC:Z:150M   MD:Z:150    PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.6   NM:i:0  MQ:i:60 AS:i:150    XS:i:26
filename.cram:21937611  163 6   29292208    60  150M    =   29292473    415 AAGAAAAAATATTAATAGAGCTATTTAACCTGGTTTGTTGAATGTGATAATGATTACAGAAAACATTGACTAATGCTGCTTATAATAATGAAAAACTGGAAAAGTGACTGGTATACAATAGGCAATCAATAAATATTTGTTAAATAAATA  AAAFAKAKFFKKKKKKKKFKKKFFAAKAKKFKK7FKFFKKKKAKAKKKK<FKKAKFFKFFFFFKKKK7KKKFKFFFKKKAAFFFF<KAKFK7FA<FA7FAAFAKKFFKKFKKKKKAK7<FFKKKK<AAFAAFFFFF7,,A7,<AA<,AF7  MC:Z:150M   MD:Z:150    PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.5   NM:i:0  MQ:i:60 AS:i:150    XS:i:20
filename.cram:21937611  83  6   29292473    60  150M    =   29292208    -415    TATTAAAAATACTAAAATGTGTATACAAGATGGTGAGACAACTCCAAAATATAAAACAACATTATGGACTTACCTTTATTTCTAAGTACTTACAAAGTTGTTGCAAAAATTTTAATATACATATTAATGGATTTAGCTATCATATGAGAT  FA7A,77<A,<<AA,A<A<F<<7<FA,7A,,,A<7AF7<,7KKKKKKKF7FAFKA7FKKKKFFKFF7AKFF,<FFFKKKKKKKKKKKKFKKKKKFKKKFKFFKKKKFAKKFFAK<7AFAKKKKKKKKKKKKKKKFAAFAKKKKKKFFFA,  MC:Z:150M   MD:Z:150    PG:Z:MarkDuplicates RG:Z:AHH7F7CCXX.sample_name.5   NM:i:0  MQ:i:60 AS:i:150    XS:i:20

And read group information (one sample, one library, 8 flow cells):

@RG ID:AHH7F7CCXX.sample_name.1 PU:AHH7F7CCXX.sample_name.1 LB:SX441_sample_name.v1 SM:sample_name  CN:U    PL:Illumina
@RG ID:AHH7F7CCXX.sample_name.2 PU:AHH7F7CCXX.sample_name.2 LB:SX441_sample_name.v1 SM:sample_name  CN:U    PL:Illumina
@RG ID:AHH7F7CCXX.sample_name.3 PU:AHH7F7CCXX.sample_name.3 LB:SX441_sample_name.v1 SM:sample_name  CN:U    PL:Illumina
@RG ID:AHH7F7CCXX.sample_name.4 PU:AHH7F7CCXX.sample_name.4 LB:SX441_sample_name.v1 SM:sample_name  CN:U    PL:Illumina
@RG ID:AHH7F7CCXX.sample_name.5 PU:AHH7F7CCXX.sample_name.5 LB:SX441_sample_name.v1 SM:sample_name  CN:U    PL:Illumina
@RG ID:AHH7F7CCXX.sample_name.6 PU:AHH7F7CCXX.sample_name.6 LB:SX441_sample_name.v1 SM:sample_name  CN:U    PL:Illumina
@RG ID:AHH7F7CCXX.sample_name.7 PU:AHH7F7CCXX.sample_name.7 LB:SX441_sample_name.v1 SM:sample_name  CN:U    PL:Illumina
@RG ID:AHH7F7CCXX.sample_name.8 PU:AHH7F7CCXX.sample_name.8 LB:SX441_sample_name.v1 SM:sample_name  CN:U    PL:Illumina



Source link