I'm trying to find matching Ensembl IDs between my germline and somatic datasets. I would intersect on chromosome and position, but I want to know if there are somatic AND germline variants in the same gene for each patient.

Germline data is like:

       CHROM      POS REF ALT    AF GNOMAD_AF    GENE          FEATURE  DP AC AN     sample genotype_final                                                    GeneID patient
1:     1   860778   A   G 0.900     0.816  SAMD11 FIVE_PRIME_FLANK  76  2  2 Sample1            1/1 NA, NA, ENSG00000223764, ENSG00000187634, ENSG00000268179  Patient14
2:     1   874327   G   C 0.031 2.462e-05  SAMD11           INTRON 606  1 32 Sample7            0/1                                       NA, ENSG00000187634  Patient18
3:     1   877831   T   C 1.000         1  SAMD11         MISSENSE 336 14 14 Sample3            1/1                  NA, NA, ENSG00000188976, ENSG00000187634  Patient9

Somatic data is like:

3   52440897    .   TC  AA  .   PASS    LongAnnotation  GT:AD:AF:DP:F1R2:F2R1:SB    0/1:74,4:0.068:78:23,2:50,2:48,26,3,1   0/0:94,0:0.012:94:41,0:52,0:52,42,0,0   ENSG00000187634

For every row in the somatic data, I'd like to take the final column (GeneID) and use it to grep the GeneID in the germline data. This solution is working for smaller somatic variant sets, but breaks on variants where there was no germline sample for filtering the somatic variants. The string construction is too large and yields the invalid regular expression error further below.

patient.matcher <- function(germline.dt,somatic.dt, patientID){return(germline.dt[grepl(paste(somatic.dt$GeneID, collapse="|"), GeneID) & patient==patientID])}

patient.matcher(patients.w.geneIDs.germline, Patient9.somatic.dt, 'Patient9')

Error: invalid regular expression 'ENSG00000188976|ENSG00000187583|ENSG00000188290|E.............

Any help is appreciated, I can't figure out which permutation of apply() and grep() to use to make this work on an arbitrarily large dataset.

Source link