I have a column of fasta headers in uniprot style

Some rows are single fasta headers and some multiple fasta headers separated by semicolons:

Example (row 1 single fasta header, row 2 three fasta headers concatenated with semicolons).

 df<- data.frame(
      fasta_headers = c("tr|F1RJE3|F1RJE3_PIG von Willebrand factor A domain containing 1 OS=Sus scrofa OX=9823 GN=VWA1 PE=1 SV=2", "tr|E7CR01|E7CR01_PIG Signal transducer and activator of transcription OS=Sus scrofa OX=9823 GN=stat5B PE=2 SV=1;sp|Q9TUZ1|STA5A_PIG Signal transducer and activator of transcription 5A OS=Sus scrofa OX=9823 GN=STAT5A PE=2 SV=1;tr|A0A0A0MY60|A0A0A0MY60_PIG S"))

I have tried some regex as follows, but it only works on the last instance, but I would like all matched instances

df$'protein names'=ifelse(grepl(".*_PIG (.*) OS.*", df$fasta_headers), 
                                gsub(".*_PIG (.*) OS.*", "\1", df$fasta_headers), 
                                "") 

df$'gene names'= ifelse(grepl(".* GN=([^ ]+).*", df$fasta_headers), 
                               gsub(".* GN=([^ ]+).*", "\1", df$fasta_headers), 
                               "")

the desired output should be

df_out <- data.frame(
  fasta_headers = c("tr|F1RJE3|F1RJE3_PIG von Willebrand factor A domain containing 1 OS=Sus scrofa OX=9823 GN=VWA1 PE=1 SV=2", "tr|E7CR01|E7CR01_PIG Signal transducer and activator of transcription OS=Sus scrofa OX=9823 GN=stat5B PE=2 SV=1;sp|Q9TUZ1|STA5A_PIG Signal transducer and activator of transcription 5A OS=Sus scrofa OX=9823 GN=STAT5A PE=2 SV=1;tr|A0A0A0MY60|A0A0A0MY60_PIG S"),
  gene_names = c("VWA1","stat5B; STAT5A"),
  protein_names = c("von Willebrand factor A domain containing 1","Signal transducer and activator of transcription; Signal transducer and activator of transcription 5A"))

There could be any number of semicolons, not just 2 or 3.

Any help would be appreciated.



Source link