gravatar for rodolfo.peacewalker

3 hours ago by

Hi everyone!

I have an issue extracting ensembl gene ids from a messy data frame.
First, I loaded the csv file in R (file that was not separated by commas) and looks like:

> my_csv_file
               ensembl_gene_id.entrezgene_id.hgnc_symbol.gene_biotype
1                           1 ENSG00000174365 128439 SNHG11 lncRNA
2 2 ENSG00000180385 NA EMC3-AS1 transcribed_unprocessed_pseudogene
3                                     3 ENSG00000183562 NA  lncRNA
4  4 ENSG00000205266 NA KRT17P5 transcribed_unprocessed_pseudogene
5                            5 ENSG00000206585 26864 RNVU1-7 snRNA
6                              6 ENSG00000206588 NA RNU1-28P snRNA

Then, I tried to extract the ensembl gene id from each row using sub function.
For example, for row number 1:

> sub("^\d", "", my_csv_file[1, ]
[1] " ENSG00000174365 128439 SNHG11 lncRNA"

However, I'm stuck because I don´t know how to remove the alphanumeric characters after the ensembl id by using regular expressions and then put it inside a for loop.

I appreciate your help.

Best regards.



Source link