gravatar for Mengying

2 hours ago by

East China Normal University

I have some difficulties splitting my taxonomy column into different rank, i.e."domain", "phylum", "class", "order", "family", "genus" .
The biggest problem is that the format in my taxonomy column is not uniform. Some of them have complete taxonomy levels, while others only have “domain”、“phylum”、“genus”levels.
My data has a few thousand rows and which looks something like this :

 OTUID  Taxonomy
OTU1    d:Bacteria,p:"Proteobacteria",c:Gammaproteobacteria,o:Pseudomonadales,f:Pseudomonadaceae,g:Pseudomonas
OTU20   d:Archaea,p:"Thaumarchaeota",o:Nitrososphaerales,f:Nitrososphaeraceae,g:Nitrososphaera
OTU774  d:Bacteria,p:"Armatimonadetes",g:Armatimonadetes_gp4

I'm not familiar with R, so I've been searching for relevant solutions on the Internet a whole day, and I've tried the separate function in the tidyr package, like this

x <- read.csv("annotation.csv")
y <- x %>% separate(Taxonomy, c("domain", "phylum", "class", "order", "family", "genus"), ",[a-z]:")
write.csv(y,"tax_split.csv",row.names = TRUE)

But the result let me down. This can't split my taxonomy according to different ranks.

OTUID   domain  phylum  class   order   family  genus
OTU1    d:Bacteria  "Proteobacteria"    Gammaproteobacteria Pseudomonadales Pseudomonadaceae    Pseudomonas
OTU20   d:Archaea   "Thaumarchaeota"    Nitrososphaerales   Nitrososphaeraceae  Nitrososphaera  NA
OTU774  d:Bacteria  "Armatimonadetes"   Armatimonadetes_gp4 NA  NA  NA

Finally, I have to use the excel filtering function to deal with this, but this method is very time-consuming(╯︵╰)
I still want to ask, is there any elegant way to use R to solve this problem?

Thanks for your help!


modified 1 hour ago



2 hours ago


Source link