I have some difficulties splitting my taxonomy column into different rank, i.e."domain", "phylum", "class", "order", "family", "genus" .
The biggest problem is that the format in my taxonomy column is not uniform. Some of them have complete taxonomy levels, while others only have “domain”、“phylum”、“genus”levels.
My data has a few thousand rows and which looks something like this ：
OTUID Taxonomy OTU1 d:Bacteria,p:"Proteobacteria",c:Gammaproteobacteria,o:Pseudomonadales,f:Pseudomonadaceae,g:Pseudomonas OTU20 d:Archaea,p:"Thaumarchaeota",o:Nitrososphaerales,f:Nitrososphaeraceae,g:Nitrososphaera OTU774 d:Bacteria,p:"Armatimonadetes",g:Armatimonadetes_gp4
I'm not familiar with R, so I've been searching for relevant solutions on the Internet a whole day, and I've tried the separate function in the tidyr package, like this
library(tidyr) x <- read.csv("annotation.csv") y <- x %>% separate(Taxonomy, c("domain", "phylum", "class", "order", "family", "genus"), ",[a-z]:") write.csv(y,"tax_split.csv",row.names = TRUE)
But the result let me down. This can't split my taxonomy according to different ranks.
OTUID domain phylum class order family genus OTU1 d:Bacteria "Proteobacteria" Gammaproteobacteria Pseudomonadales Pseudomonadaceae Pseudomonas OTU20 d:Archaea "Thaumarchaeota" Nitrososphaerales Nitrososphaeraceae Nitrososphaera NA OTU774 d:Bacteria "Armatimonadetes" Armatimonadetes_gp4 NA NA NA
Finally, I have to use the excel filtering function to deal with this, but this method is very time-consuming(╯︵╰)
I still want to ask, is there any elegant way to use R to solve this problem？
Thanks for your help!