Hi So,

Here is your script in both bash and R, as this was easier for me. (I also used some shortcuts here), I will make a bash solution at some point... it will just take more time as I have a Master's project to tend to that I've been putting aside!!!

I must say though Google is your friend. you can google things like: count duplicates in bash

and the second link is someone asking the question on stackoverflow:

stackoverflow.com/questions/6712437/find-duplicate-lines-in-a-file-and-count-how-many-time-each-line-was-duplicated

and then piece that together with the next question you have to build your script.


Anyways here's my try at this.

So you would just have to change the file paths appropriately.

Your current output is this:

[email protected] So % cat transcripts.txt 
VFFH01000932.1  StringTie   transcript  716340  733649  1000    +   .   "gene_id ""STRG.14845""; transcript_id ""STRG.14845.6""; cov ""2.191051""; FPKM ""0.287366""; TPM ""0.383253"";"
VFFH01001189.1  StringTie   transcript  11702   12012   1000    .   .   "gene_id ""STRG.14846""; transcript_id ""STRG.14846.1""; cov ""4.823151""; FPKM ""0.632579""; TPM ""0.843654"";"
...

Doing the following:

cat transcripts.txt | tr -d '"' | tr -d ';' | awk -F " " '{ print $3,$4,$5,$10,$12 }' | tr "." ' ' | awk -F " " '{ print $1,$2,$3,$4 "." $5,$6 "." $7 " " $8 }' > transcripts_cleaned.txt

Will result in a text file "transcripts_cleaned.txt" that looks like this:

[email protected] So % cat transcripts_cleaned.txt 
transcript 716340 733649 STRG.14845 STRG.14845 6
transcript 11702 12012 STRG.14846 STRG.14846 1
transcript 72502 81013 STRG.14847 STRG.14847 1
...

You can see here that I basically used the number after transcript id to provide a number for the counts of transcripts. The last column being the counts.

Now in R:

The following script or variations should get you what you want:

df <- read.table("~/Desktop/biostars/So/transcripts_cleaned.txt", sep = " ")
df2 <- df[!duplicated(as.list(df))]

install.packages('dplyr')
library(dplyr)
df.with.more.than.one.count <-  df2 %>% group_by(V4) %>% filter(n()>1)
df.with.gene.ids.transcripts <- df.with.more.than.one.count %>% group_by(V4) %>% filter(V6==max(V6))
df.with.gene.ids.transcripts$V4
write.table(df.with.gene.ids.transcripts$V4, file = "~/Desktop/biostars/So/geneids_with_more_than_one_transcript.txt", sep= " ", col.names = F, row.names =  F)
write.table(df.with.gene.ids.transcripts, file = "~/Desktop/biostars/So/geneids_with_more_than_one_transcript_whole_data_frame.txt", sep= " ", col.names = F, row.names =  F)

if you want to remove quotes in the file you can use the tr for translate like this:

cat ~/Desktop/biostars/So/geneids_with_more_than_one_transcript_whole_data_frame.txt | tr -d '"' > ~/Desktop/biostars/So/geneids_with_more_than_one_transcript_whole_data_frame_no_quotes.txt

and

cat ~/Desktop/biostars/So/geneids_with_more_than_one_transcript.txt | tr -d '"' >  ~/Desktop/biostars/So/geneids_with_more_than_one_transcript_without_quotes.txt

I hope this helps.



Source link