gravatar for Bioinfo

2 hours ago by

Morocco

Hello

Please i have question i have contigs file tha i want to annotation using prokka but i get this error msg saying that contains duplicate sequence ID: scaffold36|size13034 it makes sense because i merge some assembly files and i eliminate duplication using cd-hit and seqkit and i think that they didn't the work perfectly

so what i need is eliminate duplication sequences 'manualy' (or using another software )

so basically whta i want to do is

i have file like this :

>scaffold1|size1334
ACTGATGATACAGATACAGAAAGTAGAGATCGATGATAGA..
>scaffold2|size23034
ACAGATGAGACAGATTGACAGATAGAGATAGAGGATAGGACAG..
>scaffold3|size11654
ATAGCGCTCGCGCGCCGCGCGGCGGGGTAGAGAGATCTTTTGAGAGAGA..
>scaffold4|size3034
TGGGGTAGAGAGAGAGAGAGAAGAGGAAGAGAGGAGAGAGGA..
>scaffold2|size23034
ACAGATGAGACAGATTGACAGATAGAGATAGAGGATAGGACAG..
>scaffold100|size304
AAAAAAATACAGATAGAGAGAGAGAGGAGAGAGAGAG..
>scaffold67|size2400
ATAGAGAGAGAGAGAGAGAGAGAGAGAGGAGAGAGAGAGA..

i want to eliminate the duplicated scaffold (in this case is scaffold 2 the line >scaffold2|size2304 and its sequence because is repeated two times

so the out put will be

>scaffold1|size1334
ACTGATGATACAGATACAGAAAGTAGAGATCGATGATAGA..
>scaffold2|size23034
ACAGATGAGACAGATTGACAGATAGAGATAGAGGATAGGACAG..
>scaffold3|size11654
ATAGCGCTCGCGCGCCGCGCGGCGGGGTAGAGAGATCTTTTGAGAGAGA..
>scaffold4|size3034
TGGGGTAGAGAGAGAGAGAGAAGAGGAAGAGAGGAGAGAGGA..
>scaffold100|size304
AAAAAAATACAGATAGAGAGAGAGAGGAGAGAGAGAG..
>scaffold67|size2400
ATAGAGAGAGAGAGAGAGAGAGAGAGAGGAGAGAGAGAGA.

.

each scaffold is repeated just one time Thank you



Source link