duplicate gene IDs with different length after cd-hit-est dereplicate and cluster

1

Hi. I am trying to use CD-hit to remove the duplicates from the file that is the combine of all nucleic acid fastq output from prodigal.
I used the following parameters:

cd-hit-est -i nuc_sum.fa -o cd-hit_sum -c 0.95 -s 0.8 -M 0 -T 0 -n 8

The representative sequencing shows in the fasta file. But there are many small or fragmented sequence with the same header ID.
Does anyone know how to set the parameters in cd-hit-est to make sure there will be only one sequence for one header ID?
I tried cd-hit-est -i nuc_sum.fa -o cd-hit_sum -c 1 -t 1 -d 0, which someone recommend. But it does not solve this problem.

enter image description here


replicate


duplicate


cd-hit


cluster

• 83 views



Source link