duplicate gene IDs with different length after cd-hit-est dereplicate and cluster
Hi. I am trying to use CD-hit to remove the duplicates from the file that is the combine of all nucleic acid fastq output from prodigal.
I used the following parameters:
cd-hit-est -i nuc_sum.fa -o cd-hit_sum -c 0.95 -s 0.8 -M 0 -T 0 -n 8
The representative sequencing shows in the fasta file. But there are many small or fragmented sequence with the same header ID.
Does anyone know how to set the parameters in cd-hit-est to make sure there will be only one sequence for one header ID?
cd-hit-est -i nuc_sum.fa -o cd-hit_sum -c 1 -t 1 -d 0, which someone recommend. But it does not solve this problem.
• 83 views