gravatar for gb

1 hour ago by

To cluster (put similar reads "together") you can start with this:

cd-hit-est -i reads.fa -o output.fa -c 0.95 -n 10 -d 999 -M 0 -T 0

For more info see github.com/weizhongli/cdhit/wiki/3.-User's-Guide#CDHITEST

The option -c declares the global sequence identity so in this example all reads that are 95% similar will be put together. For redundancy removal I guess you need to put this on -c 1

BUT! Keep in mind that this is a global alignment so for example the following reads:

>read1
AAAA
>read2
AAAAA

Are not 100% the same. So what means redundancy in your case?

The output (output.fa) will contain the representative sequences. In practice (sort of) cd-hit first sorts your input based on the length of the reads of your input fasta. After that it will go trough the sorted reads from top till bottom. So at the very first read there are no clusters yet, so this will be the representative read for the first cluster. If the second read is minimal 95% similar it will be part of that first cluster and if it is not 95% similar it will be a new cluster. Lets say those two reads are similar, then in your output file you will get only 1 sequence. So the redundancy is removed.

link

modified 1 hour ago

written
1 hour ago
by

gb1.3k



Source link