gravatar for Mensur Dlakic

41 minutes ago by

USA

For the purposes of machine learning, sequence alignments and/or hidden Markov models of protein families should be just as good - and realistically better - than PSSMs. You can find a PDB database clustered at 70% identity here and the explanation is here.

There is no need to model all sequences, and that goes for all protein databases, not just PDB. As of a month ago, there were almost 0.5 million individual protein chains in PDB. A simple clustering at 95% identity drops that number to ~60 thousand, meaning that more than 85% of protein chains in PDB are 95% identical (or better) to at least one other chain in the database. In other words, there is a huge sequence and structure redundancy in PDB. Depending on your exact task, I think going down to 50% identity clustering would work as well, as almost all sequences that share 50% identity are related. Some people will tell you it is safe to go down to 30-40% identity when clustering.



Source link