The human proteome according to UniprotKB contains 20,370 reviewed proteins. I would like to create a matrix of size 20,370 x 20,370 containing all protein sequence identities or similarities (ranging from 0 to 1). I would very much appreciate any hints regarding the following:
(a) Have protein sequences identities or similarities have already been pre-computed and available for users to download? I am familiar with the UniRef clusters of 100%, 90% and 50% sequence identity, however what I am interested is rather on the pairwise sequence identities, not so much necessarily on the sequence clusters.
(b) There are a number of robust tools that have already been developed to calculate sequence similarities / identities and cluster proteins e.g. MMseqs2, clustal omega or blastall. Any other good tool that you may be familiar for an all-against-all pairwise sequence similarity calculation (?) It would be great if you could share on this thread.
Any hints would be greatly appreciated.