UMAP and "equal" objects


I want to plot a very large dataset. UMAP works quite good with this type of data (not single-cell expression but similar). However I have couple of clusters of absolutely equal objects, distance between all these objects is 0 (within each cluster) and UMAP somehow draws these huge clusters as "outlier" dots - even though these objects are not so dissimilar to the other objects.

I can replace these objects with only 1 representative, but are there alternative way to vizualize clusters using UMAP so it is not plotted as a dot very far from other dots?



You probably need to play with the parameters.
Check these papers to get an idea of where you could focus your efforts:

It depends on your definition of a large dataset. I have used openTSNE with 20-30 CPUs on a 100000 x 136 dataset, and it does the embedding in ~25 minutes. Even though this implementation of t-SNE is not as fast as UMAP, it is fast enough that it should not be a problem to use t-SNE even on datasets with million data points, as long as their second dimension is not in thousands.

I am curious as to how do you define the distance between your vectors to be 0. UMAP is not supposed to separate at all data points that are (near-)identical, no matter what parameters are used.

before adding your answer.

Traffic: 1739 users visited in the last hour

Source link