-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Precomputed distance matrix (take 2) #82
Comments
This definitely not implemented in FIt-SNE in any branch. Out of curiosity, what is your use case? Why do you have a precomputed distance matrix? |
HI @dkobak thanks for your response. This query was motivated by your recent paper According to @linqiaozhi, an old version of FIt-SNE had this. Here's the link to the code, and the link to discussion about this I'm trying to use t-SNE to embed a large dataset where I have the pairwise distance matrix and nothing else. I know this is weird, but this is because I don't have the "primary data" itself, not because I don't have access to it, but because in my use case, it is fundamentally hard to define. What I can define is a function that maps pairs of "data" onto a distance, so I can create a distance matrix. My goal is to use this distance matrix to reveal structure in my high-dimensional data set. |
So your N must be relatively small if you can hold the whole NxN distance matrix in memory? Around 10k max I guess? |
Yes, my N right now is ~15K (I plan on going bigger, and I'm essentially limited by RAM) |
Sorry I have not had time to implement this. The way it should be implemented is based on a sparse matrix, so RAM will not be a limitation. That is, the wrapper would just write the three vectors defining the sparse matrix to a file, and then the binary will read those vectors. Here is the example of writing the vectors in Matlab. I'm afraid I'm just not able to work on this for now. Maybe in January I'd have more time. @sg-s, if you want to try your hand at it, then I'd be more than happy to help you if you get stuck. |
Hi @linqiaozhi sure, I can take a crack at it. I can write the data into a sparse matrix, no problem. Can you point me to the part of your code that reads this? I can then probably put the pieces together and send you a PR. Thanks for the hint! |
If you only have a black box function to compute pairwise distances, how are you going to construct the sparse kNN matrix without first computing the full sense one? |
@dkobak I don't think it's possible to avoid computing the full distance matrix (which I have done). If I understand what's being said here, the sparsity in the matrix comes from only considering the k nearest neighbours in the full matrix, right? |
Yes, it is possible to compute an (exact or approximate) sparse kNN matrix without computing all the pairwise distances. In fact, that's exactly what FIt-SNE (and BH t-SNE for that matter) does by default! Essentially any fast KNN method does that. FIt-SNE uses VP trees for exact KNN, and it only requires that the "distance function" be a metric. @dkobak , did I misunderstand your question? The feature that we are missing is for users to provide their own sparse distance matrix and embed it with t-SNE. A user could generate this from their own nearest neighbor algorithm, or it could just be a graph that they want to embed (that does not correspond to nearest neighbors at all). @sg-s One question is whether we want users to be able to pass a distance matrix or a fully normalized P matrix. That is, do you want to provide the distances and let FIt-SNE take care of applying the Gaussian kernel and normalizations? Or do you want to be able to pass the affinities themselves? |
@sg-s thanks for being willing to take a stab at it! As I mentioned above, there are two options: Option 1: Pass the affinities. That is, trust the user to apply the kernel converting distances to affinities. This is nice because I bet there are lots of more interesting affinities than Gaussian with perplexity that we use! Here is the code that loads in the P matrix. As you can see, it is three different vectors, this is the Compressed Sparse Row format. It's easiest to just see in an example (and take a look at the wikipedia page linked above): So, this code is already there. You should try and set load_affinities to true, and test it to see if it works! That is, if you want to pass the affinities. Option 2: Pass the distances, and let the algorithm apply the kernel. You'll probably want the wrapper to write three vectors corresponding to your sparse distance matrix just like above (maybe call them val_D, col_D, row_D). Then you should probably make a new computeGaussianPerplexity() function. See how this one computes the nearest neighbors using ANNOY and then computes the affinities from them? You'll want to make a new function that does essentially the same thing, but instead of computing the nearest neighbors here, it should just look them up in the loaded val_D, col_D, and row_D (where each of these will be similar to the three vectors defined above but they have distances). Is that clear? Please feel free to ask any questions...I will be more responsive now that the holidays are over. |
It should be much easier to implement this in openTSNE which is designed to be easily extendible/modifiable. It uses pynndescent for kNN search and supports a lot of metrics out of the box. One can easily plug your own distance matrix / affinity matrix into openTSNE. In fact, I don't know if pynndescent supports a user defined metric, but it very well might, which would allow to directly get sparse kNN matrix. I am not sure it's worth implementing any of that in the C++ code here. (The only thing openTSNE currently does not support is the variable degree of freedom; hopefully it will eventually get implemented there too...) |
OK, I'll take a crack at it. Will look at how fast_tsne.cpp works and will report back. Thanks!
|
@dkobak, yes, it might be easier to implement in openTSNE. While it may be a little slower, that is probably not prohibitive for most users. However, openTSNE is pure Python; it's possible that @sg-s also likes having an R or MATLAB interface like we have here. @sg-s , sounds great! Just to be clear, it looks like you are going for "Option 1" I mentioned above, correct? That is, the user will provide the affinities, which will be plugged directly into the P matrix by fast_tsne. |
HI @linqiaozhi and @dkobak, I have a workaround that achieves what I want. Turns out I did not have to modify any of your code, and the functionality I was looking for already exists in your codebase. Here are the steps to embed a distance matrix using FIt-SNE:
The results seem to make sense, and the only thing left is this: How do I sparsify my affinity matrix? Is there some rigorous way to do this other than throwing out all except K-highest values per row?
|
The standard way to compute sparse similarities in tSNE given desired perplexity P, is to find 3P nearest neighbors for any point i, use these 3P distances to search for a kernel width (sigma) that would yield perplexity P, convert these 3P distances into affinities, and set the other N-3P values to zero. This heuristic was suggested in the Barnes-Hut tSNE paper, and everybody else is using it since then. |
Thanks, that's clear! Closing this for now |
Hi, I had previously asked about using FIt-SNE with a precomputed distance matrix and I was pointed to an earlier version which did support it. Thanks for that, but I now want something else.
I see that there is now an "degrees of freedom" parameter that makes the heavy tails heavier, and I'm interested in using that if possible, and I wonder if I can do that with a precomputed distance matrix.
Thanks for any help!
The text was updated successfully, but these errors were encountered: