Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup Benchmarker.prepare (compute_connectivities_umap) #128

Conversation

jan-engelmann
Copy link
Contributor

Thanks for the great package!

Unfortunately Benchmarker.prepare() is very slow due to currently released scanpy code:
sc.neighbors._compute_connectivities_umap is highly inefficient (see this monster).

One iteration of the KNN calculation takes 2:30h for 7 Million cells. A large part of this time is spent in sc.neighbors._compute_connectivities_umap.

I see two solutions:
A Use the approach proposed in this PR
B Use new scanpy code not yet released but already on main

A This PR

The solution in this PR offers the following speedup:
image

B Unreleased Scanpy code

The Neighbors scanpy code on main has been thoroughly refactored by @Zethson and likely offers a way to do this efficiently. For example scanpy.neighbors._common._get_sparse_matrix_from_indices_distances looks promising. I can look more into this if there's interest.

What do you think @Zethson @adamgayoso

@Zethson
Copy link
Contributor

Zethson commented Dec 21, 2023

I didn't touch the neighbors code. Maybe I merged a PR but that's about it.

I'd rather lobby for a new scanpy release and bring any improvements upstream into scanpy.

But cool work @jan-engelmann !

Copy link
Member

@adamgayoso adamgayoso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this! I think this should be added to the package so we don't rely on a private fn of scanpy. Also the fn is quite simple.

Comment on lines +69 to +75
knn_indices,
knn_dists,
n_obs,
n_neighbors,
set_op_mix_ratio=1.0,
local_connectivity=1.0,
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add typing

set_op_mix_ratio=1.0,
local_connectivity=1.0,
):
"""Sped up version of sc.neighbors._compute_connectivities_umap."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put a more general docstring? Overview of the method and that it matches how connectivies are computed in scanpy?

Comment on lines +84 to +87
X,
n_neighbors,
None,
None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I prefer using keywords everywhere

Comment on lines +19 to +20
assert (new_dist == sc_dist).todense().all()
assert (new_connect == sc_connect).todense().all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use np.testing

assert (new_connect == sc_connect).todense().all()


def test_timing_compute_connectivities_umap():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this test is necessary if you add a reproducible script to this PR description

Comment on lines +94 to +96
if isinstance(connectivities, tuple):
# In umap-learn 0.4, this returns (result, sigmas, rhos)
connectivities = connectivities[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this bit still necessary? What's the lower bound for scanpy? If we merge this PR we will need to make umap a direct dependency, so please also add that and potentially remove this block.

@adamgayoso
Copy link
Member

Actually I would hold off on this PR. I see some redundancies in converting back and forth from sparse distance matrices that I'd like to address first. This should eliminate the need for building the sparse distance graph here

@adamgayoso
Copy link
Member

@jan-engelmann feel free to give a review to #129

@adamgayoso
Copy link
Member

Closed due to #129

@jan-engelmann I will add you as an author on the PR

@adamgayoso adamgayoso closed this Dec 28, 2023
@jan-engelmann
Copy link
Contributor Author

But cool work @jan-engelmann !

thanks @Zethson :)

@jan-engelmann
Copy link
Contributor Author

Closed due to #129

@jan-engelmann I will add you as an author on the PR

Thanks @adamgayoso I appreciate it! I like the refactoring. Very clean! 👍
Left a review on #129 but did not request any changes.

@jan-engelmann
Copy link
Contributor Author

Also thanks for the review! @adamgayoso
Will keep these things in mind for the next time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants