Skip to content

Add cluster_resolution_finder and cluster_decision_tree for help deciding the proper resolution #3532

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 35 commits into
base: main
Choose a base branch
from

Conversation

joe-jhou2
Copy link

@joe-jhou2 joe-jhou2 commented Mar 28, 2025

Description

Added cluster_resolution_finder and cluster_decision_tree as new tools and plotting functions, respectively, with corresponding tests. These functions enable hierarchical clustering analysis across multiple resolutions and visualization of cluster relationships as a decision tree.

This feature requires the igraph library for graph-based plotting in cluster_decision_tree and for Leiden clustering in cluster_resolution_finder. igraph is included in the leiden extra, which can be installed with:

pip install scanpy[leiden]
  • Release notes not necessary because:

@joe-jhou2 joe-jhou2 marked this pull request as draft March 28, 2025 05:10
@flying-sheep
Copy link
Member

Thanks, this looks nice! Do you need help with the tests?

@joe-jhou2 joe-jhou2 marked this pull request as ready for review March 28, 2025 18:47
@Zethson
Copy link
Member

Zethson commented Apr 11, 2025

How does this implementation compare to https://github.com/complextissue/pyclustree ?

@joe-jhou2
Copy link
Author

That looks good! I hadn't come across it when I created this feature. Interestingly, we were both inspired by the R package clustree. It seems my function includes more features, most notably the ability to identify the genes driving the cluster splits in the tree. This significantly enhances the usefulness and interpretability of the splitting tree.

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I’ll take a last look once the image test runs, but until then, some comments

Comment on lines 153 to 160
@pytest.fixture
def adata_for_test():
"""Fixture to provide a preprocessed AnnData object for testing."""
import scanpy as sc

adata = sc.datasets.pbmc68k_reduced()
sc.pp.neighbors(adata)
return adata
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move this into tests/test_cluster_resolution.py, and use the cached version:

from testing.scanpy._helpers.data import pbmc68k_reduced

instead of sc.datasets.pbmc68k_reduced.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to tests/test_cluster_resolution.py, and use the cached version.


# # Test 0: Image comparison
# @pytest.mark.mpl_image_compare
# def test_cluster_decision_tree_plot(adata_with_clusters, image_comparer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make this one work, then I’ll do a proper review!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and test is pass.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scanpy/pyproject.toml

Lines 95 to 98 in bdcef41

test = [
"scanpy[test-min]",
# Optional but important dependencies
"scanpy[leiden]",

the leidenalg package is already part of the test extra, which means it will be installed in all test jobs except for the test-min one. This should stay that way.

please revert your changes to this workflow and instead add a marker to all tests that need leidenalg to run, using this helper:

from testing.scanpy._pytest.marks import needs

There are two ways to do this:

  1. if not all your test files’ tests need it, decorate the tests that do with @needs.leidenalg
  2. if all tests in the file you added need it, add pytestmark = [needs.leidenalg] to the top of your test file (below the imports)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add pytestmark = [needs.leidenalg] on the top for test file.

@@ -19,4 +19,5 @@
"python.testing.pytestArgs": ["-vv", "--color=yes"],
"python.testing.pytestEnabled": true,
"python.terminal.activateEnvironment": true,
"git.ignoreLimitWarning": true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what’s that for?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this.

from typing import Literal

from anndata import AnnData

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please mark the export. Also please rename the function to imperative mood (find_ instead of _finder. A …Finder would be the name for a class!)

Suggested change
__all__ = ["find_cluster_resolutions"]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to 'find_cluster_resoltuon'

Comment on lines 42 to 43
Dict mapping (parent_node, child_node) to top marker genes.
E.g., {("res_0.0_C0", "res_0.2_C1"): ["gene1", "gene2", "gene3"]}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Dict mapping (parent_node, child_node) to top marker genes.
E.g., {("res_0.0_C0", "res_0.2_C1"): ["gene1", "gene2", "gene3"]}
Dict mapping (parent_node, child_node) to top marker genes.
E.g., `{("res_0.0_C0", "res_0.2_C1"): ["gene1", "gene2", "gene3"]}`

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix this format issue

Comment on lines 31 to 38
Args:
adata: AnnData object with clustering in obs.
resolutions: List of resolution values (e.g., [0.0, 0.2, 0.5]).
prefix: Prefix for clustering columns in adata.obs (default: "leiden_res_").
method: Method for DEG analysis (default: "wilcoxon").
n_top_genes: Number of top genes per child node (default: 3).
min_cells: Minimum cells required in a subcluster (default: 2).
deg_mode: "within_parent" or "per_resolution" (default: "within_parent").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Args:
adata: AnnData object with clustering in obs.
resolutions: List of resolution values (e.g., [0.0, 0.2, 0.5]).
prefix: Prefix for clustering columns in adata.obs (default: "leiden_res_").
method: Method for DEG analysis (default: "wilcoxon").
n_top_genes: Number of top genes per child node (default: 3).
min_cells: Minimum cells required in a subcluster (default: 2).
deg_mode: "within_parent" or "per_resolution" (default: "within_parent").
Parameters
----------
adata
AnnData object with clustering in obs.
resolutions
List of resolution values (e.g., `[0.0, 0.2, 0.5]`).
prefix
Prefix for clustering columns in :attr:`~anndata.AnnData.obs`
method
Method for DEG analysis
n_top_genes
Number of top genes per child node
min_cells
Minimum cells required in a subcluster
deg_mode
See above

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reformat


Raises
------
ValueError: If deg_mode is invalid or input data is malformed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ValueError: If deg_mode is invalid or input data is malformed.
ValueError
If deg_mode is invalid or input data is malformed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reformat

@maltekuehl
Copy link

Hi @joe-jhou2, author of pyclustree here. Thanks for highlighting some of the differences in feature sets between the implementations. Whether one is “more” feature-rich than the other really depends on the intended use case and would not be the language I’d lean on, but I agree that the visualization of top genes differing between child clusters is a distinction and I am sure that it can potentially offer insights while saving an explicit call to rank_genes_groups for comparisons of interest if the focus is on the top differing genes.
Regarding this PR, I noticed that the rank_genes_groups calls don’t currently expose user-configurable parameters. Allowing **kwargs to be passed through might be helpful for users who want to align this with other analyses, e.g., choosing different test statistics or multiple testing corrections.

@joe-jhou2
Copy link
Author

Hi @joe-jhou2, author of pyclustree here. Thanks for highlighting some of the differences in feature sets between the implementations. Whether one is “more” feature-rich than the other really depends on the intended use case and would not be the language I’d lean on, but I agree that the visualization of top genes differing between child clusters is a distinction and I am sure that it can potentially offer insights while saving an explicit call to rank_genes_groups for comparisons of interest if the focus is on the top differing genes. Regarding this PR, I noticed that the rank_genes_groups calls don’t currently expose user-configurable parameters. Allowing **kwargs to be passed through might be helpful for users who want to align this with other analyses, e.g., choosing different test statistics or multiple testing corrections.

Hi @maltekuehl, It was a bit of virtual handshaking — I agree, there’s really no direct comparison in terms of which implementation has more features. We just happened to converge on the same idea, fortunately.

My initial idea/intention was very straightforward, aimed at providing a quick and intuitive visualization to help answer questions from bench scientists, like why we chose resolution A over B.

Your package is much more sophisticated and thoughtfully designed. I really appreciate your suggestion — I’ll make sure to include **kwargs.

@flying-sheep
Copy link
Member

flying-sheep commented Apr 13, 2025

OK, so how about you two collaborate and make the package the best they can be (i.e. implement features this PR has there), and we advertise its use in scanpy tutorials?

That way I don’t have to maintain it, but more people get to see it?

@maltekuehl
Copy link

maltekuehl commented Apr 14, 2025

From the pyclustree side, we would definitely be open to that! For some context, pyclustree is part of a broader effort we're working on, but we decided to release this component early since we thought it could already be helpful to others.
One important difference is that the implementation in this PR uses igraph (with networkx as a fallback), while in pyclustree we've intentionally kept dependencies light by only using networkx for layout. This does come at the cost of some of the visual niceties, like the Bezier curves shown in this PR, which we agree look great (and if scanpy is going to depend on igraph in a future release for Leiden clustering, it might not matter, though networkx potentially also offers interactivity via holoviews which we are currently evaluating).
That said, the cluster_resolution_finder and find_cluster_specific_genes functions would be fantastic additions to the plotting tools in pyclustree, and it would be great to incorporate them.
We’re completely aligned on the idea that having a single, well-maintained reference for clustering tree functionality within the scverse ecosystem would be best for the community. We're definitely open to finding common ground on the implementation side and happy to help maintain a unified solution in the long run.

EDIT: @joe-jhou2, if you would like to connect (e.g., schedule a meeting), feel free to reach out via mail to malte [dot] kuehl [at] clin [dot] au [dot] dk. We are on Central European time.

@flying-sheep
Copy link
Member

if scanpy is going to depend on igraph in a future release for Leiden clustering, it might not matter

FYI: We won’t ever hard-depend on GPL libraries, no matter if we recommend them. If we did, scanpy could effectively not be used under any license terms other than the GPL.

networkx potentially also offers interactivity via holoviews which we are currently evaluating

I’m almost completely sure that scanpy will switch to holoviews in 2025, it offers too much for us to stick with matplotlib in the long term.

Some tools that can’t be supported by holoviews for now might stay in matplotlib for the time being though.

We're definitely open to finding common ground on the implementation side and happy to help maintain a unified solution in the long run.

that’d be lovely!

@joe-jhou2
Copy link
Author

OK, so how about you two collaborate and make the package the best they can be (i.e. implement features this PR has there), and we advertise its use in scanpy tutorials?

That way I don’t have to maintain it, but more people get to see it?

Thank you both for recognizing my contribution and showing interest in the visualization approach. I believe the most straightforward way to contribute to pyclustree would be to include find_cluster_specific_genes—without modifying the core structure of pytree—as it nicely complements the cluster tree splitting logic.

That said, I’ll be stepping away for the next six months due to a major life event and won’t be able to commit to further development during that time. For that reason, I’d really appreciate getting this PR merged so I can wrap up the task before my break.

@maltekuehl -- I’d love to reach out to you later for a meeting call to explore the possibility of collaborating, or at the very least, see if find_cluster_specific_genes could be picked up and adapted into the package if it fits.

@joe-jhou2
Copy link
Author

@flying-sheep I refactor the 'cluster_decision_tree' function into class. but I'm in the middle of a tightrope walk, I need import networkx, ruff doesn't allow it into type checking block, but 'deferred import test' won't allow me import that earlier. Can you advise how to deal with dilemma? Thanks.

@flying-sheep
Copy link
Member

sure! we need to import it both in all functions using it and in the TYPE_CHECKING block.

I did that for you: 44cc72f

@joe-jhou2
Copy link
Author

sure! we need to import it both in all functions using it and in the TYPE_CHECKING block.

I did that for you: 44cc72f

It passes almost all checks, but fail milestone label check. Can you advice that? Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enhance Clustering Resolution Selection with Hierarchical Tree Visualization
4 participants