Add cluster_resolution_finder and cluster_decision_tree for help deciding the proper resolution #3532

joe-jhou2 · 2025-03-28T05:02:22Z

Description

Added cluster_resolution_finder and cluster_decision_tree as new tools and plotting functions, respectively, with corresponding tests. These functions enable hierarchical clustering analysis across multiple resolutions and visualization of cluster relationships as a decision tree.

This feature requires the igraph library for graph-based plotting in cluster_decision_tree and for Leiden clustering in cluster_resolution_finder. igraph is included in the leiden extra, which can be installed with:

pip install scanpy[leiden]

Closes Enhance Clustering Resolution Selection with Hierarchical Tree Visualization #3533
Tests included or not required because:

Release notes not necessary because:

…llow Scanpy conventions

flying-sheep · 2025-03-28T09:47:47Z

Thanks, this looks nice! Do you need help with the tests?

Zethson · 2025-04-11T14:15:51Z

How does this implementation compare to https://github.com/complextissue/pyclustree ?

joe-jhou2 · 2025-04-11T14:35:33Z

That looks good! I hadn't come across it when I created this feature. Interestingly, we were both inspired by the R package clustree. It seems my function includes more features, most notably the ability to identify the genes driving the cluster splits in the tree. This significantly enhances the usefulness and interpretability of the splitting tree.

flying-sheep

Looks good! I’ll take a last look once the image test runs, but until then, some comments

flying-sheep · 2025-04-11T15:38:16Z

tests/conftest.py

+@pytest.fixture
+def adata_for_test():
+    """Fixture to provide a preprocessed AnnData object for testing."""
+    import scanpy as sc
+
+    adata = sc.datasets.pbmc68k_reduced()
+    sc.pp.neighbors(adata)
+    return adata


Please move this into tests/test_cluster_resolution.py, and use the cached version:

from testing.scanpy._helpers.data import pbmc68k_reduced

instead of sc.datasets.pbmc68k_reduced.

move to tests/test_cluster_resolution.py, and use the cached version.

flying-sheep · 2025-04-11T15:39:53Z

tests/test_cluster_tree.py

+
+# # Test 0: Image comparison
+# @pytest.mark.mpl_image_compare
+# def test_cluster_decision_tree_plot(adata_with_clusters, image_comparer):


please make this one work, then I’ll do a proper review!

Yes, and test is pass.

flying-sheep · 2025-04-11T15:44:35Z

.github/workflows/ci.yml

scanpy/pyproject.toml

Lines 95 to 98 in bdcef41

test = [

"scanpy[test-min]",

# Optional but important dependencies

"scanpy[leiden]",

the leidenalg package is already part of the test extra, which means it will be installed in all test jobs except for the test-min one. This should stay that way.

please revert your changes to this workflow and instead add a marker to all tests that need leidenalg to run, using this helper:

from testing.scanpy._pytest.marks import needs

There are two ways to do this:

if not all your test files’ tests need it, decorate the tests that do with @needs.leidenalg

if all tests in the file you added need it, add pytestmark = [needs.leidenalg] to the top of your test file (below the imports)

add pytestmark = [needs.leidenalg] on the top for test file.

flying-sheep · 2025-04-11T15:47:14Z

.vscode/settings.json

@@ -19,4 +19,5 @@
    "python.testing.pytestArgs": ["-vv", "--color=yes"],
    "python.testing.pytestEnabled": true,
    "python.terminal.activateEnvironment": true,
+    "git.ignoreLimitWarning": true,


what’s that for?

remove this.

flying-sheep · 2025-04-11T15:49:58Z

src/scanpy/tools/_cluster_resolution.py

+    from typing import Literal
+
+    from anndata import AnnData
+


Please mark the export. Also please rename the function to imperative mood (find_ instead of _finder. A …Finder would be the name for a class!)

Suggested change

__all__ = ["find_cluster_resolutions"]

rename to 'find_cluster_resoltuon'

flying-sheep · 2025-04-11T15:52:24Z

src/scanpy/tools/_cluster_resolution.py

+        Dict mapping (parent_node, child_node) to top marker genes.
+        E.g., {("res_0.0_C0", "res_0.2_C1"): ["gene1", "gene2", "gene3"]}


Suggested change

Dict mapping (parent_node, child_node) to top marker genes.

E.g., {("res_0.0_C0", "res_0.2_C1"): ["gene1", "gene2", "gene3"]}

Dict mapping (parent_node, child_node) to top marker genes.

E.g., `{("res_0.0_C0", "res_0.2_C1"): ["gene1", "gene2", "gene3"]}`

fix this format issue

flying-sheep · 2025-04-11T15:54:19Z

src/scanpy/tools/_cluster_resolution.py

+    Args:
+        adata: AnnData object with clustering in obs.
+        resolutions: List of resolution values (e.g., [0.0, 0.2, 0.5]).
+        prefix: Prefix for clustering columns in adata.obs (default: "leiden_res_").
+        method: Method for DEG analysis (default: "wilcoxon").
+        n_top_genes: Number of top genes per child node (default: 3).
+        min_cells: Minimum cells required in a subcluster (default: 2).
+        deg_mode: "within_parent" or "per_resolution" (default: "within_parent").


Suggested change

Args:

adata: AnnData object with clustering in obs.

resolutions: List of resolution values (e.g., [0.0, 0.2, 0.5]).

prefix: Prefix for clustering columns in adata.obs (default: "leiden_res_").

method: Method for DEG analysis (default: "wilcoxon").

n_top_genes: Number of top genes per child node (default: 3).

min_cells: Minimum cells required in a subcluster (default: 2).

deg_mode: "within_parent" or "per_resolution" (default: "within_parent").

Parameters

----------

adata

AnnData object with clustering in obs.

resolutions

List of resolution values (e.g., `[0.0, 0.2, 0.5]`).

prefix

Prefix for clustering columns in :attr:`~anndata.AnnData.obs`

method

Method for DEG analysis

n_top_genes

Number of top genes per child node

min_cells

Minimum cells required in a subcluster

deg_mode

See above

flying-sheep · 2025-04-11T15:54:52Z

src/scanpy/tools/_cluster_resolution.py

+
+    Raises
+    ------
+        ValueError: If deg_mode is invalid or input data is malformed.


Suggested change

ValueError: If deg_mode is invalid or input data is malformed.

ValueError

If deg_mode is invalid or input data is malformed.

maltekuehl · 2025-04-11T16:22:59Z

Hi @joe-jhou2, author of pyclustree here. Thanks for highlighting some of the differences in feature sets between the implementations. Whether one is “more” feature-rich than the other really depends on the intended use case and would not be the language I’d lean on, but I agree that the visualization of top genes differing between child clusters is a distinction and I am sure that it can potentially offer insights while saving an explicit call to rank_genes_groups for comparisons of interest if the focus is on the top differing genes.
Regarding this PR, I noticed that the rank_genes_groups calls don’t currently expose user-configurable parameters. Allowing **kwargs to be passed through might be helpful for users who want to align this with other analyses, e.g., choosing different test statistics or multiple testing corrections.

for more information, see https://pre-commit.ci

joe-jhou2 · 2025-04-11T17:11:01Z

Hi @joe-jhou2, author of pyclustree here. Thanks for highlighting some of the differences in feature sets between the implementations. Whether one is “more” feature-rich than the other really depends on the intended use case and would not be the language I’d lean on, but I agree that the visualization of top genes differing between child clusters is a distinction and I am sure that it can potentially offer insights while saving an explicit call to rank_genes_groups for comparisons of interest if the focus is on the top differing genes. Regarding this PR, I noticed that the rank_genes_groups calls don’t currently expose user-configurable parameters. Allowing **kwargs to be passed through might be helpful for users who want to align this with other analyses, e.g., choosing different test statistics or multiple testing corrections.

Hi @maltekuehl, It was a bit of virtual handshaking — I agree, there’s really no direct comparison in terms of which implementation has more features. We just happened to converge on the same idea, fortunately.

My initial idea/intention was very straightforward, aimed at providing a quick and intuitive visualization to help answer questions from bench scientists, like why we chose resolution A over B.

Your package is much more sophisticated and thoughtfully designed. I really appreciate your suggestion — I’ll make sure to include **kwargs.

flying-sheep · 2025-04-13T09:33:59Z

OK, so how about you two collaborate and make the package the best they can be (i.e. implement features this PR has there), and we advertise its use in scanpy tutorials?

That way I don’t have to maintain it, but more people get to see it?

maltekuehl · 2025-04-14T09:06:11Z

From the pyclustree side, we would definitely be open to that! For some context, pyclustree is part of a broader effort we're working on, but we decided to release this component early since we thought it could already be helpful to others.
One important difference is that the implementation in this PR uses igraph (with networkx as a fallback), while in pyclustree we've intentionally kept dependencies light by only using networkx for layout. This does come at the cost of some of the visual niceties, like the Bezier curves shown in this PR, which we agree look great (and if scanpy is going to depend on igraph in a future release for Leiden clustering, it might not matter, though networkx potentially also offers interactivity via holoviews which we are currently evaluating).
That said, the cluster_resolution_finder and find_cluster_specific_genes functions would be fantastic additions to the plotting tools in pyclustree, and it would be great to incorporate them.
We’re completely aligned on the idea that having a single, well-maintained reference for clustering tree functionality within the scverse ecosystem would be best for the community. We're definitely open to finding common ground on the implementation side and happy to help maintain a unified solution in the long run.

EDIT: @joe-jhou2, if you would like to connect (e.g., schedule a meeting), feel free to reach out via mail to malte [dot] kuehl [at] clin [dot] au [dot] dk. We are on Central European time.

flying-sheep · 2025-04-14T12:51:43Z

if scanpy is going to depend on igraph in a future release for Leiden clustering, it might not matter

FYI: We won’t ever hard-depend on GPL libraries, no matter if we recommend them. If we did, scanpy could effectively not be used under any license terms other than the GPL.

networkx potentially also offers interactivity via holoviews which we are currently evaluating

I’m almost completely sure that scanpy will switch to holoviews in 2025, it offers too much for us to stick with matplotlib in the long term.

Some tools that can’t be supported by holoviews for now might stay in matplotlib for the time being though.

We're definitely open to finding common ground on the implementation side and happy to help maintain a unified solution in the long run.

that’d be lovely!

for more information, see https://pre-commit.ci

joe-jhou2 · 2025-04-15T22:25:04Z

OK, so how about you two collaborate and make the package the best they can be (i.e. implement features this PR has there), and we advertise its use in scanpy tutorials?

That way I don’t have to maintain it, but more people get to see it?

Thank you both for recognizing my contribution and showing interest in the visualization approach. I believe the most straightforward way to contribute to pyclustree would be to include find_cluster_specific_genes—without modifying the core structure of pytree—as it nicely complements the cluster tree splitting logic.

That said, I’ll be stepping away for the next six months due to a major life event and won’t be able to commit to further development during that time. For that reason, I’d really appreciate getting this PR merged so I can wrap up the task before my break.

@maltekuehl -- I’d love to reach out to you later for a meeting call to explore the possibility of collaborating, or at the very least, see if find_cluster_specific_genes could be picked up and adapted into the package if it fits.

…e-jhou1/scanpy into feature-cluster-decision-tree

joe-jhou2 · 2025-04-17T22:30:42Z

@flying-sheep I refactor the 'cluster_decision_tree' function into class. but I'm in the middle of a tightrope walk, I need import networkx, ruff doesn't allow it into type checking block, but 'deferred import test' won't allow me import that earlier. Can you advise how to deal with dilemma? Thanks.

flying-sheep · 2025-04-22T12:21:20Z

sure! we need to import it both in all functions using it and in the TYPE_CHECKING block.

I did that for you: 44cc72f

for more information, see https://pre-commit.ci

joe-jhou2 · 2025-04-22T16:57:22Z

sure! we need to import it both in all functions using it and in the TYPE_CHECKING block.

I did that for you: 44cc72f

It passes almost all checks, but fail milestone label check. Can you advice that? Thanks.

joe-jhou2 added 5 commits March 27, 2025 11:58

Add cluster_decision_tree and cluster_resolution_finder with tests

b37477c

Fix formatting issue

88306f6

igraph install

4cd3134

formatting fix

0b2f126

Restructure cluster_resolution_finder and cluster_decision_tree to fo…

6126013

…llow Scanpy conventions

joe-jhou2 marked this pull request as draft March 28, 2025 05:10

Register mpl_image_compare marker in pyproject.toml for pytest-mpl

12cf89d

joe-jhou2 mentioned this pull request Mar 28, 2025

Enhance Clustering Resolution Selection with Hierarchical Tree Visualization #3533

Open

joe-jhou2 added 4 commits March 28, 2025 10:54

Merge branch 'main' into feature-cluster-decision-tree

25569b2

comment out one test

1aadcc4

suppress print statements during testing

69b16c7

suppress print statements during testing

70dea35

joe-jhou2 marked this pull request as ready for review March 28, 2025 18:47

joe-jhou2 added 4 commits April 1, 2025 20:33

Merge branch 'main' into feature-cluster-decision-tree

8e25dfc

Fix the check-milestone workflow

c74cd1a

Fix the check-milestone workflow

a91e9f1

restore check-milestone workflow

bc5dd9f

flying-sheep requested changes Apr 11, 2025

View reviewed changes

flying-sheep and others added 2 commits April 11, 2025 18:34

Merge branch 'main' into feature-cluster-decision-tree

f595f5d

[pre-commit.ci] auto fixes from pre-commit.com hooks

e422ad1

for more information, see https://pre-commit.ci

joe-jhou2 and others added 2 commits April 15, 2025 09:02

WIP: save changes before pulling

5932717

[pre-commit.ci] auto fixes from pre-commit.com hooks

d6d4e72

for more information, see https://pre-commit.ci

joe-jhou2 added 3 commits April 15, 2025 09:53

fix ruff error

ce51530

fix ruff error

c084342

Merge branch 'main' into feature-cluster-decision-tree

9d6dd21

joe-jhou2 added 5 commits April 17, 2025 13:49

refactror main function to class

6e19de7

Merge branch 'feature-cluster-decision-tree' of https://github.com/jo…

3eb806a

…e-jhou1/scanpy into feature-cluster-decision-tree

Merge branch 'main' into feature-cluster-decision-tree

da8f6f4

refactror main function to class

ac54dfd

refactror main function to class

0cffc68

Defer networkx import

44cc72f

joe-jhou2 and others added 8 commits April 22, 2025 08:01

Merge branch 'main' into feature-cluster-decision-tree

34d9f1c

Skip TC004 warning for now

e171c8c

Skip TC004 warning for now

a0ff151

[pre-commit.ci] auto fixes from pre-commit.com hooks

202280e

for more information, see https://pre-commit.ci

Skip TC004 warning for now

00e10c8

[pre-commit.ci] auto fixes from pre-commit.com hooks

2d60321

for more information, see https://pre-commit.ci

modify seaborn import

01a1e87

modify seaborn import

20bf0a8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cluster_resolution_finder and cluster_decision_tree for help deciding the proper resolution #3532

Add cluster_resolution_finder and cluster_decision_tree for help deciding the proper resolution #3532

joe-jhou2 commented Mar 28, 2025 •

edited

Loading

flying-sheep commented Mar 28, 2025

Zethson commented Apr 11, 2025

joe-jhou2 commented Apr 11, 2025

flying-sheep left a comment

flying-sheep Apr 11, 2025

joe-jhou2 Apr 15, 2025

flying-sheep Apr 11, 2025

joe-jhou2 Apr 15, 2025

flying-sheep Apr 11, 2025

joe-jhou2 Apr 15, 2025

flying-sheep Apr 11, 2025

joe-jhou2 Apr 15, 2025

flying-sheep Apr 11, 2025

joe-jhou2 Apr 15, 2025

flying-sheep Apr 11, 2025

joe-jhou2 Apr 15, 2025

flying-sheep Apr 11, 2025

joe-jhou2 Apr 15, 2025

flying-sheep Apr 11, 2025

joe-jhou2 Apr 15, 2025

maltekuehl commented Apr 11, 2025

joe-jhou2 commented Apr 11, 2025

flying-sheep commented Apr 13, 2025 •

edited

Loading

maltekuehl commented Apr 14, 2025 •

edited

Loading

flying-sheep commented Apr 14, 2025

joe-jhou2 commented Apr 15, 2025

joe-jhou2 commented Apr 17, 2025

flying-sheep commented Apr 22, 2025

joe-jhou2 commented Apr 22, 2025

	test = [
	"scanpy[test-min]",
	# Optional but important dependencies
	"scanpy[leiden]",

		Dict mapping (parent_node, child_node) to top marker genes.
		E.g., {("res_0.0_C0", "res_0.2_C1"): ["gene1", "gene2", "gene3"]}

	ValueError: If deg_mode is invalid or input data is malformed.
	ValueError
	If deg_mode is invalid or input data is malformed.

Add cluster_resolution_finder and cluster_decision_tree for help deciding the proper resolution #3532

Are you sure you want to change the base?

Add cluster_resolution_finder and cluster_decision_tree for help deciding the proper resolution #3532

Conversation

joe-jhou2 commented Mar 28, 2025 • edited Loading

Description

flying-sheep commented Mar 28, 2025

Zethson commented Apr 11, 2025

joe-jhou2 commented Apr 11, 2025

flying-sheep left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maltekuehl commented Apr 11, 2025

joe-jhou2 commented Apr 11, 2025

flying-sheep commented Apr 13, 2025 • edited Loading

maltekuehl commented Apr 14, 2025 • edited Loading

flying-sheep commented Apr 14, 2025

joe-jhou2 commented Apr 15, 2025

joe-jhou2 commented Apr 17, 2025

flying-sheep commented Apr 22, 2025

joe-jhou2 commented Apr 22, 2025

joe-jhou2 commented Mar 28, 2025 •

edited

Loading

flying-sheep commented Apr 13, 2025 •

edited

Loading

maltekuehl commented Apr 14, 2025 •

edited

Loading