-
Notifications
You must be signed in to change notification settings - Fork 626
Add cluster_resolution_finder and cluster_decision_tree for help deciding the proper resolution #3532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thanks, this looks nice! Do you need help with the tests? |
How does this implementation compare to https://github.com/complextissue/pyclustree ? |
That looks good! I hadn't come across it when I created this feature. Interestingly, we were both inspired by the R package clustree. It seems my function includes more features, most notably the ability to identify the genes driving the cluster splits in the tree. This significantly enhances the usefulness and interpretability of the splitting tree. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I’ll take a last look once the image test runs, but until then, some comments
tests/conftest.py
Outdated
@pytest.fixture | ||
def adata_for_test(): | ||
"""Fixture to provide a preprocessed AnnData object for testing.""" | ||
import scanpy as sc | ||
|
||
adata = sc.datasets.pbmc68k_reduced() | ||
sc.pp.neighbors(adata) | ||
return adata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this into tests/test_cluster_resolution.py
, and use the cached version:
from testing.scanpy._helpers.data import pbmc68k_reduced
instead of sc.datasets.pbmc68k_reduced
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to tests/test_cluster_resolution.py, and use the cached version.
tests/test_cluster_tree.py
Outdated
|
||
# # Test 0: Image comparison | ||
# @pytest.mark.mpl_image_compare | ||
# def test_cluster_decision_tree_plot(adata_with_clusters, image_comparer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please make this one work, then I’ll do a proper review!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and test is pass.
.github/workflows/ci.yml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 95 to 98 in bdcef41
test = [ | |
"scanpy[test-min]", | |
# Optional but important dependencies | |
"scanpy[leiden]", |
the leidenalg package is already part of the test
extra, which means it will be installed in all test jobs except for the test-min
one. This should stay that way.
please revert your changes to this workflow and instead add a marker to all tests that need leidenalg to run, using this helper:
from testing.scanpy._pytest.marks import needs
There are two ways to do this:
- if not all your test files’ tests need it, decorate the tests that do with
@needs.leidenalg
- if all tests in the file you added need it, add
pytestmark = [needs.leidenalg]
to the top of your test file (below the imports)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add pytestmark = [needs.leidenalg] on the top for test file.
.vscode/settings.json
Outdated
@@ -19,4 +19,5 @@ | |||
"python.testing.pytestArgs": ["-vv", "--color=yes"], | |||
"python.testing.pytestEnabled": true, | |||
"python.terminal.activateEnvironment": true, | |||
"git.ignoreLimitWarning": true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what’s that for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this.
from typing import Literal | ||
|
||
from anndata import AnnData | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please mark the export. Also please rename the function to imperative mood (find_
instead of _finder
. A …Finder
would be the name for a class!)
__all__ = ["find_cluster_resolutions"] | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to 'find_cluster_resoltuon'
Dict mapping (parent_node, child_node) to top marker genes. | ||
E.g., {("res_0.0_C0", "res_0.2_C1"): ["gene1", "gene2", "gene3"]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dict mapping (parent_node, child_node) to top marker genes. | |
E.g., {("res_0.0_C0", "res_0.2_C1"): ["gene1", "gene2", "gene3"]} | |
Dict mapping (parent_node, child_node) to top marker genes. | |
E.g., `{("res_0.0_C0", "res_0.2_C1"): ["gene1", "gene2", "gene3"]}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix this format issue
Args: | ||
adata: AnnData object with clustering in obs. | ||
resolutions: List of resolution values (e.g., [0.0, 0.2, 0.5]). | ||
prefix: Prefix for clustering columns in adata.obs (default: "leiden_res_"). | ||
method: Method for DEG analysis (default: "wilcoxon"). | ||
n_top_genes: Number of top genes per child node (default: 3). | ||
min_cells: Minimum cells required in a subcluster (default: 2). | ||
deg_mode: "within_parent" or "per_resolution" (default: "within_parent"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Args: | |
adata: AnnData object with clustering in obs. | |
resolutions: List of resolution values (e.g., [0.0, 0.2, 0.5]). | |
prefix: Prefix for clustering columns in adata.obs (default: "leiden_res_"). | |
method: Method for DEG analysis (default: "wilcoxon"). | |
n_top_genes: Number of top genes per child node (default: 3). | |
min_cells: Minimum cells required in a subcluster (default: 2). | |
deg_mode: "within_parent" or "per_resolution" (default: "within_parent"). | |
Parameters | |
---------- | |
adata | |
AnnData object with clustering in obs. | |
resolutions | |
List of resolution values (e.g., `[0.0, 0.2, 0.5]`). | |
prefix | |
Prefix for clustering columns in :attr:`~anndata.AnnData.obs` | |
method | |
Method for DEG analysis | |
n_top_genes | |
Number of top genes per child node | |
min_cells | |
Minimum cells required in a subcluster | |
deg_mode | |
See above |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reformat
|
||
Raises | ||
------ | ||
ValueError: If deg_mode is invalid or input data is malformed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ValueError: If deg_mode is invalid or input data is malformed. | |
ValueError | |
If deg_mode is invalid or input data is malformed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reformat
Hi @joe-jhou2, author of |
Hi @maltekuehl, It was a bit of virtual handshaking — I agree, there’s really no direct comparison in terms of which implementation has more features. We just happened to converge on the same idea, fortunately. My initial idea/intention was very straightforward, aimed at providing a quick and intuitive visualization to help answer questions from bench scientists, like why we chose resolution A over B. Your package is much more sophisticated and thoughtfully designed. I really appreciate your suggestion — I’ll make sure to include **kwargs. |
OK, so how about you two collaborate and make the package the best they can be (i.e. implement features this PR has there), and we advertise its use in scanpy tutorials? That way I don’t have to maintain it, but more people get to see it? |
From the EDIT: @joe-jhou2, if you would like to connect (e.g., schedule a meeting), feel free to reach out via mail to malte [dot] kuehl [at] clin [dot] au [dot] dk. We are on Central European time. |
FYI: We won’t ever hard-depend on GPL libraries, no matter if we recommend them. If we did, scanpy could effectively not be used under any license terms other than the GPL.
I’m almost completely sure that scanpy will switch to holoviews in 2025, it offers too much for us to stick with matplotlib in the long term. Some tools that can’t be supported by holoviews for now might stay in matplotlib for the time being though.
that’d be lovely! |
for more information, see https://pre-commit.ci
Thank you both for recognizing my contribution and showing interest in the visualization approach. I believe the most straightforward way to contribute to pyclustree would be to include find_cluster_specific_genes—without modifying the core structure of pytree—as it nicely complements the cluster tree splitting logic. That said, I’ll be stepping away for the next six months due to a major life event and won’t be able to commit to further development during that time. For that reason, I’d really appreciate getting this PR merged so I can wrap up the task before my break. @maltekuehl -- I’d love to reach out to you later for a meeting call to explore the possibility of collaborating, or at the very least, see if find_cluster_specific_genes could be picked up and adapted into the package if it fits. |
…e-jhou1/scanpy into feature-cluster-decision-tree
@flying-sheep I refactor the 'cluster_decision_tree' function into class. but I'm in the middle of a tightrope walk, I need import networkx, ruff doesn't allow it into type checking block, but 'deferred import test' won't allow me import that earlier. Can you advise how to deal with dilemma? Thanks. |
sure! we need to import it both in all functions using it and in the TYPE_CHECKING block. I did that for you: 44cc72f |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
It passes almost all checks, but fail milestone label check. Can you advice that? Thanks. |
Description
Added
cluster_resolution_finder
andcluster_decision_tree
as new tools and plotting functions, respectively, with corresponding tests. These functions enable hierarchical clustering analysis across multiple resolutions and visualization of cluster relationships as a decision tree.This feature requires the
igraph
library for graph-based plotting incluster_decision_tree
and for Leiden clustering incluster_resolution_finder
.igraph
is included in theleiden
extra, which can be installed with: