Skip to content

RFC: In-memory sparse array interchange #840

Open
@hameerabbasi

Description

@hameerabbasi

Motivation

The sparse (also called PyData/Sparse) development team have been working on integration efforts with the ecosystem, most notably with SciPy, scikit-learn and others, with CuPy, PyTorch, JAX and TensorFlow also on the radar. One of the challenges we were facing was the lack of (possibly zero-copy) interchange between the different sparse array implementations. We believe this may be a pain point for many sparse array implementations moving forward.

This mirrors an issue seen for dense arrays previously, where the DLPack protocol was the one of the first things to be standardised. We're hoping to achieve community consensus for a similar problem.

Luckily, all sparse array formats (with the possible exception of DOK) are usually collections of dense arrays underneath. In addition, this problem has been solved for on-disk arrays before by the binsparse specification. @willow-ahrens is a co-author of that spec, and is also a collaborator for the sparse work.

Proposal

We propose introducing two new methods to the array-API compliant sparse array objects (such as those in sparse), which are described below.

__binsparse_descriptor__

The returned item is a dict equivalent to a parsed JSON binsparse descriptor of an array.

__binsparse__

The second item is a dict[str, Array] of __dlpack__ compatible arrays, which are the constituent arrays of the sparse array. The key represents the equivalent key in the descriptor.

Introduction of from_binsparse function.

If a library supports sparse arrays, its from_binsparse method should support accepting (when possible, zero-copy) versions of objects that follow this __binsparse__ protocol, and have an equivalent sparse format within the library.

Psuedocode implementation

Here's a psuedocode example using two libraries, xp1 and xp2, both supporting sparse arrays:

# In library code:
xp2_sparray = xp2.from_binsparse(xp1_sparray, ...)

# This psuedocode impl is common between `xp1` and `xp2`
def from_binsparse(x: object, /, *, device: device | None = None, copy: bool | None = None) -> array:
    binsparse_descr = getattr(x, "__binsparse_descriptor__", None)
    binsparse_impl = getattr(x, "__binsparse__", None)
    if binsparse_impl is None or binsparse_descr is None:
        raise TypeError(...)
    
    binsparse_descriptor = binsparse_descr()
    # Will raise an error if the format/descriptor is unsupported.
    sparse_type = _type_from_binsparse_descriptor(binsparse_descriptor)
    constituent_arrays = binsparse_impl()
    my_constituent_arrays = {
        k: from_dlpack(arr, device=device, copy=copy) for k, arr in constituent_arrays.items()
    }
    return sparse_type.from_strided_arrays(my_constituent_arrays, shape=...)

Parallel implementation in sparse: pydata/sparse#764
Parallel implementation in SciPy: scipy/scipy#22553

Alternative solutions

There are formats for on-disk sparse-array interchange [1] [2]; but none for in-memory interchange. binsparse is the one that comes closest to offering in-memory interchange.

Pinging possibly interested parties:

Updated on 2024.10.09 as agreed in #840 (comment).

Metadata

Metadata

Assignees

No one assigned

    Labels

    API extensionAdds new functions or objects to the API.Needs DiscussionNeeds further discussion.RFCRequest for comments. Feature requests and proposed changes.topic: DLPackDLPack.

    Type

    No type

    Projects

    Status

    Stage 1

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions