Skip to content

Add CSD #125

Open
Open
@hameerabbasi

Description

@hameerabbasi

CSD stands for Compressed Sparse Dimensions. It is a multidimensional generalization of CSR, CSC and COO.

CSD has a few "compressed" dimensions and a few "uncompressed" dimensions. The "compressed" dimensions are linearized and go into indptr similar to CSR/CSC. The "uncompressed" dimensions (corresponding to indices in CSR/CSC, but there can be multiple) will be stored in coords.

The plan is to:

  • Write CSD.
  • Test it.
  • Make COO and others inherit from it.

Task checklist:

Implement CSD (parametrize tests over compressed axes).:

  • Element-wise
  • Reductions
  • Slicing
  • Change compressed dimensions

Post-Implementation (parametrize tests over types):

  • Make COO inherit from CSD
  • Make CSR inherit from CSD
  • Make CSC inherit from CSD

Compressed Sparse Dimensions — Format Specification

Compressed Sparse Dimensions (CSD) is an multidimensional sparse array storage format.

Let’s say that the number of dimensions is ndim and compressed_axes represents the axes/dimensions that are compressed, as a tuple.

It contains three main arrays, one of them two-dimensional:

  • data (one dimensional, (nnz,))
  • coords (two-dimensional, (ndim - len(compressed_axes), nnz))
  • indptr (one dimensional (reduce(operator.mul, (shape[a] for a in compressed_axes), 1) + 1,))

data/coords/indptr from an ndarray

Relating to a normal NumPy array, we can specify data and coords directly. Let’s say arr is the Numpy array to convert.

non_compressed_axes = tuple(a for a in range(ndim) if a not in set(compressed_axis))
coo_coords = np.where(arr)
data = arr[coo_coords]
axes_order = compressed_axes + non_compressed_axes
if axes_order != tuple(range(ndim)):
    lin_coo_coords = linear_loc(coo_coords[np.asarray(axes_order)], tuple(self.shape[a] for a in axes_order))
    idx = np.argsort(lin_coo_coords)
    data = data[idx]
    coords = coords[:, idx]

coords = coo_coords[np.asarray(non_compressed_axes)].astype(np.min_scalar_type(max(non_compressed_axes)))
compressed_linear_shape = reduce(operator.mul, (shape[a] for a in compressed_axes), 1)
compressed_linearized_coords = linear_loc(coo_coords[np.array(compressed_axes)], compressed_axes)
indptr = np.empty(compressed_linear_shape + 1, dtype=np.min_scalar_type(compressed_linear_shape))
indptr[0] = 0
np.cumsum(np.bincount(compressed_linearized_coords, minlength=compressed_linear_shape), out=indptr[1:])

Interpreting the arrays

For CSR, the interpretation is simple: indices represents the columns, data the respective data. indptr[i]:indptr[i+1] represents the slice corresponding to row i for which columns and data are valid.

We can extend this definition to CSD as well. coords represent the non-compressed axes, and data the respective data. indptr can be interpreted the same way, replacing the word “row” with “linearized compressed axes index”, but it turns out that there is a simpler way to work with things.

Alternative representation for indptr

Consider the following two arrays produced from indptr:

starts = indptr[:-1].reshape((shape[a] for a in compressed_axes))
stops indptr[1:].reshape((shape[a] for a in compressed_axes))

Note that these are views, no copies are made. This gives a simpler and more intuitive interpretation for indptr. For every possible index corresponding to the compressed axes: Indexing into starts gives the starts for that index and stops gives the stops.

The “multiple-COO” view

For every value in starts/stops, coords[:, start_val:stop_val] and data[start_val:stop_val] can be seen as a mini-COO array, and can thus have any COO-related algorithms applied to it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions