Description
CSD
stands for Compressed Sparse Dimensions. It is a multidimensional generalization of CSR
, CSC
and COO
.
CSD
has a few "compressed" dimensions and a few "uncompressed" dimensions. The "compressed" dimensions are linearized and go into indptr
similar to CSR/CSC. The "uncompressed" dimensions (corresponding to indices
in CSR/CSC, but there can be multiple) will be stored in coords
.
The plan is to:
- Write
CSD
. - Test it.
- Make
COO
and others inherit from it.
Task checklist:
Implement CSD (parametrize tests over compressed axes).:
- Element-wise
- Reductions
- Slicing
- Change compressed dimensions
Post-Implementation (parametrize tests over types):
- Make COO inherit from CSD
- Make CSR inherit from CSD
- Make CSC inherit from CSD
Compressed Sparse Dimensions — Format Specification
Compressed Sparse Dimensions (CSD) is an multidimensional sparse array storage format.
Let’s say that the number of dimensions is ndim
and compressed_axes
represents the axes/dimensions that are compressed, as a tuple.
It contains three main arrays, one of them two-dimensional:
data
(one dimensional,(nnz,)
)coords
(two-dimensional,(ndim - len(compressed_axes), nnz)
)indptr
(one dimensional(reduce(operator.mul, (shape[a] for a in compressed_axes), 1) + 1,)
)
data
/coords
/indptr
from an ndarray
Relating to a normal NumPy array, we can specify data
and coords
directly. Let’s say arr
is the Numpy array to convert.
non_compressed_axes = tuple(a for a in range(ndim) if a not in set(compressed_axis))
coo_coords = np.where(arr)
data = arr[coo_coords]
axes_order = compressed_axes + non_compressed_axes
if axes_order != tuple(range(ndim)):
lin_coo_coords = linear_loc(coo_coords[np.asarray(axes_order)], tuple(self.shape[a] for a in axes_order))
idx = np.argsort(lin_coo_coords)
data = data[idx]
coords = coords[:, idx]
coords = coo_coords[np.asarray(non_compressed_axes)].astype(np.min_scalar_type(max(non_compressed_axes)))
compressed_linear_shape = reduce(operator.mul, (shape[a] for a in compressed_axes), 1)
compressed_linearized_coords = linear_loc(coo_coords[np.array(compressed_axes)], compressed_axes)
indptr = np.empty(compressed_linear_shape + 1, dtype=np.min_scalar_type(compressed_linear_shape))
indptr[0] = 0
np.cumsum(np.bincount(compressed_linearized_coords, minlength=compressed_linear_shape), out=indptr[1:])
Interpreting the arrays
For CSR, the interpretation is simple: indices
represents the columns
, data
the respective data. indptr[i]:indptr[i+1]
represents the slice corresponding to row i
for which columns
and data
are valid.
We can extend this definition to CSD as well. coords
represent the non-compressed axes, and data
the respective data. indptr
can be interpreted the same way, replacing the word “row” with “linearized compressed axes index”, but it turns out that there is a simpler way to work with things.
Alternative representation for indptr
Consider the following two arrays produced from indptr
:
starts = indptr[:-1].reshape((shape[a] for a in compressed_axes))
stops indptr[1:].reshape((shape[a] for a in compressed_axes))
Note that these are views, no copies are made. This gives a simpler and more intuitive interpretation for indptr
. For every possible index corresponding to the compressed axes: Indexing into starts
gives the starts for that index and stops
gives the stops.
The “multiple-COO” view
For every value in starts
/stops
, coords[:, start_val:stop_val]
and data[start_val:stop_val]
can be seen as a mini-COO array, and can thus have any COO-related algorithms applied to it.