Skip to content

RFC-0004: Adding fill value property to PyTorch sparse tensors #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

pearu
Copy link

@pearu pearu commented Sep 13, 2020

This proposal introduces a fill value property to PyTorch sparse tensors that generalizes the current interpretation of unspecified elements from zero value to any value, including an indefinite fill value as a future extension.

The proposal can be read here.

CC: @vincentqb @mruberry @ezyang

@pearu pearu self-assigned this Sep 13, 2020
Copy link
Contributor

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @pearu!

@rgommers
Copy link
Contributor

@mruberry this document has the different interpretations of sparse tensors that we were talking about.

@pearu
Copy link
Author

pearu commented Sep 15, 2020

@rgommers @hameerabbasi, please review. The updated proposal resolves many open-ended questions we had previously and the proposal should be complete now.

Copy link
Contributor

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few small comments.

You can remove the "Draft" status I think.

Copy link
Contributor

@hameerabbasi hameerabbasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this a somewhat thorough review. I'll push up some English style/understanding fixes tomorrow. The English here is grammatically and structurally correct, but it's a bit hard to follow and needs some motivation.

@pearu pearu marked this pull request as ready for review September 16, 2020 12:15
@pearu pearu added the module: sparse Related to pytorch sparse support label Sep 16, 2020
@pearu pearu changed the title PyTorch Sparse Tensors: fill-value property Adding fill value property to PyTorch sparse tensors Sep 16, 2020
Copy link

@dzhulgakov dzhulgakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very detailed and comprehensive proposal, thanks! Everything makes sense.

Maybe expand what exactly it'd mean for autograd? (e.g. provide a sample formula for an element-wise operation). Does it also mean that we'd need to have optimizer modification to support fill_value? (Right now there are a few optimizers that can consume sparse grads directly)

size=(4, 2),
fill_value = 1.2)
A.fill_value() -> torch.tensor([1.2, 1.2])
A._fill_value() -> torch.tensor(1.2)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just return a full-shaped tensor but restrided with strides of 0? Then I don't think you need separate _fill_value method.

The element-wise operators can have a fast path for the case when it's really a scalar being broadcasted to the hybrid shape. And general logic would be pretty much the same as for TensorIterator broadcasting rules.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that is a little tricky here, is that if you get the restrided tensor with zero and then do an operation on it like add_(2), that will cause the entire tensor to get materialized. Maybe TensorIterator shouldn't do that...

@ezyang ezyang requested a review from mruberry September 17, 2020 02:11
need to be updated for handling the defined fill values. This will
be implemented in two stages:

1. All relevant functions need to check for zero fill value. If a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is trouble, if fill values live on device. If fill value lives on device, you cannot conveniently check if it is nonzero without inducing a synchronize. Indeed, the pseudocode below's use of nonzero would cause a sync. This is bad.

It might be possible to do some conservative analysis; for example, we can separately manage if something is "definitely zero" versus "maybe not zero" on a cpu side bit, and then have individual functions propagate it. But at that point, you might as well just upfront fix all the functions to handle fill value. Fortunately, there aren't that many; it's small enough that, for example, when I did the port from THS to ATen, I did it all in one go.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang Having the fast "definitely zero" test would be advantageous even after fixing all the functions to handle fill value. This is because most Linear Algebra functions would always do the zero-test in order to use the fastest and most frequently used zero-fill-value-pathway.

An idea: let the fill value always live in the CPU memory so that the zero-test will be fast. When an algorithm needs to use fill value from a device, the algorithm makes the to-device copy.

To make coding easier, one could use unified CUDA memory only for the fill value when the sparse tensor values uses CUDA device. Although, other non-CUDA devices would not gain from this and explicit copy-fill-value-to-device would be still needed.


```python
matmul(A, B) = matmul(A - a + a, B - b + b)
= matmul(A - a, B - b) + a * matmul(ones_like(A), B) + b * matmul(A, ones_like(B))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @albanD doesn't this look like dual numbers? It kind of looks like dual numbers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this actually preserve sparsity? Consider this example:

>>> torch.tensor([[1,0,0],[0,2,0],[0,0,3]], dtype=torch.float) @ torch.ones((3, 3))
tensor([[1., 1., 1.],
        [2., 2., 2.],
        [3., 3., 3.]])

I'm not sure the output can be expressed in a sparse way. (I suppose the transpose can be done as one sparse dimension and one dense dimension, with a fill value of [1, 2, 3])

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, in general, the matmul of sparse tensors with nonzero fill values will result in a dense tensor (read: a tensor in sparse format but with a small number of equal elements).

While this is going to be slightly off-topic, but when there exist applications that require such a matmul, I can think of one way to preserve the sparsity from matmul: a lazy evaluation. It would be like uncoalesced tensor but at the level of tensors, not tensor elements.

For instance, the result of A @ B would be an object, say, LazyTensor that holds three sparse tensors: (A-a) @ (B-b), ones_like(A) @ (a*B), and (b*A) @ ones_like(B). Each of these tensors can be stored memory-efficiently as sparse tensors while adding these would lead to a dense tensor (but we are going to postpone the addition). So, when one has

A @ B -> C = LazyTensor((A-a) @ (B-b), ones_like(A) @ (a*B), (b*A) @ ones_like(B))

then computing a matrix-vector product C @ v boils down to computing the sparse matrix-vector product with each of the three sparse tensors and v:

LazyTensor(C1, C2, C3) @ v -> C1 @ v + C2 @ v + C3 @ v

so that the postponed addition is realized with the results of matrix-vector products.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. We're generally amenable to laziness (see also pytorch/pytorch#43563 ) and it is easier to do this in sparse as we have no concept of views, which make laziness harder to do. But out of scope for this RFC :)

@ezyang
Copy link
Contributor

ezyang commented Sep 17, 2020

There isn't much in the proposal about autograd. Although I don't think fill values necessarily make the situation any worse than they were before, I think they do deserve some discussion. In particular, suppose I do a function f(x) where x has a non-zero fill value. What should the fill value of the computed grad_x be?

I think the answer is simply that the gradient should be derived off of the computations that involved the fill value directly, and the sparsity pattern should be preserved. But it would be good to see this worked out in more detail, and to verify that desirable mathematical properties still hold.

@ezyang ezyang requested a review from albanD September 17, 2020 02:47
@mruberry
Copy link

Hey @pearu, thanks for this write-up!

I appreciate your thinking on sparse tensors and adding a "fill value" to them seems like a mathematically clear and consistent way of defining them. However, I'm not sure if an implementation will actually give users clear and consistent behavior. That is, will users understand the difference between sparse tensors with and without fill values, how to preserve sparsity, and which operations are/aren't performant or memory-saving when setting a fill value? Let me try to detail that query with some sub-questions:

  • How do we define the behavior of sparse tensors without a fill value? These are not consistently interpreted as having a value of zero today.
  • How do we think about the gradient of sparse tensors without a fill value? What if the gradient specifies more values than the sparse tensor?
  • How do we think about the gradient of sparse tensors with a fill value? What if the gradient specifies more values than the sparse tensor?
  • If I have sparse A and strided B, what does A * B return? I would be concerned if A's having or not having a fill value could change properties of the returned tensor. How intuitive would that be?

I understand the desire to implement a fill value for some cases, but I'd like to get a better sense for the mathematics behind the no fill value definition of a sparse tensor, and I wonder if we shouldn't apply more guardrails to sparse tensors with fill values to protect users from getting into trouble. Also, as mentioned, I'd like to get a better understanding for how we treat sparse tensors without a fill value. Should we consider them masked tensors and let functions have their own interpretations of that mask?

@rgommers
Copy link
Contributor

Also, as mentioned, I'd like to get a better understanding for how we treat sparse tensors without a fill value. Should we consider them masked tensors and let functions have their own interpretations of that mask?

It may be good to answer this by prioritizing a good rewrite of the docs (pytorch/pytorch#44635) that answers this, rather than go into detail on this PR. That has to be done anyway (for 1.7.0), and doing it first will make this RFC easier to fit into the picture.

@ezyang
Copy link
Contributor

ezyang commented Sep 18, 2020

Another thought that @mruberry and I had, was that although fill values are a more conservative extension in a semantic sense, as they maintain the correspondence between sparse and dense tensors, going straight to masked tensors, and adding fill values as an extension on top of it, might be a path that reduces the amount of absolute work we have to do (and in particular, may help us avoid having to add a bunch of complicated logic for handling fills in scenarios where no one really cares.)

@ngimel
Copy link

ngimel commented Sep 18, 2020

I agree with Ed. This RFC contains one valid example of when non-zero fill value is helpful (random sequence of rare events with nonzero mean), others (exp, log, softmax etc), at least in the neural network context, are better handled by masking semantics. Masking semantics also provides necessary restrictions on the sparsity pattern of the gradient, so the answer to @mruberry's question "What if the gradient specifies more values than the sparse tensor" is "It can't".

Comment on lines +117 to +118
| `torch.empty` | uninitialized value |
| `torch.empty_like` | uninitialized value |
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, torch.empty(layout=torch.sparse_coo) is equivalent to torch.zeros(layout=torch.sparse_coo).
For backward compatibility, the default fill value for empty ought to be 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sparse Tensor we actually zero-out the memory in empty()? Because for dense Tensors this is not true.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The empty equivalence applies only when using sparse layout, it does not hold for strided tensors indeed.
In the case of sparse tensors, there is nothing to zero-out in zeros: sparse zeros is sparse empty by the definition (the default fill value is 0).

@pearu
Copy link
Author

pearu commented Sep 28, 2020

Just for heads up, I'll respond to review comments shortly. Sorry for the delay!

@ezyang ezyang changed the title Adding fill value property to PyTorch sparse tensors [RFC-0004] Adding fill value property to PyTorch sparse tensors Oct 13, 2020
@ezyang ezyang changed the title [RFC-0004] Adding fill value property to PyTorch sparse tensors RFC-0004: Adding fill value property to PyTorch sparse tensors Oct 13, 2020
@facebook-github-bot
Copy link
Contributor

Hi @pearu!

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but we do not have a signature on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

@cpuhrsch
Copy link

cpuhrsch commented Jun 8, 2021

I'm still working my way into this topic, so please have some patience as I catch up on a many year-long effort.

When we say "masking semantics" is this maybe too implementation specific?

To some extend a "mask" could perhaps also be thought of as introducing nullability. The mask can be used to determine which value is and is not null, but it's not the only metadata representation that can achieve this. Further, a padded, strided tensor might not be the only value storage that can achieve this. Having said so, I know that there are various applications that might want to generate a dense, strided Tensor with a particular fill value from a "masked" Tensor. It might also be convenient for many kernels and operator coverage could become more easily bootstrapped.

Further, is a mask truly necessary for the use cases that require this kind of nullability? It can introduces a lot of branching, which I could see particularly difficult to support on GPUs in a generic sense.

Also, in some cases a user might want different mask semantics. Let's take addition as an example.

a) fail when the masks don't agree: addition of two tensors encoding variable length sentences
b) take the intersection of masked values (e.g. MHA input mask + attention mask): replace masked-out values with the identity
c) take the union of masked values: if one of the operands values is maked out so will be the corresponding results. This is the masked array's behavior in numpy, but I don't know a use case for this.

@ezyang
Copy link
Contributor

ezyang commented Jun 8, 2021

I personally think of "masking semantics" as referencing the semantics using the simplest possible implementation strategy (dense tensor + mask). Hard to think of any simpler implementations; in particular cannot easily null elements in a tensor as traditional semantics use up all bit patterns to mean floating point numbers. But of course can do other approaches like COO/CSR, they're just more complicated and the maskedness is more implicit.

@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@cpuhrsch
Copy link

cpuhrsch commented Jun 9, 2021

Agree, in particular the semantics of dense tensor + mask are independent of the implementation. I think it's actually an implementation independent generalization (e.g. we might still want to use sparse storage layouts even when the object is semantically a dense tensor + mask). As such we should be clear on what exactly we actually want to achieve with these dense tensor + mask semantics and what makes them uniquely useful as opposed to related semantic generalizations such as nullability or generalized shapes.

@cpuhrsch
Copy link

cpuhrsch commented Aug 26, 2021

I created an RFC around masked reductions and normalizations #27

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed module: sparse Related to pytorch sparse support
Projects
None yet
Development

Successfully merging this pull request may close these issues.