RFC-0004: Adding fill value property to PyTorch sparse tensors #8

pearu · 2020-09-13T08:15:24Z

This proposal introduces a fill value property to PyTorch sparse tensors that generalizes the current interpretation of unspecified elements from zero value to any value, including an indefinite fill value as a future extension.

The proposal can be read here.

CC: @vincentqb @mruberry @ezyang

rgommers

Looks great, thanks @pearu!

RFC-0004-sparse-fill-value/README.md

rgommers · 2020-09-13T10:19:24Z

@mruberry this document has the different interpretations of sparse tensors that we were talking about.

…ll-value

mostly missing articles and minor rearrangements - do a diff to check

…tribute.

pearu · 2020-09-15T19:35:49Z

@rgommers @hameerabbasi, please review. The updated proposal resolves many open-ended questions we had previously and the proposal should be complete now.

rgommers

LGTM, just a few small comments.

You can remove the "Draft" status I think.

RFC-0004-sparse-fill-value/README.md

hameerabbasi

I gave this a somewhat thorough review. I'll push up some English style/understanding fixes tomorrow. The English here is grammatically and structurally correct, but it's a bit hard to follow and needs some motivation.

RFC-0004-sparse-fill-value/README.md

dzhulgakov

This is a very detailed and comprehensive proposal, thanks! Everything makes sense.

Maybe expand what exactly it'd mean for autograd? (e.g. provide a sample formula for an element-wise operation). Does it also mean that we'd need to have optimizer modification to support fill_value? (Right now there are a few optimizers that can consume sparse grads directly)

dzhulgakov · 2020-09-16T21:12:22Z

RFC-0004-sparse-fill-value/README.md

+                                size=(4, 2),
+                                fill_value = 1.2)
+    A.fill_value() -> torch.tensor([1.2, 1.2])
+    A._fill_value() -> torch.tensor(1.2)


can we just return a full-shaped tensor but restrided with strides of 0? Then I don't think you need separate _fill_value method.

The element-wise operators can have a fast path for the case when it's really a scalar being broadcasted to the hybrid shape. And general logic would be pretty much the same as for TensorIterator broadcasting rules.

One thing that is a little tricky here, is that if you get the restrided tensor with zero and then do an operation on it like add_(2), that will cause the entire tensor to get materialized. Maybe TensorIterator shouldn't do that...

ezyang · 2020-09-17T02:22:09Z

RFC-0004-sparse-fill-value/README.md

+    need to be updated for handling the defined fill values. This will
+    be implemented in two stages:
+
+    1. All relevant functions need to check for zero fill value. If a


This is trouble, if fill values live on device. If fill value lives on device, you cannot conveniently check if it is nonzero without inducing a synchronize. Indeed, the pseudocode below's use of nonzero would cause a sync. This is bad.

It might be possible to do some conservative analysis; for example, we can separately manage if something is "definitely zero" versus "maybe not zero" on a cpu side bit, and then have individual functions propagate it. But at that point, you might as well just upfront fix all the functions to handle fill value. Fortunately, there aren't that many; it's small enough that, for example, when I did the port from THS to ATen, I did it all in one go.

@ezyang Having the fast "definitely zero" test would be advantageous even after fixing all the functions to handle fill value. This is because most Linear Algebra functions would always do the zero-test in order to use the fastest and most frequently used zero-fill-value-pathway.

An idea: let the fill value always live in the CPU memory so that the zero-test will be fast. When an algorithm needs to use fill value from a device, the algorithm makes the to-device copy.

To make coding easier, one could use unified CUDA memory only for the fill value when the sparse tensor values uses CUDA device. Although, other non-CUDA devices would not gain from this and explicit copy-fill-value-to-device would be still needed.

ezyang · 2020-09-17T02:23:56Z

RFC-0004-sparse-fill-value/README.md

+
+       ```python
+       matmul(A, B) = matmul(A - a + a, B - b + b)
+                    = matmul(A - a, B - b) + a * matmul(ones_like(A), B) + b * matmul(A, ones_like(B))


cc @albanD doesn't this look like dual numbers? It kind of looks like dual numbers.

Does this actually preserve sparsity? Consider this example:

>>> torch.tensor([[1,0,0],[0,2,0],[0,0,3]], dtype=torch.float) @ torch.ones((3, 3)) tensor([[1., 1., 1.], [2., 2., 2.], [3., 3., 3.]])

I'm not sure the output can be expressed in a sparse way. (I suppose the transpose can be done as one sparse dimension and one dense dimension, with a fill value of [1, 2, 3])

No, in general, the matmul of sparse tensors with nonzero fill values will result in a dense tensor (read: a tensor in sparse format but with a small number of equal elements).

While this is going to be slightly off-topic, but when there exist applications that require such a matmul, I can think of one way to preserve the sparsity from matmul: a lazy evaluation. It would be like uncoalesced tensor but at the level of tensors, not tensor elements.

For instance, the result of A @ B would be an object, say, LazyTensor that holds three sparse tensors: (A-a) @ (B-b), ones_like(A) @ (a*B), and (b*A) @ ones_like(B). Each of these tensors can be stored memory-efficiently as sparse tensors while adding these would lead to a dense tensor (but we are going to postpone the addition). So, when one has

A @ B -> C = LazyTensor((A-a) @ (B-b), ones_like(A) @ (a*B), (b*A) @ ones_like(B))

then computing a matrix-vector product C @ v boils down to computing the sparse matrix-vector product with each of the three sparse tensors and v:

LazyTensor(C1, C2, C3) @ v -> C1 @ v + C2 @ v + C3 @ v

so that the postponed addition is realized with the results of matrix-vector products.

OK. We're generally amenable to laziness (see also pytorch/pytorch#43563 ) and it is easier to do this in sparse as we have no concept of views, which make laziness harder to do. But out of scope for this RFC :)

RFC-0004-sparse-fill-value/README.md

ezyang · 2020-09-17T02:47:12Z

There isn't much in the proposal about autograd. Although I don't think fill values necessarily make the situation any worse than they were before, I think they do deserve some discussion. In particular, suppose I do a function f(x) where x has a non-zero fill value. What should the fill value of the computed grad_x be?

I think the answer is simply that the gradient should be derived off of the computations that involved the fill value directly, and the sparsity pattern should be preserved. But it would be good to see this worked out in more detail, and to verify that desirable mathematical properties still hold.

mruberry · 2020-09-17T23:03:45Z

Hey @pearu, thanks for this write-up!

I appreciate your thinking on sparse tensors and adding a "fill value" to them seems like a mathematically clear and consistent way of defining them. However, I'm not sure if an implementation will actually give users clear and consistent behavior. That is, will users understand the difference between sparse tensors with and without fill values, how to preserve sparsity, and which operations are/aren't performant or memory-saving when setting a fill value? Let me try to detail that query with some sub-questions:

How do we define the behavior of sparse tensors without a fill value? These are not consistently interpreted as having a value of zero today.
How do we think about the gradient of sparse tensors without a fill value? What if the gradient specifies more values than the sparse tensor?
How do we think about the gradient of sparse tensors with a fill value? What if the gradient specifies more values than the sparse tensor?
If I have sparse A and strided B, what does A * B return? I would be concerned if A's having or not having a fill value could change properties of the returned tensor. How intuitive would that be?

I understand the desire to implement a fill value for some cases, but I'd like to get a better sense for the mathematics behind the no fill value definition of a sparse tensor, and I wonder if we shouldn't apply more guardrails to sparse tensors with fill values to protect users from getting into trouble. Also, as mentioned, I'd like to get a better understanding for how we treat sparse tensors without a fill value. Should we consider them masked tensors and let functions have their own interpretations of that mask?

rgommers · 2020-09-17T23:09:15Z

Also, as mentioned, I'd like to get a better understanding for how we treat sparse tensors without a fill value. Should we consider them masked tensors and let functions have their own interpretations of that mask?

It may be good to answer this by prioritizing a good rewrite of the docs (pytorch/pytorch#44635) that answers this, rather than go into detail on this PR. That has to be done anyway (for 1.7.0), and doing it first will make this RFC easier to fit into the picture.

ezyang · 2020-09-18T14:43:46Z

Another thought that @mruberry and I had, was that although fill values are a more conservative extension in a semantic sense, as they maintain the correspondence between sparse and dense tensors, going straight to masked tensors, and adding fill values as an extension on top of it, might be a path that reduces the amount of absolute work we have to do (and in particular, may help us avoid having to add a bunch of complicated logic for handling fills in scenarios where no one really cares.)

ngimel · 2020-09-18T16:19:10Z

I agree with Ed. This RFC contains one valid example of when non-zero fill value is helpful (random sequence of rare events with nonzero mean), others (exp, log, softmax etc), at least in the neural network context, are better handled by masking semantics. Masking semantics also provides necessary restrictions on the sparsity pattern of the gradient, so the answer to @mruberry's question "What if the gradient specifies more values than the sparse tensor" is "It can't".

pearu · 2020-09-26T19:22:19Z

RFC-0004-sparse-fill-value/README.md

+    | `torch.empty`         | uninitialized value                    |
+    | `torch.empty_like`    | uninitialized value                    |


Currently, torch.empty(layout=torch.sparse_coo) is equivalent to torch.zeros(layout=torch.sparse_coo).
For backward compatibility, the default fill value for empty ought to be 0.

For sparse Tensor we actually zero-out the memory in empty()? Because for dense Tensors this is not true.

The empty equivalence applies only when using sparse layout, it does not hold for strided tensors indeed.
In the case of sparse tensors, there is nothing to zero-out in zeros: sparse zeros is sparse empty by the definition (the default fill value is 0).

pearu · 2020-09-28T16:25:59Z

Just for heads up, I'll respond to review comments shortly. Sorry for the delay!

facebook-github-bot · 2020-10-30T17:21:35Z

Hi @pearu!

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but we do not have a signature on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

cpuhrsch · 2021-06-08T22:17:17Z

I'm still working my way into this topic, so please have some patience as I catch up on a many year-long effort.

When we say "masking semantics" is this maybe too implementation specific?

To some extend a "mask" could perhaps also be thought of as introducing nullability. The mask can be used to determine which value is and is not null, but it's not the only metadata representation that can achieve this. Further, a padded, strided tensor might not be the only value storage that can achieve this. Having said so, I know that there are various applications that might want to generate a dense, strided Tensor with a particular fill value from a "masked" Tensor. It might also be convenient for many kernels and operator coverage could become more easily bootstrapped.

Further, is a mask truly necessary for the use cases that require this kind of nullability? It can introduces a lot of branching, which I could see particularly difficult to support on GPUs in a generic sense.

Also, in some cases a user might want different mask semantics. Let's take addition as an example.

a) fail when the masks don't agree: addition of two tensors encoding variable length sentences
b) take the intersection of masked values (e.g. MHA input mask + attention mask): replace masked-out values with the identity
c) take the union of masked values: if one of the operands values is maked out so will be the corresponding results. This is the masked array's behavior in numpy, but I don't know a use case for this.

ezyang · 2021-06-08T23:23:51Z

I personally think of "masking semantics" as referencing the semantics using the simplest possible implementation strategy (dense tensor + mask). Hard to think of any simpler implementations; in particular cannot easily null elements in a tensor as traditional semantics use up all bit patterns to mean floating point numbers. But of course can do other approaches like COO/CSR, they're just more complicated and the maskedness is more implicit.

facebook-github-bot · 2021-06-09T00:09:16Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

cpuhrsch · 2021-06-09T00:25:49Z

Agree, in particular the semantics of dense tensor + mask are independent of the implementation. I think it's actually an implementation independent generalization (e.g. we might still want to use sparse storage layouts even when the object is semantically a dense tensor + mask). As such we should be clear on what exactly we actually want to achieve with these dense tensor + mask semantics and what makes them uniquely useful as opposed to related semantic generalizations such as nullability or generalized shapes.

cpuhrsch · 2021-08-26T19:42:23Z

I created an RFC around masked reductions and normalizations #27

PyTorch Sparse Tensors: fill-value property

5c64eee

pearu self-assigned this Sep 13, 2020

rgommers reviewed Sep 13, 2020

View reviewed changes

Remove Statistics domain. undefined->indefined. Discuss non-scalar fi…

fba073a

…ll-value

rgommers mentioned this pull request Sep 14, 2020

torch.sparse improvements - tracking issue pytorch/pytorch#44634

Open

26 tasks

pearu and others added 10 commits September 14, 2020 16:15

Propose adding fill_value to to_sparse method.

72dbbb0

Minor spelling corrections

6aeed91

proof-read fixes

d8f385d

mostly missing articles and minor rearrangements - do a diff to check

Deprecate layout in arange/linspace/logspace. Fix typos.

f582779

Introduce terminology section.

f68fd3c

Update terminology.

be4714b

Identity for sparse-dense relation

3235135

Resolve zero fill value performance concern: introduce _fill_value at…

f47e95e

…tribute.

Minor fixes

b4b71bc

ispell

81a4543

rgommers reviewed Sep 15, 2020

View reviewed changes

hameerabbasi reviewed Sep 16, 2020

View reviewed changes

pearu added 3 commits September 16, 2020 12:54

Address Ralf's comments.

f61be5a

Address Hameer's comments

2fb00c5

Review.

59e3b7a

pearu marked this pull request as ready for review September 16, 2020 12:15

pearu requested review from rgommers and hameerabbasi September 16, 2020 13:28

pearu added the module: sparse Related to pytorch sparse support label Sep 16, 2020

pearu changed the title ~~PyTorch Sparse Tensors: fill-value property~~ Adding fill value property to PyTorch sparse tensors Sep 16, 2020

rgommers approved these changes Sep 16, 2020

View reviewed changes

dzhulgakov reviewed Sep 16, 2020

View reviewed changes

ezyang requested a review from mruberry September 17, 2020 02:11

ezyang reviewed Sep 17, 2020

View reviewed changes

RFC-0004-sparse-fill-value/README.md Show resolved Hide resolved

ezyang requested a review from albanD September 17, 2020 02:47

pearu commented Sep 26, 2020

View reviewed changes

This was referenced Sep 28, 2020

Revised sparse tensor documentation. pytorch/pytorch#45400

Closed

backward for dense+sparse does not work pytorch/pytorch#43347

Open

pearu added 3 commits September 29, 2020 21:41

Add Autograd example of element-wise functions applied to sparse tensors

bc2912e

Rename Calculus->Calculations

8b10eba

Reject point 15.

8349363

ezyang changed the title ~~Adding fill value property to PyTorch sparse tensors~~ [RFC-0004] Adding fill value property to PyTorch sparse tensors Oct 13, 2020

ezyang changed the title ~~[RFC-0004] Adding fill value property to PyTorch sparse tensors~~ RFC-0004: Adding fill value property to PyTorch sparse tensors Oct 13, 2020

facebook-github-bot added the cla signed label Oct 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC-0004: Adding fill value property to PyTorch sparse tensors #8

RFC-0004: Adding fill value property to PyTorch sparse tensors #8

pearu commented Sep 13, 2020 •

edited

Loading

rgommers left a comment

rgommers commented Sep 13, 2020

pearu commented Sep 15, 2020

rgommers left a comment

hameerabbasi left a comment

dzhulgakov left a comment

dzhulgakov Sep 16, 2020

ezyang Sep 17, 2020

ezyang Sep 17, 2020

pearu Sep 29, 2020

ezyang Sep 17, 2020

ezyang Sep 17, 2020

pearu Sep 17, 2020

ezyang Sep 17, 2020

ezyang commented Sep 17, 2020

mruberry commented Sep 17, 2020

rgommers commented Sep 17, 2020

ezyang commented Sep 18, 2020

ngimel commented Sep 18, 2020

pearu Sep 26, 2020

albanD Sep 28, 2020

pearu Sep 28, 2020

pearu commented Sep 28, 2020

facebook-github-bot commented Oct 30, 2020

cpuhrsch commented Jun 8, 2021 •

edited

Loading

ezyang commented Jun 8, 2021 •

edited

Loading

facebook-github-bot commented Jun 9, 2021

cpuhrsch commented Jun 9, 2021 •

edited

Loading

cpuhrsch commented Aug 26, 2021 •

edited

Loading

		\| `torch.empty` \| uninitialized value \|
		\| `torch.empty_like` \| uninitialized value \|

RFC-0004: Adding fill value property to PyTorch sparse tensors #8

Are you sure you want to change the base?

RFC-0004: Adding fill value property to PyTorch sparse tensors #8

Conversation

pearu commented Sep 13, 2020 • edited Loading

rgommers left a comment

Choose a reason for hiding this comment

rgommers commented Sep 13, 2020

pearu commented Sep 15, 2020

rgommers left a comment

Choose a reason for hiding this comment

hameerabbasi left a comment

Choose a reason for hiding this comment

dzhulgakov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ezyang commented Sep 17, 2020

mruberry commented Sep 17, 2020

rgommers commented Sep 17, 2020

ezyang commented Sep 18, 2020

ngimel commented Sep 18, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pearu commented Sep 28, 2020

facebook-github-bot commented Oct 30, 2020

cpuhrsch commented Jun 8, 2021 • edited Loading

ezyang commented Jun 8, 2021 • edited Loading

facebook-github-bot commented Jun 9, 2021

cpuhrsch commented Jun 9, 2021 • edited Loading

cpuhrsch commented Aug 26, 2021 • edited Loading

pearu commented Sep 13, 2020 •

edited

Loading

cpuhrsch commented Jun 8, 2021 •

edited

Loading

ezyang commented Jun 8, 2021 •

edited

Loading

cpuhrsch commented Jun 9, 2021 •

edited

Loading

cpuhrsch commented Aug 26, 2021 •

edited

Loading