inference_cost_matmul: Confusion or Bug regarding the MAC-count of Scaled Dot-Product Attention

## Prerequisites
Please make sure to check off these prerequisites before submitting a bug report.
- [x] Test that the bug appears on the current version of the main branch. Make sure to include the commit hash of the commit you checked out. **6ca8f8e0af84e49facac5cdc34735eaf6e938300**
- [x] Check that the issue hasn't already been reported, by checking the currently open issues.
- [x] If there are steps to reproduce the problem, make sure to write them down below.
- [x] If relevant, please include the ONNX files, which were created directly before and/or after the bug.

## Quick summary
While working on our characterization of the transformer data-flow we encountered some discrepancies when validating against the QONNX `inference_cost` estimations of the MatMul operator within the attention mechanism. We are not entirely sure whether this is indeed a bug on the QONNX side or still some confusion/error on our side. Thus we would like to start a discussion to understand this issue.

## Details
Multi-Head Scaled Dot-Product Attention involves two consecutive MatMul operations where both inputs dynamically depend on the model inputs. The heads are independent of each other and typically treated in a way similar to a batch dimension. Our cost model assumes HxTxTxd MAC operations for each of the two MatMuls, i.e. H heads each producing a TxT attention matrix  (T is the sequence length) where each element is the result of a d-dimensional dot-product. However, the QONNX analysis function `inference_cost_matmul` seems to be off by an additional factor of H (i.e. HxHxTxTxd), indicating the heads are not treated like a batch dimension.

My suspicion is further raised by the following lines from the QONNX `inference_cost_matmul` function:
```python
# exclude common dim (last axis) from one side to avoid duplication
n_macs = np.prod(i_shape[:-1]) * np.prod(w_shape)
```
Is this actually always the case? At least for the model graph I have attached it seems like the last axis is *not* the common dimension.

In the following, I provide a minimal working example of a scaled dot-product attention in isolation in PyTorch exporting to an ONNX graph. I have also attached the already preprocessed graph which in particular already includes the `InferShapes` transform. Note that running the `qonnx.util.inference_cost` script on the PyTorch ONNX export breaks at the `FoldConstants` transform due to `IndexError` which is probably unrelated and should be investigated separately (I have "fixed" it by removing that transformation step for now).

### Steps to Reproduce
The following code produces a minimal example of scaled dot-product attention and exports to ONNX.
```python
import torch


# Minimal working example of the Scaled Dot-Product Attention mechanism
class ScaleDotProductAttention(torch.nn.Module):
    # Initializes the module parameters
    def __init__(self, num_heads):
        # Initialize the PyTorch base Module
        super().__init__()
        # Set the number of attention heads
        self.num_heads = num_heads

    # Forward pass computing scaled dot-product attention between q, k and v
    def forward(self, q, k, v):
        # Assume the most simple case of q, k and v all having the same
        # dimensions
        assert q.shape == k.shape == v.shape, \
            "Q, K and V must have the same shape"
        # Embedding dimension must be divisible by number of heads
        assert q.shape[-1] % self.num_heads == 0, \
            f"Dimensions must be divisible by heads ({self.num_heads})"

        # Assume sequence first layout and get the sizes per axis
        s, b, d = q.shape
        # Number of heads and dimension per head
        n_head, d_head = self.num_heads, d // self.num_heads

        # Reshape tensors to treat the heads like batch dimensions
        q = q.reshape(s, b, n_head, d_head).reshape(s, b * n_head, d_head)
        k = k.reshape(s, b, n_head, d_head).reshape(s, b * n_head, d_head)
        v = v.reshape(s, b, n_head, d_head).reshape(s, b * n_head, d_head)
        # Compute the not-yet-normalized attentions matrices for each head.
        #   Note: permute brings batch x heads to front and transposes k
        a = torch.matmul(q.permute(1, 0, 2), k.permute(1, 2, 0))
        # Scale and normalize the attention matrix
        a = torch.softmax(a * (d_head ** -0.5), dim=-1)
        # Apply the attention matrices to the value projection
        #   Note: Second permute brings sequence dimension back to front
        o = torch.matmul(a, v.permute(1, 0, 2)).permute(1, 0, 2)
        # Reshape heads into feature dimension
        o = o.reshape(s, b, n_head, d_head).reshape(s, b, n_head * d_head)

        # Return the scaled dot-product attention output
        return o


# Script entrypoint
if __name__ == '__main__':
    # Instantiate a scale dot-product attention with 4 attention heads
    sdp = ScaleDotProductAttention(num_heads=4)
    # Generate random query, key and value tensors
    #   Note: Sequence of length 64, single instance batch, 128 dim embeddings
    q, k, v = torch.randn(3, 64, 1, 128)
    # Export the attention module to ONNX
    torch.onnx.export(sdp, args=(q, k, v), f='sdp.onnx')
```

Get MAC operation counts by running
```
python -m qonnx.util.inference_cost sdp.onnx
```
Outputs something like
```
{'op_mac_FLOAT32_FLOAT32': 4194304.0, 'mem_w_FLOAT32': 0.0, 'mem_o_FLOAT32': 24576.0, 'unsupported': "{'Softmax', 'Pow', 'Constant'}", 'discount_sparsity': True, 'total_bops': 4294967296.0, 'total_mem_w_bits': 0.0, 'total_mem_o_bits': 786432.0}
```


### Expected behavior
According to our cost model, the MAC count should be **2x HxTxTxd**, which for the given example model is **2x 4x64x64x32 = 1048576**.

### Actual behavior
The MAC count is reported as **4194304**, which is **4x (Hx)** our expectation, indicating a cost function of **2x HxHxTxTxd**.

### Attached ONNX Graph
[sdp.costs.onnx.zip](https://github.com/fastmachinelearning/qonnx/files/11895860/sdp.costs.onnx.zip)
![sdp onnx](https://github.com/fastmachinelearning/qonnx/assets/7474628/d45314d6-02c0-4587-9bcd-be46b60a8ba8)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inference_cost_matmul: Confusion or Bug regarding the MAC-count of Scaled Dot-Product Attention #60

Prerequisites

Quick summary

Details

Steps to Reproduce

Expected behavior

Actual behavior

Attached ONNX Graph

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

inference_cost_matmul: Confusion or Bug regarding the MAC-count of Scaled Dot-Product Attention #60

Description

Prerequisites

Quick summary

Details

Steps to Reproduce

Expected behavior

Actual behavior

Attached ONNX Graph

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions