Skip to content

Allow wgpu-core to use new naga optimizations for dot4{I, U}8Packed #7595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: trunk
Choose a base branch
from

Conversation

robamler
Copy link
Contributor

@robamler robamler commented Apr 22, 2025

Connections

Description

Ensures that the new naga optimizations added in #7574 can be used by wgpu-core on SPIR-V (these optimizations requrie SPIR-V language version >= 1.6).

Adds a feature FeaturesWGPU::NATIVE_PACKED_INTEGER_DOT_PRODUCT, which is available on Adapters that support the specialized implementations for dot4I8Packed and dot4U8Packed implemented in #7574 (currently, this includes DX12 with Shader Model >= 6.4 and Vulkan with device extension "VK_KHR_shader_integer_dot_product").

If this feature is available on an Adapter, it can be requested during Device creation, and then the device is set up such that any occurrences of dot4I8Packed and dot4U8Packed will be compiled to their respective specialized instructions. This means that, on a vulkan Device, the SPIR-V language version is set to 1.6, and the required SPIR-V capabilities are marked as available (on DX12, requesting the feature doesn't change anything since it looks like wgpu-hal already uses the highest Shader Model supported by the DX12 library, and availability of the new feature already guarantees that the optimizations will be emitted).

I'm not sure if this is the best approach to expose an optimization that is only available for some SPIR-V language versions, and I welcome feedback.

Testing

I'm not sure where and how to add a test for this. Probably somewhere in wgpu-hal? Are there any related tests that I can use as templates?

I did test it in my own use case (warning: messy research code) and found (by stepping through in a debugger) that, as intended, the optimized code gets generated if and only if the feature TODO is requested for the Device (these tests also confirm that the optimization in #7574 increases performance).

Squash or Rebase?

Single commit

Open Questions

  • Where and how to add tests?
  • What should we do on WebGPU?
    • Should the adapter detect somehow (how?) whether the WebGPU runtime supports the "packed_4x8_integer_dot_product" extension? If it does, we could make the new feature NATIVE_PACKED_INTEGER_DOT_PRODUCT available on the Adapter and, if it is requested for a Device, translate dot4I8Packed and dot4U8Packed literally (and emit a polyfill if the feature is not requested for the Device).
    • I'm not sure how to do this and would welcome advice.

Checklist

  • Run cargo fmt.
  • Run taplo format.
  • Run cargo clippy --tests. If applicable, add:
    • --target wasm32-unknown-unknown
  • Run cargo xtask test to run tests.
  • If this contains user-facing changes, add a CHANGELOG.md entry.

When checking for capabilities in SPIR-V,
`capabilities_available == None` indicates that all capabilities are
available. However, some capabilities are not even defined for all
language versions, so we still need to check if the requested
capabilities even exist in the language version we're using.
@robamler robamler requested a review from a team as a code owner April 22, 2025 16:47
@robamler robamler changed the title Use packed vector format Allow wgpu-core to use new naga optimizations for dot4{I, U}8Packed Apr 22, 2025
See gfx-rs#7574, in particular
<gfx-rs#7574 (comment)>.

Adds `FeaturesWGPU::NATIVE_PACKED_INTEGER_DOT_PRODUCT`, which is
available on `Adapter`s that support the specialized implementations for
`dot4I8Packed` and `dot4U8Packed` implemented in gfx-rs#7574 (currently, this
includes DX12 with Shader Model >= 6.4 and Vulkan with device extension
"VK_KHR_shader_integer_dot_product").

If this feature is requested during `Device` creation, the device is set
up such that `dot4I8Packed` and `dot4U8Packed` will be compiled to their
respective specialized instructions. This means that, on a vulkan
`Device`, the SPIR-V language version is set to 1.6, and the required
SPIR-V capabilities are marked as available (on DX12, requesting the
feature doesn't change anything since availability of the feature
already guarantees that Shader Model >= 6.4, which is all we need to
generate specialized code).
@robamler robamler force-pushed the use-packed-vector-format branch from 8594f8f to 33ed6fd Compare April 22, 2025 16:54
@robamler
Copy link
Contributor Author

robamler commented Apr 22, 2025

In case anyone is interested, these are the preliminary benchmarks from our motivating use case (a sequence of low precision integer matrix-vector multiplications, simulating the main bottleneck in on-device inference in large language models).

The table shows throughput in G MAC/s (10^9 multiplies-and-accumulate per second ± standard error; higher is better).

Bench 1: uncompressed matrices Bench 2: compressed matrices
optimizations in #7574 turned on 13.94 ± 0.05 15.99 ± 0.21
optimizations in #7574 turned off 13.15 ± 0.13 13.43 ± 0.36

More detailed benchmark results also indicate that possible further optimizations of our use case currently seem to suffer from poor performance of pack4xU8 and unpack4xU8, which I think aren't optimized yet in naga. I'll look into these when I find some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant