Allow `wgpu-core` to use new naga optimizations for `dot4{I, U}8Packed` #7595

robamler · 2025-04-22T16:47:43Z

Connections

Depends on Use specialized intrinsics for dot4{I, U}8Packed on SPIR-V and HLSL #7574.
- Separates out the issue described in this comment.

Description

Ensures that the new naga optimizations added in #7574 can be used by wgpu-core on SPIR-V (these optimizations requrie SPIR-V language version >= 1.6).

Adds a feature FeaturesWGPU::NATIVE_PACKED_INTEGER_DOT_PRODUCT, which is available on Adapters that support the specialized implementations for dot4I8Packed and dot4U8Packed implemented in #7574 (currently, this includes DX12 with Shader Model >= 6.4 and Vulkan with device extension "VK_KHR_shader_integer_dot_product").

If this feature is available on an Adapter, it can be requested during Device creation, and then the device is set up such that any occurrences of dot4I8Packed and dot4U8Packed will be compiled to their respective specialized instructions. This means that, on a vulkan Device, the SPIR-V language version is set to 1.6, and the required SPIR-V capabilities are marked as available (on DX12, requesting the feature doesn't change anything since it looks like wgpu-hal already uses the highest Shader Model supported by the DX12 library, and availability of the new feature already guarantees that the optimizations will be emitted).

I'm not sure if this is the best approach to expose an optimization that is only available for some SPIR-V language versions, and I welcome feedback.

Testing

I'm not sure where and how to add a test for this. Probably somewhere in wgpu-hal? Are there any related tests that I can use as templates?

I did test it in my own use case (warning: messy research code) and found (by stepping through in a debugger) that, as intended, the optimized code gets generated if and only if the feature TODO is requested for the Device (these tests also confirm that the optimization in #7574 increases performance).

Squash or Rebase?

Single commit

Open Questions

Where and how to add tests?
What should we do on WebGPU?
- Should the adapter detect somehow (how?) whether the WebGPU runtime supports the "packed_4x8_integer_dot_product" extension? If it does, we could make the new feature NATIVE_PACKED_INTEGER_DOT_PRODUCT available on the Adapter and, if it is requested for a Device, translate dot4I8Packed and dot4U8Packed literally (and emit a polyfill if the feature is not requested for the Device).
- I'm not sure how to do this and would welcome advice.

Checklist

Run cargo fmt.
Run taplo format.
Run cargo clippy --tests. If applicable, add:
- --target wasm32-unknown-unknown
Run cargo xtask test to run tests.
If this contains user-facing changes, add a CHANGELOG.md entry.

When checking for capabilities in SPIR-V, `capabilities_available == None` indicates that all capabilities are available. However, some capabilities are not even defined for all language versions, so we still need to check if the requested capabilities even exist in the language version we're using.

See gfx-rs#7574, in particular <gfx-rs#7574 (comment)>. Adds `FeaturesWGPU::NATIVE_PACKED_INTEGER_DOT_PRODUCT`, which is available on `Adapter`s that support the specialized implementations for `dot4I8Packed` and `dot4U8Packed` implemented in gfx-rs#7574 (currently, this includes DX12 with Shader Model >= 6.4 and Vulkan with device extension "VK_KHR_shader_integer_dot_product"). If this feature is requested during `Device` creation, the device is set up such that `dot4I8Packed` and `dot4U8Packed` will be compiled to their respective specialized instructions. This means that, on a vulkan `Device`, the SPIR-V language version is set to 1.6, and the required SPIR-V capabilities are marked as available (on DX12, requesting the feature doesn't change anything since availability of the feature already guarantees that Shader Model >= 6.4, which is all we need to generate specialized code).

robamler · 2025-04-22T17:07:35Z

In case anyone is interested, these are the preliminary benchmarks from our motivating use case (a sequence of low precision integer matrix-vector multiplications, simulating the main bottleneck in on-device inference in large language models).

The table shows throughput in G MAC/s (10^9 multiplies-and-accumulate per second ± standard error; higher is better).

	Bench 1: uncompressed matrices	Bench 2: compressed matrices
optimizations in #7574 turned on	13.94 ± 0.05	15.99 ± 0.21
optimizations in #7574 turned off	13.15 ± 0.13	13.43 ± 0.36

More detailed benchmark results also indicate that possible further optimizations of our use case currently seem to suffer from poor performance of pack4xU8 and unpack4xU8, which I think aren't optimized yet in naga. I'll look into these when I find some time.

robamler added 4 commits April 18, 2025 21:25

Use intrinsics for dot4{I, U}8Packed in HLSL

b7691d3

Use intrinsics for dot4{I, U}8Packed on spv

ae97f6d

Add intrinsics for dot4{I,U}8Packed to changelog

705dc6d

robamler requested a review from a team as a code owner April 22, 2025 16:47

robamler changed the title ~~Use packed vector format~~ Allow wgpu-core to use new naga optimizations for dot4{I, U}8Packed Apr 22, 2025

robamler mentioned this pull request Apr 22, 2025

Use specialized intrinsics for dot4{I, U}8Packed on SPIR-V and HLSL #7574

Open

8 tasks

robamler force-pushed the use-packed-vector-format branch from 8594f8f to 33ed6fd Compare April 22, 2025 16:54

Add changelog entry for gfx-rs#7595

f3ddc4b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow `wgpu-core` to use new naga optimizations for `dot4{I, U}8Packed` #7595

Allow `wgpu-core` to use new naga optimizations for `dot4{I, U}8Packed` #7595

robamler commented Apr 22, 2025 •

edited

Loading

robamler commented Apr 22, 2025 •

edited

Loading

Allow wgpu-core to use new naga optimizations for dot4{I, U}8Packed #7595

Are you sure you want to change the base?

Allow wgpu-core to use new naga optimizations for dot4{I, U}8Packed #7595

Conversation

robamler commented Apr 22, 2025 • edited Loading

robamler commented Apr 22, 2025 • edited Loading

Allow `wgpu-core` to use new naga optimizations for `dot4{I, U}8Packed` #7595

Allow `wgpu-core` to use new naga optimizations for `dot4{I, U}8Packed` #7595

robamler commented Apr 22, 2025 •

edited

Loading

robamler commented Apr 22, 2025 •

edited

Loading