-
Notifications
You must be signed in to change notification settings - Fork 1k
Allow wgpu-core
to use new naga optimizations for dot4{I, U}8Packed
#7595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
When checking for capabilities in SPIR-V, `capabilities_available == None` indicates that all capabilities are available. However, some capabilities are not even defined for all language versions, so we still need to check if the requested capabilities even exist in the language version we're using.
wgpu-core
to use new naga optimizations for dot4{I, U}8Packed
See gfx-rs#7574, in particular <gfx-rs#7574 (comment)>. Adds `FeaturesWGPU::NATIVE_PACKED_INTEGER_DOT_PRODUCT`, which is available on `Adapter`s that support the specialized implementations for `dot4I8Packed` and `dot4U8Packed` implemented in gfx-rs#7574 (currently, this includes DX12 with Shader Model >= 6.4 and Vulkan with device extension "VK_KHR_shader_integer_dot_product"). If this feature is requested during `Device` creation, the device is set up such that `dot4I8Packed` and `dot4U8Packed` will be compiled to their respective specialized instructions. This means that, on a vulkan `Device`, the SPIR-V language version is set to 1.6, and the required SPIR-V capabilities are marked as available (on DX12, requesting the feature doesn't change anything since availability of the feature already guarantees that Shader Model >= 6.4, which is all we need to generate specialized code).
8594f8f
to
33ed6fd
Compare
In case anyone is interested, these are the preliminary benchmarks from our motivating use case (a sequence of low precision integer matrix-vector multiplications, simulating the main bottleneck in on-device inference in large language models). The table shows throughput in G MAC/s (10^9 multiplies-and-accumulate per second ± standard error; higher is better).
More detailed benchmark results also indicate that possible further optimizations of our use case currently seem to suffer from poor performance of |
Connections
dot4{I, U}8Packed
on SPIR-V and HLSL #7574.Description
Ensures that the new
naga
optimizations added in #7574 can be used bywgpu-core
on SPIR-V (these optimizations requrie SPIR-V language version >= 1.6).Adds a feature
FeaturesWGPU::NATIVE_PACKED_INTEGER_DOT_PRODUCT
, which is available onAdapter
s that support the specialized implementations fordot4I8Packed
anddot4U8Packed
implemented in #7574 (currently, this includes DX12 with Shader Model >= 6.4 and Vulkan with device extension "VK_KHR_shader_integer_dot_product").If this feature is available on an
Adapter
, it can be requested duringDevice
creation, and then the device is set up such that any occurrences ofdot4I8Packed
anddot4U8Packed
will be compiled to their respective specialized instructions. This means that, on a vulkanDevice
, the SPIR-V language version is set to 1.6, and the required SPIR-V capabilities are marked as available (on DX12, requesting the feature doesn't change anything since it looks likewgpu-hal
already uses the highest Shader Model supported by the DX12 library, and availability of the new feature already guarantees that the optimizations will be emitted).I'm not sure if this is the best approach to expose an optimization that is only available for some SPIR-V language versions, and I welcome feedback.
Testing
I'm not sure where and how to add a test for this. Probably somewhere in
wgpu-hal
? Are there any related tests that I can use as templates?I did test it in my own use case (warning: messy research code) and found (by stepping through in a debugger) that, as intended, the optimized code gets generated if and only if the feature TODO is requested for the
Device
(these tests also confirm that the optimization in #7574 increases performance).Squash or Rebase?
Single commit
Open Questions
NATIVE_PACKED_INTEGER_DOT_PRODUCT
available on theAdapter
and, if it is requested for aDevice
, translatedot4I8Packed
anddot4U8Packed
literally (and emit a polyfill if the feature is not requested for theDevice
).Checklist
cargo fmt
.taplo format
.cargo clippy --tests
. If applicable, add:--target wasm32-unknown-unknown
cargo xtask test
to run tests.CHANGELOG.md
entry.