Description
Problem description
TL;DR:
rustc
auto-vectorization produces different binary code on Windows than on Linux- Turning off
unstable
feature of thepng
crate (i.e. avoidingstd::simd
) helps
I have been running Chromium-based PNG benchmarks (see https://crrev.com/c/4860210/25). In the past I have been using my 2023 corpus of 1655 PNG images from top 500 websites (see here for more details). This week I have tried running these microbenchmarks against 100 biggest PNG images from this corpus on Windows and on Linux and discovered that the ratio of total-decoding-runtime-when-using-Rust / total-decoding-runtime-when-using-C++ differs considerably across these OS platforms: 79% for Linux (i.e. Rust is ~20% faster) vs 101% for Windows (i.e. Rust is comparable).
I have investigated further by looking at 1 image with the biggest ratio (color type = 6 / RGBA; bits = 8; width x height = 740 x 3640; all 740 rows use Paeth filter). For this image, it turned out that unfiltering takes the most time and so I experimented with turning on and off the unstable
/ std::simd
feature of the png
crate:
Windows | Linux | |
---|---|---|
png features = ['unstable'] |
147% - 151% | 93% - 94% |
png features = [] |
114% - 118% | 94% |
Profiling notes below show how often png::decoder::unfiltering_buffer::UnfilteringBuffer::unfilter_curr_row
is present in ETW/profiling samples on Windows or perf record -e cpu-clock
on Linux (decoding the single file). I am also including fdeflate::decompress:Decompressor::read
for comparison. The percentages are "inclusive" although this doesn't really matter for unfilter_curr_row
which IIUC doesn't call any other functions):
unfilter_curr_row | Decompressor::read | |
---|---|---|
Windows with unstable |
56.3% | 5.7% |
Windows w/o unstable |
43.2% | 7.5% |
Linux w/o unstable |
54.4% | 11.0% |
Linux with unstable |
54.4% | 10.9% |
FWIW profiling C++ on Linux shows png_read_filter_row_paeth4_sse2.cfi
at 55.1% and Cr_z_inflate_fast_chunk_
at 11.3%. This suggests to me that Rust's auto-vectorization gives approximately the same speed as direct usage of SIMD intrinsics in the C/C++ code. I am not sure why Windows numbers look different.
Discussion
AFAIK this is the 2nd time when we find that using std::simd
negatively affects performance - the last time we discovered this, we limited std::simd
usage to x86 (see #539 (comment)).
In the last couple of months I tried measuring the impact of the unstable
feature of the png
crate and the overall impact was modest, but consistently negative - I recorded a 1.6% regression in row 1.9 from "2024-11-27 - 2024-12-02" and 0.13% regression in row 1.1.1 from "2024-12-12 - 2024-12-17" in a document here. Back then I didn't act on this data, because the 2 results were different, the delta was relatively modest, and because I don't fully trust my benchmarking methodology (leaning toward attributing small differences to noise).
Based on the results above, I plan to turn off the unstable
feature of the png
crate in Chromium. (But still using crc32fast/nightly
.)
Open questions:
- We may want to consider removing the
std::simd
-based code from thepng
crate - I assume that the Windows-vs-Linux difference is due to how auto-vectorization works. And I am surprised that auto-vectorization in this OS-agnostic/CPU-focused code behaves differently on Windows-vs-Linux. So maybe there is an opportunity to identify and fix a bug in
rustc
or LLVM here. OTOH 1) I am not sure how to efficiently narrow down the problem into something that others can debug further, and 2) I am not sure ifrustc
/ LLVM wants-to/can give any guarantees of when auto-ectorization occurs. It is also possible that the difference is not because Rust ends up slower on Windows, but because C++ ends up faster on Windows. - I am not sure if we want to revise our high-level approach to SIMD
- It seems that relying on auto-vectorization produces results comparable to direct usage of SIMD intrinsics (at least on Linux, when
unstable
feature of thepng
crate is off). From this perspective we can probably say that auto-vectorization works well. - OTOH, in theory relying on auto-vectorization should mean being able to work with single (platform-agnostic) and simple code. And IMO the
png
unfiltering code is quite complex and it uses different code paths are used on different platforms. Ideally this would be accompanied by automated tests that can catch functional and/or performance regressions on all special-cased platforms, but I am not sure if github supports this.
- It seems that relying on auto-vectorization produces results comparable to direct usage of SIMD intrinsics (at least on Linux, when
/cc @calebzulawski from https://rust-lang.github.io/rfcs/2977-stdsimd.html (see also the related rust-lang/rfcs#2948)
/cc @veluca93 who has kindly helped me narrow down microbenchmarks-vs-field-trial differences I've observed into the Windows-vs-Linux investigation above