Unfiltering is slower on Windows than on Linux (especially with `std::simd`)

# Problem description

TL;DR:
* `rustc` auto-vectorization produces different binary code on Windows than on Linux
* Turning off `unstable` feature of the `png` crate (i.e. avoiding `std::simd`) helps

I have been running Chromium-based PNG benchmarks (see https://crrev.com/c/4860210/25).  In the past I have been using my 2023 corpus of 1655 PNG images from top 500 websites (see [here](https://github.com/image-rs/image-png/discussions/416#discussioncomment-7436871) for more details).  This week I have tried running these microbenchmarks against 100 biggest PNG images from this corpus on Windows and on Linux and discovered that the ratio of total-decoding-runtime-when-using-Rust / total-decoding-runtime-when-using-C++ differs considerably across these OS platforms: 79% for Linux (i.e. Rust is ~20% faster) vs 101% for Windows (i.e. Rust is comparable).

I have investigated further by looking at 1 image with the biggest ratio (color type = 6 / RGBA;  bits = 8;  width x height = 740 x 3640;  all 740 rows use Paeth filter).  For this image, it turned out that unfiltering takes the most time and so I experimented with turning on and off the `unstable` / `std::simd` feature of the `png` crate:

|                                 | Windows     | Linux     |
|---------------------------------|-------------|-----------|
| `png` features = `['unstable']` | 147% - 151% | 93% - 94% |
| `png` features = `[]`           | 114% - 118% | 94%       |

Profiling notes below show how often `png::decoder::unfiltering_buffer::UnfilteringBuffer::unfilter_curr_row` is present in ETW/profiling samples on Windows or `perf record -e cpu-clock` on Linux (decoding the single file).  I am also including `fdeflate::decompress:Decompressor::read` for comparison.  The percentages are "inclusive" although this doesn't really matter for `unfilter_curr_row` which IIUC doesn't call any other functions):

|                         | unfilter_curr_row | Decompressor::read
|-------------------------|-------------------|--------------------
| Windows with `unstable` | 56.3%             | 5.7%
| Windows w/o `unstable`  | 43.2%             | 7.5%
| Linux w/o `unstable`    | 54.4%             | 11.0%
| Linux with `unstable`   | 54.4%             | 10.9%

FWIW profiling C++ on Linux shows `png_read_filter_row_paeth4_sse2.cfi` at 55.1% and `Cr_z_inflate_fast_chunk_` at 11.3%.  This suggests to me that Rust's auto-vectorization gives approximately the same speed as direct usage of SIMD intrinsics in the C/C++ code.  I am not sure why Windows numbers look different.

# Discussion

AFAIK this is the 2nd time when we find that using `std::simd` negatively affects performance - the last time we discovered this, we limited `std::simd` usage to x86 (see https://github.com/image-rs/image-png/pull/539#issuecomment-2512748043).

In the last couple of months I tried measuring the impact of the `unstable` feature of the `png` crate and the overall impact was modest, but consistently negative - I recorded a 1.6% regression in row 1.9 from "2024-11-27 - 2024-12-02" and 0.13% regression in row 1.1.1 from "2024-12-12 - 2024-12-17" in [a document here](https://docs.google.com/document/d/16P0I_4AglenbkU1IBM1Qfe1zAQrMXa5NoEazZqOTHZE/edit?usp=sharing).  Back then I didn't act on this data, because the 2 results were different, the delta was relatively modest, and because I don't fully trust my benchmarking methodology (leaning toward attributing small differences to noise).

Based on the results above, I plan to turn off the `unstable` feature of the `png` crate in Chromium.  (But still using `crc32fast/nightly`.)

Open questions:

* We may want to consider removing the `std::simd`-based code from the `png` crate
* I assume that the Windows-vs-Linux difference is due to how auto-vectorization works.  And I am surprised that auto-vectorization in this OS-agnostic/CPU-focused code behaves differently on Windows-vs-Linux.  So maybe there is an opportunity to identify and fix a bug in `rustc` or LLVM here.  OTOH 1) I am not sure how to efficiently narrow down the problem into something that others can debug further, and 2) I am not sure if `rustc` / LLVM wants-to/can give any guarantees of when auto-ectorization occurs.  It is also possible that the difference is not because Rust ends up slower on Windows, but because C++ ends up faster on Windows.
* I am not sure if we want to revise our high-level approach to SIMD
    - It seems that relying on auto-vectorization produces results comparable to direct usage of SIMD intrinsics (at least on Linux, when `unstable` feature of the `png` crate is off).  From this perspective we can probably say that auto-vectorization works well.
    - OTOH, in theory relying on auto-vectorization should mean being able to work with **single** (platform-agnostic) and **simple** code.  And IMO the `png` unfiltering code is quite complex and it uses [different code paths](https://github.com/image-rs/image-png/blob/7ab5bb6c473ef8c58f2265e640ba1f50e1189357/src/filter.rs#L742-L747) are used on different platforms.  Ideally this would be accompanied by automated tests that can catch functional and/or performance regressions on all special-cased platforms, but I am not sure if github supports this.


/cc @calebzulawski from https://rust-lang.github.io/rfcs/2977-stdsimd.html (see also the related https://github.com/rust-lang/rfcs/pull/2948)
/cc @veluca93 who has kindly helped me narrow down microbenchmarks-vs-field-trial differences I've observed into the Windows-vs-Linux investigation above

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unfiltering is slower on Windows than on Linux (especially with `std::simd`) #567

Problem description

Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	Windows	Linux
`png` features = `['unstable']`	147% - 151%	93% - 94%
`png` features = `[]`	114% - 118%	94%

	unfilter_curr_row	Decompressor::read
Windows with `unstable`	56.3%	5.7%
Windows w/o `unstable`	43.2%	7.5%
Linux w/o `unstable`	54.4%	11.0%
Linux with `unstable`	54.4%	10.9%

Unfiltering is slower on Windows than on Linux (especially with std::simd) #567

Description

Problem description

Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Unfiltering is slower on Windows than on Linux (especially with `std::simd`) #567