Skip to content

Unfiltering is slower on Windows than on Linux (especially with std::simd) #567

Open
@anforowicz

Description

@anforowicz

Problem description

TL;DR:

  • rustc auto-vectorization produces different binary code on Windows than on Linux
  • Turning off unstable feature of the png crate (i.e. avoiding std::simd) helps

I have been running Chromium-based PNG benchmarks (see https://crrev.com/c/4860210/25). In the past I have been using my 2023 corpus of 1655 PNG images from top 500 websites (see here for more details). This week I have tried running these microbenchmarks against 100 biggest PNG images from this corpus on Windows and on Linux and discovered that the ratio of total-decoding-runtime-when-using-Rust / total-decoding-runtime-when-using-C++ differs considerably across these OS platforms: 79% for Linux (i.e. Rust is ~20% faster) vs 101% for Windows (i.e. Rust is comparable).

I have investigated further by looking at 1 image with the biggest ratio (color type = 6 / RGBA; bits = 8; width x height = 740 x 3640; all 740 rows use Paeth filter). For this image, it turned out that unfiltering takes the most time and so I experimented with turning on and off the unstable / std::simd feature of the png crate:

Windows Linux
png features = ['unstable'] 147% - 151% 93% - 94%
png features = [] 114% - 118% 94%

Profiling notes below show how often png::decoder::unfiltering_buffer::UnfilteringBuffer::unfilter_curr_row is present in ETW/profiling samples on Windows or perf record -e cpu-clock on Linux (decoding the single file). I am also including fdeflate::decompress:Decompressor::read for comparison. The percentages are "inclusive" although this doesn't really matter for unfilter_curr_row which IIUC doesn't call any other functions):

unfilter_curr_row Decompressor::read
Windows with unstable 56.3% 5.7%
Windows w/o unstable 43.2% 7.5%
Linux w/o unstable 54.4% 11.0%
Linux with unstable 54.4% 10.9%

FWIW profiling C++ on Linux shows png_read_filter_row_paeth4_sse2.cfi at 55.1% and Cr_z_inflate_fast_chunk_ at 11.3%. This suggests to me that Rust's auto-vectorization gives approximately the same speed as direct usage of SIMD intrinsics in the C/C++ code. I am not sure why Windows numbers look different.

Discussion

AFAIK this is the 2nd time when we find that using std::simd negatively affects performance - the last time we discovered this, we limited std::simd usage to x86 (see #539 (comment)).

In the last couple of months I tried measuring the impact of the unstable feature of the png crate and the overall impact was modest, but consistently negative - I recorded a 1.6% regression in row 1.9 from "2024-11-27 - 2024-12-02" and 0.13% regression in row 1.1.1 from "2024-12-12 - 2024-12-17" in a document here. Back then I didn't act on this data, because the 2 results were different, the delta was relatively modest, and because I don't fully trust my benchmarking methodology (leaning toward attributing small differences to noise).

Based on the results above, I plan to turn off the unstable feature of the png crate in Chromium. (But still using crc32fast/nightly.)

Open questions:

  • We may want to consider removing the std::simd-based code from the png crate
  • I assume that the Windows-vs-Linux difference is due to how auto-vectorization works. And I am surprised that auto-vectorization in this OS-agnostic/CPU-focused code behaves differently on Windows-vs-Linux. So maybe there is an opportunity to identify and fix a bug in rustc or LLVM here. OTOH 1) I am not sure how to efficiently narrow down the problem into something that others can debug further, and 2) I am not sure if rustc / LLVM wants-to/can give any guarantees of when auto-ectorization occurs. It is also possible that the difference is not because Rust ends up slower on Windows, but because C++ ends up faster on Windows.
  • I am not sure if we want to revise our high-level approach to SIMD
    • It seems that relying on auto-vectorization produces results comparable to direct usage of SIMD intrinsics (at least on Linux, when unstable feature of the png crate is off). From this perspective we can probably say that auto-vectorization works well.
    • OTOH, in theory relying on auto-vectorization should mean being able to work with single (platform-agnostic) and simple code. And IMO the png unfiltering code is quite complex and it uses different code paths are used on different platforms. Ideally this would be accompanied by automated tests that can catch functional and/or performance regressions on all special-cased platforms, but I am not sure if github supports this.

/cc @calebzulawski from https://rust-lang.github.io/rfcs/2977-stdsimd.html (see also the related rust-lang/rfcs#2948)
/cc @veluca93 who has kindly helped me narrow down microbenchmarks-vs-field-trial differences I've observed into the Windows-vs-Linux investigation above

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions