Vectorized Paeth filtering (multiple pixels at the same time)

I've seen the other issue about vectorized Paeth filtering, this is not that.

It's possible to apply the Paeth filter to multiple pixels at once. Not on 4 sequential pixels, which would be the neatest case for SIMD, but 4 anti-diagonal pixels. 4 anti-diagonal pixels can be collected together in a vector relatively efficiently by loading 4 rows of pixels, staggered so that the first row has an x-offset of 3 pixels compared to the last row, then transpose those 4 rows (similar to _MM_TRANSPOSE4_PS, but with integer vectors). That produces 4 of those sets of 4 anti-diagonal pixels for the price of 8 shuffles, and after applying the filter another 8 shuffles are needed to un-transpose them (some additional shuffles are needed to update the "top" aka B vector between columns). All that shuffling is not free but in my tests it was well worth doing. 

Some diagrams:

![paeth_load_data](https://github.com/user-attachments/assets/41aa2703-f5b1-4b79-93cd-c7496a9119b0)

![paeth_transposed_data](https://github.com/user-attachments/assets/5e75b03a-278c-472d-8f44-2135885e6482)

I've already worked out how to do this and integrate it into a PNG decoder in the context of the (C++) stb_image PNG decoder. Though filtering 4 or 8 rows at the same time does not fit very naturally onto the "possibly different filter for each row" nature of the PNG format, it is still possible. There is a requirement to find 4 or 8 adjacent rows that all have the Paeth filter type, but that appears to be common enough to not invalidate the approach, and I did get some real-world wins on real files. Not on every file, obviously.

Though I described primarily how to filter a 32bpp format, a 24bpp format could use the same technique by expanding (on demand, right before the transpose) every 12 bytes to 16 with `pshufb`.

This approach requires unsafe code in order to use platform-specific intrinsics, without which I see no way to implement the transposes (autovectorization does not like to do that sort of thing) nor the `pshufb` required for 24bpp images, which may not fit with the goals of this library. Porting this approach to NEON may be possible, I'm not a NEON expert and I'm not sure whether it would be worth doing (given the typical latency of 2 cycles for NEON operations, and the fairly latency-sensitive nature of this filter - especially the 4-row version, 8-rows has more ILP to hide that latency). Using Rusts portable SIMD abstraction may be possible (it's still marked experimental).

The Avg filter may be vectorized with a similar approach, even Up and Sub could be but that's overkill. Filters could be combined, using vector-select to pick the right filter on a per-row basis, but that adds a bunch of overhead which I expect is not worth it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorized Paeth filtering (multiple pixels at the same time) #549

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Vectorized Paeth filtering (multiple pixels at the same time) #549

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions