[dl-cifar] Reduce the impact of host computations on benchmark results #96

rafbiels · 2025-04-15T16:30:54Z

The dl-cifar benchmark, especially with small workload size, suffers from significant impact of host computations on the overall measured computation time. This exacerbates any differences in compiler optimisations for host code between icpx, gcc and possibly other host compilers.

The differences between gcc and icpx as host compiler are greatly reduced with these three changes.

1. Use -O3 -ffast-math in all versions

Only SYCL version used fast math and only SYCL and HIP used -O3, whereas CUDA used -O2. This created an unfair comparison between the programming models. Align the compilation flags to use -O3 -ffast-math everywhere.

2. Comment out redundant host upsampling

This is already commented out in other places, but one was missed. The redundant upsampling on the host introduces large host overhead in the measured computation times and skews the comparison between offload programming models.

3. Optimise initImage using static cache

The initImage function is called hundreds of times, repeating the same computations on the host each time. The redundant computation introduces large host overhead in the measured computation times and skews the comparison between offload programming models. Furthermore, icpx is far better than gcc at reducing this redundancy, resulting in large performance differences caused entirely by host computation and irrelevant to offloading performance.

Speed up the initImage function by using a static cache in order to minimise the host overheads in the benchmarked computation. This also aligns the host performance between icpx and gcc.

Only SYCL version used fast math and only SYCL and HIP used -O3, whereas CUDA used -O2. This created an unfair comparison between the programming models. Align the compilation flags to use -O3 -ffast-math everywhere.

This is already commented out in other places, but one was missed. The redundant upsampling on the host introduces large host overhead in the measured computation times and skews the comparison between offload programming models.

The initImage function is called hundreds of times, repeating the same computations on the host each time. The redundant computation introduces large host overhead in the measured computation times and skews the comparison between offload programming models. Speed up the initImage function by using a static cache in order to minimise the host overheads in the benchmarked computation.

rafbiels added 3 commits April 15, 2025 16:53

[dl-cifar] Use -O3 -ffast-math in all versions

a777a86

Only SYCL version used fast math and only SYCL and HIP used -O3, whereas CUDA used -O2. This created an unfair comparison between the programming models. Align the compilation flags to use -O3 -ffast-math everywhere.

[dl-cifar] Comment out redundant host upsampling

0bd76bb

This is already commented out in other places, but one was missed. The redundant upsampling on the host introduces large host overhead in the measured computation times and skews the comparison between offload programming models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dl-cifar] Reduce the impact of host computations on benchmark results #96

[dl-cifar] Reduce the impact of host computations on benchmark results #96

rafbiels commented Apr 15, 2025

[dl-cifar] Reduce the impact of host computations on benchmark results #96

Are you sure you want to change the base?

[dl-cifar] Reduce the impact of host computations on benchmark results #96

Conversation

rafbiels commented Apr 15, 2025

1. Use -O3 -ffast-math in all versions

2. Comment out redundant host upsampling

3. Optimise initImage using static cache