Skip to content

[dl-cifar] Reduce the impact of host computations on benchmark results #96

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rafbiels
Copy link
Contributor

The dl-cifar benchmark, especially with small workload size, suffers from significant impact of host computations on the overall measured computation time. This exacerbates any differences in compiler optimisations for host code between icpx, gcc and possibly other host compilers.

The differences between gcc and icpx as host compiler are greatly reduced with these three changes.

1. Use -O3 -ffast-math in all versions

Only SYCL version used fast math and only SYCL and HIP used -O3, whereas CUDA used -O2. This created an unfair comparison between the programming models. Align the compilation flags to use -O3 -ffast-math everywhere.

2. Comment out redundant host upsampling

This is already commented out in other places, but one was missed. The redundant upsampling on the host introduces large host overhead in the measured computation times and skews the comparison between offload programming models.

3. Optimise initImage using static cache

The initImage function is called hundreds of times, repeating the same computations on the host each time. The redundant computation introduces large host overhead in the measured computation times and skews the comparison between offload programming models. Furthermore, icpx is far better than gcc at reducing this redundancy, resulting in large performance differences caused entirely by host computation and irrelevant to offloading performance.

Speed up the initImage function by using a static cache in order to minimise the host overheads in the benchmarked computation. This also aligns the host performance between icpx and gcc.

Only SYCL version used fast math and only SYCL and HIP used -O3,
whereas CUDA used -O2. This created an unfair comparison between
the programming models.

Align the compilation flags to use -O3 -ffast-math everywhere.
This is already commented out in other places, but one was missed.
The redundant upsampling on the host introduces large host overhead
in the measured computation times and skews the comparison between
offload programming models.
The initImage function is called hundreds of times,
repeating the same computations on the host each time.
The redundant computation introduces large host overhead
in the measured computation times and skews the comparison
between offload programming models.

Speed up the initImage function by using a static cache
in order to minimise the host overheads in the benchmarked
computation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant