[dl-cifar] Reduce the impact of host computations on benchmark results #96
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The dl-cifar benchmark, especially with small workload size, suffers from significant impact of host computations on the overall measured computation time. This exacerbates any differences in compiler optimisations for host code between icpx, gcc and possibly other host compilers.
The differences between gcc and icpx as host compiler are greatly reduced with these three changes.
1. Use -O3 -ffast-math in all versions
Only SYCL version used fast math and only SYCL and HIP used
-O3
, whereas CUDA used-O2
. This created an unfair comparison between the programming models. Align the compilation flags to use-O3 -ffast-math
everywhere.2. Comment out redundant host upsampling
This is already commented out in other places, but one was missed. The redundant upsampling on the host introduces large host overhead in the measured computation times and skews the comparison between offload programming models.
3. Optimise initImage using static cache
The
initImage
function is called hundreds of times, repeating the same computations on the host each time. The redundant computation introduces large host overhead in the measured computation times and skews the comparison between offload programming models. Furthermore, icpx is far better than gcc at reducing this redundancy, resulting in large performance differences caused entirely by host computation and irrelevant to offloading performance.Speed up the
initImage
function by using a static cache in order to minimise the host overheads in the benchmarked computation. This also aligns the host performance between icpx and gcc.