Description
I was inspired by some performance helpdesk discussion over the weekend (xref JuliaLang/julia#55009), and made a very simple reproducer for our matrix-field getidx
. I recently had a hunch that we should hoist our eltype
calls, but @dennisYatunin pointed out that this should simply return a compile-time constant. The benchmark shows, however, that we're spending most of the time in eltype
:
#=
julia --project
using Revise; include(joinpath("test", "Operators", "finitedifference", "getidx.jl"))
=#
using ClimaComms
ClimaComms.@import_required_backends
import BenchmarkTools
import ClimaCore
@isdefined(TU) || include(
joinpath(pkgdir(ClimaCore), "test", "TestUtilities", "TestUtilities.jl"),
);
include(joinpath(pkgdir(ClimaCore), "test", "MatrixFields", "matrix_fields_broadcasting", "test_scalar_utils.jl"))
import .TestUtilities as TU;
using Test
using JET
import ClimaCore.MatrixFields: ⋅
import ClimaCore: Utilities, Spaces, Fields, Operators
import LazyBroadcast: @lazy
function call_getidx(space, bc, loc, idx, hidx)
@inbounds Operators.getidx(space, bc, loc, idx, hidx)
return nothing
end
device = ClimaComms.device()
space = TU.CenterExtrudedFiniteDifferenceSpace(
FT;
zelem = 30,
helem = 4,
context = ClimaComms.context(device),
)
bc = @lazy @. ᶠᶜmat ⋅ ᶜᶠmat ⋅
(2 * ᶠᶜmat ⋅ ᶜᶜmat ⋅ ᶜᶠmat + ᶠᶠmat ⋅ ᶠᶠmat / 3 - (4I,)) ⋅ ᶠᶠmat ⋅
(ᶠᶜmat ⋅ ᶜᶜmat ⋅ ᶜᶠmat * 2 - (ᶠᶠmat / 3) ⋅ ᶠᶠmat + (4I,)) ⋅ ᶠᶠmat;
loc = Operators.Interior()
idx = 10
idx = Utilities.PlusHalf(10)
hidx = (1, 1, 1)
call_getidx(space, bc, loc, idx, hidx)
BenchmarkTools.@benchmark call_getidx(space, bc, loc, idx, hidx)
import Profile, ProfileCanvas
function do_work(space, bc, loc, idx, hidx, n)
for i in 1:n
call_getidx(space, bc, loc, idx, hidx)
end
return nothing
end
do_work(space, bc, loc, idx, hidx, 1)
Profile.clear()
prof = Profile.@profile do_work(space, bc, loc, idx, hidx, 10^6)
results = Profile.fetch()
Profile.clear()
ProfileCanvas.html_file("flame.html", results)

Does this mean that the compiler is not caching the inference result? @vchuravy
The good news is that: 1) there is clearly room for improving our emitted code, and 2) this is a really nice way to make reproducers that captures the fully complexity of getidx
while only targeting a single point in space, and 3) this impacts both the CPU and the GPU, since these instructions could easily increase register usage on the gpu.