`eltype` occurs at runtime in `getidx` for at least some MatrixField operations

I was inspired by some performance helpdesk discussion over the weekend (xref https://github.com/JuliaLang/julia/issues/55009), and made a very simple reproducer for our matrix-field `getidx`. I recently had a hunch that we should hoist our `eltype` calls, but @dennisYatunin pointed out that this should simply return a compile-time constant. The benchmark shows, however, that we're spending most of the time in `eltype`:

```julia
#=
julia --project
using Revise; include(joinpath("test", "Operators", "finitedifference", "getidx.jl"))
=#
using ClimaComms
ClimaComms.@import_required_backends
import BenchmarkTools
import ClimaCore
@isdefined(TU) || include(
    joinpath(pkgdir(ClimaCore), "test", "TestUtilities", "TestUtilities.jl"),
);
include(joinpath(pkgdir(ClimaCore), "test", "MatrixFields", "matrix_fields_broadcasting", "test_scalar_utils.jl"))
import .TestUtilities as TU;
using Test
using JET

import ClimaCore.MatrixFields: ⋅
import ClimaCore: Utilities, Spaces, Fields, Operators
import LazyBroadcast: @lazy

function call_getidx(space, bc, loc, idx, hidx)
    @inbounds Operators.getidx(space, bc, loc, idx, hidx)
    return nothing
end

device = ClimaComms.device()
space = TU.CenterExtrudedFiniteDifferenceSpace(
    FT;
    zelem = 30,
    helem = 4,
    context = ClimaComms.context(device),
)
bc = @lazy @. ᶠᶜmat ⋅ ᶜᶠmat ⋅
    (2 * ᶠᶜmat ⋅ ᶜᶜmat ⋅ ᶜᶠmat + ᶠᶠmat ⋅ ᶠᶠmat / 3 - (4I,)) ⋅ ᶠᶠmat ⋅
    (ᶠᶜmat ⋅ ᶜᶜmat ⋅ ᶜᶠmat * 2 - (ᶠᶠmat / 3) ⋅ ᶠᶠmat + (4I,)) ⋅ ᶠᶠmat;
loc = Operators.Interior()
idx = 10
idx = Utilities.PlusHalf(10)
hidx = (1, 1, 1)
call_getidx(space, bc, loc, idx, hidx)
BenchmarkTools.@benchmark call_getidx(space, bc, loc, idx, hidx)

import Profile, ProfileCanvas

function do_work(space, bc, loc, idx, hidx, n)
    for i in 1:n
        call_getidx(space, bc, loc, idx, hidx)
    end
    return nothing
end

do_work(space, bc, loc, idx, hidx, 1)
Profile.clear()
prof = Profile.@profile do_work(space, bc, loc, idx, hidx, 10^6)
results = Profile.fetch()
Profile.clear()
ProfileCanvas.html_file("flame.html", results)
```

<img width="2146" alt="Screenshot 2024-07-08 at 9 41 00 AM" src="https://github.com/CliMA/ClimaCore.jl/assets/1880641/c0580902-5cca-4df7-a21f-478611284644">

Does this mean that the compiler is not caching the inference result? @vchuravy

The good news is that: 1) there is clearly room for improving our emitted code, and 2) this is a really nice way to make reproducers that captures the fully complexity of `getidx` while only targeting a single point in space, and 3) this impacts both the CPU and the GPU, since these instructions could easily increase register usage on the gpu.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`eltype` occurs at runtime in `getidx` for at least some MatrixField operations #1871

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

eltype occurs at runtime in getidx for at least some MatrixField operations #1871

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`eltype` occurs at runtime in `getidx` for at least some MatrixField operations #1871