Skip to content

eltype occurs at runtime in getidx for at least some MatrixField operations #1871

Open
@charleskawczynski

Description

@charleskawczynski

I was inspired by some performance helpdesk discussion over the weekend (xref JuliaLang/julia#55009), and made a very simple reproducer for our matrix-field getidx. I recently had a hunch that we should hoist our eltype calls, but @dennisYatunin pointed out that this should simply return a compile-time constant. The benchmark shows, however, that we're spending most of the time in eltype:

#=
julia --project
using Revise; include(joinpath("test", "Operators", "finitedifference", "getidx.jl"))
=#
using ClimaComms
ClimaComms.@import_required_backends
import BenchmarkTools
import ClimaCore
@isdefined(TU) || include(
    joinpath(pkgdir(ClimaCore), "test", "TestUtilities", "TestUtilities.jl"),
);
include(joinpath(pkgdir(ClimaCore), "test", "MatrixFields", "matrix_fields_broadcasting", "test_scalar_utils.jl"))
import .TestUtilities as TU;
using Test
using JET

import ClimaCore.MatrixFields: 
import ClimaCore: Utilities, Spaces, Fields, Operators
import LazyBroadcast: @lazy

function call_getidx(space, bc, loc, idx, hidx)
    @inbounds Operators.getidx(space, bc, loc, idx, hidx)
    return nothing
end

device = ClimaComms.device()
space = TU.CenterExtrudedFiniteDifferenceSpace(
    FT;
    zelem = 30,
    helem = 4,
    context = ClimaComms.context(device),
)
bc = @lazy @. ᶠᶜmat  ᶜᶠmat 
    (2 * ᶠᶜmat  ᶜᶜmat  ᶜᶠmat + ᶠᶠmat  ᶠᶠmat / 3 - (4I,))  ᶠᶠmat 
    (ᶠᶜmat  ᶜᶜmat  ᶜᶠmat * 2 - (ᶠᶠmat / 3)  ᶠᶠmat + (4I,))  ᶠᶠmat;
loc = Operators.Interior()
idx = 10
idx = Utilities.PlusHalf(10)
hidx = (1, 1, 1)
call_getidx(space, bc, loc, idx, hidx)
BenchmarkTools.@benchmark call_getidx(space, bc, loc, idx, hidx)

import Profile, ProfileCanvas

function do_work(space, bc, loc, idx, hidx, n)
    for i in 1:n
        call_getidx(space, bc, loc, idx, hidx)
    end
    return nothing
end

do_work(space, bc, loc, idx, hidx, 1)
Profile.clear()
prof = Profile.@profile do_work(space, bc, loc, idx, hidx, 10^6)
results = Profile.fetch()
Profile.clear()
ProfileCanvas.html_file("flame.html", results)
Screenshot 2024-07-08 at 9 41 00 AM

Does this mean that the compiler is not caching the inference result? @vchuravy

The good news is that: 1) there is clearly room for improving our emitted code, and 2) this is a really nice way to make reproducers that captures the fully complexity of getidx while only targeting a single point in space, and 3) this impacts both the CPU and the GPU, since these instructions could easily increase register usage on the gpu.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions