Skip to content

ARMV8SVE dgemm kernel on Nvidia Grace slightly slower than generic ARMV8 kernel #4440

Closed
@giordano

Description

@giordano

Now that I have proper build of OpenBLAS 0.3.26 with GCC 11 for julia (see #4431 for the record), I'm testing it on Nvidia Grace, with surprising results:

$ OPENBLAS_NUM_THREADS=72 OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8sve
julia> peakflops(20_000)
1.975233629673878e12

julia> 
$ OPENBLAS_NUM_THREADS=72 OPENBLAS_CORETYPE=ARMV8 OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8
julia> peakflops(20_000)
2.2233581470163237e12

Source code of the peakflops function is at https://github.com/JuliaLang/julia/blob/b058146cafec8405aa07f9647642fd485ea3b5d7/stdlib/LinearAlgebra/src/LinearAlgebra.jl#L636-L654, it basically runs a double precision matrix-matrix multiplication for a few times (3 by default) using the underlying BLAS library (OpenBLAS by default), measures the time and returns the flops. Numbers reported above are reproducible, I always get peakflops of the order of 2.0e12 for armv8sve, and 2.2e12 for armv8, it isn't a one-off result.

It appears that the armv8sve dgemm kernel is slower (i.e. lower peakflops) than the generic armv8 one.

For the record, I get the same results as julia's dynamic arch build also with a native build of OpenBLAS 0.3.26 on the system driven by spack, configured with

 'make' '-j24' '-s' 'CC=/home/cceamgi/repo/spack/lib/spack/env/gcc/gcc' 'FC=/home/cceamgi/repo/spack/lib/spack/env/gcc/gfortran' 'MAKE_NB_JOBS=0' 'ARCH=arm64' 'TARGET=ARMV8SVE' 'USE_LOCKING=1' 'USE_OPENMP=1' 'USE_THREAD=1' 'INTERFACE64=1' 'SYMBOLSUFFIX=64_' 'RANLIB=ranlib' 'NUM_THREADS=512' 'all'

Compiler used on the system is

$ gcc --version
gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

So the problem isn't specific to julia's build of OpenBLAS.

Activity

giordano

giordano commented on Jan 16, 2024

@giordano
ContributorAuthor

With the same build of Julia as above (BTW, if you're curious you can obtain it from the "Artifacts" tab of https://buildkite.com/julialang/julia-master/builds/32344#018d12b8-4104-4406-b95b-a874b65d18db until the corresponding PR is merged) on A64FX I get:

$ OPENBLAS_NUM_THREADS=48 OPENBLAS_VERBOSE=2 ./julia -q
Unknown CPU model - implementer 46 part 1
Core: armv8sve
julia> peakflops(10_000)
6.493188698299631e11

julia> 
$ OPENBLAS_NUM_THREADS=48 OPENBLAS_CORETYPE=ARMV8 OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8
julia> peakflops(10_000)
1.276729385867521e11

This is much more reasonable: the armv8sve kernel is sensibly faster than the generic one. I obtained the same numbers on both Isambard and Ookami, so this seems to be consistent also across different systems (haven't tried Fugaku, but could test also that for good measure if that helps).

Edit: using 20k x 20k matrices, for closer comparison with the numbers above:

$ OPENBLAS_NUM_THREADS=48 OPENBLAS_VERBOSE=2 ./julia -q
Unknown CPU model - implementer 46 part 1
Core: armv8sve
julia> peakflops(20_000)
8.650433108152069e11

julia>
$ OPENBLAS_NUM_THREADS=48 OPENBLAS_CORETYPE=ARMV8 OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8
julia> peakflops(20_000)
1.3525547058847133e11

Edit 2: on Fugaku:

$ OPENBLAS_NUM_THREADS=48 OPENBLAS_VERBOSE=2 ./julia -q
Unknown CPU model - implementer 46 part 1
Core: armv8sve
julia> peakflops(20_000)
9.674598962304862e11

julia>
$ OPENBLAS_NUM_THREADS=48 OPENBLAS_CORETYPE=ARMV8 OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8
julia> peakflops(20_000)
1.5124337976818753e11
martin-frbg

martin-frbg commented on Jan 16, 2024

@martin-frbg
Collaborator

Hmm, that's a bit disappointing but I don't think anybody had tested on NeoverseV2 yet, and currently ARMV8SVE would use the same GEMM P and Q that were shown to make poor use of the NeoverseV1's cache in #4381 - though the difference is unlikely to lead to as large speedup as you showed for A64FX (which, incidentally, the SVE DGEMM kernel was originally written for)

martin-frbg

martin-frbg commented on Jan 16, 2024

@martin-frbg
Collaborator

In that sense I wonder what you'd get with OPENBLAS_CORETYPE=NEOVERSEV1 instead of the automatic fallback to ARMV8SVE ?

giordano

giordano commented on Jan 16, 2024

@giordano
ContributorAuthor
$ OPENBLAS_NUM_THREADS=72 OPENBLAS_CORETYPE=NEOVERSEV1 OPENBLAS_VERBOSE=2 ./julia -q
Core: neoversev1
julia> peakflops(20_000)
2.250280675680324e12

It's similar to the generic one.

Mousius

Mousius commented on Jan 16, 2024

@Mousius
Contributor

You should expect similar performance, as SVE and ASIMD are the same width here. I started thinking about how to resolve this with #4397, as different implementations have different cache sizes so it's a bit trickier to specify. For now it'd be safe to use NEOVERSEV1 though.

martin-frbg

martin-frbg commented on Jan 16, 2024

@martin-frbg
Collaborator

Ok thanks, at least that trivial tweak of the parameters file would put it ever so slightly ahead... still strange that we see none of the speedup of A64FX. (Now I'm wondering if the different number of threads, 48 vs 72, could play a role although of course it shouldn't)

martin-frbg

martin-frbg commented on Jan 16, 2024

@martin-frbg
Collaborator

ah, V2 is no longer 256 bit SVE like V1...

giordano

giordano commented on Jan 16, 2024

@giordano
ContributorAuthor

(Now I'm wondering if the different number of threads, 48 vs 72, could play a role although of course it shouldn't)

$ OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8sve
julia> using LinearAlgebra

julia> for threads in (12, 24, 36, 48, 60, 72)
           BLAS.set_num_threads(threads)
           @assert BLAS.get_num_threads() == threads # just to be safe
           @show threads, peakflops(20_000)
       end
(threads, peakflops(20000)) = (12, 3.9528300310608466e11)
(threads, peakflops(20000)) = (24, 7.716683918087646e11)
(threads, peakflops(20000)) = (36, 1.1449729103280854e12)
(threads, peakflops(20000)) = (48, 1.4842484586872903e12)
(threads, peakflops(20000)) = (60, 1.7123865022574075e12)
(threads, peakflops(20000)) = (72, 1.951739414398176e12)

julia> 
$ OPENBLAS_CORETYPE=NEOVERSEV1 OPENBLAS_VERBOSE=2 ./julia -q
Core: neoversev1
julia> using LinearAlgebra

julia> for threads in (12, 24, 36, 48, 60, 72)
           BLAS.set_num_threads(threads)
           @assert BLAS.get_num_threads() == threads # just to be safe
           @show threads, peakflops(20_000)
       end
(threads, peakflops(20000)) = (12, 4.434179171364078e11)
(threads, peakflops(20000)) = (24, 8.723736554950096e11)
(threads, peakflops(20000)) = (36, 1.2741984819577405e12)
(threads, peakflops(20000)) = (48, 1.6504006918821333e12)
(threads, peakflops(20000)) = (60, 1.9505026499028577e12)
(threads, peakflops(20000)) = (72, 2.19212646981316e12)

Thread scaling is very similar using the different kernels, but the baseline of neoversev1 is just higher.

giordano

giordano commented on Jan 16, 2024

@giordano
ContributorAuthor

And a plot of the above data just to make things more visual:
plot

Code to generate the plot

using Plots

armv8sve = [3.9528300310608466e11, 7.716683918087646e11, 1.1449729103280854e12, 1.4842484586872903e12, 1.7123865022574075e12, 1.951739414398176e12]
neoversev1 = [4.434179171364078e11, 8.723736554950096e11, 1.2741984819577405e12, 1.6504006918821333e12, 1.9505026499028577e12, 2.19212646981316e12]
threads = [12, 24, 36, 48, 60, 72]

p = plot(threads, threads ./ threads[begin]; linestyle=:dash, label="Ideal", xlabel="Number of threads", ylabel="Scaling", size=(900, 600))
plot!(p, threads, armv8sve ./ armv8sve[begin]; marker=:circle, label="ARMV8SVE")
plot!(p, threads, neoversev1 ./ neoversev1[begin]; marker=:diamond, label="NEOVERSEV1")

brada4

brada4 commented on Jan 19, 2024

@brada4
Contributor

Thats life according to amdahls rule.

Mousius

Mousius commented on Jan 19, 2024

@Mousius
Contributor

@giordano I assume something like #4444 (comment) would fix this temporarily?

brada4

brada4 commented on Jan 19, 2024

@brada4
Contributor

Is it more like neoverse V1 or like N2 or ARMv8 generic is faster?

martin-frbg

martin-frbg commented on Jan 26, 2024

@martin-frbg
Collaborator

closing as fixed by #4444

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @giordano@Mousius@martin-frbg@brada4

        Issue actions

          ARMV8SVE dgemm kernel on Nvidia Grace slightly slower than generic ARMV8 kernel · Issue #4440 · OpenMathLib/OpenBLAS