Description
Now that I have proper build of OpenBLAS 0.3.26 with GCC 11 for julia (see #4431 for the record), I'm testing it on Nvidia Grace, with surprising results:
$ OPENBLAS_NUM_THREADS=72 OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8sve
julia> peakflops(20_000)
1.975233629673878e12
julia>
$ OPENBLAS_NUM_THREADS=72 OPENBLAS_CORETYPE=ARMV8 OPENBLAS_VERBOSE=2 ./julia -q
Core: armv8
julia> peakflops(20_000)
2.2233581470163237e12
Source code of the peakflops
function is at https://github.com/JuliaLang/julia/blob/b058146cafec8405aa07f9647642fd485ea3b5d7/stdlib/LinearAlgebra/src/LinearAlgebra.jl#L636-L654, it basically runs a double precision matrix-matrix multiplication for a few times (3 by default) using the underlying BLAS library (OpenBLAS by default), measures the time and returns the flops. Numbers reported above are reproducible, I always get peakflops of the order of 2.0e12 for armv8sve, and 2.2e12 for armv8, it isn't a one-off result.
It appears that the armv8sve dgemm kernel is slower (i.e. lower peakflops) than the generic armv8 one.
For the record, I get the same results as julia's dynamic arch build also with a native build of OpenBLAS 0.3.26 on the system driven by spack, configured with
'make' '-j24' '-s' 'CC=/home/cceamgi/repo/spack/lib/spack/env/gcc/gcc' 'FC=/home/cceamgi/repo/spack/lib/spack/env/gcc/gfortran' 'MAKE_NB_JOBS=0' 'ARCH=arm64' 'TARGET=ARMV8SVE' 'USE_LOCKING=1' 'USE_OPENMP=1' 'USE_THREAD=1' 'INTERFACE64=1' 'SYMBOLSUFFIX=64_' 'RANLIB=ranlib' 'NUM_THREADS=512' 'all'
Compiler used on the system is
$ gcc --version
gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
So the problem isn't specific to julia's build of OpenBLAS.
Activity
giordano commentedon Jan 16, 2024
With the same build of Julia as above (BTW, if you're curious you can obtain it from the "Artifacts" tab of https://buildkite.com/julialang/julia-master/builds/32344#018d12b8-4104-4406-b95b-a874b65d18db until the corresponding PR is merged) on A64FX I get:
This is much more reasonable: the armv8sve kernel is sensibly faster than the generic one. I obtained the same numbers on both Isambard and Ookami, so this seems to be consistent also across different systems (haven't tried Fugaku, but could test also that for good measure if that helps).
Edit: using 20k x 20k matrices, for closer comparison with the numbers above:
Edit 2: on Fugaku:
martin-frbg commentedon Jan 16, 2024
Hmm, that's a bit disappointing but I don't think anybody had tested on NeoverseV2 yet, and currently ARMV8SVE would use the same GEMM P and Q that were shown to make poor use of the NeoverseV1's cache in #4381 - though the difference is unlikely to lead to as large speedup as you showed for A64FX (which, incidentally, the SVE DGEMM kernel was originally written for)
martin-frbg commentedon Jan 16, 2024
In that sense I wonder what you'd get with
OPENBLAS_CORETYPE=NEOVERSEV1
instead of the automatic fallback to ARMV8SVE ?giordano commentedon Jan 16, 2024
It's similar to the generic one.
Mousius commentedon Jan 16, 2024
You should expect similar performance, as SVE and ASIMD are the same width here. I started thinking about how to resolve this with #4397, as different implementations have different cache sizes so it's a bit trickier to specify. For now it'd be safe to use
NEOVERSEV1
though.martin-frbg commentedon Jan 16, 2024
Ok thanks, at least that trivial tweak of the parameters file would put it ever so slightly ahead... still strange that we see none of the speedup of A64FX. (Now I'm wondering if the different number of threads, 48 vs 72, could play a role although of course it shouldn't)
martin-frbg commentedon Jan 16, 2024
ah, V2 is no longer 256 bit SVE like V1...
giordano commentedon Jan 16, 2024
Thread scaling is very similar using the different kernels, but the baseline of neoversev1 is just higher.
giordano commentedon Jan 16, 2024
And a plot of the above data just to make things more visual:

Code to generate the plot
brada4 commentedon Jan 19, 2024
Thats life according to amdahls rule.
Mousius commentedon Jan 19, 2024
@giordano I assume something like #4444 (comment) would fix this temporarily?
brada4 commentedon Jan 19, 2024
Is it more like neoverse V1 or like N2 or ARMv8 generic is faster?
martin-frbg commentedon Jan 26, 2024
closing as fixed by #4444