Skip to content

Whole-array reductions always use init to start each reduction chain #58241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

mbauman
Copy link
Member

@mbauman mbauman commented Apr 27, 2025

This implements the new init behaviors that were decided in #53945 — that init is used one or more times to start each reduction chain. Unlike the dimensional counterpart (#55318), this can be done without a big refactor, which makes it much easier to evaluate the what the core behaviors should be. And, importantly, it makes it much easier for bigger refactorings — like #52397 — to be implemented subsequently. Perhaps most importantly, it's what we need to do in order to officially document that this is how init works. The current docs give us leeway to do this, and then once it's implemented, we can document the better semantics (and finally merge #53945).

In implementing this, I relalized there's still an outstanding fundamental question on how reductions should work. #53945 clarifies how init itself is used, but there's still an outstanding question about what needs to happen next.

The crux of it is how reductions should start. I talk in #55318 about the 2*3 different ways in which a reduction may start... and here I codified it all under a single function:

"""
    _mapreduce_start(f, op, A, init, [a1], [a2])

Perform the first step in a mapped reduction over `A` with 0, 1, or more elements.
The two element method may be called multiple times within a single reduction at
the start of each new chain of `op` calls.
"""
_mapreduce_start(f, op, A, ::_InitialValue) = mapreduce_empty(f, op, eltype(A))
_mapreduce_start(f, op, A, ::_InitialValue, a1) = mapreduce_first(f, op, a1)
_mapreduce_start(f, op, A, ::_InitialValue, a1, a2) = op(f(a1), f(a2))
_mapreduce_start(f, op, A, init) = init
_mapreduce_start(f, op, A, init, a1) = op(init, f(a1))
_mapreduce_start(f, op, A, init, a1, a2) = op(op(init, f(a1)), f(a2))

This brings all this chicanery into one consolidated place — and makes it clear why you need 6 cases when these are the definitions.

But there's (at least) one very interesting test that breaks, and it makes for a perfect example (just ignore the fact that it breaks the "neutral element" init rule; that's actually not the interesting part here!)

julia> @inferred count(false:true, dims = (:), init = 0x0004) === 0x0005
ERROR: return type UInt16 does not match inferred return type Union{UInt16, UInt64}

On master, Julia just happily returns a properly inferred 0x005. Why did this change? Well, now we can use an init in the pairwise implementation, so this now uses the pairwise implementation (hooray!). But now that adds an additional possible branch here. This could be computed (and master does so) as the foldl (0x0004 + false) + true, where you only ever add UInt16 and Bool. Or (with more elements) you could do ((0x0004 + false) + ...) + ((0x0004 + true) + ...), and that final summation is between two UInt16. And I lied, this isn't computed with +, it's add_sum, which very intentionally promotes that case (but not the other) to UInt. So there's the instability.

But is this actually how we want reductions to start? What if we always called mapreduce_first? I had been thinking it was the analog of empty — in that it only gets called when the entire reduction is just 1 element — and that's also how @stevengj had implemented the streaming iterable pairwise algorithm as well. But it's actually documented that it "may also be used to initialise the recursion." In other words, maybe the above _mapreduce_start definition should be — more simply:

_mapreduce_start(f, op, A, ::_InitialValue) = mapreduce_empty(f, op, _empty_eltype(A))
_mapreduce_start(f, op, A, ::_InitialValue, a1) = mapreduce_first(f, op, a1)
_mapreduce_start(f, op, A, ::_InitialValue, a1, a2) = op(mapreduce_first(f, op, a1), f(a2))
_mapreduce_start(f, op, A, init) = init
_mapreduce_start(f, op, A, init, a1) = op(init, mapreduce_first(f, op, a1))
_mapreduce_start(f, op, A, init, a1, a2) = op(op(init, mapreduce_first(f, op, a1)), f(a2))

And then you squint at it, and you realize the two-element case is no longer special — it's just the natural "next step" of the reduction after the one-element starting point in both of the init-or-not cases! This is great, as only doing something special to one element before launching into a for loop or recursion is drastically (and perhaps surprisingly so) simpler than needing to do something special to two elements. It removes a case from every single implementation, and every single implementation only needs to peel one element out of a loop instead of two. It also fixes one bit of breakage that I initially added here — that mapreduce_impl requires at least two elements with the previous definitions, but it only requires one element on master.

Now the test above is still broken, but it's broken in a new and more interesting way:

julia> @inferred count(false:true, dims = (:), init = 0x0004)
ERROR: return type Int64 does not match inferred return type Union{Int64, UInt16}

There are really only a handful of meaningful things that mapreduce_first does, but one of them is to promote Bools in summations to Int — and count is just a sum of bools. So now the instability is just stemming from our choice of init itself.

So that's now the question: is that a better behavior? Maybe? I've gone through a handful of tests and stdlibs locally, but I'm curious what CI (and eventually PkgEval) says.

This fixes #47216.

@mbauman mbauman added the fold sum, maximum, reduce, foldl, etc. label Apr 27, 2025
@mbauman mbauman marked this pull request as draft April 27, 2025 05:00
@mbauman mbauman added breaking This change will break code and removed breaking This change will break code labels Apr 27, 2025
@adienes
Copy link
Member

adienes commented Apr 28, 2025

part of me feels like this is a problem with count and not mapreduce

On master, Julia just happily returns a properly inferred 0x005.

is this "properly"? I could totally see an argument that count should always return Int regardless of init type.

@mbauman
Copy link
Member Author

mbauman commented Apr 28, 2025

It is about mapreduce at the end of the day; count is just the (only!) canary that's squawking. The core behavior change is in both sum([false,true], init=0x00) and mapreduce(identity, +, [false,true], init=0x00).

There are two orthogonal features of the reduction machinery at play. They're also fundamentally where the problem originates in the first place.

  • We want sum to avoid integer overflow. So it uses a special Base.add_sum function as its reduction op. This makes small signed integers (but not including Bool, although ::Bool + ::Bool itself promotes to Int) promote to Int. Perhaps Bool should be included here. That would solve the problem, but that doesn't help us answer what mapreduce_first should do.
  • We want reductions of one element to be type stable when init is not provided — but they cannot call op directly. So mapreduce_first gives ops the opportunity to specify what should happen. In the case of add_sum, it promotes Bool and other small signed integers to Int. But what about other cases — when init is provided and/or there's more than one element? If mapreduce_first was unconditionally called, then that would also solve the problem. In fact, we'd no longer need to define methods on the add_sum function itself; this would be sufficient to get the promoting behaviors we wanted. If mapreduce_first solves the output type stability problem, then that means (definitionally) that it also solves the inner type stability of the reduction variable itself.

I've always thought of mapreduce_first as being solely about output type stability with one element. That indeed was its original raison d'être (#25051). But if it does good work there, why shouldn't we also use it at the beginning of each reduction chain — particularly since the possibility of its use to initialize a reduction chain has always been a part of its docs!

@adienes
Copy link
Member

adienes commented Apr 28, 2025

ok, I think I'm convinced.

@mbauman
Copy link
Member Author

mbauman commented Apr 28, 2025

It's still taking me some time to actually figure out what this is doing, but I think I'm getting convinced myself, too. This example works pretty well at expressing all the possibilities in a single concise table.

On Julia v1.12/current master, here's the representative universe of what sum does:

using Base: reduce_empty as r0, reduce_first as r1, add_sum as +const t, f = true, false
# #= high-level reduc =# === #= internal API =#   === #= expand r/+ₛ +# === #= value =#
sum(Bool[])              === r0(+ₛ, Bool)         === 0
sum([f])                 === r1(+ₛ, f)            === 0
sum([f,t])               ===  f +ₛ t              ===  f + t            === 1
sum([f,t,f,t])           === (f +ₛ t) +ₛ (f +ₛ t) === (f + t) + (f + t) === 2
sum(Bool[],   init=0x00) ===  0x00
sum([f],      init=0x00) ===  0x00 +ₛ f           === 0x00 + f          === 0x01
sum([f,t],    init=0x00) === (0x00 +ₛ f) +ₛ t     === (0x00 + f) + t    === 0x01
sum([f,t,f,t],init=0x00) === (((0x00 +ₛ f) +ₛ t) +ₛ f) +ₛ t === (((0x00 + f) + t) + f) + t === 0x02

If we jam in a pairwise implementation with init like I tried at first, we get that extra promotion from the step that "combines" the two reduction chains. This is the only place where the add_sum function call itself does something different from + in this entire post:

sum([f,t,f,t],init=0x00) === ((0x00 +ₛ f) +ₛ t) +ₛ ((0x00 +ₛ f) +ₛ t) === UInt(0x00 + f + t) + UInt(0x00 + f + t) === UInt(0x01) + UInt(0x01) === UInt(2)

The current state of this PR calls reduce_first in every single case except the empty ones, so the whole scheme becomes:

using Base: reduce_empty as r0, reduce_first as r1, add_sum as +const t, f = true, false
# #= high-level reduc =# === #= internal implementation =#         === #= expanding r0/r1/+ₛ =#
sum(Bool[])              === r0(+ₛ, Bool)                          === 0
sum([f])                 === r1(+ₛ, f)                             === 0
sum([f,t])               === r1(+ₛ, f) +ₛ t                        ===  0 + t            === 1
sum([f,t,f,t])           === (r1(+ₛ, f) +ₛ t) +ₛ (r1(+ₛ, f) +ₛ t)  === (0 + t) + (0 + t) === 2
sum(Bool[],   init=0x00) ===   0x00
sum([f],      init=0x00) ===   0x00 +r1(+ₛ, f)                   ===  0x00 + 0      === 1
sum([f,t],    init=0x00) ===  (0x00 +r1(+ₛ, f)) +ₛ t             === (0x00 + 0) + t === 1
sum([f,t,f,t],init=0x00) === ((0x00 +r1(+ₛ, f)) +ₛ t) +ₛ ((0x00 +r1(+ₛ, f)) +ₛ t) === (0x00 + 0 + t) + (0x00 + 0 + t) === 2

This probably makes it easier to ensure subsequent type stability, because the type of r1 is always a part of the reduction.


It's probably worth noting that these effects are quite limited — there simply aren't many reduce_first definitions in the first place. Base only defines these 9 methods, and there's only a handful of packages in the ecosystem that define similar sum/prod methods.

reduce_first(op, x) = x
reduce_first(::typeof(+), x::Bool) = Int(x)
reduce_first(::typeof(*), x::AbstractChar) = string(x)
reduce_first(::typeof(add_sum), x) = reduce_first(+, x)
reduce_first(::typeof(add_sum), x::BitSignedSmall)   = Int(x)
reduce_first(::typeof(add_sum), x::BitUnsignedSmall) = UInt(x)
reduce_first(::typeof(mul_prod), x) = reduce_first(*, x)
reduce_first(::typeof(mul_prod), x::BitSignedSmall)   = Int(x)
reduce_first(::typeof(mul_prod), x::BitUnsignedSmall) = UInt(x)

@mbauman
Copy link
Member Author

mbauman commented Apr 28, 2025

Let's see what Nanosoldier thinks. I can still make this less breaking — most notably by relaxing the mapreduce_impl's requirement to have two elements back to one again with that reduce_first simplification — but that's all internalsy stuff anyhow. I'm more curious about the behavior changes.

@nanosoldier runtests()

@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

Report summary

❗ Packages that crashed

1 packages crashed only on the current version.

  • A segmentation fault happened: 1 packages

6 packages crashed on the previous version too.

✖ Packages that failed

82 packages failed only on the current version.

  • Package fails to precompile: 16 packages
  • Package has test failures: 9 packages
  • Package tests unexpectedly errored: 42 packages
  • Tests became inactive: 1 packages
  • Test duration exceeded the time limit: 12 packages
  • Test log exceeded the size limit: 2 packages

1487 packages failed on the previous version too.

✔ Packages that passed tests

18 packages passed tests only on the current version.

  • Other: 18 packages

5183 packages passed tests on the previous version too.

~ Packages that at least loaded

13 packages successfully loaded only on the current version.

  • Other: 13 packages

2810 packages successfully loaded on the previous version too.

➖ Packages that were skipped altogether

917 packages were skipped on the previous version too.

@mbauman
Copy link
Member Author

mbauman commented Apr 29, 2025

Ah shucks. Looks like Nanosoldier isn't picking up my Statistics and SparseArrays branches from here. How does that work? Does it always grab the latest release?

@mbauman
Copy link
Member Author

mbauman commented Apr 30, 2025

Well there are still a number of meaningful errors here, from most interesting to least:

  • OnlineStats calls reduce(merge!, ks, init = KahanSum(T)), where a single mutable init is updated and then re-used multiple times.
  • AcceleratedKernels calls AK.reduce(+, v; init=Int32(10)). AK.reduce is indeed separate from Base, but it falls back to Base.reduce for CPU backends. AcceleratedKernels.jl/test/runtests.jl
  • Gridap (and deps) have this strange promotion thing going sideways in count. In short, Gridap defines a type VectorValue that's a subtype of Number but is itself iterable and may have more than one element. They run count over these VectorValues, expecting to iterate over the many values in the Number. On master, this misses the mapreduce(f, op, ::Number) specialization because count calls sum with an init. This PR brings init support to that ::Number specialization, which then doesn't iterate multiple values, and then leads to this not working as Gridap expected.
  • DistributedArrays overloads the internal Base._mapreduce and Base.mapreducedim! in ways currently incompatible with this PR and loses their specializations
  • InMemoryDatasets directly calls the internal mapreduce_impl in src/stat/non_hp_stat.jl:165, leading to about a half dozen failures.
  • The vast majority of failures are because Nanosoldier didn't pick up the Statistics branch here, and Statistics directly calls the internal mapreduce_impl
  • A preprocessed png image isn't bitwise identical in TensorBoardLogger?
  • Approximate tolerances exceeded in CopEnt and DirectTrajOpt and AnyMOD
  • StateSpaceDynamics failed to converge
  • TravelingSalesmanExact ended up with an empty solutions vector somehow
  • EarlyStopping looks unrelated

I think I can fiddle around with some of the internal method signatures to make this a little more kind to the folks playing with fire.

@adienes
Copy link
Member

adienes commented Apr 30, 2025

while clever, reduce(merge!, ks, init = KahanSum(T)) does not feel legal

@mbauman
Copy link
Member Author

mbauman commented Apr 30, 2025

Yeah it's definitely not. They also test foldl (which is ok!) and I also get the answer they expect if I make a slew of copies with reduce((x,y)->merge!(copy(x), y), ks, init = KahanSum(T)) or reduce(merge!, copy.(ks)). It makes you want for an init=()->Kahan(T) sort of thing though, doesn't it?

It feels like you could almost make this reliable with reduce_first — and looking back now, this is what's discussed in #47696:

julia> foldl(merge!, ks, init = KahanSum(T)) |> sum
250439.49035165994

julia> ks_reduce = reduce((x,y)->merge!(copy(x), y), ks, init = KahanSum(T)) |> sum
250439.49035165994

julia> Base.reduce_first(::typeof(merge!), k::KahanSum) = copy(k) # prevent mutating the input array without init

julia> reduce(merge!, ks) |> sum
250439.49035165994

The trouble is, though, that it doesn't solve this footgun:

julia> reduce(merge!, ks, init=KahanSum(T)) |> sum
8.360444161382082e18

I don't really see a simple solution; I think we'd have to promise so much about the internals in order to have a sensible way to rely on mutation... and even then I think it'd be quite fragile.

mbauman added 3 commits May 1, 2025 15:56
This reverts commit c07a5f7. The Statistics branch is no longer needed; the Sparse branch is still beneficial but not required.
@mbauman
Copy link
Member Author

mbauman commented May 2, 2025

@nanosoldier runtests(["Statistics", "NonConvexPenalizedRegression", "EarlyStopping", "StandardizedMatrices", "BarkerMCMC", "BoltzmannMachines", "DistributedArrays", "Normalization", "NaNStatistics", "FeatureSelection", "TimeSeries", "OnlineStatsBase", "CoupledFields", "InMemoryDatasets", "StatsBase", "LongDatasetSort", "CartesianJoin", "TimeSeriesEcon", "DLMReader", "CopEnt", "ASCertain", "QuantumPropagators", "GraphicalLasso", "MLJScikitLearnInterface", "OnlineStats", "SpatialDependence", "TensorBoardLogger", "TransitionalMCMC", "LaTeXCompilers", "FastDMTransform", "VectorizedStatistics", "SDeMo", "Tonari", "MDEStudy", "AcceleratedKernels", "PopGenCore", "MFCC", "Pathogen", "GlobalSensitivityAnalysis", "Lasso", "TravelingSalesmanExact", "DMRGenie", "DirectTrajOpt", "SDDP", "HurdleDMR", "GtkObservables", "FlashWeave", "GenomicOffsets", "LiftAndLearn", "TaylorIntegration", "IRKGaussLegendre", "StateSpaceDynamics", "StochasticSeriesExpansion", "AnyMOD", "StatisticalGraphics", "EquilibratedFlux", "DynamicHMC", "RobustModels", "BulkLMM", "PopGen", "GridapDistributed", "GenX", "TinyGibbs", "BoltzmannMachinesPlots", "GridapPETSc", "PosteriorStats", "MCMCChains", "GridapP4est", "NestedSamplers", "GridapSolvers", "OpticalFibers", "DynamicHMCExamples", "GlobalSensitivity", "GridapGmsh", "ReactionNetworkImporters", "JumpProblemLibrary", "DataDrivenSparse", "Chron", "ModiaLang", "Spectra", "GridapMakie", "MaterialPointVisualizer", "BloqadeNoisy"])

there are still tests that use "illegal" init values but happen to be ok due to the number of elements or somesuch; I am intentionally leaving them as they may be helpful to flag if and when further refactorings change these fraught behaviors.
@nanosoldier
Copy link
Collaborator

The package evaluation job you requested has completed - possible new issues were detected.
The full report is available.

Report summary

❗ Packages that crashed

1 packages crashed only on the current version.

  • A segmentation fault happened: 1 packages

✖ Packages that failed

15 packages failed only on the current version.

  • Package fails to precompile: 7 packages
  • Package has test failures: 3 packages
  • Package tests unexpectedly errored: 2 packages
  • Test duration exceeded the time limit: 3 packages

8 packages failed on the previous version too.

✔ Packages that passed tests

2 packages passed tests only on the current version.

  • Other: 2 packages

53 packages passed tests on the previous version too.

~ Packages that at least loaded

1 packages successfully loaded only on the current version.

  • Other: 1 packages

3 packages successfully loaded on the previous version too.

@mbauman
Copy link
Member Author

mbauman commented May 3, 2025

Great, that's much better. We still have Gridap's Number of length > 1, OnlineStats + AcceleratedKernel's non-neutral inits in their tests, as I'd expect. But we also still have the few less-obvious and harder to identify failures to converge in AnyMOD (looking for a specific objective value), traveling salesman (getting missing solutions in att48.tsp) and StateSpaceDynamics getting non-finite values in some matrix somehow (here).

@mbauman mbauman changed the title WIP: a more principled take on whole-array reduction inits Whole-array reductions always use init to start each reduction chain May 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fold sum, maximum, reduce, foldl, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sum(f, itr; init) slower than sum(f, itr)
3 participants