Skip to content

An easy way to check for statistically significant difference between benchmarks #786

Open
@AndreyAkinshin

Description

@AndreyAkinshin

Currently, we have optional WelchTTestPValueColumn which help you to verify that there is a statistically significant difference between benchmarks. However, it doesn't work great with default run strategy because this strategy typically doesn't perform enough iterations. Users have to manually choose a satisfactory amount of iterations. Thus, it's possible to do such checks, but the user experience is not good enough. We can do the following:

  • Introduce additional property in AccuracyMode. Let's call it StopСriterion (let me know if you have better ideas about naming). It will contain logic which should decide when do we have enough iterations.
  • Currently, we have hardcoded logic inside EngineTargetStage. Let's move it to a class called StdErrStopCriterion.
  • We can introduce WelchStopCriterion which will do additional iterations until we sure that it's enough for the Welch's Two Sample t-test. (Bonus: users will be able to write own criterion)
  • StopCriterion should be able to affect IOrderProvider.GetExecutionOrder and ask to run baseline benchmarks first.
  • EngineTargetStage.RunAuto should get additional information about benchmarks like IsBaseline value. The non-baseline benchmarks should get all measurements from the baseline benchmark in the corresponded group.

Original request: https://twitter.com/AnthonyLloyd123/status/1005388154046644230

Activity

AndreyAkinshin

AndreyAkinshin commented on Jun 9, 2018

@AndreyAkinshin
MemberAuthor

Another idea by Anthony Lloyd:

Perhaps as a first step a warning if the values are too close to be reliable to a given confidence. Smart welch can make a run very long if tests are close.

caaavik-msft

caaavik-msft commented on Feb 11, 2025

@caaavik-msft
Contributor

I've been looking into this particular issue recently and the way I see it there are two use cases here to consider:

  1. Comparing two benchmark cases
  2. Comparing a benchmark case against historical data collected previously

The first option would be if you have the ability to test the implementations side by side, but the second option may be needed if you don't have an easy way to test the baseline again (e.g. running in CI against latest commit but comparing to previous versions).

Since BDN runs all the test cases sequentially and the stopping criteria is scoped to a single test run, it can only be used to solve the 2nd use case today with a custom IStoppingCritieria. The first use case is not possible to implement since we would need the measurements of both cases to decide when to stop running iterations. Ultimately, we would need some way to run each benchmark case in an alternating order so that we know when we can stop running iterations for all benchmarks being compared. I'm not sure the best way to handle this, but perhaps it could be done by extending launch count functionality with a custom stopping criteria and adding the ability to run benchmark cases in a custom order when using a launch count rather than just sequentially.

Additionality, this issue describes using Welch's T-Test, but my understanding is that this assumes the data is normally distributed which may not be the case. I have seen as well that Mann-Whitney U test and Kolmogorov-Smirnov tests are an option, but from what I can tell they are only useful to reject the null hypothesis of "the two distributions are the same", which doesn't tell us anything if we fail to reject it. Also with Mann-Whitney U test it seems it's not robust against changes in the shape of the distributions.

However, I happened to come across a blog post you had written which describes a way to calculate a non-parametric effect size: https://aakinshin.net/posts/nonparametric-effect-size/. In this you also have a footnote indicating that you make use of this for comparing performance measurements in Rider too. I'd be curious to know if possible if this is still in use at Rider or how successful it has been. The only thing missing from this is how to implement the stopping criteria as this metric does not have a way to estimate the confidence of the value. One approach would be to implement optimal stopping by tracking the value of the effect size and waiting for it to stabilize, similar to the logic in AutoWarmupStoppingCriteria or maybe by using a sliding window-based approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @AndreyAkinshin@caaavik-msft

      Issue actions

        An easy way to check for statistically significant difference between benchmarks · Issue #786 · dotnet/BenchmarkDotNet