Open
Description
Currently, we have optional WelchTTestPValueColumn
which help you to verify that there is a statistically significant difference between benchmarks. However, it doesn't work great with default run strategy because this strategy typically doesn't perform enough iterations. Users have to manually choose a satisfactory amount of iterations. Thus, it's possible to do such checks, but the user experience is not good enough. We can do the following:
- Introduce additional property in
AccuracyMode
. Let's call itStopСriterion
(let me know if you have better ideas about naming). It will contain logic which should decide when do we have enough iterations. - Currently, we have hardcoded logic inside
EngineTargetStage
. Let's move it to a class calledStdErrStopCriterion
. - We can introduce
WelchStopCriterion
which will do additional iterations until we sure that it's enough for the Welch's Two Sample t-test. (Bonus: users will be able to write own criterion) StopCriterion
should be able to affectIOrderProvider.GetExecutionOrder
and ask to run baseline benchmarks first.EngineTargetStage.RunAuto
should get additional information about benchmarks likeIsBaseline
value. The non-baseline benchmarks should get all measurements from the baseline benchmark in the corresponded group.
Original request: https://twitter.com/AnthonyLloyd123/status/1005388154046644230
Metadata
Metadata
Assignees
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
AndreyAkinshin commentedon Jun 9, 2018
Another idea by Anthony Lloyd:
caaavik-msft commentedon Feb 11, 2025
I've been looking into this particular issue recently and the way I see it there are two use cases here to consider:
The first option would be if you have the ability to test the implementations side by side, but the second option may be needed if you don't have an easy way to test the baseline again (e.g. running in CI against latest commit but comparing to previous versions).
Since BDN runs all the test cases sequentially and the stopping criteria is scoped to a single test run, it can only be used to solve the 2nd use case today with a custom
IStoppingCritieria
. The first use case is not possible to implement since we would need the measurements of both cases to decide when to stop running iterations. Ultimately, we would need some way to run each benchmark case in an alternating order so that we know when we can stop running iterations for all benchmarks being compared. I'm not sure the best way to handle this, but perhaps it could be done by extending launch count functionality with a custom stopping criteria and adding the ability to run benchmark cases in a custom order when using a launch count rather than just sequentially.Additionality, this issue describes using Welch's T-Test, but my understanding is that this assumes the data is normally distributed which may not be the case. I have seen as well that Mann-Whitney U test and Kolmogorov-Smirnov tests are an option, but from what I can tell they are only useful to reject the null hypothesis of "the two distributions are the same", which doesn't tell us anything if we fail to reject it. Also with Mann-Whitney U test it seems it's not robust against changes in the shape of the distributions.
However, I happened to come across a blog post you had written which describes a way to calculate a non-parametric effect size: https://aakinshin.net/posts/nonparametric-effect-size/. In this you also have a footnote indicating that you make use of this for comparing performance measurements in Rider too. I'd be curious to know if possible if this is still in use at Rider or how successful it has been. The only thing missing from this is how to implement the stopping criteria as this metric does not have a way to estimate the confidence of the value. One approach would be to implement optimal stopping by tracking the value of the effect size and waiting for it to stabilize, similar to the logic in
AutoWarmupStoppingCriteria
or maybe by using a sliding window-based approach.