-
Notifications
You must be signed in to change notification settings - Fork 640
feat: repair Go stack traces #3014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
// | ||
// For each stack trace S taller than 24 frames: if there is another | ||
// stack trace R taller than 24 frames that overlaps with the given | ||
// one by at least 16 frames in a row from the top, and has frames |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 16 and not less ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a reasonable overlap between stack traces. Half of the max looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should test in dev with different value of that configuration just to see the impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Less overlap size implies more tokens. The more tokens, the more checks we need to perform, the procedure becomes more expensive. In the meantime, with a shorter overlap, the chance that we will "repair" a wrong stack is higher. Therefore I'm not sure if it's worth it.
I experimented with various values and can say that the current ones work well. Thus, I would prefer to deploy this version and optimize it if we deem it necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one.
Quite hard to follow but I don't know if there's anything we can do more than your current comment.
for _, s := range p.Sample { | ||
if len(s.LocationId) >= minDepth { | ||
return true | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://go-review.googlesource.com/c/go/+/572396: we also need to check that the depth does not exceed the limit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think the fact that truncation has happened could be propagated more clearly: golang/go#43669
hey i've just come across this while trying to explain a phenomenon we're seeing where the total value of pyroscope golang profiles does not match the total values of the samples sent to pyroscope. If we emit the weight of the sample values from the client as a metric, then compare the sum of these values over a period of time with profiles over the same period, the pyroscope profiles are nearly double what we expect. The smoking gun seems to be right in the screenshots, after repairing the stack traces the total weight of the profile is almost doubled, eg 13.6 TiB to 25.0 TiB. This seems to be an error in the algorithm, as moving the stacks around within the graph shouldn't result in new weight being added to the graph. I believe if the tests were updated to assert that the total value of the truncated profile equals the total value of the repaired profile, this assertion would fail. I cannot think of any reason why this should be the case. This phenomenon only happens with golang profiles, other pprof profiles such as from nodejs are fine. |
Hey Dale! I believe you're right about the tests – unfortunately, the fixtures were generated in a buggy way, and therefore, the sample values are indeed doubled (because of #3251). I'm fixing it in #4018 – I also want to add some safety checks to detect drifts, but now that we have valid fixtures, the numbers match. The algo itself does not touch values at all. But later on, when we deduplicate stack traces, the values are aggregated. So I think the issue might be somewhere else in the processing pipeline. I'd like to learn more about the phenomenon:
It would be ideal if there were a repro. If that's not possible, please clarify which profile types are affected and how exactly you collect and measure the values. |
Interesting, because when I read the alg i couldn't see where this would be happening, if it is further down the pipe though, i wonder what it is that is happening when changing the profile that causes this mistake to happen? Intuitively i don't see how just copying stack frames onto the call chain should cause this corruption, unless we are duplicating stacks somehow? In any case, some kind of test here is needed to ensure that whatever is handed to the rest of the pipeline is valid, so that the duplication doesn't happen 👍
It doesn't happen for all golang profiles, simple ones like the rideshare app are just fine. But it looks like more complicated go apps we see the issue with deep stacks, which points to this code to fix the truncated stacks (unbelievable that this is a thing... they should really fix that in golang 😬 ). To make a repro, i think using recursion or something would make it easy to get the deep stack. The code in question used GraphQL which loves to have deep stacks... |
Thanks for sharing this! I'll take a careful look early next week. I think I'll start by adding an option to disable "repairing".
They have increased the limit to 128 in https://go-review.googlesource.com/c/go/+/572396, we, however, haven't yet backported it to our fork (grafana/pyroscope-go#113) |
In a local fork, I used the logic in pyroscope/pkg/pprof/fix_go_truncated.go Line 26 in 1c81971
![]() (the totals here are right, I verified the ones in my other fork against the metric i'm emitting that sums the values from a counter over the same period) I duplicated the same profile to an official pyroscope image, and get: ![]() (note the totals are wrong here)
Maybe a ternary switch of "repair|truncate|disabled" to pick which behaviour to use, where "truncate" inserts a truncation frame as above?
This is an app that uses generated code, in this case: https://github.com/99designs/gqlgen I guess given sufficently complex types or graphql queries, it needs to do a lot of recursive calls? I would think it would be easy to craft a repro just using a recursive function that terminates after N iterations though, eg: package main
import (
"fmt"
)
// Recursive function that calls itself until it has been called N times
func recursiveFunction(count, N int) {
if count < N {
fmt.Printf("Function call number: %d\n", count+1)
recursiveFunction(count+1, N) // Call the function recursively
}
}
func main() {
N := 5 // Set the number of times to call the function
recursiveFunction(0, N) // Start the recursion with count as 0
} Accepting a function pointer, we could also "do some work" once we hit that depth, such as an loop allocating objects or some busy CPU work to generate a profile? |
The PR suggests a method for fixing truncated Go heap profile stack traces. The function attempts to infer truncated frames by analyzing neighbors and then repairs the stack traces if there is a high likelihood that the missing part exists in the profile but in a different stack. I manually reviewed a few cases and can confidently say that it produces reliable and valid results. However, it's not immune to errors. In my opinion, any potential mistakes, if they take place, do not cause much harm to a malformed profile (which Go heap profiles often are). Therefore, I consider it safe.
The performance impact of the fix is relatively moderate; approximately 75% of the CPU time is dedicated to sorting, and there are no substantial allocations.
Interestingly, the correction of stack traces does not lead to a significant change in their cardinality (-5-15%). However, this reduces pressure on the symdb, as we expect a tree-like structure in call chains, which is often broken when truncated.
More important, from the UX standpoint, Go heap profiles become way more clear.
Before:

After:

Before:

After:
