Skip to content

feat: repair Go stack traces #3014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 27, 2024
Merged

feat: repair Go stack traces #3014

merged 6 commits into from
Feb 27, 2024

Conversation

kolesnikovae
Copy link
Collaborator

@kolesnikovae kolesnikovae commented Feb 21, 2024

The PR suggests a method for fixing truncated Go heap profile stack traces. The function attempts to infer truncated frames by analyzing neighbors and then repairs the stack traces if there is a high likelihood that the missing part exists in the profile but in a different stack. I manually reviewed a few cases and can confidently say that it produces reliable and valid results. However, it's not immune to errors. In my opinion, any potential mistakes, if they take place, do not cause much harm to a malformed profile (which Go heap profiles often are). Therefore, I consider it safe.

The performance impact of the fix is relatively moderate; approximately 75% of the CPU time is dedicated to sorting, and there are no substantial allocations.

Interestingly, the correction of stack traces does not lead to a significant change in their cardinality (-5-15%). However, this reduces pressure on the symdb, as we expect a tree-like structure in call chains, which is often broken when truncated.

More important, from the UX standpoint, Go heap profiles become way more clear.

Before:
image

After:
image


Before:
image

After:
image

@kolesnikovae kolesnikovae marked this pull request as ready for review February 21, 2024 08:50
@kolesnikovae kolesnikovae requested a review from a team as a code owner February 21, 2024 08:50
//
// For each stack trace S taller than 24 frames: if there is another
// stack trace R taller than 24 frames that overlaps with the given
// one by at least 16 frames in a row from the top, and has frames
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 16 and not less ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a reasonable overlap between stack traces. Half of the max looks good to me.

Copy link
Contributor

@cyriltovena cyriltovena Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should test in dev with different value of that configuration just to see the impact.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less overlap size implies more tokens. The more tokens, the more checks we need to perform, the procedure becomes more expensive. In the meantime, with a shorter overlap, the chance that we will "repair" a wrong stack is higher. Therefore I'm not sure if it's worth it.

I experimented with various values and can say that the current ones work well. Thus, I would prefer to deploy this version and optimize it if we deem it necessary.

Copy link
Contributor

@cyriltovena cyriltovena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one.

Quite hard to follow but I don't know if there's anything we can do more than your current comment.

Comment on lines +33 to +37
for _, s := range p.Sample {
if len(s.LocationId) >= minDepth {
return true
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://go-review.googlesource.com/c/go/+/572396: we also need to check that the depth does not exceed the limit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think the fact that truncation has happened could be propagated more clearly: golang/go#43669

@dalehamel
Copy link

hey i've just come across this while trying to explain a phenomenon we're seeing where the total value of pyroscope golang profiles does not match the total values of the samples sent to pyroscope. If we emit the weight of the sample values from the client as a metric, then compare the sum of these values over a period of time with profiles over the same period, the pyroscope profiles are nearly double what we expect.

The smoking gun seems to be right in the screenshots, after repairing the stack traces the total weight of the profile is almost doubled, eg 13.6 TiB to 25.0 TiB. This seems to be an error in the algorithm, as moving the stacks around within the graph shouldn't result in new weight being added to the graph.

I believe if the tests were updated to assert that the total value of the truncated profile equals the total value of the repaired profile, this assertion would fail. I cannot think of any reason why this should be the case.

This phenomenon only happens with golang profiles, other pprof profiles such as from nodejs are fine.

@kolesnikovae
Copy link
Collaborator Author

kolesnikovae commented Mar 19, 2025

The smoking gun seems to be right in the screenshots, after repairing the stack traces the total weight of the profile is almost doubled, eg 13.6 TiB to 25.0 TiB. This seems to be an error in the algorithm, as moving the stacks around within the graph shouldn't result in new weight being added to the graph.

I believe if the tests were updated to assert that the total value of the truncated profile equals the total value of the repaired profile, this assertion would fail. I cannot think of any reason why this should be the case.

Hey Dale! I believe you're right about the tests – unfortunately, the fixtures were generated in a buggy way, and therefore, the sample values are indeed doubled (because of #3251). I'm fixing it in #4018 – I also want to add some safety checks to detect drifts, but now that we have valid fixtures, the numbers match.

The algo itself does not touch values at all. But later on, when we deduplicate stack traces, the values are aggregated. So I think the issue might be somewhere else in the processing pipeline.

I'd like to learn more about the phenomenon:

If we emit the weight of the sample values from the client as a metric, then compare the sum of these values over a period of time with profiles over the same period, the pyroscope profiles are nearly double what we expect.

It would be ideal if there were a repro. If that's not possible, please clarify which profile types are affected and how exactly you collect and measure the values.

@dalehamel
Copy link

dalehamel commented Mar 19, 2025

The algo itself does not touch values at all. But later on, when we deduplicate stack traces, the values are aggregated. So I think the issue might be somewhere else in the processing pipeline.

Interesting, because when I read the alg i couldn't see where this would be happening, if it is further down the pipe though, i wonder what it is that is happening when changing the profile that causes this mistake to happen? Intuitively i don't see how just copying stack frames onto the call chain should cause this corruption, unless we are duplicating stacks somehow? In any case, some kind of test here is needed to ensure that whatever is handed to the rest of the pipeline is valid, so that the duplication doesn't happen 👍

If we emit the weight of the sample values from the client as a metric, then compare the sum of these values over a period of time with profiles over the same period, the pyroscope profiles are nearly double what we expect.

It doesn't happen for all golang profiles, simple ones like the rideshare app are just fine. But it looks like more complicated go apps we see the issue with deep stacks, which points to this code to fix the truncated stacks (unbelievable that this is a thing... they should really fix that in golang 😬 ).

To make a repro, i think using recursion or something would make it easy to get the deep stack. The code in question used GraphQL which loves to have deep stacks...

@kolesnikovae
Copy link
Collaborator Author

Thanks for sharing this! I'll take a careful look early next week. I think I'll start by adding an option to disable "repairing".

(unbelievable that this is a thing... they should really fix that in golang 😬 ).

They have increased the limit to 128 in https://go-review.googlesource.com/c/go/+/572396, we, however, haven't yet backported it to our fork (grafana/pyroscope-go#113)

@dalehamel
Copy link

In a local fork, I used the logic in

func PotentialTruncatedGoStacktraces(p *profilev1.Profile) bool {
to bypass the stack repairing logic, and instead insert a "truncation" frame like @simonswine suggested in golang/go#43669 (comment)

Screenshot 2025-03-20 at 7 55 26 PM

(the totals here are right, I verified the ones in my other fork against the metric i'm emitting that sums the values from a counter over the same period)

I duplicated the same profile to an official pyroscope image, and get:

Screenshot 2025-03-20 at 7 56 17 PM

(note the totals are wrong here)

I think I'll start by adding an option to disable "repairing".

Maybe a ternary switch of "repair|truncate|disabled" to pick which behaviour to use, where "truncate" inserts a truncation frame as above?

It would be ideal if there were a repro. If that's not possible, please clarify which profile types are affected and how exactly you collect and measure the values.

This is an app that uses generated code, in this case:

https://github.com/99designs/gqlgen

I guess given sufficently complex types or graphql queries, it needs to do a lot of recursive calls?

I would think it would be easy to craft a repro just using a recursive function that terminates after N iterations though, eg:

package main

import (
	"fmt"
)

// Recursive function that calls itself until it has been called N times
func recursiveFunction(count, N int) {
	if count < N {
		fmt.Printf("Function call number: %d\n", count+1)
		recursiveFunction(count+1, N) // Call the function recursively
	}
}

func main() {
	N := 5 // Set the number of times to call the function
	recursiveFunction(0, N) // Start the recursion with count as 0
}

Accepting a function pointer, we could also "do some work" once we hit that depth, such as an loop allocating objects or some busy CPU work to generate a profile?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants