Skip to content

Query Frontend: add cortex_query_samples_processed_cache_adjusted_total metric #11164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

Konstantinov-Innokentii
Copy link
Member

@Konstantinov-Innokentii Konstantinov-Innokentii commented Apr 9, 2025

What this PR does

The metric tracks the number of samples processed to execute a query if no query results cache was enabled.

How does it work: (Updated)
The extent now has a SamplesProcessedPerStep field. splitAndCacheMiddleware tracks partial stats of each downstream response it made and when downstreamRequest is converted to an extent. perStepStats are approximated from totalSamplesProcessed for this request. When extents are used to respond to a query in the partitionCacheExtents, the corresponding perStepStats are extracted and merged.
The same happens when extents are merged in or truncated when filtering recent cache extents filterRecentCacheExtents.

This number of samples processed in Extent is approximate since MQE does not expose per-step sample statistics. When extracting only part of the Extent's range, perStepStats are estimated based on TotalSamplesProcessed and number of step in the request range.

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@Konstantinov-Innokentii Konstantinov-Innokentii changed the title Add cortex_query_samples_processed_from_cache_total metric Add cortex_query_samples_processed_cache_adjusted_total metric Apr 10, 2025
@Konstantinov-Innokentii Konstantinov-Innokentii marked this pull request as ready for review April 10, 2025 05:41
@Konstantinov-Innokentii Konstantinov-Innokentii requested a review from a team as a code owner April 10, 2025 05:41
@Konstantinov-Innokentii Konstantinov-Innokentii changed the title Add cortex_query_samples_processed_cache_adjusted_total metric Query Frontend: add cortex_query_samples_processed_cache_adjusted_total metric Apr 10, 2025
@pracucci pracucci self-requested a review April 10, 2025 08:41
Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work! I just left a couple of nits. 👏

@pracucci pracucci self-requested a review April 16, 2025 10:20
@@ -67,6 +67,13 @@ message Extent {
// If the response is combination of multiple queries over time, all of which had timestamp set, this is the timestamp of oldest query.
// When merging extents and some of them have 0 query timestamp, we keep non-zero timestamp, if possible.
int64 query_timestamp_ms = 6;
// Number of samples processed per step to create this Extent.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we sort it by timestamp by contract? If so, I would document it here, given it's a design decision.

return result
}

func mergeSamplesProcessedPerStep(steps ...[]StepStat) []StepStat {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we simply define that StepStat must be ordered by timestamp, then you don't need to use a map here. You can just go building the merged slice in a single pass. Keep it simple, just take 2 inputs. Don't need to be variadic.

Comment on lines 537 to 539
if details := QueryDetailsFromContext(ctx); details != nil {
details.SamplesProcessedFromCache = cachedSamplesProcessed
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Design-wise, I don't think it's partitionCacheExtents() responsibility taking the context in input and updating it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks. Lifted stats updates up.

cachedSamplesProcessed += uint64(stepStat.Value)
}
if details := QueryDetailsFromContext(ctx); details != nil {
details.SamplesProcessedFromCache = cachedSamplesProcessed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused. partitionCacheExtents() can be called multiple times for a single input query. Here we overwrite SamplesProcessedFromCache each time 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, my bad; thanks for catching it. Pushed a fix

Copy link
Collaborator

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic LGTM.

Konstantinov-Innokentii and others added 21 commits April 16, 2025 21:40
It tracks oriignal number of samples used to populate the cache.

The value is approximate, as cache extents might be merged or truncated. Since MQE does not expose per-step sample statistics, a heuristic is applied: each extent stores the total number of ProcessedSamples, and when only a sub-range of an extent is used in a query, the number of samples is estimated proportionally to the overlap between the requested range and the extent's time range.
…nSubrange more clear - it receives range borders, not he the range itself.
1. TestMergeCacheExtentsForRequest now expectsExtents, not samples and timeranges separately
2.  Style fixes in the TestFilterRecentCacheExtents
Co-authored-by: Marco Pracucci <marco@pracucci.com>
Co-authored-by: Marco Pracucci <marco@pracucci.com>
1. Introduce new approximation algoritm. Instead of operating with SamplesProcessed count per whole extent, use approximate perStepStats, to avoid cumulative error when resizing extent
2. Move SamplesProcessedFromCache stat to query-frontend stats from the Querier stats, since they are tracked & reported in frontend.
Konstantinov-Innokentii and others added 10 commits April 16, 2025 21:40
Co-authored-by: Marco Pracucci <marco@pracucci.com>
Co-authored-by: Marco Pracucci <marco@pracucci.com>
Co-authored-by: Marco Pracucci <marco@pracucci.com>
Sum samples processed from cache, rather that overwrite.
Also, mergePerStepStats in one pass, under the assumption that they are alwats sorted by timestamp
split_and_cache_middleware doesn't update it concurrently.
@Konstantinov-Innokentii Konstantinov-Innokentii force-pushed the add_cortex_query_samples_processed_from_cache_total_metric branch from 71a4c29 to 2706b22 Compare April 16, 2025 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants