Skip to content

Add query limiter middleware for instant queries #11097

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

chencs
Copy link
Contributor

@chencs chencs commented Apr 3, 2025

What this PR does

This PR introduces a middleware (modeled after the query blocker and error caching middlewares) which checks if a query should not be run more than a configured AllowedFrequency. If the last time the query was allowed less than AllowedFrequency time ago, it is blocked with a query-limited error, and returns a 429 status. It works with the range query, instant query, and remote read paths, though this PR only adds it to the instant query middlewares, since frequent, expensive instant queries can easily overwhelm the system due to the lack of results caching on that path.

Limitations:

  • only matches exact queries
  • does not check query time (e.g., range start/end, or instant time), since the intent is to block excessive unruly/expensive instant queries that are automatically executed at "random" times.

I'm not 100% convinced that 429 is the appropriate status code to return, but it is a retryable error.

Which issue(s) this PR fixes or relates to

Checklist

  • Tests updated.
  • Documentation added.
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX].
  • about-versioning.md updated with experimental features.

@dimitarvdimitrov
Copy link
Contributor

This runs once in each query-frontend. So in effect the rate limit will scale as the query-frontend deployment scales. Any thoughts on doing this in the query-scheduler where we have fewer replicas? Or perhaps using memcached for this?

@chencs
Copy link
Contributor Author

chencs commented Apr 7, 2025

Or perhaps using memcached for this?

I'll look into using memcached for this. I think it makes more sense than implementing at the query-scheduler; we want to block requests before they're split, sharded, etc., and it's not intuitive that a block like this is the responsibility of the scheduler anyways. We do other blocking at the frontend.

@chencs chencs force-pushed the casie/limit-query-frequency branch 2 times, most recently from bf01f5e to 9c13691 Compare April 10, 2025 23:16
@chencs chencs marked this pull request as ready for review April 10, 2025 23:20
@chencs chencs requested a review from a team as a code owner April 10, 2025 23:20
@chencs chencs force-pushed the casie/limit-query-frequency branch 3 times, most recently from 75b9cb9 to 95ea26a Compare April 10, 2025 23:23
@56quarters
Copy link
Contributor

Hey, sorry to jump into the middle of this PR but this looks like it's reading and writing a structure to cache. I was hoping we could consider some approaches that are a little less race-prone.

Option 1:

If the requirement is only run a particular query every N seconds, we could use Cache.Add("limiter:hash_of_query", ttl=N*time.Second). This fails when there's already an unexpired entry, preventing multiple query-frontends from running the query more often than required.

Option 2:

If the requirement is to only run a particular query N times every minute, we could use a key with 60 second buckets in it and Cache.Incr("limiter:hash_of_query:60", 1) and check the result to ensure we're under the N limit.

Both of these would only require a single operation against the cache and would avoid races (AFAICT) in our code.

@chencs
Copy link
Contributor Author

chencs commented Apr 12, 2025

@56quarters no need to apologize! These are good suggestions, thanks for the feedback. I'll work on a change next week.

@chencs chencs force-pushed the casie/limit-query-frequency branch from 95ea26a to d92c579 Compare April 16, 2025 04:43
@chencs chencs force-pushed the casie/limit-query-frequency branch 5 times, most recently from 1ae3aaa to 31c59df Compare April 16, 2025 05:55
@chencs chencs force-pushed the casie/limit-query-frequency branch 2 times, most recently from 577a319 to d0b2cd0 Compare April 16, 2025 21:13
@chencs chencs force-pushed the casie/limit-query-frequency branch from d0b2cd0 to 9b8dd8e Compare April 16, 2025 21:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants