Description
What is the problem you are trying to solve?
Our Grafana Mimir has been suffering from high CPU usage from query path, and one solution to lower CPU usage is through CPU/Bloom filter.
Which solution do you envision (roughly)?
- An existing well-established solution would be incorporating Bloom Filter, which is a probabilistic data structure returning instantly for non-existing timeseries data and already well implemented in Loki, Thanos, M3DB etc.
In a very similar set up TSDB Thanos (also a multi-tenant Prometheus Inside), we have incorporated Cuckoo filter (a relative to Bloom filter) just on metric names, and we can see the CPU usage instantly dropped from 50% to <20%, which is 30% reduction through this simple feature, see this PR for refence for implementation.
I have also extensively worked with M3DB, which has a more robust bloom filter bitset of all series contained in this fileset for quick knowledge of whether to attempt retrieving a series for this fileset volume. Working with M3DB makes us never have a problem with CPU usage.
- A second approach, orthogonal to the Bloom filter would be separation of Storage and Query Engine, Right now ingester is handling both write and read traffic, which makes it super heavy and critical. A solution to separate write and read path would be so much helpful not only for resource usage management, but also for better isolation and less chance of failures on both read an write path.
Have you considered any alternatives?
No response
Any additional context to share?
No response
How long do you think this would take to be developed?
Small (<= 1 month dev)
What are the documentation dependencies?
No response
Proposer?
No response