Improvement: improve multiple series creation (especially on ingester startup)

### What is the problem you are trying to solve?

Sometimes ingesters have to create a lot of series, we already see some lock contention on `MemPostings.Add()` for tenants who have a lot of churn, but the worst case is when an ingester is either started from scratch or becomes writeable again.

In this last case we've seen cases of write requests waiting for seconds on `MemPostings.Add()`, since we have to create a couple of million series in matter of seconds, and each one will take an exclusive mutex.

It's not a very expensive operation, but the synchronization process wastes a lot of CPU and clears the precious caches.

Additionally, since we have hundreds of goroutines waiting to get the exclusive write mutex, it means that no reads can proceed during that time, also increasing the latency of the read path.

### Which solution do you envision (roughly)?

Instead of creating series one by one, create all the samples for a given request and optimize the amount of times the mutex is taken and the time spend under that mutex.

I [wrote a simple implementation](https://github.com/grafana/mimir-prometheus/pull/860) of a `BatchSeriesRefs` best-effort method on the Appender: it would take a list of series, and try to create the ones that are valid and don't exist yet. It would also copy the labels as necessary, since we do heavy usage of unsafe strings in Mimir.

[The implementation in Mimir](https://github.com/grafana/mimir/pull/11019) is straightforward, most of the code is required to keep the allocations low reusing builders and labels. 

In my first approach I just added a `MemPostings.AddBatch` that creates 512 index entries each time it takes the mutex (that number is shared with other thigns MemPostings does like garbage-collecting, we don't want to hold the mutex for too long if the caller wants to create 1M series). I think this is a good first approach: we could do better, we could have a pool of GOMAXPROCS workers and create series in parallel minimizing the time spent with the mutex taken, but I would leave that to the future versions.

Another improvement we could do is to batch the series lookup in the stripeSeries: we don't see mutex contention there, however there's likely some synchronization overhead we don't see.

### Have you considered any alternatives?

_No response_

### Any additional context to share?

There's one drawback I identified [we're now skipping the shortcut in headAppender.Append](https://github.com/grafana/mimir-prometheus/blob/29b599309e750994eab878944c2bd30f72ec5fcb/tsdb/head_append.go#L420-L426) that prevented series from being created when the sample received can't be appended to the head.

This means that if a tenant is sending too old data we'll be creating their series without appending any samples associated to them.

### How long do you think this would take to be developed?

Not sure

### What are the documentation dependencies?

_No response_

### Proposer?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvement: improve multiple series creation (especially on ingester startup) #11078

What is the problem you are trying to solve?

Which solution do you envision (roughly)?

Have you considered any alternatives?

Any additional context to share?

How long do you think this would take to be developed?

What are the documentation dependencies?

Proposer?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improvement: improve multiple series creation (especially on ingester startup) #11078

Description

What is the problem you are trying to solve?

Which solution do you envision (roughly)?

Have you considered any alternatives?

Any additional context to share?

How long do you think this would take to be developed?

What are the documentation dependencies?

Proposer?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions