[Failure store] Introduce dedicated failure store lifecycle configuration #127314

gmarouli · 2025-04-24T10:58:56Z

The failure store is a set of data stream indices that are used to store certain type of ingestion failures. Until this moment they were sharing the configuration of the backing indices. We understand that the two data sets have different lifecycle needs.

We believe that typically the failures will need to be retained much less than the data. Considering this we believe the lifecycle needs of the failures also more limited and they fit better the simplicity of the data stream lifecycle feature.

This allows the user to only set the desired retention and we will perform the rollover and other maintenance tasks without the user having to think about them. Furthermore, having only one lifecycle management feature allows us to ensure that these data is managed by default.

This PR introduces the following:

Configuration

We extend the failure store configuration to allow lifecycle configuration too, this configuration reflects the user's configuration only as shown below:

PUT _data_stream/*/options
{
  "failure_store": {
     "lifecycle": {
       "data_retention": "5d"
     }
  }
}

GET _data_stream/*/options

{
  "data_streams": [
    {
      "name": "my-ds",
      "options": {
        "failure_store": {
          "lifecycle": {
            "data_retention": "5d"
          }
        }
      }
    }
  ]
}

To retrieve the effective configuration you need to use the GET data streams API, see #126668

Functionality

The data stream lifecycle (DLM) will manage the failure indices regardless if the failure store is enabled or not. This will ensure that if the failure store gets disabled we will not have stagnant data.
The data stream options APIs reflect only the user's configuration.
The GET data stream API should be used to check the current state of the effective failure store configuration.
Telemetry

We extend the data stream failure store telemetry to also include the lifecycle telemetry.

{
  "data_streams": {
     "available": true,
     "enabled": true,
     "data_streams": 10,
     "indices_count": 50,
     "failure_store": {
       "explicitly_enabled_count": 1,
       "effectively_enabled_count": 15,
       "failure_indices_count": 30
       "lifecycle": { 
         "explicitly_enabled_count": 5,
         "effectively_enabled_count": 20,
         "data_retention": {
           "configured_data_streams": 5,
           "minimum_millis": X,
           "maximum_millis": Y,
           "average_millis": Z,
          },
          "effective_retention": {
            "retained_data_streams": 20,
            "minimum_millis": X,
            "maximum_millis": Y, 
            "average_millis": Z
          },
         "global_retention": {
           "max": {
             "defined": false
           },
           "default": {
             "defined": true,  <------ this is the default value applicable for the failure store
             "millis": X
           }
        }
      }
   }
}

Implementation details

We ensure that partially reset failure store will create valid failure store configuration.
We ensure that when a node communicates with a note with a previous version it will ensure it will not send an invalid failure store configuration enabled: null.

…with null enabled.

elasticsearchmachine · 2025-04-24T10:59:20Z

Pinging @elastic/es-data-management (Team:Data Management)

elasticsearchmachine · 2025-04-24T11:00:01Z

Hi @gmarouli, I've created a changelog YAML for you.

gmarouli added 10 commits April 24, 2025 09:19

Merge getBackingIndicesPastRetention & getFailureIndicesPastRetention

f171f9d

Add configuration for failure store lifecycle

f1bb80f

Use the failure store lifecycle config in DataStreamLifecycleService

c1366c2

Expose the failure store lifecycle in info APIs

3a6f151

Add telemetry for the failure store lifecycle

b9179a6

Ensure backwards compatibility when it comes to failure store config …

073df93

…with null enabled.

Ensure fully reset failure store composes to valid template

e22ce67

Failure store should not inherit the ILM policy from the data

99664a0

Warn the user when the data retention of the failure exceeds the max

217218a

Small test fixes

0e0bd34

gmarouli added >enhancement :Data Management/Data streams Data streams and their lifecycles labels Apr 24, 2025

elasticsearchmachine added Team:Data Management Meta label for data/management team v9.1.0 labels Apr 24, 2025

gmarouli added auto-backport Automatically create backport pull requests when merged v8.19.0 labels Apr 24, 2025

Update docs/changelog/127314.yaml

197e323

github-actions bot deployed to docs-preview April 24, 2025 11:00 View deployment

Merge branch 'main' into failures-lifecycle-config

09c83a8

github-actions bot deployed to docs-preview April 24, 2025 11:05 View deployment

elasticsearchmachine added the serverless-linked Added by automation, don't add manually label Apr 24, 2025

Fix test

f9474c1

github-actions bot deployed to docs-preview April 24, 2025 11:39 View deployment

gmarouli requested a review from jbaiera April 24, 2025 12:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Failure store] Introduce dedicated failure store lifecycle configuration #127314

[Failure store] Introduce dedicated failure store lifecycle configuration #127314

gmarouli commented Apr 24, 2025 •

edited

Loading

elasticsearchmachine commented Apr 24, 2025

elasticsearchmachine commented Apr 24, 2025

[Failure store] Introduce dedicated failure store lifecycle configuration #127314

Are you sure you want to change the base?

[Failure store] Introduce dedicated failure store lifecycle configuration #127314

Conversation

gmarouli commented Apr 24, 2025 • edited Loading

elasticsearchmachine commented Apr 24, 2025

elasticsearchmachine commented Apr 24, 2025

gmarouli commented Apr 24, 2025 •

edited

Loading