From fa21fc3763aa7a81dc3594cfa4a460fae12e7b10 Mon Sep 17 00:00:00 2001 From: Kofi Bartlett Date: Sun, 9 Mar 2025 17:56:11 +0900 Subject: [PATCH 1/6] Edits copied over from https://github.com/elastic/elasticsearch/pull/120346 --- .../how-to/size-your-shards.asciidoc | 36 +++++++++++++------ 1 file changed, 25 insertions(+), 11 deletions(-) diff --git a/docs/reference/how-to/size-your-shards.asciidoc b/docs/reference/how-to/size-your-shards.asciidoc index 5f67014d5bb4a..185a62fb3981e 100644 --- a/docs/reference/how-to/size-your-shards.asciidoc +++ b/docs/reference/how-to/size-your-shards.asciidoc @@ -1,17 +1,28 @@ [[size-your-shards]] == Size your shards +[discrete] +[[what-is-a-shard]] +=== What is a shard? + +A shard is a basic unit of storage in {es}. Every index is divided into one or more shards to help distribute data and workload across nodes in a cluster. This division allows {es} to handle large datasets and perform operations like searches and indexing efficiently but not without cost. Each index and shard has some overhead and if you divide your data across too many shards then the overhead will degrade performance. Shards play several key roles in {es}: + +* *Data Distribution:* Each shard contains a portion of the data from the index. When you add more nodes to your cluster, {es} will spread the shards across the nodes, balancing the workload between them. +* *Replication:* Shards can have replicas which are copies of the original shard. Replicas ensure data availability and improve search performance by allowing multiple nodes to handle requests for that shard. +* *Parallel Processing:* Shards enable {es} to distribute indexing of documents, and process queries in parallel across shards, making ingestion and searches faster and more efficient. + +By effectively using shards, {es} can scale horizontally and provide fault tolerance, ensuring your data is distributed and indexing and searches are processed efficiently. + +[discrete] +[[sizing-shard-guidelines]] +=== Sizing Shard Guidelines + +Proper shard sizing is crucial for maintaining the performance and stability of an {es} cluster. _Oversharding_ occurs when data is distributed across an excessive number of shards (primary or replica), which can degrade search performance and make the cluster unstable. Conversely, very large shards may slow down search operations and prolong recovery times after failures. + +To strike the right balance, the <> are to aim for shard sizes between 10GB and 50GB, keeping the per-shard document count below 200 million. To ensure that each node is working optimally, it's important to distribute shards evenly across nodes. Uneven distribution can cause some nodes to work harder than others, leading to performance degradation and instability. While Elasticsearch automatically balances shards, it’s important to configure your indices with an appropriate number of shards and replicas to facilitate even distribution across nodes. -Each index in {es} is divided into one or more shards, each of which may be -replicated across multiple nodes to protect against hardware failures. If you -are using <> then each data stream is backed by a sequence of -indices. There is a limit to the amount of data you can store on a single node -so you can increase the capacity of your cluster by adding nodes and increasing -the number of indices and shards to match. However, each index and shard has -some overhead and if you divide your data across too many shards then the -overhead can become overwhelming. A cluster with too many indices or shards is -said to suffer from _oversharding_. An oversharded cluster will be less -efficient at responding to searches and in extreme cases it may even become -unstable. +If you are using <>, each data stream is backed by a sequence of indices, each index potentially having multiple shards. + +Despite these general guidelines, it is good to develop a tailored <> that considers your specific infrastructure, use case, and performance expectations. [discrete] [[create-a-sharding-strategy]] @@ -208,6 +219,7 @@ index can be <>. You may then consider setting <> against the destination index for the source index's name to point to it for continuity. +See this https://www.youtube.com/watch?v=sHyNYnwbYro[fixing shard sizes video] for an example troubleshooting walkthrough. [discrete] [[shard-count-recommendation]] @@ -571,6 +583,8 @@ PUT _cluster/settings } ---- +See this https://www.youtube.com/watch?v=tZKbDegt4-M[fixing "max shards open" video] for an example troubleshooting walkthrough. For more information, see <>. + [discrete] [[troubleshooting-max-docs-limit]] ==== Number of documents in the shard cannot exceed [2147483519] From faa0742ddbc70e5f4ba796c27f593cd68ef0ed33 Mon Sep 17 00:00:00 2001 From: Kofi B Date: Mon, 17 Mar 2025 20:32:17 -0700 Subject: [PATCH 2/6] Update docs/reference/how-to/size-your-shards.asciidoc Co-authored-by: shainaraskas <58563081+shainaraskas@users.noreply.github.com> --- docs/reference/how-to/size-your-shards.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/reference/how-to/size-your-shards.asciidoc b/docs/reference/how-to/size-your-shards.asciidoc index 185a62fb3981e..a7171e5cf8db5 100644 --- a/docs/reference/how-to/size-your-shards.asciidoc +++ b/docs/reference/how-to/size-your-shards.asciidoc @@ -22,7 +22,7 @@ To strike the right balance, the < If you are using <>, each data stream is backed by a sequence of indices, each index potentially having multiple shards. -Despite these general guidelines, it is good to develop a tailored <> that considers your specific infrastructure, use case, and performance expectations. +In addition to these these general guidelines, you should develop a tailored <> that considers your specific infrastructure, use case, and performance expectations. [discrete] [[create-a-sharding-strategy]] From 68572f73c530e51f20ebba94d9efaab69321d107 Mon Sep 17 00:00:00 2001 From: Kofi B Date: Mon, 17 Mar 2025 20:46:06 -0700 Subject: [PATCH 3/6] Improve Scanability Co-authored-by: shainaraskas <58563081+shainaraskas@users.noreply.github.com> --- .../reference/how-to/size-your-shards.asciidoc | 18 ++++++++++++++++-- 1 file changed, 16 insertions(+), 2 deletions(-) diff --git a/docs/reference/how-to/size-your-shards.asciidoc b/docs/reference/how-to/size-your-shards.asciidoc index a7171e5cf8db5..fd987a4f2aaeb 100644 --- a/docs/reference/how-to/size-your-shards.asciidoc +++ b/docs/reference/how-to/size-your-shards.asciidoc @@ -16,9 +16,23 @@ By effectively using shards, {es} can scale horizontally and provide fault toler [[sizing-shard-guidelines]] === Sizing Shard Guidelines -Proper shard sizing is crucial for maintaining the performance and stability of an {es} cluster. _Oversharding_ occurs when data is distributed across an excessive number of shards (primary or replica), which can degrade search performance and make the cluster unstable. Conversely, very large shards may slow down search operations and prolong recovery times after failures. +Balancing the number and size of your shards is important for the performance and stability of an {es} cluster: -To strike the right balance, the <> are to aim for shard sizes between 10GB and 50GB, keeping the per-shard document count below 200 million. To ensure that each node is working optimally, it's important to distribute shards evenly across nodes. Uneven distribution can cause some nodes to work harder than others, leading to performance degradation and instability. While Elasticsearch automatically balances shards, it’s important to configure your indices with an appropriate number of shards and replicas to facilitate even distribution across nodes. +* Too many shards can degrade search performance and make the cluster unstable. This is referred to as _oversharding_. +* Very large shards can slow down search operations and prolong recovery times after failures. + +To avoid either of these states, implement the following guidelines: + +==== General sizing guidelines + +* Aim for shard sizes between 10GB and 50GB +* Keep the number of documents on each shard below 200 million + +==== Shard distribution guidelines + +To ensure that each node is working optimally, distribute shards evenly across nodes. Uneven distribution can cause some nodes to work harder than others, leading to performance degradation and instability. + +While {es} automatically balances shards, you need to configure indices with an appropriate number of shards and replicas to allow for even distribution across nodes. If you are using <>, each data stream is backed by a sequence of indices, each index potentially having multiple shards. From 7ffd0586e74218bf7d03b5ecd4d65357bc1bc484 Mon Sep 17 00:00:00 2001 From: Kofi Bartlett Date: Wed, 19 Mar 2025 10:45:11 +0900 Subject: [PATCH 4/6] Reduced what is a shard section for concision --- docs/reference/how-to/size-your-shards.asciidoc | 8 +------- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/docs/reference/how-to/size-your-shards.asciidoc b/docs/reference/how-to/size-your-shards.asciidoc index fd987a4f2aaeb..dc17f5f66c9bc 100644 --- a/docs/reference/how-to/size-your-shards.asciidoc +++ b/docs/reference/how-to/size-your-shards.asciidoc @@ -4,13 +4,7 @@ [[what-is-a-shard]] === What is a shard? -A shard is a basic unit of storage in {es}. Every index is divided into one or more shards to help distribute data and workload across nodes in a cluster. This division allows {es} to handle large datasets and perform operations like searches and indexing efficiently but not without cost. Each index and shard has some overhead and if you divide your data across too many shards then the overhead will degrade performance. Shards play several key roles in {es}: - -* *Data Distribution:* Each shard contains a portion of the data from the index. When you add more nodes to your cluster, {es} will spread the shards across the nodes, balancing the workload between them. -* *Replication:* Shards can have replicas which are copies of the original shard. Replicas ensure data availability and improve search performance by allowing multiple nodes to handle requests for that shard. -* *Parallel Processing:* Shards enable {es} to distribute indexing of documents, and process queries in parallel across shards, making ingestion and searches faster and more efficient. - -By effectively using shards, {es} can scale horizontally and provide fault tolerance, ensuring your data is distributed and indexing and searches are processed efficiently. +A shard is a basic unit of storage in {es}. Every index is divided into one or more shards to help distribute data and workload across nodes in a cluster. This division allows {es} to handle large datasets and perform operations like searches and indexing efficiently. For more detailed information on shards, see <>. [discrete] [[sizing-shard-guidelines]] From f080f5bad3d9b80754c2ae18e7b9056f67aab78b Mon Sep 17 00:00:00 2001 From: Kofi Bartlett Date: Wed, 19 Mar 2025 10:47:45 +0900 Subject: [PATCH 5/6] Adjusted title --- docs/reference/how-to/size-your-shards.asciidoc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/reference/how-to/size-your-shards.asciidoc b/docs/reference/how-to/size-your-shards.asciidoc index dc17f5f66c9bc..f54fbd0e220d8 100644 --- a/docs/reference/how-to/size-your-shards.asciidoc +++ b/docs/reference/how-to/size-your-shards.asciidoc @@ -8,7 +8,7 @@ A shard is a basic unit of storage in {es}. Every index is divided into one or m [discrete] [[sizing-shard-guidelines]] -=== Sizing Shard Guidelines +=== General guidelines Balancing the number and size of your shards is important for the performance and stability of an {es} cluster: From 1cd4471333f84d119d8448d6d126c526bbb17469 Mon Sep 17 00:00:00 2001 From: George Wallace Date: Mon, 7 Apr 2025 18:31:56 -0600 Subject: [PATCH 6/6] Add general and distribution sizing guidelines. --- docs/reference/how-to/size-your-shards.asciidoc | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/reference/how-to/size-your-shards.asciidoc b/docs/reference/how-to/size-your-shards.asciidoc index f54fbd0e220d8..3b3891b43500e 100644 --- a/docs/reference/how-to/size-your-shards.asciidoc +++ b/docs/reference/how-to/size-your-shards.asciidoc @@ -17,11 +17,15 @@ Balancing the number and size of your shards is important for the performance an To avoid either of these states, implement the following guidelines: +[discrete] +[[general-sizing-guidelines]] ==== General sizing guidelines * Aim for shard sizes between 10GB and 50GB * Keep the number of documents on each shard below 200 million +[discrete] +[[shard-distribution-guidelines]] ==== Shard distribution guidelines To ensure that each node is working optimally, distribute shards evenly across nodes. Uneven distribution can cause some nodes to work harder than others, leading to performance degradation and instability.