Planning High Throughput Elasticsearch Clusters: Part 1

Notice:
This post is older than 5 years – the content might be outdated.

Building an Elasticsearch cluster for log ingestion can be as easy as assembling an Ikea Billy bookshelf and sizing as complex as engineering a 24 hours of Le Mans race car. In this article series we will put a spotlight on what has to be taken into account when planning those clusters. We will take a look at index management strategies and node resource choices, then tune the cluster, indices and shippers in order to achieve optimal data ingestion performance.

Mapping Out the Cluster

Elasticsearch does not come with a one size fits all recipe in any shape or form—use cases and data shapeu a cluster’s contours. The complexity ranges from a simple yet sufficient single node setup to a distributed multinode cluster with dedicated masters, data and ingest nodes.

The amount of data nodes is determined by data sources, JVM heap size, index lifecycling, data growth and amount of shards. In fact, we cannot properly size and optimize a cluster as long as it does not get hit by production workload. However, by closely examining our use case and data, we can estimate the cluster size, then refine and optimize it further. The process can be applied to any Elasticsearch version greater than 6.7 up to release 8.

Planning

Building an Elasticsearch cluster follows an iterative procedure:

Clarify the use case
Examine the data and its sources
Define an index management strategy
Draft the cluster based on that information
Benchmark with test data
Refine and tune the cluster with test and real data

The draft must respect certain boundaries:

A shard has a capacity of 2.1 billion (2^31) documents.
Elastic recommends having a memory/shard distribution of 1:20 (1GB heap manages 20 shards) in order to guarantee best performance. Although it is not a hard limit, we want to stay close to that number.
It is advised to not grow a shard larger than 60GB to 80GB for two main reasons:
1. In case of a failure, it will take longer to recover and rebalance big shards than it would with small shards.
2. Merge operations are I/O intensive jobs. Long CPU and disk consumption cycles can lead to gimped index throughput and worse: dropped messages.

Alongside these boundaries, we define the index management and data retention strategies. Once the initial draft is made, it’s ready for a benchmark with test and real data.

Drafting the Elasticsearch Cluster

Our main concern is how data is managed in indices. The way we organize indices and their lifecycles, will affect node resources and therefore influence the cluster size. Thus, we’re primarily sizing a cluster based on our index management and housekeeping decisions, and secondarily for throughput.

Choosing a Proper Heap Size

Although it is tempting to set the JVM heap to the maximum configurable size, it is advised to follow the setup guide and avoid Elasticsearch to start with compressed oops. Generally speaking, we recommend to start low and increase the heap as needed. A Heap of Trouble discusses the implications of too big and too small JVM heap sizes.

TL;DR:

[…] If the heap is too small, applications will be prone to the danger of out of memory errors. […] If the heap is too large, the application will be prone to infrequent long latency spikes from full-heap garbage collections. […] a long pause is indistinguishable from a node that is unreachable because it is hung […].

[…] it’s better to set the heap as low as possible while satisfying your requirements for indexing and query throughput, end-user query response times, yet large enough to have adequate heap space for indexing buffers, and large consumers of heap space like aggregations, and suggesters.

Heap size needs to be monitored and adjusted during production workload. 8GB to 12GB heap size might be a reasonable starting point, unless there is a tremendous amount of data, users and complex queries expected right from the start.

Planning Index Organization

In order to estimate the resources needed, we have to examine our data and sources. Five things will help with making the first draft:

data retention periods
data structures and affiliation
amount of data received per given time period
number of clients shipping data
number of clients querying data

The number of clients becomes valuable, when tuning the cluster for data ingestion and calculating the client shard concurrency. We will take a closer look at concurrency, when benchmarking the cluster. For now, we do not have to take it into account. The expected data per day will give us a hint whether or not basic index management will be sufficient and if it’s necessary to further enhance index management or the whole cluster.

Shards Influence Cluster Size

The total amount of shards is the result of how indices and their lifecycles are managed. If we commit to the 1:20 ratio, a node with 8GB heap is capable of managing 160 shards. A single node can store 80 daily rotated indices with 1 shard and 1 replica configured. If we set index.number_of_shards to 2, the node has a capacity of 40 indices.

Consolidate Data by Affiliation

Logs come from many different sources/applications. They can be either structured or unstructured. In order to reduce the amount of indices, and therefore the amount of shards, we should consider storing logs with similar structure in the same index. Grouping logs by structure and affiliation instead of origin, will help with keeping the cluster small.

Having three applications with similar structure logging into separate indices with 1 shard and 1 replica configured leaves us with 180 shards after 30 days for a single environment. Three environments would generate 540 shards. That equates to at least three data nodes with 8GB heap each. The same applications shipping data into a single index would result in just 180 shards for three environments.

A common example for data affiliation is a highly available web application with a middleware, haproxy and a database. None of these components share a log structure, but the logs are affiliated, due to the fact that they are application components. Whenever we want to track logs across several different components and correlate them, it makes sense to ship the data into the same index, in order to make that data accessible more easily in Kibana.

Index Patterns

The most basic and common index is daily rotated and environment specific. A pattern like logs-%{environment}-%{YYYY.mm.dd} is most likely to encounter. Depending on how much data a single index receives, this might be sufficient, or a waste of shards. Some applications might generate 300GB or more per day, others might only generate 5GB or less. Whenever small indices are encountered, it is worth reconsidering the daily rotation and applying a weekly rotation instead to use shards more efficiently.

When data is consolidated by affiliation and grouped by environment, how do we find a specific application log? Filebeat needs an input for each log file and add extra fields, so that a user can query on that field and find the applications log events.

filebeat.inputs:

- type: log

  enabled: true

  paths:

  - /var/log/application1-*.log

  fields:

    app: application

filebeat.inputs:

- type: log

enabled: true

paths:

- /var/log/application1-*.log

fields:

app: application

In order to route the events into the correct index, we need to configure conditions on output.elasticsearch.

output.elasticsearch:

  hosts: ["https://elastic-1.domain.tld:9200"]

  indices:

  - index: "application-%{YYYY.MM.dd}

    when.contains:

      app: application

output.elasticsearch:

hosts: ["https://elastic-1.domain.tld:9200"]

indices:

- index: "application-%{YYYY.MM.dd}

when.contains:

app: application

The host on which the application is running will be added to the event by filebeat automatically.

Up Next

Part 2 will focus on ILM pros and cons, shard allocation in a hot-warm-cold architecture, explain client shard concurrency and how it affects cluster size.

Name	Borlabs Cookie
Anbieter	Eigentümer dieser Website
Zweck	Speichert die Einstellungen der Besucher, die in der Cookie Box von Borlabs Cookie ausgewählt wurden.
Cookie Name	borlabs-cookie
Cookie Laufzeit	1 Jahr

Akzeptieren
Name	Google Analytics
Anbieter	Google LLC
Zweck	Cookie von Google für Website-Analysen. Erzeugt statistische Daten darüber, wie der Besucher die Website nutzt.
Datenschutzerklärung	https://policies.google.com/privacy?hl=de
Cookie Name	_ga,_gat,_gid
Cookie Laufzeit	2 Jahre

Akzeptieren
Name	Hotjar
Anbieter	Hotjar Ltd.
Zweck	Hotjar ist ein Analysewerkzeug für das Benutzerverhalten von Hotjar Ltd. Wir verwenden Hotjar, um zu verstehen, wie Benutzer mit unserer Website interagieren.
Datenschutzerklärung	https://www.hotjar.com/legal/policies/privacy/
Host(s)	*.hotjar.com
Cookie Name	_hjClosedSurveyInvites, _hjDonePolls, _hjMinimizedPolls, _hjDoneTestersWidgets, _hjIncludedInSample, _hjShownFeedbackMessage, _hjid, _hjRecordingLastActivity, hjTLDTest, _hjUserAttributesHash, _hjCachedUserAttributes, _hjLocalStorageTest, _hjptid
Cookie Laufzeit	Sitzung / 1 Jahr

Akzeptieren
Name	HubSpot
Anbieter	HubSpot Inc.
Zweck	HubSpot ist ein Verwaltungsdienst für Benutzerdatenbanken bereitgestellt von HubSpot, Inc. Wir nutzen HubSpot auf dieser Website für unsere Online Marketing-Aktivitäten.
Datenschutzerklärung	https://legal.hubspot.com/privacy-policy
Host(s)	*.hubspot.com, hubspot-avatars.s3.amazonaws.com, hubspot-realtime.ably.io, hubspot-rest.ably.io, js.hs-scripts.com
Cookie Name	__hs_opt_out, __hs_d_not_track, hs_ab_test, hs-messages-is-open, hs-messages-hide-welcome-message, __hstc, hubspotutk, __hssc, __hssrc, messagesUtk
Cookie Laufzeit	Sitzung / 30 Minuten / 1 Tag / 1 Jahr / 13 Monate

Akzeptieren
Name	OpenStreetMap
Anbieter	OpenStreetMap Foundation
Zweck	Wird verwendet, um OpenStreetMap-Inhalte zu entsperren.
Datenschutzerklärung	https://wiki.osmfoundation.org/wiki/Privacy_Policy
Host(s)	.openstreetmap.org
Cookie Name	_osm_location, _osm_session, _osm_totp_token, _osm_welcome, _pk_id., _pk_ref., _pk_ses., qos_token
Cookie Laufzeit	1-10 Jahre

Planning High Throughput Elasticsearch Clusters: Part 1

Mapping Out the Cluster

Planning

Drafting the Elasticsearch Cluster

Choosing a Proper Heap Size

Planning Index Organization

Shards Influence Cluster Size

Consolidate Data by Affiliation

Index Patterns

Up Next

Hat dir der Beitrag gefallen? Antworten abbrechen

Ähnliche Artikel

Suchtechnologien in Vergleichsstudie: Schlagwortsuche vs. Synonymsuche vs. Bi-Encoder

Elk on Docker (-Compose)

Death of an ELK?

Planning High Throughput Elasticsearch Clusters: Part 1

Mapping Out the Cluster

Planning

Drafting the Elasticsearch Cluster

Choosing a Proper Heap Size

Planning Index Organization

Shards Influence Cluster Size

Consolidate Data by Affiliation

Index Patterns

Up Next

Hat dir der Beitrag gefallen? Antworten abbrechen

Ähnliche Artikel

Suchtechnologien in Vergleichsstudie: Schlagwortsuche vs. Synonymsuche vs. Bi-Encoder

Elk on Docker (-Compose)

Death of an ELK?

Newsletter