Notice:
This post is older than 5 years – the content might be outdated.
ElasticSearch does not offer support for clusters spanning data centres. However, on our project we had access to a network latency of 400 *micro*seconds (0.4 ms) between three separate locations in the same city, and decided to test a cluster spanning all three data centres. Network latency did not prove to be a problem, but a more tricky issue was deciding how to set up the cluster to best guard against network partitioning.
Background
Network partitions can be caused when network failure leads to ambiguity regarding cluster quorum/makeup: the well-established CAP theorem states that, in the case of partitioning, one can plan for either consistency or availability, but not both. The reality is often far more scary, with tool-stacks making all sorts of claims that do not stand up under rigourous testing (for the gory details: see the excellent series of blog articles here). To their credit, the ElasticSearch team have not only acknowledged this, but dedicate a page giving up-to-date details of resiliency issues. They have also incorporated the jensen test-suite into their own test harness.
In terms of partition tolerance, ElasticSearch will not allow any writes to a cluster where the cluster state is red (e.g. in split-brain situation) – providing Consistency above Availability – so preventing this situation by using a rigorous minimum-master-nodes setting is essential. Master-election is conducted by ElasticSearch’s internal ZenDisco component, rather than relying on an external service – such as Zookeeper – that employs peer-reviewed and tested algorithms i.e. it uses its own distributed consensus algorithm, rather than using existing tools.
ElasticSearch uses failure detections to pick masters rather than passing writes through a concensus algorithm: this is (somewhat unfortunately) the only option and is by design, so that ZenDisco – with its access to the entire Cluster State – has more information available regarding cluster activity/stability.
Replica count?
Increasing the number of replicas increases the likelihood of ElasticSearch remaining available and does not interfere with consistency, though this comes at a cost of
a) having to allocate writes across shards (significant in a write-heavy environment)
b) disk space and
c) performance (in both directions: replicas necessitate replication overhead, but can also improve data locality).
Availability can be expressed in relation to internal replication (the number of replicas) or in relation to the minimum number of master nodes (configuration setting: discovery.zen.minimum_master_nodes). In the following scenarios we keep the replica count constant at 1 as we want to start simple and increase replicas only as and when needed.
Scenario #1: Single DC
Notes
- 1 replica reduces internal load on ES
- 1 replica reduces overall space requirements
- no added complexity due to cross-DC activity
Availability
- Individual nodes: up to 5 nodes can fail individually (as replica shards promoted to primaries and reallocated to remaining live nodes), but no more than N simultaneously, where N is the number of replicas.
- Quorum: = 4 masters (N/2 + 1), so only two nodes can fail, regardless of how many replicas we have.
This is our starting point: a cluster hosted in a single data center with a single replica set. We can lose up to 5 nodes sequentially, but only 1 simultaneously: if we were to lose two nodes, which happen to between them hold a „related“ primary- and replica-shard, then there is no way to recover the data held in that shard. And we can only lose a maximum of two nodes before it is no longer possible to elect a master (since we need more than half the nodes).
Scenario #2: 2 DCs, rack awareness
Notes
- no extra replication (snapshot/restore) needed to sync DCs
- 1 replica reduces overall space requirements
- config-overhead to setup rack awareness
- multiple DCs not supported by ES (clocktime, latency, bandwith)
Availability
- Individual nodes: upto 3 nodes can fail individually (rack awareness prevents replicas from being written to same DC as primaries), but no more than N simultaneously, where N is the number of replicas.
- Quorum: = 4 masters (N/2 + 1), so only two nodes can fail, regardless of how many replicas we have, and a single DC can never go offline (as we would drop below the quorum).
Here, we split our cluster across data centers, but implement rack-awareness across data centers. This prevents a replica shard being created in the same data center as its primary. We haven’t really added any resilience as the loss of data center will compromise the whole cluster (as we will no longer have a quorum of nodes).
Scenario #3: 2 DCs, no rack awareness
Notes
- no extra replication (snapshot/restore) needed to sync DCs
- 1 replica reduces overall space requirements
- multiple DCs not supported by ES (clocktime, latency, bandwith)
- no extra configuration needed
Availability
- Individual nodes. up to 5 nodes can fail individually, but no more than N simultaneuously, where N is the number of replicas.
- Quorum: = 4 masters (N/2 + 1), so only two nodes can fail, regardless of how many replicas we have.
As with the previous two scenarios, we haven’t added anything tangible in terms of enhanced resilience, although not enforcing rack awareness will reduce network traffic across the two data centers. In order to make use of data center redundancy we really need a third location so that the loss of a data center does not compromise the cluster.
Scenario #4: include a third DC as a tie-breaker
Notes
- no extra replication (snapshot/restore) needed to sync DCs
- 1 replica reduces overall space requirements
- multuple DCs not supported by ES (clocktime, latency, bandwidth)
- no extraconfiguration required
Availability
- Individual nodes: up to 5 nodes can fail individually, but no more than N simultaneously, where N is the number of replicas.
- Quorum: = 4 masters (N/2 + 1), so 3 nodes can fail, or an entire DC can go offline.
With a third data center we are now able to cope with losing a whole data center, or up to 3 nodes sequentially, and still have a quorum. If we added rack awareness to this setup we could also guard against losing related primary- and replica-shards from the same data center.
Scenario #5: 2 clusters hosted in different DCs
Notes
- „extra“ replication (snapshot/restore( needed to sync DCs
- 1 replica reduces overall space requirements
- no extra configuration needed
- overhead implicit in re-syncing of DCs (we can only re-sync at roughly 2x speed of import)
- metadata must be tracked for re-syncing, or only re-sync at aggregation level
Availability
- Individual nodes: only 1 node can fail in each DC.
- Quorum: =2 masters (N/2 + 1), so only 1 node can fail.
- DC: an entire DC can go offline.
A slightly different approach is to maintain two clusters, kept in sync either by the bulk load process (i.e. loading process inserts to both clusters in parallel) or by using the ElasticSearch snapshot/restore mechanism.
Conclusion
As you can see from the above, there are a number of possibilities, two or more of which can of course be combined as needed. Our eventual set-up was to have 3 data centers, each with 3 nodes, plus 2 dedicated client nodes that we used for client queries and aggregations.
Read on …
So you’re interested in search based applications, text analytics and enterprise search solutions? Have a look at our website and read about the services we offer to our customers.
Join us!
Are you looking for a job in search or analytics? We’re currently hiring Machine Learning Engineers as well as Big Data Scientists.
Read the complete series
- Part 1: ElasticSearch as a Database
- Part 2: The aggregation framework
- Part 4: Aggregations & Plugins
4 Kommentare