Testing the Elasticsearch cluster behavior under network failure
Elasticsearch is a distributed system that depends heavily on the network, as such, you need to known how it behaves under different failures scenarios. This post shows a way of mounting these scenarios with Linux containers.
You can simulate a couple of failure scenarios:
- Node network loss
- Packet delay and/or loss
This setup should also be generally useful when you want to known how Elasticsearch behaves. For example:
- Expand or shrink the cluster
- Node maintenance
- Backup/Restore
- Shard re-allocation
- Near disk full behavior
The setup will be created with the help of: VirtualBox, Vagrant, CoreOS and Docker.
Preparation
Make sure you install VirtualBox, Vagrant, and Git. And have them available on your $PATH
.
Check the tools versions.
VirtualBox:
VBoxManage --version
4.3.24r98716
Vagrant:
vagrant version
Installed Version: 1.7.2
Git:
git --version
git version 1.9.5.msysgit.0
CoreOS and Docker
CoreOS will be running inside a VirtualBox Virtual Machine. It will be used as our Docker host. Docker will be used to create a Linux container for each Elasticsearch node.
Install and start CoreOS:
git clone https://github.com/coreos/coreos-vagrant test-elasticsearch
cd test-elasticsearch
vagrant up
Open a shell:
vagrant ssh
Check the Docker version:
docker version
Client version: 1.5.0
Client API version: 1.17
Go version (client): go1.3.3
Git commit (client): a8a31ef-dirty
OS/Arch (client): linux/amd64
Server version: 1.5.0
Server API version: 1.17
Go version (server): go1.3.3
Git commit (server): a8a31ef-dirty
Check the Docker info:
docker info
Containers: 0
Images: 0
Storage Driver: overlay
Backing Filesystem: extfs
Execution Driver: native-0.2
Kernel Version: 3.19.0
Operating System: CoreOS 618.0.0
CPUs: 1
Total Memory: 998 MiB
Name: core-01
ID: QLVN:NP7B:SZDN:D7PM:DWNF:364W:O4XA:3JLG:MNJ4:2NFS:L5YU:GUOO
Setup
We test the different scenarios in a three node Elasticsearch cluster.
Each node is run in a Docker container that is based on the dockerfile/elasticsearch image.
Docker manages the network between the nodes. It does so by creating a pair of veth (virtual Ethernet) interfaces for each container. These form a point-to-point link between the container the host. The host side will be connected to a bridge, which allows the communication between nodes:
+-- docker0 bridge --+ +-- node0 container --+
| | | |
| +-----> $node0_veth <----> eth1 interface |
| | | | |
| | | +---------------------+
| | |
| | | +-- node1 container --+
| | | | |
| +-----> $node1_veth <----> eth1 interface |
| | | | |
| | | +---------------------+
| | |
| | | +-- node2 container --+
| | | | |
| +-----> $node2_veth <----> eth1 interface |
| | | |
+--------------------+ +---------------------+
NB This setup allows for multicast, so you don’t have to do anything more for having a working Elasticsearch cluster.
NB For more details read the Docker Network Configuration documentation.
The interface name of each container is saved in a environment variable, e.g., node0_veth
. This will later let us easily fiddle with a node network interface.
Each container will get its data from a volume that is stored on the host. This will let us easily see the logs and index data.
Each container will also have the normal Elasticsearch 9200 port forwarded to the host, e.g. node 1 will be available at http://localhost:9201
.
Before you can start the nodes you need the docker-native-inspect tool. Build it with:
curl -o docker-native-inspect.tar.bz2 https://bitbucket.org/rgl/docker-native-inspect/get/default.tar.bz2
mkdir docker-native-inspect
sudo mkdir -p /opt/bin
tar xf docker-native-inspect.tar.bz2 --strip-components=1 -C docker-native-inspect
docker build -t docker-native-inspect docker-native-inspect
docker run --rm docker-native-inspect tar czf - docker-native-inspect | tar vxzf - -C /opt/bin
docker-native-inspect -h
And start the Elasticsearch nodes with:
for i in 0 1 2; do
mkdir -p node$i/{data,work,logs,plugins}
docker run -d \
--name=node$i \
--hostname=node$i \
-p 920$i:9200 \
-v "$PWD/node$i:/data" \
dockerfile/elasticsearch \
/elasticsearch/bin/elasticsearch \
-Des.node.name=node$i \
-Des.path.data=/data/data \
-Des.path.work=/data/work \
-Des.path.logs=/data/logs \
-Des.path.plugins=/data/plugins \
-Des.logger.level=DEBUG
declare "node${i}_veth=`sudo docker-native-inspect -format '{{.network_state.veth_host}}' state node$i`"
done
Install some plugins:
for i in 0 1 2; do
docker exec -it node$i /elasticsearch/bin/plugin --install mobz/elasticsearch-head # /_plugin/head
docker exec -it node$i /elasticsearch/bin/plugin --install royrusso/elasticsearch-HQ # /_plugin/HQ
#docker exec -it node$i /elasticsearch/bin/plugin --install lmenezes/elasticsearch-kopf # /_plugin/kopf
done
NB Access them at http://localhost:9200/_plugin/head or HQ.
If you need to access the node, e.g. to inspect something, you can use:
docker exec -it node0 bash
Create a basic hello
index:
curl -XPUT http://localhost:9200/hello -d '
index:
number_of_shards: 3
number_of_replicas: 1
'
Index some documents:
for i in 1 2 3 4 5 6 7 8 9 10; do
curl -XPUT http://localhost:9200/hello/message/$i -d "{\"message\":\"hello $i\"}"
done
To see how things are going, open a different shell, and tail the logs:
tail -f node*/logs/elasticsearch.log
Also open the head and HQ pages:
http://localhost:9200/_plugin/head
http://localhost:9200/_plugin/HQ
You are now ready to test the several scenarios!
Test
Node network loss
To simulate a network loss we simply bring the node container veth interface down:
sudo ip link set dev $node1_veth down
If you now try to refresh the cluster state in the head plugin, you should see it does not really work. After a while, the Elasticsearch logs should starting to display a lot of activity.
For example, at the master node (in my case this was node2
), things like:
[2015-03-14 14:50:55,265][DEBUG][action.admin.cluster.node.stats] [node2] failed to execute on node [gDtJuVYCQ5qHiU-aSKnj7g]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [node1][inet[/10.1.0.12:9300]][cluster:monitor/nodes/stats[n]] request_id [289] timed out after [15002ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2015-03-14 14:51:25,263][DEBUG][action.admin.cluster.node.stats] [node2] failed to execute on node [gDtJuVYCQ5qHiU-aSKnj7g]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [node1][inet[/10.1.0.12:9300]][cluster:monitor/nodes/stats[n]] request_id [326] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2015-03-14 14:51:55,265][DEBUG][action.admin.cluster.node.stats] [node2] failed to execute on node [gDtJuVYCQ5qHiU-aSKnj7g]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [node1][inet[/10.1.0.12:9300]][cluster:monitor/nodes/stats[n]] request_id [363] timed out after [15001ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:366)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
[2015-03-14 14:51:58,692][DEBUG][discovery.zen.fd ] [node2] [node ] failed to ping [[node1][gDtJuVYCQ5qHiU-aSKnj7g][node1][inet[/10.1.0.12:9300]]], tried [3] times, each with maximum [30s] timeout
[2015-03-14 14:51:58,700][DEBUG][cluster.service ] [node2] processing [zen-disco-node_failed([node1][gDtJuVYCQ5qHiU-aSKnj7g][node1][inet[/10.1.0.12:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout]: execute
[2015-03-14 14:51:58,726][DEBUG][cluster.service ] [node2] cluster state updated, version [11], source [zen-disco-node_failed([node1][gDtJuVYCQ5qHiU-aSKnj7g][node1][inet[/10.1.0.12:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout]
[2015-03-14 14:51:58,732][INFO ][cluster.service ] [node2] removed {[node1][gDtJuVYCQ5qHiU-aSKnj7g][node1][inet[/10.1.0.12:9300]],}, reason: zen-disco-node_failed([node1][gDtJuVYCQ5qHiU-aSKnj7g][node1][inet[/10.1.0.12:9300]]), reason failed to ping, tried [3] times, each with maximum [30s] timeout
[2015-03-14 14:51:58,732][DEBUG][cluster.service ] [node2] publishing cluster state version 11
After which the cluster will scramble to recover the lost shards in the remaining nodes.
And on node1
(the one we cut the network cable) will have something like:
[2015-03-14 14:51:58,847][DEBUG][discovery.zen.fd ] [node1] [master] failed to ping [[node2][E3MoSOWDSDW_8AdCPznakg][node2][inet[/10.1.0.13:9300]]], tried [3] times, each with maximum [30s] timeout
[2015-03-14 14:51:58,851][DEBUG][discovery.zen.fd ] [node1] [master] stopping fault detection against master [[node2][E3MoSOWDSDW_8AdCPznakg][node2][inet[/10.1.0.13:9300]]], reason [master failure, failed to ping, tried [3] times, each with maximum [30s] timeout]
[2015-03-14 14:51:58,854][INFO ][discovery.zen ] [node1] master_left [[node2][E3MoSOWDSDW_8AdCPznakg][node2][inet[/10.1.0.13:9300]]], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2015-03-14 14:51:58,868][DEBUG][cluster.service ] [node1] processing [zen-disco-master_failed ([node2][E3MoSOWDSDW_8AdCPznakg][node2][inet[/10.1.0.13:9300]])]: execute
[2015-03-14 14:51:58,870][WARN ][discovery.zen ] [node1] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: {[node0][tu2reCtdT2Shekx0gitB1g][node0][inet[/10.1.0.11:9300]],[node1][gDtJuVYCQ5qHiU-aSKnj7g][node1][inet[/10.1.0.12:9300]],}
[2015-03-14 14:51:58,898][DEBUG][cluster.service ] [node1] cluster state updated, version [10], source [zen-disco-master_failed ([node2][E3MoSOWDSDW_8AdCPznakg][node2][inet[/10.1.0.13:9300]])]
[2015-03-14 14:51:58,903][INFO ][cluster.service ] [node1] removed {[node2][E3MoSOWDSDW_8AdCPznakg][node2][inet[/10.1.0.13:9300]],}, reason: zen-disco-master_failed ([node2][E3MoSOWDSDW_8AdCPznakg][node2][inet[/10.1.0.13:9300]])
[2015-03-14 14:51:58,918][DEBUG][cluster.service ] [node1] set local cluster state to version 10
[2015-03-14 14:51:58,968][DEBUG][cluster.service ] [node1] processing [zen-disco-master_failed ([node2][E3MoSOWDSDW_8AdCPznakg][node2][inet[/10.1.0.13:9300]])]: done applying updated cluster_state (version: 10)
After a while node1
will be marked as failed by node2
and will no longer be pinged. Which means, it has to manually re-join the cluster.
After a while node1
will elect itself as a master:
[2015-03-14 14:53:03,491][DEBUG][cluster.service ] [node1] cluster state updated, version [11], source [zen-disco-join (elected_as_master)]
[2015-03-14 14:53:03,491][INFO ][cluster.service ] [node1] new_master [node1][gDtJuVYCQ5qHiU-aSKnj7g][node1][inet[/10.1.0.12:9300]], reason: zen-disco-join (elected_as_master)
[2015-03-14 14:53:03,492][DEBUG][cluster.service ] [node1] publishing cluster state version 11
[2015-03-14 14:53:33,504][WARN ][discovery.zen.publish ] [node1] timed out waiting for all nodes to process published state [11] (timeout [30s], pending nodes: [[node0][tu2reCtdT2Shekx0gitB1g][node0][inet[/10.1.0.11:9300]]])
Which is somewhat odd. But this happens because the discovery.zen.minimum_master_nodes
key was not changed from the default of 1
. You should probably increase that value.
Bring the interface back up to the network:
sudo ip link set dev $node1_veth up
You’ll notice that the node will not re-join the other nodes in the cluster. You have to manually restart it:
docker stop node1
docker start node1
declare "node1_veth=`sudo docker-native-inspect -format '{{.network_state.veth_host}}' state node1`"
And it should now join the cluster.
Packet delay and/or loss
Linux lets you fiddle with the network stack in several ways. One of them lets you delay or discard packets.
You can influence the entire communication between all nodes by changing the bridge0
interface. You can also do it for a single node by changing the corresponding veth interface.
For example, to add 100ms+-10ms delay to all packets:
sudo tc qdisc add dev docker0 root netem delay 100ms 10ms
To change it to have 25% pseudo-correlation:
sudo tc qdisc change dev docker0 root netem delay 100ms 10ms 25%
To change it to have 1% of packet loss:
sudo tc qdisc change dev docker0 root netem loss 1%
To change it to combine delay and loss:
sudo tc qdisc change dev docker0 root netem delay 100ms 10ms 25% loss 1%
To see all rules:
sudo tc qdisc show dev docker0
To remove all rules:
sudo tc qdisc del dev docker0 root
To apply the above examples to a specific node, instead of the docker0
device use the corresponding veth device, e.g. $node2_veth
.
For more information see: