那年那日那朵花

".......(o´ω`o)......"

搭建redis集群的一些坑

2017-01-16 18:03 linux

redis cluster 搭建的一些坑

状况

由两台服务器搭建redis cluster
redis实例版本 redis-3.2.6
一共3主3从
节点列表192.168.188.129:6379、192.168.188.129:7379、192.168.188.129:8379、 192.168.188.132:6379、192.168.188.132:7379、192.168.188.132:8379

遇到的问题

问题一
启动redis cluster报连接超时。
首先cluster启动采用的redis-trib.rb这个管理工具。该集群管理工具是官方包内自带的一个用ruby写的一个脚本。
所以需要先安装ruby
然后用gem安装ruby的redis客户端, gem install redis, 默认安装的是 3.3.2版本
启动 ./redis-trib.rb create --replicas 1 192.168.188.129:6379 192.168.188.129:7379 192.168.188.129:8379 192.168.188.132:6379 192.168.188.132:7379 192.168.188.132:8379
然后报rescue in _write_to_socket': Connection timed out (Redis::TimeoutError)
解决办法：我再三确认了网络问题，并把防火墙和selinux都关闭的了，其实不是网络连接的问题。最终结果目测其实是由于ruby的redis客户端和redis实例版本不一致导致的原因。
先卸载gem uninstall redis --version 3.3.2
重新安装gem install redis -v 3.2.2
说明下这里官方的gem库里没有redis 3.2.6版 3.2.X最近的只有3.2.2。
安装完然后问题解决。
问题二
开启集群时一直处于 Waiting for the cluster to join......
redis cluster创建会有cluster tcp端口即在原redis端口上加10000（例如原6379端口的话，cluster会创建一个16379端口作为cluster tcp端口）这个端口是作为集群总线用的，是节点之间的数据通信频道。集群总线用作节点间的宕机侦测，配置变更，故障转移认证等等。故如果启用防火墙需要将两个端口都开放（例如，6379和16379端口都要开放）。
问题三
启动报错
[ERR] Node XXX.XXX.XXX.XXX:XXXX is not empty. Either the nodealready knows other nodes (check with CLUSTER NODES) or contains some
应该是集群启动没成功导致,其中某个节点已经含有数据了。其实如果是在给集群添加节点的时候，待添加的节点已经含有部分数据也会报这个。将所报的节点的集群节点配置文件，rdb文件，aof文件删除重启。
问题四
启动报错
err slot XXXXXX is already busy (redis::commanderror)
同样是启动集群没成功导致。连接到各个节点redis-cli -p XXXX，执行FLUSHALL和CLUSTER RESET后，启动集群即可。
问题五
故障转移的一个坑，redis cluster中如果一个master节点挂掉，那么他的一个slave节点会被选举为master节点顶起来。但是参与选举需要有其他一半以上的master节点参加。如果没有一半以上的master节点存活来参与选举，则slave节点不会被选举为master节点。由于redis实例的配置中有cluster-node-timeout（单位毫秒）这个参数，在当前这个两台机器3主3从的环境下，如果在cluster-node-timeout时间内两个包括两个以上master节点都挂掉的话，slave节点就不会被选举为master节点了。这里有个cluster-require-full-coverage参数，默认为yes，意思为部分数据丢失则无法启动集群。那么如果cluster-require-full-coverage为yes的情况下，集群就挂了。如果为no，那么挂掉的节点所属的slot的数据也无法做任何操作了。
办法的话是增加机器。由于我这里是2台机器上部署6个节点。如果其中一台部署了两个master节点的机器挂掉的话，那么势必集群就挂了。
问题六
上个问题延伸下，在master节点一主一从的情况下，一台机器上不要同时跑一个主节点和这个主节点的从节点。因为如果cluster-require-full-coverage参数为yes的话。这台机器挂了，那么整个cluster也就挂了。
问题七
依然是问题五的延伸，redis cluster的slot可以由我们自己分配。但是一般在不做下线处理的时候，不要将master节点上的slot都移走，容易玩脱，（但是如果做下线处理的话，一定要把master节点上的数据即slot移走，不然没法做下线处理的。）。因为一个没有slot的master节点不能参与选举，如果这个时候正好含有数据的master节点挂了，可能会造成没有master节点参与选举，造成集群不可用。例如极端情况下，将cluster所有16384个slot都分配给一个master。恰好这个master挂了，即使他的slave存在，集群也是没办法将slave节点选举为master节点做到自动故障转移的。

集群节点配置文件示例

bind 0.0.0.0
protected-mode yes
port 6379
tcp-backlog 511
timeout 0
tcp-keepalive 300
daemonize yes
supervised no
pidfile /var/run/redis_6379.pid
loglevel notice
logfile /var/log/redis_6379.log
databases 16
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir /var/lib/redis/6379
slave-serve-stale-data yes
slave-read-only yes
repl-diskless-sync no
repl-diskless-sync-delay 5
repl-disable-tcp-nodelay no
slave-priority 100
maxmemory 2gb
#这里强烈建议关闭aof
#appendonly yes
appendonly no
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
lua-time-limit 5000
cluster-enabled yes
cluster-config-file nodes-6379.conf
cluster-node-timeout 5000
cluster-slave-validity-factor 0
cluster-migration-barrier 1
cluster-require-full-coverage no
slowlog-log-slower-than 10000
slowlog-max-len 128
latency-monitor-threshold 0
notify-keyspace-events ""
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-size -2
list-compress-depth 0
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
aof-rewrite-incremental-fsync yes

集群状况查看

redis-cli  -c cluster info
redis-cli -p XXXX -c cluster info
redis-cli  -c cluster nodes
redis-cli -p XXXX -c cluster nodes
redis-trib.rb  check 192.168.188.132:8379

删除节点，前提是该节点的slot全部迁移走，用redis-trib.rb reshard X.X.X.X:XXXX根据提示迁移即可
redis-trib.rb del-node 192.168.188.132:8379 e7ce3c7e4a6cdeecae97b2330975361209bc02a2
>>> Removing node e7ce3c7e4a6cdeecae97b2330975361209bc02a2 from cluster 192.168.188.132:8379
>>> Sending CLUSTER FORGET messages to the cluster...
>>> SHUTDOWN the node.

增加节点
redis-trib.rb add-node 192.168.188.132:8379 192.168.188.132:7379
>>> Adding node 192.168.188.132:8379 to cluster 192.168.188.132:7379
>>> Performing Cluster Check (using node 192.168.188.132:7379)
M: 6449b2eaefe0018110a946f82e89cc03a8bfa900 192.168.188.132:7379
slots:0-5460,10923-16383 (10922 slots) master
2 additional replica(s)
S: 00b906181d22ea31eb2eddc015fb788f4e2969b9 192.168.188.132:6379
slots: (0 slots) slave
replicates f2879b098eb546f95962f713e455cbae2f26b4f8
S: a540a580139a87d304d74ab5089aadefb0fe27cb 192.168.188.129:7379
slots: (0 slots) slave
replicates 6449b2eaefe0018110a946f82e89cc03a8bfa900
S: b9cf998626820e5e05798518439d92b86265f103 192.168.188.129:6379
slots: (0 slots) slave
replicates 6449b2eaefe0018110a946f82e89cc03a8bfa900
M: f2879b098eb546f95962f713e455cbae2f26b4f8 192.168.188.129:8379
slots:5461-10922 (5462 slots) master
1 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
>>> Send CLUSTER MEET to node 192.168.188.132:8379 to make it join the cluster.
[OK] New node added correctly.
将192.168.188.132:8379节点重新加入到集群中，后面的是现有集群中的某个节点即可。

给他分配slot。
这里给他分配个5000个
redis-trib.rb reshard 192.168.188.132:8379
>>> Performing Cluster Check (using node 192.168.188.132:8379)
M: cb9b8e99d976713693e1e1fc3e69ee2d9a5bb011 192.168.188.132:8379
slots: (0 slots) master
0 additional replica(s)
S: b9cf998626820e5e05798518439d92b86265f103 192.168.188.129:6379
slots: (0 slots) slave
replicates 6449b2eaefe0018110a946f82e89cc03a8bfa900
S: 00b906181d22ea31eb2eddc015fb788f4e2969b9 192.168.188.132:6379
slots: (0 slots) slave
replicates f2879b098eb546f95962f713e455cbae2f26b4f8
S: a540a580139a87d304d74ab5089aadefb0fe27cb 192.168.188.129:7379
slots: (0 slots) slave
replicates 6449b2eaefe0018110a946f82e89cc03a8bfa900
M: f2879b098eb546f95962f713e455cbae2f26b4f8 192.168.188.129:8379
slots:5461-10922 (5462 slots) master
1 additional replica(s)
M: 6449b2eaefe0018110a946f82e89cc03a8bfa900 192.168.188.132:7379
slots:0-5460,10923-16383 (10922 slots) master
2 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.
How many slots do you want to move (from 1 to 16384)? 5000
What is the receiving node ID? cb9b8e99d976713693e1e1fc3e69ee2d9a5bb011
Please enter all the source node IDs.
Type 'all' to use all the nodes as source nodes for the hash slots.
Type 'done' once you entered all the source nodes IDs.
Source node #1:6449b2eaefe0018110a946f82e89cc03a8bfa900
Source node #2:done

查看下
redis-trib.rb check 192.168.188.132:8379
>>> Performing Cluster Check (using node 192.168.188.132:8379)
M: cb9b8e99d976713693e1e1fc3e69ee2d9a5bb011 192.168.188.132:8379
slots:0-4999 (5000 slots) master
1 additional replica(s)
S: b9cf998626820e5e05798518439d92b86265f103 192.168.188.129:6379
slots: (0 slots) slave
replicates 6449b2eaefe0018110a946f82e89cc03a8bfa900
S: 00b906181d22ea31eb2eddc015fb788f4e2969b9 192.168.188.132:6379
slots: (0 slots) slave
replicates f2879b098eb546f95962f713e455cbae2f26b4f8
S: a540a580139a87d304d74ab5089aadefb0fe27cb 192.168.188.129:7379
slots: (0 slots) slave
replicates cb9b8e99d976713693e1e1fc3e69ee2d9a5bb011
M: f2879b098eb546f95962f713e455cbae2f26b4f8 192.168.188.129:8379
slots:5461-10922 (5462 slots) master
1 additional replica(s)
M: 6449b2eaefe0018110a946f82e89cc03a8bfa900 192.168.188.132:7379
slots:5000-5460,10923-16383 (5922 slots) master
1 additional replica(s)
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
>>> Check slots coverage...
[OK] All 16384 slots covered.

添加备节点，将节点192.168.188.222:8379加入到集群192.168.188.129:8379（该节点所在集群）中作为1c3ef1f3474bf96b9b8749c2fd106db53c728d54 （集群中某个master节点的id）的备节点
redis-trib.rb add-node --slave --master-id 1c3ef1f3474bf96b9b8749c2fd106db53c728d54 192.168.188.222:8379 192.168.188.129:8379

Cloudhu 个人随笔|built by django|