Zookeeper节点的增加及故障演练


环境介绍
OS: Ubuntu 10.10 Server 64-bit
Servers:
hadoop-master:10.6.1.150
- namenode,jobtracker;hbase-master,hbase-thrift;
- secondarynamenode;
- hive-master,hive-metastore;
- zookeeper-server;
- flume-master
- flume-node
- datanode,taskTracker

hadoop-node-1:10.6.1.151
- datanode,taskTracker;hbase-regionServer;
- zookeeper-server;
- flume-node

hadoop-node-2:10.6.1.152
- dataNode,taskTracker;hbase-regionServer;
- zookeeper-server;
- flume-node

以上环境的配置过程请参见:Hadoop集群实践 之 (0) 完整架构设计

下面将增加两台zookeeper-server
zookeeper-single-1:10.6.1.161
- zookeeper-server;

zookeeper-single-2:10.6.1.162
- zookeeper-server;

本文定义的规范,避免在配置多台服务器上产生理解上的混乱:
所有直接以 $ 开头,没有跟随主机名的命令,都代表需要在所有的服务器上执行,除非后面有单独的//开头的说明。
而以dongguo@zookeeper-single-1:~$开头的命令,则需要在zookeeper-single-2上也做相同的执行,除非有dongguo@zookeeper-single-2:~$同时出现。

1. 配置/etc/hosts
$ sudo vim /etc/hosts

127.0.0.1	localhost

10.6.1.150 hadoop-master
10.6.1.151 hadoop-node-1
10.6.1.152 hadoop-node-2
10.6.1.153 hadoop-node-3
10.6.1.161 zookeeper-single-1
10.6.1.162 zookeeper-single-2

2. 修改主机名
dongguo@zookeeper-single-1:~$ sudo vim /etc/hostname

zookeeper-single-1

dongguo@zookeeper-single-1:~$ sudo hostname zookeeper-single-1

dongguo@zookeeper-single-2:~$ sudo vim /etc/hostname

zookeeper-single-2

dongguo@zookeeper-single-2:~$ sudo hostname zookeeper-single-2

3. 安装Java环境

添加匹配的Java版本的APT源
dongguo@zookeeper-single-1:~$ sudo apt-get install python-software-properties
dongguo@zookeeper-single-1:~$ sudo vim /etc/apt/sources.list.d/sun-java-community-team-sun-java6-maverick.list

deb http://ppa.launchpad.net/sun-java-community-team/sun-java6/ubuntu maverick main
deb-src http://ppa.launchpad.net/sun-java-community-team/sun-java6/ubuntu maverick main

安装sun-java6-jdk
dongguo@zookeeper-single-1:~$ sudo add-apt-repository ppa:sun-java-community-team/sun-java6
dongguo@zookeeper-single-1:~$ sudo apt-get update
dongguo@zookeeper-single-1:~$ sudo apt-get install sun-java6-jdk

4. 配置Cloudera的Hadoop安装源
dongguo@zookeeper-single-1:~$ sudo vim /etc/apt/sources.list.d/cloudera.list

deb http://archive.cloudera.com/debian maverick-cdh3u3 contrib
deb-src http://archive.cloudera.com/debian maverick-cdh3u3 contrib

dongguo@zookeeper-single-1:~$ sudo apt-get install curl
dongguo@zookeeper-single-1:~$ curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
dongguo@zookeeper-single-1:~$ sudo apt-get update

5. 安装Zookeeper
dongguo@zookeeper-single-1:~$ sudo apt-get install hadoop-zookeeper-server

6. 配置Zookeeper
$ sudo vim /etc/zookeeper/zoo.cfg

tickTime=2000
initLimit=10
syncLimit=5
dataDir=/data/zookeeper
clientPort=2181
maxClientCnxns=0
server.1=hadoop-master:2888:3888
server.2=hadoop-node-1:2888:3888
server.3=hadoop-node-2:2888:3888
server.4=zookeeper-single-1:2888:3888
server.5=zookeeper-single-2:2888:3888

dongguo@zookeeper-single-1:~$ sudo mkdir -p /data/zookeeper
dongguo@zookeeper-single-1:~$ sudo chown zookeeper:zookeeper /data/zookeeper

创建myid文件
dongguo@zookeeper-single-1:~$ sudo -u zookeeper vim /data/zookeeper/myid

4

dongguo@zookeeper-single-2:~$ sudo -u zookeeper vim /data/zookeeper/myid

5

7. 修改整个集群中与Zookeeper相关的系统配置
7.1 修改Hbase相关的Zookeeper配置
$ sudo vim /etc/hbase/conf/hbase-site.xml //仅在hadoop-master,hadoop-node-1与hadoop-node-2上执行

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
  <name>hbase.rootdir</name>
  <value>hdfs://hadoop-master:8020/hbase</value>
</property>
<property>
  <name>hbase.cluster.distributed</name>
  <value>true</value>
</property>

<property>
  <name>hbase.zookeeper.quorum</name>
  <value>hadoop-master,hadoop-node-1,hadoop-node-2,zookeeper-single-1,zookeeper-single-2</value>
</property>
</configuration>

7.2 修改Flume相关的Zookeeper配置
$ sudo vim /etc/flume/conf/flume-site.xml //仅在hadoop-master,hadoop-node-1与hadoop-node-2上执行

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl"  href="configuration.xsl"?>

<configuration>
<property>
  <name>flume.master.servers</name>
  <value>hadoop-master</value>
  </property>
<property>
  <name>flume.master.store</name>
  <value>zookeeper</value>
</property>
<property>
  <name>flume.master.zk.use.external</name>
  <value>true</value>
</property>
<property>
  <name>flume.master.zk.servers</name>
  <value>hadoop-master:2181,hadoop-node-1:2181,hadoop-node-2:2181,zookeeper-single-1:2181,zookeeper-single-2:2181</value>
</property>
</configuration>

7.3 修改Hive相关的Zookeeper配置
dongguo@hadoop-master:~$ sudo vim /etc/hive/conf/hive-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<!-- Hive Configuration can either be stored in this file or in the hadoop configuration files  -->
<!-- that are implied by Hadoop setup variables.                                                -->
<!-- Aside from Hadoop setup variables - this file is provided as a convenience so that Hive    -->
<!-- users do not have to edit hadoop configuration files (that may be managed as a centralized -->
<!-- resource).                                                                                 -->

<!-- Hive Execution Parameters -->

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://localhost:3306/metastore</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hiveuser</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>password</value>
</property>

<property>
  <name>datanucleus.autoCreateSchema</name>
  <value>false</value>
</property>

<property>
  <name>datanucleus.fixedDatastore</name>
  <value>true</value>
</property>

<property>
  <name>hive.aux.jars.path</name>
  <value>file:///usr/lib/hive/lib/hive-hbase-handler-0.7.1-cdh3u3.jar,file:///usr/lib/hbase/hbase-0.90.4-

cdh3u3.jar,file:///usr/lib/zookeeper/zookeeper-3.3.4-cdh3u3.jar</value>
</property>

<property>
  <name>hbase.zookeeper.quorum</name>
  <value>hadoop-master,hadoop-node-1,hadoop-node-2,zookeeper-single-1,zookeeper-single-2</value>
</property>

</configuration>

8. 重启Zookeeper服务
$ sudo /etc/init.d/hadoop-zookeeper-server restart

9. 重启所有与Zookeeper相关的服务
9.1 重启Hbase相关服务
dongguo@hadoop-master:~$ sudo /etc/init.d/hadoop-hbase-master restart
dongguo@hadoop-node-1:~$ sudo /etc/init.d/hadoop-hbase-regionserver restart
dongguo@hadoop-node-2:~$ sudo /etc/init.d/hadoop-hbase-regionserver restart

9.2 重启Flume相关服务
dongguo@hadoop-master:~$ sudo /etc/init.d/flume-master restart
$ sudo /etc/init.d/flume-node restart //仅在hadoop-master,hadoop-node-1与hadoop-node-2上执行

9.3 重启Hive相关服务
dongguo@hadoop-master:~$ sudo /etc/init.d/hadoop-hive-server restart
dongguo@hadoop-master:~$ sudo /etc/init.d/hadoop-hive-metastore restart

10. 检查Zookeeper状态
10.1 通过命令检查配置是否同步
dongguo@zookeeper-single-1:~$ zookeeper-client

Connecting to localhost:2181
[zk: localhost:2181(CONNECTED) 0] ls /
[flume-cfgs, counters-config_version, flume-chokemap, hbase, zookeeper, flume-nodes]
[zk: localhost:2181(CONNECTED) 1] 

可以看到所有相关系统的配置都已经同步到了新增的zookeeper-server中。

11. 开始进行Zookeeper集群故障演练
演练的思路为模拟Zookeeper服务器故障,检查部分Zookeeper服务器在异常中止的情况下提供服务的能力。
zookeeper在运行的过程中要求必须整个集群的数量为奇数个,目前我们拥有5台zookeeper服务器,能够满足要求。
zookeeper分为三个角色,leader,follower,client,服务端会根据投票自动选择出leader和follower,当这两种角色都存在的时候,整个集群能够正常提供服务,否则整个集群失效。

我们将首先确认各个zookeeper-server的角色,然后随机选择对应的zookeeper服务器依次将其进程杀掉,并在每个过程中检测各个zookeeper-server的角色转变与工作情况。

查看zookeeper-server的状态,通过以下命令
dongguo@zookeeper-single-2:~$ zookeeper-server status

JMX enabled by default
Using config: /etc/zookeeper/zoo.cfg
Mode: follower

如上所示目前zookeeper-single-2的zookeeper角色为follower

获取zookeeper-server的进程ID,通过以下命令
$ ps aux | grep /etc/zookeeper/zoo.cfg | grep -v grep | awk '{print $2}'

然后我们直接使用 sudo kill -9 杀掉其进程ID即可。

结论:
整个故障演练的过程与结果为:
任意服务器组合的故障模拟,证实了zookeeper集群的容错能力确实与其算法相符。
即2n+1=总数,n即为可容纳的故障节点。
在实际的测试中,3台能容错1台,5台能容错2台,7台能容错3台。

,

  1. #1 by denny on 2014/06/10 - 02:22

    要玩就玩大点,直接Chaos Monkey test乱破环吧?

(will not be published)
*