Hadoop集群实践 之 (1) Hadoop(HDFS)搭建


目录结构
Hadoop集群实践 之 (0) 完整架构设计 [Hadoop(HDFS),HBase,Zookeeper,Flume,Hive]
Hadoop集群实践 之 (1) Hadoop(HDFS)搭建
Hadoop集群实践 之 (2) HBase&Zookeeper搭建
Hadoop集群实践 之 (3) Flume搭建
Hadoop集群实践 之 (4) Hive搭建

本文内容
Hadoop集群实践 之 (1) Hadoop(HDFS)搭建

参考资料
http://www.infoq.com/cn/articles/hadoop-intro
http://www.infoq.com/cn/articles/hadoop-config-tip
http://www.infoq.com/cn/articles/hadoop-process-develop
http://www.docin.com/p-238056369.html

安装配置Hadoop(HDFS)
环境准备
OS: Ubuntu 10.10 Server 64-bit
Servers:
hadoop-master:10.6.1.150 内存1024M
- namenode,jobtracker;
- secondarynamenode;
- datanode,taskTracker

hadoop-node-1:10.6.1.151 内存640M
- datanode,taskTracker;

hadoop-node-2:10.6.1.152 内存640M
- dataNode,taskTracker;

对以上角色做一些简单的介绍:
namenode - 整个HDFS的命名空间管理服务
secondarynamenode - 可以看做是namenode的冗余服务
jobtracker - 并行计算的job管理服务
datanode - HDFS的节点服务
tasktracker - 并行计算的job执行服务

本文定义的规范,避免在配置多台服务器上产生理解上的混乱:
所有直接以 $ 开头,没有跟随主机名的命令,都代表需要在所有的服务器上执行,除非后面有单独的//开头的说明。

1. 选择最好的安装包
为了更方便和更规范的部署Hadoop集群,我们采用Cloudera的集成包。
因为Cloudera对Hadoop相关的系统做了很多优化,避免了很多因各个系统间版本不符产生的很多Bug。
这也是很多资深Hadoop管理员所推荐的。
https://ccp.cloudera.com/display/DOC/Documentation//

2. 安装Java环境
由于整个Hadoop项目主要是通过Java开发完成的,因此需要JVM的支持。
添加匹配的Java版本的APT源
$ sudo apt-get install python-software-properties
$ sudo vim /etc/apt/sources.list.d/sun-java-community-team-sun-java6-maverick.list

deb http://ppa.launchpad.net/sun-java-community-team/sun-java6/ubuntu maverick main
deb-src http://ppa.launchpad.net/sun-java-community-team/sun-java6/ubuntu maverick main

安装sun-java6-jdk
$ sudo add-apt-repository ppa:sun-java-community-team/sun-java6
$ sudo apt-get update
$ sudo apt-get install sun-java6-jdk

3. 配置Cloudera的Hadoop安装源
$ sudo vim /etc/apt/sources.list.d/cloudera.list

deb http://archive.cloudera.com/debian maverick-cdh3u3 contrib
deb-src http://archive.cloudera.com/debian maverick-cdh3u3 contrib

$ sudo apt-get install curl
$ curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add -
$ sudo apt-get update

4. 安装Hadoop相关套件
$ sudo apt-get install hadoop-0.20-namenode //仅在hadoop-master上安装
$ sudo apt-get install hadoop-0.20-datanode
$ sudo apt-get install hadoop-0.20-secondarynamenode //仅在hadoop-master上安装
$ sudo apt-get install hadoop-0.20-jobtracker //仅在hadoop-master上安装
$ sudo apt-get install hadoop-0.20-tasktracker

5. 创建Hadoop配置文件
$ sudo cp -r /etc/hadoop-0.20/conf.empty /etc/hadoop-0.20/conf.my_cluster

6. 激活新的配置文件
$ sudo update-alternatives --install /etc/hadoop-0.20/conf hadoop-0.20-conf /etc/hadoop-0.20/conf.my_cluster 50
这里的50是优先级别

查询当前配置的选择
$ sudo update-alternatives --display hadoop-0.20-conf

hadoop-0.20-conf - auto mode
  link currently points to /etc/hadoop-0.20/conf.my_cluster
/etc/hadoop-0.20/conf.empty - priority 10
/etc/hadoop-0.20/conf.my_cluster - priority 50
/etc/hadoop-0.20/conf.pseudo - priority 30
Current 'best' version is '/etc/hadoop-0.20/conf.my_cluster'.

结果说明my_cluster的优先级别最高,即'best'。

7. 添加hosts记录并修改对应的主机名,以hadoop-master为例
$ sudo vim /etc/hosts

10.6.1.150 hadoop-master
10.6.1.150 hadoop-secondary
10.6.1.151 hadoop-node-1
10.6.1.152 hadoop-node-2

$ sudo vim /etc/hostname

hadoop-master

$ sudo hostname hadoop-master

8. 设置服务器间的SSH密钥认证
dongguo@hadoop-master:~$ ssh-keygen -t rsa

Generating public/private rsa key pair.
Enter file in which to save the key (/home/dongguo/.ssh/id_rsa): 
Created directory '/home/dongguo/.ssh'.
Enter passphrase (empty for no passphrase):  
Enter same passphrase again: 

dongguo@hadoop-master:~$ cd /home/dongguo/.ssh
dongguo@hadoop-master:~$ cp id_rsa.pub authorized_keys
dongguo@hadoop-master:~$ chmod 600 authorized_keys

建立与hadoop-node-1和hadoop-node-2的信任关系,实现SSH免密码登陆
dongguo@hadoop-master:~$ sudo ssh-copy-id /home/dongguo/.ssh/id_rsa.pub dongguo@10.6.1.151
dongguo@hadoop-master:~$ sudo ssh-copy-id /home/dongguo/.ssh/id_rsa.pub dongguo@10.6.1.152

9. 配置hadoop/conf下的文件
$ sudo vim /etc/hadoop/conf/hadoop-env.sh

export JAVA_HOME="/usr/lib/jvm/java-6-sun"

$ sudo vim /etc/hadoop/conf/masters

hadoop-master

$ sudo vim /etc/hadoop/conf/slaves

hadoop-node-1
hadoop-node-2

10. 创建hadoop的HDFS目录
$ sudo mkdir -p /data/storage
$ sudo mkdir -p /data/hdfs
$ sudo chmod 700 /data/hdfs
$ sudo chown -R hdfs:hadoop /data/hdfs
$ sudo chmod 777 /data/storage
$ sudo chmod o+t /data/storage

11. 配置core-site.xml - Hadoop
$ sudo vim /etc/hadoop/conf/core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<!--- global properties -->
<property>
  <name>hadoop.tmp.dir</name>
  <value>/data/storage</value>
  <description>A directory for other temporary directories.</description>
</property>
<!-- file system properties -->
<property>
  <name>fs.default.name</name>
  <value>hdfs://hadoop-master:8020</value>
</property>
</configuration>

hadoop.tmp.dir指定了所有上传到Hadoop的文件的存放目录,所以要确保这个目录是足够大的。
fs.default.name指定NameNode的地址和端口号。

12. 配置hdfs-site.xml - HDFS
$ sudo vim /etc/hadoop/conf/hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
  <name>dfs.name.dir</name>
  <value>${hadoop.tmp.dir}/dfs/name</value>
</property>
<property>
  <name>dfs.data.dir</name>
  <value>/data/hdfs</value>
</property>
<property>
  <name>dfs.replication</name>
  <value>2</value>
</property>
<property>
  <name>dfs.datanode.max.xcievers</name>
  <value>4096</value>
</property>

<property>
  <name>fs.checkpoint.period</name>
  <value>300</value>
</property>
<property>
  <name>fs.checkpoint.dir</name>
  <value>${hadoop.tmp.dir}/dfs/namesecondary</value>
</property>
<property>
  <name>dfs.namenode.secondary.http-address</name>
  <value>hadoop-secondary:50090</value>
</property>
</configuration>

dfs.data.dir指定数据节点存放数据的位置。
dfs.replication指定每个Block需要备份的次数,起到冗余备份的作用,值必须小于DataNode的数目,否则会出错。
dfs.datanode.max.xcievers指定了HDFS Datanode同时处理文件的上限。

13. 配置mapred-site.xml - MapReduce
$ sudo vim /etc/hadoop/conf/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
  <name>mapred.job.tracker</name>
  <value>hdfs://hadoop-master:8021</value> 
</property>
<property>
  <name>mapred.system.dir</name>
  <value>/mapred/system</value> 
</property>
<property>
  <name>mapreduce.jobtracker.staging.root.dir</name>
  <value>/user</value>
</property>
</configuration>

mapred.job.tracker定位jobtracker的地址和端口。
mapred.system.dir定位存放在HDFS中的目录。

注:以上配置需要在所有的服务器上进行,鉴于配置内容完全一致,可以在一台服务器上配置完成以后,将整个/etc/hadoop-0.20目录覆盖到其它服务器上。

14. 格式化HDFS分布式文件系统
dongguo@hadoop-master:~$ sudo -u hdfs hadoop namenode -format

12/10/07 21:52:48 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = hadoop-master/10.6.1.150
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 0.20.2-cdh3u3
STARTUP_MSG:   build = file:///data/1/tmp/nightly_2012-03-20_13-13-48_3/hadoop-0.20-0.20.2+923.197-1~maverick -r 318bc781117fa276ae81a3d111f5eeba0020634f; compiled by 'root' on Tue Mar 20 13:45:02 PDT 2012
************************************************************/
12/10/07 21:52:48 INFO util.GSet: VM type       = 64-bit
12/10/07 21:52:48 INFO util.GSet: 2% max memory = 19.33375 MB
12/10/07 21:52:48 INFO util.GSet: capacity      = 2^21 = 2097152 entries
12/10/07 21:52:48 INFO util.GSet: recommended=2097152, actual=2097152
12/10/07 21:52:48 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
12/10/07 21:52:48 INFO namenode.FSNamesystem: fsOwner=hdfs (auth:SIMPLE)
12/10/07 21:52:49 INFO namenode.FSNamesystem: supergroup=supergroup
12/10/07 21:52:49 INFO namenode.FSNamesystem: isPermissionEnabled=true
12/10/07 21:52:49 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=1000
12/10/07 21:52:49 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
12/10/07 21:52:49 INFO common.Storage: Image file of size 110 saved in 0 seconds.
12/10/07 21:52:50 INFO common.Storage: Storage directory /data/storage/dfs/name has been successfully formatted.
12/10/07 21:52:50 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at hadoop-master/10.6.1.150
************************************************************/

15. 启动Hadoop进程
在hadoop-master上启动datanode,namenode,secondarynamenode,jobtracker
dongguo@hadoop-master:~$ sudo /etc/init.d/hadoop-0.20-datanode start
dongguo@hadoop-master:~$ sudo /etc/init.d/hadoop-0.20-namenode start
dongguo@hadoop-master:~$ sudo /etc/init.d/hadoop-0.20-jobtracker start
dongguo@hadoop-master:~$ sudo /etc/init.d/hadoop-0.20-secondarynamenode start

在hadoop-node上启动datanode,tasktracker,以hadoop-node-1为例
dongguo@hadoop-node-1:~$ sudo /etc/init.d/hadoop-0.20-datanode start
dongguo@hadoop-node-1:~$ sudo /etc/init.d/hadoop-0.20-tasktracker start

检查进程启动情况,以hadoop-master与hadoop-node-1为例
dongguo@hadoop-master:~$ sudo jps

12683 SecondaryNameNode
12716 Jps
9250 JobTracker
11736 DataNode
5816 NameNode

dongguo@hadoop-node-1:~$ sudo jps

4591 DataNode
7093 Jps
6853 TaskTracker

16. 创建mapred.system.dir的HDFS目录
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -mkdir /mapred/system
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -chown mapred:hadoop /mapred/system

17. 测试对HDFS的相关操作
dongguo@hadoop-master:~$ echo "Hello Hadoop." > hello.txt
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -mkdir /dongguo
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -copyFromLocal hello.txt /dongguo
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -ls /dongguo

Found 1 items
-rw-r--r--   2 hdfs supergroup         33 2012-10-07 22:02 /dongguo/hello.txt
hdfs@hadoop-master:~$ hadoop fs -cat /dongguo/hello.txt
Hello Hadoop. echo Hello Hadoop.

18. 查看整个集群的状态
通过网页进行查看
http://10.6.1.150:50070

http://10.6.1.150:50030

通过命令进行查看
dongguo@hadoop-master:~$ sudo -u hdfs hadoop dfsadmin -report

12/10/07 23:22:01 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing.
Configured Capacity: 59858374656 (55.75 GB)
Present Capacity: 53180121088 (49.53 GB)
DFS Remaining: 53180010496 (49.53 GB)
DFS Used: 110592 (108 KB)
DFS Used%: 0%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 3 (3 total, 0 dead)

Name: 10.6.1.152:50010
Decommission Status : Normal
Configured Capacity: 19952791552 (18.58 GB)
DFS Used: 28672 (28 KB)
Non DFS Used: 2221387776 (2.07 GB)
DFS Remaining: 17731375104(16.51 GB)
DFS Used%: 0%
DFS Remaining%: 88.87%
Last contact: Sun Oct 07 23:22:01 CST 2012


Name: 10.6.1.151:50010
Decommission Status : Normal
Configured Capacity: 19952791552 (18.58 GB)
DFS Used: 40960 (40 KB)
Non DFS Used: 2221387776 (2.07 GB)
DFS Remaining: 17731362816(16.51 GB)
DFS Used%: 0%
DFS Remaining%: 88.87%
Last contact: Sun Oct 07 23:22:01 CST 2012


Name: 10.6.1.150:50010
Decommission Status : Normal
Configured Capacity: 19952791552 (18.58 GB)
DFS Used: 40960 (40 KB)
Non DFS Used: 2235478016 (2.08 GB)
DFS Remaining: 17717272576(16.5 GB)
DFS Used%: 0%
DFS Remaining%: 88.8%
Last contact: Sun Oct 07 23:22:01 CST 2012

19. 至此,Hadoop(HDFS)的搭建就已经完成。

20. 接着,我们可以开始以下过程:
Hadoop集群实践 之 (2) HBase&Zookeeper搭建
Hadoop集群实践 之 (3) Flume搭建
Hadoop集群实践 之 (4) Hive搭建

,

  1. #1 by mcsrainbow on 2013/04/09 - 10:12

    如果想以伪分布式部署,很简单,直接 sudo yum install hadoop-0.20-conf-pseudo 就可以了。

(will not be published)
*