Hadoop集群实践 之 (4) Hive搭建


目录结构
Hadoop集群实践 之 (0) 完整架构设计 [Hadoop(HDFS),HBase,Zookeeper,Flume,Hive]
Hadoop集群实践 之 (1) Hadoop(HDFS)搭建
Hadoop集群实践 之 (2) HBase&Zookeeper搭建
Hadoop集群实践 之 (3) Flume搭建
Hadoop集群实践 之 (4) Hive搭建

本文内容
Hadoop集群实践 之 (4) Hive搭建

参考资料
https://ccp.cloudera.com/display/CDHDOC/Hive+Installation
http://blog.csdn.net/cloudeep/article/details/4487181
http://jinghong.iteye.com/blog/1127968
http://dueam.org/2010/06/hive-%E5%8A%A0%E9%80%9Fmapreduce%E7%9A%84%E5%BC%80%E5%8F%91/
http://blog.csdn.net/hguisu/article/details/7282050
http://my.oschina.net/wangjiankui/blog/41442

安装配置Hive
OS: Ubuntu 10.10 Server 64-bit
Servers:
hadoop-master:10.6.1.150
- namenode,jobtracker;hbase-master,hbase-thrift;
- secondarynamenode;
- hive-master,hive-metastore;
- zookeeper-server;
- flume-master
- flume-node
- datanode,taskTracker

hadoop-node-1:10.6.1.151
- datanode,taskTracker;hbase-regionServer;
- zookeeper-server;
- flume-node

hadoop-node-2:10.6.1.152
- dataNode,taskTracker;hbase-regionServer;
- zookeeper-server;
- flume-node

对以上角色做一些简单的介绍:
namenode - 整个HDFS的命名空间管理服务
secondarynamenode - 可以看做是namenode的冗余服务
jobtracker - 并行计算的job管理服务
datanode - HDFS的节点服务
tasktracker - 并行计算的job执行服务
hbase-master - Hbase的管理服务
hbase-regionServer - 对Client端插入,删除,查询数据等提供服务
zookeeper-server - Zookeeper协作与配置管理服务
flume-master - Flume的管理服务
flume-node - Flume的Agent与Collector服务,具体角色由flume-master决定
hive-master - Hive的管理服务
hive-metastore - Hive的元存储,用于对元数据进行类型检查与语法分析

本文定义的规范,避免在配置多台服务器上产生理解上的混乱:
所有直接以 $ 开头,没有跟随主机名的命令,都代表需要在所有的服务器上执行,除非后面有单独的//开头的说明。

1. 安装前的准备
已经完成了 Hadoop集群实践 之 (3) Flume搭建

Hive其实就是一个SQL解析引擎,它将SQL语句转译成M/R JOB然后在Hadoop执行,来达到快速开发的目的。
Hive是由Facebook贡献给Hadoop开源社区的。

2. 安装Hive
dongguo@hadoop-master:~$ sudo apt-get install hadoop-hive hadoop-hive-metastore hadoop-hive-server
dongguo@hadoop-master:~$ sudo mkdir -p /var/lock/subsys
dongguo@hadoop-master:~$ sudo chown hive:hive /var/lock/subsys

3. 创建Hive所需的本地目录
dongguo@hadoop-master:~$ sudo mkdir -p /var/run/hive/
dongguo@hadoop-master:~$ sudo chown hive:hive /var/run/hive/

4. 启动Hive服务
dongguo@hadoop-master:~$ sudo /etc/init.d/hadoop-hive-metastore start
dongguo@hadoop-master:~$ sudo /etc/init.d/hadoop-hive-server start

5. 安装MySQL JDBC Connector
由于Hive需要MySQL数据库作为元数据的存储,而Hive本身是由Java开发的,因此需要安装JDBC。
dongguo@hadoop-master:~$ wget 'http://cdn.mysql.com/Downloads/Connector-J/mysql-connector-java-5.1.22.tar.gz'
dongguo@hadoop-master:~$ tar xzf mysql-connector-java-5.1.22.tar.gz
dongguo@hadoop-master:~$ sudo cp mysql-connector-java-5.1.22/mysql-connector-java-5.1.22-bin.jar /usr/lib/hive/lib/

6. 安装MySQL
dongguo@hadoop-master:~$ sudo apt-get install mysql-server
在弹出的界面中设置MySQL root密码为hadoophive

7. 创建数据库并赋权
dongguo@hadoop-master:~$ mysql -uroot -phadoophive

mysql> CREATE DATABASE metastore;
mysql> USE metastore;
mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.7.0.mysql.sql;

mysql> CREATE USER 'hiveuser'@'%' IDENTIFIED BY 'password';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE ON metastore.* TO 'hiveuser'@'%';
mysql> REVOKE ALTER,CREATE ON metastore.* FROM 'hiveuser'@'%';
mysql> CREATE USER 'hiveuser'@'localhost' IDENTIFIED BY 'password';
mysql> GRANT SELECT,INSERT,UPDATE,DELETE ON metastore.* TO 'hiveuser'@'localhost';
mysql> REVOKE ALTER,CREATE ON metastore.* FROM 'hiveuser'@'localhost';

8. 配置hive-site.xml
dongguo@hadoop-master:~$ sudo vim /etc/hive/conf/hive-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>

<!-- Hive Configuration can either be stored in this file or in the hadoop configuration files  -->
<!-- that are implied by Hadoop setup variables.                                                -->
<!-- Aside from Hadoop setup variables - this file is provided as a convenience so that Hive    -->
<!-- users do not have to edit hadoop configuration files (that may be managed as a centralized -->
<!-- resource).                                                                                 -->

<!-- Hive Execution Parameters -->

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://localhost:3306/metastore</value>
</property>

<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>com.mysql.jdbc.Driver</value>
</property>
 
<property>
  <name>javax.jdo.option.ConnectionUserName</name>
  <value>hiveuser</value>
</property>
 
<property>
  <name>javax.jdo.option.ConnectionPassword</name>
  <value>password</value>
</property>
 
<property>
  <name>datanucleus.autoCreateSchema</name>
  <value>false</value>
</property>
 
<property>
  <name>datanucleus.fixedDatastore</name>
  <value>true</value>
</property>

<property>
  <name>hive.stats.autogather</name>
  <value>false</value>
</property>

<property>
  <name>hive.aux.jars.path</name>
  <value>file:///usr/lib/hive/lib/hive-hbase-handler-0.7.1-cdh3u3.jar,file:///usr/lib/hbase/hbase-0.90.4-cdh3u3.jar,file:///usr/lib/zookeeper/zookeeper-3.3.4-cdh3u3.jar</value>
</property>

<property>
  <name>hbase.zookeeper.quorum</name>
  <value>hadoop-master,hadoop-node-1,hadoop-node-2</value>
</property>

</configuration>

确保hbase与zookeeper相关的jar包能被hive获取:
dongguo@hadoop-master:~$ sudo cp /usr/lib/hbase/hbase-0.90.4-cdh3u3.jar /usr/lib/hive/lib/
dongguo@hadoop-master:~$ sudo cp /usr/lib/zookeeper/zookeeper-3.3.4-cdh3u3.jar /usr/lib/hive/lib/

9. 测试通过Hive执行Job并关联HBase中的数据
由于Hive降低了学习成本,无需使用Java进行Map/Reduce编程任务即可使用Hadoop的集群进行运算,并且与HBase能够有效的集成,因此我们可以直接在下面通过Hive实现对Hadoop集群的MapReduce并行计算的调用,体验Hadoop的魅力。

9.1 启动Hive
dongguo@hadoop-master:~$ sudo /etc/init.d/hadoop-hive-server restart
dongguo@hadoop-master:~$ sudo /etc/init.d/hadoop-hive-metastore restart

9.2 创建Hive所需的HDFS目录
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -mkdir /user/hive
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -chown hive /user/hive
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -mkdir /tmp
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -chmod 777 /tmp
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -chmod o+t /tmp
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -mkdir /data
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -chown hdfs /data
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -chmod 777 /data
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -chmod o+t /data

9.3 创建测试数据
dongguo@hadoop-master:~$ sudo -u hive vim /tmp/kv1.txt

 
1	www.baidu.com
2	www.google.com
3	www.sina.com.cn
4	www.163.com
5	heylinx.com

注:以TAB作为分隔符,而不是空格,后面我们Load数据的时候会用到。

9.4 通过Hive Shell进行操作
dongguo@hadoop-master:~$ sudo -u hive hive

 
Hive history file=/tmp/hive/hive_job_log_hive_201210090516_1966038691.txt
hive> 

#创建HBase识别的数据库
hive> CREATE TABLE hbase_table_1(key int, value string)  
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'  
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")  
    > TBLPROPERTIES ("hbase.table.name" = "xyz");
OK
Time taken: 17.919 seconds
hive> 

#hbase.table.name 定义在hbase的table名称
#hbase.columns.mapping 定义在hbase的列族

#使用SQL导入数据,新建hive的数据表
hive> CREATE TABLE IF NOT EXISTS pokes (   
    > foo INT,
    > bar STRING
    > )ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY "\t"
    > LINES TERMINATED BY "\n";
OK
Time taken: 2.883 seconds
hive> 

#选择制表符(\t)做分隔符,换行符(\n)做换行符。

#批量插入数据
hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' OVERWRITE INTO TABLE pokes;
Copying data from file:/tmp/kv1.txt
Copying file: file:/tmp/kv1.txt
Loading data to table default.pokes
Deleted hdfs://hadoop-master/user/hive/warehouse/pokes
OK
Time taken: 4.361 seconds
hive> select * from pokes;
OK
1	www.baidu.com
2	www.google.com
3	www.sina.com.cn
4	www.163.com
5	heylinux.com
Time taken: 0.641 seconds

#使用SQL导入hbase_table_1
hive> INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM pokes WHERE foo=5;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201210091928_0001, Tracking URL = http://hadoop-master:50030/jobdetails.jsp?jobid=job_201210091928_0001
Kill Command = /usr/lib/hadoop/bin/hadoop job  -Dmapred.job.tracker=hdfs://hadoop-master:8021 -kill job_201210091928_0001
2012-10-09 19:43:25,233 Stage-0 map = 0%,  reduce = 0%
2012-10-09 19:43:31,323 Stage-0 map = 100%,  reduce = 0%
2012-10-09 19:43:34,348 Stage-0 map = 100%,  reduce = 100%
Ended Job = job_201210091928_0001
1 Rows loaded to hbase_table_1
OK
Time taken: 17.325 seconds
hive> 
#可以看出Hive自动调用了Hadoop的Map/Reduce来进行并行计算,最后将ID为5的一条数据插入到了HBase中。

#查看数据
hive> select * from pokes;
OK
1	www.baidu.com
2	www.google.com
3	www.sina.com.cn
4	www.163.com
5	heylinux.com
Time taken: 0.874 seconds
hive> select * from hbase_table_1;
OK
5	heylinux.com
Time taken: 1.174 seconds
hive>  

#下面测试与Flume的集成,即直接LOAD之前Flume收集在HDFS中的数据。
hive> LOAD DATA INPATH '/flume/rawdata/2012/10/10/0252/rawdata-1-a*' INTO TABLE pokes;      
Loading data to table default.pokes
OK
Time taken: 1.65 seconds
hive> select * from pokes;
OK
1	www.baidu.com
2	www.google.com
3	www.sina.com.cn
4	www.163.com
5	heylinux.com
100	hadoop-node-1-rawdata-1-value1
101	hadoop-node-1-rawdata-1-value2
102	hadoop-node-1-rawdata-1-value3
103	hadoop-node-1-rawdata-1-value4
104	hadoop-node-1-rawdata-1-value5
105	hadoop-node-1-rawdata-1-value6
106	hadoop-node-1-rawdata-1-value7
Time taken: 0.623 seconds
hive> 

#可以看到通过Flume收集到HDFS中的数据我们也成功的LOAD了。

9.5 登录Hbase查看刚刚写入的数据
dongguo@hadoop-master:~$ sudo -u hdfs hbase shell

 
HBase Shell; enter 'help' for list of supported commands.
Type "exit" to leave the HBase Shell
Version 0.90.4-cdh3u3, r, Thu Jan 26 10:13:36 PST 2012

hbase(main):001:0> describe 'xyz'
DESCRIPTION                                             ENABLED                                                            
 {NAME => 'xyz', FAMILIES => [{NAME => 'cf1', BLOOMFILTER => 'NONE', 
 REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSI true ONS => '3', 
 TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', 
 BLOCKCACHE => 'true'}]}   1 row(s) in 0.9640 seconds

hbase(main):002:0> scan 'xyz'
ROW                       COLUMN+CELL                                            
5    column=cf1:val, timestamp=1349783010142, value=heylinux.com                                               
1 row(s) in 0.0790 seconds

hbase(main):003:0> 

#这时在Hive中可以看到刚才在Hbase中插入的数据了,即值为heylinux.com的一条记录。

#测试一下在Hbase中直接插入数据。
hbase(main):004:0> put 'xyz','6','cf1:val','www.360buy.com'
0 row(s) in 0.0690 seconds

hbase(main):005:0> scan 'xyz'                              
ROW                  COLUMN+CELL                                            
5   column=cf1:val, timestamp=1349783010142, value=heylinux.com                     
6   column=cf1:val, timestamp=1349783457396, value=www.360buy.com   
2 row(s) in 0.0320 seconds

hbase(main):006:0> 

9.6 通过MapReduceAdministration查看刚刚MapReduce的执行情况
通过上面执行“使用SQL导入hbase_table_1”时的输出可以看出Hive自动调用了Hadoop的MapReduce来进行并行计算,最后将ID为5的一条数据插入到了HBase中。
具体结果如何呢,我们可以通过MapReduceAdministration的WEB界面进行查看:
http://10.6.1.150:50030

可以清楚的看到,Job已经成功的完成了。
至此,我们已经完成的整个Hadoop集群架构的搭建,并且通过Hive方便的通过SQL方式实现了对Hadoop的MapReduce并行计算的调用。

在此,让我们再一次用 Hadoop集群实践 之 (0) 完整架构设计 中的图表来回顾一下整个架构与数据流的逻辑,并开始我们的Hadoop之旅!

, , ,

  1. No comments yet.
(will not be published)
*