目录结构
Hadoop集群实践 之 (0) 完整架构设计 [Hadoop(HDFS),HBase,Zookeeper,Flume,Hive]
Hadoop集群实践 之 (1) Hadoop(HDFS)搭建
Hadoop集群实践 之 (2) HBase&Zookeeper搭建
Hadoop集群实践 之 (3) Flume搭建
Hadoop集群实践 之 (4) Hive搭建
本文内容
Hadoop集群实践 之 (4) Hive搭建
参考资料
https://ccp.cloudera.com/display/CDHDOC/Hive+Installation
http://blog.csdn.net/cloudeep/article/details/4487181
http://jinghong.iteye.com/blog/1127968
http://dueam.org/2010/06/hive-%E5%8A%A0%E9%80%9Fmapreduce%E7%9A%84%E5%BC%80%E5%8F%91/
http://blog.csdn.net/hguisu/article/details/7282050
http://my.oschina.net/wangjiankui/blog/41442
安装配置Hive
OS: Ubuntu 10.10 Server 64-bit
Servers:
hadoop-master:10.6.1.150
- namenode,jobtracker;hbase-master,hbase-thrift;
- secondarynamenode;
- hive-master,hive-metastore;
- zookeeper-server;
- flume-master
- flume-node
- datanode,taskTracker
hadoop-node-1:10.6.1.151
- datanode,taskTracker;hbase-regionServer;
- zookeeper-server;
- flume-node
hadoop-node-2:10.6.1.152
- dataNode,taskTracker;hbase-regionServer;
- zookeeper-server;
- flume-node
对以上角色做一些简单的介绍:
namenode - 整个HDFS的命名空间管理服务
secondarynamenode - 可以看做是namenode的冗余服务
jobtracker - 并行计算的job管理服务
datanode - HDFS的节点服务
tasktracker - 并行计算的job执行服务
hbase-master - Hbase的管理服务
hbase-regionServer - 对Client端插入,删除,查询数据等提供服务
zookeeper-server - Zookeeper协作与配置管理服务
flume-master - Flume的管理服务
flume-node - Flume的Agent与Collector服务,具体角色由flume-master决定
hive-master - Hive的管理服务
hive-metastore - Hive的元存储,用于对元数据进行类型检查与语法分析
本文定义的规范,避免在配置多台服务器上产生理解上的混乱:
所有直接以 $ 开头,没有跟随主机名的命令,都代表需要在所有的服务器上执行,除非后面有单独的//开头的说明。
1. 安装前的准备
已经完成了 Hadoop集群实践 之 (3) Flume搭建
Hive其实就是一个SQL解析引擎,它将SQL语句转译成M/R JOB然后在Hadoop执行,来达到快速开发的目的。
Hive是由Facebook贡献给Hadoop开源社区的。
2. 安装Hive
dongguo@hadoop-master:~$ sudo apt-get install hadoop-hive hadoop-hive-metastore hadoop-hive-server
dongguo@hadoop-master:~$ sudo mkdir -p /var/lock/subsys
dongguo@hadoop-master:~$ sudo chown hive:hive /var/lock/subsys
3. 创建Hive所需的本地目录
dongguo@hadoop-master:~$ sudo mkdir -p /var/run/hive/
dongguo@hadoop-master:~$ sudo chown hive:hive /var/run/hive/
4. 启动Hive服务
dongguo@hadoop-master:~$ sudo /etc/init.d/hadoop-hive-metastore start
dongguo@hadoop-master:~$ sudo /etc/init.d/hadoop-hive-server start
5. 安装MySQL JDBC Connector
由于Hive需要MySQL数据库作为元数据的存储,而Hive本身是由Java开发的,因此需要安装JDBC。
dongguo@hadoop-master:~$ wget ‘http://cdn.mysql.com/Downloads/Connector-J/mysql-connector-java-5.1.22.tar.gz’
dongguo@hadoop-master:~$ tar xzf mysql-connector-java-5.1.22.tar.gz
dongguo@hadoop-master:~$ sudo cp mysql-connector-java-5.1.22/mysql-connector-java-5.1.22-bin.jar /usr/lib/hive/lib/
6. 安装MySQL
dongguo@hadoop-master:~$ sudo apt-get install mysql-server
在弹出的界面中设置MySQL root密码为hadoophive
7. 创建数据库并赋权
dongguo@hadoop-master:~$ mysql -uroot -phadoophive
mysql> CREATE DATABASE metastore; mysql> USE metastore; mysql> SOURCE /usr/lib/hive/scripts/metastore/upgrade/mysql/hive-schema-0.7.0.mysql.sql; mysql> CREATE USER 'hiveuser'@'%' IDENTIFIED BY 'password'; mysql> GRANT SELECT,INSERT,UPDATE,DELETE ON metastore.* TO 'hiveuser'@'%'; mysql> REVOKE ALTER,CREATE ON metastore.* FROM 'hiveuser'@'%'; mysql> CREATE USER 'hiveuser'@'localhost' IDENTIFIED BY 'password'; mysql> GRANT SELECT,INSERT,UPDATE,DELETE ON metastore.* TO 'hiveuser'@'localhost'; mysql> REVOKE ALTER,CREATE ON metastore.* FROM 'hiveuser'@'localhost';
8. 配置hive-site.xml
dongguo@hadoop-master:~$ sudo vim /etc/hive/conf/hive-site.xml
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- Hive Configuration can either be stored in this file or in the hadoop configuration files --> <!-- that are implied by Hadoop setup variables. --> <!-- Aside from Hadoop setup variables - this file is provided as a convenience so that Hive --> <!-- users do not have to edit hadoop configuration files (that may be managed as a centralized --> <!-- resource). --> <!-- Hive Execution Parameters --> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/metastore</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hiveuser</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>password</value> </property> <property> <name>datanucleus.autoCreateSchema</name> <value>false</value> </property> <property> <name>datanucleus.fixedDatastore</name> <value>true</value> </property> <property> <name>hive.stats.autogather</name> <value>false</value> </property> <property> <name>hive.aux.jars.path</name> <value>file:///usr/lib/hive/lib/hive-hbase-handler-0.7.1-cdh3u3.jar,file:///usr/lib/hbase/hbase-0.90.4-cdh3u3.jar,file:///usr/lib/zookeeper/zookeeper-3.3.4-cdh3u3.jar</value> </property> <property> <name>hbase.zookeeper.quorum</name> <value>hadoop-master,hadoop-node-1,hadoop-node-2</value> </property> </configuration>
确保hbase与zookeeper相关的jar包能被hive获取:
dongguo@hadoop-master:~$ sudo cp /usr/lib/hbase/hbase-0.90.4-cdh3u3.jar /usr/lib/hive/lib/
dongguo@hadoop-master:~$ sudo cp /usr/lib/zookeeper/zookeeper-3.3.4-cdh3u3.jar /usr/lib/hive/lib/
9. 测试通过Hive执行Job并关联HBase中的数据
由于Hive降低了学习成本,无需使用Java进行Map/Reduce编程任务即可使用Hadoop的集群进行运算,并且与HBase能够有效的集成,因此我们可以直接在下面通过Hive实现对Hadoop集群的MapReduce并行计算的调用,体验Hadoop的魅力。
9.1 启动Hive
dongguo@hadoop-master:~$ sudo /etc/init.d/hadoop-hive-server restart
dongguo@hadoop-master:~$ sudo /etc/init.d/hadoop-hive-metastore restart
9.2 创建Hive所需的HDFS目录
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -mkdir /user/hive
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -chown hive /user/hive
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -mkdir /tmp
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -chmod 777 /tmp
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -chmod o+t /tmp
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -mkdir /data
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -chown hdfs /data
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -chmod 777 /data
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -chmod o+t /data
9.3 创建测试数据
dongguo@hadoop-master:~$ sudo -u hive vim /tmp/kv1.txt
1 www.baidu.com 2 www.google.com 3 www.sina.com.cn 4 www.163.com 5 heylinx.com
注:以TAB作为分隔符,而不是空格,后面我们Load数据的时候会用到。
9.4 通过Hive Shell进行操作
dongguo@hadoop-master:~$ sudo -u hive hive
Hive history file=/tmp/hive/hive_job_log_hive_201210090516_1966038691.txt hive> #创建HBase识别的数据库 hive> CREATE TABLE hbase_table_1(key int, value string) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val") > TBLPROPERTIES ("hbase.table.name" = "xyz"); OK Time taken: 17.919 seconds hive> #hbase.table.name 定义在hbase的table名称 #hbase.columns.mapping 定义在hbase的列族 #使用SQL导入数据,新建hive的数据表 hive> CREATE TABLE IF NOT EXISTS pokes ( > foo INT, > bar STRING > )ROW FORMAT DELIMITED > FIELDS TERMINATED BY "\t" > LINES TERMINATED BY "\n"; OK Time taken: 2.883 seconds hive> #选择制表符(\t)做分隔符,换行符(\n)做换行符。 #批量插入数据 hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' OVERWRITE INTO TABLE pokes; Copying data from file:/tmp/kv1.txt Copying file: file:/tmp/kv1.txt Loading data to table default.pokes Deleted hdfs://hadoop-master/user/hive/warehouse/pokes OK Time taken: 4.361 seconds hive> select * from pokes; OK 1 www.baidu.com 2 www.google.com 3 www.sina.com.cn 4 www.163.com 5 heylinux.com Time taken: 0.641 seconds #使用SQL导入hbase_table_1 hive> INSERT OVERWRITE TABLE hbase_table_1 SELECT * FROM pokes WHERE foo=5; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201210091928_0001, Tracking URL = http://hadoop-master:50030/jobdetails.jsp?jobid=job_201210091928_0001 Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=hdfs://hadoop-master:8021 -kill job_201210091928_0001 2012-10-09 19:43:25,233 Stage-0 map = 0%, reduce = 0% 2012-10-09 19:43:31,323 Stage-0 map = 100%, reduce = 0% 2012-10-09 19:43:34,348 Stage-0 map = 100%, reduce = 100% Ended Job = job_201210091928_0001 1 Rows loaded to hbase_table_1 OK Time taken: 17.325 seconds hive> #可以看出Hive自动调用了Hadoop的Map/Reduce来进行并行计算,最后将ID为5的一条数据插入到了HBase中。 #查看数据 hive> select * from pokes; OK 1 www.baidu.com 2 www.google.com 3 www.sina.com.cn 4 www.163.com 5 heylinux.com Time taken: 0.874 seconds hive> select * from hbase_table_1; OK 5 heylinux.com Time taken: 1.174 seconds hive> #下面测试与Flume的集成,即直接LOAD之前Flume收集在HDFS中的数据。 hive> LOAD DATA INPATH '/flume/rawdata/2012/10/10/0252/rawdata-1-a*' INTO TABLE pokes; Loading data to table default.pokes OK Time taken: 1.65 seconds hive> select * from pokes; OK 1 www.baidu.com 2 www.google.com 3 www.sina.com.cn 4 www.163.com 5 heylinux.com 100 hadoop-node-1-rawdata-1-value1 101 hadoop-node-1-rawdata-1-value2 102 hadoop-node-1-rawdata-1-value3 103 hadoop-node-1-rawdata-1-value4 104 hadoop-node-1-rawdata-1-value5 105 hadoop-node-1-rawdata-1-value6 106 hadoop-node-1-rawdata-1-value7 Time taken: 0.623 seconds hive> #可以看到通过Flume收集到HDFS中的数据我们也成功的LOAD了。
9.5 登录Hbase查看刚刚写入的数据
dongguo@hadoop-master:~$ sudo -u hdfs hbase shell
HBase Shell; enter 'help' for list of supported commands. Type "exit " to leave the HBase Shell Version 0.90.4-cdh3u3, r, Thu Jan 26 10:13:36 PST 2012 hbase(main):001:0> describe 'xyz' DESCRIPTION ENABLED {NAME => 'xyz', FAMILIES => [{NAME => 'cf1', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSI true ONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]} 1 row(s) in 0.9640 seconds hbase(main):002:0> scan 'xyz' ROW COLUMN+CELL 5 column=cf1:val, timestamp=1349783010142, value=heylinux.com 1 row(s) in 0.0790 seconds hbase(main):003:0> #这时在Hive中可以看到刚才在Hbase中插入的数据了,即值为heylinux.com的一条记录。 #测试一下在Hbase中直接插入数据。 hbase(main):004:0> put 'xyz','6','cf1:val','www.360buy.com' 0 row(s) in 0.0690 seconds hbase(main):005:0> scan 'xyz' ROW COLUMN+CELL 5 column=cf1:val, timestamp=1349783010142, value=heylinux.com 6 column=cf1:val, timestamp=1349783457396, value=www.360buy.com 2 row(s) in 0.0320 seconds hbase(main):006:0>
9.6 通过MapReduceAdministration查看刚刚MapReduce的执行情况
通过上面执行“使用SQL导入hbase_table_1”时的输出可以看出Hive自动调用了Hadoop的MapReduce来进行并行计算,最后将ID为5的一条数据插入到了HBase中。
具体结果如何呢,我们可以通过MapReduceAdministration的WEB界面进行查看:
http://10.6.1.150:50030
可以清楚的看到,Job已经成功的完成了。
至此,我们已经完成的整个Hadoop集群架构的搭建,并且通过Hive方便的通过SQL方式实现了对Hadoop的MapReduce并行计算的调用。
在此,让我们再一次用 Hadoop集群实践 之 (0) 完整架构设计 中的图表来回顾一下整个架构与数据流的逻辑,并开始我们的Hadoop之旅!
近期评论(Recent Comments)
感谢提供解决问题的思路。我的情况是因为文件有损坏,使用hotcopy 会出现“svnadmin: Can't open file '/SVN_PATH/db/revprops/24/24685'...
大神,您好。 你的博客 都是使用什么软件和主题搭建的哈?关注你的博客很久了。 也想自己搭建一个 总结 反思自己。谢谢大神...
int result = 0; for (int i = 0; i < 101; i++) { result ^= data[i]; ...
如果确认所有的表都是INNODB引擎,没有任何MyISAM表,还可以加上--no-lock参数。...
讲的不错, mark
答案无疑是本地端口转发了,它的命令格式是: ssh -L :: 原来是这个原理...
博主您好,我想咨询你一个问题,我现在想实现这样一个需求: haproxy 本身提供ssl 证书,后面的web 服务器走正常的http 协议 就是ssl证书就放在前端的haproxy上,后端4...