Hadoop集群实践 之 (3) Flume搭建


目录结构
Hadoop集群实践 之 (0) 完整架构设计 [Hadoop(HDFS),HBase,Zookeeper,Flume,Hive]
Hadoop集群实践 之 (1) Hadoop(HDFS)搭建
Hadoop集群实践 之 (2) HBase&Zookeeper搭建
Hadoop集群实践 之 (3) Flume搭建
Hadoop集群实践 之 (4) Hive搭建

本文内容
Hadoop集群实践 之 (3) Flume搭建

参考资料
http://www.cnblogs.com/oubo/archive/2012/05/25/2517751.html
http://archive.cloudera.com/cdh/3/flume/UserGuide/
http://log.medcl.net/item/2012/03/flume-build-process/
http://blog.csdn.net/rzhzhz/article/details/7449767
https://blogs.apache.org/flume/entry/flume_ng_architecture

安装配置Flume
环境准备
OS: Ubuntu 10.10 Server 64-bit
Servers:
hadoop-master:10.6.1.150 内存1024M
- namenode,jobtracker;hbase-master,hbase-thrift;
- secondarynamenode;
- zookeeper-server;
- datanode,taskTracker
- flume-master
- flume-node

hadoop-node-1:10.6.1.151 内存640M
- datanode,taskTracker;hbase-regionServer;
- zookeeper-server;
- flume-node

hadoop-node-2:10.6.1.152 内存640M
- datanode,taskTracker;hbase-regionServer;
- zookeeper-server;
- flume-node

对以上角色做一些简单的介绍:
namenode - 整个HDFS的命名空间管理服务
secondarynamenode - 可以看做是namenode的冗余服务
jobtracker - 并行计算的job管理服务
datanode - HDFS的节点服务
tasktracker - 并行计算的job执行服务
hbase-master - Hbase的管理服务
hbase-regionServer - 对Client端插入,删除,查询数据等提供服务
zookeeper-server - Zookeeper协作与配置管理服务
flume-master - Flume的管理服务
flume-node - Flume的Agent与Collector服务,具体角色由flume-master决定

本文定义的规范,避免在配置多台服务器上产生理解上的混乱:
所有直接以 $ 开头,没有跟随主机名的命令,都代表需要在所有的服务器上执行,除非后面有单独的//开头的说明。

1. 安装前的准备
已经完成了 Hadoop集群实践 之 (2) HBase&Zookeeper搭建

Flume分为三种角色:
Mater: master负责配置及通信管理,是集群的控制器。
Collector: collector用于对数据进行聚合,往往会产生一个更大的流,然后加载到storage中。
Agent: Agent用于采集数据,agent是flume中产生数据流的地方,同时,agent会将产生的数据流传输到collector。

Collector和Agent的配置数据必须指定Source(可以理解为数据入口)和Sink(可以理解为数据出口)。

常用的source:
text(“filename”):将文件filename作为数据源,按行发送
tail(“filename”):探测filename新产生的数据,按行发送出去
fsyslogTcp(5140):监听TCP的5140端口,并且接收到的数据发送出去

常用的sink:
console[("format")] :直接将将数据显示在桌面上
text(“txtfile”):将数据写到文件txtfile中
dfs(“dfsfile”):将数据写到HDFS上的dfsfile文件中
syslogTcp(“host”,port):将数据通过TCP传递给host节点

2. 安装Flume agent和Flume collector
$ sudo apt-get install flume-node

3. 安装Flume master
dongguo@hadoop-master:~$ sudo apt-get install flume-master

4. 启动Flume服务
$ sudo /etc/init.d/flume-node start
dongguo@hadoop-master:~$ sudo /etc/init.d/flume-master start

5. 配置flume-conf.xml
$ sudo vim /etc/flume/conf/flume-conf.xml
修改以下几段中的路径参数

  <property>
    <name>flume.collector.output.format</name>
    <value>raw</value>
    <description>The output format for the data written by a Flume
    collector node.  There are several formats available:
      syslog - outputs events in a syslog-like format
      log4j - outputs events in a pattern similar to Hadoop's log4j pattern
      raw - Event body only.  This is most similar to copying a file but
        does not preserve any uniqifying metadata like host/timestamp/nanos.
      avro - Avro Native file format.  Default currently is uncompressed.
      avrojson - this outputs data as json encoded by avro
      avrodata - this outputs data as a avro binary encoded data
      debug - used only for debugging
    </description>
  </property>

  <property>
    <name>flume.master.servers</name>
    <value>hadoop-master</value>
    <description>A comma-separated list of hostnames, one for each
      machine in the Flume Master.
    </description>
  </property>

  <property>
    <name>flume.collector.dfs.dir</name>
    <value>file:///data/flume/tmp/flume-${user.name}/collected</value>
    <description>This is a dfs directory that is the the final resting
    place for logs to be stored in.  This defaults to a local dir in
    /tmp but can be hadoop URI path that such as hdfs://namenode/path/
    </description>
  </property>

  <property>
    <name>flume.master.zk.logdir</name>
    <value>/data/flume/tmp/flume-${user.name}-zk</value>
    <description>The base directory in which the ZBCS stores data.</description>
  </property>

  <property>
    <name>flume.agent.logdir</name>
    <value>/data/flume/tmp/flume-${user.name}/agent</value>
    <description> This is the directory that write-ahead logging data
      or disk-failover data is collected from applications gets
      written to. The agent watches this directory.
    </description>
  </property>

创建Flume所需的本地与HDFS目录
$ sudo mkdir -p /data/flume/tmp
$ sudo chown -R flume:flume /data/flume/

dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -mkdir -p /data/flume/tmp
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -chown -R flume:flume /data/flume/

以上配置主要是修改Flume的tmp目录并将日志文件内容以原始格式保存。

6. 配置flume-site.xml
$ sudo vim /etc/flume/conf/flume-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl"  href="configuration.xsl"?>

<configuration>
<property>
  <name>flume.master.servers</name>
  <value>hadoop-master</value>
  </property>
<property>
  <name>flume.master.store</name>
  <value>zookeeper</value>
</property>
<property> 
  <name>flume.master.zk.use.external</name> 
  <value>true</value> 
</property>
<property> 
  <name>flume.master.zk.servers</name> 
  <value>hadoop-master:2181,hadoop-node-1:2181,hadoop-node-2:2181</value>
</property>
</configuration>

以上配置主要是通过Zookeeper来管理Flume的配置并指定Flume master。

7. 操作Flume shell
dongguo@hadoop-master:~$ flume shell

[flume (disconnected)] connect localhost
Using default admin port: 35873
Using default report port: 45678
Connecting to Flume master localhost:35873:45678...
2012-10-08 21:52:50,876 [main] INFO util.AdminRPCThrift: Connected to master at localhost:35873
[flume localhost:35873:45678] getnodestatus
Master knows about 3 nodes
	hadoop-node-1 --> IDLE
	hadoop-node-2 --> IDLE
	hadoop-master --> IDLE

8. 检查Zookeeper中的配置
dongguo@hadoop-master:~$ zookeeper-client

[zk: localhost:2181(CONNECTED) 0] ls /
[flume-cfgs, counters-config_version, flume-chokemap, hbase, zookeeper, flume-nodes]

Web界面方式查看
http://10.6.1.150:35871

http://10.6.1.150:35862
http://10.6.1.151:35862
http://10.6.1.152:35862

9. 配置Source,Sink与Collector
设计思路,通过配置hadoop-node-1与hadoop-node-2为agent,负责采集log。
然后hadoop-master作为collector,接收log并写入HDFS中。
下面将模拟在hadoop-1上采集2个log,在hadoop-node-2上采集1个log,在hadoop-master上监听3个端口来接收log并写入HDFS中。

创建存放log的HDFS目录
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -mkdir -p /flume/rawdata
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -chown -R hdfs /flume
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -chmod 777 /flume/rawdata

下面的log内容为测试数据,实际情况可以按照自身业务的特点进行写入,分隔符可以用逗号","或TAB制表符,我这里选用制表符。
在hadoop-node-1与hadoop-node-2上创建log文件 /var/log/rawdata.log
$ sudo mkdir /var/log/rawdata/

在hadoop-node-1上
dongguo@hadoop-node-1:~$ sudo vim /var/log/rawdata/rawdata-1.log

100     hadoop-node-1-rawdata-1-value1
101     hadoop-node-1-rawdata-1-value2
102	hadoop-node-1-rawdata-1-value3
103	hadoop-node-1-rawdata-1-value4
104	hadoop-node-1-rawdata-1-value5
105	hadoop-node-1-rawdata-1-value6

dongguo@hadoop-node-1:~$ sudo vim /var/log/rawdata/rawdata-2.log

200     hadoop-node-1-rawdata-2-value1
201     hadoop-node-1-rawdata-2-value2

在hadoop-node-2上
dongguo@hadoop-node-2:~$ sudo vim /var/log/rawdata/rawdata-1.log

300     hadoop-node-2-rawdata-1-value1
301     hadoop-node-2-rawdata-1-value2

配置Flume Agent与Collector
dongguo@hadoop-master:~$ flume shell

[flume (disconnected)] connect hadoop-master
[flume hadoop-master:35873:45678] exec map hadoop-node-1 agent-1-a
[flume hadoop-master:35873:45678] exec map hadoop-node-1 agent-1-b
[flume hadoop-master:35873:45678] exec map hadoop-node-2 agent-2-a
[flume hadoop-master:35873:45678] exec config agent-1-a 'tail("/var/log/rawdata/rawdata-1.log",true)' 'agentE2EChain("hadoop-master:35852")'
[flume hadoop-master:35873:45678] exec config agent-1-b 'tail("/var/log/rawdata/rawdata-2.log",true)' 'agentE2EChain("hadoop-master:35853")'
[flume hadoop-master:35873:45678] exec config agent-2-a 'tail("/var/log/rawdata/rawdata-1.log",true)' 'agentE2EChain("hadoop-master:35854")'
[flume hadoop-master:35873:45678] exec map hadoop-master clct-1-a
[flume hadoop-master:35873:45678] exec map hadoop-master clct-1-b
[flume hadoop-master:35873:45678] exec map hadoop-master clct-2-a
[flume hadoop-master:35873:45678] exec config clct-1-a 'collectorSource(35852)' 'collectorSink("hdfs://hadoop-master/flume/rawdata/%Y/%m/%d/%H%M/", "rawdata-1-a")'
[flume hadoop-master:35873:45678] exec config clct-1-b 'collectorSource(35853)' 'collectorSink("hdfs://hadoop-master/flume/rawdata/%Y/%m/%d/%H%M/", "rawdata-1-b")'
[flume hadoop-master:35873:45678] exec config clct-2-a 'collectorSource(35854)' 'collectorSink("hdfs://hadoop-master/flume/rawdata/%Y/%m/%d/%H%M/", "rawdata-2-a")'
[flume hadoop-master:35873:45678] 

在界面上可以看到以上配置已经生效:
http://10.6.1.150:35871

测试往log中添加新的记录,出发日志的增量收集。
在hadoop-node-1上
dongguo@hadoop-node-1:~$ sudo vim /var/log/rawdata/rawdata-1.log

106     hadoop-node-1-rawdata-1-value7

查看日志是否被收集到HDFS
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -lsr /flume/rawdata

drwxrwxrwx   - flume supergroup          0 2012-10-10 02:52 /flume/rawdata/2012
drwxrwxrwx   - flume supergroup          0 2012-10-10 02:52 /flume/rawdata/2012/10
drwxrwxrwx   - flume supergroup          0 2012-10-10 02:52 /flume/rawdata/2012/10/10
drwxrwxrwx   - flume supergroup          0 2012-10-10 02:52 /flume/rawdata/2012/10/10/0252
-rw-r--r--   2 flume supergroup        245 2012-10-10 02:52 /flume/rawdata/2012/10/10/0252/rawdata-1-a20121010-025216795+0800.26693113439423.00000029

查看收集到的文件内容
dongguo@hadoop-master:~$ sudo -u hdfs hadoop fs -cat /flume/rawdata/2012/10/10/0252/rawdata-1-a20121010-025216795+0800.26693113439423.00000029

 
100	hadoop-node-1-rawdata-1-value1
101	hadoop-node-1-rawdata-1-value2
102	hadoop-node-1-rawdata-1-value3
103	hadoop-node-1-rawdata-1-value4
104	hadoop-node-1-rawdata-1-value5
105	hadoop-node-1-rawdata-1-value6
106	hadoop-node-1-rawdata-1-value7

至此,我们可以看到日志文件在更改过后已经成功的被收集到了。

10. 至此,Flume的搭建就已经完成。

11. 接着,我们可以开始以下过程:
Hadoop集群实践 之 (4) Hive搭建

, , ,

  1. No comments yet.
(will not be published)
*