Hadoop与Hbase分布式配置

关键词：hadoop hbase 配置 hbase与hadoop关系 hadoop与hbase的关系

在单机上部署好hadoop与hbase之后，现在终于要开始在集群上来部署了。一路配置下来，个人觉得其实分布式配置与单机配置差不多，但是修改了配置文件中的一些参数适应分布式。当然这只是简单地配置部署，如果真的要作为一个实际系统来使用，考虑性能稳定性等其他因素的时候当然就没有这么简单了，但是今天的主要工作是部署一个分布式Hadoop/Hbase，先不考虑性能优化的问题，而且我们的节点数也很小，初期只有一个master和两个slaves:
master: 30.0.0.69
node1: 30.0.0.161
node2: 30.0.0.162
整体的配置可以分为分布式条件配置、hadoop配置文件、hbase配置文件三个步骤、安装文件迁移到其他节点。当然在正式开始之前，你要确保各个节点上都有一个hadoop用户（或者其他同名的用户都可以），可以在安装系统的时候默认使用该用户安装，或者使用adduser添加一个吧！不过这里要确保该用户可以使用sudo命令，因此最好加入到%admin组里，必要的时候需要修改/etc/sudoers文件，具体配置说明可以见：
http://blog.chinaunix.net/uid-26275986-id-3940725.html中有说明。

一、分布式条件配置
这里自己所说的分布式配置，主要是关于集群之间通信的时候的配置，主要涉及网卡配置、hosts配置和ssh免登陆设置。
1. 网卡配置
自己的安装系统都是Ubuntu-12.04-desktop，对于Ubuntu的网卡配置来说，可以使用ifconfig命令来配置，或者用右上角的网络工具，但是自己还是习惯使用配置文件的方法：

点击(此处)折叠或打开

auto lo
iface lo inet loopback
#auto eth0
#iface eth0 inet dhcp
auto eth0
iface eth0 inet static
address 30.0.0.69
netmask 255.255.255.0
gateway 30.0.0.254
#dns-nameservers DNS-IP

上面是网卡接口的配置文件，需要注意的有两点：一是通过修改该配置文件的方式需要sudo /etc/init.d/networking restart才能立即生效，否则重启后也可以使用新配置；二是Ubuntu不能直接设置/etc/resolv.conf配置文件来设定DNS，因为每次重启后都会清空，如果要设置DNS，要么在网卡配置文件中（如本例）添加最后一行（本例不需要DNS，因此注释掉了），或者去修改/etc/resolvconf/resolv.conf.d.base文件，然后重启网卡服务即可生效。

2. 主机hosts配置
/etc/hosts文件用于确定集群中每个节点的IP，为了方便后续集群可以正确通信，这里要进行设置，并且在每一个节点上都要配置该文件：

点击(此处)折叠或打开

127.0.0.1 localhost
#127.0.0.1 hadoop
#127.0.0.1 master
30.0.0.69 master
30.0.0.161 node1
30.0.0.162 node2
# The following lines are desirable for IPv6 capable hosts
#::1 ip6-localhost ip6-loopback
#fe00::0 ip6-localnet
#ff00::0 ip6-mcastprefix
#ff02::1 ip6-allnodes
#ff02::2 ip6-allrouters

自己不使用IPv6，因此把相应地址都注释掉了。本文件配置好后可以复制到其他节点。

3. 配置ssh免登陆
由于hadoop各节点间使用ssh通讯，因此为了避免频繁密码验证，这里需要设置ssh免密码登录。其实ssh免登陆设置很简单，只需要被登录节点保存有登录节点用户的公钥即可。
第一步：ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa，这步之后会在~/.ssh中添加公私钥文件：

第二步：将公钥导入目标机器的认证文件中：cat id_dsa.pub >> authorized_keys
当然，这里需要做的是将master的公钥文件导入到node1和node2中的.ssh中的authorized_keys中。但是自己在测试的时候出了问题，node2可以ssh其他节点，但是其他节点不能ssh回来。费了一番功夫后发觉各个节点的版本不一致，node2使用的是系统自带的版本，估计没有server部分，需要更新最新版本。为了方便起见，这步设置的时候建议统一sudo apt-get install ssh即可。
这些步骤完成之后，可以相互ssh测试，第一次还需要输入密码，但是第二次之后就可以免密码登录了。

二、Hadoop分布式配置文件设置
当然分布式也要求单机版hadoop运行的必要条件，jdk包肯定是必需的。保证各个节点上hadoop运行用户和hadoop/hbase存放目录的一致，后期的拷贝会省去很多麻烦。然后下载hadoop包，解压之后开始修改配置文件：
1. 配置jdk运行路径，修改hadoop-env.xml

点击(此处)折叠或打开

# Set Hadoop-specific environment variables here.
# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
# The java implementation to use. Required.
export JAVA_HOME=/home/hadoop/platform/jdk1.6.0_35
# Extra Java CLASSPATH elements. Optional.
# export HADOOP_CLASSPATH=
# The maximum amount of heap to use, in MB. Default is 1000.
#export HADOOP_HEAPSIZE=1000
# Extra Java runtime options. Empty by default.
# export HADOOP_OPTS=-server
# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS”
export HADOOP_SECONDARYNAMENODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS”
export HADOOP_DATANODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS”
export HADOOP_BALANCER_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS”
export HADOOP_JOBTRACKER_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS”
# export HADOOP_TASKTRACKER_OPTS=
# The following applies to multiple commands (fs, dfs, fsck, distcp etc)
#export HADOOP_CLIENT_OPTS=”-Xmx1024m $HADOOP_CLIENT_OPTS”
# Extra ssh options. Empty by default.
# export HADOOP_SSH_OPTS=”-o ConnectTimeout=1 -o SendEnv=HADOOP_CONF_DIR”
# Where log files are stored. $HADOOP_HOME/logs by default.
# export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
# File naming remote slave hosts. $HADOOP_HOME/conf/slaves by default.
# export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
# host:path where hadoop code should be rsync’d from. Unset by default.
# export HADOOP_MASTER=master:/home/$USER/src/hadoop
# Seconds to sleep between slave commands. Unset by default. This
# can be useful in large clusters, where, e.g., slave rsyncs can
# otherwise arrive faster than the master can service them.
# export HADOOP_SLAVE_SLEEP=0.1
# The directory where pid files are stored. /tmp by default.
# export HADOOP_PID_DIR=/var/hadoop/pids
# A string representing this instance of hadoop. $USER by default.
# export HADOOP_IDENT_STRING=$USER
# The scheduling priority for daemon processes. See ‘man nice’.
# export HADOOP_NICENESS=10

2. 配置namenode，修改core-site.xml

点击(此处)折叠或打开

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<!– Put site-specific property overrides in this file. –>
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://master:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/home/hadoop/hdfs/tmp</value>
</property>
</configuration>

3. 配置hdfs，修改hdfs-site.xml文件：

点击(此处)折叠或打开

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<!– Put site-specific property overrides in this file. –>
<configuration>
<property>
<name>dfs.name.dir</name>
<value>/home/hadoop/hdfs/name</value>
<final>true</final>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/hadoop/hdfs/data</value>
<final>true</final>
</property>
<property>
<name>dfs.datanode.max.xcievers</name>
<value>32768</value>
</property>
<property>
<name>dfs.replication</name> //有几个datanode最多设置几个数值，每个datanode只能保存一份备份
<value>2</value>
<final>true</final>
</property>
</configuration>

这里注意final标签表示本设置在后续运行中不允许动态修改和覆盖。
4. 配置map-reduce，修改mapred-site.xml：

点击(此处)折叠或打开

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<!– Put site-specific property overrides in this file. –>
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>30.0.0.69:9001</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>–Xmx800m</value> //根据自身机器情况设置大小，自己的内存只有1G
<final>true</final>
</property>
</configuration>

5. 根据实际情况配置masters和slaves文件，在master文件中写入master角色的主机名（master），slaves文件中写入datanode角色的主机名（node1、node2）
6. 以上配置在master上配置，然后将配置好的hadoop目录直接拷贝到其余各节点，注意保证各节点hadoop运行用户、hadoop存放位置的一致，不要忘记在各机上设置hosts文件；
7. 运行测试：
在master上启动hadoop即可，然后用Jps命令查看启动进程，TaskTracker和Datanode应该都在node1和node2上运行：

然后通过web->http://30.0.0.69:50070可以查看hdfs系统：可以看到Live Nodes为2

如果想查看mapreduce的任务情况，web->http://30.0.0.69:50030

自己很奇怪为什么自己的Nodes项为0？明明在分节点上TaskTracker都已经启动了。上网去查资料，好像比较常见，一般的解决方法有：
1-关闭namenode的safemode：hadoop dfsadmin -safemode leave；
2-格式化hdfs：删除各个几点的hdfs目录，即hdfs-site文件中指定的tmp目录；
3-查看防火墙：sudo ufw status
这些方法都不管用，后来突然想到是不是因为没有运行的mapreduce呢？于是赶紧测试hadoop：

果然不出所料，这次结果出现了两个节点，看来果然是自己的理解有问题，这里只能显示正在运行的map-reduce任务：

至此hadoop分布式的配置基本结束，下面开始配置hbase。

三、Hbase配置
配置Hbase其实和Hadoop是类似的思路，先在HMaster上配置好文件，保证用户和路径的一致的前提下，直接复制拷贝hbase安装目录就可以了：
1. 配置hbase运行环境，修改hbase-env.xml：

点击(此处)折叠或打开

#
#/**
# *
# * Licensed to the Apache Software Foundation (ASF) under one
# * or more contributor license agreements. See the NOTICE file
# * distributed with this work for additional information
# * regarding copyright ownership. The ASF licenses this file
# * to you under the Apache License, Version 2.0 (the
# * “License”); you may not use this file except in compliance
# * with the License. You may obtain a copy of the License at
# *
# * http://www.apache.org/licenses/LICENSE-2.0
# *
# * Unless required by applicable law or agreed to in writing, software
# * distributed under the License is distributed on an “AS IS” BASIS,
# * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# * See the License for the specific language governing permissions and
# * limitations under the License.
# */
# Set environment variables here.
# The java implementation to use. Java 1.6 required.
export JAVA_HOME=/home/hadoop/platform/jdk1.6.0_35
export HBASE_HOME=/home/hadoop/platform/hbase-0.90.0
export HADOOP_HOME=/home/hadoop/platform/hadoop-1.0.3
# Extra Java CLASSPATH elements. Optional.
# export HBASE_CLASSPATH=
# The maximum amount of heap to use, in MB. Default is 1000.
#export HBASE_HEAPSIZE=1000
# Extra Java runtime options.
# Below are what we set by default. May only work with SUN JVM.
# For more on why as well as other possible settings,
# see http://wiki.apache.org/hadoop/PerformanceTuning
export HBASE_OPTS=“$HBASE_OPTS -ea -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode”
# Uncomment below to enable java garbage collection logging.
# export HBASE_OPTS=”$HBASE_OPTS -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:$HBASE_HOME/logs/gc-hbase.log”
# Uncomment and adjust to enable JMX exporting
# See jmxremote.password and jmxremote.access in $JRE_HOME/lib/management to configure remote password access.
# More details at: http://java.sun.com/javase/6/docs/technotes/guides/management/agent.html
#
# export HBASE_JMX_BASE=”-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false”
# export HBASE_MASTER_OPTS=”$HBASE_MASTER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10101″
# export HBASE_REGIONSERVER_OPTS=”$HBASE_REGIONSERVER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10102″
# export HBASE_THRIFT_OPTS=”$HBASE_THRIFT_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10103″
# export HBASE_ZOOKEEPER_OPTS=”$HBASE_ZOOKEEPER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10104″
# File naming hosts on which HRegionServers will run. $HBASE_HOME/conf/regionservers by default.
# export HBASE_REGIONSERVERS=${HBASE_HOME}/conf/regionservers
# Extra ssh options. Empty by default.
# export HBASE_SSH_OPTS=”-o ConnectTimeout=1 -o SendEnv=HBASE_CONF_DIR”
# Where log files are stored. $HBASE_HOME/logs by default.
export HBASE_LOG_DIR=${HBASE_HOME}/logs
# A string representing this instance of hbase. $USER by default.
# export HBASE_IDENT_STRING=$USER
# The scheduling priority for daemon processes. See ‘man nice’.
# export HBASE_NICENESS=10
# The directory where pid files are stored. /tmp by default.
# export HBASE_PID_DIR=/var/hadoop/pids
# Seconds to sleep between slave commands. Unset by default. This
# can be useful in large clusters, where, e.g., slave rsyncs can
# otherwise arrive faster than the master can service them.
# export HBASE_SLAVE_SLEEP=0.1
# Tell HBase whether it should manage it’s own instance of Zookeeper or not.
export HBASE_MANAGES_ZK=true

2. 配置hdfs，修改hbase-site.xml：

点击(此处)折叠或打开

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<!–
/**
*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* “License”); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an “AS IS” BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
–>
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://master:9000/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.master</name>
<value>30.0.0.69:60000</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>30.0.0.161,30.0.0.162</value>
</property>
<property>
<name>zookeeper.znode.parent</name>
<value>/hbase</value> //默认位置
</property>
</configuration>

这里主要是设置hdfs系统的根位置以及HMaster的位置。
3. 拷贝hbase目录下的src/main/resources/下的hbase-default.xml文件，修改hbase.rootdir写定HDFS目录和hbase.cluster.distributed两项：

点击(此处)折叠或打开

<?xml version=”1.0″?>
<?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
<!–
/**
*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* “License”); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an “AS IS” BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
–>
<configuration>
<property>
<name>hbase.rootdir</name>
<value>hdfs://master:9000/hbase</value>
<description>The directory shared by region servers and into
which HBase persists. The URL should be ‘fully-qualified’
to include the filesystem scheme. For example, to specify the
HDFS directory ‘/hbase’ where the HDFS instance’s namenode is
running at namenode.example.org on port 9000, set this value to:
hdfs://namenode.example.org:9000/hbase. By default HBase writes
into /tmp. Change this configuration else all data will be lost
on machine restart.
</description>
</property>
<property>
<name>hbase.master.port</name>
<value>60000</value>
<description>The port the HBase Master should bind to.</description>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
<description>The mode the cluster will be in. Possible values are
false for standalone mode and true for distributed mode. If
false, startup will run all HBase and ZooKeeper daemons together
in the one JVM.
</description>
</property>
<property>
<name>hbase.tmp.dir</name>
<value>/tmp/hbase-${user.name}</value>
<description>Temporary directory on the local filesystem.
Change this setting to point to a location more permanent
than ‘/tmp’ (The ‘/tmp’ directory is often cleared on
machine restart).
</description>
</property>
<property>
其余部分略

4. 类似于hadoop的master与slaves，这里也需要编辑各个节点上的HMasters与HRegionServers，直接添加节点名称即可；
5. 将hbase目录拷贝到其余各个节点相同位置，至此hbase配置基本完成；
6. 运行测试start-hbase：

使用hbase shell命令建表：

通过WEB页面查看HMaster：http://30.0.0.69:60010

PS：
配置分布式hadoop/hbase一定要把Master和HMaster的IP搞清楚，中间的配置文件不要有错，否则后续的服务是不能正常启动的。另外整体的思路是：
1. 全节点配置hosts、ssh，安装jps；建立相同的hadoop运行用户，相同的文件路径；
2. Master/HMaster上配置hadoop/hbase；
3. 将hadoop/hbase目录直接迁移到各节点，测试可用；

转载请注明：数据分析 » Hadoop与Hbase分布式配置_hadoop hbase 配置