Hadoop与Hbase分布式配置
关键词:hadoop hbase 配置 hbase与hadoop关系 hadoop与hbase的关系
在单机上部署好hadoop与hbase之后,现在终于要开始在集群上来部署了。一路配置下来,个人觉得其实分布式配置与单机配置差不多,但是修改了配置文件中的一些参数适应分布式。当然这只是简单地配置部署,如果真的要作为一个实际系统来使用,考虑性能稳定性等其他因素的时候当然就没有这么简单了,但是今天的主要工作是部署一个分布式Hadoop/Hbase,先不考虑性能优化的问题,而且我们的节点数也很小,初期只有一个master和两个slaves:
master: 30.0.0.69
node1: 30.0.0.161
node2: 30.0.0.162
整体的配置可以分为分布式条件配置、hadoop配置文件、hbase配置文件三个步骤、安装文件迁移到其他节点。当然在正式开始之前,你要确保各个节点上都有一个hadoop用户(或者其他同名的用户都可以),可以在安装系统的时候默认使用该用户安装,或者使用adduser添加一个吧!不过这里要确保该用户可以使用sudo命令,因此最好加入到%admin组里,必要的时候需要修改/etc/sudoers文件,具体配置说明可以见:
http://blog.chinaunix.net/uid-26275986-id-3940725.html中有说明。
一、分布式条件配置
这里自己所说的分布式配置,主要是关于集群之间通信的时候的配置,主要涉及网卡配置、hosts配置和ssh免登陆设置。
1. 网卡配置
自己的安装系统都是Ubuntu-12.04-desktop,对于Ubuntu的网卡配置来说,可以使用ifconfig命令来配置,或者用右上角的网络工具,但是自己还是习惯使用配置文件的方法:
点击(此处)折叠或打开
- auto lo
- iface lo inet loopback
- #auto eth0
- #iface eth0 inet dhcp
- auto eth0
- iface eth0 inet static
- address 30.0.0.69
- netmask 255.255.255.0
- gateway 30.0.0.254
- #dns-nameservers DNS-IP
上面是网卡接口的配置文件,需要注意的有两点:一是通过修改该配置文件的方式需要sudo /etc/init.d/networking restart才能立即生效,否则重启后也可以使用新配置;二是Ubuntu不能直接设置/etc/resolv.conf配置文件来设定DNS,因为每次重启后都会清空,如果要设置DNS,要么在网卡配置文件中(如本例)添加最后一行(本例不需要DNS,因此注释掉了),或者去修改/etc/resolvconf/resolv.conf.d.base文件,然后重启网卡服务即可生效。
2. 主机hosts配置
/etc/hosts文件用于确定集群中每个节点的IP,为了方便后续集群可以正确通信,这里要进行设置,并且在每一个节点上都要配置该文件:
点击(此处)折叠或打开
- 127.0.0.1 localhost
- #127.0.0.1 hadoop
- #127.0.0.1 master
- 30.0.0.69 master
- 30.0.0.161 node1
- 30.0.0.162 node2
- # The following lines are desirable for IPv6 capable hosts
- #::1 ip6-localhost ip6-loopback
- #fe00::0 ip6-localnet
- #ff00::0 ip6-mcastprefix
- #ff02::1 ip6-allnodes
- #ff02::2 ip6-allrouters
自己不使用IPv6,因此把相应地址都注释掉了。本文件配置好后可以复制到其他节点。
3. 配置ssh免登陆
由于hadoop各节点间使用ssh通讯,因此为了避免频繁密码验证,这里需要设置ssh免密码登录。其实ssh免登陆设置很简单,只需要被登录节点保存有登录节点用户的公钥即可。
第一步:ssh-keygen -t dsa -P ” -f ~/.ssh/id_dsa,这步之后会在~/.ssh中添加公私钥文件:
第二步:将公钥导入目标机器的认证文件中:cat id_dsa.pub >> authorized_keys
当然,这里需要做的是将master的公钥文件导入到node1和node2中的.ssh中的authorized_keys中。但是自己在测试的时候出了问题,node2可以ssh其他节点,但是其他节点不能ssh回来。费了一番功夫后发觉各个节点的版本不一致,node2使用的是系统自带的版本,估计没有server部分,需要更新最新版本。为了方便起见,这步设置的时候建议统一sudo apt-get install ssh即可。
这些步骤完成之后,可以相互ssh测试,第一次还需要输入密码,但是第二次之后就可以免密码登录了。
二、Hadoop分布式配置文件设置
当然分布式也要求单机版hadoop运行的必要条件,jdk包肯定是必需的。保证各个节点上hadoop运行用户和hadoop/hbase存放目录的一致,后期的拷贝会省去很多麻烦。然后下载hadoop包,解压之后开始修改配置文件:
1. 配置jdk运行路径,修改hadoop-env.xml
点击(此处)折叠或打开
- # Set Hadoop-specific environment variables here.
- # The only required environment variable is JAVA_HOME. All others are
- # optional. When running a distributed configuration it is best to
- # set JAVA_HOME in this file, so that it is correctly defined on
- # remote nodes.
- # The java implementation to use. Required.
- export JAVA_HOME=/home/hadoop/platform/jdk1.6.0_35
- # Extra Java CLASSPATH elements. Optional.
- # export HADOOP_CLASSPATH=
- # The maximum amount of heap to use, in MB. Default is 1000.
- #export HADOOP_HEAPSIZE=1000
- # Extra Java runtime options. Empty by default.
- # export HADOOP_OPTS=-server
- # Command specific options appended to HADOOP_OPTS when specified
- export HADOOP_NAMENODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS”
- export HADOOP_SECONDARYNAMENODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS”
- export HADOOP_DATANODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS”
- export HADOOP_BALANCER_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS”
- export HADOOP_JOBTRACKER_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS”
- # export HADOOP_TASKTRACKER_OPTS=
- # The following applies to multiple commands (fs, dfs, fsck, distcp etc)
- #export HADOOP_CLIENT_OPTS=”-Xmx1024m $HADOOP_CLIENT_OPTS”
- # Extra ssh options. Empty by default.
- # export HADOOP_SSH_OPTS=”-o ConnectTimeout=1 -o SendEnv=HADOOP_CONF_DIR”
- # Where log files are stored. $HADOOP_HOME/logs by default.
- # export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
- # File naming remote slave hosts. $HADOOP_HOME/conf/slaves by default.
- # export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
- # host:path where hadoop code should be rsync’d from. Unset by default.
- # export HADOOP_MASTER=master:/home/$USER/src/hadoop
- # Seconds to sleep between slave commands. Unset by default. This
- # can be useful in large clusters, where, e.g., slave rsyncs can
- # otherwise arrive faster than the master can service them.
- # export HADOOP_SLAVE_SLEEP=0.1
- # The directory where pid files are stored. /tmp by default.
- # export HADOOP_PID_DIR=/var/hadoop/pids
- # A string representing this instance of hadoop. $USER by default.
- # export HADOOP_IDENT_STRING=$USER
- # The scheduling priority for daemon processes. See ‘man nice’.
- # export HADOOP_NICENESS=10
2. 配置namenode,修改core-site.xml
点击(此处)折叠或打开
- <?xml version=”1.0″?>
- <?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
- <!– Put site-specific property overrides in this file. –>
- <configuration>
- <property>
- <name>fs.default.name</name>
- <value>hdfs://master:9000</value>
- </property>
- <property>
- <name>hadoop.tmp.dir</name>
- <value>/home/hadoop/hdfs/tmp</value>
- </property>
- </configuration>
3. 配置hdfs,修改hdfs-site.xml文件:
点击(此处)折叠或打开
- <?xml version=”1.0″?>
- <?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
- <!– Put site-specific property overrides in this file. –>
- <configuration>
- <property>
- <name>dfs.name.dir</name>
- <value>/home/hadoop/hdfs/name</value>
- <final>true</final>
- </property>
- <property>
- <name>dfs.data.dir</name>
- <value>/home/hadoop/hdfs/data</value>
- <final>true</final>
- </property>
- <property>
- <name>dfs.datanode.max.xcievers</name>
- <value>32768</value>
- </property>
- <property>
- <name>dfs.replication</name> //有几个datanode最多设置几个数值,每个datanode只能保存一份备份
- <value>2</value>
- <final>true</final>
- </property>
- </configuration>
这里注意final标签表示本设置在后续运行中不允许动态修改和覆盖。
4. 配置map-reduce,修改mapred-site.xml:
点击(此处)折叠或打开
- <?xml version=”1.0″?>
- <?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
- <!– Put site-specific property overrides in this file. –>
- <configuration>
- <property>
- <name>mapred.job.tracker</name>
- <value>30.0.0.69:9001</value>
- </property>
- <property>
- <name>mapred.child.java.opts</name>
- <value>–Xmx800m</value> //根据自身机器情况设置大小,自己的内存只有1G
- <final>true</final>
- </property>
- </configuration>
5. 根据实际情况配置masters和slaves文件,在master文件中写入master角色的主机名(master),slaves文件中写入datanode角色的主机名(node1、node2)
6. 以上配置在master上配置,然后将配置好的hadoop目录直接拷贝到其余各节点,注意保证各节点hadoop运行用户、hadoop存放位置的一致,不要忘记在各机上设置hosts文件;
7. 运行测试:
在master上启动hadoop即可,然后用Jps命令查看启动进程,TaskTracker和Datanode应该都在node1和node2上运行:
然后通过web->http://30.0.0.69:50070可以查看hdfs系统:可以看到Live Nodes为2
如果想查看mapreduce的任务情况,web->http://30.0.0.69:50030
自己很奇怪为什么自己的Nodes项为0?明明在分节点上TaskTracker都已经启动了。上网去查资料,好像比较常见,一般的解决方法有:
1-关闭namenode的safemode:hadoop dfsadmin -safemode leave;
2-格式化hdfs:删除各个几点的hdfs目录,即hdfs-site文件中指定的tmp目录;
3-查看防火墙:sudo ufw status
这些方法都不管用,后来突然想到是不是因为没有运行的mapreduce呢?于是赶紧测试hadoop:
果然不出所料,这次结果出现了两个节点,看来果然是自己的理解有问题,这里只能显示正在运行的map-reduce任务:
至此hadoop分布式的配置基本结束,下面开始配置hbase。
三、Hbase配置
配置Hbase其实和Hadoop是类似的思路,先在HMaster上配置好文件,保证用户和路径的一致的前提下,直接复制拷贝hbase安装目录就可以了:
1. 配置hbase运行环境,修改hbase-env.xml:
点击(此处)折叠或打开
- #
- #/**
- # * Copyright 2007 The Apache Software Foundation
- # *
- # * Licensed to the Apache Software Foundation (ASF) under one
- # * or more contributor license agreements. See the NOTICE file
- # * distributed with this work for additional information
- # * regarding copyright ownership. The ASF licenses this file
- # * to you under the Apache License, Version 2.0 (the
- # * “License”); you may not use this file except in compliance
- # * with the License. You may obtain a copy of the License at
- # *
- # * http://www.apache.org/licenses/LICENSE-2.0
- # *
- # * Unless required by applicable law or agreed to in writing, software
- # * distributed under the License is distributed on an “AS IS” BASIS,
- # * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- # * See the License for the specific language governing permissions and
- # * limitations under the License.
- # */
- # Set environment variables here.
- # The java implementation to use. Java 1.6 required.
- export JAVA_HOME=/home/hadoop/platform/jdk1.6.0_35
- export HBASE_HOME=/home/hadoop/platform/hbase-0.90.0
- export HADOOP_HOME=/home/hadoop/platform/hadoop-1.0.3
- # Extra Java CLASSPATH elements. Optional.
- # export HBASE_CLASSPATH=
- # The maximum amount of heap to use, in MB. Default is 1000.
- #export HBASE_HEAPSIZE=1000
- # Extra Java runtime options.
- # Below are what we set by default. May only work with SUN JVM.
- # For more on why as well as other possible settings,
- # see http://wiki.apache.org/hadoop/PerformanceTuning
- export HBASE_OPTS=“$HBASE_OPTS -ea -XX:+HeapDumpOnOutOfMemoryError -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode”
- # Uncomment below to enable java garbage collection logging.
- # export HBASE_OPTS=”$HBASE_OPTS -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:$HBASE_HOME/logs/gc-hbase.log”
- # Uncomment and adjust to enable JMX exporting
- # See jmxremote.password and jmxremote.access in $JRE_HOME/lib/management to configure remote password access.
- # More details at: http://java.sun.com/javase/6/docs/technotes/guides/management/agent.html
- #
- # export HBASE_JMX_BASE=”-Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false”
- # export HBASE_MASTER_OPTS=”$HBASE_MASTER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10101″
- # export HBASE_REGIONSERVER_OPTS=”$HBASE_REGIONSERVER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10102″
- # export HBASE_THRIFT_OPTS=”$HBASE_THRIFT_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10103″
- # export HBASE_ZOOKEEPER_OPTS=”$HBASE_ZOOKEEPER_OPTS $HBASE_JMX_BASE -Dcom.sun.management.jmxremote.port=10104″
- # File naming hosts on which HRegionServers will run. $HBASE_HOME/conf/regionservers by default.
- # export HBASE_REGIONSERVERS=${HBASE_HOME}/conf/regionservers
- # Extra ssh options. Empty by default.
- # export HBASE_SSH_OPTS=”-o ConnectTimeout=1 -o SendEnv=HBASE_CONF_DIR”
- # Where log files are stored. $HBASE_HOME/logs by default.
- export HBASE_LOG_DIR=${HBASE_HOME}/logs
- # A string representing this instance of hbase. $USER by default.
- # export HBASE_IDENT_STRING=$USER
- # The scheduling priority for daemon processes. See ‘man nice’.
- # export HBASE_NICENESS=10
- # The directory where pid files are stored. /tmp by default.
- # export HBASE_PID_DIR=/var/hadoop/pids
- # Seconds to sleep between slave commands. Unset by default. This
- # can be useful in large clusters, where, e.g., slave rsyncs can
- # otherwise arrive faster than the master can service them.
- # export HBASE_SLAVE_SLEEP=0.1
- # Tell HBase whether it should manage it’s own instance of Zookeeper or not.
- export HBASE_MANAGES_ZK=true
2. 配置hdfs,修改hbase-site.xml:
点击(此处)折叠或打开
- <?xml version=”1.0″?>
- <?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
- <!–
- /**
- * Copyright 2010 The Apache Software Foundation
- *
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements. See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership. The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * “License”); you may not use this file except in compliance
- * with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an “AS IS” BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
- –>
- <configuration>
- <property>
- <name>hbase.rootdir</name>
- <value>hdfs://master:9000/hbase</value>
- </property>
- <property>
- <name>hbase.cluster.distributed</name>
- <value>true</value>
- </property>
- <property>
- <name>hbase.master</name>
- <value>30.0.0.69:60000</value>
- </property>
- <property>
- <name>hbase.zookeeper.quorum</name>
- <value>30.0.0.161,30.0.0.162</value>
- </property>
- <property>
- <name>zookeeper.znode.parent</name>
- <value>/hbase</value> //默认位置
- </property>
- </configuration>
这里主要是设置hdfs系统的根位置以及HMaster的位置。
3. 拷贝hbase目录下的src/main/resources/下的hbase-default.xml文件,修改hbase.rootdir写定HDFS目录和hbase.cluster.distributed两项:
点击(此处)折叠或打开
- <?xml version=”1.0″?>
- <?xml-stylesheet type=”text/xsl” href=”configuration.xsl”?>
- <!–
- /**
- * Copyright 2009 The Apache Software Foundation
- *
- * Licensed to the Apache Software Foundation (ASF) under one
- * or more contributor license agreements. See the NOTICE file
- * distributed with this work for additional information
- * regarding copyright ownership. The ASF licenses this file
- * to you under the Apache License, Version 2.0 (the
- * “License”); you may not use this file except in compliance
- * with the License. You may obtain a copy of the License at
- *
- * http://www.apache.org/licenses/LICENSE-2.0
- *
- * Unless required by applicable law or agreed to in writing, software
- * distributed under the License is distributed on an “AS IS” BASIS,
- * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- * See the License for the specific language governing permissions and
- * limitations under the License.
- */
- –>
- <configuration>
- <property>
- <name>hbase.rootdir</name>
- <value>hdfs://master:9000/hbase</value>
- <description>The directory shared by region servers and into
- which HBase persists. The URL should be ‘fully-qualified’
- to include the filesystem scheme. For example, to specify the
- HDFS directory ‘/hbase’ where the HDFS instance’s namenode is
- running at namenode.example.org on port 9000, set this value to:
- hdfs://namenode.example.org:9000/hbase. By default HBase writes
- into /tmp. Change this configuration else all data will be lost
- on machine restart.
- </description>
- </property>
- <property>
- <name>hbase.master.port</name>
- <value>60000</value>
- <description>The port the HBase Master should bind to.</description>
- </property>
- <property>
- <name>hbase.cluster.distributed</name>
- <value>true</value>
- <description>The mode the cluster will be in. Possible values are
- false for standalone mode and true for distributed mode. If
- false, startup will run all HBase and ZooKeeper daemons together
- in the one JVM.
- </description>
- </property>
- <property>
- <name>hbase.tmp.dir</name>
- <value>/tmp/hbase-${user.name}</value>
- <description>Temporary directory on the local filesystem.
- Change this setting to point to a location more permanent
- than ‘/tmp’ (The ‘/tmp’ directory is often cleared on
- machine restart).
- </description>
- </property>
- <property>
- 其余部分略
4. 类似于hadoop的master与slaves,这里也需要编辑各个节点上的HMasters与HRegionServers,直接添加节点名称即可;
5. 将hbase目录拷贝到其余各个节点相同位置,至此hbase配置基本完成;
6. 运行测试start-hbase:
使用hbase shell命令建表:
通过WEB页面查看HMaster:http://30.0.0.69:60010
PS:
配置分布式hadoop/hbase一定要把Master和HMaster的IP搞清楚,中间的配置文件不要有错,否则后续的服务是不能正常启动的。另外整体的思路是:
1. 全节点配置hosts、ssh,安装jps;建立相同的hadoop运行用户,相同的文件路径;
2. Master/HMaster上配置hadoop/hbase;
3. 将hadoop/hbase目录直接迁移到各节点,测试可用;