前言
上一遍我们搭建了虚拟机集群环境,现在我们就开始搭建Hadoop集群环境,为后续的数据处理提供HDFS存储,MapReduce计算环境。
目标
搭建一台Master,两台Slave的Hadoop集群环境。并且通过put一个文件存储到hdfs分布式存储系统,然后通过管理控制台查看相应的节点信息。
集群节点 | 描述 |
---|---|
master201 | NameNode,SecondaryNameNode |
slave202 | DataNode,NodeManager |
slave203 | DataNode,NodeManager |
准备安装文件
以下文件也可以自行到网上去下载
安装文件 | 下载地址 |
---|---|
hadoop-2.7.2.tar (Hadoop安安装包) | 链接: https://pan.baidu.com/s/1dlUjcLjemTcm7jnc0HDiwQ 提取码: typ4 |
jdk-8u144-linux-x64.tar(JDK安装) | 链接: https://pan.baidu.com/s/1dlUjcLjemTcm7jnc0HDiwQ 提取码: typ4 |
安装前准备
配置host文件(3台虚拟机配置相同)
1
2
3
4
5vim /etc/hosts
192.168.152.201 master201
192.168.152.202 slave202
192.168.152.203 slave203建立集群虚拟机ssh互信(3台虚拟机执行ssh)
1
ssh-keygen -t rsa
命令执行后一直回车即可(三台机器都要执行)
然后在Master201上执行以下命令
1
2
3ssh-copy-id -i ~/.ssh/id_rsa.pub master201
ssh-copy-id -i ~/.ssh/id_rsa.pub slave202
ssh-copy-id -i ~/.ssh/id_rsa.pub slave203rz sz 安装,方便后续上传下载文件
1
2
3yum install -y lrzsz
通过rz命令上传hadoop,jdk包JDK安装
1
2
3
4
5
6
7
8
9
10
11
12
13[root@localhost Soft]# tar -xvf jdk-8u144-linux-x64.tar.gz
[root@localhost Soft]# vim /etc/profile
在末尾添加以下环境变量
export JAVA_HOME=/home/lishijia/Soft/jdk1.8.0_144
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:$PATH
[root@localhost Soft]# source /etc/profile
[root@localhost Soft]# java -version
java version "1.8.0_144"
Java(TM) SE Runtime Environment (build 1.8.0_144-b01)
Java HotSpot(TM) 64-Bit Server VM (build 25.144-b01, mixed mode)JDK安装复制到其他Slave机器
1
2
3
4
5
6
7scp -r /home/lishijia/Soft/jdk1.8.0_144 slave202:/home/lishijia/Soft
scp -r /home/lishijia/Soft/jdk1.8.0_144 slave203:/home/lishijia/Soft
scp -r /etc/profile slave202:/etc/
scp -r /etc/profile slave203:/etc/
在两台Slave执行 source /etc/profile
以上安装Hadoop集群准备工作完成。
安装Hadoop集群
安装MasterHadoop
解压Hadoop
1
tar -xvf hadoop-2.7.2.tar.gz
配置环境变量
1
2
3
4
5
6
7vim /etc/profile.d/hadoop2.7.2.sh
添加以下变量:
export HADOOP_HOME="/home/lishijia/Soft/hadoop-2.7.2"
export PATH="$HADOOP_HOME/bin:$PATH"
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop修改Hadoop-env配置文件修改
1
2
3
4vim /home/lishijia/Soft/hadoop-2.7.2/etc/hadoop/hadoop-env.sh
增加以下变量:
export JAVA_HOME=/home/lishijia/Soft/jdk1.8.0_144修改Slave配置修改
1
2
3
4
5vim /home/lishijia/Soft/hadoop-2.7.2/etc/hadoop/slaves
添加以下Slave:
slave202
slave203修改core-site.xml配置文件
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23vim /home/lishijia/Soft/hadoop-2.7.2/etc/hadoop/core-site.xml
修改为以下配置:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- 指定hdfs的nameservice为ns1 -->
<property>
<name>fs.defaultFS</name>
<value>hdfs://master201:9000</value>
</property>
<!-- Size of read/write buffer used in SequenceFiles. -->
<property>
<name>io.file.buffer.size</name>
<value>131072</value>
</property>
<!-- 指定hadoop临时目录,自行创建 -->
<property>
<name>hadoop.tmp.dir</name>
<value>/home/lishijia/Soft/hadoop-2.7.2/tmp</value>
</property>
</configuration>修改hdfs-site.xml配置文件
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24vim /home/lishijia/Soft/hadoop-2.7.2/etc/hadoop/hdfs-site.xml
修改为以下配置:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>dfs.namenode.secondary.http-address</name>
<value>master201:50090</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/lishijia/Soft/hadoop-2.7.2/hdfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/lishijia/Soft/hadoop-2.7.2/hdfs/data</value>
</property>
</configuration>修改yarn-site.xml配置文件
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32vim /home/lishijia/Soft/hadoop-2.7.2/etc/hadoop/yarn-site.xml
修改为以下配置:
<?xml version="1.0"?>
<configuration>
<!-- Site specific YARN configuration properties -->
<!-- Configurations for ResourceManager -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>master201:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master201:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master201:8031</value>
</property>
<property>
<name>yarn.resourcemanager.admin.address</name>
<value>master201:8033</value>
</property>
<property>
<name>yarn.resourcemanager.webapp.address</name>
<value>master201:8088</value>
</property>
</configuration>修改mapred-site.xml配置文件
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21vim /home/lishijia/Soft/hadoop-2.7.2/etc/hadoop/mapred-site.xml
修改为以下配置:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>master201:10020</value>
</property>
<property>
<name>mapreduce.jobhistory.address</name>
<value>master201:19888</value>
</property>
</configuration>以上核心配置文件完成。
通过scp的方式,把以上修改的配置文件同步到Slave机器
1
2
3
4
5scp -r /etc/profile.d/hadoop2.7.2.sh slave202:/etc/profile.d/
scp -r /etc/profile.d/hadoop2.7.2.sh slave203:/etc/profile.d/
scp -r /home/lishijia/Soft/hadoop-2.7.2 slave202:/home/lishijia/Soft
scp -r /home/lishijia/Soft/hadoop-2.7.2 slave203:/home/lishijia/Soft在master201格式化hdfs
1
2source /etc/profile.d/hadoop2.7.2.sh
hdfs namenode -format启动Hadoop集群
1
2cd /home/lishijia/Soft/hadoop-2.7.2/sbin
./start-all.shhdfs put 文件至hdfs分布式存储系统(hdfs 简单命令操作,执行命令前先看遇到问题)
1
2
3
4
5
6
7hadoop fs -mkdir /lishijia
hadoop fs -ls /lishijia
hadoop fs -ls /
hadoop fs -mkdir /lishijia/input
hadoop fs -ls /lishijia/input
hadoop fs -ls /lishijia
hadoop fs -put train.csv /lishijia/input/检验访问hadoop集群查看slave节点信息,访问hdfs管理后台查看节点信息,以及存储信息
安装问题以及以往遇到的问题
防火墙问题(切记三台机器都需要执行,要不然在访问后台管理访问不了,在put文件至hdfs的时候也会报错java.net.NoRouteToHostException: No route to host)
1
2sudo systemctl stop firewalld.service
sudo systemctl disable firewalld.serviceHadoop运行一段时间后(几天后),停止重启,发现停止不了报错(之前使用遇到的问题)
1
2
3
4
5
6
7[root@master ~]# stop-dfs.sh
Stopping namenodes on [master]
hadoop100: no namenode to stop
hadoop101: no datanode to stop
hadoop102: no datanode to stop
hadoop103: no datanode to stop
hadoop104: no datanode to stop定位发现:由于tmp目录的hadoop pid文件被系统自动删除导致,可以手动kill相关进程,但是也不能每次都这么来搞呀
解决方式:
修改pid文件存放目录,只需要在hadoop-daemon.sh脚本中添加一行声明即可:
1
2
3
4
5
6if [ "$HADOOP_PID_DIR" = "" ]; then
HADOOP_PID_DIR=/tmp
fi
在这个之后添加以下行,重置pid存放路径
HADOOP_PID_DIR=/home/lishijia/Soft/hadoop-2.7.2/pid记住要先关闭Hadoop再修改,不然你修改完又无法关闭了!同样的道理,你还需要修改yarn-daemon.sh行号91:
1
2
3
4
5
6if [ "$YARN_PID_DIR" = "" ]; then
YARN_PID_DIR=/tmp
fi
在这个之后添加以下行,重置pid存放路径
YARN_PID_DIR=/home/lishijia/Soft/hadoop-2.7.2/pid然后执行start-dfs.sh start-yarn.sh 启动Hadoop。再去/home/soft/hadoop-2.7.2/pid目录下看看
以上修改记得把对应的配置同步到hadoop对应的slave三台服务器时间不同步问题
添加时间:2018-10-21 21:09:22
1
2
3
4
518/10/21 20:58:21 INFO yarn.Client:
client token: N/A
diagnostics: Application application_1540122054916_0004 failed 2 times due to Error launching appattempt_1540122054916_0004_000002. Got exception: org.apache.hadoop.yarn.exceptions.YarnException: Unauthorized request to start container.
This token is expired. current time is 1540155500688 found 1540127301513
Note: System times on machines may be out of sync. Check system time and time zones.1
2vi /etc/crontab
*/10 * * * * ntpdate -u ntp.api.bz
总结
以上整体完成了hadoop集群安装,当中包含了hdfs,mapreduce核心组件,以及做了一个简单的hdfs操作示例,创建、查看、上传文件至hdfs集群