上篇已经把Hadoop安装到各虚拟机(大数据系列-Hadoop集群搭建(1)),本篇通过认识
Hadoop,规划虚拟机部署。
目录
1、初步认识Hadoop
1.1 模块认知
1.2 关联工程/项目
1.3 Hadoop 架构
1.3.1 HDFS架构
1.3.2 YARN架构
2、Hadoop 部署规划
2.1节点规划
2.2 端口规划
3、Hadoop 集群配置
3.1 配置文件说明
3.1.1 自定义配置文件
3.1.2 默认配置
3.2 非安全模式配置Hadoop
3.2.1 Hadoop 配置说明
3.3 启动集群&验证
3.3.1 启动hdfs
3.3.2 启动yarn
3.3.3 启动jobhistory服务
3.3.4 集群验证
3.4 相关进程与端口情况
3.4.1 leader 节点
3.4.2 follower1 节点
3.4.3 follower2 节点
1、初步认识Hadoop 1.1 模块认知
包括以下模块(The project includes these modules):
Hadoop Common: 公共模块,用于支持以下其他3个模块。(The common utilities that support the other Hadoop modules. ) Hadoop Distributed File System (HDFS™): 分布式文件系统,可实现对应于数据的高吞吐访问。(A distributed file system that provides high-throughput access to application data.)Hadoop YARN:分布式调度以及资源管理的框架。( A framework for job scheduling and cluster resource management.)Hadoop MapReduce: 基于Yarn实现并行处理大数据集的系统,我们常说的分而治之,主要是通过它。(A YARN-based system for parallel processing of large data sets.
1.2 关联工程/项目
Ambari™: Hadoop的可观测系统,支持Hadoop HDFS、Hadoop MapReduce、Hive、HCatalog、Hbase、ZooKeeper、Oozie、Pig和Sqoop。Ambari还提供了一个仪表板,用于查看集群健康状况,如热图,并能够以用户友好的方式可视化地查看MapReduce、Pig和Hive应用程序,以及诊断其性能特征的功能 (A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, Hbase, ZooKeeper, Oozie, Pig and Sqoop、Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.)Avro™: 一种数据序列化框架。 (A data serialization system.)Cassandra™: 可扩展多主数据库,无单点故障。(A scalable multi-master database with no single points of failure.)Chukwa™: 管理大型分布式系统的数据收集系统 (A data collection system for managing large distributed systems.)Hbase™: 可伸缩的分布式数据库,支持大型表的结构化数据存储。(A scalable, distributed database that supports structured data storage for large tables.)Hive™:提供数据汇总和特别查询的数据仓库基础设施( A data warehouse infrastructure that provides data summarization and ad hoc querying.)Mahout™: 一个可扩展的机器学习和数据挖掘库(A Scalable machine learning and data mining library.)Ozone™: 一个可伸缩的、冗余的、分布式的Hadoop对象存储 (A scalable, redundant, and distributed object store for Hadoop.)Pig™: 并行计算的高级数据流语言和执行框架(A high-level data-flow language and execution framework for parallel computation.)Spark™: 用于Hadoop 数据的快速通用的计算引擎。它提供一个简单而富有表现力的编程模型,支持广泛的应用,包括ETL、机器学习、流处理和图计算。 (A fast and general compute engine for Hadoop data、Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.)Submarine: 一个统一的人工智能平台,允许工程师和数据科学家在分布式集群中运行机器学习和深度学习工作负载 (A unified AI platform which allows engineers and data scientists to run Machine Learning and Deep Learning workload in distributed cluster.)Tez™: 一个通用的数据流编程框架,构建在Hadoop YARN之上,它提供了一个强大而灵活的引擎,可以执行任意任务的DAG来处理批处理和交互式用例的数据。Tez被Hadoop生态系统中的Hive、Pig和其他框架所采用,也被其他商业软件(如ETL工具)所采用,以取代Hadoop MapReduce作为底层的执行引擎 (A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases、Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g、ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.)ZooKeeper™: 高性能的分布式应用协调服务 (A high-performance coordination service for distributed applications.)
1.3 Hadoop 架构 1.3.1 HDFS架构
HDFS是一个主从架构,由一个NameNode已经多个DataNodes 组成。
NameNode:
管理文件系统名称空间并调节客户端对文件的访问的主服务器,如命名空间操作,如打开、关闭和重命名文件和目录等管理;元数据管理,并且维护了数据块(Blocks)与datanode 的映射关系,有点类似注册中心的角色。
Secondary NameNode(2nn): 每隔一段时间对NameNode进行元数据备份。
DataNodes:
用于存储数据的节点,一个文件会被分拆为一个或多个块,并存储在dataNode中;负责处理来自文件系统客户端的读写请求;根据NameNode的指令进行块的创建、删除和复制
1.3.2 YARN架构
YARN的基本思想是将资源管理和作业调度/监视的功能分解为单独的守护进程,其思想是拥有一个全局的ResourceManager (RM)和每个应用程序的ApplicationMaster (AM)。应用程序可以是单个作业,也可以是作业的DAG。
Yarn由RM(ResourceManager)与NM(NodeManager)组成。
ResourceManager:ResourceManager是在系统中所有应用程序之间仲裁资源的最终权限
RM 有两个主要的组件:调度器(Scheduler)、应用管理器(ApplicationManager)
Scheduler:调度程序负责根据熟悉的容量(监控上报的情况)、队列等约束为各种正在运行的应用程序分配资源 ApplicationsManager: 负责接受作业提交,与第一个容器协商执行特定于应用程序的ApplicationMaster,并在失败时提供重新启动ApplicationMaster容器的服务。每个应用程序的ApplicationMaster负责从Scheduler协商合适的资源容器,跟踪它们的状态并监视进度。
NodeManager:每台机器的框架代理,它负责容器,监视它们的资源使用情况(cpu、内存、磁盘、网络),并向ResourceManager/Scheduler报告相同的情况。
2、Hadoop 部署规划 2.1节点规划
192.168.56.101
(leader)
192.168.56.102
(follower1)
192.168.56.103
(follower2)
HDFS NameNode DataNodeSecondaryNameNode
DataNode
DataNodeYARN NodeManagerNodeManagerResourceManager
NodeManager
2.2 端口规划 3、Hadoop 集群配置 3.1 配置文件说明
3.1 配置文件说明
Hadoop 配置文件有两类:默认配置文件、自定义配置文件,自定义配置文件优先级高。
3.1.1 自定义配置文件
*-site.xml为自定义配置文件
3.1.2 默认配置
默认配置可以参考如下位置的文件:
3.2 非安全模式配置Hadoop 3.2.1 Hadoop 配置说明
上面3.1.1图*-env 文件,为相应的环境变量的配置。
关键配置说明:
etc/hadoop/core-site.xmlParameter Value Notes fs.defaultFSNameNode URIhdfs://host:port/io.file.buffer.size
131072
(128M)
Size of read/write buffer used in SequenceFiles.
序列文件中使用的读/写缓冲区的大小。
Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently.
存放namesapce 以及事务日志持久化的地方
If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy.
可以用逗号分割,冗余存储到不同目录下
dfs.hosts / dfs.hosts.excludeList of permitted/excluded DataNodes.
允许/排除的主机列表
If necessary, use these files to control the list of allowable datanodes.
使用这些文件来控制允许的datanode列表。
dfs.blocksize268435456
(256M)
HDFS blocksize of 256MB for large file-systems.
HDFS文件的块大小
dfs.namenode.handler.count100More NameNode server threads to handle RPCs from large number of DataNodes.
NameNode用来处理来自DataNode的RPC请求的线程数量
etc/hadoop/hdfs-site.xml
Configurations for NameNode:
Configurations for DataNode:Parameter Value Notes dfs.datanode.data.dir
Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks.
存储数据块的目录
If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices.
逗号分割,可以存储到不同的设备
ACL to set admins on the cluster、ACLs are of for comma-separated-usersspacecomma-separated-groups、Defaults to special value of * which means anyone、Special value of just space means no one has access.
用户访问控制列表,*代表所有人可以访问,空格表示没人可以访问
yarn.log-aggregation-enablefalseConfiguration to enable or disable log aggregation
启用/禁用日志聚合的配置,默认为不开启
etc/hadoop/yarn-site.xml
Configurations for ResourceManager and NodeManager:
Configurations for ResourceManager: Parameter Value Notes yarn.resourcemanager.address
ResourceManager host:port for clients to submit jobs.
客户端提交任务的访问主机与端口
host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.yarn.resourcemanager.scheduler.addressResourceManager host:port for ApplicationMasters to talk to Scheduler to obtain resources.
供AM与RM获取资源的地址
host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.yarn.resourcemanager.resource-tracker.addressResourceManager host:port for NodeManagers.
供NodeManager访问的RM地址
host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.yarn.resourcemanager.admin.addressResourceManager host:port for administrative commands.
管理控制台地址
host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.yarn.resourcemanager.webapp.addressResourceManager web-ui host:port.
资源管理的WEB服务地址
host:port If set, overrides the hostname set in yarn.resourcemanager.hostname.yarn.resourcemanager.hostnameResourceManager host.
资源管理的主机名
host Single hostname that can be set in place of setting all yarn.resourcemanager*address resources、Results in default ports for ResourceManager components.yarn.resourcemanager.scheduler.classResourceManager Scheduler class.CapacityScheduler (recommended), FairScheduler (also recommended), or FifoScheduler、Use a fully qualified class name, e.g., org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.yarn.scheduler.minimum-allocation-mbMinimum limit of memory to allocate to each container request at the Resource Manager.
每个容器请求分配的最小内存大小
In MBsyarn.scheduler.maximum-allocation-mbMaximum limit of memory to allocate to each container request at the Resource Manager.
每个容器请求分配的最大内存大小
In MBsyarn.resourcemanager.nodes.include-path / yarn.resourcemanager.nodes.exclude-pathList of permitted/excluded NodeManagers.
允许/排除nodemanager的列表。
If necessary, use these files to control the list of allowable NodeManagers.Resource i.e、available physical memory, in MB, for given NodeManager
NM可使用的物理内存(MB单位)
Defines total available resources on the NodeManager to be made available to running containersyarn.nodemanager.vmem-pmem-ratioMaximum ratio by which virtual memory usage of tasks may exceed physical memory
(任务的虚拟内存使用可能超过物理内存的最大比率)
The virtual memory usage of each task may exceed its physical memory limit by this ratio、The total amount of virtual memory used by tasks on the NodeManager may exceed its physical memory usage by this ratio.yarn.nodemanager.local-dirsComma-separated list of paths on the local filesystem where intermediate data is written.
(可保存的中间数据的路径)
Multiple paths help spread disk i/o.yarn.nodemanager.log-dirsComma-separated list of paths on the local filesystem where logs are written.
日志保存地址
Multiple paths help spread disk i/o.yarn.nodemanager.log.retain-seconds10800
NM日志保留时间
(3小时)
Default time (in seconds) to retain log files on the NodeManager only applicable if log-aggregation is disabled.yarn.nodemanager.remote-app-log-dir/logs
HDFS directory where the application logs are moved on application completion、Need to set appropriate permissions、only applicable if log-aggregation is enabled.yarn.nodemanager.remote-app-log-dir-suffixlogsSuffix appended to the remote log dir、Logs will be aggregated to ${yarn.nodemanager.remote-app-log-dir}/${user}/${thisParam} only applicable if log-aggregation is enabled.yarn.nodemanager.aux-servicesmapreduce_shuffleShuffle service that needs to be set for Map Reduce applications.yarn.nodemanager.env-whitelistEnvironment properties to be inherited by containers from NodeManagersFor mapreduce application in addition to the default values HADOOP_MAPRED_HOME should to be added、Property value should JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME
etc/hadoop/mapred-site.xml
Configurations for MapReduce Applications:
Configurations for MapReduce JobHistory Server:Parameter Value Notes mapreduce.jobhistory.addressMapReduce JobHistory Server host:portDefault port is 10020.mapreduce.jobhistory.webapp.addressMapReduce JobHistory Server Web UI host:portDefault port is 19888.mapreduce.jobhistory.intermediate-done-dir/mr-history/tmpDirectory where history files are written by MapReduce jobs.mapreduce.jobhistory.done-dir/mr-history/doneDirectory where history files are managed by the MR JobHistory Server.
3.2.2 集群配置
以下为我的集群的配置:
/opt/module/hadoop-3.3.1/etc/hadoop/hdfs-site.xml
/opt/module/hadoop-3.3.1/etc/hadoop/core-site.xml
/opt/module/hadoop-3.3.1/etc/hadoop/mapred-site.xml
/opt/module/hadoop-3.3.1/etc/hadoop/yarn-site.xml
/opt/module/hadoop-3.3.1/etc/hadoop/workers
leaderfollower1follower2
*重要,以上配置leader同步到follower1、follower2
如:scp /opt/module/hadoop-3.3.1/etc/hadoop/workers hadoop@follower1:/opt/module/hadoop-3.3.1/etc/hadoop/
3.3 启动集群&验证 3.3.1 启动hdfs
第一次启动在leader节点上,格式化一次NameNode
$hdfs namenode -format#再启动hdfs$start-dfs.sh
3.3.2 启动yarn
在配置了ResourceManager的节点,follower2上启动YARN
$start-yarn.sh
3.3.3 启动jobhistory服务
在leader节点上,启动历史historyserver
$mapred --daemon start historyserver
注意:
启动中如果出现以下提示:
Java HotSpot(TM) Client VM warning: You have loaded library /opt/module/hadoop-3.3.1/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard、The VM will try to fix the stack guard now.
It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.
解决方案:
在cust.sh 中增加最后3行export,最终如下:
[hadoop@leader hadoop]$ more /etc/profile.d/cust.sh#JAVA_HOMEJAVA_HOME=/opt/module/jdk1.8.0_311PATH=$PATH:$JAVA_HOME/binHADOOP_HOME=/opt/module/hadoop-3.3.1PATH=$PATH:$HADOOP_HOME/binPATH=$PATH:$HADOOP_HOME/sbinexport JAVA_HOMEexport HADOOP_HOMEexport PATHexport HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/nativeexport HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"export JAVA_LIBRARY_PATH=/opt/module/hadoop-3.3.1/lib/native
3.3.4 集群验证
验证1:往HDFS中创建目录,input 文件
$ hadoop fs -mkdir /input#把之前 wordcount的文本文件put到hdfs的/input 目录下$ hadoop fs -put $HADOOP_HOME/wcinput/wcinput_sample.txt /input#把JDK put 到/目录下$ hadoop fs -put /opt/software/jdk-8u212-linux-x64.tar.gz /
可以在配置的data目录下查看,如leader节点
NameNodeUI查看
(可忽略output、tmp目录,后面验证用到)
验证2:运行wordcount 任务
#先清理文件系统中存在的output 目录,以免报目录已存在的异常$hadoop fs -rm -r /output#运行wordcount样例,把结果输出到文件系统中的/output$hadoop jar /opt/module/hadoop-3.3.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.1.jar wordcount /input /output
运行完,在Jobhistory中可以看到相关job及其日志
在hdfs /output目录下可以看到结果
3.4 相关进程与端口情况 3.4.1 leader 节点 3.4.2 follower1 节点 3.4.3 follower2 节点
3.4.2 follower1 节点 3.4.3 follower2 节点
以上是整个Hadoop搭建的过程,后续将用一些案例,来进阶熟悉Hadoop生态。