2022-02-09大数据学习日志——Hadoop离线阶段——Hive窗口函数、性能调优

时间：2023-06-17

学习目标

掌握窗口函数的使用知道Hive数据压缩、文件存储格式掌握Hive通用调优（重要的见下述大纲）

内容大纲

#Hive窗口函数（Window function）开窗函数分组TopN级联累加问题连续登陆#Hive的性能调优hive的数据文件格式数据压缩行式存储列式存储（ORC parquet）hive通用调优*join优化*group by数据倾斜优化*task并行度问题其他通用调优

01_Apache Hive 窗口函数快速理解与语法 1.1 快速理解窗口函数功能

window function 窗口函数、开窗函数、olap分析函数。

窗口：可以理解为操作数据的范围，窗口有大有小，本窗口中操作的数据有多有少。

可以简单地解释为类似于聚合函数的计算函数，但是通过GROUP BY子句组合的常规聚合会隐藏正在聚合的各个行，最终输出一行；而窗口函数聚合后还可以访问当中的各个行，并且可以将这些行中的某些属性添加到结果集中。

--建表加载数据CREATE TABLE employee( id int, name string, deg string, salary int, dept string) row format delimited fields terminated by ',';load data local inpath '/root/hivedata/employee.txt' into table employee;select * from employee;----sum+group by普通常规聚合操作------------select dept,sum(salary) as total from employee group by dept;select id,dept,sum(salary) as total from employee group by dept; --添加id至结果，错误sql+-------+---------+| dept | total |+-------+---------+| AC | 60000 || TP | 120000 |+-------+---------+----sum+窗口函数聚合操作------------select id,name,deg,salary,dept,sum(salary) over(partition by dept) as total from employee;+-------+-----------+----------+---------+-------+---------+| id | name | deg | salary | dept | total |+-------+-----------+----------+---------+-------+---------+| 1204 | prasanth | dev | 30000 | AC | 60000 || 1203 | khalil | dev | 30000 | AC | 60000 || 1206 | kranthi | admin | 20000 | TP | 120000 || 1202 | manisha | cto | 50000 | TP | 120000 || 1201 | gopal | manager | 50000 | TP | 120000 |+-------+-----------+----------+---------+-------+---------+

1.2 窗口函数语法

具有OVER语句的函数叫做窗口函数。

Function(arg1,..., argn) OVER ([PARTITION BY <...>] [ORDER BY <....>] [])--其中Function(arg1,..., argn) 可以是下面分类中的任意一个 --聚合函数：比如sum max avg等 --排序函数：比如rank row_number等 --分析函数：比如lead lag first_value等--OVER [PARTITION BY <...>] 类似于group by 用于指定分组每个分组你可以把它叫做窗口--如果没有PARTITION BY 那么整张表的所有行就是一组--[ORDER BY <....>] 用于指定每个分组内的数据排序规则支持ASC、DESC--[] 用于指定每个窗口中操作的数据范围默认是窗口中所有行

建表加载数据后续练习使用

---建表并且加载数据create table website_pv_info( cookieid string, createtime string, --day pv int) row format delimitedfields terminated by ',';create table website_url_info ( cookieid string, createtime string, --访问时间 url string --访问页面) row format delimitedfields terminated by ',';load data local inpath '/root/hivedata/website_pv_info.txt' into table website_pv_info;load data local inpath '/root/hivedata/website_url_info.txt' into table website_url_info;select * from website_pv_info;select * from website_url_info;

02_Apache Hive 窗口函数聚合函数和窗口表达式 2.1 聚合函数

语法

sum|max|min|avg OVER ([PARTITION BY <...>] [ORDER BY <....>] [])

重点：有PARTITION BY 没有PARTITION BY的区别；有ORDER BY没有ORDER BY的区别。

有没有partition by 影响的是全局聚合还是分组之后每个组内聚合。有没有order by的区别：

没有order by,聚合的时候是组内所有的数据聚合再一起全局聚合如果有order by，聚合的时候是累加聚合，默认是第一行聚合到当前行。

栗子

--1、求出每个用户总pv数 sum+group by普通常规聚合操作select cookieid,sum(pv) as total_pv from website_pv_info group by cookieid;+-----------+-----------+| cookieid | total_pv |+-----------+-----------+| cookie1 | 26 || cookie2 | 35 |+-----------+-----------+--2、sum+窗口函数总共有四种用法注意是整体聚合还是累积聚合--sum(...) over( )对表所有行求和--sum(...) over( order by ..、) 连续累积求和--sum(...) over( partition by..、) 同组内所行求和--sum(...) over( partition by..、order by ..、) 在每个分组内，连续累积求和--需求：求出网站总的pv数所有用户所有访问加起来--sum(...) over( )对表所有行求和select cookieid,createtime,pv, sum(pv) over() as total_pv from website_pv_info;--需求：求出每个用户总pv数--sum(...) over( partition by..、)，同组内所行求和select cookieid,createtime,pv, sum(pv) over(partition by cookieid) as total_pvfrom website_pv_info;--需求：求出每个用户截止到当天，累积的总pv数--sum(...) over( partition by..、order by ..、)，在每个分组内，连续累积求和select cookieid,createtime,pv, sum(pv) over(partition by cookieid order by createtime) as current_total_pvfrom website_pv_info;+-----------+-------------+-----+-------------------+| cookieid | createtime | pv | current_total_pv |+-----------+-------------+-----+-------------------+| cookie1 | 2018-04-10 | 1 | 1 || cookie1 | 2018-04-11 | 5 | 6 || cookie1 | 2018-04-12 | 7 | 13 || cookie1 | 2018-04-13 | 3 | 16 || cookie1 | 2018-04-14 | 2 | 18 || cookie1 | 2018-04-15 | 4 | 22 || cookie1 | 2018-04-16 | 4 | 26 || cookie2 | 2018-04-10 | 2 | 2 || cookie2 | 2018-04-11 | 3 | 5 || cookie2 | 2018-04-12 | 5 | 10 || cookie2 | 2018-04-13 | 6 | 16 || cookie2 | 2018-04-14 | 3 | 19 || cookie2 | 2018-04-15 | 9 | 28 || cookie2 | 2018-04-16 | 7 | 35 |+-----------+-------------+-----+-------------------+

2.2 窗口表达式 window_expression

直译叫做window表达式，通俗叫法称之为window子句。

功能：控制窗口操作的范围。

语法

rows between- preceding：往前- following：往后- current row：当前行- unbounded：起点- unbounded preceding 表示从前面的起点第一行- unbounded following：表示到后面的终点最后一行

栗子

--默认从第一行到当前行select cookieid,createtime,pv, sum(pv) over(partition by cookieid order by createtime) as pv1 from website_pv_info;--第一行到当前行等效于rows between不写默认就是第一行到当前行select cookieid,createtime,pv, sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and current row) as pv2from website_pv_info;--向前3行至当前行select cookieid,createtime,pv, sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and current row) as pv4from website_pv_info;--向前3行向后1行select cookieid,createtime,pv, sum(pv) over(partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5from website_pv_info;--当前行至最后一行select cookieid,createtime,pv, sum(pv) over(partition by cookieid order by createtime rows between current row and unbounded following) as pv6from website_pv_info;--第一行到最后一行也就是分组内的所有行select cookieid,createtime,pv, sum(pv) over(partition by cookieid order by createtime rows between unbounded preceding and unbounded following) as pv6from website_pv_info;

03_Apache Hive 窗口函数排序函数（row_number等）

功能：主要对数据分组排序之后，组内顺序标号。

核心函数：row_number、rank、dense_rank

适合场景：分组TopN问题（注意哦不是全局topN）

栗子

SELECt cookieid, createtime, pv, RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn1, DENSE_RANK() OVER(PARTITION BY cookieid ORDER BY pv desc) AS rn2, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn3FROM website_pv_info;--需求：找出每个用户访问pv最多的Top3 重复并列的不考虑SELECt * from(SELECt cookieid, createtime, pv, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY pv DESC) AS seqFROM website_pv_info) tmp where tmp.seq <4;

ntile函数

功能：将分组排序之后的数据分成指定的若干个部分（若干个桶）

规则：尽量平均分配，优先满足最小的桶，彼此最多不相差1个。

栗子

--把每个分组内的数据分为3桶SELECt cookieid, createtime, pv, NTILE(3) OVER(PARTITION BY cookieid ORDER BY createtime) AS rn2FROM website_pv_infoORDER BY cookieid,createtime;--需求：统计每个用户pv数最多的前3分之1天。--理解：将数据根据cookieid分根据pv倒序排序排序之后分为3个部分取第一部分SELECt * from(SELECt cookieid, createtime, pv, NTILE(3) OVER(PARTITION BY cookieid ORDER BY pv DESC) AS rn FROM website_pv_info) tmp where rn =1;

04_Apache Hive 窗口函数 lag、lead函数

--LAG 用于统计窗口内往上第n行值SELECt cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, LAG(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS last_1_time, LAG(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS last_2_timeFROM website_url_info;--LEAD 用于统计窗口内往下第n行值SELECt cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, LEAD(createtime,1,'1970-01-01 00:00:00') OVER(PARTITION BY cookieid ORDER BY createtime) AS next_1_time, LEAD(createtime,2) OVER(PARTITION BY cookieid ORDER BY createtime) AS next_2_timeFROM website_url_info;--FIRST_VALUE 取分组内排序后，截止到当前行，第一个值SELECt cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, FIRST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS first1FROM website_url_info;--LAST_VALUE 取分组内排序后，截止到当前行，最后一个值SELECt cookieid, createtime, url, ROW_NUMBER() OVER(PARTITION BY cookieid ORDER BY createtime) AS rn, LAST_VALUE(url) OVER(PARTITION BY cookieid ORDER BY createtime) AS last1FROM website_url_info;

05_Apache Hive 文件存储格式（text、ORC、parquet）

列式存储、行式存储

数据最终在文件中底层以什么样的形成保存。

Hive中表的数据存储格式，不是只支持text文本格式，还支持其他很多格式。

hive表的文件格式是如何指定的呢？建表的时候通过STORED AS 语法指定。如果没有指定默认都是textfile。

Hive中主流的几种文件格式。

textfile 文件格式

ORC、Parquet 列式存储格式。

都是列式存储格式，底层是以二进制形式存储。数据存储效率极高，对于查询贼方便。二进制意味着肉眼无法直接解析，hive可以自解析。

栗子

分别使用3种不同格式存储数据，去HDFS上查看底层文件存储空间的差异。

--1、创建表，存储数据格式为TEXTFILEcreate table log_text (track_time string,url string,session_id string,referer string,ip string,end_user_id string,city_id string)ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'STORED AS TEXTFILE; --如果不写stored as textfile 默认就是textfile--加载数据load data local inpath '/root/hivedata/log.data' into table log_text;--2、创建表，存储数据格式为ORCcreate table log_orc(track_time string,url string,session_id string,referer string,ip string,end_user_id string,city_id string)ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'STORED AS orc ;--向表中插入数据思考为什么不能使用load命令加载？因为load是纯复制移动操作不会调整文件格式。insert into table log_orc select * from log_text;--3、创建表，存储数据格式为parquetcreate table log_parquet(track_time string,url string,session_id string,referer string,ip string,end_user_id string,city_id string)ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'STORED AS PARQUET ;--向表中插入数据 insert into table log_parquet select * from log_text ;

06_Apache Hive 数据压缩和文件格式搭配（ORC+snappy）

Hive的默认执行引擎是MapReduce，因此通常所说的Hive压缩指的是MapReduce的压缩。

压缩是指通过算法对数据进行重新编排，降低存储空间。无损压缩。

MapReduce可以在两个阶段进行数据压缩

map的输出

减少shuffle的数据量提高shuffle时网络IO的效率 reduce的输出

减少输出文件的大小降低磁盘的存储空间

压缩的弊端

浪费时间消耗CPU、内存某些优秀的压缩算法需要钱

压缩的算法（推荐使用snappy）

Snappyorg.apache.hadoop.io.compress.SnappyCodec

Hive中压缩的设置：注意本质还是指的是MapReduce的压缩

--设置Hive的中间压缩也就是map的输出压缩1）开启 hive 中间传输数据压缩功能set hive.exec.compress.intermediate=true;2）开启 mapreduce 中 map 输出压缩功能set mapreduce.map.output.compress=true;3）设置 mapreduce 中 map 输出数据的压缩方式set mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.SnappyCodec;--设置Hive的最终输出压缩，也就是Reduce输出压缩1）开启 hive 最终输出数据压缩功能set hive.exec.compress.output=true;2）开启 mapreduce 最终输出数据压缩set mapreduce.output.fileoutputformat.compress=true;3）设置 mapreduce 最终数据输出压缩方式set mapreduce.output.fileoutputformat.compress.codec =org.apache.hadoop.io.compress.SnappyCodec;4）设置 mapreduce 最终数据输出压缩为块压缩还可以指定RECORDset mapreduce.output.fileoutputformat.compress.type=BLOCK;

--设置完毕之后只有当HiveSQL底层通过MapReduce程序执行才会涉及压缩。--已有普通格式的表select * from student_hdfs;--ctas语句create table student_snappy as select * from student_hdfs ;

在实际开发中，可以根据需求选择不同的文件格式并且搭配不同的压缩算法。可以得到更好的存储效果。

--不指定压缩格式代表什么呢？--orc 存储文件默认采用ZLIB 压缩。比 snappy 压缩的小STORED AS orc; --2.78M--以ORC格式存储不压缩STORED AS orc tblproperties ("orc.compress"="NONE"); --7.69M--以ORC格式存储使用snappy压缩STORED AS orc tblproperties ("orc.compress"="SNAPPY"); --3.78M

07_Apache Hive 通用调优 fetch抓取机制、MR本地模式 7.1 Fetch抓取机制

功能：在执行sql的时候，能不走MapReduce程序处理就尽量不走MapReduce程序处理。

尽量直接去操作数据文件。

官方描述

Currently the query should be single sourced not having any subquery and should not have any aggregations or distincts (which incur RS – ReduceSinkOperator, requiring a MapReduce task), lateral views and joins.应该是单一数据源没有子查询没有聚合操作没有去重操作没有侧视图没有join

设置： hive.fetch.task.conversion= more。

--在下述3种情况下 sql不走mr程序--全局查找select * from student;--字段查找列裁剪select num,name from student;--limit 限制查找select num,name from student limit 2;

7.2 mapreduce本地模式

功能：如果非要执行MapReduce程序，能够本地执行的，尽量不提交yarn上执行。

默认是关闭的。意味着只要走MapReduce就提交yarn执行。

mapreduce.framework.name = local 本地模式mapreduce.framework.name = yarn 集群模式

Hive提供了一个参数，自动切换MapReduce程序为本地模式，如果不满足条件，就执行yarn模式。

set hive.exec.mode.local.auto = true; --3个条件必须都满足自动切换本地模式The total input size of the job is lower than: hive.exec.mode.local.auto.inputbytes.max (128MB by default) --数据量小于128MThe total number of map-tasks is less than: hive.exec.mode.local.auto.tasks.max (4 by default) --maptask个数少于4个The total number of reduce tasks required is 1 or 0、 --reducetask个数是0 或者 1

切换Hive的执行引擎

WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions、Consider using a different execution engine (i.e、spark, tez) or using Hive 1.X releases.如果针对Hive的调优依然无法满足你的需求还是效率低，尝试使用spark计算引擎或者Tez.

08_Apache Hive 通用调优 join优化

底层还是MapReduce的join优化

8.1 map join

适合于小表join大表或者小表Join小表

#是否开启自动转为mapjoin 在满足条件的情况下默认truehive.auto.convert.join=trueHive老版本#如果参与的一个表大小满足条件转换为map joinhive.mapjoin.smalltable.filesize=25000000 Hive2.0之后版本#是否启用基于输入文件的大小，将reduce join转化为Map join的优化机制。hive.auto.convert.join.noconditionaltask=true #如果上述参数为true，假设参与join的表(或分区)有N个，并且有N-1个表(或分区)的大小总和小于下述参数指定的值，那么会直接转为Map join。hive.auto.convert.join.noconditionaltask.size=10000000 默认10M

8.2 reduce join

适合于大表Join大表

如果不满足map端join，那么就只能走reduce端join了，在hive中也把reduce端join叫做common join.

8.3 bucket join

适合于大表Join大表

方式1：Bucktet Map Join

将表进行分桶，每次join时分桶参与而不是整张表参与，相当于小表join了。

语法: clustered by colName(参与join的字段)参数: set hive.optimize.bucketmapjoin = true要求: 分桶字段 = Join字段，分桶的个数相等或者成倍数，必须是在map join中

方式2：Sort Merge Bucket Join（SMB）

基于有序的数据Join语法:clustered by colName sorted by (colName)参数set hive.optimize.bucketmapjoin = true;set hive.auto.convert.sortmerge.join=true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.auto.convert.sortmerge.join.noconditionaltask=true; 要求: 分桶字段 = Join字段 = 排序字段,分桶的个数相等或者成倍数

09_Apache Hive 通用调优数据倾斜优化 9.1 group by数据倾斜

方案一：开启Map端聚合

hive.map.aggr=true;#是否在Hive Group By 查询中使用map端聚合。#这个设置可以将顶层的部分聚合操作放在Map阶段执行，从而减轻清洗阶段数据传输和Reduce阶段的执行时间，提升总体性能。但是指标不治本。

方案二：实现随机分区

select * from table distribute by rand();

方案三：数据倾斜时自动负载均衡

hive.groupby.skewindata=true; #开启该参数以后，当前程序会自动通过两个MapReduce来运行 #第一个MapReduce自动进行随机分布到Reducer中，每个Reducer做部分聚合操作，输出结果 #第二个MapReduce将上一步聚合的结果再按照业务（group by key）进行处理，保证相同的分布到一起，最终聚合得到结果

9.2 join数据倾斜

方案一：提前过滤（比如分区裁剪），将大表数据变成小表数据，争取实现Map Join

方案二：使用Bucket Join

方案三：使用Skew Join

#Skew Join原理：将Map Join和Reduce Join进行合并使用。如果某个值出现了数据倾斜，就会将产生数据倾斜的数据单独使用Map Join来实现，如果不是倾斜的，则按正常的reduce端join流程进行。其他没有产生数据倾斜的数据由Reduce Join来实现，这样就避免了Reduce Join中产生数据倾斜的问题最终将Map Join的结果和Reduce Join的结果进行Union合并#开启运行过程中skewjoinset hive.optimize.skewjoin=true;#如果这个key的出现的次数超过这个范围set hive.skewjoin.key=100000;#在编译时判断是否会产生数据倾斜set hive.optimize.skewjoin.compiletime=true;set hive.optimize.union.remove=true;#如果Hive的底层走的是MapReduce，必须开启这个属性，才能实现不合并set mapreduce.input.fileinputformat.input.dir.recursive=true;

10_Apache Hive 通用调优 MR程序task个数调整 10.1 maptask个数

如果是在MapReduce中 maptask是通过逻辑切片机制决定的。

但是在hive中，影响的因素很多。比如逻辑切片机制，文件是否压缩、压缩之后是否支持切割。

因此在Hive中，调整MapTask的个数，直接去HDFS调整文件的大小和个数，效率较高。

如果小文件多，就进行小文件的合并合并的大小最好=block size如果大文件多，就调整blocl size

10.2 reducetask个数

如果在MapReduce中，通过代码可以直接指定 job.setNumReduceTasks(N)

在Hive中，reducetask个数受以下几个条件控制的

（1）每个 Reduce 处理的数据量默认是 256MBhive.exec.reducers.bytes.per.reducer=256000000（2）每个任务最大的 reduce 数，默认为 1009hive.exec.reducsers.max=1009（3）mapreduce.job.reduces该值默认为-1，由 hive 自己根据任务情况进行判断。--如果用户用户不设置 hive将会根据数据量或者sql需求自己评估reducetask个数。--用户可以自己通过参数设置reducetask的个数 set mapreduce.job.reduces = N--用户设置的不一定生效，如果用户设置的和sql执行逻辑有冲突，比如order by，在sql编译期间，hive又会将reducetask设置为合理的个数。 Number of reduce tasks determined at compile time: 1

11_Apache Hive 通用调优执行计划

通过执行计划可以看出hive接下来是如何打算执行这条sql的。

语法格式：explain + sql语句

栗子

explain select * from student;+----------------------------------------------------+| Explain |+----------------------------------------------------+| STAGE DEPENDENCIES: || Stage-0 is a root stage || || STAGE PLANS: || Stage: Stage-0 || Fetch Operator || limit: -1 || Processor Tree: || TableScan || alias: student || Statistics: Num rows: 1 Data size: 5260 Basic stats: COMPLETE Column stats: NONE || Select Operator || expressions: num (type: int), name (type: string), sex (type: string), age (type: int), dept (type: string) || outputColumnNames: _col0, _col1, _col2, _col3, _col4 || Statistics: Num rows: 1 Data size: 5260 Basic stats: COMPLETE Column stats: NONE || ListSink || |+----------------------------------------------------+

12_Apache Hive 通用调优并行机制、推测执行机制 12.1 并行执行机制

如果hivesql的底层某些stage阶段可以并行执行，就可以提高执行效率。

前提是stage之间没有依赖并行的弊端是瞬时服务器压力变大。

参数

set hive.exec.parallel=true; --是否并行执行作业。适用于可以并行运行的 MapReduce 作业，例如在多次插入期间移动文件以插入目标set hive.exec.parallel.thread.number=16; --最多可以并行执行多少个作业。默认为8。

12.2 Hive的严格模式

注意。不要和动态分区的严格模式搞混淆。

这里的严格模式指的是开启之后 hive会禁止一些用户都影响不到的错误包括效率低下的操作，不允许运行一些有风险的查询。

设置

set hive.mapred.mode = strict --默认是严格模式 nonstrict

解释

1、如果是分区表，没有where进行分区裁剪禁止执行2、order by语句必须+limit限制

12.3 推测执行机制

MapReduce中task的一个机制。功能：

一个job底层可能有多个task执行，如果某些拖后腿的task执行慢，可能会导致最终job失败。所谓的推测执行机制就是通过算法找出拖后腿的task,为其启动备份的task。两个task同时处理一份数据，谁先处理完，谁的结果作为最终结果。推测执行机制默认是开启的，但是在企业生产环境中建议关闭。

上一篇：Hadoop3.x入门-搭建3节点分布式集群

下一篇：字节实习接触到的数据库简单总结