语义语法详解
Windowing functionsThe OVER clauseAnalytics functions其它注意事项 操作示例
Window functions
数据准备SQL撰写 The OVER clause
数据准备SQL撰写 Analytics functions
数据准备SQL撰写 语义语法详解 Windowing functions
LEAD
LEAD(col, n, Default)用于统计窗口内往下第N行的值,第一个参数为列名、第二个参数为往下第N行(可选,默认为1,不可为负数)、第三个参数为默认值(当往下第N行为NULL时候,取默认值,如不指定则为NULL)。
The number of rows to lead can optionally be specified、If the number of rows to lead is not specified, the lead is one row.
Returns null when the lead for the current row extends beyond the end of the window.LAG
LAG(col, n, Default)用于统计窗口内往上第N行的值,第一个参数为列名、第二个参数为往上第N行(可选,默认为1,不可为负数)、第三个参数为默认值(当往上第N行为NULL时候,取默认值,如不指定则为NULL)。
The number of rows to lag can optionally be specified、If the number of rows to lag is not specified, the lag is one row.
Returns null when the lag for the current row extends before the beginning of the window.FIRST_VALUE
FIRST_VALUE(col, boolean),取分组内排序后的截止当前行,第一行的值。第一个参数是需要获取的列,第二个参数(可选)默认是false(如果设置为true,则跳过空值)。
This takes at most two parameters、The first parameter is the column for which you want the first value, the second (optional) parameter must be a boolean which is false by default、If set to true it skips null values.LAST_VALUE
LAST_VALUE(col, boolean),取分组内排序后的截止当前行,最后一行的值。第一个参数是需要获取的列,第二个参数(可选)默认是false(如果设置为true,则跳过空值)。
This takes at most two parameters、The first parameter is the column for which you want the last value, the second (optional) parameter must be a boolean which is false by default、If set to true skips null values、The OVER clause
OVER with standard aggregates
COUNTSUMMINMAXAVG OVER with a PARTITION BY statement with one or more partitioning columns of any primitive datatypeOVER with PARTITION BY and ORDER BY with one or more partitioning and /or ordering columns of any datatype
OVER with a window specification、Windows can be defined separately in a WINDOW clause.Window specifications support the following formats:
(ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
Analytics functionsRANKROW_NUMBERDENSE_RANKCUME_DISTPERCENT_RANKNTILE 其它注意事项
理解WINDOW子句,灵活控制窗口
CURRENT ROW:指当前行
[num] PRECEDING:当前行,往前num行(不包括当前行)
UNBOUNDED PRECEDING:分组内的第一行
[num] FOLLOWING:当前行,往后num行(不包括当前行)
UNBOUNDED FOLLOWING:分组内的最后一行
ROWS和RANGE
Row是指物理窗口,从行数上进行控制尺寸
Range是指在列值窗口,从列值上进行控制尺寸
ORDER BY缺省以及ORDER BY、WINDOW clause缺省
When ORDER BY is specified with missing WINDOW clause, the WINDOW specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
When both ORDER BY and WINDOW clauses are missing, the WINDOW specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
Distinct support in Hive2.1.0 and later
Distinct is supported for aggregation functions including SUM,COUNT and AVG, which aggregate over the distinct values within each partition、Current implementation has the limitation that no ORDER BY or window specification can be supported in the partitioning clause for performance reason.The supported syntax is as follows.
COUNT(DISTINCT a) OVER (PARTITION BY c)
ORDER BY and window specification is supported for distinct in Hive 2.2.0.An example is as follows.
COUNT(DISTINCT a) OVER (PARTITION BY c ORDER BY d ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING)
Aggregate functions in OVER clause support in Hive 2.1.0 and later
Support to reference aggregate functions within the OVER clause has been added.For instance, currently we can use the SUM aggregate function within the OVER clause as follows.
SELECt rank() OVER (ORDER BY sum(b)) FROM T GROUP BY a;
操作示例 Window functions 数据准备username1,2022-02-10 10:00:02,url1username1,2022-02-10 10:00:00,url2username1,2022-02-10 10:03:04,url3username1,2022-02-10 10:50:05,url4username1,2022-02-10 11:00:00,url5username1,2022-02-10 10:10:00,url4username1,2022-02-10 10:50:01,url5username2,2022-02-10 10:00:02,url22username2,2022-02-10 10:00:00,url11username2,2022-02-10 10:03:04,url33username2,2022-02-10 10:50:05,url66username2,2022-02-10 11:00:00,url55username2,2022-02-10 10:10:00,url44username2,2022-02-10 10:50:01,url22
创建Hive表
-- 创建表语句create table cookie_1 (cookieid string, createtime string, url string) row format delimited fields terminated by ',';-- 加载数据到该表load data local inpath "/data/tian/cookie_1.txt" into table cookie_1;
SQL撰写LAG
select cookieid,createtime,url,row_number() over (partition by cookieid order by createtime) as rn,lag(createtime, 1, '1970-01-01 00:00:00') over (partition by cookieid order by createtime) as lag_time_1,lag(createtime, 2) over (partition by cookieid order by createtime) as lag_time_2from cookie_1;
查询结果
结果说明
lag_time_1 :lag(createtime, 1, '1970-01-01 00:00:00') over (partition by cookieid order by createtime)第一行,向上1行createtime为NULL,所以这里取默认值:1970-01-01 00:00:00第二行,向上1行createtime为2022-02-10 10:00:00,所以lag_time_1这里取该值:2022-02-10 10:00:00第三行,向上1行createtime为2022-02-10 10:00:02,所以lag_time_1这里取该值:2022-02-10 10:00:02........以上操作是当前窗口内的操作lag_time_2 :lag(createtime, 2) over (partition by cookieid order by createtime)第一行,向上2行createtime为NULL,这里default没有指定,所以lag_time_2为NULL第二行,向上2行createtime为NULL,这里default没有指定,所以lag_time_2为NULL第三行,向上2行createtime为2022-02-10 10:00:00,所以lag_time_2这里取该值:2022-02-10 10:00:00.........以上操作是当前窗口内的操作
LEAD
select cookieid,createtime,url,row_number() over (partition by cookieid order by createtime) as rn,lead(createtime, 1, '1970-01-01 00:00:00') over (partition by cookieid order by createtime) as lead_time_1,lead(createtime, 2) over (partition by cookieid order by createtime) as lead_time_2from cookie_1;
查询结果
结果说明
lead_time_1 :lead(createtime, 1, '1970-01-01 00:00:00') over (partition by cookieid order by createtime)第一行,向下1行createtime为2022-02-10 10:00:02,所以该行lead_time_1取该值:2022-02-10 10:00:02第二行,向下1行createtime为2022-02-10 10:03:04,所以该行lead_time_1取该值:2022-02-10 10:03:04第三行,向下1行createtime为2022-02-10 10:10:00,所以该行lead_time_1取该值:2022-02-10 10:10:00........第七行,向下1行createtime为NULL,default为1970-01-01 00:00:00,所以该行lead_time_1取该值:1970-01-01 00:00:00lead_time_2 :lead(createtime, 2) over (partition by cookieid order by createtime)........同理如上
FIRST_VALUE
select cookieid, createtime, url, row_number() over (partition by cookieid order by createtime) as rn, first_value(url) over (partition by cookieid order by createtime) as first1 from cookie_1;
查询结果
结果说明
first_value(url) over (partition by cookieid order by createtime) as first1取分组内、排序后的第一个值,按cookieid进行分组,分组内按createtime降序。
LAST_VALUE
select cookieid, createtime, url, row_number() over (partition by cookieid order by createtime) as rn, last_value(url) over (partition by cookieid order by createtime) as last1 from cookie_1;
查询结果
结果说明
last_value(url) over (partition by cookieid order by createtime) as last1 取分组内的、排序后的,截止到当前行,取最后一个值。
The OVER clause 数据准备username1,2022-02-10,1username1,2022-02-11,5username1,2022-02-12,7username1,2022-02-13,3username1,2022-02-14,2username1,2022-02-15,4username1,2022-02-16,4
创建Hive表
-- 创建表语句create table cookie_2 (cookieid string, createtime string, pv int) row format delimited fields terminated by ',';-- 加载数据到该表load data local inpath "/data/tian/cookie_2.txt" into table cookie_2;
SQL撰写SUM
select cookieid, createtime, pv, sum(pv) over (partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1, sum(pv) over (partition by cookieid order by createtime) as pv2, sum(pv) over (partition by cookieid) as pv3, sum(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4, sum(pv) over (partition by cookieid order by createtime rows between 3 preceding and 2 following) as pv5, sum(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6 from cookie_2;
查询结果
结果说明
pv1:sum(pv) over (partition by cookieid order by createtime rows between unbounded preceding and current row)分组依据:以cookieid的值进行分组,partition by cookieid排序规则:按照createtime进行排序,order by createtime,默认为升序asc;pv1为升序进行累加pv的值;窗口内排序顺序为:1行 username12022-02-1012行 username12022-02-1153行 username12022-02-1274行 username12022-02-1335行 username12022-02-1426行 username12022-02-1547行 username12022-02-164窗口规则:between unbounded preceding and current row,从窗口内排序后的第一行开始截止到当前行 上面查询结果的第一行的pv1=sum(1行-7行的pv值),第二行的pv1=sum(1行-6行的pv值),第三行的pv1=sum(1行-5行的pv值),...以此类推。 pv2:sum(pv) over (partition by cookieid order by createtime)结果与pv1一致,省略window clause子句,表示从窗口内排序的第一行到当前行(rows between unbounded preceding and current row)pv2的值与pv1的值解析一致。pv3:(partition by cookieid),获取所有行的pv值累加pv4:(partition by cookieid order by createtime rows between 3 preceding and current row)分组依据:与pv1一致排序规则:与pv1一致窗口规则:当前窗口排序后的数据,计算范围:[当前行,往上3行],闭区间;如2022-02-16这行,pv4=sum(7行、6行、5行、4行的pv值)=13pv5:(partition by cookieid order by createtime rows between 3 preceding and 2 following)分组依据:与pv1一致排序规则:与pv1一致窗口规则:当前窗口排序后的数据,计算范围:[当前行往上3行,当前行往下2行],闭区间;如2022-02-15这行,pv5=sum(7行、6行、5行、4行、3行的pv值)=20,6行为当前行,7行为往下的2行(这里只有一行),5行、4行、3行为往上的3行。pv6:(partition by cookieid order by createtime rows between current row and unbounded following)分组依据:与pv1一致排序规则:与pv1一致窗口规则:当前窗口排序后的数据,当前行截止到最后一行,计算范围[当前行,最后一行],闭区间;如2022-02-14这行,pv6=sum(5行、6行、7行的pv值)=10此段中,关于行数,1行、2行、3行、4行、5行、6行、7行皆是窗口内排序的行号,pv1注释中的排序行。
这里有另外一个例子供大家仔细品味,注意观察pv1:
select cookieid, createtime, pv, SUM(pv) over (partition by cookieid order by createtime desc rows between unbounded preceding and current row) as pv1, SUM(pv) over (partition by cookieid order by createtime) as pv2, SUM(pv) over (partition by cookieid) as pv3, SUM(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4, SUM(pv) over (partition by cookieid order by createtime rows between 3 preceding and 2 following) as pv5, SUM(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6 from cookie_2;
AVG
select cookieid, createtime, pv, avg(pv) over (partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1, avg(pv) over (partition by cookieid order by createtime) as pv2, avg(pv) over (partition by cookieid) as pv3, avg(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4, avg(pv) over (partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5, avg(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6from cookie_2;
查询结果
结果说明
avg(pv)=sum(pv)/行数第一行2022-02-16,pv1=sum(pv)/7=26/7第二行2022-02-15,pv1=sum(pv)/6=22/6第三行2022-02-14,pv1=sum(pv)/5=18/5.......
MIN
select cookieid, createtime, pv, min(pv) over (partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1, min(pv) over (partition by cookieid order by createtime) as pv2, min(pv) over (partition by cookieid) as pv3, min(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4, min(pv) over (partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5, min(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6from cookie_2;
MAX
select cookieid, createtime, pv, max(pv) over (partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1, max(pv) over (partition by cookieid order by createtime) as pv2, max(pv) over (partition by cookieid) as pv3, max(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4, max(pv) over (partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5, max(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6from cookie_2;
COUNT
select cookieid, createtime, pv, count(pv) over (partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1, count(pv) over (partition by cookieid order by createtime) as pv2, count(pv) over (partition by cookieid) as pv3, count(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4, count(pv) over (partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5, count(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6from cookie_2;
Analytics functions 数据准备cookie1,2015-04-10,1cookie1,2015-04-11,5cookie1,2015-04-12,7cookie1,2015-04-13,3cookie1,2015-04-14,2cookie1,2015-04-15,4cookie1,2015-04-16,4cookie2,2015-04-10,2cookie2,2015-04-11,3cookie2,2015-04-12,5cookie2,2015-04-13,6cookie2,2015-04-14,3cookie2,2015-04-15,9cookie2,2015-04-16,7
创建Hive表
# 创建hive表create table cookie_3(cookieid string, createtime string, pv int) row format delimited fields terminated by ',';# 加载数据load data local inpath "/data/tian/cookie_3.txt" into table cookie_3;
SQL撰写ROW_NUMBER
select cookieid,createtime,pv,row_number() over (partition by cookieid order by createtime) as num from cookie_3;
查询结果
结果说明
num这一列是分组内的排序,默认从1开始。
RANK、DENSE_RANK
select cookieid, createtime, pv, rank() over (partition by cookieid order by pv desc) as rn1, dense_rank() over (partition by cookieid order by pv desc) as rn2, row_number() over (partition by cookieid order by pv desc) as rn3from cookie_3;
查询结果
结果说明
RANK(),生成组内中排序序号,排名相等的序号会一样,但是会略过下一个序号为空位。如上列中的3行、4行,rn1都为3,排列第5行时略过了4,直接为5.DENSE_RANK(),这个和RANK正好相反,不进行略过。ROW_NUMBER(),按顺序排列序号,不留空位。
NTILE
select cookieid, createtime, pv, ntile(2) over (partition by cookieid order by createtime) as rn1, --分组内将数据分成2片 ntile(3) over (partition by cookieid order by createtime) as rn2, --分组内将数据分成2片 ntile(4) over (order by createtime) as rn3 --将所有数据分成4片from cookie_3 order by cookieid,createtime;
查询结果
结果分析
ntile(num),对数据进行切片,返回切片值。如果切片不均匀,默认增加第一个切片的分布。ntile不支持window clause子句,(rows | range) between .、preceding and .、following如,统计一个用户访问最多的天,取TOP 3selectcookieid,createtime,pv,ntile(3) over (partition by cookieid order by pv desc) as rnfrom cookie_3;针对以上SQL查询的结果进行筛选rn=1的数据即可。
CUME_DIST
select cookieid,createtime,pv,cume_dist() over (partition by cookieid order by pv) as cume from cookie_3;
查询结果
结果分析
cume_dist:小于等于当前值的行数/分组内总行数小于等于当前访问量的天数,占窗口内总天数的比例。访问量为1的天,占总天数量的比例。
PERCENT_RANK
select cookieid,createtime,pv,rank() over (order by pv) rk1,percent_rank() over (order by pv) as p1,rank() over (partition by cookieid order by pv),percent_rank() over (partition by cookieid order by pv) as p2 from cookie_3;
查询结果
结果分析
PERCENT_RANK :分组内当前行的RANK值-1/分组内总行数-1
以上为Hive的窗口分析函数,特此整理记录。
参考文档:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics#LanguageManualWindowingAndAnalytics-WINDOWclause
https://www.cnblogs.com/ZackSun/p/9713435.html