HIVE总结

时间：2023-07-29

一：hive作用

Hive是基于Hadoop的一个数据仓库工具，可以将结构化的数据文件映射为一张表，并提供类SQL查询功能。

Hive本质：将HQL转化成MapReduce程序

（1）Hive处理的数据存储在HDFS

（2）Hive分析数据底层的实现是MapReduce

（3）执行程序运行在Yarn上

1.1：Hive优缺点

1>Hive支持用户自定义函数，用户可以根据自己的需求来实现自己的函数;

缺点：

1）Hive的HQL表达能力有限

（1）迭代式算法无法表达

（2）数据挖掘方面不擅长，由于MapReduce数据处理流程的限制，效率更高的算法却无法实现。

2）Hive的效率比较低 (Hive调优比较困难，粒度较粗)

1.2:Hive和数据库比较

从结构上来看，Hive 和数据库除了拥有类似的查询语言，再无类似之处

1.3:数据更新

Hive是针对数据仓库应用设计的，而数据仓库的内容是读多写少的。因此，Hive中不建议对数据的改写，所有的数据都是在加载的时候确定好的。

1.4:执行延迟

Hive 在查询数据的时候，由于没有索引，需要扫描整个表，因此延迟较高。

二：HIve使用 2.1：本地模式

set hive.exec.mode.local.auto=true; //开启本地mr
注意：当切换成本地模式，可能回报一个内存溢出的错误，因为本地模式默认只给我们分配 256M 内存。可以通过修改 hive-env.sh.template

2.2：一般操作 2.3：客户端查看hdfs

三:Hive中的数据类型

基本数据类型和集合数据类型

四：Hive操作

load data local inpath '/opt/module/hive/datas/test.txt' into table 表名;

Hdfs上创建一个库放到指定路径下

4.1：外部表

五：Hive操作表 5.1：上传文件到HDFS下的表

hadoop fs -put student.txt /user/hive/warehouse/student -- 查看当前表的详情 desc student; desc formatted student; -- 查看当前表的详细信息

五：DML数据操作

5.2：通过查询语句向表中插入数据（Insert） 5.3：导入数据的方式

-- 3、导入数据 --- 本地不覆盖导入 load data local inpath '/opt/module/hive-3.1.2/datas/ztudent.txt' into table student; --- 本地覆盖导入 load data local inpath '/opt/module/hive-3.1.2/datas/student1.txt' overwrite into table student; --- HDFS导入 load data inpath '/student/ztudent.txt' into table student; --- 注意：从本地导入数据属于cp行为，如果从HDFS导入属于mv行为 1.3 通过查询语句向表中插入数据（Insert） -- 1、建表 create table stu1(id int, name string) row format delimited fields terminated by 't'; -- 2、基本插入数据 insert into table stu1 values(1,'wangwu'),(2,'zhaoliu'); -- 3、基本模式插入（根据单张表查询结果） insert into table stu1 select id, name from student; 1.4 查询语句中创建表并加载数据（As Select） create table stu2 as select id, name from student; 1.5 创建表时通过Location指定加载数据路径 create table stu3(id int, name string) row format delimited fields terminated by 't' location '/student';

5.4：Insert数据导出(通常用于数据迁移)

2.1 Insert导出 --1、将查询的结果导出到本地 insert overwrite local directory '/opt/module/hive-3.1.2/datas/export_data' select * from student; --2、将查询的结果导出到本地(对数据格式约束) insert overwrite local directory '/opt/module/hive-3.1.2/datas/export_data' row format delimited fields terminated by "t" select * from student; --3、将查询的结果导出到HDFS上(没有local) insert overwrite directory '/export_data' row format delimited fields terminated by "t" select * from student; 2.2 Export导出到HDFS上 & import数据到指定Hive表中 -- export导出到HDFS export table student to '/export/student'; -- import 导入Hive的表中(导数据的过程中会根据元数据进行建表) import table bucunzai from '/export/student'; -- 注意 Export & import 使用场景：数据迁移的时候常用！！！ 2.3 清除表中数据（Truncate）注意：当清楚数据的表是内部表的时候才允许清楚，外部表不允许清除数据 truncate table student;

六：Hive函数 6.1：常用函数

HiveSQL语法有限制

6.2：SQL的执行顺序案例一：

Having语句 (只能和 group by 结合使用)
-- 计算每个部门最高薪水的那个哥们，并且薪资大于等于3000 使用group by，select后面的字段只能是分组字段和要求的值，如平均工资；

select t1.deptno,t2.ename,t1.max_salfrom ( select deptno, max(sal) as max_sal from emp group by deptno having max_sal >= 3000) t1 join emp t2on t1.deptno = t2.deptnowhere t1.max_sal = t2.sal;

七：JOIN

(emp 可以看成左表 dept可以看成右表表连接的形式与key有关 ) emp 和 dept的交集数据： join emp的全部数据和 dept匹配到的数据： left join dept 的全部数据和 emp 匹配到的数据： right join 获取emp独有的数据 left join where d.deptno is null 获取dept独有的数据 right join where e.deptno is null 获取 emp 和dept 的全部数据(全连接) （ union 传统的方式 full join 进行全连接(Hive支持，传统SQL不支持，oracle中支持)）获取emp和dept各自独有的数据 full join ( where e.deptno is null or d.deptno is null )

full join 进行全连接(Hive支持，传统SQL不支持，oracle中支持)

4.1 emp 和 dept的交集数据select e.empno,e.ename,e.deptno,d.deptno,d.dnamefrom emp e joindept d on e.deptno = d.deptno; +----------+----------+-----------+-----------+-------------+| e.empno | e.ename | e.deptno | d.deptno | d.dname |+----------+----------+-----------+-----------+-------------+| 7369 | SMITH | 20 | 20 | RESEARCH || 7499 | ALLEN | 30 | 30 | SALES || 7521 | WARD | 30 | 30 | SALES || 7566 | JonES | 20 | 20 | RESEARCH || 7654 | MARTIN | 30 | 30 | SALES || 7698 | BLAKE | 30 | 30 | SALES || 7782 | CLARK | 10 | 10 | ACCOUNTING || 7788 | SCOTT | 20 | 20 | RESEARCH || 7839 | KING | 10 | 10 | ACCOUNTING || 7844 | TURNER | 30 | 30 | SALES || 7876 | ADAMS | 20 | 20 | RESEARCH || 7900 | JAMES | 30 | 30 | SALES || 7902 | FORD | 20 | 20 | RESEARCH || 7934 | MILLER | 10 | 10 | ACCOUNTING |+----------+----------+-----------+-----------+-------------+4.2 emp的全部数据和 dept匹配到的数据selecte.empno,e.ename,e.deptno,d.deptno,d.dnamefrom emp e left join dept d on e.deptno = d.deptno;+----------+----------+-----------+-----------+-------------+| e.empno | e.ename | e.deptno | d.deptno | d.dname |+----------+----------+-----------+-----------+-------------+| 7369 | SMITH | 20 | 20 | RESEARCH || 7499 | ALLEN | 30 | 30 | SALES || 7521 | WARD | 30 | 30 | SALES || 7566 | JonES | 20 | 20 | RESEARCH || 7654 | MARTIN | 30 | 30 | SALES || 7698 | BLAKE | 30 | 30 | SALES || 7782 | CLARK | 10 | 10 | ACCOUNTING || 7788 | SCOTT | 20 | 20 | RESEARCH || 7839 | KING | 10 | 10 | ACCOUNTING || 7844 | TURNER | 30 | 30 | SALES || 7876 | ADAMS | 20 | 20 | RESEARCH || 7900 | JAMES | 30 | 30 | SALES || 7902 | FORD | 20 | 20 | RESEARCH || 7934 | MILLER | 10 | 10 | ACCOUNTING || 6666 | MILLER | 60 | NULL | NULL |+----------+----------+-----------+-----------+-------------+4.3 dept 的全部数据和 emp 匹配到的数据selecte.empno,e.ename,e.deptno,d.deptno,d.dnamefrom emp e right join dept d on e.deptno = d.deptno;+----------+----------+-----------+-----------+-------------+| e.empno | e.ename | e.deptno | d.deptno | d.dname |+----------+----------+-----------+-----------+-------------+| 7782 | CLARK | 10 | 10 | ACCOUNTING || 7839 | KING | 10 | 10 | ACCOUNTING || 7934 | MILLER | 10 | 10 | ACCOUNTING || 7369 | SMITH | 20 | 20 | RESEARCH || 7566 | JonES | 20 | 20 | RESEARCH || 7788 | SCOTT | 20 | 20 | RESEARCH || 7876 | ADAMS | 20 | 20 | RESEARCH || 7902 | FORD | 20 | 20 | RESEARCH || 7499 | ALLEN | 30 | 30 | SALES || 7521 | WARD | 30 | 30 | SALES || 7654 | MARTIN | 30 | 30 | SALES || 7698 | BLAKE | 30 | 30 | SALES || 7844 | TURNER | 30 | 30 | SALES || 7900 | JAMES | 30 | 30 | SALES || NULL | NULL | NULL | 40 | OPERATIONS |+----------+----------+-----------+-----------+-------------+4.4 获取emp独有的数据selecte.empno,e.ename,e.deptno,d.deptno,d.dnamefrom emp e left join dept d on e.deptno = d.deptnowhere d.deptno is null;+----------+----------+-----------+-----------+----------+| e.empno | e.ename | e.deptno | d.deptno | d.dname |+----------+----------+-----------+-----------+----------+| 6666 | MILLER | 60 | NULL | NULL |+----------+----------+-----------+-----------+----------+4.5 获取dept独有的数据selecte.empno,e.ename,e.deptno,d.deptno,d.dnamefrom emp e right join dept d on e.deptno = d.deptnowhere e.deptno is null;+----------+----------+-----------+-----------+-------------+| e.empno | e.ename | e.deptno | d.deptno | d.dname |+----------+----------+-----------+-----------+-------------+| NULL | NULL | NULL | 40 | OPERATIONS |+----------+----------+-----------+-----------+-------------+4.6 获取 emp 和dept 的全部数据(全连接)-- 实现方式一： union 传统的方式selecte.empno,e.ename,e.deptno,d.deptno,d.dnamefrom emp e left join dept d on e.deptno = d.deptnounionselecte.empno,e.ename,e.deptno,d.deptno,d.dnamefrom emp e right join dept d on e.deptno = d.deptno;+------------+------------+-------------+------------+| _u1.empno | _u1.ename | _u1.deptno | _u1.dname |+------------+------------+-------------+------------+| NULL | NULL | NULL | 40 || 6666 | MILLER | 60 | NULL || 7369 | SMITH | 20 | 20 || 7499 | ALLEN | 30 | 30 || 7521 | WARD | 30 | 30 || 7566 | JonES | 20 | 20 || 7654 | MARTIN | 30 | 30 || 7698 | BLAKE | 30 | 30 || 7782 | CLARK | 10 | 10 || 7788 | SCOTT | 20 | 20 || 7839 | KING | 10 | 10 || 7844 | TURNER | 30 | 30 || 7876 | ADAMS | 20 | 20 || 7900 | JAMES | 30 | 30 || 7902 | FORD | 20 | 20 || 7934 | MILLER | 10 | 10 |+------------+------------+-------------+------------+-- 方式二： full join 进行全连接(Hive支持，传统SQL不支持，oracle中支持)selecte.empno,e.ename,e.deptno,d.deptno,d.dnamefrom emp e full join dept d on e.deptno = d.deptno;4.7 获取emp和dept各自独有的数据selecte.empno,e.ename,e.deptno,d.deptno,d.dnamefrom emp e full join dept d on e.deptno = d.deptnowhere e.deptno is null or d.deptno is null;+----------+----------+-----------+-----------+-------------+| e.empno | e.ename | e.deptno | d.deptno | d.dname |+----------+----------+-----------+-----------+-------------+| NULL | NULL | NULL | 40 | OPERATIONS || 6666 | MILLER | 60 | NULL | NULL |+----------+----------+-----------+-----------+-------------+

八：Group By（）Having语句和 LIke

GROUP BY语句通常会和聚合函数一起使用，按照一个或者多个列结果进行分组，然后对每个组执行聚合操作。 Having语句

九：order by() sort by() Distribute By() Cluster By()

set mapreduce.job.reduces=3; set hive.exec.mode.local.auto=false;
运行结果

-- 小结：Hive中如果单独使用sort by 它也能进行局部排序，但是每个文件中的数据是随机进入，如果
-- 如果将 distribute by + sort by 使用就完全和Hadoop中的HashPartitioner相吻合！！！

十：分区表(Hive中分区就是分目录)

Hive中没有索引的概念分区表在查询的时候要加上分区字段
针对庞大的数据集，存储的时候考虑用分区表表存储，只要按照一定规律
-- 进行分区，将来查询的时候就可以使用分区字段结合查询条件小范围匹配数据
-- 最终避免全表扫描，从而提升查询效率。
即可以将表的分区字段看成表的普通字段 -- 查看分区详情
show partitions dept_par;
-- 准备数据导入load data [local] inpath '数据的path' [overwrite] into table student[partition (partcol1=val1,…)];load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200401.log' into table dept_parpartition (day='20200401');load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200402.log' into table dept_parpartition (day='20200402');load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200403.log' into table dept_parpartition (day='20200403');
注意：再给分区表中导入数据的时候一定要加上分区属性partition(day='20200403')
2、增加分区--增加单个分区alter table dept_par add partition(day='20200404');--增加多个分区alter table dept_par add partition(day='20200405') partition(day='20200406');3、删除分区alter table dept_par drop partition(day='20200404');alter table dept_par drop partition(day='20200405'), partition(day='20200406');

二级分区：
二级分区：1、创建分区表 -- 建表create table dept_par1(deptno int, dname string, loc string)partitioned by (day string, hour string)row format delimited fields terminated by 't';-- 准备数据导入load data [local] inpath '数据的path' [overwrite] into table student[partition (partcol1=val1,…)];load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200401.log' into table dept_par1partition (day='20200401', hour='00');load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200402.log' into table dept_par1partition (day='20200401', hour='01');load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200403.log' into table dept_par1partition (day='20200401', hour='02');
--思考：分区是不是越多越好？
-- 分区越多导致文件的数量就会增加，在HDFS尽量避免出现大量小文件，在正常使用中一般做到2级分区就可以啦！

10.1：分区表和数据产生关联的三种方式

分区表和数据产生关联的三种方式：(分区表和普通表不一样需要此三种操作才能找到对应的元数据) 方式一：上传数据后修复（msck repair table dept_par2;）方式二：上传数据后添加分区 (alter table dept_par add partition(day='20200401');) 方式三：创建文件夹后load数据到分区
load data local inpath '/opt/module/hive-3.1.2/datas/dept_20200402.log' into table dept_par4partition (day='20200402');

10.2：动态分区

相关参数：注意通过insert into的方式给分区表中插入数据一定要指定 partition(loc)
-- 进行动态分区（动态的往 dept_par_dy 插入数据）insert into table dept_par_dy partition(loc)select deptno, dname, loc from dept;

10.3:分桶表(分桶操作实际上就是Hadoop中的分区)

面对海量数据表查询效率低下，此时就考虑缩小数据的管理范围，
-- 从而达到避免查询的时候进行全表扫描，分桶操作实质就是Hadoop
-- 中默认规则的分区操作分桶表在查询的时候要加上分桶字段，类似于分区表在查询的时候要加上分区字段
注意:在进行分桶操作的时候要设置 set mapreduce.job.reduces=-1; (因为分区分的是分目录(分区上传文件的时候是每个区的文件上传到对应的目录下边)，分桶分的是分文件(上传文件的时候只上传一个文件，hive会按照指定的分桶字段创建不同的桶))

创建分桶表 CREATE TABLE table_name [(col_name data_type [COMMENT col_comment], ...)] [CLUSTERED BY (col_name, col_name, ...) INTO num_buckets BUCKETS]-- 建表create table stu_bucket(id int, name string)clustered by(id) into 4 bucketsrow format delimited fields terminated by 't';-- 加载数据load data local inpath '/opt/module/hive-3.1.2/datas/student.txt' into table stu_bucket;

10.4:抽样查询

抽样查询取是按照id进行随机分桶，然后选择第x个桶，并不是真正意义上按照分桶的规则进行划分，然后获取第x个桶的数据，而是随机的。
四、抽样查询-- 语法: TABLESAMPLE(BUCKET x OUT OF y ON 分桶字段) select * from stu_bucket tablesample(bucket 1 out of 4 on id);select * from emp tablesample(bucket 1 out of 4 on empno);

十一：Hive中的函数(思路：按照需求缺什么补什么)

2> CASE WHEN THEN ELSE END（逻辑判断函数）
1)、按照部门分组,再按照性别分组，求统计select dept_id,sex,count(name) as numfrom emp_sexgroup by dept_id,sex;+----------+------+------+| dept_id | sex | num |+----------+------+------+| A | 女 | 1 || A | 男 | 2 || B | 女 | 2 || B | 男 | 1 |+----------+------+------+2)、实现目标结果数据select dept_id,sum(CASE sex WHEN '男' THEN 1 ELSE 0 END) as man,sum(CASE sex WHEN '女' THEN 1 ELSE 0 END) as womenfrom emp_sexgroup by dept_id;+----------+------+--------+| dept_id | man | women |+----------+------+--------+| A | 2 | 1 || B | 1 | 2 |+----------+------+--------+3)、扩展 if() 函数select dept_id,sum(if(sex='男',1,0)) as man,sum(if(sex='女',1,0)) as womenfrom emp_sexgroup by dept_id;+----------+------+--------+| dept_id | man | women |+----------+------+--------+| A | 2 | 1 || B | 1 | 2 |+----------+------+--------+

11.1：函数第二波

COLLECT_SET(col)：函数只接受基本数据类型，它的主要作用是将某字段的值进行去重汇总，产生array类型字段。
1)、把 constellation 和 blood_type 拼接起来select concat_ws(',',constellation,blood_type) as cb,namefrom person_info; --> t1 因为第二个表查询的时候用到了第一个表中的name，因此要把那么查出来+--------+-------+| cb | name |+--------+-------+| 白羊座,A | 孙悟空 || 射手座,A | 大海 || 白羊座,B | 宋宋 || 白羊座,A | 猪八戒 || 射手座,A | 凤姐 || 白羊座,B | 苍老师 |+--------+-------+2)、根据 cb 字段进行分组select t1.cb,concat_ws('|',collect_set(t1.name))from( select concat_ws(',',constellation,blood_type) as cb, name from person_info) t1group by t1.cb;+--------+----------+| t1.cb | _c1 |+--------+----------+| 射手座,A | 大海|凤姐 || 白羊座,A | 孙悟空|猪八戒 || 白羊座,B | 宋宋|苍老师 |+--------+----------+
侧写函数LATERAL VIEW 语法：(因为SQL语法局限需要 LATERAL VIEW） explode 一进多出函数，侧写函数代表虚拟表的过程；执行过程炸开用explode 对应key 用侧写函数LATERAL VIEW
EXPLODE(col)：将hive一列中复杂的array或者map结构拆分成多行。select explode(split(category,',')) as category_name,moviefrom movie_info;+------+| col |+------+| 悬疑 || 动作 || 科幻 || 剧情 || 悬疑 || 警匪 || 动作 || 心理 || 剧情 || 战争 || 动作 || 灾难 |+------+

select movie,category_namefrom movie_infolateral view explode(split(category,',')) movie_info_tmp as category_name;+--------------+----------------+| movie | category_name |+--------------+----------------+| 《疑犯追踪》 | 悬疑 || 《疑犯追踪》 | 动作 || 《疑犯追踪》 | 科幻 || 《疑犯追踪》 | 剧情 || 《Lie to me》 | 悬疑 || 《Lie to me》 | 警匪 || 《Lie to me》 | 动作 || 《Lie to me》 | 心理 || 《Lie to me》 | 剧情 || 《战狼2》 | 战争 || 《战狼2》 | 动作 || 《战狼2》 | 灾难 |+--------------+----------------+

11.2：函数第三波

窗口函数就是针对SQL中的限制重新开辟一个处理数据空间，
-- 可以灵活的的控制窗口的大小，窗口的大小取决于数据的记录的条数。需求一：查询在2017年4月份购买过的顾客及总人数 -- 开窗函数此处应用大意是按照名字分组开辟了一个新的统计名字个数的空间，这个空间的范围是从每一个组中的起始位置到最后位置即每一个组统计一次。 WINDOW 规范默认为
ROW BETWEEN UNBOUNDED PRECEDING(从前面的起点) AND UNBOUNDED FOLLOWING(到后面的终点)、 count(name) over() totalNum 统计姓名的个数此时如果不分组的情况会对相同的名字进行个数统计，因此需要加上 group by() 而over() 是在group by()之后，开辟了新的统计组个数的空间(即要统计总人数就要分组
selectname ,count(name) from businesswhere substring(orderdate,0,7) = '2017-04'group by name ;name_c1jack1mart4
)但是分组的结果会重复统计相同的人名，此时开窗的作用就是开辟了新的统计人数的空间，即一个组统计一次；
1) 根据需求的日期过滤数据select orderdate ,namefrom business where substring(orderdate,0,7) = '2017-04' ;orderdatename2017-04-06jack2017-04-08mart2017-04-09mart2017-04-11mart2017-04-13mart2)、根据 name 进行分组,然后使用开窗函数重新处理分组后的结果求统计 -- (因为用到了分组，而分组要查询的字段是分组字段因此查询字段不能带日期)selectname,count(name) over() totalNumfrom businesswhere substring(orderdate,0,7) = '2017-04'group by name ;-- 开窗函数此处应用大意是按照名字分组开辟了一个新的统计名字个数的空间，这个空间的范围是-- 从每一个组中的起始位置到最后位置。如果selectname,count(name) over() totalNumfrom businesswhere substring(orderdate,0,7) = '2017-04'没有开窗语句也没有order by语句则默认窗口大小从起点到终点nametotalnummart5mart5mart5mart5jack5
需求二：查询顾客的购买明细及月购买总额目的是统计顾客的购买明细(每个顾客每个月的花费金额) 及月购买总额(一个月内所有顾客花费的总金额) 因此SQL( sum(cost) over(partition by substring(orderdate,0,7)) costByMonth 表示在求一个月内所有顾客花费的总金额的时候重新开辟了一个新的空间，新的空间的大小是按照substring(orderdate,0,7)一个月进行分区汇总的。（partition by 可以控制新空间的大小)只有窗口语句没有order by语句此时窗口的大小是每一个分区内的起点到终点
select name,cost,orderdate,sum(cost) over(partition by substring(orderdate,0,7)) total_costfrom business;
需求三：查询每个顾客的购买明细及 (每个)月购买总额
selectname,cost,orderdate,sum(cost) over(partition by substring(orderdate,0,7),name) oneCostByMonthfrom business ;

(sum(cost) over(partition by substring(orderdate,0,7),name) oneCostByMonth)

开窗函数新空间的大小由按照月份和顾客确定。只有窗口语句没有order by语句此时窗口的大小是每一个分区内的起点到终点
需求四：上述的场景, 将每个顾客的cost按照日期进行累加
-- 方式一selectname,cost,orderdate,sum(cost) over(partition by name order by orderdate) oneCustTotalfrom business ;-- 方式二selectname,cost,orderdate,sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row) oneCustTotalfrom business ;
开窗Sql代码（sum(cost) over(partition by name order by orderdate) oneCustTotal）每个顾客的cost按照日期进行累加，即开窗的新的空间是按照顾客名字进行分区，按照日期进行累加，因此要排序，开窗规则的默认窗口是从起始行累加到当前行。 (sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row) 开窗规则的默认窗口是从起始行累加到当前行。注意：
当 ORDER BY 指定时缺少 WINDOW 子句，WINDOW 窗口大小规范默认为
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW、当 ORDER BY 和 WINDOW 子句都缺失时即单纯的over()，WINDOW 窗口大小规范默认为
ROW BETWEEN UNBOUNDED PRECEDING 起点 AND UNBOUNDED FOLLOWING、终点 -- 需求五：查询每个顾客上次的购买时间和下一次购买时间
LAG(col,n,default_val)：往前第n行数据
LEAD(col,n, default_val)：往后第n行数据
selectname ,orderdate,lag(orderdate,1,'startTime') over(partition by name order by orderdate) preTime,lead(orderdate,1,'endTime') over(partition by name order by orderdate) nextTimefrom business ;nameorderdatepretimenexttimejack2017-01-01startTime2017-01-05jack2017-01-052017-01-012017-01-08jack2017-01-082017-01-052017-02-03jack2017-02-032017-01-082017-04-06jack2017-04-062017-02-03endTimemart2017-04-08startTime2017-04-09mart2017-04-092017-04-082017-04-11mart2017-04-112017-04-092017-04-13mart2017-04-132017-04-11endTimeneil2017-05-10startTime2017-06-12neil2017-06-122017-05-10endTimetony2017-01-02startTime2017-01-04tony2017-01-042017-01-022017-01-07tony2017-01-072017-01-04endTime
开窗函数是作为对其它函数在统计的时候由于SQL的局限，进行辅助作用的辅助主函数的，因此不能单独存在；因此以下写法不正确，应该将over()开窗函数放在主函数的后边。且开窗函数不能单独使用(以下两种写法都是错误的)
selectname ,orderdate,over(partition by name order by orderdate) lag(orderdate,1,'startTime') preTime,over(partition by name order by orderdate) lead(orderdate,1,'endTime') nextTimefrom business ;开窗函数不能单独使用selectname ,orderdate,over(partition by name order by orderdate) preTime,over(partition by name order by orderdate) nextTimefrom business ;
需求六：查询前20%时间的订单信息 NTILE(n)：把有序窗口的行分发到指定数据的组中， NTILE(n) n为需要的组数各个组有默认有编号，编号从1开始，对于每一行，
NTILE函数返回此行所属的组的编号。注意：n必须为int类型。
selectt1.name,t1.orderdate,t1.costfrom(selectname,orderdate,cost,-- 分成5组按照日期排序 ntile(5) over(order by orderdate) as group_idfrom business ) t1 where t1.group_id = 1;t1.namet1.orderdatet1.costjack2017-01-0110tony2017-01-0215tony2017-01-0429
rank() over(partition by subject order by score desc) rankOrder,
dense_rank() over(partition by subject order by score desc) dense_rank_Order,
row_number() over(partition by subject order by score desc) row_number_orderNum 开窗函数辅助主函数 row_number()进行统计
selectname ,subject,score ,rank() over(partition by subject order by score desc) rankOrder,dense_rank() over(partition by subject order by score desc) dense_rank_Order,row_number() over(partition by subject order by score desc) row_number_orderNumfrom score ;namesubjectscorerankorderdense_rank_orderrow_number_ordernum孙悟空数学95111宋宋数学86222婷婷数学85333大海数学56444宋宋英语84111大海英语84112婷婷英语78323孙悟空英语68434大海语文94111孙悟空语文87222婷婷语文65333宋宋语文64444

十二：自定义函数(看文档)

1)、获取客户端连接对象
2)、调用API完成具体功能
3)、关闭资源

十三：Hive企业级调优(看文档) 十四：Hive ETL案例分析

ETL阶段：清洗字段数目

数据清洗（ETL）
-- 操作数据的注意点
1)、对字段的数量进行验证
2)、对category字段中的空格进行处理
3)、对关联视频字段进行处理 -- ETL的思路
1)、通过MR程序进行数据清洗（编码）
2)、打包，上传至Linux中
3)、执行数据清洗
-- 进行数据清洗
1)、把待处理的数据上传至HDFS 2)、执行MR清洗程序
hadoop jar /opt/module/hive-3.1.2/datas/GuliETL-1.0-SNAPSHOT.jar com.atguigu.mr.GuliEtlDriver /gulivideo/video /gulivideo/video/output

需求分析：需求一：统计视频(字段用videoId视频id) 观看数(views用字段观看次数) Top10
selectvideoId,viewsfrom gulivideo_orcorder by views desclimit 10;+--------------+-----------+| videoid | views |+--------------+-----------+| dMH0bHeiRNg | 42513417 || 0XxI-hvPRRA | 20282464 || 1dmVU08zVpA | 16087899 || RB-wUgnyGv0 | 15712924 || QjA5faZF1A8 | 15256922 || -_CSo1gOd48 | 13199833 || 49IDp76kjPw | 11970018 || tYnn51C3X_w | 11823701 || pv5zWaTEVkI | 11672017 || D2kJZOfq7zk | 11184051 |+--------------+-----------+
需求二：统计视频类别热度Top10(热度用视频的数量来体现)
统计视频类别热度Top10(热度用视频的数量来体现)-- 分析：1)、获取视频类别并炸开select videoId,category_namefrom gulivideo_orclateral view explode(category) gulivideo_orc_tmp as category_name; --> t1 2)、根据 t1 中的 category_name 进行分组求统计获取 hot值并根据hot值倒序取前十select category_name,count(t1.videoId) as hotfrom ( select videoId, category_name from gulivideo_orc lateral view explode(category) gulivideo_orc_tmp as category_name) t1 group by t1.category_nameorder by hot desc limit 10;+----------------+---------+| category_name | hot |+----------------+---------+| Music | 179049 || Entertainment | 127674 || Comedy | 87818 || Animation | 73293 || Film | 73293 || Sports | 67329 || Games | 59817 || Gadgets | 59817 || People | 48890 || Blogs | 48890 |+----------------+---------+
需求三：统计出视频观看数最高的20个视频的所属类别以及类别包含Top20视频的个数
视频(字段用videoId视频id) 观看数(views用字段观看次数) 最高的20个视频selectvideoId ,views ,category ,category_namefrom gulivideo_orcorder by views desclimit 20 ; -- t1-- top 20 视频的videoId 与类别selectt1.videoId videoId_t2,category_namefrom (selectvideoId ,views ,category from gulivideo_orcorder by views desclimit 20) t1LATERAL VIEW explode(t1.category) gulivideo_orc_temp as category_name ; -- t2以及类别包含Top20视频的个数 (按照类别进行分组统计视频的个数)selectt2.category_name,count(t2.videoId_t2) as numfrom (selectt1.videoId videoId_t2,category_namefrom (selectvideoId ,views ,category from gulivideo_orcorder by views desclimit 20) t1LATERAL VIEW explode(t1.category) gulivideo_orc_temp as category_name) t2group by t2.category_name ;t2.category_name_c1Blogs 2Comedy 6Entertainment 6Music 5People 2UNA 1
需求四：统计视频观看数Top50所关联视频的所属类别排名
统计视频，观看次数，和视频相关idselect videoId ,views ,relatedId from gulivideo_orcorder by views desclimit 50 ; -- t1将 t1 中的关联视频字段炸开selectt1.videoId ,videoId_Idfrom (select videoId ,views ,relatedId from gulivideo_orcorder by views desclimit 50) t1LATERAL VIEW explode(t1.relatedId) gulivideo_orc_temp as videoId_Id ; -- t2通过和原表Join获取关联视频对应的所属类别 select t2.videoId_Id videoId_t4,t3.category category_t4from(selectt1.videoId ,videoId_Idfrom (select videoId ,views ,relatedId from gulivideo_orcorder by views desclimit 50) t1LATERAL VIEW explode(t1.relatedId) gulivideo_orc_temp as videoId_Id) t2 join gulivideo_orc t3on t2.videoId_Id = t3.videoId ; -- t4将 t4 结果中的 category 炸开selectt4.videoId_t4 videoId_t5,category_name_t4 category_t5from(select t2.videoId_Id videoId_t4,t3.category category_t4from(selectt1.videoId ,videoId_Idfrom (select videoId ,views ,relatedId from gulivideo_orcorder by views desclimit 50) t1LATERAL VIEW explode(t1.relatedId) gulivideo_orc_temp as videoId_Id) t2 join gulivideo_orc t3on t2.videoId_Id = t3.videoId ) t4LATERAL VIEW explode(t4.category_t4) t4_temp as category_name_t4 ; -- t5根据 t5 的 category_name 进行分组然后求统计selectt5.category_t5 category_t6,count(t5.videoId_t5) numfrom(selectt4.videoId_t4 videoId_t5,category_name_t4 category_t5from(select t2.videoId_Id videoId_t4,t3.category category_t4from(selectt1.videoId ,videoId_Idfrom (select videoId ,views ,relatedId from gulivideo_orcorder by views desclimit 50) t1LATERAL VIEW explode(t1.relatedId) gulivideo_orc_temp as videoId_Id) t2 join gulivideo_orc t3on t2.videoId_Id = t3.videoId ) t4LATERAL VIEW explode(t4.category_t4) t4_temp as category_name_t4) t5group by t5.category_t5 ; -- t6根据 t6 中num 进行排名selectt6.category_t6,t6.num ,rank() over(order by t6.num desc) as r_kfrom(selectt5.category_t5 category_t6,count(t5.videoId_t5) numfrom(selectt4.videoId_t4 videoId_t5,category_name_t4 category_t5from(select t2.videoId_Id videoId_t4,t3.category category_t4from(selectt1.videoId ,videoId_Idfrom (select videoId ,views ,relatedId from gulivideo_orcorder by views desclimit 50) t1LATERAL VIEW explode(t1.relatedId) gulivideo_orc_temp as videoId_Id) t2 join gulivideo_orc t3on t2.videoId_Id = t3.videoId ) t4LATERAL VIEW explode(t4.category_t4) t4_temp as category_name_t4) t5group by t5.category_t5) t6 ;
需求五：统计每个类别中的视频热度Top10，以Music为例 (热度以观看数衡量) 方式一：
获取视频类别的并炸开selectvideoId,views,category_namefrom gulivideo_orclateral view explode(category) gulivideo_orc_tmp as category_name; --> t12)、根据 t1 中的 category_name 分组然后求统计select t1.category_name,t1.viewsfrom ( selectvideoId,views,category_namefrom gulivideo_orclateral view explode(category) gulivideo_orc_tmp as category_name) t1 where t1.category_name = 'Music'order by t1.views desc limit 10;+-------------------+-----------+| t1.category_name | t1.views |+-------------------+-----------+| Music | 15256922 || Music | 11823701 || Music | 11672017 || Music | 9579911 || Music | 7533070 || Music | 6946033 || Music | 6935578 || Music | 6193057 || Music | 5581171 || Music | 5142238 |+-------------------+-----------+
方式二：
统计类别为Music 的videoIdselectvideoId,views,category_namefrom gulivideo_orcLATERAL VIEW explode(category) gulivideo_orc_temp as category_name where category_name = "Music" order by views desclimit 10;
需求六：统计每个类别中的视频热度Top10
1)、获取视频类别的并炸开selectvideoId,views,category_namefrom gulivideo_orcLATERAL VIEW explode(category) gulivideo_orc_temp as category_name ; -- t12)、根据观看数对每个类别进行排序selectt1.videoId videoId_t2,t1.views views_t2,t1.category_name category_t2,rank() over(partition by t1.category_name order by views desc) r_kfrom (selectvideoId,views,category_namefrom gulivideo_orcLATERAL VIEW explode(category) gulivideo_orc_temp as category_name) t1 ; -- t2获取前10 select t2.videoId_t2 ,t2.views_t2 ,t2.category_t2,t2.r_kfrom (selectt1.videoId videoId_t2,t1.views views_t2,t1.category_name category_t2,rank() over(partition by t1.category_name order by views desc) r_kfrom (selectvideoId,views,category_namefrom gulivideo_orcLATERAL VIEW explode(category) gulivideo_orc_temp as category_name) t1 ) t2where t2.r_k <=10 ;每个类别都是前十名 aRNzWyD7C9o8825788UNA1jtExxsiLgPM5320895UNA2PxNNR4symuE4033376UNA38cjTSvvoddc3486368UNA4LIhbap3FlGc2849832UNA5lCSTULqmmYE2179562UNA6UyTxWvp8upM2106933UNA7y6oXEWowirI1666084UNA8_x2-AmY8FI81403113UNA9ICoDFooBXpU1376215UNA10RjrEQaG5jPM2803140Vehicles1cv157ZIInUk2773979Vehicles2Gyg9U1YaVk81832224Vehicles36GNB7xT3rNE1412497Vehicles4tth9krDtxII1347317Vehicles546LQd9dXFRU1262173Vehicles6pdiuDXwgrjQ1013697Vehicles7kY_cDpENQLE956665Vehicles8YtxfbxGz1u4942604Vehicles9aCamHfJwSGU847442Vehicles10

十五：SQL练习

需求一：

方式一：一：求出每个学生所有科目成绩的最小值selectuid,min(score) over(partition by uid) min_scorefrom score ; -- t1二：求出学科的平均成绩的最大值selectsubject_id,avg(score) avg_scorefrom score group by subject_id order by avg_score desclimit 1 ; -- t2三：求出每个学生所有科目成绩的最小值大于各个学科的平均成绩的最大值的uidselectuidfrom(selectuid,min(score) over(partition by uid) min_scorefrom score) t1 , (selectsubject_id,avg(score) avg_scorefrom score group by subject_id order by avg_score desclimit 1 ) t2where t1.min_score > t2.avg_score ;uid100110011001 方式二 (标准方式)求出每个学科平均成绩selectuid,score,avg(score) over(partition by subject_id) avg_scorefrom score ; -- t1求出每个学科平均成绩10037081.6666666666666710028581.6666666666666710019081.6666666666666710037081.6666666666666710028581.6666666666666710019081.6666666666666710038581.6666666666666710027081.6666666666666710019081.66666666666667根据是否大于平均成绩记录flag，大于则记为0否则记为1selectt1.uid,if(t1.score>t1.avg_score,0,1) flagfrom(selectuid,score,avg(score) over(partition by subject_id) avg_scorefrom score) t1 ; -- t2100311002010010100311002010010100301002110010根据学生id进行分组统计flag的和，和为0则是所有学科都大于平均成绩selectuidfrom (selectt1.uid,if(t1.score>t1.avg_score,0,1) flagfrom(selectuid,score,avg(score) over(partition by subject_id) avg_scorefrom score) t1) t2 group by uidhaving sum(flag)=0;uid1001
需求二：

使用SQL统计出每个用户的累积访问次数修改数据格式select userId, date_format(regexp_replace(visitDate,'/','-'),'yyyy-MM') mn, visitCountfrom action ; -- t1u012017-015u022017-016u032017-018u042017-013u012017-016u012017-028U022017-016U012017-024计算每人单月访问量selectt1.userId userId_t2,t1.mn mn_t2 ,sum(t1.visitCount) mn_count_t2from (select userId, date_format(regexp_replace(visitDate,'/','-'),'yyyy-MM') mn, visitCountfrom action) t1group by t1.userId , t1.mn ; -- t2userid_t2mn_t2mn_count_t2U012017-024U022017-016u012017-0111u012017-028u022017-016u032017-018u042017-013按月累计访问量selectt2.userid_t2,t2.mn_t2,t2.mn_count_t2,sum(t2.mn_count_t2) over(partition by t2.userid_t2 order by t2.mn_t2)from (selectt1.userId userId_t2,t1.mn mn_t2 ,sum(t1.visitCount) mn_count_t2from (select userId, date_format(regexp_replace(visitDate,'/','-'),'yyyy-MM') mn, visitCountfrom action) t1group by t1.userId , t1.mn) t2 ;u012017-011111u012017-021223u022017-011212u032017-0188u042017-0133
需求三：
有50W个京东店铺，每个顾客访客访问任何一个店铺的任何一个商品时都会产生一条访问日志，访问日志存储的表名为Visit，访客的用户id为user_id，被访问的店铺名称为shop。

方式一： (不对，每家店铺应该去除相同顾客，即相同顾客算一次)selectshop,count(user_id) numfrom visitgroup by shop ;a9b6c4方式二：一家店去除了相同的顾客selectshop,count(distinct user_id) from visit group by shop ; a4b4c3

每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数（1）查询每个店铺被每个用户访问次数selectshop ,user_id ,count(*) ctfrom visit group by shop , user_id ; -- t1shopuser_idctau13bu12au22bu21cu22（2）计算每个店铺被用户访问次数排名selectt1.shop shop_t2,t1.user_id user_id_t2,t1.ct ct_t2,rank() over(partition by shop order by t1.ct) r_kfrom(selectshop ,user_id ,count(*) ctfrom visit group by shop , user_id) t1 ; -- t2（3）取每个店铺排名前3的selectt2.shop_t2,t2.user_id_t2,t2.ct_t2from (selectt1.shop shop_t2,t1.user_id user_id_t2,t1.ct ct_t2,rank() over(partition by shop order by t1.ct ) r_kfrom(selectshop ,user_id ,count(*) ctfrom visit group by shop , user_id) t1) t2 where t2.r_k <= 3 ;au31au22au13au53bu21bu51bu12bu42cu31cu61cu22
需求四:

数据样例:2017-01-01,10029028,1000003251,33.57。1）给出 2017年每个月的订单数、用户数、总成交金额。selectdate_format(dt,'yyyy-MM'),count(order_id),count(distinct user_id),sum(amount)from order_tabwhere date_format(dt,'yyyy') = '2017'group by date_format(dt,'yyyy-MM') ;2）给出2017年11月的新客数(指在11月才有第一笔订单)selectcount(user_id)from order_tabgroup by user_idhaving date_format(min(dt),'yyyy-MM')='2017-11' ;
需求五：

select sum(user_total_count), sum(user_total_avg_age), sum(twice_count), sum(twice_count_avg_age)from (select 0 user_total_count, 0 user_total_avg_age, count(*) twice_count, cast(sum(age)/count(*) as decimal(10,2)) twice_count_avg_agefrom ( select user_id, min(age) agefrom (select user_id, min(age) agefrom ( select user_id, age, date_sub(dt,rk) flagfrom ( select dt, user_id, min(age) age, rank() over(partition by user_id order by dt) rk from user_age group by dt,user_id )t1 )t2group by user_id,flaghaving count(*)>=2)t3group by user_id )t4union allselect count(*) user_total_count, cast((sum(age)/count(*)) as decimal(10,1)), 0 twice_count, 0 twice_count_avg_agefrom ( select user_id, min(age) age from user_age group by user_id )t5)t6;
需求六：
select t1.userid, t1.paymenttime, od.moneyfrom (select userid, min(paymenttime) paymenttimefrom ordertablewhere date_format(paymenttime,'yyyy-MM')='2017-10'group by userid)t1join ordertable odon t1.userid=od.userid and t1.paymenttime=od.paymenttime;
需求七：

select ip, interface, count(*) ctfrom ipwhere date_format(time,'yyyy-MM-dd HH')>='2016-11-09 14' and date_format(time,'yyyy-MM-dd HH')<='2016-11-09 15' and interface='/api/user/login'group byip,interfaceorder by ct desclimit 2;t1

上一篇：mysql部署hive（一）

下一篇：20220225每日一题黄金矿工刷后感

HIVE总结

Hive 在查询数据的时候，由于没有索引，需要扫描整个表，因此延迟较高。

set hive.exec.mode.local.auto=true; //开启本地mr 注意：当切换成本地模式，可能回报一个 内存溢出 的错误，因为本地模式默认只给我们分配 256M 内存。可以通过修改 hive-env.sh.template

set hive.exec.mode.local.auto=true; //开启本地mr
注意：当切换成本地模式，可能回报一个内存溢出的错误，因为本地模式默认只给我们分配 256M 内存。可以通过修改 hive-env.sh.template