Hive常用的调优方法

时间：2023-06-16

前言：Hive调优是日常工作中比较常见的问题，本文简要总结一下常用的调优方法。

一、代码层面的优化

1.去重两种方式：distinct和group by。推荐使用group by来对数据去重。distinct会将相同的key shuffle到一个reduce task中处理，group by 会启用多个reduce task来对数据进行分组去重。

2、在做具体业务时，尽量明确具体的字段内容，避免使用select * 。

3.在使用join时，可以先过滤表中的数据之后再进行join。

二、小文件过多

1.使用 hive 自带的 concatenate 命令，自动合并小文件

2.使用hadoop的archive将小文件归档

3.在map执行前合并小文件，减少map数：使用CombineHiveInputFormat对小文件进行合并

set hive.input.format= org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

三、并行执行

通过设置参数hive.exec.parallel值为true，就可以开启并发执行。

set hive.exec.parallel=true; //打开任务并行执行

set hive.exec.parallel.thread.number=16; //同一个sql允许最大并行度，默认为8。

四、JVM重用

开启JVM重用可以有效应对task和小文件过多的情况，JVM重用可以使得JVM实例在同一个job中重新使用N次。在mapred-site.xml配置文件里配置。

mapreduce.job.jvm.numtasks

How many tasks to run per jvm、If set to -1, there is

no limit、

五、推测执行

对于严重拖慢作业完成的某个task，Hadoop采用推测执行的机制，启动备份任务，最终选择最先成功完成任务的计算结果最为最终结果。

mapreduce.map.speculative

true

If true, then multiple instances of some map tasks

may be executed in parallel.

mapreduce.reduce.speculative

true

If true, then multiple instances of some reduce tasks

may be executed in parallel.