欢迎您访问365答案网,请分享给你的朋友!
生活常识 学习资料

hivegroupby和distinct性能完全一致

时间:2023-04-30

先说结论,两者没有区别,先看执行计划

1、group by

explainselect prov_idfrom dim.dim_citygroup by prov_id;STAGE DEPENDENCIES:Stage-1 is a root stageStage-0 depends on stages: Stage-1STAGE PLANS:Stage: Stage-1Map ReduceMap Operator Tree:TableScanalias: dim_cityStatistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONESelect Operatorexpressions: prov_id (type: int)outputColumnNames: prov_idStatistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONEGroup By Operatorkeys: prov_id (type: int)mode: hashoutputColumnNames: _col0Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONEReduce Output Operatorkey expressions: _col0 (type: int)sort order: +Map-reduce partition columns: _col0 (type: int)Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONEReduce Operator Tree:Group By Operatorkeys: KEY._col0 (type: int)mode: mergepartialoutputColumnNames: _col0Statistics: Num rows: 1887 Data size: 261026 Basic stats: COMPLETE Column stats: NONEFile Output Operatorcompressed: falseStatistics: Num rows: 1887 Data size: 261026 Basic stats: COMPLETE Column stats: NONEtable:input format: org.apache.hadoop.mapred.SequenceFileInputFormatoutput format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormatserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDeStage: Stage-0Fetch Operatorlimit: -1Processor Tree:ListSink

2、distinct

explainselect distinct prov_idfrom dim.dim_city;STAGE DEPENDENCIES:Stage-1 is a root stageStage-0 depends on stages: Stage-1STAGE PLANS:Stage: Stage-1Map ReduceMap Operator Tree:TableScanalias: dim_cityStatistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONESelect Operatorexpressions: prov_id (type: int)outputColumnNames: prov_idStatistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONEGroup By Operatorkeys: prov_id (type: int)mode: hashoutputColumnNames: _col0Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONEReduce Output Operatorkey expressions: _col0 (type: int)sort order: +Map-reduce partition columns: _col0 (type: int)Statistics: Num rows: 3775 Data size: 522191 Basic stats: COMPLETE Column stats: NONEReduce Operator Tree:Group By Operatorkeys: KEY._col0 (type: int)mode: mergepartialoutputColumnNames: _col0Statistics: Num rows: 1887 Data size: 261026 Basic stats: COMPLETE Column stats: NONEFile Output Operatorcompressed: falseStatistics: Num rows: 1887 Data size: 261026 Basic stats: COMPLETE Column stats: NONEtable:input format: org.apache.hadoop.mapred.SequenceFileInputFormatoutput format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormatserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDeStage: Stage-0Fetch Operatorlimit: -1Processor Tree:ListSink

执行过程完全一致,distinct在map端同样会先做group by聚合,而不是都在reduce端做这个操作,老版本的hive没有这个优化,都在reduce端执行的话会有很大的性能差异

Copyright © 2016-2020 www.365daan.com All Rights Reserved. 365答案网 版权所有 备案号:

部分内容来自互联网,版权归原作者所有,如有冒犯请联系我们,我们将在三个工作时内妥善处理。