Hive支持处理lzo压缩格式的数据统计查询

时间：2023-06-21

Hive支持处理lzo压缩格式的数据统计查询之前需要编译hadoop-3.2.2源码并支持lzo压缩，参考之前博客：
编译hadoop-3.2.2源码并支持lzo压缩
以上是前提。

数据准备

这里准备了一份379M的数据makedatatest.txt，用lzo压缩命令把文件压缩成makedatatest.txt.lzo，大小为158M，hdfs块大小为128M。

[ruoze@hadoop001 data]$ lzop -v makedatatest.txt compressing makedatatest.txt into makedatatest.txt.lzo[ruoze@hadoop001 data]$ ll -h |grep makedatatest-rw-r--r--、1 ruoze ruoze 379M Jan 27 22:17 makedatatest.txt-rw-r--r--、1 ruoze ruoze 158M Jan 27 22:17 makedatatest.txt.lzo[ruoze@hadoop001 data]$ [ruoze@hadoop001 data]$ [ruoze@hadoop001 data]$ tail makedatatest.txt9999991,Role76,85,Nokia,11,2016-9-169999992,Role30,85,Apple,13,2016-6-159999993,Role70,85,Oppo,2,2016-11-279999994,Role16,29,Meizu,9,2016-8-129999995,Role50,46,Samsung,6,2016-5-159999996,Role17,85,Huawei,10,2016-11-309999997,Role19,55,Samsung,7,2016-7-249999998,Role47,85,Oppo,15,2016-7-29999999,Role9,54,Meizu,17,2016-10-110000000,Role12,67,Apple,0,2016-2-24[ruoze@hadoop001 data]$

测试验证

启动hive，并创建hive表，把以上数据导入到表里面，然后进行测试。

set hive.cli.print.current.db=true;set hive.cli.print.header=true#创建表makedatatest_lzo，指定输入输出格式create table makedatatest_lzo(id string,name string,dataage string,phonePlus string,clicks string,logintime string)row format delimited fields terminated by ','STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";#加载数据load data local inpath '/home/ruoze/data/makedatatest.txt.lzo' overwrite into table makedatatest_lzo ;hive (study_hive)> > select * from makedatatest_lzo limit 5;OK1Role529MI42016-7-222Role3537Oppo62016-1-23Role4340Huawei72016-4-74Role1646Huawei142016-1-285Role6533MI12016-2-26Time taken: 0.108 seconds, Fetched: 5 row(s)hive (study_hive)>

执行统计查询语句：

hive (study_hive)> > select phoneplus,count(1) from makedatatest_lzo group by phoneplus;Query ID = ruoze_20220211104155_a2e3fd88-dc35-4cb2-b544-792e3197dac0Total jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified、Estimated from input data size: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max=In order to set a constant number of reducers: set mapreduce.job.reduces=Starting Job = job_1644546089342_0001, Tracking URL = http://hadoop001:8123/proxy/application_1644546089342_0001/Kill Command = /home/ruoze/app/hadoop/bin/mapred job -kill job_1644546089342_0001Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 12022-02-11 10:43:05,200 Stage-1 map = 0%, reduce = 0%2022-02-11 10:43:19,599 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 10.54 sec2022-02-11 10:43:23,682 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 12.03 secMapReduce Total cumulative CPU time: 12 seconds 30 msecEnded Job = job_1644546089342_0001MapReduce Jobs Launched: Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 12.03 sec HDFS Read: 165057516 HDFS Write: 295 SUCCESSTotal MapReduce CPU Time Spent: 12 seconds 30 msecOKApple1375637Huawei1374779Lenovo1375233MI1375460Meizu1373821Nokia1373732Oppo1375440Samsung1375898Time taken: 90.11 seconds, Fetched: 8 row(s)hive (study_hive)>

makedatatest.txt.lzo，大小为158M，hdfs块大小为128M。
以上可以看到，number of mappers: 1; number of reducers: 1，分片数为1。
然后执行以下命令，对hive表的数据文件makedatatest.txt.lzo 建立lzo索引：

hadoop jar ~/app/hadoop/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer /user/hive/warehouse/study_hive.db/makedatatest_lzo/makedatatest.txt.lzo

创建完成后，会在同目录出现一个makedatatest.txt.lzo.index索引文件。

然后需要设置参数：hive.input.format，再执行统计查询语句，
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;

hive (study_hive)> > set mapreduce.output.fileoutputformat.compress.codec;mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codechive (study_hive)> > set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;hive (study_hive)> > select phoneplus,count(1) from makedatatest_lzo group by phoneplus;Query ID = ruoze_20220211105120_93f2811d-09ba-4999-a949-1eded05b0000Total jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified、Estimated from input data size: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=In order to limit the maximum number of reducers: set hive.exec.reducers.max=In order to set a constant number of reducers: set mapreduce.job.reduces=Starting Job = job_1644546089342_0003, Tracking URL = http://hadoop001:8123/proxy/application_1644546089342_0003/Kill Command = /home/ruoze/app/hadoop/bin/mapred job -kill job_1644546089342_0003Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 12022-02-11 10:52:13,070 Stage-1 map = 0%, reduce = 0%2022-02-11 10:52:23,255 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 7.93 sec2022-02-11 10:52:24,273 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 15.78 sec2022-02-11 10:52:28,352 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 17.41 secMapReduce Total cumulative CPU time: 17 seconds 410 msecEnded Job = job_1644546089342_0003MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 17.41 sec HDFS Read: 165172760 HDFS Write: 295 SUCCESSTotal MapReduce CPU Time Spent: 17 seconds 410 msecOKApple1375637Huawei1374779Lenovo1375233MI1375460Meizu1373821Nokia1373732Oppo1375440Samsung1375898Time taken: 69.65 seconds, Fetched: 8 row(s)hive (study_hive)> > > set mapreduce.output.fileoutputformat.compress.codec;mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codechive (study_hive)>

number of mappers: 2; number of reducers: 1，分片数变成2了。
因为，对hive表的数据文件makedatatest.txt.lzo 建立lzo索引，Hadoop 的native库不支持lzo压缩文件，查看lzo文件也是乱码，需要编译hadoop-lzo，然后把jar包放到hadoop中，并且修改core-site.xml文件，添加lzo相关配置，才能支持lzo文件。本身不支持对lzo文件的分片，需要对lzo文件创建了索引之后才能支持分片。

上一篇：【全】【ES集群安装+配置教程】装ElasticSearch到CentOS8中liunx

下一篇：中国工业软件公司排名-做工业软件的公司有哪些呢