Hive支持处理lzo压缩格式的数据统计查询之前需要编译hadoop-3.2.2源码并支持lzo压缩,参考之前博客:
编译hadoop-3.2.2源码并支持lzo压缩
以上是前提。
这里准备了一份379M的数据makedatatest.txt,用lzo压缩命令把文件压缩成makedatatest.txt.lzo,大小为158M,hdfs块大小为128M。
[ruoze@hadoop001 data]$ lzop -v makedatatest.txt compressing makedatatest.txt into makedatatest.txt.lzo[ruoze@hadoop001 data]$ ll -h |grep makedatatest-rw-r--r--、1 ruoze ruoze 379M Jan 27 22:17 makedatatest.txt-rw-r--r--、1 ruoze ruoze 158M Jan 27 22:17 makedatatest.txt.lzo[ruoze@hadoop001 data]$ [ruoze@hadoop001 data]$ [ruoze@hadoop001 data]$ tail makedatatest.txt9999991,Role76,85,Nokia,11,2016-9-169999992,Role30,85,Apple,13,2016-6-159999993,Role70,85,Oppo,2,2016-11-279999994,Role16,29,Meizu,9,2016-8-129999995,Role50,46,Samsung,6,2016-5-159999996,Role17,85,Huawei,10,2016-11-309999997,Role19,55,Samsung,7,2016-7-249999998,Role47,85,Oppo,15,2016-7-29999999,Role9,54,Meizu,17,2016-10-110000000,Role12,67,Apple,0,2016-2-24[ruoze@hadoop001 data]$
测试验证启动hive,并创建hive表,把以上数据导入到表里面,然后进行测试。
set hive.cli.print.current.db=true;set hive.cli.print.header=true#创建表makedatatest_lzo,指定输入输出格式create table makedatatest_lzo(id string,name string,dataage string,phonePlus string,clicks string,logintime string)row format delimited fields terminated by ','STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";#加载数据load data local inpath '/home/ruoze/data/makedatatest.txt.lzo' overwrite into table makedatatest_lzo ;hive (study_hive)> > select * from makedatatest_lzo limit 5;OK1Role529MI42016-7-222Role3537Oppo62016-1-23Role4340Huawei72016-4-74Role1646Huawei142016-1-285Role6533MI12016-2-26Time taken: 0.108 seconds, Fetched: 5 row(s)hive (study_hive)>
执行统计查询语句:
hive (study_hive)> > select phoneplus,count(1) from makedatatest_lzo group by phoneplus;Query ID = ruoze_20220211104155_a2e3fd88-dc35-4cb2-b544-792e3197dac0Total jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified、Estimated from input data size: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=
makedatatest.txt.lzo,大小为158M,hdfs块大小为128M。
以上可以看到,number of mappers: 1; number of reducers: 1,分片数为1。
然后执行以下命令,对hive表的数据文件makedatatest.txt.lzo 建立lzo索引:
hadoop jar ~/app/hadoop/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar com.hadoop.compression.lzo.LzoIndexer /user/hive/warehouse/study_hive.db/makedatatest_lzo/makedatatest.txt.lzo
创建完成后,会在同目录出现一个makedatatest.txt.lzo.index索引文件。
然后需要设置参数:hive.input.format,再执行统计查询语句,
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
hive (study_hive)> > set mapreduce.output.fileoutputformat.compress.codec;mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.BZip2Codechive (study_hive)> > set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;hive (study_hive)> > select phoneplus,count(1) from makedatatest_lzo group by phoneplus;Query ID = ruoze_20220211105120_93f2811d-09ba-4999-a949-1eded05b0000Total jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified、Estimated from input data size: 1In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=
number of mappers: 2; number of reducers: 1,分片数变成2了。
因为,对hive表的数据文件makedatatest.txt.lzo 建立lzo索引,Hadoop 的native库不支持lzo压缩文件,查看lzo文件也是乱码,需要编译hadoop-lzo,然后把jar包放到hadoop中,并且修改core-site.xml文件,添加lzo相关配置,才能支持lzo文件。本身不支持对lzo文件的分片,需要对lzo文件创建了索引之后才能支持分片。