大数据技术——MapReduce词频统计

时间：2023-06-14

注：参考林子雨老师教程，具体请见

MapReduce编程实践(Hadoop3.1.3)_厦大数据库实验室博客

一.实验目的

1.理解Hadoop中MapReduce模块的处理逻辑｡

2.熟悉MapReduce编程｡

二.实验内容

1.新建文件夹input,并在其中创建三个指定文件名的文本文件,并将特定内容存入三个文本｡

2.启动Hadoop伪分布/全分布模式式,将input文件夹上传到HDFS上｡

3.编写MapReduce程序,实现单词出现次数统计｡统计结果保存到hdfs的output文件夹｡获取统计结果｡

三.实验环境

1.操作系统: Ubuntu 20.04(64 位)

2.虚拟机软件:VMware Workstation 15.5

3.jdk: jdk-8u202-linux-x64.tar.gz

4.Hadoop 版本 3.3.1

四.实验步骤

1.Eclipse中的项目创建及编写

import java.io.IOException;import java.util.Iterator;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;public class WordCount { public WordCount() { } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = (new GenericOptionsParser(conf, args)).getRemainingArgs(); if(otherArgs.length < 2) { System.err.println("Usage: wordcount [...] "); System.exit(2); } Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(WordCount.TokenizerMapper.class); job.setCombinerClass(WordCount.IntSumReducer.class); job.setReducerClass(WordCount.IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); for(int i = 0; i < otherArgs.length - 1; ++i) { FileInputFormat.addInputPath(job, new Path(otherArgs[i])); } FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1])); System.exit(job.waitForCompletion(true)?0:1); } public static class TokenizerMapper extends Mapper { private static final IntWritable one = new IntWritable(1); private Text word = new Text(); public TokenizerMapper() { } public void map(Object key, Text value, Mapper.Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while(itr.hasMoreTokens()) { this.word.set(itr.nextToken()); context.write(this.word, one); } } }public static class IntSumReducer extends Reducer { private IntWritable result = new IntWritable(); public IntSumReducer() { } public void reduce(Text key, Iterable values, Reducer.Context context) throws IOException, InterruptedException { int sum = 0; IntWritable val; for(Iterator i$ = values.iterator(); i$.hasNext(); sum += val.get()) { val = (IntWritable)i$.next(); } this.result.set(sum); context.write(key, this.result); } }}

2.编译

3.程序运行

4.结果

五.完成情况与问题讨论

1.实际完成情况

对上述介绍的所有操作均进行了实验｡

2.问题与讨论

运行过程速度太慢,无法得出结果｡后检查发现原因在于本机硬件配置时资源分配不足,将分配资源增大后,运行成功｡

六.实践收获

1、在使用 Eclipse 运行 MapReduce 程序时,会读取 Hadoop-Eclipse-Plugin 的 Advanced parameters 作为 Hadoop 运行参数,如果我们未进行修改,则默认的参数其实就是单机(非分布式)参数,因此程序运行时是读取本地目录而不是 HDFS 目录,就会提示 Input 路径不存在｡所以我们需要将配置文件复制到项目中的 src 目录,来覆盖这些参数｡让程序能够正确运行｡

2、如果要再次运行WordCount.jar,需要首先删除HDFS中的output目录,否则会报错｡

上一篇：java里的生产者消费者模型

下一篇：自学Java运算符