MapReduce设计模式有哪些

发布时间：2022-01-04 10:59:32 作者：iii
来源：亿速云阅读：185

本篇内容主要讲解“MapReduce设计模式有哪些”，感兴趣的朋友不妨来看看。本文介绍的方法操作简单快捷，实用性强。下面就让小编来带大家学习“MapReduce设计模式有哪些”吧!

1 (总计)Summarization Patterns

1.1（数字统计）Numerical Summarizations

这个算是Built-in的,因为这就是MapReduce的模式. 相当于SQL语句里边Count/Max,WordCount也是这个的实现。

1.2（反向索引）Inverted Index Summarizations

这个看着名字很玄，其实感觉算不上模式，只能算是一种应用，并没有涉及到MapReduce的设计。其核心实质是对listof(V3)的索引处理，这是V3是一个引用Id。这个模式期望的结果是：
url-〉list of id

1.3（计数器统计）Counting with Counters

计数器很好很快，简单易用。不过代价是占用tasktracker，最重要使jobtracker的内存。所以在1.0时代建议tens，至少<100个。不过2.0时代，jobtracker变得per job，我看应该可以多用，不过它比较适合Counting这种算总数的算法。
context.getCounter(STATE_COUNTER_GROUP, UNKNOWN_COUNTER).increment(1);

2 (过滤)Filtering Patterns

2.1（简单过滤）Filtering

这个也算是Built-in的,因为这就是MapReduce中Mapper如果没有Write，那么就算过滤掉

了. 相当于SQL语句里边Where。

map(key, record):
    if we want to keep record then
    emit key,value

2.2（Bloom过滤）Bloom Filtering

以前我一直不知道为什么叫BloomFilter，看了wiki后，才知道，贴过来大家瞧瞧：
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not, thus a Bloom filter has a 100% recall rate.
其原理可以参见这篇文章：

http://blog.csdn.net/jiaomeng/article/details/1495500
要是让我一句话说，就是根据集合内容，选取多种Hash做一个bitmap，那么如果一个词的 hash落在map中，那么它有可能是，也有可能不是。但是如果它的hash不在，则它一定没有落在里边。此过滤有点意思，在HBase中得到广泛应用。接下来得实际试验一下。

Note: 需要弄程序玩玩

2.3（Top N）Top Ten

这是一个典型的计算Top的操作，类似SQL里边的top或limit，一般都是带有某条件的top

操作。
算法实现：我喜欢伪代码，一目了然：

class mapper:
    setup():
        initialize top ten sorted list
     
    map(key, record):
        insert record into top ten sorted list
        if length of array is greater-than 10 then
        truncate list to a length of 10

    cleanup():
        for record in top sorted ten list:
        emit null,record

class reducer:
    setup():
        initialize top ten sorted list

    reduce(key, records):
        sort records
        truncate records to top 10
        for record in records:
            emit record

2.4（排重）Distinct

这个模式也简单，就是利用MapReduce的Reduce阶段，看struct，一目了然：

map(key, record):
    emit record,null

reduce(key, records):
    emit key

3 (数据组织)Data Organization Patterns

3.1（结构化到层级化）Structured to Hierarchical

这个在算法上是join操作,在应用层面可以起到Denormalization的效果.其程序的关键之处是用到了MultipleInputs,可以引入多个Mapper,这样便于把多种Structured的或者任何格式的内容,聚合在reducer端,以前进行聚合为Hierarchical的格式.
MultipleInputs.addInputPath(job, new Path(args[0]),
TextInputFormat.class, PostMapper.class);
MultipleInputs.addInputPath(job, new Path(args[1]),
TextInputFormat.class, CommentMapper.class);
在Map输出的时候,这里有一个小技巧,就是把输出内容按照分类,添加了前缀prefix,这样在Reduce阶段,就可以知道数据来源,以更好的进行笛卡尔乘积或者甄别操作. 从技术上讲这样节省了自己写Writable的必要,理论上,可以定义格式,来携带更多信息. 当然了,如果有特殊排序和组合需求,还是要写特殊的Writable了.
outkey.set(post.getAttribute("ParentId"));
outvalue.set("A" + value.toString());

3.2（分区法）Partitioning

这个又来了,这个是built-in,写自己的partitioner,进行定向Reducer.

3.3（装箱法）Binning

这个有点意思,类似于分区法,不过它是MapSide Only的,效率较高,不过产生的结果可能需

要进一步merge.
The SPLIT operation in Pig implements this pattern.
具体实现上还是使用了MultipleOutputs.addNamedOutput().

// Configure the MultipleOutputs by adding an output called "bins"
// With the proper output format and mapper key/value pairs

MultipleOutputs.addNamedOutput(job, "bins", TextOutputFormat.class,Text.class, NullWritable.class);

// Enable the counters for the job
// If there are a significant number of different named outputs, this
// should be disabled

MultipleOutputs.setCountersEnabled(job, true);

// Map-only job
job.setNumReduceTasks(0);

3.4（全排序）Total Order Sorting

这个在Hadoop部分已经详细描述过了，略。

3.5（洗牌）Shuffling

这个的精髓在于随机key的创建。
outkey.set(rndm.nextInt());
context.write(outkey, outvalue);

4 (连接)Join Patterns

4.1（Reduce连接）Reduce Side Join

这个比较简单，Structured to Hierarchical中已经讲过了。

4.2（Mapside连接）Replicated Join

Mapside连接效率较高，但是需要把较小的数据集进行设置到distributeCache，然后把

另一份数据进入map，在map中完成连接。

4.3（组合连接）Composite Join

这种模式也是MapSide的join，而且可以进行两个大数据集的join，然而，它有一个限制就是两个数据集必须是相同组织形式的，那么何谓相同组织形式呢？
• An inner or full outer join is desired.
• All the data sets are sufficiently large.
• All data sets can be read with the foreign key as the input key to the mapper.
• All data sets have the same number of partitions.
• Each partition is sorted by foreign key, and all the foreign keys reside in the associated partition of each data set. That is, partition X of data sets A and B contain
the same foreign keys and these foreign keys are present only in partition X. For a visualization of this partitioning and sorting key, refer to Figure 5-3.
• The data sets do not change often (if they have to be prepared).

// The composite input format join expression will set how the records
// are going to be read in, and in what input format.
conf.set("mapred.join.expr", CompositeInputFormat.compose(joinType,
KeyValueTextInputFormat.class, userPath, commentPath));

4.4（笛卡尔）Cartesian Product

这个需要重写InputFormat，以便两部分数据可以在record级别联合起来。sample略。

5 (元模式)MetaPatterns

5.1（链式Job）Job Chaining

多种方式，可以写在driver里边，也可以用bash脚本调用。hadoop也提供了JobControl

可以跟踪失败的job等好的功能。

5.2（折叠Job）Chain Folding

ChainMapper and ChainReducer Approach，M+R*M

5.3（合并Job）Job Merging

合并job，就是把同数据的两个job的mapper和reducer代码级别的合并，这样可以省去

I/O和解析的时间。

6 (输入输出)Input and Output Patterns

6.1 Customizing Input and Output in Hadoop

InputFormat
getSplits
createRecordReader
InputSplit
getLength()
getLocations()
RecordReader
  initialize
  getCurrentKey and getCurrentValue
  nextKeyValue
  getProgress
  close
OutputFormat
  checkOutputSpecs
  getRecordWriter
  getOutputCommiter
RecordWriter
write
close

6.2 (产生Random数据)Generating Data

关键点：构建虚假的InputSplit，这个不像FileInputSplit基于block，只能去骗hadoop了。

到此，相信大家对“MapReduce设计模式有哪些”有了更深的了解，不妨来实际操作一番吧！这里是亿速云网站，更多相关内容可以进入相关频道进行查询，关注我们，继续学习！