MapReduce高级编程之自定义InputFormat

文章由LinuxBoy分享于2019-03-31 12:03:09热评（562）

MapReduce高级编程之自定义InputFormat

InputFormat是MapReduce中一个很常用的概念，它在程序的运行中到底起到了什么作用呢？ InputFormat其实是一个接口，包含了两个方法： public interface InputFormat<K, V> {
InputSplit[]getSplits(JobConf job, int numSplits) throws IOException;
RecordReader<K, V>getRecordReader(InputSplit split,
JobConf job,
Reporter reporter) throws IOException;
} 这两个方法有分别完成着以下工作：方法getSplits将输入数据切分成splits，splits的个数即为map tasks的个数，splits的大小默认为块大小，即64M 方法getSplits将每个split解析成records, 再依次将record解析成<K,V>对也就是说InputFormat完成以下工作： InputFile --> splits--> <K,V> 系统常用的 InputFormat 又有哪些呢？

其中TextInputFormat便是最常用的，它的<K,V>就代表<行偏移,该行内容> 然而系统所提供的这几种固定的将 InputFile转换为<K,V>的方式有时候并不能满足我们的需求：此时需要我们自定义InputFormat ，从而使Hadoop框架按照我们预设的方式来将 InputFile解析为<K,V> 在领会自定义InputFormat 之前，需要弄懂一下几个抽象类、接口及其之间的关系： InputFormat(interface), FileInputFormat(abstract class), TextInputFormat(class), RecordReader(interface), LineRecordReader(class)的关系 FileInputFormatimplements InputFormat TextInputFormatextends FileInputFormat TextInputFormat.getRecordReadercalls LineRecordReader LineRecordReader implements RecordReader 对于InputFormat接口，上面已经有详细的描述再看看FileInputFormat，它实现了InputFormat接口中的getSplits方法，而将getRecordReader与isSplitable留给具体类(如TextInputFormat)实现，isSplitable方法通常不用修改，所以只需要在自定义的InputFormat中实现 getRecordReader方法即可，而该方法的核心是调用LineRecordReader(即由LineRecorderReader类来实现 "将每个split解析成records, 再依次将record解析成<K,V>对")，该方法实现了接口RecordReader public interface RecordReader<K, V> { booleannext(K key, V value) throws IOException;
KcreateKey();
VcreateValue();
longgetPos() throws IOException;
public voidclose() throws IOException;
floatgetProgress() throws IOException;
}

推荐文章：

MapReduce高级编程之自定义InputFormat