给大数据文件的每一行产生唯一的id,数据一行id


给大数据文件的每一行产生唯一的id

4个主要思路:

1 单线程处理

2 普通多线程

3 hive

4 Hadoop

 

搜到一些参考资料


《Hadoop实战》的笔记-2、Hadoop输入与输出

https://book.douban.com/annotation/17068812/

TextInputFormat:文件偏移量:整行数据

但是这个偏移量,貌似是在一个文件的偏移,而不是全局。

 

Generate Auto-increment Id in Map-reduceJob

http://shzhangji.com/blog/2013/10/31/generate-auto-increment-id-in-map-reduce-job/

 

Generate unique customer id / insert uniquerows in hive

http://stackoverflow.com/questions/26855003/generate-unique-customer-id-insert-unique-rows-in-hive

 

Need to add auto increment column in atable using hive

http://stackoverflow.com/questions/23082763/need-to-add-auto-increment-column-in-a-table-using-hive

 

 

https://hadooptutorial.info/writing-custom-udf-in-hive-auto-increment-column-hive/

Here make sure that addition of annotation@UDFType(stateful = true) is required otherwisecounter value will not get increment in the Hive column, it will just returnvalue 1 for all the rows but not the actual row number.

 

最后我采取了用hive写udf的方案。


package hive.udf;
/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.hive.ql.udf.UDFType;

/**
 * UDFRowSequence.
 */
@Description(name = "row_sequence",
    value = "_FUNC_() - Returns a generated row sequence number starting from 1")
@UDFType(deterministic = false, stateful = true)//stateful参数是必要的
public class UDFRowSequence extends UDF
{
  private int result;

  public UDFRowSequence() {
    result=0;
  }

  public int evaluate() {
	  result++;
    return result;
  }
}

// End UDFRowSequence.java

 

本文作者:linger

本文链接:http://blog.csdn.net/lingerlanlan/article/details/46430747



相关内容