Hive自定义函数的使用——useragent解析,hiveuseragent


想要从日志数据中分析一下操作系统、浏览器、版本使用情况,但是hive中的函数不能直接解析useragent,于是可以写一个UDF来解析。useragent用于表示用户的当前操作系统,浏览器版本信息,形如:

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 180.173.196.29

其中解析ua可以用一个开源的工具包,叫做useragentutils.jar来处理,但是不能直接引入这个包,因为hadoop和hive都不支持直接引用第三方的包,要导入源码。项目结构应该如下图


下面的代码用来打印出操作系统、浏览器版本信息:

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

import eu.bitwalker.useragentutils.UserAgent;

public class ParseUserAgent_UDF extends UDF{
	public Text evaluate(final Text userAgent){
		StringBuilder builder = new StringBuilder();
		UserAgent ua = new UserAgent(userAgent.toString());
		builder.append(ua.getOperatingSystem()+"\t"+ua.getBrowser()+"\t"+ua.getBrowserVersion());
		return new Text(builder.toString());
	}
}
使用:打成jar包,hive中add jar xx.jar;

create temporary function ua_parse as 'com.xx.ParseUserAgent_UDF';

select ua_parse(ua) from table_name limit 3;

结果:

WINDOWS_7       CHROME21        21.0.1180.89
WINDOWS_7       CHROME33        33.0.1750.146
WINDOWS_7       CHROME21        21.0.1180.89

此种方式只能处理一行,生成一行,无法进行统计分析。

下面使用UDTF(User Defined Table Generating Function),处理一行,生成多列。

import java.util.ArrayList;

import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;

import eu.bitwalker.useragentutils.UserAgent;

public class ParseUserAgent_UDTF extends GenericUDTF{
	@Override
	public StructObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException {
		if (args.length != 1) {
			throw new UDFArgumentLengthException("ExplodeMap takes only one argument");
		}
		if (args[0].getCategory() != ObjectInspector.Category.PRIMITIVE) {
			throw new UDFArgumentException("ExplodeMap takes string as a parameter");
		}
		ArrayList<String> fieldNames = new ArrayList<String>();
		ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>();
		fieldNames.add("system");
		fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
		fieldNames.add("browser");
		fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
		fieldNames.add("version");
		fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
		return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
	}
	@Override
	public void process(Object[] arg){
		try {
			if(arg == null || arg.length == 0)
				return;
			String input = arg[0].toString();
			String result[] = ua_parse(input).split("\t");
			forward(result);
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
	
	@Override
	public void close() throws HiveException {
		
	}
	public String ua_parse(String userAgent){
		StringBuilder builder = new StringBuilder();
		UserAgent ua = new UserAgent(userAgent.toString());
		builder.append(ua.getOperatingSystem()+"\t"+ua.getBrowser()+"\t"+ua.getBrowserVersion());
		return builder.toString();
	}
}
select t.browser,count(*) c from (select ua_parse(ua) as (system,browser,version) from table_name) t group by t.browser order by c desc;
前十名:

CHROME31        987220571
UNKNOWN 708890045
IE8     420021677
IE7     411500373
MOBILE_SAFARI   291920740
IE6     217574865
IE11    179582201
IE9     165160040
CHROME30        158623163
CHROME21        155192489

未识别的还是很多!


参考:http://blog.csdn.net/ruidongliu/article/details/8791865

http://computerdragon.blog.51cto.com/6235984/1288567



hive中怎调用python函数

ADD FILE /home/taobao/dw_hive/hivelets/smoking/ext/tsa/hivesql/bjx_topic_t1/splitsysin.py.bak;
create table if not exists splittest_t1
(
topic_id string,
topic_title string,
topic_desc string,
biz_date string,
gmt_create string
) PARTITIONED BY(pt string)
row format delimited fields terminated by '\001'
lines terminated by '\n'
STORED AS textfile;

select TRANSFORM(topic_id,topic_title,topic_desc,biz_date,gmt_create)
USING 'splitsysin.py'
as topic_id,topic_title,topic_desc,biz_date,gmt_create
from r_bjx_dim_topic_t1;
 

oracle中与minus同样功可以的hive函数是什,举个使用的例子

hive功能只是近似sql,功能并没有那么强大,没有这种函数,你可以自己用<=,>=之类的自己实现这个效果
 

相关内容