您好,登录后才能下订单哦!
这篇“Distinct Count有什么作用”文章的知识点大部分人都不太理解,所以小编给大家总结了以下内容,内容详细,步骤清晰,具有一定的借鉴价值,希望大家阅读完这篇文章能有所收获,下面我们一起来看看这篇“Distinct Count有什么作用”文章吧。
大数据(big data),IT行业术语,是指无法在一定时间范围内用常规软件工具进行捕捉、管理和处理的数据集合,是需要新处理模式才能具有更强的决策力、洞察发现力和流程优化能力的海量、高增长率和多样化的信息资产。 |
Hive
在大数据场景下,报表很重要一项是UV(Unique Visitor)统计,即某时间段内用户人数。例如,查看一周内app的用户分布情况,Hive中写HiveQL实现:
select app, count(distinct uid) as uv from log_table where week_cal = '2016-03-27'
Pig
与之类似,Pig的写法:
-- all users define DISTINCT_COUNT(A, a) returns dist { B = foreach $A generate $a; unique_B = distinct B; C = group unique_B all; $dist = foreach C generate SIZE(unique_B); } A = load '/path/to/data' using PigStorage() as (app, uid); B = DISTINCT_COUNT(A, uid); -- A = load '/path/to/data' using PigStorage() as (app, uid); B = distinct A; C = group B by app; D = foreach C generate group as app, COUNT($1) as uv; -- suitable for small cardinality scenarios D = foreach C generate group as app, SIZE($1) as uv;
DataFu 为pig提供基数估计的UDF datafu.pig.stats.HyperLogLogPlusPlus,其采用HyperLogLog++算法,更为快速地Distinct Count:
define HyperLogLogPlusPlus datafu.pig.stats.HyperLogLogPlusPlus(); A = load '/path/to/data' using PigStorage() as (app, uid); B = group A by app; C = foreach B generate group as app, HyperLogLogPlusPlus($1) as uv;
Spark
在Spark中,Load数据后通过RDD一系列的转换——map、distinct、reduceByKey进行Distinct Count:
rdd.map { row => (row.app, row.uid) } .distinct() .map { line => (line._1, 1) } .reduceByKey(_ + _) // or rdd.map { row => (row.app, row.uid) } .distinct() .mapValues{ _ => 1 } .reduceByKey(_ + _) // or rdd.map { row => (row.app, row.uid) } .distinct() .map(_._1) .countByValue()
同时,Spark提供近似Distinct Count的API:
rdd.map { row => (row.app, row.uid) } .countApproxDistinctByKey(0.001)
实现是基于HyperLogLog算法:
The algorithm used is based on streamlib's implementation of "HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm", available here.
或者,将Schema化的RDD转成DataFrame后,registerTempTable然后执行sql命令亦可:
val sqlContext = new SQLContext(sc) val df = rdd.toDF() df.registerTempTable("app_table") val appUsers = sqlContext.sql("select app, count(distinct uid) as uv from app_table group by app")
以上就是关于“Distinct Count有什么作用”这篇文章的内容,相信大家都有了一定的了解,希望小编分享的内容对大家有帮助,若想了解更多相关的知识内容,请关注亿速云行业资讯频道。
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。