Hi, Yin
Thanks for your time. This is the result.
------------------
scala> eventDF.filter($"entityType" ===
"user").select("entityId").distinct.explain(true)
== Parsed Logical Plan ==
Aggregate [entityId#16], [entityId#16]
Project [entityId#16]
Filter (entityType#15 = user)
Project [_1#0 AS id#13,_2#1 AS event#14,_3#2 AS entityType#15,_4#3 AS
entityId#16,_5#4 AS targetEntityType#17,_6#5 AS targetEntityId#18,_7#6 AS
properties#19,_8#7 AS eventTime#20,_9#8 AS eventTimeZone#21,_10#9 AS
tags#22,_11#10 AS prId#23,_12#11 AS creationTime#24,_13#12 AS
creationTimeZone#25]
LogicalRDD
[_1#0,_2#1,_3#2,_4#3,_5#4,_6#5,_7#6,_8#7,_9#8,_10#9,_11#10,_12#11,_13#12],
MapPartitionsRDD[6] at rddToDataFrameHolder at <console>:61
== Analyzed Logical Plan ==
entityId: string
Aggregate [entityId#16], [entityId#16]
Project [entityId#16]
Filter (entityType#15 = user)
Project [_1#0 AS id#13,_2#1 AS event#14,_3#2 AS entityType#15,_4#3 AS
entityId#16,_5#4 AS targetEntityType#17,_6#5 AS targetEntityId#18,_7#6 AS
properties#19,_8#7 AS eventTime#20,_9#8 AS eventTimeZone#21,_10#9 AS
tags#22,_11#10 AS prId#23,_12#11 AS creationTime#24,_13#12 AS
creationTimeZone#25]
LogicalRDD
[_1#0,_2#1,_3#2,_4#3,_5#4,_6#5,_7#6,_8#7,_9#8,_10#9,_11#10,_12#11,_13#12],
MapPartitionsRDD[6] at rddToDataFrameHolder at <console>:61
== Optimized Logical Plan ==
Aggregate [entityId#16], [entityId#16]
Project [entityId#16]
Filter (entityType#15 = user)
InMemoryRelation
[id#13,event#14,entityType#15,entityId#16,targetEntityType#17,targetEntityId#18,properties#19,eventTime#20,eventTimeZone#21,tags#22,prId#23,creationTime#24,creationTimeZone#25],
true, 10000, StorageLevel(true, true, false, true, 1), (TungstenProject
[_1#0 AS id#13,_2#1 AS event#14,_3#2 AS entityType#15,_4#3 AS
entityId#16,_5#4 AS targetEntityType#17,_6#5 AS targetEntityId#18,_7#6 AS
properties#19,_8#7 AS eventTime#20,_9#8 AS eventTimeZone#21,_10#9 AS
tags#22,_11#10 AS prId#23,_12#11 AS creationTime#24,_13#12 AS
creationTimeZone#25]), None
== Physical Plan ==
TungstenAggregate(key=[entityId#16], functions=[], output=[entityId#16])
TungstenExchange hashpartitioning(entityId#16)
TungstenAggregate(key=[entityId#16], functions=[], output=[entityId#16])
Project [entityId#16]
Filter (entityType#15 = user)
InMemoryColumnarTableScan [entityId#16,entityType#15], [(entityType#15
= user)], (InMemoryRelation
[id#13,event#14,entityType#15,entityId#16,targetEntityType#17,targetEntityId#18,properties#19,eventTime#20,eventTimeZone#21,tags#22,prId#23,creationTime#24,creationTimeZone#25],
true, 10000, StorageLevel(true, true, false, true, 1), (TungstenProject
[_1#0 AS id#13,_2#1 AS event#14,_3#2 AS entityType#15,_4#3 AS
entityId#16,_5#4 AS targetEntityType#17,_6#5 AS targetEntityId#18,_7#6 AS
properties#19,_8#7 AS eventTime#20,_9#8 AS eventTimeZone#21,_10#9 AS
tags#22,_11#10 AS prId#23,_12#11 AS creationTime#24,_13#12 AS
creationTimeZone#25]), None)
Code Generation: true
scala>
2015-11-06 5:27 GMT+09:00 Yin Huai <[email protected]>:
> Can you attach the result of eventDF.filter($"entityType" ===
> "user").select("entityId").distinct.explain(true)?
>
> Thanks,
>
> Yin
>
> On Thu, Nov 5, 2015 at 1:12 AM, 千成徳 <[email protected]> wrote:
>
>> Hi All,
>>
>> I have data frame like this.
>>
>> Equality expression is not working in 1.5.1 but, works as expected in
>> 1.4.0
>> What is the difference?
>>
>> scala> eventDF.printSchema()
>> root
>> |-- id: string (nullable = true)
>> |-- event: string (nullable = true)
>> |-- entityType: string (nullable = true)
>> |-- entityId: string (nullable = true)
>> |-- targetEntityType: string (nullable = true)
>> |-- targetEntityId: string (nullable = true)
>> |-- properties: string (nullable = true)
>>
>> scala> eventDF.groupBy("entityType").agg(countDistinct("entityId")).show
>> +----------+------------------------+
>> |entityType|COUNT(DISTINCT entityId)|
>> +----------+------------------------+
>> | ib_user| 4751|
>> | user| 2091|
>> +----------+------------------------+
>>
>>
>> ----- not works ( Bug ? )
>> scala> eventDF.filter($"entityType" ===
>> "user").select("entityId").distinct.count
>> res151: Long = 1219
>>
>> scala> eventDF.filter(eventDF("entityType") ===
>> "user").select("entityId").distinct.count
>> res153: Long = 1219
>>
>> scala> eventDF.filter($"entityType" equalTo
>> "user").select("entityId").distinct.count
>> res149: Long = 1219
>>
>> ----- works as expected
>> scala> eventDF.map{ e => (e.getAs[String]("entityId"),
>> e.getAs[String]("entityType")) }.filter(x => x._2 ==
>> "user").map(_._1).distinct.count
>> res150: Long = 2091
>>
>> scala> eventDF.filter($"entityType" in
>> "user").select("entityId").distinct.count
>> warning: there were 1 deprecation warning(s); re-run with -deprecation
>> for details
>> res155: Long = 2091
>>
>> scala> eventDF.filter($"entityType" !==
>> "ib_user").select("entityId").distinct.count
>> res152: Long = 2091
>>
>>
>> But, All of above code works in 1.4.0
>>
>> Thanks.
>>
>>
>
--
-------------------------------------------------------
千 成徳 (Sondoku Chon)
株式会社オプトホールディング http://www.opt.ne.jp/holding/
データサイエンスラボ https://datasciencelab.jp/
ビックデータアーキテクト
〒102-0081 東京都千代田区四番町6東急番町ビル
Tel:080-4581-9708
Fax:050-3156-7346
-------------------------------------------------------