Re: How to query a query with not contain, not start_with, not end_with condition effective?

Yong Zhang Tue, 21 Feb 2017 10:18:26 -0800

Sorry, didn't pay attention to the originally requirement.


Did you try the left outer join, or left semi join?

What is the explain plan when you use "not in"? Is it leading to a 
broadcastNestedLoopJoin?


spark.sql("select user_id from data where user_id not in (select user_id from 
data where url like '%sell%')").explain(true)


Yong


________________________________
From: Sidney Feiner <[email protected]>
Sent: Tuesday, February 21, 2017 10:46 AM
To: Yong Zhang; Chanh Le; user @spark
Subject: RE: How to query a query with not contain, not start_with, not 
end_with condition effective?


Chanh wants to return user_id's that don't have any record with a url 
containing "sell". Without a subquery/join, it can only filter per record 
without knowing about the rest of the user_id's record



Sidney Feiner   /  SW Developer

M: +972.528197720  /  Skype: sidney.feiner.startapp



[StartApp]<http://www.startapp.com/>



From: Yong Zhang [mailto:[email protected]]
Sent: Tuesday, February 21, 2017 4:10 PM
To: Chanh Le <[email protected]>; user @spark <[email protected]>
Subject: Re: How to query a query with not contain, not start_with, not 
end_with condition effective?



Not sure if I misunderstand your question, but what's wrong doing it this way?



scala> spark.version

res6: String = 2.0.2

scala> val df = Seq((1,"lao.com/sell"), (2, "lao.com/buy")).toDF("user_id", 
"url")

df: org.apache.spark.sql.DataFrame = [user_id: int, url: string]



scala> df.registerTempTable("data")

warning: there was one deprecation warning; re-run with -deprecation for details



scala> spark.sql("select user_id from data where url not like '%sell%'").show

+-------+

|user_id|

+-------+

|      2|

+-------+



Yong



________________________________

From: Chanh Le <[email protected]<mailto:[email protected]>>
Sent: Tuesday, February 21, 2017 4:56 AM
To: user @spark
Subject: How to query a query with not contain, not start_with, not end_with 
condition effective?



Hi everyone,



I am working on a dataset like this
user_id         url
1              lao.com/buy<http://lao.com/buy>
2      bao.com/sell<http://bao.com/sell>
2              cao.com/market<http://cao.com/market>
1       lao.com/sell<http://lao.com/sell>
3              vui.com/sell<http://vui.com/sell>

I have to find all user_id with url not contain sell. Which means I need to 
query all user_id contains sell and put it into a set then do another query to 
find all user_id not in that set.
SELECT user_id
FROM data
WHERE user_id not in ( SELECT user_id FROM data WHERE url like ‘%sell%’;

My data is about 20 million records and it’s growing. When I tried in zeppelin 
I need to set spark.sql.crossJoin.enabled = true

Then I ran the query and the driver got extremely high CPU percentage and the 
process get stuck and I need to kill it.

I am running at client mode that submit to a Mesos cluster.



I am using Spark 2.0.2 and my data store in HDFS with parquet format.



Any advices for me in this situation?



Thank you in advance!.



Regards,

Chanh

Re: How to query a query with not contain, not start_with, not end_with condition effective?

Reply via email to