Re: Scala left join with multiple columns Join condition is missing or trivial. Use the CROSS JOIN syntax to allow cartesian products between these relations.

2017-04-05 Thread gjohnson35
Thanks Andrew.  I completely missed that. It worked by removing the null safe
join condition.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Scala-left-join-with-multiple-columns-Join-condition-is-missing-or-trivial-Use-the-CROSS-JOIN-syntax-tp21297p21305.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-05 Thread Holden Karau
Following up, the issues with missing pypandoc/pandoc on the packaging
machine has been resolved.

On Tue, Apr 4, 2017 at 3:54 PM, Holden Karau  wrote:

> See SPARK-20216, if Michael can let me know which machine is being used
> for packaging I can see if I can install pandoc on it (should be simple but
> I know the Jenkins cluster is a bit on the older side).
>
> On Tue, Apr 4, 2017 at 3:06 PM, Holden Karau  wrote:
>
>> So the fix is installing pandoc on whichever machine is used for
>> packaging. I thought that was generally done on the machine of the person
>> rolling the release so I wasn't sure it made sense as a JIRA, but from
>> chatting with Josh it sounds like that part might be on of the Jenkins
>> workers - is there a fixed one that is used?
>>
>> Regardless I'll file a JIRA for this when I get back in front of my
>> desktop (~1 hour or so).
>>
>> On Tue, Apr 4, 2017 at 2:35 PM Michael Armbrust 
>> wrote:
>>
>>> Thanks for the comments everyone.  This vote fails.  Here's how I think
>>> we should proceed:
>>>  - [SPARK-20197] - SparkR CRAN - appears to be resolved
>>>  - [SPARK-] - Python packaging - Holden, please file a JIRA and
>>> report if this is a regression and if there is an easy fix that we should
>>> wait for.
>>>
>>> For all the other test failures, please take the time to look through
>>> JIRA and open an issue if one does not already exist so that we can triage
>>> if these are just environmental issues.  If I don't hear any objections I'm
>>> going to go ahead with RC3 tomorrow.
>>>
>>> On Sun, Apr 2, 2017 at 1:16 PM, Felix Cheung 
>>> wrote:
>>>
>>> -1
>>> sorry, found an issue with SparkR CRAN check.
>>> Opened SPARK-20197 and working on fix.
>>>
>>> --
>>> *From:* holden.ka...@gmail.com  on behalf of
>>> Holden Karau 
>>> *Sent:* Friday, March 31, 2017 6:25:20 PM
>>> *To:* Xiao Li
>>> *Cc:* Michael Armbrust; dev@spark.apache.org
>>> *Subject:* Re: [VOTE] Apache Spark 2.1.1 (RC2)
>>>
>>> -1 (non-binding)
>>>
>>> Python packaging doesn't seem to have quite worked out (looking
>>> at PKG-INFO the description is "Description: ! missing pandoc do not
>>> upload to PyPI "), ideally it would be nice to have this as a version
>>> we upgrade to PyPi.
>>> Building this on my own machine results in a longer description.
>>>
>>> My guess is that whichever machine was used to package this is missing
>>> the pandoc executable (or possibly pypandoc library).
>>>
>>> On Fri, Mar 31, 2017 at 3:40 PM, Xiao Li  wrote:
>>>
>>> +1
>>>
>>> Xiao
>>>
>>> 2017-03-30 16:09 GMT-07:00 Michael Armbrust :
>>>
>>> Please vote on releasing the following candidate as Apache Spark
>>> version 2.1.0. The vote is open until Sun, April 2nd, 2018 at 16:30 PST
>>> and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.1.1
>>> [ ] -1 Do not release this package because ...
>>>
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v2.1.1-rc2
>>>  (
>>> 02b165dcc2ee5245d1293a375a31660c9d4e1fa6)
>>>
>>> List of JIRA tickets resolved can be found with this filter
>>> 
>>> .
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://home.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1227/
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://people.apache.org/~pwendell/spark-releases/spark-2.1.1-rc2-docs/
>>>
>>>
>>> *FAQ*
>>>
>>> *How can I help test this release?*
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> *What should happen to JIRA tickets still targeting 2.1.1?*
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should be
>>> worked on immediately. Everything else please retarget to 2.1.2 or 2.2.0.
>>>
>>> *But my bug isn't fixed!??!*
>>>
>>> In order to make timely releases, we will typically not hold the release
>>> unless the bug in question is a regression from 2.1.0.
>>>
>>> *What happened to RC1?*
>>>
>>> There were issues with the release packaging and as a result was skipped.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271 <(425)%20233-8271>
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>>
>>> --
>> Cell : 425-233-8271 <(425)%20233-8271>
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Cell : 425-233-8271 <(425)%20233-8271>

[Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Maciej Bryński
Hi,
I'm trying to run queries with many values in IN operator.

The result is that for more than 10K values IN operator is getting slower.

For example this code is running about 20 seconds.

df = spark.range(0,10,1,1)
df.where('id in ({})'.format(','.join(map(str,range(10).count()

Any ideas how to improve this ?
Is it a bug ?
-- 
Maciek Bryński

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Garren Staubli
Query building time is significant because it's a simple query but a long
one at almost 4,000 characters alone.

Task deserialization time takes up an inordinate amount of time (0.9s) when
I run your test and building the query itself is several seconds.

I would recommend using a JOIN (a broadcast join if your data set is small
enough) when the alternative is a massive IN statement.

On Wed, Apr 5, 2017 at 2:31 PM, Maciej Bryński [via Apache Spark Developers
List]  wrote:

> Hi,
> I'm trying to run queries with many values in IN operator.
>
> The result is that for more than 10K values IN operator is getting slower.
>
> For example this code is running about 20 seconds.
>
> df = spark.range(0,10,1,1)
> df.where('id in ({})'.format(','.join(map(str,range(10).count()
>
> Any ideas how to improve this ?
> Is it a bug ?
> --
> Maciek Bryński
>
> -
> To unsubscribe e-mail: [hidden email]
> 
>
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-developers-list.1001551.n3.
> nabble.com/Pyspark-SQL-Very-slow-IN-operator-tp21307.html
> To unsubscribe from Apache Spark Developers List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Pyspark-SQL-Very-slow-IN-operator-tp21308.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Garren Staubli
Query building time is significant because it's a simple query but a long
one at almost 4,000 characters alone.

Task deserialization time takes up an inordinate amount of time (0.9s) when
I run your test and building the query itself is several seconds.

I would recommend using a JOIN (a broadcast join if your data set is small
enough) when the alternative is a massive IN statement.

On Wed, Apr 5, 2017 at 2:31 PM, Maciej Bryński [via Apache Spark Developers
List]  wrote:

> Hi,
> I'm trying to run queries with many values in IN operator.
>
> The result is that for more than 10K values IN operator is getting slower.
>
> For example this code is running about 20 seconds.
>
> df = spark.range(0,10,1,1)
> df.where('id in ({})'.format(','.join(map(str,range(10).count()
>
> Any ideas how to improve this ?
> Is it a bug ?
> --
> Maciek Bryński
>
> -
> To unsubscribe e-mail: [hidden email]
> 
>
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-developers-list.1001551.n3.
> nabble.com/Pyspark-SQL-Very-slow-IN-operator-tp21307.html
> To unsubscribe from Apache Spark Developers List, click here
> 
> .
> NAML
> 
>




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Re-Pyspark-SQL-Very-slow-IN-operator-tp21309.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [Pyspark, SQL] Very slow IN operator

2017-04-05 Thread Michael Segel
Just out of curiosity, what would happen if you put your 10K values in to a 
temp table and then did a join against it? 

> On Apr 5, 2017, at 4:30 PM, Maciej Bryński  wrote:
> 
> Hi,
> I'm trying to run queries with many values in IN operator.
> 
> The result is that for more than 10K values IN operator is getting slower.
> 
> For example this code is running about 20 seconds.
> 
> df = spark.range(0,10,1,1)
> df.where('id in ({})'.format(','.join(map(str,range(10).count()
> 
> Any ideas how to improve this ?
> Is it a bug ?
> -- 
> Maciek Bryński
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org