Inner join with the table itself

2018-01-15 Thread Michael Shtelma
Hi all,

If I try joining the table with itself using join columns, I am
getting the following error:
"Join condition is missing or trivial. Use the CROSS JOIN syntax to
allow cartesian products between these relations.;"

This is not true, and my join is not trivial and is not a real cross
join. I am providing join condition and expect to get maybe a couple
of joined rows for each row in the original table.

There is a workaround for this, which implies renaming all the columns
in source data frame and only afterwards proceed with the join. This
allows us to fool spark.

Now I am wondering if there is a way to get rid of this problem in a
better way? I do not like the idea of renaming the columns because
this makes it really difficult to keep track of the names in the
columns in result data frames.
Is it possible to deactivate this check?

Thanks,
Michael

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Inner join with the table itself

2018-01-15 Thread Jacek Laskowski
Hi Michael,

-dev +user

What's the query? How do you "fool spark"?

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski

On Mon, Jan 15, 2018 at 10:23 AM, Michael Shtelma 
wrote:

> Hi all,
>
> If I try joining the table with itself using join columns, I am
> getting the following error:
> "Join condition is missing or trivial. Use the CROSS JOIN syntax to
> allow cartesian products between these relations.;"
>
> This is not true, and my join is not trivial and is not a real cross
> join. I am providing join condition and expect to get maybe a couple
> of joined rows for each row in the original table.
>
> There is a workaround for this, which implies renaming all the columns
> in source data frame and only afterwards proceed with the join. This
> allows us to fool spark.
>
> Now I am wondering if there is a way to get rid of this problem in a
> better way? I do not like the idea of renaming the columns because
> this makes it really difficult to keep track of the names in the
> columns in result data frames.
> Is it possible to deactivate this check?
>
> Thanks,
> Michael
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Limit the block size of data received by spring streaming receiver

2018-01-15 Thread Xilang Yan
Hey, 

We use a customize receiver to receive data from our MQ. We used to use def 
store(dataItem: T) to store data however I found the block size can be very 
different from 0.5K to 5M size. So that data partition processing time is 
very different. Shuffle is an option, but I want to avoid it. 

I notice that def store(dataBuffer: ArrayBuffer[T]) can store the whole data 
into a block so I can control block size, however I also noticed that this 
method doesn't apply any rate limit on it, I have to do rate limit myself. 

So by now, I haven't have a good way to control block size, I am asking if 
spark can add rate limit on  store(dataBuffer: ArrayBuffer[T]) method or 
have a way to control block size generated by BlockGenerator 



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Join Strategies

2018-01-15 Thread Herman van Hövell tot Westerflier
Hey Marco,

A Cartesian product is an inner join by definition :). The current
cartesian product operator does not support outer joins, so we use the only
operator that does: BroadcastNestedLoopJoinExec. This is far from great,
and it does have the potential to OOM, there are some safety nets in the
driver that should start complaining before you actually OOM though.

An outer non-equi join is pretty hard to do in a distributed setting. This
is caused by two things:

   - There is no way to partition the data in such a way that you can
   exploit some locality (know that all the same keys are in one partition),
   unless you use only one partition or use some clever index.
   - You need to keep track of records that do not match the join condition
   if you are doing a full join or a join in which the stream side does not
   match the join side. This is the number one source of complexity in the
   current join implementations. If you can partition your data then you can
   track and emit unmatched rows as part of processing the partition. If you
   cannot (and you have more than 1 partition) then you need to send the
   unmatched rows (in some form) back to the driver and figure out which
   records actually have not been matched (see BroadcastNestedLoopJoinExec for
   example).

It is definitely doable to implement a such a join, however I have not seen
many JIRA's or user requests for this.

HTH

Herman


On Sat, Jan 13, 2018 at 6:41 AM, Marco Gaido  wrote:

> Hi dev,
>
> I have a question about how join strategies are defined.
>
> I see that CartesianProductExec is used only for InnerJoin, while for
> other kind of joins BroadcastNestedLoopJoinExec is used.
> For reference:
> https://github.com/apache/spark/blob/cd9f49a2aed3799964976ea
> d06080a0f7044a0c3/sql/core/src/main/scala/org/apache/spark/sql/execution/
> SparkStrategies.scala#L260
>
> May you kindly explain me why this is done? It doesn't seem a great choice
> to me, since BroadcastNestedLoopJoinExec can fail with OOM.
>
> Thanks,
> Marco
>
>
>
>


Broken SQL Visualization?

2018-01-15 Thread Tomasz Gawęda
Hi,

today I have updated my test cluster to current Spark master, after that my SQL 
Visualization page started to crash with following error in JS:

[cid:part1.DB2FB812.D25D60D1@outlook.com]

Screenshot was cut for readability and to hide internal server names ;)

It may be caused by upgrade or by some code changes, but - to be honest - I did 
not use any new operators nor any new Spark function, so it should render 
correctly, like few days ago. Some Visualizations work fine, some crashes, I 
don't have any doubts why it may not work. Can anyone help me? Probably it is a 
bug in Spark, but it's hard to me to say in which place.

Thanks in advance!

Pozdrawiam / Best regards,

Tomek


Re: Broken SQL Visualization?

2018-01-15 Thread Ted Yu
Did you include any picture ?
Looks like the picture didn't go thru.
Please use third party site. 
Thanks
 Original message From: Tomasz Gawęda 
 Date: 1/15/18  2:07 PM  (GMT-08:00) To: 
dev@spark.apache.org, u...@spark.apache.org Subject: Broken SQL Visualization? 

Hi,
today I have updated my test cluster to current Spark master, after that my SQL 
Visualization page started to crash with following error in JS:

Screenshot was cut for readability and to hide internal server names ;)


It may be caused by upgrade or by some code changes, but - to be honest - I did 
not use any new operators nor any new Spark function, so it should render 
correctly, like few days ago. Some Visualizations work fine, some crashes, I 
don't have any doubts why
 it may not work. Can anyone help me? Probably it is a bug in Spark, but it's 
hard to me to say in which place.



Thanks in advance!
Pozdrawiam / Best regards,
Tomek


Re: Broken SQL Visualization?

2018-01-15 Thread Wenchen Fan
Hi, thanks for reporting, can you include the steps to reproduce this bug?

On Tue, Jan 16, 2018 at 7:07 AM, Ted Yu  wrote:

> Did you include any picture ?
>
> Looks like the picture didn't go thru.
>
> Please use third party site.
>
> Thanks
>
>  Original message 
> From: Tomasz Gawęda 
> Date: 1/15/18 2:07 PM (GMT-08:00)
> To: dev@spark.apache.org, u...@spark.apache.org
> Subject: Broken SQL Visualization?
>
> Hi,
>
> today I have updated my test cluster to current Spark master, after that
> my SQL Visualization page started to crash with following error in JS:
>
> Screenshot was cut for readability and to hide internal server names ;)
>
> It may be caused by upgrade or by some code changes, but - to be honest -
> I did not use any new operators nor any new Spark function, so it should
> render correctly, like few days ago. Some Visualizations work fine, some
> crashes, I don't have any doubts why it may not work. Can anyone help me?
> Probably it is a bug in Spark, but it's hard to me to say in which place.
>
> Thanks in advance!
>
> Pozdrawiam / Best regards,
>
> Tomek
>


Thoughts on Cloudpickle Update

2018-01-15 Thread Bryan Cutler
Hi All,

I've seen a couple issues lately related to cloudpickle, notably
https://issues.apache.org/jira/browse/SPARK-22674, and would like to get
some feedback on updating the version in PySpark which should fix these
issues and allow us to remove some workarounds.  Spark is currently using a
forked version and it seems like updates are made every now and then when
needed, but it's not really clear where the current state is and how much
it has diverged.  This makes back-porting fixes difficult.  There was a
previous discussion on moving it to a dependency here
,
but given the status right now I think it would be best to do another
update and bring things closer to upstream before we talk about completely
moving it outside of Spark.  Before starting another update, it might be
good to discuss the strategy a little.  Should the version in Spark be
derived from a release or at least tied to a specific commit?  It would
also be good if we can document where it has diverged.  Are there any known
issues with recent changes from those that follow cloudpickle dev?  Any
other thoughts or concerns?

Thanks,
Bryan


Re: Thoughts on Cloudpickle Update

2018-01-15 Thread Hyukjin Kwon
Hi Bryan,

Yup, I support to match the version. I pushed it forward before to match it
with https://github.com/cloudpipe/cloudpickle
before few times in Spark's copy and also cloudpickle itself with few
fixes. I believe our copy is closest to 0.4.1.

I have been trying to follow up the changes in cloudpipe/cloudpickle for
which version we should match, I think we should match
it with 0.4.2 first (I need to double check) because IMHO they have been
adding rather radical changes from 0.5.0, including
pickle protocol change (by default).

Personally, I would like to match it with the latest because there have
been some important changes. For
example, see this too - https://github.com/cloudpipe/cloudpickle/pull/138
(it's pending for reviewing yet) eventually but 0.4.2 should be
a good start point.

For the strategy, I think we can match it and follow 0.4.x within Spark for
the conservative and safe choice + minimal cost.


I tried to leave few explicit answers to the questions from you, Bryan:

> Spark is currently using a forked version and it seems like updates are
made every now and then when
> needed, but it's not really clear where the current state is and how much
it has diverged.

I am quite sure our cloudpickle copy is closer to 0.4.1 IIRC.


> Are there any known issues with recent changes from those that follow
cloudpickle dev?

I am technically involved in cloudpickle dev although less active.
They changed default pickle protocol (
https://github.com/cloudpipe/cloudpickle/pull/127). So, if we target
0.5.x+, we should double check
the potential compatibility issue, or fix the protocol, which I believe is
introduced from 0.5.x.



2018-01-16 11:43 GMT+09:00 Bryan Cutler :

> Hi All,
>
> I've seen a couple issues lately related to cloudpickle, notably
> https://issues.apache.org/jira/browse/SPARK-22674, and would like to get
> some feedback on updating the version in PySpark which should fix these
> issues and allow us to remove some workarounds.  Spark is currently using a
> forked version and it seems like updates are made every now and then when
> needed, but it's not really clear where the current state is and how much
> it has diverged.  This makes back-porting fixes difficult.  There was a
> previous discussion on moving it to a dependency here
> ,
> but given the status right now I think it would be best to do another
> update and bring things closer to upstream before we talk about completely
> moving it outside of Spark.  Before starting another update, it might be
> good to discuss the strategy a little.  Should the version in Spark be
> derived from a release or at least tied to a specific commit?  It would
> also be good if we can document where it has diverged.  Are there any known
> issues with recent changes from those that follow cloudpickle dev?  Any
> other thoughts or concerns?
>
> Thanks,
> Bryan
>