Re: Spark 2.0.0 preview docs uploaded

2016-06-08 Thread Pete Robbins
It would be nice to have a "what's new in 2.0.0" equivalent to https://spark.apache.org/releases/spark-release-1-6-0.html available or am I just missing it? On Wed, 8 Jun 2016 at 13:15 Sean Owen wrote: > OK, this is done: > > http://spark.apache.org/documentation.html > http://spark.apache.org/d

Re: rdd.distinct with Partitioner

2016-06-08 Thread 汪洋
Frankly speaking, I think reduceByKey with Partitioner has the same problem too and it should not be exposed to public user either. Because it is a little hard to fully understand how the partitioner behaves without looking at the actual code. And if there exits a basic contract of a Partitio

DAG in Pipeline

2016-06-08 Thread Pranay Tonpay
Hi, Pipeline as of now seems to be having a series of transformers and estimators in a serial fashion. Is it possible to create a DAG sort of thing - Eg - Two transformers running in parallel to cleanse data (a custom built Transformer) in some way and then their outputs ( two outputs ) used for s

Re: rdd.distinct with Partitioner

2016-06-08 Thread Alexander Pivovarov
reduceByKey(randomPartitioner, (a, b) => a + b) also gives incorrect result Why reduceByKey with Partitioner exists then? On Wed, Jun 8, 2016 at 9:22 PM, 汪洋 wrote: > Hi Alexander, > > I think it does not guarantee to be right if an arbitrary Partitioner is > passed in. > > I have created a note

Re: rdd.distinct with Partitioner

2016-06-08 Thread Mridul Muralidharan
The example violates the basic contract of a Partitioner. It does make sense to take Partitioner as a param to distinct - though it is fairly trivial to simulate that in user code as well ... Regards Mridul On Wednesday, June 8, 2016, 汪洋 wrote: > Hi Alexander, > > I think it does not guarantee

Re: rdd.distinct with Partitioner

2016-06-08 Thread 汪洋
Hi Alexander, I think it does not guarantee to be right if an arbitrary Partitioner is passed in. I have created a notebook and you can check it out. (https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/7973071962862063/2110745399505739/58107563000366

rdd.distinct with Partitioner

2016-06-08 Thread Alexander Pivovarov
most of the RDD methods which shuffle data take Partitioner as a parameter But rdd.distinct does not have such signature Should I open a PR for that? /** * Return a new RDD containing the distinct elements in this RDD. */ def distinct(partitioner: Partitioner)(implicit ord: Ordering[T] = null

Re: Kryo registration for Tuples?

2016-06-08 Thread Reynold Xin
Yes you can :) On Wed, Jun 8, 2016 at 6:00 PM, Alexander Pivovarov wrote: > Can I just enable spark.kryo.registrationRequired and look at error > messages to get unregistered classes? > > On Wed, Jun 8, 2016 at 5:52 PM, Reynold Xin wrote: > >> Due to type erasure they have no difference, altho

Re: Kryo registration for Tuples?

2016-06-08 Thread Alexander Pivovarov
Can I just enable spark.kryo.registrationRequired and look at error messages to get unregistered classes? On Wed, Jun 8, 2016 at 5:52 PM, Reynold Xin wrote: > Due to type erasure they have no difference, although watch out for Scala > tuple serialization. > > > On Wednesday, June 8, 2016, Ted Yu

Re: Kryo registration for Tuples?

2016-06-08 Thread Reynold Xin
Due to type erasure they have no difference, although watch out for Scala tuple serialization. On Wednesday, June 8, 2016, Ted Yu wrote: > I think the second group (3 classOf's) should be used. > > Cheers > > On Wed, Jun 8, 2016 at 4:53 PM, Alexander Pivovarov > wrote: > >> if my RDD is RDD[(St

Re: Kryo registration for Tuples?

2016-06-08 Thread Ted Yu
I think the second group (3 classOf's) should be used. Cheers On Wed, Jun 8, 2016 at 4:53 PM, Alexander Pivovarov wrote: > if my RDD is RDD[(String, (Long, MyClass))] > > Do I need to register > > classOf[MyClass] > classOf[(Any, Any)] > > or > > classOf[MyClass] > classOf[(Long, MyClass)] > cl

Kryo registration for Tuples?

2016-06-08 Thread Alexander Pivovarov
if my RDD is RDD[(String, (Long, MyClass))] Do I need to register classOf[MyClass] classOf[(Any, Any)] or classOf[MyClass] classOf[(Long, MyClass)] classOf[(String, (Long, MyClass))] ?

Re: NegativeArraySizeException / segfault

2016-06-08 Thread Andres Perez
We were able to reproduce it with a minimal example. I've opened a jira issue: https://issues.apache.org/jira/browse/SPARK-15825 On Wed, Jun 8, 2016 at 12:43 PM, Koert Kuipers wrote: > great! > > we weren't able to reproduce it because the unit tests use a > broadcast-join while on the cluster

Re: NegativeArraySizeException / segfault

2016-06-08 Thread Koert Kuipers
great! we weren't able to reproduce it because the unit tests use a broadcast-join while on the cluster it uses sort-merge-join. so the issue is in sort-merge-join. we are now able to reproduce it in tests using spark.sql.autoBroadcastJoinThreshold=-1 it produces weird looking garbled results in

Re: NegativeArraySizeException / segfault

2016-06-08 Thread Pete Robbins
I just raised https://issues.apache.org/jira/browse/SPARK-15822 for a similar looking issue. Analyzing the core dump from the segv with Memory Analyzer it looks very much like a UTF8String is very corrupt. Cheers, On Fri, 27 May 2016 at 21:00 Koert Kuipers wrote: > hello all, > after getting ou

Spark 2.0.0 preview docs uploaded

2016-06-08 Thread Sean Owen
OK, this is done: http://spark.apache.org/documentation.html http://spark.apache.org/docs/2.0.0-preview/ http://spark.apache.org/docs/preview/ On Tue, Jun 7, 2016 at 4:59 PM, Shivaram Venkataraman wrote: > As far as I know the process is just to copy docs/_site from the build > to the appropriat