driver fail-over in Spark streaming 1.2.0

2015-02-11 Thread lin
Hi, all In Spark Streaming 1.2.0, when the driver fails and a new driver starts with the most updated check-pointed data, will the former Executors connects to the new driver, or will the new driver starts out its own set of new Executors? In which piece of codes is that done? Any reply will be a

Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-02-11 Thread fightf...@163.com
Hi, Really have no adequate solution got for this issue. Expecting any available analytical rules or hints. Thanks, Sun. fightf...@163.com From: fightf...@163.com Date: 2015-02-09 11:56 To: user; dev Subject: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

[ANNOUNCE] Spark 1.3.0 Snapshot 1

2015-02-11 Thread Patrick Wendell
Hey All, I've posted Spark 1.3.0 snapshot 1. At this point the 1.3 branch is ready for community testing and we are strictly merging fixes and documentation across all components. The release files, including signatures, digests, etc can be found at: http://people.apache.org/~pwendell/spark-1.3.0

Re: CallbackServer in PySpark Streaming

2015-02-11 Thread Todd Gao
Oh I see! Thank you very much, Davies. You correct some of my wrong understandings. On Thu, Feb 12, 2015 at 9:50 AM, Davies Liu wrote: > Yes. > > On Wed, Feb 11, 2015 at 5:44 PM, Todd Gao > wrote: > > Thanks Davies. > > I am not quite familiar with Spark Streaming. Do you mean that the > comput

Re: CallbackServer in PySpark Streaming

2015-02-11 Thread Davies Liu
Yes. On Wed, Feb 11, 2015 at 5:44 PM, Todd Gao wrote: > Thanks Davies. > I am not quite familiar with Spark Streaming. Do you mean that the compute > routine of DStream is only invoked in the driver node, > while only the compute routines of RDD are distributed to the slaves? > > On Thu, Feb 12,

Re: CallbackServer in PySpark Streaming

2015-02-11 Thread Todd Gao
Thanks Davies. I am not quite familiar with Spark Streaming. Do you mean that the compute routine of DStream is only invoked in the driver node, while only the compute routines of RDD are distributed to the slaves? On Thu, Feb 12, 2015 at 2:38 AM, Davies Liu wrote: > The CallbackServer is part o

Re: 1.2.1 start-all.sh broken?

2015-02-11 Thread Nicholas Chammas
SPARK-5747 : Review all Bash scripts for word splitting bugs I’ll file sub-tasks under this issue. Feel free to pitch in people! Nick ​ On Wed Feb 11 2015 at 3:07:51 PM Ted Yu wrote: > After some googling / trial and error, I got the following

Re: 1.2.1 start-all.sh broken?

2015-02-11 Thread Ted Yu
After some googling / trial and error, I got the following working (against a directory with space in its name): #!/usr/bin/env bash OLDIFS="$IFS" # save it IFS="" # don't split on any white space dir="$1/*" for f in "$dir"; do cat $f done IFS=$OLDIFS # restore IFS Cheers On Wed, Feb 11, 2015

Re: 1.2.1 start-all.sh broken?

2015-02-11 Thread Nicholas Chammas
The tragic thing here is that I was asked to review the patch that introduced this , and totally missed it... :( On Wed Feb 11 2015 at 2:46:35 PM Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > lol yeah, I changed the path f

Re: 1.2.1 start-all.sh broken?

2015-02-11 Thread Nicholas Chammas
lol yeah, I changed the path for the email... turned out to be the issue itself. On Wed Feb 11 2015 at 2:43:09 PM Ted Yu wrote: > I see. > '/path/to/spark-1.2.1-bin-hadoop2.4' didn't contain space :-) > > On Wed, Feb 11, 2015 at 2:41 PM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: >

Re: 1.2.1 start-all.sh broken?

2015-02-11 Thread Ted Yu
I see. '/path/to/spark-1.2.1-bin-hadoop2.4' didn't contain space :-) On Wed, Feb 11, 2015 at 2:41 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Found it: > > > https://github.com/apache/spark/compare/v1.2.0...v1.2.1#diff-73058f8e51951ec0b4cb3d48ade91a1fR73 > > GRRR BASH WORD SPLITTI

Re: 1.2.1 start-all.sh broken?

2015-02-11 Thread Nicholas Chammas
Found it: https://github.com/apache/spark/compare/v1.2.0...v1.2.1#diff-73058f8e51951ec0b4cb3d48ade91a1fR73 GRRR BASH WORD SPLITTING My path has a space in it... Nick On Wed Feb 11 2015 at 2:37:39 PM Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > This is what get: > > spark-1.2.1-bin-

Re: 1.2.1 start-all.sh broken?

2015-02-11 Thread Nicholas Chammas
This is what get: spark-1.2.1-bin-hadoop2.4$ ls -1 lib/ datanucleus-api-jdo-3.2.6.jar datanucleus-core-3.2.10.jar datanucleus-rdbms-3.2.9.jar spark-1.2.1-yarn-shuffle.jar spark-assembly-1.2.1-hadoop2.4.0.jar spark-examples-1.2.1-hadoop2.4.0.jar So that looks correct… Hmm. Nick ​ On Wed Feb 11 2

Re: 1.2.1 start-all.sh broken?

2015-02-11 Thread Ted Yu
I downloaded 1.2.1 tar ball for hadoop 2.4 I got: ls lib/ datanucleus-api-jdo-3.2.6.jar datanucleus-rdbms-3.2.9.jar spark-assembly-1.2.1-hadoop2.4.0.jar datanucleus-core-3.2.10.jarspark-1.2.1-yarn-shuffle.jar spark-examples-1.2.1-hadoop2.4.0.jar FYI On Wed, Feb 11, 2015 at 2:27 PM, Nichola

Re: 1.2.1 start-all.sh broken?

2015-02-11 Thread Sean Owen
Seems to work OK for me on OS X. I ran ./sbin/start-all.sh from the root. Both processes say they started successfully. On Wed, Feb 11, 2015 at 10:27 PM, Nicholas Chammas wrote: > I just downloaded 1.2.1 pre-built for Hadoop 2.4+ and ran sbin/start-all.sh > on my OS X. > > Failed to find Spark as

1.2.1 start-all.sh broken?

2015-02-11 Thread Nicholas Chammas
I just downloaded 1.2.1 pre-built for Hadoop 2.4+ and ran sbin/start-all.sh on my OS X. Failed to find Spark assembly in /path/to/spark-1.2.1-bin-hadoop2.4/lib You need to build Spark before running this program. Did the same for 1.2.0 and it worked fine. Nick ​

numpy on PyPy - potential benefit to PySpark

2015-02-11 Thread Nicholas Chammas
Random question for the PySpark and Python experts/enthusiasts on here: How big of a deal would it be for PySpark and PySpark users if you could run numpy on PyPy? PySpark already supports running on PyPy , but libraries like MLlib that use numpy are not

[ml] Lost persistence for fold in crossvalidation.

2015-02-11 Thread Peter Rudenko
Hi i have a problem. Using spark 1.2 with Pipeline + GridSearch + LogisticRegression. I’ve reimplemented LogisticRegression.fit method and comment out instances.unpersist() |override def fit(dataset:SchemaRDD, paramMap:ParamMap):LogisticRegressionModel = { println(s"Fitting dataset ${da

Re: Data source API | sizeInBytes should be to *Scan

2015-02-11 Thread Reynold Xin
Unfortunately this is not to happen for 1.3 (as a snapshot release is already cut). We need to figure out how we are going to do cardinality estimation before implementing this. If we need to do this in the future, I think we can do it in a way that doesn't break existing APIs. Given I think this w

Re: Data source API | sizeInBytes should be to *Scan

2015-02-11 Thread Aniket Bhatnagar
Circling back on this. Did you get a chance to re-look at this? Thanks, Aniket On Sun, Feb 8, 2015, 2:53 AM Aniket Bhatnagar wrote: > Thanks for looking into this. If this true, isn't this an issue today? The > default implementation of sizeInBytes is 1 + broadcast threshold. So, if > catalyst'

Re: CallbackServer in PySpark Streaming

2015-02-11 Thread Davies Liu
The CallbackServer is part of Py4j, it's only used in driver, not used in slaves or workers. On Wed, Feb 11, 2015 at 12:32 AM, Todd Gao wrote: > Hi all, > > I am reading the code of PySpark and its Streaming module. > > In PySpark Streaming, when the `compute` method of the instance of > PythonTr

[GraphX] Estimating Average distance of a big graph using GraphX

2015-02-11 Thread James
Hello, Recently I am trying to estimate the average distance of a big graph using spark with the help of [HyperAnf](http://dl.acm.org/citation.cfm?id=1963493 ). It works like Connect Componenet algorithm, while the attribute of a vertex is a HyperLogLog counter that at k-th iteration it estimate

CallbackServer in PySpark Streaming

2015-02-11 Thread Todd Gao
Hi all, I am reading the code of PySpark and its Streaming module. In PySpark Streaming, when the `compute` method of the instance of PythonTransformedDStream is invoked, a connection to the CallbackServer is created internally. I wonder where is the CallbackServer for each PythonTransformedDStre