Cannot connect to Python process in Spark Streaming

2017-08-01 Thread canan chen
I run pyspark streaming example queue_streaming.py. But run into the following error, does anyone know what might be wrong ? Thanks ERROR [2017-08-02 08:29:20,023] ({Stop-StreamingContext} Logging.scala[logError]:91) - Cannot connect to Python process. It's probably dead. Stopping StreamingContext

Is there any api for categorical column statistic ?

2016-11-23 Thread canan chen
DataSet.describe only calculate the statistics for numerical data, but not for categorical column. R's summary method can also calculate statistical for numerical data which is very useful for exploratory data analysis. Just wondering is there any api for categorical column statistics as well or is

How to use custom class in DataSet

2016-08-29 Thread canan chen
e.g. I have a custom class A (not case class), and I'd like to use it as DataSet[A]. I guess I need to implement Encoder for this, but didn't find any example for that, is there any document for that ? Thanks

Why the shuffle write is not the exactly same as shuffle read of next stage

2016-03-10 Thread canan chen
Here's my screenshot, the stage 19 and 20 is one-to-one relationship. They're the only child/parent. From my understanding, the shuffle write of stage 19 should be the same as shuffle read of stage 20, but here they are a little difference. Is there any reason for it ? Thanks. [image: Inline imag

Re: When does python program started in pyspark

2015-10-13 Thread canan chen
Runner @ > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala > > On Tue, Oct 13, 2015 at 7:50 PM, canan chen wrote: > >> I look at the source code of spark, but didn't find where python program >> is started in pytho

When does python program started in pyspark

2015-10-13 Thread canan chen
I look at the source code of spark, but didn't find where python program is started in python. It seems spark-submit will call PythonGatewayServer, but where is python program started ? Thanks

Re: Can not allocate executor when running spark on mesos

2015-09-09 Thread canan chen
, 2015 at 10:39 PM, canan chen wrote: > Yes, I follow the guide in this doc, and run it as mesos client mode > > On Tue, Sep 8, 2015 at 6:31 PM, Akhil Das > wrote: > >> In which mode are you submitting your application? (coarse-grained or >> fine-grained(default)).

Re: Can not allocate executor when running spark on mesos

2015-09-08 Thread canan chen
pache.org/docs/latest/running-on-mesos.html#using-a-mesos-master-url > > Thanks > Best Regards > > On Tue, Sep 8, 2015 at 12:54 PM, canan chen wrote: > >> Hi all, >> >> I try to run spark on mesos, but it looks like I can not allocate >> resources from mesos. I

Can not allocate executor when running spark on mesos

2015-09-08 Thread canan chen
Hi all, I try to run spark on mesos, but it looks like I can not allocate resources from mesos. I am not expert of mesos, but from the mesos log, it seems spark always decline the offer from mesos. Not sure what's wrong, maybe need some configuration change. Here's the mesos master log I0908 15:0

Re: Where is the doc about the spark rest api ?

2015-08-31 Thread canan chen
k/tree/master/core/src/main/scala/org/apache/spark/deploy/rest > ), current I don't think there's a document address this part, also this > rest api is only used for SparkSubmit currently, not public API as I know. > > Thanks > Jerry > > > On Mon, Aug 31, 2015 at 4

Re: Where is the doc about the spark rest api ?

2015-08-31 Thread canan chen
I mean the spark builtin rest api On Mon, Aug 31, 2015 at 3:09 PM, Akhil Das wrote: > Check Spark Jobserver > <https://github.com/spark-jobserver/spark-jobserver> > > Thanks > Best Regards > > On Mon, Aug 31, 2015 at 8:54 AM, canan chen wrote: > >> I fou

Re: Why use spark.history.fs.logDirectory instead of spark.eventLog.dir

2015-08-19 Thread canan chen
ocally then the `spark.history.fs.logDirectory` > will happen to point to `spark.eventLog.dir`, but the use case it provides > is broader than that. > > -Andrew > > 2015-08-19 5:13 GMT-07:00 canan chen : > >> Anyone know about this ? Or do I miss something here ? >> >

Re: Why use spark.history.fs.logDirectory instead of spark.eventLog.dir

2015-08-19 Thread canan chen
Anyone know about this ? Or do I miss something here ? On Fri, Aug 7, 2015 at 4:20 PM, canan chen wrote: > Is there any reason that historyserver use another property for the event > log dir ? Thanks >

Re: What's the best practice for developing new features for spark ?

2015-08-19 Thread canan chen
gt; http://search-hadoop.com/m/q3RTtdZv0d1btRHl/Spark+build+module&subj=Building+Spark+Building+just+one+module+ > > > > > On Aug 19, 2015, at 1:44 AM, canan chen wrote: > > > > I want to work on one jira, but it is not easy to do unit test, because > it involves di

What's the best practice for developing new features for spark ?

2015-08-19 Thread canan chen
I want to work on one jira, but it is not easy to do unit test, because it involves different components especially UI. spark building is pretty slow, I don't want to build it each time to test my code change. I am wondering how other people do ? Is there any experience can share ? Thanks

Why standalone mode don't allow to set num-executor ?

2015-08-18 Thread canan chen
num-executor only works for yarn mode. In standalone mode, I have to set the --total-executor-cores and --executor-cores. Isn't this way so intuitive ? Any reason for that ?

Re: TestSQLContext compilation error when run SparkPi in Intellij ?

2015-08-16 Thread canan chen
LContext > > or just create a new SQLContext from a SparkContext. > > -Andrew > > 2015-08-15 20:33 GMT-07:00 canan chen : > >> I am not sure other people's spark debugging environment ( I mean for the >> master branch) , Anyone can share his experience ? >>

Re: TestSQLContext compilation error when run SparkPi in Intellij ?

2015-08-15 Thread canan chen
I am not sure other people's spark debugging environment ( I mean for the master branch) , Anyone can share his experience ? On Sun, Aug 16, 2015 at 10:40 AM, canan chen wrote: > I import the spark source code to intellij, and want to run SparkPi in > intellij, but meet the foll

TestSQLContext compilation error when run SparkPi in Intellij ?

2015-08-15 Thread canan chen
I import the spark source code to intellij, and want to run SparkPi in intellij, but meet the folliwing weird compilation error? I googled it and sbt clean doesn't work for me. I am not sure whether anyone else has meet this issue also, any help is appreciated Error:scalac: while compiling: /

Error when running SparkPi in Intellij

2015-08-11 Thread canan chen
I import the spark project into intellij, and try to run SparkPi in intellij, but failed with compilation error: Error:scalac: while compiling: /Users/werere/github/spark/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala during phase: jvm library version: ver

Re: Why use spark.history.fs.logDirectory instead of spark.eventLog.dir

2015-08-10 Thread canan chen
Anyone know this ? Thanks On Fri, Aug 7, 2015 at 4:20 PM, canan chen wrote: > Is there any reason that historyserver use another property for the event > log dir ? Thanks >

Why use spark.history.fs.logDirectory instead of spark.eventLog.dir

2015-08-07 Thread canan chen
Is there any reason that historyserver use another property for the event log dir ? Thanks

Re: How to set log level in spark-submit ?

2015-07-29 Thread canan chen
t; El miércoles, 29 de julio de 2015, canan chen escribió: > >> Anyone know how to set log level in spark-submit ? Thanks >> >

How to set log level in spark-submit ?

2015-07-29 Thread canan chen
Anyone know how to set log level in spark-submit ? Thanks

Re: RDD saveAsTextFile() to local disk

2015-07-08 Thread canan chen
It works for me by using the following code. Could you share your code ? *val data =sc.parallelize(List(1,2,3))* *data.saveAsTextFile("file:Users/chen/Temp/c")* On Thu, Jul 9, 2015 at 4:05 AM, spok20nn wrote: > Getting exception when wrting RDD to local disk using following function > > s

What does RDD lineage refer to ?

2015-07-08 Thread canan chen
Lots of places refer RDD lineage, I'd like to know what it refer to exactly. My understanding is that it means the RDD dependencies and the intermediate MapOutput info in MapOutputTracker. Correct me if I am wrong. Thanks

Re: Yarn application ID for Spark job on Yarn

2015-06-23 Thread canan chen
I don't think there is yarn related stuff to access in spark. Spark don't depend on yarn. BTW, why do you want the yarn application id ? On Mon, Jun 22, 2015 at 11:45 PM, roy wrote: > Hi, > > Is there a way to get Yarn application ID inside spark application, when > running spark Job on YARN

Re: Spark launching without all of the requested YARN resources

2015-06-23 Thread canan chen
Why do you want it start until all the resources are ready ? Make it start as early as possible should make it complete earlier and increase the utilization of resources On Tue, Jun 23, 2015 at 10:34 PM, Arun Luthra wrote: > Sometimes if my Hortonworks yarn-enabled cluster is fairly busy, Spark

Re: When to use underlying data management layer versus standalone Spark?

2015-06-23 Thread canan chen
I don't think this is the correct question. Spark can be deployed on different cluster manager frameworks like standard alone, yarn & mesos. Spark can't run without these cluster manager framework, that means spark depend on cluster manager framework. And the data management layer is the upstream

Re: map V mapPartitions

2015-06-23 Thread canan chen
One example is that you'd like to set up jdbc connection for each partition and share this connection across the records. mapPartitions is much more like the paradigm of mapper in mapreduce. In the mapper of mapreduce, you have setup method to do any initialization stuff before processing the spl

Re: Spark standalone cluster - resource management

2015-06-23 Thread canan chen
Check the available resources you have (cpu cores & memory ) on master web ui. The log you see means the job can't get any resources. On Wed, Jun 24, 2015 at 5:03 AM, Nizan Grauer wrote: > I'm having 30G per machine > > This is the first (and only) job I'm trying to submit. So it's weird that

Re: Intermedate stage will be cached automatically ?

2015-06-17 Thread canan chen
ast. > > > Best > Ayan > > On Wed, Jun 17, 2015 at 10:21 PM, Mark Tse wrote: > >> I think >> https://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence >> might shed some light on the behaviour you’re seeing. >> >> >> >>

Intermedate stage will be cached automatically ?

2015-06-17 Thread canan chen
Here's one simple spark example that I call RDD#count 2 times. The first time it would invoke 2 stages, but the second one only need 1 stage. Seems the first stage is cached. Is that true ? Any flag can I control whether the cache the intermediate stage val data = sc.parallelize(1 to 10, 2).m

Spark compilation issue on intellij

2015-06-08 Thread canan chen
Maybe someone has asked this question before. I have this compilation issue when compiling spark sql. And I found couple of posts on stackoverflow, but did'nt work for me. Does anyone has experience on this ? thanks http://stackoverflow.com/questions/26788367/quasiquotes-in-intellij-14 Error:

Re: How does spark manage the memory of executor with multiple tasks

2015-05-27 Thread canan chen
etween > these reducers tasks since each shuffle will consume a lot of memory ? > > On Tue, May 26, 2015 at 7:27 PM, Evo Eftimov > wrote: > > the link you sent says multiple executors per node > > Worker is just demon process launching Executors / JVMs so it can execute &

Re: How many executors can I acquire in standalone mode ?

2015-05-27 Thread canan chen
executors I want in the code ? On Tue, May 26, 2015 at 5:57 PM, Arush Kharbanda wrote: > I believe you would be restricted by the number of cores you have in your > cluster. Having a worker running without a core is useless. > > On Tue, May 26, 2015 at 3:04 PM, canan chen wrote: &

Re: How does spark manage the memory of executor with multiple tasks

2015-05-27 Thread canan chen
> > > Original message > From: Arush Kharbanda > Date:2015/05/26 10:55 (GMT+00:00) > To: canan chen > Cc: Evo Eftimov ,user@spark.apache.org > Subject: Re: How does spark manage the memory of executor with multiple > tasks > > Hi Evo, > > Worker is the

Re: Is the executor number fixed during the lifetime of one app ?

2015-05-27 Thread canan chen
n, the number of executor is not fixed, will change > dynamically according to the load. > > Thanks > Jerry > > 2015-05-27 14:44 GMT+08:00 canan chen : > >> It seems the executor number is fixed for the standalone mode, not sure >> other modes. >> > >

Is the executor number fixed during the lifetime of one app ?

2015-05-26 Thread canan chen
It seems the executor number is fixed for the standalone mode, not sure other modes.

How many executors can I acquire in standalone mode ?

2015-05-26 Thread canan chen
In spark standalone mode, there will be one executor per worker. I am wondering how many executor can I acquire when I submit app ? Is it greedy mode (as many as I can acquire )?

Re: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread canan chen
ances as there is available in the Executor aka JVM Heap > > > > *From:* canan chen [mailto:ccn...@gmail.com] > *Sent:* Tuesday, May 26, 2015 9:30 AM > *To:* Evo Eftimov > *Cc:* user@spark.apache.org > *Subject:* Re: How does spark manage the memory of executor with multipl

Re: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread canan chen
dard > concepts familiar to every Java, Scala etc developer > > > > *From:* canan chen [mailto:ccn...@gmail.com] > *Sent:* Tuesday, May 26, 2015 9:02 AM > *To:* user@spark.apache.org > *Subject:* How does spark manage the memory of executor with multiple > tasks > > >

How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread canan chen
Since spark can run multiple tasks in one executor, so I am curious to know how does spark manage memory across these tasks. Say if one executor takes 1GB memory, then if this executor can run 10 tasks simultaneously, then each task can consume 100MB on average. Do I understand it correctly ? It do