Re: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-10 Thread Guillaume Pitel
Hi, I had to setup a cron job for cleanup in $SPARK_HOME/work and in $SPARK_LOCAL_DIRS. Here are the cron lines. Unfortunately it's for *nix machines, I guess you will have to adapt it seriously for Windows. 12 * * * * find $SPARK_HOME/work -cmin +1440 -prune -exec rm -rf {} \+ 32 * * * *

Re: Keep local variable

2015-04-10 Thread Tassilo Klein
Hi Gerard, thanks for the hint with the Singleton object. Seems very interesting. However, when my singleton object (e.g. handle to my DB) is supposed to have a member variable that is non-serializable I again will have a problem, won’t I? At least I always run into issues that Python tries to pic

Re: Benchmaking col vs row similarities

2015-04-10 Thread Burak Yavuz
Depends... The heartbeat you received happens due to GC pressure (probably due to Full GC). If you increase the memory too much, the GC's may be less frequent, but the Full GC's may take longer. Try increasing the following confs: spark.executor.heartbeatInterval spark.core.connection.ack.wait.tim

Re: Benchmaking col vs row similarities

2015-04-10 Thread Debasish Das
I will increase memory for the job...that will also fix it right ? On Apr 10, 2015 12:43 PM, "Reza Zadeh" wrote: > You should pull in this PR: https://github.com/apache/spark/pull/5364 > It should resolve that. It is in master. > Best, > Reza > > On Fri, Apr 10, 2015 at 8:32 AM, Debasish Das > w

foreach going in infinite loop

2015-04-10 Thread Jeetendra Gangele
Hi All I am running below code before calling foreach i did 3 transformation using MapTopair. In my application there are 16 executed but no executed running anything. rddWithscore.foreach(new VoidFunction>>() { @Override public void call(Tuple2> t) throws Exception { Entry maxEntry = null; for

DataFrame column name restriction

2015-04-10 Thread Justin Yip
Hello, Are there any restriction in the column name? I tried to use ".", but sqlContext.sql cannot find the column. I would guess that "." is tricky as this affects accessing StructType, but are there any more restriction on column name? scala> case class A(a: Int) defined class A scala> sqlCont

Re: Streaming anomaly detection using ARIMA

2015-04-10 Thread Corey Nolet
Sean, I do agree about the "inside out" parallelization but my curiosity is mostly in what type of performance I can expect to have by piping out to R. I'm playing with Twitter's new Anomaly Detection library btw, this could be a solution if I can get the calls to R to stand up to the massive data

Re: coalesce(*, false) problem

2015-04-10 Thread Tathagata Das
Coalesce tries to reduce the number of partitions into smaller number of partitions, without moving the data around (as much as possible). Since most of received data is in a few machines (those running receivers), coallesce just makes bigger merged partitions in those. Without coalesce Machine 1:

RE: Is the disk space in SPARK_LOCAL_DIRS cleanned up?

2015-04-10 Thread Wang, Ningjun (LNG-NPV)
Does anybody have an answer for this? Thanks Ningjun From: Wang, Ningjun (LNG-NPV) Sent: Thursday, April 02, 2015 12:14 PM To: user@spark.apache.org Subject: Is the disk space in SPARK_LOCAL_DIRS cleanned up? I set SPARK_LOCAL_DIRS to C:\temp\spark-temp. When RDDs are shuffled, spark writes

How to use the --files arg

2015-04-10 Thread Udit Mehta
Hi, Suppose I have a command and I pass the --files arg as below: bin/spark-submit --class com.test.HelloWorld --master yarn-cluster --num-executors 8 --driver-memory 512m --executor-memory 2048m --executor-cores 4 --queue public * --files $HOME/myfile.txt* --name test_1 ~/test_code-1.0-SNAPSHOT

Re: ClassCastException when calling updateStateKey

2015-04-10 Thread Pradeep Rai
Hi Marcelo, I am not including Spark's classes. When I used the userClasspathFirst flag, I started getting those errors. Been there, done that. Removing guava classes was one of the first things I tried. I saw your replies to a similar problem from Sept. http://apache-spark-developers-list.10

The $ notation for DataFrame Column

2015-04-10 Thread Justin Yip
Hello, The DataFrame documentation always uses $"columnX" to annotates a column. But I cannot find much information about it. Maybe I have missed something. Can anyone point me to the doc about the "$", if there is any? Thanks. Justin

Getting outofmemory errors on spark

2015-04-10 Thread Anshul Singhle
Hi, I'm reading data stored in S3 and aggregating and storing it in Cassandra using a spark job. When I run the job with approx 3Mil records (about 3-4 GB of data) stored in text files, I get the following error: (11529/14925)15/04/10 19:32:43 INFO TaskSetManager: Starting task 11609.0 in stage

Re: Benchmaking col vs row similarities

2015-04-10 Thread Reza Zadeh
You should pull in this PR: https://github.com/apache/spark/pull/5364 It should resolve that. It is in master. Best, Reza On Fri, Apr 10, 2015 at 8:32 AM, Debasish Das wrote: > Hi, > > I am benchmarking row vs col similarity flow on 60M x 10M matrices... > > Details are in this JIRA: > > https:/