Re: Get Spark Streaming timestamp

2014-07-23 Thread Bill Jay
Hi Tobias, It seems this parameter is an input to the function. What I am expecting is output from a function that tells me the starting or ending time of the batch. For instance, If I use saveAsTextFiles, it seems DStream will generate a batch every minute and the starting time is a complete minu

Re: Configuring Spark Memory

2014-07-23 Thread Nishkam Ravi
See if this helps: https://github.com/nishkamravi2/SparkAutoConfig/ It's a very simple tool for auto-configuring default parameters in Spark. Takes as input high-level parameters (like number of nodes, cores per node, memory per node, etc) and spits out default configuration, user advice and comm

Re: What if there are large, read-only variables shared by all map functions?

2014-07-23 Thread Aaron Davidson
In particular, take a look at the TorrentBroadcast, which should be much more efficient than HttpBroadcast (which was the default in 1.0) for large files. If you find that TorrentBroadcast doesn't work for you, then another way to solve this problem is to place the data on all nodes' local disks,

Re: persistent HDFS instance for cluster restarts/destroys

2014-07-23 Thread durga
Thanks Mayur. is there any documentation/readme with step by step process available for adding or deleting nodes? Thanks, D. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/persistent-HDFS-instance-for-cluster-restarts-destroys-tp10551p10565.html Sent from

Re: Lost executors

2014-07-23 Thread Eric Friedman
And... PEBCAK I mistakenly believed I had set PYSPARK_PYTHON to a python 2.7 install, but it was on a python 2.6 install on the remote nodes, hence incompatible with what the master was sending. Have set this to point to the correct version everywhere and it works. Apologies for the false alarm.

Re: Lost executors

2014-07-23 Thread Eric Friedman
hi Andrew, Thanks for your note. Yes, I see a stack trace now. It seems to be an issue with python interpreting a function I wish to apply to an RDD. The stack trace is below. The function is a simple factorial: def f(n): if n == 1: return 1 return n * f(n-1) and I'm trying to use it lik

Re: What if there are large, read-only variables shared by all map functions?

2014-07-23 Thread Mayur Rustagi
Have a look at broadcast variables . On Tuesday, July 22, 2014, Parthus wrote: > Hi there, > > I was wondering if anybody could help me find an efficient way to make a > MapReduce program like this: > > 1) For each map function, it need access some huge files, which is around > 6GB > > 2) These

Re: persistent HDFS instance for cluster restarts/destroys

2014-07-23 Thread Mayur Rustagi
Yes you lose the data You can add machines but will require you to restart the cluster. Also adding is manual on you add nodes Regards Mayur On Wednesday, July 23, 2014, durga wrote: > Hi All, > I have a question, > > For my company , we are planning to use spark-ec2 scripts to create cluster >

Re: Help in merging a RDD agaisnt itself using the V of a (K,V).

2014-07-23 Thread Roch Denis
For what it's worth, I got it to work with a Cartesian product even if it's very inefficient it worked out alright for me. The trick was to flat map it (step4) after the cartesian product so that I could do a reduce by key and find the commonalities. After I was done, I could check if any Value pai

Re: Get Spark Streaming timestamp

2014-07-23 Thread Tobias Pfeiffer
Bill, Spark Streaming's DStream provides overloaded methods for transform() and foreachRDD() that allow you to access the timestamp of a batch: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.DStream I think the timestamp is the end of the batch, not th

Re: Streaming. Cannot get socketTextStream to receive anything.

2014-07-23 Thread kytay
Hi TD You are right, I did not include "\n" to delimit the string flushed. That's the reason. Is there a way for me to define the delimiter? Like SOH or ETX instead of "\n" Regards kytay -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Solved-Streaming-Can

streaming sequence files?

2014-07-23 Thread Barnaby
If I save an RDD as a sequence file such as: val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.foreachRDD( d => { d.saveAsSequenceFile("tachyon://localhost:19998/files/WordCounts-" + (new SimpleDateFormat("MMdd-HHmmss") format Calendar.getInstance.getTime).t

Re: Help in merging a RDD agaisnt itself using the V of a (K,V).

2014-07-23 Thread Roch Denis
Ah yes, you're quite right with partitions I could probably process a good chunk of the data but I didn't think a reduce would work? Sorry, I'm still new to Spark and map reduce in general but I thought that the reduce result wasn't an RDD and had to fit into memory. If the result of a reduce can b

RE: error: bad symbolic reference. A signature in SparkContext.class refers to term io in package org.apache.hadoop which is not available

2014-07-23 Thread Sameer Tilak
/JaccardScore.classapproxstrmatch/JaccardScore$$anonfun$calculateSortedJaccardScore$1$$anonfun$5.classapproxstrmatch/JaccardScore$$anonfun$calculateJaccardScore$1$$anonfun$1.class However, when I start my spark shell: spark-shell --jars /apps/sameert/software/secondstring/secondstring/dist/lib/secondstring-20140723.jar

Re: Where is the "PowerGraph abstraction"

2014-07-23 Thread Ankur Dave
We removed the PowerGraph abstraction layer when merging GraphX into Spark to reduce the maintenance costs. You can still read the code

persistent HDFS instance for cluster restarts/destroys

2014-07-23 Thread durga
Hi All, I have a question, For my company , we are planning to use spark-ec2 scripts to create cluster for us. I understand that , persistent HDFS will make the hdfs available for cluster restarts. Question is: 1) What happens , If I destroy and re-create , do I loose the data. a) If I loos

Announcing Spark 0.9.2

2014-07-23 Thread Xiangrui Meng
I'm happy to announce the availability of Spark 0.9.2! Spark 0.9.2 is a maintenance release with bug fixes across several areas of Spark, including Spark Core, PySpark, MLlib, Streaming, and GraphX. We recommend all 0.9.x users to upgrade to this stable release. Contributions to this release came f

Get Spark Streaming timestamp

2014-07-23 Thread Bill Jay
Hi all, I have a question regarding Spark streaming. When we use the saveAsTextFiles function and my batch is 60 seconds, Spark will generate a series of files such as: result-140614896, result-140614802, result-140614808, etc. I think this is the timestamp for the beginning of each

Re: Spark job tracker.

2014-07-23 Thread abhiguruvayya
Is there any thing equivalent to haddop "Job" (org.apache.hadoop.mapreduce.Job;) in spark? Once i submit the spark job i want to concurrently read the sparkListener interface implementation methods where i can grab the job status. I am trying to find a way to wrap the spark submit object into one t

Re: wholeTextFiles not working with HDFS

2014-07-23 Thread kmader
That worked for me as well, I was using spark 1.0 compiled against Hadoop 1.0, switching to 1.0.1 compiled against hadoop 2 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/wholeTextFiles-not-working-with-HDFS-tp7490p10547.html Sent from the Apache Spark User

Re: Help in merging a RDD agaisnt itself using the V of a (K,V).

2014-07-23 Thread Sean Owen
So, given sets, you are joining overlapping sets until all of them are mutually disjoint, right? If graphs are out, then I also would love to see a slick distributed solution, but couldn't think of one. It seems like a cartesian product won't scale. You can write a simple method to implement this

Re: spark github source build error

2014-07-23 Thread m3.sharma
Thanks, it works now :) On Wed, Jul 23, 2014 at 11:45 AM, Xiangrui Meng [via Apache Spark User List] wrote: > try `sbt/sbt clean` first? -Xiangrui > > On Wed, Jul 23, 2014 at 11:23 AM, m3.sharma <[hidden email] > > wrote: > > > I am trying to

using shapeless in spark to optimize data layout in memory

2014-07-23 Thread Koert Kuipers
hello all, in case anyone is interested, i just wrote a short blog about using shapeless in spark to optimize data layout in memory. blog is here: http://tresata.com/tresata-open-sources-spark-columnar code is here: https://github.com/tresata/spark-columnar

Spark cluster spanning multiple data centers

2014-07-23 Thread Ray Qiu
Hi, Is it feasible to deploy a Spark cluster spanning multiple data centers if there is fast connections with not too high latency (30ms) between them? I don't know whether there is any presumptions in the software expecting all cluster nodes to be local (super low latency, for instance). Has an

Re: spark-shell -- running into ArrayIndexOutOfBoundsException

2014-07-23 Thread buntu
Turns out to be an issue with number of fields being read, one of the fields might be missing from the raw data file causing this error. Michael Ambrust pointed it out in another thread. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-running-int

Re: Convert raw data files to Parquet format

2014-07-23 Thread buntu
That seems to be the issue, when I reduce the number of fields it works perfectly fine. Thanks again Michael.. that was super helpful!! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Convert-raw-data-files-to-Parquet-format-tp10526p10541.html Sent from the

Error in History UI - Seeing stdout/stderr

2014-07-23 Thread balvisio
Hello all, I have noticed the (what I think) is a erroneous behavior when using the WebUI: 1. Launching App from Eclipse to a cluster (with 3 workers) 2. Using Spark 0.9.0 (Cloudera distr 5.0.1) 3. The Application makes the worker write to the stdout using System.out.println(...) When the Applic

Re: Convert raw data files to Parquet format

2014-07-23 Thread Michael Armbrust
BTW, I knew this because the top line was ":21". Anytime you see "" that means that the code is something that you typed into the REPL. On Wed, Jul 23, 2014 at 11:55 AM, Michael Armbrust wrote: > Looks like a bug in your lambda function. Some of the lines you are > processing must have less t

Re: Convert raw data files to Parquet format

2014-07-23 Thread Michael Armbrust
Looks like a bug in your lambda function. Some of the lines you are processing must have less than 6 elements, so doing p(5) is failing. On Wed, Jul 23, 2014 at 11:44 AM, buntu wrote: > Thanks Michael. > > If I read in multiple files and attempt to saveAsParquetFile() I get the > ArrayIndexOut

Re: spark github source build error

2014-07-23 Thread Xiangrui Meng
try `sbt/sbt clean` first? -Xiangrui On Wed, Jul 23, 2014 at 11:23 AM, m3.sharma wrote: > I am trying to build spark after cloning from github repo: > > I am executing: > ./sbt/sbt -Dhadoop.version=2.4.0 -Pyarn assembly > > I am getting following error: > [warn] ^

Re: Convert raw data files to Parquet format

2014-07-23 Thread buntu
Thanks Michael. If I read in multiple files and attempt to saveAsParquetFile() I get the ArrayIndexOutOfBoundsException. I don't see this if I try the same with a single file: > case class Point(dt: String, uid: String, kw: String, tz: Int, success: > Int, code: String ) > val point = sc.textFil

Re: Spark Streaming: no job has started yet

2014-07-23 Thread Bill Jay
The code is pretty long. But the main idea is to consume from Kafka, preprocess the data, and groupBy a field. I use mutliple DStream to add parallelism to the consumer. It seems when the number of DStreams is large, this happens often. Thanks, Bill On Tue, Jul 22, 2014 at 11:13 PM, Akhil Das

Re: combineByKey at ShuffledDStream.scala

2014-07-23 Thread Bill Jay
The streaming program contains the following main stages: 1. receive data from Kafka 2. preprocessing of the data. These are all map and filtering stages. 3. Group by a field 4. Process the groupBy results using map. Inside this processing, I use collect, count. Thanks! Bill On Tue, Jul 22, 20

RE: error: bad symbolic reference. A signature in SparkContext.class refers to term io in package org.apache.hadoop which is not available

2014-07-23 Thread Sameer Tilak
$calculateSortedJaccardScore$1$$anonfun$5.classapproxstrmatch/JaccardScore$$anonfun$calculateJaccardScore$1$$anonfun$1.class However, when I start my spark shell: spark-shell --jars /apps/sameert/software/secondstring/secondstring/dist/lib/secondstring-20140723.jar /apps/sameert/software/approxstrmatch/target

spark github source build error

2014-07-23 Thread m3.sharma
I am trying to build spark after cloning from github repo: I am executing: ./sbt/sbt -Dhadoop.version=2.4.0 -Pyarn assembly I am getting following error: [warn] ^ [error] [error] while compiling: /home/m3.sharma/installSrc/spark/spark/sql/core/src/main/scala/o

Re: driver memory

2014-07-23 Thread Andrew Or
Hi Maria, SPARK_MEM is actually a deprecated because it was too general; the reason it worked was because SPARK_MEM applies to everything (drivers, executors, masters, workers, history servers...). In favor of more specific configs, we broke this down into SPARK_DRIVER_MEMORY and SPARK_EXECUTOR_ME

Help in merging a RDD agaisnt itself using the V of a (K,V).

2014-07-23 Thread Roch Denis
Hello, Most of the tasks I've accomplished in Spark were fairly straightforward but I can't figure the following problem using the Spark API: Basically, I have an IP with a bunch of user ID associated to it. I want to create a list of all user id that are associated together, even if some are on

Re: Use of SPARK_DAEMON_JAVA_OPTS

2014-07-23 Thread Andrew Or
Hi Meethu, SPARK_DAEMON_JAVA_OPTS is not intended for setting memory. Please use SPARK_DAEMON_MEMORY instead. It turns out that java respects only the last -Xms and -Xmx values, and in spark-class we put SPARK_DAEMON_JAVA_OPTS before the SPARK_DAEMON_MEMORY. In general, memory configuration in spa

Re: How to do an interactive Spark SQL

2014-07-23 Thread hsy...@gmail.com
Anyone has any idea on this? On Tue, Jul 22, 2014 at 7:02 PM, hsy...@gmail.com wrote: > But how do they do the interactive sql in the demo? > https://www.youtube.com/watch?v=dJQ5lV5Tldw > > And if it can work in the local mode. I think it should be able to work in > cluster mode, correct? > > >

Re: Convert raw data files to Parquet format

2014-07-23 Thread Michael Armbrust
Take a look at the programming guide for spark sql: http://spark.apache.org/docs/latest/sql-programming-guide.html On Wed, Jul 23, 2014 at 11:09 AM, buntu wrote: > I wanted to experiment with using Parquet data with SparkSQL. I got some > tab-delimited files and wanted to know how to convert th

Convert raw data files to Parquet format

2014-07-23 Thread buntu
I wanted to experiment with using Parquet data with SparkSQL. I got some tab-delimited files and wanted to know how to convert them to Parquet format. I'm using standalone spark-shell. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Convert-raw-data

Re: Lost executors

2014-07-23 Thread Andrew Or
Hi Eric, Have you checked the executor logs? It is possible they died because of some exception, and the message you see is just a side effect. Andrew 2014-07-23 8:27 GMT-07:00 Eric Friedman : > I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc. > Cluster resources are

Re: spark-submit to remote master fails

2014-07-23 Thread Andrew Or
Yeah, seems to be the case. In general your executors should be able to reach the driver, which I don't think is the case for you currently (LinuxDevVM.local:59266 seems very private). What you need is some sort of gateway node that can be publicly reached from your worker machines to launch your d

Re: Graphx : Perfomance comparison over cluster

2014-07-23 Thread ShreyanshB
Thanks Ankur. The version with in-memory shuffle is here: https://github.com/amplab/graphx2/commits/vldb. Unfortunately Spark has changed a lot since then, and the way to configure and invoke Spark is different. I can send you the correct configuration/invocation for this if you're interested in b

spark-submit to remote master fails

2014-07-23 Thread didi
Hi all I guess the problem has something to do with the fact i submit the job to remote location I submit from OracleVM running ubuntu and suspect some NAT issues maybe? akka tcp tries this address as follows from the STDERR print which is appended akka.tcp://spark@LinuxDevVM.local:59266 STDERR

why there is only getString(index) but no getString(columnName) in catalyst.expressions.Row.scala ?

2014-07-23 Thread chutium
i do not want to use always schemaRDD.map { case Row(xxx) => ... } using case we must rewrite the table schema again is there any plan to implement this? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/why-there-is-only-getString-index-but-no

Re: spark-shell -- running into ArrayIndexOutOfBoundsException

2014-07-23 Thread buntu
Just wanted to add more info.. I was using SparkSQL reading in the tab-delimited raw data files converting the timestamp to Date format: sc.textFile("rawdata/*").map(_.split("\t")).map(p => Point(df.format(new Date( p(0).trim.toLong*1000L )), p(1), p(2).trim.toInt ,p(3).trim.toInt, p(4).trim.toI

Re: error: bad symbolic reference. A signature in SparkContext.class refers to term io in package org.apache.hadoop which is not available

2014-07-23 Thread Sean Owen
Any help > with this will be great! > > > scalac -classpath > /apps/software/secondstring/secondstring/dist/lib/secondstring-20140723.jar:/opt/cloudera/parcels/CDH/lib/spark/core/lib/spark-core_2.10-1.0.0-cdh5.1.0.jar:/opt/cloudera/parcels/CDH/lib/spark/lib/kryo-2.21.jar:/opt/clou

error: bad symbolic reference. A signature in SparkContext.class refers to term io in package org.apache.hadoop which is not available

2014-07-23 Thread Sameer Tilak
will be great! scalac -classpath /apps/software/secondstring/secondstring/dist/lib/secondstring-20140723.jar:/opt/cloudera/parcels/CDH/lib/spark/core/lib/spark-core_2.10-1.0.0-cdh5.1.0.jar:/opt/cloudera/parcels/CDH/lib/spark/lib/kryo-2.21.jar:/opt/cloudera/parcels/CDH/lib/hadoop/lib/commons-io-2.4

Re: How could I start new spark cluster with hadoop2.0.2

2014-07-23 Thread durga
Hi It seems I can only give --hadoop-major-version=2 . it is taking 2.0.0. How could I say it should use 2.0.2 is there any --hadoop-minor-version variable I can use? Thanks, D. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-could-I-start-new-spark-

Re: Spark deployed by Cloudera Manager

2014-07-23 Thread Marcelo Vanzin
Discussions about how CDH packages Spark aside, you should be using the spark-class script (assuming you're still in 0.9) instead of executing Java directly. That will make sure that the environment needed to run Spark apps is set up correctly. CDH 5.1 ships with Spark 1.0.0, so it has spark-submi

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-23 Thread Yin Huai
Yes, https://issues.apache.org/jira/browse/SPARK-2576 is used to track it. On Wed, Jul 23, 2014 at 9:11 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Do we have a JIRA issue to track this? I think I've run into a similar > issue. > > > On Wed, Jul 23, 2014 at 1:12 AM, Yin Huai wr

Re: How could I start new spark cluster with hadoop2.0.2

2014-07-23 Thread durga
Thanks Akhil -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-could-I-start-new-spark-cluster-with-hadoop2-0-2-tp10450p10514.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

akka 2.3.x?

2014-07-23 Thread Lee Mighdoll
Is there branch somewhere with current spark on a current version of akka? I'm trying to embed spark into a spray app. I can probably backport our app to 2.2.3 for a little while, but I wouldn't want to be stuck there too long. Related: do I need the protobuf shading if I'm using the spark-cassand

Re: Down-scaling Spark on EC2 cluster

2014-07-23 Thread Nicholas Chammas
There is a JIRA issue to track adding such functionality to spark-ec2: SPARK-2008 - Enhance spark-ec2 to be able to add and remove slaves to an existing cluster ​ On Wed, Jul 23, 2014 at 10:12 AM, Akhil Das wrote: > Hi > > Currently this is not

Re: Cluster submit mode - only supported on Yarn?

2014-07-23 Thread Chris Schneider
Thanks Steve, but my goal is to hopefully avoid installing yet another component into my environment. Yarn is cool, but wouldn't be used for anything but Spark. We have no hadoop in our ecosystem (or HDFS). Ideally I'd avoid having to learn about yet another tool. On Wed, Jul 23, 2014 at 11:12

Re: spark1.0.1 spark sql error java.lang.NoClassDefFoundError: Could not initialize class $line11.$read$

2014-07-23 Thread Nicholas Chammas
Do we have a JIRA issue to track this? I think I've run into a similar issue. On Wed, Jul 23, 2014 at 1:12 AM, Yin Huai wrote: > It is caused by a bug in Spark REPL. I still do not know which part of the > REPL code causes it... I think people working REPL may have better idea. > > Regarding ho

Re: Have different reduce key than mapper key

2014-07-23 Thread Nathan Kronenfeld
You do them sequentially in code; Spark will take care of combining them in execution. so something like: foo.map(fcn to [K1, V1]).reduceByKey(fcn from (V1, V1) to V1).map(fcn from (K1, V1) to (K2, V2)) On Wed, Jul 23, 2014 at 11:22 AM, soumick86 wrote: > How can I transform the mapper key at t

Re: Configuring Spark Memory

2014-07-23 Thread Martin Goodson
Thanks Andrew, So if there is only one SparkContext there is only one executor per machine? This seems to contradict Aaron's message from the link above: "If each machine has 16 GB of RAM and 4 cores, for example, you might set spark.executor.memory between 2 and 3 GB, totaling 8-12 GB used by Sp

Lost executors

2014-07-23 Thread Eric Friedman
I'm using spark 1.0.1 on a quite large cluster, with gobs of memory, etc. Cluster resources are available to me via Yarn and I am seeing these errors quite often. ERROR YarnClientClusterScheduler: Lost executor 63 on : remote Akka client disassociated This is in an interactive shell session. I

Re: wholeTextFiles not working with HDFS

2014-07-23 Thread kmader
I have the same issue val a = sc.textFile("s3n://MyBucket/MyFolder/*.tif") a.first works perfectly fine, but val d = sc.wholeTextFiles("s3n://MyBucket/MyFolder/*.tif") does not work d.first Gives the following error message java.io.FileNotFoundExceptio

Re: Configuring Spark Memory

2014-07-23 Thread Sean Owen
On Wed, Jul 23, 2014 at 4:19 PM, Andrew Ash wrote: > > > In standalone mode, each SparkContext you initialize gets its own set of > executors across the cluster. So for example if you have two shells open, > they'll each get two JVMs on each worker machine in the cluster. > Dumb question offline

Have different reduce key than mapper key

2014-07-23 Thread soumick86
How can I transform the mapper key at the reducer output. The functions I have encountered are combineByKey, reduceByKey, etc that work on the values and not on the key. For example below, this is what I want to achieve but seems like I can only have K1 and not K2: Mapper->(K1,V1)->Reducer

Re: Configuring Spark Memory

2014-07-23 Thread Andrew Ash
Hi Martin, In standalone mode, each SparkContext you initialize gets its own set of executors across the cluster. So for example if you have two shells open, they'll each get two JVMs on each worker machine in the cluster. As far as the other docs, you can configure the total number of cores req

Re: Cluster submit mode - only supported on Yarn?

2014-07-23 Thread Steve Nunez
I¹m also in early stages of setting up long running Spark jobs. Easiest way I¹ve found is to set up a cluster and submit the job via YARN. Then I can come back and check in on progress when I need to. Seems the trick is tuning the queue priority and YARN preemption to get the job to run in a reason

Re: Workarounds for accessing sequence file data via PySpark?

2014-07-23 Thread Nick Pentreath
Load from sequenceFile for PySpark is in master and save is in this PR underway (https://github.com/apache/spark/pull/1338) I hope that Kan will have it ready to merge in time for 1.1 release window (it should be, the PR just needs a final review or two). In the meantime you can check out master

Workarounds for accessing sequence file data via PySpark?

2014-07-23 Thread Gary Malouf
I am aware that today PySpark can not load sequence files directly. Are there work-arounds people are using (short of duplicating all the data to text files) for accessing this data?

Cluster submit mode - only supported on Yarn?

2014-07-23 Thread Chris Schneider
We are starting to use Spark, but we don't have any existing infrastructure related to big-data, so we decided to setup the standalone cluster, rather than mess around with Yarn or Mesos. But it appears like the driver program has to stay up on the client for the full duration of the job ("client

Re: Down-scaling Spark on EC2 cluster

2014-07-23 Thread Akhil Das
Hi Currently this is not supported out of the Box. But you can of course add/remove workers in a running cluster. Better option would be to use a Mesos cluster where adding/removing nodes are quiet simple. But again, i believe adding new worker in the middle of a task won't give you better perform

Re: Spark execution plan

2014-07-23 Thread Luis Guerra
Thanks for your answer. However, there has been a missunderstanding here. My question is related to control the execution in parallel of different parts of code, similarly to PIG, where there is a planning phase before the execution. On Wed, Jul 23, 2014 at 1:46 PM, chutium wrote: > it seems un

Configuring Spark Memory

2014-07-23 Thread Martin Goodson
We are having difficulties configuring Spark, partly because we still don't understand some key concepts. For instance, how many executors are there per machine in standalone mode? This is after having closely read the documentation several times: *http://spark.apache.org/docs/latest/configuration

Down-scaling Spark on EC2 cluster

2014-07-23 Thread Shubhabrata
Hello, We plan to use Spark on EC2 for our data science pipeline. We successfully manage to set up cluster as-well-as launch and run applications on remote-clusters. However, to enhance scalability we would like to implement auto-scaling in EC2 for Spark applications. However, I did not find any p

RE: "spark.streaming.unpersist" and "spark.cleaner.ttl"

2014-07-23 Thread Shao, Saisai
Yeah, the document may not be precisely aligned with latest code, so the best way is to check the code. -Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Wednesday, July 23, 2014 5:56 PM To: user@spark.apache.org Subject: RE: "spark.streaming.unpersist" and "spark.cl

Re: Spark execution plan

2014-07-23 Thread chutium
it seems union should work for this scenario in part C, try to use: output_a union output_b -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-execution-plan-tp10482p10491.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Spark 1.0.1 SQL on 160 G parquet file (snappy compressed, made by cloudera impala), 23 core and 60G mem / node, yarn-client mode, always failed

2014-07-23 Thread chutium
in spark 1.1 maybe not so easy like spark 1.0 after commit: https://issues.apache.org/jira/browse/SPARK-2446 only binary with UTF8 annotation will be recognized as string after this commit, but in impala strings are always without UTF8 anno -- View this message in context: http://apache-spark-

Re: driver memory

2014-07-23 Thread mrm
Hi, I figured out my problem so I wanted to share my findings. I was basically trying to broadcast an array with 4 million elements, and a size of approximatively 150 MB. Every time I was trying to broadcast, I got an OutOfMemory error. I fixed my problem by increasing the driver memory using: exp

Re: java.lang.StackOverflowError when calling count()

2014-07-23 Thread lalit1303
Hi, Thanks TD for your reply. I am still not able to resolve the problem for my use case. I have let's say 1000 different RDD's, and I am applying a transformation function on each RDD and I want the output of all rdd's combined to a single output RDD. For, this I am doing the following: ** tempRD

Re: Need info on log4j.properties for apache spark.

2014-07-23 Thread Sean Owen
You specify your own log4j configuration in the usual log4j way -- package it in your assembly, or specify on the command line for example. See http://logging.apache.org/log4j/1.2/manual.html The template you can start with is in core/src/main/resources/org/apache/spark/log4j-defaults.properties

driver memory

2014-07-23 Thread mrm
Hi, How do I increase the driver memory? This are my configs right now: sed 's/INFO/ERROR/' spark/conf/log4j.properties.template > ./ephemeral-hdfs/conf/log4j.properties sed 's/INFO/ERROR/' spark/conf/log4j.properties.template > spark/conf/log4j.properties # Environment variables and Spark pro

Re: hadoop version

2014-07-23 Thread mrm
Thank you! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/hadoop-version-tp10405p10485.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

RE: "spark.streaming.unpersist" and "spark.cleaner.ttl"

2014-07-23 Thread Haopu Wang
Jerry, thanks for the response. For the default storage level of DStream, it looks like Spark's document is wrong. In this link: http://spark.apache.org/docs/latest/streaming-programming-guide.html#memory-tuning It mentions: "Default persistence level of DStreams: Unlike RDDs, the default persis

Re: Spark deployed by Cloudera Manager

2014-07-23 Thread Sean Owen
That's how it's supposed to work, right? You don't deploy an assembly .jar for this reason. You get things like Hadoop from the cluster at runtime. At least this was the gist of what Matei described last month. This is not some issue with CDH. On Wed, Jul 23, 2014 at 8:28 AM, Debasish Das wrote:

Spark execution plan

2014-07-23 Thread Luis Guerra
Hi all, I was wondering how spark may deal with an execution plan. Using PIG as example and its DAG execution, I would like to manage Spark for a similar solution. For instance, if my code has 3 different "parts", being A and B self-sufficient parts: Part A: .. . . var output_a Part

Re: Spark deployed by Cloudera Manager

2014-07-23 Thread Bertrand Dechoux
> Is there any documentation from cloudera on how to run Spark apps on CDH Manager deployed Spark ? Asking the cloudera community would be a good idea. http://community.cloudera.com/ In the end only Cloudera will fix quickly issues with CDH... Bertrand Dechoux On Wed, Jul 23, 2014 at 9:28 AM,

spark-shell -- running into ArrayIndexOutOfBoundsException

2014-07-23 Thread buntu
I'm using the spark-shell locally and working on a dataset of size 900MB. I initially ran into "java.lang.OutOfMemoryError: GC overhead limit exceeded" error and upon researching set "SPARK_DRIVER_MEMORY" to 4g. Now I run into ArrayIndexOutOfBoundsException, please let me know if there is some way

Use of SPARK_DAEMON_JAVA_OPTS

2014-07-23 Thread MEETHU MATHEW
 Hi all, Sorry for taking this topic again,still I am confused on this. I set SPARK_DAEMON_JAVA_OPTS="-XX:+UseCompressedOops -Xmx8g"              when I run my application,I  got the following line in logs. Spark Command: java -cp ::/usr/local/spark-1.0.1/conf:/usr/local/spark-1.0.1/assembly

Re: akka disassociated on GC

2014-07-23 Thread Xiangrui Meng
Yes, that's the plan. If you use broadcast, please also make sure TorrentBroadcastFactory is used, which became the default broadcast factory very recently. -Xiangrui On Tue, Jul 22, 2014 at 10:47 PM, Makoto Yui wrote: > Hi Xiangrui, > > By using your treeAggregate and broadcast patch, the evalua

Re: Spark 0.9.1 core dumps on Mesos 0.18.0

2014-07-23 Thread Dale Johnson
Okay, I finally got this. The project/SparkBuild needed to be set, and only 0.19.0 seems to work (out of 0.14.1, 0.14.2). "org.apache.mesos" % "mesos"% "0.19.0", was the one that worked. -- View this message in context: http://apache-spark-user-list.1001560.n3.nab

Re: streaming window not behaving as advertised (v1.0.1)

2014-07-23 Thread Alan Ngai
TD, it looks like your instincts were correct. I misunderstood what you meant. If I force an eval on the inputstream using foreachRDD, the windowing will work correctly. If I don’t do that, lazy eval somehow screws with window batches I eventually receive. Any reason the bug is categorized a

Re: Spark deployed by Cloudera Manager

2014-07-23 Thread Debasish Das
I found the issue... If you use spark git and generate the assembly jar then org.apache.hadoop.io.Writable.class is packaged with it If you use the assembly jar that ships with CDH in /opt/cloudera/parcels/CDH/lib/spark/assembly/lib/spark-assembly_2.10-0.9.0-cdh5.0.2-hadoop2.3.0-cdh5.0.2.jar,

Re: Spark deployed by Cloudera Manager

2014-07-23 Thread buntu
If you need to run Spark apps through Hue, see if Ooyala's job server helps: http://gethue.com/get-started-with-spark-deploy-spark-server-and-compute-pi-from-your-web-browser/ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-deployed-by-Cloudera-Manag

RE: "spark.streaming.unpersist" and "spark.cleaner.ttl"

2014-07-23 Thread Shao, Saisai
Hi Haopu, Please see the inline comments. Thanks Jerry -Original Message- From: Haopu Wang [mailto:hw...@qilinsoft.com] Sent: Wednesday, July 23, 2014 3:00 PM To: user@spark.apache.org Subject: "spark.streaming.unpersist" and "spark.cleaner.ttl" I have a DStream receiving data from a

Spark deployed by Cloudera Manager

2014-07-23 Thread Debasish Das
Hi, We have been using standalone spark for last 6 months and I used to run application jars fine on spark cluster with the following command. java -cp ":/app/data/spark_deploy/conf:/app/data/spark_deploy/lib/spark-assembly-1.0.0-SNAPSHOT-hadoop2.0.0-mr1-cdh4.5.0.jar:./app.jar" -Xms2g -Xmx2g -Ds

"spark.streaming.unpersist" and "spark.cleaner.ttl"

2014-07-23 Thread Haopu Wang
I have a DStream receiving data from a socket. I'm using local mode. I set "spark.streaming.unpersist" to "false" and leave " spark.cleaner.ttl" to be infinite. I can see files for input and shuffle blocks under "spark.local.dir" folder and the size of folder keeps increasing, although JVM's memory