Re: Is the executor number fixed during the lifetime of one app ?

2015-05-26 Thread Saisai Shao
It depends on how you use Spark, if you use Spark with Yarn and enable dynamic allocation, the number of executor is not fixed, will change dynamically according to the load. Thanks Jerry 2015-05-27 14:44 GMT+08:00 canan chen : > It seems the executor number is fixed for the standalone mode, not

Is the executor number fixed during the lifetime of one app ?

2015-05-26 Thread canan chen
It seems the executor number is fixed for the standalone mode, not sure other modes.

Spark Streming yarn-cluster Mode Off-heap Memory Is Constantly Growing

2015-05-26 Thread Ji ZHANG
Hi, I'm using Spark Streaming 1.3 on CDH5.1 with yarn-cluster mode. I find out that YARN is killing the driver and executor process because of excessive use of memory. Here's something I tried: 1. Xmx is set to 512M and the GC looks fine (one ygc per 10s), so the extra memory is not used by heap.

How to give multiple directories as input ?

2015-05-26 Thread ๏̯͡๏
I have this piece sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]]( "/a/b/c/d/exptsession/2015/05/22/out-r-*.avro") that takes ("/a/b/c/d/exptsession/2015/05/22/out-r-*.avro") this as input. I want to give a second directory as input but this is a inva

Re: Help optimizing some spark code

2015-05-26 Thread Debasish Das
You don't need sort...use topbykey if your topk number is less...it uses java heap... On May 24, 2015 10:53 AM, "Tal" wrote: > Hi, > I'm running this piece of code in my program: > > smallRdd.join(largeRdd) > .groupBy { case (id, (_, X(a, _, _))) => a } > .map { case (a, iterable)

Re: Running Javascript from scala spark

2015-05-26 Thread marcos rebelo
Hi Ayan, I'm not an expert on Spark or on the use of dynamic languages on the JVM. I started yesterday doing a proof of concept for a project. The idea is: - to get some datasets in CSV format (I steel need to check what is the better way to parse a CSV in spark. Any suggestion?) - work this d

Re: Help reading Spark UI tea leaves..

2015-05-26 Thread Imran Rashid
yes, HashPartitioner is the default, but (a) just creating an RDD of pairs doesn't apply any partitioner and (b) both rdds have to have the same number of partitions (its not enough for them to just have a HashPartitioner). You only get a narrow dependency if you have the same number of partitions

Building scaladoc using "build/sbt unidoc" failure

2015-05-26 Thread Justin Yip
Hello, I am trying to build scala doc from the 1.4 branch. But it failed due to [error] (sql/compile:compile) java.lang.AssertionError: assertion failed: List(object package$DebugNode, object package$DebugNode) I followed the instruction on github

Re: DataFrame. Conditional aggregation

2015-05-26 Thread ayan guha
For this, I can give you a SQL solution: joined data.registerTempTable('j') Res=ssc.sql('select col1,col2, count(1) counter, min(col3) minimum, sum(case when endrscp>100 then 1 else 0 end test from j' Let me know if this works. On 26 May 2015 23:47, "Masf" wrote: > Hi > I don't know how it wor

Re: Running Javascript from scala spark

2015-05-26 Thread ayan guha
Yes you are in right mailing list, for sure :) Regarding your question, I am sure you are well versed with how spark works. Essentially you can run any arbitrary function with map call and it will run in remote nodes. Hence you need to install any needed dependency in all nodes. You can also pass

Need some Cassandra integration help

2015-05-26 Thread Yana Kadiyska
Hi folks, for those of you working with Cassandra, wondering if anyone has been successful processing a mix of Cassandra and hdfs data. I have a dataset which is stored partially in HDFS and partially in Cassandra (schema is the same in both places) I am trying to do the following: val dfHDFS = s

Re: Running Javascript from scala spark

2015-05-26 Thread andy petrella
Yop, why not using like you said a js engine le rhino? But then I would suggest using mapPartition instead si only one engine per partition. Probably broadcasting the script is also a good thing to do. I guess it's for add hoc transformations passed by a remote client, otherwise you could simply c

Re: Running Javascript from scala spark

2015-05-26 Thread marcos rebelo
Hi all Let me be clear, I'm speaking of Spark (big data, map/reduce, hadoop, ... related). I have multiple map/flatMap/groupBy and one of the steps needs to be a map passing the item inside a JavaScript code. 2 Questions: - Is this question related to this list? - Did someone do something simil

Re: PySpark Unknown Opcode Error

2015-05-26 Thread Davies Liu
This should be the case that you run different versions for Python in driver and slaves, Spark 1.4 will double check that will release soon). SPARK_PYTHON should be PYSPARK_PYTHON On Tue, May 26, 2015 at 11:21 AM, Nikhil Muralidhar wrote: > Hello, > I am trying to run a spark job (which runs

Spark unknown OpCode Error

2015-05-26 Thread Nikhil Muralidhar
Not sure if emails with attachments are discarded so re-sending this without the attachment. Hello, I am trying to run a spark job (which runs fine on the master node of the cluster), on a HDFS hadoop cluster using YARN. When I run the job which has a rdd.saveAsTextFile() line in it, I get the f

Re: Apache Spark application deployment best practices

2015-05-26 Thread Cody Koeninger
IMHO, keep it simple. Option 1: bash, cron, whatever monitoring you're already using On Tue, May 26, 2015 at 1:31 PM, lucas101 wrote: > Hi, > > I have a couple of use cases for Apache Spark applications/scripts, > generally of the following form: > > *General ETL use case* - more specifical

Apache Spark application deployment best practices

2015-05-26 Thread lucas1000001
Hi, I have a couple of use cases for Apache Spark applications/scripts, generally of the following form: *General ETL use case* - more specifically a transformation of a Cassandra column family containing many events (think event sourcing) into various aggregated column families. *Streaming use

Re: Accumulators in Spark Streaming on UI

2015-05-26 Thread Justin Pihony
You need to make sure to name the accumulator. On Tue, May 26, 2015 at 2:23 PM, Snehal Nagmote wrote: > Hello all, > > I have accumulator in spark streaming application which counts number of > events received from Kafka. > > From the documentation , It seems Spark UI has support to display it

Accumulators in Spark Streaming on UI

2015-05-26 Thread Snehal Nagmote
Hello all, I have accumulator in spark streaming application which counts number of events received from Kafka. >From the documentation , It seems Spark UI has support to display it . But I am unable to see it on UI. I am using spark 1.3.1 Do I need to call any method (print) or am I missing

PySpark Unknown Opcode Error

2015-05-26 Thread Nikhil Muralidhar
Hello, I am trying to run a spark job (which runs fine on the master node of the cluster), on a HDFS hadoop cluster using YARN. When I run the job which has a rdd.saveAsTextFile() line in it, I get the following error: *SystemError: unknown opcode* The entire stacktrace has been appended to thi

Re: Running Javascript from scala spark

2015-05-26 Thread Marcelo Vanzin
Is it just me or does that look completely unrelated to Spark-the-Apache-project? On Tue, May 26, 2015 at 10:55 AM, Ted Yu wrote: > Have you looked at https://github.com/spark/sparkjs ? > > Cheers > > On Tue, May 26, 2015 at 10:17 AM, marcos rebelo wrote: > >> Hi all, >> >> My first message on

Re: Running Javascript from scala spark

2015-05-26 Thread Ted Yu
Have you looked at https://github.com/spark/sparkjs ? Cheers On Tue, May 26, 2015 at 10:17 AM, marcos rebelo wrote: > Hi all, > > My first message on this mailing list: > > I need to run JavaScript on Spark. Somehow I would like to use the > ScriptEngineManager or any other way that makes Rhino

Running Javascript from scala spark

2015-05-26 Thread marcos rebelo
Hi all, My first message on this mailing list: I need to run JavaScript on Spark. Somehow I would like to use the ScriptEngineManager or any other way that makes Rhino do the work for me. Consider that I have a Structure that needs to be changed by a JavaScript. I will have a set of Javascript a

process independent columns with same operations

2015-05-26 Thread Laeeq Ahmed
Hi guys, I have spark streaming application and I want to increase its performance.Basically its a design question. My input is like time, s1,s2 ...s23. I have to process these columns with same operations. I am running on 40 cores on 10 machines. 1. I am trying to get rid of the loop in the mid

Re: Recommended Scala version

2015-05-26 Thread Koert Kuipers
we are still running into issues with spark-shell not working on 2.11, but we are running on somewhat older master so maybe that has been resolved already. On Tue, May 26, 2015 at 11:48 AM, Dean Wampler wrote: > Most of the 2.11 issues are being resolved in Spark 1.4. For a while, the > Spark pr

Re: Recommended Scala version

2015-05-26 Thread Dean Wampler
Most of the 2.11 issues are being resolved in Spark 1.4. For a while, the Spark project has published maven artifacts that are compiled with 2.11 and 2.10, although the downloads at http://spark.apache.org/downloads.html are still all for 2.10. Dean Wampler, Ph.D. Author: Programming Scala, 2nd Ed

Re: Recommended Scala version

2015-05-26 Thread Ted Yu
w.r.t. Kafka library, see https://repository.apache.org/content/repositories/orgapachespark-1104/org/apache/spark/spark-streaming-kafka_2.11/1.4.0-rc2/ FYI On Tue, May 26, 2015 at 8:33 AM, Ritesh Kumar Singh < riteshoneinamill...@gmail.com> wrote: > Yes, recommended version is 2.10 as all the f

DataFrame.explode produces field with wrong type.

2015-05-26 Thread Eugene Morozov
Hi! I create DataFrame using method following JavaRDD rows = ... StructType structType = ... Then apply sqlContext.createDataFrame(rows, structType). I have pretty complex schema: root |-- Id: long (nullable = true) |-- attributes: struct (nullable = true) ||-- FirstName: array (nullable =

Re: Recommended Scala version

2015-05-26 Thread Ritesh Kumar Singh
Yes, recommended version is 2.10 as all the features are not supported by 2.11 version. Kafka libraries and JDBC components are yet to be ported to 2.11 version. And so if your project doesn't depend on these components, you can give v2.11 a try. Here's a link

Recommended Scala version

2015-05-26 Thread Punyashloka Biswal
Dear Spark developers and users, Am I correct in believing that the recommended version of Scala to use with Spark is currently 2.10? Is there any plan to switch to 2.11 in future? Are there any advantages to using 2.11 today? Regards, Punya

SparkR Jobs Hanging in collectPartitions

2015-05-26 Thread Eskilson,Aleksander
I’ve been attempting to run a SparkR translation of a similar Scala job that identifies words from a corpus not existing in a newline delimited dictionary. The R code is: dict <- SparkR:::textFile(sc, src1) corpus <- SparkR:::textFile(sc, src2) words <- distinct(SparkR:::flatMap(corpus, function

RE: spark on Windows 2008 failed to save RDD to windows shared folder

2015-05-26 Thread Wang, Ningjun (LNG-NPV)
It is Hadoop-2.4.0 with spark-1.3.0. I found that the problem only happen if there are multi nodes. If the cluster has only one node, it works fine. For example if the cluster has a spark-master on machine A and a spark-worker on machine B, this problem happen. If both spark-master and spark-wo

Re: DataFrame. Conditional aggregation

2015-05-26 Thread Masf
Hi I don't know how it works. For example: val result = joinedData.groupBy("col1","col2").agg( count(lit(1)).as("counter"), min(col3).as("minimum"), sum("case when endrscp> 100 then 1 else 0 end").as("test") ) How can I do it? Thanks Regards. Miguel. On Tue, May 26, 2015 at 12:35 AM,

Re: spark-streaming-kafka_2.11 not available yet?

2015-05-26 Thread Cody Koeninger
It's being added in 1.4 https://repository.apache.org/content/repositories/orgapachespark-1104/org/apache/spark/spark-streaming-kafka_2.11/1.4.0-rc2/ On Tue, May 26, 2015 at 3:14 AM, Petr Novak wrote: > Hello, > I would like to switch from Scala 2.10 to 2.11 for Spark app development. > It se

Re: Tasks randomly stall when running on mesos

2015-05-26 Thread Reinis Vicups
Hi, I just configured my cluster to run with 1.4.0-rc2, alas the dependency jungle does not one let just download, config and start. Instead one will have to fiddle with sbt settings for the upcoming couple of nights: 2015-05-26 14:50:52,686 WARN a.r.ReliableDeliverySupervisor - Association

Re: Implementing custom RDD in Java

2015-05-26 Thread Alex Robbins
I know it isn't exactly what you are asking for, but you could solve it like this: Driver program queries dynamo for the s3 file keys. sc.textFile each of the file keys and .union them all together to make your RDD. You could wrap that up in a function and it wouldn't be too painful to reuse. I d

Re: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread Evo Eftimov
 the link you sent says multiple executors per node Worker is just demon process launching Executors / JVMs so it can execute tasks - it does that by cooperating with the master and the driver  There is a one to one maping between Executor and JVM  Sent from Samsung Mobile Original m

Re: Roadmap for Spark with Kafka on Scala 2.11?

2015-05-26 Thread Iulian Dragoș
On Tue, May 26, 2015 at 10:09 AM, algermissen1971 < algermissen1...@icloud.com> wrote: > Hi, > > I am setting up a project that requires Kafka support and I wonder what > the roadmap is for Scala 2.11 Support (including Kafka). > > Can we expect to see 2.11 support anytime soon? > The upcoming 1.

Apache Spark application deployment best practices

2015-05-26 Thread lucas1000001
Hi, I have a couple of use cases for Apache Spark applications/scripts, generally of the following form: *General ETL use case* - more specifically a transformation of a Cassandra column family containing many events (think event sourcing) into various aggregated column families. *Streamin

Re: HiveContext test, "Spark Context did not initialize after waiting 10000ms"

2015-05-26 Thread Nitin kak
That is a much better solution than how I resolved it. I got around it by placing comma separated jar paths for all the hive related jars in --jars clause. I will try your solution. Thanks for sharing it. On Tue, May 26, 2015 at 4:14 AM, Mohammad Islam wrote: > I got a similar problem. > I'm no

Re: How many executors can I acquire in standalone mode ?

2015-05-26 Thread Arush Kharbanda
I believe you would be restricted by the number of cores you have in your cluster. Having a worker running without a core is useless. On Tue, May 26, 2015 at 3:04 PM, canan chen wrote: > In spark standalone mode, there will be one executor per worker. I am > wondering how many executor can I acq

Re: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread Arush Kharbanda
Hi Evo, Worker is the JVM and an executor runs on the JVM. And after Spark 1.4 you would be able to run multiple executors on the same JVM/worker. https://issues.apache.org/jira/browse/SPARK-1706. Thanks Arush On Tue, May 26, 2015 at 2:54 PM, canan chen wrote: > I think the concept of task in

How many executors can I acquire in standalone mode ?

2015-05-26 Thread canan chen
In spark standalone mode, there will be one executor per worker. I am wondering how many executor can I acquire when I submit app ? Is it greedy mode (as many as I can acquire )?

Re: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread canan chen
I think the concept of task in spark should be on the same level of task in MR. Usually in MR, we need to specify the memory the each mapper/reducer task. And I believe executor is not a user-facing concept, it's a spark internal concept. For spark users they don't need to know the concept of execu

RE: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread Evo Eftimov
This is the first time I hear that “one can specify the RAM per task” – the RAM is granted per Executor (JVM). On the other hand each Task operates on ONE RDD Partition – so you can say that this is “the RAM allocated to the Task to process” – but it is still within the boundaries allocated to t

Re: Using Log4j for logging messages inside lambda functions

2015-05-26 Thread Spico Florin
Hello! Thank you all for your answers. Akhil's proposed solution works fine. Thanks. Florin On Tue, May 26, 2015 at 3:08 AM, Wesley Miao wrote: > The reason it didn't work for you is that the function you registered with > someRdd.map will be running on the worker/executor side, not in your >

回复:Re: Re: Re: Re: how to distributed run a bash shell in spark

2015-05-26 Thread luohui20001
Thanks Akhil, I checked the job UI again ,my app is running concurrently in all the executors. But some of the tasks got I/O exception. I will continue inspecting on this. java.io.IOException: Failed to create local dir in /tmp/spark-b66c7c95-9242-4454-b900-9be9301c4c26/spark-e4c3e926-4b

Re: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread canan chen
Yes, I know that one task represent a JVM thread. This is what I confused. Usually users want to specify the memory on task level, so how can I do it if task if thread level and multiple tasks runs in the same executor. And even I don't know how many threads there will be. Besides that, if one task

Re: HiveContext test, "Spark Context did not initialize after waiting 10000ms"

2015-05-26 Thread Mohammad Islam
I got a similar problem.I'm not sure if your problem is already resolved. For the record, I solved this type of error by calling sc..setMaster("yarn-cluster");  If you find the solution, please let us know. Regards,Mohammad On Friday, March 6, 2015 2:47 PM, nitinkak001 wrote: I am

Roadmap for Spark with Kafka on Scala 2.11?

2015-05-26 Thread algermissen1971
Hi, I am setting up a project that requires Kafka support and I wonder what the roadmap is for Scala 2.11 Support (including Kafka). Can we expect to see 2.11 support anytime soon? Jan - To unsubscribe, e-mail: user-unsubscr...

RE: How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread Evo Eftimov
An Executor is a JVM instance spawned and running on a Cluster Node (Server machine). Task is essentially a JVM Thread – you can have as many Threads as you want per JVM. You will also hear about “Executor Slots” – these are essentially the CPU Cores available on the machine and granted for use

spark-streaming-kafka_2.11 not available yet?

2015-05-26 Thread Petr Novak
Hello, I would like to switch from Scala 2.10 to 2.11 for Spark app development. It seems that the only thing blocking me is a missing spark-streaming-kafka_2.11 maven package. Any plan to add it or am I just blind? Many thanks, Vladimir

How does spark manage the memory of executor with multiple tasks

2015-05-26 Thread canan chen
Since spark can run multiple tasks in one executor, so I am curious to know how does spark manage memory across these tasks. Say if one executor takes 1GB memory, then if this executor can run 10 tasks simultaneously, then each task can consume 100MB on average. Do I understand it correctly ? It do

Parallel parameter tuning: distributed execution of MLlib algorithms

2015-05-26 Thread hugof
Hi, I am currently experimenting with linear regression (SGD) (Spark + MLlib, ver. 1.2). At this point in time I need to fine-tune the hyper-parameters. I do this (for now) by an exhaustive grid search of the step size and the number of iterations. Currently I am on a dual core that acts as a mast

RE: Websphere MQ as a data source for Apache Spark Streaming

2015-05-26 Thread Chaudhary, Umesh
Thanks for the suggestion, I will try and post the outcome. From: Arush Kharbanda [mailto:ar...@sigmoidanalytics.com] Sent: Monday, May 25, 2015 12:24 PM To: Chaudhary, Umesh; user@spark.apache.org Subject: Re: Websphere MQ as a data source for Apache Spark Streaming Hi Umesh, You can write a cu

Collabrative Filtering

2015-05-26 Thread Yasemin Kaya
Hi, In CF String path = "data/mllib/als/test.data"; JavaRDD data = sc.textFile(path); JavaRDD ratings = data.map(new Function() { public Rating call(String s) { String[] sarray = s.split(","); return new Rating(Integer.parseInt(sarray[0]), Integer .parseInt(sarray[1]), Double.parseDouble(sarray[

Re: Remove COMPLETED applications and shuffle data

2015-05-26 Thread Akhil Das
Try these: - Disable shuffle : spark.shuffle.spill=false (It might end up in OOM) - Enable log rotation: sparkConf.set("spark.executor.logs.rolling.strategy", "size") .set("spark.executor.logs.rolling.size.maxBytes", "1024") .set("spark.executor.logs.rolling.maxRetainedFiles", "3") You can also

Caching parquet table (with GZIP) on Spark 1.3.1

2015-05-26 Thread shshann
we tried to cache table through hiveCtx = HiveContext(sc) hiveCtx.cacheTable("table name") as described on Spark 1.3.1's document and we're on CDH5.3.0 with Spark 1.3.1 built with Hadoop 2.6 following error message would occur if we tried to cache table with parquet format & GZIP though we're not