KafkaReciever Error when starting ssc (Actor name not unique)

2014-04-09 Thread gaganbm
Hi All, I am getting this exception when doing ssc.start to start the streaming context. ERROR KafkaReceiver - Error receiving data akka.actor.InvalidActorNameException: actor name [NetworkReceiver-0] is not unique! at akka.actor.dungeon.ChildrenContainer$NormalChildrenContainer.reserve(

Spark on YARN performance

2014-04-09 Thread Flavio Pompermaier
Hi to everybody, I'm new to Spark and I'd like to know if running Spark on top of YARN or Mesos could affect (and how much) its performance. Is there any doc about this? Best, Flavio

Spark operators on Objects

2014-04-09 Thread Flavio Pompermaier
Hi to everybody, In my current scenario I have complex objects stored as xml in an HBase Table. What's the best strategy to work with them? My final goal would be to define operators on those objects (like filter, equals, append, join, merge, etc) and then work with multiple RDDs to perform some k

Spark packaging

2014-04-09 Thread Pradeep baji
Hi all, I am new to spark and trying to learn it. Is there any document which describes how spark is packaged. ( like dependencies needed to build spark, which jar contains what after build etc) Thanks for the help. Regards, Pradeep

Re: Spark packaging

2014-04-09 Thread prabeesh k
Please refer http://prabstechblog.blogspot.in/2014/04/creating-single-jar-for-spark-project.html Regards, prabeesh On Wed, Apr 9, 2014 at 1:04 PM, Pradeep baji wrote: > Hi all, > > I am new to spark and trying to learn it. Is there any document which > describes how spark is packaged. ( like de

Re: trouble with "join" on large RDDs

2014-04-09 Thread Andrew Ash
A JVM can easily be limited in how much memory it uses with the -Xmx parameter, but Python doesn't have memory limits built in in such a first-class way. Maybe the memory limits aren't making it to the python executors. What was your SPARK_MEM setting? The JVM below seems to be using 603201 (pag

Re: PySpark SocketConnect Issue in Cluster

2014-04-09 Thread Surendranauth Hiraman
This appears to be an issue around using pandas. Even if we just instantiate a dataframe and do nothing with it, the python worker process is exiting. But if we remove any pandas references, the same job runs to completion. Has anyone run into this before? -Suren On Mon, Apr 7, 2014 at 1:10 PM

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
For 1, persist can be used to save an RDD to disk using the various persistence levels. When a persistency level is set on an RDD, when that RDD is evaluated it's saved to memory/disk/elsewhere so that it can be re-used. It's applied to that RDD, so that subsequent uses of the RDD can use the cac

Re: Spark Disk Usage

2014-04-09 Thread Surendranauth Hiraman
Thanks, Andrew. That helps. For 1, it sounds like the data for the RDD is held in memory and then only written to disk after the entire RDD has been realized in memory. Is that correct? -Suren On Wed, Apr 9, 2014 at 9:25 AM, Andrew Ash wrote: > For 1, persist can be used to save an RDD to di

Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Marco Costantini
Hi there, To answer your question; no there is no reason NOT to use an AMI that Spark has prepared. The reason we haven't is that we were not aware such AMIs existed. Would you kindly point us to the documentation where we can read about this further? Many many thanks, Shivaram. Marco. On Tue, A

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
Which persistence level are you talking about? MEMORY_AND_DISK ? Sent from my mobile phone On Apr 9, 2014 2:28 PM, "Surendranauth Hiraman" wrote: > Thanks, Andrew. That helps. > > For 1, it sounds like the data for the RDD is held in memory and then only > written to disk after the entire RDD ha

Re: Spark Disk Usage

2014-04-09 Thread Surendranauth Hiraman
Yes, MEMORY_AND_DISK. We do a groupByKey and then call persist on the resulting RDD. So I'm wondering if groupByKey is aware of the subsequent persist setting to use disk or just creates the Seq[V] in memory and only uses disk after that data structure is fully realized in memory. -Suren On We

To Ten RDD

2014-04-09 Thread Jeyaraj, Arockia R (Arockia)
Hi , Can you any one tell how to get Top ten RDD by value? Thanks Arockia Raja

Re: To Ten RDD

2014-04-09 Thread mailforledkk
i see the top method in RDD class , you can use this method to get top N , but i found some error when i use this method it's seem as when mapPartitions in top , the task result may be an Nil , then reduce will result an class cast exception  as blow :java.lang.ClassCastException: sc

Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Nicholas Chammas
Marco, If you call spark-ec2 launch without specifying an AMI, it will default to the Spark-provided AMI. Nick On Wed, Apr 9, 2014 at 9:43 AM, Marco Costantini < silvio.costant...@granatads.com> wrote: > Hi there, > To answer your question; no there is no reason NOT to use an AMI that > Spark

Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Nicholas Chammas
And for the record, that AMI is ami-35b1885c. Again, you don't need to specify it explicitly; spark-ec2 will default to it. On Wed, Apr 9, 2014 at 11:08 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Marco, > > If you call spark-ec2 launch without specifying an AMI, it will default

RE: To Ten RDD

2014-04-09 Thread Jeyaraj, Arockia R (Arockia)
Thanks. It works for me. From: mailforledkk [mailto:mailforle...@126.com] Sent: Wednesday, April 09, 2014 9:16 AM To: user Cc: mailforledkk Subject: Re: To Ten RDD i see the top method in RDD class , you can use this method to get top N , but i found some error when i use this method it's s

What level of parallelism should I expect from my cluster?

2014-04-09 Thread Nicholas Chammas
When you click on a stage in the Spark UI at 4040, you can see how many tasks are running concurrently. How many tasks should I expect to see running concurrently, given I have things set up optimally in my cluster, and my RDDs are partitioned properly? Is it the total number of virtual cores acr

executors not registering with the driver

2014-04-09 Thread azurecoder
Up until last week we had no problems running a Spark standalone cluster. We now have a problem registering executors with the driver node in any application. Although we can start-all and see the worker on 8080 no executors are registered with the blockmanager. The feedback we have is scant but w

hbase scan performance

2014-04-09 Thread David Quigley
Hi all, We are currently using hbase to store user data and periodically doing a full scan to aggregate data. The reason we use hbase is that we need a single user's data to be contiguous, so as user data comes in, we need the ability to update a random access store. The performance of a full hba

Re: Error when compiling spark in IDEA and best practice to use IDE?

2014-04-09 Thread Dong Mo
All of these works Thanks -Mo 2014-04-09 2:34 GMT-04:00 Xiangrui Meng : > After sbt/sbt gen-diea, do not import as an SBT project but choose > "open project" and point it to the spark folder. -Xiangrui > > On Tue, Apr 8, 2014 at 10:45 PM, Sean Owen wrote: > > I let IntelliJ read the Maven buil

How does Spark handle RDD via HDFS ?

2014-04-09 Thread gtanguy
Hello everybody, I am wondering how Spark handles via HDFS his RDD, what if during a map phase I need data which are not present locally? What I am working on : I am working on a recommendation algorithm : Matrix Factorization (MF) using a stochastic gradient as optimizer. For now my algorithm wo

Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Marco Costantini
Ah, tried that. I believe this is an HVM AMI? We are exploring paravirtual AMIs. On Wed, Apr 9, 2014 at 11:17 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > And for the record, that AMI is ami-35b1885c. Again, you don't need to > specify it explicitly; spark-ec2 will default to it.

Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Shivaram Venkataraman
The AMI should automatically switch between PVM and HVM based on the instance type you specify on the command line. For reference (note you don't need to specify this on the command line), the PVM ami id is ami-5bb18832 in us-east-1. FWIW we maintain the list of AMI Ids (across regions and pvm, hv

Re: AWS Spark-ec2 script with different user

2014-04-09 Thread Marco Costantini
Perfect. Now I know what to do. Thanks to your help! Many thanks, Marco. On Wed, Apr 9, 2014 at 12:27 PM, Shivaram Venkataraman < shiva...@eecs.berkeley.edu> wrote: > The AMI should automatically switch between PVM and HVM based on the > instance type you specify on the command line. For refere

Re: How does Spark handle RDD via HDFS ?

2014-04-09 Thread Andrew Ash
The typical way to handle that use case would be to join the 3 files together into one RDD and then do the factorization on that. There will definitely be network traffic during the initial join to get everything into one table, and after that there will likely be more network traffic for various

Re: Spark Disk Usage

2014-04-09 Thread Andrew Ash
The groupByKey would be aware of the subsequent persist -- that's part of the reason why operations are lazy. As far as whether it's materialized in memory first and then flushed to disk vs streamed to disk I'm not sure the exact behavior. What I'd expect to happen would be that the RDD is materi

How to change the parallelism level of input dstreams

2014-04-09 Thread Dong Mo
Dear list, A quick question about spark streaming: Say I have this stage set up in my Spark Streaming cluster: batched TCP stream ==> map(expensive computation) ===> ReduceByKey I know I can set the number of tasks for ReduceByKey. But I didn't find a place to specify the parallelism for the

Re: hbase scan performance

2014-04-09 Thread Jerry Lam
Hi Dave, This is HBase solution to the poor scan performance issue: https://issues.apache.org/jira/browse/HBASE-8369 I encountered the same issue before. To the best of my knowledge, this is not a mapreduce issue. It is hbase issue. If you are planning to swap out mapreduce and replace it with sp

Re: Spark RDD to Shark table IN MEMORY conversion

2014-04-09 Thread Mayur Rustagi
Not right now. Like the pitch though Open new horizons for In-memory analysis.. mind if I borrow that :) Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Tue, Apr 8, 2014 at 8:36 PM, abhietc31 wrote: > Anybody, please he

Re: Why doesn't the driver node do any work?

2014-04-09 Thread Mayur Rustagi
Also Driver can run on one of the slave nodes. (you will stil need a spark master though for resource allocation etc). Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Tue, Apr 8, 2014 at 2:46 PM, Nan Zhu wr

cannot run spark shell in yarn-client mode

2014-04-09 Thread Pennacchiotti, Marco
I am pretty new to Spark and I am trying to run the spark shell on a Yarn cluster from the cli (in yarn-client mode). I am able to start the shell with the following command: SPARK_JAR=../spark-0.9.0-incubating/jars/spark-assembly-0.9.0-incubating-hadoop2.2.0.jar \ SPARK_YARN_APP_JAR=emptyfile

Re: Spark RDD to Shark table IN MEMORY conversion

2014-04-09 Thread abhietc31
Never mind...plz return it later with interest -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-RDD-to-Shark-table-IN-MEMORY-conversion-tp3682p4014.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-09 Thread Kanwaldeep
Any update on this? We are still facing this issue. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-ProtoBuf-2-5-for-messages-with-Spark-Streaming-tp3396p4015.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

KafkaInputDStream Stops reading new messages

2014-04-09 Thread Kanwaldeep
Spark Streaming job was running on two worker nodes and then there was an error on one of the nodes. The spark job showed running but no progress was being made and not processing any new messages. Based on the driver log files I see the following errors. I would expect the stream reading would b

Re: Spark Disk Usage

2014-04-09 Thread Surendranauth Hiraman
Andrew, Thanks a lot for the pointer to the code! This has answered my question. Looks like it tries to write it to memory first and then if it doesn't fit, it spills to disk. I'll have to dig in more to figure out the details. -Suren On Wed, Apr 9, 2014 at 12:46 PM, Andrew Ash wrote: > The

is it possible to initiate Spark jobs from Oozie?

2014-04-09 Thread Segerlind, Nathan L
Howdy. Is it possible to initiate Spark jobs from Oozie (presumably as a java action)? If so, are there known limitations to this? And would anybody have a pointer to an example? Thanks, Nate

Re: Spark packaging

2014-04-09 Thread Pradeep baji
Thanks Prabeesh. On Wed, Apr 9, 2014 at 12:37 AM, prabeesh k wrote: > Please refer > > http://prabstechblog.blogspot.in/2014/04/creating-single-jar-for-spark-project.html > > Regards, > prabeesh > > > On Wed, Apr 9, 2014 at 1:04 PM, Pradeep baji > wrote: > >> Hi all, >> >> I am new to spark an

Re: Spark operators on Objects

2014-04-09 Thread Flavio Pompermaier
Any help about this...? On Apr 9, 2014 9:19 AM, "Flavio Pompermaier" wrote: > Hi to everybody, > > In my current scenario I have complex objects stored as xml in an HBase > Table. > What's the best strategy to work with them? My final goal would be to > define operators on those objects (like fil

Spark 0.9.1 released

2014-04-09 Thread Tathagata Das
Hi everyone, We have just posted Spark 0.9.1, which is a maintenance release with bug fixes, performance improvements, better stability with YARN and improved parity of the Scala and Python API. We recommend all 0.9.0 users to upgrade to this stable release. This is the first release since Spark

Re: Spark 0.9.1 released

2014-04-09 Thread Tathagata Das
A small additional note: Please use the direct download links in the Spark Downloads page. The Apache mirrors take a day or so to sync from the main repo, so may not work immediately. TD On Wed, Apr 9, 2014 at 2:54 PM, Tathagata Das wrote: > Hi everyone,

Problem with running LogisticRegression in spark cluster mode

2014-04-09 Thread Jenny Zhao
Hi all, I have been able to run LR in local mode, but I am facing problem to run it in cluster mode, below is the source script, and stack trace when running it cluster mode, I used sbt package to build the project, not sure what it is complaining? another question I have is for LogisticRegress

Multi master Spark

2014-04-09 Thread Pradeep Ch
Hi, I want to enable Spark Master HA in spark. Documentation specifies that we can do this with the help of Zookeepers. But what I am worried is how to configure one master with the other and similarly how do workers know that the have two masters? where do you specify the multi-master information

Re: Spark 0.9.1 released

2014-04-09 Thread Matei Zaharia
Thanks TD for managing this release, and thanks to everyone who contributed! Matei On Apr 9, 2014, at 2:59 PM, Tathagata Das wrote: > A small additional note: Please use the direct download links in the Spark > Downloads page. The Apache mirrors take a day or so to sync from the main > repo,

Re: Problem with running LogisticRegression in spark cluster mode

2014-04-09 Thread Jagat Singh
Hi Jenny, How are you packaging your jar. Can you please confirm if you have included the Mlib jar inside the fat jar you have created for your code. libraryDependencies += "org.apache.spark" % "spark-mllib_2.9.3" % "0.8.1-incubating" Thanks, Jagat Singh On Thu, Apr 10, 2014 at 8:05 AM, Jenn

Re: Multi master Spark

2014-04-09 Thread Dmitriy Lyubimov
The only way i know to do this is to use mesos with zookeepers. you specify zookeeper url as spark url that contains multiple zookeeper hosts. Multiple mesos masters are then elected thru zookeeper leader election until current leader dies; at which point mesos will elect another master (if still l

Re: Multi master Spark

2014-04-09 Thread Jagat Singh
Hello Pradeep, Quoting from https://spark.apache.org/docs/0.9.0/spark-standalone.html In order to schedule new applications or add Workers to the cluster, they need to know the IP address of the current leader. This can be accomplished by simply passing in a list of Masters where you used to pas

Re: programmatic way to tell Spark version

2014-04-09 Thread Nicholas Chammas
Hey Patrick, I've created SPARK-1458 to track this request, in case the team/community wants to implement it in the future. Nick On Sat, Feb 22, 2014 at 7:25 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > No use case at the moment

Re: Multi master Spark

2014-04-09 Thread Pradeep Ch
Thanks Dmitriy. But I want multi master support when running spark standalone. Also I want to know if this multi master thing works if I use spark-shell. On Wed, Apr 9, 2014 at 3:26 PM, Dmitriy Lyubimov wrote: > The only way i know to do this is to use mesos with zookeepers. you > specify zooke

Re: Multi master Spark

2014-04-09 Thread Dmitriy Lyubimov
ah. standalone HA master was added in 0.9.0. Same logic, but Spark-native. On Wed, Apr 9, 2014 at 3:31 PM, Pradeep Ch wrote: > Thanks Dmitriy. But I want multi master support when running spark > standalone. Also I want to know if this multi master thing works if I use > spark-shell. > > > On W

Re: pySpark memory usage

2014-04-09 Thread Jim Blomo
Hi Matei, thanks for working with me to find these issues. To summarize, the issues I've seen are: 0.9.0: - https://issues.apache.org/jira/browse/SPARK-1323 SNAPSHOT 2014-03-18: - When persist() used and batchSize=1, java.lang.OutOfMemoryError: Java heap space. To me this indicates a memory leak

Re: Problem with running LogisticRegression in spark cluster mode

2014-04-09 Thread Jenny Zhao
Hi Jagat, yes, I did specify mllib in build.sbt name := "Spark LogisticRegression" version :="1.0" scalaVersion := "2.10.3" libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "0.9.0-incubating" libraryDependencies += "org.apache.spark" % "spark-mllib_2.10" % "0.9.0-incubating" l

Re: Spark 0.9.1 released

2014-04-09 Thread Nicholas Chammas
A very nice addition for us PySpark users in 0.9.1 is the addition of RDD.repartition(), which is not mentioned in the release notes ! This is super helpful for when you create an RDD from a gzipped file and then need to explicitly shuffle

Re: pySpark memory usage

2014-04-09 Thread Matei Zaharia
Okay, thanks. Do you have any info on how large your records and data file are? I’d like to reproduce and fix this. Matei On Apr 9, 2014, at 3:52 PM, Jim Blomo wrote: > Hi Matei, thanks for working with me to find these issues. > > To summarize, the issues I've seen are: > 0.9.0: > - https://

Best way to turn an RDD back into a SchemaRDD

2014-04-09 Thread Jan-Paul Bultmann
Hey, My application requires the use of “classical” RDD methods like `distinct` and `subtract` on SchemaRDDs. What is the preferred way to turn the resulting regular RDD[org.apache.spark.sql.Row] back into SchemaRDDs? Calling toSchemaRDD, will not work as the Schema information seems lost already

Re: Spark 0.9.1 released

2014-04-09 Thread Tathagata Das
Thanks Nick for pointing that out! I have updated the release notes. But I see the new operations like repartition in the latest PySpark RDD docs. Maybe refresh the page couple of

Re: trouble with "join" on large RDDs

2014-04-09 Thread Brad Miller
I set SPARK_MEM in the driver process by setting "spark.executor.memory" to 10G. Each machine had 32G of RAM and a dedicated 32G spill volume. I believe all of the units are in pages, and the page size is the standard 4K. There are 15 slave nodes in the cluster and the sizes of the datasets I'm

Re: Spark 0.9.1 released

2014-04-09 Thread Nicholas Chammas
Ah, looks good now. It took me a minute to realize that doing a hard refresh on the docs page was missing the RDD class doc page... And thanks for updating the release notes. On Wed, Apr 9, 2014 at 7:21 PM, Tathagata Das wrote: > Thanks Nick for pointing that out! I have updated the release >

Re: Best way to turn an RDD back into a SchemaRDD

2014-04-09 Thread Michael Armbrust
Good question. This is something we wanted to fix, but unfortunately I'm not sure how to do it without changing the API to RDD, which is undesirable now that the 1.0 branch has been cut. We should figure something out though for 1.1. I've created https://issues.apache.org/jira/browse/SPARK-1460 t

Re: Multi master Spark

2014-04-09 Thread Aaron Davidson
It is as Jagat said. The Masters do not need to know about one another, as ZooKeeper manages their implicit communication. As for Workers (and applications, such as spark-shell), once a Worker is registered with *some *Master, its metadata is stored in ZooKeeper such that if another Master is elect

Re: Only TraversableOnce?

2014-04-09 Thread wxhsdp
thank you, it works after my operation over p, return p.toIterator, because mapPartitions has iterator return type, is that right? rdd.mapPartitions{D => {val p = D.toArray; ...; p.toIterator}} -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Only-Traversable

Re: Only TraversableOnce?

2014-04-09 Thread Nan Zhu
Yeah, should be right -- Nan Zhu On Wednesday, April 9, 2014 at 8:54 PM, wxhsdp wrote: > thank you, it works > after my operation over p, return p.toIterator, because mapPartitions has > iterator return type, is that right? > rdd.mapPartitions{D => {val p = D.toArray; ...; p.toIterator}} > >

Re: pySpark memory usage

2014-04-09 Thread Jim Blomo
This dataset is uncompressed text at ~54GB. stats() returns (count: 56757667, mean: 1001.68740583, stdev: 601.775217822, max: 8965, min: 343) On Wed, Apr 9, 2014 at 6:59 PM, Matei Zaharia wrote: > Okay, thanks. Do you have any info on how large your records and data file > are? I'd like to repro

shuffle memory requirements

2014-04-09 Thread Ameet Kini
val hrdd = sc.hadoopRDD(..) val res = hrdd.partitionBy(myCustomPartitioner).reduceKey(..).mapPartitionsWithIndex( some code to save those partitions ) I'm getting OutOfMemoryErrors on the read side of partitionBy shuffle. My custom partitioner generates over 20,000 partitions, so there are 20,000

Strange behaviour of different SSCs with same Kafka topic

2014-04-09 Thread gaganbm
I am really at my wits' end here. I have different Streaming contexts, lets say 2, and both listening to same Kafka topics. I establish the KafkaStream by setting different consumer groups to each of them. Ideally, I should be seeing the kafka events in both the streams. But what I am getting is

Re: NPE using saveAsTextFile

2014-04-09 Thread Nick Pentreath
Anyone have a chance to look at this? Am I just doing something silly somewhere? If it makes any difference, I am using the elasticsearch-hadoop plugin for ESInputFormat. But as I say, I can parse the data (count, first() etc). I just can't save it as text file. On Tue, Apr 8, 2014 at 4:50 PM

Re: NPE using saveAsTextFile

2014-04-09 Thread Matei Zaharia
I haven’t seen this but it may be a bug in Typesafe Config, since this is serializing a Config object. We don’t actually use Typesafe Config ourselves. Do you have any nulls in the data itself by any chance? And do you know how that Config object is getting there? Matei On Apr 9, 2014, at 11:3

Re: NPE using saveAsTextFile

2014-04-09 Thread Nick Pentreath
Ok I thought it may be closing over the config option. I am using config for job configuration, but extracting vals from that. So not sure why as I thought I'd avoided closing over it. Will go back to source and see where it is creeping in. On Thu, Apr 10, 2014 at 8:42 AM, Matei Zaharia wrote: