Re: Guidelines for Spark Cluster Sizing

2014-04-03 Thread Sonal Goyal
Hi, My earlier email did not get any response, I am looking for some guidelines for sizing a spark cluster. Please let me know if there are any best practices or rules of thumb. Thanks a lot. Best Regards, Sonal Nube Technologies

Re: Guidelines for Spark Cluster Sizing

2014-04-03 Thread Prashant Sharma
Hi, Correct me if I am wrong, but I have not heard of such a guideline may be because it is actually very dynamic and depends on a lot factors. Most important factor is kind of workload, some workloads can benefit a lot from large memory and some don't. So its not just input data size its also how

Re: Guidelines for Spark Cluster Sizing

2014-04-03 Thread Sonal Goyal
Hi Prashant, Thanks for replying. Yes I understand that there are a lot of factors depending on the kind of processing, data as well as private/public cluster. However, as you rightly mentioned, it will be great to hear from the community the kind of data loads they run, configurations they have f

How to use addJar for adding external jars in spark-0.9?

2014-04-03 Thread yh18190
Hi, I guess their is problem with spark 0.9 version because when I tried to add external jar jerkson_2.9.1_0.5.0 version with scala version being 2.10.3 in cluster. I am facing java.classNodef error becoz this jars are not being sent to worker nodes.. Please let me know how to resolve this issue,,

Re: Error when run Spark on mesos

2014-04-03 Thread felix
You can download this tarball to replace the 0.9.0 one: wget https://github.com/apache/spark/archive/v0.9.1-rc3.tar.gz just compile it and test it ! 2014-04-03 18:41 GMT+08:00 Gino Mathews [via Apache Spark User List] < ml-node+s1001560n3702...@n3.nabble.com>: > Hi, > > > > I have installed S

How to stop system info output in spark shell

2014-04-03 Thread weida xu
Hi, alll When I start spark in the shell. It automatically output some system info every minute, see below. Can I stop or block the output of these info? I tried the ":silent" comnond, but the automatical output remains. 14/04/03 19:34:30 INFO MetadataCleaner: Ran metadata cleaner for SHUFFLE_BLO

Spark Disk Usage

2014-04-03 Thread Surendranauth Hiraman
Hi, I know if we call persist with the right options, we can have Spark persist an RDD's data on disk. I am wondering what happens in intermediate operations that could conceivably create large collections/Sequences, like GroupBy and shuffling. Basically, one part of the question is when is disk

Re: Strange behavior of RDD.cartesian

2014-04-03 Thread Jaonary Rabarisoa
You can find here a gist that illustrates this issue https://gist.github.com/jrabary/9953562 I got this with spark from master branch. On Sat, Mar 29, 2014 at 7:12 PM, Andrew Ash wrote: > Is this spark 0.9.0? Try setting spark.shuffle.spill=false There was a > hash collision bug that's fixed in

what does SPARK_EXECUTOR_URI in spark-env.sh do ?

2014-04-03 Thread felix
I deploy spark on mesos successfully, by copying the spark tarball to all the slave nodes. I have tried to remove the SPARK_EXECUTOR_URI settings in spark-env.sh and every things goes well. so , I want to know what does the SPARK_EXECUTOR_URI setting do, spark 0.9.1 don't use it any more or it's my

Re: what does SPARK_EXECUTOR_URI in spark-env.sh do ?

2014-04-03 Thread andy petrella
It's used to tell mesos where is located the spark assembly -- which contains a Mesos Executor implementation (the entry point being spark-executor under the sbin folder). In a nutshell, the Spark Mesos Framework (see the MesosSchedulerBackend and MesosExecutorBackend impl) will send this URI to t

Re: Submitting to yarn cluster

2014-04-03 Thread Tom Graves
You should just be making sure your HADOOP_CONF_DIR env variable is correct and not setting yarn.resourcemanager.address in SparkConf.  For Yarn/Hadoop you need to point it to the configuration files for your cluster.   Generally that setting goes into yarn-site.xml. If just setting it doesn't w

Re: Submitting to yarn cluster

2014-04-03 Thread Ron Gonzalez
Right thanks, that worked. My goal is to programmatically submit things to the yarn cluster. The underlying framework we have is a set of property files that specify different machines for dev, qe, prod. While it's definitely possible to have different things deployed as the client etc/hadoop di

Re: How to use addJar for adding external jars in spark-0.9?

2014-04-03 Thread andy petrella
Could you try by building an "uber jar" with all deps... using mvn shade or sbt assembly (or whatever that can do that ^^). I think it'll be easier as a first step than trying to add deps independently (and reduce the "number" of network traffic) Andy Petrella Belgium (Liège) * * D

Re: How to stop system info output in spark shell

2014-04-03 Thread Eduardo Costa Alfaia
Have you already tried in conf/log4j.properties? log4j.rootCategory=OFF Em 4/3/14, 13:46, weida xu escreveu: Hi, alll When I start spark in the shell. It automatically output some system info every minute, see below. Can I stop or block the output of these info? I tried the ":silent" comnond,

Avro serialization

2014-04-03 Thread Ron Gonzalez
Hi,   I know that sources need to either be java serializable or use kryo serialization.   Does anyone have sample code that reads, transforms and writes avro files in spark? Thanks, Ron

Re: Avro serialization

2014-04-03 Thread Ian O'Connell
Objects been transformed need to be one of these in flight. Source data can just use the mapreduce input formats, so anything you can do with mapred. doing an avro one for this you probably want one of : https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantb

Re: Submitting to yarn cluster

2014-04-03 Thread Tom Graves
Generally the yarn cluster handles propogating and setting HADOOP_CONF_DIR for any containers it launches, so it should really just be on your client node submitting the applications.   I haven't specifically tried doing what you said, but like you say Spark doesn't really expose the configurat

Re: Submitting to yarn cluster

2014-04-03 Thread Ron Gonzalez
Cool, thanks for your feedback. On Thursday, April 3, 2014 7:20 AM, Tom Graves wrote: Generally the yarn cluster handles propogating and setting HADOOP_CONF_DIR for any containers it launches, so it should really just be on your client node submitting the applications.   I haven't specificall

Re: Avro serialization

2014-04-03 Thread FRANK AUSTIN NOTHAFT
We use avro objects in our project, and have a Kryo serializer for generic Avro SpecificRecords. Take a look at: https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/edu/berkeley/cs/amplab/adam/serialization/ADAMKryoRegistrator.scala Also, Matt Massie has a good blog post

Re: Is there a way to get the current progress of the job?

2014-04-03 Thread Philip Ogren
This is great news thanks for the update! I will either wait for the 1.0 release or go and test it ahead of time from git rather than trying to pull it out of JobLogger or creating my own SparkListener. On 04/02/2014 06:48 PM, Andrew Or wrote: Hi Philip, In the upcoming release of Spark 1.0

Re: Is there a way to get the current progress of the job?

2014-04-03 Thread Philip Ogren
I can appreciate the reluctance to expose something like the JobProgressListener as a public interface. It's exactly the sort of thing that you want to deprecate as soon as something better comes along and can be a real pain when trying to maintain the level of backwards compatibility that we

Re: Is there a way to get the current progress of the job?

2014-04-03 Thread Mark Hamstra
https://issues.apache.org/jira/browse/SPARK-1081?jql=project%20%3D%20SPARK%20AND%20text%20~%20Annotate On Thu, Apr 3, 2014 at 9:24 AM, Philip Ogren wrote: > I can appreciate the reluctance to expose something like the > JobProgressListener as a public interface. It's exactly the sort of thing

Re: what does SPARK_EXECUTOR_URI in spark-env.sh do ?

2014-04-03 Thread felix
So, if I set this parameter, there is no need to copy the spark tarball to every mesos slave nodes? am I right? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/what-does-SPARK-EXECUTOR-URI-in-spark-env-sh-do-tp3708p3722.html Sent from the Apache Spark User L

Re: what does SPARK_EXECUTOR_URI in spark-env.sh do ?

2014-04-03 Thread andy petrella
Indeed, it's how mesos works actually. So the tarball just has to be somewhere accessible by the mesos slaves. That's why it is often put in hdfs. Le 3 avr. 2014 18:46, "felix" a écrit : > So, if I set this parameter, there is no need to copy the spark tarball to > every mesos slave nodes? am I

RE: Issue with zip and partitions

2014-04-03 Thread Patrick_Nicolas
Hi Xiangrui, Thanks for your reply. This makes sense, and I should have looked at the doc.. indeed.. Zipping before saveAsFile did the trick. -Original Message- From: Xiangrui Meng [mailto:men...@gmail.com] Sent: Tuesday, April 01, 2014 11:43 PM To: user@spark.apache.org Cc: u...@spark.i

Spark 1.0.0 release plan

2014-04-03 Thread Bhaskar Dutta
Hi, Is there any change in the release plan for Spark 1.0.0-rc1 release date from what is listed in the "Proposal for Spark Release Strategy" thread? == Tentative Release Window for 1.0.0 == Feb 1st - April 1st: General development April 1st: Code freeze for new features April 15th: RC1 Thanks,

Re: Job initialization performance of Spark standalone mode vs YARN

2014-04-03 Thread Kevin Markey
We are now testing precisely what you ask about in our environment.  But Sandy's questions are relevant.  The bigger issue is not Spark vs. Yarn but "client" vs. "standalone" and where the client is located on the network relative to the cluster. The "client" options

Re: Spark 1.0.0 release plan

2014-04-03 Thread Matei Zaharia
Hey Bhaskar, this is still the plan, though QAing might take longer than 15 days. Right now since we’ve passed April 1st, the only features considered for a merge are those that had pull requests in review before. (Some big ones are things like annotating the public APIs and simplifying configur

Re: Example of creating expressions for SchemaRDD methods

2014-04-03 Thread Michael Armbrust
I'll start off by saying that the DSL is pretty experimental, and we are still figuring out exactly how to expose all of it to end users. Right now you are going to get more full featured functionality from SQL. Under the covers, using these two representations results in the same execution plans

Re: Using ProtoBuf 2.5 for messages with Spark Streaming

2014-04-03 Thread Vipul Pandey
Any word on this one ? On Apr 2, 2014, at 12:26 AM, Vipul Pandey wrote: > I downloaded 0.9.0 fresh and ran the mvn command - the assembly jar thus > generated also has both shaded and real version of protobuf classes > > Vipuls-MacBook-Pro-3:spark-0.9.0-incubating vipul$ jar -ftv > ./assembly/

Re: Shark Direct insert into table value (?)

2014-04-03 Thread Michael Armbrust
This should soon be possible with Spark SQL. PR-195 adds support for INSERT statements in SparkSQL and SPARK-1366 will let you use this syntax on Hive tables. Both of these should be included in the 1.0 release. The result would look something like this: sql("INSERT INTO emp SELECT 212,'Abhi'")

Re: Spark output compression on HDFS

2014-04-03 Thread Konstantin Kudryavtsev
Thanks all, it works fine now and I managed to compress output. However, I am still in stuck... How is it possible to set compression type for Snappy? I mean to set up record or block level of compression for output On Apr 3, 2014 1:15 AM, "Nicholas Chammas" wrote: > Thanks for pointing that out.

Re: Optimal Server Design for Spark

2014-04-03 Thread Mayur Rustagi
Are your workers not utilizing all the cores? One worker will utilize multiple cores depending on resource allocation. Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Wed, Apr 2, 2014 at 7:19 PM, Debasish Da

Spark SQL transformations, narrow vs. wide

2014-04-03 Thread Jan-Paul Bultmann
Hey, Does somebody know the kinds of dependencies that the new SQL operators produce? I’m specifically interested in the relational join operation as it seems substantially more optimized. The old join was narrow on two RDDs with the same partitioner. Is the relational join narrow as well? Cheer

Re: Spark SQL transformations, narrow vs. wide

2014-04-03 Thread Michael Armbrust
I'm sorry, but I don't really understand what you mean when you say "wide" in this context. For a HashJoin, the only dependencies of the produced RDD are the two input RDDs. For BroadcastNestedLoopJoin The only dependence will be on the streamed RDD. The other RDD will be distributed to all nod

Re: Spark SQL transformations, narrow vs. wide

2014-04-03 Thread Jan-Paul Bultmann
I was referring to the categorization made by the RDD paper. It describes a narrow dependency as one where every parent partition is required by at most one child partition (e.g. map), whereas a wide dependency means that some parent partitions are required by multiple child partitions (e.g. a jo

Re: Spark SQL transformations, narrow vs. wide

2014-04-03 Thread Michael Armbrust
Ah, I understand now. It is going to depend on the input partitioning (which is a little different than a partitioner in Spark core, since in Spark SQL we have more information about the logical structure). Basically all SQL operators have a method `requiredChildDistribution` and an `outputPartit

Re: Optimal Server Design for Spark

2014-04-03 Thread Matei Zaharia
To run multiple workers with Spark’s standalone mode, set SPARK_WORKER_INSTANCES and SPARK_WORKER_CORES in conf/spark-env.sh. For example, if you have 16 cores and want 2 workers, you could add export SPARK_WORKER_INSTANCES=2 export SPARK_WORKER_CORES=8 Matei On Apr 3, 2014, at 12:38 PM, Mayur

Re: Spark SQL transformations, narrow vs. wide

2014-04-03 Thread Jan-Paul Bultmann
Thanks a lot for the detailed explanation, and your work on this awesome addition to spark :D I just realized that with classical joins one has to repartition for each different join as well, so I assume calling partitionBy for each classical join is equivalent to the `Exchange` operation. All

Re: Optimal Server Design for Spark

2014-04-03 Thread Debasish Das
@Mayur...I am hitting ulimits on the cluster if I go beyond 4 core per worker and I don't think I can change the ulimit due to sudo issues etc... If I have more workers, in ALS, I can go for 20 blocks (right now I am running 10 blocks on 10 nodes with 4 cores each and now I can go upto 20 blocks o

Re: Spark 1.0.0 release plan

2014-04-03 Thread Patrick Wendell
Btw - after that initial thread I proposed a slightly more detailed set of dates: https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage - Patrick On Thu, Apr 3, 2014 at 11:28 AM, Matei Zaharia wrote: > Hey Bhaskar, this is still the plan, though QAing might take longer than > 15 days.

Re: pySpark memory usage

2014-04-03 Thread Matei Zaharia
Cool, thanks for the update. Have you tried running a branch with this fix (e.g. branch-0.9, or the 0.9.1 release candidate?) Also, what memory leak issue are you referring to, is it separate from this? (Couldn’t find it earlier in the thread.) To turn on debug logging, copy conf/log4j.properti

Sample Project for using Shark API in Spark programs

2014-04-03 Thread Jerry Lam
Hello everyone, I have successfully installed Shark 0.9 and Spark 0.9 in standalone mode in a cluster of 6 nodes for testing purposes. I would like to use Shark API in Spark programs. So far I could only find the following: $./bin/shark-shell scala> val youngUsers = sc.sql2rdd("SELECT * FROM use

Re: Resilient nature of RDD

2014-04-03 Thread David Thomas
I'm trying to understand the Spark soure code. Could you please point me to the code where the compute() function of RDD is called. Is that called by the workers? On Wed, Apr 2, 2014 at 5:36 PM, Patrick Wendell wrote: > The driver stores the meta-data associated with the partition, but the > re

Re: Example of creating expressions for SchemaRDD methods

2014-04-03 Thread All In A Days Work
Hi Michael, The idea is to build a pipeline of operators on RDD, leveraging existing operations already done. E.g. rdd1 = rdd.select(...). rdd2 = rdd1.where(). rdd3 = rdd2.groupBy() etc. In such construct, each operator builds on the previous one, including any materialized results etc.

Re: Spark 1.0.0 release plan

2014-04-03 Thread Bhaskar Dutta
Thanks a lot guys! On Fri, Apr 4, 2014 at 5:34 AM, Patrick Wendell wrote: > Btw - after that initial thread I proposed a slightly more detailed set of > dates: > https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage > > - Patrick > > > On Thu, Apr 3, 2014 at 11:28 AM, Matei Zaharia

Re: Resilient nature of RDD

2014-04-03 Thread Andrew Or
It all begins with calling rdd.iterator, which calls rdd.computeOrReadCheckpoint(). This materializes the RDD if it's not already materialized, or reads a previously checkpointed version if it is. See https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L21