Create/shutdown objects before/after RDD use (or: Non-serializable classes)

2014-05-29 Thread Tobias Pfeiffer
Hi, I want to use an object x in my RDD processing as follows: val x = new X() rdd.map(row => x.doSomethingWith(row)) println(rdd.count()) x.shutdown() Now the problem is that X is non-serializable, so while this works locally, it does not work in cluster setup. I thought I could do rdd.mapPart

Re: spark job stuck when running on mesos fine grained mode

2014-05-29 Thread prabeesh
Hi Lukasz Jastrzebski , I have a question regarding Shark execution on mesos. I am querying a file which is in hdfs and write results back to hdfs. The problem I am facing in this is , unable to write output to hdfs. ie, when I use the method SaveAsTextfile() , then the job is getting resubmitted

Re: access hdfs file name in map()

2014-05-29 Thread Aaron Davidson
Currently there is not a way to do this using textFile(). However, you could pretty straightforwardly define your own subclass of HadoopRDD [1] in order to get access to this information (likely using mapPartitionsWithIndex to look up the InputSplit for a particular partition). Note that sc.textFi

Re: Driver OOM while using reduceByKey

2014-05-29 Thread haitao .yao
Thanks. it worked. 2014-05-30 1:53 GMT+08:00 Matei Zaharia : > That hash map is just a list of where each task ran, it’s not the actual > data. How many map and reduce tasks do you have? Maybe you need to give the > driver a bit more memory, or use fewer tasks (e.g. do reduceByKey(_ + _, > 100)

getPreferredLocations

2014-05-29 Thread ansriniv
I am building my own custom RDD class. 1) Is there a guarantee that a partition will only be processed on a node which is in the "getPreferredLocations" set of nodes returned by the RDD ? 2) I am implementing this custom RDD in Java and plan to extend JavaRDD. However, I dont see a "getPreferred

Re: Why Scala?

2014-05-29 Thread Krishna Sankar
Nicholas, Good question. Couple of thoughts from my practical experience: - Coming from R, Scala feels more natural than other languages. The functional & succinctness of Scala is more suited for Data Science than other languages. In short, Scala-Spark makes sense, for Data Science, ML

Re: Spark hook to create external process

2014-05-29 Thread ansriniv
Hi Matei, Thanks for the reply. I would like to avoid having to spawn these external processes every time during the processing of the task to reduce task latency. I'd like these to be pre-spawned as much as possible - tying them to lifecycle of corresponding threadpool thread would simplify mana

access hdfs file name in map()

2014-05-29 Thread Xu (Simon) Chen
Hello, A quick question about using spark to parse text-format CSV files stored on hdfs. I have something very simple: sc.textFile("hdfs://test/path/*").map(line => line.split(",")).map(p => (XXX, p[0], p[2])) Here, I want to replace XXX with a string, which is the current csv filename for the l

Re: pyspark MLlib examples don't work with Spark 1.0.0

2014-05-29 Thread Xiangrui Meng
You are using ec2. Did you specify the spark version when you ran spark-ec2 script or update /root/spark after the cluster was created? It is very likely that you are running 0.9 on ec2. -Xiangrui On Thu, May 29, 2014 at 5:22 PM, jamborta wrote: > Hi all, > > I wanted to try spark 1.0.0, because

Re: Why Scala?

2014-05-29 Thread Nicholas Chammas
Thank you for the specific points about the advantages Scala provides over other languages. Looking at several code samples, the reduction of boilerplate code over Java is one of the biggest plusses, to me. On Thu, May 29, 2014 at 8:10 PM, Marek Kolodziej wrote: > I would advise others to form t

Re: Why Scala?

2014-05-29 Thread Marek Kolodziej
Also regarding "why the JVM in general," it's worth remembering that the JVM has excellent garbage collection, and the just-in-time compiler (JIT) can make repetitive code run almost as fast as native C++ code. Then there's the concurrency aspect, which is broken in both Python and Ruby (GIL). Ther

pyspark MLlib examples don't work with Spark 1.0.0

2014-05-29 Thread jamborta
Hi all, I wanted to try spark 1.0.0, because of the new SQL component. I have cloned and built the latest from git. But the examples described here do not work anymore: http://people.apache.org/~pwendell/catalyst-docs/mllib-classification-regression.html#binary-classification-2 I get the followi

Re: Why Scala?

2014-05-29 Thread Marek Kolodziej
I would disagree that Scala is controversial. It's less controversial than Java was when it came out in 1995. Scala's been around since 2004, and over the past couple of years, it saw major adoption at LinkedIn, Twitter, FourSquare, Netflix, Tumblr, The Guardian, Airbnb, Meetup.com, Coursera, UBS,

Re: Spark SQL JDBC Connectivity and more

2014-05-29 Thread Michael Armbrust
On Thu, May 29, 2014 at 3:26 PM, Venkat Subramanian wrote: > > 1) If I have a standalone spark application that has already built a RDD, > how can SharkServer2 or for that matter Shark access 'that' RDD and do > queries on it. All the examples I have seen for Shark, the RDD (tables) are > created

Re: Spark SQL JDBC Connectivity and more

2014-05-29 Thread Venkat Subramanian
Thanks Michael. OK will try SharkServer2.. But I have some basic questions on a related area: 1) If I have a standalone spark application that has already built a RDD, how can SharkServer2 or for that matter Shark access 'that' RDD and do queries on it. All the examples I have seen for Shark, the

Re: Why Scala?

2014-05-29 Thread Dmitriy Lyubimov
There were few known concerns about Scala, and some still are, but having been doing Scala professionally over two years now, i learned to master and appreciate the advanatages. Major concern IMO is Scala in a less-than-scrupulous corporate environment. First, Scala requires significantly more di

Re: Why Scala?

2014-05-29 Thread Nicholas Chammas
Matei, Thank you for the concise explanation. I use Python and will definitely add my vote of interest to seeing more of Spark's functionality (especially Spark Streaming) exposed via Python. Scala seems like an interesting language to learn, if only to unlock more of Spark's functionality for u

Re: Shuffle file consolidation

2014-05-29 Thread Matei Zaharia
It can be set in an individual application. Consolidation had some issues on ext3 as mentioned there, though we might enable it by default in the future because other optimizations now made it perform on par with the non-consolidation version. It also had some bugs in 0.9.0 so I’d suggest at le

Re: Shuffle file consolidation

2014-05-29 Thread Nathan Kronenfeld
Thanks, I missed that. One thing that's still unclear to me, even looking at that, is - does this parameter have to be set when starting up the cluster, on each of the workers, or can it be set by an individual client job? On Fri, May 23, 2014 at 10:13 AM, Han JU wrote: > Hi Nathan, > > There'

Re: Why Scala?

2014-05-29 Thread Matei Zaharia
Quite a few people ask this question and the answer is pretty simple. When we started Spark, we had two goals — we wanted to work with the Hadoop ecosystem, which is JVM-based, and we wanted a concise programming interface similar to Microsoft’s DryadLINQ (the first language-integrated big data

Re: Why Scala?

2014-05-29 Thread Benjamin Black
HN is a cesspool safely ignored. On Thu, May 29, 2014 at 1:55 PM, Nick Chammas wrote: > I recently discovered Hacker News and started reading through older posts > about Scala . It > looks like the language is fairly controversial on ther

Why Scala?

2014-05-29 Thread Nick Chammas
I recently discovered Hacker News and started reading through older posts about Scala . It looks like the language is fairly controversial on there, and it got me thinking. Scala appears to be the preferred language to work with in Spark, an

Re: Selecting first ten values in a RDD/partition

2014-05-29 Thread Gerard Maas
DStream has a help method to print the first 10 elements of each RDD. You could take some inspiration from it, as the usecase is practically the same and the code will be probably very similar: rdd.take(10)... https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/s

Re: Selecting first ten values in a RDD/partition

2014-05-29 Thread Brian Gawalt
Try looking at the .mapPartitions( ) method implemented for RDD[T] objects. It will give you direct access to an iterator containing the member objects of each partition for doing the kind of within-partition hashtag counts you're describing. -- View this message in context: http://apache-spark

Re: Running Jars on Spark, program just hanging there

2014-05-29 Thread Min Li
Yana, Thanks for your advice. The Spark UI is showing everything. And I can see from sparkmaster:4040 the details of the running app. And I've also looked into the three logs you mentioned. There's no error or warning. After the parallelize(), I first used the rdd.count() operation. And even w

[ANN]: Scala By the Bay Developer Conference, CFP now open

2014-05-29 Thread Chester Chen
Hi Sparkers Scala By The Bay 2014 Scala By The Bay 2014 A new conference for developers who use the Scala language or are interested in functional programming practices. View on www.scalabythebay.org Preview by Yahoo Scala By The Bay is renamed from last year's successful "Silicon Valley

Re: Driver OOM while using reduceByKey

2014-05-29 Thread Matei Zaharia
That hash map is just a list of where each task ran, it’s not the actual data. How many map and reduce tasks do you have? Maybe you need to give the driver a bit more memory, or use fewer tasks (e.g. do reduceByKey(_ + _, 100) to use only 100 tasks). Matei On May 29, 2014, at 2:03 AM, haitao .

Re: Is uberjar a recommended way of running Spark/Scala applications?

2014-05-29 Thread Stephen Boesch
The MergeStrategy combined with sbt assembly did work for me. This is not painless: some trial and error and the assembly may take multiple minutes. You will likely want to filter out some additional classes from the generated jar file. Here is an SOF answer to explain that and with IMHO the bes

Re: Is uberjar a recommended way of running Spark/Scala applications?

2014-05-29 Thread Andrei
Thanks, Jordi, your gist looks pretty much like what I have in my project currently (with few exceptions that I'm going to borrow). I like the idea of using "sbt package", since it doesn't require third party plugins and, most important, doesn't create a mess of classes and resources. But in this

Re: Spark hook to create external process

2014-05-29 Thread Matei Zaharia
Hi Anand, This is probably already handled by the RDD.pipe() operation. It will spawn a process and let you feed data to it through its stdin and read data through stdout. Matei On May 29, 2014, at 9:39 AM, ansriniv wrote: > I have a requirement where for every Spark executor threadpool thre

Spark hook to create external process

2014-05-29 Thread ansriniv
I have a requirement where for every Spark executor threadpool thread, I need to launch an associated external process. My job will consist of some processing in the Spark executor thread and some processing by its associated external process with the 2 communicating via some IPC mechanism. Is th

Re: ClassCastExceptions when using Spark shell

2014-05-29 Thread Marcelo Vanzin
Hi Sebastian, That exception generally means you have the class loaded by two different class loaders, and some code is trying to mix instances created by the two different loaded classes. Do you happen to have that class both in the spark jars and in your app's uber-jar? That might explain the p

Re: Spark SQL JDBC Connectivity

2014-05-29 Thread Michael Armbrust
On Wed, May 28, 2014 at 11:39 PM, Venkat Subramanian wrote: > We are planning to use the latest Spark SQL on RDDs. If a third party > application wants to connect to Spark via JDBC, does Spark SQL have > support? > (We want to avoid going though Shark/Hive JDBC layer as we need good > performance)

Re: Comprehensive Port Configuration reference?

2014-05-29 Thread Jacob Eisinger
Howdy Andrew, This is a standalone cluster. And, yes, if my understanding of Spark terminology is correct, you are correct about the port ownerships. Jacob Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 From: Andrew Ash To: user@spark.apache.org Date:

Re: Selecting first ten values in a RDD/partition

2014-05-29 Thread Anwar Rizal
Can you clarify what you're trying to achieve here ? If you want to take only top 10 of each RDD, why don't sort followed by take(10) of every RDD ? Or, you want to take top 10 of five minutes ? Cheers, On Thu, May 29, 2014 at 2:04 PM, nilmish wrote: > I have a DSTREAM which consists of RDD

ClassCastExceptions when using Spark shell

2014-05-29 Thread Sebastian Schelter
Hi, I have trouble running some custom code on Spark 0.9.1 in standalone mode on a cluster. I built a fat jar (excluding Spark) that I'm adding to the classpath with ADD_JARS=... When I start the Spark shell, I can instantiate classes, but when I run Spark code, I get strange ClassCastExcepti

Re: Is uberjar a recommended way of running Spark/Scala applications?

2014-05-29 Thread jaranda
Hi Andrei, I think the preferred way to deploy Spark jobs is by using the sbt package task instead of using the sbt assembly plugin. In any case, as you comment, the mergeStrategy in combination with some dependency exlusions should fix your problems. Have a look at this gist

Re: problem about broadcast variable in iteration

2014-05-29 Thread randylu
hi, Andrew Ash, thanks for your reply. In fact, I have already used unpersist(), but it doesn't take effect. One reason that I select 1.0.0 version is just for it providing unpersist() interface. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-

Is uberjar a recommended way of running Spark/Scala applications?

2014-05-29 Thread Andrei
I'm using Spark 1.0 and sbt assembly plugin to create uberjar of my application. However, when I run assembly command, I get a number of errors like this: java.lang.RuntimeException: deduplicate: different file contents found in the following: /home/username/.ivy2/cache/com.esotericsoftware.kryo/k

Selecting first ten values in a RDD/partition

2014-05-29 Thread nilmish
I have a DSTREAM which consists of RDD partitioned every 2 sec. I have sorted each RDD and want to retain only top 10 values and discard further value. How can I retain only top 10 values ? I am trying to get top 10 hashtags. Instead of sorting the entire of 5-minute-counts (thereby, incurring th

Re: Python, Spark and HBase

2014-05-29 Thread Nick Pentreath
Hi Tommer, I'm working on updating and improving the PR, and will work on getting an HBase example working with it. Will feed back as soon as I have had the chance to work on this a bit more. N On Thu, May 29, 2014 at 3:27 AM, twizansk wrote: > The code which causes the error is: > > The code

How can I dispose an Accumulator?

2014-05-29 Thread innowireless TaeYun Kim
Hi, How can I dispose an Accumulator? It has no method like 'unpersist()' which Broadcast provides. Thanks.

Re: A Standalone App in Scala: Standalone mode issues

2014-05-29 Thread jaranda
I finally got it working. Main points: - I had to add hadoop-client dependency to avoid a strange EOFException. - I had to set SPARK_MASTER_IP in conf/start-master.sh to hostname -f instead of hostname, since akka seems not to work properly with host names / ip, it requires fully qualified domain

Driver OOM while using reduceByKey

2014-05-29 Thread haitao .yao
Hi, I used 1g memory for the driver java process and got OOM error on driver side before reduceByKey. After analyzed the heap dump, the biggest object is org.apache.spark.MapStatus, which occupied over 900MB memory. Here's my question: 1. Is there any optimization switches that I can tune

Re: Use mvn run Spark program occur problem

2014-05-29 Thread jaranda
That was it, thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Use-mvn-run-Spark-program-occur-problem-tp1751p6512.html Sent from the Apache Spark User List mailing list archive at Nabble.com.