Re: When will spark 1.2 released?

2014-12-18 Thread Matei Zaharia
Yup, as he posted before, "An Apache infrastructure issue prevented me from pushing this last night. The issue was resolved today and I should be able to push the final release artifacts tonight." > On Dec 18, 2014, at 10:14 PM, Andrew Ash wrote: > > Patrick is working on the release as we spe

Re: When will spark 1.2 released?

2014-12-18 Thread Andrew Ash
Patrick is working on the release as we speak -- I expect it'll be out later tonight (US west coast) or tomorrow at the latest. On Fri, Dec 19, 2014 at 1:09 AM, Ted Yu wrote: > > Interesting, the maven artifacts were dated Dec 10th. > However vote for RC2 closed recently: > > http://search-hadoop

Re: When will spark 1.2 released?

2014-12-18 Thread Ted Yu
Interesting, the maven artifacts were dated Dec 10th. However vote for RC2 closed recently: http://search-hadoop.com/m/JW1q5K8onk2/Patrick+spark+1.2.0&subj=Re+VOTE+Release+Apache+Spark+1+2+0+RC2+ Cheers On Dec 18, 2014, at 10:02 PM, madhu phatak wrote: > It’s on Maven Central already http://se

Re: When will spark 1.2 released?

2014-12-18 Thread madhu phatak
It’s on Maven Central already http://search.maven.org/#browse%7C717101892 On Fri, Dec 19, 2014 at 11:17 AM, vboylin1...@gmail.com < vboylin1...@gmail.com> wrote: > > Hi, >Dose any know when will spark 1.2 released? 1.2 has many great feature > that we can't wait now ,-) > > Sincely > Lin wukan

Re: SchemaRDD.sample problem

2014-12-18 Thread madhu phatak
Hi, Can you clean up the code lil bit better, it's hard to read what's going on. You can use pastebin or gist to put the code. On Wed, Dec 17, 2014 at 3:58 PM, Hao Ren wrote: > > Hi, > > I am using SparkSQL on 1.2.1 branch. The problem comes froms the following > 4-line code: > > *val t1: SchemaR

When will spark 1.2 released?

2014-12-18 Thread vboylin1...@gmail.com
Hi, Dose any know when will spark 1.2 released? 1.2 has many great feature that we can't wait now ,-) Sincely Lin wukang 发自网易邮箱大师

Re: Can we specify driver running on a specific machine of the cluster on yarn-cluster mode?

2014-12-18 Thread madhu phatak
Hi, The driver runs on the machine from where you did the spark-submit. You cannot change that. On Thu, Dec 18, 2014 at 3:44 PM, LinQili wrote: > > Hi all, > On yarn-cluster mode, can we let the driver running on a specific machine > that we choose in cluster ? Or, even the machine not in the cl

Re: UNION two RDDs

2014-12-18 Thread madhu phatak
Hi, coalesce is an operation which changes no of records in a partition. It will not touch ordering with in a row AFAIK. On Fri, Dec 19, 2014 at 2:22 AM, Jerry Lam wrote: > > Hi Spark users, > > I wonder if val resultRDD = RDDA.union(RDDB) will always have records in > RDDA before records in RDDB

RE: SPARK-2243 Support multiple SparkContexts in the same JVM

2014-12-18 Thread Anton Brazhnyk
Well, that's actually what I need (one simple app, several contexts, similar to what JobServer does) and I'm just looking for some workaround here. Classloaders look a little easier for me than spawning my own processes. Being more specific, I just need to be able to execute arbitrary Spark jobs

Re: Spark GraphX question.

2014-12-18 Thread Tae-Hyuk Ahn
Thanks, Harihar. But this is slightly more complicate than just using subgraph(filter()). See the transitive reduction. http://en.wikipedia.org/wiki/Transitive_reduction My case has one more additional requirement to think about weight (like a maximum spanning tree). Using a linear transitiv

java.lang.ExceptionInInitializerError/Unable to load YARN support

2014-12-18 Thread maven
All, I just built Spark-1.2 on my enterprise server (which has Hadoop 2.3 with YARN). Here're the steps I followed for the build: $ mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package $ export SPARK_HOME=/path/to/spark/folder $ export HADOOP_CONF_DIR=/etc/hadoop/conf Ho

Re: Sharing sqlContext between Akka router and "routee" actors ...

2014-12-18 Thread Soumya Simanta
why do you need a router? I mean cannot you do with just one actor which has the SQLContext inside it? On Thu, Dec 18, 2014 at 9:45 PM, Manoj Samel wrote: > Hi, > > Akka router creates a sqlContext and creates a bunch of "routees" actors > with sqlContext as parameter. The actors then execute q

Re: does spark sql support columnar compression with encoding when caching tables

2014-12-18 Thread Michael Armbrust
There is only column level encoding (run length encoding, delta encoding, dictionary encoding) and no generic compression. On Thu, Dec 18, 2014 at 12:07 PM, Sadhan Sood wrote: > > Hi All, > > Wondering if when caching a table backed by lzo compressed parquet data, > if spark also compresses it (u

Sharing sqlContext between Akka router and "routee" actors ...

2014-12-18 Thread Manoj Samel
Hi, Akka router creates a sqlContext and creates a bunch of "routees" actors with sqlContext as parameter. The actors then execute query on that sqlContext. Would this pattern be a issue ? Any other way sparkContext etc. should be shared cleanly in Akka routers/routees ? Thanks,

RE: Do I need to applied feature scaling via StandardScaler for LBFGS for Linear Regression?

2014-12-18 Thread Bui, Tri
Thanks dbtsai for the info. Are you using the case class for: Case(response, vec) => ? Also, what library do I need to import to use .toBreeze ? Thanks, tri -Original Message- From: dbt...@dbtsai.com [mailto:dbt...@dbtsai.com] Sent: Friday, December 12, 2014 3:27 PM To: Bui, T

Re: MLLib /ALS : java.lang.OutOfMemoryError: Java heap space

2014-12-18 Thread Xiangrui Meng
Hi Jay, Please try increasing executor memory (if the available memory is more than 2GB) and reduce numBlocks in ALS. The current implementation stores all subproblems in memory and hence the memory requirement is significant when k is large. You can also try reducing k and see whether the problem

Re: How to increase parallelism in Yarn

2014-12-18 Thread Andrew Or
Hi Suman, I'll assume that you are using spark submit to run your application. You can pass the --num-executors flag to ask for more containers. If you want to allocate more memory for each executor, you may also pass in the --executor-memory flag (this accepts a string in the format 1g, 512m etc.

Re: Creating a smaller, derivative RDD from an RDD

2014-12-18 Thread Sean Owen
I don't think you can avoid examining each element of the RDD, if that's what you mean. Your approach is basically the best you can do in general. You're not making a second RDD here, and even if you did this in two steps, the second RDD is really more of a bookkeeping that a second huge data struc

How to increase parallelism in Yarn

2014-12-18 Thread Suman Somasundar
Hi, I am using Spark 1.1.1 on Yarn. When I try to run K-Means, I see from the Yarn dashboard that only 3 containers are being used. How do I increase the number of containers used? P.S: When I run K-Means on Mahout with the same settings, I see that there are 25-30 containers being used.

Re: Spark GraphX question.

2014-12-18 Thread Harihar Nahak
Hi Ted, I've no idea what is Transitive Reduction but the expected result you can achieve by graph.subgraph(graph.edges.filter()) syntax and which filter edges by its weight and give you new graph as per your condition. On 19 December 2014 at 11:11, Tae-Hyuk Ahn [via Apache Spark User List] < ml-

Re: hello

2014-12-18 Thread Harihar Nahak
You mean to Spark User List, Its pretty easy. check the first email it has all instructions On 18 December 2014 at 21:56, csjtx1021 [via Apache Spark User List] < ml-node+s1001560n20759...@n3.nabble.com> wrote: > > i want to join you > > -- > If you reply to this emai

RE: Control default partition when load a RDD from HDFS

2014-12-18 Thread Shuai Zheng
Hmmm, how to do that? You mean for each file create a RDD? Then I will have tons of RDD. And my calculation need to rely on other input, not just the file itself Can you show some pseudo code for that logic? Regards, Shuai From: Diego García Valverde [mailto:dgarci...@agbar.es] Se

Creating a smaller, derivative RDD from an RDD

2014-12-18 Thread bethesda
We have a very large RDD and I need to create a new RDD whose values are derived from each record of the original RDD, and we only retain the few new records that meet a criteria. I want to avoid creating a second large RDD and then filtering it since I believe this could tax system resources unne

Spark GraphX question.

2014-12-18 Thread Tae-Hyuk Ahn
Hi All, I am wondering what is the best way to remove transitive edges with maximum spanning tree. For example, Edges: 1 -> 2 (30) 2 -> 3 (30) 1 -> 3 (25) where parenthesis is a weight for each edge. Then, I'd like to get the reduced edges graph after "Transitive Reduction" with considering the

Re: Standalone Spark program

2014-12-18 Thread Andrew Or
Hey Akshat, What is the class that is not found, is it a Spark class or classes that you define in your own application? If the latter, then Akhil's solution should work (alternatively you can also pass the jar through the --jars command line option in spark-submit). If it's a Spark class, howeve

UNION two RDDs

2014-12-18 Thread Jerry Lam
Hi Spark users, I wonder if val resultRDD = RDDA.union(RDDB) will always have records in RDDA before records in RDDB. Also, will resultRDD.coalesce(1) change this ordering? Best Regards, Jerry

Re: Help with updateStateByKey

2014-12-18 Thread Silvio Fiorito
Great, glad it worked out! Just keep an eye on memory usage as you roll it out. Like I said before, if you’ll be running this 24/7 consider cleaning up sessions by returning None after some sort of timeout. On 12/18/14, 8:25 PM, "Pierce Lamb" wrote: >This produces the expected output, thank

Re: Help with updateStateByKey

2014-12-18 Thread Pierce Lamb
This produces the expected output, thank you! On Thu, Dec 18, 2014 at 12:11 PM, Silvio Fiorito wrote: > Ok, I have a better idea of what you’re trying to do now. > > I think the prob might be the map. The first time the function runs, > currentValue will be None. Using map on None returns None. >

Re: Help with updateStateByKey

2014-12-18 Thread Silvio Fiorito
Ok, I have a better idea of what you’re trying to do now. I think the prob might be the map. The first time the function runs, currentValue will be None. Using map on None returns None. Instead, try: Some(currentValue.getOrElse(Seq.empty) ++ newValues) I think that should give you the expected

does spark sql support columnar compression with encoding when caching tables

2014-12-18 Thread Sadhan Sood
Hi All, Wondering if when caching a table backed by lzo compressed parquet data, if spark also compresses it (using lzo/gzip/snappy) along with column level encoding or just does the column level encoding when "*spark.sql.inMemoryColumnarStorage.compressed" *is set to true. This is because when I

Re: When will Spark SQL support building DB index natively?

2014-12-18 Thread Michael Armbrust
It is implemented in the same way as Hive and interoperates with the hive metastore. In 1.2 we are considering adding partitioning to the SparkSQL data source API as well.. However, for now, you should create a hive context and a partitioned table. Spark SQL will automatically select partitions

RE: Effects problems in logistic regression

2014-12-18 Thread Franco Barrientos
Thanks I will try. De: DB Tsai [mailto:dbt...@dbtsai.com] Enviado el: jueves, 18 de diciembre de 2014 16:24 Para: Franco Barrientos CC: Sean Owen; user@spark.apache.org Asunto: Re: Effects problems in logistic regression Can you try LogisticRegressionWithLBFGS? I verified that this will be

Re: Help with updateStateByKey

2014-12-18 Thread Pierce Lamb
Hi Silvio, This is a great suggestion (I wanted to get rid of groupByKey), I have been trying to implement it this morning, but having some trouble. I would love to see your code for the function that goes inside updateStateByKey Here is my current code: def updateGroupByKey( newValues: Seq[(St

Re: Effects problems in logistic regression

2014-12-18 Thread DB Tsai
Can you try LogisticRegressionWithLBFGS? I verified that this will be converged to the same result trained by R's glmnet package without regularization. The problem of LogisticRegressionWithSGD is it's very slow in term of converging, and lots of time, it's very sensitive to stepsize which can lead

RE: Effects problems in logistic regression

2014-12-18 Thread Franco Barrientos
Yes, without the “amounts” variables the results are similiar. When I put other variables its fine. De: Sean Owen [mailto:so...@cloudera.com] Enviado el: jueves, 18 de diciembre de 2014 14:22 Para: Franco Barrientos CC: user@spark.apache.org Asunto: Re: Effects problems in logistic regression

Re: Standalone Spark program

2014-12-18 Thread Akhil Das
You can build a jar of your project and add it to the sparkContext (sc.addJar("/path/to/your/project.jar")) then it will get shipped to the worker and hence no classNotfoundException! Thanks Best Regards On Thu, Dec 18, 2014 at 10:06 PM, Akshat Aranya wrote: > > Hi, > > I am building a Spark-bas

Re: Effects problems in logistic regression

2014-12-18 Thread Sean Owen
Are you sure this is an apples-to-apples comparison? for example does your SAS process normalize or otherwise transform the data first? Is the optimization configured similarly in both cases -- same regularization, etc.? Are you sure you are pulling out the intercept correctly? It is a separate v

Re: Spark 1.2 Release Date

2014-12-18 Thread Al M
Awesome. Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-Release-Date-tp20765p20767.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To uns

Standalone Spark program

2014-12-18 Thread Akshat Aranya
Hi, I am building a Spark-based service which requires initialization of a SparkContext in a main(): def main(args: Array[String]) { val conf = new SparkConf(false) .setMaster("spark://foo.example.com:7077") .setAppName("foobar") val sc = new SparkContext(conf) val rdd =

Effects problems in logistic regression

2014-12-18 Thread Franco Barrientos
Hi all!, I have a problem with LogisticRegressionWithSGD, when I train a data set with one variable (wich is a amount of an item) and intercept, I get weights of (-0.4021,-207.1749) for both features, respectively. This don´t make sense to me because I run a logistic regression for the same da

Re: Spark 1.2 Release Date

2014-12-18 Thread nitin
Soon enough :) http://apache-spark-developers-list.1001551.n3.nabble.com/RESULT-VOTE-Release-Apache-Spark-1-2-0-RC2-td9815.html -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-2-Release-Date-tp20765p20766.html Sent from the Apache Spark User List ma

undefined

2014-12-18 Thread Eduardo Cusa
Hi guys. I run the folling command to lauch a new cluster : ./spark-ec2 -k test -i test.pem -s 1 --vpc-id vpc-X --subnet-id subnet-X launch vpc_spark The instances started ok but the command never end. With the following output: Setting up security groups... Searching for existing cl

EC2 VPC script

2014-12-18 Thread Eduardo Cusa
Hi guys. I run the folling command to lauch a new cluster : ./spark-ec2 -k test -i test.pem -s 1 --vpc-id vpc-X --subnet-id subnet-X launch vpc_spark The instances started ok but the command never end. With the following output: Setting up security groups... Searching for existing cl

Re: Spark 1.2 Release Date

2014-12-18 Thread Silvio Fiorito
It’s on Maven Central already http://search.maven.org/#browse%7C717101892 On 12/18/14, 2:09 PM, "Al M" wrote: >Is there a planned release date for Spark 1.2? I saw on the Spark Wiki > that >we >are already in the latter p

Spark 1.2 Release Date

2014-12-18 Thread Al M
Is there a planned release date for Spark 1.2? I saw on the Spark Wiki that we are already in the latter part of the release window. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1

Re: Help with updateStateByKey

2014-12-18 Thread Silvio Fiorito
Hi Pierce, You shouldn’t have to use groupByKey because updateStateByKey will get a Seq of all the values for that key already. I used that for realtime sessionization as well. What I did was key my incoming events, then send them to udpateStateByKey. The updateStateByKey function then receive

Downloads from S3 exceedingly slow when running on spark-ec2

2014-12-18 Thread Jon Chase
I'm running a very simple Spark application that downloads files from S3, does a bit of mapping, then uploads new files. Each file is roughly 2MB and is gzip'd. I was running the same code on Amazon's EMR w/Spark and not having any download speed issues (Amazon's EMR provides a custom implementat

Re: pyspark 1.1.1 on windows saveAsTextFile - NullPointerException

2014-12-18 Thread Akhil Das
It seems You are missing HADOOP_HOME in the environment. As it says: java.io.IOException: Could not locate executable *null*\bin\winutils.exe in the Hadoop binaries. That null is supposed to be your HADOOP_HOME. Thanks Best Regards On Thu, Dec 18, 2014 at 7:10 PM, mj wrote: > > Hi, > > I'm try

pyspark 1.1.1 on windows saveAsTextFile - NullPointerException

2014-12-18 Thread mj
Hi, I'm trying to use pyspark to save a simple rdd to a text file (code below), but it keeps throwing an error. - Python Code - items=["Hello", "world"] items2 = sc.parallelize(items) items2.coalesce(1).saveAsTextFile('c:/tmp/python_out.csv') - Error --C:\Python27\py

Re: No disk single pass RDD aggregation

2014-12-18 Thread Jim Carroll
Hi, This was all my fault. It turned out I had a line of code buried in a library that did a "repartition." I used this library to wrap an RDD to present it to legacy code as a different interface. That's what was causing the data to spill to disk. The really stupid thing is it took me the better

Re: Spark SQL DSL for joins?

2014-12-18 Thread Jerry Raj
Thanks, that helped. And I needed SchemaRDD.as() to provide an alias for the RDD. -Jerry On 17/12/14 12:12 pm, Tobias Pfeiffer wrote: Jerry, On Wed, Dec 17, 2014 at 3:35 PM, Jerry Raj mailto:jerry@gmail.com>> wrote: Another problem with the DSL: t1.where('term == "dmin").count()

Re: Incorrect results when calling collect() ?

2014-12-18 Thread Tristan Blakers
Recording the outcome here for the record. Based on Sean’s advice I’ve confirmed that making defensive copies of records that will be collected avoids this problem - it does seem like Avro is being a bit too aggressive when deciding it’s safe to reuse an object for a new record. On 18 December 201

Re: java.io.NotSerializableException: org.apache.avro.mapred.AvroKey using spark with avro

2014-12-18 Thread anish
Hi, I had the same problem. One option (starting with Spark 1.2, which is currently in preview) is to use the Avro library for Spark SQL. Other is using Kryo Serialization. by default spark uses Java Serialization, you can specify kryo serialization while creating spark context. val conf = new S

Re: java.io.NotSerializableException: org.apache.avro.mapred.AvroKey using spark with avro

2014-12-18 Thread Anish Haldiya
Hi, I had the same problem. One option (starting with Spark 1.2, which is currently in preview) is to use the Avro library for Spark SQL. Other is using Kryo Serialization. by default spark uses Java Serialization, you can specify kryo serialization while creating spark context. val conf = new S

RE: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-18 Thread Sun, Rui
Owen, Since we have individual module jars published into the central maven repo for an official release, then we need to make sure the official Spark assembly jar should be assembled exactly from these jars, so there will be no binary compatibility issue. We can also publish the official assem

create table in yarn-cluster mode vs yarn-client mode

2014-12-18 Thread Chirag Aggarwal
Hi, I have a simple app, where I am trying to create a table. I am able to create the table on running app in yarn-client mode, but not with yarn-cluster mode. Is this some known issue? Has this already been fixed? Please note that I am using spark-1.1 over hadoop-2.4.0 App: - import org.ap

RE: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-18 Thread Sun, Rui
Yes, https://issues.apache.org/jira/browse/SPARK-2075 is what I met. Thanks! I think we need to address this issue. At least we need to document this issue. -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: Thursday, December 18, 2014 5:47 PM To: Shixiong Zhu Cc: Sun,

Re: Incorrect results when calling collect() ?

2014-12-18 Thread Sean Owen
Being mutable is fine; reusing and mutating the objects is the issue. And yes the objects you get back from Hadoop are reused by Hadoop InputFormats. You should just map the objects to a clone before using them where you need them to exist all independently at once, like before a collect(). (That

Re: Incorrect results when calling collect() ?

2014-12-18 Thread Tristan Blakers
Suspected the same thing, but because the underlying data classes are deserialised by Avro I think they have to be mutable as you need to provide the no-args constructor with settable fields. Nothing is being cached in my code anywhere, and this can be reproduced using data directly out of the new

Re: Spark Streaming Python APIs?

2014-12-18 Thread Tathagata Das
A more updated version of the streaming programming guide is here http://people.apache.org/~tdas/spark-1.2-temp/streaming-programming-guide.html Please refer to this until we make the official release of Spark 1.2 TD On Tue, Dec 16, 2014 at 3:50 PM, smallmonkey...@hotmail.com wrote: > Hi zhu:

Re: Help with updateStateByKey

2014-12-18 Thread Tathagata Das
Another point to start playing with updateStateByKey is the example StatefulNetworkWordCount. See the streaming examples directory in the Spark repository. TD On Thu, Dec 18, 2014 at 6:07 AM, Pierce Lamb wrote: > I am trying to run stateful Spark Streaming computations over (fake) > apache web

Re: Implementing a spark version of Haskell's partition

2014-12-18 Thread Juan Rodríguez Hortalá
Hi Andy, Thanks again for your thoughts on this, I haven't found much information about the internals of Spark, so I find very useful and interesting these kind of explanations about its low level mechanisms. It's also nice to know that the two pass approach is a viable solution. Regards, Juan

Re: Incorrect results when calling collect() ?

2014-12-18 Thread Sean Owen
It sounds a lot like your values are mutable classes and you are mutating or reusing them somewhere? It might work until you actually try to materialize them all and find many point to the same object. On Thu, Dec 18, 2014 at 10:06 AM, Tristan Blakers wrote: > Hi, > > I’m getting some seemingly i

Re: Implementing a spark version of Haskell's partition

2014-12-18 Thread andy petrella
NP man, The thing is that since you're in a dist env, it'd be cumbersome to do that. Remember that Spark works basically on block/partition, they are the unit of distribution and parallelization. That means that actions have to be run against it **after having been scheduled on the cluster**. The

Can we specify driver running on a specific machine of the cluster on yarn-cluster mode?

2014-12-18 Thread LinQili
Hi all,On yarn-cluster mode, can we let the driver running on a specific machine that we choose in cluster ? Or, even the machine not in the cluster?

Incorrect results when calling collect() ?

2014-12-18 Thread Tristan Blakers
Hi, I’m getting some seemingly invalid results when I collect an RDD. This is happening in both Spark 1.1.0 and 1.2.0, using Java8 on Mac. See the following code snippet: JavaRDD rdd= pairRDD.values(); rdd.foreach( e -> System.out.println ( "RDD Foreach: " + e ) ); rdd.collect().forEach( e -> Sy

Re: SPARK-2243 Support multiple SparkContexts in the same JVM

2014-12-18 Thread Sean Owen
Yes, although once you have multiple ClassLoaders, you are operating as if in multiple JVMs for most intents and purposes. I think the request for this kind of functionality comes from use cases where multiple ClassLoaders wouldn't work, like, wanting to have one app (in one ClassLoader) managing m

Re: Providing query dsl to Elasticsearch for Spark (2.1.0.Beta3)

2014-12-18 Thread Ian Wilkinson
Quick follow-up: this works sweetly with spark-1.1.1-bin-hadoop2.4. > On Dec 3, 2014, at 3:31 PM, Ian Wilkinson wrote: > > Hi, > > I'm trying the Elasticsearch support for Spark (2.1.0.Beta3). > > In the following I provide the query (as query dsl): > > > import org.elasticsearch.spark._ >

Re: java.io.NotSerializableException: org.apache.avro.mapred.AvroKey using spark with avro

2014-12-18 Thread M. Dale
I did not encounter this with my Avro records using Spark 1.10 (see https://github.com/medale/spark-mail/blob/master/analytics/src/main/scala/com/uebercomputing/analytics/basic/UniqueSenderCounter.scala). I do use the default Java serialization but all the fields in my Avro object are Seriali

Re: Semantics of foreachPartition()

2014-12-18 Thread Tobias Pfeiffer
Hi again, On Thu, Dec 18, 2014 at 6:43 PM, Tobias Pfeiffer wrote: > > tmpRdd.foreachPartition(iter => { > iter.map(item => { > println("xyz: " + item) > }) > }) > Uh, with iter.foreach(...) it works... the reason being apparently that iter.map() re

Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-18 Thread Sean Owen
Have a look at https://issues.apache.org/jira/browse/SPARK-2075 It's not quite that the API is different, but indeed building different 'flavors' of the same version (hadoop1 vs 2) can strangely lead to this problem, even though the public API is identical and in theory the API is completely separ

Re: Can Spark 1.0.2 run on CDH-4.3.0 with yarn? And Will Spark 1.2.0 support CDH5.1.2 with yarn?

2014-12-18 Thread Zhihang Fan
Hi, Sean Thank you for your reply. I will try to use Spark 1.1 and 1.2 on CHD5.X. :) 2014-12-18 17:38 GMT+08:00 Sean Owen : > > The question is really: will Spark 1.1 work with a particular version > of YARN? many, but not all versions of YARN are supported. The > "stable" versions are (2.2.x

Semantics of foreachPartition()

2014-12-18 Thread Tobias Pfeiffer
Hi, I have the following code in my application: tmpRdd.foreach(item => { println("abc: " + item) }) tmpRdd.foreachPartition(iter => { iter.map(item => { println("xyz: " + item) }) }) In the output, I see only the "abc" pr

Re: Can Spark 1.0.2 run on CDH-4.3.0 with yarn? And Will Spark 1.2.0 support CDH5.1.2 with yarn?

2014-12-18 Thread Sean Owen
The question is really: will Spark 1.1 work with a particular version of YARN? many, but not all versions of YARN are supported. The "stable" versions are (2.2.x+). Before that, support is patchier, and in fact has been removed in Spark 1.3. The "yarn" profile supports "YARN stable" which is about

Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-18 Thread Shixiong Zhu
@Rui do you mean the spark-core jar in the maven central repo are incompatible with the same version of the the official pre-built Spark binary? That's really weird. I thought they should have used the same codes. Best Regards, Shixiong Zhu 2014-12-18 17:22 GMT+08:00 Sean Owen : > > Well, it's al

Re: weird bytecode incompatability issue between spark-core jar from mvn repo and official spark prebuilt binary

2014-12-18 Thread Sean Owen
Well, it's always a good idea to used matched binary versions. Here it is more acutely necessary. You can use a pre built binary -- if you use it to compile and also run. Why does it not make sense to publish artifacts? Not sure what you mean about core vs assembly, as the assembly contains all of

Can Spark 1.0.2 run on CDH-4.3.0 with yarn? And Will Spark 1.2.0 support CDH5.1.2 with yarn?

2014-12-18 Thread Canoe
I did not compile spark 1.1.0 source code on CDH4.3.0 with yarn successfully. Does it support CDH4.3.0 with yarn ? And will spark 1.2.0 support CDH5.1.2? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Can-Spark-1-0-2-run-on-CDH-4-3-0-with-yarn-And-Will-Sp

Re: Unable to start Spark 1.3 after building:java.lang. NoClassDefFoundError: org/codehaus/jackson/map/deser/std/StdDeserializer

2014-12-18 Thread Sean Owen
Adding a hadoop-2.6 profile is not necessary. Use hadoop-2.4, which already exists and is intended for 2.4+. In fact this declaration is missing things that Hadoop 2 needs. On Thu, Dec 18, 2014 at 3:46 AM, Kyle Lin wrote: > Hi there > > The following is my steps. And got the same exception with D