Spark SQL

2014-09-13 Thread rkishore999
val file = sc.textFile("hdfs://ec2-54-164-243-97.compute-1.amazonaws.com:9010/user/fin/events.txt") 1. val xyz = file.map(line => extractCurRate(sqlContext.sql("select rate from CurrencyCodeRates where txCurCode = '" + line.substring(202,205) + "' and fxCurCode = '" + fxCurCodesMap(line.substring(

Re: spark 1.1.0 unit tests fail

2014-09-13 Thread Andrew Or
Hi Koert, Thanks for reporting this. These tests have been flaky even on the master branch for a long time. You can safely disregard these test failures, as the root cause is port collisions from the many SparkContexts we create over the course of the entire test. There is a patch that fixes this

Re: How to initialize StateDStream

2014-09-13 Thread Soumitra Kumar
Thanks for the pointers. I meant previous run of spark-submit. For 1: This would be a bit more computation in every batch. 2: Its a good idea, but it may be inefficient to retrieve each value. In general, for a generic state machine the initialization and input sequence is critical for correctne

Re: compiling spark source code

2014-09-13 Thread Yin Huai
Can you try "sbt/sbt clean" first? On Sat, Sep 13, 2014 at 4:29 PM, Ted Yu wrote: > bq. [error] File name too long > > It is not clear which file(s) loadfiles was loading. > Is the filename in earlier part of the output ? > > Cheers > > On Sat, Sep 13, 2014 at 10:58 AM, kkptninja wrote: > >> Hi

Workload for spark testing

2014-09-13 Thread 牛兆捷
Hi All: We know some memory of spark are used for computing (e.g., spark.shuffle.memoryFraction) and some are used for caching RDD for future use (e.g., spark.storage.memoryFraction). Is there any existing workload which can utilize both of them during the running left cycle? I want to do some pe

spark 1.1.0 unit tests fail

2014-09-13 Thread Koert Kuipers
on ubuntu 12.04 with 2 cores and 8G of RAM i see errors when i run the tests for spark 1.1.0. not sure how significant this is, since i used to see errors for spark 1.0.0 too $ java -version java version "1.6.0_43" Java(TM) SE Runtime Environment (build 1.6.0_43-b01) Java HotSpot(TM) 64-Bit Se

Re: compiling spark source code

2014-09-13 Thread Ted Yu
bq. [error] File name too long It is not clear which file(s) loadfiles was loading. Is the filename in earlier part of the output ? Cheers On Sat, Sep 13, 2014 at 10:58 AM, kkptninja wrote: > Hi Ted, > > Thanks for the prompt reply :) > > please find details of the issue at this url http://pa

Re: How to initialize StateDStream

2014-09-13 Thread qihong
I'm not sure what you mean by "previous run". Is it previous batch? or previous run of spark-submit? If it's "previous batch" (spark streaming creates a batch every batch interval), then there's nothing to do. If it's previous run of spark-submit (assuming you are able to save the result somewher

Re: compiling spark source code

2014-09-13 Thread kkptninja
Hi Ted, Thanks for the prompt reply :) please find details of the issue at this url http://pastebin.com/Xt0hZ38q Kind Regards -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/compiling-spark-source-code-tp13980p14175.html

Write 1 RDD to multiple output paths in one go

2014-09-13 Thread Nick Chammas
Howdy doody Spark Users, I’d like to somehow write out a single RDD to multiple paths in one go. Here’s an example. I have an RDD of (key, value) pairs like this: >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda >>> x: x[0])>>> a.collect() [('N', 'Nick'), ('N', 'N

Re: ReduceByKey performance optimisation

2014-09-13 Thread Julien Carme
OK, mapPartition seems to be the way to go. Thanks for the help! Le 13 sept. 2014 16:41, "Sean Owen" a écrit : > This is more concise: > > x.groupBy(obj.fieldtobekey).values.map(_.head) > > ... but I doubt it's faster. > > If all objects with the same fieldtobekey are within the same > partition

Re: Nested Case Classes (Found and Required Same)

2014-09-13 Thread Ramaraju Indukuri
Upgraded to 1.1 and the issue is resolved. Thanks. I still wonder if there is a better way to approach a large attribute dataset. On Fri, Sep 12, 2014 at 12:20 PM, Prashant Sharma wrote: > What is your spark version ? This was fixed I suppose. Can you try it > with latest release ? > > Prashan

Re: compiling spark source code

2014-09-13 Thread Ted Yu
bq. [error] (repl/compile:compile) Compilation failed Can you pastebin more of the output ? Cheers

Re: RDDs and Immutability

2014-09-13 Thread Nicholas Chammas
Have you tried using RDD.map() to transform some of the RDD elements from 0 to 1? Why doesn’t that work? That’s how you change data in Spark, by defining a new RDD that’s a transformation of an old one. ​ On Sat, Sep 13, 2014 at 5:39 AM, Deep Pradhan wrote: > Hi, > We all know that RDDs are immu

Re: ReduceByKey performance optimisation

2014-09-13 Thread Sean Owen
This is more concise: x.groupBy(obj.fieldtobekey).values.map(_.head) ... but I doubt it's faster. If all objects with the same fieldtobekey are within the same partition, then yes I imagine your biggest speedup comes from exploiting that. How about ... x.mapPartitions(_.map(obj => (obj.fieldtob

Re: How to initialize StateDStream

2014-09-13 Thread Soumitra Kumar
I had looked at that. If I have a set of saved word counts from previous run, and want to load that in the next run, what is the best way to do it? I am thinking of hacking the Spark code and have an initial rdd in StateDStream, and use that in for the first time. On Fri, Sep 12, 2014 at 11:04 PM

Re: ReduceByKey performance optimisation

2014-09-13 Thread Gary Malouf
You need something like: val x: RDD[MyAwesomeObject] x.map(obj => obj.fieldtobekey -> obj).reduceByKey { case (l, _) => l } Does that make sense? On Sat, Sep 13, 2014 at 7:28 AM, Julien Carme wrote: > I need to remove objects with duplicate key, but I need the whole object. > Object which ha

Re: compiling spark source code

2014-09-13 Thread kkptninja
Hi, I took am having problem with compiling Spark from source. However, my problem is different. I downloaded latest version (1.1.0) and ran ./sbt/sbt assembly from the command line. I end up with the following error [info] SHA-1: 20abd673d1e0690a6d5b64951868eef8d332d084 [info] Packaging /home/kk

Re: spark 1.1 failure. class conflict?

2014-09-13 Thread Sean Owen
No, your error is right there in the logs. Unset SPARK_CLASSPATH. On Fri, Sep 12, 2014 at 10:20 PM, freedafeng wrote: > : org.apache.spark.SparkException: Found both spark.driver.extraClassPath > and SPARK_CLASSPATH. Use only the former. --

Re: How to initiate a shutdown of Spark Streaming context?

2014-09-13 Thread Sean Owen
Your app is the running Spark Streaming system. It would be up to you to build some mechanism that lets you cause it to call stop() in response to some signal from you. On Fri, Sep 12, 2014 at 3:59 PM, stanley wrote: > In spark streaming programming document >

Re: ReduceByKey performance optimisation

2014-09-13 Thread Julien Carme
I need to remove objects with duplicate key, but I need the whole object. Object which have the same key are not necessarily equal, though (but I can dump any on the ones that have identical key). 2014-09-13 12:50 GMT+02:00 Sean Owen : > If you are just looking for distinct keys, .keys.distinct()

Re: JMXSink for YARN deployment

2014-09-13 Thread Otis Gospodnetic
Hi, Jerry said "I'm guessing", so maybe the thing to try is to check if his guess is correct. What about running sudo lsof | grep metrics.properties ? I imagine you should be able to see it if the file was found and read. If Jerry is right, then I think you will NOT see it. Next, how about try

Re: ReduceByKey performance optimisation

2014-09-13 Thread Sean Owen
If you are just looking for distinct keys, .keys.distinct() should be much better. On Sat, Sep 13, 2014 at 10:46 AM, Julien Carme wrote: > Hello, > > I am facing performance issues with reduceByKey. In know that this topic has > already been covered but I did not really find answers to my questio

Re: Serving data

2014-09-13 Thread andy petrella
however, the cache is not guaranteed to remain, if other jobs are launched in the cluster and require more memory than what's left in the overall caching memory, previous RDDs will be discarded. Using an off heap cache like tachyon as a dump repo can help. In general, I'd say that using a persist

ReduceByKey performance optimisation

2014-09-13 Thread Julien Carme
Hello, I am facing performance issues with reduceByKey. In know that this topic has already been covered but I did not really find answers to my question. I am using reduceByKey to remove entries with identical keys, using, as reduce function, (a,b) => a. It seems to be a relatively straightforwa

RDDs and Immutability

2014-09-13 Thread Deep Pradhan
Hi, We all know that RDDs are immutable. There are not enough operations that can achieve anything and everything on RDDs. Take for example this: I want an Array of Bytes filled with zeros which during the program should change. Some elements of that Array should change to 1. If I make an RDD with

Re: [mllib] LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD

2014-09-13 Thread DB Tsai
Hi Yanbo, We made the change here https://github.com/apache/spark/commit/5d25c0b74f6397d78164b96afb8b8cbb1b15cfbd Those apis to set the parameters are very difficult to maintain, so we decide not to provide them. In next release, Spark 1.2, we will have a better api design for parameter setting.

[mllib] LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD

2014-09-13 Thread Yanbo Liang
Hi All, I found that LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD in master and 1.1 release. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala#L199 In the above code snippet, u

Re: [mllib] LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD

2014-09-13 Thread Yanbo Liang
I also found https://github.com/apache/spark/commit/8f6e2e9df41e7de22b1d1cbd524e20881f861dd0 had resolve this issue but it seems that right code snippet not occurs in master or 1.1 release. 2014-09-13 17:12 GMT+08:00 Yanbo Liang : > Hi All, > > I found that LogisticRegressionWithLBFGS interface i

Re: Spark and Scala

2014-09-13 Thread Mark Hamstra
Sorry, posting too late at night. That should be "...transformations, that produce further RDDs; and actions, that return values to the driver program." On Sat, Sep 13, 2014 at 12:45 AM, Mark Hamstra wrote: > Again, RDD operations are of two basic varieties: transformations, that > produce furt

Re: Spark and Scala

2014-09-13 Thread Mark Hamstra
Again, RDD operations are of two basic varieties: transformations, that produce further RDDs; and operations, that return values to the driver program. You've used several RDD transformations and then finally the top(1) action, which returns an array of one element to your driver program. That is

Re: sc.textFile problem due to newlines within a CSV record

2014-09-13 Thread Mohit Jaggi
Thanks Xiangrui. This file already exists w/o escapes. I could probably try to preprocess it and add the escaping. On Fri, Sep 12, 2014 at 9:38 PM, Xiangrui Meng wrote: > I wrote an input format for Redshift's tables unloaded UNLOAD the > ESCAPE option: https://github.com/mengxr/redshift-input-f

Re: Spark and Scala

2014-09-13 Thread Deep Pradhan
Take for example this: *val lines = sc.textFile(args(0))* *val nodes = lines.map(s =>{ * *val fields = s.split("\\s+")* *(fields(0),fields(1))* *}).distinct().groupByKey().cache() * *val nodeSizeTuple = nodes.map(node => (node._1.toInt, node._2.size))* *val rootNode = nodeSizeTuple.

Re: Serving data

2014-09-13 Thread Mayur Rustagi
You can cache data in memory & query it using Spark Job Server.  Most folks dump data down to a queue/db for retrieval  You can batch up data & store into parquet partitions as well. & query it using another SparkSQL  shell, JDBC driver in SparkSQL is part 1.1 i believe.  -- Regards, Mayur Rust

Re: Spark and Scala

2014-09-13 Thread Mark Hamstra
This is all covered in http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations By definition, RDD transformations take an RDD to another RDD; actions produce some other type as a value on the driver program. On Fri, Sep 12, 2014 at 11:15 PM, Deep Pradhan wrote: > Is it always

Re: How to save mllib model to hdfs and reload it

2014-09-13 Thread Yanbo Liang
Shixiong, These two snippets behave different in Scala. In the second snippet, you define variable named m and does evaluate the right hand size as part of the definition. In other words, the variable was replaced by the pre-computed value of Array(1.0) in the subsequently code. So in the second s