date:20140913

Spark SQL

2014-09-13 Thread rkishore999

val file = sc.textFile("hdfs://ec2-54-164-243-97.compute-1.amazonaws.com:9010/user/fin/events.txt") 1. val xyz = file.map(line => extractCurRate(sqlContext.sql("select rate from CurrencyCodeRates where txCurCode = '" + line.substring(202,205) + "' and fxCurCode = '" + fxCurCodesMap(line.substring(

Re: spark 1.1.0 unit tests fail

2014-09-13 Thread Andrew Or

Hi Koert, Thanks for reporting this. These tests have been flaky even on the master branch for a long time. You can safely disregard these test failures, as the root cause is port collisions from the many SparkContexts we create over the course of the entire test. There is a patch that fixes this

Re: How to initialize StateDStream

2014-09-13 Thread Soumitra Kumar

Thanks for the pointers. I meant previous run of spark-submit. For 1: This would be a bit more computation in every batch. 2: Its a good idea, but it may be inefficient to retrieve each value. In general, for a generic state machine the initialization and input sequence is critical for correctne

Re: compiling spark source code

2014-09-13 Thread Yin Huai

Can you try "sbt/sbt clean" first? On Sat, Sep 13, 2014 at 4:29 PM, Ted Yu wrote: > bq. [error] File name too long > > It is not clear which file(s) loadfiles was loading. > Is the filename in earlier part of the output ? > > Cheers > > On Sat, Sep 13, 2014 at 10:58 AM, kkptninja wrote: > >> Hi

Workload for spark testing

2014-09-13 Thread 牛兆捷

Hi All: We know some memory of spark are used for computing (e.g., spark.shuffle.memoryFraction) and some are used for caching RDD for future use (e.g., spark.storage.memoryFraction). Is there any existing workload which can utilize both of them during the running left cycle? I want to do some pe

spark 1.1.0 unit tests fail

2014-09-13 Thread Koert Kuipers

on ubuntu 12.04 with 2 cores and 8G of RAM i see errors when i run the tests for spark 1.1.0. not sure how significant this is, since i used to see errors for spark 1.0.0 too $ java -version java version "1.6.0_43" Java(TM) SE Runtime Environment (build 1.6.0_43-b01) Java HotSpot(TM) 64-Bit Se

Re: compiling spark source code

2014-09-13 Thread Ted Yu

bq. [error] File name too long It is not clear which file(s) loadfiles was loading. Is the filename in earlier part of the output ? Cheers On Sat, Sep 13, 2014 at 10:58 AM, kkptninja wrote: > Hi Ted, > > Thanks for the prompt reply :) > > please find details of the issue at this url http://pa

Re: How to initialize StateDStream

2014-09-13 Thread qihong

I'm not sure what you mean by "previous run". Is it previous batch? or previous run of spark-submit? If it's "previous batch" (spark streaming creates a batch every batch interval), then there's nothing to do. If it's previous run of spark-submit (assuming you are able to save the result somewher

Re: compiling spark source code

2014-09-13 Thread kkptninja

Hi Ted, Thanks for the prompt reply :) please find details of the issue at this url http://pastebin.com/Xt0hZ38q Kind Regards -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/compiling-spark-source-code-tp13980p14175.html

Write 1 RDD to multiple output paths in one go

2014-09-13 Thread Nick Chammas

Howdy doody Spark Users, I’d like to somehow write out a single RDD to multiple paths in one go. Here’s an example. I have an RDD of (key, value) pairs like this: >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda >>> x: x[0])>>> a.collect() [('N', 'Nick'), ('N', 'N

Re: ReduceByKey performance optimisation

2014-09-13 Thread Julien Carme

OK, mapPartition seems to be the way to go. Thanks for the help! Le 13 sept. 2014 16:41, "Sean Owen" a écrit : > This is more concise: > > x.groupBy(obj.fieldtobekey).values.map(_.head) > > ... but I doubt it's faster. > > If all objects with the same fieldtobekey are within the same > partition

Re: Nested Case Classes (Found and Required Same)

2014-09-13 Thread Ramaraju Indukuri

Upgraded to 1.1 and the issue is resolved. Thanks. I still wonder if there is a better way to approach a large attribute dataset. On Fri, Sep 12, 2014 at 12:20 PM, Prashant Sharma wrote: > What is your spark version ? This was fixed I suppose. Can you try it > with latest release ? > > Prashan

Re: compiling spark source code

2014-09-13 Thread Ted Yu

bq. [error] (repl/compile:compile) Compilation failed Can you pastebin more of the output ? Cheers

Re: RDDs and Immutability

2014-09-13 Thread Nicholas Chammas

Have you tried using RDD.map() to transform some of the RDD elements from 0 to 1? Why doesn’t that work? That’s how you change data in Spark, by defining a new RDD that’s a transformation of an old one. On Sat, Sep 13, 2014 at 5:39 AM, Deep Pradhan wrote: > Hi, > We all know that RDDs are immu

Re: ReduceByKey performance optimisation

2014-09-13 Thread Sean Owen

This is more concise: x.groupBy(obj.fieldtobekey).values.map(_.head) ... but I doubt it's faster. If all objects with the same fieldtobekey are within the same partition, then yes I imagine your biggest speedup comes from exploiting that. How about ... x.mapPartitions(_.map(obj => (obj.fieldtob

Re: How to initialize StateDStream

2014-09-13 Thread Soumitra Kumar

I had looked at that. If I have a set of saved word counts from previous run, and want to load that in the next run, what is the best way to do it? I am thinking of hacking the Spark code and have an initial rdd in StateDStream, and use that in for the first time. On Fri, Sep 12, 2014 at 11:04 PM

Re: ReduceByKey performance optimisation

2014-09-13 Thread Gary Malouf

You need something like: val x: RDD[MyAwesomeObject] x.map(obj => obj.fieldtobekey -> obj).reduceByKey { case (l, _) => l } Does that make sense? On Sat, Sep 13, 2014 at 7:28 AM, Julien Carme wrote: > I need to remove objects with duplicate key, but I need the whole object. > Object which ha

Re: compiling spark source code

2014-09-13 Thread kkptninja

Hi, I took am having problem with compiling Spark from source. However, my problem is different. I downloaded latest version (1.1.0) and ran ./sbt/sbt assembly from the command line. I end up with the following error [info] SHA-1: 20abd673d1e0690a6d5b64951868eef8d332d084 [info] Packaging /home/kk

Re: spark 1.1 failure. class conflict?

2014-09-13 Thread Sean Owen

No, your error is right there in the logs. Unset SPARK_CLASSPATH. On Fri, Sep 12, 2014 at 10:20 PM, freedafeng wrote: > : org.apache.spark.SparkException: Found both spark.driver.extraClassPath > and SPARK_CLASSPATH. Use only the former. --

Re: How to initiate a shutdown of Spark Streaming context?

2014-09-13 Thread Sean Owen

Your app is the running Spark Streaming system. It would be up to you to build some mechanism that lets you cause it to call stop() in response to some signal from you. On Fri, Sep 12, 2014 at 3:59 PM, stanley wrote: > In spark streaming programming document >

Re: ReduceByKey performance optimisation

2014-09-13 Thread Julien Carme

I need to remove objects with duplicate key, but I need the whole object. Object which have the same key are not necessarily equal, though (but I can dump any on the ones that have identical key). 2014-09-13 12:50 GMT+02:00 Sean Owen : > If you are just looking for distinct keys, .keys.distinct()

Re: JMXSink for YARN deployment

2014-09-13 Thread Otis Gospodnetic

Hi, Jerry said "I'm guessing", so maybe the thing to try is to check if his guess is correct. What about running sudo lsof | grep metrics.properties ? I imagine you should be able to see it if the file was found and read. If Jerry is right, then I think you will NOT see it. Next, how about try

Re: ReduceByKey performance optimisation

2014-09-13 Thread Sean Owen

If you are just looking for distinct keys, .keys.distinct() should be much better. On Sat, Sep 13, 2014 at 10:46 AM, Julien Carme wrote: > Hello, > > I am facing performance issues with reduceByKey. In know that this topic has > already been covered but I did not really find answers to my questio

Re: Serving data

2014-09-13 Thread andy petrella

however, the cache is not guaranteed to remain, if other jobs are launched in the cluster and require more memory than what's left in the overall caching memory, previous RDDs will be discarded. Using an off heap cache like tachyon as a dump repo can help. In general, I'd say that using a persist

ReduceByKey performance optimisation

2014-09-13 Thread Julien Carme

Hello, I am facing performance issues with reduceByKey. In know that this topic has already been covered but I did not really find answers to my question. I am using reduceByKey to remove entries with identical keys, using, as reduce function, (a,b) => a. It seems to be a relatively straightforwa

RDDs and Immutability

2014-09-13 Thread Deep Pradhan

Hi, We all know that RDDs are immutable. There are not enough operations that can achieve anything and everything on RDDs. Take for example this: I want an Array of Bytes filled with zeros which during the program should change. Some elements of that Array should change to 1. If I make an RDD with

Re: [mllib] LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD

2014-09-13 Thread DB Tsai

Hi Yanbo, We made the change here https://github.com/apache/spark/commit/5d25c0b74f6397d78164b96afb8b8cbb1b15cfbd Those apis to set the parameters are very difficult to maintain, so we decide not to provide them. In next release, Spark 1.2, we will have a better api design for parameter setting.

[mllib] LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD

2014-09-13 Thread Yanbo Liang

Hi All, I found that LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD in master and 1.1 release. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala#L199 In the above code snippet, u

Re: [mllib] LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD

2014-09-13 Thread Yanbo Liang

I also found https://github.com/apache/spark/commit/8f6e2e9df41e7de22b1d1cbd524e20881f861dd0 had resolve this issue but it seems that right code snippet not occurs in master or 1.1 release. 2014-09-13 17:12 GMT+08:00 Yanbo Liang : > Hi All, > > I found that LogisticRegressionWithLBFGS interface i

Re: Spark and Scala

2014-09-13 Thread Mark Hamstra

Sorry, posting too late at night. That should be "...transformations, that produce further RDDs; and actions, that return values to the driver program." On Sat, Sep 13, 2014 at 12:45 AM, Mark Hamstra wrote: > Again, RDD operations are of two basic varieties: transformations, that > produce furt

Re: Spark and Scala

2014-09-13 Thread Mark Hamstra

Again, RDD operations are of two basic varieties: transformations, that produce further RDDs; and operations, that return values to the driver program. You've used several RDD transformations and then finally the top(1) action, which returns an array of one element to your driver program. That is

Re: sc.textFile problem due to newlines within a CSV record

2014-09-13 Thread Mohit Jaggi

Thanks Xiangrui. This file already exists w/o escapes. I could probably try to preprocess it and add the escaping. On Fri, Sep 12, 2014 at 9:38 PM, Xiangrui Meng wrote: > I wrote an input format for Redshift's tables unloaded UNLOAD the > ESCAPE option: https://github.com/mengxr/redshift-input-f

Re: Spark and Scala

2014-09-13 Thread Deep Pradhan

Take for example this: *val lines = sc.textFile(args(0))* *val nodes = lines.map(s =>{ * *val fields = s.split("\\s+")* *(fields(0),fields(1))* *}).distinct().groupByKey().cache() * *val nodeSizeTuple = nodes.map(node => (node._1.toInt, node._2.size))* *val rootNode = nodeSizeTuple.

Re: Serving data

2014-09-13 Thread Mayur Rustagi

You can cache data in memory & query it using Spark Job Server. Most folks dump data down to a queue/db for retrieval You can batch up data & store into parquet partitions as well. & query it using another SparkSQL shell, JDBC driver in SparkSQL is part 1.1 i believe. -- Regards, Mayur Rust

Re: Spark and Scala

2014-09-13 Thread Mark Hamstra

This is all covered in http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations By definition, RDD transformations take an RDD to another RDD; actions produce some other type as a value on the driver program. On Fri, Sep 12, 2014 at 11:15 PM, Deep Pradhan wrote: > Is it always

Re: How to save mllib model to hdfs and reload it

2014-09-13 Thread Yanbo Liang

Shixiong, These two snippets behave different in Scala. In the second snippet, you define variable named m and does evaluate the right hand size as part of the definition. In other words, the variable was replaced by the pre-computed value of Array(1.0) in the subsequently code. So in the second s

Spark SQL

Re: spark 1.1.0 unit tests fail

Re: How to initialize StateDStream

Re: compiling spark source code

Workload for spark testing

spark 1.1.0 unit tests fail

Re: compiling spark source code

Re: How to initialize StateDStream

Re: compiling spark source code

Write 1 RDD to multiple output paths in one go

Re: ReduceByKey performance optimisation

Re: Nested Case Classes (Found and Required Same)

Re: compiling spark source code

Re: RDDs and Immutability

Re: ReduceByKey performance optimisation

Re: How to initialize StateDStream

Re: ReduceByKey performance optimisation

Re: compiling spark source code

Re: spark 1.1 failure. class conflict?

Re: How to initiate a shutdown of Spark Streaming context?

Re: ReduceByKey performance optimisation

Re: JMXSink for YARN deployment

Re: ReduceByKey performance optimisation

Re: Serving data

ReduceByKey performance optimisation

RDDs and Immutability

Re: [mllib] LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD

[mllib] LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD

Re: [mllib] LogisticRegressionWithLBFGS interface is not consistent with LogisticRegressionWithSGD

Re: Spark and Scala

Re: Spark and Scala

Re: sc.textFile problem due to newlines within a CSV record

Re: Spark and Scala

Re: Serving data

Re: Spark and Scala

Re: How to save mllib model to hdfs and reload it

36 matches

Site Navigation

Mail list logo

Footer information