Re: Big performance difference between "client" and "cluster" deployment mode; is this expected?

2014-12-31 Thread Enno Shioji
Hi Tathagata, It's a standalone cluster. The submit commands are: == CLIENT spark-submit --class com.fake.Test \ --deploy-mode client --master spark://fake.com:7077 \ fake.jar == CLUSTER spark-submit --class com.fake.Test \ --deploy-mode cluster --master spark://fake.com:7077 \ s3n://fake.ja

Re: spark ignoring all memory settings and defaulting to 512MB?

2014-12-31 Thread Ilya Ganelin
Welcome to Spark. What's more fun is that setting controls memory on the executors but if you want to set memory limit on the driver you need to configure it as a parameter of the spark-submit script. You also set num-executors and executor-cores on the spark submit call. See both the Spark tuning

Re: NullPointerException

2014-12-31 Thread rapelly kartheek
Ok. Let me try out on a newer version. Thank you!! On Thu, Jan 1, 2015 at 12:17 PM, Josh Rosen wrote: > It looks like 'null' might be selected as a block replication peer? > https://github.com/apache/spark/blob/v1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L786 > > I kn

Re: NullPointerException

2014-12-31 Thread Josh Rosen
It looks like 'null' might be selected as a block replication peer? https://github.com/apache/spark/blob/v1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L786 I know that we fixed some replication bugs in newer versions of Spark (such as https://github.com/apache/spark/pull/2

Re: spark ignoring all memory settings and defaulting to 512MB?

2014-12-31 Thread Kevin Burton
wow. Just figured it out: conf.set( "spark.executor.memory", "2g"); I have to set it in the Job… that’s really counter intuitive. Especially because the documentation in spark-env.sh says the exact opposite. What’s the resolution here. This seems like a mess. I’d propose a solution to

Fwd: NullPointerException

2014-12-31 Thread rapelly kartheek
-- Forwarded message -- From: rapelly kartheek Date: Thu, Jan 1, 2015 at 12:05 PM Subject: Re: NullPointerException To: Josh Rosen , user@spark.apache.org spark-1.0.0 On Thu, Jan 1, 2015 at 12:04 PM, Josh Rosen wrote: > Which version of Spark are you using? > > On Wed, Dec 31,

Re: NullPointerException

2014-12-31 Thread rapelly kartheek
spark-1.0.0 On Thu, Jan 1, 2015 at 12:04 PM, Josh Rosen wrote: > Which version of Spark are you using? > > On Wed, Dec 31, 2014 at 10:24 PM, rapelly kartheek < > kartheek.m...@gmail.com> wrote: > >> Hi, >> I get this following Exception when I submit spark application that >> calculates the freq

spark ignoring all memory settings and defaulting to 512MB?

2014-12-31 Thread Kevin Burton
This is really weird and I’m surprised no one has found this issue yet. I’ve spent about an hour or more trying to debug this :-( My spark install is ignoring ALL my memory settings. And of course my job is running out of memory. The default is 512MB so pretty darn small. The worker and master

Re: NullPointerException

2014-12-31 Thread Josh Rosen
Which version of Spark are you using? On Wed, Dec 31, 2014 at 10:24 PM, rapelly kartheek wrote: > Hi, > I get this following Exception when I submit spark application that > calculates the frequency of characters in a file. Especially, when I > increase the size of data, I face this problem. > >

NullPointerException

2014-12-31 Thread rapelly kartheek
Hi, I get this following Exception when I submit spark application that calculates the frequency of characters in a file. Especially, when I increase the size of data, I face this problem. Exception in thread "Thread-47" org.apache.spark.SparkException: Job aborted due to stage failure: Task 11.0:

Re: limit vs sample for indexing a small amount of data quickly?

2014-12-31 Thread Fernando O.
There's a take method that might do what you need: *def take(**num**: **Int**): Array[T]* Take the first num elements of the RDD. On Jan 1, 2015 12:02 AM, "Kevin Burton" wrote: > Is there a limit function which just returns the first N records? > > Sample is nice but I’m trying to do this so it

limit vs sample for indexing a small amount of data quickly?

2014-12-31 Thread Kevin Burton
Is there a limit function which just returns the first N records? Sample is nice but I’m trying to do this so it’s super fast and just to test the functionality of an algorithm. With sample I’d have to compute the % that would yield 1000 results first… Kevin -- Founder/CEO Spinn3r.com Locatio

Re: Big performance difference between "client" and "cluster" deployment mode; is this expected?

2014-12-31 Thread Tathagata Das
Whats your spark-submit commands in both cases? Is it Spark Standalone or YARN (both support client and cluster)? Accordingly what is the number of executors/cores requested? TD On Wed, Dec 31, 2014 at 10:36 AM, Enno Shioji wrote: > Also the job was deployed from the master machine in the clust

Re: UpdateStateByKey persist to Tachyon

2014-12-31 Thread amkcom
bumping this thread up -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/UpdateStateByKey-persist-to-Tachyon-tp20798p20930.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: Why the major.minor version of the new hive-exec is 51.0?

2014-12-31 Thread Ted Yu
I see. I logged SPARK-5041 which references this thread. Thanks On Wed, Dec 31, 2014 at 12:57 PM, Michael Armbrust wrote: > We actually do publish our own version of this jar, because the version > that the hive team publishes is an uber jar and this breaks all kinds of > things. As a result

Re: Why the major.minor version of the new hive-exec is 51.0?

2014-12-31 Thread Michael Armbrust
We actually do publish our own version of this jar, because the version that the hive team publishes is an uber jar and this breaks all kinds of things. As a result I'd file the JIRA against Spark. On Wed, Dec 31, 2014 at 12:55 PM, Ted Yu wrote: > Michael: > hive-exec-0.12.0-protobuf-2.5.jar is

Re: building spark1.2 meet error

2014-12-31 Thread Jacek Laskowski
Hi, Where does the following path that appears in the logs below come from? /opt/xdsp/spark-1.2.0/H:\Soft\Maven\repository/org/scala-lang/scala-compiler/2.10.4/scala-compiler-2.10.4.jar Did you somehow point at the local maven repository that's H:\Soft\Maven? Jacek 31 gru 2014 01:48 "j_soft" n

Re: Big performance difference between "client" and "cluster" deployment mode; is this expected?

2014-12-31 Thread Enno Shioji
Oh sorry that was a edit mistake. The code is essentially: val msgStream = kafkaStream .map { case (k, v) => v} .map(DatatypeConverter.printBase64Binary) .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec]) I.e. there is essentially no original code (I was callin

Re: Big performance difference between "client" and "cluster" deployment mode; is this expected?

2014-12-31 Thread Enno Shioji
Also the job was deployed from the master machine in the cluster. ᐧ On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji wrote: > Oh sorry that was a edit mistake. The code is essentially: > > val msgStream = kafkaStream >.map { case (k, v) => v} >.map(DatatypeConverter.printBase64B

Re: Big performance difference between "client" and "cluster" deployment mode; is this expected?

2014-12-31 Thread Sean Owen
-dev, +user A decent guess: Does your 'save' function entail collecting data back to the driver? and are you running this from a machine that's not in your Spark cluster? Then in client mode you're shipping data back to a less-nearby machine, compared to with cluster mode. That could explain the b

Re: FlatMapValues

2014-12-31 Thread Sanjay Subramanian
hey guys Some of u may care :-) but this is just give u a background with where I am going with this. I have an IOS medical side effects app called MedicalSideFx. I built the entire underlying data layer aggregation using hadoop and currently the search is based on lucene. I am re-architecting t

RE: Fwd: Sample Spark Program Error

2014-12-31 Thread Kapil Malik
Hi Naveen, Quoting http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext SparkContext is Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables o

RE: FlatMapValues

2014-12-31 Thread Kapil Malik
Hi Sanjay, Oh yes .. on flatMapValues, it's defined in PairRDDFunctions, and you need to import org.apache.spark.rdd.SparkContext._ to use them (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions ) @Sean, yes indeed flatMap / flatMapValues both can b

Re: Trying to make spark-jobserver work with yarn

2014-12-31 Thread Fernando O.
Before jumping into a sea of dependencies and bash files: Does anyone have an example of how to run a spark job without using spark-submit or shell ? On Tue, Dec 30, 2014 at 3:23 PM, Fernando O. wrote: > Hi all, > I'm investigating spark for a new project and I'm trying to use > spark-jobser

Re: Long-running job cleanup

2014-12-31 Thread Ganelin, Ilya
The previously submitted code doesn’t actually show the problem I was trying to show effectively since the issue becomes clear between subsequent steps. Within a single step it appears things are cleared up properly. Memory usage becomes evident pretty quickly. def showMemoryUsage(sc: SparkCon

Re: Fwd: Sample Spark Program Error

2014-12-31 Thread Naveen Madhire
Yes. The exception is gone now after adding stop() at the end. Can you please tell me what this stop() does at the end. Does it disable the spark context. On Wed, Dec 31, 2014 at 10:09 AM, RK wrote: > If you look at your program output closely, you can see the following > output. > Lines with a:

Re: FlatMapValues

2014-12-31 Thread Sean Owen
>From the clarification below, the problem is that you are calling flatMapValues, which is only available on an RDD of key-value tuples. Your map function returns a tuple in one case but a String in the other, so your RDD is a bunch of Any, which is not at all what you want. You need to return a tu

Re: FlatMapValues

2014-12-31 Thread Sanjay Subramanian
Hey Kapil, Fernando Thanks for your mail. [1] Fernando, if I don't use an "if" logic inside the "map" then if I have lines of input data that have less fields than I am expecting I get ArrayOutOfBounds exception. so the "if" is to safeguard against that.  [2] Kapil, I am sorry I did not clarify.

Re: Fwd: Sample Spark Program Error

2014-12-31 Thread RK
If you look at your program output closely, you can see the following output.  Lines with a: 24, Lines with b: 15 The exception seems to be happening with Spark cleanup after executing your code. Try adding sc.stop() at the end of your program to see if the exception goes away. On Wedne

Fwd: Sample Spark Program Error

2014-12-31 Thread Naveen Madhire
Hi All, I am trying to run a sample Spark program using Scala SBT, Below is the program, def main(args: Array[String]) { val logFile = "E:/ApacheSpark/usb/usb/spark/bin/README.md" // Should be some file on your system val sc = new SparkContext("local", "Simple App", "E:/ApacheSpark/

Re: FlatMapValues

2014-12-31 Thread Fernando O.
Hi Sanjay, Doing an if inside a Map sounds like a bad idea, it seems like you actually want to filter and then apply map On Wed, Dec 31, 2014 at 9:54 AM, Kapil Malik wrote: > Hi Sanjay, > > > > I tried running your code on spark shell piece by piece – > > > > // Setup > > val line1 = “025126,C

RE: FlatMapValues

2014-12-31 Thread Kapil Malik
Hi Sanjay, I tried running your code on spark shell piece by piece – // Setup val line1 = “025126,Chills,8.10,Injection site oedema,8.10,Injection site reaction,8.10,Malaise,8.10,Myalgia,8.10” val line2 = “025127,Chills,8.10,Injection site oedema,8.10,Injection site reaction,8.10,Malaise,8.10,M

Re: building spark1.2 meet error

2014-12-31 Thread xhudik
Hi J_soft, for me it is working, I didn't put -Dscala-2.10 -X parameters. I got only one warning, since I don't have hadoop 2.5 it didn't activate this profile: /larix@kovral:~/sources/spark-1.2.0> mvn -Pyarn -Phadoop-2.5 -Dhadoop.version=2.5.0 -DskipTests clean package Found 0 infos Fin

NoSuchMethodError: com.typesafe.config.Config.getDuration with akka-http/akka-stream

2014-12-31 Thread Christophe Billiard
Hi all, I am currently trying to combine datastax's "spark-cassandra-connector" and typesafe's "akka-http-experimental" on Spark 1.1.1 (spark-cassandra-connector for Spark 1.2.0 not out yet) and scala 2.10.4 I am using the hadoop 2.4 pre built package. (build.sbt file at the end) To solve the jav

Re: pyspark.daemon not found

2014-12-31 Thread Naveen Kumar Pokala
Hi, I am receiving the following error when I am trying to connect spark cluster( which is on unix) from my windows machine using pyspark interactive shell > pyspark -master (spark cluster url) Then I executed the following commands. lines = sc.textFile("hdfs://master/data/spark/SINGLE.T

Re: Exception in thread "main" org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4

2014-12-31 Thread Sean Owen
This generally means you have packaged Hadoop 1.x classes into your app accidentally. The most common cause is not marking Hadoop and Spark classes as "provided" dependencies. Your app doesn't need to ship its own copy of these classes when you use spark-submit. On Wed, Dec 31, 2014 at 10:47 AM, H

Exception in thread "main" org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4

2014-12-31 Thread Hafiz Mujadid
I am accessing hdfs with spark .textFile method. and I receive error as Exception in thread "main" org.apache.hadoop.ipc.RemoteException: Server IPC version 9 cannot communicate with client version 4 here are my dependencies

Re: FlatMapValues

2014-12-31 Thread Raghavendra Pandey
Why don't you push "\n" instead of "\t" in your first transformation [ (fields(0),(fields(1)+"\t"+fields(3)+"\t"+fields(5)+"\t"+fields(7)+"\t" +fields(9)))] and then do saveAsTextFile? -Raghavendra On Wed Dec 31 2014 at 1:42:55 PM Sanjay Subramanian wrote: > hey guys > > My dataset is like this

Re: How to set local property in beeline connect to the spark thrift server

2014-12-31 Thread Xiaoyu Wang
Yi Tian, Thanks a lot! It works! I find the variable set guide in http://spark.apache.org/docs/latest/sql-programming-guide.html 2014-12-31 17:12 GMT+08:00 田毅 : > Hi, Xiaoyu! > > You can use `spark.sql.thriftserver.scheduler.pool` instead of > `spark.scheduler.pool` only in spark thrift server.

spark stream + cassandra (execution on event)

2014-12-31 Thread Oleg Ruchovets
Hi . I want to use spark streaming to read data from cassandra. But in my case I need process data based on event. (not retrieving the data constantly from Cassandra). Question: what is the way to issue the processing using spark streaming from time to time. Thanks Oleg.

Re: How to set local property in beeline connect to the spark thrift server

2014-12-31 Thread 田毅
Hi, Xiaoyu! You can use `spark.sql.thriftserver.scheduler.pool` instead of `spark.scheduler.pool` only in spark thrift server. On Wed, Dec 31, 2014 at 3:55 PM, Xiaoyu Wang wrote: > Hi all! > > I use Spark SQL1.2 start the thrift server on yarn. > > I want to use fair scheduler in the thrift s

pyspark.daemon not found

2014-12-31 Thread Naveen Kumar Pokala
Error from python worker: python: module pyspark.daemon not found PYTHONPATH was: /home/npokala/data/spark-install/spark-master/python: Please can somebody help me on this, how to resolve the issue. -Naveen

FlatMapValues

2014-12-31 Thread Sanjay Subramanian
hey guys  My dataset is like this  025126,Chills,8.10,Injection site oedema,8.10,Injection site reaction,8.10,Malaise,8.10,Myalgia,8.10 Intended output is ==025126,Chills 025126,Injection site oedema 025126,Injection site reaction 025126,Malaise 025126,Myalgia My code is as follo

Re: Kafka + Spark streaming

2014-12-31 Thread Samya Maiti
Thanks TD. On Wed, Dec 31, 2014 at 7:19 AM, Tathagata Das wrote: > 1. Of course, a single block / partition has many Kafka messages, and > from different Kafka topics interleaved together. The message count is > not related to the block count. Any message received within a > particular block int