Using unshaded akka in Spark driver

2014-08-28 Thread Aniket Bhatnagar
I am building (yet another) job server for Spark using Play! framework and it seems like Play's akka dependency conflicts with Spark's shaded akka dependency. Using SBT, I can force Play to use akka 2.2.3 (unshaded) but I haven't been able to figure out how to exclude com.typesafe.akka dependencies

Re: Visualizing stage & task dependency graph

2014-08-28 Thread Phuoc Do
I'm working on this patch to visualize stages: https://github.com/apache/spark/pull/2077 Phuoc Do On Mon, Aug 4, 2014 at 10:12 PM, Zongheng Yang wrote: > I agree that this is definitely useful. > > One related project I know of is Sparkling [1] (also see talk at Spark > Summit 2014 [2]), but

Key-Value Operations

2014-08-28 Thread Deep Pradhan
Hi, I have a RDD of key-value pairs. Now I want to find the "key" for which the "values" has the largest number of elements. How should I do that? Basically I want to select the key for which the number of items in values is the largest. Thank You

The concurrent model of spark job/stage/task

2014-08-28 Thread 李华
hi, guys I am trying to understand how spark work on the concurrent model. I read below from https://spark.apache.org/docs/1.0.2/job-scheduling.html quote " Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from s

Re: Spark Streaming: DStream - zipWithIndex

2014-08-28 Thread Tathagata Das
If just want arbitrary unique id attached to each record in a dstream (no ordering etc), then why not create generate and attach an UUID to each record? On Wed, Aug 27, 2014 at 4:18 PM, Soumitra Kumar wrote: > I see a issue here. > > If rdd.id is 1000 then rdd.id * 1e9.toLong would be BIG. > >

sbt package assembly run spark examples

2014-08-28 Thread filipus
hi guys, can someone explain or give the stupid user like me a link where i can get the right usage of sbt and spark in order to run the examples as a stand alone app I got to the point running the app by sbt "run path-to-the-data" but still get some error because i probably didnt tell the app th

Re: sbt package assembly run spark examples

2014-08-28 Thread filipus
got it when I read the class refference https://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.SparkConf conf.setMaster("local[2]") set the master to local with 2 threads but still get some warnings and the result (see below) is also not right i think ps: by the way ... first

Spark SQL : how to find element where a field is in a given set

2014-08-28 Thread Jaonary Rabarisoa
Hi all, What is the expression that I should use with spark sql DSL if I need to retreive data with a field in a given set. For example : I have the following schema case class Person(name: String, age: Int) And I need to do something like : personTable.where('name in Seq("foo", "bar")) ? Ch

Re: Trying to run SparkSQL over Spark Streaming

2014-08-28 Thread praveshjain1991
Thanks for the reply. Sorry I could not ask more earlier. Trying to use a parquet file is not working at all. case class Rec(name:String,pv:Int) val sqlContext=new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD val d1=sc.parallelize(Array(("a",10),("b",3))).map(e=>Rec(e._1

Re: How to join two PairRDD together?

2014-08-28 Thread Yanbo Liang
Maybe you can refer sliding method of RDD, but it's right now mllib private method. Look at org.apache.spark.mllib.rdd.RDDFunctions. 2014-08-26 12:59 GMT+08:00 Vida Ha : > Can you paste the code? It's unclear to me how/when the out of memory is > occurring without seeing the code. > > > > > On

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi,tried mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests dependency:tree > dep.txtAttached the dep. txt for your information. [WARNING] [WARNING] Some problems were encountered while building the effective settings [WARNING] Unrecognised tag: 'mirrors' (position: START_TAG s

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-28 Thread Ted Yu
I see 0.98.5 in dep.txt You should be good to go. On Thu, Aug 28, 2014 at 3:16 AM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi, > > tried > mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests > dependency:tree > dep.txt > > Attached the dep. txt for your

Spark-submit not running

2014-08-28 Thread Hingorani, Vineet
The file is compiling properly but when I try to run the jar file using spark-submit, it is giving some errors. I am running spark locally and have downloaded a pre-built version of Spark named "For Hadoop 2 (HDP2, CDH5)". AI don't know if it is a dependency problem but I don't want to have Hado

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-28 Thread GADV
Not sure if this make sense, but maybe would be nice to have a kind of "flag" available within the code that tells me if I'm running in a "normal" situation or during a recovery. To better explain this, let's consider the following scenario: I am processing data, let's say from a Kafka streaming, a

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi, I tried to start Spark but failed: $ ./sbin/start-all.sh starting org.apache.spark.deploy.master.Master, logging to /mnt/hadoop/spark-1.0.2/sbin/../logs/spark-edhuser-org.apache.spark.deploy.master.Master-1-m133.out failed to launch org.apache.spark.deploy.master.Master: Failed to find Spa

how to filter value in spark

2014-08-28 Thread marylucy
fileA=1 2 3 4 one number a line,save in /sparktest/1/ fileB=3 4 5 6 one number a line,save in /sparktest/2/ I want to get 3 and 4 var a = sc.textFile("/sparktest/1/").map((_,1)) var b = sc.textFile("/sparktest/2/").map((_,1)) a.filter(param=>{b.lookup(param._1).length>0}).map(_._1).foreach(prin

Re: Spark-submit not running

2014-08-28 Thread Sean Owen
You need to set HADOOP_HOME. Is Spark officially supposed to work on Windows or not at this stage? I know the build doesn't quite yet. On Thu, Aug 28, 2014 at 11:37 AM, Hingorani, Vineet < vineet.hingor...@sap.com> wrote: > The file is compiling properly but when I try to run the jar file using

Re: how to filter value in spark

2014-08-28 Thread Matthew Farrellee
On 08/28/2014 07:20 AM, marylucy wrote: fileA=1 2 3 4 one number a line,save in /sparktest/1/ fileB=3 4 5 6 one number a line,save in /sparktest/2/ I want to get 3 and 4 var a = sc.textFile("/sparktest/1/").map((_,1)) var b = sc.textFile("/sparktest/2/").map((_,1)) a.filter(param=>{b.lookup(p

how to specify columns in groupby

2014-08-28 Thread MEETHU MATHEW
Hi all, I have an RDD  which has values in the  format "id,date,cost". I want to group the elements based on the id and date columns and get the sum of the cost  for each group. Can somebody tell me how to do this?   Thanks & Regards, Meethu M

Re: How to join two PairRDD together?

2014-08-28 Thread Sean Owen
It sounds like you are adding the same key to every element, and joining, in order to accomplish a full cartesian join? I can imagine doing it that way would blow up somewhere. There is a cartesian() method to do this maybe more efficiently. However if your data set is large, this sort of algorith

Re: Compilaon Error: Spark 1.0.2 with HBase 0.98

2014-08-28 Thread Sean Owen
"0.98.2" is not an HBase version, but "0.98.2-hadoop2" is: http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22org.apache.hbase%22%20AND%20a%3A%22hbase%22 On Thu, Aug 28, 2014 at 2:54 AM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi, > > I need to use Spark with HBase 0.98 an

Re: how to specify columns in groupby

2014-08-28 Thread Yanbo Liang
For your reference: val d1 = textFile.map(line => { val fileds = line.split(",") ((fileds(0),fileds(1)), fileds(2).toDouble) }) val d2 = d1.reduceByKey(_+_) d2.foreach(println) 2014-08-28 20:04 GMT+08:00 MEETHU MATHEW : > Hi all, > > I have an RDD which has values in t

Re: Key-Value Operations

2014-08-28 Thread Sean Owen
If you mean your values are all a Seq or similar already, then you just take the top 1 ordered by the size of the value: rdd.top(1)(Ordering.by(_._2.size)) On Thu, Aug 28, 2014 at 9:34 AM, Deep Pradhan wrote: > Hi, > I have a RDD of key-value pairs. Now I want to find the "key" for which > the

RE: Spark-submit not running

2014-08-28 Thread Hingorani, Vineet
How can I set HADOOP_HOME if I am running the Spark on my local machine without anything else? Do I have to install some other pre-built file? I am on Windows 7 and Spark’s official site says that it is available on Windows, I added Java path in the PATH variable. Vineet From: Sean Owen [mailt

Re: Spark-submit not running

2014-08-28 Thread Guru Medasani
Can you copy the exact spark-submit command that you are running? You should be able to run it locally without installing hadoop. Here is an example on how to run the job locally. # Run application locally on 8 cores ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master

Re: Spark-submit not running

2014-08-28 Thread Sean Owen
Yes, but I think at the moment there is still a dependency on Hadoop even when not using it. See https://issues.apache.org/jira/browse/SPARK-2356 On Thu, Aug 28, 2014 at 2:14 PM, Guru Medasani wrote: > Can you copy the exact spark-submit command that you are running? > > You should be able to r

SPARK on YARN, containers fails

2014-08-28 Thread Control
Hi there, I'm trying to run JavaSparkPi example on YARN with master = yarn-client but I have a problem. It runs smoothly with submitting application, first container for Application Master works too. When job is starting and there are some tasks to do I'm getting this warning on console (I'm us

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-28 Thread Yana Kadiyska
Can you clarify the scenario: val ssc = new StreamingContext(sparkConf, Seconds(10)) ssc.checkpoint(checkpointDirectory) val stream = KafkaUtils.createStream(...) val wordCounts = lines.flatMap(_.split(" ")).map(x => (x, 1L)) val wordDstream= wordCounts.updateStateByKey[Int](updateFunc) wo

Re: Compilation Error: Spark 1.0.2 with HBase 0.98

2014-08-28 Thread Ted Yu
I didn't see that problem. Did you run this command ? mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests clean package Here is what I got: TYus-MacBook-Pro:spark-1.0.2 tyu$ sbin/start-all.sh starting org.apache.spark.deploy.master.Master, logging to /Users/tyu/spark-1.0.2/sbi

repartitioning an RDD yielding imbalance

2014-08-28 Thread Rok Roskar
I've got an RDD where each element is a long string (a whole document). I'm using pyspark so some of the handy partition-handling functions aren't available, and I count the number of elements in each partition with: def count_partitions(id, iterator): c = sum(1 for _ in iterator) yiel

Re: Graphx: undirected graph support

2014-08-28 Thread FokkoDriesprong
A bit in analogy with a linked-list a double linked-list. It might introduce overhead in terms of memory usage, but you could use two directed edges to substitute the uni-directed edge. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Graphx-undirected-graph-

Re: Spark-submit not running

2014-08-28 Thread Guru Medasani
Thanks Sean. Looks like there is a workaround as per the JIRA https://issues.apache.org/jira/browse/SPARK-2356 . http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7. May be that's worth a shot? On Aug 28, 2014, at 8:15 AM, Sean Owen wrote: > Yes, but I think at the moment t

RE: Spark-submit not running

2014-08-28 Thread Hingorani, Vineet
Thank you Sean and Guru for giving the information. Btw I have to put this line, but where should I add it? In my scala file below other ‘import …’ lines are written? System.setProperty("hadoop.home.dir", "d:\\winutil\\") Thank you Vineet From: Sean Owen [mailto:so...@cloudera.com] Sent: Donner

Re: Spark-submit not running

2014-08-28 Thread Sean Owen
You should set this as early as possible in your program, before other code runs. On Thu, Aug 28, 2014 at 3:27 PM, Hingorani, Vineet wrote: > Thank you Sean and Guru for giving the information. Btw I have to put this > line, but where should I add it? In my scala file below other ‘import …’ > lin

RE: Spark-submit not running

2014-08-28 Thread Hingorani, Vineet
The following error is given when I try to add this line: [info] Set current project to Simple Project (in build file:/C:/Users/D062844/Desktop/Hand sOnSpark/Install/spark-1.0.2-bin-hadoop2/) [info] Compiling 1 Scala source to C:\Users\D062844\Desktop\HandsOnSpark\Install\spark-1.0 .2-bin-hadoop

Change delimiter when collecting SchemaRDD

2014-08-28 Thread yadid ayzenberg
Hi All, Is there any way to change the delimiter from being a comma ? Some of the strings in my data contain commas as well, making it very difficult to parse the results. Yadid

Print to spark log

2014-08-28 Thread jamborta
Hi all, Just wondering if there is a way to use logging to print to spark logs some additional info (similar to debug in scalding). Thanks, -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Print-to-spark-log-tp13035.html Sent from the Apache Spark User List

Re: CUDA in spark, especially in MLlib?

2014-08-28 Thread Debasish Das
Breeze author David also has a github project on cuda binding in scalado you prefer using java or scala ? On Aug 27, 2014 2:05 PM, "Frank van Lankvelt" wrote: > you could try looking at ScalaCL[1], it's targeting OpenCL rather than > CUDA, but that might be close enough? > > cheers, Frank >

SPARK-1297 patch error (spark-1297-v4.txt )

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi, I have just tried to apply the patch of SPARK-1297: https://issues.apache.org/jira/browse/SPARK-1297 There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt respectively. When applying the 2nd one, I got "Hunk #1 FAILED at 45" Can you please advise how to fix it in order

Re: SPARK-1297 patch error (spark-1297-v4.txt )

2014-08-28 Thread Ted Yu
I attached patch v5 which corresponds to the pull request. Please try again. On Thu, Aug 28, 2014 at 9:50 AM, arthur.hk.c...@gmail.com < arthur.hk.c...@gmail.com> wrote: > Hi, > > I have just tried to apply the patch of SPARK-1297: > https://issues.apache.org/jira/browse/SPARK-1297 > > There ar

Re: SPARK-1297 patch error (spark-1297-v4.txt )

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi, patch -p1 -i spark-1297-v5.txt can't find file to patch at input line 5 Perhaps you used the wrong -p or --strip option? The text leading up to this was: -- |diff --git docs/building-with-maven.md docs/building-with-maven.md |index 672d0ef..f8bcd2b 100644 |--- docs/bui

New SparkR mailing list, JIRA

2014-08-28 Thread Shivaram Venkataraman
Hi I'd like to announce a couple of updates to the SparkR project. In order to facilitate better collaboration for new features and development we have a new mailing list, issue tracker for SparkR. - The new JIRA is hosted at https://sparkr.atlassian.net/browse/SPARKR/ and we have migrated all ex

Converting a DStream's RDDs to SchemaRDDs

2014-08-28 Thread Verma, Rishi (398J)
Hi Folks, I’d like to find out tips on how to convert the RDDs inside a Spark Streaming DStream to a set of SchemaRDDs. My DStream contains JSON data pushed over from Kafka, and I’d like to use SparkSQL’s JSON import function (i.e. jsonRDD) to register the JSON dataset as a table, and perform

Re: Spark webUI - application details page

2014-08-28 Thread Brad Miller
Hi All, @Andrew Thanks for the tips. I just built the master branch of Spark last night, but am still having problems viewing history through the standalone UI. I dug into the Spark job events directories as you suggested, and I see at a minimum 'SPARK_VERSION_1.0.0' and 'EVENT_LOG_1'; for appli

Re: Spark Streaming checkpoint recovery causes IO re-execution

2014-08-28 Thread RodrigoB
Hi Yana, The fact is that the DB writing is happening on the node level and not on Spark level. One of the benefits of distributed computing nature of Spark is enabling IO distribution as well. For example, is much faster to have the nodes to write to Cassandra instead of having them all collected

Re: SPARK-1297 patch error (spark-1297-v4.txt )

2014-08-28 Thread Ted Yu
bq. Spark 1.0.2 For the above release, you can download pom.xml attached to the JIRA and place it in examples directory I verified that the build against 0.98.4 worked using this command: mvn -Dhbase.profile=hadoop2 -Phadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests clean package Patch v5 is

Re: spark.files.userClassPathFirst=true Not Working Correctly

2014-08-28 Thread DNoteboom
Sorry for the extremely late reply. It turns out that the same error occurred when running on yarn. However, I recently updated my project to depend on cdh5 and the issue I was having disappeared and I am no longer setting the userClassPathFirst to true. -- View this message in context: http:/

Re: Low Level Kafka Consumer for Spark

2014-08-28 Thread Chris Fregly
@bharat- overall, i've noticed a lot of confusion about how Spark Streaming scales - as well as how it handles failover and checkpointing, but we can discuss that separately. there's actually 2 dimensions to scaling here: receiving and processing. *Receiving* receiving can be scaled out by subm

Re: SPARK-1297 patch error (spark-1297-v4.txt )

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi Ted, I downloaded pom.xml to examples directory. It works, thanks!! Regards Arthur [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. SUCCESS [2.119s] [INFO] Spark Project

org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi, I use Hadoop 2.4.1 and HBase 0.98.5 with snappy enabled in both Hadoop and HBase. With default setting in Spark 1.0.2, when trying to load a file I got "Class org.apache.hadoop.io.compress.SnappyCodec not found" Can you please advise how to enable snappy in Spark? Regards Arthur scala> i

Re: CUDA in spark, especially in MLlib?

2014-08-28 Thread Wei Tan
Thank you Debasish. I am fine with either Scala or Java. I would like to get a quick evaluation on the performance gain, e.g., ALS on GPU. I would like to try whichever library does the business :) Best regards, Wei - Wei Tan, PhD Research Staff Member IBM T. J.

Re: org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi, my check native result: hadoop checknative 14/08/29 02:54:51 WARN bzip2.Bzip2Factory: Failed to load/initialize native-bzip2 library system-native, will use pure-Java version 14/08/29 02:54:51 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library Native library checki

org.apache.spark.examples.xxx

2014-08-28 Thread filipus
hey guys i still try to get used to compile and run the example code why does the run_example code submit the class with an org.apache.spark.examples in front of the class itself? probably a stupid question but i would be glad some one of you explains by the way.. how was the "spark...example..

Re: Print to spark log

2014-08-28 Thread Control
I'm not sure if this is the case, but basic monitoring is described here: https://spark.apache.org/docs/latest/monitoring.html If it comes to something more sophisticated I was for example able to save some messages into local logs and view them in YARN UI via http by editing spark source code (use

What happens if I have a function like a PairFunction but which might return multiple values

2014-08-28 Thread Steve Lewis
In many cases when I work with Map Reduce my mapper or my reducer might take a single value and map it to multiple keys - The reducer might also take a single key and emit multiple values I don't think that functions like flatMap and reduceByKey will work or are there tricks I am not aware of

Re: Spark webUI - application details page

2014-08-28 Thread SK
I was able to recently solve this problem for standalone mode. For this mode, I did not use a history server. Instead, I set spark.eventLog.dir (in conf/spark-defaults.conf) to a directory in hdfs (basically this directory should be in a place that is writable by the master and accessible globally

Re: OutofMemoryError when generating output

2014-08-28 Thread SK
Hi, Thanks for the response. I tried to use countByKey. But I am not able to write the output to console or to a file. Neither collect() nor saveAsTextFile() work for the Map object that is generated after countByKey(). valx = sc.textFile(baseFile)).map { line => val field

Re: Print to spark log

2014-08-28 Thread jamborta
thanks for the reply. I was looking for something for the case when it's running outside of the spark framework. if I declare a sparkcontext or and rdd that could print some messages in the log? The problem I have that if I print something from the scala object that runs the spark app, it does

Re: Spark Streaming: DStream - zipWithIndex

2014-08-28 Thread Soumitra Kumar
Yes, that is an option. I started with a function of batch time, and index to generate id as long. This may be faster than generating UUID, with added benefit of sorting based on time. - Original Message - From: "Tathagata Das" To: "Soumitra Kumar" Cc: "Xiangrui Meng" , user@spark.apac

Re: Spark Streaming: DStream - zipWithIndex

2014-08-28 Thread Tathagata Das
But then if you want to generate ids that are unique across ALL the records that you are going to see in a stream (which can be potentially infinite), then you definitely need a number space larger than long :) TD On Thu, Aug 28, 2014 at 12:48 PM, Soumitra Kumar wrote: > Yes, that is an option

Q on downloading spark for standalone cluster

2014-08-28 Thread Sanjeev Sagar
Hello there, I've a basic question on the downloadthat which option I need to downloadfor standalone cluster. I've a private cluster of three machineson Centos. When I click on download it shows me following: Download Spark The latest release is Spark 1.0.2, released August 5, 2014 (re

Re: Converting a DStream's RDDs to SchemaRDDs

2014-08-28 Thread Tathagata Das
Try using "local[n]" with n > 1, instead of local. Since receivers take up 1 slot, and "local" is basically 1 slot, there is no slot left to process the data. That's why nothing gets printed. TD On Thu, Aug 28, 2014 at 10:28 AM, Verma, Rishi (398J) < rishi.ve...@jpl.nasa.gov> wrote: > Hi Folks,

Where to save intermediate results?

2014-08-28 Thread huylv
Hi, I'm building a system for near real-time data analytics. My plan is to have an ETL batch job which calculates aggregations running periodically. User queries are then parsed for on-demand calculations, also in Spark. Where are the pre-calculated results supposed to be saved? I mean, after fini

Re: repartitioning an RDD yielding imbalance

2014-08-28 Thread Davies Liu
On Thu, Aug 28, 2014 at 7:00 AM, Rok Roskar wrote: > I've got an RDD where each element is a long string (a whole document). I'm > using pyspark so some of the handy partition-handling functions aren't > available, and I count the number of elements in each partition with: > > def count_partitio

Re: Q on downloading spark for standalone cluster

2014-08-28 Thread Daniel Siegmann
If you aren't using Hadoop, I don't think it matters which you download. I'd probably just grab the Hadoop 2 package. Out of curiosity, what are you using as your data store? I get the impression most Spark users are using HDFS or something built on top. On Thu, Aug 28, 2014 at 4:07 PM, Sanjeev

Re: Where to save intermediate results?

2014-08-28 Thread Daniel Siegmann
I assume your on-demand calculations are a streaming flow? If your data aggregated from batch isn't too large, maybe you should just save it to disk; when your streaming flow starts you can read the aggregations back from disk and perhaps just broadcast them. Though I guess you'd have to restart yo

Re: What happens if I have a function like a PairFunction but which might return multiple values

2014-08-28 Thread Sean Owen
To emulate a Mapper, flatMap() is exactly what you want. Since it flattens, it means you return an Iterable of values instead of 1 value. That can be a Collection containing many values, or 1, or 0. For a reducer, to really reproduce what a Reducer does in Java, I think you will need groupByKey()

RE: Q on downloading spark for standalone cluster

2014-08-28 Thread Sagar, Sanjeev
Hello Daniel, If you’re not using Hadoop then why you want to grab the Hadoop package? CDH5 will download all the Hadoop packages and cloudera manager too. Just curious what happen if you start spark on EC2 cluster, what it choose for the data store as default? -Sanjeev From: Daniel Siegmann [

DStream repartitioning, performance tuning processing

2014-08-28 Thread Tim Smith
Hi, In my streaming app, I receive from kafka where I have tried setting the partitions when calling "createStream" or later, by calling repartition - in both cases, the number of nodes running the tasks seems to be stubbornly stuck at 2. Since I have 11 nodes in my cluster, I was hoping to use mo

Failed to run runJob at ReceiverTracker.scala

2014-08-28 Thread Tim Smith
Hi, Have a Spark-1.0.0 (CDH5) streaming job reading from kafka that died with: 14/08/28 22:28:15 INFO DAGScheduler: Failed to run runJob at ReceiverTracker.scala:275 Exception in thread "Thread-59" 14/08/28 22:28:15 INFO YarnClientClusterScheduler: Cancelling stage 2 14/08/28 22:28:15 INFO DAGSch

transforming a Map object to RDD

2014-08-28 Thread SK
Hi, How do I convert a Map object to an RDD so that I can use the saveAsTextFile() operation to output the Map object? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/transforming-a-Map-object-to-RDD-tp13071.html Sent from the Apache Spark User Li

Re: transforming a Map object to RDD

2014-08-28 Thread Sean Owen
val map = Map("foo" -> 1, "bar" -> 2, "baz" -> 3) val rdd = sc.parallelize(map.toSeq) rdd is a an RDD[(String,Int)] and you can do what you like from there. On Thu, Aug 28, 2014 at 11:56 PM, SK wrote: > Hi, > > How do I convert a Map object to an RDD so that I can use the > saveAsTextFile() oper

Re: Failed to run runJob at ReceiverTracker.scala

2014-08-28 Thread Tathagata Das
Do you see this error right in the beginning or after running for sometime? The root cause seems to be that somehow your Spark executors got killed, which killed receivers and caused further errors. Please try to take a look at the executor logs of the lost executor to find what is the root cause

problem connection to hdfs on localhost from spark-shell

2014-08-28 Thread Bharath Bhushan
I have HDFS servers running locally and "hadoop dfs -ls /" are all running fine. >From spark-shell I do this: val lines = sc.textFile("hdfs:///test") and I get this error message. java.io.IOException: Failed on local exception: java.io.EOFException; Host Details : local host is: "localhost.locald

Re: DStream repartitioning, performance tuning processing

2014-08-28 Thread Tathagata Das
If you are repartitioning to 8 partitions, and your node happen to have at least 4 cores each, its possible that all 8 partitions are assigned to only 2 nodes. Try increasing the number of partitions. Also make sure you have executors (allocated by YARN) running on more than two nodes if you want t

Re: Kinesis receiver & spark streaming partition

2014-08-28 Thread Chris Fregly
great question, wei. this is very important to understand from a performance perspective. and this extends is beyond kinesis - it's for any streaming source that supports shards/partitions. i need to do a little research into the internals to confirm my theory. lemme get back to you! -chris

Re: org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi, If change my etc/hadoop/core-site.xml from io.compression.codecs org.apache.hadoop.io.compress.SnappyCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.BZip2Codec to

Re: Failed to run runJob at ReceiverTracker.scala

2014-08-28 Thread Tim Smith
Appeared after running for a while. I re-ran the job and this time, it crashed with: 14/08/29 00:18:50 WARN ReceiverTracker: Error reported by receiver for stream 0: Error in block pushing thread - java.net.SocketException: Too many open files Shouldn't the failed receiver get re-spawned on a diff

Re: DStream repartitioning, performance tuning processing

2014-08-28 Thread Tim Smith
TD - Apologies, didn't realize I was replying to you instead of the list. What does "numPartitions" refer to when calling createStream? I read an earlier thread that seemed to suggest that numPartitions translates to partitions created on the Spark side? http://mail-archives.apache.org/mod_mbox/in

Re: Change delimiter when collecting SchemaRDD

2014-08-28 Thread Michael Armbrust
The comma is just the way the default toString works for Row objects. Since SchemaRDDs are also RDDs, you can do arbitrary transformations on the Row objects that are returned. For example, if you'd rather the delimiter was '|': sql("SELECT * FROM src").map(_.mkString("|")).collect() On Thu, A

Re: Spark SQL : how to find element where a field is in a given set

2014-08-28 Thread Michael Armbrust
You don't need the Seq, as in is a variadic function. personTable.where('name in ("foo", "bar")) On Thu, Aug 28, 2014 at 3:09 AM, Jaonary Rabarisoa wrote: > Hi all, > > What is the expression that I should use with spark sql DSL if I need to > retreive > data with a field in a given set. > Fo

Memory statistics in the Application detail UI

2014-08-28 Thread SK
Hi, I am using a cluster where each node has 16GB (this is the executor memory). After I complete an MLlib job, the executor tab shows the following: Memory: 142.6 KB Used (95.5 GB Total) and individual worker nodes have the Memory Used values as 17.3 KB / 8.6 GB (this is different for differe

Re: org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread arthur.hk.c...@gmail.com
Hi, I fixed the issue by copying libsnappy.so to Java ire. Regards Arthur On 29 Aug, 2014, at 8:12 am, arthur.hk.c...@gmail.com wrote: > Hi, > > If change my etc/hadoop/core-site.xml > > from > > io.compression.codecs > > org.apache.hadoop.io.compress.SnappyCodec, >

The concurrent model of spark job/stage/task

2014-08-28 Thread 35597...@qq.com
hi, guys I am trying to understand how spark work on the concurrent model. I read below from https://spark.apache.org/docs/1.0.2/job-scheduling.html quote " Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from sep

Problem using accessing HiveContext

2014-08-28 Thread Zitser, Igor
Hi, While using HiveContext. If hive table created as test_datatypes(testbigint bigint, ss bigint ) select below works fine. For "create table test_datatypes(testbigint bigint, testdec decimal(5,2) )" scala> val dataTypes=hiveContext.hql("select * from test_datatypes") 14/08/28 21:18:44 INFO p

FW: Reference Accounts & Large Node Deployments

2014-08-28 Thread Steve Nunez
Anyone? No customers using streaming at scale? From: Steve Nunez Date: Wednesday, August 27, 2014 at 9:08 To: "user@spark.apache.org" Subject: Reference Accounts & Large Node Deployments > All, > > Does anyone have specific references to customers, use cases and large-scale > deployments

Odd saveAsSequenceFile bug

2014-08-28 Thread Shay Seng
Hey Sparkies... I have an odd "bug". I am running Spark 0.9.2 on Amazon EC2 machines as a job (i.e. not in REPL) After a bunch of processing, I tell spark to save my rdd to S3 using: rdd.saveAsSequenceFile(uri,codec) That line of code hangs. By hang I mean (a) Spark stages UI shows no update on

RE: Sorting Reduced/Groupd Values without Explicit Sorting

2014-08-28 Thread fluke777
Hi list, Any change on this one? I think I have seen a lot of work being done on this lately but I am unable to forge a working solution from jira tickets. Any example would be highly appreciated. Tomas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sorti

Re: OutofMemoryError when generating output

2014-08-28 Thread Burak Yavuz
Yeah, saveAsTextFile is an RDD specific method. If you really want to use that method, just turn the map into an RDD: `sc.parallelize(x.toSeq).saveAsTextFile(...)` Reading through the api-docs will present you many more alternate solutions! Best, Burak - Original Message - From: "SK"

Re: Memory statistics in the Application detail UI

2014-08-28 Thread Sean Owen
Each executor reserves some memory for storing RDDs in memory, and some for executor operations like shuffling. The number you see is memory reserved for storing RDDs, and defaults to about 0.6 of the total (spark.storage.memoryFraction). On Fri, Aug 29, 2014 at 2:32 AM, SK wrote: > Hi, > > I am

Re: Memory statistics in the Application detail UI

2014-08-28 Thread Burak Yavuz
Hi, Spark uses by default approximately 60% of the executor heap memory to store RDDs. That's why you have 8.6GB instead of 16GB. 95.5 is therefore the sum of all the 8.6 GB of executor memory + the driver memory. Best, Burak - Original Message - From: "SK" To: u...@spark.incubator.ap

Spark / Thrift / ODBC connectivity

2014-08-28 Thread Denny Lee
I’m currently using the Spark 1.1 branch and have been able to get the Thrift service up and running.  The quick questions were whether I should able to use the Thrift service to connect to SparkSQL generated tables and/or Hive tables?   As well, by any chance do we have any documents that point

How to debug this error?

2014-08-28 Thread Gary Zhao
Hello I'm new to Spark and playing around, but saw the following error. Could anyone to help on it? Thanks Gary scala> c res15: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[7] at flatMap at :23 scala> group res16: org.apache.spark.rdd.RDD[(String, Iterable[String])] = MappedValuesRDD[5] a

Spark Hive max key length is 767 bytes

2014-08-28 Thread arthur.hk.c...@gmail.com
(Please ignore if duplicated) Hi, I use Spark 1.0.2 with Hive 0.13.1 I have already set the hive mysql database to latine1; mysql: alter database hive character set latin1; Spark: scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) scala> hiveContext.hql("create table tes

Re: Memory statistics in the Application detail UI

2014-08-28 Thread SK
Hi, Thanks for the responses. I understand that the second values in the Memory Used column for the executors add up to 95.5 GB and the first values add up to 17.3 KB. If 95.5 GB is the memory used to store the RDDs, then what is 17.3 KB ? is that the memory used for shuffling operations? For non

Re: Failed to run runJob at ReceiverTracker.scala

2014-08-28 Thread Tathagata Das
It did. It got failed and respawned 4 times. In this case, the too many open files is a sign that you need increase the system-wide limit of open files. Try adding ulimit -n 16000 to your conf/spark-env.sh. TD On Thu, Aug 28, 2014 at 5:29 PM, Tim Smith wrote: > Appeared after running for a whi

Re: Memory statistics in the Application detail UI

2014-08-28 Thread Sean Owen
Click the Storage tab. You have some (tiny) RDD persisted in memory. On Fri, Aug 29, 2014 at 5:58 AM, SK wrote: > Hi, > Thanks for the responses. I understand that the second values in the Memory > Used column for the executors add up to 95.5 GB and the first values add up > to 17.3 KB. If 95.5

RE: org.apache.hadoop.io.compress.SnappyCodec not found

2014-08-28 Thread linkpatrickliu
Hi, You can set the settings in conf/spark-env.sh like this:export SPARK_LIBRARY_PATH=/usr/lib/hadoop/lib/native/ SPARK_JAVA_OPTS+="-Djava.library.path=$SPARK_LIBRARY_PATH "SPARK_JAVA_OPTS+="-Dspark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec "SPARK_JAVA_OPTS+="-Dio.compress

RE: The concurrent model of spark job/stage/task

2014-08-28 Thread linkpatrickliu
Hi, Please see the answers following each question. If there's any mistake, please let me know. Thanks! I am not sure which mode you are running. So I will assume you are using spark-submit script to submit spark applications to spark cluster(spark-standalone or Yarn) 1. how to start 2 or more

RE: problem connection to hdfs on localhost from spark-shell

2014-08-28 Thread linkpatrickliu
Change your conf/spark-env.sh: export HADOOP_CONF_DIR="/etc/hadoop/conf"export YARN_CONF_DIR="/etc/hadoop/conf" Date: Thu, 28 Aug 2014 16:19:05 -0700 From: ml-node+s1001560n13074...@n3.nabble.com To: linkpatrick...@live.com Subject: problem connection to hdfs on localhost from spark-shell

  1   2   >