Re: spark1.0 principal component analysis

2014-07-10 Thread Sean Owen
To clarify, you are looking for eigenvectors of what, the covariance matrix? So for example you are looking for the sqrt of the eigenvalues when you talk about stdev of components? Looking at https://github.com/apache/spark/blob/1f33e1f2013c508aa86511750f7bd8437154e51a/mllib/src/main/scala/org/apa

Re: Does MLlib Naive Bayes implementation incorporates Laplase smoothing?

2014-07-10 Thread Sean Owen
Have a look at the code maybe? https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala Yes there is a smoothing parameter, and yes from the looks of it it is simply additive / Laplace smoothing. It's been in there for a while. On Th

A Task failed with java.lang.ArrayIndexOutOfBoundsException at com.ning.compress.lzf.impl.UnsafeChunkDecoder.copyOverlappingLong

2014-07-10 Thread innowireless TaeYun Kim
Hi, A Task failed with with java.lang.ArrayIndexOutOfBoundsException at com.ning.compress.lzf.impl.UnsafeChunkDecoder.copyOverlappingLong. And the whole job was terminated after repeated task failures. It ran without any problem several days ago. Currently we suspect that the cluster i

KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
Can someone please run the standard kMeans code on this input with 2 centers ?: 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 The obvious result should be (2,2) and (5,2) ... (you can draw them if you don't believe me ...) Thanks,  Wanda

Re: KMeans code is rubbish

2014-07-10 Thread Sean Owen
I ran it, and your answer is exactly what I got. import org.apache.spark.mllib.linalg._ import org.apache.spark.mllib.clustering._ val vectors = sc.parallelize(Array((2,1),(1,2),(3,2),(2,3),(4,1),(5,1),(6,1),(4,2),(6,2),(4,3),(5,3),(6,3)).map(p => Vectors.dense(Array[Double](p._1, p._2 val

Re: executor failed, cannot find compute-classpath.sh

2014-07-10 Thread cjwang
Not sure that was what I want. I tried to run Spark Shell on a machine other than the master and got the same error. The "192" was suppose to be a simple shell script change that alters SPARK_HOME before submitting jobs. Too bad it wasn't there anymore. The build described in the pull request (

Re: KMeans code is rubbish

2014-07-10 Thread Bertrand Dechoux
A picture is worth a thousand... Well, a picture with this dataset, what you are expecting and what you get, would help answering your initial question. Bertrand On Thu, Jul 10, 2014 at 10:44 AM, Wanda Hawk wrote: > Can someone please run the standard kMeans code on this input with 2 > centers

Re: KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
so this is what I am running:  "./bin/run-example SparkKMeans ~/Documents/2dim2.txt 2 0.001" And this is the input file:" ┌───[spark2013@SparkOne]──[~/spark-1.0.0].$ └───#!cat ~/Documents/2dim2.txt 2 1 1 2 3 2 2 3 4 1 5 1 6 1 4 2 6 2 4 3 5 3 6 3 " This is the final output from spark: "14/07/1

Re: All of the tasks have been completed but the Stage is still shown as "Active"?

2014-07-10 Thread Tathagata Das
Do you see any errors in the logs of the driver? On Thu, Jul 10, 2014 at 1:21 AM, Haopu Wang wrote: > I'm running an App for hours in a standalone cluster. From the data > injector and "Streaming" tab of web ui, it's running well. > > However, I see quite a lot of Active stages in web ui even s

Re: "NoSuchElementException: key not found" when changing the window lenght and interval in Spark Streaming

2014-07-10 Thread richiesgr
Hi I get exactly the same problem here, do you've found the problem ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchElementException-key-not-found-when-changing-the-window-lenght-and-interval-in-Spark-Streaming-tp9010p9283.html Sent from the

Re: KMeans code is rubbish

2014-07-10 Thread Tathagata Das
I ran the SparkKMeans example (not the mllib KMeans that Sean ran) with your dataset as well, I got the expected answer. And I believe that even though initialization is done using sampling, the example actually sets the seed to a constant 42, so the result should always be the same no matter how m

Getting Persistent Connection using socketStream?

2014-07-10 Thread kytay
Hi I am trying out a simple piece of code by writing my own JavaNetworkCount app to test out Spark Streaming So here is the 2 set of the codes. // #1 JavaReceiverInputDStream lines = sctx.socketTextStream("127.0.0.1", ); // #2 JavaReceiverInputDStream lines = sctx.socketStream(

running scrapy (or any other scraper) on the cluster?

2014-07-10 Thread mrm
Hi all, Has anybody tried to run scrapy on a cluster? If yes, I would appreciate hearing about the general approach that was taken (multiple spiders? single spider? how to distribute urls across nodes?...etc). I would also be interested in hearing about any experience running a different scraper o

RE: All of the tasks have been completed but the Stage is still shown as "Active"?

2014-07-10 Thread Haopu Wang
I didn't keep the driver's log. It's a lesson. I will try to run it again to see if it happens again. From: Tathagata Das [mailto:tathagata.das1...@gmail.com] Sent: 2014年7月10日 17:29 To: user@spark.apache.org Subject: Re: All of the tasks have been completed b

Re: All of the tasks have been completed but the Stage is still shown as "Active"?

2014-07-10 Thread Surendranauth Hiraman
History Server is also very helpful. On Thu, Jul 10, 2014 at 7:37 AM, Haopu Wang wrote: > I didn't keep the driver's log. It's a lesson. > > I will try to run it again to see if it happens again. > > > -- > > *From:* Tathagata Das [mailto:tathagata.das1...@gmail.c

Re: Does MLlib Naive Bayes implementation incorporates Laplase smoothing?

2014-07-10 Thread Rahul Bhojwani
Ya thanks. I can see that lambda is used as the parameter. On Thu, Jul 10, 2014 at 1:40 PM, Sean Owen wrote: > Have a look at the code maybe? > > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala > > Yes there is a smoothing

Re: Does MLlib Naive Bayes implementation incorporates Laplase smoothing?

2014-07-10 Thread Rahul Bhojwani
And also that there is a small bug in implementation. As I mentioned this earlier also. This is my first time I am reporting some bug. So I just wanted to ask, that do your name come somewhere or do you get something after correcting/reporting some bug. So that i can mention that in my profile/res

RDD registerAsTable gives error on regular scala class records

2014-07-10 Thread Kefah Issa
Hi, SQL on spark 1.0 is an interesting feature. It works fine when the "record" is made of a case-class. The issue I have is that I have around 50 attributes per record. scala Case-class can not handle that (hard-coded limit is 22 for some reason). So I created a regular class and defined the att

EC2 Cluster script. Shark install fails

2014-07-10 Thread Jason H
Hi Just going though the process of installing Spark 1.0.0 on EC2 and notice that the script throws an error when installing shark. Unpacking Spark ~/spark-ec2 Initializing shark ~ ~/spark-ec2 ERROR: Unknown Shark version The install completes in the end but shark is completely missed. Lo

Difference between collect() and take(n)

2014-07-10 Thread MEETHU MATHEW
Hi all, I want to know how collect() works, and how it is different from take().I am just reading a file of 330MB which has 43lakh rows with 13 columns and calling take(430) to save to a variable.But the same is not working with collect().So is there any difference in the operation of both.

Re: Purpose of spark-submit?

2014-07-10 Thread Koert Kuipers
when i write a general spark application i use SparkConf/SparkContext, not Client.scala for Yarn On Wed, Jul 9, 2014 at 9:39 PM, Andrew Or wrote: > I don't see why using SparkSubmit.scala as your entry point would be any > different, because all that does is invoke the main class of Client.scal

SparkSQL - Language Integrated query - OR clause and IN clause

2014-07-10 Thread premdass
Hi, any suggestions on how to implement OR clause and IN clause in the SparkSQL language integrated queries. For example: 'SELECT name FROM people WHERE age >= 10 AND month = 2' can be written as val teenagers = people.where('age >= 10).where('month === 2).select('name) How do we write 'SELE

Re: SparkSQL - Language Integrated query - OR clause and IN clause

2014-07-10 Thread Takuya UESHIN
Hi Prem, You can write like: people.where('age >= 10 && 'month === 2).select('name) people.where('age >= 10 || 'month === 2).select('name) people.where(In('month, Seq(2, 6))).select('name) The last one needs to import catalyst.expressions package. Thanks. 2014-07-10 22:15 GMT+09:00 pre

Re: Restarting a Streaming Context

2014-07-10 Thread Nicholas Chammas
Okie doke. Thanks for the confirmation, Burak and Tathagata. On Thu, Jul 10, 2014 at 2:23 AM, Tathagata Das wrote: > I confirm that is indeed the case. It is designed to be so because it > keeps things simpler - less chances of issues related to cleanup when > stop() is called. Also it keeps t

Re: All of the tasks have been completed but the Stage is still shown as "Active"?

2014-07-10 Thread Daniel Siegmann
One thing to keep in mind is that the progress bar doesn't take into account tasks which are rerun. If you see 4/4 but the stage is still active, click the stage name and look at the task list. That will show you if any are actually running. When rerun tasks complete, it can result in the number of

Re: SparkSQL - Language Integrated query - OR clause and IN clause

2014-07-10 Thread premdass
Thanks Takuya . works like a Charm -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Language-Integrated-query-OR-clause-and-IN-clause-tp9298p9303.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Terminal freeze during SVM

2014-07-10 Thread AlexanderRiggers
Tried the newest branch, but still get stuck on the same task: (kill) runJob at SlidingRDD.scala:74 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Terminal-freeze-during-SVM-Broken-pipe-tp9022p9304.html Sent from the Apache Spark User List mailing list ar

Re: Does MLlib Naive Bayes implementation incorporates Laplase smoothing?

2014-07-10 Thread Bertrand Dechoux
A patch proposal on the apache JIRA for Spark? https://issues.apache.org/jira/browse/SPARK/ Bertrand On Thu, Jul 10, 2014 at 2:37 PM, Rahul Bhojwani wrote: > And also that there is a small bug in implementation. As I mentioned this > earlier also. > > This is my first time I am reporting some

Re: Yay for 1.0.0! EC2 Still has problems.

2014-07-10 Thread nit
I am also running into "modules/mod_authn_alias.so" issue on r3.8xlarge when launched cluster with ./spark-ec2; so ganglia is not accessible. From the posts it seems that Patrick suggested using Ubuntu 12.04. Can you please provide name of AMI that can be used with -a flag that will not have this

Potential bugs in SparkSQL

2014-07-10 Thread Jerry Lam
Hi Spark developers, I have the following hqls that spark will throw exceptions of this kind: 14/07/10 15:07:55 INFO TaskSetManager: Loss was due to org.apache.spark.TaskKilledException [duplicate 17] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:736 failed 4 times, m

Re: Yay for 1.0.0! EC2 Still has problems.

2014-07-10 Thread nit
I am also running into "modules/mod_authn_alias.so" issue on r3.8xlarge when launched cluster with ./spark-ec2; so ganglia is not accessible. From the posts it seems that Patrick suggested using Ubuntu 12.04. Can you please provide name of AMI that can be used with -a flag that will not have this

GraphX: how to specify partition strategy?

2014-07-10 Thread Yifan LI
Hi, I am doing graph computation using GraphX, but it seems to be an error on graph partition strategy specification. as in "GraphX programming guide": The Graph.partitionBy operator allows users to choose the graph partitioning strategy, but due to SPARK-1931, this method is broken in Spark 1.0

How to read saved model?

2014-07-10 Thread rohitspujari
Hello Folks: I attended the session Aron D did at hadoop summit. He mentioned about training the model and saving it on HDFS. When you start scoring you can read the saved model. So, I can save the model using sc.makeRDD(model.clusterCenters).saveAsObjectFile(model) But when I try to read the mo

SPARKSQL problem with implementing Scala's Product interface

2014-07-10 Thread yadid
Hi All, I have a class with too many variables to be implemented as a case class, therefor I am using regular class that implements Scala's product interface. Like so: class Info () extends Product with Serializable { var param1 : String = "" var param2 : String = "" ... var param

Re: Does MLlib Naive Bayes implementation incorporates Laplase smoothing?

2014-07-10 Thread Rahul Bhojwani
I have created the issue: "In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug" Have a look at it. Thanks, On Thu, Jul 10, 2014 at 8:37 PM, Bertrand Dechoux wrote: > A patch proposal on the apache JIRA for Spark? > https://issues.apache.org/jira/browse/SPARK

Re: Use Spark Streaming to update result whenever data come

2014-07-10 Thread Bill Jay
Tobias, Your help on the problems I have met have been very helpful. Thanks a lot! Bill On Wed, Jul 9, 2014 at 6:04 PM, Tobias Pfeiffer wrote: > Bill, > > good to know you found your bottleneck. Unfortunately, I don't know how to > solve this; until know, I have used Spark only with embarassi

Re: KMeans code is rubbish

2014-07-10 Thread Xiangrui Meng
SparkKMeans is a naive implementation. Please use mllib.clustering.KMeans in practice. I created a JIRA for this: https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das wrote: > I ran the SparkKMeans example (not the mllib KMeans that Sean ran) w

Re: Spark job tracker.

2014-07-10 Thread abhiguruvayya
Hi Mayur, Thanks so much for the explanation. It did help me. Is there a way i can log these details on the console rather than logging it. As of now once i start my application i could see this, 14/07/10 00:48:20 INFO yarn.Client: Application report from ASM: application identifier: ap

Re: Terminal freeze during SVM

2014-07-10 Thread Xiangrui Meng
news20.binary's feature dimension is 1.35M. So the serialized task size is above the default limit 10M. You need to set spark.akka.frameSize to, e.g, 20. Due to a bug SPARK-1112, this parameter is not passed to executors automatically, which causes Spark freezes. This was fixed in the latest master

Re: Spark job tracker.

2014-07-10 Thread Marcelo Vanzin
That output means you're running in yarn-cluster mode. So your code is running inside the ApplicationMaster and has no access to the local terminal. If you want to see the output: - try yarn-client mode, then your code will run inside the launcher process - check the RM web ui and look at the logs

Re: Cannot submit to a Spark Application to a remote cluster Spark 1.0

2014-07-10 Thread Aris Vlasakakis
Thank you very much Yana for replying! So right now the set up is a single-node machine which is my "cluster", and YES you are right my submitting laptop has a different path to the spark-1.0.0 installation than the "cluster" machine. I tried to set SPARK_HOME on my submittor laptop using the act

Recommended pipeline automation tool? Oozie?

2014-07-10 Thread k.tham
I'm just wondering what's the general recommendation for data pipeline automation. Say, I want to run Spark Job A, then B, then invoke script C, then do D, and if D fails, do E, and if Job A fails, send email F, etc... It looks like Oozie might be the best choice. But I'd like some advice/suggest

Running Spark on Yarn vs Mesos

2014-07-10 Thread k.tham
What do people usually do for this? It looks like Yarn might be the simplest since the Cloudera distribution already installs this for you when you install hadoop. Any advantages of using Mesos instead? Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.

Stateful RDDs?

2014-07-10 Thread Sargun Dhillon
So, one portion of our Spark streaming application requires some state. Our application takes a bunch of application events (i.e. user_session_started, user_session_ended, etc..), and calculates out metrics from these, and writes them to a serving layer (see: Lambda Architecture). Two related event

Re: Some question about SQL and streaming

2014-07-10 Thread hsy...@gmail.com
Yes, this is what I tried, but thanks! On Wed, Jul 9, 2014 at 6:02 PM, Tobias Pfeiffer wrote: > Siyuan, > > I do it like this: > > // get data from Kafka > val ssc = new StreamingContext(...) > val kvPairs = KafkaUtils.createStream(...) > // we need to wrap the data in a case class for register

Difference between SparkSQL and shark

2014-07-10 Thread hsy...@gmail.com
I have a newbie question. What is the difference between SparkSQL and Shark? Best, Siyuan

Re: Recommended pipeline automation tool? Oozie?

2014-07-10 Thread Paul Brown
We use Luigi for this purpose. (Our pipelines are typically on AWS (no EMR) backed by S3 and using combinations of Python jobs, non-Spark Java/Scala, and Spark. We run Spark jobs by connecting drivers/clients to the master, and those are what is invoked from Luigi.) — p...@mult.ifario.us | Multi

Re: KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
I ran the example with "./bin/run-example SparkKMeans file.txt 2 0.001" I get this response: "Finished iteration (delta = 0.0) Final centers: DenseVector(2.8571428571428568, 2.0) DenseVector(5.6005, 2.0) " The start point is not random. It uses the first K points from the given set O

Re: Difference between SparkSQL and shark

2014-07-10 Thread Nicholas Chammas
In short, Spark SQL is the future, built from the ground up. Shark was built as a drop-in replacement for Hive, will be retired, and will perhaps be replaced by a future initiative to run Hive on Spark . More info: - http://databricks.com/bl

RE: SPARKSQL problem with implementing Scala's Product interface

2014-07-10 Thread Haoming Zhang
Hi Yadid, I have the same problem with you so I implemented the product interface as well, even the codes are similar with your codes. But now I face another problem that is I don't know how to run the codes...My whole program is like this: object SimpleApp { class Record(val x1: String, va

Re: KMeans code is rubbish

2014-07-10 Thread Wanda Hawk
I am running spark-1.0.0 with java 1.8 "java version "1.8.0_05" Java(TM) SE Runtime Environment (build 1.8.0_05-b13) Java HotSpot(TM) 64-Bit Server VM (build 25.5-b02, mixed mode)" "which spark-shell ~/bench/spark-1.0.0/bin/spark-shell" "which scala ~/bench/scala-2.10.4/bin/scala" On Thursday,

Re: Cannot submit to a Spark Application to a remote cluster Spark 1.0

2014-07-10 Thread Andrew Or
Setting SPARK_HOME is not super effective, because it is overridden very quickly by bin/spark-submit here . Instead you should set the config "spark.home". Here's why: Each of your executors inherit

Generic Interface between RDD and DStream

2014-07-10 Thread mshah
I wanted to get a perspective on how to share code between Spark batch processing and Spark Streaming. For example, I want to get unique tweets stored in a HDFS file then in both Spark Batch and Spark Streaming. Currently I will have to do following thing: Tweet { String tweetText; String use

Re: Difference between SparkSQL and shark

2014-07-10 Thread Du Li
On the spark 1.0-jdbc branch, there is a thrift server and a beehive CLI that roughly keeps the shark style, corresponding to the shark server and shark CLI, respectively. Check out https://github.com/apache/spark/blob/branch-1.0-jdbc/docs/sql-programming-guide.md for more information. Du Fro

sparkStaging

2014-07-10 Thread Koert Kuipers
in spark 1.0.0 using yarn-client mode i am seeing that the sparkStaging directories do not get cleaned up. for example i run: $ spark-submit --class org.apache.spark.examples.SparkPi spark-examples-1.0.0-hadoop2.3.0-cdh5.0.2.jar 10 after which i have this directory left behind with one file in it

Re: All of the tasks have been completed but the Stage is still shown as "Active"?

2014-07-10 Thread Andrew Or
Yes, there are a few bugs in the UI in the event of a node failure. The duplicated stages in both the active and completed tables should be fixed by this PR: https://github.com/apache/spark/pull/1262 The fact that the progress bar on the stages page displays an overflow (e.g. 5/4) is still an open

How are the executors used in Spark Streaming in terms of receiver and driver program?

2014-07-10 Thread Yan Fang
Hi all, I am working to improve the parallelism of the Spark Streaming application. But I have problem in understanding how the executors are used and the application is distributed. 1. In YARN, is one executor equal one container? 2. I saw the statement that a streaming receiver runs on one wor

Re: SPARKSQL problem with implementing Scala's Product interface

2014-07-10 Thread Zongheng Yang
Hi Haoming, For your spark-submit question: can you try using an assembly jar ("sbt/sbt assembly" will build it for you)? Another thing to check is if there is any package structure that contains your SimpleApp; if so you should include the hierarchal name. Zongheng On Thu, Jul 10, 2014 at 11:33

Re: GraphX: how to specify partition strategy?

2014-07-10 Thread Ankur Dave
On Thu, Jul 10, 2014 at 8:20 AM, Yifan LI wrote: > > - how to "build the latest version of Spark from the master branch, which > contains a fix"? Instead of downloading a prebuilt Spark release from http://spark.apache.org/downloads.html, follow the instructions under "Development Version" on th

Use of the SparkContext.hadoopRDD function in Scala code

2014-07-10 Thread Nick R. Katsipoulakis
Hello, I want to run an MLlib task in Scala API, that creates a hadoopRDD from a CustomInputFormat. According to Spark API def hadoopRDD[K, V](conf: JobConf, inputFormatClass: Class[_ <: org.apache.hadoop.mapred.InputFormat[K,V]], keyClass: Class[K], valueClass: Class[V], minSplits: Int): RDD

How to RDD.take(middle 10 elements)

2014-07-10 Thread Nick Chammas
Interesting question on Stack Overflow: http://stackoverflow.com/q/24677180/877069 Basically, is there a way to take() elements of an RDD at an arbitrary index? Nick ​ -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-RDD-take-middle-10-elements-tp93

Re: Getting Persistent Connection using socketStream?

2014-07-10 Thread Tathagata Das
The implementation of the input-stream-to-iterator function in #2 is incorrect. The function should be such that, when the hasNext is called on the iterator, it should try to read from the buffered reader. If an object (that is, line) can be read, then return it, otherwise block and wait for data t

Re: How to RDD.take(middle 10 elements)

2014-07-10 Thread Xiangrui Meng
This is expensive but doable: rdd.zipWithIndex().filter { case (_, idx) => idx >= 10 && idx < 20 }.collect() -Xiangrui On Thu, Jul 10, 2014 at 12:53 PM, Nick Chammas wrote: > Interesting question on Stack Overflow: > http://stackoverflow.com/q/24677180/877069 > > Basically, is there a way to ta

writing FLume data to HDFS

2014-07-10 Thread Sundaram, Muthu X.
I am new to spark. I am trying to do the following. Netcat-->Flume-->Spark streaming(process Flume Data)-->HDFS. My flume config file has following set up. Source = netcat Sink=avrosink. Spark Streaming code: I am able to print data from flume to the monitor. But I am struggling to create a fil

Re: Potential bugs in SparkSQL

2014-07-10 Thread Stephen Boesch
Hi Jerry, To add to your question: Following does work (from master)- notice the registerAsTable is commented : (I took a liberty to add the "order by" clause) val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) import hiveContext._ hql("USE test") // hql("select id from m").register

Re: executor failed, cannot find compute-classpath.sh

2014-07-10 Thread Andrew Or
Hi C.J., The PR Yana pointed out seems to fix this. However, it is not merged in master yet, so for now I would recommend that you try the following workaround: set "spark.home" to the executor's /path/to/spark. I provided more detail here: http://mail-archives.apache.org/mod_mbox/spark-user/20140

Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
Hi Spark users and developers, I'm doing some simple benchmarks with my team and we found out a potential performance issue using Hive via SparkSQL. It is very bothersome. So your help in understanding why it is terribly slow is very very important. First, we have some text files in HDFS which ar

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
By the way, I also try hql("select * from m").count. It is terribly slow too. On Thu, Jul 10, 2014 at 5:08 PM, Jerry Lam wrote: > Hi Spark users and developers, > > I'm doing some simple benchmarks with my team and we found out a potential > performance issue using Hive via SparkSQL. It is very

Re: RDD registerAsTable gives error on regular scala class records

2014-07-10 Thread Thomas Robert
Hi, I'm quite a Spark newbie so I might be wrong but I think that registerAsTable works either on case classes or on classes extending Product. You find this info in an example on the doc page of Spark SQL: http://spark.apache.org/docs/latest/sql-programming-guide.html // Define the schema using

RE: SPARKSQL problem with implementing Scala's Product interface

2014-07-10 Thread Haoming Zhang
Hi Zongheng, Thanks a lot for your reply. I was edited my codes in my group project and I forgot to remove the package declaration...How silly! Regards, Haoming > Date: Thu, 10 Jul 2014 12:00:40 -0700 > Subject: Re: SPARKSQL problem with implementing Scala's Product interface > From: zonghen..

What version of twitter4j should I use with Spark Streaming?

2014-07-10 Thread Nick Chammas
Looks like twitter4j 2.2.6 is what works, but I don’t believe it’s documented anywhere. Using 3.0.6 works for a while, but then causes the following error: 14/07/10 18:34:13 WARN ReceiverTracker: Error reported by receiver for stream 0: Error in block pushing threa

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
Hi Spark users, Also, to put the performance issue into perspective, we also ran the query on Hive. It took about 5 minutes to run. Best Regards, Jerry On Thu, Jul 10, 2014 at 5:10 PM, Jerry Lam wrote: > By the way, I also try hql("select * from m").count. It is terribly slow > too. > > >

Multiple SparkContexts with different configurations in same JVM

2014-07-10 Thread Philip Ogren
In various previous versions of Spark (and I believe the current version, 1.0.0, as well) we have noticed that it does not seem possible to have a "local" SparkContext and a SparkContext connected to a cluster via either a Spark Cluster (i.e. using the Spark resource manager) or a YARN cluster.

incorrect labels being read by MLUtils.loadLabeledData()

2014-07-10 Thread SK
Hi, I have a csv data file, which I have organized in the following format to be read as a LabeledPoint(following the example in mllib/data/sample_tree_data.csv): 1,5.1,3.5,1.4,0.2 1,4.9,3,1.4,0.2 1,4.7,3.2,1.3,0.2 1,4.6,3.1,1.5,0.2 The first column is the binary label (1 or 0) and the remainin

Re: Recommended pipeline automation tool? Oozie?

2014-07-10 Thread Andrei
I used both - Oozie and Luigi - but found them inflexible and still overcomplicated, especially in presence of Spark. Oozie has a fixed list of building blocks, which is pretty limiting. For example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are out of scope (of course, you can a

Re: Cannot submit to a Spark Application to a remote cluster Spark 1.0

2014-07-10 Thread Aris Vlasakakis
Andrew, thank you so much! That worked! I had to manually set the spark.home configuration in the SparkConf object using .set("spark.home","/cluster/path/to/spark/"), and then I was able to submit from my laptop to the cluster! Aris On Thu, Jul 10, 2014 at 11:41 AM, Andrew Or wrote: > Setting

SparkR failed to connect to the master

2014-07-10 Thread cjwang
I have a cluster running. I was able to run Spark Shell and submit programs. But when I tried to use SparkR, I got these errors: wifi-orcus:sparkR cwang$ MASTER=spark://wifi-orcus.dhcp.carrieriq.com:7077 sparkR R version 3.1.0 (2014-04-10) -- "Spring Dance" Copyright (C) 2014 The R Foundation f

Submitting to a cluster behind a VPN, configuring different IP address

2014-07-10 Thread Aris Vlasakakis
Hi Spark folks, So on our production Spark cluster, it lives in the data center and I need to attach to a VPN from my laptop, so that I can then submit a Spark application job to the Spark Master (behind the VPN). However, the problem arises that I have a local IP address on the laptop which is o

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Michael Armbrust
On Thu, Jul 10, 2014 at 2:08 PM, Jerry Lam wrote: > > For the curious mind, the dataset is about 200-300GB and we are using 10 > machines for this benchmark. Given the env is equal between the two > experiments, why pure spark is faster than SparkSQL? > There is going to be some overhead to parsi

Re: Potential bugs in SparkSQL

2014-07-10 Thread Michael Armbrust
Hi Jerry, Thanks for reporting this. It would be helpful if you could provide the output of the following command: println(hql("select s.id from m join s on (s.id=m_id)").queryExecution) Michael On Thu, Jul 10, 2014 at 8:15 AM, Jerry Lam wrote: > Hi Spark developers, > > I have the followin

Re: SparkSQL - Language Integrated query - OR clause and IN clause

2014-07-10 Thread Michael Armbrust
I'll add that the SQL parser is very limited right now, and that you'll get much wider coverage using hql inside of HiveContext. We are working on bringing sql() much closer to SQL-92 though in the future. On Thu, Jul 10, 2014 at 7:28 AM, premdass wrote: > Thanks Takuya . works like a Charm >

Re: EC2 Cluster script. Shark install fails

2014-07-10 Thread Michael Armbrust
There is no version of Shark that is compatible with Spark 1.0, however, Spark SQL does come included automatically. More information here: http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html http://spark.apache.org/docs/latest/sql-programming-g

Re: What version of twitter4j should I use with Spark Streaming?

2014-07-10 Thread Tathagata Das
Spark Streaming uses twitter4j 3.0.3. 3.0.6 should probably work fine. The exception that you are seeing is something that should be looked into. Can you give us more logs (specially executor logs) with stack traces that has the error. TD On Thu, Jul 10, 2014 at 2:42 PM, Nick Chammas wrote: >

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-10 Thread Tathagata Das
How are you supplying the text file? On Wed, Jul 9, 2014 at 11:51 AM, M Singh wrote: > Hi Folks: > > I am working on an application which uses spark streaming (version 1.1.0 > snapshot on a standalone cluster) to process text file and save counters in > cassandra based on fields in each row. I

Re: Number of executors change during job running

2014-07-10 Thread Tathagata Das
Are you specifying the number of reducers in all the DStream.ByKey operations? If the reduce by key is not set, then the number of reducers used in the stages can keep changing across batches. TD On Wed, Jul 9, 2014 at 4:05 PM, Bill Jay wrote: > Hi all, > > I have a Spark streaming job run

Re: "NoSuchElementException: key not found" when changing the window lenght and interval in Spark Streaming

2014-07-10 Thread Tathagata Das
This bug has been fixed. Either use the master branch of Spark, or maybe wait a few days for Spark 1.0.1 to be released (voting has successfully closed). TD On Thu, Jul 10, 2014 at 2:33 AM, richiesgr wrote: > Hi > > I get exactly the same problem here, do you've found the problem ? > Thanks >

Re: Spark Streaming using File Stream in Java

2014-07-10 Thread Tathagata Das
The fileStream is not designed to work with continuously updating file, as the one of the main design goals of Spark is immutability (to guarantee fault-tolerance by recomputation), and files that are appending (mutating) defeats that. It rather designed to pickup new files added atomically (using

Re: Number of executors change during job running

2014-07-10 Thread Bill Jay
Hi Tathagata, I set default parallelism as 300 in my configuration file. Sometimes there are more executors in a job. However, it is still slow. And I further observed that most executors take less than 20 seconds but two of them take much longer such as 2 minutes. The data size is very small (les

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Jerry Lam
Hi Michael, Yes the table is partitioned on 1 column. There are 11 columns in the table and they are all String type. I understand that SerDes contributes to some overheads but using pure Hive, we could run the query about 5 times faster than SparkSQL. Given that Hive also has the same SerDes ove

Re: Potential bugs in SparkSQL

2014-07-10 Thread Jerry Lam
Hi Michael, I got the log you asked for. Note that I manually edited the table name and the field names to hide some sensitive information. == Logical Plan == Project ['s.id] Join Inner, Some((id#106 = 'm.id)) Project [id#96 AS id#62] MetastoreRelation test, m, None MetastoreRelation test

Re: incorrect labels being read by MLUtils.loadLabeledData()

2014-07-10 Thread Yana Kadiyska
I do not believe the order of points in a distributed RDD is in any way guaranteed. For a simple test, you can always add a last column which is an id (make it double and throw it in the feature vector). Printing the rdd back will not give you the points in file order. If you don't want to go that

Re: Using HQL is terribly slow: Potential Performance Issue

2014-07-10 Thread Michael Armbrust
Yeah, sorry. I think you are seeing some weirdness with partitioned tables that I have also seen elsewhere. I've created a JIRA and assigned someone at databricks to investigate. https://issues.apache.org/jira/browse/SPARK-2443 On Thu, Jul 10, 2014 at 5:33 PM, Jerry Lam wrote: > Hi Michael, >

Re: Number of executors change during job running

2014-07-10 Thread Tathagata Das
Can you try setting the number-of-partitions in all the shuffle-based DStream operations, explicitly. It may be the case that the default parallelism (that is, spark.default.parallelism) is probably not being respected. Regarding the unusual delay, I would look at the task details of that stage in

Re: Potential bugs in SparkSQL

2014-07-10 Thread Michael Armbrust
Hmm, yeah looks like the table name is not getting applied to the attributes of m. You can work around this by rewriting your query as: hql("select s.id from (SELECT * FROM m) m join s on (s.id=m.id) order by s.id" This explicitly gives the alias m to the attributes of that table. You can also op

Re: Some question about SQL and streaming

2014-07-10 Thread Tobias Pfeiffer
Hi, I think it would be great if we could do the string parsing only once and then just apply the transformation for each interval (reducing the processing overhead for short intervals). Also, one issue with the approach above is that transform() has the following signature: def transform(tran

Re: executor failed, cannot find compute-classpath.sh

2014-07-10 Thread cjwang
Andrew, Thanks for replying. I did the following and the result was still the same. 1. Added "spark.home /root/spark-1.0.0" to local conf/spark-defaults.conf, where "/root" was the place in the cluster where I put Spark. 2. Ran "bin/spark-shell --master spark://sjc1-eng-float01.carrieriq.co

Re: Some question about SQL and streaming

2014-07-10 Thread Tathagata Das
Yeah, the right solution is to have something like SchemaDStream, where the schema of all the schemaRDD generated by it can be stored. Something I really would like to see happen in the future :) TD On Thu, Jul 10, 2014 at 6:37 PM, Tobias Pfeiffer wrote: > Hi, > > I think it would be great if

Re: Getting Persistent Connection using socketStream?

2014-07-10 Thread kytay
Hi TD Thanks. I have problem understanding the codes in github, Object SocketReceiver.byteToLines(...) private[streaming] obje

RE: Some question about SQL and streaming

2014-07-10 Thread Shao, Saisai
Actually we have a POC project which shows the power of combining Spark Streaming and Catalyst, it can manipulate SQL on top of Spark Streaming and get SchemaDStream. You can take a look at it: https://github.com/thunderain-project/StreamSQL Thanks Jerry From: Tathagata Das [mailto:tathagata.d

Streaming. Cannot get socketTextStream to receive anything.

2014-07-10 Thread kytay
Hi I am learning spark streaming, and is trying out the JavaNetworkCount example. #1 - This is the code I wrote JavaStreamingContext sctx = new JavaStreamingContext("local", appName, new Duration(5000)); JavaReceiverInputDStream lines = sctx.socketTextStream("127.0.0.1", ); JavaDStr

  1   2   >