Re: Example of Geoprocessing with Spark

2014-09-18 Thread Evan Chan
Hi Abel, Pretty interesting. May I ask how big is your point CSV dataset? It seems you are relying on searching through the FeatureCollection of polygons for which one intersects your point. This is going to be extremely slow. I highly recommend using a SpatialIndex, such as the many that exis

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-18 Thread Evan Chan
Sweet, that's probably it. Too bad it didn't seem to make 1.1? On Wed, Sep 17, 2014 at 5:32 PM, Michael Armbrust wrote: > The unknown slowdown might be addressed by > https://github.com/apache/spark/commit/f858f466862541c3faad76a1fa2391f1c17ec9dd > > On Sun, Sep 14, 2014 at 10:40 PM, Evan Chan

Re: diamond dependency tree

2014-09-18 Thread Victor Tso-Guillen
Yes, sorry I meant DAG. I fixed it in my message but not the subject. The terminology of "leaf" wasn't helpful I know so hopefully my visual example was enough. Anyway, I noticed what you said in a local-mode test. I can try that in a cluster, too. Thank you! On Thu, Sep 18, 2014 at 10:28 PM, Tobi

spark-submit command-line with --files

2014-09-18 Thread chinchu
Hi, I am running spark-1.1.0 and I want to pass in a file (that contains java serialized objects used to initialize my program) to the App main program. I am using the --files option but I am not able to retrieve the file in the main_class. It reports a null pointer exception. [I tried both local

Re: Unable to find proto buffer class error with RDD

2014-09-18 Thread Paul Wais
It turns out Kyro doesn't play well with protobuf. Out of the box I see: com.esotericsoftware.kryo.KryoException: java.lang.UnsupportedOperationException Serialization trace: extra_ (com.foo.bar.MyMessage) com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSeria

Re: diamond dependency tree

2014-09-18 Thread Tobias Pfeiffer
Hi, On Thu, Sep 18, 2014 at 8:55 PM, Victor Tso-Guillen wrote: > >> Is it possible to express a diamond DAG and have the leaf dependency >> evaluate only once? >> > Well, strictly speaking your graph is not a "tree", and also the meaning of "leaf" is not totally clear, I'd say. > So say data fl

Re: diamond dependency tree

2014-09-18 Thread Victor Tso-Guillen
Caveat: all arrows are shuffle dependencies. On Thu, Sep 18, 2014 at 8:55 PM, Victor Tso-Guillen wrote: > Is it possible to express a diamond DAG and have the leaf dependency > evaluate only once? So say data flows left to right (and the dependencies > are oriented right to left): > > [image: In

diamond dependency tree

2014-09-18 Thread Victor Tso-Guillen
Is it possible to express a diamond DAG and have the leaf dependency evaluate only once? So say data flows left to right (and the dependencies are oriented right to left): [image: Inline image 1] Is it possible to run d.collect() and have a evaluate its iterator only once?

Re: Spark run slow after unexpected repartition

2014-09-18 Thread Tan Tim
I also encountered the similar problem: after some stages, all the taskes are assigned to one machine, and the stage execution get slower and slower. *[the spark conf setting]* val conf = new SparkConf().setMaster(sparkMaster).setAppName("ModelTraining" ).setSparkHome(sparkHome).setJars(List(jarFi

Re: paging through an RDD that's too large to collect() all at once

2014-09-18 Thread Matei Zaharia
Hey Dave, try out RDD.toLocalIterator -- it gives you an iterator that reads one RDD partition at a time. Scala iterators also have methods like grouped() that let you get fixed-size groups. Matei On September 18, 2014 at 7:58:34 PM, dave-anderson (david.ander...@pobox.com) wrote: I have an R

paging through an RDD that's too large to collect() all at once

2014-09-18 Thread dave-anderson
I have an RDD on the cluster that I'd like to iterate over and perform some operations on each element (push data from each element to another downstream system outside of Spark). I'd like to do this at the driver so I can throttle the rate that I push to the downstream system (as opposed to submi

Re: Example of Geoprocessing with Spark

2014-09-18 Thread Abel Coronado Iruegas
Now i have a better version, but now the problem is that the saveAsTextFile do not finish the Job, in the hdfs repository only exist a partial temporary file, someone can tell me what is wrong: Thanks !! object SimpleApp { def main(args: Array[String]){ val conf = new Sp

Re: spark-submit: fire-and-forget mode?

2014-09-18 Thread Tobias Pfeiffer
Hi, thanks for everyone's replies! > On Thu, Sep 18, 2014 at 7:37 AM, Sandy Ryza wrote: >> YARN cluster mode should have the behavior you're looking for. The client >> process will stick around to report on things, but should be able to be >> killed without affecting the application. If this i

Re: Unable to find proto buffer class error with RDD

2014-09-18 Thread Paul Wais
hmm would using kyro help me here? On Thursday, September 18, 2014, Paul Wais wrote: > Ah, can one NOT create an RDD of any arbitrary Serializable type? It > looks like I might be getting bitten by the same > "java.io.ObjectInputStream uses root class loader only" bugs mentioned > in: > > *

Re: Unable to find proto buffer class error with RDD

2014-09-18 Thread Paul Wais
Ah, can one NOT create an RDD of any arbitrary Serializable type? It looks like I might be getting bitten by the same "java.io.ObjectInputStream uses root class loader only" bugs mentioned in: * http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-td3259.html * ht

request to merge the pull request #1893 to master

2014-09-18 Thread freedafeng
We are working on a project that needs python + spark to work on hdfs and hbase data. We like to use a not-too-old version of hbase such as hbase 0.98.x. We have tried many different ways (and platforms) to compile and test Spark 1.1 official release, but got all sorts of issues. The only version t

Re: schema for schema

2014-09-18 Thread Eric Friedman
Thanks! On Thu, Sep 18, 2014 at 1:14 PM, Davies Liu wrote: > Thanks for reporting this, it will be fixed by > https://github.com/apache/spark/pull/2448 > > On Thu, Sep 18, 2014 at 12:32 PM, Michael Armbrust > wrote: > > This looks like a bug, we are investigating. > > > > On Thu, Sep 18, 2014 a

Re: Unable to find proto buffer class error with RDD

2014-09-18 Thread Paul Wais
Well, it looks like Spark is just not loading my code into the driver/executors E.g.: List foo = JavaRDD bars.map( new Function< MyMessage, String>() { { System.err.println("classpath: " + System.getProperty("java.class.path")); CodeSource src = com.google.protobuf.Ge

AbstractMethodError when creating cassandraTable object

2014-09-18 Thread Emil Gustafsson
pretty sure this is a result on me being new to both scala, spark and sbt but I'm getting the error above when I try to use the cassandra driver for spark. I posted more information here: https://github.com/datastax/spark-cassandra-connector/issues/245 Ideas? /E

Re: Unable to load app logs for MLLib programs in history server

2014-09-18 Thread Xiangrui Meng
Could you create a JIRA for it? We can either remove special characters or encode with alphanumerics. -Xiangrui On Thu, Sep 18, 2014 at 3:50 PM, SK wrote: > Hi, > > The default log files for the Mllib examples use a rather long naming > convention that includes special characters like parentheses

Unable to load app logs for MLLib programs in history server

2014-09-18 Thread SK
Hi, The default log files for the Mllib examples use a rather long naming convention that includes special characters like parentheses and comma.For e.g. one of my log files is named "binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032". When I click on the program on the hi

Re: Anybody built the branch for Adaptive Boosting, extension to MLlib by Manish Amde?

2014-09-18 Thread Manish Amde
Hi Aris, Thanks for the interest. First and foremost, tree ensembles are a top priority for the 1.2 release and we are working hard towards it. A random forests PR is already under review and AdaBoost and gradient boosting will be added soon after.  Unfortunately, the GBDT branch I shared

PairRDD's lookup method Performance

2014-09-18 Thread Harsha HN
Hi All, My question is related to improving performance of pairRDD's lookup method. I went through below link where "Tathagata Das " explains creating Hash Map over Partitions using "mappartitio

Re: Spark + Mahout

2014-09-18 Thread Sean Owen
No, the architectures are entirely different. The Mahout implementations have been deprecated and are not being updated, so there won't be a port or anything. You would have to create these things from scratch on Spark if they don't already exist. On Sep 18, 2014 7:50 PM, "Daniel Takabayashi" wrot

Re: Huge matrix

2014-09-18 Thread Debasish Das
I am still a bit confused whether numbers like these can be aggregated as double: iVal * jVal / (math.min(sg, colMags(i)) * math.min(sg, colMags(j)) It should be aggregated using something like List[iVal*jVal, colMags(i), colMags(j)] I am not sure Algebird can aggregate deterministically over Do

Re: Huge matrix

2014-09-18 Thread Reza Zadeh
Hi Deb, I am currently seeding the algorithm to be pseudo-random, this is an issue being addressed in the PR. If you pull the current version it will be deterministic, but not potentially not pseudo-random. The PR will updated today. Best, Reza On Thu, Sep 18, 2014 at 2:06 PM, Debasish Das wrote:

Re: Huge matrix

2014-09-18 Thread Debasish Das
Hi Reza, Have you tested if different runs of the algorithm produce different similarities (basically if the algorithm is deterministic) ? This number does not look like a Monoid aggregation...iVal * jVal / (math.min(sg, colMags(i)) * math.min(sg, colMags(j)) I am noticing some weird behavior as

Re: SVD on larger than taller matrix

2014-09-18 Thread Li Pu
The main bottleneck of current SVD implementation is on the memory of driver node. It requires at least 5*n*k doubles in driver memory because all right singular vectors are stored in driver memory and there are some working memory required. So it is bounded by the smaller dimension of your matrix

Re: spark-1.1.0-bin-hadoop2.4 java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass

2014-09-18 Thread Yin Huai
Hello Andy, Will our JSON support in Spark SQL help your case? If your JSON files store one JSON object per line, you can use SQLContext.jsonFile to load it. If you want to do pre-process these files, once you have an RDD[String] (one JSON object per String), you can use SQLContext.jsonRDD. In bot

Re: MLlib - Possible to use SVM with Radial Basis Function kernel rather than Linear Kernel?

2014-09-18 Thread Xiangrui Meng
We don't support kernels because it doesn't scale well. Please check "When to use LIBLINEAR but not LIBSVM" on http://www.csie.ntu.edu.tw/~cjlin/liblinear/index.html . I like Jey's suggestion on expanding features. -Xiangrui On Thu, Sep 18, 2014 at 12:29 PM, Jey Kottalam wrote: > Hi Aris, > > A s

Re: SVD on larger than taller matrix

2014-09-18 Thread Xiangrui Meng
Did you cache `features`? Without caching it is slow because we need O(k) iterations. The storage requirement on the driver is about 2 * n * k = 2 * 3 million * 200 ~= 9GB, not considering any overhead. Computing U is also an expensive task in your case. We should use some randomized SVD implementa

Re: MLLib regression model weights

2014-09-18 Thread Xiangrui Meng
The importance should be based on some statistics, for example, the standard deviation of the feature column and the magnitude of the weight. If the columns are scaled to unit standard deviation (using StandardScaler), you can tell the importance by the absolute value of the weight. But there are o

Re: schema for schema

2014-09-18 Thread Davies Liu
Thanks for reporting this, it will be fixed by https://github.com/apache/spark/pull/2448 On Thu, Sep 18, 2014 at 12:32 PM, Michael Armbrust wrote: > This looks like a bug, we are investigating. > > On Thu, Sep 18, 2014 at 8:49 AM, Eric Friedman > wrote: >> >> I have a SchemaRDD which I've gotten

Re: Joining multiple rowMatrix

2014-09-18 Thread Xiangrui Meng
You can use CoGroupedRDD (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.CoGroupedRDD) directly. -Xiangrui On Thu, Sep 18, 2014 at 7:09 AM, Debasish Das wrote: > Hi, > > I have some RowMatrices all with the same key (MatrixEntry.i, MatrixEntry.j) > and I would like

Re: New API for TFIDF generation in Spark 1.1.0

2014-09-18 Thread Xiangrui Meng
Hi Jatin, HashingTF should be able to solve the memory problem if you use a small feature dimension in HashingTF. Please do not cache the input document, but cache the output from HashingTF and IDF instead. We don't have a label indexer yet, so you need a label to index map to map it to double val

SVD on larger than taller matrix

2014-09-18 Thread Glitch
I have a matrix of about 2 millions+ rows with 3 millions + columns in svm format* and it's sparse. As I understand it, running SVD on such a matrix shouldn't be a problem since version 1.1. I'm using 10 worker nodes on EC2, each with 30G of RAM (r3.xlarge). I was able to compute the SVD for 20 si

Re: Spark + Mahout

2014-09-18 Thread Daniel Takabayashi
Yes, thanks, but I want to test using Fuzzy Kmeans, as an option, is possible? 2014-09-18 16:40 GMT-03:00 Huang, Roger : > Taka, > > Have you considered using Spark’s own MLlib k-means clustering? > > http://spark.apache.org/docs/latest/mllib-clustering.html > > > > Roger > > > > *From:* Daniel

RE: Spark + Mahout

2014-09-18 Thread Huang, Roger
Taka, Have you considered using Spark’s own MLlib k-means clustering? http://spark.apache.org/docs/latest/mllib-clustering.html Roger From: Daniel Takabayashi [mailto:takabaya...@scanboo.com.br] Sent: Thursday, September 18, 2014 1:50 PM To: user@spark.apache.org Subject: Spark + Mahout Hi guys,

spark 1.1 examples build failure on cdh 5.1

2014-09-18 Thread freedafeng
This is a mvn build. [ERROR] Failed to execute goal on project spark-examples_2.10: Could not resolve dependencies for project org.apache.spark:spark-examples_2.10:jar:1.1.0: Could not find artifact org.apache.hbase:hbase:jar:0.98.1 in central (https://repo1.maven.org/maven2) -> [Help 1] [ERROR]

Re: schema for schema

2014-09-18 Thread Michael Armbrust
This looks like a bug, we are investigating. On Thu, Sep 18, 2014 at 8:49 AM, Eric Friedman wrote: > I have a SchemaRDD which I've gotten from a parquetFile. > > Did some transforms on it and now want to save it back out as parquet > again. > > Getting a SchemaRDD proves challenging because some

Re: problem with HiveContext inside Actor

2014-09-18 Thread Du Li
I have figured it out. As shown in the code below, if the HiveContext hc were created in the actor object and used to create db in response to message, it would throw null pointer exception. This is fixed by creating the HiveContext inside the MyActor class instead. I also tested the code by re

Re: MLlib - Possible to use SVM with Radial Basis Function kernel rather than Linear Kernel?

2014-09-18 Thread Jey Kottalam
Hi Aris, A simple approach to gaining some of the benefits of an RBF kernel is to add synthetic features to your training set. For example, if your original data consists of 3-dimensional vectors [x, y, z], you could compute a new 9-dimensional feature vector containing [x, y, z, x^2, y^2, z^2, xy

Re: Spark on EC2

2014-09-18 Thread Burak Yavuz
Hi Gilberto, Could you please attach the driver logs as well, so that we can pinpoint what's going wrong? Could you also add the flag `--driver-memory 4g` while submitting your application and try that as well? Best, Burak - Original Message - From: "Gilberto Lira" To: user@spark.apach

Re: Kafka Spark Streaming on Spark 1.1

2014-09-18 Thread JiajiaJing
Yeah, I forgot to build the new jar file for spark 1.1... And now the errors are gone. Thank you very much! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Streaming-on-Spark-1-1-tp14597p14604.html Sent from the Apache Spark User List mailing l

Re: Kafka Spark Streaming on Spark 1.1

2014-09-18 Thread Tim Smith
What kafka receiver are you using? Did you build a new jar for your app with the latest streaming-kafka code for 1.1? On Thu, Sep 18, 2014 at 11:47 AM, JiajiaJing wrote: > Hi Spark Users, > > We just upgrade our spark version from 1.0 to 1.1. And we are trying to > re-run all the written and tes

Re: MLLib regression model weights

2014-09-18 Thread Debasish Das
sc.parallelize(model.weights.toArray, blocks).top(k) will get that right ? For logistic you might want both positive and negative feature...so just pass it through a filter on abs and then pick top(k) On Thu, Sep 18, 2014 at 10:30 AM, Sameer Tilak wrote: > Hi All, > > I am able to run LinearReg

Spark + Mahout

2014-09-18 Thread Daniel Takabayashi
Hi guys, Is possible to run a mahout kmeans throws spark infrastructure? Thanks, taka (Brazil)

Spark on EC2

2014-09-18 Thread Gilberto Lira
Hello, I am trying to run a python script that makes use of the kmeans MLIB and I'm not getting anywhere. I'm using an c3.xlarge instance as master, and 10 c3.large instances as slaves. In the code I make a map of a 600MB csv file in S3, where each row has 128 integer columns. The problem is that a

Kafka Spark Streaming on Spark 1.1

2014-09-18 Thread JiajiaJing
Hi Spark Users, We just upgrade our spark version from 1.0 to 1.1. And we are trying to re-run all the written and tested projects we implemented on Spark 1.0. However, when we try to execute the spark streaming project that stream data from Kafka topics, it yields the following error message. I

Re: spark-1.1.0-bin-hadoop2.4 java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass

2014-09-18 Thread Andy Davidson
After lots of hacking I figure out how to resolve this problem. This is good solution. It severalty cripples jackson but at least for now I am unblocked 1) turn off annotations. mapper.configure(Feature.USE_ANNOTATIONS, false); 2) in maven set the jackson dependencies as provided. 1.9

Anybody built the branch for Adaptive Boosting, extension to MLlib by Manish Amde?

2014-09-18 Thread Aris
Thank you Spark community you make life much more lovely - suffering in silence is not fun! I am trying to build the Spark Git branch from Manish Amde, available here: https://github.com/manishamde/spark/tree/ada_boost I am trying to build the non-master branch 'ada_boost' (in the link above), b

Re: Adjacency List representation in Spark

2014-09-18 Thread Koert Kuipers
we build our own adjacency lists as well. the main motivation for us was that graphx has some assumptions about everything fitting in memory (it has .cache statements all over place). however if my understanding is wrong and graphx can handle graphs that do not fit in memory i would be interested t

Re: MLlib - Possible to use SVM with Radial Basis Function kernel rather than Linear Kernel?

2014-09-18 Thread Aris
Sorry to bother you guys, but does anybody have any ideas about the status of MLlib with a Radial Basis Function kernel for SVM? Thank you! On Tue, Sep 16, 2014 at 3:27 PM, Aris < wrote: > Hello Spark Community - > > I am using the support vector machine / SVM implementation in MLlib with > the

Re: spark-submit: fire-and-forget mode?

2014-09-18 Thread Nicholas Chammas
And for the record, the issue is here: https://issues.apache.org/jira/browse/SPARK-3591 On Thu, Sep 18, 2014 at 1:19 PM, Andrew Or wrote: > Thanks Tobias, I have filed a JIRA for it. > > 2014-09-18 10:09 GMT-07:00 Patrick Wendell : > > I agree, that's a good idea Marcelo. There isn't AFAIK any r

Re: Huge matrix

2014-09-18 Thread Debasish Das
Yup that's what I did for now... On Thu, Sep 18, 2014 at 10:34 AM, Reza Zadeh wrote: > Hi Deb, > > I am not templating RowMatrix/CoordinateMatrix since that would be a big > deviation from the PR. We can add jaccard and other similarity measures in > later PRs. > > In the meantime, you can un-no

Re: Huge matrix

2014-09-18 Thread Reza Zadeh
Hi Deb, I am not templating RowMatrix/CoordinateMatrix since that would be a big deviation from the PR. We can add jaccard and other similarity measures in later PRs. In the meantime, you can un-normalize the cosine similarities to get the dot product, and then compute the other similarity measur

Re: Sending multiple DStream outputs

2014-09-18 Thread contractor
One output is sort of a subset of the other so it didn¹t make much sense to spin up another Spark app for the same source. On 9/18/14, 11:19 AM, "Tim Smith" wrote: >Curious, if you have 1:1 mapping between Stream1:topic1 and >Stream2:topic2 then why not run different instances of the app for >ea

MLLib regression model weights

2014-09-18 Thread Sameer Tilak
Hi All, I am able to run LinearRegressionWithSGD on a small sample dataset (~60MB Libsvm file of sparse data) with 6700 features. val model = LinearRegressionWithSGD.train(examples, numIterations) At the end I get a model that model.weights.sizeres6: Int = 6699 I am assuming each entry in the mo

RE: Spark SQL Exception

2014-09-18 Thread Paul Magid
Michael: Thanks for the quick response. I can confirm that once I removed the “order by” clause the exception below went away. So, I believe this confirms what you were say and I will be opening a new feature request in JIRA. However, that exception was replaced by a java.lang.OutOfMemoryEr

Re: Odd error when using a rdd map within a stream map

2014-09-18 Thread Burak Yavuz
Hi, I believe it's because you're trying to use a Function of an RDD, in an RDD, which is not possible. Instead of using a `Function>`, could you try Function, and `public Void call(Float arg0) throws Exception { ` and `System.out.println(arg0)` instead. I'm not perfectly sure of the semantics i

RE: MLLib: LIBSVM issue

2014-09-18 Thread Sameer Tilak
Thanks, will try it out today. Date: Wed, 17 Sep 2014 23:04:31 -0700 Subject: Re: MLLib: LIBSVM issue From: debasish.da...@gmail.com To: bya...@stanford.edu CC: ssti...@live.com; user@spark.apache.org We dump fairly big libsvm to compare against liblinear/libsvm...the following code dumps out li

RE: MLLib: LIBSVM issue

2014-09-18 Thread Sameer Tilak
Thanks, Burak,Yes, tab was an issue and I was able to get it working after replacing that with space. > Date: Wed, 17 Sep 2014 21:11:00 -0700 > From: bya...@stanford.edu > To: ssti...@live.com > CC: user@spark.apache.org > Subject: Re: MLLib: LIBSVM issue > > Hi, > > The spacing between the inp

Re: Sending multiple DStream outputs

2014-09-18 Thread Tim Smith
Curious, if you have 1:1 mapping between Stream1:topic1 and Stream2:topic2 then why not run different instances of the app for each and pass as arguments to each instance the input source and output topic? On Thu, Sep 18, 2014 at 8:07 AM, Padmanabhan, Mahesh (contractor) wrote: > Hi all, > > I am

Re: spark-submit: fire-and-forget mode?

2014-09-18 Thread Andrew Or
Thanks Tobias, I have filed a JIRA for it. 2014-09-18 10:09 GMT-07:00 Patrick Wendell : > I agree, that's a good idea Marcelo. There isn't AFAIK any reason the > client needs to hang there for correct operation. > > On Thu, Sep 18, 2014 at 9:39 AM, Marcelo Vanzin > wrote: > > Yes, what Sandy sai

Re: Stable spark streaming app

2014-09-18 Thread Tim Smith
Dibyendu - I am using the Kafka consumer built into Spark streaming. Pulled the jar from here: http://search.maven.org/remotecontent?filepath=org/apache/spark/spark-streaming-kafka_2.10/1.0.0/spark-streaming-kafka_2.10-1.0.0.jar Thanks for the sbt-assembly link, Soumitra. On Wed, Sep 17, 2014 at

Re: Spot instances on Amazon EMR

2014-09-18 Thread Patrick Wendell
Hey Grzegorz, EMR is a service that is not maintained by the Spark community. So this list isn't the right place to ask EMR questions. - Patrick On Thu, Sep 18, 2014 at 3:19 AM, Grzegorz Białek wrote: > Hi, > I would like to run Spark application on Amazon EMR. I have some questions > about tha

Re: spark-submit: fire-and-forget mode?

2014-09-18 Thread Patrick Wendell
I agree, that's a good idea Marcelo. There isn't AFAIK any reason the client needs to hang there for correct operation. On Thu, Sep 18, 2014 at 9:39 AM, Marcelo Vanzin wrote: > Yes, what Sandy said. > > On top of that, I would suggest filing a bug for a new command line > argument for spark-submi

Re: Support R in Spark

2014-09-18 Thread Shivaram Venkataraman
As R is single-threaded, SparkR launches one R process per-executor on the worker side. Thanks Shivaram On Thu, Sep 18, 2014 at 7:49 AM, oppokui wrote: > Shivaram, > > As I know, SparkR used rJava package. In work node, spark code will execute R > code by launching R process and send/receive by

Re: SQL shell for Spark SQL?

2014-09-18 Thread Denny Lee
The CLI is the command line connection to SparkSQL and yes, SparkSQL replaces Shark - there’s a great article by Reynold on the Databricks blog that provides the context:  http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html As for SparkSQL and

Re: Spark SQL Exception

2014-09-18 Thread Michael Armbrust
Its failing to sort because the columns are of Binary type (though maybe we should support this as well). Is this parquet data that was generated by impala that you would expect to be a String? If so turn on spark.sql.parquet.binaryAsString

Re: spark-submit: fire-and-forget mode?

2014-09-18 Thread Marcelo Vanzin
Yes, what Sandy said. On top of that, I would suggest filing a bug for a new command line argument for spark-submit to make the launcher process exit cleanly as soon as a cluster job starts successfully. That can be helpful for code that launches Spark jobs but monitors the job through different m

Spark Zmq issue in cluster mode

2014-09-18 Thread Hatch M
I have a spark streaming zmq application running fine in non-cluster mode. When running a local cluster and I do spark-submit, zero mq java client is choking. org.zeromq.ZMQException: No such file or directory at org.zeromq.ZMQ$Socket.raiseZMQException(ZMQ.java:480) at org.zeromq.ZMQ$Socket.recv(Z

Spark SQL Exception

2014-09-18 Thread Paul Magid
All: I am putting Spark SQL 1.1 through its paces (in a POC) and have been pleasantly surprised with what can be done with such a young technology.I have run into an exception (listed below) that I suspect relates to the number of columns in the table I am querying. There are 336 columns

Re: Cannot run SimpleApp as regular Java app

2014-09-18 Thread ericacm
Upgrading from spark-1.0.2-hadoop2 to spark-1.1.0-hadoop1 fixed my problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-run-SimpleApp-as-regular-Java-app-tp13695p14570.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Kryo fails with avro having Arrays and unions, but succeeds with simple avro.

2014-09-18 Thread Frank Austin Nothaft
Mohan, You’ll need to register it; we register our serializer in lines 69 to 76 in https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/serialization/ADAMKryoRegistrator.scala. Our serializer implementation falls back on the default Avro serializer; yo

schema for schema

2014-09-18 Thread Eric Friedman
I have a SchemaRDD which I've gotten from a parquetFile. Did some transforms on it and now want to save it back out as parquet again. Getting a SchemaRDD proves challenging because some of my fields can be null/None and SQLContext.inferSchema abjects those. So, I decided to use the schema on the

Spark Streaming and ReactiveMongo

2014-09-18 Thread t1ny
Hello all,Spark newbie here.We are trying to use Spark Streaming (unfortunately stuck on version 0.9.1 of Spark) to stream data out of MongoDB.ReactiveMongo (http://reactivemongo.org/) is a scala driver that enables you to stream a MongoDB capped collection (in our case, the Oplog).Given that Mongo

Re: Kryo fails with avro having Arrays and unions, but succeeds with simple avro.

2014-09-18 Thread mohan.gadm
Thanks for the info frank. so your suggestion could be to use Avro serializer. i just have to configure it like Kryo for the same property? and is there any registering process for this or just specify serializer? Also does it effect performance. what measures to be taken to avoid. (im using kryo

Re: Kryo fails with avro having Arrays and unions, but succeeds with simple avro.

2014-09-18 Thread Frank Austin Nothaft
Hi Mohan, It’s been a while since I’ve looked at this specifically, but I don’t think the default Kryo serializer will properly serialize Avro. IIRC, there are complications around the way that Avro handles nullable fields, which would be consistent with the NPE you’re encountering here. That’s

Re: Kryo fails with avro having Arrays and unions, but succeeds with simple avro.

2014-09-18 Thread mohan.gadm
Hi frank, thanks for the info, thats great. but im not saying Avro serializer is failing. Kryo is failing but im using kryo serializer. and registering Avro generated classes with kryo. sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"); sparkConf.set("spark.kr

Sending multiple DStream outputs

2014-09-18 Thread contractor
Hi all, I am using Spark 1.0 streaming to ingest a a high volume stream of data (approx. 1mm lines every few seconds) transform it into two outputs and send those outputs to two separate Apache Kafka topics. I have two blocks of output code like this: Stream1 = …. Stream2 = … Stream1.foreac

Re: Stable spark streaming app

2014-09-18 Thread Soumitra Kumar
Refer to https://github.com/sbt/sbt-assembly to generate a jar with dependencies. I prefer not to build a big fat jar, since a bulk would be hadoop related and prefer to use what is installed on the host. - Original Message - From: "Tim Smith" Cc: "spark users" Sent: Wednesday, Septem

Re: Support R in Spark

2014-09-18 Thread oppokui
Shivaram, As I know, SparkR used rJava package. In work node, spark code will execute R code by launching R process and send/receive byte array. I have a question on when to launch R process. R process is per Work process, or per executor thread, or per each RDD processing? Thanks and Regards

Re: Adjacency List representation in Spark

2014-09-18 Thread Harsha HN
Hi Andrew, The only reason that I avoided GraphX approach is that I didnt see any explanation on Java side nor API documentation on Java. Do you have any code piece of using GraphX API in JAVA? Thanks, Harsha On Wed, Sep 17, 2014 at 10:44 PM, Andrew Ash wrote: > Hi Harsha, > > You could look t

Re: Kryo fails with avro having Arrays and unions, but succeeds with simple avro.

2014-09-18 Thread Frank Austin Nothaft
Mohan, I don’t think this is a Spark issue, rather, I think the issue is coming from your serializer. Can you point us to the serializer that you are using? We have no problems serializing complex Avro (nested schemas with unions and arrays) when using this serializer. You may also want to look

Re: spark-submit: fire-and-forget mode?

2014-09-18 Thread Sandy Ryza
Hi Tobias, YARN cluster mode should have the behavior you're looking for. The client process will stick around to report on things, but should be able to be killed without affecting the application. If this isn't the behavior you're observing, and your application isn't failing for a different r

Re: Kryo fails with avro having Arrays and unions, but succeeds with simple avro.

2014-09-18 Thread mohan.gadm
Added some more info on this issue in the tracker Spark-3447 https://issues.apache.org/jira/browse/SPARK-3447 - Thanks & Regards, Mohan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-fails-with-avro-having-Arrays-and-unions-but-succeeds-with-simple

Re: SQL shell for Spark SQL?

2014-09-18 Thread David Rosenstrauch
Is the cli in fact a full replacement for shark? The description says "The Spark SQL CLI is a convenient tool to run the Hive metastore service in local mode ...". The way I've used Shark in the past, however, is to run the shark shell on a client machine and connect it to a Hive metastore on

Re: spark-submit: fire-and-forget mode?

2014-09-18 Thread Nicholas Chammas
Dunno about having the application be independent of whether spark-submit is still alive, but you can have spark-submit run in a new session in Linux using setsid . That way even if you terminate your SSH session, spark-submit will keep running independ

Re: problem with HiveContext inside Actor

2014-09-18 Thread Chester @work
Akka actor are managed under a thread pool, so the same actor can be under different thread. If you create HiveContext in the actor, is it possible that you are essentially create different instance of HiveContext ? Sent from my iPhone > On Sep 17, 2014, at 10:14 PM, Du Li wrote: > > Thanks

RE: SchemaRDD and RegisterAsTable

2014-09-18 Thread Denny Lee
Could you clarify - when you’re connecting via beeline, aren’t you also connecting to the thrift server that generates the Spark context?  It’s possible the first time you query that its a little slower as it needs to transfer the data from file / Hadoop / source into a RDD and subsequent querie

Joining multiple rowMatrix

2014-09-18 Thread Debasish Das
Hi, I have some RowMatrices all with the same key (MatrixEntry.i, MatrixEntry.j) and I would like to join multiple matrices to come up with a sqlTable for each key... What's the best way to do it ? Right now I am doing N joins if I want to combine data from N matrices which does not look quite r

Odd error when using a rdd map within a stream map

2014-09-18 Thread Filip Andrei
here i wrote a simpler version of the code to get an understanding of how it works: final List nns = new ArrayList(); for(int i = 0; i < numberOfNets; i++){ nns.add(NeuralNet.createFrom(...)); } final JavaRDD nnRdd = sc.parallelize(nns); JavaDStream results = rndLists.flatMap(new FlatMapFu

Kryo fails with avro having Arrays and unions, but succeeds with simple avro.

2014-09-18 Thread mohan.gadm
*I am facing similar issue to Spark-3447 with spark streaming Api, Kryo Serializer, Avro messages. If avro message is simple, its fine. but if the avro message has Union/Arrays its failing with the exception Below:* ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0 or

RE: StackOverflowError

2014-09-18 Thread Shao, Saisai
Hi, Does your application fail in task deserialization? If so, this is a known issue in Spark because of too long RDD dependency chain, which will make java deserialization stack overflow. Two ways to solve this issue: one way is use RDD’s checkpoint to cut the dependency chain, another is to e

Spark Package for Mesos

2014-09-18 Thread John Omernik
I know there is a script that builds a TGZ ready for Mesos, but I was wondering if there are switches and/or a methodology that would allow me to change some files, and create the TGZ file without the compiling... Just trying to understand what happens under the hood here, and ensure I include the

Re: SPARK BENCHMARK TEST

2014-09-18 Thread VJ Shalish
Hi Please can someone advice on this. On Wed, Sep 17, 2014 at 6:59 PM, VJ Shalish wrote: > I am trying to benchmark spark in a hadoop cluster. > I need to design a sample spark job to test the CPU utilization, RAM > usage, Input throughput, Output throughput and Duration of execution in the > cl

Re: StackOverflowError

2014-09-18 Thread Akhil Das
What were you trying to do? Thanks Best Regards On Thu, Sep 18, 2014 at 3:37 PM, gm yu wrote: > Exception in thread "main" org.apache.spark.SparkException: Job aborted > due to stage failure: Task 736952.0:2 failed 1 times, most recent failure: > Exception failure in TID 21006 on host localhost

New API for TFIDF generation in Spark 1.1.0

2014-09-18 Thread jatinpreet
Hi, I have been running into memory overflow issues while creating TFIDF vectors to be used in document classification using MLlib's Naive Baye's classification implementation. http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ Memory overfl

Re: [SparkStreaming] task failure with 'Unknown exception in doAs'

2014-09-18 Thread Gerard Maas
Found it! (with sweat in my forehead) The job was actually running on Mesos using a Spark 1.1.0 executor. I guess there's some incompatibility between the 1.0.2 and the 1.1 versions - still quite weird. -kr, Gerard. On Thu, Sep 18, 2014 at 12:29 PM, Gerard Maas wrote: > My Spark Streaming

  1   2   >