how to track the jobs status without the webUI

2014-09-18 Thread Tan Tim
Hi, all, I can see the job failed from the web UI, But when I run ps on the client(which machine I submit the job), I can find the proces is still exists: user_tt 5971 2.6 2.2 15030180 3029840 ?Sl 11:41 4:37 java -cp /var/bh/lib/spark-0.9.1-bin-hadoop1/assembly/target/scala-2.10/spark

Re: collect on hadoopFile RDD returns wrong results

2014-09-18 Thread vasiliy
i posted an example in previous post. Tested on spark 1.0.2, 1.2.0-SNAPSHOT and 1.1.0 for hadoop 2.4.0 on Windows and Linux servers with hortonworks hadoop 2.4 in local[4] mode. Any ideas about this spark behavior ? Akhil Das-2 wrote > Can you dump out a small piece of data? while doing rdd.colle

Re: collect on hadoopFile RDD returns wrong results

2014-09-18 Thread Reynold Xin
This is due to the HadoopRDD (and also the underlying Hadoop InputFormat) reuse objects to avoid allocation. It is sort of tricky to fix. However, in most cases you can clone the records to make sure you are not collecting the same object over and over again. https://issues.apache.org/jira/browse/

Unable to find proto buffer class error with RDD

2014-09-18 Thread Paul Wais
Dear List, I'm writing an application where I have RDDs of protobuf messages. When I run the app via bin/spar-submit with --master local --driver-class-path path/to/my/uber.jar, Spark is able to ser/deserialize the messages correctly. However, if I run WITHOUT --driver-class-path path/to/my/uber.

spark-submit: fire-and-forget mode?

2014-09-18 Thread Tobias Pfeiffer
Hi, I am wondering: Is it possible to run spark-submit in a mode where it will start an application on a YARN cluster (i.e., driver and executors run on the cluster) and then forget about it in the sense that the Spark application is completely independent from the host that ran the spark-submit c

RE: Serious Issue with Spark Streaming ? Blocks Getting Removed and Jobs have Failed..

2014-09-18 Thread Shao, Saisai
Hi Rafeeq, I think this situation always occurs when your Spark Streaming application is running in an abnormal situation. Would you mind checking your job processing time in WebUI or log, is the total latency of job processing + job scheduling time larger than batch duration? If your Spark Str

Strange exception while accessing hdfs from spark.

2014-09-18 Thread Julien Carme
Hello, I have been using Spark for quite some time, and I now get this error (please stderr output below) when accessing hdfs. It seems to come from Hadoop, however, I can access hdfs from the command line without any problem. The WARN on the first seems to be key, because it never appeared previ

Better way to process large image data set ?

2014-09-18 Thread Jaonary Rabarisoa
Hi all, I'm trying to process a large image data set and need some way to optimize my implementation since it's very slow from now. In my current implementation I store my images in an object file with the following fields case class Image(groupId: String, imageId: String, buffer: String) Images

Re: HBase 0.96+ with Spark 1.0+

2014-09-18 Thread Reinis Vicups
I am humbly bumping this since even after another week of trying I haven't had luck to fix this yet. On 14.09.2014 19:21, Reinis Vicups wrote: I did actually try Seans suggestion just before I posted for the first time in this thread. I got an error when doing this and thought that I am not un

Re: pyspark on yarn - lost executor

2014-09-18 Thread Oleg Ruchovets
Great. Upgrade helped. Still need some inputs: 1) Is there any log files of spark job execution? 2) Where can I read about tuning / parameter configuration: For example: --num-executors 12 --driver-memory 4g --executor-memory 2g what is the meaning of thous parameters? Thanks Oleg. On Thu, S

Re: HBase 0.96+ with Spark 1.0+

2014-09-18 Thread Ted Yu
The stack trace mentioned OutOfMemory error. See: http://stackoverflow.com/questions/3003855/increase-permgen-space On Sep 18, 2014, at 1:59 AM, Reinis Vicups wrote: > I am humbly bumping this since even after another week of trying I haven't > had luck to fix this yet. > > On 14.09.2014 19:2

Re: Better way to process large image data set ?

2014-09-18 Thread Sean Owen
Base 64 is an inefficient encoding for binary data by about 2.6x. You could use byte[] directly. But you would still be storing and potentially shuffling lots of data in your RDDs. If the files exist separately on HDFS perhaps you can just send around the file location and load it directly using

StackOverflowError

2014-09-18 Thread gm yu
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 736952.0:2 failed 1 times, most recent failure: Exception failure in TID 21006 on host localhost: java.lang.StackOverflowError java.util.zip.GZIPInputStream.read(GZIPInputStream.java:116) java.util.zi

Spot instances on Amazon EMR

2014-09-18 Thread Grzegorz Białek
Hi, I would like to run Spark application on Amazon EMR. I have some questions about that: 1. I have input data on other hdfs (not on Amazon). Can I send all input data from that cluster to HDFS on Amazon EMR cluster (if it has enough storage memory) or do I have send it to Amazon S3 storage and th

[SparkStreaming] task failure with 'Unknown exception in doAs'

2014-09-18 Thread Gerard Maas
My Spark Streaming job (running on Spark 1.0.2) stopped working today and consistently throws the exception below. No code changed for it, so I'm really puzzled about the cause of the issue. Looks like a security issue at HDFS level. Has anybody seen this exception and maybe know the root cause?

New API for TFIDF generation in Spark 1.1.0

2014-09-18 Thread jatinpreet
Hi, I have been running into memory overflow issues while creating TFIDF vectors to be used in document classification using MLlib's Naive Baye's classification implementation. http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ Memory overfl

Re: [SparkStreaming] task failure with 'Unknown exception in doAs'

2014-09-18 Thread Gerard Maas
Found it! (with sweat in my forehead) The job was actually running on Mesos using a Spark 1.1.0 executor. I guess there's some incompatibility between the 1.0.2 and the 1.1 versions - still quite weird. -kr, Gerard. On Thu, Sep 18, 2014 at 12:29 PM, Gerard Maas wrote: > My Spark Streaming

Re: StackOverflowError

2014-09-18 Thread Akhil Das
What were you trying to do? Thanks Best Regards On Thu, Sep 18, 2014 at 3:37 PM, gm yu wrote: > Exception in thread "main" org.apache.spark.SparkException: Job aborted > due to stage failure: Task 736952.0:2 failed 1 times, most recent failure: > Exception failure in TID 21006 on host localhost

Re: SPARK BENCHMARK TEST

2014-09-18 Thread VJ Shalish
Hi Please can someone advice on this. On Wed, Sep 17, 2014 at 6:59 PM, VJ Shalish wrote: > I am trying to benchmark spark in a hadoop cluster. > I need to design a sample spark job to test the CPU utilization, RAM > usage, Input throughput, Output throughput and Duration of execution in the > cl

Spark Package for Mesos

2014-09-18 Thread John Omernik
I know there is a script that builds a TGZ ready for Mesos, but I was wondering if there are switches and/or a methodology that would allow me to change some files, and create the TGZ file without the compiling... Just trying to understand what happens under the hood here, and ensure I include the

RE: StackOverflowError

2014-09-18 Thread Shao, Saisai
Hi, Does your application fail in task deserialization? If so, this is a known issue in Spark because of too long RDD dependency chain, which will make java deserialization stack overflow. Two ways to solve this issue: one way is use RDD’s checkpoint to cut the dependency chain, another is to e

Kryo fails with avro having Arrays and unions, but succeeds with simple avro.

2014-09-18 Thread mohan.gadm
*I am facing similar issue to Spark-3447 with spark streaming Api, Kryo Serializer, Avro messages. If avro message is simple, its fine. but if the avro message has Union/Arrays its failing with the exception Below:* ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0 or

Odd error when using a rdd map within a stream map

2014-09-18 Thread Filip Andrei
here i wrote a simpler version of the code to get an understanding of how it works: final List nns = new ArrayList(); for(int i = 0; i < numberOfNets; i++){ nns.add(NeuralNet.createFrom(...)); } final JavaRDD nnRdd = sc.parallelize(nns); JavaDStream results = rndLists.flatMap(new FlatMapFu

RE: SchemaRDD and RegisterAsTable

2014-09-18 Thread Denny Lee
Could you clarify - when you’re connecting via beeline, aren’t you also connecting to the thrift server that generates the Spark context?  It’s possible the first time you query that its a little slower as it needs to transfer the data from file / Hadoop / source into a RDD and subsequent querie

Joining multiple rowMatrix

2014-09-18 Thread Debasish Das
Hi, I have some RowMatrices all with the same key (MatrixEntry.i, MatrixEntry.j) and I would like to join multiple matrices to come up with a sqlTable for each key... What's the best way to do it ? Right now I am doing N joins if I want to combine data from N matrices which does not look quite r

Re: problem with HiveContext inside Actor

2014-09-18 Thread Chester @work
Akka actor are managed under a thread pool, so the same actor can be under different thread. If you create HiveContext in the actor, is it possible that you are essentially create different instance of HiveContext ? Sent from my iPhone > On Sep 17, 2014, at 10:14 PM, Du Li wrote: > > Thanks

Re: spark-submit: fire-and-forget mode?

2014-09-18 Thread Nicholas Chammas
Dunno about having the application be independent of whether spark-submit is still alive, but you can have spark-submit run in a new session in Linux using setsid . That way even if you terminate your SSH session, spark-submit will keep running independ

Re: SQL shell for Spark SQL?

2014-09-18 Thread David Rosenstrauch
Is the cli in fact a full replacement for shark? The description says "The Spark SQL CLI is a convenient tool to run the Hive metastore service in local mode ...". The way I've used Shark in the past, however, is to run the shark shell on a client machine and connect it to a Hive metastore on

Re: Kryo fails with avro having Arrays and unions, but succeeds with simple avro.

2014-09-18 Thread mohan.gadm
Added some more info on this issue in the tracker Spark-3447 https://issues.apache.org/jira/browse/SPARK-3447 - Thanks & Regards, Mohan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-fails-with-avro-having-Arrays-and-unions-but-succeeds-with-simple

Re: spark-submit: fire-and-forget mode?

2014-09-18 Thread Sandy Ryza
Hi Tobias, YARN cluster mode should have the behavior you're looking for. The client process will stick around to report on things, but should be able to be killed without affecting the application. If this isn't the behavior you're observing, and your application isn't failing for a different r

Re: Kryo fails with avro having Arrays and unions, but succeeds with simple avro.

2014-09-18 Thread Frank Austin Nothaft
Mohan, I don’t think this is a Spark issue, rather, I think the issue is coming from your serializer. Can you point us to the serializer that you are using? We have no problems serializing complex Avro (nested schemas with unions and arrays) when using this serializer. You may also want to look

Re: Adjacency List representation in Spark

2014-09-18 Thread Harsha HN
Hi Andrew, The only reason that I avoided GraphX approach is that I didnt see any explanation on Java side nor API documentation on Java. Do you have any code piece of using GraphX API in JAVA? Thanks, Harsha On Wed, Sep 17, 2014 at 10:44 PM, Andrew Ash wrote: > Hi Harsha, > > You could look t

Re: Support R in Spark

2014-09-18 Thread oppokui
Shivaram, As I know, SparkR used rJava package. In work node, spark code will execute R code by launching R process and send/receive byte array. I have a question on when to launch R process. R process is per Work process, or per executor thread, or per each RDD processing? Thanks and Regards

Re: Stable spark streaming app

2014-09-18 Thread Soumitra Kumar
Refer to https://github.com/sbt/sbt-assembly to generate a jar with dependencies. I prefer not to build a big fat jar, since a bulk would be hadoop related and prefer to use what is installed on the host. - Original Message - From: "Tim Smith" Cc: "spark users" Sent: Wednesday, Septem

Sending multiple DStream outputs

2014-09-18 Thread contractor
Hi all, I am using Spark 1.0 streaming to ingest a a high volume stream of data (approx. 1mm lines every few seconds) transform it into two outputs and send those outputs to two separate Apache Kafka topics. I have two blocks of output code like this: Stream1 = …. Stream2 = … Stream1.foreac

Re: Kryo fails with avro having Arrays and unions, but succeeds with simple avro.

2014-09-18 Thread mohan.gadm
Hi frank, thanks for the info, thats great. but im not saying Avro serializer is failing. Kryo is failing but im using kryo serializer. and registering Avro generated classes with kryo. sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"); sparkConf.set("spark.kr

Re: Kryo fails with avro having Arrays and unions, but succeeds with simple avro.

2014-09-18 Thread Frank Austin Nothaft
Hi Mohan, It’s been a while since I’ve looked at this specifically, but I don’t think the default Kryo serializer will properly serialize Avro. IIRC, there are complications around the way that Avro handles nullable fields, which would be consistent with the NPE you’re encountering here. That’s

Re: Kryo fails with avro having Arrays and unions, but succeeds with simple avro.

2014-09-18 Thread mohan.gadm
Thanks for the info frank. so your suggestion could be to use Avro serializer. i just have to configure it like Kryo for the same property? and is there any registering process for this or just specify serializer? Also does it effect performance. what measures to be taken to avoid. (im using kryo

Spark Streaming and ReactiveMongo

2014-09-18 Thread t1ny
Hello all,Spark newbie here.We are trying to use Spark Streaming (unfortunately stuck on version 0.9.1 of Spark) to stream data out of MongoDB.ReactiveMongo (http://reactivemongo.org/) is a scala driver that enables you to stream a MongoDB capped collection (in our case, the Oplog).Given that Mongo

schema for schema

2014-09-18 Thread Eric Friedman
I have a SchemaRDD which I've gotten from a parquetFile. Did some transforms on it and now want to save it back out as parquet again. Getting a SchemaRDD proves challenging because some of my fields can be null/None and SQLContext.inferSchema abjects those. So, I decided to use the schema on the

Re: Kryo fails with avro having Arrays and unions, but succeeds with simple avro.

2014-09-18 Thread Frank Austin Nothaft
Mohan, You’ll need to register it; we register our serializer in lines 69 to 76 in https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/serialization/ADAMKryoRegistrator.scala. Our serializer implementation falls back on the default Avro serializer; yo

Re: Cannot run SimpleApp as regular Java app

2014-09-18 Thread ericacm
Upgrading from spark-1.0.2-hadoop2 to spark-1.1.0-hadoop1 fixed my problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-run-SimpleApp-as-regular-Java-app-tp13695p14570.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark SQL Exception

2014-09-18 Thread Paul Magid
All: I am putting Spark SQL 1.1 through its paces (in a POC) and have been pleasantly surprised with what can be done with such a young technology.I have run into an exception (listed below) that I suspect relates to the number of columns in the table I am querying. There are 336 columns

Spark Zmq issue in cluster mode

2014-09-18 Thread Hatch M
I have a spark streaming zmq application running fine in non-cluster mode. When running a local cluster and I do spark-submit, zero mq java client is choking. org.zeromq.ZMQException: No such file or directory at org.zeromq.ZMQ$Socket.raiseZMQException(ZMQ.java:480) at org.zeromq.ZMQ$Socket.recv(Z

Re: spark-submit: fire-and-forget mode?

2014-09-18 Thread Marcelo Vanzin
Yes, what Sandy said. On top of that, I would suggest filing a bug for a new command line argument for spark-submit to make the launcher process exit cleanly as soon as a cluster job starts successfully. That can be helpful for code that launches Spark jobs but monitors the job through different m

Re: Spark SQL Exception

2014-09-18 Thread Michael Armbrust
Its failing to sort because the columns are of Binary type (though maybe we should support this as well). Is this parquet data that was generated by impala that you would expect to be a String? If so turn on spark.sql.parquet.binaryAsString

Re: SQL shell for Spark SQL?

2014-09-18 Thread Denny Lee
The CLI is the command line connection to SparkSQL and yes, SparkSQL replaces Shark - there’s a great article by Reynold on the Databricks blog that provides the context:  http://databricks.com/blog/2014/07/01/shark-spark-sql-hive-on-spark-and-the-future-of-sql-on-spark.html As for SparkSQL and

Re: Support R in Spark

2014-09-18 Thread Shivaram Venkataraman
As R is single-threaded, SparkR launches one R process per-executor on the worker side. Thanks Shivaram On Thu, Sep 18, 2014 at 7:49 AM, oppokui wrote: > Shivaram, > > As I know, SparkR used rJava package. In work node, spark code will execute R > code by launching R process and send/receive by

Re: spark-submit: fire-and-forget mode?

2014-09-18 Thread Patrick Wendell
I agree, that's a good idea Marcelo. There isn't AFAIK any reason the client needs to hang there for correct operation. On Thu, Sep 18, 2014 at 9:39 AM, Marcelo Vanzin wrote: > Yes, what Sandy said. > > On top of that, I would suggest filing a bug for a new command line > argument for spark-submi

Re: Spot instances on Amazon EMR

2014-09-18 Thread Patrick Wendell
Hey Grzegorz, EMR is a service that is not maintained by the Spark community. So this list isn't the right place to ask EMR questions. - Patrick On Thu, Sep 18, 2014 at 3:19 AM, Grzegorz Białek wrote: > Hi, > I would like to run Spark application on Amazon EMR. I have some questions > about tha

Re: Stable spark streaming app

2014-09-18 Thread Tim Smith
Dibyendu - I am using the Kafka consumer built into Spark streaming. Pulled the jar from here: http://search.maven.org/remotecontent?filepath=org/apache/spark/spark-streaming-kafka_2.10/1.0.0/spark-streaming-kafka_2.10-1.0.0.jar Thanks for the sbt-assembly link, Soumitra. On Wed, Sep 17, 2014 at

Re: spark-submit: fire-and-forget mode?

2014-09-18 Thread Andrew Or
Thanks Tobias, I have filed a JIRA for it. 2014-09-18 10:09 GMT-07:00 Patrick Wendell : > I agree, that's a good idea Marcelo. There isn't AFAIK any reason the > client needs to hang there for correct operation. > > On Thu, Sep 18, 2014 at 9:39 AM, Marcelo Vanzin > wrote: > > Yes, what Sandy sai

Re: Sending multiple DStream outputs

2014-09-18 Thread Tim Smith
Curious, if you have 1:1 mapping between Stream1:topic1 and Stream2:topic2 then why not run different instances of the app for each and pass as arguments to each instance the input source and output topic? On Thu, Sep 18, 2014 at 8:07 AM, Padmanabhan, Mahesh (contractor) wrote: > Hi all, > > I am

RE: MLLib: LIBSVM issue

2014-09-18 Thread Sameer Tilak
Thanks, Burak,Yes, tab was an issue and I was able to get it working after replacing that with space. > Date: Wed, 17 Sep 2014 21:11:00 -0700 > From: bya...@stanford.edu > To: ssti...@live.com > CC: user@spark.apache.org > Subject: Re: MLLib: LIBSVM issue > > Hi, > > The spacing between the inp

RE: MLLib: LIBSVM issue

2014-09-18 Thread Sameer Tilak
Thanks, will try it out today. Date: Wed, 17 Sep 2014 23:04:31 -0700 Subject: Re: MLLib: LIBSVM issue From: debasish.da...@gmail.com To: bya...@stanford.edu CC: ssti...@live.com; user@spark.apache.org We dump fairly big libsvm to compare against liblinear/libsvm...the following code dumps out li

Re: Odd error when using a rdd map within a stream map

2014-09-18 Thread Burak Yavuz
Hi, I believe it's because you're trying to use a Function of an RDD, in an RDD, which is not possible. Instead of using a `Function>`, could you try Function, and `public Void call(Float arg0) throws Exception { ` and `System.out.println(arg0)` instead. I'm not perfectly sure of the semantics i

RE: Spark SQL Exception

2014-09-18 Thread Paul Magid
Michael: Thanks for the quick response. I can confirm that once I removed the “order by” clause the exception below went away. So, I believe this confirms what you were say and I will be opening a new feature request in JIRA. However, that exception was replaced by a java.lang.OutOfMemoryEr

MLLib regression model weights

2014-09-18 Thread Sameer Tilak
Hi All, I am able to run LinearRegressionWithSGD on a small sample dataset (~60MB Libsvm file of sparse data) with 6700 features. val model = LinearRegressionWithSGD.train(examples, numIterations) At the end I get a model that model.weights.sizeres6: Int = 6699 I am assuming each entry in the mo

Re: Sending multiple DStream outputs

2014-09-18 Thread contractor
One output is sort of a subset of the other so it didn¹t make much sense to spin up another Spark app for the same source. On 9/18/14, 11:19 AM, "Tim Smith" wrote: >Curious, if you have 1:1 mapping between Stream1:topic1 and >Stream2:topic2 then why not run different instances of the app for >ea

Re: Huge matrix

2014-09-18 Thread Reza Zadeh
Hi Deb, I am not templating RowMatrix/CoordinateMatrix since that would be a big deviation from the PR. We can add jaccard and other similarity measures in later PRs. In the meantime, you can un-normalize the cosine similarities to get the dot product, and then compute the other similarity measur

Re: Huge matrix

2014-09-18 Thread Debasish Das
Yup that's what I did for now... On Thu, Sep 18, 2014 at 10:34 AM, Reza Zadeh wrote: > Hi Deb, > > I am not templating RowMatrix/CoordinateMatrix since that would be a big > deviation from the PR. We can add jaccard and other similarity measures in > later PRs. > > In the meantime, you can un-no

Re: spark-submit: fire-and-forget mode?

2014-09-18 Thread Nicholas Chammas
And for the record, the issue is here: https://issues.apache.org/jira/browse/SPARK-3591 On Thu, Sep 18, 2014 at 1:19 PM, Andrew Or wrote: > Thanks Tobias, I have filed a JIRA for it. > > 2014-09-18 10:09 GMT-07:00 Patrick Wendell : > > I agree, that's a good idea Marcelo. There isn't AFAIK any r

Re: MLlib - Possible to use SVM with Radial Basis Function kernel rather than Linear Kernel?

2014-09-18 Thread Aris
Sorry to bother you guys, but does anybody have any ideas about the status of MLlib with a Radial Basis Function kernel for SVM? Thank you! On Tue, Sep 16, 2014 at 3:27 PM, Aris < wrote: > Hello Spark Community - > > I am using the support vector machine / SVM implementation in MLlib with > the

Re: Adjacency List representation in Spark

2014-09-18 Thread Koert Kuipers
we build our own adjacency lists as well. the main motivation for us was that graphx has some assumptions about everything fitting in memory (it has .cache statements all over place). however if my understanding is wrong and graphx can handle graphs that do not fit in memory i would be interested t

Anybody built the branch for Adaptive Boosting, extension to MLlib by Manish Amde?

2014-09-18 Thread Aris
Thank you Spark community you make life much more lovely - suffering in silence is not fun! I am trying to build the Spark Git branch from Manish Amde, available here: https://github.com/manishamde/spark/tree/ada_boost I am trying to build the non-master branch 'ada_boost' (in the link above), b

Re: spark-1.1.0-bin-hadoop2.4 java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass

2014-09-18 Thread Andy Davidson
After lots of hacking I figure out how to resolve this problem. This is good solution. It severalty cripples jackson but at least for now I am unblocked 1) turn off annotations. mapper.configure(Feature.USE_ANNOTATIONS, false); 2) in maven set the jackson dependencies as provided. 1.9

Kafka Spark Streaming on Spark 1.1

2014-09-18 Thread JiajiaJing
Hi Spark Users, We just upgrade our spark version from 1.0 to 1.1. And we are trying to re-run all the written and tested projects we implemented on Spark 1.0. However, when we try to execute the spark streaming project that stream data from Kafka topics, it yields the following error message. I

Spark on EC2

2014-09-18 Thread Gilberto Lira
Hello, I am trying to run a python script that makes use of the kmeans MLIB and I'm not getting anywhere. I'm using an c3.xlarge instance as master, and 10 c3.large instances as slaves. In the code I make a map of a 600MB csv file in S3, where each row has 128 integer columns. The problem is that a

Spark + Mahout

2014-09-18 Thread Daniel Takabayashi
Hi guys, Is possible to run a mahout kmeans throws spark infrastructure? Thanks, taka (Brazil)

Re: MLLib regression model weights

2014-09-18 Thread Debasish Das
sc.parallelize(model.weights.toArray, blocks).top(k) will get that right ? For logistic you might want both positive and negative feature...so just pass it through a filter on abs and then pick top(k) On Thu, Sep 18, 2014 at 10:30 AM, Sameer Tilak wrote: > Hi All, > > I am able to run LinearReg

Re: Kafka Spark Streaming on Spark 1.1

2014-09-18 Thread Tim Smith
What kafka receiver are you using? Did you build a new jar for your app with the latest streaming-kafka code for 1.1? On Thu, Sep 18, 2014 at 11:47 AM, JiajiaJing wrote: > Hi Spark Users, > > We just upgrade our spark version from 1.0 to 1.1. And we are trying to > re-run all the written and tes

Re: Kafka Spark Streaming on Spark 1.1

2014-09-18 Thread JiajiaJing
Yeah, I forgot to build the new jar file for spark 1.1... And now the errors are gone. Thank you very much! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-Spark-Streaming-on-Spark-1-1-tp14597p14604.html Sent from the Apache Spark User List mailing l

Re: Spark on EC2

2014-09-18 Thread Burak Yavuz
Hi Gilberto, Could you please attach the driver logs as well, so that we can pinpoint what's going wrong? Could you also add the flag `--driver-memory 4g` while submitting your application and try that as well? Best, Burak - Original Message - From: "Gilberto Lira" To: user@spark.apach

Re: MLlib - Possible to use SVM with Radial Basis Function kernel rather than Linear Kernel?

2014-09-18 Thread Jey Kottalam
Hi Aris, A simple approach to gaining some of the benefits of an RBF kernel is to add synthetic features to your training set. For example, if your original data consists of 3-dimensional vectors [x, y, z], you could compute a new 9-dimensional feature vector containing [x, y, z, x^2, y^2, z^2, xy

Re: problem with HiveContext inside Actor

2014-09-18 Thread Du Li
I have figured it out. As shown in the code below, if the HiveContext hc were created in the actor object and used to create db in response to message, it would throw null pointer exception. This is fixed by creating the HiveContext inside the MyActor class instead. I also tested the code by re

Re: schema for schema

2014-09-18 Thread Michael Armbrust
This looks like a bug, we are investigating. On Thu, Sep 18, 2014 at 8:49 AM, Eric Friedman wrote: > I have a SchemaRDD which I've gotten from a parquetFile. > > Did some transforms on it and now want to save it back out as parquet > again. > > Getting a SchemaRDD proves challenging because some

spark 1.1 examples build failure on cdh 5.1

2014-09-18 Thread freedafeng
This is a mvn build. [ERROR] Failed to execute goal on project spark-examples_2.10: Could not resolve dependencies for project org.apache.spark:spark-examples_2.10:jar:1.1.0: Could not find artifact org.apache.hbase:hbase:jar:0.98.1 in central (https://repo1.maven.org/maven2) -> [Help 1] [ERROR]

RE: Spark + Mahout

2014-09-18 Thread Huang, Roger
Taka, Have you considered using Spark’s own MLlib k-means clustering? http://spark.apache.org/docs/latest/mllib-clustering.html Roger From: Daniel Takabayashi [mailto:takabaya...@scanboo.com.br] Sent: Thursday, September 18, 2014 1:50 PM To: user@spark.apache.org Subject: Spark + Mahout Hi guys,

Re: Spark + Mahout

2014-09-18 Thread Daniel Takabayashi
Yes, thanks, but I want to test using Fuzzy Kmeans, as an option, is possible? 2014-09-18 16:40 GMT-03:00 Huang, Roger : > Taka, > > Have you considered using Spark’s own MLlib k-means clustering? > > http://spark.apache.org/docs/latest/mllib-clustering.html > > > > Roger > > > > *From:* Daniel

SVD on larger than taller matrix

2014-09-18 Thread Glitch
I have a matrix of about 2 millions+ rows with 3 millions + columns in svm format* and it's sparse. As I understand it, running SVD on such a matrix shouldn't be a problem since version 1.1. I'm using 10 worker nodes on EC2, each with 30G of RAM (r3.xlarge). I was able to compute the SVD for 20 si

Re: New API for TFIDF generation in Spark 1.1.0

2014-09-18 Thread Xiangrui Meng
Hi Jatin, HashingTF should be able to solve the memory problem if you use a small feature dimension in HashingTF. Please do not cache the input document, but cache the output from HashingTF and IDF instead. We don't have a label indexer yet, so you need a label to index map to map it to double val

Re: Joining multiple rowMatrix

2014-09-18 Thread Xiangrui Meng
You can use CoGroupedRDD (http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.CoGroupedRDD) directly. -Xiangrui On Thu, Sep 18, 2014 at 7:09 AM, Debasish Das wrote: > Hi, > > I have some RowMatrices all with the same key (MatrixEntry.i, MatrixEntry.j) > and I would like

Re: schema for schema

2014-09-18 Thread Davies Liu
Thanks for reporting this, it will be fixed by https://github.com/apache/spark/pull/2448 On Thu, Sep 18, 2014 at 12:32 PM, Michael Armbrust wrote: > This looks like a bug, we are investigating. > > On Thu, Sep 18, 2014 at 8:49 AM, Eric Friedman > wrote: >> >> I have a SchemaRDD which I've gotten

Re: MLLib regression model weights

2014-09-18 Thread Xiangrui Meng
The importance should be based on some statistics, for example, the standard deviation of the feature column and the magnitude of the weight. If the columns are scaled to unit standard deviation (using StandardScaler), you can tell the importance by the absolute value of the weight. But there are o

Re: SVD on larger than taller matrix

2014-09-18 Thread Xiangrui Meng
Did you cache `features`? Without caching it is slow because we need O(k) iterations. The storage requirement on the driver is about 2 * n * k = 2 * 3 million * 200 ~= 9GB, not considering any overhead. Computing U is also an expensive task in your case. We should use some randomized SVD implementa

Re: MLlib - Possible to use SVM with Radial Basis Function kernel rather than Linear Kernel?

2014-09-18 Thread Xiangrui Meng
We don't support kernels because it doesn't scale well. Please check "When to use LIBLINEAR but not LIBSVM" on http://www.csie.ntu.edu.tw/~cjlin/liblinear/index.html . I like Jey's suggestion on expanding features. -Xiangrui On Thu, Sep 18, 2014 at 12:29 PM, Jey Kottalam wrote: > Hi Aris, > > A s

Re: spark-1.1.0-bin-hadoop2.4 java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass

2014-09-18 Thread Yin Huai
Hello Andy, Will our JSON support in Spark SQL help your case? If your JSON files store one JSON object per line, you can use SQLContext.jsonFile to load it. If you want to do pre-process these files, once you have an RDD[String] (one JSON object per String), you can use SQLContext.jsonRDD. In bot

Re: SVD on larger than taller matrix

2014-09-18 Thread Li Pu
The main bottleneck of current SVD implementation is on the memory of driver node. It requires at least 5*n*k doubles in driver memory because all right singular vectors are stored in driver memory and there are some working memory required. So it is bounded by the smaller dimension of your matrix

Re: Huge matrix

2014-09-18 Thread Debasish Das
Hi Reza, Have you tested if different runs of the algorithm produce different similarities (basically if the algorithm is deterministic) ? This number does not look like a Monoid aggregation...iVal * jVal / (math.min(sg, colMags(i)) * math.min(sg, colMags(j)) I am noticing some weird behavior as

Re: Huge matrix

2014-09-18 Thread Reza Zadeh
Hi Deb, I am currently seeding the algorithm to be pseudo-random, this is an issue being addressed in the PR. If you pull the current version it will be deterministic, but not potentially not pseudo-random. The PR will updated today. Best, Reza On Thu, Sep 18, 2014 at 2:06 PM, Debasish Das wrote:

Re: Huge matrix

2014-09-18 Thread Debasish Das
I am still a bit confused whether numbers like these can be aggregated as double: iVal * jVal / (math.min(sg, colMags(i)) * math.min(sg, colMags(j)) It should be aggregated using something like List[iVal*jVal, colMags(i), colMags(j)] I am not sure Algebird can aggregate deterministically over Do

Re: Spark + Mahout

2014-09-18 Thread Sean Owen
No, the architectures are entirely different. The Mahout implementations have been deprecated and are not being updated, so there won't be a port or anything. You would have to create these things from scratch on Spark if they don't already exist. On Sep 18, 2014 7:50 PM, "Daniel Takabayashi" wrot

PairRDD's lookup method Performance

2014-09-18 Thread Harsha HN
Hi All, My question is related to improving performance of pairRDD's lookup method. I went through below link where "Tathagata Das " explains creating Hash Map over Partitions using "mappartitio

Re: Anybody built the branch for Adaptive Boosting, extension to MLlib by Manish Amde?

2014-09-18 Thread Manish Amde
Hi Aris, Thanks for the interest. First and foremost, tree ensembles are a top priority for the 1.2 release and we are working hard towards it. A random forests PR is already under review and AdaBoost and gradient boosting will be added soon after.  Unfortunately, the GBDT branch I shared

Unable to load app logs for MLLib programs in history server

2014-09-18 Thread SK
Hi, The default log files for the Mllib examples use a rather long naming convention that includes special characters like parentheses and comma.For e.g. one of my log files is named "binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032". When I click on the program on the hi

Re: Unable to load app logs for MLLib programs in history server

2014-09-18 Thread Xiangrui Meng
Could you create a JIRA for it? We can either remove special characters or encode with alphanumerics. -Xiangrui On Thu, Sep 18, 2014 at 3:50 PM, SK wrote: > Hi, > > The default log files for the Mllib examples use a rather long naming > convention that includes special characters like parentheses

AbstractMethodError when creating cassandraTable object

2014-09-18 Thread Emil Gustafsson
pretty sure this is a result on me being new to both scala, spark and sbt but I'm getting the error above when I try to use the cassandra driver for spark. I posted more information here: https://github.com/datastax/spark-cassandra-connector/issues/245 Ideas? /E

Re: Unable to find proto buffer class error with RDD

2014-09-18 Thread Paul Wais
Well, it looks like Spark is just not loading my code into the driver/executors E.g.: List foo = JavaRDD bars.map( new Function< MyMessage, String>() { { System.err.println("classpath: " + System.getProperty("java.class.path")); CodeSource src = com.google.protobuf.Ge

Re: schema for schema

2014-09-18 Thread Eric Friedman
Thanks! On Thu, Sep 18, 2014 at 1:14 PM, Davies Liu wrote: > Thanks for reporting this, it will be fixed by > https://github.com/apache/spark/pull/2448 > > On Thu, Sep 18, 2014 at 12:32 PM, Michael Armbrust > wrote: > > This looks like a bug, we are investigating. > > > > On Thu, Sep 18, 2014 a

request to merge the pull request #1893 to master

2014-09-18 Thread freedafeng
We are working on a project that needs python + spark to work on hdfs and hbase data. We like to use a not-too-old version of hbase such as hbase 0.98.x. We have tried many different ways (and platforms) to compile and test Spark 1.1 official release, but got all sorts of issues. The only version t

  1   2   >