Re: Spark: All masters are unresponsive!

2014-07-08 Thread Akhil Das
Are you sure this is your master URL spark://pzxnvm2018:7077 ? You can look it up in the WebUI (mostly http://pzxnvm2018:8080) top left corner. Also make sure you are able to telnet pzxnvm2018 7077 from the machines where you are running the spark shell. Thanks Best Regards On Tue, Jul 8, 2014

Re: Error and doubts in using Mllib Naive bayes for text clasification

2014-07-08 Thread Rahul Bhojwani
I am really sorry. Its actually my mistake. My problem 2 is wrong because using a single feature is a senseless thing. Sorry for the inconvenience. But still I will be waiting for the solutions for problem 1 and 3. Thanks, On Tue, Jul 8, 2014 at 12:14 PM, Rahul Bhojwani wrote: > Hello, > > I a

Re: Spark Installation

2014-07-08 Thread Sean Owen
On Tue, Jul 8, 2014 at 4:07 AM, Srikrishna S wrote: > Hi All, > > Does anyone know what the command line arguments to mvn are to generate > the pre-built binary for spark on Hadoop 2-CHD5. > > I would like to pull in a recent bug fix in spark-master and rebuild the > binaries in the exact same wa

Re: Unable to run Spark 1.0 SparkPi on HDP 2.0

2014-07-08 Thread Sean Owen
On Tue, Jul 8, 2014 at 2:01 AM, DB Tsai wrote: > Actually, the one needed to install the jar to each individual node is > standalone mode which works for both MR1 and MR2. Cloudera and > Hortonworks currently support spark in this way as far as I know. > (CDH5 uses Spark on YARN.)

Re: Spark SQL registerAsTable requires a Java Class?

2014-07-08 Thread Michael Armbrust
This is on the roadmap for the next release (1.1) JIRA: SPARK-2179 On Mon, Jul 7, 2014 at 11:48 PM, Ionized wrote: > The Java API requires a Java Class to register as table. > > // Apply a schema to an RDD of JavaBeans and register it as a >

error when spark access hdfs with Kerberos enable

2014-07-08 Thread 许晓炜
Hi all, I encounter a strange issue when using spark 1.0 to access hdfs with Kerberos I just have one spark test node for spark and HADOOP_CONF_DIR is set to the location containing the hdfs configuration files(hdfs-site.xml and core-site.xml) When I use spark-shell with local mode, the access t

Re: Re: Pig 0.13, Spark, Spork

2014-07-08 Thread Bertrand Dechoux
@Mayur : I won't fight with the semantic of a fork but at the moment, no Spork does take the standard Pig as dependency. On that, we should agree. As for my use of Pig, I have no limitation. I am however interested to see the rise of a 'no-sql high level non programming language' for Spark. @Zhan

Re: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-08 Thread Juan Rodríguez Hortalá
Hi Tobias, thanks for your help. I understand that with that code we obtain a database connection per partition, but I also suspect that with that code a new database connection is created per each execution of the function used as argument for mapPartitions(). That would be very inefficient becaus

"NoSuchElementException: key not found" when changing the window lenght and interval in Spark Streaming

2014-07-08 Thread Juan Rodríguez Hortalá
Hi all, I'm writing a Spark Streaming program that uses reduceByKeyAndWindow(), and when I change the windowsLenght or slidingInterval I get the following exceptions, running in local mode 14/07/06 13:03:46 ERROR actor.OneForOneStrategy: key not found: 1404677026000 ms java.util.NoSuchElementExc

RE: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-08 Thread Shao, Saisai
I think you can maintain a connection pool or keep the connection as a long-lived object in executor side (like lazily creating a singleton object in object { } in Scala), so your task can get this connection each time executing a task, not creating a new one, that would be good for your scenari

Re: Spark Installation

2014-07-08 Thread 田毅
try this command: make-distribution.sh --hadoop 2.3.0-cdh5.0.0 --with-yarn --with-hive 田毅 === 橘云平台产品线 大数据产品部 亚信联创科技(中国)有限公司 手机:13910177261 电话:010-82166322 传真:010-82166617 Q Q:20057509 MSN:yi.t...@hotmail.com 地址:北京市海淀区东北旺西路10号院东区 亚信联创大厦 ===

Re: Reading text file vs streaming text files

2014-07-08 Thread M Singh
Hi Akhil: Thanks for your response. Mans On Thursday, July 3, 2014 9:16 AM, Akhil Das wrote: Hi Singh! For this use-case its better to have a Streaming context listening to that directory in hdfs where the files are being dropped and you can set the Streaming interval as 15 minutes and

Re: Java sample for using cassandra-driver-spark

2014-07-08 Thread M Singh
Hi Piotr: It would be great if we can have an api to support batch updates (counter + non-counter). Thanks Mans On Monday, July 7, 2014 11:36 AM, Piotr Kołaczkowski wrote: Hi, we're planning to add a basic Java-API very soon, possibly this week. There's a ticket for it here:  https://gi

Re: Which is the best way to get a connection to an external database per task in Spark Streaming?

2014-07-08 Thread Juan Rodríguez Hortalá
Hi Jerry, thanks for your answer. I'm using Spark Streaming for Java, and I only have rudimentary knowledge about Scala, how could I recreate in Java the lazy creation of a singleton object that you propose for Scala? Maybe a static class member in Java for the connection would be the solution? Th

Task's "Scheduler Delay" in web ui

2014-07-08 Thread haopu
What's the meaning of a Task's "Scheduler Delay" in the web ui? And what could cause that delay? Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Task-s-Scheduler-Delay-in-web-ui-tp9019.html Sent from the Apache Spark User List mailing list archive at

Re: Graphx traversal and merge interesting edges

2014-07-08 Thread HHB
Hi Ankur, I was trying out the PatterMatcher it works for smaller path, but I see that for the longer ones it continues to run forever... Here's what I am trying: https://gist.github.com/hihellobolke/dd2dc0fcebba485975d1 (The example of 3 share traders transacting in appl shares) The first e

Spark MapReduce job to work with Hive

2014-07-08 Thread Darq Moth
Please let me know if the following can be done in Spark: In terms of MapReduce I need: 1) Map function: 1.1) Get Hive record. 1.2) Create a key from some fileds of the record. Register with framework my own key comparison function. This function will make decision about key equality by calculati

Disabling SparkContext WebUI on port 4040, accessing information programatically?

2014-07-08 Thread Martin Gammelsæter
Hi! I am building a web frontend for a Spark app, allowing users to input sql/hql and get results back. When starting a SparkContext from within my server code (using jetty/dropwizard) I get the error java.lang.NoSuchMethodError: org.eclipse.jetty.server.AbstractConnector: method ()V not found w

Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?

2014-07-08 Thread Martin Gammelsæter
Digging a bit more I see that there is yet another jetty instance that is causing the problem, namely the BroadcastManager has one. I guess this one isn't very wise to disable... It might very well be that the WebUI is a problem as well, but I guess the code doesn't get far enough. Any ideas on how

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
In addition to Scalding and Scrunch, there is Scoobi. Unlike the others, it is only Scala (it doesn't wrap a Java framework). All three have fairly similar APIs and aren't too different from Spark. For example, instead of RDD you have DList (distributed list) or PCollection (parallel collection) -

Re: master attempted to re-register the worker and then took all workers as unregistered

2014-07-08 Thread Cheney Sun
Hi Nan, The problem is still there, just as I described before. It's said that the issue had already been addressed in some JIRA and resolved in newer version, but I haven't get chance to try it. If you have any finding, please let me know. Thanks, Cheney On Tue, Jul 8, 2014 at 7:16 AM, Nan Zh

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
I don't have those numbers off-hand. Though the shuffle spill to disk was coming to several gigabytes per node, if I recall correctly. The MapReduce pipeline takes about 2-3 hours I think for the full 60 day data set. Spark chugs along fine for awhile and then hangs. We restructured the flow a few

Re: master attempted to re-register the worker and then took all workers as unregistered

2014-07-08 Thread Nan Zhu
Hi, Cheney, Thanks for the information which version are you using, 0.9.1? Best, -- Nan Zhu On Tuesday, July 8, 2014 at 10:09 AM, Cheney Sun wrote: > Hi Nan, > > The problem is still there, just as I described before. It's said that the > issue had already been addressed in so

Re: master attempted to re-register the worker and then took all workers as unregistered

2014-07-08 Thread Cheney Sun
Yes, 0.9.1. On Tue, Jul 8, 2014 at 10:26 PM, Nan Zhu wrote: > Hi, Cheney, > > Thanks for the information > > which version are you using, 0.9.1? > > Best, > > -- > Nan Zhu > > On Tuesday, July 8, 2014 at 10:09 AM, Cheney Sun wrote: > > Hi Nan, > > The problem is still there, just as I describe

Is MLlib NaiveBayes implementation for Spark 0.9.1 correct?

2014-07-08 Thread Rahul Bhojwani
Hi, I wanted to use Naive Bayes for a text classification problem.I am using Spark 0.9.1. I was just curious to ask that is the Naive Bayes implementation in Spark 0.9.1 correct? Or are there any bugs in the Spark 0.9.1 implementation which are taken care in Spark 1.0. My question is specific abou

How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-07-08 Thread Rahul Bhojwani
Hi, I am using the MLlib Naive Bayes for a text classification problem. I have very less amount of training data. And then the data will be coming continuously and I need to classify it as either A or B. I am training the MLlib Naive Bayes model using the training data but next time when data come

got java.lang.AssertionError when run sbt/sbt compile

2014-07-08 Thread bai阿蒙
Hi guys, when i try to compile the latest source by sbt/sbt compile, I got an error. Can any one help me? The following is the detail: it may cause by TestSQLContext.scala [error] [error] while compiling: /disk3/spark/sql/core/src/main/scala/org/apache/spark/sql/test/TestSQLContext.scala [

Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?

2014-07-08 Thread Koert Kuipers
do you control your cluster and spark deployment? if so, you can try to rebuild with jetty 9.x On Tue, Jul 8, 2014 at 9:39 AM, Martin Gammelsæter < martingammelsae...@gmail.com> wrote: > Digging a bit more I see that there is yet another jetty instance that > is causing the problem, namely the B

Re: Comparative study

2014-07-08 Thread Kevin Markey
When you say "large data sets", how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's

Scheduling in spark

2014-07-08 Thread rapelly kartheek
Hi, I am a post graduate student, new to spark. I want to understand how Spark scheduler works. I just have theoretical understanding of DAG scheduler and the underlying task scheduler. I want to know, given a job to the framework, after the DAG scheduler phase, how the scheduling happens?? Can

java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Konstantin Kudryavtsev
Hi all, I faced with the next exception during map step: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded) java.lang.reflect.Array.newInstance(Array.java:70) com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySeria

Re: Spark Installation

2014-07-08 Thread Sandy Ryza
Hi Srikrishna, The binaries are built with something like mvn package -Pyarn -Dhadoop.version=2.3.0-cdh5.0.1 -Dyarn.version=2.3.0-cdh5.0.1 -Sandy On Tue, Jul 8, 2014 at 3:14 AM, 田毅 wrote: > try this command: > > make-distribution.sh --hadoop 2.3.0-cdh5.0.0 --with-yarn --with-hive > > > > > 田毅

Re: Disabling SparkContext WebUI on port 4040, accessing information programatically?

2014-07-08 Thread DB Tsai
We're doing similar thing to lunch spark job in tomcat, and I opened a JIRA for this. There are couple technical discussions there. https://issues.apache.org/jira/browse/SPARK-2100 In this end, we realized that spark uses jetty not only for Spark WebUI, but also for distributing the jars and task

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full production dataset for this particular dataflow would be 60 GB roughly). I'm not sure what the size of the final output data was but I think it was on the order of 20 GBs for the given 10 GB of input data. Also, I can

Please add Talend to "Powered By Spark" page

2014-07-08 Thread Daniel Kulp
We are looking to add a note about Talend Open Studio's support for Spark components to the page at: https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark Name: Talend Open Studio URL: http://www.talendforge.org/exchange/ Description: Talend Labs are building open source tooling t

Re: NoSuchMethodError in KafkaReciever

2014-07-08 Thread Michael Chang
To be honest I'm a scala newbie too. I just copied it from createStream. I assume it's the canonical way to convert a java map (JMap) to a scala map (Map) On Mon, Jul 7, 2014 at 1:40 PM, mcampbell wrote: > xtrahotsauce wrote > > I had this same problem as well. I ended up just adding the nec

RE: Spark: All masters are unresponsive!

2014-07-08 Thread Sameer Tilak
Hi Akhil et al.,I made the following changes: In spark-env.sh I added the following three entries (standalone mode) export SPARK_MASTER_IP=pzxnvm2018.x.y.name.orgexport SPARK_WORKER_MEMORY=4Gexport SPARK_WORKER_CORES=3 I then use start-master and start-slaves commands to start the services. Anoth

how to convert RDD to PairRDDFunctions ?

2014-07-08 Thread Konstantin Kudryavtsev
Hi all, sorry for fooly question, but how can I get PairRDDFunctions RDD? I'm doing it to perform leftOuterJoin aftewards currently I do in this was (it seems incorrect): val parRDD = new PairRDDFunctions( oldRdd.map(i => (i.key, i)) ) I guess this constructor is definitely wrong... Thank you,

Re: how to convert RDD to PairRDDFunctions ?

2014-07-08 Thread Sean Owen
If your RDD contains pairs, like an RDD[(String,Integer)] or something, then you get to use the functions in PairRDDFunctions as if they were declared on RDD. On Tue, Jul 8, 2014 at 6:25 PM, Konstantin Kudryavtsev < kudryavtsev.konstan...@gmail.com> wrote: > Hi all, > > sorry for fooly question,

Re: how to convert RDD to PairRDDFunctions ?

2014-07-08 Thread Mark Hamstra
See Working with Key-Value Pairs . In particular: "In Scala, these operations are automatically available on RDDs containing Tuple2 objects (the built-in tuples in the language, created by simply writing (a, b)), as long as you import org

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
I believe our full 60 days of data contains over ten million unique entities. Across 10 days I'm not sure, but it should be in the millions. I haven't verified that myself though. So that's the scale of the RDD we're writing to disk (each entry is entityId -> profile). I think it's hard to know ho

Re: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Aaron Davidson
There is a difference from actual GC overhead, which can be reduced by reusing objects, versus this error, which actually means you ran out of memory. This error can probably be relieved by increasing your executor heap size, unless your data is corrupt and it is allocating huge arrays, or you are

Re: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Jerry Lam
Hi Konstantin, I just ran into the same problem. I mitigated the issue by reducing the number of cores when I executed the job which otherwise it won't be able to finish. Unlike many people believes, it might not means that you were running out of memory. A better answer can be found here: http:/

Re: Comparative study

2014-07-08 Thread Kevin Markey
It seems to me that you're not taking full advantage of the lazy evaluation, especially persisting to disk only.  While it might be true that the cumulative size of the RDDs looks like it's 300GB, only a small portion of that should be resident at any one time.  We've eva

Re: java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead limit exceeded)

2014-07-08 Thread Aaron Davidson
This seems almost equivalent to a heap size error -- since GCs are stop-the-world events, the fact that we were unable to release more than 2% of the heap suggests that almost all the memory is *currently in use *(i.e., live). Decreasing the number of cores is another solution which decreases memo

Join two Spark Streaming

2014-07-08 Thread Bill Jay
Hi all, I am working on a pipeline that needs to join two Spark streams. The input is a stream of integers. And the output is the number of integer's appearance divided by the total number of unique integers. Suppose the input is: 1 2 3 1 2 2 There are 3 unique integers and 1 appears twice. Ther

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
To clarify, we are not persisting to disk. That was just one of the experiments we did because of some issues we had along the way. At this time, we are NOT using persist but cannot get the flow to complete in Standalone Cluster mode. We do not have a YARN-capable cluster at this time. We agree w

Further details on spark cluster set up

2014-07-08 Thread Sameer Tilak
Hi All, I used ip addresses in my scripts (spark-env.sh) and slaves contain ip addresses of master and slave nodes respectively. However, I still have no luck. Here is the relevant log file snippet: Master node log:14/07/08 10:56:19 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster

Re: Use Spark Streaming to update result whenever data come

2014-07-08 Thread Bill Jay
Hi Tobias, Thanks for the suggestion. I have tried to add more nodes from 300 to 400. It seems the running time did not get improved. On Wed, Jul 2, 2014 at 6:47 PM, Tobias Pfeiffer wrote: > Bill, > > can't you just add more nodes in order to speed up the processing? > > Tobias > > > On Thu, J

Re: Comparative study

2014-07-08 Thread Kevin Markey
Nothing particularly custom.  We've tested with small (4 node) development clusters, single-node pseudoclusters, and bigger, using plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark master, Spark local, Spark Yarn (client and cluster) modes, with total me

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
How wide are the rows of data, either the raw input data or any generated intermediate data? We are at a loss as to why our flow doesn't complete. We banged our heads against it for a few weeks. -Suren On Tue, Jul 8, 2014 at 2:12 PM, Kevin Markey wrote: > Nothing particularly custom. We've

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
Also, our exact same flow but with 1 GB of input data completed fine. -Suren On Tue, Jul 8, 2014 at 2:16 PM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > How wide are the rows of data, either the raw input data or any generated > intermediate data? > > We are at a loss as to why our

CoarseGrainedExecutorBackend: Driver Disassociated

2014-07-08 Thread Sameer Tilak
Dear All, When I look inside the following directory on my worker node:$SPARK_HOME/work/app-20140708110707-0001/3 I see the following error message: log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration).log4j:WARN Please initialize the log4j system properly.log

Re: CoarseGrainedExecutorBackend: Driver Disassociated

2014-07-08 Thread Aaron Davidson
Hmm, looks like the Executor is trying to connect to the driver on localhost, from this line: 14/07/08 11:07:13 INFO CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@localhost:39701/user/CoarseGrainedScheduler What is your setup? Standalone mode with 4 separate machines? Are yo

Re: Spark Installation

2014-07-08 Thread Srikrishna S
Hi All, I tried the make distribution script and it worked well. I was able to compile the spark binary on our CDH5 cluster. Once I compiled Spark, I copied over the binaries in the dist folder to all the other machines in the cluster. However, I run into an issue while submit a job in yarn-clie

Re: error when spark access hdfs with Kerberos enable

2014-07-08 Thread Marcelo Vanzin
Someone might be able to correct me if I'm wrong, but I don't believe standalone mode supports kerberos. You'd have to use Yarn for that. On Tue, Jul 8, 2014 at 1:40 AM, 许晓炜 wrote: > Hi all, > > > > I encounter a strange issue when using spark 1.0 to access hdfs with > Kerberos > > I just have on

Error: Could not delete temporary files.

2014-07-08 Thread Rahul Bhojwani
HI, I am getting this error. Can anyone help out to explain why is this error coming. Exception in thread "delete Spark temp dir C:\Users\shawn\AppData\Local\Temp\spark-27f60467-36d4-4081-aaf5-d0ad42dda560" java.io.IOException: Failed to delete: C:\Users\shawn\AppData\Local\Temp\spark-

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin
This is generally a side effect of your executor being killed. For example, Yarn will do that if you're going over the requested memory limits. On Tue, Jul 8, 2014 at 12:17 PM, Rahul Bhojwani wrote: > HI, > > I am getting this error. Can anyone help out to explain why is this error > coming. > >

Re: Error: Could not delete temporary files.

2014-07-08 Thread Rahul Bhojwani
Hi Marcelo. Thanks for the quick reply. Can you suggest me how to increase the memory limits or how to tackle this problem. I am a novice. If you want I can post my code here. Thanks On Wed, Jul 9, 2014 at 12:50 AM, Marcelo Vanzin wrote: > This is generally a side effect of your executor bein

RE: CoarseGrainedExecutorBackend: Driver Disassociated

2014-07-08 Thread sstilak
Hi Aaron, I have 4 nodes - 1 master and 3 workers. I am not setting up driver public dns name anywhere. I didn't see that step in the documentation -- may be I missed it. Can you please point me in the right direction? Sent via the Samsung GALAXY S®4, an AT&T 4G LTE smartphone Origina

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin
Note I didn't say that was your problem - it would be if (i) you're running your job on Yarn and (ii) you look at the Yarn NodeManager logs and see that it's actually killing your process. I just said that the exception shows up in those kinds of situations. You haven't provided enough information

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
We kind of hijacked Santos' original thread, so apologies for that and let me try to get back to Santos' original question on Map/Reduce versus Spark. I would say it's worth migrating from M/R, with the following thoughts. Just my opinion but I would summarize the latest emails in this thread as

Re: Comparative study

2014-07-08 Thread Sean Owen
On Tue, Jul 8, 2014 at 8:32 PM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > > Libraries like Scoobi, Scrunch and Scalding (and their associated Java > versions) provide a Spark-like wrapper around Map/Reduce but my guess is > that, since they are limited to Map/Reduce under the covers,

Re: Error: Could not delete temporary files.

2014-07-08 Thread Rahul Bhojwani
Here I am adding my code. If you can have a look to help me out. Thanks ### import tokenizer import gettingWordLists as gl from pyspark.mllib.classification import NaiveBayes from numpy import array from pyspark import SparkContext, SparkConf conf = (SparkConf().setMaster("loc

Re: Error: Could not delete temporary files.

2014-07-08 Thread Rahul Bhojwani
I have pasted the logs below: PS F:\spark-0.9.1\codes\sentiment analysis> pyspark .\naive_bayes_analyser.py Running python with PYTHONPATH=F:\spark-0.9.1\spark-0.9.1\bin\..\python; SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/F:/spark-0.9.1/spark-0.9.1/as

Re: Scheduling in spark

2014-07-08 Thread Sujeet Varakhedi
This is a good start: http://www.eecs.berkeley.edu/~tdas/spark_docs/job-scheduling.html On Tue, Jul 8, 2014 at 9:11 AM, rapelly kartheek wrote: > Hi, > I am a post graduate student, new to spark. I want to understand how > Spark scheduler works. I just have theoretical understanding of DAG >

[Spark SQL]: Convert SchemaRDD back to RDD

2014-07-08 Thread Pierre B
Hi there! 1/ Is there a way to convert a SchemaRDD (for instance loaded from a parquet file) back to a RDD of a given case class? 2/ Even better, is there a way to get the schema information from a SchemaRDD ? I am trying to figure out how to properly get the various fields of the Rows of a Schem

Re: Comparative study

2014-07-08 Thread Reynold Xin
Not sure exactly what is happening but perhaps there are ways to restructure your program for it to work better. Spark is definitely able to handle much, much larger workloads. I've personally run a workload that shuffled 300 TB of data. I've also ran something that shuffled 5TB/node and stuffed m

Re: Scheduling in spark

2014-07-08 Thread Andrew Or
Here's the most updated version of the same page: http://spark.apache.org/docs/latest/job-scheduling 2014-07-08 12:44 GMT-07:00 Sujeet Varakhedi : > This is a good start: > > http://www.eecs.berkeley.edu/~tdas/spark_docs/job-scheduling.html > > > On Tue, Jul 8, 2014 at 9:11 AM, rapelly kartheek

Re: Powered By Spark: Can you please add our org?

2014-07-08 Thread Reynold Xin
I added you to the list. Cheers. On Mon, Jul 7, 2014 at 6:19 PM, Alex Gaudio wrote: > Hi, > > Sailthru is also using Spark. Could you please add us to the Powered By > Spark > page > when you have a chance? > > Organization

Re: How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-07-08 Thread Xiangrui Meng
Hi Rahul, We plan to add online model updates with Spark Streaming, perhaps in v1.1, starting with linear methods. Please open a JIRA for Naive Bayes. For Naive Bayes, we need to update the priors and conditional probabilities, which means we should also remember the number of observations for the

Re: Spark SQL registerAsTable requires a Java Class?

2014-07-08 Thread Ionized
Thanks for the heads-up. In the meantime, we'd like to test this out ASAP - are there any open PR's we could take to try it out? (or do you have an estimate on when some will be available?) On Tue, Jul 8, 2014 at 12:24 AM, Michael Armbrust wrote: > This is on the roadmap for the next release (

Re: Is MLlib NaiveBayes implementation for Spark 0.9.1 correct?

2014-07-08 Thread Xiangrui Meng
Well, I believe this is a correct implementation but please let us know if you run into problems. The NaiveBayes implementation in MLlib v1.0 supports sparse data, which is usually the case for text classificiation. I would recommend upgrading to v1.0. -Xiangrui On Tue, Jul 8, 2014 at 7:20 AM, Rah

Re: got java.lang.AssertionError when run sbt/sbt compile

2014-07-08 Thread Xiangrui Meng
try sbt/sbt clean first On Tue, Jul 8, 2014 at 8:25 AM, bai阿蒙 wrote: > Hi guys, > when i try to compile the latest source by sbt/sbt compile, I got an error. > Can any one help me? > > The following is the detail: it may cause by TestSQLContext.scala > [error] > [error] while compiling: > /d

Re: Spark SQL registerAsTable requires a Java Class?

2014-07-08 Thread Michael Armbrust
Yin (cc-ed) is working on it as we speak. We'll post to the JIRA as soon as a PR is up. On Tue, Jul 8, 2014 at 1:04 PM, Ionized wrote: > Thanks for the heads-up. > > In the meantime, we'd like to test this out ASAP - are there any open PR's > we could take to try it out? (or do you have an est

Re: Comparative study

2014-07-08 Thread Daniel Siegmann
I think we're missing the point a bit. Everything was actually flowing through smoothly and in a reasonable time. Until it reached the last two tasks (out of over a thousand in the final stage alone), at which point it just fell into a coma. Not so much as a cranky message in the logs. I don't kno

Re: [Spark SQL]: Convert SchemaRDD back to RDD

2014-07-08 Thread Michael Armbrust
On Tue, Jul 8, 2014 at 12:43 PM, Pierre B < pierre.borckm...@realimpactanalytics.com> wrote: > > 1/ Is there a way to convert a SchemaRDD (for instance loaded from a > parquet > file) back to a RDD of a given case class? > There may be someday, but doing so will either require a lot of reflection

Re: Comparative study

2014-07-08 Thread Aaron Davidson
> > Not sure exactly what is happening but perhaps there are ways to > restructure your program for it to work better. Spark is definitely able to > handle much, much larger workloads. +1 @Reynold Spark can handle big "big data". There are known issues with informing the user about what went wro

Re: Error and doubts in using Mllib Naive bayes for text clasification

2014-07-08 Thread Xiangrui Meng
1) The feature dimension should be a fixed number before you run NaiveBayes. If you use bag of words, you need to handle the word-to-index dictionary by yourself. You can either ignore the words that never appear in training (because they have no effect in prediction), or use hashing to randomly pr

Re: error when spark access hdfs with Kerberos enable

2014-07-08 Thread Sandy Ryza
That's correct. Only Spark on YARN supports Kerberos. -Sandy On Tue, Jul 8, 2014 at 12:04 PM, Marcelo Vanzin wrote: > Someone might be able to correct me if I'm wrong, but I don't believe > standalone mode supports kerberos. You'd have to use Yarn for that. > > On Tue, Jul 8, 2014 at 1:40 AM,

Re: Help for the large number of the input data files

2014-07-08 Thread Xiangrui Meng
You can either use sc.wholeTextFiles and then a flatMap to reduce the number of partitions, or give more memory to the driver process by using --driver-memory 20g and then call RDD.repartition(small number) after you load the data in. -Xiangrui On Mon, Jul 7, 2014 at 7:38 PM, innowireless TaeYun K

Re: [Spark SQL]: Convert SchemaRDD back to RDD

2014-07-08 Thread Pierre B
Cool Thanks Michael! Message sent from a mobile device - excuse typos and abbreviations > Le 8 juil. 2014 à 22:17, Michael Armbrust [via Apache Spark User List] > a écrit : > >> On Tue, Jul 8, 2014 at 12:43 PM, Pierre B <[hidden email]> wrote: >> 1/ Is there a way to convert a SchemaRDD (for i

OutOfMemory : Java heap space error

2014-07-08 Thread Rahul Bhojwani
Hi, My code was running properly but then it suddenly gave this error. Can you just put some light on it. ### 0 KB, free: 38.7 MB) 14/07/09 01:46:12 INFO BlockManagerMaster: Updated info of block rdd_2212_4 14/07/09 01:46:13 INFO PythonRDD: Times: total = 1486, boot = 698, ini

Re: How to incorporate the new data in the MLlib-NaiveBayes model along with predicting?

2014-07-08 Thread Rahul Bhojwani
Thanks a lot Xiangrui. This will help. On Wed, Jul 9, 2014 at 1:34 AM, Xiangrui Meng wrote: > Hi Rahul, > > We plan to add online model updates with Spark Streaming, perhaps in > v1.1, starting with linear methods. Please open a JIRA for Naive > Bayes. For Naive Bayes, we need to update the pri

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin
Hi Rahul, Can you try calling "sc.close()" at the end of your program, so Spark can clean up after itself? On Tue, Jul 8, 2014 at 12:40 PM, Rahul Bhojwani wrote: > Here I am adding my code. If you can have a look to help me out. > Thanks > ### > > import tokenizer > import ge

Re: Is MLlib NaiveBayes implementation for Spark 0.9.1 correct?

2014-07-08 Thread Rahul Bhojwani
Thanks a lot Xiangrui for the help. On Wed, Jul 9, 2014 at 1:39 AM, Xiangrui Meng wrote: > Well, I believe this is a correct implementation but please let us > know if you run into problems. The NaiveBayes implementation in MLlib > v1.0 supports sparse data, which is usually the case for text >

Re: Error and doubts in using Mllib Naive bayes for text clasification

2014-07-08 Thread Rahul Bhojwani
Thanks Xiangrui. You have solved almost all my problems :) On Wed, Jul 9, 2014 at 1:47 AM, Xiangrui Meng wrote: > 1) The feature dimension should be a fixed number before you run > NaiveBayes. If you use bag of words, you need to handle the > word-to-index dictionary by yourself. You can either

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin
Sorry, that would be sc.stop() (not close). On Tue, Jul 8, 2014 at 1:31 PM, Marcelo Vanzin wrote: > Hi Rahul, > > Can you try calling "sc.close()" at the end of your program, so Spark > can clean up after itself? > > On Tue, Jul 8, 2014 at 12:40 PM, Rahul Bhojwani > wrote: >> Here I am adding my

Re: Comparative study

2014-07-08 Thread Surendranauth Hiraman
Aaron, I don't think anyone was saying Spark can't handle this data size, given testimony from the Spark team, Bizo, etc., on large datasets. This has kept us trying different things to get our flow to work over the course of several weeks. Agreed that the first instinct should be "what did I do

Re: Error: Could not delete temporary files.

2014-07-08 Thread Rahul Bhojwani
Thanks Marcelo. I was having another problem. My code was running properly and then it suddenly stopped with the error: java.lang.OutOfMemoryError: Java heap space at java.io.BufferedOutputStream.(Unknown Source) at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:

Re: Error: Could not delete temporary files.

2014-07-08 Thread Marcelo Vanzin
Have you tried the obvious (increase the heap size of your JVM)? On Tue, Jul 8, 2014 at 2:02 PM, Rahul Bhojwani wrote: > Thanks Marcelo. > I was having another problem. My code was running properly and then it > suddenly stopped with the error: > > java.lang.OutOfMemoryError: Java heap space >

RE: CoarseGrainedExecutorBackend: Driver Disassociated

2014-07-08 Thread Sameer Tilak
Hi Aaron,Would really appreciate your help if you can point me to the documentation. Is this something that I need to do with /etc/hosts on each of the worker machines ? Or do I set SPARK_PUBLIC_DNS (if yes, what is the format?) or something else? I have the following set up: master node: pzxnv

Re: Comparative study

2014-07-08 Thread Robert James
As a new user, I can definitely say that my experience with Spark has been rather raw. The appeal of interactive, batch, and in between all using more or less straight Scala is unarguable. But the experience of deploying Spark has been quite painful, mainly about gaps between compile time and run

spark-1.0.0-rc11 2f1dc868 spark-shell not honoring --properties-file option?

2014-07-08 Thread Andrew Lee
Build: Spark 1.0.0 rc11 (git commit tag: 2f1dc868e5714882cf40d2633fb66772baf34789) Hi All, When I enabled the spark-defaults.conf for the eventLog, spark-shell broke while spark-submit works. I'm trying to create a separate directory per user to keep track with their own Spark job event

Re: Spark: All masters are unresponsive!

2014-07-08 Thread Andrew Or
It seems that your driver (which I'm assuming you launched on the master node) can now connect to the Master, but your executors cannot. Did you make sure that all nodes have the same conf/spark-defaults.conf, conf/spark-env.sh, and conf/slaves? It would be good if you can post the stderr of the ex

Re: Spark job tracker.

2014-07-08 Thread abhiguruvayya
Hello Mayur, How can I implement these methods mentioned below. Do u you have any clue on this pls et me know. public void onJobStart(SparkListenerJobStart arg0) { } @Override public void onStageCompleted(SparkListenerStageCompleted arg0) { }

Spark-streaming-kafka error

2014-07-08 Thread Bill Jay
Hi all, I used sbt to package a code that uses spark-streaming-kafka. The packaging succeeded. However, when I submitted to yarn, the job ran for 10 seconds and there was an error in the log file as follows: Caused by: java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils$

issues with ./bin/spark-shell for standalone mode

2014-07-08 Thread Mikhail Strebkov
Hi! I've been using Spark compiled from 1.0 branch at some point (~2 month ago). The setup is a standalone cluster with 4 worker machines and 1 master machine. I used to run spark shell like this: ./bin/spark-shell -c 30 -em 20g -dm 10g Today I've finally updated to Spark 1.0 release. Now I can

Re: issues with ./bin/spark-shell for standalone mode

2014-07-08 Thread Andrew Or
Hi Mikhail, It looks like the documentation is a little out-dated. Neither is true anymore. In general, we try to shift away from short options ("-em", "-dm" etc.) in favor of more explicit ones ("--executor-memory", "--driver-memory"). These options, and "--cores", refer to the arguments passed i

Re: issues with ./bin/spark-shell for standalone mode

2014-07-08 Thread Andrew Or
>> "The proper way to specify this is through "spark.master" in your config or the "--master" parameter to spark-submit." By "this" I mean configuring which master the driver connects to (not which port and address the standalone Master binds to). 2014-07-08 16:43 GMT-07:00 Andrew Or : > Hi Mik

  1   2   >