Re: Dynamic visualizations from Spark Streaming output?

2014-10-01 Thread Akhil Das
You can start with the Zeppelin project , they currently has support for SparkContext (not streaming), since the code is open you can customize it for your usecase. Here's a video of it https://www.youtube.com/watch?v=_PQbVH_aO5E&feature=youtu.be Thanks Best Regards

Re: Multiple exceptions in Spark Streaming

2014-10-01 Thread Akhil Das
Looks like a configuration issue, can you paste your spark-env.sh on the worker? Thanks Best Regards On Wed, Oct 1, 2014 at 8:27 AM, Tathagata Das wrote: > It would help to turn on debug level logging in log4j and see the logs. > Just looking at the error logs is not giving me any sense. :( > >

Re: Multiple exceptions in Spark Streaming

2014-10-01 Thread Shaikh Riyaz
Hi Akhil, Thanks for your reply. We are using CDH 5.1.3 and spark configuration is taken care by Cloudera configuration. Please let me know if you would like to review the configuration. Regards, Riyaz On Wed, Oct 1, 2014 at 10:10 AM, Akhil Das wrote: > Looks like a configuration issue, can y

Re: Multiple exceptions in Spark Streaming

2014-10-01 Thread Akhil Das
In that case, fire-up a sparkshell and try the following: scala>import org.apache.spark.streaming.{Seconds, StreamingContext} > scala>import org.apache.spark.streaming.StreamingContext._ > scala>val ssc = new > StreamingContext("spark://YOUR-SPARK-MASTER-URI","Streaming > Job",Seconds(5),"/home/ak

persistent state for spark streaming

2014-10-01 Thread Chia-Chun Shih
Hi, My application is to digest user logs and deduct user quotas. I need to maintain latest states of user quotas persistently, so that latest user quotas will not be lost. I have tried *updateStateByKey* to generate and a DStream for user quotas and called *persist(StorageLevel.MEMORY_AND_DISK()

any code examples demonstrating spark streaming applications which depend on states?

2014-10-01 Thread Chia-Chun Shih
Hi, Are there any code examples demonstrating spark streaming applications which depend on states? That is, last-run *updateStateByKey* results are used as inputs. Thanks.

Re: how to get actual count from as long from JavaDStream ?

2014-10-01 Thread Sean Owen
It's much easier than all this. Spark Streaming gives you a DStream of RDDs. You want the count for each RDD. DStream.count() gives you exactly that: a DStream of Longs which are the counts of events in each mini batch. On Tue, Sep 30, 2014 at 8:42 PM, Andy Davidson wrote: > Hi > > I have a simpl

Spark AccumulatorParam generic

2014-10-01 Thread Johan Stenberg
Hi, I have a problem with using accumulators in Spark. As seen on the Spark website, if you want custom accumulators you can simply extend (with an object) the AccumulatorParam trait. The problem is that I need to make that object generic, such as this: object SeqAccumulatorParam[B] extends A

Re: Spark AccumulatorParam generic

2014-10-01 Thread Johan Stenberg
Just realized that, of course, objects can't be generic, but how do I create a generic AccumulatorParam? 2014-10-01 12:33 GMT+02:00 Johan Stenberg : > Hi, > > I have a problem with using accumulators in Spark. As seen on the Spark > website, if you want custom accumulators you can simply extend (

Re: Installation question

2014-10-01 Thread Sean Owen
If you want a single-machine 'cluster' to try all of these things, you don't strictly need a distribution, but, it will probably save you a great deal of time and trouble compared to setting all of this up by hand. Naturally I would promote CDH, as it contains Spark and Mahout and supports them al

Re: Spark SQL + Hive + JobConf NoClassDefFoundError

2014-10-01 Thread Patrick McGloin
FYI, in case anybody else has this problem, we switched to Spark 1.1 (outside CDH) and the same Spark application worked first time (once recompiled with Spark 1.1 libs of course). I assume this is because Spark 1.1 is compiled with Hive. On 29 September 2014 17:41, Patrick McGloin wrote: > Hi,

Re: MLLib ALS question

2014-10-01 Thread Alex T
Hi, thanks for the reply. I added the ALS.setIntermediateRDDStorageLevel and it worked well (a little slow, but still did the job and i've made MF and get all the features). But even if I persist with DISK_ONLY, the system monitor shows on the memory and swap history that Apache Spark is using R

Re: Spark Streaming for time consuming job

2014-10-01 Thread Mayur Rustagi
Calling collect on anything is almost always a bad idea. The only exception is if you are looking to pass that data on to any other system & never see it again :) . I would say you need to implement outlier detection on the rdd & process it in spark itself rather than calling collect on it. Regar

Solution for small files in HDFS

2014-10-01 Thread rzykov
We encountered a problem of loading a huge number of small files (hundred thousands of files) from HDFS in Spark. Our jobs were failed over time. This one forced us to write own loader with combining by means of Hadoop CombineFileInputFormat. It significantly reduced number of mappers from 10

KryoSerializer exception in Spark Streaming JAVA

2014-10-01 Thread Mudassar Sarwar
Hi, I'm implementing KryoSerializer for my custom class. Here is class public class ImpressionFactsValue implements KryoSerializable { private int hits; public ImpressionFactsValue() { } public int getHits() { re

Re: persistent state for spark streaming

2014-10-01 Thread Yana Kadiyska
I don't think persist is meant for end-user usage. You might want to call saveAsTextFiles, for example, if you're saving to the file system as strings. You can also dump the DStream to a DB -- there are samples on this list (you'd have to do a combo of foreachRDD and mapPartitions, likely) On Wed,

Re: any code examples demonstrating spark streaming applications which depend on states?

2014-10-01 Thread Yana Kadiyska
I don't think your question is very clear -- *updateStateByKey* usually updates the previous state. For example, the StatefulNetworkWordCount example that ships with Spark show the following snippet: val updateFunc = (values: Seq[Int], state: Option[Int]) => { val currentCount = values.sum

Re: How to get SparckContext inside mapPartitions?

2014-10-01 Thread Daniel Siegmann
I don't think you can get a SparkContext inside an RDD function (such as mapPartitions), but you shouldn't need to. Have you considered returning the data read from the database from mapPartitions to create a new RDD and then just save it to a file like normal? For example: rddObject.mapPartition

Problem with very slow behaviour of TorrentBroadcast vs. HttpBroadcast

2014-10-01 Thread Guillaume Pitel
Hi, We've had some performance issues since switching to 1.1.0, and we finally found the origin : TorrentBroadcast seems to be very slow in our setting (and it became default with 1.1.0) The logs of a 4MB variable with TB : (15s) 14/10/01 15:47:13 INFO storage.MemoryStore: Block broadcast_84

GraphX: Types for the Nodes and Edges

2014-10-01 Thread Oshi
Hi, Sorry this question may be trivial. I'm new to Spark and GraphX. I need to create a graph that has different types of nodes(3 types) and edges(4 types). Each type of node and edge has a different list of attributes. 1) How should I build the graph? Should I specify all types of nodes(or edge

Re: GraphX: Types for the Nodes and Edges

2014-10-01 Thread andy petrella
I'll try my best ;-). 1/ you could create a abstract type for the types (1 on top of Vs, 1 other on top of Es types) than use the subclasses as payload in your VertexRDD or in your Edge. Regarding storage and files, it doesn't really matter (unless you want to use the OOTB loading method, thus you

Re: Spark AccumulatorParam generic

2014-10-01 Thread Debasish Das
Can't you extend a class in place of object which can be generic ? class GenericAccumulator[B] extends AccumulatorParam[Seq[B]] { } On Wed, Oct 1, 2014 at 3:38 AM, Johan Stenberg wrote: > Just realized that, of course, objects can't be generic, but how do I > create a generic AccumulatorParam?

MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Tomer Benyamini
Hi, I'm trying to write my JavaPairRDD using saveAsNewAPIHadoopFile with MultipleTextOutputFormat,: outRdd.saveAsNewAPIHadoopFile("/tmp", String.class, String.class, MultipleTextOutputFormat.class); but I'm getting this compilation error: Bound mismatch: The generic method saveAsNewAPIHadoopFil

Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Akshat Aranya
Hi, What's the relationship between Spark worker and executor memory settings in standalone mode? Do they work independently or does the worker cap executor memory? Also, is the number of concurrent executors per worker capped by the number of CPU cores configured for the worker?

Re: MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Nicholas Chammas
Are you trying to do something along the lines of what's described here? https://issues.apache.org/jira/browse/SPARK-3533 On Wed, Oct 1, 2014 at 10:53 AM, Tomer Benyamini wrote: > Hi, > > I'm trying to write my JavaPairRDD using saveAsNewAPIHadoopFile with > MultipleTextOutputFormat,: > > outRdd

org.apache.spark.sql.catalyst.errors.package$TreeNodeException:

2014-10-01 Thread tonsat
We have a configuration CDH5.0,Spark1.1.0(stand alone) and hive0.12 We are trying to run some realtime analytics from Tableau8.1(BI Tool) and one of the dashboard was failing with below error. Is it something to do with the aggregate function not supported by spark-sql ?Any help appreciated. 14/10/

Re: MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Tomer Benyamini
Yes exactly.. so I guess this is still an open request. Any workaround? On Wed, Oct 1, 2014 at 6:04 PM, Nicholas Chammas wrote: > Are you trying to do something along the lines of what's described here? > https://issues.apache.org/jira/browse/SPARK-3533 > > On Wed, Oct 1, 2014 at 10:53 AM, Tomer

Re: Reading from HBase is too slow

2014-10-01 Thread Tao Xiao
I can submit a MapReduce job reading that table, although its processing rate is also a litter slower than I expected, but not that slow as Spark. 2014-10-01 12:04 GMT+08:00 Ted Yu : > Can you launch a job which exercises TableInputFormat on the same table > without using Spark ? > > This would s

Re: MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Nicholas Chammas
Not that I'm aware of. I'm looking for a work-around myself! On Wed, Oct 1, 2014 at 11:15 AM, Tomer Benyamini wrote: > Yes exactly.. so I guess this is still an open request. Any workaround? > > On Wed, Oct 1, 2014 at 6:04 PM, Nicholas Chammas > wrote: > > Are you trying to do something along t

Re: MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Nicholas Chammas
There is this thread on Stack Overflow about the same topic, which you may find helpful. On Wed, Oct 1, 2014 at 11:17 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Not that I'm aware o

Re: MLLib: Missing value imputation

2014-10-01 Thread Debasish Das
If the missing values are 0, then you can also look into implicit formulation... On Tue, Sep 30, 2014 at 12:05 PM, Xiangrui Meng wrote: > We don't handle missing value imputation in the current version of > MLlib. In future releases, we can store feature information in the > dataset metadata, wh

spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread jamborta
Hi all, I cannot figure out why this command is not setting the driver memory (it is setting the executor memory): conf = (SparkConf() .setMaster("yarn-client") .setAppName("test") .set("spark.driver.memory", "1G") .set("spark.ex

Re: Reading from HBase is too slow

2014-10-01 Thread Vladimir Rodionov
Using TableInputFormat is not the fastest way of reading data from HBase. Do not expect 100s of Mb per sec. You probably should take a look at M/R over HBase snapshots. https://issues.apache.org/jira/browse/HBASE-8369 -Vladimir Rodionov On Wed, Oct 1, 2014 at 8:17 AM, Tao Xiao wrote: > I can s

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Marcelo Vanzin
You can't set up the driver memory programatically in client mode. In that mode, the same JVM is running the driver, so you can't modify command line options anymore when initializing the SparkContext. (And you can't really start cluster mode apps that way, so the only way to set this is through t

Re: Reading from HBase is too slow

2014-10-01 Thread Ted Yu
As far as I know, that feature is not in CDH 5.0.0 FYI On Wed, Oct 1, 2014 at 9:34 AM, Vladimir Rodionov < vrodio...@splicemachine.com> wrote: > Using TableInputFormat is not the fastest way of reading data from HBase. > Do not expect 100s of Mb per sec. You probably should take a look at M/R >

Re: Reading from HBase is too slow

2014-10-01 Thread Vladimir Rodionov
Yes, its in 0.98. CDH is free (w/o subscription) and sometimes its worth upgrading to the latest version (which is 0.98 based). -Vladimir Rodionov On Wed, Oct 1, 2014 at 9:52 AM, Ted Yu wrote: > As far as I know, that feature is not in CDH 5.0.0 > > FYI > > On Wed, Oct 1, 2014 at 9:34 AM, Vladi

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Tamas Jambor
thanks Marcelo. What's the reason it is not possible in cluster mode, either? On Wed, Oct 1, 2014 at 5:42 PM, Marcelo Vanzin wrote: > You can't set up the driver memory programatically in client mode. In > that mode, the same JVM is running the driver, so you can't modify > command line options

Re: Spark 1.1.0 hbase_inputformat.py not work

2014-10-01 Thread Kan Zhang
CC user@ for indexing. Glad you fixed it. All source code for these examples are under SPARK_HOME/examples. For example, the converters used here are in examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala Btw, you may find our blog post useful. https://databri

Re: MLLib ALS question

2014-10-01 Thread Xiangrui Meng
ALS still needs to load and deserialize the in/out blocks (one by one) from disk and then construct least squares subproblems. All happen in RAM. The final model is also stored in memory. -Xiangrui On Wed, Oct 1, 2014 at 4:36 AM, Alex T wrote: > Hi, thanks for the reply. > > I added the ALS.setIn

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Marcelo Vanzin
Because that's not how you launch apps in cluster mode; you have to do it through the command line, or by calling directly the respective backend code to launch it. (That being said, it would be nice to have a programmatic way of launching apps that handled all this - this has been brought up in a

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Tamas Jambor
when you say "respective backend code to launch it", I thought this is the way to do that. thanks, Tamas On Wed, Oct 1, 2014 at 6:13 PM, Marcelo Vanzin wrote: > Because that's not how you launch apps in cluster mode; you have to do > it through the command line, or by calling directly the respec

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread jamborta
when you say "respective backend code to launch it", I thought this is the way to do that. thanks, Tamas On Wed, Oct 1, 2014 at 6:13 PM, Marcelo Vanzin [via Apache Spark User List] wrote: > Because that's not how you launch apps in cluster mode; you have to do > it through the command line, or b

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Marcelo Vanzin
No, you can't instantiate a SparkContext to start apps in cluster mode. For Yarn, for example, you'd have to call directly into org.apache.spark.deploy.yarn.Client; that class will tell the Yarn cluster to launch the driver for you and then instantiate the SparkContext. On Wed, Oct 1, 2014 at 10:

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Andrew Or
Hi Tamas, Yes, Marcelo is right. The reason why it doesn't make sense to set "spark.driver.memory" in your SparkConf is because your application code, by definition, *is* the driver. This means by the time you get to the code that initializes your SparkConf, your driver JVM has already started wit

Re: how to get actual count from as long from JavaDStream ?

2014-10-01 Thread Andy Davidson
Hi Sean I guess I am missing something. JavaDStream foo = Š JavaDStream c = foo.count() This is circular. I need to get the count as an actual scalar value not a JavaDStream. Some one else posted psudo code that used foreachRDD() . This seems to work for me. Thanks Andy From: Sean Owen Dat

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread jamborta
Thank you for the replies. It makes sense for scala/java, but in python the JVM is launched when the spark context is initialised, so it should be able to set it, I assume. On 1 Oct 2014 18:25, "Andrew Or-2 [via Apache Spark User List]" < ml-node+s1001560n15510...@n3.nabble.com> wrote: > Hi Tamas,

Re: IPython Notebook Debug Spam

2014-10-01 Thread Davies Liu
On Tue, Sep 30, 2014 at 10:14 PM, Rick Richardson wrote: > I am experiencing significant logging spam when running PySpark in IPython > Notebok > > Exhibit A: http://i.imgur.com/BDP0R2U.png > > I have taken into consideration advice from: > http://apache-spark-user-list.1001560.n3.nabble.com/Disa

can I think of JavaDStream<> foreachRDD() as being 'for each mini batch' ?

2014-10-01 Thread Andy Davidson
Hi I am new to Spark Streaming. Can I think of JavaDStream<> foreachRDD() as being 'for each mini batch¹? The java doc does not say much about this function. Here is the background. I am writing a little test program to figure out how to use streams. At some point I wanted to calculate an aggre

Re: Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Boromir Widas
1. worker memory caps executor. 2. With default config, every job gets one executor per worker. This executor runs with all cores available to the worker. On Wed, Oct 1, 2014 at 11:04 AM, Akshat Aranya wrote: > Hi, > > What's the relationship between Spark worker and executor memory settings >

Re: Handling tree reduction algorithm with Spark in parallel

2014-10-01 Thread Boromir Widas
Thanks a lot Andy and Debashish, your suggestions were of great help. On Tue, Sep 30, 2014 at 6:44 PM, Debasish Das wrote: > If the tree is too big build it on graphxbut it will need thorough > analysis so that the partitions are well balanced... > > On Tue, Sep 30, 2014 at 2:45 PM, Andy Twi

Re: Handling tree reduction algorithm with Spark in parallel

2014-10-01 Thread Matei Zaharia
Some of the MLlib algorithms do tree reduction in 1.1: http://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html. You can check out how they implemented it -- it is a series of reduce operations. Matei On Oct 1, 2014, at 11:02 AM, Boromir Widas wrote: > Thanks a lot

Re: Handling tree reduction algorithm with Spark in parallel

2014-10-01 Thread Andy Twigg
Yes, that makes sense. It's similar to the all reduce pattern in vw. On Wednesday, 1 October 2014, Matei Zaharia wrote: > Some of the MLlib algorithms do tree reduction in 1.1: > http://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html. > You can check out how they imp

Re: can I think of JavaDStream<> foreachRDD() as being 'for each mini batch' ?

2014-10-01 Thread Sean Owen
Yes, foreachRDD will do your something for each RDD, which is what you get for each mini-batch of input. The operations you express on a DStream (or JavaDStream) are all, really, "for each RDD", including print(). Logging is a little harder to reason about since the logging will happen on a potent

Re: how to get actual count from as long from JavaDStream ?

2014-10-01 Thread Sean Owen
Hm, I think this is the same thing though. The input is conceptually a stream of RDDs. You want the count of elements of each of the RDDs, so you get a stream of counts. In the foreachRDD example, you're just computing the count() of each RDD directly and printing it, so, you print a stream of coun

Re: Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Akshat Aranya
On Wed, Oct 1, 2014 at 11:00 AM, Boromir Widas wrote: > 1. worker memory caps executor. > 2. With default config, every job gets one executor per worker. This > executor runs with all cores available to the worker. > > By the job do you mean one SparkContext or one stage execution within a progra

Protocol buffers with Spark ?

2014-10-01 Thread Jaonary Rabarisoa
Dear all, I have a spark job that communicates with a C++ code using pipe. Since, the data I need to send is rather complicated, I think about using protobuf to serialize it. The problem is that the string form of my data outputted by protobuf contains the "\n" character so it a bit complicated to

Re: Poor performance writing to S3

2014-10-01 Thread Gustavo Arjones
Hi, I found the answer to my problem, and just writing to keep it as KB. Turns out the problem wasn’t related to S3 performance, it was due my SOURCE was not fast enough, due the lazy nature of Spark what I saw on the dashboard was saveAsTextFile at FacebookProcessor.scala:46 instead of the load

Re: IPython Notebook Debug Spam

2014-10-01 Thread Rick Richardson
Thanks for your reply. Unfortunately changing the log4j.properties within SPARK_HOME/conf has no effect on pyspark for me. When I change it in the master or workers the log changes have the desired effect, but pyspark seems to ignore them. I have changed the levels to WARN, changed the appender

Re: Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Akshat Aranya
On Wed, Oct 1, 2014 at 11:33 AM, Akshat Aranya wrote: > > > On Wed, Oct 1, 2014 at 11:00 AM, Boromir Widas wrote: > >> 1. worker memory caps executor. >> 2. With default config, every job gets one executor per worker. This >> executor runs with all cores available to the worker. >> >> By the job

RE: MLLib: Missing value imputation

2014-10-01 Thread Sameer Tilak
Thanks, Xiangrui and Debashish for your input. Date: Wed, 1 Oct 2014 08:35:51 -0700 Subject: Re: MLLib: Missing value imputation From: debasish.da...@gmail.com To: men...@gmail.com CC: ssti...@live.com; user@spark.apache.org If the missing values are 0, then you can also look into implicit formu

Re: Short Circuit Local Reads

2014-10-01 Thread Colin McCabe
On Tue, Sep 30, 2014 at 6:28 PM, Andrew Ash wrote: > Thanks for the research Kay! > > It does seem addressed, and hopefully fixed in that ticket conversation also > in https://issues.apache.org/jira/browse/HDFS-4697 So the best thing here > is to wait to upgrade to a version of Hadoop that has th

Re: how to get actual count from as long from JavaDStream ?

2014-10-01 Thread Tathagata Das
To clarify the confusion here, when you do dstream.count() to generates a DStream[Long] which contains RDD[Long] for each batch. Each of this RDD has only one element in it, which is the count you are interested in. So the following are equivalent. dstream.foreachRDD { rdd => val count = rdd.cou

Re: IPython Notebook Debug Spam

2014-10-01 Thread Davies Liu
How do you setup IPython to access pyspark in notebook? I did as following, it worked for me: $ export SPARK_HOME=/opt/spark-1.1.0/ $ export PYTHONPATH=/opt/spark-1.1.0/python:/opt/spark-1.1.0/python/lib/py4j-0.8.2.1-src.zip $ ipython notebook All the logging will go into console (not in notebo

Re: can I think of JavaDStream<> foreachRDD() as being 'for each mini batch' ?

2014-10-01 Thread Andy Davidson
Hi Sean Many many thanks. This really clears a lot up for me Andy From: Sean Owen Date: Wednesday, October 1, 2014 at 11:27 AM To: Andrew Davidson Cc: "user@spark.apache.org" Subject: Re: can I think of JavaDStream<> foreachRDD() as being 'for each mini batch' ? > Yes, foreachRDD will d

Re: IPython Notebook Debug Spam

2014-10-01 Thread Rick Richardson
I was starting PySpark as a profile within IPython Notebook as per: http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/ My setup looks like: import os import sys spark_home = os.environ.get('SPARK_HOME', None) if not spark_home: raise ValueError('SPARK_HOME e

MLlib Linear Regression Mismatch

2014-10-01 Thread Krishna Sankar
Guys, Obviously I am doing something wrong. May be 4 points are too small a dataset. Can you help me to figure out why the following doesn't work ? a) This works : data = [ LabeledPoint(0.0, [0.0]), LabeledPoint(10.0, [10.0]), LabeledPoint(20.0, [20.0]), LabeledPoint(30.0, [30.0]) ]

Task deserialization problem using 1.1.0 for Hadoop 2.4

2014-10-01 Thread Timothy Potter
I'm running into the following deserialization issue when trying to run a very simple Java-based application using a local Master (see stack trace below). My code basically queries Solr using a custom Hadoop InputFormat. I've hacked my code to make sure the objects involved (PipelineDocumentWritab

Re: IPython Notebook Debug Spam

2014-10-01 Thread Rick Richardson
Here is the other relevant bit of my set-up: MASTER=spark://sparkmaster:7077 IPYTHON_OPTS="notebook --pylab inline --ip=0.0.0.0" CASSANDRA_NODES="cassandra1|cassandra2|cassandra3" PYSPARK_SUBMIT_ARGS="--master $MASTER --deploy-mode client --num-executors 6 --executor-memory 1g --executor-cores 1" i

Re: IPython Notebook Debug Spam

2014-10-01 Thread Rick Richardson
Out of curiosity, how do you actually launch pyspark in your set-up? On Wed, Oct 1, 2014 at 3:44 PM, Rick Richardson wrote: > Here is the other relevant bit of my set-up: > MASTER=spark://sparkmaster:7077 > IPYTHON_OPTS="notebook --pylab inline --ip=0.0.0.0" > CASSANDRA_NODES="cassandra1|cassand

Re: Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Boromir Widas
I have not found a way to control the cores yet. This effectively limits the cluster to a single application at a time. A subsequent application shows in the 'WAITING' State on the dashboard. On Wed, Oct 1, 2014 at 2:49 PM, Akshat Aranya wrote: > > > On Wed, Oct 1, 2014 at 11:33 AM, Akshat Arany

Re: Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Liquan Pei
One indirect way to control the number of cores used in an executor is to set spark.cores.max and set spark.deploy.spreadOut to be true. The scheduler in the standalone cluster then assigns roughly the same number of cores (spark.cores.max/number of worker nodes) to each executor for an application

Spark Monitoring with Ganglia

2014-10-01 Thread danilopds
Hi, I need monitoring some aspects about my cluster like network and resources. Ganglia looks like a good option for what I need. Then, I found out that Spark has support to Ganglia. On the Spark monitoring webpage there is this information: "To install the GangliaSink you’ll need to perform a cu

Re: Question About Submit Application

2014-10-01 Thread danilopds
I'll do this test and after I reply the result. Thank you Marcelo. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Question-About-Submit-Application-tp15072p15539.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

still "GC overhead limit exceeded" after increasing heap space

2014-10-01 Thread anny9699
Hi, After reading some previous posts about this issue, I have increased the java heap space to "-Xms64g -Xmx64g", but still met the "java.lang.OutOfMemoryError: GC overhead limit exceeded" error. Does anyone have other suggestions? I am reading a data of 200 GB and my total memory is 120 GB, so

Re: still "GC overhead limit exceeded" after increasing heap space

2014-10-01 Thread Liquan Pei
Hi How many nodes in your cluster? It seems to me 64g does not help if each of your node doesn't have that many memory. Liquan On Wed, Oct 1, 2014 at 1:37 PM, anny9699 wrote: > Hi, > > After reading some previous posts about this issue, I have increased the > java heap space to "-Xms64g -Xmx64

run scalding on spark

2014-10-01 Thread Koert Kuipers
well, sort of! we make input/output formats (cascading taps, scalding sources) available in spark, and we ported the scalding fields api to spark. so it's for those of us that have a serious investment in cascading/scalding and want to leverage that in spark. blog is here: http://tresata.com/tresa

Re: still "GC overhead limit exceeded" after increasing heap space

2014-10-01 Thread anny9699
Hi Liquan, I have 8 workers, each with 15.7GB memory. What you said makes sense, but if I don't increase heap space, it keeps telling me "GC overhead limit exceeded". Thanks! Anny On Wed, Oct 1, 2014 at 1:41 PM, Liquan Pei [via Apache Spark User List] < ml-node+s1001560n1554...@n3.nabble.com> w

Re: run scalding on spark

2014-10-01 Thread Matei Zaharia
Pretty cool, thanks for sharing this! I've added a link to it on the wiki: https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects. Matei On Oct 1, 2014, at 1:41 PM, Koert Kuipers wrote: > well, sort of! we make input/output formats (cascading taps, scalding > sources) a

Re: Spark 1.1.0 hbase_inputformat.py not work

2014-10-01 Thread Gilberto Lira
Thank you Zhang! I am grateful for your help! 2014-10-01 14:05 GMT-03:00 Kan Zhang : > CC user@ for indexing. > > Glad you fixed it. All source code for these examples are under > SPARK_HOME/examples. For example, the converters used here are in > examples/src/main/scala/org/apache/spark/example

Re: still "GC overhead limit exceeded" after increasing heap space

2014-10-01 Thread Sean Owen
How are you setting this memory? You may be configuring the wrong process's memory, like the driver and not the executors. On Oct 1, 2014 9:37 PM, "anny9699" wrote: > Hi, > > After reading some previous posts about this issue, I have increased the > java heap space to "-Xms64g -Xmx64g", but still

Re: IPython Notebook Debug Spam

2014-10-01 Thread Rick Richardson
I found the problem. I was manually constructing the CLASSPATH and SPARK_CLASSPATH because I needed jars for running the cassandra lib. For some reason that I cannot explain, it was this that was causing the issue. Maybe one of the jars had a log4j.properties rolled up in it? I removed almost all

Re: still "GC overhead limit exceeded" after increasing heap space

2014-10-01 Thread Liquan Pei
Hi Anny, I am assuming that you perform some complex logic for processing. Can you try to reduce your data size using RDD.sample or RDD.filter before actual processing? That may reduce memory pressure. Liqaun On Wed, Oct 1, 2014 at 1:53 PM, anny9699 wrote: > Hi Liquan, > > I have 8 workers,

Re: Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Akshat Aranya
I guess one way to do so would be to run >1 worker per node, like say, instead of running 1 worker and giving it 8 cores, you can run 4 workers with 2 cores each. Then, you get 4 executors with 2 cores each. On Wed, Oct 1, 2014 at 1:06 PM, Boromir Widas wrote: > I have not found a way to contro

Print Decision Tree Models

2014-10-01 Thread Jimmy McErlain
So I am trying to print the model output from MLlib however I am only getting things like the following: org.apache.spark.mllib.tree.model.DecisionTreeModel@1120c600 0.17171527904439082 0.8282847209556092 5273125.0 2.5435412E7 from the following code: val trainErr = labelAndPreds.filter(

Creating a feature vector from text before using with MLLib

2014-10-01 Thread Soumya Simanta
I'm trying to understand the intuition behind the features method that Aaron used in one of his demos. I believe this feature will just work for detecting the character set (i.e., language used). Can someone help ? def featurize(s: String): Vector = { val n = 1000 val result = new Array[Doub

Re: still "GC overhead limit exceeded" after increasing heap space

2014-10-01 Thread 陈韵竹
Thanks Sean. This is how I set this memory. I set it when I start to run the job java -Xms64g -Xmx64g -cp /root/spark/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/root/scala/lib/scala-library.jar:./target/MyProject.jar MyClass Is there some problem with it? On Wed, Oct 1, 2014 at 2:03 PM, Sean Ow

Re: still "GC overhead limit exceeded" after increasing heap space

2014-10-01 Thread Liquan Pei
Can you use spark submit to set the the executor memory? Take a look at https://spark.apache.org/docs/latest/submitting-applications.html. Liquan On Wed, Oct 1, 2014 at 2:21 PM, 陈韵竹 wrote: > Thanks Sean. This is how I set this memory. I set it when I start to run > the job > > java -Xms64g -Xmx

Determining number of executors within RDD

2014-10-01 Thread Akshat Aranya
Hi, I want implement an RDD wherein the decision of number of partitions is based on the number of executors that have been set up. Is there some way I can determine the number of executors within the getPartitions() call?

Re: run scalding on spark

2014-10-01 Thread Koert Kuipers
thanks On Wed, Oct 1, 2014 at 4:56 PM, Matei Zaharia wrote: > Pretty cool, thanks for sharing this! I've added a link to it on the wiki: > https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects > . > > Matei > > On Oct 1, 2014, at 1:41 PM, Koert Kuipers wrote: > > well, s

Re: Creating a feature vector from text before using with MLLib

2014-10-01 Thread Liquan Pei
The program computes hashing bi-gram frequency normalized by total number of bigrams then filter out zero values. hashing is a effective trick of vectorizing features. Take a look at http://en.wikipedia.org/wiki/Feature_hashing Liquan On Wed, Oct 1, 2014 at 2:18 PM, Soumya Simanta wrote: > I'm

Re: MLlib Linear Regression Mismatch

2014-10-01 Thread Burak Yavuz
Hi, It appears that the step size is too high that the model is diverging with the added noise. Could you try by setting the step size to be 0.1 or 0.01? Best, Burak - Original Message - From: "Krishna Sankar" To: user@spark.apache.org Sent: Wednesday, October 1, 2014 12:43:20 PM Subj

Spark And Mapr

2014-10-01 Thread Addanki, Santosh Kumar
Hi We were using Horton 2.4.1 as our Hadoop distribution and now switched to MapR Previously to read a text file we would use : test = sc.textFile(\"hdfs://10.48.101.111:8020/user/hdfs/test\")" What would be the equivalent of the same for Mapr. Best Regards Santosh

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Tamas Jambor
Thank you for the replies. It makes sense for scala/java, but in python the JVM is launched when the spark context is initialised, so it should be able to set it, I assume. On Wed, Oct 1, 2014 at 6:24 PM, Andrew Or wrote: > Hi Tamas, > > Yes, Marcelo is right. The reason why it doesn't make sen

Re: Task deserialization problem using 1.1.0 for Hadoop 2.4

2014-10-01 Thread Timothy Potter
Forgot to mention that I've tested that SerIntWritable and PipelineDocumentWritable are serializable by serializing / deserializing to/from a byte array in memory. On Wed, Oct 1, 2014 at 1:43 PM, Timothy Potter wrote: > I'm running into the following deserialization issue when trying to > run a v

Re: MLlib Linear Regression Mismatch

2014-10-01 Thread Krishna Sankar
Thanks Burak. Step size 0.01 worked for b) and step=0.0001 for c) ! Cheers On Wed, Oct 1, 2014 at 3:00 PM, Burak Yavuz wrote: > Hi, > > It appears that the step size is too high that the model is diverging with > the added noise. > Could you try by setting the step size to be 0.1 or 0.01? >

Re: Spark And Mapr

2014-10-01 Thread Vladimir Rodionov
There is doc on MapR: http://doc.mapr.com/display/MapR/Accessing+MapR-FS+in+Java+Applications -Vladimir Rodionov On Wed, Oct 1, 2014 at 3:00 PM, Addanki, Santosh Kumar < santosh.kumar.adda...@sap.com> wrote: > Hi > > > > We were using Horton 2.4.1 as our Hadoop distribution and now switched to

RE: Spark And Mapr

2014-10-01 Thread Addanki, Santosh Kumar
Hi We would like to do this in PySpark Environment i.e something like test = sc.textFile("maprfs:///user/root/test") or test = sc.textFile("hdfs:///user/root/test") or Currently when we try test = sc.textFile("maprfs:///user/root/test") It throws the error No File-System for scheme: maprfs

Re: GraphX: Types for the Nodes and Edges

2014-10-01 Thread Oshi
Excellent! Thanks Andy. I will give it a go. On Thu, Oct 2, 2014 at 12:42 AM, andy petrella [via Apache Spark User List] wrote: > I'll try my best ;-). > > 1/ you could create a abstract type for the types (1 on top of Vs, 1 other > on top of Es types) than use the subclasses as payload in your

Re: Spark And Mapr

2014-10-01 Thread Sungwook Yoon
Yes.. you should use maprfs:// I personally haven't used pyspark, I just used scala shell or standalone with MapR. I think you need to set classpath right, adding jar like /opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar to the classpath in the classpath. Sungwook On Wed, Oct 1, 2

Re: Spark And Mapr

2014-10-01 Thread Matei Zaharia
It should just work in PySpark, the same way it does in Java / Scala apps. Matei On Oct 1, 2014, at 4:12 PM, Sungwook Yoon wrote: > > Yes.. you should use maprfs:// > > I personally haven't used pyspark, I just used scala shell or standalone with > MapR. > > I think you need to set classpat

  1   2   >