Issue with Partitioning

2014-10-01 Thread Ankur Srivastava
Hi, I am using custom partitioner to partition my JavaPairRDD where key is a String. I use hashCode of the sub-string of the key to derive the partition index but I have noticed that my partition contains keys which have a different partitionIndex returned by the partitioner. Another issue I am

Re: Print Decision Tree Models

2014-10-01 Thread Jimmy
Yeah I'm using 1.0.0 and thanks for taking the time to check! Sent from my iPhone > On Oct 1, 2014, at 8:48 PM, Xiangrui Meng wrote: > > Which Spark version are you using? It works in 1.1.0 but not in 1.0.0. > -Xiangrui > >> On Wed, Oct 1, 2014 at 2:13 PM, Jimmy McErlain wrote: >> So I am t

Re: Print Decision Tree Models

2014-10-01 Thread Xiangrui Meng
Which Spark version are you using? It works in 1.1.0 but not in 1.0.0. -Xiangrui On Wed, Oct 1, 2014 at 2:13 PM, Jimmy McErlain wrote: > So I am trying to print the model output from MLlib however I am only > getting things like the following: > > org.apache.spark.mllib.tree.model.DecisionTreeMo

Re: Help Troubleshooting Naive Bayes

2014-10-01 Thread Xiangrui Meng
The cost depends on the feature dimension, number of instances, number of classes, and number of partitions. Do you mind sharing those numbers? -Xiangrui On Wed, Oct 1, 2014 at 6:31 PM, Mike Bernico wrote: > Hi Everyone, > > I'm working on training mllib's Naive Bayes to classify TF/IDF vectoried

Re: Creating a feature vector from text before using with MLLib

2014-10-01 Thread Xiangrui Meng
Yes, the "bigram" in that demo only has two characters, which could separate different character sets. -Xiangrui On Wed, Oct 1, 2014 at 2:54 PM, Liquan Pei wrote: > The program computes hashing bi-gram frequency normalized by total number of > bigrams then filter out zero values. hashing is a eff

Re: org.apache.spark.sql.catalyst.errors.package$TreeNodeException:

2014-10-01 Thread Michael Armbrust
You are likely running into SPARK-3708 , which was fixed by #2594 this morning. On Wed, Oct 1, 2014 at 8:09 AM, tonsat wrote: > We have a configuration CDH5.0,Spark1.1.0(stand alone) and hive0.12 > We a

Help Troubleshooting Naive Bayes

2014-10-01 Thread Mike Bernico
Hi Everyone, I'm working on training mllib's Naive Bayes to classify TF/IDF vectoried docs using Spark 1.1.0. I've gotten this to work fine on a smaller set of data, but when I increase the number of vectorized documents I get hung up on training. The only messages I'm seeing are below. I'm pr

What can be done if a FlatMapFunctions generated more data that can be held in memory

2014-10-01 Thread Steve Lewis
I number of the problems I want to work with generate datasets which are too large to hold in memory. This becomes an issue when building a FlatMapFunction and also when the data used in combineByKey cannot be held in memory. The following is a simple, if a little silly, example of a FlatMapF

Re: Multiple spark shell sessions

2014-10-01 Thread Sanjay Subramanian
Awesome thanks a TON. It works There is a clash in the UI port initially but looks like it creates a second UI port at 4041 for the second user wanting to use the spark-shell 14/10/01 17:34:38 INFO JettyUtils: Failed to create UI at port, 4040. Trying again.14/10/01 17:34:38 INFO JettyUtils: Err

spark.cleaner.ttl

2014-10-01 Thread SK
Hi, I am using spark v 1.1.0. The default value of spark.cleaner.ttl is infinite as per the online docs. Since a lot of shuffle files are generated in /tmp/spark-local* and the disk is running out of space, we tested with a smaller value of ttl. However, even when job has completed and the timer

Re: Multiple spark shell sessions

2014-10-01 Thread Matei Zaharia
You need to set --total-executor-cores to limit how many total cores it grabs on the cluster. --executor-cores is just for each individual executor, but it will try to launch many of them. Matei On Oct 1, 2014, at 4:29 PM, Sanjay Subramanian wrote: > hey guys > > I am using spark 1.0.0+cdh

Re: Spark And Mapr

2014-10-01 Thread Surendranauth Hiraman
As Sungwook said, the classpath pointing to the mapr jar is the key for that error. MapR has a Spark install that hopefully makes it easier. I don't have the instructions handy but you can asking their support about it. -Suren On Wed, Oct 1, 2014 at 7:18 PM, Matei Zaharia wrote: > It should ju

Re: timestamp not implemented yet

2014-10-01 Thread barge.nilesh
Parquet format seems to be comparatively better for analytic load, it has performance & compression benefits for large analytic workload. A workaround could be to use long datatype to store epoch timestamp value. If you already have existing parquet files (impala tables) then you may need to consid

Re: Spark inside Eclipse

2014-10-01 Thread Ashish Jain
Hello Sanjay, This can be done, and is a very effective way to debug. 1) Compile and package your project to get a fat jar 2) In your SparkConf use setJars and give location of this jar. Also set your master here as local in SparkConf 3) Use this SparkConf when creating JavaSparkContext 4) Debug

Re: Spark inside Eclipse

2014-10-01 Thread Ted Yu
Cycling bits: http://search-hadoop.com/m/JW1q5wxkXH/spark+eclipse&subj=Buidling+spark+in+Eclipse+Kepler On Wed, Oct 1, 2014 at 4:35 PM, Sanjay Subramanian < sanjaysubraman...@yahoo.com.invalid> wrote: > hey guys > > Is there a way to run Spark in local mode from within Eclipse. > I am running Ecl

Spark inside Eclipse

2014-10-01 Thread Sanjay Subramanian
hey guys Is there a way to run Spark in local mode from within Eclipse.I am running Eclipse Kepler on a Macbook Pro with MavericksLike one can run hadoop map/reduce applications from within Eclipse and debug and learn. thanks sanjay   

Multiple spark shell sessions

2014-10-01 Thread Sanjay Subramanian
hey guys I am using  spark 1.0.0+cdh5.1.0+41 When two users try to run "spark-shell" , the first guy's spark-shell shows active in the 18080 Web UI but the second user shows WAITING and the shell has a bunch of errors but does go the spark-shell and "sc.master" seems to point to the correct master

Re: Spark And Mapr

2014-10-01 Thread Matei Zaharia
It should just work in PySpark, the same way it does in Java / Scala apps. Matei On Oct 1, 2014, at 4:12 PM, Sungwook Yoon wrote: > > Yes.. you should use maprfs:// > > I personally haven't used pyspark, I just used scala shell or standalone with > MapR. > > I think you need to set classpat

Re: Spark And Mapr

2014-10-01 Thread Sungwook Yoon
Yes.. you should use maprfs:// I personally haven't used pyspark, I just used scala shell or standalone with MapR. I think you need to set classpath right, adding jar like /opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar to the classpath in the classpath. Sungwook On Wed, Oct 1, 2

Re: GraphX: Types for the Nodes and Edges

2014-10-01 Thread Oshi
Excellent! Thanks Andy. I will give it a go. On Thu, Oct 2, 2014 at 12:42 AM, andy petrella [via Apache Spark User List] wrote: > I'll try my best ;-). > > 1/ you could create a abstract type for the types (1 on top of Vs, 1 other > on top of Es types) than use the subclasses as payload in your

RE: Spark And Mapr

2014-10-01 Thread Addanki, Santosh Kumar
Hi We would like to do this in PySpark Environment i.e something like test = sc.textFile("maprfs:///user/root/test") or test = sc.textFile("hdfs:///user/root/test") or Currently when we try test = sc.textFile("maprfs:///user/root/test") It throws the error No File-System for scheme: maprfs

Re: Spark And Mapr

2014-10-01 Thread Vladimir Rodionov
There is doc on MapR: http://doc.mapr.com/display/MapR/Accessing+MapR-FS+in+Java+Applications -Vladimir Rodionov On Wed, Oct 1, 2014 at 3:00 PM, Addanki, Santosh Kumar < santosh.kumar.adda...@sap.com> wrote: > Hi > > > > We were using Horton 2.4.1 as our Hadoop distribution and now switched to

Re: MLlib Linear Regression Mismatch

2014-10-01 Thread Krishna Sankar
Thanks Burak. Step size 0.01 worked for b) and step=0.0001 for c) ! Cheers On Wed, Oct 1, 2014 at 3:00 PM, Burak Yavuz wrote: > Hi, > > It appears that the step size is too high that the model is diverging with > the added noise. > Could you try by setting the step size to be 0.1 or 0.01? >

Re: Task deserialization problem using 1.1.0 for Hadoop 2.4

2014-10-01 Thread Timothy Potter
Forgot to mention that I've tested that SerIntWritable and PipelineDocumentWritable are serializable by serializing / deserializing to/from a byte array in memory. On Wed, Oct 1, 2014 at 1:43 PM, Timothy Potter wrote: > I'm running into the following deserialization issue when trying to > run a v

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Tamas Jambor
Thank you for the replies. It makes sense for scala/java, but in python the JVM is launched when the spark context is initialised, so it should be able to set it, I assume. On Wed, Oct 1, 2014 at 6:24 PM, Andrew Or wrote: > Hi Tamas, > > Yes, Marcelo is right. The reason why it doesn't make sen

Spark And Mapr

2014-10-01 Thread Addanki, Santosh Kumar
Hi We were using Horton 2.4.1 as our Hadoop distribution and now switched to MapR Previously to read a text file we would use : test = sc.textFile(\"hdfs://10.48.101.111:8020/user/hdfs/test\")" What would be the equivalent of the same for Mapr. Best Regards Santosh

Re: MLlib Linear Regression Mismatch

2014-10-01 Thread Burak Yavuz
Hi, It appears that the step size is too high that the model is diverging with the added noise. Could you try by setting the step size to be 0.1 or 0.01? Best, Burak - Original Message - From: "Krishna Sankar" To: user@spark.apache.org Sent: Wednesday, October 1, 2014 12:43:20 PM Subj

Re: Creating a feature vector from text before using with MLLib

2014-10-01 Thread Liquan Pei
The program computes hashing bi-gram frequency normalized by total number of bigrams then filter out zero values. hashing is a effective trick of vectorizing features. Take a look at http://en.wikipedia.org/wiki/Feature_hashing Liquan On Wed, Oct 1, 2014 at 2:18 PM, Soumya Simanta wrote: > I'm

Re: run scalding on spark

2014-10-01 Thread Koert Kuipers
thanks On Wed, Oct 1, 2014 at 4:56 PM, Matei Zaharia wrote: > Pretty cool, thanks for sharing this! I've added a link to it on the wiki: > https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects > . > > Matei > > On Oct 1, 2014, at 1:41 PM, Koert Kuipers wrote: > > well, s

Determining number of executors within RDD

2014-10-01 Thread Akshat Aranya
Hi, I want implement an RDD wherein the decision of number of partitions is based on the number of executors that have been set up. Is there some way I can determine the number of executors within the getPartitions() call?

Re: still "GC overhead limit exceeded" after increasing heap space

2014-10-01 Thread Liquan Pei
Can you use spark submit to set the the executor memory? Take a look at https://spark.apache.org/docs/latest/submitting-applications.html. Liquan On Wed, Oct 1, 2014 at 2:21 PM, 陈韵竹 wrote: > Thanks Sean. This is how I set this memory. I set it when I start to run > the job > > java -Xms64g -Xmx

Re: still "GC overhead limit exceeded" after increasing heap space

2014-10-01 Thread 陈韵竹
Thanks Sean. This is how I set this memory. I set it when I start to run the job java -Xms64g -Xmx64g -cp /root/spark/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/root/scala/lib/scala-library.jar:./target/MyProject.jar MyClass Is there some problem with it? On Wed, Oct 1, 2014 at 2:03 PM, Sean Ow

Creating a feature vector from text before using with MLLib

2014-10-01 Thread Soumya Simanta
I'm trying to understand the intuition behind the features method that Aaron used in one of his demos. I believe this feature will just work for detecting the character set (i.e., language used). Can someone help ? def featurize(s: String): Vector = { val n = 1000 val result = new Array[Doub

Print Decision Tree Models

2014-10-01 Thread Jimmy McErlain
So I am trying to print the model output from MLlib however I am only getting things like the following: org.apache.spark.mllib.tree.model.DecisionTreeModel@1120c600 0.17171527904439082 0.8282847209556092 5273125.0 2.5435412E7 from the following code: val trainErr = labelAndPreds.filter(

Re: Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Akshat Aranya
I guess one way to do so would be to run >1 worker per node, like say, instead of running 1 worker and giving it 8 cores, you can run 4 workers with 2 cores each. Then, you get 4 executors with 2 cores each. On Wed, Oct 1, 2014 at 1:06 PM, Boromir Widas wrote: > I have not found a way to contro

Re: still "GC overhead limit exceeded" after increasing heap space

2014-10-01 Thread Liquan Pei
Hi Anny, I am assuming that you perform some complex logic for processing. Can you try to reduce your data size using RDD.sample or RDD.filter before actual processing? That may reduce memory pressure. Liqaun On Wed, Oct 1, 2014 at 1:53 PM, anny9699 wrote: > Hi Liquan, > > I have 8 workers,

Re: IPython Notebook Debug Spam

2014-10-01 Thread Rick Richardson
I found the problem. I was manually constructing the CLASSPATH and SPARK_CLASSPATH because I needed jars for running the cassandra lib. For some reason that I cannot explain, it was this that was causing the issue. Maybe one of the jars had a log4j.properties rolled up in it? I removed almost all

Re: still "GC overhead limit exceeded" after increasing heap space

2014-10-01 Thread Sean Owen
How are you setting this memory? You may be configuring the wrong process's memory, like the driver and not the executors. On Oct 1, 2014 9:37 PM, "anny9699" wrote: > Hi, > > After reading some previous posts about this issue, I have increased the > java heap space to "-Xms64g -Xmx64g", but still

Re: Spark 1.1.0 hbase_inputformat.py not work

2014-10-01 Thread Gilberto Lira
Thank you Zhang! I am grateful for your help! 2014-10-01 14:05 GMT-03:00 Kan Zhang : > CC user@ for indexing. > > Glad you fixed it. All source code for these examples are under > SPARK_HOME/examples. For example, the converters used here are in > examples/src/main/scala/org/apache/spark/example

Re: run scalding on spark

2014-10-01 Thread Matei Zaharia
Pretty cool, thanks for sharing this! I've added a link to it on the wiki: https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects. Matei On Oct 1, 2014, at 1:41 PM, Koert Kuipers wrote: > well, sort of! we make input/output formats (cascading taps, scalding > sources) a

Re: still "GC overhead limit exceeded" after increasing heap space

2014-10-01 Thread anny9699
Hi Liquan, I have 8 workers, each with 15.7GB memory. What you said makes sense, but if I don't increase heap space, it keeps telling me "GC overhead limit exceeded". Thanks! Anny On Wed, Oct 1, 2014 at 1:41 PM, Liquan Pei [via Apache Spark User List] < ml-node+s1001560n1554...@n3.nabble.com> w

run scalding on spark

2014-10-01 Thread Koert Kuipers
well, sort of! we make input/output formats (cascading taps, scalding sources) available in spark, and we ported the scalding fields api to spark. so it's for those of us that have a serious investment in cascading/scalding and want to leverage that in spark. blog is here: http://tresata.com/tresa

Re: still "GC overhead limit exceeded" after increasing heap space

2014-10-01 Thread Liquan Pei
Hi How many nodes in your cluster? It seems to me 64g does not help if each of your node doesn't have that many memory. Liquan On Wed, Oct 1, 2014 at 1:37 PM, anny9699 wrote: > Hi, > > After reading some previous posts about this issue, I have increased the > java heap space to "-Xms64g -Xmx64

still "GC overhead limit exceeded" after increasing heap space

2014-10-01 Thread anny9699
Hi, After reading some previous posts about this issue, I have increased the java heap space to "-Xms64g -Xmx64g", but still met the "java.lang.OutOfMemoryError: GC overhead limit exceeded" error. Does anyone have other suggestions? I am reading a data of 200 GB and my total memory is 120 GB, so

Re: Question About Submit Application

2014-10-01 Thread danilopds
I'll do this test and after I reply the result. Thank you Marcelo. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Question-About-Submit-Application-tp15072p15539.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Spark Monitoring with Ganglia

2014-10-01 Thread danilopds
Hi, I need monitoring some aspects about my cluster like network and resources. Ganglia looks like a good option for what I need. Then, I found out that Spark has support to Ganglia. On the Spark monitoring webpage there is this information: "To install the GangliaSink you’ll need to perform a cu

Re: Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Liquan Pei
One indirect way to control the number of cores used in an executor is to set spark.cores.max and set spark.deploy.spreadOut to be true. The scheduler in the standalone cluster then assigns roughly the same number of cores (spark.cores.max/number of worker nodes) to each executor for an application

Re: Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Boromir Widas
I have not found a way to control the cores yet. This effectively limits the cluster to a single application at a time. A subsequent application shows in the 'WAITING' State on the dashboard. On Wed, Oct 1, 2014 at 2:49 PM, Akshat Aranya wrote: > > > On Wed, Oct 1, 2014 at 11:33 AM, Akshat Arany

Re: IPython Notebook Debug Spam

2014-10-01 Thread Rick Richardson
Out of curiosity, how do you actually launch pyspark in your set-up? On Wed, Oct 1, 2014 at 3:44 PM, Rick Richardson wrote: > Here is the other relevant bit of my set-up: > MASTER=spark://sparkmaster:7077 > IPYTHON_OPTS="notebook --pylab inline --ip=0.0.0.0" > CASSANDRA_NODES="cassandra1|cassand

Re: IPython Notebook Debug Spam

2014-10-01 Thread Rick Richardson
Here is the other relevant bit of my set-up: MASTER=spark://sparkmaster:7077 IPYTHON_OPTS="notebook --pylab inline --ip=0.0.0.0" CASSANDRA_NODES="cassandra1|cassandra2|cassandra3" PYSPARK_SUBMIT_ARGS="--master $MASTER --deploy-mode client --num-executors 6 --executor-memory 1g --executor-cores 1" i

Task deserialization problem using 1.1.0 for Hadoop 2.4

2014-10-01 Thread Timothy Potter
I'm running into the following deserialization issue when trying to run a very simple Java-based application using a local Master (see stack trace below). My code basically queries Solr using a custom Hadoop InputFormat. I've hacked my code to make sure the objects involved (PipelineDocumentWritab

MLlib Linear Regression Mismatch

2014-10-01 Thread Krishna Sankar
Guys, Obviously I am doing something wrong. May be 4 points are too small a dataset. Can you help me to figure out why the following doesn't work ? a) This works : data = [ LabeledPoint(0.0, [0.0]), LabeledPoint(10.0, [10.0]), LabeledPoint(20.0, [20.0]), LabeledPoint(30.0, [30.0]) ]

Re: IPython Notebook Debug Spam

2014-10-01 Thread Rick Richardson
I was starting PySpark as a profile within IPython Notebook as per: http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/ My setup looks like: import os import sys spark_home = os.environ.get('SPARK_HOME', None) if not spark_home: raise ValueError('SPARK_HOME e

Re: can I think of JavaDStream<> foreachRDD() as being 'for each mini batch' ?

2014-10-01 Thread Andy Davidson
Hi Sean Many many thanks. This really clears a lot up for me Andy From: Sean Owen Date: Wednesday, October 1, 2014 at 11:27 AM To: Andrew Davidson Cc: "user@spark.apache.org" Subject: Re: can I think of JavaDStream<> foreachRDD() as being 'for each mini batch' ? > Yes, foreachRDD will d

Re: IPython Notebook Debug Spam

2014-10-01 Thread Davies Liu
How do you setup IPython to access pyspark in notebook? I did as following, it worked for me: $ export SPARK_HOME=/opt/spark-1.1.0/ $ export PYTHONPATH=/opt/spark-1.1.0/python:/opt/spark-1.1.0/python/lib/py4j-0.8.2.1-src.zip $ ipython notebook All the logging will go into console (not in notebo

Re: how to get actual count from as long from JavaDStream ?

2014-10-01 Thread Tathagata Das
To clarify the confusion here, when you do dstream.count() to generates a DStream[Long] which contains RDD[Long] for each batch. Each of this RDD has only one element in it, which is the count you are interested in. So the following are equivalent. dstream.foreachRDD { rdd => val count = rdd.cou

Re: Short Circuit Local Reads

2014-10-01 Thread Colin McCabe
On Tue, Sep 30, 2014 at 6:28 PM, Andrew Ash wrote: > Thanks for the research Kay! > > It does seem addressed, and hopefully fixed in that ticket conversation also > in https://issues.apache.org/jira/browse/HDFS-4697 So the best thing here > is to wait to upgrade to a version of Hadoop that has th

RE: MLLib: Missing value imputation

2014-10-01 Thread Sameer Tilak
Thanks, Xiangrui and Debashish for your input. Date: Wed, 1 Oct 2014 08:35:51 -0700 Subject: Re: MLLib: Missing value imputation From: debasish.da...@gmail.com To: men...@gmail.com CC: ssti...@live.com; user@spark.apache.org If the missing values are 0, then you can also look into implicit formu

Re: Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Akshat Aranya
On Wed, Oct 1, 2014 at 11:33 AM, Akshat Aranya wrote: > > > On Wed, Oct 1, 2014 at 11:00 AM, Boromir Widas wrote: > >> 1. worker memory caps executor. >> 2. With default config, every job gets one executor per worker. This >> executor runs with all cores available to the worker. >> >> By the job

Re: IPython Notebook Debug Spam

2014-10-01 Thread Rick Richardson
Thanks for your reply. Unfortunately changing the log4j.properties within SPARK_HOME/conf has no effect on pyspark for me. When I change it in the master or workers the log changes have the desired effect, but pyspark seems to ignore them. I have changed the levels to WARN, changed the appender

Re: Poor performance writing to S3

2014-10-01 Thread Gustavo Arjones
Hi, I found the answer to my problem, and just writing to keep it as KB. Turns out the problem wasn’t related to S3 performance, it was due my SOURCE was not fast enough, due the lazy nature of Spark what I saw on the dashboard was saveAsTextFile at FacebookProcessor.scala:46 instead of the load

Protocol buffers with Spark ?

2014-10-01 Thread Jaonary Rabarisoa
Dear all, I have a spark job that communicates with a C++ code using pipe. Since, the data I need to send is rather complicated, I think about using protobuf to serialize it. The problem is that the string form of my data outputted by protobuf contains the "\n" character so it a bit complicated to

Re: Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Akshat Aranya
On Wed, Oct 1, 2014 at 11:00 AM, Boromir Widas wrote: > 1. worker memory caps executor. > 2. With default config, every job gets one executor per worker. This > executor runs with all cores available to the worker. > > By the job do you mean one SparkContext or one stage execution within a progra

Re: how to get actual count from as long from JavaDStream ?

2014-10-01 Thread Sean Owen
Hm, I think this is the same thing though. The input is conceptually a stream of RDDs. You want the count of elements of each of the RDDs, so you get a stream of counts. In the foreachRDD example, you're just computing the count() of each RDD directly and printing it, so, you print a stream of coun

Re: can I think of JavaDStream<> foreachRDD() as being 'for each mini batch' ?

2014-10-01 Thread Sean Owen
Yes, foreachRDD will do your something for each RDD, which is what you get for each mini-batch of input. The operations you express on a DStream (or JavaDStream) are all, really, "for each RDD", including print(). Logging is a little harder to reason about since the logging will happen on a potent

Re: Handling tree reduction algorithm with Spark in parallel

2014-10-01 Thread Andy Twigg
Yes, that makes sense. It's similar to the all reduce pattern in vw. On Wednesday, 1 October 2014, Matei Zaharia wrote: > Some of the MLlib algorithms do tree reduction in 1.1: > http://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html. > You can check out how they imp

Re: Handling tree reduction algorithm with Spark in parallel

2014-10-01 Thread Matei Zaharia
Some of the MLlib algorithms do tree reduction in 1.1: http://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html. You can check out how they implemented it -- it is a series of reduce operations. Matei On Oct 1, 2014, at 11:02 AM, Boromir Widas wrote: > Thanks a lot

Re: Handling tree reduction algorithm with Spark in parallel

2014-10-01 Thread Boromir Widas
Thanks a lot Andy and Debashish, your suggestions were of great help. On Tue, Sep 30, 2014 at 6:44 PM, Debasish Das wrote: > If the tree is too big build it on graphxbut it will need thorough > analysis so that the partitions are well balanced... > > On Tue, Sep 30, 2014 at 2:45 PM, Andy Twi

Re: Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Boromir Widas
1. worker memory caps executor. 2. With default config, every job gets one executor per worker. This executor runs with all cores available to the worker. On Wed, Oct 1, 2014 at 11:04 AM, Akshat Aranya wrote: > Hi, > > What's the relationship between Spark worker and executor memory settings >

can I think of JavaDStream<> foreachRDD() as being 'for each mini batch' ?

2014-10-01 Thread Andy Davidson
Hi I am new to Spark Streaming. Can I think of JavaDStream<> foreachRDD() as being 'for each mini batch¹? The java doc does not say much about this function. Here is the background. I am writing a little test program to figure out how to use streams. At some point I wanted to calculate an aggre

Re: IPython Notebook Debug Spam

2014-10-01 Thread Davies Liu
On Tue, Sep 30, 2014 at 10:14 PM, Rick Richardson wrote: > I am experiencing significant logging spam when running PySpark in IPython > Notebok > > Exhibit A: http://i.imgur.com/BDP0R2U.png > > I have taken into consideration advice from: > http://apache-spark-user-list.1001560.n3.nabble.com/Disa

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread jamborta
Thank you for the replies. It makes sense for scala/java, but in python the JVM is launched when the spark context is initialised, so it should be able to set it, I assume. On 1 Oct 2014 18:25, "Andrew Or-2 [via Apache Spark User List]" < ml-node+s1001560n15510...@n3.nabble.com> wrote: > Hi Tamas,

Re: how to get actual count from as long from JavaDStream ?

2014-10-01 Thread Andy Davidson
Hi Sean I guess I am missing something. JavaDStream foo = Š JavaDStream c = foo.count() This is circular. I need to get the count as an actual scalar value not a JavaDStream. Some one else posted psudo code that used foreachRDD() . This seems to work for me. Thanks Andy From: Sean Owen Dat

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Andrew Or
Hi Tamas, Yes, Marcelo is right. The reason why it doesn't make sense to set "spark.driver.memory" in your SparkConf is because your application code, by definition, *is* the driver. This means by the time you get to the code that initializes your SparkConf, your driver JVM has already started wit

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Marcelo Vanzin
No, you can't instantiate a SparkContext to start apps in cluster mode. For Yarn, for example, you'd have to call directly into org.apache.spark.deploy.yarn.Client; that class will tell the Yarn cluster to launch the driver for you and then instantiate the SparkContext. On Wed, Oct 1, 2014 at 10:

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread jamborta
when you say "respective backend code to launch it", I thought this is the way to do that. thanks, Tamas On Wed, Oct 1, 2014 at 6:13 PM, Marcelo Vanzin [via Apache Spark User List] wrote: > Because that's not how you launch apps in cluster mode; you have to do > it through the command line, or b

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Tamas Jambor
when you say "respective backend code to launch it", I thought this is the way to do that. thanks, Tamas On Wed, Oct 1, 2014 at 6:13 PM, Marcelo Vanzin wrote: > Because that's not how you launch apps in cluster mode; you have to do > it through the command line, or by calling directly the respec

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Marcelo Vanzin
Because that's not how you launch apps in cluster mode; you have to do it through the command line, or by calling directly the respective backend code to launch it. (That being said, it would be nice to have a programmatic way of launching apps that handled all this - this has been brought up in a

Re: MLLib ALS question

2014-10-01 Thread Xiangrui Meng
ALS still needs to load and deserialize the in/out blocks (one by one) from disk and then construct least squares subproblems. All happen in RAM. The final model is also stored in memory. -Xiangrui On Wed, Oct 1, 2014 at 4:36 AM, Alex T wrote: > Hi, thanks for the reply. > > I added the ALS.setIn

Re: Spark 1.1.0 hbase_inputformat.py not work

2014-10-01 Thread Kan Zhang
CC user@ for indexing. Glad you fixed it. All source code for these examples are under SPARK_HOME/examples. For example, the converters used here are in examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala Btw, you may find our blog post useful. https://databri

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Tamas Jambor
thanks Marcelo. What's the reason it is not possible in cluster mode, either? On Wed, Oct 1, 2014 at 5:42 PM, Marcelo Vanzin wrote: > You can't set up the driver memory programatically in client mode. In > that mode, the same JVM is running the driver, so you can't modify > command line options

Re: Reading from HBase is too slow

2014-10-01 Thread Vladimir Rodionov
Yes, its in 0.98. CDH is free (w/o subscription) and sometimes its worth upgrading to the latest version (which is 0.98 based). -Vladimir Rodionov On Wed, Oct 1, 2014 at 9:52 AM, Ted Yu wrote: > As far as I know, that feature is not in CDH 5.0.0 > > FYI > > On Wed, Oct 1, 2014 at 9:34 AM, Vladi

Re: Reading from HBase is too slow

2014-10-01 Thread Ted Yu
As far as I know, that feature is not in CDH 5.0.0 FYI On Wed, Oct 1, 2014 at 9:34 AM, Vladimir Rodionov < vrodio...@splicemachine.com> wrote: > Using TableInputFormat is not the fastest way of reading data from HBase. > Do not expect 100s of Mb per sec. You probably should take a look at M/R >

Re: spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread Marcelo Vanzin
You can't set up the driver memory programatically in client mode. In that mode, the same JVM is running the driver, so you can't modify command line options anymore when initializing the SparkContext. (And you can't really start cluster mode apps that way, so the only way to set this is through t

Re: Reading from HBase is too slow

2014-10-01 Thread Vladimir Rodionov
Using TableInputFormat is not the fastest way of reading data from HBase. Do not expect 100s of Mb per sec. You probably should take a look at M/R over HBase snapshots. https://issues.apache.org/jira/browse/HBASE-8369 -Vladimir Rodionov On Wed, Oct 1, 2014 at 8:17 AM, Tao Xiao wrote: > I can s

spark.driver.memory is not set (pyspark, 1.1.0)

2014-10-01 Thread jamborta
Hi all, I cannot figure out why this command is not setting the driver memory (it is setting the executor memory): conf = (SparkConf() .setMaster("yarn-client") .setAppName("test") .set("spark.driver.memory", "1G") .set("spark.ex

Re: MLLib: Missing value imputation

2014-10-01 Thread Debasish Das
If the missing values are 0, then you can also look into implicit formulation... On Tue, Sep 30, 2014 at 12:05 PM, Xiangrui Meng wrote: > We don't handle missing value imputation in the current version of > MLlib. In future releases, we can store feature information in the > dataset metadata, wh

Re: MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Nicholas Chammas
There is this thread on Stack Overflow about the same topic, which you may find helpful. On Wed, Oct 1, 2014 at 11:17 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Not that I'm aware o

Re: MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Nicholas Chammas
Not that I'm aware of. I'm looking for a work-around myself! On Wed, Oct 1, 2014 at 11:15 AM, Tomer Benyamini wrote: > Yes exactly.. so I guess this is still an open request. Any workaround? > > On Wed, Oct 1, 2014 at 6:04 PM, Nicholas Chammas > wrote: > > Are you trying to do something along t

Re: Reading from HBase is too slow

2014-10-01 Thread Tao Xiao
I can submit a MapReduce job reading that table, although its processing rate is also a litter slower than I expected, but not that slow as Spark. 2014-10-01 12:04 GMT+08:00 Ted Yu : > Can you launch a job which exercises TableInputFormat on the same table > without using Spark ? > > This would s

Re: MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Tomer Benyamini
Yes exactly.. so I guess this is still an open request. Any workaround? On Wed, Oct 1, 2014 at 6:04 PM, Nicholas Chammas wrote: > Are you trying to do something along the lines of what's described here? > https://issues.apache.org/jira/browse/SPARK-3533 > > On Wed, Oct 1, 2014 at 10:53 AM, Tomer

org.apache.spark.sql.catalyst.errors.package$TreeNodeException:

2014-10-01 Thread tonsat
We have a configuration CDH5.0,Spark1.1.0(stand alone) and hive0.12 We are trying to run some realtime analytics from Tableau8.1(BI Tool) and one of the dashboard was failing with below error. Is it something to do with the aggregate function not supported by spark-sql ?Any help appreciated. 14/10/

Re: MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Nicholas Chammas
Are you trying to do something along the lines of what's described here? https://issues.apache.org/jira/browse/SPARK-3533 On Wed, Oct 1, 2014 at 10:53 AM, Tomer Benyamini wrote: > Hi, > > I'm trying to write my JavaPairRDD using saveAsNewAPIHadoopFile with > MultipleTextOutputFormat,: > > outRdd

Relation between worker memory and executor memory in standalone mode

2014-10-01 Thread Akshat Aranya
Hi, What's the relationship between Spark worker and executor memory settings in standalone mode? Do they work independently or does the worker cap executor memory? Also, is the number of concurrent executors per worker capped by the number of CPU cores configured for the worker?

MultipleTextOutputFormat with new hadoop API

2014-10-01 Thread Tomer Benyamini
Hi, I'm trying to write my JavaPairRDD using saveAsNewAPIHadoopFile with MultipleTextOutputFormat,: outRdd.saveAsNewAPIHadoopFile("/tmp", String.class, String.class, MultipleTextOutputFormat.class); but I'm getting this compilation error: Bound mismatch: The generic method saveAsNewAPIHadoopFil

Re: Spark AccumulatorParam generic

2014-10-01 Thread Debasish Das
Can't you extend a class in place of object which can be generic ? class GenericAccumulator[B] extends AccumulatorParam[Seq[B]] { } On Wed, Oct 1, 2014 at 3:38 AM, Johan Stenberg wrote: > Just realized that, of course, objects can't be generic, but how do I > create a generic AccumulatorParam?

Re: GraphX: Types for the Nodes and Edges

2014-10-01 Thread andy petrella
I'll try my best ;-). 1/ you could create a abstract type for the types (1 on top of Vs, 1 other on top of Es types) than use the subclasses as payload in your VertexRDD or in your Edge. Regarding storage and files, it doesn't really matter (unless you want to use the OOTB loading method, thus you

GraphX: Types for the Nodes and Edges

2014-10-01 Thread Oshi
Hi, Sorry this question may be trivial. I'm new to Spark and GraphX. I need to create a graph that has different types of nodes(3 types) and edges(4 types). Each type of node and edge has a different list of attributes. 1) How should I build the graph? Should I specify all types of nodes(or edge

Problem with very slow behaviour of TorrentBroadcast vs. HttpBroadcast

2014-10-01 Thread Guillaume Pitel
Hi, We've had some performance issues since switching to 1.1.0, and we finally found the origin : TorrentBroadcast seems to be very slow in our setting (and it became default with 1.1.0) The logs of a 4MB variable with TB : (15s) 14/10/01 15:47:13 INFO storage.MemoryStore: Block broadcast_84

Re: How to get SparckContext inside mapPartitions?

2014-10-01 Thread Daniel Siegmann
I don't think you can get a SparkContext inside an RDD function (such as mapPartitions), but you shouldn't need to. Have you considered returning the data read from the database from mapPartitions to create a new RDD and then just save it to a file like normal? For example: rddObject.mapPartition

  1   2   >