Any advice for using big spark.cleaner.delay value in Spark Streaming?

2014-04-27 Thread buremba
It seems default value for spark.cleaner.delay is 3600 seconds but I need to be able to count things on daily, weekly or even monthly based. I suppose the aim of DStream batches and spark.cleaner.delay is to avoid space issues (running out of memory etc.). I usually use HyperLogLog for counting un

Re: Re: what is the best way to do cartesian

2014-04-27 Thread qinwei
Thanks a lot for your reply, but i have tried the  built-in RDD.cartesian() method before, it didn't make it faster. qinwei  From: Alex BoisvertDate: 2014-04-26 00:32To: userSubject: Re: what is the best way to do cartesianYou might want to try the built-in RDD.cartesian() method. On Th

Re: Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark

2014-04-27 Thread qinwei
Thanks a lot for your reply, it gave me much inspiration. qinwei  From: Sean OwenDate: 2014-04-25 14:10To: userSubject: Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in sparkSo you are computing all-pairs similarity over 20M users? This going to take ab

Re: Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark

2014-04-27 Thread Qin Wei
Thanks a lot for your reply, it gave me much inspiration. qinwei  From: Sean Owen-2 [via Apache Spark User List]Date: 2014-04-25 14:11To: Qin WeiSubject: Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark So you are computing all-pairs simi

Re: Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark

2014-04-27 Thread Qin Wei
Thanks a lot for your reply, it gave me much inspiration. qinwei  From: Sean Owen-2 [via Apache Spark User List]Date: 2014-04-25 14:11To: Qin WeiSubject: Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark So you are computing all-pairs simi

Re: parallelize for a large Seq is extreamly slow.

2014-04-27 Thread Earthson
That's not work. I don't think it is just slow, It never ends(with 30+ hours, and I killed it). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4900.html Sent from the Apache Spark User List mailing list

help

2014-04-27 Thread Joe L
I am getting this error, please help me to fix it 4/04/28 02:16:20 INFO SparkDeploySchedulerBackend: Executor app-20140428021620-0007/10 removed: class java.io.IOException: Cannot run program "/home/exobrain/install/spark-0.9.1/bin/compute-classpath.sh" (in directory "."): error=13, -- View thi

Re: Spark on Yarn or Mesos?

2014-04-27 Thread Andrew Ash
That thread was mostly about benchmarking YARN vs standalone, and the results are what I'd expect -- spinning up a Spark cluster on demand through YARN has higher startup latency than using a standalone cluster, where the JVMs are already initialized and ready. Given that there's a lot more commit

Re: Spark on Yarn or Mesos?

2014-04-27 Thread Matei Zaharia
From my point of view, both are supported equally. The YARN support is newer and that’s why there’s been a lot more action there in recent months. Matei On Apr 27, 2014, at 12:08 PM, Andrew Ash wrote: > That thread was mostly about benchmarking YARN vs standalone, and the results > are what I

Re: Spark on Yarn or Mesos?

2014-04-27 Thread Andrew Ash
Much thanks for the perspective Matei. On Sun, Apr 27, 2014 at 10:51 PM, Matei Zaharia wrote: > From my point of view, both are supported equally. The YARN support is > newer and that’s why there’s been a lot more action there in recent months. > > Matei > > On Apr 27, 2014, at 12:08 PM, Andrew

Running a spark-submit compatible app in spark-shell

2014-04-27 Thread Roger Hoover
Hi, >From the meetup talk about the 1.0 release, I saw that spark-submit will be the preferred way to launch apps going forward. How do you recommend launching such jobs in a development cycle? For example, how can I load an app that's expecting to a given to spark-submit into spark-shell? Also

Re: Running out of memory Naive Bayes

2014-04-27 Thread John King
I'm already using the SparseVector class. ~200 labels On Sun, Apr 27, 2014 at 12:26 AM, Xiangrui Meng wrote: > How many labels does your dataset have? -Xiangrui > > On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai wrote: > > Which version of mllib are you using? For Spark 1.0, mllib will > > support

Re: Running a spark-submit compatible app in spark-shell

2014-04-27 Thread Matei Zaharia
Hi Roger, You should be able to use the --jars argument of spark-shell to add JARs onto the classpath and then work with those classes in the shell. (A recent patch, https://github.com/apache/spark/pull/542, made spark-shell use the same command-line arguments as spark-submit). But this is a gr

Re: parallelize for a large Seq is extreamly slow.

2014-04-27 Thread Matei Zaharia
How many values are in that sequence? I.e. what is its size? You can also profile your program while it’s running to see where it’s spending time. The easiest way is to get a single stack trace with jstack . Maybe some of the serialization methods for this data are super inefficient, or toSeq o

Re: Any advice for using big spark.cleaner.delay value in Spark Streaming?

2014-04-27 Thread Tathagata Das
Hello, If you want to do aggregations like count that spans across days, weeks or months, AND do not want the result in real-time, then Spark Streaming probably not the best thing to use. You probably should store all the data in a data store (HDFS file or database) and then use Spark job / SQL qu

Re: questions about debugging a spark application

2014-04-27 Thread wxhsdp
or should i run my app in spark shell by using addJars -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/questions-about-debugging-a-spark-application-tp4891p4910.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: questions about debugging a spark application

2014-04-27 Thread wxhsdp
or should i run my app on spark shell by using addJars ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/questions-about-debugging-a-spark-application-tp4891p4911.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Strange lookup behavior. Possible bug?

2014-04-27 Thread Yadid Ayzenberg
Can someone please suggest how I can move forward with this? My spark version is 0.9.1. The big challenge is that this issue is not recreated when running in local mode. What could be the difference? I would really appreciate any pointers, as currently the the job just hangs. On 4/25/14, 7:3

Re: parallelize for a large Seq is extreamly slow.

2014-04-27 Thread Earthson
It's my fault! I upload a wrong jar when I changed the number of partitions. and Now it just works fine:) The size of word_mapping is 2444185. So it will take very long time for large object serialization? I don't think two million is very large, because the cost at local for such size is typical

spark running examples error

2014-04-27 Thread Joe L
I applied this ./bin/run-example org.apache.spark.examples.SparkPi spark://MASTERIP:7077 but I am getting the following error it seems master is not connecting to the slave nodes. Any suggestion? -- View this mess

Re: Running out of memory Naive Bayes

2014-04-27 Thread Xiangrui Meng
Even the features are sparse, the conditional probabilities are stored in a dense matrix. With 200 labels and 2 million features, you need to store at least 4e8 doubles on the driver node. With multiple partitions, you may need more memory on the driver. Could you try reducing the number of partiti

NullPointerException when run SparkPI using YARN env

2014-04-27 Thread martin.ou
1.my hadoop 2.3.0 2.SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly 3.SPARK_YARN_MODE=true SPARK_JAR=$SPARK_HOME/assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.3.0.jar SPARK_YARN_APP_JAR=$SPARK_HOME/examples/target/scala-2.10/spark-examples-assembly-0.9.1.jar MASTER=yarn-client

Re: is it okay to reuse objects across RDD's?

2014-04-27 Thread DB Tsai
Hi Todd, As Patrick and you already pointed out, it's really dangerous to mutate the status of RDD. However, when we implement the glmnet in Spark, if we can reuse the residuals for each row in RDD computed from the previous step, it can speed up 4~5x. As a result, we add extra column in RDD for

Re: Running out of memory Naive Bayes

2014-04-27 Thread DB Tsai
Hi Xiangrui, We also run into this issue at Alpine Data Labs. We ended up using LRU cache to store the counts, and splitting those least used counts to distributed cache in HDFS. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn:

Re: Running out of memory Naive Bayes

2014-04-27 Thread Xiangrui Meng
How big is your problem and how many labels? -Xiangrui On Sun, Apr 27, 2014 at 10:28 PM, DB Tsai wrote: > Hi Xiangrui, > > We also run into this issue at Alpine Data Labs. We ended up using LRU cache > to store the counts, and splitting those least used counts to distributed > cache in HDFS. > >

Re: Running out of memory Naive Bayes

2014-04-27 Thread DB Tsai
Our customer asked us to implement Naive Bayes which should be able to at least train news20 one year ago, and we implemented for them in Hadoop using distributed cache to store the model. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: Running out of memory Naive Bayes

2014-04-27 Thread Matei Zaharia
Not sure if this is always ideal for Naive Bayes, but you could also hash the features into a lower-dimensional space (e.g. reduce it to 50,000 features). For each feature simply take MurmurHash3(featureID) % 5 for example. Matei On Apr 27, 2014, at 11:24 PM, DB Tsai wrote: > Our customer

Spark with Parquet

2014-04-27 Thread Sai Prasanna
Hi All, I want to store a csv-text file in Parquet format in HDFS and then do some processing in Spark. Somehow my search to find the way to do was futile. More help was available for parquet with impala. Any guidance here? Thanks !!

Re: Spark with Parquet

2014-04-27 Thread Matei Zaharia
Spark uses the Hadoop InputFormat and OutputFormat classes, so you can simply create a JobConf to read the data and pass that to SparkContext.hadoopFile. There are some examples for Parquet usage here: http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ and here: http://engineering.ooyal