date:20140427

Any advice for using big spark.cleaner.delay value in Spark Streaming?

2014-04-27 Thread buremba

It seems default value for spark.cleaner.delay is 3600 seconds but I need to be able to count things on daily, weekly or even monthly based. I suppose the aim of DStream batches and spark.cleaner.delay is to avoid space issues (running out of memory etc.). I usually use HyperLogLog for counting un

Re: Re: what is the best way to do cartesian

2014-04-27 Thread qinwei

Thanks a lot for your reply, but i have tried the built-in RDD.cartesian() method before, it didn't make it faster. qinwei From: Alex BoisvertDate: 2014-04-26 00:32To: userSubject: Re: what is the best way to do cartesianYou might want to try the built-in RDD.cartesian() method. On Th

Re: Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark

2014-04-27 Thread qinwei

Thanks a lot for your reply, it gave me much inspiration. qinwei From: Sean OwenDate: 2014-04-25 14:10To: userSubject: Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in sparkSo you are computing all-pairs similarity over 20M users? This going to take ab

Re: Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark

2014-04-27 Thread Qin Wei

Thanks a lot for your reply, it gave me much inspiration. qinwei From: Sean Owen-2 [via Apache Spark User List]Date: 2014-04-25 14:11To: Qin WeiSubject: Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark So you are computing all-pairs simi

Re: Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark

2014-04-27 Thread Qin Wei

Thanks a lot for your reply, it gave me much inspiration. qinwei From: Sean Owen-2 [via Apache Spark User List]Date: 2014-04-25 14:11To: Qin WeiSubject: Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark So you are computing all-pairs simi

Re: parallelize for a large Seq is extreamly slow.

2014-04-27 Thread Earthson

That's not work. I don't think it is just slow, It never ends(with 30+ hours, and I killed it). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/parallelize-for-a-large-Seq-is-extreamly-slow-tp4801p4900.html Sent from the Apache Spark User List mailing list

help

2014-04-27 Thread Joe L

I am getting this error, please help me to fix it 4/04/28 02:16:20 INFO SparkDeploySchedulerBackend: Executor app-20140428021620-0007/10 removed: class java.io.IOException: Cannot run program "/home/exobrain/install/spark-0.9.1/bin/compute-classpath.sh" (in directory "."): error=13, -- View thi

Re: Spark on Yarn or Mesos?

2014-04-27 Thread Andrew Ash

That thread was mostly about benchmarking YARN vs standalone, and the results are what I'd expect -- spinning up a Spark cluster on demand through YARN has higher startup latency than using a standalone cluster, where the JVMs are already initialized and ready. Given that there's a lot more commit

Re: Spark on Yarn or Mesos?

2014-04-27 Thread Matei Zaharia

From my point of view, both are supported equally. The YARN support is newer and that’s why there’s been a lot more action there in recent months. Matei On Apr 27, 2014, at 12:08 PM, Andrew Ash wrote: > That thread was mostly about benchmarking YARN vs standalone, and the results > are what I

Re: Spark on Yarn or Mesos?

2014-04-27 Thread Andrew Ash

Much thanks for the perspective Matei. On Sun, Apr 27, 2014 at 10:51 PM, Matei Zaharia wrote: > From my point of view, both are supported equally. The YARN support is > newer and that’s why there’s been a lot more action there in recent months. > > Matei > > On Apr 27, 2014, at 12:08 PM, Andrew

Running a spark-submit compatible app in spark-shell

2014-04-27 Thread Roger Hoover

Hi, >From the meetup talk about the 1.0 release, I saw that spark-submit will be the preferred way to launch apps going forward. How do you recommend launching such jobs in a development cycle? For example, how can I load an app that's expecting to a given to spark-submit into spark-shell? Also

Re: Running out of memory Naive Bayes

2014-04-27 Thread John King

I'm already using the SparseVector class. ~200 labels On Sun, Apr 27, 2014 at 12:26 AM, Xiangrui Meng wrote: > How many labels does your dataset have? -Xiangrui > > On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai wrote: > > Which version of mllib are you using? For Spark 1.0, mllib will > > support

Re: Running a spark-submit compatible app in spark-shell

2014-04-27 Thread Matei Zaharia

Hi Roger, You should be able to use the --jars argument of spark-shell to add JARs onto the classpath and then work with those classes in the shell. (A recent patch, https://github.com/apache/spark/pull/542, made spark-shell use the same command-line arguments as spark-submit). But this is a gr

Re: parallelize for a large Seq is extreamly slow.

2014-04-27 Thread Matei Zaharia

How many values are in that sequence? I.e. what is its size? You can also profile your program while it’s running to see where it’s spending time. The easiest way is to get a single stack trace with jstack . Maybe some of the serialization methods for this data are super inefficient, or toSeq o

Re: Any advice for using big spark.cleaner.delay value in Spark Streaming?

2014-04-27 Thread Tathagata Das

Hello, If you want to do aggregations like count that spans across days, weeks or months, AND do not want the result in real-time, then Spark Streaming probably not the best thing to use. You probably should store all the data in a data store (HDFS file or database) and then use Spark job / SQL qu

Re: questions about debugging a spark application

2014-04-27 Thread wxhsdp

or should i run my app in spark shell by using addJars -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/questions-about-debugging-a-spark-application-tp4891p4910.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: questions about debugging a spark application

2014-04-27 Thread wxhsdp

or should i run my app on spark shell by using addJars ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/questions-about-debugging-a-spark-application-tp4891p4911.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Strange lookup behavior. Possible bug?

2014-04-27 Thread Yadid Ayzenberg

Can someone please suggest how I can move forward with this? My spark version is 0.9.1. The big challenge is that this issue is not recreated when running in local mode. What could be the difference? I would really appreciate any pointers, as currently the the job just hangs. On 4/25/14, 7:3

Re: parallelize for a large Seq is extreamly slow.

2014-04-27 Thread Earthson

It's my fault! I upload a wrong jar when I changed the number of partitions. and Now it just works fine:) The size of word_mapping is 2444185. So it will take very long time for large object serialization? I don't think two million is very large, because the cost at local for such size is typical

spark running examples error

2014-04-27 Thread Joe L

I applied this ./bin/run-example org.apache.spark.examples.SparkPi spark://MASTERIP:7077 but I am getting the following error it seems master is not connecting to the slave nodes. Any suggestion? -- View this mess

Re: Running out of memory Naive Bayes

2014-04-27 Thread Xiangrui Meng

Even the features are sparse, the conditional probabilities are stored in a dense matrix. With 200 labels and 2 million features, you need to store at least 4e8 doubles on the driver node. With multiple partitions, you may need more memory on the driver. Could you try reducing the number of partiti

NullPointerException when run SparkPI using YARN env

2014-04-27 Thread martin.ou

1.my hadoop 2.3.0 2.SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly 3.SPARK_YARN_MODE=true SPARK_JAR=$SPARK_HOME/assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.3.0.jar SPARK_YARN_APP_JAR=$SPARK_HOME/examples/target/scala-2.10/spark-examples-assembly-0.9.1.jar MASTER=yarn-client

Re: is it okay to reuse objects across RDD's?

2014-04-27 Thread DB Tsai

Hi Todd, As Patrick and you already pointed out, it's really dangerous to mutate the status of RDD. However, when we implement the glmnet in Spark, if we can reuse the residuals for each row in RDD computed from the previous step, it can speed up 4~5x. As a result, we add extra column in RDD for

Re: Running out of memory Naive Bayes

2014-04-27 Thread DB Tsai

Hi Xiangrui, We also run into this issue at Alpine Data Labs. We ended up using LRU cache to store the counts, and splitting those least used counts to distributed cache in HDFS. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn:

Re: Running out of memory Naive Bayes

2014-04-27 Thread Xiangrui Meng

How big is your problem and how many labels? -Xiangrui On Sun, Apr 27, 2014 at 10:28 PM, DB Tsai wrote: > Hi Xiangrui, > > We also run into this issue at Alpine Data Labs. We ended up using LRU cache > to store the counts, and splitting those least used counts to distributed > cache in HDFS. > >

Re: Running out of memory Naive Bayes

2014-04-27 Thread DB Tsai

Our customer asked us to implement Naive Bayes which should be able to at least train news20 one year ago, and we implemented for them in Hadoop using distributed cache to store the model. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com

Re: Running out of memory Naive Bayes

2014-04-27 Thread Matei Zaharia

Not sure if this is always ideal for Naive Bayes, but you could also hash the features into a lower-dimensional space (e.g. reduce it to 50,000 features). For each feature simply take MurmurHash3(featureID) % 5 for example. Matei On Apr 27, 2014, at 11:24 PM, DB Tsai wrote: > Our customer

Spark with Parquet

2014-04-27 Thread Sai Prasanna

Hi All, I want to store a csv-text file in Parquet format in HDFS and then do some processing in Spark. Somehow my search to find the way to do was futile. More help was available for parquet with impala. Any guidance here? Thanks !!

Re: Spark with Parquet

2014-04-27 Thread Matei Zaharia

Spark uses the Hadoop InputFormat and OutputFormat classes, so you can simply create a JobConf to read the data and pass that to SparkContext.hadoopFile. There are some examples for Parquet usage here: http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/ and here: http://engineering.ooyal

Any advice for using big spark.cleaner.delay value in Spark Streaming?

Re: Re: what is the best way to do cartesian

Re: Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark

Re: Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark

Re: Re: Problem with the Item-Based Collaborative Filtering Recommendation Algorithms in spark

Re: parallelize for a large Seq is extreamly slow.

help

Re: Spark on Yarn or Mesos?

Re: Spark on Yarn or Mesos?

Re: Spark on Yarn or Mesos?

Running a spark-submit compatible app in spark-shell

Re: Running out of memory Naive Bayes

Re: Running a spark-submit compatible app in spark-shell

Re: parallelize for a large Seq is extreamly slow.

Re: Any advice for using big spark.cleaner.delay value in Spark Streaming?

Re: questions about debugging a spark application

Re: questions about debugging a spark application

Re: Strange lookup behavior. Possible bug?

Re: parallelize for a large Seq is extreamly slow.

spark running examples error

Re: Running out of memory Naive Bayes

NullPointerException when run SparkPI using YARN env

Re: is it okay to reuse objects across RDD's?

Re: Running out of memory Naive Bayes

Re: Running out of memory Naive Bayes

Re: Running out of memory Naive Bayes

Re: Running out of memory Naive Bayes

Spark with Parquet

Re: Spark with Parquet

29 matches

Site Navigation

Mail list logo

Footer information