Re: is it okay to reuse objects across RDD's?

2014-04-26 Thread Patrick Wendell
Hey Todd, This approach violates the normal semantics of RDD transformations as you point out. I think you pointed out some issues already, and there are others. For instance say you cache originalRDD and some of the partitions end up in memory and others end up on disk. The ones that end up in me

Re: Using Spark in IntelliJ Scala Console

2014-04-26 Thread Jonathan Chayat
In IntelliJ, nothing changed. In SBT console I got this error: $sbt > console [info] Running org.apache.spark.repl.Main -usejavacp 14/04/27 08:29:44 INFO spark.HttpServer: Starting HTTP Server 14/04/27 08:29:44 INFO server.Server: jetty-7.6.8.v20121106 14/04/27 08:29:44 INFO server.AbstractCo

Re: Running out of memory Naive Bayes

2014-04-26 Thread Xiangrui Meng
How many labels does your dataset have? -Xiangrui On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai wrote: > Which version of mllib are you using? For Spark 1.0, mllib will > support sparse feature vector which will improve performance a lot > when computing the distance between points and centroid. > > S

questions about debugging a spark application

2014-04-26 Thread wxhsdp
Hi, all i have some questions about debug in spark: 1) when application finished, application UI is shut down, i can not see the details about the app, like shuffle size, duration time, stage information... there are not sufficient informations in the master UI. do i need to hang the

Re: Running out of memory Naive Bayes

2014-04-26 Thread DB Tsai
Which version of mllib are you using? For Spark 1.0, mllib will support sparse feature vector which will improve performance a lot when computing the distance between points and centroid. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com Li

Re: how to get subArray without copy

2014-04-26 Thread wxhsdp
the way i can find out is to use 2-D Array if the split has regularity -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-get-subArray-without-copy-tp4873p4888.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Using Spark in IntelliJ Scala Console

2014-04-26 Thread Michael Armbrust
You'll also need: libraryDependencies += "org.apache.spark" %% "spark-repl" % On Sat, Apr 26, 2014 at 3:32 PM, Michael Armbrust wrote: > This is a little bit of a hack, but might work for you. You'll need to be > on sbt 0.13.2. > > connectInput in run := true > > outputStrategy in run := Som

Re: Using Spark in IntelliJ Scala Console

2014-04-26 Thread Michael Armbrust
This is a little bit of a hack, but might work for you. You'll need to be on sbt 0.13.2. connectInput in run := true outputStrategy in run := Some (StdoutOutput) console := { (runMain in Compile).toTask(" org.apache.spark.repl.Main -usejavacp").value } On Sat, Apr 26, 2014 at 1:05 PM, Jonat

Re: Spark and HBase

2014-04-26 Thread Nicholas Chammas
Thank you for sharing. Phoenix for realtime queries and Spark for more complex batch processing seems like a potentially good combo. I wonder if Spark's future will include support for the same kinds of workloads that Phoenix is being built for. This little tidbit

Re: Using Spark in IntelliJ Scala Console

2014-04-26 Thread Jonathan Chayat
Hi Michael, thanks for your prompt reply. It seems like IntelliJ Scala Console actually runs the Scala REPL (they print the same stuff when starting up). It is probably the SBT console. When I tried the same code in the Scala REPL of my project using "sbt console" it didn't work either. It only w

Re: Question about Transforming huge files from Local to HDFS

2014-04-26 Thread Michael Armbrust
> 1) When I tried to read a huge file from local and used Avro + Parquet to > transform it into Parquet format and stored them to HDFS using the API > "saveAsNewAPIHadoopFile", the JVM would be out of memory, because the file > is too large to be contained by memory. > How much memory are you givi

Re: Using Spark in IntelliJ Scala Console

2014-04-26 Thread Michael Armbrust
The spark-shell is a special version of the Scala REPL that serves the classes created for each line over HTTP. Do you know if the InteliJ Spark console is just the normal Scala repl in a GUI wrapper, or if it is something else entirely? If its the former, perhaps it might be possible to tell Int

Using Spark in IntelliJ Scala Console

2014-04-26 Thread Jonathan Chayat
Hi all, TLDR: running spark locally through IntelliJ IDEA Scala Console results in java.lang.ClassNotFoundException Long version: I'm an algorithms developer in SupersonicAds - an ad network. We are building a major new big data project and we are now in the process of selecting our tech stack &

Re: parallelize for a large Seq is extreamly slow.

2014-04-26 Thread Aaron Davidson
Could it be that you're using the default number of partitions of parallelize() is too small in this case? Try something like spark.parallelize(word_mapping.value.toSeq, 60). (Given your setup, it should already be 30, but perhaps that's not the case in YARN mode...) On Fri, Apr 25, 2014 at 11:38

Re: Parquet-SPARK-PIG integration.

2014-04-26 Thread suman bharadwaj
Figured how to do it. Hence thought of sharing in case if someone is interested. import parquet.column.ColumnReader import parquet.filter.ColumnRecordFilter._ import parquet.filter.ColumnPredicates._ import parquet.hadoop.{ParquetOutputFormat, ParquetInputFormat} import org.apache.hadoop.mapred.Jo

Re: Spark and HBase

2014-04-26 Thread Josh Mahonin
We're still in the infancy stages of the architecture for the project I'm on, but presently we're investigating HBase / Phoenix data store for it's realtime query abilities, and being able to expose data over a JDBC connector is attractive for us. Much of our data is event based, and many of th

is it okay to reuse objects across RDD's?

2014-04-26 Thread Lisonbee, Todd
For example, val originalRDD: RDD[SomeCaseClass] = ... // Option 1: objects are copied, setting prop1 in the process val transformedRDD = originalRDD.map( item => item.copy(prop1 = calculation() ) // Option 2: objects are re-used and modified val tranformedRDD = originalRDD.map( item => item.pro

Re: Running out of memory Naive Bayes

2014-04-26 Thread John King
I'm just wondering are the SparkVector calculations really taking into account the sparsity or just converting to dense? On Fri, Apr 25, 2014 at 10:06 PM, John King wrote: > I've been trying to use the Naive Bayes classifier. Each example in the > dataset is about 2 million features, only about

Re: help

2014-04-26 Thread Sean Owen
In CDH5, worker logs are under /var/log/spark The executor logs are under /var/run/spark/work This might be answerable without looking at logs. Is the directory in question going to be visible to spark's user on all machines where it is trying to run? -- Sean Owen | Director, Data Science | London

Parquet-SPARK-PIG integration.

2014-04-26 Thread suman bharadwaj
Hi All, We have written PIG Jobs which outputs the data in parquet format. For eg: register parquet-column-1.3.1.jar; register parquet-common-1.3.1.jar; register parquet-format-2.0.0.jar; register parquet-hadoop-1.3.1.jar; register parquet-pig-1.3.1.jar; register parquet-encoding-1.3.1.jar; A =

how to get subArray without copy

2014-04-26 Thread wxhsdp
Hi, all i want to do the following operations: (1) each partition do some operations on the partition data in Array format (2) split the array into subArrays, and combine each subArray with an id (3) do a shuffle according to the id here is the pseudo code /*pseudo code*/ case class

Re: how to set spark.executor.memory and heap size

2014-04-26 Thread wxhsdp
Hi, finally, i solve this problem by using the SPARK_HOME/bin/run-example script to run my application, and it works. i guess the error is due to lack of some classpath -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-h