Re: Shark vs Impala

2014-06-23 Thread Aaron Davidson
Note that regarding a "long load time", data format means a whole lot in terms of query performance. If you load all your data into compressed, columnar Parquet files on local hardware, Spark SQL would also perform far, far better than it would reading from gzipped S3 files. You must also be carefu

Re: Problems running Spark job on mesos in fine-grained mode

2014-06-23 Thread Fedechicco
I'm getting the same behavior and it's crucial I get it fixed for an evaluation of Spark + Mesos within my company. I'm bumping +1 for the request of putting this fix in the 1.0.1 if possible! thanks, Federico 2014-06-20 20:51 GMT+02:00 Sébastien Rainville : > Hi, > > this is just a follow-up

Multiclass classification evaluation measures

2014-06-23 Thread Ulanov, Alexander
Hi, I've implemented a class with measures for evaluation of multiclass classification (as well as unit tests). They are per class and averaged Precision, Recall and F1-measure. As far as I know, in Spark, there is binary classification evaluator only, given that Spark's Bayesian classifier sup

Re: implicit ALS dataSet

2014-06-23 Thread redocpot
Hi, The real-world dataset is a bit more large, so I tested on the MovieLens data set, and find the same results: alpha lambda rank top1 top5 EPR_in EPR_out 40 0.001 50 297 559 0.05855

Re: Need help. Spark + Accumulo => Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String

2014-06-23 Thread anoldbrain
found a workaround by adding "SPARK_CLASSPATH=.../commons-codec-xxx.jar" to spark-env.sh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-Spark-Accumulo-Error-java-lang-NoSuchMethodError-org-apache-commons-codec-binary-Base64-eng-tp7667p8117.html S

Re: Shark vs Impala

2014-06-23 Thread Toby Douglass
On Sun, Jun 22, 2014 at 5:53 PM, Debasish Das wrote: > 600s for Spark vs 5s for Redshift...The numbers look much different from > the amplab benchmark... > > https://amplab.cs.berkeley.edu/benchmark/ > > Is it like SSDs or something that's helping redshift or the whole data is > in memory when yo

Re: Shark vs Impala

2014-06-23 Thread Toby Douglass
On Mon, Jun 23, 2014 at 8:50 AM, Aaron Davidson wrote: > Note that regarding a "long load time", data format means a whole lot in > terms of query performance. If you load all your data into compressed, > columnar Parquet files on local hardware, Spark SQL would also perform far, > far better tha

Re: Serialization problem in Spark

2014-06-23 Thread rrussell25
Thanks for pointer...tried Kryo and ran into a strange error: org.apache.spark.SparkException: Job aborted due to stage failure: Exception while deserializing and fetching task: com.esotericsoftware.kryo.KryoException: Unable to find class: rg.apache.hadoop.hbase.io.ImmutableBytesWritable It is s

Re: pyspark-Failed to run first

2014-06-23 Thread angel2014
I've got the same problem trying to execute the following scriptlet from my Eclipse environment: /v = sc.textFile("path_to_my_file") print v.take(1) / File "my_script.py", line 18, in print v.take(1) File "spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py", line 868, in take * iterator =

Help with object access from mapper (simple question)

2014-06-23 Thread Yana Kadiyska
Hi folks, hoping someone can explain to me what's going on: I have the following code, largely based on RecoverableNetworkWordCount example ( https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/RecoverableNetworkWordCount.scala ): I am setting f

Re: Need help. Spark + Accumulo => Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String

2014-06-23 Thread Jianshi Huang
Thanks, I solved it by recompiling Spark (I think it's the preferred way). But I agree that official spark for hadoop2 need to be compiled with newer libs. Jianshi On Mon, Jun 23, 2014 at 7:41 PM, anoldbrain wrote: > found a workaround by adding "SPARK_CLASSPATH=.../commons-codec-xxx.jar" to >

Re: pyspark-Failed to run first

2014-06-23 Thread Congrui Yi
So it does not work for files on HDFS either? That is really a problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Failed-to-run-first-tp7691p8128.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Help with object access from mapper (simple question)

2014-06-23 Thread Marcelo Vanzin
"object" in Scala is similar to a class with only static fields / methods in Java. So when you set its fields in the driver, the "object" does not get serialized and sent to the executors; they have their own copy of the class and its static fields, which haven't been initialized. Use a proper cla

Basic Scala and Spark questions

2014-06-23 Thread Sameer Tilak
Hi All,I am new so Scala and Spark. I have a basic question. I have the following import statements in my Scala program. I want to pass my function (printScore) to Spark. It will compare a string import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import o

Re: hi

2014-06-23 Thread Andrew Or
Hm, spark://localhost:7077 should work, because the standalone master binds to 0.0.0.0. Are you sure you ran `sbin/start-master.sh`? 2014-06-22 22:50 GMT-07:00 Akhil Das : > Open your webUI in the browser and see the spark url in the top left > corner of the page and use it while starting your s

about a JavaWordCount example with spark-core_2.10-1.0.0.jar

2014-06-23 Thread Alonso Isidoro Roman
Hi all, I am new to Spark, so this is probably a basic question. i want to explore the possibilities of this fw, concretely using it in conjunction with 3 party libs, like mongodb, for example. I have been keeping instructions from http://codeforhire.com/2014/02/18/using-spark-with-mongodb/ in or

RE: Basic Scala and Spark questions

2014-06-23 Thread Sameer Tilak
Hi All,I was able to solve both these issues. Thanks! Just FYI: For 1: import org.apache.spark.rdd; import org.apache.spark.rdd.RDD; For 2: rdd.map(x => jc_.score(str1, new StringWrapper(x))) From: ssti...@live.com To: u...@spark.incubator.apache.org Subject: Basic Scala

Error in run spark.ContextCleaner under Spark 1.0.0

2014-06-23 Thread Haoming Zhang
Hi all, I tried to run a simple Spark Streaming program with sbt. The compile process was correct, but when I run the program I will get an error: "ERROR spark.ContextCleaner: Error in cleaning thread" I'm not sure this is a bug or something, because I can get the running result as I expected,

Re: Help with object access from mapper (simple question)

2014-06-23 Thread Yana Kadiyska
Thank you so much! I was trying for a singleton and opted against a class but clearly this backfired. Clearly time to revisit Scala lessons. Thanks again On Mon, Jun 23, 2014 at 1:16 PM, Marcelo Vanzin wrote: > "object" in Scala is similar to a class with only static fields / > methods in Java.

Re: about a JavaWordCount example with spark-core_2.10-1.0.0.jar

2014-06-23 Thread Yana Kadiyska
One thing I noticed around the place where you get the first error -- you are calling words.map instead of words.mapToPair. map produces JavaRDD whereas mapToPair gives you a JavaPairRDD. I don't use the Java APIs myself but it looks to me like you need to check the types more carefully. On Mon, J

RE: HDFS folder .sparkStaging not deleted and filled up HDFS in yarn mode

2014-06-23 Thread Andrew Lee
I checked the source code, it looks like it was re-added back based on JIRA SPARK-1588, but I don't know if there's any test case associated with this? SPARK-1588. Restore SPARK_YARN_USER_ENV and SPARK_JAVA_OPTS for YARN. Sandy Ryza 2014-04-29 12:54:02 -0700 Commit: 5f48721, git

Re: pyspark regression results way off

2014-06-23 Thread frol
Here is my conversation about the same issue with regression methods: https://issues.apache.org/jira/browse/SPARK-1859 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-regression-results-way-off-tp7672p8139.html Sent from the Apache Spark User List ma

Re: Need help. Spark + Accumulo => Error: java.lang.NoSuchMethodError: org.apache.commons.codec.binary.Base64.encodeBase64String

2014-06-23 Thread anoldbrain
Assuming "this should not happen", I don't want to have to keep building a custom version of spark for every new release, thus preferring the workaround. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Need-help-Spark-Accumulo-Error-java-lang-NoSuchMethodErr

Re: Error in run spark.ContextCleaner under Spark 1.0.0

2014-06-23 Thread Andrew Or
Hi Haoming, You can safely disregard this error. This is printed at the end of the execution when we clean up and kill the daemon context cleaning thread. In the future it would be good to silence this particular message, as it may be confusing to users. Andrew 2014-06-23 12:13 GMT-07:00 Haomin

How to use K-fold validation in spark-1.0?

2014-06-23 Thread holdingonrobin
Hello, I noticed there are some discussions about implementing K-fold validation to Mllib on Spark and believe it should be in Spark-1.0 now. However there isn't any documentation or example about how to use it in the training. While I am reading the code to find out, does anyone use it successful

Re: how to make saveAsTextFile NOT split output into multiple file?

2014-06-23 Thread holdingonrobin
I used some standard Java IO libraries to write files directly to the cluster. It is a little bit trivial tho: val sc = getSparkContext val hadoopConf = SparkHadoopUtil.get.newConfiguration val hdfsPath = "hdfs://your/path" val fs = FileSystem.get(hadoopConf) val path

Efficiently doing an analysis with Cartesian product (pyspark)

2014-06-23 Thread Aaron Dossett
I am relatively new to Spark and am getting stuck trying to do the following: - My input is integer key, value pairs where the key is not unique. I'm interested in information about all possible distinct key combinations, thus the Cartesian product. - My first attempt was to create a separate RDD

Re: Efficiently doing an analysis with Cartesian product (pyspark)

2014-06-23 Thread Aaron
Sorry, I got my sample outputs wrong (1,1) -> 400 (1,2) -> 500 (2,2)-> 600 On Jun 23, 2014, at 4:29 PM, "Aaron Dossett [via Apache Spark User List]" mailto:ml-node+s1001560n8144...@n3.nabble.com>> wrote: I am relatively new to Spark and am getting stuck trying to do the following: - My input

Run Spark on Mesos? Add yourself to the #PoweredByMesos list

2014-06-23 Thread Dave Lester
Hi All, It's great to see a growing number of companies Powered By Spark ! If you're running Spark on Apache Mesos , drop me a line or post to the u...@mesos.apache.org list and we'll also be happy to add

balancing RDDs

2014-06-23 Thread Sean McNamara
We have a use case where we’d like something to execute once on each node and I thought it would be good to ask here. Currently we achieve this by setting the parallelism to the number of nodes and use a mod partitioner: val balancedRdd = sc.parallelize( (0 until Settings.parallelism)

Re: hi

2014-06-23 Thread Andrew Or
Ah never mind. The 0.0.0.0 is for the UI, not for Master, which uses the output of the "hostname" command. But yes, long answer short, go to the web UI and use that URL. 2014-06-23 11:13 GMT-07:00 Andrew Or : > Hm, spark://localhost:7077 should work, because the standalone master > binds to 0.0.

Error when running unit tests

2014-06-23 Thread SK
I am using Spark 1.0.0. I am able to successfully run "sbt package". However, when I run "sbt test" or "sbt test-only ", I get the following error: [error] error while loading , zip file is empty scala.reflect.internal.MissingRequirementError: object scala.runtime in compiler mirror not found. T

Bug in Spark REPL

2014-06-23 Thread Shivani Rao
I have two jars with the following packages package a.b.c.d.z found in jar1 package a.b.e found in jar2 In scala REPL (no spark) both imports work just fine, but in the Spark REPL, I found that import a.b.c.d.z gives me the following error object "c" is not a member of package a.b Has a

Re: Spark 0.9.1 java.lang.outOfMemoryError: Java Heap Space

2014-06-23 Thread Shivani Rao
Hello Eugene, Thanks for your patience and answers. The issue was that one of the third party libraries was not build with "sbt assembly" but just packaged as "sbt package". So it did not contain all the source dependencies. Thanks for all your help Shivani On Fri, Jun 20, 2014 at 1:46 PM, Eug

DAGScheduler: Failed to run foreach

2014-06-23 Thread Sameer Tilak
Hi All, I am using spark for text analysis. I have a source file that has few thousand sentences and a dataset of tens of millions of statements. I want to compare each statement from the sourceFile with each statement from the dataset and generate a score. I am having following problem. I would

Re: Bug in Spark REPL

2014-06-23 Thread Shivani Rao
Actually I figured it out. There was a problem was that I was loading the "sbt package"-ed jar into the class path and not the "sbt assembly"-ed jar. Once I put the right jar in for package a.b.c.d.z everything worked thanks shivani On Mon, Jun 23, 2014 at 4:38 PM, Shivani Rao wrote: > I have

RE: DAGScheduler: Failed to run foreach

2014-06-23 Thread Sameer Tilak
The subject should be: org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: and not DAGScheduler: Failed to run foreach If I call printScoreCanndedString with a hard-coded string and identical 2nd parameter, it works fine.

Re: DAGScheduler: Failed to run foreach

2014-06-23 Thread Aaron Davidson
Please note that this: for (sentence <- sourcerdd) { ... } is actually Scala syntactic sugar which is converted into sourcerdd.foreach { sentence => ... } What this means is that this will actually run on the cluster, which is probably not what you want if you're trying to print them. Try t

apache spark 1.0.0 sha1 & md5 checksum fails

2014-06-23 Thread MrAsanjar .
source : http://www.apache.org/dist/spark/spark-1.0.0/ looks like spark-1.0.0-bin-hadoop2.tgz.sha content is not valid. Am I using wrong checksum tool? ==> sha1sum spark-1.0.0-bin-hadoop2.tgz 804fe9a0caff941fb791e15ea1cda45a7c2b7608 spark-1.0.0-bin-hadoop2.tgz ===> cat spark-1.0.0-bin-hadoop2.t

which function can generate a ShuffleMapTask

2014-06-23 Thread lihu
I see that the task will either be a ShuffleMapTask or be a ResultTask, I wonder which function will generate a ShuffleMapTask, which will generate a ResultTask?

How to Reload Spark Configuration Files

2014-06-23 Thread Sirisha Devineni
Hi All, I am working with Spark to add new slaves automatically when there is more data to be processed by the cluster. During this process there is question arisen, after adding/removing new slave node to/from the spark cluster do we need to restart master and other existing slaves in the clus

How data is distributed while processing in spark cluster?

2014-06-23 Thread srujana
Hi, I am working on auto scaling spark cluster. I would like to know how master distributes the data to the slaves for processing in detail. Any information on this would be helpful. Thanks, Srujana -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-dat

Questions regarding different spark pre-built packages

2014-06-23 Thread Sourav Chandra
Hi, I am just curious to know what are the difference between the prebuilt packages for Hadoop1, 2, CDH etc. I am using spark standalone cluster and we dont use hadoop at all. Can we use any one of the pre-buil;t packages OR we have to run make-distribution.sh script from the code? Thanks, --

Re: apache spark 1.0.0 sha1 & md5 checksum fails

2014-06-23 Thread Sean Owen
I ran into this before. The algorithm is SHA-512, not SHA-1. On OS X, for example, try: shasum -a 512 spark-1.0.0-bin-hadoop2.tgz ... and you will get the right answer. The .sha file is not quite formatted in the way shasum expects to read it, I find. It expects c1cb554194660b154536ad32f50908204