Re: instalation de spark

2014-05-14 Thread Madhu
J'ai oublié la plupart de mes français. You can download a Spark binary or build from source. This is how I build from source: Download and install sbt: http://www.scala-sbt.org/ I installed in C:\sbt Check C:\sbt\conf\sbtconfig.txt, use these options: -Xmx512M -XX:MaxPermSize=256m -XX:Reserv

Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-05-14 Thread Benjamin
Hi Gerard, thank you for your feedbacks. On Mon, May 5, 2014 at 11:17 PM, Gerard Maas wrote: > Hi Benjamin, > > Yes, we initially used a modified version of the AmpLabs docker scripts > [1]. The amplab docker images are a good starting point. > One of the biggest hurdles has been HDFS, which re

Express VMs - good idea?

2014-05-14 Thread Marco Shaw
Hi, I've wanted to play with Spark. I wanted to fast track things and just use one of the vendor's "express VMs". I've tried Cloudera CDH 5.0 and Hortonworks HDP 2.1. I've not written down all of my issues, but for certain, when I try to run spark-shell it doesn't work. Cloudera seems to crash

Re: Unable to load native-hadoop library problem

2014-05-14 Thread Shivani Rao
Hello Sophia You are only providing the Spark jar here (nevertheless, a spark jar that contains hadoop libraries in it, but that is not sufficient). Where is your hadoop installed? (Most probably: /usr/lib/hadoop/*) So you need to add that to your class path (by using -cp) I guess. Let me know if

Re: master attempted to re-register the worker and then took all workers as unregistered

2014-05-14 Thread Siyuan he
Hi Cheney Which mode you are running? YARN or standalone? I got the same exception when I ran spark on YARN. On Tue, May 6, 2014 at 10:06 PM, Cheney Sun wrote: > Hi Nan, > > In worker's log, I see the following exception thrown when try to launch > on executor. (The SPARK_HOME is wrongly specif

Average of each RDD in Stream

2014-05-14 Thread Laeeq Ahmed
Hi, I use the following code for calculating average. The problem is that the reduce operation return a DStream here and not a tuple as it normally does without Streaming. So how can we get the sum and the count from the DStream. Can we cast it to tuple?     val numbers = ssc.textFileStream(a

Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-05-14 Thread Gerard Maas
Hi Jacob, Thanks for the help & answer on the docker question. Have you already experimented with the new link feature in Docker? That does not help the HDFS issue as the DataNode needs the namenode and vice-versa but it does facilitate simpler client-server interactions. My issue described at th

spark on yarn-standalone, throws StackOverflowError and fails somtimes and succeed for the rest

2014-05-14 Thread phoenix bai
Hi all, My spark code is running on yarn-standalone. the last three lines of the code as below, val result = model.predict(prdctpairs) result.map(x => x.user+","+x.product+","+x.rating).saveAsTextFile(output) sc.stop() the same code, sometimes be able to run successfully and could g

Re: How to use spark-submit

2014-05-14 Thread phoenix bai
I used spark-submit to run the MovieLensALS example from the examples module. here is the command: $spark-submit --master local /home/phoenix/spark/spark-dev/examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop1.0.4.jar --class org.apache.spark.examples.mllib.MovieLensALS u.data also,

No configuration setting found for key 'akka.zeromq'

2014-05-14 Thread Francis . Hu
hi,all When i run ZeroMQWordCount example on cluster, the worker log says: Caused by: com.typesafe.config.ConfigException$Missing: No configuration setting found for key 'akka.zeromq' Actually, i can see that the reference.conf in spark-examples-assembly-0.9.1.jar contains below configura

NotSerializableException in Spark Streaming

2014-05-14 Thread Diana Carroll
Hey all, trying to set up a pretty simple streaming app and getting some weird behavior. First, a non-streaming job that works fine: I'm trying to pull out lines of a log file that match a regex, for which I've set up a function: def getRequestDoc(s: String): String = { "KBDOC-[0-9]*".r.find

Re: ERROR: Unknown Spark version

2014-05-14 Thread wxhsdp
i've tried 0.9.0 and it's ok, is v1.0.0 too new to ec2? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-Unknown-Spark-version-tp5500p5502.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: log4j question

2014-05-14 Thread Andrew Or
What do you mean it cannot work? Did you copy the log4j.properties.template to a new file called log4j.properties? If you're running standalone cluster, the logs should be in the $SPARK_HOME/logs directory. On Tue, May 6, 2014 at 8:10 PM, Sophia wrote: > I have tryed to see the log,but the log4

Re: log4j question

2014-05-14 Thread Sophia
I have tryed to see the log,but the log4j.properties cannot work,how to do? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/log4j-question-tp412p5471.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Understanding epsilon in KMeans

2014-05-14 Thread Stuti Awasthi
Hi All, I wanted to understand the functionality of epsilon in KMeans in Spark MLlib. As per documentation : distance threshold within which we've consider centers to have converged.If all centers move less than this Euclidean distance, we stop iterating one run. Now I have assumed that if cent

Serializable different behavior Spark Shell vs. Scala Shell

2014-05-14 Thread Michael Malak
I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In the Spark Shell, equals() fails when I use the canonical equals() pattern of match{}, but works when I subsitute with isInstanceOf[]. I am using Spark 0.9.0/Scala 2.10.3. Is this a bug? Spark Shell (equals uses match

Re: Packaging a spark job using maven

2014-05-14 Thread François Le Lay
I have a similar objective to use maven as our build tool and ran into the same issue. The idea is that your config file is actually not found, your fat jar assembly does not contain the reference.conf resource. I added the following the section of my pom to make it work : src/main/resources

little confused about SPARK_JAVA_OPTS alternatives

2014-05-14 Thread Koert Kuipers
i have some settings that i think are relevant for my application. they are spark.akka settings so i assume they are relevant for both executors and my driver program. i used to do: SPARK_JAVA_OPTS="-Dspark.akka.frameSize=1" now this is deprecated. the alternatives mentioned are: * some spark

Re: Spark unit testing best practices

2014-05-14 Thread Philip Ogren
Have you actually found this to be true? I have found Spark local mode to be quite good about blowing up if there is something non-serializable and so my unit tests have been great for detecting this. I have never seen something that worked in local mode that didn't work on the cluster becaus

Re: Spark unit testing best practices

2014-05-14 Thread Andrew Ash
There's an undocumented mode that looks like it simulates a cluster: SparkContext.scala: // Regular expression for simulating a Spark cluster of [N, cores, memory] locally val LOCAL_CLUSTER_REGEX = """local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*]""".r can you running your t

Worker re-spawn and dynamic node joining

2014-05-14 Thread Han JU
Hi all, Just 2 questions: 1. Is there a way to automatically re-spawn spark workers? We've situations where executor OOM causes worker process to be DEAD and it does not came back automatically. 2. How to dynamically add (or remove) some worker machines to (from) the cluster? We'd like to le

Packaging a spark job using maven

2014-05-14 Thread Laurent Thoulon
(I've never actually received my previous mail so i'm resending it . Sorry if it creates a duplicate. ) Hi, I'm quite new to spark (and scala) but has anyone ever successfully compiled and run a spark job using java and maven ? Packaging seems to go fine but when i try to execute the job u

problem with hdfs access in spark job

2014-05-14 Thread Marcin Cylke
Hi I'm running Spark 0.9.1 on hadoop cluster - cdh4.2.1, with YARN. I have a job, that performs a few transformations on a given file and joins that file with some other. The job itself finishes with success, however some tasks are failed and then after rerun succeeds. During the development

Re: How to use Mahout VectorWritable in Spark.

2014-05-14 Thread Debasish Das
You will get 10x speedup by not using mahout vector and use breeze sparse vector from mllib in your mllib kmeans run @Xiangrui showed the comparison chart sometime back... On May 14, 2014 6:33 AM, "Xiangrui Meng" wrote: > You need > > > val raw = sc.sequenceFile(path, classOf[Text], classOf[

saveAsTextFile with replication factor in HDFS

2014-05-14 Thread Sai Prasanna
Hi, Can we override the default file-replication factor while using saveAsTextFile() to HDFS. My default repl.factor is >1. But intermediate files that i want to put in HDFS while running a SPARK query need not be replicated, so is there a way ? Thanks !

RE: How to use Mahout VectorWritable in Spark.

2014-05-14 Thread Stuti Awasthi
Hi Xiangrui, Thanks for the response .. I tried few ways to include mahout-math jar while launching Spark shell.. but no success.. Can you please point what I am doing wrong 1. mahout-math.jar exported in CLASSPATH, and PATH 2. Tried Launching Spark Shell by : MASTER=spark://: ADD_JARS=~/insta

Re: Spark LIBLINEAR

2014-05-14 Thread Debasish Das
Hi Professor Lin, On our internal datasets, I am getting accuracy at par with glmnet-R for sparse feature selection from liblinear. The default mllib based gradient descent was way off. I did not tune learning rate but I run with varying lambda. Ths feature selection was weak. I used liblinear c

Re: Packaging a spark job using maven

2014-05-14 Thread Laurent T
Hi, Thanks François but this didn't change much. I'm not even sure what this reference.conf is. It isn't mentioned in any of spark documentation. Should i have one in my resources ? Thanks Laurent -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Packaging-a

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-14 Thread wxhsdp
Hi, DB i've add breeze jars to workers using sc.addJar() breeze jars include : breeze-natives_2.10-0.7.jar breeze-macros_2.10-0.3.jar breeze-macros_2.10-0.3.1.jar breeze_2.10-0.8-SNAPSHOT.jar breeze_2.10-0.7.jar almost all the jars about breeze i can find, but still NoSuchMethodErr

Re: How to run shark?

2014-05-14 Thread Mayur Rustagi
Is your Spark working .. can you try running spark shell? http://spark.apache.org/docs/0.9.1/quick-start.html If spark is working we can move this to shark user list(copied here) Also I am anything but a sir :) Regards Mayur Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @may

Re: java.lang.StackOverflowError when calling count()

2014-05-14 Thread lalit1303
If we do cache() + count() after say every 50 iterations. The whole process becomes very slow. I have tried checkpoint() , cache() + count(), saveAsObjectFiles(). Nothing works. Materializing RDD's lead to drastic decrease in performance & if we don't materialize, we face stackoverflowerror. On W

Re: java.lang.StackOverflowError when calling count()

2014-05-14 Thread Nicholas Chammas
Would cache() + count() every N iterations work just as well as checkPoint() + count() to get around this issue? We're basically trying to get Spark to avoid working on too lengthy a lineage at once, right? Nick On Tue, May 13, 2014 at 12:04 PM, Xiangrui Meng wrote: > After checkPoint, call c

Re: Spark to utilize HDFS's mmap caching

2014-05-14 Thread Sandy Ryza
It's worth mentioning that leveraging HDFS caching in Spark doesn't work smoothly out of the box right now. By default, cached files in HDFS will have 3 on-disk replicas and only one of these will be an in-memory replica. In its scheduling, Spark will prefer all equally, meaning that, even when r

Re: java.lang.StackOverflowError when calling count()

2014-05-14 Thread lalit1303
If we do cache() + count() after say every 50 iterations. The whole process becomes very slow. I have tried checkpoint() , cache() + count(), saveAsObjectFiles(). Nothing works. Materializing RDD's lead to drastic decrease in performance & if we don't materialize, we face stackoverflowerror. --

Re: Turn BLAS on MacOSX

2014-05-14 Thread wxhsdp
Hi, Xiangrui i compile openblas on ec2 m1.large, when breeze calls the native lib, error occurs: INFO: successfully loaded /mnt2/wxhsdp/libopenblas/lib/libopenblas_nehalemp-r0.2.9.rc2.so [error] (run-main-0) java.lang.UnsatisfiedLinkError: com.github.fommil.netlib.NativeSystemBLAS.dgemm_offsets

Re: Distribute jar dependencies via sc.AddJar(fileName)

2014-05-14 Thread DB Tsai
Hi Xiangrui, I actually used `yarn-standalone`, sorry for misleading. I did debugging in the last couple days, and everything up to updateDependency in executor.scala works. I also checked the file size and md5sum in the executors, and they are the same as the one in driver. Gonna do more testing

Re: logging in pyspark

2014-05-14 Thread Diana Carroll
foreach vs. map isn't the issue. Both require serializing the called function, so the pickle error would still apply, yes? And at the moment, I'm just testing. Definitely wouldn't want to log something for each element, but may want to detect something and log for SOME elements. So my question

Proper way to create standalone app with custom Spark version

2014-05-14 Thread Andrei
We can create standalone Spark application by simply adding "spark-core_2.x" to build.sbt/pom.xml and connecting it to Spark master. We can also compile custom version of Spark (e.g. compiled against Hadoop 2.x) from source and deploy it to cluster manually. But what is a proper way to use _custo

RE: How to use Mahout VectorWritable in Spark.

2014-05-14 Thread Stuti Awasthi
The issue of ":12: error: not found: type Text" is resolved by import statement.. But still facing issue with imports of VectorWritable. Mahout math jar is added to classpath as I can check on WebUI as well on shell scala> System.getenv res1: java.util.Map[String,String] = {TERM=xterm, JAVA_HOME

Re: spark+mesos: configure mesos 'callback' port?

2014-05-14 Thread Scott Clasen
Its not the port for the mesos slave that I want to set, there is another port used for communicating between the mesos master and the spark tasks, here are some example log lines. In this case if the port 56311 is not opened up via iptables and security groups, the detecting new master step will

accessing partition i+1 from mapper of partition i

2014-05-14 Thread Mohit Jaggi
Hi, I am trying to find a way to fill in missing values in an RDD. The RDD is a sorted sequence. For example, (1, 2, 3, 5, 8, 11, ...) I need to fill in the missing numbers and get (1,2,3,4,5,6,7,8,9,10,11) One way to do this is to "slide and zip" rdd1 = sc.parallelize(List(1, 2, 3, 5, 8, 11, ...

Re: Distribute jar dependencies via sc.AddJar(fileName)

2014-05-14 Thread Xiangrui Meng
I don't know whether this would fix the problem. In v0.9, you need `yarn-standalone` instead of `yarn-cluster`. See https://github.com/apache/spark/commit/328c73d037c17440c2a91a6c88b4258fbefa0c08 On Tue, May 13, 2014 at 11:36 PM, Xiangrui Meng wrote: > Does v0.9 support yarn-cluster mode? I che

Re: 1.0.0 Release Date?

2014-05-14 Thread Patrick Wendell
Hey Brian, We've had a fairly stable 1.0 branch for a while now. I've started voting on the dev list last night... voting can take some time but it usually wraps up anywhere from a few days to weeks. However, you can get started right now with the release candidates. These are likely to be almost

EndpointWriter: AssociationError

2014-05-14 Thread Laurent Thoulon
Hi, I've been trying to run my newly created spark job on my local master instead of just runing it using maven and i haven't been able to make it work. My main issue seems to be related to that error: 14/05/14 09:34:26 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster@devsrv:70

Re: How to run shark?

2014-05-14 Thread Sophia
My configuration is just like this,the slave's node has been configuate,but I donnot know what's happened to the shark?Can you help me Sir? shark-env.sh export SPARK_USER_HOME=/root export SPARK_MEM=2g export SCALA_HOME="/root/scala-2.11.0-RC4" export SHARK_MASTER_MEM=1g export HIVE_CONF_DIR="/usr/

Re: Distribute jar dependencies via sc.AddJar(fileName)

2014-05-14 Thread Xiangrui Meng
Does v0.9 support yarn-cluster mode? I checked SparkContext.scala in v0.9.1 and didn't see special handling of `yarn-cluster`. -Xiangrui On Mon, May 12, 2014 at 11:14 AM, DB Tsai wrote: > We're deploying Spark in yarn-cluster mode (Spark 0.9), and we add jar > dependencies in command line with "-