J'ai oublié la plupart de mes français.
You can download a Spark binary or build from source.
This is how I build from source:
Download and install sbt:
http://www.scala-sbt.org/
I installed in C:\sbt
Check C:\sbt\conf\sbtconfig.txt, use these options:
-Xmx512M
-XX:MaxPermSize=256m
-XX:Reserv
Hi Gerard, thank you for your feedbacks.
On Mon, May 5, 2014 at 11:17 PM, Gerard Maas wrote:
> Hi Benjamin,
>
> Yes, we initially used a modified version of the AmpLabs docker scripts
> [1]. The amplab docker images are a good starting point.
> One of the biggest hurdles has been HDFS, which re
Hi,
I've wanted to play with Spark. I wanted to fast track things and just use
one of the vendor's "express VMs". I've tried Cloudera CDH 5.0 and
Hortonworks HDP 2.1.
I've not written down all of my issues, but for certain, when I try to run
spark-shell it doesn't work. Cloudera seems to crash
Hello Sophia
You are only providing the Spark jar here (nevertheless, a spark jar that
contains hadoop libraries in it, but that is not sufficient). Where is your
hadoop installed? (Most probably: /usr/lib/hadoop/*)
So you need to add that to your class path (by using -cp) I guess. Let me
know if
Hi Cheney
Which mode you are running? YARN or standalone?
I got the same exception when I ran spark on YARN.
On Tue, May 6, 2014 at 10:06 PM, Cheney Sun wrote:
> Hi Nan,
>
> In worker's log, I see the following exception thrown when try to launch
> on executor. (The SPARK_HOME is wrongly specif
Hi,
I use the following code for calculating average. The problem is that the
reduce operation return a DStream here and not a tuple as it normally does
without Streaming. So how can we get the sum and the count from the DStream.
Can we cast it to tuple?
val numbers = ssc.textFileStream(a
Hi Jacob,
Thanks for the help & answer on the docker question. Have you already
experimented with the new link feature in Docker? That does not help the
HDFS issue as the DataNode needs the namenode and vice-versa but it does
facilitate simpler client-server interactions.
My issue described at th
Hi all,
My spark code is running on yarn-standalone.
the last three lines of the code as below,
val result = model.predict(prdctpairs)
result.map(x =>
x.user+","+x.product+","+x.rating).saveAsTextFile(output)
sc.stop()
the same code, sometimes be able to run successfully and could g
I used spark-submit to run the MovieLensALS example from the examples
module.
here is the command:
$spark-submit --master local
/home/phoenix/spark/spark-dev/examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop1.0.4.jar
--class org.apache.spark.examples.mllib.MovieLensALS u.data
also,
hi,all
When i run ZeroMQWordCount example on cluster, the worker log says: Caused
by: com.typesafe.config.ConfigException$Missing: No configuration setting
found for key 'akka.zeromq'
Actually, i can see that the reference.conf in
spark-examples-assembly-0.9.1.jar contains below configura
Hey all, trying to set up a pretty simple streaming app and getting some
weird behavior.
First, a non-streaming job that works fine: I'm trying to pull out lines
of a log file that match a regex, for which I've set up a function:
def getRequestDoc(s: String):
String = { "KBDOC-[0-9]*".r.find
i've tried 0.9.0 and it's ok, is v1.0.0 too new to ec2?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/ERROR-Unknown-Spark-version-tp5500p5502.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
What do you mean it cannot work? Did you copy the log4j.properties.template
to a new file called log4j.properties? If you're running standalone
cluster, the logs should be in the $SPARK_HOME/logs directory.
On Tue, May 6, 2014 at 8:10 PM, Sophia wrote:
> I have tryed to see the log,but the log4
I have tryed to see the log,but the log4j.properties cannot work,how to do?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/log4j-question-tp412p5471.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi All,
I wanted to understand the functionality of epsilon in KMeans in Spark MLlib.
As per documentation :
distance threshold within which we've consider centers to have converged.If all
centers move less than this Euclidean distance, we stop iterating one run.
Now I have assumed that if cent
I'm seeing different Serializable behavior in Spark Shell vs. Scala Shell. In
the Spark Shell, equals() fails when I use the canonical equals() pattern of
match{}, but works when I subsitute with isInstanceOf[]. I am using Spark
0.9.0/Scala 2.10.3.
Is this a bug?
Spark Shell (equals uses match
I have a similar objective to use maven as our build tool and ran into the
same issue.
The idea is that your config file is actually not found, your fat jar
assembly does not contain the reference.conf resource.
I added the following the section of my pom to make it work :
src/main/resources
i have some settings that i think are relevant for my application. they are
spark.akka settings so i assume they are relevant for both executors and my
driver program.
i used to do:
SPARK_JAVA_OPTS="-Dspark.akka.frameSize=1"
now this is deprecated. the alternatives mentioned are:
* some spark
Have you actually found this to be true? I have found Spark local mode
to be quite good about blowing up if there is something non-serializable
and so my unit tests have been great for detecting this. I have never
seen something that worked in local mode that didn't work on the cluster
becaus
There's an undocumented mode that looks like it simulates a cluster:
SparkContext.scala:
// Regular expression for simulating a Spark cluster of [N, cores,
memory] locally
val LOCAL_CLUSTER_REGEX =
"""local-cluster\[\s*([0-9]+)\s*,\s*([0-9]+)\s*,\s*([0-9]+)\s*]""".r
can you running your t
Hi all,
Just 2 questions:
1. Is there a way to automatically re-spawn spark workers? We've
situations where executor OOM causes worker process to be DEAD and it does
not came back automatically.
2. How to dynamically add (or remove) some worker machines to (from) the
cluster? We'd like to le
(I've never actually received my previous mail so i'm resending it . Sorry if
it creates a duplicate. )
Hi,
I'm quite new to spark (and scala) but has anyone ever successfully compiled
and run a spark job using java and maven ?
Packaging seems to go fine but when i try to execute the job u
Hi
I'm running Spark 0.9.1 on hadoop cluster - cdh4.2.1, with YARN.
I have a job, that performs a few transformations on a given file and joins
that file with some other.
The job itself finishes with success, however some tasks are failed and then
after rerun succeeds.
During the development
You will get 10x speedup by not using mahout vector and use breeze sparse
vector from mllib in your mllib kmeans run
@Xiangrui showed the comparison chart sometime back...
On May 14, 2014 6:33 AM, "Xiangrui Meng" wrote:
> You need
>
> > val raw = sc.sequenceFile(path, classOf[Text], classOf[
Hi,
Can we override the default file-replication factor while using
saveAsTextFile() to HDFS.
My default repl.factor is >1. But intermediate files that i want to put in
HDFS while running a SPARK query need not be replicated, so is there a way ?
Thanks !
Hi Xiangrui,
Thanks for the response .. I tried few ways to include mahout-math jar while
launching Spark shell.. but no success.. Can you please point what I am doing
wrong
1. mahout-math.jar exported in CLASSPATH, and PATH
2. Tried Launching Spark Shell by : MASTER=spark://:
ADD_JARS=~/insta
Hi Professor Lin,
On our internal datasets, I am getting accuracy at par with glmnet-R for
sparse feature selection from liblinear. The default mllib based gradient
descent was way off. I did not tune learning rate but I run with varying
lambda. Ths feature selection was weak.
I used liblinear c
Hi,
Thanks François but this didn't change much. I'm not even sure what this
reference.conf is. It isn't mentioned in any of spark documentation. Should
i have one in my resources ?
Thanks
Laurent
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Packaging-a
Hi, DB
i've add breeze jars to workers using sc.addJar()
breeze jars include :
breeze-natives_2.10-0.7.jar
breeze-macros_2.10-0.3.jar
breeze-macros_2.10-0.3.1.jar
breeze_2.10-0.8-SNAPSHOT.jar
breeze_2.10-0.7.jar
almost all the jars about breeze i can find, but still NoSuchMethodErr
Is your Spark working .. can you try running spark shell?
http://spark.apache.org/docs/0.9.1/quick-start.html
If spark is working we can move this to shark user list(copied here)
Also I am anything but a sir :)
Regards
Mayur
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@may
If we do cache() + count() after say every 50 iterations. The whole process
becomes very slow.
I have tried checkpoint() , cache() + count(), saveAsObjectFiles().
Nothing works.
Materializing RDD's lead to drastic decrease in performance & if we don't
materialize, we face stackoverflowerror.
On W
Would cache() + count() every N iterations work just as well as
checkPoint() + count() to get around this issue?
We're basically trying to get Spark to avoid working on too lengthy a
lineage at once, right?
Nick
On Tue, May 13, 2014 at 12:04 PM, Xiangrui Meng wrote:
> After checkPoint, call c
It's worth mentioning that leveraging HDFS caching in Spark doesn't work
smoothly out of the box right now. By default, cached files in HDFS will
have 3 on-disk replicas and only one of these will be an in-memory replica.
In its scheduling, Spark will prefer all equally, meaning that, even when
r
If we do cache() + count() after say every 50 iterations. The whole process
becomes very slow.
I have tried checkpoint() , cache() + count(), saveAsObjectFiles().
Nothing works.
Materializing RDD's lead to drastic decrease in performance & if we don't
materialize, we face stackoverflowerror.
--
Hi, Xiangrui
i compile openblas on ec2 m1.large, when breeze calls the native lib,
error occurs:
INFO: successfully loaded
/mnt2/wxhsdp/libopenblas/lib/libopenblas_nehalemp-r0.2.9.rc2.so
[error] (run-main-0) java.lang.UnsatisfiedLinkError:
com.github.fommil.netlib.NativeSystemBLAS.dgemm_offsets
Hi Xiangrui,
I actually used `yarn-standalone`, sorry for misleading. I did debugging in
the last couple days, and everything up to updateDependency in
executor.scala works. I also checked the file size and md5sum in the
executors, and they are the same as the one in driver. Gonna do more
testing
foreach vs. map isn't the issue. Both require serializing the called
function, so the pickle error would still apply, yes?
And at the moment, I'm just testing. Definitely wouldn't want to log
something for each element, but may want to detect something and log for
SOME elements.
So my question
We can create standalone Spark application by simply adding
"spark-core_2.x" to build.sbt/pom.xml and connecting it to Spark master.
We can also compile custom version of Spark (e.g. compiled against Hadoop
2.x) from source and deploy it to cluster manually.
But what is a proper way to use _custo
The issue of ":12: error: not found: type Text" is resolved by import
statement.. But still facing issue with imports of VectorWritable.
Mahout math jar is added to classpath as I can check on WebUI as well on shell
scala> System.getenv
res1: java.util.Map[String,String] = {TERM=xterm,
JAVA_HOME
Its not the port for the mesos slave that I want to set, there is another
port used for communicating between the mesos master and the spark tasks,
here are some example log lines.
In this case if the port 56311 is not opened up via iptables and security
groups, the detecting new master step will
Hi,
I am trying to find a way to fill in missing values in an RDD. The RDD is a
sorted sequence.
For example, (1, 2, 3, 5, 8, 11, ...)
I need to fill in the missing numbers and get (1,2,3,4,5,6,7,8,9,10,11)
One way to do this is to "slide and zip"
rdd1 = sc.parallelize(List(1, 2, 3, 5, 8, 11, ...
I don't know whether this would fix the problem. In v0.9, you need
`yarn-standalone` instead of `yarn-cluster`.
See
https://github.com/apache/spark/commit/328c73d037c17440c2a91a6c88b4258fbefa0c08
On Tue, May 13, 2014 at 11:36 PM, Xiangrui Meng wrote:
> Does v0.9 support yarn-cluster mode? I che
Hey Brian,
We've had a fairly stable 1.0 branch for a while now. I've started
voting on the dev list last night... voting can take some time but it
usually wraps up anywhere from a few days to weeks.
However, you can get started right now with the release candidates.
These are likely to be almost
Hi,
I've been trying to run my newly created spark job on my local master instead
of just runing it using maven and i haven't been able to make it work. My main
issue seems to be related to that error:
14/05/14 09:34:26 ERROR EndpointWriter: AssociationError
[akka.tcp://sparkMaster@devsrv:70
My configuration is just like this,the slave's node has been configuate,but I
donnot know what's happened to the shark?Can you help me Sir?
shark-env.sh
export SPARK_USER_HOME=/root
export SPARK_MEM=2g
export SCALA_HOME="/root/scala-2.11.0-RC4"
export SHARK_MASTER_MEM=1g
export HIVE_CONF_DIR="/usr/
Does v0.9 support yarn-cluster mode? I checked SparkContext.scala in
v0.9.1 and didn't see special handling of `yarn-cluster`. -Xiangrui
On Mon, May 12, 2014 at 11:14 AM, DB Tsai wrote:
> We're deploying Spark in yarn-cluster mode (Spark 0.9), and we add jar
> dependencies in command line with "-
46 matches
Mail list logo