i tried, but no effect
Qin Wei wrote
> try the complete path
>
>
> qinwei
> From: wxhsdpDate: 2014-04-24 14:21To: userSubject: Re: how to set
> spark.executor.memory and heap sizethank you, i add setJars, but nothing
> changes
>
> val conf = new SparkConf()
> .setMaster("spark://12
Good to know, thanks for pointing this out to me!
On 23/04/2014 19:55, Sandy Ryza wrote:
Ah, you're right about SPARK_CLASSPATH and ADD_JARS. My bad.
SPARK_YARN_APP_JAR is going away entirely -
https://issues.apache.org/jira/browse/SPARK-1053
On Wed, Apr 23, 2014 at 8:07 AM, Christophe Préau
Thank you very much for your help Prashant.
Sorry I still have another question about your answer: "however if the
file("/home/scalatest.txt") is present on the same path on all systems it
will be processed on all nodes."
When presenting the file to the same path on all nodes, do we just simply
c
It is the same file and hadoop library that we use for splitting takes care
of assigning the right split to each node.
Prashant Sharma
On Thu, Apr 24, 2014 at 1:36 PM, Carter wrote:
> Thank you very much for your help Prashant.
>
> Sorry I still have another question about your answer: "howeve
i think maybe it's the problem of read local file
val logFile = "/home/wxhsdp/spark/example/standalone/README.md"
val logData = sc.textFile(logFile).cache()
if i replace the above code with
val logData = sc.parallelize(Array(1,2,3,4)).cache()
the job can complete successfully
can't i read a
You need to use proper url format:
file://home/wxhsdp/spark/example/standalone/README.md
On Thu, Apr 24, 2014 at 1:29 PM, wxhsdp wrote:
> i think maybe it's the problem of read local file
>
> val logFile = "/home/wxhsdp/spark/example/standalone/README.md"
> val logData = sc.textFile(logFile).c
Sorry wrong format:
file:///home/wxhsdp/spark/example/standalone/README.md
An extra / is needed at the start.
On Thu, Apr 24, 2014 at 1:46 PM, Adnan Yaqoob wrote:
> You need to use proper url format:
>
> file://home/wxhsdp/spark/example/standalone/README.md
>
>
> On Thu, Apr 24, 2014 at 1:29
thanks for your reply, adnan, i tried
val logFile = "file:///home/wxhsdp/spark/example/standalone/README.md"
i think there needs three left slash behind file:
it's just the same as val logFile =
"home/wxhsdp/spark/example/standalone/README.md"
the error remains:(
--
View this message in context
Hi,
You should be able to read it, file://or file:/// not even required for
reading locally , just path is enough..
what error message you getting on spark-shell while reading...
for local:
Also read the same from hdfs file also ...
put your README file there and read , it works in both ways..
hi arpit,
on spark shell, i can read local file properly,
but when i use sbt run, error occurs.
the sbt error message is in the beginning of the thread
Arpit Tak-2 wrote
> Hi,
>
> You should be able to read it, file://or file:/// not even required for
> reading locally , just path is enough..
>
Thank you very much Prashant.
Date: Thu, 24 Apr 2014 01:24:39 -0700
From: ml-node+s1001560n4739...@n3.nabble.com
To: gyz...@hotmail.com
Subject: Re: Need help about how hadoop works.
It is the same file and hadoop library that we use for splitting takes
care of assigning the right spl
Okk fine,
try like this , i tried and it works..
specify spark path also in constructor...
and also
export SPARK_JAVA_OPTS="-Xms300m -Xmx512m -XX:MaxPermSize=1g"
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object SimpleApp {
def main(args: Array[Stri
Also try out these examples, all of them works
http://docs.sigmoidanalytics.com/index.php/MLlib
if you spot any problems in those, let us know.
Regards,
arpit
On Wed, Apr 23, 2014 at 11:08 PM, Matei Zaharia wrote:
> See http://people.csail.mit.edu/matei/spark-unified-docs/ for a more
> re
Hi All, Finally i wrote the following code, which is felt does optimally if
not the most optimum one.
Using file pointers, seeking the byte after the last \n but backwards !!
This is memory efficient and i hope even unix tail implementation should be
something similar !!
import java.io.RandomAcces
Hi,
Relatively new on spark and have tried running SparkPi example on a
standalone 12 core three machine cluster. What I'm failing to understand is,
that running this example with a single slice gives better performance as
compared to using 12 slices. Same was the case when I was using parallelize
it seems that it's nothing about settings, i tried take action, and find it's
ok, but error occurs when i tried count and collect
val a = sc.textFile("any file")
a.take(n).foreach(println) //ok
a.count() //failed
a.collect()//failed
val b = sc.parallelize((Array(1,2,3,4))
b.take(n).foreach(pri
You may try this:
val lastOption = sc.textFile("input").mapPartitions { iterator =>
if (iterator.isEmpty) {
iterator
} else {
Iterator
.continually((iterator.next(), iterator.hasNext()))
.collect { case (value, false) => value }
.take(1)
}
}.collect().lastOption
It
Thanks for the info. It seems like the JTS library is exactly what I
need (I'm not doing any raster processing at this point).
So, once they successfully finish the Scala wrappers for JTS, I would
theoretically be able to use Scala to write a Spark job that includes
the JTS library, and then run i
I am stuck with an issue for last two days and did not find any solution
after several hours of googling. Here is the details.
The following is a simple python code (Temp.py):
import sys
from random import random
from operator import add
from pyspark import SparkContext
from pyspark import Spar
Hi Matei,
I checked out the git repository and built it. However, I'm still getting
below error. It couldn't find those SQL packages. Please advice.
package org.apache.spark.sql.api.java does not exist
[ERROR]
/home/VirtualBoxImages.com/Documents/projects/errCount/src/main/java/errorCount/TransDr
Spark Command: /usr/lib/jvm/java-1.7.0/bin/java -cp
:/root/ephemeral-hdfs/conf:/root/ephemeral-hdfs/conf:/root/ephemeral-hdfs/conf:/root/ephemeral-hdfs/conf:/root/spark/conf:/root/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop1.0.4.jar
-Dspark.akka.logLifecycleEvents=true
-Djava.
Moreover it seems all the workers are registered and have sufficient memory
(2.7GB where as I have asked for 512 MB). The UI also shows the jobs are
running on the slaves. But on the termial it is still the same error
"Initial job has not accepted any resources; check your cluster UI to ensure
that
I have changed the log level in log4j to ALL. Still i cannot see any log
comming from org.apache.spark.executor.Executor
Is there something i am missing ?
If I have this code:
val stream1= doublesInputStream.window(Seconds(10), Seconds(2))
val stream2= stream1.reduceByKeyAndWindow(_ + _, Seconds(10), Seconds(10))
Does reduceByKeyAndWindow merge all RDDs from stream1 that came in the 10
second window?
Example, in the first 10 secs stream1 will have
I am new to R and I am trying to learn sparkR source code.
so, i am wondering what is the IDE for sparkR project?
is it rstudio? is there any IDE like intellij idea?
I tried, but intellij idea won`t be able to import sparkR as a project.
please help! thank you!
--
View this message in context
Did you build it with SPARK_HIVE=true?
On Thu, Apr 24, 2014 at 7:00 AM, diplomatic Guru
wrote:
> Hi Matei,
>
> I checked out the git repository and built it. However, I'm still getting
> below error. It couldn't find those SQL packages. Please advice.
>
> package org.apache.spark.sql.api.java do
You shouldn't need to set SPARK_HIVE=true unless you want to use the
JavaHiveContext. You should be able to access
org.apache.spark.sql.api.java.JavaSQLContext with the default build.
How are you building your application?
Michael
On Thu, Apr 24, 2014 at 9:17 AM, Andrew Or wrote:
> Did you b
It's a simple application based on the "People" example.
I'm using Maven for building and below is the pom.xml. Perhaps, I need to
change the version?
Uthay.Test.App
test-app
4.0.0
TestApp
jar
1.0
Akka repository
Yeah, you'll need to run `sbt publish-local` to push the jars to your local
maven repository (~/.m2) and then depend on version 1.0.0-SNAPSHOT.
On Thu, Apr 24, 2014 at 9:58 AM, diplomatic Guru
wrote:
> It's a simple application based on the "People" example.
>
> I'm using Maven for building and b
Looks like you're depending on Spark 0.9.1, which doesn't have Spark SQL.
Assuming you've downloaded Spark, just run 'mvn install' to publish Spark
locally, and depend on Spark version 1.0.0-SNAPSHOT.
On Thu, Apr 24, 2014 at 9:58 AM, diplomatic Guru
wrote:
> It's a simple application based on th
Oh, and you'll also need to add a dependency on "spark-sql_2.10".
On Thu, Apr 24, 2014 at 10:13 AM, Michael Armbrust
wrote:
> Yeah, you'll need to run `sbt publish-local` to push the jars to your
> local maven repository (~/.m2) and then depend on version 1.0.0-SNAPSHOT.
>
>
> On Thu, Apr 24, 20
Many thanks for your prompt reply. I'll try your suggestions and will get
back to you.
On 24 April 2014 18:17, Michael Armbrust wrote:
> Oh, and you'll also need to add a dependency on "spark-sql_2.10".
>
>
> On Thu, Apr 24, 2014 at 10:13 AM, Michael Armbrust > wrote:
>
>> Yeah, you'll need
Rstudio should be fine.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/IDE-for-sparkR-tp4764p4772.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Thanks Cheng !!
On Thu, Apr 24, 2014 at 5:43 PM, Cheng Lian wrote:
> You may try this:
>
> val lastOption = sc.textFile("input").mapPartitions { iterator =>
> if (iterator.isEmpty) {
> iterator
> } else {
> Iterator
> .continually((iterator.next(), iterator.hasNext()))
>
Same problem.
On Thu, Apr 24, 2014 at 10:54 AM, Shubhabrata wrote:
> Moreover it seems all the workers are registered and have sufficient memory
> (2.7GB where as I have asked for 512 MB). The UI also shows the jobs are
> running on the slaves. But on the termial it is still the same error
> "I
I receive this error:
Traceback (most recent call last):
File "", line 1, in
File
"/home/ubuntu/spark-1.0.0-rc2/python/pyspark/mllib/classification.py", line
178, in train
ans = sc._jvm.PythonMLLibAPI().trainNaiveBayes(dataBytes._jrdd, lambda_)
File
"/home/ubuntu/spark-1.0.0-rc2/pyt
./spark-shell: line 153: 17654 Killed
$FWDIR/bin/spark-class org.apache.spark.repl.Main "$@"
Any ideas?
Did you launch this using our EC2 scripts
(http://spark.apache.org/docs/latest/ec2-scripts.html) or did you manually set
up the daemons? My guess is that their hostnames are not being resolved
properly on all nodes, so executor processes can’t connect back to your driver
app. This error message
The problem is that SparkPi uses Math.random(), which is a synchronized method,
so it can’t scale to multiple cores. In fact it will be slower on multiple
cores due to lock contention. Try another example and you’ll see better
scaling. I think we’ll have to update SparkPi to create a new Random
Could you share the command you used and more of the error message?
Also, is it an MLlib specific problem? -Xiangrui
On Thu, Apr 24, 2014 at 11:49 AM, John King
wrote:
> ./spark-shell: line 153: 17654 Killed
> $FWDIR/bin/spark-class org.apache.spark.repl.Main "$@"
>
>
> Any ideas?
Is your Spark cluster running? Try to start with generating simple
RDDs and counting. -Xiangrui
On Thu, Apr 24, 2014 at 11:38 AM, John King
wrote:
> I receive this error:
>
> Traceback (most recent call last):
>
> File "", line 1, in
>
> File
> "/home/ubuntu/spark-1.0.0-rc2/python/pyspark/ml
Last command was:
val model = new NaiveBayes().run(points)
On Thu, Apr 24, 2014 at 4:27 PM, Xiangrui Meng wrote:
> Could you share the command you used and more of the error message?
> Also, is it an MLlib specific problem? -Xiangrui
>
> On Thu, Apr 24, 2014 at 11:49 AM, John King
> wrote:
>
Yes, I got it running for large RDD (~7 million lines) and mapping. Just
received this error when trying to classify.
On Thu, Apr 24, 2014 at 4:32 PM, Xiangrui Meng wrote:
> Is your Spark cluster running? Try to start with generating simple
> RDDs and counting. -Xiangrui
>
> On Thu, Apr 24, 201
This happens to me when using the EC2 scripts for v1.0.0rc2 recent release.
The Master connects and then disconnects immediately, eventually saying
Master disconnected from cluster.
On Thu, Apr 24, 2014 at 4:01 PM, Matei Zaharia wrote:
> Did you launch this using our EC2 scripts (
> http://spark
It worked!! Many thanks for your brilliant support.
On 24 April 2014 18:20, diplomatic Guru wrote:
> Many thanks for your prompt reply. I'll try your suggestions and will get
> back to you.
>
>
>
>
> On 24 April 2014 18:17, Michael Armbrust wrote:
>
>> Oh, and you'll also need to add a depend
Thanks Xiangrui, Matei and Arpit. It does work fine after adding
Vector.dense. I have a follow up question, I will post on a new thread.
On Thu, Apr 24, 2014 at 2:49 AM, Arpit Tak wrote:
> Also try out these examples, all of them works
>
> http://docs.sigmoidanalytics.com/index.php/MLlib
>
>
Folks,
I am wondering how mllib interacts with jblas and lapack. Does it make
copies of data from my RDD format to jblas's format? Does jblas copy it
again before passing to lapack native code?
I also saw some comparisons with VW and it seems mllib is slower on a
single node but scales better and
The data array in RDD is passed by reference to jblas, so data copying
in this stage. However, if jblas uses the native interface, there is a
copying overhead. I think jblas uses java implementation for at least
Level 1 BLAS, and calling native interface for Level 2 & 3. -Xiangrui
On Thu, Apr 24,
I tried locally with the example described in the latest guide:
http://54.82.157.211:4000/mllib-naive-bayes.html , and it worked fine.
Do you mind sharing the code you used? -Xiangrui
On Thu, Apr 24, 2014 at 1:57 PM, John King wrote:
> Yes, I got it running for large RDD (~7 million lines) and ma
Do you mind sharing more code and error messages? The information you
provided is too little to identify the problem. -Xiangrui
On Thu, Apr 24, 2014 at 1:55 PM, John King wrote:
> Last command was:
>
> val model = new NaiveBayes().run(points)
>
>
>
> On Thu, Apr 24, 2014 at 4:27 PM, Xiangrui Meng
I was able to run simple examples as well.
Which version of Spark? Did you use the most recent commit or from
branch-1.0?
Some background: I tried to build both on Amazon EC2, but the master kept
disconnecting from the client and executors failed after connecting. So I
tried to just use one machi
In the other thread I had an issue with Python. In this issue, I tried
switching to Scala. The code is:
*import* org.apache.spark.mllib.regression.*LabeledPoint**;*
*import org.apache.spark.mllib.linalg.SparseVector;*
*import org.apache.spark.mllib.classification.NaiveBayes;*
import scala.colle
Also when will the official 1.0 be released?
On Thu, Apr 24, 2014 at 7:04 PM, John King wrote:
> I was able to run simple examples as well.
>
> Which version of Spark? Did you use the most recent commit or from
> branch-1.0?
>
> Some background: I tried to build both on Amazon EC2, but the maste
I don't see anything wrong with your code. Could you do points.count()
to see how many training examples you have? Also, make sure you don't
have negative feature values. The error message you sent did not say
NaiveBayes went wrong, but the Spark shell was killed. -Xiangrui
On Thu, Apr 24, 2014 at
It just displayed this error and stopped on its own. Do the lines of code
mentioned in the error have anything to do with it?
On Thu, Apr 24, 2014 at 7:54 PM, Xiangrui Meng wrote:
> I don't see anything wrong with your code. Could you do points.count()
> to see how many training examples you ha
occure exception when compile spark 0.9.1 using sbt,env: hadoop 2.3
1. SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly
2.found Exception:
found : org.apache.spark.streaming.dstream.DStream[(K, V)]
[error] required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V]
[error] N
anyone knows the reason? i've googled a bit, and found some guys had the same
problem, but with no replies...
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-spark-executor-memory-and-heap-size-tp4719p4796.html
Sent from the Apache Spark User Lis
I'm using PySpark to load some data and getting an error while
parsing it. Is it possible to find the source file and line of the bad
data? I imagine that this would be extremely tricky when dealing with
multiple derived RRDs, so an answer with the caveat of "this only
works when running .map() o
i noticed that error occurs
at
org.apache.hadoop.io.WritableUtils.readCompressedStringArray(WritableUtils.java:183)
at
org.apache.hadoop.conf.Configuration.readFields(Configuration.java:2378)
at
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:28
Try running sbt/sbt clean and re-compiling. Any luck?
On Thu, Apr 24, 2014 at 5:33 PM, martin.ou wrote:
>
>
> occure exception when compile spark 0.9.1 using sbt,env: hadoop 2.3
>
> 1. SPARK_HADOOP_VERSION=2.3.0 SPARK_YARN=true sbt/sbt assembly
>
>
>
> 2.found Exception:
>
> found : org.apache
Hey Jim, this is unfortunately harder than I’d like right now, but here’s how
to do it. Look at the stderr file of the executor on that machine, and you’ll
see lines like this:
14/04/24 19:17:24 INFO HadoopRDD: Input split:
file:/Users/matei/workspace/apache-spark/README.md:0+2000
This says wh
spark.parallelize(word_mapping.value.toSeq).saveAsTextFile("hdfs://ns1/nlp/word_mapping")
this line is too slow. There are about 2 million elements in word_mapping.
*Is there a good style for writing a large collection to hdfs?*
import org.apache.spark._
> import SparkContext._
> import scala.io
Try setting the serializer to org.apache.spark.serializer.KryoSerializer (see
http://spark.apache.org/docs/0.9.1/tuning.html), it should be considerably
faster.
Matei
On Apr 24, 2014, at 8:01 PM, Earthson Lu wrote:
> spark.parallelize(word_mapping.value.toSeq).saveAsTextFile("hdfs://ns1/nlp/w
Try setting the serializer to org.apache.spark.serializer.KryoSerializer (see
http://spark.apache.org/docs/0.9.1/tuning.html), it should be considerably
faster.
Matei
On Apr 24, 2014, at 8:01 PM, Earthson Lu wrote:
> spark.parallelize(word_mapping.value.toSeq).saveAsTextFile("hdfs://ns1/nlp/w
Thanks Andrew.
Comments from Ooyala/Spark folks on reasons behind this?
Cheers,
On Sun, Apr 20, 2014 at 9:14 AM, Andrew Ash wrote:
> The homepage for Ooyala's job server is here:
> https://github.com/ooyala/spark-jobserver
>
> They decided (I think with input from the Spark team) that it made
Hi All,
I have a problem with the Item-Based Collaborative Filtering Recommendation
Algorithms in spark.
The basic flow is as below:
(Item1, (User1 ,
Score1))
RDD1 ==>(Item2, (User2 , Score2))
Hi All,
I have a problem with the Item-Based Collaborative Filtering Recommendation
Algorithms in spark.
The basic flow is as below:
(Item1, (User1 ,
Score1))
RDD1 ==>(Item2, (User2 , Score2))
Kryo With Exception below:
com.esotericsoftware.kryo.KryoException
(com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 0,
required: 1)
com.esotericsoftware.kryo.io.Output.require(Output.java:138)
com.esotericsoftware.kryo.io.Output.writeAscii_slow(Output.java:446)
com.esotericsof
On Fri, Apr 25, 2014 at 2:20 AM, wxhsdp wrote:
> 14/04/25 08:38:36 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 14/04/25 08:38:36 WARN snappy.LoadSnappy: Snappy native library not loaded
Since this comes up r
Thanks for the reply. It indeed increased the usage. There was another issue
we found, we were broadcasting hadoop configuration by writing a wrapper
class over it. But found the proper way in Spark Code
sc.broadcast(new SerializableWritable(conf))
--
View this message in context:
http://ap
I only see one risk: if your feature indices are not sorted, it might
have undefined behavior. Other than that, I don't see any thing
suspicious. -Xiangrui
On Thu, Apr 24, 2014 at 4:56 PM, John King wrote:
> It just displayed this error and stopped on its own. Do the lines of code
> mentioned in
Hi
I am also curious about this question.
The textFile function was supposed to read a hdfs file? In this case
,It is on local filesystem that the file was taken in.There are any
recognization ways to identify the local filesystem and the hdfs in the
textFile function?
Beside, the OOM exe
Hey Mayur,
We use HiveColumnarLoader and XMLLoader. Are these working as well ?
Will try few things regarding porting Java MR.
Regards,
Suman Bharadwaj S
On Thu, Apr 24, 2014 at 3:09 AM, Mayur Rustagi wrote:
> Right now UDF is not working. Its in the top list though. You should be
> able to s
So you are computing all-pairs similarity over 20M users?
This going to take about 200 trillion similarity computations, no?
I don't think there's any way to make that fundamentally fast.
I see you're copying the data set to all workers, which helps make it
faster at the expense of memory consumpt
Quin,
I'm not sure that I understand your source code correctly but the common
problem with item-based collaborative filtering at scale is
that the comparison of all pairs of item vectors needs quadratic effort
and therefore does not scale. A common approach to this problem is to
selectively d
75 matches
Mail list logo