Re: Distribute jar dependencies via sc.AddJar(fileName)

2014-05-15 Thread Xiangrui Meng
In SparkContext#addJar, for yarn-standalone mode, the workers should get the jars from local distributed cache instead of fetching them from the http server. Could you send the command you used to submit the job? -Xiangrui On Wed, May 14, 2014 at 1:26 AM, DB Tsai wrote: > Hi Xiangrui, > > I actua

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-15 Thread wxhsdp
Hi, DB i tried including breeze library by using spark 1.0, it works. but how can i call the native library in standalone cluster mode. in local mode 1. i include "org.scalanlp" % "breeze-natives_2.10" % "0.7" dependency in sbt build file 2. i install openblas it works in standalon

A new resource for getting examples of Spark RDD API calls

2014-05-15 Thread zhen
Hi Everyone, I found it quite difficult to find good examples for Spark RDD API calls. So my student and I decided to go through the entire API and write examples for the vast majority of API calls (basically examples for anything that is remotely interesting). I think these examples maybe useful

Re: master attempted to re-register the worker and then took all workers as unregistered

2014-05-15 Thread Nan Zhu
This is a bit different from what I met before, I’m suspecting that this is a new bug, I will look at this further -- Nan Zhu On Tuesday, May 6, 2014 at 10:06 PM, Cheney Sun wrote: > Hi Nan, > > In worker's log, I see the following exception thrown when try to launch on > executor. (

Re: Class not found in Kafka-Stream due to multi-thread without correct ClassLoader?

2014-05-15 Thread n0rb3rt
Any resolution to this? I'm new to Spark and have had success running an application locally. But hitting this same error when submitting it to a standalone cluster. Not using kafka streaming in this case, just parsing proto messages wrapped in an avro object file. Have read all the threads abo

same log4j slf4j error in spark 9.1

2014-05-15 Thread Adrian Mocanu
I recall someone from the Spark team (TD?) saying that Spark 9.1 will change the logger and the circular loop error between slf4j and log4j wouldn't show up. Yet on Spark 9.1 I still get SLF4J: Detected both log4j-over-slf4j.jar AND slf4j-log4j12.jar on the class path, preempting StackOverflowEr

Preferred RDD Size

2014-05-15 Thread Sai Prasanna
Hi, Is there any lower-bound on the size of RDD to optimally utilize the in-memory framework Spark. Say creating RDD for very small data set of some 64 MB is not as efficient as that of some 256 MB, then accordingly the application can be tuned. So is there a soft-lowerbound related to hadoop-blo

Re: No space left on device error when pulling data from s3

2014-05-15 Thread darkjh
Set `hadoop.tmp.dir` in `spark-env.sh` solved the problem. Spark job no longer writes tmp files in /tmp/hadoop-root/. SPARK_JAVA_OPTS+=" -Dspark.local.dir=/mnt/spark,/mnt2/spark -Dhadoop.tmp.dir=/mnt/ephemeral-hdfs" export SPARK_JAVA_OPTS I'm wondering if we need to permanently add this in th

ERROR: Unknown Spark version

2014-05-15 Thread wxhsdp
hello it's my first time to run spark on ec2, i follow the instruments on http://spark.apache.org/docs/latest/ec2-scripts.html i use the command below to launch the cluter and error occurs ./spark-ec2 -w 500 -k wxhsdp -i wxhsdp.pem -s 1 -v 1.0.0 launch wxhsdp ~/spark-ec2 Initializing s

Real world

2014-05-15 Thread Ian Ferreira
Folks, I keep getting questioned on real world experience of Spark as in mission critical production deployments. Does anyone have some war stories to share or know of resources to review? Cheers - Ian

Re: How to use Mahout VectorWritable in Spark.

2014-05-15 Thread Dmitriy Lyubimov
PS spark shell with all proper imports are also supported natively in Mahout (mahout spark-shell command). See M-1489 for specifics. There's also a tutorial somewhere but i suspect it has not been yet finished/publised via public link yet. Again, you need trunk to use spark shell there. On Wed, M

Re: why is Spark 0.9.1 (context creation?) so slow on my OSX laptop?

2014-05-15 Thread Madhu
It took some digging, but I think I found it. It's Hadoop code that's trying to get group information, which might not be available if you use Kerberos: cacheTimeout = conf.getLong(CommonConfigurationKeys.HADOOP_SECURITY_GROUPS_CACHE_SECS, 5*60) * 1000; public static final String HADO

Re: How to use Mahout VectorWritable in Spark.

2014-05-15 Thread Dmitriy Lyubimov
PPS The shell/spark tutorial i've mentioned is actually being developed in MAHOUT-1542. As it stands, i believe it is now complete in its core. On Wed, May 14, 2014 at 5:48 PM, Dmitriy Lyubimov wrote: > PS spark shell with all proper imports are also supported natively in > Mahout (mahout spark

problem about broadcast variable in iteration

2014-05-15 Thread randylu
My code just like follows: 1 var rdd1 = ... 2 var rdd2 = ... 3 var kv = ... 4 for (i <- 0 until n) { 5var kvGlobal = sc.broadcast(kv) // broadcast kv 6rdd1 = rdd2.map { 7 case t => doSomething(t, kvGlobal.value) 8} 9var tmp = rdd1.reduceByKey().collec

Re: Equally weighted partitions in Spark

2014-05-15 Thread Syed A. Hashmi
I took a stab at it and wrote a partitionerthat I intend to contribute back to main repo some time later. The partitioner takes in parameter which governs minimum number of keys / partition and once all partition h

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-05-15 Thread Koert Kuipers
hey patrick, i have a SparkConf i can add them too. i was looking for a way to do this where they are not hardwired within scala, which is what SPARK_JAVA_OPTS used to do. i guess if i just set -Dspark.akka.frameSize=1 on my java app launch then it will get picked up by the SparkConf too right?

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-15 Thread wxhsdp
finally i fixed it. previous failure is caused by lack of some jars. i pasted the classpath in local mode to workers by using "show compile:dependencyClasspath" and it works! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-breeze-linalg-Den

File present but file not found exception

2014-05-15 Thread Sai Prasanna
Hi Everyone, I think all are pretty busy, the response time in this group has slightly increased. But anyways, this is a pretty silly problem, but could not get over. I have a file in my localFS, but when i try to create an RDD out of it, tasks fails with file not found exception is thrown at th

Re: run spark0.9.1 on yarn with hadoop CDH4

2014-05-15 Thread Arpit Tak
Also try this out , we have already done this .. It will help you.. http://docs.sigmoidanalytics.com/index.php/Setup_hadoop_2.0.0-cdh4.2.0_and_spark_0.9.0_on_ubuntu_12.04 On Tue, May 6, 2014 at 10:17 PM, Andrew Lee wrote: > Please check JAVA_HOME. Usually it should point to /usr/java/default

Test

2014-05-15 Thread Matei Zaharia

Fwd: Is there a way to load a large file from HDFS faster into Spark

2014-05-15 Thread Soumya Simanta
I've a Spark cluster with 3 worker nodes. - *Workers:* 3 - *Cores:* 48 Total, 48 Used - *Memory:* 469.8 GB Total, 72.0 GB Used I want a process a single file compressed (*.gz) on HDFS. The file is 1.5GB compressed and 11GB uncompressed. When I try to read the compressed file from HDFS i

Re: 0.9 wont start cluster on ec2, SSH connection refused?

2014-05-15 Thread wxhsdp
Hi, mayur i've met the same problem. the instances are on, i can see them from ec2 console, and connect to them wxhsdp@ubuntu:~/spark/spark/tags/v1.0.0-rc3/ec2$ ssh -i wxhsdp-us-east.pem root@54.86.181.108 The authenticity of host '54.86.181.108 (54.86.181.108)' can't be established. ECDSA key fi

pyspark python exceptions / py4j exceptions

2014-05-15 Thread Patrick Donovan
Hello, I'm trying to write a python function that does something like: def foo(line): try: return stuff(line) except Exception: raise MoreInformativeException(line) and then use it in a map like so: rdd.map(foo) and have my MoreInformativeException make it back if/when

Re: NoSuchMethodError: breeze.linalg.DenseMatrix

2014-05-15 Thread DB Tsai
Hi Wxhsdp, I also have some difficulties witth "sc.addJar()". Since we include the breeze library by using Spark 1.0, we don't have the problem you ran into. However, when we add external jars via sc.addJar(), I found that the executors actually fetch the jars but the classloader still doesn't hon

Re: Equally weighted partitions in Spark

2014-05-15 Thread deenar.toraskar
This is my first implementation. There are a few rough edges, but when I run this I get the following exception. The class extends Partitioner which in turn extends Serializable. Any idea what I am doing wrong? scala> res156.partitionBy(new EqualWeightPartitioner(1000, res156, weightFunction)) 14/

Re: Unable to load native-hadoop library problem

2014-05-15 Thread Andrew Or
This seems unrelated to not being able to load native-hadoop library. Is it failing to connect to ResourceManager? Have you verified that there is an RM process listening on port 8032 at the specified IP? On Tue, May 6, 2014 at 6:25 PM, Sophia wrote: > Hi,everyone, > [root@CHBM220 spark-0.9.1]#

Re: is Mesos falling out of favor?

2014-05-15 Thread deric
I'm running 1.0.0 branch, finally I've managed to make it work. I'm using a Debian package which is distributed on all slave nodes. So, I've removed `SPARK_EXECUTOR_URI` and it works, spark-env.sh looks like this: export MESOS_NATIVE_LIBRARY="/usr/local/lib/libmesos.so" export SCALA_HOME="/usr" e

filling missing values in a sequence

2014-05-15 Thread Mohit Jaggi
Hi, I am trying to find a way to fill in missing values in an RDD. The RDD is a sorted sequence. For example, (1, 2, 3, 5, 8, 11, ...) I need to fill in the missing numbers and get (1,2,3,4,5,6,7,8,9,10,11) One way to do this is to "slide and zip" rdd1 = sc.parallelize(List(1, 2, 3, 5, 8, 11, ...

Re: sbt run with spark.ContextCleaner ERROR

2014-05-15 Thread Nan Zhu
same problem +1, though does not change the program result -- Nan Zhu On Tuesday, May 6, 2014 at 11:58 PM, Tathagata Das wrote: > Okay, this needs to be fixed. Thanks for reporting this! > > > > On Mon, May 5, 2014 at 11:00 PM, wxhsdp (mailto:wxh...@gmail.com)> wrote: > > Hi, TD > > >

Re: How to read a multipart s3 file?

2014-05-15 Thread kamatsuoka
Whereas with s3://, the write takes 32 seconds and the rename takes 33 seconds: 14/05/06 20:23:08 INFO DAGScheduler: Stage 0 (saveAsTextFile at FileCopy.scala:17) finished in 32.208 s 14/05/06 20:23:08 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 14/05/06

Re: is Mesos falling out of favor?

2014-05-15 Thread deric
I'm also using right now SPARK_EXECUTOR_URI, though I would prefer distributing Spark as a binary package. For running examples with `./bin/run-example ...` it works fine, however tasks from spark-shell are getting lost. Error: Could not find or load main class org.apache.spark.executor.MesosExec

Taking value out from Dstream for each RDD

2014-05-15 Thread Laeeq Ahmed
Hi all, I want to calculate mean and SD for each RDD. I used the followoing code for mean and now I have to use this mean for SD, but not sure how to use get these means for each RDD from the DStream, so I can use it for SD. My sample files is as 1 2 3 4 5 The code is as  val individualpoin

Re: spark on yarn-standalone, throws StackOverflowError and fails somtimes and succeed for the rest

2014-05-15 Thread Xiangrui Meng
This is a known issue. Please try to reduce the number of iterations (e.g., <35). -Xiangrui On Fri, May 9, 2014 at 3:45 AM, phoenix bai wrote: > Hi all, > > My spark code is running on yarn-standalone. > > the last three lines of the code as below, > > val result = model.predict(prdctpairs) >

Re: How can adding a random count() change the behavior of my program?

2014-05-15 Thread Nicholas Chammas
Yeah, I believe repartition() is a lazy operation, but it’s strange that adding the count() can affect anything. I wonder if it has anything to do with defining an RDD as a transformation on itself in PySpark. Maybe combining lazy transformations with Python’s labels-not-variables

How to run the SVM and LogisticRegression

2014-05-15 Thread yxzhao
Hello, I found the classfication algorithms SVM and LogisticRegression implemented in the following directory. And how to run them? What is the commnad line should be? Thanks. spark-0.9.0-incubating/mllib/src/main/scala/org/apache/spark/mllib/classification -- View this message in context: htt

Re: cant get tests to pass anymore on master master

2014-05-15 Thread Koert Kuipers
i did not save it. next time i try to run it i will also send those. it was also a timeout. On Mon, May 12, 2014 at 4:59 PM, Tathagata Das wrote: > Can you also send us the error you are seeing in the streaming suites? > > TD > > > On Sun, May 11, 2014 at 11:50 AM, Koert Kuipers wrote: > >> res

Re: Easy one

2014-05-15 Thread Laeeq Ahmed
Hi Ian, Don't use SPARK_MEM in spark-env.sh. It will get it set for all of your jobs. The better way is to use only the second option sconf.setExecutorEnv("spark.executor.memory", "4g”) i.e. set it in the driver program. In this way every job will have memory according to requirment. For examp

Re: java.lang.StackOverflowError when calling count()

2014-05-15 Thread Tathagata Das
Just to add some more clarity in the discussion, there is a difference between caching to memory and checkpointing, when considered from the lineage point of view. When an RDD in checkpointed, the data of the RDD is saved to HDFS (or any Hadoop API compatible fault-tolerant storage) and the lineag

Re: How to use Mahout VectorWritable in Spark.

2014-05-15 Thread Koert Kuipers
VectorWritable is not in mahout-math jar but in mahout-core jar, so you need to include both On Wed, May 14, 2014 at 3:43 AM, Stuti Awasthi wrote: > Hi Xiangrui, > Thanks for the response .. I tried few ways to include mahout-math jar > while launching Spark shell.. but no success.. Can you ple

Re: Using String Dataset for Logistic Regression

2014-05-15 Thread Xiangrui Meng
It depends on how you want to use the string features. For the day of the week, you can replace it with 6 binary features indicating Mon/Tue/Wed/Th/Fri/Sat. -Xiangrui On Fri, May 9, 2014 at 5:31 AM, praveshjain1991 wrote: > I have been trying to use LR in Spark's Java API. I used the dataset give

Re: Equivalent of collect() on DStream

2014-05-15 Thread Stephen Boesch
It seems the concept I had been missing is to invoke the DStream "foreach" method. This method takes a function expecting an RDD and applies the function to each RDD within the DStream. 2014-05-14 21:33 GMT-07:00 Stephen Boesch : > Looking further it appears the functionality I am seeking is

Re: How to use Mahout VectorWritable in Spark.

2014-05-15 Thread Dmitriy Lyubimov
Mahout now supports doing its distributed linalg natively on Spark so the problem of sequence file input load into Spark is already solved there (trunk, http://mahout.apache.org/users/sparkbindings/home.html, drmFromHDFS() call -- and then you can access to the direct rdd via "rdd" matrix property

Re: How to run shark?

2014-05-15 Thread Mayur Rustagi
Mostly your shark server is not started. Are you connecting to the cluster or running in local mode? What is the lowest error on the stack. Mayur Rustagi Ph: +1 (760) 203 3257 http://www.sigmoidanalytics.com @mayur_rustagi On Mon, May 12, 2014 at 2:07 PM, Sop

Re: Spark GCE Script

2014-05-15 Thread Aureliano Buendia
Please send a pull request, this should be maintained by the community, just in case you do not feel like continuing to maintain it. Also, nice to see that the gce version is shorter than the aws version. On Tue, May 6, 2014 at 10:11 AM, Akhil Das wrote: > Hi Matei, > > Will clean up the code a

[Suggestion]Strange behavior for broadcast cleaning with spark 0.9

2014-05-15 Thread Earthson
I'm using spark-0.9 with YARN. Q: Why spark.cleaner.ttl setting could remove broadcast that still in use? I think cleaner should not remove broadcasts still in the dependences of some RDDs. It makes the value of spark.cleaner.ttl need to be set more carefully. POINT: cleaner should not crash the

Average of each RDD in Stream

2014-05-15 Thread Laeeq Ahmed
Hi, I use the following code for calculating average. The problem is that the reduce operation return a DStream here and not a tuple as it normally does without Streaming. So how can we get the sum and the count from the DStream. Can we cast it to tuple? val numbers = ssc.textFileStream(args(

RE: Understanding epsilon in KMeans

2014-05-15 Thread Stuti Awasthi
Hi All, Any ideas on this ?? Thanks Stuti Awasthi From: Stuti Awasthi Sent: Wednesday, May 14, 2014 6:20 PM To: user@spark.apache.org Subject: Understanding epsilon in KMeans Hi All, I wanted to understand the functionality of epsilon in KMeans in Spark MLlib. As per documentation : distance

Re: different in spark on yarn mode and standalone mode

2014-05-15 Thread Vipul Pandey
So here's a followup question : What's the preferred mode? We have a new cluster coming up with petabytes of data and we intend to take Spark to production. We are trying to figure out what mode would be safe and stable for production like environment. pros and cons? anyone? Any reasons why o

Using String Dataset for Logistic Regression

2014-05-15 Thread praveshjain1991
I have been trying to use LR in Spark's Java API. I used the dataset given along with Spark for the training and testing purposes. Now i want to use it on another dataset that contains string values along with numbers. Is there any way to do this? I am attaching the Dataset that i want to use. T

application detail ui can not open on ec2

2014-05-15 Thread wxhsdp
Hi, all i follow the instruments on http://spark.apache.org/docs/latest/ec2-scripts.html to setup a standalone mode cluster on ec2, spark version is v1.0.0.rc3 i set spark.eventLog.enabled to true, and can see the log file in /tmp/spark-event, but i can not access application detail ui, an

Re: slf4j and log4j loop

2014-05-15 Thread amoc
Hi Patrick/Sean, Sorry to resurrect this thread, but after upgrading to Spark 9.1 I still get this error on runtime. ..trying to run some tests here. Has this actually been integrated int Spark 9.1? Thanks again -A -- View this message in context: http://apache-spark-user-list.1001560.n3.nabb

spark+mesos: configure mesos 'callback' port?

2014-05-15 Thread Scott Clasen
Is anyone aware of a way to configure the mesos GroupProcess port on the mesos slave/task which the mesos master calls back on? The log line that shows this port looks like below (mesos 0.17.0) I0507 02:37:20.893334 11638 group.cpp:310] Group process ((2)@1.2.3.4:54321) connected to ZooKeeper. I

Re: Hadoop Writable and Spark serialization

2014-05-15 Thread Madhu
I have done this kind of thing successfully using Hadoop serialization, e.g. SessionContainer extends Writable and override write/readFields. I didn't try Kyro. It's fairly straightforward, I'll see if I can dig up the code if you really need it. I remember that I had to add a map transformation o

Re: Spark unit testing best practices

2014-05-15 Thread Mark Hamstra
Local mode does serDe, so it should expose serialization problems. On Wed, May 14, 2014 at 10:53 AM, Philip Ogren wrote: > Have you actually found this to be true? I have found Spark local mode to > be quite good about blowing up if there is something non-serializable and > so my unit tests hav

Re: 1.0.0 Release Date?

2014-05-15 Thread Madhu
Spark 1.0.0 rc5 is available and open for voting Give it a try and vote on it at the dev user list. - Madhu https://www.linkedin.com/in/msiddalingaiah -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/1-0-0-Release-Date-tp5664p5716.html Sent from the Apach

Re: pySpark memory usage

2014-05-15 Thread Jim Blomo
Should add that I had to tweak the numbers a bit to keep above swap threshold, but below the "Too many open files" error (`ulimit -n` is 32768). On Wed, May 14, 2014 at 10:47 AM, Jim Blomo wrote: > That worked amazingly well, thank you Matei! Numbers that worked for > me were 400 for the textFil

Re: Job failed: java.io.NotSerializableException: org.apache.spark.SparkContext

2014-05-15 Thread Shivani Rao
This is something that I have bumped into time and again. the object that contains your main() should also be serializable then you won't have this issue. For example object Test extends serializable{ def main(){ // set up spark context // read your data // create your RDD's (grouped by key) /

Re: pySpark memory usage

2014-05-15 Thread Jim Blomo
That worked amazingly well, thank you Matei! Numbers that worked for me were 400 for the textFile()s, 1500 for the join()s. On Mon, May 12, 2014 at 7:58 PM, Matei Zaharia wrote: > Hey Jim, unfortunately external spilling is not implemented in Python right > now. While it would be possible to up

EndpointWriter: AssociationError

2014-05-15 Thread Laurent Thoulon
Hi, I've been trying to run my newly created spark job on my local master instead of just runing it using maven and i haven't been able to make it work. My main issue seems to be related to that error: 14/05/14 09:34:26 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster@devsrv:

Re: Equivalent of collect() on DStream

2014-05-15 Thread Stephen Boesch
Looking further it appears the functionality I am seeking is in the following *private[spark] * class ForEachdStream (version 0.8.1 , yes we are presently using an older release..) private[streaming] class ForEachDStream[T: ClassManifest] ( parent: DStream[T], *foreachFunc: (RDD[T], Time)

Equivalent of collect() on DStream

2014-05-15 Thread Stephen Boesch
Given that collect() does not exist on DStream apparently my mental model of Streaming RDD (DStream) needs correction/refinement. So what is the means to convert DStream data into a JVM in-memory representation. All of the methods on DStream i.e. filter, map, transform, reduce, etc generate other

Re: pySpark memory usage

2014-05-15 Thread Matei Zaharia
Cool, that’s good to hear. We’d also like to add spilling in Python itself, or at least make it exit with a good message if it can’t do it. Matei On May 14, 2014, at 10:47 AM, Jim Blomo wrote: > That worked amazingly well, thank you Matei! Numbers that worked for > me were 400 for the textFil

Re: little confused about SPARK_JAVA_OPTS alternatives

2014-05-15 Thread Patrick Wendell
Just wondering - how are you launching your application? If you want to set values like this the right way is to add them to the SparkConf when you create a SparkContext. val conf = new SparkConf().set("spark.akka.frameSize", "1").setAppName(...).setMaster(...) val sc = new SparkContext(conf)

Re: How to use Mahout VectorWritable in Spark.

2014-05-15 Thread Xiangrui Meng
You need > val raw = sc.sequenceFile(path, classOf[Text], classOf[VectorWriteable]) to load the data. After that, you can do > val data = raw.values.map(_.get) To get an RDD of mahout's Vector. You can use `--jar mahout-math.jar` when you launch spark-shell to include mahout-math. Best, Xiangr

RE: missing method in my slf4j after excluding Spark ZK log4j

2014-05-15 Thread Adrian Mocanu
Yea, had to change the versions of slf4j and log4joversslf4j to1.7.6 -Original Message- From: Sean Owen [mailto:so...@cloudera.com] Sent: May-12-14 7:55 PM To: user@spark.apache.org Subject: Re: missing method in my slf4j after excluding Spark ZK log4j It sounds like you are doing every

Re: Spark temp dir (spark.local.dir)

2014-05-15 Thread Scott Clasen
are you setting '-Dspark.local.dir=/mytemp/mytempsubdir' ? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-temp-dir-spark-local-dir-tp2643p5508.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: problem about broadcast variable in iteration

2014-05-15 Thread Earthson
RDD is not cached? Because recomputing may be required, every broadcast object is included in the dependences of RDDs, this may also have memory issue(when n and kv is too large in your case). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/problem-about-b

same log4j slf4j error in spark 9.1

2014-05-15 Thread Adrian Mocanu
I recall someone from the Spark team (TD?) saying that Spark 9.1 will change the logger and the circular loop error between slf4j and log4j wouldn't show up. Yet on Spark 9.1 I still get SLF4J: Detected both log4j-over-slf4j.jar AND slf4j-log4j12.jar on the class path, preempting StackOverflowEr

Reading from .bz2 files with Spark

2014-05-15 Thread Andrew Ash
Hi all, Is anyone reading and writing to .bz2 files stored in HDFS from Spark with success? I'm finding the following results on a recent commit (756c96 from 24hr ago) and CDH 4.4.0: Works: val r = sc.textFile("/user/aa/myfile.bz2").count Doesn't work: val r = sc.textFile("/user/aa/myfile.bz2")

Re: spark 0.9.1 textFile hdfs unknown host exception

2014-05-15 Thread Eugen Cepoi
Solved: Putting HADOOP_CONF_DIR in spark-env of the workers solved the problem. The difference between HadoopRDD and NewHadoopRDD is that the old one creates the JobConf on worker side, where the new one creates an instance of JobConf on driver side and then broadcasts it. I tried creating mysel

Re: Spark Streaming and JMS

2014-05-15 Thread Patrick McGloin
Hi Tathagata, Thanks for your response, just the advice I was looking for. I will try this out with Spark 1.0 when it comes out. Best regards, Patrick On 5 May 2014 22:42, Tathagata Das wrote: > A few high-level suggestions. > > 1. I recommend using the new Receiver API in almost-released Sp

Stable Hadoop version supported ?

2014-05-15 Thread Soumya Simanta
Currently I've HDFS with version hadoop0.20.2-cdh3u6 on Spark 0.9.1. I want to upgrade to Spark 1.0.0 soon and would also like to upgrade my HDFS version as well. What's the recommended version of HDFS to use with Spark 1.0.0? I don't know much about YARN but I would just like to use the Spark sta

Re: spark on yarn-standalone, throws StackOverflowError and fails somtimes and succeed for the rest

2014-05-15 Thread phoenix bai
after a couple of tests, I find that, if I use: val result = model.predict(prdctpairs) result.map(x => x.user+","+x.product+","+x.rating).saveAsTextFile(output) it always fails with above error and the exception seems iterative. but if I do: val result = model.predict(prdctpairs) result.cac

Schema view of HadoopRDD

2014-05-15 Thread Debasish Das
Hi, For each line that we read as textLine from HDFS, we have a schema..if there is an API that takes the schema as List[Symbol] and maps each token to the Symbol it will be helpful... Does RDDs provide a schema view of the dataset on HDFS ? Thanks. Deb

Re: problem about broadcast variable in iteration

2014-05-15 Thread randylu
rdd1 is cached, but it has no effect: 1 var rdd1 = ... 2 var rdd2 = ... 3 var kv = ... 4 for (i <- 0 until n) { 5var kvGlobal = sc.broadcast(kv) // broadcast kv 6rdd1 = rdd2.map { 7 case t => doSomething(t, kvGlobal.value) 8}.cache() 9var tmp

os buffer cache does not cache shuffle output file

2014-05-15 Thread wxhsdp
Hi, patrick said "The intermediate shuffle output gets written to disk, but it often hits the OS-buffer cache since it's not explicitly fsync'ed, so in many cases it stays entirely in memory. The behavior of the shuffle is agnostic to whether the base RDD is in cache or in disk." i

Fwd: Doubts regarding Shark

2014-05-15 Thread vinay Bajaj
Hello I have few questions regarding shark. 1) I have a table of 60 GB and i have total memory of 50 GB but when i try to cache the table it get cached successfully. How shark caches the table there was not enough memory to get the table in memory. And how cache eviction policies (FIFO and LRU) w

Re: Schema view of HadoopRDD

2014-05-15 Thread rxin
The new Spark SQL component is defined for this! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-Schema-view-of-HadoopRDD-tp5627p5723.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Is there any problem on the spark mailing list?

2014-05-15 Thread Cheney Sun
I can't receive any spark-user mail since yesterday. Can you guys receive any new mail? -- Cheney -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-any-problem-on-the-spark-mailing-list-tp5509.html Sent from the Apache Spark User List mailing list

Re: problem about broadcast variable in iteration

2014-05-15 Thread randylu
But when i put broadcast variable out of for-circle, it workes well(if not concerned about memory issue as you pointed out): 1 var rdd1 = ... 2 var rdd2 = ... 3 var kv = ... 4 var kvGlobal = sc.broadcast(kv) // broadcast kv 5 for (i <- 0 until n) { 6rdd1 = rdd2.ma

Re: Task not serializable?

2014-05-15 Thread pedro
I'me still fairly new to this, but I found problems using classes in maps if they used instance variables in part of the map function. It seems like for maps and such to work correctly, it needs to be purely functional programming. -- View this message in context: http://apache-spark-user-list.

Spark unit testing best practices

2014-05-15 Thread Andras Nemeth
Hi, Spark's local mode is great to create simple unit tests for our spark logic. The disadvantage however is that certain types of problems are never exposed in local mode because things never need to be put on the wire. E.g. if I accidentally use a closure which has something non-serializable in

Re: is Mesos falling out of favor?

2014-05-15 Thread Scott Clasen
curious what the bug is and what it breaks? I have spark 0.9.0 running on mesos 0.17.0 and seems to work correctly. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/is-Mesos-falling-out-of-favor-tp5444p5483.html Sent from the Apache Spark User List mailing l

Not getting mails from user group

2014-05-15 Thread Laeeq Ahmed
Hi all, There seems to be a problem. I am not getting mails from spark user group from two days. Regards, Laeeq

Spark to utilize HDFS's mmap caching

2014-05-15 Thread Chanwit Kaewkasi
Hi all, Can Spark (0.9.x) utilize the caching feature in HDFS 2.3 via sc.textFile() and other HDFS-related APIs? http://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/CentralizedCacheManagement.html Best regards, -chanwit -- Chanwit Kaewkasi linkedin.com/in/chanwit