Re: StackOverflow Error when run ALS with 100 iterations

2014-04-16 Thread Nick Pentreath
I'd also say that running for 100 iterations is a waste of resources, as ALS will typically converge pretty quickly, as in within 10-20 iterations. On Wed, Apr 16, 2014 at 3:54 AM, Xiaoli Li wrote: > Thanks a lot for your information. It really helps me. > > > On Tue, Apr 15, 2014 at 7:57 PM, C

what is a partition? how it works?

2014-04-16 Thread Joe L
I want to know as follows: what is a partition? how it works? how it is different from hadoop partition? For example: >>> sc.parallelize([1,2,3,4]).map(lambda x: >>> (x,x)).partitionBy(2).glom().collect() [[(2,2), (4,4)], [(1,1), (3,3)]] from this, we will get 2 partitions but what does it mean?

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-16 Thread Cheng Lian
Hmm… The good part of reduce is that it performs local combining within a single partition automatically, but since you turned each partition into a single-value one, local combining is not applicable, and reduce simply degrades to collect and then perform a skyline over all collected partial resul

Re: Why these operations are slower than the equivalent on Hadoop?

2014-04-16 Thread Yanzhe Chen
Hi Eugen, Sorry if I haven’t catch your point. In the second example, val result = data.mapPartitions(points => skyline(points.toArray).iterator) .reduce { case (left, right) => skyline(left ++ right) } In my understanding, if the data is type RDD, then both left and right

Re: Proper caching method

2014-04-16 Thread Cheng Lian
You can remove cached rdd1 from the cache manager by calling rdd1.unpersist(). But here comes some subtleties: RDD.cache() is *lazy*while RDD.unpersist() is *eager*. When .cache() is called, it just tells Spark runtime to cache the RDD *later* when corresponding job that uses this RDD is submitted;

Re: Proper caching method

2014-04-16 Thread Arpit Tak
Thanks Cheng , that was helpful.. On Wed, Apr 16, 2014 at 1:29 PM, Cheng Lian wrote: > You can remove cached rdd1 from the cache manager by calling > rdd1.unpersist(). But here comes some subtleties: RDD.cache() is *lazy*while > RDD.unpersist() is *eager*. When .cache() is called, it just tells

Java heap space and spark.akka.frameSize Inbox x

2014-04-16 Thread Chieh-Yen
Dear all, I developed a application that the message size of communication is greater than 10 MB sometimes. For smaller datasets it works fine, but fails for larger datasets. Please check the error message following. I surveyed the situation online and lots of people said the problem can be solve

Re: Spark program thows OutOfMemoryError

2014-04-16 Thread Andre Bois-Crettez
Seem you have not enough memory on the spark driver. Hints below : On 2014-04-15 12:10, Qin Wei wrote: val resourcesRDD = jsonRDD.map(arg => arg.get("rid").toString.toLong).distinct // the program crashes at this line of code val bcResources = sc.broadcast(resourcesRDD.collect.to

PySpark still reading only text?

2014-04-16 Thread Bertrand Dechoux
Hi, I have browsed the online documentation and it is stated that PySpark only read text files as sources. Is it still the case? >From what I understand, the RDD can after this first step be any serialized python structure if the class definitions are well distributed. Is it not possible to read

using saveAsNewAPIHadoopFile with OrcOutputFormat

2014-04-16 Thread Brock Bose
Howdy all, I recently saw that the OrcInputFormat/OutputFormat's have been exposed to be usable outside of hive ( https://issues.apache.org/jira/browse/HIVE-5728). Does anyone know how one could use this with saveAsNewAPIHadoopFile to write records in orc format? In particular, I would lik

Create cache fails on first time

2014-04-16 Thread Arpit Tak
I am loading some data(25GB) in shark from hdfs : spark,shark ( both- 0.9) . Generally it happens that caching a table some time fails, for the very first time we are caching data. Second time it runs successfully ... Anybody facing same issue ??.. *Shark Client Log:* > create table sample_cach

graph.reverse & Pregel API

2014-04-16 Thread Bogdan Ghidireac
Hello, I am using Pregel API with Spark (1.0 branch compiled on Apr 16th) and I run into some problems when my graph has the edges reversed. If the edges of my graph are reversed, the sendMsg function does no longer receives the attribute for the source vertex (it is null). This does not happen wi

SPARK_YARN_APP_JAR, SPARK_CLASSPATH and ADD_JARS in a spark-shell on YARN

2014-04-16 Thread Christophe Préaud
Hi, I am running Spark 0.9.1 on a YARN cluster, and I am wondering which is the correct way to add external jars when running a spark shell on a YARN cluster. Packaging all this dependencies in an assembly which path is then set in SPARK_YARN_APP_JAR (as written in the doc: http://spark.apache.o

Re: using saveAsNewAPIHadoopFile with OrcOutputFormat

2014-04-16 Thread Kostiantyn Kudriavtsev
I’d prefer to find good example of using saveAsNewAPIHadoopFile with different OutputFormat implementations (not only orc, but EsOutputFormat, etc). Any common example On Apr 16, 2014, at 4:51 PM, Brock Bose wrote: > Howdy all, > I recently saw that the OrcInputFormat/OutputFormat's have

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-04-16 Thread Aureliano Buendia
Is this resolved in spark 0.9.1? On Tue, Apr 15, 2014 at 6:55 PM, anant wrote: > I've received the same error with Spark built using Maven. It turns out > that > mesos-0.13.0 depends on protobuf-2.4.1 which is causing the clash at > runtime. Protobuf included by Akka is shaded and doesn't cause

Using google cloud storage for spark big data

2014-04-16 Thread Aureliano Buendia
Hi, Google has publisheed a new connector for hadoop: google cloud storage, which is an equivalent of amazon s3: googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html How can spark be configured to use this connector?

Re: Shark: class java.io.IOException: Cannot run program "/bin/java"

2014-04-16 Thread Arpit Tak
just set your java class path properly export JAVA_HOME=/usr/lib/jvm/java-7-. (somewhat like this...whatever version you having) it will work Regards, Arpit On Wed, Apr 16, 2014 at 1:24 AM, ge ko wrote: > Hi, > > > > after starting the shark-shell > via /opt/shark/shark-0.9.1/bin/sha

Re: How to cogroup/join pair RDDs with different key types?

2014-04-16 Thread Roger Hoover
Ah, in case this helps others, looks like RDD.zipPartitions will accomplish step 4. On Tue, Apr 15, 2014 at 10:44 AM, Roger Hoover wrote: > Andrew, > > Thank you very much for your feedback. Unfortunately, the ranges are not > of predictable size but you gave me an idea of how to handle it. He

Re: Error reading HDFS file using spark 0.9.0 / hadoop 2.2.0 - incompatible protobuf 2.5 and 2.4.1

2014-04-16 Thread Arpit Tak
I too stuck on same issue , but on shark (0.9 with spark-0.9 ) on hadoop-2.2.0 . On rest hadoop versions , it works perfect Regards, Arpit Tak On Wed, Apr 16, 2014 at 11:18 PM, Aureliano Buendia wrote: > Is this resolved in spark 0.9.1? > > > On Tue, Apr 15, 2014 at 6:55 PM, anant wrote:

sbt assembly error

2014-04-16 Thread Yiou Li
Hi all, I am trying to build spark assembly using sbt and got connection error when resolving dependencies: I tried web browser and wget some of the dependency links in the error and also got 404 error too. This happened to the following branches: spark-0.8.1-incubating spark-0.9.1 spark-0.9.1-b

Re: Spark packaging

2014-04-16 Thread Arpit Tak
Also try this ... http://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_Ubuntu-12.04 http://docs.sigmoidanalytics.com/index.php/How_to_Install_Spark_on_HortonWorks_VM Regards, arpit On Thu, Apr 10, 2014 at 3:04 AM, Pradeep baji wrote: > Thanks Prabeesh. > > > On Wed, Apr 9, 2014 a

Re: sbt assembly error

2014-04-16 Thread Arpit Tak
Its because , there is no sl4f directory exists there may be they updating it . https://oss.sonatype.org/content/repositories/snapshots/org/ Hard luck try after some time... Regards, Arpit On Thu, Apr 17, 2014 at 12:33 AM, Yiou Li wrote: > Hi all, > > I am trying to build spark a

Re: How to cogroup/join pair RDDs with different key types?

2014-04-16 Thread Andrew Ash
Glad to hear you're making progress! Do you have a working version of the join? Is there anything else you need help with? On Wed, Apr 16, 2014 at 7:11 PM, Roger Hoover wrote: > Ah, in case this helps others, looks like RDD.zipPartitions will > accomplish step 4. > > > On Tue, Apr 15, 2014 at

Re: sbt assembly error

2014-04-16 Thread Sean Owen
This is just a red herring. You are seeing the build fail to contact many repos it knows about, including ones that do not have a given artifact. This is almost always a symptom of network connectivity problem, like perhaps a proxy in between, esp. one that breaks HTTPS connections. You may need t

Re: How to cogroup/join pair RDDs with different key types?

2014-04-16 Thread Roger Hoover
Thanks for following up. I hope to get some free time this afternoon to get it working. Will let you know. On Wed, Apr 16, 2014 at 12:43 PM, Andrew Ash wrote: > Glad to hear you're making progress! Do you have a working version of the > join? Is there anything else you need help with? > > >

Re: PySpark still reading only text?

2014-04-16 Thread Matei Zaharia
Hi Bertrand, We should probably add a SparkContext.pickleFile and RDD.saveAsPickleFile that will allow saving pickled objects. Unfortunately this is not in yet, but there is an issue up to track it: https://issues.apache.org/jira/browse/SPARK-1161. In 1.0, one feature we do have now is the abil

Re: graph.reverse & Pregel API

2014-04-16 Thread Ankur Dave
Hi Bogdan, This is a bug -- thanks for reporting it! I just fixed it in https://github.com/apache/spark/pull/431. Does it help if you apply that patch? Ankur On Wed, Apr 16, 2014 at 7:51 AM, Bogdan Ghidireac wrote: > I am using Pregel API with Spark (1.0 branch comp

Regarding Partitioner

2014-04-16 Thread yh18190
Hi,, I have large dataset of elemenst[RDD] and i want to divide it into two exactly equal sized partitions maintaining order of elements.I tried using RangePartitioner like var data= partitionedFile.partitionBy(new RangePartitioner(2, partitionedFile)). This doesnt give satisfactory results beco

Re: GC overhead limit exceeded

2014-04-16 Thread Nicholas Chammas
I’m running into a similar issue as the OP. I’m running the same job over and over (with minor tweaks) in the same cluster to profile it. It just recently started throwing java.lang.OutOfMemoryError: Java heap space. > Are you caching a lot of RDD's? If so, maybe you should unpersist() the > ones

Re: GC overhead limit exceeded

2014-04-16 Thread Nicholas Chammas
Never mind. I'll take it from both Andrew and Syed's comments that the answer is yes. Dunno why I thought otherwise. On Wed, Apr 16, 2014 at 5:43 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > I’m running into a similar issue as the OP. I’m running the same job over > and over (with

Re: GC overhead limit exceeded

2014-04-16 Thread Nicholas Chammas
But wait, does Spark know to unpersist() RDDs that are not referenced anywhere? That would’ve taken care of the RDDs that I kept creating and then orphaning as part of my job testing/profiling. Is that what SPARK-1103 is about, btw? (Sorry to keep

Re: sbt assembly error

2014-04-16 Thread Yiou Li
Hi Sean, It's true that the sbt is trying different links but ALL of them have connections issue (which is actually 404 File not found error) and the build process takes forever connecting different links.. I don't think it's a proxy issue because my other projects (other than spark) builds well

choose the number of partition according to the number of nodes

2014-04-16 Thread Joe L
Is it true that it is better to choose the number of partition according to the number of nodes in the cluster? partitionBy(numNodes) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/choose-the-number-of-partition-according-to-the-number-of-nodes-tp4362.html

Re: choose the number of partition according to the number of nodes

2014-04-16 Thread Nicholas Chammas
>From the Spark tuning guide : In general, we recommend 2-3 tasks per CPU core in your cluster. I think you can only get one task per partition to run concurrently for a given RDD. So if your RDD has 10 partitions, then 10 tasks at most can operat

Re: choose the number of partition according to the number of nodes

2014-04-16 Thread Joe L
Thank you Nicholas -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/choose-the-number-of-partition-according-to-the-number-of-nodes-tp4362p4364.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: PySpark still reading only text?

2014-04-16 Thread Jesvin Jose
When this is implemented, can you load/save an RDD of pickled objects to HDFS? On Thu, Apr 17, 2014 at 1:51 AM, Matei Zaharia wrote: > Hi Bertrand, > > We should probably add a SparkContext.pickleFile and RDD.saveAsPickleFile > that will allow saving pickled objects. Unfortunately this is not in

Re: PySpark still reading only text?

2014-04-16 Thread Matei Zaharia
Yes, this JIRA would enable that. The Hive support also handles HDFS. Matei On Apr 16, 2014, at 9:55 PM, Jesvin Jose wrote: > When this is implemented, can you load/save an RDD of pickled objects to HDFS? > > > On Thu, Apr 17, 2014 at 1:51 AM, Matei Zaharia > wrote: > Hi Bertrand, > > We s

Re: sbt assembly error

2014-04-16 Thread Sean Owen
The error is "Connection timed out", not 404. The build references many repos, and only one will contain any given artifact. You are seeing it fail through trying many different repos, many of which don't even have the artifact either, but that's not the underlying cause. FWIW I can build the asse

Re: sbt assembly error

2014-04-16 Thread Azuryy Yu
It is only network issue. you have some network limited access in China. On Thu, Apr 17, 2014 at 2:27 PM, Sean Owen wrote: > The error is "Connection timed out", not 404. The build references > many repos, and only one will contain any given artifact. You are > seeing it fail through trying ma

Re: graph.reverse & Pregel API

2014-04-16 Thread Bogdan Ghidireac
yes, the patch works fine. thank you! On Thu, Apr 17, 2014 at 12:08 AM, Ankur Dave wrote: > Hi Bogdan, > > This is a bug -- thanks for reporting it! I just fixed it in > https://github.com/apache/spark/pull/431. Does it help if you apply that > patch? > > Ankur > > >