Re: Kmeans example reduceByKey slow

2014-03-23 Thread Tsai Li Ming
Hi, This is on a 4 nodes cluster each with 32 cores/256GB Ram. (0.9.0) is deployed in a stand alone mode. Each worker is configured with 192GB. Spark executor memory is also 192GB. This is on the first iteration. K=50. Here’s the code I use: http://pastebin.com/2yXL3y8i , which is a copy-

Re: Kmeans example reduceByKey slow

2014-03-23 Thread Xiangrui Meng
Hi Tsai, Could you share more information about the machine you used and the training parameters (runs, k, and iterations)? It can help solve your issues. Thanks! Best, Xiangrui On Sun, Mar 23, 2014 at 3:15 AM, Tsai Li Ming wrote: > Hi, > > At the reduceBuyKey stage, it takes a few minutes befo

Re: No space left on device exception

2014-03-23 Thread Patrick Wendell
Ognen - just so I understand. The issue is that there weren't enough inodes and this was causing a "No space left on device" error? Is that correct? If so, that's good to know because it's definitely counter intuitive. On Sun, Mar 23, 2014 at 8:36 PM, Ognen Duzlevski wrote: > I would love to work

Re: How many partitions is my RDD split into?

2014-03-23 Thread Patrick Wendell
As Mark said you can actually access this easily. The main issue I've seen from a performance perspective is people having a bunch of really small partitions. This will still work but the performance will improve if you coalesce the partitions using rdd.coalesce(). This can happen for example if y

Re: How many partitions is my RDD split into?

2014-03-23 Thread Mark Hamstra
It's much simpler: rdd.partitions.size On Sun, Mar 23, 2014 at 9:24 PM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Hey there fellow Dukes of Data, > > How can I tell how many partitions my RDD is split into? > > I'm interested in knowing because, from what I gather, having a good >

How many partitions is my RDD split into?

2014-03-23 Thread Nicholas Chammas
Hey there fellow Dukes of Data, How can I tell how many partitions my RDD is split into? I'm interested in knowing because, from what I gather, having a good number of partitions is good for performance. If I'm looking to understand how my pipeline is performing, say for a parallelized write out

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski
I would love to work on this (and other) stuff if I can bother someone with questions offline or on a dev mailing list. Ognen On 3/23/14, 10:04 PM, Aaron Davidson wrote: Thanks for bringing this up, 100% inode utilization is an issue I haven't seen raised before and this raises another issue wh

Re: No space left on device exception

2014-03-23 Thread Aaron Davidson
Thanks for bringing this up, 100% inode utilization is an issue I haven't seen raised before and this raises another issue which is not on our current roadmap for state cleanup (cleaning up data which was not fully cleaned up from a crashed process). On Sun, Mar 23, 2014 at 7:57 PM, Ognen Duzlevs

Re: distinct on huge dataset

2014-03-23 Thread Aaron Davidson
Ah, interesting. count() without distinct is streaming and does not require that a single partition fits in memory, for instance. That said, the behavior may change if you increase the number of partitions in your input RDD by using RDD.repartition() On Sun, Mar 23, 2014 at 11:47 AM, Kane wrote:

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski
Bleh, strike that, one of my slaves was at 100% inode utilization on the file system. It was /tmp/spark* leftovers that apparently did not get cleaned up properly after failed or interrupted jobs. Mental note - run a cron job on all slaves and master to clean up /tmp/spark* regularly. Thanks (

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski
Aaron, thanks for replying. I am very much puzzled as to what is going on. A job that used to run on the same cluster is failing with this mysterious message about not having enough disk space when in fact I can see through "watch df -h" that the free space is always hovering around 3+GB on the

Re: sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Bharath Bhushan
I don’t see the errors anymore. Thanks Aaron. On 24-Mar-2014, at 12:52 am, Aaron Davidson wrote: > These errors should be fixed on master with Sean's PR: > https://github.com/apache/spark/pull/209 > > The orbit errors are quite possibly due to using https instead of http, > whether or not the

Re: error loading large files in PySpark 0.9.0

2014-03-23 Thread Matei Zaharia
Hey Jeremy, what happens if you pass batchSize=10 as an argument to your SparkContext? It tries to serialize that many objects together at a time, which might be too much. By default the batchSize is 1024. Matei On Mar 23, 2014, at 10:11 AM, Jeremy Freeman wrote: > Hi all, > > Hitting a myst

is it possible to access the inputsplit in Spark directly?

2014-03-23 Thread hwpstorage
Hello, In spark we can use *newAPIHadoopRDD *to access the different distributed system like HDFS, HBase, and MongoDB via different inputformat. Is it possible to access the *inputsplit *in Spark directly? Spark can cache data in local memory. Perform local computation/aggregation on the local inpu

Re: No space left on device exception

2014-03-23 Thread Aaron Davidson
By default, with P partitions (for both the pre-shuffle stage and post-shuffle), there are P^2 files created. With spark.shuffle.consolidateFiles turned on, we would instead create only P files. Disk space consumption is largely unaffected, however. by the number of partitions unless each partition

Re: Problem with SparkR

2014-03-23 Thread Shivaram Venkataraman
Hi Thanks for reporting this. It'll be great if you can check a couple of things: 1. Are you trying to use this with Hadoop2 by any chance ? There was an incompatible ASM version bug that we fixed for Hadoop2 https://github.com/amplab-extras/SparkR-pkg/issues/17 and we verified it, but I just wan

Problem with SparkR

2014-03-23 Thread Jacques Basaldúa
I am really interested in using Spark from R and have tried to use SparkR, but always get the same error. This is how I installed: - I successfully installed Spark version 0.9.0 with Scala 2.10.3 (OpenJDK 64-Bit Server VM, Java 1.7.0_45) I can run examples from spark-shell and Python

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski
On 3/23/14, 5:49 PM, Matei Zaharia wrote: You can set spark.local.dir to put this data somewhere other than /tmp if /tmp is full. Actually it’s recommended to have multiple local disks and set to to a comma-separated list of directories, one per disk. Matei, does the number of tasks/partitions i

Re: combining operations elegantly

2014-03-23 Thread Patrick Wendell
Hey All, I think the old thread is here: https://groups.google.com/forum/#!msg/spark-users/gVtOp1xaPdU/Uyy9cQz9H_8J The method proposed in that thread is to create a utility class for doing single-pass aggregations. Using Algebird is a pretty good way to do this and is a bit more flexible since y

Re: No space left on device exception

2014-03-23 Thread Ognen Duzlevski
On 3/23/14, 5:35 PM, Aaron Davidson wrote: On some systems, /tmp/ is an in-memory tmpfs file system, with its own size limit. It's possible that this limit has been exceeded. You might try running the "df" command to check to free space of "/tmp" or root if tmp isn't listed. 3 GB also seems

Re: No space left on device exception

2014-03-23 Thread Matei Zaharia
You can set spark.local.dir to put this data somewhere other than /tmp if /tmp is full. Actually it’s recommended to have multiple local disks and set to to a comma-separated list of directories, one per disk. Matei On Mar 23, 2014, at 3:35 PM, Aaron Davidson wrote: > On some systems, /tmp/ i

Re: No space left on device exception

2014-03-23 Thread Aaron Davidson
On some systems, /tmp/ is an in-memory tmpfs file system, with its own size limit. It's possible that this limit has been exceeded. You might try running the "df" command to check to free space of "/tmp" or root if tmp isn't listed. 3 GB also seems pretty low for the remaining free space of a disk

No space left on device exception

2014-03-23 Thread Ognen Duzlevski
Hello, I have a weird error showing up when I run a job on my Spark cluster. The version of spark is 0.9 and I have 3+ GB free on the disk when this error shows up. Any ideas what I should be looking for? [error] (run-main-0) org.apache.spark.SparkException: Job aborted: Task 167.0:3 failed

Re: combining operations elegantly

2014-03-23 Thread Koert Kuipers
i currently typically do something like this: scala> val rdd = sc.parallelize(1 to 10) scala> import com.twitter.algebird.Operators._ scala> import com.twitter.algebird.{Max, Min} scala> rdd.map{ x => ( | 1L, | Min(x), | Max(x), | x | )}.reduce(_ + _) res0: (Long,

Re: sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Aaron Davidson
These errors should be fixed on master with Sean's PR: https://github.com/apache/spark/pull/209 The orbit errors are quite possibly due to using https instead of http, whether or not the SSL cert was bad. Let us know if they go away with reverting to http. On Sun, Mar 23, 2014 at 11:48 AM, Debas

Re: sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Debasish Das
I am getting these weird errors which I have not seen before: [error] Server access Error: handshake alert: unrecognized_name url= https://repo.maven.apache.org/maven2/org/eclipse/jetty/orbit/javax.servlet/2.5.0.v201103041518/javax.servlet-2.5.0.v201103041518.orbit [info] Resolving org.eclipse.je

Re: distinct on huge dataset

2014-03-23 Thread Kane
Yes, there was an error in data, after fixing it - count fails with Out of Memory Error. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3051.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: combining operations elegantly

2014-03-23 Thread Richard Siebeling
Hi Koert, Patrick, do you already have an elegant solution to combine multiple operations on a single RDD? Say for example that I want to do a sum over one column, a count and an average over another column, thanks in advance, Richard On Mon, Mar 17, 2014 at 8:20 AM, Richard Siebeling wrote: >

error loading large files in PySpark 0.9.0

2014-03-23 Thread Jeremy Freeman
Hi all, Hitting a mysterious error loading large text files, specific to PySpark 0.9.0. In PySpark 0.8.1, this works: data = sc.textFile("path/to/myfile") data.count() But in 0.9.0, it stalls. There are indications of completion up to: 14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in 1

Re: distinct on huge dataset

2014-03-23 Thread Aaron Davidson
Andrew, this should be fixed in 0.9.1, assuming it is the same hash collision error we found there. Kane, is it possible your bigger data is corrupt, such that that any operations on it fail? On Sat, Mar 22, 2014 at 10:39 PM, Andrew Ash wrote: > FWIW I've seen correctness errors with spark.shu

Re: sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Sean Owen
I'm also seeing this. It also was working for me previously AFAIK. Tthe proximate cause is my well-intentioned change that uses HTTPS to access all artifact repos. The default for Maven Central before would have been HTTP. While it's a good idea to use HTTPS, it may run into complications. I see:

sbt/sbt assembly fails with ssl certificate error

2014-03-23 Thread Bharath Bhushan
I am facing a weird failure where "sbt/sbt assembly” shows a lot of SSL certificate errors for repo.maven.apache.org. Is anyone else facing the same problems? Any idea why this is happening? Yesterday I was able to successfully run it. Loading https://repo.maven.apache.org shows an invalid cert

Kmeans example reduceByKey slow

2014-03-23 Thread Tsai Li Ming
Hi, At the reduceBuyKey stage, it takes a few minutes before the tasks start working. I have -Dspark.default.parallelism=127 cores (n-1). CPU/Network/IO is idling across all nodes when this is happening. And there is nothing particular on the master log file. From the spark-shell: 14/03/23 1