Re: Spark Shell slowness on Google Cloud

2014-12-17 Thread Alessandro Baretta
Here's another data point: the slow part of my code is the construction of an RDD as the union of the textFile RDDs representing data from several distinct google storage directories. So the question becomes the following: what computation happens when calling the union method on two RDDs? On Wed,

Fwd: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-17 Thread Krishna Sankar
Forgot Reply To All ;o( -- Forwarded message -- From: Krishna Sankar Date: Wed, Dec 10, 2014 at 9:16 PM Subject: Re: [VOTE] Release Apache Spark 1.2.0 (RC2) To: Matei Zaharia +1 Works same as RC1 1. Compiled OSX 10.10 (Yosemite) mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Dsk

Re: Nabble mailing list mirror errors: "This post has NOT been accepted by the mailing list yet"

2014-12-17 Thread Josh Rosen
Yeah, it looks like messages that are successfully posted via Nabble end up on the Apache mailing list, but messages posted directly to Apache aren't mirrored to Nabble anymore because it's based off the incubator mailing list. We should fix this so that Nabble posts to / archives the non-incubato

Re: running the Terasort example

2014-12-17 Thread Tim Harsch
On 12/16/14, 11:42 PM, "Ewan Higgs" wrote: >Hi Tim, > >> On 16 Dec 2014, at 19:27, Tim Harsch wrote: >> >> Hi Ewan, >> Thanks, I think I was just a bit confused at the time, I was looking at >> the spark-perf repo when there was the problem (uh.. ok)… >> >The PR that I am working on is indeed

Re: RDD data flow

2014-12-17 Thread Madhu
Patrick Wendell wrote > The Partition itself doesn't need to be an iterator - the iterator > comes from the result of compute(partition). The Partition is just an > identifier for that partition, not the data itself. OK, that makes sense. The docs for Partition are a bit vague on this point. Maybe

When will Spark SQL support building DB index natively?

2014-12-17 Thread Xuelin Cao
Hi,       In Spark SQL help document, it says "Some of these (such as indexes) are less important due to Spark SQL’s in-memory  computational model. Others are slotted for future releases of Spark SQL. - Block level bitmap indexes and virtual columns (used to build indexes)"      For our