Here's another data point: the slow part of my code is the construction of
an RDD as the union of the textFile RDDs representing data from several
distinct google storage directories. So the question becomes the following:
what computation happens when calling the union method on two RDDs?
On Wed,
Forgot Reply To All ;o(
-- Forwarded message --
From: Krishna Sankar
Date: Wed, Dec 10, 2014 at 9:16 PM
Subject: Re: [VOTE] Release Apache Spark 1.2.0 (RC2)
To: Matei Zaharia
+1
Works same as RC1
1. Compiled OSX 10.10 (Yosemite) mvn -Pyarn -Phadoop-2.4
-Dhadoop.version=2.4.0 -Dsk
Yeah, it looks like messages that are successfully posted via Nabble end up
on the Apache mailing list, but messages posted directly to Apache aren't
mirrored to Nabble anymore because it's based off the incubator mailing
list. We should fix this so that Nabble posts to / archives the
non-incubato
On 12/16/14, 11:42 PM, "Ewan Higgs" wrote:
>Hi Tim,
>
>> On 16 Dec 2014, at 19:27, Tim Harsch wrote:
>>
>> Hi Ewan,
>> Thanks, I think I was just a bit confused at the time, I was looking at
>> the spark-perf repo when there was the problem (uh.. ok)…
>>
>The PR that I am working on is indeed
Patrick Wendell wrote
> The Partition itself doesn't need to be an iterator - the iterator
> comes from the result of compute(partition). The Partition is just an
> identifier for that partition, not the data itself.
OK, that makes sense. The docs for Partition are a bit vague on this point.
Maybe
Hi,
In Spark SQL help document, it says "Some of these (such as indexes) are
less important due to Spark SQL’s in-memory computational model. Others are
slotted for future releases of Spark SQL.
- Block level bitmap indexes and virtual columns (used to build indexes)"
For our