Hello all,
I worked around this for now using the class (that I already had) that
inherits from RDD and is the one all of our custom RDDs inherit from. I did
the following:
1) Overload all of the transformations (that get used in our app) that don't
change the RDD size wrapping the results with a
I think the worry here is that people often use count() to force execution,
and when coupled with transformations with side-effect, it is no longer
safe to not run it.
However, maybe we can add a new lazy val .size that doesn't require
recomputation.
On Sat, Mar 28, 2015 at 7:42 AM, Sandy Ryza
So it looks like this was actually a combination of using out of date
artifacts and further debugging needed on my part. Ripping the logic out
and testing in spark-shell works fine, so it is likely something upstream
in my application that causes it to take the whole Row.
Thanks!
-Pat
On Sat,
On 3/29/15 12:26 AM, Patrick Woody wrote:
Hey Cheng,
I didn't meant that catalyst casting was eager, just that my
approaches thus far seem to have been. Maybe I should give a concrete
example?
I have columns A, B, C where B is saved as a String but I'd like all
references to B to go throug
Hey Cheng,
I didn't meant that catalyst casting was eager, just that my approaches
thus far seem to have been. Maybe I should give a concrete example?
I have columns A, B, C where B is saved as a String but I'd like all
references to B to go through a Cast to decimal regardless of the code used
o
Hi Pat,
I don't understand what "lazy casting" mean here. Why do you think
current Catalyst casting is "eager"? Casting happens at runtime, and
doesn't disable column pruning.
Cheng
On 3/28/15 11:26 PM, Patrick Woody wrote:
Hi all,
In my application, we take input from Parquet files where
Hi all,
In my application, we take input from Parquet files where BigDecimals are
written as Strings to maintain arbitrary precision.
I was hoping to convert these back over to Decimal with Unlimited
precision, but I'd still like to maintain the Parquet column pruning (all
my attempts thus far se
I definitely see the value in this. However, I think at this point it
would be an incompatible behavioral change. People often use count in
Spark to exercise their DAG. Omitting processing steps that were
previously included would likely mislead many users into thinking their
pipeline was runnin
No, I'm not saying side effects change the count. But not executing
the map() function at all certainly has an effect on the side effects
of that function: the side effects which should take place never do. I
am not sure that is something to be 'fixed'; it's a legitimate
question.
You can persist
Hi Sean,
Thanks for the response.
I can't imagine a case (though my imagination may be somewhat limited) where
even map side effects could change the number of elements in the resulting
map.
I guess "count" wouldn't officially be an 'action' if it were implemented
this way. At least it wouldn't
10 matches
Mail list logo