Re: RDD.count

2015-03-28 Thread jimfcarroll
Hello all, I worked around this for now using the class (that I already had) that inherits from RDD and is the one all of our custom RDDs inherit from. I did the following: 1) Overload all of the transformations (that get used in our app) that don't change the RDD size wrapping the results with a

Re: RDD.count

2015-03-28 Thread Reynold Xin
I think the worry here is that people often use count() to force execution, and when coupled with transformations with side-effect, it is no longer safe to not run it. However, maybe we can add a new lazy val .size that doesn't require recomputation. On Sat, Mar 28, 2015 at 7:42 AM, Sandy Ryza

Re: Lazy casting with Catalyst

2015-03-28 Thread Patrick Woody
So it looks like this was actually a combination of using out of date artifacts and further debugging needed on my part. Ripping the logic out and testing in spark-shell works fine, so it is likely something upstream in my application that causes it to take the whole Row. Thanks! -Pat On Sat,

Re: Lazy casting with Catalyst

2015-03-28 Thread Cheng Lian
On 3/29/15 12:26 AM, Patrick Woody wrote: Hey Cheng, I didn't meant that catalyst casting was eager, just that my approaches thus far seem to have been. Maybe I should give a concrete example? I have columns A, B, C where B is saved as a String but I'd like all references to B to go throug

Re: Lazy casting with Catalyst

2015-03-28 Thread Patrick Woody
Hey Cheng, I didn't meant that catalyst casting was eager, just that my approaches thus far seem to have been. Maybe I should give a concrete example? I have columns A, B, C where B is saved as a String but I'd like all references to B to go through a Cast to decimal regardless of the code used o

Re: Lazy casting with Catalyst

2015-03-28 Thread Cheng Lian
Hi Pat, I don't understand what "lazy casting" mean here. Why do you think current Catalyst casting is "eager"? Casting happens at runtime, and doesn't disable column pruning. Cheng On 3/28/15 11:26 PM, Patrick Woody wrote: Hi all, In my application, we take input from Parquet files where

Lazy casting with Catalyst

2015-03-28 Thread Patrick Woody
Hi all, In my application, we take input from Parquet files where BigDecimals are written as Strings to maintain arbitrary precision. I was hoping to convert these back over to Decimal with Unlimited precision, but I'd still like to maintain the Parquet column pruning (all my attempts thus far se

Re: RDD.count

2015-03-28 Thread Sandy Ryza
I definitely see the value in this. However, I think at this point it would be an incompatible behavioral change. People often use count in Spark to exercise their DAG. Omitting processing steps that were previously included would likely mislead many users into thinking their pipeline was runnin

Re: RDD.count

2015-03-28 Thread Sean Owen
No, I'm not saying side effects change the count. But not executing the map() function at all certainly has an effect on the side effects of that function: the side effects which should take place never do. I am not sure that is something to be 'fixed'; it's a legitimate question. You can persist

Re: RDD.count

2015-03-28 Thread jimfcarroll
Hi Sean, Thanks for the response. I can't imagine a case (though my imagination may be somewhat limited) where even map side effects could change the number of elements in the resulting map. I guess "count" wouldn't officially be an 'action' if it were implemented this way. At least it wouldn't