So it looks like this was actually a combination of using out of date artifacts and further debugging needed on my part. Ripping the logic out and testing in spark-shell works fine, so it is likely something upstream in my application that causes it to take the whole Row.
Thanks! -Pat On Sat, Mar 28, 2015 at 12:34 PM, Cheng Lian <lian.cs....@gmail.com> wrote: > > On 3/29/15 12:26 AM, Patrick Woody wrote: > > Hey Cheng, > > I didn't meant that catalyst casting was eager, just that my approaches > thus far seem to have been. Maybe I should give a concrete example? > > I have columns A, B, C where B is saved as a String but I'd like all > references to B to go through a Cast to decimal regardless of the code used > on the SchemaRDD. So if someone does a min(B) it uses Decimal ordering > instead of String. > > One approach that I had taken was to do a select of everything with the > casts on certain columns, but then when I did a count(literal(1)) on top of > that RDD it seemed to bring in the whole row. > > What version of Spark SQL are you using? Would you mind to provide a brief > snippet that can reproduce this issue? This might be a bug depending on > your concrete usage. Thanks in advance! > > > Thanks! > -Pat > > On Sat, Mar 28, 2015 at 11:35 AM, Cheng Lian <lian.cs....@gmail.com> > wrote: > >> Hi Pat, >> >> I don't understand what "lazy casting" mean here. Why do you think >> current Catalyst casting is "eager"? Casting happens at runtime, and >> doesn't disable column pruning. >> >> Cheng >> >> >> On 3/28/15 11:26 PM, Patrick Woody wrote: >> >>> Hi all, >>> >>> In my application, we take input from Parquet files where BigDecimals are >>> written as Strings to maintain arbitrary precision. >>> >>> I was hoping to convert these back over to Decimal with Unlimited >>> precision, but I'd still like to maintain the Parquet column pruning (all >>> my attempts thus far seem to bring in the whole Row). Is it possible to >>> do >>> this lazily through catalyst? >>> >>> Basically I'd want to do Cast(col, DecimalType()) whenever col is >>> actually >>> referenced. Any tips on how to approach this would be appreciated. >>> >>> Thanks! >>> -Pat >>> >>> >> > >