Agreed. That's the conclusion we came to as well. So it's less of a bug and more of a feature request. I think one of the main advantages of hive is the flexibility in allowing non-technical users to run basic queries without having to think about the transform stuff. (i.e. we in the IT shop can setup the transform) I like the annotation idea that some how the partition specs can be pushed through (identified in some other way etc). I am new to the Apache/JIRA world, what would you recommend for getting this into a feature request for consideration? I am not a Java programmer, so my idea may need to be paired with a champion to help implement it :)
On Wed, Oct 10, 2012 at 3:24 PM, shrikanth shankar <sshan...@qubole.com>wrote: > I assume the reason for this is that the Hive compiler has no way of > determining that the 'day' that is input into the transform script is the > same 'day' that is output from the transform script. Even if it did, its > unclear if pushing down would be legal without knowing the semantics of the > transformation. Any optimization to be done here will likely need an > annotation somewhere to say that certain columns in the output of a > transform refer to specific columns in the input of a transform for > predicate push down purposes (and that such pushdown is legal for this > transformation) > > thanks, > Shrikanth > On Oct 10, 2012, at 12:04 PM, John Omernik wrote: > > > Greetings all, I am trying to incorporate a TRANSFORM into a view (so we > can abstract the transform script away from the user) > > > > > > > > As a Test, I have a table partitioned on day (in YYYY-MM-DD formated) > with lots of partitions > > > > and I tried this > > > > CREATE VIEW view_transform as > > Select TRANSFORM (day, ip) using 'cat' as (day, ip) from source_table; > > > > The reason I used 'cat' in my test is if this works, I will distribute > my transform scripts to each node manually, I know each node has cat, so > this works as a test. > > > > When run > > > > SELECT * from view_transform where day = '2012-10-08' 10,432 map tasks > get spun up. > > > > If I rewrite the view to be > > > > CREATE VIEW view_transform as > > Select TRANSFORM (day, ip) using 'cat' as (day, ip) from source_table > where day = '2012-10-08'; > > > > Then only 16 map tasks get spun up (the desired behavior, but the > pruning is happening in the view not in the query) > > > > Thus I wanted input on whether this should be considered a bug. I.e. > Should we be able to define a partition spec in a view that uses a > transform that allows normal pruning to occur even though the partition > spec will be passed to the transfrom script? I think we should, and it's > likely doable some how. This would be awesome for a number of situations > where you may want to expose "transformed" data to analysis without the > mess of having them format their script for transform. > > > > > >