Agreed. That's the conclusion we came to as well. So it's less of a bug and
more of a feature request. I think one of the main advantages of hive is
the flexibility in allowing non-technical users to run basic queries
without having to think about the transform stuff. (i.e. we in the IT shop
can setup the transform)  I like the annotation idea that some how the
partition specs can be pushed through (identified in some other way etc).
 I am new to the Apache/JIRA world, what would you recommend for getting
this into a feature request for consideration? I am not a Java programmer,
so my idea may need to be paired with a champion to help implement it :)



On Wed, Oct 10, 2012 at 3:24 PM, shrikanth shankar <sshan...@qubole.com>wrote:

> I assume the reason for this is that the Hive compiler has no way of
> determining that the 'day' that is input into the transform script is the
> same 'day' that is output from the transform script. Even if it did, its
> unclear if pushing down would be legal without knowing the semantics of the
> transformation. Any optimization to be done here will likely need an
> annotation somewhere to say that certain columns in the output of a
> transform refer to specific columns in the input of a transform for
> predicate push down purposes (and that such pushdown is legal for this
> transformation)
>
> thanks,
> Shrikanth
> On Oct 10, 2012, at 12:04 PM, John Omernik wrote:
>
> > Greetings all, I am trying to incorporate a TRANSFORM into a view (so we
> can abstract the transform script away from the user)
> >
> >
> >
> > As a Test, I have a table partitioned on day (in YYYY-MM-DD formated)
> with lots of partitions
> >
> > and I tried this
> >
> > CREATE VIEW view_transform as
> > Select TRANSFORM (day, ip) using 'cat' as (day, ip) from source_table;
> >
> > The reason I used 'cat' in my test is if this works, I will distribute
> my transform scripts to each node manually, I know each node has cat, so
> this works as a test.
> >
> > When run
> >
> > SELECT * from view_transform where day = '2012-10-08'  10,432 map tasks
> get spun up.
> >
> > If I rewrite the view to be
> >
> > CREATE VIEW view_transform as
> > Select TRANSFORM (day, ip) using 'cat' as (day, ip) from source_table
> where day = '2012-10-08';
> >
> > Then only 16 map tasks get spun up (the desired behavior, but the
> pruning is happening in the view not in the query)
> >
> > Thus I wanted input on whether this should be considered a bug.  I.e.
> Should we be able to define a partition spec in a view that uses a
> transform that allows normal pruning to occur even though the partition
> spec will be passed to the transfrom script?  I think we should, and it's
> likely doable some how. This would be awesome for a number of situations
> where you may want to expose "transformed" data to analysis without the
> mess of having them format their script for transform.
> >
> >
>
>

Reply via email to