Implement CLUSTERED BY, DISTRIBUTED BY, SORTED BY directives for a single query 
level.
--------------------------------------------------------------------------------------

                 Key: HIVE-2295
                 URL: https://issues.apache.org/jira/browse/HIVE-2295
             Project: Hive
          Issue Type: Improvement
          Components: Query Processor
            Reporter: Adam Kramer


The common framework for utilizing the mapreduce framework looks like this:

SELECT TRANSFORM(a.foo, a.bar)
USING 'mapper.py'
AS x, y, z
FROM (
  SELECT b.foo, b.bar
  FROM tablename b
  CLUSTER BY b.foo
) a;

...however, this is exceptionally fragile, as it relies on the assumption that 
Hive is not doing any "magic" in between the query steps. People familiar with 
SQL frequently assume that query steps are effectively separated from each 
other. CLUSTER BY, then, would guarantee that data are clustered on their way 
OUT of the query, but really what we need is a directive to indicate that data 
must be clustered on the way INTO the query.

This is not pedantic, because there is no reason that Hive wouldn't try to 
optimize data flow between queries, for example, systematically splitting up 
big queries. The UDAF framework, with its merging step, would allow a single 
key/value pair to be split across SEVERAL reducers, "violating" the mapreduce 
assumptions but returning the correct data...however, for a TRANSFORM 
statement, no such protections are afforded.

I propose, for greater clarity, that these directives be part of the same query 
level. Example syntax:

SELECT TRANSFORM(foo, bar)
USING 'reducer.py'
AS x, y, z
FROM tablename
CLUSTERED BY foo;

...in other words, move the directive regarding data distribution to the query 
that actually cares about it, allowing for users who are making the assumptions 
of the mapreduce framework to formally indicate that their transformer really 
DOES need clustered data. Or to put it in other words, CLUSTER BY is a 
directive guaranteeing that data are clustered on the way OUT OF a query (i.e., 
for bucketed tables), whereas CLUSTERED BY is a directive guaranteeing that 
data are clustered on the way INTO a query.

Bonus points: For tables that are already CLUSTERED BY in their definition, 
allow this query to run in the map phase.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to