Jesus Camacho Rodriguez created HIVE-14468:
----------------------------------------------

             Summary: Implement Druid query based input format
                 Key: HIVE-14468
                 URL: https://issues.apache.org/jira/browse/HIVE-14468
             Project: Hive
          Issue Type: Sub-task
          Components: Druid integration
    Affects Versions: 2.2.0
            Reporter: Jesus Camacho Rodriguez
            Assignee: Jesus Camacho Rodriguez


It is responsible of generating the splits and creating the record readers.

* For *Timeseries*, *TopN*, *GroupBy* queries. Create a single split containing 
the broker address and the query. Then the record reader will submit the query 
to the broker, retrieve the results, and parse them and generate records.

* For *Select* queries. Druid has the concept of threshold (limit) in Select 
query. In fact, it is used for retrieving the query results in multiple 
requests. Hence, we will emit a Druid Segment Metadata query to obtain the 
number of rows in the datasource. Then we create _number of rows / 
default\_threshold_ splits; _default\_threshold_ is a Hive configuration 
property defined as {{hive.druid.select.threshold}}. Each split generated 
contains the broker address and a Select JSON query with _start_ and _end_ row. 
The splits are handled independently by the record readers, which submit the 
query to the broker, retrieve the results, and parse them and generate records. 
This way we can parallelize the retrieval of results for these queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to