[ 
https://issues.apache.org/jira/browse/HIVE-5761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13833368#comment-13833368
 ] 

Eric Hanson commented on HIVE-5761:
-----------------------------------

Teddy, 

Yes, this would be great! Before you start coding, please write a 1 page or so 
design document to talk about how Date will be represented and how you will 
operate on it. Please edit it in to the HIVE-4160 design doc in MS Word format 
and post the revised version as a draft. Then we can review that. One thing to 
consider is to store a date as an integer number of days in a LongColumnVector, 
e.g. 0000-­01-­01 is 1, 0000-­01-­02 is 2 etc. up to 9999-­12-­31 being some 
large integer. But before we conclude this is the way to go, we need to think 
about how operations will be performed on date data elements represented as 
longs. 

An alternative representation would be to embed YYYY, MM, DD as fields inside a 
long. E.g. put DD in byte 0, MM in byte 1, and YYYY as a 16 bit int in bytes 2 
and 3. Then comparisons will work on these ints directly. And you can grab a 
YYYY, MM or DD field out of them with a simple shift and mask. You would not 
need to consult a calendar.

Obviously we want to avoid using an internal Java Date type as much as possible 
to get all the standard benefits of vectorization.

In your design, please think about these things:

How to get data into column vectors from ORC in the vectorized InputFormat 
(does ORC support the DATE type? How is it stored?)
How to perform basic operations (comparisons, simple get() functions etc)
How to perform more complex operations (DateDiff, DateAdd etc.)
How to output a date at the top of the vectorized plan.

Hopefully you can make all of them fast. If some complex ones need to use Java 
Date for calendar management etc., that may be fine.

This will need to be broken into several patches. The first one might just be 
to be able to read date data in vectors from ORC. The next one might be to 
output date data in an end-to-end query like "select date_column from table" in 
vectorized mode. Then you can implement all the comparisons that just piggy 
back on LongColumnVector VectorExpression operations that already have been 
build. Then you can do additional patches for different groups of related 
operations.

Eric

> Implement vectorized support for the DATE data type
> ---------------------------------------------------
>
>                 Key: HIVE-5761
>                 URL: https://issues.apache.org/jira/browse/HIVE-5761
>             Project: Hive
>          Issue Type: Sub-task
>            Reporter: Eric Hanson
>
> Add support to allow queries referencing DATE columns and expression results 
> to run efficiently in vectorized mode. This should re-use the code for the 
> the integer/timestamp types to the extent possible and beneficial. Include 
> unit tests and end-to-end tests. Consider re-using or extending existing 
> end-to-end tests for vectorized integer and/or timestamp operations.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to