[ https://issues.apache.org/jira/browse/HIVE-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13699820#comment-13699820 ]
Vinod Kumar Vavilapalli commented on HIVE-4160: ----------------------------------------------- A huge +1 to that. Having a common set of operators will be a huge win. That said, I already see that the current branch follows Hive's operator base classes, uses HiveConf etc. I believe with little effort, this can be cleaned and pulled apart into one separate maven module that everyone can use. Some points to think about: - The target location of the module. The dependency graph can become un-wieldly. - Given the use of base Operator, OperatorDesc etc from Hive, if at all there is interest and commitment, we should do this ASAP when we only have a handful of operators. - Make one other project demonstrate how it can be reused across ecosystem projects, PIG will be great - just a few operators will be a great start Thoughts? > Vectorized Query Execution in Hive > ---------------------------------- > > Key: HIVE-4160 > URL: https://issues.apache.org/jira/browse/HIVE-4160 > Project: Hive > Issue Type: New Feature > Reporter: Jitendra Nath Pandey > Assignee: Jitendra Nath Pandey > Attachments: Hive-Vectorized-Query-Execution-Design.docx, > Hive-Vectorized-Query-Execution-Design-rev2.docx, > Hive-Vectorized-Query-Execution-Design-rev3.docx, > Hive-Vectorized-Query-Execution-Design-rev3.docx, > Hive-Vectorized-Query-Execution-Design-rev3.pdf, > Hive-Vectorized-Query-Execution-Design-rev4.docx, > Hive-Vectorized-Query-Execution-Design-rev4.pdf, > Hive-Vectorized-Query-Execution-Design-rev5.docx, > Hive-Vectorized-Query-Execution-Design-rev5.pdf, > Hive-Vectorized-Query-Execution-Design-rev6.docx, > Hive-Vectorized-Query-Execution-Design-rev6.pdf, > Hive-Vectorized-Query-Execution-Design-rev7.docx, > Hive-Vectorized-Query-Execution-Design-rev8.docx, > Hive-Vectorized-Query-Execution-Design-rev8.pdf, > Hive-Vectorized-Query-Execution-Design-rev9.docx, > Hive-Vectorized-Query-Execution-Design-rev9.pdf > > > The Hive query execution engine currently processes one row at a time. A > single row of data goes through all the operators before the next row can be > processed. This mode of processing is very inefficient in terms of CPU usage. > Research has demonstrated that this yields very low instructions per cycle > [MonetDB X100]. Also currently Hive heavily relies on lazy deserialization > and data columns go through a layer of object inspectors that identify column > type, deserialize data and determine appropriate expression routines in the > inner loop. These layers of virtual method calls further slow down the > processing. > This work will add support for vectorized query execution to Hive, where, > instead of individual rows, batches of about a thousand rows at a time are > processed. Each column in the batch is represented as a vector of a primitive > data type. The inner loop of execution scans these vectors very fast, > avoiding method calls, deserialization, unnecessary if-then-else, etc. This > substantially reduces CPU time used, and gives excellent instructions per > cycle (i.e. improved processor pipeline utilization). See the attached design > specification for more details. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira