[ https://issues.apache.org/jira/browse/HIVE-4160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13702513#comment-13702513 ]
Dmitriy V. Ryaboy commented on HIVE-4160: ----------------------------------------- I would like to provide the same vectorization benefits to Pig and similar frameworks (possibly Cascading, and maybe the Spark or Crunch guys will want to use this as well, etc). > Vectorized Query Execution in Hive > ---------------------------------- > > Key: HIVE-4160 > URL: https://issues.apache.org/jira/browse/HIVE-4160 > Project: Hive > Issue Type: New Feature > Reporter: Jitendra Nath Pandey > Assignee: Jitendra Nath Pandey > Attachments: Hive-Vectorized-Query-Execution-Design.docx, > Hive-Vectorized-Query-Execution-Design-rev2.docx, > Hive-Vectorized-Query-Execution-Design-rev3.docx, > Hive-Vectorized-Query-Execution-Design-rev3.docx, > Hive-Vectorized-Query-Execution-Design-rev3.pdf, > Hive-Vectorized-Query-Execution-Design-rev4.docx, > Hive-Vectorized-Query-Execution-Design-rev4.pdf, > Hive-Vectorized-Query-Execution-Design-rev5.docx, > Hive-Vectorized-Query-Execution-Design-rev5.pdf, > Hive-Vectorized-Query-Execution-Design-rev6.docx, > Hive-Vectorized-Query-Execution-Design-rev6.pdf, > Hive-Vectorized-Query-Execution-Design-rev7.docx, > Hive-Vectorized-Query-Execution-Design-rev8.docx, > Hive-Vectorized-Query-Execution-Design-rev8.pdf, > Hive-Vectorized-Query-Execution-Design-rev9.docx, > Hive-Vectorized-Query-Execution-Design-rev9.pdf > > > The Hive query execution engine currently processes one row at a time. A > single row of data goes through all the operators before the next row can be > processed. This mode of processing is very inefficient in terms of CPU usage. > Research has demonstrated that this yields very low instructions per cycle > [MonetDB X100]. Also currently Hive heavily relies on lazy deserialization > and data columns go through a layer of object inspectors that identify column > type, deserialize data and determine appropriate expression routines in the > inner loop. These layers of virtual method calls further slow down the > processing. > This work will add support for vectorized query execution to Hive, where, > instead of individual rows, batches of about a thousand rows at a time are > processed. Each column in the batch is represented as a vector of a primitive > data type. The inner loop of execution scans these vectors very fast, > avoiding method calls, deserialization, unnecessary if-then-else, etc. This > substantially reduces CPU time used, and gives excellent instructions per > cycle (i.e. improved processor pipeline utilization). See the attached design > specification for more details. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira