Prasanth J created HIVE-5324:
--------------------------------

             Summary: Extend record writer interface, ORC reader/writer 
interfaces to provide statistics
                 Key: HIVE-5324
                 URL: https://issues.apache.org/jira/browse/HIVE-5324
             Project: Hive
          Issue Type: New Feature
    Affects Versions: 0.13.0
            Reporter: Prasanth J
            Assignee: Prasanth J
             Fix For: 0.13.0


The current implementation for computing statistics (number of rows and raw 
data size) happens for every single row processed. The processOp() method in 
FileSinkOperator gets raw data size for each row from the serde and accumulates 
the size in hashmap while counting the number of rows. This accumulated 
statistics is then published to metastore. 
In case of ORC, ORC already stores enough statistics internally which can be 
made use of when publishing the stats to metastore. This will avoid the 
duplication of work that is happening in the processOp(). Also getting the 
statistics directly from ORC is very cheap (can directly read from the file 
footer).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to