Prasanth J created HIVE-5324:
--------------------------------
Summary: Extend record writer interface, ORC reader/writer
interfaces to provide statistics
Key: HIVE-5324
URL: https://issues.apache.org/jira/browse/HIVE-5324
Project: Hive
Issue Type: New Feature
Affects Versions: 0.13.0
Reporter: Prasanth J
Assignee: Prasanth J
Fix For: 0.13.0
The current implementation for computing statistics (number of rows and raw
data size) happens for every single row processed. The processOp() method in
FileSinkOperator gets raw data size for each row from the serde and accumulates
the size in hashmap while counting the number of rows. This accumulated
statistics is then published to metastore.
In case of ORC, ORC already stores enough statistics internally which can be
made use of when publishing the stats to metastore. This will avoid the
duplication of work that is happening in the processOp(). Also getting the
statistics directly from ORC is very cheap (can directly read from the file
footer).
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira