[
https://issues.apache.org/jira/browse/HIVE-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13041897#comment-13041897
]
[email protected] commented on HIVE-2185:
-----------------------------------------------------
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/785/#review725
-----------------------------------------------------------
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
<https://reviews.apache.org/r/785/#comment1468>
For a better debugging info, print out the key and the valid stats keys.
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java
<https://reviews.apache.org/r/785/#comment1492>
currentValue + amount will result in object creation. This is very
expensive in the this case since this function is called for every input row.
Instead of using immutable class Long, LongWritable maybe a better choice.
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java
<https://reviews.apache.org/r/785/#comment1501>
Also consider using LongWritable rather than Long.
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java
<https://reviews.apache.org/r/785/#comment1502>
LongWritable.
trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java
<https://reviews.apache.org/r/785/#comment1503>
Can you print the stack trace to LOG rather than to console?
trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java
<https://reviews.apache.org/r/785/#comment1504>
declaration should be interface List rather than implementation ArrayList.
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java
<https://reviews.apache.org/r/785/#comment1506>
Better use Map rather than HashMap in declaration
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java
<https://reviews.apache.org/r/785/#comment1507>
Can you change it to use Utilities.executeWithRetry() as well?
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java
<https://reviews.apache.org/r/785/#comment1508>
Let's also put the comment here as in other statements.
trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out
<https://reviews.apache.org/r/785/#comment1509>
The uncompressed size is smaller than the totalSize. Can you double check
if this is because of the overhead (headers etc) in the fileformat or because
of a bug in the stats?
trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStats.java
<https://reviews.apache.org/r/785/#comment1510>
Please add some comments here on what it is used for.
trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStruct.java
<https://reviews.apache.org/r/785/#comment1511>
Would it make sense to add the size of field delimiters as well? And if we
know the record delimiters (for most record reader it is a newline), we can add
that too. This will make the stats more accurately reflect the real
uncompressed size stored in the file.
trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe.java
<https://reviews.apache.org/r/785/#comment1512>
may be simpler just use == rather than ! and ^. Also consider assert
rather than returning null?
trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinarySerDe.java
<https://reviews.apache.org/r/785/#comment1513>
same as above.
- Ning
On 2011-05-26 21:27:34, Tomasz Nykiel wrote:
bq.
bq. -----------------------------------------------------------
bq. This is an automatically generated e-mail. To reply, visit:
bq. https://reviews.apache.org/r/785/
bq. -----------------------------------------------------------
bq.
bq. (Updated 2011-05-26 21:27:34)
bq.
bq.
bq. Review request for hive.
bq.
bq.
bq. Summary
bq. -------
bq.
bq. Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we
collect statistics about the number of rows per partition/table.
bq. Other statistics (e.g., total table/partition size) are derived from the
file system.
bq.
bq. We introduce a new feature for collecting information about the sizes of
uncompressed data, to be able to determine the efficiency of compression.
bq. On top of adding the new statistic collected, this patch extends the stats
collection mechanism, so any new statistics could be added easily.
bq.
bq. 1. serializer/deserializer classes are amended to accommodate collecting
sizes of uncompressed data, when serializing/deserializing objects.
bq. We support:
bq.
bq. Columnar SerDe
bq. LazySimpleSerDe
bq. LazyBinarySerDe
bq.
bq. For other SerDe classes the uncompressed siez will be 0.
bq.
bq. 2. StatsPublisher / StatsAggregator interfaces are extended to support
multi-stats collection for both JDBC and HBase.
bq.
bq. 3. For both INSERT OVERWRITE and ANALYZE statements, FileSinkOperator and
TableScanOperator respectively are extended to support multi-stats collection.
bq.
bq. (2) and (3) enable easy extension for other types of statistics.
bq.
bq. 4. Collecting uncompressed size can be disabled by setting:
bq.
bq. hive.stats.collect.uncompressedsize = false
bq.
bq.
bq. This addresses bug HIVE-2185.
bq. https://issues.apache.org/jira/browse/HIVE-2185
bq.
bq.
bq. Diffs
bq. -----
bq.
bq. trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 1128070
bq.
trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java
1128070
bq.
trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/TypedBytesSerDe.java
1128070
bq.
trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/s3/S3LogDeserializer.java
1128070
bq.
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java
1128070
bq.
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsAggregator.java
1128070
bq.
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsPublisher.java
1128070
bq.
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsSetupConstants.java
1128070
bq.
trunk/hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStatsUtils.java
PRE-CREATION
bq. trunk/hbase-handler/src/test/queries/hbase_stats.q 1128070
bq. trunk/hbase-handler/src/test/queries/hbase_stats2.q PRE-CREATION
bq. trunk/hbase-handler/src/test/results/hbase_stats.q.out 1128070
bq. trunk/hbase-handler/src/test/results/hbase_stats2.q.out PRE-CREATION
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java
1128070
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/MapOperator.java
1128070
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/Stat.java 1128070
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/StatsTask.java 1128070
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java
1128070
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/metadata/VirtualColumn.java
1128070
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ColumnPrunerProcFactory.java
1128070
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java
1128070
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java
1128070
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsAggregator.java
1128070
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsPublisher.java
1128070
bq. trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/StatsSetupConst.java
1128070
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsAggregator.java
1128070
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsPublisher.java
1128070
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsSetupConstants.java
1128070
bq.
trunk/ql/src/java/org/apache/hadoop/hive/ql/stats/jdbc/JDBCStatsUtils.java
PRE-CREATION
bq. trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisher.java
1128070
bq.
trunk/ql/src/test/org/apache/hadoop/hive/ql/exec/TestStatsPublisherEnhanced.java
PRE-CREATION
bq. trunk/ql/src/test/org/apache/hadoop/hive/serde2/TestSerDe.java 1128070
bq. trunk/ql/src/test/queries/clientpositive/stats14.q PRE-CREATION
bq. trunk/ql/src/test/queries/clientpositive/stats15.q PRE-CREATION
bq. trunk/ql/src/test/results/clientpositive/bucketmapjoin1.q.out 1128070
bq. trunk/ql/src/test/results/clientpositive/bucketmapjoin2.q.out 1128070
bq. trunk/ql/src/test/results/clientpositive/bucketmapjoin3.q.out 1128070
bq. trunk/ql/src/test/results/clientpositive/bucketmapjoin4.q.out 1128070
bq. trunk/ql/src/test/results/clientpositive/bucketmapjoin5.q.out 1128070
bq. trunk/ql/src/test/results/clientpositive/combine2.q.out 1128070
bq. trunk/ql/src/test/results/clientpositive/filter_join_breaktask.q.out
1128070
bq. trunk/ql/src/test/results/clientpositive/join_map_ppr.q.out 1128070
bq. trunk/ql/src/test/results/clientpositive/merge3.q.out 1128070
bq. trunk/ql/src/test/results/clientpositive/merge4.q.out 1128070
bq. trunk/ql/src/test/results/clientpositive/pcr.q.out 1128070
bq. trunk/ql/src/test/results/clientpositive/sample10.q.out 1128070
bq. trunk/ql/src/test/results/clientpositive/stats11.q.out 1128070
bq. trunk/ql/src/test/results/clientpositive/stats14.q.out PRE-CREATION
bq. trunk/ql/src/test/results/clientpositive/stats15.q.out PRE-CREATION
bq. trunk/ql/src/test/results/clientpositive/union22.q.out 1128070
bq. trunk/serde/src/java/org/apache/hadoop/hive/serde2/Deserializer.java
1128070
bq.
trunk/serde/src/java/org/apache/hadoop/hive/serde2/MetadataTypedColumnsetSerDe.java
1128070
bq. trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStats.java
PRE-CREATION
bq. trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeStatsStruct.java
PRE-CREATION
bq. trunk/serde/src/java/org/apache/hadoop/hive/serde2/Serializer.java
1128070
bq. trunk/serde/src/java/org/apache/hadoop/hive/serde2/TypedSerDe.java
1128070
bq.
trunk/serde/src/java/org/apache/hadoop/hive/serde2/binarysortable/BinarySortableSerDe.java
1128070
bq.
trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarSerDe.java
1128070
bq.
trunk/serde/src/java/org/apache/hadoop/hive/serde2/columnar/ColumnarStruct.java
1128070
bq.
trunk/serde/src/java/org/apache/hadoop/hive/serde2/dynamic_type/DynamicSerDe.java
1128070
bq.
trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe.java
1128070
bq. trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyStruct.java
1128070
bq.
trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinarySerDe.java
1128070
bq.
trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryStruct.java
1128070
bq.
trunk/serde/src/java/org/apache/hadoop/hive/serde2/thrift/ThriftDeserializer.java
1128070
bq. trunk/serde/src/test/org/apache/hadoop/hive/serde2/TestStatsSerde.java
PRE-CREATION
bq.
bq. Diff: https://reviews.apache.org/r/785/diff
bq.
bq.
bq. Testing
bq. -------
bq.
bq. - additional JUnit test for Serializer/Deserializer amended classes
bq. - additional queries for TestCliDriver over multi-partition tables
bq. - all other JUnit tests
bq. - standalone setup
bq.
bq.
bq. Thanks,
bq.
bq. Tomasz
bq.
bq.
> extend table statistics to store the size of uncompressed data (+extend
> interfaces for collecting other types of statistics)
> ----------------------------------------------------------------------------------------------------------------------------
>
> Key: HIVE-2185
> URL: https://issues.apache.org/jira/browse/HIVE-2185
> Project: Hive
> Issue Type: New Feature
> Components: Serializers/Deserializers, Statistics
> Reporter: Tomasz Nykiel
> Assignee: Tomasz Nykiel
> Attachments: HIVE-2185.1.patch, HIVE-2185.patch
>
>
> Currently, when executing INSERT OVERWRITE and ANALYZE TABLE commands we
> collect statistics about the number of rows per partition/table. Other
> statistics (e.g., total table/partition size) are derived from the file
> system.
> Here, we want to collect information about the sizes of uncompressed data, to
> be able to determine the efficiency of compression.
> Currently, a large part of statistics collection mechanism is hardcoded and
> not-easily extensible for other statistics.
> On top of adding the new statistic collected, it would be desirable to extend
> the collection mechanism, so any new statistics could be added easily.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira