[ https://issues.apache.org/jira/browse/HIVE-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073958#comment-14073958 ]
Carter Shanklin commented on HIVE-6584: --------------------------------------- I tested the .12 version of this patch on a 20 node cluster to see what sort of performance gains might be expected. I did a YCSB load of 180m rows and ran a few simple SQL queries in Hive while simultaneously running a YCSB 32-thread workload. TLDR the snapshot approach provides a nice performance boost of about 2.5x across different types of queries. The more fields I queried the better the performance was. |Query|Run|Workload|Snapshot Time (s)|Direct Time (s)|Time X Factor| |count(*)|1|a|191.019|488.915|2.56x| |count(*)|2|a|200.641|480.837|2.40x| |Aggregate 1 field|1|a|214.452|499.304|2.33x| |Aggregate 1 field|2|a|217.744|500.07|2.30x| |Aggregate 9 fields|1|a|281.514|802.799|2.85x| |Aggregate 9 fields|2|a|272.358|785.816|2.89x| |Aggregate 1 with GBY|1|a|248.874|558.143|2.24x| |Aggregate 1 with GBY|2|a|269.658|533.562|1.98x| |count(*)|1|b|194.739|482.261|2.48x| |count(*)|2|b|195.178|481.437|2.47x| |Aggregate 1 field|1|b|220.325|498.956|2.26x| |Aggregate 1 field|2|b|227.117|489.27|2.15x| |Aggregate 9 fields|1|b|276.939|817.118|2.95x| |Aggregate 9 fields|2|b|290.288|876.753|3.02x| |Aggregate 1 with GBY|1|b|244.025|563.884|2.31x| |Aggregate 1 with GBY|2|b|225.431|570.723|2.53x| |count(*)|1|c|194.568|502.79|2.58x| |count(*)|2|c|205.418|508.319|2.47x| |Aggregate 1 field|1|c|209.709|531.39|2.53x| |Aggregate 1 field|2|c|217.551|526.878|2.42x| |Aggregate 9 fields|1|c|267.93|756.476|2.82x| |Aggregate 9 fields|2|c|273.107|723.459|2.65x| |Aggregate 1 with GBY|1|c|240.991|526.053|2.18x| |Aggregate 1 with GBY|2|c|258.06|527.845|2.05x| For those not familiar with YCSB it uses a table with 9 fields, each filled with random junk 100 characters long. It defines workloads A-F, of which I've used A-C. The main point to note is the more of the fields my query fetches, the better it works in snapshot mode. The other thing I measured was throughput as reported by the YCSB tool. For the most part, when running the query over a snapshot the throughput was much better. |Workload|Tput Snapshot|Tput Direct|Throughput Improvement (Snapshot)| |a|83443.11623|56267.34148|48.30%| |b|45709.15011|31224.30376|46.39%| |c|46634.58415|43224.86383|7.89%| The throughput when using the snapshot seems to be close to the throughput when not scanning data, but I didn't run the baseline tests long enough to get anything conclusive here. In any event this looks like a good patch, especially considering its small size. The numbers quoted here are for reference only, YMMV, etc. > Add HiveHBaseTableSnapshotInputFormat > ------------------------------------- > > Key: HIVE-6584 > URL: https://issues.apache.org/jira/browse/HIVE-6584 > Project: Hive > Issue Type: Improvement > Components: HBase Handler > Reporter: Nick Dimiduk > Assignee: Nick Dimiduk > Fix For: 0.14.0 > > Attachments: HIVE-6584.0.patch, HIVE-6584.1.patch, > HIVE-6584.10.patch, HIVE-6584.11.patch, HIVE-6584.12.patch, > HIVE-6584.2.patch, HIVE-6584.3.patch, HIVE-6584.4.patch, HIVE-6584.5.patch, > HIVE-6584.6.patch, HIVE-6584.7.patch, HIVE-6584.8.patch, HIVE-6584.9.patch > > > HBASE-8369 provided mapreduce support for reading from HBase table snapsopts. > This allows a MR job to consume a stable, read-only view of an HBase table > directly off of HDFS. Bypassing the online region server API provides a nice > performance boost for the full scan. HBASE-10642 is backporting that feature > to 0.94/0.96 and also adding a {{mapred}} implementation. Once that's > available, we should add an input format. A follow-on patch could work out > how to integrate this functionality into the StorageHandler, similar to how > HIVE-6473 integrates the HFileOutputFormat into existing table definitions. -- This message was sent by Atlassian JIRA (v6.2#6252)