[jira] [Commented] (HIVE-6584) Add HiveHBaseTableSnapshotInputFormat

Carter Shanklin (JIRA) Thu, 24 Jul 2014 19:02:21 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-6584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14073958#comment-14073958
 ]


Carter Shanklin commented on HIVE-6584:
---------------------------------------

I tested the .12 version of this patch on a 20 node cluster to see what sort of 
performance gains might be expected.

I did a YCSB load of 180m rows and ran a few simple SQL queries in Hive while 
simultaneously running a YCSB 32-thread workload.

TLDR the snapshot approach provides a nice performance boost of about 2.5x 
across different types of queries. The more fields I queried the better the 
performance was.

|Query|Run|Workload|Snapshot Time (s)|Direct Time (s)|Time X Factor|
|count(*)|1|a|191.019|488.915|2.56x|
|count(*)|2|a|200.641|480.837|2.40x|
|Aggregate 1 field|1|a|214.452|499.304|2.33x|
|Aggregate 1 field|2|a|217.744|500.07|2.30x|
|Aggregate 9 fields|1|a|281.514|802.799|2.85x|
|Aggregate 9 fields|2|a|272.358|785.816|2.89x|
|Aggregate 1 with GBY|1|a|248.874|558.143|2.24x|
|Aggregate 1 with GBY|2|a|269.658|533.562|1.98x|
|count(*)|1|b|194.739|482.261|2.48x|
|count(*)|2|b|195.178|481.437|2.47x|
|Aggregate 1 field|1|b|220.325|498.956|2.26x|
|Aggregate 1 field|2|b|227.117|489.27|2.15x|
|Aggregate 9 fields|1|b|276.939|817.118|2.95x|
|Aggregate 9 fields|2|b|290.288|876.753|3.02x|
|Aggregate 1 with GBY|1|b|244.025|563.884|2.31x|
|Aggregate 1 with GBY|2|b|225.431|570.723|2.53x|
|count(*)|1|c|194.568|502.79|2.58x|
|count(*)|2|c|205.418|508.319|2.47x|
|Aggregate 1 field|1|c|209.709|531.39|2.53x|
|Aggregate 1 field|2|c|217.551|526.878|2.42x|
|Aggregate 9 fields|1|c|267.93|756.476|2.82x|
|Aggregate 9 fields|2|c|273.107|723.459|2.65x|
|Aggregate 1 with GBY|1|c|240.991|526.053|2.18x|
|Aggregate 1 with GBY|2|c|258.06|527.845|2.05x|

For those not familiar with YCSB it uses a table with 9 fields, each filled 
with random junk 100 characters long. It defines workloads A-F, of which I've 
used A-C.

The main point to note is the more of the fields my query fetches, the better 
it works in snapshot mode.

The other thing I measured was throughput as reported by the YCSB tool. For the 
most part, when running the query over a snapshot the throughput was much 
better.
|Workload|Tput Snapshot|Tput Direct|Throughput Improvement (Snapshot)|
|a|83443.11623|56267.34148|48.30%|
|b|45709.15011|31224.30376|46.39%|
|c|46634.58415|43224.86383|7.89%|

The throughput when using the snapshot seems to be close to the throughput when 
not scanning data, but I didn't run the baseline tests long enough to get 
anything conclusive here.

In any event this looks like a good patch, especially considering its small 
size.

The numbers quoted here are for reference only, YMMV, etc.

> Add HiveHBaseTableSnapshotInputFormat
> -------------------------------------
>
>                 Key: HIVE-6584
>                 URL: https://issues.apache.org/jira/browse/HIVE-6584
>             Project: Hive
>          Issue Type: Improvement
>          Components: HBase Handler
>            Reporter: Nick Dimiduk
>            Assignee: Nick Dimiduk
>             Fix For: 0.14.0
>
>         Attachments: HIVE-6584.0.patch, HIVE-6584.1.patch, 
> HIVE-6584.10.patch, HIVE-6584.11.patch, HIVE-6584.12.patch, 
> HIVE-6584.2.patch, HIVE-6584.3.patch, HIVE-6584.4.patch, HIVE-6584.5.patch, 
> HIVE-6584.6.patch, HIVE-6584.7.patch, HIVE-6584.8.patch, HIVE-6584.9.patch
>
>
> HBASE-8369 provided mapreduce support for reading from HBase table snapsopts. 
> This allows a MR job to consume a stable, read-only view of an HBase table 
> directly off of HDFS. Bypassing the online region server API provides a nice 
> performance boost for the full scan. HBASE-10642 is backporting that feature 
> to 0.94/0.96 and also adding a {{mapred}} implementation. Once that's 
> available, we should add an input format. A follow-on patch could work out 
> how to integrate this functionality into the StorageHandler, similar to how 
> HIVE-6473 integrates the HFileOutputFormat into existing table definitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-6584) Add HiveHBaseTableSnapshotInputFormat

Reply via email to