Think about it like this one system is scanning a local file ORC, using an
hbase scanner (over the network), and scanning the data in sstable format?

On Fri, Jun 9, 2017 at 5:50 AM, Amey Barve <ameybarv...@gmail.com> wrote:

> Hi Michael,
>
> "If there is predicate pushdown, then you will be faster, assuming that
> the query triggers an implied range scan"
> ---> Does this bring results faster than plain hive querying over ORC /
> Text file formats
>
> In other words Is querying over plain hive (ORC or Text) *always* faster
> than through HiveStorageHandler?
>
> Regards,
> Amey
>
> On 9 June 2017 at 15:08, Michael Segel <msegel_had...@hotmail.com> wrote:
>
>> The pro’s is that you have the ability to update a table without having
>> to worry about duplication of the row.  Tez is doing some form of
>> compaction for you that already exists in HBase.
>>
>> The cons:
>>
>> 1) Its slower. Reads from HBase have more overhead with them than just
>> reading a file.  Read Lars George’s book on what takes place when you do a
>> read.
>>
>> 2) HBase is not a relational store. (You have to think about what that
>> implies)
>>
>> 3) You need to query against your row key for best performance, otherwise
>> it will always be a complete table scan.
>>
>> HBase was designed to give you fast access for direct get() and limited
>> range scans.  Otherwise you have to perform full table scans.  This means
>> that unless you’re able to do a range scan, your full table scan will be
>> slower than if you did this on a flat file set.  Again the reason why you
>> would want to use HBase if your data set is mutable.
>>
>> You also have to trigger a range scan when you write your hive query and
>> you have make sure that you’re querying off your row key.
>>
>> HBase was designed as a <key,value> store. Plain and simple.  If you
>> don’t use the key, you have to do a full table scan. So even though you are
>> partitioning on row key, you never use your partitions.  However in Hive or
>> Spark, you can create an alternative partition pattern.  (e.g your key is
>> the transaction_id, yet you partition on month/year portion of the
>> transaction_date)
>>
>> You can speed things up a little by using an inverted table as a
>> secondary index. However this assumes that you want to use joins. If you
>> have a single base table with no joins then you can limit your range scans
>> based on making sure you are querying against the row key.  Note: This will
>> mean that you have limited querying capabilities.
>>
>> And yes, I’ve done this before but can’t share it with you.
>>
>> HTH
>>
>> P.S.
>> I haven’t tried Hive queries where you have what would be the equivalent
>> of a get() .
>>
>> In earlier versions of hive, the issue would be “SELECT * FROM foo where
>> rowkey=BAR”  would still do a full table scan because of the lack of
>> predicate pushdown.
>> This may have been fixed in later releases of hive. That would be your
>> test case.   If there is predicate pushdown, then you will be faster,
>> assuming that the query triggers an implied range scan.
>> This would be a simple thing. However keep in mind that you’re going to
>> generate a map/reduce job (unless using a query engine like Tez) where you
>> wouldn’t if you just wrote your code in Java.
>>
>>
>>
>>
>> > On Jun 7, 2017, at 5:13 AM, Ramasubramanian Narayanan <
>> ramasubramanian.naraya...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > Can you please let us know Pro and Cons of using HBase table as an
>> external table in HIVE.
>> >
>> > Will there be any performance degrade when using Hive over HBase
>> instead of using direct HIVE table.
>> >
>> > The table that I am planning to use in HBase will be master table like
>> account, customer. Wanting to achieve Slowly Changing Dimension. Please
>> through some lights on that too if you have done any such implementations.
>> >
>> > Thanks and Regards,
>> > Rams
>>
>>
>

Reply via email to