Think about it like this one system is scanning a local file ORC, using an hbase scanner (over the network), and scanning the data in sstable format?
On Fri, Jun 9, 2017 at 5:50 AM, Amey Barve <ameybarv...@gmail.com> wrote: > Hi Michael, > > "If there is predicate pushdown, then you will be faster, assuming that > the query triggers an implied range scan" > ---> Does this bring results faster than plain hive querying over ORC / > Text file formats > > In other words Is querying over plain hive (ORC or Text) *always* faster > than through HiveStorageHandler? > > Regards, > Amey > > On 9 June 2017 at 15:08, Michael Segel <msegel_had...@hotmail.com> wrote: > >> The pro’s is that you have the ability to update a table without having >> to worry about duplication of the row. Tez is doing some form of >> compaction for you that already exists in HBase. >> >> The cons: >> >> 1) Its slower. Reads from HBase have more overhead with them than just >> reading a file. Read Lars George’s book on what takes place when you do a >> read. >> >> 2) HBase is not a relational store. (You have to think about what that >> implies) >> >> 3) You need to query against your row key for best performance, otherwise >> it will always be a complete table scan. >> >> HBase was designed to give you fast access for direct get() and limited >> range scans. Otherwise you have to perform full table scans. This means >> that unless you’re able to do a range scan, your full table scan will be >> slower than if you did this on a flat file set. Again the reason why you >> would want to use HBase if your data set is mutable. >> >> You also have to trigger a range scan when you write your hive query and >> you have make sure that you’re querying off your row key. >> >> HBase was designed as a <key,value> store. Plain and simple. If you >> don’t use the key, you have to do a full table scan. So even though you are >> partitioning on row key, you never use your partitions. However in Hive or >> Spark, you can create an alternative partition pattern. (e.g your key is >> the transaction_id, yet you partition on month/year portion of the >> transaction_date) >> >> You can speed things up a little by using an inverted table as a >> secondary index. However this assumes that you want to use joins. If you >> have a single base table with no joins then you can limit your range scans >> based on making sure you are querying against the row key. Note: This will >> mean that you have limited querying capabilities. >> >> And yes, I’ve done this before but can’t share it with you. >> >> HTH >> >> P.S. >> I haven’t tried Hive queries where you have what would be the equivalent >> of a get() . >> >> In earlier versions of hive, the issue would be “SELECT * FROM foo where >> rowkey=BAR” would still do a full table scan because of the lack of >> predicate pushdown. >> This may have been fixed in later releases of hive. That would be your >> test case. If there is predicate pushdown, then you will be faster, >> assuming that the query triggers an implied range scan. >> This would be a simple thing. However keep in mind that you’re going to >> generate a map/reduce job (unless using a query engine like Tez) where you >> wouldn’t if you just wrote your code in Java. >> >> >> >> >> > On Jun 7, 2017, at 5:13 AM, Ramasubramanian Narayanan < >> ramasubramanian.naraya...@gmail.com> wrote: >> > >> > Hi, >> > >> > Can you please let us know Pro and Cons of using HBase table as an >> external table in HIVE. >> > >> > Will there be any performance degrade when using Hive over HBase >> instead of using direct HIVE table. >> > >> > The table that I am planning to use in HBase will be master table like >> account, customer. Wanting to achieve Slowly Changing Dimension. Please >> through some lights on that too if you have done any such implementations. >> > >> > Thanks and Regards, >> > Rams >> >> >