There's one here specifically for the Hive portion, but really a full-stack 
system profile is needed for deciding where to attack it:

https://issues.apache.org/jira/browse/HIVE-1231

I don't know of anyone currently working in this area.

JVS

On Mar 8, 2011, at 9:51 PM, Otis Gospodnetic wrote:

> Hi,
> 
> John, are there plans or specific JIRA issues related to this particular 
> performance hit that you or somebody else is working on and that those of us 
> interested in performance improvements when Hive points to external tables in 
> HBase should watch?
> 
> Thanks,
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> ----- Original Message ----
>> From: John Sichi <jsi...@fb.com>
>> To: "<user@hive.apache.org>" <user@hive.apache.org>
>> Sent: Tue, March 8, 2011 1:17:51 AM
>> Subject: Re: Performance between Hive queries vs. Hive over HBase queries
>> 
>> For native tables, Hive reads rows directly from HDFS.
>> 
>> For HBase tables,  it has to go through the HBase region servers, which 
>> reconstruct rows from  column families (combining cache + HDFS).
>> 
>> HBase makes it possible to keep  your table up to date in real time, but you 
>> have to pay an overhead cost at  query time.
>> 
>> On the other hand, with native Hive tables, there's latency  in loading new 
>> batches of data.
>> 
>> JVS
>> 
>> On Mar 7, 2011, at 10:13 PM,  Biju Kaimal wrote:
>> 
>>> Hi,
>>> 
>>> Could you please explain the  reason for the behavior? 
>>> 
>>> Regards,
>>> Biju
>>> 
>>> On Tue, Mar 8, 2011 at 11:35 AM, John Sichi <jsi...@fb.com>  wrote:
>>> Yes.
>>> 
>>> JVS
>>> 
>>> On Mar 7, 2011, at  9:59 PM, Biju Kaimal wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I loaded a data set which has 1 million rows into both Hive and HBase 
>> tables.  For the HBase table, I created a corresponding Hive table so that 
>> the 
>> data in  HBase can be queried from Hive QL. Both tables have a key column 
>> and a 
>> value  column
>>>> 
>>>> For the same query (select value, count(*) from  table group by value), 
>>>> the 
>> Hive only query runs much faster (~ 30 seconds) as  compared to Hive over 
>> HBase 
>> (~ 150 seconds).
>>>> 
>>>> Is this  expected?
>>>> 
>>>> Regards,
>>>> Biju
>>> 
>>> 
>> 
>> 

Reply via email to