There's one here specifically for the Hive portion, but really a full-stack system profile is needed for deciding where to attack it:
https://issues.apache.org/jira/browse/HIVE-1231 I don't know of anyone currently working in this area. JVS On Mar 8, 2011, at 9:51 PM, Otis Gospodnetic wrote: > Hi, > > John, are there plans or specific JIRA issues related to this particular > performance hit that you or somebody else is working on and that those of us > interested in performance improvements when Hive points to external tables in > HBase should watch? > > Thanks, > Otis > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > ----- Original Message ---- >> From: John Sichi <jsi...@fb.com> >> To: "<user@hive.apache.org>" <user@hive.apache.org> >> Sent: Tue, March 8, 2011 1:17:51 AM >> Subject: Re: Performance between Hive queries vs. Hive over HBase queries >> >> For native tables, Hive reads rows directly from HDFS. >> >> For HBase tables, it has to go through the HBase region servers, which >> reconstruct rows from column families (combining cache + HDFS). >> >> HBase makes it possible to keep your table up to date in real time, but you >> have to pay an overhead cost at query time. >> >> On the other hand, with native Hive tables, there's latency in loading new >> batches of data. >> >> JVS >> >> On Mar 7, 2011, at 10:13 PM, Biju Kaimal wrote: >> >>> Hi, >>> >>> Could you please explain the reason for the behavior? >>> >>> Regards, >>> Biju >>> >>> On Tue, Mar 8, 2011 at 11:35 AM, John Sichi <jsi...@fb.com> wrote: >>> Yes. >>> >>> JVS >>> >>> On Mar 7, 2011, at 9:59 PM, Biju Kaimal wrote: >>> >>>> Hi, >>>> >>>> I loaded a data set which has 1 million rows into both Hive and HBase >> tables. For the HBase table, I created a corresponding Hive table so that >> the >> data in HBase can be queried from Hive QL. Both tables have a key column >> and a >> value column >>>> >>>> For the same query (select value, count(*) from table group by value), >>>> the >> Hive only query runs much faster (~ 30 seconds) as compared to Hive over >> HBase >> (~ 150 seconds). >>>> >>>> Is this expected? >>>> >>>> Regards, >>>> Biju >>> >>> >> >>