On Mon, May 13, 2013 at 9:34 AM, Nalin Khosla <nalin.kho...@rogers.com>wrote:

> Had a quick question wrt to querying HADOOP data;
>
> 1. What tools are available to Query Hadoop data in real time vs batch?
>

The line between real time and batch isn't that clear. We are working on
substantially speeding up the performance of Hive (
http://www.slideshare.net/Hadoop_Summit/innovations-in-apache-hadoop-mapreduce-pig-hive-for-improving-query-performance).
The better question is whether you have small enough data so that it can
fit in RAM on your cluster. If so, you should look at Shark (
https://amplab.cs.berkeley.edu/2012/11/26/low-latency-sql-queries-at-massive-scale-a-performance-analysis-of-shark/)
or a proprietary MPP database such as Teradata or Impala.


>
> 2. I believe HIVE provides a batch interface, not sure on what tools
> within HIVE support the query capabilities against HADOOP ?
>

Hive currently uses MapReduce to run the queries. We plan on extending to
use Tez, which is a new Apache project that provides a richer framework for
queries.


>
> 3. Besides HIVE, are there any other Query tools to query HADOOP data
> (ad-hoc queries) ?
>

Pig and Cascading are the main open source ones for large data. Shark does
the smaller ad-hoc queries. Drill plans to fit into the ad-hoc space, but
hasn't made a release yet.

4. Finally, what skill set is required to use HIVE or other alternate tools
> ? Can business users uses these tools?
>

 Using Hive requires a learning curve. Business users will be able to run
queries against the data, but it will require someone with more engineering
background to design the table layouts and updating scheme.

-- Owen

Reply via email to