On Mon, May 13, 2013 at 9:34 AM, Nalin Khosla <nalin.kho...@rogers.com>wrote:
> Had a quick question wrt to querying HADOOP data; > > 1. What tools are available to Query Hadoop data in real time vs batch? > The line between real time and batch isn't that clear. We are working on substantially speeding up the performance of Hive ( http://www.slideshare.net/Hadoop_Summit/innovations-in-apache-hadoop-mapreduce-pig-hive-for-improving-query-performance). The better question is whether you have small enough data so that it can fit in RAM on your cluster. If so, you should look at Shark ( https://amplab.cs.berkeley.edu/2012/11/26/low-latency-sql-queries-at-massive-scale-a-performance-analysis-of-shark/) or a proprietary MPP database such as Teradata or Impala. > > 2. I believe HIVE provides a batch interface, not sure on what tools > within HIVE support the query capabilities against HADOOP ? > Hive currently uses MapReduce to run the queries. We plan on extending to use Tez, which is a new Apache project that provides a richer framework for queries. > > 3. Besides HIVE, are there any other Query tools to query HADOOP data > (ad-hoc queries) ? > Pig and Cascading are the main open source ones for large data. Shark does the smaller ad-hoc queries. Drill plans to fit into the ad-hoc space, but hasn't made a release yet. 4. Finally, what skill set is required to use HIVE or other alternate tools > ? Can business users uses these tools? > Using Hive requires a learning curve. Business users will be able to run queries against the data, but it will require someone with more engineering background to design the table layouts and updating scheme. -- Owen