Which [open-souce] SQL engine atop Hadoop?

Samuel Marks Fri, 30 Jan 2015 03:27:54 -0800

Since Hadoop <https://hive.apache.org> came out, there have been various
commercial and/or open-source attempts to expose some compatibility with SQL
<http://drill.apache.org>. Obviously by posting here I am not expecting an
unbiased answer.


Seeking an SQL-on-Hadoop offering which provides: low-latency querying, and
supports the most common CRUD <https://spark.apache.org>, including [the
basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE
Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional support
would be nice also, but is not a must-have.

Essentially I want a full replacement for the more traditional RDBMS, one
which can scale from 1 node to a serious Hadoop cluster.

Python is my language of choice for interfacing, however there does seem to
be a Python JDBC wrapper <https://spark.apache.org/sql>.

Here is what I've found thus far:

   - Apache Hive <https://hive.apache.org> (SQL-like, with interactive SQL
   thanks to the Stinger initiative)
   - Apache Drill <http://drill.apache.org> (ANSI SQL support)
   - Apache Spark <https://spark.apache.org> (Spark SQL
   <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
   
<https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
   or Paraquet <http://parquet.io/>)
   - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
   <http://hbase.apache.org>, lacks full transaction
   <http://en.wikipedia.org/wiki/Database_transaction> support, relational
   operators <http://en.wikipedia.org/wiki/Relational_operators> and some
   built-in functions)
   - Cloudera Impala
   
<http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
   (significant HiveQL support, some SQL language support, no support for
   indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
   amongst others)
   - Presto <https://github.com/facebook/presto> from Facebook (can query
   Hive, Cassandra <http://cassandra.apache.org>, relational DBs &etc.
   Doesn't seem to be designed for low-latency responses across small
   clusters, or support UPDATE operations. It is optimized for data
   warehousing or analytics¹
   <http://prestodb.io/docs/current/overview/use-cases.html>)
   - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
   community edition <https://www.mapr.com/products/hadoop-download> (seems
   to be a packaging of Hive, HP Vertica
   <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
   Drill and a native ODBC wrapper
   <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
   - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
   interface and multi-dimensional analysis [OLAP
   <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop and
   supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
   Hive and HBase; and seems targeted at very large data-sets though maintains
   low query latency)
   - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard compliance
   with JDBC <http://en.wikipedia.org/wiki/JDBC> driver support [benchmarks
   against Hive and Impala
   
<http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
   ])
   - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
   Lingual <http://docs.cascading.org/lingual/1.0/>²
   <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual provides
   JDBC Drivers, a SQL command shell, and a catalog manager for publishing
   files [or any resource] as schemas and tables.")

Which—from this list or elsewhere—would you recommend, and why?
Thanks for all suggestions,

Samuel Marks
http://linkedin.com/in/samuelmarks

Which [open-souce] SQL engine atop Hadoop?

Reply via email to