Re: Which [open-souce] SQL engine atop Hadoop?

Samuel Marks Fri, 30 Jan 2015 04:58:00 -0800

HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and exposing
both JDBC and ODBC interfaces. However, although Pivotal does open-source a
lot of software <http://www.pivotal.io/oss>, I don't believe they open
source Pivotal HD: HAWQ.


So that doesn't meet my requirements. I should note that the project I am
building will also be open-source, which heightens the importance of having
all components also being open-source.

Cheers,

Samuel Marks
http://linkedin.com/in/samuelmarks

On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
siddharth.tiw...@live.com> wrote:

> Have you looked at HAWQ from Pivotal ?
>
> Sent from my iPhone
>
> On Jan 30, 2015, at 4:27 AM, Samuel Marks <samuelma...@gmail.com> wrote:
>
> Since Hadoop <https://hive.apache.org> came out, there have been various
> commercial and/or open-source attempts to expose some compatibility with
> SQL <http://drill.apache.org>. Obviously by posting here I am not
> expecting an unbiased answer.
>
> Seeking an SQL-on-Hadoop offering which provides: low-latency querying,
> and supports the most common CRUD <https://spark.apache.org>, including
> [the basics!] along these lines: CREATE TABLE, INSERT INTO, SELECT * FROM,
> UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional
> support would be nice also, but is not a must-have.
>
> Essentially I want a full replacement for the more traditional RDBMS, one
> which can scale from 1 node to a serious Hadoop cluster.
>
> Python is my language of choice for interfacing, however there does seem
> to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>
> Here is what I've found thus far:
>
>    - Apache Hive <https://hive.apache.org> (SQL-like, with interactive
>    SQL thanks to the Stinger initiative)
>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>    - Apache Spark <https://spark.apache.org> (Spark SQL
>    <https://spark.apache.org/sql>, queries only, add data via Hive, RDD
>    
> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>    or Paraquet <http://parquet.io/>)
>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache HBase
>    <http://hbase.apache.org>, lacks full transaction
>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>    operators <http://en.wikipedia.org/wiki/Relational_operators> and some
>    built-in functions)
>    - Cloudera Impala
>    
> <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>    (significant HiveQL support, some SQL language support, no support for
>    indexes on its tables, importantly missing DELETE, UPDATE and INTERSECT;
>    amongst others)
>    - Presto <https://github.com/facebook/presto> from Facebook (can query
>    Hive, Cassandra <http://cassandra.apache.org>, relational DBs &etc.
>    Doesn't seem to be designed for low-latency responses across small
>    clusters, or support UPDATE operations. It is optimized for data
>    warehousing or analytics¹
>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>    community edition <https://www.mapr.com/products/hadoop-download>
>    (seems to be a packaging of Hive, HP Vertica
>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, SparkSQL,
>    Drill and a native ODBC wrapper
>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>    interface and multi-dimensional analysis [OLAP
>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop and
>    supports most ANSI SQL query functions". It depends on HDFS, MapReduce,
>    Hive and HBase; and seems targeted at very large data-sets though maintains
>    low query latency)
>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>    support [benchmarks against Hive and Impala
>    
> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>    ])
>    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>    provides JDBC Drivers, a SQL command shell, and a catalog manager for
>    publishing files [or any resource] as schemas and tables.")
>
> Which—from this list or elsewhere—would you recommend, and why?
> Thanks for all suggestions,
>
> Samuel Marks
> http://linkedin.com/in/samuelmarks
>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Reply via email to