Re: Which [open-souce] SQL engine atop Hadoop?

Edward Capriolo Sat, 31 Jan 2015 14:40:25 -0800

"is the metastore thrift definition stable across hive versions?" I would
say yes. Like many API's the core eventually solidifies. No one is saying
it will never every change, but basically there are things like "database"
and "table" and they have properties like "name". I have some basic scripts
that look for table names matching patterns or summarize disk usage by
owner. I have not had to touch them very much. Usually if they do change it
is something small and if you tie the commit to a jira you can figure out
what and why.


On Sat, Jan 31, 2015 at 3:02 PM, Koert Kuipers <ko...@tresata.com> wrote:

> seems the metastore thrift service support SASL. thats great. so if i
> understand it correctly all i need is the metastore thrift definition to
> query the metastore.
> is the metastore thrift definition stable across hive versions? if so,
> then i can build my app once without worrying about the hive version
> deployed. in that case i admit its not as bad as i thought. lets see!
>
> On Sat, Jan 31, 2015 at 2:41 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> oh sorry edward, i misread you post. seems we agree that "SQL constructs
>> inside hive" are not for other systems.
>>
>> On Sat, Jan 31, 2015 at 2:38 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> edward,
>>> i would not call "SQL constructs inside hive" accessible for other
>>> systems. its inside hive after all
>>>
>>> it is true that i can contact the metastore in java using
>>> HiveMetaStoreClient, but then i need to bring in a whole slew of
>>> dependencies (the miniumum seems to be hive-metastore, hive-common,
>>> hive-shims, libfb303, libthrift and a few hadoop dependencies, by trial and
>>> error). these jars need to be "provided" and added to the classpath on the
>>> cluster, unless someone is willing to build versions of an application for
>>> every hive version out there. and even when you do all this you can only
>>> pray its going to be compatible with the next hive version, since backwards
>>> compatibility is... well lets just say lacking. the attitude seems to be
>>> that hive does not have a java api, so there is nothing that needs to be
>>> stable.
>>>
>>> you are right i could go the pure thrift road. i havent tried that yet.
>>> that might just be the best option. but how easy is it to do this with a
>>> secure hadoop/hive ecosystem? now i need to handle kerberos myself and
>>> somehow pass tokens into thrift i assume?
>>>
>>> contrast all of this with an avro file on hadoop with metadata baked in,
>>> and i think its safe to say hive metadata is not easily accessible.
>>>
>>> i will take a look at your book. i hope it has an example of using
>>> thrift on a secure cluster to contact hive metastore (without using the
>>> HiveMetaStoreClient), that would be awesome.
>>>
>>>
>>>
>>>
>>> On Sat, Jan 31, 2015 at 1:32 PM, Edward Capriolo <edlinuxg...@gmail.com>
>>> wrote:
>>>
>>>> "with the metadata in a special metadata store (not on hdfs), and its
>>>> not as easy for all systems to access hive metadata." I disagree.
>>>>
>>>> Hives metadata is not only accessible through the SQL constructs like
>>>> "describe table". But the entire meta-store also is actually a thrift
>>>> service so you have programmatic access to determine things like what
>>>> columns are in a table etc. Thrift creates RPC clients for almost every
>>>> major language.
>>>>
>>>> In the programming hive book
>>>> http://www.amazon.com/dp/1449319335/?tag=mh0b-20&hvadid=3521269638&ref=pd_sl_4yiryvbf8k_e
>>>> there is even examples where I show how to iterate all the tables inside
>>>> the database from a java client.
>>>>
>>>> On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers <ko...@tresata.com>
>>>> wrote:
>>>>
>>>>> yes you can run whatever you like with the data in hdfs. keep in mind
>>>>> that hive makes this general access pattern just a little harder, since
>>>>> hive has a tendency to store data and metadata separately, with the
>>>>> metadata in a special metadata store (not on hdfs), and its not as easy 
>>>>> for
>>>>> all systems to access hive metadata.
>>>>>
>>>>> i am not familiar at all with tajo or drill.
>>>>>
>>>>> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <samuelma...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Thanks for the advice
>>>>>>
>>>>>> Koert: when everything is in the same essential data-store (HDFS),
>>>>>> can't I just run whatever complex tools I'm whichever paradigm they like?
>>>>>>
>>>>>> E.g.: GraphX, Mahout &etc.
>>>>>>
>>>>>> Also, what about Tajo or Drill?
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Samuel Marks
>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>
>>>>>> PS: Spark-SQL is read-only IIRC, right?
>>>>>> On 31 Jan 2015 03:39, "Koert Kuipers" <ko...@tresata.com> wrote:
>>>>>>
>>>>>>> since you require high-powered analytics, and i assume you want to
>>>>>>> stay sane while doing so, you require the ability to "drop out of sql" 
>>>>>>> when
>>>>>>> needed. so spark-sql and lingual would be my choices.
>>>>>>>
>>>>>>> low latency indicates phoenix or spark-sql to me.
>>>>>>>
>>>>>>> so i would say spark-sql
>>>>>>>
>>>>>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <samuelma...@gmail.com
>>>>>>> > wrote:
>>>>>>>
>>>>>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>>>>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does 
>>>>>>>> open-source
>>>>>>>> a lot of software <http://www.pivotal.io/oss>, I don't believe
>>>>>>>> they open source Pivotal HD: HAWQ.
>>>>>>>>
>>>>>>>> So that doesn't meet my requirements. I should note that the
>>>>>>>> project I am building will also be open-source, which heightens the
>>>>>>>> importance of having all components also being open-source.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Samuel Marks
>>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>>
>>>>>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>>>>>>> siddharth.tiw...@live.com> wrote:
>>>>>>>>
>>>>>>>>> Have you looked at HAWQ from Pivotal ?
>>>>>>>>>
>>>>>>>>> Sent from my iPhone
>>>>>>>>>
>>>>>>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <samuelma...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Since Hadoop <https://hive.apache.org> came out, there have been
>>>>>>>>> various commercial and/or open-source attempts to expose some 
>>>>>>>>> compatibility
>>>>>>>>> with SQL <http://drill.apache.org>. Obviously by posting here I
>>>>>>>>> am not expecting an unbiased answer.
>>>>>>>>>
>>>>>>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency
>>>>>>>>> querying, and supports the most common CRUD
>>>>>>>>> <https://spark.apache.org>, including [the basics!] along these
>>>>>>>>> lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table SET
>>>>>>>>> C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional support
>>>>>>>>> would be nice also, but is not a must-have.
>>>>>>>>>
>>>>>>>>> Essentially I want a full replacement for the more traditional
>>>>>>>>> RDBMS, one which can scale from 1 node to a serious Hadoop cluster.
>>>>>>>>>
>>>>>>>>> Python is my language of choice for interfacing, however there
>>>>>>>>> does seem to be a Python JDBC wrapper
>>>>>>>>> <https://spark.apache.org/sql>.
>>>>>>>>>
>>>>>>>>> Here is what I've found thus far:
>>>>>>>>>
>>>>>>>>>    - Apache Hive <https://hive.apache.org> (SQL-like, with
>>>>>>>>>    interactive SQL thanks to the Stinger initiative)
>>>>>>>>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>>>>>>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>>>>>>>>    <https://spark.apache.org/sql>, queries only, add data via
>>>>>>>>>    Hive, RDD
>>>>>>>>>    
>>>>>>>>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>>>>>>>    or Paraquet <http://parquet.io/>)
>>>>>>>>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
>>>>>>>>>    HBase <http://hbase.apache.org>, lacks full transaction
>>>>>>>>>    <http://en.wikipedia.org/wiki/Database_transaction> support, 
>>>>>>>>> relational
>>>>>>>>>    operators <http://en.wikipedia.org/wiki/Relational_operators>
>>>>>>>>>    and some built-in functions)
>>>>>>>>>    - Cloudera Impala
>>>>>>>>>    
>>>>>>>>> <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>>>>>>>    (significant HiveQL support, some SQL language support, no support 
>>>>>>>>> for
>>>>>>>>>    indexes on its tables, importantly missing DELETE, UPDATE and 
>>>>>>>>> INTERSECT;
>>>>>>>>>    amongst others)
>>>>>>>>>    - Presto <https://github.com/facebook/presto> from Facebook
>>>>>>>>>    (can query Hive, Cassandra <http://cassandra.apache.org>,
>>>>>>>>>    relational DBs &etc. Doesn't seem to be designed for low-latency 
>>>>>>>>> responses
>>>>>>>>>    across small clusters, or support UPDATE operations. It is
>>>>>>>>>    optimized for data warehousing or analytics¹
>>>>>>>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>>>>>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>>>>>>>>>    community edition
>>>>>>>>>    <https://www.mapr.com/products/hadoop-download> (seems to be a
>>>>>>>>>    packaging of Hive, HP Vertica
>>>>>>>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>,
>>>>>>>>>    SparkSQL, Drill and a native ODBC wrapper
>>>>>>>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>>>>>>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an
>>>>>>>>>    SQL interface and multi-dimensional analysis [OLAP
>>>>>>>>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on
>>>>>>>>>    Hadoop and supports most ANSI SQL query functions". It depends on 
>>>>>>>>> HDFS,
>>>>>>>>>    MapReduce, Hive and HBase; and seems targeted at very large 
>>>>>>>>> data-sets
>>>>>>>>>    though maintains low query latency)
>>>>>>>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>>>>>>>>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC>
>>>>>>>>>    driver support [benchmarks against Hive and Impala
>>>>>>>>>    
>>>>>>>>> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>>>>>>>    ])
>>>>>>>>>    - Cascading
>>>>>>>>>    <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>>>>>>>>>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>>>>>>>>>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>>>>>>>>>    provides JDBC Drivers, a SQL command shell, and a catalog manager 
>>>>>>>>> for
>>>>>>>>>    publishing files [or any resource] as schemas and tables.")
>>>>>>>>>
>>>>>>>>> Which—from this list or elsewhere—would you recommend, and why?
>>>>>>>>> Thanks for all suggestions,
>>>>>>>>>
>>>>>>>>> Samuel Marks
>>>>>>>>> http://linkedin.com/in/samuelmarks
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Which [open-souce] SQL engine atop Hadoop?

Reply via email to