Interesting discussion. It looks like the HBase metastore can also be configured to use HDFS HA (ex. tutorial <http://www.cloudera.com/content/cloudera/en/documentation/core/latest/topics/cdh_hag_hdfs_ha_cdh_components_config.html> ).
To get back on topic though, the primary contenders now are: Phoenix, Lingual and perhaps Tajo or Drill? Best, Samuel Marks http://linkedin.com/in/samuelmarks On Sun, Feb 1, 2015 at 9:38 AM, Edward Capriolo <edlinuxg...@gmail.com> wrote: > "is the metastore thrift definition stable across hive versions?" I would > say yes. Like many API's the core eventually solidifies. No one is saying > it will never every change, but basically there are things like "database" > and "table" and they have properties like "name". I have some basic scripts > that look for table names matching patterns or summarize disk usage by > owner. I have not had to touch them very much. Usually if they do change it > is something small and if you tie the commit to a jira you can figure out > what and why. > > On Sat, Jan 31, 2015 at 3:02 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> seems the metastore thrift service support SASL. thats great. so if i >> understand it correctly all i need is the metastore thrift definition to >> query the metastore. >> is the metastore thrift definition stable across hive versions? if so, >> then i can build my app once without worrying about the hive version >> deployed. in that case i admit its not as bad as i thought. lets see! >> >> On Sat, Jan 31, 2015 at 2:41 PM, Koert Kuipers <ko...@tresata.com> wrote: >> >>> oh sorry edward, i misread you post. seems we agree that "SQL constructs >>> inside hive" are not for other systems. >>> >>> On Sat, Jan 31, 2015 at 2:38 PM, Koert Kuipers <ko...@tresata.com> >>> wrote: >>> >>>> edward, >>>> i would not call "SQL constructs inside hive" accessible for other >>>> systems. its inside hive after all >>>> >>>> it is true that i can contact the metastore in java using >>>> HiveMetaStoreClient, but then i need to bring in a whole slew of >>>> dependencies (the miniumum seems to be hive-metastore, hive-common, >>>> hive-shims, libfb303, libthrift and a few hadoop dependencies, by trial and >>>> error). these jars need to be "provided" and added to the classpath on the >>>> cluster, unless someone is willing to build versions of an application for >>>> every hive version out there. and even when you do all this you can only >>>> pray its going to be compatible with the next hive version, since backwards >>>> compatibility is... well lets just say lacking. the attitude seems to be >>>> that hive does not have a java api, so there is nothing that needs to be >>>> stable. >>>> >>>> you are right i could go the pure thrift road. i havent tried that yet. >>>> that might just be the best option. but how easy is it to do this with a >>>> secure hadoop/hive ecosystem? now i need to handle kerberos myself and >>>> somehow pass tokens into thrift i assume? >>>> >>>> contrast all of this with an avro file on hadoop with metadata baked >>>> in, and i think its safe to say hive metadata is not easily accessible. >>>> >>>> i will take a look at your book. i hope it has an example of using >>>> thrift on a secure cluster to contact hive metastore (without using the >>>> HiveMetaStoreClient), that would be awesome. >>>> >>>> >>>> >>>> >>>> On Sat, Jan 31, 2015 at 1:32 PM, Edward Capriolo <edlinuxg...@gmail.com >>>> > wrote: >>>> >>>>> "with the metadata in a special metadata store (not on hdfs), and its >>>>> not as easy for all systems to access hive metadata." I disagree. >>>>> >>>>> Hives metadata is not only accessible through the SQL constructs like >>>>> "describe table". But the entire meta-store also is actually a thrift >>>>> service so you have programmatic access to determine things like what >>>>> columns are in a table etc. Thrift creates RPC clients for almost every >>>>> major language. >>>>> >>>>> In the programming hive book >>>>> http://www.amazon.com/dp/1449319335/?tag=mh0b-20&hvadid=3521269638&ref=pd_sl_4yiryvbf8k_e >>>>> there is even examples where I show how to iterate all the tables inside >>>>> the database from a java client. >>>>> >>>>> On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers <ko...@tresata.com> >>>>> wrote: >>>>> >>>>>> yes you can run whatever you like with the data in hdfs. keep in mind >>>>>> that hive makes this general access pattern just a little harder, since >>>>>> hive has a tendency to store data and metadata separately, with the >>>>>> metadata in a special metadata store (not on hdfs), and its not as easy >>>>>> for >>>>>> all systems to access hive metadata. >>>>>> >>>>>> i am not familiar at all with tajo or drill. >>>>>> >>>>>> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <samuelma...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Thanks for the advice >>>>>>> >>>>>>> Koert: when everything is in the same essential data-store (HDFS), >>>>>>> can't I just run whatever complex tools I'm whichever paradigm they >>>>>>> like? >>>>>>> >>>>>>> E.g.: GraphX, Mahout &etc. >>>>>>> >>>>>>> Also, what about Tajo or Drill? >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Samuel Marks >>>>>>> http://linkedin.com/in/samuelmarks >>>>>>> >>>>>>> PS: Spark-SQL is read-only IIRC, right? >>>>>>> On 31 Jan 2015 03:39, "Koert Kuipers" <ko...@tresata.com> wrote: >>>>>>> >>>>>>>> since you require high-powered analytics, and i assume you want to >>>>>>>> stay sane while doing so, you require the ability to "drop out of sql" >>>>>>>> when >>>>>>>> needed. so spark-sql and lingual would be my choices. >>>>>>>> >>>>>>>> low latency indicates phoenix or spark-sql to me. >>>>>>>> >>>>>>>> so i would say spark-sql >>>>>>>> >>>>>>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks < >>>>>>>> samuelma...@gmail.com> wrote: >>>>>>>> >>>>>>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and >>>>>>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal >>>>>>>>> does open-source >>>>>>>>> a lot of software <http://www.pivotal.io/oss>, I don't believe >>>>>>>>> they open source Pivotal HD: HAWQ. >>>>>>>>> >>>>>>>>> So that doesn't meet my requirements. I should note that the >>>>>>>>> project I am building will also be open-source, which heightens the >>>>>>>>> importance of having all components also being open-source. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Samuel Marks >>>>>>>>> http://linkedin.com/in/samuelmarks >>>>>>>>> >>>>>>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari < >>>>>>>>> siddharth.tiw...@live.com> wrote: >>>>>>>>> >>>>>>>>>> Have you looked at HAWQ from Pivotal ? >>>>>>>>>> >>>>>>>>>> Sent from my iPhone >>>>>>>>>> >>>>>>>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <samuelma...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Since Hadoop <https://hive.apache.org> came out, there have been >>>>>>>>>> various commercial and/or open-source attempts to expose some >>>>>>>>>> compatibility >>>>>>>>>> with SQL <http://drill.apache.org>. Obviously by posting here I >>>>>>>>>> am not expecting an unbiased answer. >>>>>>>>>> >>>>>>>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency >>>>>>>>>> querying, and supports the most common CRUD >>>>>>>>>> <https://spark.apache.org>, including [the basics!] along these >>>>>>>>>> lines: CREATE TABLE, INSERT INTO, SELECT * FROM, UPDATE Table >>>>>>>>>> SET C1=2 WHERE, DELETE FROM, and DROP TABLE. Transactional >>>>>>>>>> support would be nice also, but is not a must-have. >>>>>>>>>> >>>>>>>>>> Essentially I want a full replacement for the more traditional >>>>>>>>>> RDBMS, one which can scale from 1 node to a serious Hadoop cluster. >>>>>>>>>> >>>>>>>>>> Python is my language of choice for interfacing, however there >>>>>>>>>> does seem to be a Python JDBC wrapper >>>>>>>>>> <https://spark.apache.org/sql>. >>>>>>>>>> >>>>>>>>>> Here is what I've found thus far: >>>>>>>>>> >>>>>>>>>> - Apache Hive <https://hive.apache.org> (SQL-like, with >>>>>>>>>> interactive SQL thanks to the Stinger initiative) >>>>>>>>>> - Apache Drill <http://drill.apache.org> (ANSI SQL support) >>>>>>>>>> - Apache Spark <https://spark.apache.org> (Spark SQL >>>>>>>>>> <https://spark.apache.org/sql>, queries only, add data via >>>>>>>>>> Hive, RDD >>>>>>>>>> >>>>>>>>>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD> >>>>>>>>>> or Paraquet <http://parquet.io/>) >>>>>>>>>> - Apache Phoenix <http://phoenix.apache.org> (built atop Apache >>>>>>>>>> HBase <http://hbase.apache.org>, lacks full transaction >>>>>>>>>> <http://en.wikipedia.org/wiki/Database_transaction> support, >>>>>>>>>> relational >>>>>>>>>> operators <http://en.wikipedia.org/wiki/Relational_operators> >>>>>>>>>> and some built-in functions) >>>>>>>>>> - Cloudera Impala >>>>>>>>>> >>>>>>>>>> <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html> >>>>>>>>>> (significant HiveQL support, some SQL language support, no >>>>>>>>>> support for >>>>>>>>>> indexes on its tables, importantly missing DELETE, UPDATE and >>>>>>>>>> INTERSECT; >>>>>>>>>> amongst others) >>>>>>>>>> - Presto <https://github.com/facebook/presto> from Facebook >>>>>>>>>> (can query Hive, Cassandra <http://cassandra.apache.org>, >>>>>>>>>> relational DBs &etc. Doesn't seem to be designed for low-latency >>>>>>>>>> responses >>>>>>>>>> across small clusters, or support UPDATE operations. It is >>>>>>>>>> optimized for data warehousing or analytics¹ >>>>>>>>>> <http://prestodb.io/docs/current/overview/use-cases.html>) >>>>>>>>>> - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR >>>>>>>>>> community edition >>>>>>>>>> <https://www.mapr.com/products/hadoop-download> (seems to be >>>>>>>>>> a packaging of Hive, HP Vertica >>>>>>>>>> <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, >>>>>>>>>> SparkSQL, Drill and a native ODBC wrapper >>>>>>>>>> <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>) >>>>>>>>>> - Apache Kylin <http://www.kylin.io> from Ebay (provides an >>>>>>>>>> SQL interface and multi-dimensional analysis [OLAP >>>>>>>>>> <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on >>>>>>>>>> Hadoop and supports most ANSI SQL query functions". It depends on >>>>>>>>>> HDFS, >>>>>>>>>> MapReduce, Hive and HBase; and seems targeted at very large >>>>>>>>>> data-sets >>>>>>>>>> though maintains low query latency) >>>>>>>>>> - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard >>>>>>>>>> compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> >>>>>>>>>> driver support [benchmarks against Hive and Impala >>>>>>>>>> >>>>>>>>>> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space> >>>>>>>>>> ]) >>>>>>>>>> - Cascading >>>>>>>>>> <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s >>>>>>>>>> Lingual <http://docs.cascading.org/lingual/1.0/>² >>>>>>>>>> <http://docs.cascading.org/lingual/1.0/#sql-support> >>>>>>>>>> ("Lingual provides JDBC Drivers, a SQL command shell, and a >>>>>>>>>> catalog manager >>>>>>>>>> for publishing files [or any resource] as schemas and tables.") >>>>>>>>>> >>>>>>>>>> Which—from this list or elsewhere—would you recommend, and why? >>>>>>>>>> Thanks for all suggestions, >>>>>>>>>> >>>>>>>>>> Samuel Marks >>>>>>>>>> http://linkedin.com/in/samuelmarks >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >> >