"with the metadata in a special metadata store (not on hdfs), and its not as easy for all systems to access hive metadata." I disagree.
Hives metadata is not only accessible through the SQL constructs like "describe table". But the entire meta-store also is actually a thrift service so you have programmatic access to determine things like what columns are in a table etc. Thrift creates RPC clients for almost every major language. In the programming hive book http://www.amazon.com/dp/1449319335/?tag=mh0b-20&hvadid=3521269638&ref=pd_sl_4yiryvbf8k_e there is even examples where I show how to iterate all the tables inside the database from a java client. On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers <ko...@tresata.com> wrote: > yes you can run whatever you like with the data in hdfs. keep in mind that > hive makes this general access pattern just a little harder, since hive has > a tendency to store data and metadata separately, with the metadata in a > special metadata store (not on hdfs), and its not as easy for all systems > to access hive metadata. > > i am not familiar at all with tajo or drill. > > On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <samuelma...@gmail.com> > wrote: > >> Thanks for the advice >> >> Koert: when everything is in the same essential data-store (HDFS), can't >> I just run whatever complex tools I'm whichever paradigm they like? >> >> E.g.: GraphX, Mahout &etc. >> >> Also, what about Tajo or Drill? >> >> Best, >> >> Samuel Marks >> http://linkedin.com/in/samuelmarks >> >> PS: Spark-SQL is read-only IIRC, right? >> On 31 Jan 2015 03:39, "Koert Kuipers" <ko...@tresata.com> wrote: >> >>> since you require high-powered analytics, and i assume you want to stay >>> sane while doing so, you require the ability to "drop out of sql" when >>> needed. so spark-sql and lingual would be my choices. >>> >>> low latency indicates phoenix or spark-sql to me. >>> >>> so i would say spark-sql >>> >>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <samuelma...@gmail.com> >>> wrote: >>> >>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and >>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does >>>> open-source >>>> a lot of software <http://www.pivotal.io/oss>, I don't believe they >>>> open source Pivotal HD: HAWQ. >>>> >>>> So that doesn't meet my requirements. I should note that the project I >>>> am building will also be open-source, which heightens the importance of >>>> having all components also being open-source. >>>> >>>> Cheers, >>>> >>>> Samuel Marks >>>> http://linkedin.com/in/samuelmarks >>>> >>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari < >>>> siddharth.tiw...@live.com> wrote: >>>> >>>>> Have you looked at HAWQ from Pivotal ? >>>>> >>>>> Sent from my iPhone >>>>> >>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <samuelma...@gmail.com> >>>>> wrote: >>>>> >>>>> Since Hadoop <https://hive.apache.org> came out, there have been >>>>> various commercial and/or open-source attempts to expose some >>>>> compatibility >>>>> with SQL <http://drill.apache.org>. Obviously by posting here I am >>>>> not expecting an unbiased answer. >>>>> >>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency >>>>> querying, and supports the most common CRUD <https://spark.apache.org>, >>>>> including [the basics!] along these lines: CREATE TABLE, INSERT INTO, >>>>> SELECT >>>>> * FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE. >>>>> Transactional support would be nice also, but is not a must-have. >>>>> >>>>> Essentially I want a full replacement for the more traditional RDBMS, >>>>> one which can scale from 1 node to a serious Hadoop cluster. >>>>> >>>>> Python is my language of choice for interfacing, however there does >>>>> seem to be a Python JDBC wrapper <https://spark.apache.org/sql>. >>>>> >>>>> Here is what I've found thus far: >>>>> >>>>> - Apache Hive <https://hive.apache.org> (SQL-like, with >>>>> interactive SQL thanks to the Stinger initiative) >>>>> - Apache Drill <http://drill.apache.org> (ANSI SQL support) >>>>> - Apache Spark <https://spark.apache.org> (Spark SQL >>>>> <https://spark.apache.org/sql>, queries only, add data via Hive, >>>>> RDD >>>>> >>>>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD> >>>>> or Paraquet <http://parquet.io/>) >>>>> - Apache Phoenix <http://phoenix.apache.org> (built atop Apache >>>>> HBase <http://hbase.apache.org>, lacks full transaction >>>>> <http://en.wikipedia.org/wiki/Database_transaction> support, relational >>>>> operators <http://en.wikipedia.org/wiki/Relational_operators> and >>>>> some built-in functions) >>>>> - Cloudera Impala >>>>> >>>>> <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html> >>>>> (significant HiveQL support, some SQL language support, no support for >>>>> indexes on its tables, importantly missing DELETE, UPDATE and >>>>> INTERSECT; >>>>> amongst others) >>>>> - Presto <https://github.com/facebook/presto> from Facebook (can >>>>> query Hive, Cassandra <http://cassandra.apache.org>, relational >>>>> DBs &etc. Doesn't seem to be designed for low-latency responses across >>>>> small clusters, or support UPDATE operations. It is optimized for >>>>> data warehousing or analytics¹ >>>>> <http://prestodb.io/docs/current/overview/use-cases.html>) >>>>> - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR >>>>> community edition <https://www.mapr.com/products/hadoop-download> >>>>> (seems to be a packaging of Hive, HP Vertica >>>>> <http://www.vertica.com/hp-vertica-products/sqlonhadoop>, >>>>> SparkSQL, Drill and a native ODBC wrapper >>>>> <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>) >>>>> - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL >>>>> interface and multi-dimensional analysis [OLAP >>>>> <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop >>>>> and supports most ANSI SQL query functions". It depends on HDFS, >>>>> MapReduce, >>>>> Hive and HBase; and seems targeted at very large data-sets though >>>>> maintains >>>>> low query latency) >>>>> - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard >>>>> compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver >>>>> support [benchmarks against Hive and Impala >>>>> >>>>> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space> >>>>> ]) >>>>> - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s >>>>> Lingual <http://docs.cascading.org/lingual/1.0/>² >>>>> <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual >>>>> provides JDBC Drivers, a SQL command shell, and a catalog manager for >>>>> publishing files [or any resource] as schemas and tables.") >>>>> >>>>> Which—from this list or elsewhere—would you recommend, and why? >>>>> Thanks for all suggestions, >>>>> >>>>> Samuel Marks >>>>> http://linkedin.com/in/samuelmarks >>>>> >>>>> >>>> >>> >