"with the metadata in a special metadata store (not on hdfs), and its not
as easy for all systems to access hive metadata." I disagree.

Hives metadata is not only accessible through the SQL constructs like
"describe table". But the entire meta-store also is actually a thrift
service so you have programmatic access to determine things like what
columns are in a table etc. Thrift creates RPC clients for almost every
major language.

In the programming hive book
http://www.amazon.com/dp/1449319335/?tag=mh0b-20&hvadid=3521269638&ref=pd_sl_4yiryvbf8k_e
there is even examples where I show how to iterate all the tables inside
the database from a java client.

On Sat, Jan 31, 2015 at 11:05 AM, Koert Kuipers <ko...@tresata.com> wrote:

> yes you can run whatever you like with the data in hdfs. keep in mind that
> hive makes this general access pattern just a little harder, since hive has
> a tendency to store data and metadata separately, with the metadata in a
> special metadata store (not on hdfs), and its not as easy for all systems
> to access hive metadata.
>
> i am not familiar at all with tajo or drill.
>
> On Fri, Jan 30, 2015 at 8:27 PM, Samuel Marks <samuelma...@gmail.com>
> wrote:
>
>> Thanks for the advice
>>
>> Koert: when everything is in the same essential data-store (HDFS), can't
>> I just run whatever complex tools I'm whichever paradigm they like?
>>
>> E.g.: GraphX, Mahout &etc.
>>
>> Also, what about Tajo or Drill?
>>
>> Best,
>>
>> Samuel Marks
>> http://linkedin.com/in/samuelmarks
>>
>> PS: Spark-SQL is read-only IIRC, right?
>> On 31 Jan 2015 03:39, "Koert Kuipers" <ko...@tresata.com> wrote:
>>
>>> since you require high-powered analytics, and i assume you want to stay
>>> sane while doing so, you require the ability to "drop out of sql" when
>>> needed. so spark-sql and lingual would be my choices.
>>>
>>> low latency indicates phoenix or spark-sql to me.
>>>
>>> so i would say spark-sql
>>>
>>> On Fri, Jan 30, 2015 at 7:56 AM, Samuel Marks <samuelma...@gmail.com>
>>> wrote:
>>>
>>>> HAWQ is pretty nifty due to its full SQL compliance (ANSI 92) and
>>>> exposing both JDBC and ODBC interfaces. However, although Pivotal does 
>>>> open-source
>>>> a lot of software <http://www.pivotal.io/oss>, I don't believe they
>>>> open source Pivotal HD: HAWQ.
>>>>
>>>> So that doesn't meet my requirements. I should note that the project I
>>>> am building will also be open-source, which heightens the importance of
>>>> having all components also being open-source.
>>>>
>>>> Cheers,
>>>>
>>>> Samuel Marks
>>>> http://linkedin.com/in/samuelmarks
>>>>
>>>> On Fri, Jan 30, 2015 at 11:35 PM, Siddharth Tiwari <
>>>> siddharth.tiw...@live.com> wrote:
>>>>
>>>>> Have you looked at HAWQ from Pivotal ?
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Jan 30, 2015, at 4:27 AM, Samuel Marks <samuelma...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Since Hadoop <https://hive.apache.org> came out, there have been
>>>>> various commercial and/or open-source attempts to expose some 
>>>>> compatibility
>>>>> with SQL <http://drill.apache.org>. Obviously by posting here I am
>>>>> not expecting an unbiased answer.
>>>>>
>>>>> Seeking an SQL-on-Hadoop offering which provides: low-latency
>>>>> querying, and supports the most common CRUD <https://spark.apache.org>,
>>>>> including [the basics!] along these lines: CREATE TABLE, INSERT INTO, 
>>>>> SELECT
>>>>> * FROM, UPDATE Table SET C1=2 WHERE, DELETE FROM, and DROP TABLE.
>>>>> Transactional support would be nice also, but is not a must-have.
>>>>>
>>>>> Essentially I want a full replacement for the more traditional RDBMS,
>>>>> one which can scale from 1 node to a serious Hadoop cluster.
>>>>>
>>>>> Python is my language of choice for interfacing, however there does
>>>>> seem to be a Python JDBC wrapper <https://spark.apache.org/sql>.
>>>>>
>>>>> Here is what I've found thus far:
>>>>>
>>>>>    - Apache Hive <https://hive.apache.org> (SQL-like, with
>>>>>    interactive SQL thanks to the Stinger initiative)
>>>>>    - Apache Drill <http://drill.apache.org> (ANSI SQL support)
>>>>>    - Apache Spark <https://spark.apache.org> (Spark SQL
>>>>>    <https://spark.apache.org/sql>, queries only, add data via Hive,
>>>>>    RDD
>>>>>    
>>>>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.SchemaRDD>
>>>>>    or Paraquet <http://parquet.io/>)
>>>>>    - Apache Phoenix <http://phoenix.apache.org> (built atop Apache
>>>>>    HBase <http://hbase.apache.org>, lacks full transaction
>>>>>    <http://en.wikipedia.org/wiki/Database_transaction> support, relational
>>>>>    operators <http://en.wikipedia.org/wiki/Relational_operators> and
>>>>>    some built-in functions)
>>>>>    - Cloudera Impala
>>>>>    
>>>>> <http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html>
>>>>>    (significant HiveQL support, some SQL language support, no support for
>>>>>    indexes on its tables, importantly missing DELETE, UPDATE and 
>>>>> INTERSECT;
>>>>>    amongst others)
>>>>>    - Presto <https://github.com/facebook/presto> from Facebook (can
>>>>>    query Hive, Cassandra <http://cassandra.apache.org>, relational
>>>>>    DBs &etc. Doesn't seem to be designed for low-latency responses across
>>>>>    small clusters, or support UPDATE operations. It is optimized for
>>>>>    data warehousing or analytics¹
>>>>>    <http://prestodb.io/docs/current/overview/use-cases.html>)
>>>>>    - SQL-Hadoop <https://www.mapr.com/why-hadoop/sql-hadoop> via MapR
>>>>>    community edition <https://www.mapr.com/products/hadoop-download>
>>>>>    (seems to be a packaging of Hive, HP Vertica
>>>>>    <http://www.vertica.com/hp-vertica-products/sqlonhadoop>,
>>>>>    SparkSQL, Drill and a native ODBC wrapper
>>>>>    <http://package.mapr.com/tools/MapR-ODBC/MapR_ODBC>)
>>>>>    - Apache Kylin <http://www.kylin.io> from Ebay (provides an SQL
>>>>>    interface and multi-dimensional analysis [OLAP
>>>>>    <http://en.wikipedia.org/wiki/OLAP>], "… offers ANSI SQL on Hadoop
>>>>>    and supports most ANSI SQL query functions". It depends on HDFS, 
>>>>> MapReduce,
>>>>>    Hive and HBase; and seems targeted at very large data-sets though 
>>>>> maintains
>>>>>    low query latency)
>>>>>    - Apache Tajo <http://tajo.apache.org> (ANSI/ISO SQL standard
>>>>>    compliance with JDBC <http://en.wikipedia.org/wiki/JDBC> driver
>>>>>    support [benchmarks against Hive and Impala
>>>>>    
>>>>> <http://blogs.gartner.com/nick-heudecker/apache-tajo-enters-the-sql-on-hadoop-space>
>>>>>    ])
>>>>>    - Cascading <http://en.wikipedia.org/wiki/Cascading_%28software%29>'s
>>>>>    Lingual <http://docs.cascading.org/lingual/1.0/>²
>>>>>    <http://docs.cascading.org/lingual/1.0/#sql-support> ("Lingual
>>>>>    provides JDBC Drivers, a SQL command shell, and a catalog manager for
>>>>>    publishing files [or any resource] as schemas and tables.")
>>>>>
>>>>> Which—from this list or elsewhere—would you recommend, and why?
>>>>> Thanks for all suggestions,
>>>>>
>>>>> Samuel Marks
>>>>> http://linkedin.com/in/samuelmarks
>>>>>
>>>>>
>>>>
>>>
>

Reply via email to