On Mon, Jan 29, 2018 at 12:10 PM, Owen O'Malley <owen.omal...@gmail.com>
wrote:

> You should really look at what the Netflix guys are doing on Iceberg.
>
> https://github.com/Netflix/iceberg
>
> They have put a lot of thought into how to efficiently handle tabular data
> in S3. They put all of the metadata in S3 except for a single link to the
> name of the table's root metadata file.
>
> Other advantages of their design:
>
>    - Efficient atomic addition and removal of files in S3.
>    - Consistent schema evolution across formats
>    - More flexible partitioning and bucketing.
>
>
> .. Owen
>
> On Sun, Jan 28, 2018 at 12:02 PM, Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
>> All,
>>
>> I have been bouncing around the earth for a while and have had the
>> privilege of working at 4-5 places. On arrival each place was in a variety
>> of states in their hadoop journey.
>>
>> One large company that I was at had a ~200 TB hadoop cluster. They
>> actually ran PIG and there ops group REFUSED to support hive, even though
>> they had written thousands of lines of pig macros to deal with selecting
>> from a partition, or a pig script file you would import so you would know
>> what the columns of the data at location /x/y/z is.
>>
>> In another lifetime I have been at a shop that used SCALDING. Again lots
>> of custom effort there with avro and parquet, all to do things that hive
>> would do our of the box. Again the biggest challenge is the thrift service
>> and metastore.
>>
>> In the cloud many people will use a bootstrap script
>> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hado
>> op-script.html or 'msck repair'
>>
>> The "rise of the cloud" has changed us all the metastore is being a
>> database is a hard paradigm to support. Imagine for example I created data
>> to an s3 bucket with hive, and another group in my company requires read
>> only access to this data for an ephemeral request. Sharing the data is
>> easy, S3 access can be granted, sharing the metastore and thrift services
>> are much more complicated.
>>
>> So lets think out of the box:
>>
>> https://www.datastax.com/2011/03/brisk-is-here-hadoop-and-ca
>> ssandra-together-at-last
>>
>> Datastax was able to build a platform where the filesystem and the
>> metastore were backed into Cassandra. Even though a HBase user would not
>> want that, the novel thing about that approach is that the metastore was
>> not "some extra thing in a database" that you had to deal with.
>>
>> What I am thinking is that for the user of s3, the metastore should be in
>> s3. Probably in hidden files inside the warehouse/table directory(ies).
>>
>> Think of it as msck repair "on the fly" "https://www.ibm.com/support/k
>> nowledgecenter/SSPT3X_4.2.5/com.ibm.swg.im.infosphere.bigins
>> ights.commsql.doc/doc/biga_msckrep.html"
>>
>> The implementation could be something like this:
>>
>> On startup read hive.warehouse.dir look for "_warehouse" That would help
>> us locate the databases and in the databases we can locate tables, with the
>> tables we can locate partitions.
>>
>> This will of course scale horribly across tables with 90000000 partitions
>> but that would not be our use case. For all the people with "msck repair"
>> in the bootstrap they have a much cleaner way of using hive.
>>
>> The implementations could even be "Stacked" files first metastore
>> lookback second.
>>
>> It would be also wise to have a tool available in the CLI "metastore
>> <table> toJson" making it drop dead simple to export the schema
>> definitions.
>>
>> Thoughts?
>>
>>
>>
>
Close!

They ultimately have many concepts right but the dealbreaker is they have
there own file format. This ultimately will be a downfall. Hive needs to
continue working with a variety of formats. This seems like a non-starter
as everyone is already divided into camps on not-invented-here file formats.

Potentially we could implement as a StorageHandler, this interface has been
flexible and has had success.
https://github.com/mongodb/mongo-hadoop/wiki/Hive-Usage, a storage handler
can delegate to iceberg or something else.

I was thinking of this problem as more of a "docker" type solution. For
example, lets say you have build a 40GB dataset divided into partition by
day. Imagine we build a docker image the image would launch with an
embedded derby DB (read only) with a start script that completely describes
the data and the partitions.  (You need some way to connect it to your
processing) but now we have a one-shot "shippable" hive.

Another approach: We have a JSON format with files that live in each of the
40 partitions. If you are running Hive metastore and your system admins are
start you can run:

hive> scan /data/sent/to/me/data.bundle

The above command would scan and import that data into your datastore. It
could be a wizard, it could be headless. But now I can share datasets on
clouds and use them easily.

Reply via email to