Re: Proposal: File based metastore

Edward Capriolo Mon, 29 Jan 2018 11:20:58 -0800

On Mon, Jan 29, 2018 at 12:44 PM, Owen O'Malley <owen.omal...@gmail.com>
wrote:


>
>
> On Jan 29, 2018, at 9:29 AM, Edward Capriolo <edlinuxg...@gmail.com>
> wrote:
>
>
>
> On Mon, Jan 29, 2018 at 12:10 PM, Owen O'Malley <owen.omal...@gmail.com>
> wrote:
>
>> You should really look at what the Netflix guys are doing on Iceberg.
>>
>> https://github.com/Netflix/iceberg
>>
>> They have put a lot of thought into how to efficiently handle tabular
>> data in S3. They put all of the metadata in S3 except for a single link to
>> the name of the table's root metadata file.
>>
>> Other advantages of their design:
>>
>>    - Efficient atomic addition and removal of files in S3.
>>    - Consistent schema evolution across formats
>>    - More flexible partitioning and bucketing.
>>
>>
>> .. Owen
>>
>> On Sun, Jan 28, 2018 at 12:02 PM, Edward Capriolo <edlinuxg...@gmail.com>
>> wrote:
>>
>>> All,
>>>
>>> I have been bouncing around the earth for a while and have had the
>>> privilege of working at 4-5 places. On arrival each place was in a variety
>>> of states in their hadoop journey.
>>>
>>> One large company that I was at had a ~200 TB hadoop cluster. They
>>> actually ran PIG and there ops group REFUSED to support hive, even though
>>> they had written thousands of lines of pig macros to deal with selecting
>>> from a partition, or a pig script file you would import so you would know
>>> what the columns of the data at location /x/y/z is.
>>>
>>> In another lifetime I have been at a shop that used SCALDING. Again lots
>>> of custom effort there with avro and parquet, all to do things that hive
>>> would do our of the box. Again the biggest challenge is the thrift service
>>> and metastore.
>>>
>>> In the cloud many people will use a bootstrap script
>>> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hado
>>> op-script.html or 'msck repair'
>>>
>>> The "rise of the cloud" has changed us all the metastore is being a
>>> database is a hard paradigm to support. Imagine for example I created data
>>> to an s3 bucket with hive, and another group in my company requires read
>>> only access to this data for an ephemeral request. Sharing the data is
>>> easy, S3 access can be granted, sharing the metastore and thrift services
>>> are much more complicated.
>>>
>>> So lets think out of the box:
>>>
>>> https://www.datastax.com/2011/03/brisk-is-here-hadoop-and-ca
>>> ssandra-together-at-last
>>>
>>> Datastax was able to build a platform where the filesystem and the
>>> metastore were backed into Cassandra. Even though a HBase user would not
>>> want that, the novel thing about that approach is that the metastore was
>>> not "some extra thing in a database" that you had to deal with.
>>>
>>> What I am thinking is that for the user of s3, the metastore should be
>>> in s3. Probably in hidden files inside the warehouse/table directory(ies).
>>>
>>> Think of it as msck repair "on the fly" "https://www.ibm.com/support/k
>>> nowledgecenter/SSPT3X_4.2.5/com.ibm.swg.im.infosphere.bigins
>>> ights.commsql.doc/doc/biga_msckrep.html"
>>>
>>> The implementation could be something like this:
>>>
>>> On startup read hive.warehouse.dir look for "_warehouse" That would help
>>> us locate the databases and in the databases we can locate tables, with the
>>> tables we can locate partitions.
>>>
>>> This will of course scale horribly across tables with 90000000
>>> partitions but that would not be our use case. For all the people with
>>> "msck repair" in the bootstrap they have a much cleaner way of using hive.
>>>
>>> The implementations could even be "Stacked" files first metastore
>>> lookback second.
>>>
>>> It would be also wise to have a tool available in the CLI "metastore
>>> <table> toJson" making it drop dead simple to export the schema
>>> definitions.
>>>
>>> Thoughts?
>>>
>>>
>>>
>>
> Close!
>
> They ultimately have many concepts right but the dealbreaker is they have
> there own file format. This ultimately will be a downfall. Hive needs to
> continue working with a variety of formats. This seems like a non-starter
> as everyone is already divided into camps on not-invented-here file formats.
>
>
> They define a different layout, but they use Avro, ORC, or Parquet for the
> data.
>
>
> Potentially we could implement as a StorageHandler, this interface has
> been flexible and has had success. https://github.com/
> mongodb/mongo-hadoop/wiki/Hive-Usage, a storage handler can delegate to
> iceberg or something else.
>
> I was thinking of this problem as more of a "docker" type solution. For
> example, lets say you have build a 40GB dataset divided into partition by
> day. Imagine we build a docker image the image would launch with an
> embedded derby DB (read only) with a start script that completely describes
> the data and the partitions.  (You need some way to connect it to your
> processing) but now we have a one-shot "shippable" hive.
>
> Another approach: We have a JSON format with files that live in each of
> the 40 partitions. If you are running Hive metastore and your system admins
> are start you can run:
>
> hive> scan /data/sent/to/me/data.bundle
>
> The above command would scan and import that data into your datastore. It
> could be a wizard, it could be headless. But now I can share datasets on
> clouds and use them easily.
>
>
>
>
>
> Owen,


I see you commenting here:
https://docs.google.com/document/d/1Q-zL5lSCle6NEEdyfiYsXYzX_Q8Qf0ctMyGBKslOswA/edit#


I also see this "Enum and union types are not supported." and "Map keys
must use a primitive string type, " The spec is already incompatible with
hive. However they have done the legwork on an important part:

"

This table format tracks individual data files in a table instead of
directories. This allows writers to create data files in-place and only
adds files to the table in an explicit commit.

Table state is maintained in metadata files. All changes to table state
create a new metadata file and replace the old metadata with an atomic
operation. The table metadata file tracks the table schema, partitioning
config, other properties, and snapshots of the table contents. Each
snapshot is a complete set of data files in the table at some point in
time. Snapshots are listed in the metadata file, but the files in a
snapshot are stored in separate manifest files.

The atomic transitions from one table metadata file to the next provide
snapshot isolation. Readers use the snapshot that was current when they
load the table metadata and are not affected by changes until they refresh
and pick up a new metadata location.

Data files in snapshots are stored in one or more manifest files that
contain a row for each data file in the table, its partition data, and its
metrics. A snapshot is the union of all files in its manifests. Manifest
files can be shared between snapshots to avoid rewriting metadata that is
slow-changing."


This is a solution around the limitations on the external metastore. For
example we can implement this like so:

hive --hiveconf rawstoremap=database1:com.myclass

*This would instruct hive to use a different rawstore for database1*

https://cwiki.apache.org/confluence/display/Hive/AdminManual+MetastoreAdmin
hive.metastore.rawstore.impl

The most naive rawstore could implement a version of the what is described
above. We can do it in a primitive non performant way initially targeting
small tables < 300 partitions.

Re: Proposal: File based metastore

Reply via email to