On Mon, Jan 29, 2018 at 12:10 PM, Owen O'Malley <owen.omal...@gmail.com> wrote:
> You should really look at what the Netflix guys are doing on Iceberg. > > https://github.com/Netflix/iceberg > > They have put a lot of thought into how to efficiently handle tabular data > in S3. They put all of the metadata in S3 except for a single link to the > name of the table's root metadata file. > > Other advantages of their design: > > - Efficient atomic addition and removal of files in S3. > - Consistent schema evolution across formats > - More flexible partitioning and bucketing. > > > .. Owen > > On Sun, Jan 28, 2018 at 12:02 PM, Edward Capriolo <edlinuxg...@gmail.com> > wrote: > >> All, >> >> I have been bouncing around the earth for a while and have had the >> privilege of working at 4-5 places. On arrival each place was in a variety >> of states in their hadoop journey. >> >> One large company that I was at had a ~200 TB hadoop cluster. They >> actually ran PIG and there ops group REFUSED to support hive, even though >> they had written thousands of lines of pig macros to deal with selecting >> from a partition, or a pig script file you would import so you would know >> what the columns of the data at location /x/y/z is. >> >> In another lifetime I have been at a shop that used SCALDING. Again lots >> of custom effort there with avro and parquet, all to do things that hive >> would do our of the box. Again the biggest challenge is the thrift service >> and metastore. >> >> In the cloud many people will use a bootstrap script >> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hado >> op-script.html or 'msck repair' >> >> The "rise of the cloud" has changed us all the metastore is being a >> database is a hard paradigm to support. Imagine for example I created data >> to an s3 bucket with hive, and another group in my company requires read >> only access to this data for an ephemeral request. Sharing the data is >> easy, S3 access can be granted, sharing the metastore and thrift services >> are much more complicated. >> >> So lets think out of the box: >> >> https://www.datastax.com/2011/03/brisk-is-here-hadoop-and-ca >> ssandra-together-at-last >> >> Datastax was able to build a platform where the filesystem and the >> metastore were backed into Cassandra. Even though a HBase user would not >> want that, the novel thing about that approach is that the metastore was >> not "some extra thing in a database" that you had to deal with. >> >> What I am thinking is that for the user of s3, the metastore should be in >> s3. Probably in hidden files inside the warehouse/table directory(ies). >> >> Think of it as msck repair "on the fly" "https://www.ibm.com/support/k >> nowledgecenter/SSPT3X_4.2.5/com.ibm.swg.im.infosphere.bigins >> ights.commsql.doc/doc/biga_msckrep.html" >> >> The implementation could be something like this: >> >> On startup read hive.warehouse.dir look for "_warehouse" That would help >> us locate the databases and in the databases we can locate tables, with the >> tables we can locate partitions. >> >> This will of course scale horribly across tables with 90000000 partitions >> but that would not be our use case. For all the people with "msck repair" >> in the bootstrap they have a much cleaner way of using hive. >> >> The implementations could even be "Stacked" files first metastore >> lookback second. >> >> It would be also wise to have a tool available in the CLI "metastore >> <table> toJson" making it drop dead simple to export the schema >> definitions. >> >> Thoughts? >> >> >> > Close! They ultimately have many concepts right but the dealbreaker is they have there own file format. This ultimately will be a downfall. Hive needs to continue working with a variety of formats. This seems like a non-starter as everyone is already divided into camps on not-invented-here file formats. Potentially we could implement as a StorageHandler, this interface has been flexible and has had success. https://github.com/mongodb/mongo-hadoop/wiki/Hive-Usage, a storage handler can delegate to iceberg or something else. I was thinking of this problem as more of a "docker" type solution. For example, lets say you have build a 40GB dataset divided into partition by day. Imagine we build a docker image the image would launch with an embedded derby DB (read only) with a start script that completely describes the data and the partitions. (You need some way to connect it to your processing) but now we have a one-shot "shippable" hive. Another approach: We have a JSON format with files that live in each of the 40 partitions. If you are running Hive metastore and your system admins are start you can run: hive> scan /data/sent/to/me/data.bundle The above command would scan and import that data into your datastore. It could be a wizard, it could be headless. But now I can share datasets on clouds and use them easily.