Thanks, Owen. I agree, Iceberg addresses a lot of the problems that you're hitting here. It doesn't quite go as far as moving all metadata into the file system. You can do that in HDFS and implementations that support atomic rename, but not in S3 (Iceberg has an implementation of the HDFS one strategy). For S3 you need some way of making commits atomic, for which we are using a metastore that is far more light-weight. You could also use a ZooKeeper cluster for write-side locking, or maybe there are other clever ideas out there.
Even if Iceberg is agnostic to the commit mechanism, it does almost all of what you're suggesting and does it in a way that's faster than the current metastore while providing snapshot isolation. rb On Mon, Jan 29, 2018 at 9:10 AM, Owen O'Malley <owen.omal...@gmail.com> wrote: > You should really look at what the Netflix guys are doing on Iceberg. > > https://github.com/Netflix/iceberg > > They have put a lot of thought into how to efficiently handle tabular data > in S3. They put all of the metadata in S3 except for a single link to the > name of the table's root metadata file. > > Other advantages of their design: > > - Efficient atomic addition and removal of files in S3. > - Consistent schema evolution across formats > - More flexible partitioning and bucketing. > > > .. Owen > > On Sun, Jan 28, 2018 at 12:02 PM, Edward Capriolo <edlinuxg...@gmail.com> > wrote: > >> All, >> >> I have been bouncing around the earth for a while and have had the >> privilege of working at 4-5 places. On arrival each place was in a variety >> of states in their hadoop journey. >> >> One large company that I was at had a ~200 TB hadoop cluster. They >> actually ran PIG and there ops group REFUSED to support hive, even though >> they had written thousands of lines of pig macros to deal with selecting >> from a partition, or a pig script file you would import so you would know >> what the columns of the data at location /x/y/z is. >> >> In another lifetime I have been at a shop that used SCALDING. Again lots >> of custom effort there with avro and parquet, all to do things that hive >> would do our of the box. Again the biggest challenge is the thrift service >> and metastore. >> >> In the cloud many people will use a bootstrap script >> https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hado >> op-script.html or 'msck repair' >> >> The "rise of the cloud" has changed us all the metastore is being a >> database is a hard paradigm to support. Imagine for example I created data >> to an s3 bucket with hive, and another group in my company requires read >> only access to this data for an ephemeral request. Sharing the data is >> easy, S3 access can be granted, sharing the metastore and thrift services >> are much more complicated. >> >> So lets think out of the box: >> >> https://www.datastax.com/2011/03/brisk-is-here-hadoop-and-ca >> ssandra-together-at-last >> >> Datastax was able to build a platform where the filesystem and the >> metastore were backed into Cassandra. Even though a HBase user would not >> want that, the novel thing about that approach is that the metastore was >> not "some extra thing in a database" that you had to deal with. >> >> What I am thinking is that for the user of s3, the metastore should be in >> s3. Probably in hidden files inside the warehouse/table directory(ies). >> >> Think of it as msck repair "on the fly" "https://www.ibm.com/support/k >> nowledgecenter/SSPT3X_4.2.5/com.ibm.swg.im.infosphere.bigins >> ights.commsql.doc/doc/biga_msckrep.html" >> >> The implementation could be something like this: >> >> On startup read hive.warehouse.dir look for "_warehouse" That would help >> us locate the databases and in the databases we can locate tables, with the >> tables we can locate partitions. >> >> This will of course scale horribly across tables with 90000000 partitions >> but that would not be our use case. For all the people with "msck repair" >> in the bootstrap they have a much cleaner way of using hive. >> >> The implementations could even be "Stacked" files first metastore >> lookback second. >> >> It would be also wise to have a tool available in the CLI "metastore >> <table> toJson" making it drop dead simple to export the schema >> definitions. >> >> Thoughts? >> >> >> > -- Ryan Blue