> On Jan 29, 2018, at 9:29 AM, Edward Capriolo <edlinuxg...@gmail.com> wrote: > > > > On Mon, Jan 29, 2018 at 12:10 PM, Owen O'Malley <owen.omal...@gmail.com > <mailto:owen.omal...@gmail.com>> wrote: > You should really look at what the Netflix guys are doing on Iceberg. > > https://github.com/Netflix/iceberg <https://github.com/Netflix/iceberg> > > They have put a lot of thought into how to efficiently handle tabular data in > S3. They put all of the metadata in S3 except for a single link to the name > of the table's root metadata file. > > Other advantages of their design: > Efficient atomic addition and removal of files in S3. > Consistent schema evolution across formats > More flexible partitioning and bucketing. > > .. Owen > > On Sun, Jan 28, 2018 at 12:02 PM, Edward Capriolo <edlinuxg...@gmail.com > <mailto:edlinuxg...@gmail.com>> wrote: > All, > > I have been bouncing around the earth for a while and have had the privilege > of working at 4-5 places. On arrival each place was in a variety of states in > their hadoop journey. > > One large company that I was at had a ~200 TB hadoop cluster. They actually > ran PIG and there ops group REFUSED to support hive, even though they had > written thousands of lines of pig macros to deal with selecting from a > partition, or a pig script file you would import so you would know what the > columns of the data at location /x/y/z is. > > In another lifetime I have been at a shop that used SCALDING. Again lots of > custom effort there with avro and parquet, all to do things that hive would > do our of the box. Again the biggest challenge is the thrift service and > metastore. > > In the cloud many people will use a bootstrap script > https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-script.html > <https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-script.html> > or 'msck repair' > > The "rise of the cloud" has changed us all the metastore is being a database > is a hard paradigm to support. Imagine for example I created data to an s3 > bucket with hive, and another group in my company requires read only access > to this data for an ephemeral request. Sharing the data is easy, S3 access > can be granted, sharing the metastore and thrift services are much more > complicated. > > So lets think out of the box: > > https://www.datastax.com/2011/03/brisk-is-here-hadoop-and-cassandra-together-at-last > > <https://www.datastax.com/2011/03/brisk-is-here-hadoop-and-cassandra-together-at-last> > > Datastax was able to build a platform where the filesystem and the metastore > were backed into Cassandra. Even though a HBase user would not want that, the > novel thing about that approach is that the metastore was not "some extra > thing in a database" that you had to deal with. > > What I am thinking is that for the user of s3, the metastore should be in s3. > Probably in hidden files inside the warehouse/table directory(ies). > > Think of it as msck repair "on the fly" > "https://www.ibm.com/support/knowledgecenter/SSPT3X_4.2.5/com.ibm.swg.im.infosphere.biginsights.commsql.doc/doc/biga_msckrep.html > > <https://www.ibm.com/support/knowledgecenter/SSPT3X_4.2.5/com.ibm.swg.im.infosphere.biginsights.commsql.doc/doc/biga_msckrep.html>" > > > The implementation could be something like this: > > On startup read hive.warehouse.dir look for "_warehouse" That would help us > locate the databases and in the databases we can locate tables, with the > tables we can locate partitions. > > This will of course scale horribly across tables with 90000000 partitions but > that would not be our use case. For all the people with "msck repair" in the > bootstrap they have a much cleaner way of using hive. > > The implementations could even be "Stacked" files first metastore lookback > second. > > It would be also wise to have a tool available in the CLI "metastore <table> > toJson" making it drop dead simple to export the schema definitions. > > Thoughts? > > > > > Close! > > They ultimately have many concepts right but the dealbreaker is they have > there own file format. This ultimately will be a downfall. Hive needs to > continue working with a variety of formats. This seems like a non-starter as > everyone is already divided into camps on not-invented-here file formats.
They define a different layout, but they use Avro, ORC, or Parquet for the data. > > Potentially we could implement as a StorageHandler, this interface has been > flexible and has had success. > https://github.com/mongodb/mongo-hadoop/wiki/Hive-Usage > <https://github.com/mongodb/mongo-hadoop/wiki/Hive-Usage>, a storage handler > can delegate to iceberg or something else. > > I was thinking of this problem as more of a "docker" type solution. For > example, lets say you have build a 40GB dataset divided into partition by > day. Imagine we build a docker image the image would launch with an embedded > derby DB (read only) with a start script that completely describes the data > and the partitions. (You need some way to connect it to your processing) but > now we have a one-shot "shippable" hive. > > Another approach: We have a JSON format with files that live in each of the > 40 partitions. If you are running Hive metastore and your system admins are > start you can run: > > hive> scan /data/sent/to/me/data.bundle > > The above command would scan and import that data into your datastore. It > could be a wizard, it could be headless. But now I can share datasets on > clouds and use them easily. > > > >