Comments inlined. On Sun, Jul 2, 2017 at 3:22 PM, Edward Capriolo <edlinuxg...@gmail.com> wrote:
> I am not sure I am on the fence with this. > > I am -1, and I offer this -1 with the hope of being convinced otherwise > Thank you for being open to reconsider. > > > "By making it a separate project we will enable other projects to join us > in > innovating on the metastore. " > > The relevant questions I have are, > > "What is stopping others from joining us now?" > "What does being a TLP do for us that we do not have now?" > Walking through a use case will help answer these. This is a real world situation, not a hypothetical. I’ve been talking with a team building a schema registry for Kafka[1]. I’d like them to use the Hive metastore rather than reinvent the wheel. I believe this would be good for users (all their tools can work together on a shared understanding of the data) and admins (just one metadata store to administer) and for the ecosytems (tools can work across stored data and streaming data). This system has some requirements on metadata that Hive does not. To take one example, it would like a schema to be a top level concept instead of a concept tied to tables or partitions. This is not a problem for Hive, but neither is it interesting. So if they come with patches for this, would we accept them? As the Hive PMC our answer will be no, because it doesn’t help Hive’s metadata. Even if we accept their patches will we make them committers when we know they don’t care about Hive as Hive, but only the metastore. Again, the right answer for the Hive PMC is no. And we cannot say that Hive should support a generic metadata system within itself. That turns Hive into an umbrella project, which Apache has repeatedly worked to avoid. So Hive will either need to reject non-Hive centric features and contributors or end up in a place Apache has worked to avoid. And finally, why would other teams want to mess with all of Hive when they only want the metastore? Hive is a large and complex system. If we break the metastore out it is much more approachable by non-Hive contributors. Obviously the Hive team doesn’t want to see their metastore turn into something unusable by Hive, which is why we were specific in saying we wanted it to continue to support high performance SQL systems. My experience in watching ORC move out of Hive is that the adoption has increased significantly. It is reasonable to assume that moving the metastore out will also increase adoption and make it easier for others to get involved. > I see a lot of downsides: > 1) We have to maintain two sites > 2) we have to maintain two committer lists > > A large problem I see is this: Hive is already being pulled in too many > different directions. There is some grumbling about the state of > hive-on-spark. > I believe this argues in favor of the split, not against. By pulling out the metastore we are releiving pressure on Hive itself. Let Hive focus on being a SQL engine. Let another team focus on runtime metadata. On your committer questions in later emails, the point of going to a TLP has nothing to do with adding new committers. Traditionally new projects start in the incubator. But given that all of the PMC of this new project are already experienced Hive PMC members I see no reason to go through incubator. I agree with you that we would not throw any new people into the mix. People join the project in the same way as always, by contributing. Alan. 1. https://github.com/hortonworks/registry > Most importantly, our release process seems 'injured' by too many branches > going off in different ways. If the metastore lives outside of Hive we are > going to compound this issue. I would strongly suggest we do not undertake > this until we can at least turn out 2 usable releases in a 6 month period. >