> Had the original Spark proposal implied they wished to fork portions of the > hive > code base, I would have considered it a hostile fork. (this is open to > interpretation).
FYI, I did ask bluntly whether Spark intends to cut-paste Hive code into their repos previously & got an affirmative answer from rxin. http://grokbase.com/t/hive/dev/15cjb3kjvn/using-the-hive-sql-parser-in-spark > People have the right to fork it via the licence. We can not stop that. Later, I did get a response that they never made a release with the said copy-paste & they deprecated the "HiveContext" object in Spark 2.0. > than what Hive could handle at peak." > > (EC Note: How is this statement verifiable?) Reading about Hive at Facebook, I feel like we've already solved those problems that were due to FB Corona + Hadoop-1 (or, 0.20 *shudder*) limitations. Spark does not need be limited by Corona and the version of Hive being compared might not have YARN or Tez on its side. Cheers, Gopal On 3/2/17, 8:25 PM, "Edward Capriolo" <edlinuxg...@gmail.com> wrote: All, I have compiled a short (non exhaustive) list of items related to Spark's forking of Apache Hive code and usage of Apache Hive trademarks. 1) ---------------------------- The original spark proposal repeatedly claims that Spark "inter operates" with hive. https://wiki.apache.org/incubator/SparkProposal "Finally, Shark (a higher layer framework built on Spark) inter-operates with Apache Hive." (EC note: Originally spark may have linked to hive, but now the situation is much different.) ------------------------- 2) ------------------ Spark distributes jar files to maven repositories carrying the hive name. https://mvnrepository.com/artifact/org.spark-project.hive/hive-exec (EC note These are not simple "ports" features are added/missing/broken in artifacts named "hive") ----------------------- 3) --------------------------------- Spark carries forked and modified copies of hive source code https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionHookContextImpl.java -------------------------------------------- 4 ------------------------------- Spark has "imported" and modified components of hive https://issues.apache.org/jira/browse/SPARK-12572 (EC note: Further discussions of the code make little no reference to it's origins in propaganda) --------------------------------------------- 5 -------------------------------- Databricks, a company heaving involved in spark development, uses the Hive trademark to make claims https://databricks.com/blog/2017/01/30/integrating-central-hive-metastore-apache-spark-databricks.html "The Databricks platform provides a fully managed Hive Metastore that allows users to share a data catalog across multiple Spark clusters." This blog defining hadoop (draft) is clear on this: https://wiki.apache.org/hadoop/Defining%20Hadoop "Products that are derivative works of Apache Hadoop are not Apache Hadoop, and may not call themselves versions of Apache Hadoop, nor Distributions of Apache Hadoop." -------------------- 6 ---------------------- https://databricks.com/blog/2017/01/30/integrating-central-hive-metastore-apache-spark-databricks.html "Apache Spark supports multiple versions of Hive, from 0.12 up to 1.2.1. " Apache spark can NOT support multiple versions of Hive because they are working with a fork, and there is no standard body for "supporting hive" Some products have been released that have been described as "compatible" with Hadoop, even though parts of the Hadoop codebase have either been changed or replaced. The Apache™ Hadoop® developer team are not a standards body: they do not qualify such (derivative) works as compatible. Nor do they feel constrained by the requirements of external entities when changing the behavior of Apache Hadoop software or related Apache software. ----------------------- 7 --------------------------------- The spark committers openly use the word "take" during the process of "importing" hive code. https://github.com/apache/spark/pull/10583/files "are there unit tests from Hive that we can take?" Apache foundation will not take a hostile fork for a proposal. Had the original Spark proposal implied they wished to fork portions of the hive code base, I would have considered it a hostile fork. (this is open to interpretation). (EC Note: Is this the Apache way? How can we build communities? How would small projects feel if for example hive "imported" copying code while they sat in incubation) ------------------------------ 8 ---------------------------- Databricks (after borrowing slabs of hive code, using our trademarks, etc) makes disparaging comments about the performance of hive. https://databricks.com/blog/2017/02/28/voice-facebook-using-apache-spark-large-scale-language-model-training.html "Spark-based pipelines can scale comfortably to process many times more input data than what Hive could handle at peak. " (EC Note: How is this statement verifiable?) ----------------------------------------------- 9 -------------------------- https://issues.apache.org/jira/browse/SPARK-10793 It's easily enough added, to the code, there's just the risk of the fork diverging more from ASF hive. (EC Note Even those responsible for this admit the code is diverging and will diverge more from there actions.) ------------------------ 10 ---------------------- My opinion of all of this: The above points are hurtful to Hive.First, we are robbed of community. People could be improving hive by making it more modular, but instead they are improving Spark's fork of hive. Next, our code base is subject to continued "poaching". Apache Spark "imports", copies, alter, and claim compatibility with/from Hive (I pointed out above why the compatibility claims should not be made). Finally, We are subject to unfair performance comparisons "x is faster then hive", by software (spark) that is essentially *POWERED BY Hive (via the forking and code copying). * Hive has always been a bullseye as the best hadoop SQL https://vision.cloudera.com/impala-v-hive/ In my hood we have a saying, "Haters gonna hate" For every Impala and every Spark claiming to be better then hive, there is 10 HadoopDB's that collapsed under the weight of themselves. We outlasted fleets of them. That being said, software like Hive Metastore our baby. It is our TM. It is our creation. It is what makes us special. People have the right to fork it via the licence. We can not stop that. But it cant be both ways: either downstream needs to bring in our published artifacts, or they fork and give what they are doing another name. None of this activity represents what I believe is the "Apache Way". I believe the Apache Way would be to communicate to us, the hive community, about ways to make the components more modular and easier to use in other projects. Users suffer when the same code "moves" between two projects there is fragmentation and typically it leads to negative effects for both projects. -------------------------------------- Thanks, Edward