Hi!
on behalf of the Hivemall team, I'd like to kick off
a discussion thread around accepting Hivemall
into and ASF Incubator.
Hivemall is a library for machine learning implemented
as Hive UDFs/UDAFs/UDTFs that runs on Hadoop-based d
ata processing frameworks. More specifically it runs currently
runs on Apache Hive, Apache Spark, and Apache Pig, that
support Hive UDFs as an extension mechanism.
Here's the link to the proposal:
https://wiki.apache.org/incubator/HivemallProposal
and the full text is also attached to this email.
Two of the areas that I'd like to explicitly solicit IPMC's opinion
on are:
1. whether the process of re-licensing from LGPL to ALv2
was enough given the ASF's strict IP policies
2. whether the 5 initial committers make sense given that
there's a total of 15 contributors as per GitHub stats.
With that, thanks, in advance, for your time and let the discussion
begin!
Thanks,
Roman.
== Abstract ==
Hivemall is a library for machine learning implemented as Hive
UDFs/UDAFs/UDTFs.
Hivemall runs on Hadoop-based data processing frameworks, specifically
on Apache Hive, Apache Spark, and Apache Pig, that support Hive UDFs
as an extension mechanism.
== Proposal ==
Hivemall is a collection of machine learning algorithms and versatile
data analytics functions. It provides a number of ease of use machine
learning functionalities through user-defined function (UDF),
user-defined aggregate function (UDAFs), and/or user-defined table
generating functions (UDTFs) of Apache Hive. It offers a variety of
functionalities: regression, classification, recommendation, anomaly
detection, k-nearest neighbor, and feature engineering. Hivemall
supports state-of-the-art machine learning algorithms such as Soft
Confidence Weighted, Adaptive Regularization of Weight Vectors,
Factorization Machines, and AdaDelta. Hivemall is mainly designed to
run on Apache Hive but it also supports Apache Pig and Apache Spark
for the runtime.
== Background ==
Hivemall started as a research project of the main developer at
National Institute of Advanced Industrial Science and Technology
(AIST) in 2013 and the initial version was released on 2 Oct, 2013 on
Github: https://github.com/myui/hivemall.
After the main developer moving to Treasure Data in 2015, the project
has been actively developed as an open source product and changed the
license from GNU LGPL v2.1 to Apache License v2 on Mar 16, 2015. The
project copyright holders agreed to change the license then.
The community is growing incrementally and the project has 15
contributors, 431 stars, and 131 forks on Github as of Aug 15, 2016.
The project was awarded for the InfoWorld Bossie Awards (the best open
source big data tools) in 2014.
Past main contributions by external contributors includes Apache Pig
supports from Daniel Dai (Hortonworks), Apache Spark porting and an
integration to Apache YARN from Takeshi Yamamuro (NTT). Hivemall was
originally designed for Apache Hive but it now supports Apache Spark
and Apache Pig.
== Rationale ==
User-defined function is a powerful mechanism to enrich the expressive
power of declarative query languages like SQL, HiveQL, PigLatin, Spark
SQL. Hive UDF interface is now becoming the de-facto standard for
SQL-on-Hadoop platforms; Apache Spark and Apache Pig have full
supports for Hive UDFs/UDAFs/UDTFs, and Apache Impala, Apache Drill,
and Apache Tajo also have limited supports for Hive UDFs/UDAFs.
Hivemall can be considered as a cross platform library for machine
learning as Hivemall is implemented as cross platform Hive
UDFs/UDAFs/UDTFs; prediction models built by a batch query of Apache
Hive can be used on Apache Spark/Pig, and conversely, prediction
models build by Apache Spark can be used from Apache Hive/Pig.
Several database vendors are trying to offer machine learning
functionality in relational databases, so that the costs of moving
data can be eliminated. Apache MADlib, a machine learning library for
HAWQ and PostgreSQL, is accepted as an Apache Incubator project.
MADlib is implemented using PostgreSQL UDF interface.
Apache Hive has a JIRA ticket in HIVE-7940 to support machine learning
functionalities. So, we consider this proposal is useful for the
community. We consider that Hivemall is better to be a separated
project to the Apache Hive because 1) we target other data processing
frameworks such as Apache Spark as well for the runtime of Hivemall,
and 2) the current codebase is large enough to be separated.
Separation of concerns is good for project governance (e.g., release
management). For example, Apache Datafu is data mining and statistics
library for Apache Pig and a separated project to Apache Pig.
We consider that Hivemall would be a similar position to Apache Datafu
but there are large differences in features and target runtimes.
The target runtime of Apache Datafu is Apache Pig but Hivemall targets
Apache Hive, Apache Spark, and Apache Pig for the target runtime.
Apache Datafu is more likely to be statistics library and does not
support machine learning features such as classification and
regression but Hivemall is a machine learning library supporting them.
== Initial Goals ==
The initial goals are as follows:
* Establish the project governance in the Apache way and broaden the
community
* Improve documentations.
* Adding more unit/scenario tests.
* Handover of code and copyrights
== Current Status ==
Hivemall has several on-going WIP features.
Making a parameter server (a kind of distributed key-value store) as
Apache YARN application is a major issue. Hivemall’s parameter server
is currently a standalone application. Parameter servers on Apache
YARN enables to use Hadoop cluster resource efficiently and makes
management of parameter servers easier.
Another major WIP issue is integrating XGBoost into Hivemall. We need
more works and tests, e.g., supporting cross compilation of native JNI
objects of XGBoost.
=== Meritocracy ===
The project members understand the importance of letting motivated
individuals contribute to the project. Since Hivemall was initially
released in 2014, it has received contributions from 14 contributors.
Our intent of this incubator proposal is building a diverse developer
community following the Apache meritocracy model. We welcome external
contributions and plan to elect committers from those who contribute
significantly to the project.
=== Community ===
While there are 15 contributors in total, there are 3-4 active
developers continuously involved for the major feature development at
the moment. We hope to extend our contributor base and encourages
suggestions and contributions from any potential user.
=== Core Developers ===
The current main developers are from employees of Treasure Data, NTT
and Hortonworks. Some of them are Hadoop/Pig PMCs and/or Hive
committers.
=== Alignment ===
Incubating at ASF is the natural choice for the Hivemall project
because the Hivemall is targeting to run on Apache Hive, Apache Spark,
and Apache Pig. We encourage integrations with other ASF data
processing frameworks like Apache Impala and Apache Drill.
== Known Risks ==
The contributions of the main developer is significant at the moment
but the dependencies would decrease as the community grows.
=== Orphaned products ===
While the main developer is developing Hivemall as a full-time job at
TreasureData, the company is well being aware of the open source
philosophy and the importance of open governance of open source
products. Orphanining ASF product can be considered itself as a risk.
Hence, we think the the risks of it being orphaned are minimal.
=== Inexperience with Open Source ===
Hivemall also has been developed as an open source project since 2013.
The majority of the project member have jobs developing open source
products and some of them are working on other ASF projects like
Apache Hadoop and Apache Pig. We thus considered that the project
members have enough experiences for open source development.
=== Homogenous Developers ===
The current list of committers consists of developers from three
different companies. The committers are geographically distributed
across the U.S. and Asia. They are experienced with working in a
distributed environment.
While not included in the initial committer, there are other external
contributors to the project. So, we hope to establish a developer
community that includes those contributors from several other
corporations during the incubation process.
=== Reliance on Salaried Developers ===
The major developer is paid by his employer to contribute to this
project and the other developers are payed by their employers for
Hadoop-related open source development. While they might change their
affiliations over time, they are willing to have their expertise for
the open source development. So, the project would continue regardless
their affiliations.
=== Relationships with Other Apache Products ===
Hivemall is a collection for machine learning functions on Apache
Hive, Apache Spark, and Apache Pig. Apache MADlib is a collection of
machine learning functions for relational databases, i.e., Apache HAWQ
and PostgreSQL. There is no conflict in their target runtimes.
=== A Excessive Fascination with the Apache Brand ===
Our interest for this incubation is attracting more contributors,
building a strong community with open governance, and increasing the
visibility of Hivemall in the market/community. We will be sensitive
to inadvertent abuse of the Apache brand for any commercial use and
will work with the Incubator PMC and project mentors to ensure the
brand policies are respected.
== Documentation ==
Information on Hivemall can be found at:
https://github.com/myui/hivemall/wiki
== Initial Source ==
We released the initial version of Hivemall in 2013 at
https://github.com/myui/hivemall and introduced Hivemall at the Hadoop
Summit 2014.
== Source and Intellectual Property Submission Plan ==
We know no legal encumberment to transfer of the source to Apache. We
are going to get Contributor License Agreement (CLA) for all property
of Hivemall.
Also, we plan to get a sign from AIST for Software Grant Agreement
(SGA).
== External Dependencies ==
Hivemall depends on the following third party libraries:
Core module:
* netty (The MIT License)
* smile (Apache License v2.0)
* org.takuaani.xz (Public Domain)
* xgboost (Apache License v2.0)
* hadoop (Apache License v2.0)
* hive (Apache License v2.0)
* log4j (Apache License v2.0)
* guava (Apache License v2.0)
* lucene-analyzers-kuromoji (Apache License v2.0)
* junit (Eclipse Public License v1.0)
* mockito (The MIT License)
* powermock (Apache License v2.0)
* kryo (BSD License)
Hivemall on Spark:
* spark (Apache License v2.0)
* commons-cli (Apache License v2.0)
* commons-logging (Apache License v2.0)
* commons-compress (Apache License v2.0)
* scala-library (BSD License)
* scalatest (Apache License v2.0)
* xerial-core (Apache License v2.0)
The dependencies all have Apache compatible licenses.
== Cryptography ==
N/A
== Required resources ==
=== Mailing lists ===
* priv...@hivemall.incubator.apache.org (with moderated subscriptions)
* comm...@hivemall.incubator.apache.org
* d...@hivemall.incubator.apache.org
* u...@hivemall.incubator.apache.org
=== Git Repository ===
https://git-wip-us.apache.org/repos/asf/incubator-hivemall.git
=== JIRA assistance ===
JIRA project Hivemall (HIVEMALL)
== Initial Committers ==
* Makoto Yui (m...@treasure-data.com)
* Takeshi Yamamuro (yamamuro.tak...@lab.ntt.co.jp)
* Daniel Dai (da...@hortonworks.com)
* Tsuyoshi Ozawa (ozawa.tsuyo...@lab.ntt.co.jp)
* Kai Sasaki (sas...@treasure-data.com)
== Affiliations ==
=== Treasure Data ===
* Makoto Yui
* Kai Sasaki
=== NTT ===
* Takeshi Yamamuro
* Tsuyoshi Ozawa Apache Hadoop PMC member
=== Hortonworks ===
* Daniel Dai (ASF member) Apache Pig PMC member
== Sponsors ==
=== Champion ===
* Roman Shaposhnik (Pivotal, ASF member, IPMC member) Apache
Bigtop/Incubator PMC member
=== Nominated Mentors ===
* Reynold Xin (Dataricks, ASF member) Apache Spark PMC member
* Markus Weimer (Microsoft, ASF member) Apache REEF PMC member
* Xiangrui Meng (Databricks, ASF member) Apache Spark PMC member
=== Sponsoring Entity ===
We are requesting the Incubator to sponsor this project.