Re: [DISCUSS] Proposing Griffin for Apache incubator

吕志兴 Thu, 24 Nov 2016 13:50:44 -0800

eBay legal has already reviewed this name before open source to github, there's 
no trademark problem for griffin.


Thx

Alex

> 在 2016年11月24日，下午4:57，Jochen Theodorou <blackd...@gmx.org> 写道：
> 
> just a remark on the name. Griffin is used in a lot of company names and is a 
> family name. That may induce trademark problems, but I did not do any 
> research on that.
> 
> Well and I am not happy about Griffin and Griffon being so near together, 
> even though the later project most likely has no trademark as such.
> 
>> On 24.11.2016 00:30, Henry Saputra wrote:
>> Hi All,
>> 
>> As the champion for Griffin, I would like to bring up discussion to
>> bring the project as Apache incubator podling.
>> 
>> Here is the direct quote from the abstract:
>> 
>> "
>> Griffin is a Data Quality Service platform built on Apache Hadoop and
>> Apache Spark. It provides a framework process for defining data
>> quality model, executing data quality measurement, automating data
>> profiling and validation, as well as a unified data quality
>> visualization across multiple data systems. It tries to address the
>> data quality challenges in big data and streaming context.
>> "
>> 
>> Here is the link to the proposal:
>> https://wiki.apache.org/incubator/GriffinProposal
>> 
>> I have copied the proposal below for easy access
>> 
>> 
>> Thanks,
>> 
>> - Henry
>> 
>> 
>> Griffin Proposal
>> 
>> Abstract
>> 
>> Griffin is a Data Quality Service platform built on Apache Hadoop and
>> Apache Spark. It provides a framework process for defining data
>> quality model, executing data quality measurement, automating data
>> profiling and validation, as well as a unified data quality
>> visualization across multiple data systems. It tries to address the
>> data quality challenges in big data and streaming context.
>> 
>> Proposal
>> 
>> Griffin is a open source Data Quality solution for distributed data
>> systems at any scale in both streaming or batch data context. When
>> people use open source products (e.g. Apache Hadoop, Apache Spark,
>> Apache Kafka, Apache Storm), they always need a data quality service
>> to build his/her confidence on data quality processed by those
>> platforms. Griffin creates a unified process to define and construct
>> data quality measurement pipeline across multiple data systems to
>> provide:
>> 
>> Automatic quality validation of the data
>> Data profiling and anomaly detection
>> Data quality lineage from upstream to downstream data systems.
>> Data quality health monitoring visualization
>> Shared infrastructure resource management
>> 
>> Overview of Griffin
>> 
>> Griffin has been deployed in production at eBay serving major data
>> systems, it takes a platform approach to provide generic features to
>> solve common data quality validation pain points. Firstly, user can
>> register the data asset which user wants to do data quality check. The
>> data asset can be batch data in RDBMS (e.g.Teradata), Apache Hadoop
>> system or near real-time streaming data from Apache Kafka, Apache
>> Storm and other real time data platforms. Secondly, user can create
>> data quality model to define the data quality rule and metadata.
>> Thirdly, the model or rule will be executed automatically (by the
>> model engine) to get the sample data quality validation results in a
>> few seconds for streaming data. Finally, user can analyze the data
>> quality results through built-in visualization tool to take actions.
>> 
>> Griffin includes:
>> 
>> Data Quality Model Engine
>> 
>> Griffin is model driven solution, user can choose various data quality
>> dimension to execute his/her data quality validation based on selected
>> target data-set or source data-set ( as the golden reference data). It
>> has a corresponding library supporting it in back-end for the
>> following measurement:
>> 
>> Accuracy - Does data reflect the real-world objects or a verifiable source
>> Completeness - Is all necessary data present
>> Validity - Are all data values within the data domains specified by the 
>> business
>> Timeliness - Is the data available at the time needed
>> Anomaly detection - Pre-built algorithm functions for the
>> identification of items, events or observations which do not conform
>> to an expected pattern or other items in a dataset
>> Data Profiling - Apply statistical analysis and assessment of data
>> values within a dataset for consistency, uniqueness and logic.
>> 
>> Data Collection Layer
>> 
>> We support two kinds of data sources, batch data and real time data.
>> 
>> For batch mode, we can collect data source from Apache Hadoop based
>> platform by various data connectors.
>> 
>> For real time mode, we can connect with messaging system like Kafka to
>> near real time analysis.
>> 
>> Data Process and Storage Layer
>> 
>> For batch analysis, our data quality model will compute data quality
>> metrics in our spark cluster based on data source in Apache Hadoop.
>> 
>> For near real time analysis, we consume data from messaging system,
>> then our data quality model will compute our real time data quality
>> metrics in our spark cluster. for data storage, we use time series
>> database in our back end to fulfill front end request.
>> 
>> Griffin Service
>> 
>> We have RESTful web services to accomplish all the functionalities of
>> Griffin, such as register data asset, create data quality model,
>> publish metrics, retrieve metrics, add subscription, etc. So, the
>> developers can develop their own user interface based on these web
>> services.
>> 
>> Background
>> 
>> At eBay, when people play with big data in Apache Hadoop (or other
>> streaming data), data quality often becomes one big challenge.
>> Different teams have built customized data quality tools to detect and
>> analyze data quality issues within their own domain. We are thinking
>> to take a platform approach to provide shared Infrastructure and
>> generic features to solve common data quality pain points. This would
>> enable us to build trusted data assets.
>> 
>> Currently it’s very difficult and costly to do data quality validation
>> when we have big data flow across multi-platforms at eBay (e.g.
>> Oracle, Apache Hadoop, Couchbase, Apache Cassandra, Apache Kafka,
>> MongoDB). Take eBay real time personalization platform as an example.
>> Every day we have to validate data quality status for ~600M records (
>> imagine we have 150M active users for our website). Data quality often
>> becomes one big challenge both in its streaming and batch pipelines.
>> 
>> So we conclude 3 data quality problems at eBay:
>> 
>> Lack of end2end unified view of data quality measurement from multiple
>> data sources to target applications, it usually takes a long time to
>> identify and fix poor data quality.
>> How to get data quality measured in streaming mode, we need to have a
>> process and tool to visualize data quality insights through
>> registering dataset which you want to check data quality, creating
>> data quality measurement model, executing the data quality validation
>> job and getting metrics insights for action taking.
>> No Shared platform and API Service, have to apply and manage own
>> hardware and software infrastructure.
>> 
>> Rationale
>> 
>> The challenge we face at eBay is that our data volume is becoming
>> bigger and bigger, system processes become more complex, while we do
>> not have a unified data quality solution to ensure the trusted data
>> sets which provide confidences on data quality to our data consumers.
>> The key challenges on data quality includes:
>> 
>> Existing commercial data quality solution cannot address data quality
>> lineage among systems, cannot scale out to support fast growing data
>> at eBay
>> Existing eBay's domain specific tools take a long time to identify and
>> fix poor data quality when data flowed through multiple systems
>> Business logic becomes complex, requires data quality system much flexible.
>> 
>> Some data quality issues do have business impact on user experiences,
>> revenue, efficiency & compliance.
>> 
>> Communication overhead of data quality metrics, typically in a big
>> organization, which involve different teams.
>> 
>> The idea of Griffin is to provide Data Quality validation as a
>> Service, to allow data engineers and data consumers to have:
>> 
>> Near real-time understanding of the data quality health of your data
>> pipelines with end-to-end monitoring, all in one place.
>> Profiling, detecting and correlating issues and providing
>> recommendations that drive rapid and focused troubleshooting
>> A centralized data quality model management system including rule,
>> metadata, scheduler etc.
>> Native code generation to run everywhere, including Hadoop, Kafka, Spark, 
>> etc.
>> One set of tools to build data quality pipelines across all eBay data 
>> platforms.
>> 
>> Current Status
>> 
>> Meritocracy
>> 
>> Griffin has been deployed in production at eBay and provided the
>> centralized data quality service for several eBay systems ( for
>> example, real time personalization platform, eBay real time ID linking
>> platform, Hadoop datasets, Site speed analytics platform). Our aim is
>> to build a diverse developer and user community following the Apache
>> meritocracy model. We will encourage contributions and participation
>> of all types of work, and ensure that contributors are appropriately
>> recognized.
>> 
>> Community
>> 
>> Currently the project is being developed at eBay. It's only for eBay
>> internal community. Griffin seeks to develop the developer and user
>> communities during incubation. We believe it will grow substantially
>> by becoming an Apache project.
>> 
>> Core Developers
>> 
>> Griffin is currently being designed and developed by engineers from
>> eBay Inc. – William Guo, Alex Lv, Shawn Sha, Vincent Zhao, John Liu.
>> All of these core developers have deep expertise in Apache Hadoop and
>> the Hadoop Ecosystem in general.
>> 
>> Alignment
>> 
>> The ASF is a natural host for Griffin given that it is already the
>> home of Hadoop, Beam, HBase, Hive, Storm, Kafka, Spark and other
>> emerging big data products. Those are requiring data quality solution
>> by nature to ensure the data quality which they processed. When people
>> use open source data technology, the big question to them is that how
>> we can ensure the data quality in it. Griffin leverages lot of Apache
>> open-source products. Griffin was designed to enable real time
>> insights into data quality validation by shared Infrastructure and
>> generic features to solve common data quality pain points.
>> 
>> Known Risks
>> 
>> Orphaned Products
>> 
>> The core developers of Griffin team work full time on this project.
>> There is no risk of Griffin getting orphaned since at least one large
>> company (eBay) is extensively using it in their production Hadoop and
>> Spark clusters for multiple data systems. For example, currently there
>> are 4 data systems at eBay (real time personalization platform, eBay
>> real time ID linking platform, Hadoop, Site speed analytics platform)
>> are leveraging Griffin, with more than ~600M records for data quality
>> status validation every day, 35 data sets being monitored, 50+ data
>> quality models have been created.
>> 
>> As Griffin is designed to connect many types of data sources, we are
>> very confident that they will use Griffin as a service for ensuring
>> the data quality in open source data ecosystems. We plan to extend and
>> diversify this community further through Apache.
>> 
>> Inexperience with Open Source
>> 
>> Griffin's core engineers are all active users and followers of open
>> source projects. They are already committers and contributors to the
>> Griffin Github project. All have been involved with the source code
>> that has been released under an open source license, and several of
>> them also have experience developing code in an open source
>> environment. Though the core set of Developers do not have Apache Open
>> Source experience, there are plans to onboard individuals with Apache
>> open source experience on to the project.
>> 
>> Homogenous Developers
>> 
>> The core developers are from eBay. Apache Incubation process
>> encourages an open and diverse meritocratic community. Griffin intends
>> to make every possible effort to build a diverse, vibrant and involved
>> community. We are committed to recruiting additional committers from
>> other companies based on their contribution to the project.
>> 
>> Reliance on Salaried Developers
>> 
>> eBay invested in Griffin as a company-wide data quality service
>> platform and some of its key engineers are working full time on the
>> project. they are all paid by eBay. We look forward to other Apache
>> developers and researchers to contribute to the project.
>> 
>> Relationships with Other Apache Products
>> 
>> Griffin has a strong relationship and dependency with Apache Hadoop,
>> Apache HBase, Apache Spark, Apache Kafka and Apache Storm, Apache
>> Hive. In addition, since there is a growing need for data quality
>> solution for open source platform (e.g. Hadoop, Kafka, Spark etc),
>> being part of Apache’s Incubation community, could help with a closer
>> collaboration among these four projects and as well as others.
>> 
>> Documentation
>> 
>> Information about Griffin can be found at https://github.com/eBay/griffin
>> 
>> Initial Source
>> 
>> Griffin has been under development since early 2016 by a team of
>> engineers at eBay Inc. It is currently hosted on Github.com under an
>> Apache license 2.0 at https://github.com/eBay/griffin . Once in
>> incubation we will be moving the code base to apache git library.
>> 
>> External Dependencies
>> 
>> Griffin has the following external dependencies.
>> 
>> Basic
>> 
>> JDK 1.7+
>> Scala
>> Apache Maven
>> JUnit
>> Log4j
>> Slf4j
>> Apache Commons
>> 
>> Hadoop
>> 
>> Apache Hadoop
>> Apache HBase
>> Apache Hive
>> 
>> DB
>> 
>> InfluxData
>> 
>> Apache Spark
>> 
>> Spark Core Library
>> 
>> REST Service
>> 
>> Jersey
>> Spring MVC
>> 
>> Web frontend
>> 
>> AngularJS
>> jQuery
>> Bootstrap
>> RequireJS
>> eCharts
>> Font Awesome
>> 
>> Cryptography
>> 
>> Currently there's no cryptography in Griffin.
>> 
>> Required Resources
>> 
>> Mailing List
>> 
>> We currently use eBay mail box to communicate, but we'd like to move
>> that to ASF maintained mailing lists.
>> 
>> Current mailing list: ebay-griffin-d...@googlegroups.com
>> 
>> Proposed ASF maintained lists:
>> 
>> priv...@griffin.incubator.apache.org
>> 
>> d...@griffin.incubator.apache.org
>> 
>> comm...@griffin.incubator.apache.org
>> 
>> Subversion Directory
>> 
>> Git is the preferred source control system.
>> 
>> Issue Tracking
>> 
>> JIRA
>> 
>> Other Resources
>> 
>> The existing code already has unit tests so we will make use of
>> existing Apache continuous testing infrastructure. The resulting load
>> should not be very large.
>> 
>> Initial Committers
>> 
>> William Go
>> Alex Lv
>> Vincent Zhao
>> Shawn Sha
>> John Liu
>> Liang Shao
>> 
>> Affiliations
>> 
>> The initial committers are employees of eBay Inc.
>> 
>> Sponsors
>> 
>> Champion
>> 
>> Henry Saputra (hsapu...@apache.org)
>> 
>> Nominated Mentors
>> 
>> Kasper Sørensen (kasper...@apache.org)
>> 
>> Uma Maheswara Rao Gangumalla (umamah...@apache.org)
>> 
>> Luciano Resende (luckbr1...@gmail.com)
>> 
>> Sponsoring Entity
>> 
>> We are requesting the Incubator to sponsor this project.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>> For additional commands, e-mail: general-h...@incubator.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>

Re: [DISCUSS] Proposing Griffin for Apache incubator

Reply via email to