Hi all,

I am Reed, as a developer worked with the team for Palo (a MPP-based 
interactive SQL data warehousing).
https://github.com/baidu/palo/wiki/Palo-Overview

We propose to contribute Palo as an Apache Incubator project, and
we are still looking for possible Champion if anyone would like to volunteer. 
Thanks a lot.

Best Regards,
Reed

===================
The draft of the proposal as below:

#Apache Palo

##Abstract

Palo is a MPP-based interactive SQL data warehousing for reporting and analysis.

##Proposal

We propose to contribute the Palo codebase and associated artifacts (e.g. 
documentation, web-site content etc.) to the Apache Software Foundation with 
the intent of forming a productive, meritocratic and open community around 
Palo’s continued development, according to the ‘Apache Way’.

Baidu owns several trademarks regarding Palo, and proposes to transfer 
ownership of those trademarks in full to the ASF.

###Overview of Palo

Palo’s implementation consists of two daemons: Frontend (FE) and Backend (BE).

**Frontend daemon** consists of query coordinator and catalog manager. Query 
coordinator is responsible for receiving users’ sql queries, compiling queries 
and managing queries execution. Catalog manager is responsible for managing 
metadata such as databases, tables, partitions, replicas and etc. Several 
frontend daemons could be deployed to guarantee fault-tolerance, and load 
balancing.

**Backend daemon** stores the data and executes the query fragments. Many 
backend daemons could also be deployed to provide scalability and 
fault-tolerance.

A typical Palo cluster generally composes of several frontend daemons and 
dozens to hundreds of backend daemons.

Users can use MySQL client tools to connect any frontend daemon to submit SQL 
query. Frontend receives the query and compiles it into query plans executable 
by the Backend. Then Frontend sends the query plan fragments to Backend. 
Backend will build a query execution DAG. Data is fetched and pipelined into 
the DAG. The final result response is sent to client via Frontend. The 
distribution of query fragment execution takes minimizing data movement and 
maximizing scan locality as the main goal.

##Background

At Baidu, Prior to Palo, different tools were deployed to solve diverse 
requirements in many ways. And when a use case requires the simultaneous 
availability of capabilities that cannot all be provided by a single tool, 
users were forced to build hybrid architectures that stitch multiple tools 
together, but we believe that they shouldn’t need to accept such inherent 
complexity. A storage system built to provide great performance across a broad 
range of workloads provides a more elegant solution to the problems that hybrid 
architectures aim to solve. Palo is the solution.

Palo is designed to be a simple and single tightly coupled system, not 
depending on other systems. Palo provides high concurrent low latency point 
query performance, but also provides high throughput queries of ad-hoc 
analysis. Palo provides bulk-batch data loading, but also provides near 
real-time mini-batch data loading. Palo also provides high availability, 
reliability, fault tolerance, and scalability.

##Rationale

Palo mainly integrates the technology of Google Mesa and Apache Impala.

Mesa is a highly scalable analytic data storage system that stores critical 
measurement data related to Google's Internet advertising business. Mesa is 
designed to satisfy complex and challenging set of users’ and systems’ 
requirements, including near real-time data ingestion and query ability, as 
well as high availability, reliability, fault tolerance, and scalability for 
large data and query volumes.

Impala is a modern, open-source MPP SQL engine architected from the ground up 
for the Hadoop data processing environment. At present, by virtue of its 
superior performance and rich functionality, Impala has been comparable to many 
commercial MPP database query engine. Mesa can satisfy the needs of many of our 
storage requirements, however Mesa itself does not provide a SQL query engine; 
Impala is a very good MPP SQL query engine, but the lack of a perfect 
distributed storage engine. So in the end we chose the combination of these two 
technologies.

Learning from Mesa’s data model, we developed a distributed storage engine. 
Unlike Mesa, this storage engine does not rely on any distributed file system. 
Then we deeply integrate this storage engine with Impala query engine. Query 
compiling, query execution coordination and catalog management of storage 
engine are integrated to be frontend daemon; query execution and data storage 
are integrated to be backend daemon. With this integration, we implemented a 
single, full-featured, high performance state the art of MPP database, as well 
as maintaining the simplicity.

##Current Status

Palo has been an open source project on GitHub (https://github.com/baidu/palo).

###Meritocracy

Palo has been deployed in production at Baidu and is applying more than 200 
lines of business. It has demonstrated great performance benefits and has 
proved to be a better way for reporting and analysis based big data. Still We 
look forward to growing a rich user and developer community.

###Community

Palo seeks to develop developer and user communities during incubation.

###Core Developers

* Ruyue Ma (https://github.com/maruyue, 
maru...@baidu.com<mailto:maru...@baidu.com>)
* Chun Zhao (https://github.com/imay, 
buaa.zh...@gmail.com<mailto:buaa.zh...@gmail.com>)
* Mingyu Chen (https://github.com/morningman,chenmin...@baidu.com)
* De Li(https://github.com/lide-reed, 
mailtol...@sina.com)<mailto:mailtol...@sina.com%EF%BC%89>
* Hao Chen (https://github.com/chenhao7253886, 
chenha...@baidu.com<mailto:chenha...@baidu.com>)
* Chaoyong Li (https://github.com/cyongli, 
lichaoy...@baidu.com<mailto:lichaoy...@baidu.com>)
* Bin Lin (https://github.com/lingbin, 
lingbi...@gmail.com<mailto:lingbi...@gmail.com>)

###Alignment

Palo is related to several other Apache projects:

* Palo can also read data stored in Apache Hadoop clusters powered by the HDFS 
filesystem.
* Palo is closely integrated with Impala, which is also being proposed to the 
Incubator.
* Palo uses Apache Thrift as its RPC and serialization framework of choice.

##Known Risks

###Orphaned Products

The core developers of Palo team plan to work full time on this project. There 
is very little risk of Palo getting orphaned since at least one large company 
(Baidu) is extensively using it in their production. For example, currently 
there are more than 200 use cases using Palo in production. Furthermore, since 
Palo was open sourced at the beginning of October 2017, it has received more 
than 660 stars and been forked nearly 170 times. We plan to extend and 
diversify this community further through Apache.

###Inexperience with Open Source

The core developers are all active users and followers of open source. They are 
already committers and contributors to the Palo Github project. All have been 
involved with the source code that has been released under an open source 
license, and several of them also have experience developing code in an open 
source environment. Though the core set of Developers do not have Apache Open 
Source experience, there are plans to onboard individuals with Apache open 
source experience on to the project.

###Homogenous Developers

The most of core developers are from Baidu, but after Palo was open sourced, 
Palo received a lot of bug fixes and enhancements from other developers not 
working at Baidu.

###Reliance on Salaried Developers

Baidu invested in Palo as the OLAP solution and some of its key engineers are 
working full time on the project. In addition, since there is a growing Big 
Data need for scalable OLAP solutions, we look forward to other Apache 
developers and researchers to contribute to the project. Also key to addressing 
the risk associated with relying on Salaried developers from a single entity is 
to increase the diversity of the contributors and actively lobby for Domain 
experts in the BI space to contribute. Apache Palo intends to do this.

###An Excessive Fascination with the Apache Brand

Palo is proposing to enter incubation at Apache in order to help efforts to 
diversify the committer-base, not so much to capitalize on the Apache brand. 
The Palo project is in production use already inside Baidu, but is not expected 
to be an Baidu product for external customers. As such, the Palo project is not 
seeking to use the Apache brand as a marketing tool.

##Documentation

Information about Palo can be found at https://github.com/baidu/palo. The 
following links provide more information about Palo in open source:

* Palo wiki site: https://github.com/baidu/palo/wiki
* Codebase at Github: https://github.com/baidu/palo
* Issue Tracking: https://github.com/baidu/palo/issues
* Overview: https://github.com/baidu/palo/wiki/Palo-Overview
* FAQ: https://github.com/baidu/palo/wiki/Palo-FAQ

##Initial Source

Palo has been under development since 2017 by a team of engineers at Baidu Inc. 
It is currently hosted on Github.com under an Apache license at 
https://github.com/baidu/palo.

##External Dependencies

Palo has the following external dependencies.

* Google gflags (BSD)
* Google glog (BSD)
* Apache Thrift (Apache Software License v2.0)
* Apache Commons (Apache Software License v2.0)
* Boost (Boost Software License)
* OpenLdap (OpenLDAP Software License)
* rapidjson (Tencent)
* Google RE2 (BSD-style)
* lz4 (BSD)
* snappy (BSD)
* cyrus-sasl (CMU License)
* Twitter Bootstrap (Apache Software License v2.0)
* d3 (BSD)
* LLVM (BSD-like)

Build and test dependencies:

* ant (Apache Software License v2.0)
* Apache Maven (Apache Software License v2.0)
* cmake (BSD)
* clang (BSD)
* Google gtest (Apache Software License v2.0)

##Required Resources

###Mailing List

There are currently no mailing lists. The usual mailing lists are expected to 
be set up when entering incubation:

priv...@palo.incubator.apache.org<mailto:priv...@palo.incubator.apache.org>
d...@palo.incubator.apache.org<mailto:d...@palo.incubator.apache.org>
comm...@palo.incubator.apache.org<mailto:comm...@palo.incubator.apache.org>

###Subversion Directory

Upon entering incubation: https://github.com/baidu/palo.
After incubation, we want to move the existing repo from 
https://github.com/baidu/palo to Apache infrastructure.

###Issue Tracking

Palo currently uses GitHub to track issues. Would like to continue to do so 
while we discuss migration possibilities with the ASF Infra committee.

###Other Resources

The existing code already has unit tests so we will make use of existing Apache 
continuous testing infrastructure. The resulting load should not be very large.

##Initial Committers

* Ruyue Ma (https://github.com/maruyue, 
maru...@baidu.com<mailto:maru...@baidu.com>)
* Chun Zhao (https://github.com/imay, 
buaa.zh...@gmail.com<mailto:buaa.zh...@gmail.com>)
* Mingyu Chen (https://github.com/morningman,chenmin...@baidu.com)
* De Li(https://github.com/lide-reed, 
mailtol...@sina.com)<mailto:mailtol...@sina.com%EF%BC%89>
* Hao Chen (https://github.com/chenhao7253886, 
chenha...@baidu.com<mailto:chenha...@baidu.com>)
* Chaoyong Li (https://github.com/cyongli, 
lichaoy...@baidu.com<mailto:lichaoy...@baidu.com>)
* Bin Lin (https://github.com/lingbin, 
lingbi...@gmail.com<mailto:lingbi...@gmail.com>)

##Affiliations

The initial committers are employees of Baidu Inc.. The nominated mentors are 
employees of TODO.

##Sponsors

###Champion

TODO

###Nominated Mentors

* sijie guo, guosi...@gmail.com<mailto:guosi...@gmail.com>
* Luke Han, luke...@apache.org<mailto:luke...@apache.org>
* Zheng Shao, zs...@apache.org<mailto:zs...@apache.org>

###Sponsoring Entity

We are requesting the Incubator to sponsor this project.

Reply via email to