Re: [Proposal] lxdb - proposal for Apache Incubation

Sheng Wu Sat, 27 Feb 2021 05:57:58 -0800

Could you change your mail tool?

Your reply looks like this...
1.Since you are proposing a new project to a global foundation, you should
at
least keep your documentation in English.&nbsp;
&gt;Of course, if Apache accepts this project, I will complete all the
documents and translate them into English. Although my English is not very
good, many of our company come back from Australia. This should not be a
problem
2:Your provided links are Chinese,which for most IPMC people, it is not
readable.
&gt;In addition to the source code, what other documents are needed? Do you
want me to provide some basic project use or introduction first?
3:And since this project is close-source, please provide the dependencies.


Sheng Wu 吴晟
Twitter, wusheng1108


fp <f...@lucene.cn> 于2021年2月27日周六 下午9:56写道：

> Hi 吴晟
> Thank you for your reply,In response to your question, my answers are as
> follows.(我英语不怎么好请您多多包涵.)
>
>
> 1.Since you are proposing a new project to a global foundation, you should
> at
> least keep your documentation in English.&nbsp;
> &gt;Of course, if Apache accepts this project, I will complete all the
> documents and translate them into English. Although my English is not very
> good, many of our company come back from Australia. This should not be a
> problem
> 2:Your provided links are Chinese,which for most IPMC people, it is not
> readable.
> &gt;In addition to the source code, what other documents are needed? Do
> you want me to provide some basic project use or introduction first?
> 3:And since this project is close-source, please provide the dependencies.
> &gt;The version to be open source is 100% rewritten. It relies on Hadoop,
> HBase, spark, zookeeper, and does not rely on any code from my previous
> company
> 4:And as you repeated said the original projects, is this project created
> 100% on your own, is it including something from Alibaba/Tencent?&nbsp;
> &gt;the current version of lxdb is 100% created on my own . it isn`t
> including anything form Alibaba/Tencent.&nbsp;&nbsp;
> &gt;The previous version of lxdb relies on the mdrill of Alibaba. I am the
> author of mdrill project and mdrill is an open source project.
> &gt;About Tencent Hermes is my work in Tencent, but after I started my
> business, I didn't use the source code of Hermes, and I informed Tencent
> before I started my business
> 5:As there is no open-source, I can't verify.
> &gt;If you are interested, I can provide the source code to PMC members
> separately for auditing
> 6:Due to this is close-source, we also need you to be clear about whether
> you
> are going to submit SGA and open source to the public.
> &gt;I haven't open source the project yet, mainly to see if PMC is
> interested in my project. If interested, I will open source. In this way, I
> can persuade my investors. If PMC is not interested, I may consider opening
> source later. At present, the project has about 100000 lines of code, which
> can be provided to PMC for review
> 7:The most important, `lucene` is an Apache trademark and Apache
> project,this makes me have concerns about the branding violation.
> &gt;I just like Lucene. If the name offends PMC, I can correct it for the
> right name.
> 8:At last, typically, we(incubator) expect you to have open-sourced the
> project, and at least have a small community and first adoption out of your
> company.
> Our company is a commercial company. The community of previous projects
> here may be different from what you said. We have organized a QQ
> communication group with about 1000 people. Many students here have been
> our users for many years, and they are looking forward to the development
> of our project
> 9:To join the incubator, you also need at least 3 IPMC members and 1
> Champion(Apache member or officer) to help you understand the incubator.
> Can you help me? I really have language problems. There is less
> communication in this area. I have done a lot of sharing in China before. I
> hope you can help me if you can.If you like this project, you can also join
> us. It's a very good opportunity in China's database market
> my telnum is 17099831107
>
>
>
>
>
>
> yannian mu 母延年
> luxin,muyannian
>
>
>
>
> ------------------&nbsp;原始邮件&nbsp;------------------
> 发件人:
>                                                   "general"
>                                                                     <
> wu.sheng.841...@gmail.com&gt;;
> 发送时间:&nbsp;2021年2月27日(星期六) 晚上9:06
> 收件人:&nbsp;"Incubator"<general@incubator.apache.org&gt;;
>
> 主题:&nbsp;Re: [Proposal] lxdb - proposal for Apache Incubation
>
>
>
> Hi
>
> Since you are proposing a new project to a global foundation, you should at
> least keep your documentation in English. Your provided links are Chinese,
> which for most IPMC people, it is not readable.
> And since this project is close-source, please provide the dependencies.
> And as you repeated said the original projects, is this project created
> 100% on your own, is it including something from Alibaba/Tencent? As there
> is no open-source, I can't verify.
> Due to this is close-source, we also need you to be clear about whether you
> are going to submit SGA and open source to the public.
>
> The most important, `lucene` is an Apache trademark and Apache project,
> this makes me have concerns about the branding violation.
>
> At last, typically, we(incubator) expect you to have open-sourced the
> project, and at least have a small community and first adoption out of your
> company.
>
> To join the incubator, you also need at least 3 IPMC members and 1
> Champion(Apache member or officer) to help you understand the incubator.
>
> Sheng Wu 吴晟
> Twitter, wusheng1108
>
>
> fp <f...@lucene.cn&gt; 于2021年2月27日周六 下午6:40写道：
>
> &gt; Dear Apache Incubator Community,
> &gt;
> &gt;
> &gt; Please accept the following proposal for presentation and discussion:
> &gt; https://github.com/lucene-cn/lxdb/wiki
> &gt;
> &gt;
> &gt; LXDB is a high-performance,OLAP,full text search database.it`s base
> on
> &gt; hbase,but replaced hfile with lucene index to support more effective
> &gt; secondary indexes,it`s also base on spark sql,so that you can used
> sql api
> &gt; to visit data and do olap calculate. and also the lucene index is
> store on
> &gt; hdfs (not local disk).
> &gt;
> &gt;
> &gt; In our Production System, LXDB supported 200+ clusters,some of the
> single
> &gt; cluster is 1000+ nodes,insert 200 billion rows&amp;nbsp; per day (
> 20000
> &gt; billion rows for total), one of the biggest single table has
> 200million
> &gt; lucene index on LXDB.
> &gt;
> &gt;
> &gt; Hadoop`s father Doug Cutting cut nutch into HBase, MapReduce (hive),
> HDFS,
> &gt; Lucene.We have merged these separated projects again,LXDB equals spark
> &gt; sql+hbase+lucene+parquet+hdfs,it is a super database.It took me 10
> years to
> &gt; complete these merging operations.But the purpose is no longer a
> search
> &gt; engine, but a database.
> &gt;
> &gt;
> &gt;
> &gt;
> &gt; Best regards
> &gt; &amp;nbsp; yannian mu
> &gt;
> &gt;
> &gt;
> &gt;
> &gt; LXDB Proposal
> &gt; == Abstract ==
> &gt; LXDB is a high-performance,OLAP,full text search database.
> &gt;
> &gt;
> &gt; === it`s base on hbase,but replaced hfile with lucene index to support
> &gt; more effective secondary indexes.===
> &gt; we modify hbase region server ,we&amp;nbsp; change hfile to
> lucene,when put
> &gt; data we put&amp;nbsp; document to lucene instande of&amp;nbsp; put
> data to hfile
> &gt; lucene index store on region server&amp;nbsp; (it is not sote in
> different
> &gt; cluster like elstice search+hbase ,it takes to copy of data)
> &gt;
> &gt;
> &gt; === it`s base on spark sql for olap===
> &gt; we Integrated spark and hbase together ,it`s useage like this ,
> &gt; 1.unpackage lxdb.tar.gz
> &gt; 2.config hadoop_config path,
> &gt; 3.run start-all.sh to start cluster.
> &gt; lxdb can startup spark through hadoop yarn ,and then spark executor
> &gt; process Embedded start hbase region server service .
> &gt;
> &gt;
> &gt; you can operate lxdb database throuth spark sql api(hive) or mysql
> api.
> &gt; 1.the sql used spark rdd+hbase scaner&amp;nbsp; to visit hbase .
> &gt; 2.the sql`s condition (filter or group by agg) will predicate to
> hbase ,
> &gt; 3.hbase used lucene index to filter data in region server.
> &gt; all of the spark,hbase,lucene is Embedded Integrated together,it is
> &gt; not&amp;nbsp; a&amp;nbsp; seperate cluster ,that is the different
> with solr/es +
> &gt; hbase+spark Solution.
> &gt;
> &gt;
> &gt; == Background ==
> &gt; === Multiple copies of data ===
> &gt; Apache HBase+Elastic Search is the most popular Solution on full text
> &gt; search ,but it`s weak on Online AnalyticalProcessing.
> &gt; so most of the time the Production System used spark(or hive or
> impala or
> &gt; presto) ,hbase,solr/es at the same time.Multiple copies of data are
> stored
> &gt; in multiple systems,multiple systems has different Api .Data
> consistency is
> &gt; difficult to guarantee.For the above reasons we merger
> spark,hbase,elastic
> &gt; into one project .it`s target is used one copy of data,one
> cluster,one api
> &gt; to solve olap,kv,full text...database scenarios.
> &gt;
> &gt;
> &gt; === Merging and splitting of lucene indexes(hstore) acrocess different
> &gt; machine on hdfs ===
> &gt; As we all know solr/es store file in local fileSystem,it`s shard num
> must
> &gt; be a fix num,but if we store index on hdfs,the index can split able
> like
> &gt; hbase hstore,it can split or merge acorss machine nodes ,this is very
> &gt; usefull for distribute database ,it depend malloc how much resource
> on a
> &gt; table,most of time the records of a table is different by time by
> time so
> &gt; the num of shards always need adjust,if index store local it can`t
> split
> &gt; acroces throw different machine ,but lucene index store on hdfs it`s
> can do
> &gt; it.
> &gt; whether the number of pieces can be flexibly adjusted, whether it has
> the
> &gt; ability of elastic scaling, in a distributed database is particularly
> &gt; important
> &gt;
> &gt;
> &gt; === solved Insufficient of&amp;nbsp; secondary indexes ===
> &gt; some people use hbase secondary index like Phoenix prjoect. but those
> &gt; programme base on the hbase rowkey has a lot of redundancy,He can't
> create
> &gt; too many indexes,Data inflation rate is too high,so used lucene index
> &gt; instand of secondary is the best chooses.
> &gt;
> &gt;
> &gt; === we add an lucene index for spark olap===
> &gt; Most of OLAP systems has violent scanning problems and Poor
> timeliness of
> &gt; data like hive,spark sql,impala or some of the mpp database.
> &gt; 1.They used violent scans to calculate the data.but another choice is
> add
> &gt; index to the big data.some of the time using index can greatly
> improve the
> &gt; performance of the original brute force scanning. i think&amp;nbsp;
> that just
> &gt; like the traditional database, indexing technology can greatly
> improve the
> &gt; performance of the speed database.
> &gt; 2.Another problem of thoses database or system, Most of them are an
> &gt; offline system or batch system,lxdb `s target is realtime append
> ,realtime
> &gt; kv update just like hbase.
> &gt;
> &gt;
> &gt; ==future==
> &gt; === lucene on parquet ===
> &gt; recenetly i will change lucene&amp;nbsp; tim,tip(invert index)
> ,dvd,dvm files
> &gt; to&amp;nbsp; like parquet or orc format.
> &gt; To solve the performance problem of traversing Lucene index.To solve
> the
> &gt; problem that opening Lucene file needs to load files such as tip into
> &gt; memory, which leads to slow opening Lucene index file,To enable
> Lucene to
> &gt; store multi column joint index by column, which is used to handle some
> &gt; logic such as multi table join and materialized view ,mulity fields
> group
> &gt; by by invert index,The current Lucene index has many problems because
> of
> &gt; too many file pointers and single column problems,We want to modify
> Lucene
> &gt; to make it more suitable for HDFS, not only for full-text retrieval,
> but
> &gt; also better at statistical analysis, which is a real database level
> &gt; index,We want Lucene to be splitable, which can separate storage from
> &gt; computation.
> &gt;
> &gt;
> &gt; ===&amp;nbsp; supporting all kinds of Predicate pushdown calculation
> ===
> &gt; We find that if we can combine the calculation method with the data
> &gt; closely, we can give more play to the performance of the database.
> Index is
> &gt; only a way of calculating push down. For example, storage push down,
> we can
> &gt; store the index on the SSD device, and the data part on the SATA
> device. We
> &gt; can store the data that are often grouped together in advance,
> instead of
> &gt; calculating line by line, We can give important tables or columns to
> &gt; dedicated devices and resources, but these hbases are still lacking,
> which
> &gt; we need to further improve
> &gt;
> &gt;
> &gt; === Distribution of intervention data ===
> &gt; we can used row key to intervention data to different nodes ,it can do
> &gt; many interestest things
> &gt;
> &gt;
> &gt; === Resource control, resource isolation ===
> &gt; lucene recent is not support resource isolation,but&amp;nbsp; on
> hdfs&amp;nbsp; we
> &gt; can do it , I can control the priority of SQL so that Lucene with
> higher
> &gt; priority can get faster IO resources.
> &gt;
> &gt;
> &gt; == Status ==
> &gt; since 2011 I released the first open source version on
> Alibaba&amp;nbsp; ,At
> &gt; that time, mdrill used 10 nodes 48g machines to support 400 billion
> data.
> &gt; the first index on hdfs is from this version.it`s one year ahead of
> the
> &gt; community.&amp;nbsp; https://github.com/alibaba/mdrill .
> &gt;
> &gt;
> &gt; since 2014 i stoped mdrill project update for the reason of i join
> into
> &gt; tencent . in our team we developed&amp;nbsp; hermes project ,we also
> build
> &gt; lucene on hdfs , hermes now realtime import 1000 billion rows of data
> per
> &gt; day.It's the largest database I've ever developed ,
> &gt; https://plus.tencent.com/bigdata/hermes
> &gt;
> &gt;
> &gt; since 2018 I set up my own company called luxin, Lu Xin is the Chinese
> &gt; pronunciation of Lucene. as a funs of lucene ,luxin company`s domain
> is
> &gt; lucene.xin ,mail domain is lucene.cn.
> &gt; luxin`s first version of lxdb is called lsql,it`s means lucene
> sql.&amp;nbsp;
> &gt; it used lucene(2.5.3)+hdfs+spark(1.6.3),it is stable, about 200+ of
> cluster
> &gt; use lsql. it`s process about 200 billions per day ,amount of 20000
> billions
> &gt; rows in one&amp;nbsp; single cluster. (1000 nodes)
> &gt;
> &gt;
> &gt; since 2010 In the case of COVID-19 our team decide to developed the
> next
> &gt; generation of lsql called lxdb(lx=lucene pronunciation ). we add
> hbase to
> &gt; lsql To solve the update problem.nowadays we have finish the first
> version
> &gt; of lxdb. https://github.com/lucene-cn/lxdb/wiki
> &gt;
> &gt;
> &gt;
> &gt;
> &gt; == Known Risks ==
> &gt; ==Meritocracy ==
> &gt;
> &gt;
> &gt; lxdb has been deployed in production and is applying more than 200
> lines
> &gt; of business. It has demonstrated great performance benefits and has
> proved
> &gt; to be a better way for reporting and analysis based big data. Still
> We look
> &gt; forward to growing a rich user and developer community.
> &gt; === Orphaned products ===
> &gt;
> &gt;
> &gt; The core developers currently work full-time for Luxin.
> &gt; lxdb is widely adopted by many companies and individuals. There's no
> &gt; realistic chance of it becoming orphaned. and we have a number of 1000
> &gt; person tencent qq Instant messaging group
> &gt;
> &gt;
> &gt; === Inexperience with Open Source===
> &gt; The core developers are all active users and followers of open source.
> &gt; They are already committers and contributors to the lxdb
> project.&amp;nbsp;
> &gt; developed yannian mu has tens years on open source project,&amp;nbsp;
> jstorm
> &gt; https://github.com/alibaba/jstorm and mdrill
> &gt; https://github.com/alibaba/mdrill
> &gt;
> &gt;
> &gt;
> &gt;
> &gt; === Homogenous Developers ===
> &gt;
> &gt;
> &gt; The most of core developers are from luxin for the Closed source
> products
> &gt; reason, but when lxdb was open sourced, lxdb will received a lot of
> bug
> &gt; fixes and enhancements from other developers not working at
> luxin.Where did
> &gt; you learn it from and where did you return it.
> &gt;
> &gt;
> &gt;
> &gt;
> &gt; ===Reliance on Salaried Developers ===
> &gt;
> &gt;
> &gt; Lxin invested in lxdb as the&amp;nbsp; solution and some of its key
> engineers
> &gt; are working full time on the project. In addition, since there is a
> growing
> &gt; Big Data need for scalable solutions, we look forward to other Apache
> &gt; developers and researchers to contribute to the project. Also key to
> &gt; addressing the risk associated with relying on Salaried developers
> from a
> &gt; single entity is to increase the diversity of the contributors and
> actively
> &gt; lobby , Apache lxdb intends to do this.
> &gt;
> &gt;
> &gt; === An Excessive Fascination with the Apache Brand ===
> &gt;
> &gt;
> &gt; Lxdb is proposing to enter incubation at Apache in order to help
> efforts
> &gt; to diversify the committer-base, not so much to capitalize on the
> Apache
> &gt; brand. The Lxdb project is in production use already inside lxdb, but
> is
> &gt; not expected to be an lxdb product for external customers. As such,
> the
> &gt; lxdb project is not seeking to use the Apache brand as a marketing
> tool.
> &gt;
> &gt;
> &gt;
> &gt;
> &gt; === Documentation===
> &gt;
> &gt;
> &gt; Information about Palo can be found at
> https://github.com/lucene-cn/lxdb.
> &gt <https://github.com/lucene-cn/lxdb.&gt>; The following links provide
> more information about lxdb in open source:
> &gt;
> &gt;
> &gt; * wiki site: https://github.com/lucene-cn/lxdb/wiki
> &gt; * Issue Tracking: https://github.com/lucene-cn/lxdb/issues
> &gt; * Overview: https://github.com/lucene-cn/lxdb/wiki/intro
> &gt; * lxin home page: http://www.lucene.xin
> &gt; * lsql document: http://docs.lucene.xin/lsql/v21/
> &gt;
> &gt;
> &gt; ##Initial Source
> &gt;
> &gt;
> &gt; lxdb will development source code under an Apache license at
> &gt; https://github.com/lucene-cn/lxdb.
> &gt;
> &gt;
> &gt;
> &gt;
> &gt; === Core Developers ===
> &gt;
> &gt;
> &gt; Currently most of the core developers of LXDB are working in the
> research
> &gt; Team of luxin.
> &gt;
> &gt;
> &gt; - yannian mu (dev)
> &gt; - yu chen (dev)
> &gt; - guangshi hao (dev)
> &gt; - wei sun (dev)
> &gt; - qihua zheng (dev)
> &gt; - xin wang (dev)
> &gt; - qingsong liu (dev)
> &gt; - anxing zhou (Tester)
> &gt; - jiajun duan (Tester)
> &gt;
> &gt;
> &gt; == External Dependencies ==
> &gt; As all dependencies are managed using Apache Maven
> &gt; Dependency&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;
> License&amp;nbsp; &amp;nbsp; &amp;nbsp;
> &gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;
> &amp;nbsp; &amp;nbsp;Optional?
> &gt; lucene&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Apache
> License 2.0&amp;nbsp; &amp;nbsp;
> &gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; true
> &gt; zookeeper&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;
> &amp;nbsp;Apache License 2.0&amp;nbsp;
> &gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; true
> &gt; hbase&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;
> &amp;nbsp; Apache License 2.0&amp;nbsp;
> &gt; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; true
> &gt; spark&amp;nbsp; &amp;nbsp;Apache License 2.0&amp;nbsp; &amp;nbsp;
> &amp;nbsp; &amp;nbsp; &amp;nbsp; true
> &gt; hadoop&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;
> &amp;nbsp; &amp;nbsp; &amp;nbsp; Apache
> &gt; License 2.0&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; true
> &gt; hive&amp;nbsp; &amp;nbsp;Apache License 2.0&amp;nbsp; &amp;nbsp;
> &amp;nbsp; &amp;nbsp; &amp;nbsp; true
> &gt;
> &gt;
> &gt; == Required Resources ==
> &gt;
> &gt;
> &gt; === Mailing lists ===
> &gt;
> &gt;
> &gt; &amp;nbsp;* lxdb-private (PMC discussion)
> &gt; &amp;nbsp;* lxdb-dev (developer discussion)
> &gt; &amp;nbsp;* lxdb-user (user discussion)
> &gt; &amp;nbsp;* lxdb-commits (SCM commits)
> &gt; &amp;nbsp;* lxdb-issues (JIRA issue feed)
> &gt;
> &gt;
> &gt; === Subversion Directory ===
> &gt;
> &gt;
> &gt; Instead of subversion, LXDB prefers to git as source control
> &gt; management system: git://git.apache.org/lxdb

Re: [Proposal] lxdb - proposal for Apache Incubation

Reply via email to