Thanks Ted. That helps a lot ! I have also reached out to few other folks in Mahout community to see if they might also be interested.
On Fri, Feb 27, 2015 at 8:06 AM, Ted Dunning <ted.dunn...@gmail.com> wrote: > Thejas, > > Please add me as a mentor if it helps to have diversity. I have enormous > trust based on previous experience with him that Alan Gates would act as a > highly impartial and effective mentor, but would be happy to help if there > is a concern that could be addressed by having another mentor from a > different company. > > > > On Thu, Feb 26, 2015 at 6:12 PM, Thejas Nair <thejas.n...@gmail.com> wrote: > >> The incubator proposal has been updated with the feedback so far. >> We have 3 mentors now, but I think it would be good to have additional >> mentors. Please let me know if anyone is able to help mentor this >> project. >> >> I am planning to start a vote on the proposal in a day or two. >> >> >> On Fri, Feb 6, 2015 at 5:21 PM, <oo...@comp.nus.edu.sg> wrote: >> > >> > Regarding the number of users using this project -- at this moment, the >> > community is not big. A few local start-ups have been trying to use it >> > (mainly due to announcement in our seminar list), eg. one is using it for >> > image recognition (given a phone snapped by a user, it wants to be return >> > the same the product, and a list of similar products, such as a luxury >> bag >> > on a passerby). Researchers from outside of NUS may have been using it >> > since we published an application paper on cross domain/modal retrieval >> in >> > VLDB 2014. >> > >> > We have not announced the project to the outside community yet -- we >> would >> > announce it in dbworld etc in due course. >> > >> > Thanks and have a good weekend. >> > >> > regards >> > beng chin >> > >> >> >> >> Thanks for the comments and suggestions. >> >> With permission from Thejas, I would like to respond to point 2. >> >> >> >> We have a huge team down at NUS (National University of Singapore) -- >> >> we have about seven database/data mining data professors (not including >> >> those in systems, networking, and machine learning). >> >> I myself have nine PhD students in a steady state, and I have a few >> large >> >> grants, with a total budget of about 15 million S$ (~12 million USD), >> that >> >> allows me to hire a number of research fellows and research assistants >> for >> >> the next few years. In a constant state, I have about 20 people (PhD >> >> students/RA/RF) working with me alone. Other professors have their own >> >> grants (unlike other countries, it is relatively easy to get large >> grants >> >> in Singapore; many overseas Universities, including UIUC, MIT, ETH etc >> >> have research labs funded by Singapore Research Foundation [equivalent >> of >> >> NSF]). >> >> >> >> SINGA is a long term project for us -- while it is a platform as it is, >> we >> >> are using it for healthcare predictive analytics (by working with a >> >> hospital associated with the University). Therefore, we will be working >> >> on SINGA, not solely as a distributed DL platform, but as a tool that >> will >> >> enable us to do data analytics on some business domains (eg. healthcase, >> >> consumer etc) >> >> >> >> For the initial set of committers, three are tenured professors, five >> are >> >> students, with 2-5 years to go before they complete their PhD. Quite >> >> often, some would stay back as a research fellow for a couple of years >> >> before they start looking for a job outside. We will work with mentors >> >> and new developers (from outside of NUS or Zhejiang University) in >> >> enhancing the system. >> >> >> >> The project should survive in that sense. >> >> >> >> (I have an on-going project CIIDAA that has been around since 2008; it >> was >> >> started as another project, epiC, with a different grant, and then we >> >> continue the development with a new grant for CIIDAA -- >> >> http://www.comp.nus.edu.sg/~ciidaa/ >> >> ) >> >> >> >> Thanks. >> >> >> >> regards >> >> beng chin >> >> ps: i am not sure if my email will get through to the group. >> >> >> >> >> >> ---------------------------- Original Message >> ---------------------------- >> >> Subject: Re: [DISCUSS] [PROPOSAL] Singa for Apache Incubator >> >> From: "Henry Saputra" <henry.sapu...@gmail.com> >> >> Date: Thu, February 5, 2015 2:57 pm >> >> To: "general@incubator.apache.org" <general@incubator.apache.org> >> >> Cc: oo...@comp.nus.edu.sg >> >> >> -------------------------------------------------------------------------- >> >> >> >> Several comments: >> >> -) How many users already using this project? I would reccomend to >> >> drop request for singa-user list at the beginning. >> >> -) All the initial committers come from university and seemed like >> >> some of them already ready to leave university. I am not too sure if >> >> this project go survive if all of the inital committers are from >> >> university as students. >> >> -) Need to solicit more mentors if this project ever get to Apache >> >> incubator. >> >> >> >> - Henry >> >> >> >> On Tue, Feb 3, 2015 at 3:58 PM, Thejas Nair <thejas.n...@gmail.com> >> wrote: >> >>> The "Relationship with Other Apache Products" section has been >> >>> updated. The reference to H2O in that section has been removed, and >> >>> other projects have been added. >> >>> Thanks for the feedback! >> >>> >> >>> >> >>> On Wed, Jan 28, 2015 at 10:27 AM, Thejas Nair <thejas.n...@gmail.com> >> >> wrote: >> >>>> Thanks for pointing that out Henry! Yes, looks like H20 is not an >> >>>> apache project, I should have verified that. >> >>>> I will edit that, and revisit that section along with the folks in >> >>>> Singa community. >> >>>> >> >>>> >> >>>> On Tue, Jan 27, 2015 at 6:55 PM, Henry Saputra >> >> <henry.sapu...@gmail.com> wrote: >> >>>>> Quick immediate comment that "Apache H2O" is not really Apache >> >>>>> project. >> >>>>> >> >>>>> I assume you are referring to https://github.com/h2oai/h2o (or >> >>>>> https://github.com/h2oai/h2o-dev) ? >> >>>>> >> >>>>> - Henry >> >>>>> >> >>>>> On Tue, Jan 27, 2015 at 5:29 PM, Thejas Nair <thejas.n...@gmail.com> >> >> wrote: >> >>>>>> Hello everyone, >> >>>>>> >> >>>>>> I would like to propose the inclusion of Singa as an Apache >> Incubator >> >> project. >> >>>>>> >> >>>>>> Here is the proposal - >> >>>>>> https://wiki.apache.org/incubator/SingaProposal >> >>>>>> >> >>>>>> Please review the proposal and give feedback. I am planning to start >> >>>>>> a >> >>>>>> vote after 7 days if the proposal looks good. >> >>>>>> We are also seeking additional Apache mentors for the project. >> >>>>>> >> >>>>>> Thanks, >> >>>>>> Thejas >> >>>>>> ========================================================== >> >>>>>> Singa Incubator Proposal >> >>>>>> >> >>>>>> Abstract >> >>>>>> >> >>>>>> SINGA is a distributed deep learning platform. >> >>>>>> >> >>>>>> Proposal >> >>>>>> >> >>>>>> SINGA is an efficient, scalable and easy-to-use distributed platform >> >>>>>> for training deep learning models, e.g., Deep Convolutional Neural >> >>>>>> Network and Deep Belief Network. It parallelizes the computation >> >>>>>> (i.e., training) onto a cluster of nodes by distributing the >> training >> >>>>>> data and model automatically to speed up the training. Built-in >> >>>>>> training algorithms like Back-Propagation and Contrastive Divergence >> >>>>>> are implemented based on common abstractions of deep learning >> models. >> >>>>>> Users can train their own deep learning models by simply customizing >> >>>>>> these abstractions like implementing the Mapper and Reducer in >> >>>>>> Hadoop. >> >>>>>> >> >>>>>> Background >> >>>>>> >> >>>>>> Deep learning refers to a set of feature (or representation) >> learning >> >>>>>> models that consist of multiple (non-linear) layers, where different >> >>>>>> layers learn different levels of abstractions (representations) of >> >>>>>> the >> >>>>>> raw input data. Larger (in terms of model parameters) and deeper (in >> >>>>>> terms of number of layers) models have shown better performance, >> >>>>>> e.g., >> >>>>>> lower image classification error in Large Scale Visual Recognition >> >>>>>> Challenge. However, a larger model requires more memory and larger >> >>>>>> training data to reduce over-fitting. Complex numeric operations >> make >> >>>>>> the training computation intensive. In practice, training large deep >> >>>>>> learning models takes weeks or months on a single node (even with >> >>>>>> GPU). >> >>>>>> >> >>>>>> Rational >> >>>>>> >> >>>>>> Deep learning has gained a lot of attraction in both academia and >> >>>>>> industry due to its success in a wide range of areas such as >> computer >> >>>>>> vision and speech recognition. However, training of such models is >> >>>>>> computationally expensive, especially for large and deep models >> >>>>>> (e.g., >> >>>>>> with billions of parameters and more than 10 layers). Both Google >> and >> >>>>>> Microsoft have developed distributed deep learning systems to make >> >>>>>> the >> >>>>>> training more efficient by distributing the computations within a >> >>>>>> cluster of nodes. However, these systems are closed source >> softwares. >> >>>>>> Our goal is to leverage the community of open source developers to >> >>>>>> make SINGA efficient, scalable and easy to use. SINGA is a full >> >>>>>> fledged distributed platform, that could benefit the community and >> >>>>>> also benefit from the community in their involvement in contributing >> >>>>>> to the further work in this area. We believe the nature of SINGA and >> >>>>>> our visions for the system fit naturally to Apache's philosophy and >> >>>>>> development framework. >> >>>>>> >> >>>>>> Initial Goals >> >>>>>> >> >>>>>> We have developed a system for SINGA running on a commodity computer >> >>>>>> cluster. The initial goals include, * improving the system in terms >> >>>>>> of >> >>>>>> scalability and efficiency, e.g., using Infiniband for network >> >>>>>> communication and multi-threading for one node computation. We would >> >>>>>> consider extending SINGA to GPU clusters later. * benchmarking with >> >>>>>> larger datasets (hundreds of millions of training instances) and >> >>>>>> models (billions of parameters). * adding more built-in deep >> learning >> >>>>>> models. Users can train the built-in models on their datasets >> >>>>>> directly. >> >>>>>> >> >>>>>> Current Status >> >>>>>> >> >>>>>> Meritocracy >> >>>>>> >> >>>>>> We would like to follow ASF meritocratic principles to encourage >> more >> >>>>>> developers to contribute in this project. We know that only active >> >>>>>> and >> >>>>>> excellent developers can make SINGA a successful project. The >> >>>>>> committer list and PMC will be updated based on developers' >> >>>>>> performance and commitment. We are also improving the documentation >> >>>>>> and code to help new developers get started quickly. >> >>>>>> >> >>>>>> Community >> >>>>>> >> >>>>>> SINGA is currently being developed in the Database System Research >> >>>>>> Lab >> >>>>>> at the National University of Singapore (NUS) in collaboration with >> >>>>>> Zhejiang University in China. Our lab has extensive experience in >> >>>>>> building database related systems, including distributed systems. >> Six >> >>>>>> PhD students and research assistants (Jinyang Gao, Kaiping Zheng, >> >>>>>> Sheng Wang, Wei Wang, Zhaojing Luo and Zhongle Xie) , a research >> >>>>>> fellow (Anh Dinh) and three professors (Beng Chin Ooi, Gang Chen, >> >>>>>> Kian >> >>>>>> Lee Tan) have been working for a year on this project. We are open >> to >> >>>>>> recruiting more developers from diverse backgrounds. >> >>>>>> >> >>>>>> Core Developers >> >>>>>> >> >>>>>> Beng Chin Ooi, Gang Chen and Kian Lee Tan are professors who have >> >>>>>> worked on distributed systems for more than 20 years. They have >> >>>>>> collaborated with the industry and have built various large scale >> >>>>>> systems. Anh Dinh's research is also on distributed systems, albeit >> >>>>>> with more focus on security aspects. Wei Wang's research is on deep >> >>>>>> learning problems including deep learning applications and large >> >>>>>> scale >> >>>>>> training. Sheng Wang and Jinyang are working on efficient indexing, >> >>>>>> querying of large scale data and machine learning. Kaiping, Zhaojing >> >>>>>> and Zhongle are new PhD students who jointed SINGA recently. They >> >>>>>> will >> >>>>>> work on this project for a longer time (next 4-5 years). While we >> >>>>>> share common research interests, each member also brings diverse >> >>>>>> expertise to the team. >> >>>>>> >> >>>>>> Alignment >> >>>>>> >> >>>>>> ASF is already the home of many distributed platforms, e.g., Hadoop, >> >>>>>> Spark and Mahout, each of which targets a different application >> >>>>>> domain. SINGA, being a distributed platform for large-scale deep >> >>>>>> learning, focuses on another important domain for which there still >> >>>>>> lacks a robust and scalable open-source platform. The recent success >> >>>>>> of deep learning models especially for vision and speech recognition >> >>>>>> tasks has generated interests in both applying existing deep >> learning >> >>>>>> models and in developing new ones. Thus, an open-source platform for >> >>>>>> deep learning will be able to attract a large community of users and >> >>>>>> developers. SINGA is a complex system needing many iterations of >> >>>>>> design, implementation and testing. Apache's collaboration framework >> >>>>>> which encourages active contribution from developers will inevitably >> >>>>>> help improve the quality of the system, as shown in the success of >> >>>>>> Hadoop, Spark, etc.. Equally important is the community of users >> >>>>>> which >> >>>>>> helps identify real-life applications of deep learning, and helps to >> >>>>>> evaluate the system's performance and ease-of-use. We hope to >> >>>>>> leverage >> >>>>>> ASF for coordinating and promoting both communities, and in return >> >>>>>> benefit the communities with another useful tool. >> >>>>>> >> >>>>>> Known Risks >> >>>>>> >> >>>>>> Orphaned products >> >>>>>> >> >>>>>> Four core developers (Anh, Wei Wang, Jinyang and Sheng Wang) may >> >>>>>> leave >> >>>>>> the lab in two to four years time. It is possible that some of them >> >>>>>> may not have enough time to focus on this project after that. But, >> >>>>>> SINGA is part of our other bigger research projects on building an >> >>>>>> infrastructure for data intensive applications, which include >> >>>>>> health-care analytics and brain-inspired computing. Beng Chin and >> >>>>>> Kian >> >>>>>> Lee would continue working on it and getting more people involved. >> >>>>>> For >> >>>>>> example, three new developers (Kaiping, Zhaojing and Zhongle) joined >> >>>>>> us recently. Individual developers are welcome to make SINGA a >> >>>>>> diverse >> >>>>>> community that is robust and independent from any single developer. >> >>>>>> >> >>>>>> Inexperience with Open Source >> >>>>>> >> >>>>>> All the developers are active users and followers of open source >> >>>>>> projects. Our research lab has a strong commitment to open source, >> >>>>>> and >> >>>>>> has released the source code of several systems under open source >> >>>>>> license as a way of contributing back to the open source community. >> >>>>>> But we do not have much real experience in open source projects with >> >>>>>> large and well organized communities like those in Apache. This is >> >>>>>> one >> >>>>>> reason we choose Apache which is experienced in open source project >> >>>>>> incubation. We hope to get the help from Apache (e.g., champion and >> >>>>>> mentors) to establish a healthy path for SINGA. >> >>>>>> >> >>>>>> Homogenous Developers >> >>>>>> >> >>>>>> Although the current developers are researchers in the universities, >> >>>>>> they have different research interests and project experiences, as >> >>>>>> mentioned in the section that introduces the core developers. We >> know >> >>>>>> that a diverse community is helpful. Hence we are open to the idea >> of >> >>>>>> recruiting developers from other regions and organizations. >> >>>>>> >> >>>>>> Reliance on Salaried Developers >> >>>>>> >> >>>>>> As a research project in the university, SINGA's current developing >> >>>>>> community consists of professors, PhD students, research assistants >> >>>>>> and postdoctoral fellows. They are driven by their interests to work >> >>>>>> on this project and have contributed actively since the start of the >> >>>>>> project. The research assistants and fellows are expected to leave >> >>>>>> when their contracts expire. However, they are keen to continue to >> >>>>>> work on the project voluntarily. Moreover, as a long term research >> >>>>>> project, new research assistants and fellows are likely to join the >> >>>>>> project. >> >>>>>> >> >>>>>> A Excessive Fascination with the Apache Brand >> >>>>>> >> >>>>>> We choose Apache not for publicity. We have two purposes. First, we >> >>>>>> want to leverage Apache's reputation to recruit more developers to >> >>>>>> make a diverse community. Second, we hope that Apache can help us to >> >>>>>> establish a healthy path in developing SINGA. Beng Chin and Kian-Lee >> >>>>>> are established database and distributed system researchers, and >> >>>>>> together with the other contributors, they sincerely believe that >> >>>>>> there is a need for a widely accepted open source distributed deep >> >>>>>> learning platform. The field of deep learning is still at its >> >>>>>> infancy, >> >>>>>> and an open source platform will fuel the research in the area. >> >>>>>> Moreover, such a platform will enable researchers to develop new >> >>>>>> models and algorithms, rather than spending time implementing a deep >> >>>>>> learning system from scratch. Furthermore, the need for scalability >> >>>>>> for such a platform is obvious. >> >>>>>> >> >>>>>> Relationship with Other Apache Products >> >>>>>> >> >>>>>> Apache H2O implemented two simple deep learning models, namely the >> >>>>>> Multi-Layer Perceptron and Deep Auto-encoders. There are two >> >>>>>> significant differences between H2O and SINGA. First, H2O adopts the >> >>>>>> Map-Reduce framework which runs a set of computing nodes in parallel >> >>>>>> againsts of the training set. Model parameters trained by all >> >>>>>> computing nodes are averaged as the final model parameters. This >> >>>>>> training algorithm is different from the distributed training >> >>>>>> algorithm used by DistBelief, Adam and SINGA, which frequently >> >>>>>> synchronizes the parameters trained from different nodes. SINGA >> >>>>>> adopts >> >>>>>> the parameter server framework to support a wide range of >> distributed >> >>>>>> training algorithms and parallelization methods (e.g., data >> >>>>>> parallelism, model parallelism and hybrid parallelism. H2O only >> >>>>>> support data parallelism) . Second, in H2O, users are restricted to >> >>>>>> use the two built-in models. In SINGA, we provide simple programming >> >>>>>> model to let users implement their own deep learning models. A new >> >>>>>> deep learning model can be implemented by customizing the base Layer >> >>>>>> class for each layer involved in the model. It is similar to writing >> >>>>>> Hadoop programs where users only need to override the base Mapper >> and >> >>>>>> Reducer. We also provide built-in models for users to use directly. >> >>>>>> >> >>>>>> Documentation >> >>>>>> >> >>>>>> The project is hosted at >> >>>>>> http://www.comp.nus.edu.sg/~dbsystem/project/singa.html. >> >>>>>> Documentations can be found at the Github Wiki Page: >> >>>>>> https://github.com/nusinga/singa/wiki. We continue to refine and >> >>>>>> improve the documentation. >> >>>>>> >> >>>>>> Initial Source >> >>>>>> >> >>>>>> We use Github to maintain our source code, >> >> https://github.com/nusinga/singa >> >>>>>> >> >>>>>> Source and Intellectual Property Submission Plan >> >>>>>> >> >>>>>> We plan to make our code base be under Apache License, Version 2.0. >> >>>>>> >> >>>>>> External Dependencies >> >>>>>> >> >>>>>> required by the core code base: glog, gflags, google protobuf, >> >>>>>> open-blas, mpich, armci-mpi. >> >>>>>> required by data preparation and preprocessing: opencv, hdfs, >> python. >> >>>>>> >> >>>>>> Cryptography >> >>>>>> >> >>>>>> Not Applicable >> >>>>>> >> >>>>>> Required Resources >> >>>>>> >> >>>>>> Mailing Lists >> >>>>>> >> >>>>>> Currently, we use google group for internal discussion. The mailing >> >>>>>> address is nusi...@googlegroup.com. We will migrate the content to >> >>>>>> the >> >>>>>> apache mailing lists in the future. >> >>>>>> >> >>>>>> singa-dev >> >>>>>> singa-user >> >>>>>> singa-commits >> >>>>>> singa-private (for private discussion within PCM) >> >>>>>> >> >>>>>> Git Repository >> >>>>>> >> >>>>>> We want to continue using git for version control. Hence, a git repo >> >>>>>> is required. >> >>>>>> >> >>>>>> Issue Tracking >> >>>>>> >> >>>>>> JIRA Singa (SINGA) >> >>>>>> >> >>>>>> Initial Committers >> >>>>>> >> >>>>>> Beng Chin Ooi (ooibc @comp.nus.edu.sg) >> >>>>>> Kian Lee Tan (tankl @comp.nus.edu.sg) >> >>>>>> Gang Chen (cg @zju.edu.cn) >> >>>>>> Wei Wang (wangwei @comp.nus.edu.sg) >> >>>>>> Dinh Tien Tuan Anh (dinhtta @comp.nus.edu.sg) >> >>>>>> Jinyang Gao (jinyang.gao @comp.nus.edu.sg) >> >>>>>> Sheng Wang (wangsh @comp.nus.edu.sg) >> >>>>>> Kaiping Zheng (kaiping @comp.nus.edu.sg) >> >>>>>> Zhaojing Luo (zhaojing @comp.nus.edu.sg) >> >>>>>> Zhongle Xie (zhongle @comp.nus.edu.sg) >> >>>>>> >> >>>>>> Affiliations >> >>>>>> >> >>>>>> Beng Chin Ooi, National University of Singapore >> >>>>>> Kian Lee Tan, National University of Singapore >> >>>>>> Gang Chen, Zhejiang University >> >>>>>> Wei Wang, National University of Singapore >> >>>>>> Dinh Tien Tuan Anh, National University of Singapore >> >>>>>> Jinyang Gao, National University of Singapore >> >>>>>> Sheng Wang, National University of Singapore >> >>>>>> Kaiping Zheng, National University of Singapore >> >>>>>> Zhaojing Luo, National University of Singapore >> >>>>>> Zhongle Xie, National University of Singapore >> >>>>>> >> >>>>>> Sponsors >> >>>>>> >> >>>>>> Champion >> >>>>>> >> >>>>>> Thejas Nair (thejas at apache.org) - Hortonworks >> >>>>>> >> >>>>>> Nominated Mentors >> >>>>>> >> >>>>>> Thejas Nair (thejas at apache.org) - Hortonworks >> >>>>>> Alan Gates (gates at apache dot org) - Hortonworks >> >>>>>> (Seeking more volunteers!) >> >>>>>> >> >>>>>> Sponsoring Entity >> >>>>>> >> >>>>>> We are requesting the Incubator to sponsor this project. >> >>>>>> >> >>>>>> >> --------------------------------------------------------------------- >> >>>>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> >>>>>> For additional commands, e-mail: general-h...@incubator.apache.org >> >>>>>> >> >>>>> >> >>>>> --------------------------------------------------------------------- >> >>>>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> >>>>> For additional commands, e-mail: general-h...@incubator.apache.org >> >>>>> >> >>> >> >>> --------------------------------------------------------------------- >> >>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> >>> For additional commands, e-mail: general-h...@incubator.apache.org >> >>> >> >> >> >> >> >> >> > >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> For additional commands, e-mail: general-h...@incubator.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org