+1 (binding) --Chris Nauroth
On 2/29/16, 9:37 AM, "Patrick Hunt" <ph...@apache.org> wrote: >Hi folks, > >OK the discussion is now completed. Please VOTE to accept Mnemonic >into the Apache Incubator. I¹ll leave the VOTE open for at least >the next 72 hours, with hopes to close it Thursday the 3rd of >March, 2016 at 10am PT. >https://wiki.apache.org/incubator/MnemonicProposal > >[ ] +1 Accept Mnemonic as an Apache Incubator podling. >[ ] +0 Abstain. >[ ] -1 Don¹t accept Mnemonic as an Apache Incubator podling because.. > >Of course, I am +1 on this. Please note VOTEs from Incubator PMC >members are binding but all are welcome to VOTE! > >Regards, > >Patrick > >-------------------- >= Mnemonic Proposal = >=== Abstract === >Mnemonic is a Java based non-volatile memory library for in-place >structured data processing and computing. It is a solution for generic >object and block persistence on heterogeneous block and >byte-addressable devices, such as DRAM, persistent memory, NVMe, SSD, >and cloud network storage. > >=== Proposal === >Mnemonic is a structured data persistence in-memory in-place library >for Java-based applications and frameworks. It provides unified >interfaces for data manipulation on heterogeneous >block/byte-addressable devices, such as DRAM, persistent memory, NVMe, >SSD, and cloud network devices. > >The design motivation for this project is to create a non-volatile >programming paradigm for in-memory data object persistence, in-memory >data objects caching, and JNI-less IPC. >Mnemonic simplifies the usage of data object caching, persistence, and >JNI-less IPC for massive object oriented structural datasets. > >Mnemonic defines Non-Volatile Java objects that store data fields in >persistent memory and storage. During the program runtime, only >methods and volatile fields are instantiated in Java heap, >Non-Volatile data fields are directly accessed via GET/SET operation >to and from persistent memory and storage. Mnemonic avoids SerDes and >significantly reduces amount of garbage in Java heap. > >Major features of Mnemonic: >* Provides an abstract level of viewpoint to utilize heterogeneous >block/byte-addressable device as a whole (e.g., DRAM, persistent >memory, NVMe, SSD, HD, cloud network Storage). > >* Provides seamless support object oriented design and programming >without adding burden to transfer object data to different form. > >* Avoids the object data serialization/de-serialization for data >retrieval, caching and storage. > >* Reduces the consumption of on-heap memory and in turn to reduce and >stabilize Java Garbage Collection (GC) pauses for latency sensitive >applications. > >* Overcomes current limitations of Java GC to manage much larger >memory resources for massive dataset processing and computing. > >* Supports the migration data usage model from traditional NVMe/SSD/HD >to non-volatile memory with ease. > >* Uses lazy loading mechanism to avoid unnecessary memory consumption >if some data does not need to use for computing immediately. > >* Bypasses JNI call for the interaction between Java runtime >application and its native code. > >* Provides an allocation aware auto-reclaim mechanism to prevent >external memory resource leaking. > > >=== Background === >Big Data and Cloud applications increasingly require both high >throughput and low latency processing. Java-based applications >targeting the Big Data and Cloud space should be tuned for better >throughput, lower latency, and more predictable response time. >Typically, there are some issues that impact BigData applications' >performance and scalability: > >1) The Complexity of Data Transformation/Organization: In most cases, >during data processing, applications use their own complicated data >caching mechanism for SerDes data objects, spilling to different >storage and eviction large amount of data. Some data objects contains >complex values and structure that will make it much more difficulty >for data organization. To load and then parse/decode its datasets from >storage consumes high system resource and computation power. > >2) Lack of Caching, Burst Temporary Object Creation/Destruction Causes >Frequent Long GC Pauses: Big Data computing/syntax generates large >amount of temporary objects during processing, e.g. lambda, SerDes, >copying and etc. This will trigger frequent long Java GC pause to scan >references, to update references lists, and to copy live objects from >one memory location to another blindly. > >3) The Unpredictable GC Pause: For latency sensitive applications, >such as database, search engine, web query, real-time/streaming >computing, require latency/request-response under control. But current >Java GC does not provide predictable GC activities with large on-heap >memory management. > >4) High JNI Invocation Cost: JNI calls are expensive, but high >performance applications usually try to leverage native code to >improve performance, however, JNI calls need to convert Java objects >into something that C/C++ can understand. In addition, some >comprehensive native code needs to communicate with Java based >application that will cause frequently JNI call along with stack >marshalling. > >Mnemonic project provides a solution to address above issues and >performance bottlenecks for structured data processing and computing. >It also simplifies the massive data handling with much reduced GC >activity. > >=== Rationale === >There are strong needs for a cohesive, easy-to-use non-volatile >programing model for unified heterogeneous memory resources management >and allocation. Mnemonic project provides a reusable and flexible >framework to accommodate other special type of memory/block devices >for better performance without changing client code. > >Most of the BigData frameworks (e.g., Apache Spark, Apache Hadoop®, >Apache HBase, Apache Flink, Apache Kafka, etc.) have their own >complicated memory management modules for caching and checkpoint. Many >approaches increase the complexity and are error-prone to maintain >code. > >We have observed heavy overheads during the operations of data parse, >SerDes, pack/unpack, code/decode for data loading, storage, >checkpoint, caching, marshal and transferring. Mnemonic provides a >generic in-memory persistence object model to address those overheads >for better performance. In addition, it manages its in-memory >persistence objects and blocks in the way that GC does, which means >their underlying memory resource is able to be reclaimed without >explicitly releasing it. > >Some existing Big Data applications suffer from poor Java GC behaviors >when they process their massive unstructured datasets. Those >behaviors either cause very long stop-the-world GC pauses or take >significant system resources during computing which impact throughput >and incur significant perceivable pauses for interactive analytics. > >There are more and more computing intensive Big Data applications >moving down to rely on JNI to offload their computing tasks to native >code which dramatically increases the cost of JNI invocation and IPC. >Mnemonic provides a mechanism to communicate with native code directly >through in-place object data update to avoid complex object data type >conversion and stack marshaling. In addition, this project can be >extended to support various lockers for threads between Java code and >native code. > >=== Initial Goals === >Our initial goal is to bring Mnemonic into the ASF and transit the >engineering and governance processes to the "Apache Way." We would >like to enrich a collaborative development model that closely aligns >with current and future industry memory and storage technologies. > >Another important goal is to encourage efforts to integrate >non-volatile programming model into data centric processing/analytics >frameworks/applications, (e.g., Apache Spark, Apache HBase, Apache >Flink, Apache Hadoop®, Apache Cassandra, etc.). > >We expect Mnemonic project to be continuously developing new >functionalities in an open, community-driven way. We envision >accelerating innovation under ASF governance in order to meet the >requirements of a wide variety of use cases for in-memory non-volatile >and volatile data caching programming. > >=== Current Status === >Mnemonic project is available at Intel¹s internal repository and >managed by its designers and developers. It is also temporary hosted >at Github for general view >https://github.com/NonVolatileComputing/Mnemonic.git > >We have integrated this project for Apache Spark 1.5.0 and get 2X >performance improvement ratio for Spark MLlib k-means workload and >observed expected benefits of removing SerDes, reducing total GC pause >time by 40% from our experiments. > >==== Meritocracy ==== >Mnemonic was originally created by Gang (Gary) Wang and Yanping Wang >in early 2015. The initial committers are the current Mnemonic R&D >team members from US, China, and India Big Data Technologies Group at >Intel. This group will form a base for much broader community to >collaborate on this code base. > >We intend to radically expand the initial developer and user community >by running the project in accordance with the "Apache Way." Users and >new contributors will be treated with respect and welcomed. By >participating in the community and providing quality patches/support >that move the project forward, they will earn merit. They also will be >encouraged to provide non-code contributions (documentation, events, >community management, etc.) and will gain merit for doing so. Those >with a proven support and quality track record will be encouraged to >become committers. > >==== Community ==== >If Mnemonic is accepted for incubation, the primary initial goal is to >transit the core community towards embracing the Apache Way of project >governance. We would solicit major existing contributors to become >committers on the project from the start. > >==== Core Developers ==== >Mnemonic core developers are all skilled software developers and >system performance engineers at Intel Corp with years of experiences >in their fields. They have contributed many code to Apache projects. >There are PMCs and experienced committers have been working with us >from Apache Spark, Apache HBase, Apache Phoenix, Apache Hadoop® >for this project's open source efforts. > >=== Alignment === >The initial code base is targeted to data centric processing and >analyzing in general. Mnemonic has been building the connection and >integration for Apache projects and other projects. > >We believe Mnemonic will be evolved to become a promising project for >real-time processing, in-memory streaming analytics and more, along >with current and future new server platforms with persistent memory as >base storage devices. > >=== Known Risks === >==== Orphaned products ==== >Intel¹s Big Data Technologies Group is actively working with community >on integrating this project to Big Data frameworks and applications. >We are continuously adding new concepts and codes to this project and >support new usage cases and features for Apache Big Data ecosystem. > >The project contributors are leading contributors of Hadoop-based >technologies and have a long standing in the Hadoop community. As we >are addressing major Big Data processing performance issues, there is >minimal risk of this work becoming non-strategic and unsupported. > >Our contributors are confident that a larger community will be formed >within the project in a relatively short period of time. > >==== Inexperience with Open Source ==== >This project has long standing experienced mentors and interested >contributors from Apache Spark, Apache HBase, Apache Phoenix, >Apache Hadoop® to help us moving through open source process. We are >actively working with experienced Apache community PMCs and committers >to improve our project and further testing. > >==== Homogeneous Developers ==== >All initial committers and interested contributors are employed at >Intel. As an infrastructure memory project, there are wide range of >Apache projects are interested in innovative memory project to fit >large sized persistent memory and storage devices. Various Apache >projects such as Apache Spark, Apache HBase, Apache Phoenix, Apache >Flink, Apache Cassandra etc. can take good advantage of this project >to overcome serialization/de-serialization, Java GC, and caching >issues. We expect a wide range of interest will be generated after we >open source this project to Apache. > >==== Reliance on Salaried Developers ==== >All developers are paid by their employers to contribute to this >project. We welcome all others to contribute to this project after it >is open sourced. > >==== Relationships with Other Apache Product ==== >Relationship with Apache Arrow: >Arrow's columnar data layout allows great use of CPU caches & SIMD. It >places all data that relevant to a column operation in a compact >format in memory. > >Mnemonic directly puts the whole business object graphs on external >heterogeneous storage media, e.g. off-heap, SSD. It is not necessary >to normalize the structures of object graphs for caching, checkpoint >or storing. It doesn¹t require developers to normalize their data >object graphs. Mnemonic applications can avoid indexing & join >datasets compared to traditional approaches. > >Mnemonic can leverage Arrow to transparently re-layout qualified data >objects or create special containers that is able to efficiently hold >those data records in columnar form as one of major performance >optimization constructs. > >Mnemonic can be integrated into various Big Data and Cloud frameworks >and applications. >We are currently working on several Apache projects with Mnemonic: >For Apache Spark we are integrating Mnemonic to improve: >a) Local checkpoints >b) Memory management for caching >c) Persistent memory datasets input >d) Non-Volatile RDD operations >The best use case for Apache Spark computing is that the input data >is stored in form of Mnemonic native storage to avoid caching its row >data for iterative processing. Moreover, Spark applications can >leverage Mnemonic to perform data transforming in persistent or >non-persistent memory without SerDes. > >For Apache Hadoop®, we are integrating HDFS Caching with Mnemonic >instead of mmap. This will take advantage of persistent memory related >features. We also plan to evaluate to integrate in Namenode Editlog, >FSImage persistent data into Mnemonic persistent memory area. > >For Apache HBase, we are using Mnemonic for BucketCache and >evaluating performance improvements. > >We expect Mnemonic will be further developed and integrated into many >Apache BigData projects and so on, to enhance memory management >solutions for much improved performance and reliability. > >==== An Excessive Fascination with the Apache Brand ==== >While we expect Apache brand helps to attract more contributors, our >interests in starting this project is based on the factors mentioned >in the Rationale section. > >We would like Mnemonic to become an Apache project to further foster a >healthy community of contributors and consumers in BigData technology >R&D areas. Since Mnemonic can directly benefit many Apache projects >and solves major performance problems, we expect the Apache Software >Foundation to increase interaction with the larger community as well. > >=== Documentation === >The documentation is currently available at Intel and will be posted >under: https://mnemonic.incubator.apache.org/docs > >=== Initial Source === >Initial source code is temporary hosted Github for general viewing: >https://github.com/NonVolatileComputing/Mnemonic.git >It will be moved to Apache http://git.apache.org/ after podling. > >The initial Source is written in Java code (88%) and mixed with JNI C >code (11%) and shell script (1%) for underlying native allocation >libraries. > >=== Source and Intellectual Property Submission Plan === >As soon as Mnemonic is approved to join the Incubator, the source code >will be transitioned via the Software Grant Agreement onto ASF >infrastructure and in turn made available under the Apache License, >version 2.0. > >=== External Dependencies === >The required external dependencies are all Apache licenses or other >compatible Licenses >Note: The runtime dependent licenses of Mnemonic are all declared as >Apache 2.0, the GNU licensed components are used for Mnemonic build >and deployment. The Mnemonic JNI libraries are built using the GNU >tools. > >maven and its plugins (http://maven.apache.org/ ) [Apache 2.0] >JDK8 or OpenJDK 8 (http://java.com/) [Oracle or Openjdk JDK License] >Nvml (http://pmem.io ) [optional] [Open Source] >PMalloc (https://github.com/bigdata-memory/pmalloc ) [optional] [Apache >2.0] > >Build and test dependencies: >org.testng.testng v6.8.17 (http://testng.org) [Apache 2.0] >org.flowcomputing.commons.commons-resgc v0.8.7 [Apache 2.0] >org.flowcomputing.commons.commons-primitives v.0.6.0 [Apache 2.0] >com.squareup.javapoet v1.3.1-SNAPSHOT [Apache 2.0] >JDK8 or OpenJDK 8 (http://java.com/) [Oracle or Openjdk JDK License] > >=== Cryptography === >Project Mnemonic does not use cryptography itself, however, Hadoop >projects use standard APIs and tools for SSH and SSL communication >where necessary. > >=== Required Resources === >We request that following resources be created for the project to use > >==== Mailing lists ==== >priv...@mnemonic.incubator.apache.org (moderated subscriptions) >comm...@mnemonic.incubator.apache.org >d...@mnemonic.incubator.apache.org > >==== Git repository ==== >https://github.com/apache/incubator-mnemonic > >==== Documentation ==== >https://mnemonic.incubator.apache.org/docs/ > >==== JIRA instance ==== >https://issues.apache.org/jira/browse/mnemonic > >=== Initial Committers === >* Gang (Gary) Wang (gang1 dot wang at intel dot com) > >* Yanping Wang (yanping dot wang at intel dot com) > >* Uma Maheswara Rao G (umamahesh at apache dot org) > >* Kai Zheng (drankye at apache dot org) > >* Rakesh Radhakrishnan Potty (rakeshr at apache dot org) > >* Sean Zhong (seanzhong at apache dot org) > >* Henry Saputra (hsaputra at apache dot org) > >* Hao Cheng (hao dot cheng at intel dot com) > >=== Additional Interested Contributors === >* Debo Dutta (dedutta at cisco dot com) > >* Liang Chen (chenliang613 at Huawei dot com) > >=== Affiliations === >* Gang (Gary) Wang, Intel > >* Yanping Wang, Intel > >* Uma Maheswara Rao G, Intel > >* Kai Zheng, Intel > >* Rakesh Radhakrishnan Potty, Intel > >* Sean Zhong, Intel > >* Henry Saputra, Independent > >* Hao Cheng, Intel > >=== Sponsors === >==== Champion ==== >Patrick Hunt > >==== Nominated Mentors ==== >* Patrick Hunt <phunt at apache dot org> - Apache IPMC member > >* Andrew Purtell <apurtell at apache dot org > - Apache IPMC member > >* James Taylor <jamestaylor at apache dot org> - Apache IPMC member > >* Henry Saputra <hsaputra at apache dot org> - Apache IPMC member > >==== Sponsoring Entity ==== >Apache Incubator PMC > >--------------------------------------------------------------------- >To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >For additional commands, e-mail: general-h...@incubator.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org