+1 (non-binding) 2015-10-23 23:50 GMT+08:00 Owen O'Malley <[email protected]>:
> +1 (binding) > > On Fri, Oct 23, 2015 at 8:42 AM, wp chun <[email protected]> wrote: > > > +1 > > [email protected] > > > > > > On 10/23/15, 11:26 PM, "P. Taylor Goetz" <[email protected]> wrote: > > > > > > >+1 (binding) > > > > > > > >-Taylor > > > > > > > >> On Oct 23, 2015, at 10:11 AM, Manoharan, Arun <[email protected] > > > > > >>wrote: > > > >> > > > >> Hello Everyone, > > > >> > > > >> Thanks for all the feedback on the Eagle Proposal. > > > >> > > > >> I would like to call for a [VOTE] on Eagle joining the ASF as an > > > >>incubation project. > > > >> > > > >> The vote is open for 72 hours: > > > >> > > > >> [ ] +1 accept Eagle in the Incubator > > > >> [ ] ±0 > > > >> [ ] -1 (please give reason) > > > >> > > > >> Eagle is a Monitoring solution for Hadoop to instantly identify > access > > > >>to sensitive data, recognize attacks, malicious activities and take > > > >>actions in real time. Eagle supports a wide variety of policies on > HDFS > > > >>data and Hive. Eagle also provides machine learning models for > > detecting > > > >>anomalous user behavior in Hadoop. > > > >> > > > >> The proposal is available on the wiki here: > > > >> https://wiki.apache.org/incubator/EagleProposal > > > >> > > > >> The text of the proposal is also available at the end of this email. > > > >> > > > >> Thanks for your time and help. > > > >> > > > >> Thanks, > > > >> Arun > > > >> > > > >> <COPY of the proposal in text format> > > > >> > > > >> Eagle > > > >> > > > >> Abstract > > > >> Eagle is an Open Source Monitoring solution for Hadoop to instantly > > > >>identify access to sensitive data, recognize attacks, malicious > > > >>activities in hadoop and take actions. > > > >> > > > >> Proposal > > > >> Eagle audits access to HDFS files, Hive and HBase tables in real > time, > > > >>enforces policies defined on sensitive data access and alerts or > blocks > > > >>user¹s access to that sensitive data in real time. Eagle also creates > > > >>user profiles based on the typical access behaviour for HDFS and Hive > > > >>and sends alerts when anomalous behaviour is detected. Eagle can also > > > >>import sensitive data information classified by external > classification > > > >>engines to help define its policies. > > > >> > > > >> Overview of Eagle > > > >> Eagle has 3 main parts. > > > >> 1.Data collection and storage - Eagle collects data from various > > hadoop > > > >>logs in real time using Kafka/Yarn API and uses HDFS and HBase for > > > >>storage. > > > >> 2.Data processing and policy engine - Eagle allows users to create > > > >>policies based on various metadata properties on HDFS, Hive and HBase > > > >>data. > > > >> 3.Eagle services - Eagle services include policy manager, query > > service > > > >>and the visualization component. Eagle provides intuitive user > > interface > > > >>to administer Eagle and an alert dashboard to respond to real time > > > >>alerts. > > > >> > > > >> Data Collection and Storage: > > > >> Eagle provides programming API for extending Eagle to integrate any > > > >>data source into Eagle policy evaluation framework. For example, > Eagle > > > >>hdfs audit monitoring collects data from Kafka which is populated > from > > > >>namenode log4j appender or from logstash agent. Eagle hive monitoring > > > >>collects hive query logs from running job through YARN API, which is > > > >>designed to be scalable and fault-tolerant. Eagle uses HBase as > storage > > > >>for storing metadata and metrics data, and also supports relational > > > >>database through configuration change. > > > >> > > > >> Data Processing and Policy Engine: > > > >> Processing Engine: Eagle provides stream processing API which is an > > > >>abstraction of Apache Storm. It can also be extended to other > streaming > > > >>engines. This abstraction allows developers to assemble data > > > >>transformation, filtering, external data join etc. without physically > > > >>bound to a specific streaming platform. Eagle streaming API allows > > > >>developers to easily integrate business logic with Eagle policy > engine > > > >>and internally Eagle framework compiles business logic execution DAG > > > >>into program primitives of underlying stream infrastructure e.g. > Apache > > > >>Storm. For example, Eagle HDFS monitoring transforms audit log from > > > >>Namenode to object and joins sensitivity metadata, security zone > > > >>metadata which are generated from external programs or configured by > > > >>user. Eagle hive monitoring filters running jobs to get hive query > > > >>string and parses query string into object and then joins sensitivity > > > >>metadata. > > > >> Alerting Framework: Eagle Alert Framework includes stream metadata > > API, > > > >>scalable policy engine framework, extensible policy engine framework. > > > >>Stream metadata API allows developers to declare event schema > including > > > >>what attributes constitute an event, what is the type for each > > > >>attribute, and how to dynamically resolve attribute value in runtime > > > >>when user configures policy. Scalable policy engine framework allows > > > >>policies to be executed on different physical nodes in parallel. It > is > > > >>also used to define your own policy partitioner class. Policy engine > > > >>framework together with streaming partitioning capability provided by > > > >>all streaming platforms will make sure policies and events can be > > > >>evaluated in a fully distributed way. Extensible policy engine > > framework > > > >>allows developer to plugin a new policy engine with a few lines of > > > >>codes. WSO2 Siddhi CEP engine is the policy engine which Eagle > supports > > > >>as first-class citizen. > > > >> Machine Learning module: Eagle provides capabilities to define user > > > >>activity patterns or user profiles for Hadoop users based on the user > > > >>behaviour in the platform. These user profiles are modeled using > > Machine > > > >>Learning algorithms and used for detection of anomalous users > > > >>activities. Eagle uses Eigen Value Decomposition, and Density > > Estimation > > > >>algorithms for generating user profile models. The model reads data > > from > > > >>HDFS audit logs, preprocesses and aggregates data, and generates > models > > > >>using Spark programming APIs. Once models are generated, Eagle uses > > > >>stream processing engine for near real-time anomaly detection to > > > >>determine if any user¹s activities are suspicious or not. > > > >> > > > >> Eagle Services: > > > >> Query Service: Eagle provides SQL-like service API to support > > > >>comprehensive computation for huge set of data on the fly, for e.g. > > > >>comprehensive filtering, aggregation, histogram, sorting, top, > > > >>arithmetical expression, pagination etc. HBase is the data storage > > which > > > >>Eagle supports as first-class citizen, relational database is > supported > > > >>as well. For HBase storage, Eagle query framework compiles user > > provided > > > >>SQL-like query into HBase native filter objects and execute it > through > > > >>HBase coprocessor on the fly. > > > >> Policy Manager: Eagle policy manager provides UI and Restful API for > > > >>user to define policy with just a few clicks. It includes site > > > >>management UI, policy editor, sensitivity metadata import, HDFS or > Hive > > > >>sensitive resource browsing, alert dashboards etc. > > > >> Background > > > >> Data is one of the most important assets for today¹s businesses, > which > > > >>makes data security one of the top priorities of today¹s enterprises. > > > >>Hadoop is widely used across different verticals as a big data > > > >>repository to store this data in most modern enterprises. > > > >> At eBay we use hadoop platform extensively for our data processing > > > >>needs. Our data in Hadoop is becoming bigger and bigger as our user > > base > > > >>is seeing an exponential growth. Today there are variety of data sets > > > >>available in Hadoop cluster for our users to consume. eBay has around > > > >>120 PB of data stored in HDFS across 6 different clusters and around > > > >>1800+ active hadoop users consuming data thru Hive, HBase and > mapreduce > > > >>jobs everyday to build applications using this data. With this > > > >>astronomical growth of data there are also challenges in securing > > > >>sensitive data and monitoring the access to this sensitive data. > Today > > > >>in large organizations HDFS is the defacto standard for storing big > > > >>data. Data sets which includes and not limited to consumer sentiment, > > > >>social media data, customer segmentation, web clicks, sensor data, > > > >>geo-location and transaction data get stored in Hadoop for day to day > > > >>business needs. > > > >> We at eBay want to make sure the sensitive data and data platforms > are > > > >>completely protected from security breaches. So we partnered very > > > >>closely with our Information Security team to understand the > > > >>requirements for Eagle to monitor sensitive data access on hadoop: > > > >> 1.Ability to identify and stop security threats in real time > > > >> 2.Scale for big data (Support PB scale and Billions of events) > > > >> 3.Ability to create data access policies > > > >> 4.Support multiple data sources like HDFS, HBase, Hive > > > >> 5.Visualize alerts in real time > > > >> 6.Ability to block malicious access in real time > > > >> We did not find any data access monitoring solution that available > > > >>today and can provide the features and functionality that we need to > > > >>monitor the data access in the hadoop ecosystem at our scale. Hence > > with > > > >>an excellent team of world class developers and several users, we > have > > > >>been able to bring Eagle into production as well as open source it. > > > >> > > > >> Rationale > > > >> In today¹s world; data is an important asset for any company. > > > >>Businesses are using data extensively to create amazing experiences > for > > > >>users. Data has to be protected and access to data should be secured > > > >>from security breaches. Today Hadoop is not only used to store logs > but > > > >>also stores financial data, sensitive data sets, geographical data, > > user > > > >>click stream data sets etc. which makes it more important to be > > > >>protected from security breaches. To secure a data platform there are > > > >>multiple things that need to happen. One is having a strong access > > > >>control mechanism which today is provided by Apache Ranger and Apache > > > >>Sentry. These tools provide the ability to provide fine grain access > > > >>control mechanism to data sets on hadoop. But there is a big gap in > > > >>terms of monitoring all the data access events and activities in > order > > > >>to securing the hadoop data platform. Together with strong access > > > >>control, perimeter security and data access monitoring in place data > in > > > >>the hadoop clusters can be secured against breaches. We looked around > > > >>and found following: > > > >> Existing data activity monitoring products are designed for > > traditional > > > >>databases and data warehouse. Existing monitoring platforms cannot > > scale > > > >>out to support fast growing data and petabyte scale. Few products in > > the > > > >>industry are still very early in terms of supporting HDFS, Hive, > HBase > > > >>data access monitoring. > > > >> As mentioned in the background, the business requirement and urgency > > to > > > >>secure the data from users with malicious intent drove eBay to invest > > in > > > >>building a real time data access monitoring solution from scratch to > > > >>offer real time alerts and remediation features for malicious data > > > >>access. > > > >> With the power of open source distributed systems like Hadoop, Kafka > > > >>and much more we were able to develop a data activity monitoring > system > > > >>that can scale, identify and stop malicious access in real time. > > > >> Eagle allows admins to create standard access policies and rules for > > > >>monitoring HDFS, Hive and HBase data. Eagle also provides out of box > > > >>machine learning models for modeling user profiles based on user > access > > > >>behaviour and use the model to alert on anomalies. > > > >> > > > >> Current Status > > > >> > > > >> Meritocracy > > > >> Eagle has been deployed in production at eBay for monitoring > billions > > > >>of events per day from HDFS and Hive operations. From the start; the > > > >>product has been built with focus on high scalability and application > > > >>extensibility in mind and Eagle has demonstrated great performance in > > > >>responding to suspicious events instantly and great flexibility in > > > >>defining policy. > > > >> > > > >> Community > > > >> Eagle seeks to develop the developer and user communities during > > > >>incubation. > > > >> > > > >> Core Developers > > > >> Eagle is currently being designed and developed by engineers from > eBay > > > >>Inc. Edward Zhang, Hao Chen, Chaitali Gupta, Libin Sun, Jilin > Jiang, > > > >>Qingwen Zhao, Senthil Kumar, Hemanth Dendukuri, Arun Manoharan. All > of > > > >>these core developers have deep expertise in developing monitoring > > > >>products for the Hadoop ecosystem. > > > >> > > > >> Alignment > > > >> The ASF is a natural host for Eagle given that it is already the > home > > > >>of Hadoop, HBase, Hive, Storm, Kafka, Spark and other emerging big > data > > > >>projects. Eagle leverages lot of Apache open-source products. Eagle > was > > > >>designed to offer real time insights into sensitive data access by > > > >>actively monitoring the data access on various data sets in hadoop > and > > > >>an extensible alerting framework with a powerful policy engine. Eagle > > > >>compliments the existing Hadoop platform area by providing a > > > >>comprehensive monitoring and alerting solution for detecting > sensitive > > > >>data access threats based on preset policies and machine learning > > models > > > >>for user behaviour analysis. > > > >> > > > >> Known Risks > > > >> > > > >> Orphaned Products > > > >> The core developers of Eagle team work full time on this project. > > There > > > >>is no risk of Eagle getting orphaned since eBay is extensively using > it > > > >>in their production Hadoop clusters and have plans to go beyond > hadoop. > > > >>For example, currently there are 7 hadoop clusters and 2 of them are > > > >>being monitored using Hadoop Eagle in production. We have plans to > > > >>extend it to all hadoop clusters and eventually other data platforms. > > > >>There are 10¹s of policies onboarded and actively monitored with > plans > > > >>to onboard more use case. We are very confident that every hadoop > > > >>cluster in the world will be monitored using Eagle for securing the > > > >>hadoop ecosystem by actively monitoring for data access on sensitive > > > >>data. We plan to extend and diversify this community further through > > > >>Apache. We presented Eagle at the hadoop summit in china and garnered > > > >>interest from different companies who use hadoop extensively. > > > >> > > > >> Inexperience with Open Source > > > >> The core developers are all active users and followers of open > source. > > > >>They are already committers and contributors to the Eagle Github > > > >>project. All have been involved with the source code that has been > > > >>released under an open source license, and several of them also have > > > >>experience developing code in an open source environment. Though the > > > >>core set of Developers do not have Apache Open Source experience, > there > > > >>are plans to onboard individuals with Apache open source experience > on > > > >>to the project. Apache Kylin PMC members are also in the same ebay > > > >>organization. We work very closely with Apache Ranger committers and > > are > > > >>looking forward to find meaningful integrations to improve the > security > > > >>of hadoop platform. > > > >> > > > >> Homogenous Developers > > > >> The core developers are from eBay. Today the problem of monitoring > > data > > > >>activities to find and stop threats is a universal problem faced by > all > > > >>the businesses. Apache Incubation process encourages an open and > > diverse > > > >>meritocratic community. Eagle intends to make every possible effort > to > > > >>build a diverse, vibrant and involved community and has already > > received > > > >>substantial interest from various organizations. > > > >> > > > >> Reliance on Salaried Developers > > > >> eBay invested in Eagle as the monitoring solution for Hadoop > clusters > > > >>and some of its key engineers are working full time on the project. > In > > > >>addition, since there is a growing need for securing sensitive data > > > >>access we need a data activity monitoring solution for Hadoop, we > look > > > >>forward to other Apache developers and researchers to contribute to > the > > > >>project. Additional contributors, including Apache committers have > > plans > > > >>to join this effort shortly. Also key to addressing the risk > associated > > > >>with relying on Salaried developers from a single entity is to > increase > > > >>the diversity of the contributors and actively lobby for Domain > experts > > > >>in the security space to contribute. Eagle intends to do this. > > > >> > > > >> Relationships with Other Apache Products > > > >> Eagle has a strong relationship and dependency with Apache Hadoop, > > > >>HBase, Spark, Kafka and Storm. Being part of Apache¹s Incubation > > > >>community, could help with a closer collaboration among these > projects > > > >>and as well as others. An Excessive Fascination with the Apache Brand > > > >>Eagle is proposing to enter incubation at Apache in order to help > > > >>efforts to diversify the committer-base, not so much to capitalize on > > > >>the Apache brand. The Eagle project is in production use already > inside > > > >>eBay, but is not expected to be an eBay product for external > customers. > > > >>As such, the Eagle project is not seeking to use the Apache brand as > a > > > >>marketing tool. > > > >> > > > >> Documentation > > > >> Information about Eagle can be found at > https://github.com/eBay/Eagle > > . > > > >>The following link provide more information about Eagle > > > >>http://goeagle.io<http://goeagle.io/>. > > > >> > > > >> Initial Source > > > >> Eagle has been under development since 2014 by a team of engineers > at > > > >>eBay Inc. It is currently hosted on Github.com under an Apache > license > > > >>2.0 at https://github.com/eBay/Eagle. Once in incubation we will be > > > >>moving the code base to apache git library. > > > >> > > > >> External Dependencies > > > >> Eagle has the following external dependencies. > > > >> Basic > > > >> €JDK 1.7+ > > > >> €Scala 2.10.4 > > > >> €Apache Maven > > > >> €JUnit > > > >> €Log4j > > > >> €Slf4j > > > >> €Apache Commons > > > >> €Apache Commons Math3 > > > >> €Jackson > > > >> €Siddhi CEP engine > > > >> > > > >> Hadoop > > > >> €Apache Hadoop > > > >> €Apache HBase > > > >> €Apache Hive > > > >> €Apache Zookeeper > > > >> €Apache Curator > > > >> > > > >> Apache Spark > > > >> €Spark Core Library > > > >> > > > >> REST Service > > > >> €Jersey > > > >> > > > >> Query > > > >> €Antlr > > > >> > > > >> Stream processing > > > >> €Apache Storm > > > >> €Apache Kafka > > > >> > > > >> Web > > > >> €AngularJS > > > >> €jQuery > > > >> €Bootstrap V3 > > > >> €Moment JS > > > >> €Admin LTE > > > >> €html5shiv > > > >> €respond > > > >> €Fastclick > > > >> €Date Range Picker > > > >> €Flot JS > > > >> > > > >> Cryptography > > > >> Eagle will eventually support encryption on the wire. This is not > one > > > >>of the initial goals, and we do not expect Eagle to be a controlled > > > >>export item due to the use of encryption. Eagle supports but does not > > > >>require the Kerberos authentication mechanism to access secured > Hadoop > > > >>services. > > > >> > > > >> Required Resources > > > >> > > > >> Mailing List > > > >> €eagle-private for private PMC discussions > > > >> €eagle-dev for developers > > > >> €eagle-commits for all commits > > > >> €eagle-users for all eagle users > > > >> > > > >> Subversion Directory > > > >> €Git is the preferred source control system. > > > >> > > > >> Issue Tracking > > > >> €JIRA Eagle (Eagle) > > > >> > > > >> Other Resources > > > >> The existing code already has unit tests so we will make use of > > > >>existing Apache continuous testing infrastructure. The resulting load > > > >>should not be very large. > > > >> > > > >> Initial Committers > > > >> €Seshu Adunuthula <sadunuthula at ebay dot com> > > > >> €Arun Manoharan <armanoharan at ebay dot com> > > > >> €Edward Zhang <yonzhang at ebay dot com> > > > >> €Hao Chen <hchen9 at ebay dot com> > > > >> €Chaitali Gupta <cgupta at ebay dot com> > > > >> €Libin Sun <libsun at ebay dot com> > > > >> €Jilin Jiang <jiljiang at ebay dot com> > > > >> €Qingwen Zhao <qingwzhao at ebay dot com> > > > >> €Hemanth Dendukuri <hdendukuri at ebay dot com> > > > >> €Senthil Kumar <senthilkumar at ebay dot com> > > > >> > > > >> > > > >> Affiliations > > > >> The initial committers are employees of eBay Inc. > > > >> > > > >> Sponsors > > > >> > > > >> Champion > > > >> €Henry Saputra <hsaputra at apache dot org> - Apache IPMC member > > > >> > > > >> Nominated Mentors > > > >> €Owen O¹Malley < omalley at apache dot org > - Apache IPMC member, > > > >>Hortonworks > > > >> €Henry Saputra <hsaputra at apache dot org> - Apache IPMC member > > > >> €Julian Hyde <jhyde at hortonworks dot com> - Apache IPMC member, > > > >>Hortonworks > > > >> €Amareshwari Sriramdasu <amareshwari at apache dot org> - Apache > IPMC > > > >>member > > > >> €Taylor Goetz <ptgoetz at apache dot org> - Apache IPMC member, > > > >>Hortonworks > > > >> > > > >> Sponsoring Entity > > > >> We are requesting the Incubator to sponsor this project. > > > >> > > > > > > > > > > > >
