Looks like the discussion has calm down, so unless there is more
comments we will send VOTE thread tomorrow.

Thanks all for the feedback.

- Henry

On Mon, Oct 19, 2015 at 8:33 AM, Manoharan, Arun <armanoha...@ebay.com> wrote:
> Hello Everyone,
>
> My name is Arun Manoharan. Currently a product manager in the Analytics 
> platform team at eBay Inc.
>
> I would like to start a discussion on Eagle and its joining the ASF as an 
> incubation project.
>
> Eagle is a Monitoring solution for Hadoop to instantly identify access to 
> sensitive data, recognize attacks, malicious activities and take actions in 
> real time. Eagle supports a wide variety of policies on HDFS data and Hive. 
> Eagle also provides machine learning models for detecting anomalous user 
> behavior in Hadoop.
>
> The proposal is available on the wiki here:
> https://wiki.apache.org/incubator/EagleProposal
>
> The text of the proposal is also available at the end of this email.
>
> Thanks for your time and help.
>
> Thanks,
> Arun
>
> <COPY of the proposal in text format>
>
> Eagle
>
> Abstract
> Eagle is an Open Source Monitoring solution for Hadoop to instantly identify 
> access to sensitive data, recognize attacks, malicious activities in hadoop 
> and take actions.
>
> Proposal
> Eagle audits access to HDFS files, Hive and HBase tables in real time, 
> enforces policies defined on sensitive data access and alerts or blocks 
> user’s access to that sensitive data in real time. Eagle also creates user 
> profiles based on the typical access behaviour for HDFS and Hive and sends 
> alerts when anomalous behaviour is detected. Eagle can also import sensitive 
> data information classified by external classification engines to help define 
> its policies.
>
> Overview of Eagle
> Eagle has 3 main parts.
> 1.Data collection and storage - Eagle collects data from various hadoop logs 
> in real time using Kafka/Yarn API and uses HDFS and HBase for storage.
> 2.Data processing and policy engine - Eagle allows users to create policies 
> based on various metadata properties on HDFS, Hive and HBase data.
> 3.Eagle services - Eagle services include policy manager, query service and 
> the visualization component. Eagle provides intuitive user interface to 
> administer Eagle and an alert dashboard to respond to real time alerts.
>
> Data Collection and Storage:
> Eagle provides programming API for extending Eagle to integrate any data 
> source into Eagle policy evaluation framework. For example, Eagle hdfs audit 
> monitoring collects data from Kafka which is populated from namenode log4j 
> appender or from logstash agent. Eagle hive monitoring collects hive query 
> logs from running job through YARN API, which is designed to be scalable and 
> fault-tolerant. Eagle uses HBase as storage for storing metadata and metrics 
> data, and also supports relational database through configuration change.
>
> Data Processing and Policy Engine:
> Processing Engine: Eagle provides stream processing API which is an 
> abstraction of Apache Storm. It can also be extended to other streaming 
> engines. This abstraction allows developers to assemble data transformation, 
> filtering, external data join etc. without physically bound to a specific 
> streaming platform. Eagle streaming API allows developers to easily integrate 
> business logic with Eagle policy engine and internally Eagle framework 
> compiles business logic execution DAG into program primitives of underlying 
> stream infrastructure e.g. Apache Storm. For example, Eagle HDFS monitoring 
> transforms audit log from Namenode to object and joins sensitivity metadata, 
> security zone metadata which are generated from external programs or 
> configured by user. Eagle hive monitoring filters running jobs to get hive 
> query string and parses query string into object and then joins sensitivity 
> metadata.
> Alerting Framework: Eagle Alert Framework includes stream metadata API, 
> scalable policy engine framework, extensible policy engine framework. Stream 
> metadata API allows developers to declare event schema including what 
> attributes constitute an event, what is the type for each attribute, and how 
> to dynamically resolve attribute value in runtime when user configures 
> policy. Scalable policy engine framework allows policies to be executed on 
> different physical nodes in parallel. It is also used to define your own 
> policy partitioner class. Policy engine framework together with streaming 
> partitioning capability provided by all streaming platforms will make sure 
> policies and events can be evaluated in a fully distributed way. Extensible 
> policy engine framework allows developer to plugin a new policy engine with a 
> few lines of codes. WSO2 Siddhi CEP engine is the policy engine which Eagle 
> supports as first-class citizen.
> Machine Learning module: Eagle provides capabilities to define user activity 
> patterns or user profiles for Hadoop users based on the user behaviour in the 
> platform. These user profiles are modeled using Machine Learning algorithms 
> and used for detection of anomalous users activities. Eagle uses Eigen Value 
> Decomposition, and Density Estimation algorithms for generating user profile 
> models. The model reads data from HDFS audit logs, preprocesses and 
> aggregates data, and generates models using Spark programming APIs. Once 
> models are generated, Eagle uses stream processing engine for near real-time 
> anomaly detection to determine if any user’s activities are suspicious or not.
>
> Eagle Services:
> Query Service: Eagle provides SQL-like service API to support comprehensive 
> computation for huge set of data on the fly, for e.g. comprehensive 
> filtering, aggregation, histogram, sorting, top, arithmetical expression, 
> pagination etc. HBase is the data storage which Eagle supports as first-class 
> citizen, relational database is supported as well. For HBase storage, Eagle 
> query framework compiles user provided SQL-like query into HBase native 
> filter objects and execute it through HBase coprocessor on the fly.
> Policy Manager: Eagle policy manager provides UI and Restful API for user to 
> define policy with just a few clicks. It includes site management UI, policy 
> editor, sensitivity metadata import, HDFS or Hive sensitive resource 
> browsing, alert dashboards etc.
> Background
> Data is one of the most important assets for today’s businesses, which makes 
> data security one of the top priorities of today’s enterprises. Hadoop is 
> widely used across different verticals as a big data repository to store this 
> data in most modern enterprises.
> At eBay we use hadoop platform extensively for our data processing needs. Our 
> data in Hadoop is becoming bigger and bigger as our user base is seeing an 
> exponential growth. Today there are variety of data sets available in Hadoop 
> cluster for our users to consume. eBay has around 120 PB of data stored in 
> HDFS across 6 different clusters and around 1800+ active hadoop users 
> consuming data thru Hive, HBase and mapreduce jobs everyday to build 
> applications using this data. With this astronomical growth of data there are 
> also challenges in securing sensitive data and monitoring the access to this 
> sensitive data. Today in large organizations HDFS is the defacto standard for 
> storing big data. Data sets which includes and not limited to consumer 
> sentiment, social media data, customer segmentation, web clicks, sensor data, 
> geo-location and transaction data get stored in Hadoop for day to day 
> business needs.
> We at eBay want to make sure the sensitive data and data platforms are 
> completely protected from security breaches. So we partnered very closely 
> with our Information Security team to understand the requirements for Eagle 
> to monitor sensitive data access on hadoop:
> 1.Ability to identify and stop security threats in real time
> 2.Scale for big data (Support PB scale and Billions of events)
> 3.Ability to create data access policies
> 4.Support multiple data sources like HDFS, HBase, Hive
> 5.Visualize alerts in real time
> 6.Ability to block malicious access in real time
> We did not find any data access monitoring solution that available today and 
> can provide the features and functionality that we need to monitor the data 
> access in the hadoop ecosystem at our scale. Hence with an excellent team of 
> world class developers and several users, we have been able to bring Eagle 
> into production as well as open source it.
>
> Rationale
> In today’s world; data is an important asset for any company. Businesses are 
> using data extensively to create amazing experiences for users. Data has to 
> be protected and access to data should be secured from security breaches. 
> Today Hadoop is not only used to store logs but also stores financial data, 
> sensitive data sets, geographical data, user click stream data sets etc. 
> which makes it more important to be protected from security breaches. To 
> secure a data platform there are multiple things that need to happen. One is 
> having a strong access control mechanism which today is provided by Apache 
> Ranger and Apache Sentry. These tools provide the ability to provide fine 
> grain access control mechanism to data sets on hadoop. But there is a big gap 
> in terms of monitoring all the data access events and activities in order to 
> securing the hadoop data platform. Together with strong access control, 
> perimeter security and data access monitoring in place data in the hadoop 
> clusters can be secured against breaches. We looked around and found 
> following:
> Existing data activity monitoring products are designed for traditional 
> databases and data warehouse. Existing monitoring platforms cannot scale out 
> to support fast growing data and petabyte scale. Few products in the industry 
> are still very early in terms of supporting HDFS, Hive, HBase data access 
> monitoring.
> As mentioned in the background, the business requirement and urgency to 
> secure the data from users with malicious intent drove eBay to invest in 
> building a real time data access monitoring solution from scratch to offer 
> real time alerts and remediation features for malicious data access.
> With the power of open source distributed systems like Hadoop, Kafka and much 
> more we were able to develop a data activity monitoring system that can 
> scale, identify and stop malicious access in real time.
> Eagle allows admins to create standard access policies and rules for 
> monitoring HDFS, Hive and HBase data. Eagle also provides out of box machine 
> learning models for modeling user profiles based on user access behaviour and 
> use the model to alert on anomalies.
>
> Current Status
>
> Meritocracy
> Eagle has been deployed in production at eBay for monitoring billions of 
> events per day from HDFS and Hive operations. From the start; the product has 
> been built with focus on high scalability and application extensibility in 
> mind and Eagle has demonstrated great performance in responding to suspicious 
> events instantly and great flexibility in defining policy.
>
> Community
> Eagle seeks to develop the developer and user communities during incubation.
>
> Core Developers
> Eagle is currently being designed and developed by engineers from eBay Inc. – 
> Edward Zhang, Hao Chen, Chaitali Gupta, Libin Sun, Jilin Jiang, Qingwen Zhao, 
> Senthil Kumar, Hemanth Dendukuri, Arun Manoharan. All of these core 
> developers have deep expertise in developing monitoring products for the 
> Hadoop ecosystem.
>
> Alignment
> The ASF is a natural host for Eagle given that it is already the home of 
> Hadoop, HBase, Hive, Storm, Kafka, Spark and other emerging big data 
> projects. Eagle leverages lot of Apache open-source products. Eagle was 
> designed to offer real time insights into sensitive data access by actively 
> monitoring the data access on various data sets in hadoop and an extensible 
> alerting framework with a powerful policy engine. Eagle compliments the 
> existing Hadoop platform area by providing a comprehensive monitoring and 
> alerting solution for detecting sensitive data access threats based on preset 
> policies and machine learning models for user behaviour analysis.
>
> Known Risks
>
> Orphaned Products
> The core developers of Eagle team work full time on this project. There is no 
> risk of Eagle getting orphaned since eBay is extensively using it in their 
> production Hadoop clusters and have plans to go beyond hadoop. For example, 
> currently there are 7 hadoop clusters and 2 of them are being monitored using 
> Hadoop Eagle in production. We have plans to extend it to all hadoop clusters 
> and eventually other data platforms. There are 10’s of policies onboarded and 
> actively monitored with plans to onboard more use case. We are very confident 
> that every hadoop cluster in the world will be monitored using Eagle for 
> securing the hadoop ecosystem by actively monitoring for data access on 
> sensitive data. We plan to extend and diversify this community further 
> through Apache. We presented Eagle at the hadoop summit in china and garnered 
> interest from different companies who use hadoop extensively.
>
> Inexperience with Open Source
> The core developers are all active users and followers of open source. They 
> are already committers and contributors to the Eagle Github project. All have 
> been involved with the source code that has been released under an open 
> source license, and several of them also have experience developing code in 
> an open source environment. Though the core set of Developers do not have 
> Apache Open Source experience, there are plans to onboard individuals with 
> Apache open source experience on to the project. Apache Kylin PMC members are 
> also in the same ebay organization. We work very closely with Apache Ranger 
> committers and are looking forward to find meaningful integrations to improve 
> the security of hadoop platform.
>
> Homogenous Developers
> The core developers are from eBay. Today the problem of monitoring data 
> activities to find and stop threats is a universal problem faced by all the 
> businesses. Apache Incubation process encourages an open and diverse 
> meritocratic community. Eagle intends to make every possible effort to build 
> a diverse, vibrant and involved community and has already received 
> substantial interest from various organizations.
>
> Reliance on Salaried Developers
> eBay invested in Eagle as the monitoring solution for Hadoop clusters and 
> some of its key engineers are working full time on the project. In addition, 
> since there is a growing need for securing sensitive data access we need a 
> data activity monitoring solution for Hadoop, we look forward to other Apache 
> developers and researchers to contribute to the project. Additional 
> contributors, including Apache committers have plans to join this effort 
> shortly. Also key to addressing the risk associated with relying on Salaried 
> developers from a single entity is to increase the diversity of the 
> contributors and actively lobby for Domain experts in the security space to 
> contribute. Eagle intends to do this.
>
> Relationships with Other Apache Products
> Eagle has a strong relationship and dependency with Apache Hadoop, HBase, 
> Spark, Kafka and Storm. Being part of Apache’s Incubation community, could 
> help with a closer collaboration among these projects and as well as others. 
> An Excessive Fascination with the Apache Brand Eagle is proposing to enter 
> incubation at Apache in order to help efforts to diversify the 
> committer-base, not so much to capitalize on the Apache brand. The Eagle 
> project is in production use already inside eBay, but is not expected to be 
> an eBay product for external customers. As such, the Eagle project is not 
> seeking to use the Apache brand as a marketing tool.
>
> Documentation
> Information about Eagle can be found at https://github.com/eBay/Eagle. The 
> following link provide more information about Eagle http://goeagle.io.
>
> Initial Source
> Eagle has been under development since 2014 by a team of engineers at eBay 
> Inc. It is currently hosted on Github.com under an Apache license 2.0 at 
> https://github.com/eBay/Eagle. Once in incubation we will be moving the code 
> base to apache git library.
>
> External Dependencies
> Eagle has the following external dependencies.
> Basic
> •JDK 1.7+
> •Scala 2.10.4
> •Apache Maven
> •JUnit
> •Log4j
> •Slf4j
> •Apache Commons
> •Apache Commons Math3
> •Jackson
> •Siddhi CEP engine
>
> Hadoop
> •Apache Hadoop
> •Apache HBase
> •Apache Hive
> •Apache Zookeeper
> •Apache Curator
>
> Apache Spark
> •Spark Core Library
>
> REST Service
> •Jersey
>
> Query
> •Antlr
>
> Stream processing
> •Apache Storm
> •Apache Kafka
>
> Web
> •AngularJS
> •jQuery
> •Bootstrap V3
> •Moment JS
> •Admin LTE
> •html5shiv
> •respond
> •Fastclick
> •Date Range Picker
> •Flot JS
>
> Cryptography
> Eagle will eventually support encryption on the wire. This is not one of the 
> initial goals, and we do not expect Eagle to be a controlled export item due 
> to the use of encryption. Eagle supports but does not require the Kerberos 
> authentication mechanism to access secured Hadoop services.
>
> Required Resources
>
> Mailing List
> •eagle-private for private PMC discussions
> •eagle-dev for developers
> •eagle-commits for all commits
> •eagle-users for all eagle users
>
> Subversion Directory
> •Git is the preferred source control system.
>
> Issue Tracking
> •JIRA Eagle (Eagle)
>
> Other Resources
> The existing code already has unit tests so we will make use of existing 
> Apache continuous testing infrastructure. The resulting load should not be 
> very large.
>
> Initial Committers
> •Seshu Adunuthula <sadunuthula at ebay dot com>
> •Arun Manoharan <armanoharan at ebay dot com>
> •Edward Zhang <yonzhang at ebay dot com>
> •Hao Chen <hchen9 at ebay dot com>
> •Chaitali Gupta <cgupta at ebay dot com>
> •Libin Sun <libsun at ebay dot com>
> •Jilin Jiang <jiljiang at ebay dot com>
> •Qingwen Zhao <qingwzhao at ebay dot com>
> •Hemanth Dendukuri <hdendukuri at ebay dot com>
> •Senthil Kumar <senthilkumar at ebay dot com>
> •Tan Chen <tanchen at ebay dot com>
>
> Affiliations
> The initial committers are employees of eBay Inc.
>
> Sponsors
>
> Champion
> •Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
>
> Nominated Mentors
> •Owen O’Malley < omalley at apache dot org > - Apache IPMC member, Hortonworks
> •Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
> •Julian Hyde <jhyde at hortonworks dot com> - Apache IPMC member, Hortonworks
>
> Sponsoring Entity
> We are requesting the Incubator to sponsor this project.
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Reply via email to