+1 On Fri, Oct 23, 2015 at 12:26 PM, Chris Nauroth <cnaur...@hortonworks.com> wrote:
> +1 (binding) > > --Chris Nauroth > > > > > On 10/23/15, 7:11 AM, "Manoharan, Arun" <armanoha...@ebay.com> wrote: > > >Hello Everyone, > > > >Thanks for all the feedback on the Eagle Proposal. > > > >I would like to call for a [VOTE] on Eagle joining the ASF as an > >incubation project. > > > >The vote is open for 72 hours: > > > >[ ] +1 accept Eagle in the Incubator > >[ ] ±0 > >[ ] -1 (please give reason) > > > >Eagle is a Monitoring solution for Hadoop to instantly identify access to > >sensitive data, recognize attacks, malicious activities and take actions > >in real time. Eagle supports a wide variety of policies on HDFS data and > >Hive. Eagle also provides machine learning models for detecting anomalous > >user behavior in Hadoop. > > > >The proposal is available on the wiki here: > >https://wiki.apache.org/incubator/EagleProposal > > > >The text of the proposal is also available at the end of this email. > > > >Thanks for your time and help. > > > >Thanks, > >Arun > > > ><COPY of the proposal in text format> > > > >Eagle > > > >Abstract > >Eagle is an Open Source Monitoring solution for Hadoop to instantly > >identify access to sensitive data, recognize attacks, malicious > >activities in hadoop and take actions. > > > >Proposal > >Eagle audits access to HDFS files, Hive and HBase tables in real time, > >enforces policies defined on sensitive data access and alerts or blocks > >user¹s access to that sensitive data in real time. Eagle also creates > >user profiles based on the typical access behaviour for HDFS and Hive and > >sends alerts when anomalous behaviour is detected. Eagle can also import > >sensitive data information classified by external classification engines > >to help define its policies. > > > >Overview of Eagle > >Eagle has 3 main parts. > >1.Data collection and storage - Eagle collects data from various hadoop > >logs in real time using Kafka/Yarn API and uses HDFS and HBase for > >storage. > >2.Data processing and policy engine - Eagle allows users to create > >policies based on various metadata properties on HDFS, Hive and HBase > >data. > >3.Eagle services - Eagle services include policy manager, query service > >and the visualization component. Eagle provides intuitive user interface > >to administer Eagle and an alert dashboard to respond to real time alerts. > > > >Data Collection and Storage: > >Eagle provides programming API for extending Eagle to integrate any data > >source into Eagle policy evaluation framework. For example, Eagle hdfs > >audit monitoring collects data from Kafka which is populated from > >namenode log4j appender or from logstash agent. Eagle hive monitoring > >collects hive query logs from running job through YARN API, which is > >designed to be scalable and fault-tolerant. Eagle uses HBase as storage > >for storing metadata and metrics data, and also supports relational > >database through configuration change. > > > >Data Processing and Policy Engine: > >Processing Engine: Eagle provides stream processing API which is an > >abstraction of Apache Storm. It can also be extended to other streaming > >engines. This abstraction allows developers to assemble data > >transformation, filtering, external data join etc. without physically > >bound to a specific streaming platform. Eagle streaming API allows > >developers to easily integrate business logic with Eagle policy engine > >and internally Eagle framework compiles business logic execution DAG into > >program primitives of underlying stream infrastructure e.g. Apache Storm. > >For example, Eagle HDFS monitoring transforms audit log from Namenode to > >object and joins sensitivity metadata, security zone metadata which are > >generated from external programs or configured by user. Eagle hive > >monitoring filters running jobs to get hive query string and parses query > >string into object and then joins sensitivity metadata. > >Alerting Framework: Eagle Alert Framework includes stream metadata API, > >scalable policy engine framework, extensible policy engine framework. > >Stream metadata API allows developers to declare event schema including > >what attributes constitute an event, what is the type for each attribute, > >and how to dynamically resolve attribute value in runtime when user > >configures policy. Scalable policy engine framework allows policies to be > >executed on different physical nodes in parallel. It is also used to > >define your own policy partitioner class. Policy engine framework > >together with streaming partitioning capability provided by all streaming > >platforms will make sure policies and events can be evaluated in a fully > >distributed way. Extensible policy engine framework allows developer to > >plugin a new policy engine with a few lines of codes. WSO2 Siddhi CEP > >engine is the policy engine which Eagle supports as first-class citizen. > >Machine Learning module: Eagle provides capabilities to define user > >activity patterns or user profiles for Hadoop users based on the user > >behaviour in the platform. These user profiles are modeled using Machine > >Learning algorithms and used for detection of anomalous users activities. > >Eagle uses Eigen Value Decomposition, and Density Estimation algorithms > >for generating user profile models. The model reads data from HDFS audit > >logs, preprocesses and aggregates data, and generates models using Spark > >programming APIs. Once models are generated, Eagle uses stream processing > >engine for near real-time anomaly detection to determine if any user¹s > >activities are suspicious or not. > > > >Eagle Services: > >Query Service: Eagle provides SQL-like service API to support > >comprehensive computation for huge set of data on the fly, for e.g. > >comprehensive filtering, aggregation, histogram, sorting, top, > >arithmetical expression, pagination etc. HBase is the data storage which > >Eagle supports as first-class citizen, relational database is supported > >as well. For HBase storage, Eagle query framework compiles user provided > >SQL-like query into HBase native filter objects and execute it through > >HBase coprocessor on the fly. > >Policy Manager: Eagle policy manager provides UI and Restful API for user > >to define policy with just a few clicks. It includes site management UI, > >policy editor, sensitivity metadata import, HDFS or Hive sensitive > >resource browsing, alert dashboards etc. > >Background > >Data is one of the most important assets for today¹s businesses, which > >makes data security one of the top priorities of today¹s enterprises. > >Hadoop is widely used across different verticals as a big data repository > >to store this data in most modern enterprises. > >At eBay we use hadoop platform extensively for our data processing needs. > >Our data in Hadoop is becoming bigger and bigger as our user base is > >seeing an exponential growth. Today there are variety of data sets > >available in Hadoop cluster for our users to consume. eBay has around 120 > >PB of data stored in HDFS across 6 different clusters and around 1800+ > >active hadoop users consuming data thru Hive, HBase and mapreduce jobs > >everyday to build applications using this data. With this astronomical > >growth of data there are also challenges in securing sensitive data and > >monitoring the access to this sensitive data. Today in large > >organizations HDFS is the defacto standard for storing big data. Data > >sets which includes and not limited to consumer sentiment, social media > >data, customer segmentation, web clicks, sensor data, geo-location and > >transaction data get stored in Hadoop for day to day business needs. > >We at eBay want to make sure the sensitive data and data platforms are > >completely protected from security breaches. So we partnered very closely > >with our Information Security team to understand the requirements for > >Eagle to monitor sensitive data access on hadoop: > >1.Ability to identify and stop security threats in real time > >2.Scale for big data (Support PB scale and Billions of events) > >3.Ability to create data access policies > >4.Support multiple data sources like HDFS, HBase, Hive > >5.Visualize alerts in real time > >6.Ability to block malicious access in real time > >We did not find any data access monitoring solution that available today > >and can provide the features and functionality that we need to monitor > >the data access in the hadoop ecosystem at our scale. Hence with an > >excellent team of world class developers and several users, we have been > >able to bring Eagle into production as well as open source it. > > > >Rationale > >In today¹s world; data is an important asset for any company. Businesses > >are using data extensively to create amazing experiences for users. Data > >has to be protected and access to data should be secured from security > >breaches. Today Hadoop is not only used to store logs but also stores > >financial data, sensitive data sets, geographical data, user click stream > >data sets etc. which makes it more important to be protected from > >security breaches. To secure a data platform there are multiple things > >that need to happen. One is having a strong access control mechanism > >which today is provided by Apache Ranger and Apache Sentry. These tools > >provide the ability to provide fine grain access control mechanism to > >data sets on hadoop. But there is a big gap in terms of monitoring all > >the data access events and activities in order to securing the hadoop > >data platform. Together with strong access control, perimeter security > >and data access monitoring in place data in the hadoop clusters can be > >secured against breaches. We looked around and found following: > >Existing data activity monitoring products are designed for traditional > >databases and data warehouse. Existing monitoring platforms cannot scale > >out to support fast growing data and petabyte scale. Few products in the > >industry are still very early in terms of supporting HDFS, Hive, HBase > >data access monitoring. > >As mentioned in the background, the business requirement and urgency to > >secure the data from users with malicious intent drove eBay to invest in > >building a real time data access monitoring solution from scratch to > >offer real time alerts and remediation features for malicious data access. > >With the power of open source distributed systems like Hadoop, Kafka and > >much more we were able to develop a data activity monitoring system that > >can scale, identify and stop malicious access in real time. > >Eagle allows admins to create standard access policies and rules for > >monitoring HDFS, Hive and HBase data. Eagle also provides out of box > >machine learning models for modeling user profiles based on user access > >behaviour and use the model to alert on anomalies. > > > >Current Status > > > >Meritocracy > >Eagle has been deployed in production at eBay for monitoring billions of > >events per day from HDFS and Hive operations. From the start; the product > >has been built with focus on high scalability and application > >extensibility in mind and Eagle has demonstrated great performance in > >responding to suspicious events instantly and great flexibility in > >defining policy. > > > >Community > >Eagle seeks to develop the developer and user communities during > >incubation. > > > >Core Developers > >Eagle is currently being designed and developed by engineers from eBay > >Inc. Edward Zhang, Hao Chen, Chaitali Gupta, Libin Sun, Jilin Jiang, > >Qingwen Zhao, Senthil Kumar, Hemanth Dendukuri, Arun Manoharan. All of > >these core developers have deep expertise in developing monitoring > >products for the Hadoop ecosystem. > > > >Alignment > >The ASF is a natural host for Eagle given that it is already the home of > >Hadoop, HBase, Hive, Storm, Kafka, Spark and other emerging big data > >projects. Eagle leverages lot of Apache open-source products. Eagle was > >designed to offer real time insights into sensitive data access by > >actively monitoring the data access on various data sets in hadoop and an > >extensible alerting framework with a powerful policy engine. Eagle > >compliments the existing Hadoop platform area by providing a > >comprehensive monitoring and alerting solution for detecting sensitive > >data access threats based on preset policies and machine learning models > >for user behaviour analysis. > > > >Known Risks > > > >Orphaned Products > >The core developers of Eagle team work full time on this project. There > >is no risk of Eagle getting orphaned since eBay is extensively using it > >in their production Hadoop clusters and have plans to go beyond hadoop. > >For example, currently there are 7 hadoop clusters and 2 of them are > >being monitored using Hadoop Eagle in production. We have plans to extend > >it to all hadoop clusters and eventually other data platforms. There are > >10¹s of policies onboarded and actively monitored with plans to onboard > >more use case. We are very confident that every hadoop cluster in the > >world will be monitored using Eagle for securing the hadoop ecosystem by > >actively monitoring for data access on sensitive data. We plan to extend > >and diversify this community further through Apache. We presented Eagle > >at the hadoop summit in china and garnered interest from different > >companies who use hadoop extensively. > > > >Inexperience with Open Source > >The core developers are all active users and followers of open source. > >They are already committers and contributors to the Eagle Github project. > >All have been involved with the source code that has been released under > >an open source license, and several of them also have experience > >developing code in an open source environment. Though the core set of > >Developers do not have Apache Open Source experience, there are plans to > >onboard individuals with Apache open source experience on to the project. > >Apache Kylin PMC members are also in the same ebay organization. We work > >very closely with Apache Ranger committers and are looking forward to > >find meaningful integrations to improve the security of hadoop platform. > > > >Homogenous Developers > >The core developers are from eBay. Today the problem of monitoring data > >activities to find and stop threats is a universal problem faced by all > >the businesses. Apache Incubation process encourages an open and diverse > >meritocratic community. Eagle intends to make every possible effort to > >build a diverse, vibrant and involved community and has already received > >substantial interest from various organizations. > > > >Reliance on Salaried Developers > >eBay invested in Eagle as the monitoring solution for Hadoop clusters and > >some of its key engineers are working full time on the project. In > >addition, since there is a growing need for securing sensitive data > >access we need a data activity monitoring solution for Hadoop, we look > >forward to other Apache developers and researchers to contribute to the > >project. Additional contributors, including Apache committers have plans > >to join this effort shortly. Also key to addressing the risk associated > >with relying on Salaried developers from a single entity is to increase > >the diversity of the contributors and actively lobby for Domain experts > >in the security space to contribute. Eagle intends to do this. > > > >Relationships with Other Apache Products > >Eagle has a strong relationship and dependency with Apache Hadoop, HBase, > >Spark, Kafka and Storm. Being part of Apache¹s Incubation community, > >could help with a closer collaboration among these projects and as well > >as others. An Excessive Fascination with the Apache Brand Eagle is > >proposing to enter incubation at Apache in order to help efforts to > >diversify the committer-base, not so much to capitalize on the Apache > >brand. The Eagle project is in production use already inside eBay, but is > >not expected to be an eBay product for external customers. As such, the > >Eagle project is not seeking to use the Apache brand as a marketing tool. > > > >Documentation > >Information about Eagle can be found at https://github.com/eBay/Eagle. > >The following link provide more information about Eagle > >http://goeagle.io<http://goeagle.io/>. > > > >Initial Source > >Eagle has been under development since 2014 by a team of engineers at > >eBay Inc. It is currently hosted on Github.com under an Apache license > >2.0 at https://github.com/eBay/Eagle. Once in incubation we will be > >moving the code base to apache git library. > > > >External Dependencies > >Eagle has the following external dependencies. > >Basic > >€JDK 1.7+ > >€Scala 2.10.4 > >€Apache Maven > >€JUnit > >€Log4j > >€Slf4j > >€Apache Commons > >€Apache Commons Math3 > >€Jackson > >€Siddhi CEP engine > > > >Hadoop > >€Apache Hadoop > >€Apache HBase > >€Apache Hive > >€Apache Zookeeper > >€Apache Curator > > > >Apache Spark > >€Spark Core Library > > > >REST Service > >€Jersey > > > >Query > >€Antlr > > > >Stream processing > >€Apache Storm > >€Apache Kafka > > > >Web > >€AngularJS > >€jQuery > >€Bootstrap V3 > >€Moment JS > >€Admin LTE > >€html5shiv > >€respond > >€Fastclick > >€Date Range Picker > >€Flot JS > > > >Cryptography > >Eagle will eventually support encryption on the wire. This is not one of > >the initial goals, and we do not expect Eagle to be a controlled export > >item due to the use of encryption. Eagle supports but does not require > >the Kerberos authentication mechanism to access secured Hadoop services. > > > >Required Resources > > > >Mailing List > >€eagle-private for private PMC discussions > >€eagle-dev for developers > >€eagle-commits for all commits > >€eagle-users for all eagle users > > > >Subversion Directory > >€Git is the preferred source control system. > > > >Issue Tracking > >€JIRA Eagle (Eagle) > > > >Other Resources > >The existing code already has unit tests so we will make use of existing > >Apache continuous testing infrastructure. The resulting load should not > >be very large. > > > >Initial Committers > >€Seshu Adunuthula <sadunuthula at ebay dot com> > >€Arun Manoharan <armanoharan at ebay dot com> > >€Edward Zhang <yonzhang at ebay dot com> > >€Hao Chen <hchen9 at ebay dot com> > >€Chaitali Gupta <cgupta at ebay dot com> > >€Libin Sun <libsun at ebay dot com> > >€Jilin Jiang <jiljiang at ebay dot com> > >€Qingwen Zhao <qingwzhao at ebay dot com> > >€Hemanth Dendukuri <hdendukuri at ebay dot com> > >€Senthil Kumar <senthilkumar at ebay dot com> > > > > > >Affiliations > >The initial committers are employees of eBay Inc. > > > >Sponsors > > > >Champion > >€Henry Saputra <hsaputra at apache dot org> - Apache IPMC member > > > >Nominated Mentors > >€Owen O¹Malley < omalley at apache dot org > - Apache IPMC member, > >Hortonworks > >€Henry Saputra <hsaputra at apache dot org> - Apache IPMC member > >€Julian Hyde <jhyde at hortonworks dot com> - Apache IPMC member, > >Hortonworks > >€Amareshwari Sriramdasu <amareshwari at apache dot org> - Apache IPMC > >member > >€Taylor Goetz <ptgoetz at apache dot org> - Apache IPMC member, > >Hortonworks > > > >Sponsoring Entity > >We are requesting the Incubator to sponsor this project. > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >