Hi Bill, Sorry for the slow response. I am in the Swedish High Coast<http://en.wikipedia.org/wiki/The_High_Coast> this week with a weak connection. This is the proposal I had for GSoC. *Proposed idea:* I would like to propose an idea based on Mark's (@MarkCC) ideas described in AURORA-256 <https://issues.apache.org/jira/browse/AURORA-256>andAURORA-257<https://issues.apache.org/jira/browse/AURORA-257>with a few additions.
1- First, we identify the requirements from the logging system for Aurora. We study already existing solution such as Flume <http://flume.apache.org/>, Scribe <https://github.com/facebook/scribe/wiki>, Chukwa<https://chukwa.apache.org/>and Kafka <http://kafka.apache.org/>. The results shall be put in a report similar to the Wikimedia foundation's report on their choice of a logging solution.<https://wikitech.wikimedia.org/wiki/Analytics/Kraken/Logging_Solutions_Recommendation>The report should be ready for submission by the end of the bonding period. We will also identify any missing functionalities in the chosen system. 2-We start working on implementing any missing functionalities and integrating the chosen system with Aurora. Any added functionalities to the logging system shall be pushed in the respective open-source project. 3- Once we have the logging system in place, we will design and build the analytics module. This system will support both simple queries such as the example given by Mark, "Show all of the update commands that resulted in a rollback between 12:00 and 2pm." and more complex ones like "show the correlation between failures and number of jobs" or "Detect anomalies in the logged data for the past 10 days" or "What is the distribution of job execution times". The analytics tool(s) will be written in Python and R (mainly) binded by RPy2 while leveraging the power pf MapReduce (when needed). The tool will be built to be modular to allow for future extensions and updates when needed. The analysis reports will be in both textual format and visual format, e.g., histograms, box-plots, CDFs and so on, to aid Aurora users and cluster managers to make informed decisions. Best, --Ahmed On Sat, May 24, 2014 at 7:56 AM, Bill Farner <wfar...@apache.org> wrote: > Welcome, Ahmed! Cool stuff! > > Is there a design doc or mission statement that you and Mark are working > off? > > -=Bill > > > On Wed, May 21, 2014 at 11:15 PM, Ahmed Aley <ahm...@cs.umu.se> wrote: > > > Hi, > > > > I am Ahmed Ali-Eldin <https://www8.cs.umu.se/~ahmeda/>, a PhD student at > > UmeĆ„ University, Sweden (It is up > > north< > > > http://tools.wmflabs.org/geohack/geohack.php?pagename=Ume%C3%A5¶ms=63_49_30_N_20_15_50_E_type:city%2879594%29_region:SE > > >:) > > ). I am working with @MarkCC on integrating a distributed logging > > framework with Aurora and building an analytics framework on top to > analyze > > the logged data. > > We started off by looking into different logging frameworks > > (Kafka<http://kafka.apache.org/>, > > Scribe <https://github.com/facebook/scribe>, > > Chukwa<https://chukwa.apache.org/>, > > Suro< > > > http://techblog.netflix.com/2013/12/announcing-suro-backbone-of-netflixs.html > > >, > > Calligraphus< > > > http://www-conf.slac.stanford.edu/xldb2011/talks/xldb2011_tue_0940_facebookrealtimeanalytics.pdf > > >and > > Flume <http://flume.apache.org/>). We chose Suro coupled with Kafka out > of > > these for different reasons. > > i- It has been built to allow scale-up and down (elastic). > > ii- It is quite flexible with a Kafka sink giving us access to all Kafka > > sinks. > > iii- It has an S3 sink making it a suitable solution for more scenarios. > > iv- I got a tip from someone I know at Netflix on Suro benchmarking > > results. > > v- it is an active project > > > > Based on the above, I have started some experiments with Suro and will be > > looking at its integration with Aurora this weekend. I can not make any > > statements on if Suro (coupled with kafka) is "the best" solution for > > distributed logging but it looks very promising till now. I will > hopefully > > send some results/updates late next week. > > > > Best, > > --Ahmed > > >