Re: [PROPOSAL] Kafka for the Apache Incubator

2011-06-24 Thread Ioannis Canellos
Great!

Here is my + 1

I have a comment to add. ActiveMQ does provide means for persisting the
messages, so you might want to clarify or rephrase that.

I would like to participate in this effort so I've added myself to the
Intial Commiters list.

-- 
*Ioannis Canellos*
*
 http://iocanel.blogspot.com

Apache Karaf  Committer & PMC
Apache ServiceMix   Committer
*


Re: [PROPOSAL] Kafka for the Apache Incubator

2011-06-24 Thread Jun Rao
Thanks for the comments, Ioannis. Will refine the proposal. Welcome aboard!

Jun

On Fri, Jun 24, 2011 at 1:00 AM, Ioannis Canellos  wrote:

> Great!
>
> Here is my + 1
>
> I have a comment to add. ActiveMQ does provide means for persisting the
> messages, so you might want to clarify or rephrase that.
>
> I would like to participate in this effort so I've added myself to the
> Intial Commiters list.
>
> --
> *Ioannis Canellos*
> *
>  http://iocanel.blogspot.com
>
> Apache Karaf  Committer & PMC
> Apache ServiceMix   Committer
> *
>


Re: [PROPOSAL] Kafka for the Apache Incubator

2011-06-24 Thread Ioannis Canellos
It seems that Kafka overlaps with ActiveMQ more than I initially estimated,
which certainly decreases my personal interest in Kafka. So I see fit to
remove myself from the list.

-- 
*Ioannis Canellos*
*
 http://iocanel.blogspot.com

Apache Karaf  Committer & PMC
Apache ServiceMix   Committer
*


Re: [PROPOSAL] Kafka for the Apache Incubator

2011-06-24 Thread Shalin Shekhar Mangar
On Wed, Jun 22, 2011 at 9:47 PM, Jun Rao  wrote:
> Hi,
>
> I would like to propose Kafka to be an Apache Incubator project.  Kafka is a
> distributed, high throughput, publish-subscribe system for processing large
> amounts of streaming data.
>
> Here's a link to the proposal in the Incubator wiki
> http://wiki.apache.org/incubator/KafkaProposal
>

+1

I had evaluated Kafka for an internal project and came back impressed.
Great to see it moving to Apache!

-- 
Regards,
Shalin Shekhar Mangar.

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Kafka for the Apache Incubator

2011-06-24 Thread Phillip Rhodes
On Fri, Jun 24, 2011 at 8:51 AM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> On Wed, Jun 22, 2011 at 9:47 PM, Jun Rao  wrote:
> > Hi,
> >
> > I would like to propose Kafka to be an Apache Incubator project.  Kafka
> is a
> > distributed, high throughput, publish-subscribe system for processing
> large
> > amounts of streaming data.
> >
> > Here's a link to the proposal in the Incubator wiki
> > http://wiki.apache.org/incubator/KafkaProposal
> >
>

+1

Also, I'm willing to volunteer to help with this project.. I've added my
name to the proposal wiki under "initial committers."  If anybody wants to
know more
about who I am, there's probably an "intro" email from me already in the
Incubator email archives, or I can certainly post again if anybody cares.


Phillip Rhodes


Re: [PROPOSAL] Kafka for the Apache Incubator

2011-06-24 Thread Chris Burroughs
On 06/24/2011 04:00 AM, Ioannis Canellos wrote:
> I have a comment to add. ActiveMQ does provide means for persisting the
> messages, so you might want to clarify or rephrase that.

I'm sure Jun will clarify the proposal but for those interested in more
elaboration on this point check out the NetDB paper.

http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Kafka for the Apache Incubator

2011-06-24 Thread Jun Rao
Thanks Shalin, Phillip. Welcome aboard, Phillip.

Jun

On Fri, Jun 24, 2011 at 6:34 AM, Phillip Rhodes
wrote:

> On Fri, Jun 24, 2011 at 8:51 AM, Shalin Shekhar Mangar <
> shalinman...@gmail.com> wrote:
>
> > On Wed, Jun 22, 2011 at 9:47 PM, Jun Rao  wrote:
> > > Hi,
> > >
> > > I would like to propose Kafka to be an Apache Incubator project.  Kafka
> > is a
> > > distributed, high throughput, publish-subscribe system for processing
> > large
> > > amounts of streaming data.
> > >
> > > Here's a link to the proposal in the Incubator wiki
> > > http://wiki.apache.org/incubator/KafkaProposal
> > >
> >
>
> +1
>
> Also, I'm willing to volunteer to help with this project.. I've added my
> name to the proposal wiki under "initial committers."  If anybody wants to
> know more
> about who I am, there's probably an "intro" email from me already in the
> Incubator email archives, or I can certainly post again if anybody cares.
>
>
> Phillip Rhodes
>


Re: [PROPOSAL] Kafka for the Apache Incubator

2011-06-24 Thread Jake Mannix
+1

Both from the perspective of incorporating it in streaming machine-learning
work
in Mahout, and from the perspective of a persistent scalable WAL
(*especially*
once http://linkedin.jira.com/browse/KAFKA-23 gets finished up), I'm
very interested, and I know some more folks at Twitter who are interested as
well.

On Fri, Jun 24, 2011 at 5:51 AM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> On Wed, Jun 22, 2011 at 9:47 PM, Jun Rao  wrote:
> > Hi,
> >
> > I would like to propose Kafka to be an Apache Incubator project.  Kafka
> is a
> > distributed, high throughput, publish-subscribe system for processing
> large
> > amounts of streaming data.
> >
> > Here's a link to the proposal in the Incubator wiki
> > http://wiki.apache.org/incubator/KafkaProposal
> >
>
> +1
>
> I had evaluated Kafka for an internal project and came back impressed.
> Great to see it moving to Apache!
>
> --
> Regards,
> Shalin Shekhar Mangar.
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>


Re: [PROPOSAL] Kafka for the Apache Incubator

2011-06-24 Thread Henry Saputra
+1

A very good proposal and it seems to help solve our need for low
latency event messaging system, so looking forward to it.

I would love to contribute to the project and have added my name to
list initial committers if no objection.

- Henry

>> 2011/6/22 Jun Rao 
>>
>> > Hi,
>> >
>> > I would like to propose Kafka to be an Apache Incubator project.  Kafka
>> is
>> > a
>> > distributed, high throughput, publish-subscribe system for processing
>> large
>> > amounts of streaming data.
>> >
>> > Here's a link to the proposal in the Incubator wiki
>> > http://wiki.apache.org/incubator/KafkaProposal
>> >
>> > I've also pasted the initial contents below.
>> >
>> > Thanks,
>> >
>> > Jun
>> >
>> > == Abstract ==
>> > Kafka is a distributed publish-subscribe system for processing large
>> > amounts
>> > of streaming data.
>> >
>> > == Proposal ==
>> > Kafka provides an extremely high throughput distributed publish/subscribe
>> > messaging system.  Additionally, it supports relatively long term
>> > persistence of messages to support a wide variety of consumers,
>> > partitioning
>> > of the message stream across servers and consumers, and functionality for
>> > loading data into Apache Hadoop for offline, batch processing.
>> >
>> > == Background ==
>> > Kafka was developed at LinkedIn to process the large amounts of events
>> > generated by that company's website and provide a common repository for
>> > many
>> > types of consumers to access and process those events. Kafka has been
>> used
>> > in production at LinkedIn scale to handle dozens of types of events
>> > including page views, searches and social network activity. Kafka
>> clusters
>> > at LinkedIn currently process more than two billion events per day.
>> >
>> > Kafka fills the gap between messaging systems such as Apache ActiveMQ,
>> > which
>> > can provide high-volume messaging systems but lack persistence of those
>> > messages, and log processing systems such as Scribe and Flume, which do
>> not
>> > provide adequate latency for our diverse set of consumers.  Kafka can
>> also
>> > be inserted into traditional log-processing systems, acting as an
>> > intermediate step before further processing. Kafka focuses relentlessly
>> on
>> > performance and throughput by not introspecting into message content, nor
>> > indexing them on the broker.  We also achieve high performance by
>> depending
>> > on Java's sendFile/transferTo capabilities to minimize intermediate
>> buffer
>> > copies and relying on the OS's pagecache to efficiently serve up message
>> > contents to consumers.
>> >
>> > Kafka is written in Scala and depends on Apache ZooKeeper for
>> coordination
>> > amongst its producers, brokers and consumers.
>> >
>> > Kafka was developed internally at LinkedIn to meet our particular use
>> > cases,
>> > but will be useful to many organizations facing a similar need to
>> reliably
>> > process large amounts of streaming data.  Therefore, we would like to
>> share
>> > it the ASF and begin developing a community of developers and users
>> within
>> > Apache.
>> >
>> > == Rationale ==
>> > Many organizations can benefit from a reliable stream processing system
>> > such
>> > as Kafka.  While our use case of processing events from a very large
>> > website
>> > like LinkedIn has driven the design of Kafka, its uses are varied and we
>> > expect many new use cases to emerge.  Kafka provides a natural bridge
>> > between near real-time event processing and offline batch processing and
>> > will appeal to many users.
>> >
>> > == Current Status ==
>> > === Meritocracy ===
>> > Our intent with this incubator proposal is to start building a diverse
>> > developer community around Kafka following the Apache meritocracy model.
>> > Since Kafka was open sourced we have solicited contributions via the
>> > website
>> > and presentations given to user groups and technical audiences.  We have
>> > had
>> > positive responses to these and have received several contributions and
>> > clients for other languages.  We plan to continue this support for new
>> > contributors and work with those who contribute significantly to the
>> > project
>> > to make them committers.
>> >
>> > === Community ===
>> > Kafka is currently being used by developed by engineers within LinkedIn
>> and
>> > used in production in that company. Additionally, we have active users in
>> > or
>> > have received contributions from a diverse set of companies including
>> > MediaSift, SocialTwist, Clearspring and Urban Airship. Recent public
>> > presentations of Kafka and its goals garnered much interest from
>> potential
>> > contributors. We hope to extend our contributor base significantly and
>> > invite all those who are interested in building high-throughput
>> distributed
>> > systems to participate.  We have begun receiving contributions from
>> outside
>> > of LinkedIn, including clients for several languages including Ruby, PHP,
>> > Clojure, .NET and Python.
>> >
>> > To further 

Re: [PROPOSAL] Kafka for the Apache Incubator

2011-06-24 Thread Jun Rao
Ioannis,

That's fine. I just want to point out that Kafka does/will have some
difference from ActiveMQ given (1) its simple storage format; (2) how it
uses Zookeeper for distributed coordination; (3) the future replication work
http://linkedin.jira.com/browse/KAFKA-23 . Let us know if you become
interested again in the future.

Thanks,

Jun

On Fri, Jun 24, 2011 at 3:40 AM, Ioannis Canellos  wrote:

> It seems that Kafka overlaps with ActiveMQ more than I initially estimated,
> which certainly decreases my personal interest in Kafka. So I see fit to
> remove myself from the list.
>
> --
> *Ioannis Canellos*
> *
>  http://iocanel.blogspot.com
>
> Apache Karaf  Committer & PMC
> Apache ServiceMix   Committer
> *
>


RE: [VOTE] Retire Stonehenge

2011-06-24 Thread Kamaljit Bath
+1, (and +1 on the 'dormant' section proposal from Gianugo as well)

-Original Message-
From: gian...@gmail.com [mailto:gian...@gmail.com] On Behalf Of Gianugo 
Rabellino
Sent: Thursday, June 23, 2011 11:58 PM
To: general@incubator.apache.org
Subject: Re: [VOTE] Retire Stonehenge

+1, although I'd rather see it go in the "dormant" section. Also, I
would have rather see this vote happen on stonehenge-dev first, but it's not 
that big of a deal.

--
Gianugo Rabellino - gianugo at rabellino dot it
Blog: http://boldlyopen.com




On Wed, Jun 22, 2011 at 9:55 AM, Daniel Kulp  wrote:
>
> The Stonehenge project pretty much accomplished what it originally set out to
> do and then really didn't find a way to transition to something that is longer
> lasting and able to develop a community around it.     Lately, there has been
> no interest in it as evidence by the last commit being almost a year ago.  The
> only recent activity on the dev list is about retiring it and notices about
> missing board reports.
>
> Please vote for the retiring of the Stonehenge podling:
>
> [] +1 - please retire
> [] +/-0
> [] -1 - please don't retire, because...
>
>
>
> --
> Daniel Kulp
> dk...@apache.org
> http://dankulp.com/blog
> Talend - http://www.talend.com

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



[PROPOSAL] Oozie for the Apache Incubator

2011-06-24 Thread Mohammad Islam
Hi,

I would like to propose Oozie to be an Apache Incubator project.  
Oozie is a server-based workflow scheduling and coordination system to manage 
data processing jobs for Apache Hadoop. 


Here's a link to the proposal in the Incubator wiki
http://wiki.apache.org/incubator/OozieProposal


I've also pasted the initial contents below.

Regards,

Mohammad Islam


Start of Oozie Proposal 

Abstract
Oozie is a server-based workflow scheduling and coordination system to manage 
data processing jobs for Apache HadoopTM. 

Proposal
Oozie is an  extensible, scalable and reliable system to define, manage, 
schedule,  and execute complex Hadoop workloads via web services. More  
specifically, this includes: 

* XML-based declarative framework to specify a job or a complex 
workflow of 
dependent jobs. 

* Support different types of job such as Hadoop Map-Reduce, Pipe, 
Streaming, 
Pig, Hive and custom java applications. 

* Workflow scheduling based on frequency and/or data availability. 
* Monitoring capability, automatic retry and failure handing of jobs. 
* Extensible and pluggable architecture to allow arbitrary grid 
programming 
paradigms. 

* Authentication, authorization, and capacity-aware load throttling to 
allow 
multi-tenant software as a service. 

Background
Most data  processing applications require multiple jobs to achieve their 
goals,  
with inherent dependencies among the jobs. A dependency could be  sequential, 
where one job can only start after another job has finished.  Or it could be 
conditional, where the execution of a job depends on the  return value or 
status 
of another job. In other cases, parallel  execution of multiple jobs may be 
permitted – or desired – to exploit  the massive pool of compute nodes provided 
by Hadoop. 

These  job dependencies are often expressed as a Directed Acyclic Graph, also  
called a workflow. A node in the workflow is typically a job (a  computation on 
the grid) or another type of action such as an eMail  notification. 
Computations 
can be expressed in map/reduce, Pig, Hive or  any other programming paradigm 
available on the grid. Edges of the graph  represent transitions from one node 
to the next, as the execution of a  workflow proceeds. 

Describing  a workflow in a declarative way has the advantage of decoupling job 
 
dependencies and execution control from application logic. Furthermore,  the 
workflow is modularized into jobs that can be reused within the same  workflow 
or across different workflows. Execution of the workflow is  then driven by a 
runtime system without understanding the application  logic of the jobs. This 
runtime system specializes in reliable and  predictable execution: It can retry 
actions that have failed or invoke a  cleanup action after termination of the 
workflow; it can monitor  progress, success, or failure of a workflow, and send 
appropriate alerts  to an administrator. The application developer is relieved 
from  implementing these generic procedures. 

Furthermore,  some applications or workflows need to run in periodic intervals 
or  when dependent data is available. For example, a workflow could be  
executed 
every day as soon as output data from the previous 24 instances  of another, 
hourly workflow is available. The workflow coordinator  provides such 
scheduling 
features, along with prioritization, load  balancing and throttling to optimize 
utilization of resources in the  cluster. This makes it easier to maintain, 
control, and coordinate  complex data applications. 

Nearly  three years ago, a team of Yahoo! developers addressed these critical  
requirements for Hadoop-based data processing systems by developing a  new 
workflow management and scheduling system called Oozie. While it was  initially 
developed as a Yahoo!-internal project, it was designed and  implemented with 
the intention of open-sourcing. Oozie was released as a GitHub project in early 
2010. Oozie is used in production within Yahoo and  since it has been 
open-sourced it has been gaining adoption with  external developers 

Rationale
Commonly,  applications that run on Hadoop require multiple Hadoop jobs in 
order 
to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a 
combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes  
map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and shell 
scripts. 

Because  of this, developers find themselves writing ad-hoc glue programs to  
combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule, 
manage, monitor and recover. 

Workflow  management and scheduling is an essential feature for large-scale 
data  
processing applications. Such applications could write the customized  solution 
that would require separate development, operational, and  maintenance 
overhead. 
Since it is a prevalent use-case for data  processing, the application 
developer 
would surely

Re: [PROPOSAL] Kafka for the Apache Incubator

2011-06-24 Thread Ioannis Canellos
Juan,
I wil definitely let you know!
Till then I wish you the best of luck with the proposal!

-- 
*Ioannis Canellos*
*
 http://iocanel.blogspot.com

Apache Karaf  Committer & PMC
Apache ServiceMix   Committer
*


Re: [PROPOSAL] Oozie for the Apache Incubator

2011-06-24 Thread Alejandro Abdelnur
Mohammad,

This is great. Looking forward to continue working on Oozie.

Thanks.

Alejandro

On Fri, Jun 24, 2011 at 12:46 PM, Mohammad Islam  wrote:

> Hi,
>
> I would like to propose Oozie to be an Apache Incubator project.
> Oozie is a server-based workflow scheduling and coordination system to
> manage
> data processing jobs for Apache Hadoop.
>
>
> Here's a link to the proposal in the Incubator wiki
> http://wiki.apache.org/incubator/OozieProposal
>
>
> I've also pasted the initial contents below.
>
> Regards,
>
> Mohammad Islam
>
>
> Start of Oozie Proposal
>
> Abstract
> Oozie is a server-based workflow scheduling and coordination system to
> manage
> data processing jobs for Apache HadoopTM.
>
> Proposal
> Oozie is an  extensible, scalable and reliable system to define, manage,
> schedule,  and execute complex Hadoop workloads via web services. More
> specifically, this includes:
>
>* XML-based declarative framework to specify a job or a complex
> workflow of
> dependent jobs.
>
>* Support different types of job such as Hadoop Map-Reduce, Pipe,
> Streaming,
> Pig, Hive and custom java applications.
>
>* Workflow scheduling based on frequency and/or data availability.
>* Monitoring capability, automatic retry and failure handing of
> jobs.
>* Extensible and pluggable architecture to allow arbitrary grid
> programming
> paradigms.
>
>* Authentication, authorization, and capacity-aware load throttling
> to allow
> multi-tenant software as a service.
>
> Background
> Most data  processing applications require multiple jobs to achieve their
> goals,
> with inherent dependencies among the jobs. A dependency could be
>  sequential,
> where one job can only start after another job has finished.  Or it could
> be
> conditional, where the execution of a job depends on the  return value or
> status
> of another job. In other cases, parallel  execution of multiple jobs may be
> permitted – or desired – to exploit  the massive pool of compute nodes
> provided
> by Hadoop.
>
> These  job dependencies are often expressed as a Directed Acyclic Graph,
> also
> called a workflow. A node in the workflow is typically a job (a
>  computation on
> the grid) or another type of action such as an eMail  notification.
> Computations
> can be expressed in map/reduce, Pig, Hive or  any other programming
> paradigm
> available on the grid. Edges of the graph  represent transitions from one
> node
> to the next, as the execution of a  workflow proceeds.
>
> Describing  a workflow in a declarative way has the advantage of decoupling
> job
> dependencies and execution control from application logic. Furthermore,
>  the
> workflow is modularized into jobs that can be reused within the same
>  workflow
> or across different workflows. Execution of the workflow is  then driven by
> a
> runtime system without understanding the application  logic of the jobs.
> This
> runtime system specializes in reliable and  predictable execution: It can
> retry
> actions that have failed or invoke a  cleanup action after termination of
> the
> workflow; it can monitor  progress, success, or failure of a workflow, and
> send
> appropriate alerts  to an administrator. The application developer is
> relieved
> from  implementing these generic procedures.
>
> Furthermore,  some applications or workflows need to run in periodic
> intervals
> or  when dependent data is available. For example, a workflow could be
>  executed
> every day as soon as output data from the previous 24 instances  of
> another,
> hourly workflow is available. The workflow coordinator  provides such
> scheduling
> features, along with prioritization, load  balancing and throttling to
> optimize
> utilization of resources in the  cluster. This makes it easier to maintain,
> control, and coordinate  complex data applications.
>
> Nearly  three years ago, a team of Yahoo! developers addressed these
> critical
> requirements for Hadoop-based data processing systems by developing a  new
> workflow management and scheduling system called Oozie. While it was
>  initially
> developed as a Yahoo!-internal project, it was designed and  implemented
> with
> the intention of open-sourcing. Oozie was released as a GitHub project in
> early
> 2010. Oozie is used in production within Yahoo and  since it has been
> open-sourced it has been gaining adoption with  external developers
>
> Rationale
> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in
> order
> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly
>  a
> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and
> shell
> scripts.
>
> Because  of this, developers find themselves writing ad-hoc glue programs
> to
> combine these Hadoop jobs. These ad-hoc programs are difficult to
>  schedule,
> manage, monitor and recover.
>
> Workflow  management and sch

Re: [PROPOSAL] Oozie for the Apache Incubator

2011-06-24 Thread Mattmann, Chris A (388J)
Awesome! Very excited about this guys.

Cheers,
Chris

On Jun 24, 2011, at 12:46 PM, Mohammad Islam wrote:

> Hi,
>
> I would like to propose Oozie to be an Apache Incubator project.
> Oozie is a server-based workflow scheduling and coordination system to manage
> data processing jobs for Apache Hadoop.
>
>
> Here's a link to the proposal in the Incubator wiki
> http://wiki.apache.org/incubator/OozieProposal
>
>
> I've also pasted the initial contents below.
>
> Regards,
>
> Mohammad Islam
>
>
> Start of Oozie Proposal
>
> Abstract
> Oozie is a server-based workflow scheduling and coordination system to manage
> data processing jobs for Apache HadoopTM.
>
> Proposal
> Oozie is an  extensible, scalable and reliable system to define, manage,
> schedule,  and execute complex Hadoop workloads via web services. More
> specifically, this includes:
>
>* XML-based declarative framework to specify a job or a complex 
> workflow of
> dependent jobs.
>
>* Support different types of job such as Hadoop Map-Reduce, Pipe, 
> Streaming,
> Pig, Hive and custom java applications.
>
>* Workflow scheduling based on frequency and/or data availability.
>* Monitoring capability, automatic retry and failure handing of jobs.
>* Extensible and pluggable architecture to allow arbitrary grid 
> programming
> paradigms.
>
>* Authentication, authorization, and capacity-aware load throttling to 
> allow
> multi-tenant software as a service.
>
> Background
> Most data  processing applications require multiple jobs to achieve their 
> goals,
> with inherent dependencies among the jobs. A dependency could be  sequential,
> where one job can only start after another job has finished.  Or it could be
> conditional, where the execution of a job depends on the  return value or 
> status
> of another job. In other cases, parallel  execution of multiple jobs may be
> permitted – or desired – to exploit  the massive pool of compute nodes 
> provided
> by Hadoop.
>
> These  job dependencies are often expressed as a Directed Acyclic Graph, also
> called a workflow. A node in the workflow is typically a job (a  computation 
> on
> the grid) or another type of action such as an eMail  notification. 
> Computations
> can be expressed in map/reduce, Pig, Hive or  any other programming paradigm
> available on the grid. Edges of the graph  represent transitions from one node
> to the next, as the execution of a  workflow proceeds.
>
> Describing  a workflow in a declarative way has the advantage of decoupling 
> job
> dependencies and execution control from application logic. Furthermore,  the
> workflow is modularized into jobs that can be reused within the same  workflow
> or across different workflows. Execution of the workflow is  then driven by a
> runtime system without understanding the application  logic of the jobs. This
> runtime system specializes in reliable and  predictable execution: It can 
> retry
> actions that have failed or invoke a  cleanup action after termination of the
> workflow; it can monitor  progress, success, or failure of a workflow, and 
> send
> appropriate alerts  to an administrator. The application developer is relieved
> from  implementing these generic procedures.
>
> Furthermore,  some applications or workflows need to run in periodic intervals
> or  when dependent data is available. For example, a workflow could be  
> executed
> every day as soon as output data from the previous 24 instances  of another,
> hourly workflow is available. The workflow coordinator  provides such 
> scheduling
> features, along with prioritization, load  balancing and throttling to 
> optimize
> utilization of resources in the  cluster. This makes it easier to maintain,
> control, and coordinate  complex data applications.
>
> Nearly  three years ago, a team of Yahoo! developers addressed these critical
> requirements for Hadoop-based data processing systems by developing a  new
> workflow management and scheduling system called Oozie. While it was  
> initially
> developed as a Yahoo!-internal project, it was designed and  implemented with
> the intention of open-sourcing. Oozie was released as a GitHub project in 
> early
> 2010. Oozie is used in production within Yahoo and  since it has been
> open-sourced it has been gaining adoption with  external developers
>
> Rationale
> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in 
> order
> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a
> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and 
> shell
> scripts.
>
> Because  of this, developers find themselves writing ad-hoc glue programs to
> combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule,
> manage, monitor and recover.
>
> Workflow  management and scheduling is an essential feature for large-scale 
> data
> processi

Re: [PROPOSAL] Oozie for the Apache Incubator

2011-06-24 Thread Roman Shaposhnik
A strong +1

Thanks,
Roman.

On Fri, Jun 24, 2011 at 12:46 PM, Mohammad Islam  wrote:
> Hi,
>
> I would like to propose Oozie to be an Apache Incubator project.
> Oozie is a server-based workflow scheduling and coordination system to manage
> data processing jobs for Apache Hadoop.
>
>
> Here's a link to the proposal in the Incubator wiki
> http://wiki.apache.org/incubator/OozieProposal
>
>
> I've also pasted the initial contents below.
>
> Regards,
>
> Mohammad Islam
>
>
> Start of Oozie Proposal
>
> Abstract
> Oozie is a server-based workflow scheduling and coordination system to manage
> data processing jobs for Apache HadoopTM.
>
> Proposal
> Oozie is an  extensible, scalable and reliable system to define, manage,
> schedule,  and execute complex Hadoop workloads via web services. More
> specifically, this includes:
>
>        * XML-based declarative framework to specify a job or a complex 
> workflow of
> dependent jobs.
>
>        * Support different types of job such as Hadoop Map-Reduce, Pipe, 
> Streaming,
> Pig, Hive and custom java applications.
>
>        * Workflow scheduling based on frequency and/or data availability.
>        * Monitoring capability, automatic retry and failure handing of jobs.
>        * Extensible and pluggable architecture to allow arbitrary grid 
> programming
> paradigms.
>
>        * Authentication, authorization, and capacity-aware load throttling to 
> allow
> multi-tenant software as a service.
>
> Background
> Most data  processing applications require multiple jobs to achieve their 
> goals,
> with inherent dependencies among the jobs. A dependency could be  sequential,
> where one job can only start after another job has finished.  Or it could be
> conditional, where the execution of a job depends on the  return value or 
> status
> of another job. In other cases, parallel  execution of multiple jobs may be
> permitted – or desired – to exploit  the massive pool of compute nodes 
> provided
> by Hadoop.
>
> These  job dependencies are often expressed as a Directed Acyclic Graph, also
> called a workflow. A node in the workflow is typically a job (a  computation 
> on
> the grid) or another type of action such as an eMail  notification. 
> Computations
> can be expressed in map/reduce, Pig, Hive or  any other programming paradigm
> available on the grid. Edges of the graph  represent transitions from one node
> to the next, as the execution of a  workflow proceeds.
>
> Describing  a workflow in a declarative way has the advantage of decoupling 
> job
> dependencies and execution control from application logic. Furthermore,  the
> workflow is modularized into jobs that can be reused within the same  workflow
> or across different workflows. Execution of the workflow is  then driven by a
> runtime system without understanding the application  logic of the jobs. This
> runtime system specializes in reliable and  predictable execution: It can 
> retry
> actions that have failed or invoke a  cleanup action after termination of the
> workflow; it can monitor  progress, success, or failure of a workflow, and 
> send
> appropriate alerts  to an administrator. The application developer is relieved
> from  implementing these generic procedures.
>
> Furthermore,  some applications or workflows need to run in periodic intervals
> or  when dependent data is available. For example, a workflow could be  
> executed
> every day as soon as output data from the previous 24 instances  of another,
> hourly workflow is available. The workflow coordinator  provides such 
> scheduling
> features, along with prioritization, load  balancing and throttling to 
> optimize
> utilization of resources in the  cluster. This makes it easier to maintain,
> control, and coordinate  complex data applications.
>
> Nearly  three years ago, a team of Yahoo! developers addressed these critical
> requirements for Hadoop-based data processing systems by developing a  new
> workflow management and scheduling system called Oozie. While it was  
> initially
> developed as a Yahoo!-internal project, it was designed and  implemented with
> the intention of open-sourcing. Oozie was released as a GitHub project in 
> early
> 2010. Oozie is used in production within Yahoo and  since it has been
> open-sourced it has been gaining adoption with  external developers
>
> Rationale
> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in 
> order
> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a
> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and 
> shell
> scripts.
>
> Because  of this, developers find themselves writing ad-hoc glue programs to
> combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule,
> manage, monitor and recover.
>
> Workflow  management and scheduling is an essential feature for large-scale 
> data
> processing applications. Such 

Re: [PROPOSAL] Oozie for the Apache Incubator

2011-06-24 Thread Eli Collins
+1

Spectacular!

On Fri, Jun 24, 2011 at 12:46 PM, Mohammad Islam  wrote:
> Hi,
>
> I would like to propose Oozie to be an Apache Incubator project.
> Oozie is a server-based workflow scheduling and coordination system to manage
> data processing jobs for Apache Hadoop.
>
>
> Here's a link to the proposal in the Incubator wiki
> http://wiki.apache.org/incubator/OozieProposal
>
>
> I've also pasted the initial contents below.
>
> Regards,
>
> Mohammad Islam
>
>
> Start of Oozie Proposal
>
> Abstract
> Oozie is a server-based workflow scheduling and coordination system to manage
> data processing jobs for Apache HadoopTM.
>
> Proposal
> Oozie is an  extensible, scalable and reliable system to define, manage,
> schedule,  and execute complex Hadoop workloads via web services. More
> specifically, this includes:
>
>        * XML-based declarative framework to specify a job or a complex 
> workflow of
> dependent jobs.
>
>        * Support different types of job such as Hadoop Map-Reduce, Pipe, 
> Streaming,
> Pig, Hive and custom java applications.
>
>        * Workflow scheduling based on frequency and/or data availability.
>        * Monitoring capability, automatic retry and failure handing of jobs.
>        * Extensible and pluggable architecture to allow arbitrary grid 
> programming
> paradigms.
>
>        * Authentication, authorization, and capacity-aware load throttling to 
> allow
> multi-tenant software as a service.
>
> Background
> Most data  processing applications require multiple jobs to achieve their 
> goals,
> with inherent dependencies among the jobs. A dependency could be  sequential,
> where one job can only start after another job has finished.  Or it could be
> conditional, where the execution of a job depends on the  return value or 
> status
> of another job. In other cases, parallel  execution of multiple jobs may be
> permitted – or desired – to exploit  the massive pool of compute nodes 
> provided
> by Hadoop.
>
> These  job dependencies are often expressed as a Directed Acyclic Graph, also
> called a workflow. A node in the workflow is typically a job (a  computation 
> on
> the grid) or another type of action such as an eMail  notification. 
> Computations
> can be expressed in map/reduce, Pig, Hive or  any other programming paradigm
> available on the grid. Edges of the graph  represent transitions from one node
> to the next, as the execution of a  workflow proceeds.
>
> Describing  a workflow in a declarative way has the advantage of decoupling 
> job
> dependencies and execution control from application logic. Furthermore,  the
> workflow is modularized into jobs that can be reused within the same  workflow
> or across different workflows. Execution of the workflow is  then driven by a
> runtime system without understanding the application  logic of the jobs. This
> runtime system specializes in reliable and  predictable execution: It can 
> retry
> actions that have failed or invoke a  cleanup action after termination of the
> workflow; it can monitor  progress, success, or failure of a workflow, and 
> send
> appropriate alerts  to an administrator. The application developer is relieved
> from  implementing these generic procedures.
>
> Furthermore,  some applications or workflows need to run in periodic intervals
> or  when dependent data is available. For example, a workflow could be  
> executed
> every day as soon as output data from the previous 24 instances  of another,
> hourly workflow is available. The workflow coordinator  provides such 
> scheduling
> features, along with prioritization, load  balancing and throttling to 
> optimize
> utilization of resources in the  cluster. This makes it easier to maintain,
> control, and coordinate  complex data applications.
>
> Nearly  three years ago, a team of Yahoo! developers addressed these critical
> requirements for Hadoop-based data processing systems by developing a  new
> workflow management and scheduling system called Oozie. While it was  
> initially
> developed as a Yahoo!-internal project, it was designed and  implemented with
> the intention of open-sourcing. Oozie was released as a GitHub project in 
> early
> 2010. Oozie is used in production within Yahoo and  since it has been
> open-sourced it has been gaining adoption with  external developers
>
> Rationale
> Commonly,  applications that run on Hadoop require multiple Hadoop jobs in 
> order
> to  obtain the desired results. Furthermore, these Hadoop jobs are commonly  a
> combination of Java map-reduce jobs, Streaming map-reduce jobs, Pipes
> map-reduce jobs, Pig jobs, Hive jobs, HDFS operations, Java programs  and 
> shell
> scripts.
>
> Because  of this, developers find themselves writing ad-hoc glue programs to
> combine these Hadoop jobs. These ad-hoc programs are difficult to  schedule,
> manage, monitor and recover.
>
> Workflow  management and scheduling is an essential feature for large-scale 
> data
> processing applications. Such application

Re: [PROPOSAL] Kafka for the Apache Incubator

2011-06-24 Thread Jeffrey Damick
+1  from us also - at neustar are working on a large deployment using kafka
as well.  We'd be interested to help in the future.

-jeff


On Fri, Jun 24, 2011 at 2:17 PM, Henry Saputra wrote:

> +1
>
> A very good proposal and it seems to help solve our need for low
> latency event messaging system, so looking forward to it.
>
> I would love to contribute to the project and have added my name to
> list initial committers if no objection.
>
> - Henry
>
> >> 2011/6/22 Jun Rao 
> >>
> >> > Hi,
> >> >
> >> > I would like to propose Kafka to be an Apache Incubator project.
>  Kafka
> >> is
> >> > a
> >> > distributed, high throughput, publish-subscribe system for processing
> >> large
> >> > amounts of streaming data.
> >> >
> >> > Here's a link to the proposal in the Incubator wiki
> >> > http://wiki.apache.org/incubator/KafkaProposal
> >> >
> >> > I've also pasted the initial contents below.
> >> >
> >> > Thanks,
> >> >
> >> > Jun
> >> >
> >> > == Abstract ==
> >> > Kafka is a distributed publish-subscribe system for processing large
> >> > amounts
> >> > of streaming data.
> >> >
> >> > == Proposal ==
> >> > Kafka provides an extremely high throughput distributed
> publish/subscribe
> >> > messaging system.  Additionally, it supports relatively long term
> >> > persistence of messages to support a wide variety of consumers,
> >> > partitioning
> >> > of the message stream across servers and consumers, and functionality
> for
> >> > loading data into Apache Hadoop for offline, batch processing.
> >> >
> >> > == Background ==
> >> > Kafka was developed at LinkedIn to process the large amounts of events
> >> > generated by that company's website and provide a common repository
> for
> >> > many
> >> > types of consumers to access and process those events. Kafka has been
> >> used
> >> > in production at LinkedIn scale to handle dozens of types of events
> >> > including page views, searches and social network activity. Kafka
> >> clusters
> >> > at LinkedIn currently process more than two billion events per day.
> >> >
> >> > Kafka fills the gap between messaging systems such as Apache ActiveMQ,
> >> > which
> >> > can provide high-volume messaging systems but lack persistence of
> those
> >> > messages, and log processing systems such as Scribe and Flume, which
> do
> >> not
> >> > provide adequate latency for our diverse set of consumers.  Kafka can
> >> also
> >> > be inserted into traditional log-processing systems, acting as an
> >> > intermediate step before further processing. Kafka focuses
> relentlessly
> >> on
> >> > performance and throughput by not introspecting into message content,
> nor
> >> > indexing them on the broker.  We also achieve high performance by
> >> depending
> >> > on Java's sendFile/transferTo capabilities to minimize intermediate
> >> buffer
> >> > copies and relying on the OS's pagecache to efficiently serve up
> message
> >> > contents to consumers.
> >> >
> >> > Kafka is written in Scala and depends on Apache ZooKeeper for
> >> coordination
> >> > amongst its producers, brokers and consumers.
> >> >
> >> > Kafka was developed internally at LinkedIn to meet our particular use
> >> > cases,
> >> > but will be useful to many organizations facing a similar need to
> >> reliably
> >> > process large amounts of streaming data.  Therefore, we would like to
> >> share
> >> > it the ASF and begin developing a community of developers and users
> >> within
> >> > Apache.
> >> >
> >> > == Rationale ==
> >> > Many organizations can benefit from a reliable stream processing
> system
> >> > such
> >> > as Kafka.  While our use case of processing events from a very large
> >> > website
> >> > like LinkedIn has driven the design of Kafka, its uses are varied and
> we
> >> > expect many new use cases to emerge.  Kafka provides a natural bridge
> >> > between near real-time event processing and offline batch processing
> and
> >> > will appeal to many users.
> >> >
> >> > == Current Status ==
> >> > === Meritocracy ===
> >> > Our intent with this incubator proposal is to start building a diverse
> >> > developer community around Kafka following the Apache meritocracy
> model.
> >> > Since Kafka was open sourced we have solicited contributions via the
> >> > website
> >> > and presentations given to user groups and technical audiences.  We
> have
> >> > had
> >> > positive responses to these and have received several contributions
> and
> >> > clients for other languages.  We plan to continue this support for new
> >> > contributors and work with those who contribute significantly to the
> >> > project
> >> > to make them committers.
> >> >
> >> > === Community ===
> >> > Kafka is currently being used by developed by engineers within
> LinkedIn
> >> and
> >> > used in production in that company. Additionally, we have active users
> in
> >> > or
> >> > have received contributions from a diverse set of companies including
> >> > MediaSift, SocialTwist, Clearspring and Urban Airship. Recent public
> 

Re: [PROPOSAL] Oozie for the Apache Incubator

2011-06-24 Thread Phillip Rhodes
On Fri, Jun 24, 2011 at 3:46 PM, Mohammad Islam  wrote:

> Hi,
>
> I would like to propose Oozie to be an Apache Incubator project.
> Oozie is a server-based workflow scheduling and coordination system to
> manage
> data processing jobs for Apache Hadoop.
>
>
> +1


please subscribe me

2011-06-24 Thread Joe Key
-- 
Joe Andrew Key (Andy)


Re: [PROPOSAL] Kafka for the Apache Incubator

2011-06-24 Thread Joe Key
+1
We will be using it heavily here at HomeHealthCareSOS.com to relay app
server logs to our DW and Hadoop cluster.

-- 
Joe Andrew Key (Andy)


Re: please subscribe me

2011-06-24 Thread Christian Grobmeier
you need to write to general-subscr...@incubator.apache.org to subscribe

On Sat, Jun 25, 2011 at 2:34 AM, Joe Key  wrote:
> --
> Joe Andrew Key (Andy)
>

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org