Re: Hadoop Summit Meetups

Neha Narkhede Thu, 05 Jun 2014 10:15:12 -0700

Jonathan,

A third last resort pattern might be go the CDC route with something like
Databus.  This would require implementing additional fetchers and relays to
support Cassandra and MongoDB.  Also the data will need to be transformed
on the Hadoop/Spark side for virtually every learning application since
they have different data models


The approach I would suggest is similar to what Jun suggested as well. In
this approach, Kafka is the source of truth system and will be used as the
durable commit log. All other systems that you have will simply feed from
this Kafka based commit log and do the respective writes. For this to work,
you will have to configure the Kafka topics to compact data and always
maintain the latest value per key instead of just deleting old data. All
the writers will use the highest durability setting (ack=-1) while writing
to the commit log. Every system (Cassandra, MongoDB) will be populated by
fetchers consuming the commit log and writing to the store. If one fails,
the other one picks up from where the previous one left off.

This does, however, require a stronger dependency on Kafka in your
ecosystem.

Thanks
Neha


On Thu, Jun 5, 2014 at 8:27 AM, Nagesh <nageswara.r...@gmail.com> wrote:

> As Junn Rao said, it is pretty much possible multiple publishers publishes
> to a topic and different group of consumers can consume a message and apply
> group specific logic example raw data processing, aggregation etc., Each
> distinguished group will receive a copy.
>
> But the offset cannot be used UUID as the counter may reset incase you
> restart Kafka for some reasons. Not sure, can someone throw some light?
>
> Regards,
> Nageswara Rao
>
>
> On Thu, Jun 5, 2014 at 8:18 PM, Jun Rao <jun...@gmail.com> wrote:
>
> > It sounds like that you want to write to a data store and a data pipe
> > atomically. Since both the data store and the data pipe that you want to
> > use are highly available, the only case that you want to protect is the
> > client failing btw the two writes. One way to do that is to let the
> client
> > publish to Kafka first with the strongest ack. Then, run a few consumers
> to
> > read data from Kafka and then write the data to the data store. Any one
> of
> > those consumers can die and the work will be automatically picked up by
> the
> > remaining ones. You can use partition id and the offset of each message
> as
> > its UUID if needed.
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Wed, Jun 4, 2014 at 10:56 AM, Jonathan Hodges <hodg...@gmail.com>
> > wrote:
> >
> > > Sorry didn't realize the mailing list wasn't copied...
> > >
> > >
> > > ---------- Forwarded message ----------
> > > From: Jonathan Hodges <hodg...@gmail.com>
> > > Date: Wed, Jun 4, 2014 at 10:56 AM
> > > Subject: Re: Hadoop Summit Meetups
> > > To: Neha Narkhede <neha.narkh...@gmail.com>
> > >
> > >
> > > We have a number of customer facing online learning applications.
>  These
> > > applications are using heterogeneous technologies with different data
> > > models in underlying data stores such as RDBMS, Cassandra, MongoDB,
> etc.
> > >  We would like to run offline analysis on the data contained in these
> > > learning applications with tools like Hadoop and Spark.
> > >
> > > One thought is to use Kafka as a way for these learning applications to
> > > emit data in near real-time for analytics.  We developed a common model
> > > represented as Avro records in HDFS that spans these learning
> > applications
> > > so that we can accept the same structured message from them.  This
> allows
> > > for comparing apples to apples across these apps as opposed to messy
> > > transformations.
> > >
> > > So this all sounds good until you dig into the details.  One pattern is
> > for
> > > these applications to update state locally in their data stores first
> and
> > > then publish to Kafka.  The problem with this is these two operations
> > > aren't atomic so the local persist can succeed and the publish to Kafka
> > > fail leaving the application and HDFS out of sync.  You can try to add
> > some
> > > retry logic to the clients, but this quickly becomes very complicated
> and
> > > still doesn't solve the underlying problem.
> > >
> > > Another pattern is to publish to Kafka first with -1 and wait for the
> ack
> > > from leader and replicas before persisting locally.  This is probably
> > > better than the other pattern but does add some complexity to the
> client.
> > >  The clients must now generate unique entity IDs/UUID for persistence
> > when
> > > they typically rely on the data store for creating these.  Also the
> > publish
> > > to Kafka can succeed and persist locally can fail leaving the stores
> out
> > of
> > > sync.  In this case the learning application needs to determine how to
> > get
> > > itself in sync.  It can rely on getting this back from Kafka, but it is
> > > possible the local store failure can't be fixed in a timely manner e.g.
> > > hardware failure, constraint, etc.  In this case the application needs
> to
> > > show an error to the user and likely need to do something like send a
> > > delete message to Kafka to remove the earlier published message.
> > >
> > > A third last resort pattern might be go the CDC route with something
> like
> > > Databus.  This would require implementing additional fetchers and
> relays
> > to
> > > support Cassandra and MongoDB.  Also the data will need to be
> transformed
> > > on the Hadoop/Spark side for virtually every learning application since
> > > they have different data models.
> > >
> > > I hope this gives enough detail to start discussing transactional
> > messaging
> > > in Kafka.  We are willing to help in this effort if it makes sense for
> > our
> > > use cases.
> > >
> > > Thanks
> > > Jonathan
> > >
> > >
> > >
> > > On Wed, Jun 4, 2014 at 9:44 AM, Neha Narkhede <neha.narkh...@gmail.com
> >
> > > wrote:
> > >
> > > > If you are comfortable, share it on the mailing list. If not, I'm
> happy
> > > to
> > > > have this discussion privately.
> > > >
> > > > Thanks,
> > > > Neha
> > > > On Jun 4, 2014 9:42 AM, "Neha Narkhede" <neha.narkh...@gmail.com>
> > wrote:
> > > >
> > > >> Glad it was useful. It will be great if you can share your
> > requirements
> > > >> on atomicity. A couple of us are very interested in thinking about
> > > >> transactional messaging in Kafka.
> > > >>
> > > >> Thanks,
> > > >> Neha
> > > >> On Jun 4, 2014 6:57 AM, "Jonathan Hodges" <hodg...@gmail.com>
> wrote:
> > > >>
> > > >>> Hi Neha,
> > > >>>
> > > >>> Thanks so much to you and the Kafka team for putting together the
> > > meetup.
> > > >>>  It was very nice and gave people from out of town like us the
> > ability
> > > to
> > > >>> join in person.
> > > >>>
> > > >>> We are the guys from Pearson Education and we talked a little about
> > > >>> supplying some details on some of our use cases with respect to
> > > atomicity
> > > >>> of source systems eventing data and persisting locally.  Should we
> > just
> > > >>> post to the list or is there somewhere else we should send these
> > > details?
> > > >>>
> > > >>> Thanks again!
> > > >>> Jonathan
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Fri, Apr 11, 2014 at 9:31 AM, Neha Narkhede <
> > > neha.narkh...@gmail.com>
> > > >>> wrote:
> > > >>>
> > > >>> > Yes, that's a great idea. I can help organize the meetup at
> > LinkedIn.
> > > >>> >
> > > >>> > Thanks,
> > > >>> > Neha
> > > >>> >
> > > >>> >
> > > >>> > On Fri, Apr 11, 2014 at 8:44 AM, Saurabh Agarwal (BLOOMBERG/ 731
> > > >>> LEXIN) <
> > > >>> > sagarwal...@bloomberg.net> wrote:
> > > >>> >
> > > >>> > > great idea. I am interested in attending as well....
> > > >>> > >
> > > >>> > > ----- Original Message -----
> > > >>> > > From: users@kafka.apache.org
> > > >>> > > To: users@kafka.apache.org
> > > >>> > > At: Apr 11 2014 11:40:56
> > > >>> > >
> > > >>> > > With the Hadoop Summit in San Jose 6/3 - 6/5 I wondered if any
> of
> > > the
> > > >>> > > LinkedIn geniuses were thinking of putting together a meet-up
> on
> > > any
> > > >>> of
> > > >>> > the
> > > >>> > > associated technologies like Kafka, Samza, Databus, etc.  For
> us
> > > poor
> > > >>> > souls
> > > >>> > > that don't live on the West Coast it was a great experience
> > > >>> attending the
> > > >>> > > Kafka meetup last year.
> > > >>> > >
> > > >>> > > Jonathan
> > > >>> > >
> > > >>> > >
> > > >>> > >
> > > >>> > >
> > > >>> > >
> > > >>> >
> > > >>>
> > >
> >
> -------------------------------------------------------------------------------
> > > >>> > >
> > > >>> >
> > > >>>
> > > >>
> > >
> >
>
>
>
> --
> Thanks & Regards,
> Nageswara Rao.V
>

Re: Hadoop Summit Meetups

Reply via email to