Hi Matt,
If you can guarantee there are a certain # of events in a single "wrapper"
event, or bound the limit, then you could potentially get away with this.
However if you're not careful you could get stuck in an infinite
fail-backoff-retry loop due to exceeding the (configurable) channel
transaction limit. The first limit you will want to tune is the
channel.transactionCapacity parameter, which is simply a sanity-check /
arbitrary limit on the # of events that can be placed into a channel in a
single transaction (this avoids weird bugs like a source that opens a
transaction that never gets committed). The other thing you have got to
watch out for is what your (Flume) batch size looks like, since Flume is
designed to do batching at the RPC layer, not at the event layer like you
are describing.

So basically just make sure that your channel.transactionCapacity > max
batch size * max # sub-events per "wrapper" event.

Hope this makes sense. The above explanation is somewhat subtle and since
it has sharp edges when misconfigured, we just recommend not to do it if
possible.

Best,
Mike


On Tue, Jun 24, 2014 at 10:17 AM, Matt Tenenbaum <matt.tenenb...@rockyou.com
> wrote:

> I see in the documentation for org.apache.flume.interceptor.Interceptor
> that the result of intercept(List<Event>) must not exceed the size of the
> input (in all-caps, even). This is unfortunate for my use-case: I'm
> interfacing with a scribe source that provides each message as a
> serialization of some number of protobuf records together with their
> content-lengths, and an interceptor seemed like an ideal way to convert
> those possibly-many records into individual events. That's particularly
> desirable because I need to establish the timestamp header from each
> underlying record in order to route to the correct file in HDFS. It's
> unlikely that a batch of records coming in as a single event have
> _drastically_ different timestamps, but it's also out of my control.
>
> Given all the capital letters, the restriction on output cardinality is
> really-real, right? I'll be setting myself up for disaster?
>
> Is there some other way I can convert an event that looks essentially like
>
> Event(rec-size-1 + rec-1 + rec-size-2 + rec-2 + ... + rec-size-N + rec-N)
>
>
> into a List<Event>:
>
> {Event(rec-1), Event(rec-2), ..., Event(rec-N)}
>
>
> This channel has nontrivial volume, potentially hundreds of MB per minute,
> so I don't want to (e.g.) serialize the multiple records and then read them
> into a second stage if I can handle the one-to-many transformation on the
> fly.
>
> Thanks in advance for clarifications and suggestions.
>
> -mt
>

Reply via email to