Thanks Mike. That's very helpful.

It should be no trouble to establish an upper bound on the intra-event
packing, so I should be able to tune the capacity parameters in the way you
suggest. Besides, aren't the sharp edges what make it fun to play with?

I may be able to impose suitable restrictions upstream that guarantee a
one-to-one correspondence between protobuf records and scribe messages,
thus allowing me to follow your recommendation and avoid these shenanigans.
But it's nice to know that I _can_ do it this way if I need to.

Cheers
-mt


On Tue, Jun 24, 2014 at 7:45 PM, Mike Percy <mpe...@apache.org> wrote:

> Hi Matt,
> If you can guarantee there are a certain # of events in a single "wrapper"
> event, or bound the limit, then you could potentially get away with this.
> However if you're not careful you could get stuck in an infinite
> fail-backoff-retry loop due to exceeding the (configurable) channel
> transaction limit. The first limit you will want to tune is the
> channel.transactionCapacity parameter, which is simply a sanity-check /
> arbitrary limit on the # of events that can be placed into a channel in a
> single transaction (this avoids weird bugs like a source that opens a
> transaction that never gets committed). The other thing you have got to
> watch out for is what your (Flume) batch size looks like, since Flume is
> designed to do batching at the RPC layer, not at the event layer like you
> are describing.
>
> So basically just make sure that your channel.transactionCapacity > max
> batch size * max # sub-events per "wrapper" event.
>
> Hope this makes sense. The above explanation is somewhat subtle and since
> it has sharp edges when misconfigured, we just recommend not to do it if
> possible.
>
> Best,
> Mike
>
>
> On Tue, Jun 24, 2014 at 10:17 AM, Matt Tenenbaum <
> matt.tenenb...@rockyou.com> wrote:
>
>> I see in the documentation for org.apache.flume.interceptor.Interceptor
>> that the result of intercept(List<Event>) must not exceed the size of the
>> input (in all-caps, even). This is unfortunate for my use-case: I'm
>> interfacing with a scribe source that provides each message as a
>> serialization of some number of protobuf records together with their
>> content-lengths, and an interceptor seemed like an ideal way to convert
>> those possibly-many records into individual events. That's particularly
>> desirable because I need to establish the timestamp header from each
>> underlying record in order to route to the correct file in HDFS. It's
>> unlikely that a batch of records coming in as a single event have
>> _drastically_ different timestamps, but it's also out of my control.
>>
>> Given all the capital letters, the restriction on output cardinality is
>> really-real, right? I'll be setting myself up for disaster?
>>
>> Is there some other way I can convert an event that looks essentially like
>>
>> Event(rec-size-1 + rec-1 + rec-size-2 + rec-2 + ... + rec-size-N + rec-N)
>>
>>
>> into a List<Event>:
>>
>> {Event(rec-1), Event(rec-2), ..., Event(rec-N)}
>>
>>
>> This channel has nontrivial volume, potentially hundreds of MB per
>> minute, so I don't want to (e.g.) serialize the multiple records and then
>> read them into a second stage if I can handle the one-to-many
>> transformation on the fly.
>>
>> Thanks in advance for clarifications and suggestions.
>>
>> -mt
>>
>
>

Reply via email to