Thanks Mike. That's very helpful. It should be no trouble to establish an upper bound on the intra-event packing, so I should be able to tune the capacity parameters in the way you suggest. Besides, aren't the sharp edges what make it fun to play with?
I may be able to impose suitable restrictions upstream that guarantee a one-to-one correspondence between protobuf records and scribe messages, thus allowing me to follow your recommendation and avoid these shenanigans. But it's nice to know that I _can_ do it this way if I need to. Cheers -mt On Tue, Jun 24, 2014 at 7:45 PM, Mike Percy <mpe...@apache.org> wrote: > Hi Matt, > If you can guarantee there are a certain # of events in a single "wrapper" > event, or bound the limit, then you could potentially get away with this. > However if you're not careful you could get stuck in an infinite > fail-backoff-retry loop due to exceeding the (configurable) channel > transaction limit. The first limit you will want to tune is the > channel.transactionCapacity parameter, which is simply a sanity-check / > arbitrary limit on the # of events that can be placed into a channel in a > single transaction (this avoids weird bugs like a source that opens a > transaction that never gets committed). The other thing you have got to > watch out for is what your (Flume) batch size looks like, since Flume is > designed to do batching at the RPC layer, not at the event layer like you > are describing. > > So basically just make sure that your channel.transactionCapacity > max > batch size * max # sub-events per "wrapper" event. > > Hope this makes sense. The above explanation is somewhat subtle and since > it has sharp edges when misconfigured, we just recommend not to do it if > possible. > > Best, > Mike > > > On Tue, Jun 24, 2014 at 10:17 AM, Matt Tenenbaum < > matt.tenenb...@rockyou.com> wrote: > >> I see in the documentation for org.apache.flume.interceptor.Interceptor >> that the result of intercept(List<Event>) must not exceed the size of the >> input (in all-caps, even). This is unfortunate for my use-case: I'm >> interfacing with a scribe source that provides each message as a >> serialization of some number of protobuf records together with their >> content-lengths, and an interceptor seemed like an ideal way to convert >> those possibly-many records into individual events. That's particularly >> desirable because I need to establish the timestamp header from each >> underlying record in order to route to the correct file in HDFS. It's >> unlikely that a batch of records coming in as a single event have >> _drastically_ different timestamps, but it's also out of my control. >> >> Given all the capital letters, the restriction on output cardinality is >> really-real, right? I'll be setting myself up for disaster? >> >> Is there some other way I can convert an event that looks essentially like >> >> Event(rec-size-1 + rec-1 + rec-size-2 + rec-2 + ... + rec-size-N + rec-N) >> >> >> into a List<Event>: >> >> {Event(rec-1), Event(rec-2), ..., Event(rec-N)} >> >> >> This channel has nontrivial volume, potentially hundreds of MB per >> minute, so I don't want to (e.g.) serialize the multiple records and then >> read them into a second stage if I can handle the one-to-many >> transformation on the fly. >> >> Thanks in advance for clarifications and suggestions. >> >> -mt >> > >