Matt, no prob! Mike
Sent from my iPhone > On Jun 24, 2014, at 8:35 PM, Matt Tenenbaum <matt.tenenb...@rockyou.com> > wrote: > > Thanks Mike. That's very helpful. > > It should be no trouble to establish an upper bound on the intra-event > packing, so I should be able to tune the capacity parameters in the way you > suggest. Besides, aren't the sharp edges what make it fun to play with? > > I may be able to impose suitable restrictions upstream that guarantee a > one-to-one correspondence between protobuf records and scribe messages, thus > allowing me to follow your recommendation and avoid these shenanigans. But > it's nice to know that I _can_ do it this way if I need to. > > Cheers > -mt > > >> On Tue, Jun 24, 2014 at 7:45 PM, Mike Percy <mpe...@apache.org> wrote: >> Hi Matt, >> If you can guarantee there are a certain # of events in a single "wrapper" >> event, or bound the limit, then you could potentially get away with this. >> However if you're not careful you could get stuck in an infinite >> fail-backoff-retry loop due to exceeding the (configurable) channel >> transaction limit. The first limit you will want to tune is the >> channel.transactionCapacity parameter, which is simply a sanity-check / >> arbitrary limit on the # of events that can be placed into a channel in a >> single transaction (this avoids weird bugs like a source that opens a >> transaction that never gets committed). The other thing you have got to >> watch out for is what your (Flume) batch size looks like, since Flume is >> designed to do batching at the RPC layer, not at the event layer like you >> are describing. >> >> So basically just make sure that your channel.transactionCapacity > max >> batch size * max # sub-events per "wrapper" event. >> >> Hope this makes sense. The above explanation is somewhat subtle and since it >> has sharp edges when misconfigured, we just recommend not to do it if >> possible. >> >> Best, >> Mike >> >> >>> On Tue, Jun 24, 2014 at 10:17 AM, Matt Tenenbaum >>> <matt.tenenb...@rockyou.com> wrote: >>> I see in the documentation for org.apache.flume.interceptor.Interceptor >>> that the result of intercept(List<Event>) must not exceed the size of the >>> input (in all-caps, even). This is unfortunate for my use-case: I'm >>> interfacing with a scribe source that provides each message as a >>> serialization of some number of protobuf records together with their >>> content-lengths, and an interceptor seemed like an ideal way to convert >>> those possibly-many records into individual events. That's particularly >>> desirable because I need to establish the timestamp header from each >>> underlying record in order to route to the correct file in HDFS. It's >>> unlikely that a batch of records coming in as a single event have >>> _drastically_ different timestamps, but it's also out of my control. >>> >>> Given all the capital letters, the restriction on output cardinality is >>> really-real, right? I'll be setting myself up for disaster? >>> >>> Is there some other way I can convert an event that looks essentially like >>> >>> Event(rec-size-1 + rec-1 + rec-size-2 + rec-2 + ... + rec-size-N + rec-N) >>> >>> into a List<Event>: >>> >>> {Event(rec-1), Event(rec-2), ..., Event(rec-N)} >>> >>> This channel has nontrivial volume, potentially hundreds of MB per minute, >>> so I don't want to (e.g.) serialize the multiple records and then read them >>> into a second stage if I can handle the one-to-many transformation on the >>> fly. >>> >>> Thanks in advance for clarifications and suggestions. >>> >>> -mt >