>> Okay. In the future, we probably need some form of
>> "serialization-free" batching mechanism to ship data more efficiently.
> 
> Do you guys have a sense of how load splits up between serialization
> and batching/communication? My hope has been that batching itself can
> take care of the performance issues, so that we'll be able to send
> logs as standard CAF messages, each one representing a batch of N log
> lines. The benchmark I had created a little while ago to examine that
> wasn't able to get the necessary performance out of Broker/CAF to do
> that (hence the fall-back to Bro's old serialization of log messages
> for now, sent over CAF). But iirc, the conclusion was that there's
> still room for improvement in CAF that should make this feasible
> eventually. However, if you guys believe it's really CAF's
> serialization that's the bottle-neck, then we'll need to come up with
> something else indeed.

I think there are a couple of orthogonal aspects merged together here. Namely, 
(1) memory-mapping, (2) batching, and (3) performance of CAF's serialization.

1) Matthias threw in memory-mapping, but I’m not so sure if this is actually 
feasible for you. The main benefit here is to have a unified representation in 
memory, on disk, and on the wire. I think you’re still going to keep the ASCII 
log output format for Bro logs. Also, a memory-mapped format would mean to drop 
the current broker::data API entirely. My hunch is that you would rather not 
break the API immediately after releasing it to the public.

2) CAF already does batching. Ideally, Broker should not need to do any 
additional batching on top of that. In fact, doing the batching in user code 
greatly diminishes effectiveness of CAF’s own batching, because now CAF can no 
longer break up chunks on its own to make efficient use of resources.

3) Serialization should really not be a bottleneck. The costly part is 
shuffling bytes around in buffers and heap allocations when deserializing a 
broker::data. There’s no way around these two costs. Do you still remember what 
showed up during your investigation that triggered you to go with the blob? 
Because what I can see as a *much* bigger issue is *copying* overhead, not 
serialization. CAF streams assume that individual elements are cheap to copy. 
So probably a copy-on-write optimization for broker::data would have a much 
higher impact on performance (it’s also straightforward to implement and CAF 
has re-usable pieces for that). If serialization still shows up with 
unreasonable costs in a profiler, however, there are ways to speed things up. 
The customization point here is a specialized inspect() overload for 
broker::data that essentially allows you apply all optimization you want (and 
that might be used in Bro’s framework).

I hope we’re not talking past each other. :)

An in-depth performance analysis of Broker’s streaming layer is on my todo list 
for months at this point. I hope I get something done before the Bro Workshop 
in Europe. Then we can hopefully discuss this with some reliable data in person.

    Dominik
_______________________________________________
bro-dev mailing list
bro-dev@bro.org
http://mailman.icsi.berkeley.edu/mailman/listinfo/bro-dev

Reply via email to