Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

John Muehlhausen Thu, 17 Oct 2019 07:17:30 -0700

Micah, thanks very much for your input.  A few thoughts in response:

``Over the time horizon of desired latency if you aren't receiving enough
messages to take advantage of columnar analytics, a system probably has
enough time to compact batches after the fact for later analysis and
conversely if you are receiving many events you naturally get reasonable
batch sizes without having to do further work.''

To see my perspective it is necessary to stop thinking about business
analytics and latencies of minutes or seconds or significant fractions of
seconds.  In real-time financial trading systems we do things like kernel
bypass networking, core affinitization and spinning, fake workloads to keep
hot paths cached, almost complete avoidance of general purpose allocators
(preferring typed pools), all with the goal of shaving off one more
MICROsecond.

Therefore "probably has enough time" is never in our vocabulary.
Microbatching is not a thing.  It is either purely event driven or bust if
you are shooting for a dozen microseconds from packet-in to packet-out on
the wire.  This is also the reason that certain high-performance systems
will never use the reference implementations of Arrow, just as most other
financial tech standards are centered on the *protocol* and there end up
being dozens of implementations that compete on performance.

Arrow *as a protocol* has the potential to be a beautiful thing.  It is
simple and straightforward and brings needed standardization.  I would
encourage thinking of the reference implementations almost as second-class
citizens.  The reference implementations should not de-facto define the
protocol, but vice versa. (I know this is probably already the philosophy.)

Of course data eventually pops out of high performance systems and has a
second life in the hands of researchers, etc.  This is why I'd like the
"vanilla" Arrow to be able to deal with Arrow data as-constructed in the
higher performance systems.  Could there be special processes to refactor
all this data?  Of course, but why?  If I'm logging RecordBatches to disk
and some of them happen to have extra array elements (e.g. because of the
inevitable imperfect projection pre-allocation-- running out of variable
length string storage before running out of rows), why would I refactor
terabytes of data (and, more importantly, introduce a time window where
data is not available while this occurs) when I can just have pyarrow skip
the unused rows?  If I want compression/de-duplication I'll do it at the
storage media layer.

``the proposal is essentially changing the unit of exchange from
RecordBatches to a segment of a RecordBatch''

It is the recognition that a RecordBatch may need to be computationally
useful before it is completely received/constructed, and without adding
additional overhead to the reception/construction process.  I grab a canned
RecordBatch from a pool of them and start filling it in.  If an earlier
RecordBatch falls out of the computation window, I put it back in the
pool.  At any moment of time the batch advertises the truth:
RecordBatch.length is the immutable section of data at this moment in
time.  That section will not have changed in future moments.  It is also a
recognition that some RecordBatches so-constructed are not completely
fillable.  They are like a moving truck with a little empty space left.
Sure, we could custom-build trucks to exactly fit the cargo, but why?  It
takes too much time.  Grab a truck (or a RecordBatch) off the lot (pool)
and go...  when you've filled it as much as possible, but not perfectly,
grab another one that is exactly the same.

I think my perception of the situation is clearest if we think about
"frozen" or "sealed" RecordBatches that, during their construction stage,
use a one-size-fits-all set of arrays and variable length buffer storage.
I grab a RecordBatch off the lot that is "for two int columns and a
variable length string column where the average expected length of a string
is 10."  I fill it, then I'm done.  If my strings were slightly longer than
expected I have extra array elements and RecordBatch.length is less than
array length.

While it is true that I have other use-cases in mind regarding simultaneous
collection and computation, I'm hoping that the moving truck analogy by
itself demonstrates enough efficiency advantages (as compared to typical
batch construction) to warrant this change.  Put simply, it is Arrow
accommodating more space/time tradeoff options than it currently does.

When I was baking plasma into a system for the first time, I ran into the
example that I create a record batch to a mock stream in order to know how
much plasma memory to allocate.  That is the kind of "build the moving
truck to fit the cargo" that I just can't have!

-John

On Wed, Oct 16, 2019 at 10:15 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> Hi John and Wes,
>
> A few thoughts:
> One of the issues which we didn't get into in prior discussions, is the
> proposal is essentially changing the unit of exchange from RecordBatches to
> a segment of a RecordBatch.
>
> I think I brought this up earlier in discussions, an interesting idea that
> Trill [1], a columnar streaming engine, illustrates.  Over the time horizon
> of desired latency if you aren't receiving enough messages to take
> advantage of columnar analytics, a system probably has enough time to
> compact batches after the fact for later analysis and conversely if you are
> receiving many events you naturally get reasonable batch sizes without
> having to do further work.
>
>
> > I'm objecting to RecordBatch.length being inconsistent with the
> > constituent field lengths, that's where the danger lies. If all of the
> > lengths are consistent, no code changes are necessary.
>
> John, is it  a viable solution to keep all length in sync for the use case
> you are imagining?
>
> A solution I like less, but might be viable: formally specify a negative
> constant that signifies length should be inherited from RowBatch length
> (this could only be used on top level fields).
>
> I contend that it can only be useful and will never be harmful.  What are
> > the counter-examples of concrete harm?
>
>
> I'm not sure there is anything obviously wrong, however changes to
> semantics are always dangerous.  One  blemish on the current proposal  is
> one can't determine easily if a mismatch in row-length is a programming
> error or intentional.
>
> [1]
>
> https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/trill-vldb2015.pdf
>
> On Wed, Oct 16, 2019 at 4:41 PM John Muehlhausen <j...@jgm.org> wrote:
>
> > "that's where the danger lies"
> >
> > What danger?  I have no idea what the specific danger is, assuming that
> all
> > reference implementations have test cases that hedge around this.
> >
> > I contend that it can only be useful and will never be harmful.  What are
> > the counter-examples of concrete harm?
> >
>

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

Reply via email to