Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

Micah Kornfield Thu, 17 Oct 2019 23:56:25 -0700

On the specification: I'm -.5 on saying array lengths can be different then
row batch length (especially if both are valid lengths).  I can see some
wiggle room the the current language [1][2] that might allow for modifying
this, so I think we should update it one way or another however this
conversation turns out.


On the implementation: I wouldn't want to see code that is mutating the
record batch headers on the fly in the current code base.  It is dangerous
to break immutability assumptions across which standard implementations
might be relying on. IMO, for such a specialized use-case it makes sense to
have out-of-band communication (perhaps in a private implementation that
does modify headers) and then "touch" up the metadata for later analysis,
so it conforms to the specification (and standard libraries can be used).

[1] https://github.com/apache/arrow/blob/master/format/Message.fbs#L49
[2] https://github.com/apache/arrow/blob/master/format/Message.fbs#L34

On Thu, Oct 17, 2019 at 7:17 AM John Muehlhausen <j...@jgm.org> wrote:

> Micah, thanks very much for your input.  A few thoughts in response:
>
> ``Over the time horizon of desired latency if you aren't receiving enough
> messages to take advantage of columnar analytics, a system probably has
> enough time to compact batches after the fact for later analysis and
> conversely if you are receiving many events you naturally get reasonable
> batch sizes without having to do further work.''
>
> To see my perspective it is necessary to stop thinking about business
> analytics and latencies of minutes or seconds or significant fractions of
> seconds.  In real-time financial trading systems we do things like kernel
> bypass networking, core affinitization and spinning, fake workloads to keep
> hot paths cached, almost complete avoidance of general purpose allocators
> (preferring typed pools), all with the goal of shaving off one more
> MICROsecond.
>
> Therefore "probably has enough time" is never in our vocabulary.
> Microbatching is not a thing.  It is either purely event driven or bust if
> you are shooting for a dozen microseconds from packet-in to packet-out on
> the wire.  This is also the reason that certain high-performance systems
> will never use the reference implementations of Arrow, just as most other
> financial tech standards are centered on the *protocol* and there end up
> being dozens of implementations that compete on performance.
>
> Arrow *as a protocol* has the potential to be a beautiful thing.  It is
> simple and straightforward and brings needed standardization.  I would
> encourage thinking of the reference implementations almost as second-class
> citizens.  The reference implementations should not de-facto define the
> protocol, but vice versa. (I know this is probably already the philosophy.)
>
> Of course data eventually pops out of high performance systems and has a
> second life in the hands of researchers, etc.  This is why I'd like the
> "vanilla" Arrow to be able to deal with Arrow data as-constructed in the
> higher performance systems.  Could there be special processes to refactor
> all this data?  Of course, but why?  If I'm logging RecordBatches to disk
> and some of them happen to have extra array elements (e.g. because of the
> inevitable imperfect projection pre-allocation-- running out of variable
> length string storage before running out of rows), why would I refactor
> terabytes of data (and, more importantly, introduce a time window where
> data is not available while this occurs) when I can just have pyarrow skip
> the unused rows?  If I want compression/de-duplication I'll do it at the
> storage media layer.
>
> ``the proposal is essentially changing the unit of exchange from
> RecordBatches to a segment of a RecordBatch''
>
> It is the recognition that a RecordBatch may need to be computationally
> useful before it is completely received/constructed, and without adding
> additional overhead to the reception/construction process.  I grab a canned
> RecordBatch from a pool of them and start filling it in.  If an earlier
> RecordBatch falls out of the computation window, I put it back in the
> pool.  At any moment of time the batch advertises the truth:
> RecordBatch.length is the immutable section of data at this moment in
> time.  That section will not have changed in future moments.  It is also a
> recognition that some RecordBatches so-constructed are not completely
> fillable.  They are like a moving truck with a little empty space left.
> Sure, we could custom-build trucks to exactly fit the cargo, but why?  It
> takes too much time.  Grab a truck (or a RecordBatch) off the lot (pool)
> and go...  when you've filled it as much as possible, but not perfectly,
> grab another one that is exactly the same.
>
> I think my perception of the situation is clearest if we think about
> "frozen" or "sealed" RecordBatches that, during their construction stage,
> use a one-size-fits-all set of arrays and variable length buffer storage.
> I grab a RecordBatch off the lot that is "for two int columns and a
> variable length string column where the average expected length of a string
> is 10."  I fill it, then I'm done.  If my strings were slightly longer than
> expected I have extra array elements and RecordBatch.length is less than
> array length.
>
> While it is true that I have other use-cases in mind regarding
> simultaneous collection and computation, I'm hoping that the moving truck
> analogy by itself demonstrates enough efficiency advantages (as compared to
> typical batch construction) to warrant this change.  Put simply, it is
> Arrow accommodating more space/time tradeoff options than it currently does.
>
> When I was baking plasma into a system for the first time, I ran into the
> example that I create a record batch to a mock stream in order to know how
> much plasma memory to allocate.  That is the kind of "build the moving
> truck to fit the cargo" that I just can't have!
>
> -John
>
> On Wed, Oct 16, 2019 at 10:15 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>> Hi John and Wes,
>>
>> A few thoughts:
>> One of the issues which we didn't get into in prior discussions, is the
>> proposal is essentially changing the unit of exchange from RecordBatches
>> to
>> a segment of a RecordBatch.
>>
>> I think I brought this up earlier in discussions, an interesting idea that
>> Trill [1], a columnar streaming engine, illustrates.  Over the time
>> horizon
>> of desired latency if you aren't receiving enough messages to take
>> advantage of columnar analytics, a system probably has enough time to
>> compact batches after the fact for later analysis and conversely if you
>> are
>> receiving many events you naturally get reasonable batch sizes without
>> having to do further work.
>>
>>
>> > I'm objecting to RecordBatch.length being inconsistent with the
>> > constituent field lengths, that's where the danger lies. If all of the
>> > lengths are consistent, no code changes are necessary.
>>
>> John, is it  a viable solution to keep all length in sync for the use case
>> you are imagining?
>>
>> A solution I like less, but might be viable: formally specify a negative
>> constant that signifies length should be inherited from RowBatch length
>> (this could only be used on top level fields).
>>
>> I contend that it can only be useful and will never be harmful.  What are
>> > the counter-examples of concrete harm?
>>
>>
>> I'm not sure there is anything obviously wrong, however changes to
>> semantics are always dangerous.  One  blemish on the current proposal  is
>> one can't determine easily if a mismatch in row-length is a programming
>> error or intentional.
>>
>> [1]
>>
>> https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/trill-vldb2015.pdf
>>
>> On Wed, Oct 16, 2019 at 4:41 PM John Muehlhausen <j...@jgm.org> wrote:
>>
>> > "that's where the danger lies"
>> >
>> > What danger?  I have no idea what the specific danger is, assuming that
>> all
>> > reference implementations have test cases that hedge around this.
>> >
>> > I contend that it can only be useful and will never be harmful.  What
>> are
>> > the counter-examples of concrete harm?
>> >
>>
>

Re: [Discuss] Streaming: Differentiate between length of RecordBatch and utilized portion-- common use-case?

Reply via email to