Any thoughts on a RecordBatch distinguishing size from capacity? (To borrow
std::vector terminology)

Thanks,
John

On Thu, May 9, 2019 at 2:46 PM John Muehlhausen <j...@jgm.org> wrote:

> Wes et al, I think my core proposal is that Message.fbs:RecordBatch split
> the "length" parameter into "theoretical max length" and "utilized length"
> (perhaps not those exact names).
>
> "theoretical max length is the same as "length" now ... /// ...The arrays
> in the batch should all have this
>
> "utilized length" are the number of rows (starting from the first one)
> that actually contain interesting data... the rest do not.
>
> The reason we can have a RecordBatch where these numbers are not the same
> is that the RecordBatch space was preallocated (for performance reasons)
> and the number of rows that actually "fit" depends on how correct the
> preallocation was.  In any case, it gives the user control of this
> space/time tradeoff... wasted space in order to save time in record batch
> construction.  The fact that some space will usually be wasted when there
> are variable-length columns (barring extreme luck) with this batch
> construction paradigm explains the word "theoretical" above.  This also
> gives us the ability to look at a partially constructed batch that is still
> being constructed, given appropriate user-supplied concurrency control.
>
> I am not an expert in all of the Arrow variable-length data types, but I
> think this works if they are all similar to variable-length strings where
> we advance through "blob storage" by setting the indexes into that storage
> for the current and next row in order to indicate that we have
> incrementally consumed more blob storage.  (Conceptually this storage is
> "unallocated" after the pre-allocation and before rows are populated.)
>
> At a high level I am seeking to shore up the format for event ingress into
> real-time analytics that have some look-back window.  If I'm not mistaken I
> think this is the subject of the last multi-sentence paragraph here?:
> https://zd.net/2H0LlBY
>
> Currently we have a less-efficient paradigm where "microbatches" (e.g. of
> length 1 for minimal latency) have to spin the CPU periodically in order to
> be combined into buffers where we get the columnar layout benefit.  With
> pre-allocation we can deal with microbatches (a partially populated larger
> RecordBatch) and immediately have the columnar layout benefits for the
> populated section with no additional computation.
>
> For example, consider an event processing system that calculates a "moving
> average" as events roll in.  While this is somewhat contrived lets assume
> that the moving average window is 1000 periods and our pre-allocation
> ("theoretical max length") of RecordBatch elements is 100.  The algorithm
> would be something like this, for a list of RecordBatch buffers in memory:
>
> initialization():
>   set up configuration of expected variable length storage requirements,
> e.g. the template RecordBatch mentioned below
>
> onIncomingEvent(event):
>   obtain lock /// cf. swoopIn() below
>   if last RecordBatch theoretical max length is not less than utilized
> length or variable-length components of "event" will not fit in remaining
> blob storage:
>     create a new RecordBatch pre-allocation of max utilized length 100 and
> with blob preallocation that is max(expected, event .. in case the single
> event is larger than the expectation for 100 events)
>        (note: in the expected case this can be very fast as it is a
> malloc() and a memcpy() from a template!)
>     set current RecordBatch to this newly created one
>   add event to current RecordBatch (for the non-calculated fields)
>   increment utilized length of current RecordBatch
>   calculate the calculated fields (in this case, moving average) by
> looking back at previous rows in this and previous RecordBatch objects
>   free() any RecordBatch objects that are now before the lookback window
>
> swoopIn(): /// somebody wants to chart the lookback window
>   obtain lock
>   visit all of the relevant data in the RecordBatches to construct the
> chart /// notice that the last RecordBatch may not yet be "as full as
> possible"
>
> The above analysis (minus the free()) could apply to the IPC file format
> and the lock could be a file lock and the swoopIn() could be a separate
> process.  In the case of the file format, while the file is locked, a new
> RecordBatch would overwrite the previous file Footer and a new Footer would
> be written.  In order to be able to delete or archive old data multiple
> files could be strung together in a logical series.
>
> -John
>
> On Tue, May 7, 2019 at 2:39 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> On Tue, May 7, 2019 at 12:26 PM John Muehlhausen <j...@jgm.org> wrote:
>> >
>> > Wes, are we saying that `pa.ipc.open_file(...).read_pandas()` already
>> reads
>> > the future Feather format? If not, how will the future format differ?  I
>> > will work on my access pattern with this format instead of the current
>> > feather format.  Sorry I was not clear on that earlier.
>> >
>>
>> Yes, under the hood those will use the same zero-copy binary protocol
>> code paths to read the file.
>>
>> > Micah, thank you!
>> >
>> > On Tue, May 7, 2019 at 11:44 AM Micah Kornfield <emkornfi...@gmail.com>
>> > wrote:
>> >
>> > > Hi John,
>> > > To give a specific pointer [1] describes how the streaming protocol is
>> > > stored to a file.
>> > >
>> > > [1] https://arrow.apache.org/docs/format/IPC.html#file-format
>> > >
>> > > On Tue, May 7, 2019 at 9:40 AM Wes McKinney <wesmck...@gmail.com>
>> wrote:
>> > >
>> > > > hi John,
>> > > >
>> > > > As soon as the R folks can install the Arrow R package consistently,
>> > > > the intent is to replace the Feather internals with the plain Arrow
>> > > > IPC protocol where we have much better platform support all around.
>> > > >
>> > > > If you'd like to experiment with creating an API for pre-allocating
>> > > > fixed-size Arrow protocol blocks and then mutating the data and
>> > > > metadata on disk in-place, please be our guest. We don't have the
>> > > > tools developed yet to do this for you
>> > > >
>> > > > - Wes
>> > > >
>> > > > On Tue, May 7, 2019 at 11:25 AM John Muehlhausen <j...@jgm.org>
>> wrote:
>> > > > >
>> > > > > Thanks Wes:
>> > > > >
>> > > > > "the current Feather format is deprecated" ... yes, but there
>> will be a
>> > > > > future file format that replaces it, correct?  And my discussion
>> of
>> > > > > immutable "portions" of Arrow buffers, rather than immutability
>> of the
>> > > > > entire buffer, applies to IPC as well, right?  I am only
>> championing
>> > > the
>> > > > > idea that this future file format have the convenience that
>> certain
>> > > > > preallocated rows can be ignored based on a metadata setting.
>> > > > >
>> > > > > "I recommend batching your writes" ... this is extremely
>> inefficient
>> > > and
>> > > > > adds unacceptable latency, relative to the proposed solution.  Do
>> you
>> > > > > disagree?  Either I have a batch length of 1 to minimize latency,
>> which
>> > > > > eliminates columnar advantages on the read side, or else I add
>> latency.
>> > > > > Neither works, and it seems that a viable alternative is within
>> sight?
>> > > > >
>> > > > > If you don't agree that there is a core issue and opportunity
>> here, I'm
>> > > > not
>> > > > > sure how to better make my case....
>> > > > >
>> > > > > -John
>> > > > >
>> > > > > On Tue, May 7, 2019 at 11:02 AM Wes McKinney <wesmck...@gmail.com
>> >
>> > > > wrote:
>> > > > >
>> > > > > > hi John,
>> > > > > >
>> > > > > > On Tue, May 7, 2019 at 10:53 AM John Muehlhausen <j...@jgm.org>
>> > > wrote:
>> > > > > > >
>> > > > > > > Wes et al, I completed a preliminary study of populating a
>> Feather
>> > > > file
>> > > > > > > incrementally.  Some notes and questions:
>> > > > > > >
>> > > > > > > I wrote the following dataframe to a feather file:
>> > > > > > >             a    b
>> > > > > > > 0  0123456789  0.0
>> > > > > > > 1  0123456789  NaN
>> > > > > > > 2  0123456789  NaN
>> > > > > > > 3  0123456789  NaN
>> > > > > > > 4        None  NaN
>> > > > > > >
>> > > > > > > In re-writing the flatbuffers metadata (flatc -p doesn't
>> > > > > > > support --gen-mutable! yuck! C++ to the rescue...), it seems
>> that
>> > > > > > > read_feather is not affected by NumRows?  It seems to be
>> driven
>> > > > entirely
>> > > > > > by
>> > > > > > > the per-column Length values?
>> > > > > > >
>> > > > > > > Also, it seems as if one of the primary usages of NullCount
>> is to
>> > > > > > determine
>> > > > > > > whether or not a bitfield is present?  In the initialization
>> data
>> > > > above I
>> > > > > > > was careful to have a null value in each column in order to
>> > > generate
>> > > > a
>> > > > > > > bitfield.
>> > > > > >
>> > > > > > Per my prior e-mails, the current Feather format is deprecated,
>> so
>> > > I'm
>> > > > > > only willing to engage on a discussion of the official Arrow
>> binary
>> > > > > > protocol that we use for IPC (memory mapping) and RPC (Flight).
>> > > > > >
>> > > > > > >
>> > > > > > > I then wiped the bitfields in the file and set all of the
>> string
>> > > > indices
>> > > > > > to
>> > > > > > > one past the end of the blob buffer (all strings empty):
>> > > > > > >       a   b
>> > > > > > > 0  None NaN
>> > > > > > > 1  None NaN
>> > > > > > > 2  None NaN
>> > > > > > > 3  None NaN
>> > > > > > > 4  None NaN
>> > > > > > >
>> > > > > > > I then set the first record to some data by consuming some of
>> the
>> > > > string
>> > > > > > > blob and row 0 and 1 indices, also setting the double:
>> > > > > > >
>> > > > > > >                a    b
>> > > > > > > 0  Hello, world!  5.0
>> > > > > > > 1           None  NaN
>> > > > > > > 2           None  NaN
>> > > > > > > 3           None  NaN
>> > > > > > > 4           None  NaN
>> > > > > > >
>> > > > > > > As mentioned above, NumRows seems to be ignored.  I tried
>> adjusting
>> > > > each
>> > > > > > > column Length to mask off higher rows and it worked for 4
>> (hide
>> > > last
>> > > > row)
>> > > > > > > but not for less than 4.  I take this to be due to math used
>> to
>> > > > figure
>> > > > > > out
>> > > > > > > where the buffers are relative to one another since there is
>> only
>> > > one
>> > > > > > > metadata offset for all of: the (optional) bitset, index
>> column and
>> > > > (if
>> > > > > > > string) blobs.
>> > > > > > >
>> > > > > > > Populating subsequent rows would proceed in a similar way
>> until all
>> > > > of
>> > > > > > the
>> > > > > > > blob storage has been consumed, which may come before the
>> > > > pre-allocated
>> > > > > > > rows have been consumed.
>> > > > > > >
>> > > > > > > So what does this mean for my desire to incrementally write
>> these
>> > > > > > > (potentially memory-mapped) pre-allocated files and/or Arrow
>> > > buffers
>> > > > in
>> > > > > > > memory?  Some thoughts:
>> > > > > > >
>> > > > > > > - If a single value (such as NumRows) were consulted to
>> determine
>> > > the
>> > > > > > table
>> > > > > > > row dimension then updating this single value would serve to
>> tell a
>> > > > > > reader
>> > > > > > > which rows are relevant.  Assuming this value is updated
>> after all
>> > > > other
>> > > > > > > mutations are complete, and assuming that mutations are only
>> > > appends
>> > > > > > > (addition of rows), then concurrency control involves only
>> ensuring
>> > > > that
>> > > > > > > this value is not examined while it is being written.
>> > > > > > >
>> > > > > > > - NullCount presents a concurrency problem if someone reads
>> the
>> > > file
>> > > > > > after
>> > > > > > > this field has been updated, but before NumRows has exposed
>> the new
>> > > > > > record
>> > > > > > > (or vice versa).  The idea previously mentioned that there
>> will
>> > > > "likely
>> > > > > > > [be] more statistics in the future" feels like it might be
>> scope
>> > > > creep to
>> > > > > > > me?  This is a data representation, not a calculation
>> framework?
>> > > If
>> > > > > > > NullCount had its genesis in the optional nature of the
>> bitfield, I
>> > > > would
>> > > > > > > suggest that perhaps NullCount can be dropped in favor of
>> always
>> > > > > > supplying
>> > > > > > > the bitfield, which in any event is already contemplated by
>> the
>> > > spec:
>> > > > > > > "Implementations may choose to always allocate one anyway as a
>> > > > matter of
>> > > > > > > convenience."  If the concern is space savings, Arrow is
>> already an
>> > > > > > > extremely uncompressed format.  (Compression is something I
>> would
>> > > > also
>> > > > > > > consider to be scope creep as regards Feather... compressed
>> > > > filesystems
>> > > > > > can
>> > > > > > > be employed and there are other compressed dataframe formats.)
>> > > > However,
>> > > > > > if
>> > > > > > > protecting the 4 bytes required to update NowRows turns out
>> to be
>> > > no
>> > > > > > easier
>> > > > > > > than protecting all of the statistical bytes as well as part
>> of the
>> > > > same
>> > > > > > > "critical section" (locks: yuck!!) then statistics pose no
>> issue.
>> > > I
>> > > > > > have a
>> > > > > > > feeling that the availability of an atomic write of 4 bytes
>> will
>> > > > depend
>> > > > > > on
>> > > > > > > the storage mechanism... memory vs memory map vs write() etc.
>> > > > > > >
>> > > > > > > - The elephant in the room appears to be the presumptive
>> binary
>> > > > yes/no on
>> > > > > > > mutability of Arrow buffers.  Perhaps the thought is that
>> certain
>> > > > batch
>> > > > > > > processes will be wrecked if anyone anywhere is mutating
>> buffers in
>> > > > any
>> > > > > > > way.  But keep in mind I am not proposing general mutability,
>> only
>> > > > > > > appending of new data.  *A huge amount of batch processing
>> that
>> > > will
>> > > > take
>> > > > > > > place with Arrow is on time-series data (whether financial or
>> > > > otherwise).
>> > > > > > > It is only natural that architects will want the minimal
>> impedance
>> > > > > > mismatch
>> > > > > > > when it comes time to grow those time series as the events
>> occur
>> > > > going
>> > > > > > > forward.*  So rather than say that I want "mutable" Arrow
>> buffers,
>> > > I
>> > > > > > would
>> > > > > > > pitch this as a call for "immutable populated areas" of Arrow
>> > > buffers
>> > > > > > > combined with the concept that the populated area can grow up
>> to
>> > > > whatever
>> > > > > > > was preallocated.  This will not affect anyone who has
>> "memoized" a
>> > > > > > > dimension and wants to continue to consider the then-current
>> data
>> > > as
>> > > > > > > immutable... it will be immutable and will always be immutable
>> > > > according
>> > > > > > to
>> > > > > > > that then-current dimension.
>> > > > > > >
>> > > > > > > Thanks in advance for considering this feedback!  I absolutely
>> > > > require
>> > > > > > > efficient row-wise growth of an Arrow-like buffer to deal
>> with time
>> > > > > > series
>> > > > > > > data in near real time.  I believe that preallocation is (by
>> far)
>> > > the
>> > > > > > most
>> > > > > > > efficient way to accomplish this.  I hope to be able to use
>> Arrow!
>> > > > If I
>> > > > > > > cannot use Arrow than I will be using a home-grown Arrow that
>> is
>> > > > > > identical
>> > > > > > > except for this feature, which would be very sad!  Even if
>> Arrow
>> > > > itself
>> > > > > > > could be used in this manner today, I would be hesitant to
>> use it
>> > > if
>> > > > the
>> > > > > > > use-case was not protected on a go-forward basis.
>> > > > > > >
>> > > > > >
>> > > > > > I recommend batching your writes and using the Arrow binary
>> streaming
>> > > > > > protocol so you are only appending to a file rather than
>> mutating
>> > > > > > previously-written bytes. This use case is well defined and
>> supported
>> > > > > > in the software already.
>> > > > > >
>> > > > > >
>> > > > > >
>> > > >
>> > >
>> https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#streaming-format
>> > > > > >
>> > > > > > - Wes
>> > > > > >
>> > > > > > > Of course, I am completely open to alternative ideas and
>> > > approaches!
>> > > > > > >
>> > > > > > > -John
>> > > > > > >
>> > > > > > > On Mon, May 6, 2019 at 11:39 AM Wes McKinney <
>> wesmck...@gmail.com>
>> > > > > > wrote:
>> > > > > > >
>> > > > > > > > hi John -- again, I would caution you against using Feather
>> files
>> > > > for
>> > > > > > > > issues of longevity -- the internal memory layout of those
>> files
>> > > > is a
>> > > > > > > > "dead man walking" so to speak.
>> > > > > > > >
>> > > > > > > > I would advise against forking the project, IMHO that is a
>> dark
>> > > > path
>> > > > > > > > that leads nowhere good. We have a large community here and
>> we
>> > > > accept
>> > > > > > > > pull requests -- I think the challenge is going to be
>> defining
>> > > the
>> > > > use
>> > > > > > > > case to suitable clarity that a general purpose solution
>> can be
>> > > > > > > > developed.
>> > > > > > > >
>> > > > > > > > - Wes
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Mon, May 6, 2019 at 11:16 AM John Muehlhausen <
>> j...@jgm.org>
>> > > > wrote:
>> > > > > > > > >
>> > > > > > > > > François, Wes,
>> > > > > > > > >
>> > > > > > > > > Thanks for the feedback.  I think the most practical
>> thing for
>> > > > me to
>> > > > > > do
>> > > > > > > > is
>> > > > > > > > > 1- write a Feather file that is structured to
>> pre-allocate the
>> > > > space
>> > > > > > I
>> > > > > > > > need
>> > > > > > > > > (e.g. initial variable-length strings are of average size)
>> > > > > > > > > 2- come up with code to monkey around with the values
>> contained
>> > > > in
>> > > > > > the
>> > > > > > > > > vectors so that before and after each manipulation the
>> file is
>> > > > valid
>> > > > > > as I
>> > > > > > > > > walk the rows ... this is a writer that uses memory
>> mapping
>> > > > > > > > > 3- check back in here once that works, assuming that it
>> does,
>> > > to
>> > > > see
>> > > > > > if
>> > > > > > > > we
>> > > > > > > > > can bless certain mutation paths
>> > > > > > > > > 4- if we can't bless certain mutation paths, fork the
>> project
>> > > for
>> > > > > > those
>> > > > > > > > who
>> > > > > > > > > care more about stream processing?  That would not seem
>> to be
>> > > > ideal
>> > > > > > as I
>> > > > > > > > > think mutation in row-order across the data set is
>> relatively
>> > > low
>> > > > > > impact
>> > > > > > > > on
>> > > > > > > > > the overall design?
>> > > > > > > > >
>> > > > > > > > > Thanks again for engaging the topic!
>> > > > > > > > > -John
>> > > > > > > > >
>> > > > > > > > > On Mon, May 6, 2019 at 10:26 AM Francois Saint-Jacques <
>> > > > > > > > > fsaintjacq...@gmail.com> wrote:
>> > > > > > > > >
>> > > > > > > > > > Hello John,
>> > > > > > > > > >
>> > > > > > > > > > Arrow is not yet suited for partial writes. The
>> specification
>> > > > only
>> > > > > > > > > > talks about fully frozen/immutable objects, you're in
>> > > > > > implementation
>> > > > > > > > > > defined territory here. For example, the C++ library
>> assumes
>> > > > the
>> > > > > > Array
>> > > > > > > > > > object is immutable; it memoize the null count, and
>> likely
>> > > more
>> > > > > > > > > > statistics in the future.
>> > > > > > > > > >
>> > > > > > > > > > If you want to use pre-allocated buffers and array, you
>> can
>> > > > use the
>> > > > > > > > > > column validity bitmap for this purpose, e.g. set all
>> null by
>> > > > > > default
>> > > > > > > > > > and flip once the row is written. It suffers from
>> concurrency
>> > > > > > issues
>> > > > > > > > > > (+ invalidation issues as pointed) when dealing with
>> multiple
>> > > > > > columns.
>> > > > > > > > > > You'll have to use a barrier of some kind, e.g. a
>> per-batch
>> > > > global
>> > > > > > > > > > atomic (if append-only), or dedicated column(s) à-la
>> MVCC.
>> > > But
>> > > > > > then,
>> > > > > > > > > > the reader needs to be aware of this and compute a mask
>> each
>> > > > time
>> > > > > > it
>> > > > > > > > > > needs to query the partial batch.
>> > > > > > > > > >
>> > > > > > > > > > This is a common columnar database problem, see [1] for
>> a
>> > > > recent
>> > > > > > paper
>> > > > > > > > > > on the subject. The usual technique is to store the
>> recent
>> > > data
>> > > > > > > > > > row-wise, and transform it in column-wise when a
>> threshold is
>> > > > met
>> > > > > > akin
>> > > > > > > > > > to a compaction phase. There was a somewhat related
>> thread
>> > > [2]
>> > > > > > lately
>> > > > > > > > > > about streaming vs batching. In the end, I think your
>> > > solution
>> > > > > > will be
>> > > > > > > > > > very application specific.
>> > > > > > > > > >
>> > > > > > > > > > François
>> > > > > > > > > >
>> > > > > > > > > > [1]
>> > > https://db.in.tum.de/downloads/publications/datablocks.pdf
>> > > > > > > > > > [2]
>> > > > > > > > > >
>> > > > > > > >
>> > > > > >
>> > > >
>> > >
>> https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6@%3Cdev.arrow.apache.org%3E
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > > > On Mon, May 6, 2019 at 10:39 AM John Muehlhausen <
>> > > j...@jgm.org>
>> > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > Wes,
>> > > > > > > > > > >
>> > > > > > > > > > > I’m not afraid of writing my own C++ code to deal
>> with all
>> > > of
>> > > > > > this
>> > > > > > > > on the
>> > > > > > > > > > > writer side.  I just need a way to “append”
>> (incrementally
>> > > > > > populate)
>> > > > > > > > e.g.
>> > > > > > > > > > > feather files so that a person using e.g. pyarrow
>> doesn’t
>> > > > suffer
>> > > > > > some
>> > > > > > > > > > > catastrophic failure... and “on the side” I tell them
>> which
>> > > > rows
>> > > > > > are
>> > > > > > > > junk
>> > > > > > > > > > > and deal with any concurrency issues that can’t be
>> solved
>> > > in
>> > > > the
>> > > > > > > > arena of
>> > > > > > > > > > > atomicity and ordering of ops.  For now I care about
>> basic
>> > > > types
>> > > > > > but
>> > > > > > > > > > > including variable-width strings.
>> > > > > > > > > > >
>> > > > > > > > > > > For event-processing, I think Arrow has to have the
>> concept
>> > > > of a
>> > > > > > > > > > partially
>> > > > > > > > > > > full record set.  Some alternatives are:
>> > > > > > > > > > > - have a batch size of one, thus littering the
>> landscape
>> > > with
>> > > > > > > > trivially
>> > > > > > > > > > > small Arrow buffers
>> > > > > > > > > > > - artificially increase latency with a batch size
>> larger
>> > > than
>> > > > > > one,
>> > > > > > > > but
>> > > > > > > > > > not
>> > > > > > > > > > > processing any data until a batch is complete
>> > > > > > > > > > > - continuously re-write the (entire!) “main” buffer as
>> > > > batches of
>> > > > > > > > length
>> > > > > > > > > > 1
>> > > > > > > > > > > roll in
>> > > > > > > > > > > - instead of one main buffer, several, and at some
>> > > threshold
>> > > > > > combine
>> > > > > > > > the
>> > > > > > > > > > > last N length-1 batches into a length N buffer ...
>> still an
>> > > > > > > > inefficiency
>> > > > > > > > > > >
>> > > > > > > > > > > Consider the case of QAbstractTableModel as the
>> underlying
>> > > > data
>> > > > > > for a
>> > > > > > > > > > table
>> > > > > > > > > > > or a chart.  This visualization shows all of the data
>> for
>> > > the
>> > > > > > recent
>> > > > > > > > past
>> > > > > > > > > > > as well as events rolling in.  If this model
>> interface is
>> > > > > > > > implemented as
>> > > > > > > > > > a
>> > > > > > > > > > > view onto “many thousands” of individual event
>> buffers then
>> > > > we
>> > > > > > gain
>> > > > > > > > > > nothing
>> > > > > > > > > > > from columnar layout.  (Suppose there are tons of
>> columns
>> > > and
>> > > > > > most of
>> > > > > > > > > > them
>> > > > > > > > > > > are scrolled out of the view.). Likewise we cannot
>> re-write
>> > > > the
>> > > > > > > > entire
>> > > > > > > > > > > model on each event... time complexity blows up.
>> What we
>> > > > want
>> > > > > > is to
>> > > > > > > > > > have a
>> > > > > > > > > > > large pre-allocated chunk and just change rowCount()
>> as
>> > > data
>> > > > is
>> > > > > > > > > > “appended.”
>> > > > > > > > > > >  Sure, we may run out of space and have another and
>> another
>> > > > > > chunk for
>> > > > > > > > > > > future row ranges, but a handful of chunks chained
>> together
>> > > > is
>> > > > > > better
>> > > > > > > > > > than
>> > > > > > > > > > > as many chunks as there were events!
>> > > > > > > > > > >
>> > > > > > > > > > > And again, having a batch size >1 and delaying the
>> data
>> > > > until a
>> > > > > > > > batch is
>> > > > > > > > > > > full is a non-starter.
>> > > > > > > > > > >
>> > > > > > > > > > > I am really hoping to see partially-filled buffers as
>> > > > something
>> > > > > > we
>> > > > > > > > keep
>> > > > > > > > > > our
>> > > > > > > > > > > finger on moving forward!  Or else, what am I missing?
>> > > > > > > > > > >
>> > > > > > > > > > > -John
>> > > > > > > > > > >
>> > > > > > > > > > > On Mon, May 6, 2019 at 8:24 AM Wes McKinney <
>> > > > wesmck...@gmail.com
>> > > > > > >
>> > > > > > > > wrote:
>> > > > > > > > > > >
>> > > > > > > > > > > > hi John,
>> > > > > > > > > > > >
>> > > > > > > > > > > > In C++ the builder classes don't yet support
>> writing into
>> > > > > > > > preallocated
>> > > > > > > > > > > > memory. It would be tricky for applications to
>> determine
>> > > a
>> > > > > > priori
>> > > > > > > > > > > > which segments of memory to pass to the builder. It
>> seems
>> > > > only
>> > > > > > > > > > > > feasible for primitive / fixed-size types so my
>> guess
>> > > > would be
>> > > > > > > > that a
>> > > > > > > > > > > > separate set of interfaces would need to be
>> developed for
>> > > > this
>> > > > > > > > task.
>> > > > > > > > > > > >
>> > > > > > > > > > > > - Wes
>> > > > > > > > > > > >
>> > > > > > > > > > > > On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau <
>> > > > > > jacq...@apache.org>
>> > > > > > > > > > wrote:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > This is more of a question of implementation
>> versus
>> > > > > > > > specification. An
>> > > > > > > > > > > > arrow
>> > > > > > > > > > > > > buffer is generally built and then sealed. In
>> different
>> > > > > > > > languages,
>> > > > > > > > > > this
>> > > > > > > > > > > > > building process works differently (a concern of
>> the
>> > > > language
>> > > > > > > > rather
>> > > > > > > > > > than
>> > > > > > > > > > > > > the memory specification). We don't currently
>> allow a
>> > > > half
>> > > > > > built
>> > > > > > > > > > vector
>> > > > > > > > > > > > to
>> > > > > > > > > > > > > be moved to another language and then be further
>> built.
>> > > > So
>> > > > > > the
>> > > > > > > > > > question
>> > > > > > > > > > > > is
>> > > > > > > > > > > > > really more concrete: what language are you
>> looking at
>> > > > and
>> > > > > > what
>> > > > > > > > is
>> > > > > > > > > > the
>> > > > > > > > > > > > > specific pattern you're trying to undertake for
>> > > building.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > If you're trying to go across independent
>> processes
>> > > > (whether
>> > > > > > the
>> > > > > > > > same
>> > > > > > > > > > > > > process restarted or two separate processes active
>> > > > > > > > simultaneously)
>> > > > > > > > > > you'll
>> > > > > > > > > > > > > need to build up your own data structures to help
>> with
>> > > > this.
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > On Mon, May 6, 2019 at 6:28 PM John Muehlhausen <
>> > > > j...@jgm.org
>> > > > > > >
>> > > > > > > > wrote:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > > Hello,
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Glad to learn of this project— good work!
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > If I allocate a single chunk of memory and start
>> > > > building
>> > > > > > Arrow
>> > > > > > > > > > format
>> > > > > > > > > > > > > > within it, does this chunk save any state
>> regarding
>> > > my
>> > > > > > > > progress?
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > For example, suppose I allocate a column for
>> floating
>> > > > point
>> > > > > > > > (fixed
>> > > > > > > > > > > > width)
>> > > > > > > > > > > > > > and a column for string (variable width).
>> Suppose I
>> > > > start
>> > > > > > > > > > building the
>> > > > > > > > > > > > > > floating point column at offset X into my single
>> > > > buffer,
>> > > > > > and
>> > > > > > > > the
>> > > > > > > > > > string
>> > > > > > > > > > > > > > “pointer” column at offset Y into the same
>> single
>> > > > buffer,
>> > > > > > and
>> > > > > > > > the
>> > > > > > > > > > > > string
>> > > > > > > > > > > > > > data elements at offset Z.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > I write one floating point number and one
>> string,
>> > > then
>> > > > go
>> > > > > > away.
>> > > > > > > > > > When I
>> > > > > > > > > > > > > > come back to this buffer to append another
>> value,
>> > > does
>> > > > the
>> > > > > > > > buffer
>> > > > > > > > > > > > itself
>> > > > > > > > > > > > > > know where I would begin?  I.e. is there a
>> > > > differentiation
>> > > > > > in
>> > > > > > > > the
>> > > > > > > > > > > > column
>> > > > > > > > > > > > > > (or blob) data itself between the available
>> space and
>> > > > the
>> > > > > > used
>> > > > > > > > > > space?
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Suppose I write a lot of large variable width
>> strings
>> > > > and
>> > > > > > “run
>> > > > > > > > > > out” of
>> > > > > > > > > > > > > > space for them before running out of space for
>> > > floating
>> > > > > > point
>> > > > > > > > > > numbers
>> > > > > > > > > > > > or
>> > > > > > > > > > > > > > string pointers.  (I guessed badly when doing
>> the
>> > > > original
>> > > > > > > > > > > > allocation.). I
>> > > > > > > > > > > > > > consider this to be Ok since I can always
>> “copy” the
>> > > > data
>> > > > > > to
>> > > > > > > > > > “compress
>> > > > > > > > > > > > out”
>> > > > > > > > > > > > > > the unused fp/pointer buckets... the choice is
>> up to
>> > > > me.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > The above applied to a (feather?) file is how I
>> > > > anticipate
>> > > > > > > > > > appending
>> > > > > > > > > > > > data
>> > > > > > > > > > > > > > to disk... pre-allocate a mem-mapped file and
>> > > gradually
>> > > > > > fill
>> > > > > > > > it up.
>> > > > > > > > > > > > The
>> > > > > > > > > > > > > > efficiency of file utilization will depend on my
>> > > > > > projections
>> > > > > > > > > > regarding
>> > > > > > > > > > > > > > variable-width data types, but as I said above,
>> I can
>> > > > > > always
>> > > > > > > > > > re-write
>> > > > > > > > > > > > the
>> > > > > > > > > > > > > > file if/when this bothers me.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Is this the recommended and supported approach
>> for
>> > > > > > incremental
>> > > > > > > > > > appends?
>> > > > > > > > > > > > > > I’m really hoping to use Arrow instead of
>> rolling my
>> > > > own,
>> > > > > > but
>> > > > > > > > > > > > functionality
>> > > > > > > > > > > > > > like this is absolutely key!  Hoping not to use
>> a
>> > > > side-car
>> > > > > > > > file (or
>> > > > > > > > > > > > memory
>> > > > > > > > > > > > > > chunk) to store “append progress” information.
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > I am brand new to this project so please
>> forgive me
>> > > if
>> > > > I
>> > > > > > have
>> > > > > > > > > > > > overlooked
>> > > > > > > > > > > > > > something obvious.  And again, looks like great
>> work
>> > > so
>> > > > > > far!
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Thanks!
>> > > > > > > > > > > > > > -John
>> > > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > >
>> > > > > >
>> > > >
>> > >
>>
>

Reply via email to