Re: Stored state of incremental writes to fixed size Arrow buffer?

John Muehlhausen Mon, 13 May 2019 08:29:08 -0700

``perhaps the right way forward is to start by gathering a
number of interested parties and start designing a proposal''


YES!  How do we go about this?

``There are some early experiments to populate Arrow nodes in microbatches
from Kafka'' (cf link in thread)

Who did this?

-John

On Mon, May 13, 2019 at 9:39 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Hi John,
>
> We are strongly committed to backwards compatibility in the Arrow format
> specification.  You should not fear any compatibility-breaking changes
> in the future.  People sometimes express uncertainty because we have not
> reached 1.0 yet, but that's because we have not yet implemented all the
> data types we want to be in that spec.
>
> As for the general goal of making Arrow more suitable for event
> processing, perhaps the right way forward is to start by gathering a
> number of interested parties and start designing a proposal (which may
> or may not include spec additions).
>
> Regards
>
> Antoine.
>
>
> Le 13/05/2019 à 15:38, John Muehlhausen a écrit :
> > Micah, yes, it all works at the moment.  How have we staked out that it
> > will always work in the future as people continue to work on the spec?
> That
> > is my concern.
> >
> > Also, it would be extremely useful if someone opening a file had my nil
> > rows hidden from them without needing to analyze the app-specific
> side-car
> > data.
> >
> > I believe that something like my solution is how everyone will do
> efficient
> > event processing with Arrow, so I believe it is worth a broader
> discussion.
> >
> > On Mon, May 13, 2019 at 8:30 AM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> >> Hi John,
> >> To expand on this I don't think there is anything  preventing you in the
> >> current spec from over provisioning the underlying buffers.  So you can
> >> effectively split "capacity" from "length" by subtracting the size of
> the
> >> buffer from the amount of space taken by the rows indicated in the
> batch.
> >> For variable width types you would have to reference last value in the
> >> offset buffer to determine used capacity.
> >>
> >>   When appending if you runout of memory in a particular buffer, you
> don't
> >> increment the count on the batch and simply append to the next one.
> >>
> >> This is restating parts of the  thread, but I don't think the c++ code
> base
> >> has any facility for this directly and if you want to be parsimonious
> with
> >> memory you would have to rewrite batches at some point.
> >>
> >> Apologies if I missed something as Wes said this is a long thread.
> >>
> >>
> >> Thanks,
> >> Micah
> >>
> >> On Mon, May 13, 2019 at 6:07 AM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >>
> >>> hi John,
> >>>
> >>> Sorry, there's a number of fairly long e-mails in this thread; I'm
> >>> having a hard time following all of the details.
> >>>
> >>> I suspect the most parsimonious thing would be to have some "sidecar"
> >>> metadata that tracks the state of your writes into pre-allocated Arrow
> >>> blocks so that readers know to call "Slice" on the blocks to obtain
> >>> only the written-so-far portion. I'm not likely to be in favor of
> >>> making changes to the binary protocol for this use case; if others
> >>> have opinions I'll let them speak for themselves.
> >>>
> >>> - Wes
> >>>
> >>> On Mon, May 13, 2019 at 7:50 AM John Muehlhausen <j...@jgm.org> wrote:
> >>>>
> >>>> Any thoughts on a RecordBatch distinguishing size from capacity? (To
> >>> borrow
> >>>> std::vector terminology)
> >>>>
> >>>> Thanks,
> >>>> John
> >>>>
> >>>> On Thu, May 9, 2019 at 2:46 PM John Muehlhausen <j...@jgm.org> wrote:
> >>>>
> >>>>> Wes et al, I think my core proposal is that Message.fbs:RecordBatch
> >>> split
> >>>>> the "length" parameter into "theoretical max length" and "utilized
> >>> length"
> >>>>> (perhaps not those exact names).
> >>>>>
> >>>>> "theoretical max length is the same as "length" now ... /// ...The
> >>> arrays
> >>>>> in the batch should all have this
> >>>>>
> >>>>> "utilized length" are the number of rows (starting from the first
> >> one)
> >>>>> that actually contain interesting data... the rest do not.
> >>>>>
> >>>>> The reason we can have a RecordBatch where these numbers are not the
> >>> same
> >>>>> is that the RecordBatch space was preallocated (for performance
> >>> reasons)
> >>>>> and the number of rows that actually "fit" depends on how correct the
> >>>>> preallocation was.  In any case, it gives the user control of this
> >>>>> space/time tradeoff... wasted space in order to save time in record
> >>> batch
> >>>>> construction.  The fact that some space will usually be wasted when
> >>> there
> >>>>> are variable-length columns (barring extreme luck) with this batch
> >>>>> construction paradigm explains the word "theoretical" above.  This
> >> also
> >>>>> gives us the ability to look at a partially constructed batch that is
> >>> still
> >>>>> being constructed, given appropriate user-supplied concurrency
> >> control.
> >>>>>
> >>>>> I am not an expert in all of the Arrow variable-length data types,
> >> but
> >>> I
> >>>>> think this works if they are all similar to variable-length strings
> >>> where
> >>>>> we advance through "blob storage" by setting the indexes into that
> >>> storage
> >>>>> for the current and next row in order to indicate that we have
> >>>>> incrementally consumed more blob storage.  (Conceptually this storage
> >>> is
> >>>>> "unallocated" after the pre-allocation and before rows are
> >> populated.)
> >>>>>
> >>>>> At a high level I am seeking to shore up the format for event ingress
> >>> into
> >>>>> real-time analytics that have some look-back window.  If I'm not
> >>> mistaken I
> >>>>> think this is the subject of the last multi-sentence paragraph here?:
> >>>>> https://zd.net/2H0LlBY
> >>>>>
> >>>>> Currently we have a less-efficient paradigm where "microbatches"
> >> (e.g.
> >>> of
> >>>>> length 1 for minimal latency) have to spin the CPU periodically in
> >>> order to
> >>>>> be combined into buffers where we get the columnar layout benefit.
> >>> With
> >>>>> pre-allocation we can deal with microbatches (a partially populated
> >>> larger
> >>>>> RecordBatch) and immediately have the columnar layout benefits for
> >> the
> >>>>> populated section with no additional computation.
> >>>>>
> >>>>> For example, consider an event processing system that calculates a
> >>> "moving
> >>>>> average" as events roll in.  While this is somewhat contrived lets
> >>> assume
> >>>>> that the moving average window is 1000 periods and our pre-allocation
> >>>>> ("theoretical max length") of RecordBatch elements is 100.  The
> >>> algorithm
> >>>>> would be something like this, for a list of RecordBatch buffers in
> >>> memory:
> >>>>>
> >>>>> initialization():
> >>>>>   set up configuration of expected variable length storage
> >>> requirements,
> >>>>> e.g. the template RecordBatch mentioned below
> >>>>>
> >>>>> onIncomingEvent(event):
> >>>>>   obtain lock /// cf. swoopIn() below
> >>>>>   if last RecordBatch theoretical max length is not less than
> >> utilized
> >>>>> length or variable-length components of "event" will not fit in
> >>> remaining
> >>>>> blob storage:
> >>>>>     create a new RecordBatch pre-allocation of max utilized length
> >> 100
> >>> and
> >>>>> with blob preallocation that is max(expected, event .. in case the
> >>> single
> >>>>> event is larger than the expectation for 100 events)
> >>>>>        (note: in the expected case this can be very fast as it is a
> >>>>> malloc() and a memcpy() from a template!)
> >>>>>     set current RecordBatch to this newly created one
> >>>>>   add event to current RecordBatch (for the non-calculated fields)
> >>>>>   increment utilized length of current RecordBatch
> >>>>>   calculate the calculated fields (in this case, moving average) by
> >>>>> looking back at previous rows in this and previous RecordBatch
> >> objects
> >>>>>   free() any RecordBatch objects that are now before the lookback
> >>> window
> >>>>>
> >>>>> swoopIn(): /// somebody wants to chart the lookback window
> >>>>>   obtain lock
> >>>>>   visit all of the relevant data in the RecordBatches to construct
> >> the
> >>>>> chart /// notice that the last RecordBatch may not yet be "as full as
> >>>>> possible"
> >>>>>
> >>>>> The above analysis (minus the free()) could apply to the IPC file
> >>> format
> >>>>> and the lock could be a file lock and the swoopIn() could be a
> >> separate
> >>>>> process.  In the case of the file format, while the file is locked, a
> >>> new
> >>>>> RecordBatch would overwrite the previous file Footer and a new Footer
> >>> would
> >>>>> be written.  In order to be able to delete or archive old data
> >> multiple
> >>>>> files could be strung together in a logical series.
> >>>>>
> >>>>> -John
> >>>>>
> >>>>> On Tue, May 7, 2019 at 2:39 PM Wes McKinney <wesmck...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>>> On Tue, May 7, 2019 at 12:26 PM John Muehlhausen <j...@jgm.org>
> >> wrote:
> >>>>>>>
> >>>>>>> Wes, are we saying that `pa.ipc.open_file(...).read_pandas()`
> >>> already
> >>>>>> reads
> >>>>>>> the future Feather format? If not, how will the future format
> >>> differ?  I
> >>>>>>> will work on my access pattern with this format instead of the
> >>> current
> >>>>>>> feather format.  Sorry I was not clear on that earlier.
> >>>>>>>
> >>>>>>
> >>>>>> Yes, under the hood those will use the same zero-copy binary
> >> protocol
> >>>>>> code paths to read the file.
> >>>>>>
> >>>>>>> Micah, thank you!
> >>>>>>>
> >>>>>>> On Tue, May 7, 2019 at 11:44 AM Micah Kornfield <
> >>> emkornfi...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi John,
> >>>>>>>> To give a specific pointer [1] describes how the streaming
> >>> protocol is
> >>>>>>>> stored to a file.
> >>>>>>>>
> >>>>>>>> [1] https://arrow.apache.org/docs/format/IPC.html#file-format
> >>>>>>>>
> >>>>>>>> On Tue, May 7, 2019 at 9:40 AM Wes McKinney <
> >> wesmck...@gmail.com>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> hi John,
> >>>>>>>>>
> >>>>>>>>> As soon as the R folks can install the Arrow R package
> >>> consistently,
> >>>>>>>>> the intent is to replace the Feather internals with the plain
> >>> Arrow
> >>>>>>>>> IPC protocol where we have much better platform support all
> >>> around.
> >>>>>>>>>
> >>>>>>>>> If you'd like to experiment with creating an API for
> >>> pre-allocating
> >>>>>>>>> fixed-size Arrow protocol blocks and then mutating the data
> >> and
> >>>>>>>>> metadata on disk in-place, please be our guest. We don't have
> >>> the
> >>>>>>>>> tools developed yet to do this for you
> >>>>>>>>>
> >>>>>>>>> - Wes
> >>>>>>>>>
> >>>>>>>>> On Tue, May 7, 2019 at 11:25 AM John Muehlhausen <j...@jgm.org
> >>>
> >>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Thanks Wes:
> >>>>>>>>>>
> >>>>>>>>>> "the current Feather format is deprecated" ... yes, but
> >> there
> >>>>>> will be a
> >>>>>>>>>> future file format that replaces it, correct?  And my
> >>> discussion
> >>>>>> of
> >>>>>>>>>> immutable "portions" of Arrow buffers, rather than
> >>> immutability
> >>>>>> of the
> >>>>>>>>>> entire buffer, applies to IPC as well, right?  I am only
> >>>>>> championing
> >>>>>>>> the
> >>>>>>>>>> idea that this future file format have the convenience that
> >>>>>> certain
> >>>>>>>>>> preallocated rows can be ignored based on a metadata
> >> setting.
> >>>>>>>>>>
> >>>>>>>>>> "I recommend batching your writes" ... this is extremely
> >>>>>> inefficient
> >>>>>>>> and
> >>>>>>>>>> adds unacceptable latency, relative to the proposed
> >>> solution.  Do
> >>>>>> you
> >>>>>>>>>> disagree?  Either I have a batch length of 1 to minimize
> >>> latency,
> >>>>>> which
> >>>>>>>>>> eliminates columnar advantages on the read side, or else I
> >> add
> >>>>>> latency.
> >>>>>>>>>> Neither works, and it seems that a viable alternative is
> >>> within
> >>>>>> sight?
> >>>>>>>>>>
> >>>>>>>>>> If you don't agree that there is a core issue and
> >> opportunity
> >>>>>> here, I'm
> >>>>>>>>> not
> >>>>>>>>>> sure how to better make my case....
> >>>>>>>>>>
> >>>>>>>>>> -John
> >>>>>>>>>>
> >>>>>>>>>> On Tue, May 7, 2019 at 11:02 AM Wes McKinney <
> >>> wesmck...@gmail.com
> >>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> hi John,
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, May 7, 2019 at 10:53 AM John Muehlhausen <
> >>> j...@jgm.org>
> >>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Wes et al, I completed a preliminary study of
> >> populating a
> >>>>>> Feather
> >>>>>>>>> file
> >>>>>>>>>>>> incrementally.  Some notes and questions:
> >>>>>>>>>>>>
> >>>>>>>>>>>> I wrote the following dataframe to a feather file:
> >>>>>>>>>>>>             a    b
> >>>>>>>>>>>> 0  0123456789  0.0
> >>>>>>>>>>>> 1  0123456789  NaN
> >>>>>>>>>>>> 2  0123456789  NaN
> >>>>>>>>>>>> 3  0123456789  NaN
> >>>>>>>>>>>> 4        None  NaN
> >>>>>>>>>>>>
> >>>>>>>>>>>> In re-writing the flatbuffers metadata (flatc -p doesn't
> >>>>>>>>>>>> support --gen-mutable! yuck! C++ to the rescue...), it
> >>> seems
> >>>>>> that
> >>>>>>>>>>>> read_feather is not affected by NumRows?  It seems to be
> >>>>>> driven
> >>>>>>>>> entirely
> >>>>>>>>>>> by
> >>>>>>>>>>>> the per-column Length values?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Also, it seems as if one of the primary usages of
> >>> NullCount
> >>>>>> is to
> >>>>>>>>>>> determine
> >>>>>>>>>>>> whether or not a bitfield is present?  In the
> >>> initialization
> >>>>>> data
> >>>>>>>>> above I
> >>>>>>>>>>>> was careful to have a null value in each column in order
> >>> to
> >>>>>>>> generate
> >>>>>>>>> a
> >>>>>>>>>>>> bitfield.
> >>>>>>>>>>>
> >>>>>>>>>>> Per my prior e-mails, the current Feather format is
> >>> deprecated,
> >>>>>> so
> >>>>>>>> I'm
> >>>>>>>>>>> only willing to engage on a discussion of the official
> >> Arrow
> >>>>>> binary
> >>>>>>>>>>> protocol that we use for IPC (memory mapping) and RPC
> >>> (Flight).
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I then wiped the bitfields in the file and set all of
> >> the
> >>>>>> string
> >>>>>>>>> indices
> >>>>>>>>>>> to
> >>>>>>>>>>>> one past the end of the blob buffer (all strings empty):
> >>>>>>>>>>>>       a   b
> >>>>>>>>>>>> 0  None NaN
> >>>>>>>>>>>> 1  None NaN
> >>>>>>>>>>>> 2  None NaN
> >>>>>>>>>>>> 3  None NaN
> >>>>>>>>>>>> 4  None NaN
> >>>>>>>>>>>>
> >>>>>>>>>>>> I then set the first record to some data by consuming
> >>> some of
> >>>>>> the
> >>>>>>>>> string
> >>>>>>>>>>>> blob and row 0 and 1 indices, also setting the double:
> >>>>>>>>>>>>
> >>>>>>>>>>>>                a    b
> >>>>>>>>>>>> 0  Hello, world!  5.0
> >>>>>>>>>>>> 1           None  NaN
> >>>>>>>>>>>> 2           None  NaN
> >>>>>>>>>>>> 3           None  NaN
> >>>>>>>>>>>> 4           None  NaN
> >>>>>>>>>>>>
> >>>>>>>>>>>> As mentioned above, NumRows seems to be ignored.  I
> >> tried
> >>>>>> adjusting
> >>>>>>>>> each
> >>>>>>>>>>>> column Length to mask off higher rows and it worked for
> >> 4
> >>>>>> (hide
> >>>>>>>> last
> >>>>>>>>> row)
> >>>>>>>>>>>> but not for less than 4.  I take this to be due to math
> >>> used
> >>>>>> to
> >>>>>>>>> figure
> >>>>>>>>>>> out
> >>>>>>>>>>>> where the buffers are relative to one another since
> >> there
> >>> is
> >>>>>> only
> >>>>>>>> one
> >>>>>>>>>>>> metadata offset for all of: the (optional) bitset, index
> >>>>>> column and
> >>>>>>>>> (if
> >>>>>>>>>>>> string) blobs.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Populating subsequent rows would proceed in a similar
> >> way
> >>>>>> until all
> >>>>>>>>> of
> >>>>>>>>>>> the
> >>>>>>>>>>>> blob storage has been consumed, which may come before
> >> the
> >>>>>>>>> pre-allocated
> >>>>>>>>>>>> rows have been consumed.
> >>>>>>>>>>>>
> >>>>>>>>>>>> So what does this mean for my desire to incrementally
> >>> write
> >>>>>> these
> >>>>>>>>>>>> (potentially memory-mapped) pre-allocated files and/or
> >>> Arrow
> >>>>>>>> buffers
> >>>>>>>>> in
> >>>>>>>>>>>> memory?  Some thoughts:
> >>>>>>>>>>>>
> >>>>>>>>>>>> - If a single value (such as NumRows) were consulted to
> >>>>>> determine
> >>>>>>>> the
> >>>>>>>>>>> table
> >>>>>>>>>>>> row dimension then updating this single value would
> >> serve
> >>> to
> >>>>>> tell a
> >>>>>>>>>>> reader
> >>>>>>>>>>>> which rows are relevant.  Assuming this value is updated
> >>>>>> after all
> >>>>>>>>> other
> >>>>>>>>>>>> mutations are complete, and assuming that mutations are
> >>> only
> >>>>>>>> appends
> >>>>>>>>>>>> (addition of rows), then concurrency control involves
> >> only
> >>>>>> ensuring
> >>>>>>>>> that
> >>>>>>>>>>>> this value is not examined while it is being written.
> >>>>>>>>>>>>
> >>>>>>>>>>>> - NullCount presents a concurrency problem if someone
> >>> reads
> >>>>>> the
> >>>>>>>> file
> >>>>>>>>>>> after
> >>>>>>>>>>>> this field has been updated, but before NumRows has
> >>> exposed
> >>>>>> the new
> >>>>>>>>>>> record
> >>>>>>>>>>>> (or vice versa).  The idea previously mentioned that
> >> there
> >>>>>> will
> >>>>>>>>> "likely
> >>>>>>>>>>>> [be] more statistics in the future" feels like it might
> >> be
> >>>>>> scope
> >>>>>>>>> creep to
> >>>>>>>>>>>> me?  This is a data representation, not a calculation
> >>>>>> framework?
> >>>>>>>> If
> >>>>>>>>>>>> NullCount had its genesis in the optional nature of the
> >>>>>> bitfield, I
> >>>>>>>>> would
> >>>>>>>>>>>> suggest that perhaps NullCount can be dropped in favor
> >> of
> >>>>>> always
> >>>>>>>>>>> supplying
> >>>>>>>>>>>> the bitfield, which in any event is already contemplated
> >>> by
> >>>>>> the
> >>>>>>>> spec:
> >>>>>>>>>>>> "Implementations may choose to always allocate one
> >> anyway
> >>> as a
> >>>>>>>>> matter of
> >>>>>>>>>>>> convenience."  If the concern is space savings, Arrow is
> >>>>>> already an
> >>>>>>>>>>>> extremely uncompressed format.  (Compression is
> >> something
> >>> I
> >>>>>> would
> >>>>>>>>> also
> >>>>>>>>>>>> consider to be scope creep as regards Feather...
> >>> compressed
> >>>>>>>>> filesystems
> >>>>>>>>>>> can
> >>>>>>>>>>>> be employed and there are other compressed dataframe
> >>> formats.)
> >>>>>>>>> However,
> >>>>>>>>>>> if
> >>>>>>>>>>>> protecting the 4 bytes required to update NowRows turns
> >>> out
> >>>>>> to be
> >>>>>>>> no
> >>>>>>>>>>> easier
> >>>>>>>>>>>> than protecting all of the statistical bytes as well as
> >>> part
> >>>>>> of the
> >>>>>>>>> same
> >>>>>>>>>>>> "critical section" (locks: yuck!!) then statistics pose
> >> no
> >>>>>> issue.
> >>>>>>>> I
> >>>>>>>>>>> have a
> >>>>>>>>>>>> feeling that the availability of an atomic write of 4
> >>> bytes
> >>>>>> will
> >>>>>>>>> depend
> >>>>>>>>>>> on
> >>>>>>>>>>>> the storage mechanism... memory vs memory map vs write()
> >>> etc.
> >>>>>>>>>>>>
> >>>>>>>>>>>> - The elephant in the room appears to be the presumptive
> >>>>>> binary
> >>>>>>>>> yes/no on
> >>>>>>>>>>>> mutability of Arrow buffers.  Perhaps the thought is
> >> that
> >>>>>> certain
> >>>>>>>>> batch
> >>>>>>>>>>>> processes will be wrecked if anyone anywhere is mutating
> >>>>>> buffers in
> >>>>>>>>> any
> >>>>>>>>>>>> way.  But keep in mind I am not proposing general
> >>> mutability,
> >>>>>> only
> >>>>>>>>>>>> appending of new data.  *A huge amount of batch
> >> processing
> >>>>>> that
> >>>>>>>> will
> >>>>>>>>> take
> >>>>>>>>>>>> place with Arrow is on time-series data (whether
> >>> financial or
> >>>>>>>>> otherwise).
> >>>>>>>>>>>> It is only natural that architects will want the minimal
> >>>>>> impedance
> >>>>>>>>>>> mismatch
> >>>>>>>>>>>> when it comes time to grow those time series as the
> >> events
> >>>>>> occur
> >>>>>>>>> going
> >>>>>>>>>>>> forward.*  So rather than say that I want "mutable"
> >> Arrow
> >>>>>> buffers,
> >>>>>>>> I
> >>>>>>>>>>> would
> >>>>>>>>>>>> pitch this as a call for "immutable populated areas" of
> >>> Arrow
> >>>>>>>> buffers
> >>>>>>>>>>>> combined with the concept that the populated area can
> >>> grow up
> >>>>>> to
> >>>>>>>>> whatever
> >>>>>>>>>>>> was preallocated.  This will not affect anyone who has
> >>>>>> "memoized" a
> >>>>>>>>>>>> dimension and wants to continue to consider the
> >>> then-current
> >>>>>> data
> >>>>>>>> as
> >>>>>>>>>>>> immutable... it will be immutable and will always be
> >>> immutable
> >>>>>>>>> according
> >>>>>>>>>>> to
> >>>>>>>>>>>> that then-current dimension.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks in advance for considering this feedback!  I
> >>> absolutely
> >>>>>>>>> require
> >>>>>>>>>>>> efficient row-wise growth of an Arrow-like buffer to
> >> deal
> >>>>>> with time
> >>>>>>>>>>> series
> >>>>>>>>>>>> data in near real time.  I believe that preallocation is
> >>> (by
> >>>>>> far)
> >>>>>>>> the
> >>>>>>>>>>> most
> >>>>>>>>>>>> efficient way to accomplish this.  I hope to be able to
> >>> use
> >>>>>> Arrow!
> >>>>>>>>> If I
> >>>>>>>>>>>> cannot use Arrow than I will be using a home-grown Arrow
> >>> that
> >>>>>> is
> >>>>>>>>>>> identical
> >>>>>>>>>>>> except for this feature, which would be very sad!  Even
> >> if
> >>>>>> Arrow
> >>>>>>>>> itself
> >>>>>>>>>>>> could be used in this manner today, I would be hesitant
> >> to
> >>>>>> use it
> >>>>>>>> if
> >>>>>>>>> the
> >>>>>>>>>>>> use-case was not protected on a go-forward basis.
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I recommend batching your writes and using the Arrow
> >> binary
> >>>>>> streaming
> >>>>>>>>>>> protocol so you are only appending to a file rather than
> >>>>>> mutating
> >>>>>>>>>>> previously-written bytes. This use case is well defined
> >> and
> >>>>>> supported
> >>>>>>>>>>> in the software already.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>
> >>
> https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#streaming-format
> >>>>>>>>>>>
> >>>>>>>>>>> - Wes
> >>>>>>>>>>>
> >>>>>>>>>>>> Of course, I am completely open to alternative ideas and
> >>>>>>>> approaches!
> >>>>>>>>>>>>
> >>>>>>>>>>>> -John
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Mon, May 6, 2019 at 11:39 AM Wes McKinney <
> >>>>>> wesmck...@gmail.com>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> hi John -- again, I would caution you against using
> >>> Feather
> >>>>>> files
> >>>>>>>>> for
> >>>>>>>>>>>>> issues of longevity -- the internal memory layout of
> >>> those
> >>>>>> files
> >>>>>>>>> is a
> >>>>>>>>>>>>> "dead man walking" so to speak.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I would advise against forking the project, IMHO that
> >>> is a
> >>>>>> dark
> >>>>>>>>> path
> >>>>>>>>>>>>> that leads nowhere good. We have a large community
> >> here
> >>> and
> >>>>>> we
> >>>>>>>>> accept
> >>>>>>>>>>>>> pull requests -- I think the challenge is going to be
> >>>>>> defining
> >>>>>>>> the
> >>>>>>>>> use
> >>>>>>>>>>>>> case to suitable clarity that a general purpose
> >> solution
> >>>>>> can be
> >>>>>>>>>>>>> developed.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Wes
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Mon, May 6, 2019 at 11:16 AM John Muehlhausen <
> >>>>>> j...@jgm.org>
> >>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> François, Wes,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks for the feedback.  I think the most practical
> >>>>>> thing for
> >>>>>>>>> me to
> >>>>>>>>>>> do
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>> 1- write a Feather file that is structured to
> >>>>>> pre-allocate the
> >>>>>>>>> space
> >>>>>>>>>>> I
> >>>>>>>>>>>>> need
> >>>>>>>>>>>>>> (e.g. initial variable-length strings are of average
> >>> size)
> >>>>>>>>>>>>>> 2- come up with code to monkey around with the
> >> values
> >>>>>> contained
> >>>>>>>>> in
> >>>>>>>>>>> the
> >>>>>>>>>>>>>> vectors so that before and after each manipulation
> >> the
> >>>>>> file is
> >>>>>>>>> valid
> >>>>>>>>>>> as I
> >>>>>>>>>>>>>> walk the rows ... this is a writer that uses memory
> >>>>>> mapping
> >>>>>>>>>>>>>> 3- check back in here once that works, assuming that
> >>> it
> >>>>>> does,
> >>>>>>>> to
> >>>>>>>>> see
> >>>>>>>>>>> if
> >>>>>>>>>>>>> we
> >>>>>>>>>>>>>> can bless certain mutation paths
> >>>>>>>>>>>>>> 4- if we can't bless certain mutation paths, fork
> >> the
> >>>>>> project
> >>>>>>>> for
> >>>>>>>>>>> those
> >>>>>>>>>>>>> who
> >>>>>>>>>>>>>> care more about stream processing?  That would not
> >>> seem
> >>>>>> to be
> >>>>>>>>> ideal
> >>>>>>>>>>> as I
> >>>>>>>>>>>>>> think mutation in row-order across the data set is
> >>>>>> relatively
> >>>>>>>> low
> >>>>>>>>>>> impact
> >>>>>>>>>>>>> on
> >>>>>>>>>>>>>> the overall design?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks again for engaging the topic!
> >>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Mon, May 6, 2019 at 10:26 AM Francois
> >>> Saint-Jacques <
> >>>>>>>>>>>>>> fsaintjacq...@gmail.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Hello John,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Arrow is not yet suited for partial writes. The
> >>>>>> specification
> >>>>>>>>> only
> >>>>>>>>>>>>>>> talks about fully frozen/immutable objects, you're
> >>> in
> >>>>>>>>>>> implementation
> >>>>>>>>>>>>>>> defined territory here. For example, the C++
> >> library
> >>>>>> assumes
> >>>>>>>>> the
> >>>>>>>>>>> Array
> >>>>>>>>>>>>>>> object is immutable; it memoize the null count,
> >> and
> >>>>>> likely
> >>>>>>>> more
> >>>>>>>>>>>>>>> statistics in the future.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If you want to use pre-allocated buffers and
> >> array,
> >>> you
> >>>>>> can
> >>>>>>>>> use the
> >>>>>>>>>>>>>>> column validity bitmap for this purpose, e.g. set
> >>> all
> >>>>>> null by
> >>>>>>>>>>> default
> >>>>>>>>>>>>>>> and flip once the row is written. It suffers from
> >>>>>> concurrency
> >>>>>>>>>>> issues
> >>>>>>>>>>>>>>> (+ invalidation issues as pointed) when dealing
> >> with
> >>>>>> multiple
> >>>>>>>>>>> columns.
> >>>>>>>>>>>>>>> You'll have to use a barrier of some kind, e.g. a
> >>>>>> per-batch
> >>>>>>>>> global
> >>>>>>>>>>>>>>> atomic (if append-only), or dedicated column(s)
> >> à-la
> >>>>>> MVCC.
> >>>>>>>> But
> >>>>>>>>>>> then,
> >>>>>>>>>>>>>>> the reader needs to be aware of this and compute a
> >>> mask
> >>>>>> each
> >>>>>>>>> time
> >>>>>>>>>>> it
> >>>>>>>>>>>>>>> needs to query the partial batch.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> This is a common columnar database problem, see
> >> [1]
> >>> for
> >>>>>> a
> >>>>>>>>> recent
> >>>>>>>>>>> paper
> >>>>>>>>>>>>>>> on the subject. The usual technique is to store
> >> the
> >>>>>> recent
> >>>>>>>> data
> >>>>>>>>>>>>>>> row-wise, and transform it in column-wise when a
> >>>>>> threshold is
> >>>>>>>>> met
> >>>>>>>>>>> akin
> >>>>>>>>>>>>>>> to a compaction phase. There was a somewhat
> >> related
> >>>>>> thread
> >>>>>>>> [2]
> >>>>>>>>>>> lately
> >>>>>>>>>>>>>>> about streaming vs batching. In the end, I think
> >>> your
> >>>>>>>> solution
> >>>>>>>>>>> will be
> >>>>>>>>>>>>>>> very application specific.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> François
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> [1]
> >>>>>>>> https://db.in.tum.de/downloads/publications/datablocks.pdf
> >>>>>>>>>>>>>>> [2]
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>
> >>
> https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6@%3Cdev.arrow.apache.org%3E
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Mon, May 6, 2019 at 10:39 AM John Muehlhausen <
> >>>>>>>> j...@jgm.org>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Wes,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I’m not afraid of writing my own C++ code to
> >> deal
> >>>>>> with all
> >>>>>>>> of
> >>>>>>>>>>> this
> >>>>>>>>>>>>> on the
> >>>>>>>>>>>>>>>> writer side.  I just need a way to “append”
> >>>>>> (incrementally
> >>>>>>>>>>> populate)
> >>>>>>>>>>>>> e.g.
> >>>>>>>>>>>>>>>> feather files so that a person using e.g.
> >> pyarrow
> >>>>>> doesn’t
> >>>>>>>>> suffer
> >>>>>>>>>>> some
> >>>>>>>>>>>>>>>> catastrophic failure... and “on the side” I tell
> >>> them
> >>>>>> which
> >>>>>>>>> rows
> >>>>>>>>>>> are
> >>>>>>>>>>>>> junk
> >>>>>>>>>>>>>>>> and deal with any concurrency issues that can’t
> >> be
> >>>>>> solved
> >>>>>>>> in
> >>>>>>>>> the
> >>>>>>>>>>>>> arena of
> >>>>>>>>>>>>>>>> atomicity and ordering of ops.  For now I care
> >>> about
> >>>>>> basic
> >>>>>>>>> types
> >>>>>>>>>>> but
> >>>>>>>>>>>>>>>> including variable-width strings.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> For event-processing, I think Arrow has to have
> >>> the
> >>>>>> concept
> >>>>>>>>> of a
> >>>>>>>>>>>>>>> partially
> >>>>>>>>>>>>>>>> full record set.  Some alternatives are:
> >>>>>>>>>>>>>>>> - have a batch size of one, thus littering the
> >>>>>> landscape
> >>>>>>>> with
> >>>>>>>>>>>>> trivially
> >>>>>>>>>>>>>>>> small Arrow buffers
> >>>>>>>>>>>>>>>> - artificially increase latency with a batch
> >> size
> >>>>>> larger
> >>>>>>>> than
> >>>>>>>>>>> one,
> >>>>>>>>>>>>> but
> >>>>>>>>>>>>>>> not
> >>>>>>>>>>>>>>>> processing any data until a batch is complete
> >>>>>>>>>>>>>>>> - continuously re-write the (entire!) “main”
> >>> buffer as
> >>>>>>>>> batches of
> >>>>>>>>>>>>> length
> >>>>>>>>>>>>>>> 1
> >>>>>>>>>>>>>>>> roll in
> >>>>>>>>>>>>>>>> - instead of one main buffer, several, and at
> >> some
> >>>>>>>> threshold
> >>>>>>>>>>> combine
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>> last N length-1 batches into a length N buffer
> >> ...
> >>>>>> still an
> >>>>>>>>>>>>> inefficiency
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Consider the case of QAbstractTableModel as the
> >>>>>> underlying
> >>>>>>>>> data
> >>>>>>>>>>> for a
> >>>>>>>>>>>>>>> table
> >>>>>>>>>>>>>>>> or a chart.  This visualization shows all of the
> >>> data
> >>>>>> for
> >>>>>>>> the
> >>>>>>>>>>> recent
> >>>>>>>>>>>>> past
> >>>>>>>>>>>>>>>> as well as events rolling in.  If this model
> >>>>>> interface is
> >>>>>>>>>>>>> implemented as
> >>>>>>>>>>>>>>> a
> >>>>>>>>>>>>>>>> view onto “many thousands” of individual event
> >>>>>> buffers then
> >>>>>>>>> we
> >>>>>>>>>>> gain
> >>>>>>>>>>>>>>> nothing
> >>>>>>>>>>>>>>>> from columnar layout.  (Suppose there are tons
> >> of
> >>>>>> columns
> >>>>>>>> and
> >>>>>>>>>>> most of
> >>>>>>>>>>>>>>> them
> >>>>>>>>>>>>>>>> are scrolled out of the view.). Likewise we
> >> cannot
> >>>>>> re-write
> >>>>>>>>> the
> >>>>>>>>>>>>> entire
> >>>>>>>>>>>>>>>> model on each event... time complexity blows up.
> >>>>>> What we
> >>>>>>>>> want
> >>>>>>>>>>> is to
> >>>>>>>>>>>>>>> have a
> >>>>>>>>>>>>>>>> large pre-allocated chunk and just change
> >>> rowCount()
> >>>>>> as
> >>>>>>>> data
> >>>>>>>>> is
> >>>>>>>>>>>>>>> “appended.”
> >>>>>>>>>>>>>>>>  Sure, we may run out of space and have another
> >>> and
> >>>>>> another
> >>>>>>>>>>> chunk for
> >>>>>>>>>>>>>>>> future row ranges, but a handful of chunks
> >> chained
> >>>>>> together
> >>>>>>>>> is
> >>>>>>>>>>> better
> >>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>> as many chunks as there were events!
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> And again, having a batch size >1 and delaying
> >> the
> >>>>>> data
> >>>>>>>>> until a
> >>>>>>>>>>>>> batch is
> >>>>>>>>>>>>>>>> full is a non-starter.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I am really hoping to see partially-filled
> >>> buffers as
> >>>>>>>>> something
> >>>>>>>>>>> we
> >>>>>>>>>>>>> keep
> >>>>>>>>>>>>>>> our
> >>>>>>>>>>>>>>>> finger on moving forward!  Or else, what am I
> >>> missing?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Mon, May 6, 2019 at 8:24 AM Wes McKinney <
> >>>>>>>>> wesmck...@gmail.com
> >>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> hi John,
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> In C++ the builder classes don't yet support
> >>>>>> writing into
> >>>>>>>>>>>>> preallocated
> >>>>>>>>>>>>>>>>> memory. It would be tricky for applications to
> >>>>>> determine
> >>>>>>>> a
> >>>>>>>>>>> priori
> >>>>>>>>>>>>>>>>> which segments of memory to pass to the
> >>> builder. It
> >>>>>> seems
> >>>>>>>>> only
> >>>>>>>>>>>>>>>>> feasible for primitive / fixed-size types so
> >> my
> >>>>>> guess
> >>>>>>>>> would be
> >>>>>>>>>>>>> that a
> >>>>>>>>>>>>>>>>> separate set of interfaces would need to be
> >>>>>> developed for
> >>>>>>>>> this
> >>>>>>>>>>>>> task.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> - Wes
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau
> >> <
> >>>>>>>>>>> jacq...@apache.org>
> >>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> This is more of a question of implementation
> >>>>>> versus
> >>>>>>>>>>>>> specification. An
> >>>>>>>>>>>>>>>>> arrow
> >>>>>>>>>>>>>>>>>> buffer is generally built and then sealed.
> >> In
> >>>>>> different
> >>>>>>>>>>>>> languages,
> >>>>>>>>>>>>>>> this
> >>>>>>>>>>>>>>>>>> building process works differently (a
> >> concern
> >>> of
> >>>>>> the
> >>>>>>>>> language
> >>>>>>>>>>>>> rather
> >>>>>>>>>>>>>>> than
> >>>>>>>>>>>>>>>>>> the memory specification). We don't
> >> currently
> >>>>>> allow a
> >>>>>>>>> half
> >>>>>>>>>>> built
> >>>>>>>>>>>>>>> vector
> >>>>>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> be moved to another language and then be
> >>> further
> >>>>>> built.
> >>>>>>>>> So
> >>>>>>>>>>> the
> >>>>>>>>>>>>>>> question
> >>>>>>>>>>>>>>>>> is
> >>>>>>>>>>>>>>>>>> really more concrete: what language are you
> >>>>>> looking at
> >>>>>>>>> and
> >>>>>>>>>>> what
> >>>>>>>>>>>>> is
> >>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>> specific pattern you're trying to undertake
> >>> for
> >>>>>>>> building.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> If you're trying to go across independent
> >>>>>> processes
> >>>>>>>>> (whether
> >>>>>>>>>>> the
> >>>>>>>>>>>>> same
> >>>>>>>>>>>>>>>>>> process restarted or two separate processes
> >>> active
> >>>>>>>>>>>>> simultaneously)
> >>>>>>>>>>>>>>> you'll
> >>>>>>>>>>>>>>>>>> need to build up your own data structures to
> >>> help
> >>>>>> with
> >>>>>>>>> this.
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Mon, May 6, 2019 at 6:28 PM John
> >>> Muehlhausen <
> >>>>>>>>> j...@jgm.org
> >>>>>>>>>>>>
> >>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Glad to learn of this project— good work!
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> If I allocate a single chunk of memory and
> >>> start
> >>>>>>>>> building
> >>>>>>>>>>> Arrow
> >>>>>>>>>>>>>>> format
> >>>>>>>>>>>>>>>>>>> within it, does this chunk save any state
> >>>>>> regarding
> >>>>>>>> my
> >>>>>>>>>>>>> progress?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> For example, suppose I allocate a column
> >> for
> >>>>>> floating
> >>>>>>>>> point
> >>>>>>>>>>>>> (fixed
> >>>>>>>>>>>>>>>>> width)
> >>>>>>>>>>>>>>>>>>> and a column for string (variable width).
> >>>>>> Suppose I
> >>>>>>>>> start
> >>>>>>>>>>>>>>> building the
> >>>>>>>>>>>>>>>>>>> floating point column at offset X into my
> >>> single
> >>>>>>>>> buffer,
> >>>>>>>>>>> and
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>> string
> >>>>>>>>>>>>>>>>>>> “pointer” column at offset Y into the same
> >>>>>> single
> >>>>>>>>> buffer,
> >>>>>>>>>>> and
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> string
> >>>>>>>>>>>>>>>>>>> data elements at offset Z.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I write one floating point number and one
> >>>>>> string,
> >>>>>>>> then
> >>>>>>>>> go
> >>>>>>>>>>> away.
> >>>>>>>>>>>>>>> When I
> >>>>>>>>>>>>>>>>>>> come back to this buffer to append another
> >>>>>> value,
> >>>>>>>> does
> >>>>>>>>> the
> >>>>>>>>>>>>> buffer
> >>>>>>>>>>>>>>>>> itself
> >>>>>>>>>>>>>>>>>>> know where I would begin?  I.e. is there a
> >>>>>>>>> differentiation
> >>>>>>>>>>> in
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>> column
> >>>>>>>>>>>>>>>>>>> (or blob) data itself between the
> >> available
> >>>>>> space and
> >>>>>>>>> the
> >>>>>>>>>>> used
> >>>>>>>>>>>>>>> space?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Suppose I write a lot of large variable
> >>> width
> >>>>>> strings
> >>>>>>>>> and
> >>>>>>>>>>> “run
> >>>>>>>>>>>>>>> out” of
> >>>>>>>>>>>>>>>>>>> space for them before running out of space
> >>> for
> >>>>>>>> floating
> >>>>>>>>>>> point
> >>>>>>>>>>>>>>> numbers
> >>>>>>>>>>>>>>>>> or
> >>>>>>>>>>>>>>>>>>> string pointers.  (I guessed badly when
> >>> doing
> >>>>>> the
> >>>>>>>>> original
> >>>>>>>>>>>>>>>>> allocation.). I
> >>>>>>>>>>>>>>>>>>> consider this to be Ok since I can always
> >>>>>> “copy” the
> >>>>>>>>> data
> >>>>>>>>>>> to
> >>>>>>>>>>>>>>> “compress
> >>>>>>>>>>>>>>>>> out”
> >>>>>>>>>>>>>>>>>>> the unused fp/pointer buckets... the
> >> choice
> >>> is
> >>>>>> up to
> >>>>>>>>> me.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> The above applied to a (feather?) file is
> >>> how I
> >>>>>>>>> anticipate
> >>>>>>>>>>>>>>> appending
> >>>>>>>>>>>>>>>>> data
> >>>>>>>>>>>>>>>>>>> to disk... pre-allocate a mem-mapped file
> >>> and
> >>>>>>>> gradually
> >>>>>>>>>>> fill
> >>>>>>>>>>>>> it up.
> >>>>>>>>>>>>>>>>> The
> >>>>>>>>>>>>>>>>>>> efficiency of file utilization will depend
> >>> on my
> >>>>>>>>>>> projections
> >>>>>>>>>>>>>>> regarding
> >>>>>>>>>>>>>>>>>>> variable-width data types, but as I said
> >>> above,
> >>>>>> I can
> >>>>>>>>>>> always
> >>>>>>>>>>>>>>> re-write
> >>>>>>>>>>>>>>>>> the
> >>>>>>>>>>>>>>>>>>> file if/when this bothers me.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Is this the recommended and supported
> >>> approach
> >>>>>> for
> >>>>>>>>>>> incremental
> >>>>>>>>>>>>>>> appends?
> >>>>>>>>>>>>>>>>>>> I’m really hoping to use Arrow instead of
> >>>>>> rolling my
> >>>>>>>>> own,
> >>>>>>>>>>> but
> >>>>>>>>>>>>>>>>> functionality
> >>>>>>>>>>>>>>>>>>> like this is absolutely key!  Hoping not
> >> to
> >>> use
> >>>>>> a
> >>>>>>>>> side-car
> >>>>>>>>>>>>> file (or
> >>>>>>>>>>>>>>>>> memory
> >>>>>>>>>>>>>>>>>>> chunk) to store “append progress”
> >>> information.
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> I am brand new to this project so please
> >>>>>> forgive me
> >>>>>>>> if
> >>>>>>>>> I
> >>>>>>>>>>> have
> >>>>>>>>>>>>>>>>> overlooked
> >>>>>>>>>>>>>>>>>>> something obvious.  And again, looks like
> >>> great
> >>>>>> work
> >>>>>>>> so
> >>>>>>>>>>> far!
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Thanks!
> >>>>>>>>>>>>>>>>>>> -John
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>
> >>>
> >>>
> >>
> >
>

Re: Stored state of incremental writes to fixed size Arrow buffer?

Reply via email to