This is already implicit in the spec because there it requires 8 byte alignment and padding bit recommends 64. I'd be ok updating the spec to explicitly state buffers might be oversized but I agree with Wes I don't think a format change is warranted.
On Mon, May 13, 2019 at 6:29 AM John Muehlhausen <j...@jgm.org> wrote: > Thanks Wes, do you have any comment on the following from the zdnet story I > linked? > > ``But the missing piece is streaming, where the velocity of incoming data > poses a special challenge. There are some early experiments to populate > Arrow nodes in microbatches from Kafka. And, as the edge gets smarter > (especially as machine learning is applied), it will also make sense for > Arrow to emerge in a small footprint version, and with it, harvesting some > of the work around transport for feeding filtered or aggregated data up to > the cloud.’’ > > Specifically, do you view Arrow as a data structure that bridges the batch > and event processing worlds? > > I am concerned that with side-car data for the distinction between size and > capacity, someone could rather easily change the Arrow internals spec in > the future such that incremental population (with pre-allocation) is no > longer possible. By coding this distinction in RecordBatch we are saying > to the future: “Don’t assume this won’t be incrementally populated! Don’t > assume this hasn’t over-allocated something because of actual data not > matching expected data!” > > -John > > On Mon, May 13, 2019 at 8:07 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > hi John, > > > > Sorry, there's a number of fairly long e-mails in this thread; I'm > > having a hard time following all of the details. > > > > I suspect the most parsimonious thing would be to have some "sidecar" > > metadata that tracks the state of your writes into pre-allocated Arrow > > blocks so that readers know to call "Slice" on the blocks to obtain > > only the written-so-far portion. I'm not likely to be in favor of > > making changes to the binary protocol for this use case; if others > > have opinions I'll let them speak for themselves. > > > > - Wes > > > > On Mon, May 13, 2019 at 7:50 AM John Muehlhausen <j...@jgm.org> wrote: > > > > > > Any thoughts on a RecordBatch distinguishing size from capacity? (To > > borrow > > > std::vector terminology) > > > > > > Thanks, > > > John > > > > > > On Thu, May 9, 2019 at 2:46 PM John Muehlhausen <j...@jgm.org> wrote: > > > > > > > Wes et al, I think my core proposal is that Message.fbs:RecordBatch > > split > > > > the "length" parameter into "theoretical max length" and "utilized > > length" > > > > (perhaps not those exact names). > > > > > > > > "theoretical max length is the same as "length" now ... /// ...The > > arrays > > > > in the batch should all have this > > > > > > > > "utilized length" are the number of rows (starting from the first > one) > > > > that actually contain interesting data... the rest do not. > > > > > > > > The reason we can have a RecordBatch where these numbers are not the > > same > > > > is that the RecordBatch space was preallocated (for performance > > reasons) > > > > and the number of rows that actually "fit" depends on how correct the > > > > preallocation was. In any case, it gives the user control of this > > > > space/time tradeoff... wasted space in order to save time in record > > batch > > > > construction. The fact that some space will usually be wasted when > > there > > > > are variable-length columns (barring extreme luck) with this batch > > > > construction paradigm explains the word "theoretical" above. This > also > > > > gives us the ability to look at a partially constructed batch that is > > still > > > > being constructed, given appropriate user-supplied concurrency > control. > > > > > > > > I am not an expert in all of the Arrow variable-length data types, > but > > I > > > > think this works if they are all similar to variable-length strings > > where > > > > we advance through "blob storage" by setting the indexes into that > > storage > > > > for the current and next row in order to indicate that we have > > > > incrementally consumed more blob storage. (Conceptually this storage > > is > > > > "unallocated" after the pre-allocation and before rows are > populated.) > > > > > > > > At a high level I am seeking to shore up the format for event ingress > > into > > > > real-time analytics that have some look-back window. If I'm not > > mistaken I > > > > think this is the subject of the last multi-sentence paragraph here?: > > > > https://zd.net/2H0LlBY > > > > > > > > Currently we have a less-efficient paradigm where "microbatches" > (e.g. > > of > > > > length 1 for minimal latency) have to spin the CPU periodically in > > order to > > > > be combined into buffers where we get the columnar layout benefit. > > With > > > > pre-allocation we can deal with microbatches (a partially populated > > larger > > > > RecordBatch) and immediately have the columnar layout benefits for > the > > > > populated section with no additional computation. > > > > > > > > For example, consider an event processing system that calculates a > > "moving > > > > average" as events roll in. While this is somewhat contrived lets > > assume > > > > that the moving average window is 1000 periods and our pre-allocation > > > > ("theoretical max length") of RecordBatch elements is 100. The > > algorithm > > > > would be something like this, for a list of RecordBatch buffers in > > memory: > > > > > > > > initialization(): > > > > set up configuration of expected variable length storage > > requirements, > > > > e.g. the template RecordBatch mentioned below > > > > > > > > onIncomingEvent(event): > > > > obtain lock /// cf. swoopIn() below > > > > if last RecordBatch theoretical max length is not less than > utilized > > > > length or variable-length components of "event" will not fit in > > remaining > > > > blob storage: > > > > create a new RecordBatch pre-allocation of max utilized length > 100 > > and > > > > with blob preallocation that is max(expected, event .. in case the > > single > > > > event is larger than the expectation for 100 events) > > > > (note: in the expected case this can be very fast as it is a > > > > malloc() and a memcpy() from a template!) > > > > set current RecordBatch to this newly created one > > > > add event to current RecordBatch (for the non-calculated fields) > > > > increment utilized length of current RecordBatch > > > > calculate the calculated fields (in this case, moving average) by > > > > looking back at previous rows in this and previous RecordBatch > objects > > > > free() any RecordBatch objects that are now before the lookback > > window > > > > > > > > swoopIn(): /// somebody wants to chart the lookback window > > > > obtain lock > > > > visit all of the relevant data in the RecordBatches to construct > the > > > > chart /// notice that the last RecordBatch may not yet be "as full as > > > > possible" > > > > > > > > The above analysis (minus the free()) could apply to the IPC file > > format > > > > and the lock could be a file lock and the swoopIn() could be a > separate > > > > process. In the case of the file format, while the file is locked, a > > new > > > > RecordBatch would overwrite the previous file Footer and a new Footer > > would > > > > be written. In order to be able to delete or archive old data > multiple > > > > files could be strung together in a logical series. > > > > > > > > -John > > > > > > > > On Tue, May 7, 2019 at 2:39 PM Wes McKinney <wesmck...@gmail.com> > > wrote: > > > > > > > >> On Tue, May 7, 2019 at 12:26 PM John Muehlhausen <j...@jgm.org> > wrote: > > > >> > > > > >> > Wes, are we saying that `pa.ipc.open_file(...).read_pandas()` > > already > > > >> reads > > > >> > the future Feather format? If not, how will the future format > > differ? I > > > >> > will work on my access pattern with this format instead of the > > current > > > >> > feather format. Sorry I was not clear on that earlier. > > > >> > > > > >> > > > >> Yes, under the hood those will use the same zero-copy binary > protocol > > > >> code paths to read the file. > > > >> > > > >> > Micah, thank you! > > > >> > > > > >> > On Tue, May 7, 2019 at 11:44 AM Micah Kornfield < > > emkornfi...@gmail.com> > > > >> > wrote: > > > >> > > > > >> > > Hi John, > > > >> > > To give a specific pointer [1] describes how the streaming > > protocol is > > > >> > > stored to a file. > > > >> > > > > > >> > > [1] https://arrow.apache.org/docs/format/IPC.html#file-format > > > >> > > > > > >> > > On Tue, May 7, 2019 at 9:40 AM Wes McKinney < > wesmck...@gmail.com> > > > >> wrote: > > > >> > > > > > >> > > > hi John, > > > >> > > > > > > >> > > > As soon as the R folks can install the Arrow R package > > consistently, > > > >> > > > the intent is to replace the Feather internals with the plain > > Arrow > > > >> > > > IPC protocol where we have much better platform support all > > around. > > > >> > > > > > > >> > > > If you'd like to experiment with creating an API for > > pre-allocating > > > >> > > > fixed-size Arrow protocol blocks and then mutating the data > and > > > >> > > > metadata on disk in-place, please be our guest. We don't have > > the > > > >> > > > tools developed yet to do this for you > > > >> > > > > > > >> > > > - Wes > > > >> > > > > > > >> > > > On Tue, May 7, 2019 at 11:25 AM John Muehlhausen <j...@jgm.org > > > > > >> wrote: > > > >> > > > > > > > >> > > > > Thanks Wes: > > > >> > > > > > > > >> > > > > "the current Feather format is deprecated" ... yes, but > there > > > >> will be a > > > >> > > > > future file format that replaces it, correct? And my > > discussion > > > >> of > > > >> > > > > immutable "portions" of Arrow buffers, rather than > > immutability > > > >> of the > > > >> > > > > entire buffer, applies to IPC as well, right? I am only > > > >> championing > > > >> > > the > > > >> > > > > idea that this future file format have the convenience that > > > >> certain > > > >> > > > > preallocated rows can be ignored based on a metadata > setting. > > > >> > > > > > > > >> > > > > "I recommend batching your writes" ... this is extremely > > > >> inefficient > > > >> > > and > > > >> > > > > adds unacceptable latency, relative to the proposed > > solution. Do > > > >> you > > > >> > > > > disagree? Either I have a batch length of 1 to minimize > > latency, > > > >> which > > > >> > > > > eliminates columnar advantages on the read side, or else I > add > > > >> latency. > > > >> > > > > Neither works, and it seems that a viable alternative is > > within > > > >> sight? > > > >> > > > > > > > >> > > > > If you don't agree that there is a core issue and > opportunity > > > >> here, I'm > > > >> > > > not > > > >> > > > > sure how to better make my case.... > > > >> > > > > > > > >> > > > > -John > > > >> > > > > > > > >> > > > > On Tue, May 7, 2019 at 11:02 AM Wes McKinney < > > wesmck...@gmail.com > > > >> > > > > >> > > > wrote: > > > >> > > > > > > > >> > > > > > hi John, > > > >> > > > > > > > > >> > > > > > On Tue, May 7, 2019 at 10:53 AM John Muehlhausen < > > j...@jgm.org> > > > >> > > wrote: > > > >> > > > > > > > > > >> > > > > > > Wes et al, I completed a preliminary study of > populating a > > > >> Feather > > > >> > > > file > > > >> > > > > > > incrementally. Some notes and questions: > > > >> > > > > > > > > > >> > > > > > > I wrote the following dataframe to a feather file: > > > >> > > > > > > a b > > > >> > > > > > > 0 0123456789 0.0 > > > >> > > > > > > 1 0123456789 NaN > > > >> > > > > > > 2 0123456789 NaN > > > >> > > > > > > 3 0123456789 NaN > > > >> > > > > > > 4 None NaN > > > >> > > > > > > > > > >> > > > > > > In re-writing the flatbuffers metadata (flatc -p doesn't > > > >> > > > > > > support --gen-mutable! yuck! C++ to the rescue...), it > > seems > > > >> that > > > >> > > > > > > read_feather is not affected by NumRows? It seems to be > > > >> driven > > > >> > > > entirely > > > >> > > > > > by > > > >> > > > > > > the per-column Length values? > > > >> > > > > > > > > > >> > > > > > > Also, it seems as if one of the primary usages of > > NullCount > > > >> is to > > > >> > > > > > determine > > > >> > > > > > > whether or not a bitfield is present? In the > > initialization > > > >> data > > > >> > > > above I > > > >> > > > > > > was careful to have a null value in each column in order > > to > > > >> > > generate > > > >> > > > a > > > >> > > > > > > bitfield. > > > >> > > > > > > > > >> > > > > > Per my prior e-mails, the current Feather format is > > deprecated, > > > >> so > > > >> > > I'm > > > >> > > > > > only willing to engage on a discussion of the official > Arrow > > > >> binary > > > >> > > > > > protocol that we use for IPC (memory mapping) and RPC > > (Flight). > > > >> > > > > > > > > >> > > > > > > > > > >> > > > > > > I then wiped the bitfields in the file and set all of > the > > > >> string > > > >> > > > indices > > > >> > > > > > to > > > >> > > > > > > one past the end of the blob buffer (all strings empty): > > > >> > > > > > > a b > > > >> > > > > > > 0 None NaN > > > >> > > > > > > 1 None NaN > > > >> > > > > > > 2 None NaN > > > >> > > > > > > 3 None NaN > > > >> > > > > > > 4 None NaN > > > >> > > > > > > > > > >> > > > > > > I then set the first record to some data by consuming > > some of > > > >> the > > > >> > > > string > > > >> > > > > > > blob and row 0 and 1 indices, also setting the double: > > > >> > > > > > > > > > >> > > > > > > a b > > > >> > > > > > > 0 Hello, world! 5.0 > > > >> > > > > > > 1 None NaN > > > >> > > > > > > 2 None NaN > > > >> > > > > > > 3 None NaN > > > >> > > > > > > 4 None NaN > > > >> > > > > > > > > > >> > > > > > > As mentioned above, NumRows seems to be ignored. I > tried > > > >> adjusting > > > >> > > > each > > > >> > > > > > > column Length to mask off higher rows and it worked for > 4 > > > >> (hide > > > >> > > last > > > >> > > > row) > > > >> > > > > > > but not for less than 4. I take this to be due to math > > used > > > >> to > > > >> > > > figure > > > >> > > > > > out > > > >> > > > > > > where the buffers are relative to one another since > there > > is > > > >> only > > > >> > > one > > > >> > > > > > > metadata offset for all of: the (optional) bitset, index > > > >> column and > > > >> > > > (if > > > >> > > > > > > string) blobs. > > > >> > > > > > > > > > >> > > > > > > Populating subsequent rows would proceed in a similar > way > > > >> until all > > > >> > > > of > > > >> > > > > > the > > > >> > > > > > > blob storage has been consumed, which may come before > the > > > >> > > > pre-allocated > > > >> > > > > > > rows have been consumed. > > > >> > > > > > > > > > >> > > > > > > So what does this mean for my desire to incrementally > > write > > > >> these > > > >> > > > > > > (potentially memory-mapped) pre-allocated files and/or > > Arrow > > > >> > > buffers > > > >> > > > in > > > >> > > > > > > memory? Some thoughts: > > > >> > > > > > > > > > >> > > > > > > - If a single value (such as NumRows) were consulted to > > > >> determine > > > >> > > the > > > >> > > > > > table > > > >> > > > > > > row dimension then updating this single value would > serve > > to > > > >> tell a > > > >> > > > > > reader > > > >> > > > > > > which rows are relevant. Assuming this value is updated > > > >> after all > > > >> > > > other > > > >> > > > > > > mutations are complete, and assuming that mutations are > > only > > > >> > > appends > > > >> > > > > > > (addition of rows), then concurrency control involves > only > > > >> ensuring > > > >> > > > that > > > >> > > > > > > this value is not examined while it is being written. > > > >> > > > > > > > > > >> > > > > > > - NullCount presents a concurrency problem if someone > > reads > > > >> the > > > >> > > file > > > >> > > > > > after > > > >> > > > > > > this field has been updated, but before NumRows has > > exposed > > > >> the new > > > >> > > > > > record > > > >> > > > > > > (or vice versa). The idea previously mentioned that > there > > > >> will > > > >> > > > "likely > > > >> > > > > > > [be] more statistics in the future" feels like it might > be > > > >> scope > > > >> > > > creep to > > > >> > > > > > > me? This is a data representation, not a calculation > > > >> framework? > > > >> > > If > > > >> > > > > > > NullCount had its genesis in the optional nature of the > > > >> bitfield, I > > > >> > > > would > > > >> > > > > > > suggest that perhaps NullCount can be dropped in favor > of > > > >> always > > > >> > > > > > supplying > > > >> > > > > > > the bitfield, which in any event is already contemplated > > by > > > >> the > > > >> > > spec: > > > >> > > > > > > "Implementations may choose to always allocate one > anyway > > as a > > > >> > > > matter of > > > >> > > > > > > convenience." If the concern is space savings, Arrow is > > > >> already an > > > >> > > > > > > extremely uncompressed format. (Compression is > something > > I > > > >> would > > > >> > > > also > > > >> > > > > > > consider to be scope creep as regards Feather... > > compressed > > > >> > > > filesystems > > > >> > > > > > can > > > >> > > > > > > be employed and there are other compressed dataframe > > formats.) > > > >> > > > However, > > > >> > > > > > if > > > >> > > > > > > protecting the 4 bytes required to update NowRows turns > > out > > > >> to be > > > >> > > no > > > >> > > > > > easier > > > >> > > > > > > than protecting all of the statistical bytes as well as > > part > > > >> of the > > > >> > > > same > > > >> > > > > > > "critical section" (locks: yuck!!) then statistics pose > no > > > >> issue. > > > >> > > I > > > >> > > > > > have a > > > >> > > > > > > feeling that the availability of an atomic write of 4 > > bytes > > > >> will > > > >> > > > depend > > > >> > > > > > on > > > >> > > > > > > the storage mechanism... memory vs memory map vs write() > > etc. > > > >> > > > > > > > > > >> > > > > > > - The elephant in the room appears to be the presumptive > > > >> binary > > > >> > > > yes/no on > > > >> > > > > > > mutability of Arrow buffers. Perhaps the thought is > that > > > >> certain > > > >> > > > batch > > > >> > > > > > > processes will be wrecked if anyone anywhere is mutating > > > >> buffers in > > > >> > > > any > > > >> > > > > > > way. But keep in mind I am not proposing general > > mutability, > > > >> only > > > >> > > > > > > appending of new data. *A huge amount of batch > processing > > > >> that > > > >> > > will > > > >> > > > take > > > >> > > > > > > place with Arrow is on time-series data (whether > > financial or > > > >> > > > otherwise). > > > >> > > > > > > It is only natural that architects will want the minimal > > > >> impedance > > > >> > > > > > mismatch > > > >> > > > > > > when it comes time to grow those time series as the > events > > > >> occur > > > >> > > > going > > > >> > > > > > > forward.* So rather than say that I want "mutable" > Arrow > > > >> buffers, > > > >> > > I > > > >> > > > > > would > > > >> > > > > > > pitch this as a call for "immutable populated areas" of > > Arrow > > > >> > > buffers > > > >> > > > > > > combined with the concept that the populated area can > > grow up > > > >> to > > > >> > > > whatever > > > >> > > > > > > was preallocated. This will not affect anyone who has > > > >> "memoized" a > > > >> > > > > > > dimension and wants to continue to consider the > > then-current > > > >> data > > > >> > > as > > > >> > > > > > > immutable... it will be immutable and will always be > > immutable > > > >> > > > according > > > >> > > > > > to > > > >> > > > > > > that then-current dimension. > > > >> > > > > > > > > > >> > > > > > > Thanks in advance for considering this feedback! I > > absolutely > > > >> > > > require > > > >> > > > > > > efficient row-wise growth of an Arrow-like buffer to > deal > > > >> with time > > > >> > > > > > series > > > >> > > > > > > data in near real time. I believe that preallocation is > > (by > > > >> far) > > > >> > > the > > > >> > > > > > most > > > >> > > > > > > efficient way to accomplish this. I hope to be able to > > use > > > >> Arrow! > > > >> > > > If I > > > >> > > > > > > cannot use Arrow than I will be using a home-grown Arrow > > that > > > >> is > > > >> > > > > > identical > > > >> > > > > > > except for this feature, which would be very sad! Even > if > > > >> Arrow > > > >> > > > itself > > > >> > > > > > > could be used in this manner today, I would be hesitant > to > > > >> use it > > > >> > > if > > > >> > > > the > > > >> > > > > > > use-case was not protected on a go-forward basis. > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > I recommend batching your writes and using the Arrow > binary > > > >> streaming > > > >> > > > > > protocol so you are only appending to a file rather than > > > >> mutating > > > >> > > > > > previously-written bytes. This use case is well defined > and > > > >> supported > > > >> > > > > > in the software already. > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > > >> > > > https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#streaming-format > > > >> > > > > > > > > >> > > > > > - Wes > > > >> > > > > > > > > >> > > > > > > Of course, I am completely open to alternative ideas and > > > >> > > approaches! > > > >> > > > > > > > > > >> > > > > > > -John > > > >> > > > > > > > > > >> > > > > > > On Mon, May 6, 2019 at 11:39 AM Wes McKinney < > > > >> wesmck...@gmail.com> > > > >> > > > > > wrote: > > > >> > > > > > > > > > >> > > > > > > > hi John -- again, I would caution you against using > > Feather > > > >> files > > > >> > > > for > > > >> > > > > > > > issues of longevity -- the internal memory layout of > > those > > > >> files > > > >> > > > is a > > > >> > > > > > > > "dead man walking" so to speak. > > > >> > > > > > > > > > > >> > > > > > > > I would advise against forking the project, IMHO that > > is a > > > >> dark > > > >> > > > path > > > >> > > > > > > > that leads nowhere good. We have a large community > here > > and > > > >> we > > > >> > > > accept > > > >> > > > > > > > pull requests -- I think the challenge is going to be > > > >> defining > > > >> > > the > > > >> > > > use > > > >> > > > > > > > case to suitable clarity that a general purpose > solution > > > >> can be > > > >> > > > > > > > developed. > > > >> > > > > > > > > > > >> > > > > > > > - Wes > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > On Mon, May 6, 2019 at 11:16 AM John Muehlhausen < > > > >> j...@jgm.org> > > > >> > > > wrote: > > > >> > > > > > > > > > > > >> > > > > > > > > François, Wes, > > > >> > > > > > > > > > > > >> > > > > > > > > Thanks for the feedback. I think the most practical > > > >> thing for > > > >> > > > me to > > > >> > > > > > do > > > >> > > > > > > > is > > > >> > > > > > > > > 1- write a Feather file that is structured to > > > >> pre-allocate the > > > >> > > > space > > > >> > > > > > I > > > >> > > > > > > > need > > > >> > > > > > > > > (e.g. initial variable-length strings are of average > > size) > > > >> > > > > > > > > 2- come up with code to monkey around with the > values > > > >> contained > > > >> > > > in > > > >> > > > > > the > > > >> > > > > > > > > vectors so that before and after each manipulation > the > > > >> file is > > > >> > > > valid > > > >> > > > > > as I > > > >> > > > > > > > > walk the rows ... this is a writer that uses memory > > > >> mapping > > > >> > > > > > > > > 3- check back in here once that works, assuming that > > it > > > >> does, > > > >> > > to > > > >> > > > see > > > >> > > > > > if > > > >> > > > > > > > we > > > >> > > > > > > > > can bless certain mutation paths > > > >> > > > > > > > > 4- if we can't bless certain mutation paths, fork > the > > > >> project > > > >> > > for > > > >> > > > > > those > > > >> > > > > > > > who > > > >> > > > > > > > > care more about stream processing? That would not > > seem > > > >> to be > > > >> > > > ideal > > > >> > > > > > as I > > > >> > > > > > > > > think mutation in row-order across the data set is > > > >> relatively > > > >> > > low > > > >> > > > > > impact > > > >> > > > > > > > on > > > >> > > > > > > > > the overall design? > > > >> > > > > > > > > > > > >> > > > > > > > > Thanks again for engaging the topic! > > > >> > > > > > > > > -John > > > >> > > > > > > > > > > > >> > > > > > > > > On Mon, May 6, 2019 at 10:26 AM Francois > > Saint-Jacques < > > > >> > > > > > > > > fsaintjacq...@gmail.com> wrote: > > > >> > > > > > > > > > > > >> > > > > > > > > > Hello John, > > > >> > > > > > > > > > > > > >> > > > > > > > > > Arrow is not yet suited for partial writes. The > > > >> specification > > > >> > > > only > > > >> > > > > > > > > > talks about fully frozen/immutable objects, you're > > in > > > >> > > > > > implementation > > > >> > > > > > > > > > defined territory here. For example, the C++ > library > > > >> assumes > > > >> > > > the > > > >> > > > > > Array > > > >> > > > > > > > > > object is immutable; it memoize the null count, > and > > > >> likely > > > >> > > more > > > >> > > > > > > > > > statistics in the future. > > > >> > > > > > > > > > > > > >> > > > > > > > > > If you want to use pre-allocated buffers and > array, > > you > > > >> can > > > >> > > > use the > > > >> > > > > > > > > > column validity bitmap for this purpose, e.g. set > > all > > > >> null by > > > >> > > > > > default > > > >> > > > > > > > > > and flip once the row is written. It suffers from > > > >> concurrency > > > >> > > > > > issues > > > >> > > > > > > > > > (+ invalidation issues as pointed) when dealing > with > > > >> multiple > > > >> > > > > > columns. > > > >> > > > > > > > > > You'll have to use a barrier of some kind, e.g. a > > > >> per-batch > > > >> > > > global > > > >> > > > > > > > > > atomic (if append-only), or dedicated column(s) > à-la > > > >> MVCC. > > > >> > > But > > > >> > > > > > then, > > > >> > > > > > > > > > the reader needs to be aware of this and compute a > > mask > > > >> each > > > >> > > > time > > > >> > > > > > it > > > >> > > > > > > > > > needs to query the partial batch. > > > >> > > > > > > > > > > > > >> > > > > > > > > > This is a common columnar database problem, see > [1] > > for > > > >> a > > > >> > > > recent > > > >> > > > > > paper > > > >> > > > > > > > > > on the subject. The usual technique is to store > the > > > >> recent > > > >> > > data > > > >> > > > > > > > > > row-wise, and transform it in column-wise when a > > > >> threshold is > > > >> > > > met > > > >> > > > > > akin > > > >> > > > > > > > > > to a compaction phase. There was a somewhat > related > > > >> thread > > > >> > > [2] > > > >> > > > > > lately > > > >> > > > > > > > > > about streaming vs batching. In the end, I think > > your > > > >> > > solution > > > >> > > > > > will be > > > >> > > > > > > > > > very application specific. > > > >> > > > > > > > > > > > > >> > > > > > > > > > François > > > >> > > > > > > > > > > > > >> > > > > > > > > > [1] > > > >> > > https://db.in.tum.de/downloads/publications/datablocks.pdf > > > >> > > > > > > > > > [2] > > > >> > > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > > >> > > > https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6@%3Cdev.arrow.apache.org%3E > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > On Mon, May 6, 2019 at 10:39 AM John Muehlhausen < > > > >> > > j...@jgm.org> > > > >> > > > > > wrote: > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > Wes, > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > I’m not afraid of writing my own C++ code to > deal > > > >> with all > > > >> > > of > > > >> > > > > > this > > > >> > > > > > > > on the > > > >> > > > > > > > > > > writer side. I just need a way to “append” > > > >> (incrementally > > > >> > > > > > populate) > > > >> > > > > > > > e.g. > > > >> > > > > > > > > > > feather files so that a person using e.g. > pyarrow > > > >> doesn’t > > > >> > > > suffer > > > >> > > > > > some > > > >> > > > > > > > > > > catastrophic failure... and “on the side” I tell > > them > > > >> which > > > >> > > > rows > > > >> > > > > > are > > > >> > > > > > > > junk > > > >> > > > > > > > > > > and deal with any concurrency issues that can’t > be > > > >> solved > > > >> > > in > > > >> > > > the > > > >> > > > > > > > arena of > > > >> > > > > > > > > > > atomicity and ordering of ops. For now I care > > about > > > >> basic > > > >> > > > types > > > >> > > > > > but > > > >> > > > > > > > > > > including variable-width strings. > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > For event-processing, I think Arrow has to have > > the > > > >> concept > > > >> > > > of a > > > >> > > > > > > > > > partially > > > >> > > > > > > > > > > full record set. Some alternatives are: > > > >> > > > > > > > > > > - have a batch size of one, thus littering the > > > >> landscape > > > >> > > with > > > >> > > > > > > > trivially > > > >> > > > > > > > > > > small Arrow buffers > > > >> > > > > > > > > > > - artificially increase latency with a batch > size > > > >> larger > > > >> > > than > > > >> > > > > > one, > > > >> > > > > > > > but > > > >> > > > > > > > > > not > > > >> > > > > > > > > > > processing any data until a batch is complete > > > >> > > > > > > > > > > - continuously re-write the (entire!) “main” > > buffer as > > > >> > > > batches of > > > >> > > > > > > > length > > > >> > > > > > > > > > 1 > > > >> > > > > > > > > > > roll in > > > >> > > > > > > > > > > - instead of one main buffer, several, and at > some > > > >> > > threshold > > > >> > > > > > combine > > > >> > > > > > > > the > > > >> > > > > > > > > > > last N length-1 batches into a length N buffer > ... > > > >> still an > > > >> > > > > > > > inefficiency > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > Consider the case of QAbstractTableModel as the > > > >> underlying > > > >> > > > data > > > >> > > > > > for a > > > >> > > > > > > > > > table > > > >> > > > > > > > > > > or a chart. This visualization shows all of the > > data > > > >> for > > > >> > > the > > > >> > > > > > recent > > > >> > > > > > > > past > > > >> > > > > > > > > > > as well as events rolling in. If this model > > > >> interface is > > > >> > > > > > > > implemented as > > > >> > > > > > > > > > a > > > >> > > > > > > > > > > view onto “many thousands” of individual event > > > >> buffers then > > > >> > > > we > > > >> > > > > > gain > > > >> > > > > > > > > > nothing > > > >> > > > > > > > > > > from columnar layout. (Suppose there are tons > of > > > >> columns > > > >> > > and > > > >> > > > > > most of > > > >> > > > > > > > > > them > > > >> > > > > > > > > > > are scrolled out of the view.). Likewise we > cannot > > > >> re-write > > > >> > > > the > > > >> > > > > > > > entire > > > >> > > > > > > > > > > model on each event... time complexity blows up. > > > >> What we > > > >> > > > want > > > >> > > > > > is to > > > >> > > > > > > > > > have a > > > >> > > > > > > > > > > large pre-allocated chunk and just change > > rowCount() > > > >> as > > > >> > > data > > > >> > > > is > > > >> > > > > > > > > > “appended.” > > > >> > > > > > > > > > > Sure, we may run out of space and have another > > and > > > >> another > > > >> > > > > > chunk for > > > >> > > > > > > > > > > future row ranges, but a handful of chunks > chained > > > >> together > > > >> > > > is > > > >> > > > > > better > > > >> > > > > > > > > > than > > > >> > > > > > > > > > > as many chunks as there were events! > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > And again, having a batch size >1 and delaying > the > > > >> data > > > >> > > > until a > > > >> > > > > > > > batch is > > > >> > > > > > > > > > > full is a non-starter. > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > I am really hoping to see partially-filled > > buffers as > > > >> > > > something > > > >> > > > > > we > > > >> > > > > > > > keep > > > >> > > > > > > > > > our > > > >> > > > > > > > > > > finger on moving forward! Or else, what am I > > missing? > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > -John > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > On Mon, May 6, 2019 at 8:24 AM Wes McKinney < > > > >> > > > wesmck...@gmail.com > > > >> > > > > > > > > > >> > > > > > > > wrote: > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > hi John, > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > In C++ the builder classes don't yet support > > > >> writing into > > > >> > > > > > > > preallocated > > > >> > > > > > > > > > > > memory. It would be tricky for applications to > > > >> determine > > > >> > > a > > > >> > > > > > priori > > > >> > > > > > > > > > > > which segments of memory to pass to the > > builder. It > > > >> seems > > > >> > > > only > > > >> > > > > > > > > > > > feasible for primitive / fixed-size types so > my > > > >> guess > > > >> > > > would be > > > >> > > > > > > > that a > > > >> > > > > > > > > > > > separate set of interfaces would need to be > > > >> developed for > > > >> > > > this > > > >> > > > > > > > task. > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > - Wes > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau > < > > > >> > > > > > jacq...@apache.org> > > > >> > > > > > > > > > wrote: > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > This is more of a question of implementation > > > >> versus > > > >> > > > > > > > specification. An > > > >> > > > > > > > > > > > arrow > > > >> > > > > > > > > > > > > buffer is generally built and then sealed. > In > > > >> different > > > >> > > > > > > > languages, > > > >> > > > > > > > > > this > > > >> > > > > > > > > > > > > building process works differently (a > concern > > of > > > >> the > > > >> > > > language > > > >> > > > > > > > rather > > > >> > > > > > > > > > than > > > >> > > > > > > > > > > > > the memory specification). We don't > currently > > > >> allow a > > > >> > > > half > > > >> > > > > > built > > > >> > > > > > > > > > vector > > > >> > > > > > > > > > > > to > > > >> > > > > > > > > > > > > be moved to another language and then be > > further > > > >> built. > > > >> > > > So > > > >> > > > > > the > > > >> > > > > > > > > > question > > > >> > > > > > > > > > > > is > > > >> > > > > > > > > > > > > really more concrete: what language are you > > > >> looking at > > > >> > > > and > > > >> > > > > > what > > > >> > > > > > > > is > > > >> > > > > > > > > > the > > > >> > > > > > > > > > > > > specific pattern you're trying to undertake > > for > > > >> > > building. > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > If you're trying to go across independent > > > >> processes > > > >> > > > (whether > > > >> > > > > > the > > > >> > > > > > > > same > > > >> > > > > > > > > > > > > process restarted or two separate processes > > active > > > >> > > > > > > > simultaneously) > > > >> > > > > > > > > > you'll > > > >> > > > > > > > > > > > > need to build up your own data structures to > > help > > > >> with > > > >> > > > this. > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > On Mon, May 6, 2019 at 6:28 PM John > > Muehlhausen < > > > >> > > > j...@jgm.org > > > >> > > > > > > > > > >> > > > > > > > wrote: > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > Hello, > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > Glad to learn of this project— good work! > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > If I allocate a single chunk of memory and > > start > > > >> > > > building > > > >> > > > > > Arrow > > > >> > > > > > > > > > format > > > >> > > > > > > > > > > > > > within it, does this chunk save any state > > > >> regarding > > > >> > > my > > > >> > > > > > > > progress? > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > For example, suppose I allocate a column > for > > > >> floating > > > >> > > > point > > > >> > > > > > > > (fixed > > > >> > > > > > > > > > > > width) > > > >> > > > > > > > > > > > > > and a column for string (variable width). > > > >> Suppose I > > > >> > > > start > > > >> > > > > > > > > > building the > > > >> > > > > > > > > > > > > > floating point column at offset X into my > > single > > > >> > > > buffer, > > > >> > > > > > and > > > >> > > > > > > > the > > > >> > > > > > > > > > string > > > >> > > > > > > > > > > > > > “pointer” column at offset Y into the same > > > >> single > > > >> > > > buffer, > > > >> > > > > > and > > > >> > > > > > > > the > > > >> > > > > > > > > > > > string > > > >> > > > > > > > > > > > > > data elements at offset Z. > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > I write one floating point number and one > > > >> string, > > > >> > > then > > > >> > > > go > > > >> > > > > > away. > > > >> > > > > > > > > > When I > > > >> > > > > > > > > > > > > > come back to this buffer to append another > > > >> value, > > > >> > > does > > > >> > > > the > > > >> > > > > > > > buffer > > > >> > > > > > > > > > > > itself > > > >> > > > > > > > > > > > > > know where I would begin? I.e. is there a > > > >> > > > differentiation > > > >> > > > > > in > > > >> > > > > > > > the > > > >> > > > > > > > > > > > column > > > >> > > > > > > > > > > > > > (or blob) data itself between the > available > > > >> space and > > > >> > > > the > > > >> > > > > > used > > > >> > > > > > > > > > space? > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > Suppose I write a lot of large variable > > width > > > >> strings > > > >> > > > and > > > >> > > > > > “run > > > >> > > > > > > > > > out” of > > > >> > > > > > > > > > > > > > space for them before running out of space > > for > > > >> > > floating > > > >> > > > > > point > > > >> > > > > > > > > > numbers > > > >> > > > > > > > > > > > or > > > >> > > > > > > > > > > > > > string pointers. (I guessed badly when > > doing > > > >> the > > > >> > > > original > > > >> > > > > > > > > > > > allocation.). I > > > >> > > > > > > > > > > > > > consider this to be Ok since I can always > > > >> “copy” the > > > >> > > > data > > > >> > > > > > to > > > >> > > > > > > > > > “compress > > > >> > > > > > > > > > > > out” > > > >> > > > > > > > > > > > > > the unused fp/pointer buckets... the > choice > > is > > > >> up to > > > >> > > > me. > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > The above applied to a (feather?) file is > > how I > > > >> > > > anticipate > > > >> > > > > > > > > > appending > > > >> > > > > > > > > > > > data > > > >> > > > > > > > > > > > > > to disk... pre-allocate a mem-mapped file > > and > > > >> > > gradually > > > >> > > > > > fill > > > >> > > > > > > > it up. > > > >> > > > > > > > > > > > The > > > >> > > > > > > > > > > > > > efficiency of file utilization will depend > > on my > > > >> > > > > > projections > > > >> > > > > > > > > > regarding > > > >> > > > > > > > > > > > > > variable-width data types, but as I said > > above, > > > >> I can > > > >> > > > > > always > > > >> > > > > > > > > > re-write > > > >> > > > > > > > > > > > the > > > >> > > > > > > > > > > > > > file if/when this bothers me. > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > Is this the recommended and supported > > approach > > > >> for > > > >> > > > > > incremental > > > >> > > > > > > > > > appends? > > > >> > > > > > > > > > > > > > I’m really hoping to use Arrow instead of > > > >> rolling my > > > >> > > > own, > > > >> > > > > > but > > > >> > > > > > > > > > > > functionality > > > >> > > > > > > > > > > > > > like this is absolutely key! Hoping not > to > > use > > > >> a > > > >> > > > side-car > > > >> > > > > > > > file (or > > > >> > > > > > > > > > > > memory > > > >> > > > > > > > > > > > > > chunk) to store “append progress” > > information. > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > I am brand new to this project so please > > > >> forgive me > > > >> > > if > > > >> > > > I > > > >> > > > > > have > > > >> > > > > > > > > > > > overlooked > > > >> > > > > > > > > > > > > > something obvious. And again, looks like > > great > > > >> work > > > >> > > so > > > >> > > > > > far! > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > Thanks! > > > >> > > > > > > > > > > > > > -John > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > >