Any thoughts on a RecordBatch distinguishing size from capacity? (To borrow std::vector terminology)
Thanks, John On Thu, May 9, 2019 at 2:46 PM John Muehlhausen <j...@jgm.org> wrote: > Wes et al, I think my core proposal is that Message.fbs:RecordBatch split > the "length" parameter into "theoretical max length" and "utilized length" > (perhaps not those exact names). > > "theoretical max length is the same as "length" now ... /// ...The arrays > in the batch should all have this > > "utilized length" are the number of rows (starting from the first one) > that actually contain interesting data... the rest do not. > > The reason we can have a RecordBatch where these numbers are not the same > is that the RecordBatch space was preallocated (for performance reasons) > and the number of rows that actually "fit" depends on how correct the > preallocation was. In any case, it gives the user control of this > space/time tradeoff... wasted space in order to save time in record batch > construction. The fact that some space will usually be wasted when there > are variable-length columns (barring extreme luck) with this batch > construction paradigm explains the word "theoretical" above. This also > gives us the ability to look at a partially constructed batch that is still > being constructed, given appropriate user-supplied concurrency control. > > I am not an expert in all of the Arrow variable-length data types, but I > think this works if they are all similar to variable-length strings where > we advance through "blob storage" by setting the indexes into that storage > for the current and next row in order to indicate that we have > incrementally consumed more blob storage. (Conceptually this storage is > "unallocated" after the pre-allocation and before rows are populated.) > > At a high level I am seeking to shore up the format for event ingress into > real-time analytics that have some look-back window. If I'm not mistaken I > think this is the subject of the last multi-sentence paragraph here?: > https://zd.net/2H0LlBY > > Currently we have a less-efficient paradigm where "microbatches" (e.g. of > length 1 for minimal latency) have to spin the CPU periodically in order to > be combined into buffers where we get the columnar layout benefit. With > pre-allocation we can deal with microbatches (a partially populated larger > RecordBatch) and immediately have the columnar layout benefits for the > populated section with no additional computation. > > For example, consider an event processing system that calculates a "moving > average" as events roll in. While this is somewhat contrived lets assume > that the moving average window is 1000 periods and our pre-allocation > ("theoretical max length") of RecordBatch elements is 100. The algorithm > would be something like this, for a list of RecordBatch buffers in memory: > > initialization(): > set up configuration of expected variable length storage requirements, > e.g. the template RecordBatch mentioned below > > onIncomingEvent(event): > obtain lock /// cf. swoopIn() below > if last RecordBatch theoretical max length is not less than utilized > length or variable-length components of "event" will not fit in remaining > blob storage: > create a new RecordBatch pre-allocation of max utilized length 100 and > with blob preallocation that is max(expected, event .. in case the single > event is larger than the expectation for 100 events) > (note: in the expected case this can be very fast as it is a > malloc() and a memcpy() from a template!) > set current RecordBatch to this newly created one > add event to current RecordBatch (for the non-calculated fields) > increment utilized length of current RecordBatch > calculate the calculated fields (in this case, moving average) by > looking back at previous rows in this and previous RecordBatch objects > free() any RecordBatch objects that are now before the lookback window > > swoopIn(): /// somebody wants to chart the lookback window > obtain lock > visit all of the relevant data in the RecordBatches to construct the > chart /// notice that the last RecordBatch may not yet be "as full as > possible" > > The above analysis (minus the free()) could apply to the IPC file format > and the lock could be a file lock and the swoopIn() could be a separate > process. In the case of the file format, while the file is locked, a new > RecordBatch would overwrite the previous file Footer and a new Footer would > be written. In order to be able to delete or archive old data multiple > files could be strung together in a logical series. > > -John > > On Tue, May 7, 2019 at 2:39 PM Wes McKinney <wesmck...@gmail.com> wrote: > >> On Tue, May 7, 2019 at 12:26 PM John Muehlhausen <j...@jgm.org> wrote: >> > >> > Wes, are we saying that `pa.ipc.open_file(...).read_pandas()` already >> reads >> > the future Feather format? If not, how will the future format differ? I >> > will work on my access pattern with this format instead of the current >> > feather format. Sorry I was not clear on that earlier. >> > >> >> Yes, under the hood those will use the same zero-copy binary protocol >> code paths to read the file. >> >> > Micah, thank you! >> > >> > On Tue, May 7, 2019 at 11:44 AM Micah Kornfield <emkornfi...@gmail.com> >> > wrote: >> > >> > > Hi John, >> > > To give a specific pointer [1] describes how the streaming protocol is >> > > stored to a file. >> > > >> > > [1] https://arrow.apache.org/docs/format/IPC.html#file-format >> > > >> > > On Tue, May 7, 2019 at 9:40 AM Wes McKinney <wesmck...@gmail.com> >> wrote: >> > > >> > > > hi John, >> > > > >> > > > As soon as the R folks can install the Arrow R package consistently, >> > > > the intent is to replace the Feather internals with the plain Arrow >> > > > IPC protocol where we have much better platform support all around. >> > > > >> > > > If you'd like to experiment with creating an API for pre-allocating >> > > > fixed-size Arrow protocol blocks and then mutating the data and >> > > > metadata on disk in-place, please be our guest. We don't have the >> > > > tools developed yet to do this for you >> > > > >> > > > - Wes >> > > > >> > > > On Tue, May 7, 2019 at 11:25 AM John Muehlhausen <j...@jgm.org> >> wrote: >> > > > > >> > > > > Thanks Wes: >> > > > > >> > > > > "the current Feather format is deprecated" ... yes, but there >> will be a >> > > > > future file format that replaces it, correct? And my discussion >> of >> > > > > immutable "portions" of Arrow buffers, rather than immutability >> of the >> > > > > entire buffer, applies to IPC as well, right? I am only >> championing >> > > the >> > > > > idea that this future file format have the convenience that >> certain >> > > > > preallocated rows can be ignored based on a metadata setting. >> > > > > >> > > > > "I recommend batching your writes" ... this is extremely >> inefficient >> > > and >> > > > > adds unacceptable latency, relative to the proposed solution. Do >> you >> > > > > disagree? Either I have a batch length of 1 to minimize latency, >> which >> > > > > eliminates columnar advantages on the read side, or else I add >> latency. >> > > > > Neither works, and it seems that a viable alternative is within >> sight? >> > > > > >> > > > > If you don't agree that there is a core issue and opportunity >> here, I'm >> > > > not >> > > > > sure how to better make my case.... >> > > > > >> > > > > -John >> > > > > >> > > > > On Tue, May 7, 2019 at 11:02 AM Wes McKinney <wesmck...@gmail.com >> > >> > > > wrote: >> > > > > >> > > > > > hi John, >> > > > > > >> > > > > > On Tue, May 7, 2019 at 10:53 AM John Muehlhausen <j...@jgm.org> >> > > wrote: >> > > > > > > >> > > > > > > Wes et al, I completed a preliminary study of populating a >> Feather >> > > > file >> > > > > > > incrementally. Some notes and questions: >> > > > > > > >> > > > > > > I wrote the following dataframe to a feather file: >> > > > > > > a b >> > > > > > > 0 0123456789 0.0 >> > > > > > > 1 0123456789 NaN >> > > > > > > 2 0123456789 NaN >> > > > > > > 3 0123456789 NaN >> > > > > > > 4 None NaN >> > > > > > > >> > > > > > > In re-writing the flatbuffers metadata (flatc -p doesn't >> > > > > > > support --gen-mutable! yuck! C++ to the rescue...), it seems >> that >> > > > > > > read_feather is not affected by NumRows? It seems to be >> driven >> > > > entirely >> > > > > > by >> > > > > > > the per-column Length values? >> > > > > > > >> > > > > > > Also, it seems as if one of the primary usages of NullCount >> is to >> > > > > > determine >> > > > > > > whether or not a bitfield is present? In the initialization >> data >> > > > above I >> > > > > > > was careful to have a null value in each column in order to >> > > generate >> > > > a >> > > > > > > bitfield. >> > > > > > >> > > > > > Per my prior e-mails, the current Feather format is deprecated, >> so >> > > I'm >> > > > > > only willing to engage on a discussion of the official Arrow >> binary >> > > > > > protocol that we use for IPC (memory mapping) and RPC (Flight). >> > > > > > >> > > > > > > >> > > > > > > I then wiped the bitfields in the file and set all of the >> string >> > > > indices >> > > > > > to >> > > > > > > one past the end of the blob buffer (all strings empty): >> > > > > > > a b >> > > > > > > 0 None NaN >> > > > > > > 1 None NaN >> > > > > > > 2 None NaN >> > > > > > > 3 None NaN >> > > > > > > 4 None NaN >> > > > > > > >> > > > > > > I then set the first record to some data by consuming some of >> the >> > > > string >> > > > > > > blob and row 0 and 1 indices, also setting the double: >> > > > > > > >> > > > > > > a b >> > > > > > > 0 Hello, world! 5.0 >> > > > > > > 1 None NaN >> > > > > > > 2 None NaN >> > > > > > > 3 None NaN >> > > > > > > 4 None NaN >> > > > > > > >> > > > > > > As mentioned above, NumRows seems to be ignored. I tried >> adjusting >> > > > each >> > > > > > > column Length to mask off higher rows and it worked for 4 >> (hide >> > > last >> > > > row) >> > > > > > > but not for less than 4. I take this to be due to math used >> to >> > > > figure >> > > > > > out >> > > > > > > where the buffers are relative to one another since there is >> only >> > > one >> > > > > > > metadata offset for all of: the (optional) bitset, index >> column and >> > > > (if >> > > > > > > string) blobs. >> > > > > > > >> > > > > > > Populating subsequent rows would proceed in a similar way >> until all >> > > > of >> > > > > > the >> > > > > > > blob storage has been consumed, which may come before the >> > > > pre-allocated >> > > > > > > rows have been consumed. >> > > > > > > >> > > > > > > So what does this mean for my desire to incrementally write >> these >> > > > > > > (potentially memory-mapped) pre-allocated files and/or Arrow >> > > buffers >> > > > in >> > > > > > > memory? Some thoughts: >> > > > > > > >> > > > > > > - If a single value (such as NumRows) were consulted to >> determine >> > > the >> > > > > > table >> > > > > > > row dimension then updating this single value would serve to >> tell a >> > > > > > reader >> > > > > > > which rows are relevant. Assuming this value is updated >> after all >> > > > other >> > > > > > > mutations are complete, and assuming that mutations are only >> > > appends >> > > > > > > (addition of rows), then concurrency control involves only >> ensuring >> > > > that >> > > > > > > this value is not examined while it is being written. >> > > > > > > >> > > > > > > - NullCount presents a concurrency problem if someone reads >> the >> > > file >> > > > > > after >> > > > > > > this field has been updated, but before NumRows has exposed >> the new >> > > > > > record >> > > > > > > (or vice versa). The idea previously mentioned that there >> will >> > > > "likely >> > > > > > > [be] more statistics in the future" feels like it might be >> scope >> > > > creep to >> > > > > > > me? This is a data representation, not a calculation >> framework? >> > > If >> > > > > > > NullCount had its genesis in the optional nature of the >> bitfield, I >> > > > would >> > > > > > > suggest that perhaps NullCount can be dropped in favor of >> always >> > > > > > supplying >> > > > > > > the bitfield, which in any event is already contemplated by >> the >> > > spec: >> > > > > > > "Implementations may choose to always allocate one anyway as a >> > > > matter of >> > > > > > > convenience." If the concern is space savings, Arrow is >> already an >> > > > > > > extremely uncompressed format. (Compression is something I >> would >> > > > also >> > > > > > > consider to be scope creep as regards Feather... compressed >> > > > filesystems >> > > > > > can >> > > > > > > be employed and there are other compressed dataframe formats.) >> > > > However, >> > > > > > if >> > > > > > > protecting the 4 bytes required to update NowRows turns out >> to be >> > > no >> > > > > > easier >> > > > > > > than protecting all of the statistical bytes as well as part >> of the >> > > > same >> > > > > > > "critical section" (locks: yuck!!) then statistics pose no >> issue. >> > > I >> > > > > > have a >> > > > > > > feeling that the availability of an atomic write of 4 bytes >> will >> > > > depend >> > > > > > on >> > > > > > > the storage mechanism... memory vs memory map vs write() etc. >> > > > > > > >> > > > > > > - The elephant in the room appears to be the presumptive >> binary >> > > > yes/no on >> > > > > > > mutability of Arrow buffers. Perhaps the thought is that >> certain >> > > > batch >> > > > > > > processes will be wrecked if anyone anywhere is mutating >> buffers in >> > > > any >> > > > > > > way. But keep in mind I am not proposing general mutability, >> only >> > > > > > > appending of new data. *A huge amount of batch processing >> that >> > > will >> > > > take >> > > > > > > place with Arrow is on time-series data (whether financial or >> > > > otherwise). >> > > > > > > It is only natural that architects will want the minimal >> impedance >> > > > > > mismatch >> > > > > > > when it comes time to grow those time series as the events >> occur >> > > > going >> > > > > > > forward.* So rather than say that I want "mutable" Arrow >> buffers, >> > > I >> > > > > > would >> > > > > > > pitch this as a call for "immutable populated areas" of Arrow >> > > buffers >> > > > > > > combined with the concept that the populated area can grow up >> to >> > > > whatever >> > > > > > > was preallocated. This will not affect anyone who has >> "memoized" a >> > > > > > > dimension and wants to continue to consider the then-current >> data >> > > as >> > > > > > > immutable... it will be immutable and will always be immutable >> > > > according >> > > > > > to >> > > > > > > that then-current dimension. >> > > > > > > >> > > > > > > Thanks in advance for considering this feedback! I absolutely >> > > > require >> > > > > > > efficient row-wise growth of an Arrow-like buffer to deal >> with time >> > > > > > series >> > > > > > > data in near real time. I believe that preallocation is (by >> far) >> > > the >> > > > > > most >> > > > > > > efficient way to accomplish this. I hope to be able to use >> Arrow! >> > > > If I >> > > > > > > cannot use Arrow than I will be using a home-grown Arrow that >> is >> > > > > > identical >> > > > > > > except for this feature, which would be very sad! Even if >> Arrow >> > > > itself >> > > > > > > could be used in this manner today, I would be hesitant to >> use it >> > > if >> > > > the >> > > > > > > use-case was not protected on a go-forward basis. >> > > > > > > >> > > > > > >> > > > > > I recommend batching your writes and using the Arrow binary >> streaming >> > > > > > protocol so you are only appending to a file rather than >> mutating >> > > > > > previously-written bytes. This use case is well defined and >> supported >> > > > > > in the software already. >> > > > > > >> > > > > > >> > > > > > >> > > > >> > > >> https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#streaming-format >> > > > > > >> > > > > > - Wes >> > > > > > >> > > > > > > Of course, I am completely open to alternative ideas and >> > > approaches! >> > > > > > > >> > > > > > > -John >> > > > > > > >> > > > > > > On Mon, May 6, 2019 at 11:39 AM Wes McKinney < >> wesmck...@gmail.com> >> > > > > > wrote: >> > > > > > > >> > > > > > > > hi John -- again, I would caution you against using Feather >> files >> > > > for >> > > > > > > > issues of longevity -- the internal memory layout of those >> files >> > > > is a >> > > > > > > > "dead man walking" so to speak. >> > > > > > > > >> > > > > > > > I would advise against forking the project, IMHO that is a >> dark >> > > > path >> > > > > > > > that leads nowhere good. We have a large community here and >> we >> > > > accept >> > > > > > > > pull requests -- I think the challenge is going to be >> defining >> > > the >> > > > use >> > > > > > > > case to suitable clarity that a general purpose solution >> can be >> > > > > > > > developed. >> > > > > > > > >> > > > > > > > - Wes >> > > > > > > > >> > > > > > > > >> > > > > > > > On Mon, May 6, 2019 at 11:16 AM John Muehlhausen < >> j...@jgm.org> >> > > > wrote: >> > > > > > > > > >> > > > > > > > > François, Wes, >> > > > > > > > > >> > > > > > > > > Thanks for the feedback. I think the most practical >> thing for >> > > > me to >> > > > > > do >> > > > > > > > is >> > > > > > > > > 1- write a Feather file that is structured to >> pre-allocate the >> > > > space >> > > > > > I >> > > > > > > > need >> > > > > > > > > (e.g. initial variable-length strings are of average size) >> > > > > > > > > 2- come up with code to monkey around with the values >> contained >> > > > in >> > > > > > the >> > > > > > > > > vectors so that before and after each manipulation the >> file is >> > > > valid >> > > > > > as I >> > > > > > > > > walk the rows ... this is a writer that uses memory >> mapping >> > > > > > > > > 3- check back in here once that works, assuming that it >> does, >> > > to >> > > > see >> > > > > > if >> > > > > > > > we >> > > > > > > > > can bless certain mutation paths >> > > > > > > > > 4- if we can't bless certain mutation paths, fork the >> project >> > > for >> > > > > > those >> > > > > > > > who >> > > > > > > > > care more about stream processing? That would not seem >> to be >> > > > ideal >> > > > > > as I >> > > > > > > > > think mutation in row-order across the data set is >> relatively >> > > low >> > > > > > impact >> > > > > > > > on >> > > > > > > > > the overall design? >> > > > > > > > > >> > > > > > > > > Thanks again for engaging the topic! >> > > > > > > > > -John >> > > > > > > > > >> > > > > > > > > On Mon, May 6, 2019 at 10:26 AM Francois Saint-Jacques < >> > > > > > > > > fsaintjacq...@gmail.com> wrote: >> > > > > > > > > >> > > > > > > > > > Hello John, >> > > > > > > > > > >> > > > > > > > > > Arrow is not yet suited for partial writes. The >> specification >> > > > only >> > > > > > > > > > talks about fully frozen/immutable objects, you're in >> > > > > > implementation >> > > > > > > > > > defined territory here. For example, the C++ library >> assumes >> > > > the >> > > > > > Array >> > > > > > > > > > object is immutable; it memoize the null count, and >> likely >> > > more >> > > > > > > > > > statistics in the future. >> > > > > > > > > > >> > > > > > > > > > If you want to use pre-allocated buffers and array, you >> can >> > > > use the >> > > > > > > > > > column validity bitmap for this purpose, e.g. set all >> null by >> > > > > > default >> > > > > > > > > > and flip once the row is written. It suffers from >> concurrency >> > > > > > issues >> > > > > > > > > > (+ invalidation issues as pointed) when dealing with >> multiple >> > > > > > columns. >> > > > > > > > > > You'll have to use a barrier of some kind, e.g. a >> per-batch >> > > > global >> > > > > > > > > > atomic (if append-only), or dedicated column(s) à-la >> MVCC. >> > > But >> > > > > > then, >> > > > > > > > > > the reader needs to be aware of this and compute a mask >> each >> > > > time >> > > > > > it >> > > > > > > > > > needs to query the partial batch. >> > > > > > > > > > >> > > > > > > > > > This is a common columnar database problem, see [1] for >> a >> > > > recent >> > > > > > paper >> > > > > > > > > > on the subject. The usual technique is to store the >> recent >> > > data >> > > > > > > > > > row-wise, and transform it in column-wise when a >> threshold is >> > > > met >> > > > > > akin >> > > > > > > > > > to a compaction phase. There was a somewhat related >> thread >> > > [2] >> > > > > > lately >> > > > > > > > > > about streaming vs batching. In the end, I think your >> > > solution >> > > > > > will be >> > > > > > > > > > very application specific. >> > > > > > > > > > >> > > > > > > > > > François >> > > > > > > > > > >> > > > > > > > > > [1] >> > > https://db.in.tum.de/downloads/publications/datablocks.pdf >> > > > > > > > > > [2] >> > > > > > > > > > >> > > > > > > > >> > > > > > >> > > > >> > > >> https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6@%3Cdev.arrow.apache.org%3E >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > On Mon, May 6, 2019 at 10:39 AM John Muehlhausen < >> > > j...@jgm.org> >> > > > > > wrote: >> > > > > > > > > > > >> > > > > > > > > > > Wes, >> > > > > > > > > > > >> > > > > > > > > > > I’m not afraid of writing my own C++ code to deal >> with all >> > > of >> > > > > > this >> > > > > > > > on the >> > > > > > > > > > > writer side. I just need a way to “append” >> (incrementally >> > > > > > populate) >> > > > > > > > e.g. >> > > > > > > > > > > feather files so that a person using e.g. pyarrow >> doesn’t >> > > > suffer >> > > > > > some >> > > > > > > > > > > catastrophic failure... and “on the side” I tell them >> which >> > > > rows >> > > > > > are >> > > > > > > > junk >> > > > > > > > > > > and deal with any concurrency issues that can’t be >> solved >> > > in >> > > > the >> > > > > > > > arena of >> > > > > > > > > > > atomicity and ordering of ops. For now I care about >> basic >> > > > types >> > > > > > but >> > > > > > > > > > > including variable-width strings. >> > > > > > > > > > > >> > > > > > > > > > > For event-processing, I think Arrow has to have the >> concept >> > > > of a >> > > > > > > > > > partially >> > > > > > > > > > > full record set. Some alternatives are: >> > > > > > > > > > > - have a batch size of one, thus littering the >> landscape >> > > with >> > > > > > > > trivially >> > > > > > > > > > > small Arrow buffers >> > > > > > > > > > > - artificially increase latency with a batch size >> larger >> > > than >> > > > > > one, >> > > > > > > > but >> > > > > > > > > > not >> > > > > > > > > > > processing any data until a batch is complete >> > > > > > > > > > > - continuously re-write the (entire!) “main” buffer as >> > > > batches of >> > > > > > > > length >> > > > > > > > > > 1 >> > > > > > > > > > > roll in >> > > > > > > > > > > - instead of one main buffer, several, and at some >> > > threshold >> > > > > > combine >> > > > > > > > the >> > > > > > > > > > > last N length-1 batches into a length N buffer ... >> still an >> > > > > > > > inefficiency >> > > > > > > > > > > >> > > > > > > > > > > Consider the case of QAbstractTableModel as the >> underlying >> > > > data >> > > > > > for a >> > > > > > > > > > table >> > > > > > > > > > > or a chart. This visualization shows all of the data >> for >> > > the >> > > > > > recent >> > > > > > > > past >> > > > > > > > > > > as well as events rolling in. If this model >> interface is >> > > > > > > > implemented as >> > > > > > > > > > a >> > > > > > > > > > > view onto “many thousands” of individual event >> buffers then >> > > > we >> > > > > > gain >> > > > > > > > > > nothing >> > > > > > > > > > > from columnar layout. (Suppose there are tons of >> columns >> > > and >> > > > > > most of >> > > > > > > > > > them >> > > > > > > > > > > are scrolled out of the view.). Likewise we cannot >> re-write >> > > > the >> > > > > > > > entire >> > > > > > > > > > > model on each event... time complexity blows up. >> What we >> > > > want >> > > > > > is to >> > > > > > > > > > have a >> > > > > > > > > > > large pre-allocated chunk and just change rowCount() >> as >> > > data >> > > > is >> > > > > > > > > > “appended.” >> > > > > > > > > > > Sure, we may run out of space and have another and >> another >> > > > > > chunk for >> > > > > > > > > > > future row ranges, but a handful of chunks chained >> together >> > > > is >> > > > > > better >> > > > > > > > > > than >> > > > > > > > > > > as many chunks as there were events! >> > > > > > > > > > > >> > > > > > > > > > > And again, having a batch size >1 and delaying the >> data >> > > > until a >> > > > > > > > batch is >> > > > > > > > > > > full is a non-starter. >> > > > > > > > > > > >> > > > > > > > > > > I am really hoping to see partially-filled buffers as >> > > > something >> > > > > > we >> > > > > > > > keep >> > > > > > > > > > our >> > > > > > > > > > > finger on moving forward! Or else, what am I missing? >> > > > > > > > > > > >> > > > > > > > > > > -John >> > > > > > > > > > > >> > > > > > > > > > > On Mon, May 6, 2019 at 8:24 AM Wes McKinney < >> > > > wesmck...@gmail.com >> > > > > > > >> > > > > > > > wrote: >> > > > > > > > > > > >> > > > > > > > > > > > hi John, >> > > > > > > > > > > > >> > > > > > > > > > > > In C++ the builder classes don't yet support >> writing into >> > > > > > > > preallocated >> > > > > > > > > > > > memory. It would be tricky for applications to >> determine >> > > a >> > > > > > priori >> > > > > > > > > > > > which segments of memory to pass to the builder. It >> seems >> > > > only >> > > > > > > > > > > > feasible for primitive / fixed-size types so my >> guess >> > > > would be >> > > > > > > > that a >> > > > > > > > > > > > separate set of interfaces would need to be >> developed for >> > > > this >> > > > > > > > task. >> > > > > > > > > > > > >> > > > > > > > > > > > - Wes >> > > > > > > > > > > > >> > > > > > > > > > > > On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau < >> > > > > > jacq...@apache.org> >> > > > > > > > > > wrote: >> > > > > > > > > > > > > >> > > > > > > > > > > > > This is more of a question of implementation >> versus >> > > > > > > > specification. An >> > > > > > > > > > > > arrow >> > > > > > > > > > > > > buffer is generally built and then sealed. In >> different >> > > > > > > > languages, >> > > > > > > > > > this >> > > > > > > > > > > > > building process works differently (a concern of >> the >> > > > language >> > > > > > > > rather >> > > > > > > > > > than >> > > > > > > > > > > > > the memory specification). We don't currently >> allow a >> > > > half >> > > > > > built >> > > > > > > > > > vector >> > > > > > > > > > > > to >> > > > > > > > > > > > > be moved to another language and then be further >> built. >> > > > So >> > > > > > the >> > > > > > > > > > question >> > > > > > > > > > > > is >> > > > > > > > > > > > > really more concrete: what language are you >> looking at >> > > > and >> > > > > > what >> > > > > > > > is >> > > > > > > > > > the >> > > > > > > > > > > > > specific pattern you're trying to undertake for >> > > building. >> > > > > > > > > > > > > >> > > > > > > > > > > > > If you're trying to go across independent >> processes >> > > > (whether >> > > > > > the >> > > > > > > > same >> > > > > > > > > > > > > process restarted or two separate processes active >> > > > > > > > simultaneously) >> > > > > > > > > > you'll >> > > > > > > > > > > > > need to build up your own data structures to help >> with >> > > > this. >> > > > > > > > > > > > > >> > > > > > > > > > > > > On Mon, May 6, 2019 at 6:28 PM John Muehlhausen < >> > > > j...@jgm.org >> > > > > > > >> > > > > > > > wrote: >> > > > > > > > > > > > > >> > > > > > > > > > > > > > Hello, >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > Glad to learn of this project— good work! >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > If I allocate a single chunk of memory and start >> > > > building >> > > > > > Arrow >> > > > > > > > > > format >> > > > > > > > > > > > > > within it, does this chunk save any state >> regarding >> > > my >> > > > > > > > progress? >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > For example, suppose I allocate a column for >> floating >> > > > point >> > > > > > > > (fixed >> > > > > > > > > > > > width) >> > > > > > > > > > > > > > and a column for string (variable width). >> Suppose I >> > > > start >> > > > > > > > > > building the >> > > > > > > > > > > > > > floating point column at offset X into my single >> > > > buffer, >> > > > > > and >> > > > > > > > the >> > > > > > > > > > string >> > > > > > > > > > > > > > “pointer” column at offset Y into the same >> single >> > > > buffer, >> > > > > > and >> > > > > > > > the >> > > > > > > > > > > > string >> > > > > > > > > > > > > > data elements at offset Z. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > I write one floating point number and one >> string, >> > > then >> > > > go >> > > > > > away. >> > > > > > > > > > When I >> > > > > > > > > > > > > > come back to this buffer to append another >> value, >> > > does >> > > > the >> > > > > > > > buffer >> > > > > > > > > > > > itself >> > > > > > > > > > > > > > know where I would begin? I.e. is there a >> > > > differentiation >> > > > > > in >> > > > > > > > the >> > > > > > > > > > > > column >> > > > > > > > > > > > > > (or blob) data itself between the available >> space and >> > > > the >> > > > > > used >> > > > > > > > > > space? >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > Suppose I write a lot of large variable width >> strings >> > > > and >> > > > > > “run >> > > > > > > > > > out” of >> > > > > > > > > > > > > > space for them before running out of space for >> > > floating >> > > > > > point >> > > > > > > > > > numbers >> > > > > > > > > > > > or >> > > > > > > > > > > > > > string pointers. (I guessed badly when doing >> the >> > > > original >> > > > > > > > > > > > allocation.). I >> > > > > > > > > > > > > > consider this to be Ok since I can always >> “copy” the >> > > > data >> > > > > > to >> > > > > > > > > > “compress >> > > > > > > > > > > > out” >> > > > > > > > > > > > > > the unused fp/pointer buckets... the choice is >> up to >> > > > me. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > The above applied to a (feather?) file is how I >> > > > anticipate >> > > > > > > > > > appending >> > > > > > > > > > > > data >> > > > > > > > > > > > > > to disk... pre-allocate a mem-mapped file and >> > > gradually >> > > > > > fill >> > > > > > > > it up. >> > > > > > > > > > > > The >> > > > > > > > > > > > > > efficiency of file utilization will depend on my >> > > > > > projections >> > > > > > > > > > regarding >> > > > > > > > > > > > > > variable-width data types, but as I said above, >> I can >> > > > > > always >> > > > > > > > > > re-write >> > > > > > > > > > > > the >> > > > > > > > > > > > > > file if/when this bothers me. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > Is this the recommended and supported approach >> for >> > > > > > incremental >> > > > > > > > > > appends? >> > > > > > > > > > > > > > I’m really hoping to use Arrow instead of >> rolling my >> > > > own, >> > > > > > but >> > > > > > > > > > > > functionality >> > > > > > > > > > > > > > like this is absolutely key! Hoping not to use >> a >> > > > side-car >> > > > > > > > file (or >> > > > > > > > > > > > memory >> > > > > > > > > > > > > > chunk) to store “append progress” information. >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > I am brand new to this project so please >> forgive me >> > > if >> > > > I >> > > > > > have >> > > > > > > > > > > > overlooked >> > > > > > > > > > > > > > something obvious. And again, looks like great >> work >> > > so >> > > > > > far! >> > > > > > > > > > > > > > >> > > > > > > > > > > > > > Thanks! >> > > > > > > > > > > > > > -John >> > > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > >> > > > > > >> > > > >> > > >> >