Hi John, To give a specific pointer [1] describes how the streaming protocol is stored to a file.
[1] https://arrow.apache.org/docs/format/IPC.html#file-format On Tue, May 7, 2019 at 9:40 AM Wes McKinney <wesmck...@gmail.com> wrote: > hi John, > > As soon as the R folks can install the Arrow R package consistently, > the intent is to replace the Feather internals with the plain Arrow > IPC protocol where we have much better platform support all around. > > If you'd like to experiment with creating an API for pre-allocating > fixed-size Arrow protocol blocks and then mutating the data and > metadata on disk in-place, please be our guest. We don't have the > tools developed yet to do this for you > > - Wes > > On Tue, May 7, 2019 at 11:25 AM John Muehlhausen <j...@jgm.org> wrote: > > > > Thanks Wes: > > > > "the current Feather format is deprecated" ... yes, but there will be a > > future file format that replaces it, correct? And my discussion of > > immutable "portions" of Arrow buffers, rather than immutability of the > > entire buffer, applies to IPC as well, right? I am only championing the > > idea that this future file format have the convenience that certain > > preallocated rows can be ignored based on a metadata setting. > > > > "I recommend batching your writes" ... this is extremely inefficient and > > adds unacceptable latency, relative to the proposed solution. Do you > > disagree? Either I have a batch length of 1 to minimize latency, which > > eliminates columnar advantages on the read side, or else I add latency. > > Neither works, and it seems that a viable alternative is within sight? > > > > If you don't agree that there is a core issue and opportunity here, I'm > not > > sure how to better make my case.... > > > > -John > > > > On Tue, May 7, 2019 at 11:02 AM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > hi John, > > > > > > On Tue, May 7, 2019 at 10:53 AM John Muehlhausen <j...@jgm.org> wrote: > > > > > > > > Wes et al, I completed a preliminary study of populating a Feather > file > > > > incrementally. Some notes and questions: > > > > > > > > I wrote the following dataframe to a feather file: > > > > a b > > > > 0 0123456789 0.0 > > > > 1 0123456789 NaN > > > > 2 0123456789 NaN > > > > 3 0123456789 NaN > > > > 4 None NaN > > > > > > > > In re-writing the flatbuffers metadata (flatc -p doesn't > > > > support --gen-mutable! yuck! C++ to the rescue...), it seems that > > > > read_feather is not affected by NumRows? It seems to be driven > entirely > > > by > > > > the per-column Length values? > > > > > > > > Also, it seems as if one of the primary usages of NullCount is to > > > determine > > > > whether or not a bitfield is present? In the initialization data > above I > > > > was careful to have a null value in each column in order to generate > a > > > > bitfield. > > > > > > Per my prior e-mails, the current Feather format is deprecated, so I'm > > > only willing to engage on a discussion of the official Arrow binary > > > protocol that we use for IPC (memory mapping) and RPC (Flight). > > > > > > > > > > > I then wiped the bitfields in the file and set all of the string > indices > > > to > > > > one past the end of the blob buffer (all strings empty): > > > > a b > > > > 0 None NaN > > > > 1 None NaN > > > > 2 None NaN > > > > 3 None NaN > > > > 4 None NaN > > > > > > > > I then set the first record to some data by consuming some of the > string > > > > blob and row 0 and 1 indices, also setting the double: > > > > > > > > a b > > > > 0 Hello, world! 5.0 > > > > 1 None NaN > > > > 2 None NaN > > > > 3 None NaN > > > > 4 None NaN > > > > > > > > As mentioned above, NumRows seems to be ignored. I tried adjusting > each > > > > column Length to mask off higher rows and it worked for 4 (hide last > row) > > > > but not for less than 4. I take this to be due to math used to > figure > > > out > > > > where the buffers are relative to one another since there is only one > > > > metadata offset for all of: the (optional) bitset, index column and > (if > > > > string) blobs. > > > > > > > > Populating subsequent rows would proceed in a similar way until all > of > > > the > > > > blob storage has been consumed, which may come before the > pre-allocated > > > > rows have been consumed. > > > > > > > > So what does this mean for my desire to incrementally write these > > > > (potentially memory-mapped) pre-allocated files and/or Arrow buffers > in > > > > memory? Some thoughts: > > > > > > > > - If a single value (such as NumRows) were consulted to determine the > > > table > > > > row dimension then updating this single value would serve to tell a > > > reader > > > > which rows are relevant. Assuming this value is updated after all > other > > > > mutations are complete, and assuming that mutations are only appends > > > > (addition of rows), then concurrency control involves only ensuring > that > > > > this value is not examined while it is being written. > > > > > > > > - NullCount presents a concurrency problem if someone reads the file > > > after > > > > this field has been updated, but before NumRows has exposed the new > > > record > > > > (or vice versa). The idea previously mentioned that there will > "likely > > > > [be] more statistics in the future" feels like it might be scope > creep to > > > > me? This is a data representation, not a calculation framework? If > > > > NullCount had its genesis in the optional nature of the bitfield, I > would > > > > suggest that perhaps NullCount can be dropped in favor of always > > > supplying > > > > the bitfield, which in any event is already contemplated by the spec: > > > > "Implementations may choose to always allocate one anyway as a > matter of > > > > convenience." If the concern is space savings, Arrow is already an > > > > extremely uncompressed format. (Compression is something I would > also > > > > consider to be scope creep as regards Feather... compressed > filesystems > > > can > > > > be employed and there are other compressed dataframe formats.) > However, > > > if > > > > protecting the 4 bytes required to update NowRows turns out to be no > > > easier > > > > than protecting all of the statistical bytes as well as part of the > same > > > > "critical section" (locks: yuck!!) then statistics pose no issue. I > > > have a > > > > feeling that the availability of an atomic write of 4 bytes will > depend > > > on > > > > the storage mechanism... memory vs memory map vs write() etc. > > > > > > > > - The elephant in the room appears to be the presumptive binary > yes/no on > > > > mutability of Arrow buffers. Perhaps the thought is that certain > batch > > > > processes will be wrecked if anyone anywhere is mutating buffers in > any > > > > way. But keep in mind I am not proposing general mutability, only > > > > appending of new data. *A huge amount of batch processing that will > take > > > > place with Arrow is on time-series data (whether financial or > otherwise). > > > > It is only natural that architects will want the minimal impedance > > > mismatch > > > > when it comes time to grow those time series as the events occur > going > > > > forward.* So rather than say that I want "mutable" Arrow buffers, I > > > would > > > > pitch this as a call for "immutable populated areas" of Arrow buffers > > > > combined with the concept that the populated area can grow up to > whatever > > > > was preallocated. This will not affect anyone who has "memoized" a > > > > dimension and wants to continue to consider the then-current data as > > > > immutable... it will be immutable and will always be immutable > according > > > to > > > > that then-current dimension. > > > > > > > > Thanks in advance for considering this feedback! I absolutely > require > > > > efficient row-wise growth of an Arrow-like buffer to deal with time > > > series > > > > data in near real time. I believe that preallocation is (by far) the > > > most > > > > efficient way to accomplish this. I hope to be able to use Arrow! > If I > > > > cannot use Arrow than I will be using a home-grown Arrow that is > > > identical > > > > except for this feature, which would be very sad! Even if Arrow > itself > > > > could be used in this manner today, I would be hesitant to use it if > the > > > > use-case was not protected on a go-forward basis. > > > > > > > > > > I recommend batching your writes and using the Arrow binary streaming > > > protocol so you are only appending to a file rather than mutating > > > previously-written bytes. This use case is well defined and supported > > > in the software already. > > > > > > > > > > https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#streaming-format > > > > > > - Wes > > > > > > > Of course, I am completely open to alternative ideas and approaches! > > > > > > > > -John > > > > > > > > On Mon, May 6, 2019 at 11:39 AM Wes McKinney <wesmck...@gmail.com> > > > wrote: > > > > > > > > > hi John -- again, I would caution you against using Feather files > for > > > > > issues of longevity -- the internal memory layout of those files > is a > > > > > "dead man walking" so to speak. > > > > > > > > > > I would advise against forking the project, IMHO that is a dark > path > > > > > that leads nowhere good. We have a large community here and we > accept > > > > > pull requests -- I think the challenge is going to be defining the > use > > > > > case to suitable clarity that a general purpose solution can be > > > > > developed. > > > > > > > > > > - Wes > > > > > > > > > > > > > > > On Mon, May 6, 2019 at 11:16 AM John Muehlhausen <j...@jgm.org> > wrote: > > > > > > > > > > > > François, Wes, > > > > > > > > > > > > Thanks for the feedback. I think the most practical thing for > me to > > > do > > > > > is > > > > > > 1- write a Feather file that is structured to pre-allocate the > space > > > I > > > > > need > > > > > > (e.g. initial variable-length strings are of average size) > > > > > > 2- come up with code to monkey around with the values contained > in > > > the > > > > > > vectors so that before and after each manipulation the file is > valid > > > as I > > > > > > walk the rows ... this is a writer that uses memory mapping > > > > > > 3- check back in here once that works, assuming that it does, to > see > > > if > > > > > we > > > > > > can bless certain mutation paths > > > > > > 4- if we can't bless certain mutation paths, fork the project for > > > those > > > > > who > > > > > > care more about stream processing? That would not seem to be > ideal > > > as I > > > > > > think mutation in row-order across the data set is relatively low > > > impact > > > > > on > > > > > > the overall design? > > > > > > > > > > > > Thanks again for engaging the topic! > > > > > > -John > > > > > > > > > > > > On Mon, May 6, 2019 at 10:26 AM Francois Saint-Jacques < > > > > > > fsaintjacq...@gmail.com> wrote: > > > > > > > > > > > > > Hello John, > > > > > > > > > > > > > > Arrow is not yet suited for partial writes. The specification > only > > > > > > > talks about fully frozen/immutable objects, you're in > > > implementation > > > > > > > defined territory here. For example, the C++ library assumes > the > > > Array > > > > > > > object is immutable; it memoize the null count, and likely more > > > > > > > statistics in the future. > > > > > > > > > > > > > > If you want to use pre-allocated buffers and array, you can > use the > > > > > > > column validity bitmap for this purpose, e.g. set all null by > > > default > > > > > > > and flip once the row is written. It suffers from concurrency > > > issues > > > > > > > (+ invalidation issues as pointed) when dealing with multiple > > > columns. > > > > > > > You'll have to use a barrier of some kind, e.g. a per-batch > global > > > > > > > atomic (if append-only), or dedicated column(s) à-la MVCC. But > > > then, > > > > > > > the reader needs to be aware of this and compute a mask each > time > > > it > > > > > > > needs to query the partial batch. > > > > > > > > > > > > > > This is a common columnar database problem, see [1] for a > recent > > > paper > > > > > > > on the subject. The usual technique is to store the recent data > > > > > > > row-wise, and transform it in column-wise when a threshold is > met > > > akin > > > > > > > to a compaction phase. There was a somewhat related thread [2] > > > lately > > > > > > > about streaming vs batching. In the end, I think your solution > > > will be > > > > > > > very application specific. > > > > > > > > > > > > > > François > > > > > > > > > > > > > > [1] https://db.in.tum.de/downloads/publications/datablocks.pdf > > > > > > > [2] > > > > > > > > > > > > > > > > https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6@%3Cdev.arrow.apache.org%3E > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, May 6, 2019 at 10:39 AM John Muehlhausen <j...@jgm.org> > > > wrote: > > > > > > > > > > > > > > > > Wes, > > > > > > > > > > > > > > > > I’m not afraid of writing my own C++ code to deal with all of > > > this > > > > > on the > > > > > > > > writer side. I just need a way to “append” (incrementally > > > populate) > > > > > e.g. > > > > > > > > feather files so that a person using e.g. pyarrow doesn’t > suffer > > > some > > > > > > > > catastrophic failure... and “on the side” I tell them which > rows > > > are > > > > > junk > > > > > > > > and deal with any concurrency issues that can’t be solved in > the > > > > > arena of > > > > > > > > atomicity and ordering of ops. For now I care about basic > types > > > but > > > > > > > > including variable-width strings. > > > > > > > > > > > > > > > > For event-processing, I think Arrow has to have the concept > of a > > > > > > > partially > > > > > > > > full record set. Some alternatives are: > > > > > > > > - have a batch size of one, thus littering the landscape with > > > > > trivially > > > > > > > > small Arrow buffers > > > > > > > > - artificially increase latency with a batch size larger than > > > one, > > > > > but > > > > > > > not > > > > > > > > processing any data until a batch is complete > > > > > > > > - continuously re-write the (entire!) “main” buffer as > batches of > > > > > length > > > > > > > 1 > > > > > > > > roll in > > > > > > > > - instead of one main buffer, several, and at some threshold > > > combine > > > > > the > > > > > > > > last N length-1 batches into a length N buffer ... still an > > > > > inefficiency > > > > > > > > > > > > > > > > Consider the case of QAbstractTableModel as the underlying > data > > > for a > > > > > > > table > > > > > > > > or a chart. This visualization shows all of the data for the > > > recent > > > > > past > > > > > > > > as well as events rolling in. If this model interface is > > > > > implemented as > > > > > > > a > > > > > > > > view onto “many thousands” of individual event buffers then > we > > > gain > > > > > > > nothing > > > > > > > > from columnar layout. (Suppose there are tons of columns and > > > most of > > > > > > > them > > > > > > > > are scrolled out of the view.). Likewise we cannot re-write > the > > > > > entire > > > > > > > > model on each event... time complexity blows up. What we > want > > > is to > > > > > > > have a > > > > > > > > large pre-allocated chunk and just change rowCount() as data > is > > > > > > > “appended.” > > > > > > > > Sure, we may run out of space and have another and another > > > chunk for > > > > > > > > future row ranges, but a handful of chunks chained together > is > > > better > > > > > > > than > > > > > > > > as many chunks as there were events! > > > > > > > > > > > > > > > > And again, having a batch size >1 and delaying the data > until a > > > > > batch is > > > > > > > > full is a non-starter. > > > > > > > > > > > > > > > > I am really hoping to see partially-filled buffers as > something > > > we > > > > > keep > > > > > > > our > > > > > > > > finger on moving forward! Or else, what am I missing? > > > > > > > > > > > > > > > > -John > > > > > > > > > > > > > > > > On Mon, May 6, 2019 at 8:24 AM Wes McKinney < > wesmck...@gmail.com > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > hi John, > > > > > > > > > > > > > > > > > > In C++ the builder classes don't yet support writing into > > > > > preallocated > > > > > > > > > memory. It would be tricky for applications to determine a > > > priori > > > > > > > > > which segments of memory to pass to the builder. It seems > only > > > > > > > > > feasible for primitive / fixed-size types so my guess > would be > > > > > that a > > > > > > > > > separate set of interfaces would need to be developed for > this > > > > > task. > > > > > > > > > > > > > > > > > > - Wes > > > > > > > > > > > > > > > > > > On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau < > > > jacq...@apache.org> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > This is more of a question of implementation versus > > > > > specification. An > > > > > > > > > arrow > > > > > > > > > > buffer is generally built and then sealed. In different > > > > > languages, > > > > > > > this > > > > > > > > > > building process works differently (a concern of the > language > > > > > rather > > > > > > > than > > > > > > > > > > the memory specification). We don't currently allow a > half > > > built > > > > > > > vector > > > > > > > > > to > > > > > > > > > > be moved to another language and then be further built. > So > > > the > > > > > > > question > > > > > > > > > is > > > > > > > > > > really more concrete: what language are you looking at > and > > > what > > > > > is > > > > > > > the > > > > > > > > > > specific pattern you're trying to undertake for building. > > > > > > > > > > > > > > > > > > > > If you're trying to go across independent processes > (whether > > > the > > > > > same > > > > > > > > > > process restarted or two separate processes active > > > > > simultaneously) > > > > > > > you'll > > > > > > > > > > need to build up your own data structures to help with > this. > > > > > > > > > > > > > > > > > > > > On Mon, May 6, 2019 at 6:28 PM John Muehlhausen < > j...@jgm.org > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > > > > > Glad to learn of this project— good work! > > > > > > > > > > > > > > > > > > > > > > If I allocate a single chunk of memory and start > building > > > Arrow > > > > > > > format > > > > > > > > > > > within it, does this chunk save any state regarding my > > > > > progress? > > > > > > > > > > > > > > > > > > > > > > For example, suppose I allocate a column for floating > point > > > > > (fixed > > > > > > > > > width) > > > > > > > > > > > and a column for string (variable width). Suppose I > start > > > > > > > building the > > > > > > > > > > > floating point column at offset X into my single > buffer, > > > and > > > > > the > > > > > > > string > > > > > > > > > > > “pointer” column at offset Y into the same single > buffer, > > > and > > > > > the > > > > > > > > > string > > > > > > > > > > > data elements at offset Z. > > > > > > > > > > > > > > > > > > > > > > I write one floating point number and one string, then > go > > > away. > > > > > > > When I > > > > > > > > > > > come back to this buffer to append another value, does > the > > > > > buffer > > > > > > > > > itself > > > > > > > > > > > know where I would begin? I.e. is there a > differentiation > > > in > > > > > the > > > > > > > > > column > > > > > > > > > > > (or blob) data itself between the available space and > the > > > used > > > > > > > space? > > > > > > > > > > > > > > > > > > > > > > Suppose I write a lot of large variable width strings > and > > > “run > > > > > > > out” of > > > > > > > > > > > space for them before running out of space for floating > > > point > > > > > > > numbers > > > > > > > > > or > > > > > > > > > > > string pointers. (I guessed badly when doing the > original > > > > > > > > > allocation.). I > > > > > > > > > > > consider this to be Ok since I can always “copy” the > data > > > to > > > > > > > “compress > > > > > > > > > out” > > > > > > > > > > > the unused fp/pointer buckets... the choice is up to > me. > > > > > > > > > > > > > > > > > > > > > > The above applied to a (feather?) file is how I > anticipate > > > > > > > appending > > > > > > > > > data > > > > > > > > > > > to disk... pre-allocate a mem-mapped file and gradually > > > fill > > > > > it up. > > > > > > > > > The > > > > > > > > > > > efficiency of file utilization will depend on my > > > projections > > > > > > > regarding > > > > > > > > > > > variable-width data types, but as I said above, I can > > > always > > > > > > > re-write > > > > > > > > > the > > > > > > > > > > > file if/when this bothers me. > > > > > > > > > > > > > > > > > > > > > > Is this the recommended and supported approach for > > > incremental > > > > > > > appends? > > > > > > > > > > > I’m really hoping to use Arrow instead of rolling my > own, > > > but > > > > > > > > > functionality > > > > > > > > > > > like this is absolutely key! Hoping not to use a > side-car > > > > > file (or > > > > > > > > > memory > > > > > > > > > > > chunk) to store “append progress” information. > > > > > > > > > > > > > > > > > > > > > > I am brand new to this project so please forgive me if > I > > > have > > > > > > > > > overlooked > > > > > > > > > > > something obvious. And again, looks like great work so > > > far! > > > > > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > -John > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >