François, Wes, Thanks for the feedback. I think the most practical thing for me to do is 1- write a Feather file that is structured to pre-allocate the space I need (e.g. initial variable-length strings are of average size) 2- come up with code to monkey around with the values contained in the vectors so that before and after each manipulation the file is valid as I walk the rows ... this is a writer that uses memory mapping 3- check back in here once that works, assuming that it does, to see if we can bless certain mutation paths 4- if we can't bless certain mutation paths, fork the project for those who care more about stream processing? That would not seem to be ideal as I think mutation in row-order across the data set is relatively low impact on the overall design?
Thanks again for engaging the topic! -John On Mon, May 6, 2019 at 10:26 AM Francois Saint-Jacques < fsaintjacq...@gmail.com> wrote: > Hello John, > > Arrow is not yet suited for partial writes. The specification only > talks about fully frozen/immutable objects, you're in implementation > defined territory here. For example, the C++ library assumes the Array > object is immutable; it memoize the null count, and likely more > statistics in the future. > > If you want to use pre-allocated buffers and array, you can use the > column validity bitmap for this purpose, e.g. set all null by default > and flip once the row is written. It suffers from concurrency issues > (+ invalidation issues as pointed) when dealing with multiple columns. > You'll have to use a barrier of some kind, e.g. a per-batch global > atomic (if append-only), or dedicated column(s) à-la MVCC. But then, > the reader needs to be aware of this and compute a mask each time it > needs to query the partial batch. > > This is a common columnar database problem, see [1] for a recent paper > on the subject. The usual technique is to store the recent data > row-wise, and transform it in column-wise when a threshold is met akin > to a compaction phase. There was a somewhat related thread [2] lately > about streaming vs batching. In the end, I think your solution will be > very application specific. > > François > > [1] https://db.in.tum.de/downloads/publications/datablocks.pdf > [2] > https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6@%3Cdev.arrow.apache.org%3E > > > > > > > > On Mon, May 6, 2019 at 10:39 AM John Muehlhausen <j...@jgm.org> wrote: > > > > Wes, > > > > I’m not afraid of writing my own C++ code to deal with all of this on the > > writer side. I just need a way to “append” (incrementally populate) e.g. > > feather files so that a person using e.g. pyarrow doesn’t suffer some > > catastrophic failure... and “on the side” I tell them which rows are junk > > and deal with any concurrency issues that can’t be solved in the arena of > > atomicity and ordering of ops. For now I care about basic types but > > including variable-width strings. > > > > For event-processing, I think Arrow has to have the concept of a > partially > > full record set. Some alternatives are: > > - have a batch size of one, thus littering the landscape with trivially > > small Arrow buffers > > - artificially increase latency with a batch size larger than one, but > not > > processing any data until a batch is complete > > - continuously re-write the (entire!) “main” buffer as batches of length > 1 > > roll in > > - instead of one main buffer, several, and at some threshold combine the > > last N length-1 batches into a length N buffer ... still an inefficiency > > > > Consider the case of QAbstractTableModel as the underlying data for a > table > > or a chart. This visualization shows all of the data for the recent past > > as well as events rolling in. If this model interface is implemented as > a > > view onto “many thousands” of individual event buffers then we gain > nothing > > from columnar layout. (Suppose there are tons of columns and most of > them > > are scrolled out of the view.). Likewise we cannot re-write the entire > > model on each event... time complexity blows up. What we want is to > have a > > large pre-allocated chunk and just change rowCount() as data is > “appended.” > > Sure, we may run out of space and have another and another chunk for > > future row ranges, but a handful of chunks chained together is better > than > > as many chunks as there were events! > > > > And again, having a batch size >1 and delaying the data until a batch is > > full is a non-starter. > > > > I am really hoping to see partially-filled buffers as something we keep > our > > finger on moving forward! Or else, what am I missing? > > > > -John > > > > On Mon, May 6, 2019 at 8:24 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > hi John, > > > > > > In C++ the builder classes don't yet support writing into preallocated > > > memory. It would be tricky for applications to determine a priori > > > which segments of memory to pass to the builder. It seems only > > > feasible for primitive / fixed-size types so my guess would be that a > > > separate set of interfaces would need to be developed for this task. > > > > > > - Wes > > > > > > On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau <jacq...@apache.org> > wrote: > > > > > > > > This is more of a question of implementation versus specification. An > > > arrow > > > > buffer is generally built and then sealed. In different languages, > this > > > > building process works differently (a concern of the language rather > than > > > > the memory specification). We don't currently allow a half built > vector > > > to > > > > be moved to another language and then be further built. So the > question > > > is > > > > really more concrete: what language are you looking at and what is > the > > > > specific pattern you're trying to undertake for building. > > > > > > > > If you're trying to go across independent processes (whether the same > > > > process restarted or two separate processes active simultaneously) > you'll > > > > need to build up your own data structures to help with this. > > > > > > > > On Mon, May 6, 2019 at 6:28 PM John Muehlhausen <j...@jgm.org> wrote: > > > > > > > > > Hello, > > > > > > > > > > Glad to learn of this project— good work! > > > > > > > > > > If I allocate a single chunk of memory and start building Arrow > format > > > > > within it, does this chunk save any state regarding my progress? > > > > > > > > > > For example, suppose I allocate a column for floating point (fixed > > > width) > > > > > and a column for string (variable width). Suppose I start > building the > > > > > floating point column at offset X into my single buffer, and the > string > > > > > “pointer” column at offset Y into the same single buffer, and the > > > string > > > > > data elements at offset Z. > > > > > > > > > > I write one floating point number and one string, then go away. > When I > > > > > come back to this buffer to append another value, does the buffer > > > itself > > > > > know where I would begin? I.e. is there a differentiation in the > > > column > > > > > (or blob) data itself between the available space and the used > space? > > > > > > > > > > Suppose I write a lot of large variable width strings and “run > out” of > > > > > space for them before running out of space for floating point > numbers > > > or > > > > > string pointers. (I guessed badly when doing the original > > > allocation.). I > > > > > consider this to be Ok since I can always “copy” the data to > “compress > > > out” > > > > > the unused fp/pointer buckets... the choice is up to me. > > > > > > > > > > The above applied to a (feather?) file is how I anticipate > appending > > > data > > > > > to disk... pre-allocate a mem-mapped file and gradually fill it up. > > > The > > > > > efficiency of file utilization will depend on my projections > regarding > > > > > variable-width data types, but as I said above, I can always > re-write > > > the > > > > > file if/when this bothers me. > > > > > > > > > > Is this the recommended and supported approach for incremental > appends? > > > > > I’m really hoping to use Arrow instead of rolling my own, but > > > functionality > > > > > like this is absolutely key! Hoping not to use a side-car file (or > > > memory > > > > > chunk) to store “append progress” information. > > > > > > > > > > I am brand new to this project so please forgive me if I have > > > overlooked > > > > > something obvious. And again, looks like great work so far! > > > > > > > > > > Thanks! > > > > > -John > > > > > > > > >