``perhaps the right way forward is to start by gathering a number of interested parties and start designing a proposal''
YES! How do we go about this? ``There are some early experiments to populate Arrow nodes in microbatches from Kafka'' (cf link in thread) Who did this? -John On Mon, May 13, 2019 at 9:39 AM Antoine Pitrou <anto...@python.org> wrote: > > Hi John, > > We are strongly committed to backwards compatibility in the Arrow format > specification. You should not fear any compatibility-breaking changes > in the future. People sometimes express uncertainty because we have not > reached 1.0 yet, but that's because we have not yet implemented all the > data types we want to be in that spec. > > As for the general goal of making Arrow more suitable for event > processing, perhaps the right way forward is to start by gathering a > number of interested parties and start designing a proposal (which may > or may not include spec additions). > > Regards > > Antoine. > > > Le 13/05/2019 à 15:38, John Muehlhausen a écrit : > > Micah, yes, it all works at the moment. How have we staked out that it > > will always work in the future as people continue to work on the spec? > That > > is my concern. > > > > Also, it would be extremely useful if someone opening a file had my nil > > rows hidden from them without needing to analyze the app-specific > side-car > > data. > > > > I believe that something like my solution is how everyone will do > efficient > > event processing with Arrow, so I believe it is worth a broader > discussion. > > > > On Mon, May 13, 2019 at 8:30 AM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > >> Hi John, > >> To expand on this I don't think there is anything preventing you in the > >> current spec from over provisioning the underlying buffers. So you can > >> effectively split "capacity" from "length" by subtracting the size of > the > >> buffer from the amount of space taken by the rows indicated in the > batch. > >> For variable width types you would have to reference last value in the > >> offset buffer to determine used capacity. > >> > >> When appending if you runout of memory in a particular buffer, you > don't > >> increment the count on the batch and simply append to the next one. > >> > >> This is restating parts of the thread, but I don't think the c++ code > base > >> has any facility for this directly and if you want to be parsimonious > with > >> memory you would have to rewrite batches at some point. > >> > >> Apologies if I missed something as Wes said this is a long thread. > >> > >> > >> Thanks, > >> Micah > >> > >> On Mon, May 13, 2019 at 6:07 AM Wes McKinney <wesmck...@gmail.com> > wrote: > >> > >>> hi John, > >>> > >>> Sorry, there's a number of fairly long e-mails in this thread; I'm > >>> having a hard time following all of the details. > >>> > >>> I suspect the most parsimonious thing would be to have some "sidecar" > >>> metadata that tracks the state of your writes into pre-allocated Arrow > >>> blocks so that readers know to call "Slice" on the blocks to obtain > >>> only the written-so-far portion. I'm not likely to be in favor of > >>> making changes to the binary protocol for this use case; if others > >>> have opinions I'll let them speak for themselves. > >>> > >>> - Wes > >>> > >>> On Mon, May 13, 2019 at 7:50 AM John Muehlhausen <j...@jgm.org> wrote: > >>>> > >>>> Any thoughts on a RecordBatch distinguishing size from capacity? (To > >>> borrow > >>>> std::vector terminology) > >>>> > >>>> Thanks, > >>>> John > >>>> > >>>> On Thu, May 9, 2019 at 2:46 PM John Muehlhausen <j...@jgm.org> wrote: > >>>> > >>>>> Wes et al, I think my core proposal is that Message.fbs:RecordBatch > >>> split > >>>>> the "length" parameter into "theoretical max length" and "utilized > >>> length" > >>>>> (perhaps not those exact names). > >>>>> > >>>>> "theoretical max length is the same as "length" now ... /// ...The > >>> arrays > >>>>> in the batch should all have this > >>>>> > >>>>> "utilized length" are the number of rows (starting from the first > >> one) > >>>>> that actually contain interesting data... the rest do not. > >>>>> > >>>>> The reason we can have a RecordBatch where these numbers are not the > >>> same > >>>>> is that the RecordBatch space was preallocated (for performance > >>> reasons) > >>>>> and the number of rows that actually "fit" depends on how correct the > >>>>> preallocation was. In any case, it gives the user control of this > >>>>> space/time tradeoff... wasted space in order to save time in record > >>> batch > >>>>> construction. The fact that some space will usually be wasted when > >>> there > >>>>> are variable-length columns (barring extreme luck) with this batch > >>>>> construction paradigm explains the word "theoretical" above. This > >> also > >>>>> gives us the ability to look at a partially constructed batch that is > >>> still > >>>>> being constructed, given appropriate user-supplied concurrency > >> control. > >>>>> > >>>>> I am not an expert in all of the Arrow variable-length data types, > >> but > >>> I > >>>>> think this works if they are all similar to variable-length strings > >>> where > >>>>> we advance through "blob storage" by setting the indexes into that > >>> storage > >>>>> for the current and next row in order to indicate that we have > >>>>> incrementally consumed more blob storage. (Conceptually this storage > >>> is > >>>>> "unallocated" after the pre-allocation and before rows are > >> populated.) > >>>>> > >>>>> At a high level I am seeking to shore up the format for event ingress > >>> into > >>>>> real-time analytics that have some look-back window. If I'm not > >>> mistaken I > >>>>> think this is the subject of the last multi-sentence paragraph here?: > >>>>> https://zd.net/2H0LlBY > >>>>> > >>>>> Currently we have a less-efficient paradigm where "microbatches" > >> (e.g. > >>> of > >>>>> length 1 for minimal latency) have to spin the CPU periodically in > >>> order to > >>>>> be combined into buffers where we get the columnar layout benefit. > >>> With > >>>>> pre-allocation we can deal with microbatches (a partially populated > >>> larger > >>>>> RecordBatch) and immediately have the columnar layout benefits for > >> the > >>>>> populated section with no additional computation. > >>>>> > >>>>> For example, consider an event processing system that calculates a > >>> "moving > >>>>> average" as events roll in. While this is somewhat contrived lets > >>> assume > >>>>> that the moving average window is 1000 periods and our pre-allocation > >>>>> ("theoretical max length") of RecordBatch elements is 100. The > >>> algorithm > >>>>> would be something like this, for a list of RecordBatch buffers in > >>> memory: > >>>>> > >>>>> initialization(): > >>>>> set up configuration of expected variable length storage > >>> requirements, > >>>>> e.g. the template RecordBatch mentioned below > >>>>> > >>>>> onIncomingEvent(event): > >>>>> obtain lock /// cf. swoopIn() below > >>>>> if last RecordBatch theoretical max length is not less than > >> utilized > >>>>> length or variable-length components of "event" will not fit in > >>> remaining > >>>>> blob storage: > >>>>> create a new RecordBatch pre-allocation of max utilized length > >> 100 > >>> and > >>>>> with blob preallocation that is max(expected, event .. in case the > >>> single > >>>>> event is larger than the expectation for 100 events) > >>>>> (note: in the expected case this can be very fast as it is a > >>>>> malloc() and a memcpy() from a template!) > >>>>> set current RecordBatch to this newly created one > >>>>> add event to current RecordBatch (for the non-calculated fields) > >>>>> increment utilized length of current RecordBatch > >>>>> calculate the calculated fields (in this case, moving average) by > >>>>> looking back at previous rows in this and previous RecordBatch > >> objects > >>>>> free() any RecordBatch objects that are now before the lookback > >>> window > >>>>> > >>>>> swoopIn(): /// somebody wants to chart the lookback window > >>>>> obtain lock > >>>>> visit all of the relevant data in the RecordBatches to construct > >> the > >>>>> chart /// notice that the last RecordBatch may not yet be "as full as > >>>>> possible" > >>>>> > >>>>> The above analysis (minus the free()) could apply to the IPC file > >>> format > >>>>> and the lock could be a file lock and the swoopIn() could be a > >> separate > >>>>> process. In the case of the file format, while the file is locked, a > >>> new > >>>>> RecordBatch would overwrite the previous file Footer and a new Footer > >>> would > >>>>> be written. In order to be able to delete or archive old data > >> multiple > >>>>> files could be strung together in a logical series. > >>>>> > >>>>> -John > >>>>> > >>>>> On Tue, May 7, 2019 at 2:39 PM Wes McKinney <wesmck...@gmail.com> > >>> wrote: > >>>>> > >>>>>> On Tue, May 7, 2019 at 12:26 PM John Muehlhausen <j...@jgm.org> > >> wrote: > >>>>>>> > >>>>>>> Wes, are we saying that `pa.ipc.open_file(...).read_pandas()` > >>> already > >>>>>> reads > >>>>>>> the future Feather format? If not, how will the future format > >>> differ? I > >>>>>>> will work on my access pattern with this format instead of the > >>> current > >>>>>>> feather format. Sorry I was not clear on that earlier. > >>>>>>> > >>>>>> > >>>>>> Yes, under the hood those will use the same zero-copy binary > >> protocol > >>>>>> code paths to read the file. > >>>>>> > >>>>>>> Micah, thank you! > >>>>>>> > >>>>>>> On Tue, May 7, 2019 at 11:44 AM Micah Kornfield < > >>> emkornfi...@gmail.com> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Hi John, > >>>>>>>> To give a specific pointer [1] describes how the streaming > >>> protocol is > >>>>>>>> stored to a file. > >>>>>>>> > >>>>>>>> [1] https://arrow.apache.org/docs/format/IPC.html#file-format > >>>>>>>> > >>>>>>>> On Tue, May 7, 2019 at 9:40 AM Wes McKinney < > >> wesmck...@gmail.com> > >>>>>> wrote: > >>>>>>>> > >>>>>>>>> hi John, > >>>>>>>>> > >>>>>>>>> As soon as the R folks can install the Arrow R package > >>> consistently, > >>>>>>>>> the intent is to replace the Feather internals with the plain > >>> Arrow > >>>>>>>>> IPC protocol where we have much better platform support all > >>> around. > >>>>>>>>> > >>>>>>>>> If you'd like to experiment with creating an API for > >>> pre-allocating > >>>>>>>>> fixed-size Arrow protocol blocks and then mutating the data > >> and > >>>>>>>>> metadata on disk in-place, please be our guest. We don't have > >>> the > >>>>>>>>> tools developed yet to do this for you > >>>>>>>>> > >>>>>>>>> - Wes > >>>>>>>>> > >>>>>>>>> On Tue, May 7, 2019 at 11:25 AM John Muehlhausen <j...@jgm.org > >>> > >>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> Thanks Wes: > >>>>>>>>>> > >>>>>>>>>> "the current Feather format is deprecated" ... yes, but > >> there > >>>>>> will be a > >>>>>>>>>> future file format that replaces it, correct? And my > >>> discussion > >>>>>> of > >>>>>>>>>> immutable "portions" of Arrow buffers, rather than > >>> immutability > >>>>>> of the > >>>>>>>>>> entire buffer, applies to IPC as well, right? I am only > >>>>>> championing > >>>>>>>> the > >>>>>>>>>> idea that this future file format have the convenience that > >>>>>> certain > >>>>>>>>>> preallocated rows can be ignored based on a metadata > >> setting. > >>>>>>>>>> > >>>>>>>>>> "I recommend batching your writes" ... this is extremely > >>>>>> inefficient > >>>>>>>> and > >>>>>>>>>> adds unacceptable latency, relative to the proposed > >>> solution. Do > >>>>>> you > >>>>>>>>>> disagree? Either I have a batch length of 1 to minimize > >>> latency, > >>>>>> which > >>>>>>>>>> eliminates columnar advantages on the read side, or else I > >> add > >>>>>> latency. > >>>>>>>>>> Neither works, and it seems that a viable alternative is > >>> within > >>>>>> sight? > >>>>>>>>>> > >>>>>>>>>> If you don't agree that there is a core issue and > >> opportunity > >>>>>> here, I'm > >>>>>>>>> not > >>>>>>>>>> sure how to better make my case.... > >>>>>>>>>> > >>>>>>>>>> -John > >>>>>>>>>> > >>>>>>>>>> On Tue, May 7, 2019 at 11:02 AM Wes McKinney < > >>> wesmck...@gmail.com > >>>>>>> > >>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> hi John, > >>>>>>>>>>> > >>>>>>>>>>> On Tue, May 7, 2019 at 10:53 AM John Muehlhausen < > >>> j...@jgm.org> > >>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> Wes et al, I completed a preliminary study of > >> populating a > >>>>>> Feather > >>>>>>>>> file > >>>>>>>>>>>> incrementally. Some notes and questions: > >>>>>>>>>>>> > >>>>>>>>>>>> I wrote the following dataframe to a feather file: > >>>>>>>>>>>> a b > >>>>>>>>>>>> 0 0123456789 0.0 > >>>>>>>>>>>> 1 0123456789 NaN > >>>>>>>>>>>> 2 0123456789 NaN > >>>>>>>>>>>> 3 0123456789 NaN > >>>>>>>>>>>> 4 None NaN > >>>>>>>>>>>> > >>>>>>>>>>>> In re-writing the flatbuffers metadata (flatc -p doesn't > >>>>>>>>>>>> support --gen-mutable! yuck! C++ to the rescue...), it > >>> seems > >>>>>> that > >>>>>>>>>>>> read_feather is not affected by NumRows? It seems to be > >>>>>> driven > >>>>>>>>> entirely > >>>>>>>>>>> by > >>>>>>>>>>>> the per-column Length values? > >>>>>>>>>>>> > >>>>>>>>>>>> Also, it seems as if one of the primary usages of > >>> NullCount > >>>>>> is to > >>>>>>>>>>> determine > >>>>>>>>>>>> whether or not a bitfield is present? In the > >>> initialization > >>>>>> data > >>>>>>>>> above I > >>>>>>>>>>>> was careful to have a null value in each column in order > >>> to > >>>>>>>> generate > >>>>>>>>> a > >>>>>>>>>>>> bitfield. > >>>>>>>>>>> > >>>>>>>>>>> Per my prior e-mails, the current Feather format is > >>> deprecated, > >>>>>> so > >>>>>>>> I'm > >>>>>>>>>>> only willing to engage on a discussion of the official > >> Arrow > >>>>>> binary > >>>>>>>>>>> protocol that we use for IPC (memory mapping) and RPC > >>> (Flight). > >>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> I then wiped the bitfields in the file and set all of > >> the > >>>>>> string > >>>>>>>>> indices > >>>>>>>>>>> to > >>>>>>>>>>>> one past the end of the blob buffer (all strings empty): > >>>>>>>>>>>> a b > >>>>>>>>>>>> 0 None NaN > >>>>>>>>>>>> 1 None NaN > >>>>>>>>>>>> 2 None NaN > >>>>>>>>>>>> 3 None NaN > >>>>>>>>>>>> 4 None NaN > >>>>>>>>>>>> > >>>>>>>>>>>> I then set the first record to some data by consuming > >>> some of > >>>>>> the > >>>>>>>>> string > >>>>>>>>>>>> blob and row 0 and 1 indices, also setting the double: > >>>>>>>>>>>> > >>>>>>>>>>>> a b > >>>>>>>>>>>> 0 Hello, world! 5.0 > >>>>>>>>>>>> 1 None NaN > >>>>>>>>>>>> 2 None NaN > >>>>>>>>>>>> 3 None NaN > >>>>>>>>>>>> 4 None NaN > >>>>>>>>>>>> > >>>>>>>>>>>> As mentioned above, NumRows seems to be ignored. I > >> tried > >>>>>> adjusting > >>>>>>>>> each > >>>>>>>>>>>> column Length to mask off higher rows and it worked for > >> 4 > >>>>>> (hide > >>>>>>>> last > >>>>>>>>> row) > >>>>>>>>>>>> but not for less than 4. I take this to be due to math > >>> used > >>>>>> to > >>>>>>>>> figure > >>>>>>>>>>> out > >>>>>>>>>>>> where the buffers are relative to one another since > >> there > >>> is > >>>>>> only > >>>>>>>> one > >>>>>>>>>>>> metadata offset for all of: the (optional) bitset, index > >>>>>> column and > >>>>>>>>> (if > >>>>>>>>>>>> string) blobs. > >>>>>>>>>>>> > >>>>>>>>>>>> Populating subsequent rows would proceed in a similar > >> way > >>>>>> until all > >>>>>>>>> of > >>>>>>>>>>> the > >>>>>>>>>>>> blob storage has been consumed, which may come before > >> the > >>>>>>>>> pre-allocated > >>>>>>>>>>>> rows have been consumed. > >>>>>>>>>>>> > >>>>>>>>>>>> So what does this mean for my desire to incrementally > >>> write > >>>>>> these > >>>>>>>>>>>> (potentially memory-mapped) pre-allocated files and/or > >>> Arrow > >>>>>>>> buffers > >>>>>>>>> in > >>>>>>>>>>>> memory? Some thoughts: > >>>>>>>>>>>> > >>>>>>>>>>>> - If a single value (such as NumRows) were consulted to > >>>>>> determine > >>>>>>>> the > >>>>>>>>>>> table > >>>>>>>>>>>> row dimension then updating this single value would > >> serve > >>> to > >>>>>> tell a > >>>>>>>>>>> reader > >>>>>>>>>>>> which rows are relevant. Assuming this value is updated > >>>>>> after all > >>>>>>>>> other > >>>>>>>>>>>> mutations are complete, and assuming that mutations are > >>> only > >>>>>>>> appends > >>>>>>>>>>>> (addition of rows), then concurrency control involves > >> only > >>>>>> ensuring > >>>>>>>>> that > >>>>>>>>>>>> this value is not examined while it is being written. > >>>>>>>>>>>> > >>>>>>>>>>>> - NullCount presents a concurrency problem if someone > >>> reads > >>>>>> the > >>>>>>>> file > >>>>>>>>>>> after > >>>>>>>>>>>> this field has been updated, but before NumRows has > >>> exposed > >>>>>> the new > >>>>>>>>>>> record > >>>>>>>>>>>> (or vice versa). The idea previously mentioned that > >> there > >>>>>> will > >>>>>>>>> "likely > >>>>>>>>>>>> [be] more statistics in the future" feels like it might > >> be > >>>>>> scope > >>>>>>>>> creep to > >>>>>>>>>>>> me? This is a data representation, not a calculation > >>>>>> framework? > >>>>>>>> If > >>>>>>>>>>>> NullCount had its genesis in the optional nature of the > >>>>>> bitfield, I > >>>>>>>>> would > >>>>>>>>>>>> suggest that perhaps NullCount can be dropped in favor > >> of > >>>>>> always > >>>>>>>>>>> supplying > >>>>>>>>>>>> the bitfield, which in any event is already contemplated > >>> by > >>>>>> the > >>>>>>>> spec: > >>>>>>>>>>>> "Implementations may choose to always allocate one > >> anyway > >>> as a > >>>>>>>>> matter of > >>>>>>>>>>>> convenience." If the concern is space savings, Arrow is > >>>>>> already an > >>>>>>>>>>>> extremely uncompressed format. (Compression is > >> something > >>> I > >>>>>> would > >>>>>>>>> also > >>>>>>>>>>>> consider to be scope creep as regards Feather... > >>> compressed > >>>>>>>>> filesystems > >>>>>>>>>>> can > >>>>>>>>>>>> be employed and there are other compressed dataframe > >>> formats.) > >>>>>>>>> However, > >>>>>>>>>>> if > >>>>>>>>>>>> protecting the 4 bytes required to update NowRows turns > >>> out > >>>>>> to be > >>>>>>>> no > >>>>>>>>>>> easier > >>>>>>>>>>>> than protecting all of the statistical bytes as well as > >>> part > >>>>>> of the > >>>>>>>>> same > >>>>>>>>>>>> "critical section" (locks: yuck!!) then statistics pose > >> no > >>>>>> issue. > >>>>>>>> I > >>>>>>>>>>> have a > >>>>>>>>>>>> feeling that the availability of an atomic write of 4 > >>> bytes > >>>>>> will > >>>>>>>>> depend > >>>>>>>>>>> on > >>>>>>>>>>>> the storage mechanism... memory vs memory map vs write() > >>> etc. > >>>>>>>>>>>> > >>>>>>>>>>>> - The elephant in the room appears to be the presumptive > >>>>>> binary > >>>>>>>>> yes/no on > >>>>>>>>>>>> mutability of Arrow buffers. Perhaps the thought is > >> that > >>>>>> certain > >>>>>>>>> batch > >>>>>>>>>>>> processes will be wrecked if anyone anywhere is mutating > >>>>>> buffers in > >>>>>>>>> any > >>>>>>>>>>>> way. But keep in mind I am not proposing general > >>> mutability, > >>>>>> only > >>>>>>>>>>>> appending of new data. *A huge amount of batch > >> processing > >>>>>> that > >>>>>>>> will > >>>>>>>>> take > >>>>>>>>>>>> place with Arrow is on time-series data (whether > >>> financial or > >>>>>>>>> otherwise). > >>>>>>>>>>>> It is only natural that architects will want the minimal > >>>>>> impedance > >>>>>>>>>>> mismatch > >>>>>>>>>>>> when it comes time to grow those time series as the > >> events > >>>>>> occur > >>>>>>>>> going > >>>>>>>>>>>> forward.* So rather than say that I want "mutable" > >> Arrow > >>>>>> buffers, > >>>>>>>> I > >>>>>>>>>>> would > >>>>>>>>>>>> pitch this as a call for "immutable populated areas" of > >>> Arrow > >>>>>>>> buffers > >>>>>>>>>>>> combined with the concept that the populated area can > >>> grow up > >>>>>> to > >>>>>>>>> whatever > >>>>>>>>>>>> was preallocated. This will not affect anyone who has > >>>>>> "memoized" a > >>>>>>>>>>>> dimension and wants to continue to consider the > >>> then-current > >>>>>> data > >>>>>>>> as > >>>>>>>>>>>> immutable... it will be immutable and will always be > >>> immutable > >>>>>>>>> according > >>>>>>>>>>> to > >>>>>>>>>>>> that then-current dimension. > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks in advance for considering this feedback! I > >>> absolutely > >>>>>>>>> require > >>>>>>>>>>>> efficient row-wise growth of an Arrow-like buffer to > >> deal > >>>>>> with time > >>>>>>>>>>> series > >>>>>>>>>>>> data in near real time. I believe that preallocation is > >>> (by > >>>>>> far) > >>>>>>>> the > >>>>>>>>>>> most > >>>>>>>>>>>> efficient way to accomplish this. I hope to be able to > >>> use > >>>>>> Arrow! > >>>>>>>>> If I > >>>>>>>>>>>> cannot use Arrow than I will be using a home-grown Arrow > >>> that > >>>>>> is > >>>>>>>>>>> identical > >>>>>>>>>>>> except for this feature, which would be very sad! Even > >> if > >>>>>> Arrow > >>>>>>>>> itself > >>>>>>>>>>>> could be used in this manner today, I would be hesitant > >> to > >>>>>> use it > >>>>>>>> if > >>>>>>>>> the > >>>>>>>>>>>> use-case was not protected on a go-forward basis. > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> I recommend batching your writes and using the Arrow > >> binary > >>>>>> streaming > >>>>>>>>>>> protocol so you are only appending to a file rather than > >>>>>> mutating > >>>>>>>>>>> previously-written bytes. This use case is well defined > >> and > >>>>>> supported > >>>>>>>>>>> in the software already. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>> > >>> > >> > https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#streaming-format > >>>>>>>>>>> > >>>>>>>>>>> - Wes > >>>>>>>>>>> > >>>>>>>>>>>> Of course, I am completely open to alternative ideas and > >>>>>>>> approaches! > >>>>>>>>>>>> > >>>>>>>>>>>> -John > >>>>>>>>>>>> > >>>>>>>>>>>> On Mon, May 6, 2019 at 11:39 AM Wes McKinney < > >>>>>> wesmck...@gmail.com> > >>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> hi John -- again, I would caution you against using > >>> Feather > >>>>>> files > >>>>>>>>> for > >>>>>>>>>>>>> issues of longevity -- the internal memory layout of > >>> those > >>>>>> files > >>>>>>>>> is a > >>>>>>>>>>>>> "dead man walking" so to speak. > >>>>>>>>>>>>> > >>>>>>>>>>>>> I would advise against forking the project, IMHO that > >>> is a > >>>>>> dark > >>>>>>>>> path > >>>>>>>>>>>>> that leads nowhere good. We have a large community > >> here > >>> and > >>>>>> we > >>>>>>>>> accept > >>>>>>>>>>>>> pull requests -- I think the challenge is going to be > >>>>>> defining > >>>>>>>> the > >>>>>>>>> use > >>>>>>>>>>>>> case to suitable clarity that a general purpose > >> solution > >>>>>> can be > >>>>>>>>>>>>> developed. > >>>>>>>>>>>>> > >>>>>>>>>>>>> - Wes > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Mon, May 6, 2019 at 11:16 AM John Muehlhausen < > >>>>>> j...@jgm.org> > >>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> François, Wes, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks for the feedback. I think the most practical > >>>>>> thing for > >>>>>>>>> me to > >>>>>>>>>>> do > >>>>>>>>>>>>> is > >>>>>>>>>>>>>> 1- write a Feather file that is structured to > >>>>>> pre-allocate the > >>>>>>>>> space > >>>>>>>>>>> I > >>>>>>>>>>>>> need > >>>>>>>>>>>>>> (e.g. initial variable-length strings are of average > >>> size) > >>>>>>>>>>>>>> 2- come up with code to monkey around with the > >> values > >>>>>> contained > >>>>>>>>> in > >>>>>>>>>>> the > >>>>>>>>>>>>>> vectors so that before and after each manipulation > >> the > >>>>>> file is > >>>>>>>>> valid > >>>>>>>>>>> as I > >>>>>>>>>>>>>> walk the rows ... this is a writer that uses memory > >>>>>> mapping > >>>>>>>>>>>>>> 3- check back in here once that works, assuming that > >>> it > >>>>>> does, > >>>>>>>> to > >>>>>>>>> see > >>>>>>>>>>> if > >>>>>>>>>>>>> we > >>>>>>>>>>>>>> can bless certain mutation paths > >>>>>>>>>>>>>> 4- if we can't bless certain mutation paths, fork > >> the > >>>>>> project > >>>>>>>> for > >>>>>>>>>>> those > >>>>>>>>>>>>> who > >>>>>>>>>>>>>> care more about stream processing? That would not > >>> seem > >>>>>> to be > >>>>>>>>> ideal > >>>>>>>>>>> as I > >>>>>>>>>>>>>> think mutation in row-order across the data set is > >>>>>> relatively > >>>>>>>> low > >>>>>>>>>>> impact > >>>>>>>>>>>>> on > >>>>>>>>>>>>>> the overall design? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks again for engaging the topic! > >>>>>>>>>>>>>> -John > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Mon, May 6, 2019 at 10:26 AM Francois > >>> Saint-Jacques < > >>>>>>>>>>>>>> fsaintjacq...@gmail.com> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hello John, > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> Arrow is not yet suited for partial writes. The > >>>>>> specification > >>>>>>>>> only > >>>>>>>>>>>>>>> talks about fully frozen/immutable objects, you're > >>> in > >>>>>>>>>>> implementation > >>>>>>>>>>>>>>> defined territory here. For example, the C++ > >> library > >>>>>> assumes > >>>>>>>>> the > >>>>>>>>>>> Array > >>>>>>>>>>>>>>> object is immutable; it memoize the null count, > >> and > >>>>>> likely > >>>>>>>> more > >>>>>>>>>>>>>>> statistics in the future. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> If you want to use pre-allocated buffers and > >> array, > >>> you > >>>>>> can > >>>>>>>>> use the > >>>>>>>>>>>>>>> column validity bitmap for this purpose, e.g. set > >>> all > >>>>>> null by > >>>>>>>>>>> default > >>>>>>>>>>>>>>> and flip once the row is written. It suffers from > >>>>>> concurrency > >>>>>>>>>>> issues > >>>>>>>>>>>>>>> (+ invalidation issues as pointed) when dealing > >> with > >>>>>> multiple > >>>>>>>>>>> columns. > >>>>>>>>>>>>>>> You'll have to use a barrier of some kind, e.g. a > >>>>>> per-batch > >>>>>>>>> global > >>>>>>>>>>>>>>> atomic (if append-only), or dedicated column(s) > >> à-la > >>>>>> MVCC. > >>>>>>>> But > >>>>>>>>>>> then, > >>>>>>>>>>>>>>> the reader needs to be aware of this and compute a > >>> mask > >>>>>> each > >>>>>>>>> time > >>>>>>>>>>> it > >>>>>>>>>>>>>>> needs to query the partial batch. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> This is a common columnar database problem, see > >> [1] > >>> for > >>>>>> a > >>>>>>>>> recent > >>>>>>>>>>> paper > >>>>>>>>>>>>>>> on the subject. The usual technique is to store > >> the > >>>>>> recent > >>>>>>>> data > >>>>>>>>>>>>>>> row-wise, and transform it in column-wise when a > >>>>>> threshold is > >>>>>>>>> met > >>>>>>>>>>> akin > >>>>>>>>>>>>>>> to a compaction phase. There was a somewhat > >> related > >>>>>> thread > >>>>>>>> [2] > >>>>>>>>>>> lately > >>>>>>>>>>>>>>> about streaming vs batching. In the end, I think > >>> your > >>>>>>>> solution > >>>>>>>>>>> will be > >>>>>>>>>>>>>>> very application specific. > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> François > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> [1] > >>>>>>>> https://db.in.tum.de/downloads/publications/datablocks.pdf > >>>>>>>>>>>>>>> [2] > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>> > >>> > >> > https://lists.apache.org/thread.html/27945533db782361143586fd77ca08e15e96e2f2a5250ff084b462d6@%3Cdev.arrow.apache.org%3E > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Mon, May 6, 2019 at 10:39 AM John Muehlhausen < > >>>>>>>> j...@jgm.org> > >>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Wes, > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I’m not afraid of writing my own C++ code to > >> deal > >>>>>> with all > >>>>>>>> of > >>>>>>>>>>> this > >>>>>>>>>>>>> on the > >>>>>>>>>>>>>>>> writer side. I just need a way to “append” > >>>>>> (incrementally > >>>>>>>>>>> populate) > >>>>>>>>>>>>> e.g. > >>>>>>>>>>>>>>>> feather files so that a person using e.g. > >> pyarrow > >>>>>> doesn’t > >>>>>>>>> suffer > >>>>>>>>>>> some > >>>>>>>>>>>>>>>> catastrophic failure... and “on the side” I tell > >>> them > >>>>>> which > >>>>>>>>> rows > >>>>>>>>>>> are > >>>>>>>>>>>>> junk > >>>>>>>>>>>>>>>> and deal with any concurrency issues that can’t > >> be > >>>>>> solved > >>>>>>>> in > >>>>>>>>> the > >>>>>>>>>>>>> arena of > >>>>>>>>>>>>>>>> atomicity and ordering of ops. For now I care > >>> about > >>>>>> basic > >>>>>>>>> types > >>>>>>>>>>> but > >>>>>>>>>>>>>>>> including variable-width strings. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> For event-processing, I think Arrow has to have > >>> the > >>>>>> concept > >>>>>>>>> of a > >>>>>>>>>>>>>>> partially > >>>>>>>>>>>>>>>> full record set. Some alternatives are: > >>>>>>>>>>>>>>>> - have a batch size of one, thus littering the > >>>>>> landscape > >>>>>>>> with > >>>>>>>>>>>>> trivially > >>>>>>>>>>>>>>>> small Arrow buffers > >>>>>>>>>>>>>>>> - artificially increase latency with a batch > >> size > >>>>>> larger > >>>>>>>> than > >>>>>>>>>>> one, > >>>>>>>>>>>>> but > >>>>>>>>>>>>>>> not > >>>>>>>>>>>>>>>> processing any data until a batch is complete > >>>>>>>>>>>>>>>> - continuously re-write the (entire!) “main” > >>> buffer as > >>>>>>>>> batches of > >>>>>>>>>>>>> length > >>>>>>>>>>>>>>> 1 > >>>>>>>>>>>>>>>> roll in > >>>>>>>>>>>>>>>> - instead of one main buffer, several, and at > >> some > >>>>>>>> threshold > >>>>>>>>>>> combine > >>>>>>>>>>>>> the > >>>>>>>>>>>>>>>> last N length-1 batches into a length N buffer > >> ... > >>>>>> still an > >>>>>>>>>>>>> inefficiency > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Consider the case of QAbstractTableModel as the > >>>>>> underlying > >>>>>>>>> data > >>>>>>>>>>> for a > >>>>>>>>>>>>>>> table > >>>>>>>>>>>>>>>> or a chart. This visualization shows all of the > >>> data > >>>>>> for > >>>>>>>> the > >>>>>>>>>>> recent > >>>>>>>>>>>>> past > >>>>>>>>>>>>>>>> as well as events rolling in. If this model > >>>>>> interface is > >>>>>>>>>>>>> implemented as > >>>>>>>>>>>>>>> a > >>>>>>>>>>>>>>>> view onto “many thousands” of individual event > >>>>>> buffers then > >>>>>>>>> we > >>>>>>>>>>> gain > >>>>>>>>>>>>>>> nothing > >>>>>>>>>>>>>>>> from columnar layout. (Suppose there are tons > >> of > >>>>>> columns > >>>>>>>> and > >>>>>>>>>>> most of > >>>>>>>>>>>>>>> them > >>>>>>>>>>>>>>>> are scrolled out of the view.). Likewise we > >> cannot > >>>>>> re-write > >>>>>>>>> the > >>>>>>>>>>>>> entire > >>>>>>>>>>>>>>>> model on each event... time complexity blows up. > >>>>>> What we > >>>>>>>>> want > >>>>>>>>>>> is to > >>>>>>>>>>>>>>> have a > >>>>>>>>>>>>>>>> large pre-allocated chunk and just change > >>> rowCount() > >>>>>> as > >>>>>>>> data > >>>>>>>>> is > >>>>>>>>>>>>>>> “appended.” > >>>>>>>>>>>>>>>> Sure, we may run out of space and have another > >>> and > >>>>>> another > >>>>>>>>>>> chunk for > >>>>>>>>>>>>>>>> future row ranges, but a handful of chunks > >> chained > >>>>>> together > >>>>>>>>> is > >>>>>>>>>>> better > >>>>>>>>>>>>>>> than > >>>>>>>>>>>>>>>> as many chunks as there were events! > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> And again, having a batch size >1 and delaying > >> the > >>>>>> data > >>>>>>>>> until a > >>>>>>>>>>>>> batch is > >>>>>>>>>>>>>>>> full is a non-starter. > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I am really hoping to see partially-filled > >>> buffers as > >>>>>>>>> something > >>>>>>>>>>> we > >>>>>>>>>>>>> keep > >>>>>>>>>>>>>>> our > >>>>>>>>>>>>>>>> finger on moving forward! Or else, what am I > >>> missing? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> -John > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Mon, May 6, 2019 at 8:24 AM Wes McKinney < > >>>>>>>>> wesmck...@gmail.com > >>>>>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> hi John, > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> In C++ the builder classes don't yet support > >>>>>> writing into > >>>>>>>>>>>>> preallocated > >>>>>>>>>>>>>>>>> memory. It would be tricky for applications to > >>>>>> determine > >>>>>>>> a > >>>>>>>>>>> priori > >>>>>>>>>>>>>>>>> which segments of memory to pass to the > >>> builder. It > >>>>>> seems > >>>>>>>>> only > >>>>>>>>>>>>>>>>> feasible for primitive / fixed-size types so > >> my > >>>>>> guess > >>>>>>>>> would be > >>>>>>>>>>>>> that a > >>>>>>>>>>>>>>>>> separate set of interfaces would need to be > >>>>>> developed for > >>>>>>>>> this > >>>>>>>>>>>>> task. > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> - Wes > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On Mon, May 6, 2019 at 8:18 AM Jacques Nadeau > >> < > >>>>>>>>>>> jacq...@apache.org> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> This is more of a question of implementation > >>>>>> versus > >>>>>>>>>>>>> specification. An > >>>>>>>>>>>>>>>>> arrow > >>>>>>>>>>>>>>>>>> buffer is generally built and then sealed. > >> In > >>>>>> different > >>>>>>>>>>>>> languages, > >>>>>>>>>>>>>>> this > >>>>>>>>>>>>>>>>>> building process works differently (a > >> concern > >>> of > >>>>>> the > >>>>>>>>> language > >>>>>>>>>>>>> rather > >>>>>>>>>>>>>>> than > >>>>>>>>>>>>>>>>>> the memory specification). We don't > >> currently > >>>>>> allow a > >>>>>>>>> half > >>>>>>>>>>> built > >>>>>>>>>>>>>>> vector > >>>>>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>>> be moved to another language and then be > >>> further > >>>>>> built. > >>>>>>>>> So > >>>>>>>>>>> the > >>>>>>>>>>>>>>> question > >>>>>>>>>>>>>>>>> is > >>>>>>>>>>>>>>>>>> really more concrete: what language are you > >>>>>> looking at > >>>>>>>>> and > >>>>>>>>>>> what > >>>>>>>>>>>>> is > >>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>> specific pattern you're trying to undertake > >>> for > >>>>>>>> building. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> If you're trying to go across independent > >>>>>> processes > >>>>>>>>> (whether > >>>>>>>>>>> the > >>>>>>>>>>>>> same > >>>>>>>>>>>>>>>>>> process restarted or two separate processes > >>> active > >>>>>>>>>>>>> simultaneously) > >>>>>>>>>>>>>>> you'll > >>>>>>>>>>>>>>>>>> need to build up your own data structures to > >>> help > >>>>>> with > >>>>>>>>> this. > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> On Mon, May 6, 2019 at 6:28 PM John > >>> Muehlhausen < > >>>>>>>>> j...@jgm.org > >>>>>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Hello, > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Glad to learn of this project— good work! > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> If I allocate a single chunk of memory and > >>> start > >>>>>>>>> building > >>>>>>>>>>> Arrow > >>>>>>>>>>>>>>> format > >>>>>>>>>>>>>>>>>>> within it, does this chunk save any state > >>>>>> regarding > >>>>>>>> my > >>>>>>>>>>>>> progress? > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> For example, suppose I allocate a column > >> for > >>>>>> floating > >>>>>>>>> point > >>>>>>>>>>>>> (fixed > >>>>>>>>>>>>>>>>> width) > >>>>>>>>>>>>>>>>>>> and a column for string (variable width). > >>>>>> Suppose I > >>>>>>>>> start > >>>>>>>>>>>>>>> building the > >>>>>>>>>>>>>>>>>>> floating point column at offset X into my > >>> single > >>>>>>>>> buffer, > >>>>>>>>>>> and > >>>>>>>>>>>>> the > >>>>>>>>>>>>>>> string > >>>>>>>>>>>>>>>>>>> “pointer” column at offset Y into the same > >>>>>> single > >>>>>>>>> buffer, > >>>>>>>>>>> and > >>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>> string > >>>>>>>>>>>>>>>>>>> data elements at offset Z. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> I write one floating point number and one > >>>>>> string, > >>>>>>>> then > >>>>>>>>> go > >>>>>>>>>>> away. > >>>>>>>>>>>>>>> When I > >>>>>>>>>>>>>>>>>>> come back to this buffer to append another > >>>>>> value, > >>>>>>>> does > >>>>>>>>> the > >>>>>>>>>>>>> buffer > >>>>>>>>>>>>>>>>> itself > >>>>>>>>>>>>>>>>>>> know where I would begin? I.e. is there a > >>>>>>>>> differentiation > >>>>>>>>>>> in > >>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>> column > >>>>>>>>>>>>>>>>>>> (or blob) data itself between the > >> available > >>>>>> space and > >>>>>>>>> the > >>>>>>>>>>> used > >>>>>>>>>>>>>>> space? > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Suppose I write a lot of large variable > >>> width > >>>>>> strings > >>>>>>>>> and > >>>>>>>>>>> “run > >>>>>>>>>>>>>>> out” of > >>>>>>>>>>>>>>>>>>> space for them before running out of space > >>> for > >>>>>>>> floating > >>>>>>>>>>> point > >>>>>>>>>>>>>>> numbers > >>>>>>>>>>>>>>>>> or > >>>>>>>>>>>>>>>>>>> string pointers. (I guessed badly when > >>> doing > >>>>>> the > >>>>>>>>> original > >>>>>>>>>>>>>>>>> allocation.). I > >>>>>>>>>>>>>>>>>>> consider this to be Ok since I can always > >>>>>> “copy” the > >>>>>>>>> data > >>>>>>>>>>> to > >>>>>>>>>>>>>>> “compress > >>>>>>>>>>>>>>>>> out” > >>>>>>>>>>>>>>>>>>> the unused fp/pointer buckets... the > >> choice > >>> is > >>>>>> up to > >>>>>>>>> me. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> The above applied to a (feather?) file is > >>> how I > >>>>>>>>> anticipate > >>>>>>>>>>>>>>> appending > >>>>>>>>>>>>>>>>> data > >>>>>>>>>>>>>>>>>>> to disk... pre-allocate a mem-mapped file > >>> and > >>>>>>>> gradually > >>>>>>>>>>> fill > >>>>>>>>>>>>> it up. > >>>>>>>>>>>>>>>>> The > >>>>>>>>>>>>>>>>>>> efficiency of file utilization will depend > >>> on my > >>>>>>>>>>> projections > >>>>>>>>>>>>>>> regarding > >>>>>>>>>>>>>>>>>>> variable-width data types, but as I said > >>> above, > >>>>>> I can > >>>>>>>>>>> always > >>>>>>>>>>>>>>> re-write > >>>>>>>>>>>>>>>>> the > >>>>>>>>>>>>>>>>>>> file if/when this bothers me. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Is this the recommended and supported > >>> approach > >>>>>> for > >>>>>>>>>>> incremental > >>>>>>>>>>>>>>> appends? > >>>>>>>>>>>>>>>>>>> I’m really hoping to use Arrow instead of > >>>>>> rolling my > >>>>>>>>> own, > >>>>>>>>>>> but > >>>>>>>>>>>>>>>>> functionality > >>>>>>>>>>>>>>>>>>> like this is absolutely key! Hoping not > >> to > >>> use > >>>>>> a > >>>>>>>>> side-car > >>>>>>>>>>>>> file (or > >>>>>>>>>>>>>>>>> memory > >>>>>>>>>>>>>>>>>>> chunk) to store “append progress” > >>> information. > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> I am brand new to this project so please > >>>>>> forgive me > >>>>>>>> if > >>>>>>>>> I > >>>>>>>>>>> have > >>>>>>>>>>>>>>>>> overlooked > >>>>>>>>>>>>>>>>>>> something obvious. And again, looks like > >>> great > >>>>>> work > >>>>>>>> so > >>>>>>>>>>> far! > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Thanks! > >>>>>>>>>>>>>>>>>>> -John > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>> > >>>>> > >>> > >>> > >> > > >