In my mind there were two/three formats * 2 related: IPC stream/file: native storage, everything memory mappable, slight overhead from having to read in meta data due to chunking. * feather: like IPC, but with possible compression/codecs, so non-memory mappable (at least no practical use), to process this data without high memory usage we stream over the data.
Maybe I've made too strong assumptions, but to me it always seemed like an arrow stream/file is basically a 'memory dump' of the arrays that are used to compute on. Not sure if this input is useful, but this was my mental model. I still think it's useful to have the distinction, since I always assumed in Vaex I can mmap a IPC file/stream and don't need to worry about pyarrow/arrow allocation a lot of memory. On Tue, Oct 25, 2022 at 12:23 PM Nic <thisis...@gmail.com> wrote: > I'm in favour of moving things for the R package as well and would prefer > to do it all at once and make some noise about it, so there isn't lingering > out-of-date documentation or general ambiguity to cause confusion for > users. > > > I'm not familiar with R but read_arrow_file() may be better > > if users use it without any prefix. If users use the > > function is used with Apache Arrow related prefix such as > > arrow::read_ipc_file(), I think that read_ipc_file() isn't > > strange. > > There'll likely be a mix of both, depending on whether users load the > package in a script or just want to call the function directly without > importing the whole namespace. I think there's more thought needed around > the exact function names for the R package (there are other considerations, > such as consistency with our other file reading function names - none of > those end in "file" and maybe that matters but maybe it doesn't). > > Perhaps, if we agree at a high level how we should be referring to these > files in our documentation (i.e. actual names that we use when writing full > sentences), the individual function names can fall out of that in later > discussions? > > On Wed, 19 Oct 2022 at 05:01, Sutou Kouhei <k...@clear-code.com> wrote: > > > Hi, > > > > > However, I think we need to be very careful in how we brand the > > > alternative, and think proactively about what terminology we want to be > > > used (and which terms to use in APIs, ..). Because I think the "IPC" > > aspect > > > of the naming can also become confusing (IPC is a generic term, does > not > > > clearly indicate it is a *file* format, and also not that it is related > > to > > > *arrow*). > > > > I like "Apache Arrow File" and "Apache Arrow Stream" (no > > IPC) for format names because we use vnd.apache.arrow.file > > and vnd.apache.arrow.stream for IANA: > > > > * > > > https://www.iana.org/assignments/media-types/application/vnd.apache.arrow.file > > * > > > https://www.iana.org/assignments/media-types/application/vnd.apache.arrow.stream > > > > > - In pyarrow, we have a `feather` submodule with read/write_feather > > > functions. How do we want to replace this? The current alternative is > the > > > pyarrow.ipc submodule (which has functionality to open files), but so > > this > > > is using the "IPC" terminology. Are we OK with making this the > > alternative, > > > or do we want to add new APIs? > > > > How about pyarrow.ipc.{read,write}_{file,stream}()? > > I think that .ipc isn't strange under Apache Arrow namespace > > (pyarrow.) because it means Apache Arrow's IPC. > > > > > - In pyarrow.dataset, we also use IpcFileFormat for Arrow files. Should > > we > > > rename this to `ArrowFileFormat`? (and keep IpcFileFormat as alias) > > > > I like this idea. > > > > > - In the R arrow package, the non-feather alternative for > `read_feather` > > > currently is `read_ipc_file` > > > > I'm not familiar with R but read_arrow_file() may be better > > if users use it without any prefix. If users use the > > function is used with Apache Arrow related prefix such as > > arrow::read_ipc_file(), I think that read_ipc_file() isn't > > strange. > > > > > - In pandas, there is read_feather/to_feather. What do we think that > > pandas > > > should use instead? > > > > read_arrow_file/to_arrow_file? > > > > > If we want to move the (mostly Python and R) ecosystem away from > > "Feather", > > > I think we should have a clear recommendation of what to use instead. > > > > +1 > > > > > > Thanks, > > -- > > kou > > > > In <calqtmbbmaepukytnin5n4-jpahtwbacmeb0bsxjz8dmkkmo...@mail.gmail.com> > > "Re: Usage of the name Feather?" on Tue, 6 Sep 2022 15:46:39 +0200, > > Joris Van den Bossche <jorisvandenboss...@gmail.com> wrote: > > > > > Personally, I like the "Feather" name (and actually think it could help > > > disambiguate the file vs in-memory distinction), but I understand that > we > > > have chosen a certain path (eg ".arrow" is the official registered > > > extension), and have to move on. > > > > > > However, I think we need to be very careful in how we brand the > > > alternative, and think proactively about what terminology we want to be > > > used (and which terms to use in APIs, ..). Because I think the "IPC" > > aspect > > > of the naming can also become confusing (IPC is a generic term, does > not > > > clearly indicate it is a *file* format, and also not that it is related > > to > > > *arrow*). > > > > > > As an example, I just noticed a twitter thread ( > > > https://twitter.com/braaannigan/status/1566715704937676800) that is > > > promoting the "IPC format". The specific library used here (polars) > also > > > exposes this as a "read_ipc" function. > > > Other examples: > > > > > > - In pyarrow, we have a `feather` submodule with read/write_feather > > > functions. How do we want to replace this? The current alternative is > the > > > pyarrow.ipc submodule (which has functionality to open files), but so > > this > > > is using the "IPC" terminology. Are we OK with making this the > > alternative, > > > or do we want to add new APIs? > > > - In pyarrow.dataset, we also use IpcFileFormat for Arrow files. Should > > we > > > rename this to `ArrowFileFormat`? (and keep IpcFileFormat as alias) > > > - In the R arrow package, the non-feather alternative for > `read_feather` > > > currently is `read_ipc_file` > > > - In pandas, there is read_feather/to_feather. What do we think that > > pandas > > > should use instead? > > > - ... > > > > > > Personally, I think we should certainly avoid names that just use IPC > > (like > > > `read_ipc`). An alternative could be `read_arrow_ipc`, but if want to > > drop > > > the IPC part (as proposed earlier in this thread, although not yet > agreed > > > on), that would become `read_arrow`/`to_arrow`. That might then be > > confused > > > with reading from / converting to in-memory arrow data or stream? > > > If we want to recommend using "Arrow file" terminology, so then APIs > like > > > `read_arrow_file` could be used? > > > > > > If we want to move the (mostly Python and R) ecosystem away from > > "Feather", > > > I think we should have a clear recommendation of what to use instead. > > > > > > On Wed, 31 Aug 2022 at 20:33, Aldrin <akmon...@ucsc.edu.invalid> > wrote: > > > > > >> similarly to Micah, I mentally think of "Arrow IPC" a format that is > > >> optimized for "IPC". > > >> Which I have assumed meant it minimizes CPU overhead when using data > > read > > >> from > > >> storage because it's already in a memory friendly format (e.g. minimal > > >> deserialization). > > >> > > >> Not sure the "IPC" is necessary, but it does push the intent into the > > name > > >> (unless it's > > >> actually a misnomer). > > >> > > >> > > >> Aldrin Montana > > >> Computer Science PhD Student > > >> UC Santa Cruz > > >> > > >> > > >> On Tue, Aug 30, 2022 at 8:29 PM Micah Kornfield < > emkornfi...@gmail.com> > > >> wrote: > > >> > > >> > I think one source of ambiguity for Arrow files, at least for me, is > > >> > whether they are just a string of messages concatenated or they are > > the > > >> > files that contain the metadata footer. > > >> > > > >> > On Tue, Aug 30, 2022 at 5:11 AM Dewey Dunnington > > >> > <de...@voltrondata.com.invalid> wrote: > > >> > > > >> > > Ian has a very good point...I would be in favour of calling them > > "Arrow > > >> > > files" wherever possible since there's no need to know or care > what > > >> > > interprocess communication is to use them! > > >> > > > > >> > > On Mon, Aug 29, 2022 at 6:50 PM Ian Cook <i...@ursacomputing.com> > > >> wrote: > > >> > > > > >> > > > +1 We should explicitly discourage further use of “Feather” to > > refer > > >> to > > >> > > > Arrow IPC files. > > >> > > > > > >> > > > In this spirit of simplifying terminology: Does the “IPC” in the > > term > > >> > > > “Arrow IPC files” serve a truly necessary purpose? Is there > > another > > >> > type > > >> > > of > > >> > > > “Arrow file” that the “IPC” serves to disambiguate? If not, can > we > > >> > simply > > >> > > > refer to these files as “Arrow files” in most places in the > > >> > documentation > > >> > > > and website? (In a few important places we should clarify that > > when > > >> we > > >> > > say > > >> > > > “Arrow file” we are referring to a file that uses the Arrow IPC > > file > > >> > > > format.) > > >> > > > > > >> > > > Ian > > >> > > > > > >> > > > On Mon, Aug 29, 2022 at 17:33 Sutou Kouhei <k...@clear-code.com> > > >> wrote: > > >> > > > > > >> > > > > +1 for 1. > > >> > > > > > > >> > > > > Thanks, > > >> > > > > -- > > >> > > > > kou > > >> > > > > > > >> > > > > In <CAOYPqDCAib2wBKaKnRij9=__OsUJJghVq1UUTNibK2T0Np+= > > >> > r...@mail.gmail.com > > >> > > > > > >> > > > > "Re: Usage of the name Feather?" on Mon, 29 Aug 2022 > 20:18:37 > > >> > +0200, > > >> > > > > Jorge Cardoso Leitão <jorgecarlei...@gmail.com> wrote: > > >> > > > > > > >> > > > > > I agree. > > >> > > > > > > > >> > > > > > I suspect that the most widely used API with "feather" is > > Pandas' > > >> > > > > > read_feather. > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > On Mon, 29 Aug 2022, 19:55 Weston Pace, < > > weston.p...@gmail.com> > > >> > > wrote: > > >> > > > > > > > >> > > > > >> I agree as well. I think most lingering uses of the term > > >> > "feather" > > >> > > > > >> are in pyarrow and R however, so it might be good to hear > > from > > >> > some > > >> > > of > > >> > > > > >> those maintainers. > > >> > > > > >> > > >> > > > > >> > > >> > > > > >> > > >> > > > > >> On Mon, Aug 29, 2022 at 9:35 AM Antoine Pitrou < > > >> > anto...@python.org> > > >> > > > > wrote: > > >> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > I agree with this as well. > > >> > > > > >> > > > >> > > > > >> > Regards > > >> > > > > >> > > > >> > > > > >> > Antoine. > > >> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > On Mon, 29 Aug 2022 11:29:45 -0400 > > >> > > > > >> > Andrew Lamb <al...@influxdata.com> wrote: > > >> > > > > >> > > In the rust implementation we use the term "Arrow IPC" > > and I > > >> > > > support > > >> > > > > >> your > > >> > > > > >> > > option 1: > > >> > > > > >> > > > > >> > > > > >> > > > The name Feather V2 is deprecated. Only the extension > > >> > ".arrow" > > >> > > > > will > > >> > > > > >> be > > >> > > > > >> > > used for IPC files. > > >> > > > > >> > > > > >> > > > > >> > > Andrew > > >> > > > > >> > > > > >> > > > > >> > > On Mon, Aug 29, 2022 at 11:21 AM Matthew Topol > > >> > > > > >> <m...@voltrondata.com.invalid> > > >> > > > > >> > > wrote: > > >> > > > > >> > > > > >> > > > > >> > > > When I wrote "In-Memory Analytics with Apache Arrow" > I > > >> > > > definitely > > >> > > > > >> > > > treated "Feather" as deprecated and mentioned it only > > in > > >> > > passing > > >> > > > > >> > > > specifically indicating "Arrow IPC" as the > terminology > > to > > >> > > use. I > > >> > > > > only > > >> > > > > >> > > > even mentioned "Feather" at all because there are > still > > >> > > methods > > >> > > > in > > >> > > > > >> > > > pyarrow that reference it by name. > > >> > > > > >> > > > > > >> > > > > >> > > > That's just my opinion though... > > >> > > > > >> > > > > > >> > > > > >> > > > On Mon, Aug 29 2022 at 11:08:53 AM -0400, David Li > > >> > > > > >> > > > <lidav...@apache.org> wrote: > > >> > > > > >> > > > > This has come up before, e.g. see [1] [2] [3]. > > >> > > > > >> > > > > > > >> > > > > >> > > > > I would say "Feather" is effectively deprecated and > > we > > >> are > > >> > > > using > > >> > > > > >> > > > > "Arrow IPC" now but I am not sure what others > think. > > >> (From > > >> > > > that > > >> > > > > >> > > > > GitHub link, it seems to be mixed.) And ".arrow" is > > the > > >> > > > official > > >> > > > > >> > > > > extension now (since it is registered as part of > our > > >> MIME > > >> > > > type). > > >> > > > > >> But > > >> > > > > >> > > > > there's existing documentation and not everything > has > > >> been > > >> > > > > updated > > >> > > > > >> to > > >> > > > > >> > > > > be consistent (as you saw). > > >> > > > > >> > > > > > > >> > > > > >> > > > > [1]: > > >> > > > > >> > > > > < > > >> > > > > > > https://lists.apache.org/thread/0s6lgvd3g56ymd60vl5lgzhf4ro6hts5> > > >> > > > > >> > > > > [2]: > > >> > > > > >> > > > > < > > >> > > > > > > https://arrow.apache.org/faq/#what-about-the-feather-file-format> > > >> > > > > >> > > > > [3]: > > >> > > > > >> > > > > < > > >> > > > > >> > > > > > >> > > > > >> > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > > https://stackoverflow.com/questions/67910612/arrow-ipc-vs-feather/67911190#67911190 > > >> > > > > >> > > > > > > >> > > > > >> > > > > > > >> > > > > >> > > > > -David > > >> > > > > >> > > > > > > >> > > > > >> > > > > On Mon, Aug 29, 2022, at 10:50, 島 達也 wrote: > > >> > > > > >> > > > >> Hi all. > > >> > > > > >> > > > >> > > >> > > > > >> > > > >> I know the documentation (mainly pyarrow > > >> documentation) > > >> > > > > sometimes > > >> > > > > >> > > > >> refers > > >> > > > > >> > > > >> to IPC files as Feather files, but are there any > > >> > > guidelines > > >> > > > > for > > >> > > > > >> > > > >> when to > > >> > > > > >> > > > >> refer to an IPC file as a Feather file and when > to > > >> refer > > >> > > to > > >> > > > > it as > > >> > > > > >> > > > >> an IPC > > >> > > > > >> > > > >> file? > > >> > > > > >> > > > >> I believe that calling the same file an Arrow IPC > > file > > >> > at > > >> > > > > times > > >> > > > > >> and > > >> > > > > >> > > > >> a > > >> > > > > >> > > > >> Feather file at other times is confusing to those > > >> > > unfamiliar > > >> > > > > with > > >> > > > > >> > > > >> Apache > > >> > > > > >> > > > >> Arrow (myself included). > > >> > > > > >> > > > >> Surprisingly, these files may even have > completely > > >> > > different > > >> > > > > >> > > > >> extensions, > > >> > > > > >> > > > >> ".arrow" and ".feather", which are not similar. > > >> > > > > >> > > > >> > > >> > > > > >> > > > >> Perhaps there are several options for future use > of > > >> the > > >> > > name > > >> > > > > >> > > > >> Feather, > > >> > > > > >> > > > >> such as > > >> > > > > >> > > > >> > > >> > > > > >> > > > >> 1. The name Feather V2 is deprecated. Only the > > >> > extension > > >> > > > > >> ".arrow" > > >> > > > > >> > > > >> will > > >> > > > > >> > > > >> be used for IPC files. > > >> > > > > >> > > > >> 2. In some contexts(?), IPC files are referred > to > > as > > >> > > > Feather; > > >> > > > > >> only > > >> > > > > >> > > > >> ".arrow" is used for the IPC file extension > to > > >> > clearly > > >> > > > > >> > > > >> distinguish > > >> > > > > >> > > > >> it from Feather V1's ".feather". > > >> > > > > >> > > > >> 3. When an IPC file is called Feather by some > > rule, > > >> > > > extension > > >> > > > > >> > > > >> ".feather" is used, and when an IPC file is > not > > >> > called > > >> > > > > >> Feather, > > >> > > > > >> > > > >> extension ".arrow" is used. > > >> > > > > >> > > > >> > > >> > > > > >> > > > >> I mistakenly thought the current status was 2, > but > > >> > > according > > >> > > > > to > > >> > > > > >> the > > >> > > > > >> > > > >> discussion in this PR > > >> > > > > >> > > > >> (<https://github.com/apache/arrow/pull/13677>), > > >> > > > > >> > > > >> apparently the current status seems 3. (However, > > there > > >> > > seems > > >> > > > > to > > >> > > > > >> be > > >> > > > > >> > > > >> no > > >> > > > > >> > > > >> rule as to when an IPC file should be called a > > >> Feather) > > >> > > > > >> > > > >> > > >> > > > > >> > > > >> I am not very familiar with Arrow and this is my > > first > > >> > > post > > >> > > > to > > >> > > > > >> this > > >> > > > > >> > > > >> mailing list so I apologize if I have done > > something > > >> > wrong > > >> > > > or > > >> > > > > >> > > > >> inappropriate. > > >> > > > > >> > > > >> > > >> > > > > >> > > > >> Best, > > >> > > > > >> > > > >> SHIMA Tatsuya > > >> > > > > >> > > > > > >> > > > > >> > > > > > >> > > > > >> > > > > >> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >