Hi, > However, I think we need to be very careful in how we brand the > alternative, and think proactively about what terminology we want to be > used (and which terms to use in APIs, ..). Because I think the "IPC" aspect > of the naming can also become confusing (IPC is a generic term, does not > clearly indicate it is a *file* format, and also not that it is related to > *arrow*).
I like "Apache Arrow File" and "Apache Arrow Stream" (no IPC) for format names because we use vnd.apache.arrow.file and vnd.apache.arrow.stream for IANA: * https://www.iana.org/assignments/media-types/application/vnd.apache.arrow.file * https://www.iana.org/assignments/media-types/application/vnd.apache.arrow.stream > - In pyarrow, we have a `feather` submodule with read/write_feather > functions. How do we want to replace this? The current alternative is the > pyarrow.ipc submodule (which has functionality to open files), but so this > is using the "IPC" terminology. Are we OK with making this the alternative, > or do we want to add new APIs? How about pyarrow.ipc.{read,write}_{file,stream}()? I think that .ipc isn't strange under Apache Arrow namespace (pyarrow.) because it means Apache Arrow's IPC. > - In pyarrow.dataset, we also use IpcFileFormat for Arrow files. Should we > rename this to `ArrowFileFormat`? (and keep IpcFileFormat as alias) I like this idea. > - In the R arrow package, the non-feather alternative for `read_feather` > currently is `read_ipc_file` I'm not familiar with R but read_arrow_file() may be better if users use it without any prefix. If users use the function is used with Apache Arrow related prefix such as arrow::read_ipc_file(), I think that read_ipc_file() isn't strange. > - In pandas, there is read_feather/to_feather. What do we think that pandas > should use instead? read_arrow_file/to_arrow_file? > If we want to move the (mostly Python and R) ecosystem away from "Feather", > I think we should have a clear recommendation of what to use instead. +1 Thanks, -- kou In <calqtmbbmaepukytnin5n4-jpahtwbacmeb0bsxjz8dmkkmo...@mail.gmail.com> "Re: Usage of the name Feather?" on Tue, 6 Sep 2022 15:46:39 +0200, Joris Van den Bossche <jorisvandenboss...@gmail.com> wrote: > Personally, I like the "Feather" name (and actually think it could help > disambiguate the file vs in-memory distinction), but I understand that we > have chosen a certain path (eg ".arrow" is the official registered > extension), and have to move on. > > However, I think we need to be very careful in how we brand the > alternative, and think proactively about what terminology we want to be > used (and which terms to use in APIs, ..). Because I think the "IPC" aspect > of the naming can also become confusing (IPC is a generic term, does not > clearly indicate it is a *file* format, and also not that it is related to > *arrow*). > > As an example, I just noticed a twitter thread ( > https://twitter.com/braaannigan/status/1566715704937676800) that is > promoting the "IPC format". The specific library used here (polars) also > exposes this as a "read_ipc" function. > Other examples: > > - In pyarrow, we have a `feather` submodule with read/write_feather > functions. How do we want to replace this? The current alternative is the > pyarrow.ipc submodule (which has functionality to open files), but so this > is using the "IPC" terminology. Are we OK with making this the alternative, > or do we want to add new APIs? > - In pyarrow.dataset, we also use IpcFileFormat for Arrow files. Should we > rename this to `ArrowFileFormat`? (and keep IpcFileFormat as alias) > - In the R arrow package, the non-feather alternative for `read_feather` > currently is `read_ipc_file` > - In pandas, there is read_feather/to_feather. What do we think that pandas > should use instead? > - ... > > Personally, I think we should certainly avoid names that just use IPC (like > `read_ipc`). An alternative could be `read_arrow_ipc`, but if want to drop > the IPC part (as proposed earlier in this thread, although not yet agreed > on), that would become `read_arrow`/`to_arrow`. That might then be confused > with reading from / converting to in-memory arrow data or stream? > If we want to recommend using "Arrow file" terminology, so then APIs like > `read_arrow_file` could be used? > > If we want to move the (mostly Python and R) ecosystem away from "Feather", > I think we should have a clear recommendation of what to use instead. > > On Wed, 31 Aug 2022 at 20:33, Aldrin <akmon...@ucsc.edu.invalid> wrote: > >> similarly to Micah, I mentally think of "Arrow IPC" a format that is >> optimized for "IPC". >> Which I have assumed meant it minimizes CPU overhead when using data read >> from >> storage because it's already in a memory friendly format (e.g. minimal >> deserialization). >> >> Not sure the "IPC" is necessary, but it does push the intent into the name >> (unless it's >> actually a misnomer). >> >> >> Aldrin Montana >> Computer Science PhD Student >> UC Santa Cruz >> >> >> On Tue, Aug 30, 2022 at 8:29 PM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >> > I think one source of ambiguity for Arrow files, at least for me, is >> > whether they are just a string of messages concatenated or they are the >> > files that contain the metadata footer. >> > >> > On Tue, Aug 30, 2022 at 5:11 AM Dewey Dunnington >> > <de...@voltrondata.com.invalid> wrote: >> > >> > > Ian has a very good point...I would be in favour of calling them "Arrow >> > > files" wherever possible since there's no need to know or care what >> > > interprocess communication is to use them! >> > > >> > > On Mon, Aug 29, 2022 at 6:50 PM Ian Cook <i...@ursacomputing.com> >> wrote: >> > > >> > > > +1 We should explicitly discourage further use of “Feather” to refer >> to >> > > > Arrow IPC files. >> > > > >> > > > In this spirit of simplifying terminology: Does the “IPC” in the term >> > > > “Arrow IPC files” serve a truly necessary purpose? Is there another >> > type >> > > of >> > > > “Arrow file” that the “IPC” serves to disambiguate? If not, can we >> > simply >> > > > refer to these files as “Arrow files” in most places in the >> > documentation >> > > > and website? (In a few important places we should clarify that when >> we >> > > say >> > > > “Arrow file” we are referring to a file that uses the Arrow IPC file >> > > > format.) >> > > > >> > > > Ian >> > > > >> > > > On Mon, Aug 29, 2022 at 17:33 Sutou Kouhei <k...@clear-code.com> >> wrote: >> > > > >> > > > > +1 for 1. >> > > > > >> > > > > Thanks, >> > > > > -- >> > > > > kou >> > > > > >> > > > > In <CAOYPqDCAib2wBKaKnRij9=__OsUJJghVq1UUTNibK2T0Np+= >> > r...@mail.gmail.com >> > > > >> > > > > "Re: Usage of the name Feather?" on Mon, 29 Aug 2022 20:18:37 >> > +0200, >> > > > > Jorge Cardoso Leitão <jorgecarlei...@gmail.com> wrote: >> > > > > >> > > > > > I agree. >> > > > > > >> > > > > > I suspect that the most widely used API with "feather" is Pandas' >> > > > > > read_feather. >> > > > > > >> > > > > > >> > > > > > >> > > > > > On Mon, 29 Aug 2022, 19:55 Weston Pace, <weston.p...@gmail.com> >> > > wrote: >> > > > > > >> > > > > >> I agree as well. I think most lingering uses of the term >> > "feather" >> > > > > >> are in pyarrow and R however, so it might be good to hear from >> > some >> > > of >> > > > > >> those maintainers. >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> On Mon, Aug 29, 2022 at 9:35 AM Antoine Pitrou < >> > anto...@python.org> >> > > > > wrote: >> > > > > >> > >> > > > > >> > >> > > > > >> > I agree with this as well. >> > > > > >> > >> > > > > >> > Regards >> > > > > >> > >> > > > > >> > Antoine. >> > > > > >> > >> > > > > >> > >> > > > > >> > On Mon, 29 Aug 2022 11:29:45 -0400 >> > > > > >> > Andrew Lamb <al...@influxdata.com> wrote: >> > > > > >> > > In the rust implementation we use the term "Arrow IPC" and I >> > > > support >> > > > > >> your >> > > > > >> > > option 1: >> > > > > >> > > >> > > > > >> > > > The name Feather V2 is deprecated. Only the extension >> > ".arrow" >> > > > > will >> > > > > >> be >> > > > > >> > > used for IPC files. >> > > > > >> > > >> > > > > >> > > Andrew >> > > > > >> > > >> > > > > >> > > On Mon, Aug 29, 2022 at 11:21 AM Matthew Topol >> > > > > >> <m...@voltrondata.com.invalid> >> > > > > >> > > wrote: >> > > > > >> > > >> > > > > >> > > > When I wrote "In-Memory Analytics with Apache Arrow" I >> > > > definitely >> > > > > >> > > > treated "Feather" as deprecated and mentioned it only in >> > > passing >> > > > > >> > > > specifically indicating "Arrow IPC" as the terminology to >> > > use. I >> > > > > only >> > > > > >> > > > even mentioned "Feather" at all because there are still >> > > methods >> > > > in >> > > > > >> > > > pyarrow that reference it by name. >> > > > > >> > > > >> > > > > >> > > > That's just my opinion though... >> > > > > >> > > > >> > > > > >> > > > On Mon, Aug 29 2022 at 11:08:53 AM -0400, David Li >> > > > > >> > > > <lidav...@apache.org> wrote: >> > > > > >> > > > > This has come up before, e.g. see [1] [2] [3]. >> > > > > >> > > > > >> > > > > >> > > > > I would say "Feather" is effectively deprecated and we >> are >> > > > using >> > > > > >> > > > > "Arrow IPC" now but I am not sure what others think. >> (From >> > > > that >> > > > > >> > > > > GitHub link, it seems to be mixed.) And ".arrow" is the >> > > > official >> > > > > >> > > > > extension now (since it is registered as part of our >> MIME >> > > > type). >> > > > > >> But >> > > > > >> > > > > there's existing documentation and not everything has >> been >> > > > > updated >> > > > > >> to >> > > > > >> > > > > be consistent (as you saw). >> > > > > >> > > > > >> > > > > >> > > > > [1]: >> > > > > >> > > > > < >> > > > > https://lists.apache.org/thread/0s6lgvd3g56ymd60vl5lgzhf4ro6hts5> >> > > > > >> > > > > [2]: >> > > > > >> > > > > < >> > > > > https://arrow.apache.org/faq/#what-about-the-feather-file-format> >> > > > > >> > > > > [3]: >> > > > > >> > > > > < >> > > > > >> > > > >> > > > > >> >> > > > > >> > > > >> > > >> > >> https://stackoverflow.com/questions/67910612/arrow-ipc-vs-feather/67911190#67911190 >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > -David >> > > > > >> > > > > >> > > > > >> > > > > On Mon, Aug 29, 2022, at 10:50, 島 達也 wrote: >> > > > > >> > > > >> Hi all. >> > > > > >> > > > >> >> > > > > >> > > > >> I know the documentation (mainly pyarrow >> documentation) >> > > > > sometimes >> > > > > >> > > > >> refers >> > > > > >> > > > >> to IPC files as Feather files, but are there any >> > > guidelines >> > > > > for >> > > > > >> > > > >> when to >> > > > > >> > > > >> refer to an IPC file as a Feather file and when to >> refer >> > > to >> > > > > it as >> > > > > >> > > > >> an IPC >> > > > > >> > > > >> file? >> > > > > >> > > > >> I believe that calling the same file an Arrow IPC file >> > at >> > > > > times >> > > > > >> and >> > > > > >> > > > >> a >> > > > > >> > > > >> Feather file at other times is confusing to those >> > > unfamiliar >> > > > > with >> > > > > >> > > > >> Apache >> > > > > >> > > > >> Arrow (myself included). >> > > > > >> > > > >> Surprisingly, these files may even have completely >> > > different >> > > > > >> > > > >> extensions, >> > > > > >> > > > >> ".arrow" and ".feather", which are not similar. >> > > > > >> > > > >> >> > > > > >> > > > >> Perhaps there are several options for future use of >> the >> > > name >> > > > > >> > > > >> Feather, >> > > > > >> > > > >> such as >> > > > > >> > > > >> >> > > > > >> > > > >> 1. The name Feather V2 is deprecated. Only the >> > extension >> > > > > >> ".arrow" >> > > > > >> > > > >> will >> > > > > >> > > > >> be used for IPC files. >> > > > > >> > > > >> 2. In some contexts(?), IPC files are referred to as >> > > > Feather; >> > > > > >> only >> > > > > >> > > > >> ".arrow" is used for the IPC file extension to >> > clearly >> > > > > >> > > > >> distinguish >> > > > > >> > > > >> it from Feather V1's ".feather". >> > > > > >> > > > >> 3. When an IPC file is called Feather by some rule, >> > > > extension >> > > > > >> > > > >> ".feather" is used, and when an IPC file is not >> > called >> > > > > >> Feather, >> > > > > >> > > > >> extension ".arrow" is used. >> > > > > >> > > > >> >> > > > > >> > > > >> I mistakenly thought the current status was 2, but >> > > according >> > > > > to >> > > > > >> the >> > > > > >> > > > >> discussion in this PR >> > > > > >> > > > >> (<https://github.com/apache/arrow/pull/13677>), >> > > > > >> > > > >> apparently the current status seems 3. (However, there >> > > seems >> > > > > to >> > > > > >> be >> > > > > >> > > > >> no >> > > > > >> > > > >> rule as to when an IPC file should be called a >> Feather) >> > > > > >> > > > >> >> > > > > >> > > > >> I am not very familiar with Arrow and this is my first >> > > post >> > > > to >> > > > > >> this >> > > > > >> > > > >> mailing list so I apologize if I have done something >> > wrong >> > > > or >> > > > > >> > > > >> inappropriate. >> > > > > >> > > > >> >> > > > > >> > > > >> Best, >> > > > > >> > > > >> SHIMA Tatsuya >> > > > > >> > > > >> > > > > >> > > > >> > > > > >> > > >> > > > > >> > >> > > > > >> > >> > > > > >> > >> > > > > >> >> > > > > >> > > > >> > > >> > >>