"Apache Arrow is a data processing library that also provides a uniform, efficient interface for data systems."
This probably still isn't quite right, I imagine the bit about "for data systems" needs some addition (maybe "for transport between data systems")? My primary motivators: - "A data processing library": - Arrow provides many language bindings, but ultimately they're all part of the same "library ecosystem", which I think is fine to capture in "library" - A main goal of arrow is for processing to be fast, whatever that processing may be - "uniform, efficient interface for data systems": - Arrow, provides (or tries to) a cohesive ("uniform") interface for data processing (although it has several APIs to do this) - Also, IMO, a motivation for arrow was a format and library to facilitate processing, but that provided functions and interfaces to easily translate into optimized data formats used by disparate data systems (cassandra, hadoop, etc.). - Arrow tries to be transparently zero-copy, which is part of the interface for efficiency - Arrow certainly has a data format, but that format is the crux of the interface (IMO). However, it also makes using other formats easy (via filesystem API and parquet reader/writers, etc.). So, focusing on the data format seems unnecessary in such a terse description. Aldrin Montana Computer Science PhD Student UC Santa Cruz On Mon, May 17, 2021 at 5:07 PM Weston Pace <weston.p...@gmail.com> wrote: > I'd avoid the word "structured" as it is somewhat ill-defined. > > On Mon, May 17, 2021 at 12:37 PM Mauricio Vargas > <mauri...@ursacomputing.com> wrote: > > > > more marketed: > > How about: "Apache Arrow is a format and language-agnostic library > focused > > on efficient sharing and processing of structured data." > > > > On Mon, May 17, 2021 at 6:25 PM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > > How about: "Apache Arrow is a collection of specifications, cross > language > > > libraries and applications focused on efficient sharing and processing > of > > > structured data." > > > > > > On Mon, May 17, 2021 at 3:06 PM Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > > > On Mon, May 17, 2021 at 4:58 PM Weston Pace <weston.p...@gmail.com> > > > wrote: > > > > > > > > > > > “Apache Arrow is a format and compute kernel for in-memory data” > > > > > > > > > > I like this but no one ever knows what "in-memory" means (or they > just > > > > > think 'data is always in memory'). How about... > > > > > > > > > > "Apache Arrow is a format and compute kernel for zero-copy > processing > > > > > and sharing of data." > > > > > > > > > > or... > > > > > > > > > > "Apache Arrow is a format and compute kernel for processing and > > > > > sharing data without serialization overhead." > > > > > > > > A few issues with this: > > > > > > > > * Multiple PL aspect unclear (is a single piece of software, or > > > > multiple pieces of software?) > > > > * Development platform aspect unclear > > > > > > > > I see that some people don't like the word "platform". Some people > > > > come to this project and want to find an end-to-end application, > > > > rather than a developer toolkit that they can use to build > > > > applications. Perhaps we should be more explicit and use > > > > "computational development toolkit" instead of "platform". > > > > > > > > > Although marshalling[1] would probably be a more precise word it is > > > > > not as well known. > > > > > > > > > > [1] https://en.wikipedia.org/wiki/Marshalling_(computer_science) > > > > > > > > > > On Mon, May 17, 2021 at 9:36 AM Mauricio Vargas > > > > > <mauri...@ursacomputing.com> wrote: > > > > > > > > > > > > a few ideas > > > > > > > > > > > > github.com/apache/arrow - Apache Arrow is an efficient library > for > > > > big data > > > > > > processing and sharing > > > > > > > > > > > > github.com/apache/arrow - Apache Arrow is a computational tool > for > > > > > > processing, storing and sharing large datasets > > > > > > > > > > > > github.com/apache/arrow - Apache Arrow is a fast and simple > library > > > > for > > > > > > big data analytics > > > > > > > > > > > > *github.com/apache/arrow <http://github.com/apache/arrow> - > Apache > > > > Arrow is > > > > > > a powerful workhorse for analytic operations on modern hardware* > > > > > > > > > > > > > > > > > > On Mon, May 17, 2021 at 3:13 PM Julian Hyde < > jhyde.apa...@gmail.com> > > > > wrote: > > > > > > > > > > > > > Alright, well, whatever it is, it must fit into one breath. If > the > > > > > > > high-concept pitch is successful, people will stick around for > the > > > > full > > > > > > > pitch. > > > > > > > > > > > > > > Words such as “platform” and “enable” are noise. You say > > > “platform”, > > > > they > > > > > > > start to say “what exactly do you mean by platform”, the > elevator > > > > doors > > > > > > > open, and they’re gone. > > > > > > > > > > > > > > “Apache Arrow is a format and compute kernel for in-memory > data” > > > > > > > > > > > > > > > > > > > > > > On May 17, 2021, at 12:03 PM, Eduardo Ponce < > edponc...@gmail.com > > > > > > > > wrote: > > > > > > > > > > > > > > > > One more suggestion for the bucket: > > > > > > > > "Apache Arrow is a computational platform for efficient > in-memory > > > > data > > > > > > > > representation and processing." > > > > > > > > > > > > > > > > On Mon, May 17, 2021 at 2:49 PM Wes McKinney < > > > wesmck...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > > > >> I think less is better in the description, but > unfortunately the > > > > > > > >> association of Arrow as being "just a data format" has been > > > > actively > > > > > > > >> harmful in some ways to community growth. We have a data > format, > > > > yes, > > > > > > > >> but we are also creating a computational platform to go > > > > hand-in-hand > > > > > > > >> with the data format to make it easier to build fast > > > applications > > > > that > > > > > > > >> use the data format. So the description needs to capture > both of > > > > these > > > > > > > >> ideas. > > > > > > > >> > > > > > > > >> On Mon, May 17, 2021 at 12:15 PM Julian Hyde < > > > > jhyde.apa...@gmail.com> > > > > > > > >> wrote: > > > > > > > >>> > > > > > > > >>> I think that the “cross-language development platform for” > is > > > > noise. > > > > > > > >> (I’m sure that JPEG developers think that JPEG is a > > > > “cross-language > > > > > > > >> development platform” too. But it isn’t. It is an image > format.) > > > > > > > >>> > > > > > > > >>> "Apache Arrow is data format for efficient in-memory > > > processing.” > > > > > > > >>> > > > > > > > >>> I’ll note that In marketing speak, we are developing a > > > > high-concept > > > > > > > >> pitch [1] here. Every company needs a name, a brand, a > > > > high-concept > > > > > > > pitch, > > > > > > > >> and 3- or 4-sentence description. But every Apache project > needs > > > > these > > > > > > > too. > > > > > > > >> It’s worth spending the time on the description, also, and > then > > > > use > > > > > > > them in > > > > > > > >> all the places that we describe Arrow. > > > > > > > >>> > > > > > > > >>> Julian > > > > > > > >>> > > > > > > > >>> [1] > > > > https://www.growthink.com/content/whats-your-high-concept-pitch > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> > > > > > > > >>>> On May 17, 2021, at 7:38 AM, Eduardo Ponce < > > > edponc...@gmail.com > > > > > > > > > > > > >> wrote: > > > > > > > >>>> > > > > > > > >>>> I agree with Nate's and Brian's suggestions, but would > like to > > > > add > > > > > > > >> that we > > > > > > > >>>> can make it a one-liner for more conciseness and > consistency > > > > with > > > > > > > other > > > > > > > >>>> Apache projects. > > > > > > > >>>> Apologies if it seems I am going around the suggestions > loop > > > > again. > > > > > > > >>>> > > > > > > > >>>> "Apache Arrow is a cross-language development platform > > > enabling > > > > > > > >> efficient > > > > > > > >>>> in-memory data processing and transport." > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> On Mon, May 17, 2021 at 10:11 AM Brian Hulette < > > > > bhule...@apache.org> > > > > > > > >> wrote: > > > > > > > >>>> > > > > > > > >>>>> Thank you for bringing this up Dominik. I sampled some > of the > > > > > > > >> descriptions > > > > > > > >>>>> for other Apache projects I frequent, the ones with a > > > > meaningful > > > > > > > >>>>> description have a single sentence: > > > > > > > >>>>> > > > > > > > >>>>> github.com/apache/spark - Apache Spark - A unified > analytics > > > > engine > > > > > > > >> for > > > > > > > >>>>> large-scale data processing > > > > > > > >>>>> github.com/apache/beam - Apache Beam is a unified > > > programming > > > > model > > > > > > > >> for > > > > > > > >>>>> Batch and Streaming > > > > > > > >>>>> github.com/apache/avro - Apache Avro is a data > serialization > > > > system > > > > > > > >>>>> > > > > > > > >>>>> Several others (Flink, Hadoop, ...) just have "[Mirror > of] > > > > Apache > > > > > > > >> <name>" > > > > > > > >>>>> as the description. > > > > > > > >>>>> > > > > > > > >>>>> +1 for Nate's suggestion "Apache Arrow is a > cross-language > > > > > > > development > > > > > > > >>>>> platform for in-memory data. It enables systems to > process > > > and > > > > > > > >> transport > > > > > > > >>>>> data more efficiently." > > > > > > > >>>>> > > > > > > > >>>>> On Mon, May 17, 2021 at 5:23 AM Wes McKinney < > > > > wesmck...@gmail.com> > > > > > > > >> wrote: > > > > > > > >>>>> > > > > > > > >>>>>> It's probably best for description to limit mentions of > > > > specific > > > > > > > >>>>>> features. There are some high level features mentioned > in > > > the > > > > > > > >>>>>> description now ("computational libraries and zero-copy > > > > streaming > > > > > > > >>>>>> messaging and interprocess communication"), but now in > 2021 > > > > since > > > > > > > the > > > > > > > >>>>>> project has grown so much, it could leave people with a > > > > limited view > > > > > > > >>>>>> of what they might find here. > > > > > > > >>>>>> > > > > > > > >>>>>> On Mon, May 17, 2021 at 12:14 AM Mauricio Vargas > > > > > > > >>>>>> <mauri...@ursacomputing.com> wrote: > > > > > > > >>>>>>> > > > > > > > >>>>>>> How about > > > > > > > >>>>>>> 'Apache Arrow is a cross-language development platform > for > > > > > > > in-memory > > > > > > > >>>>>> data. > > > > > > > >>>>>>> It enables systems to process and transport data > > > efficiently, > > > > > > > >>>>> providing a > > > > > > > >>>>>>> simple and fast library for partitioning of large > tables'? > > > > > > > >>>>>>> > > > > > > > >>>>>>> Sorry the delay, long election day > > > > > > > >>>>>>> > > > > > > > >>>>>>> On Sun, May 16, 2021, 2:27 PM Nate Bauernfeind < > > > > > > > >>>>>> natebauernfe...@deephaven.io> > > > > > > > >>>>>>> wrote: > > > > > > > >>>>>>> > > > > > > > >>>>>>>> Suggestion: faster -> more efficiently > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> "Apache Arrow is a cross-language development > platform for > > > > > > > >> in-memory > > > > > > > >>>>>>>> data. It enables systems to process and transport data > > > more > > > > > > > >>>>>> efficiently." > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> On Sun, May 16, 2021 at 11:35 AM Wes McKinney < > > > > > > > wesmck...@gmail.com > > > > > > > >>> > > > > > > > >>>>>> wrote: > > > > > > > >>>>>>>> > > > > > > > >>>>>>>>> Here's what there now: > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> "Apache Arrow is a cross-language development > platform > > > for > > > > > > > >>>>> in-memory > > > > > > > >>>>>>>>> data. It specifies a standardized > language-independent > > > > columnar > > > > > > > >>>>>> memory > > > > > > > >>>>>>>>> format for flat and hierarchical data, organized for > > > > efficient > > > > > > > >>>>>>>>> analytic operations on modern hardware. It also > provides > > > > > > > >>>>>> computational > > > > > > > >>>>>>>>> libraries and zero-copy streaming messaging and > > > > interprocess > > > > > > > >>>>>>>>> communication…" > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> How about something shorter like > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> "Apache Arrow is a cross-language development > platform > > > for > > > > > > > >>>>> in-memory > > > > > > > >>>>>>>>> data. It enables systems to process and transport > data > > > > faster." > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> Suggestions / refinements from others welcome > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>>> On Sat, May 15, 2021 at 9:12 PM Dominik Moritz < > > > > domor...@cmu.edu > > > > > > > > > > > > > > > >>>>>> wrote: > > > > > > > >>>>>>>>>> > > > > > > > >>>>>>>>>> Super minor issue but could someone make the > description > > > > on > > > > > > > >>>>> GitHub > > > > > > > >>>>>>>>> shorter? > > > > > > > >>>>>>>>>> > > > > > > > >>>>>>>>>> > > > > > > > >>>>>>>>>> > > > > > > > >>>>>>>>>> GitHub puts the description into the title of the > page > > > > and makes > > > > > > > >>>>> it > > > > > > > >>>>>>>> hard > > > > > > > >>>>>>>>> to find it in URL autocomplete. > > > > > > > >>>>>>>>>> > > > > > > > >>>>>>>>> > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> > > > > > > > >>>>>>>> -- > > > > > > > >>>>>>>> > > > > > > > >>>>>> > > > > > > > >>>>> > > > > > > > >>> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > >