"Apache Arrow is a data processing library that also provides a uniform,
efficient interface for data systems."

This probably still isn't quite right, I imagine the bit about "for data
systems" needs some addition (maybe "for transport between data systems")?

My primary motivators:

   - "A data processing library":
      - Arrow provides many language bindings, but ultimately they're all
      part of the same "library ecosystem", which I think is fine to capture in
      "library"
      - A main goal of arrow is for processing to be fast, whatever that
      processing may be
      - "uniform, efficient interface for data systems":
      - Arrow, provides (or tries to) a cohesive ("uniform") interface for
      data processing (although it has several APIs to do this)
      - Also, IMO, a motivation for arrow was a format and library to
      facilitate processing, but that provided functions and
interfaces to easily
      translate into optimized data formats used by disparate data systems
      (cassandra, hadoop, etc.).
      - Arrow tries to be transparently zero-copy, which is part of the
      interface for efficiency
   - Arrow certainly has a data format, but that format is the crux of the
   interface (IMO). However, it also makes using other formats easy (via
   filesystem API and parquet reader/writers, etc.). So, focusing on the data
   format seems unnecessary in such a terse description.


Aldrin Montana
Computer Science PhD Student
UC Santa Cruz


On Mon, May 17, 2021 at 5:07 PM Weston Pace <weston.p...@gmail.com> wrote:

> I'd avoid the word "structured" as it is somewhat ill-defined.
>
> On Mon, May 17, 2021 at 12:37 PM Mauricio Vargas
> <mauri...@ursacomputing.com> wrote:
> >
> > more marketed:
> > How about: "Apache Arrow is a format and language-agnostic library
> focused
> > on efficient sharing and processing of structured data."
> >
> > On Mon, May 17, 2021 at 6:25 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> > > How about: "Apache Arrow is a collection of specifications, cross
> language
> > > libraries and applications focused on efficient sharing and processing
> of
> > > structured data."
> > >
> > > On Mon, May 17, 2021 at 3:06 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> > >
> > > > On Mon, May 17, 2021 at 4:58 PM Weston Pace <weston.p...@gmail.com>
> > > wrote:
> > > > >
> > > > > > “Apache Arrow is a format and compute kernel for in-memory data”
> > > > >
> > > > > I like this but no one ever knows what "in-memory" means (or they
> just
> > > > > think 'data is always in memory').  How about...
> > > > >
> > > > > "Apache Arrow is a format and compute kernel for zero-copy
> processing
> > > > > and sharing of data."
> > > > >
> > > > > or...
> > > > >
> > > > > "Apache Arrow is a format and compute kernel for processing and
> > > > > sharing data without serialization overhead."
> > > >
> > > > A few issues with this:
> > > >
> > > > * Multiple PL aspect unclear (is a single piece of software, or
> > > > multiple pieces of software?)
> > > > * Development platform aspect unclear
> > > >
> > > > I see that some people don't like the word "platform". Some people
> > > > come to this project and want to find an end-to-end application,
> > > > rather than a developer toolkit that they can use to build
> > > > applications. Perhaps we should be more explicit and use
> > > > "computational development toolkit" instead of "platform".
> > > >
> > > > > Although marshalling[1] would probably be a more precise word it is
> > > > > not as well known.
> > > > >
> > > > > [1] https://en.wikipedia.org/wiki/Marshalling_(computer_science)
> > > > >
> > > > > On Mon, May 17, 2021 at 9:36 AM Mauricio Vargas
> > > > > <mauri...@ursacomputing.com> wrote:
> > > > > >
> > > > > > a few ideas
> > > > > >
> > > > > > github.com/apache/arrow - Apache Arrow is an efficient library
> for
> > > > big data
> > > > > > processing and sharing
> > > > > >
> > > > > > github.com/apache/arrow - Apache Arrow is a computational tool
> for
> > > > > > processing, storing and sharing large datasets
> > > > > >
> > > > > > github.com/apache/arrow - Apache Arrow is a  fast and simple
> library
> > > > for
> > > > > > big data analytics
> > > > > >
> > > > > > *github.com/apache/arrow <http://github.com/apache/arrow> -
> Apache
> > > > Arrow is
> > > > > > a powerful workhorse for analytic operations on modern hardware*
> > > > > >
> > > > > >
> > > > > > On Mon, May 17, 2021 at 3:13 PM Julian Hyde <
> jhyde.apa...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > Alright, well, whatever it is, it must fit into one breath. If
> the
> > > > > > > high-concept pitch is successful, people will stick around for
> the
> > > > full
> > > > > > > pitch.
> > > > > > >
> > > > > > > Words such as “platform” and “enable” are noise. You say
> > > “platform”,
> > > > they
> > > > > > > start to say “what exactly do you mean by platform”, the
> elevator
> > > > doors
> > > > > > > open, and they’re gone.
> > > > > > >
> > > > > > > “Apache Arrow is a format and compute kernel for in-memory
> data”
> > > > > > >
> > > > > > >
> > > > > > > > On May 17, 2021, at 12:03 PM, Eduardo Ponce <
> edponc...@gmail.com
> > > >
> > > > wrote:
> > > > > > > >
> > > > > > > > One more suggestion for the bucket:
> > > > > > > > "Apache Arrow is a computational platform for efficient
> in-memory
> > > > data
> > > > > > > > representation and processing."
> > > > > > > >
> > > > > > > > On Mon, May 17, 2021 at 2:49 PM Wes McKinney <
> > > wesmck...@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > >> I think less is better in the description, but
> unfortunately the
> > > > > > > >> association of Arrow as being "just a data format" has been
> > > > actively
> > > > > > > >> harmful in some ways to community growth. We have a data
> format,
> > > > yes,
> > > > > > > >> but we are also creating a computational platform to go
> > > > hand-in-hand
> > > > > > > >> with the data format to make it easier to build fast
> > > applications
> > > > that
> > > > > > > >> use the data format. So the description needs to capture
> both of
> > > > these
> > > > > > > >> ideas.
> > > > > > > >>
> > > > > > > >> On Mon, May 17, 2021 at 12:15 PM Julian Hyde <
> > > > jhyde.apa...@gmail.com>
> > > > > > > >> wrote:
> > > > > > > >>>
> > > > > > > >>> I think that the “cross-language development platform for”
> is
> > > > noise.
> > > > > > > >> (I’m sure that JPEG developers think that JPEG is a
> > > > “cross-language
> > > > > > > >> development platform” too. But it isn’t. It is an image
> format.)
> > > > > > > >>>
> > > > > > > >>> "Apache Arrow is data format for efficient in-memory
> > > processing.”
> > > > > > > >>>
> > > > > > > >>> I’ll note that In marketing speak, we are developing a
> > > > high-concept
> > > > > > > >> pitch [1] here. Every company needs a name, a brand, a
> > > > high-concept
> > > > > > > pitch,
> > > > > > > >> and 3- or 4-sentence description. But every Apache project
> needs
> > > > these
> > > > > > > too.
> > > > > > > >> It’s worth spending the time on the description, also, and
> then
> > > > use
> > > > > > > them in
> > > > > > > >> all the places that we describe Arrow.
> > > > > > > >>>
> > > > > > > >>> Julian
> > > > > > > >>>
> > > > > > > >>> [1]
> > > > https://www.growthink.com/content/whats-your-high-concept-pitch
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>>> On May 17, 2021, at 7:38 AM, Eduardo Ponce <
> > > edponc...@gmail.com
> > > > >
> > > > > > > >> wrote:
> > > > > > > >>>>
> > > > > > > >>>> I agree with Nate's and Brian's suggestions, but would
> like to
> > > > add
> > > > > > > >> that we
> > > > > > > >>>> can make it a one-liner for more conciseness and
> consistency
> > > > with
> > > > > > > other
> > > > > > > >>>> Apache projects.
> > > > > > > >>>> Apologies if it seems I am going around the suggestions
> loop
> > > > again.
> > > > > > > >>>>
> > > > > > > >>>> "Apache Arrow is a cross-language development platform
> > > enabling
> > > > > > > >> efficient
> > > > > > > >>>> in-memory data processing and transport."
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>>
> > > > > > > >>>> On Mon, May 17, 2021 at 10:11 AM Brian Hulette <
> > > > bhule...@apache.org>
> > > > > > > >> wrote:
> > > > > > > >>>>
> > > > > > > >>>>> Thank you for bringing this up Dominik. I sampled some
> of the
> > > > > > > >> descriptions
> > > > > > > >>>>> for other Apache projects I frequent, the ones with a
> > > > meaningful
> > > > > > > >>>>> description have a single sentence:
> > > > > > > >>>>>
> > > > > > > >>>>> github.com/apache/spark - Apache Spark - A unified
> analytics
> > > > engine
> > > > > > > >> for
> > > > > > > >>>>> large-scale data processing
> > > > > > > >>>>> github.com/apache/beam - Apache Beam is a unified
> > > programming
> > > > model
> > > > > > > >> for
> > > > > > > >>>>> Batch and Streaming
> > > > > > > >>>>> github.com/apache/avro - Apache Avro is a data
> serialization
> > > > system
> > > > > > > >>>>>
> > > > > > > >>>>> Several others (Flink, Hadoop, ...) just have  "[Mirror
> of]
> > > > Apache
> > > > > > > >> <name>"
> > > > > > > >>>>> as the description.
> > > > > > > >>>>>
> > > > > > > >>>>> +1 for Nate's suggestion "Apache Arrow is a
> cross-language
> > > > > > > development
> > > > > > > >>>>> platform for in-memory data. It enables systems to
> process
> > > and
> > > > > > > >> transport
> > > > > > > >>>>> data more efficiently."
> > > > > > > >>>>>
> > > > > > > >>>>> On Mon, May 17, 2021 at 5:23 AM Wes McKinney <
> > > > wesmck...@gmail.com>
> > > > > > > >> wrote:
> > > > > > > >>>>>
> > > > > > > >>>>>> It's probably best for description to limit mentions of
> > > > specific
> > > > > > > >>>>>> features. There are some high level features mentioned
> in
> > > the
> > > > > > > >>>>>> description now ("computational libraries and zero-copy
> > > > streaming
> > > > > > > >>>>>> messaging and interprocess communication"), but now in
> 2021
> > > > since
> > > > > > > the
> > > > > > > >>>>>> project has grown so much, it could leave people with a
> > > > limited view
> > > > > > > >>>>>> of what they might find here.
> > > > > > > >>>>>>
> > > > > > > >>>>>> On Mon, May 17, 2021 at 12:14 AM Mauricio Vargas
> > > > > > > >>>>>> <mauri...@ursacomputing.com> wrote:
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> How about
> > > > > > > >>>>>>> 'Apache Arrow is a cross-language development platform
> for
> > > > > > > in-memory
> > > > > > > >>>>>> data.
> > > > > > > >>>>>>> It enables systems to process and transport data
> > > efficiently,
> > > > > > > >>>>> providing a
> > > > > > > >>>>>>> simple and fast library for partitioning of large
> tables'?
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> Sorry the delay, long election day
> > > > > > > >>>>>>>
> > > > > > > >>>>>>> On Sun, May 16, 2021, 2:27 PM Nate Bauernfeind <
> > > > > > > >>>>>> natebauernfe...@deephaven.io>
> > > > > > > >>>>>>> wrote:
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>> Suggestion: faster -> more efficiently
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> "Apache Arrow is a cross-language development
> platform for
> > > > > > > >> in-memory
> > > > > > > >>>>>>>> data. It enables systems to process and transport data
> > > more
> > > > > > > >>>>>> efficiently."
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> On Sun, May 16, 2021 at 11:35 AM Wes McKinney <
> > > > > > > wesmck...@gmail.com
> > > > > > > >>>
> > > > > > > >>>>>> wrote:
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>>> Here's what there now:
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> "Apache Arrow is a cross-language development
> platform
> > > for
> > > > > > > >>>>> in-memory
> > > > > > > >>>>>>>>> data. It specifies a standardized
> language-independent
> > > > columnar
> > > > > > > >>>>>> memory
> > > > > > > >>>>>>>>> format for flat and hierarchical data, organized for
> > > > efficient
> > > > > > > >>>>>>>>> analytic operations on modern hardware. It also
> provides
> > > > > > > >>>>>> computational
> > > > > > > >>>>>>>>> libraries and zero-copy streaming messaging and
> > > > interprocess
> > > > > > > >>>>>>>>> communication…"
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> How about something shorter like
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> "Apache Arrow is a cross-language development
> platform
> > > for
> > > > > > > >>>>> in-memory
> > > > > > > >>>>>>>>> data. It enables systems to process and transport
> data
> > > > faster."
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> Suggestions / refinements from others welcome
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>> On Sat, May 15, 2021 at 9:12 PM Dominik Moritz <
> > > > domor...@cmu.edu
> > > > > > > >
> > > > > > > >>>>>> wrote:
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> Super minor issue but could someone make the
> description
> > > > on
> > > > > > > >>>>> GitHub
> > > > > > > >>>>>>>>> shorter?
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>> GitHub puts the description into the title of the
> page
> > > > and makes
> > > > > > > >>>>> it
> > > > > > > >>>>>>>> hard
> > > > > > > >>>>>>>>> to find it in URL autocomplete.
> > > > > > > >>>>>>>>>>
> > > > > > > >>>>>>>>>
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>> --
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>
> > > > > > > >>>>>
> > > > > > > >>>
> > > > > > > >>
> > > > > > >
> > > > > > >
> > > >
> > >
>

Reply via email to